Professional Documents
Culture Documents
Minor Report
Minor Report
Submitted by
BACHELOR OF TECHNOLOGY
IN
(DEC 2022)
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
Department of Computer Science and Engineering
CERTIFICATE
This is to Certified that this MINOR project report “Recommendation
mechanism to forge connections between users with similar interests” is
submitted by Indrakant Dana (02514802719), Akshat Ajay (75714802719),
Udit Agarwal (12914802719) who carried out the project work under my
supervision.
1
ABSTRACT
This project addresses the problem individuals face when searching for people with
similar interests. It is common for us to look for people who have similar interests when
traveling to an unfamiliar city. A person's interests can range from films or music to
hobbies, certain personality traits, lifestyle, and more. These traits and interests will be
input by the user during the time of registration and will be a part of their profile. The
number of users keeps growing, and new accounts get registered every time. Despite the
growing number of users, only a few contacts are made daily. A matchmaking system
based on the user's profile text can be used in this case based on the data available in the
user's profile. Users' profiles contain the text that serves as their fingerprints. Each of
these fingerprints can be embedded in a vector on a dimensional space using deep
learning. As users with similar interests will have similar vectors, we can analyze the big
behavioral data by forming groups or clusters of like-minded individuals. With this
project, users with similar interests are recommended based on vectors using clustering
algorithms and classification techniques, with best possible accuracy. This interests based
matching can lead to a more engaging and satisfying user experience, as users are more
likely to find products or services that they are interested in. By providing personalized
recommendations and targeting marketing efforts more effectively, individuals are able to
save time, gain control over their experience, and reduce the likelihood of getting
undesirable matches.
2
ACKNOWLEDGEMENT
Place: Delhi
Date:
3
TABLE OF CONTENTS
List of Figures 6
1. Introduction
1.1. Need and Significance 7
1.2. Objective 7
1.3. Technologies used 8
1.4. Motivation 12
2. Literature Survey 13
3. Approach/Methodology
3.1. Preliminaries 15
3.2. Data Gathering and Generation 15
3.3. Data Pre-Processing 16
3.4. Algorithms Used 17
3.4.1 KMeans Clustering
3.4.2 DBSCAN Clustering
3.4.3 HDBSCAN Clustering
3.4.4 Agglomerative Hierarchical Clustering
5. Conclusion 28
6. Future Scope 29
7. Appendices 30
6. References 32
7. Research Paper 34
4
LIST OF FIGURES
1. DBSCAN algorithm 18
5
LIST OF TABLES
1. HDBSCAN Clustering Parameters 27
6
INTRODUCTION
Online interactions between individuals are becoming more frequent as web usage
increases. With enhanced Web technology and rising Web popularity, users are
increasingly turning to online social networks to connect with new friends or similar
users. Many prominent social networking sites have appeared in the previous decade, and
the number of monthly active members has also increased rapidly.
Social matching, which refers to computational methods of detecting and fostering new
social connections between people, is a relatively new strategy to this purpose. In the
absence of an automatic selection method, choosing the best candidate from a large pool
of applicants becomes time-consuming and all but useless. The adoption of a solution by
social networks that dynamically supports matchmaking by recommending potential
matches is becoming more and more popular. Connection users share in these networks
and the approach used for the formation of these connections is a crucial part of these
Social Networks. One of the popular approaches is using sub-networks of users having
similar interests, opinions, and lifestyles. One of the key benefits of user-matching
systems is their ability to provide personalized recommendations to users. By analyzing
data on user behavior, interests, and preferences, these systems are able to make
suggestions that are tailored to each individual user. This can lead to a more engaging and
satisfying user experience, as users are more likely to find products or services that they
are interested in.
Overall, user matching systems are a valuable technology that can help businesses to
improve the user experience and increase conversions and have become an integral part
of the online matching experience, providing individuals with a more efficient and
effective way to find potential partners.
1.2 Objective
7
● To explore various clustering algorithms and compare them, to get the best
possible accuracy in terms of matches given by the recommendation system
● Users are given a profile feature that serves as their unique digital fingerprint. The
interests and personality traits of the user are included in their profile, which can
be modified by the user. When recommending a user, their profile information is
used.
1.3.1 Python
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built-in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together. Python's
simple, easy to learn syntax emphasizes readability and therefore reduces the cost of
program maintenance. Python supports modules and packages, which encourages
program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for all major platforms, and
can be freely distributed. Since there is no compilation step, the edit-test-debug cycle is
incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a
segmentation fault. Instead, when the interpreter discovers an error, it raises an exception.
The debugger is written in Python itself, testifying to Python's introspective power.
The Python language comes with many libraries and frameworks that make coding easy.
This also saves a significant amount of time. The most popular libraries are NumPy,
which is used for scientific calculations; SciPy for more advanced computations; and
scikit, for learning data mining and data analysis.
These libraries work alongside powerful frameworks like TensorFlow, CNTK, and
Apache Spark. These libraries and frameworks are essential when it comes to machine
and deep learning projects.
1.3.2 Matplotlib
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy. As such, it offers a viable open source alternative to
8
MATLAB. Developers can also use matplotlib’s APIs (Application Programming
Interfaces) to embed plots in GUI applications. Matplotlib is used to create 2D graphs and
plots by using python scripts. It has a module named pyplot which makes things easy for
plotting by providing feature to control line styles, font properties, formatting axes etc. It
supports a very wide variety of graphs and plots namely histogram, bar charts, power
spectra, error charts etc. It is used along with NumPy to provide an environment that is an
effective open source alternative for MatLab. It can also be used with graphics toolkits
like PyQt and wxPython.
1.3.3 Numpy
NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the
fundamental package for scientific computing with Python. It is open-source software. It
contains various features including these important ones:
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data-types can be defined using
Numpy which allows NumPy to seamlessly and speedily integrate with a wide variety of
databases.
1.3.4 BeautifulSoup
Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup
languages. Say you’ve found some webpages that display data relevant to your research,
such as date or address information, but that do not provide any way of downloading the
data directly. Beautiful Soup helps you pull particular content from a webpage, remove
the HTML markup, and save the information. It is a tool for web scraping that helps you
clean up and parse the documents you have pulled down from the web.
9
1.3.5 Seaborn
Seaborn is an amazing visualization library for statistical graphics plotting in Python. It
provides beautiful default styles and color palettes to make statistical plots more
attractive. It is built on the top of matplotlib library and also closely integrated to the data
structures from pandas. Seaborn aims to make visualization the central part of exploring
and understanding data. It provides dataset-oriented APIs, so that we can switch between
different visual representations for same variables for better understanding of dataset.
1.3.7 Streamlit
Streamlit is an open-source python framework for building web apps for Machine
Learning and Data Science. We can instantly develop web apps and deploy them easily
using Streamlit. Streamlit allows you to write an app the same way you write a python
code. Streamlit makes it seamless to work on the interactive loop of coding and viewing
results in the web app.
1.3.8 Pandas
Pandas is an open-source library that is made mainly for working with relational or
labeled data both easily and intuitively. It provides various data structures and operations
for manipulating numerical data and time series. This library is built on top of the NumPy
library. Pandas is fast and it has high performance & productivity for users.
Pandas stands for ‘panel data’. Note that pandas is typically stylized as an all-lowercase
word, although it is considered a best practice to capitalize its first letter at the beginning
10
of sentences. Pandas was designed to work with two-dimensional data (similar to Excel
spreadsheets). Just as the NumPy library had a built-in data structure called an array with
special attributes and methods, the pandas library has a built-in two-dimensional data
structure called a DataFrame.
This tool is essentially your data’s home. Through pandas, you get acquainted with your
data by cleaning, transforming, and analyzing it. Pandas is built on top of the NumPy
package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in
pandas is often used to feed statistical analysis in SciPy, plotting functions from
Matplotlib, and machine learning algorithms in Scikit-learn.
1.3.9 Pickle
The pickle module implements binary protocols for serializing and de-serializing a
Python object structure. “Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte
stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
The process to converts any kind of python objects (list, dict, etc.) into byte streams (0s
and 1s) is called pickling or serialization or flattening or marshalling. We can converts the
byte stream (generated through pickling) back into python objects by a process called as
unpickling.
1.3.10 Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction via
a consistence interface in Python. This library, which is largely written in Python, is built
upon NumPy, SciPy and Matplotlib.
The package provides the functions for data mining and machine learning algorithms for
data analysis that are easy to use and effective. Support vector machines, gradient
11
boosting, random forests, k-means, and other regression, classification, and clustering
algorithms are included.
1.4 Motivation
User matching systems are a valuable technology that can help businesses to improve the
user experience and increase conversions and have become an integral part of the online
matching experience, providing individuals with a more efficient and effective way to
find potential partners. Various studies have been conducted in the past focusing on
methods used in the formation of connections between users. This research on social
matching systems has identified several drawbacks, including
12
LITERATURE SURVEY
There have been numerous studies and research projects focused on developing effective
user matchmaking algorithms in various contexts.
[7] presents a collaborative filtering recommendation algorithm that uses user interest
clustering to address the limitations of traditional algorithms. The improved algorithm has
been shown to have higher recommendation efficiency and precision through
experimental results. [8] proposes an algorithm based on Ant Colony is proposed to solve
the optimization problem of clustering/matching people in a social network. The
numerical results show that the algorithm can successfully perform clustering with a
variable number of individuals. A tag-based common interest discovery approach in
online social networks is proposed in [10]. It is suggested that user-generated tags are
effective for representing user interests because they can more accurately reflect
understanding. The approach is able to effectively discover common interest topics in
online social networks, such as Douban, without any information on the online
connections among users. A different type of approach under the project called
Match-MORE has been proposed in [11] to address the issues surrounding
Proximity-based Mobile Social Networks, which can pose risks to user privacy and create
a system overhead. The concept of friends-of-friends is utilized in Match-MORE to find
common connections among friends and design a private matching scheme. This is
achieved through the use of a novel similarity function that takes into account both the
social strength between users and the similarity of their profiles. It allows users to
discover new potential friends based on recommendations from their existing friends,
with adjustable accuracy and without disclosing too much personal information. The use
of Bloom filters to estimate common attributes helps to reduce the system overhead. The
security and performance of this project have also been carefully analyzed and evaluated
through simulations.
13
several strengths and weaknesses of the online dating network. To improve the
effectiveness of the network, the proposed system utilizes a recommender system that
combines both the attributes of the vertices (individual users) and the interactions within
the network. This allows for the identification of more compatible and potentially
successful (12% improvement in accuracy) matches within the network. [6] presents a
method for recommending social groups in online services by extracting multiple
interests and adaptively selecting similar users and items to create a compact rating
matrix for efficient collaborative filtering. The approach was evaluated through extensive
experiments. In [12], a new collaborative filtering (CF) recommendation method is
proposed that is based on users' interests and sequences (IS). The concept of "interest
sequences" is defined in order to depict the dynamic evolution patterns of users' interests
in online recommendation systems, drawing inspiration from work in location-based
recommendation systems. The method for calculating users' similarities based on IS,
which is used to identify the top K nearest neighbors of the target user, is introduced,
along with methods for calculating the length of the longest common sub-IS (LCSIS) and
the count of all common sub-IS (ACSIS). Finally, a method for predicting users' ratings
for unrated items is presented, and the effectiveness of the proposed recommendation
method is verified through comprehensive experiments on three datasets, examining the
influence of several factors on the results.
Numerous earlier research on this issue has a group of users as their primary audience
instead of one specific user. The algorithms used by existing applications that offer
matchmaking services based on interests are not open source and are kept secret. Our goal
is to investigate various clustering algorithms [3] and determine the optimal strategy for
delivering matches that are more accurate. Our analysis and research indicate that we may
create a recommendation system by integrating multiple clustering algorithms [3][10],
such as DBScan clustering and hierarchical clustering using various linkages.
14
METHODOLOGY
3.1 Preliminaries
We would need these user data which would contain mentioned data about users such as
bio and interests in different topics with interest being measured in range 0-9.
Additionally, we would need to collect data on the compatibility between individuals,
such as their common interests and compatibility score. This information would be used
to match individuals with each other based on their preferences and compatibility.
The machine learning algorithms we have devised require a significant quantity of data to
train our model. Because such a dataset is not publicly available on the internet, we
generate fake user profile data. There are various websites that can assist us in generating
a large quantity of fake data. By using BeautifulSoup for web scraping, we can turn the
generated data into a data frame. For each user, we need data that shows their interests in
different categories like movies, sports, and politics. This can be done by randomly
assigning numbers from 0 to 9 to each category in the data frame.
15
3.3 Data Preprocessing
We can preprocess the user bio-data by using NLP and find out the important words used
majorly by users in their bios, across the platform. Using the NTLK library we can
perform Tokenization and Lemmatization of user biodata, i.e splitting up full sentences
into individual words and converting words into their basic form. For example,
converting ‘Joking’ into ‘Joke’. We can take an extra measure to exclude words like ‘a’,
‘the’, ‘of’ etc. Our next step would be making a set of unique words with their
corresponding usage frequency by users in their bio, across the platform. Many of these
words might be adjectives to a noun, like ‘c++ enthusiast’. We need to pair such words.
We make a list of such words and find their frequency scores and build up our data frame
for the next step.
To improve the performance of our clustering algorithm, we will proceed to scale the
categories (such as Movies, TV, religion, etc). This will reduce the time required to fit and
transform the algorithm to the dataset. The next phase entails structuring the data into a
series of numerical vectors, where each vector stands for a particular data point. This is
known as the vectorization of the bio-data of users. We'll employ the Count Vectorization
and TFIDF Vectorization vectorization techniques. The two distinct data-frames created
by the two methods mentioned above will be combined and scaled into a new dataset.
Because there are too many dimensions in the dataset, we will use Principal Component
Analysis (PCA). It is a statistical method for decreasing a dataset's dimensionality.It
accomplishes this by translating the data into a new coordinate system with principal
components as the new axes. The techniques discussed above can be used to locate the
ideal number of clusters based on evaluation metrics like the Silhouette coefficient,
Davies-Bouldin Score, and Calinski-Harabasz Score once our data is ready. These metrics
will assess how effectively the clustering algorithms work. After an in-depth analysis of
the algorithms using the scoring metrics, we concluded that the optimal number of
clusters for a better match probability is 2. We will then train a classification model using
the optimal cluster value and try to train a classified model having better accuracy.
When a user wants to look for similar users, their data will be considered new. Entering
new data into existing clusters will require running clustering algorithms all over again.
And before that, the new data would go through NLP processing, tokenization, scaling,
16
vectorization, and PCA. After this clustering algorithms would run and provide the top 10
matches to the user.
One of the main advantages of the k-means clustering is that it is fast and efficient,
especially for large datasets. However, it can be sensitive to the initial placement of the
centroids and may not always produce the best possible clusters.
17
Fig 1: The DBSCAN algorithm was utilized to generate two clusters, which consist of
three types of points: key points (green) that meet the criteria for clustering, border points
(blue) that do not meet the clustering criteria but are within the reach of a key point, and
noise points (black) that do not fit into either.
One of the advantages of DBSCAN is that it does not require the user to specify the
number of clusters in advance. This is useful because the number of clusters in a dataset
is often not known beforehand. It also has the ability to identify and label points that are
not part of any cluster, which is useful for outlier detection. However, DBSCAN has
some disadvantages as well. It can be sensitive to the choice of parameters and can be
affected by the presence of noise or outliers in the dataset. Additionally, it does not work
well for data that is not evenly distributed or that has clusters of varying densities.
18
able to handle noisy or outlier data well, making it a good choice for data that may not be
well-behaved. However, like DBScan, it is sensitive to the choice of parameters and can
be affected by the presence of noise or outliers in the dataset.
19
EXPERIMENTS & RESULTS
It is a useful tool for evaluating the performance of a clustering algorithm and for
selecting the appropriate number of clusters for a dataset. Here are some points to be
remembered about the Silhouette Coefficient:
1. The score is constrained to a range of -1 for defective clustering and +1 for dense
clustering. Scores near zero indicate clusters that overlap.
2. When clusters are dense and well divided, the score is greater, which refers to a
common definition of a cluster.
3. Convex clusters typically have a larger Silhouette Coefficient than other types of
clusters, like density-based clusters like those found by DBSCAN.
20
It is important to note that the Silhouette Coefficient is sensitive to the number of clusters
chosen and may not be reliable when the number of clusters is not appropriate for the
data. Thus, we consider its limitations as well and use it in conjunction with other
evaluation methods to get a more comprehensive understanding of the clustering results.
where,
N is: the total number of observations
K is: the total number of clusters
BGSS is: the between-group sum of squares (between-group dispersion)
WGSS: is the within-group sum of squares (within-group dispersion)
Given that the observations within each cluster are more closely spaced apart (denser), a
high CH results in better clustering (well separated). One of the main advantages of the
CH Index is that it is relatively easy to compute, as it only requires the mean and variance
of each cluster, which can be easily calculated from the data and the score is also fast to
compute.
21
metric that links cluster size to the distance between clusters. A model that has a lower
Davies-Bouldin index separates the clusters better. It is defined as a ratio between the
cluster scatter and the cluster's separation. The lowest possible score is zero. Values that
are nearer to 0 denote better partitions.
One of the main advantages of using the Davies-Bouldin score is that compared to
Silhouette scores, Davies-Bouldin computation is less complicated. Also, As its
computation exclusively makes use of point-wise distances, the index is solely reliant on
elements and characteristics that are present in the dataset.
4.2. Results
22
Fig 3: C-index minimum for k=19.5
23
In DBSCAN clustering, the minimal sample of clusters for the model came out to be the
average value of all the optimal (from silhouette score, calinski_harabasz score, and
davies_bouldin score), i.e. min_sample = 12. On the training classification model using
this optimal value, we got an F1 score of 93%.
24
Fig 8: Davies-Bouldin score for different values of k
In Agglomerative clustering, the optimal number of clusters for the model came out to be
the average value of all the optimal (from silhouette score, calinski_harabasz score, and
davies_bouldin score), i.e. k = 4. On the training classification model using this optimal
value, we got an F1 score of 92%.
25
Fig 10: C-Index score for different values of k
In HDBSCAN Clustering, the min_samples in case of metric ‘leaf’ is 4 and the value of
n_clusters is 2. On training with classification models, we get an F1 score of 95%.
26
min_cluster_size 182.00000
min_samples 4.00000
validity_score 0.02719
n_clusters 2.00000
27
CONCLUSION
28
FUTURE SCOPE
Our study can be extended further and clustering algorithms can be improved to enhance
the accuracy and effectiveness of user matching systems. Currently, many user matching
systems rely on a limited set of data points, such as age, gender, and location, to group
users into clusters. However, incorporating additional data points, such as interests,
preferences, and behavior patterns, can provide a more comprehensive and accurate
representation of users and improve the accuracy of cluster assignments.
Machine learning algorithms, such as neural networks and support vector machines, can
be utilized to improve the accuracy of clustering algorithms by learning patterns and
relationships in data. This can help to identify more complex and subtle similarities
between users, resulting in more accurate and relevant clusters. Along with that,
combining multiple clustering algorithms and techniques can improve the accuracy and
robustness of user matching systems. For example, using a combination of hierarchical
and k-means clustering can provide a more comprehensive analysis of user data and
result in more accurate cluster assignments.
Also, Improving the visualization and analysis tools used to understand and interpret user
clusters can help to identify patterns and trends in user data and optimize cluster
assignments. Additionally, User feedback can be used to improve the accuracy of these
systems. By soliciting feedback from users about the accuracy of the system's
recommendations and the usefulness of the matched groups, the system can continuously
improve and adapt to user preferences.
29
APPENDICES
Personalization algorithms:
Personalization algorithms work by analyzing data about a user's past interactions and
using that data to make predictions about what the user might be interested in or likely to
purchase. For example, a recommendation algorithm might suggest a book to a user
based on their past purchases or a music streaming service might recommend songs based
on a user's listening history.
Collaborative filtering:
There are two main types of collaborative filtering: user-based and item-based. In
user-based collaborative filtering, recommendations are made based on the ratings of
similar users. For example, if a user has rated several movies highly and another user has
also rated those same movies highly, the second user may be recommended additional
movies that the first user has rated highly.
30
One of the key benefits of collaborative filtering is that it can handle large amounts of
data and make personalized recommendations in real-time. It also does not require
explicit feedback from users, as it relies on the ratings and preferences of other users to
make recommendations.
Context-aware Systems:
Context-aware systems are computing systems that are able to adapt to changing
environments and circumstances by taking into account the context in which they are
operating. These systems are designed to be highly responsive and flexible, able to adapt
to new situations and environments in order to deliver optimal performance.
One of the key features of context-aware systems is their ability to gather and analyze
data from various sources in order to make informed decisions. This data can come from
a variety of sources, including sensors, user input, and external data feeds. By gathering
and analyzing this data, context-aware systems are able to learn about their environment
and adapt to it in real-time.
31
REFERENCES
1. Statista. Global Social Media Ranking 2022 [Online]. Available:
https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number
-ofusers/
2. Statista. Online Dating India- Revenue Highlights 2022 [Online]. Available:
https://www.statista.com/outlook/dmo/eservices/dating-services/online-dating/india
3. K. Bindra and A. Mishra, "A detailed study of clustering algorithms," 6th
International Conference on Reliability, Infocom Technologies and Optimization
(ICRITO). (2017).
4. Thomas Olsson, Jukka Huhtamäki, and Hannu Kärkkäinen. Directions for
professional social matching systems. Commun. ACM 63, 2 60–69, February
(2020).
5. Sangeetha Kutty, Richi Nayak, Lin Chen, “A People-to-People Matching System
using Graph Mining Techniques”, World Wide Web, Volume 17, Issue 3, (2014).
6. D. Qin, X. Zhou, L. Chen, G. Huang and Y. Zhang, "Dynamic Connection-Based
Social Group Recommendation," in IEEE Transactions on Knowledge and Data
Engineering, vol. 32, no. 3, pp. 453-467, 1 March (2020)
7. Yunfei Yu and Yinghua Zhou , "Research on recommendation system based on
interest clustering", AIP Conference Proceedings 1820, 080021 (2017)
8. Mendonça, Luziane. “An Approach for Personalized Social Matching Systems by
Using Ant Colony”. Social Networking. 03. 102-107. (2014).
9. Xu, Rui & Wunsch, Donald. “Survey of Clustering Algorithms”. Neural Networks,
IEEE Transactions on. 16. 645 - 678. (2005).
10. Bin, Sheng & Sun, Gengxin & Zhang, Peijian & Zhou, Yixin. “Tag-Based
Interest-Matching Users Discovery Approach in Online Social Network”.
International Journal of Hybrid Information Technology. 9. 61-70. (2016).
11. F. Li, Y. He, B. Niu, H. Li and H. Wang, "Match-MORE: An efficient private
matching scheme using friends-of-friends' recommendation," International
Conference on Computing, Networking and Communications (ICNC), pp. 1-6.
(2016)
12. Cheng, Weijie & Yin, Guisheng & Dong, Yuxin & Dong, Hongbin & Zhang,
Wansong. Collaborative Filtering Recommendation on Users’ Interest Seq. PLOS
ONE. (2016).
13. Hartigan, J. A., and M. A. Wong. “Algorithm AS 136: A K-Means Clustering
Algorithm.” Journal of the Royal Statistical Society. Series C (Applied Statistics).
(1979).
14. S. Na, L. Xumin and G. Yong, "Research on k-means Clustering Algorithm: An
Improved k-means Clustering Algorithm," Third International Symposium on
Intelligent Information Technology and Security Informatics, pp. 63-67. (2010).
32
15. K. P. Sinaga and M. -S. Yang, "Unsupervised K-Means Clustering Algorithm," in
IEEE Access, vol. 8, pp. 80716-80727. (2020).
16. Tran, Thanh & Drab, Klaudia & Daszykowski, Michal. “Revised DBSCAN
algorithm to cluster data with dense adjacent clusters”. Chemometrics and Intelligent
Laboratory Systems. 120. 92–96. (2013).
17. K. Khan, S. U. Rehman, K. Aziz, S. Fong and S. Sarasvady, "DBSCAN: Past,
present and future," The Fifth International Conference on the Applications of
Digital Information and Web Technologies, pp. 232-238. (2014).
18. Stewart G, Al-Khassaweneh M. An Implementation of the HDBSCAN* Clustering
Algorithm. Applied Sciences. (2022).
19. Sasirekha, K., & Baby, P. Agglomerative hierarchical clustering algorithm-a.
International Journal of Scientific and Research Publications, 83(3), 83. (2013).
20. Ackermann, M.R., Blömer, J., Kuntze, D. et al. Analysis of Agglomerative
Clustering. Algorithmica 69, 184–215. (2014).
33
RESEARCH PAPER
34
35
36
37
38
39
40
41
42
43
44
45
RESEARCH PAPER CONFERENCE SUBMISSION
46