You are on page 1of 47

Recommendation mechanism to forge connections

between users with similar interests


A MINOR PROJECT
REPORT

Submitted by

Indrakant Dana Akshat Ajay Udit Agarwal


Enrollment No: 02514802719 Enrollment No: 75714802719 Enrollment No: 12914802719

BACHELOR OF TECHNOLOGY
IN

COMPUTER SCIENCE AND ENGINEERING

Under the Guidance


of
Mr. Saurabh Rastogi
(Assistant Professor, CSE)

Department of Computer Science and Engineering

Maharaja Agrasen Institute of Technology,


PSP area, Sector – 22, Rohini, New Delhi – 110086
(Affiliated to Guru Gobind Singh Indraprastha, New Delhi)

(DEC 2022)
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
Department of Computer Science and Engineering

CERTIFICATE
This is to Certified that this MINOR project report “Recommendation
mechanism to forge connections between users with similar interests” is
submitted by Indrakant Dana (02514802719), Akshat Ajay (75714802719),
Udit Agarwal (12914802719) who carried out the project work under my
supervision.

I approve this MINOR project for submission.

Prof. Namita Gupta Mr. Saurabh Rastogi


(HoD, CSE) (Assistant Professor,CSE)
(Project Guide)

1
ABSTRACT
This project addresses the problem individuals face when searching for people with
similar interests. It is common for us to look for people who have similar interests when
traveling to an unfamiliar city. A person's interests can range from films or music to
hobbies, certain personality traits, lifestyle, and more. These traits and interests will be
input by the user during the time of registration and will be a part of their profile. The
number of users keeps growing, and new accounts get registered every time. Despite the
growing number of users, only a few contacts are made daily. A matchmaking system
based on the user's profile text can be used in this case based on the data available in the
user's profile. Users' profiles contain the text that serves as their fingerprints. Each of
these fingerprints can be embedded in a vector on a dimensional space using deep
learning. As users with similar interests will have similar vectors, we can analyze the big
behavioral data by forming groups or clusters of like-minded individuals. With this
project, users with similar interests are recommended based on vectors using clustering
algorithms and classification techniques, with best possible accuracy. This interests based
matching can lead to a more engaging and satisfying user experience, as users are more
likely to find products or services that they are interested in. By providing personalized
recommendations and targeting marketing efforts more effectively, individuals are able to
save time, gain control over their experience, and reduce the likelihood of getting
undesirable matches.

2
ACKNOWLEDGEMENT

It gives me immense pleasure to express my deepest sense of gratitude and sincere


thanks to my respected guide Mr. Saurabh Rastogi, Assistant Professor, CSE, MAIT
Delhi, for their valuable guidance, encouragement and help for completing this work.
Their useful suggestions for this whole work and cooperative behavior are sincerely
acknowledged.

I also wish to express my indebtedness to my parents as well as my family members


whose blessings and support always helped me to face the challenges ahead.

Indrakant Dana Udit Agarwal Akshat Ajay


(02514802719) (12914802719) (75714802719)

Place: Delhi
Date:

3
TABLE OF CONTENTS

List of Tables and Figures 5

List of Figures 6

1. Introduction
1.1. Need and Significance 7
1.2. Objective 7
1.3. Technologies used 8
1.4. Motivation 12

2. Literature Survey 13

3. Approach/Methodology
3.1. Preliminaries 15
3.2. Data Gathering and Generation 15
3.3. Data Pre-Processing 16
3.4. Algorithms Used 17
3.4.1 KMeans Clustering
3.4.2 DBSCAN Clustering
3.4.3 HDBSCAN Clustering
3.4.4 Agglomerative Hierarchical Clustering

4. Experiments & Results


4.1. Performance Evaluation Metrics 20
4.1.1. Silhouette Coefficient
4.1.2. Calinski-Harabasz Index
4.1.3. Davies-Bouldin Score
4.2. Results 22

5. Conclusion 28

6. Future Scope 29

7. Appendices 30

6. References 32

7. Research Paper 34

8. Research Paper Conference Submission 46

4
LIST OF FIGURES
1. DBSCAN algorithm 18

2. KMeans - Silhouette Score for different values of k 22

3. KMeans - C-index minimum for k=19.5 23

4. Davies-Bouldin score for KMeans 23

5. KMeans - Gap-statistic lower on the left side 23

6. DBSCAN - Silhouette score for different values of k 24

7. DBSCAN - C-index minimum for k=19.5 24

8. DBSCAN - Davies-Bouldin score for different values of k 25

9. Silhouette Score for AG Clustering 25

10. C-Index Score for different values of k 26

11. AG Clustering - Davies-Bouldin score for different values of k 26

12 Number of points vs λ-value 27

5
LIST OF TABLES
1. HDBSCAN Clustering Parameters 27

2. Scores of different metrics for the Algorithms used 27

6
INTRODUCTION

1.1 Need and Significance

Online interactions between individuals are becoming more frequent as web usage
increases. With enhanced Web technology and rising Web popularity, users are
increasingly turning to online social networks to connect with new friends or similar
users. Many prominent social networking sites have appeared in the previous decade, and
the number of monthly active members has also increased rapidly.

Social matching, which refers to computational methods of detecting and fostering new
social connections between people, is a relatively new strategy to this purpose. In the
absence of an automatic selection method, choosing the best candidate from a large pool
of applicants becomes time-consuming and all but useless. The adoption of a solution by
social networks that dynamically supports matchmaking by recommending potential
matches is becoming more and more popular. Connection users share in these networks
and the approach used for the formation of these connections is a crucial part of these
Social Networks. One of the popular approaches is using sub-networks of users having
similar interests, opinions, and lifestyles. One of the key benefits of user-matching
systems is their ability to provide personalized recommendations to users. By analyzing
data on user behavior, interests, and preferences, these systems are able to make
suggestions that are tailored to each individual user. This can lead to a more engaging and
satisfying user experience, as users are more likely to find products or services that they
are interested in.

Overall, user matching systems are a valuable technology that can help businesses to
improve the user experience and increase conversions and have become an integral part
of the online matching experience, providing individuals with a more efficient and
effective way to find potential partners.

1.2 Objective

● As a way to streamline matchmaking, we propose to build an ML model that


adopts a recommendation system based on various clustering algorithms.

7
● To explore various clustering algorithms and compare them, to get the best
possible accuracy in terms of matches given by the recommendation system
● Users are given a profile feature that serves as their unique digital fingerprint. The
interests and personality traits of the user are included in their profile, which can
be modified by the user. When recommending a user, their profile information is
used.

1.3 Technologies Used

1.3.1 Python
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built-in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together. Python's
simple, easy to learn syntax emphasizes readability and therefore reduces the cost of
program maintenance. Python supports modules and packages, which encourages
program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for all major platforms, and
can be freely distributed. Since there is no compilation step, the edit-test-debug cycle is
incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a
segmentation fault. Instead, when the interpreter discovers an error, it raises an exception.
The debugger is written in Python itself, testifying to Python's introspective power.

The Python language comes with many libraries and frameworks that make coding easy.
This also saves a significant amount of time. The most popular libraries are NumPy,
which is used for scientific calculations; SciPy for more advanced computations; and
scikit, for learning data mining and data analysis.
These libraries work alongside powerful frameworks like TensorFlow, CNTK, and
Apache Spark. These libraries and frameworks are essential when it comes to machine
and deep learning projects.

1.3.2 Matplotlib
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy. As such, it offers a viable open source alternative to

8
MATLAB. Developers can also use matplotlib’s APIs (Application Programming
Interfaces) to embed plots in GUI applications. Matplotlib is used to create 2D graphs and
plots by using python scripts. It has a module named pyplot which makes things easy for
plotting by providing feature to control line styles, font properties, formatting axes etc. It
supports a very wide variety of graphs and plots namely histogram, bar charts, power
spectra, error charts etc. It is used along with NumPy to provide an environment that is an
effective open source alternative for MatLab. It can also be used with graphics toolkits
like PyQt and wxPython.

1.3.3 Numpy
NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the
fundamental package for scientific computing with Python. It is open-source software. It
contains various features including these important ones:

● A powerful N-dimensional array object.


● Sophisticated (broadcasting) functions.
● Tools for integrating C/C++ and Fortran code.
● Useful linear algebra, Fourier transform, and random number capabilities.

Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data-types can be defined using
Numpy which allows NumPy to seamlessly and speedily integrate with a wide variety of
databases.

1.3.4 BeautifulSoup
Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup
languages. Say you’ve found some webpages that display data relevant to your research,
such as date or address information, but that do not provide any way of downloading the
data directly. Beautiful Soup helps you pull particular content from a webpage, remove
the HTML markup, and save the information. It is a tool for web scraping that helps you
clean up and parse the documents you have pulled down from the web.

9
1.3.5 Seaborn
Seaborn is an amazing visualization library for statistical graphics plotting in Python. It
provides beautiful default styles and color palettes to make statistical plots more
attractive. It is built on the top of matplotlib library and also closely integrated to the data
structures from pandas. Seaborn aims to make visualization the central part of exploring
and understanding data. It provides dataset-oriented APIs, so that we can switch between
different visual representations for same variables for better understanding of dataset.

1.3.6 NTLK Library


NLTK (Natural Language Toolkit) Library is a suite that contains libraries and programs
for statistical language processing. It is one of the most powerful NLP libraries, which
contains packages to make machines understand human language and reply to it with an
appropriate response. One of the most useful cases of NLP is in data pre processing. The
process of cleaning unstructured text data, so that it can be used to predict, analyze, and
extract information. Real-world text data is unstructured, inconsistent. So, Data
preprocessing becomes a necessary step. Tokenization, Lemmatization and Frequency
Distribution of Words are some of the important methods in Data Pre-processing.

1.3.7 Streamlit
Streamlit is an open-source python framework for building web apps for Machine
Learning and Data Science. We can instantly develop web apps and deploy them easily
using Streamlit. Streamlit allows you to write an app the same way you write a python
code. Streamlit makes it seamless to work on the interactive loop of coding and viewing
results in the web app.

1.3.8 Pandas
Pandas is an open-source library that is made mainly for working with relational or
labeled data both easily and intuitively. It provides various data structures and operations
for manipulating numerical data and time series. This library is built on top of the NumPy
library. Pandas is fast and it has high performance & productivity for users.

Pandas stands for ‘panel data’. Note that pandas is typically stylized as an all-lowercase
word, although it is considered a best practice to capitalize its first letter at the beginning

10
of sentences. Pandas was designed to work with two-dimensional data (similar to Excel
spreadsheets). Just as the NumPy library had a built-in data structure called an array with
special attributes and methods, the pandas library has a built-in two-dimensional data
structure called a DataFrame.

This tool is essentially your data’s home. Through pandas, you get acquainted with your
data by cleaning, transforming, and analyzing it. Pandas is built on top of the NumPy
package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in
pandas is often used to feed statistical analysis in SciPy, plotting functions from
Matplotlib, and machine learning algorithms in Scikit-learn.

1.3.9 Pickle
The pickle module implements binary protocols for serializing and de-serializing a
Python object structure. “Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte
stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
The process to converts any kind of python objects (list, dict, etc.) into byte streams (0s
and 1s) is called pickling or serialization or flattening or marshalling. We can converts the
byte stream (generated through pickling) back into python objects by a process called as
unpickling.

1.3.10 Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction via
a consistence interface in Python. This library, which is largely written in Python, is built
upon NumPy, SciPy and Matplotlib.

scikit-learn is an open-source Python library that implements a range of machine learning,


pre-processing, cross-validation, and visualization algorithms using a unified interface.

The package provides the functions for data mining and machine learning algorithms for
data analysis that are easy to use and effective. Support vector machines, gradient

11
boosting, random forests, k-means, and other regression, classification, and clustering
algorithms are included.

1.4 Motivation

User matching systems are a valuable technology that can help businesses to improve the
user experience and increase conversions and have become an integral part of the online
matching experience, providing individuals with a more efficient and effective way to
find potential partners. Various studies have been conducted in the past focusing on
methods used in the formation of connections between users. This research on social
matching systems has identified several drawbacks, including

1. Limited personalization: Many social matching systems rely on algorithms that do


not take into account individual preferences and may result in mismatches or
unsatisfactory matches.
2. Lack of transparency and fairness in the matching algorithm: Many social
matching systems use proprietary algorithms that are not publicly available, which
can lead to concerns about fairness and bias.
3. Difficulty in maintaining user engagement: Social matching systems often rely on
user-generated content and activity, but this can be difficult to sustain over time,
leading to a decline in user engagement and activity.
4. Limited scope and functionality: Many social matching systems are focused on a
specific type of interaction or activity, such as dating or job searching, which
limits their potential usefulness and appeal to users.
5. Privacy concerns: Social matching systems often require users to provide personal
information, which can raise concerns about data privacy and security.

By providing personalized recommendations and targeting marketing efforts more


effectively, individuals are able to save time, gain control over their dating experience,
and reduce the likelihood of rejection. These systems can be highly effective in helping
users find other users with whom they are compatible, but they can also have potential
drawbacks, such as bias or discrimination. It is important for designers of user-matching
systems to carefully consider these issues and take steps to mitigate them.

12
LITERATURE SURVEY

There have been numerous studies and research projects focused on developing effective
user matchmaking algorithms in various contexts.

[7] presents a collaborative filtering recommendation algorithm that uses user interest
clustering to address the limitations of traditional algorithms. The improved algorithm has
been shown to have higher recommendation efficiency and precision through
experimental results. [8] proposes an algorithm based on Ant Colony is proposed to solve
the optimization problem of clustering/matching people in a social network. The
numerical results show that the algorithm can successfully perform clustering with a
variable number of individuals. A tag-based common interest discovery approach in
online social networks is proposed in [10]. It is suggested that user-generated tags are
effective for representing user interests because they can more accurately reflect
understanding. The approach is able to effectively discover common interest topics in
online social networks, such as Douban, without any information on the online
connections among users. A different type of approach under the project called
Match-MORE has been proposed in [11] to address the issues surrounding
Proximity-based Mobile Social Networks, which can pose risks to user privacy and create
a system overhead. The concept of friends-of-friends is utilized in Match-MORE to find
common connections among friends and design a private matching scheme. This is
achieved through the use of a novel similarity function that takes into account both the
social strength between users and the similarity of their profiles. It allows users to
discover new potential friends based on recommendations from their existing friends,
with adjustable accuracy and without disclosing too much personal information. The use
of Bloom filters to estimate common attributes helps to reduce the system overhead. The
security and performance of this project have also been carefully analyzed and evaluated
through simulations.

The above research concentrated on user recommendation mechanisms, however, we can


still gain valuable insights from studies on group recommendation systems. The proposed
system in [5] focuses on conducting a thorough investigation of online dating networks.
To do this, the system models the network as a graph and performs an analysis using
social network analysis (SNA) methods. Through this analysis, the system has identified

13
several strengths and weaknesses of the online dating network. To improve the
effectiveness of the network, the proposed system utilizes a recommender system that
combines both the attributes of the vertices (individual users) and the interactions within
the network. This allows for the identification of more compatible and potentially
successful (12% improvement in accuracy) matches within the network. [6] presents a
method for recommending social groups in online services by extracting multiple
interests and adaptively selecting similar users and items to create a compact rating
matrix for efficient collaborative filtering. The approach was evaluated through extensive
experiments. In [12], a new collaborative filtering (CF) recommendation method is
proposed that is based on users' interests and sequences (IS). The concept of "interest
sequences" is defined in order to depict the dynamic evolution patterns of users' interests
in online recommendation systems, drawing inspiration from work in location-based
recommendation systems. The method for calculating users' similarities based on IS,
which is used to identify the top K nearest neighbors of the target user, is introduced,
along with methods for calculating the length of the longest common sub-IS (LCSIS) and
the count of all common sub-IS (ACSIS). Finally, a method for predicting users' ratings
for unrated items is presented, and the effectiveness of the proposed recommendation
method is verified through comprehensive experiments on three datasets, examining the
influence of several factors on the results.

Numerous earlier research on this issue has a group of users as their primary audience
instead of one specific user. The algorithms used by existing applications that offer
matchmaking services based on interests are not open source and are kept secret. Our goal
is to investigate various clustering algorithms [3] and determine the optimal strategy for
delivering matches that are more accurate. Our analysis and research indicate that we may
create a recommendation system by integrating multiple clustering algorithms [3][10],
such as DBScan clustering and hierarchical clustering using various linkages.

14
METHODOLOGY

3.1 Preliminaries

The Data needed by matchmaking applications to suggest potential partners can be


broken down into the following categories:

● Demographics: age, gender, location, education level, occupation


● Interests: hobbies, activities, sports, music, movies, books
● Personal values: political views, religious beliefs, relationship preferences
● Personality traits: introversion/extroversion, openness, conscientiousness,
agreeableness, neuroticism
● Social connections: friends, family, colleagues, shared connections with other
users
● Behavioral data: online activity, preferences, interactions with other users on
the platform.

We would need these user data which would contain mentioned data about users such as
bio and interests in different topics with interest being measured in range 0-9.
Additionally, we would need to collect data on the compatibility between individuals,
such as their common interests and compatibility score. This information would be used
to match individuals with each other based on their preferences and compatibility.

3.2 Data Gathering and Generation

The machine learning algorithms we have devised require a significant quantity of data to
train our model. Because such a dataset is not publicly available on the internet, we
generate fake user profile data. There are various websites that can assist us in generating
a large quantity of fake data. By using BeautifulSoup for web scraping, we can turn the
generated data into a data frame. For each user, we need data that shows their interests in
different categories like movies, sports, and politics. This can be done by randomly
assigning numbers from 0 to 9 to each category in the data frame.

15
3.3 Data Preprocessing

We can preprocess the user bio-data by using NLP and find out the important words used
majorly by users in their bios, across the platform. Using the NTLK library we can
perform Tokenization and Lemmatization of user biodata, i.e splitting up full sentences
into individual words and converting words into their basic form. For example,
converting ‘Joking’ into ‘Joke’. We can take an extra measure to exclude words like ‘a’,
‘the’, ‘of’ etc. Our next step would be making a set of unique words with their
corresponding usage frequency by users in their bio, across the platform. Many of these
words might be adjectives to a noun, like ‘c++ enthusiast’. We need to pair such words.
We make a list of such words and find their frequency scores and build up our data frame
for the next step.

To improve the performance of our clustering algorithm, we will proceed to scale the
categories (such as Movies, TV, religion, etc). This will reduce the time required to fit and
transform the algorithm to the dataset. The next phase entails structuring the data into a
series of numerical vectors, where each vector stands for a particular data point. This is
known as the vectorization of the bio-data of users. We'll employ the Count Vectorization
and TFIDF Vectorization vectorization techniques. The two distinct data-frames created
by the two methods mentioned above will be combined and scaled into a new dataset.
Because there are too many dimensions in the dataset, we will use Principal Component
Analysis (PCA). It is a statistical method for decreasing a dataset's dimensionality.It
accomplishes this by translating the data into a new coordinate system with principal
components as the new axes. The techniques discussed above can be used to locate the
ideal number of clusters based on evaluation metrics like the Silhouette coefficient,
Davies-Bouldin Score, and Calinski-Harabasz Score once our data is ready. These metrics
will assess how effectively the clustering algorithms work. After an in-depth analysis of
the algorithms using the scoring metrics, we concluded that the optimal number of
clusters for a better match probability is 2. We will then train a classification model using
the optimal cluster value and try to train a classified model having better accuracy.

When a user wants to look for similar users, their data will be considered new. Entering
new data into existing clusters will require running clustering algorithms all over again.
And before that, the new data would go through NLP processing, tokenization, scaling,

16
vectorization, and PCA. After this clustering algorithms would run and provide the top 10
matches to the user.

3.4 Algorithms Used

3.4.1 KMeans Clustering


K-means clustering is an unsupervised machine learning algorithm for clustering data into
k groups, or clusters. The goal of the algorithm is to partition the data into clusters such
that the data points within a cluster are more similar to each other than they are to data
points in other clusters. The algorithm works by first initializing k centroids or the center
points of the clusters. Then, the data points are assigned to the cluster whose centroid is
closest to the data point. The centroids are then updated based on the mean of the data
points assigned to the cluster. This process is repeated until the centroids no longer move
or the assignments of data points to clusters stop changing.

One of the main advantages of the k-means clustering is that it is fast and efficient,
especially for large datasets. However, it can be sensitive to the initial placement of the
centroids and may not always produce the best possible clusters.

3.4.2 DBSCAN Clustering


DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
density-based clustering algorithm. It works by identifying clusters of high density,
defined as groups of points that are closely packed together, and marking points that are
not part of any cluster as noise. The algorithm starts by identifying a point at random and
then searching its surrounding area to find other points that are close by. If it finds enough
points in the surrounding area to form a cluster, it will mark all of those points as part of
the cluster. If it doesn't find enough points, it will mark the starting point as noise.

17
Fig 1: The DBSCAN algorithm was utilized to generate two clusters, which consist of
three types of points: key points (green) that meet the criteria for clustering, border points
(blue) that do not meet the clustering criteria but are within the reach of a key point, and
noise points (black) that do not fit into either.

One of the advantages of DBSCAN is that it does not require the user to specify the
number of clusters in advance. This is useful because the number of clusters in a dataset
is often not known beforehand. It also has the ability to identify and label points that are
not part of any cluster, which is useful for outlier detection. However, DBSCAN has
some disadvantages as well. It can be sensitive to the choice of parameters and can be
affected by the presence of noise or outliers in the dataset. Additionally, it does not work
well for data that is not evenly distributed or that has clusters of varying densities.

3.4.3 HDBSCAN Clustering


HDBSCAN is an implementation [13] of the DBScan clustering algorithm that is able to
handle data with varying densities. It is an extension of DBScan that is able to
automatically determine the appropriate value for the density threshold parameter, which
controls how tightly packed the points in a cluster need to be. HDBSCAN uses a
hierarchical approach to build clusters, starting with the largest clusters and then adding
smaller clusters until all points have been assigned to a cluster. This allows it to find
clusters of different densities and shapes, and to identify points that are not part of any
cluster.
One of the advantages of HDBSCAN is that it is able to handle datasets with a large
number of dimensions, which can be challenging for other clustering algorithms. It is also

18
able to handle noisy or outlier data well, making it a good choice for data that may not be
well-behaved. However, like DBScan, it is sensitive to the choice of parameters and can
be affected by the presence of noise or outliers in the dataset.

3.4.4 Agglomerative Hierarchical Clustering


Agglomerative hierarchical clustering is a type of clustering algorithm that is used to
group data into clusters. It is a bottom-up approach, meaning that it starts by treating each
data point as a separate cluster and then iteratively merges clusters until all points are part
of a single cluster or a predetermined number of clusters have been formed. The
algorithm first calculates the distance between each pair of data points. Then, it iteratively
merges the two closest clusters, based on the distance between their points. This process
continues until all points are part of a single cluster or the desired number of clusters has
been formed.One of the advantages of agglomerative hierarchical clustering is that it
allows the user to specify the number of clusters they want to find, which is not possible
with other clustering algorithms such as k-means. It is also relatively simple to implement
and can handle data with a large number of dimensions. However, it can be
computationally expensive for large datasets and is sensitive to the choice of distance
metric used to calculate the similarity between points. It is also not appropriate for data
that is not well-structured or that has clusters of varying sizes and densities.

19
EXPERIMENTS & RESULTS

4.1. Performance Evaluation Metrics

Comparing the precision and recall of a supervised classification algorithm to a clustering


method's performance is more complicated than just counting the number of errors. Any
evaluation metric, in particular, should not consider the absolute values of the cluster
labels, but rather whether this clustering defines separations of the data that are similar to
some ground truth set of classes or that satisfy an assumption that members of one class
are more similar to one another than members of another class, as measured by some
similarity metric.

4.1.1. Silhouette Coefficient


Based on the difference between the average distance to points in the closest cluster and
to points in the same cluster, this metric measures the cohesiveness and separation of
clusters. Each sample's Silhouette Coefficient is defined, and it consists of two scores:
a: sample's average distance from every other point in its class.
b: The average separation between a sample and every other point in the next cluster.

The Silhouette Coefficient s for a single sample is then given as:

It is a useful tool for evaluating the performance of a clustering algorithm and for
selecting the appropriate number of clusters for a dataset. Here are some points to be
remembered about the Silhouette Coefficient:

1. The score is constrained to a range of -1 for defective clustering and +1 for dense
clustering. Scores near zero indicate clusters that overlap.
2. When clusters are dense and well divided, the score is greater, which refers to a
common definition of a cluster.
3. Convex clusters typically have a larger Silhouette Coefficient than other types of
clusters, like density-based clusters like those found by DBSCAN.

20
It is important to note that the Silhouette Coefficient is sensitive to the number of clusters
chosen and may not be reliable when the number of clusters is not appropriate for the
data. Thus, we consider its limitations as well and use it in conjunction with other
evaluation methods to get a more comprehensive understanding of the clustering results.

4.1.2. Calinski-Harabasz Index


One of the other metrics used to assess clustering algorithms is the Calinski-Harabasz
index (CH). It is most frequently used to gauge how well a K-Means clustering algorithm
splits data for a specific number of clusters. The Variance Ratio Criterion, widely known
as the Calinski-Harabasz index, is determined by dividing the total inter-cluster and
intra-cluster dispersion across all clusters (where the dispersion is the sum of squared
distances).

The Calinski_Harabasz_Index is calculated as

where,
N is: the total number of observations
K is: the total number of clusters
BGSS is: the between-group sum of squares (between-group dispersion)
WGSS: is the within-group sum of squares (within-group dispersion)

Given that the observations within each cluster are more closely spaced apart (denser), a
high CH results in better clustering (well separated). One of the main advantages of the
CH Index is that it is relatively easy to compute, as it only requires the mean and variance
of each cluster, which can be easily calculated from the data and the score is also fast to
compute.

4.1.3. Davies-Bouldin Score


In order to determine the ideal number of clusters to use, the Davies-Bouldin index is a
validation statistic that is frequently used. The average similarity of each Cluster to its
most comparable Cluster is used to construct the Davies-Bouldin index. The average
"similarity" of clusters is represented by the Davies-Bouldin index, where "similarity" is a

21
metric that links cluster size to the distance between clusters. A model that has a lower
Davies-Bouldin index separates the clusters better. It is defined as a ratio between the
cluster scatter and the cluster's separation. The lowest possible score is zero. Values that
are nearer to 0 denote better partitions.

One of the main advantages of using the Davies-Bouldin score is that compared to
Silhouette scores, Davies-Bouldin computation is less complicated. Also, As its
computation exclusively makes use of point-wise distances, the index is solely reliant on
elements and characteristics that are present in the dataset.

4.2. Results

In this study, we aimed to evaluate the performance of different clustering algorithms on


our datasets. In KMeans clustering, the optimal value of clusters for the model came out
to be the average value of all the optimal k values (from silhouette score, distortion score,
calinski_harabasz score, and davies_bouldin score), i.e. k = 11. On the training
classification model using this optimal value, we got an F1 score of 94%.

Fig 2: Silhouette score for different values of k

22
Fig 3: C-index minimum for k=19.5

Fig 4: Davies-Bouldin score for KMeans

Fig 5: Gap-statistic lower on the left side

23
In DBSCAN clustering, the minimal sample of clusters for the model came out to be the
average value of all the optimal (from silhouette score, calinski_harabasz score, and
davies_bouldin score), i.e. min_sample = 12. On the training classification model using
this optimal value, we got an F1 score of 93%.

Fig 6: Silhouette score for different values of k

Fig 7: C-index minimum for k=19.5

24
Fig 8: Davies-Bouldin score for different values of k

In Agglomerative clustering, the optimal number of clusters for the model came out to be
the average value of all the optimal (from silhouette score, calinski_harabasz score, and
davies_bouldin score), i.e. k = 4. On the training classification model using this optimal
value, we got an F1 score of 92%.

Fig 9 : Silhouette score for AG clustering

25
Fig 10: C-Index score for different values of k

Fig 11:Davies-Bouldin score for different values of k

In HDBSCAN Clustering, the min_samples in case of metric ‘leaf’ is 4 and the value of
n_clusters is 2. On training with classification models, we get an F1 score of 95%.

26
min_cluster_size 182.00000

min_samples 4.00000

validity_score 0.02719

n_clusters 2.00000

Table 1 : HDBSCAN Clustering Parameters

Fig 12: Number of points vs λ-value

Table 2: Scores of different metrics for the Algorithms used

Algorithms Silhouette Score Calinski_Harabasz Davies_Bouldin

KMeans 0.0294 91.5824 4.9644


Clustering

Agglomerative 0.034494 103.061426


4.352123
Clustering

DBSCAN 0.1110 15.3085 2.1883


Clustering

27
CONCLUSION

In this paper three clustering methods (K-means, Hierarchical Agglomerative Clustering,


DBSCAN, and Spectral Clustering) have been considered where we evaluated the
performance of these algorithms using three metrics: silhouette score, Calinski_Harabasz
index, and Davies_Bouldin score. K-means and hierarchical clustering algorithms tend to
perform well in terms of efficiency and simplicity, but may not always produce the most
accurate results. On the other hand, density-based algorithms such as DBSCAN can
produce more accurate clusters, but may be more computationally complex and require
more fine-tuning of parameters. In conclusion, the performance of clustering algorithms
for matchmaking systems can vary greatly depending on the specific characteristics of
the data and the desired outcomes of the system.

Overall, it is important to carefully evaluate the strengths and limitations of each


clustering algorithm and consider the specific needs of the matchmaking system before
deciding on the most appropriate approach. By carefully selecting and implementing the
right clustering algorithm, matchmaking systems can effectively group individuals or
items together in a way that maximizes compatibility and satisfaction. Further research
should be conducted to explore the potential of other clustering algorithms and to
optimize their performance in different scenarios.

28
FUTURE SCOPE

Our study can be extended further and clustering algorithms can be improved to enhance
the accuracy and effectiveness of user matching systems. Currently, many user matching
systems rely on a limited set of data points, such as age, gender, and location, to group
users into clusters. However, incorporating additional data points, such as interests,
preferences, and behavior patterns, can provide a more comprehensive and accurate
representation of users and improve the accuracy of cluster assignments.

Machine learning algorithms, such as neural networks and support vector machines, can
be utilized to improve the accuracy of clustering algorithms by learning patterns and
relationships in data. This can help to identify more complex and subtle similarities
between users, resulting in more accurate and relevant clusters. Along with that,
combining multiple clustering algorithms and techniques can improve the accuracy and
robustness of user matching systems. For example, using a combination of hierarchical
and k-means clustering can provide a more comprehensive analysis of user data and
result in more accurate cluster assignments.

Also, Improving the visualization and analysis tools used to understand and interpret user
clusters can help to identify patterns and trends in user data and optimize cluster
assignments. Additionally, User feedback can be used to improve the accuracy of these
systems. By soliciting feedback from users about the accuracy of the system's
recommendations and the usefulness of the matched groups, the system can continuously
improve and adapt to user preferences.

29
APPENDICES

Personalization algorithms:

Personalization algorithms have become increasingly popular in recent years as


companies seek to better understand and serve their customers. These algorithms use data
and machine learning techniques to tailor products, content, and recommendations to
individual users based on their preferences and behaviors. While personalization
algorithms have the potential to improve customer experiences, they also raise concerns
about privacy and the potential for bias.

Personalization algorithms work by analyzing data about a user's past interactions and
using that data to make predictions about what the user might be interested in or likely to
purchase. For example, a recommendation algorithm might suggest a book to a user
based on their past purchases or a music streaming service might recommend songs based
on a user's listening history.

Collaborative filtering:

Collaborative filtering is a machine learning technique that aims to recommend items to


users based on the preferences and ratings of similar users. It is a popular method for
recommender systems, which are used in a variety of applications such as online retail,
streaming services, and social media platforms.

There are two main types of collaborative filtering: user-based and item-based. In
user-based collaborative filtering, recommendations are made based on the ratings of
similar users. For example, if a user has rated several movies highly and another user has
also rated those same movies highly, the second user may be recommended additional
movies that the first user has rated highly.

Item-based collaborative filtering, on the other hand, focuses on the relationships


between items rather than users. In this case, recommendations are made based on the
ratings of similar items. For example, if a user has rated several horror movies highly,
they may be recommended additional horror movies that have also received high ratings
from other users.

30
One of the key benefits of collaborative filtering is that it can handle large amounts of
data and make personalized recommendations in real-time. It also does not require
explicit feedback from users, as it relies on the ratings and preferences of other users to
make recommendations.

Context-aware Systems:

Context-aware systems are computing systems that are able to adapt to changing
environments and circumstances by taking into account the context in which they are
operating. These systems are designed to be highly responsive and flexible, able to adapt
to new situations and environments in order to deliver optimal performance.

One of the key features of context-aware systems is their ability to gather and analyze
data from various sources in order to make informed decisions. This data can come from
a variety of sources, including sensors, user input, and external data feeds. By gathering
and analyzing this data, context-aware systems are able to learn about their environment
and adapt to it in real-time.

Context-aware systems are being used in a variety of other applications, including


healthcare, transportation, and retail. In the healthcare industry, for example,
context-aware systems can be used to monitor patients and provide personalized care
based on their individual needs. In transportation, context-aware systems can be used to
optimize routes and reduce fuel consumption. In retail, context-aware systems can be
used to improve customer experience by providing personalized recommendations and
offers based on a customer's preferences and past purchases.

31
REFERENCES
1. Statista. Global Social Media Ranking 2022 [Online]. Available:
https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number
-ofusers/
2. Statista. Online Dating India- Revenue Highlights 2022 [Online]. Available:
https://www.statista.com/outlook/dmo/eservices/dating-services/online-dating/india
3. K. Bindra and A. Mishra, "A detailed study of clustering algorithms," 6th
International Conference on Reliability, Infocom Technologies and Optimization
(ICRITO). (2017).
4. Thomas Olsson, Jukka Huhtamäki, and Hannu Kärkkäinen. Directions for
professional social matching systems. Commun. ACM 63, 2 60–69, February
(2020).
5. Sangeetha Kutty, Richi Nayak, Lin Chen, “A People-to-People Matching System
using Graph Mining Techniques”, World Wide Web, Volume 17, Issue 3, (2014).
6. D. Qin, X. Zhou, L. Chen, G. Huang and Y. Zhang, "Dynamic Connection-Based
Social Group Recommendation," in IEEE Transactions on Knowledge and Data
Engineering, vol. 32, no. 3, pp. 453-467, 1 March (2020)
7. Yunfei Yu and Yinghua Zhou , "Research on recommendation system based on
interest clustering", AIP Conference Proceedings 1820, 080021 (2017)
8. Mendonça, Luziane. “An Approach for Personalized Social Matching Systems by
Using Ant Colony”. Social Networking. 03. 102-107. (2014).
9. Xu, Rui & Wunsch, Donald. “Survey of Clustering Algorithms”. Neural Networks,
IEEE Transactions on. 16. 645 - 678. (2005).
10. Bin, Sheng & Sun, Gengxin & Zhang, Peijian & Zhou, Yixin. “Tag-Based
Interest-Matching Users Discovery Approach in Online Social Network”.
International Journal of Hybrid Information Technology. 9. 61-70. (2016).
11. F. Li, Y. He, B. Niu, H. Li and H. Wang, "Match-MORE: An efficient private
matching scheme using friends-of-friends' recommendation," International
Conference on Computing, Networking and Communications (ICNC), pp. 1-6.
(2016)
12. Cheng, Weijie & Yin, Guisheng & Dong, Yuxin & Dong, Hongbin & Zhang,
Wansong. Collaborative Filtering Recommendation on Users’ Interest Seq. PLOS
ONE. (2016).
13. Hartigan, J. A., and M. A. Wong. “Algorithm AS 136: A K-Means Clustering
Algorithm.” Journal of the Royal Statistical Society. Series C (Applied Statistics).
(1979).
14. S. Na, L. Xumin and G. Yong, "Research on k-means Clustering Algorithm: An
Improved k-means Clustering Algorithm," Third International Symposium on
Intelligent Information Technology and Security Informatics, pp. 63-67. (2010).

32
15. K. P. Sinaga and M. -S. Yang, "Unsupervised K-Means Clustering Algorithm," in
IEEE Access, vol. 8, pp. 80716-80727. (2020).
16. Tran, Thanh & Drab, Klaudia & Daszykowski, Michal. “Revised DBSCAN
algorithm to cluster data with dense adjacent clusters”. Chemometrics and Intelligent
Laboratory Systems. 120. 92–96. (2013).
17. K. Khan, S. U. Rehman, K. Aziz, S. Fong and S. Sarasvady, "DBSCAN: Past,
present and future," The Fifth International Conference on the Applications of
Digital Information and Web Technologies, pp. 232-238. (2014).
18. Stewart G, Al-Khassaweneh M. An Implementation of the HDBSCAN* Clustering
Algorithm. Applied Sciences. (2022).
19. Sasirekha, K., & Baby, P. Agglomerative hierarchical clustering algorithm-a.
International Journal of Scientific and Research Publications, 83(3), 83. (2013).
20. Ackermann, M.R., Blömer, J., Kuntze, D. et al. Analysis of Agglomerative
Clustering. Algorithmica 69, 184–215. (2014).

33
RESEARCH PAPER

34
35
36
37
38
39
40
41
42
43
44
45
RESEARCH PAPER CONFERENCE SUBMISSION

46

You might also like