You are on page 1of 7

Abstract:

Social media networks share a large amount of information via their network through the
users and their friends forming a large network of community. Social networks also plays a
crucial role in connecting different types of users and their friends, families and developing
relationships among them and ensuring that people always get connected to each other.
Hence on the social networks recommendation systems play a key role in getting the correct
users connected to each other and also making sure more and more people connected to each
other. The present social media networks suggests friends based on the respective user’s
network. But this may not be the ideal case as a friend suggestion to the respective user based
on the user’s respective network as the case does not reflect the real life style of the user.
In this research, we propose a Facebook Friend Recommendation System built using
XGBOOST algorithm on which the recommendation system runs. This recommendation
system recommends friends to the users based on the social graph of the user’s network and
considering different type of connecting parameters and similarity measures. We further
collect similarity measures such as Jaccard Distance and Otsuka-Ochiai coefficient which
filters the information from the graph and to calculate the rating of different users. The
extracted features used in building the model are Adar Index, Hits Score and Katz Centrality
and PageRank. The similarity measure measures the similarity of users rating respective to
the users rating. Then from receiving a similar rating recommendation system will suggest a
list of friends which are having similar rating with more recommendation. We have
implemented this model on computers and laptops with similar datasets and is giving huge
accuracy result of (100%) using XGBOOST algorithm which is best known for its efficient
speed and performance.

1 Introduction:
Social networks are an important source of abundant information for a large community of
people and building a community where people can share their activities, their interests and
their thoughts and also share their knowledge. Social media sites are used by millions of users
as it is a must and thing for sharing content all around the world. Social media sites are
enjoying large amount of success with the huge amount of users and content shared around
the world making their community grow all over the world. To stay get connected on the
social media sites it is important for the social media sites to build a proper recommendation
system which is very accurate and so powerful such that every user can get connected to the
right amount of users. Recommendation systems are used by social media sites to give
recommendation to the users as their friends or likely known people.
Social media recommendation system has used computing methods of machine learning like
natural language processing, data mining etc are used in the previous recommendation
systems.
The recommendation systems brought the information which is enjoyable by the user by
information filtering techniques such as content filtering.
But over the last 5 years recommendation systems used in social media are personalization
based recommendation systems which came into light and moreover social media sites use
this personalized recommendation systems which gives the recommendations more precisely
and accurately than computational methods of recommendation.
This friend recommendation system has been tested upon two algorithms Random Forest and
XGBOOST but I am preferring XGBOOST due to its faster speed and higher accuracy.
In this research, we will build a XGBOOST based friend recommendation system which
recommends friends based on the followers and followees of a person. Using a dataset of
9.43M edges of Facebook Recruiting dataset in which has it both nodes source node and
destination node in which 80% of the dataset is used for training the model and 20%
remaining for testing the model. We have a lot of similarity measures such as Jaccard
Distance and lot of feature extractions in which we will extract features for a particular user
and rank them using PageRank. In this model algorithm used for classification is
XGBOOST.XGBOOST is a gradient boosted decision trees which is well known for its speed
and performance. Confusion matrix is generated to describe the performance of the
classification algorithm in the model.
Accuracy score is calculated by the precision score, recall score and F1 score.

2 Previous Research

1) Swathi Sambhangi et.al (2019) [1] has proposed a recommendation system in an


ecommerce based site. This recommendation system gives recommendations to the
users to select the best products from the user which are far efficient. The main idea
behind implementing this recommendation system was peer-to-peer text-based review
recommendation system. The recommendation system takes data from the Largest
Ecommerce Company Amazon. Amazon has the best recommendation system that
suits the best information on the world wide web and suits the best reviews for the
products. This recommendation system helps a lot for the user to choose best product
from the list of products which focuses on satisfaction of the user.
2) Neha Verma et.al (2019) [2] has proposed a recommendation system to find out users
nature in ecommerce companies like Flipkart , Amazon and streaming company
Netflix. The model of the recommendation system works in 2 phases in which the
first phase s collecting and gathering information about the user and his interests and
then the second phase is of analysis part where the recommendation system analyses
the information of the user like what the user is searching for , his bought items, items
looking to buy, his/her cart and his interests etc. After this analysis part of the
information being sent the recommendations are done based on the analysis of the
user’s information.
3) Snigdha Luthra et.al (2019) [3] has proposed a new method in which she stated that
the social media network is dynamic and can change at any amount of time. The
network at 1 second can be changed at 2 seconds as the network is dynamic. For
predicting these continuous changes occurring every second graph embedding
technique measure of square will not reach associate unattended graph with enormous
different parameters of nodes and edges which we will use in machine learning. For
this continuous detection on the network, the proposed system builts a bunch
technique to cluster different parameters of nodes in the similar cluster that has a
similar edge between different nodes. But this system requires more number of
connections between the different nodes in the similar cluster.
4) Imane Belkhadir et.al (2019) [4] has proposed a recommendation system of social
regularization approach that contains the trust information and information of social
networks of users to build a strongest trust path in social network graphs.This
recommendation system recommends friends to the users of the social network who
are having similar types of interests and behaviours. This system evaluates by
working on the matrix factorization method. To maintain the drawbacks of matrix
factorization in this recommendation system tags and friendships are regularized as
recommender promotions.
5) Ivana Andjelkovic et.al (2019) [5] proposed a recommendation system for the musical
courtesy. This recommendation system works for musical bands and musical artists
with a visualization of novel collaboration of different moods and the musician. The
recommendation system works by within the visualization of novel collaboration via
change of avatar or mood system provides recommendations to the musician. This
recommendation system proved that certain combination of features and the design of
the novel collaboration has achieved the persistent recommendation accuracy. With
this recommendation system the user can fulfill the musician mood with his relevant
activity perception. This can bring the best mood out of the musician with the best
mood of musician.
6) Peng Liu et.al (2019) [6] proposed a recommendation system in which there is a
dynamic graph based embedding model which can solve major challenges of social
media recommendation. This model of dynamic graph based embedding takes
different patterns such as user’s nature and his behaviour, his social relationships
through the connections of his social network and the semantic edges patterns of the
respective user social network. For analyzing sematic edges the probability matrix is
calculated for the semantic edges. For generating this greatest recommendation
system query processor techniques and algorithm should learn by itself by
understanding part by part which is incrementation learning algorithm. This
recommendation system should be tested against real world datasets which are very
unsolvable and estimate the accuracy of the recommendation system. This
recommendation system is worked upon continuing values or discrete values of the
users and their trusted persons and their interests.

3 Methodology

3.1 Data Description

The Dataset for this recommendation system is taken from the Facebook’s Recruiting
challenge on Kaggle platform. This dataset contains two columns such as Source node
and Destination node.
The complete data is not taken into consideration of the model but only for 100000
rows for which the model is created.
First of all the recommendation system used by social media giants like Facebook,
Twitter and Instagram is FoF algorithm (Friends of Friends algorithm).This algorithm
works by recommending friends to the users by considering their individual social
network. This recommendation systems works by if A is friend of B then it
recommends the friends of the A user social network to B such that A may show them
that they are his friends to B in the near future. So, this recommendation system may
not be accurate and may lead to wrong direction because this does not take case into
real life situation as the recommended friend may not be likely known by the users.
For solving this problem we added a Influence algorithm to the FoF algorithm which
makes the recommendation system works accurate compared to FoF algorithm and
make accurate decisions for recommending friends to the users. This algorithm works
by methods such as influence_map which takes parameters of the social graph and the
user. This influence_map returns a map of users who have atleast one common friend
with the input user and not taken into consideration into friends of input user.
Secondly this contains influence_graph which sorts the keys of the above given map
with the descending order of values. This influence algorithm takes consideration of
parameters such as how much a user’s social network is influencing the person such
as his interests and the content he follows. This makes the Influence algorithm for the
recommendation system to make accurate decisions over the FoF algorithm and this
has also proven with results.
But the problem here is the recommendation system by influence algorithm works
accurate only for smaller amount of datasets such as upto 1 lakh people. This
recommendation system may not work for heavy social media giants like Facebook
which has billions of data. For overcoming this problem we have developed a
XGBOOST based Friend Recommendation System using Graph Mining which can
make accurate decisions and more and more accuracy of the recommendation system
has resulted us in a best recommendation system which was running by XGBOOST
which is widely known for its speed and performance. This system had made of 100%
of train score which makes the system more accurate even with billions of data.

3.2 XGBOOST Recommender System

The Xgboost algorithm is the best popular machine learning algorithm whether we
choose prediction kind of thing or applying regression or classification. Xgboost is
known importantly for Extreme Gradient Boosting which introduces regularization
parameters to reduce overfits. This Gradient boosted trees will use regression trees in
the sequence learning process which will be awarded as weak learners. These
regression trees are similar to decision trees which are used to make decisions.
Xgboost algorithm assigns a continuous score to each and every leaf once the last
node in the tree has finished growing which forms a gradient tree which is added
whole up and provides the right prediction after the tree has grown totally. For each
and every iteration i there grows a tree t , scores w are calculated for each and every
iteration there will be a prediction y. This total learning process works by and their
main goal is minimize the overall and total score which contains loss function at i-1
and new structure of t forms with the loss function and minimizing overall score.
Thus Xgboost algorithm allows growing of regression trees in sequential manner and
learns from its previous iterations thus by minimising the overall score.

Xgboost also contains Gradient descent for gradient trees which calculates the optimal
values for each leaf that contains nodes and also for the possible optimal value of
overall score of tree t which minimizes the overall score. This score is also called as
impurity predictions of a tree. The loss function calculated in the algorithm contains a
regularization or penalty term under which its importance is reducing the complexity
of the regression tree functions used in the regression trees. This parameter can be
tuned awesome and can have values greater than zero and also have values between 0
to 1 which shows the maximum running complexity. But if these regularization term is
considered as zero, then there is no objection between predicting results of gradient
boosted trees and Xgboost which makes does not make a sense of Xgboost algorithm.
This Xgboost algorithm also contains parameters such as learning rate which is the
shrinkage that we will do at each and every step. Another type of parameter that is
accepted by Xgboost is column subsampling which is selecting a random subset of
features to apply for a gradient tree boosting algorithm which further allows reduction
of overfitting and does not overfit the models. For handling overfitting Xgboost has
added different methods to the regularization parameter which has been found in
gradient descent models.

Xgboost is rocking since over the previous years because of its reckless features such
as Sequential tree growing, minimizing loss function through gradient descent ,
parallel processing to increase speed , regularization parameter and learning rate like
different type of parameters. Xgboost is speeder compared to other tree ensembling
algorithms because its regularization parameter which reduces variance and
minimizing loss function through gradient descent. Additionally, xgboost has a lot of
hyperparameters subsample which is bootstrapping the training data and taking small
amount of data into consideration, maximum depth of trees upto which depth of nodes
we can process a tree, minimum weights in child nodes for splitting nodes and number
of estimators which are number of trees to use.

Similar to bagging techniques used by Random Forest, boosting in Xgboost makes


trees with fewer splits and lesser splits. The small trees which are with small splits are
highly processed and interpretable. Hyperparameters like number of trees used or the
number of iterations, the learning rate at which gradient boosting learns, the
maximum depth of a tree, this are selected optimally through k-fold cross validation.
This boosting ensemble function consists of 3 simple steps. First an initial model F0 is
defined to predict the value of target variable or target outcome y. Then this model
will be contained with a value of residual (y – F0) which shows that the value of
target variable in model F0. Another new model h1 is fit to store the residual values of
previous step. Now,F0 and h1 together are combined to give F1 which is a boosted
version of F0 increasing the speed and decreasing loss. The mean square error of F1
will be lower than the mean square error of F0. Further, to improve performance of
model boosted version F1 we can create a new model meeting the residuals of F1 and
creating a new model F2 which is boosted version of F1. This process can start for m
iterations further residuals can be minimized as:
For applying the Xgboost algorithm on our model we have to mention the
hyperparameters n_estimators which is number of estimators or numbers of trees we
will use in our model. This are in the range of 105 to 125 which shows that we can
use 105 or 125 numbers of trees. We can have another hyperparameter of max_depth
which is maximum depth of a tree that the algorithm has to process which are in the
range of 10 to 15 that means the algorithm should process upto 10 or 15 nodes of each
regression tree. Then in RandomizedSearchCV function we can pass the number of
estimators, max depth of each tree, distributing the trees with n_estimators and
max_depth, and the number of iterations that can occur which are 5, and the scoring
of f1 to rank the overall score for each tree. Then the model will be applied using
Xgboost algorithm with the specified hyperparameters in which the XGBClassifier
trains the model with applied hyperparameters. And then in the model after applying
Xgboost we can have algorithm implemented with parameters of base score 0.5,
booster gbtree of boosting gradient trees, colsampling by level of 1, colsampling by
node of 1, colsampling by tree of level 1, learning rate of 0.1, maximum delta steps of
0, maximum depth of each tree of 14, minimum child weights of 1, missing nodes of
None which selects None nodes should be processed, number of estimators or trees
120, number of jobs of 1, objective of binary logistic regression should be
implemented in regression trees and other parameters etc.

The parameters of the model after applying Xgboost algorithm with different type of
hyperparameters, regularization parameters and the specified parameters:

Table 1: Training and testing splits of dataset


Generalizatio Train Test
n
G-1(50-50) 49599 50000
G-2(90-10) 89999 9999
G-3(80-20) 79999 19999

You might also like