You are on page 1of 36

UNIVERSITY OF BUEA

Faculty of Science

Department of Computer Science

BACHELOR OF SCIENCE IN COMPUTER SCIENCE

Project Report:

JAVA IMPLEMENTATION OF COLLABORATIVE


FILLTERING

By:

NGOH BERNARD ACHA

SC20C906

Supervisor:

Dr. Nkweteyim Denis L.PhD

June 2023
CERTIFICATION
This is to certify that this report entitled “JAVA IMPLEMENTATION OF
COLLABORATIVE FILLTERING” is the original work of NGOH BERNARD ACHA with
Registration Number SC20C906, student at the Department of Computer Science at the
University of Buea. All borrowed ideas and materials have been duly acknowledged by means of
references and citations. The report was supervised in accordance with the procedures laid down
by the University of Buea. It has been read and approved by:

__________________________________ ______________________________ Supervisor's


name and affiliation Date

Dr. Nkweteyim Denis L.PhD

__________________________________ ______________________________ Name of Head


of Department Date Head of Department of Computer Science

i|Page
DECLARATION
This report has been written by me and has not received any previous academic credit at this or
any other institution.

________________________________ NGOH BERNARD ACHA, SC20C906 Department of


Computer science, Faculty of Science.

ii | P a g e
Dedication
To my dear mother, Afuh Deborah Kah, whose unwavering love and support have been the
foundation of my life. Your tireless dedication and sacrifice have made me who I am today, and I
am forever grateful. This project is a testament to your unwavering encouragement and belief in
me. Thank you for being the best mother anyone could ask for.

iii | P a g e
Abstract
Collaborative filtering is a popular technique used in recommender systems to predict the
preferences of users based on their past behavior and the behavior of similar users. It is widely
used in e-commerce, social media, and other online platforms to provide personalized
recommendations to users. In this project, we present a Java implementation of collaborative
filtering that can be used to build personalized recommendation systems. Our implementation
uses a user-item rating matrix to calculate similarity scores between users, and then predicts
ratings for new items based on the ratings of similar users. The scope of this project is limited to
getting down the different components of such a system by building a small scale of it that can
handle small matrices od datasets.

iv | P a g e
Table of Contents
Dedication....................................................................................................................................................iii
Abstract........................................................................................................................................................iv
Table of Contents............................................................................................................................................v
1. Introduction.........................................................................................................................................1
2. Literature review.................................................................................................................................2
3 Analysis...............................................................................................................................................3
3. Results and Discussion........................................................................................................................6
3.1 Results.........................................................................................................................................6
3.1.1 Implementation........................................................................................................................6
I. Data processing:..........................................................................................................................7
II. Matrix Creation:..........................................................................................................................8
III. Matrix Factorization:...............................................................................................................9
IV. Cosine similarity:...................................................................................................................12
V. Recommendation:......................................................................................................................13
VI. Results:..................................................................................................................................16
3.2 Discussion..................................................................................................................................18
4. Conclusion.........................................................................................................................................26
5. References.........................................................................................................................................27
6. Appendices........................................................................................................................................28
1. Mathematic concept of matrix factorization..................................................................................28
2. Some code snippets.......................................................................................................................29

v|Page
1. Introduction
Collaborative filtering algorithms have become increasingly popular in recent years as a
means of predicting user preferences and recommending items to users. In this project, I
explore the implementation of collaborative filtering algorithm based on matrix factorization
using cosine similarity as similarity measure.

My interest in this project stems from the growing importance of recommendation systems in
various industries, from e-commerce to streaming services.

The aims of this project is to develop and test a collaborative filtering algorithm, to
understand the concept of matrix factorization and vector similarity, and how they are
applied to collaborative filtering.

Within the boundaries of this project, the aim is to develop a thorough understanding of
collaborative filtering algorithms and their applications in recommendation systems.
However, the scope of this project would be limited to just small datasets.

Figure 1 illustration of collaborative filtering.

1|Page
2. Literature review
Collaborative filtering algorithms have been extensively studied and developed over the past two
decades, since their introduction in the late 1990s. Researchers and practitioners have explored
various techniques for improving the accuracy and scalability of these algorithms, as well as
their applications in recommendation systems.

One of the most well-known collaborative filtering algorithms is the matrix factorization method,
which decomposes the user-item ratings matrix into low-rank matrices representing user and
item latent factors. This method has been shown to outperform traditional methods like item-
based and user-based filtering in terms of accuracy, and has also enabled the development of
more personalized recommendation systems.

Another area of research in collaborative filtering algorithms is optimization techniques, such as


parallel processing and stochastic gradient descent. These techniques aim to improve the speed
and scalability of the algorithms, making it possible to handle large datasets and achieve real-
time recommendations, however this goes beyond the scope of this project.

Recently, deep learning techniques have also been applied to collaborative filtering algorithms,
using neural networks to learn user-item interactions directly from raw data. These methods have
shown promising results in improving the accuracy of recommendations, particularly for implicit
feedback scenarios where explicit ratings are not available.

Overall, the literature on collaborative filtering algorithms and recommendation systems is


extensive, covering a wide range of topics from algorithm development to application in
different industries. This concept has contributed to the development of more effective and
personalized recommendation systems, as well as deeper understanding of user preferences and
behavior.

2|Page
3 Analysis
3.1 Problem statement
The problem at hand is to implement a collaborative filtering algorithm in Java
programming language. Collaborative filtering is a technique used in recommendation
systems to suggest items to users based on the items that they and other similar users
have liked in the past. The implementation of this algorithm for a small array in Java
requires the development of data structures and algorithms that can accurately handle a
limited dataset and make recommendations based on user preferences. The goal of this
project is to create a simple but effective recommendation engine that can provide users
with relevant item suggestions based on their interests and past purchases.
3.2 Project aim
The aim of this project is to implement a collaborative filtering algorithm for a small
dataset in Java programming language, with the goal of creating a simple but effective
recommendation engine. The algorithm should be able to accurately handle a limited
dataset and make recommendations based on user preferences. This project will
contribute to the development of recommendation systems and attempt to improve the
user experience by delivering personalized and relevant item suggestions. Additionally,
the project will help develop skills in data structures, algorithms, and programming in
Java.
This project will explore Matrix factorization and cosine similarity in order to
recommend items in the grid and predict user rating based on other user behavior.
Understanding and implementing collaborative filtering and cosine similarities is also an
aim in this project.

3|Page
3.3 Methodology
The project involved implementing a collaborative filtering algorithm in Java programming
language. The methodology consists of multiple steps to accomplish this task. The
chronology of the steps is as follows.
1. Data Preparation: I obtained a dataset consisting of user-item ratings from a text file (.txt),
loaded the data into a Java program, and created a data structure to store the ratings.

2. Matrix Creation: Here, a u x i matrix is created to hold the user-item rating matrix, m is the
number of users whose rating we are examining and m is the number of items being rated by
the users

3. Matrix Factorization: I applied Stochastic Gradient Descent (SGD) on the matrix and
factorized them into latent factor matrix, which contains user preference matrix and item
feature matrix.

Figure 2 Matrix factorization

4|Page
[1]
4. Cosine similarity [2]: We computed the cosine similarity between the feature vectors to find
similarities among different items. We wrote a function in Java to calculate the cosine
similarity between vectors. [3]If we think of each column y of the utility matrix as an n-
dimensional vector, y = (y1, y2, ..., yn), then we can use the Euclidean dot product (inner
product) formula to compute the cosine of the angle θ that the two vectors make at the origin:

This is called the cosine similarity measure:

5. Recommendation: I utilized the cosine similarities between pairs of items to recommend


highly similar items to users who had already rated or watched other similar items, then
complete the unrated items in the input matrix with the prediction from the collaborative
filtering.

6. Results: Lastly, the results are written into another text (.txt) file. The results contain a
completed version of the initial matrix with completion of the unrated user items by the
algorithm’s predictions.

In summary, the methodology section of this report explains the approach taken in implementing
a collaborative filtering algorithm using Java. It highlights the important components required
for the implementation and provides a clear picture of how it was implemented.

5|Page
3. Results and Discussion
3.1 Results
3.1.1 Implementation
While the implementation section of this project may share a similar chronology to the
methodology, it provides a more detailed description of the steps involved in each phase of
the implementation process. In the data preprocessing phase, for instance, the implementation
section elaborates on how the data is loaded into the program from the text file and all the
code that was written in the process. Similarly, in the similarity calculation phase, the
implementation section explains how the user-item matrix is constructed and how cosine
similarity is calculated between each pair of rows. This level of detail continues throughout
the neighborhood selection, recommendation generation, and evaluation phases of the
implementation. Additionally, the implementation section discusses details at the level of the
algorithms and mathematical concepts backing it.

6|Page
I. Data processing:
The general implementation of the ` txtToMatrix` function in Java involves reading a txt file
containing a matrix of integers separated by tabs, creating a 2D matrix to store the integers, and
populating the 2D matrix with the integer values. To accomplish this, the function first uses a
`BufferedReader` to read the input file line by line in order to count the number of lines and infer
the dimensions of the output matrix.

Figure 3sample data input file for 5 by 4 matrix

Figure 4 code snippet: Function for turning text file to matrix

This function covers steps 1 and 2 of the methodology. The following step is a continuation of
the same function.

II. Matrix Creation:

7|Page
The function then creates a new `BufferedReader` to read the input file again, this time to
populate the 2D matrix by splitting each line into an array using a tab as the delimiter. Finally,
the function returns the populated 2D matrix as output.

Figure 5 code snippet matrix population

This is what we yield after running this function.

Figure 6 text file content turned to matrix in console

The matrix used in this example is a 5 by 4 matrix (Row x Columns), which represents five (5)
users’ preferences/ratings for four items, in a real life scenario an item could be a movie, a song
on a music streaming platform, an item on a shopping website or even a social media article. The
rating is on a scale of 1 to 5 where 5 represents the highest rating a user can give for an item and
1 is the lowest rating a user can give for an item. User items rated 0.0 are items which have not
been given any rating by the user yet.

8|Page
Use case: taking for example we are on a shopping website and some random items are being
recommended to us on the first page. In most cases these Items are items which by collaborative
filtering the algorithm predicts would be a best fit for our preferences. Our interaction with these
items would give feedback to the algorithm on the accuracy of the prediction. Which is what we
are going to delve in with more details on the subsequent parts of this section.

III. Matrix Factorization:


The function `matrixFactorization ` as seen in the snippet below performs matrix factorization
using stochastic gradient descent. Matrix factorization is a technique for decomposing a large
matrix into two (or more) smaller matrices that can be multiplied together to produce the original
matrix (in this case an approximation of the original matrix).
Use case: This technique can be applied to large rating matrices in collaborative filtering
recommendation systems, where each row of the matrix represents a user and each column
represents an item that the user could rate. Matrix factorization can then be used to estimate the
unknown ratings for each user-item pair, enabling personalized recommendations.
The `matrixFactorization` function accepts a 2D matrix of user-ratings as input and returns the
user and item latent factors as a 3-dimensional array. The latent factors are smaller feature
matrices that are used to estimate the values of the original rating matrix. The function also
accepts other parameters, such as the number of latent factors, the learning rate, the
regularization term, and the number of iterations to perform stochastic gradient descent.
1. matrix: The input 2D matrix of ratings, where each row is a user, each column is an item, and
each cell is a rating. This rating matrix is the base for matrix factorization, which the function
uses to estimate the unknown ratings for users and items.
2. numFactors: The number of latent factors to use in the matrix factorization. This is the
number of features that the function will generate for each user and item to capture their
underlying preferences and characteristics.
3. learningRate: The learning rate, or step size, to use in the stochastic gradient descent
algorithm. This parameter controls the size of the updates to the user and item latent factors
during each iteration of the algorithm. A smaller learning rate leads to slower convergence but
can provide more accurate results, while a larger learning rate leads to faster convergence but
may result in suboptimal latent factors.
4. regularization: The regularization parameter, which is used to prevent overfitting in the
matrix factorization. This parameter controls the strength of the penalty applied to the user and
item latent factors during optimization. A higher regularization parameter leads to stronger
regularization and may prevent overfitting, while a lower regularization parameter may lead to
better fit to the data but may result in overfitting.

9|Page
5. numIterations: The number of iterations to perform the stochastic gradient descent algorithm.
This parameter controls the number of times the function will update the user and item latent
factors. A larger number of iterations may lead to better convergence, but may also take longer to
run.

Overall, these parameters control the behavior of the matrix factorization algorithm used by the
`matrixFactorization` function and affect the quality of the recommendations generated by the
function. The parameter settings need to be tuned carefully to achieve good performance on a
particular problem.

Figure 7 code snippet matrix factorization function

The implementation of the `matrixFactorization ` function initializes the user and item latent
factors with random numbers and performs the stochastic gradient descent for the specified
number of iterations. Stochastic gradient descent is an iterative optimization algorithm that
updates the user and item latent factors to minimize the difference between the predicted and

10 | P a g e
actual ratings. This is done by calculating the error between the predicted and actual rating for
each user-item pair and then updating the latent factors based on the error and the parameters.

Figure 8 code snippet Stochastic gradient descent

The `matrixFactorization` function then returns the user and item latent factors as a 3-


dimensional array (Tensor), which can be used to estimate the unknown ratings for each user-
item pair. This step completes the process of matrix factorization using stochastic gradient
descent and provides an encoding of user similarities that can be used to calculate user
similarities.

11 | P a g e
Figure 9 code snippet latent factor matrices

IV. Cosine similarity:


The function `cosineSimilarity` accepts the user latent factors as input, which are generated using
matrix factorization. The function retrieves the user latent factors from the input 3-dimensional
array and calculates the cosine similarity matrix between all pairs of users in the system.

Figure 10 code snippet cosine similarity function

12 | P a g e
The output of this function is a cosine similarity matrix that captures the similarity score
between any pair of users in the system. This information can be used to identify users who have
similar tastes and preferences and to generate personalized recommendations for each user based
on the preferences of similar users.

Figure 11 user similarity matrix

This is a sample of what the function will return.

V. Recommendation:
The function `predictRating` predicts a single user's rating for a specific item in a collaborative filtering
recommendation system. The function uses the similarity matrix between users to identify similar users
and their corresponding ratings for the item in question and to generate a prediction for the user’s rating.
The `predictRating` function accepts the following parameters:
1. userRatings: A 2D matrix that contains the ratings for all users for all items, where each row
represents a user and each column an item
2. similarityMatrix: A 2D matrix that contains the pairwise similarity scores between users, based on
their past ratings for items.
3. userIndex: The index of the user for whom we're predicting the rating.
4. itemIndex: The index of the item for which we're making the rating prediction.
The implementation of the `predictRating` function involves the following steps:
The function calculates the norm of the similarity scores between each user and the item in question,
which is used to normalize the similarity scores between users.

13 | P a g e
Figure 12 code snippet predictRating function

Then, iterates over all users who have rated the item in question and calculates the similarity score
between the target user and each of these users. The similarity score is computed as the dot product
between the similarity scores of the target user and the other user, normalized by the product of their
vector norms. The function then multiplies each user's rating for the item by its similarity score and adds
the resulting value to the predicted rating.

14 | P a g e
Figure 13 code snippet predict Rating continuation

The function normalizes the predicted rating by dividing it by the sum of the absolute values of the
similarity scores between the target user and the other users who have rated the item. And finally, the
function returns the predicted rating as a double value.
Overall, this function provides a way to generate personalized recommendations for a single user based
on the ratings of similar users for a specific item. By taking into account the similarities between users
and the item preferences, the function can provide tailored predictions that reflect a user's unique tastes
and preferences.

15 | P a g e
VI. Results:
The `completeMatrix` function is a Java function that completes an incomplete user-item rating
matrix by using collaborative filtering to predict the missing ratings using a similarity matrix.
The function calls the `predictRating` function to predict the ratings for the missing items and
then updates the original rating matrix by adding these predictions.

The `completeMatrix` function accepts the user rating matrix and the similarity matrix as input
and returns a completed rating matrix where all items have been rated by all users.

The implementation of this function is as follows

First, we Initialize the completed matrix by creating a copy of the input user rating matrix and
stores it in a new variable called `completedMatrix`, then predicts Predict the ratings for the
missing itemsby iterating over each cell in the `completedMatrix` checks whether the cell is
empty (i.e., whether the corresponding item has not been rated by the corresponding user). For
each empty cell in the matrix, the function calls the `predictRating` function to predict the rating
for the item and stores the result in the `completedMatrix`.

Next, it converts the predicted ratings to a scale of 1 to 5: The function calls the
`convertToScale5` function to convert the predicted ratings to a scale of 1 to 5, which is a typical
rating scale used in many recommendation systems.

And finally it returns the completed matrix. The function returns the completed user-item rating
matrix as output, with all empty cells filled with predicted ratings.

16 | P a g e
Figure 14 code snippet complpeteMatrix function

Figure 15 code snippet completeMatrix function continuation.

Now putting everything together we need to infer some data to the program and see walk through the
complete algorithm.

17 | P a g e
3.2 Discussion
The detailed execution of the whole program with a matrix of 100 users by 10 items. The matrix
contains ratings for the Items by the 10 users with some 0 for non-rated items. The complete
matrix cannot be included in this section due to its size.

Figure 16 data file.

Let’s consider only the first 15 items of the matrix. We can see the different ratings for each item
as well as the unrated ones.

Figure 17 code snippet main function

18 | P a g e
First, we want to read the matrix into our program and store it in a multi-dimensional array
called, ` usrmtrx ` short for user matrix. The data is read form a file called `data02.txt` .

Next, we want to set the other parameters That would be used in our gradient descent function.

Figure 18 code snippet initializing parameters

The latent factor is a tensor that would be returned from the function call below. The function is
initialized with the parameters above

Figure 19 code snippet function call matrix factorization

The Cosine similarity function computes the similarities for each pair of users in the matrix
based on their latent factor and stores the results in `userSimilarity` matrix. This similarity
measure is very determining for predicting the user’s preferences. It’s encoding is stored in a
square n x n matrix where n is the number of users in the matrix, more details on this below.

19 | P a g e
Figure 20 code snippet functions call cosine similarity and completeMatrix

Finally, we want to complete the initial matrix by filling the non-rated Items. None other
function for the job than complete matrix. Complete matrix makes use of the function
`predictRating` to predict the rating for every entry in the user-item matrix which is equal to 0.0.
So `completeMatrix` loops through the user-item matrix and completes all the zero entries using
the similarity matrix and the position of the entry in the matrix.

Figure 21 code snippet predict matrix

After running our program, we want to see the different outputs at each step by writing them to
the console and or to a file where necessary.

Figure 22 code snippet to print user matrix

First, the user-item matrix as seen above. Recall that we are interested only in the first 15 rows of
the user-item matrix. The complete 10 x 100 matrix would be provided in the appendix section
for further investigation.

20 | P a g e
Figure 23 user matrix in console log

After the user-item matrix is collected, the latent factor was computed. Here is a preview of the
results.

Figure 24 user latent factor in console log

21 | P a g e
From the preview above, the first embedding denoted `Dimension 0` is the latent factor for the
users. Remember it is a 10 x 100 matrix so only the first 15 rows are considered.

The second matrix in the tensor denoted `Dimension 1` is the embedding for the items. For 10
items, we have a 10 x 2 array of the items embedding.

Figure 25 Item latent factor in console log

If we multiply these 2 arrays, we would get an approximation of the initial user-item matrix.

The similarity matrix is a 100 x 100 matrix. The similarities of the first 15 users could not be
included here since we would be getting a very large matrix (15 x 100).

Figure 26user similarity matrix in console log

22 | P a g e
Just for illustration purposes, this is a similarity matrix of a much smaller dataset.

Figure 27 user similarity matrix in cosolelog for smaller matrix

Observing this matrix, we can justify for the correctness of the computation. Observing the main
diagonal of the matrix, we can notice they are all ones in echelon form. This tells us that along
the main diagonal each user is being compared to their self. Hence getting the cofactor of the
similarity matrix would give us the same results since the similarity denoted sim() of sim(A,B) is
equal to sim(B,A).

23 | P a g e
Figure 28 user prediction matrix in console log

Figure 29user prediction matrix written to file

24 | P a g e
4. Conclusion
In conclusion, this collaborative filtering algorithm project explored the implementation of a
recommendation systems using collaborative filtering algorithms. I implemented and tested the
algorithm on a small and on a large (actually medium as compared to real life tasks) sized dataset
to see how the algorithm would scale in a real life scenario and what to take into account. Our
evaluation revealed that matrix factorization algorithms performed well overall, but this
implementation would require a lot more of work in order to handle real life data sets with much
more parameters to consider when predicting user’s preferences, since with a larger and more
complex data set would arise problems such as biases in the data and which has to be accounted
for by including techniques for bias reduction and diversity promotion to improve the quality of
recommendations.

25 | P a g e
5. References

[1] X. J. L. J. P. T. Chia Ling, "Geographical and Overlapping Community Modeling Based on Business
Circles for POI Recommendation," 2017/11/28.

[2] J. R. Hubbard, Java Data Analysis, O"reilly.

[3] insightdatascience, "Explicit Matrix Factorization: ALS, SGD, and All That Jazz," insightdatascience,
pp. https://blog.insightdatascience.com/explicit-matrix-factorization-als-sgd-and-all-that-jazz-
b00e4d9b21ea, Mar 16, 2016.

26 | P a g e
6. Appendices.
1. Some code snippets
These are all the relevant java packages that I used in the implementation of this program.

Here is a simple function to print a 2-dimentaional matrix.

A function that prints a 3-dimentional matrix.

27 | P a g e
This function is responsible for writing a matrix in to a file. It turns the matrix into a string of
characters and writes each row on a new line. It was used above in the main function to write the
results of the collaborative flittering to the results file.

28 | P a g e
29 | P a g e
Table of figures
Figure 1illustration of collaborative filttering..............................................................................................1
Figure 2 Matrix factorization.......................................................................................................................1
Figure 3sample data input file for 5 by 4 matrix..........................................................................................1
Figure 4 code snippet: Function for turning text file to matrix...................................................................1
Figure 5 code snippet matrix population......................................................................................................1
Figure 6 text file content turned to matrix in console..................................................................................1
Figure 7 code snippet matrix factorization function....................................................................................1
Figure 8 code snippet Stochastic gradient descent.......................................................................................1
Figure 9 code snippet latent factor matrices................................................................................................1
Figure 10 code snippet cosine similarity function.......................................................................................1
Figure 11 user similarity matrix...................................................................................................................1
Figure 12 code snippet predictRating function............................................................................................1
Figure 13 code snippet predict Rating continuation.....................................................................................1
Figure 14 code snippet complpeteMatrix function......................................................................................1
Figure 15 code snippet completeMatrix function continuation....................................................................1
Figure 16 data file........................................................................................................................................1
Figure 17 code snippet main function..........................................................................................................1
Figure 18 code snippet initializing parameters............................................................................................1
Figure 19 code snippet function call matrix factorization............................................................................1
Figure 20 code snippet functions call cosine similarity and completeMatrix...............................................1
Figure 21 code snippet predict matrix.........................................................................................................1
Figure 22 code snippet to print user matrix.................................................................................................1
Figure 23 user matrix in console log............................................................................................................1
Figure 24 user latent factor in console log...................................................................................................1
Figure 25 Item latent factorin console log...................................................................................................1
Figure 26user similarity matrix in console log............................................................................................1
Figure 27 user similarity matrix in cosolelog for smaller matrix.................................................................1
Figure 28 user prediction matrix in console log...........................................................................................1
Figure 29user prediction matrix written to file............................................................................................1

30 | P a g e

You might also like