You are on page 1of 6

2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)

Rating Prediction on Movie Recommendation


System: Collaborative Filtering Algorithm (CFA) vs.
Dissymetrical Percentage Collaborative Filtering
Algorithm (DSPCFA)
Johan Eko Purnomo Sukmawati Nur Endah
Department of Computer Science Department of Computer Science
Universitas Diponegoro Universitas Diponegoro
Semarang, Indonesia Semarang, Indonesia
johanepu@gmail.com sukma_ne@if.undip.ac.id

Abstract— Recommendation system is one of many solutions filtering, and hybrid filtering [1]. The use of filtering technique
for getting information rapidly from the many data available and in some studies is often combined with supporting algorithm
one of its applications is the movie recommendation system. Movie such as KNN, computational intelligence, Singular Value
recommendation system filters information then recommends Decomposition, and so on. It aims to reduce or eliminate the
movies based on rating preferences or user information. One of
weaknesses of the filtering technique used.
the most widely used algorithms is the user based collaborative
filtering algorithm (CFA) to predict movie ratings which will be Reference [2] use the item-based filtering algorithm method
recommended based on similarity between target user and other by focusing on the use of genre-based movie similarity
users regardless of common items or the number of movies that functions to eliminate the cold-start problem in the
have been rated by both. One different approach of the CFA recommendation system, but it is very dependent on the
algorithm is a dissymmetrical percentage collaborative filtering
accuracy of the movie genre determination. Reference [3]
algorithm (DSPCFA) that involves common items as a
consideration of measuring similarity. This study also uses two
developed a hybrid model approach based on collaborative
similarity measurement methods, namely the pearson correlation filtering to produce a movie recommendation that combines
similarity method and the cosine similarity method as a dimensional reduction techniques with clustering algorithms so
comparison to determine the characteristics of each measurement that feature selection can be run as a whole in the form of
method. The experiment results show that the DSPCFA algorithm smaller dimensions then using the K-means process to cluster
produces a lower error value than the CFA algorithm with an users that are considered similar to this approach for at least
error decrease of about 5% for the RMSE evaluation method overcome problems related to data dimensions that are too
(Root-mean Squared Error) and an error decrease of about 7% large. Reference [4] built a recommendation system using
using the MAE (Mean Absolute Error) evaluation method. While
simple item-based collaborative filtering techniques with a
measurement method tested shows that the pearson correlation
similarity method produces a lower error value than the cosine
comparative approach between the number of users and the
similarity method. number of movie items where the use of item-based will run
well if the number of users is much bigger than the number of
Keywords— movie reccomendation system, collaborative
movie items.
filtering algorithm, dissymmetrical percentage collaborative filtering
algorithm, common items Based on past research and increasing the need for system
I. INTRODUCTION recommendations, we need new technology to dramatically
improve system recommendations performance [4]. This study
A large amount of the existing information requires the focuses on the user-based collaborative filtering algorithm
process of sorting the information needed to be done quickly and (CFA) that predicts movie rating to be recommended to users
precisely. It also applies to the entertainment media, especially by utilizing movie ratings from other users who have similar
movies which have high quantity and complexity of data which behaviors.
if not handled effectively and efficiently will create problems
such as many time wasted. The implementation of the CFA often produces similarity
results between users without pay attention to the influence of
The movie recommendation system presents a movie the number of items/movies that are rated between these users,
prediction adjusted to users. It is important because it can resulting in users who are less suitable or relevant to the target
reduce the time needed by the user to search for movie user [5]. One solution to this matter is to involve many common
recommendation that matches the user’s preferences. The items as a consideration and become part of the method of
recommendation system generally divided into three filtering measuring similarity. There are three types of variations of the
techniques, namely content-based filtering, collaborative

978-1-7281-4610-2/19/$31.00 ©2019 IEEE


2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)

CFA that use the threshold for common items based on their
consideration-based, namely: percentage-based threshold, log
value based threshold, and probability value-based threshold
[5].
This study compares between a conventional CFA and
variations of CFA-based techniques using percentage-based
thresholds with an asymmetrical approach or called
dissymmetrical percentage collaborative filtering algorithm
(DSPCFA). The DSPCFA uses a percentage of the rating
number of the target user as a threshold for the minimum number
of common items that must be owned among the target user and Fig. 1. Process block for generating and evaluating prediction rating
other users [5]. This study also uses two similarity measurement
methods, namely the Pearson correlation similarity method and Users are represented by userId while movies are
the cosine similarity method. Tests were performed comparing represented by movieId for movies. The blank element in the
the error values from the results of the predicted rating produced dataframe referred to as 'NaN' (Not A Number) element, occurs
by the CFA compared to the DSPCFA while comparing the when the user does not have a rating on the associated movie
characteristics of the two methods of measuring similarity. The according to the mapping, this blank value will be ignored and
test seeks to determine the effect of using common items as a not calculated on the similarity measurement.
threshold measure of similarity to the results of rating
C. Similarity Measurement
predictions so that these results can provide insight that can be
used for the development of the recommendation system going The similarity between users is calculated by evaluating the
forward to improve the quality of the output of the rating of the movies rated by both users. Each user has N-
recommendation system. dimensional vectors to represent movie ratings. Measurements
are made using the rating dataframe as a representation of input
II. METODOLOGY
data. In this process, there are differences in the use of the
There are several processes to generate prediction ratings similarity measurement method between the CFA and the
from the recommendation system. This study produces a DSPCFA.
prediction rating based on an experiment scenario using a
combination of filtering algorithms and similarity measurement Common items in DSPCFA will be calculated first and used
methods. While the process block for generating prediction as a requirement of the output similarity measurement method.
rating and evaluation is shown in Fig. 1. The requirement is the number of common items needs to
exceed the threshold, where the threshold is determined by the
A. Data Collection variable 𝑝. The variable 𝑝 is multiplied by the number of ratings
Data is downloaded directly from the Movielens website owned by the target user and the result will be used as a
which is managed directly by Group Lens Research, the data threshold value. Meanwhile, the number of common items in
provided is a collection of various kinds of movies from various CFA does not have to exceed a certain threshold but the number
years with ratings that are already available by several users. The of common items cannot be empty or zero. The equation to
initial data has 100,004 ratings and 3683 tags which were describe the DSPCFA can be written as follows [5]:
applied to 9,125 movies. The movies were rated by 610 users
gradually from March 29, 1996 to September 24, 2018. Dataset 0, 𝐶𝑁𝑢𝑚(𝑥, 𝑦) < 𝑝 ∗ 𝑁𝑢𝑚(𝑥)
𝑠𝑖𝑚(𝑥, 𝑦) = { (1)
consists of 4 CSV files: ratings.csv, tags.csv, movies.csv, and 𝑓(𝑥, 𝑦), 𝐶𝑁𝑢𝑚(𝑥, 𝑦) ≥ 𝑝 ∗ 𝑁𝑢𝑚(𝑥)
links.csv [6].
B. Pre-processing Data Where 𝑠𝑖𝑚(𝑥, 𝑦) on (1) denotes the similarity between user
x and user y. Variable 𝑝 represents number of percentage.
In this pre-processing stage, data is simplified before
𝑁𝑢𝑚(𝑥) is number of rating given by user x while 𝐶𝑁𝑢𝑚(𝑥, 𝑦)
entering computational processes. This stage includes the
is number of common items between user x and user y. Where
process of selecting columns/attributes used in the dataset and
𝑓(𝑥, 𝑦) represents a similarity measurement method that used.
the process of combining files in the dataset to create data with
This study is using two similarity measurement method namely
new structures that are needed by the computational processes.
pearson correlation similarity and cosine similarity.
One process that occurs at this stage is the process of
transforming combined data between movies data and rating 1) Pearson Correlation Similarity
data by mapping records contained in movies data and rating This method used to measure the extent to which two
data into columns and rows in the form of dataframes. This variables linearly relate with each other and is defined as [7]:
rating dataframe is used to map user ratings with users and
∑𝑠∈𝑠𝑥𝑦(𝑟𝑥,𝑠 −𝑟
̅̅̅)(𝑟 ̅̅̅)
𝑦,𝑠 −𝑟
movies as cross index [7], the mapping of rating dataframe can 𝑠𝑖𝑚(𝑥, 𝑦) =
𝑥 𝑦
(2)
be seen in Fig. 2. √∑𝑠∈𝑠𝑥𝑦(𝑟𝑥,𝑠 −𝑟
2
𝑥 √∑𝑠∈𝑠𝑥𝑦 (𝑟𝑦,𝑠 −𝑟
̅̅̅) ̅̅̅)
𝑦
2

2
2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)

Fig. 2. Mapping of rating dataframe (cross index of movies and user ratings)

From (2), 𝑠𝑖𝑚(𝑥, 𝑦) denotes the similarity between user x 1) Selection for the K-nearest user
and user y, 𝑟𝑥,𝑠 is the rating given to movie s by user x, 𝑟̅𝑥 is the This process sorts the user from the closest one with highest
mean rating given by user x. Where 𝑟𝑦,𝑠 is the rating given to similarity value with the target user. Then use the K variable
movie s by user y, 𝑟̅𝑦 is the mean rating given by user y while to determine a maximum number of the nearest user.
𝑠𝑥𝑦 is the set of movies in the user-movies cross index space.
2) Calculate prediction rating
The difference between Pearson correlation similarity and K-nearest users who have been selected in the previous
cosine similarity is that there is vector normalization happened process is used to calculate prediction ratings for movies that
in this equation part 𝑟𝑥,𝑠 − 𝑟̅𝑥 and 𝑟𝑦,𝑠 − 𝑟̅𝑦 when the rating by have not been rated by the target user. The formula for
user is reduced by the mean rating given by user itself. That calculating predictions ratings is as follows [7]:
vector normalization process makes the vector invariant toward
scaling. 𝑟𝑥,𝑖 = 𝑟̅𝑥 + 𝑘 ∑𝑦∈𝑈 𝑠𝑖𝑚(𝑥, 𝑦) ∗ (𝑟𝑦,𝑖 − 𝑟̅𝑦 ) (4)
2) Cosine Similarity
While
Cosine similarity is different from Pearson-based measure
in that it is a vector-space model which is based on linear 𝑘 = 1⁄∑ |𝑠𝑖𝑚(𝑥, 𝑦)| (5)
𝑦∈𝑈
algebra rather that statistical approach [1]. It measures the
similarity between two n-dimensional vectors based on the From (4)(5), 𝑟𝑥,𝑖 denotes the predicted score of user x to
angle between them [7]. The similarity between two users x and
movie i, while 𝑟̅𝑥 is the mean rating given by user x. 𝑟𝑦,𝑖 is the
y can be defined as follows [7]:
rating given to movie i by user y while 𝑟̅𝑦 is the mean rating
∑𝑠∈𝑠𝑥𝑦 𝑟𝑥,𝑠 𝑟𝑦,𝑠
𝑠𝑖𝑚(𝑥, 𝑦) = (3) given by user y. Where k is the weight value of K-user as nearest
2
√∑𝑠∈𝑠𝑥𝑦(𝑟𝑥,𝑠 ) √∑𝑠∈𝑠𝑥𝑦(𝑟𝑦,𝑠 )
2
users of user x while 𝑠𝑖𝑚(𝑥, 𝑦) is similarity value between user
x and user y.

The description of the notation is the same as the pearson 3) Make recommendations based on prediction rating
correlation similarity from before. This process determines which films will be recommended
based on movies with the highest predicted rating without
D. Movie Recommendation using CFA/DSPCFA including films that have been watched or rated by the target
This stage contains a prediction process and determining the user so that the recommended movies will only include films
recommended movies. The prediction process is based on that have not been watched by the target user. While the number
nearest K-nearest users with highest similarity value [5]. of recommended movies uses independent variable N so that
the final output of the system is N-movies recommendations.

3
2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)

E. Evaluation and Interpretation TABLE I. MAPPING OF TEST SCENARIOS


Scenario 1 2
At this stage, an evaluation and an interpretation of the
A CFA with pearson DSPCFA with pearson
results of the prediction calculation is carried out. The correlation similarity correlation similarity
evaluation process for calculating error values in this study was B CFA with cosine similarity DSPCFA with cosine similarity
using the Root Mean Squared Error or RMSE [8] and Mean
Absolute Method or MAE [1] methods. The formula for
calculating the RMSE and MAE error values is as follows C. Test Results
[1][8]: 1) Scenario 1
In this scenario, rating prediction is carried out using the
1 2
𝑅𝑀𝑆𝐸 = √ ∑𝑥,𝑖(𝑝𝑥,𝑖 − 𝑟𝑥,𝑖 ) (6) CFA algorithm and uses two similarity measurement methods
𝑁
to be compared in a 2-dimensional graph that illustrates the
distribution of error values from the prediction rating based on
From (6) 𝑝𝑥,𝑖 denotes the predicted score of user x to movie the distribution of K values or the number of nearest users. The
i, while 𝑟𝑥,𝑖 is the actual rating given by user x to movie i and N error value is obtained from the evaluation using RMSE and
is the number of rating given by user x on movies set. MAE by comparing the predicted rating of a movie that has
been rated using CFA algorithm with the actual rating. The
1
𝑀𝐴𝐸 = ∑𝑥,𝑖 |𝑝𝑥,𝑖 − 𝑟𝑥,𝑖 | (7) graph that shows the error value of Scenario 1 can be seen in
𝑁
Fig. 3 and Fig. 4.
The description of (7) is the same as the RMSE equation Fig. 3 and Fig. 4 show the spread of Pearson correlation
from before. To provide a representative description of the error similarity values are lower than cosine similarity both in the
value of the RMSE and MAE evaluation methods for the whole evaluation of RMSE and MAE. Both evaluations show a
system, this study uses an average RMSE or MAE error value correlation between the decrease in error value and the K value
from the sample users test, so that the average system to a certain amount.
representative error value equation can be written as follows [5]:
∑𝑁
𝑖=1 𝑅𝑀𝑆𝐸𝑖
𝐴𝑣𝑔 𝑅𝑀𝑆𝐸 = (8)
𝑁

From the above equation, 𝑅𝑀𝑆𝐸 denotes the error value


from root mean squared error evaluation method, while i is the
iteration index of sample user and N is the number of the sample
users.
∑𝑁
𝑖=1 𝑀𝐴𝐸𝑖
𝐴𝑣𝑔 𝑀𝐴𝐸 = (9)
𝑁

The description of the notation above is the same as the


Avg RMSE equation from before.

III. EXPERIMENT RESULTS


A. Testing Data
Fig. 3. The comparison of RMSE error value for Scenario 1 (CFA) on the
This test uses 36 random user samples from 671 available distribution of K
users. By taking a sample of random users from each of the 20
users per-sample group out of a total of 671 users it is expected
to provide an overview of the actual collective testing. This test
focuses on rating value by comparing the rating prediction
result on the films that already have a rating with the original
rating to examine and compare the result of the combination of
the used method by using small amount of subjectivity value.
B. Testing Scenarios
Test scenarios are divided based on a combination of
filtering algorithms and similarity measurement methods as
shown on Table I. Predicted rating results from Scenario 1 (A /
B) and Scenario 2 (A / B) are evaluated to see the error value of
predicted ratings.

4
2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)

Fig. 6. The comparison of RMSE error value for Scenario 2B (DSPCFA-


Cosine)

Fig. 4. The comparison of MAE error value for Scenario 1 (CFA) on the
distribution of K

2) Scenario 2
In this scenario, rating prediction is done by using the
DSPCFA and using two similarity measurement methods to be
compared in a 3-dimensional graph that illustrates the
distribution of the error value of the prediction rating based on
the distribution of the p value or the percentage value and the K
value or the number of nearest users. The error value is obtained
from the evaluation of RMSE and MAE by comparing the
predicted rating of a movie that has been rated using DSPCFA
algorithm with the actual rating. The graph that shows the error Fig. 7. The comparison of MAE error value for Scenario 2A (DSPCFA-
Pearson)
value of Scenario 2 can be seen in Fig. 5 until Fig. 8.
The results showed by Fig. 5 to Fig 8 implying that there is
not much difference in the form of error distribution between
one to another with the average spread of the lowest error value
at the point p = 0.3 with the highest distribution of error values
found at the point p = 0.8. Based on these results, the DSPCFA
algorithm shows the best results with an ideal p-value. The
results in Fig. 5 to Fig. 8 shows the average ideal p-value at p =
0.3 which indicates the congruence of common items limitation
that suitable to the distribution of user-ratings or datasets.

Fig. 8. The comparison of MAE error value for Scenario 2B (DSPCFA-


Cosine)

D. Result Analysis
The final test compares the error value from the evaluation
result on Scenario 1A/B and 2A/B based on Root-Mean
Squared Error (RMSE) and Mean Absolute Error (MAE) error
value between one scenario to another. Table II shows the result
of the minimum error value using RMSE and MAE for each
scenario and seeks the best scenario or the best filtering
algorithm combination and similarity measurement method that
Fig. 5 The comparison of RMSE error value for Scenario 2A (DSPCFA- produces the smallest error value. Table III shows the result of
Pearson) minimum error value using RMSE and MAE for each scenario
and seeks the decrease of error value on DSPCFA towards CFA
error value.

5
2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)

TABLE II. THE RESULT OF MINIMUM ERROR VALUE FOR EACH REFERENCES
SCENARIO
[1] F. Isinkaye, Y.O.Folajimi, and B.A. Ojokoh, "Recommendation
Evaluation Filtering Similarity Error Value systems: Principles, methods and evaluation," Egyptian Informatics
Method Algorithm Method Journal, 2015.
RMSE Scenario 1 Pearson (A) 0.86119
(CFA) Cosine (B) 0.88768 
[2] Parivash Pirasteh, Jason J. Jung and Dosam Hwang, "Item-Based
Scenario 2 Pearson (A) 0.80483 Collaborative Filtering with Attribute Correlation: A Case Study on
(DSPCFA) Cosine (B) 0.83988 Movie Recommendation," ACIIDS 2014, p. 245–252, 2014.
Min. Error DSPCFA Pearson (A) 0.80483 [3] Zan Wang, Xue Yu, Nan Feng and Zhenhua Wang, "An improved
MAE Scenario 1 Pearson (A) 0.67463 collaborative movie recommendation system using computational
(CFA) Cosine (B) 0.69791 intelligence," Journal of Visual Languages and Computing, no. 25, pp.
Scenario 2 Pearson (A) 0.62192 667-675, 2014.
(DSPCFA) Cosine (B) 0.64953 [4] L. T. Ponnam, S.D. Punyasamudram, S.N. Nallagulla, S. Yellamati
Min. Error DSPCFA Pearson (A) 0.62192 "Movie Recommender System Using Item Based Collaborative
Filtering Technique," 2016.
[5] Q. Wang, T. Zhang and Z. Rong, "Collaborative Filtering Similarity
Algorithm Using Common Items," WHICEB 2017, vol. 16, 2017.
TABLE III. THE RESULT OF DSPCFA MINIMUM ERROR VALUE TO CFA [6] GroupLens, "MovieLens | GroupLens," 2015. [Online]. Available:
http://grouplens.org/datasets/movielens/. [Accessed 13 08 2018].
Evaluation Similarity Scenario Scenario 2 The decrease
[7] Bei-Bei C. , "Design and Implementation of Movie Recommendation
Method Method 1 (CFA) (DSPCFA) of DSPCFA
System Based on," ITM Web of Conferences, vol. 12, 2017.
error to CFA
RMSE Pearson 0.86119 0.80483 6.54% [8] C. J. Willmott and K. Matsuura, "Advantages of the mean absolute error
Cosine 0.88768 0.83988 5.38% (MAE) over the root mean square error (RMSE) in assessing average
MAE Pearson 0.67463 0.62192 7.81% model performance," CLIMATE RESEARCH, vol. 30, pp. 79-82, 2005.
Cosine 0.69791 0.64953 6.93%

Based on the chart and the tables above, pearson correlation


similarity measurement method produces smaller error value
compared to cosine similarity method. It means the
implementation of vector normalization on the vector resulting
smaller error value than without using the normalization. While
on DSPCFA, the overall use of the threshold was able to
significantly reduce the error value when using either pearson
correlation similarity or cosine similarity measurement method.
One thing to note is that DSPCFA will work effectively on the
distribution of users who have a variety of common items, while
the fewer users who have a relationship with the target user will
reduce the effectiveness of DSPCFA because it will make
limitation and consideration of the minimum value of common
items will be increasingly insignificant. DSPCFA also needs to
pay attention to the limitation values of common items, because
if the common item limitation value is too high or too low it will
reduce the effectiveness and reduce the accuracy of the rating
prediction.
IV. CONCLUSION
The use of threshold to give limitation on the minimum
number of common items that used in DSPCFA produces
smaller evaluation error value than CFA with a decrease around
5% for RMSE evaluation method and 7% decrease for MAE
evaluation method. In contrast based on the prediction rating
result, Pearson correlation similarity measurement method
produces smaller error value than cosine similarity method.
These results indicate the use of limitations on common items
used in DSPCFA results in better predictions of CFA.

You might also like