You are on page 1of 7

International Journal of Computer Engineering and Technology ENGINEERING (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER 6367(Print), ISSN 0976

6375(Online) Volume 4, Issue 2, March April (2013), IAEME & TECHNOLOGY (IJCET)

ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), pp. 366-372 IAEME: Journal Impact Factor (2013): 6.1302 (Calculated by GISI)


Anuj Verma1, Kishore Bhamidipati2 (Dept. of Computer Science and Engineering, Manipal Institute of Technology, Manipal University, Manipal, Karnataka - 576104, India) 2 (Asst. Professor - Sr. Scale, Dept. of Computer Science and Engineering, Manipal Institute of Technology, Manipal University, Manipal, Karnataka - 576104, India)

ABSTRACT The cyberspace aims at providing an increasingly dynamic experience to users. The rise of electronic commerce has led to efforts for providing a highly efficient and qualitative experience to the consumer. Recommender Systems are a step in this direction. They aid in understanding the unlimited amount of data available and in particularly knowing each user. One of the most flourishing techniques to generate recommendations is Collaborative filtering. The technique focuses on using available information about existing users to generate prediction for the active user. A widely employed approach for the purpose is the memory based algorithm. The existing preferences of a user are represented in form of a useritem matrix. The method makes use of the complete or partial user-item matrix in order to isolate the nearest users for the active user and then generate the prediction. The majority of initial efforts dedicated to understanding electronic commerce and recommender systems concentrate only on the technical aspects like algorithm building and computational needs of such systems. Not much attention has been provided to questions pertaining to the need of such systems or how effective they are at what they try to perform. Along with looking at the various stages corresponding to a memory based collaborative filtering system, we propose an experiment to check the effectiveness of predictions or ratings generated by such systems. Keywords: Collaborative Filtering, Collective Intelligence, Memory Based Algorithm, Recommendation System


International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME



Recommender Systems prove to be an important tool for market analysis as well as deeper understanding of customer behavior. This is more relevant in current time of information explosion when a customer gets confounded by the variety of options that are available concerning every issue. They try to bridge the gap between the user and the market by mathematically determining what a user may prefer. The uniqueness of collaborative filtering lies with its miscellany. The technique is unchanged for any type of data, i.e., the working of the system remains the same irrespective of the nature of information, so the system structure is same for any application- from a book recommendation system to a movie recommendation system. The collaborative filtering technique can be utilized by two different approachesMemory-based and Model-based. Memory based collaborative- filtering systems use the complete user-item rating matrix or a part of it to generate recommendations. The Modelbased approach attempts to determine a pattern or trend in the given ratings data and then construct a model to generate recommendations [1]. Memory-based approach has been discussed at length and is predominantly utilized in commercial systems due to several factors. The first reason is its ease of use. Since it concentrates on the user item database, it is easier to apply and account for. The second reason is its intuitive nature. As the system keeps collecting data about a particular user, it spontaneously acts to generate recommendations after considering this new information. Hence the predictions are always up-to-date. The third reason is the cost. They are less costly and hence outperform the other approach in speed and resource usage [2]. The first limitation of this approach is that it is rating dependent. The behavioral trends or taste of a user may change over time. The user can also get resistive during the rating process and may selectively or incorrectly rate items. Another factor is the limited scope of ratings. Data belonging to a particular domain can be used to successfully generate predictions for that specific domain only. It is difficult to generate a prediction about the breakfast preferences of a user after analyzing the music that user hears. The second limitation is data sparsity. When a new user is introduced to the system it takes time to build a profile for him as no information exists about him. This is called the cold start problem [2]. We observe the various stages of a general collaborative filtering algorithm and the try to analyze the effectiveness of various techniques that are employed for the same. 2. COLLABORATIVE FILTERING ALGORITHM

A Recommender System can be imagined as a black box used as a filter of information. The input is the data gathered about a user (active user). One of the most important algorithms that functions inside the filter is the similarity computation algorithm that aims to determine the proximity of different users and represent this nearness in form of a numerical weight. This weight can be any measure that can be used to determine nearness between two entities. Euclidean distance and Correlation Coefficient are widely used measures. Angular Distance can also be used. The output of the filter is the generated recommendation. The three main stages of the process are: Representation of Input Data, Similarity Computation and Recommendation Generation [3].


International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

2.1 Representation of Input data Data about a specific user can be gathered in explicit and implicit methods. The explicit method involves collecting ratings or asking about likes and dislikes of a consumer (Thumbs Up/Thumbs Down buttons). Implicit method is concerned with checking the browsing history, tracking the number of clicks and recording time spent on a particular page. The gathered data is represented in the form of the user-item matrix [4]. 2.2 Similarity Computation Similarity Computation is the most important step of the Recommendation System because the accuracy of these calculations determines the accuracy of the system. This step is concerned with identifying the knearest users to the active user. These k users form the neighborhood of the active user. The rating is generated keeping in mind the neighborhood of the active user [3]. The different methods to calculate the similarity are: 2.2.1 Euclidean Distance This method takes into account all the items users have rated in common and represents them on the axes for a graph. The users are then represented as points on the graph and the distance between the different points is measured using the Euclidean distance formula. The distance between users A and B can be represented as follows:
w(A,B) =


where and are the ratings for the item of users A and B respectively, who have a total of n co-rated items. A disadvantage of this method is the two dimensional nature of the measure despite its simplicity. The range of this measure is [0,1]. [5] 2.2.2 Pearson Correlation Pearson Correlation Coefficient is a widely used statistical measure used to check how strongly two entities are related. It determines the degree of association between two variables. The nearer the points are to a linear trajectory, the higher their strength of association. The Pearson Correlation between users A and B can be represented as follows [6]: w(A,B) =


where and are the ratings for the item of users A and B respectively, who have a are the average ratings for user A and user B respectively. total of n co-rated items. and Unlike the Euclidean distance, it has a wider range [-1,1] and also assumes negative values. Its strength lies in the fact that it can also accommodate any form of scaling and can correct for any non-normalized nature of data. 2.2.3 Cosine Similarity Vector based cosine similarity is an important technique used for string matching and in checking the similarity of two documents. It can be suitably applied to the cause of collaborative filtering as well- if the users are considered as documents to be matched, items are considered as words and the ratings are considered to be the frequency of the occurrence

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

of words. By using this measure we are trying to establish the angle between the two vectors [6]. The cosine similarity between two users A and B can be represented as follows: w(A,B) =


where and are the ratings for the item of users A and B respectively, who have a total of n co-rated items. n1 and n2 denote the rated items for user A and user B respectively. It has a range from [-1,1]. Cosine of 0 is 1 which indicates that vectors are overlapping, hence indicating that users have similar tastes. This measure is particularly useful when data is sparse or the co-rated items are few and useful relationship cannot be determined using other measures. 2.3 Prediction Generation Once the task of computation of similarity is completed and a suitable neighborhood is formed, the generation of prediction is performed. The task can be accomplished using various methods the most trivial of which is taking a simple average or mean of the obtained ratings. A more efficient method is to take the weighted average of the available ratings. The rating for a particular item k for user A can be represented as follows [6]: , = ) , (4)

where is the average rating for items rated by user A. , is the similarity between user A and neighborq, n is the number of neighbors in the neighborhood. is user qs rating for item k and is the average rating for all items rated by him. The aim is to calculate the expected rating for all items that have not yet been rated by the active user and then recommend the N most recommendedhighest rated items in the neighborhood. This is called Top N recommendation approach [7]. 3. EXPERIMENTAL DESIGN

The success of the collaborative filtering based algorithm is dependent on the effectiveness of similarity computation method used. Hence our task is the evaluation of three different techniques commonly used for memory based collaborative filtering- Euclidean Distance, Pearson Correlation Coefficient and Vector Based Cosine Similarity. An empirical study to calculate the effectiveness of the ratings generated using various similarity measures was conducted. For our study, we considered explicit input, i.e., numerical ratings given by the user to examine the system. The dataset used was a popularly used dataset for Movie Recommendation Systems-The MovieLens 100,000 movie ratings dataset (MovieLens is a free service provided by GroupLens Research atthe University of Minnesota). A standard one page questionnaire was prepared containing a list of 100 common movies belonging to the dataset and users were asked to provide ratings for the same. The ratings were collected from 52 users locally as well as through online social networking media. Basic demographic information about the users was also recorded. For an input form to be valid, a user must have rated at least 15% of the items, i.e., 15 movies out of 100.Participant details are as follows:

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Total Participants = 52 Age Range= 14-48 years Gender Ratio (Male:Female) = 2:1 Valid Ratings=1188 To empirically evaluate the techniques, we will use the process of repeated random sub sampling validation. 10% of the collected ratings act as validation data, i.e., 120 ratings from 1188 ratings, so that we can generate the corresponding ratings from the system and then compare the deviation. The validation data subset is generated at random. Rest of the ratings form one part of the training subset.To arrive at a firm conclusion, this procedure will be repeated thrice for each similarity measure. Mean-centered ratings are used. We are interested in the relative performance of the three measures. The neighborhood size is fixed at 30. Two users should have at least 5 co-rated items for similarity to be considered. 4. Total Valid Collected Ratings = 1188 Collected Ratings to be Tested = = 10 % of the Valid Collected Ratings = 120 Collected Ratings used to generate predictions by The CF System = 1188 120 =1068 Already available Ratings for the corresponding items from MovieLens Dataset = 19228 Total ratings used by the CF System= 1068 + 19228 = 20296 Therefore, 20296 ratings will be used to generate predictions for 120 ratings. Neighborhood Size = 30 Nearest Neighbors The process is to be carried out thrice for 3 techniques, hence Total No of Passes = 3 X3=9 RESULTS

where is the rating in and is the corresponding rating generated using . No. of satisfactory or good predictions, i.e., no. of generated predictions for which deviation was <0.5 No. of unsatisfactory or bad predictions, i.e., no. of generated predictions for which deviation was >1 Following is the data obtained after carrying out the studied design:

To judge the accuracy of the similarity computation technique we consider the following parameters: Average deviation for the generated ratings. Deviation is the difference between the actual rating and the predicted rating. Average deviation is measured by calculating MAE (Mean Absolute Error) given by: 1 | | | |


International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Table 1: Obtained Results Pass 1 2 3 4 5 6 7 8 9 Similarity Type PCC PCC PCC ED ED ED CS CS CS R R R 120 120 120 120 120 120 120 120 120 20296 20296 20296 20296 20296 20296 20296 20296 20296 107 108 110 88 96 104 120 120 120 Total Deviation 98.05 102.89 95.43 72.55 87.35 87.64 85.49 95.49 85.64 MAE 0.916 0.953 0.868 0.824 0.91 0.843 0.712 0.796 0.714 Good Ratings 38 36 39 28 32 32 49 41 45 Bad Ratings 39 45 33 33 37 39 32 36 29

(Note: PCC- Pearson Correlation Coefficient; ED- Euclidean Distance; CS- Cosine Similarity) An important observation is that the number of ratings generated . This is because when similarity is computed for an active user, only 30 neighbours are considered. It is not always possible that all items rated by these 30 users will contain all 100 items for which ratings have been recorded. Hence, for some ratings, the rating for a particular item for a specific user is left un-generated. Therefore, in calculating MAE, we use instead of . All three performances of a similarity type are then used to measure the mean performance: Table 2: Mean Performance SN Similarity Type 1 2 3 Euclidean Distance Pearson Correlation Cosine Similarity Avg. MAE 0.859 0.912 0.741 Avg. % Good Ratings 31.97 34.77 37.5 Avg. % Bad Ratings 37.85 36.04 26.94

As it can be observed, for a slight increase in average MAE, the Pearson Correlation Coefficient produces a higher percentage of good predictions and a lower percentage of bad predictions than the Euclidean distance measure. It can be thus considered a superior measure out of the two. The Vector Based Cosine Similarity outperforms the other two on all 3 parameters. Hence it is the most adept out of all three. It can be articulated that the Memory Based Collaborative filtering algorithm performs the recommendation task with estimable accuracy and precision. It can be thus considered a significant approach for Collaborative Filtering based recommendation.

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME



The aim of recommender systems is to automate and generate precise predictions. We resolved to check the effectiveness of a widely used approach for the same- The Memory Based Algorithm. The most crucial stage in the algorithm is neighborhood formation by similarity calculation. So, we checked the effectiveness of commonly used similarity measures. The quantitative results of our experiments indicated that Vector Based Cosine Similarity was a more effective similarity measure than Pearson Correlation Coefficient and Euclidean distance based similarity. The memory based algorithm produces practicable predictions and is thus, an efficacious technique for online recommendation. The possible extensions include carrying out a similar study after normalizing the ratings (z-score normalization can be used for the same) and by varying the similarity weight according to the number of corrated items (significance weighting) calculation. REFERENCES [1] B.M. Sarwar, G. Karypis,J.A. Konstan and J. Riedl,Item based collaborative filtering recommendation algorithms, Proc. 10th International Conference on World Wide Web (WWW 01), 2001, 285295. X. Su and T.M. Khoshgoftaar,A Survey of Collaborative Filtering Techniques, Advances in Artificial Intelligence, Hindawi Publishing Corporation, Article ID 421425, 2009, 19 pages. E. Vozalis and K.G. Margaritis, Analysis of Recommender Systems Algorithms, Proc. 6th Hellenic-European Conference on Computer Mathematics and its ApplicationsHERCMA, 2003/9. D. Militaru and C. Zaharia, A survey of collaborative filtering-based systems for online recommendation, Proc. 12th International Conference on Electronic Commerce: Roadmap for the Future of Electronic Business [ICEC 10], ACM, New York, 2010, 43-47. G. Adomavicius and A. Tuzhilin, Towards the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, IEEE Transactions on Knowledge and Data Engineering, 17(6), 2005, 734-749 T. Segaran,Making Recommendations, in Programming Collective Intelligence, (USA: OReilly Media, 2007) 7-28. J. Breese, D. Heckerman and C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Microsoft Research, Redmond, Technical Report MSR-TR-9812, 1998, 43-52. C.R. Cyril Anthoni and Dr. A. Christy, Integration of Feature Sets with Machine Learning Techniques for Spam Filtering, International Journal of Computer Engineering & Technology (IJCET), Volume 2, Issue 1, 2011, pp. 47 - 52, ISSN Print: 0976 6367, ISSN Online: 0976 6375. Suresh Kumar RG, S.Saravanan and Soumik Mukherjee, Recommendations for Implementing Cloud Computing Management Platforms using Open Source, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 3, 2012, pp. 83 - 93, ISSN Print: 0976 6367, ISSN Online: 0976 6375.





[6] [7]