You are on page 1of 5

Proc. of the Intl. Conf.

on Advances In Engineering And Technology - ICAET-2014


Copyright Institute of Research Engineers and Doctors. All rights reserved.
ISBN: 978-1-63248-028-6 doi: 10.15224/ 978-1-63248-028-6-01-10

A Survey on Recommendation Algorithm for Movie


Recommendation on Cloud
[ Swati Pandey, Dr. T. Senthil Kumar]

Abstract In current era, Web is the best source for getting any learning algorithms for recommending items/person. Since
information or making decision on something. People get online dataset for recommendation purpose, may be large hence
suggestions before making any decision such as buying any executing that dataset on a single node is not an efficient way.
product, booking movie tickets etc. In such cases For optimizing execution and getting solutions fast we go for
recommendation systems play important role. Recommendation cloud. Cloud is a specialized form of distributed system.
System works on the data about users and items which has to be
Distributed system consists of a collection of autonomous
recommended. Due to huge size of data, distributed systems come
into existence. computers connected through a network and distributed
middleware, which enables computers to coordinate their
Keywords Recommendation System, Hadoop, HBase, activities and to share the resources of the system, so that user
Mahout, Map-Reduce perceives the system as a single, integrated computing facility
[1]. Hadoop is an open source cluster based frame work which
I. Introduction is used for writing and running distributed application that
process large amount of data [3]. It is a framework for Map-
Now a day, for getting solutions of problems people prefer Reduce programming. Map-Reduce programming model is
web. Search engines fulfill their requirements partially. Search used to process and generate a large dataset according to Map
engine does not give result according to users preference, and Reduce function [2]. Whole dataset is divided into
users taste or content of item required by the user. It provides key/value pairs. Map function has been specified by the user,
all possible outcomes related to users query. Hence for making which processes key/value pairs to generate intermediate
an efficient decision regarding an item or person, key/value pairs. After processing, Reduce function merges all
Recommendation systems are needed. Recommendation intermediate results associated with the same intermediate key.
systems are used for the personalization of information. It Then final result has been given to the user [2].
helps users for making decisions regarding an item or a person
[12]. For example which movie you should watch which book Rest of the paper has been arranged in different sections.
you should buy etc. There are websites through which users Section 2 describes the basic procedure for building a
can give their ratings or views about a particular item or Recommendation system. Section 3 elaborates the related
anything. Input for recommendation system can be [11]: work which has been done. Section 4 puts focus on the
Rating, this is the opinion of user for an item. It can be outcome of the survey. Section 5 describes about the proposed
collected implicitly or explicitly. customization of movie Recommendation system.

Demographic data, which is information about user II.Procedure for building


such as age, gender, occupation. Normally is collected
explicitly. Recommendation System
Output of the recommendation system can be [11]:
Collect relevant data about users, item ratings or view of
Prediction: It represents predicted opinion for user for
users for items (Segmentation and noise removal of
an item. It should be in same scale as of input rating. relevant web pages)

Recommendation: It includes list of top N


homogeneous items recommended for an active user. Convert data in required format (Matrix, Table etc.)

For recommendation purpose we need to collect data


about related items or things, Users or user groups and their Recommendation for similar items or users (Based on
views or ratings. Recommendation system uses Machine algorithm to be used)

Swati Pandey Recommendation computation has been performed based


Amrita University, Coimbatore -461112, Tamilnadu, India
on the chosen algorithm
swati.pandeycs@gmail.com

Dr. T. Senthil Kumar Recommend item / person or list of items /persons


Amrita University, Coimbatore -461112, Tamilnadu, India
senthan111@gmail.com Figure 1: Basic Procedure for Recommendation System

48
Proc. of the Intl. Conf. on Advances In Engineering And Technology - ICAET-2014
Copyright Institute of Research Engineers and Doctors. All rights reserved.
ISBN: 978-1-63248-028-6 doi: 10.15224/ 978-1-63248-028-6-01-10

For recommendation engine first we need to collect data. Hybrid Approach: It combines both Content based as
Data can be collected implicitly or explicitly. That data may well as collaborative filtering approach.
contain unwanted information or data also, hence to remove
that we can use noise removal techniques. After getting Collaborative filtering can be categorized into two User
relevant data, we convert that in our required form (generally based collaborative filtering, Item based collaborative
matrix, tables). Now data is ready to use. With the help of an filtering. User based collaborative filtering is also known as
algorithm, we can recommend item/person or predict which nearest neighbor collaborative filtering. In this, statistical
item has to be liked by required user. techniques are used to find nearest neighbor of target user.
Basic procedure for building a recommendation system can These systems use algorithms to calculate the similarity
be described as figure 1. between user profiles of nearest neighbors to produce
prediction whether target user will like a particular item or to
III. Related Work recommend top-N items to target user [9]. User based
similarities calculations are performed using row-wise. Item
based collaborative filtering focuses on the neighborhood of
A. Segmentation and Noise Removal the similar items. Basic idea is that mostly user purchases
To extract relevant information through web page, items that are similar to the one s/he already bought in the past
Segmentation plays an important role. For movie [10]. In this approach, we can make criteria for similarity
Recommendation System, implicit or explicit ratings of users metric, for example item rating can be a metric. If we calculate
on movie are required. Sometimes user writes comments on similarity based on content of the item then it becomes content
movies. Hence to get rating from their comments we need to based. Some methods and survey which have been performed
extract their comment from particular web page. A web page on Movie dataset has been given in table 2.
can contain multiple parts with different information content
[4]. Hence page has to be segmented and non-relevant parts
have to be removed for getting precise results. For
recommending goods, we need to get ratings or views of
IV. Discussion
different users about items; we need to find similar users. For With the help of work that has been done, it can be stated
collecting this data only relevant part of webpage has been that, content based filtering techniques are good for the data
accessed. Some research work that has been done for having content as main criteria and recommendation has to
segmentation and noise removal is shown in table 1. perform on same type of items which have been previously
bought by target user. Collaborative filtering is good when we
B. Machine Learning Approaches for have sufficient number of users whose preferences are almost
similar to each other, sufficient number of items which are
movie Recommendation System rated by user. Basically if item rating is a metric for
There are different machine learning approaches for recommendation and prediction then collaborative filtering
recommendation purpose. These approaches can be performs well. User based collaborative filtering has
categorized as following [8]: limitations related to scalability. Compare to user based
filtering, item based algorithm sparse better and scale well.
Random Prediction Algorithm: According to this But its shortcoming is the cost to build item-item matrix. For
approach, an item has been taken randomly form the similarity evaluation, computer takes more computing time
large set of items and recommends to the user. Its and resources.
accuracy depends on the luck. Hence this algorithm is
failure. When data set is large then process may takes hours for
computations. This problem can be solved by cloud platform,
Frequent Sequence Algorithm: It recommends the item eg. Hadoop. Even though Hadoop can handle large amount of
to user based on the past rating or views of user for data efficiently, but results may not be accurate. This is
items. Its accuracy depends on the users past ratings. overcome by combining the user-based recommendation
results and the item-based recommendations. But each phasing
Content Based Filtering: It focuses on the content and its own disadvantages, So by applying the clustering technique
attributes of items. Item whose content correlates the in the combined result we can get the accurate
most with the content of item that has already viewed in recommendation.
past and which satisfies user preference has been
recommended.
V. Conclusion
Collaborative Filtering: This algorithm identifies users
that have relevant interest and preferences by This paper proposes a Movie recommendation system using
calculating similarities and dissimilarities between users combined Collaborative filtering algorithm (User based
profile. collaborative filtering using Pearson correlation similarity as
well as item based collaborative filtering using on open source

49
Proc. of the Intl. Conf. on Advances In Engineering And Technology - ICAET-2014
Copyright Institute of Research Engineers and Doctors. All rights reserved.
ISBN: 978-1-63248-028-6 doi: 10.15224/ 978-1-63248-028-6-01-10

cloud environment Hadoop using Mahout library and HBASE write problem. To overcome the problems of HDFS, we are
as database. Hadoop can evaluate and generate large amount using column based, NOSQL database HBASE which stores
of data. Mahout [13] is machine learning library which data in key value pair. It provides low latency access to small
supports collaborative filtering well. . Hadoop uses HDFS file amount of data within a large data set. Hence system will
system, which stores data as flat files. HDFS follows write provide more accurate result for large Movie data set also.
once read many ideology, it does not support random read

TABLE 1: RELATED WORK IN NOISE AND SEGMENTATION

Author Method/ Algorithm Merit De-merit


Christian Kohlschtter , Densitometric approach for webpage Focus on low level properties Accuracy for detecting
Wolfgang Nejdl [13] segmentation which based on token of the text, Detects duplicate duplicate blocks is 61.7%
density in a text fragment blocks and Non-Duplicate
blocks also
David Fernandes, Method aligns the DOM trees of Web Useful to segment web sites Less efficient for
Edleno S. de Moura, pages of a site in which are data-intensive. generating rules that can
Altigran S. da Silva, Order to uncover their implicit Segments the web pages and be used for string
Berthier Ribeiro-Neto, structure. also able to cluster the manipulation.
Edisson Braga [6] segments into classes.
Jan Zeleny, Radek Method combines vision based Precise, Accuracy
Burget [7] segmentation and template based Fast
clustering algorithm
Erdin Uzan, Hayri Hybrid approach for extracting Effective for structured as First stage of this
Volkan Agun, Tarik informative content which contain 2 well as semi structured data approach is appropriate
Terlikaya [14] steps: where time performance
1. Discover informative content using is not important.
decision tree learning
2. Extract rules obtains from the (1)
Fei Hu, Ming Li, Yi Reduces noise in web pages based on Works well with pages that When the topic content
Nan Zhang, Tao Peng, word density do not meet the XML has small amount of
Yang Lei [15] specification, words, the purification is
Less time consuming not ideal.
Zhao Cheng-Li, Yi Eliminate noises by Style Tree Model Adaptive, Less efficient for
Dong-Yun [16] (DOM tree based) Fast generating rules that can
be used for string
manipulation.
Shekhar Babu Boddu Determine spatial locality (vision Efficient Less efficient for
[17] based) generating rules that can
be used for string
manipulation
Hiroyuki Sano, Shun Method is comprised of 3 steps: Suitable for extracting title Effective when high
Shiramatsu, Tadachika 1. Layout template detection blocks for segmentation precision of title block
Ozono [18] 2. Division into minimum blocks and and high re-call of
detecting title blocks deciding non-title block
3. Combination into web content bits
Renato Dominguez Approach improves topic Exploration Efficient for getting relevant Automatic Segmentation
Garcia et al. [19] in blogosphere by detecting relevant data tool which has been used
segments. is not perfect.

50
Proc. of the Intl. Conf. on Advances In Engineering And Technology - ICAET-2014
Copyright Institute of Research Engineers and Doctors. All rights reserved.
ISBN: 978-1-63248-028-6 doi: 10.15224/ 978-1-63248-028-6-01-10

TABLE 2: RELATED WORK IN MOVIE RECOMMENDATION ALGORITHM

Author Method Merit Demerit Comment


Zhi-Dan Zhao, User-based collaborative Scalable because Cant reduce Hadoop has been used,
Ming-Sheng filtering (CF) on Hadoop Hadoop is a cloud recommendation through which overcomes
Shang [20] platform response time for a scalability problem (Big
single user, dataset can also be used)
Carlos E. CF using mahout, (User Prediction accuracy Mahout similarity Use of Mahout with
Seminario Based and Item Based) for both user-based weighting is not very Collaborative filtering has
, David C. Wilson and item-based has effective as a enhances the accuracy
[21] been improved weighting techniques
Manos Papagelis, User-based with implicit Prediction based on Scalability problem Item-based prediction
Dimitris rating and explicit rating, explicit rating is for big data algorithm is better than
Plexousakis [22] Item-based with implicit better than implicit user-based algorithm
rating and explicit rating rating
Trouong Khanh Clustering of items on Good prediction No. of groups must be Clustering of item in a
Quan, Ishikawa stability of user similarity accuracy given, Final clustering group performed better
Fuyuki, Honiden and apply CF may be locally
Shinichi [23] optimal
Dhoha Almazro, Cluster all items, Accuracy is 66.01% Scalability for big data Method has combined
Ghadeer Demographic information user-based and item-based
Shahatah, et al. of user, Combine both
[24] user-based and item-based
Hee Choon Lee, Neighborhood The preference prediction
Seok Jun Lee, CF(NBCFA), performance of CMA is
Young Jun Chung Correspondence mean better than NBCFA
[25] algorithm(CMA)
Yanhong Guo, CF based on trust factor, CF based on trust Scalability problem Trust factor is based on
Xuefen Cheng, et Cosine correlation and factor is better than for big data user who gives review for
al. [27] Pearson correlation has traditional CF others
been used
Kai Yu, Xiaowei Memory based CG, Reduces the storage Scalability problem Select users with rational
Xu, Jianhua Tao, TURF1, TURF2, TURF3, requirement of for big data and novel profile
et al. [28] TURF4 training data
CF, Pearson correlation
Dilek Tapucu, coefficient; Spearman corr. Scalability problem Time complexity is less
Seda Kasap, Fatih Coefficient; Tanimoto for big data, Improve for Pearson Correlation
Tekbacak [29] coefficient; Log likelihood quality is a challenge coefficient
similarity; Euclidean
Distance Simm.
Mustansar Ali Method combines rating, Better prediction Scalability problem Combining all factors
Ghazanfar, Adam feature and demographic for big data provide better result
Prugel Bnnett [30] information of item
Badrul Sarwar, Item-based CF High quality Scalability problem Item-based CF perform
George Karypis, recommendation for big data well for users who have
et al. [31] rated less items.
Alexandros CF with hashing using- Model can be scaled Time complexity Hashing is used to bound
Karatzoglou, Alex -intensive loss function, to bigger dataset on the required memory, Loss
Smola, Markus Huber loss function large server and to function is used to achieve
Weimer [32] still dataset on small a large marginal model
machines.
Yajie Hu, Ziqi Semantic distance Able to give a list of Does not put users Semantic distance has
Wang, Wei Wu, measurement and consider recommended movie feedback into been used
Jianzhong Guo, the features of movie. along with stars of consideration
Ming Zhang [33] Recommendation based on that movie.
YAGO and IMDB

51
Proc. of the Intl. Conf. on Advances In Engineering And Technology - ICAET-2014
Copyright Institute of Research Engineers and Doctors. All rights reserved.
ISBN: 978-1-63248-028-6 doi: 10.15224/ 978-1-63248-028-6-01-10

Xiao Yan Shi, Combine user based and Better than Scalability problem To overcome scalability
Hong Wu Ye, et item based traditional CF for big data problem Cloud can be
al. [34] used.

[22] Manos Papagelis and Dimitris Plexousakis, Qualitative analysis of


References user based and item based prediction algorithms for
recommendation agents, Science Direct 2005.
[1] Andrew S Tanenbaum and Maarten van Steen, Distributed [23] Trouong Khanh Quan, Ishikawa Fuyuki, Honiden Shinichi,
Systems: Principle and Paradigms, in Pearson Prentice Hall, 2nd Improving accuracy of recommendation system by clustering item
edition, may 2005. based on stability of user similarity, IEEE 2006.
[2] Jeffrey Dean and Sanjay Ghemawat, Map-Reduce: Simplified [24] Dhoha Almazro, Ghadeer Shahatah, Lamia Albbulkarim, Mona
data processing on large clusters, to appear in OSDI 2004. Kherees, Romy Martinez, William Nzoukou, A survey paper on
[3] Chuck Lam, Hadoop in Action, Manning publication, 2010. recommendation system, ACM 2010.
[4] S. Yu, D. Cai, J.-r. Wen, W.-y. Ma, Improving pseudo relevance [25] Hee Choon Lee, Seok Jun Lee, Young Jun Chung, A study on
feedback in web information retrieval using web page improved collaborative filtering algorithm for recommendation
segmentation, in proceeding of the 12th international conference system, IEEE 2007.
on world Wide Web www03 in new York, pp 11-13, ACM. [26] Wu Yueping and Zheng Jianguo, A research of recommendation
[5] Christian Kohlschttear and Wolfgang Nejdl, A Densitometric algorithm based on cloud model, IEEE 2010.
approach to web page segmentation, ACM 2008. [27] Yanhong Guo, Xuefen Cheng, Dahai Dong, Chunyu Luo,
[6] David Fernandes, Edleno S. de Moura, Altigran S. de Silva, Rishuang Wang, An improved collaborative filtering algorithm
Berthier Ribeiro Neta and Edisson Braga, A site oriented method based on trust in e-commerce recommendation system, IEEE 2010.
for segmenting web pages, SIGIR11, Beijing, ACM 2011. [28] Kai Yu, Xiaowei Xu, Jianhua Tao, Martin Aster, Eans-Peter
[7] Jan Zeleny and Radek Burget, Cluster based page segmentation- A Kriegel, Instance selection techniques for memory based
fast and precise method for web page perprocessing, ACM 2013. collaborative filtering, SIAM.
[8] J. B. Schafers, J. Konstan and J. Riedi, Recommendation Systems [29] Dilek Tapucu, Seda Kasap, Fatih Tekbacak, Performance
in e-commerce, !st ACM conference on Electronic commerce comparision of cmbined collaborative filtering algorithm for
ACM press, pp. 158-166, 1999. recommender system, IEEE 2012.
[9] Sarwar B. and Karypis, Item based collaborative filtering [30] Mustansar Ali Ghazanfar and Adam Prugel Bnnett, A scalable,
algorithms in 10th International World Wide Web conference, pp accurate hybrid recommender system.
285-295, 2001. [31] Badrul Sarwar, George Karypis, Joseph Konstan, John Riedl, Item
[10] Mukund Deshpande and George Karypis, Item based top-N based collaborative filtering recommendation algorithms, www10,
recommendation algorithm, in ACM Transactions Information in Hong Kong, ACM 2001.
Systems, volume 22, no. 1, pp 143-177, 2004. [32] Alexandros Karatzoglou, Alex Smola, Markus Weimer,
[11] Aristomenis S. Lampropoulos and George A. Tsihrintzis, A survey Collabrative filtering on a budget, Appear in proceedings of the
approach to designing recommendation system , Springer 2013. 13th international conference on Artificial Intelligence and
[12] Loren Terveen and Will Hill, Beyond recommender systems: Statistics 2010, Italy.
Helping people help each other, In HCI In The New Millennium, [33] Yajie Hu, Ziqi Wang, Wei Wu, Jianzhong Guo, Ming Zhang,
Jack Aarrdl, Addison Wesley, 2001 page 2 0f 21. Recommendation for movies and stars using YOGA and IMDB,
[13] Christian Kohlschtter and Wolfgang Nejdl, A Densitometric IEEE 2010.
approach to web page segmentation, ACM 2008. [34] Xiao Yan Shi, Hong Wu Ye, Song Jie Gong, A personalized
[14] Erdin Uzan, Hayri Volkan Agun, Tarik Terlikaya, A hybrid recommender integrating item based and user based collaborative
approach for extracting informative content from web pages, filtering, IEEE 2008.
Science direct 2013.
[15] Fei Hu, Ming Li, Yi Nan Zhang, Tao Peng, Yang Lei, A non- About Author (s):
template approach to purity web pages based on world density,
Proceedings of International Conference on information
engineering and application (IEA) 2012, Springer 2013. Swati Pandey is pursuing M.Tech (CSE)
[16] Zhao Cheng-Li and Yi Dong-Yun, A method for eliminating in Amrita University, Coimbatore. She
noises in web pages by style tree model and its applications, holds first rank in M.Tech. Her area of
Wuhan university Journel of natural sciences 2004, vol-4, no. 5. interest is Machine Learning and
[17] Shekhar Babu Boddu, Eliminate The Noisy data from web pages Distributed Computing.
using data mining techniques, GESJ: computer science and
telecommunication 2013.
[18] Hiroyuki Sano, Shun Shiramatsu, Tadachika Ozono, A web page Dr. T.Senthil kumar has around 12 years of
segmentation method based on page layouts and title blocks, teaching experience and 2 year of industry
IJCSNS international journal of computer science and network Experience. His area of interest includes cloud
security, vol. 11 no. 10, 2011.
computing, software Engineering, Video
[19] Renato Dominguez Garcia, Alexandru Berlea, Philpp Scholl,
Doreen Bhnstedt, Christoph Rensing, Ralf Steinmetz, Improving processing, Wireless Sensor Networks, Dot Net
topic exploration in the blogosphere by detecting relevant Programming, JIST simulator, Data Mining. He
segmentation, journal of universal computer science 2009. is currently working as a Assistant Professor
[20] Zhi-Dan Zhao and Ming-Sheng Shang, User based collaborative (Selection Grade) in computer science and
filtering recommendation algorithm an hadoop, IEEE 2012.
Engineering Department at Amrita School of
[21] Carlos E. Seminario and David C. Wilson, Case study evaluation
of mahout as a recommender plateform, presented in workshop on
Engineering, Amrita Vishwa Vidyapeetham,
recommendation utility evaluation: Beyond RMSE, held in Coimbatore. He has publication in 10 National
conjunction with ACM in Ireland, 2012. Conferences and 6 International Conferences and
6 International journals.
52