Professional Documents
Culture Documents
Submitted for the partial fulfillment for the degree of Bachelor of Technology in
Computer Science and Engineering
Techno India
EM 4/1, Salt Lake, Sector V, Kolkata 700 091.
CERTIFICATE
This is to certify that the project report entitled Recommender System Using Apache Hadoop
prepared under my supervision by Ankan Banerjee (13000112064), Ankit (13000112065),
Ankit Gupta (13000112067) and Hemant Kumar Joshi (13000112101) of B.Tech. (Computer
Science & Engg.), Final Year, has been done according to the regulations of the Degree of
Bachelor of Technology in Computer Science & Engineering. The candidates have fulfilled the
requirements for the submission of the project report.
It is to be understood that, the undersigned does not necessarily endorse any statement made,
opinion expressed or conclusion drawn thereof, but approves the report only for the purpose for
which it has been submitted.
----------------------------------------------(Signature of the
External
Examiner with Designation and
Institute)
ACKNOWLEDGEMENT
We would like to express our sincere gratitude to Mr. Tapan Chowdhury of the department of
Computer Science and Engineering, whose role as project guide was invaluable for the project.
We are extremely thankful for the keen interest he / she took in advising us, for the books and
reference materials provided for the moral support extended to us.
Last but not the least we convey our gratitude to all the teachers for providing us the technical skill that
will always remain as our asset and to all non-teaching staff for the gracious hospitality they offered us.
Ankan Banerjee
Ankit
Ankit Gupta
Contents
1. INTRODUCTION ............................................................................................................. 6
1.1 Abstract ........................................................................................................................ 6
1.2 Problem Domain .......................................................................................................... 6
1.3 Related Study ............................................................................................................... 7
1.4 Glossary ....................................................................................................................... 7
2. PROBLEM DEFINITION ................................................................................................. 9
2.1 Scope ........................................................................................................................... 9
2.2 Exclusions .................................................................................................................... 9
2.3 Assumption .................................................................................................................. 9
3. PROJECT PLANNING ................................................................................................... 10
3.1 Software Life Cycle Model......................................................................................... 10
3.2 Scheduling ................................................................................................................. 11
3.3 Cost Analysis ............................................................................................................. 13
4. REQUIREMENT ANALYSIS ......................................................................................... 14
4.1 Requirement Matrix ................................................................................................... 14
4.2 Requirement Elaboration ............................................................................................ 15
5. DESIGN .......................................................................................................................... 17
5.1 Technical Environment............................................................................................... 17
5.2 Hierarchy of modules ................................................................................................. 17
5.3 Detailed Design .......................................................................................................... 17
5.4 Test Plan .................................................................................................................... 26
6. IMPLEMENTATION ...................................................................................................... 28
6.1 Implementation Details............................................................................................... 28
6.2 System Installation Step ............................................................................................. 31
6.3 System Usage Instruction ........................................................................................... 33
7. CONCLUSION................................................................................................................ 34
7.1 Project Benefits .......................................................................................................... 34
7.2 Future Scope for improvements .................................................................................. 34
8. REFERENCES ................................................................................................................ 35
APPENDIX ......................................................................................................................... 37
A.1 Core-site.xml ............................................................................................................. 37
A.2 localhost:54310 ......................................................................................................... 37
A.3 Hdfs-site.xml ............................................................................................................. 38
A.4 U.data ........................................................................................................................ 38
A.5 MapReduce 1 ............................................................................................................ 39
A.6 MapReduce 2 ............................................................................................................ 39
A.7 MapReduce 3 ............................................................................................................ 40
4
List of Tables:
Table 1.1
Table 6.1
List of Figures:
Figure 3.1: Iterative Waterfall Model
Figure 3.2: Gantt Chart
Figure 4.3: Cost Analysis
Figure 5.1: Requirement Matrix
Figure 5.1: Hierarchy of Modules
Figure 5.2: Use Case Diagram
Figure 5.3: Class Diagram
Figure 6.4: Collaborative Filtering
Page Numbers
6
27
Page Numbers
10
12
13
14
17
17
18
21
1. INTRODUCTION
1.1 Abstract
Recommender Systems are new generation internet tool that help user in navigating
through information on the internet and receive information related to their
preferences. [1] Although most of the time recommender systems are applied in the
area of online shopping and entertainment domains like movie and music, yet their
applicability is being researched upon in other area as well. This report presents an
overview of the Recommender Systems which are currently working in the domain of
online movie recommendation.[2] This report also proposes a new movie
recommender system that combines user choices with not only similar users but other
users as well to give diverse recommendation that change over time. The overall
architecture of the proposed system is presented. [3]
Processor
Memory
3GB
Disk Space
160GB
Table 1.1 : Hardware Specification
Recommender systems typically produce a list of recommendations in one of two ways through collaborative or content-based filtering. Collaborative filtering approaches
building a model from a user's past behavior (items previously purchased or selected
and/or numerical ratings given to those items) as well as similar decisions made by other
users.[2] This model is then used to predict items (or ratings for items) that the user may
have an interest in. Content-based filtering approaches utilize a series of discrete
characteristics of an item in order to recommend additional items with similar properties.
These approaches are often combined. [3][5]
Apache Hadoop is an open-source software framework written in Java for distributed
storage and distributed processing of very large data sets on computer clusters built from
commodity hardware. [11] All the modules in Hadoop are designed with a fundamental
assumption that hardware failures (of individual machines, or racks of machines) are
commonplace and thus should be automatically handled in software by the framework.
The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System
(HDFS)) and a processing part (MapReduce). [15] Hadoop splits files into large blocks
and distributes them amongst the nodes in the cluster. To process the data, Hadoop
MapReduce transfers packaged code for nodes to process in parallel, based on the data
each node needs to process. This approach takes advantage of data localitynodes
manipulating the data that they have on handto allow the data to be processed faster
and more efficiently than it would be in a more conventional supercomputer architecture
that relies on a parallel file system where computation and data are connected via highspeed networking. [13][15]
1.4 Glossary
Apache Hadoop: Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets on computer clusters built
from commodity hardware.
NameNode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree
of all files in the file system, and tracks where across the cluster the file data is kept.
DataNode: A DataNode stores data in the HDFS. A functional filesystem has more than one
DataNode, with data replicated across them.
JobTracker: The JobTracker is the service within Hadoop that farms out MapReduce tasks to
specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle
operations - from a JobTracker.
Replication Factor: Replication Factor tells about data replication on multiple nodes by the way
we can achieve high fault tolerant and high availablity.
HDFS: The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop
project. This Apache Software Foundation project is designed to provide a fault-tolerant file
system designed to run on commodity hardware.
2. PROBLEM DEFINITION
2.1 Scope
Recommender systems are widespread tools that are employed by a wide range of
organizations and companies for recommending items such as movies, books and even
employees for projects. But with the advent of big data it has become difficult to process
the large amount of data for recommendations. Due to this reason, Apache Hadoop is
employed for scalability, reliability and faster processing. [1][5]
Recommender systems (sometimes replacing "system" with a synonym such as platform
or engine) are a subclass of information filtering system that seek to predict the 'rating' or
'preference' that user would give to an item. Recommender systems have become
extremely common in recent years, and are applied in a variety of applications.
2.2 Exclusions
Big Data Collection Interface and modification of user data (ratings).
2.3 Assumption
Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of
hundreds or thousands of server machines, each storing part of the file systems data. The
fact that there are a large number of components and that each component has a nontrivial probability of failure means that some component of HDFS is always nonfunctional. Therefore, detection of faults and quick, automatic recovery from them is a
core architectural goal of HDFS. [5]
Large Data Set
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes
to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high
aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should
support tens of millions of files in a single instance.
Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access. A MapReduce application or a
web crawler application fits perfectly with this model. There is a plan to support
appending-writes to files in the future.
9
3. PROJECT PLANNING
3.1 Software Life Cycle Model
Iterative Waterfall Model
The problems with the Waterfall Model created a demand for a new method of
developing systems which could provide faster results, require less up-front information
and offer greater flexibility. Iterative model, the project is divided into small parts. This
allows the development team to demonstrate results earlier on in the process and obtain
valuable feedback from system users. Each iteration is actually a mini-Waterfall process
with the feedback from one phase providing vital Information for the design of the next
phase.
10
3.2 Scheduling
13
4. REQUIREMENT ANALYSIS
4.1 Requirement Matrix
15
16
5. DESIGN
5.1 Technical Environment
The Recommender System shall be deployed over the 2 node cluster and Java runtime
was setup to use the common Hadoop configuration, as specified by the NameNode
(master node) in the cluster.
18
5.3.2.1 Architecture
HDFS is based on master/slave architecture. A HDFS cluster consists of a single
NameNode (as master) and a number of DataNodes (as slaves). The NameNode and
DataNodes are pieces of software designed to run on commodity machines. These
machines typically run a GNU/Linux operating system (OS). The usage of the highly
portable Java language means that HDFS can be deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only the NameNode software.
Each of the other machines in the cluster runs one instance of the DataNodes software.
The architecture does not preclude running multiple DataNodes on the same machine but
in a real deployment one machine usually runs one DataNode.The existence of a single
NameNode in a cluster greatly simplifies the architecture of the system. The NameNode
is the arbitrator and repository for all HDFS metadata. The system is designed in such a
way that user data never flows through the NameNode.
5.3.2.2 Namenode
NameNode manages the filesystem namespace, metadata for all the files and directories
in the tree.The file is divided into large blocks (typically 148 megabytes, but the user
selectable file-by-file) and each block is independently replicated at multiple DataNodes
(typically three, but user selectable file-by-file) to provide reliability. The NameNode
maintains and stores the namespace tree and the mapping of file blocks to DataNodes
persistently on the local disk in the form of two files: the namespace image and the edit
log. The NameNode also knows the DataNodes on which all the blocks for a given file
are located. However, it does not store block locations persistently, since this information
is reconstructed from DataNodes when the system starts.[14]
On the NameNode failure, the filesystem becomes inaccessible because only NameNode
knows how to reconstruct the files from the blocks on the DataNodes. So, for this reason,
it is important to make the NameNode resilient to failure, and Hadoop provides two
mechanisms for this: Checkpoint Node and Backup Node.
19
Writing to a File
For writing to a file, HDFS client first creates an empty file without any blocks. File
creation is only possible when the client has writing permission and a new file does not
exist in the system. NameNode records new file creation and allocates data blocks to list
of suitable DataNodes to host replicas of the first block of the file. Replication of data
makes DataNodes in pipeline. When the first block is filled, new DataNodes are
requested to host replicas of the next block. A new pipeline is organized, and the client
sends the further bytes of the file. Each choice of DataNodes is likely to be different.
If a DataNode in pipeline fails while writing the data then pipeline is first closed and
partial block on failed data node is deleted and failed DataNode is removed from the
pipeline. New DataNodes in the pipeline are chosen to write remaining blocks of data.
5.3.2.4 Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It
stores each file as a sequence of blocks; all blocks in a file except the last block are the
same size. The blocks of a file are replicated for fault tolerance. The block size and
replication factor are configurable per file. An application can specify the number of
replicas of a file. The replication factor can be specified at file creation time and can be
changed later. Files in HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically
receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport
contains a list of all blocks on a DataNode.
5.3.2.5 Replica Placement
Large HDFS instances run on a cluster of computers that commonly spread across many
racks. Communication between two nodes in different racks has to go through switches.
In most cases, network bandwidth between machines in the same rack is greater than
network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined
in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on
unique racks. This prevents losing data when an entire rack fails and allows use of
bandwidth from multiple racks when reading data. This policy evenly distributes replicas
in the cluster which makes it easy to balance load on component failure. However, this
policy increases the cost of writes because a write needs to transfer blocks to multiple
racks.
For the common case, when the replication factor is three, HDFSs placement policy is to
put one replica on one node in the local rack, another on a node in a different (remote)
rack, and the last on a different node in the same remote rack. This policy cuts the inter20
rack write traffic which generally improves write performance. The chance of rack failure
is far less than that of node failure; this policy does not impact data reliability and
availability guarantees. However, it does reduce the aggregate network bandwidth used
when reading data since a block is placed in only two unique racks rather than three.
With this policy, the replicas of a file do not evenly distribute across the racks. One third
of replicas are on one node, two thirds of replicas are on one rack, and the other third are
evenly distributed across the remaining racks. This policy improves write performance
without compromising data reliability or read performance.
5.3.2.6 Data Blocks
HDFS is designed to support very large files. Applications that are compatible with
HDFS are those that deal with large data sets. These applications write their data only
once but they read it one or more times and require these reads to be satisfied at
streaming speeds. HDFS supports write-once-read-many semantics on files. A typical
block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB
chunks, and if possible, each chunk will reside on a different DataNode.
5.3.3 Approach
5.3.3.1 Collaborative Filtering
Item-based collaborative filtering is a model-based algorithm for making
recommendations. In the algorithm, the similarities between different items in the dataset
are calculated by using one of a number of similarity measures, and then these similarity
values are used to predict ratings for user-item pairs not present in the dataset.
Similarities between items
The similarity values between items are measured by observing all the users who have
rated both the items. As shown in the diagram below, the similarity between two items is
dependent upon the ratings given to the items by users who have rated both of them:
21
Similarity measures
There are a number of different mathematical formulations that can be used to calculate
the similarity between two items. As can be seen in the formulae below, each formula
includes terms summed over the set of common users U.
Cosine-based similarity
Also known as vector-based similarity, this formulation views two items and their ratings
as vectors, and defines the similarity between them as the angle between these vectors:
Our implementation
We implemented item-based collaborative filtering using these parameters: Adjusted
cosine-based similarity Minimum number of users for each item-item pair: 5
Number of similar items stored: 50
Challenges
We tried item-based collaborative filtering on the Movielens dataset, but as the results
page shows, it didn't perform very well in testing. In particular, we isolated two main
problems, which were mainly due to the sparsity of the data:
The second challenge arose when we used weighted sum to calulate the rating
for test user-movie pairs. Since we were storing only 50 similar movies for
each movie, and for each target movie, we only consider the similar movies
that the active user has seen, it was often the case with the Movielens dataset
that there weren't many such movies for many of the users. This resulted in
bad predictions overall for large test sets. Because this was due to the sparsity
of the dataset itself, we couldn't come up with a straightforward solution to this
problem.
22
23
5.3.4.1 Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records.
The transformed intermediate records do not need to be of the same type as the input
records. A given input pair may map to zero or many output pairs.
5.3.4.2 Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of
values. The number of reduces for the job is set by the user via
JobConf.setNumReduceTasks(int).
Overall, Reducer implementations are passed the JobConf for the job via the
jobConfigurable.configure(JobConf) method and can override it to initialize
themselves. The framework then calls reduce ( WritableComparable,Iterator,
OutputCollector, reporter) method for each <key, (list of values)> pair in the grouped
inputs. Applications can then override the Closeable.close() method to perform any
required cleanup.
Reducer has 4 primary phases: shuffle, sort and reduce.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework
fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have
output the same key) in this stage. The shuffle and sort phases occur simultaneously;
while map-outputs are being fetched they are merged. is phase the
reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for
each <key, (list of values)> pair in the grouped inputs.
5.3.4.3 Partitioner
Partitioner partitions the key space.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The
key (or a subset of the key) is used to derive the partition, typically by a hash function.
The total number of partitions is the same as the number of reduce tasks for the job.
Hence this controls which of the m reduce tasks the intermediate key (and hence the
record) is sent to for reduction.
24
25
2
T-CLU 1.1
T-HDF1.1
Unresponsive
Check
Network
Storage Of Data
In
HDFS
Cluster
Node
Replication
Factor
1
10
12
References
A.3
Up &
Running
11
Test Input
THDFS1.2
Updating
Data
HDFS Write
Error On GUI
Data node
Single Copy
checked
Of Data
and
Being Stored
On Closest
Empty Data
Node
3 Copies Of
Data Being
Replication
Stored In
Factor
accordance to maintained
Rack
Default
Replication
Awareness
Alert for
Adding New
Data Nodes,
Data Nodes Scalability
Full with Data Feature
Update User
Data On
User Name Node Unsuccessful
Data
Update
Successfully
Replace File Successful
updated
with Updated
User Data
Successful
No. of Words
26
A.1
A.2
A.3
A.4
Test
with Data
Sets Program
in Key
Value Pairs
Form
T-MR1.1
Output
Format/Result
Error
13
Map
Reduce
Package
14
Data sets
A.5
Tested and
key value pair
created
T-MR1.2
MapReduce for
Correlation
Corrupt
Package
Successful
Test With
DataSet
Correlation
produced within
-1 to 1
Correlation A.6
Correlation
produced with value created
undefined(NaN)
output
Output
Unsuccessful
Format/Result
Test
Error
15
Successful
Test With
MapReduce to Sort DataSet
T-MR1.3
and Format
Unsuccessful
Test
16.
T-ALG
Algorithm Test
(Pen & Paper
Calculation)
Sorted and
desired output
produced
Output
Format/Result
Error
Pick a number The lower the
of
value,
random users
and
the better
items and try
to
predict the
rating
using
algorithm.
Calculate
the RMSE
between
prediction and
the
actual rating
27
Based on
Co-relation
value
Lower the
value
Better result
obtained
In sorted order
A.7
6. IMPLEMENTATION
6.1 Implementation Details
6.1.1 Cluster Configuration
Installing a Hadoop cluster typically involves unpacking the software on all the
machines in the cluster or installing it via a packaging system as appropriate for your
operating system. It is important to divide up the hardware into functions.
Typically one machine in the cluster is designated as the NameNode and another
machine the as ResourceManager, exclusively. These are the masters. Other services
(such as Web App Proxy Server and MapReduce Job History server) are usually run
either on dedicated hardware or on shared infrastrucutre, depending upon the load.
The rest of the machines in the cluster act as both DataNode and NodeManager. These
are the slaves.
Administrators should use the etc/hadoop/hadoop-env.sh and optionally the
etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts to do site-specific
customization of the Hadoop daemons process environment
To start a Hadoop cluster you will need to start both the HDFS and YARN cluster.
6.1.2.2 Determination of Block Size of the Data Set & Block Replication Factor
HDFS file is chopped up into 3 similar 64 MB chunks, and if possible, each chunk
will reside on a different DataNode.
29
30
Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands to
~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
Now apply all the changes into the current running system.
$ source ~/.bashrc
31
Hadoop Configuration
You
can
find
all
the
Hadoop
configuration
files
in
the
location
$HADOOP_HOME/etc/hadoop. It is required to make changes in those configuration files
according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment variables
in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your
system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and size of
Read/Write buffers.
The hdfs-site.xml file contains information such as the value of replication data, namenode
path, and datanode paths of your local file systems. It means the place where you want to store
the Hadoop infrastructure.
32
7. CONCLUSION
7.1 Project Benefits
Recommender systems are a powerful new technology for extracting additional value
for a business from its user databases. These systems help users find items they want
to buy from a business. Recommender systems benefit users by enabling them to find
items they like. Conversely, they help the business by generating more sales.
Recommender systems are rapidly becoming a crucial tool in E-commerce on the
Web. Recommender systems are being stressed by the huge volume of user data in
existing corporate databases, and will be stressed even more by the increasing volume
of user data available on the Web. New technologies are needed that can dramatically
improve the scalability of recommender systems.
In our project we have used collaborative filtering which promises of processing large
data sets and at the same time produce high-quality recommendations.
7.2 Future Scope for improvements
The system implemented in the project uses static data to recommend books to the
users. To incorporate dynamic data, distributed databases such as HBase or Cassandra
can be used which can be regularly updated to add new users and ratings. To make the
web application, the data needs to be accessible in real time. The solution to this too
can be the use of a distributed database.
The recommender system can be improved by combining user based collaborative
filtering and content based filtering with the current system. This combination is also
called Hybrid filtering and it helps in significantly performance improvement.
The comparison made between the different similarity metrics was based on the run
time and not on the precision of the recommendations.
34
8. REFERENCES
[1] A. Felfering, G. Friedrich and Schmidt Thieme, Recommender systems, IEEE
Intelligent systems, pages 18-21, 2007.
[2] P. Resnick, N. Iacovou, M. Suchak, and J. Riedl, GroupLens: An Open
Architecture for Collaborative Filtering of Netnews, In Proceedings of CSCW
94, Chapel Hill,
NC, 1994.
[3] U. Shardanand and P. Maes Social Information Filtering: Algorithms for
Automating
Word of Mouth, In Proceedings of CHI 95. Denver, 1995.
[4] Daniar Asanov, Algorithms and Methods in Recommender Systems, Berlin
Institute of Technology, Berlin, 2011
[5] Badrul Sarwar , George Karypis and John Riedl, Item-Based Collaborative
Filtering Recommendation Algorithms, IW3C2: Hong Kong, China,
2001[Online].
http://wwwconference.org/www10/cdrom/papers/519/index.html
[6] Francesco Ricci, Lior Rokach and Bracha Shapira, Introduction to Recommender
System Handbook, New York: Springer Science+Buisness Media Ltd, 2011, ch. 1,
sec. 1.4, pp 10-14.
[7] D. Borthakur and S. Dhruba, HDFS architecture guide, Hadoop Apache
Project,
2008
[Online].
http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf
[8] Dean, Jeffrey, and Sanjay Ghemawat, MapReduce: simplified data processing
on large clusters, Communications of the ACM 51.1, 2008, pp 107-113.
[9] Fuzhi Zhang, Huilin Liu, Jinbo Chao, A Two-stage Recommendation
Algorithm Based on K-means Clustering In Mobile E-commerce, Journal of
Computational
Information Systems, Vol. 6, Issue 10, pp. 3327-3334, 2010.
[10] Taek-Hun Kim, Young-Suk Ryu, Seok-In Park, and Sung-Bong Yang, An
Improved Recommendation Algorithm in Collaborative Filtering,
35
36
APPENDIX
A.1 Core-site.xml
A.2 localhost:54310
37
A.3 Hdfs-site.xml
A.4 U.data
38
A.5 MapReduce 1
A.6 MapReduce 2
39
A.7 MapReduce 3
40