You are on page 1of 40

RECOMMENDER SYSTEM

USING APACHE HADOOP


Submitted by

ANKAN BANERJEE (13000112064)


ANKIT (13000112065)
ANKIT GUPTA (13000112067)
HEMANT KUMAR JOSHI (13000112101)

Under the guidance of Mr. TAPAN CHOWDHURY.

Submitted for the partial fulfillment for the degree of Bachelor of Technology in
Computer Science and Engineering

Techno India
EM 4/1, Salt Lake, Sector V, Kolkata 700 091.

CERTIFICATE
This is to certify that the project report entitled Recommender System Using Apache Hadoop
prepared under my supervision by Ankan Banerjee (13000112064), Ankit (13000112065),
Ankit Gupta (13000112067) and Hemant Kumar Joshi (13000112101) of B.Tech. (Computer
Science & Engg.), Final Year, has been done according to the regulations of the Degree of
Bachelor of Technology in Computer Science & Engineering. The candidates have fulfilled the
requirements for the submission of the project report.
It is to be understood that, the undersigned does not necessarily endorse any statement made,
opinion expressed or conclusion drawn thereof, but approves the report only for the purpose for
which it has been submitted.

----------------------------------------------Mr. Tapan Chowdhury


Asst. Professor
Computer Science and Engineering
Techno India Salt Lake

----------------------------------------------Prof. (Dr.) C. K. Bhattacharyya


Head of Department
Computer Science and Engineering
Techno India Salt Lake

----------------------------------------------(Signature of the
External
Examiner with Designation and
Institute)

-------------------------------------------------------------------------------------------DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


--------------------------------------------------------------------------------------------

ACKNOWLEDGEMENT
We would like to express our sincere gratitude to Mr. Tapan Chowdhury of the department of
Computer Science and Engineering, whose role as project guide was invaluable for the project.
We are extremely thankful for the keen interest he / she took in advising us, for the books and
reference materials provided for the moral support extended to us.
Last but not the least we convey our gratitude to all the teachers for providing us the technical skill that
will always remain as our asset and to all non-teaching staff for the gracious hospitality they offered us.

Place: Techno India, Salt Lake


Date: 12th May, 16

Ankan Banerjee

Ankit

Ankit Gupta

Hemant Kumar Joshi

Contents
1. INTRODUCTION ............................................................................................................. 6
1.1 Abstract ........................................................................................................................ 6
1.2 Problem Domain .......................................................................................................... 6
1.3 Related Study ............................................................................................................... 7
1.4 Glossary ....................................................................................................................... 7
2. PROBLEM DEFINITION ................................................................................................. 9
2.1 Scope ........................................................................................................................... 9
2.2 Exclusions .................................................................................................................... 9
2.3 Assumption .................................................................................................................. 9
3. PROJECT PLANNING ................................................................................................... 10
3.1 Software Life Cycle Model......................................................................................... 10
3.2 Scheduling ................................................................................................................. 11
3.3 Cost Analysis ............................................................................................................. 13
4. REQUIREMENT ANALYSIS ......................................................................................... 14
4.1 Requirement Matrix ................................................................................................... 14
4.2 Requirement Elaboration ............................................................................................ 15
5. DESIGN .......................................................................................................................... 17
5.1 Technical Environment............................................................................................... 17
5.2 Hierarchy of modules ................................................................................................. 17
5.3 Detailed Design .......................................................................................................... 17
5.4 Test Plan .................................................................................................................... 26
6. IMPLEMENTATION ...................................................................................................... 28
6.1 Implementation Details............................................................................................... 28
6.2 System Installation Step ............................................................................................. 31
6.3 System Usage Instruction ........................................................................................... 33
7. CONCLUSION................................................................................................................ 34
7.1 Project Benefits .......................................................................................................... 34
7.2 Future Scope for improvements .................................................................................. 34
8. REFERENCES ................................................................................................................ 35
APPENDIX ......................................................................................................................... 37
A.1 Core-site.xml ............................................................................................................. 37
A.2 localhost:54310 ......................................................................................................... 37
A.3 Hdfs-site.xml ............................................................................................................. 38
A.4 U.data ........................................................................................................................ 38
A.5 MapReduce 1 ............................................................................................................ 39
A.6 MapReduce 2 ............................................................................................................ 39
A.7 MapReduce 3 ............................................................................................................ 40
4

List of Tables:
Table 1.1
Table 6.1
List of Figures:
Figure 3.1: Iterative Waterfall Model
Figure 3.2: Gantt Chart
Figure 4.3: Cost Analysis
Figure 5.1: Requirement Matrix
Figure 5.1: Hierarchy of Modules
Figure 5.2: Use Case Diagram
Figure 5.3: Class Diagram
Figure 6.4: Collaborative Filtering

Page Numbers
6
27
Page Numbers
10
12
13
14
17
17
18
21

1. INTRODUCTION
1.1 Abstract
Recommender Systems are new generation internet tool that help user in navigating
through information on the internet and receive information related to their
preferences. [1] Although most of the time recommender systems are applied in the
area of online shopping and entertainment domains like movie and music, yet their
applicability is being researched upon in other area as well. This report presents an
overview of the Recommender Systems which are currently working in the domain of
online movie recommendation.[2] This report also proposes a new movie
recommender system that combines user choices with not only similar users but other
users as well to give diverse recommendation that change over time. The overall
architecture of the proposed system is presented. [3]

1.2 Problem Domain


1.2.1 Software and Language Versions
Hadoop 1.2.1
Java 1.6
1.2.2 Hardware Specification of each Hadoop Node
Hadoop clusters have identical hardware specifications for all the cluster nodes.
Table lists the specification of nodes.
Operating System

Ubuntu 12.04 LTS (64 bit)

Processor

Intel Core i3 (Quad Core)

Memory

3GB

Disk Space

160GB
Table 1.1 : Hardware Specification

1.2.3 Business Domain


Traditional recommender systems suggest items belonging to a single domain that is
movies in Netflix, songs in Last.fm, etc. [4] This is not perceived as a limitation, but
as a focus on a certain market. Recommender systems may be used primarily as
systems that suggests appropriate actions to satisfy a users needs.

1.3 Related Study

Recommender systems typically produce a list of recommendations in one of two ways through collaborative or content-based filtering. Collaborative filtering approaches
building a model from a user's past behavior (items previously purchased or selected
and/or numerical ratings given to those items) as well as similar decisions made by other
users.[2] This model is then used to predict items (or ratings for items) that the user may
have an interest in. Content-based filtering approaches utilize a series of discrete
characteristics of an item in order to recommend additional items with similar properties.
These approaches are often combined. [3][5]
Apache Hadoop is an open-source software framework written in Java for distributed
storage and distributed processing of very large data sets on computer clusters built from
commodity hardware. [11] All the modules in Hadoop are designed with a fundamental
assumption that hardware failures (of individual machines, or racks of machines) are
commonplace and thus should be automatically handled in software by the framework.
The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System
(HDFS)) and a processing part (MapReduce). [15] Hadoop splits files into large blocks
and distributes them amongst the nodes in the cluster. To process the data, Hadoop
MapReduce transfers packaged code for nodes to process in parallel, based on the data
each node needs to process. This approach takes advantage of data localitynodes
manipulating the data that they have on handto allow the data to be processed faster
and more efficiently than it would be in a more conventional supercomputer architecture
that relies on a parallel file system where computation and data are connected via highspeed networking. [13][15]

1.4 Glossary
Apache Hadoop: Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets on computer clusters built
from commodity hardware.

NameNode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree
of all files in the file system, and tracks where across the cluster the file data is kept.
DataNode: A DataNode stores data in the HDFS. A functional filesystem has more than one
DataNode, with data replicated across them.
JobTracker: The JobTracker is the service within Hadoop that farms out MapReduce tasks to
specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle
operations - from a JobTracker.
Replication Factor: Replication Factor tells about data replication on multiple nodes by the way
we can achieve high fault tolerant and high availablity.
HDFS: The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop
project. This Apache Software Foundation project is designed to provide a fault-tolerant file
system designed to run on commodity hardware.

2. PROBLEM DEFINITION
2.1 Scope
Recommender systems are widespread tools that are employed by a wide range of
organizations and companies for recommending items such as movies, books and even
employees for projects. But with the advent of big data it has become difficult to process
the large amount of data for recommendations. Due to this reason, Apache Hadoop is
employed for scalability, reliability and faster processing. [1][5]
Recommender systems (sometimes replacing "system" with a synonym such as platform
or engine) are a subclass of information filtering system that seek to predict the 'rating' or
'preference' that user would give to an item. Recommender systems have become
extremely common in recent years, and are applied in a variety of applications.
2.2 Exclusions
Big Data Collection Interface and modification of user data (ratings).
2.3 Assumption
Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of
hundreds or thousands of server machines, each storing part of the file systems data. The
fact that there are a large number of components and that each component has a nontrivial probability of failure means that some component of HDFS is always nonfunctional. Therefore, detection of faults and quick, automatic recovery from them is a
core architectural goal of HDFS. [5]
Large Data Set
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes
to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high
aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should
support tens of millions of files in a single instance.
Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access. A MapReduce application or a
web crawler application fits perfectly with this model. There is a plan to support
appending-writes to files in the future.
9

3. PROJECT PLANNING
3.1 Software Life Cycle Model
Iterative Waterfall Model
The problems with the Waterfall Model created a demand for a new method of
developing systems which could provide faster results, require less up-front information
and offer greater flexibility. Iterative model, the project is divided into small parts. This
allows the development team to demonstrate results earlier on in the process and obtain
valuable feedback from system users. Each iteration is actually a mini-Waterfall process
with the feedback from one phase providing vital Information for the design of the next
phase.

Figure 3.1: Iterative Waterfall Model

10

3.2 Scheduling

Figure 3.2.1: Gantt Chart


11

Figure 3.2.2: Gantt Chart


12

3.3 Cost Analysis


Here we have assumed that Standard rate of each resource person is $2/hour. According
to this assumption the Total cost of this project will be $14,820.00.The Budget and Cost
analysis is given below.

Fig 3.3: Cost Analysis

13

4. REQUIREMENT ANALYSIS
4.1 Requirement Matrix

Figure 4.1: Requirement Matrix


14

4.2 Requirement Elaboration


4.2.1 Cluster Configuration
The project has to be implemented on huge sets of data, namely the Big Data. Hence, an
HDFS cluster of five computers has been configured as per the project requirements. One
computer has been configured as the NameNode and the rest four as the DataNode.
4.2.1.1 Rack Awareness Implementation
As per the project requirement, three racks of computers have been setup out of which,
there are two racks having two computers each and one having one computer, the
NameNode. The replication factor has been set to three so that for each block of data,
there are 3 copies, each on one of the racks.
4.2.2 Data Storage
Since the project is based on huge sets of data, an efficient system for the storage,
retrieval and analysis of this data was required. Hadoop itself has its own file system, the
Hadoop Distributed File System which has been used for the purpose as mentioned
above. The data for the project is stored in the HDFS.
4.2.2.1 Data Storage in HDFS cluster for analysis
The data obtained from the datasets is stored in the Hadoop distributed file system in
blocks of size of 64MB. Each block of data has three copies and is stored on one of the
racks.
4.2.3 Analysis of data and recommendation
4.2.3.1 InputFormat to select the data for input to MapReduce and define the
InputSplits that break a file into tasks
The MapReduce framework operates exclusively on <key, value> pairs, that Is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
Input and Output types of a MapReduce job:
(input) <k1,v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k4, v4>
(output)

15

4.2.3.2 MapReduce Program to customize Data Set


The first task is for each user, emit a row containing their 'postings' (item, rating). And
for reducer, emit the user rating sum and count for use later steps.

4.2.3.3 MapReduce Program to perform Correlation


For each row we calculate similarity by computing the number of people who rated both
movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y)
and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate
the correlation between the movies. The correlation can be expressed as:

4.2.3.4 MapReduce Program to sort and Format Recommendations


The last step of the job that will sort the top-correlated items for each item and print it to
the output.
4.2.3.5 Output Collector
Produce personalized recommendation for individual users according to user requirement
using Apache HIVE.

16

5. DESIGN
5.1 Technical Environment
The Recommender System shall be deployed over the 2 node cluster and Java runtime
was setup to use the common Hadoop configuration, as specified by the NameNode
(master node) in the cluster.

5.2 Hierarchy of modules

Figure 5.1: Hierarchy of modules

5.3 Detailed Design

Figure 5.2: Use Case Diagram


17

Figure 5.3: Class Diagram

5.3.1 Hadoop Cluster Configuration


The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures. [13]

5.3.2 Data Storage and Replication


The Hadoop filesystem is designed for storing petabytes of a file with streaming data
access using the idea that most efficient data processing pattern is a write-once, read
many- times pattern. HDFS stores metadata on a dedicated server, called NameNode.
Application data are stored on other servers called DataNodes. All the servers are fully
connected and communicate with each other using TCP-based protocols.

18

5.3.2.1 Architecture
HDFS is based on master/slave architecture. A HDFS cluster consists of a single
NameNode (as master) and a number of DataNodes (as slaves). The NameNode and
DataNodes are pieces of software designed to run on commodity machines. These
machines typically run a GNU/Linux operating system (OS). The usage of the highly
portable Java language means that HDFS can be deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only the NameNode software.
Each of the other machines in the cluster runs one instance of the DataNodes software.
The architecture does not preclude running multiple DataNodes on the same machine but
in a real deployment one machine usually runs one DataNode.The existence of a single
NameNode in a cluster greatly simplifies the architecture of the system. The NameNode
is the arbitrator and repository for all HDFS metadata. The system is designed in such a
way that user data never flows through the NameNode.
5.3.2.2 Namenode
NameNode manages the filesystem namespace, metadata for all the files and directories
in the tree.The file is divided into large blocks (typically 148 megabytes, but the user
selectable file-by-file) and each block is independently replicated at multiple DataNodes
(typically three, but user selectable file-by-file) to provide reliability. The NameNode
maintains and stores the namespace tree and the mapping of file blocks to DataNodes
persistently on the local disk in the form of two files: the namespace image and the edit
log. The NameNode also knows the DataNodes on which all the blocks for a given file
are located. However, it does not store block locations persistently, since this information
is reconstructed from DataNodes when the system starts.[14]
On the NameNode failure, the filesystem becomes inaccessible because only NameNode
knows how to reconstruct the files from the blocks on the DataNodes. So, for this reason,
it is important to make the NameNode resilient to failure, and Hadoop provides two
mechanisms for this: Checkpoint Node and Backup Node.

5.3.2.3 HDFS Client


Reading a file
To read a file, HDFS client first contacts NameNode. It returns list of addresses of the
DataNodes that have a copy of the blocks of the file. Then client connects to the closest
DataNodes directly for each block and requests the transfer of the desired block. Figure 7
shows the main sequence of events involved in reading data from HDFS.

19

Writing to a File
For writing to a file, HDFS client first creates an empty file without any blocks. File
creation is only possible when the client has writing permission and a new file does not
exist in the system. NameNode records new file creation and allocates data blocks to list
of suitable DataNodes to host replicas of the first block of the file. Replication of data
makes DataNodes in pipeline. When the first block is filled, new DataNodes are
requested to host replicas of the next block. A new pipeline is organized, and the client
sends the further bytes of the file. Each choice of DataNodes is likely to be different.
If a DataNode in pipeline fails while writing the data then pipeline is first closed and
partial block on failed data node is deleted and failed DataNode is removed from the
pipeline. New DataNodes in the pipeline are chosen to write remaining blocks of data.
5.3.2.4 Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It
stores each file as a sequence of blocks; all blocks in a file except the last block are the
same size. The blocks of a file are replicated for fault tolerance. The block size and
replication factor are configurable per file. An application can specify the number of
replicas of a file. The replication factor can be specified at file creation time and can be
changed later. Files in HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically
receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport
contains a list of all blocks on a DataNode.
5.3.2.5 Replica Placement
Large HDFS instances run on a cluster of computers that commonly spread across many
racks. Communication between two nodes in different racks has to go through switches.
In most cases, network bandwidth between machines in the same rack is greater than
network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined
in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on
unique racks. This prevents losing data when an entire rack fails and allows use of
bandwidth from multiple racks when reading data. This policy evenly distributes replicas
in the cluster which makes it easy to balance load on component failure. However, this
policy increases the cost of writes because a write needs to transfer blocks to multiple
racks.

For the common case, when the replication factor is three, HDFSs placement policy is to
put one replica on one node in the local rack, another on a node in a different (remote)
rack, and the last on a different node in the same remote rack. This policy cuts the inter20

rack write traffic which generally improves write performance. The chance of rack failure
is far less than that of node failure; this policy does not impact data reliability and
availability guarantees. However, it does reduce the aggregate network bandwidth used
when reading data since a block is placed in only two unique racks rather than three.
With this policy, the replicas of a file do not evenly distribute across the racks. One third
of replicas are on one node, two thirds of replicas are on one rack, and the other third are
evenly distributed across the remaining racks. This policy improves write performance
without compromising data reliability or read performance.
5.3.2.6 Data Blocks
HDFS is designed to support very large files. Applications that are compatible with
HDFS are those that deal with large data sets. These applications write their data only
once but they read it one or more times and require these reads to be satisfied at
streaming speeds. HDFS supports write-once-read-many semantics on files. A typical
block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB
chunks, and if possible, each chunk will reside on a different DataNode.
5.3.3 Approach
5.3.3.1 Collaborative Filtering
Item-based collaborative filtering is a model-based algorithm for making
recommendations. In the algorithm, the similarities between different items in the dataset
are calculated by using one of a number of similarity measures, and then these similarity
values are used to predict ratings for user-item pairs not present in the dataset.
Similarities between items
The similarity values between items are measured by observing all the users who have
rated both the items. As shown in the diagram below, the similarity between two items is
dependent upon the ratings given to the items by users who have rated both of them:

Figure5.4: Collaborative Filtering

21

Similarity measures
There are a number of different mathematical formulations that can be used to calculate
the similarity between two items. As can be seen in the formulae below, each formula
includes terms summed over the set of common users U.
Cosine-based similarity
Also known as vector-based similarity, this formulation views two items and their ratings
as vectors, and defines the similarity between them as the angle between these vectors:

Our implementation
We implemented item-based collaborative filtering using these parameters: Adjusted
cosine-based similarity Minimum number of users for each item-item pair: 5
Number of similar items stored: 50
Challenges
We tried item-based collaborative filtering on the Movielens dataset, but as the results
page shows, it didn't perform very well in testing. In particular, we isolated two main
problems, which were mainly due to the sparsity of the data:

The first problem manifested itself during adjusted-cosine similarity


measurement calculation, in the case when there was only one common user
between movies. Since we subtract the average rating for the user, the
adjusted-cosine similarity for items with only one common user is 1, which is
the highest possible value. As a result, for such items, which are common in
the Movielens database, the most similar items end up being only these items
with one common user. The solution we implemented was to specify a
minimum number of users (in this case, 5) that two movies needed to have in
common before they could be called similar.

The second challenge arose when we used weighted sum to calulate the rating
for test user-movie pairs. Since we were storing only 50 similar movies for
each movie, and for each target movie, we only consider the similar movies
that the active user has seen, it was often the case with the Movielens dataset
that there weren't many such movies for many of the users. This resulted in
bad predictions overall for large test sets. Because this was due to the sparsity
of the dataset itself, we couldn't come up with a straightforward solution to this
problem.

22

5.3.4 Hadoop MapReduce


Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner [8].
A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically both the input
and the output of the job are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes.
This configuration allows the framework to effectively schedule tasks on the nodes where
data is already present, resulting in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks. The
slaves execute the tasks as directed by the master.
Inputs and Outputs
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1,v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k4, v4>
(output)

23

5.3.4.1 Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records.
The transformed intermediate records do not need to be of the same type as the input
records. A given input pair may map to zero or many output pairs.
5.3.4.2 Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of
values. The number of reduces for the job is set by the user via
JobConf.setNumReduceTasks(int).
Overall, Reducer implementations are passed the JobConf for the job via the
jobConfigurable.configure(JobConf) method and can override it to initialize
themselves. The framework then calls reduce ( WritableComparable,Iterator,
OutputCollector, reporter) method for each <key, (list of values)> pair in the grouped
inputs. Applications can then override the Closeable.close() method to perform any
required cleanup.
Reducer has 4 primary phases: shuffle, sort and reduce.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework
fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have
output the same key) in this stage. The shuffle and sort phases occur simultaneously;
while map-outputs are being fetched they are merged. is phase the
reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for
each <key, (list of values)> pair in the grouped inputs.

5.3.4.3 Partitioner
Partitioner partitions the key space.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The
key (or a subset of the key) is used to derive the partition, typically by a hash function.
The total number of partitions is the same as the number of reduce tasks for the job.
Hence this controls which of the m reduce tasks the intermediate key (and hence the
record) is sent to for reduction.
24

5.3.4.4 Output Collector


Output Collector is a generalization of the facility provided by the MapReduce
framework to collect data output by the Mapper or the Reducer (either the
intermediate outputs or the output of the job).
5.3.4.5 Job Configuration
JobConf represents a MapReduce job configuration.
JobConf is the primary interface for a user to describe a MapReduce job to the
Hadoop framework for execution.
The framework tries to faithfully execute the job as described by JobConf, however,
while some job parameters are straight-forward to set (e.g. setNumReduceTasks(int)),
other parameters interact subtly with the rest of the framework and/or job
configuration .
It is more complex to set Partitioner, Reducer, Input Format, Output Format and
Output Committer implementation s. Job Conf also indicates the set of input files
(setInputPaths(JobConf, Path...) / addInputPath (JobConf,Path)) and (setInputPaths
(JobConf, String) / addInputPaths(JobConf,String)) and where the output files should
be written (setOutputPath (Path)).

25

5.4 Test Plan


Serial
No.

Testing Component details


Test ID
Description

2
T-CLU 1.1

Desired Result Test Result


Dead Node On
Disconnection HDFS
from Network
Checking The
Live Node
connected
Nodes of
the Unresponsive Not
Cluster and their Node, Slow
Responsive or And HDFS
Connection or Searching on Cluster
arrangement
arranged
Processing
HDFS GUI

T-HDF1.1

Unresponsive
Check
Network
Storage Of Data
In
HDFS
Cluster
Node
Replication
Factor
1

10

12

References

A.3

Up &
Running

11

Test Input

THDFS1.2

Updating
Data

HDFS Write
Error On GUI

Data node
Single Copy
checked
Of Data
and
Being Stored
On Closest
Empty Data
Node
3 Copies Of
Data Being
Replication
Stored In
Factor
accordance to maintained
Rack
Default
Replication
Awareness
Alert for
Adding New
Data Nodes,
Data Nodes Scalability
Full with Data Feature
Update User
Data On
User Name Node Unsuccessful
Data
Update
Successfully
Replace File Successful
updated
with Updated
User Data
Successful
No. of Words
26

A.1
A.2

A.3

A.4

Test
with Data
Sets Program

in Key
Value Pairs
Form

T-MR1.1
Output
Format/Result
Error

13
Map
Reduce
Package

14

Data sets
A.5
Tested and
key value pair
created

T-MR1.2

MapReduce for
Correlation

Corrupt
Package

Successful
Test With
DataSet

Correlation
produced within
-1 to 1
Correlation A.6
Correlation
produced with value created
undefined(NaN)
output

Output
Unsuccessful
Format/Result
Test
Error

15

Successful
Test With
MapReduce to Sort DataSet
T-MR1.3
and Format
Unsuccessful
Test

16.
T-ALG

Algorithm Test
(Pen & Paper
Calculation)

Sorted and
desired output
produced
Output
Format/Result
Error
Pick a number The lower the
of
value,
random users
and
the better
items and try
to
predict the
rating
using
algorithm.
Calculate
the RMSE
between
prediction and
the
actual rating

Table 5.1: Test Plan

27

Sorted output A.7


obtained

Based on
Co-relation
value
Lower the
value
Better result
obtained
In sorted order

A.7

6. IMPLEMENTATION
6.1 Implementation Details
6.1.1 Cluster Configuration
Installing a Hadoop cluster typically involves unpacking the software on all the
machines in the cluster or installing it via a packaging system as appropriate for your
operating system. It is important to divide up the hardware into functions.
Typically one machine in the cluster is designated as the NameNode and another
machine the as ResourceManager, exclusively. These are the masters. Other services
(such as Web App Proxy Server and MapReduce Job History server) are usually run
either on dedicated hardware or on shared infrastrucutre, depending upon the load.
The rest of the machines in the cluster act as both DataNode and NodeManager. These
are the slaves.
Administrators should use the etc/hadoop/hadoop-env.sh and optionally the
etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts to do site-specific
customization of the Hadoop daemons process environment
To start a Hadoop cluster you will need to start both the HDFS and YARN cluster.

6.1.2 Data Storage


6.1.2.1 Storage of Data in HDFS
Format the configured HDFS file system, open namenode (HDFS server), and execute
the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command
will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh
Listing Files in HDFS
After loading the information in the server, we can find the list of files in a directory,
status of a file, using ls. Given below is the syntax of ls that you can pass to a
directory or a filename as an argument.
28

$ $HADOOP_HOME/bin/hadoop fs -ls <args>


Inserting Data into HDFS
Assume we have data in the file called file.txt in the local system which is ought to be
saved in the hdfs file system. Follow the steps given below to insert the required file in
the Hadoop file system.
Create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Transfer and store a data file from local systems to the Hadoop file system using the
put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple demonstration
for retrieving the required file from the Hadoop file system.
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

6.1.2.2 Determination of Block Size of the Data Set & Block Replication Factor
HDFS file is chopped up into 3 similar 64 MB chunks, and if possible, each chunk
will reside on a different DataNode.

6.1.3 Analysis of Data And Recommendation


Our goal is to calculate how similar pairs of movies are, so that we recommend
movies similar to movies you liked. Using the correlation we can:
For every pair of movies A and B, find all the people who rated both A and B. Use
these ratings to form a Movie A vector and a Movie B vector. Then, calculate the
correlation between these two vectors. Now when someone watches a movie, you can
now recommend him the movies most correlated with it.

29

6.1.3.1 Input Format


InputFormat describes the input-specification for a Map-Reduce job.The Map-Reduce
framework relies on the InputFormat of the job to:
Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an
individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the
logical InputSplit for processing by the Mapper .

6.1.3.2 MapReduce to Customize Dataset


First step is to get our movies file which has three columns: (user, movie,rating). For
this task we will use the MovieLens Dataset. We use MapReduce to customize data
set in required format.
Our first task is for each user, emit a row containing their 'postings' (item, rating). And
for reducer, emit the user rating sum and count for use later steps.
6.1.3.3 MapReduce to Perform Correlation
Each row in calculate similarity will compute the number of people who rated both
movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y)
and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate
the correlation between the movies.
6.1.3.4 MapReduce to Perform Sort
Now the last step of the job that will sort the top-correlated items for each item and
print it to the output.
6.1.3.5 Output Collector
Collects Required Output from the result dataset of Sort MapReduce Program and
displays it to user.

30

6.2 System Installation Step


Before installing Hadoop into the Linux environment, we need to set up Linux using
ssh (Secure Shell). Follow the steps given below for setting up the Linux environment.
Installing Java
Java is the main prerequisite for Hadoop. First of all, you should verify the existence
of java in your system using the command java -version. The syntax of java version
command is given below.
$ java version
Step1: Download java (JDK <latest version> - X64.tar.gz) Verify and extract the jdk7u71-linux-x64.gz
Step2: To make java available to all the users, move it to the location /usr/local/.
Step3: For setting up PATH and JAVA_HOME variables, add commands to ~/.bashrc
file.
Step4: Configure java alternatives and verify the java -version command from the
terminal as explained above.

Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands to
~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
Now apply all the changes into the current running system.
$ source ~/.bashrc

31

Hadoop Configuration
You
can
find
all
the
Hadoop
configuration
files
in
the
location
$HADOOP_HOME/etc/hadoop. It is required to make changes in those configuration files
according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment variables
in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your
system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and size of
Read/Write buffers.
The hdfs-site.xml file contains information such as the value of replication data, namenode
path, and datanode paths of your local file systems. It means the place where you want to store
the Hadoop infrastructure.

32

6.3 System Usage Instruction


After the installation of Apache Hadoop on the Master and Slave nodes, the following
steps have to be performed in order to run the jar file.
1. Data Storage in HDFS
The data that has to be processed upon firstly exists in the local file system. It has to
be transfered to the HDFS in order for the jar to operate upon it.It is done by typing
the following into the terminal:
$ hadoop fs -jar <location of dataset in local FS> <destination location in HDFS>
2. JAR Execution
The JAR is our exectutable file which will operate upon the dataset that we had
transferred to the HDFS. It is done by typing the following into the terminal:
$ hadoop jar <location of jar> <location of input file(s) in HDFS> <location of output
file in HDFS>
3. Output Retrieval
The output produced of a MapReduce program is stored within the directory which
has been set as the output folder. Within it, the final output of the MapReduce is kept.
If it is just a Map, the filename containing the output will be named as part-m-00000.
Else, If it is a MapReduce where the output is produced after the Reduce, the output
file is named as part-r-00000. The output can be retreived in any of the following two
ways:
a. Using terminal
Since the location of the output of the output file is known, it can be displayed in the
terminal itself with the following command:
$ hadoop fs -cat <filename>
b. Using HDFS web-based file browser
Hadoop comes with a web-based file browser, which provides a GUI to browse the
HDFS. So, if the file location is known, the destination can be reached by navigating
through on the browser. The URL for the browser is:
http://localhost:50070/explorer.html#/
33

7. CONCLUSION
7.1 Project Benefits
Recommender systems are a powerful new technology for extracting additional value
for a business from its user databases. These systems help users find items they want
to buy from a business. Recommender systems benefit users by enabling them to find
items they like. Conversely, they help the business by generating more sales.
Recommender systems are rapidly becoming a crucial tool in E-commerce on the
Web. Recommender systems are being stressed by the huge volume of user data in
existing corporate databases, and will be stressed even more by the increasing volume
of user data available on the Web. New technologies are needed that can dramatically
improve the scalability of recommender systems.
In our project we have used collaborative filtering which promises of processing large
data sets and at the same time produce high-quality recommendations.
7.2 Future Scope for improvements
The system implemented in the project uses static data to recommend books to the
users. To incorporate dynamic data, distributed databases such as HBase or Cassandra
can be used which can be regularly updated to add new users and ratings. To make the
web application, the data needs to be accessible in real time. The solution to this too
can be the use of a distributed database.
The recommender system can be improved by combining user based collaborative
filtering and content based filtering with the current system. This combination is also
called Hybrid filtering and it helps in significantly performance improvement.
The comparison made between the different similarity metrics was based on the run
time and not on the precision of the recommendations.

34

8. REFERENCES
[1] A. Felfering, G. Friedrich and Schmidt Thieme, Recommender systems, IEEE
Intelligent systems, pages 18-21, 2007.
[2] P. Resnick, N. Iacovou, M. Suchak, and J. Riedl, GroupLens: An Open
Architecture for Collaborative Filtering of Netnews, In Proceedings of CSCW
94, Chapel Hill,
NC, 1994.
[3] U. Shardanand and P. Maes Social Information Filtering: Algorithms for
Automating
Word of Mouth, In Proceedings of CHI 95. Denver, 1995.
[4] Daniar Asanov, Algorithms and Methods in Recommender Systems, Berlin
Institute of Technology, Berlin, 2011
[5] Badrul Sarwar , George Karypis and John Riedl, Item-Based Collaborative
Filtering Recommendation Algorithms, IW3C2: Hong Kong, China,
2001[Online].
http://wwwconference.org/www10/cdrom/papers/519/index.html
[6] Francesco Ricci, Lior Rokach and Bracha Shapira, Introduction to Recommender
System Handbook, New York: Springer Science+Buisness Media Ltd, 2011, ch. 1,
sec. 1.4, pp 10-14.
[7] D. Borthakur and S. Dhruba, HDFS architecture guide, Hadoop Apache
Project,
2008
[Online].
http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf
[8] Dean, Jeffrey, and Sanjay Ghemawat, MapReduce: simplified data processing
on large clusters, Communications of the ACM 51.1, 2008, pp 107-113.
[9] Fuzhi Zhang, Huilin Liu, Jinbo Chao, A Two-stage Recommendation
Algorithm Based on K-means Clustering In Mobile E-commerce, Journal of
Computational
Information Systems, Vol. 6, Issue 10, pp. 3327-3334, 2010.
[10] Taek-Hun Kim, Young-Suk Ryu, Seok-In Park, and Sung-Bong Yang, An
Improved Recommendation Algorithm in Collaborative Filtering,
35

Department of computer science yonsei university.


[11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia and Robert Chansler,
The
Hadoop Distributed File System, IEEE , pp. 978-1- 4244-7153-9/10, 2010.
[12] Emmanouil Vozalis, Konstantinos G. Margaritis, Analysis of
Recommender Systems Algorithms, conference proceeding of IEEE.
[13] Brian McFee, Luke Barrington and Gert Lanckriet, Learning Content Similarity
for Music Recommendation IEEE Transactions on Audio, Speech, and Language
Processing, Vol. 20, No. 8, 2012.
[14] Paul C.Zikopolus and Chris Eaton, Understanding Big Data Analytics
for Enterprise Class Hadoop and Streaming Data, thesis, 2013.
[15] Chuck Lam, Hadoop in Action, thesis, 2013.

36

APPENDIX
A.1 Core-site.xml

A.2 localhost:54310

37

A.3 Hdfs-site.xml

A.4 U.data

38

A.5 MapReduce 1

A.6 MapReduce 2

39

A.7 MapReduce 3

40

You might also like