Professional Documents
Culture Documents
Objective
The goal of this assignment is to try out different ways of implementing and configuring a recommender, and
to evaluate different approaches. Start with an existing dataset of user-item ratings, and implement at least
two of these recommendation algorithms:
Content-Based Filtering
User-User Collaborative Filtering
Item-Item Collaborative Filtering
Evaluate and compare different approaches, and provide at least one graph and a summary of findings and
recommendations.
For this project, we use dataset 1_2 from the Jester database of joke ratings. The dataset includes ratings.
Ratings are given as real values ranging from -10.00 (worse) to +10.00 (best).
We use the recommenderlab package to develop and test our recommender system using different
recommendation algorithms. In addition, we use some of the ideas and code suggestions from
library(tidyverse)
library(recommenderlab)
library(knitr)
3 Data preparation
We start by preparing the data for the recommender system. We note from the data description for the Jester
database that the rating value “99” corresponds to “null” or “not rated”, and that the first column in the
dataset is the number of jokes rated by each user. So, our data preparation includes the following steps:
Confirm that the number of non-blank ratings per row matches the number in the first column
Drop the first column
Convert rating values of “99” to NA
Set row names to user ID and column names to joke ID
Create the ratings matrix as type realRatingMatrix
View the ratings matrix.
In the framework of the user-item utility matrix, the “users” are the raters and the “items” are the jokes. We
see that the ratings matrix includes 1.71 million ratings in total, with 23,500 users and 100 jokes. This implies
that the ratings matrix is 73% populated with non-blank ratings (or conversely, 27% sparse), compared to a
total of 2.35 million possible ratings.
glimpse(df)
## Observations: 23,500
## Variables: 100
## $ j0 <dbl> NA, -4.37, NA, 0.34, NA, NA, 3.50, -7.67, 1.02, 3.64, 9.27...
## $ j1 <dbl> 8.11, -3.88, NA, -6.55, NA, NA, 2.28, 7.14, 6.07, 1.60, 6....
## $ j2 <dbl> NA, 0.73, NA, 2.86, NA, NA, 3.01, -6.31, 6.65, -0.39, -9.8...
## $ j3 <dbl> NA, -3.20, NA, NA, NA, NA, -0.63, NA, -0.87, 3.74, -8.83, ...
## $ j4 <dbl> -2.28, -6.41, 0.73, -3.64, 9.13, -1.46, 4.95, -9.32, 6.80,...
## $ j5 <dbl> -4.22, 1.17, NA, 1.12, NA, NA, -0.49, 6.07, 0.68, -9.61, 0...
## $ j6 <dbl> 5.49, 7.82, 5.53, 5.34, -9.32, 2.72, -0.49, 3.59, 5.34, 8....
## $ j7 <dbl> -2.62, -4.76, 3.25, 2.33, -2.04, -3.83, -0.68, -2.72, 4.56...
## $ j8 <dbl> NA, -6.41, NA, NA, NA, NA, 3.30, NA, -0.39, -7.09, -6.17, ...
## $ j9 <dbl> -2.28, 0.73, NA, 2.33, NA, NA, -1.41, -8.01, 8.25, 3.88, -...
## $ j10 <dbl> 5.34, 1.99, NA, 3.54, NA, NA, 4.71, -9.13, 8.93, -6.36, 9....
## $ j11 <dbl> 3.54, 0.63, NA, 0.92, 2.04, NA, 3.30, -8.06, 6.21, -9.76, ...
## $ j12 <dbl> -9.71, -1.41, -2.04, -5.87, -2.04, -3.83, 1.26, 2.67, 0.10...
## $ j13 <dbl> 3.30, -4.32, 2.91, 2.77, NA, 0.97, -4.03, 7.43, 9.08, -7.9...
## $ j14 <dbl> -5.58, 0.63, 2.62, 4.76, -2.23, -3.83, -5.49, -7.23, -0.73...
## $ j15 <dbl> -6.07, -9.22, -1.70, -3.54, -8.50, -1.26, -5.44, -7.23, 2....
## $ j16 <dbl> -0.29, 0.87, 2.28, -7.38, 3.16, 0.00, 2.86, -0.53, 6.46, -...
## $ j17 <dbl> 2.33, 0.68, 3.11, -4.37, 2.18, -4.90, 1.17, 1.60, -5.34, -...
## $ j18 <dbl> 3.93, 0.29, 2.91, -1.75, 3.93, -2.72, 1.80, -7.91, 0.97, 2...
## $ j19 <dbl> -0.34, -4.51, 6.75, -4.81, 2.96, -0.39, 1.07, -8.11, 3.83,...
## $ j20 <dbl> -6.89, -6.65, 2.72, -1.75, 6.41, 1.50, 2.62, -7.23, 8.59, ...
## $ j21 <dbl> -2.09, -4.03, 2.82, 2.72, 2.62, NA, 0.73, -9.13, 4.71, 4.8...
## $ j22 <dbl> -8.83, 1.94, NA, 2.96, NA, NA, -0.39, -7.77, 8.45, -9.13, ...
## $ j23 <dbl> NA, -3.11, NA, NA, NA, NA, -1.94, NA, -0.44, 6.12, -8.98, ...
## $ j24 <dbl> 5.15, -4.32, NA, 1.02, NA, NA, 1.99, -8.88, 8.59, 1.60, -6...
## $ j25 <dbl> 3.98, -0.63, 4.22, 2.96, 5.10, 2.62, 1.21, 7.38, 7.52, -9....
## $ j26 <dbl> 4.47, 5.83, 2.77, 3.25, 0.05, 3.74, 2.09, 0.00, 6.26, -9.3...
## $ j27 <dbl> 3.88, 3.64, 3.35, 0.10, 6.12, NA, 3.74, -8.59, 9.08, -3.83...
## $ j28 <dbl> 5.29, 0.39, 4.22, 2.18, 7.18, 3.06, 0.63, 1.65, 7.28, -0.4...
## $ j29 <dbl> NA, -0.44, NA, 0.97, NA, NA, 3.30, -8.16, 6.17, -6.17, -6....
## $ j30 <dbl> 7.09, 0.73, 3.20, -6.89, 8.35, 2.04, 2.04, 8.40, 6.99, -8....
## $ j31 <dbl> -2.23, 1.07, 4.51, -3.93, 7.43, 4.03, -2.48, -6.55, 8.30, ...
## $ j32 <dbl> NA, 3.06, NA, NA, NA, NA, 3.35, -7.86, -7.91, -2.67, -0.49...
## $ j33 <dbl> NA, -4.51, NA, -2.43, NA, NA, 2.18, -8.01, 6.80, -4.90, 4....
## $ j34 <dbl> 3.11, -0.39, 5.24, 4.17, 7.57, 4.27, 4.32, 0.44, 2.82, 3.2...
## $ j35 <dbl> -4.42, 3.11, 2.33, -4.47, 5.49, 5.49, 1.89, -1.50, 3.50, 4...
## $ j36 <dbl> NA, 1.02, NA, NA, NA, NA, -1.99, NA, 0.15, 3.30, -9.08, 4....
## $ j37 <dbl> -5.87, -2.09, 0.00, 0.63, 8.98, 3.06, 4.17, -8.88, 1.99, 0...
## $ j38 <dbl> NA, -6.21, 1.84, 1.50, NA, NA, 2.43, 8.20, 9.17, -5.87, -0...
## $ j39 <dbl> -7.28, -4.71, 4.42, 3.45, NA, NA, 2.86, -9.13, 8.20, -4.17...
## $ j40 <dbl> NA, 1.07, NA, 2.91, NA, NA, 4.47, -8.45, 8.40, -3.54, -9.6...
## $ j41 <dbl> -4.71, 2.77, 2.82, 1.65, 5.63, 2.72, -1.02, 7.38, 6.80, -9...
## $ j42 <dbl> -8.93, 2.04, NA, NA, NA, NA, 2.38, -7.48, 8.30, -2.62, -0....
## $ j43 <dbl> NA, -5.97, NA, 3.06, NA, NA, -0.63, NA, -1.26, 1.75, -8.83...
## $ j44 <dbl> 3.40, -0.58, NA, 0.58, 4.66, NA, -4.76, 7.86, 8.01, -2.57,...
## $ j45 <dbl> -7.18, -3.74, 4.17, 3.93, NA, 0.97, -2.09, 7.67, 7.48, -1....
## $ j46 <dbl> 2.28, 3.30, NA, 3.69, NA, NA, 3.16, 7.48, 8.11, -8.88, -8....
## $ j47 <dbl> 7.48, -1.46, 3.69, 5.15, NA, 0.97, 3.79, -8.69, 7.91, 1.31...
## $ j48 <dbl> 5.15, 6.02, 3.79, 3.30, 4.17, 4.08, 4.27, -7.57, 9.03, -3....
## $ j49 <dbl> 5.73, 0.44, 2.48, 1.55, 8.11, 1.99, 1.99, 1.60, 8.88, -3.5...
## $ j50 <dbl> NA, -4.22, NA, -4.90, 7.33, NA, 1.70, 8.20, 7.28, -9.66, -...
## $ j51 <dbl> NA, -0.44, NA, -6.99, NA, NA, 4.08, 7.82, 8.11, -9.66, -7....
## $ j52 <dbl> 5.78, -1.94, 3.64, 3.06, 6.02, 4.90, 0.87, 0.68, 9.08, -4....
## $ j53 <dbl> 3.20, 8.45, 3.01, 3.64, 4.71, 0.97, -2.62, 7.38, 8.88, -2....
## $ j54 <dbl> -0.44, -3.01, NA, 1.21, -0.58, NA, 2.04, -8.69, 6.21, -3.3...
## $ j55 <dbl> 5.73, -4.51, 2.52, 0.68, 7.09, 1.50, 3.30, -7.23, 8.83, -8...
## $ j56 <dbl> NA, 3.54, NA, NA, -2.23, 0.68, 3.06, NA, 3.54, -8.01, -9.8...
## $ j57 <dbl> NA, -0.44, 4.71, NA, NA, NA, -3.74, NA, 0.19, -5.78, -4.13...
## $ j58 <dbl> NA, 4.56, NA, 1.02, NA, NA, 0.34, -8.16, 6.65, -9.76, -4.1...
## $ j59 <dbl> NA, 2.04, NA, 2.28, NA, NA, -0.92, -7.67, 6.07, -0.34, -0....
## $ j60 <dbl> -8.25, 5.78, -0.34, 1.41, 5.78, 2.09, 1.75, 8.35, 8.93, -3...
## $ j61 <dbl> 4.61, 5.63, 3.35, 1.21, 1.94, 3.06, 0.78, -7.23, 3.40, -1....
## $ j62 <dbl> NA, 6.02, NA, 0.83, NA, NA, 3.69, -7.72, 8.11, -9.71, -6.8...
## $ j63 <dbl> NA, -4.37, NA, -0.10, NA, NA, -0.68, -7.67, 8.54, 6.75, -0...
## $ j64 <dbl> -3.98, -8.98, 3.88, 4.71, 5.24, 3.20, 3.35, -1.80, 7.82, 1...
## $ j65 <dbl> 4.90, 0.87, -0.15, 1.12, 6.36, 1.89, 1.65, -8.01, 7.67, -5...
## $ j66 <dbl> NA, -4.66, NA, 3.01, NA, NA, 0.92, -6.07, 5.29, 0.24, -0.3...
## $ j67 <dbl> -4.32, 2.82, 3.06, 2.67, NA, 1.50, 2.52, -7.23, -0.29, -0....
## $ j68 <dbl> -2.72, 2.86, 2.96, 3.06, NA, 3.69, 1.70, -8.01, 0.78, -6.0...
## $ j69 <dbl> NA, -4.03, NA, 1.31, NA, NA, 5.00, -8.01, 8.06, -9.71, -3....
## $ j70 <dbl> NA, 3.06, NA, -3.64, NA, NA, 7.18, NA, -0.05, 0.44, NA, NA...
## $ j71 <dbl> NA, 2.91, NA, NA, NA, NA, 6.75, NA, 8.50, -2.14, NA, NA, N...
## $ j72 <dbl> NA, 5.44, 4.56, 2.43, NA, NA, 4.71, NA, 8.88, -9.56, NA, N...
## $ j73 <dbl> NA, -4.76, NA, NA, NA, NA, 3.20, NA, 4.27, -9.71, NA, NA, ...
## $ j74 <dbl> NA, -8.93, NA, NA, NA, NA, 2.23, NA, 9.08, -9.27, NA, NA, ...
## $ j75 <dbl> NA, -0.49, NA, NA, NA, NA, 4.47, NA, 8.54, -9.08, NA, NA, ...
## $ j76 <dbl> NA, 8.16, NA, NA, -0.05, NA, 3.50, -4.32, 6.17, -8.01, NA,...
## $ j77 <dbl> NA, -3.16, NA, NA, NA, NA, 5.97, NA, 8.74, -9.61, NA, NA, ...
## $ j78 <dbl> NA, 0.83, NA, NA, NA, NA, 0.44, NA, 3.20, -9.61, NA, NA, N...
## $ j79 <dbl> NA, 1.94, NA, -4.66, NA, NA, 4.51, NA, 8.50, -9.56, NA, NA...
## $ j80 <dbl> NA, -0.49, NA, NA, NA, 1.80, 7.14, NA, 0.92, 4.61, 9.13, 8...
## $ j81 <dbl> NA, 2.62, NA, NA, NA, NA, 5.00, -6.21, 9.13, -0.29, -9.71,...
## $ j82 <dbl> NA, 1.99, NA, NA, NA, NA, 8.98, NA, 8.54, -9.56, NA, NA, N...
## $ j83 <dbl> NA, -3.11, NA, NA, NA, NA, 5.39, NA, 6.36, -3.45, NA, NA, ...
## $ j84 <dbl> NA, -0.44, NA, NA, NA, NA, 4.27, NA, 9.03, -9.71, NA, NA, ...
## $ j85 <dbl> NA, -6.50, NA, NA, NA, NA, 4.47, NA, 9.03, 1.12, NA, NA, N...
## $ j86 <dbl> NA, 2.04, NA, NA, NA, NA, 3.20, NA, 3.54, -6.50, 9.03, NA,...
## $ j87 <dbl> NA, -3.16, NA, NA, NA, NA, 4.56, NA, 4.66, -6.94, -5.83, N...
## $ j88 <dbl> 1.21, 2.09, NA, NA, NA, NA, 4.22, 1.65, 8.93, -5.10, NA, N...
## $ j89 <dbl> NA, 5.34, 2.38, NA, NA, NA, 3.83, NA, 6.46, -6.46, NA, NA,...
## $ j90 <dbl> NA, 5.73, NA, NA, NA, NA, 1.89, NA, 6.75, 8.59, NA, NA, NA...
## $ j91 <dbl> NA, -6.70, NA, NA, NA, NA, 3.54, NA, 8.74, -3.69, NA, NA, ...
## $ j92 <dbl> NA, 1.99, NA, NA, NA, NA, 2.38, NA, 9.17, -1.84, NA, NA, N...
## $ j93 <dbl> NA, 2.62, NA, NA, NA, 1.89, 2.77, NA, 7.52, 3.11, NA, 7.04...
## $ j94 <dbl> NA, -0.49, 3.16, NA, NA, NA, -0.24, NA, 5.73, 0.87, NA, NA...
## $ j95 <dbl> -5.92, 3.45, NA, NA, NA, NA, 2.28, NA, 5.92, 1.02, NA, NA,...
## $ j96 <dbl> NA, 3.20, NA, NA, NA, NA, 5.05, 1.60, -6.60, 4.17, NA, NA,...
## $ j97 <dbl> NA, -0.53, NA, NA, NA, NA, 4.51, NA, 0.24, -6.55, NA, NA, ...
## $ j98 <dbl> NA, -0.53, NA, NA, NA, NA, 4.08, NA, 9.08, -8.93, NA, NA, ...
## $ j99 <dbl> NA, -2.96, NA, NA, NA, NA, 2.96, NA, 8.98, -9.42, NA, NA, ...
# create raw ratings matrix
r <- as.matrix(df) %>% as("realRatingMatrix")
r
## 23500 x 100 rating matrix of class 'realRatingMatrix' with 1708993 ratings.
getRatingMatrix(r)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'j0', 'j1', 'j2' ... ]]
##
## u0 . 8.11 . . -2.28 -4.22 5.49 -2.62 . -2.28
## u1 -4.37 -3.88 0.73 -3.20 -6.41 1.17 7.82 -4.76 -6.41 0.73
## u2 . . . . 0.73 . 5.53 3.25 . .
## u3 0.34 -6.55 2.86 . -3.64 1.12 5.34 2.33 . 2.33
## u4 . . . . 9.13 . -9.32 -2.04 . .
## u5 . . . . -1.46 . 2.72 -3.83 . .
## u6 3.50 2.28 3.01 -0.63 4.95 -0.49 -0.49 -0.68 3.30 -1.41
## u7 -7.67 7.14 -6.31 . -9.32 6.07 3.59 -2.72 . -8.01
## u8 1.02 6.07 6.65 -0.87 6.80 0.68 5.34 4.56 -0.39 8.25
## u9 3.64 1.60 -0.39 3.74 -6.36 -9.61 8.88 -4.08 -7.09 3.88
For comparison, we also create the normalized forms of the rating matrix, using centering and Z-score
methods for normalization.
# distribution of ratings
par(mfrow = c(1, 3))
hist(getRatings(r), main = "Raw Ratings")
hist(getRatings(r_n1), main = "Norm. Ratings - Centered")
hist(getRatings(r_n2), prob = TRUE, xlim = c(-5, 5), breaks = 40,
main = "Norm. Ratings - Z-Score")
curve(dnorm(x), add = TRUE, col = "blue")
summary(rowCounts(r))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.00 53.00 72.00 72.72 100.00 100.00
par (mfrow = c(1,3))
hist(rowMeans(r), main = "Avg. Rating per User")
hist(rowMeans(r_n1), main = "Avg. Rating (Centered) per User")
hist(rowMeans(r_n2), main = "Avg. Rating (Z-Score) per User")
summary(colCounts(r))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8164 9732 18648 17090 23329 23499
par(mfrow = c(1,3))
hist(colMeans(r), main = "Avg. Rating per Joke")
hist(colMeans(r_n1), main = "Avg. Rating (Centered) per Joke")
hist(colMeans(r_n2), main = "Avg. Rating (Z-Score) per Joke")
par (mfrow = c(1,1))
summary(colMeans(r))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.5738 -0.4375 0.9445 0.7163 1.8245 3.4593
summary(colMeans(r_n1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.25191 -1.19483 0.20238 -0.06176 0.97356 2.69231
summary(colMeans(r_n2))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.95941 -0.25569 0.05488 -0.01312 0.22596 0.60609
We can see that the most frequently rated jokes were rated by virtually all 23,500 users.
1 j12 23499
2 j14 23499
3 j16 23499
4 j35 23499
5 j49 23499
6 j4 23498
7 j19 23498
8 j52 23498
9 j6 23497
10 j7 23497
On the other hand, the least frequently rated jokes were rated by under 9,000 users.
1 j70 8164
2 j72 8231
3 j71 8288
4 j73 8392
5 j74 8393
6 j77 8494
7 j75 8513
8 j76 8551
9 j78 8586
10 j79 8643
Finally, we summarize the top 10 most highly rated jokes based on the raw ratings as well as the normalized
ratings (for both centering and Z-scores). Note that the top 10 lists are almost identical regardless of rating
method; only a few jokes move slightly in position.
Top 10 Most Highly Rated Jokes (Based on Raw, Centered, and Z-Score Ratings)
Joke 88: “A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to
go see an optometrist. The doctor started with some simple testing, and showed him a standard eye
chart with letters of diminishing size: CRKBNWXSKZY … ‘Can you read this?’ the doctor asked. ‘Read
it?’ the Czech answered. ‘Doc, I know him!’”
Joke 49: “Three engineering students were gathered together discussing the possible designers of the
human body. One said, ‘It was a mechanical engineer. Just look at all the joints.’ Another said, ‘No, it
was an electrical engineer. The nervous systems many thousands of electrical connections.’ The last
said, ‘Actually it was a civil engineer. Who else would run a toxic waste pipeline through a recreational
area?’”
Joke 31: “President Clinton looks up from his desk in the Oval Office to see one of his aides nervously
approach him. ‘What is it?’ exclaims the President. ‘It’s this Abortion Bill Mr. President, what do you
want to do about it?’ the aide replies. ‘Just go ahead and pay it.’ responds the President.”
Likewise, the lowest rated jokes are consistent regardless of rating method.
Joke 57: “Why are there so many Jones’s in the phone book? Because they all have phones.”
Joke 15: “Q: What did the blind person say when given some matzah? A: Who the hell wrote this?”
Joke 14: “The father was very anxious to marry off his only daughter so he wanted to impress her
date. ‘Do you like to screw,’ he says. ‘Huh’ replied the surprised first date. ‘My daughter she loves to
screw and she’s good at it, you and her should go screw,’ carefully explained the father. Now very
interested the boy replied, ‘Yes, sir.’ Minutes later the girl came down the stairs, kissed her father
goodbye and the couple left. After only a few minutes she reappeared, furious, dress torn, hair a mess
and screamed ‘Dammit, Daddy, it’s the TWIST, get it straight!’”
5 Model development
The distributions of rating count per user and per joke are similar to the prior distributions (seen above in the
exploratory data analysis), other than the fact that we have a smaller set of 2,000 users rather than 23,500.
summary(colCounts(rs))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 674 827 1597 1455 1988 2000
5.2.1 Top-N lists
We start by creating a sample recommender system using the “POPULAR” algorithm (see list of prediction
algorithms below). First we show top 5 and top 3 recommended jokes for various users. Note that the top-N
lists are returned already sorted in descending order.
5.2.2 Predicted ratings
Second, we show predicted ratings for various users. In the ratings version, the users’ actual ratings show up as
“NA”, but they are included in the ratingMatrix version. Also note that the ratings are normalized using
centering on rows.
90/10 split of the data into training and holdout / test samples.
For the holdout sample, the model is given 20 ratings (“given-20 protocol”) and then it predicts
ratings; recall that for this dataset, each user has rated at least 36 jokes.
For predicted ratings, a rating of 5 or higher (on a scale from -10.0 to +10.0) will be classified as a
good rating.
Calculated error metrics include the root mean squared error (RMSE), the mean squared error (MSE), and the
mean average error (MAE).
We illustrate with a top-N recommender system using the UBCF algorithm under the following evaluation
scheme:
plot (results, annotate = TRUE, main = "ROC Curve for UBCF Method")
plot(results, "prec/rec", annotate = TRUE, main = "Precision-Recall Plot for UBCF Method")
5. Develop recommender models
In order to develop a recommender system for the Jester dataset, UBCF: user-based collaborative filtering;
depends on parameter nn, similarity measure, and normalization method (defaults are 25, cosine, and center,
respectively)
SVD: singular value decomposition method; depends on parameter k and normalization method
(defaults are 10 and center).
For both the IBCF and UBCF algorithms, Pearson similarity out-performed cosine similarity for this
dataset. It would be interesting to see if this conclusion still holds under different evaluation schemes.
Out of the 5 algorithms considered, UBCF and POPULAR were the top-performing algorithms, as
measured by ROC, precision-recall, and error metrics. It is disconcerting that simply choosing the most
popular items works just as well as our collaborative filtering model. Perhaps the UBCF model can be
improved for higher nearest-neighbor values.
In the final analysis, the best models (UBCF and POPULAR) are still not particularly good. For
instance, the RMSE and MAE for both models are in the 4.0-4.5 range, meaning that model-predicted
ratings are accurate to within ±± 4 to ±± 4.5 points on average, which is a spread of 8-9 points out of a
total rating scale of 20 points (from -10 to +10).