Professional Documents
Culture Documents
net/publication/333071932
CITATIONS READS
6 1,622
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sang T.T. Nguyen on 02 November 2019.
(2)
Figure 1. Framework of the proposed book recommender
(3) systems
(4) Since the data source of the system is the Book-crossing dataset,
it is necessary to convert the dataset into ARFF files. The Book-
where, TP, TN, FP, P, N refer to the number of true positive, true crossing dataset includes three tables in CSV files: BX-Users,
negative, false positive, positive, BX-Books, and BX-Book-Ratings. BX-Users table contains user
and negative samples, respectively [6]. profiles: User-ID, Location, and Age. BX-Books table stores
book information: ISBN, Book-Title, Book-Author, Year-Of-
Besides, this study also considers Root mean-squared error Publication, Publisher, Image-URL-S, Image-URL-M, and
(RMSE) (5) in some experimental evaluations. Image-URL-L. BX-Book-Ratings table includes User-ID, ISBN,
and Book rating. Book ratings are expressed on a scale from 1-10
√ (5) (higher values denoting higher appreciation). In practice, this
information can be obtained from book shopping sites. Because
where, p1, p2, …, pn are the predicted values given the test the contents of titles, authors or locations contain many special
instances, and a1, a2, …, an are actual values. characters, and delimiters, e.g., „,‟, „;‟, „”‟, which are confused
with the separator sign between columns or attributes (features),
In addition, the time of building classification models are taken
we need to remove such characters. The pre-processing unit does
into account to evaluate their efficiency in some experiments.
this job and convert the tables to the ARFF format. Besides,
4. BUILDING MODEL-BASED BOOK attribute values need to be formatted in required types, e.g.,
numeric or string.
RECOMMENDER SYSTEMS
Moreover, because each book title contains the meaning itself
4.1 Framework which might impact book classification, the book titles are
The framework of the model-based book recommender systems is
vectorized to facilitate the book classification. In other words, a
proposed as Figure 1. In this framework, there are three process
title is converted to a number value for easy comparison. The
units: (1) data pre-processing unit, (2) classifiers (or classification
Word2Vec model is involved to model the titles. Each title is a
models), and (3) book prediction engine. collection of word vectors. There are two ways to transform one
4.1.1 The data pre-processing unit title to a number. For each title, we can compute the length of the
In the pre-processing unit, input data is cleaned and converted sum of the word vectors or the dot product of the word vectors.
into a format which can be read by the classifiers. Because the In order to select flexibly features (or attributes) from the dataset
system utilizes the classifiers from the Weka library, the input for training, that is, we can join records between the tables, the
data needs to be in the ARFF format [7], as described in the object classes of User, Book, and Book-rating are developed and
following example of the weather data. linked. User-ID and ISBN attributes in Book-rating class are
@relation weather linked to User-ID attribute in User class and ISBN attribute in
@attribute outlook { sunny, overcast, rainy } Book class, respectively. The reason is we need to combine user
@attribute temperature numeric profile, book information and ratings for classification. The
@attribute humidity numeric features of users and books affect ratings. The system then can
@attribute windy { true, false }
@attribute play? { yes, no } make predictions based on user profile, item (book), and user
@data interest (rating).
sunny, 85, 85, false, no
sunny, 80, 90, true, no 4.1.2 Classifiers
overcast, 83, 86, false, yes In this study, two classifier groups of not accepting text data and
rainy, 70, 96, false, yes accepting text data are employed. Naïve Bayes, C4.5, and Bayes
rainy, 68, 80, false, yes
Network belong to the first group. NaïveBayesMultinominalText For each case, a number of extracted instances are used for
and SDGText belong to the second group. These classifiers can training and testing. The 10-fold cross-validation is used in the all
be selected in the system for training the input data. experiments. The average results of accuracy, precision, or RMSE
are calculated from 10 iterations in the cross-validation and are
After building the classifier model, we can make a rating shown as final results.
prediction given a query, i.e., a new record of a user and a book.
Or we can evaluate the performance of the classifier using testing 5 EXPERIMENTAL RESULTS AND
data.
EVALUATIONS
4.1.3 Book prediction engine In this study, the experiments were run on a PC with Intel Core
The book prediction engine can predict the rating of a book made i7-4770 processor, 3.40 GHz and 8
by a user using one of the classifiers. GB of RAM.
Furthermore, the system can be extended to recommend books to Case 1: Six nominal attributes of ISBN, author, year, publisher,
the target user. The system can find books, combine them with location and age, and a rating class are used. In that, the rating
the user profile, and predict book ratings. If the rating of a book is class has 10 values from 1 to 10.
higher than 5, that book will be recommended.
Table 1 describes the performance evaluation of Case 1. In this
4.2 Feature Selection case, we just observe the classification accuracy of typical
algorithms, e.g., NaiveBayesMultinomialText, Naïve Bayes, and
As discussed, features influencing ratings need to be selected for C4.5, on the pre-processed dataset with 10 original classes. As
training. Given the Book-crossing dataset, the eight selected seen, when the number of validated instances is larger, the
features (attributes) are: ISBN, title, author, year, publisher, accuracy of classification is lower. Modeling a decision tree takes
location, age, and rating (class). The first five features are book too much memory space, so we cannot run C4.5 on 90% of
information, the next two features are user information, and the 1,000,000 instances for training. Moreover, the accuracy of C4.5
last one is the book rating. User profile and book information are is lower than Naïve Bayes, as shown in the case the number of
combined for making rating predictions. used instances is 28,978. The accuracy of Naïve Bayes is higher
As we know, when the rating class of 10 values might be difficult than that of NaiveBayesMultinomialText when the number of
to evaluate the precision of predicting user interests (like or training instances is small, but it is opposite when the number of
unlike) and the accuracy will not be high. Therefore, the training instances is greater. The RMSE of Naïve Bayes is about
multivalue rating class is converted to the binary class in some 0.24, a little bit higher than the one of
experiments for comparison. In particular, class 0 represents the NaiveBayesMultinomialText.
ratings from 1-5, namely “unlike”, and class 1 represents the Table 1. Case 1 – Evaluation
ratings from 6-10, namely “like”.
NaiveBayesMultinomialText Naïve C4.5
In the above selected features, we care about the location and age Bayes
of users, because book recommendation can depend on the # of instances: 1,000,000
demographic data, i.e., a group of users having a similar profile. Accuracy 62.748 % 59.997 -
Moreover, a book title is a string or text, but we can consider its %
type as nominal or string in some experiments of classification. RMSE 0.230 0.236
# of instances: 28,978
When titles are vectorized, their type will be able to be numeric.
Accuracy 63.6552% 64.352 63.6552%
The below experiments will give more details of the classification %
processes. RMSE 0.228 0.241 0.104
4.3 Training and Testing
Case 2: Six nominal attributes of ISBN, author, year, publisher,
The Book-crossing dataset contains 278,858 users, 271,379 location and age, and a rating class are used. In that, the rating
books, and 1,149,780 ratings, including some missing values. class has two values 0 and 1.
After preprocessing, the mentioned features are selected for
training and testing. That means data records present the values of Table 2 describes the performance evaluation of Case 2. In this
the selected features (attributes). One record can be called an case, the NaiveBayesMultinomialText, SGDText, Naïve Bayes,
instance. The total number of extracted instances are 1,028,978. Bayes Networks, and C4.5 classifier models are used. The
accuracy and RMSE of classification, and the precision of class 1
In this study, six experimental cases are carried out, in that, the prediction are measured to compare the performance of the
book title attribute is not involved in the first four cases. They models. Similarly, C4.5 takes too much time and memory for
will show different performance evaluation results when attribute training 900,000 instances, so it is failed to get validation results.
types are selected differently. The Naïve Bayes model achieves the highest accuracy and
Case 1: Six nominal attributes and a multivalue class. precision, and the lowest RMSE. Time taken to build the Naïve
Bayes model is quite fast (0.09s when 28,978 instances are used),
Case 2: Six nominal attributes and a binary class. just longer than NaiveBayesMultinomialText. In addition, we see
that Bayes Networks can provide high prediction accuracy but
Case 3: Six nominal, string and numeric attributes, and a binary
take more time building the model than Naïve Bayes.
class.
Table 2. Case 2 – Evaluation
Case 4: Seven nominal attributes and a binary class.
NaiveBayes- SGDText Naïve Bayes C4.5
Case 5-6: Book title is considered in classification. Multinomial Bayes Networks
-Text
# of instances: 1,000,000 Precision 0 0.546 0.559 -
Accuracy 68.750% 68.750% 69% 68.210% - RMSE 0.463 0.448 0.441 -
Precision 0 0 0.510 0.490 - # of instances: 28,978
RMSE 0.460 0.560 0.460 0.460 - Accuracy 69.677% 72.835% 72.990% 69.677 %
# of instances: 28,978 Precision 0 0.590 0.622 0
Accuracy 69.680% 69.677% 71.990% 71.710% 69.680% RMSE 0.460 0.452 0.452 0.460
Precision 0 0 0.590 0.560 0 Model 0.040s 0.640s 0.090s 32.870s
RMSE 0.460 0.551 0.450 0.450 0.460 building
Model 0.060s 2.040s 0.090s 0.490s 34.150s time
building
time
Compared with some previous studies [10, 11] using the same
dataset, the proposed book recommendation system using Naïve
Compared with Case 1, the classifier models perform better in Bayes is promising with lower RMSEs, about 0.24 for predicting
Case 2. That means classifying the dataset with the binary class is book ratings in range [1-10], and 0.5 for predicting book ratings
more effective. in range [0-1]. We do not compare accuracy or precision because
the way they make predictions is different from ours.
Case 3: Six attributes of nominal ISBN, nomial author, numeric
year, nomial publisher, string location, numeric age, and a rating Case 5: Book title is considered to evaluate its impact on rating
class are used. In that, the rating class has two values 0 and 1. In prediction.
this case, we try to test attribute values with types as they are.
Year and age attribute values are numeric, so they should Based on the above experiments, Naïve Bayes is the most
computed as numbers, rather than nominal values. Comparing efficient classifier, thus it is used in this experiment while the
strings of ISBN, author, and publisher is not necessary, so we attributes are changed. Based on Cases 2 and 3, setting the
keep their types being nominal. attribute types into nominal can gain higher performance, so the
attribute types in this case are converted to nominal. In this case,
Table 3 describes the performance evaluation of Case 3. In this 1.028.978 records are put into the Naïve Bayes classifier. There
case, the NaiveBayesMultinomialText and SGDText classifier are four sub-cases of selecting attributes. The class value, i.e.,
models are used to handle String attributes. The number of used rating, is binary.
instances is 28,978. We just test a small dataset because it takes
too much time to build the SGDText model. Although the Sub-case 1: Eight attributes of nominal ISBN, titleDotPro,
classification accuracy of SGDText is higher than that of author, year, publisher, User-ID, location and age are selected. A
NaiveBayesMultinomialText, the used memory space and model titleDotPro value is the dot product of a book title.
building time is inefficient. Therefore, this case is just for Sub-case 2: similar to Sub-case 1, but ISBN is not included.
reference. Moreover, the accuracy of
NaiveBayesMultinomialText is not higher than that in Case 2. It Sub-case 3: same as Sub-case 2, but titleDotPro is changed to
can be said that string comparison is not effective in the titleSumVecLen which is the sum vector length of book title.
classification process.
Sub-case 4: same as Sub-case 2, but titles are kept as strings and
Table 3. Case 3 – Evaluation the title type is consider as nominal.
NaiveBayesMultinomialText SGDText Table 5. Case 5 –Performance over four sub-cases
# of instances: 28,978
Sub-case 1 Sub-case 2 Sub-case 3 Sub-case 4
Accuracy 69.360% 72.430%
Accuracy 72.117% 72.946% 72.891% 72.191%
Model building time 0.190s 389.250s
Precision 0.556 0.573 0.572 0.558
RMSE 0.442 0.430 0.430 0.441
Case 4: Seven nominal attributes of ISBN, author, year,
publisher, User-ID, location, age, and a rating class are used. In
Table 5 shows the performance of Naïve Bayes over the four sub-
that, the rating class has two values 0 and 1.
cases. The performance is higher when ISBN is removed from the
Table 4 describes the performance evaluation of Case 4. In this dataset. Representing titles as the dot products of word vectors is
case, the NaiveBayesMultinomialText, Bayes Networks, more effective than the sum vector lengths. This can be seen in
NaïveBayes, and C4.5 classifier models are used. The Naïve Sub-cases 2 and 3. Compared with Sub-case 4, it shows that the
Bayes classifier model still achieves the highest performance in title vectorization can bring higher performance of title
terms of accuracy and precision (of class 1 prediction), and the comparison.
lowest RMSE. Especially, the accuracy of the classifier models is
Case 6: Book title is considered to evaluate its impact on rating
higher than the ones in Case 1-3. That means the prediction
prediction, and numeric attributes are carefully considered.
accuracy is higher when the User-ID attribute is added because
each specific user is considered. In other words, considering The following sub-cases are taken into account when running the
personalization can make better predictions. Naïve Bayes model on the dataset of 1.028.978 instances.
Table 4. Case 4 – Evaluation Sub-case 1: Eight attributes of nominal ISBN, numeric
NaiveBayes- Bayes Naïve C4.5
titleDotPro, nominal author, numeric year, nominal publisher,
Multinomial- Networks Bayes nominal User-ID, nominal location and numeric age are selected.
Text Sub-case 2: similar to Sub-case 1, but ISBN is not included.
# of instances: 1,000,000
Accuracy 68.754% 71.695% 72.223% -
Sub-case 3: same as Sub-case 1, but titleDotPro is changed to [2] S. Yang, M. Korayem, K. AlJadda, T. Grainger, and S.
titleSumVecLen which is the sum vector length of book title. Natarajan. 2017. Combining content-based and collaborative
filtering for job recommendation system: A cost-sensitive
Sub-case 4: same as Sub-case 3, but ISBN is not included. Statistical Relational Learning approach. Knowledge-Based
Table 6. Case 6 – Performance over four sub-cases Systems.
Sub-case 1 Sub-case 2 Sub-case 3 Sub-case 4 [3] P. Covington, J. Adams, and E. Sargin. 2016. Deep Neural
Accuracy 71.810% 72.597% 72.029% 72.828% Networks for YouTube Recommendations. In Proceedings
Precision 0.550 0.565 0.556 0.574 of the 10th ACM Conference on Recommender Systems,
RMSE 0.446 0.434 0.444 0.4330 Boston, Massachusetts, USA, 191-198.
[4] P. Z. Han Zhu, Guozheng Li, Jie He, Han Li, Kun Gai. 2018.
The experimental results in Table 6 show that the numeric
Learning Tree-based Deep Model for Recommender
attribute type does not facilitate the Naïve Bayes models.
Systems. In Proceedings of the 24th ACM SIGKDD
6 CONCLUSIONS International Conference on Knowledge Discovery & Data
In conclusion, this study has presented the solutions of selecting Mining. ACM, New York, NY, 1079-1088. DOI=
suitable features for the rating prediction, selecting attribute types, https://doi.org/10.1145/3219819.3219826.
and selecting appropriate classifier models in the book [5] M. Chandak, S. Girase, and D. Mukhopadhyay. 2015.
recommender systems. The selected features which are author, Introducing Hybrid Technique for Optimization of Book
year, publisher, User-ID, location and age determine book rating Recommender System. Procedia Computer Science, vol. 45,
predictions. As shown in the experiments, Sub-case 2 in Case 5 is 23-31.
the best solution when Naïve Bayes are applied to the seven [6] JiaweiHan, Micheline Kamber. 2011. Data Mining:
nominal attributes, except for ISBN, and book titles are converted Concepts and Techniques, 3rd Edition. Morgan Kaufmann.
to dot products. Naïve Bayes is the best solution in most of the
experimental cases. It can achieve the highest performance and is [7] Ian H.Witten, Eibe Frank and Eibe Frank.
efficient while saving memory and time building the classifier 2011. DataMining: Practical Machine Learning Tools and
model. Techniques (Third Edition). Morgan Kaufmann.
[8] Mikolov, T., et al. 2013. Distributed Representations of
Attribute types should be nominal when applying model-based
Words and Phrases and their Compositionality. In
recommendation methods. Numeric and string types are not
Proceedings of the 26th International Conference on Neural
efficient for classifier models. Applying the word embedding
Information Processing Systems – Volume 2. Lake Tahoe,
method to represent book titles can improve title representation
Nevada, 3111-3119.
and make better predictions. In the future, more advanced
classifiers, e.g., neural networks, will be evaluated but we might [9] David Guthrie, et al. 2006. A Closer Look at Skip-Gram
need more time and memory for training the models. Because of Modelling. In Proceedings of the Fifth International
limitation of memory usage, those models are not employed in Conference on Language Resources and Evaluation (LREC-
this study. 2006).
[10] M. Chandak, S. Girase, and D. Mukhopadhyay. 2015.
7 ACKNOWLEDGMENT Introducing Hybrid Technique for Optimization of Book
This research is funded by International University VNUHCM Recommender System. Procedia Computer Science, vol. 45,
under grant number 06-IT-2017. 23-31.
8 REFERENCES [11] A. Tashkandi, L. Wiese, and M. Baum. 2017. Comparative
[1] C. C. Aggarwal. 2016. Recommender Systems. Springer Evaluation for Recommender Systems for Book
International Publishing. Recommendations. In Lecture Notes in Informatics (LNI),
Gesellschaft für Informatik, Bonn.