You are on page 1of 13

Designing a Book Recommender Engine: A New Perspective with Deep

Learning Techniques
Radha Guha
CSE Department
SRM University AP, Amaravati

Abstract: From the last decade, deep learning technology is showing amazing performance improvement in the
field of computer vision and natural language processing (NLP). NLP’s big leap forward recently has enabled
computers to understand ambiguous human languages decently. In this paper benefit of deep learning techniques
in Book Recommender system design is explored and validated. As every book is huge in content, content-
based filtering used for recommendation system design can benefit from NLP’s breakthrough word embedding
technique which captures word context, semantics, and word dependency better and helps in dimensionality
reduction as well. Subsequent advancement in language model with attention-based transformer architecture
deciphers word and sentence meaning better considering a larger context. Content based filtering computes
nearest neighbor recommendation and this technique will benefit as cosine similarity of one book to another can
be computed more efficiently now. A second method used for recommender design is collaborative filtering
which analyzes users’ past item preference, and user to user and item to item similarity computation. Deep
learning techniques captures non-trivial, non-linear, user-item interaction better than traditional matrix
factorization algorithms. Deep learning trains its model with huge amount of data in its parallel processing
architecture. Multi-core CPU, GPU and TPU will support deep learning’s parallel processing architecture to
handle bigdata to capture complex user-item interaction hierarchy. The contribution of this paper is to explain
recommendation system design aspects, deep learning technology and comparison of deep learning with
traditional machine learning techniques by solving a book recommendation system design.

Keywords: E-commerce, Recommendation System, Bookcrossing dataset, Project Gutenberg’s e-books,


Content based filtering, Cosine distance, Collaborative filtering, Matrix Factorization, Deep learning, Auto
Encoder, RMSE, MAE, Hit rate

I. Introduction: Recommender System


Now a day for reading a book or watching a movie or listening to a song or reading news on the internet we
have overwhelming number of choices and to find the best one manually will take forever without any help. An
artificially intelligent automatic recommender system is a data mining and data analytics tool to provide that
help and they are getting implemented in most big online businesses these days. Online shopping from e-
commerce websites and online content consumption from e-entertainment and social media websites is a
growing trend now a day specially among Generation Y. In the competing online marketplace, companies must
use latest AI technology to entice new customers and retain the old ones by providing them relevant contents
that they will like. Recommendation system is a software that helps users discover items easily that they may
prefer. In this regard a recommender system is different from just an information retrieval system [1], [2], [3] as
recommender system [4], [5], [6] will further filter out information based on user’s past preference on items.
This requires recommender system to filter out enormous number of irrelevant choices for a particular user.
Some examples of everyday use of recommendation system (RecSys) are given here. Prime example is
movie recommendation system by NetFlix. Then Amazon, the biggest online book seller, also uses
recommender system. YouTube recommends videos to its huge client base. Spotify and Pandoras recommend
songs. Google news recommend personalized news to each of its users. Social networking site Facebook
recommends friends, online dating site OK Cupid provides expert service on match making, and the list goes on
and on. Few more well-known big e-commerce and e-resource companies that use recommendation systems are
AliExpress, Wikipedia, Udemy, GoodReads, Trip Adviser, Expedia, Yahoo, LinkedIn, LastFM, and IMDB etc.
These online sellers will advert users that who bought this item also bought these other items. Then the sellers
will also ask their customers to explicitly rate the item they have purchased. Now a day without a recommender
system customers will be utterly baffled by the sheer volume of choices they have for everything viz. music,
movie, video, books, articles, or news selections. A recommender system helps a user to navigate to relevant
information easily as per user’s taste and traits. RecSys helps a company to promote their items, to increase
user-item interaction time and to become more profitable. Even though big online companies are already using
recommender system, smaller companies still need to design their own recommender system.
As, recommendation system plays such an indispensable role in everyday decision making, it is a hot topic of
research in the age of big data explosion. Globalization of marketplaces and digitization of everything is
creating this humongous amount of online data which needs to be processed efficiently for end user’s
consumption [7], [8], [9], [10]. Commercial applications of various domains like entertainment, personalized
content webpages, eLearning, and travel and financial advice services have several objectives for their
businesses. These objectives are to sell more items, sale diverse set of items, recommend a sequence of items,
recommend a bundle (i.e., selling of items like a PC and anti-virus software together) of items, increase user
satisfaction by reducing shopping time and satisfying their taste, gather dynamic knowledge about users’ likes
and dislikes and ultimately garner more revenue for themselves.
E-commerce companies are striving to improve their recommendation engine to improve customer
satisfaction even by a little percentage and in-turn to increase their own profit manifold. In 2006, NetFlix
announced an open competition of one-million-dollar prize for over 10% improvement of their current
recommendation system in prediction of users’ rating for movies [11]. In 2009 this prize money was given to a
team who used collaborative filtering method that applied matrix factorization technique used for topic
modeling. Since then, recommendation system design research has become very popular. Analysts estimate that
75% of the movies sold in NetFlix comes from recommendations and 35% of the products sold by Amazon
comes from recommendations. Today online giants have made successful stride in recommender system design
research, but academia is still falling behind in exploring this research area.

Table 1: Book Details

Table 2: User Profile and Age Distribution

Table 3: Book Rating Table and Rating Distribution

Usually, companies keep record of their items, customers, and sales. But just recommending most sold items
from the sales database to a customer will ignore customer’s personal taste and will not work most of the times.
So, on the contrary a RecSys tries to personalize product offering by online data mining, analyzing customers’
product review comments, explicit ratings of an item or implicit likeness for an item by individual customers.
Companies capture user’s taste for an item explicitly from customer feedback for an item or implicitly from
user’s browsing history, click history and how much times they are viewing an item etc. These ratings can be in
the scale of 1 to 5 or just 1(like) or 0 (dislike) as collected in YouTube by thumbs up and thumbs down button.
Explicit ratings for items is rare and hard to collect as lazy users will not cast their vote. Recommender system
design depends more on implicit likeness of a user for an item which is gathered automatically by the
businesses.
Implicit ratings are abundant and can be captured real time and is more appropriate for recommender system
design. Only drawback of implicit rating is that no negative feedback or rating can be recorded. Many websites
use cookies; it is a small block of code placed in user’s computer to remember a user and to capture user’s
details like location, age, time of the day, how many times user has visited the page and what products he has
browsed etc. For example, for online shopping if a user put items in a shopping cart but left the session without
buying them and can still find the items in the shopping cart next time he visits the website, then it becomes a
great help to the shopper. Knowing a user beforehand, more personalized content can be delivered conveniently
to each user. Users’ purchase, browsing and click history are assigned some numerical values and considered as
implicit ratings. Sometimes, cookies can breach privacy and security of a user as they are gathering users’
information without their consent.
From the historical data, RecSys engine will estimate or predict how much a user will like an unseen item.
Amazon recommends books by content similarity and based on user’s past preference history on books. Usually
for any RecSys three tables of information is created. These are i). item profile table with book ISBN, book title,
book author, book description etc. (Table 1 for Book recommendation system) ii). user profile table containing
user id, age, location information (Table 2) and iii). item rating table with user id, item id and item rating by
individual users (Table 3). From these information RecSys system can compute many other derived information
like number of ratings per item, average rating per item, user age distribution, location distribution, rating
distribution etc. Finally, RecSys engine estimates likeness of user(i) and item(j) pair i.e., ^ L(U i , I j) for each
user and for each unseen item in the system. This measure is just an estimate and should be as close to the actual
likeness measure L(U i , I j ). This likeness function can be Boolean i.e., user likes an item or does not like it. Or
it can be measured in a scale of 1 to 5 where 5 is the maximum likeness and 1 means minimum likeness etc.
After processing the likeness function RecSys will recommend topN items to a user.

Figure 1: Illustration of Content Based Filtering (CBF) vs. Collaborative Filtering (CF)

RecSys computes likeness measures for items by two basic approaches either Content-Based Filtering (CBF)
or Collaborative Filtering (CF) (Figure 1) [4], [5], [6], [12], [13]. In both the approaches similarity measures
between item to item or user to user is the key. In CBF if user-A has read a book, then similar books will be
recommended to him. In general item features like color and size of a garment or movie genre, actors, theme, or
book content are used to find out item to item similarity. This is called content-based recommendation.
Collaborative filtering is of two types either item collaborative filtering or user collaborative filtering. Other
than content, item to item similarity can be determined if both items are ranked equally in several users’ taste.
User to user similarity can be found out from their history of purchases of common items. This approach is used
for user collaborative filtering. If two user-A and user-B have purchased several common items, then they are
similar, and any item purchased by user-A but not by the user-B can be recommended to him and vice versa.
Both CBF and CF suffers from some problems. CBF may recommend homogeneous boring items instead of
surprising users with new items. CF suffers from cold start problem when a user is new to the system and his
behavior or taste is little known or an item is new in the database, and it has not been purchased by many. A
hybrid method tries to combine both CBF and CF the best possible way for more complex but better
recommendations. When users’ context information like age, location, income etc. are used to recommend items
to them it is called context-based recommendation. Context information can be combined with CBF and CF
recommender as a remedy to cold start problem and for better performance of recommender.
In this paper, book recommendation system design challenges and opportunities are explored. Usually
reading a book takes much longer time than watching a movie. If a good book is recommended to a user, he will
cherish reading the book and remember the story much longer. Also comparing the content of Movie Lens
dataset [14] and Bookcrossing dataset [15] we find that Movie Lens dataset has less amount of text data to
process. Movie Lens data keeps date of release, genre, actors’ names, a short summary of movie theme, average
movie rating, movie review along with their sentiment polarity etc. Whereas Bookcrossing data keeps book title,
author’s name, year of publication, ISBN, publisher, genre, book review etc. But unlike movie dataset no
summary information of the book story line is kept there. Thus, for content-based similarity determination the
whole book content needs to be processed. First, one must scrape book data from online sites like Project
Gutenberg where a lot of e-Books are available for free reading. Because of huge amount of text data, efficiency
of processing and dimensionality reduction of unstructured text data is a major requirement for a Book
Recommendation System which can be achieved more efficiently by deep learning techniques.
As pointed out before, the recent breakthrough of NLP’s word embedding technique is doing the magic in
capturing word context, semantics, word to word similarity, dependency, and dimensionality reduction in word
representation itself. Two algorithms for word embeddings are Word2Vec [16] and GloVe [17] introduced in
2013 at Google and in 2014 at Stanford University, respectively. With the smallest token of the text i.e., a word
or a term being represented more efficiently all other NLP tasks like document-to-document similarity
computation, document summarization, automatic text generation, language translation etc. are performing
better in recent times. With this pre-calculated base word embedding, subsequent invention of attention-based
Transformer architecture in 2017 [18] also by Google is capturing word context and dependency in a much
longer window size of 3072 terms. Self-attention, multi-head attention and positional embedding is now
generating contextualized embedding from transformer architecture [19], [20], [21]. As it is aptly said a word is
known by the company it keeps.
Deep learning neural network use multiple hidden layers to extract hierarchical data information from the
input data. This parallel processing architecture can process huge amount of word vectors efficiently for any
non-linear mapping of input to output. DL also eliminates the manual feature extraction subtask of traditional
machine learning. DL can combine heterogeneous content like text, image, audio in the same model more
efficiently. DL algorithms also works better when there are a lot of training and testing data available. Any
neural computing mainly deep learning needs huge amount of matrix multiplications and additions. Python’s
API PyTorch maps these computations in powerful GPUs and AI accelerator known as tensor processing unit
(TPUs) available today for parallel processing efficiently.
Book RecSys is thus more suitable for deep learning. Limited book metadata available in the book detail
table like book title, author, year of publication, publisher is not enough for capturing the storyline of the book.
Book topic modeling can be done more accurately with deep learning technique as explained in Section II. Big
companies have made remarkable success in improving their recommender system design using deep learning
[21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32]. The difference between big companies and
academia research on deep learning is that big companies can afford more computing power of high-end GPUs
and TPUs. Academia research using massive parallel processing in GPUs and TPUs is not always possible. So,
in academia deep learning techniques were not much explored for recent business needs for a good
recommendation system design. Another hindrance to deep learning research is that deep learning acts as a
black box and explaining its output is a challenging task. The motivation for this paper is in seeing the
pervasiveness of deep learning techniques in many computation domains like face recognition, speech
recognition, machine translation etc. including the recommender system design. The purpose of this paper is to
take a new perspective in book recommender design with deep learning techniques which is possible even in a
single laptop computer with limited amount of data. Our book recommender system can be used for novels,
textbooks, research articles and news media exploration.
The remaining of this paper is organized as follows. Section II introduces the book crossing dataset and
points out that data preprocessing and data wrangling is an important prerequisite for data analytics. Section III
first illustrates content based filtering and collaborative filtering methods used in recommender system design.
Then the problems of implementing them through traditional machine learning algorithm like nearest neighbor
calculation and matrix factorization are showcased. Then deep learning is surveyed in reducing the
shortcomings of traditional methods. Section IV presents several experimental results with deep learning and
introduces the performance metrics that can be used for recommender system evaluation. Section V concludes
the need for more open research in the academia.

II. Recommender Design: Traditional Machine Learning Challenges


In this section aspects and challenges of recommender system design with traditional machine learning approach
are illustrated. Before processing data for recommender system raw data exploration and visualization give
important insight into the data. Book crossing (BX) data set comprising of books, users and ratings tables were
compiled in the year 2004 to build recommendation system. Table 1 shows book details with ISBN. book title,
average rating of the book, book author, publisher, year of publications etc. There are 2, 71, 360 books in the
database. Table 2 shows user profile and age distribution. There are 2,78,858 users. Age distribution shows
there are more young readers of books than older readers. Table 3 shows book rating and rating distribution.
There are 11,49,780 book ratings available. Book ratings are in the scale of 1 to10 here. Rating distribution
shows that most of the books are unrated (rating zero) by the users in the database. Table 4 computes a user-
book pivot table with rating for each book by individual users. This table is used for user-to-user similarity or
item to item similarity measurement as per their rating traits as used in collaborative filtering. Most of the book
titles are unrated by users and has NaN values. So, data preprocessing and data wrangling such as missing value
correction or replacing NaN values with zero are important steps so that mathematical operations can be applied
on them. Also, unnecessary columns can be dropped in data preprocessing step. Numerical values can be
normalized for better performance. Normalization of data is very critical in neural network processing.
Table 4: User-Book Rating Pivot Table

Traditional Collaborative Filtering


For collaborative filtering, user-item interaction rating table is used to compute cosine similarity or Pearson
correlation similarity for finding either user to user similarity or item to item similarity. But in the user-item
pivot matrix (Table 4), as the number of zeroes is many it is a very sparse matrix having cold start problem for
traditional collaborative filtering. In fact, sparsity level of BX dataset is 99.99% and cosine distance
measurement on sparse matrix does not give accurate result. One remedy of sparsity problem of user-item rating
matrix is dimensionality reduction of the matrix, by matrix factorization with stochastic gradient descent (SGD)
back propagation error correction algorithm. Singular value decomposition (SVD), non-negative matrix
factorization (NMF) like algorithms will factor the original sparse matrix into lower dimension user-topic and
item-topic matrices. Table 5 is showing matrix factorization using NMF algorithm. After this matrix
factorization user to user similarity or item to item similarity can be computed in lower dimension latent factors
or topics. In fact, users are giving ratings to a reduced number of movie or book topics like comedy, thriller,
romance, drama etc. rather than numerous movies or books titles. The original matrix is recreated by taking
inner product of user-topic and topic-book matrices as shown in in the right-hand side of Table 5. Original
ratings are almost the same in the reconstructed matrix but the zero values are replaced with real numbers and
these real numbers are the estimated ratings of unseen book titles. From these estimated ratings top 5 or 10
books can be recommended to a user.
Table 5: NMF Matrix Factorization for Collaborative Filtering

The dimensionality reduction of the original user-item interaction matrix is called model based collaborative
filtering. This matrix factorization technique does a decent job in collaborative filtering recommendation. But it
is a linear model and cannot capture more complex non-linear user-item interaction relationships.

Reconstructed matrix approximates the original rating matrix, and it can be evaluated for its prediction
accuracy by root mean square error (RMSE) or mean absolute error (MAE) metrics among many others [33].


n
1
n∑
Considering true rating (r ij ¿ and model prediction rating ( y ij), RMSE is given by the formula: ¿ ¿ ¿.
j=1
n
1
And MAE is given by the formula: ∑|(r − y )|. For a good prediction RMSE and MAE should be as
n j=1 ij ij
small as possible. The book crossing rating matrix if factorized with SVD decomposition and RMSE and MAE
errors are reported in Figure 2. For model evaluation training and testing data is split 80:20 ratios. To ensure
reliability of model evaluation, five-fold cross validation is done, where the dataset is split into five fractions
and each time one fraction is used as test set and four other fractions are used for training the model. Maximum
RMSE is 0.9391 and MAE is 0.7375 for Fold 5. For Fold 5 model fit time and test time was also the least than
all other folds. RMSE error is always more than MAE error as it penalizes larger error more. Just to know
prediction accuracy MAE is a preferred error metrics. And to know number of outliers or bad predictions RMSE
is a better metric. Python’s Surprise is an opensource library to build collaborative filtering recommender
system. Python Pandas is used for loading data.

SVD Matrix Factorization RMSE and MAE Error

Figure 2: SVD Matrix Factorization: RMSE and MAE Errors

A deep learning architecture named autoencoder can generalize matrix factorization and does better job in
predicting user-item interaction relationship, will be explored and validated in Section III.

Traditional Content Based Recommender


The book details table (Table 1) contains few metadata about a book like author title, publisher, year of
publication, genre etc. If a book summary is available, then the book content is known better and its similarity
with other book content will give better accuracy. Various context of a user like his location, language, age, time
of the day etc. can also be combined with book content to recommend better. But book details and user profile
are all unstructured text data of natural languages like English, German, or Hindi etc. and first needs to be
converted to numeric values for further processing. After this numerical representation of a book, it can be
compared with any another book. If a user has read one book, cosine similarity of this book with all other books
and k-nearest neighbors are computed.
Natural languages are ambiguous in meaning, computers struggle to capture true meaning of a sentence or
text in natural languages. Sometimes to understand the meaning of an isolated word or sentence we need many
surrounding words or sentences referred as context. In the beginning computers were built to follow instructions
written in artificial programming languages like C, C++, Java, Python etc. and computers do an excellent job in
following artificial programming language for number processing with very high degree of precision. But the
need for automatic natural languages processing (NLP) has emerged only after the invention of internet in 1990,
and subsequently its widespread use, creation of bigdata explosion and emergence of novel business
applications of data mining these days.
From now on we can refer a book as a document and collection of all documents as corpus. Project
Gutenberg is a digital collection of world’s great literature for free reading. Those novels can be scraped from
the website for creating a large corpus of novels and after processing the corpus it will be used for content-based
filtering recommendation. Content based filtering (CBF) involves computing document to document similarity
in a set of all documents of the corpus. At this point CBF is exactly same as information retrieval techniques [2],
[3].

Figure 3: Word Cloud Created


from Term Frequencies,
Novel: Moby Dick or the
Whale

First let us briefly look at


how unstructured
text data is converted to
numerical values, then we
can look at other challenges of
natural languages. The
corpus is first transformed into
a numerical representation
by counting word or term frequency (TF) in each document and then computing inverse document frequency
(IDF) for the word. When high TF increases the importance of a word in a document, at the same time a word’s
use in multiple documents reduces the word’s discriminative power. Thus, the product of TF and IDF represent
a word’s adjusted importance relative to other words in the corpus. In traditional TF-IDF matrix, also known as
term document matrix (TDM) there are as many rows as the number of unique words in the corpus and as many
columns as the number of documents in the corpus. In TDM each row is a word vector, and each column is a
document vector.
By plotting the word frequency or TF-IDF score of each word in a document, we can create a cloud map of
the document as shown in Figure 3 and visualize the main topic of the document. Figure 3 shows the cloud map
of 1851’s classic novel “Moby Dick or the Whale” by Melville Herman scraped from project Gutenberg’s
website.
Even though the corpus has many documents, and large vocabulary, all words do not appear in each
document, so the original TDM is a very sparse matrix and pose a problem for cosine similarity measurement.
Also, the TDM is very high dimensional and dimensionality reduction is the next step for efficient computation.
Dimensionality reduction is possible because all documents, and all words in a corpus basically belong to a very
limited number of latent topics like drama, action, romance, fiction etc. Each document has a number of those
latent topics in varying proportion. Finding topic distribution in a document is called topic modeling and is the
key in finding document to document similarity more accurately. Better result can be found if the higher
dimension sparse matrix can be decomposed to lower dimension word-topic and topic-document matrices. This
is called topic modeling. But how good is our topic modeling today?
An efficient topic modeling algorithm in traditional unsupervised machine learning, named Latent Dirichlet
Allocation (LDA) was introduced by David Blei et al. in 2003 [2]. LDA factorizes the original sparse TDM
matrix into two lower dimensions word-topic and document-topic matrices. In LDA there are two hyper
parameters alpha and beta which controls topic distribution in a document and tightness of a topic, respectively.
LDA topic modeling helps in dimensionality reduction and increases accuracy of document-to-document
similarity measures. In LDA hyper-parameter tuning takes quite an effort to create meaningful topics. Also, as
the base TDM is just a bag of words where word ordering has no significance, just by counting word frequency
and offsetting it by IDF does not capture word ordering, word context and semantics of a word very well. LDA
based on this naïve TDM also suffers from synonymy and polysemy problem of English language to some
extent.
Later innovations of Word2Vec and GloVe like word embedding algorithms used shallow neural network
architecture and is generating better word representation. Another outperforming context aware word
embedding mechanism was introduced in 2017 by using transformer architecture, which is a bidirectional long
short- term memory (LSTM) recurrent neural network with attention mechanism. This technique is exceeding
topic modeling accuracy in BERT (bidirectional encoding representation from transformers) like deep neural
network [19], [20], [21] than that of traditional LDA matrix factorization. Because of the better word embedding
in recent times computers are doing a decent job in understanding natural languages. In the next section we will
see how deep neural network is better in recommender performance.

III. Neural Architecture: Word Embedding for Content Based Filtering

Figure 4: Word2Vec Architectures i) CBOW and ii) Skip Gram

Figure 4 shows Word2Vec’s shallow neural network architecture for word embedding with one input layer, one
hidden layer and one output layer only. This word embedding algorithm introduced by Mikolov et al. at Google,
has two models CBOW (continuous bag of words) and Skip-Gram to capture word ordering and semantics of a
text. In CBOW model context words (window size 5 to 10 terms or grams surrounding a target word) predicts a
target word. Whereas in Skip Gram model the target word predicts all context words (N Grams). In the input
layer the words are represented in sparse one-hot encoding but in the hidden layer a word is projected as dense
low dimensional vector of size 100 to 300 dimensions. CBOW and Skip-Gram trains to optimize its prediction
by back propagating the error at the output layer and adjusting the weight matrix by stochastic gradient descent
(SGD) algorithm. This breakthrough in better word representation is improving all other NLP tasks such as
finding document to document similarity, document classification, clustering, sentiment analysis, text
summarization, language translation, question answering, contextual advertising, automatic text generation etc.
Figure 5: Word Embedding Vector Space: Similar Words Cluster Together

In neural word embedding techniques similar words are represented with vectors that are close in vector
space as can be seen in Figure 5. Few most frequently used words from the novel “Moby Dick or the Whale”
viz. ‘whale’, ‘ship’, ‘sea’, ‘man’ and ‘eye’ word vectors are used to find their similar words i.e., words that are
used in the same context. Each word is represented by 100 dimensions vector, but it needs to be projected in 2D
or 3D for visualization. Two-dimensional visualization of word vectors by PCA or tSNE algorithm is seen to
cluster similar words as they have close vector representation. Similar context words for ‘whale’ are whales,
fish, humpback, dolphin, shark etc. Similar words for ‘ship’ are ships, boats, vessels, cargo etc. in Figure 5.
With this amazing capability to have close vector representations of words that are used in the same context,
document to document similarity measurement become more accurate. So, the content-based filtering will be
more accurate. In fact, adding the words of a sentence and averaging them gives a sentence vector and adding
the sentences of a document and averaging them gives a document vector. From this cosine similarity score k-
nearest neighbors of a particular book is computed and recommended.
Training word embedding model takes huge dataset and is computationally expensive. There are many pre
trained word embeddings available free to use which are trained with huge amount of Wikipedia dataset or
Google news etc. We could also train our Word2Vec or GloVe models with Goodreads data or by corpus
created from Project Gutenberg’s eBooks. Word2Vec and GloVe are part of Python’s NLTK and Gensim
packages.

IV. Deep Learning for Collaborative Filtering


An auto encoder is a building block of deep learning architecture for collaborative filtering recommendation
engine. As the name suggest, an autoencoder produces the same input at its output layer. So, number of neurons
at input layer is same as number of neurons at output layer. Auto encoder is an unsupervised learning model as
no labeled data is required to reconstruct the output from input. A simple auto encoder has an input layer, one
hidden layer and one output layer (Figure 6). The input layer and the hidden layer together is the encoder part
and hidden layer, and output layer together is the decoder part. Data is passed through the input layer. In hidden
and output layer of the neural network some non-linear transformation of input data takes place. In auto encoder
encoding function Φ ( x ) encodes high dimensional input X into low dimensional latent vector Z at hidden layer.
This encoding is non-linear where the input X is multiplied by weight matrix W, then summed and added with a
bias b and then transformed by a non-linear activation function (σ) like Sigmoid, tanh or ReLU as shown in
Equation 1. The decoding function ψ ( z ) decodes low dimensional latent vector Z at hidden layer into high
dimensional output layer as X by similar transformation as shown in Equation 2.

Figure 6: Simple Auto Encoder Architecture


Encoding Function Φ : X → Z x ↦ Φ ( x )=σ ( Wx+ b ) ≔ z …. Equation 1

^ z+ b ) ≔ ^x ….. Equation 2
Decoding Functionψ :Z → X z ↦ ψ ( z )=σ ( W

n
1
∑ mi∗‖x i− x^ i‖
2
L ( x , ^x )= n

∑ mi
i=1

i=1
n
1
∑ mi‖ xi −σ (W^ zi +b)‖
2
n
=
∑ mi
i=1

i=1

n
1 2

= n ∑ mi‖xi −σ (W^ σ (W xi + b)+ b)‖ ….. Equation 3


∑ mi
i=1

i=1

The auto encoder neural network is trained in several epochs of iterations to minimize the difference between
input and reconstructed input at the output layer as shown by the loss function L ( x , ^x ) in Equation 3. Here x i is
the actual rating for ith item, and ^x iis the predicted rating. The input is a sparse vector of user ratings or item
ratings of already seen items but at the output layer a dense vector is produced where the blank values in the
input vector is replaced by real numbers as estimated ratings for unseen items. That is possible as the loss
function is masked mean square error (MMSE) and where m j is the mask value which is 0 for items with no
rating and 1 for item with a rating. The loss function is optimized using ADAM optimizer as it is the best.

Figure 7: Deeper Auto Encoder Architecture with Multiple Hidden Layers


A neural network can approximate any function to its required precision. Neural network embedding technique
to map large no. of user ratings or large no. of books ratings into lower dimension latent vector space in the
hidden layers works as a dimensionality reduction technique and improves accuracy in measuring user to user
rating correlation and book to book rating correlation measurement.
Deep autoencoder has many hidden layers (Figure 7). The purpose of the additional layers in deep auto
encoder is to capture more complex non-linear data correlation. When the first layer captures the first order
features, deeper layers capture higher order features. This can be explained better in the context of image
classification where first order features are edges in the image and second order features are which edges
cooccur to create a corner etc. Thus, the quality of recommendation engine that apply deep neural network is
significantly improved as complex data correlation from huge amount of unlabeled data is possible only by deep
learning.
Deep learning auto encoder is a flexible architecture where number of hidden layers and number of neurons
in hidden layers can be varied to experiment for better accuracy. To reduce the effect of cold start problem
context information of users e.g., country, time of the day, income, device etc. needs to be integrated.
Heterogeneous information like users’ contextual information can be combined easily in the input and output
layer of auto encoder which was not possible in traditional matrix factorization used for collaborative filtering.

One of the earliest models for modern deep learning approach for recommender design is presented here.
Many other variants of autoencoder have evolved over the times. TensorFlow, Keras, PyTorch are some of the
deep learning libraries in Python that are needed for recommender system design.

Deep AutoEncoer MAE Error


Figure 9: Auto Encoder Performance Result

Figure 10: Books Recommendation by Collaborative Filtering

A recommender system can be evaluated with many metrics viz. recall, precision, RMSE, mean reciprocal rank
(MRR), mean average percentage at k (MAP) etc. The experiments that are performed in this paper shows
superiority of deep learning approaches over traditional matrix factorization techniques. In this paper flexibility
of deep leering in recommender system design is validated.

In this paper a hybrid recommender is designed with weighted score of CBF and CF filtering both using deep
learning neural network. Figure 10 shows Top-6 books recommended for a user who has read Melville
Herman’s “Moby Dick, or the whale.” My future research will extend recommender design framework to
integrate sentiment analysis from book reviews collected from social networking sites.

V. Conclusions
In today’s bigdata era use of recommender system in e-commerce businesses will be ubiquitous. At the same
time deep learning techniques is becoming the de facto standard for any kind of learning from data. This paper
makes contribution towards understanding book recommender system design challenges and opportunities and
how advances in deep learning and NLP can be exploited in improving book recommender performance. DL
can reduce the cold start and sparse matrix problem of traditional machine learning algorithm for recommender
design. DL can capture non-trivial, non-linear user-item preference relation even from noisy training data. In
deep learning there is no need for domain expert tor manual feature engineering. All the benefits of DL that are
highlighted here are validated in our implementation of Book Recommender Design with Deep Auto Encoder
neural network. DL is very data hungry and compute intensive, needing parallel processing power of GPUs and
TPUs. Thus, this field is not extensively researched by individual academic researchers. But we can see that the
computer vision and natural language processing field has improved their accuracy significantly by using deep
learning. So, there is a great potential for deep learning to improve accuracy of recommender system engine
also. Understanding of deep learning will encourage readers to apply this technique in designing books, research
paper, news articles and course recommendation etc. This research paper motivates academician by providing a
new perspective of recommender system design benefit with deep learning architecture.

References:
1. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to
information retrieval. Cambridge university press Cambridge, 2008.
2. D. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent Dirichlet Allocation. In Journal of Machine
Learning Research, 3, 993-1022.
3. Radha Guha, 2020. Exploring Information Retrieval by Latent Semantic and Latent Dirichlet
Allocation Techniques. International Research Journal of Computer Science, Vol. 7, Issue 5.
4. Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next generation of recommender
systems: A survey of the state-of-the-art and possible extensions. IEEE transactions on knowledge and
data engineering 17, 6 (2005), 734–749
5. Pasquale Lops, Marco Degemmis, and Giovanni Semeraro. Content-based recommender
systems: State of the art and trends. In Recommender Systems Handbook, 2011.
6. Yehuda Koren and Robert Bell. Advances in collaborative filtering. In Recommender systems
handbook, pages 77–118. Springer, 2015.
7. Ben Shneiderman 1997. Direct Manipulation for Comprehensible, Predictable, and
Controllable User Interfaces. Proceedings of IUI97, 1997 International Conference on
Intelligent User Interfaces, Orlando, FL, January 6-9, 1997, 33-39.
8. Christopher Avery, Paul Resnick, and Richard Zeckhauser 1999. The Market for Evaluations.
American Economic Review 89(3): pp 564-584.
9. J. Ben Schafer, Joseph Konstan, John Ried, ecommender Systems in E-commerce. EC '99: Proceedings
of the 1st ACM Conference on Electronic commerceNovember 1999 Pages 158–66.
10. Dietmar Jannach et. al.Measuring the Business Value of Recommender Systems.
arXiv:1908.08328v3 [cs.IR], Dec 2019.
11. Carlos A Gomez-Uribe and Neil Hunt. The Netflix recommender system: Algorithms, business
value, and innovation. ACM Transactions on Management Information Systems (TMIS),
6(4):13, 2016.
12. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems.
Computer 8, 30–37 (2009)
13. Haruna K et.al. A Collaborative Approach for Research Paper Recommender System. PLoS
ONE 12(10): e0184516, 2017. https://doi.org/10.1371/journal.pone.0184516.
14. Maxwell Harper and Joseph A. Konstan. The MovieLens Datasets: History and Context. ACM
Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19, December 2015,
DOI=http://dx.doi.org/10.1145/2827872.
15. N. Kurmashov, K. Latuta and A. Nussipbekov, Online Book Recommendation System. Twelve
International Conference on Electronics Computer and Computation (ICECCO), 2015, pp. 1-4, doi:
10.1109/ICECCO.2015.7416895.
16. T. Mikolov et al., 2013. Distributed Representations of Words and Phrases and Their
Compositionality [C]. Advances in Neural Information Processing Systems. 3111-3119, 2013.
17. J. Pennington et al. GloVe: Global Vector for Word Representation. Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543.,
2014.
18. Ashish Vaswani et al. Attention is All You Need. arXiv:1706.03762. 2017.
19. M. Heidari and J. H. Jones, "Using BERT to Extract Topic-Independent Sentiment Features for
Social Media Bot Detection," 2020 11th IEEE Annual Ubiquitous Computing, Electronics &
Mobile Communication Conference (UEMCON), 2020, pp. 0542-0547, doi:
10.1109/UEMCON51285.2020.9298158.
20. Tom B. Brown. Language Models are Few-Shot Learners. arXiv:2005.14165v4 [cs.CL] Jul
2020.
21. Radha Guha, 2020. Impact of Artificial Intelligence and Natural Language Processing on
Programming and Software Engineering. International Research Journal of Computer
Science, Vol. 7, Issue 9.
22. Angelov, D. (2020). Top2Vec: Distributed Representations of Topics. arXiv preprint
arXiv:2008.09470.
23. Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, and Fangxi Zhang. 2017. A Hybrid
Collaborative Filtering Model with Deep Structure for Recommender Systems. In AAAI. 1309–1315.
24. Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2018. Deep Learning based Recommender System: A
Survey and New Perspectives. ACM Comput. Surv. 1, 1, Article 1 (July 2018), 35 pages. DOI:
0000001.0000001
25. Basiliyos Tilahun Betru, Charles Awono Onana, and Bernabe Batchakui. 2017. Deep Learning
Methods on Recommender System: A Survey of State-of-the-art. International Journal of Computer
Applications 162, 10 (Mar 2017).
26. Jianpeng Cheng et al. Long Short-Term Memory-Networks for Machine Reading. arXiv
preprint arXiv:1601.06733, 2016.
27. Chris DyeUr et al. Recurrent Neural Network Grammars. In Proc. of NAACL, 2016.
20.Matthew E. Peters et al. Deep Contextualized Word Representations. arXiv:1802.05365.
2018.
28. Xiangnan He et al. Neural Collaborative Filtering. arXiv:1708.05031v2 [cs.IR] 26 Aug 2017.
29. Heng-Tze Cheng et. al. Wide & Deep Learning for Recommender Systems.
arXiv:1606.07792v1 [cs.LG], Jun 2016. [16] Shuai Zhang et. al. Deep Learning Based
Recommender System: A Survey and New Perspectives. arXiv:1707.07435v7 [cs.IR], Jul 2019.
[17] Diana Frerira et al. Recommendation System Using Autoencoders. MDPI.
30. Kuchaiev, O.; Ginsburg, B. Training deep autoencoders for collaborative filtering. arXiv 2017,
arXiv:1708.01715.
31. Haghighi, P.S.; Seton, O.; Nasraoui, O. An Explainable Autoencoder For Collaborative Filtering
Recommendation. arXiv 2019, arXiv:2001.04344.
32. Radha Guha. Improving the Performance of an Artificial Intelligence Recommendation
Engine with Deep Learning Neural Nets. 2021 6th International Conference for Convergence
in Technology (I2CT) Pune, India. Apr 02-04, 202
33. Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root
mean square error (RMSE) in assessing average model performance. Climate Res. 30(1), 79–
82 (2005)

You might also like