You are on page 1of 5

2013 BRICS Congress on1st

Computational
BRICS Countries
Intelligence
Congress& on
11th
Computational
Brazilian Congress
Intelligence
on Computational Intelligence

Generating Synthetic Data for Context-Aware


Recommender Systems

Marden Pasinato Carlos Eduardo Mello Marie-Aude Aufaure Geraldo Zimbrão


COPPE/UFRJ UFRRJ MAS/ECP COPPE/UFRJ
Rio de Janeiro, Brazil Rio de Janeiro, Brazil Paris, France Rio de Janeiro, Brazil
marden@cos.ufrj.br carlos.mello@ufrrj.br marie-aude.aufaure@ecp.fr zimbrao@cos.ufrj.br

Abstract—Context-Aware Recommender Systems (CARS)


have emerged as a different way of providing more precise and For more details about the state-of-the-art of recommender
interesting recommendations through the use of data about the systems, Adomavicius and Tuzhilin present a good survey in
context in which consumers buy goods and/or services. CARS [2]. The authors also present new trends and future directions.
consider not only the ratings given to items by consumers (users), New approaches and methods have been proposed in the
but also the context attributes related to these ratings. Several literature trying to bring more information about the context in
algorithms and methods have been proposed in the literature in
which the products were bought. Methods considering
order to deal with context-aware ratings. Although there are lots
of proposals and approaches working for this kind of
contextual information have been getting more and more
recommendation, adequate and public datasets containing user’s attention and context-aware recommender systems (CARS)
context-aware ratings about items are limited, and usually, even emerged as a new approach in order to improve the
these are not large enough to evaluate the proposed CARS very recommendations [3]. In this perspective, context attributes
well. One solution for this issue is to crawl this kind of data from should be considered when a user gives rating about a product.
e-commerce websites. However, it could be very time-expensive Although there are many proposals and approaches working
and also complicated due to problems regarding legal rights and
on this contextual information, adequate and public datasets
privacy. In addition, crawled data from e-commerce websites may
containing user’s context attributes and ratings about products
not be enough for a complete evaluation, being unable to simulate
all possible users’ behaviors and characteristics. In this article, we
are very limited, and usually, even these are not large enough to
propose a methodology to generate a synthetic dataset for evaluate proposed CARS very well [3]. One solution for this
context-aware recommender systems, enabling researchers and issue is to collect this kind of data from e-commerce websites
developers to create their own dataset according to the by crawling them. However, it could be very time-expensive
characteristics in which they want to evaluate their algorithms and also complicated due to problems related to legal rights and
and methods. Our methodology enables researchers to define the privacy. In addition, the performance of recommender systems
user’s behavior of giving ratings based on the Probability is highly dependent on various characteristics of the dataset.
Distribution Function (PDF) associated to their profiles. Evaluating algorithms based on only one or two datasets is
often not sufficient [4]. Moreover, detailed analysis can be
Keywords—Synthetic Data Generator; Context-Aware performed by applying systematic changes to data, which
Recommender Systems; Datamining cannot be done with real data.
Several studies have been carried out in order to generate
I. INTRODUCTION synthetic datasets in the literature of database systems [5]. The
Over the last decade, Recommender Systems have been aim of synthetic data generators is to test the correctness and
largely studied in both the academia and the industry. Due to the performance of algorithms, such as TPC benchmarks [6].
the huge amount of e-commerce applications available in the These tools are often specialized and reusable. The synthetic
Internet and of goods offered in them, providing an ideal data must be realistic and correct in terms of size and
product recommendation for their consumers became an distributions to be useful. However, not much work has been
important feature for these applications. These systems are able done on generating datasets for recommender system
to leverage e-commerce sales by converting browsers into evaluation, and even these don’t focus on context-aware
buyers, increasing the cross-sell and gaining the consumer recommendations.
loyalty [1]. Therefore, many efforts in proposing algorithms
and methods to improve products’ recommendation have been In this work we propose a methodology to generate
undertaken by academia and industry. synthetic data in order to evaluate CARS. Our methodology
aims at generating data that include the items’ ratings and the
Different types of recommender systems have emerged. The context attributes in which users gave these ratings. Moreover,
methods used by recommender systems can be organized we expect to provide researchers with a powerful tool that is
according to three main approaches: context-based, able to generate datasets according to the users' behavior of
collaborative and hybrid. An orthogonal classification is also giving ratings for items. The idea is to allow researchers to set
possible dividing them into heuristic-based and model-based. up the user’s profile according to their needs and then the

978-1-4799-3194-1/13 $31.00 © 2013 IEEE 563


DOI 10.1109/BRICS-CCI-CBIC.2013.99
10.1109/BRICS-CCI.&.CBIC.2013.93

Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on April 04,2023 at 20:29:31 UTC from IEEE Xplore. Restrictions apply.
dataset can be generated following the user behavior defined by III. PROPOSED METHODOLOGY
the researcher. In this section we present our proposed methodology to
Therefore, we expect that this work can contribute to the generate synthetic data for evaluating CARS. In order to do so,
development of context-aware recommender systems by we describe the synthetic data generator (SDG) which
allowing researchers to generate synthetic data for evaluating implements our proposal. The SDG consists of 2 profile
their proposed methods and algorithms. Moreover, our generators, 2 profile arrays, a penalization function and a
methodology enables several kinds of evaluations to be sampling algorithm organized as depicted in Figure 1.
performed by varying the characteristics in the users’ profile
which could be interesting in testing different kinds of users,
such as “heavy raters”, “cold start users”, “pessimists” and etc..
This paper is organized in 4 sections, of which this is the
first one. In the second section, we briefly describe some related
work about synthetic data generation. In the third section, we
present our proposal describing basic definitions and the
algorithms used. The fourth and last section describes our
conclusions so far and presents the proposed experiment that
can be carried out to validate our methodology.

II. RELATED WORK


Synthetic data has been largely used by database researchers
— and researches from others fields as well — mainly for
simulation purpose. Therefore, the size and the distribution of
the synthetic data must be realistic and correct, so that the
simulation might be as close to the real scenario as possible [6].
Several tools used to generate synthetic data have emerged,
however they are often specialized and not reusable. In order to
deal with this issue, much work has been conducted to design
Synthetic Data Generators as general as possible.
Figure 1: Synthetic Data Generator
Lots of Synthetic Data Generators (SDG) have been
proposed in the literature such as [5], [6], [7], [8] and [9]. The The Profile Generators, the Sampling Algorithm and the
majority of synthetic data is used to evaluate multidimensional Penalization Function are used to define the behavior of the
models and OLAP tools. Besides, in general, these SDGs are synthetic data. With them, we can define behaviors such as:
supported by description languages, which allow us to define
constraints and the exact characteristics of the data to be  how many users and items will be generated;
generated. Such SDG tools are developed to generate huge  how each user gives ratings to items;
amounts of data in little time [5][9] and many of these tools are
available as commercial products [5].  how each context attribute influences the rating;
There are plenty of works in which authors propose a SDG  how many items each user will give rating to;
in order to evaluate their proposed recommender systems. In
 how users choose items to give rating;
[4], the authors present several of those works. They claim that,
although various SDGs for evaluating the behavior of data with  which products are more suitable to receive ratings;
several attributes have been developed, most of those found in
the literature are used to evaluate one specific algorithm. These and other possible attributes are supported by the use
Moreover, as SDG never was the main focus of those authors, of PDFs assigned to the random variables that represents the
the SDG is neither described in detail nor is it generic enough to context attributes, the ratings and the number of items
be reusable for evaluating other algorithms. evaluated. Therefore, we have a useful set of random variables
that models the users’ behavior and enables us to systematically
Therefore, to the best of our knowledge only the work vary the model’s parameters according to the evaluation
proposed in [4] describes in detail a SDG for evaluating purpose.
recommender system and tries to be generic enough to be
reused in other evaluations. However, this work focuses on In the following section we describe how each component
generating data for attribute-aware recommender systems, of the SDG works. We begin by describing the user’s and
which are quite different from context-aware recommender item’s profile and their role played in the SDG. Afterwards, we
systems. Furthermore, even if this work could be adapted to present the Profile Generators and how to use it to model users
context-aware recommender systems, it wouldn’t enable and items. Finally, we describe the Penalization Function and
researchers to model the user and the item by the Probability the Sampling Algorithm.
Distribution Function (PDF) of their ratings and context
attributes, as we do in our proposed methodology.

564

Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on April 04,2023 at 20:29:31 UTC from IEEE Xplore. Restrictions apply.
A. User’s and Item’s Profile being chosen as travel destination by a Gaussian distribution
Both the user’s and item’s profile are defined in the same with is peak in January and variance spanning through
way. So, let’s focus on describing a generic profile first. The December, February and March – summer months.
aim of the profile is to store the model, which describes the
characteristics of an entity, such as a user or an item. Therefore, B. Profile Generators
profiles can be defined in several ways, e.g., by simple In the profile generators we assign a PDF, with its due
attributes or by complex functions. parameters, to each variable in the user’s and item’s profile. For
We intend to have users’ and items’ profiles in order to instance, in the case of the trip recommender systems, we begin
model the user’s behavior in: by choosing the number of users –  –, the number of
destinations –  – and the number of evaluations – E – our
 choosing items to give ratings; dataset will have. For each user i, we have the following
random variables describing the user’s behavior:
 choosing an appropriate value for the rating given the
context;  Number of destinations evaluated by the user i.
This variable may assume integers values from 0
 simulating their preferred context;
to . The sum of all the i must be equal to E.
To do that we use a set of random variables representing the
context attributes and the way in which a particular user  The user’s “taste”, i.e., how he evaluates
evaluates his items. Therefore, in modeling how a user gives destinations by giving them ratings i. This
ratings to items by a random variable we are assigning a PDF of variable may assume integers values from 1 to 5.
ratings for each user. Although it seems that items’ profiles are  The user’s preference regarding the period of the
not necessary, we need them to model the most evaluated items year to travel, i.e., the user’s context preference i.
and to establish the PDF of context variables in each item. This variable may assume integers values from 1
Many different kinds of PDFs can be assigned to the user’s to 12 pointing to the month of the year.
ratings and to the context attributes depending on what we want Likewise, for each destination j, we have the following
to model. For instance, “pessimists” or “demanding” users can random variables describing the destination’s features:
be defined by a Gaussian distribution with its mean skewed for
low ratings values. The same can be done for “optimists” or  Number of users that evaluated the destination j.
“uncritical” users, with the distribution skewed for high ratings This variable may assume integers values from 0
values. There are also the “controversial” users that have high to . The sum of all the j must be equal to E.
variance in their ratings, so we can use a Gaussian distribution
with high variance or even a bimodal distribution. There are  The period of the year in which the destination is
lots of ways to model user’s behaviors through PDFs, we have most sought for traveling j. This variable may
just to choose those ones that represent the behavior we want in assume integers values from 1 to 12 pointing to the
the data. month of the year.
For items, we need to define the PDF of each context Each of these random variables has an associated PDF that
attribute. Though it is not directly related to the users’ governs its behavior. The PDF can be of any sort, e.g., a
behaviors, with these profiles we can determine how the rating Gaussian distribution, a bimodal distribution, an exponential
is influenced by the context. We are aware that context is a very distribution, a chi-square distribution or even a nonparametric
broad term, so, for the sake of clarification, we shall consider a distribution. The PDF’s parameters are also selected depending
trip recommender systems in which users are travelers and on the characteristics we want our entities to have.
items are destinations. For example, suppose one wants to model a “heavy rater”
In this scenario, we can see how the context related to items user. By “heavy rater” we say that this user gives ratings to
is such an important information in the account for the many destinations. A feasible model that conveys this behavior
recommendation. It is commonly fair to suppose that a traveler can be achieved by assigning a Gaussian distribution to the
will be much more reluctant to travel to Rio de Janeiro in winter variable i with a large mean value and a small variance value.
time than in summer time, since the town is worldwide known The same analogy can be extended for modeling the others
for its summer attractions. Similar argument can be made about random variables, both in the user’s and in the destination’s
Bordeaux. Travelers would be much more willing to go there profile.
during the wine harvest than in others seasons, since the town is As the desired dataset may have thousands, even millions,
well known for its great wine. of users, choosing the appropriate PDF’s parameters, for each
Thus, some cities are more visited during some months or random variable and for each profile can be a toil for the
periods of the year depending on their attractions. In order to researcher. Hence, in order to set the SDG in an easier way we
design an accurate trip recommender system, we need to take propose a Bayesian approach for selecting these parameters.
this pattern into account. In our model, we have a context Suppose the random variable i follows a Gaussian
random variable describing the probability of an average distribution for every user i in the dataset. Since the Gaussian
traveler choses that city on each month of the year. For distribution requires only two parameters (mean and variance)
instance, we could model the probability of Rio de Janeiro to be functional, if the dataset required 1000 users, the

565

Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on April 04,2023 at 20:29:31 UTC from IEEE Xplore. Restrictions apply.
researcher would have to set 2000 parameters manually; two for evaluated and the number of evaluations each destination
every user i ( i , i  received.
Instead, we can assume that the parameter itself is a random The complete algorithm is shown in Figure 2. We begin by
variable with its own PDF. Thus, the only manual requirement defining the total number of users, destinations and evaluations
is to determine this PDF, since the values for i follow the dataset will have. Then, the Profile Generator is responsible
straightforward from it. Regarding the example above, suppose for setting the users’ and destination’s profiles. In other words,
the values for each parameter i are given by the random it is responsible for assigning values to the profiles’ random
variable  and that this variable follows a Gaussian distribution variables, depending on its PDF. We move to the Sample
determined by its mean ( ) and variance ( ). Therefore, we Algorithm that associates users to destination and apply the
need to select manually only the values for the two parameters
 and  to get all the others parameters set. Input
Number of users – N
Having the control over  and  enables us to create the Number of destinations – M
dataset under hypothetical situations that we may be willing to Number of evaluations – E
confirm or to deny. For instance, imagine that the researcher Profile Generators
has designed an algorithm that has a good performance even For each user i:
when most users are “light raters”, i.e., when most users
Create user profile upi = {i, i, i}
evaluate very few items. Finding a real context-aware dataset is For each destination j:

Create destination profile dpj = {j, j}
already difficult; with these specific characteristics it becomes
Sampling Algorithm
even harder. With our SDG the researcher could easily and For each user i:
simply create such dataset by assigning small values for  and
Chose i destinations for evaluation
. For each destination j:

Check if the destination has achieved the maximum
C. Penalization Function number of evaluation j
In the Context-Aware Recommender Systems, the context If yes:
influences, in some way, the rating given by users. Therefore,
Discard the destination and search for other that
the Penalization Function aims at penalizing the rating given by still needs to be evaluated
Else:
user, if the destination context differs too much from the user’s

Assign a rating rij that user i “gave” to
preferred context. The implementation of the penalization destination j
function depends on how and how much we expect that the Penalization Function
rating will be influence by the context.
Penalize the rating rij according to the user’s
The Penalization Function has as input parameters the users’ preferred context i and the destination real context j
preferred context (i), the destination’s context ( j) and user’s
rating (i). As the values chosen for i take only into account Penalization Function in order to get the real rating.
the mathematical model of the user’s taste, not considering the Figure 2: Complete SDG algorithm.
context involved, we need to introduce some variation in the
rating to account for the context influence. IV. CONCLUSIONS AND FUTURE WORK
In this work we presented a methodology to generate
The function that calculates the penalty for the ratings, synthetic data for evaluating Context-Aware Recommender
according to the context, can be done in many ways. We Systems, in particular, one aimed for trip recommendations. It
suggest the use of Fuzzy Logic engine to do that. The way how was also described a SDG which implements the proposed
the rating will be penalized can be described by fuzzy rules and methodology, generating data for CARS based on users’ and
fuzzy sets. It seems to be easier for us than using complicated destinations’ profiles. These profiles are modeled by random
mathematical functions. Besides, it is more useful in order to variables and their PDFs. We hope that with this modeling we
explain how the context affects the ratings. are able to simulate the user’s behavior of giving ratings and
their contexts.
D. Sampling Algorithm Although the use of simple random variables seems an
The Sampling Algorithm is the core of the SDG and interesting tool for modeling the user’s behavior, it is necessary
therefore of our methodology too. This algorithm determined more research in this direction to understand the complex
which users will evaluate which destinations and applies the relationship between rating and context. Besides, there are
Penalization Function according to the profiles’ variables in many other statistical tools fit for the task that can, perhaps,
order to synthesize the “real” rating. outperform our approach; such as Markov’s Chains, Complex
After all the profiles were created, we need to associate each Network and Joint Distributions.
user with his evaluated destinations. In order to do so, we pick a We are working on a complete case study in order to show
user in particular and select among the entire set of destinations how we can achieve good and reliable evaluations of CARS
those which he will evaluate. This must be done under the through our proposed methodology. In order to do so, we
established constrains for the number of destinations the user expect to apply this approach to enlarge a real dataset, keeping
or even improving the recommender system performance.

566

Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on April 04,2023 at 20:29:31 UTC from IEEE Xplore. Restrictions apply.
Nevertheless, we need to discover the PDFs of the real dataset
in order to model users and items.
A context-aware movie dataset is described in [10] with
only 90 users, 950 items and 1600 ratings. This seems a good
candidate for our tests. We intend to extract, from this real
dataset, the PDFs that better describe the users’ and items’
behavior and their context attributes. We can do this either by
and trying known distributions or by a nonparametric approach,
using Kernel Density Estimators [11]. The latter seems more
appropriated due to the wide range of possibilities for modeling
human behavior.
Therefore, our methodology allows not only generating
synthetic data and evaluating CARS, but also understanding the
users’ behavior and how the context may interfere in this
behavior. We hope as soon as possible turn available the
implementation code of our SDG as a Matlab Toolbox.

ACKNOWLEDGMENTS
Our thanks to CNPQ (Brazilian Founding Agency) and
EUBRANEX for founding this research.

REFERENCES
[1] J.B. Schafer, J. Konstan, and J. Riedi, “Recommender systems in e-
commerce,” Proceedings of the 1st ACM conference on Electronic
commerce, Denver, Colorado, United States: ACM, 1999, pp. 158-166.
[2] G. Adomavicius and A. Tuzhilin, “Toward the next generation of
recommender systems: a survey of the state-of-the-art and possible
extensions,” Knowledge and Data Engineering, IEEE Transactions on,
vol. 17, 2005, pp. 734-749.
[3] G. Adomavicius, R. Sankaranarayanan, S. Sen, and A. Tuzhilin,
“Incorporating contextual information in recommender systems using a
multidimensional approach,” ACM Trans. Inf. Syst., vol. 23, 2005, pp.
103-145.
[4] K. Tso and L. Schmidt-Thieme, “Empirical Analysis of Attribute-Aware
Recommendation Algorithms with Variable Synthetic Data,” Data
Science and Classification, 2006, pp. 271-278.
[5] J.E. Hoag and C.W. Thompson, “A parallel general-purpose synthetic data
generator,” SIGMOD Rec., vol. 36, 2007, pp. 19-24.
[6] K. Houkjær, K. Torp, and R. Wind, “Simple and realistic data generation,”
Proceedings of the 32nd international conference on Very large data
bases, Seoul, Korea: VLDB Endowment, 2006, pp. 1243-1246.
[7] P.J. Lin, B. Samadi, A. Cipolone, D.R. Jeske, S. Cox, C. Rendon, D. Holt,
and R. Xiao, “Development of a Synthetic Data Set Generator for
Building and Testing Information Discovery Systems,” Proceedings of
the Third International Conference on Information Technology: New
Generations, IEEE Computer Society, 2006, pp. 707-712.
[8] N. Bruno and S. Chaudhuri, “Flexible database generators,” Proceedings of
the 31st international conference on Very large data bases, Trondheim,
Norway: VLDB Endowment, 2005, pp. 1097-1107.
[9] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P.J. Weinberger,
“Quickly generating billion-record synthetic databases,” Proceedings of
the 1994 ACM SIGMOD international conference on Management of
data, Minneapolis, Minnesota, United States: ACM, 1994, pp. 243-252.
[10] Košir, Andrej; Odic, Ante; Kunaver, Matevž; Tkalcic, Marko, Tasic, Jurij
F. “Database for contextual personalization”. Elektrotehnišk vestnik
[English print ed.], 2011, vol. 78, no. 5, str. 270-274, ilustr. [COBISS.SI-
ID 8871764].
[11] Härdle, Wolfgang; “Applied Nonparametric Regression”. Cambridge
University Press, 1992.

567

Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on April 04,2023 at 20:29:31 UTC from IEEE Xplore. Restrictions apply.

You might also like