You are on page 1of 12

DOI: 10.1609/aaai.

12016

RECOMMENDER SYSTEMS

Progress in recommender systems research: Crisis?


What crisis?

Paolo Cremonesi1 Dietmar Jannach2

1 Politecnico di Milano, Italy


2 Abstract
University of Klagenfurt, Austria
Scholars in algorithmic recommender systems research have developed a largely
Correspondence standardized scientific method, where progress is claimed by showing that a
Paolo Cremonesi, Politecnico di Milano,
Italy.
new algorithm outperforms existing ones on or more accuracy measures. In the-
Email: paolo.cremonesi@polimi.it ory, reproducing and thereby verifying such improvements is easy, as it merely
involves the execution of the experiment code on the same data. However, as
recent work shows, the reported progress is often only virtual, because of a num-
ber of issues related to (i) a lack of reproducibility, (ii) technical and theoret-
ical flaws, and (iii) scholarship practices that are strongly prone to researcher
biases. As a result, several recent works could show that the latest published
algorithms actually do not outperform existing methods when evaluated inde-
pendently. Despite these issues, we currently see no signs of a crisis, where
researchers re-think their scientific method, but rather a situation of stagnation,
where researchers continue to focus on the same topics. In this paper, we dis-
cuss these issues, analyze their potential underlying reasons, and outline a set of
guidelines to ensure progress in recommender systems research.

INTRODUCTION and the research community has developed a seemingly


standardized way of operationalizing these problems. In
The central components in any recommender system almost all published research, algorithms are compared
are the algorithms that determine which items will be through offline experiments on historical data. In such
shown to individual users based on specific contexts. experiments, progress is claimed if a new algorithm is
Correspondingly, a core topic of recommender systems better in predicting held-out data than previous ones in
research lies in the continuous improvement of these algo- terms of measures such as RMSE, Precision, Recall, MAP,
rithms. Early research from more than a quarter-century or NDCG.
ago relied on comparably simple algorithms like nearest- Given the huge amount of published research works
neighbor heuristics (Resnick et al. 1994). Since then, over 25 years and the agreed-upon standards for conduct-
increasingly more complex machine learning approaches ing experimental research, one would certainly expect to
were proposed, from linear models, to matrix factorization observe continuing and strong progress in this area. This
techniques, to the latest deep learning models that we should particularly be the case because the majority of
see today (He et al. 2017; Koren, 2008; Ning and Karypis, algorithms papers claim to improve the state-of-the art.
2011; Wu et al. 2019). Although the type of problems However, a number of scholars repeatedly voiced con-
covered by the research has gradually expanded over the cerns regarding this progress over the years. Ten years
years, the classical rating prediction and top-N recom- ago, for example, Ekstrand et al. (2011) found it “[...] diffi-
mendation problems still attract most of the attention cult to reproduce and extend recommender systems research

AI Magazine. 2021;42:43–54. © 2021, Association for the Advancement of Artificial Intelligence. All rights reserved. 43
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
44 AI MAGAZINE

results.” Later, in 2013, a workshop on reproducibility and ation protocol, and performance measures may easily lead
replication was held in conjunction with the ACM Con- to researcher bias, where algorithm designers may only
ference on Recommender Systems. In that context, Kon- look for evidence that supports the superiority of their
stan and Adomavicius (2013) observed that the community own method. Unfortunately, despite the existing evidence
faces a situation of limited progress, in particular because of the problem, we do not observe clear signs of a cri-
of a lack of rigor and problematic evaluation practices. sis yet, where researchers would re-think their method-
Related phenomena were investigated in more depth in ological approaches. In contrast, with the current boom
(Beel et al. 2016). Here, the authors not only found that of AI and machine learning research, even more algo-
the recommender systems community “[...] widely seems rithms are published every year that rely on research
to accept that research results are difficult to reproduce.”. practices that are only partially suited to demonstrate
Their experiments also showed that minor variations in progress.
the evaluation setup can have significant effects on the This paper aims to stimulate a discussion on the lack
observed outcomes. of progress in some areas of research on recommender
In our own past works, we benchmarked more than systems. While supporting the points described here, we
twenty of the most recent deep learning (neural) meth- do not question the overall quality of research in rec-
ods against a set of non-neural algorithms, most of them ommender systems: in many aspects, the community has
published many years earlier. Very surprisingly, these stud- advanced far beyond what we had achieved a decade ago.
ies showed that—except for a very small set of experi- Also, the bad practices identified here are not specific to
mental configurations—the latest neural methods, despite any individual or institution. We made such mistakes by
their computational complexity, were almost consistently ourselves, but by making researchers more aware of them,
outperformed by long-known existing methods (Ferrari we aim to avoid them in the future.
Dacrema et al. 2019, 2021, 2020a; Ludewig et al. 2020). In the remainder of the paper, we first discuss the issues
In several cases, even nearest-neighbor techniques devel- of today’s research practice in machine learning applied
oped 25 years ago were better than the latest machine to recommender systems. We then look at the potential
learning models. These observations were made both for underlying reasons for the observed phenomena. Finally,
the traditional top-N recommendation task and for more we sketch a number of ways forward to increase the repro-
recent session-based recommendation scenarios. These ducibility of our research and ensure progress in the future.
findings, which were reproduced and confirmed also by
other researchers (Kouki et al. 2020; Rendle et al. 2020;
Sun et al. 2020), indicate that we are facing major issues, THE PROBLEMS
which may prevent us from truly moving forward as a
field. According to our observations, it is not one single problem
Furthermore, our studies revealed that the reproducibil- that leads to the apparent stagnation, but multiple issues
ity of existing works is actually not high. In the context of that can all contribute to a lack of progress: Lack of repro-
the works presented in (Ferrari Dacrema et al. 2019, 2021), ducibility: If a work has never been reproduced—either
for example, we found that about 55% of the papers pub- because it is not reproducible or because the research
lished at top-level conferences could not be reproduced community never attempted to reproduce it—its claimed
based on shared code and data, nor by contacting authors gains can neither be verified nor denied, and as such, any
for help and clarification. Put differently, for 55% of the progress described in the work cannot be considered reli-
papers the verification or falsification of the made claims— able. Methodological issues: Some works exhibit experi-
which is a central element of any scientific research—is mental flaws that produce non-existent gains over state-
almost impossible. of-the-art approaches; examples are the adoption of weak
All in all, it seems that algorithms research is showing baselines, or the leakage of testing information into the
signs of stagnation, where constantly new models are pro- training process. Theoretical flaws: A few works describe
posed, but where the claimed improvements “don’t add new methods based on assumptions that are not verified,
up”, as observed previously in the domain of information neither theoretically nor empirically.
retrieval (Armstrong et al. 2009; Yang et al. 2019). Vari-
ous reasons contribute to this phenomenon. Researchers
for example sometimes compare their new methods to Lack of reproducibility
existing methods (baselines), which are too weak in gen-
eral or which are not properly tuned. Furthermore, the Reproducibility in research on recommendation algo-
fact that researchers are generally flexible regarding their rithms is important because it is the only tool for the
experimental designs in terms of baselines, datasets, evalu- research community to ensure the correctness of an
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AI MAGAZINE 45

empirical study. A reviewer cannot guarantee that the and the used data is not necessarily sufficient to ensure R1
gains reported in an empirical paper are correct. Excep- reproducibility. In our studies, we found that sometimes
tions are, for instance, theoretical studies, where mathe- researchers shared some code, but the code was limited to
matical properties of new models can be formally proved. a skeleton of the method and not executable. Furthermore,
Reproducibility ensures transparency and creates the pos- in the majority of the cases, no code was provided for data
sibility for the research community to confirm or refute preprocessing, hyperparameter tuning, or for running the
what is stated in an empirical paper. Reproducibility can experiment including the compared baselines. The code of
only work as a tool for verifying the correctness of claimed the baseline methods was actually missing in the major-
empirical results if two requirements are met: (i) papers ity of the cases. As a result, it remains difficult to under-
must be reproducible and (ii) researchers must reproduce stand what was done exactly in the experiment. Ultimately,
previous empirical results before relying on them for fur- any gains that are observed for a new method might, at
ther progress. worst, be the result of an incorrect implementation of the
According to the study by Gundersen et al. (2018), the baseline methods. Often, also important pieces of informa-
vast majority of empirical works in AI in general is not tion regarding the used hyperparameters, hyperparameter
well documented and thus not reproducible. Given the ranges or random seeds are missing in the provided docu-
inconsistent use of the terms reproducibility and replica- mentation, see also (Henderson et al. 2018).
tion, Gundersen and Kjensmo (2018) define reproducibil- In our own studies, we also observed that almost all
ity as the “[...] ability of an independent research team to researchers implement their own code for the experimen-
produce the same results using the same AI method based tal evaluation. Before deep learning became popular, a
on the documentation made by the original research team.” number of libraries for the reproducible evaluation were
Moreover, they define three levels of reproducibility, where frequently used, such as LibRec or MyMediaLite (Guo
level R1, termed Experiment Reproducibility, refers to a sit- et al. 2015; Gantner et al. 2011), which ensured that the
uation when “[...] the execution of the same implementation evaluation procedures were inspected and quality-assured
of an AI method produces the same results when executed on by more than one researcher. As a result of the constant
the same data.” In the other two levels of reproducibility, re-implementation of evaluation code, technical mistakes
R2 and R3, independent researchers can either (i) obtain and inconsistencies among evaluation methodologies are
the same results for the same data using an alternative probably now more common.
implementation, or (ii) obtain the same results using an As a result of their analysis of more than 400 papers,
alternative implementation on different data. R1 is thus the Gundersen (2020) points out that AI, like other scien-
weakest level, as it does not assess the sensitivity of the AI tific disciplines such as psychology, is affected by a repro-
method to implementation details, nor does it assess the ducibility crisis. When interpreting a crisis positively, that
ability of the method to generalize under different experi- is, as a turning point, there is hope that the situation
mental conditions. In the rest of this discussion, whenever improves. Actually, Gundersen and Kjensmo (2018) found
we talk about reproducibility, we refer to level R1 as defined some patterns of improvement between 2013 and 2016 in
by Gundersen and Kjensmo (2018), unless differently terms of R1 reproducibility (but not in the other dimen-
specified. sions). This indicates that researchers increasingly share
Research in recommender systems suffers from the their code, and the findings of our own studies corroborate
same problems as general AI. In our previous studies this development.
(Ferrari Dacrema et al. 2019, 2021), we systematically
scanned top-level conference series such as KDD, IJCAI
or SIGIR for algorithmic works on the traditional top-N Methodological issues
recommendation problem (Cremonesi et al. 2010). As
mentioned above, less than 50% of the works were repro- The second important dimension that may prevent us from
ducible despite the high quality expectations for such obtaining true progress is related to methodological issues.
conferences. On a positive note, the observed level of R1 We identified three main types of issues. The first set of
reproducibility, which requires that code and data are problems is related to bad evaluation practices and other
shared by researchers, is much higher than in the study technical mistakes. The second set of problems is due to
on general AI research, where Gundersen and Kjensmo the choice of weak baselines for the empirical compar-
(2018) for example found that the method code was only ison with the proposed methods. The third set of prob-
shared in 8% of the cases. lems can be attributed to the fact that the majority of pub-
Note that sharing the method code, that is, the imple- lished research works has no explicit research questions or
mentation of a recommendation machine learning model, hypotheses (Gundersen and Kjensmo, 2018).
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
46 AI MAGAZINE

Information Leakage The above problems concerning information leakage


are part of a wider set of problems relating to evalua-
In machine learning, information leakage happens when- tion methodologies, problems which we have identified in
ever information from the testing set is used in the train- many works and which are sometimes significant, some-
ing process. This type of information is not expected to be times of little relevance. In several cases, the selection of
available at prediction time, causing evaluation metrics to negative samples was problematic, leading to a situation in
overestimate the predictive accuracy of the method. Infor- which a different number of negative samples was used for
mation leakage can occur at different levels, some obvi- different users. In other cases, we found that researchers
ous and easy to avoid, others more devious and difficult implemented accuracy measures in uncommon ways. In
to detect. one of the major cases, we found that the train-test split
One of the most common information leakage we have provided by the researchers was not documented, not
encountered in our analysis concerns early stopping (Fer- reproducible and not consistent with any of the best prac-
rari Dacrema et al. 2019, 2021). Early stopping should be tices about dataset partitioning. In all of these cases, the
implemented by evaluating the stopping criteria on a vali- reported gains over baselines vanish when replicating the
dation set different from the test set. However, 50% of the experiments with more conventional approaches.
works we have analyzed evaluate the early stopping crite-
ria on the test set itself, thus overfitting the test set on the
stopping criteria. Even more serious, some works report Weak baselines
results in which, for each metric and cutoff, the best value
is computed on the testing set and each value corresponds One of the most common problems that leads to an illusion
to a different epoch. In all the affected papers, this informa- of progress in research on recommender systems is related
tion leakage occurs only for the newly proposed methods to the use of weak baselines. In some works the base-
but not for the baselines. lines are inherently weak, in other works the baselines are
An equally problematic issue was recently reported in potentially stronger than reported, but are not adequately
the analysis by Sun et al (2020) who found that 37% of their tuned for the problem under consideration. Lin (2019) and
examined works tuned the hyperparameters of their new Yang et al. (2019), while recognizing the progress made by
method on the test set. If such a procedure was admissi- transformer models for certain tasks of document retrieval,
ble, it is trivial to come up with an algorithm for implicit observed that it is not rare to find articles comparing neu-
feedback that uses a parameter for each user-item combi- ral ranking algorithms only against weak baselines, raising
nation and then “learns” the best setting for this variable questions regarding the empirical rigor in the field. Ren-
by looking at the test set. dle et al. (2019) found that many algorithms that were pub-
These two examples of information leakage are easy to lished between 2015 and 2019 actually did not outperform
detect and control, and authors should be able to take care longer-existing baselines, when these were properly tuned.
of them without too much effort. Other types of informa- The apparent lack of progress can be attributed to two
tion leakage are more devious and concern the splitting of factors: the choice of the baselines and the lack of proper
datasets into training, validation and test sets. optimization of the baselines. Ultimately, these phenom-
Probably the most often used experimental design in the ena can lead to a cascade effect, where only the most recent
entire literature on recommendation algorithms is to pre- models are considered as the state-of-the-art even though
dict held-out ratings on one of the many publicly avail- they actually do not outperform existing models.
able datasets containing implicit or explicit ratings. For the
evaluation, the rating dataset is typically randomly split Choice of baselines
into a 80% training and 20% test split. Such a random Hundreds of recommendation algorithms may have been
data split however leads to the effect that the training data designed over the past decade, and determining what
contains future data to predict a past event1 , see also (Ji represents the state-of-the-art is inherently difficult,
et al. 2020). Consider, for instance, the popular MovieLens because there is actually no “best” algorithm in general,
dataset. If the training data by chance contains a highly as will be discussed below. Another observation is that,
positive ratings by a user for “Rocky II” and “Rocky III”, the in many works, only the latest complex machine learning
model might quite easily learn from the data that the user algorithms are considered as relevant baselines. For
will also rate “Rocky I” highly. As a result, the observed instance, in (Ferrari Dacrema et al. 2021) it was observed
prediction performance in such an experiment might be that more than 80% of the papers on deep-learning recom-
higher than when applied in practice, where it is probably mender systems use other deep learning algorithms as the
much more difficult to guess if a user will like “Rocky I” only baselines. In that context, a key question is whether
when it is released. a specific family of algorithms is the best available to
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AI MAGAZINE 47

solve the problem addressed in the paper, or whether even the metric sampling approach (Krichene and Rendle,
there are other families of better algorithms to use as 2020; Rossetti et al. 2016).2 For all these aspects, no stan-
stronger baselines. Simpler and slightly older methods dards exist, and recent papers highlight that researchers
are often ignored as possible baselines, although some of use all sorts of experimental configurations in their experi-
them have been published in top-level venues and often ments, often without a theory-guided justification for their
lead to strong results. To some extent, this phenomenon choices (Ferrari Dacrema et al. 2021; Sun et al. 2020).
might also be a result of our publication process, where Therefore, what can be shown in a paper is that some pro-
reviewers might mainly ask for the consideration of the posed model outperforms existing approaches in very spe-
latest models as baselines, thereby implicitly suggesting cific experimental settings, many of which can be arbitrar-
that older approaches are outdated. ily chosen by the researcher. One possible consequence
of this freedom is that researchers may be subject to psy-
Lack of proper tuning of baselines chological biases, most importantly a confirmation bias.
Even in cases when suitable baseline algorithms are In such a situation, researchers may unconsciously seek
selected in a comparison, a frequently occurring issue is for evidence that supports their hoped-for outcomes, that
that the baselines used in the experimental evaluation are is, that their new model works, and this psychological
not properly tuned. This is probably the most common phenomenon might guide their search for corresponding
issue observed in our and other studies, and is not specifi- experimental configurations.
cally tied to recommender algorithms. Researchers appar- Another problem related to the lack of explicit research
ently invest significant efforts in tuning the hyperparam- questions is the failure to identify the sources of empirical
eters of their own new method but do not always pay the gains, as reported by Lipton and Steinhardt (2019). In many
same attention to the tuning of baselines. One commonly works, together with a new sophisticated algorithm, the
found mistake is that of authors using hyperparameter set- authors propose a number of other empirical and appar-
tings reported as optimal in a previous article, although ently minor contributions, such as optimization heuris-
these settings refer to experimental conditions other than tics, clever data-preprocessing, or extensive hyperparame-
those currently considered. ter tuning. Too often, just one of the minor contributions is
actually responsible for the performance gain, but the lack
of proper ablation studies give the false impression that all
Lack of explicit research questions of the proposed changes are necessary.

Gundersen and Kjensmo (2018) report that less than 10%


of the AI papers analyzed in their study contained an Theoretical flaws
explicit statement regarding the addressed research goals
and objectives. Similar trends exist in the recommender Besides reproducibility and methodological problems, our
systems literature, where the research goal is only stated field is sometimes also plagued by theoretical issues.
implicitly. Usually, the implicit goal is to create better rec-
ommender systems by making better relevance predic-
tions, that is, to obtain higher accuracy. Using unsuited models for the data
A common claim in most algorithmic research works
in recommender systems is that “our method significantly Research on novel recommendation algorithms sometime
outperforms the state-of-the-art.” Usually, these improve- involves speculations that do not properly undergo sci-
ments are attributed to a novel model that better captures entific scrutiny. For instance, in recent years, researchers
certain patterns in the data and/or the creative consider- have explored the use of convolutional neural networks
ation of existing or additional types of information. These (CNNs) for recommendation problems. The application of
improvements are then evidenced by reporting empirical CNNs relies on the assumption that there is some form
results in which the new model is benchmarked against of relationship between neighboring or nearby data ele-
a set of baselines algorithms. Unfortunately, the way the ments. One specific idea that was explored in at least
experiments are designed cannot inform us if a method three recent papers—all presented at IJCAI conferences—
improves the state-of-the-art as claimed. The observed per- is to combine latent factor models (embeddings) with
formance and ranking of algorithms can depend on a mul- the power of CNNs. The underlying assumption in these
titude of factors, including the characteristics of the used papers is that there are correlations in the latent fac-
datasets, the chosen performance measures, the cut-off tor models. To consider these correlations in the models,
length for the measures, data pre-processing, specifics of the authors propose to apply convolutions over user-item
the evaluation protocol like the data splitting procedure, or interaction maps that are obtained from the outer product
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
48 AI MAGAZINE

of their embeddings. According to the results presented in be considered as positive signals. While the ranking might
the respective papers, CNNs have helped to significantly indeed be the same in some cases, we believe that it is
improve recommendation accuracy. unimaginable in other scientific disciplines to make claims
When looking closer at how the user-item interaction regarding the power of a prediction model for a given
maps are usually created, one can however observe that research question when the underlying experiment is actu-
the order of the elements in the embeddings carries no ally not suited to answer the question. The fact that this
semantic meaning. Therefore, any performance gain seems to be acceptable in our domain somehow indicates
observed through the application of CNNs cannot be problems regarding rigor and our scientific method.
attributed to correlations between nearby elements,
but only to the fact that neural networks act as universal
approximators. As shown by Ferrari Dacrema et al (2020b), Unreliable conclusions from sampled metrics
perturbing the order of the elements in the embeddings,
as expected, has no effect on recommendation accuracy. Offline evaluation of modern machine learning algorithms
These findings do not rule out that CNNs can be useful for can be computationally demanding when there are many
various recommendation tasks. In the mentioned cases, items in the databases. This is in particular the case when
however, the claims made in research papers are not theo- ranking or classification measures are used. To determine
retically valid. Overall, this points to a more general issue such measures, one has to rank all items in the catalog for
in today’s research practice in applied machine learning, each user, which usually means that relevance scores for
where the goal of improving prediction accuracy is very potentially tens of thousands of items have to be computed
often not accompanied by the questions “What worked?” per user. To avoid this problem, researchers commonly
and “Why?”, see also (Lipton and Steinhardt, 2019). In apply an evaluation procedure where the positive items of
general, while it might often be difficult to theoretically a user in the test set—these are typically not many—are
prove how exactly a particular model contributes to the ranked together with 𝑁 randomly sampled negative items,
overall performance, suitable ablation studies may usually that is, items where no preference (or a negative one) is
help to discover potential issues empirically. available for the user. (Koren, 2008) is an early example
where this approach was taken, using 𝑁 = 1000.
This evaluation approach is very common today. How-
Unsuited experiment designs for the research ever, often much smaller values for 𝑁 are chosen, for exam-
goals ple, 50 or 100. Only recently, the question was raised if
the results obtained through such sampled metrics actu-
The experimental analyses reported in many papers devi- ally correspond to the algorithm ranking that one would
ate significantly from the claimed or implicit research obtain when all items were considered, that is, when the
goals. The probably most typical example are research exact metric is used. In their analysis, Krichene and Ren-
papers that propose new algorithms for implicit-feedback dle (2020) showed that the sampled metrics are incon-
recommendation scenarios. The goal of these research sistent with their exact version, which means that state-
works is to demonstrate that a new model is better than ments like “Recommender A is better than Recommender B”
previous ones at predicting whether a user will like an cannot be reliably derived when using such metrics. This
item or not. When it comes to the evaluation, it is however finding therefore may have major implications on what
not uncommon that researchers take an explicit-feedback we believed to know about the relative ranking of algo-
dataset from MovieLens and transform all ratings, includ- rithms, which is the main focus of almost all published
ing the negative ones, to positive signals. As a result, what algorithms research.
is measured here is how good an algorithm is at predict-
ing who will rate which item next. This, however, is almost
never the research goal. Likewise, many sequential recom- HOW COULD THIS HAPPEN?
mendation approaches are also based on MovieLens data.
Here, the sequence of events is determined by the times- To recap, other researchers and ourselves have found that
tamp of the rating, which can be days, months, or years in many cases the latest and more complex machine learn-
after a movie was watched. Again, in these approaches, ing models presented at prestigious conferences are often
researchers predict who will rate what next. not actually better than previously existing methods. Sim-
One could argue that such subtleties are not important ilar observations were also made for non-neural recom-
because what matters is the ranking of the algorithms, mendation methods by Rendle et al. (2019), for information
which would probably be the same when only the very pos- retrieval research by Yang et al. (2019) and by Armstrong
itive ratings, for example, those with 4 or 5 stars, would et al. (2009), and for other areas like time-series forecasting
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AI MAGAZINE 49

or certain clinical prediction tasks by Bellamy et al. (2020) specific research question or laying out hypothesis, which
and Makridakis et al. (2018). Many factors might con- are key pillars in scientific research elsewhere, is less
tribute to these phenomena, which are seemingly not tied than common.
to applied machine learning research in recommender sys- In some ways, our research approach is sometimes
tems or to deep learning. Here, we provide some thoughts seemingly boiling down to a leaderboard-chasing culture,
and speculations regarding what might have put us in this where the goal is to obtain an accuracy improvement on
unsatisfactory situation. some dataset, no matter where the improvements come
One quite obvious reason for limited reproducibility is from or whether they would actually matter in practice.
that providing reproducible artifacts is work-intensive and In particular this latter aspect was discussed years ago
there is little reward. The code that is written during the by Wagstaff (2012) who advocated that we should more
process might have been developed while the researcher often focus on “machine learning that matters”. Symp-
is still learning about the technology, it may contain code toms of machine learning research that has limited impact
and method variants that are no longer used as they did in practice include a hyper-focus on benchmark datasets
not lead to success, or contain programming hacks that and machine learning competitions or a hyper-focus on
were introduced under time pressure before a deadline. abstract accuracy measures, and a limited connection to
Preparing and re-working all artifacts into a sharable form, real-world problems. A recent analysis of the (often lim-
including the creation of instructions for installation and ited) correspondence of the offline and online performance
execution, might therefore seem to be a time investment of recommendation algorithms can be found in (Krauth
that does not pay off for the researcher. Furthermore, when et al. 2020), see also (Rossetti et al. 2016) and (Cremonesi
the provision of source code is strongly encouraged for et al. 2013).
paper submissions at a conference or even a reviewing cri- Furthermore, the academic incentive system and pub-
terion, researchers might only prepare a minimal set of lication culture focuses on a “machine learning contribu-
artifacts for sharing, that is, the core code of their method tion”, but does not reward researchers enough that they
but not the other material that would be needed to repro- perform all other activities that are needed to develop
duce the experiments. Not being able or willing to invest something that has impact in the real world. Similar
more time in reproducibility or in systematically optimiz- observations were made more recently by Kerner (2020),
ing the baselines may also have to do with our academic who found that AI research too often considers that real-
incentive system and a corresponding publication pres- world applications are not relevant. While this is proba-
sure. An additional reason for not sharing the code might bly a general issue in AI—see (D’Amour et al. 2020) for
lie in a certain fear by researchers that their code con- a discussion of the often poor performance of machine
tains mistakes or methodological issues. If such a mistake learning models when deployed in practice—it is partic-
is detected, this may harm the reputation of the researcher, ularly troubling in the context of recommender systems
in particular if a found mistake invalidates the claimed research, which by definition is oriented towards a particu-
findings. Finally, experimental activities, in particular pro- lar and very successful application of machine learning in
gramming, might often be carried out by young researchers practice.
who may not be fully aware of the possible impact on the Finally, some of the observed problems might also stem
results when they do not meticulously follow best evalu- from the current situation, where conferences receive
ation practices. For instance, they might not be aware of thousands of submissions, and where it might be very chal-
subtleties that can lead to information leakage. lenging to recruit a sufficient number of senior review-
Considering the above-mentioned methodological ers that would be required to critically evaluate the
issues and bad research practices, it seems that there are research methodology. In our own experience, early-stage
too few incentives (or too little pressure) in our commu- researchers who act as reviewers more often focus strongly
nity, for example, from reviewers, program chairs, and on technical innovation and novel models, and put less
journal editors, to improve these aspects. The general thought on evaluation aspects, as the standards for eval-
research methodology seems agreed-upon, and it appears uation seem agreed-upon and of secondary relevance.
that much more focus is put on the details of a fancy
new technical contribution than on double-checking that
the research methodology is appropriate to support a RECOMMENDED ACTION
claim. For example, not reporting if and how the baseline
methods were optimized is common; not reporting results What follows are a number of suggestions based on our
of statistical significance tests is not an issue; not justi- personal experiences on what we, as a community, could
fying why certain datasets, measure or cutoffs are used do to counter these trends, besides simply suggesting that
for the evaluation is acceptable. Finally, formulating a authors should refrain from today’s bad practices.
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
50 AI MAGAZINE

Best practices for reproducibility vent them to explore other and probably more important
research questions.
Regarding reproducibility in AI in general, Gundersen On a positive note, we can observe increased awareness
et al. (2018) provide a set of best practices, recommen- in recent years in the community regarding the poten-
dations, and a detailed checklist for authors. In the fol- tial helpfulness of best practices, research guidelines and
lowing, we summarize the main points and add specific reproducibility checklists, see for example (Pineau et al.
aspects that are relevant in the context of recommender 2020). Besides more general guidelines, as summarized in
systems research. Table 1, also more specific best practices for certain aspects
According to the checklist by Gundersen et al. (2018), of machine learning research were recently proposed, for
the data used in experiments should be publicly available, example, in (Lindauer and Hutter, 2019).
be documented, maybe include license information, and
should have a DOI or persistent URL attached. Since in
recommender systems research the raw data is very often Recommendations for chairs, editors, and
preprocessed, for example, to reduce the sparsity, authors reviewers
should also publicly share the preprocessed data. Further-
more, sharing the used data splits and the negative sam- With this work, our goal is to make conference chairs, jour-
ples used in the experiments can be very helpful to ensure nal editors, and reviewers better aware of the severity of
reproducibility even when the splitting and sampling pro- today’s issues with respect to reproducibility and method-
cedures are described in the paper. ology. As a first step forward, we believe that more empha-
The requirements of public accessibility, documen- sis on these topics should be placed both in the call for
tation, licences and permanent reference hold for the contributions and in the reviewing process. In particular
method code as well. Moreover, the code (and the data) reproducibility should be a key criterion in every evalua-
should be provided in a way that re-running the experi- tion of algorithmic contributions. Conference chairs and
ments reported in the paper is as easy as possible. This journal editors should therefore provide clear guidelines to
requires that all code is shared: the code of the new reviewers on what level of reproducibility is expected and
method, the baselines code, the code used for data pre- on how much weight should be given to this aspect when
processing, the one used for hyperparameter optimiza- evaluating a paper.
tion, and the code used for evaluation. Moreover, the Certainly, there might be reasons why only parts of the
code should be accompanied with relevant instructions code and the data can be shared, but these reasons should
regarding software or hardware requirements, installation be well explained. In particular researchers from indus-
scripts, and prepared scripts and configuration files to try often face the problem that they are not allowed to
run the experiments. In case a complex software environ- share the materials that they used in their experiments.
ment is required to run the experiments, the provision of The involvement of industry in recommender systems
a Docker image or similar formats is advisable. In terms of research, with no doubt, is highly important and absolutely
additional documentation that might not fit into a research necessary to move our field forward. It might therefore be
paper, authors should provide their optimal hyperparame- advisable to create dedicated opportunities for contribu-
ter settings and the ranges they have explored on a valida- tions from industry, where it is clear that these works do
tion set. not necessarily meet the needed reproducibility standards.
Note that (standardized) benchmarks are often consid- Regarding dedicated publication outlets, we further-
ered a possible cure to many of the mentioned problems. more encourage the creation of opportunities to publish
In the best case, the existence of such benchmarks could reproducibility studies, for example, in the form of a spe-
relieve researchers from the burden of running all pre- cial conference track. Correspondingly, the papers submit-
vious baselines, and would just have to report their new ted to such tracks should be evaluated differently as regular
results. Making a dataset publicly available is, however, not submissions, where, for example, novelty aspects are not in
enough to achieve this goal. It requires that a specific train- the focus.
test split is provided and that the exact evaluation proce- For reviewers, once they are aware of the existing issues,
dure, including the metrics, is known. In the best case, the we expect that they more often have a closer look at eval-
code to be used for the evaluation should be standardized uation aspects when assessing a paper. We speculate that
as well, because small implementation details can make a reviewers who work in recommendation algorithms them-
difference. Besides these challenges, the use of benchmark selves are more interested in the novelty and the particular-
problems can in general be a double-edged sword, where ities of the proposed technical approach than in the finer
researchers focus too much on a very specific known prob- details of the experimental setup. With clear guidelines
lem setting and hyperparameter tuning, which might pre- provided by the chairs and editors, they can be pointed to
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AI MAGAZINE 51

TA B L E 1 Reproducibility checklist
(1) Data: Share well documented data, include
1.1 Raw dataset, with meta-data, license information and permanent link
1.2 Processed dataset, including train, validation and test sets, as used to obtain the reported results
1.3 Other data samples used in the evaluation, for example, negative samples
(2) Code: Share documented code, include
2.1 Code for the proposed method
2.2 Code for the baselines
2.3 Code used for data pre-processing
2.4 Code for hyperparameter optimization
2.5 Evaluation code
(3) Execution instructions: Share additional instructions for re-running the experiment
3.1 Hardware and software requirements, including all external libraries necessary to run the codes
3.2 Installation instructions
3.3 Scripts for installation and experiment execution
(4) Experiment documentation: Share additional details of experiment
4.1 Final hyperparameter settings and explored ranges (new method and baselines)
4.2 Detailed experimental results
4.3 Any additional required artifacts, for example, configuration files

additional aspects that should be asked, for example, if the reproducible and thus verifiable. Providing all artifacts for
work is reproducible or if the chosen experimental design reproducibility should in the long run also serve today’s sci-
is suited to answer the research question that is asked in entific career mechanics, as reproducible work might be
a paper. more often used as a basis for innovation and comparisons
Finally, as our machine learning models increasingly by other researchers.
become more complex, some reviewers might be misled However, to achieve true progress in our field, ensur-
into thinking that the complexity of a model is always pos- ing reproducibility of new algorithms is not enough. Accu-
itively correlated with its effectiveness. This has probably racy optimization for top-N recommendation tasks has
led to a certain level of “mathiness”, defined as a “tangling been done for about twenty-five years now, with a large
of formal and informal claims”, which is observed by Lip- fraction of research works using movie rating datasets for
ton and Steinhardt (2019) for machine learning research in the evaluations. The question has been raised before, if
general. Instead, the principle of Occam’s Razor should be there is a “magic barrier” where we cannot improve our
kept in mind. Translated to our problem, this means that algorithms anymore, for example, because of the natu-
one should prefer simple over complex models, provided ral noise in the data (Said et al. 2012; Amatriain et al.
that they are performing equally well. Today, we unfortu- 2009). Moreover, there are various works that indicate that
nately still too often observe a self-enforcing pattern, where these usually slight improvements in accuracy obtained
researchers may intrinsically have a certain fondness for in offline experiments might not matter in practice, and
complex models and reviewers at the same time ask for or a number of user studies suggest that higher accuracy
appreciate these more fancy models. does not correlate positively with the users’ quality per-
ceptions (Cremonesi et al. 2012; Krauth et al. 2020). In
recent years, we might have expanded our research opera-
Recommendations for researchers and the tionalization from matrix completion to alternative scenar-
community ios like session-based or sequential recommendation, and
we have explored questions of recommendation diversity
We believe that today’s issues regarding the lack of repro- and novelty. The problems of offline evaluations however
ducibility and progress observed in our previous studies are remain the same.
mainly related to a lack of awareness and attention on these To achieve true progress, we believe that researchers in
topics in our community. In the future, when the com- recommender systems should more often focus on prob-
munity appropriately rewards and incentivizes the addi- lems “that matter”, in the sense of (Wagstaff, 2012). In our
tional efforts by researchers, we hope that it will be “nat- opinion, this means that, in principle, we should more
ural” for researchers trying to ensure that their work is often first work with specific problems in an application
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
52 AI MAGAZINE

domain—ultimately recommender systems research is an Beel, J., C. Breitinger, S. Langer, A. Lommatzsch, and B. Gipp. 2016.
applied discipline—and only then try to generalize. We can “Towards Reproducibility in Recommender-Systems Research.”
no longer try to solve the problem of finding “the best algo- User Modeling and User-Adapted Interaction 26(1): 69–101.
Bellamy, D., L. Celi, and A. L. Beam. 2020. “Evaluating Progress on
rithm”, because it does not exist. Whether an algorithm
Machine Learning for Longitudinal Electronic Healthcare Data.”
is preferred over another one basically cannot be judged
CoRR, abs/2010.01149. https://arxiv.org/abs/2010.01149.
in isolation, because we have to take into account what Cremonesi, P., Y. Koren, and R. Turrin. 2010. “Performance of Rec-
the purpose of the recommendations is in the first place ommender Algorithms on top-n Recommendation Tasks.” In Pro-
(Jannach and Adomavicius, 2016). Today, we mainly mea- ceedings of the 4th ACM Conference on Recommender Systems (Rec-
sure what is easy to measure, that is, accuracy in offline Sys ’10), 39–46.
experiments, but this seems to be too limited to make most Cremonesi, P., F. Garzotto, and R. Turrin. 2012. “Investigating the Per-
of our research impactful in practice. We therefore highly suasion Potential of Recommender Systems from A Quality Per-
spective: An Empirical Study.” Transactions on Interactive Intelli-
encourage researchers to leave established paths and focus
gent Systems 2(2): 1–41.
on more interesting and relevant problems, see, for exam- Cremonesi, P., F. Garzotto, and R. Turrin. 2013. “User-Centric vs.
ple, (Jannach and Bauer, 2020) for potential ways forward. System-Centric Evaluation of Recommender Systems.” In IFIP
Conference on Human-Computer Interaction, 334–51. Springer.
D’Amour et al. 2020. “Underspecification Presents Challenges for
CONCLUSION Credibility in Modern Machine Learning.” CoRR, abs/2011.03395.
https://arxiv.org/abs/2011.03395.
Ekstrand, M. D., M. Ludwig, J. A. Konstan, and J. T. Riedl.
A number of recent studies indicate that research in rec-
2011. “Rethinking the Recommender Research Ecosystem: Repro-
ommendation algorithms has reached a certain level of
ducibility, Openness, and Lenskit.” In Proceedings of the Fifth ACM
stagnation. In this paper, we have reviewed the possible Conference on Recommender Systems, RecSys ’11, 133–40.
reasons that lead to the effect that many new algorithms Ferrante, M., N. Ferro, and N. Fuhr. 2021. “Towards Meaningful State-
are published, even at top-level outlets, for which it is ulti- ments in IR Evaluation. Mapping Evaluation Measures to Interval
mately not clear if they really improve the state-of-the- Scales.” CoRR, abs/2101.02668. https://arxiv.org/abs/2101.02668.
art. Among the most important causes, we have identified Ferrari Dacrema, M., P. Cremonesi, and D. Jannach. 2019. “Are We
limited reproducibility, weak baselines, improper evalua- Really Making Much Progress? A Worrying Analysis of Recent
Neural Recommendation Approaches.” In Proceedings of the 13th
tion methodologies and lack of explicit research questions.
ACM Conference on Recommender Systems (RecSys ’19), 101–9.
Furthermore, we outlined a number of research best prac-
Ferrari Dacrema, M., P. Cremonesi, and D. Jannach. 2020a. “Method-
tices to avoid the “phantom progress” due to low levels ological Issues in Recommender Systems Research (Extended
of reproducibility and the use of improper methodologies. Abstract).” In Proceedings of the Twenty-Ninth International Joint
Ultimately, however, we do not only need better research Conference on Artificial Intelligence, IJCAI-20, 4706–10.
practices, but we have to re-think how we do algorithms Ferrari Dacrema, M., F. Parroni, P. Cremonesi, and D. Jannach.
research in recommender systems, where we have to shift 2020b. “Critically Examining the Claimed Value of Convolutions
from a hyper-focus on accuracy in offline experiments or Over User-Item Embedding Maps for Recommender Systems.” In
Proceedings of the 29th ACM International Conference on Informa-
on already well-explored problems, to questions that really
tion and Knowledge Management (CIKM 2020).
matter and have an impact in practice. Ferrari Dacrema, M., S. Boglio, P. Cremonesi, and D. Jannach.
2021. “A troubling analysis of reproducibility and progress in
NOTES recommender systems research.” ACM Transactions on Informa-
1 tion Systems (TOIS), 39(2): 1–49.
(Jannach et al. 2016) used the term “postdict”.
2 Gantner, Z., S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme.
Recently, fundamental questions were also raised by Ferrante et al.
(2021) and others regarding the validity of experimental results that 2011. “MyMediaLite: A Free Recommender System Library.” In
are obtained with common information retrieval measures like Pre- Proceedings of the Fifth ACM Conference on Recommender Systems,
cision and Recall, which are also commonly used in recommender 305–8.
systems evaluations. Gundersen, O. E. 2020. “The Reproducibility Crisis is Real.” AI Mag-
azine 41(3): 103–6.
REFERENCES Gundersen, O. E. and S. Kjensmo 2018. “State of the Art: Repro-
ducibility in Artificial Intelligence.” In Proceedings of the Thirty-
Amatriain, X., J. M. Pujol, N. Tintarev, and N. Oliver. 2009. “Rate It
Second AAAI Conference on Artificial Intelligence, pp. 1644–51.
Again: Increasing Recommendation Accuracy by user Re-Rating.”
Gundersen, O. E., Y. Gil, and D. W. Aha. 2018. “On Reproducible AI:
In Proceedings of the Third ACM Conference on Recommender Sys-
Towards Reproducible Research, Open Science, and Digital Schol-
tems, RecSys ’09, 173–80.
arship in AI Publications.” AI Magazine 39(3): 56–68.
Armstrong, T. G., A. Moffat, W. Webber, and J. Zobel. 2009. “Improve-
Guo, G., J. Zhang, Z. Sun, and N. Yorke-Smith. 2015. “Librec: A
ments That Don’t Add Up: Ad-hoc Retrieval Results Since 1998.” In
Java Library for Recommender Systems.” In Posters, Demos, Late-
Proceedings of the 18th ACM Conference on Information and Knowl-
breaking Results and Workshop Proceedings of the 23rd Conference
edge Management (CIKM ’09), 601–10.
on User Modeling, Adaptation, and Personalization (UMAP 2015).
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AI MAGAZINE 53

He, X., L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. 2017. “Neural Forward.” PloS one 13(3): 1–26. https://doi.org/10.1371/journal.
Collaborative Filtering.” In Proceedings of the 26th International pone.0194889
Conference on World Wide Web (WWW ’17), 173–82. Ning, X., and G. Karypis. 2011. “SLIM: Sparse Linear Methods for top-
Henderson, P., R. Islam, P. Bachman, J. Pineau, D. Precup, and D. n Recommender Systems.” In Proceedings of the 11th IEEE Interna-
Meger. 2018. “Deep Reinforcement Learning That Matters.” In tional Conference on Data Mining (ICDM ’11), 497–506.
Proceedings of the Thirty-Second AAAI Conference on Artificial Pineau, J., P. Vincent-Lamarre, K. Sinha, V. Larivière, A. Beygelz-
Intelligence, 3207–14. imer, F. d’Alché Buc, E. Fox, and H. Larochelle. 2020. “Improving
Jannach, D., and G. Adomavicius. 2016. “Recommendations with Reproducibility in Machine Learning Research (A Report from the
a Purpose.” In Proceedings of the 10th ACM Conference on Rec- NeurIPS 2019 Reproducibility Program).” CoRR, abs/2003.12206.
ommender Systems (RecSys 2016), 7–10, Boston, Massachusetts, https://arxiv.org/abs/2003.122069.
USA. Rendle, S., L. Zhang, and Y. Koren. 2019. “On the Difficulty of
Jannach, D., and C. Bauer. 2020. “Escaping the McNamara Fallacy: Evaluating Baselines: A Study on Recommender Systems.” CoRR,
Towards more Impactful Recommender Systems Research.” AI abs/1905.01395. http://arxiv.org/abs/1905.01395.
Magazine 40(4). Rendle, S., W. Krichene, L. Zhang, and J. Anderson. 2020. “Neural
Jannach, D., P. Resnick, A. Tuzhilin, and M. Zanker. 2016. “Recom- Collaborative Filtering vs. Matrix Factorization Revisited.” In Pro-
mender Systems - Beyond Matrix Completion.” Communications ceedings of the 14th ACM Conference on Recommender Systems (Rec-
of the ACM 59(11): 94–102. Sys ’20).
Ji, Y., A. Sun, J. Zhang & C. Li 2020. "A Critical Study on Data Resnick, P., N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl.
Leakage in Recommender System Offline Evaluation", CoRR, 1994. “Grouplens: An Open Architecture for Collaborative Filter-
abs/2010.11060, https://arxiv.org/abs/2010.11060 ing of Netnews.” In Proceedings of the 1994 ACM Conference on
Kerner, H. 2020. Too Many AI Researchers Think Real-World Computer-Supported Cooperative Work (CSCW ’94), 175–86.
Problems are not Relevant. MIT Technology Review, August Rossetti, M., F. Stella & M. Zanker 2016. “Contrasting Offline and
2020. https://www.technologyreview.com/2020/08/18/1007196/ Online Results when Evaluating Recommendation Algorithms.”
ai-research-machine-learning-applications-problems-opinion/. In Proceedings of the 10th ACM Conference on Recommender
Konstan, J. A., and G. Adomavicius. 2013. “Toward Identification and Systems (RecSys ’16), 31–4.
Adoption of Best Practices in Algorithmic Recommender Systems Said, A., B. J. Jain, S. Narr, and T. Plumbaum. 2012. “Users and
Research.” In Proceedings of the International Workshop on Repro- Noise: The Magic Barrier of Recommender Systems.” In User Mod-
ducibility and Replication in Recommender Systems Evaluation, 23– eling, Adaptation, and Personalization, 237–48. Berlin, Heidelberg:
8. Springer.
Koren, Y. 2008. “Factorization Meets the Neighborhood: A Multi- Sun, Z., D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, and C. Geng. 2020.
faceted Collaborative Filtering Model.” In Proceedings of the 14th “Are We Evaluating Rigorously? Benchmarking Recommendation
ACM SIGKDD International Conference on Knowledge Discovery for Reproducible Evaluation and Fair Comparison.” In Fourteenth
and Data Mining (KDD ’08), 426–34. ACM Conference on Recommender Systems, RecSys ’20, 23–32.
Kouki, P., I. Fountalis, N. Vasiloglou, X. Cui, E. Liberty, and K. Wagstaff, K. 2012. “Machine Learning That Matters.” In Proceedings
Al Jadda. 2020. “From the Lab to Production: A Case Study of the 29th International Conference on Machine Learning (ICML
of Session-Based Recommendations in the Home-Improvement ’12), 529–36.
Domain.” In Fourteenth ACM Conference on Recommender Sys- Wu, S., Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan. 2019. “Session-
tems (RecSys ’20), RecSys ’20, 140–9. Based Recommendation with Graph Neural Networks.” In Pro-
Krauth, K., S. Dean, A. Zhao, W. Guo, M. Curmei, B. Recht, and M. ceedings of the Thirty-Third AAAI Conference on Artificial Intelli-
I. Jordan. 2020. “Do Offline Metrics Predict Online Performance gence, AAAI, 346–53.
in Recommender Systems?.” CoRR, abs/2011.07931. https://arxiv. Yang, W., K. Lu, P. Yang, and J. Lin. 2019. “Critically Examining the
org/abs/2011.07931. “Neural Hype”: Weak Baselines and the Additivity of Effective-
Krichene, W., and S. Rendle. 2020. “On Sampled Metrics for Item ness Gains from Neural Ranking Models.” In Proceedings of the
Recommendation.” In Proceedings of the 26th ACM SIGKDD Inter- 42nd International ACM SIGIR Conference on Research and Devel-
national Conference on Knowledge Discovery & Data Mining, KDD opment in Information Retrieval, 1129–32.
’20, 1748–57.
Lin, J. 2019. “The Neural Hype and Comparisons Against Weak Base- AU T H O R B I O G R A P H I E S
lines.” ACM SIGIR Forum 52(2): 40–51.
Lindauer, M., and F. Hutter. 2019. “Best Practices for Scientific
Research on Neural Architecture Search.” CoRR, abs/1909.02453. Paolo Cremonesi (paolo.cremonesi@polimi.it) is pro-
http://arxiv.org/abs/1909.02453. fessor of Computer Science Engineering at Politecnico
Lipton, Z. C., and J. Steinhardt. 2019. “Research for Practice: Trou- di Milano, Italy. His current research interests focus on
bling Trends in Machine-Learning Scholarship.” Communications recommender systems and quantum computing. He is
of the ACM 62(6): 45–53. the coordinator of the new-born Quantum Computing
Ludewig, M., S. Latifi, N. Mauro, and D. Jannach. “Empirical Anal- lab at Politecnico di Milano.
ysis of Session-Based Recommendation Algorithms.” User Model-
ing and User-Adapted Interaction 31(3): 49–181.
Makridakis, S., E. Spiliotis, and V. Assimakopoulos. 2018. “Statistical Dietmar Jannach (dietmar.jannach@aau.at) is a
and Machine Learning Forecasting Methods: Concerns and Ways professor of computer science at the University of
23719621, 2021, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1609/aimag.v42i3.18145 by Test, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
54 AI MAGAZINE

Klagenfurt, Austria. His research is focused on practi-


cal applications of artificial intelligence, with a focus on How to cite this article: Cremonesi P., and
recommender systems. He is also the leading author of Jannach D. 2021. “Progress in recommender
the first textbook on recommender systems published systems research: Crisis? What crisis?” AI Magazine
with Cambridge University Press. 42: 43–54. https://doi.org/10.1609/aaai.12016

You might also like