You are on page 1of 4

Anchoring Effects of Recommender Systems

Jingjing Zhang
Department of Information and Decision Sciences
University of Minnesota
Minneapolis, MN
jingjing@umn.edu

ABSTRACT illustrated in Figure 1. This research focuses on the feed-forward


We explore how consumer preferences at the time of consumption influence of the recommender systems upon the consumer ratings.
are impacted by predictions generated by recommender systems. We believe that providing consumers with a prior rating generated
We conducted three controlled laboratory experiments to explore by the recommender system can introduce anchoring biases and
the effects of system recommendations on preferences. Results significantly influence consumer preferences and, thus, their
provide strong evidence that the rating provided by a subsequent rating of an item.
recommender system serves as an anchor for the consumer’s Predicted Ratings (expressing recommendations for unknown items)
constructed preference. Consumer’s preferences appear malleable
and can be significantly influenced by the recommendation
received. Additionally, the effects of pure number-based
Recommender System
anchoring can be separated from the effects of the perceived Consumer
(Consumer preference Accuracy
reliability of a recommender system. In particular, when the (Item consumption)
recommender system was described to the participants as being in estimation)
testing phase, the anchoring effect was reduced. Finally, the
effect of anchoring is roughly continuous, operating over a range
of perturbations of the system. Actual Ratings (expressing preferences for consumed items)

Categories and Subject Descriptors Figure 1. Ratings as part of a feedback loop in consumer-
H.1.2 [User/Machine Systems]: Human factors; H.3.3 recommender interactions
[Information Search and Retrieval]: Relevance feedback. The issue of biased ratings has been largely ignored by algorithm
developers. A common underlying assumption in the vast
General Terms majority of recommender systems literature is that consumers
Human Factors, Experimentation, Theory. have preferences for products that are developed independently of
the recommendation system. Therefore, the presumption is that
Keywords user-reported ratings can be trusted, and the majority of the
Recommender Systems, Anchoring Effects, Behavioral Decision
research directly uses user-reported ratings as the true and
Making, Experimental Research, Preferences.
authentic user preferences. However, researchers in behavioral
1. INTRODUCTION decision making, behavioral economics, and applied psychology
Recommender systems provide suggestions to consumers of have found that people’s preferences are often influenced by
products in which they may be interested and allow firms to elements in the environment in which preferences are constructed
leverage the power of recommendations to better serve their [e.g.,3,4,7,11]. This suggests that the common assumption that
customers and increase sales. Research in this area has mostly consumers have true, non-malleable preferences for items is
focused on the development and improvement of the algorithms questionable, which raises the following question: Whether and
that allow these systems to make accurate recommendations and to what extent is the performance of recommender systems
predictions. Less well-studied are the behavioral aspects of using reflective of the process by which preferences are elicited?
recommender systems in the electronic marketplace. The main objective of this research is to answer the above
Many recommender systems ask consumers to rate their liking of question and understand the influence of recommender systems’
an item that they have experienced and use these ratings as inputs predicted ratings on consumers’ preferences. In particular, we
to estimate consumer preferences for other not-yet-been- explore four issues related to the impact of recommender systems:
consumed items. The estimated preferences are often presented to (1) The anchoring issue—understanding any potential anchoring
the consumers in the form of “system ratings,” and essentially effect, particularly at the point of consumption, is the principal
serve as recommendations. The subsequent consumer ratings goal of this study: Are people’s preference ratings for items they
serve as additional inputs to the system, completing a feedback just consumed drawn toward predictions that are given to them?
loop that is central to a recommender system’s use and value, as (2) The timing issue—does it matter whether the system’s
prediction is presented before or after user’s consumption of the
item? Showing the prediction prior to consumption could provide
Permission to make digital or hard copies of all or part of this work for a prime that influences the user’s consumption experience and
personal or classroom use is granted without fee provided that copies are his/her subsequent rating of the item. If this explanation is
not made or distributed for profit or commercial advantage and that operative, an anchoring effect would be expected to be lessened
copies bear this notice and the full citation on the first page. To copy when the recommendation is provided after consumption. (3) The
otherwise, or republish, to post on servers or to redistribute to lists,
system reliability issue—does it matter whether the system is
requires prior specific permission and/or a fee.
RecSys’11, October 23–27, 2011, Chicago, Illinois, USA. characterized as more or less reliable? If the system’s reliability
Copyright 2011 ACM 978-1-4503-0683-6/11/10...$10.00. impacts anchoring, then this would provide evidence against the

375
thesis that anchoring in recommender systems is a purely numeric 3. THREE RESEARCH STUDIES
effect of users applying numbers to their experience. (4) The We designed three user experiments to explore issues related to
generalizability issue—does the anchoring effect extend beyond a the impact of providing recommendations. In particular, three
single context? We investigate two different contexts in this hypotheses are tested in this research. Discussions of these
research. Studies 1 and 2 observe ratings of television shows and hypotheses and details of the experiments are provided in [1,2].
Study 3 addresses anchoring for ratings of jokes. Consistency of
findings supports a more general phenomenon that affects Anchoring Hypothesis: Users receiving a recommendation biased
preference ratings immediately following consumption, when to be higher will provide higher ratings than users receiving a
recommendations are provided. recommendation biased to be lower.
Timing Hypothesis: Users receiving a recommendation prior to
2. BACKGROUND consumption will provide ratings that are closer to the
Behavioral research has indicated that judgments can be recommendation than users receiving a recommendation after
constructed upon request and, consequently, are often influenced viewing.
by elements of the environment in which this construction occurs.
One such influence arises from the use of an anchoring-and- Perceived System Reliability Hypothesis: Users receiving a
adjustment heuristic [4,11]. Prior research has established recommendation from a system that is perceived as more reliable
anchoring and adjustment as leading to biases in judgment. Many will provide ratings closer to the recommendation (i.e., will be
studies primarily used almanac-type questions (the answers for more affected by the anchor) than users receiving a
which can be looked up in a reference source) delivered in an recommendation from a less reliable system.
artificial survey setting; however, the effect has proved robust 3.1 Impact of Artificial Recommendations
with respect to context, e.g., researchers have found anchoring by In Study 1 subjects received artificial anchors, i.e., system ratings
mock jurors making legal case decisions [3], by students assessing were not produced by a recommender system. All subjects were
instructors [10], and by real estate professionals making house shown the same TV show episode during the study and were
appraisals [8]. asked to provide their rating of the show after viewing. 206
Past studies have largely been performed using tasks for which a people completed the study using a web-based interface in a
verifiable outcome is being judged, leading to a bias measured behavioral lab. Following a welcome screen, subjects were
against an objective performance standard (see review by [4]). In shown a list of 105 recent TV shows and were asked to indicate if
the recommendation setting, the judgment is a subjective they had ever seen the show as well as their familiarity with the
preference and is not verifiable against an objective standard. The show. For shows the subject indicated having seen or familiar
application of previous studies to the preference context is not a with, ratings were collected in a 5-star scale (half-star ratings were
straightforward generalization. Therefore, our work is different also allowed). These ratings were not used to produce the
from other anchoring research because of its focus on the effects artificial system recommendations in Study 1; instead, they were
of anchoring in a consumer’s construction of preferences and its collected to create a database for the recommender system used in
application to the specific domain of recommender systems. The Study 2. Following this initial rating task, all subjects watched a
work that comes closest to ours is [5], which deals with a related same TV episode of a situation comedy. Participants were
but significantly different anchoring phenomenon in the context of randomly assigned to one of seven experimental groups. Those in
recommender systems. the treatment groups received an artificial system rating for the
TV show with the wording appropriate to their condition. All
Studies in [5] explored the effects of recommendations on user re-
subjects rated the episode just viewed using the same 5-star rating
ratings of movies. It is found that users showed high test-retest
scale. Finally, subjects completed a short survey that included
consistency when being asked to re-rate a movie with no
questions on demographic information and TV viewing patterns.
prediction provided. However, when users were asked to re-rate a
movie while being shown a “predicted” rating that was altered Three factors were manipulated in the rating provision. The first
upward or downward from their original rating for the movie by a factor was the direction of anchor, i.e., the system rating was set
single fixed amount (1 rating point), they tended to give higher or to have either a low (1.5, on a scale of 1 through 5) or high value
lower ratings, respectively. (4.5). The second factor was the timing of the recommendation,
i.e., the artificial system rating was given either before or after the
Although [5] involves recommender systems and preferences, our
show was watched (but always before the viewer was asked to
study differs from it in several important ways. First, we address
rate the show). Together, the first two factors form a 2 × 2
a fuller range of possible perturbations of the predicted ratings.
between-subjects design. Intersecting with this design is the use
This allows us to more fully explore the anchoring issue as to
of a third factor: the perceived reliability of the system (strong or
whether any effect is obtained in a discrete fashion or more
weak). For parsimony of design, the third factor was manipulated
continuously over the range of possible perturbations. More
only within the Before conditions. Thus, within the Before
fundamentally, the focus of [5] was on the effects of anchors on a
conditions of the Timing factor, the factors of Anchoring
recall task, i.e., users had already “consumed” the movies they
(High/Low) and Reliability of the anchor (Strong/Weak) form a
were asked to re-rate in the study, had done so prior to entering
2 × 2 between-subjects design. In addition to the six treatment
the study, and were asked to remember how well they liked these
groups, a control condition, in which no system recommendation
movies from their past experiences. Thus, anchoring effects were
was provided, was also included.
moderated by potential recall-related phenomenon, and
preferences were being remembered instead of constructed. In Table 1 shows the mean ratings for the viewed episode for the
contrast, our work focuses on anchoring effects that occur in the seven experimental groups. Our preliminary analyses included
construction of preferences at the time of actual consumption. In information collected by survey, including both demographic data
our study, no recall is involved in the task impacted by anchors, (e.g., gender, age, occupation) and questionnaire responses (e.g.,
participants consume the good for the first time in our controlled hours watching TV per week, general attitude towards
environment, and we measure the immediate effects of anchoring. recommender systems), as covariates and random factors.

376
However, none of these variables or their interaction terms turned of shows and predicted ratings for unseen shows in real time. One
out to be statistically significant in explaining the user rating of the unseen shows was selected for viewing for each subject.
variations, and hence we focus on the fixed factors. Since the subjects did not all see the same show, the preference
Table 1. Mean (SD) Ratings of the Viewed TV Show by ratings (actual ratings submitted by the user after watching the
Experimental Condition in Study 1. episode) for the viewed show were adjusted for the predicted
ratings of the system, in order to obtain a response variable on a
Group (timing-anchor-reliability) N Mean (SD)
comparable scale across subjects. Thus, the main response
Before-High-Strong 31 3.48 (1.04) variable is the rating drift, which we define as:
After-High-Strong 28 3.43 (0.81)
Control 29 3.22 (0.98) Rating Drift = Actual Rating – Predicted Rating.
Before-High-Weak 31 3.08 (1.07) Similarly to Study 1, our preliminary analyses indicated that none
Before-Low-Weak 29 2.83 (0.75) of the variables collected in the survey demonstrated significance
After-Low-Strong 29 2.88 (0.79) in explaining the response variable; hence, in the following
Before-Low-Strong 29 2.78 (0.92) analyses we focused on the main treatment conditions and turned
We applied a general linear model to analyze the direction of to the direct tests of the contrasts of interest. The mean (standard
anchor (High/Low) and its timing (Before/After viewing). deviation) values across the four conditions of the study for this
Results showed no effect of Timing (F(1,113) = 0.02, p = .89). variable are shown in Table 2.
We continued the analyses using the direct tests of the contrasts of Table 2. Mean (SD) Rating Drift of the Viewed TV Show by
interest. The difference between the High and Low conditions was Experimental Condition, Study 2.
in the expected direction, showing a substantial effect between
groups (one-tailed t(58) = 2.79, p = .004). We then analyzed the Group N Mean (SD)
direction of anchor (High/Low) and perceived system reliability High 51 0.40 (1.00)
(Strong/Weak). There is no significant difference between High Control 48 0.14 (0.94)
and Low conditions with Weak recommendations (p = .15), unlike Accurate 51 0.13 (0.96)
with Strong recommendations (p = .004). Thus, weak Low 47 -0.12 (0.94)
recommendations did not operate as a significant anchor when the Overall the three experimental groups (i.e., High, Low, and
perceived reliability of the system was lowered. Accurate) significantly differed. Providing an accurate
In summary, the artificial recommendations did impact the recommendation did not significantly affect preferences for the
viewers’ preferences. A statistically significant effect is found show, as compared to the Control condition (two-tailed t(97) =
comparing a high recommendation to a low recommendation. And 0.02, p = .98). Consistent with Study 1, the High recommendation
the difference between the High and Low conditions was in the condition led to inflated ratings compared to the Low condition
expected direction. Thus, the Anchoring Hypothesis was (one-tailed t(96) = 2.63, p = .005). Comparably, providing a
supported. When the recommender system was presented as less recommendation that was raised or lowered, compared to not
reliable, being described as in test phase and providing only providing any recommendation, led to increased preference (t(97)
tentative recommendations, the anchoring effect was reduced to a = 1.39, p = .08) and decreased preference (t(93) = 1.38, p = .09),
minimal, in support of the Perceived System Reliability respectively. In summary, the Anchoring Hypothesis is supported
Hypothesis. The Timing Hypothesis was not supported, however. in Study 2 consistently with Study 1.
The magnitude of the anchoring effect was not different whether
the system recommendation was received before or after the
3.3 Actual Recommendations with Jokes
Study 3 provides a generalization of Study 2 within a different
viewing experience. This suggests that the effect is not
content domain, applying a recommender system to joke
attributable to a priming of one’s attitude prior to viewing.
preferences. We also apply a wider variety of perturbations to the
Instead, anchoring is likely to be operating at the time the subject
actual recommendations for each subject, ranging from -1.5 to
is formulating a response.
1.5. The anchors received by subjects were based on the
3.2 Impact of Actual Recommendations recommendations of an item-based collaborative filtering
Study 2 follows up Study 1 by replacing the artificially fixed recommender system. A total of 100 jokes were used in the study.
anchors with actual personalized recommendations provided by The jokes and the rating data for training the recommendation
the well-known item-based collaborative filtering algorithm [9]. algorithm were taken from the Jester Online Joke Recommender
We evaluated several alternatives using cross-validation approach, System repository [6].
and the item-based CF algorithm gave the best performance on 61 people completed the study. Subjects first evaluated randomly
this dataset. Using the user preferences for TV shows collected in selected 50 jokes to provide a basis for computing
Study 1, a recommender system was implemented to estimate recommendations. The subjects then received 40 jokes with a
preferences of subjects in Study 2 for unrated shows. Using a predicted rating displayed next to the input for providing a
parallel design to Study 1, we examine the anchoring effects with preference response (rating). Thirty of these predicted ratings
a recommender system comparable to the ones used in practice. were perturbed, 5 each using perturbations of -1.5, -1.0, -0.5,
Three levels were used for adjusting the recommender system’s +0.5, +1.0, and +1.5. Ten predicted ratings were not perturbed,
rating provided to subjects in Study 2: Low (i.e., adjusted to be and were displayed exactly as predicted. These 40 jokes were
1.5 points below the system’s predicted rating), Accurate (the randomly intermixed. The final 10 jokes were added as a control,
system’s actual predicted rating), and High (1.5 points above the i.e., displayed with no predicted rating provided. Subjects
system’s predicted rating). In addition to the three treatment completed the study with a short survey. Study 3 allows us to
groups, a control group was included for which no system examine whether the anchoring effect is continuous over this
recommendation was provided. 197 people completed the study. range, or a discrete behavioral reaction.
During the experiment, the system took as input subject’s ratings

377
As with Study 2, the main response variable for Study 3 was As part of the future work, we will further explore the effects of
Rating Drift. As an overall illustration, Figure 2 shows the mean recommender systems on consumer preferences and behavior.
Rating Drift, aggregated across items and subjects, for each Issues of trust, decision bias, and preference realization appear to
perturbation used in the study. In the aggregate, there is a linear be intricately linked in the context of recommendations in online
relationship both for negative and positive perturbations. The marketplaces. Moreover, our future research directions also
general pattern for Study 3 parallels that for Study 2. include investigating the error compounding issue of anchoring:
How far can people be pulled in their preferences if a
recommender system keeps providing biased recommendations?
0.53 Finally, this study has brought to light a potentially significant
0.28 issue in the design and implementation of recommender systems.
Control Since recommender systems rely on preference inputs from users,
Mean Rating Drift

-0.04 0.07 bias in these inputs may have a cascading error effect on the
performance of recommender system algorithms. Further
-1.5 -1 -0.5 0.5 1 1.5
-0.20 research will examine the full impact of these biases.
-0.23
-0.41 5. ACKNOWLEDGMENTS
-0.53 I would like to thank my advisor, Prof. Gediminas Adomavicius,
Perturbation of Recommendation for offering me the opportunity to work in this interesting and
challenging area. Also, many thanks to my co-authors Prof.
Figure 2. Aggregated Mean Rating Drift as a Function of the Shawn Curley and Prof. Jesse Bockstedt for all their efforts and
Amount of Perturbation of the Recommendation and for the insightful discussions which made this work possible.
Control Condition in Study 3.
The design of study 3 also allows us to investigate behavior at an 6. REFERENCE
individual level of analysis. We tested the slopes across subjects [1] Adomavicius, G., Bockstedt, J., Curley, S., and Zhang, J.
between negative and positive perturbations and no significant 2011. "Do Recommender Systems Manipulate Consumer
difference was observed (two-tailed t(60) = 1.39, p = .17). Preferences? A Study of Anchoring Effects." Working paper:
Consequently, for the individual analyses, we combine the data to University of Minnesota.
obtain a single fit for each subject. The mean slope value across [2] Adomavicius, G., Bockstedt, J., Curley, S., and Zhang, J.
the 61 subjects is 0.35, and the value is significantly positive 2010. "Impact of Recommender Systems on Consumer
(two-tailed t(60) = 10.74, p < .0001). Thus, clearly there is a Preferences: A Study of Anchoring Effects," Winter
systematic, predominantly positive effect between the Conference on Business Intelligence. Salt Lake City, Utah.
perturbations and the rating drift. Perturbations have a continuous [3] Chapman, G., and Bornstein, B. 1996. "The More You Ask
effect upon ratings with, on average, a drift of 0.35 rating points for, the More You Get: Anchoring in Personal Injury
occurring for every rating point of perturbation (e.g., mean rating Verdicts," Applied Cognitive Psychology, 10, 519-540.
drift is 0.53 for a perturbation of +1.5). [4] Chapman, G., and Johnson, E. 2002. "Incorporating the
Irrelevant: Anchors in Judgments of Belief and Value.," in
Overall, the analyses strongly suggest that the anchoring effect is Heuristics and Biases: The Psychology of Intuitive Judgment,
not prompted only at the extreme manipulations. Instead, the T. Gilovich, D. Griffin and D. Kahneman (eds.). Cambridge:
effect of perturbations on rating drift is continuous, not discrete. Cambridge University Press, 120-138.
The magnitude of the drift in ratings is proportional to the [5] Cosley, D., Lam, S., Albert, I., Konstan, J.A., and Riedl, J.
magnitude of the perturbation of the recommendation. 2003. "Is Seeing Believing? How Recommender Interfaces
Affect Users’ Opinions," CHI 2003 Conference, Fort
4. CONCLUSIONS AND FUTURE WORK Lauderdale FL.
In this study, we conducted three laboratory experiments and
[6] Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. 2001.
systematically examined the impact of recommendations on
"Eigentaste: A Constant Time Collaborative Filtering
consumer preferences. Results of this research provide strong
Algorithm," Information Retrieval, 4, (2), 133-151.
evidence that biased output from recommender systems can
[7] Mussweiler, T., and Strack, F. 2000. "Numeric Judgments
significantly influence the preference ratings of consumers.
under Uncertainty: The Role of Knowledge in Anchoring,"
Findings in this study have several important practical Journal of Experimental Social Psychology, 36, 495-518.
implications. First, they suggest that standard performance [8] Northcraft, G., and Neale, M. 1987. "Experts, Amateurs, and
metrics for recommender systems may need to be rethought to Real Estate: An Anchoring-and-Adjustment Perspective on
account for these phenomena. If recommendations can influence Property Pricing Decisions," Organizational Behavior and
consumer-reported ratings, then how should recommender Human Decision Processes, 39, 84-97.
systems be objectively evaluated? Second, how does this [9] Sarwar, B., Karypis, G., Konstan, J.A., and Riedl, J. 2001.
influence impact the inputs to recommender systems? If two "Item-Based Collaborative Filtering Recommendation
consumers provide the same rating, but based on different initial Algorithms," the 10th International WWW Conference, Hong
recommendations, do their preferences really match in identifying Kong, 285 - 295.
future recommendations? Third, our findings bring to light the [10] Thorsteinson, T., Breier, J., Atwell, A., Hamilton, C., and
potential impact of recommender systems on strategic practices. Privette, M. 2008. "Anchoring Effects on Performance
For example, it is well-known that Netflix uses its recommender Judgments," Organizational Behavior and Human Decision
system as a means of inventory management, filtering Processes, 107, 29-40.
recommendations based on the availability of items. Taking this [11] Tversky, A., and Kahneman, D. 1974. "Judgment under
one step further, online retailers could potentially use preference Uncertainty: Heuristics and Biases," Science, 185, 1124-
bias based on recommendations to increase sales. 1131.

378

You might also like