Professional Documents
Culture Documents
2017 - JCR - MTurk Character Misrepresentation Assessment and Solutions
2017 - JCR - MTurk Character Misrepresentation Assessment and Solutions
This tutorial was invited by editors Darren Dahl, Eileen Fischer, Gita
1 While the “Prevent Ballot Box Stuffing” option was selected in this
Johar, and Vicki Morwitz.
Qualtrics study, participants can make multiple attempts at a study if
Advance Access publication April 17, 2017 they clear the cookies from their web browser or simply switch
browsers.
C The Author 2017. Published by Oxford University Press on behalf of Journal of Consumer Research, Inc.
V
All rights reserved. For permissions, please e-mail: journals.permissions@oup.com Vol. 44 2017
DOI: 10.1093/jcr/ucx053
211
212 JOURNAL OF CONSUMER RESEARCH
treatments. Seventeen percent of respondents in the cancer study had the There are a number of issues related to using MTurk re-
same Worker IDs as those in the shoulder study (Tong et al. 2012).2 spondents that are only briefly mentioned in this tutorial
• The third author asked for Turkers who had written over 10 reviews on because they are well covered elsewhere. The important
Yelp to complete a study. Almost 900 Turkers began the study and all issue of the representativeness of the Turkers community
but 33 dropped out when they were asked to provide a screenshot that to different populations has been extensively explored by
verified their qualifications. other researchers (Berinsky, Huber, and Lenz 2012;
These disturbing examples mirror similar cases reported Goodman and Paolacci 2017; Paolacci, Chandler, and
by Chandler and Paolacci (2017) demonstrating consistent Ipeirotis 2010; Ross et al. 2010). We also do not cover at-
distortions in responses when MTurk participants are able trition rates due to study manipulations that can distort re-
to retake a screener or falsify their identities in order to search conclusions, such as a writing task in one condition
complete a study. Our goal is to identify the degree of mis- but not the other (Zhou and Fishbach 2016). Finally, we do
representation in paid MTurk studies and its implications not explore the disturbing finding that people who com-
on the legitimacy of the scientific inquiry. We then propose plete many social psychology research studies become
a two-step process to achieve appropriate theory-driven non-naı̈ve, and are thus differentially affected by specific
samples. The first step assesses a respondent’s qualifica- manipulations, various forms of memory tasks, and atten-
tion in a context where the respondent has neither the mo- tion checks (Chandler, Mueller, and Paolacci 2014;
tive nor the requisite knowledge to deceive. The second Chandler et al. 2015).
step then makes the study available and viewable only to
those who have qualified in the first step. Finally, we detail TESTING CHARACTER
ways that this two-step method can be incorporated into a MISREPRESENTATION
larger panel creation and management process that enables
research with known and trusted MTurk respondents. We begin with a series of two-stage tests that assess the
Amazon Mechanical Turk is the focus in this tutorial on extent to which Turkers misrepresent themselves when
misrepresentation because Turkers provide the dominant they have a motive and opportunity to do so. In the first
source of web-based studies for those studying consumer be- stage, respondents provide their demographic characteris-
havior (Goodman and Paolacci 2017). However, similar de- tics, activities, and product ownership in a context that
ception may occur on other crowdsourcing platforms, does not offer any monetary incentive to misrepresent nor
professional marketing research panels, or in-person studies. provides any information on the desired response. In the
For example, a person interested in being a part of a focus second stage, a screener question permits respondents to
group about diaper brands that pays $150 may claim to be a alter their answers from the first-stage questions in order to
mother with young children when in fact she is not (Leitch take a new study. Comparing respondents’ answers across
2004). Thus, our recommendations are also relevant to other stages allows us to assess the degree of misrepresentation
online and offline respondent recruiting platforms. While the and the extent to which Turkers provide distorted answers
problem is not limited to online studies, it may be particularly to subsequent questions. We also compare these results to
a simple take/retake group to separate misrepresentation
severe in this context given that one can more easily misrep-
from reliability in survey response.
resent oneself in the anonymity of an online environment.
There are four key lessons from this tutorial. First, we
demonstrate that MTurk workers are willing to misrepre- Stage 1: Collecting Panel Characteristics
sent themselves to gain access to a desired study, and that To assess character misrepresentation, we first built a
those who do so generate distorted responses to other ques- panel with “true” characteristics and activities including
tions in the study. Second, we show that the level of char- product and pet ownership from 1,108 Turkers located in
acter misrepresentation is negligible when there is no the United States. These questions were spread across eight
economic motive to lie. Third, we characterize the role of different surveys that asked about (1) political and reli-
online Turker communities, demonstrating how the goals gious affiliations (MoralFoundations.org); (2) moral beliefs
of MTurk workers interact and sometimes conflict with the (MFQ: Graham et al. 2011); (3) material values (MVS:
practices and values of the consumer behavior research Richins 2004); (4) personality trait importance (GSA:
community. Finally, we evaluate various measures to pre- adapted from Barriga et al. 2001); (5) extroversion and
vent misrepresentation, arguing that traditional measures agreeableness (John and Srivastava 1999); (6) personality
of response quality are not very useful, but need to be (TIPI: Gosling, Rentfrow, and Swann 2003); (7) product
replaced by a two-step process that separates the character ownership (i.e., sports and technology), pet ownership
identification from the study itself. Details on the mech- (dog, fish, cat, etc.), food consumption (Sharpe, Staelin,
anics are provided in the web appendixes. and Huber 2008), health consciousness (Gould 1988), and
social desirability bias (Crowne and Marlow 1960); and,
2 A Worker ID is a unique identifier for each MTurk worker. (8) willingness to compromise moral beliefs (MFSS:
SHARPE WESSLING, HUBER, AND NETZER 213
Haidt, Graham, and Joseph 2009). The specific contents of could be tied to other studies. Once respondents passed the
each survey are outlined in web appendix A; however, a screener and the study questions, they entered a unique
thorough analysis of this data goes beyond the scope of completion code in order to be paid. Thus, our two-stage
this tutorial. design allows us to assess the extent of misrepresentation
All eight surveys were launched simultaneously so that when Turkers are given the opportunity to do so.
any MTurk worker could take as many surveys as desired In discussing these studies, we concentrate on the degree
within the first hour of posting. At the end of the hour, any of character misrepresentation and the distortion in re-
worker who had taken one or more of the panel surveys be- sponses to subsequent questions. We focus on responses
came a “panelist” and gained access for the next four weeks that were statistically different between those who did and
to take any of the uncompleted eight surveys.3 Only those did not misrepresent themselves. Later sections examine
identified as panelists could see or take the panel surveys the contexts in which strong misrepresentation occurs, the
after the initial one-hour cutoff. On average, our panelists role of Turker communities and norms, and possible solu-
completed 7.1 panel surveys out of the eight available. tions to character misrepresentation in MTurk studies.
Each panelist saw a consent form at the beginning of each The five studies screened respondents on (1) owning a
first- and second-stage survey. The consent form notified re- cat and/or a dog, (2) owning a kayak, (3) being over 49
spondents of the possibility that their answers from other years old, (4) being raised Catholic, and (5) being female.
studies could be linked through their unique MTurk Worker In all studies, we define impostors as those who provided
ID, and if participants did not agree to these terms, they the requested response to the screener question that dif-
could exit the study. Including this consent form has impli- fered from their response in stage 1. It is important to con-
cations, as respondents who expected to cheat may question trol for possible alternative explanations for inconsistent
whether they wanted to complete the survey or study, and responses between the two stages, such as take/retake reli-
thus, might drop out of our panel. However, we found the ability error and change in status or character between the
dropout rate to be minimal. Across eight surveys with more two surveys (e.g., someone may have purchased a kayak in
than 1,000 respondents, 96 respondents abandoned a survey, between the two phases in our sports equipment–related
with only 16 of these occurring at the consent form stage. study). We do so by including in four out of the five studies
a “control” condition in which the “screener” question was
Stage 2: Testing Misrepresentation included as part of the survey but not as a screener. The
We conducted five studies to determine the extent to proportion of inconsistent responses between stage 1 and
which participants altered earlier responses to qualify for a stage 2 in the control condition, where the focal question
study. As detailed in web appendix B, the studies differed was not a screener, provides an estimate for differences
in terms of screening requirements and the questions asked that are due to random inconsistency or change in character
in the body of the study. Only panelists were permitted to status but not due to misrepresentation.
view the MTurk HIT (i.e., Human Intelligence Tasks) de- Table 1 provides for each study the percent of the first-
scription and participate in the studies. In this second stage, stage panelists who had the qualification requirement when
the invitation described the general topic of the study (e.g., there was no incentive to lie in stage 1 (column A), and the
product-related study, health-related study, pet food sur- percent of respondents in the second stage who altered
vey) and whether it would be restricted to those with cer- their earlier response to enable them to take the study (col-
tain characteristics. We provided this detail to respondents umn B). That shows unacceptable rates of misrepresenta-
for two reasons. First, in treating potential respondents eth- tion ranging from 24% to 83%, with greater rates occurring
ically, analogous to many lab situations, we informed po- when there are relatively few Turkers who can honestly re-
tential participants of the requirements so they could freely spond to the screen (low rates in column A). Because the
choose to take the study (Gleibs 2016). Second, because proportion of possible misrepresentation is “capped” at the
Turkers often complain about “unpaid” screeners, for four proportion of respondents who are “eligible” to do so, we
out of our five studies, we informed them of the qualifica- report (column C) the proportion of impostors (column B)
tion requirements a priori so they would not waste their divided by the proportion of respondents who are “eli-
time if they did not meet the requirement. If participants gible” to do so (1 – column A). This measure gives us a
chose to accept the task, they clicked on a survey link and “standardized” degree of misrepresentation. Looking at
viewed the consent screen indicating that their responses column C, we see misrepresentation of around 80% for the
pet and kayak ownership, but around 50% for age, reli-
3 This was accomplished through using the MTurk qualification gious upbringing, and gender. This suggests that respond-
functionality. We created a qualification type called “qual” and set ents are less likely to deceive with respect to stable,
this value to 1 for every panelist (see the appendix for details). We identifiable demographic characteristics compared to prod-
also batch-notified our panelists of other surveys that they were eli-
gible to take using the R package MTurkR, which may have contrib- uct ownership, which is more difficult to disprove. We en-
uted to the high response rate (see web appendix E). courage future studies to further explore the kinds of
214 JOURNAL OF CONSUMER RESEARCH
(n ¼ 144)
(n ¼ 138)
(n ¼ 154)
(n ¼ 96)
study where there was no screen and thus no motive to im-
personate. We see a baseline inconsistency of 0–4% when
4%
0%
4%
0%
NA
NA
there is no motive to deceive. That baseline inconsistency
is important in providing the prime justification for screen-
ing in a separate survey.
We now describe each of the studies and the differences
in responses between those who did and did not misrepre-
sent themselves. Web appendix B provides the details of
responses relative to those
C: Deceivers: % of altered
88%
89%
49%
56%
49%
least one dog or cat to qualify and the second test requiring at
least one dog and one cat. Upon entering the second-stage
tests, participants were asked to complete a screening ques-
tion about pet ownership. If they reported having the required
number (independent of whether they reported the correct an-
swer in the first-stage survey), they were shown the consent
B: Screened study: % of
were told that they did not qualify and could not continue.
satisfy screen
(n ¼ 378)
(n ¼ 123)
(n ¼ 146)
(n ¼ 141)
(n ¼ 120)
(n ¼ 141)
71%
83%
43%
39%
25%
they had either a dog or a cat in the first stage. In the second
(b)
(n ¼ 1,000)
(n ¼ 1,000)
(n ¼ 1,034)
(n ¼ 1,041)
(n ¼ 999)
(n ¼ 1,000)
19%
13%
30%
49%
Must be female
.033; cat food: 94% vs. 84%; p ¼ .004). Across our studies,
we often found that impostors are significantly less likely to
choose the “none” option. One possible explanation is that
impostors want to appear knowledgeable and involved and
hence are less likely to go beyond listed brands.
These results are disappointing in demonstrating sub-
stantial levels of misrepresentation and significant differ-
Cellphone case study
Politics study
Kayak study
the degree to which inconsistencies between the two stud- survey but without any screener. 100% of the 144 respond-
ies may be attributed to such accounts, in the next studies ents in this control condition reported an age bracket that
we include a control group that received the same survey was perfectly consistent with the age reported in stage 1.
without any screeners. Responses from the control group Among other questions, participants made a choice of a
also measure the fundamental variability in the response to fiber supplement among Metamucil Tablets ($15.99), Fiber
the screening variable across stages. Well Gummies ($14.99), Benefiber Powder ($25.99), and a
“none” option. The impostors, with an average age of 33,
Kayak Ownership. We determined kayak ownership in
were significantly less likely to choose the “none” option
stage 1 by asking respondents about their sports equipment,
relative to those who legitimately passed the screener (8% vs.
ownership. In doing so, 7% of our panelists checked a box
25%; z ¼ –2.567, p ¼ .010). Impostors also overstated their
indicating that they currently owned at least one kayak.
average vitamin intake frequency (ranging from never ¼ 0 to
Thus, in this study, due to the relatively low ownership of
daily ¼ 3) compared to those legitimately over 49 years old
kayaks reported in the first stage, 93% of the respondents (Mimpostors ¼ 2.36; M>49 ¼ 1.96; F(1,140), p ¼ .036). Thus,
to the first study had an opportunity to deceive. Two we find that not only do respondents misrepresent their age,
months later, a second-stage study was posted, stating it but more importantly, impostors exhibited different responses
was just for kayak owners. Once past the consent screen, to other questions, leading to biased survey results.
panel members chose again among the same sports equip-
ment options as in stage 1 and were permitted into the paid Catholic Upbringing. In the first stage, 30% of panel
study if they checked the box indicating that they owned a members indicated that they had a Catholic upbringing. The
kayak. Of the 146 respondents in stage 2 who indicated second-stage recruiting statement specified that only those
that they currently owned a kayak, 132 (88%) had indi- raised Catholic could take the study. Once in the survey, if re-
cated earlier that they did not. However, seven participants spondents indicated in the screener question that they were
also indicated that they had recently purchased a kayak, not raised Catholic, the study ended and they were not com-
which leads us to conclude that at least 83% of stage 2 par- pensated. However, if they claimed that they were raised
ticipants were clear kayak owner impostors.4 Catholic, they completed the study and were paid regardless
Because only 18 respondents reported both in the first of whether their claim matched their first-stage response.
and second stages that they owned a kayak, this study did Then participants were shown an excerpt from a CNN article
not provide a sufficient sample size to compare the re- (Burke 2016) reporting a controversy between Pope Francis
sponse of impostors and consistent respondents to other and Donald Trump and asked if they agreed with the Pope’s
questions. In this study we asked a separate group of re- statement that “A person who thinks only about building
spondents to report their kayak ownership with no incen- walls, wherever they may be, and not building bridges, is not
tive to imposter (take/retake) and found that only 4% of Christian” (Strongly disagree ¼ 1, Strongly agree ¼ 5).
those who reported to have a kayak in stage 2 did not re- Of the stage 2 respondents, 61% of the 120 participants
port the same in stage 1. This may be due to the purchase consistently matched their earlier statement that they had
of a kayak between the two studies (although no one indi- been raised Catholic, while the other 39% contradicted
cated a recently purchased kayak) or due to response in- their earlier response about their religious upbringing. For
consistency. Thus, we can conclude that the vast majority comparison purposes, we relaunched the study with no
of the change in response to the kayak ownership question screener, and only 4% of 138 respondents changed their re-
between the two surveys is due to intentional misrepresen- ported religious upbringing in a take/retake study when
tation and not merely inconsistency in response. there was no monetary incentive to misrepresent.
Furthermore, we found that those raised Catholic were statis-
Dietary Fiber for Those over 50. In the first stage, 13% tically more likely to agree with the Pope’s statement than
of panel respondents indicated that they were 50 years old the impostors (MCatholic ¼ 3.93; Mimpostors ¼ 3.38; p ¼ .028).
or older. In the second stage, the recruiting statement expli-
citly stated that only those 50 and over would qualify. Woman’s Cell Phone Case Conjoint. The final experi-
Upon entrance to the survey, participants viewed the con- ment tested gender misrepresentation and included a stand-
ard choice-based conjoint task. In the first four studies, the
sent screen and reported their age. Those who said they
unscreened “control” condition was launched after the
were 50 or above were permitted to take the study. There
screening condition; thus, differences between control and
was substantial age misrepresentation, with 43% of the 141
screen may have been due to selection effects given that
stage 2 respondents being revealed as impostors. To make
those who had previously taken the screener version of the
sure that the stage 2 age screen was not due to take/retake
study were excluded from taking the control relaunch. To
error, a separate group of panelists responded to a similar
mitigate such possible selection effects, we randomly as-
signed panel members either to a screen or no-screen con-
4 Note that respondents also had an incentive to lie about acquiring a
kayak in between the studies to justify their inconsistency between the dition, both of which were ran simultaneously. As shown
two studies. in table 1, 25% of the 141 respondents in the screener
216 JOURNAL OF CONSUMER RESEARCH
FIGURE 1
condition changed their reported gender to gain entrance to p < .0001) compared to actual females surveyed. Those mis-
the study. By contrast, none of the 154 respondents in the representing their gender also had a higher utility value for
unscreened condition changed their gender identities. the “none” option (Mfemales ¼ –3.43; Mimpostors ¼ –1.70; p
All respondents completed 12 choice-based conjoint ¼.043) and chose the “none” option more often than females
tasks selecting among cell phone case designs. As shown (Mfemales ¼ 7%; Mimpostors ¼ 13%; p ¼ .013). This result may
in an example task in figure 1, the attributes and levels for seem to contradict the earlier finding that impostors are less
the alternatives included color (pink, black, or navy), style likely to choose the “none” option. However, when we exam-
(slim design, ultra-slim profile, or easy on/off), drop pro- ine the control condition, we can see that males posing as fe-
tection (included or limited), radiation protection (included males had marginally lower utility values for the “none”
or limited), and price (ranging from $29.99 to $59.99). option compared to males in the control condition. That result
Table 2 summarizes the conjoint estimates. We found that is consistent with our previous findings that those who imper-
males posing as females statistically differed from true fe- sonate tend to be more averse to choosing the “none” option
males on the stereotypically female attributes of color and de- compared to those who are being honest (Mmales ¼ –0.06;
sign. Specifically, males impersonating as females had Mimpostors ¼ –1.70; p ¼ .088). There was no reliable differ-
higher estimated utility (part-worth) for a pink cell phone ence in utilities on the less stereotypically female attributes
case (Mfemales ¼ –0.53; Mimpostors ¼ 1.85; p ¼ .013) and an (i.e., drop and radiation protection) between males in the con-
ultra-slim case profile (Mfemales ¼ 0.40; Mimpostors ¼ 1.09; trol condition and males posing as females.
SHARPE WESSLING, HUBER, AND NETZER 217
TABLE 2
PARTWORTH UTILITIES FOR CELL PHONE CASE STUDY FOR IMPOSTOR MALES, FEMALES, AND NONIMPOSTOR/CONTROL
MALES
CONCLUSIONS FROM THE FIVE simply because telling the truth is easy, while deceit takes
STUDIES effort. It also speaks to the fairly high internal validity of
MTurk responses. Before we examine how one mitigates
The five tests demonstrate that studies using screeners this threat to the validity of studies, it is important to under-
that rely on respondents’ self-reports are susceptible to an stand the roles that web forums have on Turkers’ behavior
unacceptably large proportion of impostors. In particular, and particularly on the likelihood of addressing deception.
we find that from 24% to 83% of those passing the screener
questions are impostors, and that deceit occurs in 49–89% ONLINE TURKER FORUMS AND
of those who are “eligible” to misrepresent. The risk of
misrepresentation is greater for narrow or rare screening
DECEPTION
categories and when the characteristic misrepresented is Given the substantial number of impostors in our test
flexible, like ownership, rather than inflexible, like demo- studies, we were interested in the potential role that online
graphics. Thus, we can conclude that without safeguards, Turker communities have in either encouraging or discour-
misrepresentation can be destructively common. aging deception. The following table provides a list of the
Further, those who pretend to be someone else may use major Turker forums.
one of three different strategies in answering questions.
A number of researchers have documented the frustra-
First, impersonators may be reluctant to admit their lack of
tion and difficulty associated with being a Turker
knowledge and thus may be less likely to choose the “none”
response. Second, impostors may attempt to project what (Dholakia 2015; Martin et al. 2014). MTurk online forums
they expect the mimicked persona would think, and in doing have been created by Turkers and serve four primary func-
so overemphasize stereotypes. That appears to happen with tions to limit that frustration. First, the websites help
male impostors improperly projecting that women prefer Turkers select desirable HITs by including estimates of ac-
pink cell phone cases. Finally, where projection to a differ- tual pay per minute (which can differ from the estimated
ent person is difficult, deceivers may simply default to their pay rate) and any warnings about difficult, boring (e.g.,
own personal views or preferences. That may have happened “bubble hell”), or “tricky” tasks (e.g., attention checks,
when those misrepresenting their Catholic upbringing were memory checks). Second, and most relevant to the current
more likely to disagree with the Pope than actual Catholics. discussion, some threads make suggestions on how to pass
The important point here is that there are various ways a de- qualification screens. Using self-reported data, Chandler,
ceiver may continue to deceive, and it is very difficult to pre- Mueller, and Paolacci (2014) suggest that this behavior
dict the direction or magnitude of the bias. does occur but the extent of this distortion is unknown.
The good news from our tests is the strong evidence of Third, these forums provide a place for venting anger or
minimal distortion when there is no economic motive to do frustration with requesters or other Turkers. Fourth, the
so. That occurred in the control studies having less than forums encourage coworker friendship, which includes dis-
5% inconsistency between the stages when there was no cussions of personal challenges that may or may not be
screener needed to gain entry into the study. This high de- related to completing MTurk tasks (Brawley and Pury
gree of take/retake reliability among Turkers is reasonable, 2016). The following table provides example quotes (some
218 JOURNAL OF CONSUMER RESEARCH
Name (website) Registered users5 Open to the public? (Need for registration)
MTurk Forum (MTF) 54,831 Yes.
(http://www.mturkforum.com) (No registration to view)
Hits Worth Turking For (HWTF) 35,626 Yes.
(https://www.reddit.com/r/HITsWorthTurkingFor) (No registration to view)
MTurk Reddit (MTR) 20,146 Yes.
(https://www.reddit.com/r/mturk) (No registration to view)
Turker Nation (TN) 17,891 No, this is a private site.
(http://www.turkernation.com) (Requesters may sign up and receive limited access)
TurkerHub.com (TH) 12,4086 Yes.
(https://turkerhub.com) (No registration to view)
Turk Opticon (TO) No user information Yes.
(https://turkopticon.ucsd.edu/) published (Need to register)
MTurk Crowd (MTC) 2,740 Yes.
(http://www.mturkcrowd.com/) (No registration to view)
edited for clarity) that give a sense of how such MTurk requesters and comment on the HITs that requesters post
communities operate. based on four dimensions that workers care about:
MTurk community websites can thus generate problems communicativity, generosity, fairness, and promptness. While
for researchers by revealing experimental conditions, by separate from the MTurk platform, anyone may review the in-
undermining tests of respondent abilities or knowledge, or dividual ratings from the TO site. Those with a Turker account
by enabling character misrepresentation that permits a per- may also load a browser script from Opticon that automatic-
son to enter a study under false pretense. It is important to ally generates and displays the requester’s aggregated Opticon
note that such forums not only increase the risk of decep- scores while they browse for HITs on MTurk.
tion in studies but may also serve as a safeguard against This drive for greater Turker control arose in part out of
such deception. For example, we conducted a 12-cent their perception that requesters are unfair because they have
study with 736 Turkers who were asked to guess the num- the ability to unreasonably reject or block Turkers. Through
ber of gumballs in a jar with the ability to “win” a $1 bonus Amazon’s accept/reject functionality, requesters can reject
if they guessed correctly. After each respondent made a a submission, and then not pay if a worker makes multiple
guess, we revealed to the respondent the correct number of attempts at a study, fails an attention check, does not submit
gumballs. We monitored whether the proportion of Turkers the correct end-of-survey code, answers the survey too fast,
guessing the number correctly increased over time as well makes a submission but never completes the study, or for
as the activity on MTurk forums to see if the correct any other reason. This rejection leads to immediate loss in
answer was posted online. Indeed, shortly after we posted income and negatively impacts the worker’s approval rat-
the study, a correct answer appeared briefly on ing. Because requesters often set the requirement that
HITsWorthTurkingFor (HWTF), notifying fellow Turkers Turkers have a particular approval rating (e.g., typically
of the response that would lead to the $1 bonus. However, 95% or above), Turkers try to avoid anything that could hurt
the post was criticized and taken down by the forum mod- their rating. Further, a repeat offender may be blocked from
erator within minutes (see the screenshot in web appendix all subsequent studies by that requester. Being blocked by
C). As a result, relatively few people (3.8% of respondents) several requesters can lead to the worker’s account being
“guessed” the correct answer. Thus, while a small level of suspended and the worker being barred from completing
deception occurred, the moderator served to limit its impact any MTurk tasks. As a result, workers are highly sensitive to
by reinforcing norms of Turkers being reliable respondents. those actions that threaten their ability to work. The forums
A major function of the forum websites is to provide greater allow Turkers to quickly identify and disseminate requesters
worker power. In particular, Turk Opticon (TO) was created who commonly reject Turkers. While the forums restore
to try to restore some balance of power between the workers some of the balance of power between requesters and
and requesters. The TO platform allows Turkers to rate Turkers, they may also discourage requesters from appropri-
ately rejecting or blocking truly offending workers from
5 As of December 20, 2016. their studies. Additionally, researchers sometimes do not re-
6 TurkerHub.com was previously MTurk Grind (MTG; http:// ject or block offending Turkers because such processes re-
www.mturkgrind.com/), which had 12,408 registered users. User quire additional effort after the data collection has been
information for the newly created TurkerHub.com has not been pub- completed. Instead, researchers are often motivated to qui-
lished. However, daily views (by registered and nonregistered users)
range from 8,984 to 46,213 (mean 18,855) during the second month of etly remove poor responses from their data. However, re-
this forum’s inception. questers who abstain from taking actions against deceptive
SHARPE WESSLING, HUBER, AND NETZER 219
A HWTF (“HITs Worth Turking For”) is any task that pays 10 cents or more per minute to complete. It is based on the actual time that a Turker took to complete
the task and not the posted time by the researcher.
Turkers may be hurting the research community by not pun- Turkers are able to lessen their efforts and improve their
ishing these offenders. earnings through a collective system of notifying and
Overall, the MTurk online forums help workers trans- warning fellow workers. Therefore, and as recently recom-
form a difficult job of responding to studies into one that is mended by others (Cheung et al. 2016; Farrell, Grenier,
more predictable, pleasant, and economically justifiable. In and Leiby 2017), it is important for researchers to become
that way, forums benefit requesters by increasing the will- familiar with these Turker communities and follow the chat-
ingness of people to participate in research studies. Forums room discussions when a study is live. Doing so can help re-
also encourage requesters to act in ways that support the searchers evaluate how Turkers perceive the study, and
joint system. In particular, the forums penalize requesters whether their payment level is sufficient for the effort put
who pay a low hourly wage (Gleibs 2016), those who into the study. It will also help researchers determine the ex-
underreport the expected length of the study, those who tent to which screeners, attention checks, manipulations, or
annoy workers with unexpected or boring tasks, and those desired responses have been revealed to other Turkers.
who block workers unjustifiably (Brawley and Pury 2016).
In effect, online MTurk communities serve as an infor-
mal labor union (Bederson and Quinn 2011), whereby
POSSIBLE WAYS TO MINIMIZE
CHARACTER MISREPRESENTATION
7 A HWTF (“HITs Worth Turking For”) is any task that pays 10 cents or
more per minute to complete. It is based on the actual time that a Turker There are a number of ways to limit distortion from re-
took to complete the task and not the posted time by the researcher. spondents who falsify their identities. We begin with a
220 JOURNAL OF CONSUMER RESEARCH
number of solutions that are either infeasible or impractical, those who answered honestly (Brawley and Pury 2016).
and then move to describe a version of a two-step process Table 3 shows in the cell phone conjoint study that the
that can reduce, if not eliminate, the opportunity for average approval rating for impostors was 99.2% com-
deception. pared to a 99.1% approval rating for those who legitim-
ately passed the screen. Indeed, across our five studies the
Disguise Desired Screener Answers average approval rating of impostors was not statistically
different from that of honest respondents.
Chandler and Paolacci (2017) have demonstrated that dis- Table 3 also gives the results for traditional quality metrics.
guising a screener requirement reduces the amount of decep- It shows that there is no statistically significant difference for
tion in MTurk studies. To make it more difficult for deception failed attention and memory checks between those who
to occur, the screening questions should contain a number of deceived and those who honestly qualified in our cell phone
items where it is hard to determine which responses will grant study. Thus, including these in one’s studies and either con-
access to the study. However, it is often challenging to dis- trolling for or eliminating those who fail these checks does not
guise a screener even if the researcher adds a list of possible weed out impostors. Turkers, in general, are very good at de-
options, because the respondent may still answer the questions tecting traditional attention checks (Farrell et al. 2017; Hauser
in a way that maximizes her likelihood of qualifying for a and Schwarz 2015). There was also no difference in how
study. For example, a respondent may claim product owner- much time one spent on the study between impostors and
ship for all (or of a larger number of) products to maximize those who legitimately qualified. Finally, impostors and legit-
the likelihood of passing the screen. Furthermore, Turkers imately qualified respondents did not differ in regards to the
often complain about being screened out of a study without conjoint fit statistic, RLH (Sawtooth Software 2013, 22). It ap-
being paid, without prior warning. Studies with disguised pears that impostors are just as practiced and vigilant as honest
screeners are also susceptible to Turkers repeatedly taking the Turkers.
study (by clearing the cookies from their browsers) or to the We do find some demographic and psychographic dif-
leakage of screener criteria through the Turker communities. ferences between those who impersonate and those who
are honest. There is preliminary evidence that extroverts (p
Identify False Qualifiers after the Fact < .001) and males (p < .001) on MTurk have a higher pro-
Researchers commonly use attention checks or response pensity to impersonate, but it would certainly not be desirable
time to screen respondents who are not sufficiently diligent to remove everyone who fits these characteristics from a re-
(Peer, Vosgerau, and Acquisti 2014). Can similar search study.
approaches be effective for screening impostors ex post?
Suppose one suspects that respondents have misrepresented Pay All Respondents without Screening
their identity. Is there a way to adjust for it after the fact? We demonstrate that misrepresentation occurs rarely if
Can one infer from responses to other questions or response there is no benefit from doing so. Therefore, if one is inter-
style which respondents lied to get into a study compared to ested in a select group for pragmatic or theoretical reasons,
those who didn’t? Unfortunately, the simple answer is no. a feasible solution is to simply collect information from
First, consider approval ratings. In our studies, we delib- everyone and statistically control for, or remove, undesired
erately chose not to set an approval rating threshold so that respondents from subsequent analyses. That strategy re-
we could assess the common requirement by researchers quires payment to unneeded respondents but has the advan-
that Turkers should have a 95% approval rating to take tage of providing information about the effect of individual
their studies. The self-reported approval ratings gathered in differences. This approach is particularly attractive if the
our panel surveys had a mean approval rating of 99.1% base rate of the screened population is relatively high.
with only 1% of our panelists under the 95% threshold, However, if the base-rate proportion of the screened population
making it a difficult criterion to separate impostors from is low (e.g., people suffering from a particular disease), this
TABLE 3
approach can be prohibitively expensive. Still, one can limit one that we used to test the extent of misrepresentation.
wasted participants by moving respondents with undesired Figure 2 provides a flowchart for creating and managing
characteristics into other studies where those characteristics are such a panel. The panel could begin, as in our studies, with
desired. In a medical study, for example, those respondents 40 general questions to define a number of critical screening
and over could take the lung cancer study, while those under 40 variables. Because any panel will gradually lose members
could take the shoulder dislocation study. over time, it is useful to include categorization questions in
all studies that build information for future studies and test
Use a Commercial Panel to Deliver Prescreened respondent consistency with earlier ones. With such a
Respondents panel, studies that need a targeted population would be
made available only to prescreened panel members. Even
Companies like Qualtrics and SSI provide access to pre- so, we recommend that a consistency check in the focal re-
screened panelists. However, these vendors tend to cost search study be included. For example, in a study where
orders of magnitude more than managing the process oneself. only females are permitted, we recommend including a
Typical fees in 2016 are $20 per completed 15-minute study gender question in the demographic section as a way to
compared with $2 on MTurk. The price charged is generally check for consistency with the initial panel response.
much higher for rare populations. There are emerging enter- However, it would also be useful to allow a relatively
prises, such as TurkPrime (Litman, Robinson, and Abberbock small number of nonpanel members to take open studies to
2016) and Prolific Academic (ProA), that allow screening for gradually develop and replenish the panel with new partici-
a lower fee. Thus, we can expect the cost per respondent to pants. It is also helpful to test panel members in various
decrease. However, while these commercial companies claim ways. For example, Chandler and Paolacci (2017) asked
confidence in their prescreening, they offer little external whether respondents own a brand that does not exist, or if
verification. We encourage researchers who use such services they have rare diseases or do unlikely activities. Asking ques-
to monitor and validate the quality of the screening. It is im- tions about impossible activities or fictitious events can help
portant for these organizations to test their panels just as our identify opportunistic, long-term, consistent deceivers. Note,
two-stage process tested the MTurk workers. however, that such questions should be used with caution, as
Turkers are likely to catch on, especially if the question can
RECOMMENDED TWO-STEP APPROACH be factually verified (Goodman, Cryder, and Cheema 2013).
It is useful in setting up a panel to build a centralized re-
We believe that prescreening participants before the focal pository for study responses. While a single researcher
study is the best way to reduce the expense of a study and could easily manage such a data set in Excel, a robust sys-
limit the number of impostors. We first explain a one-off ap- tem with more complex database management could
proach within MTurk and then describe a way to create and emerge as part of a behavioral lab. In the ideal case, all
manage a panel of qualified respondents across multiple MTurk studies would be managed through a central MTurk
studies or researchers administered by a behavioral lab. account that uses “qualification” codes to designate which
Turkers would qualify based on prior responses. Web
Run a Short Paid Prescreen appendixes D and E explain the mechanics of using qualifi-
Researchers can run a prescreen questionnaire to establish cation codes for creating and managing a panel. The R pack-
who will be appropriate for a subsequent test, perhaps involv- age MTurkR is useful in creating and updating qualification
ing a simple $.10 survey with a few quick questions. As men- codes once the panel size becomes sufficiently large (Leeper
tioned above, it is important that the prescreen not be part of 2017). This package is also helpful for sending batch emails
the actual study. If the actual study is desirable because it is to notify prequalified respondents that they are eligible for
highly paid or interesting, it is likely that the desired qualifica- new studies. In this way, a researcher or lab coordinator can
tion conditions will be posted on an MTurk forum or that manage an MTurk pool, similar in nature to a professional
Turkers will attempt to retake the study. Additionally, it is im- panel company or student participant pool, while benefiting
portant that the screening question be masked by other ques- from the relatively low cost of using MTurk.
tions. For example, if one looks for respondents above a
particular age or that own a particular product, the researcher
should ask a few demographic and multiple product owner- DISCUSSION
ship questions in the paid prescreening questionnaire.
There are four goals to this tutorial. First, we demon-
strate the extent to which character misrepresentation
Develop an Ongoing Panel occurs when Turkers are given the opportunity to do so.
Researchers who conduct multiple studies or coordi- Deceivers, having gained access to a desired study, distort
nated studies within a behavioral lab setting could gain their identities and can generate unstable responses to later
substantially by building an ongoing panel similar to the questions. Second, we provide evidence that MTurk workers
222 JOURNAL OF CONSUMER RESEARCH
FIGURE 2
Post dedicated surveys (as we Within MTurk, designate if Using the qualification codes Incorporate study responses into
have done) or use existing a person is included in the that have been set up in MTurk, the panel database. Check for
studies (without screening panel. See appendix for launch focal research study response consistency within
constraints) which include how to accomplish this available to prescreened panel database and remove
questions with key through the use of respondents. Higher response repeatedly inconsistent Turkers
demographics needed for a qualification codes. rates can be obtained by from the panel.
targeted sample. Store emailing prescreened
Example: Remove from the
responses in a panel database Example: Create a respondents either manually
panel database any respondent
(e.g., using Excel or Access) “gender” qualification code (through MTurk) or using a
who claims to be male in the
which can later be referenced in MTurk assigning a 1 to batch email protocol (through
focal research study.
for comparing consistency every female and 0 to every the ‘R’ package ‘MTurkR’).
between studies or when male (based on the Panel Expand the number of panelists by
creating a targeted sample. Creation step). Example: For a study targeting incorporating respondents from
Make sure that each response females, launch an MTurk HIT your general population studies
is tied to a Turker’s with gender = 1 in the into your panelist database.
WorkerID. qualification step (see step 4 in Towards the end of the study (so as
Example: Create a Qualtrics the appendix for how to do to not impact the research stimuli),
survey which includes basic this). As a consistency check, include questions that define
demographics, personality include a gender question at the characteristics that will be useful
measures (e.g., TIPI), and end of the study. for future studies.
questions that may be used as Example: In preparation for a later,
screening requirements for medical insurance study, ask about
future use (e.g., smoker). current medical coverage.
are very consistent when there is no motive to lie. Third, we checking for consistency. Unlike categorical and substan-
explore the motivations and activities of Turkers as revealed tial lies, such softer inconsistency only suggests a height-
by their comments on MTurk forums. We advocate and detail ened probability of deceit or undesired sloppiness. The
a two-step process where the first step is to identify appropri- question then arises of the appropriate reaction on the part
ate respondents and the second is to target directly those who of a researcher who suspects that a Turker is behaving irre-
qualify. Finally, we recommend that this two-step process be sponsibly. One response is to reject the Turker’s submis-
incorporated within a larger panel management system. sion, an action that will reduce the Turker’s approval
The fact that the results of MTurk studies depend on how rating. Requesters may also block the Turker from taking
each study is introduced and managed within the system future studies. Both solutions are quite effective in penaliz-
implies that more effort is needed to document how a study is ing the individual Turker but can result in an unfair penalty
implemented and how respondents are recruited. Scientific for an honest mistake or inconsistency, as well as negative
progress requires others to be able to replicate a study, and as reactions against the researcher if the incident is dissemi-
a field, we need to move toward including the kinds of detail nated within the Turker communities. An alternative re-
shown in the following table as part of the study reporting. Of sponse is to remove the respondent from the panel, which
course, not all of this information is needed for every study, eliminates the possibility that the respondent will contamin-
but such detail is appropriate in a web appendix to help read- ate future studies. Such actions are better for both the indi-
ers better understand and be able to replicate the work. vidual Turker and researcher in the short term. However, the
Perhaps the greatest lesson from recent work demon- formal action of rejecting the submission or blocking the re-
strating the likelihood of deceit from Turkers is the need spondent from taking future studies provides a greater bene-
for constant vigilance on the part of researchers. Such vigi- fit to the entire research community, which gains from
lance requires a number of efforts, such as including valid- holding our participants accountable for honest and dishonest
ation tests that ask the same question in different ways and responses. We encourage researchers to contribute to the
SHARPE WESSLING, HUBER, AND NETZER 223
community by flagging poor-quality Turkers, but because data were provided from the institutional research budgets
such actions will have a direct effect on a Turker’s source of of all three authors. Analysis of the data was completed by
income, we recommend doing so only when the dishonesty is the first author with oversight from the second and third
clear and disruptive to scientific progress. authors from February 2016 to February 2017.
Finally, we build on Goodman and Paolacci (2017)’s tu-
torial in urging consumer behavior researchers who use
MTurk workers for their studies to better understand these APPENDIX
participants and treat them as important contributors to their USING QUALIFICATION CODES TO
research (Gleibs 2016). Thus, it is important that HIT descrip- CREATE AN MTURK PANEL
tions help respondents find topics that they can manage well
and even enjoy (Brawley and Pury 2016). Researchers also This appendix is primarily focused on creating and using
need to avoid the negative surprises from hidden tests that lead Qualification codes within MTurk for the purposes of man-
to frustration or anger. Ironically, strong positive surprises can aging a participant pool on MTurk. Qualifications are par-
ticularly useful in accomplishing the following:
also be distorting if they encourage respondents to misrepre-
sent themselves to gain access. As a long-run proposition, we • Designating your panelist: indicating which workers (“Turkers”) are to
find that building a stable but continuously refined MTurk be included in your panel (procedure described here).
panel improves both parties. The MTurk workers gain from • Prequalifying participants for a study: indicating if a participant (after tak-
steadier and more predictable work from a regular source, ing a prequalifying survey) meets certain requirements (e.g., respondent
while the researchers gain from a loyal, dependable panel is female) for taking a future study (see web appendix D for procedure).
• Removing participants from your panel: this is a way to “soft block” par-
about which much is known before the study begins.
ticipants from taking future studies (see web appendix D for procedure).
DATA COLLECTION INFORMATION Creating a panel using qualification codes within MTurk
involves the following four steps:
The first author collected the data for the eight panel sur-
veys (leading to the panel creation), the five deception 1. Create a new qualification type (to be used to des-
tests, and the gumball study on Amazon Mechanical Turk ignate whether or not someone is in your panel).
2. Download the Worker file and assign Turkers to
from June 2015 to February 2016. Funds to collect this
your panel.
3. Upload the updated Worker file (which include
8 Micro-batches are when a researcher launches the same study mul-
tiple times in order to achieve the desired sample size. Each time the
your panelist designations).
study is launched, the MTurk platform places it at the top of the queue 4. Include your new panel designations as a criterion
of HITs, which may result in faster completion times. when launching a new MTurk study.
224 JOURNAL OF CONSUMER RESEARCH
For your qualification type, name your panel by entering a label under the Friendly Name field. As it is required by MTurk,
provide a description. Note: Turkers will be able to view your name and description (which is required) so it is advised that
you keep your qualification names and descriptions general, but specific enough for you to remember why you are using
these. We labeled our qualification name “qual,” which is short and generic.
When the new qualification type has been created, you should be able to view it in the Manage Qualification Types table
within the MTurk interface. It may take a few minutes for the system to update, and you will need to refresh the page to
view. When your new qualification type has been created, there will be a 0 in the “Workers who have this Qualification” col-
umn, as workers have not yet been added to your panel.
SHARPE WESSLING, HUBER, AND NETZER 225
Step 2: Download the Workers File and Tag Each Participant (by Worker ID)
To add participants to your panel, download your global MTurk Workers file. To do so, click on Workers under the
Manage tab.
Here you will find a list of all of the Turkers who have ever completed a HIT for you. For each Turker, in the first column is
the Worker ID, and in the second column are the number of HITs that they have completed and the number that you have
approved. For example, the fourth Turker on the list below has completed eight of our studies, and we have approved all
eight of his or her submissions (as reflected in the lifetime approval rate). This 100% approval rating is just for our studies
and does not incorporate the approval ratings from other researchers (i.e., Requesters).
Next, click on the Download CSV button to export this table.
This .csv file includes a list of every worker who has ever completed a study for you. In addition to the lifetime stats (pertain-
ing to your studies) for each individual, you will find two columns for each qualification type that you have created. The col-
umns are automatically named with the following convention: CURRENT-Friendly Name and UPDATE-Friendly Name,
where Friendly Name, refers to the name that you chose to call your panel. In our example, our Friendly Name is “qual” so
the two columns associated with our panel are CURRENT-qual and UPDATE-qual.
226 JOURNAL OF CONSUMER RESEARCH
To add a worker to your panel, assign a numerical code (anywhere from 0 to 100) in the UPDATE column. We use the fol-
lowing convention when creating a panel: 1 to anyone in our panel and blank for everyone else.9 In our example .csv file, we
have entered a 1 in the UPDATE-qual column for the following Workers IDs: A1RJ2LOEXAMPLE, A8DRC9EXAMPLE,
and 8UHC9EXAMPLE2. Thus, when this procedure is complete, these three workers will be included in our newly created
panel.
When you are finished with revising this Worker file, save it as a .csv file to be used in the next step.
9 We leave the space blank for Turkers that we do not have enough information about to discern if they should be in our panel. If we
know at this point that someone should not be in our panel (e.g., Turkers that have demonstrated inconsistency or deception in the
past), we would assign a 0 to the qual code of these individuals.
SHARPE WESSLING, HUBER, AND NETZER 227
Next, select your .csv file (click Browse) and click Upload CSV. Note: Excel files do not work within the MTurk environ-
ment. If you have your updates saved in an Excel file, convert it to a .csv file before uploading.
Throughout this process, you may have noticed that you have an option to block specific Turkers from ever taking future
studies (in the Block Status column). We recommend against using this feature, as in our experience it leads to emails from
Turkers concerned about their MTurk accounts being revoked. Qualification codes are a far more effective way to limit who
is allowed to take part in your studies.
Once you have uploaded your revised Worker file (.csv), you have created your panel. You will see on the screen which
workers are included and which ones are not. In our example, there is a qualification named “qual,” and some Turkers (each
having a unique Worker ID) have been assigned the value of 1.
228 JOURNAL OF CONSUMER RESEARCH
Scroll down to the “Worker requirements” section and click “(þ) Add another criterion” button.
Scroll down to the “Qualification Types you have created” section within the drop-down menu and select your panel name
(this is the Friendly Name from earlier). In our example, “qual” is selected and set “equal to” the value of 1, indicating that
only panelists are eligible to take part in our studies.
SHARPE WESSLING, HUBER, AND NETZER 229
In the HIT Visibility section, make sure that Hidden has been checked, indicating that only your panelists can view and take
part in your study. Otherwise, you may receive email requests from nonpanelists requesting that they be added to your panel.
If this is undesirable, make sure that Hidden is checked.
Then continue to post your new HIT as usual. Note, to improve the response rate, you may want to notify Turkers of the new
study that you have posted. Unfortunately, there is no easy way to do this within the MTurk platform. You would need to
click on each Worker ID and manually send a personal email to each Turker who qualifies. The R package MTurkR does
allow for batch notifications. See web appendix E for example code for sending out batch notifications.
Leeper, Thomas J. (2017), “MTurkR: R Client for the MTurk Ross, Joel, Lilly Irani, M. Silberman, Andrew Zaldivar, and Bill
Requester API, 2017,” R Client for MTurk Requester API, Tomlinson (2010), “Who Are the Crowdworkers?: Shifting
https://cran.r-project.org/web/packages/MTurkR/ Demographics in Mechanical Turk,” in CHI’10 Extended
MTurkR.pdf. R package version 0.8.0. Abstracts on Human Factors in Computing Systems, New
Leitch, Will (2004), “Group Thinker,” New York Magazine, June York: ACM, 2863–72.
21, http://nymag.com/nymetro/shopping/features/9299/ Sawtooth Software (2013), “The CBC System for Choice-Based
#comments. Conjoint Analysis,” Technical Paper Series, https://www.
Litman, Leib, Johnathan Robinson, and Tzvi Abberbock (2016), sawtoothsoftware.com/download/techpap/cbctech.pdf.
“TurkPrime.com: A Versatile Crowdsourcing Data Sharpe, Kathryn, Richard Staelin, and Joel Huber (2008), “Using
Acquisition Platform for the Behavioral Sciences,” Behavior Extremeness Aversion to Fight Obesity: Policy Implications
Research Methods, forthcoming. of Context Dependent Demand,” Journal of Consumer
Martin, David, Benjamin V. Hanrahan, Jacki O’Neill, and Neha Research, 35 (3), 406–22.
Gupta (2014), “Being a Turker,” in Proceedings of the 17th Tong, Betty C., Joel Huber, Deborah D. Ascheim, John Puskas, T.
ACM Conference on Computer Supported Cooperative Work Bruce Ferguson Jr., Eugene Blackstone, and Peter K. Smith
and Social Computing, New York: ACM, 224–35. (2012) “Weighting Composite Endpoints in Clinical Trials:
Paolacci, Gabriele, Jesse Chandler, and Panagiotis G. Ipeirotis Essential Evidence from the Heart Team,” Annals of
(2010), “Running Experiments on Amazon Mechanical Thoracic Surgery, 94 (6), 1908–13.
Turk,” Judgment and Decision Making, 5 (5), 411–9. Wessling, Kathryn Sharpe, Oded Netzer, and Joel Huber (2016),
Peer, Eyal, Joachim Vosgerau, and Alessandro Acquisti (2014), “Customer Response to Within-Chain Price Hikes,” working
“Reputation as a Sufficient Condition for Data Quality on paper.
Amazon Mechanical Turk,” Behavior Research Methods, 46, Zhou, Haotian and Ayelet Fishbach (2016), “The Pitfall of
1023–31. Experimenting on the Web: How Unattended Selective
Richins, Marsha L (2004), “The Material Values Scale: Attrition Leads to Surprising (Yet False) Research
Measurement Properties and Development of a Short Conclusions,” Journal of Personality and Social Psychology,
Form,” Journal of Consumer Research, 31 (1), 209–19. 111 (4), 493–504.
Copyright of Journal of Consumer Research is the property of Oxford University Press / USA
and its content may not be copied or emailed to multiple sites or posted to a listserv without
the copyright holder's express written permission. However, users may print, download, or
email articles for individual use.