You are on page 1of 33

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/362861392

More Gamification Is Not Always Better: A Case Study of


Promotional Gamification in a Question Answering Website

Preprint  in  Proceedings of the ACM on Human-Computer Interaction · August 2022


DOI: 10.1145/3555553

CITATIONS READS
2 365

5 authors, including:

Reza Hadi Mogavi Ehsan Ul Haq


The Hong Kong University of Science and Technology The Hong Kong University of Science and Technology
16 PUBLICATIONS   59 CITATIONS    22 PUBLICATIONS   79 CITATIONS   

SEE PROFILE SEE PROFILE

Sujit Prakash Gujar Pan Hui


International Institute of Information Technology, Hyderabad The Hong Kong University of Science and Technology
130 PUBLICATIONS   579 CITATIONS    592 PUBLICATIONS   18,124 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Vehicular Metaverse View project

Future multimedia communications View project

All content following this page was uploaded by Reza Hadi Mogavi on 23 August 2022.

The user has requested enhancement of the downloaded file.


More Gamification Is Not Always Better: A Case Study of
Promotional Gamification in a Question Answering Website∗

REZA HADI MOGAVI, Hong Kong University of Science and Technology, Hong Kong SAR
EHSAN-UL HAQ, Hong Kong University of Science and Technology, Hong Kong SAR
SUJIT GUJAR, International Institute of Information Technology, India
PAN HUI, Hong Kong University of Science and Technology and University of Helsinki, Hong Kong SAR
and Finland
XIAOJUAN MA, Hong Kong University of Science and Technology, Hong Kong SAR
Community Question Answering Websites (CQAs) like Stack Overflow rely on continuous user contributions
to keep their services active. Nevertheless, they often undergo a sharp decline in their user participation
during the holiday season, undermining their performance. To address this issue, some CQAs have developed
their own special promotional gamification schemes to incentivize users to maintain their contributions
throughout the holiday season. These promotional gamification schemes are often time-limited, optional, and
run alongside the default gamification schemes of their websites. However, the impact of such promotional
gamification schemes on user behavior remains largely unexplored in the existing literature. This paper takes
the first steps toward filling this knowledge gap by conducting a large-scale empirical study of a particular
promotional gamification scheme called Winter Bash (WB) on the CQA of Stack Overflow. According to our
findings, promotional gamification schemes may not be the panacea they are portrayed to be. For example, in 452
the case of WB, we find that the scheme is not effective for improving the collective engagement of all users.
Only some particular user types (i.e., experienced and reputable users) are often provoked under WB. Most
novice users, who comprise the majority of Stack Overflow website’s user base, seem to be indifferent to such
a gamification scheme. Our research also shows the importance of studying the quantity and quality of user
engagement in unison to better understand the effectiveness of a gamification scheme. Previous gamification
studies in the literature have focused predominantly on studying the quantity of user engagement alone. Last
but not least, we conclude our paper by presenting some practical considerations for improving the design of
future promotional gamification schemes in CQAs and similar platforms.
CCS Concepts: • Human-centered computing → Empirical studies in HCI; • Information systems →
Collaborative and social computing systems and tools; Data mining; • Applied computing → Computer
games.
Additional Key Words and Phrases: Gamification, new gamification schemes, promotional gamification,
temporary gamification, user engagement, user behavior analysis, Community Question Answering Website
(CQA), Difference-in-Differences (DiD)
∗ This manuscript is the pre-print version.

Authors’ addresses: Reza Hadi Mogavi, rhadimogavi@cse.ust.hk, Hong Kong University of Science and Technology, Hong
Kong SAR; Ehsan-Ul Haq, euhaq@connect.ust.hk, Hong Kong University of Science and Technology, Hong Kong SAR;
Sujit Gujar, sujit.gujar@iiit.ac.in, International Institute of Information Technology, India; Pan Hui, panhui@cse.ust.hk,
Hong Kong University of Science and Technology and University of Helsinki, Hong Kong SAR and Finland; Xiaojuan Ma,
mxj@cse.ust.hk, Hong Kong University of Science and Technology, Hong Kong SAR.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2573-0142/2022/11-ART452 $15.00
https://doi.org/10.1145/3555553

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:2 Reza Hadi Mogavi et al.

ACM Reference Format:


Reza Hadi Mogavi, Ehsan-Ul Haq, Sujit Gujar, Pan Hui, and Xiaojuan Ma. 2022. More Gamification Is Not
Always Better: A Case Study of Promotional Gamification in a Question Answering Website. Proc. ACM
Hum.-Comput. Interact. 6, CSCW2, Article 452 (November 2022), 32 pages. https://doi.org/10.1145/3555553

1 INTRODUCTION
Gamification, which refers to “the use of game design elements in non-game contexts [23]”, has gained
widespread adoption in most Community Question Answering Websites (CQAs), such as Stack
Overflow [106], Quora [105], Baidu Zhidao [104], and Zhihu [107]. These websites commonly utilize
a default gamification scheme to enhance user engagement and encourage frequent exchanges of
questions and answers [5, 17, 88].
However, on certain occasions, these default gamification schemes may not be compelling enough
to motivate users to be active. For example, a major problem currently faced by some (gamified)
CQAs is the decline in user participation during the Christmas and New Year (holiday season). In
fact, there is a vast body of literature that confirms a decline in user engagement during the holiday
season (see [27, 61, 68]).
The adverse consequences of such a decline in user engagement include an increased tendency
for users to churn (i.e., leave the website without any intention to return anytime soon) [38, 67, 77],
delayed services (e.g., late responses) [10, 49], and other issues concerning sustainability [67, 89].
Therefore, many CQAs are actively seeking solutions to address the problem of user engagement
during specific periods such as the holiday season.
Some CQAs opine that promotional (or additional) gamification schemes could serve as a possible
remedy. Running in parallel with their websites’ default (gamification) schemes, these gamification
schemes encourage users to maintain their participation when user contributions are scarce and
most needed, such as during the holiday season. As a case in point, Winter Bash (WB) [93] is one
such gamification scheme from the software engineering and programming CQA of Stack Overflow
[39]. The three salient features that distinguish WB from the default gamification scheme at Stack
Overflow are as follows.
• WB is a temporary, playful festival held in winter for a limited period.
• Although the invitation is open for all, user participation in WB is not mandatory. Put
differently, users can decide for themselves whether to participate or not. To this end, there
are corresponding buttons in users’ personal accounts labeled “I love hats” or “No hats for me,
please.”
• Several innovations have been made in WB itself. To illustrate, the gamified feast comes with
its own new types of badge rewards, which are sticky visual items known as “hats.” Figure 1
shows some sample hats.
Alas, such promotional gamification schemes’ (potential) impact on user behavior remains
largely unexplored in the literature. Further research is needed to understand whether promotional
gamification schemes work, which users tend to be influenced the most, and what designers and
practitioners can take away from the design lessons. Against this backdrop, our work is motivated
to fill this knowledge gap by presenting the first empirical study of the impact of WB on user
engagement on a large-scale CQA. Our research is designed as a comprehensive case study. More
specifically, we analyze user engagement during and after the gamified celebration of WB from the
following aspects: quantity/quality of user engagement and churn rate.
We conduct our study at three levels to investigate the potential impact of WB. These are as
follows: (1) the collective behavior of all users, (2) different user types (based on Stack Overflow’s
categorization), and (3) users who choose to participate in the WB festivities.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:3

Fig. 1. An anonymized screenshot that shows the sticky, hat-shaped badges (or simply hats) used in the
promotional gamification scheme of Winter Bash (WB). The hat rewards are retrieved back (or removed)
upon the completion of WB.

The categorization in level (2) is based on the Stack Overflow’s reputation measure1 [16]: novice
users (rep < 10), low reputation users (rep in [10, 1,000)), established users (rep in [1,000, 20K)), and
trusted users (rep ≥ 20K).
Research questions. Formally, the following research questions guide our research:
• RQ1: How does the collective behavior of users change during and after WB? (level 1 analysis)
• RQ2: How does the behavior of different user types change during and after WB? (level 2
analysis)
• RQ3: Does users’ participation status in WB make a meaningful difference in their behavior
during and after WB? (level 3 analysis)
In this paper, we use data from the three most recent WB events (2018, 2019, and 2020) to answer
our research questions. In RQ1, we find that WB cannot (completely) prevent the collective decline
in user engagement during the holiday season. In fact, the collective user churn probably poses the
maximum amount of threat to Stack Overflow’s overall performance during this crucial period.
From RQ1, the golden design lesson researchers and practitioners can infer is that more gamification
does not necessarily work better or appeal to all users.
The stratified dataset in RQ2 informs us more about the behavior of different user types. In
RQ2, we look at the stratified dataset based on the behavior of the different user types (i.e., novice,
low reputation, established, and trusted users). The advantage of this approach is that we can
identify more details about the behavior of smaller but important user groups such as trusted users.
Normally, the behavior of such minority user types is often obscured by the unbalanced proportion
of data from more populous user types such as novices.
We find that users with low reputations tend to have the least significant changes in their
posting behavior during and after the bash. Also, we contrastingly find that the posting behavior of
established users is more vulnerable to change significantly during and after the bash. Succinctly
put, the key takeaway from RQ2 is that the potential impact of WB on the user engagement differs
by the type of users. Therefore, our results highlight the importance of personalizing gamification.
In RQ3, which is a core part of this paper, we seek to quantify the potential impact of WB on the
behavior of WB-participating users. We do this through a deep and focused comparison between
WB-participating and Non-participating users. In our work, we use a well-known econometric
method known as Difference-in-Differences (DiD). Consistent with RQ2, we quantitatively show that
1 Reputation is a rough measure of how much the Stack Overflow community trusts a user (see [75]).

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:4 Reza Hadi Mogavi et al.

WB has a potentially more significant and favorable impact on the engagement of WB-participating
users belonging to user types with high reputations. The most important finding from RQ3 is
that user participation in the WB celebration (alone) is not sufficient to bring about a significant
(positive) change in user behavior.
Contributions. This paper assumes significance for Computer-Supported Cooperative Work
(CSCW) and Human-Computer Interaction (HCI) communities because it adds nuance to the
literature on gamification. To the best of our knowledge, our work denotes the first systematic
quantitative study that aims to determine the usefulness of promotional gamification in a large-scale
online community. In summary, we make the following main contributions to the existing body of
knowledge in this area:
• While most studies of user incentivization solely focus on the quantity of user engagement,
we demonstrate the need to examine both the quantity and quality of user engagement
together to provide a more comprehensive picture of user engagement.
• This paper also underscores the need to implement more tailored and inclusive gamification
designs in general and promotional gamification schemes in particular.
• Last but not least, we present some design considerations for gamification researchers and
practitioners, especially those working in the domains of online communities and CQAs.
These design considerations are aimed at helping them improve the design of their pro-
motional gamification schemes and attract more users for more effective and accountable
engagement.

2 RELATED WORK AND BACKGROUND


This section consists of three parts. In the first part, we survey the literature on gamification and its
underlying theories. In the second part, we summarize the current state of research on Community
Question Answering Websites (CQAs). Finally, we provide a brief introduction to WB.

2.1 Gamification and the Underlying Theories


In recent years, the interest in gamification has skyrocketed, leading to increased research and use
in a variety of fields: health [3, 52, 53, 83], education [22, 56, 69, 96], civic engagement [36, 44, 73],
and crowdsourcing [70, 71, 109]. However, this increased interest in gamification has given rise to
the need to analyze the effectiveness of gamification in driving user behavior on different platforms
and in different scenarios [43, 85].
In practice, many gamification designers and practitioners want to know when and how each
gamification element works better [17, 57, 109]. The most commonly studied gamification elements
in the literature are badges, points, and leaderboards [43, 45, 85]. Moreover, it has been shown that
different gamification components have different impacts on people’s motivation [42].
In an insightful research study, Mekler et al. employ a psychological framework known as
Self-Determination Theory (SDT) to study the effects of game design elements on people’s intrinsic
motivation and performance [66]. According to SDT, the satisfaction of users’ psychological needs
for autonomy (volition and personal agency), competence (sense of efficacy), and relatedness (social
interaction) stimulates their intrinsic motivation [41, 82, 97]. Mekler et al. note that user engagement
can be effectively enhanced by means of points, levels, and leaderboards [66].
Interestingly, a study conducted by Lieberoth revealed that even the mere introduction of a
game frame into an activity can modify the behavior of users and enhance their performance
[60]. Slotting participants into two different research conditions, Lieberoth asks them to conduct
brainstorming sessions. Under the first condition, called deep gamification, participants have access
to a competitive discussion board game. The second condition, introduced as shallow gamification,
is where participants still have access to the same game artifacts, albeit without the actual game

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:5

elements. It is noteworthy that both conditions yield similar psychological effects and significantly
increase participants’ performance compared to the control group.
The dark side. Nevertheless, due to the destructive potential of gamification, researchers and
practitioners in this field usually have two main concerns [13, 29, 30, 110]: (1) does gamification have
negative effects on user behavior? And (2) do users utilize gamification as the designers intended
them to? For example, the escape theory posits that the entertaining activities in gamification
schemes can sometimes distract enthusiastic people from their original tasks or responsibilities
[37, 64]. What happens if users change their behavior by focusing only on the quantitative aspect
of their activities, thereby neglecting the quality?
Thus, an important threat facing many gamification applications is when gamification becomes
dysfunctional [7, 24, 95]. In such cases, gamification is reduced to being a meaningless medium for
users to participate in an application rather than a righteous motivation [7, 11, 24, 110]. Worse still,
users might be tempted to explore nefarious activities such as cheating, spamming, or griefing to
win or just get more gamification rewards. Technically, it is said that users are gaming the platform
[21, 37].
Furthermore, the overjustification effect is another frequently cited criticism of gamified applica-
tions [81]. According to this theory, additional and unjustified extrinsic rewards can reduce users’
intrinsic motivation to complete the tasks they used to perform in the absence of additional rewards
[94]. This implies that if rewards are suspended, users might suddenly stop doing their tasks, or
they might do them reluctantly [20, 94]. For this reason, we believe that identifying the destructive
and positive aspects of gamification together is an important research direction in this field.
Reflection. Our literature review did not reveal a large-scale study analyzing user behavior during
and after the end of a time-limited gamification scheme. Thus, we believe that WB makes it possible
to conduct a large-scale empirical analysis and fill this knowledge gap. In addition, WB provides
researchers with a unique opportunity to attain large-scale empirical support for the phenomenon
of the overjustification effect. Do users become less active than before after the termination of WB?
Also, which users are influenced the most? In addition, our research helps CQAs find users who are
more likely to game their platforms by analyzing both the quality and quantity of user behavior in
conjunction.

2.2 Community Question Answering Websites (CQAs)


Community Question Answering Websites (CQAs) enable users to ask questions and contribute
answers [15, 88]. In fact, CQAs encourage users to engage in problem- and inquiry-based learning,
two effective methods of fostering learner engagement [40]. Also, some researchers consider CQAs
to be a type of collaborative project (see [50]) consisting of questioners and answerers. Thus, CQAs
are considered one of the most fundamental applications in CSCW [39, 59, 91].
In this context, most studies on CQA services focus on finding ways of improving user collabora-
tion and experience. The richness and accessibility of data on CQA users’ interactions (with CQAs
and with each other) have made data mining and knowledge discovery studies on these platforms
particularly beneficial [9, 62, 80]. For instance, many researchers have worked toward enhancing
the design of user interventions (e.g., badges) [5, 54], predicting churn [26, 63, 77], and offering
users supportive Natural Language Processing (NLP) services [31, 86]. Moreover, identifying the
barriers and facilitators to user engagement on these websites is one of the main goals of this CQA
mining and knowledge discovery effort [32, 67, 109].
Several empirical studies demonstrate the effectiveness of using gamification, and badges, in
particular, to increase user engagement in CQAs [5, 28, 54, 103]. A study conducted by Vasilescu
provides empirical evidence that answerers tend to answer questions faster in Stack Overflow’s
gamified environment than on (similar) mailing lists [103].

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:6 Reza Hadi Mogavi et al.

Furthermore, according to a previous CSCW study conducted by Cavusoglu et al., badges can
sometimes facilitate greater user engagement in a broader range of activities – as opposed to
merely the activity for which the badges were designed [17]. This is an interesting observation
because, contrary to the literature’s predominant trend that suggests gamification might undermine
users’ intrinsic motivation [18], the aforementioned CSCW paper assumes that some users might
internalize the incentives created by earned badges, thereby becoming motivated to perform more
and extreme activities. However, the sheer complexity of this topic means that the intrinsic values
of gamification still necessitate further research.
Another study of Stack Overflow – a natural experiment conducted by Bornfeld and Rafaeli
– shows that badges do not always successfully direct user behavior toward the desired goals
(e.g., answering a series of archaic questions) [14]. More specifically, 18 of the 22 badges tested
in their study were found to have a positive effect, three had a negative effect, and one had no
effect. Bornfeld and Rafaeli attribute these differences in effect to the types of users. However, it is
noteworthy that the user types are not detailed in their study.
According to the literature, users also tend to increase their contributions to CQAs when they
are close to earning a badge. However, their contributions decrease significantly when they have
just earned a badge and still are far away from earning their next badge [34, 87, 109]. To explain
how users change their usual participation patterns in order to obtain the desired badges, Anderson
et al. analyze user behavior in Stack Overflow and use a computational model based on Markov
Decision Processes [5]. Their model can suggest areas where CQAs should place badges in their
design to motivate more users to make certain contributions. Despite hypothesizing that different
badges have different values for different users, they do not test this postulation in their work.
Kusmierczyk et al. examine first-time badges at Stack Overflow and attempt to determine more
specific reasons as to why some badges are not as successful as others [54]. Notably, they conduct
natural experiments based on a novel inferential framework to determine the causal effect of these
badges. According to their findings, only some badges rewarded for tasks with a lower initial utility
to users have a significant causal effect on driving user behavior. Their work primarily focuses on
badges, which makes their results unique and interesting.
Reflection. In most of these studies, the quality of user participation is often ignored, as a result
of which the focus is only on users’ behavioral engagement (i.e., only on the quantity and volume
of activities). Furthermore, the important contribution of asking questions is often overshadowed
by other types of user contributions. By considering these subtle aspects, our paper builds on the
related work by explicating user engagement in a large-scale CQA. We examine the quantity and
quality of user contributions cumulatively and study user questions and answers separately.

2.3 Winter Bash: A Brief Introduction


Each year, Stack Overflow and Stack Exchange family websites run a winter promotional gamifica-
tion scheme called Winter Bash (WB) to celebrate the end of a year with special rewards referred to
as hats [76]. This gamified celebration is often held from mid-December to early January. One of the
main goals of this celebration is to increase user contributions, and in effect, the website’s traffic,
when user activity is at one of its lowest rates [99]. In addition, some users believe that the relevance
of this gamified celebration is particularly prominent during the Christmas and New Year period
after a stressful or eventful year (such as the pandemic), by bringing people together [99, 101].
Despite the popularity of WB, its effectiveness remains contentious. For example, while ardent
supporters insist on the anecdotal positive effects of WB on user engagement (especially due to the
manner in which Stack Overflow introduces the bash each year), most critics are skeptical of such
positive effects. They also express concerns about the potential side effects (e.g., increased number
of lower quality posts) [100, 102]. The lack of systematic research in this area has exacerbated the

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:7

situation, leaving many CQA developers and enthusiastic gamification practitioners unsure of how
to proceed further. In the remainder of this article, we aim to address some of these questions and
uncertainties.

3 DATA
This section consists of three parts that contain information about our data collection, data
description, and data processing.

3.1 Data Collection


We utilize Stack Exchange Data Explorer 2 to extract all of the features required for our analysis. In
addition, we utilize an automatic crawler called Helium Scraper 3 to collect and verify the users’
WB participation status from the WB leaderboards. The complete list of features used in this work
and their corresponding explanations can be found in Appendix A. The features are either adopted
or adapted from [77] and fall into seven categories: temporal features, consistency, speed, gratitude,
competitiveness, content, and knowledge level.

3.2 Data Description


Our dataset comprises nearly 6.1 million users, 1.8 million questions, and 2.2 million answers from
the three most recent WB events, i.e., from 2018 to 2020. Each year includes user behavior from
November 1 to February 28 (120 days). Table 1 shows how Stack Overflow users are distributed
across different user types and their participation rate (in WB) during the aforementioned period.
The Table also shows that the population of novices is the largest, whereas the population of trusted
users is the smallest. After the novices, users with low reputation and established users are the
next most populous user types (respectively).

3.3 Data Processing


As mentioned earlier, user participation in WB is arbitrary. Therefore, it is important to label users
as either WB-participating or Non-participating. To determine the participation status of users, we
use the information from the official WB leaderboards4 each year. More specifically, we define
WB-participating users as those who appear in the WB leaderboards and use at least one hat to
manually customize their profile photos. For researchers who are unfamiliar with Stack Overflow,
it would be pertinent to point out here that WB leaderboards are simple gamification mechanics
used for sorting users’ ranks based on their number of achievements (i.e., hat rewards) during
the festivity. They also help CQAs render the list of users who have participated in the event.
The different user types’ participation rates can also be found in Table 1. Evidently, WB 2020 has
the highest participation rate, whereas WB 2019 is apparently the least popular (also see [98]).
Moreover, while the participation rate of established users is the highest among all other user types,
users with low reputations have the largest volume of participation in the festivity.
Another point that needs further explanation is the duration of the WB festivities, which usually
vary from one year to another. In our study, the duration of WBs from 2018 to 2020 is as follows:
21, 24, and 20 days, respectively, from left to right. More specifically, WB 2018 was held in a time
range between December 12, 2018, and January 1, 2019. WB 2019 was held in a time range between
December 9, 2019, and January 1, 2020. Lastly, WB 2020 was held in a time range between December
16, 2020, and January 4, 2021. However, our data spans (observation periods) are a bit longer and
encompass some time before and after each WB event. Formally, we define all days from November
2 https://data.stackexchange.com/
3 https://www.heliumscraper.com/
4 https://winterbash2020.stackexchange.com/leaderboard/stackoverflow.com

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:8 Reza Hadi Mogavi et al.

Table 1. The distribution of users across different user types and their participation rate in WB

Number Number Number of Low Number of Number of


WB Winter Bash Winter Bash Winter Bash Winter Bash
of All of Novice Reputation Established Trusted
Year Participation Participation Participation Participation
Users Users (N) Users (L) Users (E) Users (T)
2020 2,176,202 1,557,317 1.25% 545,249 46.61% 69,973 90.38% 3,663 70.46%
2019 2,089,295 1,465,891 0.70% 585,002 18.06% 34,639 84.87% 3,763 64.01%
2018 1,926,803 1,350,425 0.79% 558,023 36.01% 14,547 86.95% 3,808 80.77%
Total 6,192,300 4,373,633 0.93% 1,688,274 33.21% 119,159 88.36% 11,234 71.79%

1 to the first day of WB (but not including it) as before WB and all days after the last day of WB to
February 28 as after WB. For more complementary information about each WB, refer to [76].

4 RESEARCH ETHICS
During this research, we also took measures to ensure the ethical aspects of the research and
to comply with Stack Overflow’s rules and terms of use. Before commencing data collection, we
double-checked with Stack Overflow (via email) to determine if it was possible to collect and process
their public domain data. After receiving their confirmation, we also obtained the Institutional
Review Board (IRB) approval from our university. In compliance with the IRB, our research team
is committed to following all of the ethical guidelines introduced by the Association of Internet
Researchers (AoIR) [74] and the General Data Protection Regulation (GDPR) [33]. For example, our
efforts to safeguard the ethical and legal rights of users include, but are not limited to, the following:
(1) requiring each researcher to complete certified training to learn how to work with human data
prior to data collection or analysis; (2) anonymizing the identity, location, and contact information
of all users (if posted on the platform); (3) prohibiting any attempts to discover the users’ true
identity at any stage of the study or unauthorized marketing; and (4) ensuring adherence to the
site’s traffic rules and thus crawling the benchmark dataset in a gradual and scheduled manner to
make fair use of the site’s bandwidth.

5 THE STUDY
Our study can be seen as a natural extension of the research series “Does Gamification Work?” [5, 43,
54, 57, 65, 108]. However, this study is unique in that it represents a groundbreaking investigation
of a promotional gamification scheme on a large-scale CQA platform. This section expounds on
our study in three subsections in the order of our original research questions (RQ1 to RQ3). Each
subsection begins with a prologue and a summary of the main findings. Next, we explain the
settings and methods. Finally, we conclude each part by undertaking a detailed discussion on the
findings.

5.1 Winter Bash and Collective User Behavior (RQ1)


In this subsection, we analyze the behavior of the Stack Overflow population as a whole. In
particular, we examine whether and how users’ collective engagement change during and after
WB. The main inferences of RQ1 can be summarized as follows.
• The most important lesson researchers and practitioners can draw from RQ1 is that adding
more gamification schemes to an already gamified platform does not necessarily improve
users’ collective engagement or willingness to make more contributions.
• Longitudinal analysis of users’ collective behavior from the periods during and after WB
reveals that WB cannot (completely) halt the reduction in engagement. On the contrary, some
user behaviors can actually be found to get worse (e.g., churn). This suggests that the current
WB scheme may not be the panacea it appears to be for everyone.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:9

• Cross-sectional analysis of user engagement on days of the year that were part of WB in some
years but not others shows no statistically significant difference between users’ collective
engagement on these days. This observation itself is another piece of evidence supporting
the general conclusion of RQ1 already mentioned in the first item.
In the remainder of this subsection, we will detail the settings and methods as well as the results.
5.1.1 Settings and Methods. In this paper, the quantity of user engagement is a measure that indi-
cates the average number of user contributions (questions or answers) per day for each individual.
The quality is a complementary measure that refers to the average score5 of user contributions per
day per post. Notably, the definitions of engagement quantity and quality in our work are inspired
by the features of “frequency” and “quality” from the extant literature (see [77]). Again, congruent
with the literature, we define churned users as those who have not posted any new posts on the
website for at least six months after their last post (see [77]). Accordingly, we define the daily
churn rate of users as the percentage of users who have churned that day out of all participants in
website activities (i.e., asked or answered questions) on that particular day.
To study the collective behavior of users in the presence of WB and in its absence, we use two
approaches: longitudinal and cross-sectional. Under the first approach (longitudinal), we look at
each WB year’s time window independently and compare the collective behavior of users before
WB with the behavior of the same users in the observation periods during and after the festivity. In
the second approach (cross-sectional), we focus on days of the year that were part of WB in some
years but not in others. However, it must be noted that the behaviors we compare in the second
approach do not necessarily belong to the same users. The explanation is that an interval of almost
a year between festivities is considered a long time in the clock of CQAs, and most users are likely
to churn [77] during this period, elevate to better user types (see the reputation changes in [4]), or
simply decide not to attend the WB festivities in subsequent years.
Based on our data, only three periods are suitable for our cross-sectional analysis. The first period
is between December 9 and 12, which was included in WB 2019 but not in WB 2018. The second
period is between December 9 and 16, which was included in WB 2019 but not in WB 2020. Finally,
the last period is between January 1 and January 4. This period was also included in WB 2020 but
not in WB 2018 or 2019. In total, we conduct four separate cross-sectional tests for each category
of quantity and quality of user engagement and churn rate. More specifically, we compare user
engagement in WB 2019 separately with WB 2018 and WB 2020; and we compare user engagement
in WB 2020 separately with WB 2018 and WB 2019.
Furthermore, to account for the non-Gaussian distributions of our data, we use a series of non-
parametric hypothesis tests known as Mann-Whitney U tests (MWU) in both approaches to perform
our comparisons. We run all our tests using the statistical software IBM SPSS Statistics (v. 27.0). In
line with the existing research and in order to reduce the impact of unwanted false discoveries, all
p-values reported in our work are compared with their corresponding adjusted p-values calculated
through the Holm-Bonferroni correction [47]. Moreover, for all of our longitudinal studies, we
first take the mathematical union of all p-values resulting from the different years and then report
the significance of the behavioral differences between users if only the behavioral trends were
consistent between all three WB festivals. The advantage of this union calculation is that we make
our judgment criterion more stringent and report the weakest significance for the observations
across different years (which we assume might be scientifically more reliable (see [55])). However,
at this point, it must be made clear that none of our comparisons suggest a causal relationship

5 In
the context of Stack Overflow, the score is a quality-monitoring measure for user posts that shows the summation of a
post’s received up-votes (+) and down-votes (−).

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:10 Reza Hadi Mogavi et al.

Table 2. The results of how the collective behavior of users changes during and after WB. The Change columns
show the extent ofLongitudinal
an increase (+) Analysis:
or decrease (-) in the quantity or quality of user participation and churn in
the periods during and after WB compared to the period before WB. The significance levels come from the
corresponding MWU tests.

Questions Answers Churn Rate


Comparisons Change Sig. Change Sig. Change Sig.
− 0.15 < 0.001 + 0.89 < 0.001
Before vs. During + 5.17 < 0.001
− 0.01 0.044 − 0.03 0.124
− 0.09 < 0.001 − 0.22 < 0.001
Before vs. After + 2.60 0.007
− 0.01 0.030 − 0.05 0.120
Notes: Quantity Quality Null Hypothesis Rejected

between the observed behaviors of users and celebrating WB. Instead, we merely stick to the
exploratory and observatory nature of the study.

5.1.2 Results. Table 2 shows the summary of the results of our longitudinal analysis from the three
perspectives of user Questions, Answers, and Churn Rate. The Change columns show the extent of
an increase or decrease in the quantity or quality of user participation and churn in the periods
during and after WB compared to the period before WB. The plus signs (+) in the Change columns
imply that the measures have increased at the latter reference time (i.e., during or after WB). The
minus signs (-), on the other hand, indicate a decreasing trend. The significance levels in the Table
(shown with Sig.) come from the corresponding MWU tests.
A positive change in the quantity or quality of user questions or answers, or a decrease in the
user churn rate, can support the hypothesis that WB may positively influence some (engagement)
aspects of user behavior on Stack Overflow. However, a negative trend in the quantity or quality
of questions or answers, or an increase in the churn rate, may indicate that WB has not fully
accomplished its task of correcting or stopping the adverse effects of the holiday season on user
engagement.
As can be inferred from the results in Table 2, the current WB scheme in Stack Overflow mostly
fails to achieve the goal of collective user engagement. First, we find that the quantity of user
questions decreases significantly during and after the WB celebration. Second, although user
answers increase somewhat (i.e., + 0.89) during the WB celebration, this trend is seen to be transient
and does not last long. In fact, the quantity of user answers tends to worsen significantly after WB
than in the period before WB. This phenomenon could also indicate the destructive overjustification
effect we know from the gamification literature. Furthermore, since most gamification rewards
from WB are preserved to encourage users to answer more questions (rather than to ask them),
this temporary hype for asking more questions is not particularly surprising. Third, we find that
the churn rate during WB is significantly higher than in the period before WB. In such cases, our
results can also imply that the main reason for the decrease in user engagement during the WB
festivity is the high churn rate of users.
■ Cross-sectional analysis. The cross-sectional analysis also provides support for our longitudinal
findings that WB is perhaps not suitable for everyone. That is, we cannot find any significant
difference between users’ collective engagement (of any kind: quantity, quality, or churn) on days
with WB and without WB. However, this could be partly due to the short time period of our
cross-sectional analysis, forced by the inherent limitations of our data (i.e., the smaller number of
such days). The complete results of our cross-sectional analysis can be found in Appendix B.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:11

Table 3. The results of how the behaviors of different user types change during and after WB (compared to
the time before WB)

Engagement Novice Low Reputation Established Trusted


Comparisons
Measures Change Sig. Change Sig. Change Sig. Change Sig.
− 0.63 0.000* + 1.07 0.001 − 0.14 0.011 + 0.18 0.012
Questions
+ 0.03 0.041 − 0.12 0.017 − 0.05 0.012 + 0.16 0.019
Before vs. During + 0.85 0.000* + 0.94 0.001 + 1.80 0.000* + 1.96 0.000*
Answers
+ 0.01 0.130 − 0.15 0.023 − 0.05 0.000* + 0.02 0.000*
Churn Rate + 6.06 0.000* + 3.21 0.000* + 1.08 0.000* + 0.05 0.045
− 0.10 0.000* − 0.08 0.001 + 0.01 0.010 + 0.03 0.005
Questions
+ 0.03 0.028 − 0.13 0.020 + 0.02 0.012 + 0.16 0.021
Before vs. After − 0.19 0.000* − 0.28 0.001 − 0.56 0.000* − 0.76 0.023
Answers
+ 0.01 0.012 − 0.23 0.015 − 0.17 0.000* + 0.06 0.011
Churn Rate + 3.08 0.120 + 1.50 0.000* + 1.05 0.000* + 0.02 0.042
Notes: Quantity Quality Null Hypothesis Rejected

5.2 Winter Bash and Different User Types (RQ2)


As most of Stack Overflow’s users are novices who do not actually participate in WB (see Table
1), the collective engagement of users (from RQ1) was expected to be low. However, it can be
argued that this unbalanced distribution of users across different user types can be an impediment
to identifying the potential impact of WB on the user types that are in the minority: i.e., low
reputation, established, and trusted users. To address this gap, the current subsection provides a
more meticulous and detailed analysis of the engagement of different user types to determine which
users are likely to be most impacted and how. The main takeaways from RQ2 can be summarized
as follows.
• The key finding from RQ2 is that the potential impact of gamification on the engagement of
different user types is not the same. As such, our results highlight the need and importance
of adapting (tailoring) different gamification schemes.
• Stratifying user data by different user types is probably a good (exploratory) data mining
practice to better understand users’ behaviors. This is especially true when the distribution
of users across different user types is highly imbalanced.
• Upon closer inspection, we find that the promotional gamification scheme of WB is not a
successful strategy to stop the deteriorating churn rate of different user types during the bash
(even among established users). The churn rate of all other user types increases significantly
during the period of WB, except for trusted users.
• We conclude that the posting behavior of established users is likely to be most influenced by
WB. Conversely, the posting behavior of users with low reputations appears to take the most
negligible influence from the winter celebration of WB.
The rest of this subsection explains the settings, methods, and results in finer detail.
5.2.1 Settings and Methods. The instructions for the settings and methods used in RQ2 are similar
to those mentioned for RQ1. The only difference that needs to be reminded is the increased number
of comparisons in RQ2. Rather than analyzing the behavior of all users together (collectively), RQ2
examines the behaviors of different user types separately, i.e., novice, low reputation, established,
and trusted users.
5.2.2 Results. Table 3 shows the summary of the results of our longitudinal analysis (for each user
type) from the three perspectives of user Questions, Answers, and Churn Rate (similar to Table 2).
• Novices. Unsurprisingly, the behavior of novices (comprising the majority of Stack Overflow
users) is consistent with the collective behavioral results established in RQ1. In summary, we find

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:12 Reza Hadi Mogavi et al.

that the quantity of questions asked by novices decreases significantly during and after WB. In
addition, while novices tend to give more answers to questions during WB, the quantity of their
answers decreases significantly after WB. Moreover, the churn rate of novice users during WB
is found to be significantly higher than in the period before WB. On a side note, similar to the
collective user engagement studied in RQ1, the analysis of RQ2 (again) shows no significant changes
in the quality of user contributions for novices.
• Low reputation users. However, our analysis does not end with the analysis of novices. When
we examine the behavior of the other user types, we come to some non-intuitive results that are
specific to RQ2. Following the order of user reputation, we next describe our observations about
users with low reputations. We find that they are more likely to churn during WB and that their
churn rate is also significantly higher in the period after WB. Nevertheless, we note that this
post-WB change in churn rate is noticeably lower than during WB (ratio: 1.50 to 3.21). There are
two points here that might be worth mentioning. First, low reputation users are the largest group
of users for whom we can statistically verify the increased churn rate after WB. Second, they are
also the only user group for whom the significance of changes in their quantity or quality of posts
during and after WB cannot be demonstrated. However, we can empirically observe that the rising
and falling trends in the quantity and quality of posts are somewhat similar to those of novice
users, with the exception that users with low reputations tend to ask more questions during the
WB (i.e., +1.07), and that all their contribution qualities always change negatively after the launch
of WB (p-value > 0.0004).
• Established users. Next, we scrutinize our findings of established users. We find that established
users answer more questions during the WB, but they do not keep up the pace after WB; as
such, we see a significant decrease in the quantity of their answers after the end of WB. Another
interesting point that makes the previous observations (on quantity) important is the complementary
information on the quality of user participation. Established users are the largest user group for
whom we can statistically show that their answering quality decreases (significantly) in the periods
during (change = -0.05) and after WB (change = -0.17). However, similar to what was observed for
low reputation users, our analysis of established users does not show any significant changes in
the quantity and quality of users’ question-asking behavior during and after WB. The final point
regarding established users is that they tend to have a significantly higher churn rate both during
and after WB. However, similar to all other user types, their churn rate is higher during WB than
after WB.
• Trusted users. Finally, we detect only one pair of significant changes in the behavior of trusted
users. More specifically, we find that both the quantity and quality of answers from these users
increase significantly during WB. Notably, trusted users are also the only group of users whose
quality of answers increases significantly during WB.
■ Cross-sectional analysis. The cross-sectional results help us look at user engagement during
WB from a different perspective. Please refer to Appendix C for a comprehensive compilation of
our cross-sectional comparisons. Again, most differences we notice in terms of user engagement do
not recommend any statistical significance. However, this may also be due (in part) to the natural
scarcity of the days we have at hand for making our cross-sectional comparisons. In what follows,
we explain some of the most insightful excerpts from our findings.
• Novices. First, consistent with the cross-sectional findings in RQ1, the analysis in RQ2 shows
no significant differences between novice engagement on days with WB and on days without WB
across different years. This point, coupled with the earlier longitudinal observations, suggests that
the current design of WB may not be the best gamification scheme for engaging novices.
• Low reputation users. Second, a quick engagement screening of low-reputation users shows
that these users tend to ask more questions in the presence of WB. Despite this pattern, it is only

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:13

the comparison of the period from December 9 to 16, 2019 (with WB) and 2020 (without WB) that
shows a significant increase in the quantities of questions asked, thereby stressing the need for
further verification. Coincidentally, our previous longitudinal results also show that users with low
reputations tend to ask more questions during WB; however, (again) we cannot comment on the
statistical significance of this observation (see Table 3).
• Established users. Third, we observe another interesting pattern showing that established
users tend to answer more questions when exposed to WB. However, comparing the period from
December 9 to 16, in 2019 (with WB) and 2020 (without WB), provides the only compelling evidence
of statistical significance. Moreover, this cross-sectional observation is also consistent with the
longitudinal findings reported in Table 3, according to which established users tend to answer
significantly more questions during WB. Based on all the results taken together, the potential
influence of WB on established users appears to be promising.
• Trusted users. Finally, regarding the trusted users, two observations stand out. First, we observe
that these users generally answer more questions in the presence of WB (except when we compare
2020 and 2019, January 1 to 4). Second, we observe that the quality of users’ answers also tends to
improve upon the exposure to WB. The only time the quality of answers remains unchanged or
without WB is when we compare 2020 and 2018 (January 1 to 4).
However, similar to our previous cross-sectional findings, these pattern observations are only
statistically significant when comparing December 9 to 16, in 2019 (with WB) and 2020 (without WB).
These observations are in exact agreement with the only significant results from the longitudinal
studies reported in Table 3. To reiterate, the trusted users tend to answer questions significantly
better quantitatively and qualitatively over the period of WB.
5.2.3 Which user types are potentially influenced the most? The remainder of this subsection
compares the behavior of different user types to figure out which user types are likely to be
influenced the most. Such comparisons can make it slightly easier for CQAs to understand the
effectiveness of their promotional gamification schemes. It also provides a suitable conclusion for
RQ2.
We use two simple measures computed based on the previous findings in Section 5.2.2 to make
our comparisons: (1) the absolute sum of significant changes in user behavior (called magnitude)
and (2) the number of significant changes in user behavior (called frequency). The bird’s-eye view
of our comparisons can be seen in Table 4.
Considering the magnitude of changes in user churn rate, we see that novice users experience the
most notable changes. However, for users with low reputations and established users, the churn rate
changes more frequently, i.e., there are two changes. Furthermore, without considering the changes
in churn rates, it becomes evident that established users are indeed the most influence-prone user
types when it comes to analyzing their posting behavior. This is made clear both when we consider
the magnitude of changes in their posting behavior (i.e., 3.45) and when we consider the frequency
of significant changes in their posting behavior (i.e., five changes). Using the same measures about
users’ posting behavior, we can conclude that users with low reputations are likely to be the least
influenced by WB, or at least we cannot detect any traits that suggest significant influences on
their behavior.
■ More subtle cross-comparisons. In what follows, we explain two more noteworthy subtle
observations (perhaps from an empirical point of view). First, we note that trusted users are the
only user type for which most of their behavioral changes show an increasing trend during and
after WB. Aside from the positive churn rates, which do not augur well for Stack Overflow, all other
changes are favorable. In addition, we also note that trusted users have the greatest improvement
in the quantity of their answers during WB.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:14 Reza Hadi Mogavi et al.

Table 4. The comparisons between the behavioral changes of different user types

Novice Low Reputation Established Trusted


Reference Measures Posts Churn Posts Churn Posts Churn Posts Churn
Longitudinal Magnitude 1.77 6.06 N/A 4.71 2.58 2.13 1.98 N/A
Results Frequency 4 1 N/A 2 4 2 2 N/A
Cross-sectional Magnitude N/A N/A 0.06 N/A 0.87 N/A 0.79 N/A
Results Frequency N/A N/A 1 N/A 1 N/A 2 N/A
Magnitude 1.77 6.06 0.06 4.71 3.45 2.13 2.77 N/A
Total 4 1 1 2 5 2 4 N/A
Frequency

All these observed improvements reinforce the hypothesis that the promotional gamification
scheme of WB is likely to be more beneficial for the engagement of trusted users. This hypothesis
is congruent with the majority of observations from the cross-sectional findings. However, since
some of these observations are not statistically validated, further research on this matter still seems
necessary.

5.3 Winter Bash and WB-participating Users (RQ3)


Since user participation in WB is voluntary, we can conduct a more careful analysis to understand
the potential impact of this winter festivity on the behavior of different types of WB-participating
users. According to our data, 88% of established users and 71% of trusted users consist of WB-
participating users. . This gives rise to some critical questions: does participation in the bash
really make any difference? What is the potential impact size of the bash on the engagement of
WB-participating users? Could the feast help them become more productive than Non-participating
users? The current subsection aims to address these questions.
To this end, we utilize one of the most powerful econometric methods known as Difference-in-
Differences (DiD). Due to its robust inferential setting and because DiD does not necessarily require
a traditional (randomized) experimental setup, it seems to be a reasonable method choice for our
work (see [1, 8]). Furthermore, we use non-parametric hypothesis testing as a complementary
method and seek further evidence when they are relevant. The main takeaways from RQ3 can be
summarized as follows.
• One of the most important findings from RQ3 is that mere user participation in a gamification
scheme does not necessarily guarantee successful user engagement. We surprisingly find
that the impact of WB on the engagement of participating novice users is not statistically
significant.
• According to our findings, user reputation seems to play a role in attracting users’ attention
to gamification. We find that the promotional gamification scheme of WB can potentially
exert a better impact on the engagement of highly reputed participating users. As per our
finding, the engagement of participating trusted users is probably influenced the most.
The remainder of this subsection will first detail our adopted DiD method and then discuss our
results.
5.3.1 Settings and Methods. One of the main reasons why we choose DiD for our work is its
precision and supporting mechanisms to eliminate, or at least mitigate, the influence of confounding
factors from the observations as much as possible (e.g., by using the matching technique). From
this point of view, DiD can provide us with a more reliable way to investigate the potential impact
of WB on the engagement of WB-participating users. In addition, DiD analysis can help us better
interpret the results of RQ1 and RQ2 and develop a more precise way of thinking about promotional
gamification.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:15

However, before getting to the details, we should reiterate that the impacts we speak of in this
work are not necessarily causal, and proving a causal relationship would likely require further and
more controlled research environments. Against this backdrop, we can now explain how DiD is
applied in our work in four steps.
(1) Conventions and Assumptions. To formulate the potential impacts systematically, we need
to posit that all WB-participating users encounter two external gamification-related interventions
during the observation period. The first intervention is experienced when WB is launched. Similarly,
the second one is experienced when WB is terminated. As such, the festivity divides the time span
of our observation period into three parts: before (𝑠 = 0), during (𝑠 = 1), and after the bash (𝑠 = 2).
The index variable 𝑠 helps us distinguish between different time slots.
In order to estimate the potential impacts of WB (both when it exists and when it is eventually
lifted), we compare the behaviors of WB-participating users (target group) with the behaviors of
another anchor group from Non-participating users. To that end, we select both groups of users (i.e.,
target and anchor6 ) from the active users in the time window 𝑠 = 0 (i.e., before WB), which serves
as a reference point in the timelines of our study. To enable DiD analysis, an important assumption
is that both target and anchor groups should follow an initial parallel (or near to parallel [78]) trend
in the time 𝑠 = 0.
Evidently, the way we select the target and anchor groups from different user categories (i.e.,
respectively from WB-participating and Non-participating users) does not cause any selection bias
in our analysis [12]. The anchor group’s role here is primarily to cancel (or mitigate) the influence
of the confounding factors which are unrelated to the bash [58]. To be more elaborate, we know
that many hidden confounding variables can influence user behavior during and after WB in many
different ways. The relatively strong assumption here (similar to other DiD studies [46, 58]) is that
these confounding factors would probably impact the behavior of users in the target and anchor
groups in almost the same way. Therefore, in conjunction with the previous assumption of the
initial parallel trends, any change in the behavior of target group users concerning the behavior of
the anchor group users can be associated with the intervention, which, in our case, is WB.
We have depicted the relationship between the behaviors of target and anchor trusted users
as an example and for the purpose of illustration in Figure 2. More specifically, Figures 2a and
2b respectively show the quantity and quality of trusted users’ answers before, during, and after
WB in a time range from November 1, 2020, to February 28, 2021. A more comprehensive set of
visualizations is included in Appendix D.
In the period prior to WB (𝑠 = 0), target and anchor users show near-parallel and very similar
engagement trends on the platform, which can be visually acknowledged through Figures 2a and
2b. Based on DiD’s framework of thought, it can be reasonably expected that the parallel trends
would have persisted in time periods 𝑠 = 2 and 𝑠 = 3 if the interventions of WB had left no impact.
However, we note that since the beginning of WB, the trends have diverged significantly over time.
Thus, we infer that the influence of WB is likely.
(2) Matching. An important and legitimate question that arises here is: how can researchers and
practitioners find the well-paired and parallel groups of target as well as anchor users for their
comparisons? The short answer is: by a statistical method called matching. More specifically, we use
one of the best case matching techniques known as Coarsened Exact Matching (CEM) [48, 51]. One
of the outstanding capabilities of CEM is that the use of observable covariates does not affect the
inclusion of confounding factors and can actually help find matches that account for confounding
factors that we do not see or have access to.

6 For
unversed readers, we should explain that our study’s target and anchor groups are equivalent to treatment and control
groups in randomized or causal studies. We pick other terms to avoid confusion with randomized or causal studies.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:16 Reza Hadi Mogavi et al.

5 .0 4 .0
W B A n s w e rs (P ) W B A n s w e rs (P )
4 .5 A n s w e rs (N P ) 3 .5 A n s w e rs (N P )

M o v in g A v e r a g e S c o r e
M o v in g A v e r a g e V a lu e

4 .0
(P e r D a y P e r U s e r)

3 .0

(P e r D a y P e r P o s t)
3 .5
2 .5
3 .0
2 .0
2 .5
1 .5
2 .0
1 .5 1 .0
1 .0 0 .5
0 .5 0 .0
0 .0 -0 .5
N o v 1

N o v 1 5

N o v 2 9

D e c 1 3

D e c 2 7

J a n 1 0

J a n 2 4

F e b 7

F e b 2 1

N o v 1

N o v 1 5

N o v 2 9

D e c 1 3

D e c 2 7

J a n 1 0

J a n 2 4

F e b 7

F e b 2 1
(a) Quantity of answers (b) Quality of answers

Fig. 2. Figures (a) and (b) respectively show the quantity and quality of trusted users’ answers in the periods
before, during, and after Winter Bash (WB). The brown (solid) line shows the behavioral pattern of WB-
participating users (P) from the target group, whereas the golden (dotted) line represents the behavioral
pattern of Non-participating (NP) users from the anchor group.

Specifically, we use an R programming package, also called CEM7 , to implement CEM. The CEM
package allows us to automatically match WB-participating and Non-participating users and
accordingly assign their matched users to target and anchor groups. One of the advantages of
using this automated tool for matching is that researchers do not have to worry about the manual
coarsening of the data (i.e., stratifying the data into discrete bins according to observable covariates
[48]), which often requires a significant amount of expertise and domain knowledge [79]. Appendix
A comprises the full list of the observed covariates used for matching in this work. These covariates
include, for example, variables such as the time interval between posts, average post length, and
average user reputation, among others. Finally, we conduct CEM for each user type separately
(i.e., novice, low reputation, established, and trusted users) and exclusively for each WB year. The
underlying reason for this decision is similar to the one we explained in RQ2, as well as to ensure a
fair and more accurate comparison between users in each category of user types.
After matching, we end up with 5,966 pairs of matched users (total = 11,932 users). Each user from
the anchor group is matched with exactly one user from the target group (similar to [58]). The share
of each user type from these pairs is as follows: novice users = 1,475 (24.72%), low reputation users
= 1,306 (21.89%), established users = 1,645 (27.57%), and trusted users = 1,540 (25.81%). Furthermore,
the distribution of these matched pairs among the different WB years is as follows: year 2018 =
1,864 (31.24%), year 2019 = 1,913 (32.06%), and finally year 2020 = 2,189 (36.69%).
(3) Evaluation of Matches. Figures 2a and 2b demonstrate the effectiveness of our matching
method from an empirical point of view. We see that the user engagement trends for target (P) and
anchor (NP) users are almost identical in the period 𝑠 = 0. However, we must use methods that are
more rigorous than visual approaches to verify the effectiveness of matching with greater certitude.
To this end, we use the standardized mean difference (SMD), a well-known quantitative measure, to
assess the matching between the target and anchor groups resolved [2, 58, 92].
7 https://cran.r-project.org/web/packages/cem/cem.pdf

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:17

Table 5. The summary of results showing the potential impact of WB on the behavior changes of participating
users during and after WB (with reference to the time before WB)

Engagement Novice Low Reputation Established Trusted


DiD Analysis
Measures Impact⩫ Sig. Impact⩫ Sig. Impact⩫ Sig. Impact⩫ Sig.
+ 0.002 0.541 + 0.210 0.000* − 0.012 0.021 + 0.001 0.682
Questions
Before vs. During + 0.011 0.715 + 0.003 0.600 − 0.023 0.011 + 0.015 0.080
(𝒔 = 𝟎, 𝒔 = 𝟏) + 0.001 0.742 + 0.003 0.521 + 0.191 0.000* + 0.230 0.000*
Answers
+ 0.001 0.856 − 0.001 0.790 − 0.001 0.798 + 0.172 0.000*
Churn Rate − 0.029 0.095 − 0.041 0.000* − 0.034 0.063 − 0.001 0.804
+ 0.001 0.458 − 0.008 0.110 + 0.001 0.172 + 0.001 0.779
Questions
Before vs. After + 0.002 0.690 − 0.001 0.897 + 0.001 0.330 + 0.011 0.061
(𝒔 = 𝟎, 𝒔 = 𝟐) − 0.002 0.744 − 0.011 0.183 − 0.015 0.042 − 0.207 0.000*
Answers
− 0.003 0.739 − 0.001 0.925 − 0.041 0.000* + 0.001 0.813
Churn Rate + 0.003 0.802 + 0.015 0.305 + 0.160 0.006 − 0.001 0.939
Notes: Quantity Quality Null Hypothesis Rejected ⩫The potential impact

According to the this measure, the target and anchor groups are indistinguishable and balanced if
all observable confounders have an absolute SMD of less than 0.25 [2, 58, 90]. To verify this, we use
an R programming package called tableone8 to calculate the SMDs for all observable covariates.
The results confirm that all covariates are within the acceptable range of less than 0.25. Thus, our
matching method successfully passes the test.
(4) Difference-in-Differences (DiD). After making all these preparations, we are now ready to
conduct our DiD analysis. Put succinctly, DiD aims to assess and estimate the potential impact
of the intervention (i.e., WB). It does that by examining the average behavioral changes of the
target group (after exposure to the intervention) and comparing them to the anchor group [8]. The
implicit presumption is that this difference would be zero if the intervention had not been carried
out.
We accordingly formulate our work with simple DiD models in their canonical form (similar
to[6, 25, 35]). The reason for choosing the canonical form is that we always compare only two
time periods (i.e., either (𝑠 = 0, 𝑠 = 1) or (𝑠 = 0, 𝑠 = 2)) and two user groups (i.e., target and anchor
groups) [35]. Furthermore, all comparisons made within each WB year are essentially independent
of the other WB years.9 More specifically, we perform our DiD analyzes by fitting the following
linear regression model to our data:
𝑌𝑢𝑔𝑠 = 𝜃 0 + 𝜃 1 .𝛼𝑔 + 𝜃 2 .𝛽𝑠 + 𝛿.𝐷𝑔𝑠 + 𝜖𝑢𝑔𝑠 (1)
where 𝑌𝑢𝑔𝑠 refers to the dependent variable for a user 𝑢 and her/his associated group 𝑔 ∈ {𝑡𝑎𝑟𝑔𝑒𝑡 =
1, 𝑎𝑛𝑐ℎ𝑜𝑟 = 0} at time 𝑠. The dependent variables, we study here as 𝑌𝑢𝑔𝑠 , include the quantity/quality
of users’ questions/answers and the churn rate. Also, 𝛼𝑔 and 𝛽𝑠 are the binary variables that capture
the fixed effects of group and time, respectively. 𝜃 coefficients refer to the parameters being fit
by the model. 𝐷𝑔𝑠 is a dummy variable that represents the intervention status (i.e., 𝐷𝑔𝑠 = 𝛼𝑔 .𝛽𝑠 )
and 𝜖𝑢𝑔𝑠 denotes the associated error term. The coefficient 𝛿 represents the quantified estimation
of the impact of the intervention (i.e., arrival or termination of WB) on the dependent variable 𝑌 .
More precisely, if 𝛿 > 0, then the influence of WB is potentially positive. Otherwise, if 𝛿 < 0, the
influence is potentially negative. Finally, if 𝛿 = 0, then WB is not influential at all. Although there
are many ways to train the regression models for DiD, we conveniently use the Scikit-learn
machine learning library in Python for this purpose [19].
8 https://cran.r-project.org/web/packages/tableone/vignettes/introduction.html
9 It
should be mentioned here that building a single (universal) model for all three WB events is not feasible. The reason is
the different duration of WB in each year.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:18 Reza Hadi Mogavi et al.

5.3.2 Results. The summary of this DiD analysis is compiled in Table 5. Before getting into the
details, it would be pertinent to mention a few points. First, it should be noted that the impact
values we report here are, in fact, the mean 𝛿 values from our regression models constructed for
WB years 2018 to 2020. Second, we compute the mathematical union of all p-values for all our
regression models to indicate the significance of each influence size. As explained in RQ1 and RQ2,
one of the main advantages of finding such global p-values is that they increase the reliability of
the results in the sense that we only report the weakest significance for a potential impact size.
With these points in mind, we can now elaborate on our results.
• Novices. We begin with an interesting observation related to the engagement of novice users. As
can be recalled from RQ2, when reporting some engagement disruptions for novice users during and
after WB, we had pointed out that those results were not unexpected since the majority of novice
users were Non-participating users. However, does the state of participation in WB really matter?
Surprisingly, the answer is negative. The new findings from RQ3 show that WB does not create
any discernible advantages (or disadvantages) for WB-participating users over Non-participating
users (neither during nor after the bash). To corroborate (triangulate) this important finding, we
conducted a series of complementary non-parametric MWU tests with novice users. According
to our findings, their engagement measures (of any type: quantity, quality, and churn) were not
significantly different between WB-participating and Non-participating users (neither during WB
nor afterward).
• Low reputation users. We find that participating users with low reputations tend to ask
significantly more questions during the bash (see Table 5). A similar rising trend for asking questions
was also observed in RQ2, albeit without statistical significance. This is an interesting observation
since the majority of users with low reputations are among the Non-participating users. We
can deduce that most Non-participating users also tend to ask more questions during the bash.
However, this proclivity for asking more is not as intense as for WB-participating users; otherwise,
the influence of WB on the participating users could not have been significantly positive. Another
notable finding, seemingly at odds with our general results in RQ2, is the decreasing trend in the
churn rates of WB-participating users during WB, which also turns out to be statistically significant.
The explanation we can provide for this discrepancy is that WB-participating users are in the
minority among low reputation users; thus, their lower churn rate alone does not change much in
the overall behavior of all low reputation users.
• Established users. Unlike their low-reputation counterparts, participating established users are
more likely to engage with answering questions during the bash. The results from Table 5 show
that this engagement is also statistically significant, which agrees very well with our previous
findings from RQ2 (see Table 3). To reiterate, the majority (88%) of established users consist of WB-
participating users. In addition, we also find that the quality of answers participating established
users provide after the bash ends tends to decline significantly (see Table 5). This is again consistent
with our prior findings from RQ2 (refer to Table 3).
• Trusted users. Similar to established users, it seems that participating trusted users are more
involved in answering questions than asking them. The results from Table 5 show that both the
quantity and quality of user answers tend to increase significantly during the bash. This augurs
very well for Stack Overflow. Furthermore, these results match perfectly fine with our previous
results from RQ2 (see Table 3). However, considering that the majority of trusted users (71%)
participate in WB, this congruency should not be particularly surprising. Another significant
influence unconfirmed before RQ3 is the significant decline in the quantity of participating users’
answers after the bash ends. This phenomenon could also be partly attributed to the ominous
overjustification effect.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:19

Table 6. The comparisons between the potential impacts of the bash on the behavioral changes of participating
users with different types

Novice Low Reputation Established Trusted


Reference Measures Posts Churn Posts Churn Posts Churn Posts Churn
Magnitude N/A N/A 0.21 0.04 0.23 N/A 0.60 N/A
DiD Analysis
Frequency N/A N/A 1 1 2 N/A 3 N/A

5.3.3 Which participating users are potentially influenced the most? The remainder of this subsection
differs from Subsection 5.2.3 in that we are now only comparing the behavior of participating
users. It is interesting to determine whether or not the new order (of users) in RQ3 is different from
what has already been observed in RQ2. A bird’s-eye view of our comparisons can be found in
Table 6. As can be seen, established users are no longer at the top of the list. Instead, we see that
participating trusted users are the most influence-prone user type, and participating novices are the
least influence-prone user type. When we limit our focus to the influence of WB on users’ posting
behavior, established users and low reputation users are the second and third most influence-prone
user types, respectively after trusted users. Furthermore, low reputation users are the only user
type for which we observe a significant influence on their churning behavior. It is for this reason
that the orders are somewhat different.

6 DISCUSSION
The discussion section is divided into three parts. First, we summarize the main results of our work
in the form of “lessons learned”. Second, we present some design considerations for gamification
designers and practitioners to help improve their gamification designs. Finally, we explain the
limitations of our research and point out some directions for future work.

6.1 Lessons Learned


According to our research findings, the following main lessons are learned:
• According to RQ1, WB cannot (completely) halt the reduction in user engagement during the
holiday season. Furthermore, some user attributes such as asking and answering questions
tend to get worse after the temporary gamification scheme is lifted. Thus, we infer that adding
additional gamification schemes to an already gamified platform does not automatically lead
to more positive (favorable) user engagement.
• According to RQ2, an increase in the quantity of posts is sometimes accompanied by a
decrease in the quality of posts by certain types of users (e.g., established users). Hence, we
infer that evaluating the efficacy of gamification through “quantity” and “quality” of user
engagement together is more valuable; in this regard, it must be noted that previous studies
have often failed to do this and focused on the impact gamification had on the quantity of
user engagement alone.
• In analyzing the impact of promotional or other types of time-limited gamification schemes,
it is important to not get blindsided by what transpires when these schemes are running. It is
also very necessary to analyze what happens after the scheme’s end. As noted in this study,
removing gamification mechanisms can sometimes reduce user engagement.
• According to RQ2 and RQ3, not all users respond to promotional gamification schemes
the same way. Gamification designers should take this fact into consideration and plan
accordingly. For example, in the case of Stack Overflow, we found that participating novice
users were almost indifferent to the WB while participating trusted users were likely to

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:20 Reza Hadi Mogavi et al.

be influenced the most. Thus, we conclude that a more calibrated user understanding and
personalization of gamification schemes could be beneficial in optimizing the user experience.
• Promotional gamification schemes may not always be the best choice for instilling user
motivation for voluntary online activities. In this study, we observed that trusted users were
not maintaining their pace in answering questions after the WB. The decline in their pace
was noticeable even compared to before the WB.

6.2 Design Considerations


This section presents some design considerations to help gamification designers and practitioners
to improve their gamification designs.
The first consideration for improving promotional gamification schemes is that they should
not appear frivolous to their target users. A recent qualitative study by Mogavi et al. shows that
overindulgence in playfulness is one of the main reasons users abuse gamification and make
low-quality contributions to the tasks they are asked to complete [37]. According to this study,
gamification designers and practitioners should spend time explaining the purpose of their gamifi-
cation schemes to their users and even try to distinguish between their gamified design and the
usual non-serious games.
This assumes significance because some designers may be tempted to not take their promotional
gamification schemes as seriously as their default ones. Using their design to reflect the importance
of user contributions to the community, service, or website can make gamification designs signifi-
cantly more meaningful [72]. In the case of WB, a good starting point for designers would be to
contemplate what would have happened if users were unwilling to contribute during the holiday
season. This value can be demonstrated with better gamification mechanics. A viable option is
an explicit and persistent measure (or badge) that shows the importance of a user’s contribution
in keeping the site up and running (during the holiday season or a pandemic). According to the
literature review, such a value-based reward policy is more likely to reduce users’ desire to consider
malicious use of this platform [37, 110].
According to our findings, we noticed that established users are the largest user group for whom,
despite their higher activity rate, the quality of responses drops significantly in the periods during
and after WB. This suggests that a quality assurance mechanism should be developed to prevent
quality degradation. In fact, in contrast to the prevailing trend in business-only platforms, we
suggest that knowledge-sharing platforms such as CQAs, MOOCS, and other intelligent tutoring
systems should all adopt more stringent measures to safeguard the quality of user contributions
when gamifying their systems [38]. This can be achieved by using computational models such as
anomaly prediction methods to anticipate the quality of user behavior. Designers could also benefit
from using gamification reward triggers that prompt them to satisfy certain aspects of contribution
qualities. For example, users can be required to ask or answer questions with specific positive
scores to qualify them for quality-based gamification rewards. However, future works must further
explore such adjustments and trade-offs for balancing the quantitative and qualitative aspects of
reward assignment.
Finally, gamification designers should have a concrete plan in case they decide to phase out
their promotional gamification schemes in the future [84]. One of the potential benefits would be
to avoid or mitigate the risk of mishaps, such as the overjustification effect. For example, in the
case of WB, gamification developers could try to phase out the WB more cautiously and gradually
by slowly reducing the number of promotional rewards over time and then replacing them with
suitable substitute rewards from the default gamification scheme.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:21

6.3 Limitation and Future Work


As one of the first research studies about promotional gamification schemes, our work opens up
many other opportunities for interested CSCW and HCI researchers. Since our results are purely
empirical, they call for follow-up studies on other CQA platforms such as Reddit, Wikipedia, and
Quora to cross-validate our findings’ generalizability. As of February 2021, over 170 CQAs use a
promotional gamification scheme similar to that of Stack Overflow to increase user engagement.
Given this opportunity, researchers would also do well to examine the impact of WB on user
engagement through the lens of many different community-level factors, i.e., community size,
community discussion topic, and even community age.
Therefore, a valuable line of research is to examine how fixed user groups evolve and respond to
WB over different and longer periods. This, of course, raises several other interesting questions:
what percentage of users abandon WB over the long term? Can we determine this before reaching
the next WB fests? Is the user churn-behavior influenced by previous WB experience? If yes, to
what extent?
We also opine that there is a need for research similar to that of [5, 54] to scrutinize the specific
effects of the particular gamification rewards employed during the WB. For instance, it would be
interesting to understand how users prioritize their participation strategies when they want to
earn the default gamification rewards as opposed to when they want to get the promotional ones.
However, such analysis necessitates more complex inferential frameworks.

7 CONCLUSION
This paper presents an exploratory study of the promotional gamification scheme of WB in the
CQA of Stack Overflow. By analyzing the advantages and disadvantages of user engagement in
this gamification event (using quantity, quality, and churn measures), we contribute to the ongoing
debate about the efficacy of these promotional gamification schemes. Despite the temporary positive
impact of WB on some aspects of user engagement during the holiday season, we find that this
type of gamification fails to accomplish the desired objectives in three important aspects: 1) it
is not as inclusive as previously thought, and the current design of WB only appeals to certain
user groups; 2) the scheme does not ensure the quality of user contributions during and after the
WB; finally, 3) user participation in this gamified event of WB does not necessarily guarantee
successful user engagement. Moreover, some signs are observed to denote the presence of the
overjustification phenomenon at scale. Against this backdrop, our work provides gamification
designers and practitioners with some insights and design considerations to help them improve
their future gamification designs.

ACKNOWLEDGMENTS
The authors would like to thank Mr. Rahman Hadi Mogavi for proofreading and editing this paper.
This research has been supported in part by project 16214817 from the Research Grants Council of
Hong Kong, and the 5GEAR and FIT projects from the Academy of Finland. This work was also
supported in part by HKUST-WeBank Joint Laboratory Project Grant No.: WEB19EG01-d.

REFERENCES
[1] Alberto Abadie. 2010. Difference-in-Difference Estimators. In Microeconometrics. Palgrave Macmillan UK, 36–39.
https://doi.org/10.1057/9780230280816_6
[2] Tim Althoff, Pranav Jindal, and Jure Leskovec. 2017. Online Actions with Offline Impact: How Online Social Networks
Influence Online and Offline User Behavior. In Proceedings of the Tenth ACM International Conference on Web Search
and Data Mining (Cambridge, United Kingdom) (WSDM ’17). Association for Computing Machinery, New York, NY,
USA, 537–546. https://doi.org/10.1145/3018661.3018672

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:22 Reza Hadi Mogavi et al.

[3] Maximilian Altmeyer, Pascal Lessel, Marc Schubhan, Vladislav Hnatovskiy, and Antonio Krüger. 2019. Germ Destroyer
- A Gamified System to Increase the Hand Washing Duration in Shared Bathrooms. In Proceedings of the Annual
Symposium on Computer-Human Interaction in Play (Barcelona, Spain) (CHI PLAY ’19). Association for Computing
Machinery, New York, NY, USA, 509–519. https://doi.org/10.1145/3311350.3347157
[4] Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2012. Discovering Value from Community
Activity on Focused Question Answering Sites: A Case Study of Stack Overflow. In Proceedings of the 18th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (Beijing, China) (KDD ’12). Association for
Computing Machinery, New York, NY, USA, 850–858. https://doi.org/10.1145/2339530.2339665
[5] Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering User Behavior with Badges.
In Proceedings of the 22nd International Conference on World Wide Web (Rio de Janeiro, Brazil) (WWW ’13). Association
for Computing Machinery, New York, NY, USA, 95–106. https://doi.org/10.1145/2488388.2488398
[6] Ashton Anderson, Lucas Maystre, Ian Anderson, Rishabh Mehrotra, and Mounia Lalmas. 2020. Algorithmic Effects
on the Diversity of Consumption on Spotify. In Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW ’20).
Association for Computing Machinery, New York, NY, USA, 2155–2165. https://doi.org/10.1145/3366423.3380281
[7] Fernando R. H. Andrade, Riichiro Mizoguchi, and Seiji Isotani. 2016. The Bright and Dark Sides of Gamification. In
Intelligent Tutoring Systems. Springer International Publishing, 176–186. https://doi.org/10.1007/978-3-319-39583-8_17
[8] Joshua D Angrist and Jörn-Steffen Pischke. 2008. Mostly harmless econometrics. Princeton university press.
[9] Peter Babinec and Ivan Srba. 2017. Education-Specific Tag Recommendation in CQA Systems. In Adjunct Publication
of the 25th Conference on User Modeling, Adaptation and Personalization (Bratislava, Slovakia) (UMAP ’17). Association
for Computing Machinery, New York, NY, USA, 281–286. https://doi.org/10.1145/3099023.3099081
[10] Timur Bachschi, Aniko Hannak, Florian Lemmerich, and Johannes Wachs. 2020. From Asking to Answering: Getting
More Involved on Stack Overflow. arXiv preprint arXiv:2010.04025 (2020).
[11] Ryan Shaun Baker, Albert T. Corbett, Kenneth R. Koedinger, and Angela Z. Wagner. 2004. Off-Task Behavior in the
Cognitive Tutor Classroom: When Students "Game the System". In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (Vienna, Austria) (CHI ’04). Association for Computing Machinery, New York, NY, USA,
383–390. https://doi.org/10.1145/985692.985741
[12] Richard A. Berk. 1983. An Introduction to Sample Selection Bias in Sociological Data. American Sociological Review
48, 3 (June 1983), 386. https://doi.org/10.2307/2095230
[13] Max V. Birk and Regan L. Mandryk. 2018. Combating Attrition in Digital Self-Improvement Programs Using Avatar
Customization. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC,
Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3173574.
3174234
[14] Benny Bornfeld and Sheizaf Rafaeli. 2017. Gamifying with badges: A big data natural experiment on stack exchange.
First Monday (2017).
[15] Grégoire Burel. 2016. Community and thread methods for identifying best answers in online question answering
communities. Ph.D. Dissertation. The Open University.
[16] Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2018. How to ask for technical help? Evidence-based guidelines
for writing questions on Stack Overflow. Information and Software Technology 94 (Feb. 2018), 186–207. https:
//doi.org/10.1016/j.infsof.2017.10.009
[17] Huseyin Cavusoglu, Zhuolun Li, and Ke-Wei Huang. 2015. Can Gamification Motivate Voluntary Contributions?
The Case of StackOverflow Q&A Community. In Proceedings of the 18th ACM Conference Companion on Computer
Supported Cooperative Work & Social Computing (Vancouver, BC, Canada) (CSCW’15 Companion). Association for
Computing Machinery, New York, NY, USA, 171–174. https://doi.org/10.1145/2685553.2698999
[18] Edna Chan, Fiona Fui-Hoon Nah, Qizhang Liu, and Zhiwei Lu. 2018. Effect of Gamification on Intrinsic Motivation.
In HCI in Business, Government, and Organizations. Springer International Publishing, 445–454. https://doi.org/10.
1007/978-3-319-91716-0_35
[19] David Cournapeau. 2021. Scikit-learn Python Library. Retrieved February 28, 2021 from https://scikit-learn.org/
[20] Martin V. Covington and Kimberly J. Müeller. 2001. Intrinsic Versus Extrinsic Motivation: An Approach/Avoidance
Reformulation. Educational Psychology Review 13, 2 (2001), 157–176. https://doi.org/10.1023/a:1009009219144
[21] Ryan S. J. d. Baker, Albert T. Corbett, Ido Roll, and Kenneth R. Koedinger. 2008. Developing a generalizable detector
of when students game the system. User Modeling and User-Adapted Interaction 18, 3 (Jan. 2008), 287–314. https:
//doi.org/10.1007/s11257-007-9045-6
[22] Paul Denny, Fiona McDonald, Ruth Empson, Philip Kelly, and Andrew Petersen. 2018. Empirical Support for a Causal
Relationship Between Gamification and Learning Outcomes. In Proceedings of the 2018 CHI Conference on Human
Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York,
NY, USA, 1–13. https://doi.org/10.1145/3173574.3173885

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:23

[23] Sebastian Deterding, Dan Dixon, Rilla Khaled, and Lennart Nacke. 2011. From Game Design Elements to Gamefulness:
Defining "Gamification". In Proceedings of the 15th International Academic MindTrek Conference: Envisioning Future
Media Environments (Tampere, Finland) (MindTrek ’11). Association for Computing Machinery, New York, NY, USA,
9–15. https://doi.org/10.1145/2181037.2181040
[24] Sarah Diefenbach and Annemarie Müssig. 2019. Counterproductive effects of gamification: An analysis on the
example of the gamified task manager Habitica. International Journal of Human-Computer Studies 127 (July 2019),
190–210. https://doi.org/10.1016/j.ijhcs.2018.09.004
[25] Stephen G Donald and Kevin Lang. 2007. Inference with Difference-in-Differences and Other Panel Data. Review of
Economics and Statistics 89, 2 (May 2007), 221–233. https://doi.org/10.1162/rest.89.2.221
[26] Gideon Dror, Dan Pelleg, Oleg Rokhlenko, and Idan Szpektor. 2012. Churn Prediction in New Users of Yahoo!
Answers. In Proceedings of the 21st International Conference on World Wide Web (Lyon, France) (WWW ’12 Companion).
Association for Computing Machinery, New York, NY, USA, 829–834. https://doi.org/10.1145/2187980.2188207
[27] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017. Periodicity in User Engagement with a Search Engine
and Its Application to Online Controlled Experiments. ACM Trans. Web 11, 2, Article 9 (April 2017), 35 pages.
https://doi.org/10.1145/2856822
[28] David Easley and Arpita Ghosh. 2016. Incentives, Gamification, and Game Theory: An Economic Approach to Badge
Design. ACM Trans. Econ. Comput. 4, 3, Article 16 (June 2016), 26 pages. https://doi.org/10.1145/2910575
[29] Joseph Fanfarelli, Stephanie Vie, and Rudy McDaniel. 2015. Understanding Digital Badges through Feedback, Reward,
and Narrative: A Multidisciplinary Approach to Building Better Badges in Social Environments. Commun. Des. Q. Rev
3, 3 (June 2015), 56–60. https://doi.org/10.1145/2792989.2792998
[30] Oluwaseyi Feyisetan, Elena Simperl, Max Van Kleek, and Nigel Shadbolt. 2015. Improving Paid Microtasks through
Gamification and Adaptive Furtherance Incentives. In Proceedings of the 24th International Conference on World Wide
Web (Florence, Italy) (WWW ’15). International World Wide Web Conferences Steering Committee, Republic and
Canton of Geneva, CHE, 333–343. https://doi.org/10.1145/2736277.2741639
[31] Simone Filice, Nachshon Cohen, and David Carmel. 2020. Voice-Based Reformulation of Community Answers. In
Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New
York, NY, USA, 2885–2891. https://doi.org/10.1145/3366423.3380053
[32] Adabriand Furtado, Nazareno Andrade, Nigini Oliveira, and Francisco Brasileiro. 2013. Contributor Profiles, Their
Dynamics, and Their Importance in Five Q&a Sites. In Proceedings of the 2013 Conference on Computer Supported
Cooperative Work (San Antonio, Texas, USA) (CSCW ’13). Association for Computing Machinery, New York, NY, USA,
1237–1252. https://doi.org/10.1145/2441776.2441916
[33] GDPR. 2018. The European Union’s General Data Protection Regulation. Retrieved April 1, 2020 from https://gdpr-
info.eu/
[34] Paulo B. Goes, Chenhui Guo, and Mingfeng Lin. 2016. Do Incentive Hierarchies Induce User Effort? Evidence from
an Online Knowledge Exchange. Information Systems Research 27, 3 (Sept. 2016), 497–516. https://doi.org/10.1287/
isre.2016.0635
[35] Andrew Goodman-Bacon. 2021. Difference-in-differences with variation in treatment timing. Journal of Econometrics
225, 2 (Dec. 2021), 254–277. https://doi.org/10.1016/j.jeconom.2021.03.014
[36] Paul Grau, Babak Naderi, and Juho Kim. 2018. Personalized Motivation-Supportive Messages for Increasing Par-
ticipation in Crowd-Civic Systems. Proc. ACM Hum.-Comput. Interact. 2, CSCW, Article 60 (Nov. 2018), 22 pages.
https://doi.org/10.1145/3274329
[37] Reza Hadi Mogavi, Bingcan Guo, Yuanhao Zhang, Ehsan-Ul Haq, Pan Hui, and Xiaojuan Ma. 2022. When Gamification
Spoils Your Learning: A Qualitative Case Study of Gamification Misuse in a Language-Learning App. (2022), 175–188.
https://doi.org/10.1145/3491140.3528274
[38] Reza Hadi Mogavi, Xiaojuan Ma, and Pan Hui. 2021. Characterizing Student Engagement Moods for Dropout
Prediction in Question Pool Websites. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 12 (April 2021), 22 pages.
https://doi.org/10.1145/3449086
[39] Reza Hadi Mogavi, Yuanhao Zhang, Ehsan-Ul Haq, Yongjin Wu, Xiaojuan Ma, and Pan Hui. 2022. What Do Users
Think of Promotional Gamification Schemes? A Qualitative Case Study in a Question Answering Website. Proc. ACM
Hum.-Comput. Interact. 6, CSCW2, Article 399 (Nov. 2022). https://doi.org/10.1145/3555124
[40] Reza Hadi Mogavi, Yankun Zhao, Ehsan Ul Haq, Pan Hui, and Xiaojuan Ma. 2021. Student Barriers to Active Learning
in Synchronous Online Classes: Characterization, Reflections, and Suggestions. In Proceedings of the Eighth ACM
Conference on Learning @ Scale (Virtual Event, Germany) (L@S ’21). Association for Computing Machinery, New
York, NY, USA, 101–115. https://doi.org/10.1145/3430895.3460126
[41] Stuart Hallifax, Audrey Serna, Jean-Charles Marty, Guillaume Lavoué, and Elise Lavoué. 2019. Factors to Consider for
Tailored Gamification. In Proceedings of the Annual Symposium on Computer-Human Interaction in Play (Barcelona,
Spain) (CHI PLAY ’19). Association for Computing Machinery, New York, NY, USA, 559–572. https://doi.org/10.1145/

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:24 Reza Hadi Mogavi et al.

3311350.3347167
[42] Juho Hamari. 2013. Transforming homo economicus into homo ludens: A field experiment on gamification in a
utilitarian peer-to-peer trading service. 12, 4 (July 2013), 236–245. https://doi.org/10.1016/j.elerap.2013.01.004
[43] Juho Hamari, Jonna Koivisto, and Harri Sarsa. 2014. Does Gamification Work? – A Literature Review of Empirical
Studies on Gamification. In 2014 47th Hawaii International Conference on System Sciences. IEEE. https://doi.org/10.
1109/hicss.2014.377
[44] Lobna Hassan. 2016. Governments Should Play Games: Towards a Framework for the Gamification of Civic Engage-
ment Platforms. Simulation & Gaming 48, 2 (Dec. 2016), 249–267. https://doi.org/10.1177/1046878116683581
[45] Lobna Hassan and Juho Hamari. 2020. Gameful civic engagement: A review of the literature on gamification of
e-participation. Government Information Quarterly 37, 3 (July 2020), 101461. https://doi.org/10.1016/j.giq.2020.101461
[46] Jennifer L. Hicks, Tim Althoff, Rok Sosic, Peter Kuhar, Bojan Bostjancic, Abby C. King, Jure Leskovec, and Scott L.
Delp. 2019. Best practices for analyzing large-scale health data from wearables and smartphone apps. npj Digital
Medicine 2, 1 (June 2019). https://doi.org/10.1038/s41746-019-0121-1
[47] Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979),
65–70.
[48] Stefano M Iacus, Gary King, and Giuseppe Porro. 2012. Causal inference without balance checking: Coarsened exact
matching. Political analysis 20, 1 (2012), 1–24.
[49] Oluwabukola Mayowa (Ishola) Idowu and Gordon McCalla. 2018. Better Late Than Never but Never Late Is Better:
Towards Reducing the Answer Response Time to Questions in an Online Learning Community. In Lecture Notes in
Computer Science. Springer International Publishing, 184–197. https://doi.org/10.1007/978-3-319-93843-1_14
[50] Andreas M. Kaplan and Michael Haenlein. 2010. Users of the world, unite! The challenges and opportunities of Social
Media. Business Horizons 53, 1 (Jan. 2010), 59–68. https://doi.org/10.1016/j.bushor.2009.09.003
[51] Gary King, Richard Nielsen, Carter Coberley, James E Pope, and Aaron Wells. 2011. Comparative effectiveness of
matching methods for causal inference. Unpublished manuscript, Institute for Quantitative Social Science, Harvard
University, Cambridge, MA (2011).
[52] Jonna Koivisto, Aqdas Malik, Bahadir Gurkan, and Juho Hamari. 2019. Getting Healthy by Catching Them All: A
Study on the Relationship Between Player Orientations and Perceived Health Benefits in an Augmented Reality Game.
In Proceedings of the 52nd Hawaii International Conference on System Sciences. Hawaii International Conference on
System Sciences. https://doi.org/10.24251/hicss.2019.216
[53] Haridimos Kondylakis, Anca Bucur, Chiara Crico, Feng Dong, Norbert Graf, Stefan Hoffman, Lefteris Koumakis, Alice
Manenti, Kostas Marias, Ketti Mazzocco, Gabriella Pravettoni, Chiara Renzi, Fatima Schera, Stefano Triberti, Manolis
Tsiknakis, and Stephan Kiefer. 2020. Patient empowerment for cancer patients through a novel ICT infrastructure.
Journal of Biomedical Informatics 101 (Jan. 2020), 103342. https://doi.org/10.1016/j.jbi.2019.103342
[54] Tomasz Kusmierczyk and Manuel Gomez-Rodriguez. 2018. On the Causal Effect of Badges. In Proceedings of the
2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering
Committee, Republic and Canton of Geneva, CHE, 659–668. https://doi.org/10.1145/3178876.3186147
[55] Nicholas Lange, Jay N. Giedd, F. Xavier Castellanos, A.Catherine Vaituzis, and Judith L. Rapoport. 1997. Variability of
human brain structure size: ages 4–20 years. Psychiatry Research: Neuroimaging 74, 1 (March 1997), 1–12. https:
//doi.org/10.1016/s0925-4927(96)03054-5
[56] Nikoletta-Zampeta Legaki, Nannan Xi, Juho Hamari, Kostas Karpouzis, and Vassilios Assimakopoulos. 2020. The
effect of challenge-based gamification on learning: An experiment in the context of statistics education. International
Journal of Human-Computer Studies 144 (Dec. 2020), 102496. https://doi.org/10.1016/j.ijhcs.2020.102496
[57] Pascal Lessel, Maximilian Altmeyer, Lea Verena Schmeer, and Antonio Krüger. 2019. "Enable or Disable Gamification?":
Analyzing the Impact of Choice in a Gamified Image Tagging Task. In Proceedings of the 2019 CHI Conference on
Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New
York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300380
[58] Ang Li, Alice Wang, Zahra Nazari, Praveen Chandar, and Benjamin Carterette. 2020. Do Podcasts and Music
Compete with One Another? Understanding Users’ Audio Streaming Habits. In Proceedings of The Web Conference
2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 1920–1931. https:
//doi.org/10.1145/3366423.3380260
[59] Guo Li, Haiyi Zhu, Tun Lu, Xianghua Ding, and Ning Gu. 2015. Is It Good to Be Like Wikipedia? Exploring the
Trade-Offs of Introducing Collaborative Editing Model to Q&A Sites. In Proceedings of the 18th ACM Conference on
Computer Supported Cooperative Work & Social Computing (Vancouver, BC, Canada) (CSCW ’15). Association for
Computing Machinery, New York, NY, USA, 1080–1091. https://doi.org/10.1145/2675133.2675155
[60] Andreas Lieberoth. 2015. Shallow Gamification: Testing Psychological Effects of Framing an Activity as a Game.
Games and Culture 10, 3 (2015), 229–248. https://doi.org/10.1177/1555412014559978

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:25

[61] Wern Han Lim, Mark James Carman, and Sze-Meng Jojo Wong. 2017. Estimating Relative User Expertise for
Content Quality Prediction on Reddit. In Proceedings of the 28th ACM Conference on Hypertext and Social Media
(Prague, Czech Republic) (HT ’17). Association for Computing Machinery, New York, NY, USA, 55–64. https:
//doi.org/10.1145/3078714.3078720
[62] Yuli Liu, Yiqun Liu, Ke Zhou, Min Zhang, and Shaoping Ma. 2017. Detecting Collusive Spamming Activities in
Community Question Answering. In Proceedings of the 26th International Conference on World Wide Web (Perth,
Australia) (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of
Geneva, CHE, 1073–1082. https://doi.org/10.1145/3038912.3052594
[63] Zhenguang Liu, Yingjie Xia, Qi Liu, Qinming He, Chao Zhang, and Roger Zimmermann. 2018. Toward Personalized
Activity Level Prediction in Community Question Answering Websites. ACM Trans. Multimedia Comput. Commun.
Appl. 14, 2s, Article 41 (April 2018), 15 pages. https://doi.org/10.1145/3187011
[64] Szymon Tomasz Machajewski. 2017. Application of Gamification in a College STEM Introductory Course: A Case Study.
Northcentral University.
[65] Andrew Marder. 2015. Stack Overflow Badges and User Behavior: An Econometric Approach. In Proceedings of the
12th Working Conference on Mining Software Repositories (Florence, Italy) (MSR ’15). IEEE Press, 450–453.
[66] Elisa D. Mekler, Florian Brühlmann, Alexandre N. Tuch, and Klaus Opwis. 2017. Towards understanding the effects of
individual gamification elements on intrinsic motivation and performance. Computers in Human Behavior 71 (June
2017), 525–534. https://doi.org/10.1016/j.chb.2015.08.048
[67] Reza Hadi Mogavi, Sujit Gujar, Xiaojuan Ma, and Pan Hui. 2019. HRCR: Hidden Markov-Based Reinforcement to
Reduce Churn in Question Answering Forums. In PRICAI 2019: Trends in Artificial Intelligence. Springer International
Publishing, 364–376. https://doi.org/10.1007/978-3-030-29908-8_29
[68] LUKAS MOLDON. 2020. Sending signals in open source: Evidence from a natural experiment. Ph.D. Dissertation.
RWTH Aachen University.
[69] Alberto Mora, Gustavo F. Tondello, Lennart E. Nacke, and Joan Arnedo-Moreno. 2018. Effect of personalized
gameful design on student engagement. In 2018 IEEE Global Engineering Education Conference (EDUCON). IEEE.
https://doi.org/10.1109/educon.2018.8363471
[70] Benedikt Morschheuser, Juho Hamari, Jonna Koivisto, and Alexander Maedche. 2017. Gamified crowdsourcing:
Conceptualization, literature review, and future agenda. International Journal of Human-Computer Studies 106 (Oct.
2017), 26–43. https://doi.org/10.1016/j.ijhcs.2017.04.005
[71] Benedikt Morschheuser, Juho Hamari, and Alexander Maedche. 2019. Cooperation or competition – When do people
contribute more? A field experiment on gamification of crowdsourcing. International Journal of Human-Computer
Studies 127 (July 2019), 7–24. https://doi.org/10.1016/j.ijhcs.2018.10.001
[72] Scott Nicholson. 2014. A RECIPE for Meaningful Gamification. In Gamification in Education and Business. Springer
International Publishing, 1–20. https://doi.org/10.1007/978-3-319-10208-5_1
[73] Nicholas O’Donnell. 2018. Plan Your Brisbane: Exploring Game Design for Civic Engagement. In Proceedings of the
2018 Annual Symposium on Computer-Human Interaction in Play Companion Extended Abstracts (Melbourne, VIC,
Australia) (CHI PLAY ’18 Extended Abstracts). Association for Computing Machinery, New York, NY, USA, 257–269.
https://doi.org/10.1145/3270316.3272051
[74] Association of Internet Researchers (AoIR). 2019. Internet Research: Ethical Guidelines 3.0. Retrieved April 1, 2020
from https://aoir.org/ethics/
[75] Stack Overflow. 2021. What is reputation? How do I earn (and lose) it? Retrieved October 5, 2021 from https:
//stackoverflow.com/help/whats-reputation
[76] Panda. 2021. Winter Bash Records. Retrieved March 28, 2021 from https://winterba.sh/
[77] Jagat Sastry Pudipeddi, Leman Akoglu, and Hanghang Tong. 2014. User Churn in Focused Question Answering
Sites: Characterizations and Prediction. In Proceedings of the 23rd International Conference on World Wide Web
(Seoul, Korea) (WWW ’14 Companion). Association for Computing Machinery, New York, NY, USA, 469–474. https:
//doi.org/10.1145/2567948.2576965
[78] Ashesh Rambachan and Jonathan Roth. 2019. An honest approach to parallel trends. Academic Report, Harvard
University. (2019).
[79] John E Ripollone, Krista F Huybrechts, Kenneth J Rothman, Ryan E Ferguson, and Jessica M Franklin. 2019. Evaluating
the Utility of Coarsened Exact Matching for Pharmacoepidemiology Using Real and Simulated Claims Data. 189, 6
(Dec. 2019), 613–622. https://doi.org/10.1093/aje/kwz268
[80] Andreas Rücklé, Krishnkant Swarnkar, and Iryna Gurevych. 2019. Improved Cross-Lingual Question Retrieval
for Community Question Answering. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19).
Association for Computing Machinery, New York, NY, USA, 3179–3186. https://doi.org/10.1145/3308558.3313502
[81] Chrystal Rutledge, Catharine M. Walsh, Nathan Swinger, Marc Auerbach, Danny Castro, Maya Dewan, Mona Khattab,
Alyssa Rake, Ilana Harwayne-Gidansky, Tia T. Raymond, Tensing Maa, and Todd P. Chang. 2018. Gamification in

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:26 Reza Hadi Mogavi et al.

Action. Academic Medicine 93, 7 (July 2018), 1014–1020. https://doi.org/10.1097/acm.0000000000002183


[82] Herman Saksono, Ashwini Ranade, Geeta Kamarthi, Carmen Castaneda-Sceppa, Jessica A. Hoffman, Cathy Wirth,
and Andrea G. Parker. 2015. Spaceship Launch: Designing a Collaborative Exergame for Families. In Proceedings of the
18th ACM Conference on Computer Supported Cooperative Work & Social Computing (Vancouver, BC, Canada) (CSCW
’15). Association for Computing Machinery, New York, NY, USA, 1776–1787. https://doi.org/10.1145/2675133.2675159
[83] Manuel Schmidt-Kraepelin, Philipp A Toussaint, Scott Thiebes, Juho Hamari, and Ali Sunyaev. 2020. Archetypes of
Gamification: Analysis of mHealth Apps. JMIR mHealth and uHealth 8, 10 (Oct. 2020), e19280. https://doi.org/10.
2196/19280
[84] Katie Seaborn. 2021. Removing Gamification: A Research Agenda. In Extended Abstracts of the 2021 CHI Conference
on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery,
New York, NY, USA, Article 343, 7 pages. https://doi.org/10.1145/3411763.3451695
[85] Katie Seaborn and Deborah I. Fels. 2015. Gamification in theory and action: A survey. International Journal of
Human-Computer Studies 74 (Feb. 2015), 14–31. https://doi.org/10.1016/j.ijhcs.2014.09.006
[86] Bin Shao and Jiafei Yan. 2017. Recommending Answerers for Stack Overflow with LDA Model. In Proceedings of the 12th
Chinese Conference on Computer Supported Cooperative Work and Social Computing (Chongqing, China) (ChineseCSCW
’17). Association for Computing Machinery, New York, NY, USA, 80–86. https://doi.org/10.1145/3127404.3127426
[87] Zachary J. Sheffler, De Liu, and Shawn P. Curley. 2020. Ingredients for successful badges: evidence from a field
experiment in bike commuting. European Journal of Information Systems 29, 6 (Aug. 2020), 688–703. https://doi.org/
10.1080/0960085x.2020.1808539
[88] Ivan Srba and Maria Bielikova. 2016. A Comprehensive Survey and Classification of Approaches for Community
Question Answering. ACM Trans. Web 10, 3, Article 18 (Aug. 2016), 63 pages. https://doi.org/10.1145/2934687
[89] Ivan Srba and Maria Bielikova. 2016. Why is Stack Overflow Failing? Preserving Sustainability in Community
Question Answering. IEEE Software 33, 4 (July 2016), 80–89. https://doi.org/10.1109/ms.2016.34
[90] Elizabeth A. Stuart. 2010. Matching Methods for Causal Inference: A Review and a Look Forward. Statist. Sci. 25, 1
(Feb. 2010). https://doi.org/10.1214/09-sts313
[91] Jiankai Sun, Sobhan Moosavi, Rajiv Ramnath, and Srinivasan Parthasarathy. 2018. QDEE: Question Difficulty and
Expertise Estimation in Community Question Answering Sites. Proceedings of the International AAAI Conference on
Web and Social Media 12, 1 (Jun. 2018). https://ojs.aaai.org/index.php/ICWSM/article/view/15015
[92] Nozomi Takeshima, Takashi Sozu, Aran Tajika, Yusuke Ogawa, Yu Hayasaka, and Toshiaki A Furukawa. 2014. Which
is more generalizable, powerful and interpretable in meta-analyses, mean difference or standardized mean difference?
BMC Medical Research Methodology 14, 1 (Feb. 2014). https://doi.org/10.1186/1471-2288-14-30
[93] Stack Overflow Talent. 2016. Winter Bash (WB) Explained: Winter Bash 2016. Retrieved November 13, 2021 from
https://www.youtube.com/watch?v=CDi_nj1-G6U
[94] Shu-Hua Tang and Vernon C. Hall. 1995. The overjustification effect: A meta-analysis. Applied Cognitive Psychology
9, 5 (Oct. 1995), 365–404. https://doi.org/10.1002/acp.2350090502
[95] Armando M. Toda, Pedro H. D. Valle, and Seiji Isotani. 2018. The Dark Side of Gamification: An Overview of Negative
Effects of Gamification in Education. In Communications in Computer and Information Science. Springer International
Publishing, 143–156. https://doi.org/10.1007/978-3-319-97934-2_9
[96] Gustavo Fortes Tondello and Lennart E Nacke. 2019. A Pilot Study of a Digital Skill Tree in Gameful Education.. In
GamiLearn.
[97] April Tyack and Elisa D. Mekler. 2020. Self-Determination Theory in HCI Games Research: Current Uses and Open
Questions. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)
(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–22. https://doi.org/10.1145/3313831.3376723
[98] Stack Overflow User. 2020. Why is Winter Bash 2020 participation so high? Retrieved September 16, 2021 from
https://meta.stackexchange.com/questions/358895/why-is-winter-bash-2020-participation-so-high
[99] Stack Overflow User. 2020. Winter Bash event demystified. Retrieved October 5, 2021 from https://meta.stackexchange.
com/questions/358420/winter-bash-event-demystified?noredirect=1&lq=1
[100] Stack Overflow User. 2021. Do we want hats? Retrieved April 4, 2021 from https://puzzling.meta.stackexchange.com/
questions/1479/do-we-want-hats-a-puzzling-situation
[101] Stack Overflow User. 2021. Does activity on stack exchange increase during the Winter Bash? Retrieved March 27,
2021 from https://meta.stackexchange.com/questions/271093/does-activity-on-stack-exchange-increase-during-the-
winter-bash
[102] Stack Overflow User. 2021. What should we consider for next year’s Winter Bash? Retrieved April 4, 2021 from
https://meta.stackexchange.com/questions/213574/what-should-we-consider-for-next-years-winter-bash
[103] Bogdan Vasilescu. 2014. Human Aspects, Gamification, and Social Media in Collaborative Software Engineering.
In Companion Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE
Companion 2014). Association for Computing Machinery, New York, NY, USA, 646–649. https://doi.org/10.1145/

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:27

2591062.2591091
[104] Baidu Zhidao (Knows) Website. 2021. A Community Question Answering Website (CQA). Retrieved November 12,
2021 from https://zhidao.baidu.com/
[105] Quora Website. 2021. A Community Question Answering Website (CQA). Retrieved November 12, 2021 from
https://www.quora.com/
[106] Stack Overflow Website. 2021. A Community Question Answering Website (CQA) for Programming Purposes. Retrieved
March 27, 2021 from https://stackoverflow.com/
[107] Zhihu Website. 2021. A Community Question Answering Website (CQA). Retrieved November 12, 2021 from
https://www.zhihu.com/
[108] Nannan Xi and Juho Hamari. 2020. Does gamification affect brand engagement and equity? A study in online brand
communities. Journal of Business Research 109 (March 2020), 449–460. https://doi.org/10.1016/j.jbusres.2019.11.058
[109] Stav Yanovsky, Nicholas Hoernle, Omer Lev, and Kobi Gal. 2019. One Size Does Not Fit All: Badge Behavior in Q&A
Sites. In Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization (Larnaca, Cyprus)
(UMAP ’19). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/3320435.
3320438
[110] Gabe Zichermann and Christopher Cunningham. 2011. Gamification by design: Implementing game mechanics in web
and mobile apps. " O’Reilly Media, Inc.".

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:28 Reza Hadi Mogavi et al.

A THE FULL LIST OF FEATURES AND COVARIATES


The first appendix provides an overview of all the features and covariates used in this paper. All
these features, categories, and explanations are adapted (or adopted) from [77].

Category Features Explanations

1- Temporal 1-1- Time gap before the first post 1-1- The time period between the creation of an account
and the first post of a user. (Showing the initial passion or
need)
1-2- Average time gap between all posts before 1-2- The average time interval between all posts (of a user),
WB from the first to the last post before WB. (Showing the user’s
passion/commitment before WB)
1-3- Time gap between the last two posts before 1-3- The time gap between the last two consecutive posts
WB before WB. (Showing how the engagement status was just
before WB)
1-4- Last time seen posting before WB 1-4- The time interval between the last post (of a user) and
the time before the commencement of WB. (Showing how
close the time of the user’s last activity was to WB)
1-5- Last time seen posting* (for churn finding 1-5- The time gap between the last post (of a user) and six
purposes) months after the end of the observation period. (We use this
measure for sifting out churned from non-churned users)

2- Consistency 2-1- Consistency of user answers 2-1- Standard deviation of the reputation scores obtained
for the answers. (Showing how consistently the answers of
a user can be trusted)
2-2- Consistency of user questions 2-2- Standard deviation of the reputation scores obtained
for the questions (Showing how consistently the questions
of a user can be trusted)

3- Speed 3-1- Responsiveness 3-1- Inverse of the time gap between a question being posted
and the user answering it. (Showing an estimate of how
quickly the user responds to a relevant/interesting question)

4- Textual Gratitude 4-1- Community’s appreciation of a provided 4-1- Average number of comments made on the user’s an-
answer swer. (Showing the extent to which the user’s answers are
commented)
4-2- Community’s appreciation of an asked 4-2- Average number of comments made on the user’s ques-
question tion. (Showing the extent to which the user’s questions are
commented)

5- Competitiveness 5-1- Measure of being an outstanding answerer 5-1- Average of total number of answers for a question
divided by the rank of user’s answer. (Showing the mastery
and competitiveness of the answers of a user)

6- Content 6-1- User’s length of answers 6-1- Average length of an answer (Showing a measure for
time and effort a user spends on writing detailed answers)
6-2- User’s length of questions 6-2- Average length of a question (Showing a measure for
time and effort a user spends on writing detailed questions)

7- Knowledge level 7-1- Average user reputation for answers 7-1- Mean reputation score gained for each answer. (Show-
ing a handy measure to understand how knowledgeable
and trustworthy an answerer is)
7-2- Average user reputation for questions 7-2- Mean reputation score gained for each question. (Show-
ing a handy measure to understand how knowledgeable and
trustworthy a questioner is)

Note: A subtle point related to the quality of user contributions that might be useful to be elaborated
on here is that the default score feature in Stack Exchange Data Explorer is often not appropriate
for calculating the daily score measurement. The reason is that the score value there only shows
the cumulative value over time, and thus it is not possible to extract the daily value from it easily
(without other considerations). Therefore, similar to our own practice in this paper, we advise
researchers to first calculate the number of daily up-votes (+) and down-votes (-) for each day
separately and then sum them up to obtain the daily score value on this basis.

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:29

B CROSS-SECTIONAL ANALYSIS FOR RQ1 (COLLECTIVE USER ENGAGEMENT)

C CROSS-SECTIONAL ANALYSIS FOR RQ2 (DIFFERENT USER TYPES)


C.1 Novice Users

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
452:30 Reza Hadi Mogavi et al.

C.2 Low Reputation Users

C.3 Established Users

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.
More Gamification Is Not Always Better 452:31

C.4 Trusted Users

Part F- WB-participating vs. Non-participating users (Visualizations)


D WB-PARTICIPATING VS. NON-PARTICIPATING USERS (VISUALIZATIONS)

Categories Novice Users Low Reputation Users Established Users Trusted Users
Questions
(Quantity)

Answers
(Quantity)

Questions
(Quality)

Answers
(Quality)

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.

15
Answers
(Quantity)

452:32 Reza Hadi Mogavi et al.

Questions
(Quality)

Answers
(Quality)

15

Proc. ACM Hum.-Comput. Interact., Vol. 6, No. CSCW2, Article 452. Publication date: November 2022.

View publication stats

You might also like