Professional Documents
Culture Documents
Abstract
Although large language models (LLMs) are reshaping various aspects of human life, our
current understanding of their impacts remains somewhat constrained. Here we investigate
the impact of LLMs on human communication, using data on consumer complaints in the
financial industry. By employing an AI detection tool on more than 820K complaints gathered
by the Consumer Financial Protection Bureau (CFPB), we find a sharp increase in the likely
use of LLMs shortly after the release of ChatGPT. Moreover, the likely LLM usage was
positively correlated with message persuasiveness (i.e., increased likelihood of obtaining relief
from financial firms). Computational linguistic analyses suggest that the positive correlation
may be explained by LLMs’ enhancement of various linguistic features. Based on the results
of these observational studies, we hypothesize that LLM usage may enhance a comprehensive
set of linguistic features, increasing message persuasiveness to receivers with heterogeneous
linguistic preferences (i.e., linguistic feature alignment). We test this hypothesis in preregistered
experiments and find support for it. As an instance of early empirical demonstrations of LLM
usage for enhancing persuasion, our research highlights the transformative potential of LLMs
in human communication.
∗
Minkyu Shin: Assistant Professor of Marketing, City University of Hong Kong, minkshin@cityu.edu.hk. † Jin Kim:
Postdoctoral Research Associate, Northeastern University, jin.kim1@northeastern.edu. We thank the Department of
Marketing at City University of Hong Kong for their financial support, enabling us to fully use the AI detection
API. We also thank Thomas L. Griffiths, Gal Zauberman, Jiwoong Shin, Anthony Dukes, K Sudhir, Liang Guo,
Yanzhi Li, Inseong Song, Peter Arcidiacono, Soheil Ghili, Kosuke Uetake, Hortense Fong, Christoph Fuchs, Grant
Packard, Lei Su, Camilla Song for their constructive feedback. For the review process, the OSF link can be provided
by the authors via email in the current version. Part of the results in this paper were previously circulated under
the title “Enhancing Human Persuasion With Large Language Models.”
Large language models (LLMs) have begun to impact a wide range of critical human activities,
from economic to medical fields (Eloundou, Manning, Mishkin, and Rock, 2023; Korinek, 2023;
Larsen and Proserpio, 2023; Ali, Dobbs, Hutchings, and Whitaker, 2023). This impact can be
attributed to one of the core attributes of LLMs — their inherent focus on language — and
the fact that language plays a vital role in human communication. As LLMs become
more integrated in our daily lives, understanding their impact on human communication,
particularly persuasion, has become increasingly important. We thus examine whether and
how the use of LLMs can change the message from a sender to a receiver, allowing the sender
to achieve their desired outcomes. Specifically, our research focuses on the use of LLMs in
writing persuasive complaints to financial firms to achieve desired outcomes (e.g., offer of
financial relief).
The use of LLMs may enhance or reduce persuasion in communication between humans.
On the one hand, LLMs can enhance the persuasiveness of messages by generating content
that is coherent, informative, and formal (Simon and Muise, 2022; Wu, Jiang, Li, Rabe, Staats,
Jamnik, and Szegedy, 2022). On the other hand, however, LLMs can reduce persuasiveness
by generating content that is verbose or redundant (Saito, Wachi, Wataoka, and Akimoto,
2023). Moreover, LLMs may fall short of capturing the subtle nuances of human emotion
that are important for effective interpersonal communication (Shahriar and Hayawi, 2023).
In addition, given that a majority of LLMs are not trained for the sole purpose of improving
interpersonal communication, it is uncertain whether LLMs can indeed improve human
communication. This uncertainty motivates our first research question: whether the use of
LLMs can indeed enhance the persuasiveness of messages and increase the chance of achieving
desired outcomes for senders.
Persuasion can be enhanced through peripheral routes by leveraging issue-irrelevant factors
(Chaiken, 1980; Dewatripont and Tirole, 2005), and linguistic feature alignment is one such
factor. For instance, receivers may be influenced by various aspects of the message rather
2 Hypothesis
Figure 1: The figure illustrates the mechanism of linguistic feature alignment. In Panel (a), a
sender focuses on professionalism when writing their complaint, which is later handled by a
receiver who prioritizes clarity. Due to this linguistic feature misalignment, the complaint is
relatively less likely to result in the sender’s desired outcome. In Panel (b), by contrast, another
sender who also focuses on professionalism while using an LLM to write their complaint
ends up sending a complaint that is improved on multiple linguistic features — including
clarity — which is then handled by the same receiver prioritizing clarity. Due to this linguistic
feature alignment, the complaint is relatively more likely to result in the sender’s desired
outcome.
The CFPB Consumer Complaint Database serves as a pivotal platform that captures the
persuasive appeals by consumers directed toward financial firms. Each entry in this com-
prehensive database details the consumer’s issue with a financial product or service, the
company involved, and the geographic location, among others. Most importantly, each
entry includes both the narrative of the complaint and the firm’s response. These consumer
complaints submitted to the CFPB do more than just seek individual solutions; they often
aim for both monetary outcomes, such as obtaining refunds or financial compensation, and
non-monetary outcomes, including the correction of inaccuracies in credit reports or the
resolution of transaction errors.
In our analyses, the responses by financial firms to consumer complaints serve as the main
dependent variable of our investigations. Firms can classify their responses as “Closed with
Figure 2 presents a time trend of the monthly proportion of complaints that the AI detection
tool identifies as likely written by an LLM (i.e., those with an AI Score ≥ 99%, henceforth
referred to as “Likely-AI” complaints). In the months immediately following the initial release
of ChatGPT (on November 30, 2022), we observe a sharp increase in the proportion of
Likely-AI complaints, from near 0% up to 5%. This surge is substantial, particularly in light
of the recent survey by the Pew Research Center (conducted from July 17-23, 2023), which
revealed that only a small fraction of Americans (18%) had used ChatGPT at the time (Park
and Gelles-Watnick, 2023).
It is notable that before the release of ChatGPT, the proportion of Likely-AI complaints
was very close to zero. Although there were some instances where the AI detection tool
(almost certainly) incorrectly labeled a complaint as likely written by LLMs before ChatGPT
was available (i.e., false positive), the rate was consistently negligible, near 0%. When we
consider less stringent thresholds for the AI Score (e.g., 80% or 70%) to classify Likely-AI
complaints, the rate of false positives increases, but more importantly, the surge in the
share of likely-AI complaints is still prominent; see SI Appendix, section 1C. Therefore, in
the subsequent analyses, we apply the conservative threshold of 99% AI Score to reliably
determine which complaints can be considered as having been generated with or assisted by
LLMs.
Having identified the Likely-AI complaints, we tested whether firms would respond more
positively to those complaints. We first selected two groups of complaints: Likely-AI
complaints (AI scores ≥ 99%) and Likely-Human complaints (AI scores ≤ 1%). We then
conducted logistic regression analyses with the dependent variable being whether the complaint
received any form of relief — monetary or non-monetary — from firms, and the independent
variable being whether the complaint was classified as Likely-AI. The first column of Table
1 reveals a significant difference in the probability of receiving relief (Likely-AI: 56.3% vs.
Likely-Human: 36.5%). Even when controlling for more than 150 sub-topics of complaints,
as indicated in the second column of Table 1, the positive association remains. Furthermore,
this positive association persists even after controlling for more than 6,700 zip codes, as
shown in the third column of Table 1. Results presented in Table 1 suggest that LLM usage
can increase the likelihood of positive outcomes for message senders.
Dependent variable:
Whether relief was offered or not
Constant −0.551∗∗∗
(0.005)
Likely-AI dummy 0.297∗∗∗ 0.196∗∗∗ 0.179∗∗∗
(0.029) (0.033) (0.038)
Sub-topics Controlled N Y Y
Zip-code Controlled N N Y
Log-likelihood -99,222.3 -91,477.45 -87,574.33
Observations 150,908 150,908 150,908
Unit of Observations Each Complaint
Sample Period Dec-2022 ∼ Oct-2023
Note. ∗∗∗
p < .001
To better understand the differences between Likely-AI complaints and Likely-Human com-
plaints, we conducted a comprehensive analysis of the complaints’ distinct linguistic features.
We compared these two groups in terms of linguistic features proposed by prior computational
linguistics research (Kincaid, Fishburne, Rogers, and Chissom, 1975; Jwalapuram, Joty, and
Lin, 2022; Danescu-Niculescu-Mizil, Sudhof, Jurafsky, Leskovec, and Potts, 2013; Huang,
Wang, and Yang, 2023). Each column in Table 2 presents results from a regression analysis
in which the dependent variable was each computational linguistic metric estimated for the
complaints and the independent variable was the binary variable of whether the complaint
was Likely-AI (1) or Likely-Human (0) as in Table 1.
Dependent variable:
Coherence Readability Politeness Sentiment
Likely-AI dummy 18.349∗∗∗ −1.961∗∗∗ 0.233∗∗∗ −0.008∗∗∗
(0.133) (0.164) (0.020) (0.002)
Sub-topics Controlled Yes Yes Yes Yes
Zip-code Controlled Yes Yes Yes Yes
R2 0.351 0.170 0.137 0.160
Adjusted R2 0.320 0.130 0.096 0.119
Observations 150,908 150,908 150,908 148,122
Unit of Observations Each Complaint
Sample Period Dec-2022 ∼ Oct-2023
Note. ∗∗∗
p < .001
Our analyses reveal that Likely-AI complaints were significantly different from Likely-
Human complaints across various linguistic features. The first column of Table 2 shows that
Likely-AI complaints were more coherent (i.e., show higher coherence scores) than Likely-
Human complaints based on the recently developed coherence metric (Jwalapuram, Joty,
and Lin, 2022). Moreover, as shown in the second column of Table 2, Likely-AI complaints
were easier to read than Likely-Human complaints, as lower readability scores (Kincaid,
3
Due to the maximum token limit of 512, longer text in the sample is skipped, resulting in a lower number
of observations in the fourth column compared to the other columns. The qualitative result remains similar
when we apply a procedure of truncation and aggregation to the longer text. In addition, while a more
negative tone in a complaint can underscore the severity of an issue and convey a sense of urgency, it is
debatable whether increased negativity can always be considered an improvement.
10
Although the observational studies’ results suggest that message senders’ use of LLMs may
increase the persuasiveness of their messages (i.e., increase the likelihood of obtaining relief
from the message receiver), the evidence may be merely correlational and not necessarily
causal. To test whether and how LLM usage can indeed cause an increase in message
persuasiveness, we conducted two experiments. Experiment 1 presented its participants
with messages that were either edited or not edited with an LLM (ChatGPT with GPT-4;
hereafter ChatGPT) to test whether the participants would be more likely to be persuaded
by edited messages. Then, after finding evidence that the messages edited with an LLM
were indeed more likely to bring about the message senders’ desired outcomes, we conducted
Experiment 2 to not only replicate these results, but also to test our proposed mechanism.
Pilot Study Before the main experiments, we conducted a pilot study for two purposes.
First, we sought to test our assumption that people have varied linguistic preferences such
that they would focus on different linguistic features when editing persuasive messages with
an LLM. Second, we sought to determine which linguistic features people would be more
likely to focus on when editing persuasive messages with an LLM.
We recruited 212 participants from Prolific and asked them to imagine that they wrote
complaints “to obtain monetary relief from a financial institution.” 4 We then presented each
participant with a series of 5 complaints randomly selected from a set of 150 complaints from
the CFPB data set (see SI Appendix, section 2B, for details on selecting the 150 complaints).
For each complaint, participants expressed which linguistic features they would focus more
or less on by completing the following 10 items in a random order on a 5-point scale (1 =
Strongly Disagree, 5 = Strongly Agree): “For this complaint, I would ask ChatGPT to make
4
When describing the goal of complaints, we excluded the non-monetary relief because non-monetary
relief is often complex to understand for participants and requires detailed information about each financial
contract or transaction.
11
12
13
14
Figure 4: The figure shows the procedure and results of Experiment 1. From the CFPB
dataset, 20 complaints were selected (20 Control complaints) and each was edited with
ChatGPT to be more clear, coherent, and professional (20 Clear, 20 Coherent, and 20
Professional complaints). The edited complaints were rated by Experiment 1 participants
as more likely to receive hypothetical monetary compensation (M = 3.52, SD = 1.91) than
unedited complaints (M = 2.78, SD = 1.78), p < .001, Cohen’s d = 0.39. SI Appendix,
Table S1 presents more results from linear fixed effects model estimations and robustness
tests which show the same results both in terms of direction and statistical significance.
Thus, Experiment 1 yielded results consistent with the results from the observational
study (Observation 2 in Section 3.3). Specifically, Experiment 1 shows that LLM usage does
increase message persuasiveness, as the complaints edited with an LLM were more likely
to result in the desired outcome (receiving hypothetical monetary compensation) than the
complaints not edited with an LLM.
15
Experiment 1 also showed results consistent with the computational linguistic analysis
results (Observation 3 in Section 3.4). Specifically, complaints edited to improve on each
linguistic feature (i.e., Clear, Coherent, and Professional complaints) were perceived as
improved in all other linguistic features as well (see Fig. 5). For example, Clear complaints
(i.e., those edited with ChatGPT to be more clear) were not only more clear than Control
(unedited) complaints (b = 1.53, SE = 0.10, t = 14.94, p < .001), but they were also more
coherent (b = 1.61, SE = 0.10, t = 15.60, p < .001) and more professional (b = 2.23, SE
= 0.11, t = 20.48, p < .001) than Control complaints (see the circle in each of the three
panels of Fig. 5). Likewise, Coherent and Professional complaints were improved on all
three linguistic features relative to Control complaints (squares and triangles, respectively, in
the three panels of Fig. 5). These results further support the notion that editing messages
with an LLM improves the messages on multiple linguistic features simultaneously. This
assumption — supported by results from both Observation 3 (Table 2) and Experiment
1 — forms the basis of our linguistic feature alignment hypothesis, which we test in the next
experiment.
16
17
18
• The complaints edited to be more professional were more likely to receive monetary
compensation than the unedited complaints (b = 0.90, SE = 0.16, t = 5.73, p < .001).
• The complaints edited to be more coherent were also more likely to receive monetary
compensation than the unedited complaints (b = 1.08, SE = 0.16, t = 6.93, p < .001).
Similarly, when the complaint handlers focused on the professionalism of the complaints,
we found the following results:
• The complaints edited to be more clear were more likely to receive monetary compen-
sation than the unedited complaints (b = 0.92, SE = 0.15, t = 6.33, p < .001).
• The complaints edited to be more coherent were also more likely to receive monetary
compensation than the unedited complaints (b = 1.17, SE = 0.14, t = 8.25, p < .001).
19
These results are qualitatively the same as the results presented in Fig. 6, which show mean
compensation likelihood ratings by experimental condition (Focus on Clarity vs. Focus on
Professionalism) and complaint type. The results suggest that enhanced persuasion through
LLM usage in the observational study (Observation 2 in Section 3.3) and Experiment 1 can
be driven by linguistic feature alignment. That is, editing messages with LLMs improves the
messages on multiple linguistic features simultaneously, such that there is a better alignment
between the (intentionally or unintentionally) improved features and features that message
receivers highly value. This linguistic feature alignment, in turn, enhances persuasiveness
of the messages, increasing the likelihood of obtaining the outcomes desired by the message
sender.
20
Hypothesis
21
We conducted three additional analyses the same way, taking three other subsets of data
from Experiment 2 but each time examining a different misalignment in linguistic feature.
The second linear fixed effects model (Model 2 of Table 4) focused on Professional complaints
(20 in total) and complaint handlers who focused on clarity as their main criterion for
compensation judgments (n = 200), controlling for the same fixed effects as in Model 1. As
expected, compensation likelihood for the Professional complaints was positively associated
with perceived clarity of the complaints when the complaint handlers focused on clarity, b =
0.41, SE = 0.093, t = 3.88, p < .001. Again, the significant positive coefficient shows that in
this opposite case of misalignment — between Professional complaints and complaint handlers
focusing on Clarity — an improvement in the unintended linguistic feature (Clarity) predicted
the message’s persuasiveness. The remaining two models (Models 3 and 4) focused on two
other misalignment cases involving Coherent complaints, namely, the misalignment between
Coherent complaints and complaint handlers focusing on either professionalism or clarity. In
each of these models, we find the same significant positive association between the incidental
improvement on the linguistic feature valued by the complaint handler (professionalism or
clarity, respectively) and compensation likelihood, bs > 0.63, SE s < 0.11, ts > 6.1, ps < .001.
22
5 Discussion
23
24
25
1 Observational Studies
We investigate whether and how large language models (LLMs), such as ChatGPT, can
be used to enhance persuasion in human communication. Although many LLMs widely
available currently (e.g., ChatGPT) were not specifically developed to improve interpersonal
communication, their extensive training on a wide array of human interaction texts could be
beneficial in refining the quality of messages produced by people. Given these considerations,
we focus on the following questions:
(1) Whether people use ChatGPT to write persuasive messages
(2) Whether ChatGPT usage can increase message persuasiveness
(3) Whether linguistic feature alignment can be a mechanism
In investigating these questions, however, we faced two major challenges. The first
challenge was determining whether a given message was written with an LLM. In general,
the messages themselves, let alone whether LLMs were used to write those messages, are not
recorded in large-scale datasets. The second challenge was determining whether messages’
persuasiveness changed (i.e., increased). The change in persuasiveness (i.e., whether a given
message was more or less likely to result in an outcome desired by the message sender) is
difficult to determine because it is rare for a large-scale dataset to systematically document
both senders’ persuasive messages and the outcomes of those messages (receivers’ responses
to those messages).
We sought to overcome these challenges by using a recently developed LLM usage detection
tool (for the first challenge) and by using a unique and publicly available dataset (for the
second challenge). Specifically, to identify potential usage of LLMs like ChatGPT in senders’
messages, we utilized Winston.ai (Winston AI, 2023), a state-of-the-art AI detection program
designed to determine the probability that a given text was likely generated by LLMs. With
26
We take advantage of the CFPB Consumer Complaint Database to study the impact of
ChatGPT on persuasiveness of consumer complaints. In this context, consumers act as
senders of messages, which are in the form of complaints aimed at persuading the financial
companies. These companies are receivers of these messages. Below, we provide more details
about the communication steps involved in this process.
The communication begins when a consumer encounters a problem with a financial
product or service provided by a company, leading them to file a complaint to the CFPB. To
ensure effective communication between consumers and companies, the CFPB checks the
complaint for completeness before forwarding it to the concerned company. Upon receipt,
the company is expected to review the complaint, engage with the consumer as necessary,
and determine the appropriate actions. The company is given a 15-day window to provide an
initial response. If this response does not conclusively address the complaint, the company is
required to inform the CFPB. Following the initial response, the CFPB gives the company
60 days to deliver a final response.
The final response is required to provide details of the steps taken to address the complaint,
27
28
In our research, we utilized Winston AI (Winston AI, 2023), one of the most effective AI
detection programs, to determine whether a complaint was written with an LLM. This tool’s
performance has been tested and verified by many users and researchers (Prillaman, 2023;
Weber-Wulff, Anohina-Naumeca, Bjelobaba, Foltỳnek, Guerrero-Dib, Popoola, Šigut, and
Waddington, 2023), who have deemed it one of the most effective AI detection tools, on par
with Originality.ai (Gillham, 2023). Notably, this program is designed and trained to detect
outputs by a wide range of popular LLMs including GPT-3, GPT-4, Bard, and many others.
To verify the accuracy of Winston AI, we conducted our own tests using the CFPB
database. We have a solid understanding that prior to December 2022, very few, if any,
complaints would have been written with the assistance of LLMs currently available such
as ChatGPT7 . Thus, we can confidently categorize these complaints as human-authored,
serving as ground truth for our test. We randomly selected 100 complaints from this group.
We then created counterfactual, LLM-edited versions of these complaints using ChatGPT
and categorized them as LLM-authored. The two groups of complaints (human-authored
and LLM-authored) were processed through Winston AI as a test of the the tool’s detection
accuracy.
The results effectively illustrate the tool’s capability to differentiate between human-
authored and AI-authored complaints. For human-authored complaints, the AI scores are
significantly lower (first quartile: 0.010, median: 0.380, mean: 5.097, third quartile: 2.042).
On the other hand, AI-authored complaints have much higher scores (first quartile: 95.65,
median: 99.69, mean: 93.42, third quartile: 99.94). The stark difference in these sets of AI
scores underscores the tool’s effectiveness in distinguishing between complaints authored by
humans and those authored by LLMs.
7
Before the end of 2022 when ChatGPT was released, LLMs from the other GPT-series (e.g., text-ada-001
or text-curie-001) were released but not widely used. This limited usage was primarily due to the limited
access and a general lack of public understanding about them.
29
plaints
Here we present our rationale behind selecting 99% as the threshold for defining Likely-AI
complaints. Figure S1 illustrates two key observations: first, a sharp increase in Likely-AI
complaints is evident immediately following the release of ChatGPT, irrespective of the
threshold level. Second, the use of a less stringent threshold results in a higher incidence of
false positives in the classification of Likely-AI complaints.
Figure S1: Each panel in the series displays the temporal trend for “Likely-AI” complaints
defined by using various threshold levels. The top panel represents cases classified as Likely-AI
when the score is greater than or equal to 90%. The middle panel corresponds to cases with
a score threshold of greater than or equal to 80%, and the bottom panel includes cases with
scores greater than or equal to 70%. More false positive cases are observed with a lower
threshold.
30
As mentioned in the main text, we conducted experiments to test whether the relationship
between LLM usage and message persuasiveness found in the observational studies can
be causal. That is, we test whether LLM usage can cause an increase in persuasiveness
of messages. This section of the Appendix provides more detailed information of these
experiments and a pilot study. We first conducted a pilot study to develop stimuli to use in
Experiments 1 and 2. We then conducted Experiments 1 and 2 to test the causal relationship.
We conducted a pilot study to develop stimuli to use in the main experiments. The entire
procedure of stimuli development is discussed in the next section. Here, we simply provide
the detailed procedure of the pilot study.
We recruited 210 participants (Mage = 40; 49% men, 49% women, and 2% other) from
Prolific. Participants began the study by learning that consumers can report complaints
regarding their financial institutions to the CFPB and that such complaints can later be
accessed by the public. Participants were then told that they will see five different complaints
and that for each complaint, their task was to (1) imagine that they wrote the complaint
to “obtain monetary relief from a financial institution” and (2) indicate how they would
use ChatGPT in editing their complaint “to improve [their] chances of obtaining monetary
compensation.”
Participants then proceeded to review a set of five random complaints from a set of
150 complaints selected from the CFPB dataset (details on selecting these 150 complaints
are provided in the next section). Participants read one complaint at a time, and for
each complaint, they answered the study’s main dependent measure, consisting of 10 items,
each with a 5-point scale of disagreement or agreement: “For this complaint, I would ask
ChatGPT to make the complaint...(1) more polite, (2) more coherent, (3) more fact-based, (4)
31
2B Stimuli Development
As mentioned in the main text, the stimuli used in Experiments 1 and 2 were 80 consumer
complaints, 20 of which were the original, unedited complaints from the CFPB data set
(Control complaints), and 60 of which were three different edited versions of these 20 com-
plaints — namely, 20 Clear, 20 Coherent, and 20 Professional complaints. These stimuli were
carefully developed in three stages with the following steps:
1. From the CFPB dataset, we identified consumer complaints that the AI detection tool
identified as completely written by humans (i.e., those with the AI scores of 0).
2. From this set of Likely-Human complaints, we randomly selected a set of 300 consumer
complaints (see the R code in the project’s OSF page, Click Here).
3. From the above set of 300 complaints, we eliminated 84 complaints that were exact
32
33
Pilot Study to Determine the Focal Linsguistic Features of the Main Experiment
8. In the pilot study (N = 212; described in the previous section), we asked participants
how they would use ChatGPT to edit their hypothetical complaints to obtain monetary
relief from financial institutions. The results suggested that people would be likely to
use ChatGPT to edit their complaints to be more clear, coherent, and professional. We
thus concluded that clarity, coherence, and professionalism are three linguistic features
that people are relatively more likely to focus on when using an LLM like ChatGPT to
edit persuasive messages. Accordingly, we focused on these three linguistic features in
Experiments 1 and 2.
34
Figure S2: This figure shows the counts of consumer complaints for given ranges of uniqueness
scores for the 216 nonduplicated complaints that the AI detection tool determined as
completely written by humans. We randomly selected 20 complaints from the set of complaints
with uniqueness scores larger than 30.
35
36
We verified our stimuli (the 80 complaints) in Experiment 1 (described in the main text
and in the next section) by asking the experiment’s participants to rate them on the three
intended linguistic features. Specifically, we asked participants about each linguistic feature
(i.e., clarity, coherence, and professionalism) for a given complaint. Figure 5 in the main text
shows the results on these linguistic feature ratings.
We recruited 210 participants (Mage = 44; 49.5% men, 50% women, and 0.5% other) from
Prolific. After thanking participants for their participation, we presented each participant
with 5 distinct complaints randomly chosen from the set of 80 complaints as follows:
We ensured that each of the five randomly chosen complaints featured different complaint
content. That is, no two complaints were the same complaints or originated from the
same unedited complaint. For each complaint, we asked participants to rate the likelihood
of providing monetary compensation regarding each complaint (“If you were handling the
complaint above, how likely would you be to offer monetary compensation to address the
consumer’s complaint? [1 = Not likely at all, 7 = Extremely likely]”). Following this
dependent measure, we also asked participants to rate each complaint on clarity, coherence,
37
For Experiment 1, we could have chosen to manipulate linguistic features using a study design
in which each participant would be randomly assigned to receive one complaint out of the 80
(Alternative Design). However, we had a few justifications for opting for our preregistered
study design instead.
First, presenting multiple complaints as in our design likely led participants to form their
own standard for good complaints (which in turn would make them better judges), whereas
such a process would be unlikely to occur if participants saw only one complaint out of the 80
(Alternative Design), as is typically done in between-subjects experiments where often a single,
rather than multiple, stimulus is presented. By presenting multiple complaints, participants
were naturally led to compare complaints to one another, a process through which they likely
came to establish their own standards of what makes a good complaint. Such standards
then would have made them better judges of complaints, compared to evaluating a single
complaint (Alternative Design).
Second, our design more closely matched the real-world setting where evaluators of
consumer complaints encounter a wide variety of complaint types and multiple of them,
rather than exclusively encountering one specific type of complaint or one specific complaint.
Thus, exposing participants to multiple complaints of different types made our study more
ecologically valid compared to exposing them to a single complaint (Alternative Design).
In sum, our study design likely enhanced the robustness and real-world applicability of our
study’s findings.
38
39
2G Results of Experiment 2
(0.16)
Unedited vs. More clear (Dummy variable) 0.92∗∗∗
(0.15)
Unedited vs. More coherent (Dummy variable) 1.08∗∗∗ 1.17∗∗∗
(0.14) (0.14)
Condition Included in Model Focus on Clarity Focus on Professionalism
Complaints Included in Model Professional, Coherent, Control Clear, Coherent, Control
Complaint Content (20 Different Contents) Fixed Effect Y Y
Participant Fixed Effect Y Y
Complaint Presentation Position (1-5) Fixed Effect Y Y
Observations 600 600
Note. ∗∗∗
p < .001
40
Bumcrot, C. B., J. Lin, and A. Lusardi (2011): “The geography of financial literacy,”
RAND Working Paper Series No. WR-893-SSA.
Chaiken, S. (1980): “Heuristic versus systematic information processing and the use of
source versus message cues in persuasion.,” Journal of personality and social psychology,
39(5), 752.
Chen, J., W. Fan, J. Wei, and Z. Liu (2022): “Effects of linguistic style on persuasiveness
of word-of-mouth messages with anonymous vs. identifiable sources,” Marketing Letters,
33(4), 593–605.
Dou, Y., M. Hung, G. She, and L. L. Wang (2023): “Learning from peers: Evidence
from disclosure of consumer complaints,” Journal of Accounting and Economics, p. 101620.
Dou, Y., and Y. Roh (2019): “Public disclosure and consumer financial protection,” Journal
of Financial and Quantitative Analysis, pp. 1–56.
Eloundou, T., S. Manning, P. Mishkin, and D. Rock (2023): “Gpts are gpts: An
early look at the labor market impact potential of large language models,” arXiv preprint
arXiv:2303.10130.
41
Gillham, J. (2023): “AI Content Detector Accuracy Review + Open Source Dataset
and Research Tool,” https://originality.ai/blog/ai-content-detection-accuracy,
Accessed: 2023-11-26.
Guo, B., X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and Y. Wu (2023):
“How close is chatgpt to human experts? comparison corpus, evaluation, and detection,”
arXiv preprint arXiv:2301.07597.
Huang, A. H., H. Wang, and Y. Yang (2023): “FinBERT: A large language model for
extracting information from financial text,” Contemporary Accounting Research, 40(2),
806–841.
Jwalapuram, P., S. Joty, and X. Lin (2022): “Rethinking Self-Supervision Objectives for
Generalizable Coherence Modeling,” in Proc. 60th Ann. Meeting Assoc. Comput. Linguistics
(Vol. 1: Long Papers), pp. 6044–6059.
42
Larsen, P., and D. Proserpio (2023): “The Impact of Large Language Models on Search
Advertising: Evidence from Google’s BERT,” USC Marshall School of Business Research
Paper Sponsored by iORB.
Levenshtein, V. I., et al. (1966): “Binary codes capable of correcting deletions, insertions,
and reversals,” in Soviet physics doklady, vol. 10, pp. 707–710. Soviet Union.
Lingard, L. (2023): “Writing with ChatGPT: An illustration of its capacity, limitations &
implications for academic writers,” Perspectives on Medical Education, 12(1), 261.
Liu, A. X., Y. Xie, and J. Zhang (2019): “It’s not just what you say, but how you say it:
The effect of language style matching on perceived quality of consumer reviews,” Journal
of Interactive Marketing, 46(1), 70–86.
Lusardi, A., and O. S. Mitchell (2014): “The economic importance of financial literacy:
Theory and evidence,” American Economic Journal: Journal of Economic Literature, 52(1),
5–44.
Lythreatis, S., S. K. Singh, and A.-N. El-Kassar (2022): “The digital divide: A review
and future research agenda,” Technological Forecasting and Social Change, 175, 121359.
Matz, S., J. Teeny, S. S. Vaid, G. M. Harari, and M. Cerf (2023): “The Potential of
Generative AI for Personalized Persuasion at Scale,” PsyArXiv.
Noy, S., and W. Zhang (2023): “Experimental evidence on the productivity effects of
generative artificial intelligence,” Science, 381, 187–192.
43
Packard, G., J. Berger, and R. Boghrati (2023): “How Verb Tense Shapes Persuasion,”
Journal of Consumer Research, p. ucad006.
Park, E., and R. Gelles-Watnick (2023): “Most Americans haven’t used ChatGPT; few
think it will have a major impact on their job,” Pew Research Center, https://pewrsr.
ch/3SCeWX9. Accessed: 2023-11-15.
Saito, K., A. Wachi, K. Wataoka, and Y. Akimoto (2023): “Verbosity bias in preference
labeling by large language models,” arXiv preprint arXiv:2310.10076.
Shahriar, S., and K. Hayawi (2023): “Let’s have a chat! A Conversation with ChatGPT:
Technology, Applications, and Limitations,” arXiv preprint arXiv:2302.13817.
Shin, J., and C.-Y. Wang (2024): “The Role of Messenger in Advertising Content: Bayesian
Persuasion Perspective,” Marketing Science.
Shin, M., J. Kim, B. van Opheusden, and T. L. Griffiths (2023): “Superhuman artificial
intelligence can improve human decision-making by increasing novelty,” Proceedings of the
National Academy of Sciences, 120(12), e2214840120.
Simon, N., and C. Muise (2022): “TattleTale: Storytelling with Planning and Large
Language Models,” in ICAPS Workshop on Scheduling and Planning Applications.
44
Toubia, O., J. Berger, and J. Eliashberg (2021): “How quantifying the shape of
stories predicts their success,” Proceedings of the National Academy of Sciences, 118(26),
e2011695118.
45