You are on page 1of 46

Large language models can enhance persuasion through

linguistic feature alignment


Minkyu Shin∗ Jin Kim†

February 13, 2024

Abstract

Although large language models (LLMs) are reshaping various aspects of human life, our
current understanding of their impacts remains somewhat constrained. Here we investigate
the impact of LLMs on human communication, using data on consumer complaints in the
financial industry. By employing an AI detection tool on more than 820K complaints gathered
by the Consumer Financial Protection Bureau (CFPB), we find a sharp increase in the likely
use of LLMs shortly after the release of ChatGPT. Moreover, the likely LLM usage was
positively correlated with message persuasiveness (i.e., increased likelihood of obtaining relief
from financial firms). Computational linguistic analyses suggest that the positive correlation
may be explained by LLMs’ enhancement of various linguistic features. Based on the results
of these observational studies, we hypothesize that LLM usage may enhance a comprehensive
set of linguistic features, increasing message persuasiveness to receivers with heterogeneous
linguistic preferences (i.e., linguistic feature alignment). We test this hypothesis in preregistered
experiments and find support for it. As an instance of early empirical demonstrations of LLM
usage for enhancing persuasion, our research highlights the transformative potential of LLMs
in human communication.

Keywords: Large language model; ChatGPT; Persuasion; Linguistic feature alignment;


Language style matching; AI detection; CFPB


Minkyu Shin: Assistant Professor of Marketing, City University of Hong Kong, minkshin@cityu.edu.hk. † Jin Kim:
Postdoctoral Research Associate, Northeastern University, jin.kim1@northeastern.edu. We thank the Department of
Marketing at City University of Hong Kong for their financial support, enabling us to fully use the AI detection
API. We also thank Thomas L. Griffiths, Gal Zauberman, Jiwoong Shin, Anthony Dukes, K Sudhir, Liang Guo,
Yanzhi Li, Inseong Song, Peter Arcidiacono, Soheil Ghili, Kosuke Uetake, Hortense Fong, Christoph Fuchs, Grant
Packard, Lei Su, Camilla Song for their constructive feedback. For the review process, the OSF link can be provided
by the authors via email in the current version. Part of the results in this paper were previously circulated under
the title “Enhancing Human Persuasion With Large Language Models.”

Electronic copy available at: https://ssrn.com/abstract=4725351


1 Introduction

Large language models (LLMs) have begun to impact a wide range of critical human activities,
from economic to medical fields (Eloundou, Manning, Mishkin, and Rock, 2023; Korinek, 2023;
Larsen and Proserpio, 2023; Ali, Dobbs, Hutchings, and Whitaker, 2023). This impact can be
attributed to one of the core attributes of LLMs — their inherent focus on language — and
the fact that language plays a vital role in human communication. As LLMs become
more integrated in our daily lives, understanding their impact on human communication,
particularly persuasion, has become increasingly important. We thus examine whether and
how the use of LLMs can change the message from a sender to a receiver, allowing the sender
to achieve their desired outcomes. Specifically, our research focuses on the use of LLMs in
writing persuasive complaints to financial firms to achieve desired outcomes (e.g., offer of
financial relief).
The use of LLMs may enhance or reduce persuasion in communication between humans.
On the one hand, LLMs can enhance the persuasiveness of messages by generating content
that is coherent, informative, and formal (Simon and Muise, 2022; Wu, Jiang, Li, Rabe, Staats,
Jamnik, and Szegedy, 2022). On the other hand, however, LLMs can reduce persuasiveness
by generating content that is verbose or redundant (Saito, Wachi, Wataoka, and Akimoto,
2023). Moreover, LLMs may fall short of capturing the subtle nuances of human emotion
that are important for effective interpersonal communication (Shahriar and Hayawi, 2023).
In addition, given that a majority of LLMs are not trained for the sole purpose of improving
interpersonal communication, it is uncertain whether LLMs can indeed improve human
communication. This uncertainty motivates our first research question: whether the use of
LLMs can indeed enhance the persuasiveness of messages and increase the chance of achieving
desired outcomes for senders.
Persuasion can be enhanced through peripheral routes by leveraging issue-irrelevant factors
(Chaiken, 1980; Dewatripont and Tirole, 2005), and linguistic feature alignment is one such
factor. For instance, receivers may be influenced by various aspects of the message rather

Electronic copy available at: https://ssrn.com/abstract=4725351


than its content per se, such as the credibility of the source (Shin and Wang, 2024) or the
presentation order of issue-irrelevant cues (Kiesler and Kiesler, 1964; Dewatripont and Tirole,
2005). Among the various peripheral routes of persuasion, we argue that linguistic feature
alignment between senders’ messages and receivers can be a persuasion-enhancing mechanism
for LLMs.1 Specifically, LLMs can induce linguistic feature alignment by improving an
array of linguistic features, some of which the receiver values.2 By generating messages that
resonate with the receiver’s preferred linguistic style, LLMs can render the communication
more relatable and persuasive. In the absence of LLMs, achieving such linguistic congruence
is often a costly and uncertain endeavor because the receiver’s preference is usually unknown
to senders. However, LLMs usually improve many linguistic features simultaneously (as
demonstrated by our results in Section 3.4), thereby inducing linguistic feature alignment. In
this paper, we empirically test this conjecture on how the use of LLMs can enhance message
persuasiveness.
For our investigation, we examined the consumer complaint database provided by the
Consumer Financial Protection Bureau (CFPB) for two primary reasons. First, the CFPB’s
dataset enables inference of LLMs’ impact on message persuasiveness by providing a historical
and systematic record on both sides of the communication: senders’ messages (i.e., consumers’
complaints) and receivers’ response (i.e., financial firms’ compensation). By examining the
firms’ decisions on whether to offer relief, we can indirectly gauge the persuasive effectiveness
of each complaint. Second, consumer finance, with its inherent communication challenges,
serves as an appropriate domain to study the impact of LLMs on communication. Given the
problems of financial illiteracy (Lusardi and Mitchell, 2014), as well as unfamiliarity with
industry jargon and regulatory language, consumers may struggle with articulating their
1
This linguistic feature alignment can be considered a subfield of language style matching. Although
language style matching has historically centered on the use of “function” words, recent studies have started to
encompass the broader concept of language style matching beyond function words, such as various linguistic
elements (e.g., Liu, Xie, and Zhang 2019; Packard and Berger 2021; Chen, Fan, Wei, and Liu 2022; Packard,
Berger, and Boghrati 2023).
2
According to communication accommodation theory (Giles, 1979), this alignment is crucial because it
can lead to increased rapport and trust, and subsequently, higher likelihoods of persuasion.

Electronic copy available at: https://ssrn.com/abstract=4725351


problems or concerns to financial firms. Moreover, because miscommunication in the financial
domain has financial consequences, people often have extra incentives to use tools like LLMs
for effective communication. Hence, the CFPB dataset and consumer finance domain offer
an ideal testbed for examining the use and impact of LLMs in persuasive communication.
With the CFPB consumer complaint dataset, we conducted our investigation in two parts:
observational and experimental parts. First, in the observational part, we analyzed more
than 820K complaints with three questions in mind: (1) whether there is evidence on people’s
use of LLMs, (2) whether LLM usage leads to better outcomes for senders, and (3) whether
messages assisted by LLMs exhibit any differences in linguistic features. Second, in the
experimental part, we conducted two experiments: First, we tested for a causal relationship
between LLM usage and message persuasiveness; and second, we examined whether linguistic
feature alignment could be a mechanism underlying the enhancement in persuasion. Together,
the observational and experimental studies provide a thorough inquiry into whether and how
LLM usage enhances persuasion in human communication.

2 Hypothesis

Linguistic Feature Alignment as a Mechanism We hypothesize that linguistic feature


alignment is a mechanism through which LLMs enhance message persuasiveness. When
a message sender writes a persuasive message without using an LLM, the message can
show prominence in certain linguistic feature(s) that the message receiver does not value.
This would be considered a case of linguistic feature misalignment (between the message’s
prominent linguistic feature[s] and a receiver’s preferred linguistic feature[s]). In contrast,
when the message sender does use an LLM to write the message, then the LLM will likely
improve the message on a broad array of linguistic features (Lingard, 2023; Tang, Chuang, and
Hu, 2023; Guo, Zhang, Wang, Jiang, Nie, Ding, Yue, and Wu, 2023; Herbold, Hautli-Janisz,
Heuer, Kikteva, and Trautsch, 2023), especially including those features that are correlated

Electronic copy available at: https://ssrn.com/abstract=4725351


with the feature(s) the message sender seeks to improve. This means that the LLM will
improve the message not only on (1) the linguistic feature(s) the sender intended to improve
with the LLM but also (2) the feature(s) the sender did not intend to improve but the
receiver values. This would be considered a case of linguistic feature alignment (between the
LLM-edited message’s linguistic feature[s] and the receivers’ preferred linguistic feature[s]).
The receiver will be more likely to be persuaded by the message when it contains the linguistic
feature(s) they value. Thus, using an LLM will comprehensively and simultaneously improve
various linguistic features of a persuasive message, which increases the likelihood of linguistic
feature alignment, which in turn can enhance the persuasiveness of the message.
Figure 1 illustrates the mechanism of linguistic feature alignment. Panel (a) of Figure 1
depicts a consumer who does not rely on an LLM; in this scenario, they (i.e., the complaint
sender) would write a complaint that may reflect their own linguistic preference or their
assumption about which linguistic features would appeal to the complaint handlers in financial
companies. For example, the consumer might write a complaint that uses excessive jargon
to demonstrate their understanding of the problem. Although their complaint might look
professional, it may lack clarity concerning the issue and, consequently, may face a low
likelihood of obtaining relief when reviewed by a complaint handler who highly values clarity.
In contrast, Panel (b) of Figure 1 depicts a consumer who uses an LLM to edit their
complaint to be more professional. This consumer would likely receive an edited complaint
that is not only more professional but also clearer, because the LLM would improve multiple
linguistic features simultaneously. As in the case of Panel (a), the complaint is received by a
complaint handler who highly values clarity in complaints. Even though the message sender
aimed to improve their message on a linguistic feature not necessarily highly valued by a
message receiver (e.g., professionalism), the message was incidentally improved on multiple
linguistic features, including those highly valued by the receiver (e.g., clarity). Therefore, the
receiver (complaint handler) would be more likely to be persuaded by the complaint than
they would be by a complaint not edited with an LLM.

Electronic copy available at: https://ssrn.com/abstract=4725351


(a) Persuasion without LLMs (Misalignment) (b) Persuasion with LLMs (Alignment)

Figure 1: The figure illustrates the mechanism of linguistic feature alignment. In Panel (a), a
sender focuses on professionalism when writing their complaint, which is later handled by a
receiver who prioritizes clarity. Due to this linguistic feature misalignment, the complaint is
relatively less likely to result in the sender’s desired outcome. In Panel (b), by contrast, another
sender who also focuses on professionalism while using an LLM to write their complaint
ends up sending a complaint that is improved on multiple linguistic features — including
clarity — which is then handled by the same receiver prioritizing clarity. Due to this linguistic
feature alignment, the complaint is relatively more likely to result in the sender’s desired
outcome.

3 Observational Studies With the CFPB dataset

3.1 Data & Tool

The CFPB Consumer Complaint Database serves as a pivotal platform that captures the
persuasive appeals by consumers directed toward financial firms. Each entry in this com-
prehensive database details the consumer’s issue with a financial product or service, the
company involved, and the geographic location, among others. Most importantly, each
entry includes both the narrative of the complaint and the firm’s response. These consumer
complaints submitted to the CFPB do more than just seek individual solutions; they often
aim for both monetary outcomes, such as obtaining refunds or financial compensation, and
non-monetary outcomes, including the correction of inaccuracies in credit reports or the
resolution of transaction errors.
In our analyses, the responses by financial firms to consumer complaints serve as the main
dependent variable of our investigations. Firms can classify their responses as “Closed with

Electronic copy available at: https://ssrn.com/abstract=4725351


monetary relief” if they provide financial compensation, or “Closed with non-monetary relief”
if they offer a resolution that does not involve financial compensation but still resolves the
complaint, such as by changing account terms or amending contract details. If the response
is categorized as “Closed with explanation,” the firm has only provided an explanation to the
consumer without providing any relief. A complaint marked as “Closed” indicates no relief or
explanation was given to the consumer. We consider a persuaded outcome to be one where
either monetary or non-monetary relief is achieved, which is a definition consistent with the
definitions from previous studies (Dou and Roh, 2019; Dou, Hung, She, and Wang, 2023).
Thus, the outcome coded as a binary variable (whether a relief was received or not) serves as
our study’s primary dependent variable.
To identify which complaints were likely written with LLMs, we analyzed complaints
with Winston.ai (Winston AI, 2023), one of the leading commercial AI detection tools, to
calculate an AI Score for each complaint text in the CFPB database. This AI detection
tool very effectively computes a “Human Score,” a metric specifically designed to estimate
the probability that a piece of text was written by a human, as opposed to being generated
by an AI system (Prillaman, 2023). For our investigation, we introduce the term AI Score,
calculated as (100 - Human Score), to denote the computed probability that a complaint was
written with help from AI. More details about the dataset and the tool are provided in the
SI Appendix, sections 1A and 1B. We use AI scores in the following section to analyze the
prevalence of complaints likely written with an LLM and to study their distinct linguistic
features and the persuasion outcome.

3.2 Observation 1: The Use of LLMs

Figure 2 presents a time trend of the monthly proportion of complaints that the AI detection
tool identifies as likely written by an LLM (i.e., those with an AI Score ≥ 99%, henceforth
referred to as “Likely-AI” complaints). In the months immediately following the initial release
of ChatGPT (on November 30, 2022), we observe a sharp increase in the proportion of

Electronic copy available at: https://ssrn.com/abstract=4725351


Figure 2: Constructed from 824,525 complaints, the graph shows the sudden surge in Likely-
AI complaints beginning in 2023, reflecting the (growing popularity and) likely use of LLMs
in consumer complaint writing.

Likely-AI complaints, from near 0% up to 5%. This surge is substantial, particularly in light
of the recent survey by the Pew Research Center (conducted from July 17-23, 2023), which
revealed that only a small fraction of Americans (18%) had used ChatGPT at the time (Park
and Gelles-Watnick, 2023).
It is notable that before the release of ChatGPT, the proportion of Likely-AI complaints
was very close to zero. Although there were some instances where the AI detection tool
(almost certainly) incorrectly labeled a complaint as likely written by LLMs before ChatGPT
was available (i.e., false positive), the rate was consistently negligible, near 0%. When we
consider less stringent thresholds for the AI Score (e.g., 80% or 70%) to classify Likely-AI
complaints, the rate of false positives increases, but more importantly, the surge in the
share of likely-AI complaints is still prominent; see SI Appendix, section 1C. Therefore, in
the subsequent analyses, we apply the conservative threshold of 99% AI Score to reliably
determine which complaints can be considered as having been generated with or assisted by
LLMs.

Electronic copy available at: https://ssrn.com/abstract=4725351


3.3 Observation 2: Positive Association Between LLM Usage and

the Persuasion Outcome

Having identified the Likely-AI complaints, we tested whether firms would respond more
positively to those complaints. We first selected two groups of complaints: Likely-AI
complaints (AI scores ≥ 99%) and Likely-Human complaints (AI scores ≤ 1%). We then
conducted logistic regression analyses with the dependent variable being whether the complaint
received any form of relief — monetary or non-monetary — from firms, and the independent
variable being whether the complaint was classified as Likely-AI. The first column of Table
1 reveals a significant difference in the probability of receiving relief (Likely-AI: 56.3% vs.
Likely-Human: 36.5%). Even when controlling for more than 150 sub-topics of complaints,
as indicated in the second column of Table 1, the positive association remains. Furthermore,
this positive association persists even after controlling for more than 6,700 zip codes, as
shown in the third column of Table 1. Results presented in Table 1 suggest that LLM usage
can increase the likelihood of positive outcomes for message senders.

Table 1: Firm Responses (Likely-AI vs. Likely-Human)

Dependent variable:
Whether relief was offered or not
Constant −0.551∗∗∗
(0.005)
Likely-AI dummy 0.297∗∗∗ 0.196∗∗∗ 0.179∗∗∗
(0.029) (0.033) (0.038)
Sub-topics Controlled N Y Y
Zip-code Controlled N N Y
Log-likelihood -99,222.3 -91,477.45 -87,574.33
Observations 150,908 150,908 150,908
Unit of Observations Each Complaint
Sample Period Dec-2022 ∼ Oct-2023
Note. ∗∗∗
p < .001

Electronic copy available at: https://ssrn.com/abstract=4725351


3.4 Observation 3: Linguistic Feature Improvements

To better understand the differences between Likely-AI complaints and Likely-Human com-
plaints, we conducted a comprehensive analysis of the complaints’ distinct linguistic features.
We compared these two groups in terms of linguistic features proposed by prior computational
linguistics research (Kincaid, Fishburne, Rogers, and Chissom, 1975; Jwalapuram, Joty, and
Lin, 2022; Danescu-Niculescu-Mizil, Sudhof, Jurafsky, Leskovec, and Potts, 2013; Huang,
Wang, and Yang, 2023). Each column in Table 2 presents results from a regression analysis
in which the dependent variable was each computational linguistic metric estimated for the
complaints and the independent variable was the binary variable of whether the complaint
was Likely-AI (1) or Likely-Human (0) as in Table 1.

Table 2: Linguistic Features (Likely-AI vs. Likely-Human)

Dependent variable:
Coherence Readability Politeness Sentiment
Likely-AI dummy 18.349∗∗∗ −1.961∗∗∗ 0.233∗∗∗ −0.008∗∗∗
(0.133) (0.164) (0.020) (0.002)
Sub-topics Controlled Yes Yes Yes Yes
Zip-code Controlled Yes Yes Yes Yes
R2 0.351 0.170 0.137 0.160
Adjusted R2 0.320 0.130 0.096 0.119
Observations 150,908 150,908 150,908 148,122
Unit of Observations Each Complaint
Sample Period Dec-2022 ∼ Oct-2023
Note. ∗∗∗
p < .001

Our analyses reveal that Likely-AI complaints were significantly different from Likely-
Human complaints across various linguistic features. The first column of Table 2 shows that
Likely-AI complaints were more coherent (i.e., show higher coherence scores) than Likely-
Human complaints based on the recently developed coherence metric (Jwalapuram, Joty,
and Lin, 2022). Moreover, as shown in the second column of Table 2, Likely-AI complaints
were easier to read than Likely-Human complaints, as lower readability scores (Kincaid,

Electronic copy available at: https://ssrn.com/abstract=4725351


Fishburne, Rogers, and Chissom, 1975) indicate texts that are easier to read. In addition,
Likely-AI complaints were more polite than Likely-Human complaints based on the politeness
metric (Danescu-Niculescu-Mizil, Sudhof, Jurafsky, Leskovec, and Potts, 2013), as shown
in the third column of Table 2. Finally, the fourth column of Table 2 shows that Likely-AI
complaints exhibited a more negative sentiment3 , as determined by FinBERT (Huang, Wang,
and Yang, 2023). These results show that, in general, complaints edited with an LLM show
improvements across various linguistic features.
Notably, the findings from the observational studies may be simply correlational and
not necessarily causal. To establish a causal relationship, we must compare two identical
complaints: a complaint not edited with an LLM and the same complaint edited with an
LLM, while ensuring that other factors are held constant. However, such observations are
unlikely to be observed in the data, and even if they are observed, there still remains the
possibility that unobserved factors influenced the persuasion outcomes differently. Because
of such limitations, the findings from the observational studies encouraged us to test our
hypotheses with experiments. First, we sought to test our hypothesis that LLM usage enhances
persuasiveness of complaints (i.e., increases the likelihood of obtaining financial relief). Second,
if indeed LLM usage does enhance persuasion (as suggested by the observational studies’
results), we then aimed to test the hypothesis that linguistic feature alignment underlies this
effect. Below, we detail two experiments testing these two hypotheses.

3
Due to the maximum token limit of 512, longer text in the sample is skipped, resulting in a lower number
of observations in the fourth column compared to the other columns. The qualitative result remains similar
when we apply a procedure of truncation and aggregation to the longer text. In addition, while a more
negative tone in a complaint can underscore the severity of an issue and convey a sense of urgency, it is
debatable whether increased negativity can always be considered an improvement.

10

Electronic copy available at: https://ssrn.com/abstract=4725351


4 Experimental Studies with the CFPB dataset

4.1 Overview of the Experiments

Although the observational studies’ results suggest that message senders’ use of LLMs may
increase the persuasiveness of their messages (i.e., increase the likelihood of obtaining relief
from the message receiver), the evidence may be merely correlational and not necessarily
causal. To test whether and how LLM usage can indeed cause an increase in message
persuasiveness, we conducted two experiments. Experiment 1 presented its participants
with messages that were either edited or not edited with an LLM (ChatGPT with GPT-4;
hereafter ChatGPT) to test whether the participants would be more likely to be persuaded
by edited messages. Then, after finding evidence that the messages edited with an LLM
were indeed more likely to bring about the message senders’ desired outcomes, we conducted
Experiment 2 to not only replicate these results, but also to test our proposed mechanism.

Pilot Study Before the main experiments, we conducted a pilot study for two purposes.
First, we sought to test our assumption that people have varied linguistic preferences such
that they would focus on different linguistic features when editing persuasive messages with
an LLM. Second, we sought to determine which linguistic features people would be more
likely to focus on when editing persuasive messages with an LLM.
We recruited 212 participants from Prolific and asked them to imagine that they wrote
complaints “to obtain monetary relief from a financial institution.” 4 We then presented each
participant with a series of 5 complaints randomly selected from a set of 150 complaints from
the CFPB data set (see SI Appendix, section 2B, for details on selecting the 150 complaints).
For each complaint, participants expressed which linguistic features they would focus more
or less on by completing the following 10 items in a random order on a 5-point scale (1 =
Strongly Disagree, 5 = Strongly Agree): “For this complaint, I would ask ChatGPT to make
4
When describing the goal of complaints, we excluded the non-monetary relief because non-monetary
relief is often complex to understand for participants and requires detailed information about each financial
contract or transaction.

11

Electronic copy available at: https://ssrn.com/abstract=4725351


the complaint... (1) more polite, (2) more coherent, (3) more fact-based, (4) more persuasive,
(5) more clear, (6) more professional, (7) more assertive, (8) more concise, (9) less repetitive,
(10) less emotional” (see SI Appendix, section 2A, for details on selecting these 10 linguistic
features). Afterwards, participants completed exploratory and demographic measures to
complete the study.
The results from the pilot study supported our hypothesis that there would be a significant
variation in participants’ preferences for linguistic features when editing complaints with
LLMs. To quantify this variation, we calculated the Euclidean distance between participants’
response vectors for the same complaint. For example, if one participant’s response vector
indicated no need for improvement on any feature (110 ), while another’s indicated the
maximum improvement on all features (510 ), the Euclidean distance between these vectors is
(5 − 1)2 × 10, which is approximately 12.65. We used this distance as a measure of the
p

heterogeneity in participants’ linguistic preferences. We computed this distance for every


possible pair of response vectors for each complaint. A histogram of these distances is shown
in Figure 3. It is worth noting that when there is a discrepancy of more than 3 points on at
least two linguistic features between any two response vectors (that is, when one participant
strongly agrees and the other moderately disagrees on the need for improving at least two
of the 10 features), the minimum Euclidean distance would be approximately 4.24. Fig. 3
shows that 67.3% of the total 3,259 response-vector pairs have a Euclidean distance greater
than 4.24. These results suggest that there is considerable heterogeneity across participants
in their preferences regarding which linguistic features to focus on when editing persuasive
messages with an LLM.
The pilot study also helped us determine which linguistic features people would focus
relatively more on when using LLMs to edit their messages. Specifically, the three features
that participants expressed greatest intent to focus on were (1) clarity, (2) coherence, and (3)
professionalism (see Table 3). We thus centered our investigation on these three linguistic
features as linguistic features that people are likely to focus on when editing their persuasive

12

Electronic copy available at: https://ssrn.com/abstract=4725351


messages with LLMs. See SI Appendix, section 2A for more details on this pilot study.

Linguistic Feature Intent to Focus


Rating Mean [95% CI]
More clear 3.78 [3.71, 3.86]
More coherent 3.71 [3.64, 3.79]
More professional 3.71 [3.63, 3.79]
More concise 3.62 [3.54, 3.70]
More persuasive 3.54 [3.47, 3.61]
More fact-based 3.21 [3.13, 3.29]
More assertive 3.05 [2.97, 3.13]
More polite 2.86 [2.79, 2.94]
Less repetitive 2.82 [2.74, 2.90]
Less emotional 2.82 [2.74, 2.90]

Figure 3: Pairwise Euclidean Distances on Table 3: Participant Ratings on the Intent


the Intent to Modify Linguistic Features to Modify Linguistic Features

4.2 Experiment 1: Whether LLMs Can Enhance Human Persuasion

The primary goal of Experiment 1 (preregistration: https://aspredicted.org/MM3_17T)


was to test whether LLM usage can cause an increase in message persuasiveness. To test this
causal relationship, we carefully selected complaints from the CFPB dataset, created edited
versions of these complaints with a popular LLM (ChatGPT), and tested whether these two
types of messages (i.e., those unedited vs. those edited with the LLM, which contain the
same content but are written differently) would result in a difference in message senders’
desired outcomes (i.e., monetary compensation from the message receivers).
Stimuli. To create stimuli for Experiment 1, we selected actual Likely-Human consumer
complaints (AI scores ≤ 1%) in the CFPB dataset and modified them on various linguistic
features using ChatGPT. Specifically, we first selected a set of 20 consumer complaints from
the CFPB dataset based on several criteria, such as complaints’ uniqueness and length, to
encompass a variety of financial issues (see SI Appendix, section 2B for detailed selection steps).
We then presented these complaints to ChatGPT and asked it to improve them on three

13

Electronic copy available at: https://ssrn.com/abstract=4725351


linguistic features: clarity, coherence, and professionalism. We thus created experimental
stimuli by giving each of the 20 complaints to ChatGPT one at a time and asked ChatGPT to
edit each complaint to be (1) more clear, (2) more coherent, and (3) more professional (with
the order of these three instructions randomized; see SI Appendix, Fig. S3 for an example of
one such editing instruction). This procedure resulted in a total of 80 complaints consisting
of 20 unedited complaints (Control ) and 60 edited complaints (20 Coherent, 20 Clear, and 20
Professional ). Importantly, the edited complaints retained the same complaint content as the
Control complaints; they simply enhanced the linguistic features. That is, for each original
Control complaint (e.g., A), we created three edited versions of the same complaint (i.e., A′ ,
′′′
A′′ , A ). This allows us to control for the effects of complaint content on the likelihood of
compensation, the estimation of which was not possible in our observational study.
Methods. We presented 5 distinct complaints randomly chosen from the set of 80
complaints to each of the experiment participants5 (N = 210), using the randomization
process described in SI Appendix, section 2D. For each complaint, we asked participants to
rate the likelihood of providing monetary compensation regarding the complaint, which was
the main dependent measure of the experiment: “If you were handling the complaint above,
how likely would you be to offer monetary compensation to address the consumer’s complaint?
(1 = Not likely at all, 7 = Extremely likely).” Participants then rated the complaint on three
linguistic features (clarity, coherence, and professionalism) by answering the three questions
placed right below the main dependent measure: “How clear [coherent / professional] is the
complaint above? (1 = Not clear [coherent / professional] at all, 7 = Extremely clear [coherent
/ professional] ).” Participants then completed exploratory and demographic measures to
complete the study.
Results and Discussion. The results (and procedure) of Experiment 1 are summarized
5
These participants were relatively well-educated, as they had a bachelor’s degree or higher. Furthermore,
we replicated the findings with participants with more related backgrounds (i.e., financial industry work
experience) in Experiment 2. Although the compensation decision was not directly assessed by the official
complaint handler at the financial institution in these experiments, the public evaluation can still influence
the final decision, as the handler is likely to consider the potential impact on their reputation. This is a
primary reason why the CFPB makes both the complaint and the firm’s response openly available.

14

Electronic copy available at: https://ssrn.com/abstract=4725351


in Figure 4. Consistent with the findings from the observational studies, the complaints
edited with ChatGPT (Coherent, Clear, and Professional complaints combined) were more
likely to receive hypothetical monetary compensation than the unedited complaints (Control
complaints only) in the preregistered linear fixed effects model, b = 0.74, SE = 0.093, t =
7.96, p < .001.

Figure 4: The figure shows the procedure and results of Experiment 1. From the CFPB
dataset, 20 complaints were selected (20 Control complaints) and each was edited with
ChatGPT to be more clear, coherent, and professional (20 Clear, 20 Coherent, and 20
Professional complaints). The edited complaints were rated by Experiment 1 participants
as more likely to receive hypothetical monetary compensation (M = 3.52, SD = 1.91) than
unedited complaints (M = 2.78, SD = 1.78), p < .001, Cohen’s d = 0.39. SI Appendix,
Table S1 presents more results from linear fixed effects model estimations and robustness
tests which show the same results both in terms of direction and statistical significance.

Thus, Experiment 1 yielded results consistent with the results from the observational
study (Observation 2 in Section 3.3). Specifically, Experiment 1 shows that LLM usage does
increase message persuasiveness, as the complaints edited with an LLM were more likely
to result in the desired outcome (receiving hypothetical monetary compensation) than the
complaints not edited with an LLM.

15

Electronic copy available at: https://ssrn.com/abstract=4725351


Figure 5: This figure shows the estimated increase in clarity (left panel), coherence (middle
panel), and professionalism (right panel) of the edited complaints (Clear, Coherent, or
Professional complaints), relative to the unedited complaints (i.e., Control complaints).
Within each panel, the increases were estimated in a fixed effects linear model controlling for
the fixed effects of complaint contents and participants. Error bars indicate 95% Confidence
Intervals for the respective increase estimates.

Experiment 1 also showed results consistent with the computational linguistic analysis
results (Observation 3 in Section 3.4). Specifically, complaints edited to improve on each
linguistic feature (i.e., Clear, Coherent, and Professional complaints) were perceived as
improved in all other linguistic features as well (see Fig. 5). For example, Clear complaints
(i.e., those edited with ChatGPT to be more clear) were not only more clear than Control
(unedited) complaints (b = 1.53, SE = 0.10, t = 14.94, p < .001), but they were also more
coherent (b = 1.61, SE = 0.10, t = 15.60, p < .001) and more professional (b = 2.23, SE
= 0.11, t = 20.48, p < .001) than Control complaints (see the circle in each of the three
panels of Fig. 5). Likewise, Coherent and Professional complaints were improved on all
three linguistic features relative to Control complaints (squares and triangles, respectively, in
the three panels of Fig. 5). These results further support the notion that editing messages
with an LLM improves the messages on multiple linguistic features simultaneously. This
assumption — supported by results from both Observation 3 (Table 2) and Experiment
1 — forms the basis of our linguistic feature alignment hypothesis, which we test in the next
experiment.

16

Electronic copy available at: https://ssrn.com/abstract=4725351


4.3 Experiment 2: How LLMs Can Enhance Human Persuasion

In Experiment 2, we test whether the mechanism of linguistic feature alignment, as described


in the Hypothesis section (Section 2), may underlie the enhancement in persuasion through
LLM usage. In addition, we test whether the results of Experiment 1 (i.e., LLM usage
enhancing persuasiveness of complaints) can be replicated in Experiment 2.
To test the proposed mechanism of linguistic feature alignment, we experimentally induced
a misalignment between (1) the linguistic feature the message sender aimed to improve and
(2) the linguistic feature highly valued by the message receiver. We did so by adapting
the procedure of Experiment 1 and asking Experiment 2 participants to focus on either
clarity or professionalism as the main judgment criterion when deciding on their likelihood of
providing hypothetical compensation in response to the complaints. That is, we experimentally
manipulated half of the participants to highly value one linguistic feature (clarity) and the
other half to highly value another feature (professionalism) while presenting them with the
same set of diverse messages that were improved on various linguistic features (i.e., the 60
complaints that were improved on clarity, professionalism, coherence, and the 20 control
complaints). This meant that all participants would decide on compensation likelihood
for complaints that were aligned on linguistic feature (e.g., participants focusing on clarity
deciding on Clear complaints) as well as for complaints that were misaligned on linguistic
feature (e.g., participants focusing on clarity deciding on Professional or Coherent complaints).
We hypothesized that if the LLM improved messages on various linguistic features and
therefore linguistic feature alignment were to be achieved, then the induced misalignment
(between the linguistic feature on which a given message was improved and the linguistic
feature that the message receiver was induced to value) should have little effect on the likeli-
hood of desired outcomes (hypothetical monetary compensations). For example, participants
experimentally induced to value clarity should favorably view not only the complaints that
were improved on clarity (i.e.., those aligned on linguistic feature), but also the complaints
that were improved on professionalism or those improved on coherence (i.e., those misaligned

17

Electronic copy available at: https://ssrn.com/abstract=4725351


on linguistic feature). This is because editing messages with an LLM would have improved
the messages on such unintended linguistic features in addition to the intended features, and
there would therefore be an alignment between the unintentionally improved linguistic feature
and the feature that the receiver values. However, if the LLM did not improve messages on
various linguistic features and therefore linguistic feature alignment could not be achieved,
then the misalignment induced by the experimenters should decrease the likelihood of desired
outcomes.
Methods. Using Prolific, we recruited 400 individuals who have worked in the finance
sector in the past. We began the study by giving the participants some ideas about the
study’s purpose (“We’re interested in learning about how consumer complaints might be
handled by financial institutions”) and informing them of why they were recruited (“We
sought to recruit participants like you who. . . have past work experience in the finance sector”).
Participants then learned that they would encounter four different consumer complaints and
were asked to imagine that they were in charge of handling each of the complaints.
The key manipulation in Experiment 2 was asking participants to focus on one linguistic
feature — clarity or professionalism — and asking them to use it as the main judgment
criterion to decide their likelihood of providing compensation. Specifically, for each complaint,
we asked participants to rate it on the focal linguistic feature they were assigned (either clarity
or professionalism) using the following 7-point scale: “How clear [professional] is the complaint
above? (1 = Not clear [professional] at all, 7 = Extremely clear [professional] ).” Participants
then rated how likely they would be to provide compensation for the complaint: “If you were
handling the complaint above, how likely would you be to offer monetary compensation to
address the consumer’s complaint? (1 = Not likely at all, 7 = Extremely likely).” Participants
provided these two ratings for each of the four randomly assigned complaints. They then
provided demographic information to complete the study.
Results. Results were consistent with our hypotheses. First, we found further support
for our assumption that LLM usage simultaneously improves multiple linguistic features,

18

Electronic copy available at: https://ssrn.com/abstract=4725351


such that an LLM improves messages not only on the intended linguistic feature but also on
the unintended linguistic feature. Specifically, fixed effects linear model analyses revealed
that the complaints edited to be more professional were also more clear than the unedited
complaints (b = 1.61, SE = 0.13, t = 12.86, p < .001), just as the complaints edited to be
more clear were also more professional than the unedited complaints (b = 2.13, SE = 0.11, t
= 19.75, p < .001).
Second, and more importantly, we found evidence for enhanced persuasion through
linguistic feature alignment. That is, complaints edited with ChatGPT to improve on a
focal linguistic feature were more likely to result in monetary compensation (as compared
with complaints not edited with ChatGPT), even when the complaint handlers were induced
to focus on a different linguistic feature; see SI Appendix, section 2G for tabulated results.
For each manipulation of the linguistic feature to focus on (Focus on Clarity vs. Focus on
Professionalism), there were two possible cases of misalignment (see the solid arrows in Figure
6). In each of these four possible cases of misalignment, we found evidence for enhanced
persuasion through linguistic feature alignment. Specifically, when the complaint handlers
focused on clarity of the complaints, we found the following results:

• The complaints edited to be more professional were more likely to receive monetary
compensation than the unedited complaints (b = 0.90, SE = 0.16, t = 5.73, p < .001).

• The complaints edited to be more coherent were also more likely to receive monetary
compensation than the unedited complaints (b = 1.08, SE = 0.16, t = 6.93, p < .001).

Similarly, when the complaint handlers focused on the professionalism of the complaints,
we found the following results:

• The complaints edited to be more clear were more likely to receive monetary compen-
sation than the unedited complaints (b = 0.92, SE = 0.15, t = 6.33, p < .001).

• The complaints edited to be more coherent were also more likely to receive monetary
compensation than the unedited complaints (b = 1.17, SE = 0.14, t = 8.25, p < .001).

19

Electronic copy available at: https://ssrn.com/abstract=4725351


Figure 6: This figure shows the procedure and results of Experiment 2. Each participant
viewed each of the four types of complaints. They focused on either clarity or professionalism
as the main judgment criterion when deciding on their likelihood of compensation for each
complaint. Complaints edited with ChatGPT to improve on linguistic features different than
the one complaint handler focused on were more likely to result in monetary compensation
(the italicized mean ratings in yellow boxes) than the complaints not edited with ChatGPT
(Control). Because complaints edited with ChatGPT to improve on a single linguistic feature
(e.g., clarity) were incidentally improved on multiple linguistic features (e.g., clarity and
professionalism), this resulted in an alignment between the incidentally improved feature (e.g.,
professionalism) and the feature that the complaint handler valued (e.g., professionalism).
Such linguistic feature alignment thus underlies enhancement in persuasion through LLM.

These results are qualitatively the same as the results presented in Fig. 6, which show mean
compensation likelihood ratings by experimental condition (Focus on Clarity vs. Focus on
Professionalism) and complaint type. The results suggest that enhanced persuasion through
LLM usage in the observational study (Observation 2 in Section 3.3) and Experiment 1 can
be driven by linguistic feature alignment. That is, editing messages with LLMs improves the
messages on multiple linguistic features simultaneously, such that there is a better alignment
between the (intentionally or unintentionally) improved features and features that message
receivers highly value. This linguistic feature alignment, in turn, enhances persuasiveness
of the messages, increasing the likelihood of obtaining the outcomes desired by the message
sender.

20

Electronic copy available at: https://ssrn.com/abstract=4725351


4.4 A Series of Further Tests of the Linguistic Feature Alignment

Hypothesis

We conducted additional tests of the linguistic feature alignment hypothesis. Specifically,


we examined whether the complaints that were edited with an LLM (ChatGPT) to be
improved on one linguistic feature (e.g., clarity) were also more likely to be compensated by
the complaint handlers who focused on a different linguistic feature (e.g., professionalism),
to the extent that the complaints were (incidentally) improved on the different linguistic
feature (e.g., professionalism). For example, a complaint edited with ChatGPT to be more
clear would not only be clearer but also more professional. If such a complaint was presented
to the complaint handler who used professionalism as their primary criterion for deciding
whether to provide compensation, then the increase in the complaint’s persuasiveness would
be a function of the level of professionalism in the complaint that was intended to more
clear. We thus tested for the presence of such relationships implied by the linguistic feature
alignment hypothesis.
We estimated four separate linear fixed effects models using four different subsets of data
from Experiment 2. In the first model, we examined linguistic feature alignment between (1)
complaints that were edited to be more clear and (2) complaint handlers who focused on
professionalism. We restricted our analysis to the Clear complaints (20 in total) and to the
participants who were induced to use professionalism as their main criterion for compensation
likelihood judgments (n = 200). Using this subset of data, we regressed likelihood of
compensation on perceived professionalism (of these Clear complaints), controlling for fixed
effects of complaint content and the order position in which a given complaint was presented6 .
As shown in Model 1 of Table 4, compensation likelihood for the Clear complaints was
positively associated with perceived professionalism of the complaints when the complaint
handlers focused on professionalism, b = 0.49, SE = 0.13, t = 3.88, p < .001. The significant
6
It should be noted that we did not control for the fixed effects of participants because each participant
in Experiment 2 saw only one Clear complaint. The same principle was applied when estimating the other
models.

21

Electronic copy available at: https://ssrn.com/abstract=4725351


positive coefficient indicates that even in this case of misalignment (between Clear complaints
and complaint handlers focusing on professionalism), an incidental improvement in the
unintended linguistic feature (professionalism) predicted the complaints’ persuasiveness,
further supporting the linguistic feature alignment hypothesis.

Table 4: Further Tests of the Linguistic Feature Alignment Hypothesis

Model # of Obs. Complaint Type Receiver Condition Predictor b SE t p


(Edited for ...) (Focus on ...) (Linguistic feature rating)

1 200 Clarity Professionalism Professionalism 0.49 0.13 3.88 < .001


2 200 Professionalism Clarity Clarity 0.41 0.093 4.40 < .001
3 200 Coherence Professionalism Professionalism 0.64 0.10 6.14 < .001
4 200 Coherence Clarity Clarity 0.78 0.087 9.00 < .001

We conducted three additional analyses the same way, taking three other subsets of data
from Experiment 2 but each time examining a different misalignment in linguistic feature.
The second linear fixed effects model (Model 2 of Table 4) focused on Professional complaints
(20 in total) and complaint handlers who focused on clarity as their main criterion for
compensation judgments (n = 200), controlling for the same fixed effects as in Model 1. As
expected, compensation likelihood for the Professional complaints was positively associated
with perceived clarity of the complaints when the complaint handlers focused on clarity, b =
0.41, SE = 0.093, t = 3.88, p < .001. Again, the significant positive coefficient shows that in
this opposite case of misalignment — between Professional complaints and complaint handlers
focusing on Clarity — an improvement in the unintended linguistic feature (Clarity) predicted
the message’s persuasiveness. The remaining two models (Models 3 and 4) focused on two
other misalignment cases involving Coherent complaints, namely, the misalignment between
Coherent complaints and complaint handlers focusing on either professionalism or clarity. In
each of these models, we find the same significant positive association between the incidental
improvement on the linguistic feature valued by the complaint handler (professionalism or
clarity, respectively) and compensation likelihood, bs > 0.63, SE s < 0.11, ts > 6.1, ps < .001.

22

Electronic copy available at: https://ssrn.com/abstract=4725351


Each of the models estimated here essentially constitutes a further test of linguistic feature
alignment. Without exception, the models reveal that the complaints were more likely to
result in the desired outcome (compensation), the more they were incidentally improved on
the linguistic feature that the complaint handler values. These results lend further credence
to linguistic feature alignment as the mechanism underlying the enhancement in persuasion
through the use of LLMs.

5 Discussion

We investigate the role of LLMs in enhancing persuasion in human communication by


examining consumer complaints in the financial industry. Our findings indicate that consumers
(message senders) began to use LLMs to write more persuasive complaints (messages) to the
firms (message receivers), and that such LLM usage likely increased the likelihood of obtaining
desired outcomes (offer of relief). This increase in message persuasiveness through LLM can
be explained by linguistic feature alignment between messages and receivers, which in turn
is facilitated by the LLMs’ ability to improve multiple linguistic features simultaneously in
a given message. Unlike other recent works on the impact of LLMs, which rely solely on
experimental evidence (Noy and Zhang, 2023; Matz, Teeny, Vaid, Harari, and Cerf, 2023;
Liu, Yen, Marjieh, Griffiths, and Krishna, 2023), we offer one of the first observational
studies using a large dataset to examine the adoption and impact of LLMs in persuasive
communication with supporting experimental evidence.
Although this research demonstrates LLMs’ potential to enhance persuasion in human
communication, LLMs’ effectiveness in doing so can depend on the communication context. In
informal contexts, such as social media, where a conversational and direct form of persuasion
is predominant, the sophisticated linguistic features introduced by LLMs might not only
lack impact but could also inadvertently introduce biases, reflecting biases in their training
datasets. Consider another different context of online dating, where a sender can use an

23

Electronic copy available at: https://ssrn.com/abstract=4725351


LLM to edit messages to persuade a receiver to go on a date. It is possible that the LLM’s
messages intended to persuade might backfire by giving the receiver an impression that the
sender is not charismatic or lacks human qualities. Such possible misalignments between a
communication context and LLMs’ typical outputs underscores the importance of considering
the context of communication to ensure the efficacy of LLMs in persuasion. Our research,
however, demonstrates one of many appropriate contexts in which persuasion can be enhanced
effectively through the use of LLMs.
Our findings encourage many directions for future research. One possible direction is
to examine geographic or sociodemographic heterogeneity in a relevant variable and its
association with LLM adoption rates, as well as the resulting social implications. For example,
it may be worth examining whether (and how) different levels of financial literacy across
ZIP codes (Bumcrot, Lin, and Lusardi, 2011) are associated with rates of LLM usage, and
what implications such an association may have for the digital divide (Lythreatis, Singh, and
El-Kassar, 2022). Such examinations may reveal that LLMs can level the playing field for
people with different levels of financial literacy and/or different levels of communication skills.
By enhancing the persuasiveness of complaints for all individuals, LLMs could ensure that
outcomes hinge more directly on the substantive merits of an issue rather than on individuals’
ability to articulate their case.
Another direction is to explore the nature and extent of LLMs’ other impacts on people’s
behavior in the financial industry. Just as other AI systems had multifaceted impacts on a
field (Shin, Kim, van Opheusden, and Griffiths, 2023), LLMs like ChatGPT will likely affect
people’s behavior in the financial industry in various ways. For example, people may use
ChatGPT also as a knowledge acquisition tool, rather than merely as a writing assistant. This
in turn may significantly increase their propensity to write complaints about or communicate
with financial firms in general, as they become more informed about their rights as consumers
and the regulations or intricacies of the financial industry.
Yet another direction is to explore how LLMs enhance which specific linguistic features or

24

Electronic copy available at: https://ssrn.com/abstract=4725351


dimensions of persuasive messages. Prior studies have identified a range of linguistic features
influencing communication (Toubia, Berger, and Eliashberg, 2021; Packard and Berger, 2021;
Reisenbichler, Reutterer, Schweidel, and Dan, 2022; Packard, Berger, and Boghrati, 2023).
Identifying the specific features that LLMs can or cannot effectively enhance would deepen our
understanding of the reach and limits of LLMs’ impact on persuasive message construction.
Lastly, future research may examine a completely different mechanism through which
LLMs can enhance message persuasiveness. It is possible that when a persuasive message
is edited with an LLM, it comes to signal something about the message sender, which the
message sender may or may not have intended. For example, if a consumer without much
knowledge of the law edits their complaint with an LLM, and the LLM produces a message
which looks like it was written by a lawyer, then the complaint handler at a financial firm
who receives such a message may give a more favorable response to shield the firm from any
potential legal troubles. Thus, editing a persuasive message with an LLM may cause the
message to carry certain signals about the type of message sender (regardless of whether
those signals reflect truth or not), and such signaling (intentional or unintentional) could
make messages edited with LLMs more persuasive. Future research may investigate such
alternative mechanisms that might explain our findings.
These are only some of the many possible avenues for future research. We hope that
our findings will open many more avenues and encourage more work on understanding
LLMs’ impacts on human communication. LLMs have the potential to revolutionize human
communication, and we may be witnessing the first waves of a great tide of change approaching.

25

Electronic copy available at: https://ssrn.com/abstract=4725351


Supporting Information (SI) Appendix

1 Observational Studies

We investigate whether and how large language models (LLMs), such as ChatGPT, can
be used to enhance persuasion in human communication. Although many LLMs widely
available currently (e.g., ChatGPT) were not specifically developed to improve interpersonal
communication, their extensive training on a wide array of human interaction texts could be
beneficial in refining the quality of messages produced by people. Given these considerations,
we focus on the following questions:
(1) Whether people use ChatGPT to write persuasive messages
(2) Whether ChatGPT usage can increase message persuasiveness
(3) Whether linguistic feature alignment can be a mechanism
In investigating these questions, however, we faced two major challenges. The first
challenge was determining whether a given message was written with an LLM. In general,
the messages themselves, let alone whether LLMs were used to write those messages, are not
recorded in large-scale datasets. The second challenge was determining whether messages’
persuasiveness changed (i.e., increased). The change in persuasiveness (i.e., whether a given
message was more or less likely to result in an outcome desired by the message sender) is
difficult to determine because it is rare for a large-scale dataset to systematically document
both senders’ persuasive messages and the outcomes of those messages (receivers’ responses
to those messages).
We sought to overcome these challenges by using a recently developed LLM usage detection
tool (for the first challenge) and by using a unique and publicly available dataset (for the
second challenge). Specifically, to identify potential usage of LLMs like ChatGPT in senders’
messages, we utilized Winston.ai (Winston AI, 2023), a state-of-the-art AI detection program
designed to determine the probability that a given text was likely generated by LLMs. With

26

Electronic copy available at: https://ssrn.com/abstract=4725351


this tool, we were able to identify which messages were likely to have been written with
an LLM like ChatGPT or written solely by humans. This identification further allowed us
to compare the linguistic features of messages written with an LLM and those of messages
written without an LLM.
More importantly, we analyzed data from the Consumer Financial Protection Bureau
(CFPB) consumer complaint database. This dataset provides a well-documented historical
record of consumers’ persuasive complaints and the corresponding responses from financial
firms. Investigating these real-world examples affords us a unique opportunity to answer our
research questions. Below, we first discuss the details of the dataset and then the details of
the AI (LLM usage) detection tool.

1A Data: CFPB Complaint Database

We take advantage of the CFPB Consumer Complaint Database to study the impact of
ChatGPT on persuasiveness of consumer complaints. In this context, consumers act as
senders of messages, which are in the form of complaints aimed at persuading the financial
companies. These companies are receivers of these messages. Below, we provide more details
about the communication steps involved in this process.
The communication begins when a consumer encounters a problem with a financial
product or service provided by a company, leading them to file a complaint to the CFPB. To
ensure effective communication between consumers and companies, the CFPB checks the
complaint for completeness before forwarding it to the concerned company. Upon receipt,
the company is expected to review the complaint, engage with the consumer as necessary,
and determine the appropriate actions. The company is given a 15-day window to provide an
initial response. If this response does not conclusively address the complaint, the company is
required to inform the CFPB. Following the initial response, the CFPB gives the company
60 days to deliver a final response.
The final response is required to provide details of the steps taken to address the complaint,

27

Electronic copy available at: https://ssrn.com/abstract=4725351


such as any planned follow-up actions. The company also chooses a category that best describes
its response. The company can choose from several options to categorize its response to the
complaint. One option is “Closed with monetary relief,” which implies that the company’s
response included providing objective, measurable, and verifiable monetary relief to the
consumer. Another category is “Closed with non-monetary relief,” which indicates that the
company’s actions did not result in monetary relief, but may have addressed some or all of
the consumer’s complaints beyond mere explanations (e.g., changing account terms or coming
up with a foreclosure alternative). A third category is “Closed with Explanation,” which
indicates that the company simply explained the issue without providing either monetary or
non-monetary relief. The last category is “Closed,” which implies that the company closed
the case on the complaint without providing any form of relief or explanations. In the
present research, we define the successful persuasion outcome to be receiving either monetary
or non-monetary relief, consistent with the definitions used in previous research (Dou and
Roh, 2019; Dou, Hung, She, and Wang, 2023). This definition is used to code for the main
dependent variable in our analyses.
The CFPB complaint database, which is publicly accessible, contains complaints from
December 2011 through October 2023, at the time of writing this paper. However, it is
important to note that not every consumer has disclosed the “narrative,” which is the primary
text-based information of the complaint. If certain consumers do not provide consent to
make their complaints public, we are unable to observe the text within those complaints.
Additionally, the AI detection tool we utilized has specific requirements concerning the length
of the input text. We finalized our data set by taking these factors into consideration, and
this dataset is available on the project’s Open Science Framework page (Click Here). Results
from additional and related analyses, which are discussed in the following sections, can also
be found on the page.

28

Electronic copy available at: https://ssrn.com/abstract=4725351


1B Method: AI Detection Tool

In our research, we utilized Winston AI (Winston AI, 2023), one of the most effective AI
detection programs, to determine whether a complaint was written with an LLM. This tool’s
performance has been tested and verified by many users and researchers (Prillaman, 2023;
Weber-Wulff, Anohina-Naumeca, Bjelobaba, Foltỳnek, Guerrero-Dib, Popoola, Šigut, and
Waddington, 2023), who have deemed it one of the most effective AI detection tools, on par
with Originality.ai (Gillham, 2023). Notably, this program is designed and trained to detect
outputs by a wide range of popular LLMs including GPT-3, GPT-4, Bard, and many others.
To verify the accuracy of Winston AI, we conducted our own tests using the CFPB
database. We have a solid understanding that prior to December 2022, very few, if any,
complaints would have been written with the assistance of LLMs currently available such
as ChatGPT7 . Thus, we can confidently categorize these complaints as human-authored,
serving as ground truth for our test. We randomly selected 100 complaints from this group.
We then created counterfactual, LLM-edited versions of these complaints using ChatGPT
and categorized them as LLM-authored. The two groups of complaints (human-authored
and LLM-authored) were processed through Winston AI as a test of the the tool’s detection
accuracy.
The results effectively illustrate the tool’s capability to differentiate between human-
authored and AI-authored complaints. For human-authored complaints, the AI scores are
significantly lower (first quartile: 0.010, median: 0.380, mean: 5.097, third quartile: 2.042).
On the other hand, AI-authored complaints have much higher scores (first quartile: 95.65,
median: 99.69, mean: 93.42, third quartile: 99.94). The stark difference in these sets of AI
scores underscores the tool’s effectiveness in distinguishing between complaints authored by
humans and those authored by LLMs.
7
Before the end of 2022 when ChatGPT was released, LLMs from the other GPT-series (e.g., text-ada-001
or text-curie-001) were released but not widely used. This limited usage was primarily due to the limited
access and a general lack of public understanding about them.

29

Electronic copy available at: https://ssrn.com/abstract=4725351


1C Testing Various Thresholds for Determining Likely-AI Com-

plaints

Here we present our rationale behind selecting 99% as the threshold for defining Likely-AI
complaints. Figure S1 illustrates two key observations: first, a sharp increase in Likely-AI
complaints is evident immediately following the release of ChatGPT, irrespective of the
threshold level. Second, the use of a less stringent threshold results in a higher incidence of
false positives in the classification of Likely-AI complaints.

Figure S1: Each panel in the series displays the temporal trend for “Likely-AI” complaints
defined by using various threshold levels. The top panel represents cases classified as Likely-AI
when the score is greater than or equal to 90%. The middle panel corresponds to cases with
a score threshold of greater than or equal to 80%, and the bottom panel includes cases with
scores greater than or equal to 70%. More false positive cases are observed with a lower
threshold.

30

Electronic copy available at: https://ssrn.com/abstract=4725351


2 Experimental Studies

As mentioned in the main text, we conducted experiments to test whether the relationship
between LLM usage and message persuasiveness found in the observational studies can
be causal. That is, we test whether LLM usage can cause an increase in persuasiveness
of messages. This section of the Appendix provides more detailed information of these
experiments and a pilot study. We first conducted a pilot study to develop stimuli to use in
Experiments 1 and 2. We then conducted Experiments 1 and 2 to test the causal relationship.

2A Procedure of the Pilot Study

We conducted a pilot study to develop stimuli to use in the main experiments. The entire
procedure of stimuli development is discussed in the next section. Here, we simply provide
the detailed procedure of the pilot study.
We recruited 210 participants (Mage = 40; 49% men, 49% women, and 2% other) from
Prolific. Participants began the study by learning that consumers can report complaints
regarding their financial institutions to the CFPB and that such complaints can later be
accessed by the public. Participants were then told that they will see five different complaints
and that for each complaint, their task was to (1) imagine that they wrote the complaint
to “obtain monetary relief from a financial institution” and (2) indicate how they would
use ChatGPT in editing their complaint “to improve [their] chances of obtaining monetary
compensation.”
Participants then proceeded to review a set of five random complaints from a set of
150 complaints selected from the CFPB dataset (details on selecting these 150 complaints
are provided in the next section). Participants read one complaint at a time, and for
each complaint, they answered the study’s main dependent measure, consisting of 10 items,
each with a 5-point scale of disagreement or agreement: “For this complaint, I would ask
ChatGPT to make the complaint...(1) more polite, (2) more coherent, (3) more fact-based, (4)

31

Electronic copy available at: https://ssrn.com/abstract=4725351


more persuasive, (5) more clear, (6) more professional, (7) more assertive, (8) more concise,
(9) less repetitive, and (10) less emotional (1 = Strongly Disagree, 5 = Strongly Agree).”
We selected these 10 linguistic features based on our conversation with ChatGPT (e.g.,
https://chat.openai.com/share/f1c6e563-3b98-4364-9315-f87d62103dda) and the discussion
between the two authors. After completing the 10-item dependent measure for each of the
five complaints, participants answered an open-ended question, “If you were to use ChatGPT
to write (draft or edit) a complaint regarding an issue with your financial institution, what
kind of requests would you make to ChatGPT?” Participants then provided demographic
information (age, gender, race or ethnicity, education, household income, and political
ideology) to complete the study.
Results of the pilot study are presented in the main text (e.g., the summary of mean
ratings in Table 3).

2B Stimuli Development

As mentioned in the main text, the stimuli used in Experiments 1 and 2 were 80 consumer
complaints, 20 of which were the original, unedited complaints from the CFPB data set
(Control complaints), and 60 of which were three different edited versions of these 20 com-
plaints — namely, 20 Clear, 20 Coherent, and 20 Professional complaints. These stimuli were
carefully developed in three stages with the following steps:

Selection of Complaints for a Pilot Study

1. From the CFPB dataset, we identified consumer complaints that the AI detection tool
identified as completely written by humans (i.e., those with the AI scores of 0).
2. From this set of Likely-Human complaints, we randomly selected a set of 300 consumer
complaints (see the R code in the project’s OSF page, Click Here).
3. From the above set of 300 complaints, we eliminated 84 complaints that were exact

32

Electronic copy available at: https://ssrn.com/abstract=4725351


duplicates of other complaints8 .
4. Upon examining the remaining 216 complaints, we observed groups of very similar
complaints. To avoid presenting similar complaints to the experiment participants and
to represent a diverse range of issues in consumer complaints, we sought to select a set
of unique complaints. To do so, we established a uniqueness score for each complaint
based on the Levenshtein distance (Levenshtein et al., 1966), a measure of dissimilarity
between two texts.
5. Specifically, for each of the 216 complaints, we calculated its Levenshtein distance
with every other complaint in the set, recording the smallest of these distances as
the complaint’s uniqueness score. For instance, to determine the first complaint’s
uniqueness score, we calculated the Levenshtein distance between it and each of the
remaining 215 complaints; then, we took the minimum of the 215 Levenshtein distances
and recorded this value as the first complaint’s uniqueness score. A high uniqueness
score (i.e., a high minimum Levenshtein score) would mean that the given complaint is
highly different from the complaint that is most similar to it, and even more different
from all the other complaints. Figure 3 shows a histogram of the uniqueness scores for
all 216 complaints.
6. After examining the uniqueness scores, we discarded complaints with uniqueness scores
of 30 or lower, as they were relatively similar to others. This reduced the set of
complaints to 169.
7. From the set of 169 complaints, we randomly selected 150 complaints (see the R code in
the project’s OSF page, Click Here) to use in a pilot study to determine the linguistic
features to focus on in the main experiments (see the pilot study described previously
8
Interestingly, we encountered duplicated complaints that may have been written by legal agents repre-
senting their clients (e.g., those that cited precise legal code like “Fair Credit Reporting Act - 15 U.S.C.1681
section 602”). Investigating the potential influence of legal representatives (i.e., the effect of “lawyering up”
on persuasive messages) could be an interesting direction for future research. However, more importantly
for our purposes, inclusion or exclusion of duplicated messages (including such messages possibly written
by legal representatives) did not change our results. Specifically, reanalyses of the data after excluding the
duplicated messages do not qualitatively change the direction nor statistical significance of the results from
our observational studies.

33

Electronic copy available at: https://ssrn.com/abstract=4725351


and in the main text).

Pilot Study to Determine the Focal Linsguistic Features of the Main Experiment

8. In the pilot study (N = 212; described in the previous section), we asked participants
how they would use ChatGPT to edit their hypothetical complaints to obtain monetary
relief from financial institutions. The results suggested that people would be likely to
use ChatGPT to edit their complaints to be more clear, coherent, and professional. We
thus concluded that clarity, coherence, and professionalism are three linguistic features
that people are relatively more likely to focus on when using an LLM like ChatGPT to
edit persuasive messages. Accordingly, we focused on these three linguistic features in
Experiments 1 and 2.

Final Selection of Complaints for the Main Experiment

9. We proceeded to select consumer complaints for the main experiments by returning to


the set of 150 complaints (see Step 7 above). We sorted 150 complaints by their length
(i.e., number of characters in complaint) and selected the 20 shortest complaints while
excluding complaints that seem to have been written by legal experts (because we were
more interested in how laypeople, rather than legal experts, would use ChatGPT to edit
complaints; see Footnote 8 of this SI Appendix ). We selected the shortest complaints to
minimize the possibility that the main experiments’ participants could be overwhelmed
or disengaged by reading long complaints. The 20 complaints thus selected were used
in the main experiment as the 20 unedited complaints (i.e., Control complaints).
10. We proceeded to create edited versions of the 20 unedited (Control ) complaints. Specif-
ically, we gave each of these 20 complaints to ChatGPT one at a time and asked
ChatGPT to edit each complaint to be (1) more clear, (2) more coherent, and (3) more
professional (with the order of these instructions randomized; see Figure S3 for an
example of one such instruction).

34

Electronic copy available at: https://ssrn.com/abstract=4725351


11. This resulted in a final set of 80 complaints in total, consisting of 20 unedited complaints
(which we will refer to as Control complaints) and 60 edited complaints (20 Clear, 20
Coherent, and 20 Professional complaints).

Figure S2: This figure shows the counts of consumer complaints for given ranges of uniqueness
scores for the 216 nonduplicated complaints that the AI detection tool determined as
completely written by humans. We randomly selected 20 complaints from the set of complaints
with uniqueness scores larger than 30.

35

Electronic copy available at: https://ssrn.com/abstract=4725351


Figure S3: The figure shows the process through which the 60 edited complaints were
generated. Specifically, we took the 20 unedited complaints written by actual consumers
(from the CFPB data), presented each complaint to ChatGPT (one complaint at a time),
and asked it to edit the complaint to be more clear, coherent, and professional as shown in
the figure. The order of the three instructions (i.e., “...edit it to be more clear / coherent /
professional”) was randomized for each of the 20 complaints.

36

Electronic copy available at: https://ssrn.com/abstract=4725351


2C Stimuli Verification

We verified our stimuli (the 80 complaints) in Experiment 1 (described in the main text
and in the next section) by asking the experiment’s participants to rate them on the three
intended linguistic features. Specifically, we asked participants about each linguistic feature
(i.e., clarity, coherence, and professionalism) for a given complaint. Figure 5 in the main text
shows the results on these linguistic feature ratings.

2D Detailed Procedure of Experiment 1

We recruited 210 participants (Mage = 44; 49.5% men, 50% women, and 0.5% other) from
Prolific. After thanking participants for their participation, we presented each participant
with 5 distinct complaints randomly chosen from the set of 80 complaints as follows:

1. A complaint from the pool of 20 unedited complaints (1 of 20 Control Complaints)


2. A complaint from the pool of 20 complaints edited with ChatGPT to be more clear (1
of 20 Clear Complaints)
3. A complaint from the pool of 20 complaints edited with ChatGPT to be more coherent
(1 of 20 Coherent Complaints)
4. A complaint from the pool of 20 complaints edited with ChatGPT to be more professional
(1 of 20 Professional Complaints)
5. An additional complaint randomly chosen the four types of complaints above

We ensured that each of the five randomly chosen complaints featured different complaint
content. That is, no two complaints were the same complaints or originated from the
same unedited complaint. For each complaint, we asked participants to rate the likelihood
of providing monetary compensation regarding each complaint (“If you were handling the
complaint above, how likely would you be to offer monetary compensation to address the
consumer’s complaint? [1 = Not likely at all, 7 = Extremely likely]”). Following this
dependent measure, we also asked participants to rate each complaint on clarity, coherence,

37

Electronic copy available at: https://ssrn.com/abstract=4725351


and professionalism in that fixed order with the question, “How clear [coherent / professional]
was the complaint above?” (1 = Not clear [coherent / professional] at all, 7 = Extremely
clear [coherent / professional] ). After answering these four questions for each of the five
complaints presented sequentially, participants made suggestions (if they had any) on how to
improve the study and provided demographic information to complete the study.

2E Note on the Design of Experiment 1

For Experiment 1, we could have chosen to manipulate linguistic features using a study design
in which each participant would be randomly assigned to receive one complaint out of the 80
(Alternative Design). However, we had a few justifications for opting for our preregistered
study design instead.
First, presenting multiple complaints as in our design likely led participants to form their
own standard for good complaints (which in turn would make them better judges), whereas
such a process would be unlikely to occur if participants saw only one complaint out of the 80
(Alternative Design), as is typically done in between-subjects experiments where often a single,
rather than multiple, stimulus is presented. By presenting multiple complaints, participants
were naturally led to compare complaints to one another, a process through which they likely
came to establish their own standards of what makes a good complaint. Such standards
then would have made them better judges of complaints, compared to evaluating a single
complaint (Alternative Design).
Second, our design more closely matched the real-world setting where evaluators of
consumer complaints encounter a wide variety of complaint types and multiple of them,
rather than exclusively encountering one specific type of complaint or one specific complaint.
Thus, exposing participants to multiple complaints of different types made our study more
ecologically valid compared to exposing them to a single complaint (Alternative Design).
In sum, our study design likely enhanced the robustness and real-world applicability of our
study’s findings.

38

Electronic copy available at: https://ssrn.com/abstract=4725351


2F Results of Experiment 1

We analyzed data exactly as specified in the preregistration (https://aspredicted.org/MM3_17T).


Specifically, we fit a linear fixed effects model with the likelihood of offering (hypothetical)
monetary compensation as the dependent variable and whether or not a given complaint
was edited (to be more clear, coherent, or professional) or not as the independent variable,
controlling for (1) the fixed effects of each of the 20 complaint contents and (2) the fixed
effects of each of the 210 individual participants (see Model 2 in Table S1).
Consistent with our hypothesis (and as mentioned in the main text), the complaints
edited with ChatGPT were more likely to receive (hypothetical) monetary compensation than
the unedited complaints, b = 0.743, SE = 0.093, t = 7.96, p < .001 (Model 2 in Table S1).
Although not preregistered, we fit another linear fixed effects model, which also controlled for
the presentation order of complaints (Model 3 in Table S1). We find the same results, b =
0.742, SE = 0.093, t = 7.94, p < .001. Lastly, we examined which type of edits (“more clear”
vs. “more coherent” vs. “more professional”) results in the biggest increase in the likelihood
of receiving (hypothetical) monetary compensation by fitting yet another linear fixed effects
model (Model 4 in Table S1), adding the three types of edit as the three independent dummy
variables (unedited = 0, each type of edit = 1) and removing the previous binary variable of
whether a given complaint was edited or not. We find that the complaints edited to be more
clear showed the directionally greatest increase in likelihood of monetary compensation (b
= 0.810, SE = 0.116), as compared with those edited to be more coherent (b = 0.720, SE
= 0.115), or those edited to be more professional (b = 0.698, SE = 0.116); see Model 4 in
Table S1.

39

Electronic copy available at: https://ssrn.com/abstract=4725351


Table S1: Increase in Compensation Likelihood for Edited Complaints Relative to Unedited
Complaints

Dependent variable: Compensation likelihood


Model
1 2 3 4
∗∗∗
Constant 2.779
(0.113)
Unedited vs. Edited (Dummy variable) 0.737∗∗∗ 0.743∗∗∗ 0.742∗∗∗
(0.132) (0.093) (0.093)
Unedited vs. More clear (Dummy variable) 0.810∗∗∗
(0.116)
Unedited vs. More coherent (Dummy variable) 0.720∗∗∗
(0.115)
Unedited vs. More professional (Dummy variable) 0.698∗∗∗
(0.116)
Complaint Content (20 Different Contents) Fixed Effect N Y Y Y
Participant Fixed Effect N Y Y Y
Complaint Presentation Position (1-5) Fixed Effect N N Y Y
Observations 1,050 1,050 1,050 1,050
∗∗∗
Note. p < .001

2G Results of Experiment 2

Table S2: Tests of Linguistic Feature Alignment

Dependent variable: Compensation likelihood


Model
Predictor 1 2
Unedited vs. More professional (Dummy variable) 0.90 ∗∗∗

(0.16)
Unedited vs. More clear (Dummy variable) 0.92∗∗∗
(0.15)
Unedited vs. More coherent (Dummy variable) 1.08∗∗∗ 1.17∗∗∗
(0.14) (0.14)
Condition Included in Model Focus on Clarity Focus on Professionalism
Complaints Included in Model Professional, Coherent, Control Clear, Coherent, Control
Complaint Content (20 Different Contents) Fixed Effect Y Y
Participant Fixed Effect Y Y
Complaint Presentation Position (1-5) Fixed Effect Y Y
Observations 600 600
Note. ∗∗∗
p < .001

40

Electronic copy available at: https://ssrn.com/abstract=4725351


References

Ali, S. R., T. D. Dobbs, H. A. Hutchings, and I. S. Whitaker (2023): “Using


ChatGPT to write patient clinic letters,” The Lancet Digital Health, 5(4), e179–e181.

Bumcrot, C. B., J. Lin, and A. Lusardi (2011): “The geography of financial literacy,”
RAND Working Paper Series No. WR-893-SSA.

Chaiken, S. (1980): “Heuristic versus systematic information processing and the use of
source versus message cues in persuasion.,” Journal of personality and social psychology,
39(5), 752.

Chen, J., W. Fan, J. Wei, and Z. Liu (2022): “Effects of linguistic style on persuasiveness
of word-of-mouth messages with anonymous vs. identifiable sources,” Marketing Letters,
33(4), 593–605.

Danescu-Niculescu-Mizil, C., M. Sudhof, D. Jurafsky, J. Leskovec, and


C. Potts (2013): “A computational approach to politeness with application to social
factors,” in Annual Meeting of the Association for Computational Linguistics. https:
//api.semanticscholar.org/CorpusID:12383721.

Dewatripont, M., and J. Tirole (2005): “Modes of communication,” Journal of political


economy, 113(6), 1217–1238.

Dou, Y., M. Hung, G. She, and L. L. Wang (2023): “Learning from peers: Evidence
from disclosure of consumer complaints,” Journal of Accounting and Economics, p. 101620.

Dou, Y., and Y. Roh (2019): “Public disclosure and consumer financial protection,” Journal
of Financial and Quantitative Analysis, pp. 1–56.

Eloundou, T., S. Manning, P. Mishkin, and D. Rock (2023): “Gpts are gpts: An
early look at the labor market impact potential of large language models,” arXiv preprint
arXiv:2303.10130.

41

Electronic copy available at: https://ssrn.com/abstract=4725351


Giles, H. (1979): “Accommodation theory: Optimal levels of convergence,” Language and
social psychology, pp. 45–65.

Gillham, J. (2023): “AI Content Detector Accuracy Review + Open Source Dataset
and Research Tool,” https://originality.ai/blog/ai-content-detection-accuracy,
Accessed: 2023-11-26.

Guo, B., X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and Y. Wu (2023):
“How close is chatgpt to human experts? comparison corpus, evaluation, and detection,”
arXiv preprint arXiv:2301.07597.

Herbold, S., A. Hautli-Janisz, U. Heuer, Z. Kikteva, and A. Trautsch (2023):


“A large-scale comparison of human-written versus ChatGPT-generated essays,” Scientific
Reports, 13(1), 18617.

Huang, A. H., H. Wang, and Y. Yang (2023): “FinBERT: A large language model for
extracting information from financial text,” Contemporary Accounting Research, 40(2),
806–841.

Jwalapuram, P., S. Joty, and X. Lin (2022): “Rethinking Self-Supervision Objectives for
Generalizable Coherence Modeling,” in Proc. 60th Ann. Meeting Assoc. Comput. Linguistics
(Vol. 1: Long Papers), pp. 6044–6059.

Kiesler, C. A., and S. B. Kiesler (1964): “Role of forewarning in persuasive communica-


tions.,” The Journal of Abnormal and Social Psychology, 68(5), 547.

Kincaid, J., R. J. Fishburne, R. Rogers, and B. Chissom (1975): “Derivation of new


readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease
Formula) for Navy enlisted personnel,” Tech. Report 8-75, Naval Technical Training, U. S.
Naval Air Station.

42

Electronic copy available at: https://ssrn.com/abstract=4725351


Korinek, A. (2023): “Generative AI for economic research: Use cases and implications for
economists,” Journal of Economic Literature, 61(4), 1281–1317.

Larsen, P., and D. Proserpio (2023): “The Impact of Large Language Models on Search
Advertising: Evidence from Google’s BERT,” USC Marshall School of Business Research
Paper Sponsored by iORB.

Levenshtein, V. I., et al. (1966): “Binary codes capable of correcting deletions, insertions,
and reversals,” in Soviet physics doklady, vol. 10, pp. 707–710. Soviet Union.

Lingard, L. (2023): “Writing with ChatGPT: An illustration of its capacity, limitations &
implications for academic writers,” Perspectives on Medical Education, 12(1), 261.

Liu, A. X., Y. Xie, and J. Zhang (2019): “It’s not just what you say, but how you say it:
The effect of language style matching on perceived quality of consumer reviews,” Journal
of Interactive Marketing, 46(1), 70–86.

Liu, R., H. Yen, R. Marjieh, T. L. Griffiths, and R. Krishna (2023): “Improving


Interpersonal Communication by Simulating Audiences with Language Models,” arXiv
preprint arXiv:2311.00687.

Lusardi, A., and O. S. Mitchell (2014): “The economic importance of financial literacy:
Theory and evidence,” American Economic Journal: Journal of Economic Literature, 52(1),
5–44.

Lythreatis, S., S. K. Singh, and A.-N. El-Kassar (2022): “The digital divide: A review
and future research agenda,” Technological Forecasting and Social Change, 175, 121359.

Matz, S., J. Teeny, S. S. Vaid, G. M. Harari, and M. Cerf (2023): “The Potential of
Generative AI for Personalized Persuasion at Scale,” PsyArXiv.

Noy, S., and W. Zhang (2023): “Experimental evidence on the productivity effects of
generative artificial intelligence,” Science, 381, 187–192.

43

Electronic copy available at: https://ssrn.com/abstract=4725351


Packard, G., and J. Berger (2021): “How concrete language shapes customer satisfaction,”
Journal of Consumer Research, 47(5), 787–806.

Packard, G., J. Berger, and R. Boghrati (2023): “How Verb Tense Shapes Persuasion,”
Journal of Consumer Research, p. ucad006.

Park, E., and R. Gelles-Watnick (2023): “Most Americans haven’t used ChatGPT; few
think it will have a major impact on their job,” Pew Research Center, https://pewrsr.
ch/3SCeWX9. Accessed: 2023-11-15.

Prillaman, M. (2023): “ ‘ChatGPT detector’ catches AI-generated papers with unprece-


dented accuracy,” Nature.

Reisenbichler, M., T. Reutterer, D. A. Schweidel, and D. Dan (2022): “Frontiers:


Supporting content marketing with natural language generation,” Marketing Science, 41(3),
441–452.

Saito, K., A. Wachi, K. Wataoka, and Y. Akimoto (2023): “Verbosity bias in preference
labeling by large language models,” arXiv preprint arXiv:2310.10076.

Shahriar, S., and K. Hayawi (2023): “Let’s have a chat! A Conversation with ChatGPT:
Technology, Applications, and Limitations,” arXiv preprint arXiv:2302.13817.

Shin, J., and C.-Y. Wang (2024): “The Role of Messenger in Advertising Content: Bayesian
Persuasion Perspective,” Marketing Science.

Shin, M., J. Kim, B. van Opheusden, and T. L. Griffiths (2023): “Superhuman artificial
intelligence can improve human decision-making by increasing novelty,” Proceedings of the
National Academy of Sciences, 120(12), e2214840120.

Simon, N., and C. Muise (2022): “TattleTale: Storytelling with Planning and Large
Language Models,” in ICAPS Workshop on Scheduling and Planning Applications.

44

Electronic copy available at: https://ssrn.com/abstract=4725351


Tang, R., Y.-N. Chuang, and X. Hu (2023): “The science of detecting llm-generated
texts,” arXiv preprint arXiv:2303.07205.

Toubia, O., J. Berger, and J. Eliashberg (2021): “How quantifying the shape of
stories predicts their success,” Proceedings of the National Academy of Sciences, 118(26),
e2011695118.

Weber-Wulff, D., A. Anohina-Naumeca, S. Bjelobaba, T. Foltỳnek,


J. Guerrero-Dib, O. Popoola, P. Šigut, and L. Waddington (2023): “Testing of
detection tools for AI-generated text,” arXiv preprint arXiv:2306.15666.

Winston AI (2023): “Best AI Detectors in 2023 Compared,” https://bit.ly/3SMYXWq,


Accessed: 2023-11-18.

Wu, Y., A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy


(2022): “Autoformalization with large language models,” Advances in Neural Information
Processing Systems, 35, 32353–32368.

45

Electronic copy available at: https://ssrn.com/abstract=4725351

You might also like