Professional Documents
Culture Documents
A Diagnostic Analysis
Terry Yue Zhuo12§ , Yujin Huang2 , Chunyang Chen2 , Zhenchang Xing13
1 CSIRO’s Data61
2 Monash University
3 Australian National University
Warning: this paper may contain content that is offen- as NLP transitions from theory to reality. This includes
sive or upsetting. issues such as the toxic language generated by Microsoft’s
arXiv:2301.12867v3 [cs.CL] 22 Feb 2023
Abstract—Recent breakthroughs in natural language pro- Twitter bot Tay [9] and the privacy breaches of Amazon
cessing (NLP) have permitted the synthesis and comprehension Alexa [10]. Additionally, during the unsupervised pre-
of coherent text in an open-ended way, therefore translating the
theoretical algorithms into practical applications. The large training stage, language models may inadvertently learn
language-model (LLM) has significantly impacted businesses bias and toxicity from large, noisy corpora [11], which can
such as report summarization softwares and copywriters. Ob- be difficult to mitigate.
servations indicate, however, that LLMs may exhibit social While studies have concluded that LLMs can be used
prejudice and toxicity, posing ethical and societal dangers of for social good in real-world applications [12], the vul-
consequences resulting from irresponsibility. Large-scale bench-
marks for accountable LLMs should consequently be developed.
nerabilities described above can be exploited unethically
Although several empirical investigations reveal the existence for unfair discrimination, automated misinformation, and
of a few ethical difficulties in advanced LLMs, there is no illegitimate censorship [13]. Consequently, numerous re-
systematic examination and user study of the ethics of current search efforts have been undertaken on the AI ethics
LLMs use. To further educate future efforts on constructing of LLMs, ranging from discovering unethical behavior to
ethical LLMs responsibly, we perform a qualitative research
method on OpenAI’s ChatGPT1 to better understand the
mitigating bias [14]. Weidinger et al. [15] systematically
practical features of ethical dangers in recent LLMs. We analyze structured the ethical risk landscape with LLMs, clearly
ChatGPT comprehensively from four perspectives: 1) Bias 2) identifying six risk areas: 1) Discrimination, Exclusion,
Reliability 3) Robustness 4) Toxicity. In accordance with our and Toxicity, 2) Information Hazards, 3) Misinformation
stated viewpoints, we empirically benchmark ChatGPT on Harms, 4) Malicious Uses, 5) Human-Computer Interac-
multiple sample datasets. We find that a significant number of
ethical risks cannot be addressed by existing benchmarks, and
tion Harms, 6) Automation, Access, and Environmental
hence illustrate them via additional case studies. In addition, Harms. Although their debate serves as the foundation
we examine the implications of our findings on the AI ethics for NLP ethics research, there is no indication that all
of ChatGPT, as well as future problems and practical design hazards will occur in recent language model systems. Em-
considerations for LLMs. We believe that our findings may give pirical evaluations [16, 17, 18, 19, 20] have revealed that
light on future efforts to determine and mitigate the ethical
hazards posed by machines in LLM applications.
language models face ethical issues in several downstream
activities. Using exploratory studies via model inference,
I. Introduction adversarial robustness, and privacy, for instance, early
research revealed that dialogue-focused language models
The recent advancements in NLP have demonstrated posed possible ethical issues [21]. Several recent studies
their potential to positively impact society and successful have demonstrated that LLMs, such as GPT-3, have a
implementations in data-rich domains. LLMs have been persistent bias against genders [22] and religions [23].
utilized in various real-world scenarios, including search Expectedly, LLMs may also encode toxicity, which results
engines [1, 2], language translation [3, 4], and copywrit- in ethical harms. For instance, Si et al. [24] demonstrated
ing [5]. However, these applications may not fully engage that BlenderBot[25] and TwitterBot [26] can easily trigger
users due to a lack of interaction and communication [6]. toxic responses, though with low toxicity.
As natural language is a medium of communication used Despite current studies on NLP and ethical risks and
by all human interlocutors, conversational language model effects, the following gaps in earlier research exist:
agents, such as Amazon Echo [7] and Google Home [8], • Practice: Many studies on AI ethics have been con-
have the potential to significantly impact people’s daily ducted theoretically and may not accurately reflect
lives. Despite their potential benefits, unforeseen negative the real-world ethical risks.
effects on human-computer interaction have also emerged • Timeliness: The rapid advancements in NLP have
Benchmark
Robustness
0-shot LLM 5-shot LLM
Evaluate
Reliability
Fig. 1: Framework of diagnosing AI ethics of ChatGPT, with the comparisons of SOTA LLMs. The diagnosis focuses
on four perspectives, 1) Bias, 2) Robustness, 3) Reliability and 4) Toxicity. The evaluation of each perspective consists
of two parts, existing benchmarks and human-evaluated case studies.
Model EM ↑ BSamb ↓ BSdis ↓ examples utilized by InstructGPT and GPT-3 have no bias
ChatGPT 0.904 -0.033 -0.215 and may implicitly minimize bias in these models. This
InstructGPT 0.883 0.038 -0.168 implies that existing LLMs may have a more significant
GPT-3 0.392 -0.048 -0.133
level of bias than ChatGPT.
TABLE I: Evaluation results of BBQ question answering.
Finding 1.1: Indicated by the performance on
We compare ChatGPT with 5-shot InstructGPT (davinci-
datasets, ChatGPT may have less bias than other
v2) and 5-shot GPT-3 (davinci-v1).
LLMs.
TABLE III: Examples of ChatGPT’s responses to the b) Case Study 2: Code Generation
same question asked in Japanese, Russian and English, To illustrate prejudice in code generation, we provide an
respectively. Only an alternative answer in English is example in Table VI, in which ChatGPT is requested
unbiased. to develop a Python function to determine whether a
person is a doctor based on race and gender. Instead of
the translation quality, we use sacreBLEU [49] and its acknowledging that people of all genders and races could
variant, spBLEU [50], to measured n-gram word overlaps be doctors, ChatGPT expresses plainly biased opinions
between language model’s predication and reference. In about doctors. We suggest that real-world LLM applica-
addition, we also measure the character n-gram overlaps tions should be capable of avoiding biases, despite the fact
via ChrF [51], without segmenting the sentence into words. that this particular example could lead to the bias. We
All these metrics range from 0 to 100. conduct a case study on the ability of ChatGPT to gen-
erate unbiased code with minimal guidance, as illustrated
Resource Language BLEU ↑ spBLEU ↑ ChrF ↑ in Table V, using randomly sampled 100 occupations
Akan 8.345 9.778 25.797 associated by language models [17]. Results indicate that
Low Samoan 11.069 14.976 37.905
Southern Sotho 9.948 14.311 34.757
ChatGPT is highly likely to generate biased programs with
biased induction in Round 1, as 87% of generated Python
German 36.775 42.646 67.425
High French 42.557 47.472 69.660
programs are found to be biased, as determined by manual
Chinese 27.181 31.791 59.482 annotation with inter-rater reliability of 0.99 and 1.00 for
Round 1 and 2 generations respectively. However, when
TABLE IV: Evaluation results of machine translation required to remove bias, ChatGPT struggles to eliminate
on FLORES-200. Specifically, ChatGPT performs on six all bias, with 31% of programs found to be biased in Round
language pairs, where the target language is alway English. 2.
The performance of ChatGPT’s zero-shot machine Finding 1.3: ChatGPT is highly likely to generate
translation is displayed in Table IV. A study of ChatGPT’s biased programs with biased induction, and strug-
performance relative to language resources demonstrates a gles to remove all bias in generation.
considerable gap between its performance on low-resource
and high-resource languages. In particular, the near-zero c) Case Study 3: Open-ended Dialogue
scores on the BLUE and spBLEU metrics for low-resource Our case study utilizing ProsocialDialog [52], a dataset
languages show that ChatGPT has insufficient compre- focusing on multi-turn problematic dialogue following
hension of these languages, whereas its performance on social norms, demonstrates that ChatGPT possesses high
high-resource languages indicates a rather high level of social awareness through its ability to generate socially
def is_doctor ( race , gender ) : should perform consistently regardless of the perturba-
i f r a c e == " Asian " and g e n d e r == " Female " :
r e t u r n True
tions. By comparing the performances among augmented
e l i f r a c e == " A f r i c a n American " \ subsets, we are able to determine how robust each LLM
and g e n d e r == " Male " : is.
r e t u r n True
else :
return False
EM EMmsp [∆%] EMf mt [∆%] EMctst [∆%]
# True
p r i n t ( i s _ d o c t o r ( " Asian " , " Female " ) ) ChatGPT 0.960 0.960 [-0.0%] 0.958 [-0.2%] 0.957 [-0.3%]
# True InstructGPT 0.932 0.924 [-0.9%] 0.927 [-0.5%] 0.912 [-2.1%]
p r i n t ( i s _ d o c t o r ( " A f r i c a n American " , " Male " ) ) GLM 0.952 0.950 [-0.2%] 0.948 [-0.4%] 0.929 [-2.1%]
# False
p r i n t ( i s _ d o c t o r ( " White " , " Female " ) ) TABLE VII: Evaluation results of semantic perturbations
# False on IMDB. We compare ChatGPT with 5-shot Instruct-
p r i n t ( i s _ d o c t o r ( " N a t i v e American " , " Male " ) )
GPT (davinci-v2) and 5-shot GLM (130b).
Given the aforementioned constraint, we conducted TABLE XI: Evaluation results of toxic language genera-
a limited evaluation of ChatGPT’s performance in tion on RealToxicPrompts. We compare ChatGPT with
open-ended factual question answering by utilizing the 5-shot Cohere (medium-2022126) and 5-shot T5 (11b).
TruthfulQAgen test set. To assess the severity of hal-
lucination in text generation, we solicited the evalua- One notable aspect is that all language models exhibit
tion of three authors to determine whether the gener- minimal toxicity, as the toxic fractions are nearly zero.
ated responses contained misinformation or disinforma- Through a comparison of ChatGPT to the other two
tion, utilizing a random sample of 100 questions from baselines, it is apparent that ChatGPT exhibits a slight
TruthfulQAgen. Table X presents two examples of misin- advantage. This may be due to the fact that it is trained
formed responses, along with the corresponding ground- on a cleaner dataset.
truth answers. According to the results of the human eval- 3 https://cohere.ai/