Professional Documents
Culture Documents
intent and provide reasonable responses has 2020; Artetxe et al., 2022; Ouyang et al., 2022)
made them extremely popular lately. In this
using reinforcement learning from human feed-
paper, we focus on assessing the overall abil-
ity of ChatGPT using 7 fine-grained infor- back (RLHF) (Christiano et al., 2017) and high-
mation extraction (IE) tasks. Specially, we quality conversational-style datasets. Apart from
present the systematically analysis by mea- its surprising dialogue ability, ChatGPT has many
suring ChatGPT’s performance, explainabil- other aspects that attract researchers to explore.
ity, calibration, and faithfulness, and resulting Some researchers have delved into the potential
in 15 keys from either the ChatGPT or do- impacts of ChatGPT on human life (Haque et al.,
main experts. Our findings reveal that Chat- 2022; Zhuo et al., 2023; Susnjak, 2022; Basic et al.,
GPT’s performance in Standard-IE setting is
2023). Other researchers are interested in explor-
poor, but it surprisingly exhibits excellent per-
formance in the OpenIE setting, as evidenced ing the capabilities of ChatGPT for various NLP
by human evaluation. In addition, our research tasks (Zhang et al., 2022a; Qin et al., 2023; Mitro-
indicates that ChatGPT provides high-quality vic et al., 2023; Guo et al., 2023). The capabili-
and trustworthy explanations for its decisions. ties of ChatGPT have been preliminarily explored
However, there is an issue of ChatGPT being through the above research and valuable conclu-
overconfident in its predictions, which result- sions have been drawn.
ing in low calibration. Furthermore, ChatGPT
Given ChatGPT is a closed model that does not
demonstrates a high level of faithfulness to the
original text in the majority of cases. We man- provide information about its training details, and
ually annotate and release the test sets of 7 fine- any response from the model encodes an opin-
grained IE tasks contains 14 datasets to further ion. The response can significantly impact the
promote the research. The datasets and code user’s experience and shape their beliefs going for-
are available at this url. 1 ward (Aiyappa et al., 2023; Santurkar et al., 2023;
Deshpande et al., 2023; Huang et al., 2023). Con-
1 Introduction sequently, evaluating ChatGPT should involve not
only assessing its ability to achieve high perfor-
Large Language Models (LLMs) (e.g., GPT- mance but also measuring the reliability of the
3 (Brown et al., 2020), LaMDA (Thoppilan et al., answers it provides. To help users better under-
2022) and PaLM (Chowdhery et al., 2022), etc.) stand the overall quality of ChatGPT’s responses
have greatly promoted the development of the and enable systematic measurement of its capabil-
Natural Language Processing (NLP) community. ities, we design the following four metric dimen-
With a proper instruction (often the task defini- sions: The first dimension we consider is Perfor-
tion) (Ouyang et al., 2022; Kojima et al., 2022; mance, which reflects ChatGPT’s overall perfor-
Chung et al., 2022; Wang et al., 2022) and the chain- mance on various IE tasks from multiple perspec-
of-thought (CoT) prompting (Wei et al., 2022b), tives. The second metric dimension, Explainabil-
LLMs achieve surprisingly good performances ity (Rajani et al., 2019; Aghajanyan et al., 2021;
when dealing with unseen tasks. Zini and Awad, 2023), evaluates whether Chat-
1 2
https://github.com/pkuserc/ChatGPT_for_IE https://chat.openai.com/
GPT could give a justified reason for its predic- by human evaluation. Furthermore, we also
tion, thereby providing insights into ChatGPT’s discover that ChatGPT provides high-quality
decision-making process. The third one is Cal- and trustworthy explanations for its decisions.
ibration (Guo et al., 2017; Kumar et al., 2019; Although, it displays overconfidence in its pre-
Thulasidasan et al., 2019; Minderer et al., 2021), dictions, leading to low calibration. Besides,
which measures the predictive uncertainty of a ChatGPT is largely faithful to the original text
model, and we use this metric to assess if Chat- in most cases.
GPT is overconfidence on its prediction. The last
dimension is Faithfulness (Maynez et al., 2020; 2 Related Work
Koto et al., 2022; Creswell and Shanahan, 2022;
2.1 Large Language Models
He et al., 2023), it is frequently employed in the
summarization task to determine whether the sum- Large Language Models (LLMs) typically con-
mary accurately reflects the input. In our research, tain more than a hundred billion parameters, such
we adopt faithfulness as a measure of whether the as GPT-3 (Brown et al., 2020), Gopher (Rae
explanations given by ChatGPT are truthful to the et al., 2021), LaMDA (Thoppilan et al., 2022),
input, or if they are spurious. In summary, accord- Megatron-turing-NLG (Smith et al., 2022), and
ing to the above four dimensions, we collect 15 PaLM (Chowdhery et al., 2022), among others.
keys from either the ChatGPT or domain experts Scaling up the model size brings impressive abili-
for the evaluation (§ 3). ties on few-shot and zero-shot learning scenarios,
In this research, we aim to perform a compre- such as producing reasonable results with very few
hensive study and detailed analysis of ChatGPT’s samples or task descriptions (Brown et al., 2020;
capabilities through various information extraction Chowdhery et al., 2022). Moreover, scaling up the
(IE) tasks. IE involves heterogeneous structure ex- model size unlocks emergent abilities that were not
traction, factual knowledge usage, and diversified observed in smaller models, enabling LLMs to ex-
targets(Yamada et al., 2020; Paolini et al., 2021; Lu hibit strong generalizability on unseen tasks (Wei
et al., 2022), making it an ideal scenario for evalu- et al., 2022a; Fu et al., 2022; Mahowald et al.,
ating ChatGPT’s capabilities. Overall, we conduct 2023).
our experiments and analysis based on 14 datasets With its impressive ability to understand user in-
belonging to 7 fine-grained IE tasks(§ 4). Addition- tent and generate human-like responses, ChatGPT
ally, we assess the explainability, calibration, and has become the most popular language model cur-
faithfulness of ChatGPT’s responses through both rently. It is trained on the GPT family (Brown et al.,
self-check and human-check (§ 5). To sum up, our 2020; Artetxe et al., 2022; Ouyang et al., 2022)
main contributions are summarized as follows: and high-quality conversational-style datasets us-
ing reinforcement learning from human feedback
• To assess the overall ability of ChatGPT, we (RLHF) (Christiano et al., 2017).
employ a comprehensive and systematic eval- Along with its dialogue ability, ChatGPT has
uation from four dimensions: 1) performance, other aspects that researchers are exploring. Some
2) explainability, 3) calibration, and 4) faith- researchers show potential impacts of ChatGPT
fulness. We then collected 15 keys belong- on human life, such as ethical risks (Haque et al.,
ing to above dimensions from either the Chat- 2022; Zhuo et al., 2023; Krügel et al., 2023), the
GPT or domain experts for the research. All education sector (Susnjak, 2022; Basic et al., 2023;
the manually annotated datasets and code are Kortemeyer, 2023) and the medical scenario (Tu
made public available for future research. et al., 2023; Nov et al., 2023; Jeblick et al., 2022).
Additionally, some researchers are keen to examine
• We comprehensively evaluate the overall per- the potential of ChatGPT in addressing various nat-
formance of ChatGPT on various tasks in both ural language processing tasks. For instance, some
Standard-IE and OpenIE settings and compare works have examined ChatGPT’s performance in
it with other popular models. Our research in- stance detection (Zhang et al., 2022a), linguistic
dicates that ChatGPT’s performance is not sat- and sentiment analysis (Susnjak, 2023; Ortega-
isfactory in the Standard-IE setting. However, Martín et al., 2023), general NLP tasks (Qin et al.,
we show that it provides surprisingly good 2023; Bian et al., 2023; Zhong et al., 2023; Wang
results in the OpenIE setting, as confirmed et al., 2023a,b), and machine translation (Jiao et al.,
2023). (Frieder et al., 2023) explores the mathe- et al., 2018; Martins et al., 2019; Li et al., 2020; Das
matical capabilities of ChatGPT, while (Bang et al., et al., 2022) aims to first identify the candidate enti-
2023) proposes an evaluation of ChatGPT on rea- ties, and then classify their types; 3) Relation Clas-
soning and other aspects. Additionally, (Mitro- sification(RC) (Zeng et al., 2015; Ye et al., 2019;
vic et al., 2023; Guo et al., 2023) investigate the Zhou and Chen, 2021; Li et al., 2022b) requires to
differences between human-written and ChatGPT- classify the relation between two target entities; 4)
generated. Relation Extraction(RE) (Li et al., 2019; Fu et al.,
2019; Bian et al., 2021; Ye et al., 2022) is a task to
2.2 Information Extraction identify the target entities and the relation jointly;
Information Extraction (IE) is a long-standing re- 5) Event Detection(ED) (Veyseh et al., 2021; Lou
search topic that aims to extract structured fac- et al., 2021; Liu et al., 2022a; Zhao et al., 2022)
tual information from unstructured texts (Andersen identifies event triggers and their types; 6) Event
et al., 1992; Crowe, 1995; Chieu et al., 2003; Wu Argument Extraction(EAE) (Zhang et al., 2022b;
and Weld, 2010; Khot et al., 2017; Lu et al., 2022). Du and Ji, 2022; Ma et al., 2022) distinguishes argu-
Typically, IE involves a wide range of tasks, such ments and categorizes their roles with respect to the
as named entity recognition(NER) (Gregoric et al., targe event; and 7) Event Extraction(EE) (Wad-
2018; Martins et al., 2019; Li et al., 2020; Das den et al., 2019; Du and Cardie, 2020; Liu et al.,
et al., 2022), entity typing(ET) (Choi et al., 2018; 2022b; Hsu et al., 2022) performs event detection
Dai et al., 2021; Pang et al., 2022; Chen et al., and argument extraction jointly. Note that although
2022), relation extraction(RE) (Li et al., 2019; Fu some of these tasks are subsets of others, every
et al., 2019; Bian et al., 2021; Ye et al., 2022), task needs LLMs’ unique ability to perform well.
relation classification(RC) (Zeng et al., 2015; Ye It is worth to explore the performances on these
et al., 2019; Zhou and Chen, 2021; Li et al., 2022b), fine-grained IE tasks.
event detection(ED) (Veyseh et al., 2021; Lou et al.,
2021; Liu et al., 2022a; Zhao et al., 2022), event
argument extraction(EAE) (Zhang et al., 2022b; 3.2 Standard-IE Setting and OpenIE Setting
Du and Ji, 2022; Ma et al., 2022), and event ex-
traction(EE) (Wadden et al., 2019; Du and Cardie, To comprehensively evaluate the overall perfor-
2020; Liu et al., 2022b; Hsu et al., 2022), among mance of ChatGPT on IE tasks, we ask ChatGPT
others. These tasks automatically generate struc- to generate the responses from the Standard-IE
tured factual outputs related to entity, and relation, setting and the OpenIE setting. The Standard-IE
and event, and greatly boost the development of setting is commonly used in previous works, which
NLP community. uses the task-specific dataset with supervised learn-
ing paradigm to fine-tune a model. For ChatGPT,
3 ChatGPT for Information Extraction as we can not directly fine-tune the parameters, we
evaluate the ChatGPT’s ability to select the most
In this section, we first briefly introduce 7 fine- appropriate answer from a set of candidate labels
grained IE tasks, then we present how to collect 15 instead. Specifically, this setting is based on an
keys from the ChatGPT and domain experts. instruction that includes the task description, the
input text, the prompt, and the label set. Where
3.1 Information Extraction
the task description describes the specific IE task,
IE involves a wide range of tasks which need to the prompt involves the utterances that guide the
extract structured factual information from unstruc- ChatGPT outputs the required keys (which will be
tured texts, such as entity, and relation, and event. introduced in § 3.3), and the label set contains all
In this research, we conduct our analysis on the candidate labels based on each dataset. The Ope-
following 7 fine-grained IE tasks:3 1) Entity Typ- nIE setting is a more advanced and challenging
ing(ET) (Choi et al., 2018; Dai et al., 2021; Pang scenario than Standard-IE setting. In this setting,
et al., 2022; Chen et al., 2022) aims to classify we do not provide any candidate labels to ChatGPT
the type of a target entity under a given input; and rely solely on its ability to comprehend the task
2) Named Entity Recognition(NER) (Gregoric description, the prompt, and input text to generate
3
We introduce these tasks briefly due to the space limita- predictions. Our goal is to assess the ChatGPT’s
tion. Please refer to the task-specific papers for more details. ability to produce reasonable factual knowledge.
Keys Explanation
Performance
Open Directly ask ChatGPT to predict the class without the label set.
Standard ChatGPT’s most likely correct class with a given label set.
Top3 The three most likely classes of the given label set from ChatGPT.
Top5 The five most likely classes of the given label set from ChatGPT.
ifOpen_Correct(Manual) Manually annotate whether the "Open" is reasonable.
Explainability
Reason_Open The reason why ChatGPT chooses the class in "Open".
Reason_Standard The reason why ChatGPT chooses the class in "Standard".
ifR_Open Does ChatGPT think that "Reason_Open" is reasonable?
ifR_Standard Does ChatGPT think that "Reason_Standard" is reasonable?
ifR_Open(Manual) Manually annotate whether the "Reason_Open" is reasonable.
ifR_Standard(Manual) Manually annotate whether the "Reason_Standard" is reasonable.
Calibration
Confidence_Open The confidence of ChatGPT in predicting "Open".
Confidence_Standard The confidence of ChatGPT in predicting "Standard".
Faithfulness
FicR_Open(Manual) Manually annotate whether the "Reason_Open" is fictitious.
FicR_Standard(Manual) Manually annotate whether the "Reason_Standard" is fictitious.
Table 1: We gather 15 keys in this research, consisting of 10 keys automatically generated by ChatGPT and 5 keys
that required manual annotation (denoted as Manual). These keys provide insight into ChatGPT’s ability in four
dimensions, namely: 1) performance, 2) explainability, 3) calibration, and 4) faithfulness.
3.3 Collecting Keys From ChatGPT And and Reason_Standard), and whether Chat-
Human Annotation GPT approves its explanations (ifR_Open and
ifR_Standard). Additionally, we also man-
In this subsection, we first describe 15 keys that are ually evaluate the acceptability of these rea-
collected from the ChatGPT and domain experts. sons to humans (ifR_Open(Manual) and
In Table 1, we show 10 keys that are extracted ifR_Standard(Manual)).
from ChatGPT and 5 keys that involves human in- Calibration. Measuring the calibration helps to
volvements. These keys could systemically assess evaluate the predictive uncertainty of a model (Guo
ChatGPT’s ability from the following four aspects: et al., 2017; Kumar et al., 2019). A properly
Performance. One important aspect of our re- calibrated classifier should have predictive scores
search is to comprehensively evaluate the overall that accurately reflect the probability of correct-
performance of ChatGPT on various tasks and com- ness (Thulasidasan et al., 2019; Minderer et al.,
pare it with other popular models. By examining 2021). Given the tendency of modern neural net-
its performance from different aspects, we seek works to be overconfident in their predictions, we
to provide a detailed understanding of ChatGPT’s aim to identify potential uncertainties or overcon-
capability on the downstream IE tasks. fidence phenomenon of ChatGPT. To evaluate the
Explainability. The explainability of ChatGPT calibration, ChatGPT is required to provide a con-
is crucial for its application in real-world sce- fidence score (ranging from 1 to 100) for each
narios (Rajani et al., 2019; Aghajanyan et al., prediction it makes (Confidence_Open and
2021; Zini and Awad, 2023). In our study, Confidence_Standard).
we will measure both the self-check and human- Faithfulness. The faithfulness of ChatGPT’s
check explainability of ChatGPT, with a fo- explanation is important to ensure its trustwor-
cus on its ability to provide useful and accu- thiness (Maynez et al., 2020; He et al., 2023).
rate explanations of its reasoning process for In evaluating the faithfulness, we have included
humans. Specially, we ask ChatGPT to pro- two keys that assess whether the reasons pro-
vide reasons for its predictions (Reason_Open vided by ChatGPT are faithful to the original in-
put. These keys, FicR_Open(Manual) and under a supervised learning paradigm. Another
FicR_Standard(Manual), require manual reason may be ChatGPT directly choose an answer
annotation by domain experts. from the given label set, and some labels are not
Due to the space limitation, we show an intuitive easy to understand, thereby negatively impact the
example in the Appendix A.3 to help readers better performance.
understand the annotation process. Moreover, our research indicates that ChatGPT
performs well on relatively simple IE tasks but
4 Performance struggles with more complex and challenging
4.1 Setup tasks. For example, the entity typing (ET) task
only involves classifying entities into pre-defined
To ensure a comprehensive evaluation of Chat- types without any further contextual analysis, and
GPT’s capabilities, we conduct manual annotation ChatGPT excels at this task, demonstrating that
and analysis on a diverse range of IE tasks, in- the model can generate accurate factual knowledge
cluding 7 fine-grained tasks spanning 14 datasets. when the task is simple. However, in complex and
We collected 15 keys for each dataset from both challenging IE tasks such as RE, ChatGPT strug-
ChatGPT and domain experts (§ 3). Only the test gles as it requires to first identify the entities that
sets are annotated, as our aim is to analysis Chat- exist in the input and then classify the relationship
GPT’s abilities without any training. For space between them, which is a more challenging task
reasons, the detail of each dataset is shown in the than ET. Despite ChatGPT’s acceptable results on
Appendix A.1. Due to the time-consuming na- the ET, NER and RC tasks, it still faces challenges
ture of obtaining responses from domain experts, with more multifaceted IE tasks like RE and EE,
we randomly select nearly 3,000 samples in total where deeper contextual analysis and reasoning
for our analysis. The number of manually anno- abilities are required. In summary, ChatGPT’s per-
tated samples for each dataset is reported in the formance varies based on the complexity of the
Appendix A.1. As for the outputs from ChatGPT, task, and it performs well on straightforward tasks.
we use the official API to evaluate the whole test Furthermore, the conclusion that ChatGPT per-
sets. 4 forms worse than other models seems inconsistent
Besides, we compare ChatGPT with several pop- with previous studies (Wei et al., 2023; Gao et al.,
ular baselines: 1) BERT (Devlin et al., 2019) and 2023), which suggest that ChatGPT can achieve
RoBERTa (Liu et al., 2019), and 2) State-of-the- desirable performance in some IE tasks. One pos-
Art (SOTA) on the single dataset. Due to the space sible explanation for the difference in conclusions
limitation, the details of state-of-the-art methods is that we report the performance of the entire test
are shown in the Appendix A.2. As for the metric, set for each task in our study, while prior studies
we use Micro-F1 score for all tasks except RE and reported on a very small set of test samples drawn
EE. For the RE task, we report the named entity at random, which may have substantial variance.
recognition F1-score and the relation classification Another factor may be that we used a concise and
F1-score. As for the EE task, we show the trigger relatively unified prompt to guide ChatGPT, while
F1-score and argument F1-score. 5 other research relied on domain-specific prompts
4.2 Performance on the Standard-IE Setting or included a large number of label descriptions in
their prompts, which needs lots of domain knowl-
In this subsection, we report the performances of edge and thereby limits the ability to generalize
different models on the Standard-IE setting, as de- across various tasks.
picted in Table 2. It is clear from the table that
ChatGPT’s performance is not comparable to 4.3 Performance on the OpenIE Setting
that of baseline models and SOTA methods in
most cases. This is not surprising given that di- In this subsection, we report both the accuracy of
rectly asking ChatGPT for the prediction is more Standard-IE setting and OpenIE setting on the sam-
like a zero-shot scenario, whereas the other com- pled dataset. 6 For the Standard-IE setting, we pro-
pared methods are trained on task-specific datasets vide the pre-defined label set and ask the ChatGPT
to choose an answer for a given input, and the ac-
4
To prevent any historical chat biases, we cleared every
conversation after generating each response. 6
We randomly selected around 200 samples for each
5
These metrics are all following previous works. dataset.
Task Dataset BERT RoBERTa SOTA ChatGPT
Entity BBN 80.3 79.8 82.2 (Zuo et al., 2022) 85.6
Typing(ET) OntoNotes 5.0 69.1 68.8 72.1 (Zuo et al., 2022) 73.4
Named Entity CoNLL2003 92.8 92.4 94.6 (Wang et al., 2021) 67.2
Recognition(NER) OntoNotes 5.0 89.2 90.9 91.9 (Ye et al., 2022) 51.1
Relation TACRED 72.7 74.6 75.6 (Li et al., 2022a) 20.3
Classification(RC) SemEval2010 89.1 89.8 91.3 (Zhao et al., 2021) 42.5
Relation ACE05-R 87.5 | 63.7 88.2 | 65.1 91.1 | 73.0 (Ye et al., 2022) 40.5 | 4.5
Extraction(RE) SciERC 65.4 | 43.0 63.6 | 42.0 69.9 | 53.2 (Ye et al., 2022) 25.9 | 5.5
Event ACE05-E 71.8 72.9 75.8 (Liu et al., 2022a) 17.1
Detection(ED) ACE05-E+ 72.4 72.1 72.8 (Lin et al., 2020) 15.5
Event Argument ACE05-E 65.3 68.0 73.5 (Hsu et al., 2022) 28.9
Extraction(EAE) ACE05-E+ 64.0 66.5 73.0 (Hsu et al., 2022) 30.9
Event ACE05-E 71.8 | 51.0 72.9 | 51.9 74.7 | 56.8 (Lin et al., 2020) 17.0 | 7.3
Extraction(EE) ACE05-E+ 72.4 | 52.7 72.1 | 53.4 71.7 | 56.8 (Hsu et al., 2022) 16.6 | 7.8
Table 2: The performances of ChatGPT and several baseline models on 14 IE datasets on the Standard-IE setting.
We report the performance on the whole test set. All results are directly cited from public papers or re-implemented
using official open-source code.
Table 4: The explainability of ChatGPT measured on the sampled test set. We report the ratio of samples with
reasonable reasons discriminated by ChatGPT (self-check) and domain experts (human-check) under different
settings. Besides, we also compute the overlap ration for both of them. These results indicate that in most cases,
ChatGPT exhibits strong explainability for its prediction.
Table 6: The prediction confidence of various models on the whole test set. We show both the correct confidence
and incorrect confidence based on various methods. We find that ChatGPT is overconfidence for its prediction in
most cases.
Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. I-Hung Hsu, Kuan-Hao Huang, Elizabeth Boschee,
Graphrel: Modeling text as relational graphs for Scott Miller, Prem Natarajan, Kai-Wei Chang, and
joint entity and relation extraction. In Proceedings Nanyun Peng. 2022. DEGREE: A data-efficient
of the 57th Conference of the Association for Compu- generation-based event extraction model. In Pro-
tational Linguistics, ACL 2019, Florence, Italy, July ceedings of the 2022 Conference of the North
28- August 2, 2019, Volume 1: Long Papers, pages American Chapter of the Association for Computa-
1409–1418. Association for Computational Linguis- tional Linguistics: Human Language Technologies,
tics. NAACL 2022, Seattle, WA, United States, July 10-
15, 2022, pages 1890–1908. Association for Compu-
Yao Fu, Hao Peng, and Tushar Khot. 2022. How does tational Linguistics.
gpt obtain its ability? tracing emergent abilities of
language models to their sources. Yao Fu’s Notion. Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is
chatgpt better than human annotators? potential
Jun Gao, Huan Zhao, Changlong Yu, and Ruifeng Xu. and limitations of chatgpt in explaining implicit hate
2023. Exploring the feasibility of chatgpt for event speech. CoRR, abs/2302.07736.
extraction. CoRR, abs/2303.03836.
Katharina Jeblick, Balthasar Schachtner, Jakob Dexl,
Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Andreas Mittermeier, Anna Theresa Stüber, Johanna
Kirchner, and David Huynh. 2014. Context- Topalis, Tobias Weber, Philipp Wesp, Bastian Sabel,
dependent fine-grained entity type tagging. CoRR, Jens Ricke, and Michael Ingrisch. 2022. Chatgpt
abs/1412.1820. makes medicine easy to swallow: An exploratory
case study on simplified radiology reports. CoRR,
Andrej Zukov Gregoric, Yoram Bachrach, and Sam abs/2212.14882.
Coope. 2018. Named entity recognition with par-
allel recurrent neural networks. In Proceedings of Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
the 56th Annual Meeting of the Association for Com- Wang, and Zhaopeng Tu. 2023. Is chatgpt A
putational Linguistics, ACL 2018, Melbourne, Aus- good translator? A preliminary study. CoRR,
tralia, July 15-20, 2018, Volume 2: Short Papers, abs/2301.08745.
pages 69–74. Association for Computational Lin-
guistics. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017.
Answering complex questions using open informa-
Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, tion extraction. In Proceedings of the 55th Annual
Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Meeting of the Association for Computational Lin-
Wu. 2023. How close is chatgpt to human ex- guistics (Volume 2: Short Papers), pages 311–316,
perts? comparison corpus, evaluation, and detection. Vancouver, Canada. Association for Computational
CoRR, abs/2301.07597. Linguistics.
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Weinberger. 2017. On calibration of modern neu- taka Matsuo, and Yusuke Iwasawa. 2022. Large
ral networks. In Proceedings of the 34th Inter- language models are zero-shot reasoners. CoRR,
national Conference on Machine Learning, ICML abs/2205.11916.
Gerd Kortemeyer. 2023. Could an artificial- Xiao Liu, Heyan Huang, Ge Shi, and Bo Wang.
intelligence agent pass an introductory physics 2022b. Dynamic prefix-tuning for generative
course? arXiv preprint arXiv:2301.12127. template-based event extraction. In Proceedings
of the 60th Annual Meeting of the Association for
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2022. Computational Linguistics (Volume 1: Long Papers),
Can pretrained language models generate persuasive, ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
faithful, and informative ad text for product descrip- 5216–5228. Association for Computational Linguis-
tions? In Proceedings of The Fifth Workshop on tics.
e-Commerce and NLP (ECNLP 5), pages 234–243.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Sebastian Krügel, Andreas Ostermaier, and Matthias dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Uhl. 2023. The moral authority of chatgpt. CoRR, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
abs/2301.07098. Roberta: A robustly optimized BERT pretraining ap-
proach. CoRR, abs/1907.11692.
Ananya Kumar, Percy Liang, and Tengyu Ma. 2019. Dongfang Lou, Zhilin Liao, Shumin Deng, Ningyu
Verified uncertainty calibration. In Advances in Neu- Zhang, and Huajun Chen. 2021. Mlbinet: A cross-
ral Information Processing Systems 32: Annual Con- sentence collective event detection network. In Pro-
ference on Neural Information Processing Systems ceedings of the 59th Annual Meeting of the Associa-
2019, NeurIPS 2019, December 8-14, 2019, Vancou- tion for Computational Linguistics and the 11th In-
ver, BC, Canada, pages 3787–3798. ternational Joint Conference on Natural Language
Processing, ACL/IJCNLP 2021, (Volume 1: Long
Bo Li, Wei Ye, Jinglei Zhang, and Shikun Zhang. Papers), Virtual Event, August 1-6, 2021, pages
2022a. Reviewing labels: Label graph network with 4829–4839. Association for Computational Linguis-
top-k prediction set for relation extraction. CoRR, tics.
abs/2212.14270.
Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu
Bo Li, Dingyao Yu, Wei Ye, Jinglei Zhang, and Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Uni-
Shikun Zhang. 2022b. Sequence generation with fied structure generation for universal information
label augmentation for relation extraction. CoRR, extraction. In Proceedings of the 60th Annual Meet-
abs/2212.14266. ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 5755–5772, Dublin,
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Ireland. Association for Computational Linguistics.
Han, Fei Wu, and Jiwei Li. 2020. A unified MRC
framework for named entity recognition. In Pro- Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh
ceedings of the 58th Annual Meeting of the Asso- Hajishirzi. 2018. Multi-task identification of enti-
ciation for Computational Linguistics, pages 5849– ties, relations, and coreference for scientific knowl-
5859, Online. Association for Computational Lin- edge graph construction. In Proceedings of the 2018
guistics. Conference on Empirical Methods in Natural Lan-
guage Processing, Brussels, Belgium, October 31 -
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna November 4, 2018, pages 3219–3232. Association
Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. for Computational Linguistics.
Entity-relation extraction as multi-turn question an-
Yubo Ma, Zehao Wang, Yixin Cao, Mukai Li, Meiqi
swering. In Proceedings of the 57th Conference of
Chen, Kun Wang, and Jing Shao. 2022. Prompt
the Association for Computational Linguistics, ACL
for extraction? PAIE: prompting argument interac-
2019, Florence, Italy, July 28- August 2, 2019, Vol-
tion for event argument extraction. In Proceedings
ume 1: Long Papers, pages 1340–1350. Association
of the 60th Annual Meeting of the Association for
for Computational Linguistics.
Computational Linguistics (Volume 1: Long Papers),
ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. 6759–6774. Association for Computational Linguis-
A joint neural model for information extraction with tics.
global features. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin- Kyle Mahowald, Anna A. Ivanova, Idan Asher Blank,
guistics, ACL 2020, Online, July 5-10, 2020, pages Nancy Kanwisher, Joshua B. Tenenbaum, and
7999–8009. Association for Computational Linguis- Evelina Fedorenko. 2023. Dissociating language
tics. and thought in large language models: a cognitive
perspective. CoRR, abs/2301.06627.
Jian Liu, Yufeng Chen, and Jinan Xu. 2022a. Saliency
as evidence: Event detection with trigger saliency Pedro Henrique Martins, Zita Marinho, and André F. T.
attribution. In Proceedings of the 60th Annual Meet- Martins. 2019. Joint learning of named entity recog-
ing of the Association for Computational Linguistics nition and entity linking. In Proceedings of the 57th
(Volume 1: Long Papers), ACL 2022, Dublin, Ire- Conference of the Association for Computational
land, May 22-27, 2022, pages 4573–4585. Associa- Linguistics, ACL 2019, Florence, Italy, July 28 - Au-
tion for Computational Linguistics. gust 2, 2019, Volume 2: Student Research Workshop,
pages 190–196. Association for Computational Lin- 2023. Check your facts and try again: Improving
guistics. large language models with external knowledge and
automated feedback. CoRR, abs/2302.12813.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Ryan T. McDonald. 2020. On faithfulness and fac- Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao
tuality in abstractive summarization. In Proceedings Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is
of the 58th Annual Meeting of the Association for chatgpt a general-purpose natural language process-
Computational Linguistics, ACL 2020, Online, July ing task solver? arXiv preprint arXiv:2302.06476.
5-10, 2020, pages 1906–1919. Association for Com-
putational Linguistics. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, H. Francis Song, John
Matthias Minderer, Josip Djolonga, Rob Romijnders, Aslanides, Sarah Henderson, Roman Ring, Susan-
Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin nah Young, Eliza Rutherford, Tom Hennigan, Ja-
Tran, and Mario Lucic. 2021. Revisiting the cali- cob Menick, Albin Cassirer, Richard Powell, George
bration of modern neural networks. In Advances van den Driessche, Lisa Anne Hendricks, Mari-
in Neural Information Processing Systems 34: An- beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-
nual Conference on Neural Information Processing hannes Welbl, Sumanth Dathathri, Saffron Huang,
Systems 2021, NeurIPS 2021, December 6-14, 2021, Jonathan Uesato, John Mellor, Irina Higgins, An-
virtual, pages 15682–15694. tonia Creswell, Nat McAleese, Amy Wu, Erich
Elsen, Siddhant M. Jayakumar, Elena Buchatskaya,
Sandra Mitrovic, Davide Andreoletti, and Omran Ay- David Budden, Esme Sutherland, Karen Simonyan,
oub. 2023. Chatgpt or human? detect and explain. Michela Paganini, Laurent Sifre, Lena Martens,
explaining decisions of machine learning model Xiang Lorraine Li, Adhiguna Kuncoro, Aida
for detecting short chatgpt-generated text. CoRR, Nematzadeh, Elena Gribovskaya, Domenic Do-
abs/2301.13852. nato, Angeliki Lazaridou, Arthur Mensch, Jean-
Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grig-
Oded Nov, Nina Singh, and Devin M. Mann. 2023. orev, Doug Fritz, Thibault Sottiaux, Mantas Pa-
Putting chatgpt’s medical advice to the (turing) test. jarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama,
CoRR, abs/2301.10035. Cyprien de Masson d’Autume, Yujia Li, Tay-
fun Terzi, Vladimir Mikulik, Igor Babuschkin,
Miguel Ortega-Martín, Óscar García-Sierra, Alfonso Aidan Clark, Diego de Las Casas, Aurelia Guy,
Ardoiz, Jorge Álvarez, Juan Carlos Armenteros, and Chris Jones, James Bradbury, Matthew J. Johnson,
Adrián Alonso. 2023. Linguistic ambiguity analysis Blake A. Hechtman, Laura Weidinger, Iason Gabriel,
in chatgpt. arXiv preprint arXiv:2302.06426. William S. Isaac, Edward Lockhart, Simon Osin-
dero, Laura Rimell, Chris Dyer, Oriol Vinyals, Ka-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- reem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Hassabis, Koray Kavukcuoglu, and Geoffrey Irv-
Sandhini Agarwal, Katarina Slama, Alex Ray, John ing. 2021. Scaling language models: Methods,
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, analysis & insights from training gopher. CoRR,
Maddie Simens, Amanda Askell, Peter Welinder, abs/2112.11446.
Paul F. Christiano, Jan Leike, and Ryan Lowe.
2022. Training language models to follow instruc- Nazneen Fatema Rajani, Bryan McCann, Caiming
tions with human feedback. CoRR, abs/2203.02155. Xiong, and Richard Socher. 2019. Explain yourself!
leveraging language models for commonsense rea-
Kunyuan Pang, Haoyu Zhang, Jie Zhou, and Ting soning. In Proceedings of the 57th Conference of
Wang. 2022. Divide and denoise: Learning from the Association for Computational Linguistics, ACL
noisy labels in fine-grained entity typing with 2019, Florence, Italy, July 28- August 2, 2019, Vol-
cluster-wise loss correction. In Proceedings of the ume 1: Long Papers, pages 4932–4942. Association
60th Annual Meeting of the Association for Compu- for Computational Linguistics.
tational Linguistics (Volume 1: Long Papers), ACL
2022, Dublin, Ireland, May 22-27, 2022, pages Erik F. Tjong Kim Sang and Fien De Meulder.
1997–2006. Association for Computational Linguis- 2003. Introduction to the conll-2003 shared task:
tics. Language-independent named entity recognition. In
Proceedings of the Seventh Conference on Natural
Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Language Learning, CoNLL 2003, Held in cooper-
Jie Ma, Alessandro Achille, Rishita Anubhai, ation with HLT-NAACL 2003, Edmonton, Canada,
Cícero Nogueira dos Santos, Bing Xiang, and Ste- May 31 - June 1, 2003, pages 142–147. ACL.
fano Soatto. 2021. Structured prediction as transla-
tion between augmented natural languages. In ICLR Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo
2021, Virtual Event, Austria, May 3-7, 2021. Open- Lee, Percy Liang, and Tatsunori Hashimoto. 2023.
Review.net. Whose opinions do language models reflect? arXiv
preprint arXiv:2303.17548.
Baolin Peng, Michel Galley, Pengcheng He, Hao
Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Shaden Smith, Mostofa Patwary, Brandon Norick,
Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. Patrick LeGresley, Samyam Rajbhandari, Jared
Casper, Zhun Liu, Shrimai Prabhumoye, George In Proceedings of the 2019 Conference on Empiri-
Zerveas, Vijay Korthikanti, Elton Zheng, Rewon cal Methods in Natural Language Processing and
Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia the 9th International Joint Conference on Natural
Song, Mohammad Shoeybi, Yuxiong He, Michael Language Processing, EMNLP-IJCNLP 2019, Hong
Houston, Saurabh Tiwary, and Bryan Catanzaro. Kong, China, November 3-7, 2019, pages 5783–
2022. Using deepspeed and megatron to train 5788. Association for Computational Linguistics.
megatron-turing NLG 530b, A large-scale genera-
tive language model. CoRR, abs/2201.11990. Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen,
Runkai Zheng, Yidong Wang, Linyi Yang, Haojun
Teo Susnjak. 2022. Chatgpt: The end of online exam Huang, Wei Ye, Xiubo Geng, Binxing Jiao, Yue
integrity? CoRR, abs/2212.09292. Zhang, and Xing Xie. 2023a. On the robustness of
chatgpt: An adversarial and out-of-distribution per-
Teo Susnjak. 2023. Applying bert and chatgpt for senti- spective. CoRR, abs/2302.12095.
ment analysis of lyme disease in scientific literature.
arXiv preprint arXiv:2302.06474. Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze
Chen, Yuansen Zhang, Rui Zheng, Junjie Ye,
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Qi Zhang, Tao Gui, Jihua Kang, Jingsheng Yang,
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Siyuan Li, and Chunsai Du. 2023b. Instructuie:
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Multi-task instruction tuning for unified information
YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, extraction. CoRR, abs/2304.08085.
Amin Ghafouri, Marcelo Menegali, Yanping Huang,
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang,
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Zhongqiang Huang, Fei Huang, and Kewei Tu.
Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, 2021. Automated concatenation of embeddings
Igor Krivokon, Will Rusch, Marc Pickett, Kath- for structured prediction. In Proceedings of the
leen S. Meier-Hellstern, Meredith Ringel Morris, 59th Annual Meeting of the Association for Com-
Tulsee Doshi, Renelito Delos Santos, Toju Duke, putational Linguistics and the 11th International
Johnny Soraker, Ben Zevenbergen, Vinodkumar Joint Conference on Natural Language Processing,
Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen ACL/IJCNLP 2021, (Volume 1: Long Papers), Vir-
Olson, Alejandra Molina, Erin Hoffman-John, Josh tual Event, August 1-6, 2021, pages 2643–2660. As-
Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, sociation for Computational Linguistics.
Matthew Lamm, Viktoriya Kuzmina, Joe Fenton,
Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
Blaise Aguera-Arcas, Claire Cui, Marian Croak, labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Ed H. Chi, and Quoc Le. 2022. Lamda: Lan- Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
guage models for dialog applications. CoRR, jana Arunkumar, David Stap, Eshaan Pathak, Gian-
abs/2201.08239. nis Karamanolakis, Haizhi Gary Lai, Ishan Puro-
Sunil Thulasidasan, Gopinath Chennupati, Jeff A. hit, Ishani Mondal, Jacob Anderson, Kirby Kuz-
Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. nia, Krima Doshi, Kuntal Kumar Pal, Maitreya Pa-
2019. On mixup training: Improved calibration and tel, Mehrad Moradshahi, Mihir Parmar, Mirali Puro-
predictive uncertainty for deep neural networks. In hit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit
Advances in Neural Information Processing Systems Verma, Ravsehaj Singh Puri, Rushang Karia, Savan
32: Annual Conference on Neural Information Pro- Doshi, Shailaja Keyur Sampat, Siddhartha Mishra,
cessing Systems 2019, NeurIPS 2019, December 8- Sujan Reddy A, Sumanta Patro, Tanay Dixit, and
14, 2019, Vancouver, BC, Canada, pages 13888– Xudong Shen. 2022. Super-naturalinstructions:
13899. Generalization via declarative instructions on 1600+
NLP tasks. In Proceedings of the 2022 Conference
Ruibo Tu, Chao Ma, and Cheng Zhang. 2023. Causal- on Empirical Methods in Natural Language Process-
discovery performance of chatgpt in the context of ing, EMNLP 2022, Abu Dhabi, United Arab Emi-
neuropathic pain diagnosis. CoRR, abs/2301.13819. rates, December 7-11, 2022, pages 5085–5109. As-
sociation for Computational Linguistics.
Amir Pouran Ben Veyseh, Viet Dac Lai, Franck Der-
noncourt, and Thien Huu Nguyen. 2021. Unleash Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf-
GPT-2 power for event detection. In Proceedings of fel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
the 59th Annual Meeting of the Association for Com- gatama, Maarten Bosma, Denny Zhou, Donald Met-
putational Linguistics and the 11th International zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
Joint Conference on Natural Language Processing, Percy Liang, Jeff Dean, and William Fedus. 2022a.
ACL/IJCNLP 2021, (Volume 1: Long Papers), Vir- Emergent abilities of large language models. CoRR,
tual Event, August 1-6, 2021, pages 6271–6282. As- abs/2206.07682.
sociation for Computational Linguistics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
David Wadden, Ulme Wennberg, Yi Luan, and Han- Bosma, Ed H. Chi, Quoc Le, and Denny Zhou.
naneh Hajishirzi. 2019. Entity, relation, and event 2022b. Chain of thought prompting elicits reasoning
extraction with contextualized span representations. in large language models. CoRR, abs/2201.11903.
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Kailin Zhao, Xiaolong Jin, Long Bai, Jiafeng Guo,
Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, and Xueqi Cheng. 2022. Knowledge-enhanced self-
Yufeng Chen, Meishan Zhang, Yong Jiang, and Wen- supervised prototypical network for few-shot event
juan Han. 2023. Zero-shot information extraction detection. In Findings of the Association for Com-
via chatting with chatgpt. CoRR, abs/2302.10205. putational Linguistics: EMNLP 2022, Abu Dhabi,
United Arab Emirates, December 7-11, 2022, pages
Ralph Weischedel and Ada Brunstein. 2005. Bbn pro- 6266–6275. Association for Computational Linguis-
noun coreference and entity type corpus. Linguistic tics.
Data Consortium, Philadelphia, 112.
Kang Zhao, Hua Xu, Yue Cheng, Xiaoteng Li, and Kai
Fei Wu and Daniel S. Weld. 2010. Open information Gao. 2021. Representation iterative fusion based
extraction using Wikipedia. In Proceedings of the on heterogeneous graph neural network for joint en-
48th Annual Meeting of the Association for Compu- tity and relation extraction. Knowl. Based Syst.,
tational Linguistics, pages 118–127, Uppsala, Swe- 219:106888.
den. Association for Computational Linguistics.
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Dacheng Tao. 2023. Can chatgpt understand too?
Takeda, and Yuji Matsumoto. 2020. LUKE: deep a comparative study on chatgpt and fine-tuned bert.
contextualized entity representations with entity- arXiv preprint arXiv:2302.10198.
aware self-attention. In EMNLP 2020.
Wenxuan Zhou and Muhao Chen. 2021. An im-
proved baseline for sentence-level relation extrac-
Deming Ye, Yankai Lin, Peng Li, and Maosong Sun.
tion. CoRR, abs/2102.01373.
2022. Packed levitated marker for entity and relation
extraction. In Proceedings of the 60th Annual Meet- Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and
ing of the Association for Computational Linguistics Zhenchang Xing. 2023. Exploring AI ethics of chat-
(Volume 1: Long Papers), ACL 2022, Dublin, Ire- gpt: A diagnostic analysis. CoRR, abs/2301.12867.
land, May 22-27, 2022, pages 4904–4917. Associa-
tion for Computational Linguistics. Julia El Zini and Mariette Awad. 2023. On the explain-
ability of natural language processing deep models.
Wei Ye, Bo Li, Rui Xie, Zhonghao Sheng, Long Chen, ACM Comput. Surv., 55(5):103:1–103:31.
and Shikun Zhang. 2019. Exploiting entity BIO tag
embeddings and multi-task learning for relation ex- Xinyu Zuo, Haijin Liang, Ning Jing, Shuang Zeng,
traction with imbalanced data. In Proceedings of Zhou Fang, and Yu Luo. 2022. Type-enriched hierar-
the 57th Conference of the Association for Compu- chical contrastive strategy for fine-grained entity typ-
tational Linguistics, ACL 2019, Florence, Italy, July ing. In Proceedings of the 29th International Confer-
28- August 2, 2019, Volume 1: Long Papers, pages ence on Computational Linguistics, COLING 2022,
1351–1360. Association for Computational Linguis- Gyeongju, Republic of Korea, October 12-17, 2022,
tics. pages 2405–2417. International Committee on Com-
putational Linguistics.
Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.
2015. Distant supervision for relation extraction via
piecewise convolutional neural networks. In Pro-
ceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1753–
1762.
Table 10: The table presents several key statistical characteristics of the datasets used in our research, including 14
datasets that belonging to 7 different IE tasks.
Table 11: The input example of event detection task. This example is extracted from ACE05-E, and all the above
three parts are jointly imported into ChatGPT.