Self Supervised Cookbook

Evaluating ChatGPT’s Information Extraction Capabilities: An
Assessment of Performance, Explainability, Calibration, and Faithfulness

Bo Li1,2 , Gexiang Fang1,2 , Yang Yang1,2 , Quansen Wang3 ,
Wei Ye1 , Wen Zhao1 , and Shikun Zhang1
1
National Engineering Research Center for Software Engineering, Peking University
2
School of Software and Microelectronics, Peking University
3
Boston University
{deepblue.lb, fanggx, yangy}@stu.pku.edu.cn, quansenw@bu.edu
{wye, zhaowen, zhangsk}@pku.edu.cn
Abstract ChatGPT 2 is currently the most popular LLM,
known for its impressive ability to understand user
The capability of Large Language Models intent and generate human-like responses. Chat-
(LLMs) like ChatGPT to comprehend user GPT is trained on the GPT family (Brown et al.,
arXiv:2304.11633v1 [cs.CL] 23 Apr 2023
intent and provide reasonable responses has 2020; Artetxe et al., 2022; Ouyang et al., 2022)
made them extremely popular lately. In this
using reinforcement learning from human feed-
paper, we focus on assessing the overall abil-
ity of ChatGPT using 7 fine-grained infor- back (RLHF) (Christiano et al., 2017) and high-
mation extraction (IE) tasks. Specially, we quality conversational-style datasets. Apart from
present the systematically analysis by mea- its surprising dialogue ability, ChatGPT has many
suring ChatGPT’s performance, explainabil- other aspects that attract researchers to explore.
ity, calibration, and faithfulness, and resulting Some researchers have delved into the potential
in 15 keys from either the ChatGPT or do- impacts of ChatGPT on human life (Haque et al.,
main experts. Our findings reveal that Chat- 2022; Zhuo et al., 2023; Susnjak, 2022; Basic et al.,
GPT’s performance in Standard-IE setting is
2023). Other researchers are interested in explor-
poor, but it surprisingly exhibits excellent per-
formance in the OpenIE setting, as evidenced ing the capabilities of ChatGPT for various NLP
by human evaluation. In addition, our research tasks (Zhang et al., 2022a; Qin et al., 2023; Mitro-
indicates that ChatGPT provides high-quality vic et al., 2023; Guo et al., 2023). The capabili-
and trustworthy explanations for its decisions. ties of ChatGPT have been preliminarily explored
However, there is an issue of ChatGPT being through the above research and valuable conclu-
overconfident in its predictions, which result- sions have been drawn.
ing in low calibration. Furthermore, ChatGPT
Given ChatGPT is a closed model that does not
demonstrates a high level of faithfulness to the
original text in the majority of cases. We man- provide information about its training details, and
ually annotate and release the test sets of 7 fine- any response from the model encodes an opin-
grained IE tasks contains 14 datasets to further ion. The response can significantly impact the
promote the research. The datasets and code user’s experience and shape their beliefs going for-
are available at this url. 1 ward (Aiyappa et al., 2023; Santurkar et al., 2023;
Deshpande et al., 2023; Huang et al., 2023). Con-
1 Introduction sequently, evaluating ChatGPT should involve not
only assessing its ability to achieve high perfor-
Large Language Models (LLMs) (e.g., GPT- mance but also measuring the reliability of the
3 (Brown et al., 2020), LaMDA (Thoppilan et al., answers it provides. To help users better under-
2022) and PaLM (Chowdhery et al., 2022), etc.) stand the overall quality of ChatGPT’s responses
have greatly promoted the development of the and enable systematic measurement of its capabil-
Natural Language Processing (NLP) community. ities, we design the following four metric dimen-
With a proper instruction (often the task defini- sions: The first dimension we consider is Perfor-
tion) (Ouyang et al., 2022; Kojima et al., 2022; mance, which reflects ChatGPT’s overall perfor-
Chung et al., 2022; Wang et al., 2022) and the chain- mance on various IE tasks from multiple perspec-
of-thought (CoT) prompting (Wei et al., 2022b), tives. The second metric dimension, Explainabil-
LLMs achieve surprisingly good performances ity (Rajani et al., 2019; Aghajanyan et al., 2021;
when dealing with unseen tasks. Zini and Awad, 2023), evaluates whether Chat-
1 2
https://github.com/pkuserc/ChatGPT_for_IE https://chat.openai.com/
GPT could give a justified reason for its predic- by human evaluation. Furthermore, we also
tion, thereby providing insights into ChatGPT’s discover that ChatGPT provides high-quality
decision-making process. The third one is Cal- and trustworthy explanations for its decisions.
ibration (Guo et al., 2017; Kumar et al., 2019; Although, it displays overconfidence in its pre-
Thulasidasan et al., 2019; Minderer et al., 2021), dictions, leading to low calibration. Besides,
which measures the predictive uncertainty of a ChatGPT is largely faithful to the original text
model, and we use this metric to assess if Chat- in most cases.
GPT is overconfidence on its prediction. The last
dimension is Faithfulness (Maynez et al., 2020; 2 Related Work
Koto et al., 2022; Creswell and Shanahan, 2022;
2.1 Large Language Models
He et al., 2023), it is frequently employed in the
summarization task to determine whether the sum- Large Language Models (LLMs) typically con-
mary accurately reflects the input. In our research, tain more than a hundred billion parameters, such
we adopt faithfulness as a measure of whether the as GPT-3 (Brown et al., 2020), Gopher (Rae
explanations given by ChatGPT are truthful to the et al., 2021), LaMDA (Thoppilan et al., 2022),
input, or if they are spurious. In summary, accord- Megatron-turing-NLG (Smith et al., 2022), and
ing to the above four dimensions, we collect 15 PaLM (Chowdhery et al., 2022), among others.
keys from either the ChatGPT or domain experts Scaling up the model size brings impressive abili-
for the evaluation (§ 3). ties on few-shot and zero-shot learning scenarios,
In this research, we aim to perform a compre- such as producing reasonable results with very few
hensive study and detailed analysis of ChatGPT’s samples or task descriptions (Brown et al., 2020;
capabilities through various information extraction Chowdhery et al., 2022). Moreover, scaling up the
(IE) tasks. IE involves heterogeneous structure ex- model size unlocks emergent abilities that were not
traction, factual knowledge usage, and diversified observed in smaller models, enabling LLMs to ex-
targets(Yamada et al., 2020; Paolini et al., 2021; Lu hibit strong generalizability on unseen tasks (Wei
et al., 2022), making it an ideal scenario for evalu- et al., 2022a; Fu et al., 2022; Mahowald et al.,
ating ChatGPT’s capabilities. Overall, we conduct 2023).
our experiments and analysis based on 14 datasets With its impressive ability to understand user in-
belonging to 7 fine-grained IE tasks(§ 4). Addition- tent and generate human-like responses, ChatGPT
ally, we assess the explainability, calibration, and has become the most popular language model cur-
faithfulness of ChatGPT’s responses through both rently. It is trained on the GPT family (Brown et al.,
self-check and human-check (§ 5). To sum up, our 2020; Artetxe et al., 2022; Ouyang et al., 2022)
main contributions are summarized as follows: and high-quality conversational-style datasets us-
ing reinforcement learning from human feedback
• To assess the overall ability of ChatGPT, we (RLHF) (Christiano et al., 2017).
employ a comprehensive and systematic eval- Along with its dialogue ability, ChatGPT has
uation from four dimensions: 1) performance, other aspects that researchers are exploring. Some
2) explainability, 3) calibration, and 4) faith- researchers show potential impacts of ChatGPT
fulness. We then collected 15 keys belong- on human life, such as ethical risks (Haque et al.,
ing to above dimensions from either the Chat- 2022; Zhuo et al., 2023; Krügel et al., 2023), the
GPT or domain experts for the research. All education sector (Susnjak, 2022; Basic et al., 2023;
the manually annotated datasets and code are Kortemeyer, 2023) and the medical scenario (Tu
made public available for future research. et al., 2023; Nov et al., 2023; Jeblick et al., 2022).
Additionally, some researchers are keen to examine
• We comprehensively evaluate the overall per- the potential of ChatGPT in addressing various nat-
formance of ChatGPT on various tasks in both ural language processing tasks. For instance, some
Standard-IE and OpenIE settings and compare works have examined ChatGPT’s performance in
it with other popular models. Our research instance detection (Zhang et al., 2022a), linguistic
dicates that ChatGPT’s performance is not sat- and sentiment analysis (Susnjak, 2023; Ortega-
isfactory in the Standard-IE setting. However, Martín et al., 2023), general NLP tasks (Qin et al.,
we show that it provides surprisingly good 2023; Bian et al., 2023; Zhong et al., 2023; Wang
results in the OpenIE setting, as confirmed et al., 2023a,b), and machine translation (Jiao et al.,
2023). (Frieder et al., 2023) explores the mathe- et al., 2018; Martins et al., 2019; Li et al., 2020; Das
matical capabilities of ChatGPT, while (Bang et al., et al., 2022) aims to first identify the candidate enti-
2023) proposes an evaluation of ChatGPT on rea- ties, and then classify their types; 3) Relation Clas-
soning and other aspects. Additionally, (Mitro- sification(RC) (Zeng et al., 2015; Ye et al., 2019;
vic et al., 2023; Guo et al., 2023) investigate the Zhou and Chen, 2021; Li et al., 2022b) requires to
differences between human-written and ChatGPT- classify the relation between two target entities; 4)
generated. Relation Extraction(RE) (Li et al., 2019; Fu et al.,
2019; Bian et al., 2021; Ye et al., 2022) is a task to
2.2 Information Extraction identify the target entities and the relation jointly;
Information Extraction (IE) is a long-standing re- 5) Event Detection(ED) (Veyseh et al., 2021; Lou
search topic that aims to extract structured fac- et al., 2021; Liu et al., 2022a; Zhao et al., 2022)
tual information from unstructured texts (Andersen identifies event triggers and their types; 6) Event
et al., 1992; Crowe, 1995; Chieu et al., 2003; Wu Argument Extraction(EAE) (Zhang et al., 2022b;
and Weld, 2010; Khot et al., 2017; Lu et al., 2022). Du and Ji, 2022; Ma et al., 2022) distinguishes argu-
Typically, IE involves a wide range of tasks, such ments and categorizes their roles with respect to the
as named entity recognition(NER) (Gregoric et al., targe event; and 7) Event Extraction(EE) (Wad-
2018; Martins et al., 2019; Li et al., 2020; Das den et al., 2019; Du and Cardie, 2020; Liu et al.,
et al., 2022), entity typing(ET) (Choi et al., 2018; 2022b; Hsu et al., 2022) performs event detection
Dai et al., 2021; Pang et al., 2022; Chen et al., and argument extraction jointly. Note that although
2022), relation extraction(RE) (Li et al., 2019; Fu some of these tasks are subsets of others, every
et al., 2019; Bian et al., 2021; Ye et al., 2022), task needs LLMs’ unique ability to perform well.
relation classification(RC) (Zeng et al., 2015; Ye It is worth to explore the performances on these
et al., 2019; Zhou and Chen, 2021; Li et al., 2022b), fine-grained IE tasks.
event detection(ED) (Veyseh et al., 2021; Lou et al.,
2021; Liu et al., 2022a; Zhao et al., 2022), event
argument extraction(EAE) (Zhang et al., 2022b; 3.2 Standard-IE Setting and OpenIE Setting
Du and Ji, 2022; Ma et al., 2022), and event ex-
traction(EE) (Wadden et al., 2019; Du and Cardie, To comprehensively evaluate the overall perfor-
2020; Liu et al., 2022b; Hsu et al., 2022), among mance of ChatGPT on IE tasks, we ask ChatGPT
others. These tasks automatically generate struc- to generate the responses from the Standard-IE
tured factual outputs related to entity, and relation, setting and the OpenIE setting. The Standard-IE
and event, and greatly boost the development of setting is commonly used in previous works, which
NLP community. uses the task-specific dataset with supervised learn-
ing paradigm to fine-tune a model. For ChatGPT,
3 ChatGPT for Information Extraction as we can not directly fine-tune the parameters, we
evaluate the ChatGPT’s ability to select the most
In this section, we first briefly introduce 7 fine- appropriate answer from a set of candidate labels
grained IE tasks, then we present how to collect 15 instead. Specifically, this setting is based on an
keys from the ChatGPT and domain experts. instruction that includes the task description, the
input text, the prompt, and the label set. Where
3.1 Information Extraction
the task description describes the specific IE task,
IE involves a wide range of tasks which need to the prompt involves the utterances that guide the
extract structured factual information from unstruc- ChatGPT outputs the required keys (which will be
tured texts, such as entity, and relation, and event. introduced in § 3.3), and the label set contains all
In this research, we conduct our analysis on the candidate labels based on each dataset. The Ope-
following 7 fine-grained IE tasks:3 1) Entity Typ- nIE setting is a more advanced and challenging
ing(ET) (Choi et al., 2018; Dai et al., 2021; Pang scenario than Standard-IE setting. In this setting,
et al., 2022; Chen et al., 2022) aims to classify we do not provide any candidate labels to ChatGPT
the type of a target entity under a given input; and rely solely on its ability to comprehend the task
2) Named Entity Recognition(NER) (Gregoric description, the prompt, and input text to generate
3
We introduce these tasks briefly due to the space limita- predictions. Our goal is to assess the ChatGPT’s
tion. Please refer to the task-specific papers for more details. ability to produce reasonable factual knowledge.
Keys Explanation
Performance
Open Directly ask ChatGPT to predict the class without the label set.
Standard ChatGPT’s most likely correct class with a given label set.
Top3 The three most likely classes of the given label set from ChatGPT.
Top5 The five most likely classes of the given label set from ChatGPT.
ifOpen_Correct(Manual) Manually annotate whether the "Open" is reasonable.
Explainability
Reason_Open The reason why ChatGPT chooses the class in "Open".
Reason_Standard The reason why ChatGPT chooses the class in "Standard".
ifR_Open Does ChatGPT think that "Reason_Open" is reasonable?
ifR_Standard Does ChatGPT think that "Reason_Standard" is reasonable?
ifR_Open(Manual) Manually annotate whether the "Reason_Open" is reasonable.
ifR_Standard(Manual) Manually annotate whether the "Reason_Standard" is reasonable.
Calibration
Confidence_Open The confidence of ChatGPT in predicting "Open".
Confidence_Standard The confidence of ChatGPT in predicting "Standard".
Faithfulness
FicR_Open(Manual) Manually annotate whether the "Reason_Open" is fictitious.
FicR_Standard(Manual) Manually annotate whether the "Reason_Standard" is fictitious.
Table 1: We gather 15 keys in this research, consisting of 10 keys automatically generated by ChatGPT and 5 keys
that required manual annotation (denoted as Manual). These keys provide insight into ChatGPT’s ability in four
dimensions, namely: 1) performance, 2) explainability, 3) calibration, and 4) faithfulness.
3.3 Collecting Keys From ChatGPT And and Reason_Standard), and whether Chat-
Human Annotation GPT approves its explanations (ifR_Open and
ifR_Standard). Additionally, we also man-
In this subsection, we first describe 15 keys that are ually evaluate the acceptability of these rea-
collected from the ChatGPT and domain experts. sons to humans (ifR_Open(Manual) and
In Table 1, we show 10 keys that are extracted ifR_Standard(Manual)).
from ChatGPT and 5 keys that involves human in- Calibration. Measuring the calibration helps to
volvements. These keys could systemically assess evaluate the predictive uncertainty of a model (Guo
ChatGPT’s ability from the following four aspects: et al., 2017; Kumar et al., 2019). A properly
Performance. One important aspect of our re- calibrated classifier should have predictive scores
search is to comprehensively evaluate the overall that accurately reflect the probability of correct-
performance of ChatGPT on various tasks and com- ness (Thulasidasan et al., 2019; Minderer et al.,
pare it with other popular models. By examining 2021). Given the tendency of modern neural net-
its performance from different aspects, we seek works to be overconfident in their predictions, we
to provide a detailed understanding of ChatGPT’s aim to identify potential uncertainties or overcon-
capability on the downstream IE tasks. fidence phenomenon of ChatGPT. To evaluate the
Explainability. The explainability of ChatGPT calibration, ChatGPT is required to provide a con-
is crucial for its application in real-world sce- fidence score (ranging from 1 to 100) for each
narios (Rajani et al., 2019; Aghajanyan et al., prediction it makes (Confidence_Open and
2021; Zini and Awad, 2023). In our study, Confidence_Standard).
we will measure both the self-check and human- Faithfulness. The faithfulness of ChatGPT’s
check explainability of ChatGPT, with a fo- explanation is important to ensure its trustwor-
cus on its ability to provide useful and accu- thiness (Maynez et al., 2020; He et al., 2023).
rate explanations of its reasoning process for In evaluating the faithfulness, we have included
humans. Specially, we ask ChatGPT to pro- two keys that assess whether the reasons pro-
vide reasons for its predictions (Reason_Open vided by ChatGPT are faithful to the original in-
put. These keys, FicR_Open(Manual) and under a supervised learning paradigm. Another
FicR_Standard(Manual), require manual reason may be ChatGPT directly choose an answer
annotation by domain experts. from the given label set, and some labels are not
Due to the space limitation, we show an intuitive easy to understand, thereby negatively impact the
example in the Appendix A.3 to help readers better performance.
understand the annotation process. Moreover, our research indicates that ChatGPT
performs well on relatively simple IE tasks but
4 Performance struggles with more complex and challenging
4.1 Setup tasks. For example, the entity typing (ET) task
only involves classifying entities into pre-defined
To ensure a comprehensive evaluation of Chat- types without any further contextual analysis, and
GPT’s capabilities, we conduct manual annotation ChatGPT excels at this task, demonstrating that
and analysis on a diverse range of IE tasks, in- the model can generate accurate factual knowledge
cluding 7 fine-grained tasks spanning 14 datasets. when the task is simple. However, in complex and
We collected 15 keys for each dataset from both challenging IE tasks such as RE, ChatGPT strug-
ChatGPT and domain experts (§ 3). Only the test gles as it requires to first identify the entities that
sets are annotated, as our aim is to analysis Chat- exist in the input and then classify the relationship
GPT’s abilities without any training. For space between them, which is a more challenging task
reasons, the detail of each dataset is shown in the than ET. Despite ChatGPT’s acceptable results on
Appendix A.1. Due to the time-consuming na- the ET, NER and RC tasks, it still faces challenges
ture of obtaining responses from domain experts, with more multifaceted IE tasks like RE and EE,
we randomly select nearly 3,000 samples in total where deeper contextual analysis and reasoning
for our analysis. The number of manually anno- abilities are required. In summary, ChatGPT’s per-
tated samples for each dataset is reported in the formance varies based on the complexity of the
Appendix A.1. As for the outputs from ChatGPT, task, and it performs well on straightforward tasks.
we use the official API to evaluate the whole test Furthermore, the conclusion that ChatGPT per-
sets. 4 forms worse than other models seems inconsistent
Besides, we compare ChatGPT with several pop- with previous studies (Wei et al., 2023; Gao et al.,
ular baselines: 1) BERT (Devlin et al., 2019) and 2023), which suggest that ChatGPT can achieve
RoBERTa (Liu et al., 2019), and 2) State-of-the- desirable performance in some IE tasks. One pos-
Art (SOTA) on the single dataset. Due to the space sible explanation for the difference in conclusions
limitation, the details of state-of-the-art methods is that we report the performance of the entire test
are shown in the Appendix A.2. As for the metric, set for each task in our study, while prior studies
we use Micro-F1 score for all tasks except RE and reported on a very small set of test samples drawn
EE. For the RE task, we report the named entity at random, which may have substantial variance.
recognition F1-score and the relation classification Another factor may be that we used a concise and
F1-score. As for the EE task, we show the trigger relatively unified prompt to guide ChatGPT, while
F1-score and argument F1-score. 5 other research relied on domain-specific prompts
4.2 Performance on the Standard-IE Setting or included a large number of label descriptions in
their prompts, which needs lots of domain knowl-
In this subsection, we report the performances of edge and thereby limits the ability to generalize
different models on the Standard-IE setting, as de- across various tasks.
picted in Table 2. It is clear from the table that
ChatGPT’s performance is not comparable to 4.3 Performance on the OpenIE Setting
that of baseline models and SOTA methods in
most cases. This is not surprising given that di- In this subsection, we report both the accuracy of
rectly asking ChatGPT for the prediction is more Standard-IE setting and OpenIE setting on the sam-
like a zero-shot scenario, whereas the other com- pled dataset. 6 For the Standard-IE setting, we pro-
pared methods are trained on task-specific datasets vide the pre-defined label set and ask the ChatGPT
to choose an answer for a given input, and the ac-
4
To prevent any historical chat biases, we cleared every
conversation after generating each response. 6
We randomly selected around 200 samples for each
5
These metrics are all following previous works. dataset.
Task Dataset BERT RoBERTa SOTA ChatGPT
Entity BBN 80.3 79.8 82.2 (Zuo et al., 2022) 85.6
Typing(ET) OntoNotes 5.0 69.1 68.8 72.1 (Zuo et al., 2022) 73.4
Named Entity CoNLL2003 92.8 92.4 94.6 (Wang et al., 2021) 67.2
Recognition(NER) OntoNotes 5.0 89.2 90.9 91.9 (Ye et al., 2022) 51.1
Relation TACRED 72.7 74.6 75.6 (Li et al., 2022a) 20.3
Classification(RC) SemEval2010 89.1 89.8 91.3 (Zhao et al., 2021) 42.5
Relation ACE05-R 87.5 | 63.7 88.2 | 65.1 91.1 | 73.0 (Ye et al., 2022) 40.5 | 4.5
Extraction(RE) SciERC 65.4 | 43.0 63.6 | 42.0 69.9 | 53.2 (Ye et al., 2022) 25.9 | 5.5
Event ACE05-E 71.8 72.9 75.8 (Liu et al., 2022a) 17.1
Detection(ED) ACE05-E+ 72.4 72.1 72.8 (Lin et al., 2020) 15.5
Event Argument ACE05-E 65.3 68.0 73.5 (Hsu et al., 2022) 28.9
Extraction(EAE) ACE05-E+ 64.0 66.5 73.0 (Hsu et al., 2022) 30.9
Event ACE05-E 71.8 | 51.0 72.9 | 51.9 74.7 | 56.8 (Lin et al., 2020) 17.0 | 7.3
Extraction(EE) ACE05-E+ 72.4 | 52.7 72.1 | 53.4 71.7 | 56.8 (Hsu et al., 2022) 16.6 | 7.8
Table 2: The performances of ChatGPT and several baseline models on 14 IE datasets on the Standard-IE setting.
We report the performance on the whole test set. All results are directly cited from public papers or re-implemented
using official open-source code.
Standard-IE OpenIE sonable by the domain experts in ET, NER, and

BBN(ET) 86.8% 97.2% RC tasks. However, the performance is relatively
CoNLL(NER) 69.0% 93.3% poorer for more challenging tasks, such as RE and
SemEval2010(RC) 43.3% 84.3% EE. Overall, compared with Standard-IE setting,
ACE05-R(RE) 14.9% 23.9% ChatGPT’s performance on the OpenIE setting is
ACE05-E(ED) 12.4% 42.6% exciting. Our findings suggest that under the Ope-
ACE05-E(EAE) 17.3% 65.3% nIE setting, ChatGPT could generate reliable fac-
ACE05-E(EE) 4.9% 28.8% tual knowledge and reasonable output.
4.4 The top-k Recall Analysis

Table 3: The accuracy of Standard-IE setting and Ope-
nIE setting on the sampled test set. Our results show While generating the most likely prediction may
that ChatGPT could generate reasonable outputs on the be unsatisfactory on the Standard-IE setting, we
OpenIE setting. seek to investigate whether ChatGPT could be a
useful advisor. Therefore, we examine the recall
of its top-k predictions, with k = 1, 3, or 5. As
curacy was calculated by matching the predictions shown in Table 5, the results indicate that compared
to the ground truth labels. On the other hand, the with the top-1 recall, the top-3 recall increases sig-
OpenIE setting refers to asking ChatGPT to make nificantly, e.g., the improvement is 19.6% on Se-
predictions without the pre-defined label set (Open mEval2010. Moreover, the top-5 recall reaches
in § 3.3). Three domain experts evaluate these pre- an impressive 94.9% on BBN and 76.0% on Se-
dictions and vote on whether they were reasonable mEval2010, demonstrating a favorable outcome.
in light of the input and background knowledge, Our findings suggest that ChatGPT is a com-
named as ifOpen_Correct in § 3.3. Our main petent answer candidate generator for a given
goal is to determine if ChatGPT could produce task under the Standard-IE setting, which could
logical and reasonable predictions without given help users select the most probable prediction from
the pre-defined label set, so we do not require the the top-5 predictions.
prediction to match with the ground truth.
5 Explainability, Calibration and
The results presented in Table 3 indicate that
Faithfulness
ChatGPT’s performance is somewhat inspiring
under the OpenIE setting. For example, more While ChatGPT’s performance in evaluations is
than 84% of the predictions are considered rea- noteworthy, it is equally important to evaluate its
Stardand Setting OpenIE Setting
Self-check Human-check Overlap Self-check Human-check Overlap
BBN (ET) 100.0% 99.2% 99.2% 100.0% 99.5% 99.5%
CoNLL (NER) 100.0% 99.3% 99.3% 100.0% 99.7% 99.7%
SemEval (RC) 100.0% 100.0% 100.0% 100.0% 99.7% 99.7%
ACE05-R (RE) 100.0% 90.0% 90.0% 100.0% 100.0% 100.0%
ACE05-E (ED) 100.0% 96.3% 96.3% 100.0% 90.2% 90.2%
ACE05-E (EAE) 100.0% 74.1% 74.1% 100.0% 90.4% 90.4%
ACE05-E (EE) 100.0% 47.1% 47.1% 94.0% 78.0% 74.0%
Table 4: The explainability of ChatGPT measured on the sampled test set. We report the ratio of samples with
reasonable reasons discriminated by ChatGPT (self-check) and domain experts (human-check) under different
settings. Besides, we also compute the overlap ration for both of them. These results indicate that in most cases,
ChatGPT exhibits strong explainability for its prediction.
top-1 top-3 top-5 predictions in the Standard-IE setting to ensure

BBN 85.6% 92.7% 94.9% (+9.3%) a robust evaluation of ChatGPT’s explainability
SemEval2010 42.5% 62.1% 76.0% (+33.5%) ability. 7 This is because evaluating the reasons
provided by ChatGPT for incorrect predictions is
Table 5: The top-k recall analysis on the whole test set. less valuable.
We report two datasets due to the space limitation, other
datasets show similar observation. The results show The ratio of samples with reasonable explana-
that ChatGPT could server as a good advisor. tions (termed as reasonable score) is summarized
in Table 4, from which we can derive the following
conclusions. Firstly, both ChatGPT and domain
ability from diverse dimensions that could offer experts highly approve of the reasons given by
important insights for future research directions. In ChatGPT, with the majority of datasets achieving
this section, we analyze several relevant factors, in- a reasonable score of over 90% in the Standard-IE
cluding explainability, calibration, and faithfulness, and OpenIE settings. The above results demon-
to comprehensively evaluate ChatGPT’s abilities. strate that ChatGPT gives very high-quality expla-
Overall, our findings suggest that ChatGPT can pro- nation for its prediction. Secondly, we observe that
vide high-quality and reliable explanations for its ChatGPT displays a high level of confidence in
predictions, but it tends to display overconfidence the reasons provided for its predictions when
in most cases, leading to low calibration. Addi- compared with human evaluation. In fact, Chat-
tionally, ChatGPT displays high faithfulness to the GPT achieves nearly a 100% reasonable score
original text, making it an reliable tool for users. among almost all datasets. This suggests that
ChatGPT is very confident in its ability to pro-
5.1 Explainability
vide reasonable explanations. Thirdly, we find that
Explainability is a critical requirement for LLMs, when ChatGPT provides a reasonable explana-
as it allows users to understand how the model ar- tion for a prediction, there is a high level of
rives at its predictions (Peng et al., 2023). In this agreement between ChatGPT and human eval-
study, we investigate whether ChatGPT could pro- uations. This suggests that ChatGPT may have a
vide a reasonable explanation for its output. To similar understanding of explanations as humans.
be specific, we request ChatGPT to provide rea- Overall, our findings suggest that ChatGPT is ca-
sons for its predictions in the Standard-IE and Ope- pable of providing high-quality and reliable expla-
nIE settings. The corresponding keys are denoted nations for its predictions. This is a crucial step
as Reason_Standard and Reason_Open, as towards developing trustworthy and reliable LLMs.
explained in § 3.3. These reasons are then evalu-
ated for their reasonableness by both ChatGPT and
three domain experts, with the resulting evaluations
referred to as self-check and human-check, respec- 7
We randomly select around 200 samples from each
tively. We only consider the samples with correct dataset for human annotation.
Correct Confidence Incorrect Confidence
BERT RoBERTa ChatGPT BERT RoBERTa ChatGPT
BBN(ET) 0.971 0.968 0.888 0.904 0.885 0.828
CoNLL(NER) 0.990 0.991 0.864 0.866 0.886 0.785
SemEval(RC) 0.983 0.989 0.868 0.871 0.852 0.839
ACE05-R(RE) 0.995 0.991 0.760 0.883 0.810 0.764
ACE05-E(ED) 0.882 0.944 0.852 0.770 0.871 0.737
ACE05-E(EAE) 0.762 0.785 0.956 0.525 0.555 0.910
ACE05-E(EE) 0.763 0.782 0.845 0.612 0.628 0.764
Table 6: The prediction confidence of various models on the whole test set. We show both the correct confidence
and incorrect confidence based on various methods. We find that ChatGPT is overconfidence for its prediction in
most cases.
5.2 Calibration BERT RoBERTa ChatGPT

In this subsection, we first investigate the level of BBN(ET) 0.012 0.012 0.026
confidence for both the correct and incorrect sam- CoNLL(NER) 0.052 0.044 0.204
ples. Confidence is typically described in terms of SemEval(RC) 0.023 0.031 0.460
a probability value, indicating the likelihood of be- ACE05-R(RE) 0.020 0.014 0.745
longing to a specific category. To obtain prediction ACE05-E(ED) 0.161 0.226 0.656
probabilities from ChatGPT, we ask it to output ACE05-E(EAE) 0.154 0.168 0.699
the probability (Confidence_Standard and ACE05-E(EE) 0.211 0.288 0.699
Confidence_Open), as discussed in § 3.3. Our
aim is to investigate whether ChatGPT can provide Table 7: The expected calibration error (ECE) is used
a reasonable prediction confidence scores for its to measure the calibration of a given model, and the
lower, the better. Results are calculated on the whole
predictions, thus reducing the risk of misinterpreta-
test set.
tion. In Table 6, we present the confidence scores
of correct and incorrect predictions from different
models, referred to correct confidence and incor- tween predicted confidence and accuracy. 8 The
rect confidence, respectively. Our observations re- results are shown in Table 7, and from that we
veal that all the models exhibit high confidence lev- can observe that ChatGPT shows much poorer cali-
els in their predictions, this is consistent with previ- bration compared to BERT-based methods, which
ous research on large models (Guo et al., 2017). indicates that ChatGPT tends to produce confi-
Although ChatGPT performs worse than its dences that do not represent true probabilities
BERT-based counterparts in Standard-IE set- easily. Furthermore, although ChatGPT displays
ting, it displays overconfidence in both correct low ECE in tasks such as ET and NER, miscali-
and incorrect predictions. Consequently, this bration phenomenon dominates most cases. These
overconfidence could lead to misguidance of users. findings suggest that ChatGPT needs improvement
Furthermore, we note a significant confidence gap in terms of calibration, especially for IE tasks.
between correct and incorrect predictions, indicat-
ing the need for careful evaluation when ChatGPT’s 5.3 Faithfulness
prediction has relatively low confidence. Recent works show that ChatGPT may provide
We then focus on calibration, a critical property false information to users, potentially affecting
of LLMs as it could estimate the predictive un- their decision-making (Huang et al., 2023). There-
certainty for the secure application of LLMs. A fore, assessing the faithfulness of the ChatGPT
well-calibrated model not only produces accurate model to the original text is a crucial measure-
predictions but also provides reliable and infor- ment in developing a trustworthy information ex-
mative uncertainty estimates, necessary for sound traction model. Our study uses faithfulness as a
decision-making. In this research, we evaluate the metric to evaluate the ChatGPT model, specifi-
calibration using the Expected Calibration Error 8
We set the bin size to 50, dividing the prediction proba-
(ECE) metric which measures the deviation be- bilities into 50 equally spaced bins for analysis.
Stardand-IE OpenIE showed that ChatGPT exhibits a high level of faith-
BBN(ET) 98.3% 99.3% fulness to the original text, indicating that its pre-
CoNLL(NER) 100.0% 98.7% dictions are grounded in the input text. Given these
SemEval(RC) 100.0% 99.1% findings, we hope that our research could inspire
ACE05-R(RE) 90.0% 93.8% more research on using ChatGPT for information
ACE05-E(ED) 100.0% 100.0% extraction.
ACE05-E(EAE) 100.0% 96.5%
ACE05-E(EE) 100.0% 97.0%
References
Table 8: The evaluation of faithfulness for ChatGPT. Armen Aghajanyan, Sonal Gupta, and Luke Zettle-
Faithfulness refers to whether ChatGPT’s explanation moyer. 2021. Intrinsic dimensionality explains the
align with the original text. Experimental results show effectiveness of language model fine-tuning. In Pro-
ceedings of the 59th Annual Meeting of the Associa-
that ChatGPT’s explanation maintains a very high de-
tion for Computational Linguistics and the 11th In-
gree of faithfulness to the original text and provide ternational Joint Conference on Natural Language
nearly no false explanation. Processing, ACL/IJCNLP 2021, (Volume 1: Long
Papers), Virtual Event, August 1-6, 2021, pages
7319–7328. Association for Computational Linguis-
cally referring to if the explanation provided by tics.
ChatGPT aligns with the original text when its
Rachith Aiyappa, Jisun An, Haewoon Kwak, and Yong-
predictionis correct, as the original text is the Yeol Ahn. 2023. Can we trust the evaluation on chat-
most important source for extraction information. gpt? arXiv preprint arXiv:2303.12767.
There are two keys we collect by domain ex-
Peggy M. Andersen, Philip J. Hayes, Steven P. Wein-
perts, namely FicR_Standard(Manual) and stein, Alison K. Huettner, Linda M. Schmandt, and
FicR_Open(Manual), as we mentioned in Irene B. Nirenburg. 1992. Automatic extraction of
§ 3.3. Our results are shown in Table 8, which facts from press releases to generate news stories.
indicate a high degree of faithfulness between Chat- In 3rd Applied Natural Language Processing Con-
ference, ANLP 1992, Trento, Italy, March 31 - April
GPT’s explanations and the original text with rare
3, 1992, pages 170–177. ACL.
false explanations, i.e., with over 95% of samples
considered faithful in nearly all datasets under dif- Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor
ferent settings. We can conclude that ChatGPT’s Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin,
Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru,
decision-making process primarily relies on the Giridharan Anantharaman, Xian Li, Shuohui Chen,
input of the original text, leading to the majority Halil Akin, Mandeep Baines, Louis Martin, Xing
of its explanations being regarded as truthful and Zhou, Punit Singh Koura, Brian O’Horo, Jeffrey
reliable. Wang, Luke Zettlemoyer, Mona T. Diab, Zornitsa
Kozareva, and Veselin Stoyanov. 2022. Efficient
large scale language modeling with mixtures of ex-
6 Conclusion perts. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing,
In this paper, we propose to systematically analysis EMNLP 2022, Abu Dhabi, United Arab Emirates,
the ChatGPT’s performance, explainability, cali- December 7-11, 2022, pages 11699–11732. Associ-
bration, and faithfulness. To be specific, based on ation for Computational Linguistics.
7 fine-grained information extraction tasks among Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
14 datasets, we collect 15 keys identified by ei- liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi-
ther ChatGPT or domain experts for our research. wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do,
Yan Xu, and Pascale Fung. 2023. A multitask,
Our findings reveal that ChatGPT’s performance
multilingual, multimodal evaluation of chatgpt on
in Standard-IE settings is not as good as BERT- reasoning, hallucination, and interactivity. CoRR,
based models in most cases. However, we found abs/2302.04023.
that ChatGPT achieved excellent accuracy scores
Zeljana Basic, Ana Banovac, Ivana Kruzic, and Ivan
in the OpenIE setting, as evaluated by human an- Jerkovic. 2023. Better by you, better than me, chat-
notators. Furthermore, ChatGPT could provide gpt3 as writing assistance in students essays. CoRR,
high-quality and trustworthy explanations for its abs/2302.04536.
predictions. One of the key issues that we identified Junyi Bian, Li Huang, Xiaodi Huang, Hong Zhou, and
is its tendency towards overconfidence, resulting Shanfeng Zhu. 2021. Grantrel: Grant information
in low calibration. Furthermore, our analysis also extraction via joint entity and relation extraction. In
Findings of the Association for Computational Lin- Garcia, Vedant Misra, Kevin Robinson, Liam Fe-
guistics: ACL/IJCNLP 2021, Online Event, August dus, Denny Zhou, Daphne Ippolito, David Luan,
1-6, 2021, volume ACL/IJCNLP 2021 of Findings Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,
of ACL, pages 2674–2685. Association for Compu- Ryan Sepassi, David Dohan, Shivani Agrawal, Mark
tational Linguistics. Omernick, Andrew M. Dai, Thanumalayan Sankara-
narayana Pillai, Marie Pellat, Aitor Lewkowycz,
Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie Erica Moreira, Rewon Child, Oleksandr Polozov,
Lu, and Ben He. 2023. Chatgpt is a knowledgeable Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren-
but inexperienced solver: An investigation of com- nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta,
monsense problem in large language models. arXiv Jason Wei, Kathy Meier-Hellstern, Douglas Eck,
preprint arXiv:2303.16421. Jeff Dean, Slav Petrov, and Noah Fiedel. 2022.
Palm: Scaling language modeling with pathways.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie CoRR, abs/2204.02311.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Martic, Shane Legg, and Dario Amodei. 2017. Deep
Gretchen Krueger, Tom Henighan, Rewon Child, reinforcement learning from human preferences. In
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Advances in Neural Information Processing Systems
Clemens Winter, Christopher Hesse, Mark Chen, 30: Annual Conference on Neural Information Pro-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin cessing Systems 2017, December 4-9, 2017, Long
Chess, Jack Clark, Christopher Berner, Sam Mc- Beach, CA, USA, pages 4299–4307.
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learn- Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
ers. In Advances in Neural Information Processing ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Systems 33: Annual Conference on Neural Informa- Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
tion Processing Systems 2020, NeurIPS 2020, De- 2022. Scaling instruction-finetuned language mod-
cember 6-12, 2020, virtual. els. arXiv preprint arXiv:2210.11416.
Yi Chen, Jiayang Cheng, Haiyun Jiang, Lemao Liu, Antonia Creswell and Murray Shanahan. 2022. Faith-
Haisong Zhang, Shuming Shi, and Ruifeng Xu. ful reasoning using large language models. CoRR,
2022. Learning from sibling mentions with scal- abs/2208.14271.
able graph inference in fine-grained entity typing. In
Proceedings of the 60th Annual Meeting of the As- Jeremy Crowe. 1995. Constraint-based event recog-
sociation for Computational Linguistics (Volume 1: nition for information extraction. In 33rd Annual
Long Papers), ACL 2022, Dublin, Ireland, May 22- Meeting of the Association for Computational Lin-
27, 2022, pages 2076–2087. Association for Compu- guistics, pages 296–298, Cambridge, Massachusetts,
tational Linguistics. USA. Association for Computational Linguistics.
Hai Leong Chieu, Hwee Tou Ng, and Yoong Keok Lee. Hongliang Dai, Yangqiu Song, and Haixun Wang.
2003. Closing the gap: Learning-based information 2021. Ultra-fine entity typing with weak supervi-
extraction rivaling knowledge-engineering methods. sion from a masked language model. In Proceed-
In Proceedings of the 41st Annual Meeting of the As- ings of the 59th Annual Meeting of the Association
sociation for Computational Linguistics, pages 216– for Computational Linguistics and the 11th Interna-
223, Sapporo, Japan. Association for Computational tional Joint Conference on Natural Language Pro-
Linguistics. cessing, ACL/IJCNLP 2021, (Volume 1: Long Pa-
pers), Virtual Event, August 1-6, 2021, pages 1790–
Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettle- 1799. Association for Computational Linguistics.
moyer. 2018. Ultra-fine entity typing. In Proceed-
ings of the 56th Annual Meeting of the Associa- Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca J.
tion for Computational Linguistics, ACL 2018, Mel- Passonneau, and Rui Zhang. 2022. Container: Few-
bourne, Australia, July 15-20, 2018, Volume 1: Long shot named entity recognition via contrastive learn-
Papers, pages 87–96. Association for Computational ing. In Proceedings of the 60th Annual Meeting of
Linguistics. the Association for Computational Linguistics (Vol-
ume 1: Long Papers), ACL 2022, Dublin, Ireland,
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, May 22-27, 2022, pages 6338–6353. Association for
Maarten Bosma, Gaurav Mishra, Adam Roberts, Computational Linguistics.
Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi, Ameet Deshpande, Vishvak Murahari, Tanmay Ra-
Sasha Tsvyashchenko, Joshua Maynez, Abhishek jpurohit, Ashwin Kalyan, and Karthik Narasimhan.
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- 2023. Toxicity in chatgpt: Analyzing persona-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben assigned language models. arXiv preprint
Hutchinson, Reiner Pope, James Bradbury, Jacob arXiv:2304.05335.
Austin, Michael Isard, Guy Gur-Ari, Pengcheng
Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
mawat, Sunipa Dev, Henryk Michalewski, Xavier Kristina Toutanova. 2019. BERT: pre-training of
deep bidirectional transformers for language under- 2017, Sydney, NSW, Australia, 6-11 August 2017,
standing. In NAACL-HLT 2019, pages 4171–4186. volume 70 of Proceedings of Machine Learning Re-
Association for Computational Linguistics. search, pages 1321–1330. PMLR.
Xinya Du and Claire Cardie. 2020. Event extraction by Mubin Ul Haque, Isuru Dharmadasa, Zarrin Tasnim
answering (almost) natural questions. In Proceed- Sworna, Roshan Namal Rajapakse, and Hussain Ah-
ings of the 2020 Conference on Empirical Methods mad. 2022. "i think this is the most disruptive
in Natural Language Processing, EMNLP 2020, On- technology": Exploring sentiments of chatgpt early
line, November 16-20, 2020, pages 671–683. Asso- adopters using twitter data. CoRR, abs/2212.05856.
ciation for Computational Linguistics.
Hangfeng He, Hongming Zhang, and Dan Roth. 2023.
Xinya Du and Heng Ji. 2022. Retrieval-augmented gen- Rethinking with retrieval: Faithful large language
erative question answering for event argument ex- model inference. CoRR, abs/2301.00303.
traction. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing, Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,
EMNLP 2022, Abu Dhabi, United Arab Emirates, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian
December 7-11, 2022, pages 4649–4666. Associa- Padó, Marco Pennacchiotti, Lorenza Romano, and
tion for Computational Linguistics. Stan Szpakowicz. 2010. Semeval-2010 task 8:
Multi-way classification of semantic relations be-
Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- tween pairs of nominals. In Proceedings of the
fiths, Tommaso Salvatori, Thomas Lukasiewicz, 5th International Workshop on Semantic Evaluation,
Philipp Christian Petersen, Alexis Chevalier, and SemEval@ACL 2010, Uppsala University, Uppsala,
Julius Berner. 2023. Mathematical capabilities of Sweden, July 15-16, 2010, pages 33–38. The Associ-
chatgpt. CoRR, abs/2301.13867. ation for Computer Linguistics.
Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. I-Hung Hsu, Kuan-Hao Huang, Elizabeth Boschee,
Graphrel: Modeling text as relational graphs for Scott Miller, Prem Natarajan, Kai-Wei Chang, and
joint entity and relation extraction. In Proceedings Nanyun Peng. 2022. DEGREE: A data-efficient
of the 57th Conference of the Association for Compu- generation-based event extraction model. In Pro-
tational Linguistics, ACL 2019, Florence, Italy, July ceedings of the 2022 Conference of the North
28- August 2, 2019, Volume 1: Long Papers, pages American Chapter of the Association for Computa-
1409–1418. Association for Computational Linguis- tional Linguistics: Human Language Technologies,
tics. NAACL 2022, Seattle, WA, United States, July 10-
15, 2022, pages 1890–1908. Association for Compu-
Yao Fu, Hao Peng, and Tushar Khot. 2022. How does tational Linguistics.
gpt obtain its ability? tracing emergent abilities of
language models to their sources. Yao Fu’s Notion. Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is
chatgpt better than human annotators? potential
Jun Gao, Huan Zhao, Changlong Yu, and Ruifeng Xu. and limitations of chatgpt in explaining implicit hate
2023. Exploring the feasibility of chatgpt for event speech. CoRR, abs/2302.07736.
extraction. CoRR, abs/2303.03836.
Katharina Jeblick, Balthasar Schachtner, Jakob Dexl,
Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Andreas Mittermeier, Anna Theresa Stüber, Johanna
Kirchner, and David Huynh. 2014. Context- Topalis, Tobias Weber, Philipp Wesp, Bastian Sabel,
dependent fine-grained entity type tagging. CoRR, Jens Ricke, and Michael Ingrisch. 2022. Chatgpt
abs/1412.1820. makes medicine easy to swallow: An exploratory
case study on simplified radiology reports. CoRR,
Andrej Zukov Gregoric, Yoram Bachrach, and Sam abs/2212.14882.
Coope. 2018. Named entity recognition with par-
allel recurrent neural networks. In Proceedings of Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
the 56th Annual Meeting of the Association for Com- Wang, and Zhaopeng Tu. 2023. Is chatgpt A
putational Linguistics, ACL 2018, Melbourne, Aus- good translator? A preliminary study. CoRR,
tralia, July 15-20, 2018, Volume 2: Short Papers, abs/2301.08745.
pages 69–74. Association for Computational Lin-
guistics. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017.
Answering complex questions using open informa-
Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, tion extraction. In Proceedings of the 55th Annual
Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Meeting of the Association for Computational Lin-
Wu. 2023. How close is chatgpt to human ex- guistics (Volume 2: Short Papers), pages 311–316,
perts? comparison corpus, evaluation, and detection. Vancouver, Canada. Association for Computational
CoRR, abs/2301.07597. Linguistics.
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Weinberger. 2017. On calibration of modern neu- taka Matsuo, and Yusuke Iwasawa. 2022. Large
ral networks. In Proceedings of the 34th Inter- language models are zero-shot reasoners. CoRR,
national Conference on Machine Learning, ICML abs/2205.11916.
Gerd Kortemeyer. 2023. Could an artificial- Xiao Liu, Heyan Huang, Ge Shi, and Bo Wang.
intelligence agent pass an introductory physics 2022b. Dynamic prefix-tuning for generative
course? arXiv preprint arXiv:2301.12127. template-based event extraction. In Proceedings
of the 60th Annual Meeting of the Association for
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2022. Computational Linguistics (Volume 1: Long Papers),
Can pretrained language models generate persuasive, ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
faithful, and informative ad text for product descrip- 5216–5228. Association for Computational Linguis-
tions? In Proceedings of The Fifth Workshop on tics.
e-Commerce and NLP (ECNLP 5), pages 234–243.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Sebastian Krügel, Andreas Ostermaier, and Matthias dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Uhl. 2023. The moral authority of chatgpt. CoRR, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
abs/2301.07098. Roberta: A robustly optimized BERT pretraining ap-
proach. CoRR, abs/1907.11692.
Ananya Kumar, Percy Liang, and Tengyu Ma. 2019. Dongfang Lou, Zhilin Liao, Shumin Deng, Ningyu
Verified uncertainty calibration. In Advances in Neu- Zhang, and Huajun Chen. 2021. Mlbinet: A cross-
ral Information Processing Systems 32: Annual Con- sentence collective event detection network. In Pro-
ference on Neural Information Processing Systems ceedings of the 59th Annual Meeting of the Associa-
2019, NeurIPS 2019, December 8-14, 2019, Vancou- tion for Computational Linguistics and the 11th In-
ver, BC, Canada, pages 3787–3798. ternational Joint Conference on Natural Language
Processing, ACL/IJCNLP 2021, (Volume 1: Long
Bo Li, Wei Ye, Jinglei Zhang, and Shikun Zhang. Papers), Virtual Event, August 1-6, 2021, pages
2022a. Reviewing labels: Label graph network with 4829–4839. Association for Computational Linguis-
top-k prediction set for relation extraction. CoRR, tics.
abs/2212.14270.
Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu
Bo Li, Dingyao Yu, Wei Ye, Jinglei Zhang, and Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Uni-
Shikun Zhang. 2022b. Sequence generation with fied structure generation for universal information
label augmentation for relation extraction. CoRR, extraction. In Proceedings of the 60th Annual Meet-
abs/2212.14266. ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 5755–5772, Dublin,
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Ireland. Association for Computational Linguistics.
Han, Fei Wu, and Jiwei Li. 2020. A unified MRC
framework for named entity recognition. In Pro- Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh
ceedings of the 58th Annual Meeting of the Asso- Hajishirzi. 2018. Multi-task identification of enti-
ciation for Computational Linguistics, pages 5849– ties, relations, and coreference for scientific knowl-
5859, Online. Association for Computational Lin- edge graph construction. In Proceedings of the 2018
guistics. Conference on Empirical Methods in Natural Lan-
guage Processing, Brussels, Belgium, October 31 -
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna November 4, 2018, pages 3219–3232. Association
Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. for Computational Linguistics.
Entity-relation extraction as multi-turn question an-
Yubo Ma, Zehao Wang, Yixin Cao, Mukai Li, Meiqi
swering. In Proceedings of the 57th Conference of
Chen, Kun Wang, and Jing Shao. 2022. Prompt
the Association for Computational Linguistics, ACL
for extraction? PAIE: prompting argument interac-
2019, Florence, Italy, July 28- August 2, 2019, Vol-
tion for event argument extraction. In Proceedings
ume 1: Long Papers, pages 1340–1350. Association
of the 60th Annual Meeting of the Association for
for Computational Linguistics.
Computational Linguistics (Volume 1: Long Papers),
ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. 6759–6774. Association for Computational Linguis-
A joint neural model for information extraction with tics.
global features. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin- Kyle Mahowald, Anna A. Ivanova, Idan Asher Blank,
guistics, ACL 2020, Online, July 5-10, 2020, pages Nancy Kanwisher, Joshua B. Tenenbaum, and
7999–8009. Association for Computational Linguis- Evelina Fedorenko. 2023. Dissociating language
tics. and thought in large language models: a cognitive
perspective. CoRR, abs/2301.06627.
Jian Liu, Yufeng Chen, and Jinan Xu. 2022a. Saliency
as evidence: Event detection with trigger saliency Pedro Henrique Martins, Zita Marinho, and André F. T.
attribution. In Proceedings of the 60th Annual Meet- Martins. 2019. Joint learning of named entity recog-
ing of the Association for Computational Linguistics nition and entity linking. In Proceedings of the 57th
(Volume 1: Long Papers), ACL 2022, Dublin, Ire- Conference of the Association for Computational
land, May 22-27, 2022, pages 4573–4585. Associa- Linguistics, ACL 2019, Florence, Italy, July 28 - Au-
tion for Computational Linguistics. gust 2, 2019, Volume 2: Student Research Workshop,
pages 190–196. Association for Computational Lin- 2023. Check your facts and try again: Improving
guistics. large language models with external knowledge and
automated feedback. CoRR, abs/2302.12813.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Ryan T. McDonald. 2020. On faithfulness and fac- Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao
tuality in abstractive summarization. In Proceedings Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is
of the 58th Annual Meeting of the Association for chatgpt a general-purpose natural language process-
Computational Linguistics, ACL 2020, Online, July ing task solver? arXiv preprint arXiv:2302.06476.
5-10, 2020, pages 1906–1919. Association for Com-
putational Linguistics. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, H. Francis Song, John
Matthias Minderer, Josip Djolonga, Rob Romijnders, Aslanides, Sarah Henderson, Roman Ring, Susan-
Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin nah Young, Eliza Rutherford, Tom Hennigan, Ja-
Tran, and Mario Lucic. 2021. Revisiting the cali- cob Menick, Albin Cassirer, Richard Powell, George
bration of modern neural networks. In Advances van den Driessche, Lisa Anne Hendricks, Mari-
in Neural Information Processing Systems 34: An- beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-
nual Conference on Neural Information Processing hannes Welbl, Sumanth Dathathri, Saffron Huang,
Systems 2021, NeurIPS 2021, December 6-14, 2021, Jonathan Uesato, John Mellor, Irina Higgins, An-
virtual, pages 15682–15694. tonia Creswell, Nat McAleese, Amy Wu, Erich
Elsen, Siddhant M. Jayakumar, Elena Buchatskaya,
Sandra Mitrovic, Davide Andreoletti, and Omran Ay- David Budden, Esme Sutherland, Karen Simonyan,
oub. 2023. Chatgpt or human? detect and explain. Michela Paganini, Laurent Sifre, Lena Martens,
explaining decisions of machine learning model Xiang Lorraine Li, Adhiguna Kuncoro, Aida
for detecting short chatgpt-generated text. CoRR, Nematzadeh, Elena Gribovskaya, Domenic Do-
abs/2301.13852. nato, Angeliki Lazaridou, Arthur Mensch, Jean-
Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grig-
Oded Nov, Nina Singh, and Devin M. Mann. 2023. orev, Doug Fritz, Thibault Sottiaux, Mantas Pa-
Putting chatgpt’s medical advice to the (turing) test. jarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama,
CoRR, abs/2301.10035. Cyprien de Masson d’Autume, Yujia Li, Tay-
fun Terzi, Vladimir Mikulik, Igor Babuschkin,
Miguel Ortega-Martín, Óscar García-Sierra, Alfonso Aidan Clark, Diego de Las Casas, Aurelia Guy,
Ardoiz, Jorge Álvarez, Juan Carlos Armenteros, and Chris Jones, James Bradbury, Matthew J. Johnson,
Adrián Alonso. 2023. Linguistic ambiguity analysis Blake A. Hechtman, Laura Weidinger, Iason Gabriel,
in chatgpt. arXiv preprint arXiv:2302.06426. William S. Isaac, Edward Lockhart, Simon Osin-
dero, Laura Rimell, Chris Dyer, Oriol Vinyals, Ka-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- reem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Hassabis, Koray Kavukcuoglu, and Geoffrey Irv-
Sandhini Agarwal, Katarina Slama, Alex Ray, John ing. 2021. Scaling language models: Methods,
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, analysis & insights from training gopher. CoRR,
Maddie Simens, Amanda Askell, Peter Welinder, abs/2112.11446.
Paul F. Christiano, Jan Leike, and Ryan Lowe.
2022. Training language models to follow instruc- Nazneen Fatema Rajani, Bryan McCann, Caiming
tions with human feedback. CoRR, abs/2203.02155. Xiong, and Richard Socher. 2019. Explain yourself!
leveraging language models for commonsense rea-
Kunyuan Pang, Haoyu Zhang, Jie Zhou, and Ting soning. In Proceedings of the 57th Conference of
Wang. 2022. Divide and denoise: Learning from the Association for Computational Linguistics, ACL
noisy labels in fine-grained entity typing with 2019, Florence, Italy, July 28- August 2, 2019, Vol-
cluster-wise loss correction. In Proceedings of the ume 1: Long Papers, pages 4932–4942. Association
60th Annual Meeting of the Association for Compu- for Computational Linguistics.
tational Linguistics (Volume 1: Long Papers), ACL
2022, Dublin, Ireland, May 22-27, 2022, pages Erik F. Tjong Kim Sang and Fien De Meulder.
1997–2006. Association for Computational Linguis- 2003. Introduction to the conll-2003 shared task:
tics. Language-independent named entity recognition. In
Proceedings of the Seventh Conference on Natural
Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Language Learning, CoNLL 2003, Held in cooper-
Jie Ma, Alessandro Achille, Rishita Anubhai, ation with HLT-NAACL 2003, Edmonton, Canada,
Cícero Nogueira dos Santos, Bing Xiang, and Ste- May 31 - June 1, 2003, pages 142–147. ACL.
fano Soatto. 2021. Structured prediction as transla-
tion between augmented natural languages. In ICLR Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo
2021, Virtual Event, Austria, May 3-7, 2021. Open- Lee, Percy Liang, and Tatsunori Hashimoto. 2023.
Review.net. Whose opinions do language models reflect? arXiv
preprint arXiv:2303.17548.
Baolin Peng, Michel Galley, Pengcheng He, Hao
Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Shaden Smith, Mostofa Patwary, Brandon Norick,
Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. Patrick LeGresley, Samyam Rajbhandari, Jared
Casper, Zhun Liu, Shrimai Prabhumoye, George In Proceedings of the 2019 Conference on Empiri-
Zerveas, Vijay Korthikanti, Elton Zheng, Rewon cal Methods in Natural Language Processing and
Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia the 9th International Joint Conference on Natural
Song, Mohammad Shoeybi, Yuxiong He, Michael Language Processing, EMNLP-IJCNLP 2019, Hong
Houston, Saurabh Tiwary, and Bryan Catanzaro. Kong, China, November 3-7, 2019, pages 5783–
2022. Using deepspeed and megatron to train 5788. Association for Computational Linguistics.
megatron-turing NLG 530b, A large-scale genera-
tive language model. CoRR, abs/2201.11990. Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen,
Runkai Zheng, Yidong Wang, Linyi Yang, Haojun
Teo Susnjak. 2022. Chatgpt: The end of online exam Huang, Wei Ye, Xiubo Geng, Binxing Jiao, Yue
integrity? CoRR, abs/2212.09292. Zhang, and Xing Xie. 2023a. On the robustness of
chatgpt: An adversarial and out-of-distribution per-
Teo Susnjak. 2023. Applying bert and chatgpt for senti- spective. CoRR, abs/2302.12095.
ment analysis of lyme disease in scientific literature.
arXiv preprint arXiv:2302.06474. Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze
Chen, Yuansen Zhang, Rui Zheng, Junjie Ye,
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Qi Zhang, Tao Gui, Jihua Kang, Jingsheng Yang,
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Siyuan Li, and Chunsai Du. 2023b. Instructuie:
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Multi-task instruction tuning for unified information
YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, extraction. CoRR, abs/2304.08085.
Amin Ghafouri, Marcelo Menegali, Yanping Huang,
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang,
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Zhongqiang Huang, Fei Huang, and Kewei Tu.
Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, 2021. Automated concatenation of embeddings
Igor Krivokon, Will Rusch, Marc Pickett, Kath- for structured prediction. In Proceedings of the
leen S. Meier-Hellstern, Meredith Ringel Morris, 59th Annual Meeting of the Association for Com-
Tulsee Doshi, Renelito Delos Santos, Toju Duke, putational Linguistics and the 11th International
Johnny Soraker, Ben Zevenbergen, Vinodkumar Joint Conference on Natural Language Processing,
Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen ACL/IJCNLP 2021, (Volume 1: Long Papers), Vir-
Olson, Alejandra Molina, Erin Hoffman-John, Josh tual Event, August 1-6, 2021, pages 2643–2660. As-
Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, sociation for Computational Linguistics.
Matthew Lamm, Viktoriya Kuzmina, Joe Fenton,
Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
Blaise Aguera-Arcas, Claire Cui, Marian Croak, labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Ed H. Chi, and Quoc Le. 2022. Lamda: Lan- Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
guage models for dialog applications. CoRR, jana Arunkumar, David Stap, Eshaan Pathak, Gian-
abs/2201.08239. nis Karamanolakis, Haizhi Gary Lai, Ishan Puro-
Sunil Thulasidasan, Gopinath Chennupati, Jeff A. hit, Ishani Mondal, Jacob Anderson, Kirby Kuz-
Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. nia, Krima Doshi, Kuntal Kumar Pal, Maitreya Pa-
2019. On mixup training: Improved calibration and tel, Mehrad Moradshahi, Mihir Parmar, Mirali Puro-
predictive uncertainty for deep neural networks. In hit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit
Advances in Neural Information Processing Systems Verma, Ravsehaj Singh Puri, Rushang Karia, Savan
32: Annual Conference on Neural Information Pro- Doshi, Shailaja Keyur Sampat, Siddhartha Mishra,
cessing Systems 2019, NeurIPS 2019, December 8- Sujan Reddy A, Sumanta Patro, Tanay Dixit, and
14, 2019, Vancouver, BC, Canada, pages 13888– Xudong Shen. 2022. Super-naturalinstructions:
13899. Generalization via declarative instructions on 1600+
NLP tasks. In Proceedings of the 2022 Conference
Ruibo Tu, Chao Ma, and Cheng Zhang. 2023. Causal- on Empirical Methods in Natural Language Process-
discovery performance of chatgpt in the context of ing, EMNLP 2022, Abu Dhabi, United Arab Emi-
neuropathic pain diagnosis. CoRR, abs/2301.13819. rates, December 7-11, 2022, pages 5085–5109. As-
sociation for Computational Linguistics.
Amir Pouran Ben Veyseh, Viet Dac Lai, Franck Der-
noncourt, and Thien Huu Nguyen. 2021. Unleash Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf-
GPT-2 power for event detection. In Proceedings of fel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
the 59th Annual Meeting of the Association for Com- gatama, Maarten Bosma, Denny Zhou, Donald Met-
putational Linguistics and the 11th International zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
Joint Conference on Natural Language Processing, Percy Liang, Jeff Dean, and William Fedus. 2022a.
ACL/IJCNLP 2021, (Volume 1: Long Papers), Vir- Emergent abilities of large language models. CoRR,
tual Event, August 1-6, 2021, pages 6271–6282. As- abs/2206.07682.
sociation for Computational Linguistics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
David Wadden, Ulme Wennberg, Yi Luan, and Han- Bosma, Ed H. Chi, Quoc Le, and Denny Zhou.
naneh Hajishirzi. 2019. Entity, relation, and event 2022b. Chain of thought prompting elicits reasoning
extraction with contextualized span representations. in large language models. CoRR, abs/2201.11903.
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Kailin Zhao, Xiaolong Jin, Long Bai, Jiafeng Guo,
Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, and Xueqi Cheng. 2022. Knowledge-enhanced self-
Yufeng Chen, Meishan Zhang, Yong Jiang, and Wen- supervised prototypical network for few-shot event
juan Han. 2023. Zero-shot information extraction detection. In Findings of the Association for Com-
via chatting with chatgpt. CoRR, abs/2302.10205. putational Linguistics: EMNLP 2022, Abu Dhabi,
United Arab Emirates, December 7-11, 2022, pages
Ralph Weischedel and Ada Brunstein. 2005. Bbn pro- 6266–6275. Association for Computational Linguis-
noun coreference and entity type corpus. Linguistic tics.
Data Consortium, Philadelphia, 112.
Kang Zhao, Hua Xu, Yue Cheng, Xiaoteng Li, and Kai
Fei Wu and Daniel S. Weld. 2010. Open information Gao. 2021. Representation iterative fusion based
extraction using Wikipedia. In Proceedings of the on heterogeneous graph neural network for joint en-
48th Annual Meeting of the Association for Compu- tity and relation extraction. Knowl. Based Syst.,
tational Linguistics, pages 118–127, Uppsala, Swe- 219:106888.
den. Association for Computational Linguistics.
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Dacheng Tao. 2023. Can chatgpt understand too?
Takeda, and Yuji Matsumoto. 2020. LUKE: deep a comparative study on chatgpt and fine-tuned bert.
contextualized entity representations with entity- arXiv preprint arXiv:2302.10198.
aware self-attention. In EMNLP 2020.
Wenxuan Zhou and Muhao Chen. 2021. An im-
proved baseline for sentence-level relation extrac-
Deming Ye, Yankai Lin, Peng Li, and Maosong Sun.
tion. CoRR, abs/2102.01373.
2022. Packed levitated marker for entity and relation
extraction. In Proceedings of the 60th Annual Meet- Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and
ing of the Association for Computational Linguistics Zhenchang Xing. 2023. Exploring AI ethics of chat-
(Volume 1: Long Papers), ACL 2022, Dublin, Ire- gpt: A diagnostic analysis. CoRR, abs/2301.12867.
land, May 22-27, 2022, pages 4904–4917. Associa-
tion for Computational Linguistics. Julia El Zini and Mariette Awad. 2023. On the explain-
ability of natural language processing deep models.
Wei Ye, Bo Li, Rui Xie, Zhonghao Sheng, Long Chen, ACM Comput. Surv., 55(5):103:1–103:31.
and Shikun Zhang. 2019. Exploiting entity BIO tag
embeddings and multi-task learning for relation ex- Xinyu Zuo, Haijin Liang, Ning Jing, Shuang Zeng,
traction with imbalanced data. In Proceedings of Zhou Fang, and Yu Luo. 2022. Type-enriched hierar-
the 57th Conference of the Association for Compu- chical contrastive strategy for fine-grained entity typ-
tational Linguistics, ACL 2019, Florence, Italy, July ing. In Proceedings of the 29th International Confer-
28- August 2, 2019, Volume 1: Long Papers, pages ence on Computational Linguistics, COLING 2022,
1351–1360. Association for Computational Linguis- Gyeongju, Republic of Korea, October 12-17, 2022,
tics. pages 2405–2417. International Committee on Com-
putational Linguistics.
Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.
2015. Distant supervision for relation extraction via
piecewise convolutional neural networks. In Pro-
ceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1753–
1762.
Bowen Zhang, Daijun Ding, and Liwen Jing. 2022a.

How would stance detection techniques evolve after
the launch of chatgpt? CoRR, abs/2212.14548.
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-

geli, and Christopher D. Manning. 2017. Position-
aware attention and supervised data improve slot fill-
ing. In EMNLP 2017.
Zhisong Zhang, Emma Strubell, and Eduard H. Hovy.

2022b. Transfer learning from semantic role la-
beling to event argument extraction with template-
based slot querying. In Proceedings of the
2022 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2022, Abu Dhabi,
United Arab Emirates, December 7-11, 2022, pages
2627–2647. Association for Computational Linguis-
tics.
A Appendix a given sample to review candidate labels in the
Top-k prediction set and learns the connections be-
A.1 Dataset
tween them. It also includes a dynamic k-selection
We report the dataset used in each task in this sub- mechanism to learn more powerful and discrim-
section. For each task, we use two commonly used inative relation representation. This method sets
datasets for the evaluation. Table 10 shows the SOTA results on the TACRED dataset. Zhao et al.
detailed statistical information. (2021) proposed RIFRE, which models relations
Besides, we also report the number of manu- and words as nodes on a graph, and iteratively fuses
ally annotated samples for each dataset, denoted as the two types of semantic nodes using message
#Ann., as shown in Table 9. passing. This approach obtains node representa-
tions that are better suited for relation extraction
Task DataSet #Ann. tasks. The model then performs relation extraction
ED BBN 385 on the updated node representations. This method
NER CoNNL 300 sets SOTA results on the SemEval2010 dataset.
RC SemEval 400 Relation Extraction (RE): Ye et al. (2022) pro-
RE ACE05-R 66 posed PL-Marker, a novel span representation
ED ACE05-E 218 approach that considers the interrelation between
EAE ACE05-E 313 span pairs by packing markers in the encoder.
EE ACE05-E 552 To better model entity boundary information, PL-
Marker proposes a neighborhood-oriented pack-
Table 9: The number of manually annotated samples ing strategy that considers neighbor spans inte-
for each dataset.
grally. For more complicated span pair classifica-
tion tasks, this paper also designs a subject-oriented
A.2 The State-of-the-Art Methods on Single packing strategy, which packs each subject and its
Dataset objects to model the interrelation between same-
subject span pairs. This method sets SOTA re-
In this section, we introduce the state-of-the-art sults on ACE05-R and SciERC datasets. It also
method on each dataset: achieves the best performance on the OntoNotes
Entity Typing (ET): Zuo et al. (2022) proposed 5.0 datasets of NER task.
a type-enriched hierarchical contrastive strategy for
Event Detection (ED): Liu et al. (2022a)
entity typing task, named PICOT. PICOT mod-
proposed SaliencyED, a novel training mech-
els differences between hierarchical types to dis-
anism for ED, which can distinguish between
tinguish similar types at different levels of gran-
trigger-dependent and context-dependent types,
ularity. It also embeds type information into en-
and achieves promising results on ACE05-E
tity contexts and employ a constrained contrastive
dataset. Lin et al. (2020) proposed ONEIE neu-
strategy on the hierarchical structure. This method
ral framework aims to globally optimize informa-
achieves SOTA results on the BBN and OntoNotes
tion extraction as a graph from an input sentence,
5.0 datasets.
capturing cross-subtask and cross-instance inter-
Named Entity Recognition (NER): Wang et al.
dependencies, and and achieves promising results
(2021) proposed a model named ACE, which auto-
on ACE05-E+ dataset.
mates finding better embeddings for structured pre-
diction tasks. It uses a neural architecture search- Event Argument Extraction (EAE): Hsu et al.
inspired formulation where a controller updates (2022) proposed DEGREE, which formulates
belief based on a reward. The reward is the ac- event extraction as a conditional generation prob-
curacy of a task model trained on a task dataset lem, summarizing events mentioned in a passage
with the concatenated embeddings as input. This into a natural sentence following a predefined pat-
method achieves SOTA results on the CoNLL2003 tern, as learned from a prompt. Extracted event pre-
dataset. dictions are then obtained from the generated sen-
Relation Classification (RC): Li et al. (2022a) tence using a deterministic algorithm. DEGREE
proposed Label Graph Network with Top-k Pre- sets the best results on both ACE05-E and ACE05-
diction Set (KLG), to effectively utilize the Top- E+ datasets.
k prediction set. KLG builds a label graph for Event Argument Extraction (EAE): the
Task Dataset #class #test
Entity BBN (Weischedel and Brunstein, 2005) 17 23542
Typing(ET) OntoNotes (Gillick et al., 2014) 41 13393
Named Entity CoNLL 2003 (Sang and Meulder, 2003) 4 3453
Recognition(NER) OntoNotes 9 18 8233
Relation TACRED (Zhang et al., 2017) 42 15517
Classification(RC) SemEval2010 (Hendrickx et al., 2010) 10 2717
Relation ACE05-R 10 7/6 2050
Extraction(RE) SciERC (Luan et al., 2018) 6/7 551
Event ACE05-E (Wadden et al., 2019) 33 832
Detection(ED) ACE05-E+ (Lin et al., 2020) 33 676
Event Argument ACE05-E (Wadden et al., 2019) 22/33 403
Extraction(EAE) ACE05-E+ (Lin et al., 2020) 22/33 424
Event ACE05-E (Wadden et al., 2019) 22/33 832
Extraction(EE) ACE05-E+ (Lin et al., 2020) 22/33 676
Table 10: The table presents several key statistical characteristics of the datasets used in our research, including 14
datasets that belonging to 7 different IE tasks.
SOTA methods are ONEIE for ACE05-E, and DE-

GREE for ACE05-E+.
A.3 Exemplar of the Input

In this section, we show an input examples for the
event detection task to help readers understand our
implement, as shown in Table 11.
Input of Event Detection (ED)
Task Description: Given an input list of words, identify all triggers in the list, and categorize
each of them into the predefined set of event types. A trigger is the main word that most clearly
expresses the occurrence of an event in the predefined set of event types.
Pre-defined Label Set: The predefined set of event types includes: [Life.Be-Born, Life.Marry,
Life.Divorce, Life.Injure, Life.Die, Movement.Transport, Transaction.Transfer-Ownership,
Transaction.Transfer-Money, Business.Start-Org, Business.Merge-Org, Business.Declare-
Bankruptcy, Business.End-Org, Conflict.Attack, Conflict.Demonstrate, Contact.Meet, Contact.
Phone-Write, Personnel.Start-Position, Personnel.End-Position, Personnel.Nominate, Personnel.
Elect, Justice.Arrest-Jail, Justice.Release-Parole, Justice.Trial-Hearing, Justice.Charge-Indict,
Justice.Sue, Justice.Convict, Justice.Sentence, Justice.Fine, Justice.Execute, Justice.Extradite,
Justice.Acquit, Justice.Appeal, Justice.Pardon].
Input and Task Requirement: Perform ED task for the following input list, and print the output:
[’Putin’, ’concluded’, ’his’, ’two’, ’days’, ’of’, ’talks’, ’in’, ’Saint’, ’Petersburg’, ’with’, ’Jacques’,
’Chirac’, ’of’, ’France’, ’and’, ’German’, ’Chancellor’, ’Gerhard’, ’Schroeder’, ’on’, ’Saturday’,
’still’, ’urging’, ’for’, ’a’, ’central’, ’role’, ’for’, ’the’, ’United’, ’Nations’, ’in’, ’a’, ’post’, ’-’,
’war’, ’revival’, ’of’, ’Iraq’, ’.’] The output of ED task should be a list of dictionaries following
json format. Each dictionary corresponds to the occurrence of an event in the input list and should
consists of "trigger", "word_index", "event_type", "top3_event_type", "top5_event_type",
"confidence", "if_context_dependent", "reason" and "if_reasonable" nine keys. The value of "word_
index" key is an integer indicating the index (start from zero) of the "trigger" in the input list. The
value of "confidence" key is an integer ranging from 0 to 100, indicating how confident you are that
the "trigger" expresses the "event_type" event. The value of "if_context_dependent" key is either 0
(indicating the event semantic is primarily expressed by the trigger rather than contexts) or 1
(indicating the event semantic is primarily expressed by contexts rather than the trigger). The value
of "reason" key is a string describing the reason why the "trigger" expresses the "event_type", and
do not use any " mark in this string. The value of "if_reasonable" key is either 0 (indicating the reason
given in the "reason" field is not reasonable) or 1 (indicating the reason given in the "reason" field is
reasonable). Note that your answer should only contain the json string and nothing else.
Table 11: The input example of event detection task. This example is extracted from ACE05-E, and all the above
three parts are jointly imported into ChatGPT.

Self Supervised Cookbook

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Self Supervised Cookbook

Uploaded by

Copyright:

Available Formats

Evaluating ChatGPT’s Information Extraction Capabilities: An

Assessment of Performance, Explainability, Calibration, and Faithfulness

Standard-IE OpenIE sonable by the domain experts in ET, NER, and

4.4 The top-k Recall Analysis

top-1 top-3 top-5 predictions in the Standard-IE setting to ensure

5.2 Calibration BERT RoBERTa ChatGPT

Bowen Zhang, Daijun Ding, and Liwen Jing. 2022a.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-

Zhisong Zhang, Emma Strubell, and Eduard H. Hovy.

SOTA methods are ONEIE for ACE05-E, and DE-

A.3 Exemplar of the Input

You might also like