Professional Documents
Culture Documents
Abstract
arXiv:2305.12212v2 [cs.CL] 18 Oct 2023
Heuristics Implicit
Multimodel Similar Example Awareness knowledge
Text + Caption
in ChatGPT
Multimodal Named Entity Recognition
(MNER) on social media aims to enhance Predefined artificial samples
…
CP3 is likely an abbreviation or nickname for Chris Paul.
…
textual entity prediction by incorporating Tim Duncan is a retired
with Chris Paul.
NBA player who is featured in the photo
Frozen ChatGPT
Reasoning: The sentence mentions
the sentence, implying that …… the Seahawks, Answer: Named entities: 1. Connor Cook…
Transformer-based Embeddings
P ( y | 𝑻, 𝒁 )
ternal testing stage and has not yet been opened for 3.2 Stage-1. Auxiliary Refined Knowledge
public use. In addition, compared with ChatGPT, Heuristic Generation
GPT-4 has higher costs and slower API request Predefined artificial samples The key to making
speeds. In order to enhance the reproducibility of ChatGPT performs better in MNER is to choose
PGIM, we still choose ChatGPT as the main re- suitable in-context examples. Acquiring accurately
search object of our method. And this paradigm annotated in-context examples that precisely reflect
provided by PGIM can also be used in GPT-4. In the annotation style of the dataset and provide a
order to enable ChatGPT to complete the image- means to expand auxiliary knowledge poses a sig-
text multimodal task, we use advanced multimodal nificant challenge. And directly acquiring such
pre-training model to convert images into image examples from the original dataset is not feasible.
captions. Inspired by PICa (Yang et al., 2022) and To address this issue, we employ a random sam-
Prophet (Shao et al., 2023) in knowledge-based pling approach to select a small subset of sam-
VQA, PGIM formulates the testing input x as the ples from the training set for manual annotation.
following template: Specifically, for Twitter-2017 dataset, we randomly
sample 200 samples from training set for manual
Text: t \n Image: p \n Question: q \n Answer:
labeling, and for Twitter-2015 dataset, the number
is 120. The annotation process comprises two main
where t, p and q represent specific test inputs. \n components. The first part involves identifying the
stands for a carriage return in the template. Sim- named entities within the sentences, and the second
ilarly, each in-context example ci is defined with part involves providing comprehensive justification
similar templates as follows: by considering the image and text content, as well
as relevant knowledge. For many possibilities en-
Text: ti \n Image: pi \n Question: q \n Answer: ai
counter in the labeling process, what the annotator
needs to do is to correctly judge and interpret the
where ti , pi , q and ai refer to an text-image- sample from the perspective of humans. For sam-
question-answer quadruple retrieved from prede- ples where images and text are related, we directly
fined artificial samples. The complete prompt tem- state which entities in the text are emphasized by
plate of MNER consisting of a fixed prompt head, the image. For samples where the image and text
some in-context examples, and a test input is fed to are unrelated, we directly declare that the image
ChatGPT for auxiliary knowledge generation. description is unrelated to the text. Through artifi-
cial annotation process, we emphasize the entities I is the index set of top-N similar samples in G.
and their corresponding categories within the sen- The in-context examples C are defined as follows:
tences. Furthermore, we incorporate relevant auxil-
iary knowledge to support these judgments. This C = {(tj , pj , yj ) | j ∈ I}
meticulous annotation process serves as a guide for In order to efficiently realize the awareness of
ChatGPT, enabling it to generate highly relevant similar examples, all the multimodal fusion fea-
and valuable responses. tures can be calculated and stored in advance.
Multimodel Similar Example Awareness Mod- Heuristics-enhanced Prompt Generation After
ule Since the few-shot learning ability of GPT obtaining the in-context example C, PGIM builds a
largely depends on the selection of in-context ex- complete heuristics-enhanced prompt to exploit the
amples (Liu et al., 2021; Yang et al., 2022), we few-shot learning ability of ChatGPT in MNER.
design a Multimodel Similar Example Awareness A prompt head, a set of in-context examples, and
(MSEA) module to select appropriate in-context a testing input together form a complete prompt.
examples. As a classic multimodal task, the pre- The prompt head describes the MNER task in
diction of MNER relies on the integration of both natural language according to the requirements.
textual and visual information. Accordingly, PGIM Given that the input image and text may not al-
leverages the fused features of text and image as ways have a direct correlation, PGIM encourages
the fundamental criterion for assessing similar ex- ChatGPT to exercise its own discretion. The in-
amples. And this multimodal fusion feature can context examples are constructed from the results
be obtained from various previous vanilla MNER C = {c1 , · · · , cn } of the MSEA module. For test-
models. ing input, the answer slot is left blank for ChatGPT
Denote the MNER dataset D and predefined ar- to generate. The complete format of the prompt
tificial samples G as: template is shown in Appendix A.4.
D = {(ti , pi , yi )}M
i=1
3.3 Stage-2. Entity Prediction based on
G= {(tj , pj , yj )}N
j=1 Auxiliary Refined Knowledge
where ti , pi , yi refer to the text, image, and gold Define the auxiliary knowledge generated by
labels. The vanilla MNER model M trained on D ChatGPT after in-context learning as Z =
mainly consists of a backbone encoder Mb and a {z1 , · · · , zm }, where m is the length of Z. PGIM
CRF decoder Mc . The input multimodal image- concatenates the original text T = {t1 , · · · , tn }
text pair is encoded by the encoder Mb to obtain with the obtained auxiliary refining knowledge Z
multimodal fusion features H: as [T ; Z] and feeds it to the transformer-based en-
coder:
H = Mb (t, p)
{h1 , · · · , hn , · · · , hn+m } = embed([T ; Z])
In previous studies, the fusion feature H
after cross-attention projection into the high- Due to the attention mechanism employed by the
dimensional latent space was directly input to the transformer-based encoder, the token representa-
decoder layer for the prediction of the result. Un- tion H = {h1 , · · · , hn } obtained encompasses per-
like them, PGIM chooses H as the judgment basis tinent cues from the auxiliary knowledge Z. Sim-
for similar examples. Because examples approxi- ilar to the previous studies, PGIM feeds H to a
mated in high-dimensional latent space are more standard linear-chain CRF layer, which defines the
likely to have the same mapping method and entity probability of the label sequence y given the input
type. PGIM calculates the cosine similarity of the sentence T :
fused feature H between the test input and each
n
predefined artificial sample. And top-N similar
Q
ψ(yi−1 , yi , hi )
predefined artificial samples will be selected as in- P (y|T, Z) = i=1
n
context examples to enlighten ChatGPT generation P Q ′ , y′ , h )
ψ(yi−1 i i
auxiliary refined knowledge: y ′ ∈Y i=1
HT Hj ′ , y ′ , h ) are po-
where ψ(yi−1 , yi , hi ) and ψ(yi−1
I = argTopN i i
j∈{1,2,...,N } ∥H∥2 ∥Hj ∥2 tential functions. Finally, PGIM uses the negative
log-likelihood (NLL) as the loss function for the 2015), BERT-CRF (Devlin et al., 2018) as well
input sequence with gold labels y ∗ : as the span-based NER models (e.g., BERT-span,
RoBERTa-span (Yamada et al., 2020)), which only
LNLL (θ) = − log Pθ (y ∗ |T, Z) consider original text. The second group of meth-
ods includes several latest multimodal approaches
4 Experiments for MNER task: UMT (Yu et al., 2020), UMGF
4.1 Settings (Zhang et al., 2021), MNER-QG (Jia et al., 2022),
R-GCN (Zhao et al., 2022), ITA (Wang et al.,
Datasets We conduct experiments on two public
2021a), PromptMNER (Wang et al., 2022b), CAT-
MNER datasets: Twitter-2015 (Zhang et al., 2018)
MNER (Wang et al., 2022c) and MoRe (Wang et al.,
and Twitter-2017 (Lu et al., 2018). These two clas-
2022a), which consider both text and correspond-
sic MNER datasets contain 4000/1000/3257 and
ing images.
3373/723/723 (train/development/test) image-text
pairs posted by users on Twitter. The experimental results demonstrate the superi-
ority of PGIM over previous methods. PGIM sur-
Model Configuration PGIM chooses the back- passes the previous state-of-the-art method MoRe
bone of UMT (Yu et al., 2020) as the vanilla MNER (Wang et al., 2022a) in terms of performance. This
model to extract multimodal fusion features. This suggests that compared with the auxiliary knowl-
backbone completes multimodal fusion without edge retrieved by MoRe (Wang et al., 2022a) from
too much modification. BLIP-2 (Li et al., 2023) Wikipedia, our refined auxiliary knowledge offers
as an advanced multimodal pre-trained model, is more substantial support. Furthermore, PGIM ex-
used for conversion from image to image caption. hibits a more significant improvement in Twitter-
The version of ChatGPT used in experiments is 2017 compared with Twitter-2015. This can be
gpt-3.5-turbo and sampling temperature is set to 0. attributed to the more complete and standardized
For a fair comparison, PGIM chooses to use the labeling approach adopted in Twitter-2017, in con-
same text encoder XLM-RoBERTalarge (Conneau trast to Twitter-2015. Apparently, the quality of
et al., 2019) as ITA (Wang et al., 2021a), PromptM- dataset annotation has a certain influence on the ac-
NER (Wang et al., 2022b), CAT-MNER (Wang curacy of MNER model. In cases where the dataset
et al., 2022c) and MoRe (Wang et al., 2022a). annotation deviates from the ground truth, accurate
and refined auxiliary knowledge leads the model to
Implementation Details PGIM is trained by Py- prioritize predicting the truly correct entities, since
torch on single NVIDIA RTX 3090 GPU. During the process of ChatGPT heuristically generating
training, we use AdamW (Loshchilov and Hutter, auxiliary knowledge is not affected by mislabeling.
2017) optimizer to minimize the loss function. We This phenomenon coincidentally highlights the ro-
use grid search to find the learning rate for the bustness of PGIM. The ultimate objective of the
embeddings within [1 × 10−6 , 5 × 10−5 ]. Due to MNER is to support downstream tasks effectively.
the different labeling styles of two datasets, the Obviously, downstream tasks of MNER expect to
learning rates of Twitter-2015 and Twitter-2017 are receive MNER model outputs that are unaffected
finally set to 5 × 10−6 and 7 × 10−6 . And we also by irregular labeling in the training dataset. We
use warmup linear scheduler to control the learning further demonstrate this argument through a case
rate. The maximum length of the sentence input study, detailed in the Appendix A.3.
is set to 256, and the mini-batch size is set to 4.
The model is trained for 25 epochs, and the model
4.3 Detailed Analysis
with the highest F1-score on the development set is
selected to evaluate the performance on the test set. Impact of different text encoders on perfor-
The number of in-context examples N in PGIM is mance As shown in Table 2, We perform ex-
set to 5. All of the results are averaged from 3 runs periments by replacing the encoders of all XLM-
with different random seeds. RoBERTalarge (Conneau et al., 2019) MNER meth-
ods with BERTbase (Kenton and Toutanova, 2019).
4.2 Main Results BaselineBERT represents inputting original sam-
We compare PGIM with previous state-of-the-art ples into BERT-CRF. All of the results are av-
approaches on MNER in Table 1. The first group eraged from 3 runs with different random seeds.
of methods includes BiLSTM-CRF (Huang et al., The marker * refers to significant test p-value <
Twitter-2015 Twitter-2017
Methods Single Type(F1) Overall Single Type(F1) Overall
PER LOC ORG OTH. Pre. Rec. F1 PER LOC ORG OTH. Pre. Rec. F1
Text
†
BiLSTM-CRF 76.77 72.56 41.33 26.80 68.14 61.09 64.42 85.12 72.68 72.50 52.56 79.42 73.43 76.31
BERT-CRF‡ 85.37 81.82 63.26 44.13 75.56 73.88 74.71 90.66 84.89 83.71 66.86 86.10 83.85 84.96
BERT-SPAN‡ 85.35 81.88 62.06 43.23 75.52 73.83 74.76 90.84 85.55 81.99 69.77 85.68 84.60 85.14
RoBERTa-SPAN‡ 87.20 83.58 66.33 50.66 77.48 77.43 77.45 94.27 86.23 87.22 74.94 88.71 89.44 89.06
Text+Image
UMT 85.24 81.58 63.03 39.45 71.67 75.23 73.41 91.56 84.73 82.24 70.10 85.28 85.34 85.31
UMGF 84.26 83.17 62.45 42.42 74.49 75.21 74.85 91.92 85.22 83.13 69.83 86.54 84.50 85.51
MNER-QG 85.68 81.42 63.62 41.53 77.76 72.31 74.94 93.17 86.02 84.64 71.83 88.57 85.96 87.25
R-GCN 86.36 82.08 60.78 41.56 73.95 76.18 75.00 92.86 86.10 84.05 72.38 86.72 87.53 87.11
ITA - - - - - - 78.03 - - - - - - 89.75
PromptMNER - - - - 78.03 79.17 78.60 - - - - 89.93 90.60 90.27
CAT-MNER 88.04 84.70 68.04 52.33 78.75 78.69 78.72 94.61 88.40 88.14 80.50 90.27 90.67 90.47
MoReText - - - - - - 77.79 - - - - - - 89.49
MoReImage - - - - - - 77.57 - - - - - - 90.28
MoReMoE - - - - - - 79.21 - - - - - - 90.67
PGIM(Ours) 88.34 84.22 70.15 52.34 79.21 79.45 79.33* 96.46 89.89 89.03 79.62 90.86 92.01 91.43*
±0.02 ±0.12 ±0.36 ±0.98 ±0.63 ±0.22 ±0.06 ±0.02 ±0.68 ±0.53 ±2.25 ±0.16 ±0.07 ±0.09
Table 1: Performance comparison on the Twitter-15 and Twitter-17 datasets. For the baseline model, results of
methods with † come from Yu et al. (2020), and results with ‡ come from Wang et al. (2022c). The results of
multimodal methods are all retrieved from the corresponding original paper. The marker * refers to significant test
p-value < 0.05 when comparing with CAT-MNER and MoReText/Image .
Text:RT @Evode7: Actor Idris Elba became the first male to Text:Mumbai BJP cadres stopped by Police from marching
make the cover of Maxim. towards Mantralaya. Arrested. #SackShinde
Captions:Idris Elba on the cover of maxim magazine. Captions: A crowd of people standing around a bus.
Named entities:1. Idris Elba (person/actor) Named entities:1. Sootradhar (person/handle)2. Mumbai (location)
2. Maxim (magazine) 3. BJP (political party)4. Police (law enforcement)
Reasoning: The sentence directly mentions Idris Elba, a well-known 5. Mantralaya (government building)
actor who has appeared in various films and television shows. The Reasoning: The sentence mentions a person named Sootradhar
sentence also mentions Maxim, a men's lifestyle magazine that who ……. The Mumbai BJP refers to the local branch of the
features various topics such as fashion, technology, and Bharatiya Janata Party, a major political party in India. The police
entertainment. The image of Idris Elba on the cover of Maxim are mentioned as stopping the protesters from marching towards
magazine serves as visual support or proof of the claim made in the Mantralaya, a government building in Mumbai. The image of a ……
sentence that …… bus may be a part of the larger context of the political protest.
Figure 3: Four case studies of how auxiliary refined knowledge can help model predictions.
and "Mumbai BJP" are all entities that were not their ability to fully capture all the details of an im-
accurately predicted by past methods. Because age. This issue may potentially be further resolved
our auxiliary refined knowledge provides explicit in conjunction with the advancement of multimodal
explanations for such entities, PGIM makes the capabilities in language and vision models (e.g.,
correct prediction. GPT-4).
Limitations References
In this paper, PGIM enables the integration of mul- Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
timodal tasks into large language model by con- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
verting images into image captions. While PGIM Askell, et al. 2020. Language models are few-shot
achieves impressive results, we consider this Text- learners. Advances in neural information processing
Text paradigm as a transitional phase in the devel- systems, 33:1877–1901.
opment of MNER, rather than the ultimate solution. Jason PC Chiu and Eric Nichols. 2016. Named entity
Because image captions are inherently limited in recognition with bidirectional lstm-cnns. Transac-
tions of the association for computational linguistics, Ilya Loshchilov and Frank Hutter. 2017. Decou-
4:357–370. pled weight decay regularization. arXiv preprint
arXiv:1711.05101.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang,
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- and Heng Ji. 2018. Visual attention model for name
moyer, and Veselin Stoyanov. 2019. Unsupervised tagging in multimodal social media. In Proceedings
cross-lingual representation learning at scale. arXiv of the 56th Annual Meeting of the Association for
preprint arXiv:1911.02116. Computational Linguistics (Volume 1: Long Papers),
pages 1990–1999.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. 2009. Imagenet: A large-scale hier- Seungwhan Moon, Leonardo Neves, and Vitor Car-
archical image database. In 2009 IEEE conference valho. 2018. Multimodal named entity recogni-
on computer vision and pattern recognition, pages tion for short social media posts. arXiv preprint
248–255. Ieee. arXiv:1802.07862.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yasmin Moslem, Rejwanul Haque, and Andy Way. 2023.
Kristina Toutanova. 2018. Bert: Pre-training of deep Adaptive machine translation with large language
bidirectional transformers for language understand- models. arXiv preprint arXiv:2301.13294.
ing. arXiv preprint arXiv:1810.04805.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is
tional lstm-crf models for sequence tagging. arXiv chatgpt a general-purpose natural language process-
preprint arXiv:1508.01991. ing task solver? arXiv preprint arXiv:2302.06476.
Meihuizi Jia, Lei Shen, Xin Shen, Lejian Liao, Meng Ohad Rubin, Jonathan Herzig, and Jonathan Berant.
Chen, Xiaodong He, Zhendong Chen, and Jiaqi Li. 2021. Learning to retrieve prompts for in-context
2022. Mner-qg: An end-to-end mrc framework learning. arXiv preprint arXiv:2112.08633.
for multimodal named entity recognition with query
grounding. arXiv preprint arXiv:2211.14739. Erik F Sang and Jorn Veenstra. 1999. Representing text
chunks. arXiv preprint cs/9907006.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina
Toutanova. 2019. Bert: Pre-training of deep bidirec- Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. 2023.
tional transformers for language understanding. In Prompting large language models with answer heuris-
Proceedings of naacL-HLT, volume 1, page 2. tics for knowledge-based visual question answering.
arXiv preprint arXiv:2303.01903.
John Lafferty, Andrew McCallum, and Fernando CN
Pereira. 2001. Conditional random fields: Proba- Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fang-
bilistic models for segmenting and labeling sequence sheng Weng. 2021. Rpbert: a text-image relation
data. propagation-based bert model for multimodal ner.
In Proceedings of the AAAI conference on artificial
Dong-Ho Lee, Akshen Kadakia, Kangmin Tan, Mahak intelligence, volume 35, pages 13860–13868.
Agarwal, Xinyu Feng, Takashi Shibuya, Ryosuke
Mitani, Toshiyuki Sekiya, Jay Pujara, and Xiang Ren. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
2021. Good examples make a faster learner: Simple Martinet, Marie-Anne Lachaux, Timothée Lacroix,
demonstration-based learning for low-resource ner. Baptiste Rozière, Naman Goyal, Eric Hambro,
arXiv preprint arXiv:2110.08454. Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. arXiv:2302.13971.
2023. Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large lan- David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo,
guage models. arXiv preprint arXiv:2301.12597. Viresh Ratnakar, and George Foster. 2022. Prompt-
ing palm for translation: Assessing strategies and
Tsung-Yi Lin, Michael Maire, Serge Belongie, James performance. arXiv preprint arXiv:2211.09102.
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. 2014. Microsoft coco: Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang,
Common objects in context. In Computer Vision– Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang.
ECCV 2014: 13th European Conference, Zurich, 2023a. Gpt-ner: Named entity recognition via large
Switzerland, September 6-12, 2014, Proceedings, language models. arXiv preprint arXiv:2304.10428.
Part V 13, pages 740–755. Springer.
Xinyi Wang, Wanrong Zhu, and William Yang Wang.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, 2023b. Large language models are implicitly
Lawrence Carin, and Weizhu Chen. 2021. What topic models: Explaining and finding good demon-
makes good in-context examples for gpt-3? arXiv strations for in-context learning. arXiv preprint
preprint arXiv:2101.06804. arXiv:2301.11916.
Xinyu Wang, Jiong Cai, Yong Jiang, Pengjun Xie, Xin Zhang, Yong Jiang, Xiaobin Wang, Xuming Hu,
Kewei Tu, and Wei Lu. 2022a. Named entity and Yueheng Sun, Pengjun Xie, and Meishan Zhang.
relation extraction with multi-modal retrieval. arXiv 2022. Domain-specific ner via retrieving correlated
preprint arXiv:2212.01612. samples. arXiv preprint arXiv:2208.12995.
Xinyu Wang, Min Gui, Yong Jiang, Zixia Jia, Nguyen Fei Zhao, Chunhui Li, Zhen Wu, Shangyu Xing, and
Bach, Tao Wang, Zhongqiang Huang, Fei Huang, Xinyu Dai. 2022. Learning from different text-image
and Kewei Tu. 2021a. Ita: Image-text alignments pairs: A relation-enhanced graph convolutional net-
for multi-modal named entity recognition. arXiv work for multimodal ner. In Proceedings of the 30th
preprint arXiv:2112.06482. ACM International Conference on Multimedia, pages
3983–3992.
Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang,
Zhongqiang Huang, Fei Huang, and Kewei Tu. 2021b. Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and
Improving named entity recognition by external Qing Li. 2020. Object-aware multimodal named
context retrieving and cooperative learning. arXiv entity recognition in social media posts with adver-
preprint arXiv:2105.03654. sarial learning. IEEE Transactions on Multimedia,
Xuwu Wang, Junfeng Tian, Min Gui, Zhixu Li, Jiabo Ye, 23:2520–2532.
Ming Yan, and Yanghua Xiao. 2022b. Promptmner:
Prompt-based entity-related visual clue extraction Baohang Zhou, Ying Zhang, Kehui Song, Wenya Guo,
and integration for multimodal named entity recog- Guoqing Zhao, Hongbin Wang, and Xiaojie Yuan.
nition. In Database Systems for Advanced Applica- 2022. A span-based multimodal variational autoen-
tions: 27th International Conference, DASFAA 2022, coder for semi-supervised multimodal named entity
Virtual Event, April 11–14, 2022, Proceedings, Part recognition. In Proceedings of the 2022 Conference
III, pages 297–305. Springer. on Empirical Methods in Natural Language Process-
ing, pages 6293–6302.
Xuwu Wang, Jiabo Ye, Zhixu Li, Junfeng Tian, Yong
Jiang, Ming Yan, Ji Zhang, and Yanghua Xiao. 2022c. A Appendix
Cat-mner: Multimodal named entity recognition with
knowledge-refined cross-modal attention. In 2022 A.1 Generalization Analysis
IEEE International Conference on Multimedia and
Expo (ICME), pages 1–6. IEEE. Due to the distinctive underlying logic of PGIM
in incorporating auxiliary knowledge to enhance
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang,
Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu,
entity recognition, PGIM exhibits a stronger gen-
Yufeng Chen, Meishan Zhang, et al. 2023. Zero- eralization capability that is not heavily reliant on
shot information extraction via chatting with chatgpt. specific datasets. Twitter-2015→2017 denotes the
arXiv preprint arXiv:2302.10205. model is trained on Twitter-2015 and tested on
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Twitter-2017, vice versa. The results in Table 6
Takeda, and Yuji Matsumoto. 2020. Luke: Deep con- show that the generalization ability of PGIM is sig-
textualized entity representations with entity-aware nificantly improved compared with previous meth-
self-attention. arXiv preprint arXiv:2010.01057.
ods. This further validates the efficacy and superior-
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei ity of our auxiliary refined knowledge in enhancing
Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. 2022. model performance.
An empirical study of gpt-3 for few-shot knowledge-
based vqa. In Proceedings of the AAAI Conference A.2 Comparison with MoRe
on Artificial Intelligence, volume 36, pages 3081–
3089. As the previous state-of-the-art method, MoRe re-
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020.
trieves relevant knowledge from Wikipedia to as-
Improving multimodal named entity recognition via sist entity prediction. We experimentally compare
entity span detection with unified multimodal trans- the quality of auxiliary knowledge of MoRe and
former. Association for Computational Linguistics. PGIM. The results are shown in Table 5. The
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, baseline method solely relies on the original text
Qiaoming Zhu, and Guodong Zhou. 2021. Multi- without any incorporation of auxiliary informa-
modal graph fusion for named entity recognition tion. MoReText and MoReImage denote the relevant
with targeted visual guidance. In Proceedings of the
AAAI conference on artificial intelligence, volume 35,
knowledge of the input text and image retrieved us-
pages 14347–14355. ing text retriever and image retriever, respectively.
Ave.length represents the average length of the aux-
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang.
2018. Adaptive co-attention network for named en-
iliary knowledge in entire dataset. Memory indi-
tity recognition in tweets. In Proceedings of the AAAI cates the GPU memory size required for training
conference on artificial intelligence, volume 32. model. Ave.Improve represents the average result
Single Type(F1) Overall
Methods Ave. length Memory(MB) Ave. Improve
PER LOC ORG OTH. Pre. Rec. F1
Twitter-2015
BaseLine 87.04 83.49 67.34 50.16 76.45 78.22 77.32 - 11865 -
MoReText 86.92 83.08 68.20 49.15 77.12 77.77 77.45 227.41 16759 ↓ 0.05
MoReImage 87.38 83.78 67.75 49.38 77.44 78.06 77.75 203.00 16711 ↑ 0.28
PGIM(Ours) 88.34 84.22 70.15 52.34 79.21 79.45 79.33 104.56 13901 ↑ 1.86
Twitter-2017
BaseLine 95.07 87.22 85.82 78.66 88.46 90.23 89.34 - 11801 -
MoReText 95.16 88.77 87.00 77.71 89.33 90.45 89.89 241.47 16695 ↑ 0.50
MoReImage 94.43 87.43 86.22 74.77 88.06 89.49 88.77 192.00 16447 ↓ 0.80
PGIM(Ours) 96.46 89.89 89.03 79.62 90.86 92.01 91.43 94.52 13279 ↑ 2.07
Table 5: Comparison of MoRe with PGIM. Since the original paper of MoRe (Wang et al., 2022a) did not report its
Single Type (F1) on the Twitter-2015 and Twitter-2017 datasets, we run its released code and count the results. All
of the results are averaged from 3 runs with different random seeds.
PGIM: Mumbai BJP[ORG] ✔ cadres stopped by PGIM: Bush 41[PER] ✖ , 43 Do Not Plan to
Police from marching towards Mantralaya[LOC] ✔. Endorse Donald Trump[PER] ✔.
Arrested. #SackShinde
Additional information from MoReText : GEN Ram Naik M BJP Mumbai-North GEN Ram Additional information from MoReText : [Jaber Al-Ahmad Al-Sabah]
Naik M BJP 10th Lok Sabha 24.979599 Lok Sabha <hit>Mumbai</hit>-North [ GEN ] <e:Sheikh>Sheikh</e> Jaber al-Ahmad al-Sabah (29 June 1926 – 15 January 2006)
Sylvia has taken refuge in the gang's hideout. One of the Kids, Danny, returns to her (<e:Arabic language>Arabic</e>: )الشيخ جابر األحمد الجابر الصباحof the <e:Al-Sabah>al-
stepfather's apartment to bring some clothes for her. He is arrested by the police, Sabah</e> dynasty was the Emir of Kuwait and <e:Commander-in-
suspected of the murder. [ 'Neath Brooklyn Bridge ] [ / (Person of Interest) ] in chief>Commander</e> of the <e:Military of Kuwait>military of Kuwait</e> who served
<e:Champaign, Illinois>Champaign, Illinois</e>, Root (<e:Amy Acker>Amy Acker</e>) in the posts from 31 December 1977 until his death on 15 January 2006 due to
hijacks a prison bus and releases an inmate, Billy Parsons (<e:Colin Donnell>Colin <e:Cerebral hemorrhage>cerebral hemorrhage</e>. The third monarch to rule Kuwait
Donnell</e>), per instruction from the Machine. She then gives him a fake ID so he since its <e:Independence>independence</e> from <e:British Empire>Britain</e>,
infiltrates a government building posing as a <e:United States Department of Jaber had previously served as <e:Ministry of Finance (Kuwait)>minister of finance and
Defense>United States Department of Defense</e> … economy</e> from 1962 to 1965 when he was appointed <e:Prime minister>…
Additional information from PGIM:
Named entities: 1. Sootradhar (person/handle) 2. Mumbai (location) Additional information from PGIM:
3. BJP (political party) 4. Police (law enforcement) Named entities: 1. Bush 41 (person/former president)
5. Mantralaya (government building) 2. Bush 43 (person/former president)
Reasoning: The sentence mentions a person or handle named Sootradhar who is likely 3. Donald Trump (person/presidential candidate).
tweeting about a political protest in Mumbai. The Mumbai BJP refers to the local Reasoning: The sentence mentions Bush 41 and 43, who are both former presidents of
branch of the Bharatiya Janata Party, a major political party in India. The police are the United States. The sentence also mentions Donald Trump, who is a current
mentioned as stopping the protesters from marching towards Mantralaya, a presidential candidate. The image of the word "declassify" does not seem to be related
government building in Mumbai. The hashtag #SackShinde is likely a reference to a to the sentence, as it is not mentioned in the text or connected to any of the named
political issue or controversy, but without further context it is unclear what this refers to. entities.
The image of a crowd of people standing around a bus does not appear to be directly
related to the sentence, but may be a part of the larger context of the political protest.
Figure 4: Two case studies on how information retrieved from Wikipedia by MoRe and information retrieved by
PGIM from ChatGPT affects model predictions.
Text:RT @pressjournal: Two injured following crash in Text:Still smiling, the magnificent concerts of @IiroRantala
Inverurie town centre. with Peter Erskine and Johannes Weidemuller @FinnDC.
Captions:A police officer stands next to a car that Captions:Four people standing next to a piano in a room.
has been involved in a crash.
Auxiliary refined knowledge: Auxiliary refined knowledge:
Named entities: 1. Press Journal (news outlet) Named entities: 1. Iiro Rantala (person)2. Peter Erskine (person)
2. Inverurie (location) 3. Johannes Weidemuller (person)
Reasoning: The sentence mentions Press Journal, a news 4. FinnEmbassyDC (organization/location)
outlet that is likely reporting on the incident. Inverurie is a Reasoning: The sentence mentions Iiro Rantala, a musician who
location, likely the town where the crash occurred. The likely performed in the mentioned concerts. Peter Erskine and
image of a police officer standing next to a car involved in a Johannes Weidemuller are also mentioned and are likely fellow
crash supports the information in the sentence about two musicians who performed with Rantala. The image of the four
people being injured in a crash in the town center. Therefore, people …… is related to the sentence as it is likely an image of
the sentence and the image are directly related. the musicians performing at the mentioned concerts.
Text:Anxiety in South Africa as economy slips into technical Text:I love this quote. Robin Sharma credited for photo. My
recession. library is growing. #librarygirl
Captions: Ordinary people have big tvs extraordinary people
Captions:A city skyline with a bridge over it.
have big libraries.
Auxiliary refined knowledge: Auxiliary refined knowledge:
Named entities: 1. South Africa (country) Named entities:1. Robin Sharma (person)
Reasoning: The sentence mentions Robin Sharma, likely a person
Reasoning: The sentence mentions South Africa, a country located responsible for the photo that the quote refers to. The quote does
in the southern region of Africa. The image of a city skyline with a not contain any named entities, but it does suggest that the speaker
bridge over it, while not directly related to the sentence, may be a has a growing library. The image description contains a quote about
representation of the country's urban areas affected by the libraries and the success of having a large one. It is not clear
economic downturn. Overall, the named entity in the sentence is whether Robin Sharma is the person who said the quote, but he is
clear and specific. given credit for the photo associated with the quote. Therefore, the
named entity in the sentence is Robin Sharma.
Here are some content that people post on Twitter, and these content are composed of original text and
image descriptions of the original text. Please note that the text and image descriptions here may or
may not be relevant, so make your own judgment. Please follow the data annotation style and method
reflected in the example I provided, comprehensively analyze the image description and the original text,
determine which named entities and their corresponding types are included in the original text, and
explain the reason for your judgment. Notice : just in 'Text’, not include 'Image descriptions', don't
change the writing style and format of entity names, and Words after the @ sign are not counted.
Text: #Sports Connor Cook could not satisfy NFL teams’ questions about leadership.
Image descriptions: Michigan state football player in uniform.
Question: Comprehensively analyze the Text and the Image description, which named entities and their
corresponding types are included in the Text? explain the reason for your judgment.
Answer: Named entities:1. Connor Cook (person/player) 2. NFL (league/organization) Reasoning: The
sentence mentions Connor Cook, a former Michigan State football player who was drafted by the NFL.
The NFL is the highest-level professional football league in the world. The image of a ……
Here are some content that people post on Twitter, and these content are composed of original text and
image descriptions of the original text. Please note that the text and image descriptions here may or
may not be relevant, so make your own judgment. Please follow the data annotation style and method
reflected in the example I provided, comprehensively analyze the image description and the original text,
determine which named entities and their corresponding types are included in the original text. There
will only be 4 types of entities: ['LOC’, ‘MISC', 'ORG', 'PER']. Make the answer format like: ['entity name1',
'entity type1'],['entity name2', 'entity type2']...... \n Notice : just in 'Text',not include 'Image descriptions',
don't change the writting style and format of entity names in original Text, and the words beginning with
@ sign are not counted.
Text: #Sports Connor Cook could not satisfy NFL teams’ questions about leadership.
Image descriptions: Michigan state football player in uniform.
Question: Comprehensively analyze the Text and the Image description, which named entities and their
corresponding types are included in the Text?
Answer: ['Connor Cook', 'PER'], ['NFL', 'ORG']