Evaluation of GPT and BERT-based Models On Identif

Evaluation of GPT and BERT-based models on
identifying protein-protein interactions in biomedical

text
Hasin Rehana1,*, Nur Bengisu Çam2,*, Mert Basmaci2, Yongqun He3,4, Arzucan Özgür2,§, and
Junguk Hur2,§
Computer Science Graduate Program, University of North Dakota, Grand Forks, North Dakota,
1
58202, USA
Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
2
Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University
3
of Michigan, Ann Arbor, Michigan, 48109, USA

Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor,
4
Michigan, 48109, USA

Department of Biomedical Sciences, University of North Dakota School of Medicine and Health
5
Sciences, Grand Forks, North Dakota, 58202, USA
*These authors contributed equally to this work.

§
Corresponding authors
Arzucan Özgür: arzucan.ozgur@boun.edu.tr
Junguk Hur: junguk.hur@med.und.edu
Abstract
Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms,
disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical
literature, there is a growing need for automated and accurate extraction of PPIs to facilitate
scientific knowledge discovery. Pre-trained language models, such as generative pre-trained
transformer (GPT) and bidirectional encoder representations from transformers (BERT), have
shown promising results in natural language processing (NLP) tasks. We evaluated the PPI
identification performance of various GPT and BERT models using a manually curated
benchmark corpus of 164 PPIs in 77 sentences from learning language in logic (LLL). BERT-
based models achieved the best overall performance, with PubMedBERT achieving the highest
precision (85.17%) and F1-score (86.47%) and BioM-ALBERT achieving the highest recall
(93.83%). Despite not being explicitly trained for biomedical texts, GPT-4 achieved comparable
performance to the best BERT models with 83.34% precision, 76.57% recall, and 79.18% F1-
score. These findings suggest that GPT models can effectively detect PPIs from text data and
have the potential for use in biomedical literature mining tasks.
Introduction
Protein-Protein Interactions (PPIs) are essential for numerous biological functions, especially
DNA replication and transcription, signal pathways, cell metabolism, and converting genotype to
phenotype. Understanding these interactions enhances the understanding of the biological
processes, pathways, and networks underlying healthy and diseased states. Various public PPI
databases exist [1-3], including the PPI data collected from low-to-mid throughput experiments
such as yeast-two-hybrid or immunoprecipitation pull-down or high-throughput screening
assays. However, these resources are still incomplete, not covering all potential PPIs. Besides,
novel PPIs are typically reported in biomedical texts, which need to be added to the existing PPI
databases. Due to the rapid growth of scientific literature, manual extraction of PPIs has
become increasingly challenging, which necessitates automated text mining approaches that do
not require human participation.
Natural language processing (NLP) is a focal area in computer science, increasingly applied in
various domains, including biomedical research, which has experienced massive growth in
recent years. Relation extraction, a widely used NLP method, aims to identify relationships
between two or more entities in biomedical text, supporting the automatic analysis of documents
in this domain. Advances in deep learning (DL), such as convolutional neural networks (CNNs)
[4, 5] and recurrent neural networks (RNNs) [6, 7], as well as in NLP have enabled success in
biomedical text mining to discover interactions between protein entities. Pre-training large neural
language models has led to substantial improvements in numerous NLP problems [8]. Following
the publication of "Attention Is All You Need" [9], transformer architectures have achieved state-
of-the-art results in various NLP tasks, including relation extraction in the biomedical domain
[10].
After the development of the transformer architecture [9], transformer-based models like
bidirectional encoder representation transformer (BERT) [11], a type of masked language
model, emerged. These models, known as large language models (LLMs), focus on
understanding language and semantics. LLMs are pre-trained on vast amounts of data and can
be fine-tuned for various tasks. Recent studies suggest that LLMs excel at context zero-shot
and few-shot learning [12], analyzing, producing, and comprehending human languages. LLMs'
massive data processing capabilities can be employed to identify connections and trends
among textual elements. Another type of LLM is autoregressive language models, including
generative pre-trained transformer (GPT), an advanced AI language model that generates

human-like text by acquiring linguistic patterns and structures. GPT-3, developed by OpenAI,
boasts benefits like large-scale pre-training, zero and few-shot learning, context awareness,
creativity, and adaptability. OpenAI's ChatGPT, a 3.5 version of GPT models, demonstrates the
remarkable potential for analyzing and processing textual data. Recently, GPT-4 [13] has been
introduced, capable of producing, modifying, and cooperating with users in various creative and
technical writing tasks, such as songwriting, screenplay creation, and imitating user writing
styles. The advancements in GPT models, from GPT-3 to GPT-4, showcase the rapid progress
in NLP, opening up a wide range of applications.
Several studies have been published evaluating the performance of GPT models for problem-
solving on various standardized tests [14, 15], and it has been shown that they are able to
achieve performance comparable or even better than humans and are able to pass high-level
professional standardized tests such as the Bar test14. However, to our knowledge, no study has
been done to evaluate how well GPT models can be used to extract PPIs from biomedical texts.
Here, we present a thorough evaluation of the PPI identification performance of multiple GPT
models and compare these with the state-of-the-art BERT-based NLP models for relation
extraction.
Methods
Dataset
We used the LLL corpus [16], which contains 164 protein-protein interactions (PPIs) in 77
sentences. The LLL corpus is a shared dataset created for the Learning Language in Logic
2005 (LLL05) challenge. It contains manually curated gene/protein interactions in Bacillus
subtilis, and the sentences are provided in XML files. The LLL dataset does not contain non-
interacting entity pairs, which are essential for our BERT-based models for training. To address
this issue, we generated all possible entity pair combinations using the entities identified in each
sentence. A total of C(n,2) entity pairs could be generated with n protein entities in a sentence.
Then, we labeled the interacting pairs, if reported in the LLL data files, as positive samples and
the remaining ones as negative samples.
We also applied basic preprocessing steps to ensure capturing all entities by removing
punctuation marks, digit-only strings, and blank spaces, and converting all the letters into
lowercase, resulting in normalized protein names. For the BERT-based models, similar to the
prior work [17], we replaced the entity pair names with the PROTEIN1 and PROTEIN2
keywords. The entity names, other than the pair, were replaced with the ‘PROTEIN’ keyword.
Language Models
We evaluated three autoregressive language models (GPT-3, GPT-3.5 via ChatGPT, and GPT-
4 via ChatGPT), each with three variations, and seven masked language models
(Bio_ClinicalBERT, BioBERT, BioM-ALBERT-xxlarge, BioM-BERT-PubMed-PMC-Large,
PubMedBERT, SciBERT_scivocab_cased, and SciBERT_scivocab_uncased).
Autoregressive Language Models
GPT [18], is a series of language models developed by OpenAI in 2018, based on the
transformer architecture [9]. The transformer model consists of an encoder that generates
concealed representations and a decoder that produces output sequences using multi-head
attention, which prioritizes data over inductive biases, facilitating large-scale pre-training and
parallelization. The self-attention mechanism enables neural networks to determine input
element importance, making it ideal for language translation, text classification, and text
generation. The GPT architecture includes layers with self-attention mechanisms, fully
connected layers, and layer normalization, reducing computational time and preventing
overfitting during training [18].
Figure 1 illustrates the history of GPT models released by OpenAI over the past years. The first
version of GPT, GPT-1, had 117 million parameters. It was trained using a large corpus of text
data, including Wikipedia (https://en.wikipedia.org/), Common Crawl
(https://commoncrawl.org/the-data/), and OpenWebText
(https://skylion007.github.io/OpenWebTextCorpus/). GPT-2 [19] significantly improved over its
predecessor GPT-1, with 1.5 billion parameters. It was trained on a larger corpus of text data,
including web pages and books, and can generate more coherent and convincing language
responses. GPT-3 [20] was trained with 175 billion parameters, including an enormous corpus
of text data, including web pages, books, and academic articles. CPT-3 has demonstrated
outstanding performance in a wide range of NLP tasks, such as language translation, chatbot
development, and content generation.
Figure 1. Evolution of GPT Models. GPT: generative pre-trained transformer. API: application
programming interface.
On November 30, 2022, OpenAI released ChatGPT, a natural and engaging conversation tool
capable of producing contextually relevant responses based on text data. ChatGPT was fine-
tuned on the GPT-3.5 series, which included the models: gpt-3.5-turbo-0301, code-davinci-002,
text-davinci-002, and text-davinci-003. In this study, the latest gpt-3.5-turbo-0301 was used for
GPT-3.5. On March 14, 2023, OpenAI introduced its most advanced and cutting-edge system to
date, GPT-4, which has surpassed its predecessors by producing more dependable outcomes.
The architecture and number of parameters for GPT models are summarized in Table 1,
including the GPT-3, ChatGPT, and GPT-4, which were included in the current study.
Table 1: Specifications of GPT models
Model Name Year Architecture Number of

Released Parameters
GPT-1 2018 Decoder architecture of transformer with 12 layers 117 million
GPT-2 2019 Decoder architecture of transformer with 48 layers 1.5 billion
GPT-3 2020 Same model and architecture as GPT-2 with 96 layers. 175 billion
Variations include Davinci#, Babbage, Curie, and Ada.
ChatGPT (GPT- 2022 A combination of three models: code-davinci-002, text- 1.3 billion, 6 billion,
3.5) davinci-002, and text-davinci-003. and 175 billion
GPT-4 2023 Fine-tuned using reinforcement learning from human Supposedly 100
feedback. trillion
#
Used in the current study.
Masked Language Models
Six different BERT-based models were included in the current study (Table 2).
● BioBERT [10]: a BERT model pre-trained on PubMed abstracts and PubMed Central
(PMC) full-text articles for different NLP tasks to measure performance. The initial
version, BioBERT v1.0, used >200K abstracts and >270K PMC articles. An expanded
version BioBERT v1.1 was fine-tuned using > 1M PubMed abstracts and was included in
the current study.
● SciBERT [21]: a BERT model pre-trained on random Semantic Scholar articles [22].
While pre-training with the articles, the entire text was used. The researchers created the
SCIVOCAB from scientific articles of the same size as BASEVOCAB, the BERT-base
model's vocabulary. Both cased and uncased SCIVOCAB were used in the current
study.
● Bio-ClinicalBERT [23]: a fine-tuned BioBERT v1.0 model (PubMed 200K + PMC 270K)
with all notes from MIMIC-III v1.4 [24], a database of the electronic health records,
containing ~880M words.
● PubMedBERT [8]: a BERT model trained explicitly on the BLURB (Biomedical Language
Understanding & Reasoning Benchmark) benchmark.
● BioM-ALBERTxxlarge [25]: a BERT model pre-trained on PubMed abstract with the
same architecture as ALBERTxxlarge [26].
● BioM-BERTLarge [25]: a BERT model with the same architecture as BERTLarge [11],
which is the ELECTRA implementation of BERT.
Table 2: Specifications of BERT-based models
Model Name Year Architecture Number of F1-Scores on

Released Parameters ChemProt
[27]
BioBERT 2019 Encoder architecture of transformer 108 million 76.46

with 12 layers and hidden size of 768
SciBERT 2019 Encoder architecture of transformer 108 million 83.64

Bio- 2019 Encoder architecture of transformer 108 million Not Available

ClinicalBERT with 12 layers and hidden size of 768
PubMedBERT 2020 Encoder architecture of transformer 108 million 77.24
Biom- 2021 Encoder architecture of transformer 225 million 75.81

ALBERTxxlarge with 12 layers and hidden size of 4096
BioM- 2021 Encoder architecture of transformer 334 million 80.00

BERTLarge with 24 layers and hidden size of 1024
Formulating queries for GPT models
To extract PPIs from the LLL sentences, we leveraged OpenAI’s application programming
interface (API) access for GPT-3, while GPT-3.5 and GPT-4 were accessed through the web
interface of ChatGPT Plus. We carefully designed the API and web interface prompts to
generate well-structured and stable interactions with minimal post-processing steps. The LLL
data comprised 77 sentences from 44 publications with a total of 164 PPIs. We extracted the
necessary information from the dataset and divided it into ten folds using document-level
folding, as previously introduced [28]. For each fold, we provided the sentence IDs and
sentences as input along with a query, as shown in Table 3. In order to evaluate the impact of
providing the dictionary of biomedical entities covering these 77 sentences, two additional
queries with the original protein names and normalized protein names, created after the
preprocessing introduced above, were also performed.
Table 3: Queries incorporated in the prompts for GPT-3 (API), GPT-3.5 (ChatGPT), and
GPT-4 (ChatGPT)
Query type Query

Find all possible Protein-Protein interactions from the given sentences and provide the result in a
tabular format with columns "Sentence ID | Protein 1 | Protein 2 | Protein-Protein Interaction" for
Base; without
identified interaction Protein pair. And make sure each row will contain one pair of Protein-Protein
protein names
interactions, even though multiple pairs are identified from a single sentence. Remember, Protein
and Gene are the same things.
tabular format with columns "Sentence ID | Protein 1 | Protein 2 | Protein-Protein Interaction" for
With protein identified interaction Protein pair. And make sure each row will contain one pair of Protein-Protein
names interactions, even though multiple pairs are identified from a single sentence. Remember, Protein
and Gene are the same things. Here are the protein names for your reference ['KinC' 'KinD'
'sigma(A)' 'Spo0A' 'SigE' 'SigK' 'GerE' 'sigma(F)' 'sigma(G)' 'SpoIIE' 'FtsZ' 'sigma(H)' 'sigma(K)'
'gerE' 'EsigmaF' 'sigmaB' 'sigmaF' 'SpoIIAB' 'SpoIIAA' 'SigL' 'RocR' 'sigma(54)' 'E sigma E' 'YfhP'
'SpoIIAA-P' 'sigmaK' 'sigmaG' 'ComK' 'FlgM' 'sigma X' 'sigma B' 'sigma(B)' 'sigmaD' 'SpoIIID'
'sigmaW' 'PhoP~P' 'AraR' 'sigmaH' 'yvyD' 'ClpX' 'Spo0' 'RbsW' 'DnaK' 'sigmaE' 'sigma W' 'sigmaA'
'sigma(X)' 'CtsR' 'Spo0A~P' 'spoIIG' 'ydhD' 'ykuD' 'ykvP' 'ywhE' 'spo0A' 'spoVG' 'rsfA' 'cwlH' 'KatX'
'katX' 'rocG' 'yfhS' 'yfhQ' 'yfhR' 'sspE' 'yfhP' 'bmrUR' 'ydaP' 'ydaE' 'ydaG' 'yfkM' 'sigma F' 'cot' 'sigK'
'cotD' 'sspG' 'sspJ' 'hag' 'comF' 'flgM' 'ykzA' 'CsbB' 'nadE' 'YtxH' 'YvyD' 'bkd' 'degR' 'cotC' 'cotX'
'cotB' 'sigW' 'tagA' 'tagD' 'tuaA' 'araE' 'sigmaL' 'spo0H' 'sigma G' 'sigma 28' 'sigma 32' 'spoIVA'
'PBP4*' 'RacX' 'YteI' 'YuaG' 'YknXYZ' 'YdjP' 'YfhM' 'phrC' 'sigE' 'ald' 'kdgR' 'sigX' 'ypuN' 'clpC' 'ftsY'
'gsiB' 'sigB' 'sspH' 'sspL' 'sspN' 'tlp']
tabular format with columns 'Sentence ID | Protein 1 | Protein 2 | Protein-Protein Interaction' for
identified interaction Protein pair. And make sure each row will contain one pair of Protein-Protein
interactions, even though multiple pairs are identified from a single sentence. Remember, Protein
and Gene are the same things. Here are the protein names for your reference ['kinc' 'kind'
'sigmaa' 'spo0a' 'sige' 'sigk' 'gere' 'sigmaf' 'sigmag' 'spoiie' 'ftsz' 'sigmah' 'sigmak' 'esigmaf' 'sigmab'
With normalized
'spoiiab' 'spoiiaa' 'sigl' 'rocr' 'sigma54' 'esigmae' 'yfhp' 'spoiiaa-p' 'comk' 'flgm' 'sigmax' 'sigmad'
protein names
'spoiiid' 'sigmaw' 'phop~p' 'arar' 'yvyd' 'clpx' 'spo0' 'rbsw' 'dnak' 'sigmae' 'ctsr' 'spo0a~p' 'spoiig'
'ydhd' 'ykud' 'ykvp' 'ywhe' 'spovg' 'rsfa' 'cwlh' 'katx' 'rocg' 'yfhs' 'yfhq' 'yfhr' 'sspe' 'bmrur' 'ydap'
'ydae' 'ydag' 'yfkm' 'cot' 'cotd' 'sspg' 'sspj' 'hag' 'comf' 'ykza' 'csbb' 'nade' 'ytxh' 'bkd' 'degr' 'cotc'
'cotx' 'cotb' 'sigw' 'taga' 'tagd' 'tuaa' 'arae' 'sigmal' 'spo0h' 'sigma28' 'sigma32' 'spoiva' 'pbp4*' 'racx'
'ytei' 'yuag' 'yknxyz' 'ydjp' 'yfhm' 'phrc' 'ald' 'kdgr' 'sigx' 'ypun' 'clpc' 'ftsy' 'gsib' 'sigb' 'ssph' 'sspl'
'sspn' 'tlp']
Temperature parameter optimization in GPT-3
OpenAI’s API allows modulation of the ‘temperature’ parameter in GPTs, which determines how
greedy or creative the generative model is. The parameter ranges between 0 (the least creative)
and 1 (the most creative). We explored the impact of this parameter in PPI identification using
OpenAI API and 11 temperatures (minimum=0, maximum=1, increment=0.1). A temperature of
0.1 demonstrated the highest overall performance of GPT-3, thus used in the present study.
Performance evaluation
To ensure consistency for each fold, we have obtained the outputs of GPT-3, GPT-3.5, and
GPT-4 from three separate runs and acquired an average of their evaluation performance. We
refreshed the browser after each prompt to keep ChatGPT from memorizing the previous
prompts.
For BERT-based models, we fine-tuned these models in a 10-fold cross-validation setting,
where the folds were created at the document level, as introduced above. This strategy
employed document-level fold splitting, which ensured the sentences from one document were
used only either in the training or testing set to avoid overfitting [29]. The hyperparameters we
used in this study are shown in Table 4. We used the slow tokenizer for the tokenization with
the BioM-ALBERT-xxlarge model.
Table 4: Hyperparameters used for K-Fold Cross Validation on BERT and ALBERT
models
Hyperparameter Value
Optimizer Adam
Learning Rate 5e-5
Batch Size 16
Weight Decay 1e-1
Epochs per Fold 6
Number of Folds 10
Max Sentence Length 128
Results and Discussion

We conducted a thorough comparison of GPT and BERT-based models using the LLL dataset
with the identical 10-fold settings to maintain consistency across all models. The queries for
GPT models were done in three separate runs, which were averaged later (Supplementary
Tables S1-S16).
Interacting with GPT API and web interface
Figure 2 illustrates a Python code segment to access GPT-3 API and its output. The predicted
interaction pairs were returned with corresponding Sentence IDs. As shown in Figure 3, GPT-3
achieved the highest performance in all measures with the temperature parameter set to 0.1,
which was used for all GPT-3 analyses.
(A) (B)
def call_GPT_API(model, temperature, query,
max_token, top_p):
with open('input.txt') as f:
prompt = f.read();
prompt = query + '\n' + prompt
print(prompt);
completions = openai.Completion.create(
model=model,
prompt=prompt,
temperature=temperature,
max_tokens=max_token,
top_p=top_p,
frequency_penalty=0,
presence_penalty=0
)
message = completions.choices[0].text;
return message
Figure 2. GPT API code and output. (A) Python code segment for accessing OpenAI API. (B)
an example output of GPT 3 for Fold 9.

Figure 3. Performance evaluation of temperature parameter in GPT-3.
Unlike GPT-3, GPT-3.5 and GPT-4 were accessed via the web interface named ChatGPT Plus
due to the limited API access at the time of this study. Figure 4 depicts an example input and
output for GPT-4.
Figure 4: An example input and output of GPT-4 (via ChatGPT Plus web interface).
PPI identification performance
Table 5 summarizes the PPI identification performance of the 16 models, including three
variations of each GPT version. Generally, BERT-based models outperformed GPTs; however,
when protein names were provided GPTs, particularly GPT-4, demonstrated quite comparable
performances to best-performing BERT-based models (Supplementary Figure 1).
Overall, GPT-4 performed the best among all versions of the GPT models, whether or not
protein names were provided. However, in terms of precision, GPT-3.5 achieved higher
performance than GPT-4, with a score of 79.11% compared to GPT-4's score of 73.97%.
Initially, the base GPT models had lower precision than most of the BERT-based models.
However, when provided with protein names, the precision of the GPT models improved
significantly, approaching that of the best-performing PubMedBERT model, which achieved
85.17% precision. Specifically, the GPT-4 model with protein names provided achieved 83.71%
precision.
Moreover, GPT-4 outperformed BioM-ALBERT-xxlarge and SciBERT_scivocab_cased in terms
of F1-score. While PubMedBERT had the best evaluation score in terms of precision (1.46%
higher than GPT-4) and F1-score, another BERT-based model called BioALBERT achieved the
highest recall score, albeit with relatively less precision and F1-score. It is worth noting that
although BERT-based models demonstrate impressive performance, they require fine-tuning
with supervised learning, which takes considerable time and technical expertise. In comparison,
zero-shot learning models such as GPT-3, ChatGPT, and GPT-4 do not require such extensive
fine-tuning, making them more accessible and practical for specific use cases.
The BioM-ALBERT-xxlarge model had the highest recall score among all models, scoring
93.83% (4.13% higher than the second-best BioM-BERT-PubMed-PMC-Large). However, its
precision and F1 scores were not as good as the BERT models. There could be several
reasons for this discrepancy. One possibility is that the BioM-ALBERT-xxlarge model had a
larger hidden layer size (4096) than the BERT models (1024) we experimented with. Another
potential reason is that the BERT models use the WordPiece tokenization technique, whereas
ALBERT uses the SentencePiece tokenization technique.
Table 5: Evaluation result of PPI on the LLL dataset for BERT and GPT-based models.
Type Model Precision Recall F1-Score

Autoregressiv GPT-3 70.34% 56.47% 61.48%
e Language GPT-3 with protein names 78.62% 61.73% 68.12%
Models GPT-3 with normalized protein
75.31% 56.71% 63.83%
names
GPT-3.5 (ChatGPT) 79.11% 58.24% 65.75%
GPT-3.5 (ChatGPT) with protein
79.16% 62.19% 68.37%
names
GPT-3.5 (ChatGPT) with
83.06% 63.96% 71.05%
normalized protein names
GPT-4 (ChatGPT) 73.97% 62.41% 66.61%
GPT-4 (ChatGPT) with protein
83.71% 72.77% 76.89%
names
GPT-4 (ChatGPT) with
83.34% 76.57% 79.18%
normalized protein names
Masked Bio_ClinicalBERT 78.39% 88.63% 82.72%
Language BioBERT 83.28% 89.36% 85.87%
Models
BioM-ALBERT-xxlarge 60.86% 93.83% 71.72%
BioM-BERT-PubMed-PMC-Large 84.69% 89.70% 86.22%
PubMedBERT 85.17% 88.73% 86.47%
SciBERT_scivocab_cased 65.66% 88.53% 74.93%
SciBERT_scivocab_uncased 84.45% 85.98% 84.74%
The color gradient indicates the relative performance in each measure, with brighter red colors indicating
higher performance. In the two types of models evaluated in this study, the performance of the top-
performing models is highlighted in bold.

Despite being primarily designed for text generation, GPT-3 has demonstrated remarkable
ability in identifying interactions from biomedical literature, particularly in NLP tasks such as
protein-protein interaction or relation identification. However, further development of language
model-based methods is needed to address sensitive tasks in these areas. Nevertheless,
improving GPT-4 with biomedical corpora like PubMed and PMC for PPI identification is
warranted. With additional information, such as a dictionary, GPT has shown decent
performance and can demonstrate substantially improved performance comparable to BERT-
based models, indicating the potential use of GPT for these NLP tasks. Further research is
needed to explore and enhance the capabilities of GPT-based models in the biomedical
domain.
One potential area of future research is exploring the use of ontology to improve literature
mining for PPI identification. For instance, previous work has utilized the Interaction Network
Ontology (INO) to facilitate the mining and accuracy of gene-gene or protein-protein interactions
[30-33]. In addition, INO's hierarchical structure and semantic relationships among different
interactions provided a foundation for deeper analysis of mined interactions. Although ontology
was not utilized in the present study, we aim to investigate how it can be combined with existing
literature mining tools to further enhance performance in future studies.
Acknowledgments
The study was supported by the U.S. National Institute of Allergy and Infectious Disease
(U24AI171008 to Y.H. and J.H.). GEBIP Award of the Turkish Academy of Sciences (to A.Ö.) is
gratefully acknowledged.
References
1. Alonso-Lopez, D., et al., APID interactomes: providing proteome-based interactomes
with controlled quality for multiple species and derived networks. Nucleic acids research,
2016. 44(W1): p. W529-W535.
2. Szklarczyk, D., et al., STRING v11: protein–protein association networks with increased
coverage, supporting functional discovery in genome-wide experimental datasets.
Nucleic acids research, 2019. 47(D1): p. D607-D613.
3. Oughtred, R., et al., The BioGRID database: A comprehensive biomedical resource of
curated protein, genetic, and chemical interactions. Protein Science, 2021. 30(1): p. 187-
200.
4. Choi, S.-P., Extraction of protein–protein interactions (PPIs) from the literature by deep
convolutional neural networks with various feature embeddings. Journal of Information
Science, 2018. 44(1): p. 60-73.
5. Peng, Y. and Z. Lu, Deep learning for extracting protein-protein interactions from
biomedical literature. arXiv preprint arXiv:1706.01556, 2017.
6. Hsieh, Y.-L., et al. Identifying protein-protein interactions in biomedical literature using
recurrent neural networks with long short-term memory. in Proceedings of the eighth
international joint conference on natural language processing (volume 2: short papers).
2017.
7. Ahmed, M., et al. Identifying protein-protein interaction using tree LSTM and structured
attention. in 2019 IEEE 13th international conference on semantic computing (ICSC).
2019. IEEE.
8. Gu, Y., et al., Domain-specific language model pretraining for biomedical natural
language processing. ACM Transactions on Computing for Healthcare (HEALTH), 2021.
3(1): p. 1-23.
9. Vaswani, A., et al., Attention is all you need. Advances in neural information processing
systems, 2017. 30.
10. Lee, J., et al., BioBERT: a pre-trained biomedical language representation model for
biomedical text mining. Bioinformatics, 2020. 36(4): p. 1234-1240.
11. Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.
12. Kojima, T., et al., Large language models are zero-shot reasoners. arXiv preprint
arXiv:2205.11916, 2022.
13. OpenAI, GPT-4 Technical Report. ArXiv, 2023. abs/2303.08774.
14. Katz, D.M., et al., GPT-4 Passes the Bar Exam. Available at SSRN 4389233, 2023.
15. Nori, H., et al., Capabilities of GPT-4 on Medical Challenge Problems. arXiv preprint
arXiv:2303.13375, 2023.
16. Nédellec, C. Learning language in logic-genic interaction extraction challenge. in 4.
Learning language in logic workshop (LLL05). 2005. ACM-Association for Computing
Machinery.
17. Erkan, G., A. Özgür, and D. Radev. Semi-supervised classification for extracting protein
interaction sentences using dependency parsing. in Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL). 2007.
18. Radford, A., et al., Improving language understanding by generative pre-training. 2018.
19. Radford, A., et al., Language models are unsupervised multitask learners. OpenAI blog,
2019. 1(8): p. 9.
20. Brown, T., et al., Language models are few-shot learners. Advances in neural
information processing systems, 2020. 33: p. 1877-1901.
21. Beltagy, I., K. Lo, and A. Cohan, SciBERT: A pretrained language model for scientific
text. arXiv preprint arXiv:1903.10676, 2019.
22. Ammar, W., et al., Construction of the literature graph in semantic scholar. arXiv preprint
arXiv:1805.02262, 2018.
23. Alsentzer, E., et al., Publicly available clinical BERT embeddings. arXiv preprint
arXiv:1904.03323, 2019.
24. Johnson, A., T. Pollard, and R. Mark, MIMIC-III clinical database (version 1.4).
PhysioNet, 2016. 10(C2XW26): p. 2.
25. Alrowili, S. and K. Vijay-Shanker. BioM-transformers: building large biomedical language
models with BERT, ALBERT and ELECTRA. in Proceedings of the 20th Workshop on
Biomedical Language Processing. 2021.
26. Lan, Z., et al., Albert: A lite bert for self-supervised learning of language representations.
arXiv preprint arXiv:1909.11942, 2019.
27. Krallinger, M., et al. Overview of the BioCreative VI chemical-protein interaction Track. in
Proceedings of the sixth BioCreative challenge evaluation workshop. 2017.
28. Airola, A., et al., All-paths graph kernel for protein-protein interaction extraction with
evaluation of cross-corpus learning. BMC bioinformatics, 2008. 9: p. 1-12.
29. Mooney, R. and R. Bunescu, Subsequence kernels for relation extraction. Advances in
neural information processing systems, 2005. 18.
30. Hur, J., A. Ozgur, and Y. He, Ontology-based literature mining of E. coli vaccine-
associated gene interaction networks. J Biomed Semantics, 2017. 8(1): p. 12.
31. Ozgur, A., J. Hur, and Y. He, The Interaction Network Ontology-supported modeling and
mining of complex interactions represented with multiple keywords in biomedical
literature. BioData Min, 2016. 9: p. 41.
32. Karadeniz, I., et al., Literature Mining and Ontology based Analysis of Host-Brucella
Gene-Gene Interaction Network. Front Microbiol, 2015. 6: p. 1386.
33. Hur, J., et al., Development and application of an interaction network ontology for
literature mining of vaccine-associated gene-gene interactions. J Biomed Semantics,
2015. 6: p. 2.
Appendix
Supplementary Figure 1: Evaluation result of PPI on LLL dataset for BERT and GPT-
based models.
Supplementary Table S1. PPI identification performance of GPT-3.
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 38.09% 70.00% 100.00% 31.81% 83.56% 100.00% 52.33% 71.43% 89.53% 66.67% 70.34%
Recall 17.78% 70.00% 100.00% 29.63% 44.44% 100.00% 46.67% 34.48% 75.55% 46.15% 56.47%
F1 Score 24.24% 70.00% 100.00% 30.57% 57.93% 100.00% 49.06% 46.51% 81.92% 54.55% 61.48%
Supplementary Table S2. PPI identification performance of GPT-3 with
protein names.
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 33.33% 88.89% 100.00% 93.75% 89.58% 100.00% 70.00% 74.53% 91.67% 44.44% 78.62%
Recall 13.33% 80.00% 100.00% 83.33% 53.09% 100.00% 46.67% 36.78% 73.33% 30.77% 61.73%
F1 Score 19.05% 84.21% 100.00% 88.24% 66.67% 100.00% 56.00% 49.21% 81.48% 36.36% 68.12%
Supplementary Table S3. PPI identification performance of GPT-3 with
normalized protein names.
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 48.61% 88.89% 87.50% 78.41% 87.50% 100.00% 63.89% 76.67% 77.22% 44.44% 75.31%
Recall 24.44% 80.00% 77.78% 74.07% 51.85% 83.33% 44.44% 44.83% 55.56% 30.77% 56.71%
F1 Score 32.44% 84.21% 82.35% 76.11% 65.12% 90.47% 51.21% 55.51% 64.49% 36.36% 63.83%
Supplementary Table S4. PPI identification performance of GPT-3.5 (via
ChatGPT Plus web interface).
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 41.67% 89.26% 100.00% 75.32% 87.91% 48.15% 80.48% 82.71% 88.89% 96.67% 79.11%
Recall 24.44% 83.33% 100.00% 33.33% 44.44% 54.17% 48.89% 58.62% 71.11% 64.10% 58.24%
F1 Score 30.73% 86.14% 100.00% 45.79% 59.03% 50.98% 60.48% 68.46% 79.01% 76.88% 65.75%
ChatGPT Plus web interface) with protein names.
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 79.85% 82.96% 100.00% 88.85% 86.72% 67.62% 70.91% 65.66% 84.19% 64.81% 79.16%
Recall 62.22% 80.00% 100.00% 50.00% 67.90% 66.67% 48.89% 26.44% 71.11% 48.72% 62.19%
F1 Score 69.62% 81.40% 100.00% 62.23% 75.57% 66.67% 57.85% 37.68% 77.07% 55.60% 68.37%
ChatGPT Plus web interface) with normalized protein names.
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 66.88% 89.63% 100.00% 70.58% 86.58% 85.71% 83.33% 60.41% 87.52% 100.00% 83.06%
Recall 40.00% 86.67% 100.00% 33.33% 64.20% 70.83% 62.22% 29.89% 75.55% 76.92% 63.96%
F1 Score 49.87% 88.07% 100.00% 43.77% 73.32% 77.46% 71.11% 38.92% 81.02% 86.96% 71.05%
Supplementary Table S7. PPI identification performance of GPT-4 (via
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 46.11% 83.33% 100.00% 65.34% 69.68% 100.00% 52.96% 76.11% 79.49% 66.67% 73.97%
Recall 35.55% 83.33% 96.30% 25.92% 55.56% 100.00% 55.56% 51.72% 68.89% 51.28% 62.41%
F1 Score 39.57% 83.33% 98.04% 36.48% 61.77% 100.00% 54.14% 61.30% 73.81% 57.70% 66.61%
ChatGPT Plus web interface) with protein names.

Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 67.18% 80.00% 100.00% 83.82% 92.48% 100.00% 82.15% 74.74% 85.67% 71.03% 83.71%
Recall 64.44% 80.00% 100.00% 62.96% 60.49% 95.83% 73.34% 41.38% 80.00% 69.23% 72.77%
F1 Score 65.44% 80.00% 100.00% 70.16% 73.13% 97.78% 76.59% 53.14% 82.67% 70.02% 76.89%
Fold No. 1 2 3 4 5 6 7 8 9 10 Average
Precision 65.94% 69.80% 100.00% 89.08% 92.48% 100.00% 63.84% 87.16% 83.86% 81.19% 83.34%
Recall 60.00% 70.00% 100.00% 75.92% 60.49% 100.00% 71.11% 71.26% 80.00% 76.92% 76.57%
F1 Score 62.81% 69.78% 100.00% 80.01% 73.13% 100.00% 67.01% 78.36% 81.73% 78.97% 79.18%
Supplementary Table S10. PPI identification performance of BioBERT.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 88.24% 90.91% 100.00% 84.21% 72.22% 88.89% 70.00% 78.13% 83.33% 76.92% 83.28%
Recall 100.00% 100.00% 100.00% 88.89% 89.66% 100.00% 93.33% 78.13% 66.67% 76.92% 89.36%
F1 Score 93.75% 95.24% 100.00% 86.49% 80.00% 94.12% 80.00% 78.13% 74.07% 76.92% 85.87%
Supplementary Table S11. PPI identification performance of PubMedBERT.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 88.24% 90.00% 100.00% 81.25% 75.68% 100.00% 77.78% 78.38% 81.82% 78.58% 85.17%
Recall 100.00% 90.00% 100.00% 72.22% 96.56% 100.00% 93.33% 90.63% 60.00% 84.62% 88.73%
F1 Score 93.75% 90.00% 100.00% 76.47% 84.85% 100.00% 84.85% 84.06% 69.23% 81.48% 86.47%
Supplementary Table S12. PPI identification performance of
Bio_ClinicalBert.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 75.00% 90.91% 100.00% 75.00% 61.90% 88.89% 70.00% 72.97% 69.23% 80.00% 78.39%
100.00 100.00
Recall % % 100.00% 66.67% 89.66% 100.00% 93.33% 84.38% 60.00% 92.31% 88.63%
F1 Score 85.71% 95.24% 100.00% 70.59% 73.24% 94.12% 80.00% 78.26% 64.29% 85.71% 82.72%
Supplementary Table S13. PPI identification performance of BioM-BERT-
PubMed-PMC-Large.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 75.00% 90.91% 100.00% 77.27% 65.91% 100.00% 83.33% 89.29% 81.82% 83.33% 84.69%
100.00 100.00 100.00 100.00
Recall % % 100.00% 94.44% % 87.50% % 78.13% 60.00% 76.92% 89.70%
F1 Score 85.71% 95.24% 100.00% 85.00% 79.45% 93.33% 90.91% 83.33% 69.23% 80.00% 86.22%
BioM-ALBERT-xxlarge.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio 65.71
n 75.00% 58.82% 100.00% 34.62% 54.00% 72.73% 66.67% % 26.92% 54.17% 60.86%
100.00 100.00 100.00 71.88
Recall % % 100.00% % 93.10% 100.00% 80.00% % 93.33% 100.00% 93.83%

68.66
F1 Score 85.71% 74.07% 100.00% 51.43% 68.35% 84.21% 72.73% % 41.80% 70.27% 71.72%
SciBERT_scivocab_cased.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 78.95% 58.82% 75.00% 58.06% 60.00% 80.00% 63.64% 67.74% 58.82% 55.56% 65.66%
100.00 100.00 100.00 100.00
Recall % % % % 82.76% 100.00% 93.33% 65.63% 66.67% 76.92% 88.53%
F1 Score 88.24% 74.07% 85.71% 7347% 69.57% 88.89% 75.68% 66.67% 65.50% 64.52% 74.93%
SciBERT_scivocab_uncased.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 82.35% 90.00% 100.00% 61.90% 76.47% 100.00% 82.35% 90.00% 90.00% 71.43% 84.45%
Recall 93.33% 90.00% 100.00% 72.22% 89.66% 100.00% 93.33% 84.38% 60.00% 76.92% 85.98%
F1 Score 87.50% 90.00% 100.00% 66.67% 82.54% 100.00% 87.50% 87.10% 72.00% 74.07% 84.74%

Evaluation of GPT and BERT-based Models On Identif

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluation of GPT and BERT-based Models On Identif

Uploaded by

Copyright:

Available Formats

Evaluation of GPT and BERT-based models on

identifying protein-protein interactions in biomedical

of Michigan, Ann Arbor, Michigan, 48109, USA

Michigan, 48109, USA

Sciences, Grand Forks, North Dakota, 58202, USA

*These authors contributed equally to this work.

scientific knowledge discovery. Pre-trained language models, such as generative pre-trained

have the potential for use in biomedical literature mining tasks.

phenotype. Understanding these interactions enhances the understanding of the biological

such as yeast-two-hybrid or immunoprecipitation pull-down or high-throughput screening

not require human participation.

bidirectional encoder representation transformer (BERT) [11], a type of masked language

generative pre-trained transformer (GPT), an advanced AI language model that generates

in NLP, opening up a wide range of applications.

2005 (LLL05) challenge. It contains manually curated gene/protein interactions in Bacillus

the remaining ones as negative samples.

(Bio_ClinicalBERT, BioBERT, BioM-ALBERT-xxlarge, BioM-BERT-PubMed-PMC-Large,

PubMedBERT, SciBERT_scivocab_cased, and SciBERT_scivocab_uncased).

Autoregressive Language Models

parallelization. The self-attention mechanism enables neural networks to determine input

overfitting during training [18].

data, including Wikipedia (https://en.wikipedia.org/), Common Crawl

(https://commoncrawl.org/the-data/), and OpenWebText

(https://skylion007.github.io/OpenWebTextCorpus/). GPT-2 [19] significantly improved over its

development, and content generation.

Table 1: Specifications of GPT models

Model Name Year Architecture Number of

GPT-1 2018 Decoder architecture of transformer with 12 layers 117 million

GPT-2 2019 Decoder architecture of transformer with 48 layers 1.5 billion

Masked Language Models

the current study.

containing ~880M words.

Understanding & Reasoning Benchmark) benchmark.

● BioM-ALBERTxxlarge [25]: a BERT model pre-trained on PubMed abstract with the

same architecture as ALBERTxxlarge [26].

which is the ELECTRA implementation of BERT.

Table 2: Specifications of BERT-based models

Model Name Year Architecture Number of F1-Scores on

BioBERT 2019 Encoder architecture of transformer 108 million 76.46

SciBERT 2019 Encoder architecture of transformer 108 million 83.64

Bio- 2019 Encoder architecture of transformer 108 million Not Available

Biom- 2021 Encoder architecture of transformer 225 million 75.81

BioM- 2021 Encoder architecture of transformer 334 million 80.00

Formulating queries for GPT models

preprocessing introduced above, were also performed.

Query type Query

Temperature parameter optimization in GPT-3

OpenAI API and 11 temperatures (minimum=0, maximum=1, increment=0.1). A temperature of

the BioM-ALBERT-xxlarge model.

Learning Rate 5e-5

Weight Decay 1e-1

Epochs per Fold 6

Max Sentence Length 128

Results and Discussion

which was used for all GPT-3 analyses.

an example output of GPT 3 for Fold 9.

output for GPT-4.

performances to best-performing BERT-based models (Supplementary Figure 1).

significantly, approaching that of the best-performing PubMedBERT model, which achieved

Moreover, GPT-4 outperformed BioM-ALBERT-xxlarge and SciBERT_scivocab_cased in terms

although BERT-based models demonstrate impressive performance, they require fine-tuning

93.83% (4.13% higher than the second-best BioM-BERT-PubMed-PMC-Large). However, its

ALBERT uses the SentencePiece tokenization technique.