You are on page 1of 25

Evaluation of GPT and BERT-based models on

identifying protein-protein interactions in biomedical


text

Hasin Rehana1,*, Nur Bengisu Çam2,*, Mert Basmaci2, Yongqun He3,4, Arzucan Özgür2,§, and
Junguk Hur2,§

Computer Science Graduate Program, University of North Dakota, Grand Forks, North Dakota,
1

58202, USA
Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
2

Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University
3

of Michigan, Ann Arbor, Michigan, 48109, USA


Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor,
4

Michigan, 48109, USA


Department of Biomedical Sciences, University of North Dakota School of Medicine and Health
5

Sciences, Grand Forks, North Dakota, 58202, USA

*These authors contributed equally to this work.


§
Corresponding authors
Arzucan Özgür: arzucan.ozgur@boun.edu.tr
Junguk Hur: junguk.hur@med.und.edu
Abstract
Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms,

disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical

literature, there is a growing need for automated and accurate extraction of PPIs to facilitate

scientific knowledge discovery. Pre-trained language models, such as generative pre-trained

transformer (GPT) and bidirectional encoder representations from transformers (BERT), have

shown promising results in natural language processing (NLP) tasks. We evaluated the PPI

identification performance of various GPT and BERT models using a manually curated

benchmark corpus of 164 PPIs in 77 sentences from learning language in logic (LLL). BERT-

based models achieved the best overall performance, with PubMedBERT achieving the highest

precision (85.17%) and F1-score (86.47%) and BioM-ALBERT achieving the highest recall

(93.83%). Despite not being explicitly trained for biomedical texts, GPT-4 achieved comparable

performance to the best BERT models with 83.34% precision, 76.57% recall, and 79.18% F1-

score. These findings suggest that GPT models can effectively detect PPIs from text data and

have the potential for use in biomedical literature mining tasks.

Introduction
Protein-Protein Interactions (PPIs) are essential for numerous biological functions, especially

DNA replication and transcription, signal pathways, cell metabolism, and converting genotype to

phenotype. Understanding these interactions enhances the understanding of the biological

processes, pathways, and networks underlying healthy and diseased states. Various public PPI

databases exist [1-3], including the PPI data collected from low-to-mid throughput experiments

such as yeast-two-hybrid or immunoprecipitation pull-down or high-throughput screening

assays. However, these resources are still incomplete, not covering all potential PPIs. Besides,
novel PPIs are typically reported in biomedical texts, which need to be added to the existing PPI

databases. Due to the rapid growth of scientific literature, manual extraction of PPIs has

become increasingly challenging, which necessitates automated text mining approaches that do

not require human participation.

Natural language processing (NLP) is a focal area in computer science, increasingly applied in

various domains, including biomedical research, which has experienced massive growth in

recent years. Relation extraction, a widely used NLP method, aims to identify relationships

between two or more entities in biomedical text, supporting the automatic analysis of documents

in this domain. Advances in deep learning (DL), such as convolutional neural networks (CNNs)

[4, 5] and recurrent neural networks (RNNs) [6, 7], as well as in NLP have enabled success in

biomedical text mining to discover interactions between protein entities. Pre-training large neural

language models has led to substantial improvements in numerous NLP problems [8]. Following

the publication of "Attention Is All You Need" [9], transformer architectures have achieved state-

of-the-art results in various NLP tasks, including relation extraction in the biomedical domain

[10].

After the development of the transformer architecture [9], transformer-based models like

bidirectional encoder representation transformer (BERT) [11], a type of masked language

model, emerged. These models, known as large language models (LLMs), focus on

understanding language and semantics. LLMs are pre-trained on vast amounts of data and can

be fine-tuned for various tasks. Recent studies suggest that LLMs excel at context zero-shot

and few-shot learning [12], analyzing, producing, and comprehending human languages. LLMs'

massive data processing capabilities can be employed to identify connections and trends

among textual elements. Another type of LLM is autoregressive language models, including

generative pre-trained transformer (GPT), an advanced AI language model that generates


human-like text by acquiring linguistic patterns and structures. GPT-3, developed by OpenAI,

boasts benefits like large-scale pre-training, zero and few-shot learning, context awareness,

creativity, and adaptability. OpenAI's ChatGPT, a 3.5 version of GPT models, demonstrates the

remarkable potential for analyzing and processing textual data. Recently, GPT-4 [13] has been

introduced, capable of producing, modifying, and cooperating with users in various creative and

technical writing tasks, such as songwriting, screenplay creation, and imitating user writing

styles. The advancements in GPT models, from GPT-3 to GPT-4, showcase the rapid progress

in NLP, opening up a wide range of applications.

Several studies have been published evaluating the performance of GPT models for problem-

solving on various standardized tests [14, 15], and it has been shown that they are able to

achieve performance comparable or even better than humans and are able to pass high-level

professional standardized tests such as the Bar test14. However, to our knowledge, no study has

been done to evaluate how well GPT models can be used to extract PPIs from biomedical texts.

Here, we present a thorough evaluation of the PPI identification performance of multiple GPT

models and compare these with the state-of-the-art BERT-based NLP models for relation

extraction.

Methods

Dataset

We used the LLL corpus [16], which contains 164 protein-protein interactions (PPIs) in 77

sentences. The LLL corpus is a shared dataset created for the Learning Language in Logic

2005 (LLL05) challenge. It contains manually curated gene/protein interactions in Bacillus

subtilis, and the sentences are provided in XML files. The LLL dataset does not contain non-
interacting entity pairs, which are essential for our BERT-based models for training. To address

this issue, we generated all possible entity pair combinations using the entities identified in each

sentence. A total of C(n,2) entity pairs could be generated with n protein entities in a sentence.

Then, we labeled the interacting pairs, if reported in the LLL data files, as positive samples and

the remaining ones as negative samples.

We also applied basic preprocessing steps to ensure capturing all entities by removing

punctuation marks, digit-only strings, and blank spaces, and converting all the letters into

lowercase, resulting in normalized protein names. For the BERT-based models, similar to the

prior work [17], we replaced the entity pair names with the PROTEIN1 and PROTEIN2

keywords. The entity names, other than the pair, were replaced with the ‘PROTEIN’ keyword.

Language Models

We evaluated three autoregressive language models (GPT-3, GPT-3.5 via ChatGPT, and GPT-

4 via ChatGPT), each with three variations, and seven masked language models

(Bio_ClinicalBERT, BioBERT, BioM-ALBERT-xxlarge, BioM-BERT-PubMed-PMC-Large,

PubMedBERT, SciBERT_scivocab_cased, and SciBERT_scivocab_uncased).

Autoregressive Language Models

GPT [18], is a series of language models developed by OpenAI in 2018, based on the

transformer architecture [9]. The transformer model consists of an encoder that generates

concealed representations and a decoder that produces output sequences using multi-head

attention, which prioritizes data over inductive biases, facilitating large-scale pre-training and

parallelization. The self-attention mechanism enables neural networks to determine input

element importance, making it ideal for language translation, text classification, and text
generation. The GPT architecture includes layers with self-attention mechanisms, fully

connected layers, and layer normalization, reducing computational time and preventing

overfitting during training [18].

Figure 1 illustrates the history of GPT models released by OpenAI over the past years. The first

version of GPT, GPT-1, had 117 million parameters. It was trained using a large corpus of text

data, including Wikipedia (https://en.wikipedia.org/), Common Crawl

(https://commoncrawl.org/the-data/), and OpenWebText

(https://skylion007.github.io/OpenWebTextCorpus/). GPT-2 [19] significantly improved over its

predecessor GPT-1, with 1.5 billion parameters. It was trained on a larger corpus of text data,

including web pages and books, and can generate more coherent and convincing language

responses. GPT-3 [20] was trained with 175 billion parameters, including an enormous corpus

of text data, including web pages, books, and academic articles. CPT-3 has demonstrated

outstanding performance in a wide range of NLP tasks, such as language translation, chatbot

development, and content generation.

Figure 1. Evolution of GPT Models. GPT: generative pre-trained transformer. API: application

programming interface.
On November 30, 2022, OpenAI released ChatGPT, a natural and engaging conversation tool

capable of producing contextually relevant responses based on text data. ChatGPT was fine-

tuned on the GPT-3.5 series, which included the models: gpt-3.5-turbo-0301, code-davinci-002,

text-davinci-002, and text-davinci-003. In this study, the latest gpt-3.5-turbo-0301 was used for

GPT-3.5. On March 14, 2023, OpenAI introduced its most advanced and cutting-edge system to

date, GPT-4, which has surpassed its predecessors by producing more dependable outcomes.

The architecture and number of parameters for GPT models are summarized in Table 1,

including the GPT-3, ChatGPT, and GPT-4, which were included in the current study.

Table 1: Specifications of GPT models

Model Name Year Architecture Number of


Released Parameters

GPT-1 2018 Decoder architecture of transformer with 12 layers 117 million

GPT-2 2019 Decoder architecture of transformer with 48 layers 1.5 billion

GPT-3 2020 Same model and architecture as GPT-2 with 96 layers. 175 billion
Variations include Davinci#, Babbage, Curie, and Ada.

ChatGPT (GPT- 2022 A combination of three models: code-davinci-002, text- 1.3 billion, 6 billion,
3.5) davinci-002, and text-davinci-003. and 175 billion

GPT-4 2023 Fine-tuned using reinforcement learning from human Supposedly 100
feedback. trillion
#
Used in the current study.

Masked Language Models

Six different BERT-based models were included in the current study (Table 2).

● BioBERT [10]: a BERT model pre-trained on PubMed abstracts and PubMed Central

(PMC) full-text articles for different NLP tasks to measure performance. The initial
version, BioBERT v1.0, used >200K abstracts and >270K PMC articles. An expanded

version BioBERT v1.1 was fine-tuned using > 1M PubMed abstracts and was included in

the current study.

● SciBERT [21]: a BERT model pre-trained on random Semantic Scholar articles [22].

While pre-training with the articles, the entire text was used. The researchers created the

SCIVOCAB from scientific articles of the same size as BASEVOCAB, the BERT-base

model's vocabulary. Both cased and uncased SCIVOCAB were used in the current

study.

● Bio-ClinicalBERT [23]: a fine-tuned BioBERT v1.0 model (PubMed 200K + PMC 270K)

with all notes from MIMIC-III v1.4 [24], a database of the electronic health records,

containing ~880M words.

● PubMedBERT [8]: a BERT model trained explicitly on the BLURB (Biomedical Language

Understanding & Reasoning Benchmark) benchmark.

● BioM-ALBERTxxlarge [25]: a BERT model pre-trained on PubMed abstract with the

same architecture as ALBERTxxlarge [26].

● BioM-BERTLarge [25]: a BERT model with the same architecture as BERTLarge [11],

which is the ELECTRA implementation of BERT.

Table 2: Specifications of BERT-based models

Model Name Year Architecture Number of F1-Scores on


Released Parameters ChemProt
[27]

BioBERT 2019 Encoder architecture of transformer 108 million 76.46


with 12 layers and hidden size of 768

SciBERT 2019 Encoder architecture of transformer 108 million 83.64


with 12 layers and hidden size of 768

Bio- 2019 Encoder architecture of transformer 108 million Not Available


ClinicalBERT with 12 layers and hidden size of 768
PubMedBERT 2020 Encoder architecture of transformer 108 million 77.24
with 12 layers and hidden size of 768

Biom- 2021 Encoder architecture of transformer 225 million 75.81


ALBERTxxlarge with 12 layers and hidden size of 4096

BioM- 2021 Encoder architecture of transformer 334 million 80.00


BERTLarge with 24 layers and hidden size of 1024

Formulating queries for GPT models

To extract PPIs from the LLL sentences, we leveraged OpenAI’s application programming

interface (API) access for GPT-3, while GPT-3.5 and GPT-4 were accessed through the web

interface of ChatGPT Plus. We carefully designed the API and web interface prompts to

generate well-structured and stable interactions with minimal post-processing steps. The LLL

data comprised 77 sentences from 44 publications with a total of 164 PPIs. We extracted the

necessary information from the dataset and divided it into ten folds using document-level

folding, as previously introduced [28]. For each fold, we provided the sentence IDs and

sentences as input along with a query, as shown in Table 3. In order to evaluate the impact of

providing the dictionary of biomedical entities covering these 77 sentences, two additional

queries with the original protein names and normalized protein names, created after the

preprocessing introduced above, were also performed.

Table 3: Queries incorporated in the prompts for GPT-3 (API), GPT-3.5 (ChatGPT), and

GPT-4 (ChatGPT)

Query type Query


Find all possible Protein-Protein interactions from the given sentences and provide the result in a
tabular format with columns "Sentence ID | Protein 1 | Protein 2 | Protein-Protein Interaction" for
Base; without
identified interaction Protein pair. And make sure each row will contain one pair of Protein-Protein
protein names
interactions, even though multiple pairs are identified from a single sentence. Remember, Protein
and Gene are the same things.
Find all possible Protein-Protein interactions from the given sentences and provide the result in a
tabular format with columns "Sentence ID | Protein 1 | Protein 2 | Protein-Protein Interaction" for
With protein identified interaction Protein pair. And make sure each row will contain one pair of Protein-Protein
names interactions, even though multiple pairs are identified from a single sentence. Remember, Protein
and Gene are the same things. Here are the protein names for your reference ['KinC' 'KinD'
'sigma(A)' 'Spo0A' 'SigE' 'SigK' 'GerE' 'sigma(F)' 'sigma(G)' 'SpoIIE' 'FtsZ' 'sigma(H)' 'sigma(K)'
'gerE' 'EsigmaF' 'sigmaB' 'sigmaF' 'SpoIIAB' 'SpoIIAA' 'SigL' 'RocR' 'sigma(54)' 'E sigma E' 'YfhP'
'SpoIIAA-P' 'sigmaK' 'sigmaG' 'ComK' 'FlgM' 'sigma X' 'sigma B' 'sigma(B)' 'sigmaD' 'SpoIIID'
'sigmaW' 'PhoP~P' 'AraR' 'sigmaH' 'yvyD' 'ClpX' 'Spo0' 'RbsW' 'DnaK' 'sigmaE' 'sigma W' 'sigmaA'
'sigma(X)' 'CtsR' 'Spo0A~P' 'spoIIG' 'ydhD' 'ykuD' 'ykvP' 'ywhE' 'spo0A' 'spoVG' 'rsfA' 'cwlH' 'KatX'
'katX' 'rocG' 'yfhS' 'yfhQ' 'yfhR' 'sspE' 'yfhP' 'bmrUR' 'ydaP' 'ydaE' 'ydaG' 'yfkM' 'sigma F' 'cot' 'sigK'
'cotD' 'sspG' 'sspJ' 'hag' 'comF' 'flgM' 'ykzA' 'CsbB' 'nadE' 'YtxH' 'YvyD' 'bkd' 'degR' 'cotC' 'cotX'
'cotB' 'sigW' 'tagA' 'tagD' 'tuaA' 'araE' 'sigmaL' 'spo0H' 'sigma G' 'sigma 28' 'sigma 32' 'spoIVA'
'PBP4*' 'RacX' 'YteI' 'YuaG' 'YknXYZ' 'YdjP' 'YfhM' 'phrC' 'sigE' 'ald' 'kdgR' 'sigX' 'ypuN' 'clpC' 'ftsY'
'gsiB' 'sigB' 'sspH' 'sspL' 'sspN' 'tlp']
Find all possible Protein-Protein interactions from the given sentences and provide the result in a
tabular format with columns 'Sentence ID | Protein 1 | Protein 2 | Protein-Protein Interaction' for
identified interaction Protein pair. And make sure each row will contain one pair of Protein-Protein
interactions, even though multiple pairs are identified from a single sentence. Remember, Protein
and Gene are the same things. Here are the protein names for your reference ['kinc' 'kind'
'sigmaa' 'spo0a' 'sige' 'sigk' 'gere' 'sigmaf' 'sigmag' 'spoiie' 'ftsz' 'sigmah' 'sigmak' 'esigmaf' 'sigmab'
With normalized
'spoiiab' 'spoiiaa' 'sigl' 'rocr' 'sigma54' 'esigmae' 'yfhp' 'spoiiaa-p' 'comk' 'flgm' 'sigmax' 'sigmad'
protein names
'spoiiid' 'sigmaw' 'phop~p' 'arar' 'yvyd' 'clpx' 'spo0' 'rbsw' 'dnak' 'sigmae' 'ctsr' 'spo0a~p' 'spoiig'
'ydhd' 'ykud' 'ykvp' 'ywhe' 'spovg' 'rsfa' 'cwlh' 'katx' 'rocg' 'yfhs' 'yfhq' 'yfhr' 'sspe' 'bmrur' 'ydap'
'ydae' 'ydag' 'yfkm' 'cot' 'cotd' 'sspg' 'sspj' 'hag' 'comf' 'ykza' 'csbb' 'nade' 'ytxh' 'bkd' 'degr' 'cotc'
'cotx' 'cotb' 'sigw' 'taga' 'tagd' 'tuaa' 'arae' 'sigmal' 'spo0h' 'sigma28' 'sigma32' 'spoiva' 'pbp4*' 'racx'
'ytei' 'yuag' 'yknxyz' 'ydjp' 'yfhm' 'phrc' 'ald' 'kdgr' 'sigx' 'ypun' 'clpc' 'ftsy' 'gsib' 'sigb' 'ssph' 'sspl'
'sspn' 'tlp']

Temperature parameter optimization in GPT-3

OpenAI’s API allows modulation of the ‘temperature’ parameter in GPTs, which determines how

greedy or creative the generative model is. The parameter ranges between 0 (the least creative)

and 1 (the most creative). We explored the impact of this parameter in PPI identification using

OpenAI API and 11 temperatures (minimum=0, maximum=1, increment=0.1). A temperature of

0.1 demonstrated the highest overall performance of GPT-3, thus used in the present study.

Performance evaluation

To ensure consistency for each fold, we have obtained the outputs of GPT-3, GPT-3.5, and

GPT-4 from three separate runs and acquired an average of their evaluation performance. We

refreshed the browser after each prompt to keep ChatGPT from memorizing the previous

prompts.
For BERT-based models, we fine-tuned these models in a 10-fold cross-validation setting,

where the folds were created at the document level, as introduced above. This strategy

employed document-level fold splitting, which ensured the sentences from one document were

used only either in the training or testing set to avoid overfitting [29]. The hyperparameters we

used in this study are shown in Table 4. We used the slow tokenizer for the tokenization with

the BioM-ALBERT-xxlarge model.

Table 4: Hyperparameters used for K-Fold Cross Validation on BERT and ALBERT

models

Hyperparameter Value

Optimizer Adam

Learning Rate 5e-5

Batch Size 16

Weight Decay 1e-1

Epochs per Fold 6

Number of Folds 10

Max Sentence Length 128

Results and Discussion


We conducted a thorough comparison of GPT and BERT-based models using the LLL dataset

with the identical 10-fold settings to maintain consistency across all models. The queries for

GPT models were done in three separate runs, which were averaged later (Supplementary

Tables S1-S16).
Interacting with GPT API and web interface

Figure 2 illustrates a Python code segment to access GPT-3 API and its output. The predicted

interaction pairs were returned with corresponding Sentence IDs. As shown in Figure 3, GPT-3

achieved the highest performance in all measures with the temperature parameter set to 0.1,

which was used for all GPT-3 analyses.

(A) (B)
def call_GPT_API(model, temperature, query,
max_token, top_p):
with open('input.txt') as f:
prompt = f.read();
prompt = query + '\n' + prompt
print(prompt);
completions = openai.Completion.create(
model=model,
prompt=prompt,
temperature=temperature,
max_tokens=max_token,
top_p=top_p,
frequency_penalty=0,
presence_penalty=0
)
message = completions.choices[0].text;
return message

Figure 2. GPT API code and output. (A) Python code segment for accessing OpenAI API. (B)

an example output of GPT 3 for Fold 9.


Figure 3. Performance evaluation of temperature parameter in GPT-3.

Unlike GPT-3, GPT-3.5 and GPT-4 were accessed via the web interface named ChatGPT Plus

due to the limited API access at the time of this study. Figure 4 depicts an example input and

output for GPT-4.

Figure 4: An example input and output of GPT-4 (via ChatGPT Plus web interface).
PPI identification performance

Table 5 summarizes the PPI identification performance of the 16 models, including three

variations of each GPT version. Generally, BERT-based models outperformed GPTs; however,

when protein names were provided GPTs, particularly GPT-4, demonstrated quite comparable

performances to best-performing BERT-based models (Supplementary Figure 1).

Overall, GPT-4 performed the best among all versions of the GPT models, whether or not

protein names were provided. However, in terms of precision, GPT-3.5 achieved higher

performance than GPT-4, with a score of 79.11% compared to GPT-4's score of 73.97%.

Initially, the base GPT models had lower precision than most of the BERT-based models.

However, when provided with protein names, the precision of the GPT models improved

significantly, approaching that of the best-performing PubMedBERT model, which achieved

85.17% precision. Specifically, the GPT-4 model with protein names provided achieved 83.71%

precision.

Moreover, GPT-4 outperformed BioM-ALBERT-xxlarge and SciBERT_scivocab_cased in terms

of F1-score. While PubMedBERT had the best evaluation score in terms of precision (1.46%

higher than GPT-4) and F1-score, another BERT-based model called BioALBERT achieved the

highest recall score, albeit with relatively less precision and F1-score. It is worth noting that

although BERT-based models demonstrate impressive performance, they require fine-tuning

with supervised learning, which takes considerable time and technical expertise. In comparison,

zero-shot learning models such as GPT-3, ChatGPT, and GPT-4 do not require such extensive

fine-tuning, making them more accessible and practical for specific use cases.
The BioM-ALBERT-xxlarge model had the highest recall score among all models, scoring

93.83% (4.13% higher than the second-best BioM-BERT-PubMed-PMC-Large). However, its

precision and F1 scores were not as good as the BERT models. There could be several

reasons for this discrepancy. One possibility is that the BioM-ALBERT-xxlarge model had a

larger hidden layer size (4096) than the BERT models (1024) we experimented with. Another

potential reason is that the BERT models use the WordPiece tokenization technique, whereas

ALBERT uses the SentencePiece tokenization technique.

Table 5: Evaluation result of PPI on the LLL dataset for BERT and GPT-based models.

Type Model Precision Recall F1-Score


Autoregressiv GPT-3 70.34% 56.47% 61.48%
e Language GPT-3 with protein names 78.62% 61.73% 68.12%
Models GPT-3 with normalized protein
75.31% 56.71% 63.83%
names
GPT-3.5 (ChatGPT) 79.11% 58.24% 65.75%
GPT-3.5 (ChatGPT) with protein
79.16% 62.19% 68.37%
names
GPT-3.5 (ChatGPT) with
83.06% 63.96% 71.05%
normalized protein names
GPT-4 (ChatGPT) 73.97% 62.41% 66.61%
GPT-4 (ChatGPT) with protein
83.71% 72.77% 76.89%
names
GPT-4 (ChatGPT) with
83.34% 76.57% 79.18%
normalized protein names
Masked Bio_ClinicalBERT 78.39% 88.63% 82.72%
Language BioBERT 83.28% 89.36% 85.87%
Models
BioM-ALBERT-xxlarge 60.86% 93.83% 71.72%
BioM-BERT-PubMed-PMC-Large 84.69% 89.70% 86.22%
PubMedBERT 85.17% 88.73% 86.47%
SciBERT_scivocab_cased 65.66% 88.53% 74.93%
SciBERT_scivocab_uncased 84.45% 85.98% 84.74%
The color gradient indicates the relative performance in each measure, with brighter red colors indicating

higher performance. In the two types of models evaluated in this study, the performance of the top-

performing models is highlighted in bold.


Despite being primarily designed for text generation, GPT-3 has demonstrated remarkable

ability in identifying interactions from biomedical literature, particularly in NLP tasks such as

protein-protein interaction or relation identification. However, further development of language

model-based methods is needed to address sensitive tasks in these areas. Nevertheless,

improving GPT-4 with biomedical corpora like PubMed and PMC for PPI identification is

warranted. With additional information, such as a dictionary, GPT has shown decent

performance and can demonstrate substantially improved performance comparable to BERT-

based models, indicating the potential use of GPT for these NLP tasks. Further research is

needed to explore and enhance the capabilities of GPT-based models in the biomedical

domain.

One potential area of future research is exploring the use of ontology to improve literature

mining for PPI identification. For instance, previous work has utilized the Interaction Network

Ontology (INO) to facilitate the mining and accuracy of gene-gene or protein-protein interactions

[30-33]. In addition, INO's hierarchical structure and semantic relationships among different

interactions provided a foundation for deeper analysis of mined interactions. Although ontology

was not utilized in the present study, we aim to investigate how it can be combined with existing

literature mining tools to further enhance performance in future studies.

Acknowledgments
The study was supported by the U.S. National Institute of Allergy and Infectious Disease

(U24AI171008 to Y.H. and J.H.). GEBIP Award of the Turkish Academy of Sciences (to A.Ö.) is

gratefully acknowledged.
References
1. Alonso-Lopez, D., et al., APID interactomes: providing proteome-based interactomes
with controlled quality for multiple species and derived networks. Nucleic acids research,
2016. 44(W1): p. W529-W535.
2. Szklarczyk, D., et al., STRING v11: protein–protein association networks with increased
coverage, supporting functional discovery in genome-wide experimental datasets.
Nucleic acids research, 2019. 47(D1): p. D607-D613.
3. Oughtred, R., et al., The BioGRID database: A comprehensive biomedical resource of
curated protein, genetic, and chemical interactions. Protein Science, 2021. 30(1): p. 187-
200.
4. Choi, S.-P., Extraction of protein–protein interactions (PPIs) from the literature by deep
convolutional neural networks with various feature embeddings. Journal of Information
Science, 2018. 44(1): p. 60-73.
5. Peng, Y. and Z. Lu, Deep learning for extracting protein-protein interactions from
biomedical literature. arXiv preprint arXiv:1706.01556, 2017.
6. Hsieh, Y.-L., et al. Identifying protein-protein interactions in biomedical literature using
recurrent neural networks with long short-term memory. in Proceedings of the eighth
international joint conference on natural language processing (volume 2: short papers).
2017.
7. Ahmed, M., et al. Identifying protein-protein interaction using tree LSTM and structured
attention. in 2019 IEEE 13th international conference on semantic computing (ICSC).
2019. IEEE.
8. Gu, Y., et al., Domain-specific language model pretraining for biomedical natural
language processing. ACM Transactions on Computing for Healthcare (HEALTH), 2021.
3(1): p. 1-23.
9. Vaswani, A., et al., Attention is all you need. Advances in neural information processing
systems, 2017. 30.
10. Lee, J., et al., BioBERT: a pre-trained biomedical language representation model for
biomedical text mining. Bioinformatics, 2020. 36(4): p. 1234-1240.
11. Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.
12. Kojima, T., et al., Large language models are zero-shot reasoners. arXiv preprint
arXiv:2205.11916, 2022.
13. OpenAI, GPT-4 Technical Report. ArXiv, 2023. abs/2303.08774.
14. Katz, D.M., et al., GPT-4 Passes the Bar Exam. Available at SSRN 4389233, 2023.
15. Nori, H., et al., Capabilities of GPT-4 on Medical Challenge Problems. arXiv preprint
arXiv:2303.13375, 2023.
16. Nédellec, C. Learning language in logic-genic interaction extraction challenge. in 4.
Learning language in logic workshop (LLL05). 2005. ACM-Association for Computing
Machinery.
17. Erkan, G., A. Özgür, and D. Radev. Semi-supervised classification for extracting protein
interaction sentences using dependency parsing. in Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL). 2007.
18. Radford, A., et al., Improving language understanding by generative pre-training. 2018.
19. Radford, A., et al., Language models are unsupervised multitask learners. OpenAI blog,
2019. 1(8): p. 9.
20. Brown, T., et al., Language models are few-shot learners. Advances in neural
information processing systems, 2020. 33: p. 1877-1901.
21. Beltagy, I., K. Lo, and A. Cohan, SciBERT: A pretrained language model for scientific
text. arXiv preprint arXiv:1903.10676, 2019.
22. Ammar, W., et al., Construction of the literature graph in semantic scholar. arXiv preprint
arXiv:1805.02262, 2018.
23. Alsentzer, E., et al., Publicly available clinical BERT embeddings. arXiv preprint
arXiv:1904.03323, 2019.
24. Johnson, A., T. Pollard, and R. Mark, MIMIC-III clinical database (version 1.4).
PhysioNet, 2016. 10(C2XW26): p. 2.
25. Alrowili, S. and K. Vijay-Shanker. BioM-transformers: building large biomedical language
models with BERT, ALBERT and ELECTRA. in Proceedings of the 20th Workshop on
Biomedical Language Processing. 2021.
26. Lan, Z., et al., Albert: A lite bert for self-supervised learning of language representations.
arXiv preprint arXiv:1909.11942, 2019.
27. Krallinger, M., et al. Overview of the BioCreative VI chemical-protein interaction Track. in
Proceedings of the sixth BioCreative challenge evaluation workshop. 2017.
28. Airola, A., et al., All-paths graph kernel for protein-protein interaction extraction with
evaluation of cross-corpus learning. BMC bioinformatics, 2008. 9: p. 1-12.
29. Mooney, R. and R. Bunescu, Subsequence kernels for relation extraction. Advances in
neural information processing systems, 2005. 18.
30. Hur, J., A. Ozgur, and Y. He, Ontology-based literature mining of E. coli vaccine-
associated gene interaction networks. J Biomed Semantics, 2017. 8(1): p. 12.
31. Ozgur, A., J. Hur, and Y. He, The Interaction Network Ontology-supported modeling and
mining of complex interactions represented with multiple keywords in biomedical
literature. BioData Min, 2016. 9: p. 41.
32. Karadeniz, I., et al., Literature Mining and Ontology based Analysis of Host-Brucella
Gene-Gene Interaction Network. Front Microbiol, 2015. 6: p. 1386.
33. Hur, J., et al., Development and application of an interaction network ontology for
literature mining of vaccine-associated gene-gene interactions. J Biomed Semantics,
2015. 6: p. 2.
Appendix
Supplementary Figure 1: Evaluation result of PPI on LLL dataset for BERT and GPT-

based models.
Supplementary Table S1. PPI identification performance of GPT-3.

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 38.09% 70.00% 100.00% 31.81% 83.56% 100.00% 52.33% 71.43% 89.53% 66.67% 70.34%

Recall 17.78% 70.00% 100.00% 29.63% 44.44% 100.00% 46.67% 34.48% 75.55% 46.15% 56.47%

F1 Score 24.24% 70.00% 100.00% 30.57% 57.93% 100.00% 49.06% 46.51% 81.92% 54.55% 61.48%

Supplementary Table S2. PPI identification performance of GPT-3 with

protein names.

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 33.33% 88.89% 100.00% 93.75% 89.58% 100.00% 70.00% 74.53% 91.67% 44.44% 78.62%

Recall 13.33% 80.00% 100.00% 83.33% 53.09% 100.00% 46.67% 36.78% 73.33% 30.77% 61.73%

F1 Score 19.05% 84.21% 100.00% 88.24% 66.67% 100.00% 56.00% 49.21% 81.48% 36.36% 68.12%

Supplementary Table S3. PPI identification performance of GPT-3 with

normalized protein names.

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 48.61% 88.89% 87.50% 78.41% 87.50% 100.00% 63.89% 76.67% 77.22% 44.44% 75.31%

Recall 24.44% 80.00% 77.78% 74.07% 51.85% 83.33% 44.44% 44.83% 55.56% 30.77% 56.71%

F1 Score 32.44% 84.21% 82.35% 76.11% 65.12% 90.47% 51.21% 55.51% 64.49% 36.36% 63.83%

Supplementary Table S4. PPI identification performance of GPT-3.5 (via

ChatGPT Plus web interface).

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 41.67% 89.26% 100.00% 75.32% 87.91% 48.15% 80.48% 82.71% 88.89% 96.67% 79.11%

Recall 24.44% 83.33% 100.00% 33.33% 44.44% 54.17% 48.89% 58.62% 71.11% 64.10% 58.24%
F1 Score 30.73% 86.14% 100.00% 45.79% 59.03% 50.98% 60.48% 68.46% 79.01% 76.88% 65.75%

Supplementary Table S5. PPI identification performance of GPT-3.5 (via

ChatGPT Plus web interface) with protein names.

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 79.85% 82.96% 100.00% 88.85% 86.72% 67.62% 70.91% 65.66% 84.19% 64.81% 79.16%

Recall 62.22% 80.00% 100.00% 50.00% 67.90% 66.67% 48.89% 26.44% 71.11% 48.72% 62.19%

F1 Score 69.62% 81.40% 100.00% 62.23% 75.57% 66.67% 57.85% 37.68% 77.07% 55.60% 68.37%

Supplementary Table S6. PPI identification performance of GPT-3.5 (via

ChatGPT Plus web interface) with normalized protein names.

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 66.88% 89.63% 100.00% 70.58% 86.58% 85.71% 83.33% 60.41% 87.52% 100.00% 83.06%

Recall 40.00% 86.67% 100.00% 33.33% 64.20% 70.83% 62.22% 29.89% 75.55% 76.92% 63.96%

F1 Score 49.87% 88.07% 100.00% 43.77% 73.32% 77.46% 71.11% 38.92% 81.02% 86.96% 71.05%

Supplementary Table S7. PPI identification performance of GPT-4 (via

ChatGPT Plus web interface).

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 46.11% 83.33% 100.00% 65.34% 69.68% 100.00% 52.96% 76.11% 79.49% 66.67% 73.97%

Recall 35.55% 83.33% 96.30% 25.92% 55.56% 100.00% 55.56% 51.72% 68.89% 51.28% 62.41%

F1 Score 39.57% 83.33% 98.04% 36.48% 61.77% 100.00% 54.14% 61.30% 73.81% 57.70% 66.61%

Supplementary Table S8. PPI identification performance of GPT-4 (via

ChatGPT Plus web interface) with protein names.


Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 67.18% 80.00% 100.00% 83.82% 92.48% 100.00% 82.15% 74.74% 85.67% 71.03% 83.71%

Recall 64.44% 80.00% 100.00% 62.96% 60.49% 95.83% 73.34% 41.38% 80.00% 69.23% 72.77%

F1 Score 65.44% 80.00% 100.00% 70.16% 73.13% 97.78% 76.59% 53.14% 82.67% 70.02% 76.89%

Supplementary Table S9. PPI identification performance of GPT-4 (via

ChatGPT Plus web interface).

Fold No. 1 2 3 4 5 6 7 8 9 10 Average

Precision 65.94% 69.80% 100.00% 89.08% 92.48% 100.00% 63.84% 87.16% 83.86% 81.19% 83.34%

Recall 60.00% 70.00% 100.00% 75.92% 60.49% 100.00% 71.11% 71.26% 80.00% 76.92% 76.57%

F1 Score 62.81% 69.78% 100.00% 80.01% 73.13% 100.00% 67.01% 78.36% 81.73% 78.97% 79.18%

Supplementary Table S10. PPI identification performance of BioBERT.

Fold 1 2 3 4 5 6 7 8 9 10 Average

Precisio

n 88.24% 90.91% 100.00% 84.21% 72.22% 88.89% 70.00% 78.13% 83.33% 76.92% 83.28%

Recall 100.00% 100.00% 100.00% 88.89% 89.66% 100.00% 93.33% 78.13% 66.67% 76.92% 89.36%

F1 Score 93.75% 95.24% 100.00% 86.49% 80.00% 94.12% 80.00% 78.13% 74.07% 76.92% 85.87%

Supplementary Table S11. PPI identification performance of PubMedBERT.

Fold 1 2 3 4 5 6 7 8 9 10 Average

Precisio

n 88.24% 90.00% 100.00% 81.25% 75.68% 100.00% 77.78% 78.38% 81.82% 78.58% 85.17%

Recall 100.00% 90.00% 100.00% 72.22% 96.56% 100.00% 93.33% 90.63% 60.00% 84.62% 88.73%

F1 Score 93.75% 90.00% 100.00% 76.47% 84.85% 100.00% 84.85% 84.06% 69.23% 81.48% 86.47%
Supplementary Table S12. PPI identification performance of

Bio_ClinicalBert.

Fold 1 2 3 4 5 6 7 8 9 10 Average

Precisio

n 75.00% 90.91% 100.00% 75.00% 61.90% 88.89% 70.00% 72.97% 69.23% 80.00% 78.39%

100.00 100.00

Recall % % 100.00% 66.67% 89.66% 100.00% 93.33% 84.38% 60.00% 92.31% 88.63%

F1 Score 85.71% 95.24% 100.00% 70.59% 73.24% 94.12% 80.00% 78.26% 64.29% 85.71% 82.72%

Supplementary Table S13. PPI identification performance of BioM-BERT-

PubMed-PMC-Large.

Fold 1 2 3 4 5 6 7 8 9 10 Average

Precisio

n 75.00% 90.91% 100.00% 77.27% 65.91% 100.00% 83.33% 89.29% 81.82% 83.33% 84.69%

100.00 100.00 100.00 100.00

Recall % % 100.00% 94.44% % 87.50% % 78.13% 60.00% 76.92% 89.70%

F1 Score 85.71% 95.24% 100.00% 85.00% 79.45% 93.33% 90.91% 83.33% 69.23% 80.00% 86.22%

Supplementary Table S14. PPI identification performance of BioM-BERT-

BioM-ALBERT-xxlarge.

Fold 1 2 3 4 5 6 7 8 9 10 Average

Precisio 65.71

n 75.00% 58.82% 100.00% 34.62% 54.00% 72.73% 66.67% % 26.92% 54.17% 60.86%

100.00 100.00 100.00 71.88

Recall % % 100.00% % 93.10% 100.00% 80.00% % 93.33% 100.00% 93.83%


68.66

F1 Score 85.71% 74.07% 100.00% 51.43% 68.35% 84.21% 72.73% % 41.80% 70.27% 71.72%

Supplementary Table S15. PPI identification performance of BioM-BERT-

SciBERT_scivocab_cased.

Fold 1 2 3 4 5 6 7 8 9 10 Average

Precisio

n 78.95% 58.82% 75.00% 58.06% 60.00% 80.00% 63.64% 67.74% 58.82% 55.56% 65.66%

100.00 100.00 100.00 100.00

Recall % % % % 82.76% 100.00% 93.33% 65.63% 66.67% 76.92% 88.53%

F1 Score 88.24% 74.07% 85.71% 7347% 69.57% 88.89% 75.68% 66.67% 65.50% 64.52% 74.93%

Supplementary Table S16. PPI identification performance of BioM-BERT-

SciBERT_scivocab_uncased.

Fold 1 2 3 4 5 6 7 8 9 10 Average

Precisio

n 82.35% 90.00% 100.00% 61.90% 76.47% 100.00% 82.35% 90.00% 90.00% 71.43% 84.45%

Recall 93.33% 90.00% 100.00% 72.22% 89.66% 100.00% 93.33% 84.38% 60.00% 76.92% 85.98%

F1 Score 87.50% 90.00% 100.00% 66.67% 82.54% 100.00% 87.50% 87.10% 72.00% 74.07% 84.74%

You might also like