Professional Documents
Culture Documents
Hasin Rehana1,*, Nur Bengisu Çam2,*, Mert Basmaci2, Yongqun He3,4, Arzucan Özgür2,§, and
Junguk Hur2,§
Computer Science Graduate Program, University of North Dakota, Grand Forks, North Dakota,
1
58202, USA
Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
2
Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University
3
disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical
literature, there is a growing need for automated and accurate extraction of PPIs to facilitate
transformer (GPT) and bidirectional encoder representations from transformers (BERT), have
shown promising results in natural language processing (NLP) tasks. We evaluated the PPI
identification performance of various GPT and BERT models using a manually curated
benchmark corpus of 164 PPIs in 77 sentences from learning language in logic (LLL). BERT-
based models achieved the best overall performance, with PubMedBERT achieving the highest
precision (85.17%) and F1-score (86.47%) and BioM-ALBERT achieving the highest recall
(93.83%). Despite not being explicitly trained for biomedical texts, GPT-4 achieved comparable
performance to the best BERT models with 83.34% precision, 76.57% recall, and 79.18% F1-
score. These findings suggest that GPT models can effectively detect PPIs from text data and
Introduction
Protein-Protein Interactions (PPIs) are essential for numerous biological functions, especially
DNA replication and transcription, signal pathways, cell metabolism, and converting genotype to
processes, pathways, and networks underlying healthy and diseased states. Various public PPI
databases exist [1-3], including the PPI data collected from low-to-mid throughput experiments
assays. However, these resources are still incomplete, not covering all potential PPIs. Besides,
novel PPIs are typically reported in biomedical texts, which need to be added to the existing PPI
databases. Due to the rapid growth of scientific literature, manual extraction of PPIs has
become increasingly challenging, which necessitates automated text mining approaches that do
Natural language processing (NLP) is a focal area in computer science, increasingly applied in
various domains, including biomedical research, which has experienced massive growth in
recent years. Relation extraction, a widely used NLP method, aims to identify relationships
between two or more entities in biomedical text, supporting the automatic analysis of documents
in this domain. Advances in deep learning (DL), such as convolutional neural networks (CNNs)
[4, 5] and recurrent neural networks (RNNs) [6, 7], as well as in NLP have enabled success in
biomedical text mining to discover interactions between protein entities. Pre-training large neural
language models has led to substantial improvements in numerous NLP problems [8]. Following
the publication of "Attention Is All You Need" [9], transformer architectures have achieved state-
of-the-art results in various NLP tasks, including relation extraction in the biomedical domain
[10].
After the development of the transformer architecture [9], transformer-based models like
model, emerged. These models, known as large language models (LLMs), focus on
understanding language and semantics. LLMs are pre-trained on vast amounts of data and can
be fine-tuned for various tasks. Recent studies suggest that LLMs excel at context zero-shot
and few-shot learning [12], analyzing, producing, and comprehending human languages. LLMs'
massive data processing capabilities can be employed to identify connections and trends
among textual elements. Another type of LLM is autoregressive language models, including
boasts benefits like large-scale pre-training, zero and few-shot learning, context awareness,
creativity, and adaptability. OpenAI's ChatGPT, a 3.5 version of GPT models, demonstrates the
remarkable potential for analyzing and processing textual data. Recently, GPT-4 [13] has been
introduced, capable of producing, modifying, and cooperating with users in various creative and
technical writing tasks, such as songwriting, screenplay creation, and imitating user writing
styles. The advancements in GPT models, from GPT-3 to GPT-4, showcase the rapid progress
Several studies have been published evaluating the performance of GPT models for problem-
solving on various standardized tests [14, 15], and it has been shown that they are able to
achieve performance comparable or even better than humans and are able to pass high-level
professional standardized tests such as the Bar test14. However, to our knowledge, no study has
been done to evaluate how well GPT models can be used to extract PPIs from biomedical texts.
Here, we present a thorough evaluation of the PPI identification performance of multiple GPT
models and compare these with the state-of-the-art BERT-based NLP models for relation
extraction.
Methods
Dataset
We used the LLL corpus [16], which contains 164 protein-protein interactions (PPIs) in 77
sentences. The LLL corpus is a shared dataset created for the Learning Language in Logic
subtilis, and the sentences are provided in XML files. The LLL dataset does not contain non-
interacting entity pairs, which are essential for our BERT-based models for training. To address
this issue, we generated all possible entity pair combinations using the entities identified in each
sentence. A total of C(n,2) entity pairs could be generated with n protein entities in a sentence.
Then, we labeled the interacting pairs, if reported in the LLL data files, as positive samples and
We also applied basic preprocessing steps to ensure capturing all entities by removing
punctuation marks, digit-only strings, and blank spaces, and converting all the letters into
lowercase, resulting in normalized protein names. For the BERT-based models, similar to the
prior work [17], we replaced the entity pair names with the PROTEIN1 and PROTEIN2
keywords. The entity names, other than the pair, were replaced with the ‘PROTEIN’ keyword.
Language Models
We evaluated three autoregressive language models (GPT-3, GPT-3.5 via ChatGPT, and GPT-
4 via ChatGPT), each with three variations, and seven masked language models
GPT [18], is a series of language models developed by OpenAI in 2018, based on the
transformer architecture [9]. The transformer model consists of an encoder that generates
concealed representations and a decoder that produces output sequences using multi-head
attention, which prioritizes data over inductive biases, facilitating large-scale pre-training and
element importance, making it ideal for language translation, text classification, and text
generation. The GPT architecture includes layers with self-attention mechanisms, fully
connected layers, and layer normalization, reducing computational time and preventing
Figure 1 illustrates the history of GPT models released by OpenAI over the past years. The first
version of GPT, GPT-1, had 117 million parameters. It was trained using a large corpus of text
predecessor GPT-1, with 1.5 billion parameters. It was trained on a larger corpus of text data,
including web pages and books, and can generate more coherent and convincing language
responses. GPT-3 [20] was trained with 175 billion parameters, including an enormous corpus
of text data, including web pages, books, and academic articles. CPT-3 has demonstrated
outstanding performance in a wide range of NLP tasks, such as language translation, chatbot
Figure 1. Evolution of GPT Models. GPT: generative pre-trained transformer. API: application
programming interface.
On November 30, 2022, OpenAI released ChatGPT, a natural and engaging conversation tool
capable of producing contextually relevant responses based on text data. ChatGPT was fine-
tuned on the GPT-3.5 series, which included the models: gpt-3.5-turbo-0301, code-davinci-002,
text-davinci-002, and text-davinci-003. In this study, the latest gpt-3.5-turbo-0301 was used for
GPT-3.5. On March 14, 2023, OpenAI introduced its most advanced and cutting-edge system to
date, GPT-4, which has surpassed its predecessors by producing more dependable outcomes.
The architecture and number of parameters for GPT models are summarized in Table 1,
including the GPT-3, ChatGPT, and GPT-4, which were included in the current study.
GPT-3 2020 Same model and architecture as GPT-2 with 96 layers. 175 billion
Variations include Davinci#, Babbage, Curie, and Ada.
ChatGPT (GPT- 2022 A combination of three models: code-davinci-002, text- 1.3 billion, 6 billion,
3.5) davinci-002, and text-davinci-003. and 175 billion
GPT-4 2023 Fine-tuned using reinforcement learning from human Supposedly 100
feedback. trillion
#
Used in the current study.
Six different BERT-based models were included in the current study (Table 2).
● BioBERT [10]: a BERT model pre-trained on PubMed abstracts and PubMed Central
(PMC) full-text articles for different NLP tasks to measure performance. The initial
version, BioBERT v1.0, used >200K abstracts and >270K PMC articles. An expanded
version BioBERT v1.1 was fine-tuned using > 1M PubMed abstracts and was included in
● SciBERT [21]: a BERT model pre-trained on random Semantic Scholar articles [22].
While pre-training with the articles, the entire text was used. The researchers created the
SCIVOCAB from scientific articles of the same size as BASEVOCAB, the BERT-base
model's vocabulary. Both cased and uncased SCIVOCAB were used in the current
study.
● Bio-ClinicalBERT [23]: a fine-tuned BioBERT v1.0 model (PubMed 200K + PMC 270K)
with all notes from MIMIC-III v1.4 [24], a database of the electronic health records,
● PubMedBERT [8]: a BERT model trained explicitly on the BLURB (Biomedical Language
● BioM-BERTLarge [25]: a BERT model with the same architecture as BERTLarge [11],
To extract PPIs from the LLL sentences, we leveraged OpenAI’s application programming
interface (API) access for GPT-3, while GPT-3.5 and GPT-4 were accessed through the web
interface of ChatGPT Plus. We carefully designed the API and web interface prompts to
generate well-structured and stable interactions with minimal post-processing steps. The LLL
data comprised 77 sentences from 44 publications with a total of 164 PPIs. We extracted the
necessary information from the dataset and divided it into ten folds using document-level
folding, as previously introduced [28]. For each fold, we provided the sentence IDs and
sentences as input along with a query, as shown in Table 3. In order to evaluate the impact of
providing the dictionary of biomedical entities covering these 77 sentences, two additional
queries with the original protein names and normalized protein names, created after the
Table 3: Queries incorporated in the prompts for GPT-3 (API), GPT-3.5 (ChatGPT), and
GPT-4 (ChatGPT)
OpenAI’s API allows modulation of the ‘temperature’ parameter in GPTs, which determines how
greedy or creative the generative model is. The parameter ranges between 0 (the least creative)
and 1 (the most creative). We explored the impact of this parameter in PPI identification using
0.1 demonstrated the highest overall performance of GPT-3, thus used in the present study.
Performance evaluation
To ensure consistency for each fold, we have obtained the outputs of GPT-3, GPT-3.5, and
GPT-4 from three separate runs and acquired an average of their evaluation performance. We
refreshed the browser after each prompt to keep ChatGPT from memorizing the previous
prompts.
For BERT-based models, we fine-tuned these models in a 10-fold cross-validation setting,
where the folds were created at the document level, as introduced above. This strategy
employed document-level fold splitting, which ensured the sentences from one document were
used only either in the training or testing set to avoid overfitting [29]. The hyperparameters we
used in this study are shown in Table 4. We used the slow tokenizer for the tokenization with
Table 4: Hyperparameters used for K-Fold Cross Validation on BERT and ALBERT
models
Hyperparameter Value
Optimizer Adam
Batch Size 16
Number of Folds 10
with the identical 10-fold settings to maintain consistency across all models. The queries for
GPT models were done in three separate runs, which were averaged later (Supplementary
Tables S1-S16).
Interacting with GPT API and web interface
Figure 2 illustrates a Python code segment to access GPT-3 API and its output. The predicted
interaction pairs were returned with corresponding Sentence IDs. As shown in Figure 3, GPT-3
achieved the highest performance in all measures with the temperature parameter set to 0.1,
(A) (B)
def call_GPT_API(model, temperature, query,
max_token, top_p):
with open('input.txt') as f:
prompt = f.read();
prompt = query + '\n' + prompt
print(prompt);
completions = openai.Completion.create(
model=model,
prompt=prompt,
temperature=temperature,
max_tokens=max_token,
top_p=top_p,
frequency_penalty=0,
presence_penalty=0
)
message = completions.choices[0].text;
return message
Figure 2. GPT API code and output. (A) Python code segment for accessing OpenAI API. (B)
Unlike GPT-3, GPT-3.5 and GPT-4 were accessed via the web interface named ChatGPT Plus
due to the limited API access at the time of this study. Figure 4 depicts an example input and
Figure 4: An example input and output of GPT-4 (via ChatGPT Plus web interface).
PPI identification performance
Table 5 summarizes the PPI identification performance of the 16 models, including three
variations of each GPT version. Generally, BERT-based models outperformed GPTs; however,
when protein names were provided GPTs, particularly GPT-4, demonstrated quite comparable
Overall, GPT-4 performed the best among all versions of the GPT models, whether or not
protein names were provided. However, in terms of precision, GPT-3.5 achieved higher
performance than GPT-4, with a score of 79.11% compared to GPT-4's score of 73.97%.
Initially, the base GPT models had lower precision than most of the BERT-based models.
However, when provided with protein names, the precision of the GPT models improved
85.17% precision. Specifically, the GPT-4 model with protein names provided achieved 83.71%
precision.
of F1-score. While PubMedBERT had the best evaluation score in terms of precision (1.46%
higher than GPT-4) and F1-score, another BERT-based model called BioALBERT achieved the
highest recall score, albeit with relatively less precision and F1-score. It is worth noting that
with supervised learning, which takes considerable time and technical expertise. In comparison,
zero-shot learning models such as GPT-3, ChatGPT, and GPT-4 do not require such extensive
fine-tuning, making them more accessible and practical for specific use cases.
The BioM-ALBERT-xxlarge model had the highest recall score among all models, scoring
precision and F1 scores were not as good as the BERT models. There could be several
reasons for this discrepancy. One possibility is that the BioM-ALBERT-xxlarge model had a
larger hidden layer size (4096) than the BERT models (1024) we experimented with. Another
potential reason is that the BERT models use the WordPiece tokenization technique, whereas
Table 5: Evaluation result of PPI on the LLL dataset for BERT and GPT-based models.
higher performance. In the two types of models evaluated in this study, the performance of the top-
ability in identifying interactions from biomedical literature, particularly in NLP tasks such as
improving GPT-4 with biomedical corpora like PubMed and PMC for PPI identification is
warranted. With additional information, such as a dictionary, GPT has shown decent
based models, indicating the potential use of GPT for these NLP tasks. Further research is
needed to explore and enhance the capabilities of GPT-based models in the biomedical
domain.
One potential area of future research is exploring the use of ontology to improve literature
mining for PPI identification. For instance, previous work has utilized the Interaction Network
Ontology (INO) to facilitate the mining and accuracy of gene-gene or protein-protein interactions
[30-33]. In addition, INO's hierarchical structure and semantic relationships among different
interactions provided a foundation for deeper analysis of mined interactions. Although ontology
was not utilized in the present study, we aim to investigate how it can be combined with existing
Acknowledgments
The study was supported by the U.S. National Institute of Allergy and Infectious Disease
(U24AI171008 to Y.H. and J.H.). GEBIP Award of the Turkish Academy of Sciences (to A.Ö.) is
gratefully acknowledged.
References
1. Alonso-Lopez, D., et al., APID interactomes: providing proteome-based interactomes
with controlled quality for multiple species and derived networks. Nucleic acids research,
2016. 44(W1): p. W529-W535.
2. Szklarczyk, D., et al., STRING v11: protein–protein association networks with increased
coverage, supporting functional discovery in genome-wide experimental datasets.
Nucleic acids research, 2019. 47(D1): p. D607-D613.
3. Oughtred, R., et al., The BioGRID database: A comprehensive biomedical resource of
curated protein, genetic, and chemical interactions. Protein Science, 2021. 30(1): p. 187-
200.
4. Choi, S.-P., Extraction of protein–protein interactions (PPIs) from the literature by deep
convolutional neural networks with various feature embeddings. Journal of Information
Science, 2018. 44(1): p. 60-73.
5. Peng, Y. and Z. Lu, Deep learning for extracting protein-protein interactions from
biomedical literature. arXiv preprint arXiv:1706.01556, 2017.
6. Hsieh, Y.-L., et al. Identifying protein-protein interactions in biomedical literature using
recurrent neural networks with long short-term memory. in Proceedings of the eighth
international joint conference on natural language processing (volume 2: short papers).
2017.
7. Ahmed, M., et al. Identifying protein-protein interaction using tree LSTM and structured
attention. in 2019 IEEE 13th international conference on semantic computing (ICSC).
2019. IEEE.
8. Gu, Y., et al., Domain-specific language model pretraining for biomedical natural
language processing. ACM Transactions on Computing for Healthcare (HEALTH), 2021.
3(1): p. 1-23.
9. Vaswani, A., et al., Attention is all you need. Advances in neural information processing
systems, 2017. 30.
10. Lee, J., et al., BioBERT: a pre-trained biomedical language representation model for
biomedical text mining. Bioinformatics, 2020. 36(4): p. 1234-1240.
11. Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.
12. Kojima, T., et al., Large language models are zero-shot reasoners. arXiv preprint
arXiv:2205.11916, 2022.
13. OpenAI, GPT-4 Technical Report. ArXiv, 2023. abs/2303.08774.
14. Katz, D.M., et al., GPT-4 Passes the Bar Exam. Available at SSRN 4389233, 2023.
15. Nori, H., et al., Capabilities of GPT-4 on Medical Challenge Problems. arXiv preprint
arXiv:2303.13375, 2023.
16. Nédellec, C. Learning language in logic-genic interaction extraction challenge. in 4.
Learning language in logic workshop (LLL05). 2005. ACM-Association for Computing
Machinery.
17. Erkan, G., A. Özgür, and D. Radev. Semi-supervised classification for extracting protein
interaction sentences using dependency parsing. in Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL). 2007.
18. Radford, A., et al., Improving language understanding by generative pre-training. 2018.
19. Radford, A., et al., Language models are unsupervised multitask learners. OpenAI blog,
2019. 1(8): p. 9.
20. Brown, T., et al., Language models are few-shot learners. Advances in neural
information processing systems, 2020. 33: p. 1877-1901.
21. Beltagy, I., K. Lo, and A. Cohan, SciBERT: A pretrained language model for scientific
text. arXiv preprint arXiv:1903.10676, 2019.
22. Ammar, W., et al., Construction of the literature graph in semantic scholar. arXiv preprint
arXiv:1805.02262, 2018.
23. Alsentzer, E., et al., Publicly available clinical BERT embeddings. arXiv preprint
arXiv:1904.03323, 2019.
24. Johnson, A., T. Pollard, and R. Mark, MIMIC-III clinical database (version 1.4).
PhysioNet, 2016. 10(C2XW26): p. 2.
25. Alrowili, S. and K. Vijay-Shanker. BioM-transformers: building large biomedical language
models with BERT, ALBERT and ELECTRA. in Proceedings of the 20th Workshop on
Biomedical Language Processing. 2021.
26. Lan, Z., et al., Albert: A lite bert for self-supervised learning of language representations.
arXiv preprint arXiv:1909.11942, 2019.
27. Krallinger, M., et al. Overview of the BioCreative VI chemical-protein interaction Track. in
Proceedings of the sixth BioCreative challenge evaluation workshop. 2017.
28. Airola, A., et al., All-paths graph kernel for protein-protein interaction extraction with
evaluation of cross-corpus learning. BMC bioinformatics, 2008. 9: p. 1-12.
29. Mooney, R. and R. Bunescu, Subsequence kernels for relation extraction. Advances in
neural information processing systems, 2005. 18.
30. Hur, J., A. Ozgur, and Y. He, Ontology-based literature mining of E. coli vaccine-
associated gene interaction networks. J Biomed Semantics, 2017. 8(1): p. 12.
31. Ozgur, A., J. Hur, and Y. He, The Interaction Network Ontology-supported modeling and
mining of complex interactions represented with multiple keywords in biomedical
literature. BioData Min, 2016. 9: p. 41.
32. Karadeniz, I., et al., Literature Mining and Ontology based Analysis of Host-Brucella
Gene-Gene Interaction Network. Front Microbiol, 2015. 6: p. 1386.
33. Hur, J., et al., Development and application of an interaction network ontology for
literature mining of vaccine-associated gene-gene interactions. J Biomed Semantics,
2015. 6: p. 2.
Appendix
Supplementary Figure 1: Evaluation result of PPI on LLL dataset for BERT and GPT-
based models.
Supplementary Table S1. PPI identification performance of GPT-3.
Precision 38.09% 70.00% 100.00% 31.81% 83.56% 100.00% 52.33% 71.43% 89.53% 66.67% 70.34%
Recall 17.78% 70.00% 100.00% 29.63% 44.44% 100.00% 46.67% 34.48% 75.55% 46.15% 56.47%
F1 Score 24.24% 70.00% 100.00% 30.57% 57.93% 100.00% 49.06% 46.51% 81.92% 54.55% 61.48%
protein names.
Precision 33.33% 88.89% 100.00% 93.75% 89.58% 100.00% 70.00% 74.53% 91.67% 44.44% 78.62%
Recall 13.33% 80.00% 100.00% 83.33% 53.09% 100.00% 46.67% 36.78% 73.33% 30.77% 61.73%
F1 Score 19.05% 84.21% 100.00% 88.24% 66.67% 100.00% 56.00% 49.21% 81.48% 36.36% 68.12%
Precision 48.61% 88.89% 87.50% 78.41% 87.50% 100.00% 63.89% 76.67% 77.22% 44.44% 75.31%
Recall 24.44% 80.00% 77.78% 74.07% 51.85% 83.33% 44.44% 44.83% 55.56% 30.77% 56.71%
F1 Score 32.44% 84.21% 82.35% 76.11% 65.12% 90.47% 51.21% 55.51% 64.49% 36.36% 63.83%
Precision 41.67% 89.26% 100.00% 75.32% 87.91% 48.15% 80.48% 82.71% 88.89% 96.67% 79.11%
Recall 24.44% 83.33% 100.00% 33.33% 44.44% 54.17% 48.89% 58.62% 71.11% 64.10% 58.24%
F1 Score 30.73% 86.14% 100.00% 45.79% 59.03% 50.98% 60.48% 68.46% 79.01% 76.88% 65.75%
Precision 79.85% 82.96% 100.00% 88.85% 86.72% 67.62% 70.91% 65.66% 84.19% 64.81% 79.16%
Recall 62.22% 80.00% 100.00% 50.00% 67.90% 66.67% 48.89% 26.44% 71.11% 48.72% 62.19%
F1 Score 69.62% 81.40% 100.00% 62.23% 75.57% 66.67% 57.85% 37.68% 77.07% 55.60% 68.37%
Precision 66.88% 89.63% 100.00% 70.58% 86.58% 85.71% 83.33% 60.41% 87.52% 100.00% 83.06%
Recall 40.00% 86.67% 100.00% 33.33% 64.20% 70.83% 62.22% 29.89% 75.55% 76.92% 63.96%
F1 Score 49.87% 88.07% 100.00% 43.77% 73.32% 77.46% 71.11% 38.92% 81.02% 86.96% 71.05%
Precision 46.11% 83.33% 100.00% 65.34% 69.68% 100.00% 52.96% 76.11% 79.49% 66.67% 73.97%
Recall 35.55% 83.33% 96.30% 25.92% 55.56% 100.00% 55.56% 51.72% 68.89% 51.28% 62.41%
F1 Score 39.57% 83.33% 98.04% 36.48% 61.77% 100.00% 54.14% 61.30% 73.81% 57.70% 66.61%
Precision 67.18% 80.00% 100.00% 83.82% 92.48% 100.00% 82.15% 74.74% 85.67% 71.03% 83.71%
Recall 64.44% 80.00% 100.00% 62.96% 60.49% 95.83% 73.34% 41.38% 80.00% 69.23% 72.77%
F1 Score 65.44% 80.00% 100.00% 70.16% 73.13% 97.78% 76.59% 53.14% 82.67% 70.02% 76.89%
Precision 65.94% 69.80% 100.00% 89.08% 92.48% 100.00% 63.84% 87.16% 83.86% 81.19% 83.34%
Recall 60.00% 70.00% 100.00% 75.92% 60.49% 100.00% 71.11% 71.26% 80.00% 76.92% 76.57%
F1 Score 62.81% 69.78% 100.00% 80.01% 73.13% 100.00% 67.01% 78.36% 81.73% 78.97% 79.18%
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 88.24% 90.91% 100.00% 84.21% 72.22% 88.89% 70.00% 78.13% 83.33% 76.92% 83.28%
Recall 100.00% 100.00% 100.00% 88.89% 89.66% 100.00% 93.33% 78.13% 66.67% 76.92% 89.36%
F1 Score 93.75% 95.24% 100.00% 86.49% 80.00% 94.12% 80.00% 78.13% 74.07% 76.92% 85.87%
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 88.24% 90.00% 100.00% 81.25% 75.68% 100.00% 77.78% 78.38% 81.82% 78.58% 85.17%
Recall 100.00% 90.00% 100.00% 72.22% 96.56% 100.00% 93.33% 90.63% 60.00% 84.62% 88.73%
F1 Score 93.75% 90.00% 100.00% 76.47% 84.85% 100.00% 84.85% 84.06% 69.23% 81.48% 86.47%
Supplementary Table S12. PPI identification performance of
Bio_ClinicalBert.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 75.00% 90.91% 100.00% 75.00% 61.90% 88.89% 70.00% 72.97% 69.23% 80.00% 78.39%
100.00 100.00
Recall % % 100.00% 66.67% 89.66% 100.00% 93.33% 84.38% 60.00% 92.31% 88.63%
F1 Score 85.71% 95.24% 100.00% 70.59% 73.24% 94.12% 80.00% 78.26% 64.29% 85.71% 82.72%
PubMed-PMC-Large.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 75.00% 90.91% 100.00% 77.27% 65.91% 100.00% 83.33% 89.29% 81.82% 83.33% 84.69%
F1 Score 85.71% 95.24% 100.00% 85.00% 79.45% 93.33% 90.91% 83.33% 69.23% 80.00% 86.22%
BioM-ALBERT-xxlarge.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio 65.71
n 75.00% 58.82% 100.00% 34.62% 54.00% 72.73% 66.67% % 26.92% 54.17% 60.86%
F1 Score 85.71% 74.07% 100.00% 51.43% 68.35% 84.21% 72.73% % 41.80% 70.27% 71.72%
SciBERT_scivocab_cased.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 78.95% 58.82% 75.00% 58.06% 60.00% 80.00% 63.64% 67.74% 58.82% 55.56% 65.66%
F1 Score 88.24% 74.07% 85.71% 7347% 69.57% 88.89% 75.68% 66.67% 65.50% 64.52% 74.93%
SciBERT_scivocab_uncased.
Fold 1 2 3 4 5 6 7 8 9 10 Average
Precisio
n 82.35% 90.00% 100.00% 61.90% 76.47% 100.00% 82.35% 90.00% 90.00% 71.43% 84.45%
Recall 93.33% 90.00% 100.00% 72.22% 89.66% 100.00% 93.33% 84.38% 60.00% 76.92% 85.98%
F1 Score 87.50% 90.00% 100.00% 66.67% 82.54% 100.00% 87.50% 87.10% 72.00% 74.07% 84.74%