Professional Documents
Culture Documents
sciences
Article
Gradient Boosting over Linguistic-Pattern-Structured Trees for
Learning Protein–Protein Interaction in the Biomedical Literature
Neha Warikoo 1,2 , Yung-Chun Chang 3,4, * and Shang-Pin Ma 5, *
Abstract: Protein-based studies contribute significantly to gathering functional information about bio-
logical systems; therefore, the protein–protein interaction detection task is one of the most researched
topics in the biomedical literature. To this end, many state-of-the-art systems using syntactic tree
kernels (TK) and deep learning have been developed. However, these models are computationally
complex and have limited learning interpretability. In this paper, we introduce a linguistic-pattern-
representation-based Gradient-Tree Boosting model, i.e., LpGBoost. It uses linguistic patterns to
optimize and generate semantically relevant representation vectors for learning over the gradient-tree
boosting. The patterns are learned via unsupervised modeling by clustering invariant semantic
features. These linguistic representations are semi-interpretable with rich semantic knowledge, and
owing to their shallow representation, they are also computationally less expensive. Our experi-
ments with six protein–protein interaction (PPI) corpora demonstrate that LpGBoost outperforms the
Citation: Warikoo, N.; Chang, Y.-C.; SOTA tree-kernel models, as well as the CNN-based interaction detection studies for BioInfer and
Ma, S.-P. Gradient Boosting over AIMed corpora.
Linguistic-Pattern-Structured Trees
for Learning Protein–Protein Keywords: protein–protein interaction; natural language processing; gradient-tree boosting; linguistic
Interaction in the Biomedical patterns; bioinformatics
Literature. Appl. Sci. 2022, 12, 10199.
https://doi.org/10.3390/
app122010199
based on deep-learning architectures have also generated interesting results, particularly for
larger corpuses such as AIMed and BioInfer. They employ enriched embedding developed
with features such as shortest dependency path (sdpCNN), continuous bag of words
(CBOW) embedding (MCNN), and dependency-tree embedding (McDeepCNN) [7–9].
The syntactic relevant representation generated from these embeddings helps improve
the performance, using convolution neural networks (CNNs). These approaches draw a
contrast to the feature-engineering techniques used by TK-based PPI detection studies. In
separate studies conducted by Murugesan et al. [5] and Warikoo et al. [6], enriched syntax
and semantic features were used with tree kernels to achieve improved PPI detection.
In this paper, we introduce gradient-boosted learning on semantically optimized
linguistic-pattern representation. The eXtreme Gradient Boosting (XGboost) is a gradient-
boosting decision tree that can be used for classification or regression problems. Gradient
boosting tries to correct the residuals of all the weak learners by adding a new weak
learner. Eventually, multiple learners are added together for the final prediction, and
the accuracy is higher than the single one. Since the effectiveness of XGBoost, it has
been validated on a real-life, large-scale, and imbalanced dataset and solves many data-
science problems in a fast and accurate way. Therefore, we developed our model based
on XGBoost in this research. The feature-optimization and -representation studies were
developed from our previous work, using unsupervised learning to extract semantically
invariant linguistic features. These invariant linguistic patterns generate a pattern-based
representation, which is adapted with XGBoost for gradient-boosted tree learning. The
invariant lexical patterns contribute to context-based representation for each token, in
contrast to long vector embeddings based on word co-distribution. We evaluated our work
on five benchmark PPI corpora, i.e., HPRD50, LLL, IEPA, AIMed, and BioInfer [10], and
a new PPI corpus AIdea, https://aidea-web.tw/file/d0cec130-b15d-4c4c-b7e8-bf95a0c3
4dd8-553658198_train_test_dataset_1___Sample_data.7z (accessed on 13 May 2022). The
results from our study show improvement in interaction detection from both tree-kernel
and CNN-based studies, particularly for larger corpora.
The remaining paper is organized as follows. We first present a Related Works section,
followed by the Method section, which discusses the architecture of our model. Thereafter,
case studies and a result analysis are given in the Results and Discussion. Finally, the
study’s conclusion is given in the last section.
2. Related Works
In last the decade, many noteworthy studies have been conducted to improve relation
detection in PPI datasets by experimenting with feature representation, classification strate-
gies, and ensemble learning architectures, as shown in Table 1. A majority of these studies
have been developed on supervised tree-kernel classifiers, using semantic, lexical, or de-
pendency graph-based features. The tree-kernel-based studies exploit dependency-graph
substructures (i.e., convolved sub-graphs) for classification. The tree/graph representations
are often enriched with knowledge-rich feature vectors, such as lexical, syntactic, and entity
co-occurrence, as implemented in studies by Airola et al. [11] and Landeghem et al. [12].
These consolidated semantic-based features perform relatively well on smaller datasets,
such as HPRD50, LLL, and IEPA. However, developing selective knowledge-rich features
is laborious and time-consuming. Alternatively, some studies have experimented with
enriched dependency-graph structures. Erkan et al. and Satre et al. used syntactic sub-
structures from dependency trees to develop tree-feature labels, using a dot product [13,14].
The similarity score from tree features is used with maximum-margin classifiers for bi-
nary classification. Studies by Chang et al. [15] and Warikoo et al. [6] have experimented
with features such as interaction pattern trees or linguistic patterns to generate reduced
dependency-tree representations for tree-kernel classifiers. Although graph/tree-kernel-
based architectures can produce SOTA results, the overall model runtime is known to be
computationally intensive, as discussed by Warikoo et al. [6].
Appl. Sci. 2022, 12, 10199 3 of 13
Table 1. Cont.
More recently, an increasing number of PPI studies have been developed on deep-
learning models, primarily using convolutional neural network (CNN) and long short-
term memory (LSTM) architectures. Peng et al. [8] and Hua et al. [7] both experimented
with multichannel CNN models, each using a different channel size and embedding
features for a PPI detection task. While Hua et al. chose PMC, PubMed, Medline, and
Wikipedia to develop the five-channel embedding layer, Peng at al. used semantic and
lexical feature information, along with word embedding layers, to initialize seven-channel
embedding. The use of multichannel input captures separate views of data based on each
channel representation—a methodology much similar to the use of RGB channels for image
processing. Yadav et al. [16] used composite embedding layers with bidirectional LSTM
architecture to deliver promising results for AIMed and BioInfer corpora. Embedding for
each instance was structured around the shortest-dependency path, enriched by parts-
of-speech (POS) and word-position embeddings. Single-channel word-embedding deep-
learning models have limited task interpretability, and they are therefore often augmented
with semantic features, such as POS, position, or chunk vectors, to improve classification,
using multiple data views. In addition, with the development of pre-trained language
models, transformer-based BERT models have yielded promising results in the biomedical
domain. Su et al. [17] and Su et al. [18] both experimented with the fine-tuning process of the
BERT model. While Su et al. [17] chose LSTM on the last layer or the attention mechanism
on the last layer, Su et al. [18] tried the model to summarize the outputs of the last layer,
using BiLSTM and attention mechanism. Moreover, Su et al. [19] set the external knowledge
base, constructing BERT with contrastive pre-training on the training. Based on the pre-
trained BERT model, Warikoo et al. [20] also proposed a lexically aware transformer-based
BERT model, and it explores both local and global context representations for sentence-level
classification tasks. Wu et al. [21] used cross-entropy as the loss function and applied it to
drop out before each fully connected layer during training.
As evident from this brief overview, representation enriched with semantic/syntactic
features has been consistently used with different learning models to improve performance
in bio-entity relation tasks. From the graph-based SOTA tree-kernel models to the gener-
ation of enriched embeddings for CNN, these features have been adapted with different
models to incorporate syntactic and semantic information into the feature representation [5].
Semantic features have also been developed into patterns for dependency-tree pruning
as part of tree-kernel studies [6,7]. However, not many studies have implemented seman-
tic pattern representations with decision-tree classifiers [3]. In this study, we propose a
linguistic-pattern-based gradient-boosted decision-tree model which employs a semantic
representation for each bio-entity interaction, consolidated and optimized with linguistic
Appl. Sci. 2022, 12, 10199 5 of 13
Figure 1. System architecture for LpGBoost.
Figure 1. System architecture for LpGBoost.
3.1. Preprocessing
3.1. Preprocessing
3.1.1. Entity-Interaction Representation
3.1.1. Entity‐Interaction Representation
We limit the input instance size to 2n (n = sliding window size), covering prefix and
We limit the input instance size to 2n (n = sliding window size), covering prefix and
suffix “n-features” around
suffix “n‐features” around each
each bio‐entity
bio-entity interaction
interaction target
target pair. Earlier works
pair. Earlier works in
in PPI
PPI
have also adapted reduced representation either using shortest-path dependency
have also adapted reduced representation either using shortest‐path dependency graphs graphs or
entity-labeled linguistic patterns [7,14]. In order to develop an even semantic representation,
or entity‐labeled linguistic patterns [7,14]. In order to develop an even semantic represen‐
the original instances are first converted to Pos-Tag sequences, using Genia Tagger [23,24].
tation, the original instances are first converted to Pos‐Tag sequences, using Genia Tagger
Then a sliding window is used to generate a candidate instance representation of size 4n
[23,24]. Then a sliding window is used to generate a candidate instance representation of
(2n for each target-entity). The resulting reduced sequence is called the candidate sequence.
size 4n (2n for each target‐entity). The resulting reduced sequence is called the candidate
sequence.
3.1.2. Semantic feature convolution
A sliding window is moved over candidate sequences to extract local semantic sub‐
sequence features. This concept of feature detection is similar to convolution in neural
networks, where a window of size “n” is moved over an input tensor to maximize the
optimal representation; however, here we do not use a max‐pool to obtain max feature
have also adapted reduced representation either using shortest-path dependency graphs
or entity-labeled linguistic patterns [7,14]. In order to develop an even semantic represen-
tation, the original instances are first converted to Pos-Tag sequences, using Genia Tagger
[23,24]. Then a sliding window is used to generate a candidate instance representation of
Appl. Sci. 2022, 12, 10199 size 4n (2n for each target-entity). The resulting reduced sequence is called the candidate
6 of 13
sequence.
3.1.2. Semantic
3.1.2. feature
Semantic convolution
Feature Convolution
A sliding window
A sliding window is moved
is moved over
overcandidate
candidatesequences
sequencestotoextract
extractlocal
local semantic sub-
semantic sub-
sequence features. This concept of feature detection is similar to convolution
sequence features. This concept of feature detection is similar to convolution in neural in neural
networks,
networks,where a window
where a window of of
size
size“n”“n”is ismoved
movedover
overan aninput
input tensor
tensor to maximize
maximize the
the
optimal representation;
optimal representation;however,
however,here
here we do do not
notuse
usea amax-pool
max-pool to obtain
to obtain maxmax feature
feature vec-
tor [25].
vector [25]. Multi-width
Multi-widthvariant
variantsemantic
semantic features are extracted
features from from
are extracted the candidate sequence
the candidate se-
of length “m” given by Equation (1):
quence of length “m” given by Equation (1):
𝑚 𝑖∗𝑛 𝑛
𝔽 1 (1)
(1)
𝑠
where n represents
where sliding
n represents window
sliding size,
window s indicates
size, stride
s indicates size,
stride i represents
size, feature
i represents variant,
feature vari-
𝔽 isand
andant, theFfeatures/direction. Bidirectionally
is the features/direction. generated
Bidirectionally 2n semantic
generated featuresfeatures
2n semantic are concat-
are
enated to generate
concatenated to candidate features.features.
generate candidate
where x and y represent n and n-1 length semantic components of original n-gram sequence,
respectively. Coefficients p20 , p11 , and p02 evaluate joint conditional probabilities for each
sequence associated with the respective components. The invariance for P(x,y) is calculated
by Equation (3): " ! #
2 p211 2
I ( P) ≡ I ( pn0 · · · p0n ) = p20 + + p02 (3)
2
where I ( P) is the invariant score of P(x,y). The distribution similarity between P(x,y) and
Q(u,v) projections of lexical patterns p and q is concluded if they exhibit the following
property described in Equation (4):
I ( P ) = ∆W × I ( Q ) (4)
where ∆ = 1, and W = 1; and I(P) and I(Q) represent the invariant function forms of p and
q, respectively Using Equation (4), we consolidate all the n-gram semantic patterns into a
reduced ILP, where each cluster holds linguistic patterns that are semantically invariant.
The size of each cluster varies according to the distribution of the constituent patterns
within the corpora.
l pT
θ pkT = ekT × ∏ Tk (5)
k =1 p k −1
#PkT
pkT = (6)
∑ T ∈(−1, 1) #PkT
where ekT = softmax-weighted score for frame width, k, and protein-interaction-class type,
T (i.e., “1” = protein–protein interaction pair and “−1” = not a protein–protein interaction
pair); #PkT = #features of frame width, k, and type T; and x = index size ∈(1, l = length of
feature pkT ). Joint conditional probability score for feature pkT is weighted by category class
softmax score for lexical patterns of frame width, k. This normalized weighing over the
variable feature-size class adds differential representation among local and global features.
Thus, each lead feature is scored to generate a universal lead feature matrix.
k I ( h k )i
∑ i =1 1 − I pkT i
< k×n (7)
where I (hk )i = invariant feature score at index i, I pkT i = invariant lead feature score at
index i, and n = length of each n-gram linguistic pattern. In the event that the hk does not
match any lead feature, then a second-tier comparison is scored, as given in Equation (8):
k
∑ i =1 1 − sim h k , p T
k
i
< k×n (8)
where sim hk , pk i = index-wise pos-tag feature similarity between hk and pkT . Among all
T
optimized pkT values, the one with maximum invariance to hk is adapted. Corresponding
pkT lead matrix scores are substituted to develop training/test classification feature vectors,
as shown in Equation (9):
θ pkT , pkT T ∈ (−1 ∨ 1)
θ ( hk ) = (9)
∑T∈(−1,1) θ ( pkT ) , p T T ∈ (−1 ∧ 1)
#T k
where θ (hk ) = substituted matrix score for hk , and #T = category-type size, i.e., −1 and 1.
of Interacting Proteins (DIP) to have PPI-specific content in which the interactions were
annotated manually, in addition to 25 abstracts without PPI-specific content. BioInfer
has the maximum number of instances among the five corpora, with 1100 sentences,
as it contains annotations not only for PPI but also for other types of events. This is
achieved through retaining sentences with more than one pair of interacting entities after
querying the PubMed retrieval system with extracted interacting entities in pairs from
the DIP and keeping a random subset of sentences annotated for entities of protein, gene,
and RNA relationships, too. IEPA comprises 486 sentences with a specific pair of co-
occurring chemicals from PubMed abstracts in which the interactions were annotated,
and the majority of the entities was proteins. HPRD50 contained 145 sentences with
annotations and lists of positive/negative PPI and was constructed by taking 50 random
abstracts referenced by the Human Protein Reference Database (HPRD) [27], in which
human proteins and genes were identified by the ProMiner software, while direct physical
interactions, regulatory relations, and modifications were annotated by experts. The LLL
corpus was originally created for the Learning Language in Logic 2005 (LLL05) challenge,
a task focusing on the extraction of protein/gene interactions from biology abstracts in the
Medline bibliography database. It contains three types of Bacillus subtilis gene interactions
and has only 77 sentences, making it the smallest dataset among the five corpora.
Appl. Sci. 2022, 12, x FOR PEER REVIEW
In
9 of 14
addition, we also evaluated our system using AIdea, a new corpus for PPI detection tasks
that was developed from a collection of PubMed abstracts.
Corpus
Corpus Sentences
Sentences #Positive
#Positive #Negative
#Negative
HPRD50
HPRD50 145 145 163 163 270
270
LLL
LLL 77 77 164 164 166
166
IEPA
IEPA 486 486 163 163 270
270
AIMed
AIMed 1955 1955 1000
1000 4834
4834
BioInfer 1100 2534 7132
BioInfer
AIdea 1100 1273 2534
789 7132
2083
AIdea 1273 789 2083
8000
7132
7000
6000
4834
5000
4000
3000 2534
2083
2000
1000 789
1000
163 270 164 166 163 270
0
HPDR50 LLL IEPA AIMed BioInfer AIdea
#Positi ve #Negative
Figure 2. Distribution of PPI corpora used for performance evaluation.
Figure 2. Distribution of PPI corpora used for performance evaluation.
4.2. Experimental Setup
In this research, we followed an evaluation setup adapted from the previous PPI
studies, using a 10‐fold cross‐validation (CV) for each PPI corpora and cross‐corpora (CC)
studies between two large corpora, i.e., AIMed and BioInfer [28]. For the CC setup, each
data set was used for training while other was swapped for testing. Common evaluation
metrics, i.e., precision (P), recall (R), and F1‐score (F1) scores, were used to evaluate the
performance of our model [29]. In addition, feature importance was also studied by using
Appl. Sci. 2022, 12, x FOR PEER REVIEW 10 of 14
features, as shown in Table 3; we also visualized their performance in Figure 3. The com‐
Appl. Sci. 2022, 12, 10199 parative analysis of the HPRD50 corpus indicates that the use of linguistic features im‐ 9 of 13
proves decision‐tree‐based classification by approximately 29%. Table 3 illustrates a com‐
parative analysis of the results from our experiments. We developed a Naïve Bayes (NB)
baseline and included SOTA models from PPI studies based on tree kernel and deep learn‐
4.2. Experimental Setup
ing for this analysis. For an effective comparison, we limited our analysis to the studies
In this research, we followed an evaluation setup adapted from the previous PPI
that explore semantic
studies, using a 10-foldfeatures for input
cross-validation representation.
(CV) for each PPI Among
corpora the
andtree‐kernel
cross-corporastudies,
(CC)
graph‐based feature kernel by Airola et al. [11] and Murugesan et al. [5] and linguistic‐
studies between two large corpora, i.e., AIMed and BioInfer [28]. For the CC setup, each
pattern‐tree representations by Chang et al. [14] and Warikoo et al. [6] were referenced to
data set was used for training while other was swapped for testing. Common evaluation
draw
metrics,a semantic‐feature‐based
i.e., precision (P), recall comparative analysis.
(R), and F1 -score The analysis
(F1) scores, shows
were used a 3.64% im‐
to evaluate the
provement in
performance ofthe
ourFmodel
1‐score [29].
when In compared with the
addition, feature highest tree‐kernel
importance performer,
was also studied i.e.,
by using
DSTK [5].
XGBoost
library. For our experiments, we employed Genia Tagger [23,24] for POS normal-
We also compared the performance of our model with semantically enriched embed‐
ization of raw instances, as described in Section 3.1.1. The sliding window size n = 3 was set
dings used with deep‐learning models. Both Hua et al. [7] and Peng et al. [8] used seman‐
to extract n-gram subsequence semantic patterns from candidate instances. Experiments
tically enriched embeddings with CNN. Quan et al. [9] and Yadav et al. [21] studied CNN
were also conducted with n = 4, 5, and 6, but n = 3 was finally used owing to its better
and Bi‐LSTM inarchitecture,
performance using [6].
previous studies selective dependency
In Section 3.1.2, n = 3graph
and sfeatures
= 1 werewith embedding
set for semantic
input. LpGBoost shows 2% and 2.4% improvement in the F
feature convolution. Paired bio-entity mentions were normalized 1‐score for AIMed and BioInfer,
as “TRIGGERPRI” [6,28].
respectively,
DrugBank when
corpus [30]compared
was used with the MCCNN,
as a pre-training the to
dataset highest
developperforming CNN‐based
universal lead features.
model. We also developed a cross‐corpus study to further examine the adaptability of our
Ten-fold cross-validation was performed on the individual PPI corpora. The model uses
linguistic‐feature
binary representation
logistic objective into other
with XGBoost, cross‐domain entity‐association studies. As
https://xgboost.readthedocs.io/en/latest/python/
shown in Table 4 we compared the best cross‐corpus reported performances from each
python_api.html#module-xgboost.plotting, learning rate = 0.1 and max tree_depth = 6.
Baseline experiments were developed for AIMed, BioInfer and AIdea data sets only.
tree‐kernel and deep‐learning‐based study.
Precision
100
90
80
70
60
50
40
30
20
10
0
HPRD50 LLL IEP A AIMed BioInfer AIdea
Recall
100
90
80
70
60
50
40
30
20
10
0
HPRD50 LLL IEP A AIMed BioInfer AIdea
F1-score
100
90
80
70
60
50
40
30
20
10
0
HPRD50 LLL IEP A AIMed BioInfer AIdea
NB LpGBoost GK PIPE LPTK DSTK DNN McDepCNN MCCNN sdpCNN sdpLSTM
Figure 3. Comparative analysis on precision, recall, and F
Figure ‐score.
3. Comparative analysis on precision, recall, and F11-score.
TableOur discussion primarily focuses on results from AIMed and BioInfer, as these have
3. XGBoost-based comparative feature analysis on HPRD50 corpus.
been extensively studied in several related works. The comparative analysis of the SOTA
Features Precision
tree‐kernel model (Table 5) shows a 4.55% increase in F Recall F1 -Score
1‐average. Our approach outper‐
forms semantically enriched CNN models; however, we do observe a 3.65% drop in F
Bag of words (BOW) 62.4 50.9 56.0 1‐
Term frequency–inverse-document
average when compared to sdpLSTM. We also developed a new baseline for AIdea cor‐
62.9 49.0 55.1
frequency (Tf–IDF)
pus, where we achieved 86.2% on the F 1‐score. The performance showcases a 53.4% in‐
Hashing features (Hf) 63.0 50.3 55.9
crease when compared to Naïve Bayes. LpGBoost exploits sparse semantic patterns (lin‐
Our features (linguistic pattern) 86.1 84.0 85.0
guistic patterns) to develop decision features for gradient‐boosted tree. The original in‐
stance is reduced to an entity‐pair selective sematic representation; for example, “Activa‐
We also compared the performance of our model with semantically enriched em-
tion of lacZ upon interaction of p85 with IR beta (delta C‐43) was 4‐fold less as compared
beddings used
to IR beta” with deep-learning
is reduced to {[IN, NN, models. Both IN, TRIGGERPRI]
IN] → [NN, Hua et al. [7] and Peng et al. [8] used
→ [PROTEINT, IN,
semantically enriched embeddings with CNN. Quan et al. [9] and Yadav
TRIGGERPRI] → [PROTEINT, VBD, RB]} representation. Derived from it, the candidate et al. [21] studied
CNN and Bi-LSTM architecture, using selective dependency graph features with embed-
features are optimized over the lead feature matrix to generate the classification vector.
ding input. LpGBoost shows 2% and 2.4% improvement in the F1 -score for AIMed and
Table 6 shows the candidate features for the given example.
BioInfer, respectively, when compared with the MCCNN, the highest performing CNN-
based model. We also developed a cross-corpus study to further examine the adaptability
Table 5. Ten‐fold CV comparative performance evaluation on PPI corpora. A one‐tailed t‐test was
of our linguistic-feature representation into other cross-domain entity-association studies.
applied to determine whether our proposed model (LpGBoost) significantly improves the perfor‐
As shown in Table
mance of the F 4 we compared the best cross-corpus reported performances from each
1‐score of the comparisons, where * represents t‐tests with alpha = 0.05.
Our discussion primarily focuses on results from AIMed and BioInfer, as these have
been extensively studied in several related works. The comparative analysis of the SOTA
tree-kernel model (Table 5) shows a 4.55% increase in F1 -average. Our approach out-
performs semantically enriched CNN models; however, we do observe a 3.65% drop in
F1 -average when compared to sdpLSTM. We also developed a new baseline for AIdea cor-
pus, where we achieved 86.2% on the F1 -score. The performance showcases a 53.4% increase
when compared to Naïve Bayes. LpGBoost exploits sparse semantic patterns (linguistic
patterns) to develop decision features for gradient-boosted tree. The original instance is
reduced to an entity-pair selective sematic representation; for example, “Activation of lacZ
upon interaction of p85 with IR beta (delta C-43) was 4-fold less as compared to IR beta” is
reduced to {[IN, NN, IN] → [NN, IN, TRIGGERPRI] → [PROTEINT, IN, TRIGGERPRI]
→ [PROTEINT, VBD, RB]} representation. Derived from it, the candidate features are
optimized over the lead feature matrix to generate the classification vector. Table 6 shows
the candidate features for the given example.
The above features combine aspects of both convolved and sequential representa-
tions, making it a robust descriptor for a tree-based decision model. The results from our
XGBoost-based feature-comparison study (Table 3) also highlight that the sparse semanti-
cally relevant representation performs better with gradient-boosted trees in comparison
to elaborate vectors based on term frequency and hashing. A detailed analysis of the
Appl. Sci. 2022, 12, 10199 12 of 13
candidate features further revealed the consistency of features f1 and f4 in improving tree-
based decision-making among all the test corpora. Moreover, f1 and f4 represent sematic
patterns immediately adjacent to the target-entity mention, which indicates that shorter
sematic-context-based features are better descriptors in interaction detection studies over
longer sequences. The results shown in Table 5 show that our model performs better than
deep learning by 1.9% when studied for BioInfer, while the F1 -score stagnates at 43.5%
with AIMed testing. Our analysis suggests that selective labeling from interaction pattern
trees used in PIPE can help improve the cross-corpus learning and general adaptability of
linguistic pattern representation.
5. Conclusions
In this paper, we introduced a linguistic-pattern-representation-based boosted decision
tree model to study interaction detection. The results from our extensive tests on six
PPI corpora show that LpGBoost can improve prediction performance in comparison to
SOTA tree-kernel models for protein interaction detection. Our model also outperformed
semantically enriched CNN systems. As a part of this study, we developed linguistic-
pattern-based Universal Lead Features, which were used for feature optimization in all
corpora. These universal features are semi-interpretable, adaptable with other models,
and can be used in cross-domain classification studies. Our study also discovered the
relative significance of positional sematic features in interaction detection studies. Semantic
patterns proximal to the target terms are effective in PPI detection tasks.
In the future, we would like to test the viability of our learning model on other known
interaction classification tasks in the biomedical literature, such as drug–drug interaction
(DDI) and chemical–protein relation (CPR) [10,21]. In addition, we also plan to conduct
feature-enrichment studies by using additional information from dependency graphs to
improve the cross-corpus testing and bring further interpretability to representations.
Author Contributions: Conceptualization, N.W. and Y.-C.C.; methodology, N.W. and Y.-C.C.; vali-
dation, N.W.; formal analysis, N.W.; investigation, N.W. and Y.-C.C.; resources, Y.-C.C. and S.-P.M.;
writing—original draft preparation N.W. and Y.-C.C.; writing—review and editing, Y.-C.C. and
S.-P.M.; visualization, N.W. and Y.-C.C.; supervision, Y.-C.C. and S.-P.M.; project administration,
Y.-C.C. and S.-P.M.; funding acquisition, Y.-C.C. and S.-P.M. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was funded by the National Science and Technology Council of Taiwan,
under grant NSTC 111-2221-E-038-025, and the University System of Taipei Joint Research Program
under grant USTP-NTOU-TMU-109-03.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Krallinger, M.; Vazquez, M.; Leitner, F.; Salgado, D.; Chatr-aryamontri, A.; Winter, A.; Valencia, A. The Protein-Protein Interaction
tasks of BioCreative III: Classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011,
12 (Suppl. 8), S3. [CrossRef]
2. Krallinger, M.; Rabal, O.; Akhondi, S.A.; Perez, M.P.; Santamaria, J.; Rodriguez, G.P.; Tsatsaronis, G.; Intxaurrondo, A.; Lopez, J.A.;
Nandal, U.; et al. Overview of the BioCreative VI chemical-protein interaction Track. In Proceedings of the 2017 BioCreative VI
Workshop, Bethesda, MD, USA, 18–20 October 2017.
3. Lung, P.Y.; He, Z.; Zhao, T.; Yu, D.; Zhang, J. Extracting chemical–protein interactions from literature using sentence structure
analysis and feature engineering. Database 2019, 2019, bay138. [CrossRef] [PubMed]
4. Pyysalo, S.; Airola, A.; Heimonen, J.; Björne, J.; Ginter, F.; Salakoski, T. Comparative analysis of five protein-protein interaction
corpora. BMC Bioinform. 2008, 9, S6. [CrossRef] [PubMed]
5. Murugesan, G.; Abdulkadhar, S.; Natarajan, J. Distributed smoothed tree kernel for protein-protein interaction extraction from
the biomedical literature. PLoS ONE 2017, 12, e0187379. [CrossRef] [PubMed]
Appl. Sci. 2022, 12, 10199 13 of 13
6. Warikoo, N.; Chang, Y.C.; Hsu, W.L. LPTK: A linguistic pattern-aware dependency tree kernel approach for the BioCreative VI
CHEMPROT task. Database J. Biol. Databases Curation 2018, 2018, bay108. [CrossRef] [PubMed]
7. Hua, L.; Quan, C. A shortest dependency path based convolutional neural network for protein-protein relation extraction. BioMed.
Res. Int. 2016, 2016, 8479587. [CrossRef] [PubMed]
8. Peng, Y.; Lu, Z. Deep learning for extracting protein-protein interactions from biomedical literature. In Proceedings of the 2017
Workshop on Biomedical Natural Language Processing, Vancouver, BC, Canada, 4 August 2017; pp. 29–38.
9. Quan, C.; Hua, L.; Sun, X.; Bai, W. Multichannel convolutional neural network for biological relation extraction. BioMed Res. Int.
2016, 2016, 1850404. [CrossRef] [PubMed]
10. Stenetorp, P.; Topi, G.; Pyysalo, S.; Ohta, T.; Kim, J.D.; Tsujii, J. BioNLP Shared Task 2011: Supporting Resources, Proceedings of BioNLP
Shared Task 2011 Workshop Companion Volume for Shared Task; Association for Computational Linguistics (ACL): Stroudsburg, PA,
USA, 2011.
11. Airola, A.; Pyysalo, S.; Björne, J. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus
learning. BMC Bioinform. 2008, 9, S2. [CrossRef] [PubMed]
12. Landeghem, S.V.; Saeys, Y.; Peer, Y.V.; Baets, B.D. Extracting protein-protein interactions from text using rich feature vectors and
feature selection. In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM) 2008, Turku,
Finland, 1–3 September 2008.
13. Erkan, G.; Özgür, A.; Radev, D.R. Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency
Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing-Conference on Computational
Natural Language Learning (EMNLP-CoNLL) 2007, Prague, Czech Republic, 28–30 June 2007; pp. 228–237.
14. Satre, R.; Sagae, K.; Tsujii, J. Syntactic features for protein-protein interaction extraction. BMC Bioinform. 2007, 2016, 246. [CrossRef]
15. Chang, Y.C.; Chu, C.H.; Su, Y.C.; Chen, C.C.; Hsu, W.L. PIPE: A protein–protein interaction passage extraction module for
BioCreative challenge. Database J. Biol. Databases Curation 2016, 2016, baw101. [CrossRef] [PubMed]
16. Yadav, S.; Kumar, A.; Ekbal, A.; Saha, S.; Bhattacharyya, P. Feature Assisted bi-directional LSTM Model for Protein-Protein
Interaction Identification from Biomedical Texts. arXiv 2018, arXiv:abs/1807.02162.
17. Su, P.; Vijay-Shanker, K. Investigation of BERT Model on Biomedical Relation Extraction Based on Revised Fine-tuning Mechanism.
In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Korea, 16–19
December 2020; pp. 2522–2529. [CrossRef]
18. Su, P.; Vijay-Shanker, K. Investigation of improving the pre-training and fine-tuning of BERT model for biomedical relation
extraction. BMC Bioinform. 2022, 23, 120. [CrossRef] [PubMed]
19. Su, P.; Peng, Y.; Vijay-Shanker, K. Improving BERT Model Using Contrastive Learning for Biomedical Relation Extraction. arXiv
2021, arXiv:2104.13913.
20. Warikoo, N.; Chang, Y.-C.; Hsu, W.-L. LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for
learning universal bio-entity relations. Bioinformatics 2021, 37, 404–412. [CrossRef] [PubMed]
21. Wu, S.; He, Y. Enriching Pre-trained Language Model with Entity Information for Relation Classification. arXiv 2019, arXiv:1905.08284.
22. Dickson, L.E. Mathematical Monongraphs Algebraic Invariants, No.14.; John Wiley: New York, NY, USA, 1914.
23. Tsuruka, Y.; Tateishi, Y.; Kim, J.D.; Ohta, T.; McNaught, J.; Ananiadou, S.; Tsujii, J. Developing a Robust Part-of-Speech Tagger for
Biomedical Text. In Advances in Informatics. PCI 2005. Lecture Notes in Computer Science; Bozanis, P., Houstis, E.N., Eds.; Springer:
Berlin/Heidelberg, Germany, 2005; Volume 3746.
24. Tsuruoka, Y.; Tsujii, J. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. In Proceedings of the
Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ‘05). Association for
Computational Linguistics, Stroudsburg, PA, USA, 6–8 October 2005; pp. 467–474. [CrossRef]
25. Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, M.; Kuksa, P. Natural Language Processing (Almost) from Scratch.
J. Mach. Learn. Res. 2011, 12, 2493–2537.
26. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD ‘16), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794.
[CrossRef]
27. Fundel, K.; Küffner, R.; Zimmer, R. RelEx—Relation extraction using dependency parse trees. Bioinformatics 2007, 23, 365–371.
[CrossRef] [PubMed]
28. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the
25th International Conference on Neural Information Processing Systems—Volume 1 (NIPS’12); Pereira, F., Burges, C.J.C., Bottou, L.,
Weinberger, K.Q., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2012; Volume 1, pp. 1097–1105.
29. Manning, C.D.; Schutze, H. Foundations of Statistical Natural Language Processing, 1st ed.; MIT Press: Cambridge, MA, USA, 1999.
30. Segura Bedmar, I.; Martínez, P.; Herrero Zazo, M. Semeval-2013 task 9: Extraction of Drug-Drug Interactions from Biomedical Texts
(Ddiextraction 2013); Association for Computational Linguistics: Stroudsburg, PA, USA, 2013.