You are on page 1of 6

2010 IEEE International Conference on Bioinformatics and Biomedicine

Protein-protein Interaction Prediction via Collective Matrix Factorization

Qian Xu Evan Wei Xiang Qiang Yang


Bioengineering Program Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
HKUST HKUST HKUST
Hong Kong, China Hong Kong, China Hong Kong, China
fleurxq@ust.hk wxiang@cse.ust.hk qyang@cse.ust.hk

Abstract—Protein-protein interactions (PPI) play an impor- because wet-lab experiments are time-consuming and labor-
tant role in cellular processes and metabolic processes within intensive, and the known number of possible protein-protein
a cell. An important task is to determine the existence of interactions is small. Take the proteins encoded by human
interactions among proteins. Unfortunately, existing biological
experimental techniques are expensive, time-consuming and genome as an example. There are about 312 million possible
labor-intensive. The network structures of many such networks interactions for 25,000 proteins [1], but the actual known
are sparse, incomplete and noisy, containing many false positive interactions are actually far less than this number. Moreover,
and false negatives. Thus, state-of-the-art methods for link the currently known interactions are noisy in that there
prediction in these networks often cannot give satisfactory pre- exist high false positive and false negative rates [3]. As
diction results, especially when some networks are extremely
sparse. Noticing that we typically have more than one PPI a result, the study of protein-protein interaction is still a
network available, we naturally wonder whether it is possible to very challenging topic. To solve this problem, researchers
’transfer’ the linkage knowledge from some existing, relatively have investigated a large number of computational methods
dense networks to a sparse network, to improve the prediction including many in machine learning area, in order to make
performance. Noticing that a network structure can be modeled accurate and robust link prediction.
using a matrix model, in this paper, we introduce the well-
known Collective Matrix Factorization (CMF) technique to A main problem with the link prediction problem is that
’transfer’ usable linkage knowledge from relatively dense network sparsity problem in the PPI networks. Traditional
interaction network to a sparse target network. Our approach supervised classification methods may fail since we cannot
is to establish the correspondence between a source and a target accumulate enough training data to build a steady and
network via network similarities. We test this method on two effective classifier. When the network is sparse, overfitting
real protein-protein interaction networks, Helicobacter pylori
(as a target network) and Human (as a source network). Our can easily happen, which causes significant performance
experimental results show that our method can achieve higher degradation. To solve this problem, our approach is to invoke
and more robust performance as compared to some baseline the knowledge we have about network linkage in other
methods. related PPI networks and transfer the shared knowledge to
Keywords-protein-protein interactions; transfer learning; a target sparse network. In the ’transfer learning’ setting, a
Collective Matrix Factorization relatively dense interaction network is assumed to exist as
the source network, and the objective is to infer the PPI
I. I NTRODUCTION network links in a relatively sparse target network. Noticing
Protein-protein interactions (PPIs) can reveal insights on that the network structure can be captured using a matrix
biological regulatory pathways and metabolic processes. model, we exploit the Collective Matrix Factorization (CMF)
A complete and reliable protein interaction map provides methods [4] using similarities of proteins between two inter-
us with an opportunity to understand the basic biological action networks as the correspondence knowledge. We show
processes within a cell. Global interaction patterns among that when the source matrix is sufficient dense and similar
proteins, for example, can suggest new drug targets and aid to the target PPI network, transfer learning is effective for
the design of new drugs by providing a clear picture of predicting protein-protein interactions in a sparse network.
biological pathways in the neighborhoods of the drug targets In this method, similarities between protein entities are
[1]. Therefore, considerable attention has been spent on the computed by taking both protein sequences and topological
issue of protein-protein interaction prediction [2]. structures of interaction networks into account. To evaluate
In biological research, various experimental approaches our method, Human protein interaction network, which is
have been developed to map the interactions among the pro- relatively dense, is used as the source network and Helico
teins, such as mass spectrometry, yeast two-hybrid, tandem protein interaction network, which is very sparse, is used as
affinity purification and co-immunoprecipitation. As a result, the target network. Our experimental results demonstrate that
large-scale maps of PPIs become available. Unfortunately, our proposed method indeed leads to significant performance
interactions among proteins detected are largely incomplete improvement of protein interaction prediction when applied

978-1-4244-8305-1/10/$26.00 ©2010 IEEE 62


on real-world protein-protein interaction datasets. real world, protein interaction datasets are often sparse,
incomplete and noisy, which motivates our research.
II. R ELATED W ORK In machine learning, researchers have begun to apply
matrix factorization based methods for transfer learning.
Due to the importance of understanding the protein- An important application is recommendation systems, where
protein interactions, a large number of computational meth- collaborative filtering is modeled using matrix factorization.
ods have been developed. In these methods, supervised Similar to our motivation, they also face the problem of
learning is a dominant approach. The state-of-the-art super- network sparsity. To solve the problem, various matrix based
vised learning methods include K-nearest neighbor (KNN), transfer learning methods have been developed [22], [23],
support vector machines (SVMs), random forest and so [24]. A main difference from our work is that these research
on. Supervised learning aims at training a classifier using works are aimed at improving the ’rating’ of a product by
positive examples of truly interacting protein pairs and a user. In our case, we are interested in predicting links
negative examples of non-interacting protein pairs, to predict between nodes, which corresponds to binary connectivity,
an unobserved relationship between two proteins. Each where the proteins have different semantics from users and
protein pair is encoded as a feature vector in the data. Thus, products.
much effort has been spent on developing informative and
effective feature representation methods for PPI prediction. III. M ETHODS
Feature vectors may be extracted based on protein sequences Our objective is to predict the interaction links between
directly or may involve indirect evidences, including domain nodes in a protein-protein interaction network Gt , where the
compositions, motif pairs and related mRNA expression [5], subscript t stands for ‘target.’ To apply a supervised learning
[6], [7], [8], [9], [10], among others. Bock and Gough method, we first train a classification model based on known
[13] used SVM method based on compositions of amino protein pairs, each of which being represented by a feature
acids and physiochemical descriptors. Urquiza et al. [3] vector v = x1 v , x2 v , ..., xn v . We then make predictions on
extracted 26 genomic or proteomic features of yeast from unobserved interaction links among protein pairs. As men-
diverse databases for each pair, such as information of tioned above, we will apply Collective Matrix Factorization
protein domains, domain-domain interactions in proteins (CMF) [4] and exploit a known and relatively dense PPI
whose 3D-structures are known, and high quality annotation network to help improve the target-domain prediction.
of gene ontology. Espadaler et al. [14] considered protein We first introduce matrix factorization (MF) [25] methods
structural similarities among domains found in the databases for an individual network. Matrix factorization is increas-
of interacting proteins, combining conservation of pairs of ingly popular in link prediction in many domains, including
sequence patches based on the observation that structural social networks. We consider a network G = (V, E), where
evidence has shown that usually interacting pairs of close ho- there are m = |V| nodes. The links between V can be
mologs physically interact in the same way. Several methods represented by an m × n adjacency matrix X. Our aim
help to infer protein interactions based on the conservation is to seek a low-rank approximation for X with the form
of gene neighborhood, conservation of gene order, gene of X ≈ f (U V T ). In this way, the observed links in
fusion events, and the co-evolution of interacting protein pair the matrix X can be approximated by a product of two
sequences [15][16]. Qi et al. [18] split indirect features into low-rank matrices, U ∈ Rm×k and V ∈ Rn×k , where
roughly homogeneous sets of feature experts, who employ k > 0 is the rank, and f is possibly a nonlinear link
logistic regression to estimate prediction values and combine function. The information carried by the adjacency matrix
the prediction results of individual experts.Comprehensive X is encapsulated by the two factor matrices, and thus the
reviews of these methods can be found in [19], [20]. missing values in X can also be recovered by X̂ ≈ f (U V T ).
There has also been work on exploiting useful knowledge The goal of the matrix factorization process is to seek
of protein interaction networks across organisms. An intu- a factor matrix pair (U, V ) minimizing the error measure
itive idea is to use the interaction map of one organism as ||X − U V T ||. Different function f and definitions of error
a template to predict interactions in another [1]. Wojcik and measure result different models. Unfortunately, X may not
Schachter [21] applied the ’Interaction Domain Profile Pairs’ be factorized into U and V successfully when X is too
method (IDPP) to protein interaction datasets of Escherichia sparse since the learned factors U and V might be biased
coli and Helicobater pylori. They first converted the source towards the few observed entries in the sparse X, causing
dataset Helicobater pylori into an abstract interaction map overfitting.
linking clusters of interaction domains. They then inferred In this paper, we consider two protein interaction networks
unobserved interactions by establishing the correspondence G and P . G is our target protein interaction network
between this abstract map and the target Escherichia coli and P , relatively dense, is our source network. We will
proteome. Unfortunately, the IDPP method required a high- try to improve the performance of predicting links in G
quality protein interaction map to exist. However, in the with the aid of P . The links of network G and P are

63
represented by an m × m matrix Lm×m and an n × n nodes in one network are similar if there exists a link
matrix Ln×n , respectively. The rows and columns of Lm×m between them, and 2) the latent factors of two nodes in
and Ln×n correspond to protein entities in networks G and different networks are similar if the similarity between them
P , respectively. The elements in both matrices Lm×m and is high.
Ln×n indicate the existence of interaction links. We exploit the IsoRank [26] method to construct our
Now we can combine the two matrices Lm×m and Ln×n similarity matrix Sm×n , which can be thought as a particular
together to form a big matrix X t with the size of (m + n) × spectral method for global graph alignment. This is based on
(m + n):   the assumption that a protein in one PPI network is a good
t Lm×m 0 matching for a protein in another network if their respective
X =
0 Ln×n sequences and neighborhood topologies are a good match.
Thus, IsoRank can give a good matching of protein nodes
However, we will find that factorizing X t is actually equiv- between two PPI networks by simultaneously considering
alent to factorizing Lm×m and Ln×n individually. Because node similarities and network similarities. The main idea of
we are not aware of the correspondence between nodes the IsoRank algorithm can be formalized as follows:
in G and P , the information carried by the dense matrix
  1
Ln×n cannot be successfully transferred to help with the R(i, j) = R(u, v) (2)
factorization of the sparse matrix Lm×m . Thus, we still need |N (u)||N (v)|
v∈N (i) u∈N (j)
to involve some other information serving as an information i ∈ VS , j ∈ VT ,
bridge between G and P . This bridge is the similarity
between the two networks. where N (i) denotes the set of neighbors of i, VS denotes the
Consider a similarity matrix Sm×n introduced as the set of vertices of graph S and element R(i, j) represents the
correspondence between networks G and P . The rows and similarity between a vertex i of graph S and a vertex j of
columns of Sm×n correspond to proteins in networks G and graph T . In the case of PPI networks, R(i, j) represents the
P , respectively, and the element Sij of Sm×n represents similarity between proteins i and j. The intuitive idea behind
similarity between node i in network G and node j in this recursive formula is that the more i and j have similar
network P . The collective matrix factorization method neighbors, the greater the similarity measure between i and
reconstructs matrices X t ≈ f1 (ZV T ) and X a ≈ f2 (U V T ) j will be.
together by sharing the common factor V . The objective Equation 2 can be rewritten from a matrix perspective:
of collective matrix factorization then is to minimize the
regularized loss: R = AR, (3)
where A is the N 2 × N 2 matrix defined as:
 1
L(X t , X a , U, V, Z) = D(X t ||ZV T ) (1) |N (u)||N (v)| if(i, u) ∈ ES , (j, v) ∈ ET
A(i, j)(u, v) =
+ λa D(X a ||U V T ) 0 otherwise
+ λU ||U ||2F + λV ||V ||2F + λZ ||Z||2F where ES denotes the set of edges of graph T .
To estimate R, we observe that Equation 3 is an
, eigenvalue problem, where the value of R is the principal
where   eigenvector of A. A is a stochastic matrix so that the
Lm×m 0
Xt = principal eigenvalue is 1. In our case, A and R are both
0 Ln×n
sparse although A is typically a very large matrix. Singh
and   et al. [26] propose to update R efficiently using the power
a 0 Sm×n method with the form of:
X = 
Sm×n 0
AR(k)
In Equation 1, the first and second terms are loss functions R(k + 1) ← . (4)
AR(k)
that compute the distance between original matrix and its
factorized results. The rest of the terms are regularization To use other information to improve our prediction, we
terms that are used to prevent overfitting. The parameters can also take the information on protein sequence similarities
λs , λU , λV and λZ are used to control the weight of into account. The eigenvalue equation 3 can be rewritten to
target network or model complexity. In our experimental a convex combination of network and sequence similarity
setting, we focus on predicting existence of links. Therefore, scores, which can be solved by similar techniques as Equa-
a logistic loss function is adopted when computing D. Note tion 3:
that there are two underlying assumptions for our collective
matrix factorization process: 1) the latent factors of two R = αAR + (1 − α)C, (5)

64
In this equation, C is a normalized score matrix generated
by pairwise blast alignment method, and α ∈ [0, 1] controls True positives
the trade-off between both objectives, e.g., α = 0 implies no TPR =
True positives + False negatives
network data will be considered, whereas α = 1 indicates
False positives
only network data will be used. Tuning α allows us to find FPR =
False positives + True negatives
the optimal alignment.
For the pairwise alignment method, we adopt the well- Specificity = 1 − FPR
known Smith-Waterman algorithm [27], which is based on Sensitivity = TPR
dynamic programming and is implemented in a Matlab
toolbox. The Smith-Waterman algorithm is a conventional TPR and FPR depend on the classifier function h and
local pairwise alignment method [27]. It attempts to align the threshold θ used to convert h(x) to a binary prediction.
segments of all possible lengths and optimize the simi- Varying the threshold theta results ROC curve. The area
larity measure instead of considering the total sequence under the curve (AUC) indicates the performance of this
for sequence alignment between two protein sequences. classifier: the large the better [33].
When applying the Smith-Waterman algorithm in the Matlab
C. Results
toolbox, we can set the parameters BLOSUM50, 8, 8 as
the scoring matrix, gap open value and extend gap value, To evaluate our proposed methods, we chose four baseline
respectively. We input two proteins as a query and get a methods. We describe these methods in details below and
pair-wise score with respect to their similarity. As a result, explain the reasons for us to choose these classifiers as
we get a score matrix C, each element of which indicates baseline algorithms.
the similarity of two proteins. The first baseline method is low-rank matrix factorization
on single network, which is a special case of Equation 1
IV. M ATERIALS AND R ESULTS with parameter λs = 0 and λU = 0. Comparing collective
A. Benchmark datasets matrix factorization with single matrix factorization can help
In this work, we used the Helicobacter pylori dataset illustrate if transferring knowledge from a dense interaction
as the target dataset and the Human dataset as the source network to a sparse interaction network can indeed help
dataset, both of which are also used in [28], [29], [30], [31], improve performance of inferring unobserved interactions.
[12]. The target data set Helicobacter pylori dataset consists We use MF to represent this baseline method.
of 1,458 positives (interacting pairs) and 1,458 negatives The second baseline method adopts the support vector
(non-interacting pairs). The Human dataset consists of 941 machine (SVM) classifier, a state-of-the-art classification
positive samples (interacting pairs) and 941 negative samples method for protein-protein interaction prediction. SVM re-
(non-interacting pairs). quires the input samples to be represented by feature vectors
The network density of these datasets are given below: and their corresponding labels. In our implementation, 2-
• The Helicobacter pylori dataset: 0.12% Grams amino acid compositions are extracted as feature
• The Human dataset: 0.27% vectors for protein instances to build SVM model. The
We randomly sample 9/10, 5/10, 1/3, 2/10, 1/10 of total method is denoted by SVM2Gram.
instances of Helicobacter pylori dataset, respectively, in Our third baseline Ensemble of k-local hyperplanes was
order to test the performance of the approach under different proposed by Nanni and Lumini [12] . This method trained
target network sparsity. Under these settings, the density an ensemble of K-local hyperplane distance nearest neighbor
of the target networks becomes 0.12 times f , where f = classifiers based on the same datasets as ours and first
9/10, 5/10, 1/3, 2/10, 1/10, respectively. When varying the generated feature vectors using a different physicochemical
target network density, the source network Human dataset property of the amino acids, which combines the amino acid
is kept constant. We repeat each experiment for ten trials indices together with 2-Grams amino acid compositions.
and then report the average results. We chose a hybrid method introduced in [11] as our
last baseline. It first encodes protein pairs as the sum and
B. Performance measurement absolute minus value of PseAA composition pairs and then
The area under receiver operating characteristic (ROC) applies a hybrid feature selection system mRMR-KNNs-
curve (AUC) is a statistic used as our performance wrapper in order to obtain an optimized feature set by
measurement. ROC figures, which have been used excluding poor-performed and redundant features. Based
increasingly in machine learning and data mining on the optimized feature subset, a prediction model was
community [32], plot the true positive rate (TPR or trained and tested in the k-nearest neighbors learning system.
sensitivity) against the false positive rate (FPR or 1- In the following, this method is represented by hybridF-
specficity), where, SKNN if applying feature selection and represented by non-
hybridFSKNN if not applying feature selection.

65
In the experimental setting, we chose parameters achiev- ACKNOWLEDGMENT
ing best results for baseline methods. More specifically, We thank the support of N HKUST 624/09. We thank
λU = λV = λZ = 5 was set for CMF; RBF kernel with Nathan Nan Liu for discussion.
γ = 0.0004 was chosen for SVM classifier; K = 5 was
determined for KNN classifiers in the third and fourth base- R EFERENCES
line methods, leading to the best results. Note that we will [1] J. Yu and F. Fotouhi, “Computational approach for predicting
report both performances of the fourth baseline with hybrid protein-protein interactions: a survey,” Journal of Medical
feature selection and without feature selection procedure. System, vol. 30, no. 1, pp. 39–44, 2006.
Performances of baseline methods and our proposed method [2] R. Mrowka, A. Patzak, and H. Herzel, “Is there a bias in
are compared in Table I. proteome research?” Genome Res., vol. 11, no. 12, pp. 1971–
As shown in the above table and figure, our suggested 1973, 2001.
method Collective Matrix Factorization (CMF), which trans- [3] J. M. Urquiza, I. Rojas, H. Pomares, J. P. Florido, G. Ru-
fers knowledge from an auxiliary Human interaction net- bio, L. J. Herrera, J. C. Calvo, and J. Ortega, “Method
work to a sparse target Helicobacter pylori interaction for prediction of protein-protein interactions in yeast using
network, achieves 5-7% improvement as compared to the genomics/proteomics information and feature selection,” in
best baseline results under different parameter settings. To Proceedings of the 10th International Work-Conference on
Artificial Neural Networks: Part I: Bio-Inspired Systems:
build a correspondence between the auxiliary network and Computational and Ambient Intelligence, 2009, pp. 853–860.
target network, node similarities involving both protein
sequences and network structures are computed using the [4] A. P. Singh and G. J. Gordon, “Relational learning via col-
graph matching method IsoRank. Most significantly, as the lective matrix factorization,” in Proceeding of the 14th ACM
SIGKDD international conference on Knowledge discovery
density of the target network Helicobacter pylori dataset and data mining, pp. 650–658.
reduces to 0.012%, we can still get promising results using
our CMF method. This result illustrates the power of our [5] S. Gomez, W. Noble, and A. Rzhetsky, “Learning to predict
approach in solving the network sparsity problem in PPI protein-protein interactions from protein sequences,” Bioin-
prediction. formatics, vol. 19, no. 15, pp. 1875–1881, 2003.

[6] M. Deng, S. Mehta, F. Sun, and T. Chen, “Inferring


domain-domain interactions from protein-protein interac-
V. C ONCLUSION tions,” Genome Res., vol. 12, no. 10, pp. 1540–1548, 2002.

In this paper, we proposed a Collective Matrix Fac- [7] H. Wang, E. Segal, A. Ben-Hur, D. Koller, and D. Brutlag,
“Identifying protein-protein interaction sites on a genome-
torization (CMF) solution to solving the network sparsity wide scale,” In Advances in Neural Information Processing
problem for PPI prediction. CMF is an extension of classical Systems, vol. 17, no. 1, pp. 1465–1472, 2005.
matrix factorization, which we exploited for use under a
novel transfer learning framework. Our aim is to infer [8] H. Wang, E. Segal, A. Ben-Hur, Q. Li, M. Vidal, and
D. Koller, “Insite: A computational method for identifying
interactions in our target protein-protein interaction network
protein-protein interaction binding sites on a proteome-wide
by transferring network connectivity knowledge from a scale,” Genome Biology, vol. 8, no. 9, pp. 1–18, 2007.
source network via the help of a similarity matrix. CMF
achieves significant performance improvement through the [9] M. Li, L. Lin, X. Wang, and T. Liu, “Protein-protein inter-
correspondence between source network and target network. action site prediction based on conditional random fields,”
Bioinformatics, vol. 23, pp. 597–604, 2007.
To compute the similarity of two networks, IsoRank is used
recursively to compute similarities of protein entities be- [10] S. Wu and Y. Zhang, “A comprehensive assessment of
tween two interaction networks by considering both protein sequence-based and template-based methods for protein con-
sequences and topology structures of interaction networks tact prediction,” Bioinformatics, vol. 24, pp. 24–931, 2008.
simultaneously. In our experiment, we use Helicobacter [11] L. Liu, Y. Cai, W. Lu, K. Feng, C. Peng, and B. Niu,
pylori interaction dataset as target network, which is sparse “Prediction of protein protein interactions based on pseaa
and Human interaction dataset as relatively dense source composition and hybrid feature selection,” Biochemical and
network. The experimental results show that the performance Biophysical Research Communications, vol. 380, no. 2, pp.
of uncovering interaction links in the target network can 318–322, 2009.
indeed be boosted with the aid of source network via [12] L. Nanni and A. Lumini, “An ensemble of k-local hyperplanes
transferring useful knowledge. for predicting protein-protein interactions,” Bioinformatics,
In the future, we will involve multiple source networks vol. 22, no. 10, pp. 1207–1210, 2006.
to aid link prediction in the target network, and investigate [13] J. Bock and D. Gough, “Predicting protein-protein interac-
methods to improve the computation of similarity between tions from primary structure,” Bioinformatics, vol. 17, pp.
networks. 455–460, 2001.

66
Table I
P ERFORMANCE COMPARISON (AUC %)
• 1 - sample sizes of Helicobacter pylori for training
• Our CMF based Methods:
• 3 - CMF with λs = 0.2 and λU = λV = λZ = 5
• 4 - CMF with λs = 0.4 and λU = λV = λZ = 5
• 5 - CMF with λs = 0.6 and λU = λV = λZ = 5
• 6 - CMF with λs = 0.8 and λU = λV = λZ = 5
• Baselines:
• 2 - MF
• 7 - SVM2Gram
• 8 - Ensemble of k-local hyperplanes
• 9 - non-hybridFSKNN
• 10 - hybridFSKNN(feature number)
1 2 3 4 5 6 7 8 9 10
1/10 80.81±0.0188 82.63±0.0149 82.50±0.0156 82.05±0.0161 76.55±0.0182 80.95 73.80 72.90 76.00 (82)
1/5 86.04±0.0126 89.27±0.0128 89.39±0.0181 87.27±0.0071 83.32±0.0118 83.36 80.10 76.30 75.90 (160)
1/3 88.20±0.0063 89.62±0.0085 90.93±0.0092 90.48±0.0090 88.35±0.0067 85.29 81.60 80.90 81.60 (151)
1/2 89.12±0.0112 90.32±0.0083 91.33±0.0059 91.97±0.0060 91.39±0.0109 86.62 85.40 84.50 85.40 (182)
9/10 89.31±0.0139 92.42±0.0154 94.47±0.0173 96.72±0.0192 95.41±0.0115 88.84 82.40 83.70 82.40 (162)

[14] J. Espadaler, O. Romero-Isart, R. M. Jackson, and B. Oliva, [24] W. Pan, E. Xiang, N. Liu, and Q. Yang, “Can movies and
“Prediction of proteinprotein interactions using distant con- books collaborate? cross-domain collaborative filtering for
servation of sequence patterns and structure relationships,” sparsity reduction,” in Proceedings of the Twenty-First Inter-
Bioinformatics, vol. 21, no. 16, pp. 3360–3368, 2005. national Joint Conference on Artificial Intelligence, Pasadena,
CA, USA, July 2009.
[15] B. Shoemaker and A. Panchenko, “Deciphering protein-
protein interactions. part ii. computational methods to predict [25] A. P. Singh and G. J. Gordon, “A unified view of matrix
protein and domain interaction partners,” PLoS Comput Biol., factorization models,” in Proceedings of the European con-
vol. 3, no. 4, p. e43, 2007. ference on Machine Learning and Knowledge Discovery in
Databases - Part II, pp. 358–373.
[16] C. von Mering, R. Krause, B. Snel, M. Cornell, S. Oliver,
S. Fields, and P. Bork, “Comparative assessment of large- [26] R. Singh, J. Xu, and B. Berger, “Global alignment of multiple
scale data sets of protein-protein interactions,” Nature, vol. protein interaction networks with application to functional
417, no. 6887, pp. 399–403, 2002. orthology detection,” PNAS, vol. 105, no. 35, pp. 12 763–
[17] A. Ben-Hur and W. S. Noble, “Kernel methods for predicting 12 768, 2008.
protein-protein interactions,” Bioinformatics (Proceedings of
the Intelligent Systems for Molecular Biology Conference), [27] T. Smith and M. Waterman, “Identification of common molec-
vol. 21(Suppl 1), pp. 38–46, 2005. ular subsequences,” Journal of molecular biology, vol. 147,
pp. 95–197, 1981.
[18] Y. Qi, J. Klein-Seetharaman, and Z. Bar-Joseph, “A mixture
of feature experts approach for protein-protein interaction [28] J. Bock and D. Gough, “Whole-proteome interaction mining,”
prediction,” BMC Bioinformatics, vol. 8(Suppl 10), p. 6, 2007. Bioinformatics, vol. 19, no. 1, pp. 125–135, 2003.

[19] A. Arkin, “Synthetic cell biology,” Current Opinion in [29] S. Martin, D. Roe, and J.-L. Faulon, “Predicting protein-
Biotech, vol. 12, no. 6, pp. 638–644, 2001. protein interactions using signature products,” Bioinformatics,
vol. 21, no. 2, pp. 218–226, 2005.
[20] B. Shoemaker and A. Panchenko, “Deciphering protein-
protein interactions. part ii. computational methods to predict [30] L. Nanni, “Fusion of classifiers for predicting protein-protein
protein and domain interaction partners,” PLoS Comput Biol, interactions,” Bioinformatics, vol. 68, pp. 289–296, 2005.
vol. 3, no. 4, pp. 595–601, 2007.
[31] ——, “Hyperplanes for predicting protein-protein interac-
[21] J. J. Wojcik and V. Schchter, “Protein-protein interaction map
tions,” Bioinformatics, vol. 69, no. 1-3, pp. 257–263, 2005.
inference using interacting domain profile pairs,” Bioinfor-
matics, vol. 17, no. Suppl 1, pp. 296–305, 2001.
[32] T. Fawcett, “An introduction to roc analysis,” Pattern Recog-
[22] B. Cao, N. Liu, and Q. Yang, “Transfer learning for col- nition Letters, vol. 27, no. 8, pp. 861–874, 2006.
lective link prediction in multiple heterogeneous domains,”
in Proceedings of 27th International Conference on Machine [33] M. G. Culver, “Active learning to maximize area under the roc
Learning, Haifa, Israel, June 2010. curve,” in Proceedings of the 6th International Conference on
Data Mining, 2006, pp. 149–158.
[23] B. Li, Q. Yang, and X. Xue, “Transfer learning in collab-
orative filtering for sparsity reduction,” in Proceedings of
the 24th AAAI Conference on Artificial Intelligence, Atlanta,
Georgia, USA, July 2010.

67

You might also like