Professional Documents
Culture Documents
Research Online
2017
Jun Shen
University of Wollongong, jshen@uow.edu.au
Lei Wang
University of Wollongong, leiw@uow.edu.au
Jiangning Song
Monash University, jiangning.song@monash.edu
Part of the Engineering Commons, and the Science and Technology Studies Commons
Recommended Citation
Chen, Huaming; Shen, Jun; Wang, Lei; and Song, Jiangning, "Leveraging Stacked Denoising Autoencoder
in Prediction of Pathogen-Host Protein-Protein Interactions" (2017). Faculty of Engineering and
Information Sciences - Papers: Part B. 383.
https://ro.uow.edu.au/eispapers1/383
Research Online is the open access institutional repository for the University of Wollongong. For further information
contact the UOW Library: research-pubs@uow.edu.au
Leveraging Stacked Denoising Autoencoder in Prediction of Pathogen-Host
Protein-Protein Interactions
Abstract
In big data research related to bioinformatics, one of the most critical areas is proteomics. In this paper,
we focus on the protein-protein interactions, especially on pathogen-host protein-protein interactions
(PHPPIs), which reveals the critical molecular process in biology. Conventionally, biologists apply in-lab
methods, including small-scale biochemical, biophysical, genetic experiments and large-scale experiment
methods (e.g. yeast-two-hybrid analysis), to identify the interactions. These in-lab methods are time
consuming and labor intensive. Since the interactions between proteins from different species play very
critical roles for both the infectious diseases and drug design, the motivation behind this study is to
provide a basic framework for biologists, which is based on big data analytics and deep learning models.
Our work contributes in leveraging unsupervised learning model, in which we focus on stacked denoising
autoencoders, to achieve a more efficient prediction performance on PHPPI. In this paper, we further
detail the framework based on unsupervised learning model for PHPPI researches, while curating a large
imbalanced PHPPI dataset. Our model demonstrates a better result with the unsupervised learning model
on PHPPI dataset.
Keywords
protein-protein, leveraging, interactions, stacked, denoising, autoencoder, prediction, pathogen-host
Disciplines
Engineering | Science and Technology Studies
Publication Details
Chen, H., Shen, J., Wang, L. & Song, J. (2017). Leveraging Stacked Denoising Autoencoder in Prediction of
Pathogen-Host Protein-Protein Interactions. IEEE 6th International Congress on Big Data (pp. 368-375).
United States: IEEE.
Abstract—In big data research related to bioinformatics, technologies. Identifying protein-protein interactions (PPI)
one of the most critical areas is proteomics. In this paper, with a comprehensive understanding is essential for the
we focus on the protein-protein interactions, especially on studies of biological functions, which are fundamental to
pathogen-host protein-protein interactions (PHPPIs), which
reveals the critical molecular process in biology. Convention- particular research, like drug design and infectious diseases.
ally, biologists apply in-lab methods, including small-scale Among these PPI research, the computational studies for
biochemical, biophysical, genetic experiments and large-scale PPI mostly focused on ‘intra-species’ PPI, which mainly
experiment methods (e.g. yeast-two-hybrid analysis), to identify concerned the interactions between two proteins from the
the interactions. These in-lab methods are time consuming same species.
and labor intensive. Since the interactions between proteins
from different species play very critical roles for both the Since identifying PPIs is essential for understanding the
infectious diseases and drug design, the motivation behind this whole biological functions and most diseases occurring be-
study is to provide a basic framework for biologists, which tween the hosts and pathogens, we focus on the interactions
is based on big data analytics and deep learning models. Our between proteins from two different species, namely ‘inter-
work contributes in leveraging unsupervised learning model, in species’ PPI. As pathogen-host PPI plays an important role
which we focus on stacked denoising autoencoders, to achieve
a more efficient prediction performance on PHPPI. In this in both the mechanisms of infection and medicine treatment,
paper, we further detail the framework based on unsupervised obtaining a better understanding and prediction of PHPPI
learning model for PHPPI researches, while curating a large will bring huge benefits for biologists.
imbalanced PHPPI dataset. Our model demonstrates a better On the other hand, deep learning is currently booming
result with the unsupervised learning model on PHPPI dataset. and it has achieved a lot of inspiring research outcomes
[7]. It is considered as the most effective model with its
Keywords-big data; PHPPI; denoising autoencoder; predic- ability to generate higher level representation features for
tion; machine learning model learning, when there is sufficient data [8–10]. Several
popular network models are being widely used, including
I. I NTRODUCTION convolutional neural network (CNN) and deep belief net
Immersed in many disciplines, including computer vision, (DBN).
online resources, health care, urban city and bioinformatics However, the literature research mostly focus on relatively
study, big data analytics technology is influencing every area small datasets with balanced ratio. Currently, there is little
of our life with the enormously available data [1–4]. The big research on PHPPI, especially while PHPPI datasets appear
data is considered as a data-driven research area in combina- to have two important characteristics, which normally ex-
tion with data mining and machine learning technologies, so hibit high skewed ratio and large data size, as reported in
the verified and well represented data are crucial. Through our earlier study [11]. Herein we propose to leverage unsu-
decades of efforts by the biologists, tremendous datasets are pervised learning model, especially the stacked denoising
being recorded in bioinformatics. Nowadays the biologists autoencoders (SdA) in our model, for PHPPI prediction.
and computer science researchers are working together to A general framework with unsupervised learning model is
tackle the challenges in knowledge discovery by building further built to tackle the unique challenges from PHPPI
computational models in both proteomics and genomics research, which need to deal with large data size and imbal-
disciplines [5, 6]. anced ratio between binary classes. Inside this framework,
In our study, we focus on the proteomics area. Proteomics we anticipate that SdA, as the primary unsupervised learning
research draws a lot of attention due to the huge emerging model, could automatically represent higher level features
‘omics’ data in recent years. These data are accumulated for further model classifier. In our experiments, We observed
in an extraordinary speed benefiting from high throughput that the higher level representation by unsupervised learning
could be the key to a better performance for prediction on Moore-Penrose generalized inverse of the neural network
in PHPPI. We demonstrate this observation by deploying output. Most of these models are built for yeast and human
our framework to two selected pathogen species. These two ‘intra-species’ PPI.
pathogen species are ‘Francisella tularensis’ and ‘Clostrid- Deep learning thrives on some benchmark datasets, for
ium difficile’, while the host is human. example, MNIST1 , ImageNet2 and so on. Especially, deep
The rest of this paper is organized as follows: Section II learning has shown its powerful performance on ‘Big Data’
reviews the related work; Section III presents our framework sets. This stimulates us to introduce the deep learning mech-
with deep learning methods; Section IV gives our detailed anisms into PHPPI research. Currently, more and more novel
PHPPI datasets and the evaluation metrics; Section V is machine learning models are proposed to imitate human
our result analysis and discussion. Finally we conclude this brain systems. Among these, CNN and DBN are two of the
paper and point out future work in Section VI. most popular models. In [22] the hierarchical data processing
in the human brain cortex is discussed, which is imitated in
II. R ELATED W ORK the computer vision research. It has been proved in [23]
More and more experiments are conducted to verify that CNN achieves a similar data representation approach
whether a pair of proteins interact with each other. Among as discussed in [22]. The visualization in [23] details the
these, only the positive PPIs, which means a pair of pro- construction of the corner and other edge information inside
teins has interacting relationship, are manually recorded. each image. Given that the models having the ability to learn
However, identifying these positive PPIs could be time con- higher degree features from data, deep learning is considered
suming and labor intensive in biological experiments, which as a more efficient and generalized way in other research
could also introduce high false positive rate. The uncertainty areas.
of experimental results for detection of interactions between These successes in data feature representation motivate
host and pathogen proteins is one of the main obstacles for us to introduce unsupervised learning model into our study
biologists [12]. How to provide highly confident prediction on PHPPI. In this paper, we leverage a deep stacked
for further experimental verification is our goal in this study. denoising autoencoders model to design an unsupervised
Yeast and human are considered as two special species learning model for PHPPI prediction. In next section, we will
since most of PPI research focus on these ‘intra-species’ detail this framework and present the unsupervised learning
PPIs. [13] gave a detailed and insightful analysis on yeast model for PHPPI research, and later we will conduct the
protein interactions, while [14] proposed an algorithm called experiments on several curated datasets. A brief explanation
PrePPI to yield over 30,000 high-confidence interactions of the procedure will also be presented in section IV.
for yeast and over 300,000 interactions for human. This
algorithm proposed to utilize Bayesian statistics to predict III. F RAMEWORK WITH U NSUPERVISED L EARNING F OR
interactions with 3D structural information. [15] presented PHPPI
a novel sequences feature representation method of proteins In this section, we will first introduce and go through a
to achieve a better result than other works. The PPI applied whole data life cycle, which includes data curation, feature
in [15] were extracted from the yeast protein-protein inter- representation and model learning for PHPPI prediction.
actions. In detail, as shown in Figure 1, this framework consists of
Among these research a balanced dataset was normally raw data curation, data feature representation, unsupervised
applied, which could be either from the available websites or learning model, and eventually classifier model learning.
built by researchers themselves. The balanced ratio between
the positive and negative PPI data is nearly 1:1. For PHPPI,
the ratio between positive and negative pairs is highly
skewed, which is normally 1:100 to 1:500. However, the
ratio 1:100 is generally considered to build a less-biased
classifier for prediction [11, 16, 17].
In addition to the dataset characteristics, the computa-
tional model plays another important role for prediction.
Several supervised machine learning methods have been
applied to build the corresponding model for balanced
PPI datasets prediction, including extreme learning machine Figure 1: The Framework with Unsupervised Learning
(ELM) model [18, 19] and support vector machines (SVM) Model for PHPPI
[15, 17, 20, 21]. SVM is the most utilized model in many
research disciplines as it associates basic structural risk
minimization theory to make predictions, while ELM model 1 http://yann.lecun.com/exdb/mnist/