Professional Documents
Culture Documents
Abstract—In this paper, we have proposed a computational exists several studies that examine the preservation of modular
framework to measure the preservation characteristics of modu- structure in different disease progression [5], [6], [7], to the
lar structures between two biological networks. The preservation best of our knowledge, very few studies exist in the field
characteristics of co-expressed gene modules are identified by
comparing the frequencies of few predefined small substructures of HIV-1 infection stages. In [2], the three stages of HIV-
called graphlets in the coexpression networks of three HIV- 1 progression are analyzed based on coregulation pattern
1 infection stages: acute, nonprogressor, and chronic. A novel of infected genes. In [8], the samples of three stages of
similarity measure has been proposed based on the frequencies HIV-1 infection are distinguished based on differentially co-
and significances of those graphlets occurring in the networks. A expressed interacting protein pairs (DEPs). These works are
widely used tool GtrieScanner is utilized to find the frequen-
cies and significance of those graphlets in networks. Results mainly focused on the detection of coexpressed modules and
confirm high similarity of topological properties between co- differentially coexpressed protein pairs in three stages of HIV-
expressed modules of acute and chronic stages than acute and 1 infection. In [9] a multi-objective modeling is proposed to
nonprogressor stages. Our method contributes to an important detect perturbation patterns of coexpressed modules across the
understanding of preservation characteristics of the modular progression of HIV-1 infection. Later in [10] the author pro-
organization in two different biological networks.
Keywords—HIV-1 infection stages, HIV-1 modules, Graphlets, posed an eigengene based approach for analyzing microarray
Normalized and weighted graphlet frequency distribution, gene expression data of HIV infected individuals and detect
Preservation score preservation characteristics of coexpressed modules during
HIV infection progression.
I. I NTRODUCTION In [11], the authors first proposed an idea to compare
One vital task in HIV-1 research is to understand the pattern two biological networks using frequency distribution of a set
changes of HIV-1 infection progression. The first stage of HIV- of predefined graphlets. One of the major drawbacks of the
1 infection is acute retroviral syndrome which induces the method is that it missed scalability. For two input networks of
production of virus in our body [1], [2]. In this stage, large different sizes, the frequency of a particular graphlet will be
amount of viruses are being produced which diminishes CD4+ higher for the network of larger size. So, it lacks a suitable
cell count rapidly. As an immediate response, the immune normalization technique which scales the metric uniformly for
system reduces the virus count to a moderate level, called networks of different sizes. Here, we have taken care of it by
‘virus set point’, afterwards CD4+ count begins to increase. using appropriate normalization technique. We have further
The majority of HIV-1 infected individuals progresses to the extended the idea by including ‘weight’ of each graphlet.
latent stage called ‘chronic’ stage, in which CD4+ cell count As all the predefined graphlets are not equally important, so
begins to drop severely. This results progression of AIDS in comparing the frequencies of all is not required. Rather it
the human body which becomes vulnerable to opportunistic is important to know which graphlets are more significant
infections. Few infected people remain clinically stable by over others in the network. To find significant graphlets of
maintaining a high amount of CD4+ and CD8+ cell for a network we have utilized the concept of network motif.
long time which is known as long term nonprogressor stage Network motif is a small graphlet that occurs in a network
[3], [4]. The transition from acute infection stage to other more frequently than similar random networks [12]. It is said
latent stages can be dissected by examining the topological to be the building blocks of a network. The metric z-score
pattern of the transcriptomal network for each infection stage. is utilized to determine whether the graphlet is statistically
System biological approach, based on microarray data, have significant or not. We have utilized the metric for determining
been widely used to elucidate the pattern of the transcriptome the significance of a graphlet in a network, and put it as a
across different stages of disease progression. Although there weight. Finding frequencies and z-scores of all graphlets are
0.6
0.4
0.2
0
0.2 0.4 0.6 0.8 1
k-cutoff
Nonprogressor
0.4
0.3
0.2
0.1
Fig. 2. Fig. shows twenty nine graphlets which are used to compare two
0 networks
0.2 0.4 0.6 0.8 1
k-cutoff
B. Finding frequencies and z-scores of graphlets
GtrieScanner (http://www.dcc.fc.up.pt/gtries/)[13] is a net-
Fig. 1. correlation coefficient cutoff vs average network density plot work motif finding tool which is used to find the frequencies
and z-scores of all twenty nine graphlets for modules of
II. M ETHOD the three stages of HIV-1 infection. It finds frequencies of
graphlets in the original network and creates similar random
A. Data preprocessing networks for comparing the frequencies of those graphlets
We have downloaded the gene expression dataset (GSE6740 in original and those random networks. z-score is used to
series) from GEO database [14] which has expression values compare the significance of the frequency of a graphlet which
of CD4+ and CD8+ cells of the three HIV-1 infection stages is defined as: i i
of some untreated HIV-1 positive people. It consists of 22284 z-score= FG (SσR)−μ R (S )
(S i ) , where FG (S i ) is the frequency
genes and 10 samples at each progression stage, five samples of subgraph S in the original network. μR (S i ) and σR (S i )
i
are for CD4+ T cells and five samples are for CD8+ T represent average mean and standard deviation of the subgraph
cells. For obtaining gene modules in three stages of HIV S i in the similar random networks. Positive z-score of a
progression we have utilizes PROCOMOSS [15] algorithm. subgraph implies it is more abundant in the original network,
PROCOMOSS uses a multiobjective technique to cluster genes while a negative score signifies that it occurs fewer in original
based on the interaction and functional similarity between than the normal. As both positive and negative z-score implies
gene pairs.For each stage of HIV infection, we have collected significance of a graphlet, so we took the absolute value of it
interaction information of expressed genes and compiled func- and put this as a weight to a graphlet. Graphlets of a random
tional similarity matrix by using gene ontology data. These network tend to have z-scores near about zero.
two metrics are utilized by PROCOMOSS to cluster genes
into functional homogeneous modules. We have found 30 C. Normalization of graphlet frequencies
modules in the acute stage, 21 in the nonprogressor stage and As already mentioned, normalization was one of the key
32 in the chronic stage. Now in each stage, a co-expression motivations of this work. To deal with normalization we
matrix is formed using the microarray gene expression data introduced a concept called ‘reference network frequency’.
for each of the modules. A set of 30, 21 and 32 co-expression For a graphlet G and a network m, its ‘reference network
matrices are built for the acute, nonprogressor and chronic frequency’ or f req(Grefm ) is the maximum frequency of the
stage respectively. Using a cut-off value, these matrices are graphlet G in a network of the same order as m. It is the
turned into binary matrices and we called these as adjacency number at which this graphlet G can occur in a network with
2
order similar to m. A reference network mref of a network E. Finding preservation score between two sets of modules
m can be formed by creating a fully connected network of A similarity between two modules is given by the formula
same node size as m. Normalization is done by dividing the described in the above section. To find preservation score
frequency of a graphlet in a network by its frequency in the between two sets of modules we can use this formula re-
reference network. Mathematically, the normalized frequency peatedly. The preservation score between two sets of modules
of a graphlet G in a network m can be defined as: M = m1 , m2 , . . . mk P = p1 , p2 , . . . pn is defined as follows:
f req( Gm ) k
Fnorm (Gm ) = (1) M axj∈n Sim(mi , pj )
i=1
f req( Grefm ) P reserv(M, P ) = (4)
k
Another important aspect is how to prioritize the graphlets where the similarity between two modules mi and pj is
for a given network. For this, we have chosen z-score, which defined as in above section. If the value approaches one, it
determines the importance of the graphlets. Now to put means more similarity in the two groups of input networks
together the concept of normalized frequency and significance exist. Note that, it is not necessary for the two groups to
of a graphlet with respect to a network, ‘normalized weighted contain the same number of modules.
graphlet frequency’ can be formulated as below:
F. Preservation in random networks
To validate our method we have generated some random
Fnw (Gm ) = Fnorm (Gm ) × |Zm (G)| (2) networks by following the famous Erdős Rényi model [16].
In this random graph model, all graphs with a fixed set of
which depicts the normalized and weighted frequency of a vertices and edges are equally likely. There are two variants
graphlet G in a network m and |Zm (G)| is the absolute z-score of the ErdsRnyi random graph model. In the G(n, M ) model,
value of graphlet G in the network m. a graph G is selected at random from an ensemble of all
possible graphs which have n nodes and M edges. For
D. Similarity between two modules using weighted graphlet example, in the G(3, 2) model, each of the three possible
frequency distribution graphs on three vertices and two edges are included with
Studying local structure of a network is an important probability 1/3. Another model G(n, p) consists of a graph
step to understand topological characteristics appeared in the which is constructed by connecting nodes randomly. Edges
network. Comparing these, we can facilitate the understanding have probability p independent from every other edge.
of structural similarity among the networks. We have used an R package called “igraph”
(http://igraph.org/r/)[17] to generate a set of random graphs
To compare preservations among the three stages of HIV-
using G(n, p) variant of ErdősRényi model. Total five sets of
1 infection, first we need to find pairwise similarity between
modules are created where each set consists of ten random
all modules of the three stages. Here, we used the measure
networks with different n, p parameter. Random node size n
for finding similarity between two graphs proposed in [11]
(from 20 to 70) and random p (from 0.3 to 0.6) are taken for
which was based on graphlet frequency distribution. The only
G(n, p) model. In table I, preservation scores between all sets
difference lies in the concept of normalizing and adding
of random networks are shown. All diagonal entries are found
some weights to the graphlets before comparing them. Given
to be ones due to self-similarity. For rest of the entries, the
two networks m and p, the structural similarity according to
score is roughly around 0.5. This is due the fact that, all the
normalized weighted graphlet frequency distribution can be
networks are generated using same random graph generation
formulated as:
model and each set contain the same number(ten) of networks
within it. Hence they possess some kind of similarity among
k
1 Fnw (Gm
i ) Fnw (Gpi ) themselves which justifies this low similarity values.
Sim(m, p) = (3)
k i=1 Fnw (Gm
i ) Fnw (Gpi )
III. R ESULT
Fnw (Gm A. Comparing individual modules in HIV-1
i ) is the normalized and weighted frequency of
graphlet Gi in network m. K is the number of graphlets whose For comparing two modules, the first step was to find the
normalized and weighted frequencies are being compared normalized frequency of all 29 graphlets in those modules.
here. For computational convenience, we have considered total As already mentioned earlier we have used gtrieScanner
29 graphlets structures (Fig.2) having maximum 5-nodes. If tool [13] to find the frequencies of the graphlets. To create
computation power permits one can go beyond size five. a visualization, we have sorted all modules in each stage
The measure is commutative. It captures similarity between according to their mean graphlet frequency and took the top
two networks in all twenty nine weighted graphlet frequency five modules from each stage. We have plotted normalized
distribution. If the value is near about 1, it signifies the more frequencies of top five graphlets of these modules in Fig. 3.
topological similarity exists between the two input networks. Ten graphlets are observed in the acute stage, out of which
It lies between zero to one. three graphlets are of size 4 (g11, g12, g9) and the rest are
3
TABLE I
P RESERVATION S CORES AMONG FIVE SETS OF RANDOM NETWORKS
random graphlet
1.0000 0.4470 0.4440 0.4600 0.4370 g1
set 1 g11
normalized_frequency
random g12
Acute
g3
g4
random
0.4410 0.4250 1.0000 0.4400 0.4350 g5
set 3 g6
g8
random 0.2 g9
0.4520 0.4550 0.4260 1.0000 0.4190
set 4
random
0.4570 0.4250 0.4490 0.4000 1.0000
set 5
0.0
normalized_frequency
graphlets are also present in all modules but these graphlets 0.4 g2
g3
Chronic
are occurring in high frequency. g4
g5
g6
g8
4, acute and chronic modules share high sim scores. For 0.03
modules do have both high (about 0.5) and very low (around
normalized_frequency
g1
g2
Nonprogressor
0.05) sim scores. Fig. 5 shows three boxplots of all pairwise g4
g5
gression module
4
0.6
chronic_modules
0.6
7 23
17 25 nonprogressor_modules
6 16 21
8 31 1
3 4 2
0.5 5 14 0.4 3
Sim_score
Sim_score
27 12 17
9 24 19
10 26 5
13 30 8
2 15 4
0.4
11 32 6
0.2
29 20 9
21 18 18
28 1
22
0.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
acute_modules acute_modules
0.5
chronic_modules
31
0.4 12
30
14
Sim_score
24
26
0.3
18
15
19
4
0.2 25
0.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
nonprogressor_modules
Fig. 4. Displaying pairwise similarity scores between top five modules in each of the acute-chronic, acute-nonprogressor and nonprogressor-chronic stages
or two sets of networks. [6] Miller, J., Horvath, S., Geschwind, D.: Divergence of human and mouse
brain transcriptome highlights alzheimer disease pathways. Proc Natl
ACKNOWLEDGEMENTS Acad Sci U S A 107, 12698–12703 (2010)
[7] Oldham, M., Horvath, S., Geschwind, H.: Conservation and evolution
This publication is an outcome of the R&D work undertaken of gene coexpression networks in human and chimpanzee brains. Proc
Natl Acad Sci U S A 103, 17973–17978 (2006)
project under the Visvesvaraya PhD Scheme of Ministry of
[8] Yoon, D., Kim, H., Suh-Kim, H., Park, R., Lee, K.: Differentially co-
Electronics & Information Technology, Government of India, expressed interacting protein pairs discriminate samples under distinct
being implemented by Digital India Corporation stages of hiv-1 infection. BMC Systems Biology 5(Suppl 2)S1(DOI:
10.1186/1752-0509-5-S2-S1) (2011)
R EFERENCES [9] Ray, S., Biswas, S., Mukhopadhyay, A., Bandyopadhyay, S.: Detecting
perturbation in co-expression modules associated with different stages of
[1] Bandyopadhyay, S., Ray, S., Mukhopadhyay, A., Maulik, U.: A review of hiv-1 progression: A multi-objective evolutionary approach. In: Proceed-
in silico approaches for analysis and prediction of hiv-1-human protein- ings of the 2014 Fourth International Conference of Emerging Applica-
protein interactions. Briefings in Bioinformatics 16(5), 830–851 (2015) tions of Information Technology. EAIT ’14, pp. 15–20. IEEE Computer
[2] Ray, S., Bandyopadhyay, S.: Discovering condition specific topological Society, Washington, DC, USA (2014). doi:10.1109/EAIT.2014.34.
pattern changes in coexpression network: an application to hiv1 pro- http://dx.doi.org/10.1109/EAIT.2014.34
gression. IEEE/ACM Trans Comput Biol Bioinform. 16(6), 1086–1099 [10] Ray, S., Hossain, M., Khatun, L.: Discovering preservation pattern from
(2016) co-expression modules in progression of hiv-1 disease: An eigengene
[3] Zeller, J., McCain, N., Swanson, B.: Immunological and virological based approach. In: Proceedings of the IEEE, 2016 International Con-
markers of hiv-disease progression. J Assoc Nurses AIDS Care 7(15-27) ference on Advances in Computing, Communications and Informatics
(1996) (ICACCI), pp. 10–110920167732146. IEEE Computer Society, ???
[4] Buchbinder, S., Katz, M., Hessol, N., O’Malley, P., Holmberg, S.: Long- (2016)
term hiv-1 infection without immunologic progression. AIDS 8, 1123– [11] Bhattacharjee, D., Hossain, S.M.M., Sultana, R., Ray, S.: Topological
1128 (1994) inquisition into the ppi networks associated with human diseases through
[5] Cai, C., Langfelder, P., Fuller, T., Oldham, M., et al.,, R.L.: Is human graphlet frequency distribution. In: International Conference on Pattern
blood a good surrogate for brain tissue in transcriptional studies? BMC Recognition and Machine Intelligence, pp. 431–437 (2017). Springer
Genomics 11(589), 1471–2164 (2010) [12] Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D.,
5
0.6
0.4
0.4
0.2
0.2
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
acute_modules acute_modules
0.5
Sim_scores with chronic modules
0.4
0.3
0.2
0.1
0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
nonprogressor_modules
Fig. 5. Three box-plots showing pairwise similarity scores between all modules in each of the acute-chronic, acute-nonprogressor and nonprogressor-chronic
stages