You are on page 1of 6

2018 4th International Conference for Convergence in Technology (I2CT)

SDMIT Ujire, Mangalore, India. Oct 27-28, 2018

Analysis on Preservation Characteristics of Modular


Structure during HIV-1 Progression using Weighted
and Normalized Graphlet Frequency Distribution
Sourav Biswas Sumanta Ray Sanghamitra Bandyopadhyay
Machine Intelligence Unit Dept. of Computer Science Machine Intelligence Unit
Indian Statistical Institute Aliah University Indian Statistical Institute
Kolkata, India Email: sumantababai86@gmail.com Kolkata, India
Email: sourav8051@gmail.com Email: sanghami@gmail.com

Abstract—In this paper, we have proposed a computational exists several studies that examine the preservation of modular
framework to measure the preservation characteristics of modu- structure in different disease progression [5], [6], [7], to the
lar structures between two biological networks. The preservation best of our knowledge, very few studies exist in the field
characteristics of co-expressed gene modules are identified by
comparing the frequencies of few predefined small substructures of HIV-1 infection stages. In [2], the three stages of HIV-
called graphlets in the coexpression networks of three HIV- 1 progression are analyzed based on coregulation pattern
1 infection stages: acute, nonprogressor, and chronic. A novel of infected genes. In [8], the samples of three stages of
similarity measure has been proposed based on the frequencies HIV-1 infection are distinguished based on differentially co-
and significances of those graphlets occurring in the networks. A expressed interacting protein pairs (DEPs). These works are
widely used tool GtrieScanner is utilized to find the frequen-
cies and significance of those graphlets in networks. Results mainly focused on the detection of coexpressed modules and
confirm high similarity of topological properties between co- differentially coexpressed protein pairs in three stages of HIV-
expressed modules of acute and chronic stages than acute and 1 infection. In [9] a multi-objective modeling is proposed to
nonprogressor stages. Our method contributes to an important detect perturbation patterns of coexpressed modules across the
understanding of preservation characteristics of the modular progression of HIV-1 infection. Later in [10] the author pro-
organization in two different biological networks.
Keywords—HIV-1 infection stages, HIV-1 modules, Graphlets, posed an eigengene based approach for analyzing microarray
Normalized and weighted graphlet frequency distribution, gene expression data of HIV infected individuals and detect
Preservation score preservation characteristics of coexpressed modules during
HIV infection progression.
I. I NTRODUCTION In [11], the authors first proposed an idea to compare
One vital task in HIV-1 research is to understand the pattern two biological networks using frequency distribution of a set
changes of HIV-1 infection progression. The first stage of HIV- of predefined graphlets. One of the major drawbacks of the
1 infection is acute retroviral syndrome which induces the method is that it missed scalability. For two input networks of
production of virus in our body [1], [2]. In this stage, large different sizes, the frequency of a particular graphlet will be
amount of viruses are being produced which diminishes CD4+ higher for the network of larger size. So, it lacks a suitable
cell count rapidly. As an immediate response, the immune normalization technique which scales the metric uniformly for
system reduces the virus count to a moderate level, called networks of different sizes. Here, we have taken care of it by
‘virus set point’, afterwards CD4+ count begins to increase. using appropriate normalization technique. We have further
The majority of HIV-1 infected individuals progresses to the extended the idea by including ‘weight’ of each graphlet.
latent stage called ‘chronic’ stage, in which CD4+ cell count As all the predefined graphlets are not equally important, so
begins to drop severely. This results progression of AIDS in comparing the frequencies of all is not required. Rather it
the human body which becomes vulnerable to opportunistic is important to know which graphlets are more significant
infections. Few infected people remain clinically stable by over others in the network. To find significant graphlets of
maintaining a high amount of CD4+ and CD8+ cell for a network we have utilized the concept of network motif.
long time which is known as long term nonprogressor stage Network motif is a small graphlet that occurs in a network
[3], [4]. The transition from acute infection stage to other more frequently than similar random networks [12]. It is said
latent stages can be dissected by examining the topological to be the building blocks of a network. The metric z-score
pattern of the transcriptomal network for each infection stage. is utilized to determine whether the graphlet is statistically
System biological approach, based on microarray data, have significant or not. We have utilized the metric for determining
been widely used to elucidate the pattern of the transcriptome the significance of a graphlet in a network, and put it as a
across different stages of disease progression. Although there weight. Finding frequencies and z-scores of all graphlets are

978-1-5386-5232-9/18/$31.00 ©2018 IEEE 1


done using a motif finding tool called GtrieScanner[13]. matrices for the modules. We obtained a cut-off value of 0.8
for the acute and the chronic stage, while 0.6 cut-off was
choosen for the nonprogressor stage. These cut-off values are
Acute adopted by finding the knee of the graph in Fig. 1 which
0.8
describes the cut-off values and corresponding average density
0.6 of the network formed by the co-expression matrix. For the
0.4 remaining part of this article, we used ‘module’ to simply
represent the sub network constructed from the coexpression
0.2
module.
0
0.2 0.4 0.6 0.8 1
k-cutoff
Chronic
0.8

0.6

0.4

0.2

0
0.2 0.4 0.6 0.8 1
k-cutoff
Nonprogressor
0.4

0.3

0.2

0.1
Fig. 2. Fig. shows twenty nine graphlets which are used to compare two
0 networks
0.2 0.4 0.6 0.8 1
k-cutoff
B. Finding frequencies and z-scores of graphlets
GtrieScanner (http://www.dcc.fc.up.pt/gtries/)[13] is a net-
Fig. 1. correlation coefficient cutoff vs average network density plot work motif finding tool which is used to find the frequencies
and z-scores of all twenty nine graphlets for modules of
II. M ETHOD the three stages of HIV-1 infection. It finds frequencies of
graphlets in the original network and creates similar random
A. Data preprocessing networks for comparing the frequencies of those graphlets
We have downloaded the gene expression dataset (GSE6740 in original and those random networks. z-score is used to
series) from GEO database [14] which has expression values compare the significance of the frequency of a graphlet which
of CD4+ and CD8+ cells of the three HIV-1 infection stages is defined as: i i
of some untreated HIV-1 positive people. It consists of 22284 z-score= FG (SσR)−μ R (S )
(S i ) , where FG (S i ) is the frequency
genes and 10 samples at each progression stage, five samples of subgraph S in the original network. μR (S i ) and σR (S i )
i

are for CD4+ T cells and five samples are for CD8+ T represent average mean and standard deviation of the subgraph
cells. For obtaining gene modules in three stages of HIV S i in the similar random networks. Positive z-score of a
progression we have utilizes PROCOMOSS [15] algorithm. subgraph implies it is more abundant in the original network,
PROCOMOSS uses a multiobjective technique to cluster genes while a negative score signifies that it occurs fewer in original
based on the interaction and functional similarity between than the normal. As both positive and negative z-score implies
gene pairs.For each stage of HIV infection, we have collected significance of a graphlet, so we took the absolute value of it
interaction information of expressed genes and compiled func- and put this as a weight to a graphlet. Graphlets of a random
tional similarity matrix by using gene ontology data. These network tend to have z-scores near about zero.
two metrics are utilized by PROCOMOSS to cluster genes
into functional homogeneous modules. We have found 30 C. Normalization of graphlet frequencies
modules in the acute stage, 21 in the nonprogressor stage and As already mentioned, normalization was one of the key
32 in the chronic stage. Now in each stage, a co-expression motivations of this work. To deal with normalization we
matrix is formed using the microarray gene expression data introduced a concept called ‘reference network frequency’.
for each of the modules. A set of 30, 21 and 32 co-expression For a graphlet G and a network m, its ‘reference network
matrices are built for the acute, nonprogressor and chronic frequency’ or f req(Grefm ) is the maximum frequency of the
stage respectively. Using a cut-off value, these matrices are graphlet G in a network of the same order as m. It is the
turned into binary matrices and we called these as adjacency number at which this graphlet G can occur in a network with

2
order similar to m. A reference network mref of a network E. Finding preservation score between two sets of modules
m can be formed by creating a fully connected network of A similarity between two modules is given by the formula
same node size as m. Normalization is done by dividing the described in the above section. To find preservation score
frequency of a graphlet in a network by its frequency in the between two sets of modules we can use this formula re-
reference network. Mathematically, the normalized frequency peatedly. The preservation score between two sets of modules
of a graphlet G in a network m can be defined as: M = m1 , m2 , . . . mk P = p1 , p2 , . . . pn is defined as follows:

f req( Gm ) k
Fnorm (Gm ) = (1) M axj∈n Sim(mi , pj )
i=1
f req( Grefm ) P reserv(M, P ) = (4)
k
Another important aspect is how to prioritize the graphlets where the similarity between two modules mi and pj is
for a given network. For this, we have chosen z-score, which defined as in above section. If the value approaches one, it
determines the importance of the graphlets. Now to put means more similarity in the two groups of input networks
together the concept of normalized frequency and significance exist. Note that, it is not necessary for the two groups to
of a graphlet with respect to a network, ‘normalized weighted contain the same number of modules.
graphlet frequency’ can be formulated as below:
F. Preservation in random networks
To validate our method we have generated some random
Fnw (Gm ) = Fnorm (Gm ) × |Zm (G)| (2) networks by following the famous Erdős Rényi model [16].
In this random graph model, all graphs with a fixed set of
which depicts the normalized and weighted frequency of a vertices and edges are equally likely. There are two variants
graphlet G in a network m and |Zm (G)| is the absolute z-score of the ErdsRnyi random graph model. In the G(n, M ) model,
value of graphlet G in the network m. a graph G is selected at random from an ensemble of all
possible graphs which have n nodes and M edges. For
D. Similarity between two modules using weighted graphlet example, in the G(3, 2) model, each of the three possible
frequency distribution graphs on three vertices and two edges are included with
Studying local structure of a network is an important probability 1/3. Another model G(n, p) consists of a graph
step to understand topological characteristics appeared in the which is constructed by connecting nodes randomly. Edges
network. Comparing these, we can facilitate the understanding have probability p independent from every other edge.
of structural similarity among the networks. We have used an R package called “igraph”
(http://igraph.org/r/)[17] to generate a set of random graphs
To compare preservations among the three stages of HIV-
using G(n, p) variant of ErdősRényi model. Total five sets of
1 infection, first we need to find pairwise similarity between
modules are created where each set consists of ten random
all modules of the three stages. Here, we used the measure
networks with different n, p parameter. Random node size n
for finding similarity between two graphs proposed in [11]
(from 20 to 70) and random p (from 0.3 to 0.6) are taken for
which was based on graphlet frequency distribution. The only
G(n, p) model. In table I, preservation scores between all sets
difference lies in the concept of normalizing and adding
of random networks are shown. All diagonal entries are found
some weights to the graphlets before comparing them. Given
to be ones due to self-similarity. For rest of the entries, the
two networks m and p, the structural similarity according to
score is roughly around 0.5. This is due the fact that, all the
normalized weighted graphlet frequency distribution can be
networks are generated using same random graph generation
formulated as:
model and each set contain the same number(ten) of networks
 within it. Hence they possess some kind of similarity among
k
1 Fnw (Gm
i ) Fnw (Gpi ) themselves which justifies this low similarity values.
Sim(m, p) =  (3)
k i=1 Fnw (Gm
i ) Fnw (Gpi )
III. R ESULT
Fnw (Gm A. Comparing individual modules in HIV-1
i ) is the normalized and weighted frequency of
graphlet Gi in network m. K is the number of graphlets whose For comparing two modules, the first step was to find the
normalized and weighted frequencies are being compared normalized frequency of all 29 graphlets in those modules.
here. For computational convenience, we have considered total As already mentioned earlier we have used gtrieScanner
29 graphlets structures (Fig.2) having maximum 5-nodes. If tool [13] to find the frequencies of the graphlets. To create
computation power permits one can go beyond size five. a visualization, we have sorted all modules in each stage
The measure is commutative. It captures similarity between according to their mean graphlet frequency and took the top
two networks in all twenty nine weighted graphlet frequency five modules from each stage. We have plotted normalized
distribution. If the value is near about 1, it signifies the more frequencies of top five graphlets of these modules in Fig. 3.
topological similarity exists between the two input networks. Ten graphlets are observed in the acute stage, out of which
It lies between zero to one. three graphlets are of size 4 (g11, g12, g9) and the rest are

3
TABLE I
P RESERVATION S CORES AMONG FIVE SETS OF RANDOM NETWORKS

random random random random random


set 1 set 2 set 3 set 4 set 5 0.6

random graphlet
1.0000 0.4470 0.4440 0.4600 0.4370 g1
set 1 g11

normalized_frequency
random g12

0.4140 1.0000 0.4180 0.4460 0.4210 0.4 g2


set 2

Acute
g3
g4
random
0.4410 0.4250 1.0000 0.4400 0.4350 g5

set 3 g6
g8
random 0.2 g9
0.4520 0.4550 0.4260 1.0000 0.4190
set 4
random
0.4570 0.4250 0.4490 0.4000 1.0000
set 5
0.0

Module1 Module2 Module3 Module4 Module5


module

of size 3. Similarly, the nonprogressor and the chronic stages


0.6
have seven and eight graphlets only. This result suggests
only some structures/graphlets are playing important roles
and most of them are of size 3. Please note that others graphlet
g1

normalized_frequency
graphlets are also present in all modules but these graphlets 0.4 g2
g3

Chronic
are occurring in high frequency. g4
g5
g6
g8

Next step is to calculate weights of the graphlets in each 0.2 g9

module. Pairwise similarity scores among all modules in the


three stages are computed using the equation (3). In Fig. 4,
top five pairwise similarity scores among the three stages are 0.0

Module1 Module2 Module3 Module4 Module5


shown using a point plot. In the first subplot of the Fig. module

4, acute and chronic modules share high sim scores. For 0.03

acute and nonprogressor stages, individual similarity scores


are substantially less compared to the other cases. From the
last subplot of Fig.4, it is found nonprogressor and chronic graphlet
0.02

modules do have both high (about 0.5) and very low (around
normalized_frequency

g1
g2

Nonprogressor
0.05) sim scores. Fig. 5 shows three boxplots of all pairwise g4
g5

similarity scores between modules of acute-chronic, acute- g6


g8

nonprogressor and nonprogressor-chronic stages, respectively. 0.01 g9

By analyzing this distribution we can get an intuitive idea that,


acute-chronic stages are more similar than the other two cases.
B. Preservation scores among three stages of HIV-1 pro- 0.00

Module1 Module2 Module3 Module4 Module5

gression module

We have found preservation score 0.3361 for acute and


nonprogressor modules whereas for acute and chronic modules
it is 0.5582. It signifies the more topological similarities
between acute and chronic stages than acute and nonprogressor
Fig. 3. Showing normalized frequencies of top five graphlets of five modules
stages. We have also calculated preservation score between in each of the stages.(Note the different y-axis scaling for nonprogressor stage)
nonprogressor and chronic stages, which accounts to 0.3678.
Both acute and chronic stages are fatal as HIV-1 infection
progresses rapidly in these two stages. It is only during
nonprogressor stage, where the infection is low and people can modules of the HIV-1 infection stages. By comparing the
survive by maintaining a sufficient number of CD4+ cells. individual modules in the three stages, it appears that acute and
chronic have more similar pattern than acute-nonprogressor or
C ONCLUSIONS nonprogressor-chronic. One of the proposed measure, which
In this paper, we have proposed a novel framework to combines the information of pairwise similarities between the
compare two different sets of biological networks based on two sets of networks confirms this. This similarity reflects the
graphlet frequency distribution. We have incorporated weight actual medical conditions of people suffering from HIV-1 in
to each graphlet and normalized it before comparing its real life. Hence it can be concluded that, topological properties
frequency. We have carried out a comprehensive analysis of biological networks surely reflect their functionality. In this
of three sets of coexpression networks formed by the gene aspect, our method can be further utilized in any two networks

4
0.6
chronic_modules
0.6
7 23
17 25 nonprogressor_modules
6 16 21
8 31 1
3 4 2
0.5 5 14 0.4 3
Sim_score

Sim_score
27 12 17
9 24 19
10 26 5
13 30 8
2 15 4
0.4
11 32 6
0.2
29 20 9

21 18 18

28 1
22
0.3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
acute_modules acute_modules

0.5

chronic_modules
31

0.4 12
30
14
Sim_score

24
26
0.3
18
15
19
4
0.2 25

0.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
nonprogressor_modules

Fig. 4. Displaying pairwise similarity scores between top five modules in each of the acute-chronic, acute-nonprogressor and nonprogressor-chronic stages

or two sets of networks. [6] Miller, J., Horvath, S., Geschwind, D.: Divergence of human and mouse
brain transcriptome highlights alzheimer disease pathways. Proc Natl
ACKNOWLEDGEMENTS Acad Sci U S A 107, 12698–12703 (2010)
[7] Oldham, M., Horvath, S., Geschwind, H.: Conservation and evolution
This publication is an outcome of the R&D work undertaken of gene coexpression networks in human and chimpanzee brains. Proc
Natl Acad Sci U S A 103, 17973–17978 (2006)
project under the Visvesvaraya PhD Scheme of Ministry of
[8] Yoon, D., Kim, H., Suh-Kim, H., Park, R., Lee, K.: Differentially co-
Electronics & Information Technology, Government of India, expressed interacting protein pairs discriminate samples under distinct
being implemented by Digital India Corporation stages of hiv-1 infection. BMC Systems Biology 5(Suppl 2)S1(DOI:
10.1186/1752-0509-5-S2-S1) (2011)
R EFERENCES [9] Ray, S., Biswas, S., Mukhopadhyay, A., Bandyopadhyay, S.: Detecting
perturbation in co-expression modules associated with different stages of
[1] Bandyopadhyay, S., Ray, S., Mukhopadhyay, A., Maulik, U.: A review of hiv-1 progression: A multi-objective evolutionary approach. In: Proceed-
in silico approaches for analysis and prediction of hiv-1-human protein- ings of the 2014 Fourth International Conference of Emerging Applica-
protein interactions. Briefings in Bioinformatics 16(5), 830–851 (2015) tions of Information Technology. EAIT ’14, pp. 15–20. IEEE Computer
[2] Ray, S., Bandyopadhyay, S.: Discovering condition specific topological Society, Washington, DC, USA (2014). doi:10.1109/EAIT.2014.34.
pattern changes in coexpression network: an application to hiv1 pro- http://dx.doi.org/10.1109/EAIT.2014.34
gression. IEEE/ACM Trans Comput Biol Bioinform. 16(6), 1086–1099 [10] Ray, S., Hossain, M., Khatun, L.: Discovering preservation pattern from
(2016) co-expression modules in progression of hiv-1 disease: An eigengene
[3] Zeller, J., McCain, N., Swanson, B.: Immunological and virological based approach. In: Proceedings of the IEEE, 2016 International Con-
markers of hiv-disease progression. J Assoc Nurses AIDS Care 7(15-27) ference on Advances in Computing, Communications and Informatics
(1996) (ICACCI), pp. 10–110920167732146. IEEE Computer Society, ???
[4] Buchbinder, S., Katz, M., Hessol, N., O’Malley, P., Holmberg, S.: Long- (2016)
term hiv-1 infection without immunologic progression. AIDS 8, 1123– [11] Bhattacharjee, D., Hossain, S.M.M., Sultana, R., Ray, S.: Topological
1128 (1994) inquisition into the ppi networks associated with human diseases through
[5] Cai, C., Langfelder, P., Fuller, T., Oldham, M., et al.,, R.L.: Is human graphlet frequency distribution. In: International Conference on Pattern
blood a good surrogate for brain tissue in transcriptional studies? BMC Recognition and Machine Intelligence, pp. 431–437 (2017). Springer
Genomics 11(589), 1471–2164 (2010) [12] Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D.,

5
0.6

Sim_scores with nonprogressor modules


0.6
Sim_scores with chronic modules

0.4
0.4

0.2
0.2

0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
acute_modules acute_modules

0.5
Sim_scores with chronic modules

0.4

0.3

0.2

0.1

0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
nonprogressor_modules

Fig. 5. Three box-plots showing pairwise similarity scores between all modules in each of the acute-chronic, acute-nonprogressor and nonprogressor-chronic
stages

Alon, U.: Network motifs: simple building blocks of complex networks.


Science 298(5594), 824–827 (2002)
[13] Ribeiro, P., Silva, F.: G-tries: an efficient data structure for discovering
network motifs. In: Proceedings of the 2010 ACM Symposium on
Applied Computing, pp. 1559–1566 (2010). ACM
[14] Hyrcza, M.D., Kovacs, C., Loutfy, M., Halpenny, R., Heisler, L., Yang,
S., Wilkins, O., Ostrowski, M., Der, S.D.: Distinct transcriptional profiles
in ex vivo cd4+ and cd8+ t cells are established early in human
immunodeficiency virus type 1 infection and are characterized by a
chronic interferon response as well as extensive transcriptional changes
in cd8+ t cells. Journal of virology 81(7), 3477–3486 (2007)
[15] Mukhopadhyay, A., Ray, S., De, M.: Detecting protein complexes in
a ppi network: a gene ontology based multi-objective evolutionary
approach. Molecular BioSystems 8(11), 3036–3048 (2012)
[16] Erdos, P.: On random graphs. Publicationes mathematicae 6, 290–297
(1959)
[17] Csardi, G., Nepusz, T.: The igraph software package for complex
network research. InterJournal Complex Systems, 1695 (2006)

You might also like