You are on page 1of 5

Bacterial Common Gene Finding Using

Distributed Workload Management
High Performance Computing Research and Development Division,
National Electronics and Computer Technology Center,
112 Thailand Science Park,
Klong Luang, Pathumthani 12120

Abstract a supervised learning approach to predict yeast gene
expression data [5].
Bioinformatics can be defined as the creation and Recently, similarity search and gene sequences
development of advanced information and comparison of bacterial genomic data become
computational techniques for problems in biology. It important task in Bioinformatics [2],[3]. Because the
consists of computing techniques used to manage and research may lead to the discovery of new drug
extract useful information from the DNA/RNA/protein targets. The problem on similarity search and
sequence data. In many cases, the methods for comparison of bacterial genome data is often time-
analyzing genetic/protein data have been found to be consuming due to a large number of both bacterial
extremely computationally intensive, thus providing species and their genes. To reduce the processing
motivation for the use of powerful computer systems time, high performance computing systems can be
such as supercomputer, cluster computing, and grid applied to the problem of comparing gene sequences
computing. We build upon the cluster computing of bacterial genomic data.
technology a distributed workload management to The high performance computing system can
partition problem the space and manage the perform a specific task with much better performance
comparison tasks. Our preliminary results show that than ordinary computing system. The task can be of
the distributed workload management can be well any large scale system such as Bioinformatics,
achieved in a bacterial common genes finding Computational Sciences, and Computational Fluid
application. Dynamics. One popular way to build the high
performance computing system is to cluster
Key-Words: Bioinformatics, Genome Comparison, traditional personal computing together and then
Common Gene Finding apply grid computing technology such as globus grid
[6] or sun grid [7] to manage resources in the cluster
computing system.
1. Introduction However, most computing applications cannot be
Bio-data analysis is an important task in transferred directly from traditional one processor
Bioinformatics. It concerns with problem domains system to the cluster computing system. Several
such as gene finding, the determination of protein considerations need to be taken care of, especially on
function, the identification of co-occurring bio- how to partition problem space and how to manager a
sequences, and gene pathway analysis. For example, huge amount of distributed computing tasks. We
Vladimir Filkov, Steven Skiena and Jizu Zhi implement a distributed workload management
proposed integrated analysis method of yeast gene technique to alleviate these problems in our bacterial
expression data [1]. Jack Y. Yang, Okan K. Ersoy common gene finding research and the details are as
and Mary Qu Yang proposed a new two-stage follows in the next sections.
approach to determine of protein function and gene
finding that uses both unsupervised learning and
2. Methodology
supervised learning techniques [4]. Lev A Soinov,
The methodology for comparing gene sequence
Maria A Krestyaninova, and Alvis Brazma proposed
tasks can be divided into four parts: data selection,
data extraction, comparison of gene sequences, and
distributed workload management. Query= gi|16077069|ref|NP_387882.1| alternate gene nam
2.1 Data Selection dnaK [Bacillus subtilis]
(446 letters)
Genomic data is publicly available at the
National Center for Biotechnology Information Database: NC_000913.faa
(NCBI), also known as Genbank [8]. It is a part of 4279 sequences; 1,361,003 total letters
International Nucleotide Sequence Database Score E
Collaboration, which is comprised of the DNA Sequences producing significant alignments: (bits) Value
DataBank of Japan (DDBJ), European Molecular ref|NP_418157.1| DNA biosynthesis; init………….377 e-105
ref|NP_416991.1| putative DNA replication………… 71 1e-013
Biology Laboratory (EMBL), and the GenBank. ref|NP_415878.1| putative DNA replication………….30 0.25
The goal of our application is to find common
gene using all available bacterial genomic sequence Figure 2. Some BLAST results
from GeneBank, but initially as a prove of concept,
we use the data that consists of 5 bacteria, The example of pair-wise results is shown in
scientifically named as: Figure 2. All results will be stored in the pair-wise
1) Bacillus subtilis gene similarity database for further mining process as
2) Escherichia coli K12 shown in Figure 3.
3) Mycobacterium tuberculosis H37Rv
4) Pseudomonas aeruginosa PA01
5) Staphylococcus aureus subsp.aureus MW2 Bacterial
Mining Common
Database Common Gene
Each bacteria specie has the number of genes BLAST Database

about 4,000-6,000. When downloaded from Figure 3. Common Gene Finding Process
GeneBank, they have their accession numbers as
{NC_000964,NC_000913, NC_000962, NC_002516, Suppose that we have n bacterial species, each
NC_003923}, respectively. has m genes. We can estimate the number of the
BLAST comparisons to be
2.2 Data Extraction n(n − 1) 2 (1)
Data Extraction is a process when relevant N= m
biological data is extracted from source database
(GenBank), cleansed of redundant and unneeded
N is the total number of jobs in the system.
data, and then loaded into a biological database.
n is the number of bacteria.
From GeneBank’s genomic data, which contain in
m is the number of genes in a bacteria.
a plain text FASTA file format as shown in figure 1,
we extract gene sequences, their reference index and
For example, there are five bacterial genomic data
other useful information into out local database.
is shown in table 1.

>gi|16077069|ref|NP_387882.1| alternate gene name: dnaH, dn Table 1. Number of Genes
[Bacillus subtilis] Accession Number Number of Genes
NC_000962 3927
NC_002516 5567

Figure 1. Example of gene sequence data If the average of genes in the bacterial is about
in FASTA format 4,000 genes, the total number of comparison jobs is
10*(4,000)2 jobs. If the number of bacterial genomic
data is 128 (as they are seqvenied), the total number
2.3 Comparison of gene sequences
of comparison jobs is
In this comparison step, we use gene sequence
comparison tool called BLAST, which is popular for (128)(127)
4000 2
searching sequence similarity in sequences database 2
by a given query sequence.[8] The query sequences In another words, time process may be used more
can be protein sequences or DNA sequences. than 1011 pair-wise comparing. If job uses the
processing time 0.001 second, the overall processing
time of comparison is 0.001x1011 = 108 second or A job is submitted into the grid engine system
1,157 days or about 3 years. If we have more than through a host, named, frontend-0. When the grid
one computer machine that can perform comparison engine receives the job, it assigns the status of job
task in parallel, this process can drastically reduce the submitted as ‘qw’ which means queue-wait. If
processing time. resources of cluster computing system are available,
the grid engine allocates the resources to the queued
2.4 Distributed Workload Management jobs. The queued jobs’ status are changed as ‘r’
The distributed workload management concepts which means running. Then the running jobs use
is to manage workload using cluster computing and resources of cluster computing nodes such as
grid computing technology [6],[7]. compute-0-8, compute-0-9, and compute-0-10. When
Our preliminary cluster computing system the job is finished, the results are stored in Bio-
consists of 5 nodes. Each computing node is dual Database node, named, frontend-1.
Athlon MP1.8 1533 MHz with 512 MB of DDR There are three cluster computing nodes in the
RAM. To speed up the data transfer, we exploit the system. For optimal performance of a dual processor
high bandwidth of the gigabit Ethernet technology. It system, each node is assigned to run two independent
is shown in figure 4. jobs. The available of queue space is called slot, thus
each cluster computing node in the system contains
two slots. There are six jobs which can be activated,
while other have non-activated or queued status.
S= ∑s
i =1
i (2)

S is the total number of slot in the cluster
computing system.
s is the total number of slot in the ith node.
k is the total number of cluster computing node.

We build a script to monitor the queued slot and
Figure 4. Cluster Computing System submit a new job into grid engine system when any
slot becomes available. To monitor the queued slot,
To manage the workload computing, we use grid we create a table to contain the status of job
engine to manage distributed resources. The grid submitted.
computing system is a collection of computing
resources such as distributed CPU resources, STATUS = {‘Q’,’P’,’C’}
distributed Disk I/O resources, and distributed Let
memory resources. It also provides queuing control Q denoted Queue Process type.
for a huge amount of job submitted. It is shown in P denoted Progress Process type.
figure 5. C denoted Complete Process type.
When the status of job submitted on grid engine
Job submitted system is ‘qw’, the STATUS of job submitted is
updated as ‘Q’. If status of job submitted on grid
engine system is ‘r’, the STATUS of job submitted is
updated as ‘P’. If status of job submitted on grid
Grid Engine System engine system is completed, the STATUS is updated
as ‘C’.

3. Experimental Results
Cluster Computing Systems The dataset to test our system is defined by three
workloads are shown in table 2:
CPU Disk I/O Memory

Figure 5. Grid Engine System
Table 2. Three workloads The processing time of the three workloads on the
Total number of Bacterial comparison 6-cpu system is shown in table 4:
Bacterial species
Workload 1: NC_000962<->NC_000913 Table 4. Processing time by three workloads
3 bacteria NC_000962<->NC_000964 No. Workload t1(s) tc (s) T (s)
{NC_000962, NC_000913<->NC_000964 Workload 1 3,415 1,180 2.8941
NC_000913, Workload 2 8,491 1,955 4.3432
NC_000964} Workload 3 11,970 2,478 4.8305
Workload 2: NC_000962<->NC_000913
4 bacteria NC_000962<->NC_000964 From the table 4, the plots representing the
{NC_000962, NC_000962<->NC_002516 results are shown in figure 6 and figure 7,
NC_000913, NC_000913<->NC_000964 respectively.
NC_000964, NC_000913<->NC_002516
NC_002516} NC_000964<->NC_002516 processing time of one-cpu
Workload 3: NC_000962<->NC_000913 processing time of cluster system
5 bacteria NC_000962<->NC_000964

Processing time (second)
{NC_000962, NC_000962<->NC_002516
NC_000913 NC_000962<->NC_003923 8000

NC_000964, NC_000913<->NC_000964 7000

NC_002516, NC_000913<->NC_002516 6000

NC_003923} NC_000913<->NC_003923 5000
NC_000964<->NC_002516 4000
NC_000964<->NC_003923 3000
NC_002516<->NC_003923 2000

3 6 10
The processing time of pair-wise comparison on Number of bacterial comparison
one-cpu computer is shown in table 3: Figure 6. processing time of tc and t1
Table 3. Processing time of pair-wise comparison 5

Bacterial comparison Processing Time(s)
NC_000962<->NC_000913 1,240 4.5
NC_000962<->NC_000964 1,064
NC_000962<->NC_002516 2,021
Speed-up time (second)

NC_000962<->NC_003923 704 4

NC_000913<->NC_000964 1,115
NC_000913<->NC_002516 1,637
NC_000913<->NC_003923 798
NC_000964<->NC_002516 1,418
NC_000964<->NC_003923 863 3
NC_002516<->NC_003923 956

To compare the processing time of sequence 2.5
3 6 10
similarity searches on one-cpu computer to the Number of bacterial comparison

processing time of three-computer, 6 cpu, we defined Figure 7. speed-up time from 1-cpu to
the speed-up equation as follows: a cluster system
t i1
T= ; for i = 1,2,3 (3) 4. Conclusion and Future Work
t ic
From the preliminary results, the grid computing
Where technology can reduce the overall processing time on
T is the speed-up time from t1 to tc bacterial common gene finding application. For
t1 is the processing time of one-cpu computer at future work, experiments will be conducted on more
the ith workload. bacterial species and more computing nodes will be
tc is the processing time of 6-cpu computer at the applied. We also plan to use other data analysis and
ith workload. data mining techniques on the bacterial comparative
genomic application.
5. Acknowledgement
This study has been funded by RD-C2 section of
NECTEC. Special thanks to Shobhna S. of RD-C4
section, who troubleshooted many Sun grid

6. References
[1] Vladir Filkov, Steven Skiena, Jizu Zhi, “Analysis
Techniques for Microarray Time-Series Data,”
International Journal of Computational Biology,
Vol.9,No.2,2002, pp 317-330
[2] Jiawei Han, Micheline Kamber, Data Mining: Concepts
and Techniques, Morgan Kaufmann, 2001
[3] Jaiwei Han, How Can Data Mining Help Bio-Data
Analysis, Workshop on Data Mining in Bioinformatics
with SIGKDD02 Conference, 2002
[4] Jack Y. Yang, Okan K. Ersoy, Mary Qu Yang, Gene
Finding and Protein Function Determination Using Protein
Phylogenetic Profiles and Computational Intelligence,
Intelligent Engineering Systems Through Artificial Neural
Networks, Vol.12, 2002, pp 735-740
[5] Lev A Soinov, Maria A Krestyaninova, and Alvis
Brazma, Towards reconstruction of gene networks from
expression data by supervised learning, Genome Biology,
[6] Globus Project, University of Chicago,
[7] Grid Engine, Sun Microsystems,
[8] GenBank, National Center for Biotechnology