Professional Documents
Culture Documents
about 4,000-6,000. When downloaded from Figure 3. Common Gene Finding Process
GeneBank, they have their accession numbers as
{NC_000964,NC_000913, NC_000962, NC_002516, Suppose that we have n bacterial species, each
NC_003923}, respectively. has m genes. We can estimate the number of the
BLAST comparisons to be
2.2 Data Extraction n(n − 1) 2 (1)
Data Extraction is a process when relevant N= m
2
biological data is extracted from source database
Where
(GenBank), cleansed of redundant and unneeded
N is the total number of jobs in the system.
data, and then loaded into a biological database.
n is the number of bacteria.
From GeneBank’s genomic data, which contain in
m is the number of genes in a bacteria.
a plain text FASTA file format as shown in figure 1,
we extract gene sequences, their reference index and
For example, there are five bacterial genomic data
other useful information into out local database.
is shown in table 1.
Figure 1. Example of gene sequence data If the average of genes in the bacterial is about
in FASTA format 4,000 genes, the total number of comparison jobs is
10*(4,000)2 jobs. If the number of bacterial genomic
data is 128 (as they are seqvenied), the total number
2.3 Comparison of gene sequences
of comparison jobs is
In this comparison step, we use gene sequence
comparison tool called BLAST, which is popular for (128)(127)
4000 2
searching sequence similarity in sequences database 2
by a given query sequence.[8] The query sequences In another words, time process may be used more
can be protein sequences or DNA sequences. than 1011 pair-wise comparing. If job uses the
processing time 0.001 second, the overall processing
time of comparison is 0.001x1011 = 108 second or A job is submitted into the grid engine system
1,157 days or about 3 years. If we have more than through a host, named, frontend-0. When the grid
one computer machine that can perform comparison engine receives the job, it assigns the status of job
task in parallel, this process can drastically reduce the submitted as ‘qw’ which means queue-wait. If
processing time. resources of cluster computing system are available,
the grid engine allocates the resources to the queued
2.4 Distributed Workload Management jobs. The queued jobs’ status are changed as ‘r’
The distributed workload management concepts which means running. Then the running jobs use
is to manage workload using cluster computing and resources of cluster computing nodes such as
grid computing technology [6],[7]. compute-0-8, compute-0-9, and compute-0-10. When
Our preliminary cluster computing system the job is finished, the results are stored in Bio-
consists of 5 nodes. Each computing node is dual Database node, named, frontend-1.
Athlon MP1.8 1533 MHz with 512 MB of DDR There are three cluster computing nodes in the
RAM. To speed up the data transfer, we exploit the system. For optimal performance of a dual processor
high bandwidth of the gigabit Ethernet technology. It system, each node is assigned to run two independent
is shown in figure 4. jobs. The available of queue space is called slot, thus
each cluster computing node in the system contains
two slots. There are six jobs which can be activated,
while other have non-activated or queued status.
k
S= ∑s
i =1
i (2)
Where
S is the total number of slot in the cluster
computing system.
s is the total number of slot in the ith node.
k is the total number of cluster computing node.
3. Experimental Results
Cluster Computing Systems The dataset to test our system is defined by three
workloads are shown in table 2:
CPU Disk I/O Memory
3 6 10
The processing time of pair-wise comparison on Number of bacterial comparison
one-cpu computer is shown in table 3: Figure 6. processing time of tc and t1
Table 3. Processing time of pair-wise comparison 5
NC_000962<->NC_003923 704 4
NC_000913<->NC_000964 1,115
NC_000913<->NC_002516 1,637
3.5
NC_000913<->NC_003923 798
NC_000964<->NC_002516 1,418
NC_000964<->NC_003923 863 3
NC_002516<->NC_003923 956
processing time of three-computer, 6 cpu, we defined Figure 7. speed-up time from 1-cpu to
the speed-up equation as follows: a cluster system
t i1
T= ; for i = 1,2,3 (3) 4. Conclusion and Future Work
t ic
From the preliminary results, the grid computing
Where technology can reduce the overall processing time on
T is the speed-up time from t1 to tc bacterial common gene finding application. For
t1 is the processing time of one-cpu computer at future work, experiments will be conducted on more
the ith workload. bacterial species and more computing nodes will be
tc is the processing time of 6-cpu computer at the applied. We also plan to use other data analysis and
ith workload. data mining techniques on the bacterial comparative
genomic application.
5. Acknowledgement
This study has been funded by RD-C2 section of
NECTEC. Special thanks to Shobhna S. of RD-C4
section, who troubleshooted many Sun grid
problems.
6. References
[1] Vladir Filkov, Steven Skiena, Jizu Zhi, “Analysis
Techniques for Microarray Time-Series Data,”
International Journal of Computational Biology,
Vol.9,No.2,2002, pp 317-330
[2] Jiawei Han, Micheline Kamber, Data Mining: Concepts
and Techniques, Morgan Kaufmann, 2001
[3] Jaiwei Han, How Can Data Mining Help Bio-Data
Analysis, Workshop on Data Mining in Bioinformatics
with SIGKDD02 Conference, 2002
[4] Jack Y. Yang, Okan K. Ersoy, Mary Qu Yang, Gene
Finding and Protein Function Determination Using Protein
Phylogenetic Profiles and Computational Intelligence,
Intelligent Engineering Systems Through Artificial Neural
Networks, Vol.12, 2002, pp 735-740
[5] Lev A Soinov, Maria A Krestyaninova, and Alvis
Brazma, Towards reconstruction of gene networks from
expression data by supervised learning, Genome Biology,
2003, http://genomebiology.com/2003/4/I/R6
[6] Globus Project, University of Chicago,
http://www.globus.org/
[7] Grid Engine, Sun Microsystems,
http://gridengine.sunsource.net/
[8] GenBank, National Center for Biotechnology
Information, http://www.ncbi.nlm.nih.gov/