You are on page 1of 7

International Journal of Advanced Computer Science, Vol. 1, No. 6, Pp. 233-239, Dec. 2011.

A Hybrid Distributed and Shared Memory Method for Fast HNGH Algorithm
Muhannad A. Abu-Hashem, NurAini Abdul Rashid, & Rosni Abdullah
Abstract Faster sequence comparison and alignment has become crucial due to the rapid growth of the molecular databasesIn this paper we present a parallel design for a dynamic programming algorithm called Hashing-N-Gram-Hirschberg (HNGH), an extension of N-Gram-Hirschberg (NGH) algorithm, in order to fasten the sequence alignment construction. Also, we have examined HNGH algorithm on different parameters by increasing the word lengths in order to fasten the alignment and study their effects on the alignment accuracy. Parallel design is divided to two levels and applied on two different architectures. First level applied using multi-processor architecture for database decomposition where the second level applied on multi-core architecture for similarity matrix decomposition. The word lengths in the extended parameter examination range from 3 to 6 letters. HNGH algorithm outperforms the former NGH algorithm in general in the case of time. The parallel algorithm shows an enhancement in the execution time but the speed up is a bit low because of the high communication among processers and the high dependency among the tasks.

Manuscript
Received: 14, Sep., 2011 Revised: 29, Oct., 2011 Accepted: 15, Dec., 2011 Published: 15, Jan., 2012

2. Related Works
The importance and significance of sequence alignment and comparison dragged the eager of researchers to contribute in this field. Methods were varied in solving sequence alignment problem where many approaches have evolved and used in order to solve it. This section presents the related works classified into two groups. A. Heuristic Approach Heuristic approach always looks for the solution within the possible solutions but doesnt guarantee finding the optimal one. Even though there is no guarantee of optimal solution but it is fast in finding solutions. Since the size of protein sequence databases are growing very fast, the need of faster methods for pair-wise alignment arises. Researchers have applied this approach to pair-wise alignment problem in order to reach the optimal solution with less run time. FASTA [1] is one of the methods that applied this approach to pair-wise alignment problem. It divides the sequence into patterns, after which it looks for the matches between patterns. Each pattern called K-tuple indicates the number of matches within the two sequences compared [1]-[3]. Later BLAST method has been proposed which is similar to the previous one (FASTA) and it uses segment pairs alignment [4]. B. Dynamic Programming Approach Dynamic programming is a strategy of building a solution by collecting solutions of sub-problems. These sub-problems represent the smaller instances of the same problem. The solution is built by assembling the solutions of sub-problems. In other words, the solution is built-up using a bottom-up technique [5]-[6]. Hirschberg algorithm [7] is one of the earliest algorithm that attempts to solve pair-wise alignment using dynamic programming approach. It reduces the space complexity comparing with the earlier two well-known algorithms Smith-Waterman [8] and Needleman-Wunsch [9] without sacrificing the execution time. Smith-Waterman Algorithm which has been used for pair-wise alignment is the most common algorithm in the field of pair-wise sequence alignment [8], [10]-[13]. Hirschberg algorithm helps to reduce the space complexity for protein sequence pair-wise alignment. Later on N-Gram-Hirschberg (NGH) algorithm [14] is proposed as a further enhancement for Hirschberg algorithm. This

Keywords
Sequence Alignment; Protein sequence Comparison; Protein sequence Similarity; Parallel computing; Dynamic Programming.

1. Introduction
Rapid growth of molecular databases arise the need for efficient and fast sequences comparison algorithms in order to manage and control the huge size of data. Sequence alignment is the most basic operation in protein sequence comparison, as it rates the similarity between two or more sequences, by aligning the primary sequences of protein. Two protein sequences are aligned by comparing them to find the series of characters that have matches between them. Biologists who discover a new protein sequences need to compare this newly discovered sequences with the ones in the databases to know whether they are new sequences or not. Dynamic programming and many heuristic methods are an example of the variety of the approaches used for protein sequence alignment and comparison. Next section shows the related work by presenting the important algorithms that have been proposed to solve the problem.

234

International Journal of Advanced Computer Science, Vol. 1, No. 6, Pp. 233-239, Dec. 2011.

enhancement is in both time and space complexity, where it speeds up the time and at the same time reduces the space required for the alignment [14]. In the next section, we explain our proposed method.

array is the data structure used when comparing the query sequence against each protein sequence in the database. Algorithm 2 shows the parallel pseudo code of HNGH. Define protein sequences database DX; Define the Query protein Q; Begin Transform Q // using hashing-N-Gram method; For i= 0 to last sequence in DX Transform DXi // using hashing-N-Gram method; For i = 0 to last sequence in DX begin Fill the similarity matrix of Q and DXi; Back track for the optimal alignment; Calculate the Similarity value; End for End
Algorithm 1. HNGH algorithm pseudo code

3. Methods and Materials


All paragraphs must be indented. All paragraphs must be justified, i.e. both left-justified and right-justified. A. Input Data The data that have been used to test the algorithm is a protein sequences from Swiss-Prot Database created since 1986. The format that used for protein sequences in Swiss-Prot Database is FASTA format. Fig. 1 below shows a sample of protein sequence in FASTA format; also Fig. 2 is showing the rate of growth for Swiss-Prot database.
>104K_THEPA (P15711) 104 kDa microneme-rhoptry antigen precursor (p104) MKFLILLFNILCLFPVLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAV EMAGVKYL QVQHGSNVNIHRLVEGNVVIWENASTPLYTGAIVTNNDGPYMAYVEVLG DPNLQFFIKSG DAWVTLSEHEYLAKLQEIRQAVHIESVFSLNMAFQLENNKYEVETHAKNGANMVTFIP

RN

Fig. 1 Protein sequence FASTA format

Fig. 2 Swiss-Prot database growths

B. Hashing-N-Gram-Hirschberg (HNGH) HNGH [15] is a further enhancement to N-Gram-Hirschberg algorithm which embeds hash function technique to N-Gram-Hirschberg. The hash function is used to convert the words that are generated by N-Gram method to a numbers which speed up the comparison process as well as the alignment. HNGH has been tested on protein sequence length of N=2 (the length of words is two). Here we are going to test the algorithm in word length ranges from 3 to 6. The results will be compared with the original algorithm NGH. Algorithm 1 shows the pseudo code of HNGH. C. Parallel Algorithm To parallelize HNGH algorithm, the problem is decomposed at two levels. The first level is the database decomposition, where it is distributed to many processors. The second level is the similarity matrix decomposition, where it is distributed to many threads. We use MPI and PThread to implement our algorithm. Fig. 3 shows the general scheme for parallelizing the algorithm. The 2D

Define protein sequences database DX; Define the Query protein Q; Begin Transform Q // using hashing-N-Gram method; Parallel level 1: using Multi-Processor architecture to decompose the database For i= 0 to last sequence in DX Transform DXi // using hashing-N-Gram method; For i = 0 to last sequence in DX Begin Parallel level 2 : using Multi-Core architecture to decompose the similarity matrix Fill the similarity matrix of Q and DXi; Back track for the optimal alignment; Calculate the Similarity value; End for End
Algorithm 2. Parallel HNGH algorithm pseudo code

1) Architecture: Our parallel architecture based on multi processors where each processor has 2/4 multi core. Fig. 4 shows the parallel architecture. In order to overcome our problem, we have used the MIMD (Multiple Instruction Multiple Data) technology as a parallel architecture. 2) Decomposition Methods: In order to parallelize the algorithm, data decomposition has been used where data decomposition was done at two levels. The first level is the protein sequence database decomposition which is done on many processors. The second level is the similarity matrix decomposition which is done on different threads.
International Journal Publishers Group (IJPG)

Muhannad A. Abu-Hashem et al.: A Hybrid Distributed and Shared Memory Method for Fast HNGH Algorithm.

235

DB

Level 1: Database

Decomposition

DB

Similarity Matrix

Similarity Matrix

Similarity Matrix

Level 2: Similarity

matrix

decomposition

Fig.

3 General scheme for parallelizing HNGH

Data Decomposition at Database Level: For database decomposition level, we partitioned the database into smaller blocks of data. Then these blocks are distributed to different processors. For the number and size of the blocks, there are many issues that need to be considered. One of the issues is load balancing which increases the number of blocks and decreases the size of each block may affect the load balancing. However since we use the same processor, the issue didnt arise here. Fig. 5 shows the database partitioning process. Data Decomposition at Similarity Matrix Level: the partition of the similarity matrix is done by dividing it into two blocks. Each block is independent of each other. This process is considered as a big challenge due to the data dependency in the similarity matrix where each row and column depends on the previous rows and columns values for its value calculation. Fig. 6 shows how the similarity matrix is divided into two matrixes and how the calculation is done.
International Journal Publishers Group (IJPG)

Fig. 4 Parallel architecture [14]

4. Results Analysis
This section is divided into two parts. The first part of this section shows the results for HNGH algorithm. These results are from the experiment with variant length of words (N) in N-Gram method. Words length ranges from 3 to 6 and the data set is the same data that have been used in [15]. The second part of this section is for the parallel algorithm

236

International Journal of Advanced Computer Science, Vol. 1, No. 6, Pp. 233-239, Dec. 2011.

where the speed up, efficiency, and performance gain are shown and discussed. A. Sequential Results Analysis Time Evaluation: our new algorithm HNGH performs better than NGH algorithm in most cases. Actually, the best time is when the word length equal to two letters (N=2) [15]. But in general the HNGH algorithm enhance the time of NGH algorithm. In some occasions HNGH performs badly comparing with NGH where that because of the nature of the data set and the size of hashed numbers (the numbers that generated by hash function). So these large numbers becomes an overhead to the algorithm and this is the reason why the time is slower. As shown in the Fig. 7, we notice that the running time for HNGH algorithm gets better when the length of the terms (words) increases. We obtained the best time at word length N = 6. However, increasing the length of term N creates a problem with the sensitivity of the alignment for the sequences. That means by increasing the word length we sacrifice the sensitivity.

Figure 7. Running Time Results for HNGH and NGH on 3, 4, 5 and 6 Grams

Fig. 7 Running Time Results for HNGH and NGH on 3, 4, 5 and 6 Grams Fig. 5 Database partitioning

Results Evaluation: by comparing the results of HNGH algorithm with NGH algorithm we get a high match between the two algorithms results. While NGH performs same as Smith-Waterman algorithm, then our algorithm is almost has the same performance as Smith-Waterman algorithm. Table 1 shows the percentage of the matches between the results of HNGH and NGH algorithms.
TABLE 1 RESULT MATCHING PERCENTAGE BETWEEN HNGH AND NGH

Fig. 6 Calculation of similarity matrix for Hirschberg algorithm [14]

N-Gram Match Percentage

3 91%

4 97%

5 97%

6 99%

International Journal Publishers Group (IJPG)

Muhannad A. Abu-Hashem et al.: A Hybrid Distributed and Shared Memory Method for Fast HNGH Algorithm.

237

Fig. 8 Parallel Hashing-N-Gram-Hirschberg Speedup on 4 Processors

Fig. 9 Parallel Hashing-N-Gram-Hirschberg Speedup on 8 Processors

B. Parallel Result Analysis For parallel evaluation, the parallel algorithm is evaluated by calculating the speedup, efficiency and performance gain for the parallel algorithm results. The parallel method consists of two levels. The first level is the database decomposition which will be distributed to many processors using MPI. The second level is similarity matrix decomposition which is divided into many threads using PThreads. Speed up: We obtained a low speed up because of the communication overhead of the two levels of parallelization and the static load balancing between the processors. Therefore, sometimes a processor takes more time than the others to give the results, because of load balancing between the processors. Since we are working with large data, the communication time overhead will be high. So we avoid doing load balancing between the processors in the parallel algorithm, but generally the speedup becomes high when the length of the word (N) is shorter, and the length of sequence becomes longer. This means that if the number of
International Journal Publishers Group (IJPG)

the words inside the sequence increases, the speedup also increases. Fig. 8, Fig. 9 and Fig. 10, show the values of speedup taken for sequence lengths ranging from 100 to 1000 increments by 100 and N-Gram values ranging from 2 to 6 increments by one, with various numbers of processors (4, 8, and 10). As shown in Table IV, we noticed that the speedup for the algorithm can get higher by increasing the number of processors and the speedup get higher by increasing the length of the sequence that is because of the similarity matrix decomposition level. Efficiency: In general, we achieved a low efficiency for the parallel Hashing-N-Gram-Hirschberg algorithm. This low efficiency is because of the speedup. However, in some cases, the efficiency becomes lower, because of load balancing also. In general, the efficiency becomes high when the length of the word (N) becomes shorter and the length of the sequence becomes longer. This means that if the number of words inside the sequence increases, the value of the efficiency will increase also. Fig. 11, shows the values of efficiency taken for the sequence lengths ranging from 100 to 1000 increments by 100 and N-Gram values ranging from 2 to 6 increments by one, with various

238

International Journal of Advanced Computer Science, Vol. 1, No. 6, Pp. 233-239, Dec. 2011.

Fig. 10 Parallel Hashing-N-Gram-Hirschberg Speedup on 10 Processors

Fig. 11 Parallel Hashing-N-Gram-Hirschberg Efficiency on 4, 8 Figure 11. Parallel Hashing-N-Gram-Hirschberg efficiency on 4, 8 and and 10 Processors

numbers of processors (4, 8, and 10). The efficiency of using 4, 8 and 10 processors are generally very close, but the best efficiency was in 4 processors. Performance Gain: the performance gain that we get by executing the parallel algorithm on 4 processors is low compared with using 8 and 10 processors, that because of the usage of the resources. Using more resources for solving the problem gives better performance for the algorithm. As shown in the Fig. 12, we noticed; the performance gain getting higher when the number of processors increases.

10 processors

Fig. 12 Parallel Hashing-N-Gram-Hirschberg Figure 12. Parallel Hashing-N-Gram-HirschbergPerformance performanceGain gain on 4, 8 and 10 Processors on 4, 8 and 10 processors

Acknowledgment
This research is supported by the UNIVERSITI SAINS MALAYSIA and has been funded by the APEX Incentive Grant.

References
[1] D. J. Lipman & W.R. Pearson, Rapid and Sinsitive Protein Similarity Searches, (1985) Science 227, pp. 1435-1441. W.J. Wilbur & D.J. Lipman, Rapid Similarity Searches in Nucleic Acid and Protein Databanks, (1983) Proc. Natl. Acad. Sci, USA, 80, pp. 726-730. [3] W.R. Pearson & D.J. Lipman, Improved Tools for Biological Sequence Comparisons, (1988) Proc.Nat. Acad. Sci,USA, 85, pp. 2444-2448. [4] S.F. Altschul, W. Gish, W. Miller, W.E. Myers & D.J. Lipman, Basic Local Alignment Search Tool, (1990) J.Mol.Biol 215, pp. 403-410. [5] W. Pearson, Coparison of Method for Searching Protein Sequences Databases, (1995). Protein Science, vol. 4, no. 6, pp. 1145-1160. International Journal Publishers Group (IJPG)

5. Conclusion
This paper presents a parallel design for a dynamic programming pair-wise sequence alignment method named HNGH along with an extended test results. Also, the former algorithm (HNGH) has been tested on a wider datasets in order to examine its robustness and speed improvements. The examination shows that HNGH outperforms NGH algorithm in term of running time but the accuracy was scarified. On the other hand, the parallel design of HNGH adds an acceptable speedup and efficiency to the new algorithm.

[2]

Muhannad A. Abu-Hashem et al.: A Hybrid Distributed and Shared Memory Method for Fast HNGH Algorithm.

239

[6] [7]

K.A. Berman & J.L. Paul, Algorithms: Sequential, Parallel, and Distributed, (2005) University of Cincinnati: Thomson. D. Hirschberg, A Linear Space Algorithm For Computing Common Subsequences, (1975) Communication of ACM,18, pp. 341-343.

[8]

T.F. Smith & M.S. Waterman, Identification of common molecular subsequences, (1981) Journal of Molecular Biology, vol. 147, pp. 195-197.

Muhannad A. Abu-Hashem received the BSc. degree in Computer Information System (CIS) from Philadelphia University, Amman, Jordan in 2003, and M.Sc. in computer science from Universiti Sains Malaysia (USM) in 2008. Currently he is a PhD. candidate under the supervision of Associate Professor Dr. NurAini Abdul Rashid and Prof. Dr. Rosni Abdullah at Universiti Sains Malaysia. Nur'Aini Abdul Rashid received a Bsc from Mississippi State University, USA, and her MSc and PhD from University Sains Malaysia, Malaysia, all in computer science. Her PhD research involved analysing and managing protein sequence data. Currently, she is a senior lecturer at the School of Computer Sciences at the University Sains Malaysia. She was promoted to Associate Professor in 2009. Nur'Ainis research interests include paralle l algorithms, information retrieval methods and clustering algorithms. Rosni Abdullah received her Bachelors Degree in Computer Science and Applied Mathematics and Master's Degree in Computer Science from Western Michigan University, Kalamazoo, Michigan, U.S.A. in 1984 and 1986 respectively. She joined the School of Computer Sciences at Universiti Sains Malaysia in 1987 as a lecturer. She received an award from USM in 1993 to pursue her PhD at Loughborough University in United Kingdom in the area Parallel Algorithms. She was promoted to full Professor in 2007. She has held several administrative positions such as First Year Coordinator, Programme Chairman and Deputy Dean for Postgraduate Studies and Research. She is currently the Dean of the School of Computer Sciences and also Head of the Parallel and Distributed.

[9]

S.B. Needleman & C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, (1970) Journal of Molecular Biology, vol. 48, pp. 443-453.

[10] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller & D.J. Lipman, Gapped PLAST and PSI-BLAST : A New Generation of Protein Database Search Programs, (1997) Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402. [11] F. Ahmed, Pruning algorithm to reduce the search space of the Smith-Waterman algorithm & Kernel extensions to the C/OS-II Real-Time Operating System, (2005) in Department of Electrical and Computer Engineering, Lafayette College, Easton, PA. [12] A. Boukerchea, A.C.M.A.D. Melob, M.A.Rincnc & M.E.M. TellesWalterb, Parallel strategies for the local biological sequence alignment in a cluster of workstations, (2006) School of Information Technology and Engineering, University of Ottawa, Canada. [13] Y. Liu, W. Huang, J. Johnson & S. Vaidya, GPU Accelerated Smith-Waterman, {liu24, (2006) in Lawrence jjohnson, Hirschberg Livermore National Laboratory. DOE Joint Genome Institute, UCRL-CONF-218814 vaidya1}@llnl.gov. [14] N.A.B. AbdulRashid, Enhancement of Algorithm Using N-Gram and Parallel Methods for Fast Protein Homologous Search, (2008) in School of Computer Sciences, vol. PhD: Universiti Sains Malaysia. [15] M.A. Abu-Hashem & N.A.A. Rashid, Enhancing N-Gram-Hirschberg Algorithm by Using Hash Function, (2009) Asia International Conference on Modelling & Simulation, Indonesia, 3, pp. 282-286. whuang,

International Journal Publishers Group (IJPG)