Professional Documents
Culture Documents
on Multicore System
Abstract. The partial digest problem, PDP, is one of the methods used in
restriction mapping to characterize a fragment of DNA. The main challenge of
PDP is the exponential time for the best exact sequential algorithm in the worst
case. In this paper, we reduce the running time for generating the solution of
PDP by designing an efficient parallel algorithm. The algorithm is based on
parallelizing the fastest sequential algorithm for PDP. The experimental study on
a multicore system shows that the running time of the proposed algorithm
decreases with the number of processors increases. Also, the speedup achieved
good scales with increase in the number of processors.
1 Introduction
Physical mapping of the genome is one of the fundamental steps in genome studies.
One of the methods that is used in physical mapping is the digestion of DNA with one
restriction enzyme this is the partial digestion process. The enzyme cuts the double
stranded DNA within a specific short sequence of nucleotides called restriction sites.
After that we measure the lengths of obtained fragments and reconstruct the original
ordering of these fragments [1]. For example, the restriction enzyme TaqI cuts the
luciferase gene at the tcga sequence [2].
Several applications of genomic studies that require the genome mapping are
determining the order of genes or extracting the distinctive short fragments of DNA
sequence, and comparing the genomes of various species [3–7].
The combinatorial problem for partial digestion is called the partial digest problem,
PDP. Assume that the set of restriction site locations is represented as the set X = {x0,
x1, …, xn} and the multiset of lengths of DNA fragments is represented as the multiset
D = {d1, d2, …, dm}. The PDP is defined as follows [8].
Given a multiset D = {d1, d2, …, dm}. Find the set X = {x0, x1, …, xn} such that
DX = {| xj – xi |, 0 i < j n} = D.
For example, the output of the partial digestion process when we use the restriction
enzyme tcga on the luciferase gene is D = {9, 30, 100, 170, 293, 302, 393, 402, 462,
562, 632, 732, 855, 864, 945, 954, 975, 984, 1025, 1034, 1247, 1277, 1347, 1377,
1809, 1839, 1979, 2009}. The goal of the PDP is to find the set of restriction sites
locations which is X = {0, 30, 975, 984, 1277, 1377, 1839, 2009} [2].
The complexity analysis of the exact solution for PDP is a still an open problem
[9–11]. Many research papers are being introduced to find exact and approximate
solutions for PDP [12–20]. The main challenge for finding the exact solution of PDP is
the exponential time required for the best known sequential algorithm in case of the
worst case. In [21], Zhang gave an example for the worst case instances. Before 2016,
the best practical sequential algorithm for PDP is the algorithm designed by Skiena,
Smith, and Lemke [18]. Recently, Fomin presented an algorithm for PDP, in [19],
which is faster than the Skiena, Smith, and Lemke algorithm in some cases. But still
Skiena, Smith, and Lemke algorithm is better than Formin’s algorithm in Zhang’s
instances. In the same year, Abbas and Bahig [20] proposed the fastest exact sequential
algorithm for PDP. For Zhang’s data, the improvement is greater than 75% over the
Skiena, Smith, and Lemke algorithm.
The goal of this research paper is to reduce the running time of the fastest exact
sequential algorithm [20] because, with large value of n, the running time of the
algorithm that is proposed by Abbas and Bahig is still high. The algorithm takes
approximately 19 h for n = 90; while the running time of the Skiena, Smith, and
Lemke algorithm on the same n is greater than one day. To achieve this goal, we will
use high performance computing to speedup the running time of the fastest exact
sequential algorithm.
The rest of this paper is as follows. In Sect. 2, we describe briefly the fastest exact
sequential for PDP which is the BBb2 algorithm. In Sect. 3, we introduce a new
parallel algorithm on multicore system for PDP that is based on the BBb2 algorithm. In
Sect. 4, we study the proposed algorithm experimentally according to running time,
memory consumed, and scalability in the worst case. Section 5 contains the conclusion
of our work.
2 BBb2 Algorithm
The BBb2 algorithm is the proposed algorithm by Abbas and Bahig [20] to find the
exact solution of the PDP. The algorithm is based on two main stages. In the first stage,
the algorithm applies the breadth-first strategy while using the two bounding conditions
that are suggested by Skiena, Smith, and Lemke [18]. In addition, the BBb2 algorithm
deletes all repeated subproblems at the same level. For more details about the condition
Parallelizing Partial Digest Problem 97
for repeated subproblems, see Theorems 1 and 2 in [20]. The subroutine that is used to
traverse the tree level by level is called GenerateNextLevel [20]. Also, in the first stage
we will traverse the search tree by using breadth-first strategy for a certain number of
levels. This number of levels is determined by using a subroutine called Find_aM, see
[20]. In the second stage of BBb2 algorithm, we solve the subproblems at level aM,
individually, using the breadth first strategy. The values of all elements at the current
level are represent by the two lists LD and LX. The steps of BBb2 algorithm are as
follows.
In this section, we propose a parallel algorithm, PBBb2, for PDP based on the algo-
rithm BBb2 under multi-core architecture.
In the PBBb2 algorithm, we parallelize the two main stages of the BBb2 algorithm.
In the parallelization of the first stage, we build the solution tree of PDP in the breadth
first strategy sequentially till the number of subproblems at a level is greater than or
equal to the number of processors, P. After that we assign the subproblems to the
processors to work on them till the level aM. We can summaries the main steps of the
parallelization of the first stage as follows.
98 H.M. Bahig et al.
1. Apply the BBb2 algorithm from line 1–6, where T is a list contains the values of D
and X for each subproblem and initially equal to (D,{0, maximum(D)}).
2. Repeat the following until reaching level aM.
(a) If the number of elements of T is less than P, then we apply the procedure
GenerateNextLevel many times, at least one, on T until the number of elements
of T is greater than or equal to P or until reaching level aM. If the algorithm
reaches to the level aM, we terminate the process of the first stage and we go to
the second stage.
(b) If the number of elements of T is greater than or equal to P, then we do the
following:
(i) Remove the first k*P elements from the list T and assign it to a new
temporary list R, where k is an integer and k = ⎿| T | /P⏌.
(ii) Each processor, pi, works dynamically on one element, e, from R as
follows:
• Adding the element, e, to a temporary list Wi.
• Calling the procedure GenerateNextLevel until reaching level aM and
saving the output to the list Ti.
(iii) The first processor (from the P processors worked on the elements of
R) finished with the execution of its work in (ii) will go to Step (a).
3. Add the elements of Ti, 0 i P − 1, to the list T.
4. Remove the duplication from the list T.
In the parallelization of the second stage, we assign the elements of the list T to the
processors and then each processor works on the assigned element until the leaf of the
search tree using the breadth first manner or the bounding conditions cut this element.
We can summarize the parallelization of the second stage as the following steps.
Repeat the following until the list T is empty.
1. If the number of elements of T is less than P, then we apply the procedure Gen-
erateNextLevel many times, at least one, on T until the number of elements of T is
greater than or equal to P or until the list T is empty. In case of the list T is empty,
we terminate the second stage.
2. If the number of elements of T is greater than or equal to P, then we do the
following:
(a) Remove the first k*P elements from the list T and assign it to a new temporary
list R, where k is an integer and k = ⎿| T |/P⏌.
(b) Each processor, pi, works dynamically on each element e 2 R by executing the
steps from line 10 to 16 in BBb2 algorithm. If the processor pi, found a solution,
say si, then we add si to the set of solutions S if it does not exist in S.
(c) The first processor (from P processors worked on the elements of R) finished
with the execution of its work in (b) will go to Step 1.
Parallelizing Partial Digest Problem 99
2 18
55
1.8 35 16
60
1.6 40 14
65
1.4 45 12
time in minutes
70
time in second
1.2 50
10
1 75
8
0.8
6
0.6
4
0.4
0.2 2
0 0
2 4 6 8 10 2 4 6 8 10
Number of processors Number of processors
a: 35 ≤ n ≤ 50 b: 55 ≤ n ≤ 75
c: 80 ≤ n ≤ 90
12 9
35 55
40 8 60
10
45 7 65
8 50 70
6
Speedup
Speedup
5
6
4
4 3
2
2
1
0 0
2 4 6 8 10 2 4 6 8 10
Number of processors Number of processors
a: 35 ≤ n ≤ 50 b: 55 ≤ n ≤ 70
12
75
10 80
85
8 90
Speedup
0
2 4 6 8 10
Number of processors
c: 75 ≤ n ≤ 90
12 140
35 55
10 120
40 60
Memory (MByte)
Memory (MByte)
45 100 65
8
50 70
80
6
60
4
40
2
20
0 0
1 2 4 6 8 10 1 2 4 6 8 10
Number of processors Number of processors
a: 35 ≤ n ≤ 50 b: 55 ≤ n ≤ 70
c: 75 ≤ n ≤ 90
5 Conclusions
In this research paper, we parallelized the fast exact algorithm for the partial digest
problem, PDP. The main challenge of PDP is the exponential time for the best exact
sequential algorithm in the worst case. The proposed algorithm is based on working on
many independent subproblems at the same time and traversing the search tree with the
breadth–first strategy. The experimental results on multicore system have shown that
the running time of the parallel algorithms decreases as the number of processors
Parallelizing Partial Digest Problem 103
increases. The average efficiency of the PBBb2 algorithm is 88.53%. Also, the speedup
achieved good scaleing with increasing the number of processors.
References
1. Pevzner, P.: DNA physical mapping and alternating eulerian cycles in colored graphs.
Algorithmica 13(1–2), 77–105 (1995)
2. Devine, J.H., Kutuzova, G.D., Green, V.A., Ugarova, N.N., Baldwin, T.O.: Luciferase from
the east European firefly Luciola mingrelica: cloning and nucleotide sequence of the cDNA,
overexpression in Escherichia coli and purification of the enzyme. Biochimica et Biophysica
Acta (BBA)-Gene Struct. Expr. 1173(2), 121–132 (1993)
3. Baker, M.: Gene-editing nucleases. Nat. Methods 9(1), 23–26 (2012)
4. Sambrook, J., Fritsch, E.F., Maniatis, T.: Molecular Cloning. A Laboratory Manual, 2nd
edn., pp. 1.63–1.70. Cold Spring Harbor Laboratory Press, Cold Spring Harbor (1989)
5. He, X., Hull, V., Thomas, J.A., Fu, X., Gidwani, S., Gupta, Y.K., Black, L.W., Xu, S.Y.:
Expression and purification of a single-chain Type IV restriction enzyme Eco94GmrSD and
determination of its substrate preference. Sci. Rep. 5, 9747 (2015)
6. Narayanan, P.: Bioinformatics: A Primer. New Age International (2005)
7. Dear, P.H.: Genome mapping. eLS (2001)
8. Jones, N.C., Pevzner, P.: An Introduction to Bioinformatics Algorithms. MIT Press,
Cambridge (2004)
9. Lemke, P., Werman, M.: On the complexity of inverting the autocorrelation function of a
finite integer sequence, and the problem of locating n points on a line, given the (nC2)
unlabelled distances between them. Preprint 453 (1988)
10. Daurat, A., Gérard, Y., Nivat, M.: Some necessary clarifications about the chords’ problem
and the partial digest problem. Theoret. Comput. Sci. 347(1–2), 432–436 (2005)
11. Cieliebak, M., Eidenbenz, S., Penna, P.: Noisy Data Make the Partial Digest Problem
NP-Hard. Springer, Heidelberg (2003)
12. Pandurangan, G., Ramesh, H.: The restriction mapping problem revisited. J. Comput. Syst.
Sci. 65(3), 526–544 (2002)
13. Błażewicz, J., Formanowicz, P., Kasprzak, M., Jaroszewski, M., Markiewicz, W.T.:
Construction of DNA restriction maps based on a simplified experiment. Bioinformatics
17(5), 398–404 (2001)
14. Blazewicz, J., Burke, E.K., Kasprzak, M., Kovalev, A., Kovalyov, M.Y.: Simplified partial
digest problem: enumerative and dynamic programming algorithms. IEEE/ACM Trans.
Comput. Biol. Bioinf. 4(4), 668–680 (2007)
15. Karp, R.M., Newberg, L.A.: An algorithm for analysing probed partial digestion
experiments. Comput. Appl. Biosci. 11(3), 229–235 (1995)
16. Nadimi, R., Fathabadi, H.S., Ganjtabesh, M.: A fast algorithm for the partial digest problem.
Jpn J. Ind. Appl. Math. 28(2), 315–325 (2011)
17. Ahrabian, H., Ganjtabesh, M., Nowzari-Dalini, A., Razaghi-Moghadam-Kashani, Z.:
Genetic algorithm solution for partial digest problem. Int. J. Bioinform. Res. Appl. 9(6),
584–594 (2013)
18. Skiena, S.S., Smith, W.D., Lemke, P.: Reconstructing sets from interpoint distances. In:
Proceedings of the Sixth Annual Symposium on Computational Geometry, pp. 332–339.
ACM (1990)
104 H.M. Bahig et al.
19. Fomin, E.: A simple approach to the reconstruction of a set of points from the multiset of n2
pairwise distances in n2 steps for the sequencing problem: II algoirthm. J. Comput. Biol. 23,
1–7 (2016)
20. Abbas, M.M., Bahig, H.M.: A fast exact sequential algorithm for the partial digest problem.
BMC Bioinform. 17, 1365 (2016)
21. Zhang, Z.: An exponential example for a partial digest mapping algorithm. J. Comput. Biol.
1(3), 235–239 (1994)