Multimodal Memetic Framework For Low-Resolution Protein Structure Prediction

Journal Pre-proof
Multimodal Memetic Framework for low-resolution protein structure prediction
Rumana Nazmul, Madhu Chetty, Ahsan Raja Chowdhury
PII: S2210-6502(18)30108-1
DOI: https://doi.org/10.1016/j.swevo.2019.100608
Reference: SWEVO 100608
To appear in: Swarm and Evolutionary Computation BASE DATA
Received Date: 10 March 2018

Revised Date: 22 September 2019
Accepted Date: 28 October 2019
Please cite this article as: R. Nazmul, M. Chetty, A.R. Chowdhury, Multimodal Memetic Framework for
low-resolution protein structure prediction, Swarm and Evolutionary Computation BASE DATA (2019),
doi: https://doi.org/10.1016/j.swevo.2019.100608.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.

Multimodal Memetic Framework for Low-resolution
Protein Structure Prediction
Rumana Nazmul1,2,3 , Madhu Chetty3 , and Ahsan Raja Chowdhury2,3∗

1 Faculty
of Information Technology, Monash University, Australia
2 Dept of Computer Science and Engineering, University of Dhaka, Bangladesh
3 School of Engineering and Information Technology, Federation University Australia
Abstract
In this paper, we propose a systematic design of evolutionary optimization,

namely Multimodal Memetic Framework (MMF), to effectively search the vast
complex energy landscape. Our proposed memetic framework is implemented in
hierarchical stages with the optimization of each stage performed in parallel in
three different states: Exploratory, Exploitative and Central. Each state, with
its own set of sub-populations, either explores or exploits by beneficial mixing of
potential solutions to direct the search towards a global solution. Instead of im-
plementing identical genetic operators, the proposed approach employs different
selection and survival criteria in each state according to their designated task.
The Exploratory state employs a knowledge-based initial population generation
technique with appropriately tuned genetic operators to guide the search to the
“nearest peak”. The Exploitative state fine-tunes the individuals representing
different regions by applying a building block based local search. Finally, by
utilizing the imbibed knowledge from different peaks, the Central state car-
ries out information-exchange among the highly fit solutions for exploring the
undiscovered regions. The information exchange employs a novel non-random
parental selection technique to distribute the reproduction opportunity intelli-
gently among the individuals for making cross-over more effective. The method
∗ Corresponding author
Email address: r.nazmul@federation.edu.au, madhu.chetty@federation.edu.au,
ahsan.chowdhury@federation.edu.au (Rumana Nazmul1,2,3 , Madhu Chetty3 , and Ahsan
Raja Chowdhury2,3∗ )
Preprint submitted to Journal of LATEX Templates November 6, 2019

has been tested on a set of various benchmark protein sequences for 2D and
3D lattice models. The experimental results demonstrate the superiority of the
proposed method over other state-of-the-art algorithms.
Keywords: Protein Structure Prediction, Multimodality, Memetic Algorithm
1 1. Introduction
2 Protein Structure Prediction (PSP) is still a grand challenge problem in

3 computational biology due to the complex nature of the protein folding pro-
4 cess. According to the Levinthal paradox [1], the time to attain an accurate
5 folded structure by an exhaustive enumeration is approximately proportional
6 exponentially to the number of residues. This means that finding the opti-
7 mal conformations even for a small sequence, by exhaustive search is extremely
8 time-consuming. To reduce the complexity of modeling in a fine level of detail,
9 simplified or low-resolution models have been introduced that limit the range
10 of length, angles, and torsions [2]. The simplest class of model for the PSP is
11 called lattice models. One of the important approximations made by lattices is
12 the discretization of the space of conformations [3]. In a simplified model, all
13 the monomers have an equal size and all the bonds are of an equal length. Sub-
14 sequently, a protein is modeled as a sequence of simple elements representing
15 the amino acids that are embedded in a lattice. The connection angles between
16 them are restricted by the lattice structure in the plane (i.e., 2D) or in the space
17 (i.e., 3D). In a valid protein conformation the embedding of the protein chain
18 on the lattice is such that it satisfies two constraints: i.e., adjacent amino acids
19 of the sequence are also adjacent in the lattice (chain constraint) and no point
20 of the lattice can be occupied by more than one amino acid (self-avoiding-walk
21 or SAW constraint). The free-energy of a conformation is defined by a sim-
22 plified energy function, which includes contact potentials specifying the energy
23 between pairs of amino acids that are adjacent to the lattice.
24 While this discretization precludes a completely accurate model of protein
25 structures, it preserves behavioral equivalency with real-world proteins [2]. For
2
26 example, the computational cost of such models is much more convenient when
27 compared to complex models. The discretization also allows the enumeration
28 of the entire conformational space and enables the study of the folding process.
29 Thus, by sacrificing the atomic details, the lattice models can be used to extract
30 important characteristics and unify the understanding of many different proper-
31 ties of proteins [3]. Because of this, lattice models have proven to be extremely
32 useful tools for coping with the complexity of PSP problems.
33 There are varieties of lattice models where the classification is based on the
34 following properties [3]:
35 • The physical structure specifying the level of detail at which the protein
36 sequences are represented;
37 • The alphabet used for categorizing the amino acids for modeling purposes;
38 • energy evaluation criteria, which specify the interaction between pairs of

39 amino acids while computing the energy of a conformation;
40 • A lattice used for expressing the protein conformations determining the

41 space of possible conformations for a given protein.
42 Lattice models which are realistic have emerged to provide valuable insight
43 into understanding the factors governing the structural stability and essential
44 principles of protein folding kinetics. Moreover, the low-resolution models make
45 the interpretation of computer simulation feasible and as unambiguous as pos-
46 sible [4]. By sacrificing atomic details and restricting the degrees of conforma-
47 tional freedom, these models reduce the computational cost by significantly large
48 orders of magnitude and thus make tractable the task of global minimization of
49 the protein conformational energy. Notwithstanding, finding a minimum energy
50 conformation, even with a simplified model, has proved to be NP-hard [5, 6].
51 Furthermore, the “ruggedness” of the energy landscape [7] advocates that the
52 search process must deal with the difficulties of crossing numerous energy traps
53 to reach the global minima [8]. Hence, to deal with complex problems like PSP,
54 a search algorithm must (i) explore the large conformational space of protein
3
55 and (ii) deal with its multimodal nature due to the presence of local minima in
56 energy landscape [8]. To address this, the optimization process must identify
57 promising solutions belonging to different search regions, and then direct the
58 search towards global minima by exploring other regions through exploitation
59 and beneficial recombination of these solutions [9].
60 Deterministic search algorithms are less efficient for multimodal PSP op-
61 timization since they do not allow any “downhill” movement or backtracking
62 mechanism required to escape from local minima in a rugged landscape [9]. The
63 population-based evolutionary algorithms, starting with an initial population of
64 distinct solutions, are prone to converge to a single peak prematurely due to
65 the stochastic errors induced by the genetic operators especially, selection and
66 cross-over. Different niching methods, e.g., fitness sharing [10], crowding [11],
67 multi-population GA [12], etc. focus on maintaining diversity and exploring var-
68 ious regions in the search space simultaneously by forming separate subgroups
69 within a population. These methods, capable of converging into multiple good
70 peaks, lack the means for beneficial mixing or information exchange [13]. On the
71 other hand, the multi-population methods allow information sharing between
72 the sub-populations through a periodic migration technique. The beneficial mix-
73 ing and merging of individuals still depend on the migration and replacement
74 policy determining the spread of good solutions among the sub-populations [14].
75 In this paper, we propose a Multimodal Memetic Framework (MMF) to ad-
76 dress the PSP problem using simplified lattice model. The evolution is carried
77 out in stages, each stage consisting of a predefined number of generations. The
78 population is split into three different states: Exploratory, Exploitative and Cen-
79 tral, each of these performing a different role in the optimization process. In
80 each stage, the sub-populations in Exploratory state locate the promising re-
81 gions by employing a knowledge-based initial population generation technique.
82 This technique is based on the ‘maximal core’ formation concept using the hy-
83 drophobic property to provide good quality and diverse seeds. On the other
84 hand, in Exploitative state, the sub-populations fine-tune the solutions each
85 representing a region discovered by the previous stage. It searches the neigh-
4
86 borhood of a solution using a novel building-block based local search to cross
87 the valley of lower fitness. Finally, Central state aggregates and exchanges the
88 information representing different regions discovered in the previous stages to
89 enable exploration of the undiscovered regions and direct the search towards
90 the global peak. The genetic operators employed in the Central state ensure
91 the simultaneous exploitation and preservation of the information found from
92 different regions. A novel non-random mating strategy is used for parental se-
93 lection that distributes the reproduction opportunity systematically among the
94 individuals of the population and ensures the transfer of significant information
95 to next generations. In addition, a new survival selection technique implements
96 the concept of “downhill movement” to maintain a sufficiently diverse popula-
97 tion by putting a constraint on the rapid flow of genetic material. By employing
98 different population initialization techniques, selection and survival criterion in
99 each state according to its optimization objective, the proposed method allows
100 substantial selection pressure within each region for convergence to occur whilst
101 directing the search towards global minima by exploitation and beneficial re-
102 combination of those solutions to reach the unexplored regions. In other words,
103 the proposed memetic framework accomplishes a balance between two mutually
104 exclusive goals in the multimodal optimization which are exploiting the already
105 found best solutions. In addition, the proposed framework also explores the
106 search space for promising solutions.
107 The remainder of the paper is organized as follows. In Section 2, we present
108 preliminaries on PSP. This is followed by a brief discussion on the state-of-
109 the-art methods for PSP in Section 3. A detailed description of the proposed
110 Multimodal Memetic Framework (MMF) along with its component techniques
111 are given in Section 4. Next, in Section 5, the test suite, experimental setup, and
112 comparisons between the proposed and state-of-the-art techniques, conducted
113 on Hydrophobic-Polar (HP) benchmark sequences for different lattice models,
114 are investigated. Finally, Section 6 concludes the paper.
5
115 2. Background
116 This section briefly introduces the HP and functional model protein, en-
117 coding techniques, methods for generating the initial population and pull move
118 local search.
119 2.1. HP Lattice Model
In this paper, the protein model of choice is the hydrophobic-polar (HP)

[15], and the “shifted” HP model (also known as functional model proteins)
[16, 17, 18]. The Hydrophobic-Polar (HP) model, the most widely used model
for lattice simulation, captures the hydrophobic effect as the main driving forces
for the formation of protein structure containing a hydrophobic core. Based
on hydrophobicity, amino acids are classified either as Hydrophobic/non-polar
(H) or hydrophilic/Polar (P). The hydrophobic force causes hydrophobic amino
acids to minimize their contact with water. Subsequently, the hydrophobic
residues tend to aggregate together and form a hydrophobic core in the optimum
conformation, shielded from the surrounding solvent by hydrophilic amino acids.
On the other hand, the shifted HP model [16] is a variant of the HP model
that includes repulsion. This model has native states some of which are not
maximally compact; thus it allows cavities or potential binding sites that are key
properties required to investigate ligand binding [17, 19]. The contact potential
value [20] for two hydrophobic (H) residues (being topological neighbors) is
assigned as -1, whereas for PP or HP the value is assigned as 0, implying no
interaction. The energy matrix of the HP model may be expressed as:
 
H P
 
−1
 
 H 0 
 
P 0 0
The shifted HP model, on the other hand, is a variant of the HP model

that includes repulsion. This model has native states some of which are not
maximally compact; thus it allows cavities or potential binding sites that are
6
key properties required to investigate ligand binding [17, 19]. The energy matrix
for the shifted HP model is:
 
H P
 
−2
 
 H 1 
 
P 1 1
120 2.2. Encoding of Protein Conformation
121 Among various encoding techniques used to represent the embedding of a

122 protein structure or a conformation in a lattice, the most widely used one is
123 internal encoding in which the position of each residue in the conformation is
124 specified depending on the lattice position of its previous residue in the se-
125 quence. There are two different variations of internal encoding: i) Absolute
126 Encoding (AE) and ii) Relative Encoding (RE) [21]. In absolute encoding, the
127 direction or move of each residue with respect to the position of its previous
128 residue is relative to the axes defined by the lattice. On the other hand, in
129 the relative encoding, the direction is relative to the direction of the previous
130 residue. On the other hand, Non-isomorphic Encoding (NIE) [22] reduces the
131 search space by eliminating multiple occurrences of identical conformations by
132 identifying the different orientations of the same conformation. More details
133 about these encoding schemes are discussed in the Supplementary Document
134 with appropriate example.
135 2.3. Initial Population Generation
136 In all the population-based EAs, including those applied for PSP, the initial
137 population has a significant impact on the quality and the convergence speed of
138 the search process. A diverse initial population providing a good coverage of the
139 search space aids the process of information exchange to quickly evolve towards
140 better solutions. In conventional random initialization technique, for generat-
141 ing an individual Ii = (mi1 , mi2 , . . . , miL ) of length L, a particular move mik
142 for k th position is randomly selected from a move set M with all moves being
7
143 equally likely. As described earlier, the generation of an individual by any tech-
144 nique is subject to the satisfaction of self-avoiding walk (SAW) constraint [13].
145 Nonetheless, the probability of generating a non-SAW conformation increases
146 with the number of possible moves in the lattice model used as well as with the
147 length of the individual and hence generating a population of pop individuals
148 can be computationally expensive even for a sequence of moderate length.
149 To overcome this, various methods have been introduced that focus on gen-
150 erating only feasible solutions. For example, Dynamic Individual Generation
151 (DIG) technique [13] generates a conformation by using the information of both
152 the lattice and the moves. It places the first residue at lattice position (0, 0,
153 0) and next, the moves are dynamically selected from a set of possible moves
154 by continually verifying the SAW constraint. Dotu et al. [23] follow moves in
155 a specified order for a random number of times in each iteration to generate a
156 compact initial solution by ensuring the SAW constraint, but does not ensure
157 sufficient diversity. Other approaches e.g., [24] generating straight line individ-
158 uals is indisputably the fastest method, albeit does not provide any diversity at
159 all.
160 2.4. Pull Move
161 Pull move [25] is very effective in exploiting the neighborhood of a solution
162 without causing extreme changes to its global structure. [25] proposed the
163 GTabu algorithm that incorporates a new local search strategy called pull move
164 with a tabu search algorithm on the 2D HP problem. Theoretically, it has
165 been proved that pull moves are complete, i.e., any valid configuration can
166 be reached from any other valid configuration by a sequence of pull moves.
167 GTabu, employing the newly introduced pull move operation to explore the
168 neighborhood of a potential solution, has shown its effectiveness by finding new
169 lowest energy configurations for three long benchmarks. The concept of pull
170 moves has been extended more recently for both the 2D triangular and the 3D
171 FCC lattices and combined with the tabu search strategy by Böckenhauer [26]
172 for protein folding simulation.
8
173 To elaborate the concept of Pull Move, let us consider three consecutive
174 residues Si , Si−1 and Si+1 in the sequence that are positioned at Xi , Xi−1 and
175 Xi+1 , respectively. According to [25], for a free location L adjacent to Xi+1
176 (or Xi−1 ) and diagonally adjacent to Xi , constituting three corners of a square,
177 a pull move can be implemented if the fourth corner location denoted as C is
178 either empty or occupied by Si−1 (or Si+1 ). If C is empty i.e., not occupied
179 by Si−1 (or Si+1 ), then Si and Si−1 (or Si+1 ) are moved to location L and C,
180 respectively; and by subsequently pulling the remaining residues repeatedly into
181 vacated locations, until a valid conformation is reached.
182 As FCC (Face-centered-cubic) lattice is free from the parity problem, the
183 pull move can be applied on a residue Si positioned at Xi , if there exists a free
184 position in the lattice which is adjacent to both the vertices Xi and Xi−1 (or
185 Xi+1 ) containing its predecessor (or successor) residue. Residue Si is then moved
186 to this free position and the connectivity is maintained by pulling the chain such
187 that the previous position of each moved vertex is occupied by its successor
188 (or predecessor) until a valid conformation is reached. Thus, it attempts to
189 maintain the validity of a conformation by moving the least number of vertices
190 and without displacing any vertex very far from its current position.
191 Hence, we can redefine it from a generic point of all lattice models. Let Xi
192 denotes the position of amino acid i (Si ) in a lattice. A pull move is possible if
193 the following conditions are satisfied:
194 - If there is a free location L which is adjacent (connected neighbor) to Xi+1

195 (or Xi−1 ) containing its successor (or predecessor) residue.
196 - If L is also adjacent to Xi (only possible in FCC lattice). In this case, move
197 Si to L. Otherwise, it requires another location C adjacent to both L and
198 Xi , which is either empty or occupied by Si−1 (or Si+1 ).
199 However, in case of end residues i.e., S1 or SL , the second condition does
200 not need to be satisfied and a pull move can occur if there is an empty location
201 in the neighborhood of Xi+1 or Xi−1 , respectively.
9
202 3. Optimization Methods for PSP
203 Among the variety of optimization techniques [27, 28, 29], including, exact
204 methods [30, 31], approximation algorithms, non-deterministic algorithms [24,
205 32, 33], evolutionary and non-evolutionary algorithms have been applied to the
206 HP model for PSP problem. Methods such as Constrained Hydrophobic Core
207 Construction (CHCC) [30], CDCG [33], CI [34], Hydrophobic Zipper (HZ) [30]
208 etc. are based on heuristics that either approximates the hydrophobic core or
209 rely on the concept of cooperativity. CHCC [30] and CDCG [33] direct the search
210 for the optimal conformation using heuristics based presumptions designed to
211 approximate the hydrophobic core. On the other hand, Hydrophobic Zipper
212 [30] and Contact Interaction (CI) [34] are based on the concept of cooperativity.
213 Among the evolutionary approaches, a pioneering application of genetic al-
214 gorithm hybridized with Monte Carlo to PSP [24] employs a mutation operator
215 following the conventional Monte Carlo steps, has shown superior performance
216 over traditional MC methods. A modified version [35] of Goldberg’s Simple
217 Genetic Algorithm [36] uses a hybrid objective function with a penalty term
218 to penalize the conformations with collisions and has shown improvement over
219 MC and GA with MC [24]. Another method called pioneer search [37] ap-
220 plying a systematic cross-over on every possible cross-over points outperforms
221 the GA [24] but fails to reach optimal solutions for longer instances. Memetic
222 algorithm (MA), incorporating local search procedures with global search tech-
223 nique in evolutionary framework [38], have also been applied for PSP. MA [39]
224 with a self-adaptive strategy, applies the local search either for exploitation or
225 for diversification, according to the convergence of population. Furthermore,
226 Multimeme Algorithms (MMA) [38] applied for PSP adaptively choose multi-
227 ple local searches to develop several different neighborhoods on which search
228 proceeds. In Ant Colony Optimization (ACO) [40], simulated ants construct
229 candidate solutions based on heuristic information and a probability value ac-
230 cording to the quality of solutions that are found in the previous iteration. This
231 method although captures information from multiple peaks, fail to exploit them
10
232 efficiently to direct the search towards global minima. Different variants of IA
233 [41], inspired by the clonal selection principle, have also been attempted for PSP.
234 Although the success of early efforts was limited to only small instances, recently
235 reported IA [41] employing special mutation operators and an aging mechanism
236 to promote diversity in population, has exhibited better performance with long
237 protein sequences. Estimation of Distribution Algorithm (EDA) [32] captures
238 domain knowledge by replacing the traditional cross-over and mutation oper-
239 ator with a probabilistic model constructed from the selected best solutions.
240 However, the performance of EDA is not guaranteed in those cases where the
241 optimal solution cannot be constructed by combining the sub-structures residing
242 in the other good solutions [32].
243 Custòdio et al. [42] proposed a new methodology that employs a phenotype
244 based crowding mechanism for the maintenance of useful diversity within the
245 populations. The method resulted in an increased performance with the capa-
246 bility of obtaining multiple solutions. Although the method was successfully
247 adapted to an all-atom protein model, the evaluation of the method for the
248 HP model was limited to a small group of benchmark protein sequences with a
249 maximum length of 136 monomers. Another genetic- algorithm based method,
250 namely GAP SP , was proposed by Bošković and Brest [43] is the incorporation
251 of multiple tuning techniques in the proposed method. The techniques include
252 crowding, clustering, repair, local search and opposition-based mechanisms. Up
253 to a moderate length protein sequences, the method obtained the best results
254 among the then available methods, the method is not capable of or dealing the
255 true multimodality of the protein sequences. As highlighted in the algorithm
256 and in the description, only the entire population is initialized only once, and
257 all the iterations work on the evolving population. The clustering technique
258 used in this method works on the sub-population of the main population. This
259 technique can often get stuck in the local minima due to the lack of proper
260 exploration and exploitation, even though local search technique is applied. A
261 recently proposed method AHEDA [44] is developed based on a new local search
262 heuristics and not guaranteed to deal with the multimodality.
11
263 To our knowledge, none of these methods have explicitly focused on the
264 problems of multimodal optimization. Nevertheless, the search process must
265 be able to balance between the required selection pressure and diversity for
266 identifying sub-components of good solutions and generate high-performance
267 solutions by beneficial recombination and exploitation of those sub-components
268 [45]. Since the underlying global search technique of MA is based on the genetic
269 operators such as cross-over, mutation etc., the analysis was done on schemata
270 theorem is also applicable to MA.
271 4. The Method
272 To address the aforementioned issues, we propose a Multimodal Memetic

273 Framework (MMF) by decomposing the population into three different states.
274 While the Exploratory and Exploitative states encourage the growth of building
275 blocks to converge to a nearby peak, the Central state maintains and combines
276 the potential sub-components of good solutions for discovering unexplored re-
277 gions. Thus, by segregating the population according to different objectives, the
278 proposed method can attain the balance between exploration and exploitation
279 simultaneously. The evolution occurs sequentially in stages having a pre-defined
280 number of generations. The maximum number of stages is determined by the
281 maximum number of generations or fitness evaluations or time set as the termi-
282 nation criterion. The total population of MMF is decomposed into (2x+1) num-
283 ber of subpopulations, where each of the Exploratory and Exploitative states
284 consist of x number of subpopulations and the Central state contains only one.
285 Each of the x sub-populations in the Exploratory state along with that of the
286 Central state contains p number of individuals. On the other hand, the num-
287 ber of individuals assigned in each of the x sub-populations in the Exploitative
288 state is half of that in the Exploratory state (i.e., p/2). This population size for
289 each subpopulation in the Exploitative state is selected to give this state enough
290 individuals to find a new peak by fine-tuning the already found solutions.
291 It is evident that the selection pressure required to accomplish the explo-
12
292 ration and exploitation objectives of different states are not the same [46].
293 Hence, different population initialization techniques and distinct genetic op-
294 erators (parental selection, survival selection, the rate of cross-over, mutation
295 and local search operation) are employed in each state to implement distinct
296 evolutionary environments according to the optimization target of the state.
297 At the very first stage of evolution, the sub-populations in the Exploratory
298 state along with that in the Central state are initialized with new populations
299 that are generated by different initialization techniques (discussed later). How-
300 ever, the sub-populations in the Exploitative state are initialized in the second
301 stage based on the information captured by the Exploratory state. Subsequently,
302 the sub-populations in Exploratory and Exploitative states are initialized in ev-
303 ery stage except in the final stage. On the other hand, in each stage the Central
304 state imports new and possibly better genetic material from the migrants trav-
305 eling from the Exploratory and Exploitative states to this state. The working
306 principles of the three different states are described in detail below, while the
307 schematic diagram of the proposed method is shown in Figure 1.
308 4.1. Exploratory State
309 In each stage, the Exploratory state explores different regions of the search
310 space by employing different sub-populations, where each aims to capture a set
311 of similar sub-structures containing the implicit knowledge about the basin of
312 the attraction it is approaching. Hence, the genetic operators, the mating strat-
313 egy as well as the survival selection scheme employed in each sub-population
314 of this state ensure the growth of the best building blocks representing the re-
315 gion. Instead of using a conventional random technique, a new knowledge-based
316 method is employed to initialize the sub-populations since the incorporation of
317 knowledge from the problem domain can improve the performance of the opti-
318 mization [47]. Both the mating strategy and survival selection scheme employed
319 here favor the selection of the fittest individuals since it is more likely that the
320 individuals with higher fitness contain the highly fit building blocks. The prob-
321 ability of cross-over operation is set large enough to allow sufficient exchange
13
Initialization with Initialization with
CIPG Existing Technique
Phase-1 v=1..x
Explrv Central
Exploratory State Central State
Post-processing
Capture
BB
Select
Representative
GenIndiv
based on BB
Initialization
with CIPG
v=1..x v=1..x
Phase-2
Explrv Explvv Central

Exploratory State Exploitative State Central State
Post-processing
Find
BB Best Individual from each
Explv sub-population
Select
Representative
GenIndiv
based on BB
Initialization … …
with CIPG
v=1..x v=1..x
Phase-3
Explrv Explvv … Central
Exploratory State Exploitative State Central State

Phase-N
Central
14 Central Site
Figure 1: The schematic diagram of the proposed Multimodal Memetic Framework (MMF).
322 of building blocks among the individuals. However, the high selection pressure
323 induced by the selection scheme helps to overcome the disruption caused by
324 cross-over operation thereby ensuring the growth of building blocks as man-
325 dated by Holland’s schemata theorem [48].
326 4.1.1. Initialization of the Population

327 In each stage, the sub-populations are initialized with our Core-based Initial
328 Population Generation (CIPG) [49, 50, 46] technique to incorporate domain
329 knowledge based on the concept of maximum hydrophobic core formation and
330 aids the optimization process to commence the exploration with diverse quality
331 seeds. The concept of the core was stated earlier in Section 2 and the term
332 hydrophobic core or H-core of a protein conformation denotes a dense area of
333 a particular shape (depends on the lattice used) consisting of only hydrophobic
334 residues [33, 22, 30]. Here in our case, for all the three considered models (i.e.,
335 2D-Square, 3D-Cubic, 3D-FCC), we assume that the H-core is embedded in a
336 rectangular-shaped area which can be considered as a grid (stacked layers of
337 grids in case of 3D lattices) where intersection points correspond to Cartesian
338 coordinate.
339 For 2D-Square and 3D-Cubic models, all the points of the rectangular-shaped
340 H-core will be occupied by H residues. However, in contrast to 2D-Square or 3D-
341 Cubic lattice, a core embedded in a rectangular-shaped structure with dimension
342 X × Y × Z can accommodate at most half of X × Y × Z Hs inside the core
343 instead of X × Y × Z number of H residues in FCC lattice model.
344 Since our aim is to incorporate the knowledge of the H-core in the initial
345 population, unlike Constrained Hydrophobic Core Construction (CHCC) [30],
346 which determines core in an exhaustive manner, CIPG approximates the core
347 dimension which significantly reduces the computation time to generate good
348 quality seeds.
349 The CIPG algorithm works in two stages. The first stage starts with esti-
350 mating the number of Core-H (CH ), which is the number of hydrophobic (H)
351 residues forming the maximum possible H-core by analyzing the positions and
15
352 neighborhood of residues in the sequence. In the second stage, we construct the
353 set of possible cores of various dimensions with a fraction of CH that is used
354 while generating the initial population. Subsequently, we generate individuals
355 with various core sizes using a chain-growth algorithm (by adding one monomer
356 at a time) to fill the core with H residues and then place the remaining residues
357 with respect to the filled H-core.
358 4.1.2. Genetic Operators and Local search

359 In each sub-population of the Exploratory state, we apply a one-point cross-
360 over operation to ensure exchange of information between the individuals and
361 pull move [25] local search is applied on the solutions for further improvement.
362 Thus, the combined effect of cross-over and local search ensure the exchange of
363 genetic material between the individuals and exploitation of the newly created
364 offspring to carry the search process to the nearest peak.
365 4.1.3. Mating Strategy and Survival Selection Approach

366 As stated earlier, the strategies used for parental and survival selection must
367 be able to impose the selection pressure ensuring the growth of good building
368 blocks representing the region. Here we employ binary tournament technique
369 that selects mates for cross-over operation by favoring the participation of highly
370 fit individuals in reproduction over the less-fit individuals. However, we employ
371 (µ+λ)-strategy [51, 52] for the survival selection that selects the best individuals
372 from the combined pool of parent and offspring individuals. The (µ+λ)-strategy
373 is based on the deterministic selection of the best µ individuals from a set of µ
374 parent and λ offspring individuals.
375 Thus, the cumulative selection pressure induced by both the selection strate-
376 gies lead the search within each sub-population to converge to a single peak.
377 4.1.4. Information Extraction

378 According to Holland’s schemata theorem, the number of individuals con-
379 taining highly fit schemas increase over time and thus it ensures the proliferation
380 of the information of the highly fit regions of the search space from generation
16
381 to generation [53]. When the population is close to converging, the conforma-
382 tions will contain the sub-strings representing the region. However, for the PSP
383 problem, instead of the schema, we consider the sub-structures similar to meme
384 [13] which is a special case of schema excluding ‘don’t care’. To understand the
385 issues involved, let us consider two conformations shown in Figures 2(a) and (b)
386 with relative encodings FFLRLF and FLRRLL, respectively.
(a) (b)
Figure 2: Relative encoding of two conformations containing the same sub-structure from
third to sixth residue.
387 In both the conformations, although the sub-structures from third to the
388 sixth position are same, their corresponding relative encoding are different
389 (LRL and RRL, respectively). Here, to extract the region explored by a sub-
390 population, we acquire knowledge as to whether a sub-structure of a particular
391 pattern frequently occurs at the same location in different conformations. The
392 identification technique of memes reported earlier [13] needs a verification pro-
393 cess to ensure the maintenance of SAW constraint and the reflection problem
394 regarding relative encoding (RE). Here, we propose a new technique for identi-
395 fying a fixed length sub-structure occurring at a particular position in different
396 conformations with a probability greater than a specified threshold value. How-
397 ever, in both relative and non-isomorphic encodings, the representations of the
398 same sub-structure occurring even in the same locations of different confor-
399 mations are different. While representing the various orientations of the same
400 structure or conformation, non-isomorphic encoding generates the same code
401 for its all variations (as discussed in Section 2). To ensure that the code for
17
402 different orientations of the same sub-structure found in a particular position
403 in different conformations are same, we apply an intelligent mechanism based
404 on non-isomorphic encoding. For each fixed-length (l) sub-structure(SS) (here,
405 we have considered sub-structures consisting of four residues) starting from a
406 particular location in all the best ϑ% individuals (PBest ) of a sub-population in
407 the Exploratory state, the technique generates a non-isomorphic code (N IC).
408 The rationale behind considering sub-structures comprising of four residues in-
409 stead of three (which is the minimum possible length) is to generate distinct
410 non-isomorphic codes between the sub-structures consisting of a same number
411 of residues. Thus, for a conformation of length L, we consider (L-4+1)=(L-3)
412 number of sub-structures starting from 1st position, and continue up to (L-3)th
413 position and each sub-structure is denoted as (SSi..i+3 ) where i = 1, 2, . . . (L−3).
414 We start with the conformation having the best fitness value and generate the
415 N IC for the sub-structure starting at position 1 (SS1..4 ). The code for the
416 sub-structure found for (SS1..4 ) is then stored in a list (P atternList) with the
417 information of the identity of the individual and the starting position of the
418 sub-structure. This process is repeated for all (L-3) sub-structures in all the
419 PBest individuals and, for each newly found sub-structure, a new entry is cre-
420 ated in the list. However, if the code for a sub-structure already exists in the
421 list, the identity of the individual is appended in the information list of the
422 code by increasing the frequency of the sub-structure by one. After creating
423 the P atternList we select all the sub-structures appearing most frequently in
424 a particular position of the PBest individuals according to a pre-defined thresh-
425 old value. Thus, the P atternList stores the code of each sub-structure, the
426 starting location of the sub-structure, and the list of individuals containing the
427 sub-structure in decreasing order of fitness value. For PBest number of confor-
428 mations of length L, the number of row and column entries in the list is (L-3)
429 and PBest , respectively. Finally, after identifying the most frequently occurring
430 sub-structures in different locations of the conformations, for each of the sub-
431 structures, we select the best individual (denoted as RepIndiv) containing that
432 sub-structure.
18
433 4.1.5. Migrants
434 After extracting the information about the regions explored, the represen-
435 tative individuals (RepIndiv) are selected as discussed above and used as the
436 migrants from the Exploratory state to the Exploitative and the Central states.
437 However, within the Exploitative state, the migrants are used to generate the
438 initial population for the corresponding sub-population in that state.
439 4.2. Exploitative State
440 Once the algorithm has guided the search to the basins of attraction by the
441 Exploratory sub-populations, the corresponding Exploitative sub-populations
442 dig deeper into the region for effective exploitation of the information. How-
443 ever, the global search operators lack the ability of fine-tuning the solutions.
444 Hence, the exploitation is accomplished by applying local search to the indi-
445 viduals representing the promising regions in the search by relocating several
446 residues in a better way to improve the overall energy of the structure [54]. Since
447 applying local search on compact conformation might not be always successful
448 due to SAW constraint, individuals are generated using a novel operation called
449 “stretch” along with the knowledge of building blocks representing the region.
450 This causes the fitness to be reduced temporarily (i.e., downhill movement) be-
451 fore new improvements occur subsequently and thus allows a significant amount
452 of backtracking to jump into other basins of attraction.
453 4.2.1. Initialization of Population

454 The initial individuals in the Exploitative sub-populations are generated by
455 a technique inspired by “meme-based individual generation” process [13]. How-
456 ever, ensuring the SAW constraint while generating an individual by incorpo-
457 rating the building blocks in a randomly generated individuals [13], may require
458 several attempts. Instead, while generating the individuals for an Exploitative
459 sub-population, the proposed method works on the representative individuals
460 (RepIndiv) obtained from the corresponding Exploratory sub-population (see
461 Algorithm 1). The method generates Φ individuals from each RepIndiv where
19
462 the value of Φ depends on the population size of the Exploitative state and the
463 number of representative individuals (RepIndiv). While generating Φ number
464 of individuals from each representative, a pre-defined ν% of the total build-
465 ing blocks represented by the RepIndiv are selected randomly. Then the new
466 individual is generated by applying a local search operation called stretch to
467 the entire representative individual other than the selected building blocks in
468 it. Here, the stretch operation reconstruct the intermediate part of two non-
469 consecutive building blocks by a chain of straight moves unless it otherwise
470 violates the SAW constraint. The steps of the initial population generation for
471 the Exploitative state are shown in Algorithm 1.
472
Algorithm 1: CreateExploitPop (RepIndiv, Rep, ν)

Input: RepIndiv= Representative individuals, Rep= Number of Representative
Individuals, R=ν% of BBs to be picked from an individual
Output: N ewIndiv=Newly generated P opExplt Individuals
473
1: K ← 1
2: For i=1 to RepIndivs Do
3: For j=1 to Rep/P opExplt Do
4: BBList ← PickBBs(R)
5: N ewIndivk ← stretch(RepIndivi , BBList)
6: End For
7: End For
474 The notion behind selecting building blocks randomly from RepIndiv, and
475 performing a stretch operation to generate individuals for the initial population
476 of the Exploitative state can be explained with an example. Let us consider
477 the individual in Figure 3(a) to be a representative individual. However, this
478 representative individual which is similar to the optimal solution as shown in
479 Figure 3(b) with regard to the building blocks at different positions (marked
480 as 1, 2, . . . 7), it still produces a sub-optimal solution. Since the representative
481 individual is compact, generating a new solution, by keeping all of its build-
482 ing blocks intact and further applying any local search for fine-tuning, to reach
483 the optimal solution becomes difficult. Hence, applying local search on a newly
484 created individual containing the information of the RepIndiv in a reduced com-
485 pactness might make the fine-tuning easier to reach some other more potential
20
486 peaks.
(a) (b)
Figure 3: (a) Sub-optimal and (b) Optimal solutions for the same sequence having multiple
common sub-structures in corresponding positions.
487 4.2.2. Genetic Operators and Local search

488 The exploitation is essentially based on a pull move local search operation
489 to search the neighborhood of an existing conformation. The proposed search
490 is basically a hybridization of a persistent and non-persistent mode [13] of Pull
491 move. In persistent mode, the changed move is not replaced by the previ-
492 ous move before continuing local search on other randomly selected positions,
493 whereas in non-persistent mode, it is turned back to the previous stage. Here,
494 we consider both the modes while applying pull move on an individual. The
495 hybrid Pull move based local search, is shown in Algorithm 2.
21
496
Algorithm 2: HybridLS (Indiv, P ositions, T otalP os)

Input: Indiv= The individual selected for local search (LS), P ositions= list
of positions to apply LS, T otalP os= Number of positions in Indiv to
apply LS
Output: Best=An individual with best energy after applying persistent and
non-persistent LS
1: Best ← Ind
2: P er ← Ind
497
3: For i=1 to T otalP os do
4: j ← P ositions[i]
5: P er ← PullMoveLS (P er)
6: N onP er ← PullMoveLS (Indiv[j])
7: If f (N onP er) < f (Best) Then
8: Best ← N onP er
9: End If
10: End For
11: If f (P er) < f (Best) Then
12: Best ← P er
13: End If
498 4.2.3. Survival Selection Approach

499 All the individuals generated by persistent and non-persistent mode compete
500 with the parents and the best individual survives in the next generation.
501 4.2.4. Migrants

502 At the end of each stage, the best individuals from each of the sub-populations
503 in the Exploitative state are selected as the migrants to the Central state.
504 4.3. Central State
505 The notion behind designing the Exploratory and the Exploitative states is
506 to promote exploration into different areas in the search space and to further
507 exploit the solutions, and to provide the Central state with information of the
508 regions already explored. The Central state aims to integrate the information of
509 various basins of attraction together to discover unexplored region by beneficial
510 exchange of potential genetic materials. Therefore, the Central state requires to
22
511 maintain several good solutions simultaneously for guiding the search towards
512 global minima instead of allowing the population to converge to a single peak.
513 However, the conventional methods for both parental and survival selection (i.e.,
514 binary tournament, roulette-wheel etc.) [55], are based on the law of “survival
515 of the fittest”. Hence, the selection pressure, as well as the stochastic error
516 induced by these techniques, causes a loss in diversity resulting in premature
517 convergence [52]. Therefore, in the Central state, the selection operators for
518 parental selection (i.e., mating strategy) are set in a manner to distribute the
519 reproduction opportunity among the individuals to make the cross-over more
520 effective. On the other hand, the strategy for survival selection is implemented
521 to control the rapid flow of genetic material from a particular region to ensure
522 their concurrent existence in the population. That is, in this state, the selection
523 pressure induced by the selection operators allows the innovation by cross-over
524 operation to take place.
525 4.3.1. Initialization of Population

526 In the very first stage of evolution, the Central state starts with a single
527 population initialized by randomly generated individuals. Subsequently, at the
528 beginning of each stage, migrants travel from the Exploratory and Exploitative
529 states to the Central state to provide newly discovered genetic material. The
530 Central population with this new genetic information explores the regions where
531 it has not yet been, by favorable recombination of building blocks of potential
532 solutions.
533 4.3.2. Genetic Operators and Local Search

534 Employing a one-point cross-over operation with a high rate and pull move
535 local search, this state allows the exchange of information between newly found
536 solutions and those from previous stages. Furthermore, the rate of local search
537 is set large enough to for sure the exploitation of the newly created individuals.
538 However, to satisfy the SAW constraint, a SAW validation technique is applied
539 to every cross-over operation. If the SAW constraint is violated, “repairing” is
23
540 applied on the succeeding segment concatenated from the cut-point. It is impor-
541 tant to note that we are interested in recombining the potential sub-structures
542 of individuals representing different regions of the search space, with minimum
543 possible changes to the parent individuals. Instead of arbitrarily applying the
544 SAW repairing process, here we apply this to the shorter segment (either pre-
545 ceding or succeeding) to accomplish the following two objectives: i) reduce the
546 repairing cost since the success rates are higher on shorter length, and ii) satisfy
547 exchanging information from the cross-over operation with minimum disruption
548 of the information provided by the parents.
549 4.3.3. Mating Strategy: Adaptive Strategy for Assortative Mating (ASAM)
550 In the Central state, we apply a new parental selection strategy namely
551 Adaptive Strategy for Assortative Mating (ASAM) [56] which is based on the
552 concept of a special form of non-random mating i.e., assortative mating [57].
553 The proposed mating strategy relies on the dependency between the two mates
554 and imbibes the advantages of both the positive and negative paradigms of
555 assortative mating. An asymmetric selection technique applies a deterministic
556 selection for the first parent (Mate1 ) and a stochastic selection for the second
557 one (Mate2 ) to be benefited from both schemes. Using a deterministic scheme
558 minimizes the effect of sampling variance whereas stochastic selection increases
559 the robustness [52]. ASAM involves the Construction of Clusters phase (Phase-
560 1) and Construction of Mate Pools Phase (Phase-2). Phase-1 promotes non-
561 random mating by dividing the population into three clusters Gb (Best), Ga
562 (Average), and Gw (Worst) according to their fitness values. Phase-2 creates
563 mate pools and individuals are selected based on prioritizing the clusters and
564 then weighting the prioritized clusters for each pool. More details on ASAM
565 can be found in [56, 58] and a brief discussion is included in the Supplementary
566 Document.
24
567 4.3.4. Survival Selection Approach: Sib-based Survival Selection
568 For selecting the survivors in the next generation, we formulated a selection
569 strategy called, “Sib-based Survival Selection” (S3) in [59], which is inspired by
570 the principle of crowding method [11, 53, 60]. This strategy ensures the concur-
571 rent maintenance of several potential solutions by controlling the flow of genetic
572 materials among the members of the population. S3 pairs off the fittest offspring
573 amongst all the sibs (the offspring that inherit most of the genetic material from
574 the same ancestor) with the ancestor individual for survival competition. More-
575 over, by selecting the survivors in a hybridized manner of deterministic and
576 probabilistic selection techniques, it also allows the exploitation of less fit solu-
577 tions which might be beneficial while dealing with the multimodal problem. An
578 overview of the S3 method is included in the Supplementary Document.
579 Note that, the Ockhams Razor, proposed by Iacca et al. [61] is a three-
580 stage optimal memetic exploration comprising of three memes. According to
581 the authors on the three memes, the first meme is stochastic with a long search
582 radius, the second is stochastic with a moderate search radius and the third
583 is deterministic with a short search radius. On the other hand, our proposed
584 Multiple-modal Memetic Memetic Framework is completely different, which is
585 a Multi-stage algorithm having multiple (three) states in each stage. As men-
586 tioned in Section 4, we call these states as Exploratory, Exploitative and Central,
587 and each of the states has its own features. Ockhams Razor is a memetic tech-
588 nique that exhibits the exploration feature of memetic framework, whereas, the
589 proposed MMF includes two distinctive features of memetic computing with an
590 additional state, namely Central.
591 5. Experimental Results and Discussions
592 To evaluate the performance of the proposed Multimodal Memetic Frame-

593 work (MMF) algorithm experimentally, we compare it with the state-of-the-art
594 algorithms for protein structure prediction in 2D and 3D HP lattice models.
595 Experiments have been carried out on a test suite, consisting of seven different
25
596 datasets, described in Section 5.2 using three different lattice models, namely,
597 2D-Square, 3D-Cubic, and 3D-FCC. The test suite contains standard bench-
598 mark sets that are extensively reported in the literature [40, 41, 32, 62, 63, 13]
599 and the benchmark sequences having unique ground-state conformations, com-
600 plex and specially designed bio-inspired instances or long HP sequences. The
601 proposed MMF algorithm is also evaluated using a set of benchmark sequences
602 for the functional model proteins found in [41, 32] and also by a set of biological
603 sequences, used in [13].
604 5.1. Setup
605 The evaluations of the proposed Multimodal Memetic Framework or MMF

606 are presented in the following sections (Section 5.3–Section 5.5). For evaluat-
607 ing the performance of the proposed MMF algorithm, which is shown in Sec-
608 tion 5.3, we have maintained the population size pop=400 distributing the in-
609 dividuals among two Exploratory sub-populations each with 100 individuals,
610 two Exploitative sub-populations each containing 50 individuals, and one sub-
611 population in the Central state containing 100 individuals. Each sub-population
612 in the Exploratory state work according to the following parameter settings:
613 each Exploratory sub-population initializes 50% of its population by randomly
614 selecting from the pool of individuals generated by CIPG algorithm whereas
615 the remaining 50% of the initial population is created using DIG algorithm [13].
616 The cross-over rate is set to 1.0 (i.e., CR=1.0) where the two parent-individuals
617 are selected using binary tournament technique. The pull move local search
618 is performed over 25% randomly selected individuals (LSP rob =0.25) and on
619 10% of positions selected randomly for each selected individual (LSInten =0.10).
620 Finally, the (µ+λ)-strategy is used for selecting survivors in the next gener-
621 ation. Based on the aforementioned parameter settings in each exploratory
622 sub-population, it was empirically observed in several runs that the population
623 converges to a single peak in less than 50 generations. Hence, the execution
624 limit of each stage is set as 50 generations. However, if the best fitness value
625 in an Exploratory sub-population remains unchanged for a specified number of
26
626 generations (here we consider 10 generations), we assume that it has reached a
627 near-convergence state and we then reduce the cross-over and local search rate
628 by a factor of 4. Once the execution of the sub-populations in the Exploratory
629 state is over, we select the best 20% individuals according to the fitness values
630 from each to extract the information of the peaks to which it is converging. The
631 representative individuals from each subpopulation are used as migrants to the
632 Central state, while the knowledge about the positions of building blocks is used
633 to generate individuals using stretch operation for the initial population in the
634 corresponding the Exploitative sub-population for the next stage.
635 As stated above, the Exploitative state applies local search on the already
636 found solutions (from the Exploratory state) to exploit their neighborhood.
637 Here, the values for LSP rob and LSInten are set to 0.75 and 0.25, respectively,
638 to ensure a hybrid local search. Here, each newly generated individual competes
639 with its parent for survival in the next generation. At the end of the execution of
640 this state, the individual with the best energy value is transferred as a migrant
641 to the Central state.
642 Finally, for the Central state the cross-over rate (CR) is set to 1.0, and the
643 local-search parameters LSP rob and LSInten are set to 0.60 and 0.30, respec-
644 tively. Here, the newly proposed ASAM and S3 are employed, respectively, as
645 the parental and the survival selection techniques in this state. There is no
646 constant number used as the maximum number of stages, rather a termination
647 condition is used, which is the generation number when the best energy value is
648 observed to be not changing. Even though 2-3 additional run (i.e., stages) are
649 sufficient to draw this conclusion, we observe this for 5 generations for our MMF.
650 That is if the energy value remains constant for consecutive 5 generations, the
651 algorithm terminates, and the best energy value is reported. Since the struc-
652 ture and nature of the conformation for various lattice models are different, we
653 observed that the proposed manner of termination is effective for convergence.
654 All the algorithms of our MMF method are implemented in C++ language.
655 For each simulation experiments, each run is carried out for 12 h on Monash
656 Sun Grid (MSG) [64], that consists of a Dell R815 four-socket AMD Opteron
27
657 CPUs with 256 GB RAM and a Dell R820 four socket Intel Xeon CPUs with
658 256 GB RAM. However, each machine runs independently for each benchmark
659 sequence in every run which is considered as a separate job in the MSG and
660 64 jobs were executed at a time. Thus, the computation time for a sequence is
661 equivalent to the time required for running it on a single machine.
662 5.2. Test Suite
663 The test suite used for evaluating the proposed approach consists of five
664 datasets of benchmark sequences for HP model, a set of benchmark sequences for
665 the functional model of a protein, and 15 sequences from protein data bank. All
666 the data sets (with sequences and known best energies reported in the literature)
667 are shown in Tables 1-4 in the Supplementary Document.
668 Data Set-1 consisting of 11 benchmark sequences (B1–B11) of 20–100 residues
669 have been widely used by previous research [13, 32] for PSP that we have used for
670 comparing the proposed MMF with other algorithms in 2D-Square, 3D-Cubic,
671 3D-FCC lattice models.
672 Data Set-2 contains 10 sequences each of 48 residues (i.e., known as Harvard
673 sequences) and is used to test the method on the 3D-Cubic and 3D-FCC lattice
674 models.
675 Further, we have evaluated our proposed method using a subset of 11 in-
676 stances selected from the benchmarks for the functional model protein that have
677 been used previously for assessing the optimization algorithms in [32] and [41].
678 Moreover, to test how proposed approach performances on long biological se-
679 quences, a test set found from [13] has been used. The set contains sequences
680 having a length from 200 to 250 residues that were taken from the Protein Data
681 Bank and translated into HP strings by [65].
682 5.3. Performance of MMF on Standard Benchmark Instances
683 We evaluate the performance of the proposed method using the standard
684 benchmark sets (i.e., Data Set-1) for all the three lattice models stated earlier
685 and compare the performance against other methods with previously published
28
686 results available in the literature. The result of the experiments conducted on
687 Data Set-1 for all the three lattice models stated earlier are shown in Table 1. We
688 have recorded the best and average energies with standard deviation obtained by
689 the proposed method for all the instances in Table 1. Moreover, the worst results
690 obtained by the proposed method is also juxtaposed to highlight the strength of
691 the proposed method. In 2D-square lattice model, we observe that the method
692 is successful in achieving the known optimal energies for all the instances with
693 significantly good average energy values and the negligible difference between
694 the best and worst energies which indicates the robustness of the proposed
695 method. For the 3D-Cubic lattice, the results obtained by the proposed method
696 on the DataSet-1 demonstrates its success not only in obtaining the known
697 optimal energies in all the cases, also in finding conformations with even better
698 energies than the so far found best energies for the case where known best
699 energies are not reported in the literature. In the case of 3D-FCC lattice model,
700 for the benchmark sequences B5, B7, and B8, the energies obtained by the
701 proposed method are very close to the best-known energy values reported by
702 other methods.
703 The results of the comparison with other methods for the 2D-Square lattice
704 are shown in Table 2. It is worth noting that, unlike other methods reported
705 here, the proposed MMF is successful in achieving the known optimal energies
706 for all the cases and performs similarly as the methods that reach the best
707 energy values. Although chain growth algorithms like PERM [66] finds difficulty
708 in folding the sequence B8 in which optimal core formation requires extensive
709 interactions between the two terminals, it is evident from the average energy
710 value that the proposed method reaches the optimal for this instance without
711 any difficulty.
712 The comparative analysis done on Data Set-1 for 3D-Cubic lattice model
713 is shown in Table 3. The results illustrate the fact that the proposed method
714 significantly outperforms other methods as well as DCN [13] that has reported
715 the best energies among all the existing methods considered here. Moreover, the
716 proposed method is successful to find conformations with even better energies
29
Table 1: Performance of the proposed MMF algorithm for the benchmark sequences of Data
Set-1 in 2D-Square, 3D-Cubic, 3D-FCC lattice models. E ∗ implies the known optimal en-
ergy, Eb indicates the best energies obtained by MMF algorithm in 25 runs and Ir indicates
the improvement rate (in %) in MMF over the known best energies. A ‘-’ in Ir incides no
improvement by MMF.
Seq 2D-Square 3D-Cubic 3D-FCC

Name E∗ Eb Avg±Std Ir E∗ Eb Avg±Std Ir E∗ Eb Avg±Std Ir
B5 -23 -23 -23.0±0.00 - -31 -31 -31.0±0.00 - -74 -68 -67.4±0.28 -
B6 -21 -21 -21.0±0.00 - -32 -33 -31.7±0.24 3.1 -71 -73 -69.9±0.66 2.8
B7 -36 -36 -35.5±0.21 - -54 -54 -53.6±0.23 - -130 -129 -126.1±0.41 -
B8 -42 -42 -41.5±0.34 - -58 -59 -57.3±0.44 1.7 -132 -131 -127.4±0.69 -
B9 -53 -53 -52.0±0.22 - -79 -81 -79.3±0.40 2.5 -185 -186 -181.0±0.80 1.2
B10 -48 -48 -46.7±0.19 - -72 -77 -73.5±0.45 6.9 -169 -176 -168.7±1.03 4.1
B11 -50 -50 -47.5±0.34 - -76 -78 -74.5±0.61 2.6 -172 -177 -170.8±1.07 3.9
Table 2: Comparison of the best energies obtained by proposed and existing methods on
benchmark sequences of Data Set-1 in 2D-Square lattice model. A blank cell indicates the
unavailability of the energy of a sequence for a method.
Algorithm Ref B5 B6 B7 B8 B9 B10 B11
E∗ -23 -21 -36 -42 -53 -48 -50
MMF (Proposed) -23 -21 -36 -42 -53 -48 -50
CMA [13] -23 -21 -35 -42 -53 -47 -50
GGA [22] -23 -21 -36 -42
IA [41] -23 -21 -35 -39
EDA [32] -23 -21 -35 -42 -52 -47 -48
HTGA [67] -23 -21 -36 -42
PERM [66] -23 -21 -36 -39 -53 -48 -50
ACO HPPFP-3 [40] -23 -21 -36 -42 -53 -47 -49
Tabu Search [68] -23 -21 -42 -51 -45 -48
GAOSS [69] -23 -21 -36 -42 -52
EMC [70] -23 -21 -35 -42
MMA [38] -23 -21 -36 -38
GA [24] -23 -21 -34 -37
717 than the so far know best energies for 5 out of 7 sequences.
718 The comparative analysis for 3D-FCC lattice model is shown in Table 4. For
719 the benchmark sequences B5, B7, and B8, although the proposed method fails
720 to reach the best energies reported by ETS [26], the obtained energies are very
721 close to those achieved by ETS. As compared to the energies obtained by DCN
722 [13], the proposed method performs at par or better than all the instances in
723 Data Set-1. From the results obtained for 3D-FCC lattice on Data Set-1, it is
30
benchmark sequences of Data Set-1 in 3D-Cubic lattice model. A blank cell indicates the
unavailability of the energy of a sequence for that method.
E∗ -31 -32 -54 -58 -79 -72 -76
MMF (Proposed) -31 -33 -54 -59 -81 -77 -78
CMA [13] -31 -31 -54 -58
DCN [13] -31 -32 -54 -58 -79 -72 -76
R-EA [71] -28 -26 -49 -46
IA [41] -29 -23 -41 -42
HGAPSO [72] -29 -26 -49
EDA [32] -29 -31 -49 -52
CGA [73] -31 -31 -50 -55
724 evident that the performance of the proposed method is very competitive with
725 the methods described in the literature.
benchmark sequences of Data Set-1 in 3D-FCC lattice model. A blank cell indicates the
unavailability of the energy of a sequence for a method.
E∗ -74 -71 -130 -132 -185 -169 -172
MMF (Proposed) -68 -73 -129 -131 -186 -176 -177
CMA [13] -68 -71 -128 -128
DCN [13] -68 -71 -129 -128 -185 -169 -172
HGA [22] -69 -59 -117 -103
ETS [26] -74 -130 -132
726 The performance of the proposed method on 3D-Cubic and 3D-FCC lattice
727 models is further investigated on Data Set-2 comprising of 10 classical bench-
728 mark sequences. According to the comparative result, shown in Table 5, we
729 observe that, like the methods (i.e., GAHP, ACO, and PERM), the proposed
730 method is successful in achieving the best energies in all the instances. In addi-
731 tion, the average energies with low standard deviation, reached by the proposed
732 method implies its superiority while comparing with those obtained by GAHP.
733 The comparative results for 3D-FCC lattice on Data Set-2 are shown in Ta-
734 ble 6. Here, the proposed method has been compared with other local-search
735 based approaches like Tabu search [76], two neighborhood search (LS-2N) [76]
31
Table 5: Comparison of the best energies obtained by proposed and existing methods on benchmark sequences of Data Set-2 in 3D-Cubic lattice
model.
Seq E∗ MMF(Proposed) GAHP [42] GAP SP [43] AHEDA [44] CI SGA MA ACO HZ CG PERM
Name Best Avg±Std Best Avg±Std Best Avg±Std Best Avg [34] [35] [39] [40] [74] [33] [75]
H1 -32 -32 -31.8±0.34 -32 -30.7 ±0.67 -32 -31.8± 0.38 -31 -29.5 -32 -24 -32 -32 -31 -32 -32
H2 -34 -34 -32.0 ±0.63 -34 -31.2 ±0.59 -34 -33.0± 0.77 -34 -32.3 -33 -24 -34 -34 -32 -34 -34
32
H3 -34 -34 -32.8 ±0.54 -34 -32.0 ±0.80 -34 -33.2± 0.44 -34 -32.4 -32 -23 -34 -34 -31 -34 -34
H4 -33 -33 -31.6 ±0.70 -33 -31.1 ±0.81 -33 -32.2± 0.54 -33 -31.4 -32 -24 -33 -33 -30 -33 -33
H5 -32 -32 -31.5 ±0.50 -32 -30.5 ±0.73 -32 -31.5± 0.49 -31 -29.8 -30 -28 -32 -32 -30 -32 -32
H6 -32 -32 -30.8 ±0.50 -32 -29.8 ±0.78 -32 -31.1± 0.38 -32 -30.1 -30 -25 -32 -32 -29 -32 -32
H7 -32 -32 -30.4 ±0.51 -32 -29.8 ±0.56 -32 -30.6± 0.56 -32 -30.3 -30 -27 -31 -32 -29 -32 -32
H8 -31 -31 -29.7 ±0.57 -31 -29.3 ±0.58 -31 -30.3± 0.48 -30 -28.5 -30 -26 -31 -31 -29 -31 -31
H9 -34 -34 -32.7 ±0.61 -34 -31.9 ±0.66 -34 -33.0± 0.37 -34 -32.3 -34 -27 -33 -34 -31 -33 -34
H10 -33 -33 -32.0 ±0.77 -33 -31.0 ±0.56 -33 -32.2± 0.45 -31 -29.5 -33 -26 -33 -33 -33 -33 -33
736 and large neighborhood search (LNS) [76]. In Table 6, LSBest denotes the best
737 results obtained by Tabu Search [76] either with randomized initialization (LS)
738 or with the new initialization combined with constraint programming (LS-G),
739 LS − 2NBest denotes the best results obtained by Two Neighborhoods Tabu
740 Search [76] either with random initialization (LS-2N) or with new initialization
741 combined with constraint programming (LS-2N-G), LN SBest denotes the best
742 results obtained by any one of the hybrid Large Neighborhood Search-based
743 algorithms [76], i.e., Multiple Sequence Reoptimized LNS (LNS-MULT), and
744 3D Structure Reoptimized LNS (LNS-3D). The results show that among all the
745 methods considered in this analysis, only the proposed method along with the
746 hybrid Large Neighborhood Search-based algorithms (LNS) successfully reach
747 all the best energies, however, the average energies for LNS are inferior to the av-
748 erage energies reported by the proposed MMF algorithm for 9 sequences (other
than H1).
Table 6: Comparison of the best (and average) energies obtained by proposed and existing
methods on benchmark sequences of Data Set-2 in 3D-FCC lattice model. The average energy
indicates the mean of best energies obtained by each algorithm in 100 runs.
MMF Existing Methods

Seq (Proposed) LS LS-2N LNS-
Name Best (Avg±Std) Best(Avg) Best(Avg) Best(Avg)
H1 -69 (-67.1±0.88) -65(-57.5) -68(-64.7) -69(-67.6)
H2 -69 (-67.6±0.84) -64(-56.6) -69(-64.3) -69(-66.7)
H3 -72 (-69.0±1.02) -66(-56.7) -68(-62.0) -72(-68.0)
H4 -71 (-69.2±0.81) -65(-58.0) -68(-63.1) -71(-67.6)
H5 -70 (-67.8±1.06) -64(-57.0) -68(-63.8) -70(-67.0)
H6 -70 (-68.0±0.84) -63(-56.5) -69(-63.4) -70(-67.5)
H7 -70 (-68.1±1.12) -63(-58.1) -68(-63.3) -70(-66.5)
H8 -69 (-66.2±0.72) -63(-55.3) -67(-62.2) -69(-65.8)
H9 -71 (-69.5±0.90) -67(-58.9) -69(-64.9) -71(-67.9)
H10 -68 (-66.8±0.69) -64(-57.5) -67(-63.9) -68(-65.7)
749
750 5.4. Performance of MMF on Functional Protein Model
751 We have used the functional model protein (FMP) for evaluating the op-
752 timization competence of the proposed MMF method. There are more than
33
753 15000 FMP instances in the literature [17, 38], however, 11 of them, are mostly
754 used by other methods for experiments. Similar to the three recently reported
755 approaches [41, 32, 13] and for the sake of comparison, we also present an anal-
756 ysis with the functional model protein for the same set of benchmark sequences
757 (BF1–BF11) in a 2D-Square lattice. In Table 7, we reported the best and
758 average energies, along with the success rate of the proposed MMF in achiev-
759 ing the best energies obtained in 25 separate runs. The results show that,
760 although the proposed method as well as the three other existing methods (i.e.,
761 IA, EDA, CMA), are successful in achieving the optimal energies, only the pro-
762 posed method reported 100% success rate for all 11 FMP sequences. While
763 comparing the average time for the proposed method with that reported by
764 CMA, we notice that, for all the sequences, the time of MMF are much lower
765 than the time of CMA. Furthermore, we have presented F Ea and F Eb that
766 indicate the average and best fitness evaluation, respectively, reported by each
767 method. MMF achieved better F Eb for 6 FMP sequences than the other three
768 methods. Similarly, F Ea reported by the proposed method is better for 9 FMP
sequences than the other three methods.
Table 7: Comparison of the proposed MMF algorithm with three existing techniques for
evaluating functional proteins in 2D-Square HP lattice model. F Ea and F Eb indicate the
average and best fitness evaluation, respectively, reported by each method. Ta indicates
average time and SR implies the success rate (in percentage).
IA [41] EDA [32] CMA MMF (Proposed)

Seq SR (F Ea , F Eb ) SR (F Ea , F Eb ) SR Ta (h:m:s) SR (F Ea , F Eb ) Ta (h:m:s)
BF1 100 (32647.7, 3372) 100 (25799.3, -) 100 0:0:8.8 100 (12544.3, 1971) 0:0:1.83
BF2 100 (17526.7, 578) 100 (20593.1, -) 100 0:0:11.9 100 (33380.5, 2189) 0:0:4.27
BF3 56.67 (2403985.3,100234) 100 (140802, -) 80 5:12:27.2 100 (66137.6, 701) 0:0:4.66
BF4 100 (128015.1, 4955 ) 100 (108118, -) 100 0:6:54.3 100 (66587.1, 1854) 0:0:8.53
BF5 100 (12095.3, 1047) 100 (55772.6, -) 100 0:0:23.0 100 (7244.3, 2206) 0:0:3.1
BF6 100 (332938.5, 2828) 100 (61577.6, -) 100 0:0:24.4 100 (89646.1, 2278) 0:0:6.5
BF7 100 (584179.8, 10061) 86.67 (3566200, -) 100 0:2:33.1 100 (81673.0, 16081) 0:0:9.3
BF8 100 (38262.6, 1818) 100 (147938, -) 100 0:0:10.3 100 (62766.3, 2815) 0:0:7.8
BF9 100 (281720.8, 3845) 100 (155722, -) 100 0:0:18.3 100 (37732.4, 2454) 0:0:11.5
BF10 100 (100085.43, 2847) 100 (57652.1, -) 100 0:1:54.8 100 (47271.8, 2505) 0:0:5.8
BF11 100 (27743.7, 1007) 100 (11927.1, -) 100 0:0:3.7 100 (46758.17, 2347) 0:0:4.9
Instead of fitness evaluation, CMA [13] reported the generation count
769
34
770 5.5. Performance of MMF on Long Protein Data Bank (PDB) Sequences
771 To test the efficacy and the scaling ability of the proposed MMF while ap-
772 plying on long biological sequences, a set of 10 long Protein Data Bank (PDB)
773 sequences is employed for evaluation. The comparative results for 2D-Square,
774 3D-Cubic, and 3D-FCC lattice models, are shown in Table 8 and the results for
775 the first five sequences are compared with a recently published state-of-the-art
776 algorithm [13].
777 Analyzing the tables, we observe that, the proposed MMF algorithm is suc-
778 cessful in obtaining better energies than all the five PDB sequences for which
779 the energies reported in [13] in all lattice models. The observation is true for
780 the average energies reported by [13] and MMF algorithms. Furthermore, for
781 3D-FCC lattice model, we involve five more sequences that are collected from
782 [77]. The results achieved by the proposed method for these five sequences are
783 very competitive with that of [77]. Therefore, the comparative study presented
784 in this section is a clear evidence of the successful application of the proposed
785 method on long biological sequences. Since the best energies are known for the
786 first 5 PDB sequences of Table 8, we have plotted the improvement rate (in %)
787 in Figure 4 to show the effectiveness of the proposed method. Figure 4 shows
788 that improvement of MMF mostly varies from 15% to 35%, while only for one
789 PDB sequence the improvement in nominal.
Table 8: Performance of the proposed MMF algorithm on long protein data bank sequences
in 2D-Square, 3D-Cubic, 3D-FCC lattice models. A blank cell indicates the unavailability of
the energy of a sequence for a method.
PDB 2D-Square 3D-Cubic 3D-FCC
Seq [13] MMF (Proposed) [13] MMF (Proposed) [13] MMF (Proposed)
ID Len Best Avg ±std Best Avg ±std Best Avg ±std Best Avg ±std Best Avg ±std Best Avg ±std
1B0F 218 -95 -90.6±3.0 -114 -109.0±1.5 -162 -153.4±6.8 -194 -185.0±4.0 -385 -382.4±2.0 -451 -438.2±6.8
1BQS 209 -86 -78.2±4.3 -101 -96.6±2.1 -135 -131.6±2.1 -174 -164.7±3.7 -334 -329.2±4.8 -400 -378.7±9.5
1CWR 211 -56 -54.0±1.4 -75 -72.5±1.5 -94 -89.2±3.1 -127 -123.7±2.7 -224 -222.4±1.6 -291 -273.8±6.0
1NQC 217 -64 -61.2±1.6 -81 -79.3±0.9 -112 -107.8±3.3 -144 -135.9±4.4 -274 -265.8±6.3 -335 -308.2±7.2
1RTG 210 -73 -70.2±1.9 -94 -89.5±1.9 -128 -122.4±4.0 -160 -153.5±2.9 -359 -312.8±25.9 -372 -356.8±8.1
1BEC 238 -89 -83.7±1.7 -144 -138.9±1.8 -331 -316.0±7.0
1BPB 248 -142 -136.4±3.5 -142 -136.4±3.5 -314 -302.0±6.9
1DUA 242 -85 -81.0±1.7 -143 -136.7±2.7 -321 -306.2±6.1
1FBN 230 -95 -92.6±1.5 -163 -157.1±3.4 -379 -364.8±5.8
2GPQ 217 -77 -73.0±1.7 -125 -121.2±1.9 -301 -282.2±7.1
790 Analyzing all the above results (i.e., Table 1-8 and Figure 4), we observe
791 that the proposed method exhibits superior performance in all or most of the
35
Improvement Rate in MMF for 5 PDB Sequences
40.0
35.0
30.0
25.0
20.0
15.0
10.0
5.0
0.0
1BF0F 1BQS 1CWR 1NQC 1RTG
2D-Square 3D-Cubic 3D-FCC
Figure 4: Improvement rate in MMF over existing best energies for 5 PDB sequences.
792 benchmark sequences. While the results presented here are based on a common
793 and popular metric for scoring, other global similarity measures for structure
794 evaluation (e.g. SP-Score, TM-Score, etc) or even comparison with native struc-
795 tures can be performed. This is beyond the scope of current research and can
796 form part of future research work.
797 6. Conclusion
798 In this paper, we have proposed a novel memetic framework to deal with the
799 multimodal search space of protein structure prediction problem with simplified
800 models. The proposed method partitions the population into three different
801 states where each state aims to achieve a specific objective and balances be-
802 tween the two important factors i.e., diversity and selection pressure based on
803 the designated target of each state. Thus, the proposed MMF accomplishes
804 the mutually exclusive goals of the multimodal optimization-“climbing up the
805 peaks” and “passing through the valleys of reduced fitness” to climb up a more
806 potential peak. The proposed MMF has an intrinsic ability of parallelism and
807 is generic in the sense that it can be applied to any multimodal problem. To
808 assess the efficacy of the proposed method we have applied it on different lattice
809 models and have compared the results with other state-of-the-art algorithms.
810 The experimental results on different lattice models for various instances as well
811 as for long biological sequences show that the proposed method has a superior
36
812 performance compared to other algorithms and thereby establishes its ability to
813 deal with the complex multimodal problem.
814 As part of further research, we have undertaken to develop appropriate the-
815 oretical underpinnings to establish the effectiveness of exploration in a multi-
816 modal optimization problem, namely, the protein structure prediction problem.
817 References
818 [1] C. Levinthal, Are there pathways for protein folding?, The Journal of
819 Chemical Physics 65 (1968) 44–45.
820 [2] H. S. Lopes, Evolutionary algorithms for the protein folding problem: A
821 review and current trends, Computational Intelligence in Biomedicine and
822 Bioinformatics 151 (2008) 297–315.
823 [3] W. E. Hart, A. Newman, Protein structure prediction with lattice models,
824 Methods (2001) 1–24.
825 [4] A. Kolinski, Protein modeling and structure prediction with a reduced
826 representation, ACTA Biochemica Polonica- English Edition 51 (2004) 349–
827 372.
828 [5] P. Crescenzi, D. Goldman, C. H. Papadimitriou, A. Piccolboni, M. Yan-

829 nakakis, On the complexity of protein folding, in: Research in Computa-
830 tional Molecular Biology, 1998, pp. 61–62.
831 [6] B. Berger, T. Leight, Protein folding in the hydrophobic-hydrophilic (hp)

832 model is np-complete, Computational Biology 5 (1) (1998) 27–40.
833 [7] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, P. G. Wolynes, Funnels, path-

834 ways, and the energy landscape of protein folding: a synthesis, Proteins:
835 Structure, Function, and Bioinformatics 21 (3) (1995) 167–195.
836 [8] D. Hinds, M. Levitt, A lattice model for protein structure prediction at low
837 resolution, Proceedings of the National Academy of Sciences 89 (7) (1992)
838 2536–2540.
37
839 [9] R. K. Ursem, Models for evolutionary algorithms and their applications in
840 system identification and control optimization, no. DS-03-6, Basic Research
841 in Computer Science Series, 2003.
842 [10] D. E. Goldberg, J. Richardson, Genetic algorithms with sharing for mul-
843 timodal function optimization, in: International Conference on Genetic
844 Algorithms, 1987, pp. 41–49.
845 [11] K. A. De Jong, An analysis of the behavior of a class of genetic adap-

846 tive systems, Ph.D. thesis, Department of Computer and Communication
847 Sciences, University of Michigan (1975).
848 [12] R. K. Ursem, Multinational gas: Multimodal optimization techniques in

849 dynamic environments., in: Genetic and Evolutionary Computation Con-
850 ference, 2000, pp. 19–26.
851 [13] M. K. Islam, M. Chetty, Clustered memetic algorithm with local heuristics
852 for ab initio protein structure prediction, IEEE TEC 17 (4) (2013) 558–576.
853 [14] T. Park, K. R. Ryu, A dual-population genetic algorithm for adaptive

854 diversity control, IEEE TEC 14 (6) (2010) 865–884.
855 [15] K. Lau, K. A. Dill, A lattice statistical mechanics model of the confor-
856 mational and sequence spaces of proteins, Macromolecules 22(10) (1989)
857 3986–3997.
858 [16] H. S. Chan, K. A. Dill, Comparing folding codes for proteins and polymers,
859 Proteins-Structure Function and Genetics 24 (3) (1996) 335–344.
860 [17] J. D. Hirst, The evolutionary landscape of functional model proteins, Pro-
861 tein Engineering 12 (9) (1999) 721–726.
862 [18] B. P. Blackburne, J. D. Hirst, Evolution of functional model proteins, The

863 Journal of Chemical Physics 115 (4) (2001) 1935–1942.
864 [19] D. W. Miller, K. A. Dill, Ligand binding to proteins: the binding landscape
865 model, Protein Science 6 (10) (1997) 2166–2179.
38
866 [20] I. Dotu, M. Cebrián, P. Van Hentenryck, P. Clote, Protein structure predic-
867 tion with large neighborhood constraint programming search, in: Principles
868 and Practice of Constraint Programming, 2008, pp. 82–96.
869 [21] N. Krasnogor, W. Hart, J. Smith, D. Pelta, Protein structure prediction

870 with evolutionary algorithms, in: Genetic and Evolutionary Computation
871 Conference, 1999, pp. 1596–1601.
872 [22] M. T. Hoque, Genetic algorithm for ab initio protein structure prediction
873 based on low resolution models, Ph.D. thesis, Faculty of IT, Monash Uni-
874 versity, Australia (2007).
875 [23] M. Cebrián, I. Dotu, P. V. Hentenryck, P. Clote, Protein structure predic-

876 tion on the face centered cubic lattice by local search., in: AAAI Conference
877 on Artificial Intelligence, Vol. 8, 2008, pp. 241–246.
878 [24] R. Unger, J. Moult, Genetic algorithms for protein folding simulations,
879 Journal of Molecular Biology 231 (1) (1993) 75–81.
880 [25] N. Lesh, M. Mitzenmacher, S. Whitessides, A complete and effective move

881 set for simplified protein folding, in: Research in Computational Molecular
882 Biology, 2003, pp. 118–195.
883 [26] H.-J. Böckenhauer, A. Z. M. D. Ullah, L. Kapsokalivas, K. Steinhöfel, A

884 local move set for protein folding in triangular lattice models, Algorithms
885 in Bioinformatics 5251 (2008) 369–381.
886 [27] J. Del Ser, E. Osaba, D. Molina, X.-S. Yang, S. Salcedo-Sanz, D. Camacho,
887 S. Das, P. N. Suganthan, C. A. Coello Coello, F. Herrera, Bio-inspired
888 computation: Where we stand and what’s next, Swarm and Evolutionary
889 Computation 48 (2019) 220–250.
890 [28] N. X. Vinh, M. Chetty, R. Coppel, P. P. Wangikar, Polynomial time al-

891 gorithm for learning globally optimal dynamic bayesian network, in: B.-L.
892 Lu, L. Zhang, J. Kwok (Eds.), Neural Information Processing, Springer
893 Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 719–729.
39
894 [29] N. X. Vinh, M. Chetty, R. L. Coppel, P. P. Wangikar, Gene regulatory
895 network modeling via global optimization of high-order dynamic bayesian
896 network, BMC Bioinformatics 13 (2012) 131.
897 [30] K. Yue, K. A. Dill, Sequence-structure relationships in proteins and copoly-

898 mers, Physical Review E 48(3) (1993) 2267–2278.
899 [31] S. Will, Exact, constraint-based protein structure prediction in simple mod-
900 els, Ph.D. thesis, Friedrich-Schiller-University Jena, Germany (2005).
901 [32] R. Santana, P. Larranaga, J. A. Lozano, Protein folding in simplified models

902 with estimation of distribution algorithms, IEEE TEC 12 (4) (2008) 418–
903 438.
904 [33] T. C. Beutler, K. A. Dill, A fast conformational search strategy for finding
905 low energy structures of model proteins, Protein Science 5(10) (1996) 2037–
906 2043.
907 [34] L. Toma, S. Toma, Contact interactions method: A new algorithm for
908 protein folding simulations, Protein Science 5 (1996) 147–153.
909 [35] M. M. Khimasia, P. V. Coveney, Protein structure prediction as a hard op-

910 timization problem: the genetic algorithm approach, Molecular Simulation
911 19 (4) (1997) 205–226.
912 [36] D. E. Goldberg, Genetic algorithms in search, optimization, and machine

913 learning, Vol. 412, Addison-wesley Reading, 1989.
914 [37] R. König, T. Dandekar, Improving genetic algorithms for protein folding
915 simulations by systematic crossover, BioSystems 50 (1) (1999) 17–25.
916 [38] N. Krasnogor, B. Blackburne, E. K. Burke, J. D. Hirst, Multimeme al-

917 gorithms for protein structure prediction, in: International Conference on
918 Parallel Problem Solving from Nature, 2002, pp. 769–778.
919 [39] A. Bazzoli, G. B. Tettamanzi, A memetic algortithm for protein structure

920 prediction in a 3D lattice hp model, EvoWorkshop 3005 (2004) 1–10.
40
921 [40] A. Shymgelska, H. H. Hoos, An ant colony optimization algorithm for the
922 2D and 3D hydrophobic polar protein folding problem, BMC Bioinformatics
923 6(30) (2005) 1–22.
924 [41] V. Cutello, G. Nicosia, M. Pavone, J. Timmis, An immune algorithm for

925 protein structure prediction on lattice models, IEEE TEC 11 (1) (2007)
926 101–117.
927 [42] F. L. Custódio, H. J. Barbosa, L. E. Dardenne, A multiple minima ge-

928 netic algorithm for protein structure prediction, Applied Soft Computing
929 15 (2014) 88–99.
930 [43] B. Boskovic, J. Brest, Genetic algorithm with advanced mechanisms applied
931 to the protein structure prediction in a hydrophobic-polar model and cubic
932 lattice, Appl. Soft Comput. 45 (2016) 61–70.
933 [44] A. Morshedian, J. Razmara, S. Lotfi, A novel approach for protein structure
934 prediction based on an estimation of distribution algorithm, Soft Comput-
935 ing (2018) 1–12.
936 [45] D. E. Goldberg, The design of innovation: Lessons from and for competent
937 genetic algorithms, Kluwer Academic Publishers, 2002.
938 [46] R. Nazmul, Multi-modal memetic framework using h-core for low resolution
939 protein structure prediction (2015).
940 [47] P. P. Bonissone, R. Subbu, N. Eklund, R. T. Kiehl, Evolutionary al-

941 gorithms+domain knowledge=real-world evolutionary computation, IEEE
942 TEC 10 (3) (2006) 256–280.
943 [48] J. H. Holland, Adaptation in natural and artificial systems, The MIT Press,
944 1992.
945 [49] R. Nazmul, M. Chetty, R. Samudrala, D. Chalmers, Protein structure pre-

946 diction based on optimal hydrophobic core formation, in: IEEE Congress
947 on Evolutionary Computation, 2012, pp. 1–9.
41
948 [50] R. Nazmul, M. Chetty, A knowledge-based initial population generation
949 in memetic algorithm for protein structure prediction, in: International
950 Conference on Neural Information Processing, Vol. 8227, 2013, pp. 546–
951 553.
952 [51] N. Noman, H. Iba, A new generation alternation model for differential
953 evolution, in: Genetic and Evolutionary Computation Conference, 2006,
954 pp. 1265–1272.
955 [52] K. A. De Jong, Evolutionary computation: a unified approach, The MIT

956 Press, 2006.
957 [53] S. W. Mahfoud, Niching methods for genetic algorithms, Urbana

958 51 (95001).
959 [54] I. Berenboym, M. Avigal, Genetic algorithms with local search optimization
960 for protein structure prediction problem, in: Genetic and Evolutionary
961 Computation Conference, 2008, pp. 1097–1098.
962 [55] T. Blickle, L. Thiele, A comparison of selection schemes used in genetic

963 algorithms, in: TIK-Report, 1995.
964 [56] R. Nazmul, M. Chetty, An adaptive strategy for assortative mating in ge-
965 netic algorithm, in: IEEE Congress on Evolutionary Computation, 2013,
966 pp. 2237–2244.
967 [57] C. M. Fernandes, R. Tavares, C. Munteanu, A. C. Rosa, Using assortative

968 mating in genetic algorithms for vector quantization problems, in: ACM
969 Symposium on Applied Computing, 2001, pp. 361–365.
970 [58] R. Nazmul, M. Chetty, A priority based parental selection method for ge-
971 netic algorithm, in: Genetic and Evolutionary Computation Conference,
972 2013, pp. 125–126.
973 [59] R. Nazmul, M. Chetty, Sib-based survival selection technique for protein
974 structure prediction, in: International Conference on Neural Information
975 Processing, Vol. 8835, 2014, pp. 470–478.
42
976 [60] O. J. Mengshoel, D. E. Goldberg, Probabilistic crowding: Deterministic
977 crowding with probabilistic replacement, in: Genetic and Evolutionary
978 Computation Conference, 1999, pp. 409–416.
979 [61] G. Iacca, F. Neri, E. Mininno, Y. Ong, M. Lim, Ockham’s razor in

980 memetic computing: Three stage optimal memetic exploration, CoRR
981 abs/1810.08669.
982 [62] M. T. Hoque, M. Chetty, L. S. Dooley, Generalized schemata theorem

983 incorporating twin removal for protein structure prediction, in: J. C. Ra-
984 japakse, B. Schmidt, G. Volkert (Eds.), Pattern Recognition in Bioinfor-
985 matics, Springer Berlin Heidelberg, Berlin, Heidelberg, 2007, pp. 84–97.
986 [63] M. K. Islam, M. Chetty, Clustered memetic algorithm for protein structure
987 prediction, in: IEEE Congress on Evolutionary Computation, 2010, pp.
988 1–8.
989 [64] [Weblink: Monash SunGrid] https://confluence-vre.its.monash.edu.au

990 /display/mcgwiki/Monash+Sun+Grid+Overview [Last accessed on 20-06-
991 2017] .
992 [65] C. Thachuk, A. Shmygelska, H. H. Hoos, A replica exchange monte carlo

993 algorithm for protein folding in the hp model, BMC Bioinformatics 8 (1)
994 (2007) 342.
995 [66] H.-P. Hsu, V. Mehra, W. Nadler, P. Grassberger, Growth algorithms for lat-
996 tice heteropolymers at low temperatures, The Journal of Chemical Physics
997 118 (1) (2003) 444–451.
998 [67] C.-J. Lin, M.-H. Hsieh, An efficient hybrid taguchi-genetic algorithm for
999 protein folding simulation, Expert Systems with Applications 36 (10)
1000 (2009) 12446–12453.
1001 [68] M. Milostan, P. Lukasiak, K. A. Dill, J. Blazewicz, A tabu search strategy

1002 for finding low energy structures of proteins in hp-model, in: Computa-
1003 tional Methods in Science and Technology, Vol. 10, 2004, pp. 7–19.
43
1004 [69] C. Huang, X. Yang, Z. He, Protein folding simulations of 2D hp model
1005 by the genetic algorithm based on optimal substructures, Computational
1006 Biology and Chemistry 34 (2010) 137–142.
1007 [70] F. Liang, W. H. Wong, Evolutionary monte carlo for protein folding simu-
1008 lations, The Journal of Chemical Physics 115 (7) (2001) 3374–3380.
1009 [71] C. Cotta, Protein structure prediction using evolutionary algorithms hy-
1010 bridized with backtracking, Lecture Notes in Computer Science 2687 (2003)
1011 321–328.
1012 [72] C.-J. Lin, S.-C. Su, Protein 3D hp model folding simulation using a hy-
1013 brid of genetic algorithm and particle swarm optimization., International
1014 Journal of Fuzzy Systems 13 (2).
1015 [73] K.-C. Wong, K.-S. Leung, M.-H. Wong, Protein structure prediction on
1016 a lattice model via multimodal optimization techniques, in: Generic and
1017 Evolutionary Computation Conference, 2010, pp. 155–162.
1018 [74] K. M. Fiebig, K. A. Dill, Protein core assembly processes, The Journal of
1019 Chemical Physics 98 (4) (1993) 3475–3487.
1020 [75] H.-P. Hsu, V. Mehra, W. Nadler, P. Grassberger, Growth-based optimiza-

1021 tion algorithm for lattice heteropolymers, Physical Review E 68 (2) (2003)
1022 021113.
1023 [76] I. Dotu, M. Cebrián, P. Van Hentenryck, P. Clote, On lattice protein struc-
1024 ture prediction revisited, IEEE TCBB 8 (6) (2011) 1620–1632.
1025 [77] J.-J. Tsay, S.-C. Su, An effective evolutionary algorithm for protein folding
1026 on 3D fcc hp model by lattice rotation and generalized move sets, Proteome
1027 Science 11 (Suppl 1) (2013) S19.
44
23-Jun-19
Dear Sir/Madam,
We would like to mention that we have no conflicts of interest to disclose.
Please address all correspondence concerning this manuscript to me at

ahsan.chowdhury@federation.edu.au
Thank you for your consideration of this manuscript.
Yours sincerely,
Ahsan Raja Chowdhury

Faculty of Science and Technology
Federation University
Mount Helen, Vic-3350, Australia

Multimodal Memetic Framework For Low-Resolution Protein Structure Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimodal Memetic Framework For Low-Resolution Protein Structure Prediction

Uploaded by

Copyright:

Available Formats

Journal Pre-proof

Multimodal Memetic Framework for low-resolution protein structure prediction

Rumana Nazmul, Madhu Chetty, Ahsan Raja Chowdhury

To appear in: Swarm and Evolutionary Computation BASE DATA

Received Date: 10 March 2018

© 2019 Published by Elsevier B.V.

Rumana Nazmul1,2,3 , Madhu Chetty3 , and Ahsan Raja Chowdhury2,3∗

In this paper, we propose a systematic design of evolutionary optimization,

Preprint submitted to Journal of LATEX Templates November 6, 2019

2 Protein Structure Prediction (PSP) is still a grand challenge problem in

38 • energy evaluation criteria, which specify the interaction between pairs of

40 • A lattice used for expressing the protein conformations determining the

119 2.1. HP Lattice Model

In this paper, the protein model of choice is the hydrophobic-polar (HP)

The shifted HP model, on the other hand, is a variant of the HP model

120 2.2. Encoding of Protein Conformation

121 Among various encoding techniques used to represent the embedding of a

135 2.3. Initial Population Generation

160 2.4. Pull Move

194 - If there is a free location L which is adjacent (connected neighbor) to Xi+1

271 4. The Method

272 To address the aforementioned issues, we propose a Multimodal Memetic

308 4.1. Exploratory State

Explrv Explvv Central

Explrv Explvv … Central

Exploratory State Exploitative State Central State

326 4.1.1. Initialization of the Population

358 4.1.2. Genetic Operators and Local search

365 4.1.3. Mating Strategy and Survival Selection Approach

377 4.1.4. Information Extraction

439 4.2. Exploitative State

453 4.2.1. Initialization of Population

Algorithm 1: CreateExploitPop (RepIndiv, Rep, ν)

487 4.2.2. Genetic Operators and Local search

Algorithm 2: HybridLS (Indiv, P ositions, T otalP os)

498 4.2.3. Survival Selection Approach

501 4.2.4. Migrants

504 4.3. Central State

525 4.3.1. Initialization of Population

533 4.3.2. Genetic Operators and Local Search

591 5. Experimental Results and Discussions

592 To evaluate the performance of the proposed Multimodal Memetic Frame-

604 5.1. Setup

605 The evaluations of the proposed Multimodal Memetic Framework or MMF

662 5.2. Test Suite

682 5.3. Performance of MMF on Standard Benchmark Instances

Seq 2D-Square 3D-Cubic 3D-FCC

MMF Existing Methods

750 5.4. Performance of MMF on Functional Protein Model

IA [41] EDA [32] CMA MMF (Proposed)

2D-Square 3D-Cubic 3D-FCC

828 [5] P. Crescenzi, D. Goldman, C. H. Papadimitriou, A. Piccolboni, M. Yan-

831 [6] B. Berger, T. Leight, Protein folding in the hydrophobic-hydrophilic (hp)

833 [7] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, P. G. Wolynes, Funnels, path-

845 [11] K. A. De Jong, An analysis of the behavior of a class of genetic adap-

848 [12] R. K. Ursem, Multinational gas: Multimodal optimization techniques in

853 [14] T. Park, K. R. Ryu, A dual-population genetic algorithm for adaptive

862 [18] B. P. Blackburne, J. D. Hirst, Evolution of functional model proteins, The

869 [21] N. Krasnogor, W. Hart, J. Smith, D. Pelta, Protein structure prediction

875 [23] M. Cebrián, I. Dotu, P. V. Hentenryck, P. Clote, Protein structure predic-

880 [25] N. Lesh, M. Mitzenmacher, S. Whitessides, A complete and effective move

883 [26] H.-J. Böckenhauer, A. Z. M. D. Ullah, L. Kapsokalivas, K. Steinhöfel, A

890 [28] N. X. Vinh, M. Chetty, R. Coppel, P. P. Wangikar, Polynomial time al-