You are on page 1of 19

Information Sciences 466 (2018) 25–43

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

A comprehensive survey on genetic algorithms for DNA motif


prediction
Nung Kion Lee a, Xi Li b, Dianhui Wang c,∗
a
Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, Sarawak, Malaysia
b
John Curtin School of Medical Research, Australia National University, Canberra, Australia
c
Department of Computer Science and Information Technology, La Trobe University Melbourne, Australia

a r t i c l e i n f o a b s t r a c t

Article history: Computational DNA motif discovery is important because it allows for speedy and cost ef-
Received 3 August 2017 fective analysis of sequences enriched with DNA motifs, performs large scale comparative
Revised 18 June 2018
studies, and tests hypotheses on biological problems. In this work, we provide a compre-
Accepted 1 July 2018
hensive survey on DNA motif discovery using genetic algorithm (GA). According to the
Available online 19 July 2018
ways of how the solution domain are encoded, we categorize existing GA-based motif dis-
Keywords: covery techniques into search for consensus and search by position (matrix). Within each
Genetic algorithm category, we make distinctive algorithmic comparisons based on model representations,
DNA motif prediction fitness functions, genetic operators, data post-processing, as well as the experimental re-
sults. Moreover, we discuss the strengths and weaknesses of different approaches with rec-
ommendations for practical use. This survey paper is useful as guideline for practitioners
who would like to design GA solutions for DNA motif prediction in the future.
© 2018 Elsevier Inc. All rights reserved.

1. Introduction

Computational identification of functional signals in the genome has proven to be valuable for medicine and advance-
ments of the molecular and biological sciences [64]. These functional signals include but are not limited to DNA coding
regions (or gene), gene expression regulatory elements, transcription start sites (TSS), cleavage sites, splicing junction sites
and protein functional sites [25,26,41,95]. Among them, short DNA sequence segments (10–30 bp), named regulatory ele-
ments or transcription factor binding sites (binding sites for short) have been extensively studied due to their importance
in gene regulation. Those subtle yet conserved sites are bound by particular transcription factor (TF) proteins to control the
spatial and temporal expression of genes. A transcriptor factor has binding preferences to a set of DNA sequences, with dis-
tinct affinities. Due to conservation, those binding sequences have recurring pattern that indicate its specificity. DNA motif
refers to a set of binding sequences of a TF [8] while “motif pattern” is its preferred sequence pattern. The aims of compu-
tational DNA motif discovery is to identify the motifs and their instances in input sequences enriched with binding sites of
transcription factors.
Traditional experimental methods, such as DNase footprinting [10] and gel-shift assay [35] are able to give accurate iden-
tification results while they are costly and time-consuming for genome-wide motif analysis. Recently, due to the boost of
next generation sequencing techniques that produce massive amount of genomic sequences and gene expression proles,


Corresponding author.
E-mail addresses: nklee@unimas.my (N.K. Lee), sean.li@anu.edu.au (X. Li), dh.wang@latrobe.edu.au (D. Wang).

https://doi.org/10.1016/j.ins.2018.07.004
0020-0255/© 2018 Elsevier Inc. All rights reserved.
26 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

numerous computational techniques have been proposed for de novo motif analysis and demonstrate good potential for
problem solving, such as pattern enumeration, stochastic optimization, maximum likelihood, and Gibbs sampling (see re-
view [30,41]). Those methods are feasible for small- to medium-sized datasets or for prediction of short motif patterns. The
development of wet-lab techniques such as Chromatin immunoprecipitation (ChIP) followed by hybridization to an array
(ChIP-chip), sequencing (ChIP-seq) or exonucleases (ChIP-exo) are effective for genome-wide binding site identification of
proteins of interest [61]. Computational approaches are needed to identify binding motif patterns in relatively large-scale
and be able to utilize ChIP sequences for discovery. Supervised techniques such as support vector machine are popular
for modeling large-scale ChIP-Seq, enhancer, and histone datasets [24,29], while deep learning neural networks and con-
volutional neural networks [43] have recently become popular for histone and chromatin sequences modeling. Stochastic
search techniques such as genetic algorithm (GA) [36], genetic programming, and tabu search have established a niche on
its own for DNA motif discovery. A recent large-scale motif discovery evaluation using real datasets of 12 transcription fac-
tors [39] demonstrated that the genetic algorithm tool rGADEM performed the best among the four evaluated tools. We
believe the existing works are valuables to be reviewed and analysed owing to GA distinctive search capability and its fu-
ture potentials in solving motif discovery or problems with a similar nature. While there are many existing review papers
with regards to DNA motif discovery (see [11,21,41,84]), there are no comprehensive survey papers that have focused on
GA techniques. This survey paper aims to close the gaps on the literature by surveying some prominent works of using GA
technique for DNA motif discovery.
This paper is organized as follows. Section 2 presents the DNA motif prediction problem and two motif pattern rep-
resentation models. Section 3 reviews the existing GAs for DNA motif prediction. Existing works are categorized and re-
viewed based on some characteristics. Section 4 presents benchmark results of four GA tools using 8 real DNA datasets.
Section 5 summarizes the key points with regards to the use of GA techniques for DNA motif prediction. Others non-GA
tools are briefly discussed in Section 6. The last section concludes this review and offers some directions for future works.

2. Background

DNA motif discovery can be formulated as a pattern search problem as follows: given a set of input sequences with
embedded binding sites, look for motif patterns that optimize an objective function. Such a motif pattern represents the
commonality of nucleotide acids shared among the binding sites. A motif pattern of length l is an instance of the enumera-
tion on the characters set  = {A, C, G, T }. If motifs are conserved and exact match to each other, the problem can be easily
solved by enumerating all 4l possible consensus patterns of length l using efficient data structures, such as suffix tree [31].
Unfortunately, exact conservation is rare due to often observed DNA sequence level variations, i.e., nucleotide mutations,
insertions, and deletions [70]. Hence, we consider DNA motifs to be subtle. Subtle motifs are defined in [47] as sequences
[that] have been subjected to extensive change.Later, Pevzner and Sze [68] proposes the discovery of subtle motifs as the
planted-(l, d) problem-a search for motif patterns of length l with at most d mismatches to at least an l-mer in every input
sequence. An l-mer is a continuous nucleotide of length l in a DNA sequence. In addition, there are gapped motif patterns
[34] in which two conserved short patterns are spaced by a fixed length of do not care nucleotides. Other than formulating
the motif discovery problem as a consensus pattern search, it can be defined as a multiple local alignment problem that
identifies a set of k-mers with one from every input sequence [3,82]. Both problem formulations are NP-hard [51], which
means for sufficiently large motif lengths (e.g., k > 15 bp) they cannot be solved to optimality within polynomially bounded
computation time. Furthermore, the size of the search space grows exponentially with problem sizes.
While they differ in terms of search space sizes, pattern search and alignment methods are likewise aimed at discovering
interesting motifs that are over-represented in the input sequences. The notion of over-representation is algorithm-specific,
but generally interesting motifs appear a great number of times in a set of related sequences (i.e., upstream sequences of
co-expressed genes or have evidence binds by a TF) and are reasonably conserved [70]. On the other hand, interesting motifs
do not necessary occur frequently [70] and a simple motif discovery method that is based on frequency counts will not be
able to discriminate the true from the false. Therefore, statistical overrepresentation methods are often used to confirm the
significance of putative motifs returned by motif discovery tools [76,82]. For that, background sequences (sequences that do
not contain motifs) are needed to establish background probability, perform cross-tests, or objective function evaluation.
Searching for motif patterns is clearly an optimization problem with the goal is to identify motif patterns that would
minimize or maximize an objective function. Motif patterns can be searched using deterministic [33], stochastic [3], or
heuristics [44] search methods. In addition to that, clustering approaches have been proposed [81,90]. Deterministic meth-
ods perform brute-force searches by enumerating all possible patterns of specified motif lengths [33,67] and it guaranteed
to produce identical solutions by giving the same input dataset and parameters. An attractive feature of exhaustive search
is that its speed scales linearly with the growth of dataset sizes, making it suitable for genome-scale motif analysis. How-
ever, the running time complexity increases exponentially with longer motifs, which makes it only practical for discovering
short patterns, e.g., a maximum length of 12 bp in Weeder [67] and 8 bp in Oligo-analysis [33]. Clustering approach re-
quires defining a similarity metric between l-mers in input sequences and cluster model for cluster assignments. In [90],
the MISCORE-based motif score (MMS) with localized conservation measure is used as similarity metric between an l-mer
and cluster models. While Tapan and Wang [81] employed a composite of MMS and B-SCORE-based score [85] as similar-
ity metric for l-mer cluster assignments. After iterative updates of the cluster models using either crisps or fuzzy-cluster
implementation, l-mers assigned to a cluster are considered as motif’s instances.
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 27

Heuristic algorithms search for solutions that are not globally but locally optimal to be computationally tractable. They
sacrifice the best solutions for reduced space and time complexity. Instead of exact search, heuristic algorithms employ
approximation methods, which return near-optimal solutions in a relatively short time. Genetic algorithm [36] belongs to a
type of derivative-free global heuristic search algorithms that performs hill-climbing searches, guided by a fitness function
which a problem wants to optimize. The working principle of GA is natural selection, which is rooted in Darwins natural
selection theory and genetic mechanisms (reproduction, mutation, recombination). The hill-climbing search method works
with an initial guess of possible solutions of a problem that are encoded as a population of individuals. The principle of
survival is then applied, in which better solutions, quantified by the fitness function, will produce offspring and survive
to the next generation after the selection process. Offspring are produced by applying genetic operators such as mutation
and crossover. Mutation operator mutates the allele value of selected genes whereas crossover operator exchanges genes
between two individuals. Those few steps are iterated until a convergence state or a fixed number of generations is reached.

2.1. Motif representation

Essential to a motif discovery algorithm is how to represent the information encoded in a set of binding sites. The
motif representation is a model that represents the specificity of DNA binding proteins [78]. Suitable motif representations
maximize the description of information (nucleotide contexts) shared among the binding sites that are recognized by the
same transcription factor. The design of effective motif representation is the first and key step towards the success of a
search. It should effectively describe and abstract the possible appearance of motifs from the input datasets, meanwhile,
defines the complexity of the solution space. Nevertheless, since motifs differ in some of their characteristics, no single
representation method best suits all motifs. Two common motif representations are consensus sequence and profile. Another
way of classifying motif models is based on whether it is probabilistic or deterministic. There are three types of deterministic
representation: oligos, regular expressions, and mismatch expressions [9].
A deterministic motif pattern gives a boolean answer of a match or mismatch to a sequence segment. It is necessary
to define a pattern grammar that determines the valid token in each position of a motif pattern. Such a token could be
from gaps/wildcard, nucleotides in  , or degenerate characters [9]. A degenerate character stands for less conversation of
nucleotide distribution in a given position. A mismatch mode is also being used in many pattern discovery tools [31], where
a sequence segment is called a “match” to a pattern given the number of mismatches is lower than a predefined threshold.
A probabilistic pattern model returns a probabilistic score [0, 1] for a sequence segment in a DNA sequence. A match is
determined by a threshold score value. For example, the average mismatch motif model by Wang and Li [87] returns the
average mismatch score between a sequence segment and a motif instances. The probabilistic model can be interpreted as a
learning model, such as a classification system. In this case, the learned model produces a predicted score of an input DNA
sequence segment.

2.1.1. Consensus
A consensus pattern (also termed monad) represents a motif as a string (s1 s2 s3 . . . sl ), where each si ∈  represents the
most frequent base(s) that appears in position i of a set of aligned binding sites. For example, the motif consensus of ABF1
TF of Sacchramoyces cerevisiae is TATCGTATTGCATGAT. In real cases, the TF could bind to more than one base with the same
affinity in certain positions of its binding sites. For example, the consensus of the MEF2 motif is CT(A/T)(A/T)AAATAG, which
indicates that it mostly binds to either base Aor Tin the third and fourth positions of the aligned sites. To account for such
variabilities, the degenerated IUPAC can be used. The IUPAC character set defines a more flexible representation of motif
consensus by combining the four bases with the logical operators OR or NOT. By using the IUPAC nomenclature, the MEF2
motif consensus can be represented as CTWWAAATAG. Fig. 1b illustrates an example of a motif consensus pattern obtained
from the multiple alignment of binding sites.
A specific consensus motif pattern called spaced-dyad is a string pattern with two monads spaced by a fixed or variable
number of do not care characters. This type of motif is common in bacteria, especially for the helix-turn-helix transcription
factor proteins [34]. For example, the Yeast Gal4p motif pattern is CGG-X(11)-CCG, where X(11) represents 11 do not care
characters that are allowed in between the two short consensus patterns.
Consensus representation is both simple and easy to compute and has been widely accepted as a method to represent
or visualize motifs. However, it has several weaknesses. The consensus representation of a motif is not unique, since it
depends on the voting method used to produce it [22]. The voting method selects the representative bases in each position
of a motif which requires defining one or more threshold values. The threshold values are usually determined to be arbitrary,
hence, different authors might produce different consensus patterns. In addition, consensus representation fails to depict the
quantitative information of the binding sites [71]. For instance, the third position of the MEF2 motif consensus W(or A or T),
indicates that bases Aand Toccur with the same frequency, where, in fact, the occurrence of base Ais 80% more likely than
base T. Meanwhile, the 80% occurrences of Ain the sixth position are treated the same as base Ain the seventh position.
When it is used for motif detection, both bases would give a similar match score, which is inaccurate in some sense.

2.1.2. Position weight matrix


A position frequency matrix M of a motif of length l consists of | | rows and l columns, where each row represents a
distinct character b ∈  , and the columns represents the distinct positions i ∈ [1, l] of a multiple-alignment of a motif. Each
28 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

Fig. 1. (a) A multiple alignment of eight binding sites located in sequences S1–S8. Their positions in the sequences are indicated; (b) Degenerated consensus
pattern of the eight binding sites in (a); (c) Frequencies of nucleotides in (a).

f(b, i) entry of matrix M represents the frequency of the character b in position i of an aligned ungapped segments. The log-
arithm of entries produces the position weight matrix (PWM) [77]. PFM is a probabilistic model that assigns a probabilistic
score (frequency) to each base at each position of the multiple alignment from a set of binding sites (see Fig. 1c). The score
k
of a sequence segment of length k to a PWM is given by i=1 ( f (bi , i )). This model corresponds to an estimation of the
binding energy and specificity of binding sites [79]. It is assumed that the energy contribution for each binding position is
additive, and the genomic sequences where the binding sites occur are random [79]. These two assumptions simplify the
computation of the average binding energy for the collection of binding sites. According to Berg and von Hippel [6] (as cited
in [80]), given a collection of binding sequences, the logarithms of a base frequency is proportional to the binding energy
contribution of the base to the binding affinity, which is given by log2 (f(b, i)). To account for the genomic base probability
the relative binding energy of a base becomes log2 (f(b, i)/p(b)), where p(b) is the random sequence probability of base b.
The average binding energy of a motif is given by the information content (IC) defined as

f (b, i )
  f (b, i )log2 . (1)
i b∈ p( b )

This formula is also refer to as the relative entropy.

3. Evolutionary approaches for motif prediction

Some key terminology in GA are first introduced. In the context of GA, these terms are used in the spirit analogy with
real biology, but in a much simpler way. Solving the DNA motif discovery problem using GA involves encoding the motif
information into a particular model representation termed as individuals (or chromosomes). Each individual contains a com-
plete information of a motif in an input dataset. For instance, an individual may contain the positions of a putative motif’s
instances in the input datasets. Genes are the basic building block of an individual, likewise to the genes in cells, they store
“genetic” information that are used for producing offspring through genetic operators such as selection, crossover, and muta-
tion. A gene encodes trait information, for instance eye colour (blue, brown, amber). The different possible traits for a gene
are called alleles. In GA, the alleles are possible parameter values of an element of a candidate solution to be searched for.
For example, the elements could be consensus pattern characters or entry values of a PFM. The alleles of a gene that encode
the consensus pattern are  ∈ {A, C, G, T} or the IUPAC symbols. Whereas if a gene encodes PFM entries, its alleles would
be real values [0, 1]. The idea of GA is to efficiently search a solution to a problem in a large space of candidates. Searching
for solutions requires the encoding of solution parameters as individuals in a population. While it is not possible to search
for all candidate solutions, GA starts with a small fraction of possible candidates and examine other solutions through the
iterative evolutionary process. The direction of a search is mainly controlled by the genetic operators used as well as how
the candidates are evaluated. Since the initial candidates are constructed heuristically, the solutions found by GA might be
sub-optimal. The iterative steps of GA are given below:
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 29

Fig. 2. An illustrative example of how GA is used to search for consensus motif patterns. The individuals represent candidate solutions, i.e., consensus
patterns. Each genes allele value take one of the letter in  or IUPAC letters. Individuals in the current generation are evaluated for fitness score. They
are then ranked per their scores. the fitter individuals have higher probabilities to produce offspring by applying genetic operators during reproduction. A
selection mechanism is in place to select individuals that will be forwarded to the next generation. This process is iterated until they converge or reach a
maximum number of generations. Finally, solutions with higher fitness scores are reported as putative consensus.

Step 1: Initial population. Population initialization is the start point of GA. It generates a set of individuals by random as
the initial population. Domain or prior knowledge, if available, can be used to guide the initialization of gene values in
individuals instead of randomness. The size of the population is fixed in prior and is problem dependent. While large
population can possibly contain close-to-target individuals, it would increase the cost of evaluation and reproduction.
Step 2: Fitness evaluation. As one crucial part of any GA, fitness function (objective function) employs to evaluate the
fitness of an individual by accessing the information encoded. The fitness function should indicate the goodness of
a candidate solution. By differentiating strong individuals from weak ones, the selection and reproduction operations
can be executed effectively. The computation of a fitness function usually requires decoding of encoded information
in the genes of a chromosome.
Step 3: Selection. The purpose of selection is to choose appropriate parents for reproduction. Analogous with natural
selection, good individuals (indicated by their fitness scores) have more chances to reproduction and survive from
elimination. The three common selection schemas are roulette-wheel selection, tournament selection, and winner-
take-all selection.
Step 4: Reproduction. During reproduction, genetic operators are employed on selected individuals in the previous step for
producing new offsprings. It aimed to ensure the population converging at a reasonable speed, meanwhile, to avoid
local optimization. Crossover and mutation are two commonly used methods.

The GA iteration terminates when a fixed number of evolutionary generations is reached or convergence. Often, the
convergence is indicated by small changes of a population average fitness score.
To provide an illustrative example, Fig. 2 shows the GA evolutionary process for DNA motif discovery using consensus
pattern representation. In the figure, the consensus motif patterns represent individuals in a population. There are nine
genes in every individual with allele values take one of the letter in  or IUPAC. Shown is a population of eight individuals
which represent candidate motifs in an input dataset (rectangle boxes). Initially, the consensus motif patterns are randomly
generated. In generation i of the evolutionary process, candidate solutions are evaluated to obtain fitness scores for ranking
purposes. A possible fitness function is the relative abundance of a consensus’s hits in the input dataset versus background
dataset [28]. Selection is then performed on those individuals based on their fitness scores for reproduction to produce off-
spring (new solutions) in generation i+1. The new population consists of offsprings produced by applying genetic operators
on existing solutions and some selected good solutions from previous generation. Poor solutions are varnished. The heuristic
is that offsprings in the new generation represents better motifs. These steps are iterated until the maximum generation is
reached or convergence. One of the key factors in GA is to ensure population diversity, which keeps the evolutionary process
away from local optimum solutions or pre-mature convergence [57].
The main distinction between GA and other motif discovery algorithms is that users only need to focus on the design
of solution domain, which defines how a candidate motif gets encoded into an individual. The search mechanism of GA
then explores the search space guided by the fitness function landscape. Fig. 3 illustrates the typical work-flow of designing
GA solution for DNA motif discovery. It consists of five consecutive steps: (1) Preparing input DNA sequences enriched with
potential motifs of interest is the first step. Various wet-lab techniques are available for genome-wide identification of candi-
date motif regions [61]. Background sequences that do not contain any binding sites recognized by the studied transcription
factors are used to compute the fitness score of candidate solutions. Studies suggest using real sequences instead of artificial
ones as background sequences could produce better discriminative scores [69,74]; (2) The next key step is to decide how
the candidate motifs are encoded into individuals. Since each individual is a candidate solution (motif), its design directly
affects the time and space complexity of GA. There are three broadly employed encoding methods: consensus pattern, posi-
30 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

Fig. 3. Pipeline of designing and developing a GA solution for DNA motif discovery.

Fig. 4. Classification of existing GA tools for DNA motif prediction.

tions, and position frequency matrix; (3) Design the fitness function to evaluate individuals and genetic operators used for
reproduction. In addition to the standard operators (mutation and crossover), customized operators sometimes get devel-
oped for the diversification of individuals in a population; (4) The iterative evolving step, in which several key parameters
need to be set, including population size, stopping criteria, and genetic operatorsparameters; (5) Most potential candidate
solutions are selected and filtered. Furthermore, post-processing operations such as motif refinement and sites selection can
be performed to improve the quality of the final selected motifs.
This review devoted summarizes existing GA works based on the individual encoding method, the applied genetic op-
erators, and the fitness function. Most of the existing works of GAs can be classified into two main groups based on motif
representation types they searched for. Based on that, as shown in Fig. 4, two main categories are search for consensuses
and search for position frequency matrices.

1. Search for consensus. Individuals in a GA population represent the consensus pattern of motifs in which the alleles of a
gene are either nucleotides  , gaps, or degenerated IUPAC nucleotide set. If we considered whether gaps are permitted
in a consensus, existing works can be further categorized into two sub-groups: (a) string - in which consensus without
gaps; and (b) spaced-dyad - short consensuses spaced by one or more gaps. Over-representation is usually employed to
score a consensus pattern.
2. Search for matrix. Individuals in GA encode either the PFM or starting positions of motif instances in input sequences.
Though they are two different representations, a PFM model will be derived from a given vector of positions during the
evaluation step. The works that fall into this category most often apply probabilistic methods, e.g., information content
(IC) to evaluate the PFM model and rank them based on their scores of conservation.

Another two sub-groups are possible based on the characteristic on whether prior-knowledge is employed or local-
search is been used. Prior-knowledge is information refers to a certain amount of information known beforehand on the
domain problem. Several existing works employed the prior-knowledge regarding the characteristics of potential motifs in
a dataset in guiding the initial points in the exploration of the search space [52,87,88]. Some GA approaches employed a
hybrid of GA and local stochastic search techniques. Both expectation maximization and Gibbs sampling techniques have
been employed. The aim of local-search is to improve the sub-optimal solutions obtained by GAs. Existing works which
employed local-search are labeled under the node “hybrid” in Fig. 4.
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 31

Fig. 5. Illustrates how different motif representations are encoded in GA. Each coloured box indicates a gene. (a) A population of individuals encodes
degenerated motif consensus of 7 bp long; (b) A population of individuals encodes positions of candidate binding sites in the input sequences. Each genes
integer value represents a location of a candidate site in an input sequence. There are 10 positions encoded, one for every input sequence. (c) A population
of individuals encodes PFM entries. The motifs are 7 bp long.

3.1. Search for consensus

The consensus-based GA approach encodes the motif consensus models as individuals (Fig. 5a). A genes allele values can
be the characters in set  , a degenerated IUPAC character, or a gap depending on how restricted a consensus pattern is
defined.
The aim of consensus search methods is to find over-represented consensus patterns in a given input set as a contrast
to the background set. It is a permutation problem that searches for the right order of placing the permissible symbols.
Individuals in a population are putative motif consensus patterns in the given input dataset. Scoring a consensus pattern
involves counting its occurrences in input dataset where mismatches are allowed for variations in true motif consensuses.
Counting methods based on exact or approximation [29] counting have been employed for that purpose. Then the candidates
are evolved and scored based on an evaluation method. Most consensus-based GA methods require fixing the pattern length
prior to the evolutionary process. GAs that search for consensus string without gaps are FMGA [53], GAMI [19,20] and Paul
and Iba [66], while those search for spaced-dyad are GADEM [49], GASMEN [15], and GA-DAF [97].

3.1.1. Consensus string


FMGA aims to locate motif patterns from −2,0 0 0 bp upstream to +1,0 0 0 bp downstream of several co-expressed gene
groups. Its consensus patterns take both the symbols in  and IUPAC. The length of the motif pattern is fixed up till GA
termination. A distance-based fitness function is developed in FMGA, which is based on a matching function. The number
of allowed mismatches between a consensus and an input sequence segment is proportional to its consensus length (20%
of consensus length). Furthermore, a match to symbols in  is given higher score to a match to symbols in IUPAC for the
computations of the fitness score. A set of initial individuals in a population is randomly generated. After evaluation, an
elitist competition is activated so that individuals with the best fitness values are automatically qualified for the next gener-
ation. FMGA implements three genetic operators into the evolutionary process: mutation, crossover, and rearrangement. For
mutation, it first selects an individual and builds up a PFM from its best matching k-mers (one from each input sequence).
According to the matrix, IUPAC codes are randomly assigned to less conserved positions. The random assignment repeats
twice to produce a pair of new patterns, which are served as parents in a single-point crossover. After crossover, the one
with the best fitness score survives to the next generation. To avoid local optimum, rearrangement is triggered when a con-
sensus pattern becomes unchanged after a certain number of generations. At such stage, the IUPAC symbols in a consensus
are replaced by more dominant nucleotides in  according to its PFM. FMGA is tested with 3 genes sets (E2F, TGFB, and
tumor suppressor genes) to predict motifs of length 7 and 13 bp. The results indicated that, in comparison to MEME and
GibbsSampler, FMGA predicts consensus patterns with better match scores. Nevertheless, the predicted patterns are not val-
idated against real motif patterns. We have observed from the reported results that the fitness function lacks of the power
to resolve random background patterns (simple repetitive patterns) in a dataset. Some random background patterns with a
high percentage of matching scores can still be regarded as true motifs.
GAMI attempts to find the conserved motifs of orthologous genes from divergent species [19,20]. Like FMGA, an individual
in GAMI is represented by a consensus, however only symbols in  are allowed. Each symbol is encoded by using two
bits in an individual. In GAMI, the population size of 10 0 0 is used by default. A mismatch-based function named MC is
developed to score the fitness of each individual in a population. Given a consensus, the MC score is the sums of the number
of matched bases in its best instances to the input sequences (from both forward and reverse-complement directions). A
standard evolutionary process is employed in GAMI. The mutation operator simply adds a random base in one end of the
consensus and deletes the other end. While the crossover operator uses a one point crossover between two consensus
patterns. Congdon and co-workers have conducted a comprehensive investigation of different fitness functions. According
to their reported results, IC shows no superior performance on motifs from highly conserved regions in SOX21 and favors
returning patterns of CG-rich from CFTR and GSTM1. Furthermore, IC cannot distinguish the goodness of distinct consensus
patterns due to the synonyms problem. Six variations of fitness functions are constructed from modifying or combining IC
and MC. An IC using uniform background is also proposed. It was found fitness functions that are combinations of IC + MC
(Mix), IC with motif (ICM), and MixM (MC + ICM) performed well on true motif patterns.
32 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

3.1.2. Spaced-dyad
GADEM [49] is a popular GA-based tool for motif discovery in ChIP datasets. It is effective in discovery of long (6–42 bp
by default) and degenerated spaced-dyad motifs. GADEM utilizes GA, expectation maximization, and heuristic for searching
spaced-dyad motifs. The novelties of GADEM are on the initialization of the population by using over-represented short k-
mers and employing expectation maximization technique for optimizing PWMs derived from candidate consensuses. Hence,
it employed both a hybrid of global and local search method. The spaced-dyad is encoded as three parts in one individual:
a1-x-a2, where a1, a2 are monads and x is an integer value that specify the number of spaces allowed. Each cycle of GADEM
consists of three main parts: formation of spaced-dyads of a population, GA evolution, and motif declaration. Multiple motifs
can be produced at every cycle. First, individuals in a population are initialized by using top ranked (based on z-score) short
k-mers of length 3–6 bp from the input dataset. A pair of top-ranked k-mers are then randomly selected and a space value [1,
10] is chosen to form individuals. Second, the individuals will go through genetic operators manipulation, optimization, and
scoring. For the calculation of motif fitness scores, the spaced-dyads in a population are first converted into their respective
PWM. They are optimized by the expectation maximization algorithm using a small subset (25–50%) of the input dataset
for the purpose of efficiency. Optimized PWMs are used to find matched sequence segments. If the maximum number of
generations is not reached yet, those spaced-dyads, which produce the PWMs, with E-value smaller than a pre-specified
threshold are automatically retained to the next generation. The rest new individuals are generated by applying customized
mutation and crossover. If the maximum generation is reached, the third part of the GADEM cycle–motif declaration is
activated. Those space-dyads derived PWMs with E-value below the pre-specified threshold are saved as final motifs. Three
heuristic rules are introduced to find the optimal length of the final motifs.
GADEM is evaluated with six ChIP datasets (between 500 to 14 k sequences) from human and simulated datasets. Ac-
cording to the authors, 5–10 generations would be sufficient for the population to converge. Evaluation results show that
GADEM finds all the six motif consensus patterns in the ChIP datasets. Almost identical primary motif consensuses are re-
turned from different runs. Using 500 simulated datasets, GADEM has comparable performance with MEME. Nevertheless,
GADEM only performs better than GAME in terms of positive predictive value. On prediction of long motifs using 542 ChIP
sequences of p53, GADEM successfully predicts the p53 motifs (20 bp after trimming) and several SINE/ALU retroelements
motifs of 67–101 bp long. Such performance is attributed to (a) use of seed k-mers for initialization; (b) hybrid of local and
global search; and (c) heuristic post-processing.
GASMEN [15] evolves a population of generic spaced patterns using GA. The generic spaced pattern allows the do not
care characters to be placed at any positions within a consensus pattern. Therefore, the number of monads can be varied.
For example, the pattern CCGNNGTNNAANNNT has four monads spaced by do not care characters. GASMEN searches for
motif patterns of length l= 4–25 bp long that occur at least 4 times in the input sequences. It utilizes indexing technique
to speed up the search of substrings that matches a monad. The indexed substrings also facilitate the initialization of GA
population. By the monad initialization method approach, a monad with two parts p-q is constructed. Part p is a randomly
chosen monad of length w. Part q is a substring of a string r of length [w, l] in which r[1 . . . w] matches to p, and q is the
substring r[w + 1 . . . l]. The second initialization method selects a monad of length w before extending its to the right with
do not care characters or random substrings. Customized mutation and crossover are employed to produce individuals in a
new population. GASMEN’s fitness function consists of two parts: (a) the log relative frequency of a motif pattern in input
as contrast to background sequences; (b) the log relative frequency ratio between all best occurrences of motif and the
background. GASMEN also includes a motif pattern refinement step every ten generations during the evolutionary iteration,
where a motif pattern is refined based on its PWM. GASMEN outperforms SPACE on LexA and PurR datasets. Furthermore,
GASMEN has better results in terms of f-measure and performance coefficient for 7 tested datasets out of 8.
Zare-Mirakabad et al. [97] proposes a novel GA method to search for dyad motif, named GA-DPAF. A dyad motif is a
consensus that is consisted of two monads. From a input sequence, a sequence segment is considered a match to a dyad
motif if it matches to the two monads (with mismatch is allowed) spaced by a variable number of gaps. The minimum and
maximum number of gaps are users specified input parameters. As a distinctive feature, GA-DPAF uses heavily the Gibbs
sampling method to optimize dyad motifs in initial population and those obtained during evolutionary process. Similar to
GADEM, the dyad motifs are transformed to their corresponding PFMs for fitness score calculation. A dyad-motif fitness score
is a sum of three functions: (a) mutual matches between motif instances of the dyad motif; (b) relative entropy; (c) number
of matches between the dyad motif and its instances. GA-DPAF is evaluated using several datasets with nucleotide level
performance coefficient (nPC). Using the SCPD datasets (27 regulons, ranging from 3 to 18 sequences), GA-DPAF performs
best in 18 of the datasets. Datasets from six yeast regulons are chosen to test GA-DPAF’s capability of discovering dyad
motifs. It obtained the highest nPC values in three of the six datasets. While the evaluation results show promising results
from GA-DPAF in comparison to MEME, YMF, AlignACE, and MITRA, the selected benchmark datasets are relatively small ( <
20 sequences) and from low complexity species.

3.1.3. Summary
We summarize some key points of search for consensus methods:

1. Mismatch function. Most methods (GADEM, GASMEN, and GAMI) only allow the character set  in forming the consensus
patterns for speed efficiency. Nevertheless, since motifs are degenerated, mismatches are permitted when finding hits in
the input sequences. Often, only the best matching instance is collected per input sequence [20,53,97]. The consensus-
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 33

based approach is reported to be sensitive to the mismatch parameter setting in the fitness function calculation, e.g., 19
true motifs get discovered successfully when two mismatches are allowed, while none is predicted with mismatch value
of one [66].
2. Repetitive bases. Consensus-based methods often return low complexity patterns, such as repetitive A-rich sequences
[20,56]. The low complexity patterns could be either filtered out after the GA convergence, or by introducing a low
complexity measure [27] as part of the fitness function. Another common solution is to mask those repetitive regions
using tool such as RepeatMasker from the input sequences prior to motif discovery.
3. Expected length of motifs. The expected length of motifs has great influence on the success of GA. For instance, FGMA
works better for short motifs. GAMI discovers all the converted motifs from octamer-binding factor (Oct) and nuclear
factor kappa B (NF- B) motif datasets when the patten length is set to 8 bp [20] but performs poorly when the length is
set to 20 bp. On the other hand, for the discovery of spaced-dyad patterns, the motif length of monads and gaps can be
automatically optimized during the search.
4. Customized genetic operators. Operators such as crossover and mutation are applied in all of the consensus-based algo-
rithms with or without modification. With the challenges from permutation, these two operators are not enough. For
instance, when a consensus ACCCTTTT performs a single crossover with CCTCCGTT at position 3 to produce offspring
ACTCCGTT and CCCCTTTT, both solutions most of the time cannot improve because the parents are very distinct and
crossover will not preserve the good partial patterns. Therefore, custom operators and heuristics are necessary. In FGMA,
the mutation operator gets triggered selectively to consensus positions that are not fully conserved. Furthermore, when
there seems to be no learning progress, a rearrangement heuristic is used to convert the degenerate characters in each
position of the motif into a basic letter (i.e.,  ) using the majority voting scheme. GAMI introduces the truncation and
extension operators to slide the motif along the sequence. Another widely used technique is to optimize the consen-
sus model using the local optimization techniques such as expectation maximization and Gibbs sampling, which could
alleviate some of the issues mentioned.

3.2. Search for matrix

GAs in this category search for PFMs of putative motifs in the input dataset. However, the GAs can either explore the
matrix space directly, or positions of possible binding site locations in the input dataset. In the former, the individuals in a
population represent the PFMs of possible motifs in the input dataset. Hence, the encoded information are independent from
the input dataset. On the other hand, an individual can represent the set of locations of possible sites in the input dataset.
Sites in an individual forms a PFM. Fig. 5b and c illustrate the two distinct encoding methods employed. Typically, the OOPS
(one occurrence of a motif per sequence) assumption is applied to initialize the matrix model. The main objective of GA is to
identify a set of putative binding site positions in all the input sequences, whose motif model (PFM/PWM) would optimize
a fitness function. The earliest work of matrix-based GAs is proposed by Fogel et al. [27]. Subsequently, several works in this
category have been published, including MDGA [17], GALF [13], GALF-P [14], GALF-G [16],GAME [92], GEMFA [7], GAPK [87],
iGAPK [88], and IGAMD [52].

3.2.1. Position
The GA proposed in GAME [92] aims to produce globally optimized motifs by incorporating a PWM-based Bayesian model
from BioOptimizer [40]. GAME assumes there is zero or only one occurrence of a motif per sequence (ZOOPS), which guar-
antees to locate the most conserved motif from the sequences. Since ZOOPS has limitations in modeling sequences with
more than one binding site, GAME applies a simple PWM-Scan procedure to iteratively add extra instances that are missed
during the GA evolutionary cycle to the predicted motif model. The log-posterior distribution proposed in [40] is applied as
the fitness function. Besides standard single-point crossover and mutation, GAME contributes two novel genetic operators
named ADJUST and SHIFT to further optimize models predicted from the evolutionary process to avoid premature conver-
gence caused by local alignment. In total, 200 simulated datasets are built up to cover four scenarios in motif discovery: (1)
number of sequences (small and large); (2) different motif widths (8 bp and 16 bp); (3) level of motif conservation (low and
high); (4) number of binding sites per sequence (exactly one or 10% probability of none). Results from simulated datasets
indicate GAME outperforms MEME [3], BioProspector [55], and BioOptimizer [40] in locating highly conserved motifs. GAME
is further tested using eight real datasets collected from public repositories and literature. Without the consideration of
running time, it gives superior prediction accuracy over the three compared tools. By default, crossover is triggered in each
pair of chosen individuals. An extremely low mutation probability (0.001) is applied in GAME. Since ADJUST and SHIFT are
only applied after the evolutionary process, GAME mainly relies on crossover operator to produce diverse solutions. The
convergence speed is guaranteed, however, the prediction can be trapped into local optima and lacks the ability to move
away, especially when the search space is large and the population size is small.
GEMFA is a hybrid of GA and expectation maximization (EM) method which aimed to overcome the local minimum
limitation and requirement of multiple restarts in EM motif search approach. It is hypothesized that the GA search strat-
egy which perform simultaneous multiple solutions search optimized by EM would be more effective than multiple restart
strategy employed by EM for finding global optimum solutions. GEMFA minimizes the minimum description length objec-
tive function which is the discriminative measures of a candidate solution to the input and background sequences model.
In GEMFA, the solutions are encoded as positions of candidate motif sites in the input sequences, with the assumption of
34 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

OOPS. Individuals in a population are used as seeds of the initial PFM models for the EM optimization. Therefore GEMFA’s
performance depends on the quality of PFMs produced by genetic operations. For that, it employed the two-point crossover
and mutation operator for producing offspring. GEMFA was evaluated on simulated datasets which have different charac-
teristics of planted motifs with high, medium, or low conservation and background sequences which are either uniform,
GC-rich or AT-rich. Evaluation results on simulated datasets showed that GEMFA performed comparable to the MEME and
better for datasets with low conservation motifs. GEMFA was also evaluated using three additional real datasets (CRP, ERE,
E2F). It obtained the best site-level precision for two of the datasets (0.85–0.92), when compared with MEME, GAME, and
BioProspector. The results demonstrated that a hybrid of GA and expectation maximization search has the advantage of
obtaining optimal motif models and requires only small population size for DNA motif prediction.
GALF encodes positions of motif instances as individuals in a population and employed the IC as the fitness function.
The unique feature in GALF is proposed of two new genetic operators: local filter and shift operator. A local filter operator
replaces a motif instance with another one from the same input sequence if the new instance gives a better similarity
score towards the PFM of the motif. It is applied to individuals in a new population to reduce false positives. Whereas the
shift operator shifts all motif instances to the left and right to determine if its will improve the motif’s fitness score. It is
applied once the best individual is stagnated after certain number of generations. In a subsequent work, the same authors
further enhance the GALF framework by adding a post-GA process to improve the prediction accuracy, named GALF-P [14].
The motif in GALF-P is represented by PFM. As a complement, consensus is applied for assessing the degree of similarity
between a single k-mers and the predicted model. Besides using IC as fitness function and conventional genetic operators
during evolutionary process, the local filtering operator iteratively optimizes the PFM by adding conserved k-mers to replace
weak companions based on their pattern similarities until no further improvement. Furthermore, the post-processing stage
in GALP-P: adding and removing, aims to pick up potential true positives missed during GA, and simultaneously, tries to
erase possible false positives without affecting the final motif quality. The authors investigated 54 combinations of crossover
and mutation rates to find the optimized settings. A high mutation rate (0.9) and a relatively low crossover rate (0.3) produce
the best GA performance. Based on our observation, local solutions are easily to find in a motif search, since there are many
similar but random k-mers that produce high fitness scores. Compared to mutation, single-point crossover lacks of the ability
to refine local optimized solutions. Moreover, local filtering results in a nearly local optimized solution every 10 generations.
With the aid of a frequently triggered mutation, GALF-P can effectively move close to global solutions. With respect to
system complexity and efficiency, GALF-P outperforms GAME. The local filtering operator performs with a same purpose as
the ADJUST and SHIFT operators in GAME. However, it triggers in the GA iteration, which means the mechanism could have
a more direct influence on GA performance. The adaptive post-processing further increases the reliability of GALF-P, which
indicates the potential benefit of employing post-processing in GA.
The incorporation of available resources or prior knowledge (PK) into algorithm development has drawn much attention.
With the collaboration of such knowledge, novel GA-based frameworks, such as GAPK [87] and iGAPK [88], have been de-
veloped, which aim to improve the algorithm performance in terms of prediction accuracy and system robustness. In GAPK,
a model mismatch score (MMS) [86,89] is derived to as the fitness function to indicate the level of motif conservation. Later
on, to address both model conservation and rareness, the relative model mismatch score (RMMS) has been introduced in
their later work iGAPK, which takes the background model into consideration. A set of experimentally verified binding sites
are collected beforehand and constructed as PK models in the form of PFM. iGAPK is composed of three parts: (1) search
space reduction in data pre-processing; (2) GA for motif prediction; (3) final model refinement in data post-processing. A
rule-based ltering operation is developed with the assistance of PK models to mask out non-functional sequence regions
during the search space reduction. The authors aim to reduce the search space while minimizing the possibility of false dis-
missals. Beside crossover, iGAPK introduce a new genetic operator named replacement, which replaces genes from individual
based on their hamming distances. The purpose of replacement is to induce possible mutants when the evolution turns to
stable and premature. Comparing with GAME and GALF-P, iGAPK also develops a data post-processing procedure but in a
different mechanism, which has three steps, “Merging”, ”Most One-In” and “Most One-Out” (MOIO). First, the “Merging”
step groups similar motifs models together based on their motif characteristics. Then, MOIO is triggered to further refine
the merged models. “Most One-In” intends to collect weak but true binding site missed out during GA and “Most One-Out”
reduces the number of false positives. The MOIO processes iteratively until no further increase of IC for the refined mo-
tif model. iGAPK demonstrates better prediction accuracy as well as robust performance over eight datasets comparing to
GAME and GALF-P.

3.2.2. Position frequency matrix


In this approach, entries (i.e., real values) of a PFM are designated to be the allele values of genes in a chromosome
(Fig. 5e). Approaches in this category are Population Clustering Evolutionary Algorithm (PCEA) [57], GAPWM [50], and
kmerGA [94]. Designed to target very long sequences and multiple motifs, PCEA works by the principle of preserving diver-
sity of candidate solutions in a population during the evolving process which is achieved by reducing the selective pressure
on solutions that are less fits. The candidate solutions are clustered into sub-populations and mating can only be performed
between candidate solutions in the same cluster. The number of offspring produced by a cluster is proportional to the overall
fitness of member solutions. The advantages of such strategy is that very distinct solutions are not mated to produce mean-
ingless offspring and weak solutions are retained to the next generation which possibly carries useful genetic information.
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 35

The fitness function in PCEA is computed based on solution (i.e., matrix) ability to discriminate training and background
sequences. These two scores are averaged and their difference served as the fitness score.
The majority of verified binding sites are usually stored as PWM in public repositories. Since the amount of known bind-
ing sites is still quite small, PWMs with poor qualities are likely to be produced. GAPWM is proposed to optimize known
PWM models using ChIP sequences [50]. Three ChIP datasets including human Oct4, mouse Oct4 and human p53 are col-
lected from the UCSC genome browser. Two “seed” PWMs of Oct4 and p53 are formed from known binding sites and the
consensus pattern, respectively. A scoring function from MATCH [42] is used to calculate the similarity between a k-mer m
and a PWM model. Sufficient analysis has been carried out to test the optimization performance of GAPWM. Poor PWMs of
Oct4 and p53 motifs are improved after the GA process. Relatively large population size (1,0 0 0) and generations (1500) are
chosen to run GA on both training and background dataset. Usually, the global optimization appears at 500 generations and
then becomes stable. For human p53, PWM reaches optimization after only 50 generations, which indicates nearly all hu-
man p53 ChIP sequences contain highly matched binding patterns towards the known p53 motif model. Evidence suggests
GAPWM can converge to the expected global optimum and demonstrate its robustness on different sets of ChIP sequences.
The study further compares GAPWM with two EM approaches, MEME and NMICA. Results from ROC curves illustrate the
proposed GA can provide optimized PWMs with higher sensitivity than both MEME and NMICA on all the testing data. To
produce accurate ROC curves, the quality of background sequences has to be guaranteed. With respect to the extremely
low chance of having regulatory elements, coding regions and background sequences generated from ChIP experiments are
considered as the two most suitable choices in GAPWM. Unlike previous GA approaches, GAPWM targets motif model opti-
mization and is capable of improving poorly constructed PWMs with the support of ChIP dataset. Alternatively, GAPWM can
be used during post-processing to optimize motifs predicted from other searching algorithms.

3.2.3. Summary
Some key points are highlighted for search for matrix methods:

1. Motif representation. It is noted that some works choose a different motif representation to represent the set of binding
sites of an individual for the purpose of fitness calculations. For example, in GA-DPAF, the selected putative sites (i.e.,
positions) are used to construct a string consensus, and then the consensus is used to find matching sites in the input
sequences. The matched sites are subsequently converted to PFM for the calculation of the fitness score value. Likewise,
GALF converts the sequence segments in an individual into PFM and uses it to compute the score of segments where it
gets constructed.
2. Fitness function. The positions are treated as missing values in the fitness function used by GAME [92] and GEMFA [7].
In GAME, the fitness function is used to maximize the log-posterior distribution, whereas GEMFA maximizes the log-
likelihood function. On the other hand, Fogel et al. [27] combined similarity and complexity scores to be the fitness
function. The complexity score is introduced to avoid being trapped in a local minimum solution. A simpler fitness
function used by Fogel et al. [27] seems to work well despite not being as statistically rigorous as that used by GAME and
GEMFA. A distinct scoring function, named MISCORE [89], whose early versions were used in GAPK [87] and iGAPK [88],
later introduces a compositional motif complexity metric into the model. The work also carried out a very comprehensive
performance comparison of MISCORE against widely used scoring functions, i.e., Maximum a Posteriori (MAP) score and
Information Content (IC). Results show that MISCORE has further extended its capability of distinguishing motifs from
background and demonstrated computational efficiency. Instead of choosing IC or probabilistically-based models, future
motif discovery related GA works may consider alternatively using MISCORE as the fitness function.
3. Heuristics. Numerous customized operators and heuristics are introduced to conceive randomness and delineate local
minimum solution. Fogel et al. [27] introduces window shift, window recombination, and G+C% slide operators. Most
search by position algorithms, in fact, have a shift operator. The shift operator in [27] generates a new offspring by
shifting segment position from its parents to a new random position. GALF shifts all segment positions within a locally
constrained range in a greedy manner, and the best shifted positions are kept. The shift operator is rather useful and
ideal for generating population randomness. In GAME, three heuristics are introduced in post-processing: ADJUST, SHIFT,
and PWM-Scan. The PWM-Scan is used to scan for potential sites in other individuals and add to the solution if the
fitness score is improved after adding. Those heuristics are reported to greatly improve the final solutions.

4. Evaluation

To compare the performances of GA-based motif discovery tools, a simulation study was performed using 8 real datasets.
GA tools that are available for download are employed for benchmark. As most of those tools were proposed before the
ChIP era, they were designed for small to medium sized datasets. The datasets consist of known binding sites of 8 TFs
which were originally constructed by Wei and Jensen [92]. The datasets are from various transcription factors: CRP(18),
CREB(17), SRF(20), ERE(25), MEF2(17), MYOD(17), TBP (39), and E2F(25). We ran GAPK, GAME, GALF-P, and GADEM 20 times
for each dataset to obtain a solid representation of their actual performances. The tools’ parameters are varied to have a fair
comparison of the performances.
The binding sites locations in the datasets are known, therefore the precision, recall, and f-measure rates can be com-
puted precisely. The precision (positive predictive value) is defined as TP/(TP + FP), while the recall (sensitivity) is defined
36 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

Table 1
Precision, recall, and f-measure rates of GAPK, GALF-P, GAME, and GADEM on the 8 real datasets.

Datasets GAPK GAME GALF-P GADEM

P R F P R F P R F P R F

CREB 0.68 ± 0.06 0.65 ± 0.06 0.66 ± 0.06 0.44 ± 0.31 0.43 ± 0.30 0.43 ± 0.32 0.47 ± 0.24 0.60 ± 0.29 0.53 ± 0.26 0.82 ± 0.07 0.56 ± 0.07 0.67 ± 0.05
CRP 0.90 ± 0.05 0.84 ± 0.03 0.87 ± 0.02 0.93 ± 0.05 0.84 ± 0.03 0.88 ± 0.03 0.95 ± 0.02 0.88 ± 0.05 0.91 ± 0.04 0.95 ± 0.05 0.59 ± 0.09 0.72 ± 0.06
ERE 0.73 ± 0.15 0.88 ± 0.03 0.79 ± 0.11 0.63 ± 0.07 0.84 ± 0.06 0.72 ± 0.06 0.65 ± 0.15 0.84 ± 0.04 0.72 ± 0.10 0.79 ± 0.11 0.641 ± 0.08 0.70 ± 0.08
E2F 0.69 ± 0.02 0.83 ± 0.06 0.75 ± 0.03 0.62 ± 0.05 0.86 ± 0.09 0.72 ± 0.06 0.67 ± 0.08 0.93 ± 0.05 0.78 ± 0.07 0.70 ± 0.04 0.68 ± 0.03 0.69 ± 0.02
MEF 0.87 ± 0.10 0.92 ± 0.04 0.89 ± 0.06 0.90 ± 0.05 0.96 ± 0.06 0.93 ± 0.04 0.85 ± 0.16 0.94 ± 0.06 0.89 ± 0.11 0.97 ± 0.05 0.84 ± 0.06 0.90 ± 0.04
MYOD 0.83 ± 0.06 0.92 ± 0.08 0.87 ± 0.05 0.24 ± 0.17 0.24 ± 0.16 0.24 ± 0.16 0.28 ± 0.24 0.51 ± 0.45 0.36 ± 0.32 0.58 ± 0.08 0.40 ± 0.09 0.47 ± 0.08
SRF 0.75 ± 0.04 0.81 ± 0.05 0.78 ± 0.03 0.67 ± 0.06 0.92 ± 0.06 0.78 ± 0.06 0.68 ± 0.12 0.88 ± 0.06 0.76 ± 0.09 0.76 ± 0.06 0.86 ± 0.09 0.80 ± 0.05
TBP 0.73 ± 0.10 0.83 ± 0.04 0.77 ± 0.06 0.67 ± 0.28 0.58 ± 0.24 0.62 ± 0.25 0.74 ± 0.12 0.86 ± 0.02 0.80 ± 0.09 0.51 ± 0.13 0.46 ± 0.11 0.48 ± 0.11

Table 2
ANOVA 1 test on the precision, recall, and f-measure rates
at significance level α < 0.05.

Tool 1 Tool 2 Precision Recall F-measure

GAPK GAME 0.133 0.216 0.146


GAPK GALF-P 0.184 0.647 0.287
GAPK GADEM 0.849 0.007 0.061
GALF-P GADEM 0.310 0.047 0.640
GALF-P GAME 0.830 0.389 0.616
GAME GADEM 0.231 0.476 0.889

as TP/(TP + FN), where TP, FP, and FN are true positives, false positives, false negatives, respectively [63,65]. We also com-
pute the f-measure = 2/(1/precision + 1/recall), which is the hormonic mean of the precision and recall rate. A prediction is
considered a TP if it overlaps the true binding site location for at least 25% of its width in either strand.
For the tools, key parameters need to have certain adjustments to suit with specific motif characteristics. The default
parameter settings in both GAME and GALF-P are used for the first five runs. Then, based on their published works, some
optimal GA settings are determined. In GAME, the number of generation is set at 40 0 0, the population size is set at 10 0 0,
and adjust the mutation rates from 0.001 to 0.015 with the step of 0.001. In GALF-P, we apply the same settings of gen-
eration (40 0 0) and population (10 0 0) as GAME, and 15 different combinations of 5 mutation rates (from 0.2 to 0.4 with
the step of 0.05) and 3 crossover rates (0.8, 0.85 and 0.9). For each dataset, the motif width parameter is set to the ex-
pected motif width, which remains unchanged over twenty runs. In GAPK, the number of generation is set at 10 0 0 along
with 20 combinations of 10 replacement rates (0.1 to 1 with the step of 0.1) and 2 population sizes (50 0, 80 0). All other
parameters from the three tools are maintained as default. For GADEM, we employed a population size of either 50 or 100
with the default 5-cycles. The minimum and maximum gaps were set at [0, 8] as most of the expected motif lengths are
short (< 15 bp). The number of top-ranked trimers, tetramers, and pentamers are set according to the expected lengths of
the motifs in the datasets. In typical cases, they are set at 20, 40, or 50. The minN parameter is set at about 50–80% of the
number of sequences in a dataset.
Table 1 shows the results of the tools and the performance rates in terms of precision, recall, and f-measure. The results
are averaged of 20 runs and their standard deviation are indicated after the ± symbol. It can be observed that the obtained
results are mixed with GADEM achieved best average precision rates for six of the eight datasets. Nevertheless, GADEM has
low average recall rates compare with other tools. It is noticed that, none of the tools consistently obtained best in terms of
recall rates. GAPK, GAME, and GALF-P have shared a fair number of three best recall rates from the eight datasets. In terms
of f-measure rates, it is inconclusive which are better as the best f-measure for different datasets achieved by different tools.
For instance, GAPK is best for two, GAME has one, GALF-P has three, and GADEM has two. This implies that different tools
are better on some datasets than others. All of the tools were able to discover the consensus patterns of the primary motifs
in the datasets. Nevertheless, it is essential to select the correct parameters for the datasets especially on the population
size, threshold parameters, and expected motif lengths.
ANOVA 1 (Analysis of Variance 1) test was conducted to discover whether the differences in means (the classifiersper-
formances) between different groups (i.e., over different datasets) are statistically significant [5]. It is assumed that those
datasets are independently drawn and the performance scores are normally distributed. The null hypothesis H0 of the
ANOVA 1 test is that the population means from which the samples are selected are equal.
Table 2 shows the p-values obtained from pair-wise ANOVA 1 test on the tools’ precision, recall, and f-measure rates.
The results showed we failed to reject the null hypothesis at significance level α = 0.05 for all the compared tools and
performance measures, except for the recall rates between GAPK-GADEM (F(1,14) = 9.955, ρ = 0.007) and GADEM-GALF-P
(F(1,14) =4.749, ρ = 0.0047). Referring to Table 1, GALF-P and GAPK have consistently much better recall rates than GADEM.
Based on our limited evaluation results, it can be noted that the insignificance of p-values for most of the test cases implies
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 37

Table 3
Comparisons of the three GA individuals encoding methods.

Encoding method Evolability Representability Reproducbility

Consensus / spaced-dyad High Average Average


Positions High Very good Average
Position frequency matrix Low Very good High

that while the tools have different implementation (i.e., operators and encoding), their performances are comparable. It
cannot be inferred whether similar results would be obtained for large-scale datasets.

5. Discussion

It is noteworthy that the term motif is also used in the study of complicated yet unknown interactions among bio-
molecules, i.e., biological network in system biology [91]. Computational approaches have been proposed to generate math-
ematical graphs to characterize such complex interaction networks [1]. A cluster of unique, non-random appearance and
inter-connected sub-graphs within a large bio-network are referred as network motifs [46]. Nodes within the graphs usually
are genes/proteins and edges that linked the nodes represents the interaction. The study of DNA motifs or binding sites is to
mine similar combinations of nucleotides to help with the identification of same transcription factor that possibly regulates
different genes in the sequence level. As such data gets accumulated along with knowledge of multiple bio-molecules, such
as miRNAs, DNA motif discovery provides reliable annotations of gene regulations to contribute the development of entire
regulatory network that classify genes into different regulation pathways [62].
Table 3 summarizes the key works reviewed in this paper in several key dimensions. GA is attractive for motif discovery
because of its global search capability and solution-centred design [57]. The global search ability is crucial due to the large
search space of motif discovery. However, GA is still unable to fully solve the local minimum problem due to the imperfect
understanding about binding sites specificity. The solution-centred principle allows a researcher to focus on the solution
space rather than on the algorithmic aspect [57].
In the next few sub-sections, we provide some discussion on performances, design, and implementation issues of GA for
DNA motif discovery.

5.1. Performance comparisons

In this reviewed works, GA-based motif prediction tools have shown promising results. Many reported evidences demon-
strate that the GA techniques perform comparable or favorable than the local search techniques such as Gibbs Sampling
(AlignACE, MotifSampler, BioProspector) and expectation maximization (MEME). For examples, GAME significantly outper-
forms MEME and BioProspector in 8 of the real datasets in terms of f-measure [92]. GA is also reported to perform better,
especially for low conservation motifs [7,92]. However, PCEA only performed at an average level for low conservation motif
datasets [57]. Such evidence suggests position encoding method may be more suitable for discovery of subtle motifs. GA can
recognize motif signal in low SNR sequences, as it is capable of detecting planted motif signals in sequences about three
folds longer than MEME and NestedMICA [57].
Two studies, Li [49] and Jayaram et al. [39], have employed large-scale ChIP datasets in the evaluation. Jayaram et al.
[39] evaluated four discovery tools using large-scale ChIP datasets - rGADEM (R implementation of GADEM), HOMER, ChIP-
Munk, and MEME-ChIP. The four tools were evaluated using 12 TFs with validated binding sites from PAZAR motif database.
It was reported that the PWMs predicted by rGADEM achieved the best average results in four of the evaluation metrics.
The results demonstrated that, while most of the early works were focused on small scale datasets, GA is scalable in terms
of performance for large-scale datasets. While there are many reported successes in using GA methods, some authors have
reserved their conclusions. For example, GAME only manages to discover partial real motifs in the test datasets, while Fogel
et al. [27] states that multiple local optimum solutions are one step ahead of finding a global motif. Overall, it is surprising
to see that simple GA without complicated algorithmic issues performs equally well with other state-of-the-art algorithms.

5.2. Comparisons of solution encoding

The choice of solution encoding method determines the complexity of the solution search space. For the position encod-
ing method, suppose there is no assumption on the number of motif instances in each input sequence. The total number

of possible candidate motifs is i 2li −w+1 , where li is the length of each sequence and w is the expected motif length [92].
However, if we assume only one site from each input sequence, the possible motif candidates become (li − w + 1 )N , where
N is the total input sequences. In contrast, for search by consensus, there are only 4l patterns to search for assuming the
pattern only made up of four bases. The position encoding method clearly has bigger search space compared to the consen-
sus encoding method. A larger population is needed for the position encoding, which leads to longer time for convergence.
For the PFM encoding method, there are infinite number of possible matrices to search for because of its real values entries.
It notices that PCEA and GAPWM which uses the matrix encoding method, perform well in the results.
38 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

Table 3 summarizes the comparisons of different solution encoding methods in terms of evolvability, representability,
and reproducibility. (1) The evolvability is the ease of applying the standard genetic operators (crossover and mutation) to
produce offspring. The standard genetic operators can be applied with ease to the consensus and position encoding. The
allele values between two individuals can be exchanged freely with one or two point crossover operator. However, for ma-
trix encoding, while the standard operators are applicable, the exchange of allele values between two matrices can only
be performed column-wise because of the unit sum constraint for each position of a matrix. Also, once a matrix entry is
mutated, the affected column has to be renormalized. (2) Representability refers to the ability of the encoding method cap-
tures the TFs specificity. The matrix encoding (as well as position encoding) is relatively more powerful than the consensus
in representing motifs. Nevertheless, it can be observed that in some studies, the consensus patterns are transformed into
PFM to take advantage of its expressive power in detecting potential sites. Such approach is used by GADEM, GA-DPAF,
and GASMEN. Further, the spaced-dyad can effectively models the motifs with conserved parts spaced by low conservation
bases. We feel the choice of encoding method depending much on the characteristics of the motifs. While that is illusive
considering the motif characteristics are often unknown, it is practically wise to run with different solution encodings for
the same dataset. (3) Reproducibility refers to the ability of GA to identify the same motifs in different runs. Because of the
exponential combinations of parameter values, a motif prediction tool is suggested to perform multiple runs before con-
cluding the final motifs. There are evidences that GA approach results are reproducible in different runs. For example, PCEA
discovers known motifs in more than 90% of successful runs [57]. Furthermore, GAPWM converges to nearly the same solu-
tions in different runs. GADEM also reports to produce the same primary motifs (despite having different number of sites)
in different runs for the six test datasets. A somehow contradictory finding is mentioned in GAMI that is most of the runs
with raw data lead to relatively uninteresting motifs [20]. It is likely due to limitations of the consensus encoding it used.
The three solution encoding methods have their own strengths and weaknesses. The encoding by position has the most
advantages of all, which come from an intuitive representation of the solution space and readily used existing genetic opera-
tors. The consensus-based method is unable to capture the inherit diversities of binding sites or to effectively take advantage
of the operation of genetic operators.

5.3. Customized genetic operators

Designing effective genetic operators is one of the determinants to the success of GA for motif prediction. Besides aimed
at having improve average fitness score of the new individuals in a population, the genetic operators also aimed at diversify
the population to prevent pre-mature convergence. Diversification is crucial for motif discovery problem because there are
potentially multiple true motifs in a dataset. It is not surprising that many customized operators were proposed, in addition
to the standard operators (see Table 4). Notably the shift operator [92]. The shift operator prevents solutions being trapped
in a local minimum due to the phase shift problem in the multiple-alignment problem [47]. A phase shift problem is where
some sites in a motif are misaligned and require realignment. In GAME, the shift operator is applied to the best individuals
after it is converged. While in GALF, it is applied once the best individual showed sign of stagnates for certain number
of generations. From the reviewed studies, custom operators are a necessity to producing a good solution. However, the
effectiveness of most noveloperators was not evaluated in terms of contribution to the final solution. Therefore, a fitness
function should be able to succinctly evaluate each solution produced after applying genetic operators so that the good
solutions have a high chance of being selected for reproduction.

5.4. Local-search

The hybrid of local search and GA is beneficial. The aim of local search is to optimize the sub-optimal motifs (i.e., solu-
tions) during the evolutionary process. It is usually activated right after the motifs are produced by genetic operations (i.e.,
GASMEN, GA-DPAF, GADEM, GEMFA). GASMEN and GA-DPAF uses Gibbs-Sampling for the local search while GADEM and
GEMFA employed the expectation-maximization algorithm. When triggered during the evolutionary process, it shortens the
generations needed for convergence. As an example, GADEM only requires 5–10 generations to converge for the large-scale
ChIP datasets. The sub-optimal motifs are used as seeds to initialize the local search algorithms. Because of the high com-
plexity of the local search, it will incur additional time to the GA search. A strategy employed in GADEM is to use only a
subset of dataset for the optimization, which has demonstrated to be sufficient to achieve good performance. The hybrid of
GA and local search is a prominent strategy that should be considered for the design of GA-based solution.

5.5. Saving the solutions

A practical challenge of using GA is deciding when to save the solutions during the GA evolution and how many to save?
GAMI stores the best individual in every generation [20], while PCEA and GEMFA save a list of high-score individuals after
convergence [7,57]. A different strategy employed is only save the fittest solutions after convergence [7,94,97]. During the
evolutionary process, it is hard to determine when a good solution should be saved and whether such a solution need to be
reserved until convergence because of the stochastic nature of GA. Some authors have suspected that good solutions vanish
during the iterative evolution [20,66]. As indicated in GAMI, “... thus it is likely that some GAMI runs would discover this
solution” [20]. A possible solution is by applying the elitism selection. Elitist is a technique where the best solutions in the
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 39

Table 4
Summary of GA motif discovery tools.

Initialization/ Notes on Non-standard genetic


Tool name representation Fitness function operator Datasets

PCEA [57] Individuals are randomly initialized The difference between − Simulated dataset and 100
the mean best score of sequences randomly picked from
motif in the input and EPD database
background dataset.
GAPWM [50] Individuals are initialized with ROC − 603 sequences from human
existing PFMs Oct4, 367 sequences from mouse
Oct4 and 542 sequences from
p53 motif.
kmerGA [94] Random initialization of PFM values Spearman rank block crossover, PBM dataset of Cbf1, Ceh22,
correlation coefficient sequence context Oct1, Rap1, and Zif268
crossover, block
mutation, column
mutation
GAMI [19,20] Individuals are randomly initialized Match score & IC Truncation & extension Tissue specific CFTR, GSTM1, and
SOX21 dataset
FGMA [53] Random initialization pair-wise match with − Three sets of sequences with 6,
mismatch & 9, and 18 sequences from
degenerated codes, ROC SaccharomycesCerevisiae
Paul and Iba [66] Sub-sequences (consensus) are a) IC for weak motif; b) − CRP and ArcA motif of
randomly picked from input Rewarding scheme for Escherichia coli. LEU3 and MCB
sequences ordinary motif motifs of SaccharomycesCerevisiae
GADEM [49] Each individual has three parts, a1 , Relative entropy customized cross-over Six ChIP datasets totaling 542 to
x(n), a2 , where a1 and a2 . a1 and a2 and mutation 13,721 sequences. Simulated
are picked randomly from the operators datasets
selected 3- to 6-mers, while n is
randomly picked value [1,10]
GA-DPAF [97] Individuals in a population are IC + sum-of-pairs score customized crossover Various motifs from SCPD
encoded as positions of the best + matching of and mutation operator (Saccharomyces Cerevisiae) and
matches in each input sequence. spaced-dyad instances are proposed with motifs from
to its consensus pattern optimization by Gibbs SaccharomycesCerevisiae motifs
sampling. from various authors.
GASMEN [15] Individuals are randomly initialized Over-representation Customized crossover LexA (9 sequences) with
from indexed sequence segments +conserved and high and mutation operator sequence lengths from 80 to
occurrences in the 580 bp, and PurR (12 sequences)
input sequences with sequence lengths from 100
to 600 bp. Also used the 8 real
datasets from GAME.
MDGA [17] Positions are randomly generated IC Shift 18 sequences of CRP motif, 15
sequences of genes regulated by
YDR02c protein.
GALF [13] Individuals are randomly initialized IC Shift Local filter 800 simulated datasets with
GALF-P [14] with positions in input sequences planted binding sites, 300 bp
each sequence. 8 real datasets as
used by GAME.
GALF-G [16] Individuals are randomly initialized IC GALF-G is an extended 970 synthetic, real and
version of GALF with benchmark datasets Composed
shift filter and of Eukaryotic, Escherichiacoli,
modified local filter liver-specific and Myod datasets
GAME [92] Randomly generated motif positions Bayesian Adjust Shift 8 real datasets from human and
Escherichiacoli. from 17 to 95
sequences. 200 artificial datasets
with different characteristics
Fogel et al. [27] One position from each sequence is motif complexity score Window shift G+C% 7 sequences of Oct and 9
randomly selected for initialization + sum of similarity slide, Window sequences of NF- B
of individuals in the population score recombination
GEMFA [7] Positions in individuals are Minimum Description − 25 ERE, 25 E2F, and 18 CRP
randomly generated with Length sequences
assumption one position per
sequence
GAPK [87] A k-mer is obtained from each input MMS (MISCORE) Replacement 8 real datasets as used by GAME
DNA sequence to form an individual
iGAPK [88] A k-mer is obtained from each input RMMS (MISCORE) Replacement 8 real datasets as used by GAME
DNA sequence to form an individual
40 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

current generation are automatically brought forward to the next without going through the selection process [49,53]. This
strategy is effective in retaining the most prominent solutions until GA is converged.

6. Other methods

Given the two main streams of motif representations, we classified non-GA works into pattern-based and matrix-based.
The basic concept of pattern-based algorithms comes from enumeration. Suppose motif length k and the set of characters
n are given in advance, the overall search space would be nk . Every possible string can be considered as a candidate mo-
tif model. Tools such as Weeder [67] and Trawler [23] return the most over-represented consensus model from the entire
search space. On the other hand, AMD [72] and Hegma [38] are developed to target on finding significant motifs through
large-scale datasets, e.g., ChIP-Seq. The matrix-based approaches optimize the given motif model (PFM/PWM) by iteratively
adjusting model parameters to maximize the probabilistic likelihood. Expectation-maximization (EM) and Gibbs sampling
are two most commonly used stochastic local search methods. One of the EM approaches, MEME [4] implements three dif-
ferent assumptions of the motif occurrences (OOPS, ZOOPS and TCM), which can detect distinct motif models with different
widths through multiple EM iterations. Gibbs sampling has been applied by tools like AlignACE [37] and BioProspector [55].
Unsupervised learning approaches using self-organizing map (SOM) neural networks aimed to optimize PWM (SOMBRERO
[60]) or hybrid motif model (SOMEA [48]) at each node. Other machine learning algorithms such as anchor based sequence
clustering (ASC) [54] and high-order Markov model [18] focus on the improvement of prediction in terms of running time
and scalability across motifs with various lengths and input data sizes. As the prior knowledge of DNA motifs get accu-
mulated, tools that employ supervised learning show their potential of problem solving, such as a Support Vector Machine
(SVM) approach gkm-SVM [29], and DeepSEA [98] based on Deep learning neural network. Moreover, in order to increase
the discovery reliability, ensemble approaches have been developed by employing a number of different motif discovery
tools together as computational pipeline [45], such as MotifVoter [93] and GimmeMotifs [32]. Some of the tools are further
developed to provide web access with additional features, such as DMINDA [59] and MEME Suite [2]. Comprehensive re-
view works have been proposed along with the development of in silico motif discovery. Tompa et al. [83] evaluated the
performance of 13 most well-known prediction tools including MEME, AlignACE and Weeder. Later, Das and Dai [21] fur-
ther categorized tools depending on where the targeted promoter regions collected from, i.e., co-regulated genes of a single
species, orthologous genes of multiple species or both. Zambelli et al. [96] emphasized the importance of computational
approaches by working with large-scale datasets, such as ChIP-chip and ChIP-Seq. For a detailed summary of non-GA ap-
proaches, refer to the supplementary material.

7. Conclusions and future prospects

This paper presented a comprehensive survey of the research literature in how GA have been used for DNA motif discov-
ery. Within the scope of this research survey, a total of 18 GA tools have been reviewed with the focus on solution encoding,
genetic operators, and practical issues.
Some conclusions can be drawn from this review. Firstly, different solution encoding methods determine the size of the
search space, characteristics of motifs to be discovered, and the settings of standard GA parameters (size of population,
number of generations, etc.). From this review, the position-based encoding method is favourable compared to other meth-
ods due to its intuitive representation, applicability of existing genetic operators, and easy implementation, though it has a
larger search space in comparison to consensus encoding. Meanwhile, some works combine different encoding methods to
capture different characteristics of motifs and favor the search performance. Secondly, the post-processing step after the GA
iteration is necessary to improve the quality of discovered motifs. It is necessary especially for the position encoding method
to reduce the inaccuracies due to misaligned sites or remove falsely included ones. The post-processing step is beneficial by
minimizing those errors. Thirdly, customized or new genetic operators are needed because the standard genetic operators
are unable to diversify the individuals in a population. A GA developer should consider allowing users to choose which ge-
netic operators to use for a particular dataset, as well as when and what solutions should be saved in a GA run. There were
many new genetic operators proposed for different representation methods, but their benefit toward the final solutions are
unknown. Therefore, it is a good practice to use a different set of operators in different runs. A feature employed in several
GA tools is to allow the operators to be selected probabilistically using the roulette wheel mechanism. Fourthly, while the
information content (IC) is the most popular choice as fitness function, care must be taken. When the population size gets
larger, IC sometimes give higher scores to false motif models, such as those with low complexity patterns. IC also does not
work well to distinguish multiple motifs from the same dataset simultaneously. Lastly, the main bottleneck of the existing
GA approaches is not on the prediction performance, but is on its time complexity. It requires long computational time for
large-scale datasets. The high complexity comes from computation of the fitness function and genetic operators used.
We put forward some recommendations of potential future works. Firstly, most existing GA-based motif discovery tools
are only evaluated on small to medium datasets (see Table 4), except GADEM. The time complexity depends on computations
of fitness function, implementation of genetic operators, population size, number of generations, and encoding method used.
The PWM and position encoding methods require lower computation time for computing fitness functions in comparison
to the consensus encoding method. In consensus encoding, it is necessary to search for segments that match each pattern
in input sequences for computing the over-representation scores. The challenges with large-scale motif analysis using GA
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 41

are not only computational resources but also performance scale-up when the sequence search space increases. There have
been several efficient implementations of scalable GA using Hadoop MapReduce [58] and GPU CUDA [12,75]. Secondly, the
current DNA prediction landscape is focusing on enhanced prediction using various data sources, such as histone marks,
chromatin marks, and co-factor marks that can be used to infer their locations [73]. However, prediction of enhancers is
difficult because the features associated with them are not fully understood. Furthermore, with various data sources from
different cell lines available, how to integrate them to construct an effective classification model is of great challenge. In
that respect, GA can be used to search for and select the optimal set of features from different sources to be integrated
for classifier construction. Thirdly, all the approaches presented in this review are motif model driven and unsupervised.
However, as more labeled data sources are available, such as histone marks and chromatin signatures that are associated
with enhancer regions, the supervised feature-driven method would be more useful. GA can be used to generate or select
discriminative features using those datasets for building classifiers. Novel solution representations are necessary to represent
the features.

Supplementary material

Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.ins.2018.07.004.

References

[1] T. Aittokallio, B. Schwikowski, Graph-based methods for analysing networks in cell biology, Brief. Bioinform. 7 (3) (2006) 243–255, doi:10.1093/bib/
bbl022.
[2] T.L. Bailey, DREME: motif discovery in transcription factor ChIP-Seq data., Bioinformatics 27 (12) (2011) 1653–1659, doi:10.1093/bioinformatics/btr261.
[3] T.L. Bailey, C. Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the International Confer-
ence on Intelligent Systems for Molecular Biology 2(6) (1994) 28–36.
[4] T.L. Bailey, C. Elkan, Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Mach. Learn. 21 (1) (1995) 51–80,
doi:10.10 07/bf0 0993379.
[5] S. Bandyopadhyay, S. Mallik, A. Mukhopadhyay, A survey and comparative study of statistical tests for identifying differential expression from microar-
ray data, IEEE/ACM Trans. Comput. Biol. Bioinf. 11 (1) (2014) 95–115, doi:10.1109/TCBB.2013.147.
[6] O.G. Berg, P.H. von Hippel, Selection of DNA binding sites by regulatory proteins. statistical-mechanical theory and application to operators and pro-
moters., J. Mol. Biol. 193 (4) (1987) 723–750, doi:10.1016/0022- 2836(87)90354- 8.
[7] C. Bi, A Genetic-based EM motif-finding algorithm for biological sequence analysis, in: Proceedings of the IEEE Symposium on Computational Intelli-
gence and Bioinformatics and Computational Biology, IEEE, 2007, pp. 275–282, doi:10.1109/CIBCB.2007.4221233.
[8] V. Boeva, Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells, Front.
Genet. 7 (2016) 24.
[9] A. Brazma, I. Jonassen, I. Eidhammer, D. Gilbert, Approaches to the automatic discovery of patterns in biosequences, J. Comput. Biol. 5 (2) (1998)
277–304, doi:10.1089/cmb.1998.5.279.
[10] M. Brenowitz, D.F. Senear, M.A. Shea, G.K. Ackers, Quantitative dnase footprint titration: a method for studying protein-dna interactions, Meth. Enzy-
mol. 130 (1986) 132–181, doi:10.1016/0 076-6879(86)30 011-9.
[11] Q. Cao, K. Yip, A Survey of the Computational Methods for Enhancers and Enhancer-target Predictions, in: Proceedings of the Computational Biology
and Bioinformatics, CRC Press, 2016, pp. 3–27, doi:10.1201/b20026-3.
[12] S. Cavuoti, M. Garofalo, M. Brescia, A. Pescape’, G. Longo, G. Ventre, Neural Nets and Surroundings, Springer Berlin Heidelberg, Berlin, Heidelberg, pp.
29–39. doi:10.1007/978- 3- 642- 35467- 0_4.
[13] T.-M. Chan, K.-S. Leung, K.-H. Lee, TFBS identification by position- and consensus-led genetic algorithm with local filtering, in: Proceedings of the Ninth
Annual Conference on Genetic and Evolutionary Computation, in: GECCO ’07, ACM, New York, USA, 2007, pp. 377–384, doi:10.1145/1276958.1277037.
[14] T.-M. Chan, K.-S. Leung, K.-H. Lee, TFBS Identification based on genetic algorithm with combined representations and adaptive post-processing, Bioin-
formatics 24 (3) (2008) 341–349, doi:10.1093/bioinformatics/btm606.
[15] T.-M. Chan, K.-S. Leung, K.-H. Lee, P. Lio’, Generic spaced DNA motif discovery using genetic algorithm, in: Proceedings of the IEEE Congress on
Evolutionary Computation, IEEE, 2010, pp. 1–8, doi:10.1109/CEC.2010.5585924.
[16] T.-M. Chan, G. Li, K.-S. Leung, K.-H. Lee, Discovering multiple realistic TFBS motifs based on a generalized model., BMC Bioinform. 10 (2009) 321,
doi:10.1186/1471-2105- 10- 321.
[17] D. Che, Y. Song, K. Rasheed, MDGA: motif discovery using a genetic algorithm, in: Proceedings of the Seventh Annual Conference on Genetic and
Evolutionary Computation, in: GECCO ’05, ACM, NY, USA, 2005, pp. 447–452, doi:10.1145/1068009.1068080.
[18] R. Chen, Y. Peng, B. Choi, X. Jianliang, H. Haibo, A private dna motif finding algorithm, J. Biomed. Inform. 50 (2014) 122–132. Special Issue on Infor-
matics Methods in Medical Privacy. doi:10.1016/j.jbi.2013.12.016.
[19] C.B. Congdon, J.C. Aman, G.M. Nava, H.R. Gaskins, C.J. Mattingly, An evaluation of information content as a metric for the inference of putative
conserved noncoding regions in DNA sequences using a genetic algorithms approach, IEEE/ACM Trans. Comput. Biol. Bioinf. 5 (1) (2008) 1–14,
doi:10.1109/tcbb.2007.1059.
[20] C.B. Congdon, C.W. Fizer, N.W. Smith, H.R. Gaskins, J. Aman, G.M. Nava, C. Mattingly, Preliminary results for GAMI: a genetic algorithms approach
to motif inference, Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2005) 1–8.
doi:10.1109/CIBCB.2005.1594904.
[21] M. Das, H.-K. Dai, A survey of DNA motif finding algorithms, BMC Bioinform. 8 (Suppl 7) (2007) S21, doi:10.1186/1471-2105- 8- S7- S21.
[22] W.H.E. Day, F.R. McMorris, Critical comparison of consensus methods for molecular sequences, Nucleic Acids Res. 20 (5) (1992) 1093–1099, doi:10.
1093/nar/20.5.1093.
[23] L. Ettwiller, B. Paten, M. Ramialison, E. Birney, J. Wittbrodt, TRAWLER:de novo regulatory motif discovery pipeline for chromatin immunoprecipitation,
Nat. Biotechnol. 4 (7) (2007) 563–565.
[24] M. Fernández, D. Miranda-Saavedra, Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector
machines, Nucleic Acids Res. 40 (10) (2012) e77, doi:10.1093/nar/gks149.
[25] J.W. Fickett, The gene identification problem: an overview for developers, Comput. Chem. 20 (1) (1996) 103–118, doi:10.1016/s0 097-8485(96)80 012-x.
[26] J.W. Fickett, A.G. Hatzigeorgiou, Eukaryotic promoter recognition, Genome Res. 7 (9) (1997) 861–878, doi:10.1101/gr.7.9.861.
[27] G.B. Fogel, D.G. Weekes, G. Varga, E.R. Dow, H.B. Harlow, J.E. Onyia, C. Su, Discovery of sequence motifs related to coexpression of genes using evolu-
tionary computation, Nucleic Acids Res. 32 (13) (2004) 3826–3835, doi:10.1093/nar/gkh713.
[28] M.T. Friberg, P. von Rohr, G.H. Gonnet, Scoring functions for transcription factor binding site prediction, BMC Bioinform. 6 (2004). 84–84.
[29] M. Ghandi, D. Lee, M. Mohammad-Noori, M.A. Beer, O. Winther, Enhanced regulatory sequence prediction using gapped k-mers features, PLoS Comput.
Biol. 10 (7) (2014) e1003711, doi:10.1371/journal.pcbi.1003711.
42 N.K. Lee et al. / Information Sciences 466 (2018) 25–43

[30] D. GuhaThakurta, Computational identification of transcriptional regulatory elements in DNA sequence, Nucleic Acids Res. 34 (12) (2006) 3585–3598,
doi:10.1093/nar/gkl372.
[31] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, USA,
1997.
[32] S.J. van Heeringen, G.J.C. Veenstra, Gimmemotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments, Bioinformatics 27 (2) (2011)
270–271.
[33] J. van Helden, B. Andre, J. Collado-Vides, Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonu-
cleotide frequencies, J. Mol. Biol. 281 (5) (1998) 827–842, doi:10.1006/jmbi.1998.1947.
[34] J. van Helden, A.F. Rios, J. Collado-Vides, Discovering regulatory elements in non-coding sequences by analysis of spaced dyads, Nucleic Acids Res. 28
(8) (20 0 0) 1808–1818, doi:10.1093/nar/28.8.1808.
[35] L.M. Hellman, M.G. Fried, Electrophoretic mobility shift assay (EMSA) for detecting protein-nucleic acid interactions, Nat. Protoc. 2 (8) (2007) 1849–
1861, doi:10.1038/nprot.2007.249.
[36] J.H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence,
2nd, MIT Press, Cambridge, MA, 1992.
[37] J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church, Computational identification of Cis-regulatory elements associated with groups of functionally related
genes in saccharomyces cerevisiae, J. Mol. Biol. 296 (5) (20 0 0) 1205–1214, doi:10.10 06/jmbi.20 0 0.3519.
[38] N. Ichinose, T. Yada, O. Gotoh, Large-scale motif discovery using DNA gray code and equiprobable oligomers, Bioinformatics 28 (1) (2012) 25–31.
[39] N. Jayaram, D. Usvyat, A.C. R. Martin, Evaluating tools for transcription factor binding site prediction, BMC Bioinform. (2016), doi:10.1186/
s12859- 016- 1298- 9.
[40] S.T. Jensen, J.S. Liu, Biooptimizer: a Bayesian scoring function approach to motif discovery, Bioinformatics 20 (10) (2004) 1557, doi:10.1093/
bioinformatics/bth127.
[41] S.J.M. Jones, Prediction of genomic functional elements, Annu. Rev. Genomics Hum Genet. 7 (2006) 315–338, doi:10.1146/annurev.genom.7.080505.
115745.
[42] A.E. Kel, E. Gössling, I. Reuter, E. Cheremushkin, O.V. Kel-Margoulis, E. Wingender, MATCH: a tool for searching transcription factor binding sites in
DNA sequences, Nucleic Acids Res. 31 (13) (2003) 3576–3579, doi:10.1093/nar/gkg585.
[43] D.R. Kelley, J. Snoek, J. Rinn, Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks, Genome Res.
(2016), doi:10.1101/gr.200535.115.
[44] J.F. Kennedy, R.C. Eberhart, Y. Shi, Swarm Intelligence, Morgan Kaufmann Publishers, 2001.
[45] J. Kim, S. Yu, S. Yoon, Ensemble algorithms for dna motif finding, in: Proceedings of the International Conference on Electronics, Information and
Communications (ICEIC), 2014, pp. 1–2, doi:10.1109/ELINFOCOM.2014.6914361.
[46] W. Kim, M. Li, J. Wang, Y. Pan, Biological network motif detection and evaluation, BMC Syst. Biol. 5 (Suppl 3) (2011) S5.
[47] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, J.C. Wootton, Detecting subtle sequence signals: a Gibbs sampling strategy for multiple
alignment, Science 262 (1993) 208–214.
[48] N.K. Lee, D. Wang, SOMEA: self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model, BMC Bioinform.
12 (Suppl 1) (2011) S16, doi:10.1186/1471-2105-12-s1-s16.
[49] L. Li, GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery, J. Comput. Biol. 16 (2) (2009)
317–329, doi:10.1089/cmb.2008.16tt.
[50] L. Li, Y. Liang, R.L. Bass, GAPWM: a genetic algorithm method for optimizing a position weight matrix, Bioinformatics 23 (10) (2007) 1188–1194,
doi:10.1093/bioinformatics/btm080.
[51] M. Li, B. Ma, L. Wang, Finding similar regions in many strings, J. Comput. Syst. Sci. 65 (1) (2002) 473–482, doi:10.1006/jcss.2002.1823.
[52] X. Li, D. Wang, An improved genetic algorithm for DNA motif discovery with public domain information, in: M. Köppen, N. Kasabov, G. Coghill (Eds.),
Proceedings of the Fifteenth International Conference on Advances in Neuro-Information Processing: ICONIP 2008, Auckland, New Zealand, November
25–28, 2008, Revised Selected Papers, Part I, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 521–528, doi:10.1007/978- 3- 642- 02490- 0_64.
[53] F.F.M. Liu, J.J.P. Tsai, R.M. Chen, S.N. Chen, S.H. Shih, FMGA: finding motifs by genetic algorithm, in: Proceedings of the Fourth IEEE Symposium on
Bioinformatics and Bioengineering (GECCO’06), IEEE Computer Society, Washington, DC, USA, 2004, pp. 459–466.
[54] H. Liu, F. Han, H. Zhou, X. Yan, K. Kosik, Fast motif discovery in short sequences, in: Proceedings of the IEEE Thirty-Second International Conference
on Data Engineering (ICDE), 2016, pp. 1158–1169, doi:10.1109/ICDE.2016.7498321.
[55] X.S. Liu, D.L. Brutlag, J.S. Liu, BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes, in: Proceedings
of the Pacific Symposium on Biocomputing, 6, 2001, pp. 127–138.
[56] X.S. Liu, D.L. Brutlag, J.S. Liu, An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray
experiments, Nat. Biotechnol. 20 (8) (2002) 835–839, doi:10.1038/nbt717.
[57] M. Lones, A. Tyrrell, Regulatory motif discovery using a population clustering evolutionary algorithm, IEEE/ACM Trans. Comput. Biol. Bioinform. 4 (3)
(2007) 403–414, doi:10.1109/tcbb.2007.1044.
[58] Q. Lu, S. Li, W. Zhang, L. Zhang, A genetic algorithm-based job scheduling model for big data analytics, EURASIP J. Wirel. Commun. Netw. 2016 (2016)
152.
[59] Q. Ma, H. Zhang, X. Mao, C. Zhou, B. Liu, X. Chen, Y. Xu, Dminda: an integrated web server for dna motif identification and analyses, Nucleic Acids
Res. 42 (W1) (2014) W12–W19, doi:10.1093/nar/gku315.
[60] S. Mahony, D. Hendrix, A. Golden, T.J. Smith, D.S. Rokhsar, Transcription factor binding site identification using the self-organizing map, Bioinformatics
21 (9) (2005) 1807–1814, doi:10.1093/bioinformatics/bti256.
[61] S. Mahony, B.F. Pugh, Protein-DNA binding in high-resolution, Crit. Rev. Biochem. Mol. Biol. 50 (4) (2015) 269–283, doi:10.3109/10409238.2015.1051505.
[62] S. Mallik, U. Maulik, MiRNA-TF-gene network analysis through ranking of biomolecules for multi-informative uterine leiomyoma dataset, J Biomed. Inf.
57 (2015) 308–319, doi:10.1016/j.jbi.2015.08.014.
[63] S. Mallik, A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, Integrated analysis of gene expression and genome-wide dna methylation for tumor pre-
diction: An association rule mining-based approach, in: Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and
Computational Biology (CIBCB), 2013, pp. 120–127, doi:10.1109/CIBCB.2013.6595397.
[64] G.A. Maston, S.K. Evans, M.R. Green, Transcriptional regulatory elements in the human genome, Ann. Rev. Genom. Hum Genet 7 (1) (2006) 29–59,
doi:10.1146/annurev.genom.7.080505.115623.
[65] R. Parikh, A. Mathai, S. Parikh, G. Chandra Sekhar, R. Thomas, Understanding and using sensitivity, specificity and predictive values, Indian J. Ophthal-
mol. 56 (1) (2008) 45–50.
[66] T.K. Paul, H. Iba, Identification of weak motifs in multiple biological sequences using genetic algorithm, in: Proceedings of the Eighth Annual Confer-
ence on Genetic and Evolutionary Computation, in: GECCO ’06, ACM, New York, USA, 2006, pp. 271–278, doi:10.1145/1143997.1144044.
[67] G. Pavesi, G. Mauri, G. Pesole, An algorithm for finding signals of unknown length in DNA sequences, Bioinformatics 17 (Suppl 1) (2001) S207–214,
doi:10.1093/bioinformatics/17.suppl_1.s207.
[68] P.A. Pevzner, S.-H. Sze, Combinatorial approaches to finding subtle signals in DNA sequences, in: Proceedings of the Eighth International Conference
on Intelligent Systems for Molecular Biology, AAAI Press, 20 0 0, pp. 269–278.
[69] E. Redhead, T.L. Bailey, Discriminative motif discovery in DNA and protein sequences using the DEME algorithm., BMC Bioinform. 8 (385) (2007),
doi:10.1186/1471-2105- 8- 385.
N.K. Lee et al. / Information Sciences 466 (2018) 25–43 43

[70] M.F. Sagot, A. Viari, A double combinatorial approach to discovering patterns in biological sequences, in: D. Hirschberg, G. Myers (Eds.), Proceed-
ings of the Combinatorial Pattern Matching. CPM, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 1996, pp. 186–208, doi:10.1007/
3- 540- 61258- 0_15.
[71] T.D. Schneider, Consensus sequence zen, Appl. Bioinform. 1 (3) (2002) 111–119, doi:10.1002/9780471650126.dob0135.pub2.
[72] J. Shi, W. Yang, M. Chen, Y. Du, J. Zhang, K. Wang, AMD, An automated motif discovery tool using stepwise refinement of gapped consensuses, PLoS
ONE 6 (9) (2011) e24576, doi:10.1371/journal.pone.0024576.
[73] D. Shlyueva, G. Stampfel, A. Stark, Transcriptional enhancers: from properties to genome-wide predictions, Nat. Rev. Genet. 15 (4) (2014) 272–286,
doi:10.1038/nrg3682.
[74] D. Simcha, N.D. Price, D. Geman, The limits of de novo DNA motif discovery, PLoS ONE 7 (11) (2012) 1–9, doi:10.1371/journal.pone.0047836.
[75] R.S. Sinha, S. Singh, S. Singh, V.K. Banga, Speedup Genetic Algorithm Using C-CUDA, in: Proceedings of the Fifth International Conference on Commu-
nication Systems and Network Technologies, IEEE, 2015, pp. 1355–1359, doi:10.1109/CSNT.2015.148.
[76] S. Sinha, M. Tompa, YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation, Nucleic Acids Res. 31
(13) (2003) 3586–3588, doi:10.1093/nar/gkg618.
[77] R. Staden, Computer methods to locate signals in nucleic acid sequences., Nucleic Acids Res. 12 (1 Pt 2) (1984) 505–519, doi:10.1093/nar/12.1part2.505.
[78] G.D. Stormo, Modeling the specificity of protein-DNA interactions, Quant. Biol. 1 (2) (2013) 115–130, doi:10.1007/s40484- 013- 0012- 4.
[79] G.D. Stormo, D.S. Fields, Specificity, free energy and information content in protein-DNA interactions, Trends Biochem. Sci. 23 (3) (1998) 109–113,
doi:10.1016/S0968-0 0 04(98)01187-6.
[80] G.D. Stormo, Y. Zhao, Determining the specificity of protein-DNA interactions, Nat. Rev. Genet. 11 (11) (2010) 751, doi:10.1038/nrg2845.
[81] S. Tapan, D. Wang, A further study on mining dna motifs using fuzzy self-organizing maps, IEEE Trans. Neural Netw. Learn. Syst. 27 (1) (2016) 113–124,
doi:10.1109/TNNLS.2015.2435155.
[82] G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B.D. Moor, P. Rouzé, Y. Moreau, B. De Moor, P. Rouzé, Y. Moreau, Gibbs sampling method to detect
overrepresented motifs in the upstream regions of coexpressed genes, J. Comput. Biol. 9 (2) (2002) 447–464, doi:10.1089/10665270252935566.
[83] M. Tompa, N. Li, T.L. Bailey, G.M. Church, B. De Moor, E. Eskin, A.V. Favorov, M.C. Frith, Y. Fu, W.J. Kent, V.J. Makeev, A.A. Mironov, W.S. Noble, G. Pavesi,
G. Pesole, M. Regnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, Z. Zhu, Assessing computational
tools for the discovery of transcription factor binding sites, Nat. Biotechnol. 23 (1) (2005) 144.
[84] N.T.L. Tran, C.-H. Huang, A survey of motif finding web tools for detecting binding site motifs in ChIP-Seq data, Biol. Direct 9 (1) (2014) 4, doi:10.1186/
1745- 6150- 9- 4.
[85] D. Wang, B-MISCORE: a new similarity metric for self-organization of DNA k-mers, Technical Report, La Trobe University, June 2013. http://www.
homepage.cs.latrobe.edu.au/dwang/BMISCORE.pdf.
[86] D. Wang, N.K. Lee, MISCORE: mismatch-based matrix similarity scores for DNA motif detection, in: M. Köppen, N. Kasabov, G. Coghill (Eds.), Ad-
vances in Neuro-Information Processing, Lecture Notes in Computer Science, 5506, Springer Berlin / Heidelberg, 2009, pp. 478–485, doi:10.1007/
978- 3- 642- 02490- 0_59.
[87] D. Wang, X. Li, GAPK: Genetic algorithms with prior knowledge for motif discovery in DNA sequences, in: Proceedings of the IEEE Congress on
Evolutionary Computation (CEC), 2009, pp. 277–284, doi:10.1109/cec.2009.4982959.
[88] D. Wang, X. Li, iGAPK: improved GAPK algorithm for regulatory DNA motif discovery, in: K. Wong, B. Mendis, A. Bouzerdoum (Eds.), Neu-
ral Information Processing. Models and Applications, Lecture Notes in Computer Science, 6444, Springer Berlin / Heidelberg, 2010, pp. 217–225,
doi:10.1007/978- 3- 642- 17534- 3_27.
[89] D. Wang, S. Tapan, Miscore: a new scoring function for characterizing dna regulatory motifs in promoter sequences, BMC Syst. Biol. 6 (2) (2012) S4,
doi:10.1186/1752- 0509- 6- S2- S4.
[90] D. Wang, S. Tapan, A robust elicitation algorithm for discovering dna motifs using fuzzy self-organizing maps, IEEE Trans. Neural Netw. Learn. Syst. 24
(10) (2013) 1677–1688, doi:10.1109/TNNLS.2013.2275733.
[91] P. Wang, J. Lü, X. Yu, Identification of important nodes in directed biological networks: a network motif approach, PLoS ONE 9 (8) (2014) 1–15,
doi:10.1371/journal.pone.0106132.
[92] Z. Wei, S.T. Jensen, GAME: detecting cis-regulatory elements using a genetic algorithm, Bioinformatics 22 (13) (2006) 1577–1584, doi:10.1093/
bioinformatics/btl147.
[93] E. Wijaya, S.-M. Yiu, N.T. Son, R. Kanagasabai, W.-K. Sung, Motifvoter: a novel ensemble method for fine-grained integration of generic motif finders,
Bioinformatics 24 (20) (2008) 2288–2295.
[94] K.C. Wong, C. Peng, Y. Li, Evolving transcription factor binding site models from protein binding microarray data, IEEE Trans. Cybern. 47 (2) (2017)
415–424, doi:10.1109/TCYB.2016.2519380.
[95] M.M. Yin, J.T. Wang, Effective hidden Markov models for detecting splicing junction sites in dna sequences, Inf. Sci. (Ny) 139 (1) (2001) 139–163.
Bioinformatics. doi:10.1016/S0020-0255(01)00160-8.
[96] F. Zambelli, G. Pesole, G. Pavesi, Motif discovery and transcription factor binding sites before and after the next-generation sequencing era, Brief.
Bioinform. 14 (2) (2013) 225–237, doi:10.1093/bib/bbs016.
[97] F. Zare-Mirakabad, H. Ahrabian, M. Sadeghi, S. Hashemifar, A. Nowzari-Dalini, B. Goliaei, Genetic algorithm for dyad pattern finding in DNA sequences,
Genes Genet. Syst. 84 (1) (2009) 81–93, doi:10.1266/ggs.84.81.
[98] J. Zhou, O.G. Troyanskaya, Predicting effects of noncoding variants with deep learning–based sequence model, Nat. Methods 12 (2015) 931–934.

You might also like