You are on page 1of 7

Vol. 23 no.

10 2007, pages 1181–1187


BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm097

Genome analysis

IMEx: Imperfect Microsatellite Extractor


Suresh B. Mudunuri and Hampapathalu A. Nagarajaram
Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics (CDFD), ECIL Road, Nacharam,
Hyderabad 500 076, India
Received on December 15, 2006; revised on February 26, 2007; accepted on March 7, 2007
Advance Access publication March 22, 2007
Associate Editor: Alex Bateman

Downloaded from https://academic.oup.com/bioinformatics/article/23/10/1181/197745 by guest on 16 April 2023


ABSTRACT numbers (Jarne and Lagoda, 1996). Increase/decrease of repeat
Motivation: Microsatellites, also known as simple sequence copy numbers in microsatellites in coding regions often lead to
repeats, are the tandem repeats of nucleotide motifs of the size shifts in reading frames thereby causing changes in protein
1–6 bp found in every genome known so far. Their importance in products (Li et al., 2004; Sreenu et al., 2006) and in non-coding
genomes is well known. Microsatellites are associated with various regions, known to effect the gene regulation (Martin et al.,
disease genes, have been used as molecular markers in linkage 2005). Mutations occurring at microsatellite loci within or near
analysis and DNA fingerprinting studies, and also seem to play certain genes have been implicated to be responsible for some
an important role in the genome evolution. Therefore, it is of human neurodegenerative diseases (Tautz and Schlotterer,
importance to study distribution, enrichment and polymorphism of 1994). Furthermore, microsatellite instability has also been
microsatellites in the genomes of interest. For this, the prerequisite implicated in the induction of cancer (Thibodeau et al., 1993).
is the availability of a computational tool for extraction of Owing to their high mutability, it is thought that the
microsatellites (perfect as well as imperfect) and their related microsatellites are one of the sources of genetic diversity
information from whole genome sequences. Examination of (Kashi and King, 2006). In the recent times, efforts have
available tools revealed certain lacunae in them and prompted us also been made to study the possible functional roles of
to develop a new tool. microsatellites in giving rise to certain amount of plasticity and
Results: In order to efficiently screen genome sequences for also in the evolution of genomes (Sreenu et al., 2006).
microsatellites (perfect as well as imperfect), we developed a new Apart from repeat copy number variation, a microsatellite
tool called IMEx (Imperfect Microsatellite Extractor). IMEx uses tract (e.g. GCGCGCGCGC) also suffers from substitutions
simple string-matching algorithm with sliding window approach to and indels of nucleotides thereby becoming an ‘Imperfect’ tract
screen DNA sequences for microsatellites and reports the motif, (e.g. GCGCGCAGCGC: GC repeat with an insertion of A).
copy number, genomic location, nearby genes, mutational events Genomes harbor significant number of imperfect micro-
and many other features useful for in-depth studies. IMEx is more satellites (Brinkmann et al., 1998; Sreenu and Nagarajaram,
sensitive, efficient and useful than the available widely used tools. unpublished data). Imperfect microsatellites are more stable
IMEx is available in the form of a stand-alone program as well as in than perfect microsatellites as they are less prone to slippage
the form of a web-server. mutations (Sturzeneker et al., 1998) and are known to play a
Availability: A World Wide Web server and the stand-alone program role in gene regulation (Meloni et al., 1998).
are available for free access at http://203.197.254.154/IMEX/ or Most of the studies reported in the literature on micro-
http://www.cdfd.org.in/imex satellites have focused on their frequencies, abundance and
Contact: han@cdfd.org.in polymorphisms in various genomes, both prokaryotes and
eukaryotes. Few hypotheses have also been proposed on the life
cycle—‘birth’ and ‘death’—of microsatellites in eukaryotes
(Buschiazzo and Gemmell, 2006; Chambers and MacAvoy,
2000). Some studies have also revealed the role of point
mutations (indels and substitutions of nucleotides) in the
1 INTRODUCTION genesis/annihilation and evolution of microsatellites (Messier
Microsatellites or simple sequence repeats (SSRs) are the et al., 1996; Sreenu and Nagarajaram, unpublished data).
nucleotide sequences arising out of tandem repeating of short However, a large body of microsatellite data from several
sequence motifs of the size 1–6 bp (Schlotterer, 2000). genome sequences still remains unexplored. Studies pertaining
Microsatellites have been found in all the known genomes so to distribution, enrichment, mutational dynamics of micro-
far and are widely distributed both in coding and non-coding satellites along with their role in gene function and expression
regions (Sreenu et al., 2006, 2007; Toth et al., 2000). They are are very essential to understand the processes that underpin the
known to be highly polymorphic as a result of high rate of evolution and diversity of genomes.
mutations in the form of increase/decrease of their repeat copy In the due course of our studies on microsatellites, we made a
survey of existing software tools for identification and
*To whom correspondence should be addressed. extraction of microsatellites from nucleotide sequences. All

ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1181
S.B.Mudunuri and H.A.Nagarajaram

these tools can be broadly grouped into two categories: those


which can identify only perfect microsatellites (e.g. SSRF
(Sreenu et al., 2003), Poly (Bizzaro and Marx, 2003), SSRIT
Fig. 1. An illustration of type I nucleation site. The site is characterized
(Temnykh et al., 2001)) and the others which can identify by two consecutive occurrences of the motif ‘ATGC’ with the number
perfect as well as imperfect microsatellites (e.g. TRF (Benson, of edit operations equal to 0 (i.e. k ¼ 0) flanked. As can be seen here, the
1999), ATR Hunter (Wexler et al., 2004) and Sputnik (Abajian, nucleation site can be extended on either side in steps of the motif with
1994)). Our survey also revealed certain ‘lacunae’ in the tools. some edit operations like substitutions and indels.
Programs such as ‘mreps’ (Kolpakov et al., 2003) and
TandemSWAN (Boeva et al., 2006) consider only substitutions
but not indels. TROLL (Castelo et al., 2002), STAR (Delgrange
and Rivals, 2004) and SSRscanner (Anwar and Khan, 2006)
use predefined set of motifs to search for microsatellites in

Downloaded from https://academic.oup.com/bioinformatics/article/23/10/1181/197745 by guest on 16 April 2023


genomes and therefore not very convenient for global
automated searches. The algorithms of TRF (Benson, 1999), Fig. 2. Type II nucleation sites characterized by two identical motifs
ATR Hunter (Wexler et al., 2004) and STRING (Parisi et al., intervened by four nucleotides. Note that the intervening sequence
ATGG is an iteration of ATGC with C-4G operation (k ¼ 1).
2003) have been designed to find tandem repeats of large-size
motifs as large as 2000 bases and hence large numbers of
microsatellites go unidentified by these methods. Many of these
programs do not generate alignments between imperfect a maximum of one indel) between the individual repeat copy
microsatellites and their expected perfect counterparts, and and the perfect repeat motif is more than the limit (denoted by
therefore require additional post-processing in order to study ‘k’ parameter set by the user) and (ii) the percentage of
the mutational events in microsatellites. In view of these imperfection is more than the limit set by the user (denoted by
lacunae and to aid our systematic analysis of imperfect ‘p’ parameter). The percentage imperfection is calculated as
microsatellites, we developed a program called IMEx follows:
(Imperfect Microsatellite Extractor) with a number of dis-
covery-friendly features. IMEx is fast, highly sensitive and is Percentage imperfection
also flexible where user can set the limits for imperfection (thus number of point mutations in the observed tract
¼  100:
can be used for both perfect and imperfect microsatellites). The total number of bases in the equivalent perfect tract
output comprises of a list of microsatellites each of which with
information such as its total imperfection content, point The user can set a value for ‘k’ between 0 and m
mutations, sequence alignment with its perfect counterpart, where m ¼ repeat motif size. In the case of mononucleotide
whether the locus lies in the coding or non-coding region along repeats, little post-processing is done by the program so as
with corresponding known details. The IMEx program is to make sure that the trailing nucleotides on both sides of
available in two modes: as a stand-alone program and also in the tract are perfect. For example, the sequence
the form of a web server. The stand-alone program as well as AAAAAAAAAATGAAAAAAAA is picked up as a single
web server are available from the web site http:// tract of (A)20 with two imperfections whereas
203.197.254.154/IMEX/ or http://www.cdfd.org.in/imex. TAAAAAAAAAAAAAAAAATG is picked up as (A)17.
Once the termination criteria are satisfied, only those candidate
microsatellites that are more than the minimum repeat number
2 ALGORITHM of that repeat size set by the user (denoted by ‘n’ parameter) are
We define a sequence at a given locus as a microsatellite if that reported.
sequence can be expressed as a tandem repeat of a motif of While identifying the repeating copy, IMEx treats substitu-
1–6 bp size. The repeating motif at every iteration can harbor tions and indels on par. However, in certain instances,
up to ‘k’ number of point mutations (substitutions or indels of substitutions may have to take precedence over indels or
nucleotides). For example, the sequence ATATGTAGAT is a vice versa. For example, the sequence ATGATGATATGATG
tandem repeat of the motif ‘AT’ with two substitutions A!G can be viewed as ATG ATG ATA -TG ATG with G-4A at the
and T!G at third and fourth iterations, respectively. IMEx third iteration followed by a deletion of A at the fourth
algorithm uses this definition and employs simple string- iteration. The same tract can be expressed as ATG ATG
searching algorithm with ‘sliding window’ approach. AT- ATG ATG with one deletion at the third iteration. IMEx
Conceptually, IMEx may be described as a two-step procedure: chooses the longest repeating tract with least edit distance. In
(a) identification of microsatellite nucleation sites which are this example, the latter is reported.
nothing but the loci where a repeat motif is repeated twice The flowchart of IMEx algorithm is shown in Figure 3.
either tandemly (type I nucleation site) (Fig. 1) or after certain IMEx progressively scans for nucleation sites starting from
intervening nucleotides (type II nucleation sites) (Fig. 2), in the longest repeat unit i.e. hexanucleotide to the shortest repeat
both cases the repeat motif does not contain any imperfection unit i.e. mononucleotide, at a given locus. In other words,
(i.e. k ¼ 0) and (b) extension of the nucleation sites on both for each position i in the sequence, first it looks
sides in steps of the motif (with imperfections less than ‘k’ for hexanucleotide repeat nucleation sites. If no hexanucleotide
value) as long as one of the termination criteria is satisfied: repeat tract is detected, then it looks for pentanucleotide
(i) the number of imperfections (inclusive of substitutions and repeat nucleation site (m ¼ 5) and so on. IMEx automatically

1182
IMEx: Imperfect Microsatellite Extractor

Fig. 4. A sample output of IMEx showing alignment of the


microsatellite of GT unit (sequence at the top of the alignment) with
its perfect counter part (sequence at the bottom of the alignment).
Please note that the observed microsatellites tract can be written as
12 iterations of GT with GT undergoing substitution at the second

Downloaded from https://academic.oup.com/bioinformatics/article/23/10/1181/197745 by guest on 16 April 2023


iteration and an insertion of T after fifth iteration.

3 IMPLEMENTATION
IMEx has been developed in standard C language and has
undergone extensive preliminary testing and comparison with
other existing tools yielding satisfactory results. The program
can read a sequence of any length as memory is dynamically
allocated. However, the size limit is subjected to the system
configuration. The program has been successfully tested on
Human X chromosome of the size 147MB, on a system
with Intel Xeon processor with 2GB RAM. A web server
has also been created and this can be accessed from http://
203.197.254.154/IMEX/ or http://www.cdfd.org.in/imex. The
web server has been developed using CGI-Perl. HTML forms
have been created for getting input sequences and parameters
used by the C program and display the results on the
browser. The stand-alone program can be downloaded from
the ‘downloads’ section of the web server homepage.
Input to the program consists of a sequence file and the
following parameters: (a) number of edit operations/motif (k);
(b) percentage imperfection for the entire tract ( p); (c)
minimum repeat number (n); (d) coding information file. The
web version offers three different modes of access to the
Fig. 3. Flowchart of IMEx algorithm. program: basic, intermediate and advanced. The basic mode
contains very few options to be set by the user. The basic mode
runs with default values, except for an option to select either
perfect or imperfect microsatellites. The default parameters of
removes redundancies. For example, ATGCCCATGCCC is IMEx are as follows: imperfection percentage (p) is 10% for all
identified as (ATGCCC)2 only and the internal repeat of C repeat sizes; imperfection limit/repeat unit (k) of each repeat
within the hexanucleotide motif is ignored. size: (Mono: 1, Di: 1, Tri: 1, Tetra: 2, Penta: 2, Hexa: 3) and the
While detecting the microsatellite tract as a tandem repeat minimum number of repeat units (n) is set to 2 for all repeat
of a motif, IMEx also simultaneously stores the edit operations sizes i.e. any repeat unit that is repeated at least twice is
(indels and substitutions). Pairwise alignment between reported. The intermediate mode offers few options where the
the identified tract and its perfect counter part is, nevertheless, user can adjust the p value for all repeat tracts, k value for each
produced to indicate the matches, mismatches and gaps. repeat unit size and other options. Advanced mode offers all
A sample alignment produced by IMEx is shown in the the options available for this program and can adjust all the
Figure 4. Along with the alignments, the details of the available parameters. The advanced mode can set the flanking
repeat tract such as consensus (repeating unit), number of sequences’ size limit, switch to generate text outputs, search for
iterations, tract length, imperfection percentage, nucleotide a particular pattern, etc. The interface has been designed for the
composition and coding region (if it is in the coding region) convenience of the users. Using IMEx, the user can also search
or flanking coding regions (if it is in the non-coding for a particular pattern (such as, CAG repeats) or can search
region) are written on a file in the form of a table. IMEx for a particular size (di or tetra) repeats or can search only
uses.ptt file (NCBI’s protein table file) for protein-coding perfect repeats or a combination of perfect and imperfect
region information. repeats.

1183
S.B.Mudunuri and H.A.Nagarajaram

The program generates two files, one of which gives a Min Score: 2) which yielded substantial number of micro-
summary table describing the microsatellite tracts along with satellites. This is because the length of microsatellite detected by
their information that includes tract size, number of iterations, TRF is dependent on the value of Min Score. For sputnik also,
percentage imperfection, nucleotide composition and coding/ we used the least stringent parameters (Match: þ1, Mismatch:
non-coding information. The second file contains the alignment 3, Min Score: 5). For IMEx, we set the ‘p’ value of all tracts
of each repeat with its consensus sequence. These two files are to 10%; ‘k’ value for each pattern size: Mono: 1, Di: 1, Tri: 1,
produced both in HTML form as well as in text formats. The Tetra: 2, Penta: 2, Hexa: 3 and further restricted to report only
text files produced can be downloaded and used for further those microsatellites with minimum repeat copy number
studies. In HTML outputs, the files are linked so that on (Mono:5, Di: 3, Tri: 2, Tetra: 2, Penta: 2, Hexa: 2) to match
clicking a repeat will display its corresponding alignment in a those reported by TFR and Sputnik. TRF and Sputnik
separate HTML page. A link has also been provided to know identified 50 and 19 repeats respectively, whereas IMEx
the function of the coding region near which microsatellite is identified 146 microsatellite tracts (Table 1). In fact, IMEx

Downloaded from https://academic.oup.com/bioinformatics/article/23/10/1181/197745 by guest on 16 April 2023


located. Figure 5 gives a partial extract of the program output. picks up a total of 876 repeats (if the minimum repeat number
Primer3, primer design software (Rozen and Skaletsky, 2000) of all repeat sizes is set to 2) which are, needless to mention,
has been linked to the web version of IMEx. In the useful for the studies concerning evolution of microsatellites
summary table HTML output, the user can select a micro- especially when one is making cross-genome comparisons.
satellite tract for which he wants to design primer and the Table 1 gives the list of microsatellites picked up by IMEx,
interface automatically prepares the input for Primer3 software TRF and Sputnik.
to design the primers. The user can also modify the input to As can be seen from the results, IMEx reports many more
Primer3. tracts which are missed by the other two programs. It is
important to mention that Sputnik does not report
mononucleotide and hexanucleotide tracts. Out of the 146
microsatellites reported by IMEx, 94 correspond to di–penta
4 RESULTS AND DISCUSSION
tracts of which Sputnik reports only 19.
To demonstrate the capabilities of our program, we analyzed We also ran the three programs on four whole genome
the human atrophin1 gene (BC051795) and compared the sequences: Plasmodium falciparum chromosome IV
results obtained with those obtained using tandem repeat finder (NC_004318.1), yeast chromosome IV (NC_001136.8),
(TRF) and Sputnik. TRF was initially tested with the Mycobacterium tuberculosis H37Rv genome (NC_000962.2)
parameters used in the earlier studies (Archak et al., 2007; and E.coli K12 genome (NC_000913.2). The sequences were
Boby et al., 2005; Ross et al., 2003) which yielded very downloaded from ftp://ftp.ncbi.nih.gov/genomes. The three
few microsatellites. Hence, we used the most relaxed set programs were run with the same parameters that we used in
of parameters (Match: þ2, Substitution: 7, Indel: 7, the above analysis. Table 2 shows the execution times and
number of repeat tracts extracted from each sequence by each
of the three programs.
From the results shown in Table 2, it is clear that IMEx
outperforms TRF and Sputnik in terms of its ability to identify
and report microsatellite tracts in relatively shorter time. It is
also clear from the table that execution time of IMEx is linear
(directly proportional) to the sequence length rather than the
number of repeats detected whereas execution time of TRF is
correlated to the number of repeats detected. TRF uses a
probabilistic algorithm which includes a ‘detection step’ to
identify the candidate repeats and an ‘analysis’ step that uses
different statistical criteria to filter the candidate repeats.
Therefore, more the number of candidate repeats detected,
more time is taken by TRF for the execution. Sputnik uses a
recursive algorithm and the performance depends on the
recursion depth of the program. Hence, Sputnik’s execution
time seems to be dependent on the sequence composition. On
Fig. 5. A partial output of IMEx showing the results of the the other hand, IMEx uses the simple string-matching
Escherichia coli K12 genome when run in advanced mode. The algorithm that scans the entire sequence using sliding window
above figure is a snapshot of the IMEx result, which includes a
approach and reports the results in a single run. Hence, the
summary table, a sample alignment and protein information page. The
summary table shows the microsatellite repeat information such as
processing time of IMEx is dependent on the length of the
repeat unit, number of iterations, tract size, start and end values, DNA sequence and not on the number of microsatellites.
imperfection percentage of the tract, number of indels in the tract and In quintessence, IMEx embodies all the required features for
nucleotide composition. When clicked on the hyperlink of the motif, its a systematic analysis of microsatellites which are not readily
corresponding alignment with its perfect counterpart is displayed in a available in the other tools, as IMEx has been designed keeping
separate HTML page. When clicked on the hyperlink on the coding/ in view of the limitations we encountered with the other
non-coding region, its corresponding protein information is displayed. available tools. Using IMEx, the users can: (i) search only

1184
IMEx: Imperfect Microsatellite Extractor

Table 1. Microsatellites identified by IMEx in the human atrophin1 Table 1. Continued


gene (4382 bp). Tracts identified by TRF and Sputnik are given in bold;
those identified by TRF are given in italics (see text for details)
Locus in bp Microsatellite tract Motif

Locus in bp Microsatellite tract Motif 1500–1505 CT CT CT CT


1518–1523 CCCCCC C
35–40 GGA GGA GGA 1539–1544 TC TC TC TC
243–253 TGAA –GAA TGAA TGAA 1556–1561 TG TG TG TG
284–289 GAG GAG GAG 1572–1576 CCCCC C
296–301 GAA GAA GAA 1577–1582 ACC ACC ACC
321–326 GAA GAA GAA 1583–1588 TCC TCC TCC
338–347 GGGCC GGGCC GGGCC 1594–1599 GCC GCC GCC
371–376 CAG CAG CAG 1633–1640 TCCC TCCC TCCC

Downloaded from https://academic.oup.com/bioinformatics/article/23/10/1181/197745 by guest on 16 April 2023


411–416 AAG AAG AAG 1663–1670 CCCA CCCA CCCA
449–454 CAA CAA CAA 1683–1694 CATCAC CATCAC CATCAC
468–473 GAG GAG GAG 1698–1721 CAGCAA CAGCAA CAGCAA
479–484 AG AG AG AG CAGCAG CAGCAG
$
492–497 GAG GAG GAG 1689–1763 CAT CAC CAC CAG CAA CAG CAG
509–514 AAAAAA A CAA CAG CAG CAG CAG CAG
589–597 ATG ATG ATG ATG CAG CAG CAG CAG CAG CAG
598–603 GCA GCA GCA CAG CAG CAG CAG CAG CAT
643–647 CCCCC C CAC
677–688 TGACTC TGACTC TGACTC 1776–1780 CCCCC C
690–695 TCT TCT TCT ^1781–1786 TCC TCC TCC
709–721 GCCC A GCCC GCCC GCCC 1798–1805 CCCA CCCA CCCA
1
724–729 ACC ACC ACC 1823–1828 CCA CCA CCA
846–851 CCCCCC C 1850–1857 TCCC TCCC TCCC
871–876 CTC CTC CTC 1899–1903 CCCCC C

883–894 CCCCTC CCCCTC CCCCTC 1956–1967 TCTTCC TCTTCC TCTTCC

912–916 GGGGG G 1970–1987 CTCTTC CTCTTC CTCTTC ATCTTC CTCTTC
920–925 TGG TGG TGG 2012–2021 CCCCT CCCCT CCCCT
939–943 CCCCC C 2072–2077 CAC CAC CAC
^956–961 GGGGGG G 2125–2130 CAG CAG CAG
972–977 TCA TCA TCA 2157–2161 CCCCC C
980–985 GGGGGG G 2173–2178 AG AG AG AG
993–997 GGGGG G 2189–2193 GGGGG G
1000–1005 AGC AGC AGC 2206–2215 CCACC CCACC CCACC
1007–1012 CCCCCC C 2235–2246 CCTC CCTC CTTC CCTC
1017–1022 ACT ACT ACT 2376–2387 CCA CCA CCA CCT CCA
1051–1056 GTG GTG GTG 2405–2410 GCC GCC GCC
1059–1063 CCCCC C 2445–2450 GAG GAG GAG
1070–1075 GCC GCC GCC 2458–2462 CCCCC C
1086–1093 GTGG GTGG GTGG 2475–2479 CCCCC C
1099–1106 ACCT ACCT ACCT 2496–2500 CCCCC C
$
1113–1125 CCA CCA CCA G CCA CCA 2507–2512 GGT GGT GGT
1130–1134 CCCCC C 2531–2538 CAGT CAGT CAGT
2
1155–1159 CCCCC C 2549–2554 CAA CAA CAA
$
1178–1183 CAA CAA CAA 2581–2591 GC GC GC GC A GC GC
1197–1201 CCCCC C 2631–2636 AAG AAG AAG
1208–1212 GGGGG G 2647–2652 TGG TGG TGG
1274–1279 TCC TCC TCC 2660–2665 GCG GCG GCG
1326–1331 CCT CCT CCT 2677–2688 AGCG AGCG CGCG AGCG

1353–1358 CCCCCC C 2697–2720 GAGCGC GAGCGC GAGCGC
1382–1390 TAG TAG TAG TAG GAGCGG GAACGC

1395–1400 GCA GCA GCA 2728–2739 AGCGCG AGCGCG AGCGCG
1402–1407 CCT CCT CCT 2779–2784 AGG AGG AGG
1415–1420 TTC TTC TTC 2868–2873 CCCCCC C
^
1421–1426 CTC CTC CTC 2927–2936 TCATG TCATG TCATG
1438–1442 CCCCC C 2954–2961 CCAT CCAT CCAT
1472–1476 CCCCC C 2975–2979 GGGGG G
1483–1490 TCCC TCCC TCCC 3023–3028 CAG CAG CAG
1488–1492 CCCCC C 3174–3178 GGGGG G

(Continued)

1185
S.B.Mudunuri and H.A.Nagarajaram

Table 1. Continued Table 2. Comparison of execution times (in seconds) of TRF, Sputnik
and IMEx. The programs were run on an Intel Xeon Dual Processor
3.2 GHz Linux server
Locus in bp Microsatellite tract Motif

3344–3349 GGC GGC GGC Sequence TRF Sputnik IMEx


3402–3407 CAC CAC CAC
3428–3445 GCACCT GCACCT GCACCA GCACCT Repeats Time Repeats Time Repeats Time
3494–3498 CCCCC C in s in s in s
3554–3558 CCCCC C
3589–3594 TTC TTC TTC Plasmodium 25 601 69.8 10 810 89.1 54 232 2.9
3608–3613 TGC TGC TGC Chr4 (1204 Kb)
3655–3660 CAG CAG CAG Yeast Chr4 7308 4.4 2831 287.2 39 759 4.0
3662–3667 TCA TCA TCA (1531 Kb)

Downloaded from https://academic.oup.com/bioinformatics/article/23/10/1181/197745 by guest on 16 April 2023


3680–3687 GCAC GCAC GCAC MTB H37Rv 16 439 25.5 9412 17.7 1 11 113 11.6
3692–3701 AGCTG AGCTG AGCTG (4411 Kb)
3720–3728 CAG CAG CAG CAG E.coli K12 12 043 8.8 5387 8.5 1 05 392 12.3
3775–3780 AGG AGG AGG (4639 Kb)
3781–3786 ACT ACT ACT
3797–3802 GAA GAA GAA TRF: Match: þ2 Subs: 8 Indel: 8 Min. Score: 20 pM: 0.80 pI: 0.10 Max.
3837–3842 AG AG AG AG Period: 6.

3874–3878 CCCCC C Sputnik: Match: þ2 Mismatch: 6 Min. Score: 8.
,# IMEx: ‘k’ value: Mono: 1, Di: 1, Tri: 1, Tetra: 2, Penta: 2, Hexa: 3; ‘p’ value: 10%
3884–3889 CCCCC C
3926–3931 TGC TGC TGC for all repeat sizes; repeat length: 10 bases or more.
3959–3964 AG AG AG AG
3966–3970 GGGGG G
3967–3978 GGGG AGGG AGGG AGGG
particular size or for all sizes; (viii) search for a particular
3979–3986 ACAG ACAG ACAG
pattern microsatellite tracts; (ix) set the flanking sequence size
3987–3998 AAGGCC AAGGCC AAGGCC
4082–4087 CCCCCC C limit and (x) design primers seamlessly.
4096–4101 TCC TCC TCC It is clear from the results that IMEx seems more attractive in
4100–4104 CCCCC C terms of speed, sensitivity to identify microsatellites and has
$
4121–4131 GCC GCC GCC –CC GCC discover-friendly inputs and outputs.
4152–4157 ATT ATT ATT
4221–4226 TG TG TG TG
4231–4235 CCCCC C
4270–4275 TAA TAA TAA 5 CONCLUSION
^4278–4291 TA TA TA TA TA AA TA TA In this article, we have presented a new tool for extracting
4308–4314 AAAAAAA A imperfect microsatellites in genomic sequences according to the
3
4321–4332 AACCAA AACCAA AACCAA requirement of the user. It uses a simple algorithm, which scans
$
4330–4341 CAAC CAAA CAAA CAAA
the entire DNA sequence and reports the microsatellites in a
4339–4343 AAAAA A
4361–4382 AAAAAAAAAAAAAAAAAAAAAA A
single run. Information such as coding/non-coding informa-
tion, nucleotide composition, number of iterations, imperfec-
$
IMEx picked up larger microsatellites than TRF. tions, etc. about the microsatellites are also generated along
^TRF picked up larger microsatellites than IMEx. with the alignments. The tool is extremely sensitive and fast. We
*TRF picked both of them as a single microsatellite. have demonstrated the speed and accuracy of the tool by
1*TRF reported the tract as (CCAC)2.
comparing with other existing tools. This tool can serve as a
2*TRF reported the tract as (AACA)2.
#
Sputnik reported the tract as tetranucleotide repeat tract. valuable medium for studying the evolution of microsatellites.
3*TRF reported the tract as (AACCAA)2 and also as (CCAA)5.

ACKNOWLEDGEMENTS
The authors would like to thank Mr Pankaj Kumar,
perfect as well as imperfect microsatellites; (ii) get the
Mr Mohammad Anwaruddin, Dr V.B.Sreenu and
coding/non-coding information of the microsatellite tracts;
Mr Suprabhat Reddy for their valuable suggestions and
(iii) generate alignments with their perfect counter parts
assistance. A grant from the Department of Biotechnology
to know about substitutions and indels; (iv) restrict the
(DBT), India is gratefully acknowledged. The authors also
imperfection limit for repeat unit of each size; (v) set the
thank the anonymous referees for their critical and constructive
imperfection percentage threshold of the entire tract of
comments.
each repeat size; (vi) restrict the minimum number of repeat
units of a tract of each size; (vii) search for repeats of a Conflict of Interest: none declared.

1186
IMEx: Imperfect Microsatellite Extractor

REFERENCES Meloni,R. et al. (1998) A tetranucleotide polymorphic microsatellite, located in


the first intron of the tyrosine hydroxylase gene, acts as a transcription
Abajian,C. Sputnik http://espressosoftware.com/pages/sputnik.jsp regulatory element in vitro. Hum. Mol. Genet., 7, 423–428.
Anwar,T. and Khan,A.U. (2006) SSRscanner: a program for reporting Messier,W. et al. (1996) The birth of microsatellites. Nature, 381, 483.
distribution and exact location of simple sequence repeats. Bioinformation, Parisi,V. et al. (2003) STRING: finding tandem repeats in DNA sequences.
1, 89–91. Bioinformatics, 19, 1733–1738.
Archak,S. et al. (2007) InSatDb: a microsatellite database of fully sequenced Ross,C.L. et al. (2003) Rapid divergence of microsatellite abundance among
insect genomes. Nucleic Acids Res., 35, D36–D39. species of Drosophila. Mol. Biol. Evol., 20, 1143–1157.
Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Rozen,S. and Skaletsky,H.J. (2000) Primer3 on the WWW for general users
Nucleic Acids Res., 27, 573–580. and for biologist programmers. In Krawetz, S. and Misener, S. (Eds.).
Bizzaro,J.W. and Marx,K.A. (2003) Poly: a quantitative analysis tool for simple Bioinformatics Methods and Protocols: Methods in Molecular Biology,
sequence repeat (SSR) tracts in DNA. BMC Bioinformatics, 4, 22. Totowa, NJ Humana Press, pp. 365–386.
Boby,T. et al. (2005) TRbase: a database relating tandem repeats to disease genes Schlotterer,C. (2000) Evolutionary dynamics of microsatellite DNA.
in the human genome. Bioinformatics, 21, 811–816. Chromosoma, 109, 365–371.
Boeva,V. et al. (2006) Short fuzzy tandem repeats in genomic sequences, Sreenu,V.B. et al. (2003) MICAS: a fully automated web server for microsatellite

Downloaded from https://academic.oup.com/bioinformatics/article/23/10/1181/197745 by guest on 16 April 2023


identification, and possible role in regulation of gene expression. extraction and analysis from prokaryote and viral genomic sequences.
Bioinformatics, 22, 676–684.
Appl. Bioinformatics, 2, 165–168.
Brinkmann,B. et al. (1998) Mutation rate in human microsatellites: influence of
Sreenu,V.B. et al. (2006) Microsatellite polymorphism across the M. tuberculosis
the structure and length of the tandem repeat. Am. J. Hum. Genet., 62,
and M. bovis genomes: implications on genome evolution and plasticity.
1408–1415.
BMC Genomics, 7, 78–88.
Buschiazzo,E. and Gemmel,N.J. (2006) The rise, fall and renaissance of
Sreenu,V.B. et al. (2007) Simple sequence repeats in mycobacterial genomes.
microsatellites in eukaryotic genomes. Bioessays, 28, 1040–1050.
J. Biosci., 32, 3–15.
Castelo,A. et al. (2002) TROLL – Tandem repeat ocurrence locator.
Sturzeneker,R. et al. (1998) Polarity of mutation in tumor-associated
Bioinformatics, 18, 634–636.
microsatellite instability. Hum. Genet., 102, 231–235.
Chambers,G.K. and MacAvoy,E.S. (2000) Microsatellites:consensus and
Tautz,D. and Schlotterer,C. (1994) Simple sequences. Curr. Opin. Genet. Dev., 4,
controversy. Comp. Biochem. Physiol. B-Biochem. Mol. Biol., 126, 455–476.
832–837.
Delgrange,O. and Rivals,E. (2004) STAR: an algorithm to search for tandem
Temnykh,S. et al. (2001) Computational and experimental analysis of
approximate repeats. Bioinformatics, 20, 2812–2820.
microsatellites in rice (Oryza sativa L.): frequency, length variation,
Jarne,P. and Lagoda,P.J.L. (1996) Microsatellites, from molecules to populations
transposon associations, and genetic marker potential. Genome Res., 11,
and back. Trends Ecol. Evol., 11, 424–429.
Kashi,Y. and King,D.G. (2006) Simple sequence repeats as advantageous 1441–1452.
mutators in evolution. Trends Genet., 22, 253–259. Thibodeau,S.N. et al. (1993) Microsatellite instability in cancer of the proximal
Kolpakov,R. et al. (2003) mreps: efficient and flexible detection of tandem repeats colon. Science, 260, 816–819.
in DNA sequences. Nucleic Acid Res., 31, 3672–3678. Toth,G. et al. (2000) Microsatellites in different eukaryotic genomes: survey and
Li,Y.C. et al. (2004) Microsatellites within genes: structure, function, and analysis. Genome Res., 10, 967–981.
evolution. Mol. Biol. Evol., 21, 991–1007. Wexler,Y. et al. (2004) Finding approximate tandem repeats in genomic
Martin,P. et al. (2005) Microsatellite instability regulates transcription factor sequences. RECOMB 2004.
binding and gene expression. PNAS, 102, 3800–3804.

1187

You might also like