You are on page 1of 5


Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensive
route for discovering new genes, for obtaining data on gene expression and regulation, and for
the construction of the Genome maps. Polymorphism is a DNA sequence variation occurring
when a single nucleotide - A, T, C, or G - in the genome differs between members of a species,
or between paired chromosomes in an individual. It would be worthwhile inventorying SNP and
INDEL variation, as it could become the raw material for future high throughput genotyping
technologies. Such information is presently being collected in humans and a division of
GenBank, dbSNP, has been specially devoted to the storage of such data. In crop industry, the
use of SNP and INDEL as marker tools will be challenging because the high ploidy level may
prevent the straightforward application of emerging technologies developed in diploid model
organisms. However, it is hoped that with the rapid evolution of technology and the increasing
knowledge of molecular variation patterns in crop and other organisms, specific tools will
emerge which will possibly impact on future breeding programs for better economy.


Expressed Sequence Tags (ESTs) are 200-500 bp sequences that are obtained as part of a
3’ or 5’ single pass read of individual clones. These clones are derived from cDNA libraries that
may be specific to a tissue and/or development state of an organism, which can be used to
identify expressed genes that are considered ‘rare’ when exploring the overall expression of an
organism’s transcriptome. These rare transcripts may have splice variants and/or alternative
polyadenylation. ESTs are popular primarily because they can be generated in high volume,
using a high-throughput data production method at low cost. Despite serving as a rich source of
sequence information as a survey of a transcriptome, ESTs must be used with an understanding
of their limitations. The principle problems are that EST sequences contain errors, represent only
a portion of a gene product, and are found in vast, highly redundant datasets.

ESTs contain sequencing errors (sequence compression and frame-shift errors) at high
rates (3%) due to the single pass read process they were generated from. These errors do not
follow a normal distribution along the length of the sequence, but rather are biased toward the
start and end of the sequence, leaving EST base pair positions 100 to 300 to be the most accurate
part of the EST.

Each EST in itself represents only a portion of a gene product. Being 200-500 bp means
that there are potentially several thousand base pairs of the underlying transcript that are
unrepresented in the EST sequence. The attenuation of the sequencing reaction used in
generating the EST leads to a length that is shorter than the cDNA clone it was derived from.
EST single pass reads can be generated from the 5’ or 3’ end. ESTs can also be generated using
random primers, which may result in ESTs with ambiguous orientation, from different parts of
the same, non-overlapping RNA.

Databases that contain EST data also have two primary limitations.

1). ESTs, as a whole, are poorly annotated, both in terms of source and sequence
quality, which makes it difficult to determine what gene product a given EST represents.
2). EST databases contain a huge number of sequences, at a high rate of redundancy,
which makes it difficult for a researcher to negotiate and derive concise value from it.


Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensive
route for discovering new genes, for obtaining data on gene expression and regulation, and for
the construction of the Genome maps. A single nucleotide polymorphism, or SNP, ‘pronounced
snip’, is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the
genome differs between members of a species, or between paired chromosomes in an individual.
Single Nucleotide Polymorphisms may fall within coding sequences of genes, non-coding
regions of genes, or in the inter-genic regions. SNPs within a coding sequence will not
necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of
the genetic code. If DNA sequence in which any change in the base pairs do not result in the
change of polypeptide sequence, then it is termed, synonymous, sometimes called a silent
mutation - if a different polypeptide sequence is produced, they are non-synonymous. SNPs that
are not in protein-coding regions may still have consequences for gene splicing, transcription
factor binding, or the sequence of non-coding RNA. The study of single nucleotide
polymorphisms is also important in crop and livestock breeding programs. Expressed sequence
tags (ESTs) are an important resource for identifying polymorphisms in transcribed regions.

Single nucleotide polymorphisms (SNPs) have been shown to be the most abundant
source of DNA polymorphism in human, animal and plant genomes. SNPs are the most common
type of alleles found within and between varieties of a crop species. Single Nucleotide
polymorphisms (SNPs) possess desirable properties as molecular markers. Biallelism makes
them easy to score in high throughput genotyping assays. Molecular genetic markers developed
from ESTs can be used to examine a group of individuals or populations to estimate various
diversity measures and genetic distances, infer genetic structure and clustering patterns, test for
Hardy-Weinberg equilibrium and multi-locus equilibrium, and to test polymorphic loci for
evidence of selective neutrality. They are useful to plant breeders, germplasm managers, and
population geneticists. The use of EST sequence data for the identification of SNPs has many
advantages that can be exploited to facilitate the development of highly dense genetic maps and
markers assisted breeding programs. SNPs can be used to saturate genetic maps in plants.

Expressed sequence tag (EST) sequencing programs have provided a wealth of

information, identifying novel genes from a broad range of organisms and providing an
indication of gene expression level in particular tissues. EST sequence data may provide the
richest source of biologically useful SNPs due to the relatively high redundancy of gene
sequence, the diversity of genotypes represented within databases, and the fact that each SNP
would be associated with an expressed gene.

Working With EST For Polymorphism Search:

SNP detection perl script AutoSNP version.1.0 is mostly used to find the SNP site
information and transition vs transversion analysis. EST-SNP can be detected by using other
programs or servers such as SEAN, PolyPhred], PolyBayes, TRACE_DIFF, HaploSNPer and
HarvEST, but AutoSNP provides user friendly approach and interpretable result as ‘html’ file.
SNPs can be classified based on their nucleotide substitution as either transition (G↔A or C↔T)
or transversion (C↔G, A↔T, C↔A or T↔G). Indel sites can classified to four groups based on
the nucleotide involved (A/T/C/G). Thus there are ten kinds of SNP/indel, two types of
transition, four types of transversion and four groups of indels, are possible in genomes.

There are several strategies, both experimental and computational for SNP discovery.
Experimental SNP discovery often consists of a number of laborious steps that make this process
complex and expensive. The computational approach makes use of the large sequence datasets
present in public databases. Over the last few years, a number of pipelines have been developed
that automatically detect SNPs in such databases. One type of pipeline detects SNPs using trace
files or quality files, for example the PHRED/PHRAP/PolyBayes system. The other type of
pipeline uses only EST redundancy in text-based sequence files to detect SNPs; these include
autoSNP and SNiPpER. Both autoSNP and SNiPpER are based on sequence redundancy for the
initial detection of SNPs, and sequencing errors are detected and filtered out by analyzing SNP

Genetic maps are usually constructed with DNA markers such as restriction fragment
length polymorphism (RFLP), amplified fragment lengthpolymorphisms (AFLP) or simple
sequences repeats (SSR) [1, 2]. With the advancement of sequencing technology, genetic
variation at the DNA sequence level can be easily analyzed. In maize, the frequency of
occurrence of single nucleotide polymorphisms (SNP) is quite high, on average 1 per 31 base
pairs (bp) in the 3’ untranslated regions (UTR), and 1 every 124 bp in coding regions. Insertion-
deletion polymorphisms (indels) are also very frequent. The SNPs within an amplicon are
physically linked and therefore form a distinct haplotype. The abundance of SNPs makes them
highly useful for placing ESTs or candidate genes onto a genetic map, which has been previously
constructed with other markers. These polymorphisms can either be drawn from the public data,
or discovered by re-sequencing of the locus of interest in the mapping parents. For example, one
would expect to find polymorphisms in the 3’-UTR region between the B73 and Mo17 in more
than half of the cases. This was demonstrated by a re-sequencing effort of 502 maize loci across
8 maize inbred lines, including the parents for the IBM population. The loci that are
monomorphic between B73 and Mo17 can be mapped using other populations. Once
polymorphisms are known, a number of SNP analysis methods are available for the scoring of
the polymorphisms associated with each locus in a mapping population.

A simple Polymorphism search

flow diagram.

Single pass sequencing of the 5'

and/or 3' ends of randomly selected cDNA
clones, is an effective approach to provide
genetic information of an organism. These
sequences can serve as markers or tags for
transcripts, and have been used in the
development of SNP markers for reference
genetic map and recovery of full-length
cDNA and genomic sequences. Expressed
sequence tags (ESTs) are also useful for the discovery of novel genes, investigation of genes of
unknown function, comparative genomic study, and recognition of exon/intron
boundaries.Currently there are billions of sequences are being uploaded day-by-day, and
majority of these sequences are ESTs which had been deposited at NCBI (dbEST) dbEST/. The lack of sequence information has limited the progress
of gene discovery and characterization, global transcript profiling, probe design for development
of gene arrays, and generation of molecular markers for.

Single nucleotide polymorphisms (SNPs) are a second class of genetic markers that can
be mined from sequence data and are useful for characterizing allelic variation, genome-wide
mapping, and as a tool for marker-assisted selection. In the field of human genetics, SNPs are a
major focus of efforts to increase the efficiency of mapping and are already being used for
detection and mapping of a variety of diseases. In many crop plants, SNPs are present with
sufficient frequency to offer an alternative for genetic mapping and markerassisted selection.
Although SNPs can be identified by sequencing selected DNA fragments, a practical limitation
to this approach follows from the fact that the sequencing error rate is often higher than the
polymorphism rate. The cost of SNP discovery through sequencing amplified fragments is
therefore high even with reductions in the cost of sequencing. SNP detecting perl scripts
AutoSNP version 1.0 is used indentify the SNP / Indel polymorphisms, DNA substitution like
Transversion vs Transition and Indel.

In EST database, dbEST extract your required sequences. Normally, CAP3 program is
used to assemble the EST sequence in to contigs. The SNP detection tool AutoSNP version.1.0 is
used to find the candidate SNPs from these libraries. AutoSNP required input as ace or fasta
format. But the perl script edited manually to analyse fasta or ace format. Sequence assembly
program CAP3 is integrated in AUTOSNP to make fasta files in to contigs. The DNA
substitution as transition (Ts) versus transversion (Tv) ratio of all the libraries in genome of
interest is also calculated.

Your software works and gives you the result based on its own integrated algorithms. The
calculated result can be easily understood. The software gives you result as SNPs per 100 or per
kb, transversion(Tv) and transition(Ts) percentage and their ratio. It also provides Indels ratio in
the genome of the given species. The Ts vs Tv ratio is important to understand the pattern of
DNA evolution. It has been repeatedly noted that at low level of genetic divergence, Ts/Tv
appears to be high and at high levels of genetic divergence, Ts/Tv appears to be low. EST
analysis shows an insight of intrer-specific molecular genetic variations within a species. The
transitions to transversions ratio showed the molecular evolution happening in and is important
for defining phylogenetic trees.

EXAMPLE RESULTS of Ginger Officinate: