Upstream

I.
Aim: ex-2 • To retrieve the nucleic acid sequence and protein sequences
from Genbank and UniProt respectively and to study their entry formats. • To
perform local and global sequence alignment between two nucleotide and
amino acid sequences using dynamic programming and heuristic method. I. Aim:ex-4 To perform the multiple sequences alignment of I. Aim:ex-5 To analyse the evolutionary relationship of given sequences and
construct the phylogenetic tree.
given sequences using various alignment algorithms and study
II. Theory: A. Sequence Databases: Biological databases are libraries of life their efficiency.
II. Theory: Biological sequence analysis is founded on solid evolutionary
sciences information, collected from scientific experiments, published
principles. Similarities and divergence among related biological sequences
literature, high-throughput experiment technology, and computational
Theory: A natural extension of pairwise alignment is multiple revealed by sequence alignment often have to be rationalized and visualized
analysis. These databases are important tools in assisting scientists to analyse
and explain a host of biological phenomena from the structure of sequence alignment, which is to align multiple related in the context of phylogenetic trees. Evolution can be defined as the
development of biological forms from other preexisting forms or its origin to
biomolecules and their interaction, to the whole metabolism of organisms sequences to achieve optimal matching of the sequences.
the current existing form through natural selections and modifications. The
and to understanding the evolution of species. This knowledge helps facilitate Related sequences are identified through the database underlying mechanism of evolution is genetic mutations that occur
the fight against diseases, assists in the development of medications,
similarity searching like BLAST or FASTA. As the process spontaneously. Genetic diversity thus provides the source of raw material for
predicting certain genetic diseases and in discovering basic relationships
among species in the history of life. The important sequence databases generates multiple matching sequence pairs, it is often the natural selection to act on. Phylogenetics is the study of the evolutionary
history of living organisms using tree like diagrams to represent pedigrees of
available for nucleic acid are GenBank, EMBL-EBI, and DDBJ. UniPort and PIR necessary to convert the numerous pairwise alignments into a
these organisms. The tree branching patterns representing the evolutionary
are the important databases protein sequences. GenBank: The GenBank single alignment, which arranges sequences in such a way that divergence are referred to as phylogeny. Molecular data that are in the form
sequence database is an open access, annotated collection of all publicly
evolutionarily equivalent positions across all sequences are of DNA or protein sequences can also provide very useful evolutionary
available nucleotide sequences and their protein translations. The National
Center for Biotechnology Information (NCBI) at the National Library of matched. There is a unique advantage of multiple sequence perspectives of existing organisms because, as organisms evolve, the genetic
materials accumulate mutations over time causing phenotypic changes.
Medicine (NLM), National Institutes of Health (NIH) is responsible for alignment because it reveals more biological information than
Because genes are the medium for recording the accumulated mutations,
producing and distributing the GenBank Sequence Database as part of the many pairwise alignments can. For example, it allows the they can serve as molecular fossils.The bifurcating point at the very bottom of
International Nucleotide Sequence Database Collaboration (INSDC).
identification of conserved sequence patterns and motifs in the tree is the root node, which represents the common ancestor of all
UniProt:The Universal Protein Resource (UniProt) is a comprehensive resource
for protein sequence and annotation data. The UniProt databases are the the whole sequence family, which are not obvious to detect members of the tree. The topology of branches in a tree defines the
relationships between the taxa. The trees can be drawn in different ways,
UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), by comparing only two sequences. Many conserved and
such as a cladogram or a phylogram. In a phylogram, the branch lengths
and the UniProt Archive (UniParc). UniProt is a collaboration between the functionally critical amino acid residues can be identified in represent the amount of evolutionary divergence. Such trees are said to be
European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of
protein multiple alignments. Multiple sequence alignment is scaled. The scaled trees have the advantage of showing both the evolutionary
Bioinformatics and the Protein Information Resource (PIR).
also an essential prerequisite to carrying out phylogenetic relationships and information about the relative divergence time of the
branches. In a cladogram, however, the external taxa line up neatly in a row or
analysis of sequence families and prediction of protein
B. Sequence Alignment: The most commonly asked question in molecular column. Their branch lengths are not proportional to the number of
biology is whether two given sequences are related or not, to identify their secondary and tertiary structures. Multiple sequence evolutionary changes and thus have no phylogenetic meaning. In such an
structure or function. The simpler way to answer this question is to compare alignment also has applications in designing degenerate unscaled tree, only the topology of the tree matters, which shows the relative
their sequences. Sequence is a collection of nucleotides or amino acid polymerase chain reaction (PCR) primers based on multiple ordering of the taxa.
residues which are connected with each other. Speaking biologically, a typical
DNA/RNA sequence consists of nucleotides while a protein sequence consists
of amino acids. Sequencing is the process to determine the nucleotide or
related sequences . Molecular phylogenetic tree construction can be divided into five steps: 1.
Choosing molecular markers 2. Performing multiple sequence alignment 3.
amino acid sequence of a DNA fragment or a protein. There are different
Choosing a model of evolution 4. Determining a tree building method and 5.
experimental methods for sequencing, and the obtained sequence is
IV. Procedure: 1. Go to the multiple sequence alignment page Assessing tree reliability
submitted to different databases like Genbank. Sequence Alignment or
sequence comparison lies at heart of the bioinformatics, which describes the by typing http://www.ebi.ac.uk/Tools/msa/in the browser. 2.
way of arrangement of DNA/RNA or protein sequences, to identify the regions Input sequences by pasting in the text boxes provided in Step Phylogenetic tree reconstruction is not a trivial task. Although there are
of similarity among them. It is used to infer structural, functional and numerous phylogenetic programs available, knowing the theoretical
1. Alternatively, file containing three or more valid sequences
evolutionary relationship between the sequences. Alignment finds similarity background, capabilities, and limitations of each is very important.
level between query sequence and different database sequences. When a in any format (GCG,FASTA, EMBL, GenBank, PIR, NBRF or
new sequence is found, the structure and function can be easily predicted by UniProtKB/Swiss-Prot) can beuploaded using the ‘Choose File’
doing sequence alignment. Since it is believed that, a sequence sharing options as input. 3. Click on ‘more options’ button to set the
common ancestor would exhibit similar structure or function. Greater the
alignment options in Step 2.Generally work with default IV. Procedure: 1. Collection of protein sequences:
sequence similarity, greater is the chance that they share similar structure or
parameters / change the parameters if required. 4. Click
function. There are mainly two methods of Sequence Alignment: Global ‘Submit’ button for getting the results interactively in Step 3. (a) A group of myoglobin orthologs: P02144(Homo sapiens), P04247 (Mus
Alignment: Closely related sequences which are of same length are very much To receive the results as link via. e-mail, check the box before musculus), P02192 (Bos Taurus), Q6VN46 (Danio rerio), P68082 (Equus
appropriate for global alignment. Here, the alignment is carried out from ‘Be notified byemail’. caballus), P02190 (Ovis aries), P02197 (Gallus gallus), P02189(Sus scrofa),
beginning till end of the sequence to find out the best possible alignment. P02186(Elephas maximus), P02147(Gorilla gorilla beringei). (b) Globin
Local Alignment: Sequences which are suspected to have similarity or even paralogs (all human): P68871 (Hemoglobin subunit beta), P69905
dissimilar sequences can be compared with local alignment method. It finds V. Results and Discussion: Amyloid beta A4 proteins, Tissue- (Hemoglobin subunit alpha), P69892 (Hemoglobin subunit gamma-2), P69891
the local regions with high level of similarity. Dynamic programming: It is a type plasminogen activator and Globin proteins were aligned (Hemoglobin subunit gamma-1), P02008 (Hemoglobin subunit zeta), P02100
method that determines optimal alignment by matching two sequences for all (Hemoglobin subunit epsilon), Q9NPG2 (Neuroglobin), Q8WWM9
using four multiple sequence alignment tools: tcoffee, muscle,
possible pairs of characters between the two sequences. It is fundamentally (Cytoglobin), P02144 (Myoglobin).
clustalw and clustal omega. Figure shows the alignment in
similar to the dot matrix method in that it also creates a two-dimensional
alignment grid. However, it finds alignment in a more quantitative way by color schemes based on their physicochemical properties.
2. Construction Phylogenetic tree: 1. Log on to www.phylogeny.fr or CLUSTAL
converting a dot matrix into a scoring matrix to account for matches and Phylogeny 2. Input the given sequences and select the appropriate
mismatches between sequences. By searching for the set of highest scores in
this matrix, the best alignment can be accurately obtained. Heuristic
Algorithm: Searching a large database using the dynamic programming
parameters. 3. Run program and analyze the results .
methods, such as the Smith– Waterman algorithm, although accurate and
reliable, is too slow and impractical when computational resources are
V. Results and Discussions: A phylogeny is a tree containing nodes that are
limited. The heuristic algorithms perform faster searches because they
connected by branches. Each branch represents the persistence of a genetic
examine only a fraction of the possible alignments examined in regular
lineage through time, and each node represents the birth of a new lineage. If
dynamic programming. Currently, there are two major heuristic algorithms for
the tree represents the relationship among a group of species, then the nodes
performing database searches: BLAST and FASTA . represent speciation events. Phylogenetic trees are not directly observed and
are instead inferred from sequence or other data. Phylogeny reconstruction
methods are either distance-based or character-based. In distance matrix
methods, the distance between every pair of sequences is calculated, and the
IV. Procedure: A. Sequence Retrieval and Entry Formats: 1. Log on to resulting distance matrix is used for tree reconstruction. For instance,
GenBank and UniProt database using above given links 2. Type the given neighbour joining applies a cluster algorithm to the distance matrix to arrive
accession number in the search box and retrieve the sequence information 3. at a fully resolved phylogeny. Character-based methods include maximum
Navigate through the format and obtain the results as print screen. parsimony, maximum likelihood and Bayesian inference methods. These
approaches simultaneously compare all sequences in the alignment,
B. Pairwise Sequence Alignment: I. EMBOSS: 1. Go to the ‘Pairwise sequence considering one character (a site in the alignment) at a time to calculate a
alignment’ tools page by typing http://www.ebi.ac.uk/Tools/psa in your score for each tree. In the present experiment set of myoglobin orthologs and
browser. 2. Under Global Alignment/Local Alignment choose appropriate globin paralogs were used to study the speciation and gene duplication
program (protein or nucleotide). 3. Input your first and second sequence by respectively
pasting in the text boxes provided inStep 1. The input sequence can be in
GCG, FASTA, EMBL,GenBank, PIR, NBRF, Phylip or UniProtKB/Swiss-Prot 1. Phylogeny.fr
format. Note: Avoid using data directly from word processors as it may yield
unpredictable results due to the presence of hidden/control characters.
Alternatively, files containing valid sequences in any format (GCG, The molecular phylogenetic analysis was performed on the Phylogeny.fr
FASTA,EMBL, GenBank, PIR, NBRF, Phylip or UniProtKB/Swiss-Prot) can be platform and comprised four steps. Combinations of tools were used in
uploaded using the ‘Browse’ options (provided below the text box) as input. Phylogeny.fr to study the effect of programs on phylogenetic tree. Step 1:
Note: Avoid using word processors files. Preferably generate a file with Multiple sequence alignment of given sequences were performed using
sequence in FASTA format and save the file with extensions .fas or .fasta. 4. MUSCLE, ProbCons, T-Coffee and ClustalW programs. MUSCLE program is
Set the alignment options in Step 2. Generally, default parameters or change fastest and provides good accuracy of alignment. ProbCons and T-Coffee
these parameters and look how it affects the result. aligns sequence with high accuracy but requires lengthy computational time.
ClustalW is less accurate than modern programs. Step 2: Alignment curation
was performed with Gblocks after multiple sequence alignment to remove
C. BLAST: 1. Prepare the query nucleotide or protein sequences in raw ambiguous regions (i.e. containing gaps and/or poorly aligned). Step 3: The
formator in FASTA format. 2. Go to NCBI home page available at phylogenetic tree was reconstructed using PhyML (maximum likelihood
http://www.ncbi.nlm.nih.gov/ and choose BLAST program. 3. Choose the method), TNT (Tree analysis using New Technology, Minimum Parsimony
‘nucleotide BLAST’ or ‘protein BLAST’ program. 4. Select ‘Align two or more method) and BioNJ (Improved version of Neighbour Joining Method)
sequences’ programs. Step 4: Tree visualisation: Graphical representation and edition of
the phylogenetic tree were performed with TreeDyn, Drawgram and
5. Copy and paste the sequences in the sequence box or alternatively browse Drawtree. Phylogentic tree generated for myoglobin orthologs and globin
and upload the sequence in FASTA format from your computer. 6. Adjust any paralogs using various combinations of tools are given in Table 5.1 and 5.2.
parameter, if required. Any parameter changed is highlighted inyellow. 7. Click From the results it is understood that phylogenetic tree construction is purely
‘BLAST’ button at the end of the page. dependent on the Multiple sequence alignment. Hence, one should take
utmost care in performing MSA.
D. FASTA: 1. Go to FASTA home page available at fasta.bioch.virginia.edu/ and

chooseProtein-protein FASTAor Nucleotide-Nucleotide FASTA program. 2. 2. Simple Phylogeny: Another well know program, Simple Phylogeny was
Select ‘Align two sequences’. 3. Copy and paste the sequences in the used to construct phylogenetic tree. This tool provides access to phylogenetic
sequence box or alternatively browse and upload the sequence in FASTA tree generation methods from the ClustalW2 package. One can perform
format from your computer. 4. Adjust any parameter, if required and click Multiple sequence alignment using programs like MUSCLE, MAFFT, T-Coffee,
compare sequences Clustal Omega, Kalign, ProbCons etc. and submit to Simple Phylogeny tools to
construct phylogentic tree. Simple phylogeny uses clustering based method
such as Neighbour Joining (NJ) or Unweighted Pair Group Method with
Arithmetic Mean (UPGMA) for tree construction. Output includes
cladogram/phylogram and Newick format of tree. A Phylogram and
Cladogram are branching diagram (tree) that are assumed to be an estimate
of a phylogeny. In a Phylogram the length of the branch is proportional to the
amount of inferred evolutionary change where as in a Cladogram do not show
the amount of evolutionary “time” separating taxa as the branch length are
equal and show a common ancestry. The results shown in Table 5.3 and 5.4
explains the evolution of myoglobin orthologs and globin paralogs.
I. Aim: ex-7
To identify functional sites in given DNA sequence using
online gene prediction tools.
II. Theory:
With the rapid accumulation of genomic sequence information,
there is a pressing need to use computational approaches to
accurately predict gene structure. Computational gene
prediction is a prerequisite for detailed functional annotation of
genesand genomes. The process includes detection of the
location of open reading frames(ORFs) and delineation of the
structures of introns as well as exons if the genes of interest are
of eukaryotic origin. The ultimate goal is to describe all the
genes computationally with near 100% accuracy. The ability to
accurately predict genes can significantly reduce the amount of
experimental verification work required.
The current gene prediction methods can be
classified into two major categories, abinitio–
basedandhomology-basedapproaches.Theabinitio–
basedapproachpredictsgenes based on the given sequence
alone. The homology-based method makes predictions based
on significant matches of the query sequence with sequences
of known genes. Gene prediction is generated to assign to a
raw DNA sequence as structure given below:
IV. Procedure:
a. Gene Prediction:
1. Log on to respective gene prediction using URLs given.
2. Input FASTA format of given nucleotide sequence and
set the parameters.
3. Submit the data and tabulate the results.
b. Retrieving correct gene information from EBI.

1. Log on to www.ebi.ac.uk
2. Search for AY589041.1 and obtain the text format of
sequence.
3. Find the mRNA information in the FT line.
c. Prediction of promotor region:
1. Log on to www.http://fruitfly.org/
2. In analysis tool select Promoter Prediction, neural
network based program to find possible transcription
promoters.
3. Select Type of organism Eukaryote and paste the
sequence in given box and submit .

Results and Discussions:
GenScan, GeneId, HMMGene and FGENESH gene prediction

tools were used to predict functional sites in Rhesus macaque
(Macacamulatta) breast cancer type 1 (BRCA1) complete
coding sequence (CDS) and predicted results were compared
with experimental gene data available at NCBI GenBank/EBI-
ENA to select the best tool for eukaryotic gene prediction.
GENSCAN is a program used to identify
complete gene structures in genomic DNA. It is a GHMM-
based program that can be used to predict the location of genes
and their exon-intron boundaries in genomic sequences from a
variety of organisms. The GENSCAN Web server can be found
at Massachusetts Institute of Technology. Geneid is a program
to predict genes in anonymous genomic sequences designed
with a hierarchical structure. In the first step, splice sites, start
and stop codons are predicted and scored along the sequence
using Position Weight Arrays (PWAs). In the second step,
exons are built from the sites. Exons are scored as the sum of
the scores of the defining sites, plus the the log-likelihood ratio
of a Markov Model for coding DNA. Finally, from the set of
predicted exons, the gene structure is assembled, maximizing
the sum of the scores of the assembled exons. FGENESH is
the fastest (50-100 times faster than GenScan) and most
accurate gene finder available. HMMgene is a program for
prediction of genes in anonymous DNA. The program predicts
whole genes, so the predicted exons always splice correctly. It
can predict several whole or partial genes in one sequence, so
it can be used on whole cosmids or even longer sequences.
HMMgene can also be used to predict splice sites and
start/stop codons.

Upstream

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Upstream

Uploaded by

Copyright:

Available Formats

I.

D. FASTA: 1. Go to FASTA home page available at fasta.bioch.virginia.edu/ and

b. Retrieving correct gene information from EBI.

sequence in given box and submit .

GenScan, GeneId, HMMGene and FGENESH gene prediction

You might also like