You are on page 1of 22

1

EXPERIMENT – 06

MULTIPLE SEQUENCE ALIGNMENT AND


PHYLOGENETIC ANALYSIS

EB3233: BIOINFORMATICS LABORATORY

ABSTRACT

In light of homology evaluations in phylogenetic study, multiple sequence alignment is


discussed. As detailed and heuristic processes, pairwise and multiple alignment techniques
are reviewed. Because the purpose of alignment is to make the most successful declaration of

2
initial homology, it is important to favor methods that reduce non-homology. Furthermore,
the one that follows the phylogenetic optimality criteria, of all possible alignments, should be
called the best alignment. Consistency of optimality principles is desirable, as all homology
statements become subject to analysis and clarification in this manner. This accuracy is based
on the evaluation, by study from alignment to phylogeny reconstruction, of alignment gaps as
character details and the effective use of a cost function (e.g., insertion-deletion,
Transversion, and transition). For molecular and evolutionary biology, multiple sequence
alignment (MSA) is an incredibly valuable tool and there are many programs and algorithms
available for that kind of function. While the alignment precision of different MSA
programmes has been compared in previous research, their computing time and memory use
have not been consistently evaluated. Given the unparalleled quantities of information
generated by deep sequencing systems of the next generation and the increasing need for
large-scale data analysis, optimizing the application of software is imperative. However, the
most effective MSA software has become a vital predictor of a compromise between
alignment precision and computational cost. We contrasted the accuracy and cost of nine
common MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX,
MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the BAliBASE
benchmark alignment dataset, and discussed the importance of such implementations
embedded in the algorithm of each software.

INTRODUCTION
The alignment of three or more biological sequences (protein or nucleic acid) of identical
length is usually multiple sequence alignment (MSA). Homology and the evolutionary
relations between the sequences analyzed can be inferred from the output. Pairwise Sequence

3
Alignment methods, on the other hand, are used to classify similarity regions that may
suggest functional, structural and/or evolutionary associations between two biological
sequences.

Multiple sequence alignments (MSA) in molecular biology, computational biology, and


bioinformatics are an important and commonly used computational technique for biological
sequence analysis. In order to carry out phylogenetic reconstruction, protein secondary and
tertiary structure analysis, and protein function prediction analysis, MSA is completed when
homologous sequences are compared.

To build a phylogenetic tree, multiple sequence alignments can also be used. A phylogenetic
tree or evolutionary tree is a branching diagram or "tree" displaying, based on similarities and
variations in their physical or genetic traits, the inferred evolutionary relationships between
different biological organisms or other individuals, their phylogeny. It is implied that the taxa
united in the tree have arisen from a common ancestor. In phylogenetics, phylogenetic trees
are fundamental to the field. Every node with relatives in a rooted phylogenetic tree
represents the most recent inferred common ancestor of the descendants, and the edge lengths
can be viewed as time estimates in certain trees. A taxonomic unit is called a node. Internal
nodes, since they cannot be clearly observed, are usually considered hypothetical taxonomic
groups. In fields of biology, such as bioinformatics, systematics, and phylogenetic
comparative approaches, trees are useful (Clustal W2 Introduction, 2020).

ClustalW is perhaps the most common multiple sequence alignment algorithm, integrating a
variety of commercially available bioinformatics packages, such as DNASTAR, into a so-
called black box, whereas the recently developed Clustal Omega algorithm is currently the
most precise and scalable MSA algorithm available. The current MSA algorithm from the
Clustal family is Clustal Omega. To match protein sequences only this algorithm is used to
(though nucleotide sequences are likely to be introduced in time). The precision of Clustal
Omega is similar to other high-quality aligners on small numbers of sequences; however,
Clustal Omega outperforms certain MSA algorithms in terms of completion time and overall
alignment quality on large sequence ranges (Bioinformatics Tools for Multiple Sequence
Alignment < EMBL-EBI, 2020).

4
Figure 1_Clustal Omega Algorithm

OBJECTIVES

5
 To identify conserved sequence regions using multiple sequence alignment
CLUSTAL OMEGA and determine the evolutionary relationships between sequences
from phylogenetic trees.

MATERIALS

1. Computer
2. Internet connection
3. CLUSTAL OMEGA program

METHODS

6
1. The file DHFR.txt was downloaded under the link.
http://theory.bio.uu.nl/BPA/Data/DHFR.txt.

 This file contains, in FASTA format, the protein sequences of dihydrofolate reducatse
from chicken, human, Pneumocystis (a fungus) and Pseudomonas (a bacterium).

Figure 2_The file DHFR.txt

2. Then using the following link, the CLUSTAL OMEGA server (at EBI) homepage
was accessed.
https://www.ebi.ac.uk/Tools/msa/clustalo/

7
Figure 3_The CLUSTAL OMEGA server (at EBI) homepage

3. After that, inserted the contents of the DHFR.fasta file in the query box.

Figure
4. The database was set at4_The CLUSTAL
its default and OMEGA serverwas
submit button homepage
clicked.

8
Figure 5_The CLUSTAL OMEGA server homepage

5. After that, the results were obtained from the search.

 Sequence alignment and links to numerous other files were contained on that results
page.

9
Figure 6_ the sequence alignment

Figure 7_The colors of sequence alignment

10
Figure 8_ Meaning of the colors on protein alignments

6. For that as the next step, ‘phylogenetic tree’ was clicked.

7. The phylogenetic tree results were displayed as shown below. CLUSTAL OMEGA
will show a Neighbor-joining tree without distance corrections.

Figure 9_The phylogenetic tree - Cladogram

11
Figure 10_The phylogenetic tree - Real

DISCUSSION

12
Multiple sequence alignments (MSA) in molecular biology, computational biology, and
bioinformatics are an important and commonly used computational technique for biological
sequence analysis. In order to carry out phylogenetic reconstruction, protein secondary and
tertiary structure analysis, and protein function prediction analysis, MSA is completed when
homologous sequences are compared. In this experiment, we were identified conserved
hierarchical zones using a multi-hierarchical alignment and determined the evolutionary
relationship between CLUSTAL OMEGA and the phylogenetic sequences of phylogenetic
trees (Clustal W2 Introduction, 2020).

One of the common tasks in bioinformatics is the simultaneous alignment of a variety of


DNA or protein sequences. Useful to:

 Phylogenetic Analysis such as inferring a tree, estimating rates of substitution, etc.


 Determination of a sequence of consensus for example: in assembly.
 Protein structure prediction.
 Homology demonstration in multi gene families.
 Homology identification between a newly sequenced gene and an existing family of
genes.

Figure 6
The alignment of the chicken and human sequences to each other is evident in this
capture, mainly to match the sequence with the longer Pneumocystis. The alignment
of vertebrates, but where all sequences align, there is a distinct 'island' indicating the
existence of retained motifs and showing that the alignment is biologically important.

* (Asterisk) - Positions which gave single, fully conserved residue.

: (Colon) - Conservation between groups of strongly similar properties – Scoring >


0.5 in the PAM 250 matrix.

. (Period) - Conservation between groups of weakly similar properties – Scoring =<


0.5 in the PAM 250 matrix.

Figure 7 and 8

13
The colors of the alignment are presented in Figure 7. The residues are colored by
CLUSTAL OMEGA according to their physiochemical properties. The significance
of the colors on protein alignments is shown in Figure 8.

Figure 9 and 10
Both of the figures show the phylogenetic tree. Figure 9 is the cladogram and figure
10 is real.

In studying sequences, multiple alignments of protein sequences are essential tools.


Identification of conserved sequence regions is the basic information they have. In
conducting testing to determine and alter the function of particular proteins, in predicting
protein function and structure, and in recognizing new members of protein families, this is
very beneficial. Sequences may be aligned over their whole period like global alignment or in
some regions only like, local alignment. For both pairwise and multiple alignments, this is
valid. The Clustal software generates a phylogenetic tree that provides an indication of the
degree of similarities between the aligned sequences (Bioinformatics Tools for Multiple
Sequence Alignment < EMBL-EBI, 2020). The evolutionary relations between a collection of
species or classes of organisms called taxa are described by a phylogeny or evolutionary tree.
The branch-like diagram of the phylogenetic tree illustrates the evolutionary relationships
between various species. Cladogram is a repeatedly branched dichotomous phylogenetic tree
that indicates the division of molecules or species regarding the time sequence that the
evolutionary branches occur. In phylogenetic trees, Evolutionary Distance is the amount of
the physical distance on such a tree that separates organisms; it is inversely proportional to an
evolutionary relationship (Multiple sequence alignment, 2020).

REFERENCES

14
1. Abacus.bates.edu. 2020. Clustal W2 Introduction. [online] Available at:
<https://abacus.bates.edu/bioinformatics1/clustalw.html> [Accessed 29 November
2020].

2. C. Kemena and C. Notredame, “Upcoming challenges for multiple sequence


alignment methods in the high-throughput era,” Bioinformatics, vol. 25, no. 19, pp.
2455–2465, 2009.

3. Ebi.ac.uk. 2020. Bioinformatics Tools For Multiple Sequence Alignment < EMBL-
EBI. [online] Available at: <https://www.ebi.ac.uk/Tools/msa/> [Accessed 27
November 2020].

4. Ebi.ac.uk. 2020. Bioinformatics Tools For Multiple Sequence Alignment < EMBL-
EBI. [online] Available at: <https://www.ebi.ac.uk/Tools/msa/> [Accessed 27
November 2020].

5. En.wikipedia.org. 2020. Multiple Sequence Alignment. [online] Available at:


<https://en.wikipedia.org/wiki/Multiple_sequence_alignment> [Accessed 26
November 2020].

6. People.brunel.ac.uk. 2020. [online] Available at:


<http://people.brunel.ac.uk/~csstdrg/courses/glasgow_courses/website_bioinformatics
HM/slides/phylo.pdf> [Accessed 29 November 2020].

7. Phillips A, Janies D, Wheeler W. Multiple sequence alignment in phylogenetic


analysis. Mol Phylogenet Evol. 2000 Sep;16(3):317-30. doi:
10.1006/mpev.2000.0785. PMID: 10991785.

POST-LAB QUESTIONS

15
ACC Oxidase catalyzes the last step in the biosynthesis of the plant hormone ethylene.
Indeed, this simple gas molecule is involved at several stages of plant growth (Including
germination and senescence) and controls fruit ripening.
Search for ACC oxidase protein sequences in 5 different plants and get the FASTA files. Do
a multiple sequence alignment using CLUSTAL OMEGA and generate tree form the multiple
sequence alignment, the accession numbers (AC) of those sequences are:

Organism Accession numbers


Zea mays AAR25565.1
Musa acuminate ABV21759.1
Hordeum vulgare AFO63022.1
Trifolium repens ABB97396.1
Carica papaya AAC98808.1

Show the FASTA sequences from 5 different plants, multiple sequence alignment results and
phylogenetic tree in the lab report.

STEP 01 – FASTA SEQUENCES

Figure 11_FASTA sequence of Zea mays

16
Figure 12_FASTA sequence of Musa acuminate

Figure 13_FASTA sequence of Hordeum vulgare

Figure 14_FASTA sequence of Trifolium repens

17
Figure 15_FASTA sequence of Carica papaya

STEP 02

Figure 16_The CLUSTAL OMEGA server homepage

MULTIPLE SEQUENCE ALIGNMENT RESULTS

18
Figure 17_Multiple sequence alignment results

PHYLOGENETIC TREE

19
Figure 19- The phylogenetic tree - Cladogram

Figure 18_The phylogenetic tree - Real

Answer the following:


1. Can you identify conserved amino acids in multiple sequence alignment results?
Write down all the non-redundant conserved amino acids that you can identify.

 Pro P Proline
 Leu L Leucine
 Arg R Arginine
 Met M Methionine
 Cys C Cysteine
 Trp W Tryptophan
 Gly G Glycine
 Phe F Phenylalanine
 His H Histidine
 Val V Valine
 Tyr Y Tyrosine
 Asp D Aspartic acid
 Glu E Glutamic acid
 Asn N Asparagine
 Gln Q Glutamine

2. Why do you think these amino acids are being conserved?

 Because, the structure or function of a protein or domain should be conserved to


maintain it. Conserved proteins undergo fewer substitutions for amino acids, or
are more likely to substitute for amino acids with identical biochemical properties.
3. a) In your MSA results, identify and mark the following residue that has critical
role in ACC oxidase function:

20
 Phe(P)-103
 GIycine(G)-158
 Arg(R)-177
Note: Please refer to the first sequence number in the MSA for the position of the above
amino acids

Phe (P)-103

GIycine (G)-158

Arg (R)-177

b) Based on your observation from question 3(a), is this software produce the best
multiple sequence alignment? Justify your answer.

21
 Above residue has critical role in ACC oxidase function. Proline (P) and Arginine
(R) have asterisk, that means positions which gave single, fully conserved
residue. Therefore, this software implies that those residues are important.
However, Glycine is not fully conserved residue. Out of three there are two fully
conserved residues therefore, we can say that software produce the best multiple
sequence alignment.

4. From your observation on the phylogenetic tree, give description about the
relatedness of this protein between the 5 different plants. Use the following term in
your description:
 Clade
 Out group
 Sister group

Evolutionary trees depict clades. A clade is a group of organisms that includes an ancestor
and all descendants of that ancestor. You can think of a clade as a branch on the tree of life.
Musa acuminata and Hordeum vulgar are sister groups, and Trifolium repens and Carica
papaya are sister groups, meaning that they share the closest evolutionary relationship
because they stem from the same common ancestor. Zea mays is the out group in this tree,
taxon outside of the common ancestor are referred to as out groups as they are more
evolutionarily distant in relation than sister taxa are to one another due to a more distance
common ancestor. With each successive speciation event, a new clade is formed within the
tree, allowing scientists to identify common ancestors between evolutionarily distant taxon.

22

You might also like