Professional Documents
Culture Documents
Difficult due AACCG AAGCC to insertions ACGCG or deletions ACGCG (indels) ACGCG *****
Homology: Definition
Homology: similarity that is the result of inheritance from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics. An Alignment is an hypothesis of positional homology between bases/Amino Acids.
<---------------(--------------------HELIX 19---------------------) <---------------(22222222-000000-111111-00000-111111-0000-22222222 Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA E.coli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA B.subtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA match ** *** * ** ** * **
Dynamic programming
2 methods: Dynamic programming
Consider 2 protein sequences of 100 amino acids in length. If it takes 1002 seconds to exhaustively align these sequences, then it will take 1003 seconds to align 3 sequences, 1004 to align 4 sequences...etc. More time than the universe has existed to align 20 sequences exhaustively.
Progressive alignment
Progressive Alignment
Devised by Feng and Doolittle in 1987. Essentially a heuristic method and as such is not guaranteed to find the optimal alignment. Requires n-1+n-2+n-3...n-n+1 pairwise alignments as a starting point Most successful implementation is Clustal (Des Higgins)
CLUSTAL W
.60 .59 .77 .13 .75 .75
1 2
alpha-helices
1 2 3 4 5 PEEKSAVTALWGKVN--VDEVGG GEEKAAVLALWDKVN--EEEVGG PADKTNVKAAWGKVGAHAGEYGA AADKTNVKAAWSKVGGHAGEYGA EHEWQLVLHVWAKVEADVAGHGQ 1 2 3 4
Possible alignment
1 1 0 1 0 -1 Score for this path= 2 Scoring Scheme: Match: +1 Mismatch: 0 Indel: -1
GATTCGAATTC
Optimal Alignment 1
1 1 -1 1 1 1 Alignment using this path GA-TTC GAATTC
Alignment score: 4
Optimal Alignment 2
1 -1 1 1 1 1 Alignment using this path G-ATTC GAATTC
Alignment score: 4
Distance Matrix
Rice 84.9 0.0 117.8 122.4 122.6 Mosquito 105.6 117.8 0.0 84.7 80.8 Monkey 90.8 122.4 84.7 0.0 3.3 Human 86.3 122.6 80.8 3.3 0.0
First Step
PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances.
Mon-Hum
Mosquito
Spinach
Rice Human
Monkey
Mon-Hum
Spinach
Human
Monkey
Next Cycle
PAM Spinach Rice Mosquito MonHum Spinach 0.0 84.9 105.6 88.6 Rice 84.9 0.0 117.8 122.5 Mosquito 105.6 117.8 0.0 82.8 MonHum 88.6 122.5 82.8 0.0
Penultimate Cycle
PAM Spinach Rice MosMonHum Spinach 0.0 84.9 97.1 Rice 84.9 0.0 120.2 MosMonHum 97.1 120.2 0.0
Rice
Spinach
Last Joining
PAM Spinach MosMonHum SpinRice 0.0 108.7 MosMonHum 108.7 0.0
(Spin-Rice)-(Mos-(Mon-Hum))
Rice
Rice
Option 1
Option 2
ClustalW- Alternative 1
If the situation arises where a third sequence is aligned to the first two, then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences.
ClustalW- Alternative 2
If, on the other hand, two separate sequences have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out.
ClustalW- Progression
The alignment is progressively built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a pair having more than one sequence.
Advantages: Speed.
Disadvantages: No objective function. No way of quantifying whether or not the alignment is good No way of knowing if the alignment is correct.
ClustalW-Local Minimum
Potential problems: Local minimum problem. If an error is introduced early in the alignment process, it is impossible to correct this later in the procedure. Arbitrary alignment.
ClustalW- Caveats
Sequence weighting Varying substitution matrices Residue-specific gap penalties and reduced penalties in hydrophilic regions (external regions of protein sequences), encourage gaps in loops rather than in core regions. Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments
Sequence weighting
First we must be able to categorise sequences according to whether they have close relatives or if they are distantly-related to the other sequences (calculated directly from the guide tree). Weights are normalised, so that the largest weight is 1. Closely-related sequences have a large amount of the same information, so they are downweighted. These weights are multiplication factors.
ClustalW
Dependence on the length of the sequences:
The program uses the formula
GOP->(GOP+log(MIN(N,M))*(Average residue mismatch score)*(percent identity scaling factor)
The logarithm of the length of the shortest sequence is used as a scaling factor to increase the GOP with increasing length
Divergent Sequences
The most divergent sequences (most different, on average from all of the other sequences) are usually the most difficult to align. It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned). The user has the choice of setting a cutoff (default is 40% identity). This will delay the alignment until the others have been aligned.
ATGCTGTTAGGG ATGCTCGTAGGG
ATGCT-GTTAGGG ATGCTCGTA-GGG
The result might be highly-implausible and might not reflect what is known about biological processes. It is much more sensible to translate the sequences to their corresponding amino acid sequences, align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment.