You are on page 1of 96

Sequence Analysis

1 DNA SEQUENCING & SEQUENCE ALIGNMENT

2

Sequence Alignment

Study Goal
3

 What is a sequence alignment?  The difference between a global and local alignment and what the

uses of each are.  How to use the dot matrix methods to analyze genes and chromosomes  The steps performed by the Needleman-Wunsh and SmithWaterman Algorithms to produce a sequence alignment  How to use scoring matrix values and gap penalties to produce a sequence alignment

Sequence Alignment 4  Definition: the comparison of two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences  Sequence alignment score:  Sum of the individual log odds scores for each pair of aligned sequence characters in an alignment less a penalty for each gap of one more position .

5 .

an alignment is an assignment of gaps to positions 0. and 0.. N in y..xM. y = y1y2…yN.….Pair-wise Sequence Alignment 6 AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2. or a gap in the other sequence .…. N in x. so as to line up each letter in one sequence with either a letter.

0 mismatches. 1 gap 7 matches. AGCGAAGTTT AGGCTAGTTAGCGAAGTTT AGGCTA-GTTAG-CGAAGTTT AGGC-TA-GTTAG-CG-AAGTTT 6 matches. 1 mismatch.What is a good alignment? 7 AGGCTAGTT. 5 gaps . 3 gaps 7 matches. 3 mismatches.

Evolution at the DNA level 8 Deletion Mutation …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… SEQUENCE EDITS REARRANGEMENTS Inversion Translocation Duplication .

Evolutionary 9 next generation OK OK OK X X Still OK? .

(# mismatches)  s – (#gaps)  d .Scoring Function 10  Sequence edits:  Mutations  Insertions  Deletions Scoring Function:    AGGCCTC AGGACTC AGGGCCTC AGG . CTC Alternative definition: minimal edit distance “Given two strings x. deletions. mutations) to transform one string to the other”  Match: Mismatch: Gap: +m -s -d Score F = (# matches)  m . find minimum # of edits (insertions. y.

 Two sequences can always be aligned.  There are lots of possible alignments.  Sequence alignments have to be scored.Pair-wise Alignment 11  The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. .  Often there is more than one solution with the • same score.

How do we compute the best alignment? 12 AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: >> 2N .

Local alignment vs. Global Alignment 13 Use to align the entire sequence Best for same length sequence .

Global Alignment 14 Use to align the similar sequence along certain length Best for sequence sharing a conserved region or domain .Local alignment vs.

What do we want alignment for? 15 Orthologous Xenologous transferred genes .

allowing inference of function based on that relationship . homologs from more than one distantly related species are detectable for 70 – 80 % of proteins  Application: Comparison of sequence/structures can identify homologous relationships.Matches to similar sequences 16 Sequence conservation Structure conservation function conservation  What is conserved in a gene [protein] family is functionally important    Due to purifying selection driven by functional constraints observable in a bckground described by the theory of neutral evolution Fast enough that pseudogenes rapidly deteriorate over evolutionary timescale In any prokaryotic genome.

slide sequences on two lines of a word processor  Dot plot  with windows  Rigorous mathematical approach  Dynamic programming (slow.Methods of Alignment 17  By hand . approximate)  BLAST and FASTA  Word matching and hash tables0 . optimal)  Heuristic methods (fast.

Align by Hand 18 GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment .

Dotplot 19 A dotplot gives an overview of all possible alignments A T T C A C A T A         T  A C     A T T  A C G   T  A C                  Sequence 2 Sequence 1 .

Dotplot 20 In a dotplot each diagonal corresponds to a possible (ungapped) alignment A T T C A C A T A         T  A C     A T T  A C G   T  A C                  Sequence 2 Sequence 1 One possible alignment: T A C A T T A C G T A C A T A C A C T T A .

Insertions / Deletions in a Dotplot 21 Sequence 2 T A C T G T C A T T A C T G T T C A T Sequence 1 T A C T G .T C A T | | | | | | | | | T A C T G T T C A T .

Dotplot (Window = 13022Stringency = 9) / Hemoglobin -chain Hemoglobin -chain .

Word Size Algorithm 23 T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C  Word Size = 3 C T A T G A C A  T A C G G T A T G .

meaning if there are 10-15 matches within the window. and stringency or match requirement in this window is 10. and a dot is printed at the first base in the windows   Scoring Matrix Filtering Matrix: PAM250 Window = 12 Stringency = 9 .Window / Stringency 24 Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Typical window size for DNA sequences is 15 bases.

Dotplot (Window = 18 /25 Stringency = 10) Hemoglobin -chain Hemoglobin -chain .

• With large windows the sensitivity for short sequences is reduced. the larger the weight of statistical (unspecific) matches. . • Insertions/deletions are not treated explicitly.Considerations 26 • The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). • The smaller the window.

11/15)  Protein sequences: short windows. stringencies (1/1) except for short domain of partial similaritity in not similar sequences (15/5) .Dot Matrix 27  Unless two sequences are known to be very much alike. the dot matrix method should be considered as a first choice for pair-wise sequence alignment  Readily reveal the presence of insertions/deletions and direct and inverted repeats  DNA sequence dot matrix comparison: long windows and high stringencies (7/11.

net/download/  .sourceforge.gov/docs/gcg/dotplot.28  Programs: DOTTER http://www.html  PLALIGN in FASTA  EMBOSS – http://emboss.nih.se/cgb/groups/sonnhammer/Dotter.http://helix.cgb.ki.h tml  DOTPLOT .

approximate)  BLAST and FASTA  Word matching and hash tables0 .slide sequences on two lines of a word processor  Dot plot  with windows  Rigorous mathematical approach  Dynamic programming (slow. optimal)  Heuristic methods (fast.Methods of Alignment 29  By hand .

y[j+1:N]) . y[1:N]) = F(x[1:i].Alignment is additive Observation: The score of aligning is additive Say that aligns to x1…xi y1…yj 30 x1……xM y1……yN xi+1…xM yj+1…yN The two scores add up: F(x[1:M]. y[1:j]) + F(x[i+1:M].

31 .

Creation of an alignment path matrix .Backtracking (evaluation of the optimal path) “It is applicable when a large search space can be structured into a succession of stages. each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage.Stepwise calculation of score values .Mount . the final stage contains the overall solution” .Basic principles of dynamic programming 32 . such that the initial stage contains trivial solutions to sub-problems.

j) is the score of the best alignment between the initial segment x1..0) = 0 .Creation of an alignment path matrix 33 Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences • Construct matrix F indexed by i and j (one index for each sequence) • F(i.j) recursively beginning with F(0..i of x up to xi and the initial segment y1...j of y up to yj • Build F(i.

.

j-1) . F(i-1.Creation of an alignment path matrix 35 • If F(i-1.yj) • xi is aligned to a gap.j) = F(i-1.j) = F(i.d • The best score up to (i. F(i.j-1) are known we can calculate F(i.j) will be the largest of the three options .j) . F(i.j) • Three possibilities: • xi and yj are aligned. F(i.j-1) + s(xi .j-1).d • yj is aligned to a gap.j) and F(i.j) = F(i-1.

we can apply Dynamic Programming!!! Let F(i.j) = optimal score of aligning x1……xi y1……yj  .Dynamic Programming 36 There are only a polynomial number of subproblems  Align x1…xi to y1…yj  Original problem is one of the subproblems  Align x1…xM to y1…yN  Each subproblem is easily solved from smaller subproblems  ???  Then.

a4 and b1. a2. a3. b2.Example of Dynamic programming algorithm 37 •Sequences a1. b3. b4 •Align sequences in a global alignment .

a4 and b1. a3.Example of Dynamic programming algorithm 38 •Sequences a1. b2. a2. b4 •Align sequences in a global alignment . b3.

a3. b4 •Align sequences in a global alignment . a2. b2. b3.Example of Dynamic programming algorithm 39 •Sequences a1. a4 and b1.

a3. a2. b4 •Align sequences in a global alignment . b2. a4 and b1. b3.Example of Dynamic programming algorithm 40 •Sequences a1.

Example of Dynamic programming algorithm 41 •Sequences a1. b4 •Align sequences in a global alignment . a4 and b1. a2. a3. b2. b3.

Example of Dynamic programming algorithm
42

•Sequences a1, a2, a3, a4 and b1, b2, b3, b4 •Align sequences in a global alignment

43

Types of Scores
44

Scoring Matrix / Substitution Matrix 45 .

.Scoring Matrix: Example A A R N K 5 R -2 7 N -1 -1 7 K -1 3 0 6 • Notice that although R and K are different amino acids. they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein.

Conservation  Amino acid changes that tend to preserve the  Polar physicochemical properties of the original residue to polar aspartate  glutamate  Nonpolar to nonpolar alanine  valine  Similarly behaving residues leucine to isoleucine .

Scoring Examples 48 .

Dynamic Programming 49 .

50 .

51 .

52 .

53 .

54 .

55 .

56 .

57 .

58 .

Scoring systems 59 .

DNA Scoring Systems -very simple 60 Sequence 1 Sequence 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact A A G C T 1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1 Match: 1 Mismatch: 0 Score = 5 .

. . S -1 T -1 4 1 -1 1 0 1 0 5 -1 0 -2 0 -1 7 -1 -2 -2 -1 4 0 -2 -2 6 0 -1 5 1 6 T:G = -2 T:T = 5 Score = 48 P -3 A 0 G -3 N -3 D -3 . .Protein Scoring Systems 61 Sequence 1 Sequence 2 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Scoring matrix C C 9 S T P A G N D .

Protein Scoring Systems 62 • Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. tiny aliphatic I L hydrophobic aromatic charged M Y F W H C S+S V A P G G CSH T K R S D E Q positive polar N small .

Protein Scoring Systems 63 • Scoring matrices reflect:  # of mutations to convert one to another  chemical similarity  observed mutation frequencies  the probability of occurrence of each amino acid • Widely used scoring matrices: • • PAM BLOSUM .

PAM 250 64 A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6 A R N D C Q E G H I L K M F P S T W Y V B Z W -8 17 .

PAM 65 .

Dayhoff Matrix 66 .

Dayhoff Matrix 67 .

Dayhoff Matrix 68 .

BLOSUM Matrix 69 .

.BBCCC Conserved blocks in alignments ..DCBCDB CCBADAB.A.BBCBB BBBCDABA.BCCAA AAACDAC.BBCDA DABCDA.DBBDCC AAACAA..BLOSUM Matrix 70 AABCDA..

A..DCBCDB CCBADAB..BCCAA AAACDAC.BLOSUM Matrix 71 AABCDA.BBCBB BBBCDABA.DBBDCC AAACAA..BBCCC Conserved blocks in alignments .BBCDA DABCDA..

72 .

. • When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices. for distantly related proteins higher PAM or lower BLOSUM matrices. • For database searching the commonly used matrix is BLOSUM62.TIPS on choosing a scoring matrix 73 • Generally. BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff. 1993).

Significance of alignment 74 .

Significance of alignment 75 .

Database Searching 76 .

Database Searching 77 .

Global Alignment • Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C • Local Alignment—better alignment to find conserved segment tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc .Local vs.

Local Alignment 79 Global Alignment – Needleman and Wunsch (1970) Local Alignment – Smith and Waterman (1981a) .Global vs.

Global Alignment Two closely related sequences: needle (Needleman & Wunsch) creates an end-to-end alignment. .

. 67 |||||||||||||| | | | |||| || | | | || 70 1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG ...Global Alignment Two sequences sharing several regions of local similarity: 1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.

N) corresponds to an alignment of the two sequences y1 ……………………………… yN An optimal alignment is composed of optimal subalignments .The Needleman-Wunsch Matrix 82 x1 ……………………………… xM Every non-decreasing path from (0.0) to (M.

yj) [case 1] F(i-1. F(0. 0) = 0 . Main Iteration. Ptr(i.j  d.i  d 2. UP. j) = max F(i-1. F(M. 83 F(i. N) is the optimal score. a.j) = LEFT. j-1) – d [case 3] if [case 1] if [case 2] if [case 3] DIAG. j) – d [case 2] F(i.The Needleman-Wunsch Algorithm 1. Filling-in partial alignments For each i = 1……M For each j = 1……N F(i. Initialization. 0)= . N) can trace back optimal alignment . j) = . 3.j-1) + s(xi. Termination. and from Ptr(M. F(0.

Global Alignment (Needleman -Wunsch)  The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences (needle)  Global algorithms are often not effective for highly diverged sequences .  Global methods are useful when you want to force two sequences to align over their entire length .do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.  Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

Local Alignment (Smith-Waterman)  Local alignment  Identify the most similar sub-region shared between two sequences  Smith-Waterman .

The local alignment problem 86 Given two strings x = x1……xM. y = y1……yN Find substrings x’. y’ whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc .

Why local alignment – examples 87  Genes are shuffled between genomes  Portions of proteins (domains) are often conserved .

Cross-species genome similarity
88

98% of genes are conserved between any two mammals  >70% average similarity in protein sequence

hum_a mus_a rat_a fug_a hum_a mus_a rat_a fug_a : : : : : : : : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ @ @ @ @ @ @ @ 57331/400001 78560/400001 112658/369938 36008/68174 57381/400001 78610/400001 112708/369938 36058/68174

hum_a mus_a rat_a fug_a hum_a mus_a rat_a fug_a

: : : : : : : :

AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG CCGAGGACCCTGA-------------------------------------

@ @ @ @ @ @ @ @

57431/400001 78659/400001 112757/369938 36084/68174 57481/400001 78708/400001 112806/369938 36097/68174

“atoh” enhancer in human, mouse, rat, fugu fish

The Smith-Waterman algorithm
89

Termination:

1.

If we want the best local alignment…
FOPT = maxi,j F(i, j) Find FOPT and trace back

2.

If we want all local alignments scoring > t
For all i, j find F(i, j) > t, and trace back Complicated by overlapping local alignments
( Waterman–Eggert ’87: find all non-overlapping local alignments with minimal recalculation of the DP matrix )

Smith-waterman algorithm
90

Basic Local Alignment Search Tool 91 .

BLAST Algorithm 92 .

BLAST Algorithm 93 .

BLAST Algorithm 94 .

BLAST Algorithm 95 .

Basic BLAST Algorithms 96 .