P. 1
Mole Vol Class 05

|Views: 4|Likes:

See more
See less

09/25/2010

pdf

text

original

# CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

Distance Matrix Methods
Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark gorm@cbs.dtu.dk

Distance Matrix Methods
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

Gorilla : ACGTCGTA Human : ACGTTCCT Chimpanzee: ACGTTTCG

1.

Construct multiple alignment of sequences

Go Hu Ch Go Hu Ch 4 4 2 2. Construct table listing all pairwise differences (distance matrix)

1 1 2 1

Ch Hu Go 3. Construct tree from pairwise distances

Finding optimal branch lengths
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

S1 S1 S2 S3 S4 -

S2 -

S3

S4

D12 D13 D14 D23 D24 D34

S1

S2 a b d e S4 c

Observed distance

S3

Distance along tree (patristic distance)

Goal:

D12 D13 D14 D23 D24 D34

≈ ≈ ≈ ≈ ≈ ≈

d12 d13 d14 d23 d24 d34

= = = = = =

a+b+c a+d a+b+e d+b+c c+e d+b+e

Exercise (handout)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

• •

Construct distance matrix (count different positions) Reconstruct tree and find best set of branch lengths

Optimal Branch Lengths for a Given Tree: Least Squares CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS S1 S2 a b d e S4 c • S3 Fit between given tree and observed distances can be expressed as “sum of squared differences”: Distance along tree Q = j>i D12 D13 D14 D23 D24 D34 ≈ ≈ ≈ ≈ ≈ ≈ d12 d13 d14 d23 d24 d34 = = = = = = a+b+c a+d a+b+e d+b+c c+e d+b+e Σ (Dij .dij)2 • Goal: Find branch lengths that minimize Q .this is the optimal set of branch lengths for this tree. .

dij)2 Dij n • Power (n) is typically 1 or 2 .Optimal Branch Lengths: Least Squares CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS S1 S2 a b d e S4 c • Longer distances associated with larger errors Squared deviation may be weighted so longer branches contribute less to Q: S3 • Distance along tree Goal: D12 D13 D14 D23 D24 D34 ≈ ≈ ≈ ≈ ≈ ≈ d12 d13 d14 d23 d24 d34 = = = = = = a+b+c a+d a+b+e d+b+c c+e d+b+e Q= Σ j>i (Dij .

dAC ) + ( DBC .dBC ) 2 2 2 Finding Optimal Branch Lengths CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A A B C - B - C DBC - A v1 DAB DAC v2 B v3 C Observed distance Distance along tree Goal: DAB ≈ dAB = v1 + v2 DAC ≈ dAC = v1 + v3 DBC ≈ dBC = v2 + v3 .Q = ( DAB .dAB ) + ( DAC .

Q = (DAB .dAC ) + ( DBC .dBC ) 2 2 2 Finding Optimal Branch Lengths CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS .dAB ) + ( DAC .

2DBC dBC CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS 2 2 2 2 2 2 2 2 2 Finding Optimal Branch Lengths .dAB ) + ( DAC .Q = ( DAB .2DAC dAC +DBC + dBC .2DAB dAB +DAC + dAC .dAC ) + ( DBC .dBC ) c Q = DAB + dAB .

2DBC (v 2 + v 3 ) 2 2 2 2 2 2 .2DBC dBC CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS 2 2 2 2 2 2 2 2 2 Finding Optimal Branch Lengths c Q = DAB + (v1 + v 2 ) .2DAB (v1 + v 2 ) +DAC + (v1 + v 3 ) .dAC ) + ( DBC .Q = ( DAB .dAB ) + ( DAC .2DAC (v1 + v 3 ) +DBC + (v 2 + v 3 ) .2DAB dAB +DAC + dAC .2DAC dAC +DBC + dBC .dBC ) c Q = DAB + dAB .

2DAC (v1 + v 3 ) +DBC + (v 2 + v 3 ) .dBC ) c Q = DAB 2 + dAB 2 .2DBC v 21 .2DBC (v 2 + v 3 ) c Q = DAB 2 + v12 + v 2 2 + 2v1v 2 .dAB ) + ( DAC .2DAC v 3 +DBC + v 2 + v 3 + 2v 2v 3 .2DAB v1 .2DAB (v1 + v 2 ) +DAC + (v1 + v 3 ) .2DBC v 3 2 2 2 2 2 2 2 2 2 2 2 2 .dAC ) + ( DBC .2DAB v 2 +DAC + v1 + v 3 + 2v1v 3 .2DAC v1 .2DBC dBC CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS 2 2 2 2 2 Finding Optimal Branch Lengths c Q = DAB + (v1 + v 2 ) .Q = ( DAB .2DAC dAC +DBC 2 + dBC 2 .2DAB dAB +DAC + dAC .

Q = DAB + v1 + v 2 + 2v1v 2 .2DAB .2DAC v1 .2DAC v 3 +DBC + v 2 + v 3 + 2v 2v 3 .2DBC v 2 .2DBC v 3 c Q = 2v1 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS 2 2 2 2 2 2 2 2 2 2 2 Finding Optimal Branch Lengths +(2v 2 + 2v 3 .2DABv 2 .2DBC v 21 .2DAC v 3 .2DAC )v1 +2v 2 + 2v 3 + 2v 2v 3 .2DAB v1 .2DBC v 3 + DAB + DAC + DBC 2 2 2 2 .2DABv 2 +DAC + v1 + v 3 + 2v1v 3 .

2DBC v 3 + DAB + DAC + DBC c Q = 2v1 + C1v1 + C2 2 2 2 2 2 .2DAC v 3 .Q = DAB + v1 + v 2 + 2v1v 2 .2DAC v1 .2DABv 2 .2DAC )v1 +2v 2 + 2v 3 + 2v 2v 3 .2DABv1 .2DAC v 3 +DBC + v 2 + v 3 + 2v 2v 3 .2DABv 2 +DAC + v1 + v 3 + 2v1v 3 .2DBC v 3 c Q = 2v1 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS 2 2 2 2 2 2 2 2 2 2 2 Finding Optimal Branch Lengths +(2v 2 + 2v 3 .2DBC v 21 .2DBC v 2 .2DAB .

2DAC dv1 .2DAB .2DAC = 0 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS dQ = 4v1 + C1 = 4v1 + 2v 2 + 2v 3 .Q = 2v1 + C1v1 + C2 dQ =0 dv1 2 min: Finding Optimal Branch Lengths ß 4v1 + 2v 2 + 2v 3 .2DAB .

Finding Optimal Branch Lengths CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS • System of n linear equations with n unknowns • Can be solved using substitution method or matrix-based methods .

the best tree is the one with the smallest sum of squared errors.Least Squares Optimality Criterion CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS • • Search through all (or many) tree topologies For each investigated tree. Least squares criterion used both for finding branch lengths on individual trees. and for finding best tree. find best branch lengths using least squares criterion (solve N equations with N unknowns) Among all investigated trees. • • .

Minimum Evolution Optimality Criterion CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS • • Search through all (or many) tree topologies For each investigated tree. the best tree is the one with the smallest sum of branch lengths (the shortest tree). minimum tree length used for finding best tree. • • . find best branch lengths using least squares criterion (solve N equations with N unknowns) Among all investigated trees. Least squares criterion used for finding branch lengths on individual trees.

Superimposed Substitutions CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS ACGGTGC C T • Actual number of evolutionary events: Observed number of differences: 5 • GCGGTGA 2 • Distance is (almost) always underestimated .

Model-based correction for superimposed substitutions CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS • Goal: try to infer the real number of evolutionary events (the real distance) based on 1. Observed data (sequence alignment) 2. A model of how evolution occurs .

33 x DOBS) • • A A C G T -3α α α α C α -3α α α G α α -3α α T α α α -3α • For instance: DOBS=0.Jukes and Cantor Model CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS • Four nucleotides assumed to be equally frequent (f=0.43 => DJC=0.25) All 12 substitution rates assumed to be equal Under this model the corrected distance is: DJC = -0.75 x ln(1-1.64 .

.Clustering Algorithms CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS • • Starting point: Distance matrix Cluster the two nearest nodes: – Tree: connect pair of nodes to common ancestral node. by finding distance from new node to all other nodes – • • Repeat until all nodes are linked in tree Results in only one tree. Compute new distance matrix. compute branch lengths from ancestral node to both descendants Distance matrix: replace the two joined nodes with the new (ancestral) node. there is no measure of tree-goodness.

where Dij-ui-uj is smallest Connect the tips i and j. i and j. .5 Dij + 0. except the denominator is n-2 instead of n-1) CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS • • Find the pair of tips.5 Dij + 0.5 (uj-ui) • Update the distance matrix: Compute distance between new node and each remaining tip as follows: Dij.5 (ui-uj) vj = 0. The branch lengths from the ancestral node to i and j are: vi = 0.k = (Dik+Djk-Dij)/2 • • Replace tips i and j by the new node which is now treated as a tip Repeat until only two nodes remain.Neighbor Joining Algorithm • For each tip compute ui = Σj Dij/(n-2) (essentially the average distance to all other tips. forming a new ancestral node.

Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - A B C D .

5 (17+12+18)/2=23.5 (21+12+14)/2=23.Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - i A B C D ui (17+21+27)/2=32.5 (27+18+14)/2=29.5 A B C D .

5 (27+18+14)/2=29.5 (21+12+14)/2=23.5 (17+12+18)/2=23.Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - i A B C D ui (17+21+27)/2=32.5 A B C D A A B C D - B -39 - C -35 -35 - D -35 -35 -39 - Dij -ui-uj .

5 A B C D A A B C D - B -39 - C -35 -35 - D -35 -35 -39 - Dij -ui-uj .5 (17+12+18)/2=23.5 (21+12+14)/2=23.5 (27+18+14)/2=29.Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - i A B C D ui (17+21+27)/2=32.

5 (21+12+14)/2=23.5 (27+18+14)/2=29.Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - i A B C D ui (17+21+27)/2=32.5 A B C D A A B C D - B -39 - C -35 -35 - D -35 -35 -39 X C D Dij -ui-uj .5 (17+12+18)/2=23.

5-23.5-29.5 (27+18+14)/2=29.5 x 14 + 0.Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - i A B C D ui (17+21+27)/2=32.5 x 14 + 0.5 (17+12+18)/2=23.5 A B C D A A B C D - B -39 - C -35 -35 - D -35 -35 -39 C 4 X D 10 Dij -ui-uj vC = 0.5 x (29.5) = 10 .5) = 4 vD = 0.5 (21+12+14)/2=23.5 x (23.

Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - X A B C D X - C 4 X D 10 .

DCD )/2 = (21 + 27 .Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - X DXA = (DCA + DDA .14)/2 = 17 DXB = (DCB + DDB .14)/2 =8 A B C D X - C 4 X D 10 .DCD )/2 = (12 + 18 .

DCD )/2 = (21 + 27 .DCD )/2 = (12 + 18 .Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - C 21 12 - D 27 18 14 - X 17 8 DXA = (DCA + DDA .14)/2 = 17 DXB = (DCB + DDB .14)/2 =8 A B C D X - C 4 X D 10 .

DCD )/2 = (12 + 18 .DCD )/2 = (21 + 27 .14)/2 =8 DXA = (DCA + DDA .Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - X 17 8 DXB = (DCB + DDB .14)/2 = 17 A B X C 4 X D 10 .

Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - X 17 8 - i A B X ui (17+17)/1 = 34 (17+8)/1 = 25 (17+8)/1 = 25 A B X C 4 X D 10 .

Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - X 17 8 - i A B X ui (17+17)/1 = 34 (17+8)/1 = 25 (17+8)/1 = 25 A B X A A B X - B -42 - X -28 -28 C 4 X D 10 Dij -ui-uj .

Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - X 17 8 - i A B X ui (17+17)/1 = 34 (17+8)/1 = 25 (17+8)/1 = 25 A B X A A B X - B -42 - X -28 -28 C 4 X D 10 Dij -ui-uj .

5 x (34-25) = 13 vD = 0.5 x 17 + 0.Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - X 17 8 - i A B X ui (17+17)/1 = 34 (17+8)/1 = 25 (17+8)/1 = 25 A B X A A B X - B -42 - X -28 -28 C 4 X D 10 A 13 Y 4 B Dij -ui-uj vA = 0.5 x (25-34) = 4 .5 x 17 + 0.

Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - X 17 8 - Y A B X Y C 4 X D 10 A 13 Y 4 B .

DAB )/2 = (17 + 8 .17)/2 =4 A B X Y 4 C 4 X D 10 A 13 Y 4 B .Neighbor Joining Algorithm A CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS B 17 - X 17 8 - Y DYX = (DAX + DBX .

Neighbor Joining Algorithm X CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Y 4 DYX = (DAX + DBX .17)/2 =4 X Y C 4 X D 10 A 13 Y 4 B .DAB )/2 = (17 + 8 .

DAB )/2 = (17 + 8 .Neighbor Joining Algorithm X CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Y 4 DYX = (DAX + DBX .17)/2 =4 X Y C 4 D 10 4 A 13 4 B .

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A A B C D B 17 - C 21 12 - D 27 18 14 D C 4 4 4 B 10 13 A .

scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->