New Approaches for Analyzing Biological Sequences

Prof. Rushen Chahal
Prof. Rushen Chahal

Contributions
Developed methods for: Identifying new genes Constructing evolutionary trees Comparing phylogenetic solution space

Prof. Rushen Chahal

Biological Sequences
DNA (gene) RNA

protein

cgttaacaaagc...
Prof. Rushen Chahal

MAEKPKLH...

Main Tasks 

Finding new genes:
Polymerase Chain Reaction (PCR) Cloning Genomic sequencing 

Determining gene functions:
Lab work Related genes (homology) Genome comparison
Prof. Rushen Chahal

Evolutionary Trees

time

Prof. Rushen Chahal

Putting it together
new sequences related sequences relationships

Evolution

?
database
Prof. Rushen Chahal

Primer Selection for Polymerase Chain Reactions (PCR)  

Goal: Discover previously unknown genes Strategy: Design PCR primers for large set of known gene family members Unknown genes will (hopefully) be amplified
Prof. Rushen Chahal

Gene Family
herpesEC crnvHH2 cmvHH3

humfMLF humIL8 ratANG ratG10d bovLOR1 chkGPCR RBS11 humSSR1 gpPAF dogRDC1 ratODOR musdelto musP2u humC5a chkP2y ratBK2 humTHR ratRTA humMRG ratLH bovOP humMAS humEDG1 ratCGPCR ratNPYY1 ratPOT ratNK1 humACTH flyNK humMSH flyNPY musEP3 musGIR humTXA2 ratCCKA dogAd1 ratNTR musEP2 musTRH humD2 musGnRH dogCCKB humA2a musGRP ratV1a hamA1a bovETA hamB2 ratD1 hum5HT1a bovH1 humM1 Prof. Rushen Chahal

humRSC

Polymerase Chain Reaction

Prof. Rushen Chahal

Primers

Common region => common primer

Prof. Rushen Chahal

Primer Group
herpesEC

crnvHH2 humRSC cmvHH3 humfMLF humIL8 ratG10d ratANG bovLOR1 chkGPCR RBS11 humSSR1 gpPAF dogRDC1 musdelto musP2u humC5a chkP2y ratBK2

ratODOR

?
ratLH bovOP humEDG1 ratCGPCR ratPOT humACTH humMSH musEP3 humTXA2 musEP2

humTHR ratRTA humMRG humMAS

?
ratCCKA dogAd1 humD2 humA2a hamA1a ratD1 hamB2 hum5HT1a bovH1 humM1 Prof. Rushen ratNPYY1 ratNK1 flyNK flyNPY musGIR ratNTR musTRH musGnRH musGRP ratV1a bovETA

dogCCKB

?

Chahal

Primer Selection Problem
Optimal Primer Selection Problem:
input: set of DNA sequences output: optimal set of primers Theorem: NP-complete NPProof: reduction from set cover

Prof. Rushen Chahal

Approaches
Exact algorithms:
exhaustive brute-force brutebranch-andbranch-and-bound  

ProvablyProvably-good heuristics:
solution quality: log(# sequences) · OPT

Prof. Rushen Chahal

Extension: Inexact Primers

Goal: Optimize mismatches & #primers
Prof. Rushen Chahal

Sample Output
herpesEC

crnvHH2 humRSC cmvHH3 humfMLF humIL8 ratG10d ratANG bovLOR1 chkGPCR RBS11 humSSR1 gpPAF dogRDC1 musdelto musP2u humC5a chkP2y ratBK2

ratODOR

humTHR ratRTA humMRG humMAS

ratLH

bovOP

humEDG1 ratCGPCR ratPOT humACTH humMSH musEP3 humTXA2 musEP2

ratCCKA

dogCCKB

dogAd1 humD2 humA2a hamA1a ratD1 hamB2 hum5HT1a bovH1 humM1 Prof. Rushen

ratNPYY1 ratNK1 flyNK flyNPY musGIR ratNTR musTRH musGnRH musGRP ratV1a bovETA

Chahal

The Big Picture

Evolution

?

Identifying new genes using PCR

Prof. Rushen Chahal

Evolutionary Tree Reconstruction
Optimality Criteria:
Least-Squares Minimum-Evolution Maximum-Parsimony Maximum-Likelihood

tree cost

NP-complete [Foulds & Graham 1982, Day 1987]
Prof. Rushen Chahal

Previous Approaches 
   

FitchFitch-Margoliash [1967] NeighborNeighbor-Joining [1987] QuartetQuartet-Puzzling [1997] SplitSplit-Decomposition [1995] PAUP [1998], PHYLIP [1993]

All use greedy & target best solution
Prof. Rushen Chahal

However .
Topologically distant solutions may exist
1 1 0.2561 3 4 6 0.2560 2 3 4 5

3 3 3
1 6

0.2562 4 5 2 3

Detect diverse low cost solutions
Prof. Rushen Chahal

Random Starting Trees + Heuristics?
[Maddison 1991, Penny 1995, Swofford 1997]

Prof. Rushen Chahal

NeighborNeighbor-Joining Method

1 2 3

5 4

1111 11 2222 22 3

5 4

1 2 3

55 5 55 44 4 44

Prof. Rushen Chahal

Generalized Neighbor-Joining Neighbor5 1 4 2 3

1 2

5 4 3

1 4

5 2 3

3 4

5 1 2

1 2 3

5 4

3 4 2
Prof. Rushen Chahal

5 1

3 4 1

5 2

Generalized Neighbor-Joining NeighborControlling solution space sampling:
K: max # partial solutions maintained Q (quality): # candidates selected for low cost D (diversity): # candidates selected for variety

Tradeoff quality & topological diversity: K=Q+D
Prof. Rushen Chahal

GNJ Performance (8 leaves)
100 6 5

topological distance

number of solutions

4 3 2 1 0 0.001

exhaustive

Q D
50, 0 45, 5 25, 25 5, 45 0, 50

10

1 0.001

0.01

0.1

1

0.01

0.1

1

least-squares cost

least-squares cost
Prof. Rushen Chahal

Solution Cost (16 leaves)
K=1
10 -2 10 -3

K=20

K=100

LS

solution cost

10 -4 10 -5 10 -6 10 -7 10 -8

ME

Prof. Rushen methods Chahal

Solution Diversity (16 leaves)
K=1 topological distance
12 10 8 6 4 2 0 LS-max ME-max LS-ave ME-ave

K=20

K=100

methods Chahal Prof. Rushen

GNJ Time Complexity

Generate candidates:
O(K N2)

K
20 50 100 200 500

N=8
0.08 0.2 0.5 1.1 3.1

N=16 N=32
0.8 2.1 4.4 8.8 24.2 9.8 25.1 52.1 103.7 262.7

Select candidates:
O(K N2 (lg K + lg N))

O(K ‡ N3 ‡ (lg K + lg N))

GNJ run times (seconds)
Prof. Rushen Chahal

Summary

Evolution

Identifying new genes using PCR Detecting low-cost diverse trees

Prof. Rushen Chahal

Refereed Publications 

Pearson, W. R., Robins, G., and Zhang, T., Generalized NeighborNeighborJoining: More Reliable Phylogenetic Tree Reconstruction, to appear in Journal of Molecular Biology and Evolution. Pearson, W. R., Robins, G., Wrege, D. E., and Zhang, T., On the Primer Selection Problem for Polymerase Chain Reaction Experiments, Discrete and Applied Mathematics, Vol. 71, 1996, pp. 231231-246. Pearson, W. R., Robins, G., Wrege, D. E., and Zhang, T., A New Approach to Primer Selection in Polymerase Chain Reaction Experiments, Proc. International Conference on Intelligent Systems for Molecular Biology, Cambridge, England, July, 1995, pp. Prof. Rushen Chahal 285285-291.  

Refereed Publications (cont.) 

Griffith, J., Robins, G., Salowe, J. S., and Zhang, T., Closing the Gap: NearNear-Optimal Steiner Trees in Polynomial Time, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. Computer13, No. 11, November 1994, pp. 1351-1365. 1351Barrera, T., Griffith, J., McKee, S. A., Robins, G., and Zhang, T., Toward a Steiner Engine: Enhanced Serial and Parallel Implementations of the Iterated 1-Steiner MRST Algorithm, Proc. Great Lakes Symposium 1on VLSI, Kalamazoo, MI, March 1993, pp. 90-94. 90Barrera, T., Griffith, J., Robins, G., and Zhang, T., Narrowing the Gap: NearNear-Optimal Steiner Trees in Polynomial Time, Proc. IEEE International ASIC Conference, Rochester, September 1993, pp. 878790. Prof. Rushen Chahal  

Generalization
Generate Partial Solutions Evaluate Partial Solutions Select Partial Solutions
n-i & i Parsimony, Least-Squares Prefer distant trees

Prof. Rushen Chahal

Future Work 

Generalize GNJ
other optimality criteria other solution space sampling alternative topological distance metrics 



Examine solution space using GNJ Identify pathological data sets
Prof. Rushen Chahal

Generalized Neighbor-Joining NeighborInput: a set of leaves S,the distance matrix over S Output: a set of possible phylogenetic trees for S 1. T = {t}, where t is the star-tree over S 2. Repeat T* n All next-step trees derived from T T n Select up to K trees from T* Until (all trees in T are fully resolved) 3. Output T

Prof. Rushen Chahal