Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

w

Life Sciences Society

rf

["A*^

**COMPUTATIONA SYSTEMS BIOINFORMATICS
**

CSB2006 CONFERENCE PROCEEDINGS

Stanford CA, 1 4 - 1 8 August 2006

Editors

**Peter Markstein Ying Xu
**

WARIM MAtf^

Imperial College Press

Life Sciences Society

COMPUTATIONAL SYSTEMS BIOINFORMATICS

SERIES ON ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Series Editors: Ying XU (University of Georgia, USA) Limsoon WONG (National University of Singapore, Singapore) Associate Editors: Ruth Nussinov (NCI, USA) Rolf Apweiler (EBI, UK) Ed Wingender (BioBase, Germany)

ISSN: 1751-6404

See-Kiong Ng (Inst for Infocomm Res, Singapore) Kenta Nakai (Univ of Tokyo, Japan) Mark Ragan (Univ of Queensland, Australia)

Published Vol. 1: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference Eds: Yi-Ping Phoebe Chen and Limsoon Wong Vol. 2: Information Processing and Living Systems Eds: Vladimir B. Bajic and Tan Tin Wee Vol. 3: Proceedings of the 4th Asia-Pacific Bioinformatics Conference Eds: Too Jiang, Ueng-Cheng Yang, Yi-Ping Phoebe Chen and Limsoon Wong Vol. 4: Computational Systems Bioinformatics Eds: Peter Markstein and Ying Xu

Jjeriei o i

A'JVCIR'O".

if: ni-"Sni:o'-T'fjf'i •: on-.-' CnmpijMt r;:i<j. [' ^l-j-.-y

'/>i i-' .rl

r ' '•

Life Sciences Society '

**COMPUTATIONAL SYSTEMS BIOINFORMATICS
**

CSB2006 CONFERENCE PROCEEDSJ^GS Stanford CA, i ^ £ ^ » i s ^ O c b C \ \ \

Edilch's

"yf'

/

~^

/ r x \\

Peter M i r k s ^ l i i x YingXu

/ V [XH

#*.

'V

Imperial College Press

Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

Series on Advances in Bioinformatics and Computational Biology — Vol. 4 COMPUTATIONAL SYSTEMS BIOINFORMATICS Proceedings of the Conference CSB 2006 Copyright © 2006 by Imperial College Press All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN

1-86094-700-X

Printed in Singapore by B & JO Enterprise

**Life Sciences Society
**

THANK YOU

LSS Corporate Members and CSB2006 Platinum Sponsors!

The Life Sciences Society, LSS Directors, together with the CSB2006 program committee and conference organizers are extremely grateful to the Hewlett-Packard Company and Microsoft Research for their LSS Corporate Membership and for their Platinum Sponsorship of the Fifth Annual Computational Systems Bioinformatics Conference, CSB2006 at Stanford University, California, August 14-18, 2006

i n v e n t

Microsoft"

Research

This page is intentionally left blank

COMMITTEES

Steering Committee

Phil Bourne - University of California, San Diego Eric Davidson - California Institute of Technology Steven Salzberg - The Institute for Genomic Research John Wooley - University of California, San Diego, San Diego Supercomputer Center

Organizing Committee

Russ Altman - Stanford University, Faculty Sponsor (CSB2005) Serafim Batzoglou - Stanford University, Faculty Sponsor (CSB2002-CSB2004) Pat Blauvelt - Communications Ed Buckingham - Local Arrangements Chair Kass Goldfein - Finance Consultant Karen Hauge - Local Arrangements - Food VK Holtzendorf - Sponsorship Robert Lashley - Sun Microsystems Inc, Co-Chair Steve Madden - Agilent Technologies Alexia Marcous - CEI Systems Inc, Sponsorship Vicky Markstein - Life Sciences Society, Co-Chair, LSS President Yogi Patel - Stanford University, Communications Gene Ren - Finance Chair Jean Tsukamoto - Graphics Design Bill Wang - Sun Microsystems Inc, Registration Chair Peggy Yao - Stanford University, Sponsorship Dan Zuras - Group 70, Recorder

Program Committee

Tatsuya Akutsu - Kyoto University Vineet Bafna - University of California, San Diego Serafim Batzoglou - Stanford University Chris Bystroff - Rensselaer Polytechnic Institute Jake Chen - Indiana University Amar Das - Stanford University David Dixon - University of Alabama Terry Gaasterland - University of California, San Diego Robert Giegerich - Universitat Bielefeld Eran Halperin - University of California Berkeley Wolfgang R. Hess - University of Freiburg

University of London Chao Tang . Inc Tutorial Committee Carol Cain .Princeton University Dong Yup Lee . Co-chair Assistants to the Program Co-Chairs Misty Hice .Agency for Healthcare Research and Quality.Ivo Hofacker .Stanford University Nigam Shah .University of Western Ontario Peter Markstein . US Department of Health and Human Services Betty Cheng .Stanford University.Tubingen University Tao Jiang .Institute for Infocomm Research Ying Xu ..Academia Sinica Daniel Huson .University of California at San Francisco Olga Troyanskaya . Chair Demonstrations Committee AJ Chen . Chair Rong Chen .Stanford University.Royal Holloway.University of Alabama Bin Ma .University of California Riverside Sun-Yuan Kung . Chair Kathleen Sullivan .University of Illinois at Chicago Ann Loraine .Stanford University.Indiana University Ruth Nussinov .University of Georgia Poster Committee Dick Carter .Princeton University Limsoon Wong .University of Maryland Isidore Rigoutsos .Princeton University Victor Solovyev .University of Georgia Joan Yantko .Five Prime Therapeutics.University of Vienna Wen-Lian Hsu . Chair Al Shpuntoff Workshop Committee Will Bridewell .Singapore Cheng Li .University of Tokyo Sean Mooney . Co-chair Satoru Miyano .Harvard School of Public Health Jie Liang .University of Georgia.Stanford University .Hewlett Packard Labs Robert Marinelli .Hewlett-Packard Labs Ann Terka .Universite Claude Bernard Mona Singh .IBM TJ Watson Research Center Marie-France Sagot .Hewlett-Packard Co.National Cancer Institute Mihai Pop .Stanford University Biomedical Informatics Training Program.

Dixon Chuong B Do Kelsey Forsythe Ana Teresa Freitas Terry Gaasterland Irene Gabashvili Robert Geigerich Samuel S Gross Juntao Guo Eran Halperin Wolfgang Hess Ivo Hofacker Daniel Huson Wen-Lian Hsu Seiya Imoto Tao Jiang Uri Keich Gad Kimmel Bonnie Kirkpatrick S Y Kung Vincent Lacroix Dong Yup Lee Xin Lei Cheng Li Guojun Li Xiang Li Jie Liang Huiqing Liu Jingyuan Liu Nianjun Liu Ann Loraine Bin Ma Man-Wai Mak Fenglou Mao Peter Markstein Alice C McHardy Satoru Miyano Sean Mooney Jose Carlos Nacher Rei-ichiro Nakamichi Brian Naughton Kay Nieselt Ruth Nussinov Victor Olman Daniel Piatt Mihai Pop Vibin Ramakrishnan Isidore Rigoutsos Marie-France Sagot Nigam Shah Baozhen Shan Daniel Shriner Mona Singh Sagi Snir Victor Solovyev Andreas Sundquist Ting-Yi Sung Chao Tang Eric Tannier Olga Troyanskaya Aristotelis Tsirigos Adelinde Uhrmacher Raj Vadigepalli Gabriel Valiente Limsoon Wong Hongwei Wu Lei Xin Ying Xu Rui Yamaguchi Will York Hiroshi Yoshida Ryo Yoshida Noah Zaitlen Stanislav O.Referees Larisa Adamian Tatsuya Akutsu Doi Atsushi Vineet Bafna Purushotham Bangalore Serafim Batzoglou Sebastian Boecker Chris Bystroff Jake Chen Shihyen Chen Zhong Chen Amar Das Eugene Davydov Tobias Dezulian David A. Zakharkin .

This page is intentionally left blank .

Poster Chair. Its goal is to pull together the power available from computer science. CSB steering committee member. LSS directors. who was there to help whenever needed. Tutorial Chair.PREFACE The Life Sciences Society. organizing committee and members have dedicated time and talent to make CSB2006 one of the premier life sciences conferences in the world. was launched at the CSB2005 conference. Beside the huge volunteer effort for CSB it is important that this conference be properly financed. Thank you for participating in CSB2006. and Gene Ren is Finance Chair. LSS. the engineering capability to design complex automated instruments. Ann Loraine's work with PubMed has been instrumental in getting CSB proceedings indexed in Medline. and of the seven workshops by Will Bridewell. In return LSS has arranged to have the CSB2006 Proceedings distributed to libraries as a volume in the "Advances in Bioinformatics and Computational Biology" book series . We also want to thank the CSB2006 authors who have trusted us with the results of their research. you are invited to join our drive for successful knowledge transfer and persuade a colleague to join LSS. together with the weight of centuries of accumulated knowledge from the biosciences. par excellence. Life Sciences Society . The general conference Co-Chair for CSB2006. Pat Blauvelt is LSS membership Chair. This has been an incredibly dedicated CSB organizing committee! If you believe that Sharing Matters. Once again the Program Committee co-chaired by Peter Markstein and Ying Xu has orchestrated a stellar selection of thirty eight bioinformatics papers for the plenary sessions and for publication in the Proceedings. Workshop Chair. Vicky Markstein President. Ed Buckingham as Local Arrangements Chair continues to provide for the 4th continuous year outstanding professional leadership for CSB. LSS and CSB are thankful for the continuous and generous support from Hewlett Packard and from Microsoft Research. Together with the above committee members all CSB committee members deserve a special thank you.Oxford Press. The selection of the best posters was done under the supervision of Nigam Shah. Selection of the ten tutorial classes was conducted by Betty Cheng. has done a phenomenal job in his first year with LSS. Kirindi Choi is again Chair of Volunteers. Robert Lashley. CSB proceedings are indexed in Medline. Bill Wang is Registration Chair. A very big thank you to John Wooley.

This page is intentionally left blank .

Frazier et al. Smolke Reactome: A Knowledgebase of Biological Pathways Lincoln Stein et al. Musen vii ix xi 1 3 Invited Talks Biomedical Informatics Research Network (BIRN): Building a National Collaboratory for BioMedical and Brain Research Mark H. 5 7 9 11 13 15 17 . Ellisman Protein Network Comparative Genomics Trey Ideker Systems Biology in Two Dimensions: Understanding and Engineering Membranes as Dynamical Systems Erik Jakobsson Bioinformatics at Microsoft Research Simon Mercer Movie Crunching in Biological Dynamic Imaging Jean-Christophe Olivo-Marin Engineering Nucleic Acid-Based Molecular Sensors for Probing and Programming Cellular Systems Christina D. Don't Know Much About Philosophy: The Confusion Over Bio-Ontologies Mark A.CONTENTS Committees Referees Preface Keynote Addresses Exploring the Ocean's Microbes: Sequencing the Seven Seas Marvin E.

Yinglei Song. Matthias Hochsmann and Robert Giegerich Microarray Data Analysis and Applications PEM: A General Statistical Approach for Identifying Differentially Expressed Genes in Time-Course cDNA Microarray Experiment without Replicate XuHan. DeRonne and George Katypis Transmembrane Helix and Topology Prediction Using Hierarchical SVM Classifiers and an Alternating Geometric Scoring Function Allan Lo. Enrico Pontelli. JinboXu. Hua-Sheng Chiu. Guojun Li. Malmberg and Liming Cai Thermodynamic Matchers: Strengthening the Significance of RNA Folding Energies Thomas Hochsmann. Jing He and Yonggang Lu Efficient Annotation of Non-Coding RNA Structures Including Pseudoknots via Automated Filters Chunmei Liu. PingHu. Hung-Chung Huang and Ying Liu Computational Genomics and Genetics Efficient Computation of Minimum Recombination with Genotypes (not Haplotypes) Yufeng Wu and Dan Gusfield Sorting Genomes by Translocations and Deletions Xingqin Qi. Systematic Search Algorithm for Structure Determination of Denatured or Disordered Proteins Lincong Wang and Bruce Randall Donald Multiple Structure Alignment by Optimal RMSD Implies that the Average Structure is a Consensus Xueyi Wang and Jack Snoeyink Identification of a-Helices from Low Resolution Protein Density Maps Alessandro Dal Palit. Ting-Yi Sung and Wen-Lian Hsu Protein Fold Recognition Using the Gradient Boost Algorithm Feng Mao.XIV Structural Bioinformatics Effective Optimization Algorithms for Fragment-Assembly based Protein Structure Prediction Kevin W. Libo Yu and Dale Schuurmans A Graph-Based Automated NMR Backbone Resonance Sequential Assignment Xiang Wan and Guohui Lin A Data-Driven. Shuguang Li and YingXu 145 123 19 31 43 55 67 79 89 99 111 133 157 . Yanxiong Peng. Wing-Kin Sung and Lin Feng Efficient Generalized Matrix Approximations for Biomarker Discovery and Visualization in Gene Expression Data Wenyuan Li. Russell L.

Bhandarkar and J. Deepak Bandyopadhyay. Brown and Ian M. Qi Li and Sudhir Kumar Evolution versus "Intelligent Design": Comparing the Topology of Protein-Protein Interaction Networks to the Internet Qiaofeng Yang. Michalis Faloutsos andStefano Lonardi 167 179 191 199 211 223 227 239 249 257 269 281 293 299 . Srinivas Aluru and Patricks. Blelloch. Ravi and Russell Schwartz Toward an Algebraic Understanding of Haplotype Inference by Pure Parsimony Daniel G.Georgos Siganos. Alexander Tropsha and Wei Wang An Improved Gibbs Sampling Method for Motif Discovery via Sequence Weighting Xin Chen and Tao Jiang Detection of Cleavage Sites for HIV-1 Protease in Native Proteins Liwen You A Methodology for Motif Discovery Employing Iterated Cluster Re-Assignment Osman Abul. Jan Prins. R. Harrower Global Correlation Analysis Between Redundant Probe Sets Using a Large Collection of Arabidopsis ATH1 Expression Profiling Data Xiangqin Cui and Ann Loraine Motif Sequence Identification Distance-Based Identification of Structure Motifs in Proteins Using Constrained Frequent Subgraph Mining Jun Huan. Finn Drablos and Geir Kjetil Sandve Biological Pathways and Systems Identifying Biological Pathways via Phase Decomposition and Profile Extraction Yi Zhang and Zhidong Deng Expectation-Maximization Algorithms for Fuzzy Assignment of Genes to Cellular Pathways Liviu Popescu and Golan Yona Classification of Drosophila Embryonic Developmental Stage Range Based on Gene Expression Pattern Images Jieping Ye. S.XV Turning Repeats to Advantage: Scaffolding Genomic Contigs Using LTR Retrotransposons Ananth Kalyanaraman. Randy Goebel. Schnable Whole Genome Composition Distance for HIV-1 Genotyping Xiaomeng Wu. Jack Snoeyink. Tewari. Jianhui Chen. M. Guy E. Arnold Optimal Imperfect Phylogeny Reconstruction and Haplotyping (IPPH) Srinath Sridhar. Xiu-Feng Wan and Guohui Lin Efficient Recursive Linking Algorithm for Computing the Likelihood of an Order of a Large Number of Genetic Markers S.

Brown andMu Wang Author Index . Jinjin Cai. J. Sean H. Kavraki Protein Subcellular Localization Prediction Based on Compartment-Specific Biological Features Chia-Yu Su. Ting-Yi Sung and Wen-Lian Hsu Predicting the Binding Affinity of MHC Class II Peptides Fatih Altiparmak. Das A Systems Biology Case Study of Ovarian Cancer Drug Resistance Jake Y. Charles A. Zhong Yan. Zhuo Zhang. DrewH. Cruess. Jingfen Zhang. Allan Lo. Grant. Chen. Troyanskaya An Iterative Algorithm to Quantify the Factors Influencing Peptide Fragmentation for MS/MS Spectru Chungong Yu. MarekKimmel. Shiwei Sun. Changyu Shen. AltunaAkalin andHakan Ferhatosmanoglu Codon-Based Detection of Positive Selection Can be Biased by Heterogeneous Distribution of Polar Amino Acids Along Protein Sequences Xuhua Xia and Sudhir Kumar Bayesian Data Integration: A Functional Perspective Curtis Huttenhower and Olga G. G. Runsheng Chen andDongbo Bu Complexity and Scoring Function of MS/MS Peptide De Novo Sequencing ChangjiangXu and Bin Ma Biomedical Applications Expectation-Maximization Method for Reconstructing Tumor Phylogenies from Single-Cell Data Gregory Pennington. Chen. Dawn P. Kim and C.XVI Protein Functions and Computational Proteomics Cavity-Aware Motifs Reduce False Positives in Protein Function Prediction Brian Y. Soo-Yon Rhee. Bryant. Fofanov. Hua-Sheng Chiu. Anthony Hunt A Combined Data Mining Approach for Infrequent Events: Analyzing HIV Mutation Changes Based on Treatment History Ray S. David M. Viacheslav Y. Shafer andAmar K. Robert W. Yu Lin. Smith. Amanda E. Lin. Stanley Shackney and Russell Schwartz Simulating In Vitro Epithelial Morphogenesis in Multiple Environments Mark R. Kristensen. Olivier Lichtarge andLydiaE.

the Discovery Channel and the J.German Bonilla4. Clare Stewart1. Using newly developed metagenomic methods. Karen Beeson1. Rockville. Halifax. Heidelberg1. Granger Sutton1. Rockville. Republic of Costa Rica 2 6 The J. Robert Strausberg1. Valeria Souza4. National Autonomous University of Mexico Mexico City. The JCVI is also developing and refining bioinformatics tools to assemble. Shibu Yooseph1. Los Angeles. The goals of this Global Ocean Survey are to better understand microbial biodiversity. Joyce Thorpe1. Vineet Bafna3. Rusch1. The goal of this new science. of Earth Sciences. and to establish a freely shared. David M. Nova Scotia. Craig Venter Institute's (JCVI) environmental genomics group has collected ocean and soil samples from around the world. 04510 Distrito Federal. Craig Venter Institute. Venter1. Office of Science (DEFG02-02ER63453). To date. University of California San Diego Instituto de Ecologia Dept. Honolulu. United States Of America The Institute For Genomic Research. Giselle Tamayo10. J. however. annotate. United States of America Dept. Karin Remington1. Environmental metagenomics examines the interplay of perhaps thousands of species present and functioning at a point in space and time. Luisa Falcon4. to discover new genes of ecological importance. Shaojie Zhang3. Craig Venter1 J. Chile 10 University of Costa Rica. Shannon Williamson1. Ecologia Evolutiva. to discover new genes that may be useful for biological energy production. but the community of genes that enable them to capture energy from the sun. Craig . take up organic carbon. We have begun shotgun sequencing of microbial samples from more that 100 open-ocean and coastal sites across the Pacific. Robert Friedman1. San Pedro. Indian and Atlantic Oceans. Shubha Sathyendranath7. United States of America 7 Dalhousie University. Karl5. along with the appropriate database infrastructure to enable directed analyses. Cindy Pfannkoch1. Aaron L. Mexico University of Hawaii. Karla B. Yu-Hui Rogers1. genes and gene families. and species for their own sake. Maryland. University of Southern California. Hamilton Smith1. Trevor Piatt7. Republic of Panama University of Concepcidn. Ken Nealson6. Joseph E. Jonathan A. Dill1. global environmental genomics database that can be used by scientists around the world. Eisen2. and analyze large-scale metagenomic data. Ancon. the Gordon and Betty Moore Foundation. Halpern1. Concepcion. we have discovered many thousands of new microbial species and millions of new genes.1 EXPLORING THE OCEANS MICROBES: SEQUENCING THE SEVEN SEAS Marvin E. Jason Freemen1. within this data set is a huge diversity of previously unknown. California. Maryland. Moreover. Charles H. Howard1. Dongying Wu2. Eldredge Bermingham8. It is a piece of an entire biological community. These data are being augmented with deep sequencing of 16S and 18S rRNA and the draft sequencing of ~150 cultured marine microbial species. Eguiarte4 . Terry Utterback1. This data will be of great value for the study of protein function and protein evolution. Brooke A. Luis E. Frazier'. Balboa. remove carbon dioxide from the air. with no apparent slowing of the rate of discovery. San Jose. We acknowledge the DOE. Victor Gallardo9. and cycle nitrogen in its various forms through the ecosystem. including those involved in carbon cycling. This is a resource that can be mined by microbial ecologists worldwide to better understand biogeochemical cycling. energy-related genes that may be useful for developing new methods of biological energy production. is not to merely catalog sequences. John Heidelberg2. United States Of America Department of Computer Science. Cyrus Foote1. Each individual sequence is no longer just a piece of a genome. We are attempting to use these new data to better understand the functioning of natural ecosystems. Bao Tran1. Holly Baden-Tillson1. Jeff Hoffman1. Canada 8 Smithsonian Tropical Research Institute. we are able to examine not only the community of microorganisms.Douglas B.

Martin Wilkalski (Princeton) and Rod Mackie (University of Illinois) provided advice for target regions in the Galapagos to sample. especially Howard Snell and Eva Danulat. Panama.S. The Universidad Nacional Autonoma de Mexico (UNAM) facilitated permitting and logistical arrangements and identified a team of scientists for collaboration. We especially thank Greg Estes (guide). Costa Rica. All sequencing data collected from waters of the above named countries remain part of the genetic patrimony of the country from which they were obtained. We are also indebted to a large group of individuals and groups for facilitating our sampling and analysis. logistical arrangements and scientific analysis. John Glass (JCVI) provided valuable assistance in methods development. Representatives from Costa Rica's Organization for Tropical Studies (Jorge Arturo Jimenez and Francisco Campos Rivera). Canada's Bedford Institute of Oceanography provided a vessel and logistical support for sampling in Bedford basin. We thank the Governments of Canada.2 Venter Science Foundation for funding to undertake this study. who oversaw medical related issues for the crew of the Sorcerer II. and Ecuador and French Polynesia/France for facilitating sampling activities. We also acknowledge the help of Michael Ferrari and Jennifer Clark for assistance in acquiring the satellite images. . the University of Costa Rica (Jorge Cortes) and the National Biodiversity Institute (INBio) provided assistance with planning. We thank Matthew Charette (Woods Hole Oceanographic Institute) and Dave Karl (University of Hawaii) for nutrient analysis work and advice. Tyler Osgood (JCVI) facilitated many of the vessel related technical needs. special thanks also to the captain and crew of the S/V Sorcerer II. the Charles Darwin Research Institute. The U. We gratefully acknowledge Dr. Hector Chauz Campo (Institute of Oceanography of the Ecuador Navy) and a National Park Representative. Mexico. The scientists and staff of the Smithsonian Tropical Research Institute (STRI) hosted our visit in Panama. Department of State facilitated Governmental communications on multiple occasions. Simon Ricardo Villemar Tigrero for field assistance while in the Galapagos Islands. Honduras. Washington Tapia. Michael Sauri. Our visit to the Galapagos Islands was facilitated by assistance from the Galapagos National Park Service Director. Finally.

Our goal.D. more useable. simply put. . the Center will support the publication of biomedical ontologies online. and useable within software systems.. and languages created specifically by the bioinformatics community. M. from annotation of experimental data. searchable.org') is one of the seven national centers for biomedical computing formed under the NIH Roadmap. The advent of biomedical knowledge that is widely available in machine-processable form will alter the way that we think about science and perform scientific experiments. and more precise. Musen. Each of these purposes imposes different requirements concerning which entities ontologies should encode and how those entities should be encoded. The Center recognizes the importance of ontologies for use in a wide range of biomedical applications.D. Biomedical scientists use ontologies for multiple purposes. Semantic Web standards such as RDF(S) and OWL. such as OBO). The National Center for Biomedical Ontology Stanford University 251 Campus Drive. We will review those assumptions and the corresponding implications for ontology construction. Ultimately.g. so has the confusion. Ph.3 DON'T KNOW MUCH ABOUT PHILOSOPHY: THE CONFUSION OVER BIO-ONTOLOGIES Mark A. and in which new methods will be needed to support a radically different kind of scientific publishing. X-215 Stanford. much as we publish scientific knowledge in print media. to data integration. we can see little uniformity of opinion. CA 94305 USA Abstract: For the past decade.. As interest has peaked. to natural-language processing. to construction of decision-support systems. and therefore those developers will make very different assumptions about how ontologies should be created and structured. alignable. The confusion will persist until we can understand that different developers have very different requirements for ontologies. where each language has explicit strengths and weaknesses. The biomedical community soon will enter an era in which scientific knowledge will become more accessible. thqre has been increasing interest in ontologies in the biomedical community. with considerable questioning. The Center takes a broad perspective on what ontologies are and how they should be developed and put to use. Although the biomedical informatics community remains excited about ontologies. Our National Center for Biomedical Ontology (http://bioontologv. frame-based systems. and is developing new technology to make all relevant ontologies widely accessible. is to help to eliminate much of the current confusion. exactly what an ontology is and how it should be represented within a computer are points about which. The confusion stems from the multiple knowledge-representation languages used to encode ontologies (e.

This page is intentionally left blank .

D. These analyses run on TeraGrid have produced over 10TB of resultant data which were then transferred back to the BERN Data Grid. Ellisman. These components include high bandwidth inter-institutional connectivity via Internet2. whole brain histology. grid-based file management and computational services. Researchers within BERN are also benefiting directly from the connectivity to high performance computing resources.net) The Center for Research on Biological Systems (CRBS) at UCSD The Biomedical Informatics Research Network (BIRN) is an initiative within the National Institutes of Health (US) that fosters large-scale collaborations in biomedical science by utilizing the capabilities of the emerging national cyberinfrastructure (high-speed networks. the BERN is uniquely situated to serve as a major conduit between the biomedical research community of NIH-sponsored programs and the information technology development programs. distributed high-performance computing and the necessary software and data integration capabilities). and highresolution light and electron microscopy. Currently researchers are performing advanced shape analyses of anatomical structures to gain a better understanding of diseases and disorders. the national cyberinfrastructure being assembled to enable large-scale science projects will also evolve. The promise of the BERN is the ability to test new hypotheses through the analysis of larger patient populations and unique multi-resolution views of animal models through data sharing and the integration of site independent resources for collaborative data refinement. attention deficit disorder.5 BIOMEDICAL INFORMATICS RESEARCH NETWORK (BIRN): BUILDING A NATIONAL COLLABORATORY FOR BIOMEDICAL AND BRAIN RESEARCH Mark H. NSF. As the requirements of the biomedical community become better specified through projects like the BERN. Professor UCSD Department of Neurosciences and Director of the BIRN Coordinating Center (www. mostly supported by other government agencies (e. DARPA) and industry. The BIRN Coordinating Center (BERN-CC) is orchestrating the development and deployment of key infrastructure components for immediate and long-range support of the scientific goals pursued by these test bed scientists. Others are studying animal models relevant to multiple sclerosis. the BIRN involves a consortium of 20 universities and 30 research groups participating in three test bed projects centered around brain imaging of human neuropsychiatric disease and associated animal models. depression. and Parkinson's disease through MRI. DOE. As these technologies mature. BERN intertwines concurrent revolutions occurring in biomedicine and information technology.g. software and techniques to federate data and databases. As a core component of the BERN infrastructure. visualization and analysis environments. such as TeraGrid. .. These test bed projects present practical and immediate requirements for performing large-scale bioinformatics studies and provide a multitude of usage cases for distributed computation and the handling of heterogeneous data.. Internet2 provides a solid foundation for the future expansion of the BERN as well as the stable high performance network required by researchers in a national collaboratory. and schizophrenia using structural and functional magnetic resonance imaging (MRI). These groups are working on large scale. data caching and replication techniques to improve performance and resiliency. and shared processing. Currently.nbirn. Ph. crossinstitutional imaging studies on Alzheimer's disease. a uniformly consistent security model. NASA.

This page is intentionally left blank .

and Ideker. S. http://www. 5. Nature Biotechnology 23(5):561-566 (2005). 4. and the Packard Foundation. Unilever.. we are constructing network models to explain the physiological response of yeast to DNA damaging agents. R. and Ideker. T. S. McCuine.org 6. M. Validation and refinement of gene regulatory pathways on a network of physical interactions. and Ideker. Suthram. T. Genome Biology 6(7): R62 (2005). M. Uetz. The Plasmodium network diverges from those of other species. Jaakkola. PLC. Kelley. T. it will be possible to organize the network into modules representing the repertoire of distinct functional processes in the cell. Sittler. Karp. Kelley. T.cytoscape. C.org Acknowledgements We gratefully acknowledge funding through NIH/NIGMS grant GM070743-01. Proc Natl Acad Sci USA. Sharan. 2. P. T.2005). 8:102(6): 1974-79 (2005). Workman. R. and Ideker. http://www... Sittler...C. S.. R. by comparing the molecular interaction network with other biological data sets. R. including those to identify: (1) Protein interaction networks that are conserved across species (2) Networks in control of gene expression changes (3) Networks correlating with systematic phenotypes and synthetic lethals Using these computational modeling and query tools. methods are needed for constructing cellular pathway models using interaction data as the central framework. S. .pathblast.... NSF grant CCF0425926. T.. McCuine..7 PROTEIN NETWORK COMPARATIVE GENOMICS Trey Ideker University of California San Diego With the appearance of large networks of proteinprotein and protein-DNA interactions as a new type of biological measurement. Suthram. T. T. 3. Nature 437: (November 3. C . Three distinct types of network comparisons will be discussed. Relevant articles and links 1. H. Mak. Kuhn. The key idea is that... Yeang.. Systematic interpretation of genetic interactions using protein networks. Conserved patterns of protein interaction in multiple species.H.

This page is intentionally left blank .

. and nanoscale device design. and in self-assembled membranes supported on nanoporous silicon scaffolds. and d). biomolecular theory and computation. and experimental functional characterization. Broad Goals: Our Center's broad goals are: 1. transport processes. To interact synergistically with other workers in the areas of membrane processes. The model experimental systems are engineered protein channels and synthetic channels in isolation. and experimental methods for understanding and quantitatively characterizing biomembrane and other nanoscale transport processes. 3. To use our knowledge and technical capabilities to design useful biomimetic de- Major Emergent Reality Constraints: The development and maintenance of the electrocyte in the eel are guided by elaborate and adaptive pathways under genetic control. and nanotransport aspects of device design. and heterogenous membranes containing channels and transporters.9 SYSTEMS BIOLOGY IN TWO DIMENSIONS: UNDERSTANDING AND ENGINEERING MEMBRANES AS DYNAMICAL SYSTEMS Erik Jakobsson University of Illinois at Urbana-Champaign Director. The ultimate goal is to understand how biomimetic nanoscale design can be utilized in devices to achieve the functions that membrane systems accomplish in biological systems: a) Electrical and electrochemical signaling. membrane structure. vices and technologies that utilize membrane and nanopore transport. To disseminate enhanced methods and tools for: theory and computation related to transport. b) generation of osmotic pressures and flows. which generates large voltages and current densities by stacking large areas of electrically excitable membranes in series. c) generation of electrical power. Initial Design Target: A biocompatible biomimetic battery (the "biobattery") to power an implantable artificial retina. computational. extendable to other neural prostheses.energy transduction. the study of membranes as systems biomolecular design. through interactive teams doing collaborative macromolecular design and synthesis. 2. computation/theory. which we can not realistically hope to include in a device. The potential advantages of the biomimetic battery are lack of toxic materials. The model theoretical systems are native and mutant biological channels and other ion transport proteins and synthetic channels. theoretical and experimental characterization of nanoscale fluid flow. Broad design principles are suggested by the electrocyte of the electric eel. National Center for the Design of Biomimetic Nanoconductors Theme: The theme of our NTH Nanomedicine Development Center is design of biomimetic nanoconductors and devices utilizing nanoconductors. To advance theoretical. experimental characterization of membrane function. 4. and ability to be regenerated by the body's metabolism.

. and simulation. less degradable molecules. of the eel electrocyte. Approaches being pursued include designing beta-barrel 4. on which membranes will selfassemble.10 Our approach will include replacing the developmental machinery with a nanoporous silicon scaffold. and aligned interests. mining extremophile genomes for appropriate transporters. Search for more durable functional analogues of the membranes and transporters of the electrocyte. chemically functionalized silicon pores. 3. Fabrication of nanoporous silicon supports for heterogenous membranes in complex geometries. Organizational Principles of Center: Our core team is supported by the NIH Roadmap grant. with experiment. including electrical and osmotic phenomena and incorporating specific geometry. and design of durable synthetic polymer membranes that can incorporate transport molecules by self-assembly. The lack of maintenance machinery will be compensated for by making the functional components of the biobattery from more durable. Do initial design of biomimetic battery that is potentially capable of fabrication/self assembly. functional analogues for helix-bundle proteins. but we welcome collaborations with all workers with relevant technologies and skills. 2. Initial Specific Activities: 1. These approaches combine information technology. computer modeling. Making a detailed dynamical model.

current projects include new languages to simplify data extraction and processing. Conversely. and process algebra in understanding metabolic pathways.11 BIOINFORMATICS AT MICROSOFT RESEARCH Simon Mercer Microsoft Research One Microsoft Way Redmond. tools for scientific workflows. such as the use of schema-matching techniques in merging ontologies. and biological visualization. USA The advancement of the life sciences in the last twenty years has been in part the story of increasing integration of computing with scientific research. In addition to supporting academic research in the life sciences. biological systems are a rich source of ideas that will transform the future of computing. . Microsoft Research is a source of tools and technologies well suited to the needs of basic scientific research . machine learning in vaccine design. Computer science researchers also bring new perspectives to problems in biology. a trend that is set to transform the practice of science in our lifetimes. WA 98052.

This page is intentionally left blank .

spatio-temporal regulation of gene expression and pathogen motility and interaction with host cells. assessment of pathogens routing into the cell. PARTICLE TRACKING Molecular dynamics in living cells is a central topic in cell biology. to record the movement of organelles like phagosomes or endosomes in the cell.6 the movement of different mutants of bacteria or parasites2 or the positioning of telomeres in nuclei (Galy et al. Deciphering the complex machinery of cell functions and dysfunction necessitates large-scale multidimensional image-based assays to cover the wide range of highly variable and intricate properties of biological material. after labelling with specific fluorochromes. I will concentrate on work developed in our laboratory on two aspects central to cell biology. interaction between proteins. cellular therapies. France Recent advances in biological imaging technologies have enabled the observation of living cells with high resolution during extended periods of time and are impacting biological research in such different areas as high-throughput image-base drug screening.3 I will describe the methods we have developed to perform the detection and the tracking of microscopic spots directly on four dimensional (3D+t) image data. 4. FRET. control of targeted sequence incorporation into the chromatin for cell therapy. TIRF. Innovative automatic techniques to extract quantitative data from image sequences are therefore of major interest. as they are put-ting at hand the imaging of the inner working of living cells in their natural context. understanding the wealth of data generated by multidimensional microscopy depends critically on decoding the visual information contained therein. 2000).13 MOVIE CRUNCHING IN BIOLOGICAL DYNAMIC IMAGING Jean-Christophe Olivo-Marin Quantitative Image Analysis Unit Institut Pasteur CNRS URA 2582 25 rue du Dr Roux 75724 Paris. cellular therapies. as it opens the possibility to study with submicron resolution molecular diffusion. Within the wide interdisciplinary field of biological imaging.. to name but a few. 2. However. cell and developmental biology and gene expression studies. I will present methods we have recently developed to perform the computational analysis of image sequences coming from multidimensional microscopy. cell and developmental biology and gene expression studies. with particular emphasis on tracking and motion analysis for 3D+t images sequences using active contours and multiple particle tracking. However. which have many applications in the important field of infectious diseases. understanding the wealth of data generated by multidimensional microscopy depends critically on decoding the visual information contained therein and on the availability of the tools to do so. spatial-temporal organization of the cell and its changes with time or under infection. FRAP.5 They are able to detect with high accuracy multiple . particle tracking and cell shape and motility analysis. INTRODUCTION The advent of multidimensional microscopy (real-time optical sectioning and confocal. For example. tissues and organs in their intrinsic 3D and 3D+t geometry. Expectations are high for breakthroughs in areas such as cell response and motility modification by drugs. it is possible. sanitary control of pathogen evolution. FLIM) has enabled biologists to visualize cells. 1. These new technologies are already impacting biological research in such different areas as high-throughput image-base drug screening. in contrast to the limited 2D representations that were available until recently. Deciphering the complex machinery of cell functions and dysfunction necessitates indeed large-scale multidimensional image-based assays to cover the wide range of highly variable and intricate properties of biological systems.

Thiberge. vol. Eur. M. pp.-C. Medical Imaging. 1105-1108. Frischknecht. pp. Coupled parametric active contours. Baldacci. no. is simply achieved by initializing front evolutions using the segmentation result of the previous frame. infection and immunity. IEEE International Conference on Image Processing ICIP 2003. Zimmer. Genovesio. J. Barcelona. Murphy. and R. Fotsis. Olivo-Marin. W. Giner. Our methods decouple the detection and the tracking processes and are based on a two step procedure: first. V. I. IEEE Trans. vol. 2004. Zhang. vol. Scherthan. 7. IEEE Trans. C. V. linking segmented objects between time points. Image Processing. Galy. Tajbakhsh. V. Tracking of multiple fluorescent biological objects in three dimensional video microscopy. Olivo-Marin. 15. i. The ability of cells to move and change their shape is important in many important areas of biology. Nerhbass.-C. we have adopted the framework of active contours and deformable models that is widely employed in the computer vision community. 6. vol. pp. H. vol.-C. tracking fluorescent cells in dynamic 3-d microscopy with coupled active surfaces. Zimmer. C. 14. smoothness constraints. 3. Cell Microbiol. 80. Multiple particle tracking in 3D+t microscopy : method and application to the tracking of endocytozed Quantum Dots. Olivo-Marin. 5. no. and J. J. Imaging movement of malaria parasites during transmission by Anopheles mosquitoes. A. Olivo-Marin. 108-112.2001. C. S. 9. 5. Martin.2005. L. References 1.14 biological objects moving in three-dimensional space and incorporate the possibility to follow moving spots switching between different types of dynamics. This energy contains both data attachment terms and terms encoding prior information about the boundaries to be extracted. 687-94. A. Labruyere. CELL TRACKING Another important project of our laboratory is motivated by the problem of cell motility. e. Rascalou. Olivo-Marin. 1062-1070. Meas-Yedid. Dufour. Journal of Cell Biology. Nature. Tracking. Zimmer. IEEE Trans. Image Processing. I will describe some of our work on adapting these methods to the needs of cellular imaging in biological research. T. and J. E. then the tracking is performed within a Bayesian framework where each object is represented by a state vector evolving according to biologically realistic dynamic models. Ansorge. Genovesio.2002. Dual function of rhod in vesicular movement and cell motility.7 We have developed algorithms to automatically segment and track moving cells in dynamic 2D or 3D microscopy. 2005. including cancer. 2003 5.-C. no. and C. pp. pp. and J. J. V. 11. F. 3. V. and M. pp. Guillen. Shinin.-C. vol. September 2003. Doyle.1' 8 For this purpose. 1396-1410. Coppey-Moisan. Olivo-Marin. Zimmer and J. 6. R.-C. pp. N. P. vol. Olivo-Marin. T. The segmentation proceeds by evolving the front according to evolution equations that minimize an energy functional (usually by gradient descent). Liedl.g. under the assumption that inter-frame motions are modest.C. development. C. N. 391-398. IEEE Trans. Pattern Analysis and Machine Intelligence. 2. 27. 1838-1842. Saffrich. 403.e. Parak. N. and U. A. Segmenting and . B. Spain. 8. Shorte. 7. Segmentation and tracking of migrating cells in videomicroscopy with parametric active contours: a tool for cell-based drug testing. 1212-1221. 2000. B. W.-C. S. 4. 21. Zerial. Menard. Olivo-Marin. pp. Emiliani. Guillen. 2006 6. the objects are detected in the image stacks thanks to a procedure based on a three-dimensional wavelet transform. no. Nuclear pore complexes in the organization of silent telomeric chromatin. A. J.

This regulated conformational change may be linked to an appropriate readout signal by controlling a diverse set of ncRNA gene regulatory activities. Our research has demonstrated the modularity. where molecular recognition of a ligand-binding event is coupled to a conformational change in the RNA molecule. the application of these molecular sensors to the following downstream research areas will be discussed: metabolic engineering of microbial alkaloid synthesis and 'intelligent' therapeutic strategies. design predictability. and specificity inherent in these molecules for cellular control. The construction of sensor platforms based on allosteric regulation of non-coding RNA (ncRNA) activity will be presented. Department of Chemical Engineering Information flow through cellular networks is responsible for regulating cellular function at both the single cell and multi-cellular systems levels. One of the key limitations to understanding dynamic fluctuations in intracellular biomolecule concentrations is the lack of enabling technologies that allow for user-specified probing and programming of these cellular events. In addition. the flexibility of these sensor platforms enables these molecules to be incorporated into larger circuits based on molecular computation strategies to construct sensor sets that will perform higher-level signal processing toward complex systems analysis and cellular programming strategies.15 ENGINEERING NUCLEIC ACID-BASED MOLECULAR SENSORS FOR PROBING AND PROGRAMMING CELLULAR SYSTEMS Professor Christina D. Smolke California Institute of Technology. I will discuss our work in developing the molecular design and cellular engineering strategies for the construction of tailor-made sensor platforms that can temporally and spatially monitor and regulate information flow through diverse cellular networks. . In particular.

This page is intentionally left blank .

located at http://www. David Croft. Although our primary curational domain is pathways from Homo sapiens. UK Suzanna Lewis Lawrence Berkeley National Laboratory Berkeley. Marc Gillespie. thus making Reactome relevant to model organism research communities. USA Imre Vastrik. Peter D'Eustachio. and signal transduction. Reactome provides a qualitative framework. CA. the complete set of possible reactions constitutes its reactome. The basic unit of the Reactome database is a reaction. . reactions are then grouped into causal chains to form pathways. on which quantitative data can be superimposed. Lisa Matthews. The Reactome data model allows us to represent many diverse processes in the human system.17 REACTOME: A KNOWLEDGEBASE OF BIOLOGICAL PATHWAYS Lincoln Stein. Esther Schmidt. we regularly create electronic projections of human pathways onto other organisms via putative orthologs.org is a curated. Gopal Gopinathrao. The database is publicly available under open source terms.reactome. Guanming Wu Cold Spring Harbor Laboratory Cold Spring Harbor. including the pathways of intermediary metabolism. Given the genetic makeup of an organism. and to allow visualization and exploration of the finished dataset as an interactive process map. peer-reviewed resource of human biological processes. and highlevel processes. regulatory pathways. USA Reactome. NY. Ewan Birney European Bioinformatics Institute Hinxton. such as the cell cycle. Bernard de Bono. which allows both its content and its software infrastructure to be freely used and redistributed. Bijay Jassal. Tools have been developed to facilitate custom data entry and annotation by expert biologists.

This page is intentionally left blank .

19 EFFECTIVE OPTIMIZATION ALGORITHMS FOR FRAGMENT-ASSEMBLY BASED PROTEIN STRUCTURE PREDICTION Kevin W . which is called new fold (NF) prediction. then optimizing the function through fragment insertions will produce a good structure prediction. genetic algorithms 7 . many new sequences still lack a known homolog in the PDB 2 . Fragment assembly for a query protein begins with the selection of structural fragments based on sequence information. all the leading assembly-based approaches use an algorithm involving a stochastic search (e. Experiments on a diverse set of 276 proteins show that t h e Hill-climbing algorithms consistently outperform existing approaches based on Simulated Annealing optimization (a traditional stochastic technique) in optimizing the root mean squared deviation (RMSD) between native and working structures.umn. To optimize the scoring function. If the scoring function is a reliable measure of how close the working structure is to the native fold of the protein. Army HPC Research Center University of Minnesota. Within the context of the NF problem knowledge-based methods have attracted increasing attention over the last decade. MN 55455 * Email: {deronne. a . Simulated Annealing 18 . The conditional existence of a known structural homolog to a query sequence commonly delineates a set of subproblems within the greater arena of protein structure prediction. which makes it harder to predict structures for these sequences. an accurate new fold prediction algorithm remains elusive.g. Here we examine deterministic algorithms for optimizing scoring functions in protein structure prediction.org/ approaches that assemble fragments of known structures into a candidate structure 18 ' 7 ' 10 have consistently outperformed alternative methods. Two previously unused techniques are applied to the problem. karypis} @cs. One of t h e challenges facing current techniques is the size and complexity of the space containing possible structures for a query sequence. http://predictioncenter. Thus. the biennial CASP competition 3 breaks down structure prediction as follows. these two sequences have only a low (though detectable) similarity. to explore this space fragment assembly approaches t o new fold prediction have used stochastic optimization techniques. For example. Although the number of known structures continues to grow. In homologous fold recognition the structure of the query sequence is similar to a known structure for some other sequence. and the scoring function itself. Minneapolis. building a structure in this manner can break down into three main components: a fragment selection technique. such as those based largely on explicit modeling of physical forces. replacing the coordinates of the query with those of the fragment. an optimizer for the scoring function. However. INTRODUCTION Reliably predicting protein structure from amino acid sequence remains a challenge in bioinformatics. Digital Technology Center. 1. prediction 'Corresponding author. These fragments are then successively inserted into the query protein's structure. or conformational space annealing 1 0 ). D e R o n n e * a n d George K a r y p i s Department of Computer Science & Engineering. Still more challenging is the problem of predicting the structure of a query sequence lacking a known structural relative. In CASP. In analogous fold recognition there exists a known structure similar to the correct structure of the query. The quality of this new structure is assessed by a scoring function. One potential drawback of such techniques is that they can require extensive parameter tuning before producing good solutions. called the Greedy algorithm and the Hill-climbing algorithm. Traditionally. The main difference between the two is t h a t t h e latter implements a technique to overcome local minima. but the sequence of that structure has no detectable similarity to the query sequence.edu Despite recent developments in protein structure prediction.

The Greedy approach examines all possible fragment insertions at a given point and chooses the best one available. Number of sequences at various length intervals and SCOP class. Seqiaence Length SCOP Class alpha beta alpha/beta alpha+beta < 100 23 23 4 15 100-200 40 27 26 36 > 200 6 18 39 17 total 69 69 69 69 2. a training set). c The test sequences..2. The results of these experiments show that the Hill-climbing-based approaches are very effective in producing high-quality structures in a moderate amount of time. Several variables can affect the performance of optimization algorithms in the context of fragmentbased ab initio structure prediction.e. we first removed all membrane and cell surface proteins. and do not depend on a random element. Data The performance of the optimization algorithms studied in this paper were evaluated using a set of proteins with known structure that was derived from b c No bond lengths were modified to fit this constraint.cs. They accomplish this primarily by restricting the number of structural fragments that can be used to replace each fc-mer of the query sequence.69 14 . For 2. we followed a methodology for identifying these structural fragments that is similar in spirit to that used by the Rosetta 18 system. a test set). Starting from the set of domains in SCOP. how long the fragments are. provided that they lead to a better global score. Table 1. and the relative advantage of Hillclimbing-based approaches improves with the length of the proteins.69 14 as follows. The above steps resulted in a set of 2817 proteins.umn. Hillclimbing is able to produce structures that are 6% to 20% better (as measured by the root mean square deviation (RMSD) between the computed and its actual structure). MATERIALS AND METHODS 2. how many fragments per position are available to the optimizer. Consider a query sequence X of length /. and that they generally outperform Simulated Annealing. whereas the remaining 2541 sequences were used as the database from whence to derive the structural fragments (i.. Our experiments test these algorithms on a diverse set of 276 protein domains derived from SCOP 1.1. and then used Astral's tools 3 to construct a set of proteins with less than 25% sequence identity. The new algorithms presented below are inspired by techniques originally developed in the context of graph partitioning 4 . Neighbor Lists As the search space for fragment assembly is much too vast. we selected a subset of 276 proteins (roughly 10%) to be used in evaluating the performance of the various optimization algorithms (i.5A. Taking the above into account. and other parameters specific to the optimizer can all influence the quality of the resulting structures. For example.8A times their sequential separation13.e. From this set. if they should be multiple sizes at different stages 18 or all different sizes used together 7 .20 In this paper we wish to examine the relative performance of deterministic and stochastic techniques to optimize a scoring function. filtering out any proteins with a resolution greater than 2. were selected to be diverse in length and secondary structure composition. On the average.edu/ " deronne/supplement/optimize . and removing any proteins with &Ca — Ca distance greater than 3. proteins not satisfying it were simply removed from consideration. whose characteristics are summarized in Table 1. SCOP 1. This dataset is available at http://www. fragment-based ab initio structure prediction approaches must reduce the number of possible structures that they consider. we varied fragment length and number of fragments per position when comparing the performance of our optimization algorithms to that of a tuned Simulated Annealing approach. This set was further reduced by keeping only the structures that were determined by X-ray crystallography. The Hill-climbing algorithm follows a similar strategy but allows for moves that reduce the score locally. In evaluating the various optimization algorithms developed in this work.

The profile of a sequence X of length I is represented by two I x 20 matrices. Each one of these optimization phases terminates when the algorithm has finished (i. we identify a list (Li) of n structural fragments by comparing the query sequence against the sequences of the proteins in the training set.e.2.Z) are the values corresponding to the Ith amino acid at the ith position of X's position-specific scoring and frequency . and the resulting structure becomes the input to the subsequent optimization phase. while the columns correspond to the 20 distinct amino acids. The rows of this matrix correspond to the various positions in X. For fragments of length fc. if the neighbor lists contain structural fragments of length three. the similarity score between the ith position of protein X's profile. Many different schemes have been developed for determining the similarity between profiles that combine information from the original sequence. In our study we used neighbor lists containing fragments of a single length as well as neighbor lists containing fragments of different lengths.0 i=i + (1) l=i £PSFMY(j. the algorithm starts by first optimizing the structure using only fragments of length nine. or using either scan or pool will be referred to as a fragment selection scheme.j)= EPSFMx(i. matrix PSFMx that contains the frequencies used by PSI-BLAST to derive PSSMx.These frequencies (also referred to as target frequencies 13 ) contain both the sequence-weighted observed frequencies (also referred to as effective frequencies 13 ) and the BLOSUM62 6 derived-pseudocounts x. n .2. the frequencies are scaled so that they add up to one. For example. In the latter case we consider two different approaches to leveraging the varied length fragments.OPSSMy(j'. and it optimizes the structure once. The PSI-BLAST search was performed against NCBI's nr database that was downloaded in November of 2004 and which contained 2. For our study.2. 2. and the j t h position of protein Y's profile is given by 20 2. then fragments of length six. The first is its positionspecific scoring matrix PSSMx that is computed directly by PSI-BLAST. The second matrix is its position-specific frequency Sx.10 to generate profiles for both the test and training sequences.1.Z) and PSSMx(i.938 sequences. we used the version of the PSIBLAST algorithm available in NCBI's blast release 2.Y(i. The n structural fragments are selected so that their corresponding sequences have the highest profile-based score with the query sequence's fc-mer. These profiles were derived from the multiple sequence alignment constructed after five iterations using an e value of 10~ 2 . six. position-specific scoring matrix. Specifically. Sequence Profiles The comparisons between the query and the training sequences take advantage of evolutionary information by utilizing PSI-BLAST * generated sequence profiles.Z)PSSMx(i. the corresponding rows of the two matrices are derived from the scores and frequencies of BLOSUM62. Profile-to-Profile Scoring Method The similarity score between a pair of fc-mers (one from the query sequence and one from a sequence in the training set) was computed as the ungapped alignment score of the two fc-mers whose aligned positions were scored using profile information. and nine.21 each position i. we will refer to the list Li as the neighbor list of position i. 21. or position-specific target and/or effective frequencies 13.2.171. The first. referred to as scan. Throughout the rest of this paper.0. uses the fragment lengths in decreasing order. For each row of a PSFM. The second approach for combining different length fragments is referred to as pool. In our work we use a scheme that is derived from PICASSO 5 ' 13 that was recently used in developing effective remote homology prediction and fold recognition algorithms 16 . 20 where PSFMx(i.. and finally fragments of length three. Using any single length fragment in isolation. In the cases where PSI-BLAST could not produce meaningful alignments for a given position of X. reached a local optimum or performed a predetermined number of iterations). selecting fragments from any available length. these comparisons involve the fc-mer of X starting at position i(0<i<l — k + l) and all fc-mers in the training set.

Fragments are taken directly from known structures.l) are defined in a similar fashion.22 matrices. its gain is defined as the improvement in the value of the scoring function between the working structure and the native structure of the protein.5A of one another. Protein Structure Representation Internally. Greedy and Hill-climbing. Our protein construction approach uses the actual coordinates of the atoms in each fragment.5. At high temperatures. The Simulated Annealing (SA) algorithm proceeds in a series of discrete steps. As the optimization must use finite steps. the optimizer will accept a very bad move with a higher probability if the temperature is high than if the temperature is low. 2. Although such a function cannot serve as a predictive measure. we assume an ideal scoring function in order to test the optimization techniques. we believe that using this as a scoring function allows for a clearer differentiation between the optimization process and the scoring function. then the move is accepted. From Equation 2 we see that the likelihood of accepting a bad move is inversely related to the temperature and how much worse the new structure is from the current structure. whereas the scheme of 13 is based on effective frequencies. We will refer to this operation as a move. we consider only the positions of the Ca atoms. and are chosen from the training dataset using the above profile-profile scoring methods.1.4. and qnew is the score of the state in question. is that our measure uses the target frequencies. Optimization Algorithms In this study we compare the performance of three different optimization algorithms in the context of fragment assembly-based approaches for ab initio structure predictions. In effect. whereas the other two algorithms. In each step it randomly selects a valid move and performs it (i. 2. Scoring Function As the focus of this work is to develop and evaluate new optimization techniques. Simulated Annealing (SA) 2. That is. forming a stronger substance. and we use a vector representation of the protein in lieu of <p and ip backbone angles. Equation 1 determines the similarity between two profile positions by weighting the positionspecific scores of the first sequence according to the frequency at which the corresponding amino acid occurs in the second sequence's profile. it does not create any steric conflicts. A structure is considered to have a steric conflict if it contains a pair of Ca atoms within 2. the atoms of a metal can adopt configurations not available to them at lower temperatures—e. the cooling of the system cannot be . are newly developed for this work. after inserting the fragment. This optimization approach is designed to mimic the process by which a material such as metal or glass cools. Also.e. The key difference between Equation 1 and the corresponding scheme used in 13 (therein referred to as PICASS03). A move is considered valid if. inserts the selected fragment into the structure). Simulated Annealing 8 . As the system cools. If it degrades the quality.l) and PSSMY(j.. One of these algorithms. The algorithm begins with a high system temperature which it progressively decreases according to an annealing schedule. we use the RMSD between the predicted and native structure of a protein as the scoring function.5. a metal can be a liquid rather than a solid. If the move improves the quality.3. The key operation in all three of these algorithms is the replacement of a fc-mer starting at a particular position i. rotated and translated into the reference frame of the working structure. 2. is currently a widely used method to solve such problems. Qoid is the score of the last state. with that of a neighbor structure. Simulated Annealing 8 is a generalization of the Monte Carlo 12 method for discrete optimization problems. This move can either improve or degrade the quality of the structure.g. the atoms arrange themselves into more stable states.. then the move will still be accepted with probability p = eV T ). for each valid move. (2) where T is the current temperature of the system. PSFMY(j.

In the second phase. The above sequence of iterations terminates when the hill-climbing phase is unable to produce a better structure after successively performing all best scoring valid moves. uses an annealing schedule that linearly decreases the temperature of the system to zero over a fixed number of cycles. thus guaranteeing its termination after a finite number of moves. the neighbor with the highest gain) from each list. it begins a sequence of iterations consisting of a hill-climbing phase. If some positions have no valid moves on the first pass. the algorithm consists of two phases. the algorithm performs a series of moves. the algorithm repeatedly finds the maximum gain valid move over all positions of the chain. this phase terminates and the algorithm enters the progressive refinement phase. Specifically. This progressive refinement phase terminates upon failing to find any move to make. and attempts to make a valid move at each position of the protein. and also provides a good starting point for the next phase.5. After each move.3. In the hill-climbing phase. In the first phase. followed by a progressive refinement phase (as in the Greedy approach). its initial set of moves will lead to a structure whose quality (as measured by the scoring function) is worse than that at the beginning of the hillclimbing phase. The key idea behind Hillclimbing is to not stop after achieving a local optimum but to continue performing valid moves in the hope of finding a better local or a (hopefully) global optimum. Simulated Annealing is a highly tunable optimization framework. The annealing schedule depends on a combination of the number of total allowed moves and the number of steps in which to make those moves. Move Locking As Hill-climbing allows negative gain moves. This ensures that the algorithm makes moves at nearly every position down a chain. However. the algorithm can potentially oscillate between a local optimum and a non-optimal solution. The Greedy algorithm is guaranteed to finish the progressive refinement phase in at least a local optimum. 2. If at any point during this series of moves. By doing so. subsequent moves can potentially lead to improvements that outweigh the initial quality degradation. called progressive refinement. we ensure the algorithm does not repeatedly perform the same sequence of moves. irrespective of their gains. The Greedy Algorithm (G) One of the characteristics of the Simulated Annealing algorithm is that it considers moves for insertion at random. Hill-Climbing (HC) The Hill-climbing algorithm was developed to allow the Greedy algorithm to effectively climb out of locally optimal solutions. This is achieved by scoring all neighbors in each neighbor list and inserting the best neighbor (i. The algorithm begins by applying the Greedy algorithm in order to reach a local optimum. Our implementation of Simulated Annealing. The Greedy algorithm that we present here selects maximum gain moves. The starting temperature and the annealing schedule can be varied to improve performance. All locks are cleared at the end of a hill- . To prevent this from happening. the algorithm attempts to make moves at these positions after trying all positions once. and the performance of the algorithm depends greatly on these parameters. the algorithm starts from a structure corresponding to a fully extended chain. thus allowing the algorithm to climb out of locally optimal solutions.5. each time selecting the highest gain valid move irrespective of whether or not it leads to a positive gain. Specifically.e. Section 3. At this point. the Hill-climbing algorithm works as follows.1 describes how we arrive at the values for these parameters of SA as implemented in this study. it improves the value of the scoring function—the algorithm makes the move. called initial structure generation. the working structure achieves a score that is better than that of the structure at the beginning of the hill-climbing phase. but the annealing schedule can be modified to increase its smoothness.2.23 continuous. Since the hill-climbing phase starts at a local optimum.2. following the general framework employed in Rosetta 18 . a lock is placed on the move to prevent the algorithm from making this move again within the same phase.e. and if this move leads to a positive gain—i. 2. we implement a notion of move locking.

These results also show that the search algorithms embedded in Greedy. we impose a threedimensional grid over the structure being built with boxes 2. Recall that a valid move brings no two Ca atoms within 2. HCC. preventing any further insertions at that position. leading to a faster optimization algorithm. The second. However. the advantage of coarse-grain locking is that each successive fragment insertion significantly reduces the set of fragments that need to be considered for future insertions.4. fine-grained locking yields a 21. its atoms are added to the grid.1%.6% improvement for Greedy. and 3 individually.4%. We further decrease the required time by sequentially checking neighbors at each position down the amino acid chain. allowing the search maximum freedom to proceed.5A of each other. EXPERIMENTAL EVALUATION 3. However. For example. we see that the Hill-climbing algorithm consistently outperforms the Greedy algorithm. This improvement also comes at the cost of an increase in run time. Results from the pool and scan settings clearly indicate that Greedy 2.24 climbing phase. coarse locking locks moves of all sizes. This proves necessary because each insertion can potentially introduce new proximity conflicts. All atoms upstream of the insertion point must be internally valid. we need only examine those atoms at or downstream from the insertion. 6. In an attempt to assuage the time requirement for this process. as well as using the scan and pool techniques to combine them. We investigate two different locking methods. Both schemes seem to take advantage of the increased flexibility of smaller fragments and greater numbers of fragments per position. 3. referred to as coarse-grain locking. thus. Since fine-grain locking is less restrictive. For example. This saves on computation time within one iteration of checking all possible moves. Thus.5% better than the corresponding 9-mer results for Greedy. As Hill-climbing includes running Greedy to convergence. In this fashion we limit the number of actual distances that must be computed. HC C . Comparing the performance of the scan and pooling methods to combine variable length fc-mers we see that pool performs consistently better than scan by an average of 4. a less restrictive finegrained approach generally yields better results than a coarse-grained scheme. To quickly determine if this proximity constraint holds. respectively. and HC/. increasing the neighbor lists from 25 to 100 yields a 23. Hill-climbing (coarse) (hereafter HCC) and Hill-climbing (fine) (hereafter H C / ) .1. as they have previously passed proximity checks. and neither is the increased run-time that Hill-climbing requires.2% improvement over coarse-grained locking. and HC/ are progressively more powerful as the size of the overall search space increases. locks the single move made. locks the position of the query sequence itself. we expect it to lead to better quality solutions. Efficient Checking of Steric Conflicts One characteristic of the Greedy and Hill-climbing algorithms is their need to evaluate the validity of every available move after every insertion. on the average the 3-mer results are 9.1% on the average. this increased performance comes at the cost of an increase in run-time of 1128% on the average. Table 2 shows results for the Greedy and Hill-climbing optimization techniques using fc-mer sizes of 9. The first. Average times are also reported for each of these five fragment selection schemes. averaging over all experiments. the result is not surprising.6%. With respect to locking. respectively. In the case of pooling. we report results from a series of experiments in which we vary a number of parameters.0%. The algorithm can subsequently select a different neighbor for insertion at this position. Examining Table 2. we have developed an efficient formulation for validity checking. Performance of the Greedy and Hill-climbing Algorithms To compare the effectiveness of the Greedy and Hillclimbing optimization techniques. Similarly. 12. which in this case is 131.5. and 8. . and for each addition the surrounding 26 boxes are checked for atoms violating the proximity constraint. and 43.4%. 31. referred to as fine-grain locking. As each move is made.5A on each side.

dlyaca.53 4.08 6.80 5.69 4.34 5.35 4. HCC and HC/ schemes. SCOP classes and lengths for the tuning set. I is the number of amino acids in the query protein.30 11 12 15 33 41 49 65 76 120 341 357 352 390 748 1997 n == 50 Score Time 8. following recent work 17 we allowed for a temporary increase in the temperature after 150 consecutive rejected moves. the algorithm attempts a x [l x n) moves.96 4. Lower is better in both cases.56 14 17 22 40 58 98 124 182 216 912 1314 1417 1561 2878 7101 n = 75 = Score Time 7.2 dljiga.07 8. and n is the number of neighbors per position.94 5.74 4.2).17 2.92 3.02 5.38 7.06 6. we performed a search on a subset of the test proteins in an attempt to maximize the ability of SA to optimize the test structures. where a is an empirically determined scaling factor.2. Specifically. Table 3 .2 dlaoza2 length 105 111 112 116 121 146 155 204 205 209 SCOP class beta alpha+beta beta alpha beta alpha beta alpha/beta beta beta 3. Simulated Annealing consistently outperforms the Greedy scheme.23 4. we allow SA three times the number of attempted moves because the total number of neighbors is that much larger. Specifically. As we see in this table.50 7.2.33 4. Tuning the Performance of SA Due to the sensitivity of Simulated Annealing to specific values for various parameters. we selected ten medium length proteins of diverse secondary structural classification (see Table 3). we attempted to find values for two governing factors: the initial temperature To and the number of moves nm. These performance comparisons are obtained by averaging the .29 4.21 6.87 3.96 4. 3.20 7.81 5.92 4.51 6.76 8. n == 25 Score Time k = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool fe = 9 fc = 6 fe = 3 Scan Pool 9.93 4.1% better than that obtained by G. a values of 20. To this end. SCOP identifier dljiwi.25 Table 2. Comparison with Simulated Annealing 3. Over the course of these cycles. Our annealing schedule decreases the temperature linearly over 3500 cycles.46 6. when using Simulated Annealing one must select an appropriate annealing schedule.2. 50 and 100 are employed. and optimized them over various initial temperatures.30 4. The initial temperature that yielded the best average optimized RMSD was To = 0.14 18 22 30 48 76 143 221 313 333 1588 2656 3277 3837 6237 18000 n == 100 Score Time 7. Finally. the average performance of SA with a = 20 is 15.52 6. Note that for the scan and pool techniques (see Section 2.60 3. In addition to an initial temperature.1.37 3.77 7.01 3.10 3. Times are in seconds and scores are in A.99 3.20 7.07 5.1 and we used this value in all subsequent experiments. Average values over 276 proteins optimized using Hill-climbing and different locking schemes.70 6.54 5. In order to produce comparable run-times to the G.) and HCC are not as effective at exploring the search space as HC/.21 7. This allows for a smooth cooling of the system.56 4.10 5.57 5. respectively. dlnbca. dlkpf_ d2mcm__ dlbea_ dlcal.30 3.2.50 4.06 5.99 5.87 22 27 39 56 97 226 279 433 517 1833 4978 5392 6369 10677 21746 Greedy Hillclimbing (coarse) (HC C ) Hillclimbing (fine) (HC.81 5. dla8d.67 5.67 5.76 4.76 3. Results The Simulated Annealing results are summarized in Table 4.98 7.68 3.

02 6.03 5.26 6.31 5.81 5.43 6.21 5.62 8. Average values over the longest 138 proteins optimized using Hill-climbing and different locking schemes.38 17 19 24 50 64 90 121 142 213 651 672 662 737 1376 3844 n = 50 Score Time 10. whereas the stochastic nature of Simulated Annealing allows it a chance of overcoming locally optimal solutions.53 6.73 8.29 9. Lower is better in both cases.39 6.21 6.31 6.88 4.77 9. In our case.06 6. Times are in seconds and scores are in A. and HC/ performs 46. both the fine and coarse-locking versions of Hill-climbing outperform SA.95 7.00 6.82 29 36 51 76 126 271 424 602 625 3109 4992 6238 7360 11524 34960 n = 100 Score Time 9.12 5.39 6.36 9.7% when a is increased from 20 to 50.48 6.34 5.99 6.46 6. This indicates that further increasing the value of a may not lead to performance comparable to that of the Greedy and Hill-climbing schemes.26 T a b l e 4.60 6.65 5.56 11.52 9.55 6.3% better than SA with a = 100.07% when a is increased from 50 to 100.01 9.69 6.45 23 27 38 62 95 185 234 347 394 1773 2477 2690 2974 5173 13717 n = 75 Score Time 10.62 6.40 37 46 68 91 164 433 535 833 982 3581 9396 10252 12190 19818 42045 Greedy Hillclimbing (coarse) (HC e ) Hillclimbing (fine) (HC. n = 25 Score Time fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 k = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool 11.97 4.48 6.42 7. on the average HCC performs 22.18 6.96 6.48 7. n -= 25 Score Time fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool 7.31 6.03 5.23 4.78 6.24 4.25 6.0% better than SA with a = 50.92 5.94 3.94 3.95 4.) ratios between the two schemes of the corresponding RMSDs over all fragment selection schemes and values of n.50 6. More concretely.14 6.99 25 25 25 74 75 34 34 35 103 103 52 50 52 148 156 n -= 50 Score Time 6. The superior performance of Simulated Annealing over Greedy is to be expected.88 7.87 5.78 7.90 5.72 4.91 6.32 5. Analyzing the performance of Simulated Annealing with respect to the value of a.52 8. Also note that in some of the results shown in .55 5.28 6.54 5.24 8.44 6.63 6.40 5.11 5. Table 5. we see that while Simulated Annealing shows an average improvement of 1.08 7. Average values over 276 proteins optimized using Simulated Annealing.18 5.05 5.28 6.33 7.86 5.93 7.11 6.18 6. as Greedy lacks any sort of hill-climbing ability.13 5.23 31 30 31 92 94 48 49 51 148 150 81 81 84 241 265 n == 75 Score Time 6.55 6.31 6. the total number of moves is a X (I x n) where I is the length of the protein being optimized and n is the number of neighbors per position.73 5.52 9.18 5. Lower is better in both cases.59 10. In contrast.07 4. the performance deteriorates by an average of 0.62 6.20 6.84 6.34 36 36 37 109 112 65 64 67 197 203 112 115 118 348 352 n =: 100 Score Time 6.45 6.08 4.34 6.68 5.44 8.08 6.02 5.54 6.46 6.13 6.74 5.76 6.11 7.16 6.38 42 42 43 128 132 80 80 81 258 251 145 146 155 439 447 a = 20 a = 50 a = 100 The values of a in the above table scale the number of moves Simulated Annealing is allowed to make.23 7.19 5.89 6.50 7. Times are in seconds and scores are in A.01 5.31 6.93 5.30 8.18 3.15 10.

43 8.01 7. and average RMSDs and times for Simulated Annealing are shown in Table 6. DISCUSSION AND CONCLUSIONS This paper presents two new techniques for optimizing scoring functions for protein structure predic- .4% on average).57 8.38 7. but the number of moves made will vary greatly. while Greedy actually does worse. For all entries in Table 4 the annealing schedule will cool the system over a fixed number of steps. as opposed to 15. Thus. Table 4. the trends in these tables agree with the trends in the average values over all the proteins.90 7. We are currently investigating the source of this behavior. the performance occasionally decreases as the a value increases.96 7.90 8. SA performs 15.41 8.36 8.44 68 69 70 210 215 135 136 138 440 427 251 254 270 760 778 a = 20 a = 50 a = 100 The values of a.12 8.76 9. Lower is better in both cases. and that the hillclimbing abilities of HCC and HC/ are better than those of SA.39 8.21 8. The performance of SA deteriorates (by 9.3% for the full average.0% for the full average.97 9.45 8.30 7. we see an interesting trend. Average RMSDs and times for the Greedy and Hill-climbing schemes are shown in Table 5. This ostensibly strange result comes from the dependence of the cooling process on the number of allowed moves.31 7. in the above table scale the number of moves Simulated Annealing is allowed to make.06 7.30 48 48 49 145 147 80 81 83 245 248 137 138 143 411 455 n == 75 Score Time 8. but one possible explanation is that Simulated Annealing has a bias towards smaller fragments. In general.33 8.86 7.6% on the average) when the different length fc-mers are used via the pool method.11 37 37 38 113 114 53 54 55 164 163 86 83 86 243 260 n == 50 Score Time 8.70 7.94 7. in the context of a larger search space.76 7.26 8. Times are in seconds and scores are in A. Comparing with SA for a = 50.33 8.19 8.58 7.39 8. In our case.20 7. in which the value of a plays a role.67 7. Comparing the performance of the various optimization schemes with respect to the various fragment selection schemes.18 8.16 7.25 8.37 7.39 7.63 7.80 8.1% for the full average.06 8.20 7. the likelihood of accepting the former move will be higher (Equation 2).23 8.07 8.24 8. Average values over the longerst 138 proteins optimized using Simulated Annealing. comparing G and SA for a = 20.17 7. This may reduce the optimizers ability to effectively utilize the variable length fc-mers.16 7.89 7.7% better.16 8.77 7. HC C performs 27.54 8. These results suggest that. we focus on the longer half of the test proteins.0% better as opposed to 22. a hill-climbing ability is important.19 8. as opposed to 46. one key difference is that the relative improvement of the Hill-climbing scheme over Simulated Annealing is higher. whereas the performance of HC/ improves (by 4.35 8.32 8.30 8. in order to keep the cooling of the system linear we vary the number of moves allowed before the system reduces its temperature.27 T a b l e 6. n = 25 = Score Time fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fe = 6 fc = 3 Scan Pool 9. HC/ is 54. different values of a can lead to different randomly chosen optimization paths. For example. and as a result.18 8. This bias might result because an insertion of a bad 3-mer will degrade the structure less than that of a bad 9-mer.36 58 58 59 176 180 109 108 112 331 341 192 197 204 592 606 n =: 100 Score Time 8.95 7. 4.23 8.92 8. comparing with SA for a = 100. However. Finally.6% better. the total number of moves is a x (I x n) where I is the length of t h e protein being optimized and n is the number of neighbors per position. Performance on Longest Sequences In order to gain a better understanding of how the optimization schemes perform. As a result.

Mattheyses. G. I. Proceedings of the 19th Design Automation Conference. Weissig. and M. Karplus. Koehl. 1983. The protein data bank. L. L.Westbrook. Levitt. fold-recognition. H. and by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army. Journal of Molecular Biology. 5. 2000. Kirkpatrick. To build a representative set of fragments. Nucleic Acids Research. The second problem is to reconstruct a native protein fold given such a set of representative fragments 19. Joo. Small libraries of protein fragments model native protein structures accurately.E. 323:297-307. PROTEINS: Structure. Levitt. 1982. Kolodny. 8. 89:1091510919. Mandel-Gutfreund. Holm. H. Science. A linear-time heuristic for improving network partitions. Army Research Laboratory (ARL) under Cooperative Agreement number DAAD19-01-2-0014. Brenner. 1992. P. Gelatt. R. Altschul. Simulated Annealing requires the hand-tuning of several parameters. Kim. 1997. Heger and L. 2001. and no official endorsement should be inferred. 20 . E. function and bioinformatics. J. ACI-0133464. 7. Kim. S. T. This would yield a large set of fragments. Amino acid subsitution matrices from protein blocks. Picasso: generating a covering set of protein family profiles. which could serve as input to a clustering algorithm. and P. F. The centroids of these clusters could then be used in decoy construction. Z. Zhang. Chandonia. and D. HenikofT. P. G. 25 (17): 3389-402. Diekhans. J. Nucleic Acids Research. Function and Genetics. S. While such approaches have the ability to avoid local minima. In order to construct a native fold from these fragments one need only restrict the move options of Hill-climbing to the representative set. The greedy approaches used for both these problems traverse the query sequence in order. Berman. S. including the total number of moves. D. One of these approaches. S.28 tion. 2002. finding the best solutions of all the examined algorithms. J. Karchin. 2003. one could track the frequency of fragment use within multiple Hill-climbing optimizations of different proteins.M. and S. N. Casper. RLM008713A. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. The content of which does not necessarily reflect the position or the policy of the government. the algorithms build multiple structures simultaneously in the search for a better structure. Guibas. Vecchi. C. Walker. Additionally. Prediction of protein tertiary structure using profesy. C. Hon. The techniques this paper describes could be modified to solve either of the above two problems. Bourne. Recently. the initial temperature. The first problem is to determine a set of representative fragments for use in decoy structure construction 15. 10. ACKNOWLEDGMENTS This work was supported in part by NSF EIA9986042. inserting the best found fragment for each position.N. 3. 2004. 220:671680. Henikoff and J. S. One of the main advantages of using schemes like Greedy and Hill-climbing is that they do not rely on such parameters. A. Lee.M. Z. Bioinformatics. Bhat. T. and M. greedy techniques have been applied to problems similar to the one this paper addresses. Combining local-structure. PNAS. We are currently working on adapting our algorithms to solve these problems. A. Lo Conte. Hughey. Zhang. J. 6. J. Draper. Miller. Madden. Fiduccia and R. and J.N. M. a novel method based on fragment assembly and conformational space annealing. Lee. A. As an extension. PROTEINS: Structure. IIS-0431135. Nucleic Acids Research. Feng. Gilliland. The performance of SA seems to saturate beyond a = 50. K. References 1. R. I. and NIH . Furthermore. 9 . G.M. L. 9. M. and R. W. reaches better solutions than Simulated Annealing in comparable time. Gapped blast and psi-blast: a new generation of protein database search programs. experiments with variations on the number of moves available to the optimizer demonstrate that the Hill-climbing approach makes better use of an expanded search space than Simulated Annealing. HCC. and new fold methods for protein structure prediction. J. but HC/ will make use of an increased time allowance. Schffer. using the scan technique. Y. and the annealing schedule. Koehl. the Digital Technology Center at the University of Minnesota. 2. they lack an explicit notion of hill-climbing. Lipman. P. 2004. The astral compendium in 2004. J. Optimization by simulated annealing. 4. K. Shindyalov.

29 11. 13:1071-1087. 19(12):1531-1539. Teller. Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. . Park and M. 21:1087-1092. Profile based direct kernels for remote homology detection and fold recognition. Metropolis. C. Tuffery. Marti-Renom. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. 13:1612-1626. 19. C. 20. Equation of state calculations by fast computing machines. Protein Science. Dependency between consecutive local conformations helps assemble protein structures from secondary structures using go potential and greedy algorithm. 21:4239-4247. Brenner. K. and N. Journal of Chemical Physics. A. 21. H. G. M. A. K. Kooperberg. Baker. 2004. 247:536540. Journal of Molecular Biology. A. Bioinformatics. B. Sali. M. 14. Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. Huang. Methods in Enzymology. Scoring profileto-profile sequence alignments. S. H. Journal of Computational Chemistry. and D. and D. R. and P. E. Alignment of protein sequences by their profiles. 1995. 2003. 2004. Rosenbluth. Hubbard. 2004. G. Simons. 16. 12. 2005. Mittelman. N. A. Wang and R. 13. Journal of Molecular Biology. Chothia. E. The complexity and accuracy of discrete state models of protein structure. Bioinformatics. L. Rosenbluth. and A. 1995. Improved greedy algorithm for protein structure reconstruction. Rangwala and G. and E. Sadreyev. F. 2005. C. Madhusudhan. Baker. Misura. M. Strauss. Protein structure prediction using rosetta. Derreumaux. Tuffery and P. Karypis. 18. Derreumaux. Guyon. E. Rohl. 61:732-740. 2005. 249:493-507. Protein Science. and C. Murzin. Grishin. M. 15. T. 26:506513. 17. 1953. P. 1997. S. D. Dunbrack JR. Teller. M. Levitt. P. T. Protein Science.

This page is intentionally left blank .

and they have achieved significant improvements in prediction accuracy. the proportion of available highresolution structures is exceedingly limited at about 0. an accurate structural model is important for the functional annotation of membrane proteins. In this paper. our approach achieves the highest scores at 86% in helix prediction (Q2) and 91% in topology prediction (TOPO) for the high-resolution data set. Hua-Sheng Chiu3. Various properties such as the length. huasheng. Ting-Yi Sung3. representing about 25% of the proteins encoded by several genomes1"3. Academia Sinica. with higher than 99% in accuracy. compared to that of globular proteins deposited in the Protein Data Bank (PDB)6. traversing the membrane lipid bilayer.. We also propose a novel scoring function for topology models based on membrane protein folding process. When tested on a small set of newly solved structures of membrane proteins. Although a range of methods have been developed to predict TM helices and their topologies. However. A membrane protein structural model defines the number and location of transmembrane helices (TMHs) and the orientation or topology of the protein relative to the lipid bilayer. Earlier approaches relied on physico-chemical properties such as hydrophobicity8'10 to identify TMH regions. and topology prediction remained a major challenge. arrangement and topology or orientation of TM helices. National Tsing Hua University. Taiwan 2 Department of Life Sciences. followed by the prediction of the topology in the second step.11 and neural networks12 have been developed. Results: We develop a method based on support vector machines (SVM) in a hierarchical framework to predict TM helices first. experimental approaches for identifying membrane protein structural models are timeconsuming7. an evaluation study13 concluded that current accuracies were over-estimated. none of them have integrated multiple biological input features in a machine-learning framework. By partitioning the prediction problem into two steps. our method overcomes some of the difficulties in predicting TM helices by incorporating multiple biological input features. Taiwan 3 Bioinformatics Lab. Taiwan Email: {allanlo.tw Motivation: A key class of membrane proteins contains one or more transmembrane (TM) helices. When benchmarked against other methods in terms of performance. Therefore. we demonstrate the ability of our method to discriminate between membrane and non-membrane proteins. Academia Sinica. * Corresponding author. cell-cell interactions. Hsinchu. Taipei. specific input features can be selected and integrated in each step. 1. hsu}@iis. . Institute ofInformation Science. more advanced methods using hidden Markov models3. topology prediction has much lower accuracy than helix prediction. and transport of solutes and macromolecules across membranes4. Taipei. no single method consistently outperforms the others.31 TRANSMEMBRANE HELIX AND TOPOLOGY PREDICTION USING HIERARCHICAL SVM CLASSIFIERS AND AN ALTERNATING GEOMETRIC SCORING FUNCTION Allan Lo1'2. we propose a machine-learning approach called SVMtmh (SVM forffansmembranehelix prediction) in a hierarchical classification framework to predict membrane protein structure. Furthermore. bioinformatics development in sequence-based prediction methods is valuable for elu- cidating the structural genomics of membrane proteins. In the absence of a high-resolution structure. followed by their topology. Recently.edu. Furthermore. Wen-Lian Hsu3'* 1 Bioinformatics Program. and thus requires continuous improvements. Many different methods have been developed to predict structural models of transmembrane helix (TMH) proteins. Despite their biological importance. Although several methods are available. The number and location of TMHs are predicted in the first step. INTRODUCTION Integral membrane proteins constitute a wide and important class of biological entities that are crucial for life. resulting in an improvement of 6% and 14% in their respective categories over the second best method.5% of all solved structures5.sinica. are closely related to a protein's functions. tsung. Taiwan International Graduate Program. We divide the prediction task into two successive steps by using a tertiary classifier consisting of two hierarchical binary classifiers. In addition. They also play a key role in various cellular processes including signal and energy transduction.

5% when tested for discrimination between membrane and non-membrane proteins. and biological input features relevant to each classifier can be applied. 3) For topology prediction. We represent the problem of membrane protein structure prediction as a multiple classification process and solve it in two steps using hierarchical SVM classifiers. Since there are three classes for a protein sequence. including those based on different structural parts of a TMH protein. compared to the second highest score. Finally. -»• —| predicted TMH TMH |—| TMH I—I TMH |— iii°| TMH |22| TMH [ii°TTMH "Toio p r eS U Ic t te d IS W Ir e s Ii V u e s • V d i b l S U i/o C9 d in t h e n o n . Hydrophobicity scale (HS) 4. System architecture The proposed approach uses hierarchical binary classifiers to predict the helices and topology of an integral membrane protein. Overview of the SVMtmh system architecture. METHODS 2. an improvement of 14% and 6%. In addition. Each residue of a TMH protein can be regarded as belonging to one of the three classes defined by its position with respect to the membrane: inner (/) loop. The overall framework is described in this section. Di-peptide composition (DP) 3. STEP 1 : Helix Prediction STEP 2 : Topology i predicted TMH ' (from STEP 1) identify non-helical segments Prediction sliding window w\ MVT L IALTPFVSRK peptide extraction l l l j j l l l l q u e r y protein -H I I TMH l-H I I TMH~|-rl I I TMH~|-rI I + • • + r feature encoding I a n d scaling testing Input features: 1. which consists of two binary classifiers in a hierarchical structure.h e l i c a l segments T Alternating Geometric Scoring Function consensus TMH |— T M H candidates determination I T M H and topology are predicted TMH f a g r r M T T T i ^ TMH Fig. The aim of predicting membrane protein structures is to identify the correct class of each residue. An overview of the system architecture is shown in Fig. 2. and integrate them to predict helices.1.32 Our key contributions are as follows: 1) By decomposing the prediction into two steps. we reduce the complexity involved in each step. SVMtmh yields the lowest false positive rate at 0. the proposed topology scoring function is the first model to capture the relationship between topogenic factors and topology formation. 2) We select multiple input features. we design a tertiary classifier. we apply SVMtmh to analyze a newly solved structure of bacteriorhodopsin (bR) and show that our method can provide the correct structural model which is in close agreement with the structure obtained through X-ray crystallography. 1. 1. \ 1 TMH. Specifically. respectively. Amphiphilicity (AM) sliding window w2 cEE l U l U i H ] f e a t u r e encodii n g l a n d s c a lii n g * Input f e a t u r e : Amino acid composition (AA) p r e d i c t e d H/~H r e s i d u e s by t w o d i f f e r e n t f e a t u r e sets TMH. Amino acid composition (AA) 2. The performance of SVMtmh is compared with other methods across several benchmark data sets and SVMtmh achieves a marked improvement in both helix and topology prediction. we propose a novel topology scoring function based on the current understanding of membrane protein insertion. We also provide a detailed analysis of the comparison with other methods and conclude with a summary and directions for future work. transmembrane helix (H). and outer (o) loop. SVMtmh achieves the highest score at 91% for topology prediction (TOPO) and 86% for helix prediction (Q2) in the high-resolution data set. . To the best of our knowledge.

2. 1]. The cost weight is adjusted to avoid under-prediction in unbalanced data sets.2. we set the cost weight at 7/3 for the first classifier. Feature selection and extraction The choice of relevant features is critical in any prediction models. w.1 5 A) Helix end region (-7. TMH proteins are subject to global constraints of the lipid bilayer since they contain membrane-spanning helices15. are found to be 21 and 29. Additionally.30 A H ^ ^ 1 Helix core region ( .8661.') or an outer loop (o). w.3. Ten-fold cross-validation is used to evaluate our method. Local input features: hydrophobicity scale (HS)16 and amphiphilicity (AM)17.3) to train our SVM classifiers and then combine the results from the best two combinations into a consensus prediction. The optimal window sizes. which is screened for TMH candidates and subsequently assembled into physical TMH segments. is incrementally searched from 3 to 41 for both classifiers. The associated parameters (C. TM and non-TM residues (H/~H) are predicted. i/o).5 A) Loop (flexible length) Amino acid residue Plasma membrane ft'Y. The helix core region is surrounded by an aliphatic hydrocarbon layer about 30 A in thickness. respectively. Transmembrane (TM) helix structure in the lipid bilayer: helix core and end regions. 0. Similarly. the classification framework is performed in two steps. Loops connect between the adjacent TM helices. Helix prediction 2. y) are optimized at (1. We select global and local input features to capture information contained in a TM helix. the remaining non-helix residues (~H) from Step 1 are classified as either inner or outer (i/o) residues. for the first classifier and w2 for the second classifier. in the present study we select features that capture important relationships between a sequence and the structure. including the core and end regions based on the propensity of amino acids3. In Step 2. the overall prediction accuracy is the percentage of correctly predicted residues. Briefly. Global input features: amino acid (AA) and di-peptide (DP compositions./ ! * « Interfacial .33 In Step 1. The data set isfirstdivided into ten subsets of equal size. Each subset is in turn tested using the classifier trained on the remaining nine subsets. The helix end regions are embedded in the water-membrane interface of about 15 A. . each of which uses an associated binary classifier (H/~H. Fig.3. Since each residue of the whole data set is only predicted once.1. The values in the feature vectors are scaled in the range of [0. We use sliding windows to partition a protein sequence into peptides. The representation of each feature is described below: Global input feature: Amino acids (AA) composition Global input feature: Di-peptide (DP) composition Interfacial . 2. TM helices can be divided into distinct local structural parts. Training and testing We train our classifiers with the LIBSVM package14 and Radial Basis Function (RBF) is chosen as the kernel function.15 A Hydrocarbon . 2 shows the selection of features to capture both the global and local information of a TM helix. We use different feature sets (Section 3. Since the helix and non-helix classes make up about 30% and 70% of the data set respectively. we set the cost weight at 1/1 for the second classifier to reflect the proportion of the inner and outer loop classes in the data set. we apply the proposed alternating geometric scoring function.15 A '— Local input feature: Hydrophobicity Scale (HS) — Local input feature: Amphiphilicity (AM) I^H 0y w Fig. 2. The optimal length of the sliding window. To determine if the topology of the protein is an inner (.1250). Thus.

the problem of predicting the topology of a TMH protein is reduced to predicting the topology of the first loop located at the N-terminus.Y) is an ordered pair of amino acids of X followed by Y. 4. Topology prediction 2. the length of a TMH candidate is between lmi„ and lopl. including the charge bias. starting from its N. respectively. it is converted to a non-helix segment. loop size.2. First. P(X.Y) where (X. Input feature Using the second classifier. These biological phenomena form the basis of our assumptions about topology models. Step 2: Extension An optimal TMH length.1.4.2. Amino acid composition is employed as the input feature. Furthermore. lmim as the minimal length for a TMH candidate. and folding of the N-terminal loop domain20.and C-termini with the loop in the center. is set at 21 to reflect the thickness of the hydrocarbon core of a lipid bilayer19. otherwise. A TMH candidate whose length is greater than or equal to lmax is split into two helices. Determination of TMH candidates To identify potential TMH regions. This feature is represented by the pair-residue occurrence probability. We select an amphiphilicity (AM) index17 as a input feature to capture the local information contained in the helixcapping ends. 2. The optimized values for lmin and lmax for the best prediction performance are 9 and 38. Each residue is represented by a vector of length 20 that has a real value corresponding to its hydrophobicity.34 1. as shown in amino acid propensity studies16'17. we assume that topology formation is a result of contributing signals present in . The encoding scheme follows the same procedure outlined in the helix prediction section. 3. Each residue is represented by a vector of length 20 that has a real value corresponding to its amphiphilicity. lopl. We select a hydrophobicity scale (HS) recently determined by membrane insertion experiments16.and C-termini. If The purpose of predicting of the topology of a TMH protein is to determine the orientation of the protein with respect to the membrane. Step 3: Splitting We define lmax. from its N. the widely accepted twostage model for membrane protein folding suggests that the final topology of a membrane protein is established in the early stages of membrane insertion21.4. There is growing body of evidence that the final topology is influenced by multiple signals distributed along the entire protein in the loop segments.3. Di-peptide composition (DP): We consider the coupling effect of two adjacent residues that contain global information along the sequence. Step 1: Filtering We define a cut-off value. A TMH protein follows special constraints on its topology such that it always starts with an inner (?) loop or outer (o) loop that must alternate in order to connect the TM helices. A predicted helix segment is a TMH candidate if its length is at least lmin.1). The vector space of this feature input comprises 400 dimensions. Each residue of a peptide is represented by a vector of length of 20 which is indicated by 1 in the position corresponding to the amino acid type of the residue.4. Amino acid composition (AA): This basic feature enables us to capture the global information of a TM helix. We do this by modifying the algorithm proposed in the THUMBUP program18 to determine TMH candidates and assemble them into physical TMH segments. Helix ends feature: The end regions of a TM helix near the membrane-water interface exhibit a preference for aromatic and polar residues. and 0 otherwise. 2. Therefore. we predict the topology label (i/o) of each non-helix residues from the results of the first classifier (H/~H). Steps 2 and 3 describes the assembly of a TMH candidate. 2. Helix core feature: Hydrophobicity is used to capture local information within the core region of a TM helix where it is a major stabilizing factor16. Alternating function geometric scoring 2. it is necessary to determine if there are any TMH candidates among our initial prediction results. it is extended to lop. as the cut-off value for the length of a TMH candidate to be split. Two or more TMH candidates are merged if they overlap after the extension. We optimize lmin and lmax on the training data set (Section 3.

0) indicated in Section 3.(l) + 1/(1. TSj and TS0.3). High-resolution TMH proteins: We use a collection of 36 high-resolution TMH proteins from PDB compiled by Chen et al. R.{3) = 0.1. + R0 = 100%. 3. Based on these assumptions. TSt = 1 * R. the contribution of signals in the loop segments varies inversely proportional to their distance from the N-terminus in a geometric series: Given a transmembrane protein that has n non-helical segments 5 (1 < j < n and n.2563. We use the proposed alternating geometric scoring function to determine the final topology.(j) and R0(j) terms. therefore. . An example of evaluating a TMH protein's topology with alternating geometric scoring function.. respectively.(2) = 4/5. IS. R.(J) + aR0(j) ] \<j<n a = W(j) = 1. j e N) predicted in the first step: For each Sj of length | Sj |. to represent the predicted ratios of topology labels / and o. A predicted loop segment can have more than one type of topology. We validate this data set using annotations from SWISS-PROT release 49. 2. Given a set of optimal values for (A. Similarity. We manually validate this data set using annotations from SWISS-PROT release 49. where TSi is for the N-terminal loop on the inside of membrane and TS0 is for the outside. then the topology of the N-terminal loop is Protein sequence: G A L S A ||||ppii|iMnp|| ' G L T T M I M P I M I P I M M i O K G R Y N-terminus M ^ ^ M ^ H predicted Helix 1 • • ^ • " • ^ ^ Predicted Helix 2 B B ^ B B Predicted topology: o o o i i ffj/^ggffgggfg^ggg^i^ j j j j o fgfgg^gg^gfi^f^g^ o o .6' °) * RQ(2) + 1/(1. El) = (1. the geometric scoring function alternates between the inner (j) and outer (o) loops to take into account the alternating nature of the connecting loops. Second. the topology is outside. Low-resolution TMH proteins: We train and perform ten-fold cross-validation on a collection of low-resolution data set compiled by Moller et al22. Data sets 1.' r / c-terminus Fig 3. 3 illustrates the calculation of alternating geometric scoring function for an example protein.7594. The final data set contains 143 proteins for which low-resolution topology models are available. fl„(l) = 3/5. This entire data set is also used to train our model for testing on the following three data sets. TS. W(j) is a geometric function which assigns weights to the R.(j) = (# of "inside" residues/1 Sj |) x 100% (1) Ro (j) = (# of "outside" residues /1 s. RESULTS AND DISCUSSION 3.6. Fig. Rj and R0. the final topology for the N-terminal loop is outside (o).023 and further remove two proteins because they have no membrane protein annotations. In this example.023 and update the topologies of two proteins.(l) = 2/5. we define two ratios. Helices are predicted in the first step (Section 2. 3. Newly solved TMH proteins: Four newly solved high-resolution TMH proteins24 are used as an independent test set. If TSf > TS0. In the proposed topology scoring function. 4. R&) = 3/5 and R0(3) = 2/5. TS0 > TS.6.bandEIe (5) (6) where b and EI denote the base and the exponent increment. Soluble proteins: A collection of 616 highresolution soluble proteins from PDB compiled by Chen et al. = 1. respectively. we define two topology scores. (2) where i?. signals embedded in the loop segments near the N-terminus are more likely to be a factor in the formation of topology since they are inserted in the membrane at an earlier time. TS„ = X W(J)x[ (1 -a)R. // j is odd [0. we develop a novel topology scoring function that considers the topogenic contribution from all loop segments that diminishes over a distance away from the N-terminus. |) x 100%. To determine the protein topology. otherwise. R.13 and obtain topology information for 35 out of 36 proteins. = £ W{J)x[aRiU) + (1 -a)R 0 (j) ] (3) (4) inside. We select 145 proteins of good reliability from a set of 148 non-redundant proteins.35 the various loop segments. For the calculation of topology scores. 1. Ra(2) = 1/5.620) x «.13 is used to test for discrimination between membrane and soluble proteins. // j is even \/bu'1)"E'.

the consensus prediction takes the result of a prediction that has the highest probability. 3. and 4) all four features. = \ N ' [o.%obs 2T %prd $T IT number of residues correctly predicted in TM helices xl00% number of residues observed in TM helices number of residues correctly predicted in TM helices xl00% number of residues predicted in TM helices TMH residue recall TMH residue precision .-HHHHHHHHHHHHH.prd . Per-segment metrics include Q^. Per-segment scores indicate how accurately the location of a TMH region is predicted and per-residue scores report how well each residue is predicted.-HHHHHHHHHHHHH HHHHHHHHH-HHHHHHHHHHHHHHHHHHHHHHH---HHH-HHHHHHHHH---HHHHHHHHH-- Prediction 1 achieves 100% accuracy if the minimal overlap is 3 residues. 3) AA and any two of DP. The following combinations are considered: 1) AA only." Symbol Formula Description percentage of proteins in which all its TMH segments are predicted correctly TMH segment recall TMH segment precision percentage of correctly predicted topology Qoi e ?/oobs him him TOPO [1. .13 used a more relaxed minimal overlap of only 3 residues. and Q%. £ £ f ' and TOPO. An evaluation study by Chen et al. and AM. we use a less relaxed criterion which requires at least 9 overlapping residues. We also construct a consensus prediction from the two top-performing combinations through probability estimation using LIBSVM25. We use the following examples to illustrate these two issues (H = Helix): Observation Prediction 1 Prediction 2 Prediction 3 dues. with S. Evaluation metrics used in this work. In addition. and AM. Evaluation metrics There are two sets of evaluation measures for the TMH prediction: per-segment and per-residue accuracies13. but the second predicted helix is an over-prediction since we only count an overlapping observed helix only once. we do not allow an overlapping observed helix to be counted twice. For this.3. If the minimal overlap is 9 resi- Table 1.1/ O** A O l " = 100% for protein i -J x 100%. Second..36 3. a minimal overlap of observed helix segments must be defined. Per-residue metrics include Q2 Q^f" . In the calculation of per-segment scores. We follow the same performance measures proposed by Chen et al. the second predicted helix is also an overprediction. Prediction 1 achieves 50% accuracy. In the case of disagreement between the predicted classes. HS. thus it is not counted. HS. The value of the estimated probability for each residue corresponds to the confidence given for its predicted class. Table 1 lists the per-segment and per-residue metrics used in this paper. 2) AA and any one of DP. First. Q%£" . percentage of correctly predicted TMH residues e . Prediction 3 achieves 100% accuracy if the minimal overlap is 3 residues. Prediction 3 achieves 50% accuracy if the minimal overlap is 9 residues because the first predicted helix does not satisfy the minimal overlap requirement.2. Npn>. is the number of proteins in a data set. otherwise number of correctly predicted TM in data set xl00% number of TM observed in data set number of correctly predicted TM in data set xl00% number of TM predicted in data set number of proteins with correctly predicted topology xl00% ' number of residues predicted correctly in protein i 1 number of residues in prtoeini -xl00% Is. Prediction 2 achieves 50% accuracy because it already overlaps with the first observed helix. two issues must be addressed when counting a helix as correctly predicted. Performance of input feature combinations for helix prediction We test the performance of different input feature combinations for the first classifier.

and high-resolution data sets.4 93.5 71. SVMtmh improves TOPO by 5% over the second best method for the low-resolution data set.1 89.0 89. their accuracies might be over-estimated..1 82. a 14% improvement over the second best method.1%. the consensus approach increases the Qok score of Combination 6 by 1. respectively.8 85. Note that we do not have cross-validation results for all other methods.8 94.9 83. Per-segment (%) No.9 81. Therefore.0 68. Input features: AA (amino acid composition). The consensus approach is selected as our best model for comparison with other approaches.4 84. In fact.2 93.2 94.9%.9 93. Methods are sorted by their Qok values for the low-resolution data set. The purpose of consensus prediction is to maximize the benefits of both combinations.6 94. ::.6 93.3 69.PO Ml fl. Per-residue (%) (%) 1C. X<: 76 78 78 75 77 72 67 71 72 69 62 71 60 58 59 58 6! 55 63 56 60 56 55 57 62 t>2 Mi 1 " J ropo fs'l •• i •T. most notably.^" OX"1 ec> V >•*! '4 .9 79.:. the consensus has a decrease in Qok of 1. Compared to Combination 5.2 70. DP (di-peptide composition). Per-segment and per-residue scores of all methods compared are taken from an evaluation by Chen et a!.3%.4 93.4%. but an increase in Q^1" of 3. Performance of input feature combinations and the consensus method. and 95%.7 80. while the Q^hs score only decreases by 0.6 84.8%. Input Feature (s) 1 2 3 4 5 6 7 8 9 AA AA+DP AA+HS AA+AM AA+DP+HS AA+DP+AM AA+HS+AM AA+DP+HS+AM Consensus (5+6) Per-residue (%) a* 71. "I j. SVMtmh ranks the highest among all the compared methods for per-segment measures in TOPO.2 y-i%prd MhOn a 89.3 92.1 89. Specifically.• .4.0 92."• \> 86 75 75 72 71 85 83 76 59 53 47 42 35 36 38 36 34 35 35 35 32 32 32 32 33 78 83 83 79 74 65 67 47 83 87 89 91 91 90 88 89 91 91 90 89 89 91 91 91 91 61 64 69 71 75 79 65 64 60 58 56 54 52 52 48 47 45 45 43 40 39 36 33 31 28 »' 84 77 83 88 90 99 94 97 79 95 93 95 93 94 91 95 93 92 90 93 88 92 86 92 43 90 76 81 86 90 96 89 90 89 89 86 91 83 83 84 83 82 82 83 79 83 80 79 77 62 .5 Qkm 93.6 93. t r Per-residu e ( % ) •?.0 82. Combination 6 has a strikingly high Q%jbs score of 85. Qok.9 83. Low-resolution Methods •T s\ MMIA High-resolution ]Per-segment L' j Per-segment (%) . we use a minimal overlap of 9 residues whereas Chen et al" used only 3 residues.37 Table 2 shows the performance of combinations of input features and the consensus prediction.3 81. TOPO scores for the high-resolution data set are re-evaluated due to the update of topology information.3 93. The shaded area outlines the four top-performing methods.ft *'-J y. HS (hydrophobicity scale)16 and AM (amphiphilicity)'7.Ml^!'-i' ••US'.: • 1 ' r 1 \ ! .2 69.1 88.0 89.8 93. In addition.8 94.4 94.£ Ml H k* '• 1 -* »S • 'r •>o "V .9% and performs consistently well in other per-segment and per-residue measures. and Q°^d at 84%.5%.9 Table 3.2 95.8 80.9 85.and lowresolution data sets SVMtmh is compared to other methods for high and low-resolution data sets in Table 3. Performance of prediction methods for low.8 71. \ * I ( >'rj (. 71%. 3.0 94.'! 85 82 82 74 83 94 66 72 80 68 61 72 58 58 58 56 58 55 60 55 58 56 54 55 56 . Performance on high.9 89. Combination 5 achieves the highest score for Qok at 71.0 83. the consensus approach also scores the highest for Q2 at 89.8 89.1 70..'" 60 69 57 58 76 76 66 64 48 79 74 53 77 80 71 83 83 80 80 85 85 83 85 84 84 84 85 28 '. For the lowresolution set.0 81. Another marked improvement is also observed for the high-resolution set in Table 2. For the high-resolution set.u.0 88. SVMtmk has the highest score at 91% for TOPO. In addition.6 &" 83. h 68 72 59 90 87 87 88 88 86 87 80 81 78 72 63 51 54 58 53 48 49 50 52 43 41 40 41 43 i-i X^ M s. tj? ** '! •mlt • » 14 PRED-TMR PHDhtm08 PHDhtm07 TopPred2 DAS Ben-Tal Wolfenden WW GES Eisenberg KD Heijne Hopp-Wodds Sweet Av-Cid Roseman Levitt Nakashima A-Cid Lawson Radzicka Bull-Breese EM Fauchere sosui 58 57 56 49 48 39 35 29 27 23 20 13 11 11 11 10 9 9 9 8 8 6 6 5 5 92 86 85 88 84 93 79 56 90 93 90 88 89 87 87 87 89 88 88 87 86 87 86 89 87 93 86 86 86 79 81 90 82 75 68 63 59 55 58 59 58 56 56 56 57 57 56 56 56 56 :! ! %•) • >'.9 69."!"" M SO H.9 81.v >.2 84.

The cutoff length at 18 (dashed line) is chosen to minimize the sum of all three error rates (FP + FN|0W + FNhigh)Table 4. 2. Confusion between soluble and membrane proteins. 4. we also calculate the false negatives (FN) rates for both high. Effect of alternating geometric scoring function on topology accuracy We characterize the dependency of topology accuracy (TOPO) on the values of the base (b) and the exponent increment (EI) used in the alternating geometric scoring function for the low-resolution data set. Any protein that does not have at least one predicted TMH exceeding the minimum length is classified as a soluble protein.6. We calculate the false positives (FP) rates for the soluble protein set.6%. the cut-off length is a trade-off between the FP and FN rates.5 1 1 2 2 2 3 3 4 6 10 16 32 53 66 81 84 89 90 92 93 95 95 95 98 99 99 100 100 0 4 4 8 23 13 4 16 1 1 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5. The white circles indicate the highest topology accuracy at about 84% and their corresponding values for b and EI. 5 shows the relationships between topology accuracy coded by colours and the variables in the scoring function. Discrimination between soluble proteins and membrane proteins is based on the cut-off length chosen.5. The shaded area in Table 3 denotes the four top-performing approaches. SVMtmh performs 3% to 12% better for the highresolution set than for the low-resolution in terms of per-segment scores. compared to the second best methods at 80%. Similarly. for per-residue scores. A cut-off length is chosen as the minimum TMH length. which are selected to further predict newly solved membrane protein structures (Section 3. we apply SVMtmh to the soluble protein data set.38 which SVMtmh obtains the highest score for Q2 at 86%. In general.and low-resolution data sets is similar in the range of 81% to 90%. where a false positive represents a soluble protein being falsely classified as a membrane protein.5.5] . Therefore. Methods SVMtmh TMHMM2 SOSUI PHDpsiHtm08 PHDhtm08 Wolfenden Ben-Tal PHDhtm07 PRED-TMR HMMTOP2 TopPred2 DAS WW GES Eisenberg KD Sweet Hopp-Woods Nakashima Heijne Levitt Roseman A-Cid Av-Cid Lawson FM Fauchere Bull-Breese Radzicka False negatives (%) False positives (%) Low-resolution High-resolution 0. The results of all compared methods are taken from Chen et alP. Generally. the y-axis: false positive and false negative rates (%). Clearly. The x-axis: cut-off length. Meanwhile.Fig. False positive rates for soluble proteins are calculated in the second column In the third and fourth columns. which minimizes the sum of all errors is used to discriminate between soluble and membrane proteins. the accuracy for the high.7). SVMtmh is capable of distinguishing soluble and membrane proteins at FP and FNi0W rates at less than 1% and FNhigh rate at 5. Fig. false negative rates for membrane proteins are reported. most advanced methods such as TMHMM23 and PHDpsiHtm0812 achieve better accuracies than simple hydrophobicity scale methods including KyteDoolittle (KD)8 and White -Wimley (WW)10. the cut-off length selected must minimize FP + FNhigh+ FNiow. The region in which half of the white circles (8/16) occur falls in the ranges for b and EI between [1. The false positive and false negative rates as a function of cut-off length. 4 shows the FP and FN rates as a function of cut-off length.(FNhigh) and low-resolution (FNi0W) membrane protein sets using the chosen cut-off length. Discrimination between soluble and membrane proteins To assess our method's ability to discriminate between soluble and membrane proteins.6 8 8 3 19 39 11 14 8 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3. Methods are sorted by false positive rates. 100 False positives (FP) I False negatives ( F N ^ ) (Lowlresoultion) 60 40 20 v False negatives (FNhigh) (Higji resoultion) oL£ 0 _&jy 5 10 15 is 20 3. The cut-off length at 18. Fig. Table 4 shows the results of our method compared to the other methods.

7. Topology terms N„. for which all methods predict all helices correctly (Qok = 100%). However..1T ltn0_A(N out ) 100 0 0 100 0 0 0 0 100 86 71 100 70 70 50 80 100 100 100 100 100 100 50 89 100 100 100 100 78 88 56 90 85 71 76 73 87 86 86 85 90 85 82 83 60 63 53 71 84 68 77 69 57 54 52 58 91 78 92 78 62 57 69 73 94 87 87 90 74 72 72 63 89 89 75 83 60 65 53 71 lvfp_A(N in ) N ta N„. Performance on newly solved structures and analysis of bacteriorhodopsin To illustrate the performance of the top four methods on the high and low-resolution data sets as shown in Table 3. when both b and EI are large. respectively. Proteins are indicated by their PDB codes and their observed topologies. We devote our analysis to bR to illustrate that TMH prediction is by no means a trivial task and continuous development in this area is indispensable in advancing our understanding of membrane protein structures. in the upperright region. Therefore. The results are shown in Table 5. 6(a) displays the high-resolution structure of bR from PDB. 100 100 100 100 100 100 100 100 0 0 0 0 70 70 50 90 lxfh_A(Ni„) N.: N-terminal loop on the inside of membrane. On the other hand. In terms of topology prediction. EI) we choose for the scoring function is (1. In the vertical-left (b = 1) and the lower-horizontal (EI = 0) regions. only two methods are capable of predicting all the helices from a bacteriorhodopsin (bR) structure (PDB ID: ltn0_A) correctly {Qok = 100%). PREDTOPO: predicted topology. Bacteriorhodopsin (bR) is a member of the rhodopsin family. The poor accuracy in the vertical-left and the lower-horizontal region is a result of considering the contribution of every signal in the loop segments equally. Helix G does Table 5. 3. Our results suggest that the inclusion of both assumptions in modeling membrane protein topology is a key factor in achieving the best topology accuracy. Nin Ni„ Nin . Protein (observed topology) Methods SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2 SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2 SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2 SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2 PRED_TOPO Nout Nout Nout Nout Per-•segment (%) y-s%o6* ^Zhtm /-\%prd idhtm Per-residue (%) O /~*%obs /~\%prd 122 \ilT V. our analysis supports the assumptions we have made about our scoring function: 1) topology formation is a result of contributing signals distributed along the protein sequence. which is characterized with seven distinct transmembrane helices that can be indexed from Helix A to G. in the upper-right region. the scoring function assigns very small weights to the loop signals downstream of the N-terminus.5].6. most methods predict correctly for all four proteins. Nout lumx_L(N in ) Nu.0). Performance of top four approaches shaded in Table 3 for newly solved membrane proteins. Nout: N-terminal loop on the outside of membrane. particularly in the loop regions.5. the poor performance is due the contribution from downstream signals made negligible by the scoring function. Nin N ta N ta Nu. 1. Conversely. The best predicted protein is a photosynthetic reaction center protein (PDB ID: lumxL). Studies of synthetic peptides of each of the seven TM helices of bR have shown that Helix A to Helix E can form independently stable helices when inserted into a lipid bilayer26. we test four recently solved membrane protein structures not included in the training set. the scoring function is simplified to assigning an equal weight of 1 to all loop signals regardless of their distance from the N-terminus. and upper-right regions. and 2) the contribution of each downstream loop segment on the first loop segment is not equal and diminishes as a function of distance away from the N-terminus. 1. Fig.39 and [0. lower-horizontal. The set of values for (b. An interesting observation is that low topology accuracy (80%: blue and 79%: navy) occurs in the vertical-left. On the other hand.

.H B L » ««»«*««*«»** »«•««««*»**.0 Base (6) 2. . . 6(a).5 2. The structure of a bacteriorhodopsin (bR) (PDB ID: ltnO_A).5 80 79 sr 3. The relationship between base (ft) and exponent increment (El) in the alternating geometric scoring function and topology accuracy. 100 110 llGYGLTMVPFGGEQNPIYWARYADWLFTT .0 Fig. each indicated by a colour. . »•***» »««•««***«». 6(b). The region of Helix G (purple) and its predictions are highlighted in grey. The observed helices are indicated by colour boxes. N . . . EI) values occur within the white circles.N I E T L L F H V L D V S A K V G F G L I L L ^ S R A I FGEA £ A P E P S A G D G A A A T S D • * » • • * » * « » * * . M M . . • * » « • « • * * * * * • • » • • • • • • » * * * * » » • * * « « • • • • • • • • • • • • • » • • • • • * * * * » » * * * • • » • * • • • • • • • > • • • • • * • • • * • * * • • • • • • • • • • * (b) Fig. Figure is prepared with ViewerLite29.40 84 83 82. . i . t .t e r m i n u s (extracellular side) C .5 3.»•••**»••»»«»«« a* s • • • • • • • • • « • « • • * a * • • • • • • • • • • • « • ••••••nil 1«0 ISO SEQ&TMH SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2 PLLLLDLAL LVOAD^LBLM-1 * * * * • • * • ••««. The x-axis: base (b). * • * * • * • • • • • • • * ••»•« * * * • • * * • • • • •»»« 190 200 I I iMRPgVASTFKVLRNVTVV • • • • • • • i * > • > • • • > • • ::: 2~9 S6Q&TMH SVMtmh TMHMM2 PHDpSiHtmOS HMMTOP2 L W S A Y P V V W L I G S E G A G I V P I . 5. Fig. • * * • • • • * • » * «««. the y-axis: exponent increment (£/).s - 1.t e r m i n u s (cytoplasmic sklel (a) SEQ&TMH SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2 Q A Q I TGRP I .5 82 81 80. H . The accuracy of topology prediction (TOPO) for low-resolution data set is divided into 8 levels. The best accuracy (84%) and its associated (b. . Prediction results of bR by the top four methods (* = predicted helix). Each helix is coloured and indexed from A to G. .

and eukaryotic organisms. which adds another level of complexity into the TMH prediction problem. specific biological input features associated with individual classifiers can be applied more effectively.and lowresolution data sets. including helix lengths. tilts. SVMtmh and HMMTOP211 are the only two out of all four methods that can correctly identify the presence of Helix G. Acknowledg merits We gratefully thank Jia-Ming Chang. Helix G is important in the function of bR. Protein Sci 2000. TMHMM2 and HMMTOP2 rely solely on amino acid composition as sequence information. In particular. In contrast. Hsin-Nan Lin. and structural motifs. we will continue to enhance the performance of our approach by incorporating more relevant features in both stages of helix and topology prediction. Genome-wide analysis of integral membrane proteins from eubacterial. despite its atypical structure. von Heijne G. upon a closer examination of Helix G. 6(b). Larsson B. SVMtmh is successful in producing a prediction for the bR structure. Krogh A. 9: 505-511. SVMtmh discriminates between them with the lowest false positive rate compared to the other methods. SVMtmh incorporates a combination of both physico-chemical and sequence-based input features for helix prediction. which is in close agreement with the experimental approach. However. Furthermore. archaean. 4. One possible reason for our success in this case could be the integration of multiple biological input features that encompass both global and local information for TMH prediction. The poor prediction results may be due to the intrinsic structural irregularity as described earlier. TMHMM2 misses Helix G entirely and PHDpsihtm08 merges predictions for Helix F and Helix G into one long helix. Predicting transmembrane protein topol- . as it binds to retinal and undergoes conformation change during the photosynthetic cycle28. the need for accurate prediction methods is highly demanded. and Wen-Chi Chou for providing helpful discussions and computational assistance. SVMtmh achieves comparable or better persegment and topology accuracy for both high. With regard to future work. The effect of nucleotide bias upon the composition and prediction of transmembrane helices. By integrating both the sequence and structural input features and using a novel topology scoring function. When tested for confusion between membrane and soluble proteins. Stevens TJ and Arkin IT. as in the case of bacteriorhodopsin. While obtaining high-resolution structures for membrane proteins presents itself as a major challenge in the field of structural biology. all approaches are successful in identifying the first six helices (Helix A . We demonstrate that by separating the prediction problem using two classifiers. our approach could prove valuable for genome-wide predictions to identify potential integral membrane proteins and their topologies. HMMTOP2 over-predicts by 3 residues at the Nterminus and severely under-predicts by 9 residues at the C-terminus. Despite the difficulties involved in predicting the correct location of Helix G. CONCLUSION We have proposed an approach based on SVM in a hierarchical framework to predict transmembrane helix and topology in two successive steps. most methods do not predict with the same level of success for Helix G. Supported by the results we achieved. We will also consider some complexities of TM helices. 3. Interestingly. Protein Sci 1998. Wallin E and von Heijne G. We believe that the continuous development of computational methods with the integration of biological knowledge in this area will be immensely fruitful. However. and Sonnhammer EL. References 1. while PHDpsiHtm08 only uses sequence information from multiple sequence alignments. Wei-Neng Hung. 7: 1029-1038. 2. This work was supported in part by the thematic program of Academia Sinica under grant AS94B003 and AS95ASIA02. SVMtmh only under-predicts by 2 residues at the N-terminus of Helix G. We further analyze a set of newly solved structures and show that SVMtmh is capable of predicting the correct helix and topology of bacteriorhodopsin as derived from a high resolution experiment. The results of the predictions by all four approaches are shown in Fig.E) with good accuracy.41 not form a stable helix in detergent micelles27 and exhibits structural irregularity at Lys216 by forming a nbulge28.

Porollo A. 23. and Apweiler R. Biochemistry 1990. Software available at http://www. Protein Sci 2002. A collection of well characterised integral membrane proteins. 5. Amphiphilicity index of polar amino acids as an aid in the characterization of amino acid preference at membranewater interfaces. 22. Weiss RM.iaici. Transmembrane helix predictions revisited. 6. Booth PJ. and Bourne PE. 17. 81: 140-144. Zhou H and Zhou Y. JMLR 2004. 14. Helical membrane proteins: diversity of functions in the context of simple architecture. Chang CC and Lin CJ. JMol Biol 1999. Apweiler R. Membrane topology and insertion of membrane proteins: search for topogenic signals. 11:2774-2791. Topology prediction for helical transmembrane proteins at 86% accuracy. 1997. Kim H. 29. 305: 567-580. Shindyalov IN. Berman HM. Bairoch B. 20.ip/sci/viewer. Bousche O. J Mol Biol 1982. Kernytsky A. The hydrophobic moment detects periodicity in protein hydrophobicity. Membrane protein folding and stability: physical principles. and prediction of transmembrane helices. 5: 1704-1718. 18: 608-616. Goder V and Spiess M. 21. Structure of bacteriorhodopsin at 1. Ubarretxena-Belandia I and Engelman DE. Jarrell M. Bioinformatics 2002. Biochemistry 1997.htm. Nucleic Acids Res 2000. Energetics. JMol Biol 1992. Eisenberg D. Hristova K. Kalghatgi K. J Mol Biol 1998. Mol. 26. and Meller J. Gilliland G. The SWISS-PROT protein sequence database: its relevance to human molecular medical research. and White SH. Boekel J. and von Heijne G. 15. FEBS Letters 2001. 28: 235-242.42 ogy with a hidden Markov model: application to complete genomes. 11: 370-376. Bioinformatics 2000. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. Reilly K. Tusnady GE and Simon I. von Heijne G. Topogenesis of membrane proteins: determinants and dynamics. 24. Nature 2005. Luecke H. Richter HT.504:87-93. and Engelman DM. Popot JL and Engelman DM. 157: 105-132. 1460: 414. van Geest M and Lolkema JS. Schobert B. 225: 487-494. White SH and Wimley WC. stability. Hydrophobicity analysis and the positiveinside rule. The progress of membrane protein structure determination. 10. 12. Cartailler JP. Moller S. 13. 29: 4031-4037.55 A resolution. 25. Predicting the topology of transmembrane helical proteins using mean burial propensity and a hidden-Markov-model-based method.or. Wu TF. 16. Rothschild KJ.tw/~cilin/libsvm/. Chen CP. The Protein Data Bank. 28: 319-365. Hessa T. Lundin C. Kyte J and Doolittle RF. 22: 303-309. Curr Op in Struc Bio 2001. White SH. LIBSVM: A library for support vector machines. Enhanced recognition of protein transmembrane domains with prediction-based structural profiles. Unravelling the folding of bacteriorhodopsin. Lin CJ. 27. Hunt JF. J. Westbrook J. Bihlmaier K. Annu Rev Biophys Biomol Struct 1999. Rost B. and Lanyi JK.edu. White SH. 4. and Tsuji T. Mitaku S. Fariselli P. A simple method for displaying the hydropathic character of a protein. 16: 1159-1160. Adamczak R. 36: 1515615176. 7. 433:377-381. 13: 1948-1949. Bhat TN. 5: 975-1005. 9. and Weng RC. Microbiol Mol Biol Rev 2000. JMol Biol 2001. Membrane protein folding and oligomerization: the two-stage model. Software available at http://www. 18. Feng Z. and Rost B. 64: 13-33. Membrane protein structure prediction. Kriventseva EV. Bioinformatics 2006. Andersson H. Earnest TN. ViewerLite for molecular visualization. 28. 8. 291: 899-911. Weissig H. and Terwilliger TC. Cao B. and Casadio R. Hirokawa T. Recognition of transmembrane helices by the endoplasmic reticulum translocon. 11. A biophysical study of integral membrane protein folding. Biochim Biophys Acta 2000. Nilsson I. Protein Sci2003. 312: 927-934. 12: 1547-1555. Jayasinghe S. 5: 312316.ntu. . Protein Sci 1996. 19. Protein Sci 2004. Probability estimates for multi-class classification by pairwise coupling. Horvath C.csie. Proc Natl Acad Sci USA 1984. 283: 489-506. Med. JMol Biol 2001.

t Contact author. Finally. Protein structure prediction is one of the most important and difficult problems in computational molecular biology. University dale @cs. is to choose the best-fit template for the query protein with t h e structure to be predicted. University of Alberta. 1. Unfortunately. we present a machine learning approach that treats the fold recognition problem as a regression task and uses a least-squares boosting algorithm (LS-Boost) to solve it efficiently.uwaterloo. where protein sequences are becoming available at a far greater rate than the corresponding structure information. . by using the L S 3 o o s t algorithm. called fold recognition. The ability to predict a protein's structure directly from its sequence is urgently needed in the post-genomic era. ca of Alberta. In recent years. Waterloo.. libo@bioinformaticssolutions. and therefore can obtain significant computational savings over standard approaches. According to our experimental results we can draw the conclusions that: (1) Machine learning techniques offer an effective way to solve the fold recognition problem. understanding protein function has become a key step toward modelling complete biological systems. j3xu@tti-c. We test our method on Lindahl's benchmark and compare it with other methods. org Libo Yu Bioinformatics Solutions Inc. INTRODUCTION In the post-genomic era. One of the critical steps in protein threading. Canada Jinbo Xut Toyota Technological Institute at Chicago. the z-score calculation is time-consuming. The standard method for template selection is to rank candidates according to the z-score of t h e sequence-template alignment. than alternative machine learning approaches such as SVMs and neural networks. It has been established that the functions of a protein are directly linked to its three-dimensional structure. something that cannot be done using a straightforward SVM approach. the LS_Boost algorithm does not require the calculation of the z-score as an input. Protein threading predicts protein structures by using statistical knowledge of the relationship between protein sequences and structures. The prediction is made by aligning each amino acid in the target sequence to a position in a template structure and evaluating how well * Work performed at the Alberta Ingenuity Centre for Machine Learning. However. University fjiao@cs. current "wet-lab" methods used to determine the threedimensional structure of a protein are costly. which greatly hinders structure prediction at a genome scale. which also need not calculate the z-score. (4) T h e LS_Boost algorithm obtains superior accuracy.43 PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM Feng Jiao* School of Computer Science. Protein threading represents one of the most promising techniques for this problem. (3) Importantly. Canada Canada USA Protein structure prediction is one of the most important and difficult problems in computational molecular biology. protein threading has turned out to be one of the most successful approaches to this problem 7 ' 14 ' 15 . ualberta. (2) Formulating protein fold recognition as a regression rather than a classification problem leads to a more effective outcome. timeconsuming and sometimes unfeasible. In this paper.com Dale S c h u u r m a n s Department of Computing Science.ca of Waterloo. one can identify important features in the fold recognition protocol. with less computation for both training and testing.

Thus. We use LS_Boost to estimate the alignment accuracy of each sequence-template alignment and employ this as part of our fold recognition technique. Thus. neural networks and Bayes classification. put another way. Note that the z-score calculation requires the alignment score distribution to be determined by randomly shuffling the sequence many times (approx. we combine the gradient boosting algorithm of Friedman 5 with a least-squares loss criterion to obtain a least-squares boosting algorithm. We discuss how to design the least-squares boosting algorithm by combining gradient boosting with a least-squares loss criterion. meaning that the shuffled sequence has to be threaded to the template repeatedly. we experimentally test it on Lindahl's benchmark 12 and compare the resulting performance with other fold recognition methods. Finally. we propose to solve the fold recognition problem by treating it as a machine learning problem. The remainder of the paper is organized as follows. We first briefly introduce the idea of using protein threading for protein structure prediction. The traditional fold recognition technique is based on calculating the z-score. more accurate and more computationally efficient. SVM regression. and then comparing the alignment score of the correct sequence (in standard deviation units) to the average alignment score over random sequences. leading to the conclusions we present in the end. SVM classification. PROTEIN THREADING A N D FOLD RECOGNITION 2. 100 times). the fold recognition problem can be expressed as a standard prediction problem that can be solved by supervised machine learning techniques for regression or classification. the z-score is calculated for each sequencetemplate alignment by first determining the distribution of alignment scores among random re-shufflings of the sequence. Several research groups have already proposed machine learning methods. perhaps two orders of magnitude fewer than the number of known protein sequences n . and then describe how to use our algorithm to solve the fold recognition problem. In this general framework. such as the z-score method. It is also a much easier algorithm to implement. Thus. In this paper. where given a query sequence of unknown structure. we will describe our experimental set-up and compare LS_Boost with other methods. 2. treats the extracted features as input data. and the alignment accuracy or similarity level as a response variable. the entire process of calculating the z-score is very time-consuming. Thus. After the best-fit template is chosen. In particular. the structure prediction problem can be potentially reduced to a problem of recognition: choosing a known structure into which the target sequence will fold. protein threading typically consists of the following four steps: (1) Build a template database of representative . the next step then is to separate the correct templates from incorrect templates for the target sequence—a step we refer to as template selection or fold recognition. We show how to generate features from each sequence-template alignment and convert protein threading into a standard prediction problem (making it amenable to supervised machine learning techniques). Or. one searches a structure (template) database and finds the best-fit structure for the given sequence. To evaluate our approach. protein threading is in fact a database search technique.1. one generates a set of features to describe the instance. In this technique. instead of using the traditional z-score technique. After aligning the sequence to each template in the structural template database. such as neural networks 9 23 ' and support vector machines (SVMs) 20 ' 2 2 for fold recognition. LS_Boost. which statistically tests the possibility of the target sequence folding into a structure very similar to the template 3 . for each sequence-template alignment. The threading method for protein structure prediction The idea of protein threading originated from the observation that the number of different structural folds in nature may be quite small. the structural model of the sequence is built based on the alignment between the sequence and the chosen template.44 the target fits the template. In this paper we investigate a new approach that proves to be simpler to implement. Our experimental results demonstrate that the LS_Boost method outperforms the other techniques in terms of both prediction accuracy and computational efficiency.

On average. (2) Design a scoring function to measure the fitness between the target sequence and the template based on the knowledge of the known relationship between the structures and the sequences. After the N alignment scores are obtained. where N is on the order of one hundred.45 three-dimensional protein structures. Thus alignment accuracy can help to effectively differentiate the three similarity levels. we will only focus on the final step. 2. we need to shuffle and rethread the target sequence many times. Even a multi-class classifier cannot deal with this limitation very well since the three levels are in a hierarchical relationship. based on all the sequencetemplate alignments. Currently. which simply uses the alignment accuracy as the response value. as in the RAPTOR system 22 . Below we will show in our experiments that the regression approach obtains 2. there are two different approaches: the z-score method 3 and the machine learning method 9 ' 23 . The z-score method for fold recognition The z-score is defined to be the "distance" (in standard deviation units) between the optimal alignment score and the mean alignment score obtained by randomly shuffling the target sequence. We can see from above that in order to calculate the z-score for each sequence-template alignment. we use a regression approach. (4) Choose the best-fit template for the sequence according to a criterion. The alignment accuracy of threading pair is defined to be the number of correctly aligned positions. or SVMs. superfamily level similarity and family level similarity. Bryant et al.3. we only discuss how to choose the best template for the sequence. Usually. (3) Repeat the above two steps N times. (3) Find the best alignment between the target sequence and the template by minimizing the scoring function. (2) Find the optimal alignment between the shuffled sequence and the template. the higher the value of the alignment accuracy will be. We use our existing protein threading server RAPTOR 21 ' 22 to generate all the sequencestructure alignments. the minimum value of the scoring function corresponds to the optimal sequence-template alignment. That is. However. such as neural networks. Current machine learning methods generally treat the fold recognition problem as a classification problem. In our approach. we reformulate the fold recognition problem as predicting the alignment accuracy of a threading pair.2. Instead. we use SARF 2 to generate the alignment accuracy between the target protein and the template protein. An accurate zscore can cancel out the sequence composition bias and offset the mismatch between the sequence size and the template length. there is a limitation to the classification approach that arises when one realizes that there are three levels of similarity that one can draw between two proteins: fold level similarity. based on the correct alignment generated by SARF. classification-based methods treat the three different similarity levels as a single level. the higher the similarity level between two proteins. . which usually involves removing highly redundant structures. Machine learning methods for fold recognition Another approach to the fold recognition problem is to use machine learning methods. and thus are unable to effectively differentiate one similarity level from another while maintaining a hierarchical relationship between the three levels. which is called fold recognition. 3 proposed the following procedures to calculate z-score: (1) Shuffle the aligned sequence residues randomly. we calculate the deviation of the optimal alignment score from the distribution of these N alignment scores. That is. as in the GenTHREADER 9 and PROSPECT-I systems 2 3 . For the fold recognition problem. A position is correctly aligned only if its alignment position is no more than four position shifts away from its correct alignment. Then calculate the distribution of these N alignment scores. which takes a significant amount of time and essentially prevents this technique from being applied to genome-scale structure prediction. In this paper. which then is used to differentiate the similarity level between proteins.

which is the number of residues in the sequence. (11) Secondary structure compatibility score. (13) The z-score of the total alignment score and the z-score of a single score item such as mutation score. which measures the sequence similarity between the target protein and the template protein. (9) Environment fitness score. we need to find a prediction function that maps the features to the response variable. two proteins from the same fold class should share a large portion of similar sub-structure. environment fitness score. FEATURE EXTRACTION One of the key steps in the machine learning approach is to choose a set of proper features to be used as inputs for predicting the similarity between two proteins. some gaps are allowed. if there are too many gaps. (6) Number of contacts with only one end being aligned to the sequence. Let x denote the feature vector and y the alignment accuracy. which is the number of residues in the template. all the sequence-template alignments can be ranked based on the predicted alignment accuracy and the first-ranked one is chosen as the best alignment for the sequence. However. (7) Total alignment score. Usually. (5) Number of contacts with both ends being aligned to the sequence. We use the alignment accuracy as the response variable. (8) Mutation score. If this number is big. then it might indicate that the sequence is aligned to an incomplete domain of the template. Thus we have converted the protein structure problem to a function estimation problem. (1) Sequence size. 4. There is a contact between two residues if their spatial distance is within a given cutoff. (2) Template size.46 much better results than the standard classification approach. After optimally threading a given sequence to each template in the database. which is the number of aligned residues. (12) Pairwise potential score. Notice that here we still take into consideration the traditional z-score for the sake of performance comparison. We calculate the alignment accuracy between the target protein and the template protein using a structure comparison program SARF. If the alignment length is considerably smaller than the sequence size or the template size. and therefore the two sequences may not be in the same similarity level. which is not good since the sequence should fold into a complete structure. But later we will show that we can obtain nearly the same performance without using the 2-score. we can estimate the alignment accuracy for each sequence-template alignment. a longer protein should have more contacts. a response variable y and . Although a low sequence identity does not imply that two proteins are not similar. we generate the following features from each threading pair. a high sequence identity can indicate that two proteins should be considered as similar. Given an input variable x. it might indicate that the quality of the alignment is bad. we will show how to design our LS-Boost algorithm by combining the gradient boosting algorithm of Friedman 5 with a least-squares loss criterion. In the next section. which characterizes the capability of a residue to make a contact with another residue. then it indicates that this threading pair is unlikely to be in the same SCOP class. When aligning a sequence and a template. (10) Alignment gap penalty. (4) Sequence identity. LEAST-SQUARES BOOSTING ALGORITHM FOR FOLD RECOGNITION The problem can be formulated as follows. which measures the secondary structure difference between the template and the sequence in all positions. secondary structure score and pairwise potential score. This feature measures how well to put a residue into a specific environment. Then. Usually. Given the training set with input feature vectors and the response variable. 3. which means it is unnecessary to calculate the z-score as one of the features. (3) Alignment length. By using this function.

LS-Boost algorithm Fm = Fm-i(x) + pmh{x.aj}(i = 0 . .P)..ph(xi\ (3) • Fm(x) = Fm_i(x) + pmh(x. the expected value of a specific loss function L(y. By employing the least square loss function (L(y. we want to find a function F*(x) that can predict y from x such that over the joint distribution of {y.P) = ^2 m=0 where P = {/?m. Gradient boosting algorithm Algorithm 2: LS_Boost • Initialize F0 = y = j . This algorithm is called the Gradient Boosting algorithm and its entire procedure is given in Figure 1. we fix the form of the function and optimize the parameters instead. The loss function is used to measure the deviation between the real y value and the predicted y value. .am) • end for • Output the final regression function Fm(x) Fig.x} values.i = N p. This process can be represented by the following two recursive equations.a2.am) (2) • Step 4. That is.«m}m=0' The functions h(x. am) Fm(x) F{x.(Xm} after all of the {/?i. P) and change the function optimization problem to parameter optimization. A typical parameter optimization method is a "greedy-stagewise" approach.m — 1) are optimized.a *•—* t=l 1. F*(x) = aigmin Ey.F(x)) is minimized 5 .N) am)f {/3m.am) (4) Friedman proposed a steepest-descent method to solve the optimization problem described in Equation 2 5 .. . Update the estimation of F(x) Fm(x) = F m _ i ( x ) + pmh{x. . F(x) F{x) "' Algorithm 1: Gradient Boost • Initialize F0(x) = argmin p J2?-i • For m = 1 to M do: • Step 1..47 some samples {yi^Xi}^. . £ i y» • For m = 1 to M do: • y%=Vi -Fm-i{xi. we can choose a parameterized model F(x.am) = argmin Y^L(2/i. 2.F m _i(a. For this procedure.F{xi))^ dFx N F(x)) (1) • Step 2. In general.am)}2 Normally F(x) is a member of a parameterized class of functions F(x. where P is a set of parameters.a) are usually simple functions of x with parameters Q = {c*i.F(x))\x] f3h(xi. Choose a gradient descent step size as N Pm = argmin V L { y u F m p f-f t—l l(xi) +ph(xi\a)) Pmh(x. «m) = argmin V ^ .When we wish to estimate F(x) non-parametrically the task becomes more difficult. 1.C«M}. 1 .j) + (3h(xi. Fit a model am = argmin V ^ y ~ i=i = argmin Ex[EyL{y.a)) 8=1 • (pm. F)) = (y—F)2/2 we have a least-squares boosting algorithm shown in Figure 2. N • end for • Output the final regression function Fig.. That is. We use the form of the "additive" expansions to design the function as follows: M • Step 3.. • • . we optimize {Pm.xL(y. Compute the negative gradient y% = Li \Vu p) \dL{yi.

The simplest function to use here is the linear regression function: y — ax + b (6) 100 200 300 400 500 alignment accuracy 600 700 800 F i g . for considerations of speed. fitness score and pairwise score. — y _ "XX ax n n ! where ' n = n x ^ i . 5. In the end. In terms of boosting. We choose the template with the highest alignment accuracy as the basis to build the structure of the target sequence. Repeat the above two steps until the predicted alignment accuracy does not change significantly. Figure 3 shows the relation between alignment accuracy and mutation score. We call this difference the alignment accuracy residual.xi)(%2vi) *=1 There are many other simple functions one can use. EXPERIMENTAL RESULTS When one protein structure is to be predicted. The parameter p is calculated by using Equation 5. where x is the input feature and y is the alignment accuracy. For example. a) can have any form that can be conveniently optimized over a. The underlying reasons for choosing a single feature at each round are: i) we would like to see the role of each feature in fold recognition. we choose some function for which it is easy to obtain a. Algorithm 2 translates to the following procedures.. a m ) »=i (5) The simple function /i(:r. etc. super family level and the fold level. or hyperbolic function y = a + b/x. The parameters of the linear regression function can be solved easily by the following equation: a _ JOL^i. such as an exponential function y = a + ebx. a) with the minimum least-squares error. quadratic function y = ax2 + bx + c. If two proteins .( ^ i j ) i=i n *=i n n n lxy =nx ^Xiyi i=l «=1 (£. we thread its sequence to each template in the database and obtain the predicted alignment accuracy using the LS_Boost algorithm. (1) Calculate the difference between the real alignment accuracy and the predicted alignment accuracy. we choose one feature and obtain the simple function h(x. As such. Assume the initial predicted alignment accuracy is the average alignment accuracy of the training data. (3) Update the predicted alignment accuracy by adding the predicted alignment accuracy residual. T h e relation between alignment accuracy and m u t a t i o n score. the lower the mutation score. In our application. Then the alignment accuracy residual is predicted by using this chosen feature and the parameter. We can describe the relationship between two proteins at three different levels: the family level. (2) Choose a single feature which correlates best with the alignment accuracy residual. we combine these simple functions to form the final regression function. logarithmic function y = a + blnx. optimizing over a to fit the training data is called weak learning. and ii) we notice that alignment accuracy is proportional to some features.am) = axgminV^j/i »=1 ph{xi\am)f N and therefore p = N x j/»/ ^ / i ( £ t . for each round. the higher the alignment accuracy.48 p is calculated as follows: N (P. In this paper. 3 .

49 are similar at the family level, then these two proteins have evolved from a common ancestor and usually share more than 30% sequence identity. If two proteins are similar only at the fold level, then their structures are similar even though their sequences are not similar. The superfamily-level similarity is something in between family level and fold level. If the target sequence has a template that is in the same family as the sequence, then it is easier to predict the structure of the sequence. If two proteins are similar only at fold level, it means they share less sequence similarity and it is harder to predict their relationship. We use the SCOP database 16 to judge the similarity between two proteins and evaluate our predicted results at different levels. If the predicted template is similar to the target sequence at the family level according to the SCOP database, we treat it as correct prediction at the family level. If the predicted template is similar at the superfamily level but not at the family level, then we assess this prediction as being correct at the superfamily level. Similarly, if the predicted template is similar at the fold level but not at the other two levels, we assess the prediction as correct at the fold level. When we say a prediction is correct according to the top K criterion, we mean that there are no more than K — 1 incorrect predictions ranked before this prediction. The foldlevel relationship is the hardest to predict because two proteins share very little sequence similarity in this case. To train the parameters in our algorithm, we randomly choose 300 templates from the FSSP list 1 and 200 sequences from Holm's test set 6 . By threading each sequence to all the templates, we obtain a set of 60,000 training examples. To test the algorithm, we use Lindahl 's benchmark, which contains 976 proteins, each pair of which shares at most 40% sequence identity. By threading each one against all the others, we obtain a set of 976 x 975 threading pairs. Since the training set is chosen randomly from a set of non-redundant proteins, the overlap between the training set and Lindahl's benchmark is fairly small, which is no more than 0.4 percent of the whole test set. To ensure the complete separation of training and testing sets, these overlap pairs are removed from the test data. We calculate the recognition rate of each method at the three similarity levels. 5.1. Sensitivity Figure 4 shows the sensitivity of our algorithm at each round. We can see that the LS_Boost algorithm nearly converges within 100 rounds, although we train the algorithm further to obtain higher performance.

Sensitivity according to Top 1 and Top 5 criteria

Family Laval (Top 5]

SuperfamBy Level (Top 5)

f

Fold Level (Top 5)

;p ^~~~^~~

•

/

S O

100

150

200 250 300 350 Number of training rounds

400

450

500

F i g . 4.

Sensitivity curves during the training process.

Table 1 lists the results of our algorithm against several other algorithms. PROSPECT II uses the 2-score method, and its results are taken from Kim et al.'s paper 10 . We can see that the LS_Boost algorithm is better than PROSPECT II at all three levels. The results for the other methods are taken from Shi et al's paper 18 . Here we can see that our method apparently outperforms the other methods. However, since we use different sequence-structure alignment methods, this disparity may be partially due to different threading techniques. Nevertheless, we can see that the machine learning approaches normally perform much better than the other methods. Table 2 shows the results of our algorithm against several other popular machine learning methods. Here we will not describe the detail of each method. In this experiment, we use RAPTOR to generate all the sequence-template alignments. For each different method, we tune the parameters on the training set and test the model on the test set. In total we test the following six other machine learning methods.

50

Table 1. Sensitivity of the LS_Boost method compared with other sturcutre prediction servers. Family Top 1 86.5% 84.1 % 82.3% 71.2% 67.7% 70.1% 74.6% 68.6% 49.2% Level Top 5 89.2% 88.2% 85.8% 72.3% 73.5% 75.4% 78.9% 75.7% 58.9% Superfamily Level Top 1 Top 5 60.2% 74.4% 52.6% 64.8% 41.9% 53.2% 27.4% 27.9% 20.7% 31.3% 28.3% 38.9% 29.3% 40.6% 20.7% 32.5% 10.8% 24.7% Fold Level Top 1 Top 5 38.8% 61.7% 27.7% 50.3% 12.5% 26.8% 4.0% 4.7% 4.4% 14.6% 3.4% 18.7% 6.9% 16.5% 5.6% 15.6% 14.6% 37.7%

R A P T O R (LS-Boost) PROSPECT II FUGUE PSIJBLAST HMMER.PSIBLAST SAMT98-PSIBLAST BLASTLINK SSEARCH THREADER

Table 2. Performance comparison of seven machine learning methods. The sequence template alignments are generated by R A P T O R . Family Top 1 86.5% 85.0% 82.6% 82.8% 81.1% 69.9% 68.0% Level Top 5 89.2% 89.1% 83.6% 84.1% 83.2% 72.5% 70.8% Superfamily Level Top 1 Top 5 60.2% 74.4% 55.4% 71.8% 45.7% 58.8% 50.7% 61.1% 47.4% 58.3% 29.2% 42.6% 31.0% 41.7% Fold Level Top 1 Top 5 38.8% 61.7% 38.6% 60.6% 30.4% 52.6% 32.2% 53.3% 30.1% 54.8% 13.6% 40.0% 15.1% 37.4%

LS-Boost SVM (regression) SVM (classification) AdaJBoost Neural Networks Bayes classifier Naive Bayes Classifier

(1) SVM regression. Support vector machines are based on the concept of structural risk minimization from statistical learning theory 19 . The fold recognition problem is treated as a regression problem, therefore we consider SVMs used for regression. Here we use the svmJight software package 8 and an RBF kernel to obtain the best performance. As shown in Table 2, LS-Boost performs slightly better than SVM regression. (2) SVM classification. The fold recognition problem is treated as a classification problem, and we consider an SVM for classification. The software and kernel we consider is the same as for SVM regression. In this case, one can see that SVM classification performs worse than SVM regression, especially at the superfamily level and the fold level. (3) AdaBoost. Boosting is a procedure that combine the outputs of many "weak" classifiers to produce a powerful "committee". We use the standard AdaBoost algorithm 4 for classification, which is similar to LS-Boost except that it performs classification rather than regression and uses the exponential instead of least-squares loss function. The AdaBoost algorithm achieves a comparable result to SVM classification but is

worse than both of the regression approaches, LS_Boost and SVM regression. (4) Neural networks. Neural networks are one of the most popular methods used in machine learning 17 . Here we use a multi-layer perceptron for classification, based on the Matlab neural network toolbox. The performance of the neural network is similar to SVM classification and Adaboost. (5) Bayesian classifier. A Bayesian classifier is a probability based classifier which assigns a sample to a class based on the probability that it belongs to the class 13 . (6) Naive Bayesian classifier. The Naive Bayesian classifier is similar to the Bayesian classifier except that it assumes that the features of each class are independent, which greatly decreases computation 13 . We can see both Bayesian classifier and Naive Bayesian classifier obtain poor performance. Our experimental results show clearly that: (1) The regression based approaches demonstrate better performance than the classification based approaches. (2) LSJBoost performs slightly better than SVM regression and significantly better than the other methods. (3) The computational efficiency of

51 LS_Boost is much better than SVM regression, SVM classification and the neural network. One of the advantages of our boosting approach over SVM regression is its ability to identify important features, since at each round LS-Boost only chooses a single feature to approximate the alignment accuracy residual. The following are the top five features chosen by our algorithm. The corresponding simple functions associated with each feature are all linear regression functions y = ax + b, showing that there is a strong linear relation between the features and the alignment accuracy. For example, from the figure 3, we can see that the linear regression function is the best fit. (1) (2) (3) (4) (5) Sequence identity; Total alignment score; Fitness score; Mutation score; Pairwise potential score. that LS-Boost method is slightly better than SVM regression and much better than other methods.

5.2. Specificity

We further examine the specificity of the LS_Boost method with Lindahl's benchmark. All threading pairs are ranked by their confidence score (i.e., the predicted alignment accuracy or the classification score if an SVM classifier is used) and the sensitivityspecificity curves are drawn in Figure 5, 6 and 7. Figure 6 demonstrates that at the superfamily level, the LS-boost method is consistently better than SVM regression and classification within the whole spectrum of sensitivity. At both the family level and fold level, LS-Boost is a little better when the specificity is high while worse when the specificity is low. At the family level, LS-Boost achieves a sensitivity of 55.0% and 64.0% at 99% and 50% specificities, respectively, whereas SVM regression achieves a sensitivity of 44.2% and 71.3%, and SVM classification achieves a sensitivity of 27.0% and 70.9% respectively. At the superfamily level, LS-Boost has a sensitivity of 8.2% and 20.8% at 99% and 50% specificities, respectively. In contrast, SVM regression has a sensitivity of 3.6% and 17.8%, and SVM classification has a sensitivity of 2.0% and 16.1% respectively. Figure 7 shows that at the fold level, there is no big difference between LS_Boost method, SVM regression and SVM classification method.

It seems surprising that the widely used z-score is not chosen as one of the most important features. This indicates to us that the z-score may not be the most important feature and redundant. To confirm our hypothesis, we re-trained our model using all the features except all the z-scores. That is, we conducted the same training and test procedures as before, but with the reduced feature set. The results given in Table 3 show that for LS_Boost there is almost no difference between using the zscore as an additional feature or without using it. Thus, we conclude that by using the LS-Boost approach it is unnecessary to calculate z-score to obtain the best performance. This means that we can greatly improve the computational efficiency of protein threading without sacrificing accuracy, by completely avoiding the calculation of the expensive zscore. To quantify the margin of superiority of LS-Boost over the other machine-learning methods, we use bootstrap method to get the error analysis. After training the model, we randomly sample 600 sequences from Lindahl's benchmark and calculate the sensitivity using the same method as before. We repeat the sampling for 1000 times and get the mean and standard deviation of the sensitivity of each method as listed in table 4. We can see

**Family Level Only
**

It 1 1 ii |

LS_Booat 0.9 X 0.8 • vT'^ ~ ~-~.^

, ,

— -

SVM_Classification

07

"

— "

I " ~ " -~^^T\

1 °'5 '

e (0

0,4 • 0.3 •

>- ° 6 '

\

I ,

\

>

0.2 • 0.1 0' 0 ' 0.2 ' 0.4 Specificity ' 0.6 ' 0.8

1

Fig. 5. Family-level specificity-sensitivity curves on Lindahl's benchmark set. T h e three methods LS-Boost, SVM regression and SVM classification are compared.

52

Table 3. Comparison of fold recognition performance with zscore and without zscore. Family Top 1 86.5% 85.8% Level Top 5 89.2% 89.2% Superfamily Level Top 1 Top 5 60.2% 74.4% 60.2% 73.9% Fold Level Top 1 Top 5 38.8% 61.7% 38.3% 62.9%

LS-Boost with z-score LS_Boost without z-score

Table 4 . Error Analysis of seven machine learning methods. The sequence-template alignments are generated by RAPTOR. Family Topi mean std 86.6% 0.029 0.031 85.2% 82.5% 0.028 82.9% 0.030 81.8% 0.029 70.0% 0.027 68.8% 0.026 Level Top mean 89.2% 89.2% 83.8% 84.2% 83.5% 72.6% 71.0% 5 std 0.031 0.031 0.030 0.029 0.030 0.027 0.028 Superfamily Level Top 1 Top mean std mean 60.2% 0.029 74.3% 55.6% 0.029 72.0% 45.8% 0.026 58.9% 61.2% 50.7% 0.028 58.4% 47.5% 0.027 29.1% 0.021 42.6% 31.1% 0.022 41.9% Fold Level 5 std 0.034 0.033 0.030 0.031 0.031 0.026 0.025 Top mean 38.9% 38.7% 30.4% 32.1% 30.2% 13.7% 15.1% 1 std 0.027 0.027 0.024 0.025 0.024 0.016 0.017 Top mean 61.8% 60.7% 52.8% 53.4% 55.0% 40.1% 37.3% 5 std 0.036 0.035 0.032 0.034 0.033 0.028 0.027

LS-Boost SVM (R) SVM (C) Ada-Boost NN BC NBC

SuperFamily Level Only

5.3. Computational Efficiency Overall, the LS_Boost procedure achieves superior computational efficiency during both training and testing. By running our program on a 2.53 GHz Pentium IV processor, after extracting the features, the training time is less than thirty seconds and the total test time is approximately two seconds. Thus we can see that our technique is very fast compared to other approaches, in particular the machine learning approaches such as neural networks and SVMs which require much more time to train. Table 5 lists the running time of several different fold recognition methods. Prom this table, we can see that the boosting approach is more efficient than the SVM regression method, which is desirable for genomescale structure prediction. The running time shown in this table does not contain the computational time of sequence-template alignment.

Fig. 6 . Superfamily-level specificity-sensitivity curves on Lindahl's benchmark set. The three methods LS-Boost, SVM regression and SVM classification are compared.

Fold Level Onty LS_Booat SVM_Classification

6. CONCLUSION In this paper, we propose a new machine learning approach—LS_Boost—to solve the protein fold recognition problem. We use a regression approach which is proved to be both more accurate and efficient than classification based approaches. One of the most significant conclusions of our experimental evaluation is that we do not need to calculate the standard z-score, and can thereby achieve a substantial computational savings without sacrificing prediction accuracy. Our algorithm achieves strong sen-

Specificity

Fig. 7. Fold-level specificity-sensitivity curves on Lindahl's benchmark set. The three methods LS-Boost, SVM regression and SVM classification are compared.

53

Table 5. Running time of different machine learning approaches.

LS-Boost SVM classification SVM regression Neural Network Naive Bayes Classifier Bayes Classifier Training time 30 seconds 19 mins 1 hour 2.3 hours 1.8 hours 1.9 hours Testing time 2 seconds 26 mins 4.3 hours 2 mins 2 mins 2 mins

sitivity results compared t o other fold recognition methods, including both machine learning methods a n d z-score based methods. Moreover, our approach is significantly more efficient for b o t h the training and testing phases, which m a y allow genome-scale scale structure prediction.

References

1. T. Akutsu and S. Miyano. On the approximation of protein threading. Theoretical Computer Science, 210:261-275, 1999. 2. N.N. Alexandrov. SARFing the PDB. Protein Engineering, 9:727-732, 1996. 3. S.H. Bryant and S.F. Altschul. Statistics of sequencestructure threading. Current Opinions in Structural Biology, 5:236-244, 1995. 4. Y. Preund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, pages 23-37, 1995. 5. J.H. Friedman. Greedy function approximation: A gradient boosting machine. The Annuals of Statistics, 29(5), October 2001. 6. L. Holm and C. Sander. Decision support system for the evolutionary classification of protein structures. 5:140-146, 1997. 7. J. Moultand T. Hubbard, F. Fidelis, and J. Pedersen. Critical assessment of methods on protein structure prediction (CASP)-round III. Proteins: Structure, Function and Genetics, 37(S3):2-6, December 1999. 8. T. Joachims. Making Large-scale SVM Learning Practical. MIT Press, 1999. 9. D.T. Jones. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology, 287:797-815, 1999. 10. D. Kim, D. Xu, J. Guo, K. Ellrott, and Y. Xu. PROSPECT II: Protein structure prediction method for genome-scale applications. Protein Engineering, 16(9):641-650, 2003. 11. H. Li, R. Helling, C. Tang, and N. Wingreen. Emergence of preferred structures in a simple model of protein folding. Science, 273:666-669, 1996.

12. E. Lindahl and A. Elofsson. Identification of related proteins on family, superfamily and fold level. Journal of Molecular Biology, 295:613-625, 2000. 13. D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Machine learning, neural and statistical classification, (edit collection). Elllis Horwood, 1994. 14. J. Moult, F. Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods on protein structure prediction (CASP)-round IV. Proteins: Structure, Function and Genetics, 45(S5):2-7, December 2001. 15. J. Moult, F. Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods on protein structure prediction (CASP)-round V. Proteins: Structure, Function and Genetics, 53(S6):334-339, October 2003. 16. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. SCOP:a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536540, 1995. 17. Judea Pearl, probabilistic reasoning in intelligent system:Networks of plausible inference. Springer, 1995. 18. J. Shi, T. Blundell, and K. Mizuguchi. FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Molecular Biology, 310:243-257, 2001. 19. V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. 20. J. Xu. Protein fold recognition by predicted alignment accuracy. IEEE Transactions on Computational Biology and Bioinformatics, 2:157 - 165, 2005. 21. J. Xu, M. Li, D. Kim, and Y. Xu. RAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology, 1(1):95117, 2003. 22. J. Xu, M. Li, G. Lin, D. Kim, and Y. Xu. Protein threading by linear programming, pages 264-275, Hawaii, USA, 2003. Biocomputing: Proceedings of the 2003 Pacific Symposium. 23. Y. Xu, D. Xu, and V. Olman. A practical method for interpretation of threading scores: an application of neural networks. Statistica Sinica Special Issue on Bioinformatics, 12:159-177, 2002.

This page is intentionally left blank

55

A GRAPH-BASED AUTOMATED NMR BACKBONE RESONANCE SEQUENTIAL ASSIGNMENT

Xiang W a n a n d G u o h u i Lin* Department of Computing Science, University of Edmonton, Alberta T6G 2E8, Canada * Email: ghlin@cs.ualberta. ca Alberta

T h e success in backbone resonance sequential assignment is fundamental to protein three dimensional structure determination via NMR spectroscopy. Such a sequential assignment can roughly be partitioned into three separate steps, which are grouping resonance peaks in multiple spectra into spin systems, chaining t h e resultant spin systems into strings, and assigning strings of spin systems to non-overlapping consecutive amino acid residues in the target protein. Separately dealing with these three steps has been adopted in many existing assignment programs, and it works well on protein NMR data that is close to ideal quality, while only moderately or even poorly on most real protein datasets, where noises as well as data degeneracy occur frequently. We propose in this work t o partition t h e sequential assignment not into physical steps, but only virtual steps, and use their outputs t o cross validate each other. T h e novelty lies in the places where the ambiguities in the grouping step will be resolved in finding the highly confident strings in the chaining step, and the ambiguities in the chaining step will be resolved by examining t h e mappings of strings in the assignment step. In such a way, all ambiguities in t h e sequential assignment will be resolved globally and optimally. The resultant assignment program is called GASA, which was compared to several recent similar developments RIBRA, MARS, PACES and a random graph approach. The performance comparisons with these works demonstrated that GASA might be more promising for practical use. Keywords: Protein NMR backbone resonance sequential assignment, chemical shift, spin system, connectivity graph.

1. INTRODUCTION

Nuclear Magnetic Resonance (NMR) spectroscopy has been increasingly used for protein threedimensional structure determination. Although it hasn't been able to achieve the same accuracy as X-ray crystallography, enormous technological advances have brought NMR to the forefront of structural biology 1 since the publication of the first complete solution structure of a protein (bull seminal trypsin inhibitor) determined by NMR in 1985 2 . The underlined mathematical principle for protein NMR structure determination is to employ NMR spectroscopy to obtain local structural restraints such as the distances between hydrogen atoms and the ranges of dihedral angles, and then to calculate the three dimensional structure. Local structural restraint extraction is mostly guided by the backbone resonance sequential assignment, which therefore is crucial to the accurate three dimensional structure calculation. The resonance sequential assignment is to map the identified resonance peaks from multiple NMR spectra to their corresponding nuclei in the target protein, where every peak captures a nuclear

*To whom correspondence should be addressed.

magnetic interaction among a set of nuclei and its coordinates are the chemical shift values of the interacting nuclei. Normally, such an assignment procedure is roughly partitioned into three main steps, which are grouping resonance peaks from multiple spectra into spin systems, chaining the resultant spins systems into strings, and assigning the strings of spin systems to non-overlapping consecutive amino acid residues in the target protein, as illustrated in Figure 1, where the scoring scheme quantifies the residual signature information of the peaks and spin systems. Separately dealing with these three steps has been adopted in many existing assignment programs 3 _ 1 0 . Furthermore, depending the NMR spectra data availability, different programs may have different starting points. To name a few automated assignment programs, PACES 6 , a random graph approach 8 (we abbreviate it as RANDOM in the rest of the paper) and MARS 10 assume the availability of spin systems and focus on chaining the spin systems and their subsequent assignment; AutoAssign 3 and RIBRA 9 can start with the multiple spectral peak lists and automate the whole sequential

56 Scoring

peak lists

*• Grouping

F i g . 1.

Chaining

Assignment

-*- candidates

The flow chart of t h e NMR resonance sequential assignment.

assignment process. In terms of computational techniques, PACES uses exhaustive search algorithms to enumerate all possible strings and then performs the string assignment; RANDOM 8 avoids exhaustive enumeration through multiple calls to Hamiltonian path/cycle generation in a randomized way; MARS 10 first searches all possible strings of length 5 and then uses their mapping positions to filter out the correct strings; AutoAssign 3 uses a best-first search algorithm with constraint propagation to look for assignments; RIBRA 9 applies a weighted maximum independent set algorithm for assignments. The above mentioned sequential assignment programs all work well on the high quality NMR data, but most of them remain unsatisfactory in practice and even fail when the spectral data is of low resolution. Through a thorough investigation, we identified that the bottleneck of automated sequential assignment is resonance peak grouping. Essentially, a good grouping output gives well organized high quality spin systems, for which the correct strings can be fairly easily determined and the subsequent string assignment also becomes easy. In AutoAssign and RIBRA, the grouping is done through a binary decision model that considers the HSQC peaks as anchor peaks and subsequently maps the peaks from other spectra to these anchor peaks. For such a mapping, the HN and N chemical shift values in the other peaks are required to fall within the pre-specified HN and N chemical shift tolerance thresholds of the anchor peaks. However, this binary-decision model in the peak grouping inevitably suffers from its sensitivity to the tolerance thresholds. In practice, from one protein dataset to another, chemical shift thresholds vary due to the experimental condition and the structure complexity. Large tolerance thresholds could create too many ambiguities in resultant spin systems and consequently in the later chaining and assignment, leading to a dramatic decrease of assign-

ment accuracy; On the other hand, small tolerance thresholds would produce too few spin systems when the spectral data resolution is low, hardly leading to a useful assignment. Secondly, we found that in the traditional threestep procedure, which is the basis of many automated sequential assignment programs, each step is separately executed, without consideration of inter-step effects. Basically, the input to each step is assumed to contain enough information to produce meaningful output. However, for the low resolution spectral data, the ambiguities appearing in the input of one step seem very hard to be resolved internally. Though it is possible to generate multiple sets of outputs, the contained uncertainties in one input might cause more ambiguities in the outputs, which are taken as inputs to the succeeding steps. Consequently, the whole process would fail to produce a meaningful resonance sequential assignment, which might be possible if the outputs of succeeding steps are used to validate the input to the current step. In this paper, we propose a two-phase Graphbased Approach for Sequential Assignment (GASA) that uses the spin system chaining results to validate the peak grouping and uses the string assignment results to validate the spin system chaining. Therefore, GASA not only addresses the chemical shift tolerance threshold issue in the grouping step but also presents a new model to automate the sequential assignment. In more details, we propose a two-way nearest neighbor search approach in the first phase to eliminate the requirement of user-specified HN and N chemical shift tolerance thresholds. The output of first phase consists of two lists of spin systems. One list contains the perfect spin systems, which are regarded as of high quality, and the other the imperfect spin systems, in which some ambiguities have to be resolved to produce legal spin systems. In the second phase, the spin system chaining is performed to re-

Phase 1: Filtering For ease of exposition and fair comparison with RANDOM. guided by the quality of the string mapping to the target protein. Given a center C = ( H N c . We conclude the paper in Section 4. For ease of presentation. all existing peak grouping models require the manually set chemical shift tolerance thresholds in order to decide whether two resonance peaks should be grouped into the same spin system or not.Np). using a set of common chemical shift tolerance thresholds results in more troublesome centers. In Section 2. some centers may have less than 6 or even 0 peaks. wise. we introduce the detailed steps of operations in GASA. GASA firstly conducts a bidirectional nearest neighbor search to generate the perfect spin systems and the imperfect spin systems with ambiguities. and a carbon alpha/beta from the preceding amino acid residue. each center should have 6 peaks distributed to it in total. In the other case. assuming the grouping is done. GASA does not separate the sequential assignment into physical steps but only virtual steps. Figure 2 illustrates a simple scenario where 3 . otherwise an inter-peak. Due to the high quality of HSQC spectrum. GASA can accept other combinations of spectra. However. different tolerance thresholds clearly produce different sets of possible spin systems. In fact. and every peak in CBCA(CO)NH and HNCACB is distributed to the closest center. the normalized Euclidean distance between them is defined as D= 2. and all ambiguities in the whole assignment process are resolved globally and optimally. detailed as follows using the triple spectra as an example. In the case of a given list of spin systems.1. MARS and RIBRA. The reasons for this is that the peaks should be associated with these centers might turn out closer to other centers.Cp ). due to the chemical shift degeneracy. Therefore. Therefore. One typical example would be the triple spectra containing HSQC. a minor change of tolerance thresholds would lead to huge difference in the formed spin systems and subsequently the final sequential assignment. the peaks in HSQC are considered as centers. and the HSQC peak list. the ambiguities in the imperfect spin systems are resolved through finding the highly confident strings in the chaining step. Section 3 presents our experimental results and discussion. a 3D peak containing a chemical shift of the intra-residue carbon alpha is referred to as an intra-peak. and the ambiguities in the chaining step are resolved through examining the mappings of strings in the assignment step. The goal of filtering is to identify all perfect spin systems without asking for the chemical shift tolerance thresholds. a list of spin systems. the directly adjacent amide proton. and a carbon CTHN / where OHN i CN are the standard deviations of HN and N chemical shifts that are collected from BioMagResBank ( h t t p : //www. 4 from HNCACB spectrum and 2 from CBCA(CO)NH spectrum. In other words. the proper tolerance thresholds are normally dataset dependent and how to choose them is a very challenging issue in the automated resonance assignment. //HNp-HNcy V \ an< / N P . GASA skips the first phase and directly invokes the second phase to conduct the spin system chaining and the assignment. An CBCA(CO)NH spectrum contains 3D peaks each of which is a triple of chemical shifts for a nitrogen. CBCA(CO)NH and HNCACB. we assume the availability of spectral peaks containing chemical shifts for C<* and C0. the directly adjacent amide proton. Consequently. THE GASA ALGORITHM The input data to GASA could be a set of peak lists or. edu). An HSQC spectrum contains 2D peaks each of which corresponds to a pair of chemical shifts for an amide proton and the directly attached nitrogen. N c ) and a peak P = (HNp. bmrb. PACES. using the normalized Euclidean distance. We propose to use the nearest neighbor approach. In the ideal case. Nevertheless. Note that to the best of our knowledge.N ^ \ ^N J 2.57 solve the ambiguities contained in the imperfect spin systems and the string assignment step is included as a subroutine to identify the confident strings. The rest of the paper is organized as follows. It then invokes the second phase which applies a heuristic search. and for the low resolution spectral data. alpha/beta from the same or the preceding amino acid residue. An HNCACB spectrum contains 3D peaks each of which is a triple of chemical shifts for a nitrogen. to perform the chaining and assignment for resolving the ambiguities in the imperfect spin systems and meanwhile complete the assignment.

k = 9. though we have noticed that minor differences in these maximal chemical shift tolerance thresholds would not really affect the performance of this bidirectional search. then the spin system is formed and the associated peaks are removed. Nevertheless. it is very difficult to distinguish the true peaks from the fake peaks when every imperfect spin system is individually examined. Every peak is distributed to the closest center. both C\ and C%findtheir 6 closest peaks to form perfect spin systems. Phase 2: Resolving (b) Using the center specific tolerance thresholds. In Figure 2(a). A A (Ci) A ' 2. HNCACB and CBCA(CO)NH). In the Inviting step. perfect and imperfect.C2. A sample scenario in the peak grouping: (a) There are 3 HSQC peaks as 3 centers Ci. These already associated peaks are then removed from the nearest neighbor model for further consideration. Using the common tolerance thresholds. which is known ahead of resonance assignment. a closer look that center C\ reveals that the two peaks that should belong to it but are closer to center Ci are among the 6 most closest peaks. A typical value of k is set as 1. we associated each peak in CBCA(CO)NH and HNCACB spectra to their respective closest HSQC peak.2. only C3 forms a perfect spin system (with exactly 6 associated peaks).5 times the number of peaks in a perfect spin system. which essentially applies the center specific tolerance thresholds. to have two steps of operations: Residing and Inviting. we have found that in most cases. (b) Using center specific tolerance thresholds. In general. only one perfect spin system with center C3 is formed because the two peaks that should belong to center C\ are closer to center C2. In the Residing step. each remaining peak in HSQC spectrum looks for the k closest peaks in CBCA(CO)NH and HNCACB spectra. In the triple spectra case (HSQC. the user could specify maximal HN and N chemical shift tolerance thresholds to speed up the process. The parameter k is related to the number of peaks contained in a perfect spin system. 2. ^~A A ( * ( Ci) A A | ( I® * \ * ! * / "\ (£) * ! * * /' (a) Using the common tolerance thresholds. the spin system with center C\ can be formed by adding these two peaks (see Figure 2(b)). That is. respectively. Note that this bidirectional nearest neighbor model essentially applies the center specific tolerance thresholds. The aforementioned two steps will be iteratively executed until no more perfect spin systems can be found and two lists of spin systems. We designed a bidirectional nearest neighbor model. using the common tolerance thresholds.C3. F i g . using the center specific tolerance thresholds. If the HSQC peak and its associated peaks in CBCA(CO)NH and HNCACB spectra form a perfect spin system. and thus it does not require any chemical shift tolerance thresholds. Nonetheless. The goal of Resolving is to identify the true peaks contained in the imperfect spin systems and then to conduct the spin system chaining and string assignment. are constructed.58 centers present. which create ambiguities in both spin systems. measured by the normalized Euclidean distance. Similarly. using the center specific tolerance thresholds. and if a perfect spin system can be formed using some of these k peaks. but using the common tolerance thresholds C\ has only 4 peaks associated while Ci has 8. then the resultant spin system is inserted into the list of perfect spin systems. the spin system with center C2 becomes another perfect spin system. those spin systems containing true peaks enable more confident string finding than those containing fake . During our development.

ss) is then computed as F(cs aa. for every type of chemical shift.Cjq\ £ (3) If both |C?. Nj. C j j .m] and p. where the likelihood of a spin system at a position is estimated by the histogram-based scoring scheme developed in 12 . . Those carbon chemical shifts with subscription I. Cf. For an observed chemical shift value cs.4ppm. cf. and they are used to order the edges coming out of the ending spin system in the current string to provide the candidate spin systems for the current string to grow to. the number of chemical shift values in BioMagResBank that fall in the range (cs — e. Nj. „/ . guided by the mapping quality of the generated string of spin systems in the target protein. .C^l < Sa and |Cf — Cj | < dp hold. It has been observed that a sufficiently long string itself is able to detect the succeeding spin system by taking advantage of the discerning power of the scoring scheme.k £ [l. C £ . Given two perfect spin systems vt = (HN*.edu) as prior distributions and estimates for every observed chemical shift value the probability that it is associated with an amino acid residue residing in certain secondary structure. Summing up the individual intra-residue chemical shift mapping scores in a spin system gives for the spin system its mapping score to every amino acid residue in the target protein. Cf_j) and another imperfect spin system Vj = (HNj. Given one perfect spin system vt = (HNj. Cfp. the edges in the connectivity graph are weighted by the scoring scheme. one vertex corresponds to a spin system. which are typically set to 0. Cft. In each iteration of GASA. Nj.Cf | < 5p hold. |cffc-cfr da + h (4) Note that it is possible that there are multiple edges between one perfect spin system and one imperfect spin system. . • • •.Cf | < Sa and |Cjfc . Cjk. The probability . Cf_1. More precisely. namely. With this observation. there is a tolerance window of length e. both 6a and 5p are pre-determined chemical shift tolerance thresholds. its mapping quality in the target protein is measured by the average likelihood of spin systems at the best mapping position for the string. Another list. Subsequently. v N{cs | aa.bmrb.q £ [l. The relationships between spin systems are formulated into a connectivity graph similar to what we have proposed in another sequential assignment program CIS A n .. no connection is allowed for two imperfect spin systems. Cf.wisc. ss) is the total number of the same kind of chemical shift values collected in BioMagResBank. Given a string. Cf.n]. GASA proceeds essentially the same as CISA u to apply a local heuristic search algorithm. Cf2. to accept those that result in spin systems having highly confident mapping positions in the target protein. we check each legal combination v'j = (HNj. but at most one of them could be true.CJLil < 5a and |Cf . Once the connectivity graph has been constructed. Therefore. Cf. is counted for every combination of amino acid type aa and secondary structure type ss. C% C%. and it uses the chemical shift values collected in BioMagResBank (http://www. though minor adjustments are sometimes necessary to ensure a sufficient number of connectivities. if both |Cf . then there is an edge from Vi to v'j with its weight calculated as 1 f\C?-Cfp\ + . Cf_v C f ^ ) . C?_lt C f ^ ) and VJ = (HNj. respectively. the search algorithm starts with an Open List (OL) of strings and seeks to expand the one with the best mapping score. Cjq) where l. In the 'icg-qui | icf-cf. • • •. q represent the inter-residue chemical shifts. This scoring scheme is essentially a naive Bayesian learning. . . we propose to extract true peaks from the imperfect spin systems through spin system chaining and the resultant string assignment.59 peaks.2ppm and 0. denoted as N(cs \ aa. N<. | C f .j h (2) In Equation (2). ss) = —— r—^. In the connectivity graph. then there is an edge from v^ to Vj with its weight calculated as In GASA. The scoring scheme then takes the absolute logarithm of the probability as the mapping score. Nj. k represent the intra-residue chemical shifts and those with subscription p. if both |Cf .ss). Cf.cs + e). Cfm. then there is an edge from v'j to Vi with its weight calculated as 'iqi-c?| .C ^ l < 8p hold. Complete List (CL). is used in the algorithm to save those completed strings. ss) where N(aa. N(aa.

which might break the paths into shorter ones. Next. If the remaining connectivity graph G is still non-empty. Ambiguities Resolving: GASA scans through the paths in CL for the longest one. GASA performs the following filtering: Firstly. They can be obtained separately or as a whole package through the corresponding author. as well as those edges incident to and from them. CL Finalizing: Let P denote the path of the highest mapping score in CL (tie is broken to the longest path). which are edges in G such that its starting vertex has out-degree 1 and its ending vertex has in-degree 1. GASA tries to bidirectionally expand the top ranked path stored in OL. GASA proceeds to the next iteration. all paths in CL with both their lengths and their scores less than 90% of the length and the score of path P are discarded from further consideration. Implementation All components in GASA are written in the C/C-Hprogramming language and can be compiled on both Linux and Windows systems. and the resultant expanded paths are called child paths of P. 2.60 following. which how- . These resultant paths are final candidate paths. These paths are stored in OL in the non-increasing order of their mapping scores.3. (2) The length of each path reaches L. It then expands these edges into simple paths with a pre-defined length L by both tracing their starting vertices backward and their ending vertices forward. All the remaining paths are considered to be reliable strings. When none of the potential child paths of P is actually added into OL. The tracing stops if either of the following conditions is satisfied. or P is not expandable in either direction (that is. GASA firstly searches for all unambiguous edges in G. only those edges occurring in at least 90% of the paths in CL are regarded as reliable ones. MARS and RIBRA. the spin systems assigned in this iteration might not necessarily form into a single string. The size of OL is fixed at S and thus only the top S paths are kept in OL. In this case. Nevertheless. while the others are considered as fake peaks subsequently removed. Denote this path as P. The other edges in the paths are therefore removed. 3. If the mapping score is higher than that of some path already stored in OL. OL Initialization: Let G denote the constructed connectivity graph. for the imperfect spin systems that are assigned in the current iteration. All the directed edges incident to h and incident from t sue considered as candidate edges to potentially expand P. including RANDOM. then this child path makes into OL (and accordingly the path with the least mapping score is removed from OL). path P is closed for further expanding and subsequently is added into CL. Additionally. (1) The newly reached vertices are already in the paths. These paths are considered as of low quality compared to path P. all the assigned spin systems and their mapping positions are reported as the output assignment. nor edge incident from t). those peaks that are used to build the spin systems and edges are considered as true peaks. we briefly describe the GAS A algorithm for resolving the ambiguities in imperfect spin systems through the spin system chaining into strings and the subsequent assignment. GASA finds its best mapping position in the target protein and calculates its mapping score. When it terminates. there is no edge incident to h. For every potential child path. These assigned spin systems are then removed from the connectivity graph G. PACES. it could be that the mapping position in the target protein for this string conflicts mappings in the previous iteration. the starting vertex in P as h and the ending vertex in P as t. EXPERIMENTAL RESULTS We evaluated the performance of GASA through three comparison experiments with several recent works. Consequently. GASA respects previous mappings and the current string has to be broken down by removing the spin systems that have the conflicts. which is the confident string built in the current iteration. Note that both L and S are set in the way to obtain the best trade-off between the computing time and the performance. Path Growing: In this step. We note that there is another recent work GANA 13 that uses a genetic algorithm to automatically perform backbone resonance assignment with a high degree of precision and recall. GASA proceeds to consider the top ranked path in OL iteratively and this growing process is terminated when OL becomes empty.

PACES. (In fact. PACES performed better than RANDOM. The performance results of RANDOM. RANDOM achieved on average 50% assignment accuracy (We followed the exact way of determining accuracy as described in 8 . and string assignment all together. for each such initial spin system. the performance of an assignment program is measured by the assignment accuracy. to measure its performance. Such a comparison is interesting since the experimental results will show the validity of combining the spin system chaining with the resultant string assignment in order to resolve the ambiguities in the adjacencies between spin systems. the directly attached nitrogen N. we generated another set of 12 instances through doubling the tolerance thresholds (that is. then the tolerance thresholds would be reduced by 25%. all of which work well when assuming the availability of spin systems and their original design focuses are on chaining the spin systems into strings and the subsequent string assignment. Table 1 tells that instances in the second set are much harder than the corresponding ones in the first set.). which is roughly the same as that claimed in its original paper 8 . accuracy = precision = recall. MARS and GASA on both sets of instances are collected in Table 2. MARS and GASA — were called to run on both sets of instances. There are at least three reasons for this: (1) The datasets tested in 6 are different from ours. We implemented the scheme that if PACES didn't finish an instance in 8 hours. In this work. which is defined as the percentage of correctly assigned spin systems among all the simulated spin systems. where the complexity of an instance can be measured by the average out-degree of the vertices in the connectivity graph. for example.1. with suffix 1. PACES.) low independent normal distributions with 0 means and constant standard deviations. We remark that the performance of PACES in this experiment is a bit lower than that is claimed in its original paper 6 . The other two experiments are used for comparison with RIBRA only to judge the value of combining peak grouping into spin systems. PACES and MARS only.5 = 0.08ppm and 0. a predecessor of GASA n . Experiment 1 The dataset in Experiment 1 is simulated on the basis of 12 proteins in 14 . In particular.2ppm to Sa = 0. are summarized in Table 1 (the left half). For the first experiment on the availability of spin systems. spin system chaining. where 1000 iterations for each instance have been run. are also summarized in Table 1 (the right half). its four chemical shifts together with C Q and C& chemical shifts from the preceding residue formed the initial spin system.4ppm and 5p = 0. The achieved spin system is called a final spin system. in this case. All four programs — RANDOM.61 ever due to time constraint we would not be able to make comparison with in the current work. MARS and CISA.2ppm and 5p = 0. we use the same criteria in the second and the third experiments to facilitate the comparison. and the carbon beta C 3 . The dataset construction is detailed as follows. from 6a = 0. precision is defined as the percentage of correctly assigned amino acids among all the assigned amino acids. which were 5a = 0. chemical shifts for intra-residue C a and C 3 were perturbed by adding to them random errors that fol- . whose lengths range from 66 to 215.2/2. and the result tendency 3. respectively. We adopted the widely accepted tolerance thresholds for C a and OP chemical shifts.15ppm. The collected results for PACES on these seven instances were obtained through manually reducing the tolerance thresholds to remove a significant portion of edges from the connectivity graph.4/2. The first experiment is to compare GASA with RANDOM. Subsequently. respectively 9 . Obviously. RIBRA explicitly defines two criteria. These 12 instances. the carbon alpha C a .8ppm). For each amino acid residue. and recall is defined as the percentage of correctly assigned amino acids among the amino acids that should be assigned spin systems. where the datasets are simulated such that there is no fake spin system. In summary. having suffix 2. namely precision and recall. but it failed on seven instances where the connectivity graphs were too complex (computer memory ran out. the standard deviations of the normal distributions were set to 0. They. Their assignment accuracies on two sets are also plotted in Figure 3. For each of these 12 proteins.4ppm. PACES. In order to test the robustness of all four programs. Next.5 = 0. we extracted its data entry from BioMagResBank to obtain all the chemical shift values for the amide proton HN.16ppm. 5a — 0. respectively 3 ' 6 ' 8 ' 10 . see Discussion for more information). We have done a test on using the datasets in 6 to compare RANDOM.

3ppm (75%).1 41 1.87 0. since the extra CO chemical shifts provide extra information for resolving ambiguities.8ppm RANDOM PACES MARS 0.50 0.88 88 168 bmr4288.2 88 349 4.92 0.11 bmr4302.89 0.2 112 166 2.OD bmr4391.35 113 bmr4302.99 0.34 0.52 0.90 GASA 0.OD 1.82 GASA 0. In this experiment.97 0.2 bmr4353.83 0.85 bmr4318.48* 0.80 0.81 0.46 bmr4353.38* 0.2 bmr4670. which ideally should be equal to —1). *PACES performance on these 3 datasets were obtained by reducing tolerance thresholds to 5a = 0.39 0.98 0.1 bmr4027.30 0.40 0.2 bmr4316. They both outperformed PACES and RANDOM in all instances.47 0. PACES. bmr4391.37 0.1 85 1.67 0.29 0.2 <5a = 0.74 0.51 0.56 0.38 112 bmr4353.1 2.2 bmr4288.41* 0.65 0.1 bmr4353.2 221 85 3.80 0.93 0. and ' # W E ' records the number of wrong edges.53 157 1. 'Avg.2 bmr4929. With the current combination PACES was expected to perform a bit lower. (2) PACES is only semi-automated.42 bmr4670.67 0. CO) of chemical shifts in 6 to compare RANDOM.97 0.4ppm.2 652 206 3. C13. each having 12 ones.2 bmr4579.1 20 65 1.48 0. respectively.72* 0.62 0. measured by the number of amino acid residues therein.52 0.40* 0.94 85 bmr4316.1 bmr4302.99 0.96 0. ' # C E ' the number of correct edges in the connectivity graph. Sp = 0. MARS and CIS A u .70 0.66 0. and even more significantly on the second set of more difficult instances. Assignment accuracies of RANDOM. Again.87 0.71 0. (3) PACES is designed to take in better spin systems containing in addition carbonyl chemical shifts.52 0. and the result tendency is very much the same as what we have seen in this experiment. PACES was taken as fully automated and it was run for only one iteration.2 bmr4318.47 0.1 bmr4579.90 0.83 0.95 0. C a .22 0.70 0.63 0.1 Sa = 0.99 T a b l e 2.32 0.95 0.70 0. Sp = 0.62 bmr4752.1 191 206 denotes records (Length average Sa = 0. * PACES performance on this dataset was obtained by reducing tolerance thresholds to 8a = 0. Two sets of instances.2 bmr4144.87 0.60* 0. ' P A C E S performance on these 3 datasets were obtained by reducing tolerance thresholds to 5a = 0.09 bmr4579.99 0.49 0.37 bmr4144. Length 66 68 78 86 89 105 112 114 115 116 158 215 Sa = 0.1 bmr4929.2 65 51 1.1 44 1.97 0. 8g = 0.81 0. Length InstancelD 66 68 78 86 89 105 112 114 115 116 158 215 Avg.96 bmr4929.45 0.30 0.30 111 bmr4929.92 0.97 0.61 0. However.92 0.97 0.4ppm InstancelD #CE #WE Avg.1 77 30 bmr4579. in the sense that it needs manual adjustment after one iteration to iteratively improve the assignment.82 0. MARS and GASA in the first experiment. N.2 67 118 2.56 bmr4316.3ppm and Sp = 0.90 0.2 114 139 2.69 0.91 bmr4288.48 0.1 82 1. PACES.17* 0.32 0. .OD' records the out-degree of the connectivity graph.2ppm.99 0.44 0.1 bmr4318.92 0.79 0.4ppm RANDOM PACES MARS 0.92 InstancelD bmr4391.61 0.80 0. in the current work we were unable to manually adjust fairly well and we decided not to do so.97 0. in the first experiment: 'Length' the length of a protein.2ppm.4ppm (50%).38 0.77 0.4ppm.1 bmr4144. we have done a similar test on using the combination (HN.76 bmr4752.99 0.93 0.77 0. 5/3 = 0.1 104 45 1.97 0.1 bmr4752.2 bmr4302.49 0.56 0.2 bmr4027.1 67 43 1.62 0.29 bmr4027.40 bmr4027.62 T a b l e 1.35 0.34 bmr4670. One could run it several iterations for improved assignment.6ppm (75%).1 bmr4288.72 bmr4144.2 157 224 3.2 86 77 2.2 111 109 1.1 114 47 1.2 113 128 2.83 is very much the same as what we have seen in this experiment.80 0.94 0.73 0. which indicates that combining the chaining and assignment together does effectively resolve the ambiguities and then make better assignments.2ppm and Sp = 0.1 bmr4316.67 0.2 bmr4752.85 0.15ppm and 5p = 0.64 0.57 0.29 bmr4391.2 104 139 2.1 bmr4670.48 0.8ppm InstancelD #CE #WE Avg.1 35 1.76 0. MARS and GASA performed equally very well.33 0.04 bmr4318.37 0.90 0.

slightly better. and GASA achieved 88. The grouping error dataset is generated by adding HN. Fig. [Bfiaidom BPACE5 S3MAR^CGAJw] (b) Assignment accuracies on the 2nd set of instances. Since GASA is designed to deal with the real spectral data.7% recall.63 adding Ca and C0 perturbations into inter peaks in the perfect HNCACB peak list. Even though HSQC is a very reliable experiment. with length ranging from 69 to 186. and GASA on two sets of instances with difthresholds. using C Q and C@ chemical shifts inference. It is noticed that in the construction of grouping error dataset. RIBRA achieved 87. Through the detailed investigation. The linking error dataset is generated by . As shown. in which there are no peaks with 0 carbon chemical shifts. 5 sets of different datasets were simulated from the data entries deposited in BioMagResBank.4% recall. the deposited HN and N chemical (a) Assignment accuracies on the 1st set of instances. The false positive dataset is generated by respectively adding 5% carbon fake peaks into perfect CBCA(CO)NH and HNCACB peak lists. we randomly selected 14 proteins among the grouping error dataset. 4 and 5 peaks have the same chance to be perfect spin systems as those containing 6 peaks and meanwhile they could be considered as the spin systems with missing peaks. The false negative dataset is generated by randomly removing a small portion of inter carbon peaks from perfect CBCA(CO)NH and HNCACB peak lists.5% precision and 79. MARS ferent tolerance for connectivity of assignment accuracies for RANDOM. grouping is much easier on the datasets with the simulated C 3 peaks for Glycine. which is simulated from BioMagResBank without adding any errors. Plots PACES.2. one is perfect dataset. perturbing chemical shifts in all simulated peaks is necessary and would be closer to the reality because the chemical shifts deposited in BioMagResBank have been manually adjusted across multiple spectra. this is not the case in the real NMR spectral data. 3. indicating that in the RIBRA simulation. N. the performance of GASA on the grouping error dataset is not as good as RIBRA. GASA shows more robustness on the dataset with missing data while RIBRA performs better on the grouping error dataset. However. and removed all the peaks of 0 C^ chemical shift. For example. C a and C 3 perturbations into inter peaks in the perfect CBCA(CO)NH peak list. Among them. the spin systems containing 3. Therefore. Both RIBRA and GASA were tested on them. false positive and link error datasets. and the other four datasets contain four different types of errors respectively.7% precision and 72. 3 . Glycine would have two inter peaks and two intra peaks in HNCACB and the amino acid residues after Glycine would have two inter peaks in CBCA(CO)NH. RIBRA kept the perfect HSQC and HNCACB peak lists untouched and only added some perturbations to the inter peaks in the CBCA(CO)NH peak list. In fact. there is no significant difference among the performances on the perfect. Experiment 2 In RIBRA. we found that these 5 sets of datasets contain the C^ inter and intra peaks with 0 C 3 chemical shifts for Glycine. To verify our thoughts. We believe that to simulate a real NMR spectral dataset. Table 3 collects the average performances of RIBRA and GASA on these 5 sets of datasets. a huge amount of ambiguity in the sequential assignment results from Glycine because it produces various legal combinations in grouping and thus making the identification of perfect spin systems harder.

4%) 89. which causes more ambiguities in the peak grouping. wise.44% 92. and C'3. respectively. which were 5HN = 0.12% (88. we extracted its data entry from BioMagResBank to obtain all the chemical shift values for HN. and from the assignment precision we can conclude that most of these spin systems are true spin systems. c s . Dataset Perfect False positive False negative Grouping error Linking error Average RIBRA Precision Recall 98. RIBRA only achieved 65. 5a = 0. 0. From the table.36% 88. GASA performed significantly better than RIBRA (precision 86.5 = 0.32ppm. the chemical shifts in every peak in all HSQC.16ppm.06/2.33% 92.5 = 0. recall 74. As shown in Table 4. u a l b e r t a .74% 89.15% 87. Next.18% versus 42. The input to . bmr4288 and bmr4929.2/2. 4. GASA outperformed RIBRA in all instances and RIBRA failed to solve three instances. The detailed precision and recall are also plotted in Figure 4.35% 77. Experiment 3 The purpose of Experiment 3 is to provide more convincing comparison results between GASA and RIBRA.72% versus 65. based on the better data simulation.8/2.57% (72. and 0. and its HN and N chemical shifts with its own C a and C 3 chemical shifts and with C a and C^ chemical shifts from the preceding residue formed two intra peaks and two inter peaks respectively in HNCACB peak list.0024ppm. 3. For each amino acid residue in the protein. and 5p = 0. CBCA(CO)NH and HNCACB peak lists.8ppm for N.08ppm. We chose the same tolerance thresholds as those used in RIBRA.php. N.06ppm for HN.33% 92. In summary.28% 97.16% (87. The detailed datasets are available through link h t t p : / / w w w . Subsequently.5 = 0. its HN and N chemical shifts formed a peak in HSQC peak list.17% 95. Partial information of and the performances of RIBRA and GASA on these 12 proteins are summarized in Table 4. <5N = 0.84% 93.28% 95.24% 89. we can see that GASA formed many more spin systems than RIBRA did on every dataset.23%. Note that there is no C" peak for Glycine in either CBCA(CO)NH or HNCACB peak list. which are bmr4316.3. N.64 Table 3.1% recall on average. ca/~ghlin/src/WebTools/gasa.10%).23% precision and 42. the standard deviations of the normal distributions were set to 0.27% (79.7%) 89. C Q .95% GASA Precision Recall 98. In the next Experiment 3. the contained HN. we proposed a novel two-stage graphbased algorithm called GASA for protein NMR backbone resonance sequential assignment.14% shifts in BioMagResBank are still slightly different from the measured values in the real HSQC spectra ( h t t p : //bmrb.0% 81.5%) 96. for each peak in HSQC.4ppm for C 3 . For each of these 12 proteins.2ppm for C a . Percentages in parentheses were obtained on 14 randomly chosen proteins with C peaks for Glycine removed.7%) 96. (2) In the 12 simulated datasets in Experiment 3.4/2. except Proline. we chose not to simulate C 3 peaks for Glycine and to perturb every piece of chemical shift in the data. respectively. 0. and subsequent spin system chaining. which generated more uncertainties in every step of operation in the sequential assignment. The possible explanations for RIBRA not doing well on these 12 instances are: (1) The simulation procedure in Experiment 3 didn't generate C peaks with 0 chemical shift for Glycines. On average. C a or C13 chemical shifts were perturbed by adding to them random errors that follow independent normal distributions with 0 means and constant standard deviations. For this purpose.34% 91.33% 96. HNCACB and CBCA(CO)NH peak lists were perturbed with random reading errors.24% 97. we used the same 12 proteins in Experiment 1 and the simulation is detailed as follows.61% 98. its HN and N chemical shifts with C a and C 3 chemical shifts from the preceding residue formed two inter peaks respectively in CBCA(CO)NH peak list. edu/). Comparison results for RIBRA and GAS A in Experiment 2. CONCLUSIONS In this paper.5 = 0.28% 98. which are noticeably worse than what it is claimed in 9 .

4 0.32% 73.52% 96.75% 42.3 4391 4752 4144 4579 4316 4288 4670 4929 4302 4353 4027 4318 (a) Assignment precision. ca/~ghlin/src/WebTools/gasa. measured by the number of amino acid residues therein.54% 42. We have also proposed a spectral dataset simulation method that generates datasets closer to the reality. HNCACB and CBCA(CO)NH. 'Missing' records the number of true spin systems t h a t are not simulated in the dataset. 'Grouped' records t h e number of spin systems t h a t were actually formed by RIBRA and GASA.12% 64.67% 40.35% 55.51% 84. 4752 4144 4579 4316 4288 4670 4929 4302 4353 4027 4318 I F i g . GASA is based on an assignment model that separates the whole assignment process only into virtual steps and uses the outputs from these virtual steps to cross validate each other.71% 84.65% 77.23% Recall 48.93% 69.32% 76.8 _0. As a preliminary effort.1% Grouped 52 54 63 70 67 84 83 89 97 89 123 165 GASA Precision 92.32% 90.12% 63.4 0. c s . though BioMagResBank as a repository has collected all known protein NMR data.65 T a b l e 4 .22% N/A N/A 35. 4 .22% 77.34% N/A N/A 76. Plots of detailed assignment (a) precision and (b) recall on each of the 12 protein datasets in Experiment 3 by RIBRA and GASA.61% 87.62% 82. somehow there is no benchmark testing datasets in the literature. including those for Prolines.81% 86.72% recall 81. JBRIBRA OGASAJ 1 -p 0. in the format of triple spectra HSQC.92% 79.11% 82. One of the reasons for doing this is that. respectively. 'Length' denotes the length of a protein. .9 0. The novelty of GASA lies in the places where all ambiguities in the assignment process are resolved globally and optimally.91% 90. the 12 simulated protein NMR datasets.6= 0.7 n £0. are available through link h t t p : / / w w w .15% 40.17% 42.44% 95. MARS and RIBRA showed that GASA is more effective in dealing with the NMR spectral data degeneracy and thereby provides a more promising solution to automated resonance sequential assignment. One of our future works is to formalize this simulation method to produce a large number of protein NMR datasets for common comparison purpose.24% 65.33% 39.68% 43.12% 82.38% 68. | • RIBRA DGASA | 1 0.64% 78.90.php.23% N/A 71.37% 72. 0) U ' J • J 4391 II I ll I IIll (b) Assignment recall. BMRB Entry bmr4391 bmr4752 bmr4144 bmr4579 bmr4316 bmr4288 bmr4670 bmr4929 bmr4302 bmr4353 bmr4027 bmr4318 Average Length 66 68 78 86 89 105 112 114 115 116 158 215 Missing Grouped 7 2 10 3 4 9 10 4 8 18 10 24 44 44 42 54 N/A N/A 47 N/A 70 72 96 127 RIBRA Precision 65.41% 74. u a l b e r t a .13% 74.23% 60.3 02 01 o |-| (0 o nK .35% N/A 46. Partial information of and the performance of RIBRA and GASA on the 12 protein NMR datasets in Experiment 3.93% 62.92% 68. PACES.25% 66.22% 65. The extensive comparison experiments with several recent works including RANDOM.8 f 0.18% GASA can be spin systems or raw spectral peak lists.

Y. and Y. In Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004). T. Tegos. Automated analysis of protein NMR assignments using methods from artificial intelligence.-P. 269:592-610. Wen. Braun. Salzmann. and G.-P. 2003. A. Sequence-specific NMR assignment of proteins by global fragment mapping with the program Mapper. Kulikowski.robust automatic backbone assignment of proteins. S. E. Chen. Histogram-based scoring schemes for protein NMR resonance assignment. 6. 3. C. Wu. T. Zweckstetter. H.-H. Xu. Z. P. Zimmerman. Wan. Huang. 2000. Wagner. Zhan. 1997. F. Xu.-H. and G. J. Wu. Y. 182:295-315. G. Chang. . pages 165174. 18:129-137. M. Automated assignment of backbone NMR peaks using constrained bipartite matching. 9. T. Lin. 8. Journal of Biomolecular NMR. F. Ferentz and G. Coggins and P. Hitchens. Journal of Bioinformatics and Computational Biology. 25:1-9. 2:747-764. T. P. A. Lin.-H. Wiithrich. Journal of Molecular Biology. 2002. Giintert. Jiang. and G. T. K. Mars . Y. 2005. Kim.-Y. and G. W. pages 103-117. Submitted. and W. Olman. Havel. E. A. K. 4. 12.-J. D. J. GANA . K. Jung and M. Y. J. M. An efficient branch-and-bound algorithm for the assignment of protein backbone NMR peaks. C F I and N S E R C . E. Journal of Biomolecular NMR. Pandurangan. M. Sung.-L. D. B.-N. Journal of Biomolecular NMR. PACES: Protein sequential assignment by computer-assisted exhaustive search. and K. Wiithrich. V. S. 2004.-M.-S. Wan and G.a genetic algorithm for NMR backbone resonance assignment. J. 26:93-111. J. IEEE Computer Society Press. Zhou. 30:11-23. 2004. X. Xu. T. Sung. Hsu. A random graph approach to NMR sequential assignment. Lin. Wu. IEEE Computing in Science & Engineering. 2005. J.-Y. 33:4593-4601. Journal of Molecular Biology. Williamson.-H. Rule. References 1. Shimotakahara. IEEE/ACM Transactions on Computational Biology and Bioinformatics. pages 58-67. Powers. D. Bailey-Kellogg. Razumovskaya. D. MONTE: An automated Monte Carlo based approach to nuclear magnetic resonance assignment of proteins. 2004. 1985. T h e authors would like t o t h a n k t h e authors of R I B R A for providing access t o their datasets and for their prompt responses to our inquiries. 5. S. and T. C. 2.-F.66 ACKNOWLEDGMENTS This research is supported in part by AICML. Nucleic Acids Research. CISA: Combined NMR resonance connectivity information determination and sequential assignment. Journal of Biomolecular NMR. RIBRA . Lukin. Montelione. Chang. 2002. NMR spectroscopy: a multifaceted approach to macromolecular structure. 10. 11. Quarterly Review Biophysics. McCallum. Huang. 2003. and K.-B. In Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB 2002). Z. Chainraj. R. T. Chang. Xu.-L. S. C. Chien. 14. 7. A. Tashiro.an error-tolerant algorithm for the NMR backbone assignment problem. 13. 2000. and W. 4:5062.-M. Lin. D. Chen. In Proceedings of the 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005). Jiang. 33:29-65. 2005. C. Hsu. Solution conformation and proteinase inhibitor IIA from bull seminal plasma by proton NMR and distance geometry. W. J. T. X.

CH. ACBP. there presumably exists a heterogeneous set of structures in either the denatured or disordered state. we formulate the structure determination problem as the computation of an ensemble of structures from the restraints. It can be stated as the problem of elucidating how a protein can fold.dartmouth. each experimental restraint is a distribution. In our algorithm. NMR a is the only available technique that can measure many individual structural restraints for these proteins. residual dipolar coupling. The algorithm has been successfully applied to determine the structural ensembles of two denatured proteins.edu Traditional algorithms for the structure determination of native proteins by solution nuclear magnetic resonance (NMR) spectroscopy require a large number of experimental restraints.dartmouth. Although much progress has been made in understanding how the structure defines the biological function of native proteins. molecular dynamics. At present. NOE.67 A DATA-DRIVEN.SVD. NH 03755. an accurate distribution of the structures is critical for establishing the structure-function relationship. USA Email: wlincong@cs. Compared with previous algorithms. Dartmouth Chemistry Department Dartmouth Department of Biological Sciences Hanover. For denatured proteins. vdW. SYSTEMATIC SEARCH ALGORITHM FOR STRUCTURE DETERMINATION OF DENATURED OR DISORDERED PROTEINS Lincong Wang Dartmouth Computer Science Department Hanover.R. acyl-coenzyme A binding protein. In this paper. singular value decomposition. NH. "Abbreviations used: NMR. 2 0 . NC'. quantitative structural distribution is key to solving the protein-folding problem 2 7 . PRE. POF. . in less than a second. root-mean squared deviation. the bond vector between backbone amide nitrogen and amide proton. our algorithm can extract more structural information from the experimental data. it is not well-known how the structures of disordered proteins determine their function. the number of restraints measured by the current NMR techniques is well below that required by traditional algorithms.edu Bruce Randall Donald*f Dartmouth Computer Science Department. even using the most advanced NMR techniques. INTRODUCTION The protein folding problem is fundamental in structural biology. an accurate. while for disordered proteins. the number of measured restraints is well below that required by traditional 'Corresponding author tThis work is supported by the following grants to B. Furthermore. acyl-coenzyme A binding protein (ACBP) and eglin C. 1. MD. CC'. from the denatured state to the native state. These algorithms formulate the structure determination problem as the computation of a structure or a set of similar structures that bestfitthe restraints. RDC. NH 03755. It has been estimated that about onethird of eukaryotic proteins are disordered or partiallydisordered in their native state in solution. paramagnetic relaxation enhancement. USA Email: brd@cs. for both laboratory-denatured and natively-disordered proteins. One challenge to solving the folding problem is the lack of knowledge about the structures of proteins in the denatured state. However. To quantify the structural distribution. In this formulation. We present a datadriven algorithm capable of computing a set of structures (ensemble) directly from sparse experimental restraints. nuclear Overhauser effect. principal orderframe. the bond vector between backbone amide nitrogen and C'. nuclear magnetic resonance. the bond vector between backbone C a and HQ. RMSD.D. However. For both denatured and disordered proteins.: National Institutes of Health (R01 GM 65982) and National Science Foundation (EIA0305444). it is necessary to compute the ensemble of structures directly from experimental data. simulated annealing. the bond vector between backbone CQ and C . Such nativelydisordered proteins play key roles in signal transduction and genetic regulation as well as in human diseases such as Alzheimer's and Parkinson's diseases. van der Waals. "denatured state" means the state in which the backbone NH groups have little protection against 1 H / 2H-exchange. using real experimental NMR data. all the backbone conformations consistent with the data are computed by solving a series of low-degree monomials (yielding exact solutions in closed form) and systematic search with pruning. SA.

z) by 2 2 . Section 5 presents the mathematical basis of the algorithm. The biological significance of our results and the use of the computed ensembles to understand protein folding will be addressed in detail in another paper. we present the physical basis for the formulation. y and z are. We then present a datadriven algorithm capable of accurately-computing denatured backbone structures directly from sparse restraints. b 1. biological NMR data. respectively. which are presumably heterogeneous in solution 19' 24 . we first formulate the structure determination problem of both denatured and disordered proteins as the determination of a heterogeneous ensemble of structures. we only present the algorithm and application to denatured proteins. In this formulation. rather than randomly as in previous approaches 14 ' 4 ' 1 6 . experimental NMR data. (3) Successful application of the algorithm to compute the structure ensembles of two denatured proteins from real. A PROBABILISTIC INTERPRETATION OF RESTRAINTS IN THE DENATURED STATE Our algorithm first computes both the backbone dihedral angles and the orientation of each structural fragment15 independently using the orientational restraints from RDCs. The RDC. r = Sxxx2 + Syyy2 + Szzz2 (1) where Sxx. the restraints are distributions. Organization of the paper We begin with a probabilistic interpretation of NMR data in the denatured state in terms of equilibrium statistical physics. acyl coenzyme A binding protein (ACBP) and eglin C. The algorithm can be applied to natively-disordered proteins as well. in section 8 we analyze the complexity of the algorithm and describe its performance in practice.1. (2) A data-driven. Our algorithm is based on a new formulation for structure determination in which each experimental restraint is converted to a distribution. systematic search algorithm for computing an ensemble of all-atom backbone structures for denatured proteins directly from experimental data. 6 . 2. together with the systematic search. Finally. and then assembles the computed structural fragments into a complete structure using the distance restraints from PREs. the x. significantly increase the accuracy of the computed ensembles. these methods formulate the structure determination problem as the computation of a structure (or a set of similar structures) that best fit the restraints. In the following. We will only describe briefly the applications of the algorithm to two real biological systems.68 NMR structure determination methods 5 ' n . This paper concentrates on the computer science aspects of the algorithm. Our contributions are: (1) A new formulation of the structure determination problem for denatured proteins. a new formulation is necessary for computing the structures of either denatured or disordered proteins. Section 4 reviews existing approaches. from sparse experimental restraints measured in either the denatured or disordered state. Section 7 presents briefly the results of applying our algorithm to compute the structural ensembles of two denatured proteins. The algorithm uses considerably more experimental data than previous approaches for characterizing the denatured state from experimental data 1 6 ' 1 4 ' 4 . y. y. Syy and Szz are the three diagonal elements of a diagonalized Saupe matrix S (the alignment tensor) specifying the ensemble-averaged anisotropic orientation of a molecule in the laboratory frame. Section 3 presents a formulation of the structure determination problem of denatured proteins using experimental NMR data such as the orientational restraints from residual dipolar coupling (RDC) 2 5 ' 1 2 and distance restraints from paramagnetic relaxation enhancement (PRE) x experiments. In the following. In this paper. typically m fa 10. r. Furthermore. between two nuclear spins is related to the direction of the corresponding internuclear unit vector v = (x. z—components of v in a principal A structural fragment consists of m-consecutive residues. x. The larger amount of data. Section 6 describes our algorithm for computing an ensemble of structures. Such a formulation is appropriate for the structure determination of a native protein having a single dominant conformation. . In our algorithm. However. from real. RDCs can be measured on proteins weakly-aligned in a dilute liquid crystal medium 25 ' 2 6 . the conformational space consistent with the data is searched systematically.

is required to interpret the RDCs measured in the denatured state. 4. the sines and cosines of the 2n ((f>. S. d. However. Before diagonalization. Traditional NMR structure determination methods 5 ' n . can be used to interpret all the experimental RDCs by Eq. For a folded. Note that x2 + y2 + z2 = 1 and Sxx + Syy + Szz = 0. given both the RDC r and tensor S. PREVIOUS WORK Solution NMR spectroscopy is the only experimental technique currently capable of measuring geometric restraints for individual residues of a denatured protein at the atomic level. The structure determination problem for a denatured protein is to compute an ensemble of presumably heterogeneous structures that are consistent with the experimental data within a relative large range for the data. c n = (</>i. Each tensor in the set Q represents a cluster of denatured structures that have similar structures and align similarly in the medium. Traditional NOE-based approaches cannot be used since long-range NOEs. Under the isolated two spin assumption. Recently. In fact. Tensor S must be known first in order to extract orientational restraints from RDC data. long-range NOEs. However. S is a 3 x 3 symmetric. the backbone conformation of an nresidue protein is completely determined by a 2n-tuple of backbone dihedral angles. the structure determination problem for denatured proteins can be formulated as the computation of a set of conformation vectors. The reason is that PRE is almost 2. which are critical for applying the traditional approaches to determine NMR structures. cn.5' n are generally too weak to be detected on denatured proteins. THE STRUCTURE DETERMINATION PROBLEM FOR DENATURED PROTEINS As is well known. PRE can be observed even in the denatured state between an electron spin and a nuclear spin as far as 20 A away. (1). 4>n. a set of tensors. . The set of RDCs corresponding to each tensor in the set Q can be sampled from the individual distributions associated with each measured RDC. V>i) are the dihedral angles of residue i. The c PRE-derived distance. This 2n-tuple will be called a conformation vector. the observed intensity of cross-peaks in either a PRE or NOE experiment) are proportional to r . The different tensors in the set Q represent different conformations in the denatured state that are oriented differently in the aligning medium. developed for computing structures in the native state. bond angle and peptide plane UJ angle. in the denatured state. Eq.000-fold stronger than the NOE. a single global tensor. The tensor S is also a random variable. • . denatured proteins 23. require more than 10 restraints per residue. (1) represents the projection onto a 2sphere of an ellipse of solutions for the orientation of the vector v with respect to a global frame (POF) common to all the RDCs measured on the same aligned protein. it has been shown that RDCs can also be measured accurately on weakly-aligned. where (<fo. while no NOE between two nuclear spins can be observed at such a distance. is also a random variable. V'n). . both PRE and NOE (that is. 3. i/'i. 2' 18 ' 9 and disordered proteins 4 . according to equilibrium statistical physics 15 . More precisely. Thus. The distribution for each RDC can be defined by an RDC random variable that has as its sampling space the RDCs of all the orientations of the corresponding vector v in different structures that exist in the denatured state. The experimentally-measured RDC value is the expectation. Q. derived mainly from NOE experiments. Recently developed RDC-based approaches for computing native structures rely on either heuristic approaches such as restrained molecular dynamics (MD) and simulated annealing (SA) 1 0 ' 1 3 or a structural database 8 ' 2 1 . are usually too weak to be detected in the denatured The main difference between PRE and NOE is that PRE results from the dipole-dipole interaction between an electron and a nucleus while the physical basis of NOE is the dipole-dipole interaction between two nuclei. traceless matrix with five independent elements. where the measured value is an average over all the possible structures in the denatured state. given the distributions for all the RDCs r and for all the PREs d. native n-residue protein. I/J) angles are sufficient to determine a backbone conformation. .6 where r is the distance between two spins. It is not clear how to extend these native structure determination approaches to compute the desired denatured structures. which are critical for computing structures using traditional methods. c n . to compute a well-defined native structure. RDCs alone or in combination with other NMR-measured geometric restraints have been used extensively to determine and refine the solution structures of native proteins 7 ' 29 . Paramagnetic relaxation enhancement (PRE) is similar to the nuclear Overhauser effect (NOE) 30 in terms of physics0.69 order frame (POF) which diagonalizes S. In fact. given bond length.

and the z-component from x2 + y2 + z2 = 1. Over-fitting is likely since one RDC per residue is not enough to restrain the orientation of an internuclear vector (such as the NH bond vector) to afiniteset. in general. these database-based approaches have not been extended to compute structures for denatured proteins. constructed by linking together the RDC-selected fragments using a heuristic method. respectively. NH and CH RDCs denote. RDCs are employed to select structural fragments mined from the protein databank (PDB) 3 . . the sine and cosine of the <j>i angle can be computed by solving linear equations. Starting with peptide plane i. the x-component of the CH unit vector u of residue i. the RDCs measured on NH and CH bond vectors. An unfolded state in equilibrium with a folded state 6 differs from the denatured state in this paper. These monomials have been derived from the RDC equation (1) and protein backbone kinematics. Then. random selection may miss valid conformations. They begin with the construction of a library of backbone (</>. THE MATHEMATICAL BASIS OF OUR ALGORITHM Our algorithm uses a set of low-degree (<4) monomials for computing. the sine and cosine of the backbone <f> angle from a CH RDC and those of ip angle from an NH RDC 28 . (1). Given u. which was selected mainly by the NOEs. Second. from the CH RDC of residue i and the NH RDC of residue i +1 using the following two Propositions: Proposition 5. Recently.1 28 Given the orientation of peptide plane i in the POF (see section 2) of RDCs. an infinite number of backbone conformations can agree with one RDC per residue. Compared with the MD/SA approaches. Consequently. In summary. approaches 14 ' 4 have been developed to build structural models for the denatured state using one RDC per residue. and have been described in detail elsewhere 29 ' 2 8 ' 3 1 . A backbone structure for a native protein is. the observed NOEs result from the equilibrium between the folded and unfolded states. respectively. we state the monomials for computing. There are three problems with these methods. PREs alone are. the models constructed from the library may bias towards the native conformations in the PDB. In the following.ipi angles. we can compute the sines and cosines of the <j>i. 5. The generated models have large uncertainty. Due to the data sparsity and large experimental errors. the (<f>. Given the x-component. given an alignment tensor S. the sines and cosines of individual backbone dihedral {<f>. in the POF. the y-component can be computed from Eq. the models were tested by comparing the experimental RDCs with the average RDCs back-computed from the ensemble of backbone structures. A generate-and-test approach 6 using mainly NOE distance restraints has been developed to determine the ensemble of all-atom structures of an SH3 domain in the unfolded state in equilibrium with a folded state. ip) angle library is biased since only the (<f>.d Previous RDC-based MD/SA approaches typically require either more than 5 RDCs per residue or at least 3 RDCs and 1 NOE per residue (most of them should be long-range) to compute a well-defined native structure. ip) angles using only the angles occurring in the loops of the native proteins deposited in the PDB. they randomly select (4>. while only a finite number of conformations agree with two RDCs per residue 29. In fact. then. a database of experimentallydetermined native structures. In 6 . ip) angles from the library to build an ensemble of backbone models. the relatively large experimental errors as well as the sparsity and locality of NOEs similarly introduce large uncertainty in the resulting ensemble of structures. the agreement of the experimental RDCs with the average RDCs back-computed from the ensemble of structures may result from overfitting. ip) angles from the loops of the native proteins are used. However. These approaches are generate-andtest. respectively.70 state. exactly and in constant time. Finally. for ease of exposition. 28 ' 31 . can be computed from the CH RDC by solving a quartic monomial in x. In the database-based approaches. d e All-atom models for the denatured state have been computed previously in a generate-and-test manner in 16 by using PREs to select the structures from all-atom MD simulation at high temperature. The denatured state in this paper (see section 1) has been called the "unfolded state" 2 0 .e However. Third. ip) angles. neither the traditional NOE-based methods nor the above RDC-based approaches can be applied to compute allatom backbone structures in the denatured state at this time. not from the unfolded state alone. insufficient to define precisely even the backbone C a -trace. First. the databasebased approaches require fewer RDCs.

rather than a single structure or a set of similar structures that best fits the data (as in the native state). peptide plane w angle. Thus. . TV and r' are. Q. a finite and algebraic set of backbone conformations can be determined exactly. . 2 8 ). . SZZ+5ZZ] WhCTCSyy and 5ZZ are thresholds. all the sines and cosines of backbone (<f). given a tensor S. t p ) . for native proteins.2. .p. The goal of the present algorithm is to compute a presumably heterogeneous ensemble of structures that are consistent with the experimental data within a large range. As shown in Eq. the algorithm computes. given bond length. from the CH RDC of residue i and the NH RDC of residue i+1. given a structure. This set of tensors Q is updated continuously during the structure computation. and the orientation of the first peptide plane as well as a tensor S. Next. F i . the maximum number of solutions for (cp. Divide-and-conquer strategy The algorithm first divides the entire protein sequence into p fragments. the y-component can be computed from Eq. ip) angles can be computed from RDCs by solving a series of quartic and linear equations. S can be computed by SVD 17 . 2) and will be detailed in Section 6.2. the set of conformations consistent with two RDCs per residue is finite and algebraic. Given the x-component. the orientation of the peptide plane for residue i + 1 can be computed. this single tensor can be determined during structure computation (if secondary structure elements are known 29. Furthermore.F P . A linker consists of the residues between two neighboring fragments. we compute the corresponding tensor U by singular value decomposition (SVD) 1 7 and save each tj into a set TV Given a structure and the experimental RDCs.p. the algorithm merges all the tensors in the sets T. However. . this set of conformations can be computed by a systematic search such as a depthfirst search over a fc-ary tree where k < 64. . 1). . an ensemble of denatured structures can be computed exactly by solving a series of monomials each with degree < 4 using different sets of two RDCs per residue sampled from their distributions and the corresponding tensors S from the set Qour previous algorithms ' ' 3 1 for computing the backbone structures of native proteins. .71 Proposition 5. the sine and cosine of the ipi angle can be computed by solving linear equations. exactly in closed form. . Next. Our algorithm computes the ensemble using a divide-and-conquer strategy for efficiency.. and CH and NH RDCs.p. . This step is called Fragment computation (Fig. . For each merged p-tuple. can be used to interpret all the experimental RDCs by Eq. Moreover. given the orientation of peptide plane i (plane i stands for the peptide plane for residue i in the protein sequence). from the orientation of the peptide plane for residue i and the sines and cosines of the intervening fa. and the z-component from x2 +y2 + z2 = 1. (1). ( t i . . . 6. the algo- 6. W*. an ensemble of structures. for each fragment i where i = 1 . .1. respectively. for each structure in ensemble W j . that is.. and a set of two RDCs per residue sampled from the RDC distributions (see section 2). in the POF. the experimental RDC for residue j of Fj and the RDC back-computed from the structure using the tensor S by Eq. . of different tensors to compute all the possible different conformations in the denatured state. Given v. Taken together.2 28 Given the orientation of peptide plane i in the POF of RDCs. Syy + 5yy] ^Ud \SZZSZZ. . (1). Thus. where u is the total number of RDCs for fragment F*. and p — 1 linkers. ipi angles. . . the sines and cosines of the backbone (pi. the x-component of the NH unit vector v of residue i + 1.1-5. Next. such that tj is from the set Tj and all p tensors in a p-tuple have their Syy and Szz values agree with one another up to the ranges defined by [Syy — Syy. i = 1 . we have stated the mathematical basis of our algorithm. a single tensor. r[ is a function of S so by minimizing Er. it is physically infeasible to use a single tensor to interpret all the experimental RDCs on a denatured protein (see section 2). bond angle. . independently. a tensor S can be computed by using SVD to minimize the RDC RMSD. can be computed from the NH RDC by solving a quartic monomial in x. (1). AN ALGORITHM FOR STRUCTURE DETERMINATION OF DENATURED PROTEINS Our algorithm for computing the structure ensemble of a denatured protein extends but differs substantially from .L p _i (Fig. into p-tuples. . ET = y / ^ " 1 ^ . (1). the orientation of the peptide plane for residue 1 (the first peptide plane) of the protein sequence. In conclusion. i = 1 . For the native state. ipi angles can be computed. According to Propositions 5. S. ip) angles for a single residue 28 ' 29 . . . . exactly in closed form. L i . Rather one should use a set. . Furthermore.

rjk and r' k are. First.bh. If a residue is neither a glycine nor a proline in the protein sequence. using die experimental RDCs and the just-computed fragment structure. Rb. respectively. Syy and Szz. 6. the relative weight and score for van der Waals (vdW) repulsion. respectively. (2). tp) angles computed from RDCs. Sit. The diagonalization of the tensor returned from SVD gives not only the diagonal elements. In particular. Finally. 3) and will be detailed in Section 6. into a single fragment and the process stops when all the fragments have been assembled. but also the orientation for each fragment in the common POF as well. . Ev is computed with respect to a quasipolyalanine model built with c m . and the corresponding RDC backcomputed from the structure. the PRE violation. c\t.. Next. which is essentially the PRE RMSD for an individual .. cf>m. an m-residue linker L\ between diem is computed as shown in Fig. .p. u is the number of RDCs for each residue. .t = 1 . ip) angles for polyprolinell.3. . . .1 from the sampled CH RDC for residue k.. . computed from Rt where the 4>k angle for residue k is computed according to Proposition 5. i = l.1-5. and saves the common tensors into a set Q.. the algorithm computes the linkers. TL=E2 + wvE2 + wpE2 (3) where Er = y/Zi-lET=£$k~ri. Er is minimized as cross-validation using Eq.2 (section 10 of APPENDIX) for the detail. 3.. .k)' is the RDC RMSD.. ip) angles from RDCs by Propositions 5. L p _i. S l t . Sxx. t = 1 . b. the contribution of this pair of atoms to Ev is set to zero. For additional RDCs (CC or N C RDCs). For each sampled set of RDCs. Since the (<f>.1 and 10. This step is called Linker computation and assembly (Fig. Ep. p and all the experimental RDCs for Fi. b. the algorithm computes an optimal conformation vector. the output of this systematic search step is the optimal conformation vector c l t in Fig. t = 1 . b. the interested reader can see the Propositions 10. recursively. If the vdW distance between two atoms computed from the model is larger than the minimum vdW distance between the two atoms. i = 1 . .2. and the tpk angle is computed according to Proposition 5..2 from the sampled NH RDC for residue k +1. For the latter. for the fragment by randomly sampling CH and NH RDC values from their respective normal distributions..72 rithm then computes their common tensor by SVD using the corresponding structures in W. and the current linker built with the backbone (cf>. the experimental RDC for RDC j of residue k. W j . Fragment computation A structure ensemble. L\..3.i by SVD using experimental RDCs and a model built with the backbone (</>. An optimal conformation vector is a vector which has the minimum score under a scoring function TF denned as TF=E? + wvE2v (2) 6. it is replaced with an alanine residue.. For each conformation vector cm of a fragment. of an m-residue fragment F* is computed as follows (Fig. . ip) angles are computed from the sampled CH and NH RDCs by exact solution. c m of 2m-tuples (^i. The computation of a linker can start from either its Nterminus as detailed in Fig.. 3 or from its C-terminus. 2). The output of the fragment computation for a fragment i is a set of conformation vectors Chw. by systematically searching over all the possible conformation vectors.. Th. for each Rt. The quasi-polyalanine The main difference is mat Ev for a linker is computed with respect to an individual structure composing of all the previously-computed and linked fragments.. The variables wv and Ev are.. Ri.. 2. . the algorithm estimates an initial tensor So. w = 1. ipi. the algorithm repeats the cycle of systematic search followed by SVD (systematic-search/tensor-update) to compute a new ensemble of structures using each of the newly-computed tensors. the back-computed NH and CH RDCs are in fact the same as their sampled values. i>m). glycine and proline residues with proton coordinates..2.. Rt... Every two consecutive fragments are assembled (combined). The scoring function for the linker computation. . In addition. using every common tensor in Q and assembles the corresponding fragments and linkers into complete backbone structures.. . Next. The search step is followed by an SVD step to update tensors. depending on die availability of experimental data. . the orientations of all the peptide planes in the POF are returned from SVD where the first and last peptide planes are used for computing (</>. . . where h is the number of the cycles of systematic search/tensor-update. The algorithm then selects b different sets of RDCs. model consists of alanine. . Linker computation and assembly Given a common tensor S in set Q and the orientations of two fragments Fi and F2 in the POF for S. is computed similarly to TF.

Please see the text for the definitions of other terms and an explanation of the algorithm. 1 C. Update Set of Tuples Set Q of Common Alignment Tensors F^ _ Li . 1.2 o' o . . respectively. the experimental PRE distance and the distance between two CQ atoms back-computed from the model. at least two RDCs per residue in a single medium and PREs (if available). An experimental PRE dis- . b(b-1) S 2 . C2.73 Fragments / Linkers Structure Ensembles Sets of Tensors I V Divide J Fragment Computation Tensor . is computed as Ep = Y ' oJx— . The input to the algorithm is: the protein sequence.F2-L2 — — Fp-i-Lp-1 -Fp ( J y Merge: Syy ± 8yy and Szz ± 5zz Tensor Update Linker Computation Linkers I II Assemble Ensemble C|. b 2 Tensor Update Systematic Search Ensemble of conformation vectors: Set W\ I \\ " Tensor Update Fig. 2. 1 S. Please see the text for the definition of terms and an explanation of the algorithm. >2. C '"1 V c 2- • cq Fig.j 0 2 .2. The terms c* denote conformation vectors for the complete backbone structure. where d. and d^ are. Divide-and-conquer strategy.F2—L2 Fp-1-Lp-1 —Fp F47L1 . structure composing of all the previously-computed and linked fragments and the current linker. Fragment computation: the computation of a structure ensemble of a fragment Thefigureshows only two cycles of systematic search followed by S VD. Systematic Search Ensemble of conformation vectors: Set of alignment tensors: I C2t 1 c 2 2 S2. and o is the number of PRE restraints.

The linkers are computed and assembled top-down using a binary tree. where tF = £ j = 1 pV (fm + m) = p^=^ (fm + m) = 0(pbh+1(fm + m) = 0(pbh+1fm) since fm is much larger than m. Let the protein sequence be divided into p The largest possible value for / is 16 but on average / is about 2. 2). The experimental NMR data 9 has both PREs and four backbone RDCs per residue: NH.74 tance restraint is between two Ca atoms computed from the PRE peak intensity 16 . No structures in the ensemble have a vdW violation larger than 0. The SVD step for computing q common tensors from p m-residue fragments takes q(mp52 + 53) = 0{mpq) time.i = 1 . . 1).30 A. Eq. Syy and Szz. Further analysis of the computed ensemble also shows that the acid-denatured ACBP is neither random coil nor native-like. 1) can be analyzed as follows. it is very small. ALGORITHMIC COMPLEXITY AND PRACTICAL PERFORMANCE The complexity of the algorithm (Fig. Ep. The key difference is that the linker scoring function. These 231 structures satisfy all the experimental RDCs (CH. In practice. The Merge step generates q p-tuples of alignment tensors. APPLICATION TO REAL BIOLOGICAL SYSTEMS We have applied our algorithm to compute the structure ensembles of two proteins.35 A between the two nearest neighbors of a proline and the proline itself. NH. 31 . on a Linux cluster with 32 2. In practice. 0(m) and 0(p). A single SVD step in Fig. m-residue fragments and p — 1 m-residue linkers and let the size of samplings be b. tp) pairs for each residue computed from two quartic equations (Propositions 5. vdW repulsion and PRE violation are computed for the assembled fragment consisting of 2k m-residue fragments and an m-residue linker (Fig.paie rather small constants in practice with typical values o f m = 10. according to the ranges for Syy and Szz (section 6. The Linker computation and assembly step then takes th time.8GHz Xeon processors. respectively. in the range of 4. (3). an acid-denatured denatured ACBP and a urea-denatured eglin C. Syy and Szz. The native structure also has very different Saupe elements. h = 2. ip) angles to the favorable Ramachandran region for a typical a-helix or /3-strand. and q = 10 3 with w = 100. 2 takes m5 2 + 5 3 = 0(m) time. (2) and Eq. The native structure also has very different Saupe elements. and lacks the term in 29 ' 28 ' 3 1 for restraining (</>.1 A except for a few vdW violations as large as 0. the contribution of PRE violation i to Ep is set to zero. only a small number (about 100) of structures out of all the possible bh computed structures for fragment i (section 6.1A except for a few vdW violations as large as 0. Application to urea-denatured eglin C. and have PRE violations.9 . from real experimental NMR data. 2) where / is the number of (</>. Thus. The total time is 0(pbh+1fm + pwp log w + mpq + 2m+1 2mlogp+m h+1 m 6gp / ) = 0(pb f + pwplogw + mpq + fcgp2(c+l)m+lym) w h e r e c = lQgf = Q^y 8. An ensemble of 160 structures were computed for eglin C denatured at 8 M urea. the selected structures have TF < Tmax or TL < Tmax where TF and TL are computed. and 7 days for computing 28 . by Eq. h. The systematic-search step in Fragment computation takes 0(bpfm) time to compute all the p ensembles for p fragments (Fig.1-5. h cycles of systematic-search/SVD take tF time in the worst-case. has two new terms: Ev and Ep. An ensemble of 231 structures has been computed for ACBP denatured at pH 2. and Tmax is a threshold. 2 8 . where 7.p. Application to acid-denatured ACBP.4 — 7.2) and pruned using a real solution filter as described in 2 8 and also a vdW filter (repulsion). C C and NC) much better than the native structure. that is. All the 231 structures have no vdW repulsion larger than 0. The Merge step takes o(pwplogiu) time.3. are selected and saved in W j (Fig. about 1 0 . CH. where q = 'ywp and 7 is the percentage of p-tuples selected from the Cartesian product of the sets Tt.1). The computed structures satisfy the experimental CH and NH RDCs much better than the native structure. Further analysis of the computed ensemble shows that the aciddenatured ACBP is neither random coil nor native-like.2 and Fig. (3). ** = 6«£i2f 2*/(afc+1>m = t g W ' m g r _ 1 r a / " / m since at depth k.0 A. b = 8 x 1024 and h = 2 (see section 11 of APPENDIX). the parameters for m. . where w = \Wi\ is the number of structures in W*. The largest possible value for 7 is 1 but in practice. Although the worst-time complexity is exponential in O(h). N C and C C . In implementation. 3). 20 days are required for computing an ensemble of 231 structures for ACBP. . . If d\ < di.p = 6 for a 100-residue protein. This search step is similar to our previous systematic searches as detailed in 2 9 .

Z. Compute <£m_i by Proposition 5. step (f). M. (3) for the assembled fragment FiU LiU F2. Ramgopal Mettu.3 (section 10 of APPENDIX). Compared with the previous approaches for characterizing the denatured state from experimental data. 9. In this paper. M. The Principles of Nuclear Magnetism. even while the traditional structure determination methods require a large number of them.0A. 41:3089-3095.i *~ (<f>i> ^l> • • • > 4>m-2. We have shown that the ensemble of denatured structures can be computed using the distributions for the orientational restraints from RDCs by solving a series of low-degree monomials. Acknowledgments We would like to thank Drs. G. Linker computation and Assembly. The Link step. Rj. // see figure caption for an explanation Compute Ep and a new score T'L by Eq. Ackerman and D. Kresten Lindorff-Larsen. Biochemistry. ip) angles for the polyproline II model. Dr. 3. 2002. Andrei Alexandrescu. and all members of Donald lab for helpful discussions and critical reading of the manuscript. 6 is the number of sampling cycles. 4>m. i'm-2) by systematic search. is to translate first the N-terminal of Li to the C-terminal of Fi. T. H. 2.. References 1. There exists an intrinsic 4-fold degeneracy in the relative orientation between two fragments computed using RDCs measured in a single medium. and Dr. Clarendon Press. an ensemble of 160 structures for eglin C. Tony Yan and Drs. Our algorithm is based on the formulation of structure determination of denatured or disordered proteins as the computation of a set of heterogeneous structures from the distributions for the sparse experimental restraints. then translate the C-terminal of the fragment F1UL1 to the N-terminal of F2. Gilliland. Molecular alignment of denatured states of staphylococcal nuclease with strained polyacrylamide gels and surfactant liquid crystalline phases. Westbrook. The main reason is that the current experimental techniques can only provide a sparse number of restraints. 4>m and Vm by Proposition 10. from the normal distributions for RDCs. Compute VVi-i. More restraints were used in our algorithm. and most importantly. Jane Dyson for communicating to us the values of backbone (cf>. CONCLUSION AND BIOLOGICAL SIGNIFICANCE At present. J. Build a polyalanine model for linker Li using the vector c ' m i . Compute an optimal conformation vector c'' m -2.• •. the ensemble of structures computed by our algorithm is substantially more accurate. Kresten Lindorff-Larsen and Hemming Poulsen for NMR data on acid-denatured ACBP. A. 1961. 3. systematic search algorithm capable of computing the ensemble of denatured solution structures directly from sparse experimental restraints. Berman. N. Abragam. . David Shortle for NMR data on ureadenatured eglin C. i>m) — Link Li to Fi and F2. Oxford.1 using CH RDC for residue m — 1.ipi. We would like to thank Mr. we presented and demonstrated a data-driven. Shortle. (h) HT'L <TLwdEp< TL+-T'L/ (4) Return cm>i // tfie optimal conformation vector Fig.75 For i < 1 to 4 — (1) TL «. The accuratelycomputed structure ensemble makes it possible to answer two key questions in protein folding: (a) are the structures in the denatured state random coils? and (b) are the denatured structures native-like? Our quantitative analysis concludes that the denatured states of both ACBP and eglin C are neither random nor native-like. < (<£i . we have only very limited knowledge of the structural distribution of either laboratory-denatured or natively-disordered proteins. S. exact algebraic solutions in combination with systematic search guarantee that all the valid conformations consistent with the experimental restraints are computed.oo (2) c m ) i < 0 — (3) For j <— 1 to b (a) (b) (c) (d) (e) (f) (g) // 4-fold degeneracy in relative orientation // initialize the conformation vector II sampling cycle Sample a set of RDCs. Feng. Mehmet Apaydin and Chris Bailey-Kellogg. Pmax is the maximum PRE violation allowed and set to be 7.

Natl. M. C. R. 1970. J. M. Part C. Determination of protein backbone using only residual dipolar couplings. J. J. USA. NMR. Blanchard. Bax. Giintert. 12. Flanagan. R. 2005. Sci. 2000. Donald. 2001. 1995. L. Mettu. In IEEE Computer Society Bioinformatics Conference. Tjandra and A. Chem. the sine and cosine of the backbone 4> angle from an N H RDC and those of the ip angle from a CH RDC. Bax. P. W. Soc. By comparison. 24. A. and M. Donald. S. A. Landau and E. Biomol. Mayor. and A. 126:3291-3299. USA. Andrec.. Proc. M. Annual Reports on NMR Spectroscopy (In Press). Chem. Timmins. Am. M. /. 7. Fieber. Determination of protein global folds using backbone residual dipolar coupling and long-range NOE restraints. Shortle and M. C. and P. 2001. 2003. The proof for these two propositions is very 17. A. D. Yale University Press: New Haven. D. Proc. Chem. A. 340:1131-1142. ip) angles from RDCs starting with the Nterminus. 2004. Acad. Cai. Sci. Blackledge. K. 15. CA. J. and J.. Lifshitz. Fersht. J. Int. Bourne. Clore. 138(2):334-342. Proc. Bewley. Weissig. 1999. M. 14. I. 1997. 2005. 21. P. 273:283-298. Wiithrich. 2006. Freund. J. Tolman. Biol. Kristjansdottir. L. S. Tanford.2 of the main text compute the ((j>. Daura. Wang. and J. M. and J. CA. K. J. L. Mol. M. J. 102:13099-13104. C. M. Briinger. G. K. W. Statistical coil model of the unfolded state: Resolving the reconciliation problem. 29:223-242. 30. 1999. Donald. Marion. 278:1111-1114. J. 23. A. M. Protein structure determination using molecular fragment replacement and NMR dipolar couplings. H. G. D. 122(9):21422143. and P. 11.. L. Btirgi. Biomol. Mohana-Borges. respectively. W. L. In IEEE Computer Society Bioinformatics Conference. M. Soc. 2004. 92:9279-9283. D. Protein Chem. 2005. 31. van Gunsteren. Chem. J. Choy and J. J. long-range and transition state interactions in the denatured state of ACBP from residual dipolar couplings. 2000. R. APPENDIX In this appendix. and M. C. G. Starich. J. Solution structure of a protein denatured state and folding intermediate. Mol. D. 2001. Wang and B. Nucl. Am. Wang and B. Wright. In IEEE Computer Society Bioinformatics Conference. starting with the C-terminus of a fragment. W. De Novo determination of protein backbone structure from residual dipolar couplings using Rosetta. M. 26. R. 28:235-242. Chem. Kristjansdottir. Chem. 2004. N. R. H. Soc. Mumenfhaler. L. J.76 Bhat. Mol. Stanford University. T.. Chem. XPLOR: A system for X-ray crystallography andNMR. 18. Shortrange. and X. 124(11):2723-2729. H. Determination of an ensemble of structures representing the denatured state of the bovine acyl-coenzyme a binding protein. Hu and L. R. 20. 437:1053-1056. Statistical Physics. Oxford. we first state the polynomials for computing. Am. The key to solving the protein-folding problem lies in an accurate description of the denatured state. F. 2001. Direct measurement of distances and angles in biomolecules by NMR in a dilute liquid crystalline medium. Fischer. Oxford University Press. C. Science. U. J. S. and A. Poulsen. 2002. Natl. Reson. Forman-Kay. Rohl and D. L. 1997. 25:63-71. Sci. 2005. 123:1541-1542. and T R. 102:17002-17007. 10. Sosnick. Adv. CA. V. Religa. Hus. Saupe. A structural model for unfolded proteins from residual dipolar couplings and small-angle x-ray scattering. Protein denaturation. 16. Torsion angle dynamics for NMR structure calculation with the new program DYANA. Stanford University. Kontaxis. C. T. Nuclear magnetic dipole interactions in fieldoriented proteins: Information for structure determination in solution. 19. 28. Persistence of nativelike topology in a denatured protein in 8 M urea. W. Y. P. Freed. Bernado. Peter. 13. and F.1-5. R. Soc. Statistical Mechanics of Chain Molecules. Kroon. A. W. Wang. S. J. Vendruscolo. and M. P. An efficient and accurate algorithm for assigning nuclear overhauser effect restraints using a rotamer library ensemble and residual dipolar couplings. Magn. R. Exact solutions for internuclear vectors and backbone dihedral angles from NH residual dipolar couplings in two media. 6. and J. M. Ackerman. Giesen. pages 235-246. Propositions 5. Marion. R. 121(27):6513-6514. Homans. 339:1191-1199. NMR. S. 25. The protein data bank. Residual dipolar couplings: Measurements and applications to biomolecular studies. 9. N. Markson. A. 7:97-112. Lindorff-Larsen. Brown. N. W. Teilum. J. F. R. Calculation of ensembles of structures representing the unfolded state of an SH3 domain. K. Order matrix analysis of residual dipolar couplings using singular value decomposition. and K. Donald. An algebraic geometry approach to backbone structure determination from NMR data. J. Dobson. W. M. 29. Delaglio. L. Biol. Dyson. 308:1011-1023. 22. C. Science. Fieber. New York. 1993. W. 1988. Am. 1980. Prestegard. E. Soc. and B. Stanford University. E. Colubri. A. S. Poulsen. Pergamon Press. Analysis of a systematic search-based algorithm for determining protein backbone structure from a minimal number of residual dipolar couplings. R. F. Shindyalov. Acad. Acids Res. Structural characterization of unfolded states of apomyoglobin using residual dipolar couplings. J. . USA. J. Mol. Blackledge. Theoretical models for the mechanism of denaturation. A. 1968. Impact of residual dipolar couplings on the accuracy of NMR structures determined from a minimal number of NOE restraints. K. 2004. Kuszewski. Ruigrok. 2005. Plory. pages 319-330. and their application in a systematic search algorithm for determining protein backbone structure. 8. 40:351-355. Angew. H. 24:1-95. Angew. Prestegard. Goto. Wang and B. Natl. 293:487-489. Acad. Kennedy. Losonczi. Recent results in the field of liquid crystals. Jha. Ed. Baker. Biol. Biol. 27. Nature. 5. 2004. 4. Am. pages 189-202.

from where R is a constant matrix. LOW-DEGREE POLYNOMIALS FOR COMPUTING BACKBONE DIHEDRAL ANGLES The following two Propositions. the NH vector ellipse equation is a function of the (pi angle. can be computed from through algebraic manipulation we can derive the followthe CH RDC for residue i by solving a quartic monomial ing three simple trigonometric equations satisfied by the in x describing the intersection of two ellipses.3 Given the orientation ofpeptide planes i and i + 2 and the backbone dihedral angle <f>i. tp) angles of the last two residues linking two oriented fragments can be computed. Given the x-component. are a generalization of Propositions 5. (1) andthe z-componentfrom x2+y2+z2 = 1. several parameters must be chosen to ensure the correctness and convergence of the algorithm. Finally. Given the backfollows: bone angle (pi.z(ipi)KRy((pi+i)Ilx(93)Kz(xpi+i)cw. The ellipse equation has been described in detail previously 28 . respectively. vector u of residue i. We then present a proof for a new proposition for computing backbone (</>.2. (4).2. 61.3 and 5.77 similar to the proof for lemmas 5. respectively. The parameters includes: (1) division of protein sequence into fragments and linkers (2) initial estimation of alignment tensors . the y-component can be computed from a\ sin <pi+i + 61 cos (pi+i = c\ Eq.2 given in 28 . a3 sin (pi + 63 cos cpt = c 3 Here. we LG1V3 = Rz(ipi)RIly((pi+i)Kx(e3)Rz(ipi+i)cv can compute backbone <t>ui>% angles. a3. Proof. rather than the LG1W3 = B. c3. PARAMETERS AND IMPLEMENTATION OF THE ALGORITHM Our algorithm (Figs. • 11. The ellipse equation has been described in detail in 28 . 4>i+i and tpi+i angles the x-component. and (b) a systematic-search for exploring all the possible solutions consistent with the experimental restraints and biophysical properties (minimum vdW repulsion). the sine and cosine of the (pi angle can be computed by solving a linear equation. Given the orientation ofpeptide plane i+lin the POF ofRDCs. the matrix L is known.1 and 10. exactly and in constant time. V3 and w i . and the six variables. However. The sine and cosine of the backbone (<j). Given v. N-terminus. are simple trigonometric function of the (pi+\ angle. and cw and cv are two conthe NH RDC of residue i and CH RDC of residue i as stant vectors and #3 is a constant angle. and the z-component from x2 + y2 + z2 = 1. We explored via computational experiments the spaces of these parameters to find the proper values that ensure the computed ensembles are stable. Given a-i sin V>»+i + b2 cos ipi+i = c-i. Proposition 10.2. the y-component can be computed from Eq. the sines and cosine of the backbone dihedral angles ipi.4 of 28 to compute backbone structure from the C-terminus. a 2 . the x-component of the NH unit vector v of residue i. the sine and cosine of the ipi angle can be computed by solving a linear equation. Starting with the peptide plane i + 1. c\ are constants derived from the constant matrix R. From Eq. Given ipi. In the following. the NH and C a vectors of peptide planes i. in the POF. b3. small and capital bold letters denote.1. (1). 1. and i + 2.1-5. 10. respectively. <pi+i and ipi+i can be computed exactly and in constant time. Here. by the following Proposition: where a\. C2. Proposition 10. the x-component of the CH unit nate frame defined in the peptide plane i. the CH vector ellipse equation is a function of the ipi angle. we describe the parameters and implementation of the algorithm.6 2 . tp) angles from oriented peptide planes. tp) angles from RDCs. w 3 denote. All the vectors are 3D vectors and all the matrices are 3D rotation matrices. Given the orientation ofpeptide plane the rotation matrix from the POF of RDCs to a coordii+1 in the POF ofRDCs. column vectors and matrices. in the POF. can be computed from the NH RDC for residue i by solving a quartic monomial in x describing the intersection of two ellipses. Let v i . 3 of the main text) is built upon (a) exact solutions for backbone (cp. The matrix G i is Proposition 10. u. From protein backbone kinematics we have 10.

V angles O of structures in the denatured state. linkers between two fragments have more missing RDCs The function forms for both Ev and Ep in Eq. and PRE the division into fragments and linkers is based primarily violation is implemented by the requirement that all the on the availability of experimental RDCs. the computed ensemble has already achieved RDCs are rather broad relative to the experimental vala stable state since further increase in either b or h does ues.0°. the alignment tensor used to compute the linkers is computed from the structures of fragments. As detailed in section 6. respectively. The largest effect appears to be how set to be 8.0 Hz for -64.3°. Thus. The valing different sizes b of sampling and different numbers ues of these deviations are. We have also tested the algorithm us> Hz for CH RDCs and 0.0 Hz (Hertz) for CH RDCs and 4.4°).0°). (3) are than the fragments. ip = 138.0°). and both are much larger than the real experimental errors. The effects on the fithe protein sequence is divided if there are missing RDCs nal ensembles of these weights are minimal. to some extent. choice for division emphasizes the experimental data.0A. respectively.0 and 2. (3) the standard deviations of the probability distributions for convolving the experimental RDCs (4) the size of sampling. Our In order to see their effects on the computed ensembles.1.50 Hz for NH RDCs. The relative weight and pair-wise backbone RMSDs between the structures wv and wp in Eq. from the original one. we have run the algorithm with different initial tensors computed by SVD using either an ideal a-helix (<f> = The standard deviations for RDC random variables are. Our computaof the ranges for all the experimental CH and NH RDC tional experiments showed that with an b — 8 x 1024 values. if we exchange a fragment with a linker and if the linker has many missing RDCs. In the implementation. the computed ensemble differs. or polyProline II model (<j> = -80. tp = -39. respectively. b (5) the number of systematic-search/SVD cycles. which are estimated to be less than 1. or /3-strand (<£ = -120. h .0. the corresponding 4> and tl> flat-bottom-harmonic.0 V = 135. since vdW concentrated in a certain region. In general. NH RDCs. The probability distributions used to convolve and h = 2. (2) and Eq. 8.78 are selected randomly in the range of [—7r. If no experimental data is available for either CH or NH RDCs. repulsion is very small in the final structures. about one-half h of the systematic-search/SVD cycles. the final structures have no RMSD in PREs larger than 7.0°. (3) of the main text are in the ensembles.walls. and thus the algorithm is capable of computing most not changes the distributions of backbone (4>.7r].

how should they compare structures? Pairwise comparisons are commonly performed by measuring the root mean squared deviation (RMSD) between corresponding atoms in two structure.edu Root mean square deviation (RMSD) is often used to measure the difference between structures. the minimum RMSD (weighted at aligned positions or unweighted) for all pairs is the same as the RMSD to the average of the structures. alignment. We show mathematically that. Thus. Other researchers use nondeterministic methods. Many algorithms have been presented to solve this multiple structure alignment problem. which we also use. Pairwise comparison can be extended to multiple structure alignment in several ways. snoeyink}@cs. protein structures are better conserved through evolution than the sequences2. USA Email: {xwang . 1. We explore the residuals after alignment and assign weights to positions to identify aligned cores of structures. Examples from the literature include the sum of all pairwise squared distances7'8. We will focus most of our attention on this approach. University ofNorth Carolina at Chapel Hill Chapel Hill. Other algorithms align all the structures together instead of aligning each pair separately. Multiple structure alignment is an important tool to identify structurally conserved regions.79 MULTIPLE STRUCTURE ALIGNMENT BY OPTIMAL RMSD IMPLIES THAT THE AVERAGE STRUCTURE IS A CONSENSUS Xueyi Wang* and Jack Snoeyink Department of Computer Science. Several methods also choose a consensus structure to represent the whole alignment. MUSTA15 use geometric . and show that an iterative algorithm proposed by Sutcliffe and co-authors can find it efficiently — each iteration takes linear time and the number of iterations is small. 27599-3175. Sali and Blundell" use simulated annealing to determine the optimal structure alignments and Guda et al. Corresponding atoms may also be given weights so that core atoms have the greatest influence on the matching and weighted RMSD score.unc. or the average RMSD per aligned position9. In this paper we look at ways to extend RMSD (weighted at aligned positions or unweighted) after a correspondence between atoms has already been chosen.8 align protein structures to their average structure. to provide clues for building evolutionary trees and finding common ancestors. once a suitable correspondence has been chosen and the molecules have been translated and rotated as rigid bodies to the best match4""6. INTRODUCTION Although protein structures are uniquely determined by their sequences1. and guides the search for more local techniques. Two iterative algorithms by Sutcliffe et a/. NC. If we consider the protein structures as rigid bodies.12 use Monte Carlo optimization. As structural biologists classify proteins. also done by Verboon and Gabriel13 and Pennec14. We use this property to validate and improve algorithms for multiple structure alignment. then problem of multiple structure alignment is to translate and rotate these structures to minimize the score function. Some first do pairwise structure alignments and then use heuristic methods to integrate the structures. first we need to choose a score function to measure the goodness of the * Corresponding author. For multiple structure alignment. Gerstein and Levitt10 choose the structure that has minimum total RMSD to all other structures as the consensus structure and aligns other structures to it. for multiple structure alignment. we establish the properties of the average structure. Proteins with similar 3D structures may have similar functions and are often evolved from common ancestors3. Ochagavia and Wodak9 and Lupyan et al? present a progressive algorithm that chooses one structure at a time and minimizes the total RMSD to all the already aligned structures until all the structures are aligned. In particular. and to determine consensus structures for protein families. using RMSD implies that the average is a consensus structure. Observing this property also calls into question whether global RMSD is the right way to compare multiple protein structures.

then mathematically you obtain the same result by taking the average structure as a consensus structure.s algorithms.g. including a better stopping condition. If we want to transform the atom positions to minimize wRMSD. It is better computationally to recognize this. p2k. then. The weighted sum of squared distances for all pairs equals the weighted sum of squared distances to the average structure S : n i-l m ZZE *lh*-/>. whether we recognize it or not. i\pik-<ikt n -\Pik-Pkt\ -Pk\\pik -ik -Pik +Pk] = X t * -it+Pik 1=1 n 2. an aligned core) and reduce or eliminate the Our first theorem says that if wRMSD is used to compare multiple structures. we obtain the standard RMSD by setting wk = 1 for (1 < k < m)./*| 1=2 j=l k=\ w - n m ="ZZW*IK_AI i=\ k=\ • Proof. Note there are «(«-l)/2 structure pairs. then multiply by the weight wk. the total squared distance from pik. We also raise the question.that the average structure S is a consensus. Pnk to any point qk equals the total to the average point pk plus from pk toqk: i\p* -**r=zik -P*\2+£h -PA2 1=1 .'=1 (1 <k<m). we can instead minimize the weighted sum of all squared pairwise distances X Z Z w * | h * -pJk\ 1=2 >1 k=\ . MultiProt16 iteratively chooses each structure as a consensus structure.=1 i=l 2.. we subtract the second term from both sides. because m and n are fixed and the square root function is monotone increasing. Lemma 1.80 hashing and finds a consensus structure of Ca atoms. aligns all other structures to the consensus structure. By modeling deviations from the average positions as 3dimensional Gaussian distributions. then apply the definition of pk in the penultimate step. Wih e* g allow us to emphasize some positions in the alignment (e. so that structure 5. w^SD . replace qk by pjk. We can use this to establish properties of the Sutcliffe et al. Theorem 1. "If the average is not the right consensus structure then what scoring function should replace wRMSD?" influence of other positions. METHODS We define the average of structures and weighted RMSD for multiple structures for position weights. and then establish the properties of wRMSD. and doing pairwise alignment to this consensus. Proof. and sum over ally and k to obtain: .g ^ g f p ^ f . In Lemma 1. this algorithm quickly reaches the optimum alignment and consensus structure.••-. In our tests on protein families from HOMSTRAD19.=i n n <=1 . because it reduces the number of pairs of structures that must be compared from n(n-l)/2 to n. Average structure and weighted root mean square deviation We assume there are n structures each having m points (atoms).e.Pim. then what is really happening is that all structures are being compared to the average structure . and detects the largest core among aligned molecules. MASS17 and CBA18 first align secondary structure and then align tertiary structure. For any aligned position k. the « pointspik for (l < / < n) are assumed to correspond.1. we can also determine weights for well-aligned positions that can determine the aligned core. We may assign a position weight wk > 0 to each aligned position k and define the weighted root mean squared deviation (wRMSD) as the weighted sum of all squared pairwise distances between structures. i. To establish the Lemma. In this paper. We define the — 1" average structure S to have points pk =— Y[pik for . for (1 < /' < n) has points Pn. The following technical lemma on weighted sums of squares allows us to make several observations about the average structure under wRMSD. we show that if you use the root of total squared deviation to score multiple structure alignment. expand the difference of squares. and each structure pair has m squared distances. ••-.For a fixed position k. Pa.

align St to 5 using Horn's method to calculate optimum rotation matrix Ri that m In structure alignment.. i.g.81 ZZZ w *h* -P4 =2^xS w 4k -P*f • j=\ k=\ 1=1 m n n *=1 .2..-. we repeat until the algorithm converges to a local minimum of wRMSD. otherwise. noticing that terms with / =j cancel and every other term appears twice. so it is both necessary and sufficient to minimize ^ wk\qk -pk\ .• Z w * l ^ ' ^ ' * ~^*ll *=i and replace 4. The structure from a set Qu . each structure may be translated so the origin is the centroid. Theorem 2.. Finding optimum rotations for several structures is harder than for a pair because the minimization problem no longer reduces to a linear equation. then the minimization reduces to a linear equation in R and T. In Lemma 1 Z Z w* ll-P'* ~ P* II ' s ^lxsd . Horn5 showed that these can be found separately: the minimum wRMSD for the pahcan be found by translating each structure so its origin is the weighted center of mass. XE *l/ *-0*i i=l t=l . ^wkpjk = 0 .=1 *=1 new mw Pk ZZZ^IK^-^-V. Calculate SD new n new m average 5new — new and deviation new = V V w. Minimizing wRMSD For each (1 < / < ri). and one of the translations to be zero. When there are only two structures. The resulting equation gives the desired result after dividing out the extra factor of two: w 1=2 j=\ k=l m n We can fix one of the rotations to be the identity. i.=i *=i n ^ ).e. and R. Because rotating structures also changes the average structure. To minimize wRMSD with more than two structures. then applying an optimum rotation found with quaternions. qi. then the algorithm terminates. k \\Pik ~ . we can combine Theorem 1 with Horn's analysis to show that wRMSD is minimized when the centroids (weighted centers of mass) of each structure is the centroid of the average structure. Proof. We aim to find the optimal T. Calculate the average S .*^! ^ / = 2 )=\ *=1 5£> If SD .) 2.T minimizes s. with points the set of structures. and 5D = XZ W *1^* -Pk\ 2.. and modify a simple iterative algorithm of Sutcliffe et a/. 1. Translate the weighted centroid of each structure S. minimize wRMSD to within a threshold value f(e. w . n m Proof. . e= l.. then choose the structure closest to the average S . i.•£. If you can choose any structures. This follows immediately from Lemma 1 n since ^ w j ^ -pk\ i=l ^ 0 with equality if and only if qk = pk orw* = 0. and ^ZZW*IK-AII m q .=1 *=1 equality holds if and only if qk = ~pk for all positions with wk > 0. . i < ri) to the origin. Instead of directly finding the optimal rotation matrices. then chose the average 5 .e. for each structure to minimize the wRMSD. . for (1 <. Qm that minimizes the weighted sum of squared distances from all the structures St is the one whose wRMSD is closest to S .p*\\ .SD < e. We can use the fact that the average is the best consensus (Theorem 1)..=1 7=1 We can re-arrange the order of summation on the left.e. Algorithm: Given n structures with m points (atoms) each and weights wk at each position. we translate and rotate structures in 3D space to minimize wRMSD.8 to converge to the minimum wRMSD. set SD = SIT™ and S = 5 n e w and go to step 3. for any structure Q with points q\. if you must choose from a limited set. The average structure S minimizes the weighted sum of squared distances from all the structures.=i j=\ Two more theorems suggest how to choose the structure closest to a given set of structures. The target function is: org mm R.=i • • Pk = ~ZPat . we align each structure to the average structure separately to minimize wRMSD. (optionally align each structure to a randomly chosen St for a good initial average. D Theorem 3.OxlO5 k=l k1L1L\\p* . We define Rf as a 3x3 rotation matrix and T as a 3x1 translation vector for structure S. = /?.

in step 3 we have n m J. in each of these cases there is an aligned core for all proteins in the family. they actually give different weights to individual atoms.8. but in fact it works only for position weights because the optimization of translation and of rotation cannot be separated with atom weights.8GHz Pentium M laptop with 768M memory. We set e= l. So SD"™ < SD and SD decreases in each iteration.OxlO"8.J 2 ft itt E5>*k"" -PA ^IZ^Ik -p>f = SD i = l *=1 . which is actually testing only the second inequality on the decrease of SD above.2 -PA — While preparing the final version of this paper. From theorem 1. so they converge to the same local minimum. the algorithm converges in a maximum of 4-6 iterations when s = l. . which proposed the algorithm above. Performance We test the performance of our algorithm by minimizing the RMSD for 23 protein families from HOMSTRAD19. We run our algorithm 5.. To minimize wRMSD for such weights. but some disordered regions allow our algorithm to finds alignments with better RMSD. 3. Moreover. we know that minimizing the deviation SD to the average minimizes the global wRMSD.edu/~xwang/. It is a stronger condition to terminate based on the deviation of SD.=1 *=1 From theorem 2.000 tests. Horn's method calculates the optimal rotation matrix for two /w-atom structures in 0(m) operations. we have observed that it is no longer sufficient to translate the structure centroids to the origin. but want to make sure that our algorithm does not become stuck in local minima that are not the global minimum. RESULTS AND DISCUSSION 3. Each time we begin by randomly rotating each structure in 3D space and then minimize the RMSD. their termination condition was when the deviation between two average structures was small. quaternions should be used instead because they preserve chirality. this will be a local minimum of SD. We can establish analogues of Theorems 1-3 for individual atom weights if the weight of a corresponding pair of atoms is the half-normalized product of the individual weights. the difference between maximum RMSD and minimum RMSD is less than l. which they change during the minimization. however.000 times for each protein family. The number of iterations is one fewer when the proteins start with a preliminary alignment from the optional initialization in step 1.OxlO-5. Verboon and Gabriel13 presented their iterative algorithm as minimizing wRMSD with atom weights (different atoms having different weights). We plan to explore atom weights more thoroughly in a subsequent paper. From Horn5. which are all the families that contain more than 10 structures with total aligned length longer than 100. The results are shown in Table 1. These cases clearly call for weighted alignment. we found two papers with similar iterative algorithms13'14. We expect that the changes in RMSD will be small. In three cases the relative difference is greater than 3%. The code is written in MATLAB and is downloadable at http://www.. Our experiments show that for any start positions of all n structures. For each protein family's 5. We must make two remarks about the paper of Sutcliffe et a/. new . Second. We stop when this decrease is less than the threshold e. so initialization and each iteration take 0(« m) operations. the optimal RMSD values found by our algorithm are less than the original RMSD from the alignments in HOMSTRAD in all cases. since these proteins were carefully aligned with a combination of tools.cs. We believe that this may explain why Sutcliffe's algorithm can take many iterations for convergence — the weights are not well-grounded in mathematics. Because the lower bound for aligning n structures with m points per structure is 0(« m). .OxlO-5 and run the experiment on a 1. in step 4 we have sz^=£f>|/^-vi 2 /'=1 jfc=l ^LL^qPik 1 = 1 ll=\ n m ^ X""1 X™1 . this algorithm is close to the optimum. Both algorithms use singular value decomposition (SVD) as the subroutine for finding an optimal rotation matrix. Pennec14 presented an iterative algorithm for unweighted multiple structure alignment and our work can be regarded as the extension of his work.1. First.82 Horn's method and our theorems imply that the deviation SD decreases monotonically in each iteration.unc.

10. 41 7. 4 4.28 1.3. number of structures Fig.8. RMSD 1.94 0.287 2.315 2. Average running time vs. 40 9. 40 8.95 0.302 2.824 3. 5 3.036 2.1.0. 5 3.4.16 0.971 1. 30 Protein family immunoglobulin domain .8. 10. We report n.032 2.293 1. catalytic and Cterminal domains triose phosphate isomerase pyridine nucleotide-disulphide oxidoreductases class-I lactate/malate dehydrogenase cytochrome p450 aspartic proteinase 30 25 20 T S 15 • • • ~^*. 4. 11 10. 5 4.0.34 0.1.9.9. number of atoms (b) Average running time vs. 4 4. 10.0. 4.87 13.0. 10. 4.873 1. 20.95 0.024 2.0.median. 4 Time (ms) avg.0. 21 10.9.17 2. 10.478 1. 6 4.med.8.268 2. 4. 30 10. 6 4. 4 4.48 1. 4.7. 10. 11 17.602 1.861 1.1. RMSD HA(A) 21 41 18 13 15 12 17 17 12 13 27 13 14 23 12 15 11 23 10 11 14 12 13 107 109 111 114 118 119 122 148 148 177 181 190 200 201 202 205 222 224 242 262 266 295 297 1.398 3.747 1.302 2. 10. 4. diff 0. 10.8. 4.729 2. 21 8. 4 3. 4.7. 40 10. 10.188 1. the number of atoms or the number of structures .000 runs of each alignment. 30 9.870 2. the number of atoms aligned. 5 3. 4. RMSD from the HOMSTRAD Alignment (HA). 1. 10.825 3.561 2. 20 16. 4.32 0.3.6.668 1.V set heavy chain globin phospholipase A2 ubiquitin conjugating enzyme Lipocalin family glycosyl hydrolase family 22 (lysozyme) Fatty acid binding protein-like Proteasome A-type and B-type phycocyanin short-chain dehydrogenases/reductases serine proteinase .54 0.872 1.213 1. 11 9.0.881 1. 5.0.eukaryotic Papain fam cysteine proteinase glutathione S-transferase Alpha amylase.0.93 Iterations avg.91 1. 30 10.279 2.0.327 1. 4.0. 20.224 1. 10.932 optim. 30 7-3.max 11.8.357 1. 04 $-^*- 0 1000 2000 3000 4000 Number of atoms (nxm) 5000 6000 (a) Average running time vs.91 1.420 2. 21 11. 4.88 0. 20. 4.0. 40 8. statistics on iterations and time (milliseconds) for 5. 10.7.077 1. 4. legume lectin Serine/Threonine protein kinases.5.0. 5 3. Performance of the algorithm on different protein families from HOMSTRAD.386 3.83 Table 1.336 2. 10.8. 30 16. 4.5.9. 40 8. 20. 40 7.0. 4. the number of proteins.59 0.0.492 1.877 %rel. the RMSD for the optimal alignment from our algorithm. 10.91 5.4. 4.0.6. 10. 21 9.12 0. 5 4.714 2.383 2.49 2.05 8.max 3. 4. 4. 5 4.5. 30 24.435 1.8. 11 10. m.32 0.396 2. 5 4.454 1. 4 4. 4.342 1.781 1.9.87 1. 5 4. catalytic dom. 10.503 2. 4 4.38 2. 5 3. 10. 5 4.0.9. 4. 5 4. 10. catalytic domain subtilase Alpha amylase. 10. 5.954 1. 5 4. 5 4.

2. and Ye and Janardan23 use vectors between neighboring C a atoms to represent protein structures and define a consensus structure as a collection of average vectors from aligned columns. Multiple structure alignment for (d) Structure with maximum RMSD nucleotide-disulphide oxidoreductases class-I . which means our algorithm is highly efficient. including the proposal of Gerstein and Levitt10 to use the true protein structure that has the minimum RMSD to all other structures. Altaian and Gerstein20 and Chew and Kedem21 propose to use the average structure of the (a) all 11 aligned proteins (b) the consensus structure (c) Structure with minimum RMSD Fig. Chew and Kedem21. by Theorems 1 and 2.84 The maximum number of iterations is 6 and the average and median number of iterations is around 4.22. which builds a consensus structure by rearranging input structures based on alignments of partial order graphs based on 3. This depends on the intended use for the consensus structure — in fact. One objection to this claim is that the average structure is not a true protein structure . Consensus structure For a given protein family. Thus. which is @(n m). But a more significant answer comes from Theorem 3: if you do have a set of structures from which you wish to choose a consensus. In fact.2. and no other structure has better wRMSD with all structures. which takes 0(n) in each iteration. All of the average running time is less than 25 milliseconds and all of the maximum running time is less than 40 milliseconds. The average running time shows linear relation with the number of structures but not the number of atoms. so / is a small constant and the algorithm achieves the lower bound of multiple structure alignment.it may have physically unrealizable distances or angles due to the averaging. Figure la and lb show the relationship between the average running time and the number of atoms (nxm) and the number of structures (ri) in each protein family. or POSA of Ye and Godzik24. because the most time-consuming operation is computing eigenvectors and eigenvalues of a 4x4 matrix in Horn's method. we claim that the average structure is the natural candidate for the consensus structure. one problem is to find a consensus structure to summarize the structure information. conserved core as the consensus structure. the wRMSD is minimized by aligning to the average structure. some other proposed consensus structures are even more schematic: Taylor et al.

we can establish the fact that aligned positions are distributed according to the Gaussian distribution in 3D. which suggests that the data fits the 3D Gaussian model pretty well. at a fixed position k. which indicates that the two corresponding structures have larger errors in this position than would be predicted by a Gaussian distribution. which is commonly used to determine whether two data sets come form a common distribution.6 0. We carried the same experiments for all the aligned positions and the collected the histogram of the correlation coefficient R2 is shown in figure 4b. the consensus protein structure with the minimum RMSD to all other structures. 3. Low slope indicates that they align well. and the x-axis is the quantile data from 3D Gaussian. As a first step.8. we investigate the following question concerning the spatial distribution of aligned positions in a protein family. we want to test the null hypothesis that. Figure 2 shows the alignment of conserved core of protein family pyridine nucleotide-disulphide oxidoreductases class-I. 0. Statistical analysis of deviation from consensus in aligned structures Deriving the statistical description of the aligned protein structures is an intriguing question that has significant theoretical and practical implications. We chose the Gaussian not only because it is the most widely used distribution function. then you should choose from this set the structure with minimum wRMSD to the average. and the structure with maximum RMSD to other structures. Most position produce curves like this. the y-axis is the distances from each structure to the average structure for each aligned position. we briefly demonstrate one heuristic for determining position weights to identify and align the conserved core of two of our structure families.9632.85 40 30 20 10 0. The types of curves in q-q plots reveal information that can be used to classify whether a position should be deemed part of the core.3.4. Determining and weighting the core for aligned structures There are many ways in which we can potentially use this model of the alignment in a family to determine the structurally conserved core of the family.8 0.5 2 2.5 1 1. or with all or almost all points on a line through the origin.7 .5 3D Gaussian Distribution 3 0 5 n 0. 3D Gaussian Distribution analysis of the distances from each atom to corresponding points on the average structure Fig these structures. Due to space constraints. 3. We identify that more than 79% of the positions we check have R2> 0. To test the fitness of our data to the hypothesized 3D Gaussian model.r. The correlation coefficient R2 is 0. due to the central limit theorem of statistics. and help biologist to compare protein structures. A few plots begin above the line and come down. especially those that are in the "core" area of protein structures. and that the residuals may fit a 3D Gaussian distribution with a small scale. but also because previous studies hint that Gaussian is the best model to describe the aligned structures25. by checking our data. the consensus structure.9 (a) Distribution of the best aligned position (b) histogram of R2 for all aligned positions 3. More specifically. or stay on a line of higher slope. In our procedure. The illustrated q-q plot has the last two curves above the line. are consistent with distances from a 3D Gaussian distribution. the distances the n atoms can be found from the average ~pk. we adopted the Quantile-Quantile Plot (q-q plot) procedure26. the set of aligned protein structures can be conveniently described by a concise model that is composed by the average structure and the covariance matrix specifying the distribution of the positions. . indicating that such positions are disordered and should not be considered part of the core.. If. Figure 4a shows the q-q plot for the best aligned position.

which includes RMSD as a special case. and light-gray colored atoms satisfy ak > a + 3a. and go to step 2. We tested the algorithm on the protein families from HOMSTRAD database that have more than 10 proteins with total aligned length longer than 100 atoms.86 (a) pyridine nucleotide-disulphide oxidoreductases class-I (b) proteasome A-type and B-type Figure 4. The black colored positions satisfy ak<a + a. and the correlation coefficient Rk by assuming that deviations have a 3D Gaussian distribution. Align the protein structures by RMSD using the algorithm of Section 3. often due to divergence in of some or all members in the family. We plan to extend our work to weighted multiple structure alignment with atom weights at different atoms (which includes gapped structure alignment as a special case). The term \lak in the weights encourages the alignment in the positions where the average squared deviations are small. For each aligned position k. and the reviewers for their comments. wRMSD. which is most probably the global minimum. 2. CONCLUSION In this paper. gray colored atoms satisfy a + 2a < ak < a + 3a. Otherwise set all weights 3. We plan to devise new algorithms to achieve better aligned structures for multiple structure alignment by combining the sequence and structure alignments together and build 3D Hidden Markov Models for protein structure classification. Rk2/akifak<a 0 + 3*i aHgn stmctures b y otherwise. 1. dark-gray colored atoms satisfy a+a<ak<a+2a. While directly minimizing wRMSD is hard in multiple structure alignment. Thus. then exit the algorithm. and the term R? encourages those positions where the distances to the average structure are close to 3D Gaussian distribution. the tests show that the algorithm converges to the same local minimum. 4. If all ai<a + 3(7. We use the following iterative algorithm to assign weights. Regardless of the starting positions of structures. Each iteration takes time proportional to the number of atoms in the structures.2. This work is partially supported by NIH grant 127666 . Based on this property. and the average squared 1 " distance ak =—y\dik Then calculate the mean a and standard deviation a of ak. The tests also show that the number of iterations is a small constant whenever the input does not have near symmetry. The algorithm in the paper is for aligning protein structures after sequence alignment. the average structure is the natural choice for a consensus structure. we create an efficient iterative algorithm for minimizing the wRMSD and prove its convergence and other properties. we analyzed the problem of minimizing the multiple structure alignment using weighted RMSD with weights at aligned positions. calculate distance dik -\pik-pk\ for (1 < /' < «). where the black is the core and gray are portions that are eliminated by being given weight zero. we show that this problem is the same as minimizing the wRMSD to the average structure. Acknowledgments We thank Jun Huan for helpful discussion and advice during this work. The results show our algorithm minimizes the wRMSD in less than 50 milliseconds in Matlab for any protein family. so the algorithm achieves the linear lower bound for multiple structure alignment. Figure 3 shows two examples of alignments. Aligned protein families using position weights.

Wolfson HJ. Gerstein M.87 and a UNC Bioinformatics and Computational Biology training grant. New York. Protein Science 1998. Protein Science 2003. 25. 56(1): 143-156. Brutlag D. Approximate multiple protein structure alignment using the sum-of-pairs distance. 1(5): 377-384. 2: 19-27. Sali A. Proc. 1996: 178-185. Benyamini H. Journal of Computational Biology 2001. Hastings N. Bourne PE. 21(10): 2362-2369. Acta Crystallographica A 1978. Comprehensive assessment of automatic structural alignment against a manual standard. 273: 595-602. 23. Ye Y. 2. Ochagavia ME. Intelligent Systems for Molecular Biology 1994. 21(15): 3255-3263. . Mizuguchi K. Statistical Distributions. Development and validation of a consistency based multiple structure alignment algorithm. Sutcliffe MJ. Conf. Protein Science 1994. Peacock B. Journal of the Optical Society of America A: Optics. Dror O. References 1. New York. Statist. A method for simultaneous alignment of multiple protein structures. 34: 827-828. Guda C. Godzik A. 4(4): 629-642. Holm L. 8. Altaian RB. 24. 7: 445-456. 11. A new progressive-iterative algorithm for multiple structure alignment. Multiple structural alignment by secondary structures: Algorithm and applications. Closed-form solution of absolute orientation using unit quaternions. Kedem K. Scheeff ED. Sander C. Flores TP. 125: 691-692. 4. Proteins 2004. 5:2. Gerstein M. A discussion of the solution for the best rotation to relate two sets of vectors. 1999. 20. Proceedings of the 16th Leeds Annual Statistical Workship. Knowledge based modelling of homologous proteins. Bioinformatics 2006. MUSTA~a general. . Using quaternions to calculate RMSD. Taylor WR. Bioinformatics 2005. Part I: Three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Gabriel KR. 7. HOMSTRAD: A database of protein structure alignments for homologous families. Ye J. A new algorithm for the alignment of multiple protein structures using Monte Carlo optimization. Leibowitz N. the SCOP classification of proteins. Br. 9. Carney D. Science 1957. Psychol. Protein Engineering 1987. Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures. 48:57-74. Nussinov R. Journal of Molecular Biology 1990. Journal of Computational Chemistry 2004. 10. 212: 403-428. 16. Alexandrov V. Blundell TL. J. Coutsias EA. Janardan R. 21. Science 1996. Multiple protein structure alignment. 3. Branden C. Wodak S. Finding an average core structure: application to the globins. Proceedings of Pacific Symposium on Biocomputing 2001: 275-286. Seok C. Deane CM. Wolfson HJ. 12: 2492-2507. Levitt M. 11(5): 986-1000. BMC Bioinformatics 2004. 22(9): 1080-1087. Orengo CA. Math. Multiple registration and mean rigid shapes: Application to the 3D case. Generalized Procrustes analysis with iterative weighting to achieve resistance. 15. Journal of Computational Biology 2004. Haneef I. 55(2): 436-454. Anfinsen CB. Wolfson HJ. 22. 3: 1858-1870.. 12. 25(15): 1849-1857. Shatsky M. 2nd ed. 8(2): 93-121. 13. Lupyan D. Protein Science 1998. Nussinov R. Sela M. Horn BKP. Blundell TL. Finding the Consensus Shape for a Protein Family. Leo-Macias A. Overington JP. 2000. 19. Bioinformatics 2005. Blundell TL. Verboon P. 5. 26. 1995. Ortiz AR. Garland Publishing. Introduction to protein structure. White FH Jr. and Vision 1987. 38(1): 115-129. Mapping the protein universe. Int. 6. Tooze J. Progressive combinatorial algorithm for multiple structural alignments: application to distantly related proteins. Ebert J. Gerstein M. efficient. Shindyalov IN. Reductive cleavage of disulfide bridges in ribonuclease. Wiley. Pennec X. automated method for multiple structure alignment and detection of common motifs: application to proteins. 7: 2469-2471. Multiple flexible structure alignment using partial order graphs. Nussinov R. Algorithmica 2002. Inc. 17. Image Science. The definition of general topological equivalence in protein structures: A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. Proteins 2004. 3rd ed. Chew LP. Dill KA. Kabsch W. 18. 14. Evans M.

This page is intentionally left blank .

t h a t can be obtained through electron cryomicroscopy. The building blocks of the chain are 20 kinds of amino acids.nmsu. The most commonly used experimental techniques for protein structure determination are X-ray crystallography and Nuclear Magnetic Resonance (NMR). we present a new methodology to aid in the identification of a-helices in a low resolution density map. X-ray crystallography has produced more than 80% of the known protein structures currently present in the Protein Data Bank (PDB). . Pontelli*. 1. ylu} @cs. E . such as membrane bond proteins and large protein complexes. In this paper. have been successfully generated at 8. where helices are modeled as general cylinder-like shapes. Both techniques can determine structures at atomic resolution (usually better than 3A). Although it is not possible to determine the backbone chain of the protein at the resolution range of 6A to lOA—current methods to solve protein structure require a density map of much higher resolution. such as the Herpes virus. a-helices and /9-sheets). it J. dalpalu@unipr. X-ray crystallography is limited to the availability of suitable crystals of the protein. T h e methodology proposed in this paper performs gradients analysis to recognize volumes in the density m a p and to classify them. and successfully tested with simulated structures. He. 2 . Although these two techniques are successful in targeting soluble proteins. they are seriously lim- ited for non-soluble proteins. Using the cutting edge techniques in this field. INTRODUCTION 3-dimensional (3D) protein structure information is essential in understanding the mechanisms of biological processes. The methodology has been implemented in a tool. defined by a central axial line (i. such as 3A or 4A 14.. Matematica Universita di Parma alessandro. A protein can be thought of as a chain of beads that adopts a certain conformation in the 3D space (native conformation). and large protein complexes cannot easily produce crystals.e. The spline is •Corresponding author.5A resolution 15 . Lu Dept. such as a-helices and /3-sheets 15 .g. In particular. the focus is on the reliable identification of a-helices. 6 —this resolution allows the visualization of various secondary structure elements. In particular. Knowledge of 3D structure of proteins is essential in understanding the mechanisms of protein function. denoting significant improvements in recognition and precision. called Helix Tracer. At such resolutions. but it is possible to identify individual structural elements (e. Both experimental techniques and prediction techniques have been devised to generate 3D protein structures.89 IDENTIFICATION OF a-HELICES FROM LOW RESOLUTION PROTEIN DENSITY MAPS A. Y . Dal Palu Dip. and this information has become more and more important in rational drug design. 6A to lOA) protein density map. edu This paper presents a novel methodology to analyze low resolution (e.. 3D structure of large complexes. modeled from the Protein Data Bank at 10A resolution. it is often not possible to recognize t h e backbone chain of the protein. The results of the study have been compared with the only other known tool with similar capabilities (Helixhunter).g. Electron cryomicroscopy is an experimental technique that has the potential to allow structure determination for large protein complexes 15 ' 12.. The methodology relies on a novel representation of a-helices. In particular. a spline). epontell. Computer Science New Mexico State University {jinghe.

5A radius. The essential idea used in the helix detection process is to segment the density map into volumes satisfying certain properties.. . most helices have a certain degree of curvature. In this work. thus making a perfect cylinder not an accurate template.90 a continuous line (possibly not straight). these secondary structure components help in producing important geometrical constraints about the tertiary structure. in 4 . based on the notion of gradient segmentation. we present an effective method. This feature allows the model to better fit real helices. such as in Helixhunter 7 . Preliminary experiments show very promising results—Helix Tracer is on average capable of identifying over 75% of the helices. and (c) the length of the axis L Following our concern. it is possible to observe that the density distribution of a helix resembles a cylinder.aiZ)—see Figure 1 on the right. the knowledge of this information helps in discovering the fold of a protein. Since a-helices and ^-sheets are the major components of a structure.5A. Each element corresponds to a cubic volume.r. since helices in nature are often not straight. For example. where each helix is described as a straight cylinder with a 5A diameter. (b) the axis orientation vector d = (dx. as well as to reduce the search space in the context of molecular dynamics applications. obtained from Helix Tracer. where Si = (aiX.dz). The central axial spline is generated using a standard spline function.aiy.1.sy. and with greater accuracy than related systems (the Helixhunter system 7 ) . we describe the central axial line of the a-helix in terms of a quadratic spline 1 . Such constraints can be employed to effectively guide a protein prediction method. with the aim of determining the most likely mappings of the a-helices on the primary sequence. For the sake of simplicity. we introduce a more general representation. e. 2. 2. the density is normalized w.7 . a maximal density in the map. Sn. at the resolution range of 6A-10A 15. described by a set of control points.t. Our analysis method relies on the observation that. to combine information about a-helices. s — (sx. particularly for long helices. based on constraint satisfaction. only one other approach has been proposed to identify helices in low resolution density maps: Helixhunter 7 relies on a cylindrical representation of helices.sz). However. while the helix itself is defined as the set of points whose minimal distance from the spline is 2. In various previous proposals. the cylindrical area presents the local maximum density value roughly on the central axis of the corresponding ahelix. Helixhunter identifies helices by searching cylindrically shaped areas in the density map using the second moment tensor. Overall approach The algorithm for helix extraction is based on processing the discrete density map.g. and each voxel is associated to the mean electron density of the protein in that volume. The density gradually decreases as the distance from the central axis increases. In particular. A spline is a continuous curve. 1 on the left): (a) the starting point. The cylinder is characterized by three parameters (see Fig.dv. and can be potentially used for extraction of other features. encoded as a 3-dimensional array. each a-helix is described as a cylinder with a 2. that can be inadequate for specific regions of the density map. Moreover. with very low RMSD errors.. called Helix Tracer. controlled by a finite number of control points. significantly improving precision and efficiency of the prediction 4 . /3-sheets and coils. that actual helices in nature are not straight but they tend to bend and curve to a certain degree. To the best of our knowledge. The outcome of this analysis is a description of the helices identified.. The strength of this segmentation method is its threshold independence—which allows the segmentation of volumes present in the density map without the drawback of using generic thresholds. with results from helix prediction (obtained from PHD 1 3 ). METHOD The input to our analysis algorithm is a density map. The proposed methodology has been implemented in a software tool. and thus provides smaller errors. located on one extremity of the central axis of the cylinder. called voxel. The actual identification of the helices makes use of a new type of analysis of the density maps. The segmentation we propose is general. based on the identified control points S\.

The complete process is articulated in the following phases. Observe that we use gradient trees to segment volumes—by collecting in a single volume all the nodes whose gradient paths lead to the same maximal density voxel. the gradient associated to each voxel is approximated using a discrete and local operator over the original density map. in order to build the map of gradients. In this perspective.91 of such density function. The motivation for requiring such segmentation is that low resolution density maps witness the presence of a helix as a dense cylinder-like shape. until a maximal density value voxel is found. This process generates in output the segmentation we require for helix detection. (ii) graph construction and processing. For processing purposes. Thus. The method involves gradient analysis. For each voxel.t. the problem boils down to the one of correctly merging some of these volumes in order to reconstruct the identified helices. normal coil. The paths generated touch every voxel. This situation arises when amino acids are arranged into specific patterns that provide a high local density contribution. Intuitively.g. and can be partitioned according to the ending points reached. Using the gradient direction as a pointer. helices are arranged so that the side chains of amino acids involved show an average increase of local density w. The gradient is approximated using Sobel-like convolution masks ( 3 x 3 x 3 ) over the original density map 5 . we can follow these directions from voxel to voxel. and the summation of a point-by-point product is performed in order to collect the intensity of the gradient component for each dimension. due to the helical packing of the backbone. each of these volumes will contribute to only a part of a helix. The gradient information is computed for each voxel. < 5A). This means that the high density voxels of the trees identified on the gradient paths can be employed to characterize the location of the helix axis. that is associated to the same volume. and it is substantially different from simple density values thresholding (as used in previous proposals 7 ' 8 ) . VT^r) Helix models The intuition of the segmentation process is that each local maximal density voxel can be related to the presence of a packed set of atoms. Thus. the gradient points towards such axis. Every maximal density voxel v is a representative of a volume that is defined as the set of voxels that can be reached from v without increasing the density along the path followed. The gradient is a vectorial information. (Hi) detection of helices. The key idea is that this segmentation offers a robust identification of subsets of helices' volumes. Gradients determination The density map is processed.. the problem boils down to recognize such clusters made of locally higher density. Hence. Paths that share the same ending point form a tree structure. where the maximal density increases gradually towards its axis. the gradient shows the direction that points to the locally maximal increase of density. 2. For example. and further analysis is required to study the properties of the volumes (and the relationships between their maximal density voxels) and determine whether different volumes actually belong to the same helix. the gradient corresponds to the first derivative . a 3D convolution process is performed using the three masks: each mask is overlapped on the density map. described in the next subsections: (i) gradients calculation. in general. Each volume is a maximal set of voxels and it contains. expressed in terms of a 3D direction and intensity. this is characterized by a clear increase of local density that reflects the helical three dimensional shape of the helix.2. The gradient is represented by a vector whose direction and intensity can be calculated using the Sobel-like mask in Figure 2. At low resolution. 1. small parts of individual helices. The • / /" 4 T 3 / e X t id i 1 1 ^ 1 *«2 I C5> F i g . When close to the axis (e.r. considering the density map as the discretization of a continuous function from R3 to density values.

V2. we show a slice of a density map. the direction of each edge in the graph is inverted. the component of the gradient along the X axis can be calculated by using the three matrices in the first row in Figure 2. we add a direct edge (n\.g. because the typical distance between consecutive amino acids is 3.. N consists of the nodes formed by the remaining voxels. • The density at 712 is higher than that at m . two voxels are neighbors if they differ by at most by one unit in each coordinate—which leads to 26 possible neighbors per voxel. The last operation in the graph processing phase is to mark the border of each volume induced by the tree.5 (50% of the maximum value of the map).0A. The graph is actually a forest of tress.zi\\ < 1. After the trees have been generated. the two trees are merged.3. |j/i — 2/21. The trees recognized in this graph construction provide a segmentation of the density map in distinct volumes. Nodes will represent voxels of interest (as described later) while edges connect voxels that are "adjacent" in the density map. where N is the set of nodes of the graph and EC. In the cases we select. A voxel is on the border if at least one of its neighbors is not in the same volume.8A. and it is the voxel with the highest density in the volume—e. Fig. This implies that some noise could be present and split the gradient segmentation into small volumes. smoothing and/or filtering. we can recover this problem my merging manually volumes that are close to each other.zi) and V<z = {x2. 3(a). Construction of the graph The next step of the algorithm involves the construction of a graph describing the structure of the density map. • The directed edge is the best approximation of the gradient direction (the non-arrowed lines in Figure 4). 712) that starts from ni and points to the neighbor 712 G N of n\ (the arrowed lines in Figure 4) if the following two conditions are satisfied: . After this coarse thresholding. 2. the directed graph G = (N. z-i). and each node has at most one incoming edge (Figure 4(c)).The voxels are considered to be neighbors if and only if the following relationship is satisfied: max{|xi — X2\.e. E) is used to summarize the gradient properties. Fig. 3(b) indicates the corresponding z-projection of the gradient for each point. we retain only the voxels with a density value greater than 0. 3(a) and Fig. in order to reduce volumes fragmentation and to facilitate detection of volume borders.yi. The key property is that every path in each tree represents a decreasing density sequence of neighboring voxels. \zi . the double-circled node in Figure 4(c). In the graph we propose to construct. For example.65. The purpose of this step is to discard grossly irrelevant voxels. 3(b). The border voxels are used later for merging volumes that are determined to belong to the same helix. Observe how the gradient lines are "pointing" towards the denser regions of the density map (shown in darker color). to improve efficiency of the successive analysis steps. The process starts with a coarse thresholding (cropping) of the density map. 3(c) is the overlay of Fig. In particular. This is important. As last step in the construction. the roots are close enough to ensure that the merged volumes describe two areas that are consecutive according to the direction defined by the backbone. without incurring in any relevant loss of information. i. edges will be introduced only if the two nodes involved are neighbors. In the future we plan to introduce a more robust image preprocessing (e. small and spatially close trees are merged to simplify the forest of trees. In other words. The root of a tree is the representative of the associated volume. In particular. For each node m . The resulting graph is a directed acyclic graph. since it is not necessarily connected. However. due to the low resolution scale ( 6 12A). Voxels belonging to a volume border are properly marked using a flag. a smoothing phase) to cope with this problem. this choice arises from the practical observation that the voxels of interest have an average density larger than 0. Note that in this version we did not apply any image preprocessing. N x N is the set of edges. In Fig.g. When the distance between the roots of two trees are less than 3. Let us consider two voxels Vi = (xi. Each volume contains the voxels that belong to the same tree.92 addition of the three resulting vectors generates the gradient associated to the voxel.

3. K- ( I V '^4-Z ! '/dp(a) Fig.1 Sobelx[0] Sobelx[l] = Sobelx[2] = Sobelyp] = 2 4 2 Sobely[l] = 0 0 0 . ''!lr 1 ' '^N\ \U\\'A\ WWWNN (b) Density map and gradients J . I I If ' \ W W / > .^\ \ I / ? l \ \ \ If ^ \ \ 1/sK^-.-2 .2 . Detection of helices The final phase in our computation is to define the location of the a-helices. A typical threshold is 85% of the maximum density value of the map. The border voxels of each volume are scanned for the satisfaction of two rules. Volumes that satisfy the . Obtaining the tree from gradients 2.10 .0 0 0. The first step is to detect and merge the volumes that belong to the same helix.1 121 242 121 2 0 -2 4 0-4 2 0-2 '10 . The other is to see if the border voxel has high density (greater than a threshold helixTHR).93 "10-1' 2 0-2 .2 "0 0 0' 000 .4 .1/ (a) •—=-<> i il l 1 ^ — VY/ Fig. 1 2 1 0 0 0 -1 .1 ' 2 0-2 . the volumes are analyzed to see if they belong to the same helix.1 . One is to see if there is a neighboring border voxel that belongs to a different volume.2 .2 .1 -1 . Therefore. 2. 1 2 1' 0 0 0 -1 . This phase involves two steps.10-1. by defining the location of the control points that constitute the central axial line. It has been observed that a helix often contains one or more neighboring volumes.4 . Sobely[2] Sobelz[Q] Sobelz[l) = F i g .4. Sobelz[2) = Sobel's Convolution Masks s \ I t / ^ \ \ l / ' ( I t If ~. The second step is to construct a description of each identified helix. 4.1 1 -2 .2 -1 .2 .

The density maps have been generated at 10A resolution. The helices identified by Helix Tracer are shown as sticks that are overlayed on the density map and on the backbone of the protein. EXPERIMENTAL RESULTS 3. the contact surface may not be a regular plane. The traversal moves to a neighbor that contains the locally highest density available. The idea is to start from the highest density voxel close to the barycentre of the trace volume (in the neighborhood of 3A).39GHz). This also implies that the high density region on the border between two volumes is very close to the axis of the a-helix. and thus is used simply as a selection tool for the relevant volumes. a//3 (e. Table 2 provides the following information: the PDB Id of each protein (ID). All experiments have been conducted using Linux (2. by default. and they offer a good representation of proteins with varying structural configurations. thus we can safely merge two volumes that presents this characteristic. From the starting voxel.. Helix Tracer results In order to test Helix Tracer. note that this thresholding is performed on volumes. that returns two half axis: one for each side analyzed.6. in order to reduce the scattered jumps between neighbors. a validation step is launched. called trace volume is generated using a threshold (helixselectTHR). at 10A resolution. In particular. 1CC5). For example. this does not constitute a problem. a subset of the helix volume. To define the search space of the control points. and proteins with separate a and j3 domains (e. The central axial line is generated using a greedy algorithm. In particular. all a (e. Therefore.g. if the contact surface is not orthogonal to the axis.1.94 above two rules are merged to generate the volume of a helix. Notice that the helices identified are not straight. and thus one volume would be a subset of the other. The rationale of this process is the following. recall that the density map for a helix decreases quickly when moving orthogonally away from the axis. 3. 1B0M).. The identification of the control points relies on the fact that the central axial line of the helix is often located at the high density voxels of the volume.1GHz. in order to discard the helices that show an extreme and unnatural curvature. and it may contain some irregularities.g. This threshold.g. The smoothed and real-coordinate points are used as the actual control points for the second-order spline that is consequently generated. due to the discretization of the density in voxels. using the program pdb2mrc from the EMAN suite u . the density map of protein 1BM1. The proteins have been randomly chosen. we launch two searches along the initial search direction. the number of helices present in the PDB annotation . This follows from the gradients property: each gradient associated to a voxel that belongs to a helix points towards the axis of the helix. This guarantees that separate helices are not erroneously merged or a helix incorrectly broken in separate structures. this implies that the contact points between the two volumes lie on a plane which is orthogonal to the helix axis. the threshold used to detect volumes belonging to same helix. we generated density maps for 29 proteins with known structures in the PDB.. We have experimentally observed that only helices show a high density (above the helixTHR threshold) close to their axis. Note that this thresholding is applied after the segmentation phase. 3. we include representatives from the major SCOP families 10 —a + /? (e. the control points are smoothed with a single pass of Gaussian smoothing (a — 8). If two volumes belong to the same helix. Finally. the execution time. is too strict when used for the construction of the axis. During this exploration. and a AMD 64-bit 2. the gradients on the border will point to the volume. We estimate the initial search direction by means of a least square fitting of the trace volume. At the end of the process. The union of the two paths gives the set of control points associated to the axis.11) workstations (a Pentium 4. we move along the axis towards the ends of the helix. In practice.. (Time). which ensures that we will not encounter cases where the analysis takes us to volumes that do not belong to the same helix. Moreover. while building a path that contains the maximal values detected. 1BVP). is set to 2% less than helixTHR. This choice is suggested by the practical observation that helixTHR. Table 2 reports the number of helices recognized by Helix Tracer. 1A06). is shown in Figure 5. in seconds. Nevertheless.g.

and the number of helices of adequate length present in the PDB annotation and missed by Helix Tracer (Missed). and the corresponding number of Ca correctly identified by Helix Tracer. We approximate the problem of computing the projection on a continuous spline with the problem of finding the smallest projection vector out of the set of segments. This is the minimal length of helices detected by default by 1 . the number of helices recognized by Helix Tracer (Identified). A Ca is a correctly recognized Ca atom if it can be projected internally on the helix axis identified. as comparison).5A distance. it is straightforward to calculate the projection of a Ca atom on the central helix axis. often the ratio of helices correctly identified is larger than 88%.lA a . since we use splines as axis representation (see Figure 1). the RMSD (Root Mean Square Deviation) is calculated for the correctly identified Ca atoms. a method to project a Ca on the axial spline needs to be developed. the number of identified helices that are correct (Correct). Observe also that. the execution times are very reasonable (Table 2. lices longer than 5 amino acids (shown in Table 2 and in Figure 6 on the left. We subdivide the continuous spline into a set of 40 contiguous segments. and they depend on the spatial distribution of the control points. C a atoms that cannot be projected on the axis— i. column Time). Table 1. despite the lack of optimization in the current implementation. In order to further evaluate the accuracy. 5. RMSD values for the selected proteins are plotted in Figure 6 (on the right).1 A.e. The last two columns report the number of Ca present in the helices of length greater than 8. recognizes 80% of the helices longer than 8. Use of different resolutions 8A 1BVP Correct Missed Ca Correct Missed OCK Fig.5A. In particular. on average. Helix Tracer recognized 75% of the total Ca atoms that are on the PDB hea 10A 89% 1 85% 57% 23 58% 12A 78% 2 76% 36% 34 36% 100% 0 93% 62% 20 63% 1Q16 1TCD Correct Missed Cc 92% 2 85% 76% 6 77% 68% 8 66% Finally. a sufficient number to approximate the spline. and 12A resolutions. In particular. Table 1 Helix Tracer and Helixhunter.lA in the PDB annotation. the number of helices in the PDB annotation that are longer than 8. The lengths of these segments are not necessarily identical. Helices identified for 1BM1 (PDB Id) A helix is correctly identified by Helix Tracer if it can be paired with a PDB helix in the protein. let us underline that the quality of the results is dependent on the resolution of the density maps employed. Helix Tracer. However. The RMSD we calculated is the deviation of the distance between the correctly identified Ca atoms and the central axial line from 2. the number of false positives (False).95 ( P D B Helices). 10A. We tested the program on the density maps at 8A. they could be projected on the prolongation of the axis outside the helix—are not accounted as correctly identified C a atoms. The theoretical distance between a Ca atom and the central axial line of the helix is 2.. there should be at least one Oa on the PDB helix that is within a 4A distance from the central axial line of the identified helix. Although the pairing process is simple. the accuracy of the identified helices can also be seen from the number of Ca that are recognized by Helix Tracer. When a helix is represented as a straight cylinder.

we focused on the problem of recognizing a-helices.1 A. see the previous subsection): • RMSD: we compare the RMSD values for those Ca atoms that have been correctly identified by both systems. At this level of resolution. leading to a more accurate identification. 4. The resulting technique has been implemented in the Helix Tracer system. In particular. Although we also tested different density threshold for Helixhunter. 1Q16. Once again. This is the resolution level that can be obtained for large protein complexes using techniques such as electron cryomicroscopy. We employ the release 1. we have proposed a framework to map the secondary structures identified from the density map to their locations on the primary sequence of the protein 4 . extracted from the density map. and can be effectively used to recognize other 3. The methodology proposed in this paper makes use of gradients information. The density threshold used in Helixhunter is 0. The comparison starts with the evaluation of the following two parameters (for definition. and to aid in threading the protein sequence in the 3D structure. but the resolution is sufficient to visually recognize structural features. For the protein 1CC5 we reach a peak of improvement of 64% in CQ recognition. • Ca: we compare the number of Ca that have been correctly identified by either system. to perform volumes segmentation and to guide analysis of volumes towards the identification of secondary structure components.7 of the Helixhunter software.g.2. PHD).086A lower than that of Helixhunter. 6A to 10A) of proteins. This gradient-based technique is a general approach. the framework computes successful mappings that best satisfy both the constraints obtained from the density map and the results obtained using secondary structure prediction tools (e. with very positive results. Helix Tracer provides a more flexible representation of helices. lix Tracer provides significantly better results (up to 37% more helices correctly identified) and it never performs worse than Helixhunter. As expected. The comparison is performed on the same set of 29 proteins used in the previous subsection. and 1TCD. CONCLUSIONS AND FUTURE WORK In this paper. We performed these experiments using the proteins 1BVP.0% of the total Ca atoms on the helices longer than 8.4% of such Ca atoms. Helixhunter identified 58.96 shows the percentage of correctly identified helices. The outcome of the analysis performed by Helix Tracer can be applied to aid in the reconstruction of a tentative atomic model of the entire protein complex. For the correctly identified Ca atoms. Moreover.85. and we performed a test using 29 proteins. In this direction. it is often impossible to recognize the actual backbone directly from the protein density map. we can observe that He- . Helix Tracer correctly identified 75.. we presented a novel methodology to analyze low resolution density maps (e. the diagram on the left compares the number of correctly identified helices (relating them to the number of helices in the PDB annotation).g. which is the same {helixTHR) used in Helix Tracer. In this paper. The framework relies on encoding all the components of the problem as a constraints satisfaction problem 9 and makes use of Constraint Logic Programming techniques to solve it. the RMSD from Helix Tracer is on average 0.. These results are depicted in the two diagrams of Figure 6. note that the method adopted in Helix Tracer performs always better than Helixhunter in terms of the number of correctly identified helices and RMSD. Comparison with Helixhunter In this subsection we report the comparison between Helix Tracer and Helixhunter 7 .85 appears to be the threshold that generates best overall results for these 29 proteins. Figure 7 compares the performance of the two systems in terms of the number of helices that are longer than 5 amino acids. the number of missed helices. the accuracy of Helix Tracer degrades as the quality of the density map degrades. while the figure on the right shows the trend in number of helices present in the PDB annotation and not recognized by either of the systems. the information about a-helices can be employed as constraints to guide and/or filter ab-initio prediction secondary structure. 0. In particular. and the percentage of correctly identified Ca. such as a-helices and /3-sheets. For example.

Future work will include the devel- opment of heuristics to further improve the quality of a-helix identification and to reduce the number of false positives.8 5.9 1.9 12.5 3. Helixhunter: # of Amino acids identified and RMSD secondary structure traits of the protein.5 5.0 1.3 1. This will require a more compre- .3 2.8 11.6 12.4 5.o Helix Tracer l I 'I I I I I I ' ' ' II§IlllPpI!§§lI^f§§§PP§ Protein imimmimrmmm I I I I I I I §§ Fig.0 PDB Helices> 8.7 4. 6. Work is in progress to apply this methodology to identify /3sheets and coils.5 2.97 Table 2.6 1.7 0.0 8.9 12. Helix Tracer vs. ID Execution Time(s) 3.2 5.2 1.8 11.5 2.8 1.lA 10 8 10 26 7 7 9 4 25 9 37 11 38 12 8 20 9 27 11 7 53 10 25 22 7 24 6 24 22 Helix Tracer results Helix Tracer Correct False 9 7 8 23 7 7 8 4 18 7 28 6 31 8 7 15 8 17 7 7 30 8 19 13 7 16 5 18 17 1 0 2 10 0 1 2 1 6 1 3 1 4 0 0 4 2 6 0 1 7 3 5 4 0 1 6 4 1 ca Missed 1 1 2 3 0 0 1 0 7 2 9 5 7 4 1 5 1 10 4 0 23 2 6 9 0 8 1 6 5 PDB 99 98 98 259 161 162 123 41 231 182 572 114 652 121 184 242 95 240 122 179 480 122 230 194 169 266 90 216 211 Helix c« Tracer 80 86 61 207 154 156 105 37 172 145 418 76 479 86 168 183 74 154 79 166 279 98 177 130 158 158 63 163 150 Helices 14 8 15 30 7 7 10 4 28 9 40 12 39 13 8 30 12 34 14 8 57 14 28 22 7 24 6 24 28 Identified 10 7 10 33 7 8 10 5 24 8 31 7 35 8 7 19 10 23 7 8 37 11 24 17 7 17 11 22 18 1A06 1AGW 1AXG 1B0M 1BM1 1BRX 1BVP 1CC5 1CI1 1DZE 1FIY 1GIH 1JQN 1KE8 1KGB 1M52 1NVS 1POC 1P14 1P8I 1Q16 1QHD 1TCD 1TPB 2BRD 2YPI 3PRK 7TIM 8TIM Comparison of Helix Ca Recognition i I I I i I I I i i I I I I I i i I i I I I i I I I i I I •—• Helixhunter Q Q Helix Tracer — —.0 2.6 8.8 29.9 1.PDB Data RMSD Comparison (Ca correctly identified) I I I I I I I I I I I I I I I I I I I I I I I I I I I I I •—• Helixhunter B .0 5.1 5.

128. Chothia. Los Altos. A Structural-informatics Approach for Mining Beta-sheets: Locating Sheets in Intermediate Resolution Density Maps. Struct. Struct. Rixon. EMAN: Semi-automated Software for High Resolution Single Particle Reconstructions. Protein Secondary Structure Prediction Continues to Rise. Digital Image Processing. Lu. Algorithms for CSP: a Survey. Greer. In J. Acids Res. 2. 11. Kumar. J. 7. D. 7. In Nucl. 12:263-269. A. Cell. Chiu et al. Baker. work is in progress t o apply the proposed technique t o real d a t a obtained from electron cryomicroscopy. 2000. M. 4. ACM Press. B. Rost. Cryo-electron Microscopy Reveals the Functional Organization of an Enveloped Virus. B. M. W. Barsky. He. and W. 1992. 8 2 97. In J. 32-44. F . 100:427-458. and W. E. Chiu.. Baldwin. 2000. Wang. HRD-0420407. R. C. In J Mol Biol. Ludtke. 332(2):399-413. 2000. 8. In Mol.. In Acta Crystallogr. Determination of the Fold of the Core Protein of Hepatitis B Virus by Electron Cryomicroscopy.98 Correctly Identified Helices Missed Helices I Helix Tracer D Helixhunter I PDB Helices I 0 Helix Tracer Helixhunter x "S30 S40 nmmmmrm Protein mi Ii§31iiPlg§ 1 Fig. Z.. Finally. B. and CNS0220590.E. Seeing the Herpesvirus Capsit at 8. References 1. Pontelli. Lo Conte. T. S. In Curr Opin Struct Biol. A Constraint Logic Programming Approach to 3D Structure Determination of Large Protein Complexes. Biol. Chiu. In Science. 3. 5:255266.. and C. 15. V. R. S. 134:204-218. hensive analysis. and W.J. 1992. In AI Magazine. 2001. and Y. and B. Zhou. R. 2006. 2000. 308. and R.H.E. S. In J. 2003. Bridging the Information Gap: Computational Tools for Intermediate Resolution Structure Interpretation. He. 56:1591-1611. Mancini. Biol. An Introduction to Splines for use in Computer Graphics and Geometric Modeling. Brenner. A. Acknowledgments The research has been partially supported by N S F grants CNS-0454066. and S. 14. 2002.A.5A. Y. Woods. Biol. B. 6. 1997. Rutten. Hubbard. 2001. W. Ma. Chiu. 386:8891. SCOP: a Structural Classification of Proteins Database. Spring. Correctly identified helices and missed helices nates for the Main Chain. ConfMatch: Automating Electrondensity Map Interpretation by Matching Conformations. Gowen.C. of Mol. Bartels. 5. Bottcher. Dougherty. 10. In Nature. In J MOl Biol. Dal Palu.. Crowther. Jiang.J. Ludtke. Beatty. J. Morgan Kaufmann Publishers. In A CM Symposium on Applied Computing. which will include the recognition of coils a n d /3-sheets. J. Wynne.H. 1974. J.A. 1987. Automated Interpretation of Electron Density Maps of Proteins: Derivation of Atomic Coordi- . 13. J. 9. E. Kong and J. Deriving Folds of Macromolecular Complexes through Electron Cryomicroscopy and Bioinformatics Approaches. 12. Clarke.E. 28:257-259. Ailey.A. T.. Jakana. S. 1999. Biol Crystallogr. M. Gonzalez and R. Fuller.D. Murzin. P. L. 288:877880.G. Addison Wesley.

RNA structure search can speed up 11 to 60 folds whiling maintaining the same search sensitivity and specificity of without the filtration. Based upon the authors' earlier work in tree-decomposable graph model for RNA pseudoknots. In this paper. the strategy to select which substructures to use as filters is critical to t h e overall search speed up. the new scheme can automatically derive a set of filters with t h e overall optimal filtration ratio. CMs can thus be used as structural profiles to model RNA families. RNA modification. Due to the explosive growth of fully sequenced genome data. song}@cs. A plausible solution is to devise effective filters that can efficiently remove segments unlikely to contain the desired structure patterns in the genome and to apply search only on the remaining portions. University Athens. The bottleneck of the search is sequence-structure alignment.uga. However. the sequence-structure alignment is computationally intractable. Search experiments on both synthetic and biological genomes showed that. Since filters can be substructures of the RNA to be searched. 22 j n general. and gene regulation 7 ' 19. 28 . University of Athens.edu Liming Cai* Department of Computer Science. USA Email: russell@plantbio. a new effective filtration scheme is introduced to filter RNA pseudoknots.edu of Georgia Georgia Georgia Georgia Computational search of genomes for RNA secondary structure is an important approach to the annotation of non-coding RNAs. Georgia 30602.edu Russell L. for pseudoknots. INTRODUCTION Non-coding RNAs (ncRNAs) do not encode proteins yet they play fundamental roles in many biological processes including chromosome replication. approaches t h a t can filter pseudoknots are yet available.99 EFFICIENT ANNOTATION OF NON-CODING RNA STRUCTURES INCLUDING PSEUDOKNOTS VIA AUTOMATED FILTERS C h u n m e i Liu* a n d Yinglei Song Department of Computer Science. generally determines the biological functions of an ncRNA and is preserved across its homologs.uga. University of Athens. GA 30602. Covariance models (CMs) 6 contain additional emission states that can emit base pairs to generate stems. USA Email: cai@cs. USA Email: {chunmei. M a l m b e r g Department of Plant Biology.uga. homologous searching using computational methods has recently become an important approach to annotating genomes and identifying new ncRNAs 16. compared with profiling models based on Hidden Markov Models (HMMs) 14 . For example. In some cases. GA 30602. Since secondary structure *To whom correspondence should be addressed. USA Email: huping@uga. with this filtration approach.edu Ping Hu Department of Genetics. a profile needs to include both sequence conservation and secondary structure information. Georgia 30602. 1. RNA structure search . the filtration even improves the specificity that is already high. University of Athens. which contain at least one pair of crossing stems. which is often computationally intensive. 21. a computational searching tool scans through a genome and aligns its sequence segments to an RNA profile. Such an issue becomes more involved when the structure contains pseudoknots.

we are able to construct a filter graph. The tree width t of the RNA conformational graph is very small. two efficient tRNA detection algorithms are used as filters to preprocess a genome and remove most parts that are unlikely to contain the searched tRNA structure. a subtree formed by tree nodes containing either of the two vertices that form a stem can be used as a filter. Based on a tree decomposition of the conformational graph with tree width t. 16. Search on genomes can be speeded up with filtration methods. this filtering approach is significantly faster and can achieve improved specificity. for all these models. like the structure to be searched. e.100 in genomes or large databases thus remains difficult. 27 have been developed to improve the search efficiency. The sequence-structure alignment can be determined by finding in the image graph the maximum valued subgraph that is isomorphic to the conformational graph. In 27 . 16 use CMs to profile the secondary structure of an ncRNA. 15. all of them have yet been applied to search for structures that contain pseudoknots. in tRNAscan-SE 16 . it achieved . they are incapable of modeling pseudoknots. the time complexity for optimally aligning a sequence segment to a CM profile is too high for a thorough search of a genome 13 . To test its accuracy and computational efficiency. the computation time and memory space costs needed for optimal structure-sequence alignment are 0(N5) and 0(N4) respectively. We have implemented this filter selection algorithm and combined it with the original tree decomposition based searching program to improve its computational efficiency. while a queried sequence segment is modeled with an image graph with valued vertices and edges. Heuristic approaches 3 ' 8 ' 15 can significantly improve the search efficiency for pseudoknots. an algorithm is developed to safely "break" the base pairs in an RNA structure and automatically select filters from the resulting HMM models. it is possible to efficiently remove genome segments unlikely to contain the desired pattern. Filters. A filter graph is a chordal graph and we thus are able to compute its maximum weighted independent set in time 0 ( n 2 ) . the secondary structure of RNAs is modeled as a conformational graph. These approaches have significantly improved the computational efficiency of genome annotation. Our testing results showed that. based on the profiling model in our previous work. need to be profiled with appropriate models. In practice. A tree decomposable graph model has been introduced in our previous work 25 . a sequence-structure alignment can be accomplished in time 0{ktN'2) 25 . and T is the size of the conformational V graph. These approaches either cannot guarantee the search accuracy 8 or have the same drawback in computation efficiency as CM based approaches 3 ' 15 . While CM based searching tools can achieve high accuracy. based on the tree decomposable model. we develop a novel approach of filtration. FastR 2 considers the structural units of an RNA structure. we used this combined search tool to search for RNA structures inserted into random generated sequences. In the graph each vertex represents a maximal subtree and two vertices are connected with an edge if the corresponding subtrees intersect. However. Experiments have shown that this approach is significantly faster than CM based searching approaches while achieving an accuracy comparable with that of CM. compared with the original searching program. A few models 4' 20 ' 23 ' 26 based on stochastic grammar systems have been proposed to profile pseudoknot structures. which is the filtration ratio of the filter that can be measured based on randomly generated sequences. where k is a small parameter (practically k < 7). where n is its number of vertices. t = 2 for pseudoknot-free RNAs and can only increase slightly for pseudoknots. The remaining part of the genome is then scanned with a CM to identify the tRNA. A filter can thus be constructed for each vertex in the conformational graph. In particular. 2. We thus select filters that correspond to the maximum weighted independent set in the graph. In particular. In this paper.g. Based on the intersection relationship among the subtrees of filters. Specifically. A few filtration methods Ref. However. Most of the existing searching tools 3> 13. these models cannot be directly used for searching. It evaluates the specificity of each structural unit and construct filters based on the specificities of these structural units. For example. Filters can thus be selected in time 0(n2). In addition. With simpler sequence or structural models. We associate every vertex in the filter graph a weight.

pairing region of a stem. (3) V(u. The weight is defined by the alignment score between vertices (stems) and edges (loops) in the conformational graph and their counterparts in the image graph. To align the structure model to a target sequence. and Vu G V. In particular. In general. it provides a topological view on graphs. and loops are profiles with HMM 14 . It can be computed based on a statistical cut-off value and its value is generally small in nature. where V is the set of vertices in G. (4) Vijj. the vertices for two base regions are connected with a directed edge (from 5' to 3') if the sequence part between them is a loop. Let G = (V. All pairs of regions with statistically significant alignment score. 2. Figure 1(c) and (d) illustrate the mapping from stems to their images and the corresponding image graph constructed. e. called the images of the stem. k € I. Interestingly. The subgraph isomorphism problem is NP-hard. then Xi D Xj C Xk • The tree width of the tree decomposition (T. ALGORITHMS AND MODELS 2. Then an image graph is constructed from the set of images for all stems in the structure. the conformational graph for the RNA secondary structure is tree decomposable. are identified. 3i G / such that u G Xi and v G Xi. The graph contains both directed and undirected edges. The tree width of the graph G is the minimum tree width over all possible tree decompositions ofG. each vertex defines one of the base pairing regions of a stem. Tree Decomposable Graph Model In our previous work 25 . Technically. a parameter k is used to define the maximum number of images that a stem can map to. Figure 1(a) and (b) show the consensus structure of an RNA family and its conformational graph.101 20 to 60 fold speed up for pseudoknot-free RNAs and 11 to 45 fold speedup for RNAs containing pseudoknots. for some tested structures. F) defines a tree.1 ( 2 4 ). The model consists of two components: a conformational graph that describes the relationship among all stems and loops and a set of simple statistical profiles that model individual stems and loops. (2) X = {Xi\i G I. To reduce the complexity of the graph. We then used this combined searching tool to search a few biological genomes for ncRNAs. In addition. compared with the original searching program. E denotes the set of edges in G. In addition. each vertex represents an image for one . 3i G / such that u £ Xi. The optimal structure-sequence alignment between the structure model and the target sequence thus corresponds to finding in the image graph a maximum weighted subgraph that is isomorphic to the conformational graph. we first preprocess the target sequence to identify all possible matches to each individual stem profile. efficient isomorphism algorithms are possible. In addition. In the conformational graph. X) is defined as maxj € j \Xi\ — 1. Our testing results showed that this combined program can accurately determine the locations of these ncRNAs with significantly reduced computational time. a directed edge connects the vertices for two non-overlapping base regions (5' to 3'). two vertices for the base pairing regions of a stem are connected with a nondirected edge.1. Figure 2 provides an example for a tree decomposition of a given graph. if k is on the path that connects i and j in tree T. Pair (T. For example. Each undirected edge connects two vertices that form the pairing regions of a stem.E) be a graph. in addition to the conformational graph. Conformational graphs for the RNA secondary structure have small tree width.g. we can construct a consensus structure from the multiple structural alignment of a family of RNAs. Tree decomposition is a technique rooted in the deep graph minor theorems 24 . Definition 2. individual stems are profiled with the Covariance Model (CM) 6 . Tree width of a graph measures how much the graph is "tree-like".Xt C V}. this approach is able to achieve an improvement in specificity from about 80% to 92%. the consensus secondary structure of an RNA family was modeled as a topological relation among stems and loops. In this model. two additional vertices s (called source) and t (called sink) are included in the graph. v) G E. the sets of vertices and edges in T are I and F respectively. X) is a tree decomposition of graph G if it satisfies the following conditions: (1) T = (I. it achieved 6 to 142 fold speed up for genome searchings for pseudoknots.

In the tree decomposition of the conforma- . 2. For instance. the conformational graph shown in Figure 5 for sophisticated bacterial tmRNAs has tree width 5. (a) Fig.2.jri) for stem 1. (b) T h e corresponding conformational graph. the maximum weighted subgraph isomorphism can be efficiently found in time 0(ktN2). It can be used as a filter to preprocess a genome to be annotated. Automated Structure Filter We observe that any subtree in a tree decomposition of a conformational graph induces a substructure and is thus a structure profile of smaller size. (d) F i g . the left and right regions of any stem s. In particular. T h e dashed lines specify t h e possible mappings between stems and their images. where N is the length of the structure model and k is the maximum number of images that a stem can map to. 1. (c) A secondary structure (top). the tree width is 2 for the graph of any pseudoknotfree RNA and it can only increase slightly for all known pseudoknot structures 25 . We showed in our previous work 25 that given a tree decomposition of the conformational graph with tree width t. and (1/2. in an RNA structure have two corresponding vertices v\ and v\ in its conformational graph. (d) The image graph formed by the images of its stems on a target sequence. (b) A tree decomposition for the graph in (a). (b) (a) An example of a graph.""2) an< ^ (jhtfrz) for stem 2.ir\) and (jh. 2. (a) An RNA structure that contains both nested and parallel stems. and the mapped regions and images for its stems on the target sequence (bottom). {il\.102 5 / (b) Stem 1 Stem 2 ir2 ]r.

This way. we construct a filter graph as follows. the vertices contained in the tree bags of its corresponding subtree induce a subgraph in the conformational graph. The table thus contains a column allocated for each vertex in the node and two additional columns V and S to maintain validity and partial optimal alignment scores respectively. However. for vertices in a leaf node. which is the filtration ratio of the filter resulted from the corresponding subtree. Specifically.1. there exists a tree decomposition for the graph such that the vertices contained in every tree node induce a clique and the tree decomposition can be found in time 0(|V| 2 ). Then given such a tree decomposition. This is also to ensure that when the RNA structure contain a simple pseudoknot.3. We present a summary and some details of the new optimal alignment algorithm in the following. where V is the vertex set of the chordal graph 9 . Also for any chordal graph. where N is the size of the conformational graph. For each filter. We choose subtrees with this maximal property since each of them contains the maximum amount of structural information associated with the stem. these two vertices induce a maximal connected subtree Ti. in which every node contains either of the vertices. Filter-Sequence Alignment For a given filter F. An entry in a table includes a possible combination of images of vertices in the corresponding tree node and the validity and partial optimal alignment score associated with the combination. In the graph each vertex represents a maximal subtree defined above and two vertices are connected with an edge if the corresponding subtrees intersect. subtrees may intersect and it would be more desirable to select a set of disjoint subtrees to preprocess the genome. The filtration ratio of a filter is defined as the percentage of nucleotides that pass the corresponding filtration process and it is obtained as follows. Figure 3 shows an example for the filter graph of a given RNA structure. there exists an algorithm of time 0(n2) that can select a set of disjoint filters with the maximum filtration ratio. a simple dynamic programming algorithm can be developed to find the maximum weight independent set. a combination of their images is valid if the corresponding mapping satisfies the first two conditions for isomorphism (see section 2) and the partial optimal alignment score for a valid combination is the sum of the alignment scores of loops and stems induced by images of vertices that are only contained in the node. For an internal node Xi in the tree. the algorithm checks the first two conditions for isomorphism (section 2 in 2 5 ) and sets a to be in- . We show in the following that this independent set can be found easiiyAccording to 10 .I n / to its corresponding vertex. We associate every vertex in the filter graph a weight. we thus can obtain up to O(N) such subtrees. To achieve a minimum filtration ratio. To find such an isomorphism. we assign a weight of . In a bottom up fashion.103 tional graph. For a given combination e$ of images of vertices in Xi. the filter graph constructed from a tree decomposition is actually a chordal graph. The dynamic programming over the tree decomposition to find an optimal alignment is based on the maintenance of a dynamic programming table for each node in the tree. However. in which any cycle with length larger than 3 contains a chord. we introduce some additional techniques to solve the problem in our setting. For a filter with filtration ratio / . such an induced subgraph is its filter conformational graph. An alignment between a structural filter profile and a target sequence is essentially an isomorphism between its filter conformational graph H and some subgraph of the image graph G for the target sequence. Theorem 2. we need to find the maximum weighted independent set in the filter graph. we randomly generate a sequence of sufficient length and compute the distribution of the scores of alignment between the filter profile and all the sequence segments in the generated sequence. the algorithm first fills the entries in the tables for all leaf nodes. we assume Xj and Xk are its children nodes. 2. the pseudoknot will be included in some filter. without loss of generality. since the general technique can only be directly applied to a subgraph isomorphism on small fixed graph H and graph G of a small tree width l r . we adopt the general dynamic programming technique 1 over the tree decomposition of H. For this. For an RNA secondary structure that contains n stems.

only one locally best combination (computed in the children tables) is used for vertices that occur in the children nodes but not in the parent node. v)) f (filter ( a . e x"y~z~ H bottom-up 2 a b dS X y w X y u £ b e S y V s y V t y r s t (a) (b) F i g .. In every such combination. / » ) / X 2 (filter (c. (a) The conformational graph for a secondary structure that includes a pseudoknot.and Xk. 4. (d) Substructures of the filters. Starting with leaves of the tree. 3 . only combinations of the images of the vertices in the node are considered. d)) F i g .104 4 (filter (u. the algorithm follows a bottom-up fashion. In computing the table for a parent node.&i is set to be valid if and only if there exist valid entries ej and et from the tables of Xj and Xk such that ej and efe have the same assignments of images as that of ej for vertices in Xj n Xj and Xj n Xk re- . The algorithm maintains a dynamic programming table in each tree node. (b) A tree decomposition for the graph in (a). Otherwise. (c) A filter graph for the secondary structure in (a). A sketch of the dynamic programming approach for optimal alignments. the algorithm queries the tables for X. valid if one of them is not satisfied.

The statistical distribution of alignment scores for each filter and the overall structural profile is determined on the same sequence using a method similar to that used by RSEARCH 13 . As a first profiling and searching experiment. we develop a heuristic method to determine the order of filters. where / is its measured filtration ratio and T is the computation time needed for the filter to scan the testing sequence. we consider both the filtration ratio and the computation time of a filter. For a stem.105 spectively. we used this program to search for about 30 pseudoknotfree RNA structures inserted in a random background of 105 nucleotides generated with the same base composition as the RNA structure. The alignment score is the sum of the scores for aligning individual stems and loops in the structure profile. Finally. called the constrained image region of the stem. the alignment score for a loop is calculated between its profile and the sequence segment in the target within the two chosen images for the two stems. t is the tree width of its tree decomposition and N is its number of vertices. This ensures a small value for the parameter k in the models. EXPERIMENTAL RESULTS We performed experiments to test the accuracy and efficiency of this filtration based approach and compared it with that of the original tree decomposition based program. we choose up to 60 sequences with pair-wise identities lower than 80% from the structural alignment of seed sequences. Figure 4 provides an example for the overall algorithm. we constrain the images of a stem within certain region. the upper bound on the number of images that a stem can map to. The alignment score for a stem is calculated between the stem profile and a chosen image in the target of the stem. we associate it with the value -^f-. we determine the maximum size of the substructure for each filter. where k is the number of images for each vertex in the conformational graph. We assume that for homologous sequences. In practice. the distances from the pairing region of a given stem to the 3' end follow a Gaussian distribution. we compute the mean and standard deviation of distances from its two pairing regions to the 3' end of the sequence respectively. in the target sequence. We then used this filtration based approach and the original tree decomposition based program to search for the inserted sequences. we tested the performance of our approach by searching for non-coding RNA genes in real biological genomes. . 3. To improve the computational efficiency. The partial optimal alignment score for a valid entry e. we inserted several RNA sequences from the same family into a random background generated with the same base composition as the sequences in the family. The training data was obtained from the Rfam database 12 . we can effectively divide data into groups so that a different but related profile can be built for each group and used for searches. We compared the sensitivity and specificity of both approaches on several different RNA families. For each selected filter. The time complexity for this dynamic programming approach is 0(klN2). which is generated with the same base composition as that of the sequence to be searched. On Pseudoknot-Free Structures We implemented this filter selection algorithm and combined it with our tree decomposition based searching tool to improve searching efficiency. 3. Since any loop in the structure is between some two stems. includes the alignment scores of stems and loops induced by images of vertices only in Xi and the maximum partial alignment scores over all valid entries e / s and e^'s with the same assignments of images for vertices in Xi n Xj and Xi n Xk as that of ti in tables for Xj and Xk respectively. the number of possible orders for / selected filters is up to l\ and we thus are unable to exhaustively search through all possible orders and find the best one. evaluated over all training sequences. We then apply the structural profiles of filters to scan the target sequence with an increasing order of this value. to obtain a reasonably small value for the parameter k. For each family. a window with a size that is about 1.2 times of this value is used for searching while this filter is used. In particular. To test its accuracy and computational efficiency. However. For training data representing distant homologs of an RNA family with structural variability.1. we computed the filtration ratio of each selected filter with a random sequence of 10000 nucleotides. In particular. The order that the selected filters are applied is critical to the performance of searching. In practice.

106 A sequence segment passes the screening of a filter if its corresponding alignment Z-score is larger than 2. the filtration based approach is able to achieve the same accuracy with a . In particular. The parameter k used in the tree decomposition based algorithm for searching all genomes is 7. a significant improvement on specificity is observed on a few tested families. and telomerase RNAs. the filtration based approach achieves the same or better searching accuracy than that of the original approach. Telomerase RNA is responsible for the addition of some specific simple sequences onto the chromosome ends 5 . the tmRNA is essential for the trans-translation process and is responsible for adding a new C-terminal peptide tag to the incomplete protein product of a broken mRNA 18 . we carried out the same searching experiments for each given k. Among the bacteria containing tmRNAs. The table clearly shows that compared with the original approach. Table 1 shows the number of filters selected for each tested structure and the filtration ratio for the one that is first applied to scan the genome. This pseudoknot was recently shown to play important roles in the replication of the viruses in the family n . For each tested pseudoknot structure. the algorithm selects k images with the maximum alignment scores within the constrained image region of the stem. Table 2 shows that on the tested RNA families.2. both of which contain more than 107 nucleotides. Figure 5 provides a sketch of the stems that constitute the secondary structure of a tmRNA. Table 7 provides the real locations of the searched patterns and the identified location offsets deviating from the real locations annotated by the filtration based and the original approaches respectively. Haemophilus influenzae and Neisseria meningitidis. For example. To test the accuracy and efficiency of the algorithm on genomes with a significantly larger size. 3. Table 4 also shows the filtration ratio of the first applied filter obtained on different values of k for each pseudoknot structure. Saccharomyces cerevisiae and Saccharomyces bay anus. Tables 5 and 6 compare the searching accuracy and efficiency between the filtration based approach and the original one. We selected four genomes from the corona virus family and used the algorithm to search for this pseudoknot. In our experiments. the filtration based approach consumes a significantly reduced amount of computation time. An alignment Z-score larger than 5. compared to the original approach.0 times faster than our original approach. for each stem. 3. we use the original tree decomposition based algorithm to process the remaining sequence segments. the filtration ratio for the first filter that is applied to scan the genome is shown in Table 4.0. It is evident that on families with pseudoknots. the filtration based approach is more than 20 times faster than the original approach on most of the tested pseudoknot structures. The training data was also obtained from the Rfam database 12 where we selected up to 40 sequences with pair wise identity lower than 80% from the seed alignment for each family. tmRNA. For each family. On most of the tested families.0 is reported as a hit. In order to evaluate the impact of the parameter k on the accuracy of the algorithm. the filtration based algorithm achieves the same accuracy as that of the CM based algorithm when parameter k reaches a value of 7. these two are relatively distant from each other evolutionarily. On Biological Genomes We used the program to search biological genomes for structural patterns that contain pseudoknots: corona virus genomes. the filtration based searching is more than 30. we used the algorithm to search for the telomerase RNA gene in the genomes of two yeast organisms. From Table 3. we inserted about 30 structures that contain pseudoknots into a background randomly generated with the same base composition as that of the inserted sequences. For final processing. In addition. The tree decomposition based algorithm was also used to search for tmRNA genes on the genomes of two bacteria organisms.3. the secondary structure formed by nucleotides in the 3' untranslated region in the genomes of the corona virus family contains a pseudoknot structure. On Pseudoknot Structures We also performed searching experiments on several RNA families that contain pseudoknot structures. Both of the genomes contain more than 106 nucleotides. For bacteria. The secondary structure of tmRNA contains four pseudoknots.

107

T a b l e 1. The number of filters selected on tested pseudoknot free structures. For each structure, the filtration ratio for the first filter used to scan the genome is also shown. RNA EC EO Let-7 Lin.4 Purine SECIS S.box TTL Number of Selected Filters 1 1 2 3 1 3 2

l—l

Filtration Ratios fc = 6 fc = 7 fc = 8 0.084 0.147 0.084 0.049 0.082 0.049 0.110 0.074 0.055 0.030 0.030 0.045 0.042 0.042 0.021 0.036 0.036 0.089 0.189 0.189 0.189 0.093 0.056 0.056 and

EC, EO and T T L represent Entero.CRE, Entero.OriR, Tymo.tRNA-like respectively.

Table 2. A comparison of the searching accuracy of filtration based approach and the original tree decomposition based program in terms of sensitivity and specificity. RNA fc = 6 SE SP 80.65 100 100 100 95.8 100 100 94.11 96.43 93.10 97.30 100 100 92.86 96.67 100 Without Filtration fc = 7 fc = 8 SE SP SE SP 100 80.65 100 80.65 100 100 100 100 100 100 100 100 94.11 94.11 100 100 96.43 93.10 93.10 96.43 100 97.30 100 97.30 100 96.30 100 96.30 96.67 100 100 96.67 fc = 6 SE SP 91.18 100 100 100 95.8 100 100 100 93.10 96.43 97.30 100 96.30 100 100 96.67 With Filtration fc = 7 SE SP 100 93.93 100 100 100 100 100 100 93.10 100 100 97.30 100 100 100 96.67 fc = 8 SE SP 100 96.87 100 100 100 100 100 100 93.10 100 100 97.30 100 100 100 96.67

EC EO Let_7 Lin.4 Purine SECIS S-box TTL

SE and SP are sensitivity and specificity in percentage respectively. Table 3 . ilies. RNA The computation time for both approaches on all pseudoknot free RNA famWithout Filtration fc = 6 fc = 7 fc = 8 RT RT RT 2.85 3.21 3.38 5.26 5.42 4.91 14.97 16.38 16.92 3.22 4.25 5.10 8.49 7.09 9.61 9.14 10.89 10.23 29.76 41.01 34.76 6.10 7.07 5.01 With Filtration fc = 7 RT SU 0.08 40.13X 22.87X 0.23 52.84X 0.31 30.36X 0.14 25.72X 0.33 51.15X 0.20 20.33X 1.71 0.24 25.42X

EC EO Let.7 Lin.4 Purine SECIS S-box TTL

fc = 6 RT SU 0.07 40.71X 0.17 28.88 x 0.24 62.38X 0.11 29.27X 0.25 28.36 x 0.15 60.94X 1.22 24.39 x 0.20 25.05 x

RT 0.11 0.27 0.34 0.16 0.38 0.23 1.81 0.30

fc = 8 SU 30.73 x 20.07X 49.76 x 31.87X 25.29X 39.73X 22.65X 23.57X

RT is the computation time in minutes; SU is the amount of speed up compared to the original approach.

significantly reduced amount of computation time. Both programs achieve 100% sensitivity and specificity for searches in genomes. The table also shows that on real biological genomes, the selected filter sets can effectively screen out the parts of the genome that do not contain the desired structures and thus improve the searching efficiency.

4. CONCLUSIONS

In this paper, we develop a new approach to improve the computational efficiency for annotating non-coding RNAs in biological genomes. Based on the graph theoretical profiling model proposed in our previous work, we develop a new nitration model that uses subtrees in a tree decomposition of the conformational graph as filters. This new filtering approach can be used to search genomes for structures

108

T a b l e 4. The number of filters selected on tested pseudoknot structures. For each structure, the filtration ratio for the first filter used to scan t h e genome is also shown. RNA Alpha-RBS Antizyme-FSE HDV_ribozyme IFN -gamma Tombus.3JV corona_pk3 PK3 tmRNA Telomerase Number of Selected Filters 3 1 3 5 3 1 1 11 2 Filtration Ratios fc = 6 fc = 7 fc = 8 0.095 0.071 0.071 0.078 0.066 0.042 0.030 0.030 0.010 0.069 0.035 0.035 0.067 0.048 0.048 0.028 0.014 0.014 0.027 0.220 0.130 0.013 0.220 0.130 0.013 0.070 0.130

T a b l e 5. The search sensitivity (SE) and specificity (SP) for both filtration based and original approaches on RNA sequences containing pseudoknots. RNA SE 95.80 96.43 100 100 100 100 k = 6 Alpha-RBS Antizyme-FSE HDV-robozyme IFN-gamma Tombus-3-IV corona_pk3 Without Filtration fe = 7 SP SE SP 92.00 100 96.00 100 100 100 97.37 100 97.37 100 100 100 100 100 100 97.37 100 97.37 fe = 8 SE SP 96.00 100 100 100 100 97.37 100 100 100 100 100 97.37 fc = 6 SE SP 95.80 96.0 92.86 100 100 97.37 90 100 100 100 97.30 100 With Filtration fc = 7 SE SP 96.0 100 100 100 97.37 100 100 100 100 100 100 100 fe = 8 SE SP 100 96.0 100 100 100 97.37 100 100 100 100 100 100

T a b l e 6. The computation performance for both searching algorithms on some RNA families that contain pseudoknots. RNA Without Filtration fc = 6 k = l fc = 8 RT RT RT 0.31 0.42 0.55 0.18 0.13 0.23 0.34 0.52 0.79 0.72 1.07 1.52 0.40 0.27 0.57 0.20 0.15 0.26 k = 6 RT 0.02 0.003 0.01 0.04 0.01 0.005 SU 15.50X 43.33X 34.00 x 18.00X 27.00 x 30.00X With Filtration fc = 7 RT SU 0.03 14.00X 0.004 45.00X 0.02 26.00X 0.05 21.40X 13.33X 0.03 0.007 28.57X k = 8 RT SU 0.05 ll.OOx 0.006 38.33 x 26.33 x 0.03 0.06 25.33 x 11.40X 0.05 26.00 x 0.01

Alpha-RBS Antizyme-FSE HDV-ribozyme IFN-gamma Tombus-3JV corona_pk3

The amount of RT is in hours; SU is the amount of speed up compared to the original approach.

-A-B-D-E-F-G-H-g-h-I-J-j-i-K-L-M-N-m-0-o-l-k-n-P-p-Q-R-S-r-cj-B-T-0-V-W-X-v-u-t-Z-l-z-l-6-#-2-3-x-w-£-a-ol-b-$-4-a-

n=h r m ((r-h r^\\\ n l{~Pi-U I I I ((\ 1 ITH r ^ V \ ^h r,(T=^i(TM 1

PK1

Y

PK2

PK3

PK4

F i g . 5. Diagram of stems in the secondary structure of a tmRNA. Upper case letters indicate base regions that pair with t h e corresponding lower case letters. The four pseudoknots constitute the central part of the tmRNA gene and are labeled as P k l , Pk2, Pk3, Pk4 respectively.

containing pseudoknots with high accuracy. Compared to the original method, a significant amount of speed up is also achieved. More importantly, this

filtering method allows us to apply more sophisticated sequence-structure alignment algorithm on the remaining portions of the genome. For example, we

**109 Table 7. A comparison of the accuracy and efficiency for both algorithms on searching biological genomes.
**

OR BCV MHV PDV HCV HI NM SC SB ncRNA 3'PK 3'PK 3'PK 3'PK tmRNA tmRNA TLRNA TLRNA Without Filtration L R RT 0 0 0.053 0 0.053 0 0 0 0.048 0 0 0.047 -1 0 -3 -3 -1 0 -1 2 44.0 52.9 492.3 550.2 L 0 0 0 0 -1 0 -3 -3 With R 0 0 0 0 -1 0 -1 2 Filtration RT SU 0.008 6.63X 0.007 7.57x 0.004 12.00x 0.006 7.83x 137.50 x 0.32 0.37 142.97X 8.74 9.28 56.33x 59.29x Real location Left Right 30798 30859 31092 31153 27802 27882 27063 27125 472210 1241197 307691 7121532 472575 1241559 308430 7122282 GL 0.31 0.31 0.28 0.27 18.3 22.0 103.3 114.8

OR is the name of the organism; GL is the length of the genome in multiples of 105 nucleotides. BCV is Bovine corona virus; MHV is Murine hepatitus virus; PDV is Porcine diarrhea virus; HCV is Human corona virus; HI and NM represent Haemophilus influenzae and Neisseria meningitidis respectively, and SC and SB represent Saccharomyces cerevisiae and Saccharomyces bayanus respectively. L and R are the left and right offsets of the resulting locations respectively compared to the real locations. RT is the single CPU time needed to identify the ncRNA in hours. For tmRNA and telomerase RNA searches, RT was estimated from the time needed by a parallel search with 16 processors. SU is the amount of speed up compared to the original approach. are able t o search remote homologs of a sequence family using a few alternative profiling models for each stem or loop. This approach can be used t o find remote homologs with unknown secondary structure. 2001. 9. F. Gavril, "Algorithms for minimum coloring, maximum clique, minimum covering by cliques, and maximum independent set of a chordal graph", SI AM Journal on Computing, 1:180-187, 1972. 10. F. Gavril, "The intersection graphs of subtrees in trees are exactly the chordal graphs", Journal of Combinatorial Theory Series B, 16: 47-56, 1974. 11. S. J. Geobel, B. Hsue, T. F. Dombrowski, and P. S. Masters, "Characterization of the RNA components of a Putative Molecular Switch in the 3' Untranslated Region of the Murine Coronavirus Genome.", Journal of Virology, 78: 669-682, 2004. 12. S. Griffiths-Jones, A. Bateman, M. Marshall, A. Khanna, and S. R. Eddy, "Rfam: an RNA family database.", Nucleic Acids Research, 31: 439-441, 2003. 13. R. J. Klein and S. R. Eddy, "RSEARCH: Finding Homologs of Single Structured RNA Sequences.", BMC Bioinformatics, 4:44, 2003. 14. A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler, "Hidden Markov models in computational biology: Applications to protein modeling", J. Molecular Biology, 235: 1501-1531, 1994. 15. C. Liu, Y. Song, R. Malmberg, and L. Cai, "Profiling and Searching for RNA Pseudoknot Structures in Genomes.", Lecture Notes in Computer Science, 3515: 968-975. 16. T. M. Lowe and S. R. Eddy, "tRNAscan-SE: A Program for Improved Detection of Transfer RNA genes in Genomic Sequence.", Nucleic Acids Research, 25: 955-964, 1997. 17. J. Matousek and R. Thomas, "On the complexity of finding iso- and other morphisms for partial fctrees.", Discrete Mathematics, 108: 343-364, 1992. 18. N. Nameki, B. Felden, J. F. Atkins, R. F. Gesteland, H. Himeno, and A. Muto, "Functional and struc-

References

1. S. Arnborg and A. Proskurowski, "Linear time algorithms for NP-hard problems restricted to partial fc-trees.", Discrete Applied Mathematics, 23: 11-24, 1989. 2. V. Bafna and S. Zhang, "FastR: Fast database search tool for non-coding RNA.", Proceedings of the 3rd IEEE Computational Systems Bioinformatics Conference, 52-61, 2004. 3. M. Brown and C. Wilson, "RNA Pseudoknot Modeling Using Intersections of Stochastic Context Free Grammars with Applications to Database Search.", Pacific Symposium on Biocomputing, 109-125, 1995. 4. L. Cai, R. Malmberg, and Y. Wu, "Stochastic Modeling of Pseudoknot Structures: A Grammatical Approach.", Bioinformatics, 19, i66 — z73, 2003. 5. A. T. Dandjinou, N. Levesque, S. Larose, J. Lucier, S. A. Elela, and R. J. Wellinger, "A Phylogenetically Based Secondary Structure for the Yeast Telomerase RNA.", Current Biology, 14: 1148-1158, 2004. 6. S. Eddy and R. Durbin, "RNA sequence analysis using covariance models.", Nucleic Acids Research, 22: 2079-2088, 1994. 7. D. N. Frank and N. R. Pace, "Ribonuclease P: unity and diversity in a tRNA processing ribozyme.", Annu Rev Biochem., 67: 153-180, 1998. 8. D. Gautheret and A. Lambert, "Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles.", Journal of Molecular Biology, 313: 1003-1011,

110 tural analysis of a pseudoknot upstream of the tagencoded sequence in E. coli tmRNA.", Journal of Molecular Biology, 286(3): 733-744, 1999. V. T. Nguyen, T. Kiss, A. A. Michels, and O. Bensaude, "7SK small nuclear RNA binds to and inhibits the activity of CDK9/cyclin T complexes.", Nature 414: 322-325, 2001. E. Rivas and S. Eddy, "The language of RNA: a formal grammar that includes pseudoknots.", Bioinformatics, 16: 334-340, 2000. E. Rivas and S. R. Eddy, "Noncoding RNA gene detection using comparative sequence analysis.", BMC Bioinformatics, 2:8, 2001. E. Rivas, R. J. Klein, T. A. Jones, and S. R. Eddy, "Computational identification of noncoding RNAs in E. coli by comparative genomics.", Current Biology, 11: 1369-1373, 2001. E. Rivas and S. R. Eddy, "A dynamic programming algorithm for RNA structure prediction including pseudoknots.", Journal of Molecular Biology, 285: 2053-2068, 1999. 24. N. Robertson and P. D. Seymour, "Graph Minors II. Algorithmic aspects of tree-width.", Journal of Algorithms, 7: 309-322, 1986. 25. Y. Song, C. Liu, R. L. Malmberg, F. Pan, and L. Cai, "Tree decomposition based fast search of RNA structures including pseudoknots in genomes", Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference, 223-224, 2005. 26. Y. Uemura, A. Hasegawa, Y. Kobayashi, and T. Yokomori, "Tree adjoining grammars for RNA structure prediction.", Theoretical Computer Science, 210: 277-303, 1999. 27. Z. Weinberg and W. L. Ruzzo, "Faster genome annotation of non-coding RNA families without loss of accuracy.", Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, 243-251, 2004. 28. Z. Yang, Q. Zhu, K. Luo, and Q. Zhou, "The 7SK small nuclear RNA inhibits the Cdk9/cyclin T l kinase to control transcription.", Nature 414: 317-322, 2001.

19.

20.

21.

22.

23.

Ill

THERMODYNAMIC MATCHERS: STRENGTHENING THE SIGNIFICANCE OF RNA FOLDING ENERGIES

T . Hochsmann*, M. H o c h s m a n n a n d R. Giegerich of Technology, University Bielefeld, Bielefeld, Germany {thoechsm*,mhoechsm,robert} @techfak.uni-bielefeld.de Faculty

Email:

Thermodynamic RNA secondary structure prediction is an important recipe for the latest generation of functional non-coding RNA finding tools. However, the predicted energy is not strong enough by itself to distinguish a single functional non-coding RNA from other RNA. Here, we analyze how well an RNA molecule folds into a particular structural class with a restricted folding algorithm called Thermodynamic Matcher (TDM). We compare this energy value t o that of randomized sequences. We construct and apply TDMs for the non-coding RNA families RNA I and hammerhead ribozyme type III and our results show that using TDMs rather than universal minimum free energy folding allows for highly significant predictions.

1. INTRODUCTION

In this section, we discuss shortly the state of the art in RNA gene prediction and classification, and give an outline of the new approach presented here.

**1.1. RNA gene prediction and classification
**

The term "RNA genes" is defined, for the purpose of this article, as those RNA transcripts that are not translated to protein, but carry out some cellular function by themselves. Recent increased interest in the manifold regulatory functions of RNA have led to the characterization of close to 100 classes of functional RNA 1 ' 2 . These RNA regulators mostly exert their function via their tertiary structure. RNA genes are more difficult to predict than protein coding genes for two reasons: (1) There is no signal such as an open reading frame, which would be a first necessary indicator of a coding region. (2) Comparative gene prediction approaches are difficult to apply, because sequence need not be preserved in order to preserve a functional structure. In fact, structure preservation in the presence of sequence variation is the best indicator of a potentially interesting piece of RNA 3, 4 . This means that, in one way or another, structure must play an essential role in RNA gene prediction. Whereas the full 3D structure of an RNA

* Corresponding author.

molecule currently cannot be computed, its 2D structure, the particular pattern of base pairs that form helices, bulges, hairpins etc., can be determined by dynamic programming algorithms based on an elaborate thermodynamic model 5 - 7 . Unfortunately, the minimum free energy (MFE) structure as denned and computed by this model is often weakly determined, and does not necessarily correspond to the functional structure in vivo. And of course, every single stranded RNA molecule, be it functional or not, attains some structure. However, if there is a functional structure, preserved by evolution, it should be well-defined, according to two criteria: • Energy Criterion: The energy level of the MFE structure should be relatively low, to ensure that the structure is stable enough to execute a specific function. • Uniqueness Criterion: The determined MFE structure should not be challenged by alternative foldings with similar free energy. Much work has been invested in the Energy Criterion: Can we move a window along an RNA sequence, determine the MFE of the best local folding, and where it is significantly lower than for a random sequence, may we hope for an RNA gene, because evolution has selected for a well-defined structure? Surprising first results were reported by Seffens &

112 Digby 8 , indicating that mRNAs (where one would not even expect such an effect) had lower energies than random sequences of the same nucleotide composition. However, this finding was refuted by Workman & Krogh 9 , who showed that this effect goes away when considering randomized sequences with conserved dinucleotide composition. Rivas & Eddy 10 studied the significance of local folding energies in detail, reporting two further caveats: First, local inhomogeneity of CG content can produce a seemingly strong signal. Second, variance in MFE values is high, and a value of at least 4 standard deviations from the mean (a Z-score of 4) should be required before a particular value is considered an indicator of structural conservation. In most recent work, Clote et a l . u studied several functional RNA families, comparing their MFE values against sequences of the same dinucleotide composition. They found that, on the one hand, there is a signal of smaller-thanrandom free energy, but on the other hand, it is not significant enough to be used for RNA gene prediction. A weak signal can be amplified by using a comparative approach. Washietl et al. 12 suggest that, by scanning several well-aligned sequences, significant Z-scores can be obtained. The tool RNAz3 is based on this idea. Of course, a good sequence alignment is not always available. All in all, it has been determined that the Energy Criterion is not useless, but also not strong enough by itself to distinguish functional RNA genes from other RNA. A first move to incorporate the Uniqueness Criterion has been suggested by Le et al. 13 . They compute scores based on energy differences: They compare the MFE value to the folding energy of a "restrained structure", which is defined by forbidding all base pairs observed in the MFE structure. This essentially partitions the folding space into two parts, taking the MFE structure within each part as the representative structure. This can be seen as a binary version of the shape representative structures defined by Giegerich et al. 14 . Just recently, the complete probabilistic analysis of abstract shapes of RNA has become possible 15 , which would allow us to base the Le et al. approach on probabilities derived from Boltzmann statistics. This appears to be a promising route to follow. Here, however, we take yet another road in the same direction. 1.2. Outline of the new approach After gene prediction via the Energy Criterion, the next step is to analyze the candidate structure, in order to decide whether it is a potential member of a known functional class. The structural family models provided in Rfam 16, 17 are typically used for this purpose. We suggest to combine the second step with the first one: We ask how well the molecule folds into a particular structural class, and compare this energy value to that of randomized sequences. We shall show that in this way we can obtain significant Z-scores. Note that this approach contains the earlier one as a special case: If the "particular class" holds all feasible structures, we are back with simple MFE folding. The Le et al. approach, by contrast, is not subsumed by this idea, as their partitioning is derived from the sequence at hand, while ours is set a priori. The term Thermodynamic Matcher (TDM) has been suggested by Reeder et al. 18 for an algorithm that folds an RNA sequence into a particular type of structure in the energetically most favorable way. This is similar to using covariance models based on stochastic context free grammars, but uses thermodynamics rather than statistics. A first example of a TDM was the program pknotsRG-enf, which folds an RNA sequence into the energetically best structure containing at least one pseudoknot somewhere. Although the idea of specialized thermodynamic folding appears to be an attractive supplement to covariance models 16 , to our knowledge, no other TDMs have been reported. This is most likely due to the substantial programming effort incurred when implementing such specialized folding algorithms under the full energy model. However, these efforts are reduced by the technique of algebraic dynamic programming19' 20 , which allows to produce such a folding program - at least an executable draft - in one afternoon of work. Subsequent experimentation may be required to make the draft more specific, as explicated below. By this technique, we have been able to produce TDMs for nine RNA families so far, and our results show that using TDMs rather than universal MFE folding allows for highly significant

113 predictions. struct comp grammar Q is a tuple (£, V, P, A) where S is a set of terminal symbols, V is a set of variables with £ n V = 0, P is a production set, and A is a designated variable called axiom. The language C(Q) of a tree grammar Q is the set of trees that do not contain variables, which can be derived by iteratively applying productions starting with the axiom. Figure 1 shows tree grammar QGF for RNA secondary structures. QGF is a simplified version of the base grammar our TDMs are derived from, which is more complex and takes into account the latest energy model for RNA folding. We use QGF to illustrate the basic concepts of TDMs. Note that the sequence of leaf nodes (in left-to-right order) for a tree T € £(Q) is the primary sequence for T. RNA structure prediction and stochastic context free grammar approaches to align RNA structures, are problems of computing an optimal derivation for a primary sequence.

comp comp = base comp base BL I IL

struct

region comp region BR

region comp ML struct SS

comp region HL region

region

P i g . 1. General folding grammar QGF'- The terminal symbol "base" denotes one of the nucleotides A,C,G,U, and "region" is a sequence of nucleotides, struct and comp are non-terminal symbols, and the corresponding productions are shown above. These productions can be read as follows: An RNA secondary structure can be a single component or a component next to some other struct. A component is either a single stranded region (SS), or it is composed (AD) from stacking regions (SR) and loops (BR,BL,IL,ML), which can be arbitrarily nested and terminated by a hairpin loop (HL).

base

base

The same results as with our TDMs in this paper can be computed using RNAmotif21, by using the free energy as score function. However our motifs will result in exponential many structures for a input sequence. For every structure the energy has to be separately computed resulting in exponential runtime. 1.3. Tree grammars RNA secondary structure, excluding pseudoknots, can be formally denned with regular tree grammars 15 . Similar to context free string grammars, a set of rules, called productions, transforms non terminal symbols into trees labeled with terminal and non terminal symbols. Formally, a tree

F i g . 2. This is one possible derivation of the grammar QCF for the sequence "CUCCGGCGCAG". Note t h a t this is just one of many possible trees/structures.

2. THERMODYNAMIC MATCHERS

The RNA folding problems means finding the energetically best folding for a given sequence under a certain model. Throughout this article, we consider the Zuker & Stiegler model, which describes the structure space and energy contributions for RNA

114 secondary structures and is used in a wide range of folding routines 7, 15, 6 . As indicated above, the structure space for an RNA molecule can be denned with a tree grammar and the folding problem becomes a parsing problem 19 ' 20 . We use this view and express (or restrict) folding spaces in terms of tree grammars, thereby obtaining thermodynamic matchers. The informal notion of a structural motif is formally modeled by a specialized tree grammar. Let Q be a grammar that describes the folding space for some structural motif, e.g. only those structures that have a tRNA-like hairpin structure. Q typically differs from QGF by absence of some rules, while other rules may be duplicated and specialized. Fg denotes the structure space for the grammar Q, in other words: all possible trees that can be derived from the grammar's axiom. A thermodynamic matcher TDMg(s) is an algorithm that calculates the minimum free energy and the corresponding structure from the structure space Fg for some nucleotide sequence s. MFEg(s) is the minimum free energy calculated by TDMg(s). Since the same energy model is used, the minimal free energy of the restricted folding can not be lower than the minimal free energy of the general folding, we always have MFEg(s) > MFEGF(S). Note that it is not always possible to fold a sequence into a particular motif. In this TDM returns an empty result. 2.1. Z-scores A Z-score is the distance from the mean of a distribution normalized by the standard deviation. Mathematically: Z{x) = (x — fx)/5, with fi being the mean and 8 the standard deviation. Z-scores are useful for quantifying how different from normal a recorded value is. This concept has been applied to eliminate an effect that is well known for minimum free energy folding: The energy distribution is biased by the G/C content of a sequence as well as its length and dinucleotide composition. To calculate the Z-score for a particular sequence, the distribution of MFE values for random sequences with the same dinucleotide composition must be known. The lower the Z-score, the lower is the energy compared to energies from random sequences. Clote et. al. 11 observed that Z-score distributions for RNA genes are lower than Z-score distribution for random RNA. However, this difference is fairly small and only significant if the whole distribution is considered. It is not sufficient to distinguish an individual RNA gene from random RNA 10 . The reason for the insufficient significance of Z-scores are the combinatorics of RNA folding. There is often some structure in the complete search space that obtains a low energy.

0.25 I

1—

GGF

"BNAI G HH

1

1

1

1

1

g.

0.15 -

/

\

f

1

0.05 /

I

\

\

o I

•— 4 -

—• 2

' 0 Z-score

- ^ 2

' 4

F i g . 3 . Z-score histogram for 10000 random sequences with a length of 100 nucleotides, for two TDMs and the general folding.

Here, our aim is not the general prediction of non-coding RNA, but the detection of new members of a known, or at least defined, RNA family. By restricting the folding space, we can, as we demonstrate in Section 3, shift Z-scores for family members into a significant zone. Structures with MFEGF = MFEg for a grammar Q get a lower Zscore, since the distribution MFEg for random RNA is shifted to higher energies. Even if this seems to be right for the grammars used in this paper, the effect of a folding space restriction on the energy distribution is not obvious. Clearly, the mean is shifted to more positive values, but the effect on the variance is not yet understood mathematically. Therefore, our applications must provide evidence that the Z-scores are affected in the desired way. Let Dg{s) be the frequency distribution of MFE values for random sequences with the same dinucleotide frequency as s, i.e. the minimum free energy versus the fraction of structures s' obtaining that energy with TDMg(s'). Zg{s) is the Z-score for a sequence s with respect to the distribution Dg{s).

the Rfam database.115 The value-mean and the standard deviation can be determined by a sampling procedure. 5. a restriction of the folding space does not affect the Z-score distribution. we are interested in stable secondary structures that consist of three hairpin loops separated by single stranded regions. we generate 1000 random sequences preserving the dinucleotide frequencies of s. Instead of an axiom t h a t derives arbitrary RNA structures. the consensus shown there is a good starting point. The distribution of Z-scores for random RNA sequences is shown in Figure 3.98% of random RNAs have Z-scores greater then -4. a Z-score of lower than -4 is needed 10 . At least this holds for the TDMs shown in this paper. t h e axiom motif derives three hairpin loops {hloop) connected by single stranded regions. region hloop IL base hloop base BL region hloop I region hloop region BR hloop region A-e I HL region f. A specialized grammar for RNAI must only allow structures compatible with this motif. a threshold should be set to a Z-value such that the number of false predictions is trackable. Our experiments showed that over 99. at least the structural part of it. Interestingly. For our experiments. Design and implementation Designing a thermodynamic matcher means defining its structure space. A simplified version of the grammar QRNAU which abstracts from length constraints for stems and loops. motif AD hloop region hloop AD SS hloop 2. 4 . is given in Figure 5. The design of a TDM for an RNA gene requires a consensus structure. F i g . On the one hand it must be large enough to support good sensitivity. To distinguish RNA genes from other RNA on a genomic scale. For a reliable detection of RNA genes. If an RNA family is listed in We now exemplify the design of a TDM. like PMmulti22 and RNAcast23. like the structures of RNAI genes as shown in Figure 4. Simplified version of t h e grammar QRNAIReconsider the grammar in Figure 1. E • '-6 *&*• F i g .2. Since we want to demonstrate that with a search . and on the other hand it must be small enough to provide good specificity. A systematic analysis of the relation between structure space restriction and its effect on specificity and sensitivity of MFE based Z-scores is subject of our current research. Alternatively. Consensus structure for RNAI genes taken from the Rfam database. For instance. the consensus of known sequences can be obtained with programs that predict a common structure.

In the context of ADP.06 -6.6 . We use the algebraic dynamic programming (ADP) framework19 to turn RNA secondary structure space grammars into thermodynamic matchers. the plasmid encoded RNAII transcript. Structures for this consensus are described by the grammar QRNAI (Figure 5).1 Z Z GCF QRNAI -6. it would still obtain a good energy resulting in a low Z-score. which constitutes the control structure of an unrestricted folding algorithm. and FN is the number of The Rfam consensus structure consists of three adjacent hairpin loops connected by single stranded regions (Figure 4). To assess if TDMs can be used to find candidates for an RNA family.53 -6. Clearly. RESULTS We constructed TDMs for the non-coding RNA families RNAI and hammerhead type III ribozyme (hammerheadlll) taken from the Rfam database Version 7. which forms a hybrid with its template DNA.20 -6. This has an undesired effect: It would be possible to fold a sequence.1. these structures do not really resemble the structures of RNAI genes. There is no need to implement the search routines. These restrictions are compatible with the consensus of RNAI and increase the sensitivity .63 -5. and specificity.84 -5. that folds (with general folding) into a single hairpin with low energy. All TDMs share these rules. EMBL Accession number AF156893. into a structure with one long and two very short hairpins. In refinement. we searched for known members in genomic data. where T P is the number of true positives.31 -6. We apply our TDMs to genomes containing the seed sequences and measure the relation between Z-score threshold. FP is the number of false positives. This approach is similar to using an engine for searching with regular expressions. RNA I Replication of ColEl and related bacterial plasmids is initiated by a primer. sensitivity.41 -5.1 U80803.71 3. If we allow for arbitrary stem lengths in our motif.1 U65460.1 Y17716. Sequences coding for RNAI fold into stable secondary structures with Z-scores reaching from —3.1 S42973. which are experimental validated.1 D21263.88 -5.74 -5.33 -3. 7 (Table 1). each stem loop is restricted to a minimal length of 25 nucleotides and the length of the complete structure is restricted to up to 100 nucleotides.2 X80302. We define sensitivity as T P / ( T P + F N ) and specificity as TN/(TN+FP). is augmented by an evaluation algebra incorporating the established energy rules 5 .96 -4. all structures that consist of three adjoined hairpins would be favored by TDMgRNAI. we do not incorporate explicit sequence constraints in a thermodynamic matcher other than those necessary to form the required base-pairs. it is only a matter of specifying the search results.0 16 ' 17 . 3.41 -6. A grammar.61 -4. The known members are those from Rfam seeds. RNAI is a shorter plasmid-encoded RNA that acts as a kinetically controlled suppressor of replication and thus controls the plasmid copy number 24 .73 -3. Z-score for the RNAI seed sequences computed with TDMgGF and TDMgRNA]. Although the energy of the restricted folding is higher than the energy of the unrestricted folding. false negatives. TN is the number true negatives.82 -7.116 space reduction new members of an RNA family can be detected by their energy based Z-score. Without multiloops the time complexity is 0(n2).33 -5. writing a grammar in a text based notation is equivalent to writing a dynamic programming structure prediction program.1 X63534.1 AJ132618. However. If multiloops are included the runtime is 0(n3) where n is the length of the sequence that is folded.1 Y17846.6 to . only the grammar changes. The time complexity of a TDM depends on the motif complexity. Table 1.29 -6. All TDMs used in this section utilize the complete energy model for RNA folding6 and therefore have more complex grammars than the grammars presented to explain our method. if the size of bulges and loops is bounded by a constant. In both cases the memory consumption scales with 0(n2).93 -7.16 -6. this could be easily incorporated in our framework.

All plasmids together have a length of ~ 27500 nucleotides. The specificity in this case is 96. The known RNAI gene is located at position 4498 indicated by the dotted vertical line. a 100 nucleotides long window was slid from 5' to 3' with a successive offset of 5.33 and —7. Z-score in the range of 5 nucleotides to the left or right of the starting position of an RNAI gene has a Z-score equal or lower than the current threshold. For random RNA the frequency distribution of ZgRNAI is similar to ZgCF (see Figure 3). TDMgRNAI was also applied to the reverse complement. we obtain for TDMgRNAI a sensitivity of 100% and a specificity of 99. 6. which means 8 true positives and 99 false positives. The ZQRNAI score difference is large enough to distinguish RNAI genes from random RNA. The Plasmid length ranges from 108 to 8193 nucleotides in this experiment. ZgRNAI was computed for every window. Sensitivity and specificity versus the Z-value threshold. (b) shows corresponding values for ZgRNAJ. For TDMgGF. we obtain only a sensitivity of 80% and a specificity of 99. However. Therefore. (a) In steps of 5 nucleotides.5 is required to find all RNAI genes of the seed. An RNAI sequence was counted as positive hit if a If we set the Z-score threshold to —5. but there is another position with a lower Z-score (position ~ 1450) and positions. In this region. For each plasmid.41 (Table 1). the score ZgCF is shown for the following 100 nucleotides and for their reverse complement. (Figure 7). Although the specificity is fairly low. we applied our matcher to 10 plasmids that contain the seed sequences (one in each of them).89%.117 and specificity of TDMgRNAI. TDMgRNA1 improves sensitivity and specificity compared to TDMgGF. ) --•- F i g . this will return the best substructure (or substructures) in terms of energy. T h e Z-scores for both directions are drawn versus the same sequence position. RNA I can be located on both strands of the plasmid. A threshold of —3. Hammerhead ribozyme (type III) The hammerhead ribozyme was originally discovered as a self-cleaving motif in viroids and satellite RNAs. it makes a big difference to the number of false positives for genome wide applications.2. scores. Sequences that fold into some unrelated stable structure are penalized because they cannot fold into a stable RNAI structure. this results in ~ 11000 ZgRNA. 3. -8 -7 -6 -3 -2 -1 Fig. Overall. 7. The position where the known RNAI gene starts achieves a low Z-score. Sequences from the seed obtain ZgRNAI values between —5. TDM scan for RNAI in a plasmid of Klebsiella pneumoniae (EMBL Accession number AF156893). Figure 6 shows the result for a plasmid of Klebsiella pneumoniae. which means 10 true positives and 12 false positives (for all plasmids).10%.71% resulting in 362 false positives. To verify whether RNAI genes can also be distinguished from genomic RNA. no negative hits are counted. The RNAI gene now clearly separates from all other positions. These RNAs replicate using the rolling circle mech- . 100 Sequence position [nt] TTZ (a) General folding (TDMgG 80 2a > o> 8 60 40 0 1000 2000 3000 Sequence position [nt] 4000 5000 (b) Restricted Folding (TDMg R N A I ) 20 Sensitivity (G GF ) — Specificity (G e F ) — Sensitivity ( G R N A | ) —• Specificity ( G R N A . It is also possible to use a complete sequence as input for a TDM. with nearly as low scores (around position 750). which not always corresponds to the substructure with the lowest Z-score.

Most sequences now obtain a Z-score smaller than —4. Hammerhead type III ribozymes (Hammerheadlll) form stable secondary structures with Z-scores varying from -6 to -2 for general folding. of each window ZgHH was computed. TDMgHH is not designed to search for Hammerheadlll candidates with a sequence length larger than 60 nucleotides. In this region. These se- We applied TDMgHH to 59 viroid sequences with length of 290 to 475 nucleotides. which are not part of the Rfam seed. TDMgHH improves the distribution of Z-scores for the seed sequences (Figure 9). An Hammerheadlll sequence was counted as positive hit if a Z-score in the range of 3 nucleotides to the left or right of the starting position of an Hammerheadlll gene has a Z-score equal or lower than the current threshold. The seed sequences from the Rfam database vary in their length. no negative hits are counted. A .. A 60 nucleotides long window was slid from 5' to 3' with a successive offset of 2. Thus. it might be necessary to separately consider subfamilies. but some obtain a higher score. 9. Consensus structure for hammerhead ribozyme type III genes taken from the Rfam database. the specificity is lower for Zscores thresholds smaller than —3. The maximal length of our motif is 60 nucleotides. Each sequence contains one or two Hammerheadlll genes. quences are only about 45 nucleotides long. For the sequence (and for its reverse complement).118 anism. They use the cleavage reaction to resolve the multimeric intermediates into monomeric forms. They fold into two adjacent hairpin loops and do not form a multiloop with TDMgGF They are forced into our Hammerheadlll motif with considerable higher free energy. Grammar QHH describes the folding space for the consensus structure shown in Figure 8. The sensitivity is improved significantly compared to TDMgGF. Z-scores distribution for 68 hammerhead ribozyme type III sequences. The single stranded region between the two stem loops in the multiloop has to be between 5 and 6 nucleotides long. The sensitivity and specificity depending on the Z-score threshold is shown in Figure 10.C C C C A nc %AU0^O « U-c Qfi«3 l i Fig. Hammerheadlll can be located on both strands of the DNA. Overall. It turned out that many false positives with Z-values of smaller —4 maybe true positives. this resulted in ~ 19500 scores. 8. Figure 11 shows sensitivity and specificity if false nega- . All other seed sequences are around 55 nucleotides long. The region able to self-cleave has three base paired helices connected by two conserved single stranded regions and a bulged nucleotide. If a family has many members. but are predicted as new RNAI candidate genes in Rfam. The stem lengths are not explicitly restricted. which generates long multimeric replication intermediates. 6 sequences have a length of around 80 nucleotides. However. which is the relevant region. we removed the 6 long sequences for our experiment. which are not too vague. I Fig. To be able to use length constraints.

This is a consequence of the combinatorics of the RNA folding space. 12. called thermodynamic matching. These matchers can be fine tuned and can also include sequence restrictions. We demonstrated that. Distribution of Z-scores for all 274 Hammerheadlll gene and gene candidate sequences taken from the Rfam database. . This gives further and independent evidence for the correctness of both predictions. we use pure thermodynamics rather than a covariance based optimization.15 i i —r t \ 1 0. that are candidate genes in Rfam. 0. It is also possible to include H-type pseudoknots in the motif using techniques presented in Ref. It is well known that the MFE structure from predictions in most cases only shares a small number of base-pairs that can be detected by more reliable sources than MFE such as compensational base mutations. DISCUSSION The current debate about the quality of thermodynamic prediction of RNA secondary structures is extended by our observations regarding specialized folding spaces. 1 0 .35 1" 1 1 ' ' G^ ' G 1 1 1 HH random sequences — - i 0. -8 F i g . 1 1 .119 tives. Candidates predicted by Rfam are treated as positive hits. a restriction of the folding space to this family prunes low energy foldings for non-coding RNA that do not belong to this family. Selectivity and specificity versus the Z-value threshold.4 0. we did not include other restrictions than size restrictions for parts of the structure. TDMgHH improves sensitivity and specificity compared to TDMgGF. given a consensus structure for a family of non-coding RNA. All RNA candidate genes that are provided in Rfam achieve low Z-scores as shown in Figure 12. TDMgHH improves sensitivity and specificity compared to TDMgGF.05 0 -10 r-t—/ V.3 0.1 0. ' \ v ' " K . In our experiments for RNA I and the hammerhead type III riboyzme.. Thus. 18. Unlike Infernal16.'. which could further increase their sensitivity and specificity. are counted as true positives.. which provides many "good" foldings. The overlap of Z-score distributions for MFE values for family members and non-family members can be reduced by our technique resulting in a search technique with high sensitivity and specificity. 4.2 0. MFE on its own can not be used to discriminate non-coding RNAs. F i g . i -4 -2 Z-score Y-. 100 80 60 40 - 20 Sensitivity Specificity Sensitivity Specificity -4 Z-score (GGF) (G GF ) (G HH ) (GHH) P i g . Selectivity and specificity versus the Z-value threshold.25 0. which is used for the prediction of candidate family members in Rfam.

and P. vol. Eddy. 919-929. Krogh. pp. S. no. Giegerich. Rehmsmeier. Stadler. 11. R. vol. "A discipline of dynamic programming over sequence data. tection using comparative sequence analysis. 7.. N. Zuker. Lehmann. 578-591. Washietl and I. F. p. pp. A. Jacob V. R. 2003. 17. J. vol. I. Eddy. Reeder and R. vol. R. no. Acids Res. A. Turner. vol." Nature Reviews Genetics. vol. 2001. 7. Chen. Giegerich. 13. Flamm. 102. 34293431. Marshall. 2. 1578-1584. R. including transfer RNA. Hofacker. 2004. vol. Konings. E. pp. Acids Res. Marshall. vol. C. 31. Bompfiinewerer. 5. K. "Discovering well-ordered folding patterns in nucleotide sequences." J Mol Biol. 4." RNA. Acids Res. 4843-4851. We are also working on a graphical user interface to facilitate biologists to create their own TDMs. "Vienna RNA secondary structure server. M. A." Nucl. 7. 2. S. no. pp. "mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. and S. Washietl. F. Rivas and S." Nucl. 11. 167192. 32. 5. no. no. J. 2005. and C. 27. Giegerich. Stadler. Ferre. no. 342." Nucl. E. 19. 12. C. I. 583-605. if this is not necessary. ACKNOWLEDGEMENTS We thank Marc Rehmsmeier for helpful discussions and Michael Beckstette for comments on the manuscript. 2000. 1. 2004. Witwer. D121-124. P. Eddy. Beside the two RNA families shown here we have implemented TDMs for 7 other non-coding RNA families. Miiller. 1999. 14. Mosig. 24. Acids Res. 8. pp. 1999. Eddy. "No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Acids Res. R. Rivas and S. 19. C. M. vol. 18. vol. S. D. 8. Clote. no. 4816-4822. Voss. no. 2006. E. 301-369.. R. pp. no. I. S. and P. L. "Complete probabilistic analysis of RNA shapes." Nucl. no. 2454-2459. Eddy. we focus on a systematic generation of TDMs for known RNA families from the Rfam database. 1. 31.. Stadler. and A. M. "Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. Khanna. 13. S. R." Annual Review of Biophysics and Biophysical Chemistry. "Evolutionary patterns of non-coding RNAs. 15. 16. S. vol. suppl 1. and J. 5. Biosci. Fried. Hofacker. A. Prohaska. 4. 2003. 7. Griffiths-Jones. F. W. "Rfam: an RNA family database. Krizac. "RNA Structure Prediction. B.." BMC Bioinformatics. 123. no. 27. Bateman. Steffen. 31. pp. pp. C. they might be somehow biologically relevant." Nucl. "From The Cover: Fast and reliable prediction of noncoding RNAs. no. "Design. implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. S." Nucl. pp. or. pp. B. B. Maizel. 13. Washietl. L. Griffiths-Jones. no. "Abstract Shapes of RNA. M.-Y. S. pp. vol." Bioinformatics. Hofacker. A.. F. 1. 2.120 We demonstrated that a TDM can detect members of RNA families by scanning single sequences. Acids Res." PNAS. allow "looser" motif definitions. Freier. 2001. B. 16. 104. Workman and A. 2003. References 1. micro RNA percursor and the Nanos 3' UTR translation control element. "Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. pp. and S. 9. H. "Mfold web server for nucleic acid folding and hybridization prediction. 19-30. A. D. no. and M. Voss. without requiring the knowledge of the underlying algebraic dynamic programming technique. 2005. Sugimoto. R. and D. "Non-coding RNA Genes and the Modern RNA World. . vol. vol. J. and will be used to analyze further the predictive power of thermodynamic matchers." BMC Bioinformatics. G. pp. P. vol.. Fritzsch. The results were consistent with our observations for RNAI and the hammerhead ribozyme given here. 354-361. 2003." Theor. 2005. Le. It seems promising to extend the TDM approach to scan aligned sequences using a combined energy and covariance scoring in spirit of RNAalifold12. S. Tanzer. 3. 2005. A question that arises from our observations is: Can our TDM approach be incorporated in a gene prediction strategy? If we would guess a certain motif and find stable structures with significant Zscores. Acids Res. 10.. 439441." Nucl. Rehmsmeier." Bioinformatics. vol. and M. 6. "Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. M. 1988. In a current research project. L. This should further increase selectivity. L. "Noncoding RNA gene de- 17. Giegerich. J. pp. Hofacker. Missal. Khanna. Moxon. Bateman." BMC Biology. vol. "Rfam: annotating non-coding RNAs in complete genomes. 3. Kranakis. vol. Digby. 33. Meyer. S.." 5. pp. no. 16. 3406-3415. Seffens and D. 2004.-H.

24. 2001. Hofacker. an RNA secondary structure definition and search algorithm. 22. F.. and R. "Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction. pp. 21." Nucl. Giegerich. Eguchi and J. P. R. Bernhart. 3516-3523. 2005. pp. vol. . 60. no.121 Science of Computer Programming. D. 51. no. Case. Macke. R. Y. Gutell. 20. pp. vol.. S. H. 3. vol. J. vol. pp. D. Acids Res. 631-652. "Antisense RNA. 20. A. vol. F. J." Bioinformatics. 21. I. 29. Stadler. L. no. 2222-2227." Annu. Itoh. Reeder and R. 224. Gautheret. Biochem. pp. 17. Steffen and R. 6. "RNAMotif. T. "Alignment of RNA Base Pairing Probability Matrices. "Versatile and declarative dynamic programming using pair algebras. T Tomizawa. and P. 2005. vol. Giegerich. no. 22. 1991. 2004. Sampath." Bioinformatics. Ecker. D. 47244735. 23. 2004. J." BMC Bioinformatics. 215-263. Rev.

This page is intentionally left blank .

sg Wing-Kin Sung Genome Institute of Singapore. National University of Singapore.sg.a-star. It outperforms the previous versions of SAM and the spline based EDGE approaches in identifying genes of interest. individual donor. and time point. various statistical approaches. Singapore 117543 Email:sungk@gis. etc. INTRODUCTION Time-course cDNA microarray experiments are widely used to study the cell dynamics from a genomic perspective and to discover the associated gene regulatory relationship. Singapore 138672 School of Computing. many model-specific approaches have been proposed. In the PEM method. 2003. modeling the temporal expression patterns is difficult when the dynamics of gene expression in the experiment is poorly understood. they fail to identify the genes whose expression patterns do not fit the pre-defined models.edu.nus. like ANOVA and its modifications.edu. 2003. PEM. Storey et al. To identify the differentially expressed genes. Nanyang Technological University. 60. replication of time series or longitudinal sampling is costly if the number of time points is comparatively large. 2005).edu. This assumption is comparatively weak and hence the method is general enough to identify genes expressed in unexpected patterns. differentially expressed gene. Smyth. Keywords: Time course. cDNA microarray. However. The experimental results show the robustness and the generality of the PEM method. ksung@comp. When replicated time course microarray data is available. However. which are differentially expressed in various manner.sg Replication of time series in microarray experiments is costly. . To analyze time series data with no replicate. a new statistic is developed by comparing the energies of two convoluted profiles.sg Lin Feng School of Computer Engineering. Identifying differentially expressed genes is an important step in time course microarray data analysis to select the biologically significant portion from the genes available in the dataset. many published time course datasets have no replicate. where the microarray measurements span in multi-dimensional space with the coordinates to be gene index. Biopolis Street. We propose a method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. Park et al. 2004). A number of solutions have been proposed in the literature for this purpose. For the sake of this.123 PEM: A GENERAL STATISTICAL APPROACH FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN TIME-COURSE CDNA MICROARRAY EXPERIMENT WITHOUT REPLICATE XuHan* Genome Institute of Singapore. Biopolis Street. We evaluated the PEM method with an artificial dataset and two published time course cDNA microarray datasets on yeast. are employed (Lonnstedt & Speed. 1. The PEM statistic is incorporated into the permutation based SAM framework for significance analysis.a-star.edu. * Corresponding author. 2002. Singapore 637553 Email: asflin@ntu. This category of approaches has been extended to recent work on longitudinally sampled data. we assume the gene expressions vary smoothly in the temporal domain. Singapore 138672 'Email: hanxu@gis. 60. Besides. We further improve the statistic for microarray analysis by introducing the concept of partial energy. (Guo et al.

(2003) proposed an orderrestricted model to select responsive genes. Thus it is often observed that the logratio expression profiles of the differentially expressed genes are featured with "smooth" patterns. are provided for analyzing single time course data. the measurements are sampled from continuously varying gene expressions. The software of EDGE (Storey et al. (1998) used Fourier transform to identify cell-cycle regulated genes. METHOD 2. 2005) implemented natural cubic spline and polynomial spline for testing the statistical significance of genes. 2000) and the yeast cell cycle dataset (Spellman et al. Additionally. Bar-Joseph et al. the log-ratio expression profile of the gene g. An artificial dataset and two published cDNA microarray datasets are employed to evaluate our approach. However. two alternative methods. which is established on comparatively weaker assumptions. A famous example of clustering software is the Eisen's Cluster (Eisen et al. The drawback is that clustering does not provide a ranking for the individual genes. 2001) framework for determining confidence interval and false discovery rate (Benjamini and Hochberg. = [Xfai). (. namely smoothing kernel and differential kernel. which efficiently stabilizes the variance of the proposed statistic. Clustering based approaches are advantageous in finding co-expressed genes. To utilize this feature. which are verified to be biologically informative. Xi(t2). but also identified a set of non-periodically expressed genes. respectively. the PEM method not only identified the periodically expressed genes.. Xu et al. the dimension of spline needs to be chosen carefully to balance the robustness and the diversity of gene patterns. . we employ two simple convolution kernels that function as a low-pass filter and a high-pass filter. For instance.124 When replicated time course is not available. (2002) developed a regression-based approach to identify the genes induced in Huntington's disease transgenic model. We further improve the performance of the statistic for microarray anlaysis by introducing a concept called partial energy to solve the problem caused by "steep edge".... tn. 2003). 2001).. In time-course experiments... 2. In spline based approaches. at they'th time point. Model-specific approaches identify differentially expressed genes based on prior knowledge of their temporal patterns. which refers to rapid increasing or decreasing of gene expression level. Peddada et al.(f„)]T. Spellman et al. g2. The published datasets include the yeast environment response dataset (Gasch et al. and an empirical setting of dimension may not be applicable for some applications. . (2002) proposed a spline based approach. . The goal of this paper is to propose a new statistical method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. of which the energies mainly concentrate in low frequency. It outperforms previous versions of SAM and spline based EDGE in identifying genes differentially expressed in various manner. 1998). and it is difficult to determine a cut-off threshold based on confidence analysis. a small positive constant called "relative difference" is added to the denominator of the ratio. clustering based approaches and model-specific approaches are widely used. . The proposed ratio statistic is incorporated into the permutation based SAM (Tusher et al.' = 1 to m) can be represented by X. the assumption underlying the model-specific approaches is too strong and some biologically informative genes that do not fit the predefined model may be ignored. 1995). in the recent versions of SAM (Tusher et al. cluster analysis may fail to detect changing genes that belong to clusters for which most genes do not change (BarJoseph et al. gm. and n time points: th t2.. Clustering based approaches select genes whose patterns are similar to each other. X.. slope based and signed area based. The basic statistic for testing the smoothness of a temporal pattern is represented by the energy ratio of the convoluted profiles.. In the SAM framework. In the experiment with yeast cell cycle dataset.1 Signal/noise microarray data model for cDNA Consider a two-channel cDNA time-course microarray experiment over m genes: gj. 1998). where X&tj) (/' = 1 to n) represents the log-ratio expression value of g. The experiment results showed the robustness and generality of the proposed PEM method.

. e. we define a statistic called energy ratio (ER) for testing the null hypothesis.. The smoothing kernel is represented by a slidingwindow Ws = [1. 1]. eit2).(f. we derive two propositions from the Assumption of noise and the Assumption of signal.(f„)]T and its noise component et = [Si(t. 2002). (2). respectively. To utilize this feature. i. Proposition 2: If the signal component St satisfies Assumption of signal. as follow: .is constantly zero.(/„) are independent random variables following a symmetric distribution with the mean equal to zero. and the log-ratio expression profile Xt only consists of the noise component. Thus the assumption underlying the null hypothesis may not be established if the log-ratios are calculated directly from the raw data.2 Smoothing convolution and differential convolution In time-course experiments. 5. X.125 We model the log-ratio expression profile X. If there is adequate number of sampled time points.V(tn)]T. V(tn)]T representing a time-series.) + V(t2). In this case. and the noise component e. 2001). will concentrate in low frequency. Thus the null hypothesis is defined as follow: simple convolution kernels for time series data analysis. is a non-zero signal vector.. V(t2) . V(t2). We have the following assumption on the noise component: Assumption of noise: £.. V(tn. . namely the smoothing kernel and the differential kernel.. (1) and Eq. then H0: X.. Nevertheless.i) .). e. For a non-differentially expressed gene g„ we assume its expression signals in two channels are identical at all the time points.. to detect the edges. To further overcome the influence of the genespecific bias. Given a vector V = [V(ti).. . satisfies the Assumption of noise. we have: Assumption of signal: If 5.. satisfies the Assumption of noise. . . then E(\(Si+£i)*Ws \2)>E(\(Si +£i)*Wd | 2 ) (2) 2. + £.. the signal component 5. then E(\ei*Ws\2) = E(\ei*Wd\2) (1) Due to the variation of populations in cDNA microarray experiments. as the sum of its signal component 5. Note that the noise distribution in our assumption is not necessarily normal so that this gives a better model of the heavily tailed symmetrical noise distribution that is often observed in microarray log-ratio data. and V*Wd = [V(t.{tn)]T. Sfa). where * is the convolution operator..V(t3) V{tn. =e.) + V(tn)f..) V(t2). we introduce two Propositions 1 and 2 can be proven based on the symmetry of noise distribution and the linear decomposability of convolution operation.). there is bias between the expression signals in two channels. We suggest using pre-processing approaches such as Lowess regression to compensate the global bias (Yang et al. In signal processing. the null hypothesis provides a mathematical foundation for demonstration of our method. the smoothing kernel and the differential kernel function as a low-pass filter and a high-pass filter. in which a small positive constant called "relative difference" was introduced to stabilize the variance of the statistic (Tusher et al. Next. and the differential kernel is represented by Wd = [-1. as follows: Proposition 1: If the noise component e.(f2).. respectively.. V(t2) + V(t3). E(\Si*Ws\2)>E(\Si*Wd\2) where E(\ St *WS |2) and E(\ St *Wd |2) represent the expected energies of the corresponding smoothed profile and differential profile. According to Eq... we adopted the SAM framework. 1]. the smoothed profile and the differential profile of V are represented by V*WS = [V(f. Since the energy of the signal component St is likely to concentrate in low frequency. = [S^fy).. £... Note that the log-ratio expression profile Xt = S. = Si + £. the temporal pattern of the signal St will be comparatively smooth so that the energy of S. the measurements are sampled from continuously varying gene expressions.e.

Denote Y = [Yi. the border partial energy of Y is defined as: i=l i=l where k<n. The distribution is two-tailed.126 4322 - 1. and modify the statistic to be the ratio of the 2-order partial energies (PER2) of the smoothed profile and the differential profile: PERJXi) = PE2(Xi *WS) 2. An example of responsive gene expression profile where a "steep edge" occurs between the 3 rd and the 4* time points. the logarithm of £/?(£.\ \x. 2 shows an example of responsive gene (4) PE2(X. hence reduces the statistical significance of the ER score. 1. The basic idea of partial energy is to exclude the steep edges in calculating the energy of a differential profile..). To solve the "steep edge" problem. Fig.*wd\2 2 (3) The distributions of logarithm of ER(ed are shown in Fig. expression profile in which a steep up-slope edge occurs between the 3"1 and the 4th time points. For most responsive patterns in microarray data. . Y?. When n—»°°.. This is because the negative tail implies the energy concentrates in the high frequency. We call this a "steep edge" problem. ER(sd is asymptotically independent on the distribution of Si.3 Partial energy In most time-course microarray experiments. 2. the number of time points is limited. where n is the number of time points. ER(Xi) = \x. and y Yt . The numerically estimated distribution of logarithm of ER(£. the steep edge adds a large value to the denominator in Eq. Obviously..*Wd) . 3. the high frequency component is not adequately sampled thus the expression profile may not be reliable. We assume there are no more than 2 steep edges in the gene expression profile.. Due to insufficient sampling. Fig.) follows a symmetric distribution highly peaked around zero mean.5 - \8 time points n Fig. According to Nyquist sampling criterion.*w. Y2. we propose a new concept called partial energy. let Y= [1. When the number of time points is limited. the number of steep edges is much smaller than the number of time points. where the number of time points varies from 5 to 15 and the distribution of et is multivariate normal. .5 - ft f1 1-• „ \ Logarithm of ER -n=15 -n=10 •n=5 1\ \ i l\ : •0. which can be easily proven based on central limit theorem. 1. -4. where -4 and 3 are excluded in calculating the partial energy. the smoothness of the signal component St is not guaranteed at all the time points. (3). A steep edge refers to rapid increasing or decreasing of gene expression level at certain time points. 2 2 represents the ith biggest value of y22. y„]T be a vector representing a profile. but we are only interested in the positive tail when testing the null hypothesis.For example. 2] T . -1.. its 2-order partial energy PE2 (Y) = I2 + (-1)2 + 2 2 = 6. We take logarithm simply for the convenience of visualization.

3(a). which gives a reasonable measurement of sensitivity vs.. (4). They are monotonous decreasing pattern defined by linear function. (4) takes the form of a ratio. 2001) for significance analysis. Secondly. which are the SAM (Tusher. respectively. In the EDGE software. The dimension of splines is empirically optimized to be 4 in the simulation and in the yeast environment response experiments.2. 2001). 2005). (McNeil and Hanley. For each parameter setting. (5) is called PEM (Partial Energy ratio for Microarray). In the first step.noise ratio (SNR). They are slope based and signed area based. For the detail of the algorithm. and periodic pattern defined by sine function.. 2000) and . one can refer to the SAM manual available at the website: http://www-stat. There are two free parameters in our simulation test: the number of time points and the signal. The statistic defined in Eq. the signal component is one of the three frequently observed signal patterns in time course microarray data. a "relative difference" s0 is added to the denominator in Eq. the signs of the measurements are randomly flipped. and Ws and Wd are the smoothing kernel and the differential kernel defined in section 2. The noise component follows normal distribution with zero mean. peaked responsive pattern defined by Gaussian function.edu /--tibs/SAM/. 2001) and the spline based EDGE (Storey et al.1 Simulation In the simulation experiment. The evaluation is based on relative operating characteristic (ROC) score.127 where * is the convolution operator. and the signed area based SAM is an improved version of paired t-test. The missing values in the published datasets are filled in using KNN-Impute (Troyanskaya et al. Recent version of SAM provides two alternative approaches for the analysis of single time course data. 2. and is set to be 5 for polynomial spline to avoid singular matrix in calculation.stanford. in the second step. This efficiently reduces the influence of channel bias and stabilizes the variance of statistic. our evaluation also includes the approaches employed in two of the most popular microarray analysis software. EXPERIMENTS The robustness and generality of the proposed PEM method are evaluated with both simulation dataset and published microarray datasets. 3. The second step is based on the Assumption of noise. we generate 6000 artificial time course gene expression profiles. The slope based SAM is designed for identifying the genes with monotonous increasing or decreasing patterns. both the natural cubic spline and the polynomial spline based approaches are included in our evaluation.*W d ) + s0 For the sake of simplicity. as shown in Fig. as follow: PEM(Xt) = PE2(Xt *WS) (5) PE 2 (X. which include the yeast environment response dataset (Gasch et al. With the PEM statistic. For differentially expressed genes. specificity. the genes with small fold-changes are excluded from the top-ranking list. of which 5400 3. 1998). the order of the log-ratio measurements in the expression profile are randomly permutated for each gene. it can be easily incorporated into the SAM framework (Tusher et al. we employ the algorithm of SAM for determining the confidence interval and the false discovery rate (sometimes called q-value).. By adding introducing the relative difference.4 Significance analysis Since the PER2 statistic defined in Eq. 1984) For comparison. we briefly describe our strategy of randomized permutation. the dimension of splines is optimized to be 8 for cubic spline. the constant so is chosen to be the 5 percentile of PE2 (Xi*Wd) for all the genes (i = 1 to m). the intensity of the signal component is constantly zero. For nondifferentially expressed genes. each log-ratio expression profile is generated by summing its signal component and its noise component. the procedure of permutation consists of two steps. where the distributions of the measurements of non-differentially expressed genes are assumed to be symmetric with zero mean. the yeast cell cycle dataset (Spellman et al. Here. In yeast cell cycle experiment. In the first step.

5 £• 1 monotonous decreasing decreasing pattern. The signed area based SAM is the most robust when the number of time points is 5.9 § 0.« .75 0. 0. consisting of 8 time points. The result of simulation experiment demonstrates that PEM achieves the best overall performance among the methods in evaluation.-* i• * — M . • Menadione exposure.'periodic time p o i n t (a) Three basic patterns of gene expression profile defined in the artificial dataset.85 DC scort 0. The ROC scores are plotted against the number of time points in Fig. • DTT exposure. we fix the number of time points to be 10 and set the SNR to be 0.0. 1997). or 15. consisting of 10 time points.65 0.65 0..0. 0.6 -i W ^ • • SIJV • • -i . Among the arrays available in the dataset.5 -1 •1. while the models underlying PEM and EDGE are more general. This is because the SAM approaches are modeled base on specific patterns.8 -. the PEM and EDGE approaches achieve much higher ROC score than SAM. corresponding to monotonous .7 0. The ROC scores are plotted against SNR in Fig. 15 • number of time points 3.95 0. 0. consisting of 10 time points..S l o p e based SAM — a . 1.6 -' 0. The dataset is used to discover the way in which the budding yeast S. However. we selected 79 arrays based on two criteria: (i) population from wild-type cells.5 1 SNR 2 c) ROC score vs.5 -i 5 7 10 • " .• " • ' . as the number of time points increases or the SNR becomes larger.C u b i c spline a —•— Polynomial spline PEM . 10.. (ii) at least 7 time points sampled under each condition.95 0.. • Diauxic shift. consisting of 8 time points.9 -' 0. The 600 profiles of differentially expressed genes are equally divided into 3 portions. consisting of 7 time points. 3.. .55 -.• ... These arrays fall into 10 individual experiments: • Heat shock from 25 °C to 37°C..85 - . . The numerically estimated distribution of logarithm of ER(£..128 2 1. consisting of 9 time points. 2000) and (Derisi et al. 3(c).7 0.Signed area based SAM — * . and periodic pattern. consisting of 7 time points. we generate artificial datasets by setting the SNR to be 1. • Hydrogen peroxide treatment. • Nitrogen source depletion.Slope based SAM — h .0 and the numbers of time points to be 5. 7.5 - s1 1 • J m °- peaked responsive i *j 4 k\ t a s /is . Cerevisiae cells adapt to variant changing environments. • Diamide treatment.Signed area based SAM o °-8 " O 0. consisting of 8 time points.2 Evaluation with yeast environment response dataset The yeast environment response dataset consists of measurements in 173 arrays published in (Gasch et al. 3(b). First. Next. S/N ratio Fig. and 2. peaked responsive pattern. • Hyper-osmotic shock. (b) ROC scores under variant number of time points. (90%) belong to non-differentially expressed genes and 600 (10%) belong to differentially expressed genes.5. where n is the number of time points..C u b i c spline —•— * Polynomial spline PEM •' 0.).75 0. S °' 5 " c c 2> -0. 1 0.

The ROC scores for methods are summarized in Table 1. including two nearly-identical experiments consisting of 10 and 12 time points. To evaluate the sensitivity and specificity of the methods.>:\ (I.'I " J l i) : .'92 (a) heat shock M.75K (J4')1 lift"'. (2003).. we use a list e .i n-. The PEM method outperforms other methods in 7 out of 10 experiments. .i 0. of 270 genes available at the website of Chen et al.5 -1 •1.' f i.-*K.=J ilW'i !. We assess the approaches by applying them to the 10 time course experiments individually. i.:i- I) NAM piisw i i|-l I •Ssnn.W. ! M:.is1 IISMI I)! ! Mldi-ildi l ! \ .i . Average expression patterns of ERS genes in variant experiments. .' . The bold fonts correspond to the highest scores in the rows.l'k Ni::-!i^»-! dcpv-lnn i ):.-IKhllO I.-l: ." 111 .7) in most experiments.5 • »* ^^-1 "^ ^ ' .<j.Table 1. -h l i I :Xil- ilrt. This list is the intersection of (i) around 800 Environment Stress Response (ESR) genes which were identified by Gasch et al.*M c. •: ii sh« (b) diauxic shift 2. 4.5 - up-regulated .5 2 1.- ._ _.iViL- U ")!' Sj!).! i n <.V 0 fhV (Mi'-l ul.S> il -"i i-C4:i (1 ' S 4 «». 1 n 71 r> M 41(i .. (2000) using hierarchical clustering on multiple experiments. except for the Menadione exposure experiment in which all the methods do not perform well. and (ii) a list of ortholog genes in fusion yeast S.iiii|i- - 0. i xps J Au-iat'i 11 till.'17 0.- (j"i:' ii.•.' !i i'm UOV ii *. Figure 4 shows that these evolutionarily conserved ERS genes are expressed with various expression patterns in different experiments so that they provide a good testbed to evaluate the robustness and the generality of the methods.hW II.'.. To further show the Stationary phase.down-reguiated c) nitrogen depletion Fig.5 • I ' Wlllll (l.-\i ll.»iai> j ' h a v ."J. i'i . ROC scores of the evaluated methods on environment response experiments.>f' •.1 DM1..77J l i 74*1 (I /"!<fi <) . l i . It achieves reasonably good ROC scores (>0. respectively..•. n .1Ul: 1 11 M 5 I (l|.5 0 -0.. I : I :!! 1MM-. S v ! SiiMp VM:n. Pombe which are differentially expressed under environment stress. .MrtJ SAM Sjiii.'10 (I i|.-<• • 0 i.S n.74.i.-J '. " *i " " -2 • -2.i.|"t': Mallii.nHil I K i t 1 si' .ill"i<.' i) (rfi.i"r.

5.1 Kil- alp/u . Note that the non-periodic portion of the differentially expressed genes is not significant with the Fourier transform approach..iM. CDC15.M-d < i. as listed in Table 3. The Bonferroni correlated hypergeometric P-values show Cluster 5 Cluster 6 Cluster 7 Cluster 8 =r* non-periodic clusters Fig. .. OiV « ' »v ? •ji»i)iliiftS.1.Milr n I'l-M ^M Iwn-J r.Hi .iunait:'VisxUhi'M'i Fliiiijj midj::. • .*iJi:<iM it. Table 2.l. To show this. Five of the clusters are periodic and the remaining three are nonperiodic.-•* U jSu • i T>:S !) : •/ 0 vm !! IM ! 0. These genes are selected based on a false discovery rate equal to 0. ' i .-lh* ci r» . We employed a reference list containing 104 cell cycle regulated genes determined by traditional biological experiments.•«!!•. as shown in Fig.jili. The p-values of the paired t-test demonstrate the significance of the improvement made by PEM. 1998) in our evaluation. ' ' '!ni VIMI tlu:-E.. •<:r. we averaged the ROC scores over all experiments for each method.i) [MUM* j l i f •KilfiJO 1 S'.••.130 superiority of PEM.l a:«-3 ba.yeastgenome.811 0. We applied K-mean clustering using Eisen's Cluster software (Eisen et al..i ! !. •. e.4 :i l 7 SAM 1..\ : . The FT method performs slightly better than PEM in identifying periodically expressed genes.i iBga::s..lu-. .h hllhlK- "linpi ll. In addition to SAM and EDGE.ii-: n S=:xi. as mentioned in the original paper. we also include the method of Fourier transform (Spellman et at. The ROC scores are shown in Table 2. Cerevisiae cells..i>n !. The non-periodic clusters are mapped to the gene ontology clusters using GO Term Finder in SGD database (http://db.in>i/.ij iil. 5.V..eil .. and used paired t-test for comparison of the performance of PEM and the other methods.3 Evaluation with Yeast Ceil Cycle Dataset The yeast cell cycle dataset (Spellman et at.in 'ASA 0 iji-.ji lvpr-n'iM: ..:m S. 1998) consists of the measurements in three experiments (Alpha factor. »s 0 "ft') Table 3.::. ROC scores for evaluation of the methods in identifying periodically expressed cell cycle regulated genes. 1998) and came up with eight clusters.>:. Cluster 1 Cluster 2 Cluster 3 periodic clusters Cluster 4 3.:. CDC28) on cell cycle synchronized yeast S. The GO terms and cluster IDs are retrieved from SGD database.i ") ( I:MCM'I < .. W-. Clustering result shows periodic and non-periodic patterns of differentially expressed genes identified by PEM in alpha factor experiment.917 0. we clustered the top 706 differentially expressed genes identified by the PEM in the alpha factor experiment. The PEM method outperforms the SAM approaches and the spline based EDGE approaches in all experiments..i:-. the PEM method also identified a number of non-periodically expressed genes. Selected significant gene ontology terms mapped to nonperiodic clusters.7 !.:.K i * (] S'-'J (141) J l. However. We selected four significant gene ontology terms corresponding to the non-periodic clusters.? II Si»s .859 0 «S. The Fourier transform (FT) method was introduced specifically for identifying periodically expressed genes.A ' ? >f C i . S-.lit. org/cgi-bin/GO/goTermFinder/).. which account for considerable false positives in calculating ROC scores.

Thus. USA 1998. In this paper. Bar-Joseph Z. Mr. Kivinen K. Vladimir Andreevich Kuznetsov. 95: 1486314868. Exploring the metabolic and genetic control of gene expression on a genomic scale. Gifford D. Li. To identify differentially expressed genes. 4 CONCLUSION AND DISCUSSION Replications in time course microarray experiments are costly. Mol. Cell 2000. Lyne R. we assume the gene expressions vary smoothly in time series. for identifying differentially expressed genes in time course cDNA microarray experiment without replicates. Botstein D. one possible solution is to integrate the PEM statistic and the ANOVA F-score using a permutation based strategy. Gerber G. Another advantage of PEM is that the parameters of PEM can be fixed for applications of different experiments. Biol. In selecting methods for analysis without replicates. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Simon I. Chen D. we propose a general statistical method called PEM (Partial Energy ratio for Microarray). Dr. Acknowledgments The authors thank Dr. in which the variance of the statistic is stabilized by introducing the "relative difference". although automatic determination of the optimal parameters may slightly improve the performance. Genomic expression programs in the response of yeast cells to environmental changes. and Brown PO. Proc. Eisen MB. Bar-Joseph Z. Experimental results show the robustness and generality of PEM when the number of time points is comparatively large (>6).131 that these non-periodic meaningful. Science 1997. the assumption of signal smoothness may not be satisfied if the measurements are not adequately sampled. Spellman PT. Sci. Cell 2003. This assumption is comparatively weak hence the PEM method is more general in identifying genes expressed in unexpected patterns. In this case. the PEM method is more general and leads to a better overview of the dynamics of gene expression changes. people usually have to face the problem to make tradeoff between robustness and generality. Stat. and Brown PO. and Simon I. Spellman PT. In the PEM method. and Botstein D. Iyer VR. Storz G. Brazma A. The main limitation of the PEM method is. 14: 214229. Roy. DeRisi JL. 100: 10146-10151. 5. which will be investigated in the future. Natl Acad. Soc. 3. References 1. and Bahler J. Cerevisiae cell cycle dataset clearly indicates the ability of the PEM method in identifying genes with either periodic or non-periodic patterns. Natl Acad. A new approach to analyzing gene expression time series data. Jones N. . Proc. 4. 2. replication of the time series is necessary. Burns G. 57: 289-300. Biol. 11: 4241-4257. Gerber G. Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes. The evaluation with the yeast S. Vinsensius Berlian for the valuable discussions on the topics related to this paper. 7. Vega. 278: 680-686. RECOMB 2002: 39-48. Mol. Gasch AP. Gifford D. Carmel-Harel O. USA 2003. Brown PO. and Jaakkola T. 1995. Jaakkola T. the method may fail to identify the genes whose expression patterns do not fit the pre-defined model. which will be implemented and be tested in near future. Juntao. Kao CM. 6. If the assumption underlying the method is too strong. Sci. Benjamini Y and Hochberg Y. Krishna. clusters are biologically we will also explore the possibility of modifying the PEM method for the applications where replicates are available. Karuturi Radha. J. B. and Mr. The proposed statistic can be easily incorporated into the SAM framework for significance analysis. For this problem. we employed convolution kernels in our statistic and introduced the concept of partial energy. In comparison to model-specific approaches like Fourier transform. Cluster analysis and display of genome-wide expression patterns. Eisen MB. Toone WM. Global transcriptional responses of fission yeast to environmental stress. Mata J.

Guo X. Storey JD. Statistical significance analysis of longitudinal gene expression data. Significance analysis of microarrays applied to the ionizing radiation response. Comprehensive identification of cell-cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. 17: 520-525. 18. 9: 3273-3297. Peng V. 11: 1977-1985. Lin DM. A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model. Iyer VR. Statistica Sinica 2002. Lee SY. and Lee Y. Lobenhofer EK. Lee S. Natl Acad. . Cantor M. Botstein D. Weinberg CR. Olson JM. Xu XL. Xiao W. Bioinformatics 2003. 3: article 3. Qi H. Troyanskaya O. 1984. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Zhang MO. Yi S. Significance analysis of time course microarray experiments. 11. Sherlock G. and Chu G. and Futcher B. and Davis RW. 15. McNeil BJ and Hanley JA. 30: el5. Lonnstedt I. 10. Sci. Natl Acad. Proc. and Pan W. Tibshirani R. Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics 2003. Yoo D. Decis. Mak. Cell 1998. Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Eisen MB. Botstein D. 12. Med. and Zhao LP. Statistical applications in genetics and molecular biology 2004. 19: 834-841. Biol. Verfaillie CM. USA 2001. Tusher V. Ahn J. 4: 137-150. Smyth GK. 2002. Park T. and Umbach DM. Afshari CA. 19: 1628-1635. Missing value estimation methods for DNA microarrays. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves. Bioinformatics 2003. Yang YH. 17. Peddada SD. Li L. and Altman RB. Anders K. Leek JT. Sci. 98: 51165121. 14. Brown PO. Nuclceic Acids Res. Luu P. 19. Dudoit S. and Speed TP. Spellman PT. 13. Proc. 19: 694-703. 9. Human Molecular Genetics 2002. Tibshirani R. Hastie T. Mol. Bioinformatics 2001. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. 16. Ngai J. Replicated microarray data. Tompkins RG. Sherlock G. Brown P. 12: 31-46.132 8. and Speed TP. USA 1005. 102: 12837-12842.

In this paper. we propose to make t h e best use of these two aspects: the differential expressions (that can be viewed as the domain knowledge of gene expression data) and the multiple classes (that can be viewed as a kind of d a t a set characteristic). Due to the drastic size difference of genes and samples. for improving the biomarker discovery. statisticians. Therefore. and therefore attracted the focus of recent research. Among these methods. while the specific gene selection methods particularly consider the differential expressions across t h e normal and diseased. the final goal of gene selection is to discover the "biomarker". data mining and machine learning 5 . However. we simultaneously take into account these two aspects by employing the 1-rank generalized matrix approximations (GMA). TX 75083. but also provide a visualization method to effectively analyze the gene expression d a t a on both genes and samples. Such characteristic.A. they select the biomarker by only considering the binary class labels.g. These two characteristics distinguish the task of discovering "biomarker" from the common feature selection tasks. but also contains most relevant genes without redundancy. Our results show t h a t the consideration of both aspects can not only improve the accuracy of classifying the samples. U. One of the most important features in microarray data is the very high dimensionality with a small number of samples. 1. we further propose an algorithm for obtaining t h e compact biomarker by reducing the redundancy. such as disease classification and the discovery of structure of the genetic network 18 . However. and computer scientists. are differentially expressed across different sample classes. but ignore the existence of multiple classes. On the other hand. the step of gene selection is also the need of solving the well-known problem "curse of dimensionality" in statistics. they attracted more focus of the studies in recent years. Most existing filter methods follow the methodologies of statistics 9 and information theory 4 ' 23 ' 18 to rank the genes and reduce the redundancy. liu@utdallas.. there are often multiple sample classes with ordinals. which has never existed in any other type of data. has made the traditional data mining and analysis methods not effective. quite different from the traditional feature selection in other data sets such as text 22 . They are independent of the sample classification and are efficient in analyzing the functions of genes. massive amounts of gene expression data are generated in experiments. INTRODUCTION With the rapid advances of microarray technologies. * Email: ying. Analysis of these highthroughput data poses both opportunities and challenges to the biologists. University of Texas at Richardson.C h u n g H u a n g a n d Y i n g Liu* Department of Computer Science. healthy/diseased. .S. The basic goals of these filter methods are to obtain a subset of genes with maximum relevance and minimum redundancy 9 ' 23 ' 4 . Yanxiong Peng. The traditional feature or attribute selection methods consider multiple classes equally without paying attention to t h e up/down regulation across t h e normal and diseased types of classes. Recent gene selection methods fall into two categories: filter methods and wrapper methods 18 . the filter methods analyze the data by investigating their domain-specific targets: (1) differential expression across classes and (2) redundancies induced by the relevant genes. which are categorized into the normal or diseased type. a crucial approach is to select a small portion of informative genes for further analysis. e.133 EFFICIENT GENERALIZED MATRIX APPROXIMATIONS FOR BIOMARKER DISCOVERY AND VISUALIZATION IN GENE EXPRESSION DATA W e n y u a n Li. a minimal subset of genes that not only 'Corresponding author. mutual information or information gain based methods. The wrapper methods 3 are closely "embedded" in the classifier and thus are often time-consuming. Therefore. such as t-like-statistics. There are over thousands of genes and at most several hundreds of samples in the data set. Based on the GMA mechanism. H u n g . edu Dallas In most real-life gene expression data sets. These methods are computationally efficient.

i. Moreover. it is a reinforcement mechanism simulating the resonance phenomenon. The GMA simultaneously takes into account the global between-class data distribution (differentially expression) and local withinclass data distribution (collection of low or high values). GMA-1 provides the simultaneous ranking of genes and samples c . the labels of multiple classes show the ordinal scales according to the degree of their membership to the positive or negative type. GMA-1 is quite efficient. and therefore may lose the accuracy of discovering the biomarker with maximal relevance and minimal redundancy. where top genes are differentially expressed across sets are available at http://www. One of the efficient techniques for getting the 1-rank matrix is to employ the discrete dynamical system to quickly converge to the local optima. patients who show the early symptoms. They are normal ones. positive and negative. . Therefore. through the low-rank matrix approximation. the particular trends or the meaningful dimensions of the high-dimensional data implicate that the overall structure inherent can be easily discovered. Due to the quick convergence and efficient matrix-vector multiplications. most existing filtering methods e. information gain and t-statistics. 1:4 or 3:4. We generalized it as a novel discrete dynamical system.gov/nctr/science/ In MAQC project. b T h e 1-rank matrix is a matrix whose rank is 1. but ignore the special characteristic of gene selection.. most general feature selection methods. in this paper. four classes of persons are considered.. and recent greedy matrix approximation for machine learning 17 reveal this implication.2). consider the multiple classes. For example.. e. Also some gene expression experiments a consider classes of samples that are the composite of normal and disease ingredients with different scale. c T h e samples are ranked within each class. The GMA follows the framework of the traditional 1-rank matrix approximation in linear algebra and generalizes it by partitioning the matrix. However.g. e.134 while the sample classes in the observed experiments are often ordinal with the gradually changing tendency 3 . we propose a class of 1-rank Generalized Matrix Approximation (GMA) b filter method to simultaneously rank the genes and samples to identify the biomarker in the data sets with multiple classes. relatives of patient. 1-rank matrix approximation is essential for analyzing the highdimensional data 12. In nature. in the Lupus experiment (see Subsection 5. it is like a black-box screening the user out of the analysis a process. As a filter method. Such analysis may ignore the characteristics of the expression data within each single class. it is a wrapper method by using the leave-one-out error and forward selection and therefore is not efficient.e. We followed the framework of the resonance model introduced in our previous work of visually analyzing the high-dimensional data 12 . combine all classes in positive type into a positive class and similarly combine all classes in negative type into a negative class. we can visually observe the overall distribution (see Fig. such as Gaussian process model based method 3 .4) of the values. 'normal'. Among these techniques. ReliefF 10 .g. i. the description of the data centers/toxicoinformatics/maqc. There have been few works in the wrapper methods on investigating the biomarker on these data sets.. By rearranging the gene expression matrix with GMA-1 rankings. however. and then do the filtering process on the two combined classes.. when dealing with the data sets with such multiple classes and two types. up and down regulations.g. On the other hand. n ' 13 . This is the second "blessing of dimensionality" stated by Donoho 5 . and patients whose symptoms are complete. As pointed out by Achlioptas and McSherry 1.. In these gene expression data.fda. a particular overall structure of the matrix can be approximated and observed by the tendency of x y T . 'intermediate-grade' tumor and 'high-grade' tumor 16 . Therefore. GMA-1. Latent semantic indexing 14 to understand text data.g. However. Therefore. e. It is formally expressed as a multiplication x y T of two vectors x and y .e. the success of HITS u and PageRank 13 algorithms to understand the huge WWW graph adjacency matrix. they are not specific to the task of gene selection as well. the original and intuitive objective of biomarker discovery is that the user can visually select the differentially expressed genes without redundancy. although there are two types of classes. which is particularly designed for approximating the gene expression matrix with the multiple classes. which has been widely used and studied 12> 20> 7 ' n . According to this objective. 'low-grade' tumor.

1(d). BASIC RESONANCE MODEL FOR APPROXIMATING MATRIX In this section. We followed the idea of Jaeger et al. Wi\p\)Likewise. We defined this function abstractly to support different measures of resonance strength.. Therefore. is explained by two theorems. 1(d) to approximate the real matrix in Fig. • •. such that when an appropriate response function r is applied. x n ) and y = ( j / i . to make sure W is a non-negative matrix.} who resonated with o when r was applied. 1(c). • •. This phenomenon implies that. if two objects of the same 'natural frequency' resonate. one existing measure to compare two terrains is the well-known rearrangement inequality theorem. o will resonate to elicit those objects {o^.• R. y) = ICiLi xiVi *s maximized when the two positive sequences x = ( x i . In the context of the weighted bipartite graph G = (0. denoted as r o T w W. 9 by using the representative of the dense cluster in the gene correlation matrix of the biomarker to reduce the redundancy. we scale the values of W to t h e range of [0. The target is to rearrange the matrix for collecting the large values to the left-top corner of the sorted matrix (called "mountain").r| d where O and T are two subsets of vertices.. Furthermore. It simulates the resonance phenomenon by introducing a forcing object 5. Wi2. GMA-1 can satisfy the biomarker discovery. 2.The components of the graph G are clearly shown in Fig. The evaluation of resonance strength between objects Oi and Oj is given by the response function r(oj.E.. GMA-2 can be employed to remove redundant genes. from a yeast gene correlation data 19 in Fig. . while leaving the small values to the right-bottom corner (called "plains"). 1S2. In this way. i. GMA-2 is particularly customized for finding clusters with the fixed density. the data terrain (showing where the "mountains" and "plains" are) can be used to visually analyze the high-dimensional data sets. . where min and max can be the minimum and maximum of each rows or of the whole matrix. we can get Fig. In d I n the gene expression data. As observed and proved in GMA-2. a generalized matrix approximation for the task of biomarker discovery shall be introduced in the next section. the sorted matrix according to the ranking of both genes and samples can be visually shown to the user for further analysis. Process of Resonance Model This resonance model has been introduced in 12 for visually analyzing the high-dimensional data set. Therefore. we firstly introduce the resonance model for the purpose of revealing the terrain of the high-dimensional data set. if the user needs to refine the biomarker for obtaining the compact biomarker. } c O. In nature..1] by —^-^—:—. the resonance model is an iterative reinforcement learning process of the matrix. it is able to find clusters with fixed density. yn) are or- 2.W) and W = (wy) |c >|x|.135 classes and top samples are important to the class. wj^|). . Through the iterative reinforcement process. . the dynamic 'frequency' of the forcing object 5 is defined as 5 = (w\. W is supposed to be a matrix whose range is in [0.Oj) : 1 " x R" . the underlying dominant terrain and structure is revealed. the resonance model indirectly does the work of the 1-rank matrix approximation by using the matrix in Fig.2(a). This "natural frequency" represents the makeup of both o and the objects {oi.1.1] if we do not mention. . the "frequency" of the forcing object 5 and the ranking values of the objects Oi G O are updated and converged until 5 is similar to those objects with the largest ranking values.. Through this matrix approximation process. . a real-world example. Simply put. the 1-rank matrix approximation. whose "natural frequency" is similar to o. we multiplied the ranking value vectors of rows and columns. In this way. In the rest of the paper. Fig. . Different from a general clustering algorithm used by Jaeger. the static 'natural frequency' of Oj £ O is Oi = (wn. For example. 1(a) and (b) clearly indicates how this process can be used to visually analyze the matrix through the comparison before and after the rearranging process... the "frequency" vector 6 of o and the ranking value vector r of the object set can approximate the matrix W by the matrix r o T .. they should have a similar terrain.. Then the underlying rationale of the resonance model. Through the expatiation of the basic mechanism for approximating the matrix in this section. .F. CBioMarker combining GMA-1 and GMA-2 yields higher accuracy. .e. 1(c). Moreover. where I(x.

. . dered in the same way (i. . r<fc+1> = norm(r(W6 (fc) )) . we obtained the converged r* . Test Convergence Compare o( fc+1 ) against b^k\ If the result converges. Initialization Set up o with a uniform distribution: 6 = ( 1 . i. and record this as 6<°> = 6. compute the resonance strength r(6. 1). e Adjust Forcing Object Using r from the previous step.fj) integrates the weights from fj into 5 by evaluating the resonance strength recorded in r. . y) are put together to form M = [x. those objects that resonated with o are discovered and placed together to form the 'mountains' within the 2-dimensional matrix W. . Likewise. columns (a) 3D matrix before sorting 4 rows 2 X s ^ ^ » ' r 3 3 o * columns * (b) 3D matrix after sorting (c) symmetric matrix S (d) approximation matrix r * S * T of S F i g .y] (in MATLAB format).t h frequency is given in fj = (wij. a.xn) norm(x) = x/||x||2.F ° r e& ch frequency /_. Again. c is abstract. "mountains" and "plains" help analyzing the d a t a in both rows and columns.e. forcing resonance). then let k = 0. a small example matrix with 4 rows and 5 columns to illustrate how the terrain. else apply r on O again (i. we define the adjustment function c(r.(fe+i) (1) )) (2) norm (c(W [ T-(k+l) To illustrate how the matrix is sorted. r ( 6 . fj) = r • fj = J2i wn • r(6. To find the 'mountains' and 'plains'. we compute 6 = norm(6) and record it as 6(fe+1) = 6. . This iterative learning process between o and G is outlined below. M a t r i x R e a r r a n g e m e n t Sort the objects Oj € O by the coordinates of r in descending order..136 Mountain 4 —. the 'plains' are discovered by combining those objects that resonated weakly with 6. 1. . the forcing object 5 evaluates the resonance strength of every objects Oj against itself to locate a 'best fit' based on the contour of its terrain.. x\ ^ X2 ^ • • • ^ xn and 2/i ^ J/2 ^ • • • ^ Vn) and is minimized when they are ordered in the opposite way (i. After the resonance model. o 2 ) . r = norm(r).fj) : Rl°l x Rl°l -> R. To do this. go to the next step. For clearly stating the whole process above. where the weights of the j . Oi). In the same fashion. let's take a look at a real-life example from a yeast gene expression data 19 . .. w\o\j). store the results in a vector r = ( r ( 6 .W2j. . The symmetric gene correlation matrix is computed by Pearson correlation measure. the response function I is a suitable candidate to characterize the similarity of terrains of two objects. Finally. For example.e. normalize it as 6 = norm(6) e . x\ > x^ ^ • • • ^ xn and J/1 ^y2 ^ ••• < yn)Notice if two vectors maximizing I(x.e. By running this iteratively. o 1 ) . Apply Response Function For each object Oj € O. . Matrix approximation by the basic linear resonance model ( r = c = T ) : (a) and (b). we obtain the terrain with the "mountain" in the left side and the "plain" in the right side. .-. i. and can be materialized using the inner product c(r. Wj = c(r.e.Oi). r ( 6 ) 0 | 0 | ) ) . and then normalize it. y) = exp(£]" = i %iVi) is also an effective response function with the function of magnifying the roles of "mountains". and then adjust o.. . (c) and (d) symmetric matrix sorted by r* and 6*. and sort the frequencies fa € T by the coordinates of 6 in descending order. adjust the terrain of 5 for all o% G O.e. we further express it in the following formulas. 1 .. where ||x||2 = ($27=1 z ? ) 1 / 2 & 2-norm of vector x = ( x i . E(x. • • •.

and simultaneously a series of high values collections in each W* into the left-top corner. a general genesample matrix WmXn = [ W~ .gk}. it is essential that the value distribution of r*6* T determines how the values of the sorted S are distributed. they are two generalized matrix approximation methods based on the basic resonance . we particularly designed two extended resonance models. whose number of samples are n i . GMA-1 for Ranking Differentially Expressed Genes Consider the general case of the gene expression data. (b) simultaneously collecting high/low values into the left-top corners of k classes or submatrices W~ or W*. 1(d). Without losing the generality.3(a).4. (2) the underlying rationale is to employ the 1-rank matrix r*6* T to approximate 5. (2) Controlling the differences of left-top corners between the negative classes W[~ and W*. 2. . Therefore. and k. In the two steps.1. from collecting high values to the left-top corner of W. 3. and also sorted Oj e O and fj € T accordingly.+ k+ = k. (1) Transformation of W: before doing the GMA-1. Because the target of analyzing differentially expressed genes is to find up-regulated or down-regulated genes between negative and positive sample classes. suppose the data set consists of m genes and n samples with k classes. + nk—n. From the perspective of the matrix computation. Therefore. The structure of W is made of 3.The second is to discover those very dense clusters in the correlation matrix computed from Q. we suppose the first fc_ classes are negative. according to this task as follows. 1(c). the basic resonance model should be changed.. and 6* with the decreasing order. TWO GENERALIZED MATRIX APPROXIMATIONS BY EXTENDING RESONANCE MODEL FOR GENE SELECTION In this section. Wf ) is shown with submatrix blocks in Fig. to: (1) A series of low values collections in each W^~ into the left-top corner. Actually. . model. 1(c) and (d) illustrates two observations: (1) the function of the resonance model is to collect the large values in the left-top corner of the rearranged matrix and leave the small values to the right-bottom corner. The resonance models of approximating the matrix for different purposes: (a) collecting the high values into the left-top corner. . . nit respectively and n\ + .137 Responso. (c) collecting the extremely high similarity/correlation values into the left-top corner to form a dense cluster. called GMA-1. we extend and generalize the basic mechanism of the resonance model in Section 2 for the purpose of the gene selection in two aspects.. This example in Fig. we need to transform the original gene-sample matrix W to W. An example figure of such matrix approximation is illustrated in Fig. to meet these two goals. the rows and columns of the matrix S are also rearranged with the same orders of Oi and fj.. We also draw its corresponding 1-rank approximation matrix r*6* T in Fig. Certainly. Weighted Function »h Adjustment Function (a) basic resonance model (b) GMA-1: extended resonance model 1 (c) GMA-2: extended resonance model 2 Pig. the following k+ classes are positive. and remove the redundant genes in Q by only selecting one or two representative genes from each dense cluster. The sorted S in this example is shown in Fig. • -. we extended the basic resonance model. The first is to rank genes and samples for selecting those differentially expressed genes Q={gi.

W.3(b). To satisfy this requirement. each of which is normalized independently. we partition the original forcing object's frequency vector 6 into k parts corresponding to k classes or submatrices.+ k+ vectors is expressed in MATLAB format. we can transform W by W[+ = 1 — Wf in the case of finding downregulated and differentially expressed genes. In the case of finding up-regulated and differentially expressed genes. after running our algorithm. we can control the differentiation between the negative and positive classes. + . to get a suitable differential contrast of two types of classes. Similarly. In the process of GMA-1. In formal. Transformation of the matrix W: the transformed matrix W in (a). . or minifying the frequency subvectors r~ = norm(W i ' _ 6i) of fc_ negative classes. (2) The k partitions of the forcing object 6: an implicit requirement in the first goal is that the relative order of each class (submatrix W-~ or W/ + ) should be kept the same after doing GMA-1 and sorting W.3(a). + av+ ) fe_ negative classes fc+ positive classes (3) where a ~£ 1 and a as a scaling factor is multiplied with the normalized positive classes' resonance strength vectors. . +n t (a) original matrix W = [ Wi . w+ 1 Y n = n. In this way. 6 k ) f .Negative Classes Positive Classes up regulation I down regulation w= Wj Wu w1 I + WT l-wr 1-W+ w= VJ2"> j [i-Wf. With the increasing of a. . the result of collecting high values of W'~ and W[+ into their own left-top corners naturally lead to the result of collecting the low values of W~ into the left-top corners and the high values of Wf into the left-top corners. In this way.and W^+ as listed in (b). since we need to collect the low values of W~ into the left-top corner. we do the transformation by W'~ = 1 — W~. although we can change the order of columns or samples within W{~ or W^~. by magnifying the resonance strengths rf = norm(W i ' + 6.. 6 = ( 6 i . W4+ ] (b) transformed matrix W ' = [ W'~ . but with different submatrix W s ' .' + F i g . In other words. 3 . the user can tune a. has the same structure of submatrix blocks as shown the submatrix blocks W~ and Wf of negative classes and positive classes as shown in Fig.. (3) The factor a for controlling the differentiation between the negative and positive classes: the frequency vector of 6 is divided into k = fc_ + k+ parts. . '( + •+ r k_ + ar+ + . we separately normalize each 6i and then sum their resonance strength vectors together with the factor a to control the differentiation between the negative and positive classes. We can also use other reverse functions in stead of the simple 1 — x function used in Fig. it is required that all columns of the submatrix W!f must appear after all columns of W[~. For example. . f Specifically. where each 5i corresponds to a sample class.) of k+ positive classes. T h e concatenation of k = k. This is an essential step to meet the first goal aforementioned. Therefore.. the proportions of positive classes in the resonance strength vector r will increase and thus result in the increasingly large differences in the top-left corners between positive and negative classes.. we need to reverse the values of W~ so that low values become high and vice versa.

this set of genes are the top-ranked m' genes selected by the GMA-1.rik)T. . . . . ranking sequence of n samples. 0(mn). . .2(b) and express its process in the following formulas: r-(fc+1)=norm(W. . The underlying rationale is "members of a very homogeneous and dense cluster are highly correlated and with more redundancy.. ranking sequence of m genes. .4. Comparing with the fuzzy clustering algorithm. . + 6 + W ) .. sizes of t h e fc sample classes with the submatrix structure as in Fig. but also provide the membership degree for a cluster for each gene. i = 1 .3(a). the GMA-2 can not only find clusters with different densities. besides using the linear functions r = c = I .(4). fc+)T and regulation option. the GMA-2 can be iteratively run on the remaining matrix by removing rows and columns of the genes in clusters already found..1.r~ £ R m x l and 6~ e Mn< x l . 9 .. . i. i= l fc~ r+ ( f c + 1 ) = n o r m ( W . we draw the architecture of the GMA-1 in Fig. . two equations in the basic resonance model are expanded to the (2k + 1) equations in GMA-1.. numbers of negative and positive classes. O u t p u t : (1) (gi.vf . k+)T. o £ in decreasing order t o get t h e sorted sample sequence {comment: Because the positions of all sample classes in W keep not changing as shown in Fig. Comparing Eqn. To find more clusters with the fixed density.. 1(c) and (d). . because there are correlations among the top-ranked genes. down or up. . . GMA-1 can quickly converge. (5) a. Considering that GMA-1 is a generalized resonance model by partitioning the matrix into k submatrices. k+ =nor»(E^1ri-(fc+1) +aE£iri+(fc+1)) r ( fc+1 ) (fc+1) 6=norm((W i '-) T r( f e + 1 )). . . its computational complexity is the same as the resonance model on the whole matrix. . fc+ (4) A l g o r i t h m 3. (2) ( s i . Given a pairwise correlation or similarity matrix of a set of genes g . expression matrix from m genes set G and n samples set S. which induces the problem of reducing "redundancy" from the topranked gene subsets. (4) regulation option.gm). fc). . ...nfc) T .-67«). .(l) and (2) with Eqn.3(a). .1 (GMA-1): Biomarker Discovery. Then we can simply select one or more representative genes from each cluster and therefore reduce the redundancy. The GMA-2 is a clustering method to find the high-density clusters. . i = 1 .1] as following the steps in Subsection 2. . GMA-2 for Reducing Redundancy by Finding Dense Clusters It has been recognized that the top-ranked genes may not be the minimum subset of genes for biomarker and classification 9 | 4 ' 23 .gm). an special instance of the basic resonance model to reduce the redundancy of the top-ranked genes selected by GMA1.1 GMA-1 for the biomarker discovery. we partitioned the matrix W to k submatrix blocks and divided the frequency vector 6 into k subvectors. .2. differentiation factor. We also formally summarize it as Algorithm 3.e. s„). and sort each of o £ .. . . each sorting ofo* can only change the order of samples within the i-th sample class W^. k~ 5+(fc+i) = n o r m ( ( ^+ ) T r ( f c + i)). Observing Fig. 3. In this section.. j = In practice. the linear basic resonance model is i. of £ Rntxl... the authors used the fuzzy clustering algorithm which is not a suitable algorithm to control the density of the clusters. 1: preprocess W so that the values of W in [0. . One of the effective strategies is to take into account the gene-to-gene correlation and remove redundant genes through pairwise correlation analysis among genes 9 ' 4 ' 21 .}. Therefore. (3) (fc_.. 2 . (2) ( m . and (fc_.139 To summarize the above changes of the resonance model. . (1) Wmxn. A real-life example of the overall process in Algorithm GMA-1 is visually shown in Fig. 3(b) with the knowledge of the matrix structure given by ( n i . 2: transform W t o W according t o formulas in Fig.(4) t o obtain the converged r* and 5* ( i = l . while a heterogeneous and loose cluster means bigger variety in genes". Although similar work has been done by Jaeger et al. 3: iteratively run equations in Eqn. 4: sort r* in decreasing order t o get the ranking gene sequence (gi. the GMA-2 is actually a special instance of the basic resonance model. . we proposed to use the GMA-2. g I n our context. the GMA-2 outputs the largest cluster with the fixed density. Unlike the GMA-1 which is a generalization of the basic resonance model. Input: where ri.

e. a dense cluster becomes the dense subgraph in this graph.. According to the linear algebra. xT5x ^max(S) = max . a sequence of k genes which forms a dense cluster.140 able to collect the high values of a symmetric matrix to the left-top corner of the sorted matrix.(5) and Eqn.(l) and (2)) can be combined together by removing 6. the GMA-2 is formally stated in Algorithm 3. l ] " x l and normalize it as ||x||2 = y/J27=i x1 = 1.2(c). A typical strategy in approximation algorithms is to relax the integer constraints (i.1 indicates that the first eigenvector x* of S is the solution of -D(x) and therefore reveals a dense cluster.. 4. This means that it can approximate a high-density cluster. this non-linear resonance model is called RME. l } n x l is supposed to indicate the membership degree of each gene belonging to the dense and largest cluster. CBioMarker. and when r = c = E. Let S e R n x " be a real symmetric matrix and Xmax(S) be the largest eigenvalue of S. we discover the compact biomarker by combining GMA-1 and GMA-2. then we have. the user is more interested in the biomarker with the minimal genes that can classify the samples.(6) respectively as follows... .. D(x) in Eqn. Therefore. Theorem 3. We outlined it in Algorithm 4. the RML can reveal the dense cluster. a nonnegative membership vector x = (xi.-. we called this linear resonance model as RML.In this way. and therefore RML and RME can be represented by Eqn. based on RME. a fixed density threshold.e. we have the following theorem. in this section.g. With these settings and S = ST. = max x r 5 x x£E" | | x | | 2 ||x|| 2 = l (8) and the eigenvector x* corresponding to A m a x (5) is the solution on which the maximum is attained. x e [0. there are extensive studies on the problem of finding the densest subgraph h which is known as the NP-hard problem 6 . 5: 6: 7: 8: 9: 10: set Gk = {ffi. e.9k} as t h e top k genes.1. e n d if return Gk-iInput: r(fc+1> = n o r m ( S r « ) r( fe+1 (5) (6) > =norm(E(Sr<*>)) A theoretical analysis is given in the following to show how RML works. Then set the subset G2 = {91. Eqn.xn)T € {0. . According to the matrix computation theory 8 . • • • . A l g o r i t h m 3. Therefore. when the values of x are 0 or 1. these two problems are equivalent.(7) means the density of a cluster formed by those genes whose corresponding Xi is 1. T h e o r e m 3.52} and k = 2. the iterative running of Eqn.e.. Hence. Therefore. e n d while if there is no k satisfying D(S(Gk)) ^ < t h e n 5 return 0. . When r = c = I. r* = x*.. we customized the basic resonance model to find the dense cluster by setting the response and adjustment functions to be I or E. the membership degree x changes from the binary number to the continuous number.-. Output: G' = {gi.1 (Rayleigh-Ritz). (2) 5. a non-negative gene correlation matrix from a set of n genes G. ALGORITHM FOR COMPACT BIOMARKER DISCOVERY In some cases.2 (GMA-2): Find a <5-Dense Cluster (1) Snxn.. 2: sort r* in decreasing order and get the sequence of genes (fli) • • • ifln) according to this order.(5) in RML will lead to the convergence of r to the first eigenvector of 5 .9k} 6 G. In practice. Therefore. 3: while D(S(Gk)) ^ 5 d o 4: k = k+l. The overall architecture of RML and RME is illustrated in Fig. n n D(x) = Y J y j SijXiXj = x T 5 x i=lj=\ (7) However. h Considering a nonnegative symmetric matrix is the adjacency matrix of an undirected weighted graph. i. Given a nonnegative gene correlation matrix S = (sij)nxn £ R n x " . two equations in the basic resonance model (i..2. 1: run RME on S to get t h e converged r*. x take the binary values 0 or 1) in x to the continuous real numbers. we found that the non-linear resonance model RME works better than the linear RML by using the exponential function to magnify the roles of high values in the dense cluster.

otherwise. 1: run GMA-1 on W to get the gene ranking sequence (gi. As shown in the three tables. a was set to 2 for GMA-1.rik) . we performed three experiments to test our method by using one class versus the rest of classes as positive versus negative: (1) ALL versus MLL&AML.} 11: add the rest of genes that are not clustered and found by GMA-2 to G'. In the experiment "MLL versus ALL&AML". T and IG to rank the genes and compared them over different feature sizes. then one representative gene is enough.1. (2) MLL versus ALL&AML and (3) AML versus ALL&MLL.} e n d if set G' = G' . Output: G' = {gi.. a fixed density threshold of the cluster. A l g o r i t h m 4. the computational complexity of CBioMarker is at most 0 ( m n + hm'2) if considering the size of S in each iteration is always m' (but in fact. 6: if G" is not empty t h e n 7: select the first representative gene g\ and add it to G'.582 genes and 72 samples with these 3 sample classes. Input: 2: select the first m' genes from the ranking gene sequence (ffi. a gene expression matrix from a set of m genes G. If S is high.Wf . select several more.141 Similar to that of the basic resonance model. •. we ran the CBioMarker to get the compact biomarker and similarly used LOOCV accuracy to evaluate it.. We firstly used the GMA-1 '. GMA-1 can still obtain high accuracy even when k is very small. 3: set G' = {} and S' = S. (4) 5.3m). was performed to assess the classification performance.10. S' is always smaller than that of the previous iteration after removing the gene dense cluster already found. It contains 12. Therefore.20. T-statistics (T). numbers of negative and positive classes. 5.100.-. i.).G" and S' = S(G'). T (2) ( m . G' = {G'. where besides the classes "ALL" and "AML".5fc top-ranking genes in up regulation and 0. GMA-1 maintains very high accuracies in different k. . In all three experiments.1000.500.. a new class "MLL" of samples is identified. the Leave-One-Out Cross Validation (LOOCV). a popular performance validation procedure adopted by many researchers. where h is the iteration number in Algorithm CBioMarker depending on the number of dense clusters found in S'. Information Gain (IG) and ReliefF 1 0 . £=2. T (3) (fc_. we conducted the experiments on two data sets and compared our method with three most popular filter methods. 4: r e p e a t run GMA-2 on S' with < and get the highly correlated 5 5: gene cluster sequence G". The results are shown in Table 1. • Compact biomarker: observing the accuracies of three methods from the small k to the large.W^~] with one positive and two negative classes. . the computational complexity of GMA-2 is 0 ( n 2 ) . Algorithm 4.-.4.1 shows that it took about 3 seconds in MATLAB environment and Pentium IV PC with 512MB RAM. a subset of q genes which forms a biomarker. where the class MLL is hard to distinguish.fc + ) . .1 CBioMarker is efficient as well. in this experiment of comparing fe top-ranking genes. • • •. In each experiment. Because of the small number of samples. the gene expression matrix partition for our method is W = [Wi . 10 until G" is empty {comment: it indicates there are no 6-dense clusters any more. sizes of the k sample classes with the submatrix structure as shown in Fig. 'Because GMA-1 can rank genes in terms of up and down regulation respectively. EMPIRICAL STUDY In this section.50. {comment: the number of representative genes selected depends on 8. JThe SVMlight was used..g\}. Then for obtaining a minimum biomarker. Therefore. our method GMA-1 outperforms the other methods in. we selected 0.1 (CBioMarker): Outline of Compact Biomarker Discovery with GMA-1 and GMA-2 (1) (l) Wmxn. Therefore.gq} S G. Each resulting feature subset was used to train an SVM classifier J with the linear kernel function. 2 and 3. 12: return G'. 9m) and compute its correlation matrix S.e.200. Leukemia Data We used the Leukemia gene expression data 2 .3(a). • High Accuracy: in all three experiments. 5. . Our empirical result on the large Leukemia data set with the size 12582x72 in subsection 5.5fc top-ranking genes in down regulation to form k top-ranking genes given by GMA-1.

. fc= 2 T 69. while the methods T and IG require larger k to arrive at the same accuracy (the numbers in bold in three tables show the minimum k each method requires to get the highest accuracy).6 100 50 98.6 100 1000 100 98. Table 3 .4 72. we used the unpublished data set taken from the Lupus gene expression experiments of Microarray core facility in UT Southwestern Medical Center.2 98.2 IG RliefF 72.2 20 98.2 93.2 97.1 80. As shown in Table 2.2 97. Two genes are MME which is underexpressed and LGALSl which is overexpressed k . • Stability: not only do the small amount of selected genes have the higher accuracies than the other methods. we checked two genes found by GMA-1 in Table 1 with Entrez Gene in NCBI Website ( h t t p : //www.9 88. Similar cases also appear in Table 2 and 3. we found 4 genes with the accuracy 93.4 88. for instance. or gene-to-gene.9 95. We demonstrate the visualization ability of our method for facilitating the user to analyze both the genes and samples simultaneously.6 98. is that the matrix approximation has the global searching ability to take into account the value distribution of the whole matrix and multiple classes in macroview way. the method T is not stable. 2.4 84.8 100 98.6 100 Table 2. while LGALSl was also reported to be highly correlated with ALL 15 .5 95. in Table 1.2 97. n i h .8 95.5 97. n c b i .2 97..2 1000 97.1%.6 100 100 500 100 98.8 98. and may be interesting to the biologists when they try to analyze more relevant genes contributing to the diseases.7 93.2 97.2.9 80. samples. Therefore. This is a stable property with k increasing. This is different from k T h e GeneBank No. We then intended to obtain the minimal biomarker while keeping a relatively high accuracy (e.2 98.6 find 4 genes with 50 100 84.6 500 90.7 79.6 98. There is no need to find the compact biomarker in the experiments except "MLL versus ALL&AML".6 100 200 98. which enables GMA-1 to perform well. because GMA-1 already found 2 genes with the accuracy greater than 90%.6 88.7 for GMA-2 for the second experiment. the accuracy is greater than 90%).6 93. but also the the large subset of selected genes maintain the high accuracy. k= T IG RliefF GMA-1 LOOCV accuracy rate (%) of ALL versus MLL&AML.6 97. This data set contains 1.6 95.g. 5.6 98.6 98.9 88. the way of individually considering genes.6 1000 87.6 86.6 98. we performed the algorithm CBiomarker with 5 = 0. This means that GMA-1 outperforms the other methods in terms of discovering the compact or minimal biomarker. By investigating the result of Armstrong et al.2 50 100 97.6 98.6 98.6 98.2 Table 1. In contrast. nlm.3 4 77. these two genes were also ranked as the first genes in the underexpressed and overexpressed genes respectively.142 GMA-1 is able to quickly obtain the high accuracy in the very small k.2 97.6 97.7 10 97. the top 2 ranking genes are found by GMA-1 and their accuracy is 100%.6 To test if the biomarker found by our methods is effective or not.1 91. of MME is J03779 and the GeneBank No. gov/entrez). For example. k= T IG RliefF GMA-1 LOOCV accuracy rate (%) of AML versus ALL&MLL.8 97.1% 200 93.9 100 4 10 20 86. Lupus Data In this experiment. 2 66.2 100 98.7 86.6 94.022 genes and 84 An important factor. 2 79.4 97.8 76.0 CBioMarker: 4 10 20 65.6 98.2 GMA-1 86.2 87.3 98.2 98.2 97.2 98.2 97. of LGALSl is AI535946.2 500 97.2 97.6 100 100 98.6 98.7 91. This result is better than any other method in Table 2 when k = 4.9 88.1 95. while the accuracy of the other two methods' top 2 ranking genes are less than 80%.2 200 97.9 97.2 97.1 94.4 63.1 97.2 97.1 90.2 97.4 94.6 98.8 94. LOOCV accuracy rate (%) of MLL versus ALL&AML.1 98. especially in Table 2 when the samples are hard to distinguish. MME is a common acute lymphocytic leukemia antigen that is an important cell surface marker in the diagnosis of human acute lymphocytic leukemia (ALL).2 86.2 76.4 100 97.2 81.

Visualization of the sorted matrix W. et. The dense k-subgraph problem. and the high values of the first two classes "ILE" and "SLE" are collected to the left-top corner of each submatrix W J L E and WgLE. CONCLUSIONS A INC FDR ILE SLE i (b) sorted W (c) sorted r*6*T (a) sorted W Fig. Ding C. Wild D. 2. Chu W. 29(3): 410-421. al. 3. References 1. Algorithmica 2001.W N Q . we found that the outlier samples of each class are put in the rightmost place of the corresponding class submatrix. 4. and therefore improves the accuracy of the biomarker discovery. we have introduced a novel perspective of the matrix approximation for filtering the genes in the multiple-class data sets. 21(16): 3385-3393. In Proc. "FDR" (First-Degree Relative). undetected patients. To illustrate the process of GMA-1. column) in the class "NC" are significantly different from the colors of all other left samples. While the most differentially expressed genes (or rows) are placed in the top of W. the dominant tendency within each class can be clearly observed. the low values of the first two classes "NC" and "FDR" are collected to the left-top corner of each submatrix W^Q and WppR. 6. Achlioptas D. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. McSherry F. From this redistribution of the whole matrix.4(a). ofSTOC2001. Meanwhile. "ILE" (Incomplete Lupus Erythematosus) and "SLE" (Systematic Lupus Erythematosus). . Among these classes. the user can also draw the conclusion that some normal persons may be earlystage. Falciani F. Feige U. The sorted matrix W with up regulation setting (see Fig. the colors of the rightmost sample (the 13-th In this work. High-dimensional data analysis: The curses and blessings of dimensionality. . we also drew the grey scale image of the transformed matrix W for up regulation and the final approximation matrix r*6* r given by the converged resonance strength vector r* and the frequency distribution vector 6. it provides an overall tendency of the whole matrix for visualizing and analyzing the data. In Proc. 5. Minimum redundancy feature selection from microarray gene expression data. of CSB 2003. By observing the grey scale image of approximation matrix r*6* r in Fig. This can also be observed in Fig. Biomarker discovery in microarray gene expression data with gaussian processes.4(c). In Math Challenges of the 21st Century 2000.^ n t n i s w a y > t n e ^ a t a within-class distributions and the between-class distribution are fully considered. Quan Li from Microarray core facility in UT Southwestern Medical Center for sharing his unpublished data set with us. Ghahramani Z. Armstrong SA. We performed GMA-1 with a = 5 on the data.^ S L E ] > and sorted approximation matrix r o: W .143 samples with 4 sample classes: "NC" (Normal Control). ACKNOWLEDGMENT The authors would like to thank Dr. besides obtaining the topranking relevant genes. sorted transformation matrix W = [ l . where 6* is the concatenation of k vectors: 6*=(6j|. For example. 6. Fast computation of low rank matrix approximations. which indicates that this sample may be an outlier of the class "NC". 30(1): 4147. Peleg D. It comprehensively considers the global between-class data distribution and local within-class data distribution. Kortsarz G. Donoho DL. Nature Genetics 2002.W F D R ' ^ILE. Peng H. 4. After analyzing this visualization. Experiments on gene expression data have demonstrated its efficiency and effectiveness of both biomarker discovery and visualization. "NC" and "FDR" are from the normal persons while "ILE" and "SLE" are from patients. Similar cases occur in the other classes as well. 1 . Bioinformatics 2005.4(a) of the original sorted gene expression matrix.3(b)) is visualized by the grey scale image in Fig.

The Johns Hopkins University Press. Winograd T. Singh D. 8. Clustering categorical data: An approach based on dynamical systems. al. . Raghavan P. of the ACM 1999. of ICML 2000. Cho R. Kleinberg J. Kira K. Motwani R. 15. 8(3-4): 222-236. Brin S. Jordan M. 10. Proc. of PODS 1998. 1996. Liu H.pdf. Raghavan P. of PKDD 2005. J. of PODS 2004. Matrix Computations. Ruzzo WL. Ng WK. Cancer Cell 2002. In Proc. Redundancy based feature selection for microarray data. Feature selection for high-dimensional genomic microarray data. 100(13): 7853-7858. In Proc. Visual terrain analysis of high-dimensional datasets. Loan CV. 18. Seattle. Sengupta R. Rendell L. Yu L. Astola J. Using non-linear dynamical systems for web searching and ranking. 2000. Technical report. Rozovskaia T. Smola AJ. In Proc. Yang Y. of PSB 2003. Tabus I. In Proc. In Proc. VLDB Journal 2000. et. 1: 203-209. Church G. 20. 17. Improved gene selection for classification of microarrays.fi/~tabus/course/GSP/ Chapter2GSPSR. et. 14. Scholkopf B. Papadimitriou CH. Xing E. 1998. Gene feature selection. 23. Tavazoie S. Li W. Vempala S. Tamaki H. Technical report. 16. 46(5): 604-632. Campbell M. In Proc. Gene expression correlates of clinical prostate cancer behavior. A practical approach to feature selection. Latent semantic indexing: A probabilistic analysis. Tsaparas P. Jaeger J. Washington. http://www. of ICML 2001. al. Stanford Digital Library Technologies Project.cs. In Proc. The pagerank citation ranking: Bringing order to the web. Expression profiles of acute lymphoblastic and myeloblastic leukemias with all-1 rearrangements. . In Proc. Yeast micro data set. Ong KL.144 7. of SIGKDD 2004. Karp R. A comparative study on feature selection in text categorization. Pedersen JP. Page L. Gibson D. of ICML 1997. 13. 22. 21. of National Academy of Sciences USA 2003. 19.tut. Kleinberg J. Hughes J. of ICML 1992. Golub G. . 9. 11. In Proc. 12. Sparse greedy matrix approximation for machine learning. Authoritative sources in a hyper linked environment.

in the context of specific downstream questions. description of the data from a single copy is called a haplotype. In diploid organisms (such as humans) there are two (not completely identical) "copies" of each chromosome. We apply these algorithms to t h e goal of finding recombination hotspots. The international Haplotype Map Project 7 ' 8 is focused on determining.145 EFFICIENT COMPUTATION OF MINIMUM RECOMBINATION WITH GENOTYPES (NOT HAPLOTYPES) Yufeng W u * and D a n Gusfield Department of Computer Science University of California. and in order to introduce the theme of this paper. and also giving rise t o the goal of computing t h e range of downstream answers t h a t would be obtained over t h e range of possible inferred haplotype solutions. INTRODUCTION The field of genomics is now in a phase where largescale data in populations is collected in order to study population-level variations 8 . U. giving rise to the general question of how well one can answer the downstream questions using genotype data without first inferring haplotypes. ucdavis. b u t our experiments certainly do not fully resolve t h e issue.c) is 0 (or 1) respectively. obtaining as good results as a published method t h a t first infers haplotypes. c) = 2. This paper provides some tools for t h e study of those issues. Hence our tools allow an initial study of the two-stage versus one-stage issue. and then answer the downstream questions using t h e inferred haplotypes. In discussing acquisition and analysis problems. A pair of binary vectors of length m (haplotypes) generate a row i of G if for every position c both entries in the haplotypes are 0 (or 1) if and only if G(i. while a description of the conflated (mixed) data on the two copies is called a genotype. and exactly one entry is 1 and one is 0 if and only if G(i. A 'Corresponding author.A. CA 95616. Variations between individuals are used to provide insight into basic biological processes such as meiotic recombination. Davis Davis. the underlying data that forms a haplotype is usually a vector of values of m single nucleotide polymorphisms (SNP's). obtaining weaker results compared to first inferring haplotypes using t h e program P H A S E . Since almost all the collected d a t a is in the form of genotypes. Email: { wuyu. and to t h e case of estimating the minimum amount of recombination needed to derive the true haplotypes underlying t h e genotypic data. and hence of each region of interest. without first fixing a choice of haplotypes. and there is a growing literature on several key problems involved in both the acquisition and the downstream analysis of population variation data. The Key Issue The key technological fact is that it is very difficult and costly to collect largescale haplotype data. T h a t two-stage approach has potential deficiencies. and to help locate the genes that influence genetic disease or economic traits (through a technique called "association mapping" or "LD mapping") 1 5 ). and some partial answers. Algorithms and computation play a central role in all of these efforts. we must first define some basic terms. the standard approach t o this issue has been to t r y to first infer haplotypes from the genotypes. gusfield} @cs. but relatively easy and cheap . A SNP is a single nucleotide site where exactly two (of four) different nucleotides occur in a large percentage of the population. both molecularly and computationally.S. T h e variation data are sought in order to address fundamental and applied questions in genetics t h a t concern the haplotypes in t h e population. but the downstream genetics questions concern haplotypes. 1. or to locate genes that are currently under natural selection 3 6 . Genotype data is represented as an n by m 0-1-2 (ternary) matrix G. the common haplotypes in several diverse human populations. concepts and issues. edu A current major focus in genomics is the large-scale collection of genotype d a t a in populations in order to detect variations in the population. Each row is a genotype. We present algorithms t o solve downstream questions concerning t h e minimum amount of recombination needed t o derive given genotypic data. Today.

developed a method for tag SNP selection with genotype data 14 . we do not address questions about the range of possible answers to the downstream questions that the collected genotype data support. Genotype data is also called "unphased data". association mapping. and they address the general question of "how much does it really help to know the underlying haplotypes that gives rise to the genotypes" 6 ' 2 r ' 28 ? The answer to that question helps determine how much effort or money one would be willing to spend to determine the correct haplotypes (by molecular means or by gathering more genotype data). Range questions are of interest because they provide a kind of "sensitivity analysis" for any particular chosen answer (for example for an answer derived from the two-stage approach). linkage disequilibrium. we have the ability to gather large-scale genotypes in populations. H is called an "HI solution for G". There is a very large literature on haplotype inference (HI). selection. et al. So. For example. the resolution of this issue has overwhelmingly involved two independent stages: First. but we have the need to ask and answer questions about the underlying haplotypes in populations. even the developer of the most widely used program PHASE 3 5 has recently introduced a totally different approach in order to address much larger data than PHASE can handle 3 3 ) . about recombination hotspots. and bracketing the range of the minimum amount of recombination needed to derive haplotypes that can pair to form the observed genotypes. this paper concerns the solution to certain downstream biological questions using genotypic data. Given an input set (or matrix) of n genotype vectors G of length m. The missed opportunities inherent in the two-stage approach is that by choosing just one set of haplotypes. without first fixing a choice of haplotypes. and therefore the "downstream" biological questions that we want to answer using population variation data (for example. As a byproduct. The Main T h e m e of This Paper: Motivated by the above discussion. and with problems of inferring an explicit evolutionary history of haplotypes that can pair to form the observed genotypes. Halperin.) are all questions that are most naturally framed in terms of haplotypes. natural selection. In particular. The main problem is that the haplotype inferences are likely to be incorrect to some extent and is not clear what effect those inaccuracies will have on the downstream analysis. evolution. ADDITIONAL DEFINITIONS Before discussing our results in detail. But mutation. we can use some of these computations to solve the haplotype inference problem itself. phylogenetic networks. 2. do the downstream analysis using those inferred haplotypes. in the context of specific problems involving recombination. the two-stage approach may in the end be the best way to address many of the downstream biological questions of interest. The way that all the 2's in a column are expanded is called the phasing of the column. We develop polynomial time algorithms for some problems. we are concerned with estimating. is called a "phasing" of that entry. So. etc. we need some additional definitions. Indeed. or inferring just the frequencies of the haplotypes in the sample. Second. non-polynomial but practical algorithms for other problems. the underlying haplotypes can be inferred with remarkable fidelity 25 . Our methods provide some tools to study the two-stage versus one-stage issue. recombination. and on an absolute scale. all operate on haplotypes. but in general there is some (potential) loss of information in any two-stage approach. one pair for each genotype vector. we are seeing results in this general direction. either inferring a pair of haplotypes for each genotype in the sample. such that each genotype vector in G is generated by the associated pair of haplotypes in H.146 to collect genotype data. and the decision on whether to expand a 2 entry in G to [ ] or to [ ] in H. and certainly this particular approach has both problems and missed opportunities. turning the two-stage process on its head. To date. and show the results of applying these methods to simulated and real biological data. we do not claim that our experimental results resolve the issue of which approach is best. . although problems remain and the field of haplotype inference is still very active (for example. try to infer the "correct" haplotypes from the genotypes. not genotypes. However. the Haplotype Inference (HI) Problem is to find a set (or matrix) Hofn pairs of binary vectors (with values 0 and 1).

That is. Given a haplotype matrix H. The motivation for wanting to know MaxL(G) may be a bit unintuitive. L(H) <Rmin(H). we define MaxL(G) by changing "minimum" to "maximum" in the definition. For a formal definition of these graphs. under the infinite sites model. 3. for G is a called a "PPH solution" if no pair of sites in matrix H are incompatible.0. see Gusfield. H. that is. is displayed on a directed acyclic graph called a "Phylogenetic Network" or an "Ancestral Recombination Graph (ARG)". and let L(H) be the lower bound given by L when applied to haplotype matrix H. The two quantities.1. and recombination is at the heart of the logic of association mapping. In most of the results in this paper the concept of site incompatibility is fundamental. and on efficient methods to compute close bounds on Rmin(H).147 The standard assumption in population genetics is that at most one mutation has occurred in any sampled site in the evolution of the haplotypes. for both practical and fundamental reasons. For a matrix of haplotypes H. In particular. Given a genotype matrix G. This is called the "infinite sites model". determination of the true H has little value in this context. The evolutionary history of a set of haplotypes H. two sites (columns) p and q in H are said to be incompatible if and only if there are four rows in H where columns p and q contain all four of the ordered pairs 0. followed by a suffix of the other sequence. even with additional efforts to learn the true H. The classic Perfect Phylogeny theorem is that there is a phylogenetic network without recombination (and hence a tree). if and only if there is no incompatible pair of site in H. The difference MaxL(G) — MinL{G) indicates the most that can be learned about recombination (through the use of L(H)). 16 test for the existence of all four pairs is called the "four-gamete test" in the population genetics literature. In addition to mutation. and the available tool for studying recombination levels is the ability to compute L(H) given haplotypes H. Meiotic recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species. on practical heuristics that are exact on small data. and if the difference is large. MaxL(G) bounds the most that can be learned.1. precisely define the range of results that method L will produce. RECOMBINATION LOWER BOUNDS OVER GENOTYPES Let L denote a particular recombination lower bound method that works on haplotype data. and 0. computing the minimum number of recombinations needed to explain their evolution (under the infinite sites model) is a standard question of interest. if the difference is small. An obvious question in this situation is whether it is valuable to expend additional resources to better determine more information about the true H (in the laboratory or by collecting more data). et al. Note that MinL{G) < L(H*) < Rmin{H*) where H* is the true (but unknown) set of haplotypes that gives rise to G. a technique that is widely hoped to help locate genes influencing genetic diseases and important traits 15 . but there is a growing literature on polynomial-time methods that work on problems of special structure. An HI solution. MinL(G) and MaxL(G). we define Rmin(H) as the minimum number of recombination events needed in a derivation of H from some unknown (or sometimes known) ancestral haplotype. haplotypes may evolve due to recombination between haplotype sequences. but it is not true that Rmin(H*) < MaxL{G). even if the true H is . over all haplotype matrices that generate G. The 15. 1. The problem of computing Rmin(H) exactly is NP-hard.0. Meiotic recombination takes two equal length sequences and produces a third sequence of the same length consisting of some prefix of one of the sequences. 12 or Gusfield 11 . Efforts to deduce patterns of historical recombination or to estimate the frequency or the location of recombination are central to modern-day genetics 26 . One situation is where we are interested in the amount of recombination that must have occurred in the generation of the true haplotypes underlying the observed genotypes G. L(H*) < MaxL{G). For a given set of haplotypes. 1. The problem of determining if there is a PPH solution can be solved in linear time 9> 3 2 . we define MinL(G) as the minimum value of L(H) taken over every HI solution H for G. which evolve by site mutations (assuming the infinite sites model) and recombination. that derives haplotypes H. Similarly. over all possible HI solutions for G. Rather.

\EG\ > MaxHK(G). 2. interchangeably. Construct MIG(G) for input data G.1. the HI solution that is used for one pair can be completely different from the HI solution used for another pair. denoted MIG(G). S2 incompatible (that is possible since edge (si. We now describe the algorithm. Because the edges in EG are non-overlapping and C is a connected component. and consider one of the connected components. q) is determined independently of all other pairs of sites. s%. It is easy to establish that HK(H) < Rmin(H).1. S2 to make pair si. the edges in C form a simple connected path along the nodes in C ordered from left to right in the embedded MIG(G). Graph MIG(G) is a supergraph of every incompatibility graph IG(H) where H is an HI solution for G. Previously. and an edge connecting two nodes p and q if and only if sites p and q are incompatible... q) in MIG(G) can phase the 2's in row four as [ because we ] to make p. of non-overlapping edges in MIG(G). Note that the existence of an edge (p. q is incompatible. The "incompatibility graph" IG(H) for H is a graph containing one node for each site in H. Thus. it is sufficient to show that \EG\ < HK{H) for some HI solution H for G. To construct the desired H. For example. 3. then \EG\ < HK(H). we define the maximal incompatibility graph for G. Time analysis: The first step takes 0{nm2) time. of that graph. In the next three sections we develop lower bound methods that work on a genotype matrix G. Therefore. Therefore. Let s\. and that HK{H) can be computed in time that is linear in the number of edges of IG(H). s* denote the ordered nodes in C. The edges in EG induce a graph. Then find a maximum-size set. we use MinHK(G) and MaxHK(G) in place of MinmL(G) and MaxL(G). C. The HK lower bound on Rmin(H) can be described and computed as follows: Arrange the nodes of IG(H) on a line in the order that the corresponding sites appear in the underlying chromosome. The second time takes 0{m?) time. But. q incompatible. The computed number is the HK bound. This is not immediate because it is not necessarily true that MIG(G) = IG{H) for some HI solution H for G. These methods will also be useful in Section 4 where we develop a method to build a minimum ARG for a set of genotypes. Correctness: For every HI solution H for G. 3. The case of the Hudson-Kaplan (HK) lower bound The first and best-known lower bound on Rmin(H) is the HK bound 19 . in the order that the sites appear in the underlying chromosome. We will refer to a node of IG{H) and to the site of G it corresponds to. In this section. Two edges that only share a single node are still non-overlapping. q) is in MIG(G). denoted HK(H).. suppose sites p and q in G are 00 01 10 22 Then there is an edge (p. To show the converse. we first phase sites sy. EG. Algorithm MaxHK 1. as follows: Each node in MIG(G) corresponds to a site in G. Then compute the maximum number of nonoverlapping edges in the embedded graph IG(H). and there is an edge between nodes p and q if there exists an HI solution H for G so that the pair p. we show that MaxHK{G) can also be computed in polynomial time.1. Wiuf 38 showed that MinHK(G) can be computed in polynomial time. \EG\ > HK(H) because IG{H) is subgraph of MIG(G). We claim that \EG\ = MaxHK(G). Efficient algorithm for MaxHK(G) Given a genotype matrix G. Arrange the nodes of MIG(G) on a line. if we can find an HI solution H for G where all the edges of EG are in IG(H) (where they will be non-overlapping). When L is the HK method.148 known.S2) is . the algorithm runs in 0(nm2) time. we only need to look at sites p and q to determine if edge (p. We first have to define the incompatibility graph and to briefly describe the HK bound and method..

C. Time analysis: Constructing MIG{G) takes 0(nm2) time. To finish the proof of correctness. every connected component of IG{H) must be completely contained in a connected component of MIG(G). we use cc(I) to denote the number of non-trivial connected components in graph I.2. . construct graph MIG(G) and remove all trivial components. For each such C. Otherwise. Checking all components for PPH solutions takes 0(nm) time. S2m incompatible. assured that no pair of sites in different components will be made incompatible. and the definition of MIG(G). Note that we can phase the sites in each connected component of MIG(G) separately.S3 incompatible but we have already chosen how s-2 will be phased with respect to si.0 or both. making each consecutive pair of sites incompatible. C. . completing the proof of the correctness of Algorithm MaxHK. (The case where we need 0. and so cc(IG(H)) > Kc. 1 that for a haplotype matrix H. are the rows where both those sites have value 2 in the genotype matrix G. In this section we show . To begin the construction of H'. the entire algorithm takes 0{nm2) time. but is not always. Further.) If column S2 (for row k) has been phased as [ ] we phase S3 (for row k) as [ ]. For a graph / . 3. found by the algorithm. and so at least one edge in C must also be in IG(H). Let Kc be the number of remaining connected components. the only rows in G where a phasing choice has any effect on whether pair 3^. . and consider one of the Kc remaining connected components. It has previously been established 13. it suffices to find an HI solution H' for G where cc{IG(H')) = Kc. for the same reasons we want to compute MinHK{G) and MaxHK(G). Given genotype matrix G. since IG{H) is a subgraph of MIG{G).1 is similar and omitted. none of those sites will be inthat MinCC{G) can be computed in polynomial time by Algorithm MinCC. In this way. A trivial connected component has only one node. in order to make pair si.1 or the pair 1. We first argue that cc(IG(H)) > Kc for every HI solution H for G. This is due to the maximality of connected components. we phase S3 as [ ]. we can construct a haplotyping solution H for G where all the edges of EG (and possibly more) appear in IG(H). \EG\ = MaxHK(G). Since G(C) does not have a PPH solution. let G{C) be the matrix G restricted to the sites in C. and no edges. Let H be an arbitrary HI solution for G. For one such row k of G. The problem of efficiently computing MaxCC{G) is currently open. Thus.0 and/or 1. Therefore. S3 to produce the pair 0. for a non-trivial component C of MIG(G) where G{C) has a PPH solution. We want to make pair S2. determine if there is a PPH solution for G{C). and remove component C if there is a PPH solution for G(C). . and that this lower bound can be. Correctness. Algorithm MinCC 1. we can follow the same approach to phase sites s 4 . As a result. S3 for row k. we define MinCC(G) and MaxCC(G) respectively as the minimum and maximum values of cc(IG(H)) over every HI solution H for G. Therefore. although one has to pay attention to how S2 was phased. cc{IG{H)) < Rmin(H). and hence \Ea\ < HK(H) < MaxHK{G). The critical observation is that this prior decision does not constrain the ability to make pair s^. we phase the sites in C to create a PPH solution. sk. For each remaining component C. But since \EG\ > MaxHK(G). Finding all components takes O(m) time. In either case. S3 incompatible. we will produce the needed binary pairs in sites S2. suppose we need to phase the 2's in S2. The case of connected-component lower bound A "non-trivial" connected component. Similarly.149 in MIG(G)). there must be at least one incompatible pair of sites in H. In choosing how to phase S3 relative to S2. 3. there must be at least one non-trivial connected component of IG(H) contained in C. of a graph is a connected component that contains at least one edge. We claim that Kc = MinCC (G). using an idea similar to one used for MaxHK(G).83 will be incompatible. Now we move to site S3. 2. superior to the HK bound when applied to specific haplotype matrices.

each site can be phased to be made incompatible with its parent site. But cc(IG(H)) > Kc for every HI solution H for G. Similar to the methods that work on haplotype data. given G. All of the methods in those papers produce lower bounds on Rmin(H) that are much superior to HK(H) and CC(H). but when used to compute many local lower bounds in small intervals. we have a submatrix G' of G. 2 . and then use the composition method to create an overall number Ghap(G) which is a lower bound on the minimum Rmin(H) over every HI solution H for G. but are of interest because they allow provably polynomial-time methods to compute MinHK(G). and may help to obtain polynomial-time. and we have observed that when n> m. An HI solution H' for a genotype matrix G' is called a "pure parsimony" solution if it minimizes the number of distinct haplotypes used. minus the number of distinct columns. given a genotype matrix G. Clearly. and we presently solve it only for very small data. Any other site can be phased as soon as its unique parent site has been phased. to be of value. and because each node has a unique parent. We now explain how we compute the local bounds in G that combine to create Ghap(G). What differs between the methods is the way local bounds are computed. et al. Therefore.150 compatible with any other sites in G. et al. and the correctness of Algorithm MinCC is proved. and optimized in Song et al 34 . minus 1. directed spanning tree T of C Then phase the site at the root and one of its children in T so that those two sites are made incompatible. and these local bounds are combined with the composition method. Parsimony-based lower bound One of the most effective methods to compute lower bounds on Rmin(H). so MinCC{G) = Kc. or practical methods. it must be that Ghap(G) is larger than MinHK(G) and MinCC{G) for a large range of data. for the best application of such numerical values. the nodes of C form a connected component of IG{H'). The result is that all the sites of C will be in a single connected component of IG(H'). We do not have space to fully detail the methods. These are two specific cases of our interest in efficiently computing MinL(G) and MaxL(G) for different lower bounding methods L that work on haplotypes. further studied in Bafna. but all the local bounds are computed with some variation of the following idea 3 0 : Let Hap(H) be the number of distinct rows of H. Of course. using an idea related to the cited methods. When restricted to sites in an interval. particularly when n > m.3. Next we phase the sites of one of the Kc remaining components. Unfortunately. 30 . for a haplotype matrix H. Hap(H) is called the Haplotype lower bound. so that in H'. et al. Final comments on the polynomial-time methods Above. we compute relaxed Haplotype lower bounds for many small intervals. given genotypes G. The composition method was developed in Myers. MaxHK(G) and MinCC(G). the overall lower bound on Rmin(H) is generally quite good. C. All the lower bound methods in the papers cited above work by first finding (local) lower bounds for (selected) intervals or subsets of sites in H. so Kc > cc(IG(H'). was developed in Myers. no matter how that parent site was phased. and then combining those local bounds to form a composite lower bound on Rmin(H). 3 0 and is the same for all of the methods. over . we have developed a lower bounding method that works on genotype matrices of moderate size. we would like to compute the minimum and/or maximum of these better bounds over all HI solutions for G. first find an arbitrary rooted. the lower bound obtained is often much superior to MinHK(G) and MinCC(G). Those results contribute to the theoretical study of lower bound methods. As in the proof of correctness for Algorithm MaxHK. Then Hap(H) < Rmin(H). To do this. However. The HK and the CC lower bounds are not the best. we do not have a polynomialtime method for that problem. we would like to compute MinL(G) and MaxL{G) for the lower bound methods L that obtain the highest lower bounds on Rmin(H) when given haplotypes H. 3. In the next section we discuss a practical method (on moderate size data) to compute better lower bounds given genotypes. Hap(H) is often a very poor lower bound. for better lower bound methods. When applied to the entire matrix H. we developed polynomial-time methods to compute MaxHK(G) and MinCC(G).

If the lower bound is not smaller than Rmin. we compute a lower bound on the minimum number of recombinations we need to derive an HI solution. If the number of distinct haplotypes in a pure parsimony HI solution for G' is p{G'). However. Conceptually we can build the ARG as follows. The problem of computing a pure parsimony haplotyping solution is known to be NP-hard 17 ' 22 . we have found an ARG that is potentially the solution. Our simulation shows that for dataset with 20 genotypes and 20 sites. Ghap(G) becomes higher than MinHK(G) or MinCC(G). Otherwise. et al 3 0 . MinHK{G) = MinCC(G) = 2. Directly applying the above idea is not practical when the data size increases. given the choices we make in the search path. CONSTRUCTING A MINIMUM ARG FOR GENOTYPE DATA USING BRANCH AND BOUND In this section. But. 3 and haplotype data can be considered to be a subset of genotype data.3). Ghap(G) is larger than MinHK(G) or MinCC{G) for over 80% of the data. It is easy to see this problem is difficult.m' . we compute the local bound Par(G') for each submatrix of G denned by an interval of sites of G. We check whether the current ARG derives an HI solution. and so Ghap(G) can be computed in practice for a wide range of data. find an HI solution H for G. each local Par(G') bound can be computed in practice when the size of G' is moderate. we assume the infinite sites model of mutations. we try all possible ways of deriving a new sequence by (1) an (unused) mutation from a derived sequence. We develop a practical method using branch and bound. Other papers have shown how to solve the problem on larger datasets 4 ' 5 . the ARG that derives an HI solution for G and use the smallest number of recombinations (denoted Rmin). a real biological data (from Orzack. we know the current partially built ARG can not lead to a better solution and thus stop this search path.e. To compute Ghap(G). We also maintain the best ARG found so far. We derive a new sequence by a recombination of two already derived sequences or an unused mutation from a derived sequence. We call this bound Par{G'). We call such an ARG a minimum ARG for G and denote the minimum number of recombination in this ARG Rmin(G). such . and G' has m' sites. After all.151 all HI solutions for G'. Formally. when n increases. we show that under certain conditions. so computing Par{G') is also NP-hard. while Ghap{G) is 5 (which is equal to Rmin(G) as shown in Section 5. we maintain a set of sequences that have been derived. We can find the minimum ARG by searching through all possible ways of deriving new sequences and finding the ARG with smallest number of recombinations. there is no known efficient algorithm for constructing the minimum ARG for haplotype data 37. we can solve this problem by a branch and bound method. et al. Therefore. The ARG grows when we derive new sequences. Here. 4. If so. Here. Each time. Once the ARG derives an HI solution for G. We start from every sequence node in the hypercube as the root of the ARG.1 < Rmin(H') for any HI solution H' for G'. 31 ) has 80 rows and 9 sites. At each step. as usual. If not. As an example. i. We start building the ARG by staring from a sequence as the root. that we can derive H on an ARG with the fewest number of recombinations. a pure parsimony HI solution can be found relatively efficiently in practice on datasets of moderate size by using integer linear programming 10 . It is easy to show that Ghap(G) < Rmin(H) for every HI solution H for G. or (2) a recombination of two derived sequences. we consider the problem of constructing an ancestral recombination graph (ARG) that derives an HI solution H and uses the fewest number of recombinations for the genotype matrix G. The intuition of our method comes from the concept of hypercube of length m binary sequences. Haplotyping on a minimum ARG: Given a genotype data G. we store this solution if this ARG uses less recombinations than Rmin. and then combine those local bounds using the composition method from Myers. Our experiments show that Ghap(G) is often smaller than MinHK(G) or MinCC(G) when n < m and when the recombination rate is low. Note that there are up to 2 m possible sequences in the hypercube that can be on the an ARG that derives an HI solution for G. it is easy to show that p(G') .

we move a sliding window with a small number of (say 6) SNPs. computing recombination lower bounds on haplotype data has a potential problem. Otherwise.1 Through either a recombination or (unused) mutation from sequences in the derived set. there are 277 (out of 296) SNPs and 50 genotypes. We use Rmin(G) on these intervals to calculate the average minimum number of recombinations per kB along the genome. Maintain a variable Rmin as the currently known minimum number of recombinations. It is not clear what effect the haplotyping error has on recombination hotspot detection. follow this branch and continue on step 2. The real biological data (such as the MHC 20 and MS32 data 21 ) are genotypes. Here. We first analyze the MHC data. Initialize the derived set with a binary sequence sr as the root of the ARG.152 we continue to derive more sequences from the current derived sequences. Denote the number of recombinations in this ARG Rminc. . For each window. Their results on the MHC data 20 and MS32 data 2 1 indicate that recombination lower bounds may be useful in identifying locations and intensity of recombination hotspots. Deriving sequences Repeat until all search paths are explored or terminated. Initialize Rmin to be oo (or some pre-computed upper bound). If so. 2.3. Detecting recombination hotspots using genotype data Recombination rates are often believed to vary significantly across a genome. and thus effectively remove the haplotyping uncertainty from our results. On a Pentium 2. We denote the submatrix within a window by G. set Rmin < — Rminc. Algorithm GenoMinARG 5. A recombination hotspot refers to a genomic region where the recombination rate is much higher than in its neighboring regions. and it takes 2865 seconds when the sliding window size is set to 6. Each time. we compute Rmin(G). We illustrate the basic procedure of the branch and bound method in Algorithm GenoMinARG. Detecting recombination hotspots is important for many applications. we compute the exact minimum number of recombinations for small intervals of the genotype data. which speed up the method significantly. stop this search path. Continue with the next branch. Figure 1 plots the average minimum number of recombinations per KB across the region of interest. 2. Root We maintain a set of sequences called derived set (containing sequences that are part of the ARG already built so far). The branch and bound method seems to work for many datasets with up to 8 sites. We provide such an example in Section 5. Then Return to Step 1 if there are more root sequences to try. Stop the algorithm otherwise.1. 2.2 Check whether the derived set contains an HI solution. e. 2. This method is still useful because there are real biological data that contain small number of sites and many rows. APPLICATIONS 5.g. The results we obtained by computing over genotypes match the results over haplotype data quite 1. Remarks.0 GHz machine. for example in Myers.1. 2 9 . If Rminc < Rmin. However.3 If the recombination lower bound (with the current derived haplotypes) is at least Rmin. For a few intervals. the computation takes 242 seconds when the window size is 5. We simply time out and use an efficiently computed recombination lower bounds instead for these intervals. we move the window by half of its width. et al. Given a set of genotypes G. We omit the details due to the space limit. the computation of the exact minimum number of recombinations is slow. stop this search path and continue with the next search. After removing missing values. We use the lower bound methods in Section 3 and also improve them with some additional ideas. Bafna and Bansal 2 applied recombination lower bounds based on haplotypes to reveal recombination hotspots. grow the derived set by deriving a new sequence. association mapping and has been actively studied recently. The key to the success of branch and bound is the use of genotype recombination lower bounds we presented earlier.

2. et al. our results by computing minimum number of recombination on genotype data match quite well with the results in Bafna. We also compute Rmin(H0) using program beagle 2 4 . et al. we can build a minimum ARG that derives an HI solution explaining the given genotypes. We run our method on G. Rmin(H0) is usually closer to Rmin(Hp) than Rmin(G). the minimum number of recombinations on G. and obtain an HI solution H. We fix the number of sites in the sequences to be a small number. but not always. We assign position 0 to the leftmost SNP. MS32. Overall. original haplotypes and HI solutions found by program PHASE. we see good matches with the results reported in Bafna. Rmin(Hp) is between Rmin(G) and Rmin(H0). et al. etal2. 2 . Rmin(G) = Rmin(H0)..I e i S 5DMAM 0 i } » I * 1 150 1 ^ *'. We generated simulation genotype data as follows. Genotype matrix G with n rows is then generated by pairing haplotype 2% with haplotype 2i — 1._-J*_. MSTM1 and MSTM2._. Rrnin(H0) is much closer to Rmin(G) than Rmin(Hp). DNA3. Note that Rmin(G) < Rmin{H0) and Rmin(G) < Rmin(Hp). PHASE is known to be quite accurate for many data. 2 quite well. we compute Rmin{Hp).Rmin(H0) and Rmin(Hp) for various data size and p. We also tested our method on the MS32 data 2 1 . But on average. NIDI. Our simulation results show that often. The DNAl/2. Five hotspots are identifiable: NID2. For each data size and scaled recombination rate. say 7. Use a sliding window of 6 SNPs. Again. we run program PHASE on G.153 « c e 25" 1AP? 20 15-1 I 1 S A « I 7 £ 9 % S » a | 9 i o - j DNA3 DMB1. 5. a.. This allows us to compare the minimum recombination on genotypes. haplotyping on a minimum ARG may be an effective approach for the range of data we tested.*. As a comparison. we compute Rmin{G). 2 . and compare the PHASE HI solution (denoted Hp) with H0. the two stage approach by first using PHASE to obtain an HI solu- . we observed that for some data. For each dataset. Another interesting observation is on the performance of PHASE. we demonstrated that for certain genotype data. As in Bafna. Our simulation here shows that PHASE tends to minimize the number of recombinations to some extent. DMB1/2 and TAP2 hotspots are clearly identifiable and match the results in Bafna. We compare H with the original haplotypes H0. 1. these experiments suggest that for the downstream question of computing the minimum number of recombination. Thus. Overall. from the leftmost $HP\ Pig. We choose various scaled recombination rate (denoted p) when running MS. From the simulation we performed. Table 1 shows the comparison among Rmin(G). One can see that for a large portion of the data we simulated. well. As a comparison. NID3 is not significant and not detected. We first run Hudson's program MS 18 to generate 2n haplotypes (denoted H0). Detecting recombination hotspots using genotype data for MHC data. The result is shown in Figure 2. 100 * 200 [ 250 Physical dlstancs {In kg. Comparing minimum recombination on genotypes and haplotypes In Section 4. at least implicitly. The results match the ones based on haplotypes. we generate 100 datasets.50 .

The results show that our method is comparable to the solutions by PHASE in accuracy. Haplotyping on a minimum ARG Since we can build a minimum ARG for the input genotypes. p = 40.P 63% 72% 56% 68% 54% 76% H-G 0.*-. and (c) the percentage of mis-phased 2s.87 0. Naturally. and table 3 shows the re- sult on these dataset of program PHASE. Table 1. we can construct a HI solution from this minimum ARG. Data size 15 20 15 20 15 20 by by by by by by 7 7 7 7 7 7 P 20 20 30 30 40 40 G.79 0. The data size is the number of genotype rows by the number of sites. We use the following three haplotyping accuracy measures: (a) The standard error 3S is the percentage of incorrectly phased genotypes. thus allowing one to study this issue. The value reported for difference are average differences for the two numbers over the 100 datasets.H 68% 60% 60% 56% 46% 47% G. Here. We want to point out however that this conclusion would not be possible without our method of computing Rmin(G). For example. we use the same simulated data in 5. since the ARG gives a set of haplotypes that can explain the genotypes. from the IsRmost SNPf F i g . we want to study the haplotyping accuracy of this minimum-ARG method. Rmin(G) and Rmin(Hp). either a minimum one or a near-minimum one. (b) the switching error 2 3 is related to the incorrectly phased neighboring heterozygotes. 2.48 tion Hp and then compute Rmin(Hp) is an effective approach for the range of data we tested. This indicates that finding a single minimum ARG that derives an HI solution may not be enough to produce more accurate HI solutions than PHASE.H 56% 45% 38% 44% 43% 46% P.154 « c « K £ qui 5NIDI 43NIB? 4 «S3! » 8 » h-T" f t ."• Physical distance fin KB.99 0. Table 2 shows the haplotyping accuracy of our method. that gives the best haplotyping accuracy remains an in- . Use a sliding window of 6 SNPs. Comparison of Rmin(H0).% 150 ** •* 200 250 . G stands for Rmin(G) and P stands for Rmin{Hv).32 0.2.3. Finding an ARG. Sometimes.45 0. T h e two columns on the right shows the average difference between Rmin(G) (resp.2 0.62 0. our method produces an HI solution better than PHASE. 43% of data have equal Rmin(G) and Rmin(H0). p is the scaled recombination rate in Hudson' program MS. Rmin(Hp)) with Rmin(H0).34 0.34 0. although statistically PHASE is still slightly more accurate for the range of data we tested. • *• 1 c 3 S 2•1 £ c 2 • . r* • 1 - Nra3 f* -JfcN< 0 50 100 •v.80 H-P 0. for data 15 by 7. Detecting recombination hotspots using genotype data for MS32 data.68 0. Here. The next 3 columns display t h e percentage of datasets where two compared minimum number of recombinations are equal. 5. H stands for Rmin(H0).

D. L. 437:1299-1320. In Proceedings of RECOMB 2005. Generating samples under the WrightFisher neutral model of genetic variation. Clark. Cox.753 14. A new integer programming formulation for the pure parsimony problem in haplotype analysis. and C. M. Work supported by grants CCF-0515278 and IIS0513910 from National Science Foundation. Eddhu. Nilsen. 8. LNBI Vol. Hickerson. D.793 11.65 8.333 0. The role of haplotypes in candidate gene studies. 9. 2003. 10. Proceedings of ISMB 2005. To appear in Discrete Applied Math. 21:il95-i203. 2003. Annals of Combintorics. 6. H. Accuracy of haplotyping on a minimum ARG.811 12.155 teresting research problem. Stuve. K. Eddhu. Tag snp selection in genotype data for maximizing snp prediction accuracy. 2004. The results are averaged over 100 datasets for each parameter settings. The HapMap project. Oxford University Press. August 2000.750 0. E. JCSS.9 Table 3. This d a t a has 47 non-trivial genotypes (i. 2004. Whole-genome patterns of common DNA variation in three human populations.799 14. D.821 10.11 11. editors. E.801 10. Gene Genealogies. 70:381-398.98 p= 30 15x7 20x7 0. Personal Communication. E.e. Ding.63 p = 40 15x7 20x7 0. Switch % of mis-2 p= 15x7 0. Consortium. Hinds. 18. References 1. In Proceedings of RECOMB 2005. et al 3 1 .e. Bafna and V. Harrower. A. I. Nature. experimentally determined phases). 14. 5. Halperin. For this d a t a . Bioinformatics Suppl.237 0. Wiuf. J. 1. 14th Annual Symposium on Combinatorial Pattern Matching (CPM'03). University of Waterloo. 2004. V. Song for sharing simulation scripts. Springer-Verlag LNCS. and D. of 2004 Workshop on Algorithms in Bioinformatics. V. and C. 2004. An efficiently-computed lower bound on the number of recombinations in phylogenetic networks: Theory and empirical study.349 0. Gusfield. Bioinformatics. In R. Special issue on Computational Biology.281 0. Kimmel.214 0. Bordewich and C.721 15. 4. 8:409-423. we test our method with the A P O E d a t a from Orzack. A linear-time algorithm for the perfect phylogeny haplotyping problem.256 0. Brown and I. Bioinfor- . and M. pages 569-584. Acknowledgments. Accuracy of program PHASE on the same datasets. Langley. Brown and I. 12.90 0. Hein. Springer. E. G.276 0. Improved recombination lower bounds for haplotype data. 2005. G. and D. Schierup. real d a t a may b e derived on a minimum or near-minimum ARG. p=20 15x7 20x7 0. A new formulation for haplotype inference by pure parsimony. 17. Bansal. Z. H. UK. efficient reconstruction of Root-Unknown phylogenetic networks with constrained and structured recombination. 15. 2(1):173-213. D. Gusfield. Harrower. LNBI Vol. M. 307:1072-1079. D. Consortium. 2005. One can see that. Gusfield. Bansal. Baeza-Yates. performing slightly worse t h a n P H A S E . D. Frazer.752 13. report cs2005-03. Optimal. D. The accuracy measures includes standard error. Our method takes about 2 minute to find an HI solution with 5 incorrectly phased genotypes. 3500. Optimal. the min-ARG approach is comparable but underperform slightly. and R. Gusfield. Gallinger. 2005. Genetic Epidemiology. Bioinformatics and Computational Biology. 13. Nature. D. 3.49 0. Rmin(G) = 5. 1:78-90. D. J. Rmin(H0) = 7.65 Std. V. We also note t h a t Rmin(Hp) = 6 and for the real solution. Haplotype inference by pure parsimony. 11. This is a small indication t h a t haplotyping on a minimum (or near-minimum) A R G may be useful. This genotype d a t a has a real solution (i. Program P H A S E produces a solution within 16 seconds with 4 incorrectly phased genotypes. In Proc. Halperin. Semple.381 0.297 0. The number of recombination events in a sample history: conflict graph and lower bounds. Chavez. Springer. Chrochemore.804 0. volume 2676 of Springer LNCS. Table 2. A haplotype map of the human genome. Bafna and V. and S. err. The results are averaged over 100 datasets for each parameter settings. A benefit of our m e t h o d is t h a t it computes the minimum number of recombinations for the given genotype d a t a (under the infinite sites model). 2005. switch accuracy and % of mis-phased 2s. 3500. Variation and Evolution: A primer in coalescent theory. Hudson. Science.269 0.e. Hubbel. Eskin. 2005.265 0. 2005. 27:321-333. IEEE/A CM Transactions on Computational Biology and Bioinformatics. School of Computer Science.13 11. comparing to PHASE. 2005. Shamir. Gusfield. 2005. min-ARG Std. 2. Germany. S. T h e authors wish to t h a n k Yun S. 2004.00 p=30 15x7 20x7 0. I. the genotype contains more t h a n one 2) and 9 sites. On the computational complexity of the rooted subtree prune and regraft distance.01 p = 40 15x7 20x7 0.98 20 20x7 0. Technical report. 16. err. pages 144-155. efficient reconstruction of phylogenetic networks with constrained recombination.834 10.787 0. Filkov. E. 426:789-796. 7. Switch % of mis-2 Finally. pages 585-600. R.336 0. i. Berlin.

Zhang. P. 24. 2006. C. Y. 2005. Song. 2001. 2004. Perfect phylogenetic networks with recombination. 111:147-164. G. Smith. 2005. Donnelly. 2005. 36. S. Donnelly. Griffiths. Myers. Bioinformatics. 29. 27. Science. Rizzi. McVean. M. M. Whittaker. A. Voight. J. Human recombination hot spots hidden in regions of strong marker association. and P. 310:321-324. 32. 28. Gusfield. Stephens. Proceedings of ISMB 2005. J. Balding. and L. Genetics. Bioinformatics Suppl. Analysis and exploration of the use of rule-based algorithms and consensus methods for the inferral of haplotypes. McVean. IEEE Press. CA. Abecasis. Kaplan. Donnelly. R. Deloukas. Cutler. J. Z. 29:217-222. Genetics. 20. PLOS Biology. 4:e72. 166:537-545. of Hum. Myers and R. 21. Haplotyping populations by pure parsimony: Complexity. Y. and P. Minimum recombination histories by branch and bound. and P. 25. 2003. 21:i413i422. H. Journal of Computational Biology. E. 8:69-78. P. and P. Wiuf. G. Los Alamitos. R. A fine-scale map of recombination rates and hotspots across the human genome. exact and approximation algorithms. 163:375-394. Donnelly. Satya and A. and J. 2005. C. Intensely punctate meiotic recombination in the class ii region of the major histocompatibility complex. and A. Patterson. Cutler. D. 23:221-233. 2001. A map of recent positive selection in the human genome. Am. Qin. On the advantage of haploype analysis in the presence of multiple disease susceptibility alleles. Zhang. 2004. R. J. 2003. Human Genetics. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. and D. D. Kauppi. Kaplan. 37. Lancia. Subrahmanyan. Jeffreys. 23. Eskin. Human Genetics. Nature Genetics. Gusfield. 2004. Science. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Scheet and M. 304:581-584. V. Pritchard. 2003. Human Genetics. 38. S. An optimal algorithm for perfect phylogeny haplotyping. 1985. S. 18(2):337-338. A. Myers. 30. Haplotype inference in random population samples. J. Efficient computation of close lower and upper bounds on the minimum number of needed recombinations in the evolution of biological sequences. M. A new statistical method for haplotype reconstruction from population data. and D. Genetics. Jeffreys. Pinotti. 2004. G. volume 3692. Wen. 22. Halperin. 19. L. B. 37:601-606. Olson. Stephens. Wu. 2006. Bottolo. M. S. 78:629-644. pages 239-250. Lyngso. and J. Y. Stanton. N. Genetics. Lin. and J. Genetic Epidemiology. and R. J. Morris and N. Wang. Am. 165:915-928. V. Neumann. L. P. 78:437-450. Am. 31. 2001. and P. S. 33. S. 26. 2002. D. S. In Proceedings of Workshop on Algorithm of Bioinformatics (WABI) 2005. C. S. Stephens. Kudaravalli. Lin. C. Neumann. Orzack. Bentley. Nature Genetics. Freeman. L. R. S. 71:1129-1137. Little loss of information due to unknown phase for fine-seal linkage-disequilibrium mapping with singlenucleotide-polymorphism genotyep data. Bounds on the minimum number of recombination events in a sample history. K. J. 2002. special issue on Computational Biology. J. In Proceedings of 4th CSB Bioinformatics Conference. Hudson and N.156 matics. 74:945-953. 1. Am. Donnelly. D. Chakravarti. INFORMS J. E. Panayi. . A comparison of phasing algorithms for trios and unrelated individuals. Genet. . X. 34. Zwick. 68:978989. Hein. 35. Human Genetics. N. Am. Mukherjee. L. S. Marchini. Morris. Song. R. R. Nesbitt. 2006. on Computing. 16:348-359. The fine-scale structure of recombination rate variation in the human genome. G. Munro. Myers. A. and R. Inference of recombination and block structure using unphased data. Hunt.

com Ying X u Department of Biochemistry and Molecular Biology. INTRODUCTION A translocation considered here is always reciprocal which exchanges non-empty tails between two chromosomes. USA Email: guojun@csbl. the problem of sorting by translocations (abbreviated as SBT) is to find a shortest translocation sequence that transforms II to T. China of Biochemistry and Molecular Biology. 264005. of Applied at Weihai. 250100. We present an asymptotically optimal algorithm for transforming II to T by translocations and deletions. based on which an 0(n3) algorithm for SBT was given. Given two multi-chromosomal genomes II and r . 250100. China Email: qixingqin@163. Given two signed multi-chromosomal genomes n and T with the same gene set. 4 gave a linear time algorithm for computing the translocation distance (without producing a shortest sequence). Write . In this paper we consider a more general case: when the gene set of T is a subset of the gene set of II. where OPT is the minimum number of translocations and deletions transforming II to r .bmb.uga. Bergeron. Mixtacki and Stoye 3 pointed out an error in Hannenhalli's sorting strategy and gave a new 0(n3) algorithm for SBT. Jinan. University of Georgia. Shandong University Weihai. Yantai.com G u o j u n Li School of Mathematics Department and System Sciences. the problem of sorting by translocations (SBT) is to find a shortest sequence of translocations transforming 11 to T. 1. in such case. Note that all above algorithms assume that the two genomes have the same gene content. In this paper. Li et al. 5 presented an 0(n2) algorithm for SBT. Athens. which allows us to compare genomes containing different genes. Shandong Jinan. Shandong University. edu S h u g u a n g Li School of Mathematics Department and System Sciences. Furthermore.uga. China of Mathematics and Information Science. 250100. University. Yantai University. bmb. Such is of course rarely the case in biological practice. China School of Mathematics and System Sciences. 264213. providing a feasible sequence with length at most OPT+2. Wang et al. revisited this problem and gave an elementary proof for the formula of the translocation distance which leads to a new 0(n3) algorithm for SBT.edu of Georgia. Georgia 30602. Anne Bergeron et al. Hannenhalli gave the formula of t h e translocation distance for the first time. Jinan. Georgia 30602. University Athens. China Email: sgliytu@hotmail. USA Email: xyn@csbl. "deletions" are needed. this analysis can be used to approximate the minimum number of translocations and insertions transforming one genome to another. where the length of t h e sequence is called the translocation distance between II and V. In 1996. we show how to extend Anne Bergron's algorithm for SBT to include deletions.157 SORTING GENOMES BY TRANSLOCATIONS AND DELETIONS Xingqin Qi Department Mathematics. SBT was first introduced by Kececioglue and Ravi x and was given a polynomial time algorithm by Hannenhalli 2 . Shandong University. In 2005. Clearly.

An edge (u. hence the graph can be uniquely decomposed into a number of disjoint cycles. A chromosome is a sequence of genes and does not have an orientation. We refer to left vertex corresponding to a. ( 1 . (Xi. A chromosome is orientation-less. we represent a gene by a positive integer and an associated sign ("+" or "—") reflecting the direction of the gene. The vertex set V contains the pairs of vertices xt and xh for every gene x in II. v) in Q(H. v) with u € IN(I). We will try to computer the minimum number of translocations and deletions transforming II to T. (6. The necessary preliminaries are given in Section 2 and Section 3.l....Right(I)}. Therefore. A cycle is interchromosomal if it contains at least one interchromosomal gray edge.xf) of vertices. Each vertex has degree either 2 or 0. The bicolored cycle graph Q{J1. which provides a feasible sequence with length at most dtd(II.. v) in G(H.3). ( 6 . i < k < j}. V={u: u is either xl or xh. Vertices u and v are connected by a black edge iff they are neighbors in II. 5 ) . where x is a gene in II}. Let X = (X\. vertices xt and xh are always neighbors and for simplicity.. . which is denoted as dtd(H.3). therefore flipping a chromosome X = xi. T = {(1. T) + 2.X2). A sub-permutation(SP) is an interval I = Xi. The paper is organized as follows. replace every positive element +x. A gray edge (u. . 2.Xj within a chromosome X = xi. otherwise is intrachromosomal.. a chromosome X is said to be identical to a chromosome Y iff either X = Y or X = —Y.2. Yj) be two chromosomes. X2) and Y = (Yj. 2..1.-. —x\ does not affect the chromosome it represents. r ) . As a convention we illustrate a chromosome horizontally and read it from left to right.T) of II with respect to T which have the same gene content is defined as follows.£&).. Hence. T) is interchromosomal if u.X2. SBT is limited to genomes that are cotailed.Y2 are sequences of genes. The Sub-permutation A segment is an interval I = Xi. i.5). otherwise is intrachromosomal. In Section 5 and Section 6 we give the approximation algorithm and its analysis respectively.e.8.Xk). The Cycle Graph For a chromosome X = (xi. In the following. For a chromosome X = (xi. The set of tails of all the chromosomes in II is denoted by T(II).. write An for those in IT only. (4.. We say that vertices u and v are neighbors in a genome if they are neighbors in some chromosome in this genome. .Xk into —X = —Xfe. let n = {(4. A prefixsuffix translocation switches X\ with Yi resulting in {—Y2. Vertices u and v are neighbors in X if they are adjacent in the ordered list constructed in aforementioned manner.e. (Yi.. A prefix-prefix translocation switches X\ with Y\ resulting in ( V i .. A genome is a set of chromosomes. We assume that each gene in A appears exactly once in each genome.. For gene x.o. For example.. We present an asymptotically optimal algorithm.. PRELIMINARIES As usual. Let Vi be the set of vertices induced by genes in / . —X\). Vertices u and v are connected by a gray edge iff they are neighbors in V. v belong to different chromosomes. T) is given in Section 4.. where X\. Reader(s) are assumed to have a thoughtful understanding of Refs. Genomes II and T are co-tailed if T(II) = ^"(r). 9 ) } and 2. ^ ) . and right vertex corresponding to Left (I) and Right(I) respectively. Genomes II and T are said to be identical if their sets of chromosomes are the same. . we assume that the elements in each chromosome of the target genome r are positive and in increasing order. w.v£lN(I)..x^) of vertices and replace every negative element — Xi by ordered pair (x^. V/ ={u: u is either x\ or x\.. i. the numbers +xi and — Xk are called tails of X. we exclude them from the definition of "neighbors" in the following discussion.9)}. A lower bound on dtd(II.Yi. Xi+i.X2.. 2 and 3.158 A for the set of genes in both II and T. Define IN{I)=Vi \ {Left(I). The resulting genome after applying a translocation p on genome II is denoted as II • p. 2 . T) is said to be inside the interval J if u.7.Xj within a chromosome X of II such that (i) there exists no edge (u.v £ IN(I) and (ii) that is not the union of smaller such ..xm. Y^). Conclusions are given in Section 7... and the corresponding element is said to be positive element or negative element.g.2...8 . by ordered pair (x'.7 ..

Each maximal chain that contains a nontrivial SP is represented by a square node whose (ordered) children are the round nodes that represent the non-trivial SPs of this chain. Definition 2 . Anne Bergeron et al. Let c be the numT ber of cycles in 5(11. Then Ac € {—1. T). The only way to destroy a SP with translocations is to apply a translocation with one cleavage in the SP. When two SPs overlap on one element. The minimum number of translocations for transforming II to T is d t ( I I .X2. A square node is the child of the smallest SP that contains this chain. where { L + 2..xp) and Y = (yi.. Each non-trivial SP is represented by a round node. 2. In the following.. v). X^}. if L is odd L. then the trees can be separated on different chromosomes by proper translocations without modifying FuLemma 2.y£_i} and v € {y). improper if Ac = 0 and bad if Ac= . or overlapping on one element.x^} such that / and g are neighbors in X. nested with different endpoints. A translocation is proper if Ac = 1.— i)..j) be a translocation acting on chromosomes X = (xi..0. The nesting and linking relation of SPs on a chromosome can be shown in the following way. Let / € {x'_ 1 . can be destroyed by a single translocation.159 intervals. Let u e {yj_i.2. at most two MSPs on two different chromosomes. Yet. Given a genome II consisting of chromosomes {X\. if L is even and T=l L + 1.y'j} such that u and v are neighbors in Y. We will refer to a MSP that is a leaf in Fu as simply a leaf.. If a translocation destroys two MSPs on different chromosomes at the same time. Let Ac denote the change in the number of cycles after performing a translocation on II. 3 If a chromosome X of genome U contains more than one tree. define the forest Fx by the following construction: 1. We refer to a SP by giving its first and last element such as (xi.. A chain that can not be extended to the left or right is called maximal. 3. assume there are N chromosomes and n genes in both L and T. i. 1 . A minimal sub-permutation (MSP) is a SP not containing any other SPs.• ••.^_ 1 } and g £ {x\. The Forest of SPs Two different SPs of a chromosome are either disjoint. Lemma 2. A SP is trivial if it is of the form (i.Xj).. The above definition can be extended to a forest of a genome by combining the forests of all chromosomes: Definition 2.3. 3 proved that it is also possible to eventually merge two MSPs that initially belong to two different trees of the same chromosome. . genome II has an even-isolation. 3 Given two genomes II and T with the same gene set. if L is even and T ^ 1 .. The forest Fu is the union of forests Fx1. v) in (7(11.. In fact. and no other chromosome ofH contains non-trivial MSP. We say that a translocation destroys a SP C if C is not a SP in the resulting genome.1.2. Such translocations always merge cycles and thus are always bad.... If T = 1 and L is even.FxN • Note that a leaf of Fu corresponds to a MSP of II. where the cleavages occur in X between Xi-\ and Xi and in y between yj~\ and yj.yq). otherwise is non-trivial..g) and (u. It is easy to see each interchromosomal gray edge (u. we only consider proper translocations determined by interchromosomal gray edges. 2. The Translocation Distance Let p(X.-.4. 2. as in Ref. Given a chromosome X and its SPs. we say the translocation merges the two MSPs. Y. Then p acts on black edges (f. 2. Denote the number of leaves and trees of Fa by L and T respectively. a translocation may destroy more than one SP at the same time. and one cleavage in another chromosome.a.1 . plus all SPs containing these two MSPs. T) determines a proper (prefix-prefix or prefix-suffix) translocation p of L by cutting the T two black edges incident on u and v respectively. we say that they are linked.. Successive linked SPs form a chain. r ) = n — N — c + t.i+l) or (—(i+1).1} 2 .

. Color the leaves corresponding to indirect MSPs red.5(3*). then S can be rewritten as +l.. a. T) = 1. 13.+4. • Gray edges link adjacent vertices in T.l. 3 .2 else perform a proper translocation such that T and L remain unchanged. R e m a r k 2 . then 6(a).We always assume genomes n and r are co-tailed. 16)}. c.up. d.ii2 • • • up-i. We represent genomes n and T by the redefined cycle graph G(H. i. where n = {(1. Bergeron et al. How to select a valid MSP to destroy has been described in the proof of Theorem 2 in Ref.. Let II be a genome which is induced from n by deleting from n . 6).1 2 . These three sets are defined as follows: •v = {x°yx%{At} • The black edges pertain to genome n . For any segment of form S = Ui. Thus one can use Algorithm I to transform II toT. and for all i. Indirect black edges are indicated by thick lines. 8 . +4.160 A translocation p is valid if dt(H. that implies II and T are co-tailed too. destroying it will not create an even-isolation in the resulting genome. where u\. T). . to find the minimum of translocations and deletions required to transform genome n to T. An indirect cycle (or indirect SP) is one containing at least one indirect black edge. or V > 1 and V = L . b. Through Algorithm I. 4 . the segment of elements in An between a and b.14. Ui corresponds to a gene of An. -3. we always try to destroy a valid MSP on some chromosome. —15. E x a m p l e 3 . ON SORTING BY TRANSLOCATIONS AND DELETIONS Returning to the problem at hand. we always try to merge a pair of valid MSPs on different chromosomes. An example is given in the following Fig. +b.5 . and could be replaced with any symbols different from those used in A. There are two kinds of black edges: direct black edges which link two adjacent vertices in n . e.5(lh).e. all the genes in An. . or T > 1 and V = L . T) — dt(U • p.3 .9. How to select a pair of valid MSPs to merge has been introduced in the proof of Lemma 4 in Ref. merging them will not create an evenisolation in the resulting genome. their identities and signs are irrelevant. 1 . For completeness. -a. we replace S by S" = Ui5(u\)up. is called the label ofe.1. 1 . b). if L is even and T = 1 destroy one leaf such that U = L — 1 2. i. 1 1 . A. .up correspond to two elements of A. gave an 0(n3) efficient algorithm for SBT (hereafter called Algorithm I). and the leaves corresponding to direct MSPs blue. (5(3*) = +b. +2.2 < i < p — 1.e.1 0 . i. while II is not sorted do if there exist MSPs on different chromosomes then perform a bad translocation such that V = 0. f. And u\ and up are separated by a 6. h where S(l ) = -a. Let S = + 1 . otherwise is direct. we describe it in the following. if L is odd perform a bad translocation such that V = 0. we have two points to remark.5 . -c. 3. B is the set of black edges and D is the set of gray edges. 3. (7. The New Definition for Cycle Graph Given that the genes of An are destined to be deleted. Based on this formula. For an indirect black edge e = (a. -c. Algorithm I 1.+2. 3. when the gene set of F is a subset of the gene set of n .e. A translocation is safe if it does not create any new non-trivial SPs. R e m a r k 2.2.. —5 be a segment on a chromosome of n . . indirect black edges which link two vertices separated by a 6.2 . where 8{u{) is the segment of An between u\ and up.1 3. . where V is the set of vertices. end while For above algorithm. Through Algorithm I. 3.

.::-: f'"" 8' '© O 6 4' *' 4' O 5" 0 5' & 6' 6 6fc <> ^ O——O O 16 » 7' 7k 9' »* 10' 10' l l l l l ' 12" 12' 13' 13k 141 i 4 » i 5 » 15' 16' Cycle Graph (1-3) (4.-S'. T). The segment [x. An interchromosomal size k is transformed by Algorithm I translocations. Definition 3. The New Definition for a Translocation In 0(n. d) exchanges the segment [xi.. yq) of Y. o] of X with the segment [yi... Assume the two black edges e = (a.c) exchanges the segment [x\. these cycles is indirect..d.yq}oiY.b) be one indirect edge in Q(U. The segment [(5(a).S) (9.. b) and / = (c.. (1) The translocation determined by g = (a. T). This gets rid of the two MSPs and creates an interchromosomal cycle.T). at least one of which is of size 1. At the end. (5)The translocation determined by e and / exchanges the segment [xi... 5(a)] of X with the segment [yuc]oiY. we will save one step of deletion. This gets rid of M and creates an interchromosomal cycle. The segment [5(a). x] designates the interval bounded on the left by the element of An adjacent to a and on the right by x. T) by ci(LT. Thus we get the optimal sorting scheme: So that there are as few deletions and translocations as possible. (4) The translocation determined by g — (b.c) By Corollary 4. say C\.11) 6 (11. The merging of two MSPs is achieved by combining two intrachromosomal cycles C\ and C2 one from each of the two MSPs.5(c))oiY. if both C\ and C2 are indirect. r ) Algorithm I can be generalized to graphs containing direct and indirect edges by making use of the new definition of a translocation. Let e = (a.yq. the k cycles are all direct. (2) The translocation determined by g = (b.c. rect. To give Definition 3. T).1 simply. (3) The translocation determined by g = (a.6(a). 5(a)] designates the interval bounded on the left by x and on the right by a.161 exchanges the segment [xi.5(c). an indirect black edge determines not an adjacency of genome II but an interval containing only genes to be deleted. The resulting interchromosomal cycles can be resolved as described above.1. denote the number of red leaves in Fn by r (II..-13) (14-10 4. d) are on two different chromosomes X = Xi.5(a)] of X with the segment [yi. x] designates the interval bounded on the left by b and on the right by x. T) and the forest F n ..<5(a)] of X with the segment [d. A proper translocation determined by an interchromosomal gray edge of a cycle C transforms C into two cycles C\ and C2. . where x. The destroying of one MSP M is achieved by combining one intrachromosomal cycle C\ £ M with another cycle Ci which is not in any MSP.1.... A LOWER BOUND ON d t d ( n .. for each interchromosomal indirect cycle. a single deletion is required for each interchromosomal indirect cycle. For either destroying or merging. a] of X with the segment [5(c). T h e cycle graph 5(11. d) exchanges the segment [x\.1.. 5(a)] designates the interval bounded on the left by x and on the right by the element of An adjacent to b.. cycle C of with k — 1 If C is dionly one of Forest F i g . F) and the number of indirect cycles in Q(H.c] of Y. Now consider the intrachromosomal indirect cycles which are forming MSPs. into k cycles of size 1. Then the black edge ofC\ is direct.. Given two genome II and T with different genes.(l < i < p) and yj (1 < j < q) are vertices of £7(11. we define 5(a) = 0 for a direct black edge e = (a.. b).2. Then the segment [x.xp and Y = yi. Corollary 4. We thus have to redefine what we mean by "the bad translocation acting on two black edges" or "the proper translocation determined by an interchromosomal gray edge". we just need to merge as many as possible of pairs of indirect cycles through Algorithm I. 1..b. Algorithm I gathers all the genes to be deleted into a single segment.a. Lemma 4.1... 1' 1> 2" 2' 3'. If not. 3.

one is an x-tree. otherwise.T) [r^ 2 \ —1) where II is the induced genome by deleting from II all the genes in An and dt(U. the leftmost leaf C of the odd-tree has color I. 3. so there are at most |_r('I2'r^j mergings of pairs of indirect cycles through Algorithm I. steps 1 and 2 are not executed. In such case. dtd(Il. the two conditions for "merging" or "destroying" are not compatible. Case 1: T = 3.a if only one of Mi and M 2 is indirect.162 Lemma 4. where x is odd and x > 3. 5. black edge) choice for each MSP.T) is the translocation distance between II and T. it is impossible to merge a pair of indirect cycles in step 2.1.2. and the left MSPs are paired to merge in step 3. So there are at ms L«j ot mergings of pairs of indirect cycles in step 3. Choose a pair of valid MSPs M\ and M 2 . is called an odd-tree. choose any black edge e in Mi and / in M 2 . if M is direct. the other internal leaves of the odd-tree have color j .1. Sub-procedure 2: 5. there are at most |_r^ ' 2 '~ J mergings of pairs of indirect cycles in step 3. i. there are at most |_r' ' 2 j mergings of pairs of indirect cycles in step 3. step 1 is not executed. DESIGN AN ALGORITHM We will approximate dtdfil. To do this. subcase 2: if L is odd. the other is a 1-tree. Special Cases and Corresponding Sub-procedures We always try to merge a pair of "valid" MSPs with "the same color". In summary: MSPs merging 1. thus there are at most L ^ . The rightmost leaf TZ of the oddtree has color k. or destroy a "valid" "direct" MSP.i j + 1 < |_lffi£lj + i mergings of pairs of indirect cycles through Algorithm I. D Theorem 4.1.. Choose a black edge e in M. T) by merging as many as possible of pairs of indirect MSPs through Algorithm I.n .. 3.. if neither of M\ and Mi is indirect. the leaves of both the two 1-trees have color j . Choose a valid MSP M. and the left MSPs are paired to merge in step 3.r) + ci(U. direct if possible. Case 2: T = 2. choose an indirect black edge e in Mi and an indirect black edge / in M 2 .e. subcase 1: if L is even and T = 1.e. thus there are at most [ r ' 2' '] mergings of pairs of indirect cycles through Algorithm I. All leaves of the even-tree have color i. otherwise. having the same color if possible. We list the following six cases. If both Mi and M 2 are indirect. a Since MSP destroying may merge a pair of indirect cycles. But in some cases. thus there are at most L r ( n 'P~ 2 J + 2 = L ^ ^ J + 1 mergings of pairs of indirect cycles through Algorithm I. .1. it is called an even-tree. through Algorithm I. If M is indirect.P " 1 ] + 1 < L 1 ^ ] + 1 mergings of pairs of indirect cycles through Algorithm I. Sub-procedure 1: Ignore the color of leaves and apply Algorithm I on Q(H. it is impossible to merge a pair of indirect cycles in steps 1 and 2. The leaf of the 1-tree has color i. our strategy prefers to destroy a direct MSP. MSP destroying 1. we denote it by x-tree. L^FJ + I. If x is even. where i ^ j . In such case. subcase 3: if L is even and T ^ 1. We will prove it in the following subcases. one is an even-tree. and a black edge / on a different chromosome not contained in any MSP. The number of merging of pairs of indirect cycles through Algorithm I is at most Our sorting scheme requires careful consideration of MSP choice and cycle (i. thus there are at most [ r ( n . when some MSP must be destroyed. For a tree with x leaves. where i ^ j . 2. 2. Apply the (prefix-prefix) translocation determined by e and / by Definition 3. In such case.r) > dt(fl. Proof.r). Apply the (prefix-prefix) translocation determined by e and / by Definition 3. One MSP M will be destroyed in step 2. then there are at most [ '2 J mergings of pairs of indirect cycles in step 3. two MSPs M\ and M% will be destroyed in step 1 and 2 respectively. If both Mi and M 2 are indirect. the other two are 1-trees.

Lemma 5. If i = j =blue.i^j. Case 3: T = 2. Sub-procedure 6: Ignore the color of leaves and apply Algorithm I on Q(Il.3. if some MSP must be destroyed. Ignore the color of leaves and apply Algorithm i on <?(n. / / there are same color leaves on different chromosomes.y > 2 and x + y is even. Case 4: T = 1. then 1. 3. If k = I = i.-tree and y-tree.2 or 3. has color i.2. Case 6: T = 1. and the leaf of the 1-tree has color blue. the other is a 1-tree.2.5 or 6. Lemma 5. 5. the leftmost leaf C has color j . 2.163 1. the two trees must be on different chromosomes. where x + y > 4. Keep merging the pair of leftmost leaves of color j on different chromosomes until at most one leaf of color j is left. Sub-procedure 5: Ignore the color of leaves and apply Algorithm I on Q(Tl. then there exists a proper translocation involving X that does not modify FnLemma 5. 0 3. . Let L fee a genome whose forest Fu T has L > 4 leaves and T > 2 trees. will make Tt and C be on different chromosomes.1. All leaves of the re-tree have color i. Keep merging the pair of rightmost leaves of color i on different chromosomes until at most one leaf of color i is left. In the process of Algorithm I. Proof. one is an x-tree. L is even and all leaves of Fu are red.T). When x > 2 This This d I t is e This will make 71 and C be on different chromosomes. All leaves of the even-tree have color red. L is odd and L > 3. 6 1. then there exists a bad translocation merging a pair of same color leaves such that the forest Fu< has L' = L — 2 leaves and T" ^ 1 trees. b 2. Sub-procedure 3: 1. 3 If a chromosome X of genome II contains more than one tree. we can always destroy a "valid" "direct" MSP as long as it is not the Case 4. 2.2.1. Ignore the color of leaves and apply Algorithm I on £(ILT). subcase 1: T = 2. Assume the two trees are a. 4. The rightmost leaf "R.1 destroy the middle (red) leaf of the tree. 2. Clearly. 3 / / o chromosome X of genome II contains more than one tree.5.T). all leaves of the y-tree have color j.r). then there exists a proper translocation p in II such that II • p does not have any new non-trivial MSP. feasible since we always use a prefix-prefix translocation to merge MSPs.1. r ) .4. Apply a sequence of proper translocations without changing Fu until the two trees are on different chromosomes by Lemma 2. 2 If there exists a proper translocation in II. We will prove it by discussing on T. Ignore the color of leaves and apply Algorithm I o n £ ( I I . then there exists a proper translocation involving chromosome X. Merge the middle leaves of the two trees. where L is even and Fu is not the Case 1. Sub-procedure 4: 1.d 5. Flip the chromosome on which the tree of color j is on the right of the tree of color i. Apply a sequence of proper translocations without changing Fu until the two trees are on different chromosomes by Lemma 2. merge the leaf of the 1-tree with the middle leaf of the odd-tree. 6. Lemma 5. will create four trees. then 2. one is an even-tree. Main Lemmas The following lemmas will be central in providing an invariant for the sorting algorithm. and no other chromosome of II contains non-trivial MSP. and the other internal leaves all have color red. merge 71 with C. Lemma 5.2 merge 1Z with £.1. where x. b c Case 5: T = 2. the other is a y-tree.

else. there exist a pair of same color leaves l\ and I2 on different chromosomes such that either l\ or I2 is a leaf of some z-tree. • Lemma 5.4. Let U be a genome whose forest Fu has L > 4 leaves and T > 2 trees. there exists a valid proper translocation without modifying FuOtherwise.6. there exists a safe proper translocation on IT. When x = l . where x > 2. and all leaves on X have color i while all leaves on Y have color j . We will prove it by discussing on T. either X or Y has more than one tree. respectively).2.2 leaves and T' ^ 1 trees. Since Fn is not the Case 1. Since T > 3. 5. by Lemma 5. (b) perform a bad translocation merging this two leaves. Since there are only two kind of colors. Merging l\ and li will result in a genome IT with L' = L . destroy a valid blue leaf by Lemma 5.3.2 or 3. (b) if it is Case 4 or 5. merging any pair of same color leaves on different chromosomes will result in a genome II' with V = L — 2 leaves and V ± 1 trees.2 or 3 respectively. (b) if there exist same color leaves on different chromosomes then perform a valid bad translocation merging a pair of same color leaves by Lemma 5. respectively). then there always exists a safe T proper translocation on LT.T. assume the two chromosomes are X and Y. Perform safe proper translocations on L unT til L is sorted by Lemma 5.7.3.5. subcase 1: T = 2. while L > 4 do (a) if it is Case 1. go to subprocedure 1.2. 4. subcase 3: T > 4. which determines a proper (prefix-prefix or prefix-suffix) translocation. Clearly. where L is even and Fu is not the Case 1.7. Let IT be a genome having no leaves. If the trees are on one chromosome. there exists a proper translocation without changing L. j / > 3 ( j / = l . Merging this pair of same color leaves will result in a genome II' with L' = L — 2 leaves and T" ^ 1 trees. else. a . (a) if L = 1.1. Proof. 2.3. • . since Fu is not the Case 2. > 3 . The resulting genome must have more than two trees. subcase 2: T = 3. Algorithm II 1. there must exist a pair of same color leaves between the two trees. destroy this leaf. if L = 2 ( comment: there must be T = 2) (a) perform proper translocations without changing Fu until the two leaves are on different chromosomes by Lemma 2. By Lemma 5.6. all trees in Fu must be on at most two chromosomes. T Proof. go to sub-procedure 4 or 5 respectively. Then by Lemma 5. merging Zi and I2 will result in a genome II' with L' = L — 2 leaves and T" ^ 1 trees. The Approximation Algorithm Our extended algorithm for merging as many as possible of pairs of indirect MSPs is given in the following Algorithm II. there exists a valid proper translocation without changing Fusubcase 2: T > 3. respectively) such that l\ has the same color with the only leaf 1% of x-tree (y-tree. If there does not exist any pair of same color leaves on different chromosomes.2 or 3. Then by Lemmas 5. where i ^ j .164 and y > 2. there exists an internal leaf li of y-tree (a:-tree. In this case. If L is not sorted. Since L is not sorted and Fu = 0. 3. then there exists a valid proper translocation of II without changing L.1 and 5. thus this proper translocation is valid. go to sub-procedure 6. destroy a valid blue leaf by Lemma 5. The two trees must be on the same chromosome. if L is even and T = 1 if it is Case 6. if L is odd Lemma 5. else. since Fu is not Case 3. • 5.4. there must T be an interchromosomal gray edge in Gn. perform a valid proper translocation by Lemma 5.

Procedure Deletion For each indirect edge e = (a.. Note that a red leaf is always merged with another red leaf in step 3 until one of the Cases 1. if rSi is even.3 appears in Algorithm II.T) be the number of mergings of pairs of indirect cycles in Algorithm II. i = 1. T) and a(U. In sub-procedure 4. in sub-procedure 3. In sub-procedure 1..6 appears in Algorithm II. If k = i.3..!• If k = l=j.5. one red leaf is destroyed. then r S4 is even. in sub-procedure 6.1. In sub-procedure 4. Assume there are rSi red leaves when sub-procedure i happens and there will be j/i pairs of red MSPs merged in subprocedure i.5. the other red leaves are paired to merge. the other red leaves are paired to merge. Assume that there are x» pairs of indirect MSPs which have been merged in step 3 when sub-procedure i happens.r). rSl is even. b) of Q\ (II. Case 6 appears.6.l = j or k = j. T). so ye = /?(n.1 . r ) is even. If i = j =blue. and a(TI. Let P(H. • Since when Cases i = 4.2. Theorem 5. Lemma 6. Proof.3. In subprocedure 4.l = i. then r S4 is odd. the other red leaves are paired to merge.If o r u y o n e 0I" the rightmost and leftmost leaf is red. r ) . so y2 = L 1 " 1 ^] = L ^ J . r s t e p3 = r(H.T) used to merge with a blue leaf. We discuss the six cases respectively. Case 1 appears.2.1. a red internal leaf will be destroyed. in sub-procedure 3.r). the other red leaves are paired to merge. two red leaves are used to merge with two blue leaves respectively.2. the other red leaves if exist are paired to merge. and the other red leaves are paired to merge. so yi = [ "\ J = L-^-J. the other red leaves if exist are paired to merge. In sub-procedure 2. so yx = L!:£V~J = L ^ J ~ lCase 2 appears.3 appears.r) is even.r) > a(II. If all leaves of the odd-tree are red. labeled by 6(a).4. Lemma 6. T) be the graph containing only length 1 cycles obtained by applying Algorithm II to the graph £/(II. Case 5 appears.165 Let Q\ (II. It is clear that the algorithm transforms genome II to genome T. T) is odd. and a(U. L ^ J = L^J . one red leaf is cut. so y5 == [ra52~ J = [^-J — 1. The number of translocations and deletions obtained by the Translocation-Deletion algorithm is Apxtd(U. so j/4 = [r"42~ J = L^-J.T) = L 1 ^ ^ ] ifr(II. two red leaves are destroyed respectively. then r S2 is odd. then rS2 is even. rS6 is even.. another red leaf will be used to merge with the only one blue leaf.T). ANALYSIS OF TRANSLOCATION-DELETION ALGORITHM Let a(II. Case 3 appears. the other red leaves are paired to merge. the middle red leaf is used to merge with the middle blue leaf. Obviously. If rS3 is even.T) = yi} so we have Lemma 6. one red leaf is . then rS4 is odd. so 2/3 = [rs32~ j = [^-J — 1. rSi = r(Il.r) = l^^1} .2. Clearly.1 tf r ( I I .T) is odd. the other red leaves if exist are paired to merge. Proof..1 J/r(IT.2. rSs is even. the other red leaves if exist are paired to merge.3. 6. two red leaves are used to merge with two blue leaves respectively. Fori = 1. thena(U. so 2/4 = \J-^2—J = L ^ J .T) = dt(n. T) = L 1 1 ^ ^ ] if r(U. Now we apply the following Procedure Deletion on Si (II. T) Delete the gene segment between a and b. so 2/3 = Lr's2~ j = L^"JCase 4 appears. If r33 is odd. / / one of Cases 1. Since rSi + 2xt = rstep3 = r ( I I . If k — I = i. / / one of Cases 4. We call Algorithm II augmented by Procedure Deletion the Translocation-Deletion algorithm. /3(II. In sub-procedure 2. the other red leaves are paired to merge. then rS2 is odd. In sub-procedure 2.T) + ci(U. one red leaf is used to merge with a blue leaf. one red leaf is destroyed and another red leaf is used to merge with the blue leaf.2. T). T) be the number of mergings of pairs of indirect MSPs during Algorithm II.T) = L 1 ^ ] . where i = 1. then a(IT. Assume that there are ratep3 red leaves at the beginning of step 3. so yi = [r'22~ j = [^-J. in sub-procedure 5. 6. two red leaves are used to merge with two blue leaves respectively.5 or 6 appears in Algorithm II.

1 and 5. 9th Annual International Conference On Research in Computional Molecular Biology (RECOMB'05).dtd(n. 6. Zhu. 1995: 604-613.. In: Proc.T) . if (11. By Theorems 4. 2004: 323-332.11.2 — 6. there are two indirect cycles in Q(Il.7 + 6 = 13. Wang. T) is even.4. So by Lemma 6. Of mice and men: Algorithms for evolutionary distance between genomes with translocation. 4.16) and (1.1 ACKNOWLEDGMENTS The work is supported. 70(3): 284-299.. L. Apxtd(Il...2 . . Cambridge.166 rSi and r(II.r) = x{ + L ^ J .11).. Ma.13. we give an asymptotically optimal algorithm when the gene set of T is a subset of the gene set of LT. In: Proc. IT = {(1. 3 .. We have a(II. o t h e r w i s e . Springer-Verlag. Wang. J.5. 1.3) is chosen to destroy. a(U.r) = 13 + 2 . 8 ..13) is chosen to destroy.6) and (14. Springer-Verlag. References Theorem 6. Journal of Computer and System Sciences 2005.. T) < L^fQj + 1 . R. the problem of transforming II to r with a minimum of translocations and insertions can be approximated by the translocation-deletion analysis.-12. X.. the resulting forest will be Case 2. if r (IT. Ravi.r)<2. a red leaf is always merged with another red leaf.9.. 15th Annual symposiun on Combinaotrial Pattern Matching (CPM'04).P(U. Since none of the six cases appears in Algorithm II. S. a(II. Hannenhalli. 3.. 11th Annual symposiun on Combinaotrial Pattern Matching (CPM'00). ^Knj^j _ 1. where IT takes the role of T. To the best of our knowledge. 71: 137-151. 60373025. (11. . except one red leaf is merged with a blue leaf.1.2 .T) dtd(U..5 . If in step 2 of Algorithm II. Qi. T) have the same parity.T) = 16 .T) and five non-trivial MSPs: (1.T) > [!^E1\ ..1.3). and vice versa.T) < 2. Note that 14 is the shortest number of translocations and deletions transforming II to T.. J. S. 4 . in part. MA.1 = [^^-\ ~ * = 7. (7. (4.1. See Fig. 2.4. so Apxtd(n. T) is odd. if r(II.1. we have Lemma 6. -10. (4.3) will merge with (9. NSF/ITR-IIS0407204). then /3(IT.1 5 . by National Science Foundation (NSF/DBI-0354771. A linear time algorithm for computing translocation distance between signed genomes. 2005: 615629.. T) = 2/i + a. Liu. D. D. (9. (1. (14. 2000: 222-234.. o ( n . If none of the six cases appears in Algorithm ii.16) are indirect. J. 6th ACM-SIAM Symposium on Discrete Algorithm. In fact. then /3(n. where (4.1 = 14. Proof.14. a (n..\. Polynomial algorithm for computing translocation distance between genomes.. N.0 = 15. An 0(n2) algorithm for signed translocation.. Discrete Applied Mathematics 1996.. Apxtd{U. if r(II.„16). Mixtacki.11) respectively. B. 6 ) . G. the others are paired to merge. X..r) = Xi + L^f-J = Lemma 6.. Genome rearrangement by reversals and insertions/deletions of contiguous segments. so ApxtdQl. r ) = 13 + 2 . T) < L ^ J + 1 a(n. Guojun Li's work is also supported by NSFC under Grant No. Kececioglu.6) will merge with (14.r) = 0.13).. El-Mabrouk. In: Proc.. 5. 16)} and dt(U.. so a(II. T) = | / ' 2 ]• ^ By Lemmas 6. In: Proc. Bergeron.. CONCLUSIONS In this paper.. On sorting by translocations. . Stoye. this is the first time to consider SBT when the genomes have different gene sets. Proof.l. Li. . D Example 6. X. r ) = \?&D.6)... A. Zhu. T) = 1. T) is even.

the idea is to generate fragments such that each genomic position is expected to be covered (or sampled) by at least one fragment — and also ensuring that there is sufficient computable evidence in the form of "overlaps" between fragments 4 to carry out the assembly. The need for scaffolding arises from the fact that there could be gaps in sequencing. a combination of these BACs that provide a minimum tiling path based on their locations along the genome is determined. 1. Each selected BAC is then individually sequenced using a shotgun approach that generates numerous short (~500-l. The advantages of retroscaffolding are two fold: (i) it allows detection of regions containing LTR retrotransposons within the unfinished portions of a genome and can therefore guide the process of finishing.S. IA 50011.e. current methods generate shotgun fragments in "pairs" — each BAC is first bro- * Corresponding author. We also report on the on-going development of an algorithmic framework to perform retroscaffolding. the identified gaps between contigs can be filled through targeted experimental procedures called pre-finishing and finishing. Regardless of the coverage specified. Development and Cell Biology Iowa State University. a genome is first broken into numerous smaller clones of size up to 200 kbp each called a Bacterial Artificial Chromosome (or BAC). we use the term "finishing" to collectively refer to both these procedures.000 bp long) fragments. gaps invariably occur during sequencing. For simplicity. and Genetics. INTRODUCTION Hierarchical sequencing is being used to sequence the maize genome 18 . aims at determining the order and orientation of the contigs relative to one another. This approach is not meant to supplant but rather to complement other scaffolding approaches. it cannot be guaranteed that every position is covered by at least one fragment. Kalyanaraman* 1 . while a high coverage results in fewer and shorter gaps. and (ii) it provides a mechanism to lower sequencing coverage without impacting the quality of the final assembled genie portions.aluru. which is a new variant of the well known problem of scaffolding that orders and orients a set of assembled contigs in a genome assembly project. assembling a set of fragments sequenced from a BAC typically results in not one but many assembled sequences called contigs that represent the set of all contiguous genomic stretches sampled. S. Coverage affects the nature of gaps — a low coverage typically results in several long gaps. The next step. The key feature of this new formulation is that it takes advantage of the structural characteristics and abundance of a particular type of retrotransposons called the Long Terminal Repeat (LTR) retrotransposons. we introduce a problem called retroscaffolding.schnable}Qiastate. The main focus of this paper is the scaffolding step. The fragments generated by a shotgun experiment approximately represent a collection of sequences originating from positions distributed uniformly at random over each BAC. Once scaffolded. Ames. Schnable 2 2 Department of Electrical and Computer Engineering Departments of Agronomy. As with a jigsaw puzzle. In this paper.. To be able to identify a pair of contigs corresponding to adjacent genomic stretches. Results of preliminary studies on maize genomic data validate the utility of our approach. i. In this approach. Sequencing and finishing costs dominate the expenditures in whole genome projects. Retrotransposons alone are estimated to constitute at least 50% of the genome. The retroscaffolding technique provides a viable mechanism to this effect. Aluru 1 and P. USA Email: {ananthk. however.edu 1 The abundance of repeat elements in the maize genome complicates its assembly. Next. scaffolding. The problem of assembling the target genome is thereby reduced to the problem of computationally assembling each BAC from its fragments.167 TURNING REPEATS TO ADVANTAGE: SCAFFOLDING GENOMIC CONTIGS USING LTR RETROTRANSPOSONS A. . and it is often desired in the interest of saving cost to reduce such efforts spent on repetitive regions of a genome. Because of gaps.

To this effect. Thus. most of the sequencing and finishing efforts are expected to be spent on repeat rich regions. we introduce a new variant of the scaffolding problem called the retro scaffolding problem. Figure 1 illustrates an example of scaffolding contigs based on clone mate information. An example showing 6 pairs of clone mate fragments (shown connected in dotted lines) sequenced from a given BAC. called physical gaps.168 i gapi i BAC Clone mate pairs e Contigs Cl . 19 . In Section 2. ken into smaller clones of length ~ 5 kbp. and discuss the various factors that affect the ability to retroscaffold. The retroscaffolding technique provides a mechanism to reduce sequencing coverage without affecting the quality of the genie portion of the final assembly. and each such clone is sequenced from both ends thereby producing two fragments which are referred to as clone mates (or a clone pair). 1. • In genome projects of highly repetitive genomes. i i i i i i i i z gap2 Physical gap pw! i i i i i i I ** J |4 Pig. we are developing an algorithmic framework to perform retroscaffolding as described in Section 4. not sufficient to determine the scaffolding information between all pairs of contigs in this example. however. however. formulate it as a problem. as is the case with the maize genome project 18 . This approach has the following advantages: • It does not require clone mate information. As part of the NSF/DOE/USDA maize genome project 18 . we describe the retroscaffolding idea. we conducted experiments on previously sequenced maize BAC data. The approach proposed in this paper provides an alternative mechanism to scaffold around physical gaps as well. . • It can be used to identify LTR retrotransposon-rich portions within the un- finished genomic regions. subject to their repeat content. we are working on applying the retroscaffolding technique to the maize data as it becomes available. the importance of our approach is further emphasized. our approach complements existing scaffolding approaches for genomes with significant LTR retrotransposon content. Also. For obtaining a proof of concept. Performing a higher coverage sequencing is an effective but expensive approach to reduce the occurrences of gaps. at least 50% of which is expected to be retrotransposons. Such information can be useful if it is decided to not finish repetitive regions in the interest of saving costs. The supplied clone mate information is . with the advent of newer sequencing technologies13 that do not generate clone mate information. The relative order and orientation between contigs c\ and ci (also. These and other experimental results assessing the effects of various factors on retroscaffolding are presented in Section 3. In this paper. 9 ' 10 ' 17. This technique is not. This is one of the main concerns in the on-going efforts to sequence the maize genome. thereby providing a means to reduce the sequencing costs. The results show that (i) 3X/4X coverage sequencing is suited for exploiting the data's repeat content towards retroscaffolding. sufficient to link contigs surrounding gaps without a flanking pair of clone mates (gap2 in Figure 1). (ii) retroscaffolding can yield over 30% savings in finishing costs. the fact that a pair of clone mates originated from the same ~5 kbp clone can be used to impose distance and orientation constraints for linking contigs that span the corresponding fragments1. The problem is to order and orient contigs based on their span of LTR retrotransposon-rich regions of the genome. and involve costly finishing efforts. are typically harder to "close". and (iii) with retroscaffolding it is possible to opt for a lower sequencing coverage. During scaffolding. between 03 and C4) can be inferred from the clone mates. Such gaps.

Given that these retrotransposons are typically 10-15 kbp long. Yet. wheat. and up to 90% in wheat 7 . This repeat is referred to as a Target Site Duplication (TSD) because it corresponds to the 5 (or 6) bp duplicated in the host genome at the time and site of the retrotransposon's insertion. let ss=s. the capability of retroscaffolding to exploit this repeat content can provide a significant means to reduce sequencing and finishing costs. we present the results of our experiments to assess the effect of applying both clone mate based scaffolding and retroscaffolding on maize genomic data. their flanking LTRs can also be expected to be separated by as many bps along the genome*. Long Terminal Repeat (LTR) retrotransposons constitute one of the most abundant classes of retrotransposons. a "good quality" alignment that spans a sufficiently "long" suffix or prefix of the latter sequence.169 In Section 5. LTRs start with TG and end in CA. For a sequence s. LTR retrotransposons are distinctly characterized in their structure by two terminal repeat sequences — one each at the 5' and 3' ends of a retrotransposon inserted in a host genome. Let an LTR pair (l&. As their name suggests. 2. sary (but not sufficient) indication that the contigs sample the flanking regions of an inserted retrotransposon. and have been studied in relation to genome evolution. the detection of two identical or highly similar LTR-like sequences in two contigs is a neces- "Sometimes. accordingly affecting the distances between t h e 5' and 3' LTRs. . If this indication can be further validated to sufficiency by searching for other structural signals of an LTR retrotransposon (described below). the LTRs flanking most retrotransposons are similar enough for detection. In addition. Moreover. then the contigs can be relatively ordered and oriented (because LTRs are directed repeats).. as explained below. a subsequent assembly can be expected to have two contigs each spanning one of the LTRs. Various strengths and limitations of the retroscaffolding technique are discussed in Section 6. sorghum. >50% in maize 15. Therefore. the LTR sequences are identical at the time a retrotransposon inserts itself into a host genome. RETROSCAFFOLDING Retrotransposons are DNA repeat elements abundant in several eukaryotic genomes — occupying at least 45% of the human genome6. A sequence c is said to contain a sequence I if there exists between c and either V or V. 2 0 . etc. retrotransposon genes (gag. and gradually diverge over time due to mutations. and sr denote its reverse complement. Low coverage sequencing of a genome with significant LTR retrotransposon content is likely to result in a proportionately large number of gaps that span these repetitive regions. • L4 The 5 (or 6) bp immediately to the left of the 5' LTR are "highly similar" (if not identical) to the 5 (or 6) bp immediately to the right of the 3' LTR. this implies that the intervening region between two consecutively ordered contigs contains retrotransposon related sequences — an information that can be used to prioritize the gaps for finishing. If it so happens that the sequencing covers only the two LTRs of a given retrotransposon. • L3 Typically. • L5 The intervening region between the 5' and 3' LTRs contains several signals that correspond to an inserted retrotransposon. • L2 The starting positions of the 5' and 3' LTRs are at least Dmin bp and at most Dmax bp apart along the genome. Given that retrotransposons are abundant in genomes of numerous plant crops yet to be sequenced (e.). LTR retrotransposons can be nested within one another. These include a primer binding site (PBS).g. and potentially reduce efforts spent on finishing repetitive regions. The structure of a full-length LTR retrotransposon (illustrated in Figure 2a) is characterized by the following key attributes: • L I The 5' and 3' LTRs share a "high" sequence identity. pol. These properties form the basis of our retroscaffolding idea. barley. genomic rearrangements and retroviral transposition mechanisms 2 ' 3 . ly) denote the two LTRs of a given LTR retrotransposon. and a poly-purine tract (PPT). if so desired. and env).

it can be expected that the TSDs are different because they correspond to the host genomic sequence at the site of insertion. two contigs C. Therefore. which is limited by the number of detectable LTR retrotransposons in the genome. not all retro-links may be used in the final ordering and orientation. Presence of distinguishable LTRs: LTRs from different retrotransposons but from the same "family" may share substantial sequence similarity. the above situation can be alleviated by applying retroscaffolding to contigs corresponding to the same BAC (instead of across BACs)..63) < Ln PBS. It may still happen that a target genome contains the same family retrotransposons in abundant quantities.. Definition of a Retro-link: Given a set L of n LTR pairs. and other structural attributes become less distinguishable as well. Details are omitted for brevity. it is essential to take into account other structural evidence specific to an insertion before establishing a retro-link between two contigs. (e 3 .. In the latter... called the Contig Scaffolding Problem that is NPcomplete 9 . a retro-link accounts for other structural attributes of an LTR retrotransposon. (b) An example showing two contigs ci and ci with a retro-link between them. and • every pair of consecutive contigs in each subset is retro-linked and there is no contig that participates in two retro-links in opposite orientations.PPT *-' 3'LTR ' 3'TSD a—>. partition C such that: • each subset is an ordered set of contigs.65). the input is a set of contigs and a set of clone mates.. This is because the likelihood of the same family occurring multiple times at a BAC level is much smaller than . and Cj are said to be retro-linked if 3 (ly..CA}CZ^> 5'TSD 5'LTR Retrotransposon (a) PBS. the retroscaffolding problem can be formulated as on optimization problem.ga9.PPT e5 Host Genome 4^>(TG. 2.env. The retroscaffolding problem can be viewed as a variant of the standard scaffolding problem. L4 and L5. and Cj contain l§< or I31 or both. like in the original scaffolding problem.pol.... In addition to the LTRs....."^G..CA T Dmax— -*-: fe Dmin < distance < (b) Fig. Even if the same LTR retrotransposon is present in two different locations of a genome.170 Dmin < (63 — 65) < Dmax '••min < (e5 ... This is similar to the distance constraint imposed by a retro-link between the two contigs containing two LTRs of the same retrotransposon. where each clone mate pair is a pair of fragments sequenced from the same clone of a known approximate length.cQ. The Retroscaffolding Problem: Given a set C of m contigs and a set L of n LTR pairs. Note that this approach of exploiting the abundance in retrotransposons offers a respite from the traditional view that these are a source of complication in genome projects. If BAC-by-BAC sequencing is used.. The effectiveness of retroscaffolding on a genome is dictated by the following factors: LTR retrotransposon abundance: The ability to retroscaffold depends on the number of retrolinks that can be established. ly) € L such that both c.CAl e3 XTG. As shown. :**= C2 (TG. Similar to the contig scaffolding problem. the above definition is extended to account for additional structural attributes such as L3. (a) Structure of a full-length LTR retrotransposon.. to ensure that a retro-link indeed spans the same full-length LTR retrotransposon. Also. retro-link . An example of a retro-link between two contigs is shown in Figure 2b..

As can be observed. if it is a nested retrotransposon). the fraction of LTR retrotransposons in these BACs averages 42%. 10X). For this purpose. the set of all contiguous genomic stretches covered (and thereby the set of sequencing gaps) was determined. using the knowledge of the fragments' originating positions.932 Number of LTR retrotransposons 3 6 8 6 Retrotransposons in BAC Length in bp % bp 29. then there are likely be so few (short) sequencing gaps that the need for any scaffolding technique diminishes. GenBank Accession BACi BAC2 BAC3 BAd AC157977 AC160211 AC 157776 AC157487 BAC Length (in bp) 107.. A program that "simulates" a random shotgun sequencing over an arbitrary input sequence at a user-specified coverage was provided by Scott Emrich at Iowa State University 5 . and • G g X : at least one LTR is not contained by any contig (i.. along with the information of their originating positions. each LTR pair was classified into one of these three classes (see Figure 3): • C g C : both LTRs are contained in two different contigs. Parameter Name Dmin/Dmax T Lmin/Lmax Match/mismatch Gap penalties Default Value 600/15. 3. four finished maize BACs (listed in Table 1) were acquired from Cold Spring Harbor Laboratory 14 . we provide a proof of concept for retroscaffolding. it is located in a gap).391 73. If the sequencing coverage is too high (e. We ran this program on each BAC for coverages IX through 10X. Whereas at very low coverage (e. Based on the placement information of the contigs on the BAC and that of the LTR pairs (Table 1) on the BAC.g. LTR-par parameter settings.549 147.783 27% 46% 50% 42% Table 2.000 bp 2/-5 6/1 Description Distance constraints between 5' and 3' LTRs (L2) % identity cutoff between 5' and 3' LTRs (LI) Minimum/maximum allowed length of an LTR Match and mismatch scores Gap opening and continuation penalties at a genome level. PROOF OF CONCEPT OF RETROSCAFFOLDING ON MAIZE GENOMIC DATA In this section. The first step was to determine the LTR retrotransposon content of these BACs. Henceforth.000 bp 70% 100/2. assembling the sample would produce a contig for each contiguous stretch. and for each coverage 10 samples were collected to simulate sequencing 10 such BACs..171 Table 1.470 136. Ideally. Table 1 summarizes the findings. LTR-par11. Each run of the program produces a set (or sample) of fragments. Sequencing coverage: Retroscaffolding targets each sequencing gap that spans an inserted retrotransposon such that its flanking LTRs are represented in two different contigs. • C_C: both LTRs are contained in the same contig. we will refer to such gaps as retro-gaps. IX) long sequencing gaps that may span entire LTR retrotransposons are likely to prevail. The effect of sequencing at different coverages was assessed as follows. consistent with the latter's estimated abundance in the genome. .578 60. Given the length of such an insert ranges from 10-15 kbp (greater. the coverage at which the genome is sequenced is a key factor affecting the ability to retroscaffold. which is a program for the de novo identification of LTR retrotransposons.e. Summary of the LTR retrotransposons identified in 4 maize BACs using LTILpar.099 57. was used to analyze each BAC with the parameters specified in Table 2.631 132.g. For each sample.

observe that a very high coverage has a high likelihood of sequencing an LTR retrotransposon region to entirety. we observe that the ratio is maximum for a 3X coverage for 3 out of the 4 BACs. averaging over 34% savings for both BACs.6). by considering one coverage at a time. while sequencing at a 4X coverage and subsequently applying retroscaffolding is expected to result in an effective 39 gaps (as 65. it is easy to see that retro-links can be expected to be established only for CgC LTR pairs. While a very low coverage results in a high likelihood of LTRs falling in gaps. the ratio of the number of retro-gaps to the total number of sequencing gaps indicates the potential savings achievable at the finishing step because of retroscaffolding. 3. As retroscaf- . LTRs. Classification of LTR pairs based on the location of sequencing gaps. the ratio of the number of CgC LTR pairs to the total number of LTR pairs is indicative of the maximal value of retroscaffolding at a given coverage. we assess the potential savings that can be achieved at the finishing step through the information provided by retroscaffolding on gap content.172 CgC Genome ••• Contigs Fragments CJC GgX ^ 5' 3' 5' 3' 5' 3' Fig. Classification of the LTR pairs in 4 BACs. This implies that a 3X/4X coverage project is expected to best benefit from the retroscaffolding approach. and 4X for the other BAC. with respect to a set of 10 shotgun samples obtained from each BAC at different coverages.e. retro-gaps). Retro-links correspond to the class CgC. Coverage CgC IX 2X 3X 4X 5X 6X 7X 8X 9X 10X 16 26 25 27 24 22 19 18 16 7 C-C BAC\ GgX 13 4 2 0 0 0 0 0 0 0 BAC2 CgC% 83 95 100 100 95 83 83 77 48 37 BAC3 CgC% 63 77 97 88 93 76 61 64 50 31 BACA CgC% 53 87 83 90 80 73 63 60 53 23 CgC% 63 92 100 100 95 98 100 67 60 43 1 0 3 3 6 8 11 12 14 23 In this classification scheme. As each retro-gap corresponds to a potential region of the genome that may not necessitate finishing. and 24%-49% for BAC A. We computed this ratio for each of the 4 BACs used in our experiments. Prom the table we observe this ratio ranges from 23%-40% for BAC2. The C-C increase with coverage also indicates the amount of efforts spent in sequencing retrotransposon-rich regions. Therefore. the gradual increase in C-C and the decrease in GgX with increasing coverage.. Table 4 also shows that sequencing BACi at a 6X coverage is expected to result in ~37 sequencing gaps. Table 3. Table 4 shows the number of gaps generated at various sequencing coverages. and contigs. Dotted lines denote sequencing gaps. making retroscaffolding unnecessary. making retroscaffolding ineffective. In our next experiment. This implies that through retroscaffolding it is possible to reduce the coverage from 6X to 4X on BACi without much loss of scaffolding information. Both these expectations are corroborated in our experiments — in Table 3. we observed a similar pattern in all four BACs. Prom Table 3. and counting the LTR pairs in each of the three classes over all 10 samples. While the results are shown only for two BACs (due to lack of space).7-26. To understand the above results intuitively. and the number of which can be detected using retroscaffolding (i.

7 50. Once retro-links are established.7 BACA %Retro-gaps 37. the second class of retro-links that are based on a de novo detection of LTR sequences in the contig data is preferable. additional validation will be necessary to ensure the correctness of such retrolinks.4 37.9 38.7 84.1.6 19.5 Retro-gaps 24.173 Table 4.8 35.6 24.1 11. In what follows.2 9. The first class of retro-links can be established by building a database of known LTR retrotransposons and detecting contigs that overlap with LTR sequences of the same retrotransposon. we describe our approach to establish retro-links.0 2.0 All gaps 78. all sequencing gaps. In the first phase.5 6.2 26. Coverage All gaps IX 2X 3X 4X 5X 6X 7X 8X 9X 10X 70.3 13.7 13. the first step in our approach is to build a database of maize LTR pairs from previously sequenced maize genomic data. 4.4 31.1 6.4 33.2 33.2 folding can be used independent of clone mate information.5 26.3 18.8 23.3 5.2 36. retro-links are established between contigs that show "sufficient" evidence of spanning two ends of the same LTR retrotransposon. then retroscaffolding can be used to advocate for a low coverage sequencing. If similar results can be shown at a much larger scale of experimental data for a target genome.6 32.0 19. However.5 84.8 33.0 93. In addition. each retro-link can be treated equivalent to a clone mate pair that imposes distance and orientation constraints appropriate for LTR retrotransposon inserts. Measurements are averaged over all 10 samples of each of the two BACs.. A FRAMEWORK FOR RETRO-LINKING We developed the following two-phase approach to retroscaffolding. any of the programs developed for the conventional contig scaffolding problem 1.5 16.4 34.3 %Retro-gaps 31. in principle.9 30. 4.0 33. in which only either a 5' or a 3' LTR (or a part of it) survives.1 36.6 37.7 9.9 2.5 38. 19 can be used to achieve retroscaffolding from the retro-linked contigs b . i. such a database of already known LTR sequences of a target genome may hardly be complete in practice.3 BAd Retro-gaps 26. and (ii) those that are de novo found in the contig data. In what follows.1 40.5 3. c .0 9. There are two types of retro-links that can be established among contig data: (i) those that corb respond to LTR retrotransposons that are already known to exist in the genome of the target organism or closely related species. 9 ' 10.4 39.6 19. Because the information about the LTR sequences within the full-length retrotransposons and BACs For our experiments.6 65. A set of 560 known full-length LTR retrotransposons and 149 solo LTRs c was acquired from San Miguel 16 . a set of 470 maize BACs were downloaded from GenBank 5 .6 34. the process of scaffolding the contigs is the same as scaffolding them based on clone mate information. we used the Bambus19 program.e.6 33. we describe the algorithmic framework we developed to establish retro-links based on already known LTR retrotransposons.0 64.4 28. Number of retro-gaps vs.0 49.5 46. Building a Database of LTR Pairs Given that the entire genome of maize has not yet been assembled. we are working on evaluating the collective effectiveness of both clone mate-based scaffolding and retroscaffolding approaches. Therefore. directly impacting the sequencing costs of repetitive genomes. 17. For this reason.7 13.1 29. Solo LTRs are typically the result of a deletion/recombination event at a site of an inserted LTR retrotransposon.9 9.5 88. However. and the results of applying it on maize genomic data.7 36.

Vc e d.G2.l\i)}A naive way to perform step Si is by evaluating each of the m x n pairs of the form {contig. depends on the location of the subsequences set G = {G\.234 1.5 implies either L3 or L4 but not both.c • S2 Construct a such that VGi C P. Steps SI and S2 ensure that two contigs are paired if and only if they contain LTRs from the same LTR pair. Our algorithmic framework performs the following steps: • SI Compute C. and/or P = {(c. The algorithm runs in linear space and run-time proportional to the number of the output pairs. Given a set of sequences. • • •. Thus it is sufficient to evaluate only pairs of the form (contig.)) € e. the locations of both the 5' and 3' LTRs are output. corresponding to the LTRs within the contigs — for C.CA motif (L3) and TSDs (L4). we use level 1 predictions. P P T .. then they also have a "long" exact match between them (although the converse need not hold). As reverse complemented forms also need to be considered. 4.2. Note that G C. although we are currently evaluating other combinations of LTR pairs from across confidence levels.(Z5-. Based on the presence of other signals such as the TG.234 Total was not available.939 Number of fulllength predictions 556 1. We did not include the LTRs identified in the four maize BACs listed in Table 1. it may not be possible to look for retrotransneed not be a partition of . (h> .. to check if a contig contains one of the LTRs.Z3'))l G contains 1& or l3> or both}. it is therefore necessary only to establish additional structural evidence c such as the presence of TSDs..Gn}. this approach involves 4 x m x n alignments in the worst case. we adapted a parallel algorithm for detecting maximal matches across DNA sequences that we had originally developed for a clustering problem 12 . Number of LTR pairs 556 149 1. LTR pair). As pairs are output. We used the values shown in Table 2. To perform S3.Cj £ Gi. Desired values for structural attributes can be input as parameters. Input LTR retrotransposons 1 6 Solo-LTRs 16 Maize BACs 5 Summary of LTR pairs predicted by Number of sequences 560 149 470 LTfLpar. The attributes to look for.ci and c. and 0 implies only LI and L2. We developed a run-time efficient method based on the observation that if two sequences that align significantly. compute R4 = {{ci.174 Table 5. an optimal semi-global alignment is computed. and let L denote the set of n LTR pairs (n =1.h>) € L. A prediction is made only if the identified region satisfies LTR sequence similarity (LI) and LTR distance (L2) conditions. The check can be performed through standard dynamic programming techniques for computing semi-global alignments that take time proportional to the product of the lengths of the sequences being aligned. so that they can be used as benchmark data for validating retroscaffolding.939 LTR pairs.are retro-linked by (J\i. In this paper. however. • S3 VGi € G. we used the LTR-par program to identify LTR retrotransposons and their location information. (c. We call each Gi a contig group. 0.(/j. retrotransposon genes.g. Table 5 shows the statistics over the resultant total of 1. LTR) that have an exact match of a minimum cutoff length. each prediction is also associated with a "confidence level".939 in Table 1). the set G is computed as well in constant time per pair (step S2). A confidence level of 1 implies presence of both L3 and L4. An Algorithm to Establish Retro-links Let C denote a set of m contigs generated through an assembly of maize fragments corresponding to one BAC. As part of each prediction.Z|. PBS.Cj)\ci. Only pairs that have alignments satisfying a specified criteria are output. For this purpose. LTR-par identifies subsequences within each sequence that bear structural semblance to full-length LTR retrotransposons. For each generated pair..

While this suggests that either of the techniques can be applied independent of one another.. The experiment resulted in 44 contig groups (= \G\). Note that the 1. (assembled) Truth: BACi (assembled) J!_LTR r \ >C16 J!_LTR r ~\ >c]•41 . and contigs CA\ and C24. Vertically aligned ovals denote overlapping regions. and (c 24 -> C41) with the arrows implying the order in which the contigs can be expected to occur along the "unknown" BAC sequence (BACi) in the specified orientations.e. and were assembled5 using the CAP3 assembler 8 .par (Table 1) — that way. Shotgun fragments were experimentally sequenced at a 3X coverage of the BAC 14 . We verified the predictions by aligning each of these 4 contigs directly against the known sequence of BACi and found that the retroscaffolding prediction is correct (see Figure 4). This step resulted in only two retro-linked pairs: (cj 0 -¥ c 1 6 ). the corresponding LTR pairs share a significant sequence identity (> 95%). The equivalent groups were merged. The subsequent step was to evaluate each contig pair of a merged group for a valid retro-link.e.939 LTR pairs (in Table 5) to our retro-linking program. and upon investigation we found that most of the groups were "equivalent".175 copia element >c 7 10 fc gypsy element > C41 > C24 £ 3- i > C16" >h -t: >/l t >h — Prediction: BACi (unknown) >h 3 Prediction: BACi (unknown) >cl 10 ><=16 retro-fink (OR) > Cl6 > C24 > c\x retro-fink > en retro-link > <^4 £LTR retror retro-link > Cio J/LTR retror > cio Truth: BACi. and squares denote retrotransposon hit through tblastx against the GenBank nr database. the output may themselves be not mutually exclusive — i. Ideally. While such redundancies in output can be used as additional supporting evidence for bolstering the validity of scaffolding. Validation of two retro-links — between contigs cio and cie. we queried the contigs against the GenBank nr database using the tblastx program. We perform S3 as follows: we concatenate each pair of contigs under consideration in each of the 4 possible orientation combinations.. SCAFFOLDING WITH CLONE MATES AND RETRO-LINKS Retroscaffolding differs from conventional contig scaffolding as it relies on the presence of LTR retrotransposons instead of the clone mate information. and run LTR-par on the concatenated sequence. we would hope that these two outputs to . 4 . the actual value added by either of these two techniques is dictated by its respective unique share in output scaffolding. Other structural attributes were detected using LTR-par. the validation reflects an assessment of retro-linking under practical settings in which a target BAC sequence and its LTR pairs are unknown prior to the retroscaffolding step. Preliminary Validations: We validated the retro-linking algorithm on BACi of Table 1 as follows.& > ^ 4 P i g . it is possible that the relative ordering and orientation between the same two contigs are implied by both the techniques. A retro-link is established between a pair only if sufficient structural evidence is detected. poson genie sequences if the LTR regions within the contigs are a suffix of one contig and a prefix of another (see Figure 2b). 5. For detecting retrotransposon genie sequences in contigs.939 LTR pairs do not include the 3 LTR pairs in BACi as identified by LTR-. The resulting 45 contigs were input along with the 1. i.

356 bp) than was achieved by just clone mate scaffolding — implying that retroscaffolding provides added information that is not provided by clone mate information. For each of the three scaffolding strategies (i. While retroscaffolding resulted in many fewer scaffolds (5). 6. We assessed the effect of a combined application of retroscaffolding and clone mate based scaffolding on real maize genomic contig data as follows: 62 contigs were generated by performing a CAP3 assembly over a 3X coverage set of fragments sequenced from BACi. retroscaffolding and combined).) We then assessed the scaffolding achieved by retroscaffolding the contig data — retro-links were first established using the framework described in Section 4 and the output was transformed as input to Bambus. the total span was smaller (65. This resulted in 32 scaffolds spanning an estimated total of 120. and (iii) combined scaffolding using both clone mate and retro-link information.. We also assessed the individual effect of these scaffolding techniques on "assembly gaps": Each of the 62 contigs was individually aligned to the assembled BAC4 sequence and the stretch along which each has a maximum alignment score was selected to be its locus on the BAC. it is necessary that the LTR retrotransposons in the genome are both abundant and distinguishable. we input both the retro-link and clone mate information with their respective distance and orientation constraints to Bambus. and repeat-rich genomes (e.3800]). Ideally.457 71 28 5 65. There were a total of 42 such gaps. The scaffolding achievable from just the clone mate information was first assessed by running the Bambus 19 program on the contigs.g. A maximal stretch along the BAC not covered by any of the 62 contigs was considered an "assembly gap".605 6. the number of covered assembly gaps was 22 for clone mate scaffolding. does not exist any) pair of scaffolded contigs spanning the gap. The above results are summarized in Table 6.15000]) than that of clone mate links ([2200. the average span of each scaffold was almost twice as large in retroscaffolding.760 bp. In the next step. However. Results of (i) scaffolding contig data for BAC* (136. maize) could . LTR sequences within the same family of LTR retrotransposons are harder to distinguish.605 bp) when compared to clone mate scaffolding. "not covered") if there exists a (alternatively.350 bp and each with an average span of 3. the more inclusive scaffolding is on the contigs — ideally. The table also shows the number of contig pairs scaffolded as a result of the respective scaffolding strategies.246 10 17 complement one another. we would expect all contigs to be in one scaffold thereby implying (622) contigs pairs. This further demonstrates the value added by retroscaffolding. This is as expected because the distance constraint used for each retro-link was longer ([5000. the higher this number is. Based on this definition.. This combination resulted in fewer scaffolds (27) and a longer total span (138. Clone mate scaffolding Number of scaffolds Total span of scaffolds (bp) Average span of scaffold (bp) Number of contig pairs scaffolded Number of assembly gaps covered 32 120. clone mate based. For retroscaffolding to be effective in a genome project. and 28 for the combined scaffolding. DISCUSSION Our preliminary studies on maize genomic (Section 3) and the experimental results on maize contig data (Section 5) demonstrate a proof of concept and the value added by retroscaffolding in genome assembly projects. because it includes the size estimated for sequencing gaps between the scaffolded contigs.e. 17 for retroscaffolding. an assembly gap is said to be "covered" (alternatively. (ii) retroscaffolding.350 3.932 bp) using clone mate information. (Note that the "span" of a scaffold output by Bambus is only an estimate. all 62 contigs would be part of just one "scaffold" if the contigs were all to be ordered along the target BAC.176 Table 6.356 4.760 42 22 Retroscaffolding Combined scaffolding 27 138.

Further validation of the framework on sequenced genomes and at much larger scales are essential to ensure an effective and high-quality application of our methodology in forthcoming complex genome projects. While reducing the sequencing coverage as low as 3X may expose more gaps to span LTR retrotransposons in a target genome. we propose the following iterative approach to sequencing and assembly: first. this procedure can be repeated until sufficient information is gathered to completely assemble and scaffold each BAC. and reassemble them with the fragments sequenced from all sequencing phases. sorghum. It is for this reason that retroscaffolding is more suited for genome projects involving hierarchical (e. and retroscaffolding could make the task of carrying out the assembly in such projects practically feasible. then perform additional coverage sequencing selectively on these BACs. CONCLUSIONS Genome projects of several economically important plant crops such as maize. wheat. The algorithmic framework for retroscaffolding is still at an early stage of development. If a subsequent retroscaffolding reveals the low repeat content in a subset of the input BACs. To circumvent this issue in a hierarchical sequencing project. we can prioritize the gaps to finish based on this information. New sequencing technologies such as the 454 sequencing13 that do not generate clone mate information are increasingly becoming popular due to their high throughput and cost-effectiveness. In practice. Therefore. we plan to evaluate the collective effectiveness of retroscaffolding and clone mate based scaffolding at a larger scale. BAC-by-BAC) sequencing.. applying retroscaffolding at a genome level may cause several spurious retro-links to be established. In the worst case. especially of those contigs corresponding to the non-repetitive regions of the genome. etc. Such sequencing technologies may be an appropriate choice for low-budget sequencing projects. the scaffolding information derived from retroscaffolding may in part be already provided by clone mates. Therefore. thereby confounding the process of scaffolding. it also implies that there is less redundancy in fragment data. This might affect the quality of the output assembly. Retroscaffolding can also be used to order and orient BACs. we can benefit from the scaffolding information provided by retroscaffolding in two ways: (i) we will have information about not only the genomic loci but also the composition of the assembly gaps covered by retroscaffolding. continued research in developing this new methodology further could have a high impact. Most of these plant genomes contain an enormous number of retrotransposons that are not only expected to confound the assembly process. sequence all the BACs at a low coverage and assemble them. even if no new scaffolding information is provided by retroscaffolding. if the overlapping ends of two consecutive BACs along a tiling path span an LTR retrotransposon. Specifically. To . the retroscaffolding approach proposed in this paper offers the possibility of exploiting the abundance of LTR retrotransposons. barley. Retroscaffolding also provides a mechanism to explore the feasibility of a lower coverage sequencing in genome projects. (ii) to guide the process of finishing by providing information on the unfinished regions of the genome. and (iii) to introduce the possibility of reducing sequencing coverage without loss of information regarding the sequenced genes and their relative ordering. but are also expected to consume the bulk of the sequencing and finishing budget. are either already underway or are likely to be initiated over the next few years.177 have numerous copies of the same family scattered across the genome. Retroscaffolding will be useful in projects which do not generate clone mate information. as they are expected to contain sequences corresponding to a retrotransposon insert. 7.g. Several developments have been planned as future work on this research. In genome projects which generate clone mate information. Given that sequencing and finishing account for most of the expenditures in genome projects. In contrast to this perspective. and (ii) the scaffolding output by retroscaffolding can be used to as supporting evidence to validate the output of clone mate information. This approach provides a cost-effective mechanism to sequence repeat-rich genomes without compromising on the quality of the output assembly. thus serving three main purposes: (i) to scaffold contigs that are output by an assembler..

C.J. 409:860-921. Initial sequencing and analysis of the human genome. Mullikin and Z. K. Miguel. A. Nature Genetics. 2003. distribution. et al. The paleontology of intergene retrotransposons of maize. Gnerre. 20. CAP3: A DNA sequence assembly program. ARACHNE a whole-genome shotgun assembler. International Conference on Research in Computational Biology (RECOMB). 2003. Proc. X. Y. 13:81-90. 2005. A. IEEE Transactions on Parallel and Distributed Systems. Philosophical Transactions of the Royal Society of London. and J. Personal Communication. IEEE Computational Systems Bioinformatics Conference. 19.C. McCombie. Meyers. . In Proc. D. Pop. The contributions of retroelements to plant genome organization.gov/ news/news_summ. Margulies. S. E. Stanley. 2005. References 1. Huang and A. 18. S. S. P. Altman.C. S. 2005. 7. 2005. Huson. http://www. Nusbaum. Jaffe.B. Myers. Science.S. Morgante.E. Hierarchical scaffolding with Bambus. 1997. International Human Genome Sequencing Consortium. 2005.M. Tingey. 1998. Hughes. t h e application of retroscaffolding on the on-going maize genome project will provide a good starting point.V.178 this effect.M. 9. Butler et al. et al. 2001.H. 2005. 2002. 312:227-242.S. 1998. S. NSF. 6. L. B. Gaut. Kalyanaraman. 2001.H. 4. Bennetzen. and M. Personal Communication. 13. S. USDA and DOE Award $32 Million to Sequence Corn Genome. 2004. D. Brendel. 8. E.B. Kalyanaraman. Plantview. V. Egholm. and transcriptional activity of repetitive elements in the maize genome. D. NSF. 17. function and evolution. Salzberg. 11. Repetitive DNA and chromosome evolution in plants. D. Mauceli et al. R. 437:376-380. Nature. and E. Tikhonov. 5. SanMiguel. Nakajima. pages 5664. P. Press Release 05-197. Batzoglou. 13:91-96. B. Abundance. 14:149159.B. 9(9):868-877. W. 2001. 2003. M. Attiya. 12(1):177-189. Ning. K. Aluru. 274:765-768. 1996.. The phusion assembler. Whole-genome sequence assembly for mammalian genomes: Arachne 2. 14. Efficient Algorithms and Software for Detection of Full-Length LTR Retrotransposons. Nature. Genome Research. and S. Genome sequencing in microfabricated highdensity picolitre reactors.nsf.E. 20:4345. Personal Communication. J. Genome Research. Kosack. Kalyanaraman and S. J. Lander. Varmus. Fellowship to A. This research was supported by the NSF award DBI #0527192 and an IBM Ph.L. B. The greedy path—merging algorithm for sequence assembly. Flavell. B. and Phillip San Miguel for the maize retrotransposon data. Madan. 3. 1999. 10.D. R. Linton. 4(9):347-353.L. Genome Research. and S. Trends in Microbiology. 1986. M. ACKNOWLEDGMENTS We thank the reviewers of an earlier version of this manuscript for providing several useful insights. 2. M. Retroviruses. J. Jaffe.S.L.jsp?cntn_id=104608\&org=BI0\ &from=news. pages 157-163. We are grateful to Scott Emrich for the scripts to simulate shotgun sequencing and for discussions. Nature. J. 15. A. Genome Research. Emrich. Aluru. Coffin. Genome Research. We thank Richard McCombie for the maize BAC data. Bennetzen. and H. Butler. J. 12. Reinert. 14(12):1209-1221. Space and time efficient parallel algorithms and software for EST clustering.S. Kothari. S. Birren. Initial sequencing and analysis of the human genome. 16. 409:860-921.

000 people had been diagnosed with HIV infection. pol. University of Edmonton. Keywords: String composition. rev. g o v ) . HIV is notorious for its fast mutation and recombination. previously placed subtypes E and I were later discovered to be recombinants. and have some geographical associations (Figure 1). nef. USA. These subtypes represent different lineages of HIV-1. most genotyping methods are complicated and laborious. Similar to other RNA viruses.179 WHOLE GENOME COMPOSITION DISTANCE FOR HIV-1 GENOTYPING Xiaomeng W u . Circulating recombinant form 1. HIV is a retrovirus. O and N 3 . HIV-1 is more pathogenic than HIV-2. The first case of AIDS was reported in the United States in 1981 and has since become a major worldwide epidemic. but also help us understand the epidemics of HIV infection. over 900. For a genotyping process that uses HIV-1 whole genomic sequences. INTRODUCTION Acquired Immune Deficiency Syndrome (AIDS) is caused by a virus known as human immunodeficiency virus (HIV).ca Alberta Miami University Alberta Existing HIV-1 genotyping systems require a computationally expensive phase of multiple sequence alignments and the alignments must have a sufficiently high quality for accurate genotyping. which has one RNA genome segment encoding 9 genes. Canada Email: xiaomeng.ca Xiu-Feng W a n Systems Biology Laboratory.edu G u o h u i Lin* Department of Computing Science.ualberta. Further analyses have characterized the group M of HIV-1 into 9 subtypes (A-D. who. The identification of emerging genotypes. OH 45056. F-H. l a n l . there are two main challenges: (1) The mutation rates be- . Improved genotyping information will not only enhance the development of anti-HIV drugs and HIV vaccine. vif. many different genotypes of HIV-1 have been reported. Here we propose a whole genome composition distance (WGCD) to measure the evolutionary closeness between two HIV-1 whole genomic RNA sequences.ualberta. E-mail: wanx@muohio. Oxford. medicines 2 . Neighbor joining. due to HIV mutation and recombination. and individual circulating recombinant forms. Canada Email: ghlin@cs. and use that measure to develop an HIV-1 genotyping system. Alberta T6G 2E8. geobel@cs. tat. R a n d y Goebel Department of Computing Science. For example. HIV-1 genotyping. Such a WGCD-based genotyping system avoids multiple sequence alignments and does not require any pre-knowledge about the evolutionary rates. and at least 16 circulating recombinant forms (CRFs) ( h t t p : / / h i v . This is particularly a challenge when t h e number of strains is large. and K). J. sub-subtypes. In the last two decades. Department of Microbiology. There are two types of HIV: HIV-1 and HIV2. and more than 2. including env. So it is clear that an efficient and effective genotyping system for HIV will be essential for HIV study. and most of HIV infection is caused by HIV-1 1. gag. Experimental results showed that the system is able to correctly identify the known subtypes. University of Edmonton. Currently however. largely consisting of three major groups: M. Alberta T6G 2E8. Whole genome phylogenetic analysis. i t ) . vpr. By the end of 2005. presents a major challenge for the development of HIV vaccines and anti-HIV *To whom correspondence should be addressed.3 million were estimated to be HIV positive ( h t t p : //www.w e b . and vpu (for HIV-1) or vpx (for HIV2).

u c l . All these systems employ a computationally intensive phase . a c . even though the aligned length could vary for different regions of a genome. a n d a recently developed system ( h t t p : / / w w w . It is expected that recombination will complicate these genotype interpretation systems. where it is reported that alignments with fewer than 400 characters generated trees with problems. we believe that the performance of multiple sequence alignments will decrease as the number of sequences increases. Section 3 outlines the flow of operations in this novel HIV-1 genotyping system. because of the updating sequence information in the database 2 . 1. Subsequently. However. For example. v i r o l . Some recent methods propose the use of partial genome or whole genome sequences to conduct the phylogenetic analysis 4 . S t a n f o r d . the European-based Subtype Analyzer Program ( h t t p : / / p g v l 9 . the Los Alamos Recombinant Identification Program ( h t t p : //hivweb. that avoids multiple sequence alignment to measure the evolutionary distance between two HIV-1 whole genomic sequences. (2) As the number of HIV-1 genomes increases. which can then provide enough sequence information.html). including the Stanford HIV-Seq program ( h t t p : / / h i v d b . For example.g. These genotyping methods either employ rule-based algorithms or are based on a genotype-phenotype database.l i n u x . most of them rely on at least 300-500 character long alignment. This diagram illustrates the different levels of HIV-1 classification. a distance-based phylogenetic construction method Neighbor-Joining (NJ) 8 is adapted to build the phylogenetic clades. HIV-1 is divided into three groups.gov/ projects/genotyping). these methods may generate some confusing instead of confirmatory information. As a result.. e d u ) . the NCBI Genotyping Program (http://www. However. tween HIV-1 genomes are not equal.ncbi. The importance of alignment length was emphasized in RevML 6 . l a n l . gov/RIP/RIPsubmit. u k / d o w n l o a d / s t a r .nih. Experimental results and discussion on the individual CRF identification are also included in Section 4. the envelope gene) than in others. as are the experimental results and our discussion. the genetic distance between genotypes is found to be greater in more polymorphic genes (e. 2. F l and F2 and sometimes K. n e t / s u b t y p e t o o l / h t m l ) described in 4 . Our results show that the proposed approach can efficiently construct the phylogenetic clades for a set of 42 HIV-1 strains exactly the same as the one published by Los Alamos National Laboratory (LANL) with intensive computation and human curation 6 . In this paper we propose a novel method called the whole genome composition distance (WGCD). genotyping methods must be faster and more robust. There are a number of genotyping systems especially designed for anti-retrovirus drug resistance studies 7 ' 2 . we introduce in detail the whole genome composition distance computation using complete genomic sequences. The rest of the paper is organized as follows: In Section 2. and group M is divided into 9 subtypes. An HIV-1 dataset that includes in total 42 whole genomic sequences is introduced in Section 4. such as mixing of sub-subtypes between Al and A2.180 HIV-1 Group M Group N Group O A B C D F G H J K F i g . a phylogenetic relation based on a single gene may not be accurate with respect to the evolutionary patterns of HIV1. as well as mixing of B and D. We conclude the paper in Section 5. WHOLE GENOME COMPOSITION DISTANCE There are several existing HIV-1 genotyping systems. Almost all of current available traditional genotyping methods are based on multiple sequence alignments 2 ' 5 ' 4 . t a r ) 9 . b i o a f r i c a .

for each type of nucleotide nu. the dipeptide composition n . Likewise. To name a few. the lengthfc nucleotide segment composition of R is called the fc-th composition of R and denoted as Ck{R)The fc-th composition of R can be used as a signature for strain JR. For these reasons. Given an RNA sequence R. for example.181 of multiple sequence alignments or alike to align the query sequences. there are approaches based on string composition 1 1 _ 1 4 . we propose to use . including the single amino acid composition. However. Based on our previous research on whole genome phylogenetic analysis for Avian Influenza Viruses (AIV). such as the fact that HIV-1 po/gene is highly conservative so at most two gaps can be introduced into the alignment. the performance is highly dependent on the accumulated knowledge on HIV-1 strains. A O and Ck(S) = {91. to measure the evolutionary distance between R and S. and uses its normalized version to measure the evolutionary distance between R and S 12 . we explore the possibility of HIV1 phylogenetic analysis using their complete genomic sequences. Nonetheless. the (fc + l)-th composition is expected to contain some additional evolutionary information not included in the fc-th composition. In this study. but some information hidden in one composition isn't necessarily revealed by the other. for each length-fc nucleotide segment (there are possibly 4fc of them). We note that there is a rich literature on whole genome phylogenetic analysis. when limited to single genes in the HIV-1 strains. respectively. the single nucleotide composition is one of the most informationrich compositions. On the other hand.92. then the Euclidean distance dk(R. through avoiding the computationally intensive phase of multiple sequence alignments. For simplicity. this additional information is expected to decrease with increasing fc (see also the Experimental Results and Discussion).(R) and Cfc(S). different levels of conservation within different HIV-1 genes pose additional difficulties in the phylogenetic analysis: The analyses using different genes might produce inconsistent and even erroneous results (the same situation happens in numerous HIV databases) 10 . We adapt some ideas for whole genome phylogeny construction from the literature U ' 12 and propose a novel composition distance to measure the evolutionary closeness between two HIV-1 whole genomic sequences. and thus the MSA-based phylogenetic analysis using multiple genes is computationally impossible. and the multiple sequence alignments are more challenging. where many approaches have been proposed for estimating the pairwise distance between two whole genomes. •• • . with priority given to some fragments that are known to be more conserved than the others. That is. The dinucleotide composition reveals the single nucleotide composition and some amount of extra evolutionary information not included in the single nucleotide composition. In fact. one may define. we propose to use the Euclidean distance between Ck(R) and Ck(S) to measure the evolutionary distance. Likewise. S) is 4 fc dt(R. the dinucleotide composition of R is a vector of 4 2 = 16 dinucleotide frequencies in sequence R. where fi and 9i are the frequencies of a common length-fc nucleotide segment in R and S. After the distance matrix on the set of HIV-1 whole genomic sequences is thus computed. the single nucleotide composition of R is a vector of 4 nucleotide frequencies in sequence R. In general. We use the whole genomic RNA sequences of HIV-1 to validate our method. • • •. an SVD-based measure using tripeptide (and tetrapeptide) composition 13 ' 23 . h . these systems all perform acceptably well. in our case to represent an HIV-1 strain. Obviously.9^).S) N ft)2 i=l (1) Note that in general Ck{R) and Ck+i(R) both contain evolutionary information of R. the CIS method defines an information discrepancy between C). each is a 4fe-dimensional vector. There are several similar measures proposed in the literature for whole genome phylogenetic analysis. the complete information set (CIS) 12 . its frequency in R is calculated as the number of its occurrences in sequence R divided by the total number of overlapping length-fc nucleotide segments in R. and approaches based on gene content 1 9 _ 2 2 . the Euclidean distance between Ck{R) and Ck(S). the frequency of nu in R is the number of occurrences of nu in sequence R divided by the total number of nucleotides in R. Namely. Given two strains R and S. Consequently. the Neighbor-Joining method 8 is utilized to build the phylogenetic clades. and the composition vector (CV) 14 . assuming that Ck(R) = ( / i . approaches based on text compression 15~18.

019bp. which in our case is set to 80..Ck(R)). for every pair of strains R and S.. the (k + l)-th composition is expected to contain some additional evolutionary information not included in the fc-th composition. The fourth step is the standard bootstrapping in phylogeny construction methods 24 . which were used as references sequences in genotyping using an enhanced version of fastDNAml maximum likelihood tree fitting (RevML) and site rate estimation codes (RevRates) 6 . The method can be roughly partitioned into the following four steps: In the first step. 2 CPZ strains (CPZ is a subtype of Simian immunodeficiency virus (SIV). Datasets We downloaded a set of 42 HIV-1 strains based on a review paper 6 . ••-. Instead. Note that the computation is done with a linear scanning the composition vectors for R and S at the same time.. to calculate the composition (Ci(S). and the single gene trees show problems when fewer than 400 character alignments occurred. . followed by using modified global and local site mutation rates. to represent strain R. The trees created in 6 require several executions of RevML and RevRates using different initial site mutation rates. . S) N d?(JJ. In the second step. pol. Given a pair of strains S and T.1. 200 4. with the maximum length of 9.S). where the frequency for a non-occurring segment is automatically treated as 0. The resultant pairwise evolutionary distance matrix M = (d(S. for any strain S. In addition to these 42 HIV-1 strains.. Note that bootstrapping is used to test whether the output phylogeny is stable (or confident) with respect to the input sequences. where di(R. typically in the cases where the whole strains are available. their evolutionary distance measured by d(S.. a fixed nucleotide segment length k is used. (Ci(S). defined as k d(R. . the counting is done for only those length-80 nucleotide segments that actually occur in S. The average length of these 44 whole genomic RNA sequences is 9. C 2 ( 5 ) . which is believed to have a common ancestor to HIV-1 and HIV-2) were included for outgrouping purposes. WGCD-BASED HIV-1 GENOTYPING The WGCD-Based HIV-1 Genotyping system can be used as a tool. .182 (Ci{R). the genotyping system described here has been developed for analyzing whole strains to avoid the computationally intensive phase of multiple sequence alignments. Subsequently. given that there might be sequencing errors and some strains are incomplete sequences. Ck(S)). Ck(S)) for each HIV-1 whole strain S. Note that it is impossible to count the frequency for every length-80 nucleotide segment.2. Such a distance is called the whole genome composition distance (WGCD) between R and S. i.C2(S). assuming strain S is represented as (Ci(S). i=l (2) such bootstrapping trees are built and their consensus is produced as the final phylogenetic clades. We have set up experiments to validate this expectation. EXPERIMENTAL RESULTS A N D DISCUSSION 4. In fact.. . In total. 4.Cso(S)) . then the Euclidean distance between R and S is d(R.C2(R). as stated in the review paper 6 . we calculate the composition vectors for them. and subsequently to determine the maximum length k we should use in the WGCD-based genotyping. 3. in addition to existing systems.349bp.T)) is fed to the Neighbor-Joining method in the third step to build a phylogenetic tree on the strains. and the frequencies of the same length nucleotide segments are alphabetically ordered. yet this additional amount decreases with increasing k..e. Results on Maximum Nucleotide Segment Length Determination As discussed earlier. gag were constructed. for a sufficiently large k. as there would be 4 8 0 «s 1. This counting is computed by several linear scans of the whole strain S. 15 recombinant strains were also downloaded for recombinant identification ability testing for the WGCD based HIV-1 genotyping system. respectively.S).5 x 10 48 of them.T) is computed using Equation (2). 829bp and the minimum length of 8. Both trees on single genes and full-length env. to randomly mutate 30% of the genetic sequences at each iteration and then to build a bootstrapping tree using exactly the same method as stated in steps one to three..C2(S).S) is defined in Equation (1).

° Q- .. (d) k = 30. where green lines plot 5 randomly chosen distance contribution percentages and blue lines plot the minimum and maximum distance contribution percentages. S 10 o 8 1 r -f 0 1 2 V" & ' . 12 - 1 8 Q.-a--G--( .^ 6 7 8 9 10 11 12 13 14 19 20 0 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 18 19 20 2122 23 24 25 26 27 28 29 30 (c) k = 20. G ._ -^^•^T-T^r-i:^ 2 i E • =•°. 18 16 Si if 14 12 1fl 8 6 4 2 ".. 2..-5.• * : . - S|p Q a- ^Q 9 g o •B £ . rr D ' a 15 16 17 18 -n 6 : •i l 4 B 0 . a .. . (b) fc = 15.* J.»• 5 '»' .. The average distance contribution percentages of individual compositions to the WGCD with respect to different values of fc. ..o- Si 35 r a 14 ' • " ' ^ J ' .. F i g . (f) fc = 80. ... •a = • ".. |TQ3HrI3EHQeDaHI3EEE33SSCEajHGCieEil 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 45 50 55 (e) k = 50.a3 4 T l3. -J..183 3 14 15 (a) fc = 10. •*.

.F1+AF005494 F1+AJ249238 .„ D+AY371157 D+U8S824 C+U46016 C+U52953 C+AY772699 C+AF067155 H+AF190128 1'" H+AF005496 H+AF190127 A1+U51190 A1+AF069670 A1+AF004885 AUAF484509 2 a " T) 0 00 II •si Q O O £ 1 \L ff1- L_ i_ N+AY532635 O+AJ302647 J V ' O+L20587 0+AY169812 O+L20571 CPZ+AF447783 CPZ+U42720 «• r.„B+AY173951 I M B+AY423387 B+AY331295 -P+K03454 8 T)+AY371157 D+U88824 „H+AF190128 4 ri+AF005496 H+AF190127 „C+U52953 .J+AF082395 J3+K03455 . J+AF082394 J+AF082395 „ B+K03455 B+AY173951 B+AY423387 B+AY331295 „ D+K03454 1—£.„ G+AF061642 .Jt+AF067155 I M C+U48016 C+AY772S99 la A1+U51190 ^A1+AF069670 A1+AF004885 A1+AF484509 2+AF2B6238 2+AF286237 . .„ *.J+AF082394 ".„ .F1+AF075703 jsBF1+AF077336 I99 F1+AFO05494 F1+AJ249238 *2+AF377956 mr2+AY37115B 5 F2+AJ249237 F2+AJ249236 lqQK+AJ249239 K+AJ249235 j .„ » .F2+AY371158 F2+AJ249237 F2+AJ249236 F2+AF377956 F1+AJ249238 K+AJ249239 K+AJ249235 D+K03454 D+U8S824 D+AY371157 J+AF082394 J+AF082395 B+K0345S B+AY423387 B+AY173951 B+AY331295 C+U52953 C+AF067155 C+U46016 C+AY772699 H+AF190128 H+AF005496 H+AF190127 F1+AF077336 F1+AF005494 F1+AF075703 A1+US1190 » " A1+AF069670 A1+AF004885 A1+AF484509 A2+AF286238 A2+AF286237 .„ . . a „.G+AF061641 G+AF084936 N+AJ271370 N+AJ008022 N+AY532635 0+AJ302647 O+L20587 0+AY1B9812 O+L20571 CPZ+AF447763 CPZ+U42720 II •Si F2+AY3711S8 F2+AJ249237 F2+AF377956 F2+AJ249236 K+AJ249239 K+AJ249235 F1+AF075703 F1+AF077336 F1+AF005494 F1+AJ249238 B+K03455 B+AY173951 B+AY423387 B+AY331295 D+K03454 0+U88624 D+AY371157 J+AF082394 J+AF062395 H+AF190128 H+AF005496 H+AF190127 C+U52953 C+AF067155 C+U46016 C+AY772699 A1+U51190 A1+AF069670 A1+AF0O4885 A1+AF484509 A2+AF288238 A2+AF286237 G+AF0B4936 G+AF061642 G+AF081641 N+AJ271370 N+AJ006022 N4AY532635 0+AJ302647 O+L20587 0+AY169812 0+120571 CPZ+AF447763 CPZ+U42720 1—n* r 1 CO II •Si •v fi „ F1+AF075703 F1+AF077336 r l ' .„N+AJ271370 J"SMTN+AJ006022 •— N+AY532635 .„ .„ „ .G+AF084936 G+AF061642 G+AF061641 r . „..„O+AJ302647 »O+L20587 I M 0+AY1B9812 193 O+L20571 "3 Xi H CPZ+AF447763 CPZ+U42720 erat rf' G+AF084936 G+AF061642 G+AF061641 N+AJ271370 *!> w iff erent > a e n a en a o o a o bC 0) ... F2+AF377956 rf 1 " F2+AY371158 Lr F2+AJ249237 F2+AJ249236 K+AJ249239 K+AJ249235 . F2+AY37115B F2+AJ249237 F2+AJ249236 F2+AF37795B F1+AJ249238 K+AJ249239 K+AJ249235 D+AY371157 J+AF0B2394 J+AF082395 D+K03454 D+U88824 B+K03455 B+AY423387 B+AY173951 B+AY331295 C*U52953 C+AF067155 C+U46016 C+AY772699 H+AF190128 H+AF005496 H+AF190127 A1+U51190 A1+AF069670 A1+AF004885 A1+AF484509 A2+AF2B6238 A2+AF286237 G+AF061642 G+AF081641 G+AF084936 F1+AF077336 F1+AF00S494 F1+AF075703 N+AJ271370 N+AJ006022 N+AY532635 O+AJ302W7 O+L205B7 O+L20571 0+AY169812 CPZ+AF447763 CPZ+U42720 F2+AY371158 F2+AJ249237 F2+AJ249236 F2+AF377956 K+AJ249239 K+AJ249235 F1+AF077336 F1+AF005494 F1+AF075703 F1+AJ249238 B+K03455 B+AY423387 B+AY173951 B+AY331295 D+K03454 D+U 88824 D+AY371157 J+AF082394 J+AF082395 H4AF190128 H*AF005496 H+AF190127 C+U52953 C+AF067155 C+U46016 C+AY772899 A1+U51190 A1+AF089670 A1+AF004885 A1+AF484509 A2+AF286238 A2+AF286237 G+AF0B4936 G+AF061642 G+AF061641 N+AJ271370 N+AJ006022 N+AY532635 0+AJ302647 O+L20587 0+AY169812 O+L20571 CPZ+AF447763 CPZ+U42720 .

B and D.50.20. subtype K is mis-inserted into subtype F in both of the phylogenies in Figures 3(c) and 3(d). in Figures 3(a) and 3(b). and groups M.30. respectively. For the set of the above 42 pure subtype strains and 2 CPZ strains. This problem is resolved in Figures 3(c) and 3(d). N and O are well-separated. sub-subtypes Al and A2 are adjacent to each other.20.. of group M. sub-subtypes F l and F2 are adjacent to each other. where one can see that the tail portion of the line corresponding to k = 80 is approximately horizontal.15.T).15. the strains are classified better into the phylogenetic clades with increasing k.20. In fact.uk/ c l u s t a l w / ) .15. ISSP^HSS SiiiHiiiglSliM" 8 Pig. We set k = 10.30.. and sent them to Biolayout.20. For J = 10. We uploaded all the 44 genomic sequences to the ClustalW webserver. One can see that there is a problem with the outgrouping CPZ SIV strains. Phylogenetic Analysis Results We have constructed the phylogenies using k = 10.ebi. which involve all 42 strains (84 is the minimum number such that this number of distances involve all 42 strains). this misplacement is consistent in all values of k we have tested. subtypes B and D are closer to each other than other subtypes. 26 ( h t t p : / / w w w .185 and (Ci(T). the average distance contribution percentage of the i-th composition to the WGCD.50 and 80. Discussion 4.T) according to Equation (1). we selected the 84 minimum distances. Note that we might think of volume (ci+T—57) as the extra contribution of the (i + l)-th composition compared to the i-th composition. the resulting bootstrapping phylogenetic trees using k = 80 is shown in Figure 6. 30. we compute di(S. For example.80 to compute the WGCD between S and T. 4. using Equation (2). and subtype C is misplaced outside <W.4. 4. these cfs are plotted as red lines in Figures 2(a)2(f). Strains as Concatenations of Gene Sequences We have also used the concatenation of all 9 gene sequences to represent a strain. the approximately horizontal tail portion indicates that we might neglect the evolutionary information carried by the nucleotide segments longer than 80. Ci{S.T) is the distance contribution percentage of the i-th composition to the WGCD. D and G. Nonetheless. Comparing the contribution plots for individual compositions..15. rTh X fTps rVi T rk rCrKh rtrrVs X frVn IT?. but is misclassified into F2. For example. Therefore.1.50 and . Also can be seen are that subtypes Al and A2. C80(T)). and G and A2 are closer than the others. Subsequently. we have also conducted a control experiment to use the multiple sequence alignments by ClustalW (http://www.50 and k 80. we removed the outgrouping CPZ SIV strains and sorted the pairwise distances between all 42 HIV-1 strains in increasing order.30. o r g / ) to display out the phylogenetic clades. when k increases to 20 and 30. We took the average of Cj(5. An interesting observation is that this phylogeny is different from the one constructed using complete genomic sequences — it contains several misplaced strains in subtypes A2 and C. d(S. or 50.T)/d2(S.3.4. The ClustalW guide tree on the 42 HIV-1 strains and two CPZ SIV strains.20. C2(T).ac. The guide tree generated by ClustalW is shown in Figure 4. For comparison. then used them to construct a phylogeny. which clearly demonstrated 11 clades corresponding to 13 subtypes. We have also borrowed a software Biolay25. For this purpose.. These trees are illustrated in Figures 3(a)-3(f). Scanning through the order.30. Figure 5 shows the graphical view of these distances. 4 . T) over all pairs of strains and the average is denoted as c7. b i o l a y o u t . but this is not the case for k = 10. When k = 80. respectively. k = 10.15. AJ249238 belongs to subtype F l . For each i < 80.T) = d2i(S. a phylogeny which maps to all known evolutionary relationships is obtained (Figure 3(f)). F l and F2. namely.

where strains are represented as concatenations of 9 gene sequences each. 5. sub-subtypes F l and F2. Once again. The resultant bootstrapping Neighbor-Joining phylogeny for k = 40 is shown in Figure 7. F l . and confirm that nucleotide sequences are superior.r h ^ t C+AY772699 ^*Src*U52353 j#N*AJ2713?0 «$W532635 ^ » N*AJ006022 • • KMJ249235 K*AJ249239 ^__ _. and the plots (as in Figure 2) are very "horizontal" when k > 30. We note that this would be resulted from the silent mutations. The gene products were also used for phylogenetic analysis.4.20. which contribute to the genotyping diversity at nucleotide level but not protein level. In fact. 4.2. and G are split into 2 parts each. C. subtypes Al. 80. 6.LJ5V'm • O*AJ302647 / J J * 0 + A Y 1 89812 (ro+UQ587 • (HL20571 /t\ ^#^t££SP^ l^* 4 A1+ ^ FQe567i 1 - W&W'371158 ¥y 5 " ^ IJ*AF0823S5 ^ • * JMF082394 •MwHWiW'i 3 7 " ~ ~ A C M F 0 6 7 1 55 ^ f ^ . using the 84 smallest distances computed by the WGCD method using nucleotide strings of length 1 to 80.25 and 40 in the WGCD distance com. and sub-subtypes Al and A2 become disconnected. The Biolayout graphical view of the phylogenetic clades of t h e 42 HIV-1 strains. the standard HIV-1 genotyping is done mostly . o o p o d ^ z j ^ 15*1: p cfb >: NN X rWff?? X Hf>i Fig. Since there are 20 types of amino acid residues. Strains as Concatenations Sequences of Protein rfa fa fo fa rfa* rj>. every strain is represented as a concatenation of 9 protein sequences. Therefore. This observation might indicate that the inter-gene regions in the complete genomic sequences for HIV1 strains also contain evolutionary information that can be used for genotyping purpose.186 ^^^WPSMIPIIIH^ File Edit View Search Tools Help MPHiiHlg-T-w • • F1*AJ246I38 A1+AF484509 Jr^. We have also obtained the distance contribution percentages of all the compositions. putation. we set A = 10. In this phylogeny one can see that there are more misplaced phylogenetic clades than in the phylogeny using only gene sequences. [Ready j Fig. For example. The Neighbor-Joining phylogeny with 200 bootstrapping iterations on the WGCD using k = 80. To this purpose. we might be able to conclude that using gene products for phylogenetic analysis on such fast evolving viruses is the least appropriate. the distance contribution percentages were calculated and their plots (similarly as in Figure 2) are very "horizontal" when k = 40.

Based on these results. our WGCDbased HIV-1 genotyping system performs well and that is superior to existing methods in the sense that it requires no CRF section pre-selection. . The Neighbor-Joining phylogeny with 200 bootstrapping iterations on the WGCD using k = 40. 7. The classification of new HIV-1 sequences follows the proposed HIV nomenclature guidelines 3 ' 27 to use the HIV-1 subtyping reference set. in terms of single CRF detections. A2. strain from Nigeria) and AF063224 (NCBI accession number. only non-CRF strains were included. Similarly. Again for this purpose. we found (Figure 9) they were grouped together and were recognized as A/G recombinants (a little closer to subtype A than to subtype G). In the future. We have also performed experiments to test whether or not the WGCD-based genotyping system would be able to recognize CRFs. with the large increase of reported CRFs. the reference set would increase to an extent that it would cause problems in some analyses if four sequences were included for each CRF. and thus find the breaking points within the recombination. which consists of 42 HIV-1 pure subtype strains and 2 CRFs. evidence suggests many sub-subtypes and even a potential new subtype. We will study how to detect the re- 4. and in each case their recombinant form(s) was correctly identified (data not shown).187 at the nucleotide level. and both of them are closer to subtype A than to subtype G. For example. Identification of Circular Recombinant Forms Group M HIV-1 strains are currently classified into 9 subtypes and 16 circulating recombinant forms (CRFs). we found that when the number of CRFs is limited to one to two at a time (where CRFs are represented as complete genomic RNA sequences). or G pure subtypes. the current version of our method has the limitation that it cannot detect the breaking points of the HIV-1 genotyping. and G subtypes. we have also used Biolayout to graphically display the phylogenetic closeness relationships of L39106 and AF063224 to the pure subtype strains. Figures 8(a) and 8(b) show that both L39106 (NCBI accession number. each of the known CRFs was described by four representatives. strain from Djibouti) are recognized as A/G recombinant. With respect to this set. and G strains. In addition. fra m ^mmMimmmu umnmm F i g . In our previous experiments.4. or A2. A2. Another possible reason that gene products are inferior may be that the evolution changes in the intergenic regions were not considered in the genotyping using protein coding regions. Typically. nor that one CRF has to be described by four representatives. yet they both have close relationships with Al. Figure 10 clearly shows the pure subtype clades and that both L39106 and AF063224 have edges connected to all Al. Compared with other recombination detection systems based on multiple sequence alignments. This fact indicates that neither of L39106 and AF063224 can be classified into Al. an obvious indication of recombinants. then the recombinant subtype information can always be correctly identified. although many of the new CRFs have not yet been published 6 . and the sequence selected for each subtype was intended to show how it is composed of the included subtypes.3. It is recognized that. and a total of 92 smallest distances were used. Therefore. the outgrouping CPZ strains were removed from the dataset. In the experiments. When we added both L39106 and AF063224 into the phylogenetic analysis. where strains are represented as concatenations of 9 protein sequences each. we are confident in claiming that. Recently there are intensive activities in the area of discovering new HIV-1 CRFs. we have tested for 51 other CRFs individually and as a whole set. we are proposing to check the motifs contributing the most to the recombination. conforming with previous studies on these two recombinants. previous studies have limited the CRF section to one sequence per CRF.

Subtype Signature Segment Nucleotide Identification In this current work.4. This is one of our on-going research focuses and we have obtained promising partial results. T h e Neighbor-Joining phylogenies with 200 bootstrapping iterations on the W G C D using k — 80. which will be released to public upon passing an extensive test. by simply looking at the occurrences of signature segments. 9. occurrences of signature segments associated with multiple subtypes are indications of CRFs. rfri r?i rj>.4. The full set of detailed experimental results will be reported in a succeeding paper. all nucleotide (or amino acid) segments of length up to k are used in the WGCD distance calculation. 8S!88 Fig. though they are sufficient in identifying the subtypes and CRFs. . Furthermore. We are also constructing a web server based on the WGCD for HIV-1 genotyping. CONCLUSIONS We proposed a whole genome composition distance based HIV-1 genotyping system. (a) L39016. Look- . and both of them are closer to subtype A t h a n to subtype G. 200 boot80. 8 . ^fr rfe ^^k^hkt^^ 8£SS"" . we have found that the pairwise distances are very close to each other. which will be able to find the recombinations from original subtypes and even between recombinant genotypes. where both L39106 and AF063224 are recognized as A / G recombinants.188 L fn 1 ! r?. for a fixed k values. one interesting research subject would be to identify those nucleotide segments that signify subtypes. n?.o S ^pj|Si£S8§§8gg5SB s IH'Aiilll I ll'tfgilii-WIIf illli ssss* (b) AF063224. With these signature segments. By 4. Such a system avoids the computationally intensive phase of multiple sequence alignments. where recognized A than to 5.i BXgS""**^ lf?i rp. Experiments show that every composition has its unique contributions to the evolutionary distance computation. it could be easier for subtype identification for a newly sequenced strain. Therefore. combination interactively. The Neighbor-Joining phylogeny with strapping iterations on the WGCD using k = L39106 and AF063224 are grouped together and as A / G recombinants (a little closer to subtype subtype G). 3C^O03CDU0\u"n"TI i r i i p O ^ T m i ^ n . ing at the resultant distance matrices. fe t H? . Fig. by using nucleotide segment frequencies to represent the genomes for phylogenetic analysis. and the contribution decreases with the nucleotide segment length increases.

M. 2004. B. Peeters. Wolinsky. McCutchan. Science. Wensing. 3. J. The Biolayout graphical view of the phylogenetic clades of the 42 HIV-1 strains and 2 A/G CRFs. D. J. Preiser. complete genomic sequences could b e a better source t h a n gene sequences and gene products. and T. Boucher. 4. L. E. and B. C. Variety of interpretation systems for human immunodeficiency virus type 1 genotyping: Confirmatory information or additional confusion? Current Drug Targets Infectious Disorder.-M. 32:W654-W659. Leitner. M. Sharp. Kalish. D. D. J. Funkhouser. Plikat. W. Journal of Infectious Diseases. van Rensburg. M. J. S. and A. Chappey. F. 1999. setting the maximum segment length k to 80. K. A. Vandamme. D. Popper. Korber. E. J. and B. de Oliveira. G. M. 180:1116-1121. K. Nucleic Acids Research. A. A. Bradac. The figure shows that subtypes A (including Al and A2) and G are fully connected by these two A/G CRFs. Tatusova. 2005. and P. An automated genotyping system for analysis of HIV-1 and other microbial sequences. A. 2003. Carr. H. Sarr. C. M. Experiments on single C R F identification also confirm t h a t such a new system is also capable of C R F discovery using only t h e whole strains. Essex.AJ00602: lb*wi69ei2 Node: - IReady Fig. CFI and N S E R C . Kanki. M. 6. R. K. Anderson. without any extra requirements on t h e CRFs. Daniels. T. Rozanov. K. 21:3797-3800. D. ACKNOWLEDGMENTS This research is supported in part by AICML. Lower human immunodeficiency virus (HIV) type 2 viral load reflects the difference in pathogenicity of HIV-1 and HIV-2. Lastly. Kochergin. B. 288:55-56. our experiments also confirm t h a t . Pieniazek. C. A. Foley. M. S. Leitner. 10. M. HIV-1 nomenclature proposal. S. Hahn. Paraskevis. Deforche. Sturmer. R. M. 2000. Travers. Learn. C. Osmanov. 2. U. References 1. the Neighbor-Joining method using the WGCD-based distances constructs phylogenetic clades t h a t are identical to the golden standard ones determined by running several MSA-based genotyping systems together with manual parameter adjustments. Kuiken. T. J. J. A. Seebregts. 5. P. Cassol. Doerr. T. S. 3:373-382. L.189 OglKl 00#®©^<^ *N*W271370 VVM+AY53283! kfi-t. GueyeNdiaye. Snoeck. H. Calef. U. Korber. C. van de Vijver. for phylogenetic analysis on fast evolving viruses such as HIV. H. Foley. B. HIV-1 Subtype and Circulating Recom- . A. using the 92 smallest distances computed by the WGCD method using nucleotide strings of length 1 to 80. Mboup. J. Gao. Robertson. F. S. P. Camacho. Salminen. Salminen. M. Bioinformatics. and W. A web-based genotyping resource for viral sequences.

18:100-108. McCutchan. C. Felsenstein. Craig. Loreto. Data Compression Conference. 2002. Bork. 78:315-322. Chen. 10. Robertson. Caglioti. Y. X. F. Snel. 26. l a n l . ACM Press. Kwong. 2002. Burge. G. R. Compression of DNA sequences. Benedetto. T. Harrison. Karlin and C. J. Chen. N. K. I. Wolinsky. 4:71-74. G. Funkhouser. Biochimie. S. 8. A comprehensive vertebrate phylogeny using vector representation of protein sequences from whole genomes. H. Language trees and zipping. J. Molecular Evolution. Anderson. Stuart. Carr. Bork. Moffet. and M.190 binant Form (CRF) Reference Sequences. B. Vlak. V. Cory. Using homolog groups to create a whole-genomic tree of free-living organisms: An update. S. 9. L. Gale. M. R. and R. and S. 2002. J. 88:048702. Bradac. S. J. HIV-1 nomenclature proposal: a reference guide to HIV-1 classification. Integrated gene and species phylogenies from unaligned whole genome sequence. html. Accessible through h t t p : / / e v o l u t i o n . An automatic graph layout algorithm for similarity and network visualization. P. Osmanov. Kuiken. 2002. A. and O. K. 1995. and J. and B. Shafer. Bioinformatics. Delgrange. Wang. X. K. Delahaye. 17. R. 23. M. P. 2001. 4:406-425. C. J. 2001. Leitner. and C. and V. pages 375385. 2005. B. Hsu. 18. Peeters. D. Journal of Virology. E. 16. 21. 17:853854. 1987. W. Leader. S. NM. Luque. 12. Takeuchi. 27. genetics. Los Alamos National Laboratory. 28:439-447. L. Chen. Fang. 12:17-25. . g o v / c o n t e n t / hiv-db/REVIEWS/Ref Seqs2005/Ref Seqs05. D. A. L. Winstanley. Bioinformatics. Kellam. Li. Bioinformatics. Li. J. Journal of Biological Physics. 1999. A compression algorithm for DNA sequences and its applications in genome comparison. B. Learn. Goldovsky. Genome Research. K. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Qi. 21:108-110. An automatic graph layout algorithm for similarity and network visualization. Los Alamos. Nei. Gao. A. M. J. g o v / c o n t e n t / hiv-db/mainpage. 7. F. J. 2005. 22. Physical Review Letters. Applied Bioinformatics. Use of whole genome sequence data to infer baculovirus phylogeny. 2000. D. E. Identification of biased amino acid substitution patterns in human immunodeficiency virus type 1 isolates from patients treated with protease inhibitors. Fitz-Gibbon. 54:539-547. Ling. 19. G. O'Reilly. 2003. Molecular Biology and Evolution. Herniou. C. h i v . Hao and J. h i v . Moffet. 2002. and M. Huynen. 11. Accessible through h t t p : / / w w w . Huynen. Grumbach and F. In Human Retroviruses and AIDS 1999: a compilation and analysis of nucleic acid and amino acid sequences. 13. PHYLIP. Salminen. E. Tahi. Pieniazek. Korber. Accessible through h t t p : / / w w w . pages 107-117. 21:3535-3540. S. Rivals. 73:6197-6202. Cases. 25. edu/phylip. A. Trends in Genetics. Xuan. l a n l . E. 1999. In Proceedings of the Sixth Annual International Computing and Combinatorics Conference (RECOMB). Genome phylogeny based on gene content. C.Washington. P. The HIV Sequence Database. A. Myers. 2000. Dauchet. Brendel. A statistical model for HIV-1 sequence classification using the subtype analyser (STAR). K. 24. ht'/ml. 19:554-562. A. and V. T. Phylogeny based on whole genome as inferred from complete information set analysis. National Genetics. 1996. Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. Ouzounis. P. W. Enright. Dinucleotide relative abundance extremes: a genomic signature. H. 20. Molecular Biology and Evolution. Z. 11:283-290. 2002.html. House and S. Stuart. 2005. and P. Compression and genetic sequences analysis. 14. 15. and D. In Proceedings of the 2003 IEEE Bioinformatics Conference (CSB 2003). D. 1993. Saitou and M. J. Snel. 75:8117-8126. B. A. and M. W. Journal of Virology. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Baker. Patick. Hahn.

Previous attempts ( 7 ' 8 ) at genetic map 'Corresponding author. USA * Email: suchi@cs. J . One approach to understanding the genetic basis of a complex trait is to follow its segregation in offspring along with an array of genetic markers. EM algorithm. fundamental genetic processes. is no exception 4 . Most model systems possess high density linkage maps that can assist in the analysis of complex traits. We focus on map construction for a model system N. genetic maps can be exploited to address how the genetic material controls a particular trait 2 controlled by one or more genes. of Genetics. Assuming no interference. USA * Email: statsusant@yahoo. time complexity. a multi-locus genetic likelihood is implemented based on a mathematical model of the recombination process in meiosis that accounts for events up to double crossovers in the genetic interval for any specified order of genetic markers. University of Georgia. crassa6 where there is a wealth of published information about how markers segregate because the genetic makeup of gametes can be identified. which gave us the biochemical function of genes. GA 30605-7404.edu Georgia. motivating the need for a novel algorithm. This is a computationally challenging problem 5 and at the heart of understanding complex traits. Neurospora crassa3. USA * Email: arnold@uga. of Statistics. such as human disease. The improvement in time complexity is shown theoretically by a worst-case analysis of the algorithm and supported by run time results using data on linkage group-I of the fungal genome Neurospora crassa. These markers in essence allow a triangulation on loci in the DNA affecting the complex trait. University of Georgia. S. 1. The bread mold. One class of genetic markers frequently used are restriction fragment length polymorphism (RFLP). markers in the DNA itself. University of Athens. of Computer Science.edu D r . In this paper we address the problem of genetic map reconstruction from a large number of RFLP markers. Tewari Dept.. recursive linking. T h e recursive algorithm is shown to reduce the order of time complexity from exponential to linear. Athens. The mathematical model is realized with a straightforward algorithm that implements the likelihood computation process. com Dr. Arnold Dept. INTRODUCTION High density linkage maps are an essential tool for characterizing genes in many systems. . The time complexity of the straightforward algorithm is exponential without bound in t h e number of genetic markers and implementation of the model for more than 7 genetic markers is not feasible. such as genetic exchange between chromosomes.complex traits) 1 . Part of this triangulation process involves the construction of a genetic map with many markers. M. A recursive linking algorithm is proposed t h a t decomposes the pool of genetic markers into segments and renders the model implementable for a large number of genetic markers. B h a n d a r k a r * Dept. MLE.uga. Athens.e. as well as the analysis of traits controlled by more than one gene (i. GA 30605-7223.191 EFFICIENT RECURSIVE LINKING ALGORITHM FOR COMPUTING THE LIKELIHOOD OF AN ORDER OF A LARGE NUMBER OF GENETIC MARKERS S. GA 30605-1952. K e y w o r d s : Crossover. Since genetic maps are most often the critical link between phenotype (what a gene or its product does ) and the genetic material.

k = i1.1. ij € Si . x i(_x x ii where. ..03. Ri = fjk{Ri-i)Mi th = !...192 construction of a large number of genetic markers are mostly based on pair wise genotypic information.3). 2.a2.a2.---. The indices ij = 1 and ij = 0 indicate the paternal and maternal alleles respectively.j#0} +-f{i^0. Rk(j. Let c denote the probability of exchange of genetic material between any two non sister strands in the tetrad at meiosis.P(fi\MxPk CxP (7) = ii. 3 and 4 indicate that non-sister chromatids (1.02) / 4 ( a ) = (04. x i j _ ! where.i = 1. ij =0. a = (01.04)' / ( a ) = (01.Z 2. k}. The observed data set can be represented as: V={rij . Vj = l.1. Here we focus on solving the computational challenge posed by our probabilistic modelling of the genetic map..2.j=0}} flo = (1100)' . not on likelihood computation. The conditional distribution of fi for a given 0fe is P(fM = • j = i \tIf^ka.04.0eS (1) /°(o) = x (ai. Let Si(= S x S) be the set whose elements denote crossover events covering events up to double crossovers between locus Aiaxid A(i+i}(i = 1.. The following matrix Rk of size 4 x / defines the simulated tetrad as follows: Rk = (RoRi---R(i-i)) (5) P({hJ}) = T^k^o^o} 16 i V % ) f T where.i2..4} P(i) = -. the probability distribution on Si is given by: where. where Z=total number of loci being studied.h—k-i .03. Let /fc denote a multi-locus genotype with I loci: fk = h x i2 x .3). MULTILOCUS GENETIC LIKELIHOOD FOR A SPECIFIED ORDER OF GENETIC MARKERS Let S be the sample space of an exchange (crossover) between any two non-sister strands (chromatids) of the tetrad in a single meiosis.i3..-•• .a4) / 2 ( a ) = (01..02. i e S P(0) = 1 . rij is the observed frequency of fj. -)is the j t h row of Rk.1.3. {i. For a particular crossover <f>k we can generate a model tetrad at meiosis using the function fij./ — 1).1-1 + (1 . (2. The progeny are obtained by crossover between homogeneous parents. (2.a4) (4) / (a) = (a3.C i ) 2 / { i = j = 0 } (2) where.1 Mi The function fij(a) = fj(fi(a)) corresponds to the events in Si.02. T" 1 H "j {/{i=0. The marginal density of a single spore fi is given by P(fi) = fc T.4) and (4. 4 .1) took part in exchange respectively... i-i and the i genetic interval Si observed the crossover event {j.. i=\ l .ai.03. (6) (3) where. 2.ii .. 2 ' } where. Elements 1. Vj = l . Using equation ( 1 ) . 4>k € S = J J S .01) 3 The element 0 in 5 indicates the absence of a crossover event. j} £ Si. Let <j>k denote a unique crossover event on 5 ' as described below: 4>k = h x i2 x .c...02.03.i2.04)' Oj = 0.--.a3. Probability distribution on Sl Let us define the following functions 5 = {0.

m = E k|»*.2c + D = 0 (12) N0.m = (*l. Let f = (fa. (from equation ( 6)) (8) n{kh) Vfc)' i-i =Yln^\^ih)) 7Tj|fc X P = (Pk.0) Ah) where.c.193 where. ikj S 5 j and the probability distribution P(Ij = ikj) is as defined in equation( 2). Pfe = P(^fc) = j=i l[p(Ij=ikJ) (9) •*k\j =P{xkj =l|/j) = ^fc Pj where.*2) t i ^ O A N D 125*0 Ah) Chi = P{fi\4>k) and P is given by. The log-likelihood of f viewed as a function of Q is given by: Pi =^27ri\kx^ic n j\k = c ki in equation ( 8) ln(e|£>) = £>log £ [\ j=i Ehf^U. . fa)' be the observed frequency vector corresponding to all possible meiotic products for parental genes M and O for two markers.. /3. The EM-iterative equations are given below. Let 9 = (ci. /2.. Theorem 1..m = (»l. £>=2(/a N + /3) .m = Nlt„ Yl fc »fc._i)' denote the unknown parameter vector in the model.*2) t i = 0 (Strict)OR i 2 = 0 Ah) This theorem is used to obtain the starting values of Cm for the EM-iterative equations in Theorem 1. ikyTn denotes an event in Sm for the crossover <j)k • (10) Theorem 2.C2.1] ) of the following equation: k The following two theorems solve equation( 10) using a set of recurrence relations obtained via the Expectation-Maximization (EM) algorithm 9 . The proofs are not given in the interest of brevity.m = (0. N = j-fi t=i £ fc *lc. where c%+1> = I w/iere ' 1 (11) f(c) = c2 . The genotype vector for f is (MM MO OM 00)'. The maximum likelihood estimator 10 of the exchange probability c under the model represented by Equation( 1) is unique and is given as follows: (1) If fa + fa<f2 + h ^en cmle = 1 (2) If / i + fa > h + h ^en Cmu is given by the unique solution (in the interval [0.. C is the conditional probability matrix given by: C = ((CH)) N2.-)} j j=l =i i-i j=l T f = Pfc in equation ( 9) T c Note that.

The elimination of such crossovers is achieved with the function klsWorthy.The function kProb computes the marginal probability Pk due to crossover <f>k as defined in equation( 9).j + +do pCount[CellSpecial(fc[j])][j]+ = sum { The function CellSpecialQ maps crossover values from 0 to 24 to the events of the set 5. as their marginal probabilities are not computed yet it is not possible to do so. Once all the crossovers are processed and the marginal probabilities computed.0 * totalObs probF01d[i]+ = countMatch()/4. The vector freq has the observed count corresponding to the distinct genotypes(d g ) and totalObs is the total sample size and probFOld stores the marginal probability of each distinct genotype using equation( 7). which implies that at least some amount of computation cannot be avoided for each crossover. accounting for the (/ — 1) genetic intervals.} r=getRMatrix(k) if kIsWorthy()==l then prob=kProb(cProb01d. A particular crossover (j>k does not enter into the computation of the likelihood so long it does not have positive probability for at least one distinct genotype. We work around this problem by adding another dimension along the number of distinct genotypes to the structure pCount. The matrix R^ and the probability Pk in turn create matrices C and P progressively during the course of the recursive loop to calculate the marginal probability of each observed distinct genotype as defined in equation ( 7).m and JV2. Despite being a recursive algorithm (crossovers are generated recursively) it suffers from the computational bottleneck to process a huge (25C. To compute nk ' in equation( 11) we need to add up the inverse probabilities (7Tjfc|j) across the distinct genotypes but. for i = 0. k denotes a particular crossover 4>k as defined in equation( 3). the elements in the third dimension are divided by their corresponding marginal probabilities and then added up across the dimension. The first dimension of pCount is of magnitude 3 to account for Notm. pj. Then the new value of Cj for each genetic interval is computed using equation( 11) and the process iterates until convergence. This gives us the two dimensional structure postCount (not shown in the pseudocode) containing values of Notm. {Pseudocode of the Straightforward Algorithm} loop { This is a recursive loop. THE STRAIGHTFORWARD ALGORITHM In the pseudocode below.1 ) for loci I) number of crossovers. Note that the denominator of TT^j .The array pCount implements the EM algorithm via equation ( 11) by recategorizing the vector sum based on the crossover values along the chromosome.Ni. For brevity we show the pseudocode of only the most important part of the straightforward algorithm.} end for end if end loop . is left out from its {^k\j) computation as that requires going through all the crossovers and is currently being progressively computed by probFOld.k). This dynamically generates Z-l FOR loops.194 3. The conditional probability is computed with the help of the function countMatch that implements the equation( 6). counting the number of strands (out of 4) in R^ that match with the observed genotype.Ni. The function getRMatrix implements equation( 5) to create the Rk matrix corresponding to the crossover (j>k. These marginal probabilities along with the counts for the distinct genotypes are used to compute the log-likelihood in equation( 10).m a n d N2. In the recursive algorithm we propose in the paper.m in equation( 11) and the second dimension runs along m. the marginal probability due to the j t h distinct genotype.0 * prob end for for j = 0. i < dg\ i + + d o sum[i] = countMatchQ * prob * freq[i]/4. this feature is handled more efficiently where a large number of crossovers are eliminated by performing checks on a few. This problem is overcome using the proposed recursive linking algorithm. j < loci — l. and the addition is a componentwise vector addition.m for all the genetic intervals. In the pseudocode the vector sum computes conditional probabilities across all distinct genotypes.

This updated matching status is termed a spore in this paper. such that all the intervals are covered. 1010. a matching status 1 0 0 1 for the first distinct genotype corresponding to crossover pattern 0 0 2 3 4 in the first segment level indicates that among the 4 tetrads in meiosis generated by the crossover pattern 0 0 2 3 4 in the first segment. Note that active crossovers will be different across segments as an active crossover depends both on the observed genotypes(that varies across segments) and the tetrad pattern used in the generation of the R matrix. The array UnklnfoFirst stores the last row generated by the R matrix for each crossover of the segment. The distinction between a spore and matching status is that while matching status is the original status of the segment the spore is the matching status obtained after updating the matching status of all the previous segments. 100 1.1. Each crossover in the first segment branches out to 25^ h_1 ^ crossovers in the following segment and creates 2 5 2 ^ . Under model described by equationf 1).e.1 ^ combined crossovers. This continues till the last segment is accounted for. to count the matches for the entire crossover we have to examine the matching status of all the segments and update them. 0101 and 0 0 11 could occur. 1011. Hence when the matching status of two segments are updated the resulting matching status is 1 if and only if both the segments have 1 at that position and 0 otherwise. To be considered a match for the whole segment at a particular position (out of 4 possible positions) one must have a match for all the segments at that position. The array cArrayFirst checks for each particular crossover and each observed genotype of the first segment which strands of the simulated (based on the model described in equation( 1)) tetrad obtained by R match with the genotype. where the dimension denotes the segment numbers. 110 1. Under the model described by equation( 1) at any particular locus only one of the tetrad patterns 110 0. THE PROPOSED RECURSIVE LINKING ALGORITHM Let the entire order of genetic markers be broken into equal segments of width (h). In order to move along the segments following the model described in equation( 1). The following lemma restricts the number of possible spore patterns. 0110. When we use this information over a combined segment formed with two segments. Lemma 4. For example. It is important to emphasize the huge computational gain achieved by elimination of the crossovers on a segment wise basis in the proposed algorithm compared to the straight forward algorithm which The first segment has an associated array called kArrayFirst that has all the crossovers for the segment. The matching status forms the last dimension of the array with length 4 and consists of symbols 1 and 0 indicating a match(l) and mismatch(O) respectively. L e m m a 4.195 4. 1110 and 1111 are not possible. only a match at the same tetrad position will ensure a match for the combined segment . So. The following lemma states that only certain patterns are possible at the end locus of the adjoining segments and we create arrays similar to kArray. We call a crossover of a particular segment active if it has positive probability for at least one distinct genotype. the observed genotype in question was found only on the first and the fourth tetrad. Consider a combined crossover for all the segments. Analogous to the function klsWorthy in the straightforward algorithm we implement the concept of counting active crossovers for each segment. we need to know the last row (a tetrad pattern at the last locus of the segment) generated by the R matrix of the linked crossover of the previous segment corresponding to each combined crossover of the two segments. for I genetic markers the number of segments s is given by the following equation : lowing segments except the last one and call them kArrayTemp\\.link Info and cArray for all the fol- .2. To compute equation( 6) i.linkInfoTemp\\ and cArrayTemp[] respectively. for any crossover on any observed genotype the spore patterns 0111.Note that R depends on crossover values on all intervals of the segment and its columns are sequentially dependent on each other with a lag one.

A single elimination of a crossover in the first segment has the effect of 25'~ 2 eliminations of crossovers in the straightforward algorithm. This very feature shows how we geometrically increase the information base (in terms of crossovers) of the "table look up " procedure and avoid traversing all the crossovers one at a time.kArrayLast[zi][i4]) condProb=tempSumIndex[ii][i2][i3][i4]/4. The effect is 2 5 ' . The recursive structure for the last but one segment now has 25 2 (' l_1 ) crossovers processed within it with provision for all possible spores from the previous segment. j k < activeKTemp[i][jo] .m. The process is best understood by looking at the pseudocode below and noticing how the arrays tempSum and tempPCount are updated in the recursive linking process. This lets us delay the probability value updates when linking the segments.1 ' combined crossovers of these two segments. Next we implement equation( 7) for the last segment for all possible spores denoted by the array tempSum. i + + do startPos=endPos-h+2 firstCProb — from startPos to endPos in > cProbOld spores[ll][4] — generate spores > for jo = 0 . {Pseudocode for the recursive structure} tempSum[6][d s ][ll] tempPCount [6] [dg] [11] [3] [loci-1] loop {*i5 *2J *3> .1 and Lemma 4. Note that Lemma 4. j 2 < 11 . j i < dG . j + + do tempPCount[ii][i2][i3][CellSpecial(kArrayLast[ii] [z4][j—loci+ h])] [index] + = condProb x tempDouble.[m . int j'oi =linkInfoTemp[i][jo][jfc] int spike=SporeMatch(UpdateScore( cArrayTemp[i] \j0] [jk\ fa] . jo < 6 . j i + + do for j 2 = 0 . One of the reasons this procedure works is because if we look into the combined crossovers of two segments we see that crossover values for the left segment change only for every 2 5 ^ .0 tempSum[ii][i2][i3]+ = condProb x tempDouble for j = loci — h. m < endPos . j k + + do k = kArrayTemp[i][j0]b'fe] prob=kProb(firstCProb.1.m via index manipulation as shown in the pseudocode.k)..1 ^ combined crossovers.2 restrict the array size and ensure storage economy.startPos + l])][m]+ =tempSum[i7oi] [ji] [spike] xprob end for . For the last segment we create an array called tempSumlndex. end for end loop The arrays tempSum and tempPCount taken together are termed a recursive structure for the last segment.0 < = 14 < activeKLast[ii]} tempDouble=kProb(lastCProb. i < segments-2 . Recursive Linking : templSum=tempSum templPCount=tempPCount int startPos=0 int endPos=loci-h for i = 0 . j 0 + + do for ji = 0 . m + + do temp2PCount [jo][ji][J2] [CellSpecial(A.spores[j2])) temp2Sum[j'o] [ji] [j 2 ]+ =templSum[j 0 i] [ji] [spike] xprob for m =startPos-l .3 for the 2nd segment and so on. j 2 + + do for j k = 0 . Note that when a crossover from the previous (allowing for all possible spore patterns of its previous segment) segment is processed the recursive structure identifies and uses an appropriate spore from the last segment and thus processes simultaneously 2 5 ^ ./Vi)m and A^.196 eliminates crossovers one at a time. This structure has the property that 25^-*) crossovers have already been processed in a form so that equation ( 11) can be implemented and it can handle any spore generated from the previous segment. To update 6 using equation( 11) we use another array called tempPCount which computes iVo. Corresponding to all possible spores (the dimension is restricted by Lemma 4.2) the matching status of the last segment is updated and then it is summed (along its dimension) to compute equation( 8) for the last segment except that the matching status is computed for all possible tetrad patterns in Lemma 4. j < loci — 1. After all the crossovers from the previous segment are processed the proposed algorithm generates a recursive structure that is exactly the same as that of the last segment without adding to the storage requirement.

The vector sum has dg elements which does not vary with I and hence the total execution time for computing vector sum is 0(1). hence we do not mention the details.197 for n = 0 . index < loci-1 . So the total run time for the recursive algorithm is Rri = sO ((/i . preferably that of the first one. namely processing all the crossovers for a single iteration. In this phase we do not have any previous matching status vectors and instead of tempPCount we have the structure pCount as in the straightforward algorithm. The resulting speedup is clearly evident from the table below. the number of genetic markers. The loop in the straightforward algorithm runs for 25(' -1 ) iterations. Hence in both the straightforward and the proposed algorithm we provide run time complexity analysis for the main computationally intensive phase. The run time corresponds to the average time it takes for a single EM iteration before convergence on multiple starts. The algorithm would be required even if one wanted to compute just the likelihood for the initial probabilities and not use the subsequent EM iterations. 6. In the interest of clarity of description of the algorithm we have assumed all the segments to be of equal length and hence both even and odd number of markers can not be implemented without first changing the length of at least one segment. Hence the total run time complexity of the straightforward algorithm is Rst = O (/25 ( i " 1 )) (14) In the recursive linking algorithm the run time for each of the s segments is 0((h . The core function of the algorithm is to process a large number of crossovers in real time. n + + d o for index=endPos . Hence for h = 3 the run time complexity for the recursive linking algorithm is of order 0(1) using equation( 13). This is so because R has / columns and the computation time for each column is fixed.1)25^-^)). Since the observed genotypes are not known in advance we do a worst-case analysis for the computation of vector sum. The vector addition involved in computing pCount entails dg elements and thus accounts for run time complexity of 0(1). The worst case occurs when there is a match between an observed genotype and any of the 4 columns of the matrix R resulting in run time complexity of order 0(1). RUN TIME RESULTS We ran several jobs on a DELL PC (Model DM051 Pentinum(R) 4 CPU 3. n < 3 . In each iteration the computation time of matrix R is 0(1). TIME COMPLEXITY COMPARISON OF THE ALGORITHMS In the analysis of time complexity of the algorithms 11 the running variable is /. That entails a trivial modification of the algorithm. index++ do temp2PCount[j 0 ] [ji] ih] [n] [index] + =templPCount[j 0 i] [ji] [spike] [n] [index] x prob end for end for end for end for end for end for templSum=temp2Sum temp 1P Count=temp2PCount endPos=startPos-1 end for Linking with the first segment is the last step of the likelihood computation process. Note that the length of the first segment must be adjusted to account for both even and odd number of genetic markers. It was verified that both the algorithms provide the same likelihood for the same order of markers as they essentially solve the same problem but in different ways.l ) 2 5 ( h _ 1 ) ) which is minimized for h = 3 for any odd number of loci /.40GHz and 4GB of RAM) for different number of genetic markers in their natural order of precedence on a data set from the linkage group-I of Neurospora crassa4 for both the algorithms. 5. .

ACKNOWLEDGEMENTS The research was supported grant from the U. The use of simulated annealing in chromosome reconstruction markers into segments and rendered the model imexperiments based on binary scoring. Genetic dissection of complex traits. motivating lutionary strategy algorithm. 23: 208-223. Thomas HC. Crawford ME. 7. implementation of the model for more than 7 genetic Constructing large-scale genetic maps using an evomarkers turned out to be not feasible. a straight forward algorithm that implemented the Proc. Statistical Science 1997. Statistical issues in the search for genes affecting quantitative traits in ex7. Rubin D. Advances in without bound in the number of genetic markers and Genetics 1954. likelihood computation process. Natvig DO. 6: 1-93. The mathematical model was realized with tion of multi-locus genetic linkage maps in humans.Acad. The proposed recur2269-2282. Perkins DD. sive linking algorithm decomposed the pool of genetic 8. Barratt RW. Garnjobst L. S. Doerge RW. Wiley-Interscience. Journal of the Royal Statistical Society . Newmeyer D. 2. 84: 2363-2367. Run time (in seconds) Comparison of Straightforward and under the NRI Competitive Grants Program Recursive Algorithm (Award No: GEO-2002-03590) to Drs.Natl. 1980. Genetics 1992. : 5. Charles EL. 7 1336. Romin Y. Dempster A.073 for their support. J.Biol. 165: the need for a novel algorithm. We thank the College of Agricultural Loci(Z) Straightforward Algorithm Recursive Linking Algorithm and Environmental Sciences. CONCLUSIONS perimental populations. Ronald LR. Maximum likelihood the order of time complexity from exponential to linfrom incomplete data via the EM algorithm.03 1. Rao CR.0 0.66 21 oo 5. Bhandarkar and Arnold. Meiosis and ascospore genesis in Neulihood was implemented based on a mathematical rospora. Construcmarkers.fgsc.Sci. 132: 591-601. Weir BS.Introduction to Algorithms. a multi-locus genetic like195-219. Clifford rospora crassa.93 61 oo 13.61 References 41 oo 8. shown theoretically by a worst-case analysis of the 10. Arnold J. 39:1: ear. Nevo E.45 0. 2nd ed. ity of the straightforward algorithm was exponential Map construction in Neurospora crassa. Green P. University of Georgia 5 2. 3.298 9 > 43200. 2002.USA 1987. 2nd ed. Laird N. Timberlake WE. Raju NB. The time complex6. Eur. The recursive algorithm has been shown to reduce 9. The MIT Press. Zeng ZB. in part by a of Agriculture . Dept. 12: Assuming no interference. Lander ES.html 1998. plementable for a large number of genetic markers. Mester D.S. Restriction accounted for events up to double crossovers in the polymorphism maps of Neurospora crassa: 1998 upgenetic interval for any specified order of genetic date. 2001. Nelson MA. data on linkage group-I of the fungal genome Neu11. The improvement in time complexity has been 1-38. Schork NJ. Cuticchia AJ.49 0.Series B 1977. Genetics 2003. model of the recombination process in meiosis that 4. 265: 2037-2048. Korol A. http://wxvw.Linear Statistical Inference and Its Applicaalgorithm and supported by run time results using tion. Science 1994. Cell.net/fgn45/45rflp. Minkov D.198 Table 1. Lander ES.

In haplotype reconstruc- tion.edu R. 0101 are both consistent with the genotype and the goal of phasing is to correctly infer the true haplotypes. Blelloch Computer Science Department. Email: ravi@cmu. Two pairs of haplotypes 0001.edu Russell Schwartz Department of Biological Sciences. analysis of sequence variability in populations. Carnegie Mellon Pittsburgh. PA 15213.cmu.edu University University University University The production of large quantities of diploid genotype d a t a has created a need for computational methods for largescale inference of haplotypes from genotypes. but the general problem remained open. USA. Computational methods have long been central to the study of phylogenetics. USA. Carnegie Mellon Pittsburgh. Ravi Tepper School of Business. This work has application t o the haplotype phasing problem as well as t o various related problems in phylogenetic inference. Empirical studies on human d a t a of known phase show this method to be competitive with the leading phasing methods and provide strong support for the value of continued research into algorithms for general phylogeny construction from diploid data. Recently. One promising approach to t h e problem has been t o infer possible phylogenies explaining the observed genotypes in terms of putative descendants of some common ancestral haplotype. in which each variant locus is presumed t o have mutated exactly once over the sampled population's history. USA. Email: guyb@cs. the perfect phylogeny model was relaxed and the problem of reconstructing an imperfect phylogeny (IPPH) from genotype d a t a was considered. The more specialized problem of haplotype reconstruction has also benefited tremendously from contributions from the fields of discrete algorithms and combinatorial optimization. given the genotypes. we solve the general I P P H problem and show for the first time that it is possible to infer optimal q-near-perfect phylogenies from diploid genotype data in polynomial time for any constant q. Email: srinath@cs. Email: russells@andrew. If we use the symbols 0 and 1 to represent homozygous and 2 to represent heterozygous alleles then '0221' is a genotype typed on four loci. INTRODUCTION Sophisticated computational methods for data refinement and interpretation have become a core component of modern studies in human genetics. also called phasing. USA. where q is the number of "extra" mutations required in the phylogeny beyond what would be present in a perfect phylogeny. The first attempts at this problem proceeded on the restrictive assumption t h a t observed sequences could be explained by a perfect phylogeny. particularly the parsimony variants best suited to inferences of relationships over short time scales. or evolutionary tree building. The problem has relevance to basic research into population histories as well as to applied problems such as statistically linking haplotypes to disease phenotypes.199 OPTIMAL IMPERFECT PHYLOGENY RECONSTRUCTION AND HAPLOTYPING (IPPH) S r i n a t h Sridhar Computer Science Department. . 1. Carnegie Mellon Pittsburgh.edu G u y E. Carnegie Mellon Pittsburgh. A polynomial time algorithm was developed for the case when a single site is allowed to mutate twice. In this work. 0111 and 0011.cmu. PA 15213. PA 15213. and association study design. PA 15213. one seeks to separate the alleleic contributions of two chromosomes observed together in a diploid genotype assay.cmu.

and q is the imperfectness of the solution. In such an approach. 14. which can assemble hybrid chromosomes in which different segments have different phylogenies. This problem is referred to as Perfect Phylogeny Haplotyping (PPH). a new avenue towards haplotype reconstruction was developed based on phylogenetic methods. although still poor scaling in problem size. one seeks to identify the ancestral history that could have produced a set of haplotype sequences from a common ancestor such that the inferred haplotypes could give rise to the observed genotypes. meaning that each character mutates only once from the common ancestor over the entire phylogeny. under a practical assumption on the input data 3 3 . 21.200 The field of computational haplotype reconstruction began with a fast and simple iterative method based on the idea that a haplotype that is observed in one individual is likely to be found in other individuals as well8. Imperfect phylogeny haplotyping has proven to be very fast and competitive with the best prior methods in accuracy. . Our Contributions: In this work. Recently. But the general IPPH problem remained open. Our algorithm reconstructs a q-near-perfect phylogeny in time nm°^ where m is the number of characters (variant sites typed). Heuristic methods have allowed some tolerance to recurrent mutations (for example. We find that our method is efficient and more accurate on blocks with small q. 33 that solves the 1-near-perfect phylogeny haplotyping problem relied on an empirically but not theoretically supported assumption that an embedded perfect phylogeny problem will have a unique solution. Our approach builds on both the prior theory on phylogeny construction from diploid data 13. The advent of large-scale genotyping created a need for new methods designed to scale to chromosome-sized data sets. Data inconsistent with perfect phylogenies can arise from multiple mutations of a base over the history of a species or through the processes of recombination or gene conversion. 3 ' 12 ' 13 . the work by Halperin and Eskin 23 ) resulting in the generation of imperfect phylogenies. The prior method of Song et al. in which one breaks a large sequence into blocks consistent with perfect phylogenies and uses the phylogenies to phase those sequences. 31 . we solve the general IPPH problem and show for the first time that it is possible to infer imperfect phylogenies with any constant number q of recurrent mutations in polynomial time in the number of sequences and number of sites typed. Gusfield19 showed that it is possible to directly and efficiently infer phylogenies that could explain an observed set of diploid genotypes provided the phylogenies are perfect. resulting in a polynomial-time algorithm for the case when a single site is allowed to mutate twice (or a single recombination is present). 25. We relax this assumption to allow a polynomial number of such solutions and show that the relaxed assumption holds with high probability if the population obeys a common assumption of population genetics theory called Hardy-Weinberg equilibrium. n is the number of taxa (sequences examined). Several subsequent results simplified and improved on this original method 2. The work produced a fast. though. The perfect phylogeny assumption is quite restrictive. It appears for the moment intractable in both theory and practice to infer recombinational histories in non-trivial cases. Various statistically motivated algorithms based on heuristic optimization techniques such as expectation maximization 28 and Markov chain Monte Carlo methods 35 have since been developed. Some recent work has been directed at provably optimal methods for imperfect phylogeny haplotyping (IPPH). defined below. We thus provide a theoretical basis for Song et al. We demonstrate and validate our algorithm by comparing with leading phasing methods using a collection of large-scale genotype data of known phase 29 . Such algorithms provide significantly improved robustness and accuracy. Several combinatorial optimization methods have also been formulated based on Clark's original method 20 . We further provide strong empirical support for the value of continuing research into accelerated algorithms for phylogeny construction from diploid data. practical method for large-scale haplotyping. and several approaches have been taken to adapt the perfect phylogeny method to more realistic models of molecular evolution.'s empirical observation. 33 and a separate body of theory on the inference of imperfect phylogenies from haplotype data 4. although these have all been intractable in theory and practice.

2}™. In such problems. there are two taxa /i2i-i. Given matrix G and parameter q. • vertex v G V(T) is terminal if l(v) £ R(I) and Steiner otherwise.x\c\) G GAM(S. Finally. • penalty(T) = length(T) .{i. . E) and a label function I : V(T) -> {0. This provides a natural motivation for the penalty of a phylogeny as defined above.j G C(H) (See the work of Eskin et al.3. • phylogeny T is q-near-perfect if penalty(T) = q and perfect if penalty(T) = 0. since such rows do not change the optimal solution. Definition 2. The output is a In x m matrix H in which each row /i. Pre-processing: We perform a well established pre-processing step that ensures that for any optimal output matrix H.2} m represents a (genotype) taxon and each column represents a character. (0.C) i. We will assume that both states 0. 33 : for any subset of characters of the input matrix.v) s. where row set R(I) represent taxa and column set C(I) represent characters. where each row #» G {0. l(u)[n(e)] ^ l(v)[/j. 1 3 ). we will drop the label function l(v) and use v to refer to both a vertex and the taxon it represents. . there exists s G S with s[ci] = Xi for all c. each of which is defined over a set of m binary characters. if a solution to perfect phylogeny haplotyping (PPH) exists. .1.^2i G R(H) with the following properties: • if gi[c\ ^ 2 then h2i-i[c} = h2i[c] = gi[c] • if gi[c] = 2 then /i2i-i[c] ^ h2i[c] The objective of the IPPH problem is to find an output matrix H such that the length of the optimum phylogeny on H is minimized. C) is the projection of S on characters C. A set of characters C shares k gametes if \GAM(S. the set of gametes GAM(S. j conflict if \GAM(S. We therefore consider the following parameterized version of the problem. In a haplotype matrix all taxa r* G {0. This problem is clearly NP-hard. That is. Definition 2. A phylogeny for n x m binary input matrix J is a tree T(V. In other words ( x i . since if matrix G contains no 2s. A pair of characters i. it does not have a solution to the IPPH problem.f. we return matrix H such that there exists an optimal phylogeny T* on H with penalty(T*) < q. l } m and in a genotype matrix r ( G {0. For a set of binary state taxa S and a set of characters C. l } m represents a haplotype. 3.0) G GAM(H. We say that a character i mutates on edge e if fi{e) = i. Definition 2. Input / will be represented by a matrix. We define the following terms for a phylogeny T on input I: • character n(e) represents mutation on edge e = (u. G C. under the assumption that such a matrix H exists.(e)].t. It then solves the perfect phylogeny haplotyping problem on the remainder of the matrix.1. In this section.j}) for all i. The algorithm guesses characters that mutate more than once and removes them from the input matrix. then it is . every input taxon appears in T and the Hamming distance between adjacent vertices is 1.f. d(l(u). our algorithm adds the removed characters back and performs haplotyping on the full matrix. we wish to study the relationship between a set of n taxa. Therefore the length of an optimum phylogeny is at least m.1 are present in all characters. .1.l(v)) = 1 where d is the Hamming distance. The I P P H problem: The input to the problem is an n x m matrix G.v) G E(T). • length(T) = \E(T)\ .m. {i. Furthermore corresponding to every taxon gi G R(G). PRELIMINARIES We begin by introducing definitions for parsimonybased phylogeny reconstruction. • phylogeny T is optimal if length(T) is minimized.201 2. then the problem is equivalent to reconstructing the most parsimonious phylogenetic tree 15 . For simplicity.2. j})\ = 4. Note that such a matrix should have at most q + 1 more taxa than characters. G {0. l } m with the following properties: R(I) C I(V(T)) and for all (u.C)\ = k. Note that the PPH problem is a restriction of IPPH when q = 0. ALGORITHM At a high level. we will use the same assumption of Song et al. 3 3 . We assume that the input matrix has no duplicate rows. since otherwise. our algorithm has the same spirit as the algorithm of Song et al.

The solution matrix M' contains 2n taxa and m — \Q\ characters.K) <£ GAM{t(T*). c) (8) for all couples h!2i_x. Since \Q\ < q. After finding Q. |G^M(t(T*). for all c G Q with /i2j_i[c] = 2(= /i2j[c]) character c) A.t./4_i[c] = ft'2j[c] = #[£] (5) guess K = {i G C(G) | 3j G C(G). c) B.202 function solveIPPH(input matrix G) (1) (2) (3) (4) guess Q = {i G C{G)\3eu e 2 € £(T*). A pair of taxa h'2i_1. Definition 3.c) = IND(H'.Note that matrix H' contains state 2 only on the characters Q (see Fig 3(b) for an example). We can now add the characters Q to matrix M' to obtain H'.1./i 2 J. Let t(T*) be the set of all taxa (Steiner vertices included) in T*. j})\ = 4} (6) guess GAM(t(T*). s. c) conflict or for the set H2 of all (2.c}) else guess three gametes Q{c. {*. For all j G Q and h'2i_1. We now use a prior method to solve the perfect phylogeny haplotyping (PPH) problem on M in time 0(nm)12. (1) initialize the set A := {c} (2) while |A| > 0 do (a) extract a character c from A (b) for all (2.c) C. resolve state 2 in c on /i2j-ii foi based on G(c.b!2i s. which we will use as a reference for expository purposes.h'2i e H' a n d 9i e ^> taxa ^ i . Let M' be the unique solution to the .U(l))=0{qm«).Vc G Q. if \IND(H'. Vh2i_x.h'2i £ R(H') and corresponding taxa 5 i G R(G). we fix an arbitrary optimal phylogeny T*.h'2i is a (2.t. A := AU {c' <£ K|/i 2i -i[c'] ^ h2i[c'}} i. For any c G C(H'). c)-couples return no s o l u t i o n s (d) remove c from C(H') (3) return F ' F i g . add to A characters c' £ K for which the current couple is also a (2. then the couple contained state 2 0{Y.i bl = ^2tb1 = 5*b1./i 2 i i. {c. if /i 2 i-i[ c ] = 2 ( = KiW) o r Ki-Ac] ± h2i[c] t h e n h'2i_x. Let Q be the set of characters that mutate more than once in T*.GAM({/i 2 i _ 1 ./i(ei) = /z(e2) = i} let M be matrix G restricted to characters C(G) \ Q let M' be the unique solution to perfect phylogeny haplotyping of M let H' be matrix M' denned on characters C(G) s.K) (a) resolve state 2 on h!2i_x. we can remove the characters to obtain matrix M with character set C{G) \ Q.n) then unique. Note that if c contains both 0 and 1 in a couple of H'.K) C function processMatrix(matrix H'. Throughout the paper. h'2i G R(H') is defined as a couple. we can find Q by brute force in time PPH problem.(c.t. we show that the number of PPH solutions is polynomial with high probability. c)-couple. Algorithm to solve IPPH GAM(H2. 1.c)-couples h 2 i-i.h'2i m H' GAM(£(T*). ex ^ e 2 . In Section 4.c})\ = 3 then £(c. {c.K) (7) loop for c G C(H') \ K (a) H' := processMatrix(fl' / . c')-couple (c) if 3c G Q.e.

a pair of characters share four gametes in H only if they are both in K.7) = {(0. are resolved (Step 2(b)iB).c'})\ = 4) = > c.e. the set of gametes on K in H is a subset of the set of gametes on K in T*.ri induces {x. 1} matrix H such that the following correctness conditions are satisfied: (1) for every (2. Note that Q C K. K) C GAM(t(T*).\GAM(t(T*). Note that since we guessed the states at the root of Tv.1). We now consider the set of all vertices VK that lie in the path connecting two mutations of the same character in T. Note that contracting edges e G T* with /i(e) G Q results in tree T (see Figure 2(b)). The algorithm to solve IPPH is summarized in Figure 1. ez G T* such that fi(ei) = fi(e3) = j . Since we do not know |K|.0).1). For reference Figures 3(a) and 3(b) represent input matrix G and matrix H' as found in Step 4 of solvelPPH.c'. To complete the description and analysis of the algorithm we borrow the following definition from prior work13. c)-couple in H' one of the two taxa should contain state '1' and the other '0' on character c in H (2) GAM(H. (1. {2. (0.e3. (1.6) = IND(H'. Let K = {i G C(G)\3j G C(G).1)} and guesses G{2. We can however do better by performing such an enumeration over phylogenies as illustrated in Figure 2.j})\ = 4} be the set of characters that conflict with some other character in t(T*). It is easy to see that GAM(VK.K) in 2q 0( qS time m q .\Q\ + l)l«l+«) = 0(m2"). We now go into the details of the steps. Character 2 is removed (Step 2d).K). Therefore.r\\d] = r2[c'} = 2. First we construct the unique perfect phylogeny T for the matrix H' restricted to C(G) \ Q which contains m — \Q\ +1 vertices in time 0(nm)12.203 at character c in the input (before perfect phytogeny haplotyping).> + 0(nm). Each enumeration assigns a set of edges (multi-set on characters) Qv to each vertex v of T.{c. the states in all vertices on characters C(G) \ Q are known. Noobtain T* as follows. the path that connects them is given by the path connecting v. v' in the perfect phylogeny T (see Figure 2(c)). rows 9 to 12.{i.2. we can identify the set of \Q\ + q edges (labels fi) in time 0((lQ[+q)) = 0((2q)) = 0(4«). there exists j G Q and ei.1. Based on these two sets of three gametes.. the (2. Function solvelPPH selects c = 2 (Step 7). can be Figure 3 illustrates how the algorithm performs haplotyping given K and GAM(t(T*).{c. Definition 3. i.c'}) to denote the set of gametes induced by the couples of H' at c. then replacing it with state 0 on one taxa and 1 on the other is informally referred to as a resolution. K) = GAM(t(T*).2)couples. ez. (0. Function processMatrix determines £(2. There a r e m . We now enumerate all possible rooted trees Tv with edge labels in Qv for all vertices v G T in time 0((\Q\ + q)lQl+q) = 0((2q)2i).0). Fig 3(c). we know the states in all the characters for every vertex in the path connecting them as well. K). We can therefore identify K and GAM(t(T*). D Proof.ri[c'} = y or n[c] = r2[c] = 2.c' G K. We define IND(H'. For every root of Tv. this step can take time 0 ( 2 m ) .K). Since we know Q already. /x(e) G Q to .K. Since we know states in all characters of Tv and T'v. for any two roots of TV.) found in time m2qq°^ + 0(nm).c') if ri[c] = x. 2} matrix H' to a {0. We will begin with T and add the edges e.TV>. We can now find K since for all i G K \ Q. ufa) = i and ei lies in the path connecting ei.6}) = {(0. The following lemma shows that Steps 5 and 6 of function solvelPPH can be implemented efficiently: Lemma 3. All possible edge assignments can be enumerated in time 0((m . Sets K and GAM(t(T*). The goal now is to convert the {0. In matrix H'. if a couple contains state 2 at character c. we guess the states in all Q characters in time 0(2" ).y) at characters (c.e. 1. we can identify the edges that lie between two mutations of a character in Q. we know the states on all characters for all vertices in Tv (for all v). We can easily identify K by brute force in time 0(m' K l). Further. (3) (\GAM{H. We say that a (unordered) couple r\.1)} (Step 2(b)iA). which can be improved to 0(qq) (since this is equivalent to enumerating all tree structures with edges mutating Q). i.| Q | + l locations to add an edge that mutates a character in Q. Since the mutations in C(G) \ Q are already fixed by the perfect phylogeny.n[c'] = r2[c'} = y or ri[c] = r2[c] = x.

6.r2i G R(H') be a (2. Function solvelPPH then resolves state 2 in the first couple resulting in 0000.c})cg(c. 5} that lie between two mutations of 6.n) (Step 8a). c do not conflict at Step 2c.0001010.7) = IND(H'. Fig 3(e). Fig 3(f). Fig 3(i). Let r2i-i.c).3. r2i-i[c] ^ r2i[c}. completing the algorithm.r2i}. we resolve the third couple which results in 0101.0010 which are both in GAM(t(T*).{c. Step 1 requires at most q(m) enumerations. To prove this theorem.C £ Q. l } m then GAM(H.c}) is fixed where H is any matrix obtained from H' by resolution of 2s. Since there are no (2. Every pair of characters share at least three gametes in T*.1 shows that given the correct Q. Lemma 3. For any couple in H' if /i 2 t-i[^] = ^J^l 2 then according to the condition of the lemma. If r 2 i_i[c] = f2i[c] = 2 then the resolution will either create GAM({r2i-i.C2 share exactly three gametes in t(T*).0). rows 7 and 8. given a set of three gametes Q(c. if c. Since the pair of rows (7. for any couple h2i-i. {c.0).0)}.c) Proof. h2i in H obtained by resolving state 2 on c will have the property that GAM({h2i-i. (1. there exists a unique resolution of state 2 in character c for any (2.1.0111 (Step 8a). for c ^ K. c). Although c can contain state 2 in matrix H'.4. c\. By definition. Proof. character 1 is removed (Step 2d). rooted trees Tv are constructed on the assigned edges. Fig 3(d). Fig 3(g). both K. then ci. Algorithm to determine K and GAM{t{T*). (a) optimal phylogeny T* with Q = {6.c}) = {(0. Fig 3(h).t. 7. Function solvelPPH then selects c = 3 (Step 7).1)} and determines 0(3. c.7}. Characters. l)-couples. Character 3 is removed (Step 2d) and c = 1 is extracted from A (Step 2a). we first need three simple lemmas. c2 cannot share four gametes since c2 £ K. GAM(H'. K) can be found in time m2qq°^ + . (1. For a pair of characters c £ K. then the final matrix obtained by the algorithm will not have a conflict between c. are resolved(Step 2(b)iB).1). 8) is also a (2.c}) = {(8. • Lemma 3.h2i}. / / p e n a l t y ( T * ) < q.00)0000 (a) .2. 0000111 determine edges {4.c)-couples are completely resolved (rows 9 to 12). for resulting matrix H'.{c. This exhausts all c G C(H') \ K. Lemma 3. 5.c)-couple is in {0. We now prove the main lemma that bounds the running time of our algorithm. l)-couple character 1 is added to A (Step 2(b)iC). c) established between c and c. 7}(b) perfect phylogeny T on characters C(G) \ Q (c) the four edges mutating characters 6 and 7 are assigned to three vertices of T.0). (1. c are conflicting. In matrix H'.1)}- • Proof. • Corollary 3. Only one of the two can be contained in set Q(c. If ci G n and C2 G C(H') \ K.1).0000011 ^5 Fig. states on the three roots 0000000. We now return to the proof of the main theorem. Furthermore.1. 2.6) = {(0. = Proof.{3. (8. (0.1)} or {(0. T h e o r e m 3. K) efficiently. and GAM{t(T*). tice that a character c is removed only when all the (2. Since the next couple has no state 2.7}) = {(0. ^2i-i[ c ] = ^2i-i[ c ] = s. Filled vertices form VK.0). Step 2c of function processMatrix in Figure 1 can test if c. 0000010 .{c. and using them (2. K = {4.1)} (Step 2(b)iA).0). (0.c)-couple.C G Q if every (2.couple s. (1.Therefore.3)couple.204 . then the algorithm described in Figure 1 returns a solution matrix H that obeys all correctness conditions in time Lemma 3. Function processMatrix guesses 0(3.

7}.100121 0-10121 : oom 0 310101 F i g . the couples are in GAM(t(T*). this results in three induced gametes.K) = {0000.{ck.C).t. Shaded regions represent deleted (or ignored) characters and couples completely resolved by the algorithm. The check performed at Step 2c takes time 0(nm). 3 . This bounds the number of times Step 2a is executed throughout the algorithm as 0(m). it resolves states 2 in c S Q. Cfc)-couple. Let cj be the character extracted from A and used in the loop of Step 2b during which ck was added into A. 0). Given Q = {6. K = {4. Step 2(b)iA (guesses) iterates over the set of all possible gametes shared between c. 3 . Based on a set of three gametes. For all values of x. Assuming q < m. We show that if G(CJ.c)-couples is 0{n) and therefore Step 2b loops for 0(n) times.c})\ = 3.1. The time bound for guessing the set of gametes G(c.{ck. Step 2c of function processMatrix checks for conditions 2 and 3. re). GAM{t(T*). c)-couples. Correctness condition 1 holds by the definition of resolution. it is never added back throughout the execution of the algorithm. \IND{H'.C) is guessed at Step 2(b)iA then for all future ck encountered during the execution of Step 2(b)iA. we suffer a multiplicative factor of 2 9 in the run-time because of this step. First notice that once a character c is removed from A at Step 2a of function processMatrix.1011. We now bound the run time for function processMatrix. Therefore the probability of all guesses performed at Step 2(b)iA being correct for any single call to processMatrix is at least 2~q. (1. the algorithm chooses c = 2 . 1 . Proof. The cardinality of Q is at most q and therefore the loop in Step 2(b)i executes 0(|Q|) = 0(q) times. Using Corollary 3.{ck. c).0101}.5.h'2i which is both a (2.3 shows that there is a unique resolution of states 2 in the (2. When the algorithm guessed G(CJ.c})\ = 3 (cfc and c cannot be identical characters). t h e algorithm considers the remaining couples and iteratively resolves states 2 s.5. Consider any one call to function processMatrix. y € {0. Also /i2j-i[cfc] = ^2i[cfc] = x since otherwise \IND(H'.205 (a) (b) 0000020 0000020 0000020 0000211 0000011 0000111 0000121 0000121 0000121 2020121 1000121 0010121 0201012 0001012 0101012 0201022 0000022 0101022 Input G matrix H' (c) 0000020 0000020 0000011 0000111 0000121 0000121 1000121 0010121 0001010 0101011 0000000 0101011 c 3(d) 0E00020 0300020 (e) 300020 300020 0500011 0 300111 0J00121 0300121 oioooii 0300111 o. by definition of (2.)-couple and a (2.00121 0300121 .6. Again.c. Total run-time for all calls to function processMatrix is 0(2qnm2).0111. we know that c and c cannot conflict because of the reso- . To prove the lemma. After exhausting all characters c € C(G) \ re. Given the set of three gametes. 1).1}. This is because. However /i2j_i[c] = h'2i[c] = y. c. 0(nm). Let c. function processMatrix resolves all the states 2 present in (2. c)-couples and subsequently the character is deleted (or ignored) for the rest of the algorithm in Step 2d. we now bound the time for executing Steps 2a through 2d as 0(2qnm). we know that c shares exactly three gametes with all characters c £ Q in t(T*). The above proof therefore shows that the if condition in Step 2(b)iA fails at most q times for any function call to processMatrix.c})\ = 3. c) shared between c and c is the hardest to analyze. L e m m a 3. y) on characters (cfc. Therefore matrix H' induces gametes (x. For character c.7}. (x. otherwise \IND{H'.'s represent the characters added into A during the execution of a call. Lemma 3. The number of (2.Cj)couples. (0.0010. h2i-i[c\ = h2i[c) = 2.1010. y). This implies the existence of a couple h'2i_1. the total running time for all calls to processMatrix combined is 0(2'nm 2 ). h2i-i[cj] ^ /i2i[cj] and by the condition on Step 2(b)i.0011.2.D Using Lemma 3. in all (2. Equivalently. e)-couples and this is performed in Step 2(b)iB.

2.K) C GAM(t(T*). we know that there exists a set of three gametes Q(c. let p be the minor allele frequency and (1 — p) be the major allele frequency.5 2/ Jo rl/Vn (l-p2)ndp r0.n). K)(ensures correctness condition 3). To guarantee optimality. Matrix H satisfies the following two properties: if a pair of characters c. Consider G.c' £ K (third correctness condition). • Theorem 3. c' conflict in H. In the worst case. contracting the phylogeny to a vertex and constructing a perfect phylogeny on the remaining characters. This step also checks if the resulting gametes on characters in K are in the predicted set GAM(t(T*). Any solution matrix H obeying all the correctness conditions is optimal. This expectation. We now consider a more general setting when the value of p is assumed to be uniformly distributed in [0. Theo• 4. Prior work showed that the number of PPH solutions is at most 2k. Q) (as defined by t(T*)) s.K). the probability that a specific character does not contain the homozygous minor allele is: /•0. a specific character contains the homozygous minor allele.K).K) C GAM(t(T*). Now. The phylogeny along with correctness condition 1 guarantees that the returned matrix H is an optimal solution. increasing run-time proportionally. we know that if penalty(T*) < q.3.t. but this assumption does not necessarily hold.1 and 3.5]. We now prove the correctness of our algorithm: Theorem 3. a q-near-perfect phylogeny can be constructed on GAM(H. At each iteration. with high probability in n (at least 1 — (1— c 2 ) " ) .This probability could be large for very small values of p. In this case. The proof is constructive and demonstrates the procedure to construct a g-near-perfect phylogeny for H. then c. Therefore in expectation the number of characters without a homozygous minor allele is at most m ( l — c 2 )". The proof follows directly from rems 3. It is reasonable to assume that the value of p > c for some constant c since otherwise SNPs cannot be detected. Since only the characters in Q can contain state 2 at this point. an n x m input genotype matrix to the PPH problem. following prior work 33 .2. SOLUTIONS TO PPH We assumed in the preceding sections. Finally.0. Step 8a takes 0(m2q) time. second correctness condition). Such a phylogeny is obtained by constructing the g-near-perfect phylogeny on on GAM(H. that the perfect phylogeny stage of the algorithm will find a unique solution. where k is the number of characters of G that do not contain the homozygous minor allele 13 ' 19 . This however should not occur in practice since the underlying population typically follows a random mating model.5 J' 0 (l-p2)ndp + 2 JxisfR {l-p2)ndp . resolving based on Q will ensure that c. this could be as large as m and therefore the number of PPH solutions can be 2 m . The algorithm of Figure 1 returns an optimal solution H to the IPPH problem in time nm0^. Using these two observations.c do not conflict and GAM(H'.206 lution of any of the remaining 2s (ensures correctness condition 3). Proof. it performs a brute-force step of computing all possible ways of resolving the 2s. The probability under Hardy-Weinberg that none of the n taxa contain the homozygous minor allele is (1 — p2)n. Proof. This shows that any matrix found by the algorithm obeys the correctness conditions and the running time is nm°(q\ Finally. the algorithm would need to enumerate over all solutions to the PPH sub-problem. exponentially tends to zero with n. n) (since GAM{H. there can be no 2s in character c (since c £ Q) or in any of the (2. then the algorithm finds matrix H' that obeys the correctness conditions in the stated time. Step 8 of function solveIPPH iterates n times. For any fixed character. It can be shown that a g-near-perfect phylogeny can be constructed for any matrix that satisfies the above two properties (see Section 7 of Gusfield and Bansal 2 1 ) .c)couples since it was just resolved. Hardy-Weinberg equilibrium states that the two haplotypes of any given individual are selected independently of one another at random.

the IPPH algorithm described above with high probability just suffers 0(m2) overhead for finding the optimal extension of each PPH solution.2.1. EMPIRICAL VALIDATION We demonstrate and validate our algorithm by comparing with leading phasing methods using a collection of large-scale genotype data of known phase from a high resolution scan of human chromosome 21 2 9 . The SNPs were partitioned into 4135 haplotype blocks by the authors of that study using a diversity test. 1. Data for more than 50 characters not shown. or 2 near-perfect. but such data would not be comparable to those solved optimally and is therefore omitted from our analysis. We then randomly paired haplotypes to produce 10 diploid genotypes per block as our final test set. The tails are heavier for larger values of q as expected.207 . the imperfectness. A large majority of blocks (98%) are 0. Figure 4 shows that the length of the blocks is related to q. We extracted haplotypes from the 'haplotype pattern' file provided in the supplementary material. which employs a Markov Chain Monte Carlo method. which identifies distinct haplotypes in each block and provides some inference of missing data when it can be done unambiguously. Furthermore. Although .5 < '^+2 {l~p ) dP 2 n < -T=+2E/ (!-p2)n dp 2 2 ^ < y/n Now. Such blocks can be solved non-optimally in practice by subdividing into smaller blocks of lower imperfectness. distribution of fully-resolved haplotypes in the corresponding block in order to maintain 20 chromosomes per block. We also ran the PHASE35 package.(m2).0. this shows that a significant fraction of the q = 2 blocks have large number of characters when compared with q = 1. The study identified common single nucleotide polymorphisms (SNPs) and typed them on 20 sequences through a method that directly identifies haplotypes rather than genotypes. Since the number of solutions returned by PPH is 0 ( m 2 ) . 3+) using a prior method 31 . since PPH is always performed on SNP blocks with low diversity. We do not solve optimally for the 2% with imperfectness 3 or greater because the run-time for optimal solutions would be prohibitive. then using Chernoff bounds we can show that with high probability the number of characters that lack a homozygous minor allele is bounded by 21ogm and therefore the number of solutions to the PPH problem is bounded by m 2 . with 20 rounds (the recommended value). if n = Q. which uses an expectation-maximization heuristic.•0. 5. The haplotype blocks were divided into four sets based on their imperfectness (q = 0. This discussion answers the question raised by Gusfield on the theoretical estimate for the number of PPH solutions to expect from a coalescent model 19 . We ran the h a p l o t y p e r 2 8 package. We compared our method with two popular phasing packages. replacing them with haplotypes randomly selected from the multinomial o c o 5 10 15 20 25 30 35 40 45 50 Number of characters in a block F i g . it is not unreasonable to assume n » m. 4 . We ignored haplotypes which still had significant missing data. Distribution of the fraction of blocks as a function of the number of characters in the block. For instance.

62 59.83 our alg 0.02 337. though. Our method's run time rapidly increases with larger q. Empirical results on Chromosome 21.68 Total Run Time(secs) PHASE h a p l o t y p e r our alg 3521. with any given q. DISCUSSION AND CONCLUSIONS We have developed a theory for reconstructing phylogenetic trees directly from genotypes that is optimal in the number of mutations. Rather. 13 and also substantially outperforms the competing methods for q = 1. Our method. All methods provide comparable accuracy when blocks are perfect. Error rate is the number of switch errors divided by the number of blocks.53 0.77 1111.18 80.18 Perfectness 0 1 2 #Blocks 3497 461 93 #SNPs 20816 4211 1266 PHASE 0. the method is competitive with the most widely used tools in accuracy and. Where multiple solutions were obtained.17 0. Our approach may thus be considered a Haplotyper crashes on 18 blocks which were ignored in the calculation of its accuracy.35 0. high-accuracy method such as ours to function as a fall-back for regions that those methods cannot infer with confidence.208 Error rate haplotyper 0. scales much better with imperfectness than do the other two.33 805.55 F i g . Table of Figure 5 summarizes the test results*.62 268. Our new method has several immediate applications in computational genetics: • Phasing: At present. with optimizations.11 0. 5 . As an immediate application.21 8. recombinations or incorrect SNP inferences and we attribute our method's superior performance on imperfect data to the fact that it explicitly handles one of these factors while the others suffer from all three. however. Both PHASE and haplotyper return confidence scores on its results.47 0.28 17. Our method is extremely fast for q = 0. we selected the one that minimized the entropy of the haplotype frequencies. as is done for prior perfect phylogeny methods 13 . clearly outperforming them on 1near-perfect and 2-near-perfect inputs. which might allow a slower. as expected from the theoretical bounds. function processMatrix does not add characters into A to be processed iteratively. we solve the general IPPH problem. where it reduces to the perfect phylogeny algorithm of Eskin et al. we do not implement the algorithm exactly as described. should become competitive in run-time for larger q. The implementation however always returns a haplotype output matrix and we report the switch errors 33 for the output. Run time. Total run time is the sum over all blocks. while very fast for perfect and almost perfect data. remains an obstacle for even modest q. 2 near-perfectness (accounting for 4051 out of 4135 blocks) were analyzed separately using the three algorithms. we provided it only the genotypes as input. • Phylogeny Construction: Our run times are competitive with typical times to be expected from other optimal phylogeny reconstruction algorithms. this observation suggests a need for further research in improving theoretical and practical bounds for general q. We demonstrate practical results that show great promise in accurately inferring phase from real data sets. 1. 6. Blocks with 0. Imperfectness can result from recurrent mutations.11 0. We attribute the slightly worse performance of our method on perfect input to the fact that we do not use any procedure to select a maximum-likelihood output matrix among all perfect matrices. Our results suggest that imperfect phylogeny approaches can lead to significant improvements in accuracy over other leading phasing methods. All three methods degrade in accuracy with increase in imperfectness. PHASE can make use of additional information such as SNP positions. Note that this does not guarantee finding all possible output matrices that are consistent with a g-near perfect phylogeny. For the PPH sub-problem we implemented a fast prior algorithm 13 . even when the input consists of haplotypes. For our own method. We note that for simplicity. . we enumerated all possible phylogenies that are at most <7-near-perfect.

Efficient Algorithms for Inferring Evolutionary Trees. W. Computational Complexity of Inferring Phylogenies by Compatibility. 10. A. 24. L. M. G. P. 22. Journal of Computational Biology. Langley. Karp. Halperin and E. R. Advances in Applied Mathematics (3) (1982). 23 (1994). Halperin and R. Future empirical studies. E. 11 (2004). Filkov and D. A. D. Science (2005). R. Eskin. D. M. Warnow. Fellows. 2983 (2004). Gusfield. D. D. Ding. Ramachandran and T. Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination. Frazer. Klopper. H. Downey and M. V. T. A note on efficient computation of haplotypes via perfect phylogeny. Transversals. Journal of Bioinformatics and Computational Biology (2003). Yooseph. 14. A. D. S. 6. SIAM Journal on Computing. Halperin. Fellows and T. E. Hannenhalli. M. Day and D. 21. Fellows. A. In proc Research in Computational Molecular Biology (2002). Hinds. Algorithms on Strings. Agarwala and D. Whole Genome Patterns of Common DNA Variation in Three Human Populations. 9. L. Monographs in Computer Science (1999). G. E. R. Bodlaender. D. Haplotyping as perfect phylogeny: A direct approach. J. 5(3) (1992). Eskin. In proc International Workshop on Parameterized and Exact Computation (2004). G. Gusfield. and S. Steel. and Imperfect Phylogeny Reconstruction. Ballinger. and S. Inference of haplotypes form PCRamplified samples of diploid populations. Warnow. Fernandez-Baca. In proc International Colloquium on Automata. 23.G. Ravi. Gusfield. The Hardness of Perfect Phylogeny. Efficient Reconstruction of Haplotype Structure via Perfect Phylogeny. Lecture Notes in Computer Science. D. Blelloch. Fixed Parameter Tractability of Binary Near-Perfect Phylogenetic Tree Reconstruction. Lagergren. 21 (1991). Yooseph. Steel. Foulds and R. The Steiner problem in Phylogeny is NP-complete. R. 19. Bansal. H. 10 (2003). Languages and Programming (1992). E.209 preferable t o the standard practice of inferring haplotypes then fitting t h e m to phylogenies. Eddhu and C. A PolynomialTime Algorithm for Near-Perfect Phylogeny. Journal of Computational Biology. Gusfield. In proc Workshop on Algorithms in Bioinformatics (2003). D. 13. E. Better Hill-Climbing Searches for Parsimony. V. Lockhart. Ft. . enabled by our method. 12. M. Sridhar. E. In proc IEEE Computer Society Bioinformatics Conference (2003). 5. Better Methods for Solving Parsimony and Compatibility. Gusfield and V. K.com. P. M. Eskin. Yooseph. In proc Research in Computational Molecular Biology (2005). 3. Bioinformatics (2004). V. Gusfield. Haplotyping as a Perfect Phylogeny: Conceptual Framework and Efficient Solutions. Dhamdhere. T h e blocks might thus be useful in improving statistical power in haplotype-based association testing. Sankoff. www. Damaschke. Journal of Computational Biology. • Haplotype Blocks Inference: O u r method could serve as an improved means of identifying recombination-free haplotype blocks for purposes of association study design by more accurately distinguishing recurrent mutation from recombination. 2. Lancia. Cox. G. Bafna. In: Networks. Stuve. K. Schwartz and S. Feasible Register Assignment and Other Problems on Thin Colored Graphs. Fernandez-Baca and J. In proc Research in Computational Molecular Biology (2005). 20. Systematic Zoology (1986). Reconstruction of Reticulate Networks from Gene Acknowledgments This work is supported in p a r t by N S F I T R grant CCR-0122581 (The ALADDIN project) and N S F grant CCF-043075. A Linear Time Algorithm For Perfect Phylogeny Haplotyping. Two Strikes Against Perfect Phylogeny. Bonet. M. Warnow. An Overview of Combinatorial Methods for Haplotype Inference. 17. D. D. 7. SIAM Journal on Computing. L. G. E. Trees and Sequences. Nilsen. D. A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters. H. Clark. Graham. Haplotype Reconstruction from Genotype Data using Imperfect Phylogeny. Molecular Biology and Evolution 7:1111-122 (1990). To appear in proc International Colloquium on Automata. G. 8. 4. will be needed to better establish t h e n a t u r e of imperfectness in real genotype d a t a and t h e degree to which b e t t e r handling of recurrent mutations will be of use in practice. M.perlegen. Parameterized Complexity. 15. 18. Gusfield. D. Halperin. Bafna. 11. Huson. V. Z. Gusfield. L. D. H. 25. R. T. Gusfield. Ganapathy. 32 (2003). Wareham and T. Theoretical Computer Science (2000). Languages and Programming (2006). A PolynomialTime Algorithm for the Perfect Phylogeny Problem when the Number of Character States is Fixed. 16. Parameterized Enumeration. G. Hallett. References 1. R. B. Warnow and S. Cambridge University Press (1999). Bodlaender.

Dhamdhere. Phylogenetics. Steger. Steel. Song. 34. Donnelly. Z. Schwartz. 35.org. Sridhar. N. Smith and P. Nature 426 (2003) 27. S. S. The Steiner Tree Problem: A Tour Through Graphs Algorithms and Complexity. Xu. In proc Research in Computational Molecular Biology (2005). X. Bayesian Haplotype Inference for Multiple Linked Single Nucleotide Polymorphisms. A Fast Algorithm for the Computation and Enumeration of Perfect Phylogenies. C. Stephens. A new statistical method for haplotype reconstruction from population data. Vieweg Verlag (2002). Kannan and T. Niu. The Complexity of Reconstructing Trees from Qualitative Characters and Subtrees. and J. 31. G.hapmap. N. S.210 Trees. www. Y. Steel. 9 (1992). 26 (1997). Simple Reconstruction of Binary Near-Perfect Phylogenetic Trees. Blelloch. Ravi and R. In proc International Workshop on Bioinformatics Research and Applications (2006). E. Algorithms for Imperfect Phylogeny Haplotyping with a Single Homoplasy or Recombination Event. T. The International HapMap Consortium. In Science (2001). 28. 30. Qin. Liu. Halperin. A. Semple and M. Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21. R. 32. In proc Workshop on Algorithms in Bioinformatics (2005). The International HapMap Project. Y. M. K. Journal of Classification. SI AM Journal on Computing. American Journal of Human Genetics (2002) 29. . Warnow. E. Patil et al. Gusfield. Promel and A. 26. J. American Journal of Human Genetics (2001). H. Oxford University Press (2003). M. 33. Wu and D.

it verifies for branch-and-bound procedures that they have found an optimum.211 TOWARD AN ALGEBRAIC UNDERSTANDING OF HAPLOTYPE INFERENCE BY PURE PARSIMONY Daniel G. this study of algebraic rank allows us to prove a new variation of the "haplotype lower bound" 9 on the number of recombinations required to explain a haplotype matrix. The goal is to identify a smallest set of haplotypes to explain a set of genotypes. though. especially when applied to synthetically generated test instances arising from standard evolution models. In this paper. This objective is partly justified by the observation. In addition. Canada N2L 3G1 Email: {brovmdg. to help identify easy and hard instances of the problem. K e y w o r d s : Haplotype inference. and we show in particular the conditions under which it is exact for galled trees. INTRODUCTION Haplotype inference is the process of attempting to identify the chromosomal sequences that have given rise to a diploid population. we show that for haplotypes with few recombinations. Ontario. now made in several species2. we prove a stronger version of the standard haplotype lower bound. Harrower David R. Cheriton School of Computer Science. We show this bound is almost surely tight for data generated by ( i ^ ) p l o g p random pairings of p haplotypes from a perfect phylogeny (for positive e). the rank bound is near-exact. uwaterloo. 1. that in genomic regions under strong linkage disequilibrium. Our second theorem. Here. and bound the effect recombination has on rank.. phylogenetic networks. Despite this. The rank lower bound is important: when tight. in Sec- . we can go one step further for some HIPP instances. and a common mutation. many algorithms successfully solve HIPP instances on simulated and real data. as researchers attempt to connect variations to inherited diseases. imharrow} @cs. probabilistic analysis of algorithms. 3 . and its only known polynomial-time approximation algorithms have exponentially bad performance guarantees 4 . it is NP-hard 1 . The simplest haplotype inference problem to describe is haplotype inference by pure parsimony (HIPP). Waterloo. Our results focus on the algebraic rank of the genotype matrix. few ancestral haplotypes are found. we explore the algebraic structure of the problem. to compute an optimal set of haplotypes in polynomial time. In Section 5. ca Waterloo. We examine the algebraic effect of allowing recombination. the simplest kind of phylogenetic networks that include recombination. This is known to be a lower bound to the optimal solution size of the HIPP problem 8 . Several authors have developed integer programming or branch-and-bound algorithms that have shown fast performance 1 ' 5 _ 7 . we can almost surely recover an optimal set of haplotypes in polynomial time. with only a constant multiple more population members. We focus on problem instances resulting from random pairing of haplotypes generated by standard evolution models. it is surprisingly easy to solve. computing the size of the HIPP solution is NP-hard in general. In t h e process. However. this problem has become increasingly important. This classification identifies a set of problem instances with recombination when the rank lower bound is also tight for the HIPP problem. and some real-world problems have readily solved exactly or near-exactly7. We show t h a t this bound is almost surely tight for d a t a generated by randomly pairing p haplotypes derived from a perfect phylogeny when t h e number of distinct population members is more than (^4^)plogp (for some positive e). T h e rank of t h e input matrix is known t o be a lower bound on the size an optimal HIPP solution. galled tree. Brown a n d I a n M. moreover. To help resolve this conflict between theoretical complexity and practical ease of solution. University of 200 University Avenue W. Haplotype inference by pure parsimony (HIPP) is known t o be NP-Hard. The problem is also interesting purely combinatorially. We also give a complete classification of the rank of a haplotype matrix derived from a galled tree. introduced by Gusfield1. we strengthen our results by exploring population models that allow for recombination. we explore t h e connection between algebraic rank and t h e HIPP problem. Recently. In practice. Moreover.

We can represent this pairing by an n x k pairing matrix. and on a phase transition for haplotyping problems. This corresponds to finding the smallest H such that a proper pairing matrix II exists where G — n • H.j) = 0 if both parent chromosomes of pi have allele 0 at Sj.j) = 2 if both have allele 1. Pure parsimony In haplotype inference by pure parsimony (HIPP). Our results identify structure in HIPP instances and show how to easily solve HIPP in such cases.1. we assume that at every sampled site Sj. where we show that for haplotypes generated by this model. and 0 elsewhere. Our experiments demonstrated that even on large instances. The k rows of H represent possible chromosome choices and their alleles at the m sites. Gusfield1 gave an integer linear programming formulation for the problem that. the m columns represent sites in the genome. Haplotypes and genotypes To answer this question. often solved very quickly. all m sites evolve according to a rooted phylogenetic tree. We assume there are two alleles. Although we do not show why previous haplotype inference algorithms often run surprisingly quickly. at site Sj.2. 7 . it is inspired by a formulation of He and Zelikovsky10.3. at each site. G(i.1. where row rj of II has value 1 in the two columns corresponding to the two parent haplotypes of gi. BACKGROUND A N D RELATED WORK We begin by briefly reviewing existing work on haplotype inference. we examine some models for generating HIPP instances: where do haplotypes come from. (If two copies of the same haplotype explain genotype gi. the common ancestor of all haplotypes had allele 0. H. the row of II has a 2 in that column and 0 elsewhere.. and G(i. and thus three choices for G(i. haplotype inference consists of finding a haplotype matrix H and a valid pairing matrix II such that G = U • H. Without loss of generality. In this model. we do show that many of the instances upon which they are run have special structure. Genotype gi is explained by H if there exist two rows (possibly the same) of H whose sum is gi. Each site is assigned to a single edge of the tree. we seek the smallest set of haplotypes to explain G. and its known approximation algorithms have only exponential approximation guarantees 4 . 2. G. though theoretically exponential in size. (Note: this is not the standard notation. Our result requires that there be a mutation found in at least an a fraction of the genotypes. (The simplicity of this formulation explains our notational choice.j) is the genotype of population member p. particularly with the parsimony objective. Halldorsson et al. found a polynomialsized IP formulation for the problem 11 . and how are they paired to make genotypes? The simplest generative model for haplotypes is perfect phylogeny. Why are HIPP instances often solvable in practice? 2.212 tion 4. which represents when the unique mutation of that site oc- . despite its theoretical hardness. which exchanges the meanings of 1 and 2. 0 and 1. with the root at the top of the tree. Haplotype inference and notation The input to a haplotype inference algorithm is a genotype matrix. 2. Each of its n rows represents the genotype gi of population member p. introduced by Gusfield1. the optimal solution can be found easily.) 2. which reduces the complexity of the problem. we can guarantee such a common mutation exists with probability at least 1 — e when the mutation parameter 9 of the coalescent model is at least a constant depending on e and a.) Haplotype inference consists of explaining G with a 0/1 haplotype matrix.) In our formulation. Still. G(i.j): G(i. It is NP-hard 1 . we independently identified and extended it with further inequalities 5. We connect this theorem to the standard coalescent model from population genetics in Section 4. many instances of this problem have been easy to solve.j) = 1 at positions where one parent has each allele. shows a constructive algorithm that almost surely runs in polynomial time on instances derived from perfect phylogenies with a constant factor more genotypes than for our first theorem. II.

1.1. every edge (i. this has no effect on any solution. If G = H • H and H has exactly k* rows. This model of data generation is not the same as the model studied in this paper. other leaves have allele 0. Proof. and G = U. There seems to be a phase transition: when n <C k log k. we can also relax the all-zero ancestor requirement).1. LINEAR ALGEBRAIC STRUCTURE AND A FIRST BOUND Our work focuses on the rank of the genotype matrix. H) is an optimal HIPP solution for G. > Proof. 3. IfHisannxk* pairing matrix of rank k*. Cleary and St. n is the node-edge incidence matrix of the pairing multigraph.4.1. but do not prove. there is usually a unique PPH solution.1. The number of haplotypes in the solution to the HIPP instance G is at least k* = rank(G). Chung and Gusfield12 examined the number of distinct PPH solutions to a genotype matrix obtained by randomly pairing In haplotypes from a perfect phylogeny without replacement n times. for some constant e > 0. A matrix H compatible with this framework is a PPH matrix (for perfect phylogeny haplotype. H is a k* xm haplotype matrix of rank k*. when n » fclog/c. John 13 then studied the structure of random pairing graphs. A phase transition for haplotyping? In 2003. This observation explains the depen- 3. where all haplotypes have value 1. then k* = rank(G) and (n. and must have at least that many rows. Leaves descendant from that point have allele 1 at site Sj. they showed that if there are o(k log k) population members. The rank of the pairing matrix Lemma 3. but their observations are clearly related. we also include a single non-polymorphic site so. Proof. that above this bound. where we consider model parameter settings that make the conditions of our second major theorem (Theorem 4. so is their product. More complex generative models for haplotypes allow for recombination. the simplest of these is described in Section 4. For ease of calculation. 3. A standard graph theory result shows that n is full rank if all connected components of the graph are non-bipartite 14 .213 curred. 2. We discuss recombination in Section 5. They also show experimentally. The simplest process is random pairing: each genotype results from pairing two haplotypes sampled with replacement.i) has probability 1/fc2.2. A haplotype or genotype may occur in multiple copies. • Corollary 3.H.1. If there exist a k* x m haplotype matrix H and an n x k* pairing matrix II such that G = H • H. This corresponds to n edges being picked from a random multigraph model. Since H is a valid set of haplotypes for G only if there exists a pairing matrix such that G = n • H. . The rank of the pairing and haplotype matrices We now consider when we can expect G to be created from full-rank matrices H and n . For example. and every loop (i. breaking the rule that every site derives from the same single phylogenetic tree. since any haplotype pairing gives the same genotype. then H forms an optimal set of haplotypes for G. Note that the number of sites and the number of distinct haplotypes are directly correlated in the coalescent model. which Kalpakis and Namjoshi 8 have previously noted is a lower bound on the size of the solution to the HIPP instance.1) likely to hold. at that site. Haplotypes pair to form genotypes. • 2. there is typically only one. If H is a random pairing matrix for k haplotypes with more than (i^)fclogfc pairings. dence on length of the sequences observed in their experiments. with high probability there are multiple PPH solutions.j) has probability 2/A. This follows since if both n and H are full rank.2. H must have rank at least rank(G). Lemma 3. there are many PPH solutions. Let k be the number of distinct haplotypes. then it matches the lower bound and is optimal. We can also allow a probabilistic process for generation of the haplotype matrix. then rank(n) = k almost surely as k — oo.

BoUobas's textbook 16 ) gives that such a graph contains a triangle almost surely. for example. and . but only k distinct haplotypes. If there is a pool of p different haplotypes being randomly paired to form genotypes.214 if the graph is connected and contains a triangle. If G has at least ^•plogp pairings. > Proof. we see that for PPH instances with a few times more pairings can be exactly solved. Theorem 3.1. so at least one of i or j has a mutation on the path from their common ancestor to it.2. where some haplotypes are more common than others. Lemma 3. if not the actual haplotypes. and II. it contains a subgraph G' from the random graph model G(k. II is rank k almost surely as k — oo. and has value 1 at positions corresponding to mutations on the path from leaf to root. • Thus. the haplotype matrix used to create G is a p x m matrix. we may require substantially more genotypes in order to be confident that we have paired all of the haplotype kinds. not k — 1. the pairing graph on the p nodes represented by the pool of p haplotypes is connected. Corollary 3.£).1.1 shows that the initial set of haplotypes found in H is an optimal set for this HIPP instance. with one column SQ of all ones.2. II will be full rank (of rank p). the general NPhardness of HIPP is partly ameliorated: we can easily compute the size of the optimum. Putting it together: a first bound If the data are generated by a perfect phylogeny. As such. If the graph has ^ r fc log k pairings. and is thus linearly independent of all other haplotypes. The Erdos-Renyi Theorem shows that as long as the number of distinct non-self pairings is at least i ^ p l o g p (which is true. almost surely if the number of genotypes is ^4^-plogp. suppose it is i. for some constant e' > 0.1. and a standard textbook exercise (see. and again. As such. In this representation. it will also contain a triangle as a subgraph as before. Consider two neighbouring leaves i and j in the subtree induced by haplotypes of H.3. Remove hi and repeat this process for all k haplotypes. for many instances. If the matrix II is rank p and the matrix H is rank k.3. The pairing matrix is almost surely of full rank. Hence. by Lemma 3. is n x p. Their sequences are distinct. It is of rank k when it comes from a perfect phylogeny.• 3. ensuring the matrix is of rank k.3. for some e' > 0). The column so prevents a row of all zeros being the last row left. p. then rank(II) = k. Let G be a genotype matrix produced by random pairing of a pool ofp haplotypes (not necessarily unique) that are generated by a perfect phylogeny (with a column of all ones). with k nodes and t edges. W H E N CAN W E FIND T H E ACTUAL HAPLOTYPES? For instances of the problem with a constant factor more genotypes than the bound of Theorem 3. the pairing matrix. and each possible edge equally likely. and the rank of the haplotype matrix equals its number of distinct haplotypes. by Lemma 3. Elementary column operations allow us to extend to the case where the root has allele 1 at other sites than so. then almost surely as k — oo it has at least I = ^-khgk > distinct non-self pairings. G' is connected almost surely. Each distinct haplotype hi corresponds to a different leaf in the phylogenetic tree. > • 3. almost surely as p — oo. then their product G = II • H is also of rank k. If H is a p x n haplotype matrix with k distinct haplotypes and can be realized by a perfect phylogeny. again. 3. we can assume that H. Hence. and there are enough population members. then H is rank k. 4. The classic ErdosRenyi Theorem 15 shows that for such I. Then hi has a one found in no other haplotype.2. then the size of the optimal solution to the HIPP instance G is rank(G). as k —• oo. Proof. In the next section. then the rank of the genotype matrix is the optimal number of haplotypes. The haplotype matrix Now we consider the haplotype matrix. Pairing in non-uniform haplotype pools We may also be sampling from a non-uniform population.

so Theorem 3.2. When considering the random pairings to make genotypes. they are also a smallest PPH solution. where c is the number of connected components in a specific subgraph 18 of this graph. if p = k and there is a site with minor allele frequency at least 25%. we connect s and s' if we can find three genotypes g\. Theorem 4. one possible resolution of sites s and s' violates the perfect phylogeny condition. Proof. and rank(G) = k almost surely and the unique haplotypes of H form an optimal HIPP solution for G. Lemma 4. a) 11 I x yl b) 1 1 00 22 c) 1 1 20 02 Fig.17. the site with the common mutation. More precisely. . represented by the haplotype matrix H.2 shows there exists only one. let ej be the tree edge containing the mutation at site Sj. • We prove Theorem 4. The main result we use is that the number of PPH solutions for G is 2 C _ 1 . have shown that Min-PPH is NP-hard.ls. If G arises from at least m a x ( i ^ p l o g p . then with high probability. let Ai be the set of haplotypes below e. is the HIPP problem subject to this restriction on H. studied by Bafna et al. Corollary 4. The total number of PPH solutions for G equals 2 C _ 1 . we can almost surely identify the optimal haplotypes for P P H instances. Y) to denote a pairing of a haplotype from class X with a haplotype from class Y. 1 represents a heterozygous site. Patterns of 3 x 2 submatrices which cause an edge to be added between the vertices representing the sites in D(G). ^^plogfc) pairings of members of H. 1. We prove the lemma via a property of the DPPH algorithm of Bafna et al. This will complete the proof of Theorem 4. where c is the number of connected components of D(G)18. Theorem 4. The values x and y are each either 0 or 2. This algorithm constructs a graph with one vertex for each column of G. To show D(G) is almost surely connected. where we restrict the haplotype matrix to be a PPH matrix. We show that if G satisfies the conditions of Theorem 4.1.) Lemma 4.1 says that if the number of rows of G is at least (8 + e)k log k. and Bj be all other haplotypes.1 allows us to restrict our search to PPH solutions. then we can solve HIPP in polynomial time almost surely as k — oo (and consequently > as p — ooj.1 shows it can be found in polynomial time. The Min-PPH problem.1 through several steps. Our main result is the following constructive theorem. Recall from Section 2 that in our notation. and also an optimal solution to the Min-PPH instance G. so the possible space of PPH solutions is restricted.1 applies. Lemma 4. In each of these forms. Proof. Suppose there exists a column of H (without loss of generality. the set of unique haplotypes in H is an optimal HIPP solution. The number of distinct population members is greater than ( i ^ ) p l o g p .1. we use (X. Since they satisfy a perfect phylogeny. » For example.1. and the resolution of g at sites s and s' is restricted by the perfect phylogeny condition. c = 1.1. Given a genotype matrix G satisfying the conditions of Theorem 4-1. Let G be derived from randomly pairing p haplotypes compatible with a perfect phylogeny. s\) with at least ap of both zeros and ones. For any site. If the conditions of Theorem 4-1 hold. then almost surely. with high probability. with k distinct haplotypes. there is an edge between each node and the node for si. (Bafna et al. we show that almost surely. Graph D(G) has one vertex for each distinct column in genotype matrix G and an edge between two vertices s and s' if there exists a row goiG with value 1 in columns s and s'. <2 ? and 53 such that the 3 x 2 submatrix of G induced by these genotypes and sites has one of the forms in Figure 1. Our first lemma notes that we can restrict ourselves to PPH matrices H. We do this by connecting to a variant of the HIPP problem. for some a > 0. • Lemma 4. and there is only one PPH solution. the optimal set of haplotypes can be found almost surely in polynomial time as fc — oo.215 with a "common" mutation. almost surely there exists only one set of haplotypes that satisfies a perfect phylogeny and can generate G.

The coalescent model describes the descent of a population under neutral evolution.u gives a representation of all PPH solutions for a given genotype matrix G in polynomial time. one needs to set the parameter 9 in the coalescent process to a constant depending only on a and e.Bi). Again. We describe them briefly. and coalesce them to their common ancestor. the number of polymorphic sites needs only increase as the logarithm of the population sample's size. If Aa has fewer than ^ haplotypes.21.216 Consider a site s.Ai) produce a submatrix of type (c) in Figure 1. The question of how many mutations with minor allele frequency a can be expected to exist has been studied by theoretical population geneticists. for each column s.1. suppose ea is not below e\. since these events produce a submatrix of type (a) in Figure 1. 1 . Proof. three events each with probability at least £-. while if \Aa\ > ^ . 4. will connect s and si. 2p Therefore. we need to determine how large 0 must be in order to be confident that a mutation of the type we desire exists almost surely. Two . This totals less than 6fc events. for example. • Corollary 4 . by randomly pairing the resultant haplotypes 1 ' 5| 7 . each event has probability at least ^ .1 are satisfied.2 shows that there is a unique PPH solution almost surely. (AS. then almost surely. We will describe the model going backward in time: we begin with the p leaves. all events have probability at least -^-. We use it to generate rooted trees with p leaves.Bi) give a submatrix of form (b). (Aa. the events (Aa. in the program ms 23 . If \AS\ > T|P. the work of these authors has mostly concerned the expected number of mutations with minor allele frequency a. and allows their enumeration in time polynomial in the input matrix size and proportional to the number of PPH solutions. in a simple version of the coalescent model from population genetics. In a random pairing.B\ \ As) and (B\ \ Aa. but we can also give a partial answer to this question probabilistically. for full detail.Bi). and it can be recovered in polynomial time. but again prove that one can have provably high success in solving synthetic HIPP instances. A bound on finding a common mutation in a coalescent model Our theorems show that almost surely. the probability that a needed event has not yet occurred is less than 6(fc~e/2).1. However. focusing on details we need. Ai). We consider a few cases on s. Aa) and (Bi. we focus on the event order.Ai). By the couponcollector lemma 19 . (AS. for example. since there are at most 2k — 2 distinct columns in a perfect phylogeny with fc distinct leaves.AS) and (Ai. see. The coalescent approach is used. Fu 20 . An introduction to the coalescent model Infinite-site constant-population coalescent models are a standard population genetics model to produce haplotypes. the events {As. if there exists a mutation with minor allele frequency at least a in our data.Ai\Aa) give a submatrix of form (a). In particular. How often does such a common mutation occur? One would likely not study a population if all mutations were rare. A Min-PPH instance G satisfying the conditions of Theorem 4-1 can be solved in polynomial time with high probability. if the conditions of Theorem 4. and on the size of Aa. not the time between events. we will likely be able to solve HIPP if the number of genotypes is above a relatively small bound. see Hein et al. Our results show that to guarantee that with probability 1 — e there is a site chosen with minor allele frequency at least a. hence. the events (Aa. • 4. we will have an edge between s and s\ in D(G) if the events (Aa. after ( ^ ^ p l o g f c random pairings. Thus. D(G) is connected and the DPPH results of Bafna et oZ. The algorithm DPPH of Bafna et al. If \Aa\ < ^p. Lemma 4. where each leaf represents a haplotype. Suppose instead e s is below e\.18 show that there exists a unique PPH solution for G. Our bounds are coarse.Ai) occur. see Hudson 22 for justification of this approach. (Aa.Ai\Aa) and (Bi. which has been used by several groups to generate HIPP problem test instances. First. depending on whether e s is below e\ or not. The needed events always have probability at least #•.1.

one of the active lineages is uniformly chosen. with the other new lineage resulting form the bifurcation having i minus that many descendants. Suppose that we produce p haplotypes by the coalescent model of Section J^. .2 by focusing on finding one edge with between ap and (1 — a)p descendants (a "good" edge) with a mutation.1. depends on both /x. we will have chosen the topology of our coalescent tree. it is actually easier to use a non-standard forward-in-time way to sample from this distribution. 6 = 2Nfj. The standard way to do this is as a branching process. a coalescence is the next event with probability k^~[}s. . A fact about the branching process is that if we pick one of the two lineages that result from the initial bifurcation. n — 1} (see. but a lineage is only active if it has more than one eventual descendant. After n — 1 such bifurcations. . are themselves chosen independently from the coalescent distribution with those number of nodes. the initial lineage is annotated to have n eventual descendants.l. . .3.3. Lemma 4.l. the tree we choose by this procedure is chosen from the same probabilistic distribution as for the classic model. we can apply Theorem 4.) If k lineages are active at a point in time. (In population genetics. For any e > 0. we then switch to the backward coalescent version of the process. Common mutations We now connect the coalescent model to our HIPP theorems.217 kinds of events can occur as we move backwards in time: mutations and coalescences. e. those with only one descendant will never bifurcate again. Since we are successively conditioning according to the probabilities of the traditional forward coalescent process. one with i eventual descendants and the other with n — i. but we can ignore this here. not backward. We can sample from the same distribution of tree topologies by thinking about the coalescent process by moving forward in time. this process is continued until there are n lineages present. by annotating each lineage with the number of eventual descendants that it will have. if we choose the parameter 6 of the coalescent model to be at least ^ ^ ( e 1 / ^ 2 0 . and a mutation with probability k_\+e. the effective population size. as in Section 2. First. When we choose a lineage with i eventual descendants to bifurcate. Our bounds are likely coarse as a consequence. When a coalescence is indicated. conditioned on having a lineage of our desired type. we sample the number of descendants that one of the new lineages will have uniformly from { 1 . the parameter 6 governs which of these is more likely to happen. we will have a site with minor allele frequency at least a. then with probability at least 1 — 2e. and assume 0 < a < 1/3 . All random choices are independent. Also. . 4. i — 1}. and N. however. a lineage is chosen from all of the i lineages. Ref 22.) For our purposes. The process continues until only one lineage remains. which is the common ancestor for all sites. If we use the model to generate p haplotypes. if 6 is w(l) as a function of p. Consider a coalescent tree with p leaves. such a site is chosen almost surely as p —* oo. two lineages are uniformly chosen and joined into a common ancestor. a mutation at a new polymorphic site is assigned to it. When a mutation is indicated.1 ) — 1). The probability that the top of a good edge exists in the tree at or before the ith bifurcation from the top is at least . We prove Theorem 4. its number of eventual descendants in the n-member population is uniformly distributed over { 1 .g.) Moreover. the structure of the two trees that result from this bifurcation. where we start with one lineage. Theorem 4. .2. We still pick our branching lineage uniformly at random from the active lineages.1. We establish haplotypes for the sequences. We will use the forward branching process model only to estimate the number of lineages at a time when a mutation occurring on one lineage in particular would be sufficient to create a mutation of our desired type. and then whenever a divergence event is indicated.2. we can generate trees from the coalescent distribution in a somewhat different-appearing formulation that is still equivalent. As such. we show that there is likely a good edge reasonably high in the tree. the mutation rate. if we have a polymorphic site with minor allele found in at least ap haplotypes. (Mutations occur in this process as in the backward conception of the coalescent. and it bifurcates to produce i +1 lineages.

They indicate how far down in the tree one must look in order to be guaranteed a high probability of having found a good edge. with probability at least 1 — e. Number of haplotypes p 30 50 75 100 150 60 75 85 130 90 160 110 200 £2a-l T h gboun(j a g & function 0 f t h e probability 1 — e of a good edge at or above level t*a t is easily shown by arithmetic. we can finish the proof of Theorem 4. while perhaps odd. For instances generated by models that include recombination.2e. Applicability to small populations The previous theorems are asymptotic results depending on the value of p. Thus.e — 1)> then the probability of a mutation of minor allele frequency at least a is at least 1 . there is such an edge with start at or above the i*a e-th bifurcation. we have already seen a good edge. going forward. At each step. Filkov and Gusfield's LPPH 2 4 . Since the expected number of e 5 10 20 40 10 25 25 35 35 15 45 40 45 55 200 185 310 515 675 130 155 220 295 115 195 260 395 150 260 440 540 5.2. and the rank of G may not equal the size of its optimal solution. If it has fewer than (1 — a)p descendants. the generating set of haplotypes (after removing duplicates) was always optimal for HIPP and was the unique Min-PPH solution. it only increases the probability of mutation preceding coalescence on the lineage of interest. Setting this probability equal to 1 — e and solving for 9 gives that if 9 > ~€'(la. then a common mutation exists almost surely (for any a) asp grows. and now working backward in time. Shown in Table 1 are the number of genotypes needed so that in 200 experiments. (We verified the number of PPH solutions using Ding. it may be more if there are lineages with only one descendant haplotype. a small experiment shows that they apply for small p as well. If n is full rank.lp\ogp genotypes satisfied these conditions. for t*a e = eVt 2 ". and where a mutation on one lineage would be a good mutation. D Our bounds. there exists a lineage with the most descendants. we can upper bound the probability of no good edges occurring by level I by n U ( l -*=&)< VLx ^ < e(2a-l)loSe = mutations in the tree is approximately Poisson distributed with mean 6 logp (see Ref 21).) Even for moderate values of p and 9. l. rank(i7) exactly equals its number of unique haplotypes. there is a period in the coalescent history of the sequences during which there are fewer than d. We use the rank of H to prove a lower bound on the number of recombinations in a phylogenetic network that ex- . consider the first time this happens: we divided a value greater than (1 — a)p into two parts. We now study ranks of haplotype matrices in such models. Since all bifurcations are independent. the next event on that lineage is a mutation with probability at least j—zi+0 i if there are more coalescences of other parts of the tree before an event on our good lineage.2. However. both smaller than (1 — a)p. 4. • Now. Given e > 0. assuming always that n is of full rank. and paired the haplotypes randomly to generate n distinct genotypes. one is at least ap. and this depends solely on a. the situation is more complicated: H may not be full rank. we note that if the number of mutations accumulated is cu(logp). we can concern ourselves with bifurcations on the lineage with the most descendants. If it bifurcates. By Lemma 4. are constant as a function of p.1 ). Since a < 1/3. At step i in the coalescent process. then the unique rows of H form an optimal solution to the HIPP instance G = H • H with exactly rank(G) haplotypes.218 1 — £2a~1. THE ALGEBRAIC RANK OF NON-PPH INSTANCES For PPH instances. the probability of a good edge being produced is at least 1 —2a.ayt lineages. For a variety of values of p. this lineage has probability at least 1/i of being chosen to bifurcate.3. To see this. we used Hudson's program ms 23 to generate a PPH matrix with varying values of the mutation parameter 9. Table 1. Sticking to such instances. Proof. Proof. with probability at least 1 — e. The smallest number of genotypes n for which all 200 trials passed the rank and P P H tests.

Let H be a haplotype matrix with k unique rows derived from a phylogenetic network with r recombinations. r >k— rank(i?). the first k sites of the haplotype come from the parent corresponding to the prefix edge into the node and the remainder comes from the other parent. For any network. it is a tree and has a column of all ones. We will focus our attention on the rank of data coming from a galled tree.2. in which every edge is found in at most one recombination cycle. we assume that at the top of the network. Stated another way. but use of the bound may still be interesting to explore. We also note that this bound applies for unknown ancestral sequence. we remove that recombination node and its descendants. and thus may be stronger. a prefix of one lineage followed by a suffix of the second lineage. One interesting feature of our findings is that estimating the number of recombinations and performing haplotype inference by pure parsimony seem opposed to each other.4. We assume that the haplotype at any position in the network identifies the allele found at that network position for every site with a mutation in the network.219 plains a set of haplotypes. The leaves (indegree 1 and outdegree 0) correspond to current haplotypes. 1 .3 applies. we give a full characterization of the rank. which equals k — c + 1. the haplotype is the haplotype at the parent of the node. when the recombination node of one is a descendant of another. In either case. in the network. We discuss this more in Section 5. If there is a mutation found only in all p haplotypes. then the lemma applies and removing the p haplotypes drops the rank by p. At recombination nodes. so Lemma 3. Then rank(ff) > k — r. Mutations are assigned to edges of the network. The rank of data from phylogenetic networks We can easily relate the data rank to the number of recombinations in the network. the haplotype is all zeros. 5. we can give a partial order over recombination cycles.1. but first give a general rank bound. removing the p haplotypes drops the rank by at least p — 1. A phylogenetic network is a rooted directed acyclic graph with edges pointing away ("down") from the root. from allele 0 to allele 1. 5. which is provably at least as strong as the commonly used "haplotype lower bound" 9 . and then identify an order from the bottom of the network to the top. where c is the number of unique columns in the matrix H. because there may be many different columns. The node is labelled to indicate which parental lineage takes each role and the discrete recombination breakpoint where the recombination occurs. If no such mutation exists. with zeros changed to ones at positions corresponding to any mutations on the edge separating them. consider a lowest recombination node. but not necessarily p. rank(JJ) is slower to compute than the number of distinct columns of H. An important element of a phylogenetic network are the recombination cycles between recombination nodes and coalescent nodes. a class of recombination networks. but k — rank(-ff) is always non-negative. Of course. Without loss of generality. At a coalescent node or a leaf. suppose its leaves have p unique haplotypes. labelled with position k. The haplotype bound is often negative. Below it is a tree. Recombination nodes (indegree 2 and outdegree 1) correspond to when two incoming lineages recombine to form a single lineage. as it can be adjusted in the standard way to apply to the case of a known ancestral sequence. For galled trees.25 defined a simple kind of recombination network. If the network has no recombinations. Proof. except for a one in a special site SQ that never mutates. galled trees. and have a network with one fewer recombination. . Gusfield et al. We prove this by induction on the number of recombinations in the network. Theorem 5 . it is similar in spirit to the "haplotype lower bound" 9 on the number of recombinations required to explain a haplotype matrix. Phylogenetic networks with recombination We first give a combinatorial description of this domain. Coalescent nodes (indegree 1 and outdegree 2) correspond to the most recent common ancestor of their descendant haplotypes. • This rank bound may be surprisingly useful. If not. which mutates once. Each mutation has a chromosomal position.

whereas when 0 is large. or between any two consecutive included nodes on the cycle are found two mutations on either side of the recombination breakpoint of the cycle (a "rankmaintaining pair"). many mutations will make it likely that the haplotype at a recombination node is linearly independent of the other haplotypes. Let hc be the haplotype at the coalescent node and hr be the haplotype at the recombination node. If there is a rankmaintaining pair. We begin with the rank-maintaining cases. so the recombination node is independent. In a recombination cycle. We must add hc to our set of haplotypes find out. The bounds from Theorems 5. or rank-confounding. except hr. • Going through each cycle obeying the partial order. a node is included if it represents a haplotype in H. and thus compute the overall rank of H. and the coalescent node of the cycle is included. In the case of galled trees. we can identify for each recombination whether it actually does reduce the rank or not. respectively. all mutations on the cycle can be individually isolated by subtracting haplotypes at consecutive included nodes. but there may be a dependency between them and the other haplotypes in the tree. all haplotypes in the cycle are independent. or • between the coalescent node and the first included cycle node on either side of the cycle. Proof. recombination is common. There are three types of node in the recombination cycle. When rank is high. or • the recombination node is included and has a mutation found in no other included node in the cycle. it may or may not be independent. the rank of H (and consequently G = II • H. so hr is independent. the most parsimonious set of haplotypes to produce the genotype matrix G is likely to • the recombination node is not included. Consecutive included nodes are nodes on either side of the cycle that have only unincluded nodes between them. If hc is already in H.4. Theorem 5. but they also possess other mutations not found in hr.2. The coalescent node and recombination node that together define the recombination cycle will simply be referred to as the coalescent node and recombination node. rank-maintaining. Consequences of the rank bounds for phylogenetic networks In the standard variation of the coalescent process that includes recombination. The sides of the cycle are the two directed paths from the coalescent node to the recombination node. recombination is rare. other haplotypes may possess mutation j . A recombination cycle is rank-decreasing if it is not rank-maintaining. the relative rates of recombination and mutation are given by two parameters p and 6. if II is full rank) is likely to be close to its number of unique haplotypes.2 can be read to say that if mutation is common relative to recombination. Elementary row operations can thus transform hr into hc. then every included haplotype that possesses one mutation from the pair also possesses the other. For rank-confounding cycles. . The rank of H will be the same as that of H with hr removed and hc added. that column is independent of all other columns. When p is large relative to 9. so we add a new haplotype to determine this. and no higher node on the cycle also represents that haplotype. A recombination maintaining if: cycle is rankA recombination cycle is rank-confounding if it is neither rank-maintaining nor rankdecreasing. then hr is not independent of the other haplotypes in H\ if it is not. 5. The case where hr is not included is trivial. We will inspect recombination cycles from the bottom of the network up. The other nodes in the cycle (although marking coalescence events) will be referred to as cycle nodes. If there is no rank-maintaining pair.1 and 5.3.220 5. and identify them as rankdecreasing. The algebraic rank of galled trees The rank of the haplotype matrix can decrease from full rank by at most the number of recombinations in the network that gives rise to the haplotypes. we can identify its effect on the rank. We need to detect whether the haplotype hr at the recombination node is independent of all other haplotypes. If there is a mutation j unique to hr among cycle nodes.

2005. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 10. Harrower. R. Griffiths. Myers and R. interestingly. For data generated by a perfect phylogeny on p haplotypes. 4. Brown and I. In Proceedings of WABI 2004. Haplotype phasing using semidefinite programming. but may get some information about the number of recombinations. for the haplotype matrix H. the rank bound on the HIPP solution may be far from optimal. and Dan Gusfield. to guarantee a common mutation with probability 1-e.C. The International HapMap Consortium. Second. Integer programming approaches to haplotype inference by pure parsimony. 2004. S. Lindblad-Toh et al. 437(7063):1299-1300. we get little information about the minimum number of recombinations. Harrower. Brown and I. J. Genome sequence. 2. 2003. 16:348-359. B. 2003. Pinotti. 438(7069):803-809.G. In Proceedings of WABI 2004. Kalpakis and P. pages 254-265. Halldorsson. This may mean that for the genotype matrix G. John for sending us a copy of her paper with Sean Cleary 13 . But. We would like to thank Katherine St. just from the rank of G. Edwards. He and A. 2004. 2005. we can recover the haplotypes in polynomial time. for data generated with Hudson's popular ms package23. 5. and S. when recombination is more common than mutation. Unfortunately. 2003. 3. 7. Here we derive two interesting results. Nature. recombination. Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms. M. pages 144-155. S. through a Discovery Grant to D. 2005. 163:375-394. we may start to accumulate many rank-decreasing cycles. CONCLUSION We have presented several results about algebraic rank and HIPP instances. Lancia. Xu. Finally. L. We studied more closely data generated by the coalescent model often studied in population genetics. V. D. Lippert. with only a few times more genotypes and a common mutation. 8. Haplotype inference by pure parsimony. we cannot identify which of these bounds we have. 2006.221 have close to the same number of distinct haplotypes as does H. we provide an interesting variant of the "haplotype lower bound". but may obtain a close match on the minimum number of haplotypes. Yun Song and Brian Golding for helpful conversations. Acknowledgements This research has been supported by the Natural Sciences and Engineering Research Council of Canada. A survey of computational methods for determining haplotypes. since this bound goes up as the rank goes down. 9. In Proceedings of BIBE 2005. 11. The algebraic structure of HIPP instances will likely have other fruitful consequences as well. we completely classify the algebraic rank of haplotype matrices derived from galled trees.R. M. and R. D. K. N. Haplotype inference by maximum parsimony. and Postgraduate and Canada Graduate Scholarships to I. we get no information about the minimum number of haplotypes. Wang and Y. A haplotype map of the human genome. the lower bound on the minimum number of recombinations to explain H will be increasingly accurate. comparative analysis and haplotype structure of the domestic dog. Nature. Gusfield. References 1. INFORMS Journal on Computing. Istrail. Bafna. Zelikovsky. This suggests a tension between the HIPP problem and estimating the number of recombinations: for instances with few recombinations. Rizzi. 6. 3(2):141-154.M. Moreover. when the number of distinct population members is more than (^)plogp. This shows that rank is still a close bound when the model allows . Bioinformatics. In Proceedings of CPM 2003. this is relevant. we examined the effect of adding recombination to generative model. Yooseph. In Com- 6. G. studying HIPP instances generated by randomly pairing haplotypes generated from two important models from population genetics.H. pages 242-253. the simplest type of phylogenetic network with recombination. C. A new integer programming formulation for the pure parsimony problem in haplotype analysis. For instances with lots of recombinations. D. for example. We showed that the constant value for the mutation parameter 9. V. Linear reduction for haplotype inference. First. the size of the optimal solution equals the rank of the genotype matrix. Genetics.B. Bounds on the minimum number of recombination events in a sample history. 19(14):1773-1780. Namjoshi. By contrast. G. K. pages 145-152. 2004.

2004. In Proceedings of COCOON 2003. 2005. 11:858-866. Langley. Gusneld. pages 26-47. Fu.222 mutational Methods for SNPs and Haplotype Inference: DIMACS/RECOMB Satellite Workshop. Z. 1959. Sep 1976. Hudson. D. John. Cleary and K. pages 5-19. Random Graphs. Variation and Evolution. Van Nuffelen. Bollobas. 2005. P Erdos and A Renyi. Raghavan. IEEE Transactions on Circuits and Systems. and C. Journal of Computational Biology. pages 585-600. 2(1):173-213. and S. A linear-time algorithm for the perfect phylogeny haplotyping (pph) problem. On the incidence matrix of a graph. Journal of Computational Biology. Filkov. Gusneld. R. V. 18(2):337-338. Hudson. R. S. 23(9):572-572. Lancia. 20. D. 25. 7:1-44. Yooseph. 2003. 6:290-297. C. Hein. D. . 1995. St. S. On random graphs. Publicationes Mathematicae Debrecen. 18. Gusneld. S. 2005. Cambridge Press. V. 2002. J. Wiuf. 19. Statistical properties of segregating sites. Eddhu. 14. Gene Genealogies. Motwani and P. Y. V. Yooseph. Journal of Bioinformatics and Computational Biology. Manuscript under review. 23. 48(2):172-177. Schierup. Analyses of haplotype inference algorithms. 12. G. and S. 1990. R. 10:323340. Generating samples under a WrightFisher neutral model of genetic variation. Hannenhalli. Oxford Surveys of Evolutionary Biology. Ding. Randomized Algorithms. Empirical exploration of perfect phylogeny haplotyping and haplotypers. Gusneld. X. Bioinformatics. H. H. Bafna. A note on efficient computation of haplotypes via perfect phylogeny. Oxford University Press. B. 22. Optimal. 17. R. Cambridge Press. 2004. Gusneld. 15. 2001. Bafna. 1995. In Proceedings of RECOMB 2005. 13. 2004. 16. Chung and D. volume 2983 of LNCS. and C. R. 21. 24. 2003. 2nd edition. Haplotyping as perfect phylogeny: A direct approach. efficient reconstruction of phylogenetic networks with constrained recombination. Theoretical Population Biology. Gene genealogies and the coalescent process. M. R. and D.

the majority of which are from the ATH1 microarray2. To simplify the analysis. Visual inspection of non-correlated probe sets' target genes suggests that some may be inappropriately merged gene models and represent independently expressed.223 GLOBAL CORRELATION ANALYSIS BETWEEN REDUNDANT PROBE SETS USING A LARGE COLLECTION OF ARABIDOPSIS ATH1 EXPRESSION PROFILING DATA Xiangqin Cui ' and Ann Loraine ' ' Section on Statistical Genetics.0 "present-absent" calls increased the overall percentage of well-correlated probe sets. For a nominal fee. users can subscribe to the NASC AffyWatch service.xcui}@uab. Results are on-line at http://www.06 and generated Present. The Nottingham Arabidopsis Stock Centre's NASCArrays is perhaps the acme of such services'. we analyzed coexpression patterns among redundant probe sets using a database that contained data from nearly 500 qualityscreened ATH1 array hybridizations.0. METHODS We obtained probe set to gene mappings and gene structure annotations (version 6) from the Arabidopsis Information Resource (TAIR) 3 .edu Oligo-based expression microarraysfromAffymetrix typically contain thousands of redundant probe sets that match different regions of the same gene. Department of Biostatistics 2 Department of Genetics 3 Department of Medicine 4 corresponding author University of Alabama at Birmingham Birmingham. We also processed the CEL files using MAS5. However. Toward this end. INTRODUCTION Affymetrix microarrays contain thousands of probes which are grouped into probe sets.org/exp_cor/analysis. it is now possible to survey the behavior of individual probe sets across many different experimental conditions and laboratory settings. Using methods described previously4. of microarray experiments. we created an expression database containing qualityscreened data from 486 array hybridizations compiled from AffyWatch Release 1." Thanks to new public resources that archive and distribute expression data from hundreds. We found that expression values from redundant probe set pairs were often poorly correlated. sometimes thousands. followed by quantile-quantile normalization. and Marginal . we designate these as "redundant probe sets. 1.transvar. but neighboring loci. Others may reflect differential regulation of alternative 3-prime end processing. Array data were processed using RMA5. We used linear regression and correlation to survey redundant probe set behavior across nearly 500 quality-screened experiments from the Arabidopsis ATH1 array manufactured by Affymetrix. AL 35294 {aloraine. Due to the high frequency of alternative mRNA processing (splicing and polyadenylation) in many eukaryotic genomes. poor correlation was still observed for a substantial number of probe set pairs. we purged all probe sets that mapped to multiple genes. 2. Absent. These CEL files. Affymetrix arrays typically include multiple probe sets that match predicted or known mRNA variants produced by the same gene. Because the ATH1 array is based on a solved genome. which delivers quarterly DVDs bearing expression data in the form of array 'CEL' files. which contain numeric. Pre-filtering expression results using MAS5. Because these probe sets measure different regions (or transcripts) of the same gene. probe intensity data from microarray scans. data from NASC provide an unprecedented opportunity to investigate the long-range behavior of redundant probe sets. collections of probes that (typically) hybridize to 300-500 bp sequence segments near the three prime ends of target transcripts. are contributed by users who enjoy discounted array processing service from NASC in exchange for donating their data for public use.

including scatter plots showing regression results.6 i 0. If true. RESULTS The ATH1 array contains 21.org/data/quickload.2 i 0. therefore. To test this. Table 1. redundant probe sets may be the result of including low expression readings whose target gene was not actually expressed. Correlation (r) computed using RMA expression values before (A) and after (B) PA filtering. leaving 87 for further correlation analysis.987 protein-coding genes in the Arabidopsis genome as determined by extensive sequence analysis performed at TAIR.transvar. we focused on the 142 genes interrogated by two probe sets each. This PA filtering followed by correlation analysis generated two notable results. To assess cDNA evidence. are posted as Supplementary Files at our Web site http://www. 3. we found that eliminating probe set readings called as "Absent" by MAS5. This filtering step removed 55 genes. We hypothesized that if redundant probe sets measure related molecular entities. we ran the "present-absent" call procedure in MAS5.0 for each probe set on each array and eliminated probe set readings called as "Absent" from the analysis. We then re-computed linear regression and Pearson's correlation coefficient for the redundant probe set pairs in which both partner was called as "present" in at least 20 chips. we computed Pearson's correlation coefficient (r) and performed linear regression between each pair of redundant probe sets.148 probe sets that uniquely map to 20. Of these 21.. 309 are redundant probe sets measuring 148 genes (Table 1.2 i i 0. i. All array processing was done using the BioConductor software under default settings7. the low degree of correlation between some We next explored the possibility that for some of the poorly correlated probe sets.8 i 1.0 removed many of the genes that were found to be poorly-correlated according in the first (no PA filtering) analysis (Figure IB). We manually inspected gene models using the Integrated Genome Browser (http://genoviz. we found that many redundant probe sets are not well-correlated (Figure 1 A).) To simplify the analysis.0 in _ o _ m o - RMA+PA n _J~L i -0.org/exp_cor/analysis.4 i 0. we used the Sequence Viewer tool at the TAIR Web site. First.2 1 •0 1 I 0.transvar. It is commonly believed that less than half of the genes in a genome are simultaneously expressed8'9.2 0. Breakdown of redundant probe sets per gene on the ATH1 expression microarray A m _ RMA c=f= 1 -0. which serves probe set-to-genome alignments generated by Affymetrix and Arabidopsis gene annotations (versions 5 and 6).0 Probe sets per gene 1 2 3 >3 Genes 20. exhibit low correlation. transcripts whose synthesis is driven by the same promoter.4 I I I 0.6 0.224 "calls" for each probe set.sourceforge. Results from these analyses.148 probe sets. The readings from these probe sets might represent random noise and. If this were true. We used R to perform linear regression and compute Pearson's correlation coefficient for each pair of redundant probe sets that measure the same gene.8 n i 0 5 1 1 1 1. Interestingly. To reduce the influence of the probe set readings not derived from bona fide expression of their respective target genes. they should exhibit a high degree of correlation across a broad range of conditions. then we might observe a negative correlation between putative transcript size and . we found surprisingly small correspondence in P versus A calls between redundant probe sets (Supplementary File 4).839 142 4 2 r Figure 1.e. Second. the annotated structure of the target gene may have inappropriately merged adjacent or overlapping genes into a single gene model.net) and IGB Quickload site http://www.

0567 . visualization with the Integrated Genome Browser revealed that the probe sets associated with this gene (AT5G04440) appear to interrogate opposite strands of the chromosome.3). the gold standard for assessing the correctness of a gene model is the existence of one or more full-length or partial cDNA sequences that cover the gene region in question.483.56. which is the percentage of variation in one probe set that can be explained by variation in the other. we investigated cDNA support for genes whose redundant probe sets generated noncorrelated expression values.09) between the two redundant probe sets interrogating AT4G12640. Alignment of ATHl redundant probe sets to Arabidopsis chromosome 4. Currently. see Supplemental File 2.) We found that there was indeed a weak negative correlation (r = -0.0iU jAUfioa jAssjm 254825 AT 254826_AT ^ I B. C. This lack of correspondence suggests that gene model AT4G12640 represents two independent transcriptional units. K>r. B. redundant probe sets might represent gene models that should be split. AT4G12640 8X^46650 SAIKJMOOS 4328. Many genes currently included in the Arabidopsis version 6 annotations are based originally on the results of computational analysis and manual curation.6 jflj^lJ-Ol-Fie CS') 254825_at !R2= 0. which explains why expression readings from these two probe sets were uncorrelated. For many of these gene models.082 : 1^116-44-106 Figure 2.c l i c k here t o open K* kh sequence window ~'"T '"' *7".28) between average transcript size per gene and R-squared. A.0 6. (Note that transcript sizes are approximately log-normally distributed. To test this. we computed Pearson's correlation coefficient comparing average transcript size per gene (log scale) and R-squared from the linear regression. however.22.8 6.4 6.*46ST600 'T' ' "'" *" <imMi£ J4J4JT7 JJfL16-80-H» 5. No similarly trivial explanation could explain lack of correspondence (r = 0.6 5. AT4GI2640 gene model probes T. Probes appear as vertical bars above rectangles representing the genomic alignment of probe set design sequences. only one was supported by cDNA evidence covering the entire gene model. . from which the probe sequences were selected.225 expression correspondence between redundant probe sets since inappropriately merged gene models would likely be unusually large. alongside gene model AT5G04440.2 6.01 X gquen« r u l e r . However. Scatter diagram showing expression readings from the probe sets in A. Of nine genes with non-correlated redundant probe sets (r < 0. suggesting that some fraction of the genes with noncorrelated.70.Vi CS404S24 SALK-024223.91 . e= 0. some additional evidence is needed before they can be accepted as accurate. A.N A^5913S1 )5AIL_1566_D0 5. TAIR Sequence Viewer showing lack of full cDNA support for this gene model.wef= 0. Using the Integrated Genome Browser to visualize probe sets and the TAIR Web site Sequence Viewer to visualize gene structures and cDNA alignments.

References Craigon DJ. Dudoit S. Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Yang JY. 6. Chen G. Reiser L. We suggest that this lack of correspondence can be used to improve annotation. A gene atlas of the mouse and human protein-encoding transcriptomes. Maechler M. Irizarry R. Hogenesch JB. Robust estimators for expression analysis. 4. Genome Res 2005. first as a means of checking probe set to gene mappings (as with AT5G04440) and second as a way to flag gene models that require further validation through cDNA sequencing or other means. 15: 1007-1014. Yoo D. Tanimoto G. Xu I. Ge Y. Irizarry RA. 1. Ching KA. Dettling M. 32: D575-577. Some discordance between redundant probe sets may arise from differential regulation of alternative splicing or polyadenylation.226 4.38:545-561. Strausberg RL. 4: 249-264. 3. Okyere J. Liu WM. Stevenson BJ. Zhou D. Gentry J. Yoon J. Wei H. Wu Y. Smyth G. Genome Biol 2004. Zhang P. 9. James N. DISCUSSION & CONCLUSIONS We found that large-scale analysis of redundant probe sets reveals a surprising lack of correspondence of expression values between probe sets annotated as interrogating the same gene. Nucleic Acids Res 2003. Miller N. Milne J. Huber W. Nucleic Acids Res 2004. Zhang J. Jongeneel CV. Berardini TZ. Persson S. In many cases. Lapp H. Acknowledgments We thank Tapan Mehta and Vinodh Srinivasasainagendra for superb programming support. 5. Scherf U. however. Mueller LA. Delorenzi M. Garcia-Hernandez M. Cooke MP. Iseli C. Page GP. Simpson AJ. Bolstad B. Vasicek TJ. Beavis W. Somerville CR. . Hobbs B. Kreiman G. Weems DC. Block D. Development and evaluation of an Arabidopsis whole genome Affymetrix probe array. Higgins J. Hayakawa M. 31: 224-228. Wiltshire T. research materials and community. Smith C. Batalov S. Speed TP. 2. Gautier L. Hornik K. Dixon D. Zhang J. Town CD. Antonellis KJ. Mundodi S. 5: R80. Biostatistics 2003. Kuznetsov D. Soden R. Haudenschild CD. Hubbell E. Funds from NSF grant 0217651 (PI David Allison) supported this work. Carey VJ. Haas BJ. Hothorn T. Beazer-Barclay YD. Khrebtukova I. 18: 15851592. Plant J 2004. Bioinformatics 2002. NASCArrays: a repository for microarray data generated by NASC's transcriptomics service. 101: 6062-6067. Sawitzki G. Redman JC. Leisch F. normalization. 8. Iacus S. and summaries of high density oligonucleotide array probe level data. Li C. 102: 8633-8638. Tierney L. Bioconductor: open software development for computational biology and bioinformatics. May S. Lander G. Rossini AJ. Proc Natl Acad Sci USA 2004. Tacklind J. Doyle A. Walker JR. Proc Natl Acad Sci USA 2005. Rhee SY. Montoya M. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized. Collin F. Jotham J. Gentleman RC. Ellis B. 7. Exploration. Mei R. Huala E. it is more likely to result from incorrect gene models. An atlas of human gene expression from massively parallel signature sequencing (MPSS). Su AI. Bates DM. curated gateway to Arabidopsis biology.

edu. which may further increase the available protein structures significantly. identify all structure motifs that occur with sufficient frequency among the proteins in Q. are associated with 7 different global structural families n . Deepak Bandyopadhyay . Jack Snoeyink1. 4 ). weiwang}@cs. clique 1. and catalysis 3 ' 5 ' 3 0 _ 3 3 . Nuclear Binding Domains. a structure motif corresponds to a labeled clique that occurs frequently among the graphs representing the protein structures.edu Structure motifs are amino acid packing patterns that occur frequently within a set of protein structures. Because of the puzzling relationship between global protein structure and function. the number of spurious structure motifs is greatly reduced.unc. and FAD/NAD-linked Reductases.227 DISTANCE-BASED IDENTIFICATION OF STRUCTURE MOTIFS IN PROTEINS USING CONSTRAINED FREQUENT SUBGRAPH MINING Jun Huan . Many globally dissimilar structures show convergent evolution of biological function. The pairwise distance constraints on each edge in a clique serve to limit the variation in geometry among different occurrences of a structure motif. tropsha@email. Using this algorithm. structure motif. as measured using a hyper-geometric distribution. INTRODUCTION This paper studies the following structural comparison problem: given a set Q of three dimensional (3D) protein structures. yet widely varying func- tions . Using a graph representation of proteins. recent research effort in protein structure comparison has shifted to identifying local structural features (referred to as structure motifs) responsible for biological functions including protein-protein interaction. structure motifs were located for several SCOP families including the Eukaryotic Serine Proteases. Keywords: protein structure comparison. The constraints are encoded in the graph representation of protein structure as pairwise amino acid residue distances. automatic and effective knowledge discovery tools are needed to gain insights from the available structure data in order to generate testable hypotheses about the functional role of proteins and the evolutionary relationship among proteins. School of Pharmacy University of North Carolina at Chapel Hill Email : {huan. A recent review of methods and applications involved in protein structure motif identification can be found in 19 . similar function does not necessarily imply similar global structure: the most versatile enzymes. Compared with contact graph representations. The recent Structural Genomics projects 28 aim to generate many new protein structures in a high-throughput fashion.000) 3D protein structures stored in public repositories such as the Protein Data Bank (PDB. snoeyink. we formalize the structure motif identification problem as a frequent clique mining problem in a set of graphs Q and present a novel constrained clique mining algorithm to obtain recurring cliques from Q that satisfy certain additional constraints. Wei Wang1 Computer Science Department The Laboratory for Molecular Modeling. Alexander Tropsha2. and the physical/chemical properties . Our study is also motivated by the complex relationship between protein structure and protein function 8 . the TIM barrels are a large group of proteins with remarkably similar global structures. prins. The motifs were found to cover functionally important sites like the catalytic triad for Serine Proteases and co-factor binding sites for Nuclear Binding Domains. ligand binding. For example. Papain-like Cysteine Proteases. Using this representation. hydro-lyases and the O-glycosyl glucosidases. Conversely. we typically obtain a handful of motifs within seconds of processing time. It is well known that global structure similarity does not necessarily imply similar function. pair-wise amino acid residue interactions. For each family. With fast growing structure data. debug. We present an efficient constrained subgraph mining algorithm to discover structure motifs in this setting.unc. Our study is motivated by the large number of (> 35. Jan Prins . We define a labeled graph representation of protein structure in which vertices correspond to amino acid residues and edges connect pairs of residues and are labeled by (1) the Euclidian distance between the Ca atoms of the two residues and (2) a boolean indicating whether the two residues are in physical/chemical contact. The fact that many motifs are highly family-specific can be used to classify new proteins or to provide functional annotation in Structural Genomics Projects. The occurrences of these motifs throughout the PDB were strongly associated with the original SCOP family. graph mining.

The mapping (isomorphism) <Ji —• P2. 29. by requiring structure motifs to recur among a group of proteins. In Section 1. graph S (non-clique) occurs in both graph P and graph Q. i. It usually takes only a few seconds to process a group of proteins of moderate size (ca. In Section 3.1. and edges E' <Z(Ef] (V x V')). A) where V is a set of vertices or nodes and E C V x V is a set of undirected edges. our method offers the following advantages. 3 1 . The rest of this paper is organized as follows. starting from simple geometric patterns such as triangles. • Delaunay tessellation (DT) 6 ' 20> 3 3 partitioning the structure into an aggregate of nonoverlapping.e. and A: V U E —• E is a function that assigns labels to vertices and edges. . • String pattern matching methods that encode the local structure and sequence information of a protein as a string. Compared to other methods. progressively finding larger patterns 5 ' 25. 2. Second. CONSTRAINED FREQUENT CLIQUE MINING 2. E') is a subgraph of G. Set {pi. p3 } is an embedding of Q in P. We also include a practical implementation of the algorithm that supports the experimental study in Section 5. 1. 38 Geometric hashing 2 1 and graph matching 38 methods have been extended for inferring recurring structure motifs from multiple structures. In Section 2. and 93 — pz demonstrates that clique Q is isomorphic > to a subgraph of P and so we say that Q occurs in P.1. such as VAST 9 and DALI 12 . denoted by G" C G. which makes it suitable for processing protein families defined by various classifications such as SCOP or EC (Enzyme Commission). we review recent progress in discovering protein structure motifs. we discuss our graph representation of proteins structures. we present a detailed description of our method. In Section 4. irregular tetrahedra thus identifying all unique nearest neighbor residue quadruplets for any protein 33 . 10. applied pairwise between protein structures to identify structure motifs 3 ' 24. Database Q of three labeled graphs. Finally. • Graph matching methods comparing protein structures modeled as graphs and discovering structure motifs by finding recurring subgraphs 1. First. G' = (V. we significantly reduce spurious patterns without losing structure motifs that have clear biological relevance. Here we focus on the recent algorithmic techniques for discovering structure motifs from protein structures. our results are specific. As we show in our experimental study section. we find that the structure motifs we identify are specifically associated with the original family. E. Labeled graphs We define a labeled graph G as a four-element tuple G = (V. 22. rather than in just two proteins. and apply string search algorithms to derive motifs 1 7 ' 1 8 ' 3 2 . 92 —'Pi. • Geometric hashing. Section 6 concludes with a brief discussion of future work. originally developed in computer vision.1. 14. 30 . if vertices V C V. but both methods have exponential running time in me number of structures in a data set. Related work There is an extensive body of literature on comparing and classifying proteins using multiple sequence or structure alignment. we review definitions related to graphs and introduce the constrained subgraph mining problem. 30 proteins).228 of the amino acid residues and their interactions in a protein structure. With a quantitative definition of significance based on the hyper-geometric distribution. The methods can be classified into the following five types: • Depth-first search. We assume that a total ordering is defined on the labels in E. P2.35 . 1. This association may significantly improve the accuracy of feature-based functional annotation of structures from structural genomics projects. Ps V P3 P J® 13 s ^ 4 (Q) (S)° Fig. E is a set of (disjoint) vertex and edge labels. our method is efficient. E. E' is a subset of the edges of G that join vertices in V. Similarly.

the graph Q in Figure 1 is a clique since each pair of (distinct) nodes is connected by an edge in Q while 5 is not.3/3. we answer the following two questions: (1) what types of constraints can be efficiently incorporated into a subgraph mining algorithm and (2) how to incorporate a constraint if it can be efficiently incorporated. A3. and AQ. A2. E. 3/3.1. labeled by two types of information: (1) The Euclidian distance between the two related Ca atoms and (2) A boolean indicates whether the two residues have physical/chemical contact. or a graph database Q. of Constrained Frequent Clique Mining is to identify all frequent cliques C in a graph database Q such that p(C) is true. As part of our computational concern. E. a clique corresponds to a structure motif with all pairwise inter-residue distances specified. the problem . as the fraction of graphs in Q in which C occurs. A3. and define C to be frequent if it occurs in at least fraction a of the graphs in Q. E. we say that graph H occurs in G if we can find an isomorphism between graph H = (VH.) (jS (A2) CS ) .2. We create edges connecting each and every pair of (distinct) residues. Figure 2 shows all cliques (without any constraint) which appear in at least two graphs in the graph database shown in Figure 1. Graph matching A fundamental part of our constrained subgraph mining method is to find an occurrence of a graph H within another graph G. a protein in our study is a labeled graph P = (V. This definition is illustrated in Figure 1. A#) and some subgraph of G = (VG. only cliques A\. false) where IR+ is the set of positive real numbers • A assigns labels to nodes and edges. (a) (A. A4. For example. In this paper. The set V is an embedding of H in G. Our graph representation can be viewed as a hybrid of two popular representation of proteins: that of distance 2. Given a constraint p. 3/3. 2/3) for cliques from Ai lo A6. Constraints on structure motifs A constraint in our discussion is a function that assigns a boolean value to a subgraph such that true implies that the subgraph has some desired property and false indicates otherwise. i. An isomorphism from H to the subgraph of G defined by vertices V C VG is a 11 mapping between vertices / : VH — V that preserves • edges and edge/node labels. 3.3. Note that while C may occur many times within a single graph.(u. If we use support threshold a = 2/3 and the constraint that each clique should contains at least one node with label "a". we restrict ourselves to matching cliques. For example. The task of formulating the right constraint(s) is left for domain experts. The constrained frequent clique mining problem Given a set of graphs. E. Fig. the constrained frequent cliques are A\. We choose a support threshold 0 < a < 1.4. If we use support threshold a — 2/3 without any constraint. To make this more precise. all six cliques will be reported to users.e. The actual support values are: (3/3. these count as only one occurrence. EG. More precisely.» (A5) (A6) 2. fully connected subgraphs. All (non-empty) frequent cliques with support > 2/3 in QfromFigure 1. If we increase a to 3/3. The answer to the two questions is the major contribution of this paper and is discussed in details in Section 4.229 2.2/3. A G ) . the following statement "each amino acid residue in a structure motif must have a solvent accessible surface of sufficient size" is a constraint. we define the support of a clique C. denoted by s(C). The nodes of our protein graphs represent the Ca atoms of each amino acid residue. u) for all u € V • E = Ey U TIE is the set of disjoint node labels (Ey) and edge labels (E^) • Ey is the set of 20 amino acid types • HE = R + x {true. for the purpose of measuring support. HYBRID GRAPH REPRESENTATION OF PROTEIN STRUCTURES 3. In protein structure graphs. Graph representation overview We model a protein structure as a labeled undirected graph by incorporating pairwise amino acid residue distances and contact relation in the following way. This constraint selects only those structure motifs that are close to the surface of proteins. A) where • V is a set of nodes that represents the set of amino acid residues in the protein • E = V x V . 2.5 -® (A3) ®—^—® (A4) S Y 'J? (ST y >. EH. and A\ will be reported.

we choose the median number 1.2. The contact relation enables us to incorporate various constraints into the subsequent graph mining process and further reduce irrelevant patterns. In practice we are not concerned with interactions over long distances (say > 13 A). there always exists a subgraph G' C G such that the code of G' is a prefix of the code of G. The absence of geometric constraints can lead to many spurious matches as noticed in 25. THE CONSTRAINED CLIQUE MINING ALGORITHM In this section. In the following. and popular choices are 1 A 22 . In our system. As we prove in Theorem 4. We name this simple constraint an edge label constraint and show a specific graph normalization function that is prefix-preserving for this edge label constraint. Mapping distances I to bins. The symbol "y" is selected to make the constraint works best with the graph example we show in Figure 1. and 2 A 26 . a prefix-preserving graph normalization function guarantees that no frequent constrained patterns can be missed.1. Distance discretization To map continuous distances to discrete values.5 2 10</<11.5</<7 3 11. 3.5 4 Our strategy relies on designing graph normalization functions that map cliques to one dimensional sequences of labels.1. which empirically delivers patterns with good geometric conservation. 4. Since each amino acid occupies a real volume. outlined below for completeness. The major difference is that in our representation.8. /<4 1 8. geometric constraints are in the form of pairwise distance constraints and are embedded into the graph representation to model geometrically conserved patterns. Our final solution will adapt this constraint-unaware graph normalization function. we discretize distances into bins. Given an n x n adjacency matrix M of a graph G with n nodes. Examples of prefix-preserving graph normalization functions include the DFS code 39 and the CAM code 15 . the number of edges per vertex in the graph representation can be bounded by a small constant. 38 . The constraint states that we should only report frequent cliques that contain at least one edge label of "y".5 A 5 .5 6 5.230 matrix representation 8 and that of contact map representation 13.2. Another difference is that we explicitly specify the "contact" relation. The width of such bins is commonly referred to as the distance tolerance. 4. Many graph normalization functions have a very desirable property: prefix-preservation. Before we do that. Fig. The string N{G) is the canonical code (code in short) of the graph G with respect to the function N. 3. a graph normalization function always assigns a unique string to each unique graph. In other words. A synthetic example of constraints The following constraint is our driving example for constrained clique mining. we find that there are only three cliques satisfying the constraint. we introduce a normalization that does not support any constraints. Applying this constraint to all the frequent cliques shown in Figure 2. A graph normalization function that does not support constraints We use our previous canonical code 15 for graph normalization. A graph normalization function is a 1-1 mapping N such that N(G) = N{G') if and only if G = G". denoted . with a generic depth first search algorithm. namely Aj.5</ 7 7</<8. 38 . we define the code of M. and A§.5</<10 5 4</<5. 4.5 A as shown in Figure 3. which is important for our structure motif identification algorithm. we present a detailed discussion on (1) what types of constraints can be incorporated efficiently into a subgraph mining algorithm and (2) how to incorporate them. The design challenge here is to construct a graph normalization function that is prefix-preserving in the presence of constraints. The graph representation presented here is similar to those used by other groups 25. The unit is A. As. so proteins need not be represented by complete graphs. we discuss how to discretize distances into distance bins. A graph normalization function is prefix-preserving if for every graph G.

• Theorem 4. they must give the same string. j represents the entry at the ith row and jth column in M. The lexicographic order of sequences defines a total order over adjacency matrix codes.Mn^xMn<n where M. and hence tp(G) is defined. bybyxa is a suffix of the code since it is the canonical code of graph Q. We assume that the symbol $ is not in the set T and we use this symbol to separate different parts of the generic graph normalization. If the original G does not contain any edge label "y".p(G') if p(G) is false if | V(G) | = 1 ip(G')$F(G) otherwise . its canonical code. Figure 4 shows examples of adjacency matrices and codes for the labeled graph Q shown in the same figure. there exists a subgraph G ' c G with size one less than G such %y~ (QJ 13 b yI xl a M b_ JL-tx v a M2 a Y b X V b M3 > a X b V Y b ' M4 F i g . Given a graph G.. we show that the function ip can be used in a depth-first constrained clique search procedure to make sure that each constrained frequent clique is searched once and exactly once. denoted by F(G). two subgraphs of Q that satisfy the edge label constraint are searched. In Section 4. and the code F(G).231 by code(M). All possible adjacency matrices for the clique Q . is the maximal code among all its possible codes. This observation suggests that we can always find at least one G' in the recursive definition.5.3. noticing that the last element of a label sequence produced by ip is F(G) where F is a graph normalization procedure. ip(G')$f(G) is the concatenation of the code produced by ip(G'). we show that several widely used constraints for protein structure motifs lead to a well-defined function ip. Theorem 4. where G' is a subgraph of G with size one less than G. In a search for ip(Q). Example 4. If two graphs are isomorphic. Therefore two identical sequence must imply the same graph. Given the graph normalization function F with its codomain T and a constraint p. Proof. Finally we add a single "b" before the string "byb$bybyxa" at the last step of the recursive definition and we have ip(G) = b$byb$bybyxa. we introduce the definition of a generic graph normalization function ip.\M%xM%'1. is a function that maps a graph G to T* recursively as: (F(G) V>(G) = { F(G) max G'CG. 4 . Since edges are undirected. ip(G) is b$byb$bybyxa. The total ordering on strings (max) is defined by lexicographical order with the underlying symbol ordering assumed from T.1. In Section 4.1. Let's first assume that a graph G contains an edge label "y". a generic graph normalization function ip. which is also defined. we show the applications of the generic graph normalization function ip. Since the canonical code for the first ("byb") is greater than that of the second ("bya"). 4. ip{G) exists for every graph G with the edge label constraint. its code is F(G). For all G such that p(G) is true. only the lower triangular part of die matrices are shown. as the sequence of lower triangular entries of M (including the node labels as diagonal entries) in the order: Mx.2. Definition 4. To prove that two graphs that have the same canonical strings are isomorphic. Since adjacency matrices for undirected graph are symmetric. ip is a 1-1 mapping and thus a graph normalization function.. We claim that there always exists a subgraph G' of G that also contains the same label. Using lexicographic order of symbols: y>x>d>c> b> a> 0.Mn±Mn>. One is a single edge connecting nodes with labels "b" and "b" with an edge label "y" and the other is also a single edge connecting nodes with labels "a" and "b" with an edge label "y". • Theorem 4. we have code(Mi) ="bybyxa" > code(Mi) = "bybxya" > code(M3) — "aybxyb" > code(Mi) ="aybxyb".4. the symbol $. F(Mi) ="bybyxa" shown in Figure 4 is the canonical code of the graph. Proof.1. In the following two sections. Hence code(Mi) is the canonical code for the graph Q. Applying the generic graph normalization to the graph Q shown in Figure 4. we put string "byb" before the string "bybyxa" and obtain "byb$bybyxa".3. we are concerned only with the lower triangular entries of M.2. according to the definition of ip. with the graph normalization F be the canonical code we discussed in Section 4. A graph normalization function that supports constraints In this section. as guaranteed by F. For example.

Theorem 4.1. More examples related to protein structure motifs Let's first view a real-world example of constraint that is widely used in structure motif discovery. tp(G) exists. Proof. %jj is a 1-1 mapping and prefix-preserving for the CC constraint or the density constraint.4.4.7. a Following Theorem 4. Proof. the contact density constraint asserts that the ratio of the number of contacts and the total number of edges in a structure motif should be greater than a predefined threshold. to be formal. Given a constraint p. ip is 1-1 and prefixingpreserving.5.2 and 4. This property is a direct result of the recursive definition 4.6. the density constraint is a function d that assigns value true to a graph if its contact density is at least some predefined threshold and false otherwise. there exists a subgraph G' C G such that G' is a connected component according to the same contact relation. (only if) If tp(G) exists for every graph G with respect to a constraint p. we have the following theorem.3 can be proved as long as we have Theorem 4.3. In the following discussion. The observation is a well-known result from graph theorem and a detail proof can be found in 1 5 .232 that p{G') is true and ip{G') is a prefix ofijj(G). In other words. there exists a subgraph G' C G of size n — 1 such that p(G') is also true. The following theorem formalize the answer. Ifip is defined for every graph with respect to a given constraint p. Therefore. Theorems 4. Proof. Fortunately. ip(G) exist for every graph G with respect to the constraint p if and only if for each graph G of size n such that p(G) is true. The intuition of the density constraint is that a structure motif should be compact and the amino acid residues in the motif should well interact with other. we always have at least one G' <ZG such that p(G') is also true. we do not use the definition of the constraint p. Again. each amino acid residue is connected to at least another amino acid residue by a contact relation and that the motif is a connected component with respect to the contact relation. by the definition of ip. (if) For a graph G such that p(G) is true. ip(G) existfor every graph G with respect to the CC constraint or the density constraint. • 4. if there exists one G' C G such that p{G') is also true. for a graph G such that p{G) is true. by the definition of ip.5. cliquehashing We have designed an efficient algorithm identifying frequent cliques from a labeled graph database with constraints. this is not the case. Theorem 4. This is a direct result of the recursive definition 4. To be formal. • We notice that in proving Theorems 4. This constraint may be viewed as a more strict version of the CC constraint which only requires a motif to be connected component. Theorem 4.4. as described below.2 and 4. The intuition of the CC constraint is that a structure motif should be compact and hence has no isolated amino acid residue. Such ratio is referred to as the contact density (density) of the motif and the constraint is referred to as the density constraint. Proof. It would be an awkward situation if we need to define a new graph normalization procedure for each of the 4. we have the following theorem: Theorem 4. we show that generic graph normalization function tp is well defined for these two constraints. The connected component constraint (CC constraint for short) asserts that in a structure motif. we study the sufficient and necessary condition such that our graph normalization function ip is well defined for a constraint p.1. As another example. After working several example constraints. The key observation is for every graph G of size n that is a connected component with respect to the node contact relation.1. the CC constraint is a function cc that assigns value true to a graph if it is a connected component according to the contact relation and false otherwise. We only show the proof of the theorem for the CC constraint and that for the density constraint can be proved similarly. • constraints we discuss above. At the beginning of the algorithm we scan a graph database and find all frequent node .

q 3 . 6. counter[t]) 8. s(G) is the support of a graph G. end for 5. The algorithm backtracks to the parents of a clique if no further extension from the clique is possible.{/I/ = hUv. The conclusion contradicts our assumption that the two subcliques have the same size. These newly discovered cliques and their occurrences. The key observation is that for a clique G of size n. for each clique h G O do 2. types (line 1-4. • "d" "c" "b- W iP5) {p 2 } (Pal (q. end for 10. C «. T <— T U backtrack_search(t. x C y if string x is a prefix of string y. end for 8. p) begin 1. v G V[G]. . in the backtrack-search procedure in CliqueHashing.. 6. end if 9.s 2 ) (q2) (Sp) steps step2 "bybyxa" {P2. a. for t € C do 6. tfi is the graph normalization function defined in Definition 4. P3) (qi. for each occurrence of a clique / G 0 ' do 4. t *. by the line 9 of the backtrack_search procedure.hC V[G]. there is only one subclique with size n — 1 that has a code matching a prefix of ijj (G). to E * and p(t) is true) do > T *— T U backtrack.1 implies that at least one subclique of a frequent clique will pass the IF statement of line 9. counter[t] < counter[t] U {v} — 3. the CliqueHashing algorithm guarantees that each constrained frequent cliques will be discovered exactly once. The contents of the hash table counter after applying the CliqueHashing algorithm to the data set shown in Figure 1 with the edge label constraint. a frequent clique with size n > 1 is picked from the hash table and is extended to all possible n + 1 cliques by attaching one additional node to its occurrences in all possible ways. Pi) h i . end for 7. with the edge label constraint. for each node label t € \(v). from a group of graphs 5 with support at least a and with a constraint p.f(f) counter[t] < counter[£] U {/} — 5. The overall algorithm stops when all frequent node types have been enumerated. Ps.233 CliqueHashing(£/. The CliqueHashing algorithm which reports frequent cliques. Figure 5). The prefix preserving property of Definition 4.1. return T end backtrack jsearch(to.search(t. The proof that the algorithm discovers every constrained frequent cliques exactly once may not be obvious at first glance. T. The node types and their occurrences are kept in a hash table counter. 5. in Figure 6. If ip is well defined for all possible graphs with the constraint p.ft)} 3.) {q 3 } {s. p(s) is true) do 7. The claim leads to the conclusion that one of two subcliques must be a subclique of the other (by the definition of ip). p) begin 1. q3) {Pz.Pi) {q. 0. G G Q do 2. if( s(t) > a. if( s(t) > cr.. Therefore the algorithm will not miss any frequent cliques in the presence of a constraint p. To prove the observation.v£ (V[G] . C ^ C U { ( } 4. counter[t]) 10 11 end if 12 end for 13 return T end Fig.8. are again indexed in a separate hash table and enumerated recursively. We claim that one of the two codes must be a prefix of the other (by the definition of prefix). Theorem 4.} (s3) "bya" "byb" (P2. we assume to the contrary that there are at least two such subcliques with the same size and both give codes as prefixes of ip(G). Proof. the CliqueHashing algorithm identifies all frequent constrained cliques from a graph database exactly once. q2) "a" stepl Fig. C^CU{t} 6.q 2 } {s. If we can prove the observation. for each t G C do 9. We illustrate the CliqueHashing algorithm. At a subsequent step.

234

% x / y Ps

Pi

**5. EXPERIMENTAL STUDY 5.1. Experimental setup
**

'*\x/y

%yj

^

J

,x/y

(P)

-'cu ft

(Q)

(S)

Fig. 7. A graph database of three graphs with multiple labels.

"d" "c" "b"

(PJ

(Ps) (s 4 l

<Pj>

"byb" "bya" (q,.q2) (q3.q2) {s,,s2} {s,,s,} step2

"bybyxa" (P2. P3. Pi) fq t . q 3 . q2) step3

(P,l {q,(

M

(s,) {sj "a" {p,l <q?} {s2} stepl

Fig. 8. the contents of the hash table counter after applying the CliqueHashing algorithm to the data set shown left.

4.6. CliqueHashing on Graphs

-labeled

To exclude redundant structures from our analysis, we used the culled PDB list (http://www.fccc.edu/research/ labs/dunbrack/pisces/culledpdb.html) with sequence similarity cutoff value 90% (resolution = 3.0, R factor = 1.0). This list contains about one quarter of all protein structures in PDB; remaining ones are regarded as duplicates to proteins in the list. We study four SCOP families: Eukaryotic Serine Protease (ESP), Papain-like Cysteine Protease (PCP), Nuclear Binding Domains (NB), and FAD/NAD-linked reductase (FAD). Each protein structure in a SCOP family was converted to its graph representation as outlined in Section 3. The pairwise amino acid residue contacts are obtained by computing the almost-Delaunay edges 2 with e = 0.1 and with length up to 8.5 A, as was also done in 14 . Structure motifs from a SCOP family were identified using the CliqueHashing algorithm with the CC constraint that states "each amino acid residue in a motif should contact at least another residue and the motif should be a connected component with respect to the contact relation". Timings of the search algorithm were reported using the same hardware configuration used in 14 . In Table 1, we document the four families including their SCOP ID, total number of proteins in the family (iV), the support threshold we used to retrieve structure motifs (a), and the processing time (T, in seconds). In the same table, we also record all the structure motifs identified, giving the motifs' compositions (a sequence of oneletter residue codes), actual support values (K), the number of occurrences outside the family in the representative structures in PDB (referred to as the background frequencies hereafter) (5), and their statistical significance in the family (P). The statistical significance is computed by a hyper-geometric distribution, specified in Appendix 7.1. Images of protein structures were produced using VMD 16 and residues in the images were colored by the residue identity using default VMD settings.

A multi-labeled graph is a graph where there are two or more labels associated with a single edge in the graph. The CliqueHashing algorithm can be applied to multilabeled graphs directly without major modifications. The key observation is that our enumeration is based on occurrences of cliques (line 3 in function backtrack_search). In Figure 7, we show a graph database with three multilabeled graphs. In figure 8, we show (pictorially) how the CliqueHashing algorithm can be applied to graphs with multilables. In the context of the structure motifs detection, handling multi-labeled graphs is important for the following reason. First, due to the imprecision in 3D coordinates data in motif discovery, we need to tolerate distance variations between different instances of the same motif. Second, partitioning the ID distance space into distance bins is not a perfect solution since distance variations can not be well handled at the boundary of the bins. In our application distance bins may lead to a significant number of missing motifs. Using a multi-labeled graph we can solve the boundary problem by using "overlapping" bins to take care of boundary effect.

**5.2. Eukaryotic serine protease
**

The structure motifs identified from the ESP family were documented at the top part of Table 1. The data indicated that the motifs we found are highly specific to the ESP family, measured by P-value < 10 - 8 2 . We have investigated the spatial distribution of the residues covered by

235

Table 1. Motifs Composition Motif | Composition | K \ S \ —log(P) Motif « Eukaryotic Serine Protease (ID: 50514) N: 56 a: 48/56, T 31.5 100 50 1 DHAC 54 13 14 DHAC 100 HACA 50 ACGG 52 9 15 2 DHSC 52 100 ACGA 50 3 10 16 50 4 DHSA 52 10 100 17 DSAG 50 DSAC 52 12 100 18 SGGC 5 50 DGGG 52 23 100 19 AGAG 6 51 100 AGGG 50 7 DHSAC 9 20 21 51 100 ACGAG 49 8 SAGC 11 DACG 51 14 100 22 SCGA 49 9 51 14 100 DACS 49 10 HSAC 23 49 DHAA 51 18 100 24 DGGS 11 DAAC 51 32 99 25 SACG 49 12 49 DHAAC 50 5 100 26 DSGC 13 Papain-like cysteine protease (ID: 54002) N: 24, IT: 18/24, T: 18.4 34 3 WWGS 18 HCQS 18 2 1 18 HCQG 34 4 WGNS 2 18 3 Nuclear receptor ligand-binding domain (ID: 485(» ) N: 23, <T: 17/23, T: 15.3 DLQF 17 FQLL 43 3 1 20 21 DLQF 42 2 18 7 FAD/NAD-linked reductase (ID: 51943) N:20a : 15/20,1r-. 90.0 AGGA 17 1 [ AGGG | 17 j 34 | 34 2

*

6 8 11 16 17 27 58 4 6 7 8 10 15 3 4 8

-log(P) 100 100 100 100 100 95 85 100 100 100 100 98 98 44 44 39

Motif 27 28 29 30 31 32 33 34 35 36 37 38

Composition DASC SAGG DGGL DSAGC DSSC SCSG AGAG SAGG DSGS DAAG DASG GGGG

K

*

20 31 53 9 12 19 19 23 23 27 32 71

-log(P) 92 90 83 99 97 93 93 88 94 89 87 76

49 49 49 48 48 48 48 48 48 48 48 48

5

WGSG

18

5

43

4

LQLL

17

40

31

91

27

those motifs, by plotting all residues covered by at least one motif in the structure of a trypsin: 1HJ9, shown in Figure 9. Interestingly we found that all these residues are confined to the vicinity of the catalytic triad of 1HJ9, namely: HIS57-ASP102-SER195, confirming a known fact that the geometry of the catalytic triad and its spatially adjacent residues are rigid, which is probably responsible for functional specificity of the enzyme.

outside these two families. Comparing to our own previous study that uses generic subgraph mining algorithm (without constraints and without utilizing pairwise amino acid residue distance information), and pairwise structural comparison performed by other groups x ' 1 0 ' 22 ' 31, 29, 38 , we report a significant improvement of the "precision" of structure motifs. For example, rather than reporting thousands of motifs for a small data set like serine proteases 3 8 ' 1 4 , we report a handful of structure motifs that are highly specific to the serine protease family (as measured by low P-values) and highly specific to the catalytic sites of the proteins (as shown in Figure 9). To further evaluate our algorithm, we randomly sample two proteins from the ESP family and search for common structure motifs. We obtain an average of 2300 motifs per experiment for a total of thousand runs. Such motifs are characterized by poor statistical significance and were not specific to known functional sites in the ESP. If we require a structure motif to appear in at least 24 of a 31 randomly selected ESP proteins and repeat the same experiment, we obtain an average of 65 motifs per experiment with improved statistical significance. This experiment demonstrates that comparing a group of proteins improves the quality of the motifs, as observed by 3S . Beside improved quality of structure motifs, we observe a significant speed up for our structure motif com-

Fig. 9. Left: Spatial distribution ofresiduesfound in 38 common structure motifs within protein 1HJ9. Theresiduesof catalytic triad, HIS57-ASP102-SER195, are connected by white dotted lines. Right: Performance comparison of graph mining (GM) and geometric hashing (GH) for structure motif identification.

We found that there are five motifs that occur significantly (P-value < 10 - T ) in another SCOP family: Prokaryotic Serine Protease (details not shown). This is not surprising since both prokaryotic and eukaryotic serine proteases are quite similar at both structural and functional levels and they share the same SCOP superfamily classification. None of the motif has significant presence

236 parison algorithm comparing to other methods such as geometric hashing. At the right part of Figure 9, we show performance comparison of graph mining (GM) and geometric hashing (GH)21 ( executable download from the companion website) for serine proteases. We notice a general trend that with the increasing number of proteins structures, the running time of graph mining decreases (since there are fewer common structure motifs) but the running time of geometric hashing increases. The two techniques have different set of parameters that make any direct comparison of running time difficult, however, the trend is very clear that graph mining has better scalability than geometric hashing for data set contains large number of proteins structures. NB family.

Fig. 1 1 . The motif appears in two proteins 1LVL (belongs to me FAD/NADlinked reducatase family without Rossman fold ) and 1JAY (belongs to the 6phosphogluconate dehydrogenase-like, N-terminal domain family with Rossman fold) with conserved geometry.

**5.4. FAD/NAD binding proteins
**

In the SCOP database, there are two superfamilies of NADPH binding proteins, the FAD/NAD(P)-binding domains and the NAD(P)-binding Rossmann-fold domains, which share no sequence or fold similarity to each other. This presents a challenging test case for our system to check whether we would be able to find patterns across the two groups with biological significance. To address the question, we applied our algorithm to the largest family in SCOP FAD/NAD(P)-binding domain: FAD/NAD-linked reductases (SCOPID: 51943). With support threshold 15/20, we obtained two recurring structure motifs from the family, and both showed strong statistical significance in the NAD(P)-binding Rossmannfold superfamily as shown in bottom part of Table 1. In Figure 11, we show a motif that is statistically enriched in both families; it has conserved geometry and is interacting with the NADPH molecule in two proteins belonging to the two families. Notice that we do not include any information from NADPH molecule during our search, and we identified this motif due to its strong structural conservation among proteins in a SCOP superfamily. The two proteins have only 16% sequence similarity and adopt different folds (DALI z-score 4.5). The result suggests that significantly common features can be inferred from proteins with no apparent sequence and fold similarity.

different residue contact patterns and therefore regarded as two patterns.

Fig. 10. Left: Residues included in the motifs from PCP family in protein 1CQD. The residues in catalytic dyad CYS27-HIS161 are connected by a white dotted line and two important surrounding residues ASN181 and SER182 are labeled. Right: Residues included in motifs from the NB family in protein lOVL. The labeled residue GLN 435 has direct interaction with the cofactor of the protein.

**5.3. Papain-Iike cysteine protease and nuclear binding domain
**

We applied our approach to two additional SCOP families: Papain-Like Cysteine Protease (PCP, ID: 54002) and Nuclear Receptor Ligand-Binding Domain (NB, ID: 48509). The results are documented in the middle part of Table 1. For the PCP family, we identified five structure motifs which covered the catalytic CYC-HIS dyad and nearby residues ASN and SER which are known to interact with the dyad 7 , as shown in Figure 10. For the NB family, we identified four motifs a which map to the cofactor binding sites 37 , shown in the same figure. In addition, four members missed by SCOP: Isrv, Ikhq, and loOe were identified for the PCP family and six members IsjO, Irkg, losh, lnq7, lpq9, lnrl were identified for the

"Structure motifs 2 and 3 have the same residue composition but they r They do not map to the same set of residues.

237

5.5. Random proteins

Our last case study is a control experiment to empirically evaluate the statistical significance of the structure motifs regardless of the P-^ualue definition. To that end, 20 proteins were randomly sampled from the culled PDB list in order to obtain common motifs with support > 15. The parameters 20 and 15 were set up to mimic the size of a typical SCOP family. We repeated the experiment a million times, and did not find a single recurring structure motif. Limited by the available computational resources, we did not test the system further; however, we are convinced that the chance of observing a random structure motif in our system is rather small.

6. CONCLUSION

We present a method to identify recurring structure motifs in a protein family with high statistical significance. This method was applied to selected SCOP families to demonstrate its applicability to finding biologically significant motifs with statistical significance. In future studies, we will apply this approach to all families in SCOP as well as from other classification systems such as Gene Ontology and Enzyme Classification. The accumulation of all significant motifs characteristic of known protein functional and structural families will aid protein structures resulting from structural genomics projects.

References

1. Peter J. Artymiuk, Andrew R. Poirrette, Helen M. Grindley, David W. Rice, and Peter Willett. A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. Journal of Molecular Biology, 243:327^14, 94. 2. D. Bandyopadhyay and J. Snoeyink. Almost-Delaunay simplices : Nearest neighbor relations for imprecise points. In ACM-SIAM Symposium On Distributed Algorithms, pages 403-412, 2004. 3. JA Barker and JM Thornton. An algorithm for constraintbased structural template matching: application to 3d templates with statistical analysis. Bioinformatics, 19(13):1644-9,2003. 4. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The protein data bank. Nucleic Acids Research, 28:235-242, 2000. 5. Philip Bradley, Peter S. Kim, and Bonnie Berger. TRILOGY: Discovery of sequence-structure patterns across diverse proteins. Proceedings of the National Academy of Sciences, 99(13): 8500-8505, June 2002. 6. SA Cammer, CW Carter, and A. Tropsha. Identification of sequence-specific tertiary packing motifs in protein structures using delaunay tessellation. Lecture notes in Computational Science and Engineering,, 24:477-494, 2002.

7. K. H. Choi, R. A. Laursen, and K. N. Allen. The 2.1 angstrom structure of a cysteine protease with proline specificity from ginger rhizome, zingiber officinale. Biochemistry, 7, 38(36):11624-33,1999. 8. I. Eidhammer, I. Jonassen, and W. R. Taylor. Protein Bioinformatics: An Algorithmic Approach to Sequence and Structure Analysis. John Wiley & Sons, Ltd, 2004. 9. JF Gibrat, T Madej, and SH. Bryant. Surprising similarities in structure comparison. Curr Opin Struct Biol, (6(3)):683-92,1996. 10. H.M. Grindley, P.J. Artymiuk, D.W. Rice, and P. Willet. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J. Mol. biol., 229:707-721, 1993. 11. H. Hegyi and M. Gerstein. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol, 288:147-164, 1999. 12. L. Holm and C. Sander. Mapping the protein universe. Science, 273:595-602., 1996. 13. J. Hu, X. Shen, Y. Shao, C. Bystroff, and M. J. Zaki. Mining protein contact maps. 2nd BIOKDD Workshop on Data Mining in Bioinformatics, 2002. 14. J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. Mining protein family specific residue packing patterns from protein structure graphs. In Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages 308-315,2004. 15. J. Huan, W Wang, and J. Prins. Efficient mining of frequent subgraph in the presence of isomorphism. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pages 549-552,2003. 16. William Humphrey, Andrew Dalke, and Klaus Schulten. VMD - Visual Molecular Dynamics. Journal of Molecular Graphics, 14:33-38, 1996. 17. I. Jonassen, I. Eidhammer, D. Conklin, and W. R. Taylor. Structure motif discovery and mining the PDB. Bioinformatics, 18:362-367, 2002. 18. I. Jonassen, I. Eidhammer, and W. R. Taylor. Discovery of local packing motifs in protein structures. Proteins, 34:206-219,1999. 19. Susan Jones and Janet M Thornton. Searching for functional sites in protein structures. Current Opinion in Chemical Biology, 8:3-7, 2004. 20. Bala Krishnamoorthy and Alexander Tropsha. Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics, 19(12):1540-48,2003. 21. N. Leibowitz, ZY Fligelman, R. Nussinov, and HJ Wolfson. Automated multiple structure alignment and detection of a common substructural motif. Proteins, 43(3):235—45, May 2001. 22. M Milik, S Szalma, and KA. Olszewski. Common structural cliques: a tool for protein structure and function analysis. Protein Eng., 16(8):543-52., 2003. 23. N Nagano, CA Orengo, and JM Thornton. One fold with many functions: the evolutionary relationships between

24.

25.

26.

27. 28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

tim barrel families based on their sequences, structures and functions. Journal of Molecular Biology, 321:741765, 2002. Ruth Nussinov and Haim J. Wolfson. efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques. PNAS, 88:10495-99,1991. Robert B. Russell. Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. Journal of Molecular Biology, 279:1211-1227, 1998. S. Schmitt, D. Kuhn, and G. Klebe. A new method to detect related function among proteins independent of sequence and fold homology. J. Mol. Biol, 323(2):387-406, 2002. J. P Shaffer. Multiple hypothesis testing. Ann. Rev. Psych, pages 561-584, 1995. Jeffrey Skolnick, Jacquelyn S. Fetrow, and Andrzej Kolinski. Structural genomics and its importance for gene function analysis, nature biotechnology, 18:283-287, 2000. R. V. Spriggs, P. J. Artymiuk, and P. Willett. Searching for patterns of amino acids in 3D protein structures. J Chem lnf Comput Sci, 43:412-421, 2003. A Stark and RB Russell. Annotation in three dimensions, pints: Patterns in non-homologous tertiary structures. Nucleic Acids Res, 31(13):3341^1, 2003. A. Stark, A. Shkumatov, and R. B. Russell. Finding functional sites in structural genomics proteins. Structure (Camb), 12:1405-1412, 2004. William R. Taylor and Inge Jonassen. A method for evaluating structural models using structural patterns. Proteins, July 2004. A. Tropsha, C.W. Carter, S. Cammer, and I.I. Vaisman. Simplicial neighborhood analysis of protein packing (SNAPP): a computational geometry approach to studying proteins. Methods Enzymol, 374:509-544, 2003. J. R. Ullman. An algorithm for subgraph isomorphism. Journal of the Association for Computing Machinery, 23:31-42, 1976. AC Wallace, N Borkakoti, and JM Thornton. Tess: a geometric hashing algorithm for deriving 3d coordinate templates for searching structural databases, application to enzyme active sites. Protein Sci, 6(11):2308-23, 1997. G. Wang and R. L. Dunbrack. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003. http://www.fccc.edu/research/labs/dunbrack/pisces/ culledpdb.html. Z. Wang, G. Benoit, J. Liu, S. Prasad, P. Aarnisalo, X. Liu, H. Xu, NR Walker, and T. Perlmann. Structure and function of nurrl identifies a class of ligand-independent nuclear receptors. Nature, 423(3):555-60, 2003. PP Wangikar, AV Tendulkar, S Ramya, DN Mali, and S Sarawagi. Functional sites in protein families uncovered

via an objective and automated graph theoretic approach. J Mol Biol, 326(3):955-78, 2003. 39. X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Proc. International Conference on Data Mining'02, pages 721-724, 2002.

**7. APPENDIX 7.1. Statistical significance of structure motifs
**

Any cliques that are frequent in a SCOP family are checked against a data set of 6500 representative proteins from CulledPDB 36 , selected from all proteins in the Protein Data Bank. For each clique c, we used Ullman's subgraph isomorphism algorithm 3 4 to search for its occurrence(s) and record the search result in an occurrence vector V = v\, V2, • • •, vn, where Vi is 1 if c occurs in the protein pi, and 0, otherwise. Such cliques are referred to as structure motifs. We determine the statistical significance of a structure motif by computing the related P-value, defined by a hyper-geometric distribution 5 . There are three parameters in our statistical significance formula: a collection of representative proteins M, which stands for all known structures in PDB; a subset of proteins T C M in which a structure motif m occurs, a subset of proteins F C M stands for the family we would like to establish the statistical significance. The probability of observing a set of motif m containing proteins K = F n T with size at least k is given by the following formula:

fc_l

(\F\\r\M\-\F\\

P-value = 1 - E

*—> i=0

.JIT'"'

(\M\\ V|T|/

•

v

(D

'

where |X| is the cardinality of a set X. For example, if a motif m occurs in every member of a family F and in no proteins outside F (i.e. K = F = T) for a large family F, we would estimate that this motif is specifically associated with the family; the statistical significance of such case is measured by a P-value close to zero. We adopt the Bonferroni correction for multiple independent hypotheses 27 : 0.001/|C|, where \C\ is the set of categories, is used as the default threshold to measure the significance of the P-value of individual test. Since the total number of SCOP families is 2327, a good starting point of P-value upper bound is 1 0 - 7 .

7.2. Background frequency

Using the culledpdb list (http://www.fccc.edu/research/labs/dunbrack/ pisces/culledpdb.html) as discussed in Section 5.1, we obtain around 6000 proteins as the "representative proteins" in PDB. We treat the proteins as a sample from PDB and for each motif, we estimate its background frequency (the number of occurrences in proteins) using graph matching. Specifically, each sample protein is transformed to its graph representation using the procedure outline in Section 3 and we use subgraph isomorphism testing to obtain the total number of proteins the motif occurs in.

239

AN IMPROVED GIBBS SAMPLING METHOD FOR MOTIF DISCOVERY VIA SEQUENCE WEIGHTING

Xin Chen* School of Physical and Mathematical Sciences Nanyang Technological University, Singapore *Email: chenxin@ntu.edu. sg Tao Jiang Department of Computer Science and Engineering University of California at Riverside, USA Currently visiting at Tsinghua University, Beijing, China Email: jiang@cs.ucr.edu

T h e discovery of motifs in DN A sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics, although a large number of computational methods have been proposed in the past decade. Among these methods, the Gibbs sampling strategy has shown great promise and is routinely used for finding regulatory motif elements in the promoter regions of co-expressed genes. In this paper, we present an enhancement to the Gibbs sampling method when the expression data of the concerned genes is given. A sequence weighting scheme is proposed by explicitly taking gene expression variation into account in Gibbs sampling. That is, every putative motif element is assigned a weight proportional to t h e fold change in t h e expression level of its downstream gene under a single experimental condition, and a position specific scoring matrix (PSSM) is estimated from these weighted putative motif elements. Such an estimated PSSM might represent a more accurate motif model since motif elements with dramatic fold changes in gene expression are more likely to represent true motifs. This weighted Gibbs sampling method has been implemented and successfully tested on both simulated and biological sequence data. Our experimental results demonstrate that the use of sequence weighting has a profound impact on the performance of a Gibbs motif sampling algorithm.

1. INTRODUCTION

Discovering motifs in DNA sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics 19 , although a large number of computational methods have been proposed in the past decade. The motif finding problem can be simply formalized as the problem of looking for short segments that are overrepresented among a set of long DNA sequences. Previously proposed methods for finding motifs broadly fall into two categories: (a) deterministic combinatorial approaches based on word statistics 15, 14, 2, 6^ a n ( j ^ probabilistic approaches based on local multiple sequence alignment 10' 1' 8 ' n . Typical methods in the first category search the promoter sequences of co-regulated genes for various sized motifs exhaustively, and then evaluate their significance by a statistical method, whereas methods in the second category rely on local search techniques such as

* Corresponding author.

expectation maximization and Gibbs sampling. The latter methods also usually represent a motif as a position specific scoring matrix (PSSM) (which is also commonly referred to as a position weight matrix). Gibbs sampling has shown to be a very promising strategy for motif discovery. The original implementation of Gibbs sampling was done in the site sampling mode, which assumes that there is exactly one motif element (notably a transcript factor binding site) located in each (promoter) sequence. Since its first application to find conserved DNA motifs in the early 90's 10 , quite a few improvements have been made in the literature to improve its effectiveness. These improvements include: (a) motif sampling allowing zero or multiple motif elements located in each sequence 13 ; (b) incorporation of a higherorder Markov background model 18 ' n ; (c) column sampling allowing gaps within a motif 13 ; (d) incorporation of phylogeny information 17 ; and so on. Be-

240 sides these enhancements, we observe below two important aspects common to all previous implementations of Gibbs sampling (and also common to most other motif finding algorithms). First, the promoter DNA sequences upstream of a collection of co-expressed genes are often taken as the input to a motif finding algorithm. This is because that co-expression is usually taken as an evidence of co-regulation, which is in turn assumed to be controlled by a common motif. Molecular biology has been revolutionized by cDNA rnicroarray and chromatin immunoprecipitation (ChIP) techniques, which allow us to simultaneously monitor the mRNA expression levels of many genes under various conditions. With gene expression data in hand, one generally applies a selected threshold on fold changes in the expression level under a single experimental condition relative to some control condition in order to retrieve a set of co-expressed genes. Here, it naturally gives rise to a question: how to select a threshold properly such that the motif would be more easily and reliably found? Notice that motif elements with large fold changes in expression are in general more likely to represent a true motif 12 . However, the statistical significance (e.g., p-values) of these motif elements may not increase as the threshold increases, because an increase in threshold may also simultaneously reduce the number of co-expressed genes. On the other hand, lowering the threshold may cluster more genes that are less likely co-regulated, and also decrease the statistical significance as well. This is a dilemma that was not addressed by any previous Gibbs sampling algorithm. Second, a PSSM is commonly used to represent a probabilistic motif model by taking into account the base variation of motif elements at different positions. Specifically, given a PSSM of a motif, Q, each component q^j describes the probability of observing base j at position i in the motif. In all previous implementations of Gibbs sampling, a PSSM is estimated from a set of putative motif elements, which are sampled from the input promoter sequences, with components proportional to the observed frequencies of each base at different positions. This relies on an implicit assumption that has never been questioned before. That is, all motif elements, regardless of the downstream genes that they regulate, should contribute equally to the components of the motif PSSM. However, we know that expression levels (and fold changes) vary in a large range even among coregulated genes, which could perhaps suggest that the above assumption might not be fair. In other words, equating every motif element could result in an inaccurate PSSM. Note, however, that an accurate motif model for a transcription factor is essential to differentiate its true binding sites from spurious ones. In this paper, we address the above two problems together by one scheme referred to as sequence weighting. It is natural to assume that motif elements with dramatic fold changes in expression are more likely to represent a true motif. Therefore, we want to estimate a PSSM such that it can explicitly reflect such a (nonuniform) likelihood distribution over motif elements. One way to achieve this is to assign each motif element a weight, e.g., proportional to the fold change in expression, and then to estimate each component g^ of the PSSM as the weighted frequencies of base j at position i among all motif elements. One can see that a weighted PSSM favors putative motif elements showing large fold changes in expression. On the other hand, a putative motif element with small fold changes, which is less likely to represent the true motif, will not affect a weighted PSSM as much. The use of fold changes in expression as weights to estimate PSSMs implicitly assumes that the DNA sequences of motif elements exhibiting higher fold changes are more similar to the motif consensus pattern. This is plausible since such motif elements are more likely to represent the true motif. Moreover, since the binding energy of a transcription factor (TF) protein to a DNA site can be approximated as the sum of pairwise contact energy between the individual nucleotides and the protein 20 , different binding sites may indeed have different affinities for their cognate transcription factors. In evolution, there is not only selection force for T F binding sites to remain recognized by their TFs, but that also selection force for preserving the strength of binding sites 17 , especially those showing dramatic fold changes in expression. We have incorporated the sequence weighting scheme into the Gibbs sampling algorithm originally

241 developed in 10, 7 . In real applications on a set of coexpressed genes, we can assign each input promoter sequence a weight proportional to the fold change in gene expression obtained from a cDNA microarray experiment, or proportional to the so-called binding ratio determined by a genome-wide location analysis, which is a popular approach that combines a modified chromatin immunoprecipitation procedure with DNA microarray analysis for studying genome-wide protein-DNA interactions and transcription regulation 16 . In a genome-wide location analysis, a binding ratio is calculated by taking the average of fold changes in expression over three independent microarray experiments. Our implementation of Gibbs sampling via sequence weighting has been successfully tested on both simulated and real biological sequence data. We considered two sets of genes regulated by the transcriptional activator Gal4 and Stel2 respectively, and their expression levels and binding ratios were determined by the genome-wide location analysis 16 . The test results show that the use of sequence weighting has a profound impact on the performance of the Gibbs motif sampling algorithm. The rest of the paper is organized as follows. The next section introduces the basic Gibbs sampling algorithm and the proposed sequence weighting scheme. Preliminary experiments on simulated data and real data are presented in Section 3. Section 4 gives some concluding remarks. at each position, for which a common mathematical representation is a so-called position specific scoring matrix (PSSM). A PSSM Q consists of entries qij, which give the probabilities of observing base j at position i of a binding site. The main assumption underlying this motif model is that the bases occurring at different positions of a DNA motif are probabilistically independent. Assume that we are given a set of N binding sites, si, S2, . . . , SN, of width W each. Let J = 4 be the number of bases in the alphabet {A, C, G, T}, and Cij be the observed count/frequency of base j at position i. A widely used method to estimate a PSSM from these binding sites is simply given by a

where c* is the sum of Cjj over the alphabet; that is, With the PSSM Q, we are able to estimate the probability P(s\Q) of an arbitrary sequence s being generated by the motif model as w P(s\Q) = H<li,at

t=l

where Si is the base of s at position i. On the other hand, a background sequence model V is estimated to depict the probabilities pj of base j occurring in the background sequence. The probability of the sequence s being generated by V is given by w

**2. GIBBS SAMPLING THROUGH SEQUENCE WEIGHTING
**

In this section, we start with the description of the basic Gibbs sampling algorithm, and then introduce the new method of estimating position specific scoring matrices (PSSMs) via sequence weighting.

P(s\V) = J[pai

»=i

Therefore, the likelihood that s is a true binding motif of interest under the motif mode Q versus the background model V is given by the formula

**2.1. The Motif Model
**

A DNA motif is usually represented by a set of short sequences that are all binding sites of some transcription factor protein. Due to base variation at binding sites, the pattern of a DNA motif is conveniently described by a probabilistic model of base frequencies

**The most useful quantity characterizing the quality of a PSSM Q is its information content J, defined as
**

1

w

J

a I n an actual implementation, a "pseudocount" should be added to each Cij in order to avoid a zero frequency for any base not actually observed.

2. and the output segments are putative binding sites.j. a motif model was created as follows. Estimating the PSSM via Sequence Weighting To start. The simulated data sets allow us to compare the performance of the algorithms in an idealized situation that does not involve the complexities of real data. EXPERIMENTAL RESULTS In order to test the performance of the above weighted Gibbs sampler. Si. j and Qij = — . and compared its results with the original Gibbs sampler 10 ' 7 . i. l<j<J One can easily see that the above is a natural extension of the original construction of PSSM where the weights for all sequences involved were assumed to be equal. Given a set of N binding sites. S2. 3. the binding motif.242 where the logarithm is often taken with base 2 to express the information content in bits. but using the weighted counts c^j.e. We define a binary function 6(i. The sequence weights can be normalized so that they sum up to N. Simulated Data 2. Note that Sk\i. we introduce two more notations in order to incorporate sequence weights into the computation of a PSSM. First. Sjv. to look for short sequence segments of specified width W that are overrepresented among the input sequences. It is easy to see that the extra running time caused by sequence weighting is negligible. The most basic implementation of Gibbs sampling. otherwise where Sk(i) is the base at position i of sequence k. W ranges from 5 bp to 16 bp. Only the necessary parts of the source code have been modified so that we could make a fair comparison between the original Gibbs sampler and this modified version. That is. reflecting the weakest to the strongest motifs. The seed transcription factor binding site is described by a consensus pattern.1. Basic Gibbs Sampling Algorithm The basic motif finding problem is. The information content thus ranges from 0 to 2. We have implemented the above sequence weighting scheme into the Gibbs motif sampler software developed in 10 ' 7 . 3. respectively. N Cij = fe=i ^2wkS(i. The details of this implementation are described in 10 .k) 2. given a set of DNA sequences. • • •.3. reflecting in some way the contribution of the sequence Sk to the PSSM as discussed above. assumes that there is exactly one binding site located in each input sequence. 20 short DNA sequences (of width W) were randomly generated for binding sites of a common transcription factor with varying degrees of conservation. we use two sets of genes in Saccharomyces cerevisiae (yeast) that were determined by ChlP-array experiments 16 to be co-regulated by two proteins Stel2 and Gal4. let Wk be the weight associated with the input sequence Sk. the input sequences (of a typical size of 800 bps) are usually taken from the upstream regions of co-expressed genes. i+W—1] denotes the substring of width W starting at i and ending at i + W — 1 in sequence Sk- Then. For our tests on real data. we estimate qij as before. known as a site sampler. . we propose to compute Cij as the weighted count of base j at position i of In our simulation studies. 1 < i < W. In real biological applications. Gibbs sampling has proven to be a powerful strategy for finding weak DNA motifs. In Figure 1.k) as d(l 3. we briefly summarize it in order to show how sequence weighting is incorporated into Gibbs sampling. we have applied it to both simulated and real sequence data. S2. • • •.j. SN. si. of width W each. In order to incorporate sequence weights into the Gibbs Sampling algorithm.lt) ' -\0.

we chose five different motif widths (W = 8.12.14. A PSSM Q — [qij] for the putative motif model begin Initialization: Randomly select a position ak for the motif in each sequence Sk Estimate the background base frequencies Pj. reflecting different levels of difficulty for motif finding. each promoter sequence was assigned a weight as the degree of conservation of the implanted binding site. Each program was run twice on each test data set with the option of column sampling 13 turned on or turned off.1] | V) for every position n in sequence Sz Randomly select a new position az in Sz according to L(Sz[n. Finally. A found motif is considered correct if its consensus sequence differs from the the planted motif consensus pattern by at most two bases.2. n + W . n + W . It is particularly promising in the discovery of weak (i. 100 test data sets were generated. ak + W . The basic Gibbs sampling algorithm.2. Second. • •.n + W . It combines a modified chromatin immunoprecipitation (ChIP) procedure with DNA microarray analysis in order to provide a relative binding of the protein of interest to a DNA sequence. 5tel2 The transcription activator Stel2 is a DNA-bound protein that directly controls the expression of genes in response of haploid yeast to mating pheromones 16 . We will use it to demonstrate how the sequence weighting scheme could boost the prediction accuracy of the Gibbs sampling method. SN and the motif width W Output: The starting position ak of the motif in each sequence Sk'.. respectively. and the weakest binding sites have one half of their bases different from those at the corresponding positions of the consensus. 3.1] | 1 < k < N. The genome-wide location analysis is a promising approach to monitor protein-DNA interactions across a whole genome 16 . short) motifs. to obtain V R e p e a t until convergence: Predictive update step: Randomly select a sequence Sz from the input sequences Take the set of putative binding sites {Sk[ak. The results are summarized in Table 1. and the average rank of the correct motif if it comes up in the top three.1] | Q) for every position n in sequence Sz Estimate P(Sz[n.e. For each motif width. giving rise to a total of 500 data sets. S2.1] \V. We are interested in the number of times each program successfully detects the motif inserted in the 100 tests for each motif width.a>k + W — 1] | 1 < k < N. Such an analysis on epitope-tagged . In the test. each with a binding site implanted at a randomly selected position. a set of 20 promoter sequences of 800 bases long were randomly generated.k ^= z} Estimate the PSSM Q from {Sk[ak. The degree of conservation is measured by the Hamming distances to the consensus pattern.Q) end F i g .1.. We can see that the weighted Gibbs sampling method was able to find more correct motifs than the original Gibbs motif sampler in all the tests.243 Input: A set of DNA sequences Si.10. based on the observation that binding sites with dramatic fold changes in gene expression are more likely to represent true motifs 12 . as twice as many correct motifs were found by the weighted method when W = 8 (without column sampling) than by the original Gibbs sampling method. and the top three motifs were reported from each program. Real Biological Data 3.16). Both the original Gibbs sampler and the weighted version Were applied to search for motifs in each data set. 1. k ± z} Sampling step: Estimate P(SZ [n. for j from 1 to J.

the original Gibbs sampling algorithm did not find any motif resembling the known Stel2 consensus pattern TGAAACA 5 . Both the original Gibbs sampling algorithm and our weighted version were run on all 29 sequences. This result was unexpected by us because the binding sites of Gal4 are actually well conserved among the input sequences (shown below). Figure 2 lists these genes and their binding ratios extracted from 16 . but not surprising to us. The genome-wide location analysis 16 found 10 genes to be regulated by Gal4 and induced in Galactose.2. names with all capital letters. the closer the concerned binding site is to the known motif consensus pattern (in terms of sequence Hamming distance).97 77 1.2.80) among 100 putative motifs. represent DNA binding motifs. These putative binding sites are listed in Figure 4. they are strongly favored in the construction of the (weighted) PSSM due to their large weights. Figure 2 lists the putative binding sites found by the weighted Gibbs sampling method upstream of the 29 genes regulated by Stel2. . which in turn further improve the specificity of the PSSM. With sequence weighting. rank 2. such as STE12.00 1.90 58 1. Due to the stochastic nature of Gibbs sampling. and names that begin with a capital letter. our algorithm successfully discovered the exact Gal4 motif with the highest information content (1. In particular. One can see that the same motif might be reported in different runs. rank Times found Avg. we ran both programs 10 times with different random seeds. The above experimental results are very encouraging. in the figure. each of the top six sequences in the table contains a binding site exactly matching the Stel2 motif consensus pattern.244 Table 1.86 88 1. roughly speaking. Gal4 Gal4 is among the most characterized transcriptional activators. although we suspect that the algorithms might have found or missed the Gal4 motif by chance due to its very low statistical significance and the stochastic nature of Gibbs sampling. For this purpose. Note that.97 74 1. and assigned each sequence a weight as the relative binding ratio obtained from the genomewide location analysis. Of the 100 putative motifs.00 92 70 1. such as Stel2. rank 31 1. while the weighted PSSM is given in Figure 5.03 35 50 2.09.92 j 88 2.85 Stel2 has determined that 29 pheromone-induced genes in yeast are likely to be directly regulated by Stel2 16 .96 2.97 73 1. and ranked the Stel2 motif the second among all the putative motifs in terms of information content. however.76 95 2. Tests W W W W W = = = = = 8 10 12 14 16 Simulation results on 20 sequences of 800 bases Gibbs sampling via sequence weighting with column sampling without column sampling Times found Avg. This clearly shows the advantage of sequence weighting that we have implemented in the Gibbs sampling algorithm.91 53 73 1. The information content of this PSSM is 1. indicating a very strong motif that has been detected by our algorithm. are used to represent genes.96 80 1. we extracted up to 800 bp upstream regions of each gene from the Saccharomyces genome database. One can see from Figure 2 that. 3. found the correct Stel2 motif six times in 10 runs. and each time the top ten putative motifs were reported. and Figure 3 shows the corresponding weighted PSSM.00 T h e original Gibbs motif sampler 7 ' 10 with column sampling without column sampling Times found Avg. which activates genes necessary for galactose metabolism 16 . with varying relative binding ratios (see Figure 4). Of great interest is to find the sites bound by Stel2 in the promoter sequences of these 29 genes.50 22 23 1. the higher a relative binding ratio is.91 2.61 49 1. Our algorithm. It provides another test showing that sequence weighting really improves the performance of the original Gibbs sampling algorithm.08 95 1. Of the 100 putative motifs. rank Times found Avg. This process tends to recruit more correct sites. the original Gibbs sampling algorithm once again failed to find any motif similar to the known Gal4 consensus pattern CGGNnCCG.89 84 1. however. We performed the same experiment on Gal4 as we did on Stel2. and their experimental results were then compared. Once some of these binding sites have been selected by chance.

sg/home/ChenXin/Gibbs.. .0139 0 .760593 0. 0 3 8 1 Fig.0047 0 . 0 0 4 1 0.017586 0. T h e column under Ratio lists t h e relative binding ratios obtained from the genome-wide location analysis.. 8 4 4 54 5 0..245 Name Binding sites R a t i o Prob FUS1 STE12 FUS3 PEP1 PCL2 ERG24 PRH1 FIG2 PGH1 YIL169C AGA1 HYH1 GIC2 YER019W SPC25 FAR! KAR5 FIG1 YIL037C YOR343C AFR1 PH081 YOL155C YIL083C CIK1 YPL192C SCH9 CHS1 YOR129C 614 327 645 189 281 470 447 447 525 649 599 656 620 128 2 55 231 75 70 641 618 419 165 131 529 655 619 14 442 459 ggtgcgatga ttgcataatt tacagggcat agttccaggg gaatgccagc gaaacagtat acggagtacg aaaacaacac ttaccacaat cacaaataac cataattttc AAAGtggcac aAAAAGAAac tCCAGCAGcg aataccagaa actacatgac tagaaaatta CCAAGaccat agcacattat aagcgagacc accactgata caactgcatg CACGcgcctt attgcgtgcc GAAGCAActt aaaacagaac gtacaaatca aatacatgca taagcaagta TGAAACA TGAAACA TGAAACA TGAAACA TGAAACA TGAAACA TGAAAAA TGAAACA TGTATCA TGCAACA TGAAACA TGACACA TGAAAAT TGAATCA TGGAACA TGTAACT TGGAACA TGAACAA TGAAAAA ACAATAA TGAAACA GGAAAAA AGGAACA TGAAATC TGAAACA TGAAACA AGAAACA GGAAACA TGTAACA ******* aac at gaaac 620 cagcatttct 333 attgcagaaa 651 gatctaaaac 195 cattcttgtt 287 tatgtattac 476 cgtccgttat 453 aacttattgc 453 cggtggttTC 531 gcgcctacaa 655 atattaaaac 605 atatttacag 662 gtaaaactcc 62 6 tcgccgaata 134 aacgctgata 2 61 aaaattcaat 237 atagacctgt 81 tccccatcat 76 tgttttaaaa 647 t ac age aagg 624 agtacgatcc 425 ttatatagat 171 taaAAATAAA 137 tgcggatttg 535 caactacgac 661 agtatattgg 625 tagtcctgtg 20 tacattacct 448 ttgatagaca 465 5 4 .075314 T h e weighted PSSM calculated from t h e putative binding sites in Figure 2. 0 0 3 1 0 .01692 0. 3 0 3 1 0. .2 1 ..1 3 .. 4 1 .01692 0 .edu. . and the column under P r o b lists the probabilities of t h e binding sites given by the motif model represented by t h e PSSM in Figure 3. 0 0 0 0 0 .. The complete results of both tests axe available at http://www.. 3 0 3 1 0 . ...0304 0 . .872697 0.2 1..045558 0. 3 0 3 1 0.. 0 7 0 5 0 ..087732 0. 6 1. 3 0 3 1 0 .9 0 . 0 1 7 7 0 .042062 0. 0 3 7 9 0 ..6 2 1 . 3 0 3 1 0. Putative binding sites for the transcription activator Stel2 found by the weighted Gibbs sampling method. 0. 3 0 3 1 0 .038565 0. 0 3 1 4 0 . 3 0 3 1 0 . 0 0 6 0 .. 3 . 3 0 3 1 0 .098041 0. .7 0 .7 2 ...83948 0. 0 0 0 7 0.90678 0. . 4 1. 0 1 7 9 0..028292 0.3 1. .. 1 1.. 0 0 3 2 0 .3 3 . .01692 0. ..7 2 .3 1. POS A C G T 1 2 3 4 5 6 7 Fig.036817 0.050137 0. .9 0 .3 2 .3 3 . 0 3 0 4 0 .ntu. DISCUSSION AND FUTURE RESEARCH The selection of a suitable threshold value on expression level (or binding ratio) in order to retrieve a set of co-regulated genes and the construction of an accurate PSSM from a set of promoter sequences to . 4 1... 3 0 3 1 0 .176893 0..9 0 .3 1.778292 0. 0 7 0 5 0 .. 0 0 3 2 0.. . .. 0 4 5 594 0....028111 0.105034 0 ..9 0 ..3031 0 . 1 1 1 1 1 0 .. 2.035069 0.078109 0..01692 0.028111 0.. 4. 3 0 3 1 0 . .912907 0.

regardless of their expression variations.925659 0.016568 .899515 0. 028409 0. For example. 5 5. Our method does not require a preset threshold value. 4^ kU£ a n (jjffer from o u r s m various aspects. Moreover. . 4. 028409 0. represent the true motif model are two delicate problems usually ignored by a Gibbs sampling strategy in motif discovery.5853 0.01812 5 0. 12. several computational methods that take advantage of gene expression variation have been developed 3. we try to tackle these problems by a sequence weighting scheme in order to improve the prediction accuracy of the basic Gibbs sampling algorithm. Here.246 Name Binding sites Ratio Prob agcgggcgac tcggagggcT aacatgctcc ttcaggggtc tgcgagtttt tgatccgaag aacgacctca gggAAGAACA aaAAAAGTCG ggaagctcgg 364 548 529 420 348 62 5 64 5 448 532 561 GALl GAL10 GAL3 GAL2 MTH1 GAL7 GAL80 GCYl FUR4 PCL10 348 tattgaagta CGGATTAGAAGCCGCCG 532 cgaggacgca CGGAGGAGAGTCTTCCG 513 accccacgtt CGGTCCACTGTGTGCCG 404 ttcgtccgtg CGGAGATATCTGCGCCG 332 gaaaaaggtc CGGGGAAATGGAGTCCG 609 aaaaagcgct CGGACAACTGTTGACCG 629 cttcatttac CGGCGCACTCTCGCCCG 432 ggcgaacaat CGGGGCAGACTATTCCG 516 aaagctttca CCGATTTCCTAGACCGG 54 5 tttttgggcc CGGAATATATCTTTTCG 8.04 5826 0. MDscan 12 uses a word-enumeration strategy to exhaustively search for motifs. The relative binding ratios and probabilities of the binding sites are displayed as in Figure 3.027807 0. . Gibbs sampling via sequence weighting can be effectively applied to find motifs when the gene expression data is available.027807 0. 028409 0.027807 0. 043 519 0.016568 0.4 1. .044269 0. 5 1. 9.5853 0. 5 1. In this paper. To summarize.5853 0.1 0.5853 0.027807 0. 028409 0. .5853 0.0014 0. POS A C G T 1 2 3 15 16 17 F i g . our proposed method focuses on a single experimental condition (relative to a control condition). and also treats all putative binding sites equally when representing a motif model.925659 0.912106 0.5853 0. It was achieved by estimating a PSSM from the promoter sequences weighted proportionally to the fold changes in the expression of their downstream genes. and is thus a deterministic combinatorial approach.5853 0. One reason for this is that averaging across experiments may destroy the significant relationship between the expression of genes and their regulatory motifs present only in a single experiment. 0. Previous studies 9 showed that focusing on a single experimental condition is crucial for identifying experiment specific regulatory motifs.1 1. it needs a threshold value on expression level in order to extract highly expressed genes.927216 0. 0. On the other hand. 5.01812 5 0. In the future. 028409 The weighted PSSM calculated from t h e putative binding sites in Figure 4.027807 0.027807 0. Our preliminary experiments on simulated and real biological data have clearly shown the advantage of this sequence weighting scheme in a Gibbs sampling. we would like to test this method on more real data sets with gene expression profiles and extend the method to gene expression data across multiple experimental conditions. Putative binding sites for the transcription activator Gal4 found by the weighted Gibbs sampling. we have proposed in this paper a sequence weighting scheme for enhancing the motif finding accuracy of the basic Gibbs sampling algorithm. As we have noticed before.6 0.897958 0.0279 Fig. many computational methods have been proposed to identify motifs in the promoter regions of genes that exhibit similar expression patterns across a variety of experimental conditions 3 .5853 0. 5 8.9 5 2.

Rigoutsos and A. 97. H.. PNAS. 14. Syst. and J. Rombauts. Tompa. Liu. Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Nimwegen. P. Assoc. 3. Liu. T.. 13. Sci. Wootton. Int. Regulatory element detection using correlation with expression. 98. 2002. Siggia. 1618-1632.. Acad. Boguski. Proc. 18. Bailey and C. 20. 2 3 . 1 3 . h t t p : / / w w w . P. X. Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Fields. Bioinformatics. 2000. Estep. Thijs. and a fellowship from the Center for Advanced Study. M. 4. M.. . 1113-1122. Liu. Lieb. Protein Sci. Acad. 2001. Proc. Shraiman. Nature Biotechnology. J. Bioinformatics. E.247 ACKNOWLEDGMENTS Research supported in p a r t by N S F grant C C R 0309902. e d u / ~ j u n l i u / S o f t w a r e / gibbs9_95. Combinatorial pattern discovery in biological sequnce: the TEIRESIAS algorithm. Tavazoie. Brutlag. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. 2005. A higher order background model improves the detection of promoter regulatory elements by Gibbs sampling. 15. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Genome Res. Rouze. PLoS Computational Biology.tar 8. Biol. 5703-5707. Liu. 17. Keles. Assessing computational tools for the discovery of transcription factor binding sites. and M. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. 290. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Science. National Key Project for Basic Research (973) grant 2002CB512801. Ren. 14. 3339-3344. not. 2001.. 8. K. 1995. Syst. 16. Elkan. 27. B. Li. d. Biol. 7. J. 4 . and E. M. Neuwald. Liu. Laan. kirkman. 2000. I. 55-67. Li. S. Moor. e67. Stat.D. Bussemaker. Gupta and J. Genet. S. Integrating regulatory motif discovery and genome-wide expression analysis. E. 1994. Tsinghua University. USA . H. P. 11. Brutlag. 2003. Proc. 2306-2309. B. 100. Djordjevic. 127-138. Mol. h a r v a r d . Genome-wide location and function of DNA binding proteins. 269-278. 20. 2001. Altschul. Floratos. f a s . 19. Liu. 1167-1175. A. M. Lawrence. USA. G. Intell. C. Nature Biotechnology. H. Moreau. of regulatory elements using a feature selection method. and B. 296. 262. M. et al. and G. Liu. 2005. Natl. Detcting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. N I H grant LMOO8991-01. Science. d. Biol. X. 6. 2003. 10096-10100. X. 18. Pac Symp Biocomput. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Mol. et al. natl. Lawrence. 10. 208-214. H. Lescot. N. M. Siddharthan. Combinatorial approaches to finding subtle signals in DNA sequences. Siggia. N S F C grant 60528001. and C. J. Marchal. A. Bioinformatics. Conlon. Identification 12. Li. Conf. 1993. and J. 5. 2. Discovery of conserved sequence patterns using a stochastic dictionary model. 86. Eisen. 2. References 1. Church. Int. Proc. Dolan. 55-66. 2000. Siggia. Sengupta. and E. 167-171. 1989. J. S. and Y. and J. Hughes. Bussemaker. C. 1205-1214. 9. Sze. Sci. 28-36. A. 2002. 0534-0555. A biophysical approach to transcription factor binding site discovery. Intell. 1. 1998. The yeast STE12 protein binds to the DNA sequence mediating pheromone induction. Liu. Neuwald. and S. 2001. S. J. 17. Pevzner and s. 2003. J. E. 835-839. Conf. Mol. R. J. Am. 2381-2390.. Liu. J.

This page is intentionally left blank .

it has been discovered that the protease acts on more than 20 variant non-viral proteins. SE-22362. T h e best classifier using secondary structure information on the second level classification of the hierarchical classifier is the one using logistic regression. and nucleocapsid protein) and the pol gene encodes four replication related proteins (i. SE-30118. It raises questions with regard to the involvement of the protease in breakdown of host proteins related to the immune system. During the HIV-1 virion maturation process. Computer and Electrical Engineering. Sweden * Email: liwen@thep. are related but different. We introduce a three-level hierarchical classifier which combines information from experimentally verified short oligopeptides. On the other hand.e. INTRODUCTION Within the HIV-1 virion genome. the difficulty is that the protease cleaves at different sites with little or no sequence similarities. Halmstad. However. to combine information from oligopeptides and structure from native proteins. So far. The two cleavage problems. Short oligopeptides or denatured proteins do not have folded structures.249 DETECTION OF CLEAVAGE SITES FOR HIV-1 PROTEASE IN NATIVE PROTEINS Liwen You* Computational Biology and Biological Physics Group Department of Theoretical Physics. capsid protein. However. reverse transcriptase and integrase). HIV-1 protease inhibitors are therefore part of the therapy arsenal against HIV/AIDS today. Therefore. Translation of gag and gag/pol transcript results in Gag and GagPol polyproteins. . Lund University Solvegatan 14 A. there is lack of comprehensive information about the interaction between the protease and non-viral proteins. Lund. HIV-1 protease cleaves viral Gag and GagPol polyproteins into structural and other replication proteins and make it possible to assemble into an infectious virion. Efficiently cleaved substrates are excellent templates for the synthesis of tightly binding chemically modified inhibitors 1 . matrix protein. T h e method can be applied on other protease specificity problems too. have been performed to study cleavage specificity 2 ~ 5 .e. the protein synthesis process. such as Actin 6 and Vimentin 7 . the cleavage of the polyproteins by HIV-1 protease plays an important role in the final stage of the HIV virion maturation process. cleaving of short oligopeptides and cleaving of native proteins. several studies. where eight corresponding residues can be bound. Sweden School of Information Halmstad University Predicting novel cleavage sites for HIV-1 protease in non-viral proteins is a difficult task because of the scarcity of previous cleavage d a t a on proteins in a native state. Therefore. gene regulatory pathways and so on. protease. 1.lu. gag and pol are two main genes. secondary structure and solvent accessibility information from prediction servers to predict potential cleavage sites in non-viral proteins. Efficiently hindering the cleavage process is one way of blocking the viral life cycle. the study of the susceptibility of host proteins in native states to hydrolysis by the protease is important to understand the role of HIV-1 protease in its host cell. t h e false positive ratio was reduced by more than half compared to the first level classifier using only the oligopeptide cleavage information. By using this level of classification. In the last two decades. It is known that the protease has an active site with eight subsites. It is known that the gag gene encodes four separate proteins which form the building blocks for the viral core (i. little is known about what happens to the protease after its mutation and postmaturation phases of the viral life cycle. Box 823. There are quite a lot of oligopeptides that have been experimentally verified as substrates of 'Corresponding author. including wet-lab experiments on HIV-1 protease cleavage of oligopeptides.se Intelligent Systems Lab Science.

but only one cleavage site. and vice versa. as the example with Bcl2 shows. In other words. 2. the predictor based only on short oligopeptides does not work well on native proteins. This is not surprising since some predicted cleavage sites might not be exposed to the protease or their local secondary structures may prevent the binding with the protease. The predictor predicts 55 cleavage sites including the true one. Peptidic compound's structure can be denned by their <j> and tp angels. The aim of the present work was to predict cleavage sites in native proteins by combining information from short oligopeptides and native proteins. But strictly speaking. Therefore it is not possible to build a predictor based on short oligopeptides for native proteins. They found that proteases generally recognize the extended beta strand conformation in the active sites. it has 205 amino acids. secondary structure information can be accessed. The two problems are related in the sense that cleavage sites in short oligopeptides are very likely to be cleaved in native proteins if it is located at surface exposed regions. Fortunately. Due to the rarity and complex structure. Due to little information about the cleavage of native proteins and insufficient structure information on proteins. unless they are peptidic compounds. As mentioned before. In this way. and they are more abundant in the literature than experiments on native proteins. We have in our previous work collected an extended data set with 746 octamers 8 and built a predictor with 92% sensitivity and specificity for predicting cleavage of short oligopeptides. In contrast to short oligopeptides are proteins in their native states folded with complex structures. This is complicated since the information from short oligopeptides is difficult to transfer to native proteins. Is it possible to get structure information for proteins? We know that as far as ligands go. it would be hard to directly work on the native protein level. Native protein cleavage sites are rarely observed. there are much more data available for short oligopeptides. Tyndall et al. The cleavage specificity of the protease is both sensitive to its context and broad. On the other hand. On average. the two cleavage problems are different. there are around 42 cleaved sites in those native proteins. There are only around 20 tested native protein substrates reported in the literature. By accessing some prediction servers to get secondary structure and solvent accessibility information. but predicts too many false positives on native proteins since it is not possible to take structure into account on short oligopeptides. . then secondary structure cannot be readily denned. We use structure predictors to get secondary structure information. cleavage sites in native proteins might not be cleaved since the local environment makes it recognize some specific structures. our predictor based on short oligopeptides contains information about cleavage specificity of the protease. Systems and methods There are about 42 experimentally verified cleavage sites within 21 proteins with a total length of 8212 amino acids.250 HIV-1 protease. A predictor based on short oligopeptides should discover all true cleavage sites but with lots of false positive ones. except a short part of it. we can combine them with the information from short oligopeptides to build a predictor to determine the cleavage of native proteins. Taking the Bcl2 protein 9 as an example. The same goes for solvent accessibility information. Although there are lots of PDB files describing ligand structure information. which implies that the cleavage sites are in a tiny region of the whole protein sequence and structure space. As mentioned in the above section. In total. a protein with a length of about 400 amino acids has only one or two cleavage sites. Today. the cleavages in native proteins are rare cases. some predictors can reach around 80% correct prediction performance. So. then it should still improve the prediction. but this is important to do since experiments on oligopeptides are much easier to perform. it is almost impossible to find structure information for a whole protein. Therefore. lack of experimental structure information is a problem. However. short oligopeptides do not contain much structure information. this cleavage problem is much harder to attack than the one on short oligopeptides. The risk here is that it contains noise. 10' u have targeted the recognition of substrates and ligands by proteases based on PDB files. Many research groups have developed secondary structure predictors. if it contains more information than noise.

and secondary/tertiary structure of substrates. a conservative measurement is taken in such way that if 90% residues inside the window are buried. (1) At the first classification level. expert knowledge. solvent accessibility information with the same window size as the second step around the remaining cleaved sites is collected. A predictor trained using the secondary structure information is used to check the cleaving at those sites. These protein sequences were submitted to a structure and solvent accessibility prediction server 24 (http://www. In the sense that it should be tuned to never produce false negatives. A three-level hierarchical classifier.251 2. in order to measure the fraction of exposed volume of all residues inside that window. The first classification level should never miss a true cleaving site. If residues inside the window are not exposed to the protease. where PROFsec Using predictor trained on short oligopeptides on native protein sequence ' Secondary structure Truedeavage ^ Solvent accessibility True cleavage Get sites predicted to be cleaved after the three levels of prediction Fig. meaning 4 residues at both sides of the cleavage sites.1.predictprotein.2. Figure 1 illustrates the structure of the hierarchical classifier. With regards to the first classification level. which is too flexible to tune. 1. At the last step. (3) At the final step. proteins with a window size of 8 amino acids. (2) At the second level. Hierarchical classifier Boyd et al. which might not be available for interesting proteins. Similar sequences with minor mutations sites were not included since they contain redundant information. Data Cleavage sites in 21 native proteins were collected from the literature 6 ' 7 ' 9 ' 1 4 _ 2 3 . our previous work 8 ' 13 has discussed how to build a classifier based on short oligopeptides. it should not be cleaved. their way to build prediction models was mainly based on protease specificity profiles. 12 have built a publicly accessible bioinformatics tool to build computational models of protease specificity which could be based on amino acid sequences. In addition. Only sites predicted to be cleaved are collected with true cleavage indicators (class labels). 2. Here we used a three-level hierarchical classifier to combine the information from oligopeptides and native proteins. The predictor works like a filter on the protein sequence level that only removes a part of true non-cleaved sites. instead of a data-driving one. moves along a protein sequence and predicts all possible cleavage sites. secondary structure information around these predicted cleavage sites are collected with a larger window size to include residue interaction and the window does not need to be symmetric around its cleavage site. which combines information from short oligopeptides and native proteins. and thus removed from the cleaving list. they extracted accessibility surface area information from PDB files.org/). Furthermore. they used a rule based method to use secondary structure information. the predictor trained using short oligopeptides and denatured . then the whole fragment is not accessible to the protease. at the cost of some more false positives. Only those claimed to be cleaved at this step are classified as to be cleaved by the whole hierarchical classifier. However.

we can draw new samples representing possible structure patterns (i. there are 3 x J parameters for each class and we use maximum likelihood to estimate those parameters.. Using the drawn new structure character sequences. We used Gibbs-sampling method to implement it. 3. The parameters. 3. . In total. However. Classification accuracy is a common measure for evaluating model performance. there is a majority class and a minority class. are fitted using maximum likelihood with MATLAB StatBox toolbox (version 4. we assume that all positions inside the window are independent. E and L at that position if we randomly draw new samples from unseen but possible structure character sequence space around cleaved sites. When we predict a new structure probability data. 3. w and b . or area under ROC (receiver operating characteristic) curve. For rare case detection with an imbalanced data set. In other words. For a discriminative model. . . It is the same for the non-cleaved class.e. The index j runs from 1 to J (the size of the input window). Algorithms Two generative models. Classifiers tend to be biased towards the majority class but sampling methods (i. which is square root of the product of sensitivity and specificity. a naive Bayes classifier and a Bayesian inference model. 3. JV (the number of cleaved samples). strand and loop probability set at each position as the probability to observe H.1. For a generative model. Therefore accuracy metric is not a suitable one for this problem. where n is the secondary structure probability for residues inside a window and 9 denotes class label. geometric mean (G). the density estimation is not needed. We use the synthetic minority over-sampling technique (SMOTE) introduced by Chawla et al. specificity. Rare case detection The cleavage site prediction problem is a rare case detection. Naive Bayes The secondary structure predictor outputs probabilities. In addition. . 3. We use the notation 7rj = {^EJ^HJ'^LJ) for the secondary structure probabilities for position j of sample i. HHHHL. logistic regression and support vector machines (SVM).2. 3.. It introduces new data by randomly choosing a sample and interpolating new samples between it and its K-nearest samples. The posterior probability needed for the classification decision is computed using Bayes' theorem.5.. we can estimate the parameters on the dirichlet distributions at different positions.2). and two discriminative models. were tested for the cleavage prediction. LLL) from the structure probability data set.252 and PROFacc were used to get secondary structure and solvent accessibility information individually. we draw a set of structure sequences from it and use the dirichlet distributions to calculate the probability for observing those structure character sequences and average them to get its posterior probability. undersampling of majority class and over-sampling of minority class) can compensate for this to some extent. the data distribution is either known or assumed to be close to a well known distribution. 25 .4. Good metrics for this problem are sensitivity. We assume that the data set TTJ. The numbers are provided by the secondary structure predictor and are normalized so that n^j + nlHj + TTZLJ = 1. Bayesian inference Each amino acid residue has a specific structure in native proteins and HIV-1 protease recognizes specific structures. The Fastfit MATLAB toolbox was used to estimate dirichlet distribution parameters. in cleaved class has dirichlet distribution at position j . We use all of them to evaluate and compare models.e.3. We can interpret each helix. It directly works on the model to find optimum values for its parameters. the data set is very imbalanced and a classifier that always predicts uncleaved is correct in more than 97% of the cases. Logistic regression Logistic regression has the form: log P L ~ 0 *] = w • 7r + b. SVM To train SVM with a very imbalanced data set makes the decision boundary biased towards major- . where i = 1 .

This is probably due to the structure prediction performance on helix. From the secondary structure prediction server.5%.5 9O0O00000O6 1.2. Experiments After using the first level predictor on the 21 proteins having 42 true cleavage sites. Experiments and results 4." 4. Prediction results after using t h e first level predictor on the 21 native protein sequences. The probabilities for observing loop. Sensitivity=100%.15) to demonstrate it. We use window (15. loops are less likely to be observed around the active site. 2. . over-sampling method (i. However. H and E at each position for non-cleaved (upper part) and cleaved (bottom part) class. Exploring the data set We explored the structure sequence data from the secondary structure predictor output to see if the structure data set contains any information to separate cleaved and non-cleaved class. Figure 2 displays the probabilities for observing L. kernel parameters for SVM. strands are more likely to be observed in the vicinity of the cleaved site. two generative and two discriminative methods.5%. There is almost no structure difference at different positions for the non-cleaved class. Randomly under-sampling. Table 1. SMOTE) were used in our experiments to remove and add secondary structure probability data respectively. false positive rate=16%. In total. For each structure. sampling rate and ratio between two classes after sampling and number of nearest neighbors in SMOTE). We used the libSVM 26 MATLAB toolbox to train SVM. strand and loop. which agrees with Tyndall's conclusion that extended conformation is preferred at active sites. The precision ( T P / ( T P + F P ) ) is 2. The problem with this method is that there are quite a lot of parameters to tune (constraints. there are 1655 sites predicted to be cleaved and fed into the second level predictor.4 non-cleaved sites to be cleaved. Crossvalidation was used to find their optimum values for good generalization performance. it is uniformly distributed inside the window. precision=2.5 I 1 True non-cleavage sites 1613 6557 0 5 10 15 20 25 30 Predicted to be cleaved sites Predicted to be non-cleaved sites 42 0 Residue position inside a window for non-cleaved class 0 5 10 15 20 25 30 Residue position inside a window for cleaved class We can see that after this step. It is worth noting that the probability to observe helix structure increases somewhat when it is closer to the active site. the prediction results are shown in Table 1.e. occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa (expected accuracy of PHDsec). which means if there is one true cleavage site.253 ity class. the predictor predicts 38. all true cleavage sites were kept with 100% sensitivity (TP/(TP+FN)) and 16% (false positive rate= F P / ( F P + T N ) ) non-cleaved sites were predicted to be cleaved.1. helix and strand structures at each site for non-cleaved (upper part) and cleaved (bottom part) class. True cleavage sites 0. 4. The next experiment was to try the four different classifiers. For the cleaved class. Consequently. with different window sizes using sec- Pig. it states that "PHD as well as other methods focus on predicting hydrogen bonds.

5-nearest neighbors were used to interpolate new samples. cross-validation was used to tune the cutoff value over the outputs of the classifier in such way that it gives the best geometric mean value. After using the sec- There are no exact criteria to choose the best cutoff value. Figure 3 displays the ROC curve of the best classifier with the logistic regression method.698 (0.6 Standard deviation 0.254 ondary structure information on the 1655 sites and estimate the generalization performance using crossvalidation.10 0.083) 0. If only the predictor based on short oligopeptides is used.078) ond predictor based on secondary structure information. Table 2. specificity and precision using the tuned cutoff value on the test data set. The cleaved samples were over-sampled 3 times and the uncleaved samples were under-sampled to get the same number of cleaved samples after its over-sampling process.701 (0.075) Table 2 lists the three largest AUC values for each classifier.21 0.65 0.706 (0. for each true cleavage site there are 38 false ones from the first level classifier. specificity and precision performance were estimated with the tuned cutoff value with logistic regression classifier. After using the second classifier.67 (0. area under ROC curve. It reports its mean value and its standard deviation.079) 0. The generalization performances were estimated by using this tuned cutoff value over held out test data set.702 (0. logistic regression. Table 3 lists the sensitivity.080) 0. Mean Sensitivity Specificity Precision TP FP FN TN 0. Naive Bayes classifier is little inferior to them and Bayesian inference is almost the same as naive Bayes. Fig.068) 0. during training.079) 0. The cross-validation was done in the following way.684 (0.9 204. specificity and precision performance of the best classifier.013 1. . the best classifier is the one with logistic regression method and SVM with sampling methods performs as good as logistic regression method.682 (0.095) 0. and the remaining 20% was used to test its prediction performance. 3 . we can see that. ROC curve of t h e best classifier gression method. In this experiment.5 SVM 0.5 1. 80% of the data set was used to train the classifier.081) 0. on average.701 (0. if it is required to reach 90% sensitivity. For each classifier. for each correct cleavage site.1 112. Table 3 .080) Bayesian inference 0. the whole data set was randomly divided into two parts.044 5.081) 0.66 (0. In other words. The third experiment is to estimate the influence of the cutoff value on the sensitivity.685 (0.67 (0.5%. which increases 1. The best three classification generalization performances. the ROC curve can give a good idea to pick the cutoff value for a specific requirement.4%. T h e upper curve is ROC training data set and the lower curve is for Error bars with standard deviation are also with logistic remeasured on the the test d a t a set. Generally.083) Naive Bayes 0. Logistic regression 1 2 3 0. the precision increases to 4.64 0. Sensitivity. we can lower the cutoff value. Although the performance variance is around 8%. it predicts 22 false cleavage sites. the precision is around 2. For SMOTE over-sampling.7 30.697 (0.7 30. displayed. of the four different classifiers only based on the secondary structure information. but then get more false positive ones. The process was repeated 100 times for each window size. AUC (area under ROC curve) was used to compare the four classifiers. For this cleavage site prediction.4 2.7 times.

and glial fibrillary acidic protein. 2000. SVM with sampling methods has around 2 x J + 5 in the model. If a protein is cleaved at the first cleavage site. Zurcher-Neely H. Therefore. Louis JM. Human immunodeficiency virus. It has not been tested how sensitive the classifiers are to the predicted structure and accessibility information. 267:62876295. desmin. Dr Joel Tyndall. Bizub-Bender D. 8. 271:10538-10544. Ridky TW. Shoeman RL. Another useful information is to use the absolute positions of predicted cleavage sites. Prof Thorsteinn Rognvaldsson and Dr Daniel Garwicz for discussions and pointers to important references. and Oroszlan S. can be used to detect the cleavage sites on native proteins. By using the secondary level classification based on secondary structure information. Comprehensive bioinformatic analysis of the specificity of . Studies on the symmetry and sequence context dependence of the HIV-1 proteinase specificity. Due to the lack of data. Eur J Biochem. and Rognvaldsson T. The author also thanks Dr Martin Skold. With Bayesian method. and Harrison RW. there are 2 x 3 x J parameters needed to estimate for dirichlet distributions. The logistic regression. J Biol Chem. Elder JE. Logistic regression has only 2 x J + 1 parameters in the model. 5. Weber IT. This method can also be used for other cleavage problems on native proteins. 4. 6. 1990. Hui JO. Proc Natl Acad Sci USA. discriminative approaches are better than generative ones in general. Sweden. In addition. Virology. Cameron J. Human immunodeficiency virus type 1 protease cleaves the intermediate filament proteins vimentin. Miedel MC. 274:391-401. References 1. Wlodawer A. Cameron CE. 272:16807-16814. alzheimer amyloid precursor protein and pro-interleukin 1 beta as substrates of the protease from human immunodeficiency virus. Dawson PE. and Heinrikson RL. the false positive ratio is more than halved compared to the classifier only using short oligopeptide information on the first level. Cameron CE. Weber IT. Normally it is impossible to have cleavage sites at the very end of a native protein. Identification of efficiently cleaved substrates for HIV-1 protease using a phage display library and use in inhibitor development. You L. Harrison RW. Adams L. Ridky TW. Wlodawer A.255 5. Weber IT. 1991. Since logistic regression has few parameters and is fast to train. and Leis J. Tomasselli AG. Copeland TD. During the experiments. 1997. Kesselmeier C. Skalka AM. Copeland T. 7. is the best among them on average. 87(16):6336-6340. which combines protein sequences. Hervio L. Zahuczky G. Copeland TD. J Biol Chem. the hierarchical classifier does not consider cleaving ordering in a protein if there are more than one cleavage site. the hierarchical classifier. 271:4709-4717. Copeland T. Stoller TJ. Leis J. Comparison of the substrate specificity of the human T-cell leukemia virus and human immunodeficiency virus proteinases. and Madison EL. Tozser J. Garwicz D. Tozser J. Traub P. Lowery D. Therefore structure and solvent accessibility data provide information to predict proteasesubstrate interactions. Bagossi P. Programming the rous sarcoma virus protease to cleave new substrate sequences. Greenberg B. For Bayesian methods. and Weber IT. Yem A. While. the secondary structure and solvent accessibility information were predicted only by one prediction sever. Chosay J. Acknowledgments The work was supported by the National Research School in Genomics and Bioinformatics hosted by Goteborg University. J Biol Chem. 1996. 2. Bagossi P. Beck ZQ. 266(22):14548-53. To conclude. we can use this rule to directly rule out some false predicted cleaved sites. troponin c. a bad data and model parameter distribution assumption could affect their performance quite a lot. experimentally tested short oligopeptides. the protein is cleaved into two fragments and their secondary structures might change and previously buried parts can be exposed to the protease. and Graves MC. 3. there is no major difference between them. Oroszlan S. From our experiments. Louis JM. Deibel MR. protein secondary structure and solvent accessibility information. it is the method of choice in this case. Discussion Two discriminative (logistic regression and SVM) and two generative models (naive Bayes and Bayesian inference) were used to build classifiers with secondary structure information. J Biol Chem. 2000. type 1 protease substrate specificity is limited by interactions between substrate amino acids bound in adjacent enzyme subsites. Honer B. Actin. 1996.

and Carrasco L.csie. The predictprotein server. Tomasselli AG. Leone JW. and Burkett BA. 1992. 2004. 292(1-2) :298-300. Krausslich HG. Chang CC and Lin CJ. and Korant BD. 19. 1996. Wolber V. 15. 14. Why neural networks should not be used for hiv-1 protease cleavage site prediction. Proteins. Software available at http://www. 2005. Rognvaldsson T and You L. Fairlie DP. Hui J O . and Israel A. Whisstock JC. Billich A. 20. Calcium-free calmodulin is a substrate of proteases from human immunodeficiency viruses 1 and 2. and Kalbitzer HR. and Heinrikson RL.. 13. Blank V. Howe WJ. doi:10. Hui JO. Sawyer TK. Conformational selection of inhibitors and substrates by proteolytic enzymes: implications for drug design and polypeptide processing. Eckenrode FM. 31(42):10153-68.1042/BJ20060108. 79(19) :12477-86.edu. DeCamp DL. Wong AK. Chai CL. J Virol. Konvalinka J. 12. Yachdav G. 22. Rizzo CJ. Debouck C. 43(7):1271-81. Corman J. Biol Chem Hoppe Seyler. Paddock DJ. Reid RC. 16. . Vosters AF. Hall LO. 31(13):3300-3304. Biochem.tw/ cjlin/libsvm. George HJ. 2004. 2000. page 372. FEBS Lett. Cordova B. Eur J Biochem. Frey MW. and Schramm W. and Fairlie DP. 1991. 2005. Chattopadhyay D. and Carrasco L. Bergman DA. Ho SP. The eukaryotic translation initiation factor 4gi is cleaved by different retroviral proteases. Moore ML. Tyndall JD. 21. Abbenante G. Dixon JS. Oswald M and von der Helm K. 1992. Tritch R. Kourilsky P. Proteolysis of an active site peptide of lactate dehydrogenase by human immunodeficiency virus type 1 protease. Dreyer GB. 16:321-357. Alvarez E. Pike RN. Boyd SE. and Heinrikson RL. Anderson DC. Deibel MR Jr. Smote: Synthetic minority over-sampling technique. Chem Rev. J Biol Chem. 372(12):1051-6. Production of chemokines ctapiii and nap/2 by digestion of recombinant ubiquitin-ctapiii with yeast ubiquitin c-terminal hydrolase and human immunodeficiency virus protease. Menendez-Arias L. 2002. Nucleic Acids Research. Zurcher-Neely HA. Apoptosis mediated by hiv protease is preceded by cleavage of bcl-2. Immediate Publication. 10(l):l-9. Kellner R. and Kegelmeyer WP. Processing of the precursor of nf-kappa b by the hiv-1 protease during acute infection. 2003. Human immunodeficieny virus protease cleaves poly(a) binding protein. 267(20):14227-32. Protein Expr Purif. Nature. and Sharma SK. Sanchez RL. Strickler JE. Proteases universally recognize beta strands in their active sites. 18. and et al. Garcia de la Banda M. Craik CS. 2003.ntu. Scanlon MJ. Tyndall JD. 10. Brooks I. Freund J. 77:12392400. 26. Menendez-Arias L. Journal of Artificial Intelligence Research. Tomaszek TA Jr. 350(6319) :625-6. Proc Natl Acad Sci USA. Evans DB. 25. 24. Fibronectin is a nonviral substrate for the hiv proteinase. Metcalf BW. 11. Hassell A. 1991. 105(3):973-99. Bioinformatics. J Med Chem. 23. Schramm HJ. Strack PR. Chawla NV. A possible regulation of negative factor (nef) activity of human immunodeficiency virus type 1 by the viral protease. Alvarez E. 223(2):589-93. Bowyer KW. Meier UC. 93(18):9571-6. Mildner AM.256 human immunodeficiency virus type 1 protease. 20(11):1702-1709. 9. Tomasselli AG. Nail T. 2001. J. Riviere Y. 1991. 1999. 2006.. Tomasselli AG. LeCureux LW. Einspahr HM. March DR. 1994. 1991. Pops: A computational tool for modeling and predicting protease specificity. alpha 2-macroglobulin is cleaved by hiv-1 protease in the bait region but not in the c-terminal inter-domain region. LIBSVM: a library for support vector machines. and Rudy GB. Biochemistry. Mann K. Stanford. Rost B. Proceedings of the IEEE Computer Society Bioinformatics Conference. Reardon IM. 17. Meade R. Purification and characterization of heterodimeric human immunodeficiency virus type 1 (hiv-1) reverse transcriptase produced by in vitro processing of p66 with recombinant hiv-1 protease. and Liu J. Heinrikson RL. Castello A. CA. 16(2):347-354. J Virol.

For instance. 1. Other problems include condition specific binding. The success of motif discovery depends highly on the homogeneity of input sequences. explicitly focusing on use and discovery of .abul@ntnu. the choice of language in which motifs are expressed. forming Transcription Factor Binding Motifs (TFBMs) or just motifs. Motif discovery programs search for statistically significant. e. and joint probabilistic. since these regions have accumulated very few mutations compared to non-functional parts. in this paper we do not cover these approaches. In this work. a Regulatory network identification methods are also studied without T F B M s 3 3 ' 27> 3 6 ' 2 8 ' 2 4 . together with available completed genetic maps of many species. TFBMs are functional elements of genes and preserved throughout the evolution. well-conserved and over-represented patterns in given promoter sequences. Although TFs bind in a selective way they allow some degeneracy in their binding sites. no Geir Kjetil S a n d v e Department of Computer and Information Science. Finding TFBMs is an important step in elucidation of genetic regulatory networks a . INTRODUCTION Transcription Factors (TF) are proteins that selectively bind to short pieces (5-25nt long) of DNA.no of Science and Technology.ntnu. of Science and Technology. there are mainly three paradigms for motif discovery. Norway Email: sandve@idi. and difficulty in finding an optimal consensus motif. Motif discovery is a crucial part of regulatory network identification. It is a unification of cluster-first and regression paradigms based on iterative cluster re-assignment. although they usually benefit from each other. When gene expression d a t a is available. Norway * Email: osman. measurement noise. *This work was carried out during the tenure of an ERCIM fellowship. it is possible to find them computationally by just exploiting the statistical over-representation. mismatch strings and position specific weight matrices (PSWMs). as well as their variants and specializations. has made possible computational identification based solely on sequence data. Lee et al. Unfortunately their resolution ( « lK-nt) is not enough to exactly identify binding locations. T h e experimental results show the effectiveness of t h e methodology. and therefore widely studied in the literature.no jinn. Computational ap- 'Corresponding author. There are basically two methods for finding TFBMs. drablos@ntnu. That is. Norwegian University Trondheim. 19 have conducted experiments for over 100 TFs for experimental identification of regulatory networks of Saccharomyces cerevisiae. we propose a methodology for getting homogeneous subsets from input sequences for increased motif discovery performance. regardless of paradigm employed. Norwegian University Trondheim. This property creates the TFBS representation problem.257 A METHODOLOGY FOR MOTIF DISCOVERY EMPLOYING ITERATED CLUSTER RE-ASSIGNMENT O s m a n Abul* ' a n d F i n n D r a b l 0 s Department of Cancer Research and Molecular Medicine. The most common representations are motif consensus over IUPAC codes. ChlP-chip experiments can analyze the genome-wide binding of a specific TF. This property. i. cluster-first. experimental and computational. so called Transcription Factor Binding Sites (TFBS). regression.

Motivated by this. a boosting approach also employing ChlP-chip data (Hong et al. and phylogenetic relations (e. we here study the generation of homogeneous clusters using both sequence and expression data. POWERING MOTIF DISCOVERY USING GENE EXPRESSION DATA The three main paradigms for incorporating gene expression data into motif discovery are cluster- . etc). 2 6 ) . We then (optionally) refine these motifs by filtering out irrelevant ones. simple filtering or filtering employing regression analysis is applied. and 4) their combinatorial joint regulation of target expression 35. In particular. we screen all the genes by motif profiles of each cluster and refine clusters by re-assignment based on screening score. the sequence data employed is the inter-genic promoter regions upstream of transcription start sites while the functional data is obtained from microarray experiments under various conditions. hierarchical clustering. 1 4 ). 2) abundance and activity of them under varying conditions. That is.e. Each cluster (in which sequences are highly probable to contain homogeneous TFBMs) is given as input to motif finding programs (MEME. either by further analysis (like regression) or manually. it is clear that to induce regulatory networks computationally we need both sequence and functional data. The idea behind this approach is to remove non-relevant motifs and thereby reduce the number of false positives. Each TF can function as inducer or repressor and this process is combinatorial. genes are clustered before they are presented to motif discovery programs. BioProspector 20 . Self-organizing maps. we output the set of motifs found in the last iteration. This combinatorial behavior can cause non-additive expression behavior for their common targets. An alternative to the cluster-first approach is to start from a large set of putative motifs and filter them by regressing on expression data. After that. Typically.g. Finally. we start with an initial clustering of gene sets from gene expression data and find motifs in these clusters. In this step.). 9 . it might fail in finding true clusters. and we address the issue of methodology for motif discovery. 3 ) . The co-expressed genes are assumed to be co-regulated. This is because gene expression depends on combinatorial binding of TFs on TFBMs. intra-module couplings are much stronger than inter-module couplings. To understand the governing rules for gene expression. and MDScan 21 . sets of TFBSs for a number of TFs. i. We define an iterative procedure (a methodology) for the motif discovery process. In general. Each gene can have a number of TFBSs for several different TFs in its promoter sequence. Expression behavior also depends on the genome-wide global conditions. hence this is called the cluster-first approach. and finally 3) decide on the true motifs among all the candidates. we restart motif discovery on the new gene clusters and iterate this procedure until convergence. Following this. AlignACE 12 . they typically give high false-positives/negatives if input genes are heterogenous with respect to regulation. Consensus 10 . what an experimenter does is 1) cluster the gene sets of interest (using a clustering program like k-means.g. In practice. Other useful sources of data for motif (and module) discovery include ChlP-chip experiments (e. we need to know 1) all TFs. TFs bind to respective TFBSs in promoter regions of their target genes. Motif Regressor s' 21 . therefore genes are clustered based on their expression profile similarity over a course of microarray experiments. To make the input genes homogeneous. depends on the qualitatively and quantitatively binding of other TFs. BioProspector. Although a number of algorithms and programs have been developed for motif discovery. The success of motif discovery programs depends on the quality of input data. then 2) input them to one or a few motif finding programs. among many others. In Eukaryotes. Examples of this approach include Reduce 7 . MDScan etc. 15) and a logic regression approach by Keles. Briefly. 1T.258 proaches built around this fact include MEME 2 ' *.g. Though clustering before motif discovery improves homogeneity compared to random subsets. 2. little attention is paid to the selection of homogeneous subsets from heterogeneous gene sets of interest. 3) their binding sites. TFBM databases (e. From this. et al. little has been done on designing a methodology for optimal usage. TFBSs are organized in modules.

Similarly. and from the induced network they identify TFs and search for motifs in the sequence data of subtrees of TFs. sets of candidate motifs. The method uses single gene expression experiments over which oligo scores are linearly regressed.e. et al. The objective in this approach is not to find novel motifs but motif modules. 37 . An indirect cluster-first approach is presented in Tamada et al. Brazma et al. i. A large-margin classification approach. they assume an initial set of motifs given a priori and assign motif profiles to modules after clustering finishes. The idea is to model probabilistic interactions between sequence features (motifs) and expression data. using boosting together with alternating decision trees is given in 2 2 . A variant of the cluster-first approach is TFCC (Transcription Factor Centric Clustering) of Zhu et al. 15 presents a boosting approach for motif discovery. where each cluster represent transcriptional modules and in turn determine motif profile and gene expression of genes in the modules. 7 . A MOTIF DISCOVERY METHODOLOGY In the cluster-first approaches. 32. The idea is to find a set of genes showing similar expression profiles to the expression profile of a particular TF over a set of expression experiments. The idea of using a joint probabilistic paradigm was first proposed by Holmes and Bruno u . the recent study by Hong et al. 6 presented one of the earliest methods within the cluster-first paradigm. and construct logical rules (in the form of if-then rules) in terms of the absence/presence of a priori given motifs. regression and joint probabilistic. 3i. for each sequence a binary score vector (serving as a covariate vector) is constructed in which each entry corresponds to existence of a motif type (or a logical function of a subset of all motif types. The model is fit in an iterative manner. 13 find similar genes to a selected gene using the expression data. From the resulting large number of putative motifs the insignificant ones are eliminated through regression. Briefly. The LogicMotif approach of Keles. Similarly. The approach has been extensively studied by Segal et al. 4 o n a few m o c iel variants all employing Bayesian reasoning. and then look for motifs in that cluster using AlignACE 12 . However. Conlon et al. they construct a regulatory network from gene expression data. 3. 34 where the objective is finding regulatory networks. One of the earliest work using regression on gene expression data for motif discovery is the Reduce method of Bussemaker et al. They employ MDScan 2 1 to extract features. 30. The basic variant assumes that transcriptional modules are dependent on sequence information and that they in turn determine gene expression. Likewise. In the first step. Another similar probabilistic clustering algorithm jointly using sequence and time series expression data is presented in 18 . i. both genome-wide and for clusters generated from gene expression clustering based on the time series data. The approach taken by Beer et al. from sequence data. clustering based on gene expression data is assumed to represent true . 8 introduced a linear regression method called Motif Regressor. Methods for binary regression (classification) have also been developed. Motif discovery is an intermediate step used to refine the network. The approach learns transcriptional module motif profiles using the Expectation-maximization (EM) algorithm. The Rim-Finder system of Zilberstein et al. called Medusa. The objective is to find the best minimal set of motifs (if-mers) capable of explaining the gene expression data. In the second step. Identification of synergistic effects of pairs of motifs using co-expression has also been studied 26 . The genes are clustered using expression data with k-means clustering and AlignACE 12 is used for motif discovery. and find motifs accordingly. They look for over-represented oligos with limited degeneracy. a so called logic tree) and this vector is regressed on expression data. A very similar approach using a custom clustering algorithm is presented in 23 . 38 is another method using the regression approach. the set of all over-represented oligos (allowing limited degeneracy) in the input sequences are identified as candidate motifs. Hvidsten et al. 17 uses two-step logistic regression on a single gene expression experiment. as a post-processing step.e. They formulate the problem of motif discovery as a classification of ChlP-chip data. starting with an empty set and adding the most significant motif to the model in each iteration until there is no statistical improvement. 5 also use a cluster-first approach. 29.259 first.

the discovered motifs can be regarded as motif profiles of their respective clusters. Clustering The input to the clustering step is Y. clustering (using gene expression) before motif discovery improves the quality of discovered motifs. Note that. . To explore the claims above we have conducted experiments on some subsets of S. we have used MDScan for motif discovery and we have selected random subsets with 500 genes from over 6000 genes of the Gasch et al. | G | . To get the advantages of gene expression clustering. define the gene expression matrix asY = {Yge : g = l . we do not force the use of any particular clustering. The methodology is illustrated in Figure 3.|5 S |} where Sgi e {A. Let G = {GgYgZf1 be the set of genes and E = {Ee}eez[El be the set of gene expression experiments (either time series or various treatment conditions). For instance. C. In this way. genes are re-assigned to clusters.. | £ | } where Yge is the pre-processed gene expression value for gene g under experiment e. only the significant motifs are retained. Following this. These steps (motif discovery. e = 1 . T} is the nucleotide in the fth position and \Sg\ is the length of the sequence for g. a motif discovery algorithm is used to find candidate motifs for each cluster.e. More information is provided for these datasets in the Experiments section. Motifs from the motif discovery algorithm are assumed to be putative and further refined by filtering. The number of clusters is 5. modified k-means algorithms with Pearson corre- .1. Based on the motif scores. . It starts with the initial clustering from gene expression data. Finally. Figure 2:b scores are higher than Figure 2:a scores. Note also that. filtering. dataset. It is also the case that genes with TFBSs for the same TF are not necessarily co-expressed during a specific time-course as gene expression is combinatorial and therefore depends on several factors. In the literature a number of clustering algorithms for gene expression data have been designed and employed.. these motifs are optionally regressed over the gene expression values for motif filtering. Mahalanobis distance. respectively. screening and cluster re-assignment) are iterated until the clusters converge and a final set of motifs are output as motif profiles for each cluster. i. The results agree with our claims that genes having similar motifs need not be co-expressed. and that co-expression clustering therefore can be deceptive for motif discovery. Euclidean distance. This makes sense as selection of homogeneous clusters instead of random clusters gives better candidates for motif discovery.g. We will now define a basic vocabulary to be used in the remaining parts of this section. only significant and relevant motifs are kept. ranging -1 to 1. The task is to partition G into a given number of partitions based on similarity computed from Yg. in 3 7 and 5 . uncertainty of the number of clusters and lack of true knowledge of optimal distance measures. On the other hand. measures how similar a point is to points in its cluster compared to points in other clusters. . G. Then. motif discovery. . filtering or screening algorithm. . Since motif discovery is applied to each cluster. then its cluster membership should be changed based on this new evidence. In Figure 1 we show the Silhouette index of true clusterings and clustering induced by k-means clustering for two subsets. After that. self-organizing maps and hierarchical clustering are more common) and similarity measure (e. Define S = {Sg : g S G} as the sequence data such that Sg = {Sgi : I = 1. as already discussed.. while at the same time avoiding its potential deceptiveness. 25 . .260 functional clusters. A crucial point in clustering is to decide on the clustering method (k-means. . The idea here is that if a gene is closer to the motif profile of a different cluster rather than its current one. . In the analysis. Silhouette index. a score for each gene for each motif profile is computed.. For convenience. Pearson correlation).cerevisiae gene clusters reported by Harbison et al. 9 and on genome-wide gene expression data by Gasch et al. as shown in Figure 2. Our input data is DNA sequence data extracted from regions upstream of transcription start site and gene expression data. we also define 1^ as the expression vector for g over all experiments and Ye as the expression vector for e over all genes. the results only partially represent true clusters. we propose a methodology for discovering regulatory motifs using both gene expression and upstream sequence data. Due to the noise in data. 3. the motif profiles are used to screen all the genes. Larger index values indicate good cluster separation.

4h K-means clustering 3.l + Random clustering K-means clustering a) random 500 genes b) gene cluster {CBF1.1 0. as different clustering algorithms may be optimal depending on the specific dataset.2.BAS1.REB1} Fig. It is also important to decide on the number of clusters.g.05 0. 3.15 n?s .2 §4.05 0 -0. the filtering step outputs a subset of the motifs denoted by M*'. The reason we introduce a filtering step is because statistical over-representation does not necessarily imply biological significance. For our purpose.261 0.MSN2.2 X 0.3 0.MSN2.MBP1} b) gene cluster {CBF1.MBP1.FHL1.05 True clustering K-means clustering 0.REB1} Fig. regular expressions.IN04.g.g.8 3. Since the clustering step is done once in our approach. MDScan and BioProspector can be used in this step. We apply the motif discovery algorithm for each cluster separately and independently. In other words.2 I i I • • f % 3 £ C O 0. any PSWM based motif finding method like Align ACE.BAS1. some statistically over-represented motifs may be ei- . IUPAC codes. and exploitation of biological knowledge (e. word count- ing).05 • 0 • True clustering .MBP1.1 0.15 0. j 1 CO 8 4t *• | I Si 2 3. Silhouette index scores 4. •: 3. K-means clustering a) gene cluster {CBF1. We denote initial clustering results as {C^}c=i > where \CX\ is the number of clusters.3. Motif Discovery The motif discovery methods basically differ in their motif representation (e.3 4.6 Random clustering . 1. . inter-dependence of motif locations).g. MEME. adaptivequality based clustering 2 3 ) . Some methods estimate the number of clusters by applying model selection (e. Gibbs sampling.7 3. • .1 CO . fixed/flexible gaps. bi-modality. motifs in modules. Let us denote the clusters at the i'th step by Cl and the motif set output from cluster CI by M*. using cross-validation 18 . MDScan scores lation coefficient are used.BAS1. Expectation-maximization. we leave the selection of clustering algorithm to the user. Motif Filtering Given the resulting motifs M* of the motif discovery step. PSWMs) search algorithm (e. Since the hierarchical clusters are exploratory and flexible they are usually the preferred choice. 3. and serves only as a source of good initial clusters. 2. palindromic motifs.25 0. 0.

If the cluster re-assignment is the same or very similar to previous clustering. Note that it is also possible to have a simple filter that does not employ expression data. The dataset contains over 150 gene expression arrays (measured under several conditions with repetitions) for 6371 ORFs of S. e. 16 .g.262 Expression l Clustering Data I Initial Clusters Clusters < Motif Discovery Cluster reassignment Filtering Filtered Motifs Scores Screening Sequence Data Pig. This gives 149 arrays and 6107 ORFs. Iterative Motif Discovery Approach.cerevisiae. 4. we eliminate arrays and genes with considerable number of missing entries. The genes are assigned to the cluster t o which they have highest conformance. 3. Since in regression based approaches the success depends highly on the initial putative motifs. which can be con- . As those artifact motifs are not generally consistent with expression. the set of motifs for each cluster is output. The metric basically uses the information content of PSWM and scales each k-mer within maximum possible and minimum possible match.4. otherwise the iteration continues with the gene clusters C l + 1 . measuring the conformance of them to the motif profile. feeding these programs with the output of statistically overrepresented motifs usually give better results. filtering nothing or filtering putative motifs based only on their motif scores. EXPERIMENTS To assess the merit and relevance of the methodology presented we conduct several experiments on real datasets for S. Screening and Cluster Re-assignment Given the motif profile M* of each cluster. 3. thus creating a cluster re-assignment. this filtering step has the potential of eliminating most of them. We use the matrix similarity score metric reported in Kel et al. Since the dataset contains missing values. ther artifacts of the motif discovery program or simple tandem repeats. We pre-process the dataset by log transforming the background corrected intensities. This creates a vector of cluster motif profile conformance measures for each gene. We use the gene expression dataset by Gasch et al. 25 . including the genes in other clusters. we score all the genes.cerevisiae.

Jaccard index and Silhouette index.2. In all of these runs we employ k-means as the initial clustering algorithm and use trivial identity filters. cerevisiae target genes for a number of TFs by collecting results from the following resources.IN04.MBP1} (5 clusters). This is particularly important for the choice of motif discovery algorithm. it also works well without ChlP-chip data. and use either a trivial identity or linear regression based filters.263 sidered as a 6107 x 149 matrix. Since our objective here is to show the effectiveness of it. • motif width=8 • number of motifs to report=2 • number of top sequences to seed=20 • number of motifs to be kept for refinement = 4x number of motifs to report The order of genes presented to MDScan is relevant. they defined several dozens of (overlapping) gene clusters for each binding motif. for each missing-valued gene. 4. There are still missing values in this matrix and we impute these missing values with k — nn imputation method. published data from literature and phylogenetic conservation. • Subset 1: {CBF1. Convergence. We rather preferred MDScan score as it is more objective and more general. Clusters Harbison et al. respectively. MDScan. Harbison et al. On the other hand we conjecture that the genes could have been sorted based on distance to their cluster centroids. As performance measures we use MDScan scores. We conduct experiments with the following three gene subsets drawn from Table 1. We therefore assume these clusters as true clusters for our purpose. MDScan score is used to quantify the strength of motifs within clusters. ChlP-chip data. number of clusters and MDScan parameters. In our experiments we have used random orders to avoid any bias (this is because we do not use ChlPchip data). After this we get a complete expression matrix. We have tested how sensitive our methodology is to initial clustering by running with random initial clusterings. There are many alternative methods and tools that can be used in different steps of our methodology. We use Silhouette index and Jaccard index as cluster separation and similarity metrics. Although MDScan is originally designed for ChlP-chip experiments. As a result. MDScan scores of clusters also increases with the iterations. 9 identified S. suggesting a convergence.FHL1. Figure 4 gives the results for 20 runs (Iteration 1 is the result for the initial k-means clustering). the correspondence between predicted and known sites could have been used as performance measure. Random Clusters In this experiment. This means that improvement of the methodology is not dependent on particular settings of initial clusters or motif discovery tools.). As for the sequence data. k-means and MDScan algorithms have been chosen mainly because they are fast.1. similar results are observed to those reported in Figure 4. It is clear from both figures that our approach is able to correct some deceptiveness of the initial clustering. In all of the experiments conducted we have selected k-means as clustering and MDScan as motif discovery algorithms.BAS1. AlignACE. We have also tested the approach by changing the random gene numbers. we randomly select 500 genes among 6107 genes and cluster them into 5 clusters by k-means clustering. In all of the runs we start from converged k-means results at iteration number 1. For each cluster we use the same parameter setting for MDScan as follows (and default values for other parameters). we use at most 500 (-500 to -1) base sequences from upstream of transcription start site for all of the 6107 genes. as it is run for each cluster in every iteration. The method. They also confirmed the results by applying several motif discovery programs (MEME. 4. In cases where experimentally determined binding sites for motifs are available. Convergence is defined as the number of re-assigned genes so it is a natural metric for our methodology. etc. In all of these cases. identifies closest k genes over non-missing entries and then imputes the missing value by the average of column values for the k genes. . but rather a few practical ones. thereby possibly improving motif discovery. we do not experiment with an extensive set of methods and tools. From the figure we see that the number of re-assigned genes decreases along iterations.

This index. 5 (0 4.264 + . A high index value indicates high cluster similarity. 3 4 5 6 7 £ 4 5 6 7 Iteration Number Iteration Number a) MDScan scores Fig.6 I" I:: -L ± >. respectively. MBP1 and REB1 while it fails to find specific binding sites for true BAS1 and MSN2 clusters.MBP1.1 and trivial identity filter are employed.14) and number of re-assigned genes is 230 (248). The results which are shown on Figure 5:g-h are inconclusive. MDScan parameters used are same as given in Section 4. With the same parameters.. We reason from the figure that our method achieves the similar Silhouette index as true clusters while k-means clustering (iteration number 1) destroys the original clustering. 4 . Regulator CBF1 FHL1 BAS1 IN04 MBP1 MSN2 REB1 Clusters used in experiments # of genes in the cluster 195 131 17 32 92 74 99 We show general utility of the methodology on the subsets 1 and 2. Filter 2 uses the function defined in 16 and Filter 3 uses the the one employed in Motif Regressor. We test the effect of using different filters on subset 3. We also remove genes found in more than one cluster. The number of clusters parameter for k-means is set to 2 for all niters. there is a potential to increase performance by cluster re-assignment. Number of clusters parameter for the k-means is set to 5. Since . IN04. we remove genes not contained within the 6107 genes of the Gasch et al. subsets 1. REB1} (2 clusters) As a pre-processing. ranging from 0 to 1. For the case of Filter 1 the number of motifs reported by MDScan is set to 2. Filter 1 is a trivial identity filter. is an external cluster validation method measuring how similar a clustering is to another. and 3 contain 399. This might be due to the similarity in sequences (like TATA box) in different clusters of the true clustering.00 (4.MSN2.BAS1. Figure 5:e-f shows the Silhouette index of clusterings for subset 1 and 2 through iterations and also for the true clustering. dataset. 403. b) Convergence Performances results on random clusters • Subset 2: {CBF1.1. 2. while Filter 2 and 3 are stepwise multiple linear regression filters. This way we ensure that each gene belongs exactly one cluster. and failure of MDScan to find specific motifs for the considered clusters. As a result. while for Filter 2 and 3 it is set to 15 and the other parameters are the same as in Section 4. the average MDScan scores and convergence performances are shown for subset 1 and 2 over 20 runs. This additionally shows that even though their performance is better than k-means clustering. To measure the cluster similarity between the true clustering and clusterings over iterations we measure the Jaccard index. Table 1. and 253 genes. it finds specific binding sites for true clusters of CBF1. For instance. FHL1.REB1} (5 clusters) • Subset 3: {CBF1. In Figure 5:a-d . The only difference between Filter 2 and 3 is the scoring function for computing the covariate vector. the true clustering for subset 1 (2) has MDScan score of 4. The results clearly indicate the improved MDScan scores and convergence over iterations.

..I • | J.* tn P ro ro cn Jaccard Index o o o o o # of re-assignes genes o o i-ciH.- >-* era <T> 3 o n--rr~r---H • . ™ H 1 ' f- 1 3 CO i -H | h -1 1 v .m .r n ..* . .1 1 H 1 — -H CO Si-J .-1 (0 CO i- H 1 1 H---I 3 1 1 H 1 h + c cr r o 4 5 Iteration N s • "---I ! !- 1H a. ..* c n o t n .+ '-* — to .* ... 00 ->--l 3 C D 5 6 tion Numb .H <D CO r .1 ro • H I h 1 •-D •"--I _ _ L • t-H 1 h-i p 3 O..08 o 0.o — < .— H - O ..r— »HD-i h-n-H i-cn-H O o a < CD i | _ -a HI h-i | 1 h-< • h--' • I----I • | • • i .• *r--0--l+ 0.. • • • 4I> .J J Qrq ^-' «-H B — <n "-CD-' O 1--CD-1 * * ^ h- -HI 0q H m o car i r--m---i 3 fa.'I-DQ--I hro MDScan Score 0..m ..' I — • O • r ..1 ro co J* cn 01 • o •^ cn P o .D • • § • • r . H ^1 h — r .Jaccard Index # of re-assigned genes MDScan Score to = cn a -* +• Silhouette Index O -* O '-*• O O O O O O O P -* O O O O '-* .— I • ID I I • o —)----...16 0. a. i H &.14 0. • •+I-V-I 3 B • i--C 1 I----H- m a. H .i + r . 2 .) h-O}-^ + - MDScan s core *~t . H 1 1 h .. h — I ! . o cn ro cn -» -* ro ro K-1 t-CED-i •HD H •£• HI True Clusle ring TH .H h-| + CO .18 ro P ^ P b b P c n r o t n . — • H . CO TO H 3 ! -\—A >---rTH • o H--r~n-H * i-Ln-i x_ I-rH 1 j _L_ hI..' K-H H • h ..^ I h-"H ON . . 3 r-rn-H • i..i ..> H <~i 3 * 1 H " • H-CCM 1 • •1--CD--1 .06 fO 0.12 . H+ + + + .1 c cn H--IZD---H*3 5-r ! .L I ZF TO c Silhouette Index 0.1 (D h--l 3 I I---I I Z c cn H-ED--H ! • h - Co e 3 h — --i 1 • t.* tn b P 1.m .i -i--m--H • ^{L>< • h • • ..( .i--m---^ • •— n h-< h-O-H hHXD-i + r-DJ-i +• I ' d . ... h _ . _ r = } .i •+ + a > ..

• * .2* 1 \ • v '"•• -0- •"'" 1 3 4 5 6 7 Iteration Number 1 2 3 4 5 6 7 Iteration Number b) MDScan score c) Jaccard index Fig. and from Figure 6:c. screening or clustering algorithm. CONCLUSION In this work. It is also possible to compare filters based on the performance parameters.F i l t e r #2 •••-Filter #3 2 3 4 5 6 7 8 Iteration Number 9 10 a) Convergence 5. Finally.•/. Our method is developed to compensate these two issues and thereby improve the quality of motif discovery. not developed around a particular motif discovery. The analysis and experimental results show that clustering based on gene expression is a better basis for motif discovery than random clustering.1 5 4.5 0. This clearly shows the tradeoff of selecting the filter among multiple filters. * • • * • * • " .4 4. e. but not perfect. * / : 4.g.3 4. * • . It is also shown that it might mislead.. we use the 2 most frequently selected motifs out of 15 (i.266 —•— Filter #1 • • . It is clear that all the performance parameters show increased performance over the k-means clustering for all filters used. The conducted experiments clearly suggest the utility of our approach. Filter 1 scores best while in terms of Jaccard index Filter 2 scores slightly better.Filter #2 • • • • • Filter #3 — • — Filter #1 .6 5 4. ' 8 4. The methodology is quite flexible. . Performance results for subset 3 we have 149 expression experiments. We do this by screening all genes and re-assigning clusters in several iterations.2 5. Recall that the last issue was inconclusive from subsets 1 and 2. this is why we have introduced a filtering step in our methodology.. 13 motifs are filtered out) as a cluster motif signature and use them for cluster conformance computation in the reassignment step. a . there is full convergence. filtering. . Figure 6 gives the averaged results over 20 runs. Note that. . In other words. 5. we have addressed the problem of developing a methodology for motif discovery. For the average MDScan scores. It is organized around the idea of getting highly homogeneous gene clusters using both sequence and expression data. we regress once for each experiment and count the motifs selected in the stepwise regression. the Jaccard index increases suggesting recovery of true clustering for all filters.• — Fitter #1 • e .Filter #2 • • * • • Filter #3 .• . Note that in Figure 6:a.31 4.7r g4.8 C O [0 § . We conclude from the results that the methodology works well for all of these filters. 6.e.

CA. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Berkeley Biostatistics Working Paper Series. improves over the initial clustering. Harbison CT. 8:1202-1215. Kellis M. 8. 2004. CA. Timothy L. Conlon. 3. 2000. Bussemaker. T h e methodology presented here can also be considered as a unification of the cluster-first and regression based motif discovery paradigms into a single framework. J. Berlin. it is in general different from these approaches. 2000. 13. February 2001. the methodology works well. Alvis Brazma. Macisaac KD. Jensen. Jan Komorowski. Feng Gao. Lee. Chris Wiggins. Alexis Battle. Young. Bailey and Charles Elkan. Erik van Zwet. Tagne JB. 2004. 7. Transcriptional regulatory networks in Saccharomyces Cerevisiae. 18. References 1. 2003. 12. 2003. We have also shown t h e importance of the filtering step. 16. Ren. Danford TW. Yoseph Barash. in t h a t our approach does not establish any probabilistic relationships between gene expression and sequence information. H. Kel. Lee TI. E. 2003. Inge Jonassen. Pokholok DK. 27:167-171. Estep. Gal Elidan. Lei Shen. Transcriptional regulatory networks in Saccharomyces Cerevisiae. Wyrick. 5. O. Michael A. Thompson. I. 10. of Eighth International Conference of Intelligent Systems for Molecular Biology.V. In Proc. Liu. Katherina J. Jennings. Andriy Kryshtafovych. Hannett. Ian Holmes and William J. N. . Finding regulatory elements using joint likelihoods for sequence and expression profile data. Mark J. Harbison. Nature. N. on Research in Computational Molecular Biology.e. IEEE Transactions on Computational Biology and Bioinformatics. Peter J. Yoo J. 2004. D. Rolfe PA. Rinaldi. In Proc. pages 202-210. Stormo. Bioinformatics. X. Gifford. Gordon DB. Hannett NM. Shirley Liu. Tagne. Unsupervised Learning of Multiple Motifs in Biopolymers using Expectation Maximization. Gifford DK. Eisen. J. D. 17. Hertz and Garry D. Journal of Molecular Biology. Windenger. Modeling Dependencies in ProteinDNA Binding Sites. 14. Reynolds DB. Integrating Regulatory Motif Discovery and Genome-wide Expression Analysis. Science. 21(ll):2636-2643. and Krzysztof Fidelis. Zeitlinger. 2005. especially to T a m a d a et al. 15. Nature Genetics. I. Gobling. 431(7004) :99-104. Machine Learning. and Jun S. 6. Reuter. Torgeir R. Saeed Tavazoie. The Value of Prior Knowledge in Discovering Motifs with MEME. Hao Li. 2005. 15(7/8) :563-577. 2(3):194-202. Odom. Combining sequence and time series expression data to learn transcriptional modules. 117:185-198. Bruno.E. (15):856-866. Identifying DNA and Protein Patterns with Statistically Significant Alignments of Multiple Sequences. G. Takusagawa KT. Siindiiz Kele§. 4. Jason D. Regulatory Motif finding by Logic Regression. E. Discovering Regulatory Bindingsite Modules using Rule-based Learning. and Tommy Kaplan. 11. 9. Z. of 9th RECOMB. In Proc.267 broad range of algorithms developed in the field can b e used in our methodology. (296):1205-1214. 1999. (298): 799-804. Cell. Kel-Margoulis. Jennings EG. Gordon. C. Germany. B. A. J. 2004. Eran Segal. and Esko Ukkonen. of the ISMB'95. Fraenkel. Predicting Gene Regulatory Elements in Silico on a Genomic Scale. Robert. PNAS. 21(20):3832-3839. Probabilistic Discovery of Overlapping Cellular Processes and Their Regulation. and Michael B. Zeitlinger J. However. T. F. Lander ES. C. van der Laan. Computational Identification of Cis-regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces Cerevisiae. Shane T. Fraenkel E. of the 7th International Conf. and Christina Leslie. 2005. Genome Research. Rinaldi NJ. 1995. and Eric D. Cheremushkin. Church. Siggia. Genome Research. and R. Murray. In Proc. 2002. A Boosting Approach for Motif Modeling using ChlP-chip Data. Nucleic Acids Research. Liu. Anshul Kundaje. D. and Daphne Koller. Gerald Z. 100(6) :3339-3344. T. and George M. Predicting gene expression from sequence. 31(13):3576-3579. Nir Friedman. Combining Phylogenetic Motif Discovery and Motif Clustering to Predict Co-regulated Genes. Harmen J. BarJoseph. E. It has been shown t h a t regardless of the actual filtering method used. E. Jason D. 1998. Gerber. (21):51-80. (145). Volkert. 19. and E. Kechris. Future work will focus on assessment of general utility and performance of our methodology as compared to joint probabilistic modeling. Lieb. 34 where their main motivation is finding regulatory networks rather t h a n discovering motifs. Simon. Jaak Vilo. O u r approach is similar t o the joint probabilistic approaches. 2005.C. Bailey and Charles Elkan. Regulatory Element Detection using Correlation with Expression. 1995. Preston W. Hughes. Beer and Saeed Tavazoie. and Young RA. Bioinformatics. Bioinformatics. San Diego. Bickel. Manuel Middendorf. Jerzy Tiuryn. Menlo Park. Bartosz Wilczynski. Timothy L. i. U. Erin M. and Jun S. and Chris Vulpe. 2. Hvidsten.

E. Dana Peer. and George M. 2003. 2001. Pierre Rouze. and J. 4(1):R6. Daphne Roller. Brown. 23. and Satoru Miyano. Eran Segal. 34(2):166-176. 2005. Douglas L. 22.268 20. Reconstructing Biological Networks using Conditional Correlation Analysis. Anshul Kundaje. X. Rousuke Tashiro. A Causal Inference Approach for Constructing Transcriptional Regulatory Networks. Krestyaninova. (144). Liu. Washington. David Botstein. 2002. Nature Genetics. Berkeley Biostatistics Working Paper Series. Seiya Imoto. Su Yong Rim. Eran Segal. Dana Peer. 2002. In Proc. 2003. Naoki Nariai. 11:4241-4257. In Proc. 20:835-839. Aviv Regev. 2005. Module Networks: Identifying Regulatory Modules and Their Condition-specific Regulators from Gene Expression Data. Genomewide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression. 2004. (318):71-81. 31. D. Luscombe. Roller. Brutlag. 36. of Pacific Symposium on Biocomputing. Kathleen Marchal. 2001. and Nir Friedman.1-R6. Shirley Liu. R. (CS2003-09). 19(2):227-236. and Satoru Miyano. JOBIM 2002. Technical Report. X. 38. Lev A. Michael B. U. Journal of Machine Learning Research. Paul T. 2003. Estimating Gene Regulatory Networks and Protein-protein Interactions of Saccharomyces Cerevisiae from Multiple Genome-wide Data. 2003. 37.L. 2003. Michael Shapira. Sequence Motifs in Ranked Expression Data. Bioinformatics. 2005. Biao Xing and Mark J. Bioinformatics. 33. and Gustavo Stolovitzky.10. Segal. Chris H. and Daphne Roller. 2005. Nicholas M. 28. and Jun S. Biao Xing and Mark J.S. 21. Itamar Simon. Yoshinori Tamada. 21(6):765-773. Motif Discovery through Predictive Modeling of Gene Regulation. 2003. MA. Bioinformatics. Eleazar Eskin. John Jeremy Rice. Bioinformatics. A Statistical Method for Constructing Transcriptional Regulatory Networks using Gene Expression and Sequence Data. Gert Thijs. 2002. 2001. Yitzhak Pilpel. Estimating Gene Networks from Gene Expression Data by Combining Bayesian Network Model with Prometer Element Detection. of 6th RECOMB. Janick Mathys. Technion CS Dept. Brutlag. Church. Prediction of Regulatory Networks: Genome-wide Identification of Transcription Factor Targets from Gene Expression Data. 35. Audrey P. 27.Gasch. Liu. Bioinformatics. Nir Friedman. 2000. Daphne Roller. 6:557-588. Seiya Imoto. Frank De Smet. and Patrick O. Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements. of 9th RECOMB. Zhou Zhu. and Zohar Yakhini. DC. David Botstein. . 29:153-159. Journal of Molecular Biology. 30. Yuhai Tu. Computational Identification of Transcription Factor Binding Sites via a Transcription-factor-centric Clustering (TFCC) Algorithm. Eisen. 25. and Alvis Brazma. Yoav Freund. Maria A. and Bart De Moor. Cambridge. Eran Segal. and Nir Friedman. Nature Biotechnology. From Promoter Sequence to Expression: A Probabilistic Framework. Mihir Shah. Yelensky. Church. 24. Genomic expression programs in the response of yeast cells to environmental changes. Nature Genetics. Integrating Quality-based Clustering of Microarray Data with Gibbs Sampling for the Discovery of Regulatory Motifs. 19(15):1917-1926. 2005. Hideo Bannai. 21(21):40074013. Towards Reconstruction of Gene Networks from Expression Data by Supervised Learning. 32. Learning Module Networks. Yitzhak Pilpel. Wiggings. Gisela Storz. 26. and D. Magali Lescot. Soinov. 21(2):206-212. Jiang Qian. Kao. pages 75-79. Satoru Ruhara. Chaya Ben-Zaken Zilberstein. Yves Moreau. 19(l):273-282. 34. Molecular Biology of the Cell. 29. Spellman. and Mark Gerstein. Liu. An Algorithm for Finding Protein-DNA Binding Sites with Applications to ChromatinImmunoprecipitation Microarray Experiments. Jimmy Lin. In Proc. van der Laan. Yoseph Barash.C. Yoshinori Tamada. Stephane Rombauts. and Christina Leslie. Haiyuan Yu. Genome Biology. van der Laan. and George M. Bioinformatics. Camilla M. Orna Carmel-Harel. Aviv Regev. Bioprospector: Discovering Conserved DNA Motifs in Ppstream Regulatory Regions of Co-expressed Genes. Priya Sudarsanam. Manuel Middendorf.

sets of coherent genes that coordinate in biological processes to achieve specific functions.cn Biological processes are always carried out through large numbers of genes (and their products) and these activities are often organized into different cellular pathways: sets of genes that cooperate to finish specific biological functions. 100084. One challenge to biclustering is that all the possible combinations of various genes and conditions are almost infinite. However. Recently. We firstly decompose gene expression data into highly overlapping segments and partition genes into clusters in each segment. identifying cellular pathways. applying traditional clustering algorithms on gene expression data can provide us with new perspectives on cellular processes. some biclustering algorithms discover submatrices one after another and thus naturally yield non-exclusive biclusters. However. (2) Extremely high noise exists in microarrays. K-Means clustering of genes [17] and some graph-theoretical approaches for gene-based clustering [15]. which calls for robust models for pathway identification. China Email: michael@tsinghua. . On the other hand.269 IDENTIFYING BIOLOGICAL PATHWAYS VIA PHASE DECOMPOSITION AND PROFILE EXTRACTION Yi Zhang and Zhidong Deng Department of Computer Science. Pioneering work includes agglomerative algorithm for hierarchical clustering [7]. Admittedly. INTRODUCTION The rapid development of high-throughput techniques such as Oligonucleotide and cDNA microarrays [5] enable measuring the expression of thousands of genes simultaneously. finally we extract expression profiles from coherent cliques and shape biological pathways as genes consistent with these profiles. Furthermore. Biclustering algorithms [14] are designed to capture biological processes active on part conditions. For instance. Managing high noise in gene expression data is also indispensable for successfully determining * Corresponding author. effective algorithms to reveal biologically significant pathways are possible. Tsinghua University Beijing. On one hand. This possibility offers an unprecedented opportunity to characterize the underlying mechanisms of a living cell.edu. researchers have made tremendous efforts to identify coherent gene groups [10]. these models perform simultaneous clustering on both rows and columns and thus finally discover coherent submatrices where rows refer to genes and columns correspond to relevant conditions. Battle et al [4] proposed a probabilistic model to discover overlapping biological processes concurrently. 1. We compare our algorithm with several recent models and the experimental results justify the superiorities of our approach: robustly identifying overlapping pathways in arbitrary set of conditions and consequently discovering more biologically significant pathways in terms of enrichment of gene functions. In this paper. overlapping property of cellular pathways is also mentioned by recent work. several problems in this process should be highlighted: (1) biological processes are active only on partial conditions. This fact renders clustering genes on entire conditions ineffective. Cheng et al [6] mask the previous biclusters with random numbers and find other ones. we proposed a novel approach to identify overlapping pathways via extracting partial expression profiles from coherent cliques of clusters scattered on different conditions. Different from traditional clustering methods. algorithms aim to discover overlapping pathways simultaneously also existed. then we organize all the resulting clusters as a cluster graph and search coherent cliques of clusters. some open problems such as large amount of noise in microarrays and the fact that most biological processes are overlapping and active only on partial conditions pose great challenges to researchers. In this sense. Similarity. Activities of a living cell are so complex that different sets of genes participate in diverse biological processes to perform various cell functions. Owing to the development of microarray technology and its ability to simultaneously measure the expression of thousands of genes. in [12] each bicluster merely deals with the "residual" expression of previous biclusters. plays a considerable role in gaining an insight into the cell's activities. (3) Partitioning genes into mutually exclusive clusters is unreasonable in that biological pathways are always overlapping.

We set L = 4 in figure 1 only for illustration. Then we discover co-expressed genes by gene-based clustering on each segment.1~2. c3.270 coherent genes. The consensus clustering algorithm in [9] combine different clustering results to form a clustering ensemble. In fact. The decomposition strategy is shown in figure 1. Each segment covers fixed amount of conditions and advances a small step based on the previous segment. while engaging too small L makes clustering on each segment ineffective: co-expression on very short period always appears by chance. But fortunately. thus any biological process whose life-span is larger than L can be obtained from combining certain segments. c4} and s2 contains {c2. Phase Decomposition 2. (2) construct a cluster graph from the resulting clusters over all segments and discovering coherent cliques on the graph. C. One may choose larger step length in order to reduce the total number of segments.e. Each segment contains all the rows and a few columns in gene expression matrix. The goal of this decomposition is to ensure that biological process active on any partial conditions can be discovered by combining some segments. The first step is to decompose gene expression matrix into many segments. The second procedure is gene-based clustering on each segment so as to obtain the co-expression group. rather than the accurate expression level. we examine these steps in section 2. Two parameters are involved: (1) segment length L: the number of conditions covered by a segment. such short segments will not be used in experiments. as long as segments are highly overlapping. Note that using too large L loses the possibility to discover pathways active on short period. (2) step length AL: in figure 1. to model the similarity between expression profiles of genes. to group genes on each segment. 2. Other than traditional approaches that directly seek grouping over genes. c4. In the rest of this section. with average link and Pearson correlation. 34 Sj | Si r — — at . period c2 ~ C can be captured 6 by integrating two segments s2 and s3. large clusters are retained for later processing. 1. an algorithm address all of these open problems is highly desirable. Finally.3 and analyze the properties of this algorithm in section 2. Cj C. c3. Phase Decomposition In order to capture biological processes in partial conditions. our algorithm identifies cellular pathways by robustly searching expression profiles over arbitrary set of conditions. columns) into highly overlapping segments. The key ideas of our approach are: (1) decompose the entire conditions into highly overlapping segments and clustering genes over each segment. C. we propose a strategy that satisfies all these demands: robustly discovering overlapping pathways on partial conditions. we set AL = 1. Note the term "segment" refers to a submatrix in gene expression matrix where all rows and a subset of columns are included. In this paper. Cj C. As a result. Here we use hierarchical clustering. combinations of different segments can approximately represent any period larger than L. Fig. On each segment. The underlying idea of this ensemble learning approach is that integrating opinions of different "experts" can yield a robust estimation. we divide the entire conditions (i. (3) extract expression profiles over coherent cliques and shape overlapping pathways according to the these profiles. C. METHOD Our pathway discovery algorithm consists of three steps: (1) decomposing conditions into overlapping segments and performing gene clustering on each segment. C.4. the author engages robust similarity measure based on the rank of expression in each condition. cutting the hierarchical tree at specific threshold \-c will produce many sets of co- . (2) manage all resultant clusters into a cluster graph and discover coherent cliques on cluster graph. This sort of measures is robust in the sense that they mainly focus on the rough shape of expression profile and will not be affected by disturbances on accurate expression level. C. For instance. Si covers {c b c2. C. In [2]. For instance.1.. this algorithm is capable of robustly recognizing overlapping molecular pathways on partial conditions and thus furnishing biologically significant sets of genes in terms of enrichment of gene functions. c 5 }. (3) extracting expression profile from each coherent clique and identifying biological pathway according to each profile.

(2) clusters in a /^-coherent clique must come from different . from expression profiles. we identify the cellular pathway corresponding to Q based on its expression profile: • A gene g belongs to the cellular pathway of coherent clique Q if and only if the Pearson correlation between g's expression and Q's expression profile on Q's active period exceeds ICnC'l • • Overlap(C. the expression profile involving the underlying biological process of coherent cluster Q is: • The expression profile of coherent cluster Q is defined on Q's active period and computed as the mean expression of all the core genes of Q.2. E') in a cluster graph G(V. C) Secondly. Thirdly. all clusters obtained from the procedures discussed in section 2. Coherent Clique on Cluster Graph After clustering on each segment as discussed in section 2. then we focus on how to discover coherent cliques on cluster graph. c5. Since each cluster covers certain conditions: the conditions covered by the segment where this cluster is generated.Overlap(C. clusters which contain less than 5 genes are discarded in that too small clusters are considered outliers or biologically insignificant groups. we obtain many co-expression gene clusters. Clusters on the same segment are mutually exclusive while clusters computed from different segments may be highly overlapping. Note that searching coherent cliques is to find possible biological processes. the active period of Q is {c b c2. satisfying that (1) any edge in E' has an weight less than p. we define the overlapping degree. To begin with. See figure 1 for an illustration: Supposed that coherent clique Q is composed of three clusters. In this section. we address the problem of how to utilize these clusters to discover biological processes that active on arbitrary period.1.3. thus we can define the active period of a coherent clique: • The active period P(Q) of a coherent clique Q is all the conditions that covered by at least one cluster in Q. and use this definition to offer a distance measure between clusters. Profile and Pathway Extraction In section 2. c6. (3) The fact that several clusters from diverse segments share a large proportion of common co-expressioned genes indicates the existence of a biological process which is active on the period composed of these segments. segments. i. 2. C') = \ . E) is a complete subgraph in G. set of coordinated genes. the concept of ^-coherent clique is proposed as following: a ^-coherent clique Q(V.271 expressed genes. At last. we discuss how to robustly extract expression profile of the biological process underlying each coherent clique. Another important notion is the core genes: • Gene g is the core gene of coherent clique Q if and only if g is the member of all clusters in Q. we can get a hierarchical tree. 2. and then cut the tree into many /?coherent cliques according to certain choice of ft. Then.e. An effective algorithm to attain such goal is the hierarchical clustering with complete link [11]. Finally. Note that \C\ is the amount of genes in cluster C. which furnishes us with a global view of relationships among genes over different segments. given a cluster graph. E) is a complete graph where each node V e V refers to a cluster C and the weight of an edge e = (v. since clusters from same segment are mutually exclusive. v 2 ) G E is the distance between two clusters corresponding to V\ and v2. c3. Furthermore. we want to discover /?-coherent cliques on this graph.e. Note c is a key parameter of our algorithm: two gene expression profiles are considered coherent when their Pearson correlation is larger than c. Cluster graph G(V.. given two gene clusters C and C. (2) V contains at least two nodes. For this purpose. we propose the concepts of cluster graph and coherent clique. recall that a coherent clique is composed of a set of nodes in cluster graph and each node refers to a cluster from a segment. Firstly. Using this algorithm. c 7 }. We also address how to discover cellular pathways. which are generated on segment Si.e. c4. an overlapping degree larger than 1-/?. their distance is smaller than 1-c. i. we partition the entire cluster graph into many ^-coherent cliques. In this section. i. Note that a /?coherent clique is biologically meaningful: (1) any two clusters in a coherent clique has a distance smaller than P. after defining the distance between clusters.2. Finally. s2 and s4. The definition of complete link ensures that the resulting clusters on cluster graph are /^-coherent cliques. respectively. especially when step length is small and thus adjacent segments may present similar structure on gene expression. C')= \CuC\ Distance(C.1 constitute a large cluster graph.

S3 and s4. (3) The choice of core genes in a coherent clique Q is robust in that each core gene must belong to all the clusters in Q. s2 and s4. Note that these pathways are generated independently. On cell cycle dataset which contains 76 conditions. then genes with small variance are removed.4. the overlapping property of pathways is ensured: (1) the active periods of diverse coherent cliques Qi and Q2 differ. Q may consist of only three clusters on Si. (2) Plaid [12] is designed to discover biclusters one by one independently. (2) The computation of active period of each coherent clique is robust. However. the active period of Q will be judged properly as Ci~c7. thus the fact that 162 . Secondly.272 c. 3. Dataset and Preprocessing Two well-know datasets used in our experiments are yeast cell cycle data [16] and yeast stress data [8].2. See figure 1 for an illustration: supposed a biological process P is active on conditions ci~c7. Ideally speaking Q ought to consist of four clusters which are located in segments Si. (2) genes cooperating in P should co-express in P's active period. The coherence threshold c mentioned in section 2.s active period. Note that g's expression outside Q's active period is not considered. respectively. Finally 30 clusters are formed on both two datasets.1. S2. this selection strategy is so "cautious" that 3. (2) identifying overlapping pathways. The number of pathways is set as 30. if the active period of a biological process P can be obtained by combining different segments produced in section 2.7. owing to high noise in microarrays.3.1. our algorithm is more capable of identifying biologically significant pathways in terms of the enrichment of gene functions. from four perspectives: (1) The definition of coherent clique is robust in that any two clusters in the clique have high overlapping degree and thus it is unlikely that an "outlier" cluster can be accommodated. some clusters may be missed. (2) Even on same period. Further Analysis In this section we mainly discuss three properties: (1) ability to discover pathways on partial conditions. (3) robustness. Thirdly. s2 and s4. We stop at 100 biclusters on both datasets. the parameter /? used to define ^-coherent clique is 0. only a subset of core genes is still sufficient to extract expression profile of the underlying biological process. we introduce rival algorithms and their parameters: (1) HClust [7]: hierarchical clustering with average link and Pearson correlation. because core genes are supposed to co-express well in their biological process. L is engaged as 20 and AL is set as 3. then the corresponding coherent clique Q should have an active period as ci~c7. This is based on two assumptions: (1) segment is defined in suitable granularity and highly overlapping. EXPERIMENTAL RESULTS In this section we present empirical results. (4) At last. Results on Cell Cycle Running our algorithm on 526 genes in 76 conditions results in 162 coherent cliques. the coherence parameter mentioned in section 2.1 and 2. based on Si.7. Even in this case. In short. it is unlikely that an outlier gene will belong to all these clusters. because P is active on these segments and genes participate in P should co-express on these segments. segment overlapping ensures the robustness of active period estimation.1. (3) OP [4] is a probabilistic model to search overlapping pathways simultaneously. a corresponding coherent clique should be identified from cluster graph. we set segment length L as 10 and step length AL = 2. These steps result in 526 genes in cell cycle dataset and 659 genes in stress condition dataset. Since these clusters are located in different segments and obtained by clustering on each segment independently. For preprocessing. The default parameters are used. Firstly. 2. thus finally we obtain 162 cellular pathways. (4) PD?E(Pathway Identification by Profile Extraction): our algorithm. Rival Methods In this part. As a result.3 is 0. 3. for yeast stress dataset contains 173 conditions. it may miss some core genes. However. Admittedly. the algorithm in this paper is robust. respectively. Compared with several state-of-the-art models. the quality of core gene selection and active period estimation ensures the quality of expression profile extraction and the resultant pathway. we remove genes with more than 5% missing values and estimate missing values by KNNimpute [18]. 3. expression level of gene g can be consistent with different expression profiles. thus a gene g can be consistent with both Qy's expression profile in g / s active period and £?/s expression profile in Q2.

2 1 n . As shown in figure 3. \ ~. and only 19 out of 526 genes participate in more than three pathways. According to many researches concentrating on scale-free topology of biological networks [3] and especially of genetic regulatory networks [13]: (1) there should be a few "hub" genes connected with many other genes and thus join a lot of biological processes. From figure 2 we can observe that more than 80% pathways contain more than 10 genes. 40 60 80 100 n: pathway size Fig. and running OP model results in 30 slightly overlapping pathways. Among 117 GO categories listed in Table 1: (1) PIPE won 70 times. while OP. we test the enrichment of gene functions in GO categories [1].L \ \ \ \ \ \ \\ \ \ \ I HCIust PIPE OP Plaid • to 0.001 is considered as a failure to find enrichment and is labeled as "—" in the Table. the Y-axis is the proportion between the genes which involve more than a specific amount of pathways and all the genes. we engage yeast stress condition dataset [8] for another experiment. where a few pathways contain more than 100 genes and the majority are in . we measure the overlapping among pathways. Results on Stress Condition To further justify the superiority of PIPE. Distribution of Pathway Size 1 20 120 0. (2) PIPE failed 23 times. (4) For our PIPE method.8 *. the better the enrichment. yet OP. X 60 pathways are produced from 526 genes does not indicate that the average size of pathways is quite small.4. the smallest pathway contains 4 genes and the largest one includes 101 genes. Pathway size distribution of PIPE is demonstrated in figure 4. respectively. Distribution of Gene Participation . enrichment of all four algorithms are computed and listed in Table 1. all genes take part in only one pathway. 2. we focus on the p-value of the pathway with best enrichment. the enrichment of a GO category is represented by p-value: the smaller the p-value. single gene takes part in at most 5 pathways. pathways are excessively overlapping: almost all genes participate in more than 30 pathways. Running our algorithm on 659 genes and 173 conditions brings about 174 pathways. In reality. More interestingly. The distribution of pathway size is shown in figure 2. For comparison. (2) most genes in network should not have large degrees and thus they participate in only a few biological processes. HCIust and Plaid failed 41.6 a % 0. Figure 3 shows the distribution of amount of pathways that each gene participates in. respectively. Note that p-value larger than 0. Four algorithms present 3. This result shows that the majority of pathways have moderate size. For each GO category. \ \ . For any pathway. HCIust and OP model can not produce "hub" gene. \. The p-value is computed based on Genomica [19]. At last. (3) In Plaid algorithm. only our algorithm generates results consistent with above conclusion. \ \ \ \ 1 \ \ \ 0 » 10 20 30 40 50 n: number of involved pathways Fig. HCIust will certainly generate 30 mutually exclusive pathways. . to justify the biological significance of the pathways generated by these models. 3. and Plaid produce too many "hubs". 21 and 1 times. the result in figure 3 shows a natural distribution that only a few genes throw themselves into more than 15 pathways. The X-axis is pathway size. HCIust and Plaid models won 25. at last. (2) For OP model. while only about 20% pathways contains more than 40 genes. and the Y-axis is the proportion between pathways larger than specific size and all the pathways. Further examining the results listed in Table 1 will naturally yield to the conclusion that PIPE has identified more biologically significant pathways. 58 and 90 times.273 quite different properties in figure 3: (1) Since HCIust merely produce mutually exclusive pathways.4 c 1 0. The X-axis is the number of pathways a gene joins. plaid model find 100 biclusters one by one.

. Plaid model. moderate size. Nat.T. A. R. Roller.L. We also test the enrichment of GO categories and list the results in Table 2. E.2 • ^%- 0 50 100 n: pathway size 150 200 Fig. Oltvai. Clustering of gene expression data using a local shape-based similarity measure. Gene Ontology: tool for the unification of biology. figure 5 shows similar results as observed in figure 3: HCIust and OP can not discover any "hub" gene while Plaid considers most 659 genes as hubs that join a lot of pathways. Exploring the new world of the genome with DNA microarrays. et al. Spellman. respectively. Cluster analysis and display of genome- 3 4 5 6 7 . Brown and D. Y. \ \ \ • \ \ \ HCIust PIPE OP Plaid • \ \ \ 0 i \ : \ \ \ \ \ \ \ 60 Acknowledgments This work was supported in part by the National Science Foundation of China under Grant 60321002 and the Teaching and Research Award Program for Outstanding Young Teachers in Higher Education Institutions of MOE (TRAPOYT). In addition.. We firstly decompose gene expression matrix into highly overlapping segments and partition genes into clusters on each segment. 2000. lv 10 20 30 40 50 n: number of involved pathways Fig. stress condition. while OP. CONCLUSION In this paper. Segal and D.M. Battle.. then we organize all the resulting clusters into a cluster graph and identify coherent cliques. We compare our algorithm with several recent models and the experimental results justify the superiorities of our approach: robustly identifying overlapping pathways on partial conditions and consequently discovering biologically significant pathways. M. 25: 25-29. Barabasi and Z. 2000. we presented a new approach to discover cellular pathways. Bostein. 13 and 19 times. P.O. 33-37. China. E. P. A.. Journal of Computational Biology.. From table 2 we can summarize that over 128 GO categories listed in Table2: (1) PIPE find best enrichment for 84 categories. Nat.O.. Bioinformatics. Genet. Hullermeier. In exogenous phase such as stress response. Biclustering of expression data. (2) PIPE failed to find enrichment for 11 GO terms. respectively. which tends to produce excessively overlapping pathways. ISMB. PIPE has its own advantage to discover biologically meaningful pathways.g. Brown and D. Kamper. Distribution of Gene Participation --. cell cycle and sporulation) and exogenous phase (e.274 1 0. P. 5: 101. Eisen. 1999. Nature Reviews Genetics. 2005. 2005. One explanation for this result is the differences of regulation mechanism between endogenous phase (e.B. 5. DNA damage and diauxic shift) [13]. genes are often regulated by more References 1 2 M. results in more accurate results. N. Balasubramaniyan. Genet. HCIust and Plaid failed 61.N. 12(7): 909-927. 21. yet OP. 0. Probabilistic discovery of overlapping cellular processes and their regulation. 81 and 68 times. Church.8 transcriptional factor and participate in more processes than in endogenous phase such as cell cycle. 4. Bostein. In a word. finally we extract expression profiles of coherent cliques and shape biological pathways from these profiles. Distribution of Pathway Size . 21(7): 1069-1077. 4. Network biology: understanding the cell's functional organization. HCIust and Plaid find 18. 2004. Therefore.. Ashburner. Another interesting fact is that Plaid algorithm performs much better on stress condition dataset than on cell cycle condition. Cheng and G. Weskamp and J.g.

06E-5 — — — — .B.48E-6 — — 2. D. IEEE trans. O. Sherlock. Brown.01E-6 — — 7. Winther. CM.33E-7 4.84E-4 — 2.D.79E-5 — 8.R. Botstein and R. 69.2004. Brown. N. O. Sherlock. Technical report.82E-5 Plaid — — — — — — — — — — 6. 14863-14868. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces Cerevisiae by microarray hybridization. Brown. 1(1): 24-45. Tavazoie. Spellman.2004. Luscombe. Gerstein. Hansen. Tibshirani. Altman.95E-5 — 3.Q.A.56E-5 — — — — — — 1. 95.58E-5 2. D. J.B.J. Cluster analysis for gene expression data: a survey. CLICK: a clustering algorithm with applications to gene expression analysis. 2004. Yu.49E-6 2. Botstein and B. Nat. Shamir.02E-6 2. V. D.94E-6 4. M. 9: 3273-3297.05E-7 1. Snyder S.16E-4 — 2. Bioinformatics.Q. Oliveira. 2001. Sharan and R. Nielsen and L. Tang and A. Cho and G. Am. 22(1): 58-67.57E-4 7. Bioinformatics.ac. C.91E-6 2. Genet. L.20E-5 2. M. Babu. Kao. Hastie.O.M. Church. P. 18 O. J.57E-6 2. A.J. B.275 wide expression patterns.49E-7 3.60E-5 1.45E-6 — 1. on Computational Biology and Bioinformatics. H. Madeira and A. IEEE trans. Step-wise clustering procedures.T.15E-11 4.66E-7 3. M. 2006. Regenberg.89E-8 3. T. P. K. 1998. Stanford Univ. 15 R. R. Iyer. Stat. ISMB. Anders. Gasch. Lazzeroni and A. G. Genomic analysis of regulatory network dynamics reveals large topological changes. Futcher. Biclustering algorithms for biological data analysis: a survey. R.25E-4 3.K. 2000. Campbell. Storz.P. 17: 520-525. Grotkjar. 281285. Eisen.html 8 9 10 11 12 13 Table 1. Assoc.B. Missing value estimation methods for DNA microarrays. PNAS. CarmelHarel. 11: 4241-4257. Jiang. King. Eisen.88E-5 — 2.T.M.70E-7 — — — — 7. Plaid models for gene expression data.17E-5 8. 1999. Robust multi-scale clustering of large DNA microarray datasets with consensus algorithm. T. 1998. Teichmann and M.L.10E-7 — — 5. M.weizmann. Spellman. D. 1967. Zhang. Molecular Biology of the Cell. B.M.45E-7 OP 3. on Knowledge and Data Engineering.29E-6 1. M. Zhang. pages 86-101.C. G. Troyanskaya. Hughes.il/index. 22.14E-5 — 5.. M. Genomic expression programs in the response of yeast cells to environmental changes. 16 P. Nature. Owen.95E-5 — 8.77E-6 — — 2. Botstein and P. 16(11): 13791386. Systematic determination of genetic network architecture.55E-5 7.98E-4 — 3. Cantor.60E-5 2. G. P. 431(7006): SOSSI 2. 14 S. M. GO Categories Enrichment based on Cell Cycle Dataset GO term 35S Primary Transcript Processing amine biosynthesis amine metabolism amine transport aspartate family amino acid metabolism ATP dependent DNA helicase activity ATP dependent helicase activity beta-glucosidase activity bud neck carbohydrate metabolism carbohydrate transport carrier activity cation transport cation transporter activity cell communication cell wall chromatin assembly or disassembly chromatin binding conjugation contractile ring HClust 4. 2000. J.95E-5 PIPE 1. Molecular Biology of the Cell. 17 S. 19 http://genomica. 2000.

84E-6 2.36E-4 — 2.71 E-4 1.74E-5 2.24E-8 4.71E-4 4.55E-7 2.62E-6 — — — 1.33E-4 — 9.29E-6 9.33E-7 — — 1.12E-4 2.98E-4 — — — 1.11E-8 2.95E-5 2.28E-5 5.49E-5 2.92E-6 8.26E-7 1.53E-4 1.05E-4 1.82E-5 1.63E-6 5.95E-5 — 8.12E-4 1.51E-6 7.13E-6 3.21E-6 2.86E-7 — 1.97E-5 — 1.02E-6 — 1.45E-10 8.77E-4 — 4.12E-4 2.42E-6 — — 1.99E-6 1.76E-4 7.80E-4 5.49E-6 9.04E-6 3.42E-4 — — — — — — .10E-4 — — — — 5. of RNA primer DNA strand elongation DNA unwinding DNA-dependent ATPase activity DNA-directed DNA polymerase activity electron transporter activity endoplasmic reticulum endosome energy pathways glucosidase activity glutamate metabolism glutamine family amino acid biosynthesis glutamine family amino acid metabolism helicase activity hexose transport hydrolase activity ion transporter activity iron ion transport iron ion transporter activity iron-siderochrome transport kinase inhibitor activity lagging strand elongation large ribosomal subunit leading strand elongation main pathways of carbohydrate metabolism mannose transporter activity MCM complex metal ion transporter activity methionine metabolism mismatch repair — 2.12E-4 6.59E-6 1.12E-6 2.00E-5 2.39E-5 1.16E-5 2.87E-6 — 1.09E-5 4.82E-5 7.19E-5 — 1.20E-5 2.20E-4 — — 3.cyclin-dependent protein kinase regular activity cytokinesis cytokinesis. completion of separation cytoplasmic vesicle cytoskeleton organization and biogenesis cytosolic large ribosomal subnit (sensu Eukaryota) cytosolic small ribosomal subnit (sensu Eukaryota) development DNA binding DNA helicase activity DNA metabolism DNA packaging DNA recombination DNA repair DNA replication DNA replication initiation DNA replication.82E-7 1.84E-5 — 1.26E-6 3.36E-9 3.26E-4 — — 2.52E-5 — 2.82E-5 — 6.96E-4 — 1.71E-5 — — — — — — — — 2.10E-8 — — — — — — — 1.13E-6 4.29E-5 — 2.86E-4 — 2.10E-8 3.07E-4 — — — — — — — — — — — — — — — — — — — — 2. synthesis.14E-5 2.84E-6 — 4.37E-8 1.20E-10 2.25E-8 2.74E-5 7.45E-8 1.56E-5 — 5.05E-5 7.84E-5 2.37E-4 2.26E-10 3.20E-4 1.59E-5 8.86E-10 2.92E-6 9.12E-4 2.26E-9 — 2.18E-4 2.63E-5 3.26E-7 3.16E-4 3.18E-10 1.82E-5 3.30E-5 — — 2.75E-9 — 4.63 E-7 1.88E-5 — — 1.49E-6 2.02E-6 4.22E-8 4.

18E-4 3.36E-4 1.80E-5 — 3.84E-6 1.61E-11 1.17E-12 — — — — — — — — 4.50E-12 1.67E-5 9.73E-11 1.00E-5 6.37E-6 1.08E-4 3.53E-4 — — 2.34E-9 1.46E-12 2.63E-14 2.71E-05 2..12E-14 8.22E-8 1..57E-7 7.90E-9 — — — — — — — — 1.89E-6 1.98E-6 1.00E-5 — — — 1.54E-5 5.66E-5 4.20E-5 — — — 8.01 E-23 5.21E-9 8.46E-4 2.69E-5 1.46E-5 1.29E-5 4.86E-4 9.49E-5 2.62E-10 3.96E-9 8.42E-4 6.65E-6 — 2.52E-5 2.29E-6 — — 8.40E-7 1.04E-4 — — — — — 1.19E-5 1.42E-18 3.04E-4 .78E-4 2.53E-4 2.18E-7 5.94E-5 2.80E-13 2.54E-5 — — 2.15E-15 — — — 5.31E-4 8.10E-20 1.54E-9 2.46E-6 1.07E-4 1.04E-5 9.59E-5 7.39E-5 2.12E-5 — — — — — — 1.13E-13 2.54E-14 1.98E-6 — 2.95E-13 9.76E-5 3.46E-09 — — — — — — 1.16E-10 2.18E-4 1.76E-4 — — 1.13E-5 — — 2.46E-6 — — — — — — — — — 2.15E-5 6.43E-9 2.42E-4 1.76E-5 9.88E-8 — 3.51E-5 1.07E-10 — — — — — — — 4.24E-7 — — — — — — — — .13E-5 3.40E-7 — 4.71E-5 — 4.70E-10 3.35E-13 1.93E-6 4.10E-12 8. — 4.46E-5 4.90E-9 — — 1.41E-7 2.48E-7 1.63E-4 8.277 mitochondrion mitotic recombination nuclear organization and biogenesis nucleic acid binding nucleolus nucleosome nucleotide-excision repair nucleotidyltransferase activity organic acid metabolism polyamine transport pre-replicative complex pre-replicative complex formation and maintenance protein amino acid glycosylation protein binding protein biosynthesis protein folding pyrophosphatase activity regulation of cell cycle regulation of cyciin dependent protein kinase activity regulation of enzyme activity regulation of metabolism regulation of transcription regulation of transcription from Pol 2 promoter replication fork response to DNA damage stimulus response to stimulus response to stress ribosomal large subunit biogenesis ribosomal small subunit assemble and maintenance ribosomal subunit assembly RNA binding RNA metabolism RNA processing rRNA processing siderochrome transport site of polarized growth small ribosomal sununit structural constituent of ribosome sulfur amino acid biosynthesis sulfur amino acid metabolism sulfur metabolism telomerase-independent telomere maintenance telomere maintenance transcription regulator activity transferase activity and phosphorus-contain group transcription metal ion transporter activity — 8.36E-12 1.18E-7 2.06E-5 — 7.

38E-4 5.43E-4 1.55E-4 — — — 8.49E-6 1.78E-7 5.58E-6 — — — 6.28E-10 — — — 1.73E-12 — 1.50E-7 2.39E-4 — — .95E-4 4.74E-6 1.26E-4 7.94E-5 ..57E-5 1.41E-4 — 9.94E-4 2.65E-19 2.26E-4 4.72E-4 1.27E-10 1.44E-5 1.11E-8 — — — — — — — — 4.13E-5 — 3.02E-10 2.31E-5 3.26E-5 1.17E-5 4.56E-10 1.55E-4 2.27E-9 2.32E-6 1.46E-9 — — 8.17E-4 — 9.translational elongation tricarboxylic acid cycle tricarboxylic acid cycle intermediate metabolism unfolded protein binding vesicle-mediated transport 1.86E-5 1.34E-5 5.33E-12 — — — — 9.67E-5 — 3..08E-4 3.11E-8 6.36E-21 4.42E-4 9.70E-7 — 1.14E-8 — — — — 2.10E-4 — 1.53E-5 — — 1.26E-4 — — — 1..03E-5 5.54E-4 4.18E-6 3.11E-7 — — 7.89E-7 1.02E-11 — — — — 1.48E-5 8.23E-4 4. GO Categories Enrichment based on Stress Condition Dataset GO term 35S Primary Transcript Processing acid phosphatase activity aerobic respiration alcohol metabolism amine biosynthesis amine metabolism amine transport activity aspartate family amino acid biosynthesis aspartate family amino acid metabolism ATP dependent DNA helicase activity carbohydrate catabolism carbohydrate kinase activity carbohydrate metabolism carbohydrate transport cell communication cell wall cytosolic large ribosomal subnit (sensu Eukaryota) cytosolic small ribosomal subnit (sensu Eukaryota) disaccharide metabolism DNA binding DNA-directed RNA polymerase 1 complex endoplasmic reticulum energy pathways energy reserve metabolism enzyme regulator activity galactose metabolism glucan metabolism glucose metabolism glucosyltransferase activity glutamate metabolism glutamine family amino acid biosynthesis glutamine family amino acid metabolism glutathione peroxidase activity glycogen metabolism helicase activity heterocycle metabolism hexose metabolism HClust 3.70E-5 — 4.21E-7 — 7.77E-6 1.67E-9 6.64E-10 3.67E-5 9.21E-7 1.49E-14 3.12E-5 — 2.01E-4 — 9.72E-7 1.11E-7 5.34E-5 3.35E-18 6.71E-10 — — 4.68E-9 .35E-7 4.67E-7 2.93E-4 3. 9.92E-4 2.30E-14 3..39E-4 OP 6.74E-4 2. Table 2.53E-5 — — — — — — 6.70E-5 2.64E-4 — — 2.87E-12 9.50E-32 2.17E-7 1.41 E-15 — 5.11E-4 2.08E-5 — — 2.06E-4 1.58E-6 — 2.43E-5 PIPE 4.18E-20 4.55E-6 1.43E-5 Plaid 3.03E-5 — 2.56E-11 — — — 2.68E-4 4.54E-6 2.

40E-4 — — 6.85E-6 1.12E-5 1.76E-7 .02E-5 1.34E-7 3.28E-4 7.73E-20 2.22E-8 8.279 hexose transport hexose transporter activity hydrolase activity acting on ester bonds ion transporter activity kinase activity large ribosomal subunit ligase activity.66E-7 1.16E-4 2.29E-8 — 1.33E-13 1.73E-4 — 6.46E-4 — 2.41E-60 1. forming carbon-nitrogen bonds lipid metabolism main pathways of carbohydrate metabolism mannose transporter activity methionine metabolism methyltransferase activity mitochondrion Noc complex non-membrane-bound organelle nucleic acid binding nucleolus nucleotide biosynthesis nucleotide metabolism organic acid metabolism oxidoreductase activity oridoreductase activity on CH-OH group of donors pentose metabolism peroxidase activity peroxisomal matrix phosphoric ester hydrolase activity polysaccharide metabolism processing of 20S pre-rRNA protein binding protein biosynthesis protein folding purine nucleotide metabolism pyrophosphatase activity regulation of biosynthesis regulation of catabolism regulation of cell redox homeostasis regulation of cellular process regulation of translation regulation of translational fidelity respiratory chain complex 3 response to biotic stimulus response to osmotic stress response to stimulus response to stress ribonucleoprotein complex ribosomal large sununit assembly and maintenance — — — — 9.36E-7 1.17E-5 6.74E-5 3.25E-4 1.14E-8 2.76E-4 — 273E-4 2.18E-4 — 1.82E-29 2.34E-6 9.61E-8 — 3.69E-9 3.28E-5 — 6.51E-7 — — 2.32E-9 1.55E-26 1.79E-9 — 2.74E-5 1.14E-4 1.47E-4 4.48E-5 2.82E-27 2.18E-20 4.61 E-7 6.64E-4 4.95E-15 1.51E-6 3.73E-4 6.44E-11 1.64E-27 6.51E-21 — — 1.38E-4 1.76E-4 1.33E-4 — 7.54E-6 — — — — — 4.50E-32 — — 2.24E-5 6.62E-6 8.51E-5 — — — — — — — 5.59E-4 — 1.53E-5 — 2.83E-5 2.74E-9 1.76E-4 — — — — — — — 1.65E-19 — — 3.74E-5 — — — — — — — — — — — — 3.03E-5 — — — 4.14E-4 — 4.65E-4 — — — — — — 1.18E-6 — 4.44E-8 8.33E-6 4.73E-4 — — 1.97E-8 1.74E-5 4.38E-8 1.29E-8 7.19E-4 770E-6 — — 8.18E-4 1.10E-4 4.16E-9 — 3.39E-11 2.66E-21 1.35E-18 — — 5.47E-7 1.17E-11 1.82E-12 6.34E-7 2.26E-5 1.37E-4 9.93E-4 1.64E-5 — 1.52E-4 1.55E-4 2.67E-17 2.72E-10 1.74E-5 1.72E-5 8.70E-11 174E-7 3.96E-11 6.85E-6 3.24E-5 4.39E-12 2.79E-6 1.44E-11 6.29E-5 2.18E-6 — — — — 9.

08E-7 2.72E-5 1.22E-7 1.21E-11 9.01E-6 8.48E-6 7.13E-5 2.06E-6 1.75E-5 — 1.01 E-6 1.00E-28 — 3.55E-61 7.280 ribosomal large subunit biogenesis ribosome ribosome biogenesis RNA binding RNA helicase activity RNA ligase activity RNA metabolism RNA methyltransferase activity RNA modification RNA polymerase complex RNA processing RNA-dependent ATPase activity rRNA binding rRNA modification rRNA processing S-adenosylmethionine-dependent methyltransferease signal transduction small nuclear ribonucleoprotein complex small nucleolar ribonucleoprotein complex small ribosomal sununit snoRNA binding SRP-dependent protein-membrane target structural constituent of ribosome succinate dehydrogenase (ubiquinone) activity sulfur amino acid metabolism sulfur metabolism thioredoxin peroxidase activity transcription from Pol 1 promoter transcription regulator activity transcription.26E-8 — 2.01E-11 3. nucleic acid binding transcription initiation factor activity transcription regulator activity translational elongation trehalose metabolism tricarboxylic acid cycle tricarboxylic acid cycle intermediate metabolism tRNA aminoacylation tRNA metabolism tRNA methyltransferase activity tRNA modification UDP-glycosyltransferase activity unfolded protein binding — — 4.68E-22 — 9.77E-5 7.89E-7 8.42E-4 1.56E-11 6.11E-7 — 2. transferase acyl groups transferase activity and phosphorus-contain group transcription factor activity.59E-22 — — — 2.22E-7 — 5.23E-10 2.34E-5 1.88E-9 — 2.44E-4 5.48E-5 7.55E-4 2.03E-8 2.89E-5 3.60E-4 5. DNA-dependent transferase activity.95E-5 2.61E-5 3.95E-5 5.56E-11 — 7.12E-4 5.76E-31 — 1.92E-4 3.16E-9 4.45E-5 6.55E-4 2.92E-4 1.45E-4 — — — — — — — — — — — — — 3.33E-12 7.53E-7 5.76E-6 — 2.46E-7 4.29E-29 — 3.97E-5 5.92E-4 3.67E-6 — — — — — — — — — — — 2.17E-7 6.92E-21 — — — 3.97E-5 3.80E-8 6.22E-6 6.92E-8 — 2.09E-27 — 1.75E-14 4.02E-20 1.65E-5 2.72E-6 2.78E-30 — 7.50E-7 2.05E-11 4.48E-6 1.41E-6 3.46E-4 — — 1.04E-4 5.64E-4 1.87E-7 — — 3.64E-4 — 5.03E-6 3.08E-6 — 5.76E-6 — 3.92E-21 — — — 2.66E-4 4.54E-13 1.10E-8 — 2.29E-23 1.67E-6 — — — — — — — — — — — — — 2.09E-4 2.26E-4 4.60E-28 — 1.80E-8 3.79E-5 — 1.59E-22 — — — 1.38E-4 2.29E-23 4.26E-5 1.92E-4 — 7.68E-22 2.23E-5 3.48E-6 7.32E-21 2.09E-13 — 7.08E-6 — 1.19E-4 3.14E-4 .73E-4 1.85E-5 3.81E-8 3.

Of broad interest are studies that attempt to compile the network of cellular pathways in a given genome 1~i. while in others the analysis is mostly computational. Due to the complexity of these studies. and each reaction might be potentially mediated by multiple genes in the same genome. US Golan Yona * Computer Science Department. database annotations and similarity data to infer the most likely assignments. Existing pathway reconstruction procedures assign a gene to all pathways in which it might catalyze a reaction. "Corresponding author.technion.il Cellular pathways are composed of multiple reactions and interactions mediated by genes. However. it assigns a single gene to each reaction. In this paper we present a variation over a known EM algorithm that addresses this problem. We apply the algorithm to metabolic pathways in Yeast and compare the results to assignments that were experimentally verified. 1. On the other hand. each one providing information on . Different genes might participate with different affinities in different pathways. Our framework can be extended to include interaction data and other highthroughput data sets.Here we present a probabilistic algorithm for the assignment of genes to pathways that addresses this problem and reduces this ambiguity. This ambiguity characterizes popular procedures for the assignment of genes to pathways5.281 EXPECTATION-MAXIMIZATION ALGORITHMS FOR FUZZY ASSIGNMENT OF GENES TO CELLULAR PATHWAYS Liviu Popescu Department of Computer Science. however. and estimates pathway assignment probabilities from expression data and sequence similarities. these pathways have been verified and studied extensively only in a few organisms. our algorithm works well on a set of test pathways. In a previous work we presented a deterministic algorithm for pathway assignment that reduces the ambiguity by using expression data. Thus. As our tests showed. many genes have unknown function and therefore the cellular processes in which they participate remain largely unknown. 14853. leading to a many-to-many mapping of genes to pathways.Israel Institute of Technology * Email:golan@cs. Hence. Therefore. While our results generally support this assumption. a more reasonable approach would be to assign a probabilistic measure to indicate the level of association of a given gene with a given pathway. Furthermore. The wealth of experimental data about genes initiated many studies in search for larger complexes. some reactions can be catalyzed by multiple genes and are associated with multiple pathways. in addition to functional characterization of genes. Technion . existing procedures for pathway construction produce assignments that are ambiguous. it is not always the case in reality. and estimate the affinity of each gene with the known cellular pathways.ac. 6 . assignments that rely just on the general functional characterization of genes are not refined enough and tend to introduce ambiguity by creating many-to-many mappings between genes and pathways. Our algorithm uses expression data. INTRODUCTION In the last decade an increasingly large number of genomes were sequenced and analyzed. to complement the set of known enzymes (which is usually incomplete) our algorithm considers other genes in the subject genome that might possess catalytic capabilities based on similarity. Cornell University. it is unlikely that all genes that are capable of mediating a certain reaction are involved in all the pathways that contain it. However. and selectively assigning genes to pathways such that co-expression within pathways is maximized and conflicts among pathways (due to shared assignments) are minimized. Ithaca. Many of these reactions are common to multiple pathways. Rather. now assuming that the same gene can participate in multiple pathways. NY. it is more likely that each gene is optimized to function in specific pathway(s). To propagate the experimental knowledge to other organisms several groups developed procedures that extrapolate pathways (and mostly metabolic pathways) based on the known association of genes with reactions in these pathways. patterns and regularities.

when there are multiple branches. However." This definition is used in many studies and by most biochemical textbooks and underlies literature curated databases such as MetaCyc9. pathway. and the conditional probability p(xj \Pk) denotes the probability of the A. PK. However. for example. Our goal is to predict which genes take part in each a 2. Each pathway P contains a set enzymatic reactions. The input to our algorithm is a genome G with N genes. Indeed.. A PROBABILISTIC FRAMEWORK FOR ASSIGNING GENES TO PATHWAYS In this paper we focus on metabolic pathways..} where e. We denote by F(P) the set of protein families that are associated with the reactions of pathway P. I. etc) that reflect different aspects of the pathway. enzyme families F\. We initialize these probabilities based on prior knowledge of metabolic reactions. We use G(Fj) to represent the set of genes that can be assigned to enzyme family Fj based on their database records (or based on their similarity with known enzymes of family Fj. intermingled together). Indeed.2.1 of the online Supplementary Material). we estimate p(xj|Pfc) ~ p(ej|Pfc) where p(e|Pfe) is the probability to observe expression vector e in pathway Pk. •••.g. E. the algorithm can be easily generalized to include other types of data. . 2. it produces good results as we demonstrate later on. or transformation of a metabolite from a key intermediate to some terminal compound. Preliminaries In the next sections we use bold characters to represent sets or vectors and non-bolded characters to represent individual entities or measurements. In Ref. ij. Each reaction is associated with an enzyme family F whose member genes can catalyze the reaction. expression level). we assume that each cellular process (in our case a metabolic pathway) can be modeled as a statistical source3 generating measurable observations over genes. with several modifications that are discussed in the next section. a similar assumption was employed in other studies on pathway reconstruction (see section 4). which we refer to as the affinity of gene i with pathway PkComputing the probabilities p(i\Pk) andp(Pfc|z) is difficult since they refer to biological entities (genes) that are not observed directly but only through measurements (e. interactions.g. Therefore. In other words. P2. this process is constrained so as to maintain the prior information.We adhere to the set of known pathways as our algorithm is concerned with pathway assignments rather than pathway discovery (although this can be easily changed).-th source to emit Xj. This approximation is based on the assumption that genes participating in the same biological process are similarly expressed. Each gene i is associated with a feature vector Xj. breakdown.1. The EM algorithm Our algorithm is based on the fuzzy EM clustering algorithm that assumes a mixture of Gaussian sources 10 . i. our goal is to compute the probability p(i\Pk) of gene i participating in pathway Pk as well as the posterior probability p(Pk\i). is the interaction profile and so on. We model each pathway as a source that generates expression profiles for the pathway genes such that p(e|Pfc) follows a Gaussian distribution N(fik.e. The set of enzymatic reactions (families) associated with gene i is denoted by F(t).282 different aspects of the same cellular process. We then revisit these estimates and recompute these probabilities based on experimental observations until convergence to maximum likelihood solutions. 8 a metabolic pathway is defined as "a sequence of consecutive enzymatic reactions that brings about the synthesis. Sfc) or a mixture of Gaussian sources (assuming there are several underlying processes. The observations can be characterized in terms of different types of data (such as expression profiles..FM and pathways Pi. having as centroid the pathway characteristic profile. Therefore a pathway can be modeled as a probabilistic source for the expression profiles of the participating genes.. as described in section 1. Our initial assumption is that the expression profiles of genes assigned to the same pathway tend to be similar which suggests that each pathway has a characteristic expression profile. Xj = {e. . F2. 2.. is the expression profile of gene i. We adopt this definition in our algorithm. Assuming independence between these features we can decompose p(xi|P fc ) = p(ei\Pk)p{U\Pk) •••• In this work we use only expression profiles (that are generated from multiple experiments). Each path- Each pathway can also be modeled as a mixture of sources.2.

(/Z2. We use several performance measures as discussed in section 3 of the Supplementary Material. We then take a look at particular examples to illustrate the strengths of our algorithm.... we replace the Euclidean metric with a new metric. different data sets (Cell-cycle vs. Instead of random initialization of Hk andCTfe. A G = G. This set of 23 pathways serves as our test set.e. RESULTS To evaluate the performance of our method we test the influence of different settings and parameters on pathway b Expression profiles are typically composed of measurements taken from a set of independent experiments. Pathways blueprints are obtained from the MetaCyc database9.e.ai).CT2). our algorithm utilizes any prior information that might help to obtain more accurate assignments. We assume that the microarray experiments are independent of each other such that the expression vector e is composed of d independent measurements {ei. that is more effective for detecting similarity between expression profiles. These parameters as well as the probabilities p(Pfc|ej) are modified iteratively. For example.. To assign genes to pathways we test two different expression datasets: the Cell Cycle data set of Ref. section 1. First. Due to space limitation the details of these three elements are described in the Supplementary Material. 2. p(e\Pk) = rL=iP(e/l-Pfc) w h e r e each component is distributed as a one dimensional normal distribution N{nki. This algorithm converges to parameters that maximize (locally) the likelihood of the d a t a p P / 6 ) = n ^ P ( e i i e ) = I L £*?(«*! W A ) whereO = {(/ii. I. Rosetta) are generated from experiments that are conducted practically independently under different conditions.283 way has a prior p(Pfe).<Ti<)}. we employ constrained clustering so as to minimize the number of pathways that end up with an incomplete assignment. see section 1 in the online Supplementary Material. Other datasets (e..e2. in several ways. For more information on the datasets used in this study see section 2 of the Supplementary Material. 10. Genes are mapped to enzymatic reactions using Biozon14 at b i o z o n . < f and p{Pk) based on database annota7c tions and similarity data. We initialize the parameters of the pathway models /ifc. Second. using an EM algorithm similar to the one described in Ref.2.g. Knowledge-based clustering Our algorithm is a variation of the EM algorithm described above. Third. in time-series datasets the measurements are collected at different time-points usually spaced at relatively large time intervals during which the cell has undergone significant changes and the correlation between consecutive time points is relatively weak. assignments and show that the algorithm produces results of biological significance. whose non-zero elements are denoted by o v We seek the parameters that will maximize the likelihood of the data.. P G = D^GiFj). The intermediate set of genes consists of all enzymes (AE) in the genome. annotated or predicted (including enzymes that are not associated with any of the reactions in the pathways we considered).(/iK. The third set of genes we consider is the entire genome or all genes (AG). Proteins assigned to reactions based on similarity with known enzymes are referred to as predicted enzymes.Hence the covariance matrix Sfc is actually a diagonal matrix. The most constrained set of genes is pathway genes (PG) .0ki). Proteins that are linked with enzyme families based on their annotation are referred to as annotated enzymes. To explore the influence of different options on our algorithm we ran a total of 12 experiments.1 in the online Supplementary Material. 3. as described in section 1.we use the prior information that is available from database annotations and similarity searches. Rosetta) and different subsets of genes from the Yeast genome as outlined below.For more details. i. the massdistance measure. 23 of these were experimentally verified to exist in Yeast in SGD 11 . Our model organism is the Yeast genome. We first provide quantitative measures of performance by comparing the results we get to experimentally validated assignments. to initialize the parameters. We compare performance across different models (the Gaussian model vs the mass-distance model).. . We used a subset of 52 metabolic pathways for which we could assign Yeast genes to all the reactions in the pathway. the genes that can be assigned to at least one of the pathways based on database annotations and by prediction P G = Ufc Uf 6 F (p k ) G(Fj).2. until convergence. .ed}. o r g . Gene s e t s Using the prior knowledge we can restrict the set of genes we consider in our algorithm.3. 12 and the Rosetta Inpharmatics Yeast compendium data set 13 .

The fourth and the fifth columns show the precision and recall with respect to the verified genes (where genes are assigned to a pathway if the posterior probability is greater than a threshold 9 = 0.82 0.80 0. pathway genes (PG).69 0.84 0. For each pathway we generate 100k random permutations and compute the average MAP over the results (see section 3 in the Supplementary Material for details). The performance of our algorithm on the Rosetta set is significantly and consistently better than on the Cell Cycle data.84 N/A 0.62 0. In the second column we show the number of pathways with a verified (EV) assignment in the top position.284 Table 1.45 TS:AG:MD TS:AE:MD TS:PG:MD TS:PG:GM TS:AG:GM TS:AE:GM TS:PG:DET ROS:PG:MD ROS:PG:GM ROS:AE:MD ROS:AG:MD ROS:AE:GM ROS:AG:GM ROS:PG:DET random model 3. and with respect to the ranking of all possible deterministic assignments based on their score (see section 3 of the Supplementary Material for details). where there is no ranking over assignments and all assignments are equally probable). Summary of results Our method is conceived as an extension of the current pathway reconstruction methods like Pathway Tools15. .80 0.9 N/A 0.. Therefore we need some other baseline to compare our results to. Comparative results for different experimental settings.94 0.87 0.72 0.83 0.85 0.69 0. our algorithm is able to improve significantly over the random model under all settings.77 0.60 0.71 0.7 0.68 0.63 0.66 0. all enzymes (AE). These methods do not attempt to assign genes to pathways selectively and hence cannot be compared to ours. First column lists the experimental setup. The good performance even with the Euclidean metric reflects the strong correlation between pathways and expression pat- T o compare the results.84 0. The mass distance model has a slight advantage compared to the Gaussian model.1. "ROS:AE:MD" is the experiment using Rosetta.69 0.81 0.94 0. but there are some noticeable trends. our model exploits the information in the expression data to rank the genes effectively.4 # of pathways with violated constraints 12(29) 12(28) 11(24) 9(22) 12(27) 11(27) N/A 10(24) 11(22) 10(24) 10(26) 12(25) 11(25) N/A N/A precision (genes) 0. Rosetta (ROS).75 0.85 0. Experiment # of pathways with verified top assignment 12 10 8 9 9 6 10 14 14 13 13 12 10 11 5. and it also improves over the deterministic algorithm. Clearly.86 0. this is equivalent to a regular pathway reconstruction algorithm. When comparing the different settings our general conclusions are: • Clearly the choice of the expression data set is important. We also compare the results with those of the deterministic algorithm c of our previous work7.71 0.79 0.94 0. As Table 1 shows.74 0. Codes used: Cell Cycle data (TS). deterministic algorithm (DET). • There are no significant differences between the different model variations within each expression dataset.52 0. The third column shows the number of pathways with violated constraints (number in parenthesis is over the entire set of 52 pathways used in clustering).62 0. Results are reported over the test set of 23 pathways.56 N/A N/A MAP genes 0. In the last 2 columns we show the MAP with respect to the ranking of genes based on their affinity. Gaussian model (GM). since it explicitly drives the assignments towards solutions that assign a single gene to each reaction.g.79 N/A 0.61 0.91 0.85 0. 7 but skipped the last step that attempts to minimize shared assignments by looking at near-optimal assignments.93 0. we ran the deterministic algorithm of Ref.1).80 N/A N/A recall (genes) 0.72 0.78 0. (e. MASS Distance (MD).63 N/A 0. entire genome (AG). The last line in the table represents the results of a model that gives a random ordering of the genes and assignments.65 0. all other being equal.84 0.77 0.83 0. clustering only enzymes and using the MASS distance model).69 0.74 MAP assignments 0. We consider the random model that generates random permutations over the set of all possible assignments (this setting is similar to that of KEGG or Pathway Tools.6 0.

The examples are discussed in the context of the ROS:AE:MD experiment. these pathways i i homoserine \ \ \ homoserine methionine biosynthesis threonine bbsynthese homoserine biosynthesis homocysteine L-threonine I tetrahydropteroyltriL-glutamate L-methionine (a) (b) Fig. Example . starting both from homoserine. However. this reflects the capacity (or lack thereof) of our model to cover certain pathways. Finally. most pathways tend to have similar performance across all experiments within a data set. Two additional examples are given in the Supplementary Material (section 4). The characteristic profiles of the three pathways. The difference between the two datasets is caused by a few pathways for which the Rosetta data set is more informative. Homoserine biosynthesis and threonine biosynthesis from homoserine are correlated while homoserine methionine biosynthesis is just loosely correlated. The relation between homoserine biosynthesis. However. homoserine methionine biosynthesis and threonine biosynthesis from homoserine. Homoserine is synthesized from aspartate in the first pathway and methionine and threonine are synthesized in the second and third pathway respectively. 3. The number of related genes (genes that have significant affinity with a pathway although they were not initially assigned to it) loosely correlates with the number of reactions in the pathway as well as with the numL-aspartate ber of genes initially assigned to the pathway (data not shown). This explains the small difference in performance between experiments within the datasets.2 of the Supplementary Material) in all settings. there are a few pathways in which consistently most of the reactions are not satisfied (such as arginine biosynthesis. It is part of the superpathway Threonine and methionine biosynthesis that consists of three pathways: homoserine biosynthesis. this can also suggest that these pathways might not exist in Yeast or they might exist but in a different configuration than the pathway blueprint. Methionine is bio-synthesized in this pathway from homoserine. arginine degradation I and trehalose biosynthesis III). To some extent. therefore the super-pathway forks at homoserine. This confirms that pathways tend to have unique characteristic expression profiles • Most of the pathways have zero or one violated constraints (see section 1. Though related. revealing interesting dependencies between pathways. Due to the page limit we focus only on one example. homoserine methionine biosynthesis and threonine biosynthesis from homoserine as is depicted in Figure la. • Interestingly. (b).285 terns. (a).2. 1.2. the performance does not decline significantly when we use a larger set of genes.Homoserine methionine biosynthesis In this section we present an individual case which illustrates our method. . cysteine biosynthesis II. This setting is one of the best ones according to our performance measures and is chosen because the clustering is done with all enzymes. This conclusion is also reinforced by the fact that the average number of violated constraints does not seem to depend on the expression dataset.

10(0.57) 4. 4. Third column shows the affinity to the pathway ip{Pk\xi)).and 2.10(0. our algorithm consistently recovers the experimentally verified genes Table 2. The double line separates the genes that were assigned to this pathway from the ones that were rejected.2.99.9 (0.99.2.99) Alternative pathway affinity YHR112C 003780000158 unverified 0. The probabilistic assignment of the Homoserine methionine biosynthesis pathway.4.4.00 1.45) 2.2.31(0.1(0.8 (0.9 (0..15) 4.3.2.00) 4.2.09) 4.8 (0.39) 2.1.31(0.00 HOMOCYSDEGR-PWY (0.4.1.2. There are seven genes that can be initially assigned to this pathway based on database annotations and function prediction (by similarity).1.15) 4.2.1(0.50) PWY-801 (0.1.1.18) 4. These are also the only genes whose affinity with the pathway (posterior probability) at the end of the run is significant (1 in this case) as is shown in Table 2.99.1.1.00) 4.2.50) Str3 004650000171 unverified 0.49. If the EC number is in bold the gene was annotated in the database as capable of catalyzing this reaction otherwise it was predicted.1(0.99.14(1.99.99.37) 4.50) PWY-801 (0.24) 4.4.1.4.2.31.1.26) 4.8 (0. In the last column we list the MetaCyc ID of alternate pathways to which this gene was also assigned (number in parenthesis is the affinity to that pathway).2.00) 2.2.8(0.31(0.59) 4.2.26) 4.00 1.99.1(0.10(1.09) 4. this reaction was revised and assigned a new number 2.The forth column shows the EC family membership (number in parenthesis is the weight which reflects the confidence in this assignment).31(0.68) 2.38) 2.3.8(0.1.11) 4. Our algorithm assigns these genes such that each reaction is associated with exactly one gene.1. This pathway has three reactions: 2. .2.9(0. However they seem to be similar in certain experiments (Figure lb).99.1. only with different evalues.3. Second column shows the Biozon ID 14 .1. only three are experimentally verified assignments: MET2.00) 4.28) GLYOXYLATE-BYPASS (0.99) 4.1.00 HOMOCYSDEGR-PWY (0.2. The other four unverified genes have insignificant affinity to the pathway.99.34) 4.00 Cys3 003940001012 unverified 0.(1.4.00 EC Numbers 2.99.4.31(1.1. Gene Name MET2 MET6 MET17 Biozon ID 004860000048 007670000142 004440000819 Verification verified verified verified Pathway affinity 1. However.99.50) YFR055W 003400000153 unverified N/A (no profile) N/A (no profile) d Recently. Furthermore.2.8(0.9 (0. In the first column we show the name of the gene or the systematic name. and no other genes are associated with the pathway.99.9 (0.4.99.31) 4.8(1.3.1.99.3.37) 4. MET6 and MET17.2.4. only MET17 makes it to final round and is assigned to the pathway by our algorithm.45) 4.14.1.00) 4.00) 4.5.from the pathway model because it has an incomplete EC number'1.99.00) 4.15) 4.286 have different characteristic expression profiles for most of the experiments.8(1.1(1.2.1.1.3.1.99. The table lists all the genes that are potentially assigned to this pathway.. We have excluded reaction 4.22) 4.1.99. Note that all the unverified genes as well as MET 17 have similar functional assignments initially (to 6 different reactions).24) 4. which suggests that th