September – December 2007 – Student Research Rotations

1) Title: Viral Bioinformatics Lab: Dr. Chris Upton, Biochemistry and Microbiology, University of Victoria Some additional funding may be available to help with relocation and accommodation of student. Description: A visiting rotation student would have the opportunity to take part in several bioinformatics projects tailored to the interests and background of the student. These projects would expose the student to both software development and genome analysis techniques. My group is 11 strong and consists of mix of computer scientists, biochemists/microbiologists and several who have training in both areas; the overall focus of the lab is Viral Bioinformatics. There are 3 graduate students + 1 PDF in the lab and there are usually 1 or 2 Co-op students from UVic. Students choosing to come to the lab could have a primary interest in computer science or biochemistry. The lab has 4 major research areas funded by NIH, NSERC (2 grants). All are heavily dependent on bioinformatics. 1) Design and production of databases and software to analyze the large number of completely sequenced virus genomes (poxviruses [smallpox], corona viruses [SARS], herpes and baculoviruses). We are also now part of a large collaborative effort to produce a Bioinformatics Resource for a series of emerging virus pathogens. 2) Analysis of genome structure (repeats), gene/promoter prediction using bioinformatics tools. 3) Novel techniques for genome annotation and identification of distant homologs. 4) Lab validation of protein function predicted by bioinformatics tools. We have multiple collaborations with labs around the world. See: http://www.virology.ca or email cupton@uvic.ca for more information. Poxvirus Background Orthopoxviruses comprise a diverse family of large viruses that cause disease in vertebrates. Among them, variola virus, the causative agent of smallpox, was a major threat to humanity until its eradication around 1980. Recent bioterrorism events have brought poxviruses and smallpox to the forefront of attention once again and there is an

urgent need to increase our knowledge of their pathogenic mechanisms and to identify new targets for therapeutics. At present there are no satisfactory drugs that can be utilized to treat smallpox infection and due to serious vaccine complications, a return to mass vaccination with the current live vaccine is not desirable. There is a wealth of poxvirus genomic information available, more than 50 complete genomes; however, there is a need for the development of software to permit rigorous and novel analyses on a genome scale. 2) Title: Identification of genes involved in obesity, retinal degeneration and polycystic kidney disease Lab: Michel Leroux, Dept. of Mol. Biol. and Biochem., Simon Fraser University Description: We are taking bioinformatics, genomic and genetic approaches to uncovering candidate genes that are associated with the function of cilia, which are slender organelles that protrude from most cell types in humans and are involved in various sensory processes1-3. Genes linked to cilia function are associated with a plethora of human ailments, including blindness, kidney and heart problems, obesity, and diabetes. This makes the identification of ciliary components and subsequent analysis of their functions on a global scale not only useful in understanding fundamental biological processes, but also in providing important insights into some disorders that are prevalent in human populations. Our organism of choice to carry out our studies is the nematode C. elegans, which has a completely sequenced genome and boasts a short lifespan that facilitates genomic and genetic analyses. The student will be exposed to a variety of in silico and other experimental techniques to identify and subsequently characterise novel ciliary genes. For the best candidate genes, we collaborate with other groups to determine if the gene of interest may be implicated in a disease, e.g., Bardet-Biedl syndrome (BBS) or retinopathies, etc. Evaluation: Laboratory performance and a final report will be used to evaluate the project.
1

Ansley et al. (2003) Basal body dysfunction is a likely cause of pleiotropic Bardet-Biedl syndrome. Nature 425, 628-633. 2 Blacque et al. (2005) Functional genomics of the cilium, a sensory organelle. Curr. Biol. 15, 935-941. 3 Fan et al. (2004) Mutations in a member of the Ras superfamily of small GTP-binding proteins causes Bardet-Biedl syndrome. Nat. Genet. 36, 989-993.

3) Title: Development of a statistical and visualization platform for ChIP-on-Chip Advisors: Michael Kobor (Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, UBC) and Raphael Gottardo (Department of Statistics, UBC) Description: DNA microarrays have been widely used to measure gene expression. However, to fully understand transcriptional circuitry, it is necessary to determine the direct binding of regulatory proteins to genome regions. To address this, a new and innovative technology has been developed that combines chromatin immunoprecipitation (ChIP) with DNA microarray technology. Reflecting the integration of these two approaches, this technology is frequently called “ChIP-on-Chip”. It is rapidly becoming a key technology in different fields of molecular biology. ChIP-on-Chip permits the unbiased mapping of in vivo binding sites of transcription factors at gene-regulating elements and the localization of region-specific posttranslational modifications in chromatin proteins. Affymetrix recently developed high-density oligonucleotide arrays that tile all nonrepetitive sequences of different organism from yeast to human. Using these Affymetrix arrays, we have performed initial experiments to map specific chromatin modifications in budding yeast. A huge challenge is to analyze the massive amount of data (about hundred thousand probe pairs for the whole genome in yeast and almost one million probe pairs for one chromosome in human). The primary goal of this project is to develop statistical methods and software for the analysis of Affymetrix tiling arrays. The platform should enable the analysis and interpretation of our datasets and also allow the cross comparison with ChIP-on-Chip data from other research groups. A secondary goal is to develop software that allows the visualization of binding sites on chromosome maps. The student work could involve some of the following: exploratory data analysis, normalization, model development and programming. Furthermore, the student will have the opportunity to participate hands on in planning and executing of ChIP-on-Chip experiments. For this project, a basic knowledge of statistics and R/S-plus could be helpful but not strictly necessary. The work done is likely to lead to publication(s) in excellent journals. 4) Title: Designing a Synthetic Bacterial Genome Lab: Dr. Rob Holt, Genome Sciences Centre, Dept. of Psychiatry, UBC Description: With a relatively small genome consisting of 1.8M base pairs, Haemophilus became one of the first free-living organisms with its entire genome sequenced. This small, wellannotated genome plus the close physiological and phylogenetic relationship with E.coli makes Haemophilus an ideal candidate to use as a template in the design and eventual construction of an entirely synthetic, yet functional, genome. The design of a synthetic genome will consist of first identifying essential and nonessential components of the existing genome in an effort to reduce the genome to a

more manageable size. Non-essential genes will be identified through homology and interaction network analysis, annotation of non-regulatory, non-coding, and redundant regions, and review of published mutagenesis experiments. Where possible, these non-essential genes will be removed from the design. To further reduce the size of the genome, pathways will be simplified and genes will be overlapped on different reading frames wherever possible. The final genome will then be organized such that it can be fragmented into approximately 40K base pair segments, which are amenable to synthesis, and could in principle be joined in a host cell to form a complete engineered genome. Advances have already led to the construction of a synthetic virus and creating a simplified bacterial (eg. cellular) genome is the next logical step. The construction of a synthetic genome would provide valuable insight into the basics of genome organization, composition, essential pathways, and protein function. Establishing procedures for designing, building and activating synthetic genomes are key milestones in the emerging field of synthetic biology.

5) Title: Strand specificity of putative guanine quadruplex forming sequences in relation to published gene expression patterns and regulatory regions of stem cell genes Lab: Dr. Peter Lansdorp, Terry Fox Lab, BCCA, Dept of Medicine, UBC Description: Based on a new theory of stem cell self-renewal (1), published stem cell gene expression studies need to be analyzed to examine expression of “stem cell” genes in relation to the promoter sequences of those genes. Specifically we need to analyze strands carrying guanine-rich DNA capable of forming G-quadruplex or G4 DNA, a stable, four-stranded DNA structure formed from guanine-rich regions. The hypothesis is that G4 DNA participates in the epigenetic regulation of transcription (2, 3). 1. Lansdorp PM. Immortal strands? Give me a break. Cell 129:1244-7, 2007 2. Du Z, Kong P, Gao Y, Li N. Enrichment of G4 DNA motif in transcriptional regulatory region of chicken genome. Biochem Biophys Res Commun 354:1067-70. 2007 3. Fernando H, Reszka AP, Huppert J, Ladame S, Rankin S, Venkitaraman AR, Neidle S, Balasubramanian S. A conserved quadruplex motif located in a transcription activation site of the human c-kit oncogene. Biochemistry 45:7854-60, 2007

6) Title: Clustering algorithms and software tools for the automated analysis of flow cytometry data. Lab: Dr. Ryan Brinkman, BC Cancer Research Centre Description: Flow Cytometry is a technique that allows precise and high throughput measurements of the presence or absence of marker proteins on a cell-by-cell basis and is widely used for both HIV and cancer research and treatment. Datasets produced in this way often contain measurements of multiple markers for many hundreds of thousands of cells. Current analysis methods involve human experts separating subpopulations in the data through the use of polygonal “gates”. However, this method is undesirable because it is often very subjective and fails to take into account the multidimensional nature of the dataset. We have been developing a software tool to automate the identification of cell subpopulations. We are investigating several clustering algorithms, such as fuzzy kmeans, self-organizing maps, etc. to see which, in combination, can identify the wide variety of shapes of cell populations. This work term would allow the student to gain experience in clustering algorithms, computer programming (Java, C++ or R, we currently use all of these, though R is preferred) as well as high performance parallel computing.

7) Title: International flow cytometry database development. Lab: Dr. Ryan Brinkman, BC Cancer Research Centre Description: Flow cytometry is a technique that allows precise and high throughput measurements of the presence or absence of marker proteins on a cell-by-cell basis and is widely used for example in HIV/AIDS and cancer research and treatment. Our group is leading an international effort to develop a public flow cytometry database under the auspices of the International Society for Analytical Cytology (http://www.isac-net.org/). It is expected that flowDB would become a primary repository for flow cytometry data and metadata related to publications, in a similar manner as GEO and ArrayExpress do for microarrays. Our current focus is in developing the requirements, and extending FuGE (http://fuge.sourceforge.net/) in a compliant manner (http://wiki.ficcs.org/ficcs/FuGEFlow). This work term would allow the student to gain experience in UML and standards based database design.

8) Title: Sex determination in Atlantic salmon Lab: Willie Davidson, Department of Molecular Biology and Biochemistry, Simon Fraser University Description: Identifying the sex-determining gene (SEX) in Atlantic salmon has been a Holy Grail for salmonid biologists for many years. With the availability of genomic data, we are now in a good position to achieve this goal. By integrating the linkage map with the physical map and the karyotype, it was recently possible to show that the long arm of chromosome 2 is where SEX resides. As part of cGRASP (Consortium for Genomics Research on All Salmonids Project), we are in the process of constructing a BAC minimum tiling path for chromosome 2. There are two main regions that have been extensively characterized, and 23 BACs, that cover approximately 4 Mb, have been sequenced. Some of the BAC sequences have been completely finished, whereas others are in the latter stages and comprise several contigs. This project will involve the annotation of these BACs with a view to identifying candidate sex-determining genes. There is an extensive EST database, with >460,000 reads that have been placed in contigs and annotated. There is also a well-developed repeat masking database specifically for Atlantic salmon. The primary goal will be to identify segments of genomic DNA corresponding to coding regions whose gene products can be identified. The successful applicant will join an ambitious team that is at the forefront of salmonid genomics research, and is leading the consortium that will sequence the Atlantic salmon genome. For more information contact wdavidso@sfu.ca (604-291-5637).

9) Title: Automation of a SNP discovery analysis pipeline Lab: Dr. Angie Brooks-Wilson, Genome Sciences Centre, BCCA Rotation Project Description: My laboratory investigates genetic susceptibility to cancer at the population level. We also study resistance to cancer and other diseases associated with aging in the Healthy Aging project (also known as the Super-Seniors project). In our projects, we discover genetic variants by gene re-sequencing candidate genes, and then assess those variants by genotyping DNA from hundreds or thousands of cases (individuals with the phenotype of interest) and controls (unaffected individuals). We then correlate presence of specific variants and haplotypes (combinations of genetic variants) with disease or, in the case of the Healthy Aging project, with unusual long-term good health.

One challenge of our current approach is processing, quality-checking and interpreting data from high-volume gene re-sequencing experiments and comparing it to data in public databases. At present, we use two main software tools to process raw DNA traces off the ABI 3730XL sequencers, PolyPhred/Consed (1) or Mutation Surveyor (www.softgenetics.com), and discover single nucleotide polymorphisms (SNPs), insertions, deletions and other variants. Both tools require substantial formatting of the input data and time-consuming processing of the program output. Much of this manipulation involves time-consuming manual checking of DNA sequence traces. Genetic variants are then further formatted and annotated before being imported into the lab’s Progeny database. The goal of the Bioinformatics Student Rotation Project is to develop an innovative and user-friendly system to link these software tools to facilitate discovery of genetic variants. The Rotation Student will be encouraged to interact with other members of the lab to expand his or her knowledge of genetics, and to interact with other bioinformatics experts associated with the Bioinformatics Training Program. The specific aims of the project are to: 1) Automate formatting of input data for PolyPhred/Consed and Mutation Surveyor 2) Automate optimal formatting of output data from these analysis tools 4) Cross check detected variants with dbSNP, HapMap and other public databases 5) Retrieve allele frequencies and other data for variants found in public databases 6) Format variant reports for import into the lab’s Progeny database 7) Research and assess other tools for variant discovery The bioinformatics rotation student will interact with many lab members to gain an understanding of the importance of genetic variation in health and disease, and to help develop our pipeline for optimally handling genetic data.
(1) Stephens M, Sloan JS, Robertson PD, Scheet P, Nickerson DA. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat Genet. 2006 Mar;38(3):375-81.

10) Title: Epigenomics and gene regulation. Supervisor : Dr. Steven Jones, Genome Sciences Centre. Description : Using the next –generation of DNA sequencing machines, we have developed methodology to map epigenetic modifications of chromatin, specifically histones, genome-wide. We have also developed sequence based technology to map transcription factors genome wide (Nature Methods, 2007, PMID: 17558387). We would like to understand more fully the role in which mono-methylation of the histone tail plays in gene regulation or in the marking of regions involved in gene regulation. In the human HeLa cell-line, we have mapped regions associated with monomethylated histone as well as the binding sites for the transcription factor STAT1. We have also mapped positions of acetylated histone protein. For this we have generated

more than 80 million DNA sequence reads. These data now pose a number of questions. Such as - are the epigenetic marks, strongly indicative of gene activity or do they poise the cell for future activity? Are the mono-methylated regions useful in identifying regions functioning in gene regulation? Are there other clues as to why these regions can accrue epigenetic markings? In this project, you will work with the bioinformatics team at the Genome Sciences Centre, to attempt to answer some of these questions, using computational approaches. Evaluation: The assessment of this rotation will be a presentation to the laboratory of your findings.

11) Title: Functional annotation of the mammalian genome Lab: Dr. Dan Goldowitz, Medical Genetics, CMMT (Children’s Hospital) Co-supervisor: Paul Pavlidis – Dept of Psychiatry, UBC Description: The functional annotation of the mammalian genome is one of the more exciting projects in biomedical research, and the success of this effort firmly resides with collaborations between bench scientists and bioinformaticians. The idea behind a set of rotation projects is to explore three different neurological mutant mice whose causative gene has yet to be identified, and to work along with the bench scientists to identify and validate possible candidate genes in hopes of discovering the mutant gene and the gene – function relationship that exists for each of the mutants. Each mutant mouse has varying sets of biological and genetic data to help start the search. The data include: microarray expression, gene mapping, developmental and cellular analyses, and experimental mouse chimeras. The task would be to understand and use these various sources of data with the extensive knowledge of the mouse genome and various bioinformatics tools to arrive at suspected genes that can be tested with simple approaches such as quantitative RT-PCR, sequencing, or in situ hybridization. The mutants available for discovery are: 1) meander tail, which has a very specific loss of anterior lobe granule cells in the cerebellum, 2) infantile gliosis, which an exuberant gliotic response that largely consumes the neuraxis by 2 weeks after birth, 3) 5TND, which has an accumulation of ectopic cells in the cerebellar molecular layer, and 4) 153TNP, whose cerebellum is very small due to the deletion of major populations of cells early in development.

12) Title: Cerebellar Development in Time and Space Lab: Dr. Dan Goldowitz, Medical Genetics, CMMT (Children’s Hospital Co-supervisors: Dr. Steven Jones, GSC and Dr. Wyeth Wasserman, CMMT Description: In this project we are obtaining Illumina full genome expression data from the mouse over a series of developmental times. This data collection is happening in three contexts: 1. Mutants that have primary defects in cerebellar development (i.e., Math1 null, Pax6 null, and the yet-to-be-identified meander tail) 2. C57BL/6J and DBA inbred strains, and 3. 35 Recombinant inbred lines made between C57BL/6J and DBA inbred strains. This dataset is complemented by the phenotypic analysis of the cerebellum for several parameters at the different development time points. This very rich set of data should provide nice bite-size projects for informatics students who would like test microarray sets for the application of algorithms to identify gene regulatory networks and single genes to be further explored with in situ hybridization, qRT-PCR and shRNA knockdown. Furthermore, there is the chance to make more quantitative phenotype to genotype correlations with these data. The tools in WebQTL (now known as GeneNetwork) would be one excellent starting point. Currently we have a mini-consortium of PIs who are involved in this project and students could use them as resources. They provide expertise in graph theory, latent semantic indexing, Bayesian analysis, and higher order statistical approaches to examining quantitative trait loci. 13) Title: Design, analysis and visualization of high-density oligonucleotide arrays. Lab: Stephane Flibotte, Genome Sciences Centre, BC Cancer Research Centre. Description: High-density oligonucleotide arrays are now routinely used to perform various genomic experiments: gene expression, comparative genomic hybridization (CGH), ChIP-chip studies, etc. The Maskless Array Synthesis technology developed and used by NimbleGen Systems Incorporated allows the researcher to customize each individual array for his/her own specific experimental needs. We have developed a pipeline to selectoligonucleotides taking into account various criteria like melting temperature, self folding energy, the presence of homopolymers or homology with other regions of the genome, etc. This pipeline has produced high-quality arrays for CGH experiments in C. elegans (J.S. Maydan et al. “Efficient high-resolution deletion discovery in Caenorhabditis elegans by array comparative genomic hybridization”, Genome

Research 2007; 17: 337-347) and several groups have now asked if we could collaborate with them to help design arrays for their experiments. Specific experimental conditions and differences between genomes make it difficult to build an efficient pipeline usable for any experiment without modifications. For example, one of our new collaborators wants to study a parasite with very large families of genes. In such situations one will have to consider designing oligonucleotides simultaneously targeting a small number of genes instead of insisting on the usual uniqueness. Development would have to be made to implement a computationally efficient way of designing such oligonucleotides and new algorithms would have to be developed to analyze the resulting arrays. Another important aspect where software development is still needed is the visualization of the processed data. During this rotation the student will learn about array design and their subsequent statistical analysis and visualization, and potentially about some machine learning techniques. Evaluation: A final report will be used to evaluate the project. 14) Title: Improved identification of bacterial membrane proteins with less than three membrane spanning domains, which represent primary drug targets or vaccine component candidates. Lab: Dr. Fiona Brinkman, MBB, SFU Contact: Direct any questions to Fiona Brinkman at brinkman@sfu.ca. Summary: Adapt our current PSORT software to improve its ability to identify membrane proteins with less than three membrane spanning domains. Moreinformation: Computational prediction of the subcellular localization of proteins is a valuable tool for genome analysis and annotation. The prediction of membrane proteins and/or proteins on the cell surface is of particular interest due to the potential of such proteins to be primary diagnostics, drug targets or, in the case of microbial pathogens, vaccine components. In environmental microbiology, such sequences are also of interest for their potential in microbe detection and environmental analysis. A protein’s subcellular localization is influenced by several features present within the protein’s primary structure, such as the presence of a signal peptide or membranespanning alpha-helices. Although several algorithms have been developed to analyze single features such as these, only PSORT analyzes several features at once, using information obtained from each analysis to generate an overall prediction of localization site. We have recently developed a new version of PSORT, named PSORTb, which is the most precise method to date for prediction of protein subcellular localization for bacteria (http://www.psort.org). This program reflects the many advances in both knowledge of protein sorting and computational analysis techniques (primarily machine learning and data mining) made in the last decade. Programs are written in Perl, for conversion into BioPerl modules, and are open source (GPU GPL).

For this proposed project, you would expand on previous PSORT software development to create a module that would help improve the identification of membrane proteins that have less than three transmembrane helices. Such membrane proteins are disproportionately under predicted by our current method and are of high interest. You would examine a dataset of known membrane proteins that have two or less membrane helices examining the potential of using support vector machine-based, or sequence similarity-based approaches to improve their classification. Evaluation: You will give a small talk to the Brinkman Lab on your research toward the end of the term, with additional rough documentation as an “appendix” stored electronically in a Brinkman server directory for future reference by others. 15) Title: Characterization of unusually large intergenic regions that likely contain novel functional genes. Lab: Dr. Fiona Brinkman, MBB, SFU Contact: Direct any questions to Fiona Brinkman at brinkman@sfu.ca. Summary: We have identified regions of bacterial genomes between genes that are unusually large and are statistically likely to contain functional coding or non-coding genes. This project will involve using a diverse array of novel bioinformatics approaches to identify whether there is a possible gene in such regions and characterize the features of such genes that were missed by earlier annotation efforts. Aspects of this analysis would be automated for running on a larger scale across multiple genomes. More information: A large project is initiating this year at SFU called “Bioinformatics for Combating Infectious Diseases” (BCID). The BCID project involves 12 faculty from computing science, MBB, biology, chemistry, physics, public health and medicine who will use an interdisciplinary approach to improve aspects of the computational pipeline for identifying novel antimicrobial drugs targets and drugs. As a part of this project, we are examining regions of sequence between known bacterial genes which are considered to be essential or involved in virulence based on high-throughput laboratory screens. We are focusing on Pseudomonas aeruginosa as a model organism initially, with plans to also examine Staphylococcus aureus and Mycobacterium tuberculosis and then ramping up to a large scale analysis of all bacteria for which there is suitable data. Some of these intergenic sequences are unusually large and statistically likely to contain a novel gene that is essential or involved in virulence and therefore represents a potential novel drug target of high interest. For this rotation project you would become a member of this BCID research group and use a series of bioinformatics methods to further investigate features of the sequence

regions of primary interest to determine if they may contain novel genes. Novel bioinformatics tools will be used including recently developed methods for identifying non-coding RNA secondary structure, utilizing evolutionary conservation, and predicting function based on subcellular localization, presence of regulatory elements upstream, and co-transcription with flanking genes. This project has the potential to identify some truly novel genes that represent previously undiscovered drug targets that will be further examined through our interdisciplinary drug discovery pipeline. You will interact with a large group of researchers and gain insight into how interdisciplinary team-based research is initiated. Evaluation: You will give a small talk to the Brinkman Lab on your research toward the end of the term, with additional rough documentation as an “appendix” stored electronically in a Brinkman server directory for future reference by others. 16) Title: Improving the identification and characterization of orthologs. Lab: Dr. Fiona Brinkman, MBB, SFU Contact: Direct any questions to Fiona Brinkman at brinkman@sfu.ca. Summary: Expand upon existing Ortholuge software developed by our group to improve the identification of orthologs most likely to be functionally similar (primarily incorporating gene order, and regulatory element data). The can be used to aid cross species analyses of a wide variety of Pathogenomics Project data, leading to new insights regarding gene function conservation. More information: The Pathogenomics of Innate Immunity (PI2) program, or Genome Canada Pathogenomics Project, aims to increase current understanding of how mucosal immunity to infectious agents operates, and how it may be enhanced to enable the rational development of new and effective strategies for improving human health, animal productivity and welfare, and food safety. Unlike other efforts in this field, the PI2 program is investigating mucosal immunity using genomics approaches in a wide range of hosts, including humans, mice, chickens, and cattle, and thus straddles the fields of agriculture and health. This permits very broad conclusions to be made about the mechanisms of immunity in these hosts, as well as measures that will enhance immunity. It also provides access to data for performing unique comparative genomics studies involving the characterization of orthologs and features of orthologs. The Brinkman Laboratory is heading Bioinformatics for this project. A large database of all gene expression data generated from the project is to be analyzed, which includes gene expression data from both the host and pathogen under a variety of infection conditions for a number of different hosts and pathogens. A number of questions are becoming formulated from this work, but currently one of the most notable bioinformatics challenges we face involves improving the precise identification of

orthologous genes between the species, to permit high quality comparative analyses to be made. Orthologs (genes that diverged due to speciation) are of primary interest as they represent genes of common decent in two organisms. However, they are frequently identified by simplistic “reciprocal best BLAST hit” searches that are fraught with falsepositive ortholog identification. We have recently developed a more phylogeneticallybased approach called “Ortholuge” which needs to now be refined. Ortholuge’s approach, involving using an out-group organism to root relationships between two species being compared, currently uses just sequence comparisons between three species to determine what genes are orthologs and what genes are paralogs. We need to better analyze orthologous relationships between a select set of species, and incorporate additional meta data for the identification orthologs using our approach. Such high quality ortholog identification is of course not only of use for this project, but is applicable to any bioinformatics/genomics project that involves comparisons between species. For this project, you would further investigate adding additional features to our method of ortholog identification – such as conservation of regulatory binding sites and domains. Selected analyses of the microarray data we obtain for comparative genomics purposes for this Pathogenomics Project could be pursued – investigating the fundamental issue of the degree in which orthologs due maintain similar function verses paralogs. For this project you would collaborate with a team of other bioinformaticists, including the larger Pathogenomics Project group, involving researchers at SFU, UBC, U Sask, the Sanger Centre, and in Singapore. This project is suitable for those who have an interest in improving a fundamental bioinformatics analysis problem (identifying and characterizing orthologs) and learning more about evolutionary theory and how it may be applied to comparative genomics. No strong knowledge of evolutionary theory is required. Evaluation: You will give a small talk to the Brinkman Lab on your research toward the end of the term, with additional rough documentation as an “appendix” stored electronically in a Brinkman server directory for future reference by others.

17) Title: Linking Function and Phylogeny Within Metagenomic Sequence Space Lab: Steven Hallam, UBC, Microbiology and Immunology Description: Metagenomic data sets are complex snapshots of the genetic potential of a microbial community. The development of tools for evaluation and comparison of metagenomic data sets has great potential to enhance our understanding of how ecological parameters drive selection and loss of particular metabolic subsystems among and between microbial groups, and how such genomic differences feedback on the physical and chemical properties of ecosystems. The goal of this project is the development of a

workflow and interactive Java-based visualization tool that integrates the phylogenetic pipeline PhyloGenie (http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=1545929 3) with a database of predicted functional roles defined in the Kyoto Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/). Metagenomic data will be mapped onto KEGG and the resulting metabolic profiles will be used as seed sequences for comprehensive phylogenetic analysis. This information will be used in the construction of distance matrices that define the taxonomic diversity of metabolic pathways from one environmental setting to the next. Results will be visualized in the form of metabolic pathway maps similar to those generated by the Kyoto Encyclopedia of Genes and Genomes Automatic Annotation Server (http://www.genome.jp/kegg/kaas/) and in the form of heatmaps and dendrograms. Evaluation: A final report and a functional workflow will be used to evaluate the project. 18) Title: Refining design of artificial tubular proteins and its verification with existing software Lab: Arvind Gupta, SFU, School of Computing Science Description: The Hydrophobic-Polar (HP) model for protein folding was introduced by Ken Dill [1] in 1985. In this model, a protein is folded onto a 2D or 3D lattice as a path (self-avoiding walk), and only hydrophobic interactions between neighbouring hydrophobic amino acids in the lattice contribute to the energy of the model. Thus, instead of considering 20 different types of amino acids, only 2 types are considered: hydrophobic and polar. In our recent paper, we have designed sequences which fold uniquely into artificial tubular structures when folded onto 3D hexagonal prism lattice under the HP model [2]. The task of the student in this rotation is : 1. refine the design of our sequences (go back from two letters H and P, to 20 different amino acids), 2. identify existing protein folding software and test how the refined designed sequences fold using this software, 3. go back to step 1 to improve the results, or alternatively, we can revisit together the initial design of tubular or other structures in 3D HP model. Necessary Background: Some background in genetics would be useful although this can be learned during the rotation. The ability to program is a definite asset. Evaluation: A final report will be used to evaluate the project.

[1] Dill, K.A., Theory for the folding and stability of globular proteins, Biochemistry 24(6), 1501-1509 (1985). [2] Gupta, A., Karimi, M., Khodabakhshi, A.H., Manuch, J., Rafiey, A., Design of artificial tubular protein structures in 3D hexagonal prism lattice under HP model, Proc. of BIOCOMP 2007. 19) Title: Molecular self-assembly in stages Lab: Arvind Gupta, SFU, School of Computing Science Description: Molecular self-assembly gives rise to a great diversity of complex forms, from crystals and DNA helices to microtubules and holoenzymes. The formal study of pseudocrystalline self-assembly, called the Tile Assembly Model (TA), started with a paper of Paul Rothemund and Erik Winfree [1] in 2000. In this model, we put together a collection of square tiles (each type of tile occurring in a large number of copies) and observe what kinds of structures are assembled from the tiles. Each type of tile is characterized by the types of glues on its four sides. For a given shape, the goal is to find a set of tiles with the minimum number of types of tiles which would uniquely assembly to the given shape. Interestingly, in a recent paper [2] of D. Soloveichik and E. Winfree, a strong connection between Kolmogorov (descriptional) complexity and the minimum number of types of tiles was shown. Ken Dill suggested the following variation of the TA model. Instead of combining all tiles together at once, we will put tiles together in stages. For example, in stage 1, we combine tiles of type A and B; in stage 2, types C and D and in stage 3, we combine products from stage 1 and stage 2 with tiles of type E to obtain the final product. Of course, if we would mix all 5 types of tiles together, we would still be able to obtain the final product, but interactions between A and C tiles, for instance, could result in unwanted assembled shapes. Thus, under Dill’s variation we might need a smaller number of distinct tiles to assembly a given shape compared to the original TA model. The task of the student in this rotation is to formalize Ken Dill’s variation of TA model and study its properties. Necessary Background: Students should have an appreciation of mathematical arguments and some programming skills. Evaluation: A final report will be used to evaluate the project. [1] Rothemund, P.W.K., Winfree, E., The program-size complexity of self-assembled squares, STOC 2000. [2] Soloveichik, D., Winfree, E., Complexity of self-assembled shapes, SIAM Journal of Computing 36(6), 1544-1569, 2007.