CONTENTS

Gene finding tools Serial analysis of gene expression. Paralogues and gene displacement

GENE FINDING TOOLS
Gene finding softwares are given below: 1. Glimmer a system for finding genes in microbial DNA, especially the genomes of bacteria and archaea. 2. TWAIN a new syntenic gene finder which employs a Generalized Pair Hidden Markov Model (GPHMM) to predict genes in two closely related eukaryotic genomes simultaneously. 3. GlimmerHMM a Generalized Hidden Markov Model (GHMM) gene-finder which makes use of the techniques implemented previously by GlimmerM: splice site modules and Interpolated Markov Models. 4. GeneZilla, a gene finder based on the GHMM framework, similar to Genscan and Genie. 5. GeneSplicer a fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. 6. ExAlt a Phylogenetic Generalized Hidden Markov Model for finding alternatively spliced exons. 7. JIGSAW a program that predicts gene models using the output from other annotation software; it uses a statistical algorithm to identify patterns of evidence corresponding to gene models. 8. RBSfinder a Perl script that implements an algorithm to find ribosome binding sites for genes in bacterial and archaeal genomes.

GLIMMER Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archAea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from non-coding DNA. The IMM approach, uses a combination of Markov models from 1st through 8th-order, weighting each model according to its predictive power. Glimmer uses 3-periodic non-homogenous Markov models in its IMMs. Glimmer is the primary microbial gene finder used at The Institute for Genomic Research (TIGR), where it was first developed, and has been used to annotate the complete genomes of over 80 bacterial species from TIGR and dozens (possibly hundreds) from

other labs. Its analyses of some of these genomes are available at the Comprehensive Microbial Resource Site. TWAIN TWAIN is a new syntenic genefinder which employs a Generalized Pair Hidden Markov Model (GPHMM) to predict genes in two closely related eukaryotic genomes simultaneously. It utilizes the MUMmer package to perform approximate alignment before applying a GPHMM based on an enhanced version of the TigrScan gene finder. TWAIN consists of two components: (1) ROSE, the Region Of Synteny Extractor, which identifies contiguous regions likely to contain one or more syntenic genes, and (2) OASIS, a generalized pair hidden Markov model (GPHMM) for predicting genes in the regions identified by ROSE. The system utilizes approximate alignments constructed by the PROmer and NUCmer programs in the MUMmer package to assess approximate alignment scores efficiently. GLIMMER HMM GlimmerHMM is a new gene finder based on a Generalized Hidden Markov Model (GHMM). Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. It also utilizes Interpolated Markov Models for the coding and noncoding models . Currently, GlimmerHMM's GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single).

GENEZILLA GeneZilla is a state-of-the-art program for computational prediction of protein-coding genes in eukaryotic DNA, and is based on the Generalized Hidden Markov Model (GHMM) framework, similar to GENSCAN and GENIE. It is highly reconfigurable and includes software for retraining by the end-user. It is written in highly optimized C++ and runs under most UNIX/Linux platforms. The run time and memory requirements are linear in the sequence length, and are in general much better than those of competing systems, due to GeneZilla's novel decoding algorithm. Graph-theoretic representations of the high scoring open reading frames are provided, allowing for exploration of suboptimal gene models. It utilizes Interpolated Markov Models (IMMs), Maximal

Dependence Decomposition (MDD), and includes states for signal peptides, branch points, TATA boxes, CAP sites, and will soon model CpG islands as well.

GeneZilla Architecture GeneZilla's state-transition diagram is essentially the same as that of GENSCAN. GeneZilla has the ability to model different types of exons (i.e., initial/internal/final/single) using different content sensors, unlike many GHMM-based gene finders.

GENE SPLICER A fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been trained and tested successfully on Plasmodium falciparum (malaria), Arabidopsis thaliana, human, Drosophila, and rice . Training data sets for human and Arabidopsis thaliana are included. Use the GeneSplicer Web Interface to run GeneSplicer directly, or see below for instructions on downloading the complete system including source code .

ExAlt ExAlt is a software program designed to predict alternatively spliced overlapping exons in genomic sequence. The program works in several ways depending on the available input. ExAlt can use information of existing gene structure as well as sequence conservation to improve the precision of it's predictions. ExAlt can also make predictions when only a single genomic sequence is available. ExAlt has been extensively tested on Drosophila melanogaster, but can be adapted to run on other species.

JIGSAW JIGSAW is a program designed to use the output from gene finders, splice site prediction programs and sequence alignments to predict gene models. The program provides an automated way to take advantage of the many succsessful methods for computational gene prediction and can provide substantial improvements in accuracy over an individual gene prediction program. JIGSAW is available for all species. Using JIGSAW

A training set is given to JIGSAW, which consists of example output from an automated gene structure annotation pipeline along with sequence coordinates of known genes. JIGSAW compares the pipeline's predicted genes to the example known genes to record the prediction accuracy of each combination of evidence. A non-linear model is built to estimate the accuracy of the different combinations of evidence found in new data. JIGSAW pieces together gene structure models most likely to be accuracte based on statistics collected in the training set. JIGSAW predicts gene models for a user supplied genomic sequence. The main interface is a simple "evidence list" file, which lists the file names of each prediction program's output, file format and the type of evidence. JIGSAW reads several coordinate based file formats including GFF.

RBSfinder RBSfinder is also a program from TIGR. It searches for probable ribosome binding sites in the vicinity of the beginning of genes. Based on its findings RBSfinder sometimes proposes a different starting coordinate of the ORF. In most cases it seems that RBSFinder improves the results from Glimmer. When RBSfinder proposes a different start the finding of Glimmer as well as the alternative gene start from RBSfinder are taken into the results.

SERIAL ANALYSIS OF GENE EXPRESSION
SAGE Serial analysis of gene expression (SAGE) is a method for comprehensive analysis of gene expression patterns. Three principles underlie the SAGE methodology: 1. A short sequence tag (10-14bp) contains sufficient information to uniquely identify a transcript provided that that the tag is obtained from a unique position within each transcript; 2. Sequence tags can be linked together to from long serial molecules that can be cloned and sequenced; and 3. Quantitation of the number of times a particular tag is observed provides the expression level of the corresponding transcript. SAGE is not array-based but instead relies on compiling large cDNA libraries of expressed sequences and obtaining sequence information for short segments or tags located at the 3' end of each cDNA. This approach, provides qualitative information on the identity of genes expressed. Moreover, quantitative information can be obtained from SAGE by analyzing how many times the same sequence appears. But, because SAGE relies on sequences present at the 3' end of genes, the technique cannot discriminate the relative representation of alternatively spliced forms of RNAs that share the same 3' end. Other disadvantages of SAGE include a need for larger amounts of good quality RNA and less sensitivity than microarrays. The modifications to the SAGE protocol resolve these problems to a large extent and allow application to primary cell populations available in small numbers (less than 5 million cells). To get a complete picture of gene expression and a quantitative measure, researchers can also use serial analysis of gene expression (SAGE). Bert Vogelstein, Ken Kinzler, and their colleagues at Johns Hopkins University developed this technique and licensed it to Genzyme Molecular Oncology. Process of SAGE This intricate technique starts with a tissue sample. Investigators extract the sample's mRNA and use it to make cDNA captured on beads. Restriction enzymes then cut the cDNA and leave a fragment of it attached to the beads. A linker, which contains the

recognition sequence for a second restriction enzyme, is added to the exposed ends of the retained cDNA fragments. The second enzyme liberates a short sequence of the original cDNA, which is 14 base pairs in length and is called a SAGE tag. Tags are harvested, polymerized, and sequenced. The sequence of a SAGE tag can uniquely identify a transcript, and quantification techniques reveal how often a tag appears, which gives a measurement of a gene's expression. SAGE experiments proceed as follows: 1. Isolate the mRNA of an input sample (e.g. a tumour). 2. Extract a small chunk of sequence from a defined position of each mRNA molecule. 3. Link these small pieces of sequence together to form a long chain (or concatemer). 4. Clone these chains into a vector which can be taken up by bacteria. 5. Sequence these chains using modern high-throughput DNA sequencers. Comparison to DNA microarrays The general goal of the technique is similar to the DNA microarray. However, SAGE is a sequence-based sampling technique. Observations are not based on hybridization, which result in more qualitative, digital values. In addition, the mRNA sequences do not need to be known a priori, so genes or gene variants which are not known can be discovered. Microarray experiments are much cheaper to perform, however, so large-scale studies do not typically use SAGE. Applications 1. Although SAGE was originally conceived for use in cancer studies, it has been successfully used to describe the transcriptome of other diseases and in a wide variety of organisms. 2. One of the major strengths of SAGE is the electronic nature of the database, allowing direct comparisons of libraries in silico by different investigators. For example, a normal human heart SAGE library is available on the CGAP Web site for gene expression queries, and a normal adult mouse heart SAGE library gene expression profile has recently been reported. Therefore, if both heart SAGE library data were available on an internet platform similar to the CGAP Web site, it may be possible for investigators to determine species similarities or differences in heart gene expression profiles. Because there is no such SAGE cardiovascular Web site, in some instances individual authors have made their SAGE tags available for download and analysis. 3. There are a number of areas in cardiovascular biology where the SAGE technique may be useful. These areas include stem cell biology, cardiovascular development,

angiogenesis, atherosclerosis, and lipid regulation. Some exploratory SAGE studies have already been reported for human hematopoietic stem cells, hyperlipidemic ApoE3-Leiden mice, and endothelial cells exposed to atherogenic stimulus. In the future, the SAGE technique could assist in finding new targets of important transcriptional factors such as Nkx2-5 in cardiogenesis, where the number of cells may be limiting. With the burgeoning population of congestive heart failure (CHF) patients, more insights are needed into our basic understanding of the pathogenetic mechanisms of CHF. Potentially, SAGE libraries could be made from human endomyocardial biopsy specimens, but tissue heterogeneity may undermine the gene expression signals. It may be more informative to study the temporal changes in gene expression using controlled animal models of CHF where more tissue material is available for processing. A refined candidate gene list could then be used in the diagnosis and prognosis of larger numbers of patient samples in a microarray format.

Schematic of SAGE method:

PARALOGUES AND GENE DISPLACEMENT
PARALOGY Homologous sequences are paralogous if they were separated by a gene duplication event: if a gene in an organism is duplicated to occupy two different positions in the same genome, then the two copies are paralogous. A set of sequences that are paralogous are called paralogs of each other. Paralogs typically have the same or similar function, but sometimes do not: due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions. Paralogous sequences provide useful insight to the way genomes evolve. The genes encoding myoglobin and hemoglobin are considered to be ancient paralogs. Similarly, the four known classes of hemoglobins (hemoglobin A, hemoglobin A2, hemoglobin S, and hemoglobin F) are paralogs of each other. While each of these genes serves the same basic function of oxygen transport, they have already diverged slightly in function: fetal hemoglobin (hemoglobin F) has a higher affinity for oxygen than adult hemoglobin. Another example can be found in rodents such as rats and mice. Rodents have a pair of paralogous insulin genes, although it is unclear if any divergence in function has occurred. Paralogous genes often belong to the same species, but this is not necessary: for example, the hemoglobin gene of humans and the myoglobin gene of chimpanzees are paralogs. This is a common problem in bioinformatics: when genomes of different species have been sequenced and homologous genes have been found, one can not immediately conclude that these genes have the same or similar function, as they could be paralogs whose function has diverged.

GENE DISPLACEMENT
Comparative genomics has revealed many examples in which the same function is performed by unrelated or distantly related proteins in different cellular lineages. In some

cases, this has been explained by the replacement of the original gene by a paralogue or non-homologue, a phenomenon known as non-orthologous gene displacement. Such gene displacement probably occurred early on in the history of proteins involved in DNA replication, repair, recombination and transcription (DNA informational proteins), i.e. just after the divergence of archaea, bacteria and eukarya from the last universal cellular ancestor (LUCA). This would explain why many DNA informational proteins are not orthologues between the three domains of life. However, in many cases, the origin of the displacing genes is obscure, as they do not even have detectable homologues in another domain. I suggest here that the original cellular DNA informational proteins have often been replaced by proteins of viral or plasmid origin. As viral and plasmid-encoded proteins are usually very divergent from their cellular counterparts, this would explain the puzzling phylogenies and distribution of many DNA informational proteins between the three domains of life.

Sign up to vote on this title
UsefulNot useful