You are on page 1of 1

Using gene prediction models in order to improve the quality of multiple sequence alignments of homologous genes

VF Onuchic, AM Durham
Abstract
The problems of gene prediction and alignment of multiple gene sequences are usually treated separately. Although, when aligning multiple homologous sequences of genes, information about the intron/exon structure of the genes being aligned can lead to an improvement on the quality of the alignment. Besides, the alignment of these sequences could also help determining their structures when these are not known and are being predicted. In this project we wish to use commonly used models of alignment (pairHMMs) and gene prediction (GHMMs) in order to calculate the posterior probabilities of certain bases being aligned and of belonging to a common gene structure class. These posterior probabilities are then combined through a series of consistency transformations, and later used to construct a final multiple alignment that is consistent with the possible gene structures of each of the sequences being aligned.

Methods
Calculating posterior probabilities Calculating the posterior probabilities of the bases in two different sequences being aligned with each other, or to a gap, using a pairHMM model, can be easily done using the forward-backward algorithm (reference). Although, to calculate the posterior probabilities of each base in a sequence belonging to a certain gene structure class using a GHMM is not as straight forward. This is due to explicit duration states involved in this sort of model, in which an undetermined amount of bases can be emitted at once. This implies that we can only calculate the posterior probability of a whole emission of this state, for example a whole exon (reference). Although, if we were to only use the probability of specific exons, introns, or signals, we would have to commit to a determined prediction for that gene structure. This could lead to problems in the alignment when the prediction in some of the genes being aligned are incorrect, or when the predictions for different genes take different splice alternatives into consideration. Thus, we made a modification on the forward-backward algorithm, in order to calculate the probability of every possible structure that can be present in that sequence. This can be done in a reasonable time because the amount of possible exons, that are the states with explicit duration distribution, is limited by the start, stop, and splice signals needed around it. Combining these probabilities, we can now know for each base the probability of it belonging to a certain structure class. Since we want to use these prediction probabilities to help in the alignment of the sequences, the next step is to calculate for each pair of sequences the probability of each of the bases in the two sequences belonging to the same structure class. This is done by multiplying the probability of each of them belonging to a certain class, and then summing these products over every class. Consistency transformations The problem of finding the optimal multiple sequence alignment for n sequences is exponential on the number of sequences. Because of this, it is approximations have to be made. What is usually done, is to compute the alignment between every possible pair of sequences and later combining these alignments in some manner to achieve the multiple alignment. Although, this approach makes the alignments between the pairs of sequences independent of each other, this can lead to errors in the alignment that could be avoided if information from the other alignments were being used. With this objective, most of the current multiple alignment techniques make use of a consistency transformation, that makes the alignments between pairs of sequences consistent with each other (reference). In the workflow, we can see how this is done when using probabilistic models that allow us to calculate the posterior probability of base alignments. Besides the transformation to make alignments consistent with each other, we intend, in this project, to use a similar approach in order to make the probabilities of two bases belonging to the same structural class consistent with the probability of these bases being aligned, and also making alignments of pairs of sequences consistent with the gene predictions for each of the two sequences. These transformations can be seen in the workflow. Multiple alignment With the modified posterior probabilities of pair alignments in hand, the next step is to build the multiple sequence alignment. For this, the most widely used technique is the progressive alignments. In this technique the alignment is built by aligning the sequences or alignment profiles in the order specified by a guide tree. In this project we intend to implement this technique, as well as two others: sequence annealing (reference), and a greedy algorithm for multiple alignment construction (reference). These two are very similar, and involve building the multiple alignment by inserting one pair of aligned bases at a time, and verifying if the alignment remains consistent. The biggest difference between them is that in sequence annealing, the next pair of bases to be inserted will be the one that has the maximum value for a weight function based on the posterior probabilities, and this weight function is recalculated every time a new pair is inserted in the alignment, while in the greedy approach, the pair of bases with the highest posterior probability of being aligned is always the next one to be inserted. Iterative refinement In many multiple alignment tools there is a final refinement step in which the alignment is repeatedly broken down in two, and the profile of the two groups is then realigned with the one of the other group (reference). We intend to implement two different approaches to this technique. In one the alignment is broken down in two groups randomly (reference), while in the other the alignment is broken down in two groups that are the closest

and the furthest related sequences to one specific sequence in the alignment (reference). In the first one, the process is done until there is no more improvement or a certain number of rounds is achieved. In the second one, the process stops once all the sequences have been used to separate the groups.

Conclusion
So far this project is still in implementation phase, having the prediction and alignment architectures, posterior probability calculations and the consistency transformations ready. It is being implemented in a very flexible framework, and we expect to be able to test every component of it, what would also allow us to compare the advantages of different consistency transformations, alignment and refinement techniques individually, as well as different prediction and alignment model architectures.

References
[1] C. Burge, Identification of genes in human genomic DNA, Ph.D. thesis, Stanford University, 1997. [2] Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast Statistical Alignment. PLoS Computational Biology 5:e1000392 [3] Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University Press,Cambridge. [4] Kemena,C. and Notredame,C. (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics, 25, 24552465. [5] Sahraeian SM, Yoon BJ: PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res. 2010, 38:4917-4928.

You might also like