Bioinformatics is an interdisciplinary research area at the interface between computer science and biological science. Bioinformics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules. The ultimate goal of bioinformatics is to better understand a living cell and how it functions at the molecular level.
Bioinformatics is an interdisciplinary research area at the interface between computer science and biological science. Bioinformics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules. The ultimate goal of bioinformatics is to better understand a living cell and how it functions at the molecular level.
Bioinformatics is an interdisciplinary research area at the interface between computer science and biological science. Bioinformics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules. The ultimate goal of bioinformatics is to better understand a living cell and how it functions at the molecular level.
Bioinformatics is an interdisciplinary research area at the interface
between computer science and biological science. A variety of definitions exist in the literature and on the world wide web; some are more inclusive than others. Here, we adopt the definition proposed by Luscombe et al. in defining bioinformatics as a union of biology and informatics: bioinformatics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules such as DNA, RNA, and proteins. The emphasis here is on the use of computers because most of the tasks in genomic data analysis are highly repetitive or mathematically complex. The use of computers is absolutely indispensable in mining genomes for information gathering and knowledge building. Bioinformatics differs from a related field known as computational biology. Bioinformatics is limited to sequence, structural, and functional analysis of genes and genomes and their corresponding products and is often considered computational molecular biology. However, computational biology encompasses all biological areas that involve computation. For example, mathematical modeling of ecosystems, population dynamics, application of the game theory in behavioral studies, and phylogenetic construction using fossil records all employ computational tools, but do not necessarily involve biological macromolecules. Beside this distinction, it is worth noting that there are other views of how the two termsrelate. For example, one version defines bioinformatics as the development and application of computational tools in managing all kinds of biological data, whereas computational biology is more confined to the theoretical development of algorithms used for bioinformatics. The confusion at present over definition may partly reflect the nature of this vibrant and quickly evolving new field. The ultimate goal of bioinformatics is to better understand a living cell and how it functions at the molecular level. By analyzing raw molecular sequence and structuraldata, bioinformatics research can generate new insights and provide a global perspective of the cell. The reason that the functions of a cell can be better understood by analyzing sequence data is ultimately because the flow of genetic information is dictated by the central dogma of biology in whichDNAis transcribed to RNA, which is translated to proteins. Cellular functions are mainly performed by proteins whose capabilities are ultimately determined by their sequences. Therefore, solving functional problems using sequence and sometimes structural approaches has proved to be a fruitful endeavor Gene Prediction The current gene prediction methods can be classified into two major categories, ab initiobased and homology-based approaches.The ab initiobased approach predicts genes based on the given sequence alone. It does so by relying on two major features associated with genes.Thefirst is the existence of gene signals, which include startand stop codons, intron splice signals, transcription factor binding sites, ribosomal binding sites, and polyadenylation (poly-A) sites. In addition, the triplet codon structure limits the coding frame length to multiples of three, which can be used as a condition for gene prediction. The second feature used by ab initio algorithms is gene content, which is statistical description of coding regions. It has been observed that nucleotide composition and statistical patterns of the coding regions tend to vary
significantly fromthose of the noncoding regions. The unique features can be
detected by employing probabilistic models such as Markov models or hidden Markov models to help distinguish coding from noncoding regions. The homology-based method makes predictions based on significant matches of the query sequence with sequences of known genes. For instance, if a translated DNA sequence is found to be similar to a known protein or protein family from a database search, this can be strong evidence that the region codes for a protein. Alternatively, when possible exons of a genomic DNA region match a sequenced cDNA, this also provides experimental evidence for the existence of a coding region. Some algorithms make use of both gene-finding strategies. There are also a number of programs that actually combine prediction results from multiple individual programs to derive a consensus prediction. This type of algorithms can therefore be considered as consensus based. Sequence Homology and Sequence Similarity An important concept in sequence analysis is sequence homology. When two sequences are descended from a common evolutionary origin, they are said to have a homologous relationship or share homology. A related but different term is sequence similarity, which is the percentage of aligned residues that are similar in physiochemical properties such as size, charge, and hydrophobicity. It is important to distinguish sequence homology from the related termsequence similarity because the two terms are often confused by some researchers who use them interchangeably in scientific literature. To be clear, sequence homology is an inference or a conclusion about a common ancestral relationship drawn from sequence similarity comparison when the two sequences share a high enough degree of similarity. On the other hand, similarity is a direct result of observation from the sequence alignment. Sequence similarity can be quantified using percentages; homology is a qualitative statement. For example, one may say that two sequences share 40% similarity. It is incorrect to say that the two sequences share 40% homology. They are either homologous or nonhomologous. Generally, if the sequence similarity level is high enough, a common evolutionary relationship can be inferred. In dealing with real research problems,theissue of atwhat similarity level can one infer homologous relationships is not always clear. The answer depends on the type of sequences being examined and sequence lengths. Nucleotide sequences consist of only four characters, and therefore, unrelated sequences have at least a 25% chance of being identical. For protein sequences, there are twenty possible amino acid residues, and so two unrelated sequences can match up 5% of the residues by random chance. If gaps are allowed, the percentage could increase to 1020%. Sequence length is also a crucial factor. The shorter the sequence, the higher the chance that some alignment is attributable to random chance. The longer the sequence, the less likely the matching at the same level of similarity is attributable to random chance. There are two ways to calculate the sequence similarity/identity. One involves the use of the overall sequence lengths of both sequences; the other normalizes by the size of the shorter sequence. The first method uses the following formula:
S=
Ls 2 100 La L b
where S is the percentage sequence similarity, Ls is the number of aligned
residues with similar characteristics, and La and Lb are the total lengths of each individual sequence. Phylogenetics Basics Biological sequence analysis is founded on solid evolutionary principles Similarities and divergence among related biological sequences revealed by sequence alignment often have to be rationalized and visualized in the context of phylogenetic trees. Thus, molecular phylogenetics is a fundamental aspect of bioinformatics. Evolution can be defined in various ways under different contexts. In the biological context, evolution can be defined as the development of a biological formfromother preexisting forms or its origin to the current existing form through natural selections and modifications. The driving force behind evolution is natural selection in which unfit forms are eliminated through changes of environmental conditions or sexual selection so that only the fittest are selected. The underlying mechanism of evolution is genetic mutations that occur spontaneously. The mutations on the genetic material provide the biological diversity within a population; hence, the variability of individuals within the population to survive successfully in a given environment. Genetic diversity thus provides the source of rawmaterial for the natural selection to act on. Phylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees of these organisms. The tree branching patterns representing the evolutionary divergence are referred to as phylogeny. Phylogenetics can be studied in various ways. It is often studied using fossil records, which contain morphological information about ancestors of current species and the timeline of divergence.However, fossil records have many limitations; they may be available only for certain species. Existing fossil data can be fragmentary and their collection is often limited by abundance, habitat, geographic range, and other factors. The descriptions of morphological traits are oftenambiguous,whicharedueto multiple genetic factors. Thus, using fossil records to determinephylogenetic relationshipscanoftenbebiased. For microorganisms, fossils are essentially nonexistent, which makes it impossible to study phylogeny with this approach.
Fortunately, molecular data that are in the form of DNA or protein
sequences can also provide very useful evolutionary perspectives of existing
organisms because, as organisms evolve, the genetic materials accumulate
mutations over time causing phenotypic changes. Because genes are the medium for recording the accumulated mutations, they can serve as molecular fossils. Through comparative analysis of the molecular fossils from a number of related organisms, the evolutionary history of the genes and even the organisms can be revealed. The advantage of using molecular data is obvious.Molecular data are more numerous than fossil records and easier to obtain. There is no sampling bias involved, which helps to mend the gaps in real fossil records.More clear-cut and robust phylogenetic trees can be constructed with the molecular data.Therefore, they have become favorite and sometimes the only information available for researchers to reconstruct evolutionary history.The advent of the genomic era with tremendous amounts of molecular sequence data has led to the rapid development of molecular phylogenetics. The field of molecular phylogenetics can be defined as the study of evolutionary relationships of genes and other biological macromolecules by analyzing mutations at various positions in their sequences and developing hypotheses about the evolutionary relatedness of the biomolecules. Based on the sequence similarity of the molecules, evolutionary relationships between the organisms can often be inferred. To use molecular data to reconstruct evolutionary history requires making a number of reasonable assumptions. The first is that the molecular sequences used in phylogenetic construction are homologous, meaning that they share a common origin and subsequently diverged through time. Phylogenetic divergence is assumed to be bifurcating, meaning that a parent branch splits into two daughter branches at any given point. Another assumption in phylogenetics is that each position in a sequence evolved independently. The variability among sequences is sufficiently informative for constructing unambiguous phylogenetic trees. (Xiong, J. Essential Bioinformatics. Cambridge: Cambridge University Press; 2006. 4,5,32,33, 97, 127-128)