Bioinformatics – An Overview

Research Scholar, Dept of Computer Science, S.V.K.P & Dr.K.S Raju Atrs & Science College,Penugonda-534320, India ABSTRACT : This presentation gives an overview of Bioinformatics covering major databases available online as well as at major research centers. The major databases called mother databases are the nucleic acid databases and protein sequence databases. Bioinformatics has been visualized as an interface between biological information and information technology that are employed for Protein sequencing, DNA sequencing etc. The concept of Transcription and Translation processes are explained by the central dogma of molecular biology, which states that the sequences of a strand of DNA correspond to the amino acid sequence of a protein. Representation of two or more sequences can be compared by alignment methods such as Pairwise and Multiple alignments. Some database search tools like BLAST, FASTA are some of the programs which do intensive pairwise alignment of our query sequence to all the database sequence entries and gives out the sequences with best scores. Phylogenetic methods are used to reconstruct the relationships between macromolecular sequences finding the genetic connections and relationships between species. The paper also explains the application of bioinformatics in the various industries e.g. Food, Pharmaceutical, Agricultural, Medical, etc., and the technologies that have enabled the analysis of biological problems in multiple dimensions.

Keywords: Protein, DNA, FASTA, BLAST, Phylogenetic Tree, Orthologus

• Bioinformatics is the application of computational techniques to the management and analysis of biological information.

Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases.

1. DATABASES: 1.1. Primary Databases Sequences obtained by various sequencing techniques like • • • • EST: Expressed Sequence Tags GSS: Genome Survey Sequences STS: Sequence Tagged Sites and HTG: High Throughput Sequences have been put in different nucleic acid and protein databases, which can be accessed by the people all over the world through World Wide Web. The major databases called mother databases are the nucleic acid and protein sequence. 1.1.1. Nucleic Acid Databases: The nucleic acid sequence databases consists of complete annotation of all the nucleic acid sequences (DNA and RNA) like information of organism (source) from regions, date on which it is sequenced etc., The major nucleic acid data bases are: • European Molecular biology laboratory(EMBL) • GenBank (National center for Biotechnology Information ,NCBI) • DNA databank of Japan (DDBJ). These are three databases under mutual collaboration facilitate the mutual exchange of data everyday. 1.1.2. Protein Sequence Databases: A protein sequence database consists of information of all the proteins that have been translated from the RNA sequences and the proteins sequenced by methods like N-terminal sequencing. The major protein sequence databases are • Protein Information Resource(PIR) • Swiss-Prot

1.2. Secondary Databases: The derived databases which are obtained by making use of the sequence information available in the primary databases are called secondary databases. Databases like, CUTG: Codon Usage Database of Japan COGS: Cluster of Orthologus Groups of Protein from NCBI PROSITE for regular expressions PRINTS having aligned motifs and BLOCKS having aligned motifs as blocks are fine examples of secondary databases. 1.3. Structure Databases: The major structure databases consist of the structural data of the proteins or DNA whose structure has been determined by either X-ray crystallography or NMR (Nuclear Magnetic Resonance). Protein Data Bank gives details of the coordinates bond angles, torsion angles of various proteins and nucleic acid database gives the same details about DNA and its types i.e., ADNA or B-DNA etc., Protein Data Bank (PDB) The Nucleic Acid Databases (NDB) Cambridge Structural Databases (CSD) These databases are an organized way to store the tremendous amount of sequence information that accumulates from laboratories worldwide. Each database has its own specific format. Three major database organizations around the world are responsible for maintaining most of this data; they largely ‘mirror’ one another.

2. The Central Dogma of Biology:

Central Dogma: Flow of Information This concept is explained by the central dogma of molecular biology, which states that the sequences of a strand of DNA correspond to the amino acid sequence of a protein.

2.1. Transcription Transcription is the process where messenger RNA (mRNA) molecules are synthesized from DNA molecules. Transcription takes place in the nucleus. During transcription only one of the strands of DNA corresponding to a gene (template strand) is copied into mRNA. This mRNA molecule will be complementary to the bases that compose the template strand. The mRNA molecules have short lives. They travel out to the cytoplasm where they direct the synthesis of a Protein and then they are destroyed.

Transcription depends on complementary base pairings. A pairs with U, U with A, C with G and G with C. Only one of the DNA molecules is transcribed and therefore the resulting mRNA molecule is single stranded. The amount of transcription of any given gene can be directly controlled by the cell. Once the mRNA molecules leave the nucleus and enter the cytoplasm, they are loaded onto the ribosome. It is at the ribosomes that protein synthesis occurs by a process called translation. The ribosomes are composed of ribosomal RNA (rRNA) proteins and ribosomal proteins. 2.2. Translation Translation is the process where mRNA molecules are translated into proteins at the ribosome. The nucleotides of the mRNA molecule are read by the ribosome so that each set of three nucleotides called a codon, specifies a single amino acid. Therefore, the first three nucleotides of the mRNA will encode the first amino acid, the second three bases the second amino acid and so on. The rules by which the base sequence of the mRNA molecule is translated into the primary amino acid sequence of a protein are called the genetic code. There are 64 different possible codons (this is because there are 4 bases: A, U, C, G, and each codon has 3 bases, so 43 = 64) and 20 amino acids. Some codons code for more than one amino acid and therefore the genetic code is said to be degenerate. No codon codes for more than one amino acid. Three of the codons do not specify the incorporation of any amino acids. These are known as the stop codons - UAA, UAG and UGA. They are found at the end of the mRNA coding sequence and they tell the ribosome to stop translating the message and release the protein. The mRNA is translated from the 5' end and read one codon at a time to the 3' end. Translation usually starts at a start codon (AUG) which codes for methionine. Each successive codon is read and the amino acid incorporated into the protein chain until a stop codon is encountered. The codons in a mRNA molecule do not directly recognize the amino acids that must be incorporated. Instead this process is directed by a group of adapter proteins called transfer RNAs (tRNAs). Every codon, except the stop codons, has its own tRNA molecule. A tRNA molecule has an anti-codon end, which is made of a set of three base pairs. These base pairs can base pair with the complementary codon in the mRNA. The 3' end of a

tRNA molecule is attached to an amino acid. In the translation process, a ribosome reads a mRNA molecule codon by codon. At each codon, a tRNA molecule with an anti-codon complementary to that codon attaches to the mRNA. It brings with it the appropriate amino acid that is then incorporated into the growing polypeptide chain. Once the amino acid has been added, the tRNA molecule is released and the ribosome moves onto reading the next codon in the mRNA chain. This process continues until the ribosome reads a stop codon. At this point the ribosome releases the mRNA molecule and the completed protein. The tRNA molecule functions as an interpreter reading codons in the mRNA molecule and translating them into amino acids. In this way, the sequence of base pairs in a given gene determines the amino acid sequence of the protein. 3. Alignment: Representation of two or more protein or nucleotide sequences where homologous amino acids or nucleotides are in the same columns while missing amino acids or nucleotides replaced with gaps. 3.1. Pair wise Alignment: Pairwise alignment, in which only two sequences are compared. Two sequences can be compared either by global alignment or local alignment. In global alignment the sequences are stretched over the entire length to get the maximum number of matches and minimum number of gaps. In local alignment, the alignment is restricted or stopped at the region, which is having the number of matches of similarity. Local alignment uses Smith and Waterman algorithms and Global alignment uses Needleman and Wunsch algorithms. The best alignment is chosen by the alignment having maximum score, which is obtained for matches and negative scores for gaps and mismatches. Pairwise alignment is used to find the function of unknown genes or proteins by finding similar sequences of known function. Comparing the unknown sequence with that of the whole nucleic acid or protein databases does this. Some database search tools like BLAST, FASTA are some of the programs which do intensive pairwise alignment of our query sequence to all the database sequence entries and gives out the sequences with best scores.

3.2. Multiple Alignment : Multiple alignment , in which more than two sequences are compared, is used for finding conserved regions among gene sequences and protein sequences, to study phylogenetic relationship of macromolecular sequences i.e., to find evolutionarily related organisms. The major multiple alignment software are clustalW, clustalX and Tcofee. ClustalW: It is a general purpose multiple sequence alignments program for DNA or proteins sequences. It gives biologically meaningful multiple sequence alignments of divergent sequences and calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Cladograms or Phylograms obtained is used to see the evolutionary relationships between species. This can be either downloaded are used online at ClustalX is the X-window based user-friendly version of clustalW, which can be downloaded and used locally on our machine. Tcofee is more accurate than clustalW for sequences with less than 30% identity, but it is slower. Basic Local Alignment Search Tool (BLAST): BLAST is the heuristic search algorithm for sequence similarity searching – for example to identify homologs to a query sequence. If a particular sequence is submitted to BLAST program, it searches with the whole database sequences of users’ choice and in the result produces those sequences that are showing percent identity of more than a particular threshold value. The threshold value is set depending on user choice. BLASTing Protein sequences: BLASTing protein sequences is what we want to do if we already have a protein sequence and we want to find other similar protein sequences in a sequence database. Two flavors of BLAST that exist and deal with proteins are blastp : Compares a protein sequence with a protein database. tblastn : Compares a protein sequence with a nucleotide database. FASTA: FASTA is the first widely used program for database similarity searching. For nucleotide searches, FastA may be more sensitive than BLAST. FastA can be very specific when identifying long regions of low similarity especially for highly diverged sequences. FastA submission form can be obtained at

4. Phylogenetic Analysis: Phylogenetic methods are used to reconstruct the relationships between macromolecular sequences finding the genetic connections and relationships between species. The results of phylogenetic analysis may be depicted as a hierarchical branching diagram, a ‘cladogram’ or ‘phylogenetic tree’. Programs for Phylogenetic analysis are available at This software can be downloaded free of cost and used locally or it can be used online at Tree view and phylodraw are the major user – friendly software to show the hierarchical clustering in different formats used for publishing and easy analyzing. Other than this phylip software there are other software like PAUP, Mega, TreeconW and Winboot popular for phylogenetic analysis. 5. Applications of Bioinformatics 5.1. Food Industry: Functional genomics is playing a major role in food biotechnology industry. The complete genome sequence information available in different databases generates information that can be used for finding metabolic pathways, various digestive enzymes, improving cell factories and development of novel presentation methods. The information about the various microbes, which assist in food digestion like E.coli, also plays a vital role in the major achievements of the food industry using Bioinformatics. 5.2. Agriculture: Crops are improved by producing plants that have disease resistant genes to pathogens like fungui and bacteria. Homology searches, finding conserved motifs, and molecular modeling is useful in identifying disease resistant genes. Pesticides and insecticides that can efficiently kill the pathogens and pests are designed by molecular modeling. 5.3. Pharmaceutical industry and Medical science: Bioinformatics, computational biology and cheminformatics are playing a key role in pharmaceutical industry to design new drug targets from genomic data at a very faster rate. Disease causing genes are identified using the tools of genomics and proteomics. Drug lead identification and drug optimization became easy using the tools of genomics and proteomics. Not only drugs, pharmaceutical industry is using the sequence information in the production of vaccines and therapeutic proteins. The processes of designing a new drug using bioinformatics

tools has been of great help in identifying Target Disease, interesting lead compounds, and by docking studies finding the effective interaction between the drug and the compound. Pharmacoinformatics is the area of Medical Informatics concerned with modeling and simulation of the behavior of drugs, and control of such behavior by individualized dosage regimens for each patient to achieve explicitly chosen therapeutic goals. The credibility of serum concentration data is a major factor in such modeling. Medical informatics is a scientific discipline, which is concerned with the systematic processing of data, information and knowledge in medicine and health care. Computerization of the patient record is expected to resolve long – standing problems with the current paper – based system. 6. Bioinformatics in India In India there are various research and development units, centers and sub centers, pharmaceuticals industries doing research on various aspects of bioinformatics like proteomics, genomics, developing sequence analysis tools, molecular modeling, drug designing etc. Department of Biotechnology(DBT), New Delhi have emphasized on starting Bioinformatics centers with the help of BTISnet (Biotechnology Information System) for the proper application of Bioinformatics in various sectors of science and technology for the benefit of researchers. DBT has sponsored various Bioinformatics Distributed Information Centers (DICs) and Distributed Information sub Centers (Sub – DICs) all over India. The list of the DICs and the Sub DICs can be seen in the following websites.

1. Bioinformatics – A Beginner’s Guide by Jean - Michel Claverie, PhD & Cedric Notredame, PhD 2. Introduction to Bioinformatics by Arthu