You are on page 1of 26

BIOINFORMATICS

Bioinformatics is an interdisciplinary field mainly involving molecular


biology and genetics, computer science, mathematics, and statistics.
Data intensive, large-scale biological problems are addressed from a
computational point of view. Bioinformatics is the application of tools of
computation and analysis to the capture and interpretation of biological
data. Bioinformatics is essential for management of data in modern
biology and medicine. Analysis of genome sequence data, particularly
the analysis of the human genome project, is one of the main
achievements of bioinformatics to date. Prospects in the field of
bioinformatics include its future contribution to functional understanding
of the human genome, leading to enhanced discovery of drug targets
and individualised therapy.
WHY IS BIOINFORMATICS NECESSARY?
● The need for bioinformatics has arisen from the recent explosion
of publicly available genomic information, such as results from the
Human Genome Project.
● Gain a better understanding of gene analysis, taxonomy and
evolution.
● To work efficiently on the rational drug designs and reduce the
time taken for development of drugs manually.
● Bioinformatics allows existing datasets to be reused and
amplified.

AIMS OF BIOINFORMATICS
● Development of a database containing all biological information.
● Development of better tools for data designing, annotation and
mining.
● Design and development of drugs by using simulation software.
● Design and development of software tools for protein structure
prediction function, annotation and docking analysis.
● Creation and development of software to improve tools for
analysing sequences for their function and similarity with other
sequences.
WHAT IS A SEQUENCE?
A biological sequence is a single, continuous molecule of nucleic acid or
protein. It can be thought of as a multiple inheritance class hierarchy.
One hierarchy is that of the underlying molecule type: DNA, RNA, or
protein.

What is a DNA sequence ?


•DNA sequence : Sequential arrangement of DEOXY RIBO NUCLEOTIDES
(ATGC) in DNA molecule >NG_007114.1:4986-6416 Homo sapiens insulin (INS),
RefSeqGene on chromosome 11
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGG
TGGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGC
TCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGC
CCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGC
TGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGG
TGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAG
AGGACCTGCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGC
GCTCCCACCCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTT
TTTTAAAAAGAAGTTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAG
CCTGAGGACGGTGTTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTCCCTCCACTC
GCCCCTCAAACAAATGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGACCGCAGATTCAAGTGTTTTG
TTAAGTAAAGTCCTGGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGCGAACACCCC
ATCACGCCCGGAGGAGGGCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGC
AGCTCCATAGTCAGGAGATGGGGAAGATGCTGGGGACAGGCCCTGGGGAGAAGTACTGGGATCACCTGT
TCAGGCTCCCACTGTGACGCTGCCCCGGGGCGGGGGAAGGAGGTGGGACATGTGGGCGTTGGGGCCTG
TAGGTCCACACCCAGTGTGGGTGACCCTCCCTCTAACCTGGGTCCAGCCCGGCTGGAGATGGGTGGGAG
TGCGACCTAGGGCTGGCGGGCAGGCGGGCACTGTGTCTCCCTGACTGTGTCCTCCTGTGTCCCTCTGCC
TCGCCGCTGTTCCGGAACCTGCTCTGCGCGGCACGTCCTGGCAGTGGGGCAGGTGGAGCTGGGCGGGG
GCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAAC
AATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCA
GCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGA ATAAAGCCCTTGAACCAGC

What is a RNA sequence?


•RNA sequence: Sequential arrangement of RIBONUCLEOTIDES (AUGC)
in RNA >NM_000207.3 Homo sapiens insulin (INS), transcript variant 1, mRNA
AGCCCUCCAGGACAGGCUGCAUCAGAAGAGGCCAUCAAGCAGAUCACUGUCCUUCUGCCA
UGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGAC
CCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUAC
CUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CUGCAGGUGGGGCAGGUGGAGCUGGGCGGGGGCCCUGGUGCAGGCAGCCUGCAGCCCU
UGGCCCUGGAGGGGUCCCUGCAGAAGCGUGGCAGUGGAACAAUGCUGUACCAGCAUCUG
CUCCCUCUACCAGCUGGAGAACUACUGCAACUAGACGCAGCCCGCAGGCAGCCCCACACC
CGCCGCCUCCUGCACCGAGAGAGAUGGAAUAAAGCCCUUGAACCAGC

What is a protein sequence ?


•PROTEIN SEQUENCE: Sequential arrangement of AMINO ACIDS in
PROTEIN >NP_000198.1 insulin preproprotein [Homo sapiens]
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG
GPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

HISTORICAL TIMELINE OF SEQUENCING

WHY IS SEQUENCE IMPORTANT


•Sequence Information - identify changes in genes, associations with
diseases and phenotypes, and identify potential drug targets.
•Diagnosis of emerging viral infections, molecular epidemiology of viral
pathogens, and drug-resistance testing.
•Sequence genes from patients or foetus to determine if there is risk of
genetic diseases
•DNA sequencing may be used along with other DNA profiling methods
for forensic identification and paternity testing.

SEQUENCING A GENOME
Most genomes are enormous(e.g10^10 Base pair in case of humans).
Current Sequencing technology, on the other hand, Only allows
biologists to determine ~10^3 Base pairs at a time.
This leads to some very interesting problems in bioinformatics.

Genomes can also be determined using a technique known as shotgun


sequencing. Computer scientists have played an important role in
developing algorithms for assembling such data.
NCBI GenBank
The National Centre for Biotechnology Information (NCBI) develops and
maintains molecular and bibliographic databases as a part of the
National Library of Medicine (NLM). They do not generate their own data,
but they do: Receive data submissions from researchers.
GenBank is a comprehensive public database of nucleotide and protein
sequences with supporting bibliographic and biological annotation, built
and distributed by the National Centre for Biotechnology Information
(NCBI), a division of the National Library of Medicine (NLM), located on
the campus of the US National Institutes of Health (NIH) in Bethesda,
MD.
BIOLOGICAL DATABASES
Why do we need databases?
I. Genomic research makes it possible to look at biological phenomena
on a scale not previously possible: all genes in a genome, all transcripts
in a cell, all metabolic processes in a tissue.
II. One feature that all of these approaches share is the production of
massive quantities of data.
III. GenBank, for example, now accommodates >1010 nucleotides of
nucleic acid sequence data and continues to more than double in size
every year.
IV. New technologies for assaying gene expression patterns, protein
structure, protein-protein interactions, etc., will provide even more data.
V. How to handle these data, make sense of them, and render them
accessible to biologists working on a wide variety of problems is the
challenge facing bioinformatics—an emerging field that seeks to
integrate computer science with applications derived from molecular
biology.

What is a database?
• A collection of related data elements
• tables
• columns (fields)
• rows (records)
• Records retrieved using a query language
• Database technology is well established
For example, a protein database would have protein entries as records
and protein properties as fields (e.g., name of protein, length, amino-acid
sequence).

The ‘perfect’ database


• Comprehensive, but easy to search.
• Annotated, but not “too annotated”.
• A simple, easy to understand structure.
• Cross-referenced.
• Minimum redundancy.
• Easy retrieval of data

The purposes of these biological databases are:-


1. Make information available globally
2. Systematic results from experiments and analysis
3. Non-redundancy and redundancy deduction
4. Accuracy
5. Reference to literature
6. Bioinformatics issues
7. Database design and implementations
8. Consistency
9. Cross- references
10.Tools for analysing, querying and visualisation
11.Data mining
BROAD CLASSIFICATION OF BIOLOGICAL DATABASE

Biological databases are classified as


1. Primary database has experimental results in database.
2. Secondary has the results of analysis of the primary database.
3. Composite databases combined with various primary database
sources and reduce the risk of searching multiple resources.
Sequence database at NCBI
• Primary
• GenBank: NCBI’s primary sequence database
• Trace Archive: reads from capillary sequencers
• Sequence Read Archive: next generation data
• Derivative
• GenPept (GenBank translations)
• Outside Protein (UniProt—Swiss-Prot, PDB)
• NCBI Reference Sequences (RefSeq)

● GenBank - primary sequence database


• GenBank is the NIH genetic sequence database of all publicly
available DNA and derived protein sequences, with
annotations
describing the biological information these records contain.
• Nucleotide only sequence database
• Archival in nature
• Historical
• Reflective of submitter point of view (subjective)
• Redundant
• Data
• Direct submissions (traditional records)
• Batch submissions
• FTP accounts (genome data)

● Nucleotide - primary sequence database


• Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database

Derivative database
• Expressed Sequences
• dbSNP
• Structure
• Gene

Format
• ASN.1
• Flat Files
• DNA
• Protein
• FASTA
• DNA
• Protein

FASTA format
In bioinformatics and biochemistry, the FASTA format is a text-based
format for representing either nucleotide sequences or amino acid
(protein) sequences, in which nucleotides or amino acids are
represented using single-letter codes. The format also allows for
sequence names and comments to precede the sequences. The format
originates from the FASTA software package, but has now become a
near universal standard in the field of bioinformatics.

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4


MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE R

Abstract Syntax Notation (ASN.1)


ASN.1, or Abstract Syntax Notation One, is an International Standards
Organization (ISO) data representation format used to achieve
interoperability between platforms. NCBI uses ASN.1 for the storage and
retrieval of data such as nucleotide and protein sequences, structures,
genomes, PubMed records, and more. It permits computers and
software systems of all types to reliably exchange both the data
structure and content.
GenBank Flat File (GBFF)
SEQUENCES TO PUBLIC DATABASES
• No longer publish sequences in Journal
• Electronic format , is most useful
• Allows validations testing of data
• best way to move Science forward
• Sequences sent to DDBJ/EMBL/GenBank are exchanged daily
• Best way to exchange new data, and updates

Sequence alignment
Alignment: The process of lining up two more sequences to achieve
maximum levels of identity (and conservation, in the case of amino acid
sequences) for the purpose of assessing the degree of similarity and the
possibility of homology.

Match : align the same residue in both sequences


Mismatch : align two different similar residues
Mismatch : align two different dissimilar residues
Gap : Place a gap or nothing to one of the residue in one of the sequence
•sequences may have diverged from a common ancestor through
various types of mutations:
– substitutions (ACGA -> AGGA)
– insertions (ACGA -> ACCGGAGA)
– deletions (ACGGAGA -> AGA)
•the latter two will result in gaps in alignments

Percentage identity = No of matches = 14 x 100 = 56 % PI


Total Length of the 25
alignment

Percentage similarity = No of matches (similar residues) = 15 x 100 = 60% PS


Total Length of the alignment 25

DIFFERENT TYPES OF SEQUENCE ALIGNMENT


Based of Number of sequences
•Pairwise sequence alignment
•Multiple sequence alignment
Based of the length and coverage
•global alignment: find best match of both sequences in their entirety
•local alignment: find best subsequence match
•semi-global alignment: find best match without penalising gaps on
the ends of the alignment

● What do we want to align?


•some possible alignments for ELV and VIS
simple approach: compute & score all possible alignments and
find the alignment having best score

● How do we “score” an alignment?


SCORING FUNCTION (LINEAR)
Match : align the same residue in both sequences
Mismatch : align two different residues
Gap : Place a gap or nothing to one of the residue in one of the
sequence
Scoring Function (LINEAR)
Match score = 2 (reward by giving a positive value)
Mismatch score = -1 (penalty by giving negative score)
Gap penalty = -2 (penalty by giving negative score)

● How Do We Find the Best Alignment fast?


• simple approach: compute & score all possible alignments

possible global alignments for 2 sequences of length n


•Dynamic programming
•Instead of evaluating every possible alignment, create an table
of partial scores by breaking the alignment problem into
subproblems.

APPLICATIONS OF SEQUENCE ALIGNMENT


•Predicting and determining function of newly discovered genes or
proteins
•Identification of functional patterns or domains

Applications of sequence alignment evolutionary relationship-


homology
•Determination of evolutionary relationships among genes, proteins and
entire species.
•Two genes or proteins or organisms - evolutionary related or
homologous - a shared ancestry or evolved from a common ancestral
gene/protein/organism
•Homology among DNA, RNA, or proteins – inferred - nucleotide or
amino acid sequence similarity.

Inferring homology from sequence identity


•Homology is inferred from sequence identity
•Protein sequences >= 30% identity are considered to homologues
•Protein sequences ≈ 25% identity are considered to be in twilight zone
•Protein sequences < 25 % identity are not homologs

DATABASE HOMOLOGY SEARCHING


• Use algorithms to increase efficiency and provide a mathematical basis
for searching which can be translated into statistical significance
• Assumes that sequence, structure and function are interrelated.
• BLAST (Basic Local Alignment Tool) and FastA (Fast Alignment)
• These are heuristic methods approximating Smith Waterman

BASIC LOCAL ALIGNMENT SEARCH TOOL


• Widely used similarity search tool
• Heuristic approach based on Smith Waterman algorithm
• Finds best local alignments
• Provides statistical significance

The BLAST Algorithm


• The BLAST programs (Basic Local Alignment Search Tools) are a set of
sequence comparison algorithms introduced in 1990 that are used to
search sequence databases for optimal local alignments to a query.
• Scoring of matches done using scoring matrices
• Sequences are split into words (default n=3)
• Speed, computational efficiency
• BLAST algorithm extends the initial “seed” hit into an HSP
• HSP = high scoring segment pair = Local optimal alignment

BLAST compares sequences


• BLAST takes a query sequence
• Compares it with millions of sequences in the sequence databases
• By constructing local alignments
• Lists those that appear to be similar to the query sequence
• The “hit list”
• Tells you why it thinks they are homologs
• BLAST makes suggestions
• YOU make the conclusions

What BLAST tells you


• BLAST reports surprising alignments
- Different than chance
• Conclusions
- Surprising similarities imply evolutionary homology

Program Description

blastp Compares an amino acid query sequence


against a protein sequence database.

blastn Compares a nucleotide query sequence


against a nucleotide sequence database.

blastx Compares a nucleotide query sequence


translated in all reading frames against a
protein sequence database. You could use this
option to find potential translation products of
an unknown nucleotide sequence.

tblastn Compares a protein query sequence against a


nucleotide sequence database dynamically
translated in all reading frames.

tblastx Compares the six-frame translations of a


nucleotide query sequence against the
sixframe translations of a nucleotide sequence
database.

BLAST ALGORITHM
• Scoring of matches done using scoring matrices
• Sequences are split into words (default n=3)
• Speed, computational efficiency
• BLAST algorithm extends the initial “seed” hit into an HSP Word hits are
extended in either direction in an attempt to generate an alignment with
a score exceeding the threshold of "S"
• HSP = high scoring segment pair = Local optimal alignment
Why BLAST is great
• Very fast and can be used to search extremely large databases
• Sufficiently sensitive and selective for most purposes
• Robust - the default parameters can usually be used
MULTIPLE SEQUENCE ALIGNMENT (MSA)

What is a multiple sequence alignment?


● A representation of a set of sequences, where equivalent residues
(e.g. functional, structural) are aligned in rows and columns
Example: part of an alignment of SH2 domains from 14 sequences
Why we need MSA
•The result of searching databases is the establishment of a list of
sequences, either protein or nucleotide, which exhibit significant
similarity and are inferred to be homologous.
•These sequences can then be subjected to multiple sequence
alignment.
•The process that involves an attempt to place residues in columns that
derive from a common ancestral residue by substitutions
•The most successful alignment is the one that most closely represents
the evolutionary history of the sequences.

Applications of MSA
some of the applications of MSA are:
1.Preliminary step in molecular evolution analysis using Phylogenetic
methods for constructing phylogenetic trees.
2.Help prediction of the secondary and tertiary structures of new
sequences;
3.the identification of modules or domains or motifs in multimodular
protein
4.In order to characterise protein families, identify shared regions of
homology in a multiple sequence alignment; (this happens generally
when a sequence search reveals homologies to several sequences)
5. Determination of the consensus sequence of several aligned
sequences.
6.the detection of weak similarities in databases using profiles
7.the design of PCR primers for related genes

APPLICATIONS OF BIOINFORMATICS
Bioinformatics is widely applied in the examination of Genomics,
Proteomics, 3D structure modelling of proteins, Image analysis, Drug
designing and a lot more. A significant application of bioinformatics can
be found in the fields of precision and preventive medicines, which are
mainly focused on developing measures to prevent, control and cure
dreadful infectious diseases.
The main aim of Bioinformatics is to increase the understanding of
biological processes.

Listed below are a few applications of Bioinformatics.

● In Gene therapy.
● In Evolutionary studies.
● In Microbial applications.
● In Prediction of Protein Structure.
● For the Storage and Retrieval of Data.
● In the field of medicine, used in the discovery of new drugs.
● In Biometrical Analysis for identification and access control for
improvising crop management, crop production and pest control.
REFERENCES

• http://www.ncbi.nlm.nih.gov

• http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html

• http://www.ncbi.nlm.nih.gov/Genbank/

• Attawood, T., Smith, P.J., 1999, Introduction to Bioinformatics,


Longman Publishers.

• Grant, G.R., Ewens, W.J., 2005, Statistical Methods in Bioinformatics:


An Introduction, Springer.

• Higgins, D., Taylor, W., 2000, Bioinformatics – Sequence, Structure and


Databanks, Oxford University Press.

• Lacroix, Z., Critchlow, T., 2003, Bioinformatics – Managing Scientific


Data, Elsevier

You might also like