You are on page 1of 38

BIOINFORMATICS

Exploring the Diverse Branches of Life


Definition of Bioinformatics

 The science of collecting and analysing complex biological


data such as genetic codes.
 The use of computers to collect and analyze biological
information, especially for the field of genetics and genomics.
 Bioinformatics is the application of tools of computation and
analysis to the capture and interpretation of biological data.
 The application of computational tools to organize, analyze,
understand, visualize and store information associated with
biological macromolecules.
Introduction to Bioinformatics

 Bioinformatics is the use of computers for the acquisition,


management and analysis of biological information.
 It is an interdisciplinary field, which includes computer
science, mathematics, physics, and biology.
 It uses computer programs for a variety of applications,
including determining gene and protein functions, establishing
evolutionary relationships, and predicting the three-
dimensional shapes of proteins.
 Bioinformatics has developed a thought to maintain and store
concepts in databases.
History
 1960 : Margaret Dayhoff collected sequences

in a database.
 1970 : Term Bionformatics coined by Pauline Hogweg and
Ben Hesper.
 1970 : Needleman Wunch algorithm
 1972 : Smith Waterman algorithm
 1973 : Multiple Sequence alignment
 1979 : First DNA database
 1980 : First complete gene sequence for an organism was
published
 1989 : First complete genome map was published
(Haemophilia influenza)
 1990 : Human genome project started
 1970 : PDB
 1974 : Genbank
 1980 : EMBL
 1984 : DDBJ
 1984 : SwissProt
 1988 : FASTA by Person and Lipman
 1990 : BLAST by Altshul Gish and Lipman
Scope of Bioinformatics

 It deals with methods for retrieving and analysing


biological data.
 It provides genome level data for understanding normal
biological processes and explains the malfunctioning of
genes.
 It helps to create databases on genomes and protein
sequences.
Where Bioinformatics help?

 In experimental molecular biology


 In genetics and genomics
 In generating biological data
 Analysis of gene and protein expression
 Comparison of genomic data
 Understanding evolutionary aspects of evolution
 Understanding biological pathways and networks in system
biology
 In simulation and modeling of DNA, RNA and protein
Tools and techniques in
Bioinformatics
 One important tool is sequence alignment, which compares DNA or
protein sequences to identify similarities and differences.
 Another key technique is gene prediction, which uses computational
algorithms to identify the location and structure of genes within a
genome.
 Protein structure prediction is another important tool in
bioinformatics, as it allows researchers to predict the three-dimensional
structure of proteins based on their amino acid sequence.
 Other tools and techniques used in bioinformatics include phylogenetic
analysis, which reconstructs evolutionary relationships among
organisms based on genetic data, and network analysis, which identifies
patterns of interaction between genes or proteins.
Branches of Bioinformatics

1. Structural bioinformatics
 Prediction of structure from sequence
 Analysis of 3D Structure
 Structural databases
 Analysis and comparison of biomolecular structures
2. Genomics
 Study of genes at mRNA, protein and DNA level
 Includes structural genomics, functional genomics, comparative genomics.
 Genome sequencing
 Genome annotation
 Genome assembling
 Study gene mutations
 Techniques used are PCR amplification, DNA sequencing, DNA
hybridisation
3. Proteomics
 Study of proteome using various technologies.
 Assess activities, modifications, localisation and interactions of
proteins.
 Identification and characterization of proteins.
 Protein separation
 Consists of structural proteomics, functional proteomics and
expression proteomics.
 Various proteomics techniques are SDS PAGE, mass spectrometry
etc.
4. Cheminformatics
 The use of physical chemistry theory with computer and
information science techniques.
 Encompasses the design, creation, organisation, management,
retrieval, analysis and use of chemical information.
 Molecular similarity
 QSAR
5. Animal Bioinformatics
 Deals with computer added study of genomics, proteomics and
metabolomics in various animal species.
 The study includes study of gene mapping, gene sequencing,
animal breed etc.
  Illustration of the relationships within the animal data and
also between animals. 
6. Plant bioinformatics
 Deals with computer aided study of plant species.
 It is further divided into agriculture bioinformatics,
horticulture bioinformatics, medicinal plant
bioinformatics and forest plant bioinformatics
Major research fields
 Sequence alignment
 Gene finding
 Genome assembly
 Drug design
 Drug discovery
 Protein structure alignment
 Protein structure prediction
 Gene expression
 Protein – protein interaction
 NGS
Biological databases

 They are collection of files containing records of biological


data in machine readable form.
 It consists of structured searchable, updated periodically and
cross referenced data.
 The main purpose of a biological database is to store and
manage biological data and information.
Basic terminologies

 DATA - they are raw unorganised facts need to be


processed.
 INFORMATION – When data is processed, organised,
structured in a given context so as to make it useful it is
called information.
 DATABASES - A database is a collection of data that is
organized in a specific way.
 DBMS – Data are arranged by sets of rules which are
programmed into software that manages the data is called
Database Management System.
Why databases

 Means to handle and share large volumes of biological


data.
 Support large scale efforts.
 Make data access easy and updated.
 Link knowledge obtained from various fields of biology
and medicine.
 Security of data
Features of biological databases

 Heterogeneity
 High volume data
 Data integration
 Data curation
 Data sharing
Types of biological databases

 Primary databases consists of experimentally derived


data such as nucleotide sequences, protein sequences etc.
Some of the primary databases are Genbank, DDBJ etc.
 Secondary databases consists of data derived from
analyzing the primary data. Examples of secondary
databases are PDB, BLOCKS etc.
 Derived databases are derived from other resources.
They are further classified into structural database ,
specialized database.
Primary databases

 Primary databases contain primary sequence information


(nucleotide or protein) and accompanying annotation
information regarding function, bibliographies, cross
references to other databases, and so forth.
 They contains biomolecular data in its original form.
 Experimental results are submitted directly into the database
by researchers and the data are essentially archived in nature.
 Once a databases occasion number is given to a kind of data
in primary database then the data can be changed further.
GenBank

 It is a database from NCBI, includes sequences from publically


available resources.
 It is a genetic sequence database, an annoted collection of all
publically available DNA sequences.
 Query of the GenBank database is carried out via the NCBI
Entrez system [entrez], which is used to query all NCBI-
associated databases.
 GenBank is part of the international Nucleotide sequence
collaboration.
 In GenBank Sequences can be entered by anyone via a Web page
or by e-mail when working with larger sequence sets.
EMBL

 European Molecular Biology Laboratory, is a nucleic acid


database that comes under EBI.
 The EMBL-EBI is a hub for bioinformatics research and
services, developing and maintaining a large number of
scientific databases that are free of charge.
 It was established in collaboration with DDBJ and GenBank.
DDBJ

 The DNA Data Bank of Japan is a biological database that


collects DNA sequences.
 It is located at the national institute of genetics in the Shizuoka
prefecture of Japan.
 It is also a member of the international nucleotide sequence.
 DDBJ Center provides sharing and analysis services for data
from life science researches and advances science.
SWISSPROT

 Swissprot is a curated protein sequence database which strieves


to provide a high level of annotation,
 It was created in 1986 by Amos, Bairoch with Swiss institute of
Bioinformatics.
PIR
 The protein Information resources is an integrated public
resource of protein informatics that supports genomic and
proteomic research and scientific discovery.
 PIR maintains the protein sequence database and annoted protein
database containing over 283000 sequences covering the entire
taxonomic range.
UniProt

 UniProt is a freely accessible database of protein sequence and functional


information, many entries being derived from genome sequencing projects.
 UniProt consists of three parts, the UniProt Knowledgebase (UniProtKB), the
UniProt Reference Clusters Database (UniRef), and the UniProt Archive
(UniPArc).
 Protein sequences and their annotations are stored in the UniProt
Knowledgebase (UniProtKB), which is divided into two realms. First is the
UniProtKB/TrEMBL realm, which contains automatically annotated
sequences, and there is the UniProtKN/SwissProt realm, where manually
curated and annotated sequences are stored.
 The UniProt Reference Clusters (UniRef) provide clustered sets of sequences
from the UniProt Knowledgebase
 The UniProt Archive (UniPArc), a collection of protein sequences and their
history.
Secondary databases

 Secondary databases contains data derived from results


of analysed primary data.
 Secondary biological databases summarize the results
from analyses of primary protein sequence databases.
 The data in secondary database is either manually
created or generated automatically.
 It contains some valuable information such as about
mutations or evolutionary relationship.
PDB

 The Protein Data Bank is a database for the three-


dimensional structural data of large biological molecules,
such as proteins.
 The data are obtained by X-ray crystallography, NMR
spectroscopy, or, increasingly, cryo-electron microscopy,
and submitted by biologists and biochemists from around
the world.
 The data are freely accessible on the Internet via the
websites of its member organisations.
 The mission is to maintain a single protein data bank
archive of macromolecular structural data.
PROSITE

 It is a protein database.
 It consists of entries describing the protein families, domains
and functional sites as well as amino acid patterns.
 They are manually curated by a team of Swiss Institute of
Bioinformatics.
 Classification of proteins in Prosite is determined using single
conserved motifs, i.e., short sequence regions (10–20 amino
acids) that are conserved in related proteins
Pfam

 Pfam is a database of protein families that includes their


annotations and multiple alignments generated using hidden
markov models.
 The resulting alignments represents functionally interesting
structures and contain evolutionarily related sequences.
 The general purpose of pfam is to provide a complete and
accurate classification of protein families and domains.
PRINTS

 PRINTS database is a collection of protein motif


fingerprints.
 PRINTS provides detailed annotation resource for protein
families and a diagnostic tool for newly determined
sequences.
 PRINTS is the founding partner of the integrated resource
InterPro, a widely used database of protein families,
domains and functional sites.
 PRINTS allows multiple motif search.
BLOCKS

 The Blocks Database is a collection of blocks representing known


protein families that can be used to compare a protein or DNA
sequence with documented families of proteins.
 Searches of the Blocks Database are carried out using protein or
DNA sequence queries, and results are returned with measures of
significance for both single and multiple block hits.
 The database has also proved useful for derivation of amino acid
substitution matrices (the Blosum series) and other sets of
parameters.
 WWW and E-mail servers provide access to the database and
associated functions, including a block maker for sequences
provided by the user.
Derived databases

 A database derived from other resources but including


relationships or data not found in those resources.
 They are further classified into structural database,
specialised database
 Some of the structural databases are PDB, CATH, SCOP,
PubChem.
 Specialized databases are a collection of focused
information on one or more specific fields of study
 Kegg is a specialised database.
CATH

 The CATH database provides hierarchical classification of


protein domains based on their folding patterns.
 CATH is a free, publicly available online resource that
provides information on the evolutionary relationships.
 Experimentally-determined protein three-dimensional
structures are obtained from the Protein Data Bank and
split into their consecutive polypeptide chains, where
applicable.
SCOP

 The Structural Classification of Proteins database is a


largely manual classification of protein structural domains
based on similarities of their structures and amino acid
sequences.
 A motivation for this classification is to determine the
evolutionary relationship between proteins.
 The three main classifications are families, superfamilies,
and folds.
PubChem

 The PubChem database at the NCBI [pubchem] stores small chemical


molecules and information about their biological activities.
 It consists of three components, PubChem Compound, PubChem
Substance, and PubChem BioAssay.
 PubChem Compound contains approx. 91 million molecules together
with their two-dimensional (2D) molecular structures.
 PubChem Substance permits the search for substances produced by
various manufacturers, samples of unknown composition, and natural
substances of unknown 2D molecular structure.
 PubChem BioAssay consists of deposited bioactivity data and
descriptions of bioactivity assays used to screen the chemical substances
contained in the PubChem Substance database, including descriptions of
the conditions and the readouts (bioactivity levels) specific to the
screening procedure.
KEGG

 KEGG is a collection of databases dealing with genomes,


biological pathways, diseases, drugs, and chemical
substances.
 KEGG is a database resource for understanding high-level
functions and utilities of the biological system, such as the
cell, the organism and the ecosystem, from molecular-
level information, especially large-scale molecular
datasets generated by genome sequencing and other high-
throughput experimental technologies.

You might also like