You are on page 1of 204

Syllabus

Bioinfo@AmS 1
Introduction: Bioinformatics

Bioinformatics = Biology + Information Technology

Computer IT
Sc.

Math &
Biology Statistics

Bioinformatics is an interdisciplinary field, which associates with computer science, information technology,
mathematics/statistics, and biology for organization, analysis and interpretation of new insight of the biological
data.
Bioinfo@AmS 2
Introduction: Bioinformatics

✓ Bioinformatics deals large biological datasets (macromolecular structures, genome sequences, expression
data) for better understanding and generation of new hypothesis

✓ Acquiring, organizing, managing and new-insights of biological data

✓ Development of algorithms and tools for analysis biological data

Bioinfo@AmS 3
Various aspects of Bioinformatics

✓ Development of organized databases/databank

✓ Open source web services (tools and platform)

✓ Virtual screening of the drug molecules

✓ Drug design

✓ Biological Network analysis

✓ Algorithms for big data analysis (NGS, RNA Seq)

✓ Meta data analysis

✓ Generation of new hypothesis via data mining

Bioinfo@AmS 4
Bioinformatics

Foundation of Bioinformatics is Molecular Biology, specifically Genomics and Proteomics

Bioinfo@AmS 5
Basic Form of Life : Cells

Bioinfo@AmS 6
Introduction: Complexity in Biology

Cell signaling
Organism Organs Tissue

Cell

Protein Protein-Protein
Protein interaction
sequence
structure

Bioinfo@AmS 7
Length and Time Scale

Bioinfo@AmS 8
Introduction: Exploration of Biology

Observation
Or
Results

Conclusion
Experiments Or
Hypothesis

Knowledge

Bioinfo@AmS 9
Introduction: Exploration of Biology

Charles Darwin and the voyage of the HMS Beagle (1831-36)

• Over the course of his travels in the Galápagos Islands off the
coast of Ecuador, Darwin began to see intriguing patterns in the
distribution and features of organisms.

Charles Darwin

Darwin's Theory of Evolution by “Natural Selection”


• Those individuals with heritable traits (i.e., phenotypic variation) better suited to the environment will
survive and leave more offspring than their peers, causing the traits to increase in frequency over
generations
Bioinfo@AmS 10
Introduction: Exploration of Biology

What is the molecular mechanism that causes this variation?

How does this trait pass through from one generation to next?

How do species extinct?

Bioinfo@AmS 11
Introduction: Exploration of Biology

Gregor Johann Mendel (1822-1884)


• Studied Inheritance of Traits in garden Pea plants
• Developed the laws of inheritance
• Established the concept role of genotype to phenotypes

He is the father and founder of genetics. The genes come in pairs (Alleles) and are inherited as
distinct units, one from each parent. Mendel tracked the segregation of parental genes and their
appearance in the offspring as dominant or recessive traits.

Bioinfo@AmS 12
Introduction: Exploration of Biology

Genes are the basic unit of heredity. Each chromosome contains many genes that is responsible for
different traits. Father Mother

B b

T T

• Two copies of every gene each of them is commonly called allele i.e., is a variant form of a gene received
from father and mother.
• If the two alleles that form the pair for a trait are identical, then the individual is said to be homozygous gene
and if the two genes are different, then the individual is heterozygous gene for the trait.
• If the alleles of a gene are different, one allele will be expressed; it is the dominant gene. The effect of the
other allele, called recessive, is masked
Bioinfo@AmS 13
Introduction: Exploration of Biology

Back Hair Black Hair


Back Hair Brown Hair
(Bb) (Bb)
(BB) (bb)

B B B b

b Bb Bb B BB Bb

b Bb Bb b Bb bb

Genotype: 4Bb Genotype: BB 2Bb bb


Phenotype: Black Phenotype: Black Black Brown

Bioinfo@AmS 14
Introduction: Exploration of Biology

Bioinfo@AmS 15
Introduction: Exploration of Biology

Genotype – Deals with GENE CODE. Genotype is the particular set of genes present in an organism’s cell.
In other words, the genotype is the genetic constitution of an organism.

Phenotype – Deals with looks you can take a PHOTO with. All the observable characteristics of an
organism, such as shape, size, color and behavior are called phenotype.

AA

aa

Bioinfo@AmS 16
Introduction: Exploration of Biology

What is the chemical nature and structure of Gene?

How gene store genetic information?

How does information from Gene flows to Characteristics

Bioinfo@AmS 17
Central Dogma

DNA Storage unit


(Genotype)

Transcription

mRNA Carrier Unit


Translation

Protein
Functional Unit

Phenotype

Genetic mapping provides the first evidence that a disease or trait (i.e., a characteristic) is linked to the
gene(s) inherited from one’s parents.
Bioinfo@AmS 18
Central Dogma

Flow of biological information:

Coding (DNA replication)

↶ Decoding
DNA Protein

• Patter of DNA Sequence store fixed information about protein sequence

• A single change in the order of sequence of DNA nucleotides leads to defective or no protein synthesis.

• Coding and decoding the information from DNA level to protein are highly regulated biological events

Bioinfo@AmS 19
Introduction: Human Genome Project

Objectives: Find out the sequence pattern of genome of human genome (i.e.., haploid chromosome + X/Y).

Bioinfo@AmS 20
Introduction: Human Genome Project

• Human genome length is 3.3× 109 base pairs.

• Approximately 25,000-30,000 genes are there

• On average, a gene is made up of 3000 nucleotides.

• The function of more than 50 percent of the genes is yet to be discovered.

• Proteins are coded by less than 2 percent of the genome.

Essence of Bioinformatics:

• Store all the sequence information in a database (33,000 books having1000 pages and each page have 1000 bp ).

• Developed programs for data retrieval and indexing

Bioinfo@AmS 21
Gene

GENE: Gene is a segment of DNA of chromosome which has a specific ordered sequence of nucleotides (the building
blocks of DNA) that codes a functional proteins and RNAs (rRNA, tRNA, mRNA).

Bioinfo@AmS 1
Gene

GENE: Gene is a segment of DNA of chromosome which has a specific ordered sequence of nucleotides (the building
blocks of DNA) that codes a functional proteins and RNAs (rRNA, tRNA, mRNA).

Bioinfo@AmS 2
Characteristics of Gene

• Genes are the basic physical and functional units of heredity transfer from parent to offspring.

• Each gene is located on a particular region of a chromosomal DNA.

• There is no space (Non-coding region) within genes of Prokaryotes

• Eukaryotes genes are not continuous with coding region (Exon) only these are separated by Non-coding sequences

(Intron).

Prokaryotic Gene Eukaryotic Gene

Bioinfo@AmS 3
Prokaryotic Gene Expression
Gene

Promoter Terminator Promoter Terminator

ORF ORF I ORF II ORF III DNA


DNA

Transcription Transcription
mRNA mRNA (Polycistronic)

Translation Translation

Protein
Protein I Protein II Protein III

• ORF: Open Reading Frame (i.e., Coding region of gene) contain codes for protein
• Gene: ORF + Regulatory Region (e.g., Promoter and Terminator)

Bioinfo@AmS 4
Eukaryotic Gene Expression
Gene

ORF
Promoter Terminator

Exon Intron Exon Intron Exon DNA

Transcription

hnRNA

Processing
mRNA

Translation

Protein

• ORF in eukaryotic cells divided in exons and introns, however, only exon parts carry codes for protein

Bioinfo@AmS 5
Exon and Intron

Exon: Eukaryotic genes contain stretches of coding sequences called exons, which are interrupted by non-coding
segments called introns. Exons code for functionally distinct proteins.

Introns: These are intervening sequences of DNA that do not code for any information. Introns are generally present
in higher eukaryotic genomes but are rarely present in prokaryotes.
• When genomic DNA is transcribed to produce mRNA (gene expression), introns are also transcribed. Once the
entire mRNA has been transcribed, the introns are removed before the mRNA reaches the ribosomes for protein
synthesis.

Bioinfo@AmS 6
Characteristics of Gene

Pseudogenes
Pseudogenes are copies of genes that have lost their function. This can happen due to mutations, or the presence of stop
codons or frameshifts within the coding sequence. Pseudogenes are thought of as DNA that should be removed from the
genome and are considered junk DNA.

Gene families
In prokaryotes, a gene occurs in single copy per genome. However, it is not uncommon in eukaryotes to find genes that
are present in several copies. Such groups of genes are called gene families. A higher copy number enables production
of larger quantities of gene products.

Repetitive DNA sequence


Repeated non-coding sequences form a significant proportion of eukaryotic genomes. They are sometimes present in
thousands of copies.

Bioinfo@AmS 7
Number of genes in a genome

✓ Unlike eukaryotic genomes, most of the DNA in bacterial genomes (prokaryote) encodes proteins.
✓ The genome of E. coli bacteria is made of 4288 genes, with nearly 90% of the genome coding for proteins.
✓ The yeast genome, is about 2.5 times larger, comprising about 6000 genes with 70% used for coding proteins.
Only 4% of the yeast genome is reported to be made of introns.
✓ The genome of humans consists of about 25,000-30,000 genes, with only about 2% of DNA used as protein
coding sequence.

Bioinfo@AmS 8
Non-coding DNA

Complex genomes have roughly 10x to 30x more DNA than is required to encode all the proteins or RNAs
in the organism.

Contributors to the non-coding DNA include:


✓ Introns in genes
✓ Regulatory elements of genes
✓ Multiple copies of genes, including pseudogenes
✓ Intergenic sequences

Bioinfo@AmS 9
Genome

Genome: All the information that is encoded in DNA and is capable of being passed on to an offspring.

✓Includes all genes, intergenic sequences, repeats

✓The genome is all of the nuclear DNA in a haploid cell (sperm or egg) i.e., DNA that is inherited by the
next generation. So a normal somatic cell in humans actually has two genomes, a maternal genome and a
paternal genome.

✓All the DNA on all the chromosomes of a haploid cell.

✓Specifically, it is all the DNA in an organelle.

Bioinfo@AmS 10
Tutorial/ Practical: Finding Exon and Introns

DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC TCCAGAGAGA GCACCCAA CACCCTCCAGGCTT GACTAA CCAGGGTGT

EXON 1 INTRON 1 EXON 2 INTRON 2 EXON 3

Bioinfo@AmS 11
Genome

Intron Intron

Length of a Gene (in term of base pairs) = (Number of Exons× Length of an Exon) + (Number of Introns× Length of an
Intron)

Coding region of a genome = Total length of Exons (i.e., Number of Exons× Length of an Exon× Number of genes of a
genome)

Total Non-coding Region of a genome = Total length of Introns (intra-genomic part, i.e., gap between two exons) + Total
length of Inter-genomic part (space between two consecutive genes)

Bioinfo@AmS 12
Assignment

1. The whole genome of an E.coli bacterium is 2×106 base pairs in size, and sequencing has shown it has 600 genes
of having the length of each gene of 2×103. Each gene is comprised with three coding regions (exons) of average size
of each exon is 200 base pairs. Representative schematic is given below. How many introns are there within the
whole genome of E.coli? What is the ratio of total length of coding to total non-coding region of the whole genome?

Intron Intron

Bioinfo@AmS 13
Tutorial/ Practical: Finding Exon and Introns

http://hollywood.mit.edu/GENSCAN.html

Bioinfo@AmS 14
Tutorial/ Practical: Finding Exon and Introns

Write a program: Given a two DNA sequence compare these two sequences one to one characters and provide output
as i) Numbers of Introns and their length ii) Number of Exons and their length

DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC TCCAGAGAGA GCACCCAA CACCCTCCAGGCTT GACTAA CCAGGGTGT

DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC GCACCCAA GACTAA CCAGGGTGT

Bioinfo@AmS 15
Tutorial/ Practical: Finding Exon and Introns

Write a program: Given a DNA sequence and its corresponding matured mRNA sequence, compare these two
sequences one to one characters and provide output i) numbers of Introns and their length ii) Number of Exons and their
length

DNA Sequence: { A, T,G, C}


CTCGAGGGGCCTA ATGCATTGCCC TCCAGAGAGA GCACCCAA CACCCTCCAGGCTT GACTAA CCAGGGTGT

mRNA Sequence (matured): { A, U,G, C}


CUCGAGGGGCCUA AUGCAUUGCCC GCACCCAA GACUAA CCAGGGUGU

Notes: The mRNA is synthesized from DNA. The mRNA sequence does not have any intron parts and all the ‘T’
character of DNA changes to ‘U’ in case of RNA . All other characters remains same in both DNA and RNA as
mapping is given below.

Bioinfo@AmS 16
Tutorial/ Practical: Finding Exon and Introns

https://asia.ensembl.org/index.html

https://asia.ensembl.org/Multi/Search/Results?q=BRCA2;site=ensembl

https://asia.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000139618;r=13:32315086-32400268

https://asia.ensembl.org/Homo_sapiens/Gene/Sequence?db=core;g=ENSG00000139618;r=13:32315086-32400268

Bioinfo@AmS 17
Central Dogma: Flow of Biological Information

Base Pair
Transcription Translation

Sugar-
phosphate
Protein

Bioinfo@AmS 1
Nucleic Acids (DNA and RNA)

DeoxyRibo Nucleic Acid (DNA) Ribo Nucleic Acid (RNA)

• Double stranded • Single stranded

• Located in the nucleolus • Located in the outside nucleolus (cytoplasm)

• Storage genetic information • Mainly transfer genetic information ( although few cases

• Deoxyribose sugar storage genetic information)

• A,T,G,C • Ribose sugar


• A,U,G,C

Both DNA and RNA are polymers of Nucleotides

Bioinfo@AmS 2
Nucleic Acids (DNA and RNA)

DNA RNA

NB
P NB
P
O
5 O
4 S 1 5
4 S 1
3 2
OH H 3 2
OH OH

Bioinfo@AmS 3
Nitrogenous Bases
Purine Pyrimidine

Double Ring Single Ring

• Shorter Name :: Large structure • Larger Name :: short structure

• 6 atoms ring number Anti-clockwise • 6 atoms ring number Clockwise

• Adenine (A) and Guanine (G) • Cytosine (C), Thymine (T) and Uracil (U)
7 4
6
N 5
5 3 N
1 N

8
2
2
O 6
1 N
4
3 N
N 9

Bioinfo@AmS 4
Nitrogenous Bases

Purine Pyrimidine

NB

Nucleoside
(Sugar + NB)

Nucleotide
(Sugar + NB+
Phosphate group)
(GMP)

Bioinfo@AmS 5
DNA

Bioinfo@AmS 6
Chemical Composition of DNA

Deoxyribonucleic Acid (DNA): Polymer of nucleotides

Nucleotides = Deoxy-Sugar (5C) + Nitrogen Base +Phosphate Group

Bioinfo@AmS 7
Chemical Composition of Nucleotides
Nucleotide

Nucleoside

Base

Deoxyribose pentose sugar


Bioinfo@AmS 8
Chemical Composition of Nucleotides
dNTP = Deoxynucleotide Triphosphates

Nucleoside

Base

Deoxyribose (pentose) sugar


Bioinfo@AmS 9
Base pair (BP) complementary

The rules of base pairing (or nucleotide pairing) are:


• A with T: the purine adenine (A) always pairs with the pyrimidine thymine (T)
• C with G: the pyrimidine cytosine (C) always pairs with the purine guanine (G)

Double rings Single ring

Bioinfo@AmS 10
Base pair (BP) complementary: One letter DNA string
5
3

One letter double stranded DNA string

5  T C C TA C TA G T G T G 3 
3  A G G ATG ATC A C A G 5  Base
Pair

Sugar-Phosphate
Backbone

3 5
Complementary strands

Bioinfo@AmS 11
Double helical-dsDNA

• Strands are anti-parallel (Opposite polarity)


• Each strands coil each other to form spiral structure which is
known as DNA double helix. 3
5
• Diameter of DNA is 20 A, which is uniform along the DNA.
• Each turn of the helix (360o) rises 34A ( or 3.4 nm)
• Each turns comprises 10 bases.
• Rise of consecutive two bases is 3.4 A

• The DNA helix has a shallow groove called minor groove


5
(-1.2 nm) and a deep groove called major groove (- 2.2 nm) 3

across.

Bioinfo@AmS 12
DNA Replication

Semi conservative mode of DNA Replication


Two Daughter dsDNA

Old Strand
One Parent dsDNA 5’–TCAGCTCGCTGCTAATGGCC–3’
3’–AGTCGAGCGACGATTACCGG–5’
New Strand
Old Strand

5’–TCAGCTCGCTGCTAATGGCC–3’
3’–AGTCGAGCGACGATTACCGG–5’

Old Strand Old Strand

5’–TCAGCTCGCTGCTAATGGCC–3’
3’–AGTCGAGCGACGATTACCGG–5’
New Strand

Bioinfo@AmS 13
DNA Replication
(DNA)n + dNTP → (DNA)n+1 + PPi

P P P

Growing Strand (5→ 3 direction )


P P P 3OH
5 DNA
Polymerase III

P P P P P P P P 5
3

Template Strand
(DNA Polymerase reads from 3→ 5 direction only)

Bioinfo@AmS 14
DNA string

DNA Sequence:
5’ AGGATCATGATCATGAATT 3’

Complement Sequence: Palindromic sequence


3’ TCCTAGTACTAGTACTTAA 5’ 3’ TTAATTAA 5’
5’AATTAATT 3’

Reverse Sequence:
3’ TTAAGTACTAGTACTAGGA 5’ Repetitive sequence

5’ TTAGTTAG 3’
Reverse Complement sequence:
5’ AATTCATGATCATGATCCT 3’

https://www.bioinformatics.org/sms/rev_comp.html

Bioinfo@AmS 15
Chargaff's rules: dsDNA

i) # A = # T 5  AGGATCATGATCATGAATT 3 

=> A/T = 1 3 TCCTAGTACTAGTACTTAA 5 

ii) # C = # G
=> C/G =1

iii) #(A+C) = #(G+T)

vi) #(A+T)/# (G+C) (range 0.4 to 1.9)

Bioinfo@AmS 16
Chargaff's rules: dsDNA

i) # A = # T 5 ’ A G G AT C AT G AT C AT G A AT T 3 ’

=> A/T = 1 3 ’ T C C TA G TA C TA G TA C T TA A 5 ’

ii) # C = # G
=> C/G =1

iii) #(A+C) = #(G+T)

vi) #(A+T)/# (G+C) (range 0.4 to 1.9)

Bioinfo@AmS 17
Next generation storage unit: DNA

Adenosine (A) = 00 Guanine (G)=10

Cytosine (C)=01 Thymine (T) =11

Bioinfo@AmS 18
Next generation storage unit: DNA

Bioinfo@AmS 19
Melting temperature of DNA (Tm)

dsDNA ssDNA
(Double helical) (Separated Strand)

5 ’ A AT TA AT TA AT T 3 ’
3 ’ T TA AT TA AT TA A 5 ’
Denaturation
Heating

5’GGCCGGCCGGCC3’ Cooling
3’CCGGCCGGCCGG5’
Renaturation

If a DNA solution is heated (> 90°C ) there will be enough kinetic energy to denature the DNA completely causing it
to separate into single strands. Tm is defined as the temperature at which 50% of double stranded DNA is changed to
single-standard DNA. The higher the melting temperature the greater the guanine-cytosine (GC) content of the DNA.

Bioinfo@AmS 20
Melting temperature of DNA (Tm)
Example:

AT G C T G AAT G C

TAC GACTTAC G

DNA length = 11 bps

Total bases = 11*2 = 22

A=6

T=6

G=5

C =5

Total Number of H-bonds = 2 (A or T) +3 (G or C) = 12+15 = 27

Tm = ½ Total Number of H-bonds

= 13.5

Bioinfo@AmS 21
Chargaff's rules: dsDNA

A 3D cubical DNA structure is given below, where edge length is 85 nm. Consider for double helical DNA, pitch
length is 3.4 nm that comprises with 10 bps. How many bases are required to form the 3D structure? If thymine (T)
makes up 15 percent of the bases for this DNA sample, what would be the number of other bases and sugar
molecules? Also, calculate the amount of energy required to attain melting temperature (Tm). Assume energy
of H-bonds is 5 Kcal/mole. One mole of H-bonds means, Avogadro number of H-bonds (i.e., 6.03 × 1023 )

85 nm

Bioinfo@AmS 22
Chargaff's rules: dsDNA
Perimeter of a Cube = 12* edge length = 12*85 nm

If the DNA length is 3.4 nm, then bases would be 10 bps

So, total number of base pairs present in this double stranded structure = 12* (85/3.4) *10 = 3000 bps

Total number of bases = 3000*2 = 6000

Total number of sugars (each nucleotide contains one sugar) in 3000 bps length of a double stranded DNA = 3000 ×2 = 6000

T = 15% , A = 15%, G= 35 %, C= 35%

T = (15*6000)/100 = 900

A = 900

G = (35*3000)/100 = 2100

C = 2100

Total bases (A+T+G+C) = 6000

Total Number of H-bonds = 3 (G or C) + 2 (A or T) = (6300 + 1800) = 8100

Tm is energy required to break ½ H-bonds = 4050

Energy for Tm = [(5 Kcal/mole) * 4050]/ 6.03 × 1023 = 3,358.20 × 10-23 Kcal

Bioinfo@AmS 23
Assignment

✓ Why DNA is double stranded and RNA is single stranded?

✓ Why does cell consider DNA as genetic material?

✓ How does DNA store genetic information?

Bioinfo@AmS 24
RNA

Ribonucleic Acid (RNA): Polymer of nucleotides

Nucleotides = Sugar (5C) + Nitrogen Base +Phosphate

Bioinfo@AmS 25
RNA

Bioinfo@AmS 26
Transcription: DNA to RNA String

DNA template strand


antisense (-) strand

3' 5'

5' 3'
Sense (+) strand
Transcription DNA coding strand
RNA polymerase
Non-template strand
5' 3'
mRNA

Bioinfo@AmS 27
Splicing: Generation of mRNA

ds DNA

hnRNA

mRNA

Bioinfo@AmS 28
Splicing: Generation of mRNA

• Whole sequence (introns and exons) of a gene is transcribed into RNA by RNA polymerase enzyme.
• The total transcribed RNA is known as heterogeneous nuclear RNA (hnRNA) which undergoes a process called
RNA-splicing.
• Spliceosomes protein and small nuclear RNA (snRNA) catalyzes the splicing. That removes intron parts from the
hnRNA and joins exons to form matured mRNA.
• Thus, in the next step, only Exons part of a gene undergoes into the Translation process (proteins synthesis). There
is no role of the Introns in the protein synthesis.

Essentially, codes for the proteins are available in exons parts of the gene.

Bioinfo@AmS 29
DNA to RNA string
dsDNA
mRNA
3’ TAC AGC AGA CGA CGC 5’
5’ ATG TCG TCT GCT GCG 3’ 5’ AUG UCG UCU GCU GCG 3’

Template strand mRNA

3’ ATA CTG TCG TGA CGT CGT 5’ 5’ UAU GAC AGC ACU GCA GCA 3’

Coding strand
(Complementary strand)

Bioinfo@AmS 30
DNA to RNA String

• Are both strands of DNA get transcribed into mRNA or does only one?

• If so, which one?

• Why can’t both strands of DNA get transcribed at the same time?

• Why RNA is less stable as compared to DNA?

Bioinfo@AmS 31
DNA to RNA String

• Are both strands of DNA get transcribed into mRNA or does only one?
• If so, which one?
At a time only one strand get transcribed, i.e., 3’ to 5’ strand is used as template for transcription. RNA synthesis also
involves the normal base pairing rules, but the base thymine (T) is replaced with the base uracil (U). Interestingly, the
base sequence of the synthesized RNA is complementary of the template strand, hence, the base sequence of the RNA
exactly match with the another strand of DNA (i.e., complementary strand of DNA, 5’-3’). Therefore, at the same time
only one strand transcribed.

Bioinfo@AmS 32
DNA to RNA String

• Why can’t both strands of DNA get transcribed at the same time?
✓ Two strand of DNA carries two different kind of information, that means dsDNA carries information about two
different kind of proteins with its two strands. Therefore, for a particular protein, sequence of one strand of DNA is
sufficient. If both are transcribed, then two different mRNAs would be synthesized and this would lead to the
production of two different proteins.
✓ Moreover, if both strands are transcribed, the resulting mRNAs formed would be complementary and would pair with
each other producing a double stranded RNA, which would lead to the huge problem again.

Bioinfo@AmS 33
DNA to RNA String

Two factors make RNAs less stable than DNAs


• The presence of the hydroxyl group on the 2’ Carbon.
• Presence of Uracil in place of Thymine.

Bioinfo@AmS 34
mRNA

Messenger RNA (mRNA)


• The mRNA carries the genetic code from DNA to the ribosomes for protein synthesis.
• Sequence of mRNA is also having 5’ to 3’ polarity
• It carries exactly coding strand (5’ to 3’ ) sequence bases of dsDNA having U instead of T
• Thus, it carries copies of instructions for the assembly of amino acids into proteins from DNA to the rest of the
cell (serve as “messenger”).

Bioinfo@AmS 35
rRNA

Ribosomal RNA (rRNA)


• Makes up the major part of ribosomes, which is where proteins are made.
• Ribosome a organelle where protein synthesis takes place that consists of ribosomal RNA (65%) and proteins (35%)
• It has two subunits, a large one and a small one both are required for protein synthesis.

Bioinfo@AmS 36
tRNA

Transfer RNA (tRNA)


• It can read the codes carries by mRNA
• Transfers amino acids to ribosomes during protein synthesis.
• Each amino acid is recognized by one or more specific tRNA
• tRNA has a tertiary structure that is L-shaped
• One end attaches to the amino acid and the other binds to the mRNA by a 3-base complimentary sequence

Bioinfo@AmS 37
Assignment

1. A double stranded DNA has 20% of guanine what is the percentage of other bases on the DNA.

2. Let the four kinds of nitrogen bases Adenosine (A), Cytosine (C), Guanine (G) and Thymine (T) be represented by a pair of binary
numbers 00, 01, 10 and 11 respectively. S1= 5′ AGTCATGGCCAA 3′ and S2= 5′ AGTCCTGCCCAC 3′. Find out the output DNA
sequences by applying AND, XOR and OR operation.

3. Find out the mRNA sequence of the reverse complement of the given dsDNA

3’ TAC AGC AGA CGA CGC 5’


5’ ATG TCG TCT GCT GCG 3’
4. Write a program so that given a DNA strand sequence as an input, find out complimentary sequence , mRNA sequence and number
of ATG repeats.

Bioinfo@AmS 38
Nucleic acid information Resources

https://www.ncbi.nlm.nih.gov/

Bioinfo@AmS 39
Nucleic acid information Resources

Bioinfo@AmS 40
Protein

Proteins: Proteins are linear polymers built of monomer units called amino acids. It is the most versatile
macromolecules in living systems and serve crucial functions in essentially all biological processes.

“Workhorse in Cells”

Functions:
• They provide mechanical support.
• They transport and store other molecules such as oxygen.
• They help in immune protection.
• They control cell growth and differentiation.
• They function as biocatalysts such as enzymes.

Bioinfo@AmS 1
Protein

“Workhorse in Cells”

Actin -myosin
Antibody Hormone
Enzyme (Insulin) Actin
Filament

Bioinfo@AmS 2
Chemical Nature of Protein

- Cα is at the heart of the amino acid.

α - Cα, C, N and O are called backbone atoms


- R can be any of the 20 side chains

Amino
Side Chain Carboxyl
Group
(Variable) Group

Characteristics of amino acids:

• Amino acids are basic units of protein (i.e.., monomer).

• All amino acids have at least one acidic carboxylic acid (-COOH) group and one basic amino (-NH2) group.

• Only 20 amino acids are standard and present in protein because they are coded by gene.

Bioinfo@AmS 3
Three and One Letter Code of Amino acids

B, J, O, U, X, Z

Bioinfo@AmS 4
Characteristic of Amino acids

Polar
Non-Polar

Positively Charge Negatively Charge

Bioinfo@AmS 5
Characteristic of Amino acids

Polar
Non-Polar

Non-polar: These amino acids are hydrophobic Polar: These amino acids are hydrophilic (water

(water hater) in nature comes closer in biological loving) in nature interacts with water in biological

aqueous medium. e.g., Aliphatic and aromatic side aqueous medium. e.g., Alcohol, Acids, Amine etc.

chain Acid> Alcohol>Amine >Ether> Alkane

Bioinfo@AmS 6
Characteristic of Amino acids

✓ Which amino acid are the most non-polar between Isoleucine and Alanine?

✓ Arrange the following amino acids (D, G, F, S) based on polarity?

D >S>G>F

Bioinfo@AmS 7
Amide Bonds

Amino acid are linked together through the formation of amide bonds (peptide bonds) from the amino group of one
residue and the carboxylate of a second residue.

Peptide (< 50 amino acids)


R1
Protein (> 50 amino acids)

N-terminal End C-terminal End

Bioinfo@AmS 8
Structure Function Relationship of Protein

• Function of a protein depends on its structure that indeed depends on protein Sequence.
• Function of proteins depends on the amino acids Order

Protein sequence (Amino acid order) 3D Structure Function

Bioinfo@AmS 9
Protein Folding

Primary structure Secondary Tertiary structure Quaternary


(Amino acid structure (3D structure structure
sequence) Local folding into formed by assembly (Structure
α-helix, β-sheet formed by more
of secondary
than one
structures) polypeptide
chains)

Bioinfo@AmS 10
Protein Folding

• The primary structure of protein: a sequence of amino acids linked together by peptide bonds (covalent bond)

• The secondary structure of protein: Polypeptide folding into α helix, β sheet, or random coil (H bonds involved)

• The tertiary structure of protein: 3-D folding of a single polypeptide chain

• The quaternary structure of proteins: Association of two or more folded polypeptides (sub units) to form a

multi-subunits protein (bonds and interactions similar to tertiary structure)

Bioinfo@AmS 11
Secondary structure

Bioinfo@AmS 12
Secondary Structure: - Helix

• Secondary structure, refers to local folded structures that form within a polypeptide (intra chain) due to interactions
between atoms of the backbone.
• In an α helix, the carbonyl (C=O) of one amino acid is hydrogen bonded to the amino H (N-H) of an amino acid that
is four residues down the chain (n+4).
• Each turn of the helix containing 3.6 amino acids.
• Pitch of the helix is 0.54 nm
(Pitch)
• R groups are not involved.
• Example of proteins: -keratin - abundant in skin, hair, nails and horns

H-bond: Gln (n residue) with Arg (n+4 residue)


Bioinfo@AmS 13
Secondary structure: β-pleated sheet

• Polypeptide chains are held together by H bonds between N-H group of one polypeptide chain and C=O group of
the other chain.

• Two or more segments of a polypeptide chain line up next to each other, forming a sheet-like structure held
together by hydrogen bonds.

• The strands of a β pleated sheet may be parallel, pointing in the same direction (meaning that their N- and C-
termini match up), or antiparallel, pointing in opposite directions (meaning that the N-terminus of one strand is
positioned next to the C-terminus of the other).

• Example of protein: Fibroin - abundant in silk

Bioinfo@AmS 14
Secondary structure: β-pleated sheet

Bioinfo@AmS 15
Tertiary Structure

• 3D conformation or shape. Fold spontaneously or with the help of molecular chaperones protein present inside
the cells. Stabilize by various kind of interactions (i.e., covalent and non-covalent bonds) present in the side
chain (R group).

• Various kinds of interactions which are responsible for protein tertiary structure Depends on the properties of
the R groups of amino acid residues, folding patter varies or 3D conformation changes.

Bioinfo@AmS 16
Tertiary Structure

• Ionic bonds (between charged amino acid side chains): For example, Lys is having a positively charge side chain due to NH3+
and Asp is having a negatively charge side chain due to COO- these may form ionic interaction which is known as salt-bridge.

• Hydrogen bonds between R groups : for example, uncharged polar amino acids can form H-bond like Ser is having side chain
carrying OH groups that serve as donor/acceptor for H bonds.

• Covalent bonds: Protein chain form intra/inter chain disulfide bonds between cysteine to form tertiary conformation

• Hydrophobic interactions: Amino acids having non-polar side chains associate in the interior of the peptide molecule and
exclude water via hydrophobic interactions.

• van der Waals interactions: Van der Waals forces' is a general term used to define the attraction of intermolecular forces
between molecules. There are two kinds of Van der Waals forces: weak London Dispersion Forces and stronger dipole-dipole
forces.

Bioinfo@AmS 17
Quaternary Structure

Many proteins is made up with more than one polypeptide chains called subunits. Association of
two or more folded polypeptides (subunits) to form a multimeric protein (bonds and interactions
similar to tertiary structure).

The quaternary structure refers to how these protein subunits interact with each other and arrange
themselves to form a larger aggregate protein complex. The final shape of the protein complex is
once again stabilized by various interactions, including hydrogen-bonding, disulfide-bridges and
salt bridges

Hemoglobin, a protein in red blood cells, has four sub units (two copies each of - and β-globins
containing a heme molecule.

Bioinfo@AmS 18
Protein Folding

Primary
sequence

Hierarchy of the Protein structure


Bioinfo@AmS 19
Protein Folding

Protein domain: A segment (100 – 250 aa) of a polypeptide chain that fold independently into a stable structure
and performs particular biological function. These are independently folded moieties of protein which can be
differentiated both structurally and functionally.

Protein Motif: It’s a short polypeptide chains comprised with secondary structures. It cant hold its independent
structure outside protein and not biologically functional too. Put very simply, a domain can be made up of one or
more well characterized motifs which usually occur together and suggest a putative function.

Bioinfo@AmS 20
Protein Folding: 3D structure

•Hydrophobic Amino acids (Green)


•Hydrophilic Amino Acids (Pink)
•Water molecules (Surrounding)

Unfolded Folded

•Details of protein folding depends on the patter of amino acids sequence


•Protein folding is reversible and reproducible
•Protein folding is thermodynamically favorable
•Only one protein structure is biologically relevant and forms inside the cells among possibility numerous conformation

Bioinfo@AmS 21
Protein Folding

Semi stable (local minima)


ΔG = ΔH –TΔS Native conformation (most stable)

G = Gibbs Free Energy

Enthalpy(H)= Total energy including chemical potential energy

Entropy(S) = Number of ways to arrange something (randomness)

Temperature (T) = contribute to kinetic energy

Bioinfo@AmS 22
Entropy of Protein Folding

S = kBlnW kB is Boltzmann's constant = 1.38x10-23


Decrease entropy

W = Number of ways system/molecules can be arranged


Unfolded
Folded

S = Less
S = High

Bioinfo@AmS 23
Entropy of Protein Folding

Conformational
Unfolded Unfolded Entropy Decreases

Folded

Conformational entropy is LOW


Unfolded

Conformational entropy is HIGH

Bioinfo@AmS 24
Entropy of Protein Folding

Unfolded Folded

Entropy of surrounding
water is HIGH
Entropy of surrounding
water is LOW
• Water molecules are free for movement around folded
• Water molecules form a case like protein due to exposure of hydrophilic amino acids only.
structure around hydrophobic amino • As all hydrophobic amino acids buried inside core of
acid for unfolded protein, thus not free protein structure, thus not interacts with water
for movement. molecules.
Bioinfo@AmS 25
Entropy of Protein Folding

• Entropy of surrounding water of a folded protein is increases


• Conformational entropy of protein after folding decreases
• Generally, entropy increases due to water is higher than the entropy decreases due to conformational entropy.
• Overall, protein folding process results increase in entropy of the system.

Bioinfo@AmS 26
Enthalpy of Protein Folding
Energy Absorbed
Energy released
(bonds breaking)
(bond formation)

HUnfolded Hfolded

∆H = Hfolded- Hunfolded

∆H = (Energy requires to break bonds – Energy releases to bonds formations)

Bioinfo@AmS 27
Protein Folding

Covalent Bonds: Disulfide bond (S-S)

Ionic interactions/salt bridge: COO- … NH3+

Dipole-Dipole interactions(H-bonds) : HOH …. NH2

Dipole-Induced dipole : ROH …. CH3

Induced dipole-Induced dipole : CH3…. CH3

Bioinfo@AmS 28
Protein Folding

Parameters Reactions Negative (< 0) Positive (>0)


∆G = ∆H – T ∆S Spontaneous Non-spontaneous
(exothermic) (endothermic)
∆H = (HBonds broken – H Bonds formed) More number of
stable bonds Less number of
formations stable bonds
formation
∆S = (SFolded – SUnfolded) Less Freedom High degree of
of Motion Freedom of
Motion

*Decrease in enthalpy also overcomes the decrease in entropy (i.e., gives positive impact on ∆G), so that
overall there is a negative ∆G that facilitates protein folding.

Bioinfo@AmS 29
Genetic Codes

Transcription

Translation

Bioinfo@AmS 1
Genetic Codes

5’ GGTCTCCTCACGCCA 3’ DNA


5’ CCAGAGGAGUGCGGU 3’
mRNA
Codons

Pro-Glu-Glu-Cys-Gly
Protein
Chain of Amino acids

The genetic code is a set of three-letter combinations of nucleotides called codons, each of which corresponds
to a specific amino acid or stop signal. The concept of codons was first described by Francis Crick and his
colleagues in 1961.
Bioinfo@AmS 2
Genetic Codes

• A codons are three consecutive bases that found on the coding strand of double-stranded DNA and

subsequently transcribed in the (single-stranded) mRNA.

• DNA/RNA consists of four different bases, and there are three bases in a codon, hence, there are

maximum 64 (4 * 4 * 4 = 64) possible patterns for which will act as a codon.

✓ Total 64 codons.

✓ AUG is START codon.

✓ UAA, UGA, UAG are STOP codon.

✓ Other 60 codons code either of any 20 amino acids

Bioinfo@AmS 3
Codon Degeneracy

Bioinfo@AmS 4
Codon Degeneracy

• The genetic code is degenerate because there are many instances in which different codons specify the same
amino acid. A genetic code in which some amino acids may each be encoded by more than one codons is known
as codon degeneracy or codon redundancy.

• There are total 20 amino acids which code for a wide variety of proteins in a living organism. There are 64
codons so, its obvious that more than one codon may specify same amino acid. Codon Degeneracy means
several code words have the same meaning.

• Degeneracy makes the DNA more tolerant to point mutations. It isn't necessary a point mutation in a codon will
lead to change in conformation of the peptide. It might be replaced by a same amino acid.

Bioinfo@AmS 5
Codon Bias

In most of the species Synonymous codon are prefer for particular amino acids which is known as codon bias

Bioinfo@AmS 6
Open Reading Frame(ORF)

ORF
5’ AUGAUACUCACAAUCUGA3’

ORF
5’AUACUC AUGAUACCCACAAUUCAACACCUCUAG3’

ORF
5’AUACGCCA AUGAUACUCACAAUCUAAACUCACACUCUC3’

• A reading frame (RF) is a non-overlapping set of three-nucleotide-codons (triplets) in DNA or RNA, it may present
anywhere within DNA sequence
• Open Reading Frame(ORF) means sequence is ‘open’ to keep reading by ribosome for protein synthesis.
• The ORF is a segment of DNA that begins with START codon and ends with a STOP codon make a functional
proteins.
• Gene also codes a functional protein , however it contains regulatory region along with ORF

Bioinfo@AmS 7
Reading Frame(RF)

• There are three RFs are possible, depending on the


starting point of reading of the mRNA string.
5’CUCAGCGUUACCAU3’
• Three RFs code three different protein strings
• One of the RF called as ORF that only contains
desired functional protein information (gene product)
RF 1
initiate with start codon and end with stop codon.

RF 2

RF 3

Bioinfo@AmS 8
Reading Frame(ORF)

A segment of double-stranded DNA has six possible reading frames (RF), three in each direction. But all are not leads
to proteins until it starts with start codon and ends with stop codon.

• ds DNA may produce two different mRNA transcript

5 ’ AT G C T C T C AT C T C G 3 ’ • There are six RFs which have code for six different proteins,
3 ’ TA C G A G A G TA G A G C 5 ’
however, only two RFs called ORFs contain desired protein

information (i.e., ORFs).

Bioinfo@AmS 9
Assignment

5’TTAGATGTGTGTAAATGTGTGTGATGATCGTGATATCATAGTAGTCAATGATCGTAATATTATCTATTTATAACCG3’

https://www.ncbi.nlm.nih.gov/orffinder/

Bioinfo@AmS 10
Mutation

• Mutation: Any sudden change(s) in Genetic make up is known as mutation.

• The external/internal agents that causes mutation are known as mutagen.

• Mutation may result in change in the amino acids in proteins. That is responsible for phenotype change.

Bioinfo@AmS 11
Mutation

Gene Mutation/
Point Mutation

Substitution Inversion Frameshift


Mutation Mutation Mutation

Bioinfo@AmS 12
Mutation

• Gene Mutation = Any change(s) in the nucleotide/base sequence of DNA which may occur due to
errors in DNA replication or due to the impacts of chemicals or radiation to the DNA molecule.

• Mutation may result in change in the amino acids in proteins.

No- Mutation Mutation

Bioinfo@AmS 13
Mutation

Bioinfo@AmS 14
Mutation

• If no changes to genomes occur over time, there would be no evolution

• Too much change in the DNA is harmful

• Too little does nothing

• A balance exists between the amount of new variation and the overall health (adaptiveness) of the
new variant individual

• Differences between closely related organisms show closely matched DNA sequences that diverged at
some past time and that was adaptive for a given environment.

Bioinfo@AmS 15
Mutation

1. Substitution: Substitution of 1 base for another this may or may not affect to the resulted protein. For example, an
A:T base pair could be mutated into a G:C base pair or even a T:A base pair. Three subtypes such as

(i) Silent Mutation

(ii) Missense Mutation

(iii) Non-sense Mutation.

2. Inversion: Sometimes consecutive bases rotates 180 degrees (mutually changes its position). Such an event creates
a mutation called an inversion. This may show particular abnormalities at the phenotypic level.

Bioinfo@AmS 16
Mutation

3. Frame shift Mutation

- Insertion: An insertion changes the number of DNA bases in a gene by adding a piece of DNA. As a result,

the protein made by the gene may not function properly.

- Deletion: A deletion changes the number of DNA bases by removing a piece of DNA. Small deletions may

remove one or a few base pairs within a gene, while larger deletions can remove an entire gene or several

neighboring genes. The deleted DNA may alter the function of the resulting protein(s).

Bioinfo@AmS 17
Silent Mutation

Normal Mutation
DNA

mRNA

Protein

Due to redundancy of Genetic Code (i.e., different codons can code same amino acid), no change in amino acid
sequence is produced!!

Bioinfo@AmS 18
Missense Mutation

Normal Mutation

DNA

mRNA

Protein

Missense mutation produces a change in amino acid sequence in protein product (Histidine in for Arginine); may change
function of protein or may not!

Bioinfo@AmS 19
Nonsense Mutation

Normal Mutation

DNA

mRNA

Protein

Bad news! – nonsense mutation produces a STOP codon within the mRNA transcript
leading to a truncated protein (incomplete protein sequence). How short the protein
product depends on where the STOP codon was produced within the mRNA transcript.

Bioinfo@AmS 20
Mutations: Insertion

A frame shift mutation

Normal gene Addition mutation


5’ GGTCTCCTCACGCCA 3’ 5’ GGTGCTCCTCACGCCA 3’
↓ ↓
5’ CCAGAGGAGUGCGGU 3’ 5’ CCACGAGGAGUGCGGU 3’
Codons
↓ ↓
Pro-Glu-Glu-Cys-Gly Pro-Arg-Gly-Val-Arg
Amino acids

Bioinfo@AmS 21
Mutations: Deletion

A frame shift mutation

Normal gene Deletion mutation


5’ GGTCTCCTCACGCCA 3’ 5’ GGTC_CCTCACGCCA 3’
↓ ↓
5’ CCAGAGGAGUGCGGU 3’ 5’ CCAGGGAGUGCGGU 3’
Codons
↓ ↓
Pro-Glu-Glu-Cys-Gly Pro-Gly-Ser-Ala-
Amino acids

Bioinfo@AmS 22
Mutations: Inversion

In case of inversion mutation a mutation resulting from the removal of a length of DNA (segment of DNA), which is
then reinserted facing in the opposite direction.

Bioinfo@AmS 23
Assignment

1. Write a program to find out the ORF

2. Write a program, that given two DNA sequences, find out the number of mutations and site of mutation by
comparing these tow DNA strings. Make comment types of point mutation.

Bioinfo@AmS 24
Database

Bioinfo@AmS 1
Biological Databanks and Databases

Biological Databank: Databank is a generic term meaning any collection of data in any form. This is a
largest form of repository/archive of data that will keep all the information safe for long term.

Biological Databases: This is a collection of particular type of data arrange in a computer readable format
suitable for easy storing, searching and analyzing the data for users. Generally, database is a technical term
denoting to a collection of data managed by a software called a Database Management System (DBMS).

Bioinfo@AmS 2
Types of Various Kinds Biological Data

• Nucleic Acids:
✓ DNA/RNA Sequences, Double helical structure
✓ RNA Sequences, mRNA, tRNA, rRNA, secondary RNA structure, Interactions
• Genomics: Gene, ORF, Genetic disorder, Mutation
• Proteomics:
✓ Protein sequences, secondary, tertiary , quaternary protein structure
✓ Protein expression profile, Protein-ligand interactions
• Transcriptomics: This is the study of the transcriptome—the complete set of RNA transcripts that are
produced by the genome
• Metabolomics: The study of small molecules involved in cellular metabolism and their interactions
within a biological system are known as the metabolomics.

Bioinfo@AmS 3
Biological Databases

• Nucleotide Sequence Databases (Genome and Gene)


• Protein Sequences Databases (Sequence information)
• Protein Structure Databases (3D structure )

Bioinfo@AmS 4
Complexity and size of Biological data

Bioinfo@AmS 5
Complexity and size of Biological data

Bioinfo@AmS 6
Complexity and size of Biological data

Bioinfo@AmS 7
Database
Scatter and random data
•Biological experimental data published in literature are generally
scatter and random.

•It hard to find and access related as well as relevant data required
for various analysis and applications.

•Biological data is very complex in nature such as signaling Organized and indexed data
pathways are so much interconnected study of all these kinds of
data poses huge concern in order to find out conclusive inferences.

Database is the organized form of information collected and store as computer readable form.

Bioinfo@AmS 8
How to design a biological database

•Contents
•Perspective of end users
•Data must be easily understandable for users
•Reliable and required information
•Faster responses to get the search results
•Less redundancy in the information

Collection of data in the related format


✓Structured/Organized -> Flat file
✓Searchable (index) -> Table of contents
✓Updated periodically (release) -> New edition
✓Cross-referenced (hyperlinks) -> Links with other DB
Bioinfo@AmS 9
Characteristics of biological database

•Unique Contents: Structure, Sequence


•The ontology: Notations, symbols and biological condition need to explain briefly for easy
understanding of the users
•Schema: The logical structure of the design of the databases
•Format of the data: The format must be uniform and compatible for directly use this data for other
program and applications
•Search engine: Search engine should be provide the relevant and less redundant data.
•Link for the source of the data: Provide the source of the data like literature , details of the researchers
and other website link.

Bioinfo@AmS 10
History of Biological Database

• Margaret Belle (Oakley) Dayhoff (March 11, 1925 – February 5, 1983) was an American physical
chemist and a pioneer in the field of bioinformatics. She created first biological database based on
protein sequences and named as ‘Atlas of Protein Sequences and Structure” in 1965 at Columbia
University, USA.

• Protein Data Bank (PDB ) was created in 1972 with the collection of the X-ray crystallographic protein
structure.

• The SwissProt protein sequences database began in 1987.

• 1988 - The National Centre for Biotechnology Information (NCBI) is established at the National Cancer
Institute.
Bioinfo@AmS 11
Classification of Biological Database

Biological Database

On the basis of On the Basis of Nature of


Source Data

• Primary Database • Sequences


• Structure
• Secondary Database • Function
• Interaction
• Composite Database • Literature

Bioinfo@AmS 12
Classification of Biological Database

Primary databases: Experimental results (raw data) directly submitted to database. Once given a
database accession number, the data in primary databases are never changed that become a permanent
scientific record.

Secondary databases: This databases comprise data derived from the results of analyzing primary data.

Composite databases: Many databases have both raw data as well as derived data i.e., characteristic of
primary and secondary databases which are known as composite/aggregate data base.

Bioinfo@AmS 13
Primary Database
• Primary databases are highly organized, user-friendly gateways to the huge amount of biological data
directly produced by researchers around the world.

• The primary databases were first developed for the storage of experimentally determined DNA and protein
sequences in the 1980s and 90s.

• This databases contains the raw nucleic acid/protein sequence data which are produced and submitted by
researchers worldwide directly.

• Primary databases contain information for sequence or structure only. Once data are deposited in primary
databases, they can be accessed freely by anyone around the world.

Example

✓ GenBank, DDBJ, EMBL for genome (i.e.., DNA) sequences

✓ Protein Databank (PDB) for protein structure


Bioinfo@AmS 14
Secondary Database

• Secondary databases contain information derived from primary databases that means it contains the
data which are derived by curation, annotation and analysis of the data of primary database .
• Secondary databases store information such as conserved sequences, active site residues, and
signature sequences.
• A secondary structure database contains entries of the PDB in an organized way.

Examples
•Tr-EMBL for protein sequences
•SCOP at Cambridge University
•CATH at the University College of London
•PROSITE of the Swiss Institute of Bioinformatics

Bioinfo@AmS 15
Composite Database

Composite databases contain a information of primary and derive data, which eliminates the need to search
each one separately. Each composite database has different search algorithms and data structures.

Example:
NCBI is a composite data base that links to the Online Mendelian Inheritance in Man (OMIM) and other
data bases.

Bioinfo@AmS 16
Classification of Database
Biological Database

Protein
Nucleotides
• EMBL
• GenBank Sequences Interaction
• DDBJ • UniPort • Biogrid
INSDC • PIR • STRING
(International Nucleotides Sequence Database) • SwissProt

Structure
• PDB
ENSEMBL
(Whole Genome Database) • CATH
• SCOP

Specialized
• OMIM (online Mendelian Inheritance of Man)
• Gene Expression Omnibus (GEO) database

Bioinfo@AmS 17
Nucleotide sequence databases

There are a small number of bioinformatics centers of excellence worldwide that have taken on the
responsibility to collect, catalogue and provide open access to published biological data.

Example:
1. The EMBL-European Bioinformatics Institute (EMBL-EBI)
2. The US National Center for Biotechnology Information (NCBI)
3. The National Institute of Genetics in Japan (NIG)

The role of bioinformatics centers of excellence in making biological data available for the research
community

Bioinfo@AmS 18
Nucleotide sequence databases

• EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases

• EMBL www.ebi.ac.uk/embl/

• GenBank www.ncbi.nlm.nih.gov/Genbank/

• DDBJ www.ddbj.nig.ac.jp

They together constitute the International Nucleotide Sequence Database (INSD) collaboration.

Bioinfo@AmS 19
NCBI

How to Use the NCBI’s Bioinformatics Tools and Databases

Bioinfo@AmS 20
National Center for Biotechnology Information (NCBI)

• National Center for Biotechnology Information (NCBI) was established 1988, as a division of the National
Library of Medicine (NLM) at the National Institutes of Health (NIH), Maryland USA.
• Website: https://www.ncbi.nlm.nih.gov/
• Aim: To create and maintain public database, develop software such as sequence analysis tools (BLAST,
iCn3D), resources for Biomedical Information
• Maintain Several databases like Genbank (Nucleic acid sequence database) from 1992. It’s a superset of
various databases including Gene, Genome, Protein, literatures etc. It acts as interface to connect various
databases.
• Databased present in the NCBI generally classified as Primary and Derived databases.
• It provides a ‘database retrieval system’ known as ENTEZ

Bioinfo@AmS 21
Entrez

Entrez : This is an Advanced Search interface that facilitates constructing more sophisticated queries for biological
database. It’s a molecular biology database developed by NCBI. Entrez is primary text search and retrieval system that
integrates the PubMed database of biomedical literature with 38 other literature and molecular databases mainly use by
National Center for Biotechnology Information (NCBI) .

• Easy to use and convenient. Low redundancy


• Mainly text based search preform here. Boolean operators are used for text search.
• Integrated and cross-referenced: Entrez Global Query Cross-Database Search System is a federated search engine,
or web portal that allows users to search many discrete biological databases. Specialized search fields are available
for each database and can be browsed and selected in the Search Builder section of the Advanced Search interface.
• Other useful Entrez features include Search History with access to recent results and a Clipboard where search
results can be saved temporarily.

Bioinfo@AmS 22
GenBank ( Maintain by NCBI)

• GenBank is the most complete collection of raw nucleic acid sequence data for almost every organism.
• The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput raw sequence data, and
sequence polymorphisms.
• There is also a GenPept database for protein sequences, the majority of which are conceptual
translations from DNA sequences.
• There are two ways to search for sequences in GenBank. One is using text-based keywords similar to a
PubMed search. The other is using molecular sequences to search by sequence similarity using BLAST.
• the
• search output for sequence files is produced as flat files for easy reading. The resulting flat files contain
three sections – Header, Features, and Sequence entry

Bioinfo@AmS 23
GenBank ( Maintain by NCBI)

✓ LOCUS: It contains a unique database identifier for a sequence location in the database (not a chromosome locus). The
identifier is followed by sequence length and molecule type (e.g., DNA or RNA). This is followed by a three-letter code
for GenBank divisions.
✓ DEFINITION: This provides the summary information for the sequence record including the name of the sequence, the
name and taxonomy of the source organism if known, and whether the sequence is complete or partial.
✓ ACCESSION NUMBER : A new accession number is given in the form of a string of alphanumeric characters (two
alphabet and six numerical or three alphabet and five characters). In addition to the accession number, there is also
a version number and a gene index (GI) number. The purpose of these numbers is to identify the current version of the
sequence. If the sequence annotation is revised at a later date, the accession number remains the same, but the version
number is incremented as is the GI number.
✓ “ORGANISM” field, which includes the source of the organism with the scientific name of the species,
✓ “Features” section includes annotation information about the gene and gene product. The “gene” field is the
information about the nucleotide coding sequence and its name. For DNA entries, there is a “CDS” field, which is
information about the boundaries of the sequence that can be translated into amino acids.
Bioinfo@AmS 24
GenBank ( Maintain by NCBI)

✓ If there are multiple sequences with same Accession number


take sequence having the highest GI number which is the
most updated one.

Bioinfo@AmS 25
NCBI
https://www.ncbi.nlm.nih.gov/

Bioinfo@AmS 26
NCBI-GenBank

https://www.ncbi.nlm.nih.gov/genbank/
Bioinfo@AmS 27
Search of Nucleotide Sequence

https://www.ncbi.nlm.nih.gov/nuccore

https://www.ncbi.nlm.nih.gov/nuccore/?term=Insulin
Bioinfo@AmS 28
Search of Nucleotide Sequence
PLN: plant, fungal, and algal Seq.
PRI: Primate Seq.
MAM: Non-primate mammalian Seq.
BCT: Bacterial Seq.
EST: Expressed sequence Tag.

Header Features

https://www.ncbi.nlm.nih.gov/nuccore/AH002844.2

Bioinfo@AmS 29
Search of Nucleotide Sequence

Sequence

Bioinfo@AmS 30
Search of Nucleotide Sequence

https://www.ncbi.nlm.nih.gov/nuccore/AH002844.2?report=fasta
Bioinfo@AmS 31
Search of Protein Sequence

https://www.ncbi.nlm.nih.gov/protein

Bioinfo@AmS 32
Search of Protein Sequence

https://www.ncbi.nlm.nih.gov/protein/AAA59172.1

Bioinfo@AmS 33
Search of Protein Sequence

https://www.ncbi.nlm.nih.gov/protein/AAA59172.1?report=fasta

Bioinfo@AmS 34
Search of Gene Sequence

https://www.ncbi.nlm.nih.gov/gene

Bioinfo@AmS 35
Search of Gene Sequence

https://www.ncbi.nlm.nih.gov/gene/3630
Bioinfo@AmS 36
Search of Gene Sequence

Bioinfo@AmS 37
Search of Gene Sequence

Bioinfo@AmS 38
Search of Gene Sequence

https://www.ncbi.nlm.nih.gov/variation/view/

Bioinfo@AmS 39
Search of Gene Sequence

https://www.ncbi.nlm.nih.gov/variation/view/
Bioinfo@AmS 40
Nucleotide sequence databases

DDBJ www.ddbj.nig.ac.jp

Bioinfo@AmS 41
Nucleotide sequence databases

https://www.ddbj.nig.ac.jp/services/indexe.html?tag=search,DDBJ

http://ddbj.nig.ac.jp/arsa/

Bioinfo@AmS 42
Nucleotide sequence databases

http://ddbj.nig.ac.jp/arsa/

http://ddbj.nig.ac.jp/arsa/search?lang=en&cond
=quick_search&query=Human+insulin+gene&op
erator=AND

Bioinfo@AmS 43
Nucleotide sequence databases

http://getentry.ddbj.nig.ac.jp/getentry

Bioinfo@AmS 44
Nucleotide sequence databases

https://www.ebi.ac.uk/

Bioinfo@AmS 45
Nucleotide sequence databases

https://www.ebi.ac.uk/ebisearch/search?db=emblstandard&query=human%
20insulin%20gene&size=15

https://www.ebi.ac.uk/ebisearch/search?db=nucleotideSequences&query=
human%20insulin%20gene
Bioinfo@AmS 46
Nucleotide sequence databases

Bioinfo@AmS 47
Assignment- Practical

1. Visit NCBI (i)Download flat file and FASTA files of nucleotide sequences of human (Homo sapiens) hemoglobin subunit alpha 1 (HBA1), Mus musculus hemoglobin alpha,
adult chain-1 and Bos taurus hemoglobin, beta (HBB).

2. Visit DDBJ and EMBL (i)Download flat file of human complete CFTR nucleotide sequence (ii) Download FASTA file of human complete CFTR nucleotide sequence

3. Visit NCBI (i)Download flat file of human complete CFTR Gene sequence, Nucleotide sequence and Protein sequence (ii) Download FASTA file of human complete CFTR
gene sequence, Nucleotide sequence and protein sequence (iii) Show the variants of the genes, with missense mutation (iv) Show the CDS

Bioinfo@AmS 48
Protein Databases

Protein

Sequences Structure
• UniPort • PDB
• PIR • CATH
• SwissProt • SCOP
• TrEMBL

Bioinfo@AmS 1
Protein Sequence Databases

• NCBI -Protein Sequence Databases

• PIR (Protein Information Resource): at Georgetown University Medical Center


• SWISS-PROT: Expertly curated protein sequence entries maintained Swiss Institute of Bioinformatics.
• TrEMBL: Computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL
nucleotide sequence entries not yet integrated in SWISS-PROT.

• UniProt: Universal Protein Resource (Knowledgebase)

Bioinfo@AmS 2
The Protein Information Resource (PIR)

• PIR: Atlas of protein sequence and structure – Dayhoff (1966) first sequence database (pre-
bioinformatics). Currently known as Protein Information Resource (PIR)
• PIR has provided many protein databases and analysis tools to the scientific community, including the
PIR-International Protein Sequence Database (PSD) of functionally annotated protein sequences.
• PIR was established in 1984 by the National Biomedical Research Foundation (NBRF), USA as a
resource to assist researchers in the identification and interpretation of protein sequence information.
• Maintain by Georgetown University Medical Center
• PIR format

https://proteininformationresource.org/

Bioinfo@AmS 3
Protein Information Resource (PIR)

PRO: Protein family classifications

iPTMnet: Post-translational modifications (PTMs)

iProLINK: Literature and research articles

iProClass: Contain Sequences, integrated protein


knowledgebase

Bioinfo@AmS 4
https://proteininformationresource.org/cgi-bin/textsearch.pl

Bioinfo@AmS 5
https://proteininformationresource.org/cgi-bin/ipcEntry?id=A0A6J4EE43

Bioinfo@AmS 6
https://proteininformationresource.org/cgi-bin/ipcEntry?id=A0A6J4EE43

Bioinfo@AmS 7
FASTA File format

https://proteininformationresource.org/cgi-bin/comp_mw.pl?ids=A0A6J4EE43

Bioinfo@AmS 8
Universal Protein Resource (UniProt)

Universal Protein Resource (UniProt): The United Protein Databases (UniProt, 2003) is a central
database of protein sequence and function created by joining the forces of the SWISS-PROT, TrEMBL
and PIR protein database activities
The centerpiece of the UniProt databases is the UniProt knowledge base (UniProtKB), which
comprises two sections: Manually annotated UniProtKB/Swiss-Prot and Automatically computer
annotated UniProtKB/TrEMBL.

Bioinfo@AmS 9
Universal Protein Resource (UniProt)

Bioinfo@AmS 10
SWISS-PROT/UniProtKB

• SWISS-PROT is an annotated protein sequence database established in 1986 and maintained


collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva
and the EMBL Data Library.
• UniProtKB/SWISS-PROT an expertly curated protein sequence database of UniProtKB (produced by
the UniProt consortium), which provides high level of annotation and minimum level of redundancy .
• It also have high level of integration with other databases.
• Highest level of accuracy
• It contains protein sequences, descriptions, including function, domain structure, subcellular location,
post-translational modifications and functionally characterized variants.
• Similar f o r m a t t o EMBL

Bioinfo@AmS 11
TrEMBL/UniProtKB

• TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of


EMBL nucleotide sequence entries, which are not yet integrated in SWISS-PROT.
• TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide
Sequence Database not yet integrated in SWISS-PROT.
• TrEMBL can be considered as a preliminary section of SWISS-PROT. For all TrEMBL entries which
should finally be upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.
• Currently, SWISS-PROT and TrEMBL have 0.5 and 7.6 million sequences, respectively.

Bioinfo@AmS 12
UniProtKB
https://www.uniprot.org/

Bioinfo@AmS 13
UniProtKB

https://www.uniprot.org/uniprotkb?query=insulin

Bioinfo@AmS 14
UniProtKB

https://www.uniprot.org/uniprotkb/P01308/entry

Bioinfo@AmS 15
UniProtKB

https://www.uniprot.org/uniprotkb/P01308/entry

Bioinfo@AmS 16
UniProtKB

https://www.uniprot.org/uniprotkb/P01308/entry

Bioinfo@AmS 17
FASTA File format

https://rest.uniprot.org/uniprotkb/P01308.fasta

Bioinfo@AmS 18
Protein Structure Databases

Bioinfo@AmS 19
PDB

http://www.rcsb.org/structure/4HHB
https://files.rcsb.org/view/4HHB.pdb
Bioinfo@AmS 20
Protein Data Bank (PDB)

• PDB was established in 1972 at Brookhaven National Laboratory (BNL)


• Sole international repository of protein Structure database
• X-ray crystallographic data and NMR data scored for a protein

Process of submission of entry/data files to the PDB

User can directly do submission of raw data mmCIF (macro


molecules crystallographic information file) to the PDB

Checking the format of the coordinates and structures by RCSB

Validation test on the structure on deposition to the database

Acceptance of the structure and receive and unique PDB id

Bioinfo@AmS 21
PDB file format

http://www.rcsb.org/structure/4HHB
https://files.rcsb.org/view/4HHB.pdb

Bioinfo@AmS 22
Protein Structure Databases

CATH (Class, Architecture, Topology and Homology):


• Classification of proteins based on domain structures based on their folding patterns
• Each protein chopped into individual domains and assigned into homologous super-families.
• Hierarchical domain classification of PDB entries.

Bioinfo@AmS 23
Protein Structure Databases

Class: Derived from secondary structure content is assigned


automatically
Architecture: Describes gross orientation of secondary
structures, independent of connectivity
Topology: Clusters structures according to their topological
connections and numbers of secondary structures
Homologous superfamily: This level groups together
protein domains which are thought to share a
common ancestor and can therefore be described as
homologous

Bioinfo@AmS 24
Protein Structure Databases

https://www.cathdb.info/

Bioinfo@AmS 25

You might also like