Bioinfo PPTs

Syllabus
Bioinfo@AmS 1
Introduction: Bioinformatics
Bioinformatics = Biology + Information Technology
Computer IT
Sc.
Math &
Biology Statistics
Bioinformatics is an interdisciplinary field, which associates with computer science, information technology,
mathematics/statistics, and biology for organization, analysis and interpretation of new insight of the biological
data.
Bioinfo@AmS 2
Introduction: Bioinformatics
✓ Bioinformatics deals large biological datasets (macromolecular structures, genome sequences, expression
data) for better understanding and generation of new hypothesis
✓ Acquiring, organizing, managing and new-insights of biological data
✓ Development of algorithms and tools for analysis biological data
Bioinfo@AmS 3
Various aspects of Bioinformatics
✓ Development of organized databases/databank
✓ Open source web services (tools and platform)
✓ Virtual screening of the drug molecules
✓ Drug design
✓ Biological Network analysis
✓ Algorithms for big data analysis (NGS, RNA Seq)
✓ Meta data analysis
✓ Generation of new hypothesis via data mining
Bioinfo@AmS 4
Bioinformatics
Foundation of Bioinformatics is Molecular Biology, specifically Genomics and Proteomics
Bioinfo@AmS 5
Basic Form of Life : Cells
Bioinfo@AmS 6
Introduction: Complexity in Biology
Cell signaling
Organism Organs Tissue
Cell
Protein Protein-Protein
Protein interaction
sequence
structure
Bioinfo@AmS 7
Length and Time Scale
Bioinfo@AmS 8
Introduction: Exploration of Biology
Observation
Or
Results
Conclusion
Experiments Or
Hypothesis
Knowledge
Bioinfo@AmS 9
Charles Darwin and the voyage of the HMS Beagle (1831-36)
• Over the course of his travels in the Galápagos Islands off the
coast of Ecuador, Darwin began to see intriguing patterns in the
distribution and features of organisms.
Charles Darwin
Darwin's Theory of Evolution by “Natural Selection”

• Those individuals with heritable traits (i.e., phenotypic variation) better suited to the environment will
survive and leave more offspring than their peers, causing the traits to increase in frequency over
generations
Bioinfo@AmS 10
What is the molecular mechanism that causes this variation?
How does this trait pass through from one generation to next?
How do species extinct?
Bioinfo@AmS 11
Gregor Johann Mendel (1822-1884)

• Studied Inheritance of Traits in garden Pea plants
• Developed the laws of inheritance
• Established the concept role of genotype to phenotypes
He is the father and founder of genetics. The genes come in pairs (Alleles) and are inherited as
distinct units, one from each parent. Mendel tracked the segregation of parental genes and their
appearance in the offspring as dominant or recessive traits.
Bioinfo@AmS 12
Genes are the basic unit of heredity. Each chromosome contains many genes that is responsible for
different traits. Father Mother
B b
T T
• Two copies of every gene each of them is commonly called allele i.e., is a variant form of a gene received
from father and mother.
• If the two alleles that form the pair for a trait are identical, then the individual is said to be homozygous gene
and if the two genes are different, then the individual is heterozygous gene for the trait.
• If the alleles of a gene are different, one allele will be expressed; it is the dominant gene. The effect of the
other allele, called recessive, is masked
Bioinfo@AmS 13
Back Hair Black Hair

Back Hair Brown Hair
(Bb) (Bb)
(BB) (bb)
B B B b
b Bb Bb B BB Bb
b Bb Bb b Bb bb
Genotype: 4Bb Genotype: BB 2Bb bb

Phenotype: Black Phenotype: Black Black Brown
Bioinfo@AmS 14
Bioinfo@AmS 15
Genotype – Deals with GENE CODE. Genotype is the particular set of genes present in an organism’s cell.
In other words, the genotype is the genetic constitution of an organism.
Phenotype – Deals with looks you can take a PHOTO with. All the observable characteristics of an
organism, such as shape, size, color and behavior are called phenotype.
AA
aa
Bioinfo@AmS 16
What is the chemical nature and structure of Gene?
How gene store genetic information?
How does information from Gene flows to Characteristics
Bioinfo@AmS 17
Central Dogma
DNA Storage unit

(Genotype)
Transcription
mRNA Carrier Unit

Translation
Protein
Functional Unit
Phenotype
Genetic mapping provides the first evidence that a disease or trait (i.e., a characteristic) is linked to the
gene(s) inherited from one’s parents.
Bioinfo@AmS 18
Central Dogma
Flow of biological information:
Coding (DNA replication)
↶ Decoding
DNA Protein
• Patter of DNA Sequence store fixed information about protein sequence
• A single change in the order of sequence of DNA nucleotides leads to defective or no protein synthesis.
• Coding and decoding the information from DNA level to protein are highly regulated biological events
Bioinfo@AmS 19
Introduction: Human Genome Project
Objectives: Find out the sequence pattern of genome of human genome (i.e.., haploid chromosome + X/Y).
Bioinfo@AmS 20
Introduction: Human Genome Project
• Human genome length is 3.3× 109 base pairs.
• Approximately 25,000-30,000 genes are there
• On average, a gene is made up of 3000 nucleotides.
• The function of more than 50 percent of the genes is yet to be discovered.
• Proteins are coded by less than 2 percent of the genome.
Essence of Bioinformatics:
• Store all the sequence information in a database (33,000 books having1000 pages and each page have 1000 bp ).
• Developed programs for data retrieval and indexing
Bioinfo@AmS 21
Gene
GENE: Gene is a segment of DNA of chromosome which has a specific ordered sequence of nucleotides (the building
blocks of DNA) that codes a functional proteins and RNAs (rRNA, tRNA, mRNA).
Bioinfo@AmS 1
Gene
GENE: Gene is a segment of DNA of chromosome which has a specific ordered sequence of nucleotides (the building
blocks of DNA) that codes a functional proteins and RNAs (rRNA, tRNA, mRNA).
Bioinfo@AmS 2
Characteristics of Gene
• Genes are the basic physical and functional units of heredity transfer from parent to offspring.
• Each gene is located on a particular region of a chromosomal DNA.
• There is no space (Non-coding region) within genes of Prokaryotes
• Eukaryotes genes are not continuous with coding region (Exon) only these are separated by Non-coding sequences
(Intron).
Prokaryotic Gene Eukaryotic Gene
Bioinfo@AmS 3
Prokaryotic Gene Expression
Gene
Promoter Terminator Promoter Terminator
ORF ORF I ORF II ORF III DNA

DNA
Transcription Transcription
mRNA mRNA (Polycistronic)
Translation Translation
Protein
Protein I Protein II Protein III
• ORF: Open Reading Frame (i.e., Coding region of gene) contain codes for protein
• Gene: ORF + Regulatory Region (e.g., Promoter and Terminator)
Bioinfo@AmS 4
Eukaryotic Gene Expression
Gene
ORF
Promoter Terminator
Exon Intron Exon Intron Exon DNA
Transcription
hnRNA
Processing
mRNA
Translation
Protein
• ORF in eukaryotic cells divided in exons and introns, however, only exon parts carry codes for protein
Bioinfo@AmS 5
Exon and Intron
Exon: Eukaryotic genes contain stretches of coding sequences called exons, which are interrupted by non-coding
segments called introns. Exons code for functionally distinct proteins.
Introns: These are intervening sequences of DNA that do not code for any information. Introns are generally present
in higher eukaryotic genomes but are rarely present in prokaryotes.
• When genomic DNA is transcribed to produce mRNA (gene expression), introns are also transcribed. Once the
entire mRNA has been transcribed, the introns are removed before the mRNA reaches the ribosomes for protein
synthesis.
Bioinfo@AmS 6
Characteristics of Gene
Pseudogenes
Pseudogenes are copies of genes that have lost their function. This can happen due to mutations, or the presence of stop
codons or frameshifts within the coding sequence. Pseudogenes are thought of as DNA that should be removed from the
genome and are considered junk DNA.
Gene families
In prokaryotes, a gene occurs in single copy per genome. However, it is not uncommon in eukaryotes to find genes that
are present in several copies. Such groups of genes are called gene families. A higher copy number enables production
of larger quantities of gene products.
Repetitive DNA sequence

Repeated non-coding sequences form a significant proportion of eukaryotic genomes. They are sometimes present in
thousands of copies.
Bioinfo@AmS 7
Number of genes in a genome
✓ Unlike eukaryotic genomes, most of the DNA in bacterial genomes (prokaryote) encodes proteins.
✓ The genome of E. coli bacteria is made of 4288 genes, with nearly 90% of the genome coding for proteins.
✓ The yeast genome, is about 2.5 times larger, comprising about 6000 genes with 70% used for coding proteins.
Only 4% of the yeast genome is reported to be made of introns.
✓ The genome of humans consists of about 25,000-30,000 genes, with only about 2% of DNA used as protein
coding sequence.
Bioinfo@AmS 8
Non-coding DNA
Complex genomes have roughly 10x to 30x more DNA than is required to encode all the proteins or RNAs
in the organism.
Contributors to the non-coding DNA include:

✓ Introns in genes
✓ Regulatory elements of genes
✓ Multiple copies of genes, including pseudogenes
✓ Intergenic sequences
Bioinfo@AmS 9
Genome
Genome: All the information that is encoded in DNA and is capable of being passed on to an offspring.
✓Includes all genes, intergenic sequences, repeats
✓The genome is all of the nuclear DNA in a haploid cell (sperm or egg) i.e., DNA that is inherited by the
next generation. So a normal somatic cell in humans actually has two genomes, a maternal genome and a
paternal genome.
✓All the DNA on all the chromosomes of a haploid cell.
✓Specifically, it is all the DNA in an organelle.
Bioinfo@AmS 10
Tutorial/ Practical: Finding Exon and Introns
DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC TCCAGAGAGA GCACCCAA CACCCTCCAGGCTT GACTAA CCAGGGTGT
EXON 1 INTRON 1 EXON 2 INTRON 2 EXON 3
Bioinfo@AmS 11
Genome
Intron Intron
Length of a Gene (in term of base pairs) = (Number of Exons× Length of an Exon) + (Number of Introns× Length of an
Intron)
Coding region of a genome = Total length of Exons (i.e., Number of Exons× Length of an Exon× Number of genes of a
genome)
Total Non-coding Region of a genome = Total length of Introns (intra-genomic part, i.e., gap between two exons) + Total
length of Inter-genomic part (space between two consecutive genes)
Bioinfo@AmS 12
Assignment
1. The whole genome of an E.coli bacterium is 2×106 base pairs in size, and sequencing has shown it has 600 genes
of having the length of each gene of 2×103. Each gene is comprised with three coding regions (exons) of average size
of each exon is 200 base pairs. Representative schematic is given below. How many introns are there within the
whole genome of E.coli? What is the ratio of total length of coding to total non-coding region of the whole genome?
Intron Intron
Bioinfo@AmS 13
http://hollywood.mit.edu/GENSCAN.html
Bioinfo@AmS 14
Write a program: Given a two DNA sequence compare these two sequences one to one characters and provide output
as i) Numbers of Introns and their length ii) Number of Exons and their length
DNA Sequence:
DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC GCACCCAA GACTAA CCAGGGTGT
Bioinfo@AmS 15
Write a program: Given a DNA sequence and its corresponding matured mRNA sequence, compare these two
sequences one to one characters and provide output i) numbers of Introns and their length ii) Number of Exons and their
length
DNA Sequence: { A, T,G, C}

mRNA Sequence (matured): { A, U,G, C}

CUCGAGGGGCCUA AUGCAUUGCCC GCACCCAA GACUAA CCAGGGUGU
Notes: The mRNA is synthesized from DNA. The mRNA sequence does not have any intron parts and all the ‘T’
character of DNA changes to ‘U’ in case of RNA . All other characters remains same in both DNA and RNA as
mapping is given below.
Bioinfo@AmS 16
https://asia.ensembl.org/index.html
https://asia.ensembl.org/Multi/Search/Results?q=BRCA2;site=ensembl
https://asia.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000139618;r=13:32315086-32400268
https://asia.ensembl.org/Homo_sapiens/Gene/Sequence?db=core;g=ENSG00000139618;r=13:32315086-32400268
Bioinfo@AmS 17
Central Dogma: Flow of Biological Information
Base Pair
Transcription Translation
Sugar-
phosphate
Protein
Bioinfo@AmS 1
Nucleic Acids (DNA and RNA)
DeoxyRibo Nucleic Acid (DNA) Ribo Nucleic Acid (RNA)
• Double stranded • Single stranded
• Located in the nucleolus • Located in the outside nucleolus (cytoplasm)
• Storage genetic information • Mainly transfer genetic information ( although few cases
• Deoxyribose sugar storage genetic information)
• A,T,G,C • Ribose sugar

• A,U,G,C
Both DNA and RNA are polymers of Nucleotides
Bioinfo@AmS 2
Nucleic Acids (DNA and RNA)
DNA RNA
NB
P NB
P
O
5 O
4 S 1 5
4 S 1
3 2
OH H 3 2
OH OH
Bioinfo@AmS 3
Nitrogenous Bases
Purine Pyrimidine
Double Ring Single Ring
• Shorter Name :: Large structure • Larger Name :: short structure
• 6 atoms ring number Anti-clockwise • 6 atoms ring number Clockwise
• Adenine (A) and Guanine (G) • Cytosine (C), Thymine (T) and Uracil (U)
7 4
6
N 5
5 3 N
1 N
8
2
2
O 6
1 N
4
3 N
N 9
Bioinfo@AmS 4
Nitrogenous Bases
Purine Pyrimidine
NB
Nucleoside
(Sugar + NB)
Nucleotide
(Sugar + NB+
Phosphate group)
(GMP)
Bioinfo@AmS 5
DNA
Bioinfo@AmS 6
Chemical Composition of DNA
Deoxyribonucleic Acid (DNA): Polymer of nucleotides
Nucleotides = Deoxy-Sugar (5C) + Nitrogen Base +Phosphate Group
Bioinfo@AmS 7
Chemical Composition of Nucleotides
Nucleotide
Nucleoside
Base
Deoxyribose pentose sugar

Bioinfo@AmS 8
Chemical Composition of Nucleotides
dNTP = Deoxynucleotide Triphosphates
Nucleoside
Base
Deoxyribose (pentose) sugar

Bioinfo@AmS 9
Base pair (BP) complementary
The rules of base pairing (or nucleotide pairing) are:

• A with T: the purine adenine (A) always pairs with the pyrimidine thymine (T)
• C with G: the pyrimidine cytosine (C) always pairs with the purine guanine (G)
Double rings Single ring
Bioinfo@AmS 10
Base pair (BP) complementary: One letter DNA string
5
3
One letter double stranded DNA string
5  T C C TA C TA G T G T G 3 
3  A G G ATG ATC A C A G 5  Base
Pair
Sugar-Phosphate
Backbone
3 5
Complementary strands
Bioinfo@AmS 11
Double helical-dsDNA
• Strands are anti-parallel (Opposite polarity)

• Each strands coil each other to form spiral structure which is
known as DNA double helix. 3
5
• Diameter of DNA is 20 A, which is uniform along the DNA.
• Each turn of the helix (360o) rises 34A ( or 3.4 nm)
• Each turns comprises 10 bases.
• Rise of consecutive two bases is 3.4 A
• The DNA helix has a shallow groove called minor groove

5
(-1.2 nm) and a deep groove called major groove (- 2.2 nm) 3
across.
Bioinfo@AmS 12
DNA Replication
Semi conservative mode of DNA Replication

Two Daughter dsDNA
Old Strand
One Parent dsDNA 5’–TCAGCTCGCTGCTAATGGCC–3’
3’–AGTCGAGCGACGATTACCGG–5’
New Strand
Old Strand
5’–TCAGCTCGCTGCTAATGGCC–3’
Old Strand Old Strand
5’–TCAGCTCGCTGCTAATGGCC–3’
New Strand
Bioinfo@AmS 13
DNA Replication
(DNA)n + dNTP → (DNA)n+1 + PPi
P P P
Growing Strand (5→ 3 direction )

P P P 3OH
5 DNA
Polymerase III
P P P P P P P P 5
3
Template Strand
(DNA Polymerase reads from 3→ 5 direction only)
Bioinfo@AmS 14
DNA string
DNA Sequence:
5’ AGGATCATGATCATGAATT 3’
Complement Sequence: Palindromic sequence

3’ TCCTAGTACTAGTACTTAA 5’ 3’ TTAATTAA 5’
5’AATTAATT 3’
Reverse Sequence:
3’ TTAAGTACTAGTACTAGGA 5’ Repetitive sequence
5’ TTAGTTAG 3’
Reverse Complement sequence:
5’ AATTCATGATCATGATCCT 3’
https://www.bioinformatics.org/sms/rev_comp.html
Bioinfo@AmS 15
Chargaff's rules: dsDNA
i) # A = # T 5  AGGATCATGATCATGAATT 3 
=> A/T = 1 3 TCCTAGTACTAGTACTTAA 5 
ii) # C = # G
=> C/G =1
iii) #(A+C) = #(G+T)
vi) #(A+T)/# (G+C) (range 0.4 to 1.9)
Bioinfo@AmS 16
i) # A = # T 5 ’ A G G AT C AT G AT C AT G A AT T 3 ’
=> A/T = 1 3 ’ T C C TA G TA C TA G TA C T TA A 5 ’
ii) # C = # G
=> C/G =1
iii) #(A+C) = #(G+T)
vi) #(A+T)/# (G+C) (range 0.4 to 1.9)
Bioinfo@AmS 17
Next generation storage unit: DNA
Adenosine (A) = 00 Guanine (G)=10
Cytosine (C)=01 Thymine (T) =11
Bioinfo@AmS 18
Next generation storage unit: DNA
Bioinfo@AmS 19
Melting temperature of DNA (Tm)
dsDNA ssDNA
(Double helical) (Separated Strand)
5 ’ A AT TA AT TA AT T 3 ’
3 ’ T TA AT TA AT TA A 5 ’
Denaturation
Heating
5’GGCCGGCCGGCC3’ Cooling
3’CCGGCCGGCCGG5’
Renaturation
If a DNA solution is heated (> 90°C ) there will be enough kinetic energy to denature the DNA completely causing it
to separate into single strands. Tm is defined as the temperature at which 50% of double stranded DNA is changed to
single-standard DNA. The higher the melting temperature the greater the guanine-cytosine (GC) content of the DNA.
Bioinfo@AmS 20
Melting temperature of DNA (Tm)
Example:
AT G C T G AAT G C
TAC GACTTAC G
DNA length = 11 bps
Total bases = 11*2 = 22
A=6
T=6
G=5
C =5
Total Number of H-bonds = 2 (A or T) +3 (G or C) = 12+15 = 27
Tm = ½ Total Number of H-bonds
= 13.5
Bioinfo@AmS 21
A 3D cubical DNA structure is given below, where edge length is 85 nm. Consider for double helical DNA, pitch
length is 3.4 nm that comprises with 10 bps. How many bases are required to form the 3D structure? If thymine (T)
makes up 15 percent of the bases for this DNA sample, what would be the number of other bases and sugar
molecules? Also, calculate the amount of energy required to attain melting temperature (Tm). Assume energy
of H-bonds is 5 Kcal/mole. One mole of H-bonds means, Avogadro number of H-bonds (i.e., 6.03 × 1023 )
85 nm
Bioinfo@AmS 22
Perimeter of a Cube = 12* edge length = 12*85 nm
If the DNA length is 3.4 nm, then bases would be 10 bps
So, total number of base pairs present in this double stranded structure = 12* (85/3.4) *10 = 3000 bps
Total number of bases = 3000*2 = 6000
Total number of sugars (each nucleotide contains one sugar) in 3000 bps length of a double stranded DNA = 3000 ×2 = 6000
T = 15% , A = 15%, G= 35 %, C= 35%
T = (15*6000)/100 = 900
A = 900
G = (35*3000)/100 = 2100
C = 2100
Total bases (A+T+G+C) = 6000
Total Number of H-bonds = 3 (G or C) + 2 (A or T) = (6300 + 1800) = 8100
Tm is energy required to break ½ H-bonds = 4050
Energy for Tm = [(5 Kcal/mole) * 4050]/ 6.03 × 1023 = 3,358.20 × 10-23 Kcal
Bioinfo@AmS 23
Assignment
✓ Why DNA is double stranded and RNA is single stranded?
✓ Why does cell consider DNA as genetic material?
✓ How does DNA store genetic information?
Bioinfo@AmS 24
RNA
Ribonucleic Acid (RNA): Polymer of nucleotides
Nucleotides = Sugar (5C) + Nitrogen Base +Phosphate
Bioinfo@AmS 25
RNA
Bioinfo@AmS 26
Transcription: DNA to RNA String
DNA template strand

antisense (-) strand
3' 5'
5' 3'
Sense (+) strand
Transcription DNA coding strand
RNA polymerase
Non-template strand
5' 3'
mRNA
Bioinfo@AmS 27
Splicing: Generation of mRNA
ds DNA
hnRNA
mRNA
Bioinfo@AmS 28
Splicing: Generation of mRNA
• Whole sequence (introns and exons) of a gene is transcribed into RNA by RNA polymerase enzyme.
• The total transcribed RNA is known as heterogeneous nuclear RNA (hnRNA) which undergoes a process called
RNA-splicing.
• Spliceosomes protein and small nuclear RNA (snRNA) catalyzes the splicing. That removes intron parts from the
hnRNA and joins exons to form matured mRNA.
• Thus, in the next step, only Exons part of a gene undergoes into the Translation process (proteins synthesis). There
is no role of the Introns in the protein synthesis.
Essentially, codes for the proteins are available in exons parts of the gene.
Bioinfo@AmS 29
DNA to RNA string
dsDNA
mRNA
3’ TAC AGC AGA CGA CGC 5’
5’ ATG TCG TCT GCT GCG 3’ 5’ AUG UCG UCU GCU GCG 3’
Template strand mRNA
3’ ATA CTG TCG TGA CGT CGT 5’ 5’ UAU GAC AGC ACU GCA GCA 3’
Coding strand
(Complementary strand)
Bioinfo@AmS 30
DNA to RNA String
• Are both strands of DNA get transcribed into mRNA or does only one?
• If so, which one?
• Why can’t both strands of DNA get transcribed at the same time?
• Why RNA is less stable as compared to DNA?
Bioinfo@AmS 31
DNA to RNA String
• Are both strands of DNA get transcribed into mRNA or does only one?
• If so, which one?
At a time only one strand get transcribed, i.e., 3’ to 5’ strand is used as template for transcription. RNA synthesis also
involves the normal base pairing rules, but the base thymine (T) is replaced with the base uracil (U). Interestingly, the
base sequence of the synthesized RNA is complementary of the template strand, hence, the base sequence of the RNA
exactly match with the another strand of DNA (i.e., complementary strand of DNA, 5’-3’). Therefore, at the same time
only one strand transcribed.
Bioinfo@AmS 32
DNA to RNA String
• Why can’t both strands of DNA get transcribed at the same time?
✓ Two strand of DNA carries two different kind of information, that means dsDNA carries information about two
different kind of proteins with its two strands. Therefore, for a particular protein, sequence of one strand of DNA is
sufficient. If both are transcribed, then two different mRNAs would be synthesized and this would lead to the
production of two different proteins.
✓ Moreover, if both strands are transcribed, the resulting mRNAs formed would be complementary and would pair with
each other producing a double stranded RNA, which would lead to the huge problem again.
Bioinfo@AmS 33
DNA to RNA String
Two factors make RNAs less stable than DNAs

• The presence of the hydroxyl group on the 2’ Carbon.
• Presence of Uracil in place of Thymine.
Bioinfo@AmS 34
mRNA
Messenger RNA (mRNA)

• The mRNA carries the genetic code from DNA to the ribosomes for protein synthesis.
• Sequence of mRNA is also having 5’ to 3’ polarity
• It carries exactly coding strand (5’ to 3’ ) sequence bases of dsDNA having U instead of T
• Thus, it carries copies of instructions for the assembly of amino acids into proteins from DNA to the rest of the
cell (serve as “messenger”).
Bioinfo@AmS 35
rRNA
Ribosomal RNA (rRNA)

• Makes up the major part of ribosomes, which is where proteins are made.
• Ribosome a organelle where protein synthesis takes place that consists of ribosomal RNA (65%) and proteins (35%)
• It has two subunits, a large one and a small one both are required for protein synthesis.
Bioinfo@AmS 36
tRNA
Transfer RNA (tRNA)

• It can read the codes carries by mRNA
• Transfers amino acids to ribosomes during protein synthesis.
• Each amino acid is recognized by one or more specific tRNA
• tRNA has a tertiary structure that is L-shaped
• One end attaches to the amino acid and the other binds to the mRNA by a 3-base complimentary sequence
Bioinfo@AmS 37
Assignment
1. A double stranded DNA has 20% of guanine what is the percentage of other bases on the DNA.
2. Let the four kinds of nitrogen bases Adenosine (A), Cytosine (C), Guanine (G) and Thymine (T) be represented by a pair of binary
numbers 00, 01, 10 and 11 respectively. S1= 5′ AGTCATGGCCAA 3′ and S2= 5′ AGTCCTGCCCAC 3′. Find out the output DNA
sequences by applying AND, XOR and OR operation.
3. Find out the mRNA sequence of the reverse complement of the given dsDNA
3’ TAC AGC AGA CGA CGC 5’

5’ ATG TCG TCT GCT GCG 3’
4. Write a program so that given a DNA strand sequence as an input, find out complimentary sequence , mRNA sequence and number
of ATG repeats.
Bioinfo@AmS 38
Nucleic acid information Resources
https://www.ncbi.nlm.nih.gov/
Bioinfo@AmS 39
Nucleic acid information Resources
Bioinfo@AmS 40
Protein
Proteins: Proteins are linear polymers built of monomer units called amino acids. It is the most versatile
macromolecules in living systems and serve crucial functions in essentially all biological processes.
“Workhorse in Cells”
Functions:
• They provide mechanical support.
• They transport and store other molecules such as oxygen.
• They help in immune protection.
• They control cell growth and differentiation.
• They function as biocatalysts such as enzymes.
Bioinfo@AmS 1
Protein
“Workhorse in Cells”
Actin -myosin
Antibody Hormone
Enzyme (Insulin) Actin
Filament
Bioinfo@AmS 2
Chemical Nature of Protein
- Cα is at the heart of the amino acid.
α - Cα, C, N and O are called backbone atoms

- R can be any of the 20 side chains
Amino
Side Chain Carboxyl
Group
(Variable) Group
Characteristics of amino acids:
• Amino acids are basic units of protein (i.e.., monomer).
• All amino acids have at least one acidic carboxylic acid (-COOH) group and one basic amino (-NH2) group.
• Only 20 amino acids are standard and present in protein because they are coded by gene.
Bioinfo@AmS 3
Three and One Letter Code of Amino acids
B, J, O, U, X, Z
Bioinfo@AmS 4
Characteristic of Amino acids
Polar
Non-Polar
Positively Charge Negatively Charge
Bioinfo@AmS 5
Polar
Non-Polar
Non-polar: These amino acids are hydrophobic Polar: These amino acids are hydrophilic (water
(water hater) in nature comes closer in biological loving) in nature interacts with water in biological
aqueous medium. e.g., Aliphatic and aromatic side aqueous medium. e.g., Alcohol, Acids, Amine etc.
chain Acid> Alcohol>Amine >Ether> Alkane
Bioinfo@AmS 6
✓ Which amino acid are the most non-polar between Isoleucine and Alanine?
✓ Arrange the following amino acids (D, G, F, S) based on polarity?
D >S>G>F
Bioinfo@AmS 7
Amide Bonds
Amino acid are linked together through the formation of amide bonds (peptide bonds) from the amino group of one
residue and the carboxylate of a second residue.
Peptide (< 50 amino acids)

R1
Protein (> 50 amino acids)
N-terminal End C-terminal End
Bioinfo@AmS 8
Structure Function Relationship of Protein
• Function of a protein depends on its structure that indeed depends on protein Sequence.
• Function of proteins depends on the amino acids Order
Protein sequence (Amino acid order) 3D Structure Function
Bioinfo@AmS 9
Protein Folding
Primary structure Secondary Tertiary structure Quaternary

(Amino acid structure （3D structure structure
sequence) Local folding into formed by assembly （Structure
α-helix, β-sheet formed by more
of secondary
than one
structures） polypeptide
chains）
Bioinfo@AmS 10
Protein Folding
• The primary structure of protein: a sequence of amino acids linked together by peptide bonds (covalent bond)
• The secondary structure of protein: Polypeptide folding into α helix, β sheet, or random coil (H bonds involved)
• The tertiary structure of protein: 3-D folding of a single polypeptide chain
• The quaternary structure of proteins: Association of two or more folded polypeptides (sub units) to form a
multi-subunits protein (bonds and interactions similar to tertiary structure)
Bioinfo@AmS 11
Secondary structure
Bioinfo@AmS 12
Secondary Structure: - Helix
• Secondary structure, refers to local folded structures that form within a polypeptide (intra chain) due to interactions
between atoms of the backbone.
• In an α helix, the carbonyl (C=O) of one amino acid is hydrogen bonded to the amino H (N-H) of an amino acid that
is four residues down the chain (n+4).
• Each turn of the helix containing 3.6 amino acids.
• Pitch of the helix is 0.54 nm
(Pitch)
• R groups are not involved.
• Example of proteins: -keratin - abundant in skin, hair, nails and horns
H-bond: Gln (n residue) with Arg (n+4 residue)

Bioinfo@AmS 13
Secondary structure: β-pleated sheet
• Polypeptide chains are held together by H bonds between N-H group of one polypeptide chain and C=O group of
the other chain.
• Two or more segments of a polypeptide chain line up next to each other, forming a sheet-like structure held
together by hydrogen bonds.
• The strands of a β pleated sheet may be parallel, pointing in the same direction (meaning that their N- and C-
termini match up), or antiparallel, pointing in opposite directions (meaning that the N-terminus of one strand is
positioned next to the C-terminus of the other).
• Example of protein: Fibroin - abundant in silk
Bioinfo@AmS 14
Secondary structure: β-pleated sheet
Bioinfo@AmS 15
Tertiary Structure
• 3D conformation or shape. Fold spontaneously or with the help of molecular chaperones protein present inside
the cells. Stabilize by various kind of interactions (i.e., covalent and non-covalent bonds) present in the side
chain (R group).
• Various kinds of interactions which are responsible for protein tertiary structure Depends on the properties of
the R groups of amino acid residues, folding patter varies or 3D conformation changes.
Bioinfo@AmS 16
Tertiary Structure
• Ionic bonds (between charged amino acid side chains): For example, Lys is having a positively charge side chain due to NH3+
and Asp is having a negatively charge side chain due to COO- these may form ionic interaction which is known as salt-bridge.
• Hydrogen bonds between R groups : for example, uncharged polar amino acids can form H-bond like Ser is having side chain
carrying OH groups that serve as donor/acceptor for H bonds.
• Covalent bonds: Protein chain form intra/inter chain disulfide bonds between cysteine to form tertiary conformation
• Hydrophobic interactions: Amino acids having non-polar side chains associate in the interior of the peptide molecule and
exclude water via hydrophobic interactions.
• van der Waals interactions: Van der Waals forces' is a general term used to define the attraction of intermolecular forces
between molecules. There are two kinds of Van der Waals forces: weak London Dispersion Forces and stronger dipole-dipole
forces.
Bioinfo@AmS 17
Quaternary Structure
Many proteins is made up with more than one polypeptide chains called subunits. Association of
two or more folded polypeptides (subunits) to form a multimeric protein (bonds and interactions
similar to tertiary structure).
The quaternary structure refers to how these protein subunits interact with each other and arrange
themselves to form a larger aggregate protein complex. The final shape of the protein complex is
once again stabilized by various interactions, including hydrogen-bonding, disulfide-bridges and
salt bridges
Hemoglobin, a protein in red blood cells, has four sub units (two copies each of - and β-globins
containing a heme molecule.
Bioinfo@AmS 18
Protein Folding
Primary
sequence
Hierarchy of the Protein structure

Bioinfo@AmS 19
Protein Folding
Protein domain: A segment (100 – 250 aa) of a polypeptide chain that fold independently into a stable structure
and performs particular biological function. These are independently folded moieties of protein which can be
differentiated both structurally and functionally.
Protein Motif: It’s a short polypeptide chains comprised with secondary structures. It cant hold its independent
structure outside protein and not biologically functional too. Put very simply, a domain can be made up of one or
more well characterized motifs which usually occur together and suggest a putative function.
Bioinfo@AmS 20
Protein Folding: 3D structure
•Hydrophobic Amino acids (Green)

•Hydrophilic Amino Acids (Pink)
•Water molecules (Surrounding)
Unfolded Folded
•Details of protein folding depends on the patter of amino acids sequence

•Protein folding is reversible and reproducible
•Protein folding is thermodynamically favorable
•Only one protein structure is biologically relevant and forms inside the cells among possibility numerous conformation
Bioinfo@AmS 21
Protein Folding
Semi stable (local minima)

ΔG = ΔH –TΔS Native conformation (most stable)
G = Gibbs Free Energy
Enthalpy(H)= Total energy including chemical potential energy
Entropy(S) = Number of ways to arrange something (randomness)
Temperature (T) = contribute to kinetic energy
Bioinfo@AmS 22
Entropy of Protein Folding
S = kBlnW kB is Boltzmann's constant = 1.38x10-23

Decrease entropy
W = Number of ways system/molecules can be arranged

Unfolded
Folded
S = Less
S = High
Bioinfo@AmS 23
Conformational
Unfolded Unfolded Entropy Decreases
Folded
Conformational entropy is LOW

Unfolded
Conformational entropy is HIGH
Bioinfo@AmS 24
Unfolded Folded
Entropy of surrounding
water is HIGH
Entropy of surrounding
water is LOW
• Water molecules are free for movement around folded
• Water molecules form a case like protein due to exposure of hydrophilic amino acids only.
structure around hydrophobic amino • As all hydrophobic amino acids buried inside core of
acid for unfolded protein, thus not free protein structure, thus not interacts with water
for movement. molecules.
Bioinfo@AmS 25
• Entropy of surrounding water of a folded protein is increases

• Conformational entropy of protein after folding decreases
• Generally, entropy increases due to water is higher than the entropy decreases due to conformational entropy.
• Overall, protein folding process results increase in entropy of the system.
Bioinfo@AmS 26
Enthalpy of Protein Folding
Energy Absorbed
Energy released
(bonds breaking)
(bond formation)
HUnfolded Hfolded
∆H = Hfolded- Hunfolded
∆H = (Energy requires to break bonds – Energy releases to bonds formations)
Bioinfo@AmS 27
Protein Folding
Covalent Bonds: Disulfide bond (S-S)
Ionic interactions/salt bridge: COO- … NH3+
Dipole-Dipole interactions(H-bonds) : HOH …. NH2
Dipole-Induced dipole : ROH …. CH3
Induced dipole-Induced dipole : CH3…. CH3
Bioinfo@AmS 28
Protein Folding
Parameters Reactions Negative (< 0) Positive (>0)

∆G = ∆H – T ∆S Spontaneous Non-spontaneous
(exothermic) (endothermic)
∆H = (HBonds broken – H Bonds formed) More number of
stable bonds Less number of
formations stable bonds
formation
∆S = (SFolded – SUnfolded) Less Freedom High degree of
of Motion Freedom of
Motion
*Decrease in enthalpy also overcomes the decrease in entropy (i.e., gives positive impact on ∆G), so that
overall there is a negative ∆G that facilitates protein folding.
Bioinfo@AmS 29
Genetic Codes
Transcription
Translation
Bioinfo@AmS 1
Genetic Codes
5’ GGTCTCCTCACGCCA 3’ DNA
↓
5’ CCAGAGGAGUGCGGU 3’
mRNA
Codons
↓
Pro-Glu-Glu-Cys-Gly
Protein
Chain of Amino acids
The genetic code is a set of three-letter combinations of nucleotides called codons, each of which corresponds
to a specific amino acid or stop signal. The concept of codons was first described by Francis Crick and his
colleagues in 1961.
Bioinfo@AmS 2
Genetic Codes
• A codons are three consecutive bases that found on the coding strand of double-stranded DNA and
subsequently transcribed in the (single-stranded) mRNA.
• DNA/RNA consists of four different bases, and there are three bases in a codon, hence, there are
maximum 64 (4 * 4 * 4 = 64) possible patterns for which will act as a codon.
✓ Total 64 codons.
✓ AUG is START codon.
✓ UAA, UGA, UAG are STOP codon.
✓ Other 60 codons code either of any 20 amino acids
Bioinfo@AmS 3
Codon Degeneracy
Bioinfo@AmS 4
Codon Degeneracy
• The genetic code is degenerate because there are many instances in which different codons specify the same
amino acid. A genetic code in which some amino acids may each be encoded by more than one codons is known
as codon degeneracy or codon redundancy.
• There are total 20 amino acids which code for a wide variety of proteins in a living organism. There are 64
codons so, its obvious that more than one codon may specify same amino acid. Codon Degeneracy means
several code words have the same meaning.
• Degeneracy makes the DNA more tolerant to point mutations. It isn't necessary a point mutation in a codon will
lead to change in conformation of the peptide. It might be replaced by a same amino acid.
Bioinfo@AmS 5
Codon Bias
In most of the species Synonymous codon are prefer for particular amino acids which is known as codon bias
Bioinfo@AmS 6
Open Reading Frame(ORF)
ORF
5’ AUGAUACUCACAAUCUGA3’
ORF
5’AUACUC AUGAUACCCACAAUUCAACACCUCUAG3’
ORF
5’AUACGCCA AUGAUACUCACAAUCUAAACUCACACUCUC3’
• A reading frame (RF) is a non-overlapping set of three-nucleotide-codons (triplets) in DNA or RNA, it may present
anywhere within DNA sequence
• Open Reading Frame(ORF) means sequence is ‘open’ to keep reading by ribosome for protein synthesis.
• The ORF is a segment of DNA that begins with START codon and ends with a STOP codon make a functional
proteins.
• Gene also codes a functional protein , however it contains regulatory region along with ORF
Bioinfo@AmS 7
Reading Frame(RF)
• There are three RFs are possible, depending on the

starting point of reading of the mRNA string.
5’CUCAGCGUUACCAU3’
• Three RFs code three different protein strings
• One of the RF called as ORF that only contains
desired functional protein information (gene product)
RF 1
initiate with start codon and end with stop codon.
RF 2
RF 3
Bioinfo@AmS 8
Reading Frame(ORF)
A segment of double-stranded DNA has six possible reading frames (RF), three in each direction. But all are not leads
to proteins until it starts with start codon and ends with stop codon.
• ds DNA may produce two different mRNA transcript
5 ’ AT G C T C T C AT C T C G 3 ’ • There are six RFs which have code for six different proteins,
3 ’ TA C G A G A G TA G A G C 5 ’
however, only two RFs called ORFs contain desired protein
information (i.e., ORFs).
Bioinfo@AmS 9
Assignment
5’TTAGATGTGTGTAAATGTGTGTGATGATCGTGATATCATAGTAGTCAATGATCGTAATATTATCTATTTATAACCG3’
https://www.ncbi.nlm.nih.gov/orffinder/
Bioinfo@AmS 10
Mutation
• Mutation: Any sudden change(s) in Genetic make up is known as mutation.
• The external/internal agents that causes mutation are known as mutagen.
• Mutation may result in change in the amino acids in proteins. That is responsible for phenotype change.
Bioinfo@AmS 11
Mutation
Gene Mutation/
Point Mutation
Substitution Inversion Frameshift

Mutation Mutation Mutation
Bioinfo@AmS 12
Mutation
• Gene Mutation = Any change(s) in the nucleotide/base sequence of DNA which may occur due to
errors in DNA replication or due to the impacts of chemicals or radiation to the DNA molecule.
• Mutation may result in change in the amino acids in proteins.
No- Mutation Mutation
Bioinfo@AmS 13
Mutation
Bioinfo@AmS 14
Mutation
• If no changes to genomes occur over time, there would be no evolution
• Too much change in the DNA is harmful
• Too little does nothing
• A balance exists between the amount of new variation and the overall health (adaptiveness) of the
new variant individual
• Differences between closely related organisms show closely matched DNA sequences that diverged at
some past time and that was adaptive for a given environment.
Bioinfo@AmS 15
Mutation
1. Substitution: Substitution of 1 base for another this may or may not affect to the resulted protein. For example, an
A:T base pair could be mutated into a G:C base pair or even a T:A base pair. Three subtypes such as
(i) Silent Mutation
(ii) Missense Mutation
(iii) Non-sense Mutation.
2. Inversion: Sometimes consecutive bases rotates 180 degrees (mutually changes its position). Such an event creates
a mutation called an inversion. This may show particular abnormalities at the phenotypic level.
Bioinfo@AmS 16
Mutation
3. Frame shift Mutation
- Insertion: An insertion changes the number of DNA bases in a gene by adding a piece of DNA. As a result,
the protein made by the gene may not function properly.
- Deletion: A deletion changes the number of DNA bases by removing a piece of DNA. Small deletions may
remove one or a few base pairs within a gene, while larger deletions can remove an entire gene or several
neighboring genes. The deleted DNA may alter the function of the resulting protein(s).
Bioinfo@AmS 17
Silent Mutation
Normal Mutation
DNA
mRNA
Protein
Due to redundancy of Genetic Code (i.e., different codons can code same amino acid), no change in amino acid
sequence is produced!!
Bioinfo@AmS 18
Missense Mutation
Normal Mutation
DNA
mRNA
Protein
Missense mutation produces a change in amino acid sequence in protein product (Histidine in for Arginine); may change
function of protein or may not!
Bioinfo@AmS 19
Nonsense Mutation
Normal Mutation
DNA
mRNA
Protein
Bad news! – nonsense mutation produces a STOP codon within the mRNA transcript
leading to a truncated protein (incomplete protein sequence). How short the protein
product depends on where the STOP codon was produced within the mRNA transcript.
Bioinfo@AmS 20
Mutations: Insertion
A frame shift mutation
Normal gene Addition mutation

5’ GGTCTCCTCACGCCA 3’ 5’ GGTGCTCCTCACGCCA 3’
↓ ↓
5’ CCAGAGGAGUGCGGU 3’ 5’ CCACGAGGAGUGCGGU 3’
Codons
↓ ↓
Pro-Glu-Glu-Cys-Gly Pro-Arg-Gly-Val-Arg
Amino acids
Bioinfo@AmS 21
Mutations: Deletion
A frame shift mutation
Normal gene Deletion mutation

5’ GGTCTCCTCACGCCA 3’ 5’ GGTC_CCTCACGCCA 3’
↓ ↓
5’ CCAGAGGAGUGCGGU 3’ 5’ CCAGGGAGUGCGGU 3’
Codons
↓ ↓
Pro-Glu-Glu-Cys-Gly Pro-Gly-Ser-Ala-
Amino acids
Bioinfo@AmS 22
Mutations: Inversion
In case of inversion mutation a mutation resulting from the removal of a length of DNA (segment of DNA), which is
then reinserted facing in the opposite direction.
Bioinfo@AmS 23
Assignment
1. Write a program to find out the ORF
2. Write a program, that given two DNA sequences, find out the number of mutations and site of mutation by
comparing these tow DNA strings. Make comment types of point mutation.
Bioinfo@AmS 24
Database
Bioinfo@AmS 1
Biological Databanks and Databases
Biological Databank: Databank is a generic term meaning any collection of data in any form. This is a
largest form of repository/archive of data that will keep all the information safe for long term.
Biological Databases: This is a collection of particular type of data arrange in a computer readable format
suitable for easy storing, searching and analyzing the data for users. Generally, database is a technical term
denoting to a collection of data managed by a software called a Database Management System (DBMS).
Bioinfo@AmS 2
Types of Various Kinds Biological Data
• Nucleic Acids:
✓ DNA/RNA Sequences, Double helical structure
✓ RNA Sequences, mRNA, tRNA, rRNA, secondary RNA structure, Interactions
• Genomics: Gene, ORF, Genetic disorder, Mutation
• Proteomics:
✓ Protein sequences, secondary, tertiary , quaternary protein structure
✓ Protein expression profile, Protein-ligand interactions
• Transcriptomics: This is the study of the transcriptome—the complete set of RNA transcripts that are
produced by the genome
• Metabolomics: The study of small molecules involved in cellular metabolism and their interactions
within a biological system are known as the metabolomics.
Bioinfo@AmS 3
Biological Databases
• Nucleotide Sequence Databases (Genome and Gene)

• Protein Sequences Databases (Sequence information)
• Protein Structure Databases (3D structure )
Bioinfo@AmS 4
Complexity and size of Biological data
Bioinfo@AmS 5
Bioinfo@AmS 6
Bioinfo@AmS 7
Database
Scatter and random data
•Biological experimental data published in literature are generally
scatter and random.
•It hard to find and access related as well as relevant data required
for various analysis and applications.
•Biological data is very complex in nature such as signaling Organized and indexed data
pathways are so much interconnected study of all these kinds of
data poses huge concern in order to find out conclusive inferences.
Database is the organized form of information collected and store as computer readable form.
Bioinfo@AmS 8
How to design a biological database
•Contents
•Perspective of end users
•Data must be easily understandable for users
•Reliable and required information
•Faster responses to get the search results
•Less redundancy in the information
Collection of data in the related format

✓Structured/Organized -> Flat file
✓Searchable (index) -> Table of contents
✓Updated periodically (release) -> New edition
✓Cross-referenced (hyperlinks) -> Links with other DB
Bioinfo@AmS 9
Characteristics of biological database
•Unique Contents: Structure, Sequence

•The ontology: Notations, symbols and biological condition need to explain briefly for easy
understanding of the users
•Schema: The logical structure of the design of the databases
•Format of the data: The format must be uniform and compatible for directly use this data for other
program and applications
•Search engine: Search engine should be provide the relevant and less redundant data.
•Link for the source of the data: Provide the source of the data like literature , details of the researchers
and other website link.
Bioinfo@AmS 10
History of Biological Database
• Margaret Belle (Oakley) Dayhoff (March 11, 1925 – February 5, 1983) was an American physical
chemist and a pioneer in the field of bioinformatics. She created first biological database based on
protein sequences and named as ‘Atlas of Protein Sequences and Structure” in 1965 at Columbia
University, USA.
• Protein Data Bank (PDB ) was created in 1972 with the collection of the X-ray crystallographic protein
structure.
• The SwissProt protein sequences database began in 1987.
• 1988 - The National Centre for Biotechnology Information (NCBI) is established at the National Cancer
Institute.
Bioinfo@AmS 11
Classification of Biological Database
Biological Database
On the basis of On the Basis of Nature of

Source Data
• Primary Database • Sequences

• Structure
• Secondary Database • Function
• Interaction
• Composite Database • Literature
Bioinfo@AmS 12
Classification of Biological Database
Primary databases: Experimental results (raw data) directly submitted to database. Once given a
database accession number, the data in primary databases are never changed that become a permanent
scientific record.
Secondary databases: This databases comprise data derived from the results of analyzing primary data.
Composite databases: Many databases have both raw data as well as derived data i.e., characteristic of
primary and secondary databases which are known as composite/aggregate data base.
Bioinfo@AmS 13
Primary Database
• Primary databases are highly organized, user-friendly gateways to the huge amount of biological data
directly produced by researchers around the world.
• The primary databases were first developed for the storage of experimentally determined DNA and protein
sequences in the 1980s and 90s.
• This databases contains the raw nucleic acid/protein sequence data which are produced and submitted by
researchers worldwide directly.
• Primary databases contain information for sequence or structure only. Once data are deposited in primary
databases, they can be accessed freely by anyone around the world.
Example
✓ GenBank, DDBJ, EMBL for genome (i.e.., DNA) sequences
✓ Protein Databank (PDB) for protein structure

Bioinfo@AmS 14
Secondary Database
• Secondary databases contain information derived from primary databases that means it contains the
data which are derived by curation, annotation and analysis of the data of primary database .
• Secondary databases store information such as conserved sequences, active site residues, and
signature sequences.
• A secondary structure database contains entries of the PDB in an organized way.
Examples
•Tr-EMBL for protein sequences
•SCOP at Cambridge University
•CATH at the University College of London
•PROSITE of the Swiss Institute of Bioinformatics
Bioinfo@AmS 15
Composite Database
Composite databases contain a information of primary and derive data, which eliminates the need to search
each one separately. Each composite database has different search algorithms and data structures.
Example:
NCBI is a composite data base that links to the Online Mendelian Inheritance in Man (OMIM) and other
data bases.
Bioinfo@AmS 16
Classification of Database
Biological Database
Protein
Nucleotides
• EMBL
• GenBank Sequences Interaction
• DDBJ • UniPort • Biogrid
INSDC • PIR • STRING
(International Nucleotides Sequence Database) • SwissProt
Structure
• PDB
ENSEMBL
(Whole Genome Database) • CATH
• SCOP
Specialized
• OMIM (online Mendelian Inheritance of Man)
• Gene Expression Omnibus (GEO) database
Bioinfo@AmS 17
Nucleotide sequence databases
There are a small number of bioinformatics centers of excellence worldwide that have taken on the
responsibility to collect, catalogue and provide open access to published biological data.
Example:
1. The EMBL-European Bioinformatics Institute (EMBL-EBI)
2. The US National Center for Biotechnology Information (NCBI)
3. The National Institute of Genetics in Japan (NIG)
The role of bioinformatics centers of excellence in making biological data available for the research
community
Bioinfo@AmS 18
• EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases
• EMBL www.ebi.ac.uk/embl/
• GenBank www.ncbi.nlm.nih.gov/Genbank/
• DDBJ www.ddbj.nig.ac.jp
They together constitute the International Nucleotide Sequence Database (INSD) collaboration.
Bioinfo@AmS 19
NCBI
How to Use the NCBI’s Bioinformatics Tools and Databases
Bioinfo@AmS 20
National Center for Biotechnology Information (NCBI)
• National Center for Biotechnology Information (NCBI) was established 1988, as a division of the National
Library of Medicine (NLM) at the National Institutes of Health (NIH), Maryland USA.
• Website: https://www.ncbi.nlm.nih.gov/
• Aim: To create and maintain public database, develop software such as sequence analysis tools (BLAST,
iCn3D), resources for Biomedical Information
• Maintain Several databases like Genbank (Nucleic acid sequence database) from 1992. It’s a superset of
various databases including Gene, Genome, Protein, literatures etc. It acts as interface to connect various
databases.
• Databased present in the NCBI generally classified as Primary and Derived databases.
• It provides a ‘database retrieval system’ known as ENTEZ
Bioinfo@AmS 21
Entrez
Entrez : This is an Advanced Search interface that facilitates constructing more sophisticated queries for biological
database. It’s a molecular biology database developed by NCBI. Entrez is primary text search and retrieval system that
integrates the PubMed database of biomedical literature with 38 other literature and molecular databases mainly use by
National Center for Biotechnology Information (NCBI) .
• Easy to use and convenient. Low redundancy

• Mainly text based search preform here. Boolean operators are used for text search.
• Integrated and cross-referenced: Entrez Global Query Cross-Database Search System is a federated search engine,
or web portal that allows users to search many discrete biological databases. Specialized search fields are available
for each database and can be browsed and selected in the Search Builder section of the Advanced Search interface.
• Other useful Entrez features include Search History with access to recent results and a Clipboard where search
results can be saved temporarily.
Bioinfo@AmS 22
GenBank ( Maintain by NCBI)
• GenBank is the most complete collection of raw nucleic acid sequence data for almost every organism.
• The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput raw sequence data, and
sequence polymorphisms.
• There is also a GenPept database for protein sequences, the majority of which are conceptual
translations from DNA sequences.
• There are two ways to search for sequences in GenBank. One is using text-based keywords similar to a
PubMed search. The other is using molecular sequences to search by sequence similarity using BLAST.
• the
• search output for sequence files is produced as flat files for easy reading. The resulting flat files contain
three sections – Header, Features, and Sequence entry
Bioinfo@AmS 23
✓ LOCUS: It contains a unique database identifier for a sequence location in the database (not a chromosome locus). The
identifier is followed by sequence length and molecule type (e.g., DNA or RNA). This is followed by a three-letter code
for GenBank divisions.
✓ DEFINITION: This provides the summary information for the sequence record including the name of the sequence, the
name and taxonomy of the source organism if known, and whether the sequence is complete or partial.
✓ ACCESSION NUMBER : A new accession number is given in the form of a string of alphanumeric characters (two
alphabet and six numerical or three alphabet and five characters). In addition to the accession number, there is also
a version number and a gene index (GI) number. The purpose of these numbers is to identify the current version of the
sequence. If the sequence annotation is revised at a later date, the accession number remains the same, but the version
number is incremented as is the GI number.
✓ “ORGANISM” field, which includes the source of the organism with the scientific name of the species,
✓ “Features” section includes annotation information about the gene and gene product. The “gene” field is the
information about the nucleotide coding sequence and its name. For DNA entries, there is a “CDS” field, which is
information about the boundaries of the sequence that can be translated into amino acids.
Bioinfo@AmS 24
✓ If there are multiple sequences with same Accession number

take sequence having the highest GI number which is the
most updated one.
Bioinfo@AmS 25
NCBI
https://www.ncbi.nlm.nih.gov/
Bioinfo@AmS 26
NCBI-GenBank
https://www.ncbi.nlm.nih.gov/genbank/
Bioinfo@AmS 27
Search of Nucleotide Sequence
https://www.ncbi.nlm.nih.gov/nuccore
https://www.ncbi.nlm.nih.gov/nuccore/?term=Insulin
Bioinfo@AmS 28
PLN: plant, fungal, and algal Seq.
PRI: Primate Seq.
MAM: Non-primate mammalian Seq.
BCT: Bacterial Seq.
EST: Expressed sequence Tag.
Header Features
https://www.ncbi.nlm.nih.gov/nuccore/AH002844.2
Bioinfo@AmS 29
Sequence
Bioinfo@AmS 30
https://www.ncbi.nlm.nih.gov/nuccore/AH002844.2?report=fasta
Bioinfo@AmS 31
Search of Protein Sequence
https://www.ncbi.nlm.nih.gov/protein
Bioinfo@AmS 32
https://www.ncbi.nlm.nih.gov/protein/AAA59172.1
Bioinfo@AmS 33
https://www.ncbi.nlm.nih.gov/protein/AAA59172.1?report=fasta
Bioinfo@AmS 34
Search of Gene Sequence
https://www.ncbi.nlm.nih.gov/gene
Bioinfo@AmS 35
https://www.ncbi.nlm.nih.gov/gene/3630
Bioinfo@AmS 36
Bioinfo@AmS 37
Bioinfo@AmS 38
https://www.ncbi.nlm.nih.gov/variation/view/
Bioinfo@AmS 39
https://www.ncbi.nlm.nih.gov/variation/view/
Bioinfo@AmS 40
DDBJ www.ddbj.nig.ac.jp
Bioinfo@AmS 41
https://www.ddbj.nig.ac.jp/services/indexe.html?tag=search,DDBJ
http://ddbj.nig.ac.jp/arsa/
Bioinfo@AmS 42
http://ddbj.nig.ac.jp/arsa/
http://ddbj.nig.ac.jp/arsa/search?lang=en&cond
=quick_search&query=Human+insulin+gene&op
erator=AND
Bioinfo@AmS 43
http://getentry.ddbj.nig.ac.jp/getentry
Bioinfo@AmS 44
https://www.ebi.ac.uk/
Bioinfo@AmS 45
https://www.ebi.ac.uk/ebisearch/search?db=emblstandard&query=human%
20insulin%20gene&size=15
https://www.ebi.ac.uk/ebisearch/search?db=nucleotideSequences&query=
human%20insulin%20gene
Bioinfo@AmS 46
Bioinfo@AmS 47
Assignment- Practical
1. Visit NCBI (i)Download flat file and FASTA files of nucleotide sequences of human (Homo sapiens) hemoglobin subunit alpha 1 (HBA1), Mus musculus hemoglobin alpha,
adult chain-1 and Bos taurus hemoglobin, beta (HBB).
2. Visit DDBJ and EMBL (i)Download flat file of human complete CFTR nucleotide sequence (ii) Download FASTA file of human complete CFTR nucleotide sequence
3. Visit NCBI (i)Download flat file of human complete CFTR Gene sequence, Nucleotide sequence and Protein sequence (ii) Download FASTA file of human complete CFTR
gene sequence, Nucleotide sequence and protein sequence (iii) Show the variants of the genes, with missense mutation (iv) Show the CDS
Bioinfo@AmS 48
Protein Databases
Protein
Sequences Structure
• UniPort • PDB
• PIR • CATH
• SwissProt • SCOP
• TrEMBL
Bioinfo@AmS 1
Protein Sequence Databases
• NCBI -Protein Sequence Databases
• PIR (Protein Information Resource): at Georgetown University Medical Center

• SWISS-PROT: Expertly curated protein sequence entries maintained Swiss Institute of Bioinformatics.
• TrEMBL: Computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL
nucleotide sequence entries not yet integrated in SWISS-PROT.
• UniProt: Universal Protein Resource (Knowledgebase)
Bioinfo@AmS 2
The Protein Information Resource (PIR)
• PIR: Atlas of protein sequence and structure – Dayhoff (1966) first sequence database (pre-
bioinformatics). Currently known as Protein Information Resource (PIR)
• PIR has provided many protein databases and analysis tools to the scientific community, including the
PIR-International Protein Sequence Database (PSD) of functionally annotated protein sequences.
• PIR was established in 1984 by the National Biomedical Research Foundation (NBRF), USA as a
resource to assist researchers in the identification and interpretation of protein sequence information.
• Maintain by Georgetown University Medical Center
• PIR format
https://proteininformationresource.org/
Bioinfo@AmS 3
Protein Information Resource (PIR)
PRO: Protein family classifications
iPTMnet: Post-translational modifications (PTMs)
iProLINK: Literature and research articles
iProClass: Contain Sequences, integrated protein

knowledgebase
Bioinfo@AmS 4
https://proteininformationresource.org/cgi-bin/textsearch.pl
Bioinfo@AmS 5
https://proteininformationresource.org/cgi-bin/ipcEntry?id=A0A6J4EE43
Bioinfo@AmS 6
https://proteininformationresource.org/cgi-bin/ipcEntry?id=A0A6J4EE43
Bioinfo@AmS 7
FASTA File format
https://proteininformationresource.org/cgi-bin/comp_mw.pl?ids=A0A6J4EE43
Bioinfo@AmS 8
Universal Protein Resource (UniProt)
Universal Protein Resource (UniProt): The United Protein Databases (UniProt, 2003) is a central
database of protein sequence and function created by joining the forces of the SWISS-PROT, TrEMBL
and PIR protein database activities
The centerpiece of the UniProt databases is the UniProt knowledge base (UniProtKB), which
comprises two sections: Manually annotated UniProtKB/Swiss-Prot and Automatically computer
annotated UniProtKB/TrEMBL.
Bioinfo@AmS 9
Universal Protein Resource (UniProt)
Bioinfo@AmS 10
SWISS-PROT/UniProtKB
• SWISS-PROT is an annotated protein sequence database established in 1986 and maintained

collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva
and the EMBL Data Library.
• UniProtKB/SWISS-PROT an expertly curated protein sequence database of UniProtKB (produced by
the UniProt consortium), which provides high level of annotation and minimum level of redundancy .
• It also have high level of integration with other databases.
• Highest level of accuracy
• It contains protein sequences, descriptions, including function, domain structure, subcellular location,
post-translational modifications and functionally characterized variants.
• Similar f o r m a t t o EMBL
Bioinfo@AmS 11
TrEMBL/UniProtKB
• TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of

EMBL nucleotide sequence entries, which are not yet integrated in SWISS-PROT.
• TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide
Sequence Database not yet integrated in SWISS-PROT.
• TrEMBL can be considered as a preliminary section of SWISS-PROT. For all TrEMBL entries which
should finally be upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.
• Currently, SWISS-PROT and TrEMBL have 0.5 and 7.6 million sequences, respectively.
Bioinfo@AmS 12
UniProtKB
https://www.uniprot.org/
Bioinfo@AmS 13
UniProtKB
https://www.uniprot.org/uniprotkb?query=insulin
Bioinfo@AmS 14
UniProtKB
https://www.uniprot.org/uniprotkb/P01308/entry
Bioinfo@AmS 15
UniProtKB
Bioinfo@AmS 16
UniProtKB
Bioinfo@AmS 17
FASTA File format
https://rest.uniprot.org/uniprotkb/P01308.fasta
Bioinfo@AmS 18
Protein Structure Databases
Bioinfo@AmS 19
PDB
http://www.rcsb.org/structure/4HHB
https://files.rcsb.org/view/4HHB.pdb
Bioinfo@AmS 20
Protein Data Bank (PDB)
• PDB was established in 1972 at Brookhaven National Laboratory (BNL)

• Sole international repository of protein Structure database
• X-ray crystallographic data and NMR data scored for a protein
Process of submission of entry/data files to the PDB
User can directly do submission of raw data mmCIF (macro

molecules crystallographic information file) to the PDB
Checking the format of the coordinates and structures by RCSB
Validation test on the structure on deposition to the database
Acceptance of the structure and receive and unique PDB id
Bioinfo@AmS 21
PDB file format
http://www.rcsb.org/structure/4HHB
https://files.rcsb.org/view/4HHB.pdb
Bioinfo@AmS 22
CATH (Class, Architecture, Topology and Homology):

• Classification of proteins based on domain structures based on their folding patterns
• Each protein chopped into individual domains and assigned into homologous super-families.
• Hierarchical domain classification of PDB entries.
Bioinfo@AmS 23
Class: Derived from secondary structure content is assigned

automatically
Architecture: Describes gross orientation of secondary
structures, independent of connectivity
Topology: Clusters structures according to their topological
connections and numbers of secondary structures
Homologous superfamily: This level groups together
protein domains which are thought to share a
common ancestor and can therefore be described as
homologous
Bioinfo@AmS 24
https://www.cathdb.info/
Bioinfo@AmS 25

Bioinfo PPTs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinfo PPTs

Uploaded by

Copyright:

Available Formats

Syllabus

Bioinformatics = Biology + Information Technology

✓ Acquiring, organizing, managing and new-insights of biological data

✓ Development of algorithms and tools for analysis biological data

✓ Development of organized databases/databank

✓ Open source web services (tools and platform)

✓ Virtual screening of the drug molecules

✓ Biological Network analysis

✓ Algorithms for big data analysis (NGS, RNA Seq)

✓ Meta data analysis

✓ Generation of new hypothesis via data mining

Foundation of Bioinformatics is Molecular Biology, specifically Genomics and Proteomics

Charles Darwin and the voyage of the HMS Beagle (1831-36)

Darwin's Theory of Evolution by “Natural Selection”

What is the molecular mechanism that causes this variation?

How do species extinct?

Gregor Johann Mendel (1822-1884)

Back Hair Black Hair

Genotype: 4Bb Genotype: BB 2Bb bb

What is the chemical nature and structure of Gene?

How gene store genetic information?

How does information from Gene flows to Characteristics

DNA Storage unit

mRNA Carrier Unit

Flow of biological information:

Coding (DNA replication)

• Patter of DNA Sequence store fixed information about protein sequence

• Human genome length is 3.3× 109 base pairs.

• Approximately 25,000-30,000 genes are there

• On average, a gene is made up of 3000 nucleotides.

• The function of more than 50 percent of the genes is yet to be discovered.

• Proteins are coded by less than 2 percent of the genome.

• Developed programs for data retrieval and indexing

• Each gene is located on a particular region of a chromosomal DNA.

• There is no space (Non-coding region) within genes of Prokaryotes

Prokaryotic Gene Eukaryotic Gene

Promoter Terminator Promoter Terminator

ORF ORF I ORF II ORF III DNA

Exon Intron Exon Intron Exon DNA

Repetitive DNA sequence

Contributors to the non-coding DNA include:

✓Includes all genes, intergenic sequences, repeats

✓All the DNA on all the chromosomes of a haploid cell.

✓Specifically, it is all the DNA in an organelle.

EXON 1 INTRON 1 EXON 2 INTRON 2 EXON 3

DNA Sequence: { A, T,G, C}

mRNA Sequence (matured): { A, U,G, C}

DeoxyRibo Nucleic Acid (DNA) Ribo Nucleic Acid (RNA)

• Double stranded • Single stranded

• Located in the nucleolus • Located in the outside nucleolus (cytoplasm)

• Deoxyribose sugar storage genetic information)

• A,T,G,C • Ribose sugar

Both DNA and RNA are polymers of Nucleotides

Double Ring Single Ring

• Shorter Name :: Large structure • Larger Name :: short structure

• 6 atoms ring number Anti-clockwise • 6 atoms ring number Clockwise

Deoxyribonucleic Acid (DNA): Polymer of nucleotides

Nucleotides = Deoxy-Sugar (5C) + Nitrogen Base +Phosphate Group

Deoxyribose pentose sugar

Deoxyribose (pentose) sugar

The rules of base pairing (or nucleotide pairing) are:

Double rings Single ring

One letter double stranded DNA string