Professional Documents
Culture Documents
Bioinfo@AmS 1
Introduction: Bioinformatics
Computer IT
Sc.
Math &
Biology Statistics
Bioinformatics is an interdisciplinary field, which associates with computer science, information technology,
mathematics/statistics, and biology for organization, analysis and interpretation of new insight of the biological
data.
Bioinfo@AmS 2
Introduction: Bioinformatics
✓ Bioinformatics deals large biological datasets (macromolecular structures, genome sequences, expression
data) for better understanding and generation of new hypothesis
Bioinfo@AmS 3
Various aspects of Bioinformatics
✓ Drug design
Bioinfo@AmS 4
Bioinformatics
Bioinfo@AmS 5
Basic Form of Life : Cells
Bioinfo@AmS 6
Introduction: Complexity in Biology
Cell signaling
Organism Organs Tissue
Cell
Protein Protein-Protein
Protein interaction
sequence
structure
Bioinfo@AmS 7
Length and Time Scale
Bioinfo@AmS 8
Introduction: Exploration of Biology
Observation
Or
Results
Conclusion
Experiments Or
Hypothesis
Knowledge
Bioinfo@AmS 9
Introduction: Exploration of Biology
• Over the course of his travels in the Galápagos Islands off the
coast of Ecuador, Darwin began to see intriguing patterns in the
distribution and features of organisms.
Charles Darwin
How does this trait pass through from one generation to next?
Bioinfo@AmS 11
Introduction: Exploration of Biology
He is the father and founder of genetics. The genes come in pairs (Alleles) and are inherited as
distinct units, one from each parent. Mendel tracked the segregation of parental genes and their
appearance in the offspring as dominant or recessive traits.
Bioinfo@AmS 12
Introduction: Exploration of Biology
Genes are the basic unit of heredity. Each chromosome contains many genes that is responsible for
different traits. Father Mother
B b
T T
• Two copies of every gene each of them is commonly called allele i.e., is a variant form of a gene received
from father and mother.
• If the two alleles that form the pair for a trait are identical, then the individual is said to be homozygous gene
and if the two genes are different, then the individual is heterozygous gene for the trait.
• If the alleles of a gene are different, one allele will be expressed; it is the dominant gene. The effect of the
other allele, called recessive, is masked
Bioinfo@AmS 13
Introduction: Exploration of Biology
B B B b
b Bb Bb B BB Bb
b Bb Bb b Bb bb
Bioinfo@AmS 14
Introduction: Exploration of Biology
Bioinfo@AmS 15
Introduction: Exploration of Biology
Genotype – Deals with GENE CODE. Genotype is the particular set of genes present in an organism’s cell.
In other words, the genotype is the genetic constitution of an organism.
Phenotype – Deals with looks you can take a PHOTO with. All the observable characteristics of an
organism, such as shape, size, color and behavior are called phenotype.
AA
aa
Bioinfo@AmS 16
Introduction: Exploration of Biology
Bioinfo@AmS 17
Central Dogma
Transcription
Protein
Functional Unit
Phenotype
Genetic mapping provides the first evidence that a disease or trait (i.e., a characteristic) is linked to the
gene(s) inherited from one’s parents.
Bioinfo@AmS 18
Central Dogma
↶ Decoding
DNA Protein
• A single change in the order of sequence of DNA nucleotides leads to defective or no protein synthesis.
• Coding and decoding the information from DNA level to protein are highly regulated biological events
Bioinfo@AmS 19
Introduction: Human Genome Project
Objectives: Find out the sequence pattern of genome of human genome (i.e.., haploid chromosome + X/Y).
Bioinfo@AmS 20
Introduction: Human Genome Project
Essence of Bioinformatics:
• Store all the sequence information in a database (33,000 books having1000 pages and each page have 1000 bp ).
Bioinfo@AmS 21
Gene
GENE: Gene is a segment of DNA of chromosome which has a specific ordered sequence of nucleotides (the building
blocks of DNA) that codes a functional proteins and RNAs (rRNA, tRNA, mRNA).
Bioinfo@AmS 1
Gene
GENE: Gene is a segment of DNA of chromosome which has a specific ordered sequence of nucleotides (the building
blocks of DNA) that codes a functional proteins and RNAs (rRNA, tRNA, mRNA).
Bioinfo@AmS 2
Characteristics of Gene
• Genes are the basic physical and functional units of heredity transfer from parent to offspring.
• Eukaryotes genes are not continuous with coding region (Exon) only these are separated by Non-coding sequences
(Intron).
Bioinfo@AmS 3
Prokaryotic Gene Expression
Gene
Transcription Transcription
mRNA mRNA (Polycistronic)
Translation Translation
Protein
Protein I Protein II Protein III
• ORF: Open Reading Frame (i.e., Coding region of gene) contain codes for protein
• Gene: ORF + Regulatory Region (e.g., Promoter and Terminator)
Bioinfo@AmS 4
Eukaryotic Gene Expression
Gene
ORF
Promoter Terminator
Transcription
hnRNA
Processing
mRNA
Translation
Protein
• ORF in eukaryotic cells divided in exons and introns, however, only exon parts carry codes for protein
Bioinfo@AmS 5
Exon and Intron
Exon: Eukaryotic genes contain stretches of coding sequences called exons, which are interrupted by non-coding
segments called introns. Exons code for functionally distinct proteins.
Introns: These are intervening sequences of DNA that do not code for any information. Introns are generally present
in higher eukaryotic genomes but are rarely present in prokaryotes.
• When genomic DNA is transcribed to produce mRNA (gene expression), introns are also transcribed. Once the
entire mRNA has been transcribed, the introns are removed before the mRNA reaches the ribosomes for protein
synthesis.
Bioinfo@AmS 6
Characteristics of Gene
Pseudogenes
Pseudogenes are copies of genes that have lost their function. This can happen due to mutations, or the presence of stop
codons or frameshifts within the coding sequence. Pseudogenes are thought of as DNA that should be removed from the
genome and are considered junk DNA.
Gene families
In prokaryotes, a gene occurs in single copy per genome. However, it is not uncommon in eukaryotes to find genes that
are present in several copies. Such groups of genes are called gene families. A higher copy number enables production
of larger quantities of gene products.
Bioinfo@AmS 7
Number of genes in a genome
✓ Unlike eukaryotic genomes, most of the DNA in bacterial genomes (prokaryote) encodes proteins.
✓ The genome of E. coli bacteria is made of 4288 genes, with nearly 90% of the genome coding for proteins.
✓ The yeast genome, is about 2.5 times larger, comprising about 6000 genes with 70% used for coding proteins.
Only 4% of the yeast genome is reported to be made of introns.
✓ The genome of humans consists of about 25,000-30,000 genes, with only about 2% of DNA used as protein
coding sequence.
Bioinfo@AmS 8
Non-coding DNA
Complex genomes have roughly 10x to 30x more DNA than is required to encode all the proteins or RNAs
in the organism.
Bioinfo@AmS 9
Genome
Genome: All the information that is encoded in DNA and is capable of being passed on to an offspring.
✓The genome is all of the nuclear DNA in a haploid cell (sperm or egg) i.e., DNA that is inherited by the
next generation. So a normal somatic cell in humans actually has two genomes, a maternal genome and a
paternal genome.
Bioinfo@AmS 10
Tutorial/ Practical: Finding Exon and Introns
DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC TCCAGAGAGA GCACCCAA CACCCTCCAGGCTT GACTAA CCAGGGTGT
Bioinfo@AmS 11
Genome
Intron Intron
Length of a Gene (in term of base pairs) = (Number of Exons× Length of an Exon) + (Number of Introns× Length of an
Intron)
Coding region of a genome = Total length of Exons (i.e., Number of Exons× Length of an Exon× Number of genes of a
genome)
Total Non-coding Region of a genome = Total length of Introns (intra-genomic part, i.e., gap between two exons) + Total
length of Inter-genomic part (space between two consecutive genes)
Bioinfo@AmS 12
Assignment
1. The whole genome of an E.coli bacterium is 2×106 base pairs in size, and sequencing has shown it has 600 genes
of having the length of each gene of 2×103. Each gene is comprised with three coding regions (exons) of average size
of each exon is 200 base pairs. Representative schematic is given below. How many introns are there within the
whole genome of E.coli? What is the ratio of total length of coding to total non-coding region of the whole genome?
Intron Intron
Bioinfo@AmS 13
Tutorial/ Practical: Finding Exon and Introns
http://hollywood.mit.edu/GENSCAN.html
Bioinfo@AmS 14
Tutorial/ Practical: Finding Exon and Introns
Write a program: Given a two DNA sequence compare these two sequences one to one characters and provide output
as i) Numbers of Introns and their length ii) Number of Exons and their length
DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC TCCAGAGAGA GCACCCAA CACCCTCCAGGCTT GACTAA CCAGGGTGT
DNA Sequence:
CTCGAGGGGCCTA ATGCATTGCCC GCACCCAA GACTAA CCAGGGTGT
Bioinfo@AmS 15
Tutorial/ Practical: Finding Exon and Introns
Write a program: Given a DNA sequence and its corresponding matured mRNA sequence, compare these two
sequences one to one characters and provide output i) numbers of Introns and their length ii) Number of Exons and their
length
Notes: The mRNA is synthesized from DNA. The mRNA sequence does not have any intron parts and all the ‘T’
character of DNA changes to ‘U’ in case of RNA . All other characters remains same in both DNA and RNA as
mapping is given below.
Bioinfo@AmS 16
Tutorial/ Practical: Finding Exon and Introns
https://asia.ensembl.org/index.html
https://asia.ensembl.org/Multi/Search/Results?q=BRCA2;site=ensembl
https://asia.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000139618;r=13:32315086-32400268
https://asia.ensembl.org/Homo_sapiens/Gene/Sequence?db=core;g=ENSG00000139618;r=13:32315086-32400268
Bioinfo@AmS 17
Central Dogma: Flow of Biological Information
Base Pair
Transcription Translation
Sugar-
phosphate
Protein
Bioinfo@AmS 1
Nucleic Acids (DNA and RNA)
• Storage genetic information • Mainly transfer genetic information ( although few cases
Bioinfo@AmS 2
Nucleic Acids (DNA and RNA)
DNA RNA
NB
P NB
P
O
5 O
4 S 1 5
4 S 1
3 2
OH H 3 2
OH OH
Bioinfo@AmS 3
Nitrogenous Bases
Purine Pyrimidine
• Adenine (A) and Guanine (G) • Cytosine (C), Thymine (T) and Uracil (U)
7 4
6
N 5
5 3 N
1 N
8
2
2
O 6
1 N
4
3 N
N 9
Bioinfo@AmS 4
Nitrogenous Bases
Purine Pyrimidine
NB
Nucleoside
(Sugar + NB)
Nucleotide
(Sugar + NB+
Phosphate group)
(GMP)
Bioinfo@AmS 5
DNA
Bioinfo@AmS 6
Chemical Composition of DNA
Bioinfo@AmS 7
Chemical Composition of Nucleotides
Nucleotide
Nucleoside
Base
Nucleoside
Base
Bioinfo@AmS 10
Base pair (BP) complementary: One letter DNA string
5
3
5 T C C TA C TA G T G T G 3
3 A G G ATG ATC A C A G 5 Base
Pair
Sugar-Phosphate
Backbone
3 5
Complementary strands
Bioinfo@AmS 11
Double helical-dsDNA
across.
Bioinfo@AmS 12
DNA Replication
Old Strand
One Parent dsDNA 5’–TCAGCTCGCTGCTAATGGCC–3’
3’–AGTCGAGCGACGATTACCGG–5’
New Strand
Old Strand
5’–TCAGCTCGCTGCTAATGGCC–3’
3’–AGTCGAGCGACGATTACCGG–5’
5’–TCAGCTCGCTGCTAATGGCC–3’
3’–AGTCGAGCGACGATTACCGG–5’
New Strand
Bioinfo@AmS 13
DNA Replication
(DNA)n + dNTP → (DNA)n+1 + PPi
P P P
P P P P P P P P 5
3
Template Strand
(DNA Polymerase reads from 3→ 5 direction only)
Bioinfo@AmS 14
DNA string
DNA Sequence:
5’ AGGATCATGATCATGAATT 3’
Reverse Sequence:
3’ TTAAGTACTAGTACTAGGA 5’ Repetitive sequence
5’ TTAGTTAG 3’
Reverse Complement sequence:
5’ AATTCATGATCATGATCCT 3’
https://www.bioinformatics.org/sms/rev_comp.html
Bioinfo@AmS 15
Chargaff's rules: dsDNA
i) # A = # T 5 AGGATCATGATCATGAATT 3
ii) # C = # G
=> C/G =1
Bioinfo@AmS 16
Chargaff's rules: dsDNA
i) # A = # T 5 ’ A G G AT C AT G AT C AT G A AT T 3 ’
=> A/T = 1 3 ’ T C C TA G TA C TA G TA C T TA A 5 ’
ii) # C = # G
=> C/G =1
Bioinfo@AmS 17
Next generation storage unit: DNA
Bioinfo@AmS 18
Next generation storage unit: DNA
Bioinfo@AmS 19
Melting temperature of DNA (Tm)
dsDNA ssDNA
(Double helical) (Separated Strand)
5 ’ A AT TA AT TA AT T 3 ’
3 ’ T TA AT TA AT TA A 5 ’
Denaturation
Heating
5’GGCCGGCCGGCC3’ Cooling
3’CCGGCCGGCCGG5’
Renaturation
If a DNA solution is heated (> 90°C ) there will be enough kinetic energy to denature the DNA completely causing it
to separate into single strands. Tm is defined as the temperature at which 50% of double stranded DNA is changed to
single-standard DNA. The higher the melting temperature the greater the guanine-cytosine (GC) content of the DNA.
Bioinfo@AmS 20
Melting temperature of DNA (Tm)
Example:
AT G C T G AAT G C
TAC GACTTAC G
A=6
T=6
G=5
C =5
= 13.5
Bioinfo@AmS 21
Chargaff's rules: dsDNA
A 3D cubical DNA structure is given below, where edge length is 85 nm. Consider for double helical DNA, pitch
length is 3.4 nm that comprises with 10 bps. How many bases are required to form the 3D structure? If thymine (T)
makes up 15 percent of the bases for this DNA sample, what would be the number of other bases and sugar
molecules? Also, calculate the amount of energy required to attain melting temperature (Tm). Assume energy
of H-bonds is 5 Kcal/mole. One mole of H-bonds means, Avogadro number of H-bonds (i.e., 6.03 × 1023 )
85 nm
Bioinfo@AmS 22
Chargaff's rules: dsDNA
Perimeter of a Cube = 12* edge length = 12*85 nm
So, total number of base pairs present in this double stranded structure = 12* (85/3.4) *10 = 3000 bps
Total number of sugars (each nucleotide contains one sugar) in 3000 bps length of a double stranded DNA = 3000 ×2 = 6000
T = (15*6000)/100 = 900
A = 900
G = (35*3000)/100 = 2100
C = 2100
Energy for Tm = [(5 Kcal/mole) * 4050]/ 6.03 × 1023 = 3,358.20 × 10-23 Kcal
Bioinfo@AmS 23
Assignment
Bioinfo@AmS 24
RNA
Bioinfo@AmS 25
RNA
Bioinfo@AmS 26
Transcription: DNA to RNA String
3' 5'
5' 3'
Sense (+) strand
Transcription DNA coding strand
RNA polymerase
Non-template strand
5' 3'
mRNA
Bioinfo@AmS 27
Splicing: Generation of mRNA
ds DNA
hnRNA
mRNA
Bioinfo@AmS 28
Splicing: Generation of mRNA
• Whole sequence (introns and exons) of a gene is transcribed into RNA by RNA polymerase enzyme.
• The total transcribed RNA is known as heterogeneous nuclear RNA (hnRNA) which undergoes a process called
RNA-splicing.
• Spliceosomes protein and small nuclear RNA (snRNA) catalyzes the splicing. That removes intron parts from the
hnRNA and joins exons to form matured mRNA.
• Thus, in the next step, only Exons part of a gene undergoes into the Translation process (proteins synthesis). There
is no role of the Introns in the protein synthesis.
Essentially, codes for the proteins are available in exons parts of the gene.
Bioinfo@AmS 29
DNA to RNA string
dsDNA
mRNA
3’ TAC AGC AGA CGA CGC 5’
5’ ATG TCG TCT GCT GCG 3’ 5’ AUG UCG UCU GCU GCG 3’
3’ ATA CTG TCG TGA CGT CGT 5’ 5’ UAU GAC AGC ACU GCA GCA 3’
Coding strand
(Complementary strand)
Bioinfo@AmS 30
DNA to RNA String
• Are both strands of DNA get transcribed into mRNA or does only one?
• Why can’t both strands of DNA get transcribed at the same time?
Bioinfo@AmS 31
DNA to RNA String
• Are both strands of DNA get transcribed into mRNA or does only one?
• If so, which one?
At a time only one strand get transcribed, i.e., 3’ to 5’ strand is used as template for transcription. RNA synthesis also
involves the normal base pairing rules, but the base thymine (T) is replaced with the base uracil (U). Interestingly, the
base sequence of the synthesized RNA is complementary of the template strand, hence, the base sequence of the RNA
exactly match with the another strand of DNA (i.e., complementary strand of DNA, 5’-3’). Therefore, at the same time
only one strand transcribed.
Bioinfo@AmS 32
DNA to RNA String
• Why can’t both strands of DNA get transcribed at the same time?
✓ Two strand of DNA carries two different kind of information, that means dsDNA carries information about two
different kind of proteins with its two strands. Therefore, for a particular protein, sequence of one strand of DNA is
sufficient. If both are transcribed, then two different mRNAs would be synthesized and this would lead to the
production of two different proteins.
✓ Moreover, if both strands are transcribed, the resulting mRNAs formed would be complementary and would pair with
each other producing a double stranded RNA, which would lead to the huge problem again.
Bioinfo@AmS 33
DNA to RNA String
Bioinfo@AmS 34
mRNA
Bioinfo@AmS 35
rRNA
Bioinfo@AmS 36
tRNA
Bioinfo@AmS 37
Assignment
1. A double stranded DNA has 20% of guanine what is the percentage of other bases on the DNA.
2. Let the four kinds of nitrogen bases Adenosine (A), Cytosine (C), Guanine (G) and Thymine (T) be represented by a pair of binary
numbers 00, 01, 10 and 11 respectively. S1= 5′ AGTCATGGCCAA 3′ and S2= 5′ AGTCCTGCCCAC 3′. Find out the output DNA
sequences by applying AND, XOR and OR operation.
3. Find out the mRNA sequence of the reverse complement of the given dsDNA
Bioinfo@AmS 38
Nucleic acid information Resources
https://www.ncbi.nlm.nih.gov/
Bioinfo@AmS 39
Nucleic acid information Resources
Bioinfo@AmS 40
Protein
Proteins: Proteins are linear polymers built of monomer units called amino acids. It is the most versatile
macromolecules in living systems and serve crucial functions in essentially all biological processes.
“Workhorse in Cells”
Functions:
• They provide mechanical support.
• They transport and store other molecules such as oxygen.
• They help in immune protection.
• They control cell growth and differentiation.
• They function as biocatalysts such as enzymes.
Bioinfo@AmS 1
Protein
“Workhorse in Cells”
Actin -myosin
Antibody Hormone
Enzyme (Insulin) Actin
Filament
Bioinfo@AmS 2
Chemical Nature of Protein
Amino
Side Chain Carboxyl
Group
(Variable) Group
• All amino acids have at least one acidic carboxylic acid (-COOH) group and one basic amino (-NH2) group.
• Only 20 amino acids are standard and present in protein because they are coded by gene.
Bioinfo@AmS 3
Three and One Letter Code of Amino acids
B, J, O, U, X, Z
Bioinfo@AmS 4
Characteristic of Amino acids
Polar
Non-Polar
Bioinfo@AmS 5
Characteristic of Amino acids
Polar
Non-Polar
Non-polar: These amino acids are hydrophobic Polar: These amino acids are hydrophilic (water
(water hater) in nature comes closer in biological loving) in nature interacts with water in biological
aqueous medium. e.g., Aliphatic and aromatic side aqueous medium. e.g., Alcohol, Acids, Amine etc.
Bioinfo@AmS 6
Characteristic of Amino acids
✓ Which amino acid are the most non-polar between Isoleucine and Alanine?
D >S>G>F
Bioinfo@AmS 7
Amide Bonds
Amino acid are linked together through the formation of amide bonds (peptide bonds) from the amino group of one
residue and the carboxylate of a second residue.
Bioinfo@AmS 8
Structure Function Relationship of Protein
• Function of a protein depends on its structure that indeed depends on protein Sequence.
• Function of proteins depends on the amino acids Order
Bioinfo@AmS 9
Protein Folding
Bioinfo@AmS 10
Protein Folding
• The primary structure of protein: a sequence of amino acids linked together by peptide bonds (covalent bond)
• The secondary structure of protein: Polypeptide folding into α helix, β sheet, or random coil (H bonds involved)
• The quaternary structure of proteins: Association of two or more folded polypeptides (sub units) to form a
Bioinfo@AmS 11
Secondary structure
Bioinfo@AmS 12
Secondary Structure: - Helix
• Secondary structure, refers to local folded structures that form within a polypeptide (intra chain) due to interactions
between atoms of the backbone.
• In an α helix, the carbonyl (C=O) of one amino acid is hydrogen bonded to the amino H (N-H) of an amino acid that
is four residues down the chain (n+4).
• Each turn of the helix containing 3.6 amino acids.
• Pitch of the helix is 0.54 nm
(Pitch)
• R groups are not involved.
• Example of proteins: -keratin - abundant in skin, hair, nails and horns
• Polypeptide chains are held together by H bonds between N-H group of one polypeptide chain and C=O group of
the other chain.
• Two or more segments of a polypeptide chain line up next to each other, forming a sheet-like structure held
together by hydrogen bonds.
• The strands of a β pleated sheet may be parallel, pointing in the same direction (meaning that their N- and C-
termini match up), or antiparallel, pointing in opposite directions (meaning that the N-terminus of one strand is
positioned next to the C-terminus of the other).
Bioinfo@AmS 14
Secondary structure: β-pleated sheet
Bioinfo@AmS 15
Tertiary Structure
• 3D conformation or shape. Fold spontaneously or with the help of molecular chaperones protein present inside
the cells. Stabilize by various kind of interactions (i.e., covalent and non-covalent bonds) present in the side
chain (R group).
• Various kinds of interactions which are responsible for protein tertiary structure Depends on the properties of
the R groups of amino acid residues, folding patter varies or 3D conformation changes.
Bioinfo@AmS 16
Tertiary Structure
• Ionic bonds (between charged amino acid side chains): For example, Lys is having a positively charge side chain due to NH3+
and Asp is having a negatively charge side chain due to COO- these may form ionic interaction which is known as salt-bridge.
• Hydrogen bonds between R groups : for example, uncharged polar amino acids can form H-bond like Ser is having side chain
carrying OH groups that serve as donor/acceptor for H bonds.
• Covalent bonds: Protein chain form intra/inter chain disulfide bonds between cysteine to form tertiary conformation
• Hydrophobic interactions: Amino acids having non-polar side chains associate in the interior of the peptide molecule and
exclude water via hydrophobic interactions.
• van der Waals interactions: Van der Waals forces' is a general term used to define the attraction of intermolecular forces
between molecules. There are two kinds of Van der Waals forces: weak London Dispersion Forces and stronger dipole-dipole
forces.
Bioinfo@AmS 17
Quaternary Structure
Many proteins is made up with more than one polypeptide chains called subunits. Association of
two or more folded polypeptides (subunits) to form a multimeric protein (bonds and interactions
similar to tertiary structure).
The quaternary structure refers to how these protein subunits interact with each other and arrange
themselves to form a larger aggregate protein complex. The final shape of the protein complex is
once again stabilized by various interactions, including hydrogen-bonding, disulfide-bridges and
salt bridges
Hemoglobin, a protein in red blood cells, has four sub units (two copies each of - and β-globins
containing a heme molecule.
Bioinfo@AmS 18
Protein Folding
Primary
sequence
Protein domain: A segment (100 – 250 aa) of a polypeptide chain that fold independently into a stable structure
and performs particular biological function. These are independently folded moieties of protein which can be
differentiated both structurally and functionally.
Protein Motif: It’s a short polypeptide chains comprised with secondary structures. It cant hold its independent
structure outside protein and not biologically functional too. Put very simply, a domain can be made up of one or
more well characterized motifs which usually occur together and suggest a putative function.
Bioinfo@AmS 20
Protein Folding: 3D structure
Unfolded Folded
Bioinfo@AmS 21
Protein Folding
Bioinfo@AmS 22
Entropy of Protein Folding
S = Less
S = High
Bioinfo@AmS 23
Entropy of Protein Folding
Conformational
Unfolded Unfolded Entropy Decreases
Folded
Bioinfo@AmS 24
Entropy of Protein Folding
Unfolded Folded
Entropy of surrounding
water is HIGH
Entropy of surrounding
water is LOW
• Water molecules are free for movement around folded
• Water molecules form a case like protein due to exposure of hydrophilic amino acids only.
structure around hydrophobic amino • As all hydrophobic amino acids buried inside core of
acid for unfolded protein, thus not free protein structure, thus not interacts with water
for movement. molecules.
Bioinfo@AmS 25
Entropy of Protein Folding
Bioinfo@AmS 26
Enthalpy of Protein Folding
Energy Absorbed
Energy released
(bonds breaking)
(bond formation)
HUnfolded Hfolded
∆H = Hfolded- Hunfolded
Bioinfo@AmS 27
Protein Folding
Bioinfo@AmS 28
Protein Folding
*Decrease in enthalpy also overcomes the decrease in entropy (i.e., gives positive impact on ∆G), so that
overall there is a negative ∆G that facilitates protein folding.
Bioinfo@AmS 29
Genetic Codes
Transcription
Translation
Bioinfo@AmS 1
Genetic Codes
5’ GGTCTCCTCACGCCA 3’ DNA
↓
5’ CCAGAGGAGUGCGGU 3’
mRNA
Codons
↓
Pro-Glu-Glu-Cys-Gly
Protein
Chain of Amino acids
The genetic code is a set of three-letter combinations of nucleotides called codons, each of which corresponds
to a specific amino acid or stop signal. The concept of codons was first described by Francis Crick and his
colleagues in 1961.
Bioinfo@AmS 2
Genetic Codes
• A codons are three consecutive bases that found on the coding strand of double-stranded DNA and
• DNA/RNA consists of four different bases, and there are three bases in a codon, hence, there are
✓ Total 64 codons.
Bioinfo@AmS 3
Codon Degeneracy
Bioinfo@AmS 4
Codon Degeneracy
• The genetic code is degenerate because there are many instances in which different codons specify the same
amino acid. A genetic code in which some amino acids may each be encoded by more than one codons is known
as codon degeneracy or codon redundancy.
• There are total 20 amino acids which code for a wide variety of proteins in a living organism. There are 64
codons so, its obvious that more than one codon may specify same amino acid. Codon Degeneracy means
several code words have the same meaning.
• Degeneracy makes the DNA more tolerant to point mutations. It isn't necessary a point mutation in a codon will
lead to change in conformation of the peptide. It might be replaced by a same amino acid.
Bioinfo@AmS 5
Codon Bias
In most of the species Synonymous codon are prefer for particular amino acids which is known as codon bias
Bioinfo@AmS 6
Open Reading Frame(ORF)
ORF
5’ AUGAUACUCACAAUCUGA3’
ORF
5’AUACUC AUGAUACCCACAAUUCAACACCUCUAG3’
ORF
5’AUACGCCA AUGAUACUCACAAUCUAAACUCACACUCUC3’
• A reading frame (RF) is a non-overlapping set of three-nucleotide-codons (triplets) in DNA or RNA, it may present
anywhere within DNA sequence
• Open Reading Frame(ORF) means sequence is ‘open’ to keep reading by ribosome for protein synthesis.
• The ORF is a segment of DNA that begins with START codon and ends with a STOP codon make a functional
proteins.
• Gene also codes a functional protein , however it contains regulatory region along with ORF
Bioinfo@AmS 7
Reading Frame(RF)
RF 2
RF 3
Bioinfo@AmS 8
Reading Frame(ORF)
A segment of double-stranded DNA has six possible reading frames (RF), three in each direction. But all are not leads
to proteins until it starts with start codon and ends with stop codon.
5 ’ AT G C T C T C AT C T C G 3 ’ • There are six RFs which have code for six different proteins,
3 ’ TA C G A G A G TA G A G C 5 ’
however, only two RFs called ORFs contain desired protein
Bioinfo@AmS 9
Assignment
5’TTAGATGTGTGTAAATGTGTGTGATGATCGTGATATCATAGTAGTCAATGATCGTAATATTATCTATTTATAACCG3’
https://www.ncbi.nlm.nih.gov/orffinder/
Bioinfo@AmS 10
Mutation
• Mutation may result in change in the amino acids in proteins. That is responsible for phenotype change.
Bioinfo@AmS 11
Mutation
Gene Mutation/
Point Mutation
Bioinfo@AmS 12
Mutation
• Gene Mutation = Any change(s) in the nucleotide/base sequence of DNA which may occur due to
errors in DNA replication or due to the impacts of chemicals or radiation to the DNA molecule.
Bioinfo@AmS 13
Mutation
Bioinfo@AmS 14
Mutation
• A balance exists between the amount of new variation and the overall health (adaptiveness) of the
new variant individual
• Differences between closely related organisms show closely matched DNA sequences that diverged at
some past time and that was adaptive for a given environment.
Bioinfo@AmS 15
Mutation
1. Substitution: Substitution of 1 base for another this may or may not affect to the resulted protein. For example, an
A:T base pair could be mutated into a G:C base pair or even a T:A base pair. Three subtypes such as
2. Inversion: Sometimes consecutive bases rotates 180 degrees (mutually changes its position). Such an event creates
a mutation called an inversion. This may show particular abnormalities at the phenotypic level.
Bioinfo@AmS 16
Mutation
- Insertion: An insertion changes the number of DNA bases in a gene by adding a piece of DNA. As a result,
- Deletion: A deletion changes the number of DNA bases by removing a piece of DNA. Small deletions may
remove one or a few base pairs within a gene, while larger deletions can remove an entire gene or several
neighboring genes. The deleted DNA may alter the function of the resulting protein(s).
Bioinfo@AmS 17
Silent Mutation
Normal Mutation
DNA
mRNA
Protein
Due to redundancy of Genetic Code (i.e., different codons can code same amino acid), no change in amino acid
sequence is produced!!
Bioinfo@AmS 18
Missense Mutation
Normal Mutation
DNA
mRNA
Protein
Missense mutation produces a change in amino acid sequence in protein product (Histidine in for Arginine); may change
function of protein or may not!
Bioinfo@AmS 19
Nonsense Mutation
Normal Mutation
DNA
mRNA
Protein
Bad news! – nonsense mutation produces a STOP codon within the mRNA transcript
leading to a truncated protein (incomplete protein sequence). How short the protein
product depends on where the STOP codon was produced within the mRNA transcript.
Bioinfo@AmS 20
Mutations: Insertion
Bioinfo@AmS 21
Mutations: Deletion
Bioinfo@AmS 22
Mutations: Inversion
In case of inversion mutation a mutation resulting from the removal of a length of DNA (segment of DNA), which is
then reinserted facing in the opposite direction.
Bioinfo@AmS 23
Assignment
2. Write a program, that given two DNA sequences, find out the number of mutations and site of mutation by
comparing these tow DNA strings. Make comment types of point mutation.
Bioinfo@AmS 24
Database
Bioinfo@AmS 1
Biological Databanks and Databases
Biological Databank: Databank is a generic term meaning any collection of data in any form. This is a
largest form of repository/archive of data that will keep all the information safe for long term.
Biological Databases: This is a collection of particular type of data arrange in a computer readable format
suitable for easy storing, searching and analyzing the data for users. Generally, database is a technical term
denoting to a collection of data managed by a software called a Database Management System (DBMS).
Bioinfo@AmS 2
Types of Various Kinds Biological Data
• Nucleic Acids:
✓ DNA/RNA Sequences, Double helical structure
✓ RNA Sequences, mRNA, tRNA, rRNA, secondary RNA structure, Interactions
• Genomics: Gene, ORF, Genetic disorder, Mutation
• Proteomics:
✓ Protein sequences, secondary, tertiary , quaternary protein structure
✓ Protein expression profile, Protein-ligand interactions
• Transcriptomics: This is the study of the transcriptome—the complete set of RNA transcripts that are
produced by the genome
• Metabolomics: The study of small molecules involved in cellular metabolism and their interactions
within a biological system are known as the metabolomics.
Bioinfo@AmS 3
Biological Databases
Bioinfo@AmS 4
Complexity and size of Biological data
Bioinfo@AmS 5
Complexity and size of Biological data
Bioinfo@AmS 6
Complexity and size of Biological data
Bioinfo@AmS 7
Database
Scatter and random data
•Biological experimental data published in literature are generally
scatter and random.
•It hard to find and access related as well as relevant data required
for various analysis and applications.
•Biological data is very complex in nature such as signaling Organized and indexed data
pathways are so much interconnected study of all these kinds of
data poses huge concern in order to find out conclusive inferences.
Database is the organized form of information collected and store as computer readable form.
Bioinfo@AmS 8
How to design a biological database
•Contents
•Perspective of end users
•Data must be easily understandable for users
•Reliable and required information
•Faster responses to get the search results
•Less redundancy in the information
Bioinfo@AmS 10
History of Biological Database
• Margaret Belle (Oakley) Dayhoff (March 11, 1925 – February 5, 1983) was an American physical
chemist and a pioneer in the field of bioinformatics. She created first biological database based on
protein sequences and named as ‘Atlas of Protein Sequences and Structure” in 1965 at Columbia
University, USA.
• Protein Data Bank (PDB ) was created in 1972 with the collection of the X-ray crystallographic protein
structure.
• 1988 - The National Centre for Biotechnology Information (NCBI) is established at the National Cancer
Institute.
Bioinfo@AmS 11
Classification of Biological Database
Biological Database
Bioinfo@AmS 12
Classification of Biological Database
Primary databases: Experimental results (raw data) directly submitted to database. Once given a
database accession number, the data in primary databases are never changed that become a permanent
scientific record.
Secondary databases: This databases comprise data derived from the results of analyzing primary data.
Composite databases: Many databases have both raw data as well as derived data i.e., characteristic of
primary and secondary databases which are known as composite/aggregate data base.
Bioinfo@AmS 13
Primary Database
• Primary databases are highly organized, user-friendly gateways to the huge amount of biological data
directly produced by researchers around the world.
• The primary databases were first developed for the storage of experimentally determined DNA and protein
sequences in the 1980s and 90s.
• This databases contains the raw nucleic acid/protein sequence data which are produced and submitted by
researchers worldwide directly.
• Primary databases contain information for sequence or structure only. Once data are deposited in primary
databases, they can be accessed freely by anyone around the world.
Example
• Secondary databases contain information derived from primary databases that means it contains the
data which are derived by curation, annotation and analysis of the data of primary database .
• Secondary databases store information such as conserved sequences, active site residues, and
signature sequences.
• A secondary structure database contains entries of the PDB in an organized way.
Examples
•Tr-EMBL for protein sequences
•SCOP at Cambridge University
•CATH at the University College of London
•PROSITE of the Swiss Institute of Bioinformatics
Bioinfo@AmS 15
Composite Database
Composite databases contain a information of primary and derive data, which eliminates the need to search
each one separately. Each composite database has different search algorithms and data structures.
Example:
NCBI is a composite data base that links to the Online Mendelian Inheritance in Man (OMIM) and other
data bases.
Bioinfo@AmS 16
Classification of Database
Biological Database
Protein
Nucleotides
• EMBL
• GenBank Sequences Interaction
• DDBJ • UniPort • Biogrid
INSDC • PIR • STRING
(International Nucleotides Sequence Database) • SwissProt
Structure
• PDB
ENSEMBL
(Whole Genome Database) • CATH
• SCOP
Specialized
• OMIM (online Mendelian Inheritance of Man)
• Gene Expression Omnibus (GEO) database
Bioinfo@AmS 17
Nucleotide sequence databases
There are a small number of bioinformatics centers of excellence worldwide that have taken on the
responsibility to collect, catalogue and provide open access to published biological data.
Example:
1. The EMBL-European Bioinformatics Institute (EMBL-EBI)
2. The US National Center for Biotechnology Information (NCBI)
3. The National Institute of Genetics in Japan (NIG)
The role of bioinformatics centers of excellence in making biological data available for the research
community
Bioinfo@AmS 18
Nucleotide sequence databases
• EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases
• EMBL www.ebi.ac.uk/embl/
• GenBank www.ncbi.nlm.nih.gov/Genbank/
• DDBJ www.ddbj.nig.ac.jp
They together constitute the International Nucleotide Sequence Database (INSD) collaboration.
Bioinfo@AmS 19
NCBI
Bioinfo@AmS 20
National Center for Biotechnology Information (NCBI)
• National Center for Biotechnology Information (NCBI) was established 1988, as a division of the National
Library of Medicine (NLM) at the National Institutes of Health (NIH), Maryland USA.
• Website: https://www.ncbi.nlm.nih.gov/
• Aim: To create and maintain public database, develop software such as sequence analysis tools (BLAST,
iCn3D), resources for Biomedical Information
• Maintain Several databases like Genbank (Nucleic acid sequence database) from 1992. It’s a superset of
various databases including Gene, Genome, Protein, literatures etc. It acts as interface to connect various
databases.
• Databased present in the NCBI generally classified as Primary and Derived databases.
• It provides a ‘database retrieval system’ known as ENTEZ
Bioinfo@AmS 21
Entrez
Entrez : This is an Advanced Search interface that facilitates constructing more sophisticated queries for biological
database. It’s a molecular biology database developed by NCBI. Entrez is primary text search and retrieval system that
integrates the PubMed database of biomedical literature with 38 other literature and molecular databases mainly use by
National Center for Biotechnology Information (NCBI) .
Bioinfo@AmS 22
GenBank ( Maintain by NCBI)
• GenBank is the most complete collection of raw nucleic acid sequence data for almost every organism.
• The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput raw sequence data, and
sequence polymorphisms.
• There is also a GenPept database for protein sequences, the majority of which are conceptual
translations from DNA sequences.
• There are two ways to search for sequences in GenBank. One is using text-based keywords similar to a
PubMed search. The other is using molecular sequences to search by sequence similarity using BLAST.
• the
• search output for sequence files is produced as flat files for easy reading. The resulting flat files contain
three sections – Header, Features, and Sequence entry
Bioinfo@AmS 23
GenBank ( Maintain by NCBI)
✓ LOCUS: It contains a unique database identifier for a sequence location in the database (not a chromosome locus). The
identifier is followed by sequence length and molecule type (e.g., DNA or RNA). This is followed by a three-letter code
for GenBank divisions.
✓ DEFINITION: This provides the summary information for the sequence record including the name of the sequence, the
name and taxonomy of the source organism if known, and whether the sequence is complete or partial.
✓ ACCESSION NUMBER : A new accession number is given in the form of a string of alphanumeric characters (two
alphabet and six numerical or three alphabet and five characters). In addition to the accession number, there is also
a version number and a gene index (GI) number. The purpose of these numbers is to identify the current version of the
sequence. If the sequence annotation is revised at a later date, the accession number remains the same, but the version
number is incremented as is the GI number.
✓ “ORGANISM” field, which includes the source of the organism with the scientific name of the species,
✓ “Features” section includes annotation information about the gene and gene product. The “gene” field is the
information about the nucleotide coding sequence and its name. For DNA entries, there is a “CDS” field, which is
information about the boundaries of the sequence that can be translated into amino acids.
Bioinfo@AmS 24
GenBank ( Maintain by NCBI)
Bioinfo@AmS 25
NCBI
https://www.ncbi.nlm.nih.gov/
Bioinfo@AmS 26
NCBI-GenBank
https://www.ncbi.nlm.nih.gov/genbank/
Bioinfo@AmS 27
Search of Nucleotide Sequence
https://www.ncbi.nlm.nih.gov/nuccore
https://www.ncbi.nlm.nih.gov/nuccore/?term=Insulin
Bioinfo@AmS 28
Search of Nucleotide Sequence
PLN: plant, fungal, and algal Seq.
PRI: Primate Seq.
MAM: Non-primate mammalian Seq.
BCT: Bacterial Seq.
EST: Expressed sequence Tag.
Header Features
https://www.ncbi.nlm.nih.gov/nuccore/AH002844.2
Bioinfo@AmS 29
Search of Nucleotide Sequence
Sequence
Bioinfo@AmS 30
Search of Nucleotide Sequence
https://www.ncbi.nlm.nih.gov/nuccore/AH002844.2?report=fasta
Bioinfo@AmS 31
Search of Protein Sequence
https://www.ncbi.nlm.nih.gov/protein
Bioinfo@AmS 32
Search of Protein Sequence
https://www.ncbi.nlm.nih.gov/protein/AAA59172.1
Bioinfo@AmS 33
Search of Protein Sequence
https://www.ncbi.nlm.nih.gov/protein/AAA59172.1?report=fasta
Bioinfo@AmS 34
Search of Gene Sequence
https://www.ncbi.nlm.nih.gov/gene
Bioinfo@AmS 35
Search of Gene Sequence
https://www.ncbi.nlm.nih.gov/gene/3630
Bioinfo@AmS 36
Search of Gene Sequence
Bioinfo@AmS 37
Search of Gene Sequence
Bioinfo@AmS 38
Search of Gene Sequence
https://www.ncbi.nlm.nih.gov/variation/view/
Bioinfo@AmS 39
Search of Gene Sequence
https://www.ncbi.nlm.nih.gov/variation/view/
Bioinfo@AmS 40
Nucleotide sequence databases
DDBJ www.ddbj.nig.ac.jp
Bioinfo@AmS 41
Nucleotide sequence databases
https://www.ddbj.nig.ac.jp/services/indexe.html?tag=search,DDBJ
http://ddbj.nig.ac.jp/arsa/
Bioinfo@AmS 42
Nucleotide sequence databases
http://ddbj.nig.ac.jp/arsa/
http://ddbj.nig.ac.jp/arsa/search?lang=en&cond
=quick_search&query=Human+insulin+gene&op
erator=AND
Bioinfo@AmS 43
Nucleotide sequence databases
http://getentry.ddbj.nig.ac.jp/getentry
Bioinfo@AmS 44
Nucleotide sequence databases
https://www.ebi.ac.uk/
Bioinfo@AmS 45
Nucleotide sequence databases
https://www.ebi.ac.uk/ebisearch/search?db=emblstandard&query=human%
20insulin%20gene&size=15
https://www.ebi.ac.uk/ebisearch/search?db=nucleotideSequences&query=
human%20insulin%20gene
Bioinfo@AmS 46
Nucleotide sequence databases
Bioinfo@AmS 47
Assignment- Practical
1. Visit NCBI (i)Download flat file and FASTA files of nucleotide sequences of human (Homo sapiens) hemoglobin subunit alpha 1 (HBA1), Mus musculus hemoglobin alpha,
adult chain-1 and Bos taurus hemoglobin, beta (HBB).
2. Visit DDBJ and EMBL (i)Download flat file of human complete CFTR nucleotide sequence (ii) Download FASTA file of human complete CFTR nucleotide sequence
3. Visit NCBI (i)Download flat file of human complete CFTR Gene sequence, Nucleotide sequence and Protein sequence (ii) Download FASTA file of human complete CFTR
gene sequence, Nucleotide sequence and protein sequence (iii) Show the variants of the genes, with missense mutation (iv) Show the CDS
Bioinfo@AmS 48
Protein Databases
Protein
Sequences Structure
• UniPort • PDB
• PIR • CATH
• SwissProt • SCOP
• TrEMBL
Bioinfo@AmS 1
Protein Sequence Databases
Bioinfo@AmS 2
The Protein Information Resource (PIR)
• PIR: Atlas of protein sequence and structure – Dayhoff (1966) first sequence database (pre-
bioinformatics). Currently known as Protein Information Resource (PIR)
• PIR has provided many protein databases and analysis tools to the scientific community, including the
PIR-International Protein Sequence Database (PSD) of functionally annotated protein sequences.
• PIR was established in 1984 by the National Biomedical Research Foundation (NBRF), USA as a
resource to assist researchers in the identification and interpretation of protein sequence information.
• Maintain by Georgetown University Medical Center
• PIR format
https://proteininformationresource.org/
Bioinfo@AmS 3
Protein Information Resource (PIR)
Bioinfo@AmS 4
https://proteininformationresource.org/cgi-bin/textsearch.pl
Bioinfo@AmS 5
https://proteininformationresource.org/cgi-bin/ipcEntry?id=A0A6J4EE43
Bioinfo@AmS 6
https://proteininformationresource.org/cgi-bin/ipcEntry?id=A0A6J4EE43
Bioinfo@AmS 7
FASTA File format
https://proteininformationresource.org/cgi-bin/comp_mw.pl?ids=A0A6J4EE43
Bioinfo@AmS 8
Universal Protein Resource (UniProt)
Universal Protein Resource (UniProt): The United Protein Databases (UniProt, 2003) is a central
database of protein sequence and function created by joining the forces of the SWISS-PROT, TrEMBL
and PIR protein database activities
The centerpiece of the UniProt databases is the UniProt knowledge base (UniProtKB), which
comprises two sections: Manually annotated UniProtKB/Swiss-Prot and Automatically computer
annotated UniProtKB/TrEMBL.
Bioinfo@AmS 9
Universal Protein Resource (UniProt)
Bioinfo@AmS 10
SWISS-PROT/UniProtKB
Bioinfo@AmS 11
TrEMBL/UniProtKB
Bioinfo@AmS 12
UniProtKB
https://www.uniprot.org/
Bioinfo@AmS 13
UniProtKB
https://www.uniprot.org/uniprotkb?query=insulin
Bioinfo@AmS 14
UniProtKB
https://www.uniprot.org/uniprotkb/P01308/entry
Bioinfo@AmS 15
UniProtKB
https://www.uniprot.org/uniprotkb/P01308/entry
Bioinfo@AmS 16
UniProtKB
https://www.uniprot.org/uniprotkb/P01308/entry
Bioinfo@AmS 17
FASTA File format
https://rest.uniprot.org/uniprotkb/P01308.fasta
Bioinfo@AmS 18
Protein Structure Databases
Bioinfo@AmS 19
PDB
http://www.rcsb.org/structure/4HHB
https://files.rcsb.org/view/4HHB.pdb
Bioinfo@AmS 20
Protein Data Bank (PDB)
Bioinfo@AmS 21
PDB file format
http://www.rcsb.org/structure/4HHB
https://files.rcsb.org/view/4HHB.pdb
Bioinfo@AmS 22
Protein Structure Databases
Bioinfo@AmS 23
Protein Structure Databases
Bioinfo@AmS 24
Protein Structure Databases
https://www.cathdb.info/
Bioinfo@AmS 25