You are on page 1of 42

Module 3 and 4

”Mei ziada detail mei nai jarha...” - AJ

Lecture 1

( Sense strand/coding strand→ running 5’→3’


Anti-sense strand/non-coding strand/template strand → running 3’→5’
mRNA that is eventually formed is identical to the sense strand except for the usual
DNA RNA differences.)

Central Dogma as usual, the new thing is reverse transcription i.e. RNA can go back to
DNA
Reverse transcription → a positive strand of RNA is converted into DNA with the help of
reverse transcriptase ( we get these enzymes from virus )
Positive mRNA strand is in the 5’→3’ direction and can be easily changed to DNA
whereas the negative mRNA is 3’→5’, so it first needs to be converted to a positive
mRNA strand and then to a DNA.

Module 3 and 4 1
X-Dependent Y polymerase:
So, in the case of DNA replication, it’s a DNA dependent DNA polymerase
In case of converting negative to positive mRNA in RNA replication for example, the
RNA dependent RNA polymerase kicks in, i.e. it uses a template of RNA to synthesize
RNA segments
In transcription, we have a DNA dependent RNA polymerase i.e. RNA pol II

In FASTA format, we can store both DNA and protein sequences

Note that despite this being an example of an mRNA, we use T instead of U as a


convention.
IDs | Accession numbers | description | length in bp → FASTA format

cDNA normally refers to the region which always codes for a protein since it is formed
by the conversion of mRNA into DNA ( and that mRNA is expected to have undergone
splicing so it only contains exons )

ORFs are the regions from the start to the stop codon.

For Eukaryotes → In DNA, the coding gene from the start to the stop codon as we know
it, is present in chunks, i.e. each chunk(exons) is separated by non-coding parts(introns)

Module 3 and 4 2
These introns are removed by a process called splicing, leaving only exons. The exons
can then combine in different ways which leads to variations in the gene .

It is even possible for an exon to be skipped in the process of constructing the final
gene, a process called exon skipping.
It is
during splicing that the order of exons is decided, not that once the introns are removed
and then their order decided.
Given a double stranded DNA, there are 6 reading frames, but only one ORF
Another way of saying this is that there are 6 ORFs, but only one
true ORF

exPASY tool → Give it a mRNA sequence and it will return you a protein sequence.
If something is being transcribed, but not translated, it wont have an ORF. For instance,
tRNA and rRNA don’t have any ORFs, By referring to ORFs we mean the region of code
from the start to the stop codon. So just because something is being transcribed, it is
not necessary for it to be an ORF.
All genes are ORFs, but not all ORFs are genes → For this course, we assume the
longest ORF of the possible ORFs to usually be the gene.

Module 3 and 4 3
The process of finding an ORF

Higher the number of codons in an ORF, higher the probability of that ORF to be a
gene.

Module 3 and 4 4
It is important to remember that promoters are
i. Cis-regulatory in nature i.e. are present right next to the genes
ii. Are present upstream of our gene
iii. Are present on the same strand as our gene

(just to revise a bit, RNA polymerase binds to the antisense strand i.e. the template
strand, so the promoter also needs to be at the antisense strand)

Resircition sites → specific sequences 4 to 8 base pairs in length, recognized by


restriction enzymes.

Resitriction endonucleass → Detect certain palindromic sequences in the DNA and they
cut it
Panlidromic sequences are sequences that run the same forward and backward, for
example:

Module 3 and 4 5
If the palindromic sequence is however methylated, the resitriction enzymes do not act
upon it. This is the very phenomenon bacterias use to only cut down the specific
palindromic sequences in the Virus and not those in its own genome since those are
methylated.

To find restriction sites, just look for palindromic sequences.

Express Sequence Tags


ESTs are basically segments of protein sequences.

So basically let’s say we have a small ORF X in our genome, and then when we
compare it with a protein sequence, we find that the X ORF’s coded amino acids match

Module 3 and 4 6
a small segment of that protein (i.e. an EST) with high stringency→ in this case it is
highly likely for that segment X to be a part of a gene.

As we discussed earlier, there are exons and introns present. So, if we know an EST,
and find a comparable DNA sequence which codes for this EST with high stringency, by
comparing them both closely we can also figure out the exon boundaries in the genome
which are actually coding for the EST.
In short, knowing the ESTs and their gene can help us know where each exon starts
and stops.
Single, di, and tri nucleotides are present towards exon boundaries which guide the
splicing process.

Knowing the gene sequences of a propkaryotes is easy → just check the ORFs and the
longest is probably a gene.
As for the eukaryotes, it’s not so simple since they have introns, exons etc. So in this
case, it is necessary to know the ORFs, the direction of transcription which is

Module 3 and 4 7
determined by the promoter and signal predictors, and lastly the splice site predictors
which facilitate the splicing process.

Histones have positively charged amino acids i.e. Arginine and Lysine to which the DNA
can attatch and be wrapped around the histones. The part of the DNA that wraps
around the histone octamer is repititive.
Prokaryotes don’t have introns or histones.
If a certain sequence is highly conserved over generations, it usually means that
sequence is critical to the functioning of that specie.
(Just to revise, DNA wraps around the histone octamer twice, and the histone protein
are post-translationally modified which is essential to their functioning)
Lastly, there isn’t an equal probability for all the possible codons for a single amino acid
to be expressed.

Module 3 and 4 8
Introns are much larger than exons.
To promoters, TFIID (TATAA binding protein) bind which then recruits other transcription
factors and RNA pol II to start transcription.

Remember that in Prokaryotes, TATAA box is present at a location -10 from the TSS
and at -35 from the TSS we have a sequence TTGACA. Eukaryotes only have the
TATAA sequence at -10 location.
The Shine-Dalgarno (SD) sequence is present in prokaryotic mRNA and is the site
where the small-subunit of the ribosome detects the mRNA. It is at upstream of the
translational start site i.e. the start codon.

Module 3 and 4 9
Transcription factors have certain bindings motifs. For instance,the GAGA factor only
binds to the region where the GAGA sequence is present.
UTR is the region to the upstream of the transcription start site (i.e. the 5’ UTR) and
downstream of the trasncription stop site (i.e. THE 3’ UTR)

We found that we couldn’t predict alot of genes that were expreimentally expressed
through our softwares. Why? Because we couldn’t take into account alternative splicing,
and post-translational modifications in our algorithm.

Module 3 and 4 10
As the length of the DNA increases, the probability for it to randomly exist decreases i.e.
4^n ways of making that length of DNA increase
Tryptophan and Methionine are only coded by one codon

Genomics → Genetics/genome/DNA content


Transcriptomic → mRNAs
Proteomics → Study of all the proteomes present

i. Sequence of the protein

ii. Structure of the protein → this is what determines its function, and the structure can
have conformational changes, so the function of a protein can change.

Module 3 and 4 11
Codon usage patterns can be measured, but not predicted.

Lecture 2

Percentage of genome which codes is VERY low → 1%

45-50% DNA is repetitive → region of the heterochromatin (not transcribed)


We do not know the total number of proteins in our body since their number is much
bigger than the number of genes present. Why?

1. Alternative Splicing → different exons can join in different orders and form different
kind of mRNAs and hence proteins.

2. These different mRNAs are then post translationally modified in many different
ways, leading to further variation.

3. Proteoforms → there can be some peptides added, or some amino acid cleaved at
the translation level etc.

Hence, a gene sequence alone is not enough to find the function of the protein
Phosphorylation is ad very common modification since the ATP from which comes the
gamme phosphate to be attatched is widely available. One again, the phosphate group
is added on serine, threonine, and tyrosine to which the gamma phosphte is attatched,

Module 3 and 4 12
and the negative charge of the phosphate changes the structure of that protein which
could be influential in changing its working.

Benefits of Comparative Genomics → Important

Websites
GenBank → stores DNA sequences

Module 3 and 4 13
Uniprot → stores Protein sequences
PDB → stores Protein structures

BLAST → used for aligning and comparing two DNA/protein sequences

Alignment goes something like this

Module 3 and 4 14
BLAST uses a mechanism called suffix tree, creating different sequences to compare
them with the target sequence.
BLOSUM62
We can use thermodynamics to predict the chances of some amino acid converting into
the other.

0 means the conversion of one amino acid to the other is purely by chance
Negative score means the probability of conversion is less than by chance (i.e.
evolution discourages)
Positive score means the probability is more than by chance, the conversion is
something that was tolerated/might have some kind of advantage

Do know how to calculate scores for conversions!

Module 3 and 4 15
There is a threshold value of score AND above for sequences which are accepted. We
can vary this threshold sequence as per our requirements.

BLAST
blastn:

Takes in nucleotide sequences and compares them with other nucleotide sequences
from the database.
blastp
Compares and amino sequence vs an amino acid sequence from the database.
blastx

Module 3 and 4 16
Takes a nucleotide sequence, translates it into all 6 possible ORFs, and compares all
the resulting ORFs with possible protein sequences in the database.
So it might be better to use than blastn in the case we have an unknown nucleotide
sequence.

tblastn
Comapres a protein query sequence against a nucleotide sequence database translatd
in all reading frames.
Note that we can’t predict the nucleotide sequence simply using the protein sequence
since one amino acid could be encoded by multiple codons, and there could be several
introns involved in the process to which might have been removed.

So what tblastn does to deal with this is that it takes a nucleotide sequence from the
database, translates it into 6 possible ORFs and translates them, and compares these
translated ORFs with our query protein sequence!

tblastx
Compares the six-frame translations of a nucleotide query sequence against the six-
frame translations of a nucleotide sequence database

Module 3 and 4 17
Homologs
Features including DNA and protein sequences in species being compared that are
similar because they are ancesrally related.
Orthologs
Homologous genes or any DNA sequences that separated because of a speciation
event.
Derived from the same gene in the last common ancestor.
e.g. insulin producing genes in different species.
Paralogs
Homologous genes that separated because of gene duplication events within the same
species.
e.g. different insulin producing genes in humans

Lecture 3
Remember the positive (Arginine and Lysine) and negative amino acids.
The rest will be polar or non-polar but don’t have a net charge.

Module 3 and 4 18
Proteins are 3D→changes in protein structure can cause a change in its function.
Dihedral Angles:

Angles that exist between two bonds


Distance between 1 and 4 is decided by the bond between 2 and 3.

Module 3 and 4 19
1. Phi-ϕ- Controls the C’-C’ distance
(N-
C α controls the distance)

2. Psi -ψ- Controls N-N distance


(
C α and C’ controls the distance)

Module 3 and 4 20
3. Omega-ω - Controls C α -C α distance
(C’-N controls the distance)

( C’ is the carbon to which the oxygen is attathced, and C α the carbon which is chiral )

In short, assume the bond between the 2nd and the 3rd as a rod, which can rotate in a
way to bring the 1st and the 4th atom closer or farther from each other.

Ramachandran→ proposed there are only particular psi and phi angles that can exist.
His plot showed that there are only few, allowable, permissible phi and psi angles that
can exist. (Omega is free to vary)

Module 3 and 4 21
Edman Degradation (outdated technique) → Removes an amino acid from the N
terminus one by one which can be used to know the protein sequence. Restricted to 60
amino acids, and laborious i.e. only 50 amino acids can be known per day.

Mass Spectrometry Proteomics

1. Sample is ionized

2. There is a detector which detects the ionized samples

3. We get a mass spectrogram which gives us the different m/z ratios of samples and
their relative abundances

Module 3 and 4 22
Heavier fragments → Less Deflection
Lighter fragments→ More deflection

Mass to Charge ratio→ find using (m+z)/z = M


Algorithm of searching Protein spectra

Module 3 and 4 23
1. Take a protein sample

2. Separate it into a single protein


Protein separation is necessary before the protein is further analyzed. Having
multiple proteins in the same sample being analyzed can mess up our graphs as
there can be alot of overlap of contents between the multiple proteins.

3. Put that single protein into a mass spectrometer after ionizing it.

4. Mass spectrometer gives us the Intact protein mas i.e. MS1 data. which is the mass
of the entire exact protein.

5. From this intact protein mass, different fragments are created. The masses of these
fragments are called the MS2 data.

Module 3 and 4 24
6. That’s all you need to know over here (AJ king)

Predictng protein sequences using mass-spectrogram

Now that we have MS2 data, we use them to predict the amino acids by figuring out the
difference between the fragments in the hope that it equals that of an amino acid. Thus,
if the difference in the m/z of two fragments is equal to an amino acid, it just tells us that
that amino acid is present in our protein sequence.
On the intensity-m/z graph, subtract peaks towards the right with the left ones. If the
difference corresponds to an amino acid, well and good, note it down. If not, keep
moving to the left until the subtraction gives us an amino acid.
As an example:
For instace, we start from A. A-B gives us the mass of Tryptophan (W). We note this as
the first amino from the N terminus of the protein. Now we do B-C and find out that it
doesn’t correspond to any amino acid, so we move on and do B-D. Now since this gives
us the mass of Proline (P), we write it down as being the 2nd amino acid from the N
terminus of the protein. Moving on, now we do D-E, ( NOT C-D since B-C didn’t give us
anything so we just ignore C ) and it gives us the mass of Glycine(G) so it’s the 3rd
amino acid in order. Hence, our order of amino acids from the N-terminus is WPG.

Module 3 and 4 25
Scoring Sequences Tags
Lengthier the tag, small the RMSE, and more the abundance - the better.

Predicting post-translational modifications


Phosphorylation can happen on Serene, threonine, and tyrosine because these three
have an OH group to which the phosphate group can attatch.
The Phosphorylation patterns of certain residues can be predicted using the
in-silico model. Similarly, we can predict the pattern of methylation, acetylation
ubiquitylation of Lysine.
Lecture 4 and 5
It is possible for proteins to have a very low sequence similarity but a high structural
similarity. If two proteins have the same structure, there’s a good chance for them to
have a similar function as well.
Proteins can form alpha helices and beta sheets. The coils in the following diagram
represent alpha-helices and the arrows represent Beta-sheets

Module 3 and 4 26
We compare protein structures by superimposing known structures with our query
structure.
So, the protein structure tells us:

1. The function of the protein

2. Its partners which assist the protein in performing the specific function i.e. finding if
the protein is working in a complex with another protein.

PDB File Format

Gives

1. Name of the protein

2. References to the scientists etc. who discovered it

3. Its function (broadly)

4. The sequence of the Protein

5. ATOM/HETATM → Most importantly, the x,y,z coordinates of particular atoms in the


protein structure

Module 3 and 4 27
To know the difference between the structure of two protein sequences, we find the
RMSE between the coordinates of the two protein sequences. The smaller the value of
RMSE, the more similar the two structures are.

Experimentally, we use X-ray crystallography and NMR spectroscopy to determine the


structures of proteins which are then entered into the database.
Resolution → refers to the quality with which we can look at the protein structure.
Higher the resolution, higher the accuracy to visualize protein structures.
Resolution is determined in Angstroms
Ao which is 10^-10 m. Lower the Angstrom, higher the resolution.

Homology Modelling
Takes a template and a target sequence. It takes the template from the Uniprot
sequence and target sequence is the one which we give. The target proteins are
searched in the PDB database for their structure information. Similar sequences are
found and the algorithm moves on.
After aligning the two together, our target sequence is corrected a bit to get a higher
similarity, and a backbone is generated (i.e. the -C’-N-

Module 3 and 4 28
C a -) primarily with the help of ϕψangles of the target. Then loops are modelled
separately from backbone since they are more flexible than the backbone. Moving on,
the side-chain modelling occurs.

Homology Modelling Limitations

1. Large Bias towards structure of template.


Our target protein might very well be a new, or an unknown unique structure, which
we won’t be able to reach due to the bias towards the already known template.

2. Cannot study conformational changes → Protein interactions can change the


structure of the protein, while homology modelling only gives us a single fixed
structure. For instance, consider a transcription factor attatching or a substrate
attatching to the active site of an enzyme which might change the shape of the
Protein a bit- this we can’t figure out through homology remodelling.

3. Cannot elicit new catalytic/binding sites e.g. we can not predict which site of the
Protein enzyme acts as the active site.

Module 4
Lecture 1

Module 3 and 4 29
Recombinant DNA technology → Any technique with which we can manipulate the
central dogma for our benefit.
There are several organisms nowadays that are genetically engineered.

Restriction Endonucleases :

Detect certain restriction sites and break down the phosphodiester bonds in a
nucleotide chain and separates it into fragments.
DNA Ligases :
Acts as a glue; can form phosphodiester bonds between two separate nucleotide
chains.

Plasmids:
Extra chromosomal DNA in bacteria. The following properties of Plamids are important
to remember

1. Circular DNA

2. Extra chromosomal ( separate from the genome of the bacteria )

3. Easily exchangable between Bacterias

4. Ori site-origin of replication - Bacterial replication is independent of the bacterial


genome

5. Selectable Marker - specific plamids contain specific antibiotic resistance genes


which can be used to detect the bacteria containing that plasmid.

Module 3 and 4 30
6. Multiple Cloning sites (regions having restriction sites, where genes can be
inserted)

Introducing foreign DNA into plasmids is called transformation.


Introducing foreign DNA into Eukaryotic cells is called transfection.
How to ensure our transformed plasmid has been introduced in the bacteria?
Take the plasmid, which has some specfic antibiotic resistance gene, and introduce it
into bacterias. Some baceterias will accept it while the rest won’t. Assuming our
introduced plasmid had resistance to let’s say Pencillin, when we put our culture of
Bacterias in Pencillin which is an antibiotic, all the Bacteria not containing the resistance
gene will. The remaining Bacteria will thus be the ones which succesfully accepted our
transformed Plasmid, and so had Pencillin resistance.

Generally, bacteria accept foreign DNA at a very slow rate. So bacterial cells are
converted into competent bacterias by forming holes in the cell walls of bacterias, which
readily increases their ability to accept foreign DNA. The follwoing methods are used to
produce competent bacteria:

1. Use chemicals like Calcium Chloride to make those holes

2. Use Electroporation - electrice shocks

3. Using Agrobacterium, which can easily move into plant cells and integrate the
plasmids into the genome of plants!

4. Inject the gene at the embryonic stage (microinjection)

Cutting DNA into smaller fragments


All restriction enzymes come from bacteria.
Whenever Bacteriophage would attack bacteria and release its DNA inside, the
restriction enzymes of the bacteria would cut down the viral DNA at restriction sites into
little fragments, while protecting its own DNA which too had those restriction sites by
methylating them.

Module 3 and 4 31
Cutting plasmids at a point results in a linear DNA.
Moreover, the way the sequence is cut can vary; it can either be a sticky end with a 5’
protruding end or a 3’ protruding or even a blunt end (no protruding ends), The way it is
cut is dependent on the type of restriction enzyme used.
Sticky ends have a certain affinity to each other i.e. are kinda ‘sticky’ since they can
form hydrogen bonds in between before being fully patched up together with the help of
DNA ligase.

Module 3 and 4 32
So, to make sticky ends in our plasmid as well as our gene sequence cut that we want
to add into the plasmid should both be cut by the same restriction enzyme such that
they have complementary sticky ends which would patch up.

For Blunt ends however, it doesn’t matter even if we use different enzymes to cut insert
gene and plasmid.
Bacteria never takes up Linear DNAs, if it does in any case, the Bacteria degrades it.
Selection for Recombinant DNA
Beta-galactosidase (LacZ) gene codes for an enyme that converts X-gal (an artificial
substrate) into a blue product. Luckily, there’s a resitriction site with the LacZ gene. So if
we insert our gene into the LacZ region, the LacZ gene is lost and it no longer codes for
the enzyme which would convert X-gal into a blue product.
Thus we can identify which of the bacterias in a culture accepted our gene if we add X-
gal. All the bacterias that didn’t accept the gene would still have LacZ and so would
show a blue colour, while those containing our inserted gene would appear colourless.

Module 3 and 4 33
Lecture 2
Cloning vs Expression vectors

Cloning Vectors → gene inserted into a plasmid and we want lots of copies of that gene.
So, for that we need a cloning vector in which we don’t add a promoter upstream of our
gene, so the gene isn’t translated. Thus, in this case the plasmid keeps replicating and
we get several copies of our gene without it being translated

Expression vectors → Promoter is added upstream of the gene, and often additional
stop codons are added in the case the added gene did not have stop codons. The gene
added in this case is translated into proteins at a very high rate, especially since the
plasmids are dividing too.

Module 3 and 4 34
Locations of Ampicillin/Origin of replication doesn’t matter as Ampicillin has its own
promoter anyway

Plasminogen is inactive originally and is converted into plasmin by TPA. Plasmin is used
to dissolve clotting proteins, and hence reduce the probability of heart strokes and
clotting proteins.

Module 3 and 4 35
To deal with this, we decided to use streptokinase from bacteria, but it had severe
negative effects as well:
1. Streptokinase goes into overdrive causing excessive blood thinning which means that
blood cannot clot when it needs to (sir jawad saeed)
2. It triggered immune response

Hence, the final solution discovered was to genetically engineer TPA which would
activate plasminogen to plasmin.

Genomic Libraries
Contain introns, exons, everything

cDNA Librarires
Contain only the coding DNA of a given cell.
All cells of a body do not have cDNA libraries, since not all genes are expressed.

Module 3 and 4 36
PCR

1. Strands separated with heat

2. Primers added to separated strands such that the DNA polymerase could act on it

3. DNA polymerase along with nucleotides are added into the solution so the DNA
replicates. Note that a Taq DNA polymerase is used which has a very high optimum
temperature.

4. Repeat

Module 3 and 4 37
Gel Electrophoresis
A technique to visualize DNA and the process uses an Agarose gel.
When Agarose is cooling, pores are formed due to polymerization. Smaller
concentration of agarose makes small pores, while a higher concentration makes larger
pores.
Our DNA travels through these pores, so a larger DNA faces more resistance and so
moves slower the smaller DNA fragments.

PCR detection of infectious viruses

reverse PCR is used in this case.


RNA extracted from the virus is converted into cDNA using reverse transcritase, which
is then amplified in much the same way as discussed before.

Module 3 and 4 38
Genome Editing → Changes in genome without using restriction enzymes
Homologous recombination is a process through which foreign DNA can be
incorporated into the host part. There are certain ‘sites’ for recombination where the
content between the foreign and host genome can be flipped; much like crossing over in
meiosis.

Targeted Nucleases for genome editing


Zinc finger Nucleases (ZFN)

Module 3 and 4 39
Specific to detect 3 base pairs → different ZFN detect different 3bp sequence. Several
ZFNs can be attatched together to attatch to their specific sequence as we want, and
this combination can be attatched with a FOKI endonuclease which then causes a
double-stranded DNA break.

double-stranded break (very disastrous if a mutation):

Tramscription activator-like effector Nuclease (TALEN)


Very similar to ZFNs, except that each TALEN detects one nucleotide at a time while a
single ZFN detected 3 nucleotides. We can again, just attatch a FOKI endonuclease to
the TALEN array which can cause a double stranded break at out target.
CRISPR genome-editing technology

Bacterias used to respond to the attacks of bacteriophage as we saw earlier through


restriction enzymes. Other than that, they also use CRISPR Cas9.
Bacteria used the short fragments of virus (it had after its restriction nuclease had acted
upon the viral DNA) and incorporated these fragments into its genome’s sequence
known as the Crisper array to form a sort of immunity. These fragments in the bacteria’s
genome are called spacers.
When the Crisper sequence which contains the spacers is translated, the resulting
Crisper RNA is able to detect the same virus if it invades again, which the RNA simply

Module 3 and 4 40
degrades with the help of Cas9.

Now that the double stranded break has been caused, there are several mechanims to
join the DNA fragments backs

Module 3 and 4 41
Non-homology end joining
The double stranded fragments are simply joined back without considering the lost
nucleotide sequences. This process causes deletions and hence mutations in the
sequence.
Homology-drected repair
In this case, we can also precisely insert additional genes in the double stranded
fragment complex using Cas9- we can get the original sequence back if we want to.

Module 3 and 4 42

You might also like