You are on page 1of 72

Functional genomics notes

DISCLAIMER

These are just my personal edited notes about the course. Before using them as study material, please
consider that:

• These are not lecture transcripts. I always paraphrase what professors say into a sentence I can
easily understand. In addition, as I’m writing I often integrate the concepts explained during
lectures with previous knowledge I acquired in other courses or during my bachelor’s
• These notes contain conceptual mistakes. Given there is no clear reference textbook it is difficult to
check that what I wrote is correct
• There are grammatical mistakes. I’m Italian, I have never practiced English as extensively as during
this master’s degree. Do not expect perfection from these notes
• These notes were taken during the academic year 2021-22. Professors usually change their course
program every year and information you find here might be obsolete in the future.

FOR ALL THE ABOVE REASONS, IT IS ALWAYS BEST TO ATTEND LECTURES DIRECTLY

If you want me to correct a mistake you found in this document, please send a message to
mcdelloca.lab@gmail.com

Brief guide to symbols I often use: […] means a part of the lecture is missing; (?) means I didn’t understand
that part or I’m not sure that what I wrote is correct; text in red can indicate either a possible exam question
or something I added to the document and that was not explained in class.

Contents
Introduction to the course ................................................................................................................................ 3
Genome sequencing .......................................................................................................................................... 3
Early sequencing technologies ...................................................................................................................... 4
Next generation sequencing.......................................................................................................................... 5
454-sequencing ......................................................................................................................................... 5
Solexa sequencing by Illumina................................................................................................................... 6
SOLiD sequencing chemistry ..................................................................................................................... 9
De novo genome sequencing ...................................................................................................................... 11
Past de novo genome sequencing strategies .......................................................................................... 11
De novo genome sequencing with SMRT (PacBio) .................................................................................. 12
Nanopore sequencing.............................................................................................................................. 14
Genome sequencing projects ...................................................................................................................... 15
Example: Arabidopsis genome sequence ................................................................................................ 16
Gene families ............................................................................................................................................... 16
Example: MADS-box gene family ............................................................................................................ 17
Discovering gene function ............................................................................................................................... 19
Forward vs reverse genetics approaches .................................................................................................... 19
1
Functional genomics in bacteria .................................................................................................................. 20
Functional genomics in yeast ...................................................................................................................... 22
Mice models ................................................................................................................................................ 25
Plants ........................................................................................................................................................... 27
Drosophila melanogaster ............................................................................................................................ 30
Targeting induced local lesions in genomes (TILLING) ................................................................................ 31
Small RNAs ....................................................................................................................................................... 34
siRNA ........................................................................................................................................................... 34
miRNA .......................................................................................................................................................... 36
Example: P uptake by plants.................................................................................................................... 38
RNA interference ......................................................................................................................................... 39
RNAi construct and gateway cloning ....................................................................................................... 39
Advantages and applications ................................................................................................................... 40
RNAi in C. elegans .................................................................................................................................... 41
Genome editing ............................................................................................................................................... 42
ZFN &TALENs ............................................................................................................................................... 42
CRISPR systems ............................................................................................................................................ 43
Base editing systems ................................................................................................................................... 44
Prime editing ............................................................................................................................................... 46
Ethics and regulations of genome editing ................................................................................................... 47
Transcriptome analysis .................................................................................................................................... 48
In situ hybridisation analysis ....................................................................................................................... 48
Microarray ................................................................................................................................................... 49
RT PCR.......................................................................................................................................................... 50
Tiling array and RNA-seq ............................................................................................................................. 52
Regulatory pathways analysis ......................................................................................................................... 52
Identification of TF gene targets ................................................................................................................. 52
Single-cell analysis ................................................................................................................................... 54
Identification of direct TF targets & TF consensus sequences ................................................................ 54
Proteomics ....................................................................................................................................................... 60
Mass spectrometry for protein identification ............................................................................................. 61
Ribosome footprinting with Ribo-seq ......................................................................................................... 63
Protein-protein interactions ........................................................................................................................ 64
Yeast n-hybrid systems ............................................................................................................................ 64
Fluorescence Resonance Energy Transfer (FRET) .................................................................................... 68
Other fluorescence-based techniques: FRAP, FLIP and BiFC ................................................................... 69

2
Protein complexes analysis ..................................................................................................................... 70
Protein structure determination: cryo-EM.................................................................................................. 71
Lecture by Chiara Paleni - Genomics for biodiversity conservation ................................................................ 71

7/10/21

Introduction to the course


Prof Martin Kater (Dutch)

= study the function of genes and other parts of the genome in cells, organs and organisms. Information
gathered in model systems can usually be applied also to other organisms through comparative genomics.

Omics -> large datasets, expensive infrastructures, interdisciplinary approaches to analyze and interpret the
data.

• Genome = complete set of genes of an organism or its organelles.


• Transcriptome = complete set of mRNA molecules in a cell, tissue or organ. It depends on the
context (mainly tissue type, development stage)
• Proteome = complete collection of proteins in a cell, tissue or organ
• Metabolome = complete set of metabolites in a cell, tissue or organ (mass spectrometry, NMR)
• Phenome = set of all phenotypes a cell, tissue, organ or organism can exhibit.

Gene expression is regulated at multiple steps, usually more complex mechanisms in eukaryotic cells
(transcription, splicing, nuclear export, mRNA degradation, translation, post-translational modification,
protein degradation, protein transport through the cell to its final destination, modulation by small
molecules, interaction with other proteins).

Bioinformatic approaches allow phylogenetics analyses, deep sequencing, modelling, …

Course is regulatory pathways-oriented. No official textbook. Attendance is important. Exam: 6 questions


(6,5 points each), answer only to 5 of those. No oral exam.

Phenotyping. To understand gene function the simplest strategy is analysing the mutant phenotype. In
large, automated facilities it’s possible to analyse plant phenotypic characteristics in controlled conditions
and over a long period of time.

Genome sequencing
Sequencing a genome to:

• Gather data on all its genes


• Study its structure (centromeres, intergenic regions,
repetitiveness)
• Annotate the genome (gene and regulatory structures
positions are mapped on the genome)
• Identify non-coding genes (sRNA, …)
• Find new genes
• Design primers for PCR and other experiments
• Study its evolution and genetic diversity in the same
species

3
• Identify SNPs and connect them to specific conditions in disease models

Early sequencing technologies


1977 – first generation sequencing by Sanger and Maxam & Gilbert (2 different methods). The former is
simpler and largely used even today because it allows higher throughput. This technique allowed for
sequencing of genomes from different species. 2001 human genome project results are published.

In Maxam and Gilbert sequencing, DNA fragments are treated with different chemicals each corresponding
to one of the nucleotide bases. The small fragments are then separated by polyacrylamide gel
electrophoresis. Difficult to execute, the concentration of the reagents must me just right.

Dideoxy sequencing by Sanger relies


on ddNTPs, which lack the 3’ OH
group and cannot form a
phosphodiester bond. A mix of 4
ddNTP (one for each DNA bases)
labelled in different ways are added
to DNA fragments that undergo
replication. Once one of the ddNTPs
is incorporated replication cannot
continue (no 3’-OH). The resulting
mixture of DNA fragments is then
separated by polyacrylamide gel
electrophoresis. To obtain a good
quality signal, adding the correct
amount of ddNTP is crucial – too
much and DNA replication will stop
prematurely, too little and ddNTPs
aren’t effectively incorporated in the DNA strands. With this technology genome sequencing is long and
laborious (2 years to sequence a plasmid).

Next, the ddNTPs were labelled by 4 different fluorophores with different emission spectra. This allowed
pooling all the fragments in a single electrophoresis row, with each band emitting light at a specific
wavelength to signal the type of ddNTP that was last incorporated. Next a machine capable of reading the
fluorescence signal was developed. These advances allowed for faster sequencing, however sequencing a
large genome was still too laborious because of gel preparation.

Next a capillary gel was developed with the same separating capacity as traditional gels. 96 capillary tubes
machines were able to increase throughput dramatically. Companies bought multiple copies of these
machines to sequence the human genome for the first time. There was a competition between academic
groups and private companies, however both relied on this technology. Once the human genome was
successfully sequenced, other large scale sequencing projects drove the search for alternative methods to
reduce time and costs (the human genome project was super expensive).

4
8/10/21

Next generation sequencing


454-sequencing
2005 – new technology. Fibre-optic slide of individual wells. It takes 1 hour to read 25 million bases at 99%
accuracy. Based on pyrosequencing = enzymatic DNA synthesis releases pyrophosphate for each
incorporated dNTP.

1. A sequencing primer is hybridized to a ssDNA template from PCR. Incubated with all necessary
enzymes and reagents
2. First dNTP (one of 4 bases) added to the new strands. PPi is released. The quantity of released PPi is
equimolar to the amount of incorporated dNTP.
3. ATP sulfurylase converts PPi into ATP
4. Luciferase converts luciferin to oxyluciferin by consuming 1 ATP molecule. The reaction releases
photons. Light is measured by detectors. The intensity of generated light is proportional to the
number of nucleotides originally incorporated.
5. To erase the light signal ATP and dNTPs must be removed. The enzyme apyrase removes free
dNTPs and ATP. When degradation is complete, another dNTP is added.

In 454-sequencing:

5
1. genomic DNA of interest is isolated, fragmented, attached to adaptors and denatured.
2. ssDNA fragments are bound to beads, one ssDNA for each bead (10X bead concentration).
3. beads are captured in PCR reaction solution mixture in oil. This way, each of the beads will be
segregated inside a bubble of PCR mixture, where only the ssDNA it is attached to will be amplified.
The result is one bead covered by a high quantity of ssDNA amplified from the same template,
meaning ssDNAs with the same sequence.
4. After DNA denaturation, each of the beads is placed in one hole in the nanofiber slide and then is
immobilized. This ensures that each cell in the slide contains multiple copies of the same template
to be amplified.
5. A machine pumps one type of dNTP in each one of the holes, then the detector identifies light
signals coming from specific wells and the software records light signals at each cycle.
6. After each cycle, wells are washed with buffer containing apyrase.

Solexa sequencing by Illumina


1 billion bases per run, high quality, <1% of the cost of capillary based methods. Faster and cheaper
genome sequencing. Up to 200-bp long DNA fragments can be sequenced by Illumina.

Watch video. 4 steps: sample preparation, cluster generation, sequencing and data
analysis.

Sample preparation. Purified DNA molecules are processed to attach adapters at


their ends. Various library preparation protocols are available. For example, in
tagmentation engineered transposons simultaneously cut dsDNA and insert adapter
sequences at each overhang. Transoposomes have free DNA ends and insert
randomly into DNA in a “cut and paste” reaction. Because the DNA ends are free,
this effectively fragments the DNA while adding on the sequences required for PCR
amplification and sequencing. Two transposomes are mixed in equimolar ratios and
each carries one of the two sequences required for library PCR. The resulting dsDNA
fragments are ligated to 2 different adapters, one at each end. Once adapters have
been ligated, a PCR reaction can be performed to add new sequences at the
fragment ends. Special primer pairs are used to anneal to adapter sequences and
carry new sequences on their 5’ tail. Resulting amplified dsDNA fragments contain 2 modified adapter
sequences, one at each end, organised as follows:

6
• The binding site specific for one of the 2 sequencing primers
• An index sequence track and distinguish the origin of the
different DNA fragments (from different cells or individuals)
that are put and sequenced together.
• Region complementary to one of the flow cell oligos.

Cluster generation. Clustering is a process where each fragment is


isothermally amplified in a specific cluster of the same flow cell. The
flow cell is a glass slide with lanes. Each lane is a channel with 2
different oligos attached to its surface. Once the sample is loaded on the flow cell, oligos will initiate
annealing with their complementary sequences in the adapter region of each ssDNA. Then, a DNA
polymerase will elongate the oligo 3’ end using the DNA fragment as template. Once replication is
completed, the original DNA fragment is washed off and its antisense copy remains attached to the flow
cell. Then, the strands are clonally amplified through bridge amplification. The strand folds over and the
free adapter sequence hybridises to the second type of oligo attached to the flow cell, thus forming a
“bridge”. Once again, DNA pol elongates the second oligo using the strand as template, forming a dsDNA
bridge. The bridge structure is then denatured, yielding a ssDNA forward strand and a ssDNA reverse strand
tethered to the surface of the flow cell. Bridge amplification is repeated for a defined number of cycles to
perform clonal amplification of the original strand. The process occurs in parallel in all clusters of the same
flow cell, resulting in the amplification of all fragments. After bridge amplification has been completed, the
reverse strands are cleaved from the flow cell and washed off, leaving only multiple copies of the forward
strand at each cluster. The 3’end of each strand (the one that is not attached to the flow cell) is protected
to prevent unwanted elongation reactions in the next phase.

Sequencing by synthesis. The first sequencing primer is added and it


hybridises to the free end adapter of each strand in the flow cell.
Then, dNTPs are added to the elongating strand following
complementarity rules. Each incorporated dNTP emits light in a
characteristic wavelength. Light emission at each cluster is detected by
the optical component of the Illumina machine, which then
determines the corresponding base of the read. In cyclic reversible
termination sequencing, each ddNTP is fluorescently modified and has
its 3’-OH group inaccessible. Unincorporated ddNTPs are washed away
to leave at each cluster only the light emission corresponding to the
incorporated ddNTP. After light emission has been detected, the
fluorescent dye and terminator group are cleaved, thus restoring the 3’-OH and allowing another cycle to

7
be performed. The number of dNTP incorporation cycles determines the length of the read. For a given
cluster, multiple copies of the same fragment are used as template for read synthesis simultaneously,
ensuring proper signal intensity. The first read product is washed away once sequencing has been
completed. Then, an index-specific primer hybridises to the same template strand to initiate strand
elongation in the region corresponding to the first index. Base calling of this region is used to sequence the
index, which carries important information about the origin of the template.

After the first read and index have been sequenced, the 3’ end of the template is deprotected to allow
annealing to the complementary oligo on the flow cell surface. This structure allows elongation of the oligo
to sequence the second index of the template.

If paired-end sequencing is performed, then the same oligo is elongated forming a dsDNA bridge. After
linearisation, the original forward strand is cleaved and washed off, leaving the newly-synthesised reverse
strand. After annealing of the read 2 sequencing primer, the same sequencing by synthesis steps are
repeated to obtain the second read. Each read corresponds to antiparallel sequences from opposite ends of
the same DNA fragment. Paired-end sequencing is optimal to resolve repeated or longer DNA fragments.

Data analysis. The billions of reads generated in a single sequencing


experiment must be reorganised to be properly analysed. First, reads
are pooled based on their respective indexes introduced in the
library preparation step. Then, for each group reads with similar
stretches of base calls are locally clustered. Forward and reverse
read pairs are put together to yield contiguous sequences, which are
then aligned back to the reference genome.
The flow cell has attached adaptors that are complementary to the ones fused to the ssDNA
fragments. Normally the 2 different adaptors have a small complementary region but different
DNA tails. Then PCR is performed by adding primers that bind only on the adaptor tails. There’s
also the possibility to add an index to one adaptor to track and distinguish the origin of the
different DNA fragments (from different cells or individuals) that are put and sequenced
together. It's possible to prepare a DNA library using transposon that can cut genomic DNA to
add adaptor sequences. The cut and paste reaction occurs in random positions in the genome. However, the transposase is bound to 2 different
DNA fragments (2 adaptors of the same type) so that the product of the reaction is 2 genomic fragments with the same type of adaptor at the
extremes. This way PCR is avoided. After adding the adaptors, the ssDNA to be sequenced attach on the surface of the flow cell via complementarity
between the adaptor and a DNA fragment attached to the surface. Both ends of the ssDNA may adhere to the surface. Then PCR is initiated to
generate dsDNA which is then denaturated to form more copies of the original ssDNA. Bridge amplification. After amplification, sequencing is
initiated. A mixture of the necessary enzymes and the ddNTP labelled with 4 different fluorophores are added. The machine reads the fluorescence
signal of each cluster to determine which of the dNTP was incorporated. Sequencing by cyclic reversible termination. Up to 200-bp long DNA
fragments can be sequenced by Illumina. Watch explanatory videos. Paired-end sequencing is optimal to resolve repeated or longer DNA fragments.

The first Solexa sequencing machine provided 120 million 50-bp long reads in one experiment. It proved a
useful tool for deep sequencing as an alternative to microarrays. It’s also used to perform RNA-seq
experiments. Extract RNA, make cDNA, sequence 50 bases by Illumina to identify the type of mRNA and
quantify it in a cell -> study gene expression across different cell types (differential gene expression
studies).

8
In an Illumina Hi-Seq machine, the sequencing capacity is so high that sequencing only one sample per run
will get as a result too many reads for each fragment (10k instead of 40-50). For this reason, it is preferable
to multiplex = sequencing in the same sequencer many different sample libraries. Either samples of very
divergent sequences (distant taxa) are mixed together in order to avoid mixing the results basing on
homology, or the different samples will have adaptors that have different indexes to identify and
distinguish them. Many different nucleotide barcodes can be used in a single experiment. After sequencing,
all the data collected can be sorted basing on barcode sequences.

SOLiD sequencing chemistry


Library preparation. SOLiD sequencing can be performed on two types of libraries: fragment library or
mate-paired. The choice of library preparation protocol depends on the application you're performing and
the information you desire from your experiments. Fragment library is prepared as described in Illumina
sequencing: purify and fragment gDNA, attach 2 adapters at each end. In mate-paired libraries, purified
gDNA is sheared and each fragment is ligated with internal adapters at both ends. These adapter
sequences are designed to promote circularisation of the fragment. Then, specific restriction enzymes are
used to cut 27 bp from the circularised adapter sequences, yielding a DNA fragment that contains an
internal adapter and one gDNA fragment at each of its ends. Lastly, sequencing adapters are ligated the
extremities of this fragment. After sequencing, mapping of mate-paired reads is useful to identify structural
variants such as deletions, insertions and local inversions.

Emulsion PCR and bead enrichment. ssDNA


fragments from the library are attached each to
its own bead (10X bead concentration) thanks
to surface oligos complementary to P1 adapter
sequence. Then, PCR is performed to create
millions of ssDNA copies attached to the surface
of each loaded bead. Empty beads (the vast
majority) must be removed. Polystyrene beads
coated with P2-complementary oligos are
added to specifically attach to loaded beads. Beads complexes are separated from empty beads thanks to
centrifugation (polystyrene structures are less dense than empty beads, so they are found in the

9
supernatant of the centrifuged solution, while empty beads have precipitated. Finally, The template on the
selected beads undergoes a 3’ modification to allow covalent attachment to the slide.

Bead deposition. 3’ modified beads are deposited onto a glass slide. Deposition chambers enable their slide
to be segmented into one, four, or eight sections. This system allows the accommodation of increasing
bead densities per slide, thus increasing the throughput of one experiment.

Sequencing by ligation.

1. Primers hybridize to the universal P1 adapter


sequence on the templated beads.
2. One of the 16 available probes anneals right
next to the primer. A probe is an 8-nt long oligo
organised as follows from 5’ to 3’: a pair of
classic dNTPs (16 possible combinations), 3
universal bases that can bind to any nucleotide
on the template, 3 universal bases attached to
one of four possible fluorescent dyes. The ligase performs ligation of the probe to the primer.
3. Fluorescence signal acquisition. A laser excites the fluorescent dye, which then emits light that is
detected by the machine.
4. The 3 nucleotides of the probe that are attached to the dye are cleaved. This leaves only the
nucleotide dimer and the internal 3 universal bases of the original probe. A free 5’-P end is made
available for the next round of probe ligation.

After one round of sequencing by ligation, only the first 2 bases have been determined. Subsequent rounds
yield fluorescence measurements for every 5th base (since the internal universal bases are not removed
from the ligated probe). For this reason, after one ligation round has been completed the process is
repeated with a new sequencing primer that is offset by 1 base on the adapter. The combination of
sequencing data from ligation rounds that used 5 different starting primers (n through n-4 offset) yields
sequencing data for the entire template strand. Five rounds of primer reset are completed for each
sequence tag. Through the primer reset process, virtually every base is interrogated in two independent
ligation reactions by two different primers. Although each dye corresponds to 4 different base
combinations, fluorescence data from all these ligation rounds is used to decode the actual sequence.

10
Add primer with a combination of 2 nucleotides attached to one side (16 total combinations). If those 2 nts are complementary to
the DNA fragment, the primer is successfully ligated and a fluorophore is cleaved to generate a signal. This way the 2 bases on the
DNA fragment are determined basing on the fluorescence signal. After that, another primer is added and ligated to the 3’OH of the
previous primer. After washing and primer removal, a new set of primers which will bind with a -1 shift to the previous primers
binding sequence.

14/10/21

De novo genome sequencing


Past de novo genome sequencing strategies
In all techniques previously discussed, the read length is quite
short.

In the past with Sanger sequencing, the genome or


chromosome to be studied was randomly sheared in fragments
to insert into plasmids. This way, the correct order of the
fragments is lost. The flanking sequences into the cloning
vector (plasmids) are known, therefore it’s possible to design
primers to amplify the DNA fragment and then sequence it.

Conventional genomic libraries are divided into 3 vector


categories: yeast artificial chromosome (YAC), bacterial
artificial chromosomes (BAC) and phage P1 artificial
chromosomes (PAC). YAC and BAC libraries were the most
common. To construct BAC libraries, high molecular weight
DNA was purified from nuclei and partially digested. After
size selection, gDNA fragments were ligated into a BAC
vector. The resulting circularised DNA molecule was
electroporated into bacteria to promote transformation. The
result is an arrayed storage of bacterial colonies, each
containing a BAC with a specific fragment from the original
genome to be sequenced.

After sequencing, all reads from the same genomic library must be correctly aligned using overlapping
sequences. Multiple reads are assembled to form a contig, which corresponds to the longest continuous
sequence obtained by read overlapping and alignment(?). For example, RFLP was performed on the whole
BAC library to identify the clones with the same pattern. Alternatively, it’s possible to construct primers
specific to a certain read sequence and use it on the whole BAC library to identify which clones undergo
amplification (because they contain the same sequence).

11
2 different sequencing approaches were used:
hierarchical and whole genome shotgun. The former
consists of dividing gDNA to be sequenced into smaller,
more manageable fragments, each of those to be
further sheared and cloned into a BAC or YAC library.
This way, each large fragment is reassembled
independently to reduce contig reconstruction
complexity. Finally, the different fragment contigs are
reassembled to form the original genome. This
approach was used by academic groups to sequence the
different chromosomes of the human genomes. On the
other hand, private companies employed the whole-genome shotgun approach which skips the initial
fragmentation (all short fragments are cloned in the same library). This approach was faster but potentially
more error prone.

In de novo genome assembly, alignment of reads into contigs and subsequent ordering of contigs is
performed to obtain the whole genome sequence. The process requires that each base in the original
genome is sequenced at least 4 times (it should appear in 4 independent reads). Genomic coverage of a
sequencing experiment is usually higher than 4 and it is calculated as:
𝑛∗𝐼
𝐶=
𝐿
Where 𝑛 is the number of total reads, 𝐼 is the length of each read (determined by the type of sequencing
equipment used) and 𝐿 is the length of the genomic segment that is being considered. To ensure the
appropriate accuracy, a coverage factor of 20 is needed (every nucleotide in the genome is sequenced
independently 20 times). This however increases time and money spent on larger genome projects.

Another major factor that increases technical


complexity of de novo mammalian genome
sequencing is the presence of repeated sequences.
In mammalians, the portion of repeated sequences
makes up about 50% of the total genome (only 5%
in bacterial genomes). This significantly complicates
genome assembly, as it is difficult to reconstruct the
correct order of reads that contain repeated
sequences without any flanking unique sequence. In
fact, NGS platforms that yield short reads (such as
Illumina, up to 150 bp) are not viable for complex
genome assembly. Read length must be higher than the length of repeated sequences in the genome to
sample both flanking unique sequences and total repeat length. This achieves reliable assembly of contigs
with proper length repeats. The requisite of long reads is neglectable in resequencing projects because
short reads are mapped on an previously assembled reference genome sequence. in reference genomes,
position and length of repeated sequences is already known, so short read assembly into contigs is not
needed.

De novo genome sequencing with SMRT (PacBio)

12
Nowadays, the SMRT system by Pacific biosciences is used to sequence new complex genomes thanks to
longer reads, short run time, high quality sequencing (although sequence quality is lower than Illumina’s)
and lower costs. The core sequencing strategy is based on sequencing by synthesis without amplification
steps. This system detects incorporation of dNTP into a single DNA molecule. One major technical challenge
of the SMRT system is the removal of background interference (sequencing by synthesis on one molecule
yields a low signal-to-noise ratio) as well as the distinction between signal generated by two different but
close DNA templates. In fact, for proper functioning DNA polymerase requires a high concentration of
labelled nucleotides, which creates a fluorescent background thousands of times brighter than the signal of
a single incorporation event.

The SMRT chip contains thousands of zero-mode waveguides, which


are cylindrical holes perforating a thin metal sheet placed on a glass
substrate. Zero mode wavelengths cylinders contain the smallest
detection volume to date. Essentially, this minute grid allows only a
fraction of the intensity of laser exciting light to permeate the
detection volume (attenuated light), thus incident light only reaches
the bottom of the detection volume (10−21 L).

A single DNA polymerase is fixed on the bottom of each


hole. A long DNA fragment is put inside this space together
with a lesser concentration of fluorescently tagged dNTPs
(fluorophore is linked to the phosphate groups). While free
dNTPs navigate throughout the solution, only one can be
added to the elongating strand at the bottom of the well.
Since this part of the well is the only one that the laser can
reach, as the labelled dNTP is kept on the elongating strand by the DNA polymerase for several milliseconds
the fluorophore is excited and emits fluorescent light. This tiny fluorescence signal is detected by the
machine. As seen before, the machine can detect different fluorescence signals from every well at the same
time. As the dNTP is incorporated into the elongating strain, pyrophosphate linked to the fluorophore is
released into the solution, where it cannot be excited by the laser. These steps are repeated until the
template has been completely sequenced.

This system can be employed to study bases modifications


(methylation) thanks to DNA pol kinetics measurements (the
duration of a fluorescence signal). The incorporation behaviour of
DNA polymerase changes based on the type of modification on
the dNTP to be incorporated. This way, it’s possible to sequence
the genome and its epigenetic modifications at the same time.
For example, after a methylated adenine is incorporated it takes
more time to resume dNTP incorporation compared to non-
methylated adenine.

13
To reduce sequencing mistakes, the de novo genome sequencing consists of:

1. Fragment the genomes into large fragments (10k nt is ok)


2. Ligate the fragment ends to one another to circularize the DNA fragment (SMRTbell). This way the
same fragment can be sequenced several times
3. After obtaining the fragment sequence, the long reads are easy to assemble into contigs and
eventually the whole genome
4. To polish the genomic sequence, the genome is re-sequenced with Illumina to identify and correct
mistakes.

To summarise, the advantages of SMRT system are:

• Fast runs (from 30 min to 3 hours)


• Longer reads (>10 kb) ->
o full sequence of a genetic locus, it’s easier to identify splicing consensus sequences
o de novo complex genome sequencing
• Lower cost
• Information about rate of nucleotide incorporation, which can be used to determine the
modification status of the template nucleotide.
• The error model in the SMRT sequencing approach is stochastic (random mistakes through the
read, not in specific places). Therefore, sequencing the same fragment multiple times (SMRTbell)
allows to easily identify those mistakes
• The DNA polymerase used in the machine is a specific mutant that is highly stable to laser light
• Can sequence previously inaccessible regions (high GC content, centromeres, ...), leading to
uniform genomic coverage.

The disadvantage of this method is lower output compared to Illumina and HiSeq. Illumina is the best
choice to identify SNPs on a genome that has already been sequenced and studied.

From 1990 to 2003 the human genome project cost 3 bln dollars to sequence the 3 bln bases. This result
was a breakthrough at the time, but now cheaper and faster sequencing systems are needed to sequence
genomes from multiple organisms of many different species or at a single-cellular level to identify and map
variations. In 2015 there were 500000 human genome sequences available.

The bottleneck in current sequencing methods is optics. Fluorescently tagged dNTP are detected by their
fluorescence emission. Since this emitted light must be measured, the machines need precise and
complicated lasers, mirrors and detectors. To simplify the process, nanopore sequencing was developed to
once again revolutionize the sequencing world.

15/10/21

Nanopore sequencing
Commercially available since 2 years. Based on naturally-occurring
proteins that form a pore on membranes to allow ion flows. Those
channel proteins were engineered to become sequencing
machines. [from Wikipedia] The biological membrane, where the
nanopore is found, is surrounded by electrolyte solution. The
membrane splits the solution into two chambers. A bias voltage is
applied across the membrane inducing an electric field that drives
ions into motion inside the nanopore. When a molecule occupies
a volume that partially restricts the flow of ions, an ionic current
drop is produced. Based on various factors such as geometry, size
14
and chemical composition, the change in magnitude of the ionic current and the duration of the
translocation will vary. Different molecules can then be sensed and potentially identified based on this
modulation in ionic current.

An upside of this strategy is the fact that the precise optics machinery isn’t needed. The machine size and
cost dramatically decrease (they are the size of a USB key and cost around a thousand $). One optical-based
sequencer can be replaced by several nanopore sequencers to increase throughput and cut costs. No
particular infrastructure is needed. Extremely fast sequencing (>400 bases per second). Long reads (42 kb).

The critical step in this case is purifying enough DNA from the sample. High molecular weight DNA samples
must be used if extensive sequencing is the goal. With time, nanopore sequencing quality has increased.
Also, this method allows the identification of dNTP modifications (methylation) while sequencing. Average
fragment length is 43k nt -> easier de novo genome assembly.

Many species are diploid organisms, which means that every organism has 2 copies of any given genomic
locus. Heterozygosity (2 different alleles for the same locus) in genomes is a critical parameter for an
individual’s genetic diversity. Unfortunately, the identification of alleles is difficult with standard
sequencing techniques(?). A thorough genomic analysis requires not only the list of carried alleles but also
the arrangement of such alleles on a chromosome (the haplotype, which usually indicates which alleles will
co-segregate). With long reads, different alleles (as little as SNPs) can be identified and put together into
the same haplotype.

Genome sequencing projects


Human genome = 3,5 Gbp. 2 approaches were employed: clone-by-clone (hierarchical) and shotgun whole-
genome sequencing (Craig Venter @ Celera Genomics). As much as 50% of the human genome is made of
repetitive sequences.

An important step in genome studies is the annotation of genes, a process that requires the identification
of ORFs. In less complex genomes, a species-specific software is sufficient to scan the genomic sequence
and find candidate ORFs (the strategy is the same, but genomic characteristics may differ between species).
One standard method for the identification of ORFs in extremely large genomes such as plant ones (wheat)
is exome sequencing:

1. Fragmented gDNA is hybridised to bait probes


attached to a solid surface or biotinylated
probes. Probe sequence has been previously
determined with RNA-seq. only gDNA
fragments that contain transcribed ORFs are
selected, while non-annealed fragments are
washed away.
2. Elution of selected gDNA fragments from the
solid surface or selection of biotinylated complexes with streptavidin-
loaded magnetic beads.
3. After library enrichment, selected gDNA fragments are sequenced and reads are aligned back on
the reference genome.

Limitations:

• Not all exons are captured


• Exon sequences that were flipped or translocated (structural variants) aren’t easily detected
• No information about regulatory sequences that can be distant from coding sequences. i.e. miRNA
aren’t detected because they aren’t converted into cDNA.
15
Example: Arabidopsis genome sequence
Model plant organism because:

• easy to grow
• Lots of seeds produced
• complete life cycle in 8 weeks
• relatively tiny genome (130 Mbp), one of the
smallest within plants. 5 chromosomes.
Streamline genome sequencing and annotation
was expected, however its complexity made that
impossible.

At the moment, there are still gaps in the genome sequence. It displays synteny with other agriculturally
important species = order of genes in the genome is conserved between species, although the actual gene-
to-gene distances may vary. This characteristic denotes close evolutionary links between such species. Also,
knowing the order of genes in one species, the order of homologous genes on another closely related
species can be predicted. It contains internal duplications and families of repeated sequences. Many SNPs
were identified. About 32k genes were identified thanks to annotation software. Amongst these, non-
coding RNAs encoding sequences, pseudogenes and transposable elements are also counted. Annotating =
connect a specific genomic sequence with the function of its product (or its own function if regulatory
sequence). A large proportion of annotated genes
still have unknown functions.

Duplicated genes complicate functional genomics


analyses. Duplications result in genetic redundancy.
A. thaliana went through a complete genomic
duplication and throughout the stabilization period 2
duplicated genes may diverge in function. Duplication
drives evolution.

Gene families
= a group of genes that have a certain sequence in common, originally derived from
duplication. Example: transcription factors are grouped into families basing on the
type of DNA binding domain. (TFs are trans-acting regulators that must bind on cis-
acting regulatory sequences).

Gene families evolve by gene duplication or recruitment of conserved domain-


coding sequences by unrelated genes thanks to unequal crossovers and
recombination events. To study a particular gene function, the gene is mutated and
inactivated to study the mutant phenotype. However, if the target gene has recently
duplicated (the gene copy doesn’t have divergent function), the inactivation of the
target gene has no effect on the organism because its function is complemented by
the gene copy. Sometimes the 2 gene copies diverge in function, normally because
the regulatory sequences accumulated mutations that dictate different tissue-
specific expression of such genes.

Neofunctionalization. A tissue-specific gene is duplicated. One copy retains the original tissue-specific
expression, while the other accumulates mutations and its expression pattern may change.

16
Subfunctionalization. A tissue-specific gene is duplicated. Each of
the copies retain part of the original tissue-specific expression
pattern. For example, one copy is expressed in only one of the
original tissues, while the other is expressed only in another
tissue amongst the original ones.

In plants, TFs are highly amplified. For example, the myb gene
family has >100 members. MADS-box genes are 107, while in
mammals and yeast there’s less than 10. CAAT-binding protein
NF-Y has 10 members for each subunit, while in mammals and yeast there’s only one copy each. This
happens because plant genomes are more flexible (?) and can duplicate more easily. Also, since plants are
sessile they are subjected to greater natural selection factors. To survive and reproduce, plants must resist
a wide range of adverse environmental events. In fact, plants possess a higher number of responses to
environmental stimuli, therefore they usually require additional levels of gene expression regulation
compared to animal genomes to cope with environmental issues.

Gene duplications can be recognized thanks to high sequence homology and conserved gene structure
(exon and intron borders).

Example: MADS-box gene family


Important homeotic genes that regulate organ identity
during development in yeast, animals and plants (i.e.
antennapedia). The first types that were identified in
plants are MIKC MADS-box which have a C-terminal
activating domain.

40 genes were expected in the MADS-box family.


Southern blot to estimate copy number. By sequencing, as many as 100 genes were identified. About 70%
of MADS-box genes were eventually identified through genome sequencing.

A phylogenetic tree can be drawn by analysing sequence conservation and


alignment. Recent duplication suggest that those genes have still redundant
functions. This information is highly predictive for functional genomics studies.

21/10/21

Other subfamilies of MADS-box genes were largely discovered by genome


sequencing. 70% of MADS-box genes had never had their cDNA isolated.
Sequence conservation analysis on the DNA binding domain of MADS-box genes
reveals similarities between different genes. Some of those with high sequence conservation will likely have
redundant function. At the time, to confirm this hypothesis an RT-qPCR experiment was performed to
analyse gene expression in different plant tissues. Some of those genes with high sequence identity are still
expressed in different tissues, therefore they aren’t considered redundant. Obviously, there’s always the
possibility that a protein travels between tissues, but this fact is largely ignored in such studies. In RT-qPCR
normally genomic DNA must be removed to avoid false positives (only cDNA is analysed), also thanks to
intron-exon interface complementary primers.

The MIKC subfamily is the most studied. All genes contain the same MADS-box binding domain (yellow
line), however the rest of gene structure is highly variable. Mostly those genes are characterized by long
introns. From their position on the genome, some information can be acquired. MIKC genes are evenly
distributed on all A. thaliana chromosomes, while the other subfamilies concentrate mainly on

17
chromosome I and V. This even distribution
suggest that the MICK subfamily is
evolutionarily old, while newer families
usually duplicate in tandem therefore are
concentrated on one or a few
chromosomes.

Genome sequencing, annotation and


phylogenetic trees of gene families are
foundational analyses for future reverse
genetic experiments. In forward genetics,
the gene responsible for a certain
phenotype is unknown and its search is
often complicated. In reverse genetics, the function of a certain gene in a sequenced genome is unknown.
Usually, gene function can be deduced by studying the phenotype of a mutant strain. However, in plant
genomes the high rate of gene duplication causes single mutant lines to exhibit a wild-type phenotype
because gene copies often have redundant functions. Thanks to phylogenetic analysis, it is possible to
predict which genes might be redundant.

Among MIKC genes, the SEP1, 2, 3 and 4 genes are all


very closely related with high sequence identity (96-
100%). To study which phenotype was controlled by
each of those genes, single mutant lines were
produced. Single mutant exhibited a wild-type
phenotype. The same happened for double mutants.
Finally, in sep1-2-3 triple mutants a mutant phenotype was displayed: the 4
petals of a flower become sepals and a new flower is formed. Since organ
identity was changed, SEP genes are indeed homeotic genes. Since only the
triple mutant exhibited an altered phenotype, the sep1-2-3 genes are thought
to be redundant. Moreover, sep1-2-3-4 quadruple mutant has floral organs transformed into leaf-like
organs. This conformed that the original identity of flower petals is a leaf.

Shp1-2 double mutant do not display dehiscence zones, the mutant plant is
unable to shatter its seeds and disperse them in the environment (shatter-
proof mutant). By reducing seed shattering more seeds can be easily collected
by farmers. Moreover, the shatter-proof genes are expressed in other tissues. A
stk shp1-2 triple mutant displays an arrested ovule development, therefore
those genes determine ovule identity and are redundant. Redundancy is based
on sequence similarity and same expression pattern.

18
These types of mutants couldn’t be found by forward genetics, because the chance of crossing plants to
obtain triple and quadruple mutants is really small. Reverse genetics advanced our knowledge of gene
function, especially of redundant genes.

Discovering gene function


Forward vs reverse genetics approaches
In forward genetics (phenotype to gene), a spontaneous mutant of interest is isolated. To identify the gene
that causes it, the mutant phenotype is reproduced under certain controlled conditions of interest
(screenings).

Insertional mutagenesis Chemical mutagenesis


An active transposon is inserted inside a Mutagenic agents induce random mutations in the genome
coding sequence, knocking out the that disrupt or alter gene sequence and function
corresponding gene
Mainly knockout mutations, not much There’s more diversity between mutants depending on the
diversity in the type of introduced expression level of the mutated gene. The position and type of
mutations mutation determines its outcome on gene function (complete
loss-of-function, gain-of-function, aberrant splicing, …)
Expression of mutated gene is prevented Usually leaky gene expression, so more difficult to study
(complete knockout)
Limited number of insertion events in High number of mutations in each organism, it is harder to
each organism identify the one that causes the mutant phenotype
Mutated gene is easier to clone since it is Difficult to clone the gene responsible for the mutant
tagged by transposon sequences phenotype. Unrelated random mutations must be discarded
through crossing with wild-type plants and recrossing with
other mutant plants

Then, the genes that cause the mutant phenotype are cloned with different approaches.

• For example, if an active transposon is involved it’s possible to map its flanking sequences on the
genome to identify nearby genes that could be affected by the transposon. There are different
methods to do this.
• On the other hand, to identify specific point mutations map-based methods are used. Phenotype-
linked or molecular markers on the genome and the mutation of interest are studied through
different generations to determine a linkage. A non-mendelian segregation profile indicates that
the mutant-phenotype-causing mutation is close to a specific marker.
• It’s also possible to sequence a region of the genome of a model system to look for new mutations
on known genes. Nowadays, whole genome sequencing has sped up this process, however this
approach isn’t possible outside of model organisms because there isn’t a reference genome
sequence.

Example of forward genetic screen: the glucose insensitive mutant. A glucose insensitive A. thaliana mutant
doesn’t respond to high glucose concentration in growth medium.

Genetic interactions predict physical interactions. Genetics is fundamental in functional genomics studies.

The power of reverse genetics is based on genome sequencing. In reverse genetics approaches, the
function of a gene or a family of genes is investigated by introducing site-directed mutagenesis and
analysing the resulting mutant phenotype. Specific mutagenesis protocols have been developed and
validated for each of the model organisms.
19
Functional genomics in bacteria
In bacteria, it’s possible to easily select even the rarest of events because a lot of colonies can be grown in a
lab. Specific knockouts can be introduced in genetic screenings by using either transposons or homologous
recombination.

Transposons or transposable genetic elements are segments of DNA that can move in different locations
on a DNA molecule. The integration point is random. They can’t replicate autonomously therefore they
need to be integrated into a host DNA molecule. Transposase is the enzyme responsible for excision and
reintegration on the host DNA. Transposition may be followed by duplication. In bacteria, transposons can
carry other genes other than those involved in transposition. They’re called Tn followed by a number. In
nature Tns carry a resistance gene, which is a useful selection marker
for the cultivation of cells that have successfully integrated a
transposon in their genome.

In vitro transposition involves a piece of DNA, a transposon and a


transposase in the same solution. The transposon will integrate in
random positions of the DNA molecule. Then the DNA molecule is
fragmented and such fragments are exposed to bacteria. The bacteria
cells can integrate the transposon by homologous recombination
thanks to the flanking sequences of the original DNA molecule. Those
cells that have integrated the transposon carry the antibiotic resistance
gene and can be selected. This way, the bacterial genome can be
randomly mutagenized.

After that, a GAMBIT method can be applied


(Genomic Analysis and Mapping by In vitro
Recombination) to identify essential genes. In
random transposition, transposons integrate in
a range of positions from the point of view of an
essential gene. After selecting the transposed
colonies, a PCR reaction is performed using a
transposon specific primer and a chromosomal
primer (it binds to a specific sequence on the
bacterial chromosome). If the transposon has
inserted close to the sequence recognized by
the chromosomal primer, the PCR amplified
fragments are short. However, if the transposon
insertion site is far from that sequence, the PCR amplified fragments are long. If a transposon is inserted
into an essential gene, the knockout cells won’t grow in optimal conditions. Therefore, in the PCR reaction
no fragments are amplified that correspond to the transposons inserted in an essential gene. By separating
the PCR products with electrophoresis, a gap is identified that corresponds to the missing fragments. Some
of the other bands can be sequenced to identify the essential gene. GAMBIT is always performed on a
specific genomic region, not genome-wide.

A genome-wide approach is needed to perform functional genomics studies on a large number of poorly
characterised genes in a newly-sequenced bacterial genome. High-throughput strategies that combine NGS
with transposon insertional mutagenesis reveal genotype-phenotype relationships in a wide range of
bacterial species. An example of these methods is Transposon-Directed Insertion site Sequencing (TraDIS),
which uses a derivative of Tn5 as a transposable element that is active in different species.

20
1. Construction of a dense transposon insertion library. Each
bacterial cell in the library carries a transposon insertion in one locus on
the genome. Cells that have successfully incorporated the transposon can
be selected thanks to an antibiotic resistance gene inside the transposon.
Cells where the transposon has inserted into essential genes do not
survive.
2. gDNA is extracted and randomly sheared into fragments.
Adapters are ligated to each end.
3. Fragments are PCR amplified and Illumina sequenced to identify
the flanking sequences of transposons.
4. Reads are mapped on the reference genome to highlight
transposon insertion sites in the various clones of the library.
5. Normalised read counts mapped to a specific locus are
proportional to the frequency of the insertion mutant in the population.
Sequencing data is used to estimate or calculate fitness of mutants.

TraDIS was first used to assay gene function in Salmonella enterica typhi
genome-wide. A library of 370k Salmonella mutants was generates,
averaging a transposon insertion site every 13 bp of the genome.
Sequencing data revealed that approximately 8% of Salmonella genes are
essential, since mutants for those genes did not survive in rich growth
medium. Other transposon insertion methods (Tn-seq) use transposons
that contain a IIS restriction enzyme binding site. Thus, gDNA
fragmentation is performed by restriction enzyme-mediated digestion,
whereby enzymes cut 20 bp downstream of the recognition site.

Transposon insertion sequencing techniques are also performed to


quantitatively determine the fitness of a genome-wide gene knockout. To
do this, gDNA is extracted both before and some time after selection is
performed. The comparison between normalised mapped read counts of
these two conditions should reveal a change in the frequency of a
specific insertion mutant into the population.

• No change in insertion frequency on gene X -> gene X is neutral, it does not contribute to fitness in
the culture condition tested.
• Frequency reduction or disappearance of insertions on gene Y -> gene Y is essential for growth
under the experimental conditions.
• Frequency increase of insertions on gene Z -> disruption of gene Z is beneficial for growth under
the experimental conditions.

For example, the following graph shows the fitness of


individual transposon insertion mutants of S.
pneumoniae. Yellow lines correspond to insertion
frequencies and fitness of a wild-type background with
transposon insertions. Blue lines represent insertion
frequencies and fitness of a mutant strain with a
SP2193 deletion (a response regulator). The latter
genetic background exhibits higher fitness level and
transposon insertion frequency in three loci involved in
pyrimidine biosynthesis, suggesting that the loss of

21
SP2193 results in a suppression of the fitness defects that result from insertion mutations in each gene
involved in the pyrimidine biosynthesis pathway.1
It’s possible to study bacteria colony fitness basing on different insertional mutagenesis with transposons. Cfr. Transposon insertion
sequencing (article). 370k transposon mutants of Salmonella enterica typhi were generated to completely cover the genome with
transposon insertions. This revealed that 8% of the genes are essential for growth in a rich medium. This was discovered by
analysing the flanking sequences of the transposon mutants library. If the transposon has inserted into an essential gene, the
corresponding cells won’t be in the library (because they’re dead). If the flanking sequences of the library don’t correspond to a
specific Salmonella locus, then that locus likely contains an essential gene.

22/10/21
Genomic DNA is fragmented, then a restriction enzyme cuts the fragments in sequences that are only present inside the transposon
sequences. This way some fragments contain a portion of the original transposon and its flanking sequence. Adaptors are ligated at
the flanking sequence end, then fragments are PCR amplified using transposon-specific and adaptor-specific primers. Finally the
fragments are sequenced to identify the flanking sequences. Also, the number of such reads may decrease if the experiment is
repeated at different time points. If that happens, then the transposon insertion impacts negatively on the bacteria fitness in
particular condition.

This method can be repeated without a restriction enzyme simply by shearing DNA in random parts.

This systems allows to perform fitness studies on a bacterial population in specific growth condition. A lot of genes are
simultaneously analyzed and their impact on fitness is determined. Among all genes, the insertional mutagenesis can have an effect
on fitness that varies between neutral (if the gene isn’t essential for growth and survival) and negative.

The same experiment can be repeated in a population of bacteria previously mutagenized to inactivate a specific gene. The results
may show that a specific gene knockout can change the impact on fitness of insertional mutagenesis of other genes. For example,
in a SP2193 knockout strain the fitness impact of insertional mutagenesis in 3 different loci is mostly neutral, while in wild-type
strains the fitness greatly decreases if the same 3 loci are inactivated. Therefore, the SP2193 is a respressor of such genes(?).

Functional genomics in yeast


Yeast cells can be cultured and selected quite easily, it’s the perfect model organism to perform basic
molecular studies in eukaryotic systems. 2 unicellular yeasts are used: budding yeast (S. cerevisiae) and
fission yeast (S. pombe).

• S. cerevisiae was the first eukaryote to be transformed by plasmids, the first to be completely
sequenced in 1996 and the first on which precise gene knockouts were performed. Yeast mutants
are easy to select and the mutation can be complemented by expressing a homologous gene from
other eukaryotes. The function of >6000 genes in S. cerevisiae remains unknown.
• S. pombe shows no synteny with S. cerevisiae, therefore the 2 yeasts are evolutionarily very
different.

Both yeasts grow and divide as haploid organisms, so essential gene knockouts aren’t viable. However, a
diploid zygote can be formed by the fusion of 2 haploid cells with different sex phenotypes (either a or
alpha). Both types of yeast have high rates of homologous recombination that allows easy knock in
protocols to inactivate or change the allele of specific genes.

Usually the ORF of a gene of interest is disrupted by inserting a selectable marker with homologous
recombination. Linear fragments with free ends within the region of
homology give max. frequency of integration. This way specific gene
mutants can be selected and their phenotype easily compared to the
wild-type cells. This approach requires cloning.

1
See Transposon insertion sequencing: a new tool for systems-level analysis of microorganisms, van Opijnen and
Camilli, Nat Rev Microbiol. 2013 July ; 11(7)
22
Nowadays, the donor DNA fragment is synthesized by PCR amplification of the
selection marker with primers that contain 40 bp long 5’ tails that are
homologous to the site of mutagenesis. The PCR product contains both the
selectable marker and the 40bp of homologous sequence at both ends. In S.
pombe 60 bp homologous sequences are needed because homologous
recombination isn’t as efficient as in S. cerevisiae. This approach can also be
used to add a tag to a specific protein to create and study a specific chimeric
protein. The tag can be a fluorescent protein (GFP) or a His tag depending on
the purpose of the experiment.

In both yeasts illegitimate recombination may occur. The marker is


incorporated in different sites without sequence homology. Therefore,
the insertion site must always be checked.

With insertional mutagenesis, essential gene functions cannot be analysed because haploid mutants are
not viable. To overcome this problem, diploid heterozygotes were used instead. S. cerevisiae community
uses a pool of strains in which all the ORFs have been knocked out. In the Saccharomyces Genome Deletion
Project it was discovered that about 20% of genes are essential. However, not all viable heterozygotes in
this collection are normal. In fact, several display abnormal chromosome contents. Loss of one gene might
impose a growth selection that leads to increased copy number of the chromosome that contains another
version of the same gene, because it confers increased growth rate.

To solve the essential gene problem, some mutants are


designed as conditional mutants, therefore the mutant
phenotype is displayed under certain conditions. For
example, temperature sensitive mutants are wild-type
under a certain T, while at higher T the proteins unfold or
misfold and they display the mutant phenotype. To isolate
such mutants, yeast colonies are plated in specific
positions of 2 replica plates that are grown at different T.
The spots where no growth is observed in the high T plate
indicate the presence of temperature-sensitive mutants. These kinds of mutants are important to study cell
cycle defects. However, conditional mutants are always mutants and they might have mild defects even in
permissive growth conditions.

A selection assay is more powerful than a screening assay(?).

Suppressors are capable of (partially) restoring the activity of mutated proteins and likely intervene in the
same cellular process. Overexpression of suppressor proteins may cover the effects of a mutation on
another gene and rescue the mutant phenotype or increase the fitness of the mutant. There are different
types of suppressors, some examples:

a) Dosage suppressor. It stabilises a target


protein. If overexpressed it is able to rescue
mutant proteins that display lower stability.
b) Interaction suppressor. A target protein needs
to interact with its partner to properly
function. If it is mutated it might not be able to
interact with its partner. However, if the
partner is mutated too their interaction might
be restored
23
c) Bypass suppressor. Rescues null alleles by
allowing an alternative biochemical pathway to
obtain the same products of the wild-type
pathway that is blocked.
d) Nonsense suppressor. Rescues nonsense
mutations that cause the production of
truncated proteins. During translation, a
suppressor tRNA recognises the nonsense
codon introduced by the original mutation and
prevents a protein synthesis block.

Suppressor mutants can be cloned with the following process.

1. A temperature sensitive mutant is the starting strain. It contains a mutated gene which functions
only at low T and a wild-type suppressor gene.
2. The ts mutant is mutagenized to obtain a mutation that suppresses the ts phenotype but at the
same time the mutant become cold sensitive (unable to grow at low Ts). This happens because the
suppressor gene is mutated to become cold-sensitive.
3. After selecting such mutants, another round of mutagenesis is performed to revert the ts mutation
in the original gene. The result is a cold-sensitive suppressor mutation in a wild-type background.
a. If the mutation of the
suppressor gene is lethal, the
strain displays a wild-type
phenotype at high T but dies at
low T. To select only cells that
carry a cold-sensitive mutation
on a specific suppressor gene,
the yeast population is transfected with a plasmid library containing wild-type genes of the
yeast. Only the colonies with a wild-type cosy of the suppressor gene can grow at low
temperature because the function of the mutated suppressor is complemented by the
wild-type copy of the gene on the plasmid.
b. If the suppressor mutant has no
phenotype, the double mutants
can grow at any T. These
mutants can be transformed
with wild-type gene plasmid
library. The transformed mutants are replica plated and grown at high T. Yeast that have
incorporated a wild-type copy of the suppressor gene are again temperature sensitive
because the function of the mutated suppressor is complemented by the wild-type gene
copy. The mutation on the suppressor is recessive.
c. If the suppressor mutant has no
independent phenotype and the
mutation is dominant, it cannot
be complemented as described
above. A library of suppressor
mutant is made and used to
transform the original ts mutant strain. Transformed mutants that incorporated the
dominant suppressor mutation can grow at high temperature.

24
Synthetic lethality is the opposite of mutation suppression. Instead of restoring the function of the original
mutant, a second mutation is introduced so that combined with the first one generates a lethal phenotype.
This is useful to analyse duplicated genes with redundant functions. In S. cerevisiae 80% of genes aren’t
essential, therefore single mutants of such genes are able to grow. A matrix is constructed to cross 2 single
mutants (with different sexes) to obtain double mutants of all combinations of non-essential genes. This
requires a lot of work and is usually automated with SGA, a machine able to cross strains and score the
viability of the double mutants. The results allow scientists to build a gene network map. Double mutant
phenotypes are scored to identify deficiencies in specific pathways or non-viable double mutants. The
corresponding pair of mutated genes are likely redundant.

With this system it’s also possible to screen a


collection of compounds to find new drug targets. If a
certain molecule inhibits a certain gene function, in
synthetic lethality screens the single mutant treated
with a compound is not viable. After generating a
library of viable single mutants, these cells are exposed
to different compounds. If the inhibited gene
interacted with the mutated gene, the mutant dies
because it’s in the same situation as a double mutant.
The chemical genetic lethality profile of a certain
molecule can be matched with the lethality profile of a
double mutant screening with SGA. From this
information, the gene targeted by the drug can be identified. The protocol was first published in a proof-of-
concept study. Cfr. Slides.

No lectures on Thursday 4th november.

28/10/21

Mice models
Ethical committee, strictly regulated specialized facilities. The mouse is a good mammal model because it’s
small, has a relatively short gestation time, they’re docile and easy to handle. 19 chromosome pairs + XY,
30-40k genes. However, some gene functions aren’t conserved between mice and humans. For example,
mouse stem cells behave a little bit differently compared to human ones. Also, given its large sized genome
it’s harder to work on.

Mouse breeding strategies. Inbred strains are normally employed in experiments. Outcross to obtain F1,
then inbreed for 20 generations. Inbred mice are essentially homozygous at all genetic loci (genetically
identical), therefore they’re very stable. With heterozygous mice, at each generation the gene of interest
segregates with different alleles that could potentially alter its function. With an inbred line the genetic
background of all mice is the same, therefore interactions between the gene of interest and the others
should be the same in all animals.

Mouse reference genome is available. Same number of estimated genes and almost same genomic size as
humans. To study the mouse genome, it’s possible to induce random or site-specific mutations that are
tissue or development-specific and express exogenous genes. For example, a gene construct containing the
exogenous gene under a tissue specific promoter control is injected into the blastocyst of a mouse.

The classic targeted knockout strategy consists of the construction of a gene construct with a selection
marker (resistance gene) flanked by sequences that are homologous to the site of insertion, together with a

25
ganciclovir-sensitive gene (herpes TK gene). By injecting such construct into an embryonic stem cell, via
homologous recombination it is integrated into the genome.

1. If the construct is inserted into the correct site, the TK gene doesn’t integrate and the cell can
express the neomycin resistance gene
2. If the construct is inserted randomly into the genome, both the neomycin and the TK gene are
expressed
3. If homologous recombination doesn’t happen, the construct doesn’t integrate.

It’s possible to select only the knockout cells by plating the ESCs into a neomycin + ganciclovir selective
medium. Ganciclovir treatment kills any cell that contains a TK gene, while neomycin selects against ESCs
that have not integrated the genetic construct. The selected cells are implanted into a wild-type blastocyst
to yield a chimeric mouse. The chimeric mouse is crossed with a mouse strain to obtain mice that carry the
genetic construct in all cells. These heterozygous mice for the targeted mutation are inbred to yield
homozygous mice for the gene construct of interest.

It might be very useful to switch on the expression of an exogenous gene in a specific development stage
and only in specific tissues. There are several gene expression control systems. For example, in tetracyclin-
like systems 2 genetically engineered mouse lines are
crossed. One parental line contains the cDNA of interest
under the control of a tet-inducible promoter. The second
line expresses the transcriptional activators needed to switch
on the Ptet gene expression. Such proteins can be regulated
by adding doxycycline, which can either activate or inhibit the
transcriptional factor (depending on the expressed type). By
conditionally activating or inactivating gene expression by
adding doxycycline it’s possible to perform complementation
studies of genes that may be fundamental for specific
development stages (the classic knockout would result into
dead mice).

Similarly, conditional gene expression can be achieved by


using the lac operon and LacI repressor, that is inactivated by
allolactose.

26
Another example of an inducible system is the Gal4
transcriptional activator. Normally, Gal4 is expressed in
specific tissues thanks to regulatory sequences in the
promoter. The engineered Gal4 is normally bound by Hsp90,
therefore cannot translocate into the nucleus. However, by
adding RU486 Hsp90 binding is displaced, Gal4 moves into the
nucleus and activates target gene expression. This system is
much faster than the previous because the transcription
factor is always expressed in the tissue of interest and
immediately activates expression of the exogenous gene.

Finally, the Cre/lox system is used to irreversibly excise a DNA


fragment between lox recombination sequences. Lox sites on
DNA are recombination sites specifically recognized by Cre
proteins. Cre expression is regulated by a tissue or
development-specific promoter so that only in specific tissues
or developmental stages Cre is expressed and excises the region flanked by the lox sites from the genome.
Since the DNA fragment is physically removed by Cre, the system is irreversible.

Plants
Transposon and T-DNA insertion lines have been largely used in the lab, however recently the field was
changed by the introduction of the CRISPR technology.

Even though homologous recombination exists in plants, it’s highly inefficient and it isn’t used for knockout
or knock-in studies. Therefore, plant mutants are usually obtained by T-DNA or transposon insertions. Rice
and A. thaliana are commonly used in reverse genetics studies. The cereal transposon system found in
maize is usually employed into A. thaliana because its sequences are unique in such genome, therefore it’s
much easier to detect.

T-DNAs are inserted into the genome by A.


tumefaciens. These bacteria normally cause plant
tumors (galls) but can be engineered to mutagenize a
plant cell or deliver a CRISPR system. The bacterium
contains a T-plasmid that allows for the insertion of
one of its parts into the host genome. Since the T DNA
contains an antibiotic resistance gene, the transformed
plants can be selected by adding such molecule. Also,
the Ti plasmid is 200 kb long, much longer than other
plasmids. Since T-DNA carries eukaryotic promoters,
gene expression is switched on in the host cell. The
region of the Ti plasmid that will be integrated into the
host genome is flanked by specific sequences. The T-DNA randomly inserts into the host genome, however
it has preference for AT-rich regions which are usually located into intronic region. Even though the ORF of
the target gene isn’t disrupted, T-DNA insertion usually impacts on RNA splicing. Normally exonic insertion
are preferred because T-DNA always blocks gene expression. Sometimes T-DNA inserts into the promoter
region with varying effects (either complete knockout or transcription deregulation). It becomes difficult to
select complete knockout lines.

T-DNA can be engineered by adding a reporter gene under the control of a tissue-specific or development-
specific promoter. However, since the integration site is random the reporter gene might not be expressed

27
because of positional effects (it integrates into a transcriptionally silent genomic region). Therefore, T-DNAs
can be inserted at high coverage on the whole genome. To saturate the genome usually rice and A. thaliana
are employed to generate T-DNA-saturated lines thanks to a specific protocol that doesn’t require tissue
culture because it’s possible to directly select the seeds.

Alternatively, transposons can be used. The dissociation (Ds) and activator (Ac) system was first
discovered in maize and later used in A. thaliana. The Ds element is a non-autonomous transposon because
it doesn’t code for its own transposase, therefore it cannot jump by itself. The Ac element is an
autonomous transposon of the same family as Ds, therefore it can excise and reintegrate by itself thanks to
its own transposase. When both the Ds and Ac elements are present in the same genome, the Ac
transposase can recognize the Ds sequences and catalyse its transposition. Ds and Ac transposition
generate instable phenotypes because between generations the transposons can move inside the genome
inactivating or reactivating target genes. However, if the Ac element is crossed away it’s possible to obtain
a stable population that contain a fixed Ds element. It’s also easier to identify the transposon insertion site
by looking for their flanking and short repeated sequences. This system allows for rapid mutant phenotype
generation because transposon aren’t in fixed positions.

A transposon tagging population is prepared as follows:

1. A T-DNA construct is set up. The LB and RB recombination sequences delimitate a construct
containing:
a. Ac element without inverted repeats or the Ac transposase gene alone
b. functioning Ds element with an herbicide resistance gene
c. S.M. gene. Expression of SU1 makes plants
sensitive to sulphuronyl urea.
2. The T-DNA construct is inserted into a wild-type
background plant strain of interest.
3. At each generation, active Ac transposase catalyses Ds
excision and subsequent integration in a random
position in the genome.
4. Plants are crossed. If T-DNA and Ds sequences are
distant to avoid co-segregation, some of the progeny of
the crossings should only carry the Ds element while
the T-DNA sequence is crossed away. These plants thus
carry a stable Ds insertion.
5. Seeds are collected and new plants are grown in a
selective medium. Herbicide prevents growth of plants
that do not carry the Ds element. Sulphuronyl urea kills
plants that carry the T-DNA sequence. Only plants that
carry exclusively the Ds element are selected. These
plants are collected and their seeds stored to generate
the transposon tagged insertion library.

29/19/21
In a T-DNA mutagenized population the insertion site is fixed in random positions. This is called stable insertion. To generate
different mutant lines to inactivate all genes, an efficient transformation method must be employed and a lot of transgenic plants
are generated. With T-DNA there isn’t any revertant phenotype. To prove that a mutant phenotype is caused by a specific gene
insertion, the same phenotype must be displayed in 2 independent lines with different insertion sites on the same gene. This is
because in a single line the phenotype might be caused by a sequence that is close to the insertion and co-segregates with it
(linkage). Usually, a complementation test is performed. In a transposon mutagenized population of mutants, the same can be
achieved simply by reactivating transposon activity and study the revertant phenotype.

28
T-DNA tagging population Transposon tagging population
Stable insertion. T-DNA cannot be transferred Non-stable insertion. As long as active transposase is
expressed, a transposon can be excised and reintegrated
into random genomic positions. The insertion phenotype
is not stable. This can be resolved by crossing out the Ac
sequence, yielding a population with stable Ds insertions.
Efficient transformation system is needed New insertions are easily obtained by reactivating
transposon mobility
Many transgenic plants are needed Lower number of transgenic plants in the tagged
population
Absence of revertants phenotypes Revertants are obtained by reactivating transposon
mobility
To validate that an insertion in a specific gene Proof that an insertion in a gene causes a certain
causes a specific phenotype, two independent phenotype is easily obtained by reactivating transposon
T-DNA insertion events in two plants are mobility and analysing the revertant phenotype.
needed. Only one event is not enough
because the phenotype might be caused by
other nearby sequences that co-segregate
with T-DNA insertions.

Once a mutant phenotype of interest has been identified, the mutagenized gene must be identified. This is
possible because the population was mutagenized with transposons or T-DNA which have known
sequences. One possible approach is PCR-based. By designing primers that bind to the T-DNA or transposon
and primers that are gene specific. If DNA amplification is visualized with gel electrophoresis and DNA
hybridization, then the target gene has been mutagenized and is likely involved into the generation of the
mutant phenotype. However, since the mutant population is huge performing PCR reactions for each plant
isn’t feasible. Therefore, seeds and leaves are collected from each
plant to obtain DNA samples. DNA samples from each mutant are
sequentially pooled to drastically reduce the number of PCR reactions
needed to screen the entire population. Once a positive hit has been
identified, each pool and sample that were part of such hit are
analysed separately. This strategy allows to rapidly and cost-
effectively screen an entire T-DNA population where each line has 1-2
T-DNA insertions to identify a genotype-phenotype correlation.

Alternatively, from the DNA samples of each plant line it’s possible to sequence
the insertion flanking site, map the reads on the reference genome and identify
the genes that had insertions. The flanking sequences can be isolated with inverse
PCR by using a restriction enzyme that cuts inside the insertion sequence (T-DNA
or transposon). The genome is digested with such enzyme, the DNA fragments are
diluted to prevent ligation between different genomic fragments (only one
restriction enzyme is used, so each DNA fragment has complementary sticky ends
than can ligate) and each DNA fragment is ligated into a plasmid to generate a
genomic library. Each plasmid can be analysed with PCR using insertion-specific
primers. If amplification is detected, the plasmid contains the transposon/T-DNA
and its flanking sequence. Selected plasmids are then sequenced.

A sequence of a gene of interest is submitted to a gene database that contains genomic libraries to look for
a match between the gene of interest and a flanking sequence on the database. Normally there are many

29
mutant lines available, each differing only on the T-DNA or transposon insertion site. This is because the
insertion site might not be favourable for specific applications.

Forward genetics can be performed: from a wild-type plant generate a transgenic plant population to select
the mutant phenotype of interest and through flanking sequence analysis identify the gene that causes
such mutant phenotype.

The T-DNA or transposon sequences that are used to generate transgenic plant lines can also contain
regulatory sequences to study different characteristics of a genome.

• Enhancer trap. The T-DNA or transposon contain a minimal promoter and a reporter gene. If it is
inserted next to a tissue or development-specific enhancer, then the reporter gene is expressed
accordingly. This can be used to study gene regulation in development.
• Activation tag lines. T-DNA or transposon sequence contains a 4 x 35S enhancer that strongly
activates transcription of nearby genes. The insertion site determines which genes are activated
and thus the phenotypic effects.

A point-mutation approach can be performed as an alternative to insertional mutagenesis. The position of


a point mutation that causes a particular phenotype can be identified with sequencing technologies. Cfr.
Fast-forward genetics enabled by new sequencing technologies.

1. Cross a homozygous mutant line with a wild-type


line, then cross the F1 heterozygous offspring
and select the mutant plants into the F2
population. Such plants must have 2 copies of the
original mutation.
2. Genome sequencing of the F2 mutants. Since
they aren’t lines, their genomes are different.
3. By aligning and comparing the F2 mutant
genomes, it’s possible to identify the common
homozygous region which likely contains the
point mutation that causes the mutant
phenotype. Regions that don’t contain the
mutation can either correspond to the wild-type or the original parental mutant genomes thanks to
segregation in the F1 crosses. (non so come spiegarlo in inglese).

Drosophila melanogaster
Good human genetics model, especially to study the central nervous system.
1,5k full or partial transposons. Relatively few pseudogenes (less sequence
redundancy), large number of overlapping genes. Large collection of
mutants has been studied. 14k genes, barely double the number of yeast
genes (low gene content). It’s possible to perform comparative genetics
analyses because some functions are conserved even among distant animal
relatives.

Gene constructs can be easily delivered into an early-stage embryo because


it’s a syncytium. Moreover, prior to cellularization pole cells bud off at the
posterior end. For germline transmission to occur, the transgenic DNA must
be taken up into the pole cells that are fated to become germ cells. F1
individuals that carry the transgene can be selected thanks to a marker

30
carried by the gene construct. For example, a white+ marker in a white- strain highlights all dark colored
eye flies as transgenic.

Random integration of a transgene into a genome can result in a wide range of expression levels despite
having the same regulatory sequences because of the positional effect. Comparing the effects of the
integration of two separate transgenes or gene variants with equal regulatory regions requires removal of
the positional effect. To do so, the 2 coding sequences are inserted into the same transposon but flanked
by different recombination sites. This way, both transgenes are inserted into the same genomic position
and depending on the expression of one type of recombinase only one gene is excised. This is achieved with
a particular transposon that consists of:

• Transposon inverted repeat ends


• The 2 gene variants
• loxP sites positioned to cut out one of gene X in presence of Cre
recombinase
• FRT recombination sites positioned to cut out the gene Y in
presence of FLP recombinase.
• Selection marker between the two gene constructs. Either of the recombination events results in
removal of the marker, thus enabling scientist to select individuals where recombination has
occurred.

By expressing either Cre or FRT, the transgenic line contains either one of the gene variants in the same
orientation and at the same genomic locus. The mutant phenotype is the result of the function of one of
the gene variants.

It is also possible to replace a transgene inserted with a transposon with another sequence. First, a donor
DNA molecule is synthesised to contain the following:

• Homologous sequences to the site of integration


• Inverted repeats of the same kind as the integrated transposon
• A selection marker different from the one in the integrated
transposon
• The gene sequence of interest to be inserted replacing the other one.

In presence of transposase, the previously integrated transposon is excised


leaving a DSB with ends that are homologous to donor DNA. Thus,
homologous recombination is activated and the donor fragment is inserted at
the same site as the original transposon, effectively replacing a sequence with
another one.

Targeting induced local lesions in genomes (TILLING)


Classical mutagenesis method to introduce point mutations in genes of interest to perform reverse genetics
studies. Usually combined with genome editing studies to induce a mutation of interest that has been
previously studied in genome editing experiments.

Different mutagenic agents display a certain specificity for


the types of mutations they induce. Commonly EMS
mutagenesis is performed. Similarly to UV mutagenesis, EMS
converts a C into a T, therefore the C:G base pair is converted
into a T:A pair. Alternatively, NG mutagenesis is performed.
Both EMS and NG are alkylating agents that add an ethyl

31
group to the target base to alter its ability to form hydrogen bonds with the complementary strand.

Given the random position of point mutations introduced with chemical treatment, the downstream effect
on the gene product largely depends on the position of the mutation relative to the reading frame as well
as the new base pair that is introduced. Different types of mutations generate various effects on the final
protein:

• Silent: a point mutation results in a substitution of a codon with one of its synonyms. The protein
product is not altered.
• Missense: a point mutation results in a residue change. The effects on the protein product depend
on the similarity of the new Aa with the original one as well as the relevance of the affected residue
in the protein folding process.
• Nonsense: an in-frame stop codon causes the production of a truncated protein or a complete
absence of protein synthesis.
• Abnormal splicing: intron retention, exon skipping and activation of cryptic splice sites lead to large
scale alterations of the protein that can potentially block its function or trafficking.

To select the optimal mutagenesis site for the purpose of a certain


experiment (e. g. we want to introduce a nonsense mutation to
inactivate the gene), the ORF of the target gene is analysed to
identify the optimal position of the intended mutation. For example,
the Coddle software is normally employed to scan the gene of
interest and identify the optimal position where a mutation induced
by a specific treatment should introduce a stop codon in frame. A
nonsense mutation is preferred because a truncated protein has
likely lost completely its original function, thus the phenotype of the mutant plant is equivalent to a
targeted knockout.

After treating the plant seed with the mutagenic agent, resulting M1 plants
are crossed to yield a M2 generation, where the mutations are usually
stable and carried in germline cells. Genomic DNA from M2 plants is
extracted for future analysis and plants are again crossed to obtain and
store M3 seeds. As seen previously, to cut costs of DNA analysis on a large
mutant population gDNA samples from 8 different M2 plants are
sequentially pooled into the same array.

11/11/21

gDNA samples in the same pool are PCR-amplified. Since PCR is


efficient at amplifying short fragments (up to 1,5 kbp), PCR primers
are designed to amplify 0,5 to 1,5 kb-long portion of the gene that
contains the desired position of the mutation.

After amplification, dsDNA fragments in each pool are denatured


and then reannealed. Most of M2 plants have no mutation on
target gene, while a small number of those are heterozygous for
the mutation. ssDNA fragments that contain a mutation have a high
chance of annealing to wild-type ssDNAs, thus forming a
heteroduplex (no complete complementarity). On the other hand,
tilled ssDNAs that do not carry a mutation reanneal mostly to other
non-mutant complementary strands to form homoduplexes.

32
After that, another round of PCR is performed with fluorescently labelled forward and reverse primers.
After reannealing, each homo or heteroduplex dsDNA carries a 5’ fluorescent tail with two distinct
fluorescence dyes to distinguish between forward and reverse strand. CelI is added to the pooled sample to
specifically introduce a SSB at mismatched positions in heteroduplexes. Denaturing of the treated DNA
fragments yields a mix of full-length tilling fragments that did not originally carry the mutation and a
smaller number of shorter ssDNA fragments that have been cleaved.

Fluorescent dyes help visualise these fragments after gel


electrophoresis has been performed. Pools where a mutation has
been detected exhibit two or more DNA bands, one for full-length
tilling fragments and the others for shorter fragments. By
measuring the length of shorter fragments using a reference it is
possible to identify the position of the mutation. additionally, two
colour imaging is used to measure the length of both the forward
and reverse strand of each pool. If the sum of the molecular
weights of a pair of mutant bands is equal to the length of a wild-
type band, then the 2 bands originally carried the same mutation. If
the mutation has occurred in the desired position to yield a
nonsense mutation, all plants in the corresponding pool are
analysed individually to identify the organism that carries the
mutation.

Once the proper mutant has been selected, its phenotype is thoroughly studied to determine the function
of the corresponding gene. Efficient method, only 10k Arabidopsis plants are needed.
A reverse genetics approach to find those mutations in the gene of interest that revert the mutant phenotype. After mutagenizing a
seed population and obtaining the F2 generation (the F1 is a chimeric generation), a screen is performed to identify mutations in
the gene of interest with a PCR-based protocol. Since PCR is efficient at amplifying short fragments (up to 1,5 kbp), PCR primers are
designed to amplify only specific positions on the exons. The mutation is searched at predetermined positions on the gene,
particularly those that should introduce a stop codon in frame. The position depends on the mutagenic agent that has been
employed, as each of those introduces a specific mutation. Genomic samples from 8 of the M2 plant population are pooled in the
same microwell to perform PCR. Most of M2 plants are wild-type (no mutation on target gene) while a small number of those are
heterozygous for the mutation. In the pooled sample, after PCR has been performed dsDNA is denatured at high T and then
reannealed. ssDNA fragments that contain the mutation have a high chance of annealing to wild-type complementary ssDNAs, thus
forming a heteroduplex (no complete complementarity). Then, fluorescent primers are added to the sample and annealed to the
previous PCR products. This way, each dsDNA fragment contains 2 5’ primers, each exhibiting a different fluorescence signal. After
that, the CelI enzyme is added to the sample to cut the DNA heteroduplexes. After another round of denaturation, a mixture of
ssDNA is obtained: wild-type ssDNAs will be tagged with fluorescent markers, while mutated ssDNAs that are still tagged are
shorter. The length of the mutated fluorescent ssDNA can be measured to identify the position of the mutation. Efficient method,
only 10k Arabidopsis plants are needed.

The TILLING assay can be performed on multiple model organisms, however the mutant library is preserved
in a specific protocol for each species. C. elegans is a nematode that has been extensively studied to
understand the embryonic development processes. In TILLING experiments, each F1 mutagenized line is
frozen individually to preserve the animals for long periods of time. In Drosophila TILLING experiments,
mutant population can be kept alive at 18°C. In zebrafish, the mutant population cannot be maintained so
it can be accessed only for a limited amount of time.

A mutagenized plant isn’t legally considered a GMO, therefore useful mutants can be easily planted in
fields. TILLING isn’t transgenic because it doesn’t introduce exogenous genes in the plant genome. On the
other hand, genome editing that yields the same mutation as a TILLING approach generates mutants that
are unfortunately considered GMOs. Normally, a specific mutation is introduced and studied with genome
editing, then introduced with TILLING in plants to be cultivated.

33
TILLING & sequencing. Each mutagenized plant is placed into
a specific row and column (the population is virtually
organized into a 3D space). Then, genomic samples from
plants from the same row are pooled, PCR amplified with
primers containing a specific index. The same is repeated for
each column. PCR products are then sequenced in Illumina.
Reads contain indexes that indicate row or column of the
plant from which the original genomic sample was extracted.
The mutation of interest is identified in a specific row sample
and a column sample. From the indexes it’s possible to identify the original mutant plant of interest.

Mutations introduced in TILLING can have various consequences on the plant’s phenotype. Some mutations
can be studied when the mutant is planted in a field. A TILLING-mutagenized plant can be planted because
they aren’t considered GMOs.

Small RNAs
Mostly siRNA and miRNA. 2006 Nobel prize to Mello & Fire for the
discovery of RNA interference in C. elegans. Baulcomb was actually the first
to discover sRNAs in plants.

sRNAs are 21-24 nt long and are generally involved in gene silencing, by
both post-transcriptionally inhibiting mRNA translation and inducing DNA
methylation (transcriptional silencing). The core mechanism consists of the
maturation of sRNA by DICER. Dicer or Dicer-like proteins (DLP) process
long dsRNA molecules or hairpin RNA in the cytosol to produce short RNA
duplexes of the desired length. Dicer’s structure allows it to measure the
RNA it is cleaving, thus generating uniformly sized dsRNA molecules. One of
the 2 complementary strands then associates to Argonaute to perform
gene silencing. Argonaute proteins bind sRNAs and their targets. Argonaute
mutant plants resemble an octopus and generally can’t survive well.

2 main systems: siRNA and miRNA pathways. siRNA pathway is


triggered by dsRNA molecules, either synthesized by the cell or
exogenously introduced, and induces transcriptional silencing by
DNA methylation. The DNA region to be methylated is
complementary to the sRNA bound to AGO. On the other hand,
miRNAs are encoded by specific genes (promoter, transcription
stop sites) that produce palindromic sequences that form
dsRNAs. Dicer processes the precursor to yield the active
miRNA, which binds to AGO and directs it to specific sequences
on mRNAs. Recognized mRNA are either actively degraded or
not translated.

siRNA
siRNAs suppress invading viruses, silence sources of aberrant transcripts (when a
gene is overexpressed, it might be silenced), transposons and repetitive elements.
siRNA globally regulate gene expression as a defence mechanism.

siRNA effects have been observed for ages. By inoculating a


plant with a virus, if the plant recovers from the infection, it
34
becomes resistant to other infections with the same virus. To understand the underlying mechanism, plants
that were infected at day 1 and recovered were again inoculated with the same virus at day 22 on a
different leaf. The inoculated leaf was picked at day 32 and its RNA contents were extracted. The
comparison between first-contact leaves RNA content and second exposure leaves in a Southern blot
showed that the viral RNA quantity had decreased in the leaves inoculated at day 22.

It was later discovered that a small RNA homologous to viral RNA is present in
inoculated leaves and distal, systemic leaves, but not mock-infected leaves (inoculated
with water). In fact, a Southern blot shows the appearance of a 24 nt RNA band 10
days post infection in both inoculated and systemic leaves.

[…]

Since the inoculated leaves are different, the immunologic memory agent must travel
from the first infected leaf to other systemic leaves to grant them resistance. To study
the spreading of immunity, GFP was overexpressed in a transgenic plant. Those leaves have a green
fluorescence, while wild-type leaves display a red fluorescence signal (chlorophyll). By inoculating a dsRNA
complementary to the GFP transgene and recording the fluorescence of the inoculated transgenic leaf, the
points of injection become red because GFP isn’t synthesized anymore (gene
silencing). In time the red dot can spread locally, thus demonstrating that dsRNA is
able to spread between cells. Moreover, the silencing pattern can spread
systematically through the plant phloem, thus reaching other non-inoculated
leaves. The same systematic spreading of sRNA silencing has been observed in C.
elegans.

Systemic silencing requires a strong signal amplification to increase the concentration of siRNAs to cover
the whole plant. This happens thanks to the RNA-dependent RNA polymerase.

Dcl2-dcl4 double mutants are far more susceptible to viral infection than wild-type because they’re unable
to process dsRNA to make defensive siRNA.

On the other hand, viruses code for specific siRNA-suppressing proteins that target specific components of
the siRNA pathway to overcome the plant’s defence mechanism. Those viral proteins have been employed
in scientific studies to block the siRNA pathway at each specific step to study its organization and action
mechanism.

35
In biotechnology studies, the overexpression of a gene
of interest often leads to the gradual reduction of its
expression through several generations thanks to the
siRNA system. Flower colours were one of the

phenotypes used to study the transgene-induced


post-transcriptional silencing, especially in petunias.

For example, the chalcone synthase biochemical


pathway was investigated for its ability to
biosynthesize anthocyanins. Different types of gene
constructs were introduced in petunias. By inserting
sense ORFs for the enzyme, extra protein was translated and petal colours were darker. However, by
expressing an antisense ORFs for the same enzyme, a complementary sRNA is generated and enzyme
expression is silenced, thus making petals colorless (white). Inexplicably, by overexpressing sense
transgenes into plants some of the plants displayed white petals. This phenomenon is called co-
suppression because sense RNA overexpression induces the silencing of both exogenous and endogenous
genes coding for the protein. It was discovered that by overexpressing the sense transgene dsRNAs were
generated that could be processed by Dicer and act on mRNA thanks to AGO. Post-transcriptional silencing.

Similar studies on C. elegans genes unc-22. Silencing of unc-22


causes loss of muscle control.

Small RNAs can initiate gene silencing through covalent


modifications of the DNA (cytosine methylation) or its
associated histone proteins, interfering with transcription. The
precise mechanisms by which siRNAs target DNA for silencing
are not known, however two plant-specific RNA polymerase
complexes - RNA Polymerases IV and V – are involved in the
process. siRNA production is catalysed by RNA pol IV, while RNA
pol V recruits AGO4 on the DNA.

12/11/21

Most cellular siRNAs that are sequenced and mapped back on


the reference genome (specific protocols for sRNA sequencing) have been found to be derived from in
transposon and repetitive DNA regions. Those same regions are highly methylated.

miRNA
how can you study those?

miRNAs are important to regulate gene expression and function. It is thought they evolved from siRNAs, as
they are processed by the same enzymes (Dicer and AGO). Some miRNAs are highly conserved. Unlike
siRNA, they are specifically produced from MIR genes (promoter sequence, …) as trans-acting regulators
(they act on other genes). In plants, miRNA especially regulate developmental and physiological events.

A MIR gene encodes for a RNA precursor that contains a


hairpin structure of dsRNA, which is recognised by Dicer and
cut in smaller dsRNA fragments. The strand of the dsRNA
fragment that is able to anneal to the target mRNA is
maintained in AGO, while the complementary strand is

36
degraded. The AGO-miRNA complex either inhibits the translation of target mRNA ([mRNA] stable) or
actively degrades target mRNA (lower [mRNA]).

In A. thaliana there are specific Dicer-like proteins that process miRNA, while a different subset only
participates in siRNA maturation. AGO1 binds both miRNA and siRNA and preferentially cleaves its targets.
AGO4 binds to siRNA and mediates methylation of source DNA.

MIR genes have a particular structure. Their primary transcript corresponds to the precursor pri-miRNA,
which contains other RNA sequences that aren’t active. Some MIR genes are highly conserved and
duplicated. Typically, they act on transcription factors.

MIR have probably evolved from siRNA by


a gene duplication and inversion. Plant
miRNAs are thought to be derived from
their target sequences following an
inverted duplication event and divergence.
Only some miRNAs confer selective
advantage and are retained and further
duplicated. As a result, conserved MIR
gene families usually exist in multiple copies, while non-conserved miRNAs occur as single genes.

For example, within the plant kingdom the MIR156 is highly conserved and coded by 5 different genes.
miR156 is a regulator of the SPL transcription factors, which control developmental timing.

Vegetative phase = transition from juvenile to adult growth in


plants. In some cases, adult phase growth is visibly different
from the juvenile phase as exemplified by leaf shape,
phyllotaxy and trichome patterns. Arabidopsis displays peculiar
vegetative phase: juvenile leaves are round-shaped, then the
adult becomes serrated and grow trichomes in abaxial portion.

Hasty mutants on the MIR pathway have shorter juvenile phase


and flower earlier than wild-type. The HASTY protein is needed
for the export of pre-miRNA from nucleus to cytoplasm, where
it can bind to AGO.

Loss-of-function zippy mutants prematurely express adult


vegetative traits. ZIPPY encodes an ARGONAUTE protein, AGO7.

Overexpression of miR156 prolongs the juvenile phase, without ever transitioning into the adult phase. In
wild-type plants that display normal growth, miR156 levels decrease allowing SPL protein levels to increase,
triggering phase change. The SPL family mRNA has a sequence in its 3’-UTR that is complementary to
miR156. To prove that miR156 acts directly on SPL mRNA, its 3’-UTR sequence is either removed or
replaced with another lacking a miRNA156 binding site. In these genetically engineered plants, SPL is
expressed earlier to promote a premature transition to adult phase. The knockout of the seed sequence
isn’t proof that SPL is directly regulated by miR156. The seed sequence is mutated. Moreover, miR156 loss-
of-function promotes precocious phase change.

37
miR172 is another miRNA that regulates the transition from adult to
reproductive (flowering) phase. It acts on the TF AP2. Floral initiation may
occur when the level of AP2-like floral inhibitors drops below a certain level.
Overexpression of miR172 causes early transition to flowering phase. SPL
directly activates transcription of miR172.

miRNAs regulate developmental timing in other organisms. For example, in C. elegans lin-4 is required for
proper larval development. Lin-4 is a miRNA that binds to multiple sequences on the mRNA of lin-14.
Lowering levels of lin-14 protein activate transition into L2(?) larval phase.

miRNAs contribute to the formation of patterns during development thanks to the fact that their
concentration isn’t equal in all cells. The gradient can define polarity of cells. miRNAs can move between
adjacent cells to spatially restrict activity of their targets. In neighbouring cells their concentration is lower
than in the cell that is producing them.

Example: P uptake by plants


P uptake must adapt to the plant’s needs. miR399 genes are induced upon P starvation. Overexpression of
miR399 causes P overaccumulation.

miR399 binds to the mRNA of an E2 ligase called PHO2 that in turn inhibits P accumulation. miR399
knockout -> high E2 -> no P uptake. miR399 overexpression -> low E2 -> overaccumulation of P. PHO4 is an
F-box protein that binds to others in the SCF E3 ligase complex. Its targets are still unknown.

Grafts can be performed with A. thaliana, usually the roots of one phenotype are connected to the aerial
part (shoot) of a wild-type plant. This method can be used to study the effects in the aerial part of a
mutation that acts on the roots.

• Pho2 mutants overaccumulate P.


• Pho2 mutant shoot grafted on wild-type roots display wild-type levels of P uptake.
• Pho2 mutant roots grafted on wild-type aerial part over accumulate P.
• miR399 overexpression either in roots or shoots grafted on wild-type opposite plant parts causes P
overaccumulation regardless.

This suggests that Pho2 specifically acts in the roots and that miR399 is translocated from shoot to root.

38
Low phosphate in the dirt activates overexpression of miR399, while high P concentration causes low
expression of miR399. In addition, the translocation system of miR399 is unidirectional, as it does not work
from root to shoot:

• miR399 overexpressed in the


shoot -> high miR399 levels in
the shoot
• miR399 overexpressed in the
root -> low miR399 levels in the
shoot
• miR399 overexpressed in both
shoot and root -> high miR399 in
the root.

This suggests that miR399 is translocated from shoot to root but not vice
versa. miR399 moves through the phloem.

miR399 activity is also regulated by a target mimic. miR399 can also


anneal to the mRNA of IPS1, which mimics the seed sequence on the
PHO2 mRNA. IPS1 mRNA isn’t degraded by AGO complexes. IPS1 mRNA
irreversibly sequesters miR399 from the cytoplasm. This additional
regulatory activity is necessary to precisely regulate phosphate levels in
the plant.

RNA interference
Expression of dsRNAs in organisms to activate the Dicer-AGO pathway to knockdown specific genes, in
specific tissues, genes in linkage (so close that they cannot independently segregate).

RNAi construct and gateway cloning


RNAi constructs consists of a sense and antisense portions separated by a specific intron. The intron is
spliced away, it doesn’t interfere with dsRNA formation. The
cut and ligate system isn’t effective to obtain these kinds of
constructs. Gateway cloning is performed instead.

It is based on phage lambda integration into host bacterial DNA in specific recombination site (att
sequences). Single mutations on such sites completely abolish phage integration. Those sites have been
engineered to become more specific, meaning that they only initiate recombination with their specific
partner sequence. attL1 recombines with attR1. attL2 recombines with attR2. This way, it is possible to
insert the desired sequence into a donor vector without using restriction enzymes, thus avoiding problems
with enzyme star activity and cut sites into the gene of interest.

1. The gene of interest is PCR-amplified to yield a fragment where the insert is flanked by the attB1
and B2 recombination sites.
2. The PCR product is incubated with a donor vector that contains attP1 and P2 recombination sites
where the gene should be inserted. After recombination, the gene is located into the entry clone
and flanked by attL1 and L2 recombination sequences. The entry clone also carries an antibiotic
resistance gene.
3. The entry clone is incubated with a destination vector that possesses attR1 and R2 recombination
sequences. After recombination with the appropriate enzymes, the gene is in the expression clone
flanked by new sequences. The final vector also carries a different antibiotic resistance gene.

39
4. The products of the last in vitro recombination reaction are used to transform bacteria. Cells that
successfully uptook the expression clone are selected thanks to antibiotic.
5. Incubate entry clone (containing the gene sequence) with destination vector and the product are
used to transform E. coli. Bacteria that contain the entry clone plasmid scaffold are kanamycin
resistant, while those with the destination vector scaffold are ampicillin resistant.

Compared to the classic cloning method that requires restriction enzymes and ligation, gateway cloning is
faster, avoids restriction enzymes, can be used for large scale cloning and multiple constructs can be
combined provided they possess the appropriate recombination sites. The RNAi construct is essentially a
binary vector where the same gene fragment is inserted into two sites in opposite directions on the same
plasmid thanks to inverted recombination sequences.

Plasmids 101: Gateway Cloning (addgene.org)

The BP Reaction takes place between the attB sites flanking the insert and the attP sites of the donor
vector. This reaction is catalyzed by the BP Clonase enzyme mix and generates the entry clone containing
the DNA of interest flanked by attL sites. As a byproduct of the reaction, the ccdB gene is excised from the
donor vector.

The LR Reaction takes place between the attL sites of the generated entry clone and the attR sites of the
destination vector. This reaction is catalyzed by the LR Clonase enzyme mix. As a result, an expression clone
with the DNA of interest flanked by attB sites is generated. As in the BP reaction, a DNA fragment
containing the ccdB gene is excised from the destination vector.

The next step is to transform competent E. coli cells and select the positive clones. The entry clone and
destination vector carry different antibiotic resistance markers (indicated here by plasmid color), allowing
you to easily select for the expression clone. You will also need to use a E. coli strain sensitive to CcdB (e.g.
DH5α, TOP10, Mach1). The ccdB gene is present in the donor vectors and the destination vectors prior to
recombination, and it is exchanged with the gene of interest during the BP or LR reactions. Since the CcdB
protein inhibits the growth of CcdB sensitive E. coli strains, most colonies should contain the desired,
recombined construct.

18/11/21

Advantages and applications


RNAi is a powerful reverse genetics strategy to partially silence genes of interest and observe the resulting
mutant phenotype. Unlike classic mutagenesis studies (tilling, T-DNA and transposon insertion libraries),
meaningful results do not require a high number of organisms in a population. In addition, the same RNAi-
mediating dsRNA can act on more than one mRNA target. Thus, RNAi is a powerful tool to overcome
redundant genes or genes in linkage disequilibrium.

40
However, one major disadvantage of this system is the possibility of off-target effects. The same transcript
from the RNAi construct that is processed by Dicer can yield more than one sRNA molecule, each with the
ability to induce RNAi. As the positions of Dicer cleavage are not fixed, the 5' ends of small RNAs (called
siRNAs) are not known, thus predicting their target is difficult. This problem can be fixed by using artificial
miRNA instead of RNAi constructs. The artificial miRNA construct is a PCR product from naturally existing
miR genes that target the gene of interest (?). The complete spectrum of artificial miRNA is easily
predictable.

RNAi and artificial miRNAs are commonly performed to confirm mutant phenotypes, overcome gene
redundancy (target 2 genes at a time), overcome linkage of genes (?) and act on genes in a specific tissue or
developmental stage (it depends on the promoter of the interfering RNA construct).

RNAi in C. elegans
RNAi was first discovered in C. elegans by Fire. As soon as the complete genome sequence of the nematode
was obtained, this new method to specifically silence genes was immediately exploited to study gene
function.

C. elegans is the optimal genetics model organism: 1-mm long, transparent, free-living in a well, high brood
size, short development time, self-fertilising hermaphrodites. Its anatomy has been studied extensively to
precisely determine the total number of cells in each tissue. The complete cell lineages (the pattern of cell
divisions) from fertilized oocyte to adult are known, therefore the effects of a mutation can be predicted to
manifest in specific cells and tissues.

Despite its popularity, RNAi in C. elegans has some limitations:

• RNAi in the nematode has less efficiency and specificity against nervous system-specific genes.
• Also, the results display some degree of variability between different labs because environmental
inputs regulate gene expression.

To perform RNAi, the dsRNA molecule must be delivered using one of 4 methods:

a) microinjection
b) soaking in dsRNA rich solution
c) feeding. Bacteria that express dsRNA are
eaten by nematodes
d) transgenes. Transform the organism to
stably express the RNA construct.
Particularly used for neural cells.

To perform RNAi-mediated knockdown of multiple genes, injections and genetic transformations aren’t
feasible. Normally feeding libraries are prepared to rapidly screen the function of thousands of genes in
different worms.

RNAi has temporary effects and it doesn’t completely silence target gene expression. The effects of a
complete gene knockout must be further investigated with other approaches. However, RNAi can
specifically knockdown a target gene in a specific developmental stage. This is extremely useful to study the
function of genes that operate in 2 distinct developmental stages. Knockout of such genes that results in
embryonic lethality won’t allow the study of the function of the gene in later stages. To perform this study
with knockout approaches, a transgenic copy of the target gene must be co-expressed in early
developmental stages to prevent embryonic lethality (complementation tests).

41
Genome editing
Genome editing is based on the induction of DSB DNA repair and targeting nucleases to specific genomic
regions.

• In a non-homologous end joining


approach, the target sequence is
disrupted by short or large
insertions and deletions. Also, the
ends of the DSBs can be used to
ligate a new sequence, delete an
existing one or invert it.
• In homology-directed repair
approaches, the presence of a donor DNA fragment induces insertions or substitutions of
sequences depending on sequence homology. The presence of a DSB in a specific region
significantly increases the odds of homologous recombination activation.

To perform genome editing, 3 systems have been developed: ZFN, TALENs and CRISPR-Cas9.

ZFN &TALENs
ZFN are the pioneers of genome editing. They were designed by engineering Zn-
finger transcriptional factors to obtain a fusion protein that binds to specific DNA
sequences and carries a DNA endonuclease domain. Each Zn-finger domain
recognizes a specific base triplet on the DNA by intercalating to the major
groove. By joining 6 Zn-finger domains, the consensus sequence on the DNA is
lengthened to 18 bp. ZFN were specifically designed to target a specific DNA
sequence. However, the consensus sequence was a combination of triplets
recognized by the Zn-finger domains and not all genomic regions could be
targeted efficiently with this system.

The original ZFN was then optimized by introducing the


dimerization event. Each monomer specifically binds to a
18-bp long target sequence, therefore a dimeric structure
binds to a 36-bp long consensus sequence. Specificity was
increased, however not every possible DNA triplet could
be targeted.

TALENs were designed as an evolution of ZFN. TALEN


domains were discovered in DNA binding proteins
from bacteria that can bind to eukaryotic sequences
during plant infection. Each functional targeted
nuclease is a heterodimer. In each subunit, the DNA
binding portion consists of a precise sequence of TALE
domains, each able to specifically recognize one base.
A TALE aminoacidic sequence consists of highly
conserved residues and a couple of variable residues
that determine the base specificity of the protein
domain (Repeat-Variable Di-residues RVDs). By
carefully designing the TALE sequence, a specific DNA region could be targeted.

42
However, to construct the final fusion protein 18 different TALE modules had to be cloned in the same
construct, which made the technology extremely laborious. Some strategies were developed to simplify
and streamline the TALEN construction process. For example, all 18 TALE modules needed to design a
specific TALEN monomer were first cloned inside vectors that are specific for the final order of the TALE
domains. Then, each TALE sequence is cut with a specific restriction enzyme that generates sticky ends. The
restriction enzyme is specifically chosen to generate sticky ends that can anneal to the restriction fragment
of another TALE sequence. Once the restriction fragments are ligated, the resulting sequence contains the
TALE coding sequences in the correct order to generate the appropriate DNA-binding domain of the
engineered nuclease. Despite these advances, the TALEN coding construct could induce homologous
recombination when inserted into a plant (?).

CRISPR systems
Another discovery allowed the design of the third genome editing tool: CRISPR-Cas9. The CRISPR locus was
first identified in Archaea as blocks of repeated sequences separated by same-length spacers. By searching
for sequence homology between the spacer and other known sequences in a BLAST analysis, it was
discovered that they were identical to those of bacteriophages. Later, Doudna and Charpentier
reengineered the bacterial system to produce customizable nucleases for genome editing. At the same
time, Zhang first used and optimized the system to genetically engineer mammalian cells.

3 main types. Type II is actually used for genome editing because it requires only one nuclease protein
(Cas9), while type I and III require a protein complex. If the CRISPR system has to be expressed in cells to
genetically engineer, the expression of only one protein is more feasible.

The CRISPR locus consists of spacers flanked by palindromic repeats. When transcribed, the resulting RNA
molecules that contains all spacers is processed together with the tracrRNA and cleaved to form one
cr:tracrRNA. This RNA complex binds to the nuclease Cas9 and directs it towards target sequences that
match the guide sequence.

In the immunity mechanism, bacteria that survive from phage infection recognise and cleave a protospacer
from viral DNA. This viral sequence is then inserted in the CRISPR locus generating a new repeat. The
protospacer sequence must be adjacent to a protospacer adjacent motif, which is only present in viral DNA.
This will later prevent the Cas9 system from cleaving bacterial DNA.

The CRISPR locus consists of the actual CRISPR sequences, cas genes and tracrRNA coding sequence.
tracrRNA (trans-activating crRNA) is a small noncoding RNA that processes the pre-crRNA together with
RNase II and Cas9. tracrRNA is complementary to the repeat sequence that flanks the spacer on the crRNA.
The processing involves the trimming of the ends of the crRNA and its binding to tracrRNA.

Cas9 bound to the cr:tracrRNA complex is activated. It introduces DSBs in


target DNA that binds to the spacer sequence on the crRNA and that is
located next to a PAM sequence.

This system was optimised for genome editing. The modified guide RNA is a
single molecule that replaces the tracr:crRNA complex, thus removing the
need of an additional tracrRNA coding sequence. This system introduces
DSBs, that are immediately recognised and repaired by the cell to preserve
chromosome integrity.

The original Cas9 protein that was used in genome editing was isolated from
S. pyogenes and it selectively cuts target DNA flanked by an NGG PAM. This
restricts the actual genomic sites that can be engineered by this system. The
nuclease is composed of 2 DNA-cleaving domains: HNH (cleaves the gRNA
43
complementary strand) and RuvC (cuts the non-complementary DNA strand).

The CRISPR system has similarities with the eukaryotic RNAi. Both can be used as a defence from virus
infection. In both cases an RNA molecule is involved and it directs the activity of proteins thanks to
sequence complementarity to the target.

Genome editing allowed the replacement of mice genes with human homologous genes to study the
differences between those and obtain more predictive animal models.

New CRISPR systems have been later discovered. In Prevotella and


Francisella, CRISPR system type V Cpf1 doesn’t need the tracrRNA and
recognises T-rich PAMs. This system also introduces sticky ends on the
broken DNA molecules. This system can be used to target regions
previously inaccessible with the classic type II. It also facilitates NHEJ
gene insertions.

CRISPR is unable to completely abrogate off-target events, despite the sgRNA complementarity
requirement. sgRNAs are specifically designed to reduce the probability of off-target events. Alternatively,
off-target cuts can be prevented using nickase Cas9, which has only one DNA nicking domain active. By
combining two Cas9 nickases that bind different sgRNA designed to target adjacent regions, a DSB is still
introduced in the target sequence. The odds of that happening in other sites is much lower than the ones
associated with a wild-type Cas9.

CRISPR-mediated gene editing can in principle be applied in medicine to potentially cure genetic diseases
and other malignancies. Ethical regulations prohibit the editing of germline cells to prevent the heredity of
edited DNA into the next generation. Gene editing performed on embryos is in principle simpler than
performing the same task on fully grown adults because there’s no need to target the CRISPR system to
specific cell types. In adult treatment, in vivo gene therapy requires the delivery of CRISPR-Cas9 to specific
cells. Cas9 is too big to be loaded in AAV vectors. Zhang discovered smaller Cas9 variants that could fit into
adenovirus vectors.

Base editing systems


Classic CRISPR gene editing requires the substitution of a mutated sequence with the wild-type variant in a
DNA donor molecule by activating homologous recombination. This system is inefficient: low homologous
recombination frequency, significant risk of introducing new mutations. For clinical applications, base
editing technology has been developed to skip the DSB induction part. It is defined as the direct,
irreversible conversion of one target DNA base into another in a programmable manner, without requiring
dsDNA backbone cleavage or a donor template.

Cytidine can be converted into uridine (equivalent of thymine) by a


specific deaminase enzyme -> C->T conversion. The enzyme is
fused to an inactivated Cas9, called dCas9, that retains the ability
to bind to DNA in a gRNA-dependent manner but it’s unable to
introduce DSBs. Cas9 then guides the cytidine deaminase to a
specific genomic locus and directs it to specific bases to edit.
Deaminase enzymes only work on ssDNA, which is only located at
the site where dCas9 binds because it requires the unwinding of
the dsDNA and the annealing to the gRNA. This reduces the
likelihood of off-target deamination on sites where dCas9 doesn’t
bind.

44
Rat APOBEC1 cytidine deaminase was selected for its high catalytic activity. The enzyme is linked to dCas9
thanks to a specific linker peptide (XTEN) whose length is optimised to allow APOBEC1 to act at specific
positions in a small editing window on target DNA.

When tested in human cell lines, this system showed low editing efficiency. This happens because cells
contain uracil DNA glycosilases (UDG) that specifically recognise uracils in DNA and change it back into a
cytosine. However, UDG inhibitors from bacteria had been previously identified and then linked to the
fusion protein system to increase editing efficiency in cells.

Moreover, the result of the successful cytidine to


uracil conversion is a mismatched base pair U:G, that
is recognized by mismatch repair mechanism. To
retain the U and convert C:G into U:A, the strand
containing G is nicked. The mismatch repair
mechanism recognises the non-nicked strand as the
original template, therefore G is removed. To perform
this, dCas9 is partially reactivated to nick the strand
containing G. The result of this optimisation was Base
Editor 3 (BE3), which showed improved efficiency in
human cells.

However, BE3 cannot be used in gene therapies to correct every pathologic gene variant. A significant
amount of disease-causing mutations consists of specific deamination of cytosine that causes C:G into A:T
conversion. To correct these mutations by reinstating the C:G pair, adenine can be converted into inosine, a
modified base that interacts with cytosine. The problem was that no known adenine deaminases were
known to work on DNA (they use free adenine, RNA or DNA:RNA heteroduplexes as substrate).

The original BE3 was modified to replace APOBEC1 with natural adenine deaminase from E. coli, mouse and
human. The resulting fusion protein was tested in live cells, but no base editing occurred. To induce the
evolution of adenine deaminase into a new form that would accept DNA as substrate, a mutation in a
bacterial selection marker that could be repaired by base editors was introduced. Mutant bacteria that
express the natural adenine deaminase base editor lack the eukaryotic mismatch repair machinery. The
selection marker had to be restored to allow the survival of bacterial clones. Clones that express natural
adenine deaminase domains weren’t viable in the selective medium because those enzymes cannot accept
DNA as a substrate. However, clones that express randomly mutated adenine deaminase can survive if
those mutations allow the mutant deaminase to act on DNA. Several rounds of this experiment were
performed to further improve efficiency of the adenine deaminase.

25/11/2021

Unbiased libraries of the adenine deaminase enzyme ecTadA linked to dCas9 were generated, each with
unique mutations on the ecTadA coding sequence. This library was then expressed in antibiotic-sensitive
bacteria growing in selective medium. The antibiotic resistance gene contains a loss-of-function mutation
that required adenine deamination to be reversed, therefore bacteria that express ecTadA able to
recognise dsDNA substrate became antibiotic-resistant and could be easily selected. Several rounds of
mutation of an initial modified ecTadA lead were performed to optimise enzyme activity. The final result
(ABE1.2) was linked to a dCas9 and expressed in human cells. Low but detectable levels of adenine
deamination were detected, proving that the modified enzyme works on DNA substrates. Final optimised
enzyme displayed a 40 to 70% base editing efficiency, while conventional HDR-directed CRISPR-Cas9 gene
editing has a 2% efficiency in human cells. Moreover, off-target effects and indels of base editing are rare.

45
Currently, the base editing toolbox contains several enzymes able to modify single bases in different ways
without introducing a double strand break.

Recently, type VI CRISPR system (CRISPR-Cas13) has been reprogrammed to generate an RNA-dependent
RNase and mRNA base editor. In particular, the REPAIR system can replace adenine with inosine (I) in
mRNA. These changes don’t affect DNA, therefore aren’t permanent. dCas13 is linked to ADAR enzyme that
specifically targets RNA. Any adenosine in target mRNA can be targeted because Cas13 doesn’t require a
PAM sequence.

Engineered Cas9. Wild-type Cas9 has low nuclease efficiency when no NGG PAM sequence is next to the
target sequence. Cas9 was engineered to only recognise a NGN PAM sequence, thus increasing the possible
target sequences.

Prime editing
Despite recent advancements in Cas9 engineering and base editing, most of disease-modifying mutations
cannot be corrected with available genome editing methods. For example, the A to T transversion that
causes sickle cell anemia cannot be corrected by neither an adenosine deaminase nor a cytidine deaminase
moiety. A possible new strategy involves the introduction of deletions, insertions or multiple mutations on
a target site without a DSB. In 2019, the prime editing system was put together to theoretically generate
any type of gene editing imaginable: target deletions, insertions and all 12 possible base-to-base
conversions.

A prime editor (PE) consists of:

• A guide RNA called pegRNA


• A Cas9 nickase domain
• A reverse transcriptase domain.

The pegRNA guides nCas9 to the target sequence next to its PAM. The nickase
domain introduces a ssDNA break. Then, the pegRNA 3’ terminal sequence,
which is longer than conventional sgRNAs, anneals to the ssDNA strand right
upstream the nicked site. Obviously, the pegRNA is specifically designed to be complementary to the nicked
DNA strand. The result is the formation of the optimal substrate for DNA elongation on the RNA template
by reverse transcriptase. The pegRNA template sequence corresponds to the specific genome editing result
to be achieved, thus the newly-synthesized DNA (3’ flap) contains the editing mutation of interest
(insertion, deletion, base conversion).

46
After prime editor dissociation, the target genomic region reanneals. 2 protruding ssDNAs are generated on
the originally nicked strand:

A) a 3’ flap that contains the edited sequence. When it isn’t involved in the annealing, no base
mismatches on the dsDNA are present. This system is the most stable, however the 3’ flap cannot
be easily removed by exonucleases.
B) A 5’ flap that lacks the edited sequence. When it is in ssDNA conformation, the 3’ flap partially
anneals to the opposite DNA strand and a base mismatch region is created (since only the 3’ flap
contains the edited sequence). this situation is less stable than A).

However, the protruding 5’ flap is degraded by FEN1, an endonuclease that specifically removes 5’
protruding ssDNA. Since 5’ flap is preferentially removed, the DNA then retains the edited strand, the nick
and the mismatched region. As previously analysed for base editors, the endogenous mismatch repair
mechanism recognises the template strand as the one without any nicks. As a consequence, the edited
strand is preferentially removed because it carries a nick, thus lowering editing efficiency.

By employing the third generation of prime editors (PE3), editing efficiency was greatly improved.

• PE3 uses a classic sgRNA as well as the pegRNA. The sgRNA redirects the system to induce nicking
on the non-edited strand 14 to 116 nts away from the edited site. Editing efficiency trebled as a
result, however a higher chance of indels was observed.
• PE3b uses a sgRNA containing a spacer sequence that only binds to the edited sequence. Because
of nCas9 molecular mechanism, only the unedited strand can be nicked. The second nick is
introduced specifically after the resolution of the edited strand, as sgRNA is unable to bind to any
other sequence. This way, the edited strand can first be ligated. The sequence of events in PE3b
prime editing results in higher editing efficiency and a 13-fold decrease in indel frequency.

Ethics and regulations of genome editing


In 2016, UK scientists had official permission to edit genes in human embryos which would not be
reimplanted in a uterus (strictly in vitro experiments).

Moreover, CRISPR was first tested in an ex vivo gene therapy to target the PD-1 gene in immune cells. PD-1
normally blocks the immune response against cancer cells (oncogene); therefore its inactivation is
predicted to stimulate the immune system against the tumour cells. However, compared with standard
mAb therapy gene editing technology is more expensive as it requires collecting cells from the patient’s
47
blood, culturing them, editing them with low-efficiency systems and injecting edited cells back into the
patient. CRISPR technology is beneficial only when its editing efficiency is much higher.

Recently, in China twin embryos were edited to become HIV-resistant and then reimplanted in the
mother’s uterus. The twins were born healthy. This practice is still banned in most countries because edited
genes can be passed down to the next generations. Moreover, the long-term effects of any edited gene in
all cells of an organism are still not well understood. From the ethical point of view, this embryonic editing
technology can potentially be performed to generate designer babies (i.e. individuals with specific physical
characteristics) and be only affordable for the very rich.

The specific modification of a gene yields a transgenic plant. This is considered a GMO, whose cultivation is
banned or highly discouraged in some European countries. An alternative strategy is random mutagenesis
via radiation, chemical treatment or tilling and selection of relevant phenotypes to yield crops that aren’t
legally considered GMOs.

Currently, biotech methods involve the introduction of a CRISPR cassette, which codes for a target specific
Cas9 enzyme. After the target gene has been edited, the CRISPR cassette is removed by crossing. Legally,
plants obtained with this method are considered GMOs, even though the same result can be obtained by
chemical treatment. Scientifically there shouldn’t be any difference between the 2, since in both cases
there isn’t any trace of a transgene. Thanks Bruxelles!

26/11/2021

Next thursday maybe no lecture, check ariel

Transcriptome analysis
Transcriptome = collection of all mRNAs in a cell, tissue or organ allows the identification of tissue or
development stage-specific genes to complete their functional analysis. The expression profile of a gene
suggests its function. However, high mRNA levels don’t necessarily correspond to high protein levels
because of RNA interference, mRNA degradation or protein cleavage.

It indicates genome wide expression levels. Gene expression can be studied in different ways and at a
single-gene level, while transcriptome analysis indicates all genes that are simultaneously expressed.
Different gene expression analysis methods are used:

• Northern blot, allows the measurement of mRNA length


• In situ hybridisation analysis
• Microarrays
• RNA-seq, typically Illumina short read sequencing is enough to identify a specific mRNA. In some
cases, the whole mRNA sequence is obtained in other sequencing methods.

In situ hybridisation analysis


The tissue of interest is dehydrated, fixed and embedded in
paraffine or polymer block. The block is sliced with a
microtome. 7 µm slices correspond to tissue section and
each of those can be analysed for transcriptome analysis by
hybridisation with a probe.

Probes are designed to hybridise to an mRNA of interest and


can be visualised on the slide thanks to labelling. Normally,
one of two possible types of non-radioactive labelling are
used:
48
• Direct labelling with a fluorophore. Some nucleotides in the probe are linked to a fluorescent dye.
• Indirect labelling. The probe is covalently linked to a reporter molecule that can bind at high affinity
to other ligands to allow visualisation of the probe. Probes are synthesised in a PCR-based protocol
starting from the cDNA of the transcript of interest. A T7 promoter sequence is added to the
antisense strand thanks to the tail of a reverse primer. T7 DNA pol adds modified nucleotides to the
antisense strand, thus creating a labelled probe. Example of modified nucleotides:
o Biotin labelling. Biotin is a naturally occurring vitamin which binds with high affinity (10−14)
to the protein streptavidin. Biotin is linked to the base moiety of a modified nucleotide.
Incubation with fluorescently tagged streptavidin identifies the probe in the tissue slide.
o Digoxigenin labelling. Digoxigenin is a plant steroid that is recognised and bound by a highly
specific antibody.

This method allows staining of a specific mRNA in specific tissues. Spatial and temporal expression profiles
can be easily discovered by analysing the same tissue in different developmental stages. Compared with
the Northern blot, in situ hybridisation has higher resolution, as it identifies specific subcompartments of an
organ where the gene is expressed (in Northern blotting signals simply identify the organ). Moreover,
multiple labelling can be performed with probes tagged with different reporters or fluorophores.
Sequential slides can be each stained with a specific probe and all the signals are then merged into a single
picture. In some cases, all probes can be used on the same slide. The downside of this method is:
complexity of the protocol, time consuming, each probe has to be optimised, gene-specific analysis (this
isn’t strictly transcriptomic analysis).

Microarray
= glass slides on which specific probes can be synthesized
in specific spots. Without any information on gene
expression of the sample, cDNA libraries are synthesized
on the slide (photolithographic) or attached to it

49
(printing). Usually, DNA probes on the slide are complementary to one exon of a gene. Then the sample
mRNAs are fluorescently labelled and incubated with the slide. RNA can anneal to sequence-
complementary probes on the slides. It can be easily automated and miniaturised.

The system is designed to allow hybridisation between DNA probe and mRNA only with complete
homology. As much as a single base difference is enough to prevent annealing. Each DNA probe spot has a
mismatch control, which is identical to their perfect match partners except for a single base difference in a
central position. Signals in negative controls identify non-complementary hybridisation, therefore the row
or column is eliminated.

Microarray allows genome-wide analysis of gene expression. It has been completely replaced by RNA-seq.
However, all data obtained by microarray experiments is still used today.

Co-expression gene analysis are commonly performed on microarray data. Genes are clustered (put in the
same group) when their expression pattern is similar or identical across different tissue samples. Guilt by
association.

Microarrays are still used for SNP detection, diagnostics, chromosome structure analysis...

Apart from GC-rich regions (transposons, centromeres, …), all genome is covered by a set of probes
complementary to all ORFs and in the correct order. Chromosome structure analysis identifies correlation
between gene position and their expression levels, as a result of epigenetic markers.

SNPs identification. DNA sequence variation causes most visible difference between species. Many kinds of
SNPs can be analysed: some are linked to disease, some others to ancestry, some are used in forensics and
some for functional genomics studies. Primers are loaded on the slide. Their sequence corresponds to the
upstream or downstream sequence from the POV of the SNP. mRNA samples then hybridise to the probe
regardless of its SNP variant. After that, minisequencing is performed by adding only fluorescently-labelled
ddNTPs and DNA polymerase. This way, the base at the SNP
position will be identified by the optics system and the
specific SNP allele carried by the sample genome is
determined.

RT PCR
Real-time PCR quantifies the amount of PCR amplification product at each cycle by its fluorescence signal. It
is commonly used to confirm gene expression data obtained in microarrays or RNA-seq. It is a powerful
technique to quantify the relative abundance of a DNA molecule of interest from a sample without running
gel electrophoresis, since it employs threshold detection to compare PCR product abundance between
samples.

In a classic PCR amplification reaction, the number of DNA molecules as defined by the pair of primers
doubles at each cycle. This is true especially for the earliest PCR cycles, where amplification is exponential.
The quantity of PCR product after X cycles depends on the number of template copies in the starting
reaction mixture, thus wells that differ on the amount of starting DNA of interest will yield different PCR
product signals. In other words, for twice as much initial template (2T), there is twice as much PCR product
in exponential phase.

However, a PCR reaction is never 100% efficient. In later PCR cycles, not all PCR product molecules are
amplified successfully for different reasons. Short DNA molecules are usually amplified with greater
efficiency compared with longer dsDNA. Moreover, the concentration of PCR products in the final PCR
cycles plateaus because of lack of an excess of reagents (primers, dNTPs, …), a decrease in Taq polymerase
activity or reannealing of template strands. For this reason, all measurements are usually performed in

50
exponential PCR cycles, where the number of PCR products is proportional to the initial concentration of
template.

A real-time PCR machine is designed to record fluorescence at the end of each cycle in each well. The
fluorescence signal is proportional to PCR product quantity. 3 types of dyes are commonly used:

• Sybr green is a non-sequence specific DNA intercalant that is


fluorescent once bound to dsDNA. It is easy to use with conventional
primers, cost-effective and it reliably detects the presence or
absence of a specific template in the initial sample. Unfortunately, it
detects all dsDNA products, including primer dimers and nonspecific
amplification products, therefore it may lead to false positives. Its
use for quantitative measurement is discouraged and the reliability
of the protocol for a specific project must be checked first.
• Molecular beacon probes are modified primers that carry a
fluorophore and a quencher on opposite ends of the oligo. When the
probe is not annealed to the template, it forms a stem-loop structure
that allows the interaction between the dye and the quencher, thus
preventing any fluorescence emission to occur (0 fluorescence).
Once annealed, the quencher is separated from the fluorophore and
a fluorescence signal (proportional to the number of template
molecules) is detected. Despite their sequence specificity and low
background noise, it is difficult to design the probe so as to
preferentially bind the target sequence rather than reanneal to itself.
Moreover, probes should not be consumed during strand elongation,
so they are designed to denature from the template during the
elongation phase to prevent Taq pol from degrading them.
• Taqman probes are modified beacon probes. The oligo does not
form a secondary structure when not annealed and the interaction
between dye and quencher is stable as long as the probe is intact.
During strand elongation, the 5’-3’ exonuclease activity of Taq
polymerase cuts the probe and releases the reporter dye into the solution. The fluorescent dye is
no longer quenched and a signal is detected. After each cycle, a higher number of dyes are active in
the solution as they accumulate proportionally to PCR product amplification. Taqman probes are
sequence-specific and easier to design, but significantly more expensive than Sybr green.

Digital PCR is another quantitative method where the sample is split into several vials by dilution to
separate individual target mRNAs into different vials. PCR is performed for each vial and for a definite
number of cycles. PCR-positive samples originally contain the template of interest and the number of PCR-
positive vials corresponds to the number of single mRNA molecules transcribed from the same gene. Single-
molecule sensitivity and absolute quantification of gene expression.

51
Tiling array and RNA-seq
To identify completely new genes from transcriptome
analysis, tiling array experiments are performed. Tiling
microarrays are designed to assay transcription at regular
intervals of the genome using regularly spaced probes.
Probes are selected to be complementary to one or both
strands. In microarray-based experiments, probes are
attached or synthesised on the microarray and then
incubated with fluorescently-labelled cDNA or RNA from
cell samples. A fluorescence signal identifies the
corresponding genomic region as actively transcribed in
the cell sample analysed. This method can be used to
identify transcripts of non-annotated sequences in a genome or from a non-annotated genome. RT-PCR
and standard microarray technology only analyse expression profiles of known genes since complementary
probes or primers are required.

RNA-seq is now widely used to analyse the transcriptome (mRNAs, small noncoding RNAs and others).
Illumina sequencing is usually performed.

03/12/21

Regulatory pathways analysis


The expression of specific genes is regulated at multiple levels by multiple different mechanisms. At the
transcription level, transcription factors regulate multiple gene targets and are also in turn regulated by
other TFs. The identification of a gene regulatory network is essential to understand when, where, how
much and why specific genes are expressed and to potentially engineer such systems for biotech purposes.
Multiple genes that are co-regulated (expressed with the same profile) usually possess a promoter element
that is regulated by the same TFs. TFs are trans-acting factors that bind to regulatory sequences to establish
short or long-range interactions with target promoters. Some promoters aren’t located specifically close to
the transcription start site.

Generally, modules of gene networks have evolved to work


as switches and generate a cascade of transcription
networks that regulate each other. The same modules can
be duplicated or manipulated to generate different gene
regulatory pathways for specific tissues. For example, the
same switches that regulate flower development are also
essential to root development. Target genes usually co-
evolve together with their regulators. Many types of external
or internal stimuli (light, heat, ...) can be detected and trigger
an appropriate response. The first genes in the regulatory
network to be activated are those that interact directly with a TF. Later, more genes are indirectly activated
to generate an increasingly complex transcriptional program.

Identification of TF gene targets


To study the regulatory network of a specific TF, two approaches can be used:

a) Overexpress TF and check which genes are unregulated as a result. This condition doesn’t reflect
physiological TF activity.

52
b) TF mutants complemented with a wild-type TF gene copy, check which genes change their
transcription pattern. This approach allows the study of the TF in its physiological concentrations
and conditions.

To identify target genes of a TF, inducible systems are commonly used: T-sensitive mutants, inducible
promoters (heat shock promoters), inducible protein activity and environmental stimuli (light). In all cases,
the TF of interest is expressed or activated only in specific and controlled conditions and for a specific
amount of time. In a mutant complementation experiment, the
inducible TF restores the specific pathway that is being
investigated. RNA samples are collected at different time points
for downstream analysis.

An example of an inducible TF expression strategy is the alc system in plants. The alcA promoter responds
to the ALCR TF and is induced by ethanol (applied by spraying, evaporation, addition to growth media, root
drenching). alcR in plants is constitutively expressed. The gene construct contains: the transgene, a minimal
promoter and the alcA sequence. Without alcohol the system isn’t induced and the transgene is almost
never transcribed, while ethanol treatment activates ALCR TF and the transgene is expressed. A problem of
the alc system is that ethanol is the preferred carbon source for bacteria and fungi, thus long induction
periods cause fungi growth and plant death. Normally, ethanol induction lasts 8 h maximum (?).

TF activity can also be induced thanks to the fusion with


a glucocorticoid receptor (GR) domain. The TF-GR
fusion protein is unable to cross the nuclear envelope in
absence of steroids because GR is bound to Hsp90. In presence of steroids, TF-GR is stabilised and
transported into the nucleus. The GR domain contains the dexamethasone (DEX) binding domain in plants,
because DEX has a rapid uptake kinetics and induces TF activity minutes after incubation. In fact,
glucocorticoids can easily permeate plant cells and cause no adverse effects. This system displays fast
induction because the TF of interest has already been synthesized and accumulates in the cytosol ready to
be activated by DEX. However, the fusion of TF with a GR domain can influence TF activity, which can lead
to incomplete complementation of TF mutant background.

In TF-GR approaches, the inducer is normally administered together with translation blockers
(cyclohexamide). Thus, mRNA won’t be translated and the gene cascade cannot be initiated. This way, only
direct targets of the TF of interest are induced the most and can be easily detected in RNA-seq. Indirectly-
induced genes are also transcribed because protein synthesis block isn’t 100% effective. Nevertheless,
there will be a lower concentration of indirectly-induced genes, thus transcriptome analysis is less complex.

Transcriptome analysis can also be performed in a mutant vs wild-type approach, especially between
mutant tissues and wild-type ones. This strategy is most common in developmental studies, where

53
differentially expressed genes are measured in a specific early development stage. Transcriptome analysis =
RNA-seq.

Single-cell analysis
In expression profiling studies, a relatively large amount of tissue is
used. Since a tissue contains different cell types, the differences
among those are not detected. Microgenomics can be used to record
differential gene expression in small numbers of cells of interest. To
perform this, the tissue is cut by a laser (laser microdissection) to
isolate small areas or single cells from it. From a microscopic slide
observed at the microscope, a specific portion of the tissue is
selected, then a laser cuts it accordingly and the portion of interest is collected in a specific tube. From the
same slide different sections can be cut and isolated in different containers to be analysed separately. The
process is automated and contamination-free. Since the amount of RNA is very low, there are specific kits
to linearly amplify it.

Other techniques to isolate and analyse single cells:

FACS. 90-95% purity in sorted cells. Separates phenotypically different cells


from a sample as well as counts them and measures a reporter (fluorescent
dye) signal correlated to a protein or nucleic acid. After cell sorting,
subpopulations can be analysed (RNA-seq, …).

INTACT method. It avoids any effect of stress response activation once the
cell wall is broken to extract RNA. Usually, the transcriptome inside a
nucleus mimics the one found in the cytoplasm. The nuclei of cell types of
interest are labelled, then cells are lysed and intact nuclei are isolated. Then,
labelled nuclei can be isolated with specific Ab-linked or streptavidin beads
that are separated by using a magnet. Watch video or read paper for more
details.

Identification of direct TF targets & TF consensus sequences


After RNA-seq data has been performed, the targets directly regulated by
the TF of interest are identified. The expression profiles are compared
before and after induction or in mutant vs wild-type backgrounds. The
direct proof that a TF or specific protein binds to a specific sequence in
vivo is obtained in a ChIP-seq experiments.

1. Cells are crosslinked with formaldehyde to fix proteins on DNA


2. Nuclei are harvested from specific cell types or in specific
developmental stages of interest.
3. DNA is sheared by sonication to obtain different protein-DNA complexes
4. Immunoprecipitation is performed with Abs specific for the TF of interest. This way, an enrichment
of DNA fragments bound to a specific protein is performed.
5. Proteins are removed and DNA fragments purified.
6. DNA fragments are PCR amplified, sequenced and aligned to the reference genome.

Negative controls: same procedure without Abs or mutants lacking the TF of interest. Not all proteins can
be analysed in ChIP. Sometimes no ChIP-compatible Abs are available for a protein. In this case, proteins
can be tagged to be easily immunoprecipitated with validated Abs that bind to the tag region (GFP or YFP
tags). The fusion protein must be expressed in mutant strains where the wild-type phenotype is restored to

54
check whether it is biologically active and to avoid competition with the native protein for the binding to
regulatory sequences.

9/12/21

In regulatory pathway analysis, the experiment is set up to identify all differentially expressed genes
between two conditions (induced-uninduced, wild-type vs mutant, 2 different cell types, healthy cells vs
tumour cells), which differ from the point of view of the activity of a certain transcription factor. The list of
differentially expressed genes contains both direct and indirect gene targets of the transcription factor.

The determination of the regulatory pathway requires the identification of all genes to which the
transcription factor of interest directly binds to regulate expression. These genes are direct targets to a FT
and are identified with in situ ChIP. ChIP enriches the fragment library
for sequences that are physically associated with the transcription
factor of interest, meaning that direct targets of that FT will be in the
library. The fragment library is PCR amplified with primers that bind to
putative binding sites of the TF of interest (?). in RT-PCR the number
of TF binding sites can be quantified (?). Every ChIP experiment
requires relevant negative controls. Commonly, the first negative control is the same experiment
performed without adding antibody. The second negative control is the same experiment performed on
cells or tissues that do not contain the tagged protein or the TF of interest. In both cases, ChIP should yield
a negative result. (why?)

One limitation of a ChIP experiment is the requirement of the cis-element binding site of the TF of interest.
For TFs with unknown consensus sequences it is not possible to design primers for PCR experiments. To
overcome this problem, Systematic Evolution Ligands by Exponential enrichment (SELEX) is performed to
locate TF binding sites. The protocol of SELEX is as follows:

1) Generate an initial DNA fragment library which


contains all possible sequences of a given length (for
example all possible 10-mers). Each of the sequences
can potentially be the binding site of the TF of
interest. Each possible sequence is flanked by known
sequences that can be used for amplification.
Commercially available libraries are commonly used.
2) The pool of possible sequences is incubated with the
TF of interest, which binds to its consensus
sequences if in vitro conditions correspond to
physiological conditions.
3) Washing is performed to discard all non-binding DNA fragments.
4) Elution is performed to remove the protein from DNA fragments.
5) Selected DNA fragments are then PCR amplified using
primer sets that anneal to the flanking sequences on
the DNA library.
6) Amplified fragments are then used to repeat the
cycle 10 to 15 times. The result is the enrichment of
DNA fragments that bind to the protein of interest. Every cycle increases binding specificity.
7) Resulting DNA fragments are sequenced to identify the consensus sequence of the TF.

55
In a consensus sequence, each position of the binding site is associated with the relative frequency of each
of the 4 bases as observed from read analysis. The more frequent a specific base is, the higher the
likelihood that the TF recognises that specific base in that position.

Example: multiplexed massive parallel SELEX for characterisation of human transcription factor binding
specificities

The initial DNA library contains dsDNA nucleotides containing all possible 14-mer sequences flanked by a
forward and reverse primer complementary sequences as well as a barcoded sequencing primer. Human
transcription factors of interest were replaced by fusion proteins that contain the DNA binding region of
the protein of interest, a luciferase domain for protein quantification and a streptavidin-binding peptide for
microwell attachment. Many different human complete TF coding sequences or the ones corresponding to
the DNA binding region were then cloned in the same cassette to produce fusion proteins. Each unique
fusion protein was quantified in a luciferase assay. Protein quantity is an important parameter to determine
binding affinity. Different binding affinities may be caused by different protein levels. After quantification,
SELEX results are normalised for the protein
level in each well.

Each different TF is attached to the bottom


of a microwell, where the initial DNA library
is loaded. Each DNA library carries a unique
barcode sequence that identifies the TF it
was incubated with. After several rounds of
SELEX enrichment for DNA sequences that
bind to target proteins, resulting DNA from
each well is pooled together and sequenced.
The resulting reads can be divided into the
original microwell groups thanks to the
barcode sequence. Finally, reads are
analysed to determine the consensus
sequence of each TF.

The quality of the results can be deduced from consensus sequences:

A) In cases where flanking sequences do not interfere with binding, a very uniform distribution of
sequences is observed.
B) In cases where a part of the binding sequence for a TF is found in the constant region or barcode, a
strong positional bias is observed. These consensus sequences are discarded.

Example: universal protein binding microarrays for the comprehensive characterisation of the DNA binding
specificities of transcription factors

56
Protein binding microarray (PBM) technology provides a rapid, high-throughput means of characterizing
the in vitro DNA binding specificities of transcription factors (TFs).

1. ssDNA oligonucleotides are attached to the array. Each oligo


contains a unique 36-mer joined to a common 24-nt primer
sequence, resulting in a 60 nt long ssDNA. Each 36-mer
contains 27 overlapping 10-mers, thus each spot on the
microarray contains more than one possible sequence where
a TF can bind.
2. By incubating them with appropriate primers and performing
DNA synthesis dsDNA molecules are produced. DNA synthesis
reaction requires an appropriate substrate and enzymatic
mixture. It also contains a low amount of fluorescent Cy3-
labelled dUTP, which is incorporated in the new strand at low
levels. Fluorescence Cy3 signal indicates dsDNA quantity in
each spot of the microarray. This parameter is useful to
monitor the quality of DNA synthesis. If differences are
detected, they can be normalised in the results.
3. The GST-tagged TF of interest is incubated in the microarray.
The fusion protein binds to its preferred sequences on the
microarray.
4. A fluorescently labelled alpha-GST is also incubated in the
microarray. The alpha-GST specifically binds to the GST tag of
the TF, thus it is only located on the spots where the TF binds
to specific sequences. The fluorescent label used is Alexa488.
In general, a fluorophore that emits fluorescence at a
different wavelength than Cy3 is used. This way, the
detectors can differentiate between the 2 signals. The
stronger the TF binds to a specific sequence on a spot, the
higher the Alexa488 fluorescence signal is.
5. Since each spot corresponds to 27 different 10-mers and a TF
normally recognises only one of them, the 36-mer sequence
from all fluorescence positive spots is determined. By
comparing 36-mer sequences from different spots, the
unique 10-mer where the TF binds is determined.

Once the consensus sequence of a TF of interest has been determined with SELEX, the annotated genome
is scanned to identify all possible consensus sequences on the 3k nt upstream of the transcription start site
and 1k nt downstream of the 3’ end of each gene.

The binding sequence of a TF can also be determined in an alternative method that detects genome-wide
binding sites. Dap-seq

Example: cistrome and epicistrome features shape the regulatory DNA landscape

Cistrome = complete collection of cis elements (accessible or not) that TFs can recognise on the genome.
Epicistrome = collection of accessible cis elements TFs can bind to. Some cis elements are not recognised by
TF because of epigenetic modifications such as DNA methylation or highly condensed chromatin state.

DAP-seq is an in vitro assay that identifies TF binding sites.

57
1) Genomic DNA is extracted and fragmented in 200 bp
long molecules. Adaptors are ligated at each end of
the fragments, then PCR amplification is performed
to obtain an ampDAP library.
2) The TF coding sequence is cloned inside a cassette
for the expression of a fusion protein that contains
an affinity tag. The gene cassette is expressed in
vitro to produce the fusion protein. The fusion
protein is purified in affinity chromatography by
exploiting the affinity tag properties. The fusion
protein is immobilised on the solid phase of the
column.
3) The ampDAP library is eluted inside the same
column. Fusion proteins in the solid phase capture
DNA fragments that contain their binding site, while
other DNA molecules are eluted immediately. This
way, by washing the column and eluting the
fragments that bind to TFs DNA molecules that
contain the binding sites of interest are selected.
4) Eluted DNA fragments are sequenced. Reads are mapped on the reference genome. Each DAP peak
contains the sequence motifs recognised by a specific TF.

DAP is an entirely in vitro approach to determine the consensus sequence of a specific TF and it requires a
PCR amplifying step. Contrastingly, in ChIP experiments the DNA library is obtained directly from cell nuclei
and does not require amplification. Despite this difference, Dap-seq yields a similar consensus sequence
compared to ChIP experiments and it also indicates the position of these sequences on the reference
genome. Crucially, DAP-seq mapping data identify candidate TF binding sites on the genome (it does not
measure physical binding of the protein to those sequences), while ChIP enriches for sequences that are
bound by the protein of interest. Proof that TF binds to those sites requires a ChIP experiment.

Analysis of all sequence motifs recognised by a family of TFs in Arabidopsis reveals that those consensus
sequences share common features. Despite the diversity in the functions of all TFs analysed, their binding
sites are almost the same. However, each TF binds to a unique sequence in vivo thanks to its dimerization
properties and the features of nearby sequences that can influence TF binding. The 3D structure of the
genome also influences TF binding to specific genomic sites.

58
The genome-wide TF binding site mapping can also be performed in vivo with ChIP-seq. Before NGS was
widely available, ChIP on CHIP was performed to obtain the same type of results. DNA fragments enriched
by ChIP were incubated with a microarray containing genomic fragments of intergentic regions (not cDNA
because TF binding sites are mostly in non-coding regions). Hybridisation occurs between complementary
DNA sequences, thus sequence enrichment in the ChIP library is determined. Tiling arrays are built to
associate the enrichment fold in the experiment to each genomic position. Peaks of enrichment correspond
to the position of TF binding sites.

ChIP-seq has replaced ChIP on CHIP for its precision, improved resolution, limited cost and higher
throughput. Moreover, it eliminates the need to know the type of cis regulatory region (promoter,
enhancer or RNA-coding) the TF binds to, since ChIP-ChIP requires
the preparation of a microarray containing appropriate DNA
libraries. After fragment enrichment by ChIP, DNA molecules are
ligated to adaptors and sequenced with Illumina. Reads are
mapped on the reference genome to identify TF binding sites. The
challenges of ChIP-seq data analysis lie in read processing and
mapping (bioinformatic problems).

Results from ChIP-seq peaks allow the identification of putative


direct gene targets of a TF of interest. However, these results do
not directly prove that the TF of interest regulates the expression
of putative genes. It may be that the TF is located at that binding
site but it does not act on that particular gene or is inactivated by
59
the interaction with other proteins. To prove that the gene is regulated by it, an RNA-seq experiment is
performed in the same conditions on either mutant cells that do not produce functional TF or on cells
where TF expression is induced. If this putative target gene is deregulated compared to wild-type cells, then
the results of the first one are confirmed.

Example: identification of pathways directly regulated by short vegetative phase during vegetative and
reproductive development in Arabidopsis

TF binding was assessed from a limited number of cells. As a result, amplification biases were introduced
and ChIP-seq peaks on the reference genome appeared distorted. Nevertheless, sequencing data was used
to design appropriate primers and to perform a classic ChIP experiment to confirm the results.

The TF of interest has an early and late function in two different tissues. In both tissues the identified
sequence motif is almost the same. Small changes in the consensus sequence between the 2 tissues can be
attributed by the interaction of the TF to a partner protein. After the binding sites were mapped on the
genome and the direct TF targets were confirmed by RNA-seq, an overview of the biological network
regulated by the TF in the 2 tissues are determined. A regulatory pathway of a specific TF can be compared
and overlapped with the pathways regulated by candidate or confirmed partner proteins and other TFs.

Proteomics
Protein expression profile – protein modification –
protein folding – protein-protein interaction –
protein localisation and trafficking. Post-translational
protein modifications are extensive and significantly
impact on protein folding and interactions.

Single-cell proteomics is still not possible or highly


limited, therefore an overview of characteristics and
activity of every protein in a specific cell has not been achieved. Proteins must be separated and identified
to obtain a complete proteome. Proteomic analysis is a more reliable indicator of gene expression than
transcriptome analysis. In fact, there is little correlation between mRNA and actual protein levels in a cell.
Transcriptome analysis is still the method of choice to monitor tissue-specific gene expression. Moreover,
one gene is transcribed into multiple differentially spliced mRNAs, each coding for a different protein
isoform that can be post-translationally modified further. The result is a high protein diversity even from a
limited number of genes. As a result, proteome analysis more accurately reflects a given “genome plan”
than transcriptome does.

Since it’s impossible to amplify a specific protein in a sample, technologies have high sensitivity to detect
even small amounts of it. Protein concentration is specific cell subcompartments is highly variable. As little
as tens of proteins can be present in a cell, a condition that is challenging to detect. Moreover, a gel

60
electrophoresis of a whole protein extract is impossible to analyse given the number of different proteins
that form a smear. To solve this issue, the complexity of the cell extract is reduced by pre-fractionation.
Generally, cell lysate is centrifuged before analysis to select a specific cell fraction of interest. Different
types of centrifugation techniques and different centrifugal forces allow the precipitation of a specific
subset of cell lysate components, while others remain in the supernatant. The fraction of interest is
collected after different rounds of centrifugation. Since each fraction usually corresponds to cellular
components from specific organelles, they contain organelle-specific proteins that are used as markers. A
protein fraction from a supposed organelle is validated before any subsequent analysis by quantifying the
amount of marker proteins (each specific to a certain fraction). Highly purified organelles can also be
obtained thanks to specific antibodies. Protein A or G is a bacterial protein that selectively binds to Igs.

Selected protein fractions are then separated in 1D or 2D gel electrophoresis. 2D gels allow the
simultaneous separation of proteins according to two distinct protein properties (molecular weight and
charge or pI for example). The result is a specific 2D spot pattern where each spot corresponds to a protein
with specific characteristics. 2D gels are often too complex to be analysed with naked eyes, but fortunately
specific softwares have been developed to perform gel analysis. Each spot can be extracted and further
analysed to identify the protein. Two or more 2D gels in the same conditions can be compared to identify
differential protein expression. To visualise protein spots and bands, different staining techniques with
specific sensitivity can be used according to the specific experiment. 2D gels analysis and spot extraction
can be fully automated.

10/12/21

Mass spectrometry for protein identification


Protein spots of a 2D gel can be reliably picked up by robotic arms to automate the process. If the protein
that generates that particular spot is unknown, mass spectrometry is performed to identify it. Different
types of spectrometers are available, but the most common in biologic studies is the MALDI-TOF. MALDI is
a solid phase technique, therefore the protein sample must be solidified before analysis. New generation
spectrometers such as ESI can analyse liquid phase samples.

61
MS is an analytical technique that determines the molecular mass neutral molecules thanks to the
separation of gas-phase ions in the mass analyser based on their mass-to-charge ratio (m/z). All MS
machines require 3 basic components: an ion source, a mass analyser and a detector. The accuracy of
calculated molecular weight is higher than 99,9%, much higher than the 90-95% from SDS-experiments
(protein mobility in a gel can be affected by post-translational modifications). In MS, phosphorylation and
glycosylation do not interfere with mass determination, therefore it is not possible to determine the type of
modifications carried by the protein sample.

MALDI-TOF or Matrix-Assisted Laser Desorption/Ionisation – Time Of Flight is MS technique that requires


the ionisation of the solid-phase analyte thanks to a laser. The mass-to-charge ratio of the analyte is
determined from the total time of flight of charged particles. The machine works as follows:

1) Analyte solution mixed with matrix that absorbs UV light


and applied to a metal probe tip
2) Matrix material and peptide solution co-precipitate
3) Irradiation with UV pulse which sublimates matrix and
peptides, which are ejected into vacuum
4) Peptides ions are formed mainly by protonation
5) Acceleration through a strong electric field, all ions same
kinetic energy
6) Different velocities and arrival times after traversing a
fixed field-free distance (TOF). Peptides with lower mass
have a short time of flight before hitting the detector,
while larger peptides take longer to produce a signal.

Normally, proteins are not analysed as whole


macromolecules, but they are fragmented into
peptides instead. The peptides are analysed in MS. A
protein spot is normally fragmented by
trypsinisation. Trypsin cuts at Lys and Arg residues.
Since the primary aminoacidic sequence of the
peptide chain is unique for each protein, the pattern
of trypsin cleavage is specific for each protein. MS
analysis of protein fragments is represented as a
pattern of different molecular weights associated
with specific peptidic fragments (fingerprint). The fingerprint of a protein sample is then compared to all
known fingerprints of proteins expressed by the organism. The database only contains predicted protein
fingerprints from virtual trypsin digestion of all annotated proteins. Unfortunately, if the analysed protein
has not been annotated before it cannot be identified with this approach. This possibility occurs when not
all genes have been annotated.

In tandem MS or MS/MS, the peptide particle is accelerated before entering a collision cell where it is
further fragmented. While in MALDI-TOF all fragments resulting from spontaneous breakages reach the
detector at the same speed and time of flight regardless of their mass (they cannot be detected), in MS/MS
they generate distinct signals. This occurs because peptidic
fragments are decelerated and reaccelerated after their
fragmentation. Again, smaller fragments reach the detector
before larger ones.

The signal pattern of MS/MS indicates the aminoacidic sequence


of a peptide. By repeating the analysis on all peptides from the
62
same protein (in a single experiment), the whole Aa sequence of a protein is determined. This is possible
because the Aa sequence of a peptide determines all possible fragments that are produced. Fragmentation
occurs randomly at amidic bonds. Each fragment from the same peptide has a defined molecular weight
that is determined by the residues from the original peptide. The limited amount of possible Aa
combinations results in a contained collection of corresponding molecular weights. Therefore, the pattern
of molecular weights detected by fragments from the same peptide is analysed to identify the As sequence
of that peptide.

2D gel has limited sensitivity since it is impossible to identify low concentration protein spots. 2D gel
electrophoresis is a major bottleneck in proteomic analysis. Alternative strategies to increase sensitivity
replace the 2D gel separation with another separation method.

Multi-dimensional LC before MS (LC-MS). The protein sample is first separated according to charge or pI
with an ion exchange chromatography column. Eluted protein fractions are then run through a second
chromatographic column (reverse chromatography?) that separates analytes according to molecular
weight. The double liquid chromatography step reduces the
complexity of the protein sample before MS analysis without losing
low concentration proteins. This process is automated.

A second method called Isotope Coded Affinity Tag (ICAT) is used to differentially label proteins from
different samples. ICAT allows the identification of differentially expressed proteins from different tissues
or treated cells (induced vs uninduced cells).

1. Proteins from one sample are labelled with a biotin affinity group
that incorporates deuteruim and a reactive group for linkage to Cys
residues. The other sample is labelled with a biotin affinity tag with
hydrogens.
2. Labelled proteins from both samples are pooled and digested with
a protease. Peptides labelled with the ICAT reagent are purified
using avidin chromatography.
3. ICAT-labeled peptides can be analysed by MS, which can
distinguish between heavy and light ICAT tag. Double peaks that
have a mass difference exactly identical to a deuterium atom
identify the same protein that is present in both samples. Peak
ratios can be correlated to protein concentration. Differentially
expressed proteins are identified by sequencing the peptides in
MS/MS.

Ribosome footprinting with Ribo-seq


Illumina sequencing can be used to monitor the translation of mRNA. The technique is used to monitor the
position of ribosomes on mRNAs. Deep sequencing of this ribosome footprint is called ribosome profiling.
Resulting reads are mapped to the reference genome to identify which sequences on specific genes are
translated and to quantify mRNA translation. Ribo-seq is also useful to uncover small ORFs in annotated
non coding RNAs and pseudogenes.

Example: the tomato translational landscape revealed by transcriptome


assembly and ribosome profiling

Tissue of the root was analysed by performing both Ribo-seq and RNA-
seq. RNA-seq data identifies all mRNAs in these cells. Ribo-seq data only

63
identifies the sequences on RNA where ribosomes are located. The protocol is as follows:

1. Nuclease digestion. All RNA regions that are not protected inside a ribosome are susceptible to
degradation and will not be sequenced.
2. Monosome isolation. Isolation of ribosome-RNA complexes for further analysis.
3. RNA purification
4. Library construction. cDNA is synthesised from purified RNA fragments.
5. Deep sequencing with Illumina. Read length is 25 nt as it corresponds to the region that is
protected by nuclease degradation by ribosome binding.

Once translation has been initiated, ribosomes shift 3 nt downstream to maintain the correct reading
frame. As a result, ribosome profiling data also identifies the reading frame that is being used on a specific
transcript. Moreover, Ribo-seq data should not contain UTRs and introns. The Ribo-seq profile of an
annotated gene should correspond to exon sequences of the annotated gene itself. Mapping of Ribo-seq
reads on the reference genome clearly identifies the translation start and stop sites.

Ribo-seq data identifies possible upstream ORFs (uORFs), which usually repress the translation of any
downstream ORF. uORFs are short but they can also encode for stable protein products that can be
identified with MS. Normally, in gene annotation without Ribo-seq data uORFs are disregarded as non-
coding sequences because of the presence of a longer downstream ORF. This is true for some tissues, while
in others the main downstream ORF is repressed and only the uORFs are translated.

Protein-protein interactions
Interactions between proteins shape many of their properties and functions. For example, transcription
factors commonly work in complexes or dimers, enzymes are often assembled into complexes that regulate
their catalytic activity, signal transduction cascades often require physical association between proteins to
amplify the signal or activate the cellular response. Some protein complexes turned out to be bigger than
anticipated. For example, the RNA pol II complex in eukaryotes was first purified as a 12-subunit protein
assembly, but co-immunoprecipitation experiments revealed it actually contains 55 subunits. In addition,
proteins interact with nucleic acids.

Yeast n-hybrid systems


A first tool to study protein-protein interactions is a n-hybrid system in forward or reverse configuration.
The forward configuration simply detects the interaction between two or more proteins. This system is
used in screenings to identify new proteins that interact with a factor of interest.

The yeast two-hybrid system allows the study of protein interactions in vivo, thus results are more
predictive than in vitro experiments. The issue with in vivo protein interaction studies is the fact that the
cell contains many other proteins that can potentially alter the interaction between a protein of interest
and its partners. Some proteins specifically disrupt a complex. Despite this, the two-hybrid system has two
other major advantages, which are the possibility to assay a high number of coding sequences and protein-
protein interaction types in the same experiment (?).

Yeast 2-hybrid system uses selection to identify rare events. Historically, it was designed after the discovery
and characterization of the yeast transcription factor Gal4p, which is responsible for the transcriptional
64
activation of downstream genes by binding to Upstream Activator Sequences (UAS). Structural studies of
Gal4p revealed that the protein consisted of two distinct portions that function independently of one
another: a DNA binding domain (BD) and an activation domain (AD). The modular structure of transcription
factors is common in many species.

Further experiments tested the activity of a fusion protein that contained the DNA binding domain of the
bacterial repressor LexA and the activation domain of Gal4p. A reporter assay was designed where a vector
containing an expressible promoter gene under the control of a minimal promoter and a LexA operator
sequence. Transformed yeast cells that express the fusion protein also produce the reporter protein,
indicating that Gal4p activation domain activates the transcription of the reporter gene even though the
binding domain and regulatory sequences came from bacteria.

Once the modular structure of transcription factors was discovered and tested, it was exploited to assess
protein-protein non-covalent interactions. Essentially, Gal4p BD and AD are cleaved so that each fragment
is unable to exert the original Gal4p function (AD cannot bind to UAS and BD has no transcriptional
activation property) even though each domain is still functional. Then, each domain is fused to one of two
proteins of interest to assay their interaction. One gene construct codes for the BD-protein #1 construct
that binds to the UAS element upstream of a reporter gene. This fusion protein is called bait, since it is
unable to activate reporter expression but it is located where the AC would be able to do so. The strategy is
designed to recruit proteins that interact with protein #1 at the UAS element and activate reporter
transcription. The second gene construct codes for the fusion protein AD-protein #2. If proteins #1 and #2
bind to each other, AD is recruited on a UAS element and can activate
the transcription of a reporter gene. In absence of interaction, no
reporter signal is observed. In the past LacZ was used as a reporter
gene, but later it was replaced by genes coding for amminoacid
biosynthesis enzymes (Leu2, His3 and ADE2 for Trp). The yeast strain
that is assayed lacks a functional copy of the reporter gene, so they
can only grow in a medium that contains the corresponding
amminoacid (auxotrophs). As a result, the expression of the reporter
gene acts as a selection marker, since only the strains where AD is
recruited to the UAS element (where protein #1 and #2 interact) can
synthesize essential amminoacids.

The protocol works as follows:

1. A yeast strain unable to synthesize His, Leu and Trp is maintained in a rich growth medium to
ensure its survival. To test the auxotrophy of the strain, cells are also plated in 3 different media
that lack one of the three essential ammino acids. If the strain is devoid of His, Leu and Trp
biosynthesis ability, it would not grow on any of the three plates.
2. Three vectors are built as follows:
a. BD + protein #1 coding sequences in frame (fusion protein) and a Leu gene that
complements the inability of the yeast strain to synthesize Leu
b. AD + protein #2 coding sequences in frame (fusion protein) and a Trp gene that rescues the
Trp biosynthesis loss in the yeast strain
c. A His coding gene that restores the His biosynthesis capability of the strain. Its transcription
is under the control of a UAS element. His gene transcription requires the interaction
between the 2 fusion proteins.
3. Yeast cells are transformed with all three vectors and plated in a His containing growth medium.
Only those cells that contain and express vectors a) and b) can grow in absence of Leu and Trp.
Thus, only double transformants are selected.

65
4. Transformed yeast cells are plated in a His lacking medium. Only cells that express the His gene can
grow. Those cells express fusion proteins that physically interact. Protein #1 and #2 interact.

The protocol can be used to screen the entire cDNA library of an organism or tissue to identify all proteins
that interact with a specific partner of interest. The cDNA library is cloned in one of the vectors. Each cDNA
is cloned in three different reading frames to assure that the protein is synthesized correctly in one of these
cases. Generally, the transformation step can be replaced with yeast mating, where a strain of one of the
mating type contains the cDNA library vector. These cells conjugate
with yeast cells of the opposite mating type, which carry the
invariant vector. The resulting zygote contains both vectors. Zygotes
are plated on two plates, one with and the other without His. The
number of colonies on the His- plate corresponds to the number of
proteins that specifically interact with the factor of interest. Colony
numbers in the His+ plate are far more abundant and correspond to
the number of cDNAs or proteins that are being screened for
interaction with the factor of interest. the His+ plate is fundamental.

The original system has been optimized over the years. The level of two-hybrid proteins was reduced to
minimize aspecific interactions that cause false positives. The same fusion protein pair is tested with
different reporter genes to ensure that no artifact is produced. False negatives in 2-hybrid assays are
caused by incorrect folding or poor stability of the hybrid protein, the toxic effect of the fusion protein, the
absence of the gene in the cDNA library and the unidirectionality of the interactions (sometimes BD-X and
Y-AD interact but BD-Y and X-AD do not). Low expression of hybrid proteins also leads to false negatives
because the reporter gene is not activated enough. Sensitivity and specificity are important parameters in
every assay and often the optimization of one leads to a decrease in the other.

• Highly stringent assay. It only detects strong protein-protein interactions to reduce false positives.
Sensitivity is decreased because most of low and medium affinity interactions are not counted
(false negatives).
• High sensitivity assay. Low energy interactions are detected. Lower sensitivity because aspecific
interactions can generate false positives.

A 2-hybrid system assay at high sensitivity can also incorporate a selection for interaction strength by
adding 3-AT to the medium. 3-AT is a competitive inhibitor of the HIS3 enzyme for the biosynthesis of His.
3-AT is added at increasing concentrations in different plates. High quantity of 3-AT leads to selection of
exclusively strong interactions between fusion proteins (high stringency screening). Low quantity of 3-AT
yields selection of high, medium and low affinity interactions between fusion proteins (high sensitivity).

ADE is more stringent than HIS selection. ?

2-hybrid assays are ideal for large scale screenings, while other methods only check interaction events one
by one.

66
16/12/21

The 3-hybrid assay is performed to test the indirect interaction between 2 proteins of interest that is
mediated by a third protein that binds to both. The bridge protein is expressed independently, while the
other 2 proteins are part of two separate fusion proteins, the bait and the reporter activator. The co-
expression of bait and reporter activator constructs should yield a
negative signal since the proteins of interest do not directly bind to
each other (control). When the bridge protein together with a nuclear
localization signal is also expressed in this background and a positive
reporter signal (survival) is detected, then the bridge protein facilitates
interaction between the two hybrid proteins to activate reporter gene
transcription.

Another alternative hybrid system in yeast is the 1-hybrid system to identify factors that regulate the
expression of a gene of interest. A cis regulatory element is first identified near an annotated gene, for
example by sequence conservation and homology with other species. The regulatory sequence is cloned
either alone or in tandem in a gene construct that contains a reporter gene. The construct is then
integrated in a precise locus on the yeast genome (commercial kits available). The yeast strain is then
transformed with a plasmid library that contains the AD domain linked to a cDNA of interest (only TF
cDNAs, only tissue-specific cDNAs or else). A control is also established
by transforming a yeast strain without the cis regulatory element
upstream of the reporter gene with the same plasmid library.
Additional controls use different reporter genes to validate the results
from one assay. When a positive result is observed in all experiments
involving the AD fusion protein, the regulatory element and a reporter
gene, then it is confirmed that the protein factor binds to the specific
regulatory sequence.

Example: gluilt-by-association experiments

When 2 factors interact with each other, then it is likely that they are involved in the same biological
process. When the biological function of one factor is unknown, by assaying its interaction with other
proteins with known functions the role of the factor can be determined. The 2
proteins that interact must be expressed in the same tissue and at the same
developmental stage to represent a biologically relevant interaction.

In 2000, all 6000 ORFs of yeast annotated genes were assayed in a 2-hybrid system
against all known yeast proteins. The large scale experiment tested 36 million
protein interactions to identify gene networks and biochemical pathways. Protocol:

1) All yeast ORFs were PCR amplified from yeast genome. PCR primer sets
contain both a ORF-complementary sequence and a common tail sequence.
2) The tails of all ORF amplified fragments were elongated in a second round
of PCR. Primers contained a tail-complementary sequence and a 50 bp-long
common tail. The 50 bp overhang on each resulting fragment was
complementary to the BD and AD cassettes.
3) Each fragment was used to transform yeast cells together with a second
linear vector that contains either AD or BD and 70 bp-long sequences that
induce recombination with the ORF fragment directly inside the cell. The
recombination event must maintain the appropriate reading frame for both
the yeast ORF and the AD or BD domain. The recombination event often

67
results into a frameshift that yields a non-functional protein fusion and a negative result in the test,
thus the results of this large-scale screening are likely part of the full protein-protein interactions
spectrum.
4) Standard 2-hybrid system protocol.

Example: evolutionary studies of 3 MADS-box transcription factors in rice, Petunia and Arabidopsis

Mutant phenotypes of these 3 FTs all display homeotic transformations in tissue identity. Rice is a monocot
(?), evolutionarily distant from Petunia and Arabidopsis. To test the conservation of the gene network of
these 3 FTs, the interactions between them and any of their protein partners and orthologs was
investigated.

Once an interaction between 2 proteins has been detected by one assay, it must be confirmed in a
completely different experimental setup to test the reproducibility of the interaction. After a positive result
in a yeast hybrid system, the interaction is tested in an affinity chromatography experiment.

1) An antibody against one of the protein is attached on the solid phase. If a protein-specific Ab is
unavailable, the protein of interest is expressed with an affinity tag that is exploited for its fixation
on the solid phase.
2) A cell extract sample, a purified protein and both purified proteins are eluted on three separate
columns. One protein binds stably to the antibody and also interacts with its partner.
3) The proteins attached to the antibodies are washed from the column and collected, then run on a
denaturing gel together with the relevant controls.
4) A Western or immunoblot blot is performed by incubating the nitrocellulose support with
antibodies specific for both proteins. If the proteins form a complex, they should form 2 distinct
bands in the same row.

Yeast n-hybrid systems have the limitation of the number of selection markers for each plasmid. Moreover,
results from hybrid systems do not identify the complete protein complex that is found in cells.

Fluorescence Resonance Energy Transfer (FRET)


FRET identifies in vivo interactions between proteins via live imaging of fluorescence signals. FRET cannot
be used to screen a protein library for interactions, but rather to confirm the interaction between 2
proteins and to identify the localization of the protein complex in the cell.

FRET requires non-radiative transfer of energy between a donor fluorophore and an acceptor fluorophore
close together (Foster resonance energy transfer). The result is the emission of fluorescence light by the
acceptor fluorophore after the excitation of the donor fluorophore. Fluorophores are attached to 2
proteins to produce 2 fusion proteins. If the 2 proteins interact, the fluorophores are brought together and
FRET is observed. Variations on the basic protocol have been developed to test for proteolysis of specific
peptide sequences and the conformational changes of protein domains that change the distance between
the 2 fluorophores.

Three conditions must be fulfilled for FRET to take place:

68
• Overlap of donor emission spectrum with acceptor excitation spectrum
• Molecules must be in close proximity on a nanometer (10–9 m) scale
• Molecules must have the appropriate relative orientation.

The incident light should be set to only excite the donor fluorophore
with a specific wavelength while not activating the acceptor
fluorophore. The couple of fluorophores should possess clearly
distinct absorption and fluorescence emission spectra to avoid
emission from both in absence of FRET. Usually, the experiment is
performed on confocal microscopes. FRET has also been combined
with other methods or protocols to establish different fluorescence
microscopy techniques.

Fluorescence Lifetime Imaging (FLIM) is an imaging technique based on


the differences in the exponential decay rate of the photon emission of a
fluorophore from a sample. After absorption of incident light with the
appropriate energy, a fluorescent molecule transitions from the
electronic ground state to its excited state. The time a molecule spends
in its excited state is known as the fluorescence lifetime and it is typically
in the order of nanoseconds (10–9 s). The absorbed energy is eventually
emitted either as fluorescence or as non-radiative energy transfer
(FRET), provided there is a close acceptor fluorophore. If a donor
fluorophore is in the condition of performing FRET, the rate of depletion
of molecules in the excited state increases compared to identical donor
fluorophores that do not transfer energy via FRET. In other words, FRET
shortens the donor fluorescence lifetime. In FLIM experiments,
fluorescence lifetime of the donor fluorophore is quantified to reveal its
exponential decay. Numerical curve fitting renders the fluorescence lifetime, which serves ad an absolute
reference against which the FRET sample is analysed. The fluorescence lifetime is inherent to the dye, thus
invariant to photobleaching or varying concentrations of tagged protein. FLIM-based FRET measurements
allow the researchers to measure the fraction of donor molecules that do not participate in FRET, an
impossible operation in standard FRET experiments. FLIM is more sensitive than FRET alone, it requires only
a few photons of fluorescence emission. FRET with FLIM | Science Lab | Leica Microsystems (leica-
microsystems.com)

TF usually interact to form functional complexes. Often one factor requires the interaction with the other
to enter the nucleus. The regulated nuclear localization of FT complexes is a safety mechanism that
prevents the activity of one FT when its partner is not expressed to avoid aspecific activation of gene
transcription.

Other fluorescence-based techniques: FRAP, FLIP and BiFC


FRAP = Fluorescence Recovery After Photobleaching. Temporary photobleaching of a specific region of the
cell. Photobleaching destroys fluorescence emissions from fluorophores that are located into that region.
After photobleaching, the cell gradually recovers the fluorescence intensity because of free diffusion of
tagged proteins. The fluorescence recovery plateaus at a lower
intensity than detected before photobleaching because some
fluorophores are irreversibly damaged by photobleaching.
FRAP recovery time indicates the dynamics (speed) of any type

69
of protein movement inside the cell. The speed indicates the type of transport, as higher speeds indicate
active transport.

FLIP = Fluorescence Loss In Photobleaching. Continuous photobleaching is performed on a specific area of


the cell while fluorescence intensity is monitored on another area of the same cell. Since proteins diffuse
through the cell, fluorescence intensity should gradually fall because fluorophores that enter the
photobleached area are burnt by the high energy laser. FLIP identifies the source or origin of the protein
transport as well as the speed of protein transport.

Both FRAP and FILP can be performed on live tissue with a confocal microscope.

BiFC = Bimolecular Fluorescence Complementation. Protein interaction is detected in living cells with a
standard optic microscope (FRET achieves the same but requires a confocal microscope). YFP has been
engineered to produce two protein fragments that do not emit fluorescence light on their own, but recover
their fluorescence emission properties when are brought close
together. Each fragment is fused to a protein of interest. The 2
constructs are expressed or delivered inside cells of a live
tissue. If the 2 proteins interact, YFP fluorescence is restored.
BiFC or FRET are performed to confirm results from a 2-hybrid
assay. FRET has higher quality than BiFC, which has more false
positives.

Protein complexes analysis


Protein complexes can be captured using a labelled subunit.

1) A TAP tag is fused to the target protein of a complex of interest


2) The tagged protein is introduced in the host cell or organism.
Expression level of the tagged protein must reproduce physiological
condition
3) Cell extracts are processed with 2 specific affinity chromatography
steps in tandem to specifically select the tagged protein together
with its partners. Complexes must remain intact, therefore protein
denaturing steps are avoided. The elution of the proteins attached
to the solid phase is achieved thanks to the TEV site in the TAP tag
that is specifically cut by a protease.
4) Selected proteins are concentrated and separated by SDS-PAGE
5) MS on each protein band to identify all components of the protein
complex. The gel step can be eliminated as discussed above.

The TAP tag consists of two IgG binding domains of Staphylococcus aureus protein A and the calmodulin
binding peptide (CBP) separated by a TEV protease cleavage site. TEV cleavage after the first affinity
chromatography exposes the CaM binding peptide, which specifically binds to the solid phase of the second
affinity chromatography (loaded with CaM).

Large-scale protein complexes studies reveal the different functions of all complexes that contain a specific
protein partner, the gene network of different FTs and comparative studies for conserved complexes.
Overexpression of the bait tagged protein often leads to artifacts.

This protocol is efficient in yeast, but it can also be performed in plants. Protein complexes can also be
visualised in living tissues before extraction by adding a fluorophore to the tag.

70
Protein structure determination: cryo-EM
The three-dimensional structure of a protein or a protein complex indicates its function and interaction
partners.

In Cryo-EM, the purified protein complexes are frozen in a thin and highly
structured layer (it does not require crystallisation). Then in a TEM an
electron beam is fired at the frozen protein solution. The resulting
scattered electrons create a magnified image of the proteins (like a
shadow). Scattering images from multiple angles are combined to
determine the 3D structure of the protein.

Cryo-EM only works with >50 kDa proteins or protein complexes. To


analyse small proteins, they can be attached to an imaging scaffold. Cryo-
EM removes the need for crystallization, a step that requires perfect and
protein-specific conditions.

5/11/21

Lecture by Chiara Paleni - Genomics for biodiversity conservation


Master’s thesis on genomics analysis of Salvia collection in the Brera botanical garden.

Biodiversity is biological variability of life on Earth. It can be impacted by human activities and climate
change. Biodiversity loss can compromise the ecosystem. Biodiversity can be protected either in situ by
monitoring natural habitats or ex situ whereby plants are kept in a different ecosystem.

Genetic plan diversity is correlated to the adaptive potential to the population, therefore its fitness.
Reduced population sizes result in lower genetic diversity and a less resilient population. Genetic diversity is
a target of biodiversity conservation, particularly for threatened species. Genomics approaches are
employed to study inter and intra-species genetic diversity. Genomics is used to study diversity across the
entire genome but it requires a reference genome.

First, genomic properties such as its size are determined. Genome assemblies are generally haploid,
however heterozygosity and repeats complicate sequence assembly. Therefore, diploid and inbred plants
are preferentially sequenced to obtain a de novo genome. Then, the DNA is extracted and prepared for
whole-genome sequencing. A de novo genome sequencing normally requires a high coverage, meaning a
genomic region is independently sequenced up to 100 times in different reads. Usually in de novo
sequencing long reads are necessary to solve the problem of repeated sequences. Short reads sequencing
is also used to correct sequencing errors. After the reads have been obtained, the genome is assembled
with different tools. Usually many different assemblies are performed to find the best solution.

Salvia is an officinal plant group from the Laminacee family (quadrangular stem section, ...). the pollination
mechanism is peculiar because the stem functions as a lever to adhere to the pollinating insect back. 18
species of Salvia in Italy, 4 species are endemic (not present anywhere else). Salvia pratensis exhibits a high
phenotypic variability. The botanical garden collection of Salvia species has been previously studied.
Objective: obtain a de novo reference genome of Salvia.

Step #1: a plant collection was established from different botanical gardens. The seeds were vernalized
(kept in the dark and in the fridge) to make them germinate as soon as they’re placed in the dirt. Not all
seeds germinated unfortunately. Pollination was performed to obtain seeds and propagate the collection
for future studies.

71
Step #2: flow cytometry to measure Salvia genome size indirectly. Nuclei are stained with ethidium
bromide. Analyze the light scattered and emitted by each single nucleus in the flow cytometer. 1 large peak
corresponding to 2C DNA content. Smaller 4C DNA content peak identifies cells that are dividing. The
genome size is determined by comparing the unknown sample to a standard of a known genomic size
(Nicotiana benthamiana and ?). standards and unknown samples are analyzed together in flow citometry to
yield different peaks on the same fluorescence histogram. The peaks of the 2C DNA content of the different
plants are compared to estimate Salvia genomic size. Nicotiana was found to not be an optimal standard
because the estimated genomic size of Salvia resulted double the value reported on other scientific articles.
Another standard was used to yield a better result of a 800 Mbp.

Step #3: genome sequencing from different individual plants. Different protocols were tested to select the
best one. Sample requirements must be checked and DNA is packaged to be sent for sequencing. 3 150 bp
paired-end libraries from 3 Salvia pratensis accession.

Step #4: K-mer profiles analysis to estimate heterozygosity and repeat rates. K-mers are all possible
subsequences of K length that are present in the genome. When genome size is much longer than a k-mer,
the number of k-mers approximates the genome size. Sufficiently long k-mers (k = 19-20) should
correspond to unique sequences. Each distinct k-mer is present in many copies in the read set. Plot k-mer
multiplicity histogram that should yield a Poisson distribution profile.
𝑡𝑜𝑡𝑎𝑙 𝑘 − 𝑚𝑒𝑟𝑠
= 𝑔𝑒𝑛𝑜𝑚𝑒 𝑠𝑖𝑧𝑒
𝐶𝑘𝑚𝑒𝑟
Data was analyzed with statistical models.

Step #4: de novo genome assembly. Short reads -> higher scaffold number (should approximate the
number of chromosomes) -> not accurate genome assembly because plants genome have highly repetitive
sequences.

The same was repeated on chloroplast genome. 150 Mbp long. 2 large inverted repeates + 2 single copy
regions. It was found that one of the accessions carried a long inversion on a portion of the single copy
region. This shouldn’t happen inside plants from the same species.

Future steps: longer reads genome assembly. The reference genome can be used to characterize other
Salvia species, perform population-wide studies.

72

You might also like