You are on page 1of 51

Genes, Genomics, and Chromosomes

Topics
• Eukaryotic Gene Structure
• Chromosomal Organization of Genes and Noncoding DNA
• Transposable (Mobile) DNA Elements

Goals
• Learn how genes
encoded by complex
transcription units are
expressed.
• Learn the origin,
types, and functions
of DNA in higher
organisms.
• Learn the properties
of transposons and
their roles in gene
evolution. RxFISH-painted human chromosomes.
1. Genome structure – Double
helix
1. Genome structure - Chromatin
DNA is packed into chromosomes in a
hierarchical way:

DNA double helix is coiled around histone


octamers. About 165 nucleotides wrap
around a single octamer (wrapping 2.85
times). These “beads” are separated by 50
nt long spacer sequence.

Histone beads pack into 30 nm fibers

Fibers are tied up into scaffolds

Condensed scaffolds make up the


macroscopic form of chromosomes.
Overview of Human Genes & Chromosomes
Human diploid genomic DNA contains ~109 bp divided among 22
autosomes and 2 sex chromosomes. The longest autosome (#1)
contains 280 x 106 bp. Only 1.5% of human DNA encodes
proteins or functional RNA products. The expressed, coding
segments of genes are called exons. Exons are highly conserved in
sequence. Noncoding DNA
consists of spacer DNA
between genes and intron
DNA within genes.
Noncoding DNA is not
strongly conserved and
accounts for most of the
variations in sequences
between individual humans.
As discussed later, DNA
is highly condensed
(overall ~105-fold in
mitotic chromosomes) by
protein-nucleic acid
complexes called
nucleosomes and other
higher-order structures
(Fig. 6.1).
1. Genome structure -
Chromosomes
Human cells have 46 chromosomes: 22
normal chromosomes (autosomes),
in pairs (from father and mother),
and two sex chromosomes (X from
the mother, X or Y from the father).

In preparation for normal cell division


(mitosis), chromosomes are
replicated, but remain joined at their
centromere (prophase). This gives
the chromosomes their “X” shape.
Both “halves” of the X are called
(sister) chromatids.

When cells are not replicating, this is http://biology.unm.edu/ccouncil/Biology_124/Summaries/Sex.html


1. Genome structure – DNA
methylation
DNA methylation is an epigenetic marker
that controls / regulates many biological
functions:
- Control of gene expression
- Control of DNA replication
- Control of the cell cycle
- and more

Cytosines are methylated by enzyme (C-


methyltransferase), which targets CpG pairs
(Cytosine, phosphate, Guanine).

Methylation patterns are established during


early development, and maintained over
many generations by maintenance
methyltransferases copying the methylation
status to a newly synthesized strand (note:
1. Genome structure – Histone
modifications
Histones have tails which can be modified in
various ways, and at several locations. Each
(combination of) modifications has a
different biological function (“Histone
code”).

Histones are involved in many essential


biological processes including
- Gene regulation
- DNA repair
- Chromosome condensation / mitosis

“Until the early 1990s, histones were


dismissed as merely packing material for
nuclear DNA” (Wikipedia). Extreme
conservation of histone proteins (found all Nature 447, 433-440(24 May 2007)

the way back to archaea) suggests that they


http://chemistry.gsu.edu/faculty/Zheng/
1. Genome structure – Sequence
structure
What does the human genome sequence consist of?

Total size: 2,858,160,000 bp


- Protein-coding genes: About 20,500
- Protein-coding exons: About 220,000; cover 1.2% of genome
- Transposable elements: About 45% of genome
- Tandem repetitive sequence: Few %
- Heterochromatin: Few %
- Unknown: About half.

Conserved: About 5%
Biologically functional: ? (>5%)
1. Genome structure - Genes
Darwin (1809-1882) used the term “gemmule” to denote a microscopic
unit of inheritance. Major problem in his day: why do traits not
“blend out” by mixing.

Mendel (1822-1884) first to suggest the existence of factors conveying


traits from parent to offspring, and the pattern of their inheritance
(e.g., two copies per individual, one from each parent; segregation
during gamete production; different traits segregate independently),
solving the problem of blending.

1889: Hugo de Vries coined term “pangen”, later shortened to “gene”.


1910: Thomas Hunt Morgan: genes reside on specific chromosomes
1941: Specific genes code for specific proteins. “One gene one enzyme” hypothesis.
1977: Roberts and Sharp discover introns
2003: Genes often overlap; single genes have multiple product.
1. Genome structure - Genes

Eukaryotic protein-coding genes Upstream

consist of: region

- Upstream region (with regulatory signals)


Promoter 3’ UTR
- Promoter region, with transcription
initiation site (e.g. TATA box)
- 5’ untranslated region (5’ UTR)
- Translation initiation site
(includes start codon)
- Alternating sequence of exons (protein-
coding) and introns
- Translation stop site (stop codon)
- 3’ UTR
- Polyadenylation (poly-A) signal
- Translation stop site
Simple Transcription Units
Eukaryotic genes are monocistronic in that only one protein is
produced from a given mRNA. However, multiple forms of mRNAs,
and therefore proteins, are produced from many genes. Simple
gene transcription units produce only one type of mRNA and protein
(Fig. 6.3a). Mutations at sites a & b often reduce or prevent
transcription. Mutations at site c can change the amino acid
sequence of the protein and interfere with its function. Mutations
at site d affecting the selection of the exon 2/3 splice site can
result in an abnormally spliced mRNA and nonfunctional protein.
Complex Transcription Units
Complex gene transcription
units produce several species
of mRNAs, and thus proteins
(Fig. 6.3b). The exon content
of mRNAs and domain
composition of proteins are
varied by selection of
alternative splice sites (Top),
polyadenylation sites (Middle),
and even promoter sites
(Bottom). Site selection may
vary in different cell types and
during different stages of
development. The effects of
mutations (e.g., c & d) on the
gene products synthesized
from these transcription units
will be discussed in class.
About 60% of humans genes
are contained in complex
transcription units.
Alternative Splicing & Gene Regulation
Protein domains can be encoded by a single exon or by a small
collection of exons within a larger gene. The coding regions for
domains can be spliced in or out of the primary transcript by the
process of alternative splicing. The resulting mRNAs encode
different forms of the protein, known as isoforms. Alternative
splicing is an important method for regulation of gene expression
in different tissues and different physiological states. It is
estimated that 60% of all human genes are expressed as
alternatively spliced mRNAs. Alternative splicing is illustrated in
Fig. 4.16 for the fibronectin gene. The fibroblast and
hepatocyte isoforms differ in their content of the EIIIA and
EIIIB domains which mediate cell surface binding.Twenty
different isoforms of fibronectin produced by alternative splicing
have been identified.
Human Genomic DNA: Protein-coding Genes
Genomic DNA of higher eukaryotes contains 4 main classes of
DNA--1) protein-coding genes, 2) tandemly repeated genes, 3)
repetitious DNA, and 4) unclassified spacer DNA (Table 6.1).
Protein coding genes are grouped into the categories known as
solitary genes, and duplicated or diverged genes belonging to gene
families. In humans, roughly equal numbers of protein-coding genes
occur in these two categories. Groups of homologous duplicated
genes form gene and protein families, such as the ß-globin family.

(25-30%)
The Human ß-globin Gene Family
The ß-globin gene cluster on chromosome 11 is shown in Fig. 6.4a.
The ß-globin genes are expressed in different stages of life. , A,
and G are expressed during different trimesters of fetal
development (next slide). ß expression begins around birth &
continues throughout adult life. Fetal hemoglobin molecules made
with the  and G or A polypeptides have a higher affinity for
O2 than maternal hemoglobin, facilitating O2 transfer to the fetus.

The 5 ß-globin genes are derived from an ancestral ß-globin gene


via gene duplication. Over time, these genes accumulated adaptive
mutations via sequence drift resulting in the specialized species of
ß-globin proteins. Genomic DNA also contains nonfunctional DNA
sequences called pseudogenes that are derived from gene
duplication or reverse transcription and integration of cDNA
sequences made from mRNA (covered below). ß-globin pseudogenes
contain introns and thus were derived by gene duplication. Over
time these genes became nonfunctional also due to sequence drift.
Because they are not harmful, pseudogenes remain in the genome,
marking a gene duplication event in an earlier ancestor.
Exon and Gene Duplication from Unequal
Crossing Over
Fig. 6.2 illustrates how duplication of genes (e.g., the ß-globins)
and exons can occur via unequal crossing over during meiosis and
formation of gametes. Exon duplication results in proteins
containing repeated domains (e.g., the EGF precursor, Fig. 3.11).
In the examples shown, recombination is shown to occur between
L1 retrotransposon sequences which are common in genomic DNA.
Modular Domain Structure of Proteins
Domains are independently folding and functionally specialized
tertiary structure units within a protein. The respective
globular and fibrous structural domains of the hemagglutinin
monomer (which happen to be individual polypeptide chains) are
illustrated above in Fig. 3.10a. Domains (such as the EGF
domain) also may be encoded within a single polypeptide chain,
as illustrated in Fig. 3.11. Domains still perform their
standard functions although fused together in a longer
polypeptide (e.g., DNA binding and ATPase domains of a
transcription factor). The modular domain structure of many
proteins has resulted from the shuffling and splicing together
of their coding sequences within longer genes.

Epidermal growth
factor (EGF) domain
Gene Density in Genomic DNA
Higher eukaryotes contain far more noncoding DNA between
genes than bacteria and simple eukaryotes (Fig. 6.4). The region
of human genomic DNA containing the ß-globin gene cluster
shown in the figure actually is a relatively "gene-rich" region of
human DNA. Some regions known as gene-poor "deserts" also
occur. Higher eukaryotes also contain a larger amount of intron
DNA. Although one-third of human DNA is transcribed into pre-
mRNA, 95% ends up being degraded after RNA splicing
reactions. On average, the typical exon is 50-200 bp in length,
while the median length of introns is 3.3 kb in human genes.
Human Genomic DNA: Tandemly
Repeated Genes
Tandemly repeated genes also are derived by gene duplication.
Unlike gene families, the sequences of these duplicated genes
are identical or strongly conserved. In addition, they commonly
are arranged in a head-to-tail fashion in tandem arrays over a
long stretch of DNA. rRNAs and snRNAs (used in splicing
reactions, Chap. 8) are representative of this group (Table
6.1). Multiple copies of these genes are needed due to the
requirement for vast amounts of these RNAs in the cell. tRNA
and histone genes are included in this category, but these
genes typically occur in clusters and not true tandem arrays.
Nonprotein-coding Genes in Human
Genomic DNA
Thousands of genes in the human genome encode functional RNAs (Table
6.2). The functions of several of these are covered in later chapters.
Repetitious DNA
Two main categories of repetitious DNA--simple-sequence DNA
and interspersed repeats--occur in eukaryotic genomes (Table
6.1). Interspersed repeats are more common and are derived
largely from transposons. Simple-sequence DNA is less prevalent,
accounting for ~ 6% of human genomic DNA. Simple-sequence DNA
is also known as satellite DNA, due to its formation of satellite
bands during cesium chloride density gradient ultracentrifugation.
The function of this DNA is mostly obscure. It is commonly found
at the centromere and telomere regions of chromosomes.

(25-30%)
Properties of Satellite DNA
Satellite DNA is classified into 3 types
based on length. True satellite DNA
consists of 14-500 bp sequence units
that tandemly repeat over 20-100 kb
lengths of genomic DNA. Minisatellite
DNA consists of 15-100 bp sequence
units that tandemly repeat over 1-5 kb
stretches of DNA. Microsatellite DNA
consists of 1-13 bp units that can
repeat up to 150 times. Microsatellite
DNA is thought to originate from
“backward slippage” of a growing
daughter strand on its template strand
during DNA replication (Fig. 6.5).The
sequences of repeat units are highly
conserved which suggests they perform
important functions. Each category of
satellite DNA contains a number of
different repeat sequences. Simple-
sequence DNAs can serve as DNA
markers due to variations in repeat
number. Satellite DNAs are exploited in
FISH (fluorescence in situ hybridization)
chromosome staining (Fig. 6.6).
DNA Fingerprinting
DNA fingerprinting is a method for
identifying individuals based on their
minisatellite DNA (Fig. 6.7). It was
developed in the mid-80s and is
widely used in forensics, paternity
analysis, and for research purposes.
In the method, minisatellite DNA
from a genomic DNA specimen is
amplified by PCR using primers that
bind to unique sequences flanking
minisatellite repeat units. Bands
corresponding to each minisatellite
locus then are separated on gels.
Although satellite DNA is highly
conserved in sequence, the number
of tandem copies at each loci is
highly variable between individuals.
This results from unequal crossing
over during formation of gametes in
meiosis. Due to the variation in the
number of repeats at each locus,
different individuals can be readily
distinguished based on banding
patterns.
Chap. 6 Problem 3

Satellite DNA is classified into 3 categories based on


length. Satellite DNA consists of 14-500 bp sequence
units that tandemly repeat over 20-100 kb lengths of
genomic DNA. Minisatellite DNA consists of 15-100
bp sequence units that tandemly repeat over 1-5 kb
stretches of DNA. Microsatellite DNA consists of 1-
13 bp units that can repeat up to 150 times.
Although the sequences of satellite DNA are highly
conserved, the number of tandem copies at each locus
is highly variable between individuals. This originates
due to unequal crossing over during formation of
gametes in meiosis (Upper figure). DNA fingerprinting
is a method for identifying individuals based on
variations in minisatellite DNA (Fig. 6.7). In the
method, minisatellite DNA is amplified by PCR using
unique primers flanking repeat regions, and the
collection of fragments is run on a gel. Due to the
variation in the number of repeats at different loci,
different individuals can be readily distinguished.
Interspersed Repeats
Interspersed repeat DNA comprises the largest fraction of
repetitious DNA in eukaryotic genomes. This DNA, which is also
called moderately repeated DNA makes up ~45% of human
genomic DNA. Interspersed repeat DNA is composed of partial
and complete transposon sequences or "mobile DNA". Mobile DNAs
were discovered by Barbara McClintock in the 1940s. These
sequences move by "transposition". Transpositions in germ line
cells are inheritable and occur at a rate of one transposition per
8 individuals. In somatic cells they can cause somatic cell
mutations. Mobile DNA has been very important in genome
evolution.

(25-30%)
Mobile DNA Elements
Mobile DNA elements are
grouped into two classes,
DNA transposons and
retrotransposons (Fig. 6.8).
DNA transposons move
directly as DNA via a "cut-
and-paste" mechanism.
Retrotransposons move via an
RNA intermediate and a
"copy-and-paste" mechanism,
wherein the original copy of
the transposon is preserved.
Retroviruses, like HIV,
formally are a subclass of
retrotransposons that can
move between cells because
they encode viral coat
proteins. DNA transposons
predominate in bacteria;
retrotransposons are more
prevalent in eukaryotes.
Genome structure – Transposable elements

TEs are “selfish genes” which when activated can insert copies of themselves into
the genome. When this happens in the germline, these insertions are
transmitted to the next generation.

Vast majority of TEs can be classified into four families, based on the mechanism
by which they copy themselves:
- LINEs (Long Interspersed Nuclear Elements, autonomous)
- SINEs (Short Interspersed Nuclear Elements, use LINE proteins for life cycle)
- LTR elements (Long Terminal Repeats; derived from retroviruses)
- DNA transposons (replicate without RNA intermediary)
Genome structure – Transposable elements

TEs were discovered by Barbara McClintock in the 1950s, in


maize where they are very active.

In human somatic cells, TE insertions can cause disease.

TEs are mostly neutral or deleterious. Despite most not being


useful (for us) so that there is no selection pressure to
keep them (in the human population), many have
remained just by chance. They
are useful as proxy for neutrally evolving
sequence.

A small proportion of TE-derived sequence


has in fact been recruited into useful bio-
logical roles, and is now highly conserved.
Genome structure – Transposable elements

Age of a TE can be determined (approximately) by counting average number of


substitutions from the consensus sequence, supposed to be the ancestral state.

Histogram of TEs versus age shows the activity over time. Alus have been very
active, but recently things have quited down in human.
Mobile DNA in Prokaryotes
Bacteria contain DNA transposons called insertion sequences (Fig.
6.9). IS elements are 1-2 kb DNAs that transpose within the
bacterial genome to random locations. Transposition ("jumping") is
mediated by an encoded transposase protein. Insertion usually
causes gene inactivation and is harmful. Nonetheless, E. coli
encodes ~20 types of IS elements. They are tolerated in part
due to their low transposition rate (1 in 105 - 107 cells per
generation). This rate is set by the low rate of transcription of
the transposase gene. IS elements contain inverted repeat
sequences of ~50 bp at each end of the protein-coding region
that are crucial for transposition.
Mechanism of DNA Transposon Copy
Number Increase
About 3 x 105 copies of
full-length and truncated
DNA transposons occur in
human genomic DNA (3%
of DNA). Although DNA
transposons move via a
cut-and-paste mechanism,
their copy number in the
genome will increase if
they transpose during
DNA synthesis preceding
the first meiotic division
of gametogenesis (Fig.
6.11).
LTR Retrotransposons
Eukaryotic retrotransposons fall into two major groups--LTR
retrotransposons and non-LTR retrotransposons. Together, these
sequences account for 42% of human genomic DNA.
LTRs stand for long direct
terminal repeats. LTRs consist
of 250-600 bp direct repeat
sequences located at the ends
of the retrotransposon coding
region (Fig. 6.12). LTR
retrotransposons share many
features with retroviruses.
They both encode LTRs,
reverse transcriptase, and
DNA integrase. However, LTR
retrotransposons lack coat
proteins that allow
retroviruses to move between
cells. Transposition occurs via
an RNA intermediate that is
transcribed from a promoter
in the left LTR (Fig. 6.13).
The primary transcript is
polyadenylated, forming the
retroviral genomic RNA.
Non-LTR Retrotransposons
Even more abundant in human genomic DNA are non-LTR
retrotransposon sequences. There are two main classes of non-LTR
retrotransposons, known as long interspersed elements (LINEs, ~6
kb), and short interspersed elements (SINEs, ~300 bp). LINEs
encode a reverse transcriptase (ORF2) needed for transposition
(Fig. 6.16), whereas SINEs do not. Instead SINEs are thought to
rely on LINE-encoded enzymes for transposition. LINEs are
grouped into L1, L2, and L3 families, of which only L1 is active
today. LINE sequences occur at ~9 x 105 copies per human
genome. SINEs occur at ~1.6 x 106 copies. The most abundant
SINE is the Alu element, which is named based on the fact that it
encodes an AluI restriction site. Alu elements were important for
gene duplications at the ß-globin locus (Figs. 6.4).
poly(A)
promoter
site
site
Almost all transposable elements in mammals
fall into one of four classes
Short interspersed repetitive elements: SINEs
• Example: Alu repeats
– Most abundant repeated DNA in primates
– Short, about 300 bp
– About 1 million copies
– Likely derived from the gene for 7SL RNA
– Cause new mutations in humans
• They are retrotranposons
– DNA segments that move via an RNA intermediate.
• MIRs: Mammalian interspersed repeats
– SINES found in all mammals
• Analogous short retrotransposons found in genomes of
all vertebrates.
Long interspersed repetitive elements: LINEs
• Moderately abundant, long repeats
– LINE1 family: most abundant
– Up to 7000 bp long
– About 50,000 copies
• Retrotransposons
– Encode reverse transcriptase and other enzymes required for
transposition
– No long terminal repeats (LTRs)
• Cause new mutations in humans
• Homologous repeats found in all mammals and many
other animals
Other common interspersed
repeated sequences in humans
• LTR-containing retrotransposons
– MaLR: mammalian, LTR retrotransposons
– Endogenous retroviruses
– MER4 (MEdium Reiterated repeat, family 4)
• Repeats that resemble DNA transposons
– MER1 and MER2
– Mariner repeats
– Were active early in mammalian evolution but are
now inactive
Exon Shuffling via Recombination Between
Homologous Interspersed Repeats
We previously have noted that gene evolution has involved exon
shuffling between protein-coding genes in the genome. A large
amount of shuffling has occurred due to the prevalence of
interspersed repeats in the genome. Due to sequence conservation
within these regions, crossover events can take place at these
sites (Fig. 6.18). This results in exon shuffling between
nonhomologous genes and the formation of new genes with new
combinations of protein domains. As illustrated in Fig. 6.2, such
events also have been important in exon and gene duplications.
Exon Shuffling via Transposition
Exon shuffling can also occur via cut-and-paste transpositions
mediated by DNA transposons. The mechanism by which this
occurs is illustrated in Fig. 6.19a. It requires that two copies of
the transposon flank the target exon. Both DNA transposons and
the exon will move as one piece of DNA if the transposase
happens to cleave DNA at the left inverted repeat of the
upstream transposon and at the right inverted repeat of the
downstream transposon. Gene 1 ends up losing the exon, and Gene
2 acquires the exon
Exon Shuffling via Transposition
Exons can move along with a LINE element when it transposes via
its copy-and-paste mechanism (Fig. 6.19b). When a LINE element
has a weak poly(A) signal, RNA polymerase II continues to
transcribe downstream, potentially through an exon. If this exon
has a strong poly(A) signal, then transcription stops and the RNA
is polyadenylated. Then following the mechanism in Fig. 6.17,
DNA encoding the exon and the LINE element can be incorporated
into another gene. The spliced mRNA produced from the acceptor
gene may contain the newly introduced exon. Exon shuffling is
supported by experimental evidence and the enormous amount of
interspersed repeat DNA in genomes. Over billions of years, it
has played a major role in evolution of genomes.
Genome
• The genome is all the DNA in a cell.
– All the DNA on all the chromosomes
– Includes genes, intergenic sequences, repeats
• Specifically, it is all the DNA in an organelle.
• Eukaryotes can have 2-3 genomes
– Nuclear genome
– Mitochondrial genome
– Plastid genome
• If not specified, “genome” usually refers to the
nuclear genome.
Genomics
• Genomics is the study of genomes, including large
chromosomal segments containing many genes.
• The initial phase of genomics aims to map and
sequence an initial set of entire genomes.
• Functional genomics aims to deduce information
about the function of DNA sequences.
– Should continue long after the initial genome sequences
have been completed.
Human genome
• 22 autosome pairs + 2
sex chromosomes
• 3 billion base pairs in the
haploid genome
• Where and what are the
30,000 to 40,000 genes?
• Is there anything else
interesting/important? From NCBI web site, photo from T. Ried,
Natl Human Genome Research Institute, NIH
Components of the human
Genome
• Human genome has 3.2 billion base pairs of
DNA
• About 3% codes for proteins
• About 40-50% is repetitive, made by
(retro)transposition
• What is the function of the remaining 50%?
The Genomics Revolution
• Know (close to) all the genes in a genome, and
the sequence of the proteins they encode.
• BIOLOGY HAS BECOME A FINITE
SCIENCE
– Hypotheses have to conform to what is present, not
what you could imagine could happen.
• No longer look at just individual genes
– Examine whole genomes or systems of genes
Genomics, Genetics and
Biochemistry
• Genetics: study of inherited phenotypes
• Genomics: study of genomes
• Biochemistry: study of the chemistry of living
organisms and/or cells
• Revolution lauched by full genome sequencing
– Many biological problems now have finite (albeit
complex) solutions.
– New era will see an even greater interaction among
these three disciplines
Finding the function of genes
• Genes were originally defined in terms
phenotypes of mutants
• Now we have sequences of lots of DNA
a variety of organisms, so ...
• Which portions of DNA actually do some

• What do they do?


• code for protein or some other pr
• regulate expression?
• used in replication, etc?
Genome Structure

Distinct components of genomes

Abundance and complexity of mRNA

Normalized cDNA libraries and ESTs

Genome sequences: gene numbers

Comparative genomics
Much DNA in large genomes is non-
coding
• Complex genomes have roughly 10x to 30x
more DNA than is required to encode all the
RNAs or proteins in the organism.
• Contributors to the non-coding DNA include:
– Introns in genes
– Regulatory elements of genes
– Multiple copies of genes, including pseudogenes
– Intergenic sequences
– Interspersed repeats
Distinct components in complex
genomes
• Highly repeated DNA
– R (repetition frequency) >100,000
– Almost no information, low complexity
• Moderately repeated DNA
– 10<R<10,000
– Little information, moderate complexity
• “Single copy” DNA
– R=1 or 2
– Much information, high complexity
Clustered repeated sequences
Human
chromosomes,
ideograms
G-bands
Tandem repeats on
every chromosome:
Telomeres
Centromeres

clusters of repeated rRNA genes:


hort arms of chromosomes 13, 14, 15, 21, 22

You might also like