You are on page 1of 11

Genome Organization

 Genomics is the development and application of new mapping, sequencing, and


computational procedures for the analysis of the entire genome of organisms. It deals with the
systematic molecular characterization of genomes. Some of the methods used are traditional
genetic-mapping procedures; in addition, specialized techniques have been developed for
manipulating the large amounts of DNA in a genome. Genomic analysis is important for two
reasons: (1) it represents a way of obtaining an overview of the genetic architecture of an
organism and (2) it forms a set of basic information that can be used to find new genes such
as those responsible for disease. Genomic analysis generally proceeds from low-resolution
analysis to techniques with higher resolution. Genomics is divided into three basic areas:
structural genomics, characterizing the physical nature of whole genomes; functional
genomics, characterizing the gene and non-gene sequences in entire genome and
Comparative genomics: better understanding of function including evolutionary relationships.
Structural Genomics:
As its name suggests, the aim of structural genomics is to characterize the structure of the
genome. Knowledge of the structure of an individual genome can be useful in manipulating
genes and DNA segments in that particular species. For example, genes can be cloned on the
basis of knowing where they are in the genome. When a number of genomes have been
characterized at the structural level, the hope is that, through comparative genomics, it will
become possible to deduce the general rules that govern the overall structural organization
of all genomes. Structural genomics proceeds through increasing levels of analytic resolution,
starting with the assignment of genes and markers to individual chromosomes, then the
mapping of these genes and markers within a chromosome, and finally the preparation of a
physical map culminating in sequencing.
Functional genomics:
Functional genomics uses a variety of approaches such as defining all ORFs, the use of gene
knockouts to probe gene function, the yeast two-hybrid system to look for gene interaction,
and DNA microarrays to determine which genes are transcribed. It attempts to understand the
broad sweep of genome function at different developmental stages and under different
environmental conditions.
Comparative genomics:
The basis of comparative genomics is that the genomes of related organisms are similar. The
argument is the same one that we considered when looking at homologous genes. Two organisms with a
relatively recent common ancestor will have genomes that display species-specific
differences built onto the common plan possessed by the ancestral genome. The closer two
organisms are on the evolutionary scale, the more related their genomes will be. Studies of
comparative genomics also offer a powerful opportunity to identify highly conserved and
therefore functionally important sequence motifs in coding and noncoding genomic DNA.
This identification helps researchers confirm predictions of protein-coding regions of the
genome and identify important regulatory elements within DNA.
 
Eukaryotic Genome Organization
The completed and on-going genome projects are revealing a great deal about how genomes
are organized, including a number of unexpected discoveries that have taken molecular
biologists by surprise. It is very important to survey the information that has arisen from
genome projects and to learn how the genome is organized in a eukaryotic organism. Every
organism possesses a genome that contains the biological information needed to construct
and maintain a living example of that organism. Most genomes, including the human genome
and those of all other cellular life forms, are made of DNA (deoxyribonucleic acid) but a few
viruses have RNA (ribonucleic acid) genomes. DNA and RNA are polymeric molecules
made up of chains of monomeric subunits called nucleotides. Humans are fairly typical
eukaryotes and the human genome is in many respects a good model for eukaryotic genomes
in general. All of the eukaryotic nuclear genomes that have been studied are, like the human
version, divided into two or more linear DNA molecules, each contained in a different
chromosome; all eukaryotes also possess smaller, usually circular, mitochondrial genomes.
The only general eukaryotic feature not illustrated by the human genome is the presence in
plants and other photosynthetic organisms of a third genome, located in the chloroplasts.
Although the basic physical structures of all eukaryotic nuclear genomes are similar, one
important feature is very different in different organisms. This is genome size, the smallest
eukaryotic genomes being less than 10 Mb in length, and the largest over 100,000Mb as seen
in following table. Here genome size is Total amount of DNA contained within one copy of a
genome. Genome size can be compared to molecular mass using formula 1 pg= 978 Mb =
978000000 bp.

Species  Genome size (Mb)
Fungi
Saccharomyces cerevisiae 12.1
Aspergillus nidulans 25.4
Protozoa
Tetrahymena pyriformis 190
Invertebrates
Caenorhabditis elegans 97
 Drosophila melanogaster  180
 Bombyx mori (silkworm) 490
Strongylocentrotus purpuratus (sea urchin)  845
Locusta migratoria (locust) 5000
Vertebrates
Takifugu rubripes (pufferfish) 400
Homo sapiens 3200
Mus musculus (mouse) 3300
 Plants
Arabidopsis thaliana (vetch) 125
Oryza sativa (rice) 430
 Zea mays (maize) 2500
Pisum sativum (pea) 4800
Triticum aestivum (wheat)  16000
Fritillaria assyriaca (fritillary)  120 000
 
Genome size range coincides to a certain extent with the complexity of the organism, the
simplest eukaryotes such as fungi having the smallest genomes, and higher eukaryotes such
as vertebrates and flowering plants having the largest ones. This might appear to make sense
as one would expect the complexity of an organism to be related to the number of genes in its
genome - higher eukaryotes need larger genomes to accommodate the extra genes. However,
the correlation is far from precise: if it was, then the nuclear genome of the yeast S.cerevisiae,
which at 12 Mb is 0.004 times the size of the human nuclear genome, would be expected to
contain 0.004 × 35 000 genes, which is just 140. In fact, the S. cerevisiae genome contains
about 5800 genes.
For many years the lack of precise correlation between the complexity of an organism and the
size of its genome was looked on as a bit of a puzzle, the so-called C-value paradox. In fact,
the answer is quite simple: space is saved in the genomes of less complex organisms because
the genes are more closely packed together. We will try to understand this by comparison
of the 50 kb fragment of genomes of humans, yeast, fruit flies, maize and Escherichia coli.
The yeast genome segment, which comes from chromosome III (the first eukaryotic
chromosome to be sequenced), has the following distinctive features:
• It contains more genes than the human segment.
• Relatively few of the yeast genes are discontinuous.
•There are fewer genome-wide repeats.

 
The picture that emerges is that the genetic organization of the yeast genome is much more
economical than that of the human version. The genes themselves are more compact, having
fewer introns, and the spaces between the genes are relatively short, with much less space
taken up by genome-wide repeats and other non-coding sequences.
The hypothesis that more complex organisms have less compact genomes holds when other
species are examined. Let’s examine fruit fly fragment. If we agree that a fruit fly is more
complex than a yeast cell but less complex than a human then we would expect the
organization of the fruit-fly genome to be intermediate between that of yeast and humans.
The gene density in the fruit-fly genome is intermediate between that of yeast and humans,
and the average fruit-fly gene has many more introns than the average yeast gene but still
three times fewer than the average human gene.
It is beginning to become clear that the genome-wide repeats play an intriguing role in
dictating the compactness or otherwise of a genome. This is strikingly illustrated by the
maize genome, which at 5000 Mb is larger than the human genome but still relatively small
for a flowering plant. Only a few limited regions of the maize genome have been sequenced,
but some remarkable results have been obtained, revealing a genome dominated by repetitive
elements. The only gene in 50-kb region is one member of a family of genes coding for the
alcohol dehydrogenase enzymes. Instead of genes, the dominant feature of this genome
segment is the genome-wide repeats. The majority of these are of the LTR element type,
which comprise virtually all of the non-coding part of the segment, and on their own are
estimated to make up approximately 50% of the maize genome. It is becoming clear that one
or more families of genome-wide repeats have undergone a massive proliferation in the
genomes of certain species. This may provide an explanation for the most puzzling aspect
of the C-value paradox, which is not the general increase in genome size that is seen in
increasingly complex organisms, but the fact that similar organisms can differ greatly in
genome size. A good example is provided by Amoeba dubia which, being a protozoan, might
be expected to have a genome of 100-500 kb, similar to other protozoa such as
Tetrahymena pyriformis. In fact the Amoeba genome is over 200,000 Mb. Similarly, we
might guess that the genomes of crickets are similar in size to those of other insects, but these
bugs have genomes of approximately 2000 Mb, 11 times that of the fruit fly.
 Nuclear genome: 

 Figure: Classification of nuclear genome into various categories


The nuclear genome is split into a set of linear DNA molecules, each contained in a
chromosome. No exceptions to this pattern are known: all eukaryotes that have been studied
have at least two chromosomes and the DNA molecules are always linear. The only
variability at this level of eukaryotic genome structure lies with chromosome number, which
appears to be unrelated to the biological features of the organism. For example, yeast has 16
chromosomes, four times as many as the fruit fly. Nor is chromosome number linked to
genome size: some salamanders have genomes 30 times bigger than the human version but
split into half the number of chromosomes.

Packaging of DNA into chromosomes: Chromosomes are much shorter than the DNA
molecules that they contain. A highly organized packaging system is therefore needed to fit a
DNA molecule into its chromosome.
In 1973-74 several groups carried out nuclease protection experiments on chromatin (DNA-
histone complexes) that had been gently extracted from nuclei by methods designed to retain
as much of the chromatin structure as possible. In a nuclease protection experiment the
complex is treated with an enzyme that cuts the DNA at positions that are not 'protected' by
attachment to a protein. The sizes of the resulting DNA fragments indicate the positioning
of the protein complexes on the original DNA molecule. After limited nuclease treatment
of purified chromatin, the bulk of the DNA fragments have lengths of approximately 200 bp
and multiples thereof, suggesting a regular spacing of histone proteins along the DNA.

Nuclease protection analysis of chromatin from human nuclei. Chromatin is gently


purified from nuclei and treated with a nuclease enzyme. On the left, the nuclease treatment
is carried out under limiting conditions so that the DNA is cut, on average, just once in each
of the linker regions between the bound proteins. After removal of the protein, the DNA
fragments are analyzed by agarose gel electrophoresis and found to be 200 bp in length, or
multiples thereof. On the right, the nuclease treatment proceeds to completion, so all the
DNA in the linker regions is digested. The remaining DNA fragments are all146 bp in length.
The results show that in this form of chromatin, protein complexes are spaced along the DNA
at regular intervals, one for each 200 bp, with 146 bp of DNA closely attached to each protein
complex.
DNA in the nucleus exists mainly in combination with histone proteins; the DNA–histone
complex is called “chromatin”. Chromatin can undergo changes in its structure in response to
various cellular metabolic demands. Chromatin can be envisioned as a repeat of structural
units called “nucleosomes”. The nucleosome core particle is composed of histone octamer
plus the DNA that wraps around it. The histone octamer contains two molecules each of
histones H2A, H2B, H3, and H4. DNA wraps around the octamer in a left-handed supercoil
in about 1.75 turns which encloses about 150 bp. Histone H1 is a linker histone that, along
with linker DNA (the DNA in between two nucleosome core particles), physically connects
the adjacent nucleosome core particles. The length of linker DNA varies with species and cell
types. Usually, nucleosome core particle and linker DNA on both sides of the core
encompasses between 180- and 200-bp DNA. Between the nucleosome unit structure and the
metaphase chromosome structure containing two chromatids, there are several levels of
organization and compaction of the chromatin. Each nucleosome has a diameter of 10 nm;
the nucleosomes are compacted into a solenoid fiber structure of 30 nm called as 30 nm fiber;
the 30-nm solenoid fibers are compacted into a 300-nm filament; and finally, the 300-nm
filaments are further compacted into a 700-nm chromosome. During cell division, when the
chromosomes duplicate, a 1,400-nm metaphase chromosome is produced containing two
chromatids, each chromatid being 700 nm.

Figure: From nucleosome to chromosome.

 
The 30 nm fiber is probably the major type of chromatin in the nucleus during interphase, the
period between nuclear divisions. When the nucleus divides, the DNA adopts a more
compact form of packaging, resulting in the highly condensed metaphase chromosomes that
can be seen with the light microscope and which have the appearance generally associated
with the word 'chromosome'. The metaphase chromosomes form at a stage in the cell cycle
after DNA replication has taken place and so each one contains two copies of its
chromosomal DNA molecule. The two copies are held together at the centromere, which has
a specific position within each chromosome. Individual chromosomes can therefore be
recognized because of their size and the location of the centromere relative to the two ends.
Further distinguishing features are revealed when chromosomes are stained. There are a
number of different staining techniques, each resulting in a banding pattern that is
characteristic for a particular chromosome. This means that the set of chromosomes
possessed by an organism can be represented as a karyogram, in which the banded
appearance of each one is depicted.
An important part of the chromosome is the terminal region or telomere. Telomeres are
important because they mark the ends of chromosomes and therefore enable the cell to
distinguish a real end from an unnatural end caused by chromosome breakage – an essential
requirement because the cell must repair the latter but not the former. Telomeric DNA is
made up of hundreds of copies of a repeated motif, 5 -TTAGGG-3 in humans, with a short
extension of the 3 terminus of the double-stranded DNA molecule.
Functional DNA content of genome: This includes coding and non-coding gene content and
contributes 25% of nuclear genome. As we have seen earlier in our comparison of genome
fragment from different organisms one thing becomes clear that genes are not arranged
indefinite pattern but rather arranged unevenly throughout the entire genome. There were two
lines of evidence, one of which related to the banding patterns that are produced when
chromosomes are stained. The dyes used in these procedures bind to DNA molecules, but in
most cases with preferences for certain base pairs. Giemsa, for example, has a greater affinity
for DNA regions that are rich in A and T nucleotides. The dark G-bands in the human
karyogram are therefore thought to be AT-rich regions of the genome. The base composition
of the genome as a whole is 59.7% A + T so the dark G-bands must have AT contents
substantially greater than 60%. Cytogeneticists therefore predicted that there would be fewer
genes in dark G-bands because genes generally have AT contents of 45-50%. This prediction
was confirmed when the draft genome sequence was compared with the human karyogram.
The second line of evidence pointing to uneven gene distribution derived from the isochore
model of genome organization. According to this model, the genomes of vertebrates and
plants (and possibly of other eukaryotes) are mosaics of segments of DNA, each at least
300kb in length, with each segment having a uniform base composition that differs from that
of the adjacent segments. Support for the isochore model comes from experiments in which
genomic DNA is broken into fragments of approximately 100 kb, treated with dyes that bind
specifically to AT- or GC-rich regions, and the pieces separated by density gradient
centrifugation. When this experiment is carried out with human DNA, five fractions are seen,
each representing a different isochore type with a distinctive base composition: two AT-rich
isochores, called L1 and L2, and three GC-rich classes: H1, H2 and H3. The last of these, H3
is the least abundant in the human genome, making up only 3% of the total, but contains over
25% of the genes. This is a clear indication that genes are not distributed evenly through the
human genome.
The genes present in an organism can be classified using two approaches first is based
according to the function of genes and other is based on particular domain of the protein a
gene codes for. The second approach is more informative and better because it shows that
particular genome specifies a number of protein domains that are absent from the genomes of
other organisms, these domains including several involved in activities such as cell adhesion,
electric couplings, and growth of nerve cells. These functions are interesting because they are
ones that we look on as conferring the distinctive features of vertebrates compared with other
types of eukaryote.
Since the earliest days of DNA sequencing it has been known that multigene families-groups
of genes of identical or similar sequence - are common features of many genomes. The rRNA
genes are examples of 'simple' or 'classical' multigene families, in which all the members
have identical or nearly identical sequences. These families are believed to have arisen by
gene duplication, with the sequences of the individual members kept identical by an
evolutionary process. Other multigene families, more common in higher eukaryotes than in
lower eukaryotes, are called 'complex' because the individual members, although similar in
sequence, are sufficiently different for the gene products to have distinctive properties. One
of the best examples of this type of multigene family are the mammalian globin genes. The
globins are the blood proteins that combine to make hemoglobin, each molecule of
haemoglobin being made up of two α-type and two β-type globins. Why are the members
of the globin gene families so different from one another? The answer was revealed when the
expression patterns of the individual genes were studied. It was discovered that the genes are
expressed at different stages in human development: for example, in the β-type cluster ε is
expressed in the early embryo, Gγ and Aγ (whose protein products differ by just one amino
acid) in the fetus, and δ and β in the adult. The different biochemical properties of the
resulting globin proteins are thought to reflect slight changes in the physiological role that
hemoglobin plays during the course of human development.
In some multigene families, the individual members are clustered, as with the globin genes,
but in others the genes are dispersed around the genome. An example of a dispersed family
is the five human genes for aldolase, an enzyme involved in energy generation, which are
located on chromosomes 3, 9, 10, 16 and 17. The important point is that, even though
dispersed, the members of the multigene family have sequence similarities that point to a
common evolutionary origin.
The Repetitive DNA Content of Genomes
Repetitive DNA is found in all organisms and that in some, including humans, itmakes up a
substantial fraction of the entire genome. There are various types of repetitive DNA, and
several classification systems have been devised. The scheme that we will use begins by
dividing the repeats into those that are clustered into tandem arrays and those that are
dispersed around the genome.
a) Tandemly repeated DNA: Tandemly repeated DNA is a common feature of eukaryotic
genomes but is found much less frequently in prokaryotes. This type of repeat is also called
satellite DNA because DNA fragments containing tandemly repeated sequences form
'satellite' bands when genomic DNA is fractionated by density gradient centrifugation. The
satellite bands contain fragments of repetitive DNA, and hence have GC contents and
buoyant densities that are atypical of the genome as a whole. The satellite bands in density
gradients of eukaryotic DNA are made up of fragments composed of long series of tandem
repeats, possibly hundreds of kb in length. A single genome can contain several different
types of satellite DNA, each with a different repeat unit, these units being anything from < 5
to > 200 bp. The three satellite bands in human DNA include at least four different repeat
types.
One type of human satellite DNA is the alphoid DNA repeats found in the centromere
regions of chromosomes. Although some satellite DNA is scattered around the genome, most
is located in the centromeres, where it may play a structural role, possibly as binding sites for
one or more of the special centromeric proteins. Alternatively, the repetitive DNA content
of the centromere might be a reflection of the fact that this is the last region of the
chromosome to be replicated. In order to delay its replication until the very end of the cell
cycle, the centromere DNA must lack sequences that can act as origins of replication. The
repetitive nature of centromeric DNA may be a means of ensuring that such origins are
absent.
Although not appearing in satellite bands on density gradients, two other types of tandemly
repeated DNA are also classed as 'satellite' DNA. These are minisatellites and
microsatellites. Minisatellites form clusters up to 20 kb in length, with repeat units up to
25bp; microsatellite clusters are shorter, usually < 150 bp, and the repeat unit is usually 13 bp
or less. We have already seen one type of minisatellite DNA is Telomeric DNA. In addition
to telomeric minisatellites, some eukaryotic genomes contain various other clusters of
minisatellite DNA, many, although not all, near the ends of chromosomes. The functions
of these other minisatellite sequences have not been identified. The function of
microsatellites is equally mysterious. The typical microsatellite consists of a 1-, 2-, 3- or 4-bp
unit repeated 1020 times, as illustrated by the microsatellites in the human β T-cell receptor
locus. Although each microsatellite is relatively short, there are many of them in the genome.
In humans, for example, microsatellites with a CA repeat, that make up 0.25% of the genome,
8 Mb in all. Single base-pair repeats such as: (A)15 make up another 0.15%.
Although their function, if any, is unknown, microsatellites have proved very useful to
geneticists. Many microsatellites are variable, meaning that the number of repeat units in the
array is different in different members of a species. This is because 'slippage' sometimes
occurs when a microsatellite is copied during DNA replication, leading to insertion or, less
frequently, deletion of one or more of the repeat units. No two individuals have exactly the
same combination of microsatellite length variants: if enough microsatellites are examined
then a unique genetic profile can be established for every individual. The only exceptions are
genetically identical twins. Genetic profiling is well known as a tool in forensic science, but
identification of criminals is a fairly trivial application of microsatellite variability. More
sophisticated methodology makes use of the fact that a person's genetic profile is inherited
partly from the mother and partly from the father. This means that microsatellites can be used
to establish kinship relationships and population affinities, not only for humans but also for
other animals, and for plants.
b) Interspersed genome-wide repeats: Tandemly repeated DNA sequences are thought to
have arisen either by replication slippage, as described for microsatellites, or by DNA
recombination processes. Both of these events are likely to result in a series of linked repeats,
rather than individual repeat units scattered around the genome. Interspersed repeats must
therefore have arisen by a different mechanism, one that can result in a copy of a repeat unit
appearing in the genome at a position distant from the location of the original sequence. The
most frequent way in which this occurs is by transposition, and most interspersed repeats
have inherent transpositional activity. There are two alternative modes of transposition, one
that involves an RNA intermediate and one that does not. The version that involves an RNA
intermediate is called retrotransposition. The basic mechanism involves three steps:
1. An RNA copy of the transposon is synthesized by the normal process of transcription.
2. The RNA transcript is copied into DNA. This conversion of RNA to DNA, the reverse of
the normal transcription process, requires a special enzyme called reverse transcriptase. Often
the reverse transcriptase is coded by a gene within the transposon and is translated from the
RNA copy synthesized in step 1.
3. The DNA copy of the transposon integrates into the genome, possibly back into the same
chromosome occupied by the original unit, or possibly into a different chromosome. The end
result is that there are now two copies of the transposon, at different points in the genome.

RNA transposons or retro elements are features of eukaryotic genomes but have not so far
been discovered in prokaryotes.

Endogenous retroviruses
(ERVs) are retroviral genomes
integrated into vertebrate
chromosomes. Some are still active
and might, at some stage in a cell's
lifetime, direct synthesis of exogenous viruses, but most are decayed relics that no longer
have the capacity to form viruses. These inactive sequences are genome wide repeats but
they are not capable of additional proliferation.
Retrotransposons have sequences similar to ERVs but are features of nonvertebrate
eukaryotic genomes (i.e. plants, fungi, invertebrates and microbial eukaryotes) rather than
vertebrates. Retrotransposons have very high copy numbers in some genomes, with many
different types present. There are two types of retrotransposon: the Ty3/gypsy family (Ty3
and gypsy are examples of this class in yeast and fruit fly, respectively), whose members
possess the same set of genes as an ERV, and the Ty1/copia family, members of which lack
the env gene. Both types are able to transpose but the absence of the env gene means that the
Ty1/copia group cannot form infectious virus particles. In fact, despite the presence of env
in the Ty3/gypsy genome, it has only recently been recognized that some of these elements
can form viruses and hence should be looked upon as non-vertebrate retroviruses. Although
technically they are interspersed elements, retrotransposons are sometimes found in clusters
in a genome sequence as a result of the presence of preferred integration sites for transposing
elements. The three types of retroelement described so far are LTR elements, as they have
long terminal repeats at either end which play a role in the transposition process. Other
retroelements do not have LTRs. These are called retroposons and in mammals include the
following:
• LINEs (long interspersed nuclear elements) contain a reverse-transcriptase-like gene
probably involved in the retro transposition process. An example is the human element LINE-
1, which is 6.1 kb and has a copy number of 516,000 in the human genome. A LINE contains
a pol II promoter and two open reading frames (ORFs), one encoding the endonuclease and
the other encoding the reverse transcriptase. LINE activity proceeds as follows: RNA pol II
transcribes the LINE DNA into LINE RNA; the LINE RNA is translated into proteins; the
proteins and RNA join together and re-enter the nucleus; the endonuclease cuts a strand of
the target genomic DNA, often in the intron of a gene; the reverse transcriptase copies the
LINERNA into LINE DNA which is inserted into the target DNA forming a new LINE
element there. Three distant related LINE families are found in the human genome: LINE1,
LINE2, and LINE3. Only LINE1 (L1) is still active.
• SINEs (short interspersed nuclear elements) do not have a reverse transcriptase gene but
can still transpose, probably by 'borrowing' reverse transcriptase enzymes that have been
synthesized by other retro elements. SINEs are short sequences (about 100–400 bp) and they
contain an internal pol III promoter but do not encode any proteins. All currently known
SINEs are derived from tRNA and 7SL RNA genes. Most non autonomous SINEs share the
3′ end with a resident LINE. The only active SINE in the human genome is the Alu element,
which is the major SINE constituting about 11% of the genome (~1 million Alu elements).
Not all transposons require an RNA intermediate. Many are able to transpose in a more direct
DNA to DNA manner. In eukaryotes, DNA transposons are less common than
retrotransposons, but they have a special place in genetics because a family of plant DNA
transposons – the Ac/ Ds elements of maize - were the first transposable elements to be
discovered, by Barbara McClintock in the 1950s. DNA transposons are a much more
important component of prokaryotic genome anatomies than the RNA transposons. The
insertion sequences, IS1 and IS186, are examples of DNA transposons, and a single E. coli
genome may contain as many as 20 of these of various types. Other kinds of DNA transposon
known in E. coli, and fairly typical of prokaryotes in general, includes: Composite
transposons and Tn3-type transposons.

You might also like