You are on page 1of 10


Genomics is the development and application of new mapping, sequencing, and

computational procedures for the analysis of the entire genome of organisms. It deals with the
systematic molecular characterization of genomes. Some of the methods used are traditional
genetic-mapping procedures; in addition, specialized techniques have been developed for
manipulating the large amounts of DNA in a genome. Genomic analysis is important for two
reasons: (1) it represents a way of obtaining an overview of the genetic architecture of an
organism and (2) it forms a set of basic information that can be used to find new genes such
as those responsible for disease. Genomic analysis generally proceeds from low-resolution
analysis to techniques with higher resolution.
Genomics is divided into three basic areas: structural genomics, characterizing the
physical nature of whole genomes; functional genomics, characterizing the gene and non-
gene sequences in entire genome and Comparative genomics: better understanding of
function including evolutionary relationships.
Structural Genomics:
As its name suggests, the aim of structural genomics is to characterize the structure of the
genome. Knowledge of the structure of an individual genome can be useful in manipulating
genes and DNA segments in that particular species. For example, genes can be cloned on the
basis of knowing where they are in the genome. When a number of genomes have been
characterized at the structural level, the hope is that, through comparative genomics, it will
become possible to deduce the general rules that govern the overall structural organization of
all genomes. Structural genomics proceeds through increasing levels of analytic resolution,
starting with the assignment of genes and markers to individual chromosomes, then the
mapping of these genes and markers within a chromosome, and finally the preparation of a
physical map culminating in sequencing.
Functional genomics:
Functional genomics uses a variety of approaches such as defining all ORFs, the use of gene
knockouts to probe gene function, the yeast two-hybrid system to look for gene interaction,
and DNA microarrays to determine which genes are transcribed. It attempts to understand the
broad sweep of genome function at different developmental stages and under different
environmental conditions.
Comparative genomics:
The basis of comparative genomics is that the genomes of related organisms are similar. The
argument is the same one that we considered when looking at homologous genes. Two
organisms with a relatively recent common ancestor will have genomes that display species-
specific differences built onto the common plan possessed by the ancestral genome. The
closer two organisms are on the evolutionary scale, the more related their genomes will be.
Studies of comparative genomics also offer a powerful opportunity to identify highly
conserved and therefore functionally important sequence motifs in coding and noncoding
genomic DNA. This identification helps researchers confirm predictions of protein-coding
regions of the genome and identify important regulatory elements within DNA.
Eukaryotic Genome Organization
The completed and on-going genome projects are revealing a great deal about how genomes
are organized, including a number of unexpected discoveries that have taken molecular
biologists by surprise. It is very important to survey the information that has arisen from
genome projects and to learn how the genome is organized in a eukaryotic organism.
Every organism possesses a genome that contains the biological information needed
to construct and maintain a living example of that organism. Most genomes, including the
human genome and those of all other cellular life forms, are made of DNA (deoxyribonucleic
acid) but a few viruses have RNA (ribonucleic acid) genomes. DNA and RNA are polymeric
molecules made up of chains of monomeric subunits called nucleotides.
Humans are fairly typical eukaryotes and the human genome is in many respects a
good model for eukaryotic genomes in general. All of the eukaryotic nuclear genomes that
have been studied are, like the human version, divided into two or more linear DNA
molecules, each contained in a different chromosome; all eukaryotes also possess smaller,
usually circular, mitochondrial genomes. The only general eukaryotic feature not illustrated
by the human genome is the presence in plants and other photosynthetic organisms of a third
genome, located in the chloroplasts.
Although the basic physical structures of all eukaryotic nuclear genomes are similar,
one important feature is very different in different organisms. This is genome size, the
smallest eukaryotic genomes being less than 10 Mb in length, and the largest over 100,000
Mb as seen in following table. Here genome size is Total amount of DNA contained within
one copy of a genome. Genome size can be compared to molecular mass using formula 1 pg
= 978 Mb = 978000000 bp.
Species Genome size (Mb)
Saccharomyces cerevisiae 12.1
Aspergillus nidulans 25.4
Tetrahymena pyriformis 190
Caenorhabditis elegans 97
Drosophila melanogaster 180
Bombyx mori (silkworm) 490
Strongylocentrotus purpuratus (sea urchin) 845
Locusta migratoria (locust) 5000
Takifugu rubripes (pufferfish) 400
Homo sapiens 3200
Mus musculus (mouse) 3300
Arabidopsis thaliana (vetch) 125
Oryza sativa (rice) 430
Zea mays (maize) 2500
Pisum sativum (pea) 4800
Triticum aestivum (wheat) 16 000
Fritillaria assyriaca (fritillary) 120 000
Genome size range coincides to a certain extent with the complexity of the organism, the
simplest eukaryotes such as fungi having the smallest genomes, and higher eukaryotes such
as vertebrates and flowering plants having the largest ones. This might appear to make sense
as one would expect the complexity of an organism to be related to the number of genes in its
genome - higher eukaryotes need larger genomes to accommodate the extra genes. However,
the correlation is far from precise: if it was, then the nuclear genome of the yeast S.
cerevisiae, which at 12 Mb is 0.004 times the size of the human nuclear genome, would be
expected to contain 0.004 × 35 000 genes, which is just 140. In fact the S. cerevisiae genome
contains about 5800 genes.
For many years the lack of precise correlation between the complexity of an organism and the
size of its genome was looked on as a bit of a puzzle, the so-called C-value paradox. In fact
the answer is quite simple: space is saved in the genomes of less complex organisms because
the genes are more closely packed together. We will try to understand this by comparison of
the 50 kb fragment of genomes of humans, yeast, fruit flies, maize and Escherichia coli. The
yeast genome segment, which comes from chromosome III (the first eukaryotic chromosome
to be sequenced), has the following distinctive features:
• It contains more genes than the human segment.
• Relatively few of the yeast genes are discontinuous.
• There are fewer genome-wide repeats.

The picture that emerges is that the genetic organization of the yeast genome is much more
economical than that of the human version. The genes themselves are more compact, having
fewer introns, and the spaces between the genes are relatively short, with much less space
taken up by genome-wide repeats and other non-coding sequences.
The hypothesis that more complex organisms have less compact genomes holds when other
species are examined. Let’s examine fruit fly fragment. If we agree that a fruit fly is more
complex than a yeast cell but less complex than a human then we would expect the
organization of the fruit-fly genome to be intermediate between that of yeast and humans.
The gene density in the fruit-fly genome is intermediate between that of yeast and humans,
and the average fruit-fly gene has many more introns than the average yeast gene but still
three times fewer than the average human gene.
It is beginning to become clear that the genome-wide repeats play an intriguing role in
dictating the compactness or otherwise of a genome. This is strikingly illustrated by the
maize genome, which at 5000 Mb is larger than the human genome but still relatively small
for a flowering plant. Only a few limited regions of the maize genome have been sequenced,
but some remarkable results have been obtained, revealing a genome dominated by repetitive
elements. The only gene in 50-kb region is one member of a family of genes coding for the
alcohol dehydrogenase enzymes. Instead of genes, the dominant feature of this genome
segment is the genome-wide repeats. The majority of these are of the LTR element type,
which comprise virtually all of the non-coding part of the segment, and on their own are
estimated to make up approximately 50% of the maize genome. It is becoming clear that one
or more families of genome-wide repeats have undergone a massive proliferation in the
genomes of certain species. This may provide an explanation for the most puzzling aspect of
the C-value paradox, which is not the general increase in genome size that is seen in
increasingly complex organisms, but the fact that similar organisms can differ greatly in
genome size. A good example is provided by Amoeba dubia which, being a protozoan, might
be expected to have a genome of 100-500 kb, similar to other protozoa such as Tetrahymena
pyriformis. In fact the Amoeba genome is over 200,000 Mb. Similarly, we might guess that
the genomes of crickets are similar in size to those of other insects, but these bugs have
genomes of approximately 2000 Mb, 11 times that of the fruit fly.
Nuclear genome:

Figure: Classification of nuclear genome into various categories

The nuclear genome is split into a set of linear DNA molecules, each contained in a
chromosome. No exceptions to this pattern are known: all eukaryotes that have been studied
have at least two chromosomes and the DNA molecules are always linear. The only
variability at this level of eukaryotic genome structure lies with chromosome number, which
appears to be unrelated to the biological features of the organism. For example, yeast has 16
chromosomes, four times as many as the fruit fly. Nor is chromosome number linked to
genome size: some salamanders have genomes 30 times bigger than the human version but
split into half the number of chromosomes.
Packaging of DNA into chromosomes: Chromosomes are much shorter than the DNA
molecules that they contain. A highly organized packaging system is therefore needed to fit a
DNA molecule into its chromosome.
In 1973-74 several groups carried out nuclease protection experiments on chromatin (DNA-
histone complexes) that had been gently extracted from nuclei by methods designed to retain
as much of the chromatin structure as possible. In a nuclease protection experiment the
complex is treated with an enzyme that cuts the DNA at positions that are not 'protected' by
attachment to a protein. The sizes of the resulting DNA fragments indicate the positioning of
the protein complexes on the original DNA molecule. After limited nuclease treatment of
purified chromatin, the bulk of the DNA fragments have lengths of approximately 200 bp and
multiples thereof, suggesting a regular spacing of histone proteins along the DNA.

Nuclease protection analysis of chromatin

from human nuclei. Chromatin is gently
purified from nuclei and treated with a nuclease
enzyme. On the left, the nuclease treatment is
carried out under limiting conditions so that the
DNA is cut, on average, just once in each of the
linker regions between the bound proteins. After
removal of the protein, the DNA fragments are
analyzed by agarose gel electrophoresis and
found to be 200 bp in length, or multiples thereof.
On the right, the nuclease treatment proceeds to
completion, so all the DNA in the linker regions
is digested. The remaining DNA fragments are all
146 bp in length. The results show that in this
form of chromatin, protein complexes are spaced
along the DNA at regular intervals, one for each
200 bp, with 146 bp of DNA closely attached to
each protein complex.

DNA in the nucleus exists mainly in combination with histone proteins; the DNA–
histone complex is called “chromatin”. Chromatin can undergo changes in its structure in
response to various cellular metabolic demands. Chromatin can be envisioned as a repeat of
structural units called “nucleosomes”. The nucleosome core particle is composed of histone
octamer plus the DNA that wraps around it. The histone octamer contains two molecules
each of histones H2A, H2B, H3, and H4. DNA wraps around the octamer in a left-handed
supercoil in about 1.75 turns which encloses about 150 bp. Histone H1 is a linker histone
that, along with linker DNA (the DNA in between two nucleosome core particles), physically
connects the adjacent nucleosome core particles. The length of linker DNA varies with
species and cell types. Usually, nucleosome core particle and linker DNA on both sides of the
core encompasses between 180- and 200-bp DNA. Between the nucleosome unit structure
and the metaphase chromosome structure containing two chromatids, there are several levels
of organization and compaction of the chromatin. Each nucleosome has a diameter of 10 nm;
the nucleosomes are compacted into a solenoid fiber structure of 30 nm called as 30 nm fiber;
the 30-nm solenoid fibers are compacted into a 300-nm filament; and finally, the 300-nm fi
laments are further compacted into a 700-nm chromosome. During cell division, when the
chromosomes duplicate, a 1,400-nm metaphase chromosome is produced containing two
chromatids, each chromatid being 700 nm.
Figure: From nucleosome to chromosome.

The 30 nm fiber is probably the major type of chromatin in the nucleus during
interphase, the period between nuclear divisions. When the nucleus divides, the DNA adopts
a more compact form of packaging, resulting in the highly condensed metaphase
chromosomes that can be seen with the light microscope and which have the appearance
generally associated with the word 'chromosome'. The metaphase chromosomes form at a
stage in the cell cycle after DNA replication has taken place and so each one contains two
copies of its chromosomal DNA molecule. The two copies are held together at the
centromere, which has a specific position within each chromosome. Individual chromosomes
can therefore be recognized because of their size and the location of the centromere relative
to the two ends. Further distinguishing features are revealed when chromosomes are stained.
There are a number of different staining techniques, each resulting in a banding pattern that is
characteristic for a particular chromosome. This means that the set of chromosomes
possessed by an organism can be represented as a karyogram, in which the banded
appearance of each one is depicted.
An important part of the chromosome is the terminal region or telomere. Telomeres
are important because they mark the ends of chromosomes and therefore enable the cell to
distinguish a real end from an unnatural end caused by chromosome breakage – an essential
requirement because the cell must repair the latter but not the former. Telomeric DNA is
made up of hundreds of copies of a repeated motif, 5 -TTAGGG-3 in humans, with a short
extension of the 3 terminus of the double-stranded DNA molecule.
Functional DNA content of genome: This includes coding and non-coding gene content and
contributes 25% of nuclear genome. As we have seen earlier in our comparison of genome
fragment from different organisms one thing becomes clear that genes are not arranged in
definite pattern but rather arranged unevenly throughout the entire genome. There were two
lines of evidence, one of which related to the banding patterns that are produced when
chromosomes are stained. The dyes used in these procedures bind to DNA molecules, but in
most cases with preferences for certain base pairs. Giemsa, for example, has a greater affinity
for DNA regions that are rich in A and T nucleotides. The dark G-bands in the human
karyogram are therefore thought to be AT-rich regions of the genome. The base composition
of the genome as a whole is 59.7% A + T so the dark G-bands must have AT contents
substantially greater than 60%. Cytogeneticists therefore predicted that there would be fewer
genes in dark G-bands because genes generally have AT contents of 45-50%. This prediction
was confirmed when the draft genome sequence was compared with the human karyogram.
The second line of evidence pointing to uneven gene distribution derived from the isochore
model of genome organization. According to this model, the genomes of vertebrates and
plants (and possibly of other eukaryotes) are mosaics of segments of DNA, each at least 300
kb in length, with each segment having a uniform base composition that differs from that of
the adjacent segments. Support for the isochore model comes from experiments in which
genomic DNA is broken into fragments of approximately 100 kb, treated with dyes that bind
specifically to AT- or GC-rich regions, and the pieces separated by density gradient
centrifugation. When this experiment is carried out with human DNA, five fractions are seen,
each representing a different isochore type with a distinctive base composition: two AT-rich
isochores, called L1 and L2, and three GC-rich classes: H1, H2 and H3. The last of these, H3,
is the least abundant in the human genome, making up only 3% of the total, but contains over
25% of the genes. This is a clear indication that genes are not distributed evenly through the
human genome.
The genes present in an organisms can be classified using two approaches first is
based according to the function of genes and other is based on particular domain of the
protein a gene codes for. The second approach is more informative and better because it
shows that particular genome specifies a number of protein domains that are absent from the
genomes of other organisms, these domains including several involved in activities such as
cell adhesion, electric couplings, and growth of nerve cells. These functions are interesting
because they are ones that we look on as conferring the distinctive features of vertebrates
compared with other types of eukaryote.
Since the earliest days of DNA sequencing it has been known that multigene families -
groups of genes of identical or similar sequence - are common features of many genomes.
The rRNA genes are examples of 'simple' or 'classical' multigene families, in which all the
members have identical or nearly identical sequences. These families are believed to have
arisen by gene duplication, with the sequences of the individual members kept identical by an
evolutionary process. Other multigene families, more common in higher eukaryotes than in
lower eukaryotes, are called 'complex' because the individual members, although similar in
sequence, are sufficiently different for the gene products to have distinctive properties. One
of the best examples of this type of multigene family are the mammalian globin genes. The
globins are the blood proteins that combine to make hemoglobin, each molecule of
haemoglobin being made up of two α-type and two β-type globins. Why are the members of
the globin gene families so different from one another? The answer was revealed when the
expression patterns of the individual genes were studied. It was discovered that the genes are
expressed at different stages in human development: for example, in the β-type cluster ε is
expressed in the early embryo, Gγ and Aγ (whose protein products differ by just one amino
acid) in the fetus, and δ and β in the adult. The different biochemical properties of the
resulting globin proteins are thought to reflect slight changes in the physiological role that
hemoglobin plays during the course of human development.
In some multigene families, the individual members are clustered, as with the globin
genes, but in others the genes are dispersed around the genome. An example of a dispersed
family is the five human genes for aldolase, an enzyme involved in energy generation, which
are located on chromosomes 3, 9, 10, 16 and 17. The important point is that, even though
dispersed, the members of the multigene family have sequence similarities that point to a
common evolutionary origin.
The Repetitive DNA Content of Genomes
Repetitive DNA is found in all organisms and that in some, including humans, it
makes up a substantial fraction of the entire genome. There are various types of repetitive
DNA, and several classification systems have been devised. The scheme that we will use
begins by dividing the repeats into those that are clustered into tandem arrays and those that
are dispersed around the genome.
a) Tandemly repeated DNA: Tandemly repeated DNA is a common feature of eukaryotic
genomes but is found much less frequently in prokaryotes. This type of repeat is also called
satellite DNA because DNA fragments containing tandemly repeated sequences form
'satellite' bands when genomic DNA is fractionated by density gradient centrifugation. The
satellite bands contain fragments of repetitive DNA, and hence have GC contents and
buoyant densities that are atypical of the genome as a whole. The satellite bands in density
gradients of eukaryotic DNA are made up of fragments composed of long series of tandem
repeats, possibly hundreds of kb in length. A single genome can contain several different
types of satellite DNA, each with a different repeat unit, these units being anything from < 5
to > 200 bp. The three satellite bands in human DNA include at least four different repeat
One type of human satellite DNA is the alphoid DNA repeats found in the centromere
regions of chromosomes. Although some satellite DNA is scattered around the genome, most
is located in the centromeres, where it may play a structural role, possibly as binding sites for
one or more of the special centromeric proteins. Alternatively, the repetitive DNA content of
the centromere might be a reflection of the fact that this is the last region of the chromosome
to be replicated. In order to delay its replication until the very end of the cell cycle, the
centromere DNA must lack sequences that can act as origins of replication. The repetitive
nature of centromeric DNA may be a means of ensuring that such origins are absent.
Although not appearing in satellite bands on density gradients, two other types of tandemly
repeated DNA are also classed as 'satellite' DNA. These are minisatellites and
microsatellites. Minisatellites form clusters up to 20 kb in length, with repeat units up to 25
bp; microsatellite clusters are shorter, usually < 150 bp, and the repeat unit is usually 13 bp or
less. We have already seen one type of minisatellite DNA is Telomeric DNA. In addition to
telomeric minisatellites, some eukaryotic genomes contain various other clusters of
minisatellite DNA, many, although not all, near the ends of chromosomes. The functions of
these other minisatellite sequences have not been identified. The function of microsatellites is
equally mysterious. The typical microsatellite consists of a 1-, 2-, 3- or 4-bp unit repeated 10
20 times, as illustrated by the microsatellites in the human β T-cell receptor locus. Although
each microsatellite is relatively short, there are many of them in the genome. In humans, for
example, microsatellites with a CA repeat, that make up 0.25% of the genome, 8 Mb in all.
Single base-pair repeats such as: (A)15 make up another 0.15%.
Although their function, if any, is unknown, microsatellites have proved very useful to
geneticists. Many microsatellites are variable, meaning that the number of repeat units in the
array is different in different members of a species. This is because 'slippage' sometimes
occurs when a microsatellite is copied during DNA replication, leading to insertion or, less
frequently, deletion of one or more of the repeat units. No two individuals have exactly the
same combination of microsatellite length variants: if enough microsatellites are examined
then a unique genetic profile can be established for every individual. The only exceptions are
genetically identical twins. Genetic profiling is well known as a tool in forensic science, but
identification of criminals is a fairly trivial application of microsatellite variability. More
sophisticated methodology makes use of the fact that a person's genetic profile is inherited
partly from the mother and partly from the father. This means that microsatellites can be used
to establish kinship relationships and population affinities, not only for humans but also for
other animals, and for plants.
b) Interspersed genome-wide repeats: Tandemly repeated DNA sequences are thought to
have arisen either by replication slippage, as described for microsatellites, or by DNA
recombination processes. Both of these events are likely to result in a series of linked repeats,
rather than individual repeat units scattered around the genome. Interspersed repeats must
therefore have arisen by a different mechanism, one that can result in a copy of a repeat unit
appearing in the genome at a position distant from the location of the original sequence. The
most frequent way in which this occurs is by transposition, and most interspersed repeats
have inherent transpositional activity.
There are two alternative modes of transposition, one that involves an RNA
intermediate and one that does not. The version that involves an RNA intermediate is called
retrotransposition. The basic mechanism involves three steps:
1. An RNA copy of the transposon is synthesized by the
normal process of transcription.
2. The RNA transcript is copied into DNA. This
conversion of RNA to DNA, the reverse of the normal
transcription process, requires a special enzyme called
reverse transcriptase. Often the reverse transcriptase is
coded by a gene within the transposon and is translated
from the RNA copy synthesized in step 1.
3. The DNA copy of the transposon integrates into the
genome, possibly back into the same chromosome
occupied by the original unit, or possibly into a different
The end result is that there are now two copies of the
transposon, at different points in the genome.
RNA transposons or retroelements are features of eukaryotic genomes but have not so far
been discovered in prokaryotes.
Endogenous retroviruses (ERVs) are retroviral genomes integrated into vertebrate
chromosomes. Some are still active and might, at some stage in a cell's lifetime, direct
synthesis of exogenous viruses, but most are decayed relics that no longer have the capacity
to form viruses. These inactive sequences are genomewide repeats but they are not capable of
additional proliferation.
Retrotransposons have sequences similar to ERVs but are features of nonvertebrate
eukaryotic genomes (i.e. plants, fungi, invertebrates and microbial eukaryotes) rather than
vertebrates. Retrotransposons have very high copy numbers in some genomes, with many
different types present. There are two types of retrotransposon: the Ty3/gypsy family (Ty3 and
gypsy are examples of this class in yeast and fruit fly, respectively), whose members possess
the same set of genes as an ERV, and the Ty1/copia family, members of which lack the env
gene. Both types are able to transpose but the absence of the env gene means that the
Ty1/copia group cannot form infectious virus particles. In fact, despite the presence of env in
the Ty3/gypsy genome, it has only recently been recognized that some of these elements can
form viruses and hence should be looked upon as non-vertebrate retroviruses. Although
technically they are interspersed elements, retrotransposons are sometimes found in clusters
in a genome sequence as a result of the presence of preferred integration sites for transposing
The three types of retroelement described so far are LTR elements, as they have long terminal
repeats at either end which play a role in the transposition process. Other retroelements do not
have LTRs. These are called retroposons and in mammals include the following:
• LINEs (long interspersed nuclear elements) contain a reverse-transcriptase-like gene
probably involved in the retrotransposition process. An example is the human element LINE-
1, which is 6.1 kb and has a copy number of 516,000 in the human genome. A LINE contains
a pol II promoter and two open reading frames (ORFs), one encoding the endonuclease and
the other encoding the reverse transcriptase. LINE activity proceeds as follows: RNA pol II
transcribes the LINE DNA into LINE RNA; the LINE RNA is translated into proteins; the
proteins and RNA join together and reenter the nucleus; the endonuclease cuts a strand of the
target genomic DNA, often in the intron of a gene; the reverse transcriptase copies the LINE
RNA into LINE DNA which is inserted into the target DNA forming a new LINE element
there. Three distant related LINE families are found in the human genome: LINE1, LINE2,
and LINE3. Only LINE1 (L1) is still active.
• SINEs (short interspersed nuclear elements) do not have a reverse transcriptase gene but
can still transpose, probably by 'borrowing' reverse transcriptase enzymes that have been
synthesized by other retroelements. SINEs are short sequences (about 100–400 bp) and they
contain an internal pol III promoter but do not encode any proteins. All currently known
SINEs are derived from tRNA and 7SL RNA genes. Most nonautonomous SINEs share the 3′
end with a resident LINE. The only active SINE in the human genome is the Alu element,
which is the major SINE constituting about 11% of the genome (~1 million Alu elements).
Not all transposons require an RNA intermediate. Many are able to transpose in a
more direct DNA to DNA manner. In eukaryotes, DNA transposons are less common than
retrotransposons, but they have a special place in genetics because a family of plant DNA
transposons - the Ac/Ds elements of maize - were the first transposable elements to be
discovered, by Barbara McClintock in the 1950s. DNA transposons are a much more
important component of prokaryotic genome anatomies than the RNA transposons. The
insertion sequences, IS1 and IS186, are examples of DNA transposons, and a single E. coli
genome may contain as many as 20 of these of various types. Other kinds of DNA transposon
known in E. coli, and fairly typical of prokaryotes in general, includes: Composite
transposons and Tn3-type transposons.

Dr Subhash Jakhesara