Professional Documents
Culture Documents
GENOME ORGANIZATION
Genome Organization
The human haploid genome consists of about 3 x 10 9 base pairs of DNA. Genomic
DNA exists as single linear pieces of DNA that are associated with a protein called
a nucleoprotein complex. The DNA-protein complex is the basis for the formation
of chromosomes, virtually all of the genomic DNA is distributed among the 23
chromosomes that reside in the cellular nucleus. A very small fraction of the
genome is also found in a 16,000 base pair circular piece of DNA that is found in
the mitochondria. The double helical DNA of the chromatin is replicated with the
chromatin fiber condensing into discrete bodies, the chromosomes, each consisting
of two identical chromatids. The two sister chromatids separate, one moving to
each pole of the cell, where they become part of the newly formed nucleus of each
daughter cell. The cells that make up most of the body of a multicellular organism,
the somatic cells, have two copies of each chromosome and are said to be diploid
(2n). Egg and sperm for example, produced by meiosis and having only one copy
of each chromosome, are haptoid (n). The DNA of chromatin and chromosomes is
bound tightly to a family of positively charged proteins, the histones, which
associate strongly with the many negatively charged phosphate groups in DNA.
The histones and DNA associate in complexes called nucleosomes in which the
DNA strand winds around a core of histone molecules.
Functional Elements and Distribution of DNA within the Genome
The major function of genomic DNA is to carry and store genetic information that
is expressed as RNA and then as functional proteins. For gene expression to
correctly occur there must be regulatory elements present on the genome and the
genome must be faithfully replicated and segregated between daughter cells.
Based on studies with unicellular eukaryotes (yeast) at least three types of DNA
elements are required for replication and stable inheritance of chromosomes:
autonomously replicating sequences (ARS), centromeres and telomeres.
Autonomously Replicating Sequences (ARS) are the sites at which DNA
replication is initiated on the chromosomes. Centromeres are DNA sequences that
are required for segregation of replicated chromosomes to daughter cells.
Telomeres (see "DNA Synthesis" lecture) Telomerase recognizes the tips of
chromosomes also know as telomeres. The DNA sequences of telomeres have been
determined in several organisms and consist of numerous repeats of a 6 to 8 base
long sequence, [TTGGGG]n. Yeast Artificial Chromosomes or YAC's can be
constructed by combining large segments of human DNA (50,000 base pairs or
longer) with a selectable marker and the three essential elements described above.
These artificial chromosomes can then be propagated and amplified in yeast cells.
This technology is being used in the sequencing of the human genome.
Unique Sequences
Greater than 50% of the eukaryotic genome consists of DNA that is unique in
sequence and the human genome encodes for about 100,000 proteins. The average
coding portions of a gene (the exons) consist of about 2,000 base pairs of DNA
that is unique in sequence. This number represents less than 7% of the total DNA
comprising the human genome and less than 14% of that DNA is unique. Most of
the coding sequences are interrupted by from 1 to 50 noncoding sequences or
introns. The total length of the introns that interrupt a gene generally far exceeds
the total length of the exons. Since sequences that regulate gene expression also
account for some of the unique sequences the actual amount of DNA coding for
functional gene products is probably less than 3% of the total genomic DNA. The
spatial distribution of genes, exons, introns and regulatory sequences along each
chromosome is shown below.
Repetitive Sequences
There are multiple classes of repetitive DNA, two of these classes include: highly
repetitive and moderately repetitive DNA. The function of repetitive DNA is not
really known but approximately 30% of the human genome consists of repetitive
DNA.
Highly Repetitive DNA consists of several different sets of short repeated
polynucleotides, generally the repeats range from 5 to 500 base pairs in length and
exist in tandem arrays. Highly repetitive DNA comprises about 10-15% of the total
genomic DNA, is present in over a million copies and is transcriptionally inactive.
Some of the highly repetitive DNA is clustered in structural regions of
chromosomes particularly in the cetromeric and telomeric regions.
Most types of moderately repetitive DNA are short about 300 base pairs in length,
are interspersed with unique sequences, are often transcribed but do not code for
gene product.
Chromosomal Structure
Chromatin contains two classes of protein: histones and nonhistone proteins. The
overall purpose of histones is to condense the DNA though many nonhistone
proteins are involved with transcription, DNA replication and maintenance of
chromatin structure.
Histones are the most abundant proteins found in chromatin. There are five major
types: H1, H2A, H2B, H3 and H4. The histones are small basic proteins composed
mostly of Lys and Arg. The positive charge (basicity) of the histones allows the
negatively charged DNA to "wrap" around it forming a nucleosome.
The assembly of the nucleosome requires the nonhistone proteins N1, binds to a
tetramer of H3 and H4, and nucleoplasmin which binds to dimers of H2A and
H2B. The resulting H32H42 tetramer and H2AH2B dimers associate with the
DNA while N1 and nucleoplasmin are released and recycled. H1 then adds to the
structures forming a chromatosome.
Chromatin Dynamics
The higher order structure of chromatin varies and is determined by factors such as
tissue type, sex and the developmental state of the cell. If chromosomes are stained
with a dye and then analyzed microscopically numerous dark bands are seen. The
dark bands correspond to the highly condensed and transcriptionally inactive
heterochromatin. Heterochromatin is generally found at or near the centromere and
telomeres and consists of highly repetitive DNA. The lighter bands are the less
condensed, transcriptionally active euchromatin.
Proteomics is the study of the entire set of proteins produced by a cell type in order
to understand its structure and function.
LEARNING OBJECTIVES
Key Points
Key Terms
proteomics: the branch of molecular biology that studies the set of proteins
expressed by the genome of an organism
proteome: the complete set of proteins encoded by a particular genome
genomics: the study of the complete genome of an organism
Proteomics is a relatively-recent field; the term was coined in 1994 while the
science itself had its origins in electrophoresis techniques of the 1970’s and 1980’s.
The study of proteins, however, has been a scientific focus for a much longer time.
Studying proteins generates insight into how they affect cell processes.
Conversely, this study also investigates how proteins themselves are affected by
cell processes or the external environment. Proteins provide intricate control of
cellular machinery; they are, in many cases, components of that same machinery.
They serve a variety of functions within the cell; there are thousands of distinct
proteins and peptides in almost every organism. The goal of proteomics is to
analyze the varying proteomes of an organism at different times in order to
highlight differences between them. Put more simply, proteomics analyzes the
structure and function of biological systems. For example, the protein content of a
cancerous cell is often different from that of a healthy cell. Certain proteins in the
cancerous cell may not be present in the healthy cell, making these unique proteins
good targets for anti-cancer drugs. The realization of this goal is difficult; both
purification and identification of proteins in any organism can be hindered by a
multitude of biological and environmental factors.
The basic techniques used to analyze proteins are mass spectrometry, x-ray
crystallography, NMR, and protein microarrays.
LEARNING OBJECTIVES
Key Points
Key Terms
X-ray crystallography: X-rays that hit atomic nuclei are diffracted onto a detector.
Another protein imaging technique, nuclear magnetic resonance (NMR), uses the
magnetic properties of atoms to determine the three-dimensional structure of
proteins. NMR spectroscopy is unique in being able to reveal the atomic structure
of macromolecules in solution, provided that highly-concentrated solution can be
obtained. This technique depends on the fact that certain atomic nuclei are
intrinsically magnetic. The chemical shift of nuclei depends on their local
environment. The spins of neighboring nuclei interact with each other in ways that
provide definitive structural information that can be used to determine complete
three-dimensional structures of proteins.
Protein microarrays have also been used to study interactions between proteins.
These are large-scale adaptations of the basic two-hybrid screen. The premise
behind the two-hybrid screen is that most eukaryotic transcription factors have
modular activating and binding domains that can still activate transcription even
when split into two separate fragments, as long as the fragments are brought within
close proximity to each other. Generally, the transcription factor is split into a
DNA-binding domain (BD) and an activation domain (AD). One protein of interest
is genetically fused to the BD and another protein is fused to the AD. If the two
proteins of interest bind each other, then the BD and AD will also come together
and activate a reporter gene that signals interaction of the two hybrid proteins.
Western Blot
Cancer Proteomics
Proteomics, the analysis of proteins, plays a prominent role in the study and
treatment of cancer.
LEARNING OBJECTIVES
Explain the ways in which cancer proteomics may lead to better treatments
KEY TAKEAWAYS
Key Points
Cancer Proteomics
Genomes and proteomes of patients suffering from specific diseases are being
studied to understand the genetic basis of diseases. The most prominent set of
diseases being studied with proteomic approaches is cancer. Proteomic approaches
are being used to improve screening and early detection of cancer, which is
achieved by identifying proteins whose expression is affected by the disease
process.
Eavesdropping Methods :
Some of the more sophisticated bugs have a "burst" transmission. A device about
the size of a fingernail can record several hours of ordinary conversation and then
transmit it to a remote receiver in a burst that lasts only two seconds. An hour of
speech can be stored on a single chip. This is a passive system that records
information but emits signals only when interrogated.1 This makes detection very
difficult. Of course, some countermeasures systems are designed to try to activate
such systems so they can be detected.
The type of bug installed in a home or office setting depends in part upon the
length of time and the circumstances, if any, under which the installer has physical
access to the site.
A visitor seated in front of your desk may bend down to pick up a dropped pen,
using the few seconds when his hand is out of your sight to stick a bug under his
chair or under your desk. Or he may "forget" and leave behind a workable pen that
has a concealed microphone and transmitter. Any gift intended to be kept on your
desk or elsewhere in the open in your office is a potential concealment device for a
bug.
More than half of all eavesdropping attacks on U.S. offices, both foreign and
domestic, have exploited the common telephone.2 Telephones offer a variety of
eavesdropping options, as the telephone instrument has electrical power, a built-in
microphone, a speaker that can serve dual purposes, and ample room for hiding
bugs or taps.
Computers are similar to telephones, in that they have the essential parts for a
sophisticated surveillance system -- a microphone and a means of communicating
information outside the area in which they are located. Computers are vulnerable to
several types of eavesdropping operations. For example, a bug in your keyboard
could transmit every keystroke so that everything you write can be reproduced.
Even public areas are not immune to technical surveillance. Whenever your
presence in a public area is known or predictable in advance, an adversary or
competitor has time to plan the best way to exploit that knowledge.
Individuals who habitually frequent the same restaurant or caf� and hold sensitive
conversations over lunch or dinner are also vulnerable, especially if they usually sit
at the same table or the restaurant manager cooperates with the eavesdropper. A
short-term bug can simply be attached to the underside of the table. Longer term,
one could build the bug into the table or into a vase or other item on the table.
Although probably very rare, at least one highly-competitive, high-class restaurant
is known to have bugged its own tables to obtain unfiltered feedback on customer
reactions to the service and food.
https://books.google.mw/books?
id=xYmcAQAAQBAJ&pg=PA64&lpg=PA64&dq=eavesdropping+on+transmissi
on+of+genetic+information+in+bioinformatics&source=bl&ots=7h8gF7NYmH&s
ig=ACfU3U1zN-gK-
JAvZmja2WFHwmkhDXfSZQ&hl=en&sa=X&ved=2ahUKEwi2gtqZs6bvAhXU
URUIHaNVADEQ6AEwA3oECBYQAw#v=onepage&q&f=false
Genomes of prokaryotes:
LEARNING OBJECTIVES
Key Points
The genome of prokaryotic organisms generally is a circular, double-
stranded piece of DNA, multiple copies of which may exist at any time.
The length of a genome varies widely, but is generally at least a few million
base pairs.
A genophore is the DNA of a prokaryote. It is commonly referred to as a
prokaryotic chromosome.
Key Terms
The Nucleoid
The Genophore
Supercoiling
LEARNING OBJECTIVES
Key Points
Key Terms
supercoiling: The coiling of the DNA helix upon itself; can cause disruption
to transcription and lead to cell death.
DNA: A biopolymer of deoxyribonucleic acids (a type of nucleic acid) that
has four different chemical groups, called bases: adenine, guanine, cytosine,
and thymine.
chromosome: A structure in the cell nucleus that contains DNA, histone
protein, and other structural proteins.
In a “relaxed” double-helical segment of B-DNA, the two strands twist around the
helical axis once every 10.4 to 10.5 base pairs of sequence. Adding or subtracting
twists, as some enzymes can do, imposes strain. If a DNA segment under twist
strain were closed into a circle by joining its two ends and then allowed to move
freely, the circular DNA would contort into a new shape, such as a simple figure-
eight. Such a contortion is a supercoil.
The simple figure eight is the simplest supercoil, and is the shape a circular DNA
assumes to accommodate one too many or one too few helical twists. The two
lobes of the figure eight will appear rotated either clockwise or counterclockwise
with respect to one another, depending on whether the helix is over or
underwound. For each additional helical twist being accommodated, the lobes will
show one more rotation about their axis.
The noun form “supercoil” is rarely used in the context of DNA topology. Instead,
global contortions of a circular DNA, such as the rotation of the figure-eight lobes
above, are referred to as writhe. The above example illustrates that twist and writhe
are interconvertible. “Supercoiling” is an abstract mathematical property
representing the sum of twist and writhe. The twist is the number of helical turns in
the DNA and the writhe is the number of times the double helix crosses over on
itself (these are the supercoils).
Extra helical twists are positive and lead to positive supercoiling, while subtractive
twisting causes negative supercoiling. Many topoisomerase enzymes sense
supercoiling and either generate or dissipate it as they change DNA topology.
DNA of most organisms is negatively supercoiled.
In part because chromosomes may be very large, segments in the middle may act
as if their ends are anchored. As a result, they may be unable to distribute excess
twist to the rest of the chromosome or to absorb twist to recover from
underwinding—the segments may become supercoiled, in other words. In response
to supercoiling, they will assume an amount of writhe, just as if their ends were
joined.
DNA supercoiling is important for DNA packaging within all cells. Because the
length of DNA can be thousands of times that of a cell, packaging this genetic
material into the cell or nucleus (in eukaryotes ) is a difficult feat. Supercoiling of
DNA reduces the space and allows for much more DNA to be packaged. In
prokaryotes, plectonemic supercoils are predominant, because of the circular
chromosome and relatively small amount of genetic material. In eukaryotes, DNA
supercoiling exists on many levels of both plectonemic and solenoidal supercoils,
with the solenoidal supercoiling proving the most effective in compacting the
DNA. Solenoidal supercoiling is achieved with histones to form a 10 nm fiber.
This fiber is further coiled into a 30 nm fiber, and further coiled upon itself
numerous times more.
DNA packaging is greatly increased during nuclear division events such as mitosis
or meiosis, where DNA must be compacted and segregated to daughter cells.
Condensins and cohesins are structural maintenance of chromosome (SMC)
proteins that aid in the condensation of sister chromatids and the linkage of the
centromere in sister chromatids. These SMC proteins induce positive supercoils.
Supercoiling is also required for DNA and RNA synthesis. Because DNA must be
unwound for DNA and RNA polymerase action, supercoils will result. The region
ahead of the polymerase complex will be unwound; this stress is compensated with
positive supercoils ahead of the complex. Behind the complex, DNA is rewound
and there will be compensatory negative supercoils. It is important to note that
topoisomerases such as DNA gyrase (Type II Topoisomerase) play a role in
relieving some of the stress during DNA and RNA synthesis.
An open reading frame (ORF) is the part of a reading frame that varies in size and
content in bacterial genomes.
LEARNING OBJECTIVES
Key Points
Open reading frames are used as one piece of evidence to assist in gene
prediction.
If a portion of a genome has been sequenced, ORFs can be located by
examining each of the three possible reading frames on each strand.
Bacterial genomes display variation in size, even among strains of the same
species.
Key Terms
In molecular genetics, an open reading frame (ORF) is the part of a reading frame
that contains no stop codons. The transcription termination pause site is located
after the ORF, beyond the translation stop codon, because if transcription were to
cease before the stop codon, an incomplete protein would be made during
translation.
Normally, inserts which interrupt the reading frame of a subsequent region after
the start codon cause frameshift mutation of the sequence and dislocate the
sequences for stop codons.
Open reading frames are used as one piece of evidence to assist in gene prediction.
Long ORFs are often used, along with other evidence, to initially identify
candidate protein coding regions in a DNA sequence. The presence of an ORF
does not necessarily mean that the region is ever translated. For example, in a
randomly generated DNA sequence with an equal percentage of each nucleotide, a
stop-codon would be expected once every 21 codons. A simple gene prediction
algorithm for prokaryotes might look for a start codon followed by an open reading
frame that is long enough to encode a typical protein, where the codon usage of
that region matches the frequency characteristic for the given organism ‘s coding
regions. Even a long open reading frame by itself is not conclusive evidence for the
presence of a gene.
Open Reading Frames: Frame +1 is the ORF predicted in the database to encode a
protein. +2 and +3 are the other two potential ORFs in the same strand and -1, -2,
and -3 are the three potential ORFs in the antisense strand.
Possible stop codons in DNA are “TGA”, “TAA”, and “TAG”. Thus, the last
reading frame in this example contains a stop codon (TAA), unlike the first two.
Bacterial genomes display variation in size, even among strains of the same
species. These microorganisms have very little noncoding or repetitive DNA, as
the variation in their genome size usually reflects differences in gene repertoire.
Some species, particularly bacterial parasites and symbionts, have undergone
massive genome reduction and simply contain a subset of the genes present in their
ancestors.
However, in free-living bacteria, such gene loss cannot explain the observed
disparities in genome size because ancestral genomes would have had to contain
improbably large numbers of genes. Surprisingly, a substantial fraction of the
difference in gene contents in free-living bacteria is due to the presence of
ORFans, that is, open reading frames (ORFs) that have no known homologs and
are consequently of no known function.
The high numbers of ORFans in bacterial genomes indicate that, with the
exception of those species with highly reduced genomes, much of the observed
diversity in gene inventories does not result from either the loss of ancestral genes
or the transfer from well-characterized organisms (processes that result in a patchy
distribution of orthologs but not in unique genes) or from recent duplications
(which would likely yield homologs within the same or closely related genome).
LEARNING OBJECTIVES
Key Points
Key Terms
In order to study how normal cellular activities are altered in different disease
states, the biological data must be combined to form a comprehensive picture of
these activities. Therefore, the field of bioinformatics has evolved such that the
most pressing task now involves the analysis and interpretation of various types of
data. This includes nucleotide and amino acid sequences, protein domains and
protein structures. The actual process of analyzing and interpreting data is referred
to as computational biology. Important sub-disciplines within bioinformatics and
computational biology include:
Figure 4.1
Genome size. The range of sizes of the genomes of representative groups of
organisms are shown on a logarithmic scale.
This apparent paradox was resolved by the discovery that the genomes of
most eukaryotic cells contain not only functional genes but also large amounts
of DNA sequences that do not code for proteins. The difference in the sizes of the
salamander and human genomes thus reflects larger amounts of non-coding DNA,
rather than more genes, in the genome of the salamander. The presence of large
amounts of noncoding sequences is a general property of the genomes of complex
eukaryotes. Thus, the thousandfold greater size of the human genome compared to
that of E. coli is not due solely to a larger number of human genes. The human
genome is thought to contain approximately 100,000 genes—only about 25 times
more than E. coli has. Much of the complexity of eukaryotic genomes thus results
from the abundance of several different types of noncoding sequences, which
constitute most of the DNA of higher eukaryotic cells.
Go to:
Figure 4.2
The structure of eukaryotic genes. Most eukaryotic genes contain segments of
coding sequences (exons) interrupted by noncoding sequences (introns). Both
exons and introns are transcribed to yield a long primary RNA transcript. The
introns are then removed (more...)
Introns were first discovered in 1977, independently in the laboratories of Phillip
Sharp and Richard Roberts, during studies of the replication of adenovirus in
cultured human cells. Adenovirus is a useful model for studies of gene expression,
both because the viral genome is only about 3.5 × 10 4 base pairs long and because
adenovirus mRNAs are produced at high levels in infected cells. One approach
used to characterize the adenovirus mRNAs was to determine the locations of the
corresponding viral genes by examination of RNA-DNA hybrids in the electron
microscope. Because RNA-DNA hybrids are distinguishable from single-stranded
DNA, the positions of RNA transcripts on a DNA molecule can be determined.
Surprisingly, such experiments revealed that adenovirus mRNAs do not hybridize
to only a single region of viral DNA (Figure 4.3). Instead, a single mRNA
molecule hybridizes to several separated regions of the viral genome. Thus, the
adenovirus mRNA does not correspond to an uninterrupted transcript of the
template DNA; rather the mRNA is assembled from several distinct blocks of
sequences that originated from different parts of the viral DNA. This was
subsequently shown to occur by RNA splicing, which will be discussed in detail in
Chapter 6.
Figure 4.3
Identification of introns in adenovirus mRNA. (A) The gene encoding the
adenovirus hexon (a major structural protein of the viral particle) consists of four
exons, interrupted by three introns. (B) This tracing illustrates an electron
micrograph of a (more...)
Soon after the discovery of introns in adenovirus, similar observations were made
on cloned genes of eukaryotic cells. For example, electron microscopic analysis
of RNA-DNA hybrids and subsequent nucleotide sequencing of cloned genomic
DNAs and cDNAs indicated that the coding region of the mouse β-
globin gene (which encodes the β subunit of hemoglobin) is interrupted by two
introns that are removed from the mRNA by splicing (Figure 4.4). The intron-
exon structure of many eukaryotic genes is quite complicated, and the amount of
DNA in the intron sequences is often greater than that in the exons. The chicken
ovalbumin gene, for example, contains eight exons and seven introns distributed
over approximately 7700 base pairs (7.7 kilobases, or kb) of genomic DNA. The
exons total only about 1.9 kb, so approximately 75% of the gene consists of
introns. An extreme example is the human gene that encodes the blood clotting
protein factor VIII. This gene spans approximately 186 kb of DNA and is divided
into 26 exons. The mRNA is only about 9 kb long, so the gene contains introns
totaling more than 175 kb. On average, introns are estimated to account for about
ten times more DNA than exons in the genes of higher eukaryotes.
Figure 4.4
The mouse β-globin gene. This gene contains two introns, which divide the coding
region among three exons. Exon 1 encodes amino acids 1 to 30, exon 2 encodes
amino acids 31 to 104, and exon 3 encodes amino acids 105 to 146. Exons 1 and 3
also (more...)
Introns are present in most genes of complex eukaryotes, although they are not
universal. Almost all histone genes, for example, lack introns, so introns are clearly
not required for gene function in eukaryotic cells. In addition, introns are not found
in most genes of simple eukaryotes, such as yeasts. Conversely, introns are present
in rare genes of prokaryotes. The presence or absence of introns is therefore not an
absolute distinction between prokaryotic and eukaryotic genes, although introns
are much more prevalent in higher eukaryotes (both plants and animals), where
they account for a substantial amount of total genomic DNA.
Most introns have no known cellular function, although a few have been found to
encode functional RNAs or proteins. Introns are generally thought to represent
remnants of sequences that were important earlier in evolution. In particular,
introns may have helped accelerate evolution by
facilitating recombination between protein-coding regions (exons) of different
genes—a process known as exon shuffling. Exons frequently encode functionally
distinct protein domains, so recombination between introns of different genes
would result in new genes containing novel combinations of protein-coding
sequences. As predicted by this hypothesis, DNA sequencing studies have
demonstrated that some genes are chimeras of exons derived from several other
genes, providing direct evidence that new genes can be formed by recombination
between intron sequences.
It appears most likely that introns were present early in evolution, prior to the
divergence of prokaryotic and eukaryotic cells. According to this hypothesis,
introns played an important role in the initial assembly of protein-coding sequences
in the ancient ancestors of present-day cells. Introns were subsequently lost from
most genes of prokaryotes and simpler eukaryotes (e.g., yeasts) in response to
evolutionary selection for rapid replication, which led to streamlining the genomes
of these organisms. However, since rapid cell division is not an advantage to
higher eukaryotes, introns have been retained in their genomes. Alternatively,
introns may have arisen later in evolution as a result of the insertion
of DNA sequences into genes that had already been formed as continuous protein-
coding sequences. Exon shuffling would then have played an important role in the
further evolution of genes in higher eukaryotes but would not account for the initial
assembly of protein-coding sequences prior to the evolutionary divergence of
prokaryotic and eukaryotic cells.
Go to:
Figure 4.5
Globin gene families. Members of the human α- and β-globin gene families are
clustered on chromosomes 16 and 11, respectively. Each family contains genes that
are specifically expressed in embryonic, fetal, and adult tissues, in
addition (more...)
Gene families are thought to have arisen by duplication of an original
ancestral gene, with different members of the family then diverging as a
consequence of mutations during evolution. Such divergence can lead to the
evolution of related proteins that are optimized to function in different tissues or at
different stages of development. For example, fetal globins have a higher affinity
for O2 than do adult globins—a difference that allows the fetus to obtain O 2 from
the maternal circulation.
As might be expected, however, not all mutations enhance gene function. Some
gene copies have instead sustained mutations that result in their loss of ability to
produce a functional gene product. For example, the human α- and β-globin gene
families each contain two genes that have been inactivated by mutations. Such
nonfunctional gene copies (called pseudogenes) represent evolutionary relics that
significantly increase the size of eukaryotic genomes without making a functional
genetic contribution.
Go to:
Figure 4.6
Identification of repetitive sequences by DNA reassociation. The kinetics of the
reassociation of fragments of E. coli and bovine DNAs are illustrated as a function
of C0t, which is the initial concentration of DNA multiplied by the time of
incubation. (more...)
Further analysis has identified several types of these highly repeated sequences.
One class (called simple-sequence DNA) contains tandem arrays of thousands of
copies of short sequences, ranging from 5 to 200 nucleotides. For example, one
type of simple-sequence DNA in Drosophila consists of tandem repeats of the
seven nucleotide unit ACAAACT. Because of their distinct base compositions,
many simple-sequence DNAs can be separated from the rest of the genomic DNA
by equilibrium centrifugation in CsCl density gradients. The density of DNA is
determined by its base composition, with AT-rich sequences being less dense than
GC-rich sequences. Therefore, an AT-rich simple-sequence DNA bands in CsCl
gradients at a lower density than the bulk of Drosophila genomic DNA (Figure
4.7). Since such repeat-sequence DNAs band as “satellites” separate from the main
band of DNA, they are frequently referred to as satellite DNAs. These sequences
are repeated millions of times per genome, accounting for 10 to 20% of the DNA
of most higher eukaryotes. Simple-sequence DNAs are not transcribed and do not
convey functional genetic information. Some, however, may play important roles
in chromosome structure.
Figure 4.7
Satellite DNA. Equilibrium centrifugation of Drosophila DNA in a CsCl gradient
separates satellite DNAs (designated I–IV) with buoyant densities (in g/cm 3) of
1.672, 1.687, and 1.705 from the main band of genomic DNA (buoyant density
1.701).
Other repetitive DNA sequences are scattered throughout the genome rather than
being clustered as tandem repeats. These sequences are classified as SINEs (short
interspersed elements) or LINEs (long interspersed elements). The major SINEs in
mammalian genomes are Alusequences, so-called because they usually contain a
single site for the restriction endonuclease AluI. Alu sequences are approximately
300 base pairs long, and about a million such sequences are dispersed throughout
the genome, accounting for nearly 10% of the total cellular DNA.
Although Alu sequences are transcribed into RNA, they do not encode proteins and
their function is unknown. The major human LINEs (which belong to the LINE 1,
or L1, family) are about 6000 base pairs long and repeat approximately 50,000
times in the genome. L1 sequences are transcribed and at least some encode
proteins, but like Alu sequences, they have no known function in cell physiology.
Both Alu and L1 sequences are examples of transposable elements, which are
capable of moving to different sites in genomic DNA (see Chapter 5). Some of
these sequences may help regulate gene expression, but most Alu and L1 sequences
appear not to make a useful contribution to the cell. They may, however, have
played important evolutionary roles by contributing to the generation of genetic
diversity.
Go to:
Table 4.1
The Numbers of Genes in Cellular Genomes.
The yeast genome, which consists of 12 × 10 6 base pairs, is about 2.5 times the size
of the genome of E. coli, but is still extremely compact. Only 4% of the genes
of Saccharomyces cerevisiae contain introns, and these usually have only a single
small intron near the start of the coding sequence. The average gene in yeast spans
about 2000 base pairs, and approximately 70% of the yeast genome is used as
protein-coding sequence, specifying a total of about 6000 proteins.
The genome of the nematode C. elegans, a relatively simple animal genome, is
intermediate in size and complexity between the genomes of yeast and mammals.
The C. elegans genome is 97 × 106 base pairs and contains approximately 19,000
protein-coding genes. Thus, while the genome of C. elegans is 8 times larger than
that of yeast, it contains only about three times the number of genes. This
correlates with the presence of a substantial number of introns in C. elegans.
Each gene in C. elegans contains an average of five introns and spans an average
of 5000 bases. Consistent with this, only about 25% of the C. elegans sequence
corresponds to exons, versus 70% protein-coding sequence in the yeast genome.
Although the genome of Drosophila is 180 × 106 base pairs, Drosophila contains
fewer genes than C. elegans (about 13,600). Protein-coding sequence thus
corresponds to only about 13% of the Drosophila genome.
The genomes of higher animals (such as humans) are still more complex and
contain large amounts of noncoding DNA. Thus, only a small fraction of the 3 ×
109 base pairs of the human genome is expected to correspond to protein-coding
sequence. Approximately one-third of the genome corresponds to highly repetitive
sequences, leaving an estimated 2 × 109 base pairs for functional genes,
pseudogenes, and nonrepetitive spacer sequences. If the average gene spans
10,000–20,000 base pairs (including introns), one might expect the human genome
to consist of about 100,000 genes, with protein-coding sequences corresponding to
only about 3% of human DNA. Although this estimate is generally accepted as
plausible, it remains to be verified or corrected by the final results of human
genome sequencing.
Human Genome :
Bioinformatics: Introduction
During the month of January, 2003, 1.5 billion bases were sequenced. As the speed
of DNA sequencing increased, the cost decreased from 10 dollars per base in 1990
to 10 cents per base at the conclusion of the project in April 2003. Although the
Human Genome Project is officially over, improvements in DNA sequencing
continue to be made. Researchers are experimenting with new methods for
sequencing DNA that have the potential to sequence a human genome in just a
matter of weeks for a few thousand dollars.
Scientists enter their assembled sequences into genetic databases so that other
scientists may use the data. Since the sequences of the two DNA strands are
complementary, it is only necessary to enter the sequence of one DNA strand into a
database. By selecting an appropriate computer program, scientists can use
sequence data to look for genes, get clues to gene functions, examine genetic
variation, and explore evolutionary relationships. Bioinformatics is a young and
dynamic science. New bioinformatic software is being developed while existing
software is continually updated.
SNPs:
Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”),
are the most common type of genetic variation among people. Each SNP represents
a difference in a single DNA building block, called a nucleotide. For example, a
SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in a
certain stretch of DNA.
SNPs occur normally throughout a person’s DNA. They occur almost once in
every 1,000 nucleotides on average, which means there are roughly 4 to 5 million
SNPs in a person's genome. These variations may be unique or occur in many
individuals; scientists have found more than 100 million SNPs in populations
around the world. Most commonly, these variations are found in the DNA between
genes. They can act as biological markers, helping scientists locate genes that are
associated with disease. When SNPs occur within a gene or in a regulatory region
near a gene, they may play a more direct role in disease by affecting the gene’s
function.
Genetic Diversity:
Different breeds of dogs. Dogs are selectively bred to get the desired traits.
Different varieties of rose flower, wheat, etc.
There are more than 50,000 varieties of rice and more than a thousand
varieties of mangoes found in India.
Different varieties of medicinal plant Rauvolfia vomitoria present in
different Himalayan ranges differ in the amount of chemical reserpine
produced by them.
Genome Evolution
LEARNING OBJECTIVES
Explain the importance of genomic changes in an evolutionary context
KEY TAKEAWAYS
Key Points
Key Terms
Mutation Rates
Mutation rates differ between species and even between different regions of the
genome of a single species. Spontaneous mutations often occur which can cause
various changes in the genome. Mutations can result in the addition or deletion of
one or more nucleotide bases. A change in the code can result in a frameshift
mutation which causes the entire code to be read in the wrong order and thus often
results in a protein becoming non-functional. A mutation in a promoter region,
enhancer region or a region coding for transcription factors can also result in either
a loss of function or and upregulation or downregulation in transcription of that
gene. Mutations are constantly occurring in an organism’s genome and can cause
either a negative effect, positive effect or no effect at all.
Transposable Elements
Transposable elements are regions of DNA that can be inserted into the genetic
code through one of two mechanisms. These mechanisms work similarly to “cut-
and-paste” and “copy-and-paste” functionalities in word processing programs. The
“cut-and-paste” mechanism works by excising DNA from one place in the genome
and inserting itself into another location in the code. The “copy-and-paste”
mechanism works by making a genetic copy or copies of a specific region of DNA
and inserting these copies elsewhere in the code. The most common transposable
element in the human genome is the Alu sequence, which is present in the genome
over one million times.
Pseudogenes
Exon Shuffling
Exon shuffling is a mechanism by which new genes are created. This can occur
when two or more exons from different genes are combined together or when
exons are duplicated. Exon shuffling results in new genes by altering the current
intron-exon structure. This can occur by any of the following processes: transposon
mediated shuffling, sexual recombination or illegitimate recombination. Exon
shuffling may introduce new genes into the genome that can be either selected
against and deleted or selectively favored and conserved.
Many species exhibit genome reduction when subsets of their genes are not needed
anymore. This typically happens when organisms adapt to a parasitic life style, e.g.
when their nutrients are supplied by a host. As a consequence, they lose the genes
need to produce these nutrients. In many cases, there are both free living and
parasitic species that can be compared and their lost genes identified. Good
examples are the genomes of Mycobacterium tuberculosis and Mycobacterium
leprae, the latter of which has a dramatically reduced genome. Another beautiful
example are endosymbiont species. For instance, Polynucleobacter necessarius was
first described as a cytoplasmic endosymbiont of the ciliate Euplotes aediculatus.
The latter species dies soon after being cured of the endosymbiont. In the few cases
in which P. necessarius is not present, a different and rarer bacterium apparently
supplies the same function. No attempt to grow symbiotic P. necessarius outside
their hosts has yet been successful, strongly suggesting that the relationship is
obligate for both partners. Yet, closely related free-living relatives of P.
necessarius have been identified. The endosymbionts have a significantly reduced
genome when compared to their free-living relatives (1.56 Mbp vs. 2.16 Mbp).