You are on page 1of 51

UNIT II

GENOME ORGANIZATION

Genomics and Proteomics, Eavesdropping on transmission of genetic


information, Genomes of prokaryotes, Genomes of Eukaryotes, Human
Genome, SNPs, Genetic Diversity, Evolution of Genomes

Genome Organization

The human haploid genome consists of about 3 x 10 9 base pairs of DNA. Genomic
DNA exists as single linear pieces of DNA that are associated with a protein called
a nucleoprotein complex. The DNA-protein complex is the basis for the formation
of chromosomes, virtually all of the genomic DNA is distributed among the 23
chromosomes that reside in the cellular nucleus. A very small fraction of the
genome is also found in a 16,000 base pair circular piece of DNA that is found in
the mitochondria. The double helical DNA of the chromatin is replicated with the
chromatin fiber condensing into discrete bodies, the chromosomes, each consisting
of two identical chromatids. The two sister chromatids separate, one moving to
each pole of the cell, where they become part of the newly formed nucleus of each
daughter cell. The cells that make up most of the body of a multicellular organism,
the somatic cells, have two copies of each chromosome and are said to be diploid
(2n). Egg and sperm for example, produced by meiosis and having only one copy
of each chromosome, are haptoid (n). The DNA of chromatin and chromosomes is
bound tightly to a family of positively charged proteins, the histones, which
associate strongly with the many negatively charged phosphate groups in DNA.
The histones and DNA associate in complexes called nucleosomes in which the
DNA strand winds around a core of histone molecules.
Functional Elements and Distribution of DNA within the Genome

The major function of genomic DNA is to carry and store genetic information that
is expressed as RNA and then as functional proteins. For gene expression to
correctly occur there must be regulatory elements present on the genome and the
genome must be faithfully replicated and segregated between daughter cells.

DNA Elements Required for Replication and Segregation of the Genome

Based on studies with unicellular eukaryotes (yeast) at least three types of DNA
elements are required for replication and stable inheritance of chromosomes:
autonomously replicating sequences (ARS), centromeres and telomeres.
Autonomously Replicating Sequences (ARS) are the sites at which DNA
replication is initiated on the chromosomes. Centromeres are DNA sequences that
are required for segregation of replicated chromosomes to daughter cells.
Telomeres (see "DNA Synthesis" lecture) Telomerase recognizes the tips of
chromosomes also know as telomeres. The DNA sequences of telomeres have been
determined in several organisms and consist of numerous repeats of a 6 to 8 base
long sequence, [TTGGGG]n. Yeast Artificial Chromosomes or YAC's can be
constructed by combining large segments of human DNA (50,000 base pairs or
longer) with a selectable marker and the three essential elements described above.
These artificial chromosomes can then be propagated and amplified in yeast cells.
This technology is being used in the sequencing of the human genome.

Unique Sequences
Greater than 50% of the eukaryotic genome consists of DNA that is unique in
sequence and the human genome encodes for about 100,000 proteins. The average
coding portions of a gene (the exons) consist of about 2,000 base pairs of DNA
that is unique in sequence. This number represents less than 7% of the total DNA
comprising the human genome and less than 14% of that DNA is unique. Most of
the coding sequences are interrupted by from 1 to 50 noncoding sequences or
introns. The total length of the introns that interrupt a gene generally far exceeds
the total length of the exons. Since sequences that regulate gene expression also
account for some of the unique sequences the actual amount of DNA coding for
functional gene products is probably less than 3% of the total genomic DNA. The
spatial distribution of genes, exons, introns and regulatory sequences along each
chromosome is shown below.

Repetitive Sequences

There are multiple classes of repetitive DNA, two of these classes include: highly
repetitive and moderately repetitive DNA. The function of repetitive DNA is not
really known but approximately 30% of the human genome consists of repetitive
DNA.
Highly Repetitive DNA consists of several different sets of short repeated
polynucleotides, generally the repeats range from 5 to 500 base pairs in length and
exist in tandem arrays. Highly repetitive DNA comprises about 10-15% of the total
genomic DNA, is present in over a million copies and is transcriptionally inactive.
Some of the highly repetitive DNA is clustered in structural regions of
chromosomes particularly in the cetromeric and telomeric regions.

Moderately Repetitive DNA contains a large variety of repeated sequences ranging


from a few hundred to tens of thousands of base pairs with different characteristics.
Moderately repetitive DNA can be clustered at specific chromosomal locations or
distributed throughout the genome. One type of moderately repetitive human DNA
sequence is the rRNA precursor gene. Each rRNA precursor gene is contained in a
DNA segment of about 43,000 base pairs. The actual transcript is 13,400 bases
which is processed into the mature 28S, 18S and 5.8S rRNA's (see "RNA
Synthesis and Processing" lecture). This means that at least 30,000 base pairs are
not transcribed and apparently serve as spacer DNA. About 280 copies of the
rRNA precursor gene are distributed in clusters on five chromosomes and account
for about 0.4% of the genomic DNA.

Most types of moderately repetitive DNA are short about 300 base pairs in length,
are interspersed with unique sequences, are often transcribed but do not code for
gene product.

Chromosomal Structure

A typical human cellular nucleus is between 5 and 10 mM in diameter and the


diploid human genome is over 2 meters long! Obviously to make the DNA fit into
the nucleus it must be compacted, think of it as trying to put a piece of thread 6
miles long into a ping-pong ball. Fully compacted DNA can not be transcribed so
consequently the cell must be able to selectively expose ARS elements so that
replication can be initiated at the correct time in the cell cycle. In order to
accomplish all of these tasks, compaction, transcription, replication the DNA is
associated with a special set of structural proteins that form a nucleo- or DNA-
protein complex called chromatin.

Composition and Structure of Chromatin

Chromatin contains two classes of protein: histones and nonhistone proteins. The
overall purpose of histones is to condense the DNA though many nonhistone
proteins are involved with transcription, DNA replication and maintenance of
chromatin structure.

Histones are the most abundant proteins found in chromatin. There are five major
types: H1, H2A, H2B, H3 and H4. The histones are small basic proteins composed
mostly of Lys and Arg. The positive charge (basicity) of the histones allows the
negatively charged DNA to "wrap" around it forming a nucleosome.

Chromatin consists of a linear chain of nucleosomes each linked to its neighbor by


a segment of DNA that is between 20 and 100 base pairs in length. Nucleosomes
that are bound to H1 are called chromatosomes. The assembly of nucelosomes is
believed to require the participation of the nonhistone proteins, N1 and
nucleoplasmin.
Nucleosome Assembly

The assembly of the nucleosome requires the nonhistone proteins N1, binds to a
tetramer of H3 and H4, and nucleoplasmin which binds to dimers of H2A and
H2B. The resulting H32H42 tetramer and H2AH2B dimers associate with the
DNA while N1 and nucleoplasmin are released and recycled. H1 then adds to the
structures forming a chromatosome.

Chromatin can be further compacted into higher order structures including a


solenoidal coil with about six chromatosomes per turn and the resulting DNA
fibril. The fibril forms loops anchored to a nonhistone protein scaffold, the looped
structures forming the interphase chromosomes. During mitosis the looped
structure further condense by coiling upon themselves to form minibands. Each
miniband is comprised of about 18 loops, each loop containing over a million base
pairs. The DNA in these minibands has been compacted by about 10,000 fold! The
minibands are arranged along a central axis and form the arms of the mitotic
chromosome.
Treatment of mitotic chromosomes with dextran sulfate followed by special
detergents strips off the histones and most other proteins. Additional treatment
with restriction enzymes cuts most of the DNA which can then be separated from
the scaffold. When the scaffolding is then analyzed short segments of DNA are
found attached to the scaffolding between genes, not within regions of transcribed
DNA. These sequences are called scaffold associated regions or SAR's.
The major scaffold protein is topoisomerase II which regulates the extent of
supercoiling in the DNA. Supercoiling, seen in circular DNA (mitochondrial
DNA) and nucleosomes (DNA wrapped around something else), results when
double stranded DNA twists upon itself. Topoisomerase II maintains the level of
supercoiled DNA at a constant value because supercoiling can affect the efficiency
of transcription, DNA replication and the integrity of chromatin.

Chromatin Dynamics

The higher order structure of chromatin varies and is determined by factors such as
tissue type, sex and the developmental state of the cell. If chromosomes are stained
with a dye and then analyzed microscopically numerous dark bands are seen. The
dark bands correspond to the highly condensed and transcriptionally inactive
heterochromatin. Heterochromatin is generally found at or near the centromere and
telomeres and consists of highly repetitive DNA. The lighter bands are the less
condensed, transcriptionally active euchromatin.

In order for DNA replication to occur the chromatin must be dynamically


restructured or "decondensed" allowing the replication "machinery" to gain access
to the DNA. Transcriptionally active genes are sensitive to digestion by DNase
while inactive genes are insensitive to digestion. This suggests that the chromatin
has "decondensed" during transcription which also allows access to the DNase.

Numerous subtypes of histones have been identified. Analysis of these histones


indicates that histones are subject to chemical modification via: acylation,
phosphorylation, ADP-ribosylation and ubiquination. Some of these modified
histones appear to be associated with actively transcribing genes suggesting that
the modifications may affect the structure of the nucleosome making the DNA
more accessible to the enzymes required for regulating and carrying out
transcription, replication and repair.

Genomics and Proteomics :

Proteomics is the study of the entire set of proteins produced by a cell type in order
to understand its structure and function.

LEARNING OBJECTIVES

Explain how the field of genomics led to the development of proteomics


KEY TAKEAWAYS

Key Points

 Proteomics investigates how proteins affect and are affected by cell


processes or the external environment.
 Within an individual organism, the genome is constant, but the proteome
varies and is dynamic.
 Every cell in an individual organism has the same set of genes, but the set of
proteins produced in different tissues differ from one another and are
dependent on gene expression.

Key Terms
 proteomics: the branch of molecular biology that studies the set of proteins
expressed by the genome of an organism
 proteome: the complete set of proteins encoded by a particular genome
 genomics: the study of the complete genome of an organism

Proteomics is a relatively-recent field; the term was coined in 1994 while the
science itself had its origins in electrophoresis techniques of the 1970’s and 1980’s.
The study of proteins, however, has been a scientific focus for a much longer time.
Studying proteins generates insight into how they affect cell processes.
Conversely, this study also investigates how proteins themselves are affected by
cell processes or the external environment. Proteins provide intricate control of
cellular machinery; they are, in many cases, components of that same machinery.
They serve a variety of functions within the cell; there are thousands of distinct
proteins and peptides in almost every organism. The goal of proteomics is to
analyze the varying proteomes of an organism at different times in order to
highlight differences between them. Put more simply, proteomics analyzes the
structure and function of biological systems. For example, the protein content of a
cancerous cell is often different from that of a healthy cell. Certain proteins in the
cancerous cell may not be present in the healthy cell, making these unique proteins
good targets for anti-cancer drugs. The realization of this goal is difficult; both
purification and identification of proteins in any organism can be hindered by a
multitude of biological and environmental factors.

The study of the function of proteomes is called proteomics. A proteome is the


entire set of proteins produced by a cell type. Genomics led to proteomics (via
transcriptomics) as a logical step. Proteomes can be studied using the knowledge of
genomes because genes code for mRNAs and the mRNAs encode proteins.
Although mRNA analysis is a step in the right direction, not all mRNAs are
translated into proteins. Proteomics complements genomics and is useful when
scientists want to test their hypotheses that were based on genes. Even though all
cells of a multicellular organism have the same set of genes, the set of proteins
produced in different tissues is different and dependent on gene expression. Thus,
the genome is constant, but the proteome varies and is dynamic within an
organism. In addition, RNAs can be alternately spliced (cut and pasted to create
novel combinations and novel proteins) and many proteins are modified after
translation by processes such as proteolytic cleavage, phosphorylation,
glycosylation, and ubiquitination. There are also protein-protein interactions,
which complicate the study of proteomes. Although the genome provides a
blueprint, the final architecture depends on several factors that can change the
progression of events that generate the proteome.
Large-scale proteomics machinery: This machine is preparing to do a proteomic
pattern analysis to identify specific cancers so that an accurate cancer prognosis
can be made.

Basic Techniques in Protein Analysis

The basic techniques used to analyze proteins are mass spectrometry, x-ray
crystallography, NMR, and protein microarrays.

LEARNING OBJECTIVES

Describe the techniques used in proteomics to analyze proteins


KEY TAKEAWAYS

Key Points

 Mass Spectrometry is a technique that is useful for determining the size of a


protein or protein complex.
 X-ray crystallography and NMR are techniques useful for determining the 3-
D structure of a protein or protein complex.
 Protein microarrays are useful for determining protein-protein interactions.

Key Terms

 microarray: any of several devices containing a two-dimensional array of


small quantities of biological material used for various types of assays
 reporter gene: a gene that researchers attach to a regulatory sequence of
another gene of interest and whose product is easily identifiable in assays

Basic Techniques in Protein Analysis

The ultimate goal of proteomics is to identify or compare the proteins expressed in


a given genome under specific conditions, study the interactions between the
proteins, and use the information to predict cell behavior or develop drug targets.
Just as the genome is analyzed using the basic technique of DNA sequencing,
proteomics requires techniques for protein analysis. The basic technique for protein
analysis, analogous to DNA sequencing, is mass spectrometry.

Mass Spectrometer: Matrix-Assisted Laser Desorbtion Ionisation – Time Of Flight


(MALDI-TOF) Mass Spectrometer. Mass spectrometry can be used in protein
analysis.
Mass Spectrometry

Mass spectrometry is used to identify and determine the characteristics of a


molecule. It is a technique in which gas phase molecules are ionized and their
mass-to-charge ratio is measured by observing acceleration differences of ions
when an electric field is applied. Lighter ions will accelerate faster and be detected
first. If the mass is measured with precision, then the composition of the molecule
can be identified. In the case of proteins, the sequence can be identified. The
challenge of techniques used for proteomic analyses is the difficulty in detecting
small quantities of proteins, but advances in spectrometry have allowed researchers
to analyze very small samples of protein. Variations in protein expression in
diseased states, however, can be difficult to discern. Proteins are naturally-unstable
molecules, which makes proteomic analysis much more difficult than genomic
analysis.

X-ray crystallography and Nuclear Magnetic Resonance

X-ray crystallography enables scientists to determine the three-dimensional


structure of a protein crystal at atomic resolution. Crystallographers aim high-
powered X-rays at a tiny crystal containing trillions of identical molecules. The
crystal scatters the X-rays onto an electronic detector that is the same type used to
capture images in a digital camera. After each blast of X-rays, lasting from a few
seconds to several hours, the researchers precisely rotate the crystal by entering its
desired orientation into the computer that controls the X-ray apparatus. This
enables the scientists to capture in three dimensions how the crystal scatters, or
diffracts, X-rays. The intensity of each diffracted ray is fed into a computer, which
uses a mathematical equation to calculate the position of every atom in the
crystallized molecule. The result is a three-dimensional digital image of the
molecule.

X-ray crystallography: X-rays that hit atomic nuclei are diffracted onto a detector.
Another protein imaging technique, nuclear magnetic resonance (NMR), uses the
magnetic properties of atoms to determine the three-dimensional structure of
proteins. NMR spectroscopy is unique in being able to reveal the atomic structure
of macromolecules in solution, provided that highly-concentrated solution can be
obtained. This technique depends on the fact that certain atomic nuclei are
intrinsically magnetic. The chemical shift of nuclei depends on their local
environment. The spins of neighboring nuclei interact with each other in ways that
provide definitive structural information that can be used to determine complete
three-dimensional structures of proteins.

Protein Microarrays and Two- Hybrid Screening

Protein microarrays have also been used to study interactions between proteins.
These are large-scale adaptations of the basic two-hybrid screen. The premise
behind the two-hybrid screen is that most eukaryotic transcription factors have
modular activating and binding domains that can still activate transcription even
when split into two separate fragments, as long as the fragments are brought within
close proximity to each other. Generally, the transcription factor is split into a
DNA-binding domain (BD) and an activation domain (AD). One protein of interest
is genetically fused to the BD and another protein is fused to the AD. If the two
proteins of interest bind each other, then the BD and AD will also come together
and activate a reporter gene that signals interaction of the two hybrid proteins.

Two-hybrid screening: Two-hybrid screening is used to determine whether two


proteins interact. In this method, a transcription factor is split into a DNA-binding
domain (BD) and an activation domain (AD). The binding domain is able to bind
the promoter in the absence of the activator domain, but it does not turn on
transcription. A protein called the bait is attached to the BD, and a protein called
the prey is attached to the AD. Transcription occurs only if the prey “catches” the
bait.

Western Blot

The western blot, or protein immunoblot, is a technique that combines protein


electrophoresis and antibodies to detect proteins in a sample. A western blot is
fairly quick and simple compared to the above techniques and, thus, can serve as
an assay to validate results from other experiments. The protein sample is first
separated by gel electrophoresis, then transferred to a nitrocellulose or other type
of membrane, and finally stained with a primary antibody that specifically binds
the protein of interest. A fluorescent or radioactive-labeled secondary antibody
binds to the primary antibody and provides a means of detection via either
photography or x-ray film, respectively.

Cancer Proteomics

Proteomics, the analysis of proteins, plays a prominent role in the study and
treatment of cancer.

LEARNING OBJECTIVES

Explain the ways in which cancer proteomics may lead to better treatments
KEY TAKEAWAYS

Key Points

 Identifying those proteins whose expression is affected by disease processes


can be used to improve screening and early detection of cancer.
 Different biomarkers and protein signatures are being used to analyze each
type of cancer.
 A future goal of cancer proteomics is to have a personalized treatment plan
for each individual.
Key Terms

 biomarker: a substance used as an indicator of a biological state, most


commonly disease

Cancer Proteomics

Genomes and proteomes of patients suffering from specific diseases are being
studied to understand the genetic basis of diseases. The most prominent set of
diseases being studied with proteomic approaches is cancer. Proteomic approaches
are being used to improve screening and early detection of cancer, which is
achieved by identifying proteins whose expression is affected by the disease
process.

An individual protein that indicates disease is called a biomarker, whereas a set of


proteins with altered expression levels is called a protein signature. For a
biomarker or protein signature to be useful as a candidate for early screening and
detection of a cancer, it must be secreted in body fluids (e.g. sweat, blood, or urine)
such that large-scale screenings can be performed in a non-invasive fashion. The
current problem with using biomarkers for the early detection of cancer is the high
rate of false-negative results. A false-negative is an incorrect test result that should
have been positive. In other words, many cases of cancer go undetected, which
makes biomarkers unreliable. Some examples of protein biomarkers used in cancer
detection are CA-125 for ovarian cancer and PSA for prostate cancer. Protein
signatures may be more reliable than biomarkers to detect cancer cells.
Questions that can be answered by biomarkers: In cancer research and medicine,
biomarkers are used in three primary ways: (A) Diagnostic – To help diagnose
conditions, as in the case of identifying early stage cancers. (B) Prognostic – To
forecast how aggressive a condition is, as in the case of determining a patient’s
ability to fare in the absence of treatment. (C) Predictive – To predict how well a
patient will respond to treatment.

Proteomics is also being used to develop individualized treatment plans, which


involves the prediction of whether or not an individual will respond to specific
drugs and the side effects that the individual may experience. In addition,
proteomics can be used to predict the possibility of disease recurrence. The
National Cancer Institute has developed programs to improve the detection and
treatment of cancer. The Clinical Proteomic Technologies for Cancer and the Early
Detection Research Network are efforts to identify protein signatures specific to
different types of cancers. The Biomedical Proteomics Program is designed to
identify protein signatures and design effective therapies for cancer patients.

Eavesdropping on transmission of genetic information :

Eavesdropping Methods :

Eavesdropping operations generally have three principal elements:

 Pickup Device: A microphone, video camera or other device picks up sound


or video images and converts them to electrical impulses. If the device can
be installed so that it uses electrical power already available in the target
room, this eliminates the need for periodic access to the room to replace
batteries. Some listening devices can store information digitally and transmit
it to a listening post at a predetermined time. Tiny microphones may be
coupled with miniature amplifiers that filter out background noise.

 Transmission Link: The electrical impulses created by the pickup device


must somehow be transmitted to a listening post. This may be done by a
radio frequency transmission or by wire. Available wires might include the
active telephone line, unused telephone or electrical wire, or ungrounded
electrical conduits. Transmitters may be linked to an existing power source
or be battery operated. The transmitter may operate continuously or, in more
sophisticated operations, be remotely activated.

 Listening Post: This is a secure area where the signals can be monitored,


recorded, or retransmitted to another area for processing. The listening post
may be as close as the next room or as far as several blocks. Voice-activated
equipment is available to record only when activity is present. A recorder
can record up to 12 hours of conversation between tape changes.

Eavesdropping equipment varies greatly in level of sophistication. Many off-the-


shelf spy shop devices are generally low-cost consumer electronic devices that
have been modified for covert surveillance. They are easy to use against
unsuspecting targets but can be detected by elementary electronic
countermeasures. Devices produced for law enforcement and industrial espionage
are more expensive, more sophisticated, and more difficult to find during a
technical security countermeasures (TSCM) inspection. Devices designed and built
for intelligence services are still more expensive and very difficult to find.

Some of the more sophisticated bugs have a "burst" transmission. A device about
the size of a fingernail can record several hours of ordinary conversation and then
transmit it to a remote receiver in a burst that lasts only two seconds. An hour of
speech can be stored on a single chip. This is a passive system that records
information but emits signals only when interrogated.1 This makes detection very
difficult. Of course, some countermeasures systems are designed to try to activate
such systems so they can be detected.

Some eavesdropping operations, as discussed below, don't require anything at all to


be planted in the target room. The eavesdropping can be done without ever having
direct physical access to the target area. Such operations exploit weaknesses in the
the telephone system or computer system already in the target room, or they use a
laser beam aimed at the target room.1
Eavesdropping in Office or Home

The type of bug installed in a home or office setting depends in part upon the
length of time and the circumstances, if any, under which the installer has physical
access to the site.

A visitor seated in front of your desk may bend down to pick up a dropped pen,
using the few seconds when his hand is out of your sight to stick a bug under his
chair or under your desk. Or he may "forget" and leave behind a workable pen that
has a concealed microphone and transmitter. Any gift intended to be kept on your
desk or elsewhere in the open in your office is a potential concealment device for a
bug.

  If the eavesdropper can gain a period of unsupervised access to your


office or home, it is possible to install more sophisticated devices that are more
difficult to detect. That is why physical security measures to protect the office
space from intruders or other unauthorized persons are so important.
Common hiding spots when time is available to plant a device include
electrical outlets in the wall, furniture, lamps, ceiling light fixtures, pictures on
the wall, books on your bookshelf, etc.

More than half of all eavesdropping attacks on U.S. offices, both foreign and
domestic, have exploited the common telephone.2 Telephones offer a variety of
eavesdropping options, as the telephone instrument has electrical power, a built-in
microphone, a speaker that can serve dual purposes, and ample room for hiding
bugs or taps.

The time it takes to install a bug in your


telephone is measured in seconds, not minutes.
One type of telephone bug transmits all your
telephone conversations to a nearby listening
post. Picking up your telephone to make or
receive a call triggers a recorder that can be
placed in the trunk of a car parked up to four
blocks away. When you hang up, the recorder is
turned off automatically.
Another type of telephone bug will pick up conversations in the room and transmit
them down the telephone line while your telephone remains on the hook. The
eavesdropper can monitor your room conversations from another telephone
anywhere in the world. Such telephone bugs are usually easy to detect by a
professional countermeasures technician who knows what to look for.

With some of today's computerized phone systems, it is possible to manipulate a


telephone electronically without ever having direct, physical access to the
telephone instrument. Signals can be sent down the telephone line to turn the
handset into a microphone that picks up and transmits conversations in the room
even when the handset is hung up. This risk can be greatly reduced by the selection
of an appropriate telephone system and implementation of available technical
security countermeasures. This type of penetration of telephone systems is
discussed in greater detail under Telephones in the Intercepting Your
Communications module.

Computers are similar to telephones, in that they have the essential parts for a
sophisticated surveillance system -- a microphone and a means of communicating
information outside the area in which they are located. Computers are vulnerable to
several types of eavesdropping operations. For example, a bug in your keyboard
could transmit every keystroke so that everything you write can be reproduced.

Standard computers emit faint electromagnetic radiation that a very sophisticated


eavesdropper can use to reconstruct the contents of the computer screen. These
signals can carry a distance of several hundred feet, and even further if exposed
cables or telephone lines act as inadvertent antennas. Security measures and
shielding are available to reduce the risk of such eavesdropping. It is possible to
buy TEMPEST-protected computers that block the unintended radiation.

Eavesdropping in Public Places

Even public areas are not immune to technical surveillance. Whenever your
presence in a public area is known or predictable in advance, an adversary or
competitor has time to plan the best way to exploit that knowledge.

  One Western European intelligence service is known to bug selected first


class seats of its national airline. This picks up conversations among U.S.
government officials or business executives traveling together for negotiations
in that Western European country.
Outdoors in a park, in a hotel lobby, or while sitting around a hotel swimming
pool, conversations may be monitored with a shotgun microphone. This is a
directional microphone (parabolic reflector) that may be concealed in a sleeve or a
folded newspaper and aimed at the target. Clarity of the recording may be
improved by programs that cancel out extraneous noise and that employ neural net
analysis to learn the target’s speech patterns.

Individuals who habitually frequent the same restaurant or caf� and hold sensitive
conversations over lunch or dinner are also vulnerable, especially if they usually sit
at the same table or the restaurant manager cooperates with the eavesdropper. A
short-term bug can simply be attached to the underside of the table. Longer term,
one could build the bug into the table or into a vase or other item on the table.
Although probably very rare, at least one highly-competitive, high-class restaurant
is known to have bugged its own tables to obtain unfiltered feedback on customer
reactions to the service and food.

Eavesdropping on transmission of genetic information

https://books.google.mw/books?
id=xYmcAQAAQBAJ&pg=PA64&lpg=PA64&dq=eavesdropping+on+transmissi
on+of+genetic+information+in+bioinformatics&source=bl&ots=7h8gF7NYmH&s
ig=ACfU3U1zN-gK-
JAvZmja2WFHwmkhDXfSZQ&hl=en&sa=X&ved=2ahUKEwi2gtqZs6bvAhXU
URUIHaNVADEQ6AEwA3oECBYQAw#v=onepage&q&f=false

Genomes of prokaryotes:

Bacterial Chromosomes in the Nucleoid

The nucleoid is an irregularly-shaped region within the cell of a prokaryote that


contains all or most of the genetic material.

LEARNING OBJECTIVES

Evaluate the nucleoid in prokaryotes


KEY TAKEAWAYS

Key Points
 The genome of prokaryotic organisms generally is a circular, double-
stranded piece of DNA, multiple copies of which may exist at any time.
 The length of a genome varies widely, but is generally at least a few million
base pairs.
 A genophore is the DNA of a prokaryote. It is commonly referred to as a
prokaryotic chromosome.

Key Terms

 nucleoid: The irregularly-shaped region within a prokaryote cell where the


genetic material is localized.
 prokaryote: An organism characterized by the absence of a nucleus or any
other membrane-bound organelles.
 genome: The complete genetic information (either DNA or, in some viruses,
RNA) of an organism, typically expressed in the number of basepairs.

The Nucleoid

The nucleoid (meaning nucleus-like) is an irregularly-shaped region within the cell


of a prokaryote that contains all or most of the genetic material. In contrast to the
nucleus of a eukaryotic cell, it is not surrounded by a nuclear membrane. The
genome of prokaryotic organisms generally is a circular, double-stranded piece of
DNA, of which multiple copies may exist at any time. The length of a genome
varies widely, but is generally at least a few million base pairs.

Prokaryote cell nucleoid: Prokaryote cell (right) showing the nucleoid in


comparison to a eukaryotic cell (left) showing the nucleus.

The nucleoid can be clearly visualized on an electron micrograph at high


magnification, where it is clearly visible against the cytosol. Sometimes even
strands of what is thought to be DNA are visible. The nucleoid can also be seen
under a light microscope.by staining it with the Feulgen stain, which specifically
stains DNA. The DNA-intercalating stains DAPI and ethidium bromide are widely
used for fluorescence microscopy of nucleoids.

Experimental evidence suggests that the nucleoid is largely composed of about


60% DNA, plus a small amount of RNA and protein. The latter two constituents
are likely to be mainly messenger RNA and the transcription factor proteins found
regulating the bacterial genome. Proteins helping to maintain the supercoiled
structure of the nucleic acid are known as nucleoid proteins or nucleoid-associated
proteins, and are distinct from histones of eukaryotic nuclei. In contrast to histones,
the DNA-binding proteins of the nucleoid do not form nucleosomes, in which
DNA is wrapped around a protein core. Instead, these proteins often use other
mechanisms, such as DNA looping, to promote compaction.

The Genophore

A genophore is the DNA of a prokaryote. It is commonly referred to as a


prokaryotic chromosome. The term “chromosome” is misleading, because the
genophore lacks chromatin. The genophore is compacted through a mechanism
known as supercoiling, but a chromosome is additionally compacted through the
use of chromatin. The genophore is circular in most prokaryotes, and linear in very
few. The circular nature of the genophore allows replication to occur without
telomeres. Genophores are generally of a much smaller size than Eukaryotic
chromosomes. A genophore can be as small as 580,073 base pairs (Mycoplasma
genitalium). Many eukaryotes (such as plants and animals) carry genophores in
organelles such as mitochondria and chloroplasts. These organelles are very similar
to true prokaryotes.

Supercoiling

DNA supercoiling refers to the over- or under-winding of a DNA strand, and is an


expression of the strain on that strand.

LEARNING OBJECTIVES

Assess the role of supercoiling in prokaryotic genomes


KEY TAKEAWAYS

Key Points

 As a general rule, the DNA of most organisms is negatively supercoiled.


 The simple figure eight is the simplest supercoil, and is the shape a circular
DNA assumes to accommodate one too many or one too few helical twists.
 DNA supercoiling is important for DNA packaging within all cells.

Key Terms
 supercoiling: The coiling of the DNA helix upon itself; can cause disruption
to transcription and lead to cell death.
 DNA: A biopolymer of deoxyribonucleic acids (a type of nucleic acid) that
has four different chemical groups, called bases: adenine, guanine, cytosine,
and thymine.
 chromosome: A structure in the cell nucleus that contains DNA, histone
protein, and other structural proteins.

DNA supercoiling refers to the over- or under-winding of a DNA strand, and is an


expression of the strain on that strand. Supercoiling is important in a number of
biological processes, such as compacting DNA. Additionally, certain enzymes
such as topoisomerases are able to change DNA topology to facilitate functions
such as DNA replication or transcription. Mathematical expressions are used to
describe supercoiling by comparing different coiled states to relaxed B-form DNA.
Supercoiled Structure of Circular DNA: This is a supercoiled structure of circular
DNA molecules with low writhe. Note that the helical nature of the DNA duplex is
omitted for clarity.

As a general rule, the DNA of most organisms is negatively supercoiled.

In a “relaxed” double-helical segment of B-DNA, the two strands twist around the
helical axis once every 10.4 to 10.5 base pairs of sequence. Adding or subtracting
twists, as some enzymes can do, imposes strain. If a DNA segment under twist
strain were closed into a circle by joining its two ends and then allowed to move
freely, the circular DNA would contort into a new shape, such as a simple figure-
eight. Such a contortion is a supercoil.
The simple figure eight is the simplest supercoil, and is the shape a circular DNA
assumes to accommodate one too many or one too few helical twists. The two
lobes of the figure eight will appear rotated either clockwise or counterclockwise
with respect to one another, depending on whether the helix is over or
underwound. For each additional helical twist being accommodated, the lobes will
show one more rotation about their axis.

The noun form “supercoil” is rarely used in the context of DNA topology. Instead,
global contortions of a circular DNA, such as the rotation of the figure-eight lobes
above, are referred to as writhe. The above example illustrates that twist and writhe
are interconvertible. “Supercoiling” is an abstract mathematical property
representing the sum of twist and writhe. The twist is the number of helical turns in
the DNA and the writhe is the number of times the double helix crosses over on
itself (these are the supercoils).

Extra helical twists are positive and lead to positive supercoiling, while subtractive
twisting causes negative supercoiling. Many topoisomerase enzymes sense
supercoiling and either generate or dissipate it as they change DNA topology.
DNA of most organisms is negatively supercoiled.

In part because chromosomes may be very large, segments in the middle may act
as if their ends are anchored. As a result, they may be unable to distribute excess
twist to the rest of the chromosome or to absorb twist to recover from
underwinding—the segments may become supercoiled, in other words. In response
to supercoiling, they will assume an amount of writhe, just as if their ends were
joined.

Supercoiled DNA forms two structures; a plectoneme or a toroid, or a combination


of both. A negatively supercoiled DNA molecule will produce either a one-start
left-handed helix, the toroid, or a two-start right-handed helix with terminal loops,
the plectoneme. Plectonemes are typically more common in nature, and this is the
shape most bacterial plasmids will take. For larger molecules, it is common for
hybrid structures to form – a loop on a toroid can extend into a plectoneme. If all
the loops on a toroid extend, it becomes a branch point in the plectonemic
structure.

The Importance of DNA supercoiling

DNA supercoiling is important for DNA packaging within all cells. Because the
length of DNA can be thousands of times that of a cell, packaging this genetic
material into the cell or nucleus (in eukaryotes ) is a difficult feat. Supercoiling of
DNA reduces the space and allows for much more DNA to be packaged. In
prokaryotes, plectonemic supercoils are predominant, because of the circular
chromosome and relatively small amount of genetic material. In eukaryotes, DNA
supercoiling exists on many levels of both plectonemic and solenoidal supercoils,
with the solenoidal supercoiling proving the most effective in compacting the
DNA. Solenoidal supercoiling is achieved with histones to form a 10 nm fiber.
This fiber is further coiled into a 30 nm fiber, and further coiled upon itself
numerous times more.

DNA packaging is greatly increased during nuclear division events such as mitosis
or meiosis, where DNA must be compacted and segregated to daughter cells.
Condensins and cohesins are structural maintenance of chromosome (SMC)
proteins that aid in the condensation of sister chromatids and the linkage of the
centromere in sister chromatids. These SMC proteins induce positive supercoils.

Supercoiling is also required for DNA and RNA synthesis. Because DNA must be
unwound for DNA and RNA polymerase action, supercoils will result. The region
ahead of the polymerase complex will be unwound; this stress is compensated with
positive supercoils ahead of the complex. Behind the complex, DNA is rewound
and there will be compensatory negative supercoils. It is important to note that
topoisomerases such as DNA gyrase (Type II Topoisomerase) play a role in
relieving some of the stress during DNA and RNA synthesis.

Size Variation and ORF Contents in Genomes

An open reading frame (ORF) is the part of a reading frame that varies in size and
content in bacterial genomes.

LEARNING OBJECTIVES

Explain prokaryotic genome size variation and ORFs


KEY TAKEAWAYS

Key Points

 Open reading frames are used as one piece of evidence to assist in gene
prediction.
 If a portion of a genome has been sequenced, ORFs can be located by
examining each of the three possible reading frames on each strand.
 Bacterial genomes display variation in size, even among strains of the same
species.

Key Terms

 gene: A unit of heredity; a segment of DNA or RNA that is transmitted from


one generation to the next. It carries genetic information such as the sequence
of amino acids for a protein.
 codons: The genetic code is the set of rules by which information encoded
within genetic material (DNA or mRNA sequences) is translated into proteins
(amino acid sequences) by living cells. Biological decoding is accomplished
by the ribosome, which links amino acids in an order specified by mRNA,
using transfer RNA (tRNA) molecules to carry amino acids and to read the
mRNA three nucleotides at a time. The genetic code is highly similar among
all organisms, and can be expressed in a simple table with 64 entries.
 open reading frame: A sequence of DNA triplets, between the initiator and
terminator codons, that can be transcribed into mRNA and later translated
into protein.

In molecular genetics, an open reading frame (ORF) is the part of a reading frame
that contains no stop codons. The transcription termination pause site is located
after the ORF, beyond the translation stop codon, because if transcription were to
cease before the stop codon, an incomplete protein would be made during
translation.

Normally, inserts which interrupt the reading frame of a subsequent region after
the start codon cause frameshift mutation of the sequence and dislocate the
sequences for stop codons.

Open reading frames are used as one piece of evidence to assist in gene prediction.
Long ORFs are often used, along with other evidence, to initially identify
candidate protein coding regions in a DNA sequence. The presence of an ORF
does not necessarily mean that the region is ever translated. For example, in a
randomly generated DNA sequence with an equal percentage of each nucleotide, a
stop-codon would be expected once every 21 codons. A simple gene prediction
algorithm for prokaryotes might look for a start codon followed by an open reading
frame that is long enough to encode a typical protein, where the codon usage of
that region matches the frequency characteristic for the given organism ‘s coding
regions. Even a long open reading frame by itself is not conclusive evidence for the
presence of a gene.

Open Reading Frames: Frame +1 is the ORF predicted in the database to encode a
protein. +2 and +3 are the other two potential ORFs in the same strand and -1, -2,
and -3 are the three potential ORFs in the antisense strand.

If a portion of a genome has been sequenced (e.g. 5′-ATCTAAAATGGGTGCC-


3′), ORFs can be located by examining each of the three possible reading frames
on each strand. In this sequence two out of three possible reading frames are
entirely open, meaning that they do not contain a stop codon:

…A TCT AAA ATG GGT GCC…

…AT CTA AAA TGG GTG CC…

…ATC TAA AAT GGG TGC C…

Possible stop codons in DNA are “TGA”, “TAA”, and “TAG”. Thus, the last
reading frame in this example contains a stop codon (TAA), unlike the first two.

Bacterial genomes display variation in size, even among strains of the same
species. These microorganisms have very little noncoding or repetitive DNA, as
the variation in their genome size usually reflects differences in gene repertoire.
Some species, particularly bacterial parasites and symbionts, have undergone
massive genome reduction and simply contain a subset of the genes present in their
ancestors.
However, in free-living bacteria, such gene loss cannot explain the observed
disparities in genome size because ancestral genomes would have had to contain
improbably large numbers of genes. Surprisingly, a substantial fraction of the
difference in gene contents in free-living bacteria is due to the presence of
ORFans, that is, open reading frames (ORFs) that have no known homologs and
are consequently of no known function.

The high numbers of ORFans in bacterial genomes indicate that, with the
exception of those species with highly reduced genomes, much of the observed
diversity in gene inventories does not result from either the loss of ancestral genes
or the transfer from well-characterized organisms (processes that result in a patchy
distribution of orthologs but not in unique genes) or from recent duplications
(which would likely yield homologs within the same or closely related genome).

Bioinformatic Analyses and Gene Distributions

Bioinformatics is the study of methods for storing, retrieving and analyzing


biological data.

LEARNING OBJECTIVES

Describe the purposes and applications of bioinformatics


KEY TAKEAWAYS

Key Points

 The primary goal of bioinformatics is to increase the understanding of


biological processes.
 Bioinformatics entails the creation and advancement of databases,
algorithms, computational and statistical techniques and theory to solve
problems arising from the management and analysis of biological data.
 Gene Ontology, or GO, is a major bioinformatics initiative to unify the
representation of gene and gene product attributes across all species.

Key Terms

 bioinformatics: Bioinformatics is a branch of biological science which deals


with the study of methods for storing, retrieving and analyzing biological data
like nucleic acid (DNA/RNA) and protein sequence, structure, function,
pathways and genetic interactions.
 algorithms: In mathematics and computer science, an algorithm is a step-by-
step procedure for calculations. Algorithms are used for calculation, data
processing, and automated reasoning.
 gene ontology: The Gene Ontology is a major bioinformatics initiative to
unify the representation of gene and gene product attributes across all species.

Bioinformatics is a branch of biological science dealing with the study of storing,


retrieving and analyzing biological data like nucleic acid (DNA/RNA) and protein
sequence, structure, function, pathways and genetic interactions. It generates new
knowledge that is useful in such fields as drug design and development of new
software tools. Bioinformatics also deals with algorithms, databases and
information systems, web technologies, artificial intelligence and soft computing,
information and computation theory, structural biology, software engineering, data
mining, image processing, modeling and simulation, discrete mathematics, control
and system theory, circuit theory, and statistics.
Map of the human X chromosome: Assembly of the human genome is one of the
greatest achievements of bioinformatics

At the beginning of the “genomic revolution,” the term bioinformatics refered to


the creation and maintenance of a database to store biological information like
nucleotide and amino acid sequences. Development of this type of database
involved not only design issues but the development of complex interfaces
whereby researchers could access existing data as well as submit new or revised
data.

In order to study how normal cellular activities are altered in different disease
states, the biological data must be combined to form a comprehensive picture of
these activities. Therefore, the field of bioinformatics has evolved such that the
most pressing task now involves the analysis and interpretation of various types of
data. This includes nucleotide and amino acid sequences, protein domains and
protein structures. The actual process of analyzing and interpreting data is referred
to as computational biology. Important sub-disciplines within bioinformatics and
computational biology include:

 the development of tools that enable efficient use of various types of


information
 the development of new algorithms (mathematical formulas) and statistics
with which to assess relationships among members of large data sets. For
example, methods to locate a gene within a sequence (gene distributions),
predict protein structure and/or function, and cluster protein sequences into
families of related sequences.

The primary goal of bioinformatics is to increase the understanding of biological


processes. What sets it apart from other approaches, however, is its focus on
developing and applying computationally intensive techniques to achieve this goal.
Examples include pattern recognition, data mining, machine learning algorithms,
and visualization. Major research efforts in the field include sequence alignment,
gene finding, genome assembly, drug design, drug discovery, protein structure
alignment, and the modeling of evolution.

Gene Ontology, or GO, is a major bioinformatics initiative to unify the


representation of gene and gene product attributes across all species. More
specifically, the project aims to:
 maintain and develop its controlled vocabulary of gene and gene product
attributes
 annotate genes and gene products and assimilate and disseminate annotation
data
 offer tools for easy access to all aspects of the data provided by the project

Organization of Eukaryotic Chromosome

Chromosome structure differs somewhat between eukaryotic and prokaryotic cells.


Eukaryotic chromosomes are typically linear, and eukaryotic cells contain multiple
distinct chromosomes. Many eukaryotic cells contain two copies of each
chromosome and, therefore, are diploid.
The length of a chromosome greatly exceeds the length of the cell, so a
chromosome needs to be packaged into a very small space to fit within the cell. For
example, the combined length of all of the 3 billion base pairs [1] of DNA of the
human genome would measure approximately 2 meters if completely stretched out,
and some eukaryotic genomes are many times larger than the human genome.
DNA supercoiling refers to the process by which DNA is twisted to fit inside the
cell. Supercoiling may result in DNA that is either underwound (less than one turn
of the helix per 10 base pairs) or overwound (more than one turn per 10 base pairs)
from its normal relaxed state. Proteins known to be involved in supercoiling
include topoisomerases; these enzymes help maintain the structure of supercoiled
chromosomes, preventing overwinding of DNA during certain cellular processes
like DNA replication.
During DNA packaging, DNA-binding proteins called histones perform various
levels of DNA wrapping and attachment to scaffolding proteins. The combination
of DNA with these attached proteins is referred to as chromatin. In eukaryotes, the
packaging of DNA by histones may be influenced by environmental factors that
affect the presence of methyl groups on certain cytosine nucleotides of DNA. The
influence of environmental factors on DNA packaging is called epigenetics.
Epigenetics is another mechanism for regulating gene expression without altering
the sequence of nucleotides. Epigenetic changes can be maintained through
multiple rounds of cell division and, therefore, can be heritable.

The Complexity of Eukaryotic Genomes


The genomes of most eukaryotes are larger and more complex than those of
prokaryotes (Figure 4.1). This larger size of eukaryotic genomes is not inherently
surprising, since one would expect to find more genes in organisms that are more
complex. However, the genome size of many eukaryotes does not appear to be
related to genetic complexity. For example, the genomes of salamanders and lilies
contain more than ten times the amount of DNA that is in the human genome, yet
these organisms are clearly not ten times more complex than humans.

Figure 4.1
Genome size. The range of sizes of the genomes of representative groups of
organisms are shown on a logarithmic scale.
This apparent paradox was resolved by the discovery that the genomes of
most eukaryotic cells contain not only functional genes but also large amounts
of DNA sequences that do not code for proteins. The difference in the sizes of the
salamander and human genomes thus reflects larger amounts of non-coding DNA,
rather than more genes, in the genome of the salamander. The presence of large
amounts of noncoding sequences is a general property of the genomes of complex
eukaryotes. Thus, the thousandfold greater size of the human genome compared to
that of E. coli is not due solely to a larger number of human genes. The human
genome is thought to contain approximately 100,000 genes—only about 25 times
more than E. coli has. Much of the complexity of eukaryotic genomes thus results
from the abundance of several different types of noncoding sequences, which
constitute most of the DNA of higher eukaryotic cells.
Go to:

Introns and Exons


In molecular terms, a gene can be defined as a segment of DNA that is expressed
to yield a functional product, which may be either an RNA (e.g., ribosomal and
transfer RNAs) or a polypeptide. Some of the noncoding DNA in eukaryotes is
accounted for by long DNA sequences that lie between genes (spacer sequences).
However, large amounts of noncoding DNA are also found within most eukaryotic
genes. Such genes have a split structure in which segments of coding sequence
(called exons) are separated by noncoding sequences (intervening sequences,
or introns) (Figure 4.2). The entire gene is transcribed to yield a long RNA
molecule and the introns are then removed by splicing, so only exons are included
in the mRNA. Although most introns have no known function, they account for a
substantial fraction of DNA in the genomes of higher eukaryotes.

Figure 4.2
The structure of eukaryotic genes. Most eukaryotic genes contain segments of
coding sequences (exons) interrupted by noncoding sequences (introns). Both
exons and introns are transcribed to yield a long primary RNA transcript. The
introns are then removed (more...)
Introns were first discovered in 1977, independently in the laboratories of Phillip
Sharp and Richard Roberts, during studies of the replication of adenovirus in
cultured human cells. Adenovirus is a useful model for studies of gene expression,
both because the viral genome is only about 3.5 × 10 4 base pairs long and because
adenovirus mRNAs are produced at high levels in infected cells. One approach
used to characterize the adenovirus mRNAs was to determine the locations of the
corresponding viral genes by examination of RNA-DNA hybrids in the electron
microscope. Because RNA-DNA hybrids are distinguishable from single-stranded
DNA, the positions of RNA transcripts on a DNA molecule can be determined.
Surprisingly, such experiments revealed that adenovirus mRNAs do not hybridize
to only a single region of viral DNA (Figure 4.3). Instead, a single mRNA
molecule hybridizes to several separated regions of the viral genome. Thus, the
adenovirus mRNA does not correspond to an uninterrupted transcript of the
template DNA; rather the mRNA is assembled from several distinct blocks of
sequences that originated from different parts of the viral DNA. This was
subsequently shown to occur by RNA splicing, which will be discussed in detail in
Chapter 6.

Figure 4.3
Identification of introns in adenovirus mRNA. (A) The gene encoding the
adenovirus hexon (a major structural protein of the viral particle) consists of four
exons, interrupted by three introns. (B) This tracing illustrates an electron
micrograph of a (more...)
Soon after the discovery of introns in adenovirus, similar observations were made
on cloned genes of eukaryotic cells. For example, electron microscopic analysis
of RNA-DNA hybrids and subsequent nucleotide sequencing of cloned genomic
DNAs and cDNAs indicated that the coding region of the mouse β-
globin gene (which encodes the β subunit of hemoglobin) is interrupted by two
introns that are removed from the mRNA by splicing (Figure 4.4). The intron-
exon structure of many eukaryotic genes is quite complicated, and the amount of
DNA in the intron sequences is often greater than that in the exons. The chicken
ovalbumin gene, for example, contains eight exons and seven introns distributed
over approximately 7700 base pairs (7.7 kilobases, or kb) of genomic DNA. The
exons total only about 1.9 kb, so approximately 75% of the gene consists of
introns. An extreme example is the human gene that encodes the blood clotting
protein factor VIII. This gene spans approximately 186 kb of DNA and is divided
into 26 exons. The mRNA is only about 9 kb long, so the gene contains introns
totaling more than 175 kb. On average, introns are estimated to account for about
ten times more DNA than exons in the genes of higher eukaryotes.

Figure 4.4
The mouse β-globin gene. This gene contains two introns, which divide the coding
region among three exons. Exon 1 encodes amino acids 1 to 30, exon 2 encodes
amino acids 31 to 104, and exon 3 encodes amino acids 105 to 146. Exons 1 and 3
also (more...)
Introns are present in most genes of complex eukaryotes, although they are not
universal. Almost all histone genes, for example, lack introns, so introns are clearly
not required for gene function in eukaryotic cells. In addition, introns are not found
in most genes of simple eukaryotes, such as yeasts. Conversely, introns are present
in rare genes of prokaryotes. The presence or absence of introns is therefore not an
absolute distinction between prokaryotic and eukaryotic genes, although introns
are much more prevalent in higher eukaryotes (both plants and animals), where
they account for a substantial amount of total genomic DNA.
Most introns have no known cellular function, although a few have been found to
encode functional RNAs or proteins. Introns are generally thought to represent
remnants of sequences that were important earlier in evolution. In particular,
introns may have helped accelerate evolution by
facilitating recombination between protein-coding regions (exons) of different
genes—a process known as exon shuffling. Exons frequently encode functionally
distinct protein domains, so recombination between introns of different genes
would result in new genes containing novel combinations of protein-coding
sequences. As predicted by this hypothesis, DNA sequencing studies have
demonstrated that some genes are chimeras of exons derived from several other
genes, providing direct evidence that new genes can be formed by recombination
between intron sequences.
It appears most likely that introns were present early in evolution, prior to the
divergence of prokaryotic and eukaryotic cells. According to this hypothesis,
introns played an important role in the initial assembly of protein-coding sequences
in the ancient ancestors of present-day cells. Introns were subsequently lost from
most genes of prokaryotes and simpler eukaryotes (e.g., yeasts) in response to
evolutionary selection for rapid replication, which led to streamlining the genomes
of these organisms. However, since rapid cell division is not an advantage to
higher eukaryotes, introns have been retained in their genomes. Alternatively,
introns may have arisen later in evolution as a result of the insertion
of DNA sequences into genes that had already been formed as continuous protein-
coding sequences. Exon shuffling would then have played an important role in the
further evolution of genes in higher eukaryotes but would not account for the initial
assembly of protein-coding sequences prior to the evolutionary divergence of
prokaryotic and eukaryotic cells.
Go to:

Gene Families and Pseudogenes


Another factor contributing to the large size of eukaryotic genomes is that some
genes are repeated many times. Whereas most prokaryotic genes are represented
only once in the genome, many eukaryotic genes are present in multiple copies,
called gene families. In some cases, multiple copies of genes are needed to
produce RNAs or proteins required in large quantities, such as ribosomal RNAs
or histones. In other cases, distinct members of a gene family may be transcribed in
different tissues or at different stages of development. For example, the α and β
subunits of hemoglobin are both encoded by gene families in the human genome,
with different members of these families being expressed in embryonic, fetal, and
adult tissues (Figure 4.5). Members of many gene families (e.g., the globin genes)
are clustered within a region of DNA; members of other gene families are
dispersed to different chromosomes.

Figure 4.5
Globin gene families. Members of the human α- and β-globin gene families are
clustered on chromosomes 16 and 11, respectively. Each family contains genes that
are specifically expressed in embryonic, fetal, and adult tissues, in
addition (more...)
Gene families are thought to have arisen by duplication of an original
ancestral gene, with different members of the family then diverging as a
consequence of mutations during evolution. Such divergence can lead to the
evolution of related proteins that are optimized to function in different tissues or at
different stages of development. For example, fetal globins have a higher affinity
for O2 than do adult globins—a difference that allows the fetus to obtain O 2 from
the maternal circulation.
As might be expected, however, not all mutations enhance gene function. Some
gene copies have instead sustained mutations that result in their loss of ability to
produce a functional gene product. For example, the human α- and β-globin gene
families each contain two genes that have been inactivated by mutations. Such
nonfunctional gene copies (called pseudogenes) represent evolutionary relics that
significantly increase the size of eukaryotic genomes without making a functional
genetic contribution.
Go to:

Repetitive DNA Sequences


A substantial portion of eukaryotic genomes consists of highly repeated
noncoding DNA sequences. These sequences, sometimes present in hundreds of
thousands of copies per genome, were first demonstrated by Roy Britten and David
Kohne during studies of the rates of reassociation of denatured fragments of
cellular DNAs (Figure 4.6). Denatured strands of DNA hybridize to each other
(reassociate), re-forming double-stranded molecules (see Figure 3.28). Since DNA
reassociation is a bimolecular reaction (two separated strands of denatured DNA
must collide with each other in order to hybridize), the rate of reassociation
depends on the concentration of DNA strands. When fragments of E. coli DNA
were denatured and allowed to hybridize with each other, all of the DNA
reassociated at the same rate, as expected if each DNA sequence were represented
once per genome. However, reassociation of fragments of DNA extracted from
mammalian cells showed a very different pattern. Approximately 60% of the DNA
fragments reassociated at the rate expected for sequences present once per genome,
but the remainder reassociated much more rapidly than expected. The
interpretation of these results was that some sequences were present in multiple
copies and therefore reassociated more rapidly than those sequences that were
represented only once per genome. In particular, these experiments indicated that
approximately 40% of mammalian DNA consists of highly repetitive sequences,
some of which are repeated 105 to 106 times.

Figure 4.6
Identification of repetitive sequences by DNA reassociation. The kinetics of the
reassociation of fragments of E. coli and bovine DNAs are illustrated as a function
of C0t, which is the initial concentration of DNA multiplied by the time of
incubation. (more...)
Further analysis has identified several types of these highly repeated sequences.
One class (called simple-sequence DNA) contains tandem arrays of thousands of
copies of short sequences, ranging from 5 to 200 nucleotides. For example, one
type of simple-sequence DNA in Drosophila consists of tandem repeats of the
seven nucleotide unit ACAAACT. Because of their distinct base compositions,
many simple-sequence DNAs can be separated from the rest of the genomic DNA
by equilibrium centrifugation in CsCl density gradients. The density of DNA is
determined by its base composition, with AT-rich sequences being less dense than
GC-rich sequences. Therefore, an AT-rich simple-sequence DNA bands in CsCl
gradients at a lower density than the bulk of Drosophila genomic DNA (Figure
4.7). Since such repeat-sequence DNAs band as “satellites” separate from the main
band of DNA, they are frequently referred to as satellite DNAs. These sequences
are repeated millions of times per genome, accounting for 10 to 20% of the DNA
of most higher eukaryotes. Simple-sequence DNAs are not transcribed and do not
convey functional genetic information. Some, however, may play important roles
in chromosome structure.
Figure 4.7
Satellite DNA. Equilibrium centrifugation of Drosophila DNA in a CsCl gradient
separates satellite DNAs (designated I–IV) with buoyant densities (in g/cm 3) of
1.672, 1.687, and 1.705 from the main band of genomic DNA (buoyant density
1.701).
Other repetitive DNA sequences are scattered throughout the genome rather than
being clustered as tandem repeats. These sequences are classified as SINEs (short
interspersed elements) or LINEs (long interspersed elements). The major SINEs in
mammalian genomes are Alusequences, so-called because they usually contain a
single site for the restriction endonuclease AluI. Alu sequences are approximately
300 base pairs long, and about a million such sequences are dispersed throughout
the genome, accounting for nearly 10% of the total cellular DNA.
Although Alu sequences are transcribed into RNA, they do not encode proteins and
their function is unknown. The major human LINEs (which belong to the LINE 1,
or L1, family) are about 6000 base pairs long and repeat approximately 50,000
times in the genome. L1 sequences are transcribed and at least some encode
proteins, but like Alu sequences, they have no known function in cell physiology.
Both Alu and L1 sequences are examples of transposable elements, which are
capable of moving to different sites in genomic DNA (see Chapter 5). Some of
these sequences may help regulate gene expression, but most Alu and L1 sequences
appear not to make a useful contribution to the cell. They may, however, have
played important evolutionary roles by contributing to the generation of genetic
diversity.
Go to:

The Number of Genes in Eukaryotic Cells


Having discussed several kinds of noncoding DNA that contribute to the genomic
complexity of higher eukaryotes, it is of interest to consider the total number of
genes in eukaryotic genomes (Table 4.1). Assuming that the average polypeptide is
approximately 400 amino acids long, the average size of the coding sequence of
a gene is 1200 base pairs. In bacterial genomes, most of the DNA encodes proteins.
For example, the genome of E. coli is approximately 4.6 × 106 base pairs long and
contains 4288 genes, with nearly 90% of the DNA used as protein-coding
sequence.

Table 4.1
The Numbers of Genes in Cellular Genomes.
The yeast genome, which consists of 12 × 10 6 base pairs, is about 2.5 times the size
of the genome of E. coli, but is still extremely compact. Only 4% of the genes
of Saccharomyces cerevisiae contain introns, and these usually have only a single
small intron near the start of the coding sequence. The average gene in yeast spans
about 2000 base pairs, and approximately 70% of the yeast genome is used as
protein-coding sequence, specifying a total of about 6000 proteins.
The genome of the nematode C. elegans, a relatively simple animal genome, is
intermediate in size and complexity between the genomes of yeast and mammals.
The C. elegans genome is 97 × 106 base pairs and contains approximately 19,000
protein-coding genes. Thus, while the genome of C. elegans is 8 times larger than
that of yeast, it contains only about three times the number of genes. This
correlates with the presence of a substantial number of introns in C. elegans.
Each gene in C. elegans contains an average of five introns and spans an average
of 5000 bases. Consistent with this, only about 25% of the C. elegans sequence
corresponds to exons, versus 70% protein-coding sequence in the yeast genome.
Although the genome of Drosophila is 180 × 106 base pairs, Drosophila contains
fewer genes than C. elegans (about 13,600). Protein-coding sequence thus
corresponds to only about 13% of the Drosophila genome.
The genomes of higher animals (such as humans) are still more complex and
contain large amounts of noncoding DNA. Thus, only a small fraction of the 3 ×
109 base pairs of the human genome is expected to correspond to protein-coding
sequence. Approximately one-third of the genome corresponds to highly repetitive
sequences, leaving an estimated 2 × 109 base pairs for functional genes,
pseudogenes, and nonrepetitive spacer sequences. If the average gene spans
10,000–20,000 base pairs (including introns), one might expect the human genome
to consist of about 100,000 genes, with protein-coding sequences corresponding to
only about 3% of human DNA. Although this estimate is generally accepted as
plausible, it remains to be verified or corrected by the final results of human
genome sequencing.
Human Genome :
Bioinformatics: Introduction

When the Human Genome Project was begun


in 1990 it was understood that to meet the
project's goals, the speed of DNA sequencing
would have to increase and the cost would have
to come down. Over the life of the project
virtually every aspect of DNA sequencing was
improved. It took the project approximately
four years to sequence its first one billion bases
but just four months to sequence the second
billion bases.

During the month of January, 2003, 1.5 billion bases were sequenced. As the speed
of DNA sequencing increased, the cost decreased from 10 dollars per base in 1990
to 10 cents per base at the conclusion of the project in April 2003. Although the
Human Genome Project is officially over, improvements in DNA sequencing
continue to be made. Researchers are experimenting with new methods for
sequencing DNA that have the potential to sequence a human genome in just a
matter of weeks for a few thousand dollars.

DNA sequencing performed on an industrial scale has produced a vast amount


of data to analyze. In August 2005 it was announced that the three largest
public collections of DNA and RNA sequences together store one hundred billion
bases, representing over 165,000 different organisms. As sequence data began to
pile up, the need for new and better methods of sequence analysis was critical.
Bioinformatics is the branch of biology that is concerned with the acquisition,
storage, and analysis of the information found in nucleic acid and protein sequence
data. Computers and bioinformatics software are the tools of the trade.

Genetic data represent a treasure trove for


researchers and companies interested in how
genes contribute to our health and well being.
Almost half of the genes identified by the
Human Genome Project have no known
function. Researchers are using bioinformatics to
identify genes, establish their functions, and
develop gene-based strategies for preventing,
diagnosing, and treating disease.

A DNA sequencing reaction produces a sequence


that is several hundred bases long. Gene sequences
typically run for thousands of bases. The largest
known gene is that associated with Duchenne
muscular dystrophy. It is approximately 2.4 million
bases in length. In order to study genes, scientists
first assemble long DNA sequences from series of shorter overlapping sequences.

Scientists enter their assembled sequences into genetic databases so that other
scientists may use the data. Since the sequences of the two DNA strands are
complementary, it is only necessary to enter the sequence of one DNA strand into a
database. By selecting an appropriate computer program, scientists can use
sequence data to look for genes, get clues to gene functions, examine genetic
variation, and explore evolutionary relationships. Bioinformatics is a young and
dynamic science. New bioinformatic software is being developed while existing
software is continually updated.

SNPs:
Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”),
are the most common type of genetic variation among people. Each SNP represents
a difference in a single DNA building block, called a nucleotide. For example, a
SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in a
certain stretch of DNA.

SNPs occur normally throughout a person’s DNA. They occur almost once in
every 1,000 nucleotides on average, which means there are roughly 4 to 5 million
SNPs in a person's genome. These variations may be unique or occur in many
individuals; scientists have found more than 100 million SNPs in populations
around the world. Most commonly, these variations are found in the DNA between
genes. They can act as biological markers, helping scientists locate genes that are
associated with disease. When SNPs occur within a gene or in a regulatory region
near a gene, they may play a more direct role in disease by affecting the gene’s
function.

Most SNPs have no effect on health or development. Some of these genetic


differences, however, have proven to be very important in the study of human
health. Researchers have found SNPs that may help predict an individual’s
response to certain drugs, susceptibility to environmental factors such as toxins,
and risk of developing particular diseases. SNPs can also be used to track the
inheritance of disease genes within families. Future studies will work to identify
SNPs associated with complex diseases such as heart disease, diabetes, and cancer.

Genetic Diversity:

Biodiversity is the variety of plants and animals inhabiting in an ecosystem. It


occurs at 3 different levels, namely, species diversity, genetic diversity and
ecosystem diversity.
Each individual has a unique genetic architecture, which is determined by the
hereditary material- DNA. Huge variety of gene sets equip a population to tolerate
stress from a given environmental factor. Genetic diversity is one of the driving
forces of evolution and the main criteria for natural selection which leads to
survival of the fittest.
Table of Content:

 Importance of Genetic Diversity


 Genetic Diversity Examples
 Conservation of Genetic Diversity

What is Genetic Diversity?


Genetic diversity is defined as genetic variability present within species. Genetic
diversity is the product of recombination of genetic material in the process of
inheritance. It changes with time and space.
Sexual reproduction is important in maintaining genetic diversity as it gives unique
offspring by combining genes of parents.
Mutation of genes, genetic drift and gene flow are also responsible for genetic
diversity.

Importance of Genetic Diversity

 Genetic diversity gives rise to different physical attributes to the individual


and capacity to adapt to stress, diseases and unfavourable environmental
conditions.
 Environmental changes that are natural or due to human intervention, lead to
the natural selection and survival of the fittest. Hence, due to genetic
diversity, the varieties that are susceptible, die and the ones who can adapt to
changes will survive.
 Genetic diversity is important for a healthy population by maintaining
different varieties of genes that might be resistant to pests, diseases or other
conditions.
 New varieties of plants can be grown by cross-breeding different genetic
variants and produce plants with desirable traits like disease resistance,
increased tolerance to stress.
 Genetic diversity reduces the recurrence of undesirable inherited traits.
 Genetic diversity ensures that at least there are some survivors of a species
left.

Genetic Diversity Examples

 Different breeds of dogs. Dogs are selectively bred to get the desired traits.
 Different varieties of rose flower, wheat, etc.
 There are more than 50,000 varieties of rice and more than a thousand
varieties of mangoes found in India.
 Different varieties of medicinal plant Rauvolfia vomitoria present in
different Himalayan ranges differ in the amount of chemical reserpine
produced by them.

Conservation of Genetic Diversity


Activities like specific selection for harvesting, destruction of natural habitats lead
to loss of diversity.
Genes which get lost might be having many benefits, so it is important to conserve
diversity for human well-being and to protect a species from getting extinct.
In cases of drought or a sudden outbreak of disease when the whole crop is
destroyed, it is possible to grow genetically diverse and disease-resistant species by
conserving diversity.
There are various methods to conserve biodiversity:

 In situ conservation: It is impossible to conserve the whole of biodiversity,


so certain “hotspots” are identified and conserved to protect species that are
endemic to a particular habitat and are threatened, endangered or at high risk
of getting extinct. E.g. wildlife sanctuaries, national parks.
 Ex-situ conservation: Threatened plants and animals are taken out from
their natural habitat and kept in a special setting to give them special care
and protection. E.g. botanical gardens, zoos, wildlife safari etc.

o Using cryopreservation techniques, gametes of threatened species are


preserved in viable and fertile conditions for a longer period of time.
o Eggs can be fertilised in-vitro and plants can be propagated through
tissue culture.
o Genomic library is a recent advancement to conserve genetic
diversity.

Genome Evolution

Processes such as mutations, duplications, exon shuffling, transposable elements


and pseudogenes have contributed to genomic evolution.

LEARNING OBJECTIVES
Explain the importance of genomic changes in an evolutionary context
KEY TAKEAWAYS

Key Points

 Gene and whole genome duplications have contributed accumulations that


have contributed to genome evolution.
 Mutations are constantly occurring in an organism’s genome and can cause
either a negative effect, positive effect or no effect at all; however, it will still
result in changes to the genome.
 Transposable elements are regions of DNA that can be inserted into the
genetic code and will causes changes within the genome.
 Pseudogenes are dysfunctional genes derived from previously functional
gene relatives and will become a pseudogene by deletion or insertion of one
or multiple nucleotides.
 Exon shuffling occurs when two or more exons from different genes are
combined together or when exons are duplicated, and will result in new
genes.
 Species can also exhibit genome reduction when subsets of their genes are
not needed anymore.

Key Terms

 intron: a portion of a split gene that is included in pre-RNA transcripts but is


removed during RNA processing and rapidly degraded
 exon: a region of a transcribed gene present in the final functional RNA
molecule
 pseudogene: a segment of DNA that is part of the genome of an organism,
and which is similar to a gene but does not code for a gene product

Accumulating Changes Over Time

The evolution of the genome is characterized by the accumulation of changes. The


analaysis of genomes and their changes in sequence or size over time involves
various fields. There are various mechanisms that have contributed to genome
evolution and these include gene and genome duplications, polyploidy, mutation
rates, transposable elements, pseudogenes, exon shuffling and genomic reduction
and gene loss. The concepts of gene and whole-genome duplication are discussed
as their own independent concepts, thus, the focus will be on other mechanisms.

Mutation Rates

Mutation rates differ between species and even between different regions of the
genome of a single species. Spontaneous mutations often occur which can cause
various changes in the genome. Mutations can result in the addition or deletion of
one or more nucleotide bases. A change in the code can result in a frameshift
mutation which causes the entire code to be read in the wrong order and thus often
results in a protein becoming non-functional. A mutation in a promoter region,
enhancer region or a region coding for transcription factors can also result in either
a loss of function or and upregulation or downregulation in transcription of that
gene. Mutations are constantly occurring in an organism’s genome and can cause
either a negative effect, positive effect or no effect at all.

Chromosomal Mutations: Chromosomal mutations over time can accumulate and


promote diversity and evolution if a produced trait is favorable.

Transposable Elements

Transposable elements are regions of DNA that can be inserted into the genetic
code through one of two mechanisms. These mechanisms work similarly to “cut-
and-paste” and “copy-and-paste” functionalities in word processing programs. The
“cut-and-paste” mechanism works by excising DNA from one place in the genome
and inserting itself into another location in the code. The “copy-and-paste”
mechanism works by making a genetic copy or copies of a specific region of DNA
and inserting these copies elsewhere in the code. The most common transposable
element in the human genome is the Alu sequence, which is present in the genome
over one million times.

Pseudogenes

Often a result of spontaneous mutation, pseudogenes are dysfunctional genes


derived from previously functional gene relatives. There are many mechanisms by
which a functional gene can become a pseudogene including the deletion or
insertion of one or multiple nucleotides. This can result in a shift of reading frame,
causing the gene to longer code for the expected protein, a premature stop codon or
a mutation in the promoter region. Often cited examples of pseudogenes within the
human genome include the once functional olfactory gene families. Over time,
many olfactory genes in the human genome became pseudogenes and were no
longer able to produce functional proteins, explaining the poor sense of smell
humans possess in comparison to their mammalian relatives.

Exon Shuffling

Exon shuffling is a mechanism by which new genes are created. This can occur
when two or more exons from different genes are combined together or when
exons are duplicated. Exon shuffling results in new genes by altering the current
intron-exon structure. This can occur by any of the following processes: transposon
mediated shuffling, sexual recombination or illegitimate recombination. Exon
shuffling may introduce new genes into the genome that can be either selected
against and deleted or selectively favored and conserved.

Genome Reduction and Gene Loss

Many species exhibit genome reduction when subsets of their genes are not needed
anymore. This typically happens when organisms adapt to a parasitic life style, e.g.
when their nutrients are supplied by a host. As a consequence, they lose the genes
need to produce these nutrients. In many cases, there are both free living and
parasitic species that can be compared and their lost genes identified. Good
examples are the genomes of Mycobacterium tuberculosis and Mycobacterium
leprae, the latter of which has a dramatically reduced genome. Another beautiful
example are endosymbiont species. For instance, Polynucleobacter necessarius was
first described as a cytoplasmic endosymbiont of the ciliate Euplotes aediculatus.
The latter species dies soon after being cured of the endosymbiont. In the few cases
in which P. necessarius is not present, a different and rarer bacterium apparently
supplies the same function. No attempt to grow symbiotic P. necessarius outside
their hosts has yet been successful, strongly suggesting that the relationship is
obligate for both partners. Yet, closely related free-living relatives of P.
necessarius have been identified. The endosymbionts have a significantly reduced
genome when compared to their free-living relatives (1.56 Mbp vs. 2.16 Mbp).

You might also like