Professional Documents
Culture Documents
What is DNA?
DNA stands for Deoxyribo Nucleic Acid. It carries the genetic instructions required for the
development, functioning and reproduction of all known living organisms. In eukaryotic
organisms (like animals, plants, and fungi), DNA occurs in the nucleus of each cell. In
prokaryotic organisms (single-celled organisms like bacteria and mitochondria), DNA occurs
in the cell's cytoplasm.
Back to the ladder analogy. "Nucleotide" is the type of molecule that makes up each half-
rung of our ladder. Because they function as the units that encode genetic information, each
letter is also called a base. For an example of how we represent the sequence of DNA bases
using these letters, if we took the DNA of the Bacillus anthracis bacteria that causes Anthrax
disease, unfolded the double-helix into a ladder, and then split the ladder in two, the top
half (the forward strand–we'll get to that in a bit) would be written like this:
ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTT
The lines that connect the bases on either side denote a basepair relationship.
"Forward" and "reverse" are just labels. The choice of labeling is arbitrary and does not
depend on any inherent property of the DNA. The forward strand is not "special". Scientists
decide which to call "forward" and which to call "reverse" when they first analyze the DNA
of an organism. Even though the decision is arbitrary, it's important to maintain consistency
with that decision for the sake of clear communication.
The forward and reverse strands may also be denoted with different terms. For example in
datasets you may find them labeled as + and - . They might also be called top and bottom
strands. In our opinion, these variances are needlessly confusing. Please avoid referring to
strands with any other terms than forward and reverse.
Most biological mechanisms (but not all) take place on single strand of the DNA and in the
direction of the arrow. Hence sequences of the DNA above will be "seen" by the biochemical
machinery as either:
This latter sequence is called the reverse complement of the first and is formed by reversing
the letters then interchanging A and T and interchanging C and G .
Hence a DNA sequence AAACT may need to be considered:
in reverse TCAAA
as a complement TTTGA
as a reverse-complement AGTTT
What is a sense/antisense?
When a process occurs in the expected direction then its directionality may be called sense,
if it is going against the expected direction its directionality may be called anti-sense. It is
very important not to collate the concepts of forward/reverse with sense/anti-sense as
these are completely unrelated. The sense/anti-sense is relative to a sequence's direction,
the sequence in turn may come from a forward or reverse strand.
What is a genome?
A genome is all of an organism's DNA sequence. Each cell typically contains a copy of the
entire genome. More accurately, each cell has one or more nearly identical copies of the
genome. Replication is the process of duplicating the genome when a cell divides.
While due to complementarity the number of A and T nucleotides and the C and G
nucleotides is equal the relative ratios of AT vs CG may be very different. Some genomes
may contains more AT pairs while others may contain more CG pairs.
All genomes are subject to evolutionary principles hence some (or even substantial) regions
of a healthy genome may be non-functional and may not serve any purpose anymore. Some
percent of a genome may consists of copies of various kinds of interspersed repeat
sequences. At some point these regions have been labelled as "junk DNA" a term that later
has become a lightning rod of controversy.
What is RNA?
Whereas DNA is the blueprint RNA is a smaller interval out of this blueprint translated into a
molecule similar to the DNA except the base T (Thymine) is replaced by U (Uracil) and it
contains other chemical modifications that change its properties relative to DNA. RNA is a
polymeric, single stranded molecule that usually performs some type functionality and is
believed to exist transiently. Unlike DNA there are many classes of RNA: mRNA , tRNA , rRNA
and many others. The DNA is continuously present in the cell whereas the RNA degrades
quickly in time (minutes).
The cell begins by transcribing a "gene" (see later) into an RNA molecule. Then pieces of the
RNA are cut out and discarded, in a process called splicing. Each discarded piece is called an
intron. Each piece between consecutive introns is called an exon, and the RNA molecule
with the introns removed is known as messenger RNA, or mRNA. Perhaps 35% of human
genes are alternatively spliced, meaning that under different circumstances, different
combinations of exons are selected.
What is a protein?
A protein is a three dimensional macromolecule built from a series of so called amino acid
molecules that can form a 3D structure. There are 20 kinds of amino acids that can form a
protein, these are labelled as letters in the 'alphabet' of protein sequences, where each
letter is an amino acid. Proteins can be described by their sequence though in our current
state of understanding the sequence alone is typically insufficient to fully determine the 3D
structure or function of it. Whereas DNA and mRNA typically carry information the proteins
are actual physical building blocks of life. Every living organism is built out of proteins and
functions via the interaction of proteins that are being continuously produced. A short series
(less than 40) of amino acids without a well defined 3D structure are called polypeptides
(peptides).
To perform the translation the mRNA is partitioned into units of three consecutive letters,
each called a codon. A codon is then translated via a translation table into an amino acid. A
protein is a sequence of aminoacids:
There is a genetic code for translating a codon into an amino acid. For example, the codon
TCA (or UGA if we describe it as an RNA sequence) codes for S , the amino acid Serine. The
translation process begins with the so called start codon ATG that corresponds to M (the
Methionine) amino acid. Hence all proteins sequences start with M.
What is a gene?
The "official" definition for the term gene in the Sequence Ontology is a region (or regions)
that includes all of the sequence elements necessary to encode a functional transcript. A
gene may include regulatory regions, transcribed regions and/or other functional sequence
regions.
Untranslated regions:
The region of the mRNA before the start codon (or the corresponding genomic region) is
called the 5' UTR (5 prime UTR) or untranslated region; the portion from the stop codon to
the start of the poly-A tail is the 3' UTR (three prime UTR).
Promoters regions
The genomic region just before the 5' UTR may contain patterns of nucleotides, called the
promoter, that is used to position the molecular machinery that performs the transcription
to RNA. Other patterns in the DNA tell the cell when (how frequently and in what tissues) to
transcribe the gene; that is, they regulate transcription. A pattern that increases the
frequency of transcription operations is an enhancer , while one that decreases the
frequency is a silencer.
CpG islands
CpG islands are regions of DNA where a C (cytosine) nucleotide is followed by a G guanine
nucleotide in the linear sequence of bases along its 5' -> 3' direction. Cytosines in CpG
dinucleotides can be methylated. In turn methylated cytosines within a gene may change its
expression, a mechanism that is part of a larger field of science studying gene regulation
that is called epigenetics.
What is homology?
Two regions of DNA that are descended from the same sequence (through processes of
duplication of genomic regions and/or separation of two species) are homologous, or
homologs of one another.
How is bioinformatics practiced?
Bioinformatics requires a broad skillset. The diverse tasks undertaken by bioinformaticians
can be roughly divided into three tiers:
1) Data management:
Data management requires accessing, combining, converting, manipulating, storing, and
annotating data. It requires routine data quality checks, summarizing large amounts of
information, and automating existing methods.
3) Data interpretation:
Data management and analysis are meaningless without accurate and insightful
interpretation. Bioinformaticians discover or support biological hypotheses via the results of
their primary analysis, and so they must be able to interpret their findings in the context of
ongoing scientific discourse.