Lecture2 - Background

LECTURE 2
BACKGROUND: MOLECULAR
BIOLOGY
D R . Z O YA K H A L I D
Z O YA . K H A L I D @ N U . E D U . P K
WHAT IS DNA ?
● Alphabet of 4 basic characters (nucleotides):
● Adenine (A) ● Cytosine (C) ● Guanine (G) ● Thymine (T)
A DNA sequence:
AACGTTTGA
In RNA data: T is replaced by Uracil ●

A RNA sequence: AACGUUUGA
● We will see what RNA is later
● If we use T or U does usually not matter, computationally
PROTEINS
• Polymer made of monomers – Amino Acids
• 20 Naturally occurring Amino Acids
• There are 20 different amino acids. Proteins are created by linking amino acids together
in a linear fashion to form polypeptide chains.
• Protein polypeptide chains fold into three-dimensional structures that can associate
with other protein structures to perform specific functions.
GENES
• Genes are the basic physical and functional units of heredity. Each gene is located on a
particular region of a chromosome and has a specific ordered sequence of nucleotides (the
building blocks of DNA).
CENTRAL DOGMA OF MOLECULAR BIOLOGY
1. Genetic information is stored in DNA.

2. Segments of DNA that encode proteins or other functional products are called genes.
3. Gene sequences are transcribed into messenger RNA intermediates (mRNA).
4. mRNA intermediates are translated into proteins that perform most life functions.
image from the DOE Human Genome Program
http://www.ornl.gov/hgmis
HUMAN CHROMOSOMES
Down syndrome
WHAT IS A LOCUS?
• A locus describes the region of a
chromosome where a gene is
located. 11p15.5 is the locus for the
human insulin gene. 11 is the
chromosome number, p indicates the
short arm of the chromosome, and
15.5 is the number assigned to a
particular region on a chromosome.
When chromosomes are stained in
the lab, light and dark bands appear,
and each band is numbered. The
higher the number, the farther away
the band is from the centromere.
IMPORTANT
• Both DNA and RNA have a direction: one end is the 3’ the other is the 5’
end.
• Thus, codons are read in one direction only.
GENOME
• The full genetic information of an organism

• Contains all chromosomes
• Comprises the coding & non-coding sequence data of the organism
• Coding sequence data → part of the genome that encodes proteins
• Non-coding (in earlier days: junk) DNA → part of the genome that does not encode proteins
but still has a function –
• The function of non-coding DNA is only partially known
• Non-coding DNA regulates protein processes
GENETIC CODE CHARACTERISTICS
• Genetic code is specific (Unambiguous)
• A specific codon always codes for same amino acids
• Example : UAA codes for phenylalanine
• Genetic code is Redundant

• Also called degenerate
• Although each codon specifies a single amino acid but a single amino acid can have multiple
codons
• All codons are independent sets of three codons- no overlapping
CONTID…..
• Codon is read from the fixed starting point called as Reading Frame
• Codon is read as a continuous sequences of bases read three at at time
• Three codons out of 64 are called stop codons UAA, UAG, UGA which do not encode for any
amino acid
• AUG is the start codon
Amino Acids
EXAMPLES OF PROTEINS
AMINO ACID SEQUENCE OF HEXOKINASE
1 A A S X D X S L V E V H X X V F I V P P X I L Q A V V S I A
31 T T R X D D X D S A A A S I P M V P G W V L K Q V X G S Q A
61 G S F L A I V M G G G D L E V I L I X L A G Y Q E S S I X A
91 S R S L A A S M X T T A I P S D L W G N X A X S N A A F S S
121 X E F S S X A G S V P L G F T F X E A G A K E X V I K G Q I
151 T X Q A X A F S L A X L X K L I S A M X N A X F P A G D X X
181 X X V A D I X D S H G I L X X V N Y T D A X I K M G I I F G
211 S G V N A A Y W C D S T X I A D A A D A G X X G G A G X M X
241 V C C X Q D S F R K A F P S L P Q I X Y X X T L N X X S P X
271 A X K T F E K N S X A K N X G Q S L R D V L M X Y K X X G Q
301 X H X X X A X D F X A A N V E N S S Y P A K I Q K L P H F D
331 L R X X X D L F X G D Q G I A X K T X M K X V V R R X L F L
361 I A A Y A F R L V V C X I X A I C Q K K G Y S S G H I A A X
391 G S X R D Y S G F S X N S A T X N X N I Y G W P Q S A X X S
421 K P I X I T P A I D G E G A A X X V I X S I A S S Q X X X A
451 X X S A X X A
SPACE-FILLING MODEL OF
HEXOKINASE
ECORI – RESTRICTION ENZYME
HEMOGLOBIN
• protein built from 4
polypeptides
• responsible for carrying

oxygen in red blood cells
HEMOGLOBIN: CARRIER OF OXYGEN
WHEN THINGS GO WRONG…
• Mutations: changes in the DNA sequence, that may be passed along to future
generations.
• Point mutations: a single base substitution
THE CAT SAW THE RAT
THE CAT SAW THE HAT
• Deletion: a small DNA segment is lost
• Insertion: a segment of DNA is added
MUTANT Β-GLOBIN → SICKLE BLOOD
CELLS
Fiber of sickle Sickle and normal

hemoglobin blood cells
NORMAL BLOOD FLOW
SICKLE CELL COMPLICATIONS
THE DYNAMICS OF CELLS
• all cells in an organism have the same genomic data, but the genes expressed in each
vary according to cell type, time, and environmental factors
• there are networks of interactions among various biochemical entities in a cell (DNA,
RNA, protein, small molecules) that carry out processes such as
– metabolism
– intra-cellular and inter-cellular signaling
– regulation of gene expression

SELECTED MILESTONES
Common # of Size (base
Year Species
Name Chromosomes pairs)
Haemophilus
1995 Bacterium 1 1.8 x 106
influenzae
Saccharomyces
1996 Yeast 16 1.2 x 107
cerevisiae
Caenorhabditis
1998 Worm 6 1.0 x 108
elegans
Bigger genome than
Drosophila Humans
1999 Fruit Fly 4 1.3 x 108
melanogaster
2000 Human Homo sapiens 23 3.1 x 109 SIZE DOES NOT MATTE
J
2002 Mouse Mus musculus 20 2.6 x 109
2004 Rat Rattus norvegicus 21 2.8 x 109
2005 Chimpanzee Pan troglodytes 24 3.1 x 109
Sequence is freely available NCBI - http://www.ncbi.nlm.nih.gov

UCSC - http://genome.ucsc.edu
BUT WAIT, THERE’S MORE…
• > 1000 other publicly available databases pertaining to molecular biology

• GenBank
> 196 million sequence entries
> 217 billion bases
• UniProtKB / Swiss-Prot
> 65 million protein sequence entries
> 21 billion amino acids
• Protein Data Bank
> 135,000 protein (and related) structures
BIOINFORMATICS REVISITED
Representation/storage/retrieval/ analysis of biological data concerning
– sequences (DNA, protein, RNA)
– structures (protein, RNA)
– functions (protein, sequence signals)
– activity levels (mRNA, protein, metabolites)
– networks of interactions (metabolic pathways, regulatory pathways, signaling pathways)
of/among biomolecules
DATA AND TIMELINE
• 12,000 enzymes
– Purified and characterized
– Function and effect of mutations has been understood
~100 years
• 95,000 protein structures
~50 years
LETS TRY TO CURE CANCER
• Have access only to a limited amount of information
• Focus on parameters (e.g. genes) with dominant effects
• Design experimental strategies that help to sort out the dominant parameters before
information is extracted - (information is gained only about “important” parameter)
• Find somehow an/the important parameter (eg. Gene)

• Limitations
– Many phenotype (eg. cancer) are due to multiple factors
– Each factor alone may not have any effects e.g. non-dorminant oncogenes
– Only specific combination of such cooperating factors lead to phenotype
• Suppose 3 cooperative genes are responsible for cancer.

• Assume that there are 15,000 human genes.
• Need:
– 15,000*14,999*14998=3.3 x 1013 experiments (or 104N for N cooperating genes).
• The exponential increase in the number of samples to be tested impose a practical and
conceptual limitation.
Molecular Biology:
– Often deep understanding of the function of one or several gene/protein,
– But: low ratio of
Information extracted
Potentially relevant information
Genomics, Gene expression, Bioinformatics:

– Often shallower understanding of the functions
– But: much higher ratio of
Information extracted
Potentially relevant information
REFERENCES
• Lecture notes of Colin Dewey @ University of Wisconsin-Madison

• Lecture notes of Arne Elofsson @ Stockholm University
• Lecture notes of Yuzhen Ye @ Indiana University
• http://www.ornl.gov/hgmis
THE SEQUENCING PROBLEM
• We want to determine the identity of the base pairs (ACTG) that make up:
– A single large molecule of DNA
– The genome of a single cell
– The genome of an individual organism
– The genome of a species
• But we can’t (currently) “read” off the sequence of an entire molecule all at once
SEQUENCE ASSEMBLY
• Resolve a puzzle and rebuild a DNA sequence from its pieces
DNA SEQUENCING
The process of reading the nucleotide bases in a DNA molecule

● There exist various sequencing technologies
● Properties
● Cost
● Speed
● Amount of data/Number of Sequences
● Sequence length
THE STRATEGY: SUBSTRINGS
• We do have the ability to read or detect short pieces (substrings) of DNA

– Sanger sequencing: 500-700 bp/read
– Hybridization arrays: 8-30bp/probe
– Latest technologies:
• 454 Genome Sequencer FLX: 250-600 bp/read
• Illumina Genome Analyzer: 35-300 bp/read
• Pacific Biosciences: ~10,000 bp/read
• bp = base pair = letter

SEQUENCING GELS
• Run replication in four separate test tubes
– Each with one of some concentration of either ddATP, ddTTP, ddGTP, or
ddCTP
• Depending on when ddNTP is included, different length
fragments are synthesized
• Fragments separated by length with electrophoresis
gel
• Sequence can be read from bands on gel
SHOTGUN SEQUENCING
FRAGMENT ASSEMBLY
Multiple copies of sample DNA
Randomly fragment DNA
Sequence sample of fragments
Assemble reads
ASSIGNMENT
• Posted online

Lecture2 - Background

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture2 - Background

Uploaded by

Copyright:

Available Formats

LECTURE 2

In RNA data: T is replaced by Uracil ●

1. Genetic information is stored in DNA.

• The full genetic information of an organism

• Genetic code is Redundant

• responsible for carrying

Fiber of sickle Sickle and normal

– intra-cellular and inter-cellular signaling

– regulation of gene expression

2004 Rat Rattus norvegicus 21 2.8 x 109

2005 Chimpanzee Pan troglodytes 24 3.1 x 109

Sequence is freely available NCBI - http://www.ncbi.nlm.nih.gov

• > 1000 other publicly available databases pertaining to molecular biology

Representation/storage/retrieval/ analysis of biological data concerning

– sequences (DNA, protein, RNA)

– structures (protein, RNA)

– functions (protein, sequence signals)

– activity levels (mRNA, protein, metabolites)

– networks of interactions (metabolic pathways, regulatory pathways, signaling pathways)

• Have access only to a limited amount of information

• Focus on parameters (e.g. genes) with dominant effects

• Find somehow an/the important parameter (eg. Gene)

• Suppose 3 cooperative genes are responsible for cancer.

Genomics, Gene expression, Bioinformatics:

• Lecture notes of Colin Dewey @ University of Wisconsin-Madison

The process of reading the nucleotide bases in a DNA molecule

• We do have the ability to read or detect short pieces (substrings) of DNA

• bp = base pair = letter

Randomly fragment DNA

Sequence sample of fragments

You might also like