You are on page 1of 51

Computational Molecular Biology

Biochem 218 BioMedical Informatics 231


http://biochem218.stanford.edu/

Genomics and Bioinformatics

Doug Brutlag
Professor Emeritus
Biochemistry & Medicine (by courtesy)

Faculty, TAs and Staff


Doug Brutlag

Lee Kozar

Maeve OHuallachain

Dan Davison

Course and Video


Availability
Alway M114
Tuesdays & Thursdays 2:15-3:30 PM

Course Web Site


http://biochem218.stanford.edu/

Stanford Center for Professional Development


http://scpd.stanford.edu/

Videos available 24 hours/day, 7 days/week


Course offered Autumn, Winter and Spring
quarters

Course Requirements
Lectures
Theoretical background of current methods
Strengths and weaknesses of current approaches
Future directions for improvements

Demonstrations
Applications (Mac, PC, Unix, Web)
Web applications
Illustrate homework

All homework and questions must be submitted by


email to homework218@cmgm.stanford.edu
Several homework assignments (35%)
Due one week after assigned

Final project (Due March 12th)


A critical or comparative review of computational approaches to
any problem in computational molecular biology
Propose new approach
Implement a new approach
Examples of previous projects for the class can be found at
http://biochem218.stanford.edu/Projects.html

David Mount
Bioinformatics: Sequence and Genome Analysis 2nd Edition

Jin Xiong
Essential Bioinformatics

Richard Durbin et al.


Biological Sequence Analysis

Jones & Pevzner


Bioinformatics Algorithms

Dan Gusfield
Algorithms on Strings, Trees & Sequences

Baldi & Brunak

Bioinformatics: The Machine Learning Approach

Higgins & Taylor


Bioinformatics: Sequence, Structure & Databanks

NCBI Handbook
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook

NCBI Handbook
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook

EMBL-EBI Home Page


http://www.ebi.ac.uk/

Berg, Tymoczko & Stryer


Biochemistry, Fifth Edition

Benjamin Lewin
Genes IX

Genomics, Bioinformatics &


Computational Biology
Genomics
Structural Genomics

Bioinformatics
Proteomics

Computational Molecular Biology


Computational Biology

Genomics, Bioinformatics &


Computational Biology
Genomics
Bioinformatics
Systems Biology
Structural Genomics

Proteomics

Computational Molecular Biology


Computational Biology

Genomics, Bioinformatics &


Computational Biology
Genomics

Bioinformatics

Structural Genomics

Proteomics

Computational Molecular Biology


Computational Biology

Machine Learning
Artificial Intelligence

Robotics
Databases

Statistics & Probability

Algorithms

Information Theory
Graph Theory

What is Bioinformatics?
Individuals
RNA

Protein

DNA

Phenotype

Evolution

Selection

Populations

Biological Information

Computational Goals of Bioinformatics


Learn & Generalize: Discover conserved patterns (models) of
sequences, structures, interactions, metabolism & chemistries from
well-studied examples.
Prediction: Infer function or structure of newly sequenced genes,
genomes, proteins or proteomes from these generalizations.
Organize & Integrate: Develop a systematic and genomic approach to
molecular interactions, metabolism, cell signaling, gene expression
Simulate: Model gene expression, gene regulation, protein folding,
protein-protein interaction, protein-ligand binding, catalytic function,
metabolism
Engineer: Construct novel organisms or novel functions or novel
regulation of genes and proteins.
Gene Therapy: Target specific genes, or mutations, RNAi to change a
disease phenotype.

Central Paradigm of Molecular Biology

DNA

RNA

Protein

Phenotype
(Symptoms)

Molecular Biology of the Gene 1965

Central Paradigm of Bioinformatics


Genetic
Information
MVHLTPEEKT
AVNALWGKVN
VDAVGGEALG
RLLVVYPWTQ
RFFESFGDLS
SPDAVMGNPK
VKAHGKKVLG
AFSDGLAHLD
NLKGTFSQLS
ELHCDKLHVD
PENFRLLGNV
LVCVLARNFG
KEFTPQMQAA
YQKVVAGVAN
ALAHKYH

Molecular
Structure

Biochemical
Function

Phenotype
(Symptoms)

Central Paradigm of Bioinformatics


Genetic
Information
MVHLTPEEKT
AVNALWGKVN
VDAVGGEALG
RLLVVYPWTQ
RFFESFGDLS
SPDAVMGNPK
VKAHGKKVLG
AFSDGLAHLD
NLKGTFSQLS
ELHCDKLHVD
PENFRLLGNV
LVCVLARNFG
KEFTPQMQAA
YQKVVAGVAN
ALAHKYH

Molecular
Structure

Biochemical
Function

Phenotype
(Symptoms)

Challenges Understanding
Genetic Information
Genetic
Information

Molecular
Structure

Biochemical
Function

Phenotype

Genetic information is redundant


Structural information is redundant
Genes and proteins are meta-stable
Single genes have multiple functions
Genes are one dimensional but function depends
on three-dimensional structure

Redundancy in Genomic
& Protein Sequences
DNA is double-stranded
Genetic code
Acceptable amino-acid
replacements
Intron-exon variation
Alternative splicing
Strain variations (SNPs)
Sequencing errors

Using A Controlled Vocabulary for Literature Search


http://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh

Gene Ontology Database


http://www.geneontology.org/

UCSC Genome Browser


http://genome.ucsc.edu/

ExPASy Proteomics Server


http://www.expasy.ch/doc.html

Inferring Biological Function from


Protein Sequence
Consensus Sequences
or Sequence Motifs
Zinc Finger (C2H2 type)
C x {2,4} C x {12} H x {3,5} H

Sequences of Common
Structure or Function

Sequence Similarity
10
20
30
40
50
Query VLSPADKTNVKAAWGKVGAHAGEVGAEALERMFLSFPTTKTYFPHF------DLSHGS
|:| :|: | |:||||
| |:||| |: : :|:| :| |
|: |
Match HLTPEEKSAVTALWGKV--NVDEYGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
10
20
30
40
50

A Typical Motif:
Zinc Finger DNA Binding Motif

C..C............H....H

Inferring Biological Function from


Protein Sequence
Weight Matrices or
1 2 3 4 5 Scoring
6 7 8 Matrices
9 10 11 12
Position-Specific
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V

2 1 3 13 10 12 67 4 13 9 1 2
7 5 8 9 4 0 1 16 7 0 1 0
0 8 0 1 0 0 0 2 1 1 10 0
0 1 0 1 13 0 0 12 1 0 4 0
0 0 1 0 0 0 0 0 0 2 2 1
1 1 21 8 10 0 0 7 6 0 0 2
2 0 0 9 21 0 0 15 7 3 3 0
9 7 1 4 0 0 8 0 0 0 46 0
4 3 1 1 2 0 0 2 2 0 5 0
10 0 11 1 2 10 0 4 9 3 0 16
16 1 17 0 1 31 0 3 11 24 0 14
3 4 5 10 11 1 1 13 10 0 5 2
7 1 1 0 0 0 0 0 5 7 1 8
4 0 3 0 0 4 0 0 0 10 0 0
0 6 0 1 0 0 0 0 0 0 0 0
1 17 0 8 3 1 3 0 2 2 2 0
5 22 3 11 1 5 0 2 2 2 0 5
2 0 0 0 0 0 0 0 0 1 0 1
1 0 4 2 0 1 0 0 2 4 0 1
6 3 1 1 2 15 0 0 2 12 0 28

Consensus Sequences
or Sequence Motifs
Zinc Finger (C2H2 type)
C x {2,4} C x {12} H x {3,5} H

Profiles, PSI-BLAST
Sequences of Common Hidden Markov Models
Structure or Function

D2

D3

D4

D5

I1

I2

I3

I4

I5

AA1

AA2

AA3

AA4

AA5

Sequence Similarity
10
20
30
40
50
Query VLSPADKTNVKAAWGKVGAHAGEVGAEALERMFLSFPTTKTYFPHF------DLSHGS
|:| :|: | |:||||
| |:||| |: : :|:| :| |
|: |
Match HLTPEEKSAVTALWGKV--NVDEYGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
10
20
30
40
50

AA6

Buried Treasure

Buried Treasure

Buried Treasure

Clustal Globin Alignment

Consensus Sequence From a


Multiple Sequence Alignment
ClustalW Insulin Alignments
10

IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF

20

30

M A L WM R L L P L L A L L A L W A P A P T R A
M A L W I R S L P L L A L L V F S G P G - T S Y
M A V W I Q A G A L L F L L A V S S V N A N A G
M A A L WL Q S F S L L V L L V V S W P G S Q A V
A . W . .
L L
L L
40

IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF

L
L
L
L
L
L
L
L

C
C
C
C
C
C
C
C

G
G
G
G
G
G
G
G

S
S
S
S
S
S
S
S

N
H
H
H
H
H
H
H

L
L
L
L
L
L
L
L

V
V
V
V
V
V
V
V

E
E
E
E
D
E
D
E

T
A
A
A
A
A
A
A

IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF

D
Q
D
Q
L
G
L

P
P
L
P
G
P
G
P

Q
Q
F
Q
F

V
L
V
L
L
V
L
L

E
V
R
V
P
G
P

Q
N
D
S
P
A
P

T
G
V
S
K
L
K

E
P
E
P
S
E
S

L
L
L
L
L
G
L

IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF

A
E
A
E
V
P
M

L
Y
L
Y
I
P
M
.

Q
Q
Q
E
R
Q
V
Q

X
X
K
K
K
K
K
K

X
X
R
V
R
R
R
R

K
-

R
-

G
G
G
G
G
G
G
G

I
I
I
I
I
I
I
I

L
L
L
L
L
L
L
L

S
L
L
L
L
L
L
L

V
V
V
V
V
V
V
V

C
C
C
C
C
C
C
C

Q
G
G
G
G
G
G
G

D
E
E
E
P
E
D
E

D
R
R
R
T
R
R
R

G
G
G
G
G
G
G
G

F
F
F
F
F
F
F
F

M
G
G
G
G
A
G

G
E
A
E
G
A

A
G

Q
A

E
D

L
V
P
A
T
P
N

G
G
G
G
E
G
E
G

S
N
N
N
N
A
N

R
Q
Q
Q
Q
Q
Q
Q

D
E
E
E
E
E
E
E

Q
Q
Q
Q
Q
Q
Q
Q

C
C
C
C
C
C
C
C

C
C
C
C
C
C
C
C

T
E
T
H
H
A
H

G
N
S
N
K
S
R

T
P
I
T
P
V
P

C
C
C
C
C
C
C
C

T
S
S
S
S
S
N
S

F
F
F
F
F
F
F
F

A
E
E
V
V
A
V

I
S
T
S
N
T
N
.

P
P
P
P
P
P
P
P

K
K
K
K
K
K
K
K

D
T
A
A
R
A
R
.

X
X
R
R
D
R
D

X
X
R
R
V
R
V

E
D
E
D
D
E
D
D

L
V
V
V
P
V
Q
V

G
L
G
L
A
G
A

G
P
G
P
D
G
E

L
F
L
F
F
L
F
F

Q
Q
Q
Q
A
E
A
Q

P
P
F
F

L
K
K

A
D
D

L
L
H
Q

Q
H
E
Q
A
M

H
Y
Y
Y
F
Y
F
Y

Q
Q
Q
Q
E
Q
D
Q

L
L
L
L
L
L
L
L

Q
E
E
E
Q
E
Q
E

S
N
N
N
N
N
N
N

Y
Y
Y
Y
Y
Y
Y
Y

C
C
C
C
C
C
C
C

N
N
N
N
N
N
N
N

E
E
E
E
P
E
L
E
90

110

R
L
L
L
I
L
I
L

H
H
H
H
H
H
H
H
60

Y
Y
Y
Y
Y
Y
Y
Y

80

100

V
V
V
V
V
V
V
V

V
A
V
A
P
V
P

50

Y
Y
Y
Y
Y
Y
Y
Y

70

G
H
A
R
A
G

F
A
F
A
A
F
A
A

G
E
G
E
E
G
E
E

120

HMM Model of Hemoglobins


http://decypher.stanford.edu/

GrowTree VegF Neighbor Joining Tree

Human Gene Expression Signatures


T Cells Signaling

DNA Damage
Fibroblast Stimulation
B Cells Signaling
CMV Infection
Anoxia
Polio Infection
Monocytes Signaling IL4
Hormone

Clustering Gene Expression Profiles:


Comparison of Methods

D'haeseleer P (2005). Nat Biotechnol. 23,1499-501.

TAMO:
Tools for the Analysis of Motifs

Finding Transcription Factor Binding Sites

Upstream Regions
expressed

CoGenes

Pho 5

GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
Pho 8
CACATCGCATCACGTGACCAGT...GACATGGACGGC
Pho 81
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
Pho 84
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
Pho
CGCTAGCCCACGTGGATCTTGA...AGAATGACTGGC
Transcription
Start

Finding Transcription Factor Binding


Sites

Upstream Regions

Co-expressed
Genes

GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT

Finding Transcription Factor Binding


Sites

Upstream Regions

Co-expressed
Genes

ATGGCTGCACCACGTTTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTA
TTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTTGATCTTGT...AGAATGGCCTAT
Pho4 binding

Metabolic Networks: BioCyc


http://biocyc.org/

C. crescentus Cell Cycle Gene Expression

Genome Wide Associations in


Rheumatoid Arthritis

Pearson, T. A. et al. JAMA 2008;299:1335-1344

Leveraging Genomic Information in


Medicine
Novel Diagnostics

Microchips & Microarrays - DNA


Gene Expression - RNA
Proteomics - Protein
Novel Therapeutics
Drug Target Discovery
Rational Drug Design
Molecular Docking
Gene Therapy
Stem Cell Therapy

Understanding Metabolism
Understanding Disease
Inherited Diseases - OMIM
Infectious Diseases
Pathogenic Bacteria
Viruses

You might also like