You are on page 1of 50

Intro… Bioinformatics

Basic Concepts of Modern Information


Technology and Computer Science

Simple Definition (Biology)  Application of


information technology to the storage, management and
analysis of biological information
Simple Definition (IT)  Bioinformatics is the
electronic infrastructure of molecular biology
Classical Definition  The mathematical, statistical
and computing methods that aim to solve biological
problems using DNA and amino acid sequences and
related information
Primary sequence databases

List of primary sequence databases and


their locations.
Primary sequence databases
There are several problems with databases
today:
􀂄 Databases are regulated by users rather
than by a central body (except for Swiss-
Prot).
􀂄 Only the owner of the data can change it.
􀂄 Sequences are not up to date.
􀂄 Large degree of redundancy in databases
and between databases.
􀂄 Lack of standard for .elds or annotation.
Largest databases: Genbank (US), EMBL (Europe - UK),DDBJ (Japan).

DNA Databases
EMBL : URL http://www.ebi.ac.uk/embl/
􀂄 EMBL is a DNA sequence database from European
Bioinformatics Institute (EBI).
􀂄 EMBL includes sequences from direct submissions,
from genome sequencing projects, scientific literature
and patent applications.
􀂄 Its growth is exponential,
􀂄 supports several retrieval tools:
􀂄 SRS for text based retrieval and Blast and FastA for
sequence based retrieval.
Challenge Signal Transduction
Signal transduction concerns
cell mechanisms used
To interpret, integrate, and
act upon information
received at the cell surface
To convert this info into
biochemical events within Signal 1
interior of the cell that
Trigger specific pathways
Control gene activity Amplifier
Transducer
Modify many aspects of cell
Receptor 3 5
behavior and dysfunction (i.e., 2 4
disease) Glucose Response
5

24
Challenge Protein Unfolding
Gain insight in folding
properties/pathways by unfolding
proteins from 3D structure
Molecular dynamics simulations 'heat'
the protein up

Amyloidogenic protein – transthyretin – 127 AA,


L55P mutation (leucine replaced by proline)
1 917 protein atoms, 14 190 water molecules, 37
Na+ ions and 32 Cl- ions, total of 45 256 atoms

Rui M.M. Brito, University of Coimbra


25
Total length of DNA ~ 2 x 1014 km
Circumference of the earth 4 x 104 km
Distance between the earth and the sun 1.5
x 108 km
GENOME: Total DNA in an organism
Human genome ~ 3 bi bp
Worm 100 mi bp
Fruit fly 160 mi bp
Yeast 15 mi bp
Genomics

Application of high throughput automated molecular biology


technologies

Study of large number of genes & gene products


taking advantage of complete genome sequence

All at once in whole cells, whole tissues or whole organism


A who listic or systems approach to the study of information
flow within a cell

Knowledge of specific genes underlying diseases & differences


in Individual’s genetic make up that respond to differently to
drugs, are changing the face of drug development & delivery
323 bacterial genome have been sequenced
235 sequences belong to different species
65 sequences of type strains
32 sequenced more than once of the same species

1350 more sequencing projects in progress


Impact of
bacterial genomics
bioinformatics
second generation genomic technologies
on target identification
assay development
lead optimization
compound characterization
Genomics: a revolution in the making
First genome sequence of a complete organism (bacterium):
Haemophilus influenza,1995 first free-living organism
Mycoplasma genitalium, 1995 470 proteins
-

First genome sequence of an archaeum: Methanococcus


jannaschii, 1995 ~1,700 proteins

First genome sequence of a complete multicellular organism:


Caenorhabditis elegans, 1998 ~18,000 proteins

Human genome sequence: 2001 25-30,000 proteins


second mammal, mouse genome
- others- over 300s of bacteria/archaea, plants, animals, zebrafish

- provides complete information about what makes up an organism


but: we know the functions of <50% of all genes
Genome

Ordered library
Random small-insert
library of whole genome

Sequence and contig


assembly
Random small-insert
library of one clone Assemble complete
genome sequence
Sequence and contig
Shotgun sequence
assembly

Repeat for other clones

Assemble
complete genome
sequence
Clone- by- clone-approach SEQUENCING
FIRST-GENERATION GEL-BASED SEQUENCE ANALYSIS
Capable of sequencing small regions
Y chromosome: 50 Mb; Chromosome 1: 250 Mb
50,000 -100,000 b/year @ $1-2/b
SECOND-GENERATION SEQUENCING TECHNOLOGIES

Sequence 100,000 b/d @ $ 0.20 to 0.50/b


Faster, more sensitive, accurate
THIRD-GENERATION GEL-LESS TECHNOLOGIES

Fluorescence detection of bases by flow cytometry


Scanning tunneling or atomic force microscopies
Mass spectroscopic analysis
Sequencing by hybridization
Functions of protein
NUCLEIC ACIDS PROTEINS

4 NUCLEOTIDES 20 AMINO ACIDS


AMINO ACIDS DI-PEPTIDE
O
O H H
H H
C N C COOH
H2N C C OH + H N
H
C COOH H2N C
H
R2 H2O R1 R2 O
R1
H
H2N C C OH

R3
O O
H H H
TRI-PEPTIDE H2N C C N C C N C COOH
H H
R1 R2 R3

O O
H H H
H2N C C N C C N C COOH
H H POLYPEPTIDE
R1 Rn -2 Rn
GENETIC
CODE
  U C A G

UUU Phe UCU Ser UAU Tyr UGU Cys


UUC Phe UCC Ser UAC Tyr UGC Cys
 U
UUA Leu UCA Ser UAA End UGA End
UUG Leu UCG Ser UAG End UGG Trp

  CUU Leu CCU Pro CAU His CGU Arg


C CUC Leu CCC Pro CAC His CGC Arg
CUA Leu CCA Pro CAA Gln CGA Arg
CUG Leu CCG Pro CAG Gln CGG Arg

  AUU Ile ACU Thr AAU Asn AGU Ser


A AUC Ile ACC Thr AAC Asn AGC Ser
AUA Ile ACA Thr AAA Lys AGA Arg
AUG Met ACG Thr AAG Lys AGG Arg

  GUU Val GCU Ala GAU Asp GGU Gly


G GUC Val GCC Ala GAC Asp GGC Gly
GUA Val GCA Ala GAA Glu GGA Gly
GUG Val GCG Ala GAG Glu GGG Gly
The primary structure of a protein is its linear
sequence of amino acids and the location of any
disulfide (-S-S-) bridges.
SECONDARY STRUCTURE
Most proteins contain one or more stretches of
amino acids that take on a characteristic structure in 3-D
space. The most common of these are the alpha helix and
the beta conformation.
Alpha Helix

The R groups of the amino acids all extend to the outside.


The helix makes a complete turn every 3.6 amino acids.
The helix is right-handed; it twists in a clockwise direction.
The carbonyl group (-C=O) of each peptide bond extends
parallel to the axis of the helix and points directly at the -N-H group of
the peptide bond 4 amino acids below it in the helix. A hydrogen bond
forms between them [-N-H·····O=C-] .
BETA
CONFORMATION

Consists of pairs of chains


lying side-by-side and
Stabilized by hydrogen
bonds between the carbonyl
oxygen atom on one chain
and the -NH group on the
adjacent chain.
The chains are often "anti-
parallel"; the N-terminal to
C-terminal direction of one
being the reverse of the
other
FOUR-RESIDUE B-
HAIRPINS
These are also quite
common with the first two
residues adopting the alpha-
helical conformation. The
third residue has psi and phi
angles which lie in the
bridging region between
alpha-helix and beta-sheet
and the final residue adopts
the left-handed alpha-helical
conformation and is
therefore usually glycine,
aspartate or asparagine.
asparagine
TERTIARY PROTEIN STRUCTURE AND
FOLDS

There are a number


of examples of small
proteins (or peptides)
which consist of little more
than a single helix. A
striking example is
alamethicin, a
transmembrane voltage
gated ion channel, acting
as a peptide antibiotic.
Quaternary Structure of Protein
Structures of Proteins


FOUR NUCLEOTIDES OF DNA

A T

G C
DNA-1 ATGAAGGCCTTAAAAGAGCTTTCCCAATTTCTAG..
DNA-2 GGTTAACGTTAGGGGGAACCAAGTGGAATTGATA.
DNA MICRO ARRAY
Sickle cell anaemia- due
to change in one base
pair
CCTGATCC (Valine) –

CCTGTTCC(Glutamine)

You might also like