You are on page 1of 28

Introduction to Bioinformatics

2. Genetics Background

Course 341
Department of Computing
Imperial College, London

© Simon Colton
Coursework

 1 coursework – worth 20 marks


– Work in pairs

 Retrieving information from a database

 Using Perl to manipulate that information


The Robot Scientist

 Performs experiments
 Learns from results
– Using machine learning
 Plans more experiments
 Saves time and money

 Team member:
– Stephen Muggleton
Biological Nomenclature

 Need to know the meaning of:


– Species, organism, cell, nucleus, chromosome, DNA
– Genome, gene, base, residue, protein, amino acid
– Transcription, translation, messenger RNA
– Codons, genetic code, evolution, mutation, crossover
– Polymer, genotype, phenotype, conformation
– Inheritance, homology, phylogenetic trees
Substructure and Effect
(Top Down/Bottom Up)

Species
Organism Affects the
Behaviour of
Cell Affects the
Function of
Nucleus Protein
Folds
Chromosome Amino Acid into

DNA strand
Prescribes
Gene
Base
Cells

 Basic unit of life


 Different types of cell:
– Skin, brain, red/white blood
– Different biological function
 Cells produced by cells
– Cell division (mitosis)
– 2 daughter cells
 Eukaryotic cells
– Have a nucleus
Nucleus and Chromosomes

 Each cell has nucleus


 Rod-shaped particles inside
– Are chromosomes
– Which we think of in pairs
 Different number for species
– Human(46),tobacco(48)
– Goldfish(94),chimp(48)
– Usually paired up
 X & Y Chromosomes
– Humans: Male(xy), Female(xx)
– Birds: Male(xx), Female(xy)
DNA Strands

 Chromosomes are same in every cell of organism


– Supercoiled DNA (Deoxyribonucleic acid)
 Take a human, take one cell
– Determine the structure of all chromosonal DNA
– You’ve just read the human genome (for 1 person)
– Human genome project
 13 years, 3.2 billion chemicals (bases) in human genome
 Other genomes being/been decoded:
– Pufferfish, fruit fly, mouse, chicken, yeast, bacteria
DNA Structure
 Double Helix (Crick & Watson)
– 2 coiled matching strands
– Backbone of sugar phosphate pairs
 Nitrogenous Base Pairs
– Roughly 20 atoms in a base
– Adenine  Thymine [A,T]
– Cytosine  Guanine [C,G]
– Weak bonds (can be broken)
– Form long chains called polymers
 Read the sequence on 1 strand
– GATTCATCATGGATCATACTAAC
Differences in DNA
 DNA differentiates:
2% tiny
– Species/race/gender
– Individuals
 We share DNA with
– Primates,mammals
– Fish, plants, bacteria

a re l  Genotype
Sh teria – DNA of an individual
a  Genetic constitution
M
 Phenotype
Ro – Characteristics of the
ug
hl resulting organism
y
4%  Nature and nurture
Genes
 Chunks of DNA sequence
– Between 600 and 1200 bases long
– 32,000 human genes, 100,000 genes in tulips
 Large percentage of human genome
– Is “junk”: does not code for proteins
 “Simpler” organisms such as bacteria
– Are much more evolved (have hardly any junk)
– Viruses have overlapping genes (zipped/compressed)
 Often the active part of a gene is spit into exons
– Seperated by introns
The Synthesis of Proteins

 Instructions for generating Amino Acid sequences


– (i) DNA double helix is unzipped
– (ii) One strand is transcribed to messenger RNA
– (iii) RNA acts as a template
 ribosomes translate the RNA into the sequence of amino acids
 Amino acid sequences fold into a 3d molecule
 Gene expression
– Every cell has every gene in it (has all chromosomes)
– Which ones produce proteins (are expressed) & when?
Transcription
 Take one strand of DNA
 Write out the counterparts to each base
– G becomes C (and vice versa)
– A becomes T (and vice versa)
 Change Thymine [T] to Uracil [U]
 You have transcribed DNA into messenger RNA
 Example:
Start: GGATGCCAATG
Intermediate: CCTACGGTTAC
Transcribed: CCUACGGUUAC
Genetic Code
 How the translation occurs

 Think of this as a function:


– Input: triples of three base letters (Codons)
– Output: amino acid
– Example: ACC becomes threonine (T)

 Gene sequences end with:


– TAA, TAG or TGA
A=Ala=Alanine

Genetic Code C=Cys=Cysteine


D=Asp=Aspartic acid
E=Glu=Glutamic acid
F=Phe=Phenylalanine
G=Gly=Glycine
H=His=Histidine
I=Ile=Isoleucine
K=Lys=Lysine
L=Leu=Leucine
M=Met=Methionine
N=Asn=Asparagine
P=Pro=Proline
Q=Gln=Glutamine
R=Arg=Arginine
S=Ser=Serine
T=Thr=Threonine
V=Val=Valine
W=Trp=Tryptophan
Y=Tyr=Tyrosine
Example Synthesis
 TCGGTGAATCTGTTTGAT
Transcribed to:
 AGCCACUUAGACAAACUA
Translated to:
 SHLDKL
Proteins
 DNA codes for
– strings of amino acids
 Amino acids strings
– Fold up into complex 3d molecule
– 3d structures:conformations
– Between 200 & 400 “residues”
– Folds are proteins
 Residue sequences
– Always fold to same conformation
 Proteins play a part
– In almost every biological process
Evolution of Genes: Inheritance

 Evolution of species
– Caused by reproduction and survival of the fittest
 But actually, it is the genotype which evolves
– Organism has to live with it (or die before reproduction)
– Three mechanisms: inheritance, mutation and crossover
 Inheritance: properties from parents
– Embryo has cells with 23 pairs of chromosomes
– Each pair: 1 chromosome from father, 1 from mother
– Most important factor in offspring’s genetic makeup
Evolution of Genes: Mutation

 Genes alter (slightly) during reproduction


– Caused by errors, from radiation, from toxicity
– 3 possibilities: deletion, insertion, alteration
 Deletion: ACGTTGACTC  ACGTGACTC
 Insertion: ACGTTGACTC  AGCGTTGACTC
 Substitution: ACGTTGACTC  ACGATGACTT
 Mutations are almost always deleterious
– A single change has a massive effect on translation
– Causes a different protein conformation
Evolution of Genes:
Crossover (Recombination)
 DNA sections are swapped
– From male and female genetic input to offspring DNA
Bioinformatics Application #1
Phylogenetic trees
 Understand our evolution
 Genes are homologous
– If they share a common ancestor
 By looking at DNA seqs
– For particular genes
– See who evolved from who
 Example:
– Mammoth most related to
 African or Indian Elephants?
 LUCA:
– Last Universal Common Ancestor
– Roughly 4 billion years ago
Genetic Disorders

 Disorders have fuelled much genetics research


– Remember that genes have evolved to function
 Not to malfunction
 Different types of genetic problems
 Downs syndrome: three chromosome 21s
 Cystic fibrosis:
– Single base-pair mutation disables a protein
– Restricts the flow of ions into certain lung cells
– Lung is less able to expel fluids
Bioinformatics Application #2
Predicting Protein Structure

 Proteins fold to set up an active site


– Small, but highly effective (sub)structure
– Active site(s) determine the activity of the protein
 Remember that translation is a function
– Always same structure given same set of codons
– Is there a set of rules governing how proteins fold?
– No one has found one yet
– “Holy Grail” of bioinformatics
Protein Structure Knowledge

 Both protein sequence and structure


– Are being determined at an exponential rate
 1.3+ Million protein sequences known
– Found with projects like Human Genome Project
 20,000+ protein structures known
– Found using techniques like X-ray crystallography
 Takes between 1 month and 3 years
– To determine the structure of a protein
– Process is getting quicker
Sequence versus Structure

500000
Protein sequence
400000
Number

300000

200000

100000
Protein structure
0
85 90 95 00
Year
Database Approaches
 Slow(er) rate of finding protein structure
– Still a good idea to pursue the Holy Grail
 Structure is much more conservative than sequence
– 1.3m genes, but only 2,000 – 10,000 different conformations
 First approach to sequence prediction:
– Store [sequence,structure] pairs in a database
– Find ways to score similarity of residue sequences
– Given a new sequence, find closest matches
 A good match will possibly mean similar protein shape
 E.g., sequence identity > 35% will give a good match
– Rest of the first half of the course about these issues
Potential (Big) Payoffs
of Protein Structure Prediction

 Protein function prediction


– Protein interactions and docking

 Rational drug design


– Inhibit or stimulate protein activity with a drug

 Systems biology
– Putting it all together: “E-cell” and “E-organism”
– In-silico modelling of biological entities and process
Further Reading
 Human Genome Project at Sanger Centre
– http://www.sanger.ac.uk/HGP/

 Talking glossary of genetic terms


– http://www.genome.gov/glossary.cfm

 Primer on molecular genetics


– http://www.ornl.gov/TechResources/Human_Genome/publicat/primer/toc.html

You might also like