Professional Documents
Culture Documents
2. Genetics Background
Course 341
Department of Computing
Imperial College, London
© Simon Colton
Coursework
Performs experiments
Learns from results
– Using machine learning
Plans more experiments
Saves time and money
Team member:
– Stephen Muggleton
Biological Nomenclature
Species
Organism Affects the
Behaviour of
Cell Affects the
Function of
Nucleus Protein
Folds
Chromosome Amino Acid into
DNA strand
Prescribes
Gene
Base
Cells
a re l Genotype
Sh teria – DNA of an individual
a Genetic constitution
M
Phenotype
Ro – Characteristics of the
ug
hl resulting organism
y
4% Nature and nurture
Genes
Chunks of DNA sequence
– Between 600 and 1200 bases long
– 32,000 human genes, 100,000 genes in tulips
Large percentage of human genome
– Is “junk”: does not code for proteins
“Simpler” organisms such as bacteria
– Are much more evolved (have hardly any junk)
– Viruses have overlapping genes (zipped/compressed)
Often the active part of a gene is spit into exons
– Seperated by introns
The Synthesis of Proteins
Evolution of species
– Caused by reproduction and survival of the fittest
But actually, it is the genotype which evolves
– Organism has to live with it (or die before reproduction)
– Three mechanisms: inheritance, mutation and crossover
Inheritance: properties from parents
– Embryo has cells with 23 pairs of chromosomes
– Each pair: 1 chromosome from father, 1 from mother
– Most important factor in offspring’s genetic makeup
Evolution of Genes: Mutation
500000
Protein sequence
400000
Number
300000
200000
100000
Protein structure
0
85 90 95 00
Year
Database Approaches
Slow(er) rate of finding protein structure
– Still a good idea to pursue the Holy Grail
Structure is much more conservative than sequence
– 1.3m genes, but only 2,000 – 10,000 different conformations
First approach to sequence prediction:
– Store [sequence,structure] pairs in a database
– Find ways to score similarity of residue sequences
– Given a new sequence, find closest matches
A good match will possibly mean similar protein shape
E.g., sequence identity > 35% will give a good match
– Rest of the first half of the course about these issues
Potential (Big) Payoffs
of Protein Structure Prediction
Systems biology
– Putting it all together: “E-cell” and “E-organism”
– In-silico modelling of biological entities and process
Further Reading
Human Genome Project at Sanger Centre
– http://www.sanger.ac.uk/HGP/