You are on page 1of 22

Introduction to Bioinformatics

Department of
Computer Science and
Electronics
Gadjah Mada University

© Afiahayati
Courses
 You do not need A-level biology for the course

 We don’t have any background requirements


– AI might help, databases might help…
Bioinformatics Soundbites
 An intersection of AI and genetics
– Two very popular (most wanted) sciences

 An opportunity to:
– Use some of the most interesting computational
techniques to solve some of the most important and
rewarding questions

 When KPK solves Gayus's scandal


Aims for the Course

 To give an introduction to
– Modern bioinformatics practices
– With an emphasis on computational aspects
 To produce bioinformaticians
– Able to complement biologists
 They use the algorithms, we design and tweak them
 Two important areas
– Protein structure prediction
– Data mining bioarray informatics
Hirarki Informasi Kesehatan
What is…?
 Bioinformatics
– Applying computational techniques to biology data

 Medical informatics
– Applying computational techniques to medical data

 Chemo-informatics
– Applying computational techniques to chemical data

 Lots of overlap between the three disciplines


– Idea is to enhance and enable scientific discovery
Bioinformatics Data

Applications DNA Data

Structural Genomics

Predictive toxicology
Drug trial Data
Medical diagnosis
Patient Record Data
Lecture 2

From DNA to Cell Function

DNA sequence
codes for
(split into genes)
Amino Acid
Sequence
folds into has
Protein
3D
Structure
dictates Protein determines
Function
Cell
Activity
Good and Bad News

 We know the DNA sequence in many species


– E.g., human genome (3.2 billion bases)
 There is a 1:1 mapping
– DNA sequence in genes  protein structure
 But we don’t yet know a good algorithm for this
– And we cannot yet observe proteins folding
 “Holy Grail” of bioinformatics
– Identify protein structure from amino acid sequence
Two Approaches to
Structure Prediction
 Given a new gene sequence
– Predict the structure of the protein it codes for

 Sequence matching approach


– Find a sequence with known structure
 That closely matches the new sequence
 Use the old structure to predict the new one

 Machine learning approach


– Train a learning method (e.g., neural net) with the known
sequence/structure pairs
– Use this to predict new structure
– Careful to use good practice (prepare data/evaluate hypothesis)
Matching Sequences
 We know the structure of many proteins
– X-ray crystallography
 Genes with similar amino acid (residue) sequences
– Produce proteins with similar structures
 Can change much of the sequence without changing the structure
 Given a new genetic sequence
– Search for genes with similar sequences in databases
 Matching DNA/residue sequences
– Very important aspect of bioinformatics
– Not as easy as it sounds
Match These Sequences

 How do we match this sequence:


gattcagacctagct

 With this sequence:


gtcagatcct
Possible Answers

1. gattcagacctagct (no indels)


gtcagatcct
2. gattcaga-cctagct (with indels)
g-t-cagatcct
3. gattcagacctagc-t (no overhang)
gtcagatcct
4. gattcagacctagct (with overhang)
gtcagatcct
Lecture 3

Sequence Matching Algorithms #1

 Without indels
 Hamming distance
 Scoring schemes
– Certain changes in sequence more likely
 Due to chemical properties of the residues
 BLAST algorithm
– Idea: match local regions and expand
– Seven part process
Lecture 4

Sequence Matching Algorithms #2


 With indels
 Drawing of Dotplots VPFLLMMVLG
 Dynamic Programming V
P
(getting from A to B)
F
Quickest route to Z
A M
+ Quickest route M
from Z C
D L
G Z G
E

B F
Lecture 5

Searching Databases

 We have ways to score how well 2 seqs match


 Now want to use this in databases
– Given a known gene sequence
– Which genes in the database are closely related
 Have to worry about:
– Repeated subsequences biasing matches
– Accuracy and significance of matches
– Sensitivity and specificity (false + and false -)
Lecture 6

Multiple Sequence Alignments


 Protein sequences
form families
– Learn much more
about a gene by
looking at its family

 Multiple sequence
alignment algorithms
– Profiles
– PSI-BLAST
Lectures 7 & 8

Hidden Markov Models


 Statistical Representation
– Of a protein family
 Describes how to generate
– A protein sequence

 Can be used to generate a multiple alignment


 Can be given a multiple alignment
– And estimate HMM parameters

 Can use libraries of HMMs


Lecture 9

Machine Learning
 Machine learning (inductive reasoning)
– Automatic proposing of hypotheses based on data
– Has many applications in bioinformatics
 Including protein structure prediction
 Example: predictive toxicology
– Given: set of toxic drugs and a set of non-toxic drugs
– Given: background information (chemistry, etc.)
– Produces: hypothesis why drugs are toxic
 Overview of machine learning
– Aims, techniques, methodologies, representations
 Artificial neural networks
Evaluating Learned Hypothesis
 How do we know that a rule/hypothesis
– Reflects something interesting, not a coincidence?

 Show that a learning algorithm isn’t overfitting


– i.e., learning the data, rather than generalising

– Use cross-validation techniques (hold back data)

 Define errors
 Use statistics to define confidence intervals
 Show that one learning algorithm
– Outperforms another algorithm
Lecture 10

Protein Structure and Function


 Proteins share
– Secondary structures

– Helixes, hairpins, barrels

 Protein function
– From the protein’s fold

 Look at:
– Folds, functions, evolution of
structure and function
Protein Structure Prediction

 Need for prediction


 Evaluation of predictions
 Secondary structure predictions
 Homology modelling
– Good practice: unbiased data to start with
 Fold recognition
 Ab initio prediction
 Knowledge-based prediction