You are on page 1of 22

Introduction to Bioinformatics

Department of
Computer Science and
Gadjah Mada University

© Afiahayati
 You do not need A-level biology for the course

 We don’t have any background requirements

– AI might help, databases might help…
Bioinformatics Soundbites
 An intersection of AI and genetics
– Two very popular (most wanted) sciences

 An opportunity to:
– Use some of the most interesting computational
techniques to solve some of the most important and
rewarding questions

 When KPK solves Gayus's scandal

Aims for the Course

 To give an introduction to
– Modern bioinformatics practices
– With an emphasis on computational aspects
 To produce bioinformaticians
– Able to complement biologists
 They use the algorithms, we design and tweak them
 Two important areas
– Protein structure prediction
– Data mining bioarray informatics
Hirarki Informasi Kesehatan
What is…?
 Bioinformatics
– Applying computational techniques to biology data

 Medical informatics
– Applying computational techniques to medical data

 Chemo-informatics
– Applying computational techniques to chemical data

 Lots of overlap between the three disciplines

– Idea is to enhance and enable scientific discovery
Bioinformatics Data

Applications DNA Data

Structural Genomics

Predictive toxicology
Drug trial Data
Medical diagnosis
Patient Record Data
Lecture 2

From DNA to Cell Function

DNA sequence
codes for
(split into genes)
Amino Acid
folds into has
dictates Protein determines
Good and Bad News

 We know the DNA sequence in many species

– E.g., human genome (3.2 billion bases)
 There is a 1:1 mapping
– DNA sequence in genes  protein structure
 But we don’t yet know a good algorithm for this
– And we cannot yet observe proteins folding
 “Holy Grail” of bioinformatics
– Identify protein structure from amino acid sequence
Two Approaches to
Structure Prediction
 Given a new gene sequence
– Predict the structure of the protein it codes for

 Sequence matching approach

– Find a sequence with known structure
 That closely matches the new sequence
 Use the old structure to predict the new one

 Machine learning approach

– Train a learning method (e.g., neural net) with the known
sequence/structure pairs
– Use this to predict new structure
– Careful to use good practice (prepare data/evaluate hypothesis)
Matching Sequences
 We know the structure of many proteins
– X-ray crystallography
 Genes with similar amino acid (residue) sequences
– Produce proteins with similar structures
 Can change much of the sequence without changing the structure
 Given a new genetic sequence
– Search for genes with similar sequences in databases
 Matching DNA/residue sequences
– Very important aspect of bioinformatics
– Not as easy as it sounds
Match These Sequences

 How do we match this sequence:


 With this sequence:

Possible Answers

1. gattcagacctagct (no indels)

2. gattcaga-cctagct (with indels)
3. gattcagacctagc-t (no overhang)
4. gattcagacctagct (with overhang)
Lecture 3

Sequence Matching Algorithms #1

 Without indels
 Hamming distance
 Scoring schemes
– Certain changes in sequence more likely
 Due to chemical properties of the residues
 BLAST algorithm
– Idea: match local regions and expand
– Seven part process
Lecture 4

Sequence Matching Algorithms #2

 With indels
 Drawing of Dotplots VPFLLMMVLG
 Dynamic Programming V
(getting from A to B)
Quickest route to Z
+ Quickest route M
from Z C

Lecture 5

Searching Databases

 We have ways to score how well 2 seqs match

 Now want to use this in databases
– Given a known gene sequence
– Which genes in the database are closely related
 Have to worry about:
– Repeated subsequences biasing matches
– Accuracy and significance of matches
– Sensitivity and specificity (false + and false -)
Lecture 6

Multiple Sequence Alignments

 Protein sequences
form families
– Learn much more
about a gene by
looking at its family

 Multiple sequence
alignment algorithms
– Profiles
Lectures 7 & 8

Hidden Markov Models

 Statistical Representation
– Of a protein family
 Describes how to generate
– A protein sequence

 Can be used to generate a multiple alignment

 Can be given a multiple alignment
– And estimate HMM parameters

 Can use libraries of HMMs

Lecture 9

Machine Learning
 Machine learning (inductive reasoning)
– Automatic proposing of hypotheses based on data
– Has many applications in bioinformatics
 Including protein structure prediction
 Example: predictive toxicology
– Given: set of toxic drugs and a set of non-toxic drugs
– Given: background information (chemistry, etc.)
– Produces: hypothesis why drugs are toxic
 Overview of machine learning
– Aims, techniques, methodologies, representations
 Artificial neural networks
Evaluating Learned Hypothesis
 How do we know that a rule/hypothesis
– Reflects something interesting, not a coincidence?

 Show that a learning algorithm isn’t overfitting

– i.e., learning the data, rather than generalising

– Use cross-validation techniques (hold back data)

 Define errors
 Use statistics to define confidence intervals
 Show that one learning algorithm
– Outperforms another algorithm
Lecture 10

Protein Structure and Function

 Proteins share
– Secondary structures

– Helixes, hairpins, barrels

 Protein function
– From the protein’s fold

 Look at:
– Folds, functions, evolution of
structure and function
Protein Structure Prediction

 Need for prediction

 Evaluation of predictions
 Secondary structure predictions
 Homology modelling
– Good practice: unbiased data to start with
 Fold recognition
 Ab initio prediction
 Knowledge-based prediction