You are on page 1of 71

Bioinformatics

廖明帜
生命科学学院
About me

Name: Mingzhi Liao


Office: Room #E207, Science Building
Email: liaomz@nwsuaf.edu.cn

College of Life Science--Mingzhi Liao


Assessment and Grading

Assessment:
 Test
 closed-book exam
Grading:
 Homework & Quiz (15%)
 Attendance (15%)
 Final exam (70%)

College of Life Science--Mingzhi Liao


Resources
 T. Charlie Hodgman, Andrew French, David R.
Westhead. Instant notes in bioinformatics,
2010.Second edition.
 Also recommended:
 David W Mount. Bioinformatics: Sequence and
Genome Analysis, 2004. Second edition.

 Jonathan Pevsner. Bioinformatics


and Functional Genomics, 2009.Second edition.

College of Life Science--Mingzhi Liao


College of Life Science--Mingzhi Liao
College of Life Science--Mingzhi Liao
College of Life Science--Mingzhi Liao
College of Life Science--Mingzhi Liao
Goals
 Introduce fundamental problems, concepts,
methods, and applications in Bioinformatics.

 Emphasize both the methods and the practical use


of bioinformatics tools and databases.

College of Life Science--Mingzhi Liao


Lecture Syllabus

 Introduction to Bioinformatics
 Probability and Statistics
 Database and Database Search
 Genomes and Other Sequences
 Transcriptomics and Proteomics

College of Life Science--Mingzhi Liao


Chapter 1
Introduction to bioinformatics

College of Life Science--Mingzhi Liao


Outline
 What is Bioinformatics?
 Why use bioinformatics?
 Aims of Bioinformatics
 Why Bioinformatics becomes Hot Field?
 Main topics in bioinformatics.

Goal
Get a primary impression of the bioinformatics field
and its main topics

College of Life Science--Mingzhi Liao


What is bioinformatics?

Experiments Analysis Hypothesis


Sequence
Data Mutation Results
Evolution
Expression
Structure
Interaction

College of Life Science--Mingzhi Liao


What is bioinformatics?
 Antediluvian origins
 Bioinformatics first appeared in a textbook by
Rybak in 1968, and its content was outlined
in 1978.
 Royal definition
 An apocryphal story in 1995- Horrible word!
 Canonical definition
 Bioinformatics is the discipline at the interface
of Biology,information science and
mathematics.

College of Life Science--Mingzhi Liao


 Functional definition
 Bioinformatics seeks to generate knowledge of the
properties,populations and processes of biological
entities.
• Properties-- Composition,Structure,Activities
• Populations-- Design and mining of databases
• Processes --Interactions,Networks,Paths Rates and
efficiencies
 Public services definition
 Bioinformatics is the application of computing and
mathematics to the management,analysis,and
understanding of data to solve biological
questions,and involves links to medical,
chemo-,neuro-,tec. Informatics.
College of Life Science--Mingzhi Liao
What is bioinformatics?
Bioinformatics is the computational analysis and
storage of biological data.

NIH definitions
•Bioinformatics: Research, development, or application of
computational tools and approaches for expanding the use of
biological, medical, behavioral or health data, including those
to acquire, store, organize, archive, analyze, or visualize such
data.
•Derivation:
 Bio—biology
 Informatique—French for ‘data processing’
•To discover new biological insights using computer and
biology.

College of Life Science--Mingzhi Liao


Other related disciplines
 Computational Biology: The development and
application of data-analytical and theoretical
methods, mathematical modeling and
computational simulation techniques to the study
of biological, behavioral, and social systems.
 Chemoinformatics: Study and analysis of chemical
information.
 Medical informatics: Study, invention and
implementation of structures and algorithms to
improve communication, understanding and
management of medical information.
College of Life Science--Mingzhi Liao
History of Bioinformatics

College of Life Science--Mingzhi Liao


Major events in bioinformatics history
 1962 Pauling's theory of molecular evolution
 1965 Margaret Dayhoff's Atlas of Protein Sequences
 1970 Needleman-Wunsch algorithm
 1970 Gibbs and McIntyre developed Dot Plot method
 1970 Point Accepted Mutation (PAM) matrix (Dayhoff)
 1981 Smith-Waterman algorithm developed
 1981 The concept of a sequence motif
 1982 European Molecular Biology Laboratory (EMBL) at
EBI
 1982 GeneBank created by National Center for Biotechnology
Information (NCBI) at NIH
 1986 DNA Data Bank of Japan (DDBJ)
College of Life Science--Mingzhi Liao
• 1985 FASTP/FASTN: fast sequence similarity searching
• 1990 BLAST: fast sequence similarity searching
• 1995 First bacterial genomes completely sequenced
• 1996 Yeast genome completely sequenced
• 1998 Worm (multicellular) genome completely sequenced
• 2000 Fly genome completely sequenced
• 2003 Human genome sequenced
• 2006 Cattle genome sequenced
• 2011 A high quality draft sequence of potato genome published
……
In the future
• 2050 Completion of the first computational model of a complete
cell, or maybe even already of a complete organism
Why use bioinformatics?
 Find an anwser quickly
 Most in sillico biology is faster than in vitro
 Massive amounts of data to analyse
 Need to make use of all information
 Not possible to do analysis by hand
 Can’t organize and store information only using lab
note books
 Automation is key

College of Life Science--Mingzhi Liao


The explosion in biological data !

College of Life Science--Mingzhi Liao


Biological Data

College of Life Science--Mingzhi Liao


The challenge of huge data
 The storage, management and sharing of the data
 Data ≠ knowledge!

College of Life Science--Mingzhi Liao


Astronomy
Isaac Newton
Tycho Brahe Johannes Kepler

Astronomical Phenomena Laws of Planetary Motion


Law of Gravity  Aerospace Industry
College of Life Science--Mingzhi Liao
Chemistry

Dmitri Mendeleev

Compounds  Periodic Table of the Elements  Chemical


Industry
College of Life Science--Mingzhi Liao
Physics

• Max Karl Ernst Ludwig Planck


• Albert Einstein
• Niels Bohr
• Erwin Schrödinger

Atomic Spectrum Quantum Theory Information Technology


College of Life Science--Mingzhi Liao
Biology ?

A huge amount of sequences, expression, structures data


 What kinds of discovery?
On bioinformatics
• Science is about building causal relations between natural
phenomena
• for instance, between a mutation in a gene and a
disease.
• The development of instruments to increase our capacity
to observe natural phenomena has, therefore, played a
crucial role in the development of science
• the microscope being the paradigmatic example in
biology.
• With the human genome, the natural world takes an
unprecedented turn: it is better described as a sequence
of symbols.
• Besides high-throughput machines such as sequencers
and DNA chip readers, the computer and the associated
software becomes the instrument to observe it, and the
discipline of bioinformatics flourishes.
College of Life Science--Mingzhi Liao
Central Dogma of Molecular Biology

Replication

Transcription /
Reverse
transcription

Translation

College of Life Science--Mingzhi Liao


DNA RNA protein phenotype

genome transcriptome proteome

Genomics, Transcriptomics and Proteomics


How to analyze genomics,
transcriptomics and proteomics?
 Many interesting problems arise out of sequence
analysis.
 There are two different types of biological
sequences studied in this class:
– DNA/RNA sequences
– Amino acid sequences

College of Life Science--Mingzhi Liao


Sequencing Successes
T7 bacteriophage
completed in 1983
39,937 bp, 59 coded proteins

Sacchoromyces cerevisae
completed in 1996
12,069,252 bp, 5800 genes

Escherichia coli
completed in 1998
4,639,221 bp, 4293 ORFs

College of Life Science--Mingzhi Liao


Sequencing Successes
Caenorhabditis elegans
completed in 1998
95,078,296 bp, 19,099 genes

Drosophila melanogaster
completed in 2000
116,117,226 bp, 13,601 genes

Homo sapiens
1st draft completed in 2001
3,160,079,000 bp, 20,000-25,000 genes

College of Life Science--Mingzhi Liao


Growth of GenBank
Genome Size

College of Life Science--Mingzhi Liao


Huge Amount of Biological Data
 DNA/RNA Sequence Data (GeneBank)
 Protein Sequence Data (SwissProt and PIR)
 Protein Structure Data (Protein Data Bank)
 Gene Expression Data (Gene expression omnibus)
 Protein Function (Gene Ontology)
 And many more….

College of Life Science--Mingzhi Liao


LET’S FOCUS ON OUR DATA
Nucleotides

College of Life Science--Mingzhi Liao


Nucleotides

College of Life Science--Mingzhi Liao


DNA Sequences

College of Life Science--Mingzhi Liao


Genome
 Human Genome Project
 $3-billion project
 United States Department of Energy
 United Kingdom, France, Germany, Japan, China,
and India
 A 'rough draft' of the genome was finished in 2000
 The human genome has approximately
3.3 billion base-pairs

College of Life Science--Mingzhi Liao


There are three major public DNA databases

EMBL GenBank DDBJ

The underlying raw DNA sequences are identical

College of Life Science--Mingzhi Liao


There are three major public DNA databases

EMBL GenBank DDBJ


Housed Housed Housed
at NCBI in Japan
at EBI National
European Center for
Bioinformatics Biotechnology
Information
Institute

College of Life Science--Mingzhi Liao


Taxonomy at NCBI:
>300,000 species are represented in GenBank

http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
The most sequenced organisms in GenBank
Homo sapiens 14.9 billion bases
Mus musculus 8.9b
Rattus norvegicus 6.5b
Bos taurus 5.4b
Zea mays 5.0b
Sus scrofa 4.8b
Danio rerio 3.1b
Strongylocentrotus purpurata 1.4b
Oryza sativa (japonica) 1.2b
Nicotiana tabacum 1.2b

Updated Oct. 2010


GenBank release 180.0
Excluding WGS, organelles, metagenomics
College of Life Science--Mingzhi Liao
If 1000 letters per page, 1000 pages per book, we need 2800 books to
record the human genome.
Genome
 http://www.ncbi.nlm.nih.gov/genomes/static/gpst
at.html

College of Life Science--Mingzhi Liao


Mining DNA sequences

College of Life Science--Mingzhi Liao


MINING DNA SEQUENCES
What we will learn
• Query/visualize DNA sequences / genome data
• Analyze single sequence
• Analyze DNA composition
• Finding protein-coding regions
• Finding promoter regions
• ……
• Similarity search
• BLAST

College of Life Science--Mingzhi Liao


RNA

College of Life Science--Mingzhi Liao


Transcriptome
The transcriptome is the set of all RNA molecules, including
mRNA, rRNA, tRNA, and other non-coding RNA produced in
one or a population of cells

RNA abundance measuring methods:


 EST (Expressed Sequence Tag)
 SAGE (Serial Analysis of Gene Expression)
 DNA Microarray – large scale gene expression
analysis
 RNA-Seq

College of Life Science--Mingzhi Liao


RNA structure

56
College of Life Science--Mingzhi Liao
What we will learn
 Query
 RNA secondary structures
 Predict
 Draw

 Finding miRNA and siRNA


 Analyze microarray data

College of Life Science--Mingzhi Liao


Protein

College of Life Science--Mingzhi Liao


Protein

• Sequencing technologies:
– 2D Gel Electrophoresis – protein expression
analysis
– Mass Spectrometry – protein sequencing
– Yeast Two-Hybrid (Y2H) System – protein
interaction analysis
College of Life Science--Mingzhi Liao
Protein
Proteomics is the large-scale study of proteins,
particularly their structures and functions

College of Life Science--Mingzhi Liao


What can we do with these huge
amount of data?
What can we do with these huge amount of data?
 Store (databases)
 Search
 Analyze / Annotate (visualization, interpretation,
classification, pattern recognition)
 Generate new biological knowledge and build
biological models (prediction)

College of Life Science--Mingzhi Liao


Aims of Bioinformatics
 To organize biological data for accession and
submission.
 To develop tools that aid in the analysis of data.
 To use these tools to analyze the data and
interpret the results in a biologically meaningful
manner.

College of Life Science--Mingzhi Liao


Why is bioinformatics a hot field?
 Whole Genome Analyses and Sequences
 Experimental Analyses involving Thousands of
Genes simultaneously
 DNA Chips and Array Analyses
 Expression Arrays
 Comparative Analyses between Species
 Proteomics: 'Proteome' of an Organism

College of Life Science--Mingzhi Liao


Why is bioinformatics a hot field?

 Medical applications: Genetic Disease ... SNPs


 Pharmaceutical and Biotech Industry
 Agricultural applications
 Nutrition applications
 ……

College of Life Science--Mingzhi Liao


What does bioinformatics do

When I give talks to young scientists seeking advice


about areas of future intense scientific excitement,
computational biology is my number one
recommendation.
23/10/14 66
Francis Collins, Director of HGP at NIH
Application prospect

 Customized medicines
 Drug development
 Pathways, systems
biology
…

College of Life Science--Mingzhi Liao


Main topics in bioinformatics
 Sequence alignment
 Sequence analysis
 Phylogenetic prediction
 Functional genomics
 Gene expression data analysis
 Protein structure prediction
 Drug design

College of Life Science--Mingzhi Liao


Summary
You should get a primary impression on:
 What is Bioinformatics?
 Why use bioinformatics?
 Aims of Bioinformatics
 Why Bioinformatics becomes Hot Field?
 Main topics in bioinformatics.

College of Life Science--Mingzhi Liao

You might also like