You are on page 1of 55

Unit 2.

4: Bioinformatics and Databases


Objectives: At the end of this unit, students will
-have been introduced to ome basic concepts and considerations in bioinformatics and computational biology -know what a relational database is

-understand why databases are useful for dealing with large amounts of data
-have been introduced to some of the major online biological databases and their features -have gained experience in extracting data from online biological databases Reading: Stein, L.D. 2003. Integrating biological databases. Nat Rev Genet 4: 337-345.

Assignments: Read the excerpts from Current Protocols in Bioinformatics on Entrez and the UCSC Browser. Follow along with the examples in Protocol 1 of each section.

Genomic research makes it possible to look at biological phenomena on a scale not previously possible: all genes in a genome, all transcripts in a cell, all metabolic processes in a tissue. One feature that all of these approaches share is the production of massive quantities of data. GenBank, for example, now accommodates >1010 nucleotides of nucleic acid sequence data and continues to more than double in size every year. New technologies for assaying gene expression patterns, protein structure, protein-protein interactions, etc., will provide even more data. How to handle these data, make sense of them, and render them accessible to biologists working on a wide variety of problems is the challenge facing bioinformaticsan emerging field that seeks to integrate computer science with applications derived from molecular biology. We are swimming in a rapidly rising sea of data. . . how do we keep from drowning?
Roos (2001). Science. 291:1260

Bioinformatics is one solution to this problema way of coping with large data sets and making sense of genomic-scale data. But like with most approaches, it is important to have a sense of what types of things are possible or not possible to achieve using bioinformatics approaches. Learn to know the differenceBioinformatics is:
sometimes a time-saver: you can automate common and/or

repetative tasks, and parse large files sometimes essential: how else would you analyze results from a 25,000 gene microarray experiment sometimes not helpful/not useful/unimportant: it can be easier and more straightforward to do a simple wet-lab experiment than to devise an elaborate computational approach sometimes not possible: computers cant do everything!

Its also important to have an understanding of the underlying concepts and algorithms in bioinformatics, just as its important to understand the basic concepts and chemical basis of molecular biology, or genetics, or biochemistry, if youre going to do wet-lab experiments. Many biologists are comfortable using algorithms like BLAST or GenScan without really understanding how the underlying algorithm works. . . . BLAST solves a particular problem only approximately and it has certain systematic weaknesses. . . . Users that do not know how BLAST works might misapply the algorithm or misinterpret the results it returns. [Pevzner (2004). Bioinformatics 20(14): 2159-2161.]

A historical perspective
The 1960s: the birth of bioinformatics
High-level computer languages Protein sequence data Academic access to computers

Margaret Oakley Dayhoff


First protein database First program for sequence assembly IBM 7090 computer

Benfey and Protopapas, "Genomics" 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

By way of comparison
32 Kbytes RAM
2.18 Hz $2,900,000 in 1960
IBM 7090 computer

1 GB RAM 2.4 GHz $1199 in 2008


20 Apple iMac

Solving problems in computer science


Necessary parameters for assessing the difficulty of a computer science problem
Algorithmic complexity
Is the problem theoretically solvable? If so, what is the most efficient solution?

Current state of computer technology


Memory CPU speed Cost
Benfey and Protopapas, "Genomics" 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Algorithms
An algorithm is a sequence of instructions that one must perform in order to solve a wellformulated problem First you must identify exactly what the problem is! A problem describes a class of computational tasks. A problem instance is one particular input from that task In general, you should design your algorithms to work for any instance of a problem (although there are cases in which this is not possible)

Computer technology: memory, CPU speed, cost


Dramatic improvements on yearly basis

We do a lot of our work using desktop Macs out of the box - 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for
~$3000
- 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for ~$6000

CPU speed vs. memory: which is more important?

- for protein structure, might need many calculations but limited


memory - for genome searches, might have few calculations but huge amounts to store in memory

Reading from memory is several orders of magnitude faster than reading from disk

Databases
What is a database?
A collection of related data elements
tables columns (fields) rows (records)

Records retrieved using a query language Database technology is well established

Databases
Tables (entitites) basic elements of information to track, e.g., gene, organism, sequence, citation Columns (fields) attributes of tables, e.g. for citation table, title, journal, volume, author Rows (records) actual data whereas fields describe what data is stored, the rows of a table are where the actual data is stored

Databases
A very simple form of (non-electronic) database is a filing cabinet. In the filing cabinet, you can store many different records (sheets of paper), each containing mulitple data elements.
Example: a filing cabinet of invoices the filing cabinet is a table the columns are the fields of data on the individual invoices (customer, product, price, quantity) the rows (records) are the individual invoices

The biggest problem with a filing cabinet is that you can only store your data one way (e.g., in alphabetical order of the customers last name), and theres no good way of searching your files based on any other criteria (say, by product ordered).

Databases
Example: a filing cabinet of invoices the filing cabinet is a table the columns are the fields of data on the individual invoices (customer, product, price, quantity) the rows (records) are the individual invoices

A flat-file databasea spreadsheetis the electronic analogue to the filing cabinet:


invoice_id 1 2 3 4 customer Elmer Wiley Elmer Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00

This is more easily searchable than a paper file cabinet, but is still very unwieldly, especially for large amounts of data.

Databases
invoice_id 1 2 3 4 customer Elmer Wiley Elmer Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00

Suppose you now want to be able to send an advertisement to every customer who bought the Acme Snow Machine. You could add a column to your table that includes the address for each customer, but this is very inefficientyou will keep repeating information for customers (like Elmer) who make multiple purchases. Plus, as the number of rows and columns grows, searching a flat file becomes more and more time consuming. Also, it is difficult to construct complex queries (e.g., customers who bought the Snow Machine and who like opera or live in the Southwest desert)

Relational Databases
invoice_id 1 2 3 4 customer Elmer Wiley Elmer Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00

The solution is the relational database. A relational database contains multiple tables and defines the relationships between them. Thus you might also have a customer table and a product table, like this:
customer_table name Elmer Wiley Bugs address Looney Tunes Dr. Southwest desert Rabbit Hole
price $ $ $ $

notes likes hunting and opera big mail order customer likes to cross dress
notes

product_table product carrots shotgun buckshot Acme snow machine

0.50 25.00 oddly flexible 2.00 5.00 high defect rate

Relational Databases
Relationships can be built between tables and fields:
invoice_id 1 2 3 4 customer Elmer Wiley Elmer Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00

customer_table name Elmer Wiley Bugs

address Looney Tunes Dr. Southwest desert Rabbit Hole

notes likes hunting and opera big mail order customer likes to cross dress

product_table product carrots shotgun buckshot Acme snow machine

price $ $ $ $

notes 0.50 25.00 oddly flexible 2.00 5.00 high defect rate

database schema

Relational Databases
Now only three items need to be filled in for an invoice: a customer, a product, and a quantity. The price and total fields can be filled in automatically: price from a product_table lookup and total by calculation (price * qty).
invoice_id 1 2 3 4 customer Elmer Wiley Elmer Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00

customer_table name Elmer Wiley Bugs

address Looney Tunes Dr. Southwest desert Rabbit Hole

notes likes hunting and opera big mail order customer likes to cross dress

product_table product carrots shotgun buckshot Acme snow machine

price $ $ $ $

notes 0.50 25.00 oddly flexible 2.00 5.00 high defect rate

Relational Databases
Now we can send our advertisement to every customer who bought the Acme Snow Machine by getting their addresses from the customer_table table.
To do this, we use Structured Query Language (SQL): SELECT customer_table.name, customer_table.address

FROM customer_table, invoice


WHERE invoice.product = Acme Snow Machine AND invoice.customer = customer_table.name
invoice_id 1 2 3 4 customer Elmer Wiley Elmer Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00

customer_table name Elmer Wiley Bugs

address Looney Tunes Dr. Southwest desert Rabbit Hole

notes likes hunting and opera big mail order customer likes to cross dress

product_table product carrots shotgun buckshot Acme snow machine

price $ $ $ $

notes 0.50 25.00 oddly flexible 2.00 5.00 high defect rate

Relational Databases
We can also make our complex query

customers who bought the Snow Machine and who like opera or live in the Southwest desert):
SELECT customer_table.name FROM customer_table, invoice WHERE invoice.product = Snow Machine AND invoice.customer = customer_table.name AND (customer_table.notes LIKE %opera% OR cutomer_table.address = Southwest desert)
invoice_id 1 2 3 4 customer Elmer Wiley Elmer Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00

customer_table name Elmer Wiley Bugs

address Looney Tunes Dr. Southwest desert Rabbit Hole

notes likes hunting and opera big mail order customer likes to cross dress

product_table product carrots shotgun buckshot Acme snow machine

price $ $ $ $

notes 0.50 25.00 oddly flexible 2.00 5.00 high defect rate

Online Databases
When you query an online database, your query is translated into SQL, the database is interrogated, and the answer displayed on your web browser.

Your computer and browser (the client) Software to receive and translate the instructions you enter into your browser (on the server)

The database itself


Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. OReilly (2002).

Biological Databases
Over 1000 biological databases

Vary in size, quality, coverage, level of interest


Many of the major ones covered in the annual Database Issue of Nucleic Acids Research

What makes a good database?


comprehensiveness accuracy is up-to-date good interface batch search/download API (web services, DAS, etc.)

The Ten Commandments When Using Servers


Remember the server, the database, and the program version used

Write down sequence identification numbers


Write down the program parameters Save your internet results the right way

(use screenshots or PDFs if necessary)


Databases are not like good wine (use up-to-date builds) Use local installs when it becomes necessary
Source: Bioinformatics for Dummies

Ten Important Bioinformatics Databases


GenBank Ensembl PubMed NR SWISS-PROT InterPro OMIM www.ncbi.nlm.nih.gov www.ensembl.org www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov www.expasy.ch www.ebi.ac.uk www.ncbi.nlm.nih.gov nucleotide sequences human/mouse genome (and others) literature references protein sequences protein sequences protein domains genetic diseases

Enzymes
PDB KEGG

www.chem.qmul.ac.uk
www.rcsb.org/pdb/ www.genome.ad.jp

enzymes
protein structures metabolic pathways

Source: Bioinformatics for Dummies

NCBI (National Center for Biotechnology Information)

over 30 databases including GenBank, PubMed, OMIM, and GEO Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/)

www.ncbi.nlm.nih.gov/GenBank

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 65,369,091,950 bases in 61,132,599 sequence records in the traditional GenBank divisions and 80,369,977,826 bases in 17,960,667 sequence records in the WGS division as of August 2006.

www.ncbi.nlm.nih.gov/GenBank

The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only. RefSeq biological sequences (also known as RefSeqs) are derived from GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time.

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

The MOD squad


Most model organism communities have established organismspecific Model Organism Databases (MODs) Many of these databases have different schemas and implementations, although there is movement toward harmonizing many features via the Generic Model Organism Database project.

The MOD squad


SGD: yeast (www.yeastgenome.org)

Wormbase: C. elegans (www.wormbase.org)


FlyBase: Drosophila (flybase.bio.indiana.edu) Zfin: zebrafish (zfin.org)

and many others (Xenopus, Dictyostelium, Arabisdopsis)

The MOD squad: what about Homo sapiens?


There is not a true model organism database for Human. The two main sources of genome information that have evolved are the UCSC Genome Browser and Ensembl.

EnsEMBL www.ensembl.org UCSC genome.ucsc.edu

UCSC Browser

UCSC Browser

Ensembl

Ensembl

Ensembl

Protein Data Bank (PDB)

Protein Data Bank (PDB)

total yearly

Protein Data Bank (PDB)

You might also like