You are on page 1of 14

Introduction to Biological Databases

One of the hallmarks of modern genomic research is the generation of enormous


amounts of raw sequence data. As the volume of genomic data grows, sophisticated
computational methodologies are required to manage the data deluge. Thus, the very
first challenge in the genomics era is to store and handle the staggering volume of
information through the establishment and use of computer databases. The development
of databases to handle the vast amount of molecular biological data is thus a
fundamental task of bioinformatics. In this segment of the course we will introduce
some basic concepts related to databases, in particular, the types, designs, and
architectures of biological databases.
WHAT IS A DATABASE?

A database is a computerized archive used to store and organize data in such a


way that information can be retrieved easily via a variety of search criteria. Databases
are composed of computer hardware and software for data management. The chief
objective of the development of a database is to organize data in a set of structured
records to enable easy retrieval of information. Each record, also called an entry, should
contain a number of fields that hold the actual data items, for example, fields for names,
phone numbers, addresses, dates. To retrieve a particular record from the database, a
user can specify a particular piece of information, called value, to be found in a
particular field and expect the computer to retrieve the whole data record. This process
is called making a query.
A database holds not only the organizations operational data but also a
description of this data. So, it can be defined as a self-describing collection of integrated
records.
Although data retrieval is the main purpose of all databases, biological databases
often have a higher level of requirement, known as knowledge discovery, which refers to
the identification of connections between pieces of information that were not known
when the information was first entered. For example, databases containing raw sequence
information can perform extra computational tasks to identify sequence homology or
conserved motifs. These features facilitate the discovery of new biological insights from
raw data.
A computer program (database management system, DBMS) can manage and
query the database to get answers to questions. A set of tools that stores, extracts and
modifies.
Schema is the description of data in terms of a data model
1

TYPES OF DATABASES

There are two types of database management systems: flat-like indexing systems
and relational DBMSs.
Originally, databases all used a flat file format, which is a long text file that
contains many entries separated by a delimiter, a special character such as a vertical bar
(|). Within each entry are a number of fields separated by tabs or commas. Except for
the raw values in each field, the entire text file does not contain any hidden instructions
for computers to search for specific information or to create reports based on certain
fields from each record. The text file can be considered a single table. Thus, to search a
flat file for a particular piece of information, a computer has to read through the entire
file, an obviously inefficient process. This is manageable for a small database, but as
database size increases or data types become more complex, this database style can
become very difficult for information retrieval. Indeed, searches through such files often
cause crashes of the entire computer system because of the memory-intensive nature of
the operation.
HISTORY OF BIOLOGICAL DATABASES

1965: Margaret Dayhoff et al. publish Atlas of Protein Sequences and


Structures.
1982: EMBL initiates DNA sequence databases, followed within a year by
GenBank and in 1984 by the DNA Database of Japan.
1988: EMBL/GenBank/DDBJ agree on common format for data elements.
1980: Only 80 genes were fully sequenced. The PCR techniques in 1983 lead to
tremendous increase in nucleotide sequence.
EMBL database growth
Total nucleotides (Nov 07: 188,490,792,445)

Number of entries(Nov 07: 106,144,026)

BIOLOGICAL DATABASES: SOME STATISTICS

1. More than 1000 different databases


o 968 databases reported in The Molecular Biology Database Collection:
2007 update by Galperin, Nucleic Acids Research, 2007, Vol. 35, Database
issue D3-D4
o Metabase: database of biological databases,
http://biodatabase.org/index.php/Main_Page
2. Database sizes: <100kB to >100GB (EMBL >500GB)
o DNA: >100GB, Protein: 1GB
o 3D structure: 5GB
3. Update (adding new data) frequency: daily to annually
4. Freely accessible (as a rule)
CATEGORIES OF BIOLOGICAL DATABASES

1. Nucleotide sequences
2. Genomics (information on gene chromosomal location and nomenclature, provide
links to sequence databases)
3. Mutation/polymorphism (sequence variations linked or not to genetic diseases)
4. Protein sequences
5. Protein domain/family
6. Proteomics (2D gel, MS)
7. Microarray (high-dimensional data: profiles of thousands of genes depending on
hundreds/thousands of various conditions)
8. Organism-specific
9. 3D structure
10. Metabolism (e.g., metabolic pathways graph data)
11. Bibliography
12. Others
BIOLOGICAL DATABASES: SPECIFIC FEATURES

o Sub-class of scientific databases


o Autonomous: many independent maintainers
o Heterogeneous data formats: e.g., various data formats for the same data entities;
various types of biological data: genomic, microarray, proteomic, ...
3

o Dynamic: frequent and continuous changes in data content (and, more


importantly, in data schema)
o Broad domain knowledge
o Workflow-oriented: databases + rich set of analysis tools
o Information integration is essential: data aggregation from several databases
BIOLOGICAL DATABASES: INTEGRATION

MODELS OF DATABASES

Hierarchical
The hierarchical data model organizes data in a tree structure. There is a hierarchy
of parent and child data segments. It has only one rooy record. Each root record may
participate in relationship with many child records. Each child may itself have many
child records. A parent record owns its child records.
If parent is deleted all child records are automatically deleted. Changing the
structure of the database is very difficult because changes in structure require changes to
the access mechanisms and consequently to the application programs.
Network
This model was developed in response to the shortcomings of the hierarchical
model. A network database is a collection of record types. Record types are associated
together by links and there is no contraint on the number and direction of the links that
can be established. There is no root record and each record can participate in any
number of owns relationships. In this network model the data duplication is removed,
4

it is possible to set up record instances without having them participate in a link and by
deleting one owner you do not necessarily delete all its members. However the
implementation of this model is very difficult.
Object-Oriented
Object DBMSs add database functionality to object programming languages.
They bring much more than persistent storage of programming language objects. Object
DBMSs extend the semantics of the C++, Smalltalk and Java object programming
languages to provide full-featured database programming capability, while retaining
native language compatibility. A major benefit of this approach is the unification of the
application and database development into a seamless data model and language
environment. As a result, applications require less code, use more natural data modeling,
and code bases are easier to maintain. Object developers can write complete database
applications with a modest amount of additional effort
Relational

Data is presented as a collection of relations


Each relation is depicted as a table
Columns are attributes
Rows ("tuples") represent entities
Every table has a set of attributes that taken together as a "key" (technically, a
"superkey") uniquely identifies each entity

The Relational Data Model in Details


Schemas
Relation schema = relation name and attribute list.
Optionally: types of attributes.
Example: Beers(name, manf) or Beers(name: string, manf: string)
Database = collection of relations.
Database schema = set of all relation schemas in the database.
Why Relations?
Very simple model.
Often matches how we think about data.
Abstract model that underlies SQL, the most important database language today.

Entity-Relationship Model
Entity like object, = thing.
Entity set like class = set of similar
entities/objects.
Attribute = property of entities in an entity set
Relationships connect two or more entity sets.
Purpose of the E/R Diagram
- The E/R model allows us to sketch database designs.
Kinds of data and how they connect.
Not how data changes.
- Designs are pictures called entity-relationship diagrams.
- Later: convert E/R designs to relational DB designs.
In an E/R Diagram
Entity set = rectangle.
Attribute = oval, with a line to the rectangle representing its entity set.

name

man
Beers

Entity set Beers has two attributes, name and manf (manufacturer).
Each Beers entity has values for these two attributes, e.g. (Bud, Anheuser-Busch)
A relationship connects two or more entity sets.
It is represented by a diamond, with lines to each of the entity sets involved.
6

many-one

one-one

many-many

Subclass = special case = fewer entities = more properties.


Example: Ales are a kind of beer.
Not every beer is an ale, but some are.
Let us suppose that in addition to all the properties (attributes and relationships)
of beers, ales also have the attribute color.
Note: It is relatively straighforward to represent a database design in graphical E/R
Diagrams, where rectangles represent entity types, diamonds relationship types, and
ovals attributes
A key is a set of attributes for one entity set such that no two entities in this set agree on
all the attributes of the key.
It is allowed for two entities to agree on some, but not all, of the key attributes.
We must designate a key for every entity set.

Note: In the previous diagram underlined attribute names represent keys

The Relational Algebra and Calculus


The theory of database operations are based on set theory.
1. The Relational Algebra provides a collection of operations to manipulate
relations. It supports the notion of a query, or request to retrieve information from
a database. There are set operations: Union, Intersection, Difference, and
Cartesian Product.
2. The Relational Calculus is a formal query language. Instead of having to write a
sequence of relational algebra operations, we simply write a single declarative
expression, describing the results that we want. It is somewhat like writing a
program in C or Java istead of assembler

Structured Query Language (SQL)


It came from an IBM Research project entitled "SEQUEL" where the intent was to
create a structured English-like query language to interface to the early System R
database system. Along with QUEL, SQL was the first high level declarative database
language.
Data Definition Language (DDL): allows a database administrator or database
designer to define tables, create views, etc
Data Manipulation Language (DML): allows an end user to retrieve
information from tables

Practice in SQL
(Note/Hint: A good discussion in understanding the details of SQL can be in the course
website, Introduction to SQL 1 and 2 and PL/SQL)
The CREATE TABLE statement is used to define a new table.
CREATE TABLE Students (sid
CHAR(20),
name CHAR(30),
login CHAR(20),
8

age
gpa
sid
50000
53666
53688
53650
53831
53832

INTEGER
REAL)

name
Dave
Jones
Smith
Smith
Madayan
Guldu

login
dave@cs
jones@cs
smith@chem
smith@ee
madayan@music
guldu@math

age
19
18
18
19
11
12

gpa
3.3
3.4
3.2
3.8
1.8
2.0

An instance S1 if the Students Relation


An instance of a relation is a set of tuples, also called records
Tuples are insetred using the INSERT command. We can insert a single tuple into the Students table as
follows:
INSERT
INTO
Students (sid, name, login, age, gpa)
VALUES (53688, Smith, smith@chem, 18, 3.2)
We can delete tuples using the DELETE command. We can delete all Students tuples with name equal
to Smith using the command.
DELETE
FROM
WHERE

Students S
S.name = Smith

We can modify the column values in an existing row using the UPDATE command. For example we
can incement the age and decrement the gpa of the student with sid 53688.
UPDATE
SET
WHERE

Students S
S.age = S,age + 1, S.gpa 1
S.id = 53688

The WHERE clause is applied first and determines which rows are to be modified. The SET clause
then determines how these rows are to be modified. If the column that is being modified is also used to
determine the new value, the value used on this expression on the right side of equals ( =) is the old
value, that is before the modification. To illustarte these points further, consider,
UPDATE
SET
WHERE

Students S
S.gpa = S.gpa 0.1
S.gpa >= 3.3

Then instance S1 of Students becomes after the update


cid
50000
53666
53688
53650
53831
53832

name
Dave
Jones
Smith
Smith
Madayan
Guldu

login
dave@cs
jones@cs
smith@chem
smith@ee
madayan@music
guldu@math

age
19
18
18
19
11
12

gpa
3.2
3.3
3.2
3.7
1.8
2.0

Key Constraints
Constraints are mostly a collection of indexes and triggers that restrict certain actions on
a table. Constraints are not actual entities themselves. There are four types of
constraints:
1. Primary Key Constraints (A primary key is a type of index that will most likely
be used as the primary index when a query is made on the table.)
2. Unique Constraints (Unique constraints may be placed on multiple columns.
They constrain the UPDATE/INSERTS on the table so that the values being
updated or inserted do not match any other row in the table for the corresponding
values.
3. Check Constraints (A check constraint prevents updates/inserts on the table by
placing a check condition on the selected column. The UPDATE/INSERT is
allowed only if the check condition qualifies.
4. Foreign Key (FK) Constraints (A foreign key constraint allows certain fields in
one table to refer to fields in another table. This type of constraint is useful if you
have two tables, one of which has partial information, details on which can be
sought from another table with a matching entry. A foreign key constraint in this
case will prevent the deletion of an entry from the table with detailed information
if there is an entry in the table with partial information that matches it.)

10

Primary Key
cid
50000
53666
53688
53650
53831
53832

name
Dave
Jones
Smith
Smith
Madayan
Guldu

login
dave@cs
jones@cs
smith@chem
smith@ee
madayan@music
guldu@math

age
19
18
18
19
11
12

gpa
3.2
3.3
3.2
3.7
1.8
2.0

Foreign Key
cid
grade
Chem302
C
Math203
B
CS112
A
His105
B

sid
53831
53832
53650
53666

Quering Relational Data


A relational database query is a question about the data and the answer consists of a new relation
containing the result. A query language is a specialized language for writing queries.
SELECT
FROM
WHERE

*
Students S
S.age < 18

SELECT
FROM
WHERE

S.name, S.login
Students S, Enrolled E
S.sid = E.sid AND E.grade = A

Consider the following three tables


sid sname rating age
22 Dustin
7
45.0
29 Brutus
1
33.0
031 Lubber
8
55.5
32
Andy
8
25.5
58
Rusty
10
35.0
64 Horatio
7
35.0
71 Zorba
10
16.0
74 Horatio
9
35.0
85
Art
3
25.5
95
Bob
3
63.5
Instance S3 of sailors
11

sid
22
22
22
22
31
31
31
64
64
74

bid
101
102
103
104
102
103
104
101
102
103

day
10/10/98
10/10/98
10/8/98
10/7/98
11/10/98
11/6/98
11/12/98
9/5/98
9/8/98
9/8/98

Instance R2 of Reserves
bid bname color
101 Interlake blue
102 Interlake red
103 Clipper green
104 Marine
red
Instance B1 of Boats

1) Find the names and ages of all sailors


SELECT
FROM

DISTINCT S.sname, S.age


Sailors. S
Without DISTINCT
sname
Dustin
Brutus
Lubber
Andy
Rusty
Horatio
Zorba
Horatio
Art
Bob

age
45.0
33.0
55.5
25.5
35.0
35.0
16.0
35.0
25.5
63.5

12

With DISTINCT
sname
Dustin
Brutus
Lubber
Andy
Rusty
Horatio
Zorba
Art
Bob

age
45.0
33.0
55.5
25.5
35.0
35.0
16.0
25.5
63.5

2) Find all sailors with rating above 7


SELECT
FROM
WHERE

S.sid, S.sname, S.age


Sailors. AS S
S.rating > 7

3) Find the names of sailors who have reserved boat number 103
SELECT
FROM
WHERE

S.sname
Sailors.S, Reserves.R
S.sid = R.sid AND R.bid = 103

4) Find the sids of sailors who have reserved a red boat


SELECT
FROM
WHERE

R.sid
Boats B, Reserves R
B.sid = R.sid AND B.color = red

5) Find the names of sailors who have reserved a red boat


SELECT
FROM
WHERE

S.sname
Sailors.S, Reserves.R, Boats B
S.sid = R.sid AND R.bid = B.bid AND B.color = red

6) Find the colors of boats reserved by Lubber


SELECT
FROM
WHERE

B.color
Sailors.S, Reserves.R, Boats B
S.sid = R.sid AND R.bid = B.bid AND S.sname = Lubber
13

7) Find the names of boats who have reserved at least one boat
SELECT
FROM
WHERE

S.sname
Sailors.S, Reserves.R
S.sid = R.sid

Nested Queries
Query 3 above, find the names of sailors who have reserved boat number 103 can be written also as:
SELECT
FROM
WHERE

S.sname
Sailors.S
S.sid IN ( SELECT R.sid
FROM Reserves R
WHERE R.bid = 103 )

Query 5 above, find the names of sailors who have reserved a red boat can be written also as:
SELECT
FROM
WHERE

S.sname
Sailors.S
S.sid IN ( SELECT R.sid
FROM Reserves R
WHERE R.bid IN ( SELECT B.bid
FROM
Boats B
WHERE B.color = red )

Find the names of sailors who have not reserved a red boat
SELECT
FROM
WHERE

S.sname
Sailors.S
S.sid NOT IN ( SELECT R.sid
FROM Reserves R
WHERE R.bid IN ( SELECT B.bid
FROM
Boats B
WHERE B.color = red )

14