Professional Documents
Culture Documents
Khademul Islam
Outline
What Is a Database?
Types of Databases
Properties of Biological Databases
Pitfalls of Biological Databases
Information Retrieval from Biological Databases
DB
Organization Information
Objective
Relational Primary Secondary Specialized
Oriented
Example of construction and query of an object-oriented database using the same student information as shown in
previous figure. Three objects are constructed and are linked by pointers shown as arrows. Finding specific information
relies on navigating through the objects by way of pointers. For simplicity, some of the pointers are omitted.
Types of Db: (B) Based on information
Solution:
• NCBI has now created a nonredundant database, called RefSeq, in which
identical sequences from the same organism and associated sequence
fragments are merged into a single entry.
• Proteins sequences derived from the same DNA sequences are explicitly
linked as related entries.
• SWISS-PROT database also has minimal redundancy for protein
sequences compared to most other databases.
• Another way to address the redundancy problem is to create sequence-
cluster databases such as UniGene that coalesce EST sequences that
are derived from the same gene.
Dr. Md. Khademul Islam
Erroneous annotation:
• same gene sequence is found under different names.
• unrelated genes bearing the same name
Solution:
• re-annotation of genes and proteins using a set of common,
controlled vocabulary to describe a gene or protein is
necessary.
• Example of such systems is Gene Ontology
FASTA:
• it contains plain sequence information that is readable by many
bioinformatics analysis programs.
• has a single definition line that begins with a right angle bracket (>)
followed by a sequence name.
• Sometimes, extra information such as gi number or comments can be
given, which are separated from the sequence nameby a “|” symbol.
• The extra information is considered optional and is ignored by sequence
analysis programs.
• The plain sequence in standard one-letter symbols starts in the second
line.
• Each line of sequence data is limited to sixty to eighty characters in width.
• The drawback of this format is that much annotation information is lost.
Objective:
Learn about different types of databases
Introduce with National Centre for Biotechnology Information
(NCBI)
Web Address:
http://www.ncbi.nlm.nih.gov/
Objective:
Retrieve protein or DNA sequences (FASTA format) from NCBI
Web Address:
http://www.ncbi.nlm.nih.gov/
Tools:
BioMart of EnsEMBL
http://www.ensembl.org
200