Biological Databases

BIOINFORMATICS
UIMLT
Prepared by :Syeda Rida Shah

Senior lecturer
THE UNIVERSITY OF LAHORE
Biological Databases
• A biological database is a large, organized
body of persistent data, usually associated
with computerized software designed to
update, query, and retrieve components of
the data stored within the system.
• A simple database might be a single file

containing many records, each of which
includes the same set of information.
• For example, a record associated with a
nucleotide sequence database typically
contains information such as contact name;
the input sequence with a description of the
type of molecule; the scientific name of the
source organism from which it was isolated;
and, often, literature citations associated with
the sequence.
• For researchers to benefit from the data stored
in a database, two additional requirements must
be met:
• • Easy access to the information; and
• • A method for extracting only that information

needed to answer a specific biological question.
• Currently, a lot of bioinformatics work is
concerned with the technology of
databases. These databases include both
"public" repositories of gene data like
GenBank or the Protein DataBank (the
PDB), and private databases like those
used by research groups involved in gene
mapping projects or those held by biotech
companies.
• Making such databases accessible via open
standards like the Web is very important
since consumers of bioinformatics data use a
range of computer platforms. RNA and DNA
are the molecules that store the hereditary
information about an organism. These
macromolecules have a fixed structure,
which can be analyzed by biologists with the
help of bioinformatics tools and databases.
• A few popular databases are
• GenBank from NCBI (National Center for
Biotechnology Information)
• SwissProt from the Swiss Institute of
Bioinformatics
• PIR from the Protein Information
Resource.
• GenBank:
GenBank (Genetic Sequence Databank) is one of the
fastest growing repositories of known genetic
sequences. It has a flat file structure, that is an
ASCII text file, readable by both humans and
computers. In addition to sequence data, GenBank
files contain information like accession numbers
and gene names, phylogenetic classification and
references to published literature. There are
approximately 191,400,000 bases and 183,000
sequences as of June 1994.
• EMBL:
The EMBL Nucleotide Sequence Database is a
comprehensive database of DNA and RNA
sequences collected from the scientific literature
and patent applications and directly submitted
from researchers and sequencing groups. Data
collection is done in collaboration with GenBank
(USA) and the DNA Database of Japan (DDBJ).
The database currently doubles in size every 18
months and currently (June 1994) contains nearly 2
million bases from 182,615 sequence entries.
• SwissProt:
This is a protein sequence database that provides
a high level of integration with other databases
and also has a very low level of redundancy
(means less identical sequences are present in
the database).
• PDB:
The X-ray crystallography Protein Data Bank
(PDB), compiled at the Brookhaven National
Laboratory.
• GDB:
The GDB Human Genome Data Base
supports biomedical research, clinical
medicine, and professional and scientific
education by providing for the storage and
dissemination of data about genes and
other DNA markers, map location, genetic
disease and locus information, and
bibliographic information.
• OMIM:
The Mendelian Inheritance in Man data bank
(MIM) is prepared by Victor Mc Kusick
with the assistance of Claire A. Francomano
and Stylianos E. Antonarakis at John
Hopkins University.
• PHYSICAL MAP:
computation of the human genetic map using
DNA fragments in the form of YAC contigs.
• GENETIC MAP:
production of micro-satellite probes and the
localization of chromosomes, to create a
genetic map to aid in the study of
hereditary diseases. GENEXPRESS
(cDNA): catalogue the transcripts required
for protein synthesis obtained from
specific tissues, for example
neuromuscular tissues.
Data Annotation, Processing, and Analysis
• Data are expensive to gather and confounded by
noise, but they are the primary means of
validation in the sciences.
•Data annotation helps scientists effectively share
their data and maximize its use in knowledge
discovery.
•Processing steps help control the quality of the
data by reducing irrelevant variation and handling
missing values.
• Data analysis helps scientists form conjectures
about their data and identify hidden relationships.
Informatics tools can support each of these
activities, although tools for analysis receive the
most attention.
• Data Annotation
• Data annotation includes several activities, such as
labeling measurements, adding structure to data,
describing the collection environment, and
recording provenance.
• Data Annotation
This information enhances the use of scientific
data in collaborative environments and enables
data integration. Shared, controlled
vocabularies let scientists communicate how
and why data were collected to reduce data
misuse. In some cases the annotations supplant
the original observations to become a new
form of scientific data.
• Using Annotations
Annotated data serves several purposes such as
enhancing traditional information retrieval
approaches with shared knowledge of
concepts and relationships; tracking the
source an original use of scientific data to
facilitate proper interpretation and use by
third parties; creating a new, structured
representation of the data that scientists can
reason about.
• Data Preparation
Observations often require processing before
serving as scientific data. Even then, data may
require further preparation before analysis
such as When correctly applied, these steps
help ensure the reliability of scientific results.
normalizing the data to enable the comparison
of results across experiments; filtering the data
to enhance the signal; and estimating the
values of missing observations.
• Data Normalization and Filtering Normalization
counters systematic and uninformative variation in
measurement tools and measured entities.
• Normalization of microarray data combats incidental
variation across experimental settings.
• Normalizations may also transform data to fit a
normal distribution to support the use of statistical
analyses.
• Filters remove unreliable data and irrelevant noise by
scanning for outliers, smoothing trajectories, etc. .
• Handling Missing Data
Missing data can skew the distribution of a sample:
Imputation builds a (typically shallow) underlying model
of the available data that provides the missing values.
SPSS, SAS, and R include imputation routines.
substituting the mean is no longer encouraged; for series
data, interpolation fits a (localized) curve to the data set
and estimates the missing values from it; maximum
likelihood estimation and multiple imputation are the
most common approaches.
• Data Analysis
• Analysis tools can reveal the patterns and
relationships hidden within a scientific data set.
Abstract views of these relationships are gathered
through a combination of
• These analyses describe the key characteristics of
data sets, helping scientists form conjectures.
Informatics tools supporting these analyses include
Excel, SPSS, Minitab, and R. descriptive statistics,
correlation tables, and exploratory data analysis.
• Descriptive Statistics and Correlations
Descriptive statistics include quantitative
measures of Correlation tables identify linear
relationships between variables in a multivariate
data set. The correlation coefficient ranges
between -1.0 and 1.0 and provides heuristic
evidence for interesting interactions. central
tendency (e.g., mean, median), variability (e.g.,
range, standard deviation), and skewness (whether
a distribution leans to one direction). Example
distributions and their correlation coefficients..
• Exploratory Data Analysis
Exploratory data analysis includes a collection of
techniques designed to These techniques
complement statistical approaches to testing
hypotheses and providing quantitative summaries.
Informatics support for exploratory data analysis
includes: identify potential causal factors in a data
set; locate outliers for analysis or removal; and
produce other general intuitions about the data.
Data Desk, SOCR, and JMP.
• Data Annotation and Analysis:
Summary Data annotation assists primarily in information
retrieval, but it has potential for data and knowledge
integration.
However, we need rich informatics tools that use the well
established knowledge bases such as the Gene Ontology.
Software for data processing is becoming more common,
but different types of data have different needs.
General informatics tools that are readily specialized to
particular sciences could address this situation.
RECOMMENDED BOOKS
• 1. Baxevanis AD. Current Protocols in
Bioinformatics: Volumes 1-3; John Wiley & Sons:
2003.
• 2. Arthur M. Lesk, “Introduction to Bioinformatics”,
Oxford University Press.
• 3. Ignacimuthu SJ, “Basic Bioinformatics”, Narosa
Publishing House.
• 4. Yadav Neelam, “A Hand Book of
Bioinformatics”, Anmal Publications Pvt.Ltd.
• 5. Krawetz. Stephen A., “Introduction to
Bioinformatics: A Theoretical and Practical
Approach”, Humana Press
Thank You
ANY QUESTION???

Biological Databases

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biological Databases

Uploaded by

Copyright:

Available Formats

BIOINFORMATICS

Prepared by :Syeda Rida Shah

• A simple database might be a single file

• • Easy access to the information; and

• • A method for extracting only that information

You might also like