SHES2201 Lecture 3 - Data Mining in Bioinformatics

SHES2201
Lecture 3 – Data-mining in Bioinformatics
Profesor Madya Khairuddin Itam

Room B20, Bioinformatics Division
khair@um.edu.my
03-79676738
Data Mining The Data
Inference from biological data
• Goal is to move from raw data to meaningful
conclusions.
• Examples: detecting remote homologues,
identifying coregulated genes, predicting binding
affinities
• Broadly applicable computational techniques:
clustering, discrimination/regression & density
estimation
Data Mining Techniques
• Common Techniques
– Classification and prediction
– Clustering
– Data summarization
– Dependency modeling
– Change and deviation detection
• Dependency modeling.
– The aim is to derive some causal structure within the data.
– One example is functional dependency between predicates.
Given: a sample (query) object and a database containing a set of

objects,
• Find the objects within the database that are within a user-
defined distance of the queried object
• Find all pairs within some distance of each other.
Clustering.
– The aim is to partition or segment the set of data items into
smaller subsets.
– The elements of one subset are similar to each other (high
intra-group similarity) and significantly different from elements
in other subsets (small inter-group similarity).
– Also called unsupervised learning
Clustering
• Begin with a set of instances (e.g. gene
sequences, protein structures) and a distance
metric.
• Create a collection of groups of the instances
which are more similar to each other than they
are to instances in other groups. Groups can be
hierarchically clustered themselves
• Examples:
– Building taxonomic trees from aligned sequences
– Identifying coregulated genes from expression arrays
• Classification and prediction
– The aim is to predict the value of some database field based on
the values of other fields.
– The field to predict is sometimes called class.
– If the class takes discrete values, then it is a classification
problem.
– If the class takes continuous numerical values it is a regression
problem.
– Also called supervised learning
Discrimination/Regression
• Induce a predictor of some aspect (the label) of an
instance from other aspects. Numeric predictions
are regression, class predictions discrimination.
• Beginning with a training set of labeled instances
• Produce a model which accurately predicts the labels
of other (unlabeled) instances.
• Examples:
– Protein secondary structure prediction
– Prediction of drug response from gene expression
Discrimination & Regression
 Clusters of co-expressed genes are interesting,
but just a first step. Really want predictive
models, gene networks, etc.
 Biology tells us that the predictors are likely to
involve interactions more than linear effects.
 Traditional statistics is not strong in non-linear
models, high order interactions, large
datasets.
• Data summarization.
– The aim is
• to discover patterns that describe subsets of the data
(attribute focusing), and
• to extract rules from the data telling us how a subset
of data influences the presence of another subset
– Association Rules Mining (ARM) relate to an
undirected/unsupervised data mining technique.
– Usually produces clear and understandable results
Detect sets of attributes that frequently occur together, and

also the rules among them. Example: 60% of the
population with a credit card also has a charge card (40%
of shoppers have both)
Density estimation
• Produces a method for assessing the probability

of an observation.
Like a histogram
• Uses a set of observations (and, optionally, a
distance function)
• Examples
– Recognition of members of protein families
– Evaluation of diversity of compound libraries
Particular applications
• Those broad computational approaches have

many particular instantiations and applications
• Two examples
– Hidden Markov models for multiple sequence
alignment and homologous family discrimination
– Analysis of gene expression array data
• Finding genes that vary significantly
• Estimating the number of clusters
• Finding high-order discriminators
Hidden Markov models
• Technique from speech understanding is now widely used

in sequence analysis
• Good software and tutorials on the web
http://www.cse.ucsc.edu/research/compbio/ismb99.tuto
rial.html
• HMMs infer unobserved states that influence the
probability distribution of observed states
• Most common use is to model sequence families.
• Change and deviation detection.
– Data has a sequential structure, either temporal,
physical or other.
– The aim is to find patterns assuming an ordering of
the observations.
Find the record(s) that is (are) the most different from

the other records; i.e., find all outliers. These outliers
may be thrown away as noise or may be the
“interesting” ones.
Expression Array Analysis
• Gene expression arrays are a popular new

technique for assaying the expression level of
tens of thousands of genes simultaneously
• Many problems arise in analyzing this data
• Collaborative groups are now developing tools
and procedures for such analysis
Expression Analysis Issues
• Identifying genes that changed significantly

over a set of observations. How much change
is enough? What's wrong with 2-fold?
• Estimating the number of expression clusters.
How many groups of genes are there?
• Finding discriminators based on expression
levels (e.g. for response to drugs)
• Other techniques
– Three dimension visioning
– Decision trees
– Neural networks
– Genetic algorithms
– Hidden markov models
– Time series
– Bayesian networks
– Soft computing : rough and fuzzy sets
– Graphical models
– Density estimation
Some Bioinformatics
Data Mining Perspectives
Taking advantage of public data
• An enormous amount of high quality data is

available free.
How to find public data
• Start with http://www.ncbi.nlm.nih.gov
• Consult the Jan 2001 issue of Nucleic Acids
Research
http://nar.oupjournals.org/content/vol28/issu
e1
• Metadata sites (listings of databases)
– The NAR issue has an associated site
– http://www.genome.ad.jp/kegg/kegg4.html
– http://bioinformatics.weizmann.ac.il/mb/molecula
r_biol_databases.html
• Commercial portals, e.g. www.biolinks.com,
www.doubletwist.com
NCBI: Ground Zero
• The National Center for Biotechnology
Information is the first place to go. Sequences,
structures, PubMed, taxonomy, medical genetics,
etc.
• Spend some time learning all it has to offer.
There are good online tutorials at
http://www.ncbi.nlm.nih.gov/Education/ Look at
the site map, not just the front page!
• Check out PROW (Protein Reviews on the Web), a
journal/reference source at NCBI.
An Abundance of Specialized Data
• Gene sequences and protein structures are not
all there is!
• Metabolic, regulatory and signaling pathway data
is growing rapidly
• Carbohydrates, drugs, lipids, diseases, organisms,
etc. all have their own public databases
Integrated data sources
• Like the data, the shear volume of databases can
be overwhelming.
• Integrated systems offer organized summaries of
diverse datasets.
• An excellent starting place for information about
human genes are GeneCards:
http://bioinformatics.weizmann.ac.il/cards/
• And ENTREZ at NCBI.
• Biozon at Stanford
Definition and scope
• In the computational sense, bioinformatics is the
– systematic development and
– application of computing systems and
– computational solution techniques,
– analyzing biological datasets obtained by experiments,
– modeling,
– database search and
– instrumentation.
Computational Perspective
• A sampling of spheres of research carried out by
biocomputing scientists from the computational
perspective are discussed next.
Neural networks
• Development and application of novel computational
techniques based on neural networks.
• First proposed by McCulloch and Pitts in 1943.
• Neural nets comprised a set of interconnected nodes,
based on the natural nervous systems, and with various
mechanisms of interconnections.
• Neural network architectures are usually designed to
complete specific tasks through some sort of learning
procedure or mechanism.
• Neural networks and genetic algorithms are utilised to
classify DNA sequences, predict sequenced based
protein structures and optimisation of molecular
structures (Anonymous, 1996).
Evolutionary algorithms
• Based on observations on biological processes of natural
selection and includes genetic algorithms, evolutionary
strategies, evolutionary programming and genetic
programming.
• Applications developed from these algorithms are such
as: routing and scheduling, time tabling, financial
models, data analysis and data mining (Langdon, 1995,
Abramson and Abela, 1991).
• Neural networks are also used in research to classify
nucleic acid sequences and sequence-based prediction
of protein structure (Notredame and Higgins, 1995),
while genetic algorithms are used in molecular structure
optimisation and protein and RNA folding (Ogata et al.,
1995, Shapiro, 1996).
Molecular computing
• The first DNA based computer was developed by
Adleman (1994) to solve the Hamiltonian Path problem.
• The goal of the Hamiltonian Path problem is to find a
path from one city to another city going through every
city only once.
• It took the DNA-computer one week to process and
complete the operation for a seven city problem (which
can be solved with a pen and paper within an hour).
• As the number of cities increases to more than 70,
conventional (serial logic) computers (including
supercomputers) are unable to solve the problem
completely and efficiently.
• The DNA-computer operates in a massively parallel
construction, and solved the complex problem within
the same period of time!.
Molecular computing
Another example
• bacteriarhodopsin (bR) from the bacteria,
Halobacterium halobium, are now being used by
scientists to produce bioelectronic switches a thousand
times smaller and faster than current semiconductor
technologies.
• Hong, Birge and others (in Vitaliano, 1996) are
researching electronic photo-active bR systems to
develop massively parallel and massively distributed
biocomputers.
Biological Perspective
• A sampling of research activities carried out by

biocomputing scientists from the biological
perspective are listed in the next paragraphs.
• Complete discussion of individual research is
beyond the scope of this paper.
Gene expression and genetic networks
• Large scale gene expression identification and data
analysis using micro-array, Expressed Sequence Tags
(ESTs), SAGE, DNA chip, etc.
• Identification of coordinated gene expression and
regulatory sequences and their functional
characteristics.
• Expression profile or sequence motif identification and
classification using novel pattern recognition methods.
• Forward modeling of genetic networks based on
Boolean, continuous and stochastic nets
• Development of reverse-engineering algorithm to
extract information from noisy sequence data.
Distributed and intelligent databases
• Developing robust and high-speed network to cater for
the needs of the scientific community - Asia Pacific
Bioinformatics Network
• Developing integrated database search engines and
retrieval system -BioXML Project, KRIS Program Suite.
• Developing an ontology to bridge (or middleware)
between the different notions in various databases
Visualisation and interactive molecular modeling
tools
• The study of structure, energetics and dynamics of
proteins and their interaction with ligands.
• Using Virtual Reality Modeling Language (VRML) to
develop models of the substrate channels in
cytochrome P450 - German Cancer Research Center.
• Developing musical algorithms to provide a different
perspective into the structure of DNA - The Nucleic Acid
Database Project
• High throughput graphics library for molecular structure
viewer RASMOL - Electrotechnical Laboratory Japan.
Analysis, management and application of single
nucleotide polymorphisms (SNP) data
• Automation of large scale SNP genotyping
• Tools for high throughput SNP discovery and screening
• Visualization and analysis of SNP data
Computer-aided drug design
• Development of large and high throughput
combinatorial libraries - ECLiPStm from Pharmacopeia
• Protein evolution and structural genomics
• Developing an information-theoretic DNA compression
scheme for new gene discovery and studying DNA
compression
• Developing a software that supports semi-automated
annotation of uncharacterized sequence data - GAIA
Univ Pennsylvania
Natural language processing for biology
• GenEng: A dialogue-based natural language user
interface to the GeneBank - Center for High
Performance Computing, Univ. Texas
• A logic-based syntactic pattern recognition system for
DNA sequences - GenLang, Univ. Pennsylvania.
• Protein structure prediction in biology and medicine
• Probable "folding cluster" role of non-functional
conserved residues of protein. - Laboratory of
Experimental & Computational Biology, National Cancer
Institute, NIH
Application of information theory to biology
• Coincident Detection Method to detect functional and
immunological sites of the highly variable HIV V3 Loop -
Molecular Mining Corporation
• Building predictive prototypes of the immune system
function with information theory models
• Using Minimum Message Length from Information
Theory on Structural Building Blocks (SBB) to identify
different distributions of rotamer classes amongst the
SBB's
Data mining and discovery in molecular databases
• Investigating the motif rules that predict T cell
activation, from peptide databases with high binding
affinity to the same MHC class I molecule - BONSAI,
Medical Institute of Bioregulation, Kyushu University.
Internet tools for computational biology
• Developing tools to access high volume, heterogeneous
and geographically dispersed biological databases in an
integrated manner - KRIS, NUS Bioinformatics Centre
• Understanding genomic and protein structures on the
WWW. - CHIME & RASMOL
• Virtual Reality meeting place for biologists - BioMOO,
Weizmann Institute.
Educational topics in biocomputing
• M.Sc. in medical informatics with biomedical computing
skills - Biomedical Information and Communnication
Center, Oregon Health Sciences University.
• Computational biology as an instructional tool for
graduates - Center of Bioengineering, University of
Washington
Some Advice
• A little bioinformatics is good for you!
– Know how to use web data resources
– Know the kinds of analyses that are possible
• Sequence and structure computations are
widespread and (fairly) easy. Finding exons, remote
homologies, structural domains, fold families, etc.
are routine.
• Generic clustering, discrimination/regression and
density estimation tools exist (neural networks...)
• Collaboration with bioinformaticians is no
worse than with statisticians.... :-)
Some Advice
• A little bioinformatics is good for you!
– Know how to use web data resources
– Know the kinds of analyses that are possible
• Sequence and structure computations are
widespread and (fairly) easy. Finding exons, remote
homologies, structural domains, fold families, etc.
are routine.
• Generic clustering, discrimination/regression and
density estimation tools exist (neural networks...)
• Collaboration with bioinformaticians is no
worse than with statisticians.... :-)

SHES2201 Lecture 3 - Data Mining in Bioinformatics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SHES2201 Lecture 3 - Data Mining in Bioinformatics

Uploaded by

Copyright:

Available Formats

SHES2201

Lecture 3 – Data-mining in Bioinformatics

Profesor Madya Khairuddin Itam

Given: a sample (query) object and a database containing a set of

Detect sets of attributes that frequently occur together, and

• Produces a method for assessing the probability

• Those broad computational approaches have

• Technique from speech understanding is now widely used

Find the record(s) that is (are) the most different from

• Gene expression arrays are a popular new

• Identifying genes that changed significantly

• An enormous amount of high quality data is

• A sampling of research activities carried out by

You might also like