Professional Documents
Culture Documents
Following are the aspects in which data mining contributes for biological data analysis :
Semantic integration of heterogeneous, distributed genomic and proteomic
databases
The semantic integration of such data is essential to the cross-site analysis of
biological data.
Data cleaning, data integration, reference reconciliation, classification, and
clustering methods will facilitate the integration of biological data and the
construction of data warehouses for biological data analysis.
Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences
BLAST and FASTA, in particular, are tools for the systematic analysis of
genomic and proteomic data.
Biological sequence analysis methods differ from many sequential pattern
analysis algorithms proposed in data mining research.
The sequence data to be searched in order to deal with insertions, deletions, and
mutations.
Sophisticated statistical analysis and dynamic programming methods often play
a key role in the development of alignment algorithms.
Compare the frequently occurring patterns of each class (e.g., diseased and
healthy).
Identify gene sequence patterns that play roles in various diseases.
Discovery of structural patterns and analysis of genetic networks and protein
pathways
In biology, protein sequences are folded into three-dimensional structures, and
such structures interact with each other based on their relative positions and the
distances between them. Such complex interactions form the basis of
sophisticated genetic networks and protein pathways.
Association and path analysis
Association analysis: identification of co-occurring gene sequences
Most diseases are not triggered by a single gene but by a combination of
genes acting together
Association analysis may help determine the kinds of genes that are
likely to co-occur together in target samples
High-level graphical user interfaces and visualization tools are required for
scientific data mining systems.
These should be integrated with existing domain-specific information
systems and database systems to guide researchers and general users in
searching for patterns, interpreting and visualizing discovered patterns, and
using discovered knowledge in their decision making.
Data Types:
The data mining system may handle formatted text, record-based data, and
relational data.
The data could also be in ASCII text, relational database data or data warehouse
data. Therefore, we should check what exact format the data mining system can
handle.
System issues:
The data mining system should be compatible with one or more operating
systems.
The most popular operating systems that host data mining software are
UNIX/Linux and Microsoft Windows. There are also data mining systems that
run on Macintosh, OS/2, and others.
Large industry-oriented data mining systems often adopt a client/server
architecture.
A recent trend has data mining systems providing Web-based interfaces and
allowing XML data as input and/or output.
Data Sources:
Data sources refer to the data formats in which data mining system will operate.
Some data mining system may work only on ASCII text files while others on
multiple relational sources.
Data mining system should also support ODBC connections or OLE DB for
ODBC connections.
Coupling data mining with databases or data warehouse systems: Data mining
systems need to be coupled with a database or a data warehouse system. The coupled
components are integrated into a uniform information processing environment. Here are
the types of coupling listed below −
No coupling
Loose Coupling
Semi tight Coupling
Tight Coupling
SGI MineSet:
MineSet, available from Purple Insight, was introduced by SGI in 1999.
It provides multiple data mining functions, including association mining and
classification, as well as advanced statistics and visualization tools.
MineSet is its set of robust graphics tools, including rule Advanced Visualization
Tools.
Clementine:
Clementine, from SPSS, provides an integrated data mining development
environment for end users and developers.
Clementine is its objectoriented, extended module interface, which allows
users’ algorithms and utilities to be added to Clementine’s visual programming
environment.
Enterprise Miner:
Enterprise Miner was developed by SAS Institute, Inc.
Enterprise Miner is its variety of statistical analysis tools, which are built
based on the long history of SAS in the market of statistical analysis.
Insightful Miner:
Insightful Miner, from Insightful Inc.,
It provides several data mining functions, including data cleaning,
classification, prediction, clustering, and statistical analysis packages, along
with visualization tools.
Its visual interface, which allows users to wire components together to create
self-documenting programs.
CART:
CART, available from Salford Systems, is the commercial version of the
CART (Classification and Regression Trees) system.
It creates decision trees for classification and regression trees for prediction.
CART employs boosting to improve accuracy.
See5 and C5.0:
See5 and C5.0, available from RuleQuest, are commercial versions of the C4.5
decision tree and rule generation method .
See5 is the Windows version of C4.5, while C5.0 is its UNIX counterpart.
Weka:
Weka, developed at the University of Waikato in New Zealand, is open-source
data mining software in Java.
It contains a collection of algorithms for data mining tasks, including data
preprocessing, association mining, classification, regression, clustering, and
visualization.
Data Reduction : The basic idea of this theory is to reduce the data representation which
trades accuracy for speed in response to the need to obtain quick approximate answers to
queries on very large databases. Some of the data reduction techniques are as follows −
Singular value Decomposition
Wavelets
Regression
Log-linear models
Histograms
Clustering
Sampling
Construction of Index Trees
Data Compression : The basic idea of this theory is to compress the given data by
encoding in terms of the following −
Bits
Association Rules
Decision Trees
Clusters
Pattern Discovery: The basic idea of this theory is to discover patterns occurring in a
database. Following are the areas that contribute to this theory −
Machine Learning
Neural Network
Association Mining
Sequential Pattern Matching
Clustering
Probability Theory: This theory is based on statistical theory. The basic idea behind this
theory is to discover joint probability distributions of random variables.
Probability Theory: According to this theory, data mining finds the patterns that are
interesting only to the extent that they can be used in the decision-making process of
some enterprise.
Microeconomic View: As per this theory, a database schema consists of data and
patterns that are stored in a database. Therefore, data mining is the task of performing
induction on databases.
Inductive databases: Apart from the database-oriented techniques, there are statistical
techniques available for data analysis. These techniques can be applied to scientific data
and data from economic and social sciences as well.
3.2 Statistical Data Mining
Some of the Statistical Data Mining Techniques are as follows −
Regression − Regression methods are used to predict the value of the response variable
from one or more predictor variables where the variables are numeric. Listed below are
the forms of Regression −
Linear
Multiple
Weighted
Polynomial
Nonparametric
Robust
Generalized Linear Models − Generalized Linear Model includes −
Logistic Regression
Poisson Regression
Auto-regression Methods.
Univariate ARIMA (AutoRegressive Integrated Moving Average) Modeling.
Long-memory time-series modeling.
Data Mining Process Visualization: Data Mining Process Visualization presents the
several processes of data mining. It allows the users to see how the data is extracted. It
also allows the users to see from which database or data warehouse the data is cleaned,
integrated, preprocessed, and mined.
3.4 Audio Data Mining
Audio data mining makes use of audio signals to indicate the patterns of data or the
features of data mining results.
By transforming patterns into sound and musing.
We can listen to pitches and tunes, instead of watching pictures, in order to identify
anything interesting.
3.5 Data Mining and Collaborative Filtering
The Collaborative Filtering Approach is generally used for recommending
products to customers. These recommendations are based on the opinions of other
customers.
Openness:
Individuals have the right to know the nature of the data collected about them, the
identity of the data controller (responsible for ensuring the principles), and how the data
are being used.
Security Safeguards:
Personal data should be protected by reasonable security safeguards against such
risks as loss or unauthorized access, destruction, use, modification, or disclosure of
data.
Privacy-preserving data mining is a new area of data mining research that is emerging in
response to privacy protection during mining. It
There are two common approaches: secure multiparty computation and data
obscuration.
In secure multiparty computation, data values are encoded using simulation and
cryptographic techniques so that no party can learn another’s data values. This
approach can be impractical when mining large databases.
In data obscuration, the actual data are distorted by aggregation (such as using the
average income for a neighborhood, rather than the actual income of residents) or by
adding random noise.
In this way, we may continue to reap the benefits of data mining in terms of time and money
savings and the discovery of new knowledge.
Data mining concepts are still evolving and here are the latest trends that we get
to see in this field.
Application exploration:
The exploration of data mining for businesses continues to expand as e-commerce and
e-marketing have become mainstream elements of the retail industry.
Data mining is increasingly used for the exploration of applications in other areas, such
as financial analysis, telecommunications, biomedicine, intrusion detection, mobile
(wireless) data mining and science.
Web mining:
Web content mining, web log mining, and other mining services on the internet
have secured a place among the flourishing subfields of data mining.