Professional Documents
Culture Documents
Data Mining Notes
Data Mining Notes
Classification
♦ Classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new date
♦ Find a model for class attribute as a function of the values of other attributes.
A test set is used to determine the accuracy of the model. Usually, the given data set
is divided into training and test sets, with training set used to build the model and
test set used to validate it.
➢ The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore I appropriate for exploratory
knowledge discovery. Decision trees can handle high dimensional data.
➢ The learning and classification steps of decision tree induction are simple and
fast. In general, decision tree classifiers have good accuracy.
➢ Decision tree induction algorithms have been used for classification in many
application areas, such as medicine, manufacturing and production, financial
analysis, astronomy, and molecular biology.
Algorithm:
Generate decision tree. Generate a decision tree from the training tuples of data
partition, D.
Input:
-Data partition, D, which is a set of training tuples and their associated class labels;
-attribute list, the set of candidate attributes;
-Attribute selection method, a procedure to determine the splitting criterion that
“best” partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split-point or splitting subset.
Method:
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting
criterion;
(9) attribute list ← attribute list − splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion// partition the tuples and grow subtrees
for each partition
(12) if Dj
is empty then
(14) else attach the node returned by Generate decision tree(Dj, attribute list) to node
N;
endfor
(15) return N;
♦ The parameter attribute list is a list of attributes describing the tuples. Attribute
selection method specifies a heuristic procedure for selecting the attribute
that―best‖ discriminates the given tuples according to class.
♦ The tree starts as a single node, N, representing the training tuples in D♦ If the
tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class .♦ All of the terminating conditions are explained at the end of the
algorithm. Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion. The splitting criterion tells us which attribute to test at node
N by determining the ―best‖ way to separate or partition the tuples in D into
individual classes.
♦ There are three possible scenarios .Let A be the splitting attribute. A has v
distinct values, {a1, a2,av}, based on the training data.
1 A is discrete-valued:
In this case, the outcomes of the test at node N correspond directly to the known
values of A.A branch is created for each known value, aj, of A and labeled with
that value. A need not be considered in any future partitioning of the tuples.
2 A is continuous- valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A <=split point and A >split point,
respectively where split point is the split-point returned by Attribute selection
method as part of the splitting criterion.
Bayesian Classification
Bayesian classifiers are statistical classifiers.
♦ They can predict class membership probabilities, such as the probability that a
given tuple
belongs to a particular class.
♦ Bayesian classification is based on Bayes’ theorem.
Bayes’ Theorem:
Let X be a data tuple. In Bayesian terms, X is considered ―evidence .‖and it is
described by measurements made on a set of n attributes.
-Let H be some hypothesis, such as that the data tuple X belongs to a specified class
C.
-For classification problems, we want to determine P(H|X), the probability that the
hypothesis
-H holds given the ―evidence‖ or observed data tuple X. P(H|X) is the posterior
probability, or a posteriori probability, of H conditioned on X. Bayes’ theorem is
useful in that it provides a way of calculating the posterior probability, P(H|X), from
P(H), P(X|H), and P(X).
1.Let D be a training set of tuples and their associated class labels. As usual, each
tuple is represented by an n-dimensional attribute vector, X = (x1, x2, …,xn),
depicting n measurements made on the tuple from n attributes, respectively, A1, A2,
…, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier
will predict that X belongs to the class having the highest posterior probability,
conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs
to the class Ci if and only if
Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called
the maximum posteriori hypothesis. By Bayes’ theorem
3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the
class prior probabilities are not known, then it is commonly assumed that the classes
are equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore
maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the
training tuples. For each attribute, we look at whether the attribute is categorical or
continuous-valued. For instance, to compute P(X|Ci), we consider the following:
➢ If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having
the value
xk for Ak, divided by |Ci,D| the number of tuples of class Ci in D.
➢ If Ak is continuous-valued, then we need to do a bit more work, but the
calculation is pretty
straightforward.
A continuous-valued attribute is typically assumed to have a Gaussian distribution
with a mean μ and standard deviation , defined by
5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class
Ci.The classifier predicts that the class label of tuple X is the class Ci if and only if
♦ During the learning phase, the network learns by adjusting the weights so as to be
able to predict the correct class label of the input tuples.
♦ Neural networks involve long training times and are therefore more suitable for
applications where this is feasible.-Back propagation learns by iteratively processing
a data set of training tuples, comparing the network’s prediction for each tuple with
the actual known target value.
♦ The target value may be the known class label of the training tuple (for
classification problems) or a continuous value (for prediction).
♦ For each training tuple, the weights are modified so as to minimize the mean
squared error between the network’s prediction and the actual target value. These
modifications are made in the ―backwards‖ direction, that is, from the output layer,
through each hidden layer down to the first hidden layer hence the name is back
propagation.
Advantages:
♦ It include their high tolerance of noisy data as well as their ability to classify
patterns on which they have not been trained.
♦ They can be used when you may have little knowledge of the relationships
between attributes and classes.
♦ They are well-suited for continuous-valued inputs and outputs, unlike most
decision tree algorithms.
Process:
♦ Initialize the weights:
♦ The weights in the network are initialized to small random numbers ranging from-
1.0 to 1.0, or -0.5 to 0.5. Each unit has a bias associated with it. The biases are
similarly initialized to small random numbers.
♦ Each training tuple, X, is processed by the following steps.
First, the training tuple is fed to the input layer of the network. The inputs pass
through the input units, unchanged. That is, for an input unit j, its output, Oj, is
equal to its input value, Ij. Next, the net input and output of each unit in the hidden
and output layers are computed. The net input to a unit in the hidden r output layers
is computed as a linear combination of its inputs. Each such unit has a number of
inputs to it that are, in fact, the outputs of the units connected to it in the previous
layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit is multiplied by its corresponding weight, and this is
summed.
where wi,jis the weight of the connection from unit iin the previous layer to unit j;
Oi is the output of unit i from the previous layer Ɵj is the bias of the unit & it acts as
a threshold in that it serves to vary the activity of the unit Each unit in the hidden
and output layers takes its net input and then applies an activation function to it.
The error is propagated backward by updating the weights and biases to reflect the
error of the network’s prediction. For a unit j in the output layer, the error Err j is
computed by
where Oj is the actual output of unit j, and Tj is the known target value of the given
training tuple.The error of a hidden layer unit j is
where wjk is the weight of the connection from unit j to a unit k in the next higher
layer, and Errk is the error of unit k. Weights are updated by the following equations,
where Dwi j is the change in weight wi j:
Algorithm:
Backpropagation. Neural network learning for classification or numeric prediction,
using the backpropagation algorithm.
.Input:
D, a data set consisting of the training tuples and their associated target values; l, the
learning rate; network, a multilayer feed-forward network.
Output
:
A trained neural network.
Method:
(1) Initialize all weights and biases in network;
(8) Ij =Pi wij Oi + θj ; //compute the net input of unit j with respect to the previous
layer, i
(9) Oj =
1
1+e
−I
j
; } // compute the output of each unit j
(10) // Backpropagate the errors:
(13) for each unit j in the hidden layers, from the last to the first hidden layer
P
k
Errkwjk; // compute the error with respect to
; // bias increment
(21) } }
Objectives:
♦ The naïve Bayesian classifier assume that attributes are conditionally
independents
Example:
Advanced Classification Methods
Introduction
D(X,Y)=
The unknown sample is assigned the most common class among its k nearest
neighbors when k=1, the unknown sample is assigned the class of the training
sample that is closest to it in pattern space. Nearest neighbor classifiers are
instance-based or lazy learners in that they store all of the training samples and do
not build a classifier until a new (unlabeled) sample needs to be classified. This
contrasts with eager learning methods, such as decision tree induction and
backpropagation, which construct a generalization model before receiving new
samples to classify. Lazy learners can incur expensive computational costs when
the number of potential neighbors (i.e., stored training samples) with which to
compare a given unlabeled sample is great. Therefore, they require efficient
indexing techniques. As expected, lazy learning methods are faster at training-than
eager methods, but slower at classification since all computation is delayed to that
time. Unlike decision tree induction and backpropagation ,
nearest neighbor classifiers assign equal weight to each attribute. This may cause
confusion when there are many irrelevant attributes in the data.
Nearest neighbor classifiers can also be used for prediction, that is, to return real-
valued prediction for a given unknown sample. In this case, the classifier returns
the average value of the real-valued labels associated with the k nearest neighbors
of the unknown sample.
When given a new case to classify, a case-based reasoned will first check If an
identical training case exists. If one is found, then the accompanying solution to
that case is returned. If no identical case is found, then the case based reasoner will
search for training cases having components that are similar to those of the new
case. Conceptually, these training cases may be considered as neighbors of the new
case. If cases are represented as graphs, this involves searching for subgraphs that
ate similar to subgraphs within the new case. The case-based reasoner tries to
combine the solutions of the neighboring training cases in order to propose a
solution for the new case. If incompatibilities arise with the individual solutions,
then backtracking to search for other solutions may be necessary. The case-based
reasoned may employ background knowledge and problem-solving strategies in
order to propose a feasible combined- solution.
Genetic Algorithms
Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems, in data mining, they may be used to evaluate
the fitness of other algorithms.
Rough set theory can be used for classification to discover structural relationships
within imprecise or noisy data. It applies to discrete-valued attributes. Continuous-
valued attributes must therefore be discretized prior to its use.
Rough set theory is based on the establishment of equivalence classes within the
given training data. All of the data samples forming an equivalence class are
indiscernible, that is, the samples are identical with respect to the attributes
describing the data. Given real-world data, it is common that some classes cannot
be distinguished in terms of the available attributes. Rough sets can be used to
approximately or “roughly” define such classes. A rough set definition for a given
class c is approximated by two sets-a lower approximation of C and an upper
approximation of C. the lower approximation of C consists of all
of the data samples that, based on the knowledge of the attributes, ate certain to
belong to C without ambiguity. The upper approximation of C consists of all of the
samples that, based on the knowledge of the attributes, cannot be described as not
belonging to C. decision rules can be generated for each class. Typically, a
decision table is used to represent the rules.
Rough sets can also be used for feature reduction (where attributes that do not
contribute towards the classification of the given training data can be identified and
removed) and relevance analysis (where the contribution or significance of each
attribute is assessed with respect to the classification task). The problem of finding
the minimal subsets (reducts) of attributes that can describe all of the concepts in
the given data set is NP-hard. However, algorithms to reduce the computation
intensity have been proposed. In one method, for example, discernibility matrix is
used that stores the differences between attribute values for each pair of data
samples. Rather than searching on the entire training set, the matrix is instead
searched to detect redundant attributes.
Rule-based systems for classification have the disadvantage that they involve sharp
cutoffs for continuous attributes. For example, consider the following rule for
customer credit application approval. The rule essentially says that applications for
customers who have dad a job for two or more years and who have a high income
(i.e., of at least $50k) ate approved:
A customer who has had a job for at least two years will receive credit if her
income is, say, $50k, but not id it is $49k. Such harsh thresholding may seem
unfair. Instead, fuzzy logic can be introduced into the system to allow “fuzzy”
thresholds or boundaries to be defined. Rather than having a precise cutoff between
categories or sets, fuzzy logic uses truth-values between 0.0 and 1.0 to represent
the degree of membership that a certain value has in a given category. Hence, with
fuzzy logic, we can capture the notion that an income of $49k is, to some degree,
high, although not as high as an income of $50k.
Fuzzy logic is useful for data mining systems performing classification. It provides
the advantage of working at a high level of abstraction. In general, the use of fuzzy
logic in rule-based systems involves the following:
• Attribute values are converted to fuzzy values. The fuzzy membership or truth
values are calculated- Fuzzy logic systems typically provide graphical tools to
assist users in this step.
• For a given new sample, more than one fuzzy rule may apply. Each applicable
rule contributes a vote for membership in the categories. Typically, the truth-values
for each predicted category are summed.
• The sums obtained above ate combined into a value that is returned by the
system. Weighting each category by its truth sum and multiplying by the mean
truth-value of each category may do this process, the calculations involved may be
more complex, depending on the complexity of the fuzzy membership graphs.
Fuzzy logic systems have been used in numerous areas for classification, including
health care and finance.
Prediction
In linear regression, data are modeled using a straight line. Linear regression is the
simplest form of regression. Bivariate linear regression models a random variable,
Y ( called a response variable), as a linear function of another random variable, X
(called a predictor variable), that is,
Y=
Nonlinear Regression
“How can we model data that does not show a linear dependence? For example,
what if a given response variable and predictor variables have a relationship that
may be modeled by a polynomial function?” polynomial regression can be
modeled by adding polynomial terms to the basic linear model. By applying
transformations to the variables, we can convert the nonlinear model. By applying
transformations to the variables, we can convert the nonlinear the model into a
linear one that can then be solved by the method of least squares.
Some models are intractably nonlinear (such as the sum of exponential terms, for
example) and cannot be converted to a linear model. For such cases, it may be
possible to obtain least square estimates through extensive calculations on more
complex formulae.
lower-order ones. The technique scales up well to allow for many dimensions.
Aside from prediction, the log-linear model is useful for data compression (since
the smaller- order cuboids together typically occupy less space than the base
cuboid) and data smoothing variations than cell estimates in the base cuboid).
Classifier Accuracy
Using training data to derive a classifier and then to estimate the accuracy of the
classifier can result in misleading overoptimistic estimates due to over-
specialization of the learning algorithm( or model) to the data. Holdout and cross-
validation are two common techniques for assessing classifier accuracy, based on
randomly sampled partitions of the given data- in the holdout method, the given
data are randomly partitioned into two independent sets, a training set and a test
set. Typically, two thirds of the data are allocated to the training set, and the
remaining one third is allocated to the test se. the training set is used to derive the
classifier, whose accuracy is estimated with the test set. The estimate is pessimistic
since only a portion of the initial data is used to derive the classifier. Random
subsampling is a variation of the holdout method in which the holdout method is
repeated k times. The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
In k-fold cross-validation, the initial data are randomly partitioned into k mutually
exclusive subsets or “folds,” S1,S2,…., Sk, each of approximately equal size.
Training and testing is performed k times. In iteration I, the subset S, is reserved as
the test se, and the remaining subsets are collectively used to train the classifier.
That is, the classifier of the first iteration is trained on subsets S1,.. , Sk and tested
on Si; the classifier of the second iteration is trained on subsets S2,.. ,Sk and tested
on Si; and so on. The accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total number of samples in the
initial data. In stratified cross-validation, the folds are stratified so that the class
distribution of the samples in each fold is approximately the same as that in the
initial data.
UNIT - 4
Cluster Analysis
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
♦ A cluster of data objects can be treated as one group.While doing cluster analysis,
we first partition the set of data into groups based on data similarity and then assign
the labels to the groups.
♦ Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
♦ In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures inherent
to populations.
♦ As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
The following points throw light on why clustering is required in data mining −
♦ Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
Clustering Methods
Partitioning Methods:
Apartitioning method constructs k partitions of the data, where each partition
represents a cluster and k <= n. That is, it classifies the data into k groups, which
together satisfy the following requirements:
♦ Each group must contain at least one object, and Each object must belong to
exactly one group.
where E is the sum of the square error for all objects in the data set pis the point in
space representing a given object mi is the mean of cluster Ci
Algorithm: k - means
.
The k-means algorithm for partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for
each cluster;
where E is the sum of the absolute error for all objects in the data set p is the point
in space representing a given object in cluster Cj ,oj is the representative object of
Cj
♦ The initial representative objects are chosen arbitrarily. The iterative process of
replacing representative objects by non representative objects continues as long as
the quality of the resulting clustering is improved. This quality is estimated using a
cost function that measures the average dissimilarity between an object and the
representative object of its cluster.
Case 1:
p currently belongs to representative object, oj. If oj is replaced by orandom as a
representative object and p is closest to one of the other representative objects,
oi,i≠j, then p is reassigned to oi.
Case 2:
p currently belongs to representative object, oj. If ojis replaced by orandom as a
representative object and p is closest to orandom, then p is reassigned to o random.
Case 3:
p currently belongs to representative object, oi, i≠j. If oj is replaced by orandom as
a representative object and p is still closest to oi, then the assignment does not
change.
Case 4:
p currently belongs to representative object, oi, i≠j. If oj is replaced by orandom as
a representative object and p is closest to orandom, then p is reassigned to
orandom.
Algorithm: k- medoids
Input:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative
object;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative
objects;
The k-medoids method is more robust than k-means in the presence of noise and
outliers, because a medoid is less influenced by outliers or other extreme values
than a mean. However, its processing is more costly than the k-means method.
♦ It subdivides the cluster into smaller and smaller pieces, until each object forms a
cluster on its own or until it satisfies certain termination conditions, such as a desired
number of clusters is obtained or the diameter of each cluster is within a certain
threshold.
Density-based methods
❖ Most partitioning methods cluster objects based on the distance between objects.
Such methods can find only spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.
❖ Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the density in
the neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number
of points. Such a method can be used to filter out noise (outliers)and discover clusters
of arbitrary shape.
❖ DBSCAN and its extension, OPTICS, are typical density-based methods that
grow clusters according to a density-based connectivity analysis. DENCLUE is a
method that clusters objects based on the analysis of the value distributions of
density
functions.
Grid-Based Methods
❖ Grid-based methods quantize the object space into a finite number of cells that
form a grid structure.
❖ All of the clustering operations are performed on the grid structure i.e., on the
quantized space. The main advantage of this approach is its fast processing time,
which is typically independent of the number of data objects and dependent only on
the number of cells in each dimension in the quantized space.
The statistical parameters of higher-level cells can easily be computed from the
parameters of the lower-level cells. These parameters include the following: the
attribute-independent parameter, count; and the attribute-dependent parameters,
mean, stdev (standard deviation), min (minimum), max (maximum), and the type
of distribution that the attribute value in the cell follows such as normal, uniform,
exponential, or none (if the distribution is unknown). Here, the attribute is a
selected measure for analysis such as price for house objects. When the data are
loaded into the database, the parameters count, mean, stdev, min, and max of the
bottom-level cells are calculated directly from the data.
(2) the grid structure facilitates parallel processing and incremental updating and
(3) the method’s efficiency is a major advantage: STING goes through the database
once to compute the statistical parameters of the cells, and hence the time complexity
of generating clusters is O(n), where n is the total number of objects.
♦ It uses a density threshold to identify dense cells and sparse ones. A cell is dense
if the number of objects mapped to it exceeds the density threshold.
♦ The main strategy behind CLIQUE for identifying a candidate search space uses
the monotonicity of dense cells with respect to dimensionality.
♦ CLIQUE performs clustering in two steps. In the first step, CLIQUE partitions the
d-dimensional data space into nonoverlapping rectangular units, identifying the
dense units among these. CLIQUE finds dense cells in all of the subspaces. To do
so,CLIQUE partitions every dimension into intervals, and identifies intervals
containing at least l points, where l is the density threshold.
♦ In the second step, CLIQUE uses the dense cells in each subspace to assemble
clusters, which can be of arbitrary shape. The idea is to apply the Minimum
Description Length (MDL) principle (Chapter 8) to use the maximal regions to cover
connected dense cells, where a maximal region is a hyper rectangle where every cell
falling into this region is dense, and the region cannot be extended further in any
dimension in the subspace.
Evaluation of clustering
In this task, for a given data set, we assess whether a nonrandom structure exists in
the data. Blindly applying a clustering method on a data set will return clusters;
however, the clusters mined may be misleading.Cluster-
ing analysis on a data set is meaningful only when there is a nonrandom structure in
the data.
A few algorithms, such as k-means, require the number of clusters in a data set as
the parameter. Moreover, the number
of clusters can be regarded as an interesting and important summary statistic of a
data set. Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
After applying a clustering method on a data set, we want to assess how good the
resulting clusters are. A number of measures can be used. Some methods measure
how well the clusters fit the data set, while others measure how well the clusters
match the ground truth, if such truth is available. There are also measures that score
clusterings and thus can compare two sets of clustering results on the same data set.
Unit 5
Mining Complex Data Types
Sequence Classification
Feature-based vs. sequence-distance-based vs. model-based
PathPredict
Similarity Search and OLAP in Information Networks: PathSim, GraphCube
Evolution of Social and Information Networks: EvoNetClus
Applications
Topic modeling, i-topic model, integration with geo- and networked data
Regression
predict the value of a response (dependent) variable from one or more predictor
(independent) variables
where the variables are numeric forms of regression: linear, multiple, weighted,
polynomial, nonparametric, and robust
Generalized linear models
Mixed-effect models
For analyzing grouped data, i.e. data that can be classified according to one or
more grouping variables .Typically describe relationships between a response
variable and some covariates in data grouped according to one or more factors
Regression trees
Binary trees used for classification and prediction Similar to decision trees:Tests are
performed at the internal nodes In a regression tree the mean of the objective
attribute is computed and used as the predicted value
Analysis of variance
e.g., for many psychiatric data, one can indirectly measure other quantities (such as
test scores) that reflect the factor of interest
Discriminant analysis
Survival analysis
Predicts the probability that a patient undergoing a medical treatment would survive
at least to time t (life span prediction)
Views on Data Mining Foundations
♦ Data reduction
♦ Data compression
Basis of data mining: Compress the given data by encoding in terms of bits,
association rules, decision trees, clusters, etc.
♦ Microeconomic view
A view of utility: Finding patterns that are interesting only to the extent in that they
can be used in the decision-making process of some enterprise
Visualization: Use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data .
Visual Data Mining: discovering implicit but useful knowledge from large data
sets using visualization techniques
Visualization
♦ Purpose of Visualization
data visualization
data mining result visualization
data mining process visualization
interactive visual data mining
*Data visualization
Presentation of the results or knowledge obtained from data mining in visual forms
Examples
Decision trees
Association rules
Clusters
Outliers
Generalized rules
*Presentation of the various processes of data mining in visual forms so that users
can see
Example
♦ Display the data distribution in a set of attributes using colored sectors or columns
(depending on whether the whole space is represented by either a circle or a set of
columns)
♦ Use the display to which sector should first be selected for classification and where
a good split point for this sector may be
♦ Uses audio signals to indicate the patterns of data or the features of data mining
results.An interesting alternative to visual mining
♦ An inverse task of mining audio (such as music) databases which is to find patterns
from audio data
♦ Visual data mining may disclose interesting patterns using graphical displays, but
requires users to concentrate on watching patterns
♦ Instead, transform patterns into sound and music and listen to pitches, rhythms,
tune, and melody in order to identify anything interesting or unusual
Data mining:
Ayoung discipline with broad and diverse applications .There still exists a nontrivial
gap between generic data mining methods and effective and scalable data mining
tools for domain-specific applications
Some application domains (briefly discussed here)
♦ Financial data collected in banks and financial institutions are often relatively
complete, reliable, and of high quality Design and construction of data warehouses
for multidimensional data analysis and data mining
♦ View the debt and revenue changes by month, by region, by sector, and by other
factors
♦ Access statistical information such as max, min, total, average, trend, etc.
Other issues
♦ Data mining in social sciences and social studies: text and social media
♦ Content-based: Recommends items that are similar to items the user preferred or
queried in the past
Further work: Integration of data mining into existing business and scientific
technologies to provide domain-specific data mining tools
E.g., meteorology, astronomy, geography, geology, biology, and other scientific and
engineering data
Randomization (e.g., perturbation): Add noise to the data in order to mask some
attribute values of records
Downgrading the effectiveness of data mining: The output of data mining may
violate privacy
Modify data or mining results, e.g., hiding some association rules or slightly
distorting some classification models
Integration of data mining with Web search engines, database systems, data
warehouse systems and cloud computing systems