Data Mining Notes

UNIT 3
Classification
♦ Predicts categorical class labels (discrete or nominal)
♦ Classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new date
Given a collection of records (training set)

– Each record contains a set of attributes, one of the attributes is the class.
♦ Find a model for class attribute as a function of the values of other attributes.
♦ Goal: previously unseen records should be assigned a class as accurately as

possible.
A test set is used to determine the accuracy of the model. Usually, the given data set
is divided into training and test sets, with training set used to build the model and
test set used to validate it.
Illustrating Classification TaskExamples of Classification Task

♦ Predicting tumor cells as benign or malignant
♦ Classifying credit card transactions as legitimate or fraudulent
♦ Categorizing news stories as finance, weather, entertainment, sports, etc
Decision Tree Induction
Decision tree is a flowchart-like tree structure,where
➢ Each internal node denotes a test on an attribute.

➢ Each branch represents an outcome of the test.
➢ Each leaf node holds a class label.
➢ The topmost node in a tree is the root node.
A decision tree indicating whether a customer is likely to purchase a computer
➢ The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore I appropriate for exploratory
knowledge discovery. Decision trees can handle high dimensional data.
➢ Their representation of acquired knowledge in tree form is intuitive and generally

easy to assimilate by humans.
➢ The learning and classification steps of decision tree induction are simple and
fast. In general, decision tree classifiers have good accuracy.
➢ Decision tree induction algorithms have been used for classification in many
application areas, such as medicine, manufacturing and production, financial
analysis, astronomy, and molecular biology.
Algorithm:
Generate decision tree. Generate a decision tree from the training tuples of data
partition, D.
Input:
-Data partition, D, which is a set of training tuples and their associated class labels;
-attribute list, the set of candidate attributes;
-Attribute selection method, a procedure to determine the splitting criterion that
“best” partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split-point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C, then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting
criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued andmultiway splits allowed then // not

restricted to binary trees
(9) attribute list ← attribute list − splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion// partition the tuples and grow subtrees
for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj
is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj, attribute list) to node
N;
endfor
(15) return N;
The algorithm is called with three parameters:

➢ Data partition
➢ Attribute list
➢ Attribute selection method
♦ The parameter attribute list is a list of attributes describing the tuples. Attribute
selection method specifies a heuristic procedure for selecting the attribute
that―best‖ discriminates the given tuples according to class.
♦ The tree starts as a single node, N, representing the training tuples in D♦ If the
tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class .♦ All of the terminating conditions are explained at the end of the
algorithm. Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion. The splitting criterion tells us which attribute to test at node
N by determining the ―best‖ way to separate or partition the tuples in D into
individual classes.
♦ There are three possible scenarios .Let A be the splitting attribute. A has v
distinct values, {a1, a2,av}, based on the training data.
1 A is discrete-valued:
In this case, the outcomes of the test at node N correspond directly to the known
values of A.A branch is created for each known value, aj, of A and labeled with
that value. A need not be considered in any future partitioning of the tuples.
2 A is continuous- valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A <=split point and A >split point,
respectively where split point is the split-point returned by Attribute selection
method as part of the splitting criterion.
3 A is discrete-valued and a binary tree must be produced: The test at node N is of

the form―A€SA? ‖. SA is the splitting subset for A, returned by Attribute
selection method as part of the splitting criterion. It is a subset of the known values
of A.
Bayesian Classification
Bayesian classifiers are statistical classifiers.
♦ They can predict class membership probabilities, such as the probability that a
given tuple
belongs to a particular class.
♦ Bayesian classification is based on Bayes’ theorem.
Bayes’ Theorem:
Let X be a data tuple. In Bayesian terms, X is considered ―evidence .‖and it is
described by measurements made on a set of n attributes.
-Let H be some hypothesis, such as that the data tuple X belongs to a specified class
C.
-For classification problems, we want to determine P(H|X), the probability that the
hypothesis
-H holds given the ―evidence‖ or observed data tuple X. P(H|X) is the posterior
probability, or a posteriori probability, of H conditioned on X. Bayes’ theorem is
useful in that it provides a way of calculating the posterior probability, P(H|X), from
P(H), P(X|H), and P(X).
Naïve Bayesian Classification
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
1.Let D be a training set of tuples and their associated class labels. As usual, each
tuple is represented by an n-dimensional attribute vector, X = (x1, x2, …,xn),
depicting n measurements made on the tuple from n attributes, respectively, A1, A2,
…, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier
will predict that X belongs to the class having the highest posterior probability,
conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs
to the class Ci if and only if
Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called
the maximum posteriori hypothesis. By Bayes’ theorem
3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the
class prior probabilities are not known, then it is commonly assumed that the classes
are equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore
maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
4.Given data sets with many attributes, it would be extremely computationally

expensive to compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci),
the naive assumption of class conditional independence is made. This presumes that
the values of the attributes are conditionally independent of one another, given the
class label of the tuple. Thus,
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the
training tuples. For each attribute, we look at whether the attribute is categorical or
continuous-valued. For instance, to compute P(X|Ci), we consider the following:
➢ If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having
the value
xk for Ak, divided by |Ci,D| the number of tuples of class Ci in D.
➢ If Ak is continuous-valued, then we need to do a bit more work, but the
calculation is pretty
straightforward.
A continuous-valued attribute is typically assumed to have a Gaussian distribution
with a mean μ and standard deviation , defined by
5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class
Ci.The classifier predicts that the class label of tuple X is the class Ci if and only if
Classification by Back propagation

♦ Back propagation is a neural network learning algorithm.
♦ A neural network is a set of connected input/output units in which each connection

has a weight associated with it.
♦ During the learning phase, the network learns by adjusting the weights so as to be
able to predict the correct class label of the input tuples.
♦ Neural network learning is also referred to as connectionist learning due to the

connections between units.
♦ Neural networks involve long training times and are therefore more suitable for
applications where this is feasible.-Back propagation learns by iteratively processing
a data set of training tuples, comparing the network’s prediction for each tuple with
the actual known target value.
♦ The target value may be the known class label of the training tuple (for
classification problems) or a continuous value (for prediction).
♦ For each training tuple, the weights are modified so as to minimize the mean
squared error between the network’s prediction and the actual target value. These
modifications are made in the ―backwards‖ direction, that is, from the output layer,
through each hidden layer down to the first hidden layer hence the name is back
propagation.
♦ Although it is not guaranteed, in general the weights will eventually converge,

and the learning process stops.
Advantages:
♦ It include their high tolerance of noisy data as well as their ability to classify
patterns on which they have not been trained.
♦ They can be used when you may have little knowledge of the relationships
between attributes and classes.
♦ They are well-suited for continuous-valued inputs and outputs, unlike most
decision tree algorithms.
♦ They have been successful on a wide array of real-world data, including

handwritten character recognition, pathology and laboratory medicine, and training
a computer to pronounce English text.
♦ Neural network algorithms are inherently parallel; parallelization techniques can

be used to speed up the computation process.
Process:
♦ Initialize the weights:
♦ The weights in the network are initialized to small random numbers ranging from-
1.0 to 1.0, or -0.5 to 0.5. Each unit has a bias associated with it. The biases are
similarly initialized to small random numbers.
♦ Each training tuple, X, is processed by the following steps.
Propagate the inputs forward:
First, the training tuple is fed to the input layer of the network. The inputs pass
through the input units, unchanged. That is, for an input unit j, its output, Oj, is
equal to its input value, Ij. Next, the net input and output of each unit in the hidden
and output layers are computed. The net input to a unit in the hidden r output layers
is computed as a linear combination of its inputs. Each such unit has a number of
inputs to it that are, in fact, the outputs of the units connected to it in the previous
layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit is multiplied by its corresponding weight, and this is
summed.
where wi,jis the weight of the connection from unit iin the previous layer to unit j;
Oi is the output of unit i from the previous layer Ɵj is the bias of the unit & it acts as
a threshold in that it serves to vary the activity of the unit Each unit in the hidden
and output layers takes its net input and then applies an activation function to it.
Back propagate the error:
The error is propagated backward by updating the weights and biases to reflect the
error of the network’s prediction. For a unit j in the output layer, the error Err j is
computed by
where Oj is the actual output of unit j, and Tj is the known target value of the given
training tuple.The error of a hidden layer unit j is
where wjk is the weight of the connection from unit j to a unit k in the next higher
layer, and Errk is the error of unit k. Weights are updated by the following equations,
where Dwi j is the change in weight wi j:
Biases are updated by the following equations below
Algorithm:
Backpropagation. Neural network learning for classification or numeric prediction,
using the backpropagation algorithm.
.Input:
D, a data set consisting of the training tuples and their associated target values; l, the
learning rate; network, a multilayer feed-forward network.
Output
:
A trained neural network.
Method:
(1) Initialize all weights and biases in network;
(2) while terminating condition is not satisfied {
(3) for each training tuple X in D {
(4) // Propagate the inputs forward:
(5) for each input layer unit j {
(6) Oj = Ij ; // output of an input unit is its actual input value
(7) for each hidden or output layer unit j {
(8) Ij =Pi wij Oi + θj ; //compute the net input of unit j with respect to the previous
layer, i
(9) Oj =
1
1+e
−I
j
; } // compute the output of each unit j
(10) // Backpropagate the errors:
(11) for each unit j in the output layer
(12) Errj = Oj(1 − Oj)(Tj − Oj); // compute the error
(13) for each unit j in the hidden layers, from the last to the first hidden layer
(14) Errj = Oj(1 − Oj)
P
k
Errkwjk; // compute the error with respect to
the next higher layer, k
(15) for each weight wij in network {
(16) 1wij = (l)Errj Oi ; // weight increment
(17) wij = wij + 1wij; } // weight update
(18) for each bias θj in network {
(19) 1θj = (l)Errj
; // bias increment
(20) θj = θj + 1θj ; } // bias update
(21) } }
Support Vector Machine
Support Vector Machine (SVM) performs classification by finding the hyperplane

that maximizes the margin between the two classes. The vectors (cases) that define
the hyperplane are the support vectors.
Algorithm
1. Define an optimal hyperplane: maximize margin

2. Extend the above definition for non-linearly separable problems: have a penalty term for
misclassifications.
3. Map data to high dimensional space where it is easier to classify with linear decision
surfaces: reformulate problem so that data is mapped implicitly to this space.
Bayesian Belief Networks
Objectives:
♦ The naïve Bayesian classifier assume that attributes are conditionally
independents
♦ Belief Nets are PROVEN TECHNOLOGY, Medical Diagnosis, DSS for

complex machines ,Forecasting, Modeling, Information Retrieval, etc.
Definition
♦ A bayesian network is a causal directed acyclic graph (DAG), associated with an

underlying distribution of probability.
♦ DAG structure
Each node is represented by a variable v depends (only) on its parents conditional
probability:
P(vi | parenti = <0,1,…>) v is INDEPENDENT of non-descendants, given

assignments to its parents
♦ We must specify the conditional probability distribution for each node.
♦ It the variables are discrete, each node is described by a table (Conditional

Probability Table (CPT)) which lists the probability that the child node takes on
each of its different values for each combination of values of its parents.
Example:
Advanced Classification Methods
Introduction
In this section, we give a brief description of a number of other classification

methods. These methods include k-nearest neighbor classification, case-based
reasoning, genetic algorithms, rough set,snd fuzzy set approaches. In general, these
methods are less commonly used for classification in commercial data mining
systems than the methods described earlier in this chapter. Nearest neighbor
classification, for example, stores all training samples, which may present
difficulties when learning from very large data sets. furthermore, many
applications of case-based reasoning, genetic algorithms, and rough sets for
classification are still in the prototype phase. These method , however, are enjoying
increasing popularity, and hence we include them here.
K-Nearest Neighbor classifiers

Nearest neighbor classifiers are based on learning by analogy. The training sample
are described by M-dimensional space; In this way, all of the training samples ai
stored in an n-dimensional pattern space for the k training sample that are closest
to the unknown sample. These k training samples are the k “nearest neighbors” of
the unknown sample, “Closeness” is defined in terms of Eudidean distance, where
the Euclidean distance between two points,
D(X,Y)=
The unknown sample is assigned the most common class among its k nearest
neighbors when k=1, the unknown sample is assigned the class of the training
sample that is closest to it in pattern space. Nearest neighbor classifiers are
instance-based or lazy learners in that they store all of the training samples and do
not build a classifier until a new (unlabeled) sample needs to be classified. This
contrasts with eager learning methods, such as decision tree induction and
backpropagation, which construct a generalization model before receiving new
samples to classify. Lazy learners can incur expensive computational costs when
the number of potential neighbors (i.e., stored training samples) with which to
compare a given unlabeled sample is great. Therefore, they require efficient
indexing techniques. As expected, lazy learning methods are faster at training-than
eager methods, but slower at classification since all computation is delayed to that
time. Unlike decision tree induction and backpropagation ,
nearest neighbor classifiers assign equal weight to each attribute. This may cause
confusion when there are many irrelevant attributes in the data.
Nearest neighbor classifiers can also be used for prediction, that is, to return real-
valued prediction for a given unknown sample. In this case, the classifier returns
the average value of the real-valued labels associated with the k nearest neighbors
of the unknown sample.
Case- Based Reasoning
Case-based reasoning (CBR) classifiers ate instanced-based. Unlike nearest

neighbor classifiers, which store training samples as points in Euclidean space, the
samples or “cases” stored by CBR are complex symbolic descriptions. Business
applications of CBR include problem resolution for customer service help desks,
for example, where cases describe product-related diagnostic problems. CBR has
also been applied to areas such as engineering and law, where cases are either
technical designs or legal rulings, respectively.
When given a new case to classify, a case-based reasoned will first check If an
identical training case exists. If one is found, then the accompanying solution to
that case is returned. If no identical case is found, then the case based reasoner will
search for training cases having components that are similar to those of the new
case. Conceptually, these training cases may be considered as neighbors of the new
case. If cases are represented as graphs, this involves searching for subgraphs that
ate similar to subgraphs within the new case. The case-based reasoner tries to
combine the solutions of the neighboring training cases in order to propose a
solution for the new case. If incompatibilities arise with the individual solutions,
then backtracking to search for other solutions may be necessary. The case-based
reasoned may employ background knowledge and problem-solving strategies in
order to propose a feasible combined- solution.
Challenges in case-based reasoning include finding a good similarity metic (e.g.,

for matching subgraphs), developing efficient techniques for indexing training
cases, and methods for combining solutions.
Genetic Algorithms
Genetic algorithms attempt to incorporate ideas of natural evolution. In general,

genetic learning starts as follows. An initial population is created consisting of
randomly generated rules. Each rule can be represented by a string of bits. As a
simple example, suppose that samples in a given training set are described by two
Boolean attributes, A1 and A2, and that there are two classes, C1 and C2. The rule
“IF NOT A1 AND NOT A2 THEN C2” can be encoded as the bit string “100”,
where the two leftmost bits represent attributes A1 and A2, respectively, and the
rightmost bit represents the class, similarly, the rule “IF NOT A1 AND NOT A2
THEN C1” can be encoded as “001”. If an attribute has k values, where k>2, then
k bits may be used to encode the attribute’s values. Classes can be encoded in a
similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist
of the fittest rules in the current population, as well as offspring of these rules.
Typically, the fitness of a rule is assessed by its classification accuracy on asset of
training samples.
Applying genetic operators such as crossover and mutation created offspring, in

crossover, substrings from pairs of rules are swapped to form new pairs of rules, in
mutation, randomly selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules

continues until a populations of rules continues until a population p “evolves”
where each rule in P satisfies a pre specified fitness threshold.
Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems, in data mining, they may be used to evaluate
the fitness of other algorithms.
Rough Set Approach
Rough set theory can be used for classification to discover structural relationships
within imprecise or noisy data. It applies to discrete-valued attributes. Continuous-
valued attributes must therefore be discretized prior to its use.
Rough set theory is based on the establishment of equivalence classes within the
given training data. All of the data samples forming an equivalence class are
indiscernible, that is, the samples are identical with respect to the attributes
describing the data. Given real-world data, it is common that some classes cannot
be distinguished in terms of the available attributes. Rough sets can be used to
approximately or “roughly” define such classes. A rough set definition for a given
class c is approximated by two sets-a lower approximation of C and an upper
approximation of C. the lower approximation of C consists of all
of the data samples that, based on the knowledge of the attributes, ate certain to
belong to C without ambiguity. The upper approximation of C consists of all of the
samples that, based on the knowledge of the attributes, cannot be described as not
belonging to C. decision rules can be generated for each class. Typically, a
decision table is used to represent the rules.
Rough sets can also be used for feature reduction (where attributes that do not
contribute towards the classification of the given training data can be identified and
removed) and relevance analysis (where the contribution or significance of each
attribute is assessed with respect to the classification task). The problem of finding
the minimal subsets (reducts) of attributes that can describe all of the concepts in
the given data set is NP-hard. However, algorithms to reduce the computation
intensity have been proposed. In one method, for example, discernibility matrix is
used that stores the differences between attribute values for each pair of data
samples. Rather than searching on the entire training set, the matrix is instead
searched to detect redundant attributes.
Fuzzy set Approaches
Rule-based systems for classification have the disadvantage that they involve sharp
cutoffs for continuous attributes. For example, consider the following rule for
customer credit application approval. The rule essentially says that applications for
customers who have dad a job for two or more years and who have a high income
(i.e., of at least $50k) ate approved:
IF (years_employed) >=2^(income>50k) THEN credit=”approved”
A customer who has had a job for at least two years will receive credit if her
income is, say, $50k, but not id it is $49k. Such harsh thresholding may seem
unfair. Instead, fuzzy logic can be introduced into the system to allow “fuzzy”
thresholds or boundaries to be defined. Rather than having a precise cutoff between
categories or sets, fuzzy logic uses truth-values between 0.0 and 1.0 to represent
the degree of membership that a certain value has in a given category. Hence, with
fuzzy logic, we can capture the notion that an income of $49k is, to some degree,
high, although not as high as an income of $50k.
Fuzzy logic is useful for data mining systems performing classification. It provides
the advantage of working at a high level of abstraction. In general, the use of fuzzy
logic in rule-based systems involves the following:
• Attribute values are converted to fuzzy values. The fuzzy membership or truth
values are calculated- Fuzzy logic systems typically provide graphical tools to
assist users in this step.
• For a given new sample, more than one fuzzy rule may apply. Each applicable
rule contributes a vote for membership in the categories. Typically, the truth-values
for each predicted category are summed.
• The sums obtained above ate combined into a value that is returned by the
system. Weighting each category by its truth sum and multiplying by the mean
truth-value of each category may do this process, the calculations involved may be
more complex, depending on the complexity of the fuzzy membership graphs.
Fuzzy logic systems have been used in numerous areas for classification, including
health care and finance.
Prediction
The prediction of continuous values can be modeled by statistical techniques of

regression. V or example, we may like to develop a model to predict the salary of
college graduates with 10 years of work experience, or the potential sales of a new
product given its price. Many problems can be solved by linear regression, and
even more can be tackled by applying transformaions to the variables so that a
nonlinear problem can be converted to a linear one. For reasons of space, we
cannot give a fully detailed treatment of regression. Instead, this section provides
an intuitive introduction to the topic. By the end of this section, you will be
familiar with the ideas of linear, multiple, and nonlinear regression, as well as
generalized linear models.
Several software packages exist to solve regression problems. Examples include

SAS (http://www.sas.com), SPSS (http://www.spss.com), and S-plus
(http://www.mathspfr.com).
Linear and Multiple Regression
In linear regression, data are modeled using a straight line. Linear regression is the
simplest form of regression. Bivariate linear regression models a random variable,
Y ( called a response variable), as a linear function of another random variable, X
(called a predictor variable), that is,
Y=
Where the variance of Y is assumed to be constant, and a and b are regression

coefficients specifying the Y-intercept and slope of the line, respectively. These
coefficients can be solved for by the method of least squares, which minimizes the
error between the actual data and the estimate of the line.
Nonlinear Regression
“How can we model data that does not show a linear dependence? For example,
what if a given response variable and predictor variables have a relationship that
may be modeled by a polynomial function?” polynomial regression can be
modeled by adding polynomial terms to the basic linear model. By applying
transformations to the variables, we can convert the nonlinear model. By applying
transformations to the variables, we can convert the nonlinear the model into a
linear one that can then be solved by the method of least squares.
Some models are intractably nonlinear (such as the sum of exponential terms, for
example) and cannot be converted to a linear model. For such cases, it may be
possible to obtain least square estimates through extensive calculations on more
complex formulae.
Other Regression models
Linear regression is used to model continuous-valued functions. It is widely used,

owing largely to its simplicity. “can if also be used to predict categorical labels?”
generalized linear models represent the theoretical foundation on which linear
regression can be applied to the modeling of categorical response variables. In
generalized linear models, the variance of the response variable V is a function of
the mean value of V, unlike in linear regression, where the variance of Y is
constant. Common types of generalized linear models include logistic regression
and Poisson linear function of a set of predictor variables. Count data frequently
exhibit a Poisson distribution and are commonly modeled using Poisson
regression.
Log-linear models approximate discrete multidimensional probability distributions.

They may be used to estimate the probability value associated with data cube cells.
For example, suppose we are given data for the attributes city, item, year, and
sales, in the log-linear method, all attributes must be categorical; hence continuous-
valued attributes (like sales) must first be discretized. The method can then be used
to estimate the probability of each cell in the 4-D base cuboid for the given
attributes, based on the 2-d cuboids for city and item, city and year, city and sales,
and the 3-D cuboid for item, year, and sales. In this way, an iterative technique can
be used to build higher- order data cubes from
lower-order ones. The technique scales up well to allow for many dimensions.
Aside from prediction, the log-linear model is useful for data compression (since
the smaller- order cuboids together typically occupy less space than the base
cuboid) and data smoothing variations than cell estimates in the base cuboid).
Classifier Accuracy
Estimating classifier accuracy is important in that it allows one to evaluate how

accurately a given classifier will label future data, that is, data on which the
classifier has not been trained. For example, if data from previous sales arre used
to train a classifier to predict customer purchasing behavior, we would like some
estimate of how accurately the classifier can predict the purchasing behavior of
future customers.
Estimating Classifier Accuracy
Using training data to derive a classifier and then to estimate the accuracy of the
classifier can result in misleading overoptimistic estimates due to over-
specialization of the learning algorithm( or model) to the data. Holdout and cross-
validation are two common techniques for assessing classifier accuracy, based on
randomly sampled partitions of the given data- in the holdout method, the given
data are randomly partitioned into two independent sets, a training set and a test
set. Typically, two thirds of the data are allocated to the training set, and the
remaining one third is allocated to the test se. the training set is used to derive the
classifier, whose accuracy is estimated with the test set. The estimate is pessimistic
since only a portion of the initial data is used to derive the classifier. Random
subsampling is a variation of the holdout method in which the holdout method is
repeated k times. The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
In k-fold cross-validation, the initial data are randomly partitioned into k mutually
exclusive subsets or “folds,” S1,S2,…., Sk, each of approximately equal size.
Training and testing is performed k times. In iteration I, the subset S, is reserved as
the test se, and the remaining subsets are collectively used to train the classifier.
That is, the classifier of the first iteration is trained on subsets S1,.. , Sk and tested
on Si; the classifier of the second iteration is trained on subsets S2,.. ,Sk and tested
on Si; and so on. The accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total number of samples in the
initial data. In stratified cross-validation, the folds are stratified so that the class
distribution of the samples in each fold is approximately the same as that in the
initial data.
Other methods of estimating classifier accuracy include bootstrapping, which

samples the given training instances uniformly with replacement, and leave-one-
out) which is k-fold cross-validation with k set to 5, the number of initial samples.
UNIT - 4
Cluster Analysis
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
♦ A cluster of data objects can be treated as one group.While doing cluster analysis,
we first partition the set of data into groups based on data similarity and then assign
the labels to the groups.
♦ The main advantage of clustering over classification is that, it is adaptable to

changes and helps single out useful features that distinguish different groups.
Applications of Cluster Analysis
♦ Clustering analysis is broadly used in many applications such as market research,

pattern recognition, data analysis, and image processing.
♦ Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
♦ In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures inherent
to populations.
♦ Clustering also helps in identification of areas of similar land use in an earth

observation database. It also helps in the identification of groups of houses in a city
according to house type, value, and geographic location.
♦ Clustering also helps in classifying documents on the web for information

discovery.
♦ Clustering is also used in outlier detection applications such as detection of credit

card fraud.
♦ As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data mining −
♦ Scalability − We need highly scalable clustering algorithms to deal with large

databases.
♦ Ability to deal with different kinds of attributes − Algorithms should be capable

to be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
♦ Discovery of clusters with attribute shape − The clustering algorithm should be

capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
♦ High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
♦ Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
♦ Interpretability − The clustering results should be interpretable, comprehensible,

and usable.
Clustering Methods
Clustering methods can be classified into the following categories −

♦ Partitioning Method
♦ Hierarchical Method
♦ Density-based Method
♦ Grid-Based Method
♦ Model-Based Method
♦ Constraint-based Method
Partitioning Methods:
Apartitioning method constructs k partitions of the data, where each partition
represents a cluster and k <= n. That is, it classifies the data into k groups, which
together satisfy the following requirements:
♦ Each group must contain at least one object, and Each object must belong to
exactly one group.
♦ A partitioning method creates an initial partitioning. It then uses an iterative

relocation technique that attempts to improve the partitioning by moving objects
from one group to another.
♦ The general criterion of a good partitioning is that objects in the same cluster are
close o related to each other, whereas objects of different clusters are far apart or
very different.
Classical Partitioning Methods:
The most well-known and commonly used partitioning methods are

❖ The k-Means Method
❖ k-Medoids Method
Centroid-Based Technique: The K-Means Method:

The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k clusters so that the resulting intra cluster similarity is high but the
inter cluster similarity is low. Cluster similarity is measured in regard to the mean
value of the objects in a cluster, which can be viewed as the cluster’s centroid or
center of gravity.
The k-means algorithm proceeds as follows.
♦ First, it randomly selects k of the objects, each of which initially represents a

cluster mean or center. For each of the remaining objects, an object is assigned to
the cluster to which it is the most similar, based on the distance between the object
and the cluster mean.
♦ It then computes the new mean for each cluster. This process iterates until the
criterion function converges. Typically, the square-error criterion is used, defined
as
where E is the sum of the square error for all objects in the data set pis the point in
space representing a given object mi is the mean of cluster Ci
Algorithm: k - means
.
The k-means algorithm for partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for
each cluster;
(5) until no change;
The k-Medoids Method:
♦ The k-means algorithm is sensitive to outliers because an object with an

extremely large value may substantially distort the distribution of data. This effect
is particularly exacerbated due to the use of the square-error function. Instead of
taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster.
Each remaining object is clustered with the representative object to which it is the
most similar.
♦ The partitioning method is then performed based on the principle of minimizing
the sum of the dissimilarities between each object and its corresponding reference
point. That is, an absolute-error criterion is used, defined as
where E is the sum of the absolute error for all objects in the data set p is the point
in space representing a given object in cluster Cj ,oj is the representative object of
Cj
♦ The initial representative objects are chosen arbitrarily. The iterative process of
replacing representative objects by non representative objects continues as long as
the quality of the resulting clustering is improved. This quality is estimated using a
cost function that measures the average dissimilarity between an object and the
representative object of its cluster.
♦To determine whether a non representative object, oj random, is a good

replacement for a current representative object, oj, the following four cases are
examined for each of the nonrepresentative objects.
Case 1:
p currently belongs to representative object, oj. If oj is replaced by orandom as a
representative object and p is closest to one of the other representative objects,
oi,i≠j, then p is reassigned to oi.
Case 2:
p currently belongs to representative object, oj. If ojis replaced by orandom as a
representative object and p is closest to orandom, then p is reassigned to o random.
Case 3:
p currently belongs to representative object, oi, i≠j. If oj is replaced by orandom as
a representative object and p is still closest to oi, then the assignment does not
change.
Case 4:
p currently belongs to representative object, oi, i≠j. If oj is replaced by orandom as
a representative object and p is closest to orandom, then p is reassigned to
orandom.
Algorithm: k- medoids
PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative
object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative
objects;
(7) until no change;
The k-medoids method is more robust than k-means in the presence of noise and
outliers, because a medoid is less influenced by outliers or other extreme values
than a mean. However, its processing is more costly than the k-means method.
Hierarchical Clustering Methods

A hierarchical clustering method works by grouping data objects into a tree of
clusters.
The quality of a pure hierarchical clustering method suffers from its inability to
perform adjustment once a merge or split decision has been executed. That is, if a
particular merge or split decision later turns out to have been a poor choice, the
method cannot backtrack and correct it.
♦ Hierarchical clustering methods can be further classified as either agglomerative
or divisive, depending on whether the hierarchical decomposition is formed in a
bottom-up or top-down fashion.
Agglomerative hierarchical clustering:

This bottom-up strategy starts by placing each object in its own cluster and then
merges these atomic clusters into larger and larger clusters, until all of the objects
are in a single cluster or until certain termination conditions are satisfied.
Most hierarchical clustering methods belong to this category. They differ only in
their definition of intercluster similarity
Divisive hierarchical clustering
♦ This top-down strategy does the reverse of agglomerative hierarchical clustering
by starting with all objects in one cluster.
♦ It subdivides the cluster into smaller and smaller pieces, until each object forms a
cluster on its own or until it satisfies certain termination conditions, such as a desired
number of clusters is obtained or the diameter of each cluster is within a certain
threshold.
Agglomerative versus divisive hierarchical clustering. Figure shows the

application of AGNES (AGglomerative NESting), an agglomerative hierarchical
clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical
clustering method, on
a data set of five objects, {a,b,c,d, e}. Initially, AGNES, the agglomerative method,
places each object into a cluster of its own. The clusters are then merged step-by-
step according to some criterion. For example, clusters C1 and C2 may be merged
if an object in C1 and an object in C2 form the minimum Euclidean distance
between any two objects from different clusters.
Density-based methods
❖ Most partitioning methods cluster objects based on the distance between objects.
Such methods can find only spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.
❖ Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the density in
the neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number
of points. Such a method can be used to filter out noise (outliers)and discover clusters
of arbitrary shape.
❖ DBSCAN and its extension, OPTICS, are typical density-based methods that
grow clusters according to a density-based connectivity analysis. DENCLUE is a
method that clusters objects based on the analysis of the value distributions of
density
functions.
Grid-Based Methods
❖ Grid-based methods quantize the object space into a finite number of cells that
form a grid structure.
❖ All of the clustering operations are performed on the grid structure i.e., on the
quantized space. The main advantage of this approach is its fast processing time,
which is typically independent of the number of data objects and dependent only on
the number of cells in each dimension in the quantized space.
❖ STING is a typical example of a grid-based method. Wave Cluster applies wavelet

transformation for clustering analysis and is both grid-based and density-based.
STING: STatistical INformation Grid

STING is a grid-based multi resolution clustering technique in which the embedding
spatial area of the input objects is divided into rectangular cells. The space can be
divided in a hierarchical and recursive way. Several levels of such rectangular cells
correspond to different levels of resolution and form a hierarchical structure: Each
cell at a high level is partitioned to form a number of cells at the next lower level.
Statistical information regarding the attributes in each grid cell, such as the mean,
maximum, and minimum values, is pre computed and stored as statistical
parameters. These statistical parameters are useful for query processing and for other
data analysis tasks.
Figure shows a hierarchical structure for STING clustering.
The statistical parameters of higher-level cells can easily be computed from the
parameters of the lower-level cells. These parameters include the following: the
attribute-independent parameter, count; and the attribute-dependent parameters,
mean, stdev (standard deviation), min (minimum), max (maximum), and the type
of distribution that the attribute value in the cell follows such as normal, uniform,
exponential, or none (if the distribution is unknown). Here, the attribute is a
selected measure for analysis such as price for house objects. When the data are
loaded into the database, the parameters count, mean, stdev, min, and max of the
bottom-level cells are calculated directly from the data.
STING offers several advantages:

(1) the grid-based computation is query-independent because the statistical
information stored in each cell represents the summary information of the data in the
grid cell, independent of the query
(2) the grid structure facilitates parallel processing and incremental updating and
(3) the method’s efficiency is a major advantage: STING goes through the database
once to compute the statistical parameters of the cells, and hence the time complexity
of generating clusters is O(n), where n is the total number of objects.
CLIQUE: An Apriori-like Subspace Clustering Method

♦ CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density-
based clusters in subspaces. CLIQUE partitions each dimension into nonoverlapping
intervals, thereby partitioning the entire embedding space of the data objects into
cells.
♦ It uses a density threshold to identify dense cells and sparse ones. A cell is dense
if the number of objects mapped to it exceeds the density threshold.
♦ The main strategy behind CLIQUE for identifying a candidate search space uses
the monotonicity of dense cells with respect to dimensionality.
♦ CLIQUE performs clustering in two steps. In the first step, CLIQUE partitions the
d-dimensional data space into nonoverlapping rectangular units, identifying the
dense units among these. CLIQUE finds dense cells in all of the subspaces. To do
so,CLIQUE partitions every dimension into intervals, and identifies intervals
containing at least l points, where l is the density threshold.
♦ In the second step, CLIQUE uses the dense cells in each subspace to assemble
clusters, which can be of arbitrary shape. The idea is to apply the Minimum
Description Length (MDL) principle (Chapter 8) to use the maximal regions to cover
connected dense cells, where a maximal region is a hyper rectangle where every cell
falling into this region is dense, and the region cannot be extended further in any
dimension in the subspace.
Evaluation of clustering
The major tasks of clustering evaluation include the following:

Assessing clustering tendency:
In this task, for a given data set, we assess whether a nonrandom structure exists in
the data. Blindly applying a clustering method on a data set will return clusters;
however, the clusters mined may be misleading.Cluster-
ing analysis on a data set is meaningful only when there is a nonrandom structure in
the data.
Determining the number of clusters in a data set:
A few algorithms, such as k-means, require the number of clusters in a data set as
the parameter. Moreover, the number
of clusters can be regarded as an interesting and important summary statistic of a
data set. Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
Measuring clustering quality:
After applying a clustering method on a data set, we want to assess how good the
resulting clusters are. A number of measures can be used. Some methods measure
how well the clusters fit the data set, while others measure how well the clusters
match the ground truth, if such truth is available. There are also measures that score
clusterings and thus can compare two sets of clustering results on the same data set.
Unit 5
Mining Complex Data Types
1) Mining Sequence Data
a) Mining Time Series
b) Mining Symbolic Sequences
c) Mining Biological Sequences
2) Mining Graphs and Networks
3) Mining Other Kinds of Data
Mining Sequence Data

♦ Similarity Search in Time Series Data Subsequence match, dimensionality
reduction, query-based similarity search, motif-based similarity search
♦ Regression and Trend Analysis in Time-Series Data long term + cyclic +

seasonal variation + random movements
♦ Sequential Pattern Mining in Symbolic Sequences GSP, PrefixSpan, constraint-
based sequential pattern mining
 Sequence Classification
Feature-based vs. sequence-distance-based vs. model-based
 Alignment of Biological Sequences

Pair-wise vs. multi-sequence alignment, substitution matirces, BLAST
 Hidden Markov Model for Biological Sequence Analysis

Markov chain vs. hidden Markov models, forward vs. Viterbi vs. Baum-Welch
algorithms
Mining Graphs and Networks

♦ Graph Pattern Mining
Frequent subgraph patterns, closed graph patterns, gSpan vs. CloseGraph
♦ Statistical Modeling of Networks
Small world phenomenon, power law (log-tail) distribution, densification
♦ Clustering and Classification of Graphs and Homogeneous Networks
Clustering: Fast Modularity vs. SCAN
Classification: model vs. pattern-based mining
♦ Clustering, Ranking and Classification of Heterogeneous Networks
RankClus, RankClass, and meta path-based, user-guided methodology
♦ Role Discovery and Link Prediction in Information Networks
PathPredict
 Similarity Search and OLAP in Information Networks: PathSim, GraphCube
 Evolution of Social and Information Networks: EvoNetClus
Mining Other Kinds of Data

 Mining Spatial Data Spatial frequent/co-located patterns, spatial clustering
and classification
 Mining Spatiotemporal and Moving Object Data Spatio temporal data mining,
trajectory mining, periodica, swarm, …
 Mining Cyber-Physical System Dat
Applications
Healthcare, air-traffic control, flood simulation
♦ Mining Multimedia Data
Social media data, geo-tagged spatial clustering, periodicity analysis, …
♦ Mining Text Data
Topic modeling, i-topic model, integration with geo- and networked data
♦ Mining Web Data
Web content, web structure, and web usage mining
♦ Mining Data Streams
Dynamics, one-pass, patterns, clustering, classification, outlier detection
Other Methodologies of Data
 Statistical Data Mining

 Views on Data Mining Foundations
 Visual and Audio Data Mining
Major Statistical Data Mining Methods
 Regression
 Generalized Linear Model
 Analysis of Variance
 Mixed-Effect Models
 Factor Analysis
 Discriminant Analysis
 Survival Analysis
Regression
predict the value of a response (dependent) variable from one or more predictor
(independent) variables
where the variables are numeric forms of regression: linear, multiple, weighted,
polynomial, nonparametric, and robust
Generalized linear models
Allow a categorical response variable (or some transformation of it) to be related to

a set of predictor variables similar to the modeling of a numeric response variable
using linear regression include logistic regression and Poisson regression
Mixed-effect models
For analyzing grouped data, i.e. data that can be classified according to one or
more grouping variables .Typically describe relationships between a response
variable and some covariates in data grouped according to one or more factors
Regression trees
Binary trees used for classification and prediction Similar to decision trees:Tests are
performed at the internal nodes In a regression tree the mean of the objective
attribute is computed and used as the predicted value
Analysis of variance
Analyze experimental data for two or more populations described by a numeric

response variable and one or more categorical variables (factors)
Factor analysis
Determine which variables are combined to generate a given factor
e.g., for many psychiatric data, one can indirectly measure other quantities (such as
test scores) that reflect the factor of interest
Discriminant analysis
Predict a categorical response variable, commonly used in social science
Attempts to determine several discriminant functions (linear combinations of the

independent variables) that discriminate among the groups defined by the response
variable
Time series:
Many methods such as autoregression,ARIMA (Autoregressive integrated moving-

average modeling), long memory time-series modeling
Quality control: displays group summary charts
Survival analysis
Predicts the probability that a patient undergoing a medical treatment would survive
at least to time t (life span prediction)
Views on Data Mining Foundations
♦ Data reduction
Basis of data mining: Reduce data representation
Trades accuracy for speed in response
♦ Data compression
Basis of data mining: Compress the given data by encoding in terms of bits,
association rules, decision trees, clusters, etc.
♦ Probability and statistical theory
Basis of data mining: Discover joint probability distributions of random variables
♦ Microeconomic view
A view of utility: Finding patterns that are interesting only to the extent in that they
can be used in the decision-making process of some enterprise
♦ Pattern Discovery and Inductive databases
Basis of data mining: Discover patterns occurring in the database,such as

associations, classification models, sequential patterns, etc.
♦ Data mining is the problem of performing inductive logic on databases.The task is

to query the data and the theory (i.e., patterns) of the database .Popular among many
researchers in database systems
Visual Data Mining
Visualization: Use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data .
Visual Data Mining: discovering implicit but useful knowledge from large data
sets using visualization techniques
Visualization
♦ Purpose of Visualization
 Gain insight into an information space by mapping data onto graphical

primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships among data.
 Help find interesting regions and suitable parameters for further quantitative
analysis.
 Provide a visual proof of computer representations derived
♦ Integration of visualization and data mining
 data visualization
 data mining result visualization
 data mining process visualization
 interactive visual data mining
*Data visualization
 Data in a database or data warehouse can be viewed at different levels of

abstraction as different combinations of attributes or dimensions
 Data can be presented in various visual forms
Data Mining Result Visualization
Presentation of the results or knowledge obtained from data mining in visual forms
Examples
Scatter plots and boxplots (obtained from descriptive data mining)
Decision trees
Association rules
Clusters
Outliers
Generalized rules
Data Mining Process Visualization
*Presentation of the various processes of data mining in visual forms so that users
can see
 Data extraction process

 Where the data is extracted
 How the data is cleaned, integrated, preprocessed, and mined
 Method selected for data mining
 Where the results are stored
 How they may be viewed
Interactive Visual Data Mining

Using visualization tools in the data mining process to help users make smart data
mining decisions
Example
♦ Display the data distribution in a set of attributes using colored sectors or columns
(depending on whether the whole space is represented by either a circle or a set of
columns)
♦ Use the display to which sector should first be selected for classification and where
a good split point for this sector may be
Audio Data Mining
♦ Uses audio signals to indicate the patterns of data or the features of data mining
results.An interesting alternative to visual mining
♦ An inverse task of mining audio (such as music) databases which is to find patterns
from audio data
♦ Visual data mining may disclose interesting patterns using graphical displays, but
requires users to concentrate on watching patterns
♦ Instead, transform patterns into sound and music and listen to pitches, rhythms,
tune, and melody in order to identify anything interesting or unusual
Data mining applications
Data mining:
Ayoung discipline with broad and diverse applications .There still exists a nontrivial
gap between generic data mining methods and effective and scalable data mining
tools for domain-specific applications
Some application domains (briefly discussed here)
♦ Data Mining for Financial data analysis
♦ Data Mining for Retail and Telecommunication Industries
♦ Data Mining in Science and Engineering
♦ Data Mining for Intrusion Detection and Prevention
♦ Data Mining and Recommender Systems
Data Mining for Financial Data Analysis
♦ Financial data collected in banks and financial institutions are often relatively
complete, reliable, and of high quality Design and construction of data warehouses
for multidimensional data analysis and data mining
♦ View the debt and revenue changes by month, by region, by sector, and by other
factors
♦ Access statistical information such as max, min, total, average, trend, etc.
♦ Loan payment prediction/consumer credit policy analysis feature selection and

attribute relevance ranking Loan payment performance
♦ Consumer credit rating
♦ Classification and clustering of customers for targeted

marketing,multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer groups or associate a new
customer to an appropriate
customer group
Detection of money laundering and other financial crimes,integration of from

multiple DBs (e.g., bank transactions, federal/state crime history DBs)
Tools: data visualization, linkage analysis, classification, clustering tools, outlier

analysis, and sequential pattern analysis tools (find unusual access sequences)
Data Mining for Retail & Telcomm. Industries
Retail industry: huge amounts of data on sales, customer shopping history, e-

commerce, etc.
Applications of retail data mining
♦ Identify customer buying behaviors
♦ Discover customer shopping patterns and trends
♦ Improve the quality of customer service
♦ Achieve better customer retention and satisfaction
♦ Enhance goods consumption ratios
♦ Design more effective goods transportation and distribution policies

♦ Telcomm. and many other industries: Share many similar goals and expectations
of retail data mining
Data Mining Practice for Retail Industry
♦ Design and construction of data warehouses
♦ Multidimensional analysis of sales, customers, products, time, and region
♦ Analysis of the effectiveness of sales campaigns
♦ Customer retention: Analysis of customer loyalty
♦ Use customer loyalty card information to register sequences of purchases of

particular customers
♦ Use sequential pattern mining to investigate changes in customer consumption or

loyalty
♦ Suggest adjustments on the pricing and variety of goods
♦ Product recommendation and cross-reference of items
♦ Fraudulent analysis and the identification of usual patterns
♦ Use of visualization tools in data analysis
Data Mining in Science and Engineering
Data warehouses and data preprocessing
Resolving inconsistencies or incompatible data collected in diverse environments

and different periods (e.g. eco-system studies)
Mining complex data types

♦Spatiotemporal, biological, diverse semantics and relationships
♦ Graph-based and network-based mining
♦ Links, relationships, data flow, etc.
♦ Visualization tools and domain-specific knowledge
Other issues
♦ Data mining in social sciences and social studies: text and social media
♦ Data mining in computer science: monitoring systems, software bugs, network

intrusion
Data Mining for Intrusion Detection and Prevention
Majority of intrusion detection and prevention systems use
♦ Signature-based detection: use signatures, attack patterns that are preconfigured

and predetermined by domain experts
♦ Anomaly-based detection: build profiles (models of normal behavior) and detect

those that are substantially deviate from the profiles
What data mining can help
New data mining algorithms for intrusion detection
♦ Association, correlation, and discriminative pattern analysis help select and

build discriminative classifiers
♦ Analysis of stream data: outlier detection, clustering, model shifting
♦ Distributed data mining
♦ Visualization and querying tools

Data Mining and Recommender Systems
♦ Recommender systems: Personalization, making product recommendations that

are likely to be of interest to a user
♦ Approaches: Content-based, collaborative, or their hybrid
♦ Content-based: Recommends items that are similar to items the user preferred or
queried in the past
♦ Collaborative filtering: Consider a user’s social environment, opinions of other

customers who have similar tastes or preferences
Data mining and recommender systems
♦ Users C × items S: extract from known to unknown ratings to predict user-item

combinations
♦ Memory-based method often uses k-nearest neighbor approach
♦ Model-based method uses a collection of ratings to learn a model (e.g.,

probabilistic models, clustering, Bayesian networks, etc.)
♦ Hybrid approaches integrate both to improve performance (e.g., using ensemble)

Data Mining and Society
Ubiquitous and Invisible Data Mining:
Ubiquitous Data Mining
Data mining is used everywhere, e.g., online shopping
Ex. Customer relationship management (CRM)
Invisible Data Mining
Invisible: Data mining functions are built in daily life operations

Ex. Google search: Users may be unaware that they are examining results returned
by data
♦ Invisible data mining is highly desirable
♦ Invisible mining needs to consider efficiency and scalability, user interaction,

incorporation of background knowledge and
visualization techniques, finding interesting patterns, real-time, …
Further work: Integration of data mining into existing business and scientific
technologies to provide domain-specific data mining tools
Privacy, Security and Social Impacts of Data Mining

Many data mining applications do not touch personal data
E.g., meteorology, astronomy, geography, geology, biology, and other scientific and
engineering data
Many DM studies are on developing scalable algorithms to find general or

statistically significant patterns, not touching individuals
♦ The real privacy concern: unconstrained access of individual records, especially

privacy-sensitive information
Method 1: Removing sensitive IDs associated with the data
Method 2: Data security-enhancing methods
♦ Multi-level security model: permit to access to only authorized level
Encryption: e.g., blind signatures, biometric encryption, and anonymous databases

(personal information is encrypted and stored at different locations)
Method 3: Privacy-preserving data mining methods

Privacy-Preserving Data Mining
♦Obtaining valid mining results without disclosing the underlying sensitive data
values Often needs trade-off between information loss and privacy
Privacy-preserving data mining methods:
Randomization (e.g., perturbation): Add noise to the data in order to mask some
attribute values of records
K-anonymity and l-diversity: Alter individual records so that they

cannot be uniquely identified
k-anonymity: Any given record maps onto at least k other records
l-diversity: enforcing intra-group diversity of sensitive values
Distributed privacy preservation: Data partitioned and distributed either

horizontally, vertically, or a combination of both
Downgrading the effectiveness of data mining: The output of data mining may
violate privacy
Modify data or mining results, e.g., hiding some association rules or slightly
distorting some classification models
Data Mining Trends
Application exploration: Dealing with application-specific problems
Scalable and interactive data mining methods
Integration of data mining with Web search engines, database systems, data
warehouse systems and cloud computing systems
Mining social and information networks

Mining spatiotemporal, moving objects and cyber-physical systems
Mining multimedia, text and web data
Mining biological and biomedical data
Data mining with software engineering and system engineering
Visual and audio data mining
Distributed data mining and real-time data stream mining
Privacy protection and information security in data mining

Data Mining Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Notes

Uploaded by

Copyright:

Available Formats

UNIT 3

♦ Predicts categorical class labels (discrete or nominal)

Given a collection of records (training set)

♦ Goal: previously unseen records should be assigned a class as accurately as

Illustrating Classification TaskExamples of Classification Task

♦ Classifying credit card transactions as legitimate or fraudulent

♦ Categorizing news stories as finance, weather, entertainment, sports, etc

Decision Tree Induction

Decision tree is a flowchart-like tree structure,where

➢ Each internal node denotes a test on an attribute.

➢ Each leaf node holds a class label.

➢ The topmost node in a tree is the root node.

A decision tree indicating whether a customer is likely to purchase a computer

➢ Their representation of acquired knowledge in tree form is intuitive and generally

Output: A decision tree.

(1) create a node N;

(2) if tuples in D are all of the same class, C, then

(3) return N as a leaf node labeled with the class C;

(4) if attribute list is empty then

(7) label node N with splitting criterion;

(8) if splitting attribute is discrete-valued andmultiway splits allowed then // not

(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition

(13) attach a leaf labeled with the majority class in D to node N;

The algorithm is called with three parameters:

3 A is discrete-valued and a binary tree must be produced: The test at node N is of

Naïve Bayesian Classification

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

4.Given data sets with many attributes, it would be extremely computationally

Classification by Back propagation

♦ A neural network is a set of connected input/output units in which each connection

♦ Neural network learning is also referred to as connectionist learning due to the

♦ Although it is not guaranteed, in general the weights will eventually converge,

♦ They have been successful on a wide array of real-world data, including

♦ Neural network algorithms are inherently parallel; parallelization techniques can

Propagate the inputs forward:

Back propagate the error:

Biases are updated by the following equations below

(2) while terminating condition is not satisfied {

(3) for each training tuple X in D {

(4) // Propagate the inputs forward:

(5) for each input layer unit j {

(6) Oj = Ij ; // output of an input unit is its actual input value

(7) for each hidden or output layer unit j {

(11) for each unit j in the output layer

(12) Errj = Oj(1 − Oj)(Tj − Oj); // compute the error

(14) Errj = Oj(1 − Oj)

the next higher layer, k

(15) for each weight wij in network {

(16) 1wij = (l)Errj Oi ; // weight increment

(17) wij = wij + 1wij; } // weight update

(18) for each bias θj in network {

(19) 1θj = (l)Errj

(20) θj = θj + 1θj ; } // bias update

Support Vector Machine

Support Vector Machine (SVM) performs classification by finding the hyperplane

1. Define an optimal hyperplane: maximize margin

Bayesian Belief Networks

♦ Belief Nets are PROVEN TECHNOLOGY, Medical Diagnosis, DSS for

♦ A bayesian network is a causal directed acyclic graph (DAG), associated with an

P(vi | parenti = <0,1,…>) v is INDEPENDENT of non-descendants, given

♦ We must specify the conditional probability distribution for each node.

♦ It the variables are discrete, each node is described by a table (Conditional

In this section, we give a brief description of a number of other classification