Data Mining Notes

Chapter – 1
• Data mining is the process of automatically discovering useful

information in large data repositories. Data mining techniques are
deployed to scour large databases in order to find novel and useful
patterns that might otherwise remain unknown.
• Data mining is an integral part of knowledge discovery in
databases (KDD), which is the overall process of converting raw
data into useful information.
• The purpose of preprocessing is to transform the raw input data
into an appropriate format for subsequent analysis. The steps
involved in data preprocessing include fusing data from multiple
sources, cleaning data to remove noise and duplicate observations,
and selecting records and features that are relevant to the data
mining task at hand.
• Data mining tasks are generally divided into two major categories:-
◦ Predictive tasks . The objective of these tasks is to predict the
value of a particular attribute based on the values of other
attributes. The attribute to be predicted is commonly known as
the target or dependent variable, while the attributes used for
making the prediction are known as the explanatory or
independent variables.
◦ Descriptive tasks . Here, the objective is to derive patterns
(correlations, trends, clusters, trajectories, and anomalies) that
summarize the underlying relationships in data. Descriptive data
mining tasks are often exploratory in nature and frequently
require postprocessing techniques to validate and explain the
results.
• Predictive modeling refers to the task of building a model for the
target variable as a function of the explanatory variables. There are
two types of predictive modeling tasks: classifi cation, which is used
for discrete target variables, and regression, which is used for
continuous target variables.
• Association analysis is used to discover patterns that describe
strongly associated features in the data. The discovered patterns
are typically represented in the form of implication rules or feature
subsets.
• Cluster analysis seeks to find groups of closely related
observations so that observations that belong to the same cluster
are more similar to each other than observations that belong to
other clusters.
• Anomaly detection is the task of identifying observations whose
characteristics are significantly different from the rest of the data.
Such observations are known as anomalies or outliers.
Chapter – 2
• A data set can often be viewed as a collection of data objects. Other

names for a data object are record, point, vector, pattern, instance,
case, sample, observation, or entity. In turn, data objects are
described by a number of attributes that capture the basic
characteristics of an object, such as the mass of a physical object or
the time at which an event occurred. Other names for an attribute
are variable, characteristic, field, feature, and dimension.
• An attribute is a property or characteristic of an object that may
vary; either from one object to another or from one time to another.
• A measurement scale is a rule (function) that associates a
numerical or symbolic value with an attribute of an object.
• The term measurement error refers to any problem resulting from
the measurement process. A common problem is that the value
recorded differs from the true value to some extent.
• Noise is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious objects.
• Data errors may be the result of a more deterministic phenomenon,
such as a streak in the same place on a set of photographs. Such
deterministic distortions of the data are often referred to as
artifacts.
• The closeness of repeated measurements (of the same quantity) to
one another is called Precision.
• A systematic variation of measurements from quantity being
measured is called Bias.
• The closeness of measurements to the true value of the quantity
being measured is called Accuracy.
• Variance is the change in prediction accuracy of ML model between
training data and test data. Simply what it means is that if a ML
model is predicting with an accuracy of "x" on training data and its
prediction accuracy on test data is "y" then, Variance = x – y.
• In a nutshell, Bias is a measure of how far the expected value of the

estimate is from the true value of the parameter being estimated.
Precision is a measure of how similar the multiple estimates are to
each other, not how close they are to the true value. Precision and
bias are two different components of Accuracy.
• Outliers are either (1) data objects that, in some sense, have
characteristics that are different from most of the other data objects
in the data set, or (2) values of an attribute that are unusual with
respect to the typical values for that attribute.
• Data Preprocessing techniques:- 1.) Aggregation 2.) Sampling 3.)
Dimensionality reduction 4.) Feature subset selection 5.) Feature
creation 6.) Discretization and binarization 7.) Variable
transformation.
• Attribute types are:- Nominal, Ordinal, Interval and Ratio.
• For continuous attributes, the numerical difference of the
measured and true value is called the error. The term data
collection error refers to errors such as omitting data objects or
attribute values, or inappropriately including a data object.
• A Discrete attribute has a finite or countably infinite set of values.
Such attributes can be categorical, such as zip codes or ID numbers,
or numeric, such as counts. Discrete attributes are often
represented using integer variables. Binary attributes are a special
case of discrete attributes and assume only two values, e.g.,
true/false, yes/no, male/female, or 0/1. Binary attributes are often
represented as Boolean variables, or as integer variables that only
take the values 0 or 1.
• A Continuous attribute is one whose values are real numbers.
Examples include attributes such as temperature, height, or weight.
Continuous attributes are typically represented as floating-point
variables. Practically, real values can only be measured and
represented with limited precision.
• Similarity and Dissimilarity between Objects.
• A given distance(e.g. dissimilarity) is meant to be a metric if and only
if it satisfies the following four conditions:
1- Non-negativity: d(p, q) ≥ 0, for any two distinct
observations p and q.
2- Symmetry: d(p, q) = d(q, p) for all p and q.
3- Triangle Inequality: d(p, q) ≤ d(p, r) + d(r, q) for all p, q, r.
4- d(p, q) = 0 only if p = q.
• Methods for handling missing values:

1. Delete Rows.
2. Replace with mean/median/mode.
3. Predict the data.
4. Use such an algorithm that can tolerate missing values.
5. Assign a unique category(such as categorizing NaN as
unknown).
Chapter – 3
• Classifi cation is the task of learning a target function [f(x)] that

maps each attribute set x to one of the predefined class Iabels g. In
simple words, it is the process of classifying a data point into one of
the pre-defined categories.
• Classification techniques are most suited for predicting or
describing data sets with binary or nominal categories. They are less
effective for ordinal categories (e.g., to classify a person as a
member of high-, medium-, or low income group) because they do
not consider the implicit order among the categories.
• A classification technique (or classifier) is a systematic approach to
building classification models from an input data set. Examples
include decision tree classifi ers, rule-based classifi ers, neural
networks, support vector machines, and naive Bayes classifi ers.
Each technique employs a learning algorithm to identify a model
that best fits the relationship between the attribute set and class
label of the input data. The model generated by a learning algorithm
should both fit the input data well and correctly predict the class
labels of records it has never seen before. Therefore, a key objective
of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels of
previously unknown records.
• General approach for building a classification model: First, a training
set consisting of records whose class labels are known must be
provided. The training set then is used to build a classification
model, which is subsequently applied to the test set, which consists
of records with unknown class labels.
• A confusion matrix is a table that is often used to describe the
performance of a classifi cation model(or "classifier") on a set of
test data for which the true values are known. The confusion matrix
itself is relatively simple to understand, but the related terminology
can be confusing. For example,
• True Negative Rate (TNR) = Specifi city = TN / (TN + FP), True

Positive Rate (TPR) = Sensitivity = TP/(TP + FN). Read more about
confusion matrix, here.(T=True, F=False, P=Positive, N=Negative).
• False Positive Rate (FPR) = FP / (TN + FP), False Negative Rate
(FNR) = FN / (TP + FN).
• Precision = TP / (TP + FP), Precision determines the fraction of
records that actually turns out to be positive in the group the
classifier has declared as a positive class.
• Recall = TP / (TP + FN), Recall measures the fraction of positive
examples correctly predicted by the classifier.
• Accuracy = (TP + TN) / (TP + TN + FP + FN), Error = (FP + FN) / (TP + TN
+ FP + FN).
• A decision tree is a flowchart-like structure in which each internal
node represents a "test" on an attribute (e.g. whether a coin flip
comes up heads or tails), each branch represents the outcome of
the test, and each leaf node represents a class label (decision taken
after computing all attributes).
• A learning algorithm for inducing decision trees must address the
following two design issues:-
1. How should the training records be split? Each recursive
step of the tree-growing process must select an attribute test
condition to divide the records into smaller subsets. To
implement this step, the algorithm must provide a method for
specifying the test condition for different attribute types as
well as an objective measure for evaluating the goodness of
each test condition.
2. How should the splitting procedure stop? A stopping
condition is needed to terminate the tree-growing process. A
possible strategy is to continue expanding a node until either
all the records belong to the same class or all the records have
identical attribute values.
• Formula for Entropy(degree of uncertanity):
• Formula for Information gain(degree to which the uncertanity has

been reduced):
• Formula for Gini Index(another method for choosing an attribute):-
• Note that, (Entropy, Information Gain) & (Gini Index) are two
seperate methods for deciding the best split. Also, remeber that the
max range for Entropy is 1, while for Gini Index it’s 0.5.
Chapter – 4
• Rule-based classifiers are just another type of classifier which makes

the class decision depending by using various “if..else” rules. These
rules are easily interpretable and thus these classifiers are generally
used to generate descriptive models. The condition used with “if” is
called the antecedent and the predicted class of each rule is called
the consequent.
• The left-hand side of the rule is called the rule antecedent or
precondition.
• The right-hand side of the rule is called the rule consequent, which
contains the predicted class yi.
• Formulas:
Coverage = |A|/|D|, Accuracy: |A U y| / |D|
where lAl is the number of records that satisfy the rule antecedent,
lA U yl is the number of records that satisfy both the antecedent and
consequent, and lDl is the total number of records.
• Properties:
- Coverage: The percentage of records which satisfy the
antecedent conditions of a particular rule.
- The rules generated by the rule-based classifiers are

generally not mutually exclusive, i.e. many rules can cover the
same record.
- The rules generated by the rule-based classifiers may not be

exhaustive, i.e. there may be some records which are not
covered by any of the rules.
- The decision boundaries created by them is linear, but these

can be much more complex than the decision tree because the
many rules are triggered for the same record.
• If the rule set is not exhaustive, then a default rule, rd: () --> yd, must
be added to cover the remaining cases. A default rule has an empty
antecedent and is triggered when all other rules have failed. yd is
known as the default class and is typically assigned to the majority
class of training records not covered by the existing rules.
• If the rule set is not mutually exclusive, then a record can be
covered by several rules, some of which may predict conflicting
classes. There are two ways to overcome this problem:-
- Ordered Rules: In this approach, the rules in a rule set are
ordered in decreasing order of their priority, which can be
defined in many ways (e.g., based on accuracy, coverage, total
description length, or the order in which the rules are
generated). An ordered rule set is also known as a decision list.
When a test record is presented, it is classified by the highest-
ranked rule that covers the record. This avoids the problem of
having conflicting classes predicted by multiple classification
rules.
- Unordered Rules: This approach allows a test record to

trigger multiple classification rules and considers the
consequent of each rule as a vote for a particular class. The
votes are then tallied to determine the class label of the test
record. The record is usually assigned to the class that receives
the highest number of votes. In some cases, the vote may be
weighted by the rule's accuracy.
• Rule ordering can be implemented on a rule-by-rule basis or on a

class-by-class basis,
- Rule-Based Ordering Scheme: This approach orders the

individual rules by some rule quality measure. This ordering
scheme ensures that every test record is classified by the
"best" rule covering it. A potential drawback ofthis scheme is
that lower-ranked rules are much harder to interpret because
they assume the negation of the rules preceding them. If the
number of rules is large, interpreting the meaning of the rules
residing near the bottom of the list can be a cumbersome task.
- Class-Based Ordering Scheme: In this approach, rules that

belong to the same class appear together in the rule set R. The
rules are then collectively sorted on the basis of their class
information. The relative ordering among the rules from the
same class is not important; as long as one of the rules fires,
the class will be assigned to the test record. This makes rule
interpretation slightly easier. However, it is possible for a high-
quality rule to be overlooked in favor of an inferior rule that
happens to predict the higher-ranked class.
• The k-nearest neighbors (KNN) algorithm is a simple, supervised

machine learning algorithm that can be used to solve both
classification and regression problems. It is a non-parametric, lazy
learning algorithm. Its purpose is to use a database in which the
data points are separated into several classes to predict the
classification of a new sample point.
• KNN tries to predict the correct class for the test data by calculating
the distance between the test data and all the training points.
• If k is too small, then the nearest-neighbor classifier may be

susceptible to overfi tting because of noise in the training data. On
the other hand, if k is too large, the nearest-neighbor classifier may
misclassify the test instance because its list of nearest neighbors
may include data points that are located far away from its
neighborhood.
• The characteristics of the nearest-neighbor classifier are

summarized below:
- Nearest-neighbor classification is part of a more general

technique known as instance-based learning, which uses specific
training instances to make predictions without having to maintain
an abstraction (or model) derived from data. Instance-based
learning algorithms require a proximity measure to determine
the similarity or distance between instances and a classification
function that returns the predicted class of a test instance based
on its proximity to other instances.
- Lazy learners such as nearest-neighbor classifiers do not

require model building. However, classifying a test example can
be quite expensive because we need to compute the proximity
values individually between the test and training examples. In
contrast, eager learners often spend the bulk of their computing
resources for model building. Once a model has been built,
classifying a test example is extremely fast.
- Nearest-neighbor classifiers make their predictions based on

local information, whereas decision tree and rule-based
classifiers attempt to find a global model that fits the entire input
space. Because the classification decisions are made locally,
nearest-neighbor classifiers (with small values of /c) are quite
susceptible to noise.
- Nearest-neighbor classifiers can produce arbitrarily shaped

decision boundaries. Such boundaries provide a more flexible
model representation compared to decision tree and rule-based
classifiers that are often constrained to rectilinear decision
boundaries. Increasing the number of nearest neighbors may
reduce such variability.
- Nearest-neighbor classifiers can produce wrong predictions

unless the appropriate proximity measure and data
preprocessing steps are taken.
• Bayes' Theorem states that the conditional probability of an event,

based on the occurrence of another event, is equal to the likelihood
of the second event given the first event multiplied by the
probability of the first event.
• Conditional Probability : Probability of one (or more) event given
the occurrence of another event, e.g. P(A given B) or P(A | B).
• Joint Probability: Probability of two (or more) simultaneous events,

e.g. P(A and B) or P(A, B).
• Marginal Probability: The probability of an event irrespective of

the outcomes of other random variables, e.g. P(A).
• In general, the result P(A|B) is referred to as the posterior

probability and P(A) is referred to as the prior probability.
Sometimes P(B|A) is referred to as the likelihood and P(B) is
referred to as the evidence.
• A Naive Bayes classifi er is a probabilistic machine learning model

that’s used for classification task. The crux of the classifier is based
on the Bayes theorem.
• Using Bayes theorem, we can find the probability of A happening,

given that B has occurred. Here, B is the evidence and A is the
hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one
particular feature does not affect the other. Hence it is called naive.
• For a categorical attribute Xi, the conditional probability P(Xi = xi |

Y = y) is estimated according to the fraction of training instances in
class y that take on a particular attribute value xi.
• We can discretize each continuous attribute and then replace the

continuous attribute value with its corresponding discrete interval.
This approach transforms the continuous attributes into ordinal
attributes.
• We can assume a certain form of probability distribution for the
continuous variable and estimate the parameters of the
distribution using the training data. A Gaussian distribution is
usually chosen to represent the class-conditional probability for
continuous attributes.
Important Stuff
• The measures proposed for analyzing relationships between pairs

of binary variables can be divided into two categories, symmetric
and asymmetric measures. A measure M is symmetric if M(A --> B)
= M(B --> A). For example, interest factor is a symmetric measure
because its value is identical for the rules A --> B and B --> A. In
contrast, confi dence is an asymmetric measure since the
confidence for A --> B and B --> A may not be the same.
• A complete clustering assigns every object to a cluster, whereas a

partial clustering does not. The motivation for a partial clustering
is that some objects in a data set may not belong to well-deﬁned
groups. Many times objects in the data set may represent noise,
outliers, or "uninteresting background”.
• An association rule having support and confidence greater than or

equal to a user-specified minimum support threshold and
respectively a minimum confidence threshold is called a strong rule.
• Cluster Proximity.
• Whereas noise can be defined as mislabeled examples (class noise)

or errors in the values of attributes (attribute noise), outlier is a
broader concept that includes not only errors but also discordant
data that may arise from the natural variation within the population
or process.
• A variable transformation refers to a transformation that is
applied to all the values of a variable. ) In other words, for each
object, the transformation is applied to the value of the variable for
that object. For example, if only the magnitude of a variable is
important, then the values of the variable can be transformed by
taking the absolute value.
• Aggregation in data mining is the process of finding, collecting, and
presenting the data in a summarized format to perform statistical
analysis of business schemes or analysis of human patterns.
Aggregation characteristics:-
▪ There are several motivations for aggregation. First, the

smaller data sets resulting from data reduction require less
memory and processing time, and hence, aggregation may
permit the use of more expensive data mining algorithms.
▪ Second, aggregation can act as a change of scope or scale by

providing a high-level view of the data instead of a low-level
view.
▪ Finally, the behavior of groups of objects or attributes is often
more stable than that of individual objects or attributes.
• KDD is the overall process of extracting knowledge from data, while

Data Mining is a step inside the KDD process, which deals with
identifying patterns in data.
• Sampling is a method that allows us to get information about the

population based on the statistics from a subset of the population
(sample), without having to investigate every individual.
• In simple random sampling, the researcher selects the participants

randomly. There are a number of data anaylitic tools like random
number generators and random number tables used that are based
entirely on chance.
• Progressive Sampling (PS) starts with a small data sample from the
full dataset and use progressively larger samples until the model
accuracy cannot increase substantially.
• Given two sets A and B, A - B is the set of elements of A that are not
in B. For example, if A: {1, 2, 3, 4} and B : {2, 3, 4}, then A-
B = {1} and B - A = {}, the empty set. We can define the distance d
between two sets A and B as d(A,B): size(A - B), where size is a
function returning the number of elements in a set. This distance
measure, which is an integer value greater than or equal to 0, does
not satisfy the second part of the positivity property the symmetry
property, or the triangle inequality. However, these properties can
be made to hold if the dissimilarity measure is modified as follows:
d(A,B): size(A- B) + size(B – A).
• In the holdout method, the original data with labeled examples is

partitioned into two disjoint sets, called the training and the test
sets, respectively. A classification model is then induced from the
training set and its performance is evaluated on the test set. The
holdout method has several well-known limitations.
◦ First, fewer labeled examples are available for training because
some of the records are withheld for testing. As a result, the
induced model may not be as good as when all the labeled
examples are used for training.
◦ Second, the model may be highly dependent on the composition

of the training and test sets. The smaller the training set size, the
larger the variance of the model. On the other hand, if the
training set is too large, then the estimated accuracy computed
from the smaller test set is Iess reliable
• Suppose we partition the data into two equal-sized subsets. First,

we choose one of the subsets for training and the other for testing.
We then swap the roles of the subsets so that the previous training
set becomes the test set and vice versa. This approach is called a
twofold cross-validation.
• The k-fold cross-validation method generalizes this approach by

segmenting the data into k equal-sized partitions. During each run,
one of the partitions is chosen for testing, while the rest of them are
used for training. This procedure is repeated k times so that each
partition is used for testing exactly once.
• In leave-one-out (K = N, as this method is a form of K-fold)

approach, each test set contains only one record. This approach has
the advantage of utilizing as much data as possible for training. In
addition, the test sets are mutually exclusive and they effectively
cover the entire data set. The drawback of this approach is that it is
computationally expensive to repeat the procedure y times.
Furthermore, since each test set contains only one record, the
variance of the estimated performance metric tends to be high.
• When a model performs very well for training data but has poor
performance with test data (new data), it is known as overfi tting. In
this case, the machine learning model learns the details and noise in
the training data such that it negatively affects the performance of
the model on test data. Overfitting can happen due to low bias and
high variance.
• When a model has not learned the patterns in the training data well
and is unable to generalize well on the new data, it is known as
underfitting. An underfit model has poor performance on the
training data and will result in unreliable predictions. Underfitting
occurs due to high bias and low variance.
• The closeness of repeated measurements (of the same quantity) to

one another is called Precision.
• A systematic variation of measurements from quantity being
measured is called Bias.
• The closeness of measurements to the true value of the quantity
being measured is called Accuracy.
• Variance is the change in prediction accuracy of ML model between
training data and test data. Simply what it means is that if a ML
model is predicting with an accuracy of "x" on training data and its
prediction accuracy on test data is "y" then, Variance = x – y.
• Noisy data can appear as normal data. So noise objects are not
always outliers.
• Whereas noise can be defined as mislabeled examples (class noise)

or errors in the values of attributes (attribute noise), outlier is a
broader concept that includes not only errors but also discordant
data that may arise from the natural variation within the population
or process.
• A point is a core point if there are at least minPts number of points

(including the point itself) in its surrounding area with radius eps. A
point is a border point if it is reachable from a core point and there
are less than minPts number of points within its surrounding area.
• The most important limitations of Simple k-means are:

▪ The user has to specify k (the number of clusters) in the
beginning.
▪ k-means can only handle numerical data.
▪ k-means assumes that we deal with spherical clusters and that
each cluster has roughly equal numbers of observations.
• Limitations of DBSCAN:-
▪ DBSCAN cannot cluster data-sets with large differences in

densities well, since then the minPts-eps combination cannot
be chosen appropriately for all clusters.
▪ Choosing a meaningful eps value can be difficult if the data
isn't well understood.
▪ DBSCAN is not entirely deterministic. That's because the
algorithm starts with a random point. Therefore border points
that are reachable from more than one cluster can be part of
either cluster.

Data Mining Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Notes

Uploaded by

Copyright:

Available Formats

Chapter – 1

• Data mining is the process of automatically discovering useful

• A data set can often be viewed as a collection of data objects. Other

• In a nutshell, Bias is a measure of how far the expected value of the

2- Symmetry: d(p, q) = d(q, p) for all p and q.

3- Triangle Inequality: d(p, q) ≤ d(p, r) + d(r, q) for all p, q, r.

• Methods for handling missing values:

• Classifi cation is the task of learning a target function [f(x)] that

• True Negative Rate (TNR) = Specifi city = TN / (TN + FP), True

• Formula for Information gain(degree to which the uncertanity has

• Formula for Gini Index(another method for choosing an attribute):-

• Rule-based classifiers are just another type of classifier which makes

- The rules generated by the rule-based classifiers are

- The rules generated by the rule-based classifiers may not be

- The decision boundaries created by them is linear, but these

- Unordered Rules: This approach allows a test record to

• Rule ordering can be implemented on a rule-by-rule basis or on a

- Rule-Based Ordering Scheme: This approach orders the

- Class-Based Ordering Scheme: In this approach, rules that

• The k-nearest neighbors (KNN) algorithm is a simple, supervised

• If k is too small, then the nearest-neighbor classifier may be

• The characteristics of the nearest-neighbor classifier are

- Nearest-neighbor classification is part of a more general

- Lazy learners such as nearest-neighbor classifiers do not

- Nearest-neighbor classifiers make their predictions based on

- Nearest-neighbor classifiers can produce arbitrarily shaped

- Nearest-neighbor classifiers can produce wrong predictions

• Bayes' Theorem states that the conditional probability of an event,

• Joint Probability: Probability of two (or more) simultaneous events,

• Marginal Probability: The probability of an event irrespective of

• In general, the result P(A|B) is referred to as the posterior

• A Naive Bayes classifi er is a probabilistic machine learning model

• Using Bayes theorem, we can find the probability of A happening,

• For a categorical attribute Xi, the conditional probability P(Xi = xi |

• We can discretize each continuous attribute and then replace the

• The measures proposed for analyzing relationships between pairs

• A complete clustering assigns every object to a cluster, whereas a

• An association rule having support and confidence greater than or

• Whereas noise can be defined as mislabeled examples (class noise)

▪ There are several motivations for aggregation. First, the

▪ Second, aggregation can act as a change of scope or scale by

• KDD is the overall process of extracting knowledge from data, while

• Sampling is a method that allows us to get information about the

• In simple random sampling, the researcher selects the participants

• In the holdout method, the original data with labeled examples is

◦ Second, the model may be highly dependent on the composition

• Suppose we partition the data into two equal-sized subsets. First,

• The k-fold cross-validation method generalizes this approach by

• In leave-one-out (K = N, as this method is a form of K-fold)

• The closeness of repeated measurements (of the same quantity) to

• Whereas noise can be defined as mislabeled examples (class noise)

• A point is a core point if there are at least minPts number of points

• The most important limitations of Simple k-means are:

▪ DBSCAN cannot cluster data-sets with large differences in

You might also like