Unit III Data Mining Techniques

UNIT –III Data Mining Techniques
- Prof.Sachin Lamkane
 Mining Frequent Techniques : -

Frequent patterns are patterns (e.g. itemset, subsequence, or substructure) that appear
frequently in a data set. Finding frequent patterns plays an essential role in mining
associations, correlations, and many other interesting relationship among data. Moreover, it
helps in data classification, clustering & other data mining task.
 Item set : -
A set of items that appear frequently together in a transaction data set is a frequent itemset.
For e.g. purchasing milk and breads.
 Subsequence :-
A subsequence, such as buying first a PC, then a digital camera, and then a memory card, it
occurs frequently in a shopping history database.
 Substructures :-
A Substructure can refer to different structure forms, such as sub graphs, sub trees, or sub
lattices, which may be combined with item set or subsequences.
Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items that you
have been buying for the last six months. It calculates a percentage of items being purchased together.
These are three major measurements technique:
o Lift:
This measurement technique measures the accuracy of the confidence over how often item B is
purchased.
(Confidence) / (item B)/ (Entire dataset)
1
o Support:
This measurement technique measures how often multiple items are purchased and compared it to the
overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
 Market Basket Analysis :-

Frequent item set mining leads to the discovery of associations and correlation among items
in large transactional data sets. With massive amount of data continuously being collected
and stored, many industries are becoming interested in mining such patterns from their
database. The discovery of interesting correlation relationship among huge amount of
business transaction record can help in many business decision making processes such as
catalog design, cross-making, and customers shopping behavior analysis.
A typical example of frequent item set mining is Market Basket Analysis. This process
analyzes customers buying habits by finding associations between the different items that
customers place in their shopping Basket. This discovery of association can help retailers
develop marketing strategies by gaining insight into which item are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to
also buy bread?
2
 Apriori Algorithm :-
Apriori is a seminal algorithm proposed by R.Agrawal and R.Shrikant in 1994 for mining
frequent item set for Boolean association rules. The name of algorithm is based on the fact
that the algorithm uses prior knowledge of frequent item set properties. Apriori employs an
iterative approach known as level–wise search, where K-item set are used to explore (k+1)
item set.
To improve the efficiency of the level-wise generation of frequent
item set, an important property called the Apriori Property is used to reduce the search
space.
 Apriori Property :-
All nonempty subset of a frequent item set must also be frequent. Apriori property consist
two-step process called Join And prune.
1. The Join Step: - To find Lk, a set of candidate K-item sets is generated by joining
Lk-1 with itself. This denoted by Lk.
2. The Prune Step: - In This step size of Lk is reduced.
3
Classification and Prediction:-
 What is Classification:-
The problem of classification arise when an investigator makes a numbers of measurement on an

individual and wishes to classify the individual into one of the several categories on the basis of the
observation or measurement. Classification is also known as supervised learning.
A bank loan offices needs analysis of his data to learn which loan applicants are
“safe” and which are “risky” for the bank. A marketing manager needs data analysis to help to
guess whether a customer with a given profile will buy a new computer or not. A medical
research wants to analyze breast cancer data to predict which one of three specific treatments given
to the patients. In each data to these example, the data analysis task is to classify data to predict
class labels (category), such as “safe” or “risky” for loan application data; “Yes” or “No” for
marketing data ; or “Treatment A” , “ Treatment B”, Treatment for the medical data.
How Classification Does Works:-

Data classification is a two step process learning step and classification step. In the learning step
classification models is constructed and in classification step the models is used to predict class
labels for given data.
Consider an example of bank to analysis loan application are “safe” or “risky”. Consider
following fig.
3
Here the class label (i.e. loan decision) of each training tuple is provided, Hence classification is
also known as supervised learning.
In the first step, a classifier is built describing a predetermined set of data classes or
concept. Which is known as models construction or learning step or training step. Construction
of a classification model is based on training data. Training data consist a set tuples.
A tuple, X is represented by an n-dimensional attribute vector. X=(x, x2….xn).Each tuple, X is
assumed to belong to a predicted class as determined by another database attribute called the class
label attribute.
 Decision Tree Induction:-

Decision tree induction is the learning of decision tree form class labeled training tuples. A
decision tree is a flowchart like tree structure where:
Internal node (no leafnode) denotes a test on an attribute.

Branches represent an outcome of the test.
Leaf nodes (terminal node) holds an in class label.
Root node is the topmost node in a tree
Consider the decision tree indicating whether a customer’s likely to purchase
computer or not. Internal nodes are denoted by reactangels, and leaf nodes are denoted by ovals.
Class –label Yes: The customer is likely to buy a computer

Class-label No: The customer is unlikely to buy a computer
4
 How Are Decision Trees Used For Classification?
Given a tuple, X, for which the associated class label is unknown, the attribute values of the
tuple are tested against the decision tree. A path is traced from the root to a leaf node, which
holds the class predicted for the tuple. Decision trees can easily be converted to the
classification rules.
For e.g.
RID Age Income Student Credit rating Class
1 Youth High No Fair ?
 Test on age: Youth

 Test On Student: No
 Reach leaf Node
 Class No: The Customer is unlikely to buy a computer
 Performance:-
The construction of decision tree classifiers does not require any domain knowledge or
parameter setting. It can handle multidimensional data. Their representation of acquired
knowledge in tree form is initiative and easily understood by human’s. The learning and
classification steps of decision tree induction are simple and fast. Decision tree classifiers have
good accuracy. Decision tree used for classification in many application areas such as medicine,
manufacturing, and production, financial analysis, astronomy, and molecular Biology.
 Attribute Selection Measure:-

An attribute selection measure is a heuristic for selecting the splitting criterion that “Best”
separates a given data partition, D of Class-labeled training tuples into individual classes. If we
were to split D into smaller partition according to the outcomes of the splitting criterion, ideally
each partition would be pure. Attribute selection measures are also known as splitting rules
because they determine how the tuples at a given node are to be split.
The attribute selection measures provides a ranking for each attribute describing the given
training tuples. The attribute having the best score for the measure is chosen as the splitting
attribute as node created for partition D is labeled with the splitting criterion, branches are
grown for each outcome of the criterion, and the tuples are partitioned accordingly. There are
three attribute selection measures.
1. Information Gain
2. Gain ratio
3. Gini Index
Consider following table class labeled training tuples:-
5
RID Age Income Student Credit-rating Class: buys computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle-aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle-aged Low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle-aged Medium No Excellent Yes
13 Middle-aged High Yes Fair Yes
14 Senior Medium No Excellent No
In Above table:
M=2 (The number of classes. Yes and No)
N=14 (Number of tuples)
9 tuples in class Yes.
5 tuples in class Yes.
To find the splitting criterion for these tuples, we must computer the information gain of each
attribute.
Formulas for finding information gain are:
6
According To these
Gain (Age) = 0.246 bit
Gain (income) = 0.029bit
Gain (Student) = 0.151 bit
Gain (Credit-rating) = 0.048 bit
Here age has the highest information gain among the attributes, so it is selected as splitting
attribute.
7
The tuples falling into the partition for age=middle-aged all belong to the same class. Because they
all belong to class “Yes” a leaf should therefore be created at the end of this branch and labeled
“Yes”.
 Issues: Over Fitting Tree Pruning Methods:-

When a decision tree is built, many of the branches will reflect anomalies in the
training data due to noise and outliers. Tree pruning methods are used to overcome this
problems of overfitting the data. Tree pruning method typically use statistical measures to
remove the least reliable branches. Pruning usually results in reducing size of tree, avoids
unnecessary complexity, and to avoid overfitting of the data sets when classifying new data.
following fig.shows an unpruned decision tree and pruned version of it.
8
There are two techniques for pruning:
1) Pre pruning
2) Post pruning
1) Pre Pruning:-
In the pre pruning approach, a tree is “pruned” by halting its construction early. Upon halting,
the node becomes a leaf. The leaf may hold the most frequent class among the subset tuples or
the probability distribution of those tuples.
When constructing a tree, measures such as statistical significance, information gain,
Gini index, and so on can be used to access the goodness of a split. If partitioning the tuples at a
node would result the split that falls below a participation there sold, then further partitioning of
the given subset is halted otherwise .it is expanded high there sold result in oversimplified trees,
whereas low there sold result in very little simplification.
2) Post pruning:-
In the postpruning method, it removes sub tree from a “Fully grown” tree. A sub tree at a given
node is pruned by removing its branches and replacing it with a leaf. The leaf is labeled with
the most frequent class among the sub tree being replaced. for example, sub tree at node A3? In
the unpruned tree, the most common class within the sub tree is “class B”. In the pruned version
of the tree it is replaced with the leaf “class B”.
 Classification and regression tree(CART):-

Classification and regression trees (CART) is a technique that generates a binary decision tree.
In this method, where a child is created for each subcategory, only two children are created.
The splitting is performed around what is determined to the best split point. At each step, an
exhaustive search is used to determine the best split, where Best is defined by
Φ (S/t) = 2PLPR∑ j=1 | P (cj / tL) - P(cj / tR) |
This Formula is evaluated at the current node, t and for each possible splitting attribute and
criterion, s. Here L and R are used to indicate the left and right sub tree of the current node in
the tree. PL and PR are the probability that a tuple in the training set will be on the left or right
side of tree. P (cj | tL) or P(cj | tR) is the probability that a tuples is in this class cj and in the
left or right sub tree.
 Bayesians Classification:-
Bayesian classification is based on Bayes theorem. Bayesial classifiers are the statistical
classifiers. They can predict class membership probabilities such as the probability that a given
belong to a particular class.
9
 Bayes Theorem:-
Bayes theorem is named after Thomas Bayes. Let X be a data tuples. In the
Bayesian terms, X is considered “evidence”. As usual it is described by set on n attribute. Let H
be some hypothesis such as the data tuples X belongs to a specific class C. For classification
problem, we want to determine P(H/X), the probability that the hypothesis H holds given the
“evidence” X.
There are two types of probabilities.
1. Posterior probability [ P (H/X) ]
2. Prior probability [ P(H) ]
1) Posterior Probability-P(H/X):-
P(H/X) is the posterior probability or a posteriori probability, of H conditioned on X.
For example: In the data tuple customers described by the attribute age with 35 years old and
income with $40000.Supose that H is the hypothesis that our customers will buy a computer.
Then P (H/X) reflects the probability that customer X will buy a computer given that we know
the customer’s age and income.
2) Prior Probability-P(H):-
P (H) is the prior probability, or a priori probability, of H.
For example this is the probability that any given customer will buy a computer, regardless of
age, income, or any other information.
The posterior probability, P (H/X), is based on more information than prior probability P (H),
which is independent of X.
Bayes Theorem is:
 Naive Bayesian Classification:-

The naïve Bayesian classifiers, or simple Bayesian classifier, works as follows:
10
(For explanation consider Table A from page IV)
1. Let D be a training set of tuples and their associated class labels.

Within our example three are 14 tuples and two class labels {Yes, No} Let
c1 corresponds to the class buys-computer= Yes and c2 corresponds to buys-computer=No.
2. Suppose that three are M classes c1, c2, --------- cm. Given a tuples X the classifiers will predict
that X belongs to the class having the highest posterior probability, conditional on X. that is the
naïve Bayesian classifiers predict that tuples X belongs to class Ci if and only if
P (ci/x) > P (cj/x) For 1<=j<=m, j # i
Thus, we maximize P (ci/x). The class ci for which P (ci/x) is maximized is called the
maximum posteriori hypothesis. Bayes theorem:
P (ci/x) = P (x/ci) P (ci)

P (x)
3. The class prior probabilities may be estimated by P (Ci) = | Ci, D | / | D | , Where | Ci, D | is
the number of training tuples of class Ci in D
For eg The tuple we wish to classify is
X = (age = youth, income = medium, student = yes, creditrating = Fair)
We need to maximize P (X/Ci) P (Ci), for i=1, 2. The prior probability of each class, can be
computer on the training tuples.
P (buys-computer = Yes) = 9/14 =0.643

P (buys-computer = No) = 5/14 = 0.357
To computer p (X/Ci) , For i= 1,2, we computer the following conditional probabilities.
P (age = youth / buys. Computer = Yes) = 2/9 = 0.222

P (age = youth / buys. Computer = No) = 3/5 = 0.6000
P (income = medium / buys. Computer = Yes) = 4/9 = 0.444
P (income = medium / buys. Computer = No) = 2/5 = 0.4000
P (Student = Yes / buys. Computer = Yes) = 6/9 = 0.667
P (Student = Yes / buys. Computer = No) = 1/5 = 0.200
P (Credit rating = Fair / buys. Computer = Yes) = 6/9 = 0.667
P (Credit rating = Fair / buys. Computer = No) = 2/5 = 0.400
4. Computer P (x/ci) as follows:
P (x/ci) =𝜋n P (xk / ci)

k=1
11
= P (x1/ci) X p (x2/ci) X ------X p (xn/ci)
For example, using above probabilities we find.
P (x/buys. Computer = Yes) = P (age = youth/ buys. Computer = Yes)
X P (income = medium/ buys. Computer = Yes)
X P (Student = Yes/ buys. Computer = Yes)
X P (Credit rating = youth/ buys. Computer = Yes)
= 0.222 X 0.444 X 0.667 X 0.667
= 0.044
Similarly,
P (x/ buys. Computer = No) = 0.600*0.400*0.200*0.400
=0.019
To find the class, ci that maximize P (x/ci) P (ci) we compute
P(x/buys. Computer=yes) P (buys. Computer=yes) = 0.444*0.643
= 0.028
P(x/buys. Computer = No) P (buys. Computer = No) = 0.019*0357
= 0.007
Therefore, the naïve Bayesian classifier predict buys. Computer = yes for tuple x.
 Bayesian Network:-
Bayesian network especially joint conditional probability distribution. They are also known as
Bayesian belief network, belief network, probabilistic networks.
1) A belief network allows class conditional independencies to be defined between subset of

variables.
2) It provides a graphical model of casual relationship on which learning can be performed.
3) We can use a trained Bayesian network for classification.
There are two components that define a Bayesian Belief Network.

1) Directed acyclic graph (DAG)
2) A set of conditional probability tables (CPT)
1) Directed Acyclic Graph:-(DAG)

1) Each node in a directed acyclic graph represents a random variable.
2) These variable may be desecrate or continues valued.
3) These variable may corresponding to the actual attribute given in the data.
Consider the following representation of DAG.
12
The arc in the diagram allows representation of casual knowledge. For example , lung cancer is
influenced by a person’s family history of lung cancer , as well as whether or not the person is
smoker note that the variable positive x-ray is independent of whether the patient has a family
history of lung cancer or is a smoker, given that we that we know the patient has lung cancer.
1) Conditional Probability Table(CPT):-

The conditional probability table for the values of the variable lung cancer (LC) showing
each possible combination of the values of its parent nodes, family History (FH) and smoker
(S) as follows.
FH, S FH, ~S ~FH, S ~FH, ~S

LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9
A belief network has one conditional probability table (CPT) for each variable. The CPT for
variable Y specifies the conditional distribution (PXY/parents (Y)). Where, parents(Y) are
the parents of Y. From above table CPT for the variable lung cancer is
P (Lung cancer = Yes/Family History = Yes, Smoke = Yes) = 0.8
P (Lung cancer = No/Family History = No, Smoke = No) = 0.9
13
 Linear Classification:-
A large number of algorithm for classification can be phrased in terms of linear function.
1. Logistic regression:-
In statistic, logistic regression is a regression model where the dependent variable (DV) is a
categorical. The binary dependent variable which considers two values, such as pass/fail,
win/lose. Alive/dead or healthy/diseased. Cases with more than two categories are referred to as
Multinomial Logistic Regression.
Logistic Regression measures the relationship between the
categorical dependent variable and one or more independent variable by estimating
probabilities using logistic function which is the cumulative logistic distribution.
Logistic regression predict the probability of particular outcome. For e.g. probability of passing
an exam versus hour of study.
Hours of Study Probability of passing Exam

1 0.07
2 0.26
3 0.61
4 0.87
5 0.97
2. Perception:-
In machine learning, the perception is an algorithm for surprised learning of binary classifiers.
Functions that can decide whether an input belongs to one class or another. It is a type of linear
classifier i.e. a classification algorithm that makes its prediction based on a linear predictor
function combining a set of weights with the feature vector. The algorithm allows to processes
element in the training set one at a time.
 SVM(Support Vector Machine):-

Support Vector Machine is a method for the Classification of both linear and nonlinear data.
It uses a nonlinear mapping to transform the original training data into a higher dimension. Within
this new dimension, it searches for the Linear Optimal Separating hyper plane (i.e. a decision
boundary separating the tuples of one class from another). With an appropriate nonlinear mapping
to a sufficiently high dimension data from two classes can always be separated by a hyper plane.
The SVM finds this hyper plane using support vectors (essential training tuples) and margins
(defined by the support vectors).
14
There are an infinite number of separating lines that could be drawn. We want to find the best one
that is one that will have the minimum classification error on previously unseen tuples. How can
we find this best line?
An SVM approaches this problem by searching for the maximum
marginal hyper plane.
 Prediction:-
Prediction models continuous- valued functions i.e. predicts unknown or missing values.
For e.g. A marketing manager would like to predict how much a given will spend during a sale.
Customer
Prediction Rs.50, 000
Profile
Regression Analysis is used for Prediction.
1. Linear Regression:-
Linear Regression is a statistical procedure for predicting the value of a dependent variable
from an independent variable when the relationship between the variables can be described with
a linear model.
15
A linear regression model is typically stated in the form Y= α + βx+ ε
Here α and β are linear combination of the parameters. The term ε
represent the unpredicted or unexplained variation in the dependent variable, it is
conventionally called the error.
 Non Linear Regression:-

While a linear equation has one basic form, nonlinear equation can take many different forms.
The easiest way to determine whether an equation is nonlinear is to focus on the term nonlinear
itself.
Nonlinear regression covers many different forms that provides the most flexible curve
fitting functionally.
Questions :
1. What is mean by frequent item set? Explain association rules.
2. Explain Apriori algorithm.
3. What is classification? Explain decision tree induction with example.
4. Explain over fitting tree pruning methods in detail.
5. Explain navie Bayes algorithm in detail.
6. Explain Bayesian belief network.
7. What is Linear classification? Explain logistic method of classification.
8. How regression is used for classification and prediction.
9. What is the difference between linear and nonlinear regression
16

Unit III Data Mining Techniques

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit III Data Mining Techniques

Uploaded by

Copyright:

Available Formats

UNIT –III Data Mining Techniques

 Mining Frequent Techniques : -

These are three major measurements technique:

 Market Basket Analysis :-

The problem of classification arise when an investigator makes a numbers of measurement on an

How Classification Does Works:-

 Decision Tree Induction:-

Internal node (no leafnode) denotes a test on an attribute.

Class –label Yes: The customer is likely to buy a computer

 Test on age: Youth

 Attribute Selection Measure:-

Consider following table class labeled training tuples:-

1 Youth High No Fair No

2 Youth High No Excellent No

3 Middle-aged High No Fair Yes

4 Senior Medium No Fair Yes

5 Senior Low Yes Fair Yes

6 Senior Low Yes Excellent No

7 Middle-aged Low Yes Excellent Yes

8 Youth Medium No Fair No

9 Youth Low Yes Fair Yes

10 Senior Medium Yes Fair Yes

11 Youth Medium Yes Excellent Yes

12 Middle-aged Medium No Excellent Yes

13 Middle-aged High Yes Fair Yes

14 Senior Medium No Excellent No

M=2 (The number of classes. Yes and No)

N=14 (Number of tuples)

9 tuples in class Yes.

5 tuples in class Yes.

Formulas for finding information gain are:

Gain (Age) = 0.246 bit

Gain (income) = 0.029bit

Gain (Student) = 0.151 bit

Gain (Credit-rating) = 0.048 bit

 Issues: Over Fitting Tree Pruning Methods:-

 Classification and regression tree(CART):-

Φ (S/t) = 2PLPR∑ j=1 | P (cj / tL) - P(cj / tR) |

 Naive Bayesian Classification:-

1. Let D be a training set of tuples and their associated class labels.

P (ci/x) = P (x/ci) P (ci)

X = (age = youth, income = medium, student = yes, creditrating = Fair)

P (buys-computer = Yes) = 9/14 =0.643

To computer p (X/Ci) , For i= 1,2, we computer the following conditional probabilities.

P (age = youth / buys. Computer = Yes) = 2/9 = 0.222

4. Computer P (x/ci) as follows:

P (x/ci) =𝜋n P (xk / ci)

1) A belief network allows class conditional independencies to be defined between subset of

There are two components that define a Bayesian Belief Network.

1) Directed Acyclic Graph:-(DAG)

1) Conditional Probability Table(CPT):-

FH, S FH, ~S ~FH, S ~FH, ~S

Hours of Study Probability of passing Exam

 SVM(Support Vector Machine):-

 Non Linear Regression:-

You might also like