You are on page 1of 31


Bayes Classification
• Bayesian classifiers are statistical classifiers
based on Bayes’ theorem
• Predict class membership probabilities
• Naive Bayesian classifier
– Assumes effect of an attribute value on a given
class is independent of the values of the other
attributes – class conditional independence
– Simplifies the computations
– Has comparable performance with decision tree
and selected neural network classifiers
Bayes Classification
• Let X be a data sample (“evidence”): n attributes, class
label unknown
• H: some hypothesis such as that the data tuple X
belongs to a specified class C
• Find P(H|X): probability that the hypothesis H holds
given the observed data tuple X
• Probability that tuple X belongs to class C, given that
we know the attribute description of X
• P(H|X) a posteriori probability of H conditioned on X
Bayes Classification
• Baye’s theorem

• Example: computer purchase problem

– customers described by age and income
– X is a 35-year-old customer with an income of
– H: hypothesis that customer will buy a computer
– P(H|X) = ?
• P(H|X): posterior probability that customer X will buy
a computer given his age and income
• P(H) is the prior probability that any given customer
will buy a computer, regardless of age, income
• P(X|H): posterior probability that customer X is 35
years old and earns $40,000, given that he buys a
• P(X): prior probability that a person from the set of
customers is 35 years old and earns $40,000
• P(H), P(X|H), P(X) may be estimated from given data
Naive Bayesian Classification
• D: training set of tuples. Each tuple represented by n-
dimensional attribute vector, X=(x1, x2,…, xn).
Attributes are A1, A2,…, An. m classes - C1, C2,…, Cm
• Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior
probability, conditioned on X. Tuple X belongs to class
Ci if and only if

• By Bayes’ theorem
Naive Bayesian Classification
• As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs
to be maximized
• If the class prior probabilities are not known, assume
P(C1)=P(C2)=…=P(Cm), and therefore maximize P(X|Ci)
• Class prior probabilities may be estimated by P(Ci) =
• Given data sets with many attributes, extremely
computationally expensive to compute P(X|Ci)
• Assumption: class-conditional independence, i.e., no
dependence relation between attributes:
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1
Naive Bayesian Classification
• Attributes may be categorical or continuous
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of
Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ 1 
( x ) 2

P ( X | Ci )  g ( x ,  ,  )  e 2 2

2 
• Compute P(X|Ci) for each class Ci
• Predicted class label is class Ci for which
P(X|Ci) is maximum
Avoiding the Zero-Probability Problem
• Naive Bayesian prediction requires each conditional
prob. be non-zero. Otherwise, the predicted
probability will be zero n
P( X | C i)   P( x k | C i)
k 1
• Ex: A dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Naive Bayes Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss
of accuracy
– Practically, dependencies exist among variables
• E.g., Patients: Profile - age, family history, etc. Symptoms
- fever, cough etc., Disease - lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayes Classifier
Bayesian Belief Networks
• Unlike naive Bayesian classifiers, do not assume class
conditional independence
• Allow the representation of dependencies among
subsets of attributes
• Graphical model of causal relationships
• Two components
– a directed acyclic graph
– set of conditional probability tables (CPT)
• Node = random variable (discrete- or continuous)
• Arc = probabilistic dependence
Bayesian Belief Networks
• Arc Y -> Z implies Y is a parent, Z is a descendant
Bayesian Belief Networks
• Each variable is conditionally independent of its non-
descendants in the graph, given its parents
• One CPT for each variable
• X = (x1,…, xn) be a data tuple described by the
variables or attributes Y1,…, Yn
• Complete representation of the existing joint
probability distribution
Bayesian Belief Networks
• A node within the network can be selected as an
“output” node, representing a class label attribute
• May be more than one output node
• Can return probability of each class
• Various learning algorithms – gradient descent
• Some applications
– genetic linkage analysis
– computer vision
– document and text analysis
– decision support systems
– sensitivity analysis
Rule-Based Classification
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
• R: IF age = youth AND student = yes THEN buys_computer =
• Can be generated either from a decision Tree or directly from
the training data using a sequential covering algorithm
• The “IF” part (or left side) of a rule is known as the rule
antecedent or precondition
• The “THEN” part (or right side) is the rule consequent
• If the rule antecedent holds true for a given tuple, we say that
the rule is satisfied and that the rule covers the tuple
• Assessment of a rule R: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
Using IF-THEN Rules for Classification

• If more than one rule is triggered, need

conflict resolution
– Size ordering: assigns highest priority to the
triggering rules that has the “toughest”
requirement (i.e., with the most attribute tests)
– Rule ordering: rules prioritized beforehand
• class-based, rule-based
Using IF-THEN Rules for Classification
• Class-based ordering: classes are sorted in
decreasing order of prevalence or misclassification
cost per class
• Rule-based ordering (decision list): rules are
organized into one long priority list, according to
some measure of rule quality or by experts
• What if no rule satisfied by X?
– Set up a default rule to specify a default class,
based on a training set
– May be the class in majority or the majority class
of the tuples that were not covered by any rule
Rule Extraction from a Decision Tree
• Rules are easier to understand than large trees
• One rule is created for each path from the root
to a leaf
• Each splitting criterion along a given path is
logically ANDed to form the rule antecedent
(“IF” part)
• The leaf node holds the class prediction,
forming the rule consequent (“THEN” part)
Rule Extraction from a Decision Tree
• Example

IF age = youth AND student = no THEN buys_computer = no

IF age = youth AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = senior AND credit_rating = fair THEN buys_computer = no
IF age = senior AND credit_rating = excellent THEN buys_computer = yes
Rule Extraction from a Decision Tree
• Rules extracted are mutually exclusive and
• Mutually exclusive: cannot have rule conflicts
here as no two rules will be triggered for the
same tuple
• Exhaustive: one rule for each possible attribute–
value combination
• Since one rule extracted per leaf, the set of rules
is not much simpler than the corresponding
decision tree
• Rule pruning required
Rule Induction: Sequential Covering Algorithm

• Extracts rules directly from training data

• Typical sequential covering algorithms: FOIL, AQ, CN2,
• Rules are learned sequentially, each for a given class Ci will
cover many tuples of Ci but none (or few) of the tuples of
other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
– Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
Basic Sequential Covering Algorithm
How are rules learned?
• Start with the most general rule possible:
– IF THEN loan_decision = accept
• Add new attributes by adopting a greedy
depth-first strategy
– Pick the one that improves the rule quality most
• The resulting rule should cover relatively more
of the “accept” tuples
Rule Learning
Rule-Quality measures
• Need to consider both coverage and accuracy
• Entropy - prefers rules that cover a large number of tuples of a
single class and few tuples of other classes
• Tuples of the class for which rules are learned are called
positive tuples, while the remaining tuples are negative
• Foil-gain (in FOIL & RIPPER): assesses information gained by
extending the antecedent
pos ' pos
FOIL _ Gain  pos '(log 2  log 2 )
pos ' neg ' pos  neg

• Favors rules that have high accuracy and cover many positive
Rule Pruning
• Prune a rule, R, if the pruned version of R has greater
quality, as assessed on an independent set of tuples
pos  neg
FOIL _ Prune( R) 
pos  neg
• If FOIL_Prune is higher for the pruned version of R,
prune R
Classifier Evaluation Metrics
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

Actual class\Predicted class buy_computer buy_computer = Total
= yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
Classifier Evaluation Metrics
• Classifier Accuracy, or recognition rate: A\P C ¬C
percentage of test set tuples that are C TP FN P
correctly classified ¬C FP TN N
Accuracy = (TP + TN)/All P’ N’ All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
• Sensitivity: True Positive recognition rate
• Sensitivity = TP/P
• Specificity: True Negative recognition
• Specificity = TN/N
Classifier Evaluation Metrics
• Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive

• Recall: completeness – what % of positive tuples did the classifier

label as positive?

• F measure (F1 or F-score): harmonic mean of precision and recall

• Fß: weighted measure of precision and recall

– assigns ß times as much weight to recall as to precision