Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

) (
jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir
Data Mining
Bayesian Classification
Data Mining
Bayesian Learning (Nave Bayesian Classifier)

A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, nave Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that
a hypothesis is correct
When to use: Moderate or large training set available Attributes that describe instances are conditionally independent given classification
Successful applications: Diagnosis Classifying text documents
Bayesian Theorem: Basics

Let X be a data sample (evidence): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income,
P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium income 4
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
P(H | X) P(X | H )P(H ) P(X)

Predicts X belongs to C2 if the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Towards Nave Bayesian Classifier

Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem
P(C | X) i P(X | C )P(C ) i i P(X)
Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C ) i i i

needs to be maximized 6
Derivation of Nave Bayes Classifier

A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):
n P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i ) k 1 2 n k 1
This greatly reduces the computation cost: Only counts the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak

divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation
g ( x, , )
1 e 2 ( x )2 2 2
and P(xk|Ci) is
P ( X | C i ) g ( xk , C i , Ci )
Nave Bayesian Classifier: Training Dataset

age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40
Class: C1:buys_computer = yes C2:buys_computer = no Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
income student redit_rating c buys_compu high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent 8no
Nave Bayesian Classifier: An Example

P(Ci):
P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14= 0.357
P(C | X) P(X | C )P(C ) i i i

n P( X | C i) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i ) k 1 2 n k 1
P(age = <=30 | buys_computer = yes) = 2/9 = 0.222 P(age = <= 30 | buys_computer = no) = 3/5 = 0.6 P(income = medium | buys_computer = yes) = 4/9 = 0.444 P(income = medium | buys_computer = no) = 2/5 = 0.4 P(student = yes | buys_computer = yes) = 6/9 = 0.667 P(student = yes | buys_computer = no) = 1/5 = 0.2 P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667 P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
Compute P(X|Ci) for each class
P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028 P(X|buys_computer = no) * P(buys_computer = no) = 0.007 Therefore, X belongs to class (buys_computer = yes)
Avoiding the 0-Probability Problem

Nave Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero
P( X | C i ) n P( x k | C i) k 1
Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003

10
Nave Bayesian Classifier: Comments

Advantages
Easy to implement Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Nave Bayesian Classifier How to deal with these dependencies?
Bayesian Belief Networks 11
Bayesian Belief Networks

Bayesian belief network allows a subset of the variables conditionally independent. (Also called Bayes Nets, directed graphical models, BNs, ).
A graphical model of causal relationships

Represents dependency among the variables Gives a specification of joint probability distribution
Nodes: variables Links: dependency
X
Z
Y
P
X and Y are the parents of Z, and Y is

the parent of P No dependency between Z and P Has no loops or cycles
12

Simple Bayesian
Concept C

Concept C xn
X1
X2
P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(xn|c)
X1 X3
X2
x4
P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(x3|x1,x2,c)P(x4,c)
13
Bayesian Belief Network: An Example

Node = variables Arc = dependency Direction on arc representing causality
Family History
Smoker
The conditional probability table (CPT) for variable LungCancer:

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC
LungCancer Emphysema
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
CPT shows the conditional probability for each possible combination of its parents
PositiveXRay
Dyspnea
Derivation of the probability of a particular combination of values of X, from CPT:

n P ( x1 ,...,xn ) P ( xi | Parents(Y i )) i 1
14
1. -
51
Text Classification - 2
: 1. x=(sqrt=25, bit=30, chip=25, Web= 15, limit= 20, CPU= 12, equal=30)
Word Doc1 Doc2 sqrt 42 10 bit 25 28 chip 7 45 log 56 2 web 2 69 limit 66 8 cpu 1 56 equal 89 40 class Math Comp
Doc3
Doc4 Doc5 Doc6
11
33 28 8
25
40 32 22
22
8 9 30
4
48 60 1 Meaning Maths related
40
3 6 29
9
40 29 3
44
4 7 40
34
77 60 33
Comp
Math Math Comp
Class Attribute Math Comp
Each document is a vector of frequencies of all the words in the document
Computers related
16

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

Uploaded by

Copyright:

Available Formats

) (

Bayesian Learning (Nave Bayesian Classifier)

Successful applications: Diagnosis Classifying text documents

Bayesian Theorem: Basics

P(H | X) P(X | H )P(H ) P(X)

Towards Nave Bayesian Classifier

Since P(X) is constant for all classes, only

P(C | X) P(X | C )P(C ) i i i

Derivation of Nave Bayes Classifier

If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak

Nave Bayesian Classifier: Training Dataset

Nave Bayesian Classifier: An Example

P(C | X) P(X | C )P(C ) i i i

Compute P(X|Ci) for each class

Avoiding the 0-Probability Problem

Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003

Nave Bayesian Classifier: Comments

Practically, dependencies exist among variables

Bayesian Belief Networks

A graphical model of causal relationships

Nodes: variables Links: dependency

X and Y are the parents of Z, and Y is

Bayesian Belief Networks

Bayesian Belief Networks

P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(xn|c)

P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(x3|x1,x2,c)P(x4,c)

Bayesian Belief Network: An Example

The conditional probability table (CPT) for variable LungCancer:

Bayesian Belief Networks

Derivation of the probability of a particular combination of values of X, from CPT:

Class Attribute Math Comp

Each document is a vector of frequencies of all the words in the document

You might also like