You are on page 1of 16

) (

jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir

Data Mining

Bayesian Classification

Data Mining

Bayesian Learning (Nave Bayesian Classifier)


A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, nave Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that

a hypothesis is correct
When to use: Moderate or large training set available Attributes that describe instances are conditionally independent given classification

Successful applications: Diagnosis Classifying text documents

Bayesian Theorem: Basics


Let X be a data sample (evidence): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income,

P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium income 4

Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem

P(H | X) P(X | H )P(H ) P(X)


Predicts X belongs to C2 if the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Towards Nave Bayesian Classifier


Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem
P(C | X) i P(X | C )P(C ) i i P(X)

Since P(X) is constant for all classes, only

P(C | X) P(X | C )P(C ) i i i


needs to be maximized 6

Derivation of Nave Bayes Classifier


A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):
n P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i ) k 1 2 n k 1

This greatly reduces the computation cost: Only counts the class distribution

If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak


divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation
g ( x, , )
1 e 2 ( x )2 2 2

and P(xk|Ci) is

P ( X | C i ) g ( xk , C i , Ci )

Nave Bayesian Classifier: Training Dataset


age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40

Class: C1:buys_computer = yes C2:buys_computer = no Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

income student redit_rating c buys_compu high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent 8no

Nave Bayesian Classifier: An Example


P(Ci):
P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14= 0.357

P(C | X) P(X | C )P(C ) i i i


n P( X | C i) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i ) k 1 2 n k 1

P(age = <=30 | buys_computer = yes) = 2/9 = 0.222 P(age = <= 30 | buys_computer = no) = 3/5 = 0.6 P(income = medium | buys_computer = yes) = 4/9 = 0.444 P(income = medium | buys_computer = no) = 2/5 = 0.4 P(student = yes | buys_computer = yes) = 6/9 = 0.667 P(student = yes | buys_computer = no) = 1/5 = 0.2 P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667 P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

Compute P(X|Ci) for each class

P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028 P(X|buys_computer = no) * P(buys_computer = no) = 0.007 Therefore, X belongs to class (buys_computer = yes)

Avoiding the 0-Probability Problem


Nave Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero
P( X | C i ) n P( x k | C i) k 1

Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case

Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003


10

Nave Bayesian Classifier: Comments


Advantages
Easy to implement Good results obtained in most of the cases

Disadvantages
Assumption: class conditional independence, therefore loss of accuracy

Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Nave Bayesian Classifier How to deal with these dependencies?
Bayesian Belief Networks 11

Bayesian Belief Networks


Bayesian belief network allows a subset of the variables conditionally independent. (Also called Bayes Nets, directed graphical models, BNs, ).

A graphical model of causal relationships


Represents dependency among the variables Gives a specification of joint probability distribution

Nodes: variables Links: dependency

X
Z

Y
P

X and Y are the parents of Z, and Y is


the parent of P No dependency between Z and P Has no loops or cycles
12

Bayesian Belief Networks


Simple Bayesian
Concept C

Bayesian Belief Networks


Concept C xn

X1

X2

P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(xn|c)

X1 X3

X2

x4

P(x1,x2,xn,c) = P(c) P(x1|c) P(x2|c) P(x3|x1,x2,c)P(x4,c)

13

Bayesian Belief Network: An Example


Node = variables Arc = dependency Direction on arc representing causality

Family History

Smoker

The conditional probability table (CPT) for variable LungCancer:


(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC
LungCancer Emphysema

0.8

0.5

0.7

0.1

~LC

0.2

0.5

0.3

0.9

CPT shows the conditional probability for each possible combination of its parents

PositiveXRay

Dyspnea

Bayesian Belief Networks

Derivation of the probability of a particular combination of values of X, from CPT:


n P ( x1 ,...,xn ) P ( xi | Parents(Y i )) i 1

14

1. -

51

Text Classification - 2
: 1. x=(sqrt=25, bit=30, chip=25, Web= 15, limit= 20, CPU= 12, equal=30)
Word Doc1 Doc2 sqrt 42 10 bit 25 28 chip 7 45 log 56 2 web 2 69 limit 66 8 cpu 1 56 equal 89 40 class Math Comp

Doc3
Doc4 Doc5 Doc6

11
33 28 8

25
40 32 22

22
8 9 30

4
48 60 1 Meaning Maths related

40
3 6 29

9
40 29 3

44
4 7 40

34
77 60 33

Comp
Math Math Comp

Class Attribute Math Comp

Each document is a vector of frequencies of all the words in the document

Computers related

16

You might also like