Matakuliah Tahun

: M0614 / Data Mining & OLAP : Feb - 2010

Classification and Prediction
Pertemuan 08

Learning Outcomes
Pada akhir pertemuan ini, diharapkan mahasiswa akan mampu :
• Mahasiswa dapat menggunakan teknik analisis classification by decision tree induction, Bayesian classification, classification by back propagation, dan lazy learners pada data mining. (C3)

3
Bina Nusantara

Acknowledgments
These slides have been adapted from Han, J., Kamber, M., & Pei, Y. Data Mining: Concepts and Technique and Tan, P.-N., Steinbach, M., & Kumar, V. Introduction to Data Mining.

Bina Nusantara

Outline Materi
• Bayesian classification

5
Bina Nusantara

Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

June 20, 2010

Data Mining: Concepts and Techniques

6

Bayesian Theorem: Basics
• • • • • • Let X be a data sample (“evidence”): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), (posteriori probability), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability – E.g., X will buy computer, regardless of age, income, … P(X): probability that sample data is observed P(X|H) (likelyhood), the probability of observing the sample X, given that the hypothesis holds – E.g., Given that X will buy computer, the prob. that X is 31..40, medium income

June 20, 2010

Data Mining: Concepts and Techniques

7

Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem

P (H | X ) = P (X | H )P (H ) P (X )
• • • Informally, this can be written as posteriori = likelihood x prior/evidence Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: require initial knowledge of many probabilities, significant computational cost

June 20, 2010

Data Mining: Concepts and Techniques

8

Example of Bayes Theorem
• Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20 • If a patient has stiff neck, what’s the probability he/she has meningitis?

P ( S | M ) P ( M ) 0.5 ×1 / 50000 P( M | S ) = = = 0.0002 P( S ) 1 / 20

Bayesian Classifiers
• Consider each attribute and class label as random variables • Given a record with attributes (A1, A2,…,An) – Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Can we estimate P(C| A1, A2,…,An ) directly from data?

Bayesian Classifiers
• Approach: – compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

P (C | A A K A ) =
1 2 n

P ( A A K A | C ) P (C ) P(A A K A )
1 2 n 1 2 n

– Choose value of C that maximizes P(C | A1, A2, …, An) – Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )?

Naïve Bayes Classifier
• Assume independence among attributes Ai when class is given: – P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) – Can estimate P(Ai| Cj) for all Ai and Cj. – New point is classified to Cj if P(Cj) Π P(Ai| Cj) is maximal.

l How rtol EstimateusProbabilities from Data? ca ca i o ri c
Tid

at

go e

c

at

go e

c

on

u in t

s la c

s

Refund

Marital Status Single Married Single Married Divorced Married Divorced Single Married Single

Taxable Income 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K

Evade No No No No Yes No No Yes No Yes

• Class: P(C) = Nc/N
– e.g., P(No) = 7/10, P(Yes) = 3/10

1 2 3 4 5 6 7 8 9 10
10

Yes No No Yes No No Yes No No No

• For discrete attributes: P(Ai | Ck) = |Aik|/ Nck
– where |Aik| is number of instances having attribute Ai and belongs to class Ck – Examples:
P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0

How to Estimate Probabilities from Data?
• For continuous attributes: – Discretize the range into bins • one ordinal attribute per bin • violates independence assumption – Two-way split: (A < v) or (A > v) • choose only one of the two splits as new attribute – Probability density estimation: • Assume attribute follows a normal distribution • Use data to estimate parameters of distribution (e.g., mean and standard deviation) • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)

Howgoto Estimate Probabilities from Data? o tin ss e eg
at c at c on c a cl
Tid Refund Marital Status Single Married Single Married Divorced Married Divorced Single Married Single Taxable Income 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K Evade No No No No Yes No No Yes No Yes
( 120 −110 ) 2 2 ( 2975 )

a ric

l

a ric

l

u uo

s

• Normal distribution:

1 2 3 4 5 6 7 8 9 10
10

Yes No No Yes No No Yes No No No

1 P( A | c ) = e 2πσ
i j 2 ij

( Ai − µ ij ) 2
2 2 σ ij

– One for each (Ai,ci) pair

• For (Income, Class=No):
– If Class=No

• sample mean = 110 • sample variance = 2975

1 P ( Income = 120 | No) = e 2π (54.54)

= 0.0072

Example of Naïve Bayes Classifier
Given a Test Record:

X = (Refund = No, Married, Income = 120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25

P(X|Class=No) = P(Refund=No|Class=No) × P(Married| Class=No) × P(Income=120K| Class=No) = 4/7 × 4/7 × 0.0072 = 0.0024 P(X|Class=Yes) = P(Refund=No| Class=Yes) × P(Married| Class=Yes) × P(Income=120K| Class=Yes) = 1 × 0 × 1.2 × 10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X)

=> Class = No

Naïve Bayes Classifier
• If one of the conditional probability is zero, then the entire expression becomes zero • Probability estimation: N ic c: number of classes Original : P ( Ai | C ) = Nc

N ic + 1 Laplace : P( Ai | C ) = Nc + c N ic + mp m - estimate : P( Ai | C ) = Nc + m

p: prior probability m: parameter

Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class

human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle

yes no no yes no no yes no yes yes no no yes no no no no no yes no

no no no no no no yes yes no no no no no no no no no yes no yes

no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no

yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes

mammals non-mammals non-mammals mammals non-mammals non-mammals mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals

A: attributes M: mammals N: non-mammals

6 6 2 2 P ( A | M ) = × × × = 0.06 7 7 7 7 1 10 3 4 P ( A | N ) = × × × = 0.0042 13 13 13 13 7 P ( A | M ) P ( M ) = 0.06 × = 0.021 20 13 P ( A | N ) P( N ) = 0.004 × = 0.0027 20
P(A|M)P(M) > P(A|N)P(N) => Mammals

Give Birth

Can Fly

Live in Water Have Legs

Class

yes

no

yes

no

?

Example Naïve Bayesian Classifier: Training Dataset
Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student redit_rating c uys_compu high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no
19

June 20, 2010

Data Mining: Concepts and Techniques

Example Naïve Bayesian Classifier: Training Dataset
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357

Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)
June 20, 2010 Data Mining: Concepts and Techniques 20

Naïve Bayes: Summary
• Robust to isolated noise points • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes • Independence assumption may not hold for some attributes – Use other techniques such as Bayesian Belief Networks (BBN)