You are on page 1of 6

CS 4870: Machine Learning - Homework #4

Due on March 6, 2018 at 11:59 PM

Professor Kilian Weinberger, 8:40 AM

A,D,K,M,Z

1
A,D,K,M,Z CS 4870: Homework #4

Problem 1
a. From the question:
P (R) = 1/2
P (B) = 1/2
P (H|R) = 3/5

By Law of Total Probability, we have:

3 1 7 1 6 7 13
P (H) = P (H|R)P (R) + P (H|B)P (B) = · + · = + =
5 2 10 2 20 20 20
By Bayes Rule:

P (H|R)P (R) 3 1 20 6
P (R|H) = = · · =
P (H) 5 2 13 13

b. From the question (probabilities are P ([coin] is heads|hat color):

P (P |R) = 3/5 P (P |B) = 7/10


P (N |R) = 3/10 P (N |B) = 1/5
P (D|R) = 1/2 P (D|B) = 1/10
P (Q|R) = 4/5 P (Q|B) = 2/5

We make the naive Bayes assumption. By Bayes Rule (notice the 1/2 terms cancel):

3 3 1 4 9
P (HHT H|R) = · · · =
5 10 2 5 125
7 1 9 2 63
P (HHT H|B) = · · · =
10 5 10 5 1250

P (HHT H|R) 10
P (R|HHT H) = =
P (HHT H|R) + P (HHT H|B) 17

c. After examining the data table (probabilities are P ([coin] is heads|hat color):

P (P |R) = 3/4 P (P |B) = 1/10


P (N |R) = 7/8 P (N |B) = 3/10
P (D|R) = 1/2 P (D|B) = 9/10
P (Q|R) = 1/8 P (Q|B) = 4/10

By Bayes Rule (notice the 1/2 terms cancel):

3 7 1 1 21
P (HHT H|R) = · · · =
4 8 2 8 512
1 3 1 4 3
P (HHT H|B) = · · · =
10 10 10 10 2500

P (HHT H|R) 4375


P (R|HHT H) = =
P (HHT H|R) + P (HHT H|B) 4503

Problem 1 continued on next page. . . 2


A,D,K,M,Z CS 4870: Homework #4 Problem 1 (continued)

d. X = Data(heads/tails) and y = red/blue hat. The Naive Bayes assumption holds because different
coins are independent of each other (i.e. the event spaces are disjoint); hence, the features can be assumed
to be conditionally independent.

Problem 2
a.
P (0, 0, 1|Ham)P (Ham) 0
P (Ham|0, 0, 1) = =
P (0, 0, 1) 5

P (1, 1, 1|Ham)P (Ham) 1/5 ∗ 1/3


P (Ham|1, 1, 1) = = =1
P (1, 1, 1) 1/5 ∗ 1/3 + 0 ∗ 2/3

P (1, 0, 0|Ham)P (Ham)


P (Ham|1, 0, 0) = =0
P (1, 0, 0)

P (0, 0, 0|Ham)P (Ham) 0


P (Ham|0, 0, 0) = = = undef ined
P (0, 0, 0 0
Yes, last one is undefined, all seem unreasonable for them to guaranteed to to be 0 or 1 or undefined. This
is due to the fact that we don’t utilize Laplace smoothing.

b.

• Collecting more emails would help with our predictions because a larger data sample would give us
more realistic probabilities

• Extracting more features for each email would allow us to classify each email more accurately

• Duplicating emails with uncommon features would not help, it changes the distribution of the emails

• Making stronger assumptions is helpful, assuming our features are independent of each other would be
more realistic for our data.

c.

P (1, 0, 1|Ham) = P (bacon = 1|Ham)P (ip = 0|Ham)P (mispell = 1|Ham) = 1 ∗ 2/5 ∗ 3/5 = 6/25

d.
1
P (bacon = 1|Spam) =
10
3
P (ip = 1|Spam) =
10
7
P (mispell = 1|Spam) =
10
5
P (bacon = 1|Ham) =
5
3
P (ip = 1|Ham) =
5
3
P (mispell = 1|Ham) =
5
P (Spam) = 2/3
P (Ham) = 1/3

Problem 2 continued on next page. . . 3


A,D,K,M,Z CS 4870: Homework #4 Problem 2 (continued)

P (0, 0, 1|Ham)P (Ham)


P (Ham|0, 0, 1) = =0
P (0, 0, 1)
P (1, 1, 1|Ham)P (Ham)
P (Ham|1, 1, 1) = = 60/67
P (1, 1, 1)
P (1, 0, 0|Ham)P (Ham)
P (Ham|1, 0, 0) = = 80/101
P (1, 0, 0)
P (0, 0, 0|Ham)P (Ham)
P (Ham|0, 0, 0) = =0
P (0, 0, 0)

e.
5
P (bacon = 1|Spam) =
18
7
P (ip = 1|Spam) =
18
11
P (mispell = 1|Spam) =
18
9
P (bacon = 1|Ham) =
13
7
P (ip = 1|Ham) =
13
7
P (mispell = 1|Ham) =
13
P (Spam) = 18/31
P (Ham) = 13/31

P (1, 0, 1|Ham) = P (bacon = 1|Ham)P (ip = 0|Ham)P (mispell = 1|Ham) = 9/13 ∗ 6/13 ∗ 7/13 = 0.172

P (0, 0, 1|Ham)P (Ham)


P (Ham|0, 0, 1) = = 0.16996
P (0, 0, 1)
P (1, 1, 1|Ham)P (Ham)
P (Ham|1, 1, 1) = = 0.613
P (1, 1, 1)
P (1, 0, 0|Ham)P (Ham)
P (Ham|1, 0, 0) = = 0.617
P (1, 0, 0)
P (0, 0, 0|Ham)P (Ham)
P (Ham|0, 0, 0) = = 0.216
P (0, 0, 0)

4
A,D,K,M,Z CS 4870: Homework #4

Problem 3
1.
Qd
α=1 p([x]α |y = 1)p(y = 1)
p(y = 1|x) =
p(x)
= (given sum rule)
Qd
α=1 p([x]α |y = 1)p(y = 1)
p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
= (given Naive Bayes assumption and product rule)
Qd
α=1 p([x]α |y = 1)p(y = 1)
Qd Qd
α=1 p([x]α |y = 1)p(y = 1) + α=1 p([x]α |y = 0)p(y = 0)

2. Dividing both sides by the numerator.


Qd
p([x]α |y = 1)p(y = 1)
α=1
p(y = 1|x) = Qd Qd
α=1 p([x]α |y = 1)p(y = 1)p(y = 1) + α=1 p([x]α |y = 0)p(y = 0)
1
= Qd
p([x]α |y=0)p(y=0)
1+ Qα=1
d
α=1 p([x]α |y=1)p(y=1)

1
= Qd
α=1 p([x]α |y=0)p(y=0)
1 + exp (log Qd )
α=1 p([x]α |y=1)p(y=1)

1
= Qd
p([x]α |y=1)p(y=1)
1 + exp (− log Qα=1
d )
α=1 p([x]α |y=0)p(y=0)

3. Define w
~ and b as follows:
µα1 − µα0
wα = [w]α =
σα2
d
 p(y = 1)  X µ2α1 − µ2α0
b = log −
p(y = 0) α=1
2σα2

Problem 3 continued on next page. . . 5


A,D,K,M,Z CS 4870: Homework #4 Problem 3 (continued)

(xα −µαy )2
Then, given that p([x]α |y) = √ 1 exp (− 2 ),
2πσα 2σα

P (y = 1|~x)
h(~x) = 1 ⇐⇒ >1
P (y = 0|~x)
Qd
α=1 p([x]α |y = 1)p(y = 1)
⇐⇒ Qd >1
α=1 p([x]α |y = 0)p(y = 0)
Qd
p([x]α |y = 1) p(y = 1)
⇐⇒ log Qdα=1 + log >0
α=1 p([x]α |y = 0)
p(y = 0)
Qd (xα −µα1 )2
√ 1
α=1 2πσα exp (− 2
2σα ) p(y = 1)
⇐⇒ log Qd (xα −µα0 )2
+ log >0
1 p(y = 0)
α=1 2πσα exp (− )
√ 2
2σα
2
exp (− α=1 (xα −µ
Pd α1 )
2σα2 ) p(y = 1)
⇐⇒ log Pd (xα −µα0 )2 + log >0
exp (− 2 ) p(y = 0)
α=1 2σα
d
X (xα − µα1 )2 − (xα − µα0 )2 p(y = 1)
⇐⇒ log(exp (− 2
)) + log >0
α=1
2σα p(y = 0)
d
X (xα − µα1 )2 − (xα − µα0 )2 p(y = 1)
⇐⇒ − 2
+ log >0
α=1
2σ α p(y = 0)
d
X −2xα µα1 + µ2α1 + 2xα µα0 − µ2α0 p(y = 1)
⇐⇒ − 2
+ log >0
α=1
2σ α p(y = 0)
d d
X µα1 − µα0 X µ2α1 − µ2α0 p(y = 1)
⇐⇒ 2
· x α − 2
+ log >0
α=1
σα α=1
2σα p(y = 0)
d d
X   p(y = 1)  X µ2α1 − µ2α0 
⇐⇒ wα xα + log − >0
α=1
p(y = 0) α=1
2σα2
⇐⇒ w
~ · ~x + b > 0

Therefore, the Gaussian Naives Bayes with shared variance is linear.

You might also like