You are on page 1of 2

Bernoulli Naïve Bayes |V| = 10 (excl.

Linear regression
Max likelihood: P(wi|c) = Nwi,c/Nc tears) - Might produce prob <0 or >1
- Nwi,c = numb. of wi in class c. Nwi,funk = 10, E(Y|X) = f(X) = b0 + b1X
P(wi|c) can be zero.
Nwi,metal = 8 Y = b0 + b1X + e
Laplace smoothing: P(wi|c) = (Nwi,c + 1)/(Nc + |
Wi|) E(e | X) = 0
- record whether a word occurs (0 or 1) = |Wi| - Minimising RSS (single coefficient)
=2 dRSS/db = 0
Multinomial Naïve bayes d^2RSS/ db^2 > 0 (local min)
P(wi|c) = (Nwi,c + 1) / (Nwi,c’ + |V|) - Minimising RSS (multiple coefficients)
- Nwi,c’ = volume of words occur in c, also
dRSS/dbj = 0, for j = 1,…,p
duplicates
Hessian matrix= positive definite
- |V| = total numb. of distinct words in whole
data set. (x-xi)H(x-xi) > 0
Least Square Solution
- To minimise RSS, e perpendicular to X, X*e = 0 ->
X*Y – bX*X=0 -> b = (XtX)^-1 XtY
Y = Xb + e, Y^= Xb, e = Y – Xb
- X needs extra column of 1 before itself
- Low R^2 means room for improvement

In general, we see that if P(w|c1) > P(w|c2), then


every occurrence of the word w will increase the
chance of class c1 compared to the chance of class
c2.

Logistical regression
f(x) = e^x/e^x + 1
d/dx f(x) = f(x)(1-f(x))

Case control
- b0 case = bo random + log(t1/t2)
- t1 = prob for controls
- t2 = prob for cases
b0* = b0 + log(pi/1-pi) – log(pi~/1-pi~)
- bias is small if we sample enough controls (5*cases for
amount of controls needed)
Ordinal classification
- for P(y<=1|x) = Λ(t1-bx)
- for P(y<=2|x) = Λ(t2-bx)- Λ(t1-bx)
- for P(y<=3|x) = Λ(t2-bx)

You might also like