You are on page 1of 4

Classification (A High Level Idea)

Name Gender Age Name Sex Age Prof.

Classification I Bob
John
M
M
20
45 M
Gender
F
Bob
John
M
M
20
45
N
P
Dave M 25 Age Dave M 25 N
Age
Marthe F 27 Marthe F 27 N
Kathy F 40 <40 >40 <30 >30 Kathy F 40 P
Kimi M 35 N P N P Kimi M 35 P
☞ Outline: Tod M 50 Tod M 50 P
1 – Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 – Linear Regression in Classification . . . . . . . . . . . . . . . . . . . . . 7
Name Sex Age Prof.
3 – Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Dave M 25 N
4 – Distance Based Classification . . . . . . . . . . . . . . . . . . . . . . . 14
Kathy F 40 P
5 – Questions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Tod M 50 P

Fall, 2006 Arturas Mazeika Page 2

Classification (Definition) Classification (Strategies)

☞ Definition 0.1 Given a database D = {t1 , t2 , . . . , tn } and a set of classes ☞ Main strategies to classify data:
C = {C1 , C2 , . . . , Cm } the classification problem is to define the mapping • Specify the boundaries of the domain
f : D → C , where f (t) is assigned to one class only. • With the help of probability distributions:
☞ The Classes: P (ti ∧ Cj ) = P (ti |Cj )P (Cj )
• are pre-defined and known in advanced
• With the help of posterior probabilities:
• non-overlapping
• partition the whole database (domain) P (Cj |ti )

Fall, 2006 Arturas Mazeika Page 3 Fall, 2006 Arturas Mazeika Page 4
Classification (Missing Values) Classification (Quality Metrics)

False Negative
True Positive
☞ There are three main strategies to deal with missing values: (We do miss data)
f (ti ) ∈ Cj , ti ∈ Cj
• Ignore the missing values f (ti ) ∈
/ C j , ti ∈ C j
• Predict the missing values False Positive
True Negative
• Classify the missing values separately (we get too much data)
f (ti ) ∈
/ C j , ti ∈
/ Cj
f (ti ) ∈ Cj , ti ∈
/ Cj

Fall, 2006 Arturas Mazeika Page 5 Fall, 2006 Arturas Mazeika Page 6

Linear Regression in Classification (Example 1) Linear Regression in Classification (Example 2)


☞ Linear regression (line) can be used to separate two classes
☞ Lets use f (x) = a + ε for our classification problem
☞ On can use (linear) regression to clas-
sify data into: • The general formula to compute a:
• data, which follows linear regression 1X
n

• outliers a= Xi
n
• noise i=1

☞ The training data should be provided • In our case:


without noise and outliers a = 34.7
☞ The idea:
• Compute the (linear) regression
from the data
• Compute the regression error ε
• All data which are farther away from N N
the regression line by ε is noise and N NN P P P P N NN P P P P
outliers
20 30 40 50 20 30 40 50

Fall, 2006 Arturas Mazeika Page 7 Fall, 2006 Arturas Mazeika Page 8
Linear Regression in Classification (Example 3) Linear Regression in Classification (Conclusions)
☞ Linear regression (line) can be used to separate two classes
☞ Lets use f (x) = ax + b + ε for our classification problem
• The general formula to compute a:
(
b = Ȳ − aX̄
)
a = cov(X,Y
σ2 X

• In our case:
a = 0.04 b = −0.7 ☞ Regression is hardly applicable in classification, unless strong assumptions about the
data distribution is true
☞ all x for which f (x) ≥ 1/2 will be classified as profitable customers

2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
15 20 25 30 35 40 45 50 55 15 20 25 30 35 40 45 50 55

Fall, 2006 Arturas Mazeika Page 9 Fall, 2006 Arturas Mazeika Page 10

Bayesian Classification (The formalism) Bayesian Classification (Example)

☞ Use Bayesian theorem to classify the data:


P (ti |Hj )P (Hj )
P (Hj |ti ) =
P (ti )
Name Sex Age Prof.
☞ Say ti = (t1i , t2i , . . . , tki ) consists of k attributes:
Bob M 20 N
P (t1i , t2i , . . . , tki |Hj )P (Hj ) John M 45 P
P (Hj |t1i , t2i , . . . , tki ) = ☞ What is
P (t1i , t2i , . . . , tki ) Dave M 25 N
• P (HP |(F, 27))
Marthe F 27 N
☞ If we assume that attributes are independent: • P (HN |(F, 27))
Kathy F 40 P
P (t1i , t2i , . . . , tki |Hj ) = P (t1i |Hj )P (t2i |Hj ) . . . P (tki |Hj ) Kimi M 35 P
Tod M 50 P
☞ Therefore
P (t1i |Hj )P (t2i |Hj ) . . . P (tki |Hj )
P (Hj |t1i , t2i , . . . , tki ) =
P (t1i )P (t2i ) . . . P (tki )

Fall, 2006 Arturas Mazeika Page 11 Fall, 2006 Arturas Mazeika Page 12
Bayesian Classification (Conclusion) Distance Based Classification (A Naive Distance Based Classification
Algorithm, the Idea)

☞ Given:
• Training data
+ Easy to use • Classes

+ Requires one scan of training data ☞ The main idea:


• Compute representatives for each class
+ Handles missing data
• For each DB point t assign t to the closest class in term of given distance or similarity

− Independence of attributes might not be realistic


− Continuous data is not easily handled

Fall, 2006 Arturas Mazeika Page 13 Fall, 2006 Arturas Mazeika Page 14

Distance Based ClassificationK-Nearest Neighbors Classification, the Idea Questions?

The idea:
☞ Train data is the model of the data
☞ Compute k-nearest neighbors to the
train data
☞ DB tuple t is placed in the class, which
has most of the nearest neighbors ☞ Questions?
☞ Choice of k has a huge impact to the
classification results
☞ A rule of thumb:

K ≤ number of tuples in the training set

Fall, 2006 Arturas Mazeika Page 15 Fall, 2006 Arturas Mazeika Page 16

You might also like