Classificationi 4

Classification (A High Level Idea)
Name Gender Age Name Sex Age Prof.
Classification I Bob
John
M
M
20
45 M
Gender
F
Bob
John
M
M
20
45
N
P
Dave M 25 Age Dave M 25 N
Age
Marthe F 27 Marthe F 27 N
Kathy F 40 <40 >40 <30 >30 Kathy F 40 P
Kimi M 35 N P N P Kimi M 35 P
☞ Outline: Tod M 50 Tod M 50 P
1 – Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 – Linear Regression in Classification . . . . . . . . . . . . . . . . . . . . . 7
Name Sex Age Prof.
3 – Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Dave M 25 N
4 – Distance Based Classification . . . . . . . . . . . . . . . . . . . . . . . 14
Kathy F 40 P
5 – Questions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Tod M 50 P
Fall, 2006 Arturas Mazeika Page 2
Classification (Definition) Classification (Strategies)
☞ Definition 0.1 Given a database D = {t1 , t2 , . . . , tn } and a set of classes ☞ Main strategies to classify data:
C = {C1 , C2 , . . . , Cm } the classification problem is to define the mapping • Specify the boundaries of the domain
f : D → C , where f (t) is assigned to one class only. • With the help of probability distributions:
☞ The Classes: P (ti ∧ Cj ) = P (ti |Cj )P (Cj )
• are pre-defined and known in advanced
• With the help of posterior probabilities:
• non-overlapping
• partition the whole database (domain) P (Cj |ti )
Fall, 2006 Arturas Mazeika Page 3 Fall, 2006 Arturas Mazeika Page 4
Classification (Missing Values) Classification (Quality Metrics)
False Negative
True Positive
☞ There are three main strategies to deal with missing values: (We do miss data)
f (ti ) ∈ Cj , ti ∈ Cj
• Ignore the missing values f (ti ) ∈
/ C j , ti ∈ C j
• Predict the missing values False Positive
True Negative
• Classify the missing values separately (we get too much data)
f (ti ) ∈
/ C j , ti ∈
/ Cj
f (ti ) ∈ Cj , ti ∈
/ Cj
Linear Regression in Classification (Example 1) Linear Regression in Classification (Example 2)

☞ Linear regression (line) can be used to separate two classes
☞ Lets use f (x) = a + ε for our classification problem
☞ On can use (linear) regression to clas-
sify data into: • The general formula to compute a:
• data, which follows linear regression 1X
n
• outliers a= Xi
n
• noise i=1
☞ The training data should be provided • In our case:

without noise and outliers a = 34.7
☞ The idea:
• Compute the (linear) regression
from the data
• Compute the regression error ε
• All data which are farther away from N N
the regression line by ε is noise and N NN P P P P N NN P P P P
outliers
20 30 40 50 20 30 40 50
Linear Regression in Classification (Example 3) Linear Regression in Classification (Conclusions)
☞ Linear regression (line) can be used to separate two classes
☞ Lets use f (x) = ax + b + ε for our classification problem
• The general formula to compute a:
(
b = Ȳ − aX̄
)
a = cov(X,Y
σ2 X
• In our case:
a = 0.04 b = −0.7 ☞ Regression is hardly applicable in classification, unless strong assumptions about the
data distribution is true
☞ all x for which f (x) ≥ 1/2 will be classified as profitable customers
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
15 20 25 30 35 40 45 50 55 15 20 25 30 35 40 45 50 55
Bayesian Classification (The formalism) Bayesian Classification (Example)
☞ Use Bayesian theorem to classify the data:

P (ti |Hj )P (Hj )
P (Hj |ti ) =
P (ti )
Name Sex Age Prof.
☞ Say ti = (t1i , t2i , . . . , tki ) consists of k attributes:
Bob M 20 N
P (t1i , t2i , . . . , tki |Hj )P (Hj ) John M 45 P
P (Hj |t1i , t2i , . . . , tki ) = ☞ What is
P (t1i , t2i , . . . , tki ) Dave M 25 N
• P (HP |(F, 27))
Marthe F 27 N
☞ If we assume that attributes are independent: • P (HN |(F, 27))
Kathy F 40 P
P (t1i , t2i , . . . , tki |Hj ) = P (t1i |Hj )P (t2i |Hj ) . . . P (tki |Hj ) Kimi M 35 P
Tod M 50 P
☞ Therefore
P (t1i |Hj )P (t2i |Hj ) . . . P (tki |Hj )
P (Hj |t1i , t2i , . . . , tki ) =
P (t1i )P (t2i ) . . . P (tki )
Bayesian Classification (Conclusion) Distance Based Classification (A Naive Distance Based Classification
Algorithm, the Idea)
☞ Given:
• Training data
+ Easy to use • Classes
+ Requires one scan of training data ☞ The main idea:

• Compute representatives for each class
+ Handles missing data
• For each DB point t assign t to the closest class in term of given distance or similarity
− Independence of attributes might not be realistic

− Continuous data is not easily handled
Distance Based ClassificationK-Nearest Neighbors Classification, the Idea Questions?
The idea:
☞ Train data is the model of the data
☞ Compute k-nearest neighbors to the
train data
☞ DB tuple t is placed in the class, which
has most of the nearest neighbors ☞ Questions?
☞ Choice of k has a huge impact to the
classification results
☞ A rule of thumb:
√
K ≤ number of tuples in the training set

Classificationi 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classificationi 4

Uploaded by

Copyright:

Available Formats

Classification (A High Level Idea)

Name Gender Age Name Sex Age Prof.

Fall, 2006 Arturas Mazeika Page 2

Classification (Definition) Classification (Strategies)

Linear Regression in Classification (Example 1) Linear Regression in Classification (Example 2)

☞ The training data should be provided • In our case:

Bayesian Classification (The formalism) Bayesian Classification (Example)

☞ Use Bayesian theorem to classify the data:

+ Requires one scan of training data ☞ The main idea:

− Independence of attributes might not be realistic

Distance Based ClassificationK-Nearest Neighbors Classification, the Idea Questions?

You might also like