Professional Documents
Culture Documents
WINSEM2020-21 ECE3047 ETH VL2020210503202 Reference Material I 08-Apr-2021 CS583-Supervised-learning
WINSEM2020-21 ECE3047 ETH VL2020210503202 Reference Material I 08-Apr-2021 CS583-Supervised-learning
Supervised Learning
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Naïve Bayesian classification
Naïve Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
training data
Testing: Test the model using unseen test data
to assess the model accuracy
Number of correct classifications
Accuracy ,
Total number of test cases
No
|C |
Pr(c ) 1,
j 1
j
j 1 | D |
entropy( D j )
5 5 5
entropy Age ( D) entropy ( D1 ) entropy ( D2 ) entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
0.888
Efficiency
time to construct the model
time to use the model
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability:
understandable and insight provided by the model
Compactness of the model: size of the tree, or the
number of rules.
TP TP
p . r .
TP FP TP FN
Precision p is the number of correctly classified
positive examples divided by the total number of
examples that are classified as positive.
Recall r is the number of correctly classified positive
examples divided by the total number of actual
positive examples in the test set.
CS583, Bing Liu, UIC
An example
Then we have
100
Pe rce nt of total pos itive cas e s
90
80
70
60
lift
50
random
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Percent of testing cases
is maximal
CS583, Bing Liu, UIC
Apply Bayes’ Rule
Pr(C c j | A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
Pr( A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
|C |
Pr( A a ,..., A
r 1
1 1 | A| a| A| | C cr ) Pr(C cr )
Pr(C cr ) Pr( Ai ai | C cr )
r 1 i 1
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
N it | di |
t 1
Pr( wt | cj; ) 1. (25)
t 1
i 1 N ti Pr(c j | d i )
| D|
Pr( wt | c j ; ˆ ) . (27)
| V | s 1 i 1 N si Pr(c j | d i )
|V | | D|
| D|
Pr(cj | di )
ˆ
Pr(c | )
j
i 1 (28)
|D|
r 1 Pr(cr | )k 1 Pr( wdi ,k | cr ; ˆ )
|d i |
|C |
ˆ
yi ( w x i b 1, i summarizes
1, 2, ..., r
w xi + b 1 for yi = 1
w xi + b -1 for yi = -1.
r
1 r
LD i
i 1
2 i , j 1
y i y j i j x i x j , (55)
only require dot products (x) (z) and never the mapped
vector (x) in its explicit form. This is a crucial point.
Thus, if we have a way to compute the dot product
(x) (z) using the input vectors x and z directly,
no need to know the feature vector (x) or even itself.
In SVM, this is done through the use of kernel
functions, denoted by K,
K(x, z) = (x) (z) (82)
A new point
Pr(science| )?
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
training set
(x1, y1, w1) Build a classifier ht
(x2, y2, w2) whose accuracy on
… training set > ½
(xn, yn, wn) (better than random)
Non-negative weights
sum to 1
Change weights
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
Genetic algorithms
Fuzzy classification