Professional Documents
Culture Documents
CS583 Supervised Learning
CS583 Supervised Learning
Supervised
Learning
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
An example application
Another application
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Approved or not
training data
Testing: Test the model using unseen test data
to assess the model accuracy
Accuracy
10
What do we mean by
learning?
Given
a data set D,
a task T, and
a performance measure M,
11
An example
12
Fundamental assumption of
learning
Assumption: The distribution of training
13
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
14
Introduction
15
16
17
18
19
20
21
22
Choose an attribute to
partition data
23
24
25
Information theory
26
27
entropy ( D)
|C |
Pr(c ) log
j
Pr(c j )
j 1
|C |
Pr(c ) 1,
j
j 1
28
29
Information gain
30
31
An example
6
6 9
9
entropy ( D) log 2 log 2 0.971
15
15 15
15
6
9
entropy ( D1 ) entropy ( D2 )
15
15
6
9
0 0.918
15
15
0.551
5
5
5
entropy ( D1 ) entropy ( D2 ) entropy ( D3 )
15
15
15
5
5
5
0.971 0.971 0.722
15
15
15
0.888
entropy Age ( D)
32
33
Handling continuous
attributes
Handle continuous attribute by splitting into
34
An example in a continuous
space
35
Avoid overfitting in
classification
Overfitting: A tree may overfit the training data
36
37
38
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
39
Evaluating classification
methods
Predictive accuracy
Efficiency
40
Evaluation methods
41
42
43
a training set,
a validation set and
a test set.
44
Classification measures
45
46
TP
r
.
TP FN
47
An example
48
49
Receive operating
characteristics curve
50
Then we have
51
52
53
54
55
56
An example
57
An example
58
Lift curve
Bin
210
42%
42%
120
24%
66%
60
12%
78%
40
22
8%
4.40%
86% 90.40%
18
12
7
3.60%
2.40%
1.40%
94% 96.40% 97.80%
10
6
1.20%
99%
5
1%
100%
100
90
80
70
60
lift
50
random
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
59
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Summary
60
Introduction
61
Sequential covering
Learn one rule at a time, sequentially.
After a rule is learned, the training examples
covered by the rule are removed.
Only the remaining data are used to find
subsequent rules.
The process repeats until some stopping
criteria are met.
Note: a rule covers an example if the example
satisfies the conditions of the rule.
We introduce two specific algorithms.
62
63
64
Differences:
65
Learn-one-rule-1 function
66
Learn-one-rule-1 function
)m, each (m-1)-condition rule in the
(cont
In iteration
67
Learn-one-rule-1 algorithm
68
Learn-one-rule-2 function
69
Learn-one-rule-2 algorithm
70
Rule evaluation in learn-one Let the current partially developed rule be:
rule-2
R: av1, .., avk class
gain( R, R ) p1 log 2
log 2
p1 n1
p 0 n0
71
pn
v( BestRule, PrunePos, PruneNeg )
pn
where p (n) is the number of examples in PrunePos
(PruneNeg) covered by the current rule (after a
deletion).
72
Discussions
73
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
74
Three approaches
75
76
77
78
79
Considerations in CAR
mining
Multiple minimum class supports
80
Building classifiers
81
82
83
84
Coverage: rare item rules are not found using classic algo.
Multiple min supports and support difference constraint
help a great deal.
85
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
86
Bayesian classification
87
Pr( A a ,..., A
1
| A|
a| A| | C cr ) Pr(C cr )
r 1
88
Computing probabilities
89
Conditional independence
assumption
Formally, we assume,
Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)
90
Pr(C c j ) Pr( Ai ai | C c j )
|C |
i 1
| A|
r 1
i 1
Pr(C cr ) Pr( Ai ai | C cr )
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
91
i 1
92
An example
93
An Example (cont )
For C = t, we have
2
1 2 2 2
Pr(C t ) Pr( A j a j | C t )
2 5 5 25
j 1
1 1 2 1
Pr(C f ) Pr( A j a j | C f )
2 5 5 25
j 1
94
Additional issues
Pr( Ai ai | C c j )
nij
n j ni
95
Advantages:
Easy to implement
Very efficient
Good results obtained in many applications
Disadvantages
96
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
97
Text
classification/categorization
Due to the rapid growth of online documents in
98
Probabilistic framework
Generative model: Each document is
generated by a parametric distribution
governed by a set of hidden parameters.
The generative model makes two
assumptions
99
Mixture model
100
An example
class 2
101
102
Document generation
Pr(d i | )
Pr(c
| ) Pr( d i | c j ; )
(23)
j 1
103
104
Multinomial distribution
105
t 1
(24)
|V |
it
| di |
t 1
Pr( wt | cj; ) 1.
(25)
t 1
106
Parameter estimation
N Pr(c | d )
)
Pr( w | c ;
.
N Pr(c | d )
t
ti
i 1
|V |
| D|
s 1
i 1
si
(26)
i 1 N ti Pr(c j | d i )
| D|
Pr( wt | c j ; )
CS583, Bing Liu, UIC
| V | s 1 i 1 N si Pr(c j | d i )
|V |
| D|
(27)
107
Pr(c | )
j
| D|
i 1
Pr(cj | di )
(28)
|D|
108
Classification
Given a test document di, from Eq. (23) (27) and (28)
Pr(
c
j | ) Pr( di | cj ; )
)
Pr(cj | di;
)
Pr(di |
|d i |
)
Pr(cj | )k 1 Pr( wd i ,k | cj;
|d i |
r 1 Pr(cr | )k 1 Pr(wdi ,k | cr ; )
|C |
109
Discussions
110
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
111
Introduction
112
Basic concepts
1 if w x i b 0
yi
1 if w x i b 0
113
The hyperplane
114
115
116
| w xi b |
|| w ||
(36)
|| w || w w w1 w2 ... wn
CS583, Bing Liu, UIC
(37)
117
| w xs b 1 |
1
d
|| w ||
|| w ||
(38)
2
margin d d
|| w ||
(39)
118
A optimization problem!
Definition (Linear SVM: separable case): Given a set of
linearly separable training examples,
D = {(x1, y1), (x2, y2), , (xr, yr)}
Learning is to solve the following constrained minimization
problem,
w w
Minimize :
2
Subject to : yi ( w x i b) 1, i 1, 2, ..., r
(40)
yi (summarizes
w x i b 1, i 1, 2, ..., r
w xi + b 1 for yi = 1
w xi + b -1 for yi = -1.
CS583, Bing Liu, UIC
119
[ y ( w x b) 1]
i
(41)
i 1
120
Kuhn-Tucker conditions
121
122
Dual formulation
1 r
LD i
y i y j i j x i x j ,
2 i , j 1
i 1
(55)
123
124
w x b
y x x b 0
i
(57)
isv
sign( w z b) sign
y x z b
i
(58)
isv
If (58) returns 1, then the test instance z is classified
as positive; otherwise, it is classified as negative.
125
w w
Minimize :
2
Subject to : yi ( w x i b) 1, i 1, 2, ..., r
126
i 0, i =1, 2, , r.
CS583, Bing Liu, UIC
127
Geometric interpretation
128
objective function.
A natural way of doing it is to assign an extra
cost for errors to change the objective
function to
r
w w
(60)
Minimize :
C ( i ) k
2
i 1
k = 1 is commonly used, which has the
advantage that neither i nor its Lagrangian
multipliers appear in the dual formulation.
129
(61)
i 0, i 1, 2, ..., r
(62)
r
r
r
1
LP w w C i i [ yi ( w x i b) 1 i ] i i
2
i 1
i 1
i 1
130
Kuhn-Tucker conditions
131
132
Dual
133
r
1
b yi i x i x j 0.
yi i 1
CS583, Bing Liu, UIC
(73)
134
135
w x b
y x
i
x b 0
(75)
i 1
136
137
Space transformation
(77)
138
Geometric interpretation
139
140
An example space
transformation
( x1 , x2 ) ( x1 , x2 , 2 x1 x2 )
141
142
Kernel functions
143
Polynomial kernel
(83)
d
K(x, z) = x z
Let us compute the kernel with degree d = 2 in a 2dimensional space: x = (x1, x2) and z = (z1, z2).
x z 2 ( x1 z1 x 2 z 2 ) 2
2
x1 z1 2 x1 z1 x 2 z 2 x 2 z 2
2
(84)
( x1 , x 2 , 2 x1 x 2 ) ( z1 , z 2 , 2 z1 z 2 )
(x) (z ) ,
144
Kernel trick
145
Is it a kernel function?
146
147
148
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
149
k-Nearest Neighbor
Classification
(kNN)
150
kNNAlgorithm
151
A new point
Pr(science| )
?
152
Discussions
153
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
154
Combining classifiers
Bagging
Boosting
155
Bagging
Breiman, 1996
156
Bagging (cont)
Training
Testing
157
Bagging Example
Original
Training set 1
Training set 2
Training set 3
Training set 4
158
Bagging (cont )
Boosting
A family of methods:
Training
Testing
160
AdaBoost
called a weaker classifier
Weighted
training set
Build a classifier ht
whose accuracy on
training set >
(better than random)
Non-negative weights
sum to 1
Change weights
CS583, Bing Liu, UIC
161
AdaBoost algorithm
162
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
163
164
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Summary
165
Summary
166