Professional Documents
Culture Documents
Chapter4 Machine Learning Part4
Chapter4 Machine Learning Part4
´Standard methodology:
1. Collect a large set of examples (all with correct classifications)
2. Divide collection into training, validation and test sets
Loop:
3. Apply learning algorithm to training set
4. Measure performance with the validation set, and adjust hyper-
parameters* to improve performance
5. Measure performance with the test set
´DO NOT LOOK AT THE TEST SET until step 5.
´Hyper-parameters: parameters used to set up your ML algorithm.
´for NB: value of delta for smoothing,
´for DTs, pruning level.
5 Metrics
´ Accuracy
´ % of instances of the test set the algorithm correctly classifies
´ when all classes are equally important and represented
(a + b) a +1
a +b b b
WHM =
(
a +b
R P
=
a +b
Rb Pb
) ( =
) (
a +1
bR P
)
´ let β2 = a/b
β2 + 1(β2 + 1) PR
WHM = 2 =
æβ + 1 ö β2P + R … which is called the F-measure
ç R P ÷ø
è
11 Evaluation: the F-measure
(β2 + 1)PR
F= 2
(β P + R)
´ Noisy Input:
´Two examples have the same feature-value pairs, but different
outputs
´ We can also underfit data, i.e. use too simple decision boundary
X5
a1 a2 a3 Output
X2
X1 1 0 0 ?
X2 1 6 0 ?
X3 8 0 1 ?
X4 X4 6 1 0 ?
X1
X3 X5 1 7 1 ?
25 Clustering
´ Clustering Algorithm
´Represent test instances on a n dimensional space
´Partition them into regions of high density
´How? … many algorithms (ex. k-means)
´Compute the centroïd of each region as the
average of data points in the cluster
26 Clustering Techniques
K-means
27 k-means Clustering
10
´ To find the nearest centroïd…
9
´ a possible metric is the
Euclidean distance 8
7
´ distance between 2 pts
p = (p1, p2, ....,pn) 6
q = (q1, q2, ....,qn) 5
n 4
d= å (p - q )
i=1
i i
2
3
2
´ where to assign a data point x?
1
´ For all k clusters, chose the one
where x has the smallest 1 2 3 4 5 6 7 8 9 10
distance.
30 Example
initial 3 centroïds (ex. at random) in 2-D i.e. 2 features
5
4
c1
3
2
c2
c3
0
0 1 2 3 4 5
31 Example
partition data points to closest centröid
5
4
c1
3
2
c2
c3
0
0 1 2 3 4 5
32 Example
re-compute new centroïds
5
4
c1
c3
1 c2
0
0 1 2 3 4 5
33 Example
re-assign data points to new closest centroïds
5
4
c1
c3
1 c2
0
0 1 2 3 4 5
34 Example
5
4 c1
2
c2
c3
1
0
0 1 2 3 4 5
35 Why use K-means ?
´ Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear algorithm.
´ K-means is the most popular clustering algorithm.
´ Note that: it terminates at a local optimum. The global optimum is
hard to find due to complexity.
36 Weakness of K-means
outlier
outlier