Seminar 3

Seminar 3
Outline
• DT
• KNN
• Unsupervised algorithm
• Clustering
DT
• DT are non-parametric supervised learning methods that

can be used for classi cation and regression
• There are no θ or W to be learned from the data
• Objective is to create a model that predicts the value of

the target variable by learning simple decision rules
• Decision rules are inferred from the data features

fi
DT — Goals
• Two con icting goals in building a DT
• achieving a low training error
• building a DT that is not too large
• Training error — fraction of errors made on the training

dataset
• Testing error — fraction of errors made on the testing

dataset
fl
DT — Advantages
• Easy to understand and interpret
• Tree can be visualised
• Requires little data and minimal preprocessing

DT — Disadvantages
• Easy to over t and lead to complex tree which do not

generalise well
• Small variations in data can result in totally di erent tree
• DT learning algorithms are heuristic (e.g. greedy

algorithms) and hence cannot guarantee global optimum
fi
ff
DT for playing Tennis
Outlook
Sunny Overcast Rain
Humidity Wind
Yes
High Normal Strong Weak
No Yes No Yes
(Outlook = Sunny and Humidity = Normal)
or
(Outlook = Overcast)
or
(Outlook = Rain and Wind = Weak)
Tree Algorithms
ID3
C4.5 C5.0
Iterative Dichotomiser 3
CART
Classi cation &
Regression Trees
fi
Impurity Function
• Decision tree is constructed by splitting the dataset
• Measure the quality of the dataset after each split
• Impurity function measures the quality of the split
• Gini impurity — uses probability
• Entropy — uses entropy

Play Tennis
Yes - 9
No - 5
http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf
Information Gain
• Select the attribute/feature providing the highest

information gain
• Entropy — 9+, 5-
( 9 ) 14 (5 )
9 14 5 14
E= log2 + log2 = 0.940
• 14
Information Gain
• Outlook:
• Sunny [2+, 3-]:

2 3
= log2(5/2) + log2(5/3) = 0.9709
5 5
• Overcast [4+, 0-] , = 0
• Rain [3+, 2-] , = 0.9709

Information Gain
5 4 5
G(S, Outlook) = 0.940 − × 0.9709 − ×0− × 0.9709
14 14 14
G(S, Outlook) = 0.2465
G(S, Humidity) = 0.151
G(S, Wind) = 0.048
G(S, Temperature) = 0.029

Tree Size
Training Error Testing Error
Tree Size Tree Size

Tree Growth
• Two methods:
• Grow the DT carefully and stop it at appropriate size
• Grow the biggest tree possible and prune it to

appropriate size (most adopted method)
Real-valued Features
• In regular DT — target variable is discrete as well as
feature values at the decision nodes were also discrete
• In playing tennis example, Target is to decide to YES/NO

to play tennis
• Features were like, Wind — STRONG/WEAK
• How to use DT for regression? i.e when target and/or

features are continuous-valued
• Continuous-valued features can be easily handled by DT
at the decision nodes
• For example, the decision node can be rendered as
Temperature > C
Yes No
• How to determine the threshold C?

Temp. 40 48 60 72 80 90
Play No No Yes Yes Yes No
48 + 60 80 + 90
= 54 = 85
2 2
2 temperature values at which the Play tennis variable changes
Create the 2 new features, Temp>54 and Temp>85
With these new features, construct the DT by maximising the information gain
DT — Regression/Classi cation
• DT Classi cation — predicted class is the most common

class in the node (majority class)
• DT Regression — DT can also be used for estimating real-

valued target
fi
fi
DT Regression
Outlook Temp Humidity Windy Min
Rainy Hot High FALSE 25
Rainy Hot High TRUE 30
Overcast Hot High FALSE 46
Sunny Mild High FALSE 45
Sunny Cool Normal FALSE 52
Sunny Cool Normal TRUE 23
Overcast Cool Normal TRUE 43
Rainy Mild High FALSE 35
Rainy Cool Normal FALSE 38
Sunny Mild Normal FALSE 46
Rainy Mild Normal TRUE 48
Overcast Mild High TRUE 52
Overcast Hot Normal FALSE 44
Sunny Mild High TRUE 30

Statistics of Dataset
Count 14
Mean 39.7857143
S 9.32108647
CV 0.23428225
Root Node
SDR = 9.32- (4/14)*3.49 -(5/14)*7.78 -(5/14)*10.87
S Mean Count
Outlook Overcast 3.49106001 46.25 4

SDR = 1.66*
Rainy 7.78203058 35.2 5
Sunny 10.8701426 39.2 5
S Mean Count
Temp Cool 10.511898 39 4 SDR = 0.48

Hot 8.95474734 36.25 4
Mild 7.65216019 42.6666667 6
S Mean Count
SDR = 0.27
Humidity High 9.3634112 37.5714286 7
Normal 8.73416935 42 7
S Mean Count
SDR = 0.28
Windy TRUE 10.5934991 37.6666667 6
FALSE 7.87301562 41.375 8

First Split
S Mean CV
Outlook Overcast 3.49106001 46.25 0.07548238
Rainy 7.78203058 35.2 0.22108041
Sunny 10.8701426 39.2 0.27729956
DT - Regressor
k-Nearest Neighbours
(KNN)
— Supervised Learning
KNN
• k- nearest neighbours is a supervised learning algorithm
• Can be used for classi cation/ regression
• Finds the k nearest points to the test data from the

training dataset
fi
KNN Classi er
Height Weight Gender
1 175 80 Male
2 160 58 Female
3 179 78 Male
4 163 68 Female
5 159 75 Female
6 180 77 Male
7 183 75 Male
8 158 69 ?
fi
KNN Regressor
Height Age Weight
1 5 45 77
2 5.11 26 47
3 5.6 30 55
4 5.9 34 59
5 4.8 40 72
6 5.8 36 60
7 5.3 19 40
8 5.8 28 60
9 5.5 23 45
10 5.6 32 58
11 5.5 38 ?
KNN Regressor
• k = 2: {1,5} weight = (77+72)/2 = 74.5
• k = 3: {1,5,6} weight = (77+72+60)/3 = 69.7
• k = 5: {1,5,6,4,10} weight = (77+72+60+59+58)/5 = 65.2

Unsupervised
Learning
Unsupervised Learning
• There is no labelled dataset
• The dataset has several features but no target variable or

class
• Goal — nd the hidden patterns in the unlabelled dataset

{x1, x2, ⋯, xM}
• Most commonly used algorithm are hierarchical clustering

apriori algorithm and k-means clustering
fi
Unsupervised Learning
• Clustering — look for patterns and group similar
datapoints
• Association — deduce rules from observing the datapoints
customer 1 customer 2 customer 3
Milk Milk
Milk
Internet Bread Bread
Bread
Butter Sugar
Toothpaste
Biscuit Rice
Mouthwash
Veggie
Call Duration
Clustering Market Basket Analysis

Clustering Tech.
Clustering
Hierarchical Bayesian
Decision Non
Divisive Agglomerative
Based Parametric
Partitional
Model Graph
Centroid Spectral
Based Theoretic
K-means
Clustering Tech.
• Hierarchical — nd successive clusters using previously found
clusters
• Divisive — start with whole dataset and proceed to divide

it smaller clusters successively (like DT)
• Agglomerative —each element (datapoint) as a separate

cluster and merge them successively into larger clusters
• Bayesian — tries to generate posteriori distribution over the

collection of all dataset partitions
• Partitional — determines all clusters at once and improves

iteratively
fi
1854 - Cholera Outbreak in
London
John Snow
Image Segmentation
https://www.mathworks.com/matlabcentral/ leexchange/41967-fast-fuzzy-c-means-image-segmentation
fi
Hierarchical Clustering
Distance Measure
• Minimum distance — distance between nearest points
between clusters (Single Linkage/ Nearest Neighbour)
• Maximum distance — distance between farthest points

between clusters (Complete Linkage/ Farthest Neighbour)
• Average distance — average of distance between every

pair points in the cluster
• Mean Distance — distance between the cluster centres

k-means Clustering
• k-means is a partitional clustering technique
• There are M datapoint {x1, x2, ⋯, xM}
• Each datapoint is n dimensional xk = [x1k, x2k, ⋯, xnk]
• k-means algorithm partitions the dataset into k clusters
• Each cluster has a cluster centre, called the centroid
• k is speci ed by the user.

fi
k-mean Algorithm
1. Select k random data points to be initial centroids,

cluster centres
2. Assign each data point to the closest centroid
3. Re-compute the centroid using the current datapoints in

each cluster
4. Check convergence, if not goto step 2 and continue

k-means Convergence
• no (or minimum) re-assignments of data points to di erent

cluster, or
• no (or minimum) change of centroids, or
• minimum decrease in the sum of squared error (SSE)

k
d(x, ml)2
∑∑
SSE =
l=1 x∈Cl
centroid of l th cluster
l th cluster Euclidean distance ff
k-means
• Euclidean Distance: a = [a1, a2, ⋯, an] b = [b1, b2, ⋯, bn]
d(a, b) = (a1 − b1)2 + (a2 − b2)2 + ⋯ + (an − bn)2
• Centroid: Cl = {x4, x7, x9}
x14 + x17 + x19

3
x24 + x27 + x29
ml = 3
⋮
xn4 + xn7 + xn9
3
k-means visually explained
https://stanford.edu/~cpiech/cs221/handouts/kmeans.html

Seminar 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar 3

Uploaded by

Copyright:

Available Formats

Seminar 3

• DT are non-parametric supervised learning methods that

• There are no θ or W to be learned from the data

• Objective is to create a model that predicts the value of

• Decision rules are inferred from the data features

• achieving a low training error

• building a DT that is not too large

• Training error — fraction of errors made on the training

• Testing error — fraction of errors made on the testing

• Easy to understand and interpret

• Tree can be visualised

• Requires little data and minimal preprocessing

• Easy to over t and lead to complex tree which do not

• Small variations in data can result in totally di erent tree

• DT learning algorithms are heuristic (e.g. greedy

Sunny Overcast Rain

High Normal Strong Weak

• Decision tree is constructed by splitting the dataset

• Measure the quality of the dataset after each split

• Impurity function measures the quality of the split

• Gini impurity — uses probability

• Entropy — uses entropy

• Select the attribute/feature providing the highest

• Sunny [2+, 3-]:

• Overcast [4+, 0-] , = 0

• Rain [3+, 2-] , = 0.9709

G(S, Outlook) = 0.2465

G(S, Humidity) = 0.151

G(S, Wind) = 0.048

G(S, Temperature) = 0.029

Training Error Testing Error

Tree Size Tree Size

• Grow the DT carefully and stop it at appropriate size

• Grow the biggest tree possible and prune it to

• In playing tennis example, Target is to decide to YES/NO

• Features were like, Wind — STRONG/WEAK

• How to use DT for regression? i.e when target and/or

• For example, the decision node can be rendered as

• How to determine the threshold C?

Play No No Yes Yes Yes No

2 temperature values at which the Play tennis variable changes

Create the 2 new features, Temp>54 and Temp>85

• DT Classi cation — predicted class is the most common

• DT Regression — DT can also be used for estimating real-

Rainy Hot High FALSE 25

Rainy Hot High TRUE 30

Overcast Hot High FALSE 46

Sunny Mild High FALSE 45

Sunny Cool Normal FALSE 52

Sunny Cool Normal TRUE 23

Overcast Cool Normal TRUE 43

Rainy Mild High FALSE 35

Rainy Cool Normal FALSE 38

Sunny Mild Normal FALSE 46

Rainy Mild Normal TRUE 48

Overcast Mild High TRUE 52

Overcast Hot Normal FALSE 44

Sunny Mild High TRUE 30

Outlook Overcast 3.49106001 46.25 4

Sunny 10.8701426 39.2 5

Temp Cool 10.511898 39 4 SDR = 0.48

Mild 7.65216019 42.6666667 6

FALSE 7.87301562 41.375 8

• k- nearest neighbours is a supervised learning algorithm