You are on page 1of 43

Seminar 3

Outline

• DT

• KNN

• Unsupervised algorithm

• Clustering
DT

• DT are non-parametric supervised learning methods that


can be used for classi cation and regression

• There are no θ or W to be learned from the data

• Objective is to create a model that predicts the value of


the target variable by learning simple decision rules

• Decision rules are inferred from the data features


fi
DT — Goals
• Two con icting goals in building a DT

• achieving a low training error

• building a DT that is not too large

• Training error — fraction of errors made on the training


dataset

• Testing error — fraction of errors made on the testing


dataset
fl
DT — Advantages

• Easy to understand and interpret

• Tree can be visualised

• Requires little data and minimal preprocessing


DT — Disadvantages

• Easy to over t and lead to complex tree which do not


generalise well

• Small variations in data can result in totally di erent tree

• DT learning algorithms are heuristic (e.g. greedy


algorithms) and hence cannot guarantee global optimum
fi
ff
DT for playing Tennis
Outlook

Sunny Overcast Rain

Humidity Wind
Yes

High Normal Strong Weak

No Yes No Yes
(Outlook = Sunny and Humidity = Normal)
or
(Outlook = Overcast)
or
(Outlook = Rain and Wind = Weak)
Tree Algorithms

ID3
C4.5 C5.0
Iterative Dichotomiser 3

CART
Classi cation &
Regression Trees
fi
Impurity Function

• Decision tree is constructed by splitting the dataset

• Measure the quality of the dataset after each split

• Impurity function measures the quality of the split

• Gini impurity — uses probability

• Entropy — uses entropy


Play Tennis

Yes - 9
No - 5

http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf
Information Gain

• Select the attribute/feature providing the highest


information gain

• Entropy — 9+, 5-

( 9 ) 14 (5 )
9 14 5 14
E= log2 + log2 = 0.940
• 14
Information Gain

• Outlook:

• Sunny [2+, 3-]:


2 3
= log2(5/2) + log2(5/3) = 0.9709
5 5

• Overcast [4+, 0-] , = 0

• Rain [3+, 2-] , = 0.9709


Information Gain

5 4 5
G(S, Outlook) = 0.940 − × 0.9709 − ×0− × 0.9709
14 14 14

G(S, Outlook) = 0.2465

G(S, Humidity) = 0.151

G(S, Wind) = 0.048

G(S, Temperature) = 0.029


Tree Size

Training Error Testing Error

Tree Size Tree Size


Tree Growth

• Two methods:

• Grow the DT carefully and stop it at appropriate size

• Grow the biggest tree possible and prune it to


appropriate size (most adopted method)
Real-valued Features
• In regular DT — target variable is discrete as well as
feature values at the decision nodes were also discrete

• In playing tennis example, Target is to decide to YES/NO


to play tennis

• Features were like, Wind — STRONG/WEAK

• How to use DT for regression? i.e when target and/or


features are continuous-valued
Real-valued Features
• Continuous-valued features can be easily handled by DT
at the decision nodes

• For example, the decision node can be rendered as

Temperature > C

Yes No

• How to determine the threshold C?


Real-valued Features
Temp. 40 48 60 72 80 90

Play No No Yes Yes Yes No

48 + 60 80 + 90
= 54 = 85
2 2

2 temperature values at which the Play tennis variable changes

Create the 2 new features, Temp>54 and Temp>85

With these new features, construct the DT by maximising the information gain
DT — Regression/Classi cation

• DT Classi cation — predicted class is the most common


class in the node (majority class)

• DT Regression — DT can also be used for estimating real-


valued target
fi
fi
DT Regression
Outlook Temp Humidity Windy Min

Rainy Hot High FALSE 25

Rainy Hot High TRUE 30

Overcast Hot High FALSE 46

Sunny Mild High FALSE 45

Sunny Cool Normal FALSE 52

Sunny Cool Normal TRUE 23

Overcast Cool Normal TRUE 43

Rainy Mild High FALSE 35

Rainy Cool Normal FALSE 38

Sunny Mild Normal FALSE 46

Rainy Mild Normal TRUE 48

Overcast Mild High TRUE 52

Overcast Hot Normal FALSE 44

Sunny Mild High TRUE 30


Statistics of Dataset

Count 14

Mean 39.7857143

S 9.32108647

CV 0.23428225
Root Node
SDR = 9.32- (4/14)*3.49 -(5/14)*7.78 -(5/14)*10.87
S Mean Count

Outlook Overcast 3.49106001 46.25 4


SDR = 1.66*
Rainy 7.78203058 35.2 5

Sunny 10.8701426 39.2 5

S Mean Count

Temp Cool 10.511898 39 4 SDR = 0.48


Hot 8.95474734 36.25 4

Mild 7.65216019 42.6666667 6

S Mean Count
SDR = 0.27
Humidity High 9.3634112 37.5714286 7

Normal 8.73416935 42 7

S Mean Count
SDR = 0.28
Windy TRUE 10.5934991 37.6666667 6

FALSE 7.87301562 41.375 8


First Split
S Mean CV
Outlook Overcast 3.49106001 46.25 0.07548238
Rainy 7.78203058 35.2 0.22108041
Sunny 10.8701426 39.2 0.27729956
DT - Regressor
k-Nearest Neighbours
(KNN)
— Supervised Learning
KNN

• k- nearest neighbours is a supervised learning algorithm

• Can be used for classi cation/ regression

• Finds the k nearest points to the test data from the


training dataset
fi
KNN Classi er
Height Weight Gender

1 175 80 Male

2 160 58 Female

3 179 78 Male

4 163 68 Female

5 159 75 Female

6 180 77 Male

7 183 75 Male

8 158 69 ?
fi
KNN Regressor
Height Age Weight
1 5 45 77
2 5.11 26 47
3 5.6 30 55
4 5.9 34 59
5 4.8 40 72
6 5.8 36 60
7 5.3 19 40
8 5.8 28 60
9 5.5 23 45
10 5.6 32 58
11 5.5 38 ?
KNN Regressor

• k = 2: {1,5} weight = (77+72)/2 = 74.5

• k = 3: {1,5,6} weight = (77+72+60)/3 = 69.7

• k = 5: {1,5,6,4,10} weight = (77+72+60+59+58)/5 = 65.2


Unsupervised
Learning
Unsupervised Learning
• There is no labelled dataset

• The dataset has several features but no target variable or


class

• Goal — nd the hidden patterns in the unlabelled dataset


{x1, x2, ⋯, xM}

• Most commonly used algorithm are hierarchical clustering


apriori algorithm and k-means clustering
fi
Unsupervised Learning
• Clustering — look for patterns and group similar
datapoints

• Association — deduce rules from observing the datapoints

customer 1 customer 2 customer 3

Milk Milk
Milk
Internet Bread Bread
Bread
Butter Sugar
Toothpaste
Biscuit Rice
Mouthwash
Veggie

Call Duration

Clustering Market Basket Analysis


Clustering Tech.
Clustering

Hierarchical Bayesian

Decision Non
Divisive Agglomerative
Based Parametric
Partitional

Model Graph
Centroid Spectral
Based Theoretic

K-means
Clustering Tech.
• Hierarchical — nd successive clusters using previously found
clusters

• Divisive — start with whole dataset and proceed to divide


it smaller clusters successively (like DT)

• Agglomerative —each element (datapoint) as a separate


cluster and merge them successively into larger clusters

• Bayesian — tries to generate posteriori distribution over the


collection of all dataset partitions

• Partitional — determines all clusters at once and improves


iteratively
fi
1854 - Cholera Outbreak in
London

John Snow
Image Segmentation

https://www.mathworks.com/matlabcentral/ leexchange/41967-fast-fuzzy-c-means-image-segmentation
fi
Hierarchical Clustering
Distance Measure
• Minimum distance — distance between nearest points
between clusters (Single Linkage/ Nearest Neighbour)

• Maximum distance — distance between farthest points


between clusters (Complete Linkage/ Farthest Neighbour)

• Average distance — average of distance between every


pair points in the cluster

• Mean Distance — distance between the cluster centres


k-means Clustering
• k-means is a partitional clustering technique

• There are M datapoint {x1, x2, ⋯, xM}

• Each datapoint is n dimensional xk = [x1k, x2k, ⋯, xnk]

• k-means algorithm partitions the dataset into k clusters

• Each cluster has a cluster centre, called the centroid

• k is speci ed by the user.


fi
k-mean Algorithm

1. Select k random data points to be initial centroids,


cluster centres

2. Assign each data point to the closest centroid

3. Re-compute the centroid using the current datapoints in


each cluster

4. Check convergence, if not goto step 2 and continue


k-means Convergence

• no (or minimum) re-assignments of data points to di erent


cluster, or

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE)


k
d(x, ml)2
∑∑
SSE =
l=1 x∈Cl
centroid of l th cluster
l th cluster Euclidean distance ff
k-means
• Euclidean Distance: a = [a1, a2, ⋯, an] b = [b1, b2, ⋯, bn]

d(a, b) = (a1 − b1)2 + (a2 − b2)2 + ⋯ + (an − bn)2

• Centroid: Cl = {x4, x7, x9}

x14 + x17 + x19


3
x24 + x27 + x29
ml = 3

xn4 + xn7 + xn9
3
k-means visually explained

https://stanford.edu/~cpiech/cs221/handouts/kmeans.html

You might also like