You are on page 1of 31

Machine Learning 1 (CM50264)

Week 1, Lecture 2: Glossary of ML and decision trees

Subhadip Mukherjee
Acknowledgment: Tom S. F. Haines (for many images and other contents)
5 Oct. 2022
University of Bath, UK

1
Supervised machine learning

• Given features and corresponding targets (xi , yi )ni=1 ∈ X × Y


• Select a ‘class of functions’ F (e.g., linear)
• Perform empirical risk minimization over F:
n
1X
min L(yi , f (xi ))
f ∈F n
i=1

• f often takes a parametric form and we solve


n
1X
min L(yi , fθ (xi ))
θ∈Θ n
i=1

2
Classification

• Target y is discrete, i.e.,


y ∈ {1, 2, · · · , c}

• Input: Image
• Output: Which animal

• Input: Demographics
• Output: Preferred candidate

3
Loss function for classification

• 0-1 loss: 
0 if y = fθ (x)
L(y, fθ (x)) =
1 y ̸= fθ (x).

• Not amenable to gradient-based optimization


• Cross-entropy loss:
• One-hot encoding of target labels:
(
1 if y = p
yponehot =
0 if y ̸= p.
Pc
• Suppose fθ (x) is a probability vector in Rc , i.e., (fθ (x))p ≥ 0, p=1 (fθ (x))p =1
c
X
cross-entropy: L(y, fθ (x)) = − yponehot log (fθ (x))p
4
p=1
Converting model output to a probability vector

• Two-class problem: Let the model output be


v ∈ R. Apply sigmoid on it to convert it to a
probability:

1
sigmoid(v) =
1 + exp(−v)

• Multi-class problem: Let the model output be v ∈ Rc . Apply softmax to convert


it to a probability:

exp(vk )
(softmax(v))k = Pc , for k = 1, 2, · · · , c
p=1 exp(vp )
5
Regression

• Target variable y is continuous-valued. For simplicity, let y ∈ R, but can be


generalized easily to y ∈ Rp , p > 1
• Find θ by solving:
n
1X
min ∥yi − fθ (xi )∥22 + λ R(θ)
θ n
i=1

• minimize data-fidelity + regularizer (controls model complexity)


• Different data-fidelity loss depending of noise statistics
• Examples: predicting housing price, image denoising,...
• Will learn more about an important special case, namely linear regression, in
Lecture-4

6
Unsupervised machine learning – 1

Clustering: Grouping ‘similar’ points


• Examples: health insurance company
clustering households based on
household size, chronic condition,
average age, etc; grouping similar
users on social media

Dimensionality reduction
• Given a set of vectors (xi )ni=1 ∈ Rd , find a lower-dimensional representation that
captures as much variability in the data as possible.
• Example: Hyperspectral image analysis

7
Unsupervised machine learning – 2

Dictionary learning for sparse coding: Given data points (xi )ni=1 ∈ Rd , learn a basis
B ∈ Rd×k such that xi = Bαi , where αi ’s are sparse
• Applications: image compression, image recovery,...
Density estimation:
• Parametric: Given i.i.d. samples (xi )ni=1 ∈ Rd from pθ , estimate θ.
• Deterministic: θ is deterministic but unknown (e.g., maximum-likelihood estimation)
• Bayesian: θ is a random variable (e.g., maximum a-posteriori probability estimation)
• Non-parametric methods
Generative modeling: Given i.i.d. samples (xi )ni=1 ∈ Rd from pθ , learn to draw more
samples from the same distribution

8
Reinforcement learning

• Actions in an environment
• Maximize accumulated reward
over time
• Formulated as a Markov
decision process

• Examples:
• Games, e.g. Alpha Go
• Autonomous robots

9
Further categories

• Type of answer: • Workflow:


• Point estimate • Batch learning
e.g. “You have cancer ” i.e. Collect data then learn
• Probabilistic • Incremental learning
e.g. “60% chance you have cancer ” i.e. Learn as data arrives
• Active learning
• Area:
• Computer vision
• Natural language processing (NLP)
• ···

10
Decision tree classifiers

• Partitions the input space into cuboid regions

• Prediction: Traverse the tree, and assign the


most frequent class in the leaf node

• Classifier fθ defined by the tree, the split values are the parameters to learn

11
How to learn a decision tree from data?

• Question: How to learn such a classifier and its parameters?

We will begin by considering a very simple case, where we have two features (x1 on
the horizontal axis and x2 on the vertical axis) and two classes (red and blue).

• Suppose you are allowed to have just one split on one of the features.
• Known as decision stumps.

12
Decision stump
3

• feature = x2 , split = 0.5


2

accuracy = 50.8%
1

3
3 2 1 0 1 2 3

13
Decision stump
3

• feature = x2 , split = 0.5


2

accuracy = 50.8%
• feature = x2 , split = 0.3 1

accuracy = 52.1%
0

3
3 2 1 0 1 2 3

13
Decision stump
3

• feature = x2 , split = 0.5


2

accuracy = 50.8%
• feature = x2 , split = 0.3 1

accuracy = 52.1%
• feature = x1 , split = −0.5 0

accuracy = 83.2%
1

3
3 2 1 0 1 2 3

13
Decision stump
3

• feature = x2 , split = 0.5


2

accuracy = 50.8%
• feature = x2 , split = 0.3 1

accuracy = 52.1%
• feature = x1 , split = −0.5 0

accuracy = 83.2%
• feature = x1 , split = 0.2 1

accuracy = 92.4%
2

3
3 2 1 0 1 2 3

13
Decision stump
3

• feature = x2 , split = 0.5


2

accuracy = 50.8%
• feature = x2 , split = 0.3 1

accuracy = 52.1%
• feature = x1 , split = −0.5 0

accuracy = 83.2%
• feature = x1 , split = 0.2 1

accuracy = 92.4%
2
• feature = x1 , split = 0.0
accuracy = 100%
3
3 2 1 0 1 2 3

13
Decision stump: Learning algorithm

• Our intuitive approach can be summarized in the following brute force algorithm:
1. Sweep each dimension and consider every possible split
(half way between each pair of exemplars when sorted along that axis)
2. Evaluate loss function for every split
3. Choose the best axis (i.e., feature) and the best split

x1 : x2 :
100 100

80 80

60 60

40 40

20 20

0 0
3 2 1 0 1 2 3 3 2 1 0 1 2 3

14
Real-world data isn’t as simplistic!

• What about this? 2

• Axis-aligned separation is rare in real data! 0

• You can still learn a decision stump, but you 2

will end up learning a poor classifier. 4

8
8 6 4 2 0 2 4 6 8

15
Decision tree

• What if we learn multiple splits recursively?

8
8 6 4 2 0 2 4 6 8

16
Decision tree

• What if we learn multiple splits recursively?


• Background color – shaded with ratio of
8
red/blue points
• First split – as before
6

• Can be represented as a tree


2

8
8 6 4 2 0 2 4 6 8

16
Decision tree

• What if we learn multiple splits recursively?


• Background color – shaded with ratio of
8
red/blue points
• First split – as before
6

• Can be represented as a tree


2

• Have split left half of first split again


0

8
8 6 4 2 0 2 4 6 8

16
Decision tree

• What if we learn multiple splits recursively?


• Background color – shaded with ratio of
8
red/blue points
• First split – as before
6

• Can be represented as a tree


2

• Have split left half of first split again


0

• Jumping ahead. . .
2

8
8 6 4 2 0 2 4 6 8

16
Decision tree

• What if we learn multiple splits recursively?


• Background color – shaded with ratio of
8
red/blue points
• First split – as before
6

• Can be represented as a tree


2

• Have split left half of first split again


0

• Jumping ahead. . .
2

8
8 6 4 2 0 2 4 6 8

16
Decision trees: a few things to remember

• You have two kinds of nodes:


• internal, which contains a split (rounded rectangles)
• leaf, which shows the corresponding class

• When do you stop growing the tree?


• If the number of data points associated with a
node is smaller than a predetermined threshold n0
• If you keep going until you have just one point per
leaf node, you’ll likely overfit

• So, you begin with the root node (containing all


data points). How do you start splitting?

17
Entropy: measure of randomness

• Let X be a random variable taking values in a finite alphabet


A = {a1 , a2 , · · · , am }
X m
• H(X) = − pi log pi , where pi = P(X = ai )
i=1
• = bits required on average to encode X
(bits assumes log2 ; if loge the unit is nats)
• Maximum when pi = m 1
for all i, minimum when there exists some j such that
(
1 if i = j
pi =
0 if i ̸= j.

18
Information gain

Let’s try to understand it using the following simple example:


• Let X denote the outcome of two independent coin tosses (heads denoted by 1,
and tail denoted by 0)
• X = (X1 , X2 ) ∈ {(0, 0); (0, 1); (1, 0); (1, 1)}, all all equally likely, so H(X) = 2
• Let Y = X1 + X2
• H(X|Y = 0) = 0, H(X|Y = 1) = 1, H(X|Y = 2) = 0
• Conditional entropy:
X
H(X|Y ) = H(X|Y = y)P(Y = y)
y
1 1 1 1
= 0· +1· +0· =
4 2 4 2

• Information gain = reduction in entropy = H(X) − H(X|Y ) = 2 − 1


2 = 3
2 19
Information gain for a split

• Entropy of a node:
• Let the node contains n points in total, with ni being the number of points in class-i
ni
• Let pi = , i ∈ {1, 2, · · · , c}
n
Xc
• H(node) = − pi log pi
i=1
• Consider
 some split based on some feature. Treat that as a binary random variable Z
1 when the split condition is satisfied (left child node)
Z=
0 when the split condition is NOT satisfied (right child node)

• H(node|split) = H(node|Z = 1)P(Z = 1) + H(node|Z = 0)P(Z = 0)


nleft nright
• H(node|split) = H(left child) + H(right child)
n n
• info gain(split) = H(node) − H(node|split)
• Recursively split the nodes, pick the split that leads to maximum info. gain
20
Entropy vs. Gini impurity
X X
• Gini impurity: G(p) = pi (1 − pi ) = 1 − p2i
i i
• Probability that if you select two items from a data set at random (with
replacement) they will have different class labels
1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

green: Gini impurity, red: entropy


21
Early stopping to avoid overfitting

• Can avoid overfitting by stopping early


• Limit on tree depth
• Minimum leaf node size

• Prevents function getting too complicated!

• These extra parameters are called hyperparameters


(Gini or information gain is also one)

• Also optimized!
(often by hand)

22
Some remarks

• In decision stumps, we used the accuracy directly to determine the best split.
Can’t we do the same for a decision tree?
• Unfortunately, no. It is found empirically that often no available split produces a
significant reduction in mis-classification error, and yet after several more splits a
substantial error reduction is achieved.

• Extension to regression: Cannot use entropy anymore to measure the


homogeneity of a node, use variance instead
1 X Pn
• var(node) = (yi − ŷ)2 , where ŷ = n1 i=1 yi
n
i∈node
• Choose the split that leads to maximum variance reduction
• For any leaf node, the predicted output is the mean of the target outputs of all
points in that leaf
• Referred to as classification and regression trees (CART) [Breiman et al., 1984]

23

You might also like