Professional Documents
Culture Documents
Subhadip Mukherjee
Acknowledgment: Tom S. F. Haines (for many images and other contents)
5 Oct. 2022
University of Bath, UK
1
Supervised machine learning
2
Classification
• Input: Image
• Output: Which animal
• Input: Demographics
• Output: Preferred candidate
3
Loss function for classification
• 0-1 loss:
0 if y = fθ (x)
L(y, fθ (x)) =
1 y ̸= fθ (x).
1
sigmoid(v) =
1 + exp(−v)
exp(vk )
(softmax(v))k = Pc , for k = 1, 2, · · · , c
p=1 exp(vp )
5
Regression
6
Unsupervised machine learning – 1
Dimensionality reduction
• Given a set of vectors (xi )ni=1 ∈ Rd , find a lower-dimensional representation that
captures as much variability in the data as possible.
• Example: Hyperspectral image analysis
7
Unsupervised machine learning – 2
Dictionary learning for sparse coding: Given data points (xi )ni=1 ∈ Rd , learn a basis
B ∈ Rd×k such that xi = Bαi , where αi ’s are sparse
• Applications: image compression, image recovery,...
Density estimation:
• Parametric: Given i.i.d. samples (xi )ni=1 ∈ Rd from pθ , estimate θ.
• Deterministic: θ is deterministic but unknown (e.g., maximum-likelihood estimation)
• Bayesian: θ is a random variable (e.g., maximum a-posteriori probability estimation)
• Non-parametric methods
Generative modeling: Given i.i.d. samples (xi )ni=1 ∈ Rd from pθ , learn to draw more
samples from the same distribution
8
Reinforcement learning
• Actions in an environment
• Maximize accumulated reward
over time
• Formulated as a Markov
decision process
• Examples:
• Games, e.g. Alpha Go
• Autonomous robots
9
Further categories
10
Decision tree classifiers
• Classifier fθ defined by the tree, the split values are the parameters to learn
11
How to learn a decision tree from data?
We will begin by considering a very simple case, where we have two features (x1 on
the horizontal axis and x2 on the vertical axis) and two classes (red and blue).
• Suppose you are allowed to have just one split on one of the features.
• Known as decision stumps.
12
Decision stump
3
accuracy = 50.8%
1
3
3 2 1 0 1 2 3
13
Decision stump
3
accuracy = 50.8%
• feature = x2 , split = 0.3 1
accuracy = 52.1%
0
3
3 2 1 0 1 2 3
13
Decision stump
3
accuracy = 50.8%
• feature = x2 , split = 0.3 1
accuracy = 52.1%
• feature = x1 , split = −0.5 0
accuracy = 83.2%
1
3
3 2 1 0 1 2 3
13
Decision stump
3
accuracy = 50.8%
• feature = x2 , split = 0.3 1
accuracy = 52.1%
• feature = x1 , split = −0.5 0
accuracy = 83.2%
• feature = x1 , split = 0.2 1
accuracy = 92.4%
2
3
3 2 1 0 1 2 3
13
Decision stump
3
accuracy = 50.8%
• feature = x2 , split = 0.3 1
accuracy = 52.1%
• feature = x1 , split = −0.5 0
accuracy = 83.2%
• feature = x1 , split = 0.2 1
accuracy = 92.4%
2
• feature = x1 , split = 0.0
accuracy = 100%
3
3 2 1 0 1 2 3
13
Decision stump: Learning algorithm
• Our intuitive approach can be summarized in the following brute force algorithm:
1. Sweep each dimension and consider every possible split
(half way between each pair of exemplars when sorted along that axis)
2. Evaluate loss function for every split
3. Choose the best axis (i.e., feature) and the best split
x1 : x2 :
100 100
80 80
60 60
40 40
20 20
0 0
3 2 1 0 1 2 3 3 2 1 0 1 2 3
14
Real-world data isn’t as simplistic!
8
8 6 4 2 0 2 4 6 8
15
Decision tree
8
8 6 4 2 0 2 4 6 8
16
Decision tree
8
8 6 4 2 0 2 4 6 8
16
Decision tree
8
8 6 4 2 0 2 4 6 8
16
Decision tree
• Jumping ahead. . .
2
8
8 6 4 2 0 2 4 6 8
16
Decision tree
• Jumping ahead. . .
2
8
8 6 4 2 0 2 4 6 8
16
Decision trees: a few things to remember
17
Entropy: measure of randomness
18
Information gain
• Entropy of a node:
• Let the node contains n points in total, with ni being the number of points in class-i
ni
• Let pi = , i ∈ {1, 2, · · · , c}
n
Xc
• H(node) = − pi log pi
i=1
• Consider
some split based on some feature. Treat that as a binary random variable Z
1 when the split condition is satisfied (left child node)
Z=
0 when the split condition is NOT satisfied (right child node)
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
• Also optimized!
(often by hand)
22
Some remarks
• In decision stumps, we used the accuracy directly to determine the best split.
Can’t we do the same for a decision tree?
• Unfortunately, no. It is found empirically that often no available split produces a
significant reduction in mis-classification error, and yet after several more splits a
substantial error reduction is achieved.
23