Professional Documents
Culture Documents
Lecture 1: Introduction
Iasonas Kokkinos
i.kokkinos@cs.ucl.ac.uk
Lecture outline
Introduction to the course
Introduction to Machine Learning
Least squares
13
Machine Learning
Why?
AI
neural network
ML
artificial intelligence
NN
machine learning
16
Machine Learning variants
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised/semi-supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: sparse reward for a sequence of decisions
17
Classification
• Based on our experience, should we give a loan to this customer?
– Binary decision: yes/no
Decision boundary
18
Classification examples
• Digit Recognition
• Spam Detection
• Face detection
19
Segmentation + Classification in Real
Images
Face
21
Face
Window Classifier
Non-face
22
Face Detection
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
24
Regression
• Output: Continuous
– E.g. price of a car based on years, mileage, condition,…
25
Computer vision example:
Human Face/Pose Estimation
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
27
Clustering
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
31
Dimensionality reduction & manifold learning
1
(x1 + x2 ) x2
2
34
Moving along the learned face manifold
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction, Manifold Learning
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
44
Prerequisites: Vectors and Matrices
vector
matrix
linear
equation
inner prod.
• Input-output mapping
y = fw (x)
49
• Input-output mapping
method
prediction
y = fw (x)
parameters Input
50
method
prediction
y = fw (x)
parameters Input
Calculus x2R
Vector calculus x 2 RD
Machine learning: can work also for discrete inputs, strings, trees, graphs,…
51
method
prediction
y = fw (x)
parameters Input
Classification: y 2 {0, 1}
Regression: y2R
52
method
prediction
y = fw (x)
X X X
Feature coordinate i
61
Method example: neural network
62
Method example: neural network
63
Method example: neural network
64
Lectures 1-7: all of that
• 1st week: Introduction & linear regression.
• 2nd week: Regularization, Ridge Regression, Cross-Validation,
• 3rd week: Logistic Regression
• 4th week: Support Vector Machines
• 5th week: Ensemble Models (Adaboost, Random Forests)
• 6th week: Deep Learning (neural networks, backpropagation, SGD)
• 7th week: Deep Learning and applications
• 8th week: Unsupervised learning (K-means, PCA, Sparse Coding)
• 9th week: Probabilistic modelling (hidden variable models, EM)
• 10th week: Introduction to Reinforcement Learning
https://en.wikipedia.org/wiki/Least_squares
The first clear and concise exposition of the method of least squares was
published by Legendre in 1805.
The technique is described as an algebraic procedure for fitting linear
equations to data and Legendre demonstrates the new method by
analyzing the same data as Laplace for the shape of the earth.
The value of Legendre's method of least squares was immediately
recognized by leading astronomers and geodesists of the time
67
• Input-output mapping
method
prediction
y = fw (x) = f (x; w)
parameters Input
w2R
w 2 RK
68
Assumption: linear function
T
y = fw (x) = f (x, w) = w x
Inner product:
D
X
T
w x = hw, xi = wd x d
d=1
D D
x 2 R ,w 2 R
69
Reminder: linear classifier
xi positive : xi × w + b ³ 0
Feature coordinate j
xi negative : xi × w + b < 0
+1 ( )
yt =
-1 ( )
Feature coordinate i
70
xi positive : xi × w + b ³ 0
Feature coordinate j
xi negative : xi × w + b < 0
+1 ( )
yt =
-1 ( )
Feature coordinate i
71
Linear regression in 1D
72
Linear regression in 1D
noise
74
Sum of squared errors criterion
i T i i
y =w x +✏
Loss function: sum of squared errors
X N
i 2
L(w) = (✏ )
i=1
Calculus 101
f (x)
x⇤ x
77
Calculus 101
f (x)
x⇤ x
⇤
x = argmaxx f (x)
78
f (x)
x⇤ x
⇤
x = argmaxx f (x)
79
f (x)
x⇤ x
⇤ !
x = argmaxx f (x) f 0 (x⇤ ) = 0
80
⇤ !
x = argminx f (x) f 0 (x⇤ ) = 0
81
" #
@f
f (x) @x1
f (x) = c rf (x) = @f
@x2
2D function graph isocontours gradient field
i T i i
y =w x +✏
Loss function: sum of squared errors
N
X
i 2
L(w) = (✏ )
i=1
That’s it!
87
Line Fitting (continued)
n
atio
qu
alE
r m
No
88
Linear regression in 1D
89
Linear regression in 2D (or ND)
90
Least squares solution for linear regression
D: problem dimension
2 3 2 32 3 2 3
N: training set size
1
y x11 ... x1D w1 ✏1
6 y2 7 6 x21 ... x2D 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN xN1 ... xN
D wD ✏N
Nx1 NxD Dx1 Nx1
91
Least squares solution for linear regression
y = Xw + ✏
92
Least squares solution for linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1
2 1
3
✏
⇥ 1 2 N
⇤6
6 ✏2 7
7
L(w) = ✏ ✏ ... ✏ 6 .. 7
4 . 5
N
✏
95
Least squares solution for linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1
2 1
3
✏
⇥ 1 2 N
⇤6
6 ✏2 7
7
L(w) = ✏ ✏ ... ✏ 6 .. 7
4 . 5
N
✏
T y = Xw + ✏
L(w) = ✏ ✏
96
Recap: Sum of squared errors criterion
i T i i
y =w x +✏
Loss function: sum of squared errors
X N
i 2
L(w) = (✏ )
i=1
Gradient-based optimization
" #
@f
@x1
rf (x) =
f (x) f (x) = c @f
@x2
rL(w⇤ ) = 0
2XT y + 2XT Xw⇤ = 0
⇤ T 1 T
w = (X X) X y
100
Generalized linear regression
2 3
1 (x)
6 .. 7
x ! (x) = 4 . 5
M (x)
101
1D Example: 2nd degree polynomial fitting
2 3
1
(x) = 4 x 5
(x)2
2 3
1
6 x 7
6 7
(x) = 6 .. 7
4 . 5
(x)K
K
hw, (x)i = w0 + w1 x + . . . + wk (x)
103
2D example: second-order polynomials
x = (x1 , x2 )
2 3
1
6 x1 7
6 7
6 x2 7
(x) = 6
6 (x1 )2
7
7
6 7
4 (x2 )2 5
x1 x2
hw, (x)i = w0 + w1 x1 + w2 x2 + w3 x21 + w4 x22 + w5 x1 x2
104
Reminder: linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1
2 1
3 2 32 3 2 3
y x11 ... x1D w1 ✏1
6 y2 7 6 x21 ... x2D 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7 = 6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN xN1 . . . xN
D wD ✏N
105
Reminder: linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1
2 3 2 1 T 32 3 2 3
y1 (x ) w1 ✏1
6 y 2 7 6 (x2 )T 76 w2 7 6 ✏ 2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
N N
y (xN )T wD ✏
106
Generalized linear regression
N
X N
X
Loss function: L(w) = (y i –wT (xi ))T = (✏i )2
i=1 i=1
2 3 2 32 3 2 3
y1 (x1 )T w1 ✏1
6 y2 7 6 (x2 )T 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN N T
(x ) wM ✏N
Nx1 NxM Mx1 Nx1
(x) : RD ! RM
107
Least squares solution for linear regression
2 1 T 3
(x )
6 (x2 )T 7
6 7
y = Xw + ✏ X=6
4
..
.
7
5
T (xN )T
L(w) = ✏ ✏
Minimize:
⇤ T 1 T
w = (X X) X y
108
Least squares solution for generalized linear regression
2 1 T 3
(x )
6 (x2 )T 7
6 7
y= w +✏ =6
4
..
.
7
5
T (xN )T
L(w) = ✏ ✏
Minimize:
⇤ T 1
w =( ) y
109
2D example: second-order polynomials
x = (x1 , x2 )
2 3
1
6 x1 7
6 7
6 x2 7
(x) = 6
6 (x1 )2
7
7
6 7
4 (x2 )2 5
x1 x2
hw, (x)i = w0 + w1 x1 + w2 x2 + w3 x21 + w4 x22 + w5 x1 x2
110
5D Example: fourth-order polynomials in 5D
x = (x1 , . . . , x5 )
2 3
1
6 x 1 7
6 7
6 .. 7
6 . 7
6
(x) = 6 7
x 7
6 5 7
6 .. 7
4 . 5
(x1 x2 x3 x4 x5 )4
1 1 1 1
y = w 0 x0 + w 1 x1 + ... + w D xD
y 2 = w0 x20 + w1 x21 + . . . + wD x2D
..
.
N N N N
y = w 0 x0 + w 1 x1 + ... + w D xD
If N<D (e.g. 30 points, 15265 dimensions) we have more unknowns
than equations: underdetermined system!
Input-output equations hold exactly, but we are simply memorizing data
113
Overfitting, in images
Classification
just right
Regression
114
Tuning the model’s complexity
A flexible model approximates the target function well in the training set
but can “overtrain” and have poor performance on the test set (“variance”)
But what if we force the classifier not to use all of the parameters?
w = [w1 , w2 , . . . , wD ]
“Large” vector: vector norm
v
uD
. u X p
L2, (“euclidean”) norm: kwk2 = t wd2 = hw, wi
d=1
D
X
L1, (“manhattan”) norm:
.
kwk1 = |wd |
d=1
D
!1/p
. X p
Lp norm, p>1: kwkp = wd
d=1
118
Regularized linear regression
T
L(w) = ✏ ✏ linear regression: minimize model error
.
Complexity term: R(w) = kwk22 = wT w
(regularizer)
T T
L(w) = ✏ ✏ + w w
“data fidelity” complexity
T T
L(w) = ✏ ✏ + w w
= yT y 2yT Xw + wT XT Xw + wT Iw
as before, for linear regression identity matrix
= yT y 2yT Xw + wT XT X + I w
Condition for minimum:
⇤
rL(w ) = 0
2XT y + 2(XT X + I)w⇤ = 0
⇤ T 1 T
w = (X X + I) X y
121
Ridge regression, continued
.
Regularizer: R(w) = kwk22 = wT w
New objective:
T T
L(w) = ✏ ✏ + w w
“data fidelity” complexity
λ: “hyperparameter”
Νοτε: direct minimization w.r.t. it would lead to λ=0
122
Bias-Variance tradeoff as a function of λ
sweet spot!
(function of λ)
123
Selecting λ with cross-validation
• Cross validation technique
– Exclude part of the training data
from parameter estimation
– Use them only to predict the
test error
• K-fold cross validation:
– K splits, average K errors
w = [w1 , w2 , . . . , wD ]
“Large” vector: vector norm
v
uD
. u X p
L2, (“euclidean”) norm: kwk2 = t wd2 = hw, wi
d=1
D
X
L1, (“manhattan”) norm:
.
kwk1 = |wd |
d=1
D
!1/p
. X p
Lp norm, p>1: kwkp = wd
d=1
125
Vector norm: determines geometry