You are on page 1of 104

1

Introduction to Machine Learning

Lecture 1: Introduction
Iasonas Kokkinos
i.kokkinos@cs.ucl.ac.uk

University College London


12

Lecture outline
Introduction to the course
Introduction to Machine Learning

Least squares
13
Machine Learning

Machine learning is a field of computer science that uses statistical


techniques to give computer systems the ability to learn (i.e.,
progressively improve performance on a specific task) with data, without
being explicitly programmed.

‘ML’ coined by Arthur Samuel, 1959.

data model building prediction


14
Machine Learning

Goal: Machines that perform a task based on experience, instead of


explicitly coded instructions

Why?

• Crucial component of every intelligent/autonomous system

• Important for a system’s adaptability

• Important for a system’s generalization capabilities

• Attempt to understand human learning


15
Rise of Machine Learning

AI
neural network

ML
artificial intelligence
NN

machine learning
16
Machine Learning variants

• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised/semi-supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: sparse reward for a sequence of decisions
17
Classification
• Based on our experience, should we give a loan to this customer?
– Binary decision: yes/no

Decision boundary
18
Classification examples

• Digit Recognition

• Spam Detection

• Face detection
19
Segmentation + Classification in Real
Images

Evaluation measures: Confusion matrix, ROC curve, precision, recall, etc.


20

`Faceness function’: classifier

Background Decision boundary

Face
21

Test time: deploy the learned function


• Scan window over image
– Multiple scales
– Multiple orientations
• Classify window as either:
– Face
– Non-face

Face
Window Classifier
Non-face
22
Face Detection

CMU Science Lab


23
Machine Learning variants

• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
24
Regression

• Output: Continuous
– E.g. price of a car based on years, mileage, condition,…
25
Computer vision example:
Human Face/Pose Estimation

[Blanz and Vetter, Siggraph, 1999]


26
Machine Learning variants

• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
27
Clustering

• Break a set of data into coherent groups


– Labels are `invented’
28
Clustering examples
• Spotify recommendations
29
Clustering examples
• Image segmentation
30
Machine Learning variants

• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
31
Dimensionality reduction & manifold learning

• Find a low-dimensional representation of high-dimensional data


– Continuous outputs are `invented’
32
Dimensionality Reduction examples
Isomap

[Tenenbaum et al., Science, 2000]


Face Manifold

[Averkiou et al., Eurographics, 2014]


33
Example of nonlinear manifold: faces

Average of two faces is not a face

1
(x1 + x2 ) x2
2
34
Moving along the learned face manifold

Trajectory along the “male” dimension

Trajectory along the “young” dimension

Lample et. al. Fader Networks, NIPS 2017


42
Tentative Course Schedule
• 1st week: Introduction & linear regression.
• 2nd week: Regularization, Ridge Regression, Cross-Validation,
• 3rd week: Logistic Regression
• 4th week: Support Vector Machines
• 5th week: Ensemble Models (Adaboost, Random Forests)
• 6th week: Deep Learning (neural networks, backpropagation, SGD)
• 7th week: Deep Learning and applications
• 8th week: Unsupervised learning (K-means, PCA, Sparse Coding)
• 9th week: Probabilistic modelling (hidden variable models, EM)
• 10th week: Introduction to Structured Prediction, RNNs & NLP

No rush - stop me whenever something is not clear!


43
Focus of first part: supervised learning

• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction, Manifold Learning
• Weakly supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: reward for a sequence of decisions
44
Prerequisites: Vectors and Matrices

vector

matrix
linear
equation
inner prod.

Self check: do you know what these stand for?


• linear independence; rank of a matrix
• span of a matrix
45
Prerequisites: Differential Calculus
46
Classification: yes/no decision
47
Regression: continuous output
48

What we want to learn: a function

• Input-output mapping

y = fw (x)
49

What we want to learn: a function

• Input-output mapping

method
prediction

y = fw (x)
parameters Input
50

What we want to learn: a function

method
prediction

y = fw (x)
parameters Input

Calculus x2R
Vector calculus x 2 RD
Machine learning: can work also for discrete inputs, strings, trees, graphs,…
51

What we want to learn: a function

method
prediction

y = fw (x)
parameters Input

Classification: y 2 {0, 1}
Regression: y2R
52

What we want to learn: a function

method
prediction

y = fw (x)

Linear classifiers, neural networks, decision trees, ensemble models,


probabilistic classifiers, …
53
Example of method: K-nearest neighbor classifier

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

–Compute distance to other training records


–Identify K nearest neighbors
–Take majority vote
54
Training data for NN classifier (in R2)
55
1-nn classifier prediction (in R2)
56
3-nn classifier prediction
60
Method example: linear classifier
Feature coordinate j

Feature coordinate i
61
Method example: neural network
62
Method example: neural network
63
Method example: neural network
64
Lectures 1-7: all of that
• 1st week: Introduction & linear regression.
• 2nd week: Regularization, Ridge Regression, Cross-Validation,
• 3rd week: Logistic Regression
• 4th week: Support Vector Machines
• 5th week: Ensemble Models (Adaboost, Random Forests)
• 6th week: Deep Learning (neural networks, backpropagation, SGD)
• 7th week: Deep Learning and applications
• 8th week: Unsupervised learning (K-means, PCA, Sparse Coding)
• 9th week: Probabilistic modelling (hidden variable models, EM)
• 10th week: Introduction to Reinforcement Learning

No rush - stop me whenever something is not clear!


65
Lectures 1-7: all of that
• 1st week: Introduction & linear regression.
• 2nd week: Regularization, Ridge Regression, Cross-Validation,
• 3rd week: Logistic Regression
• 4th week: Support Vector Machines
• 5th week: Ensemble Models (Adaboost, Random Forests)
• 6th week: Deep Learning (neural networks, backpropagation, SGD)
• 7th week: Deep Learning and applications
• 8th week: Unsupervised learning (K-means, PCA, Sparse Coding)
• 9th week: Probabilistic modelling (hidden variable models, EM)
• 10th week: Introduction to Reinforcement Learning

No rush - stop me whenever something is not clear!


66
We have two centuries of material to cover!

https://en.wikipedia.org/wiki/Least_squares

The first clear and concise exposition of the method of least squares was
published by Legendre in 1805.
The technique is described as an algebraic procedure for fitting linear
equations to data and Legendre demonstrates the new method by
analyzing the same data as Laplace for the shape of the earth.
The value of Legendre's method of least squares was immediately
recognized by leading astronomers and geodesists of the time
67

What we want to learn: a function

• Input-output mapping

method
prediction

y = fw (x) = f (x; w)
parameters Input

w2R
w 2 RK
68
Assumption: linear function
T
y = fw (x) = f (x, w) = w x
Inner product:
D
X
T
w x = hw, xi = wd x d
d=1
D D
x 2 R ,w 2 R
69
Reminder: linear classifier

xi positive : xi × w + b ³ 0
Feature coordinate j

xi negative : xi × w + b < 0

Each data point has


a class label:

+1 ( )
yt =
-1 ( )

Feature coordinate i
70

Question: which one?

xi positive : xi × w + b ³ 0
Feature coordinate j

xi negative : xi × w + b < 0

Each data point has


a class label:

+1 ( )
yt =
-1 ( )

Feature coordinate i
71
Linear regression in 1D
72
Linear regression in 1D

Training set: input–output pairs S = {(xi , y i )}, i = 1...,N


xi 2 R, y i 2 R
73
Linear regression in 1D

noise
74
Sum of squared errors criterion

i T i i
y =w x +✏
Loss function: sum of squared errors
X N
i 2
L(w) = (✏ )
i=1

Expressed as a function of two variables:


N
X ⇥ i
⇤2
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
Question: what is the best (or least bad) value of w?
Answer: least squares
75
Prerequisites: Differential Calculus
76

Calculus 101

f (x)

x⇤ x
77

Calculus 101

f (x)

x⇤ x

x = argmaxx f (x)
78

Condition for maximum: derivative is zero

f (x)

x⇤ x

x = argmaxx f (x)
79

Condition for maximum: derivative is zero

f (x)

x⇤ x
⇤ !
x = argmaxx f (x) f 0 (x⇤ ) = 0
80

Condition for minimum: derivative is zero

⇤ !
x = argminx f (x) f 0 (x⇤ ) = 0
81

Vector calculus 101

" #
@f
f (x) @x1
f (x) = c rf (x) = @f
@x2
2D function graph isocontours gradient field

at minimum of function: rf (x) = 0


82
Back to least squares..

i T i i
y =w x +✏
Loss function: sum of squared errors

N
X
i 2
L(w) = (✏ )
i=1

Expressed as a function of two variables: training sample


N
X ⇥ i
⇤2
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
feature dimension
Question: what is the best (or least bad) value of w?
Answer: least squares
83
Line Fitting
84
Fitting a line N
X ⇥ ⇤2
i
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
N ⇥ ⇤2
@L(w0 , w1 ) X @ y i
w0 xi0 + w1 xi1
=
@w0 i=1
@w0
N
X ⇥ i ⇤
= 2 y w0 xi0 + w1 xi1 ( xi0 )
i=1 N
X
= 2 y i xi0 w0 xi0 xi0 w1 xi1 xi0
@L(w0 , w1 ) i=1
= 0N ó N N
@w0 X X X
y i xi0 = w0 xi0 xi0 + w1 xi1 xi0
i=1 i=1 i=1
85
Fitting a line, continued
@L(w0 , w1 )
@w0
=0
N
ó N N
X X X
y i xi0 = w0 xi0 xi0 + w1 xi1 xi0
i=1 i=1 i=1
@L(w0 , w1 )
@w1
=0 ó
N
X N
X N
X
y i xi1 = w0 xi0 xi1 + w1 xi1 xi1
i=1 i=1 i=1

2 linear equations, 2 unknowns


86
Fitting a line, continued
N
X N
X N
X
y i xi0 = w0 xi0 xi0 + w1 xi1 xi0
i=1 i=1 i=1
N
X N
X N
X
y i xi1 = w0 xi0 xi1 + w1 xi1 xi1
i=1 i=1 i=1
2x2 system of equations:
" P # " P PN i i # 
N i i N i i
i=1 y x 0 i=1 x 0 x0 i=1 x0 x1 w0
PN i i = PN i i PN i i
i=1 y x1 i=1 x0 x1 i=1 x1 x1
w1

That’s it!
87
Line Fitting (continued)

n
atio
qu
alE
r m
No
88
Linear regression in 1D
89
Linear regression in 2D (or ND)
90
Least squares solution for linear regression

D: problem dimension
2 3 2 32 3 2 3
N: training set size

1
y x11 ... x1D w1 ✏1
6 y2 7 6 x21 ... x2D 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN xN1 ... xN
D wD ✏N
Nx1 NxD Dx1 Nx1
91
Least squares solution for linear regression

y = Xw + ✏
92
Least squares solution for linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1

2 1
3

⇥ 1 2 N
⇤6
6 ✏2 7
7
L(w) = ✏ ✏ ... ✏ 6 .. 7
4 . 5
N

95
Least squares solution for linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1

2 1
3

⇥ 1 2 N
⇤6
6 ✏2 7
7
L(w) = ✏ ✏ ... ✏ 6 .. 7
4 . 5
N

T y = Xw + ✏
L(w) = ✏ ✏
96
Recap: Sum of squared errors criterion

i T i i
y =w x +✏
Loss function: sum of squared errors
X N
i 2
L(w) = (✏ )
i=1

Expressed as a function of two variables:


N
X ⇥ i
⇤2
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
Question: what is the best (or least bad) value of w?
Answer: least squares
97

Gradient-based optimization

" #
@f
@x1
rf (x) =
f (x) f (x) = c @f
@x2

2D function graph isocontours gradient field

at minimum of function: rf (x) = 0


98
Recap: condition for optimum
@L(w0 , w1 )
@w0
=0
N
ó N N
X X X
y i xi0 = w0 xi0 xi0 + w1 xi1 xi0
i=1 i=1 i=1
@L(w0 , w1 )
@w1
=0 ó
XN N
X N
X
y i xi1 = w0 xi0 xi1 + w1 xi1 xi1
i=1 i=1 i=1

2 linear equations, 2 unknowns


99
Least squares solution, in vector form
T
L(w) = ✏ ✏
= (y Xw)T (y Xw)
T T T T
=y y 2y Xw + w X Xw
Condition for minimum:

rL(w⇤ ) = 0
2XT y + 2XT Xw⇤ = 0
⇤ T 1 T
w = (X X) X y
100
Generalized linear regression

2 3
1 (x)
6 .. 7
x ! (x) = 4 . 5
M (x)
101
1D Example: 2nd degree polynomial fitting

2 3
1
(x) = 4 x 5
(x)2

hw, (x)i = w0 + w1 x + w2 (x)2


102
1D Example: k-th degree polynomial fitting

2 3
1
6 x 7
6 7
(x) = 6 .. 7
4 . 5
(x)K
K
hw, (x)i = w0 + w1 x + . . . + wk (x)
103
2D example: second-order polynomials

x = (x1 , x2 )
2 3
1
6 x1 7
6 7
6 x2 7
(x) = 6
6 (x1 )2
7
7
6 7
4 (x2 )2 5
x1 x2
hw, (x)i = w0 + w1 x1 + w2 x2 + w3 x21 + w4 x22 + w5 x1 x2
104
Reminder: linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1

2 1
3 2 32 3 2 3
y x11 ... x1D w1 ✏1
6 y2 7 6 x21 ... x2D 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7 = 6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN xN1 . . . xN
D wD ✏N
105
Reminder: linear regression
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1

2 3 2 1 T 32 3 2 3
y1 (x ) w1 ✏1
6 y 2 7 6 (x2 )T 76 w2 7 6 ✏ 2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
N N
y (xN )T wD ✏
106
Generalized linear regression
N
X N
X
Loss function: L(w) = (y i –wT (xi ))T = (✏i )2
i=1 i=1
2 3 2 32 3 2 3
y1 (x1 )T w1 ✏1
6 y2 7 6 (x2 )T 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN N T
(x ) wM ✏N
Nx1 NxM Mx1 Nx1

(x) : RD ! RM
107
Least squares solution for linear regression

2 1 T 3
(x )
6 (x2 )T 7
6 7
y = Xw + ✏ X=6
4
..
.
7
5
T (xN )T
L(w) = ✏ ✏
Minimize:

⇤ T 1 T
w = (X X) X y
108
Least squares solution for generalized linear regression

2 1 T 3
(x )
6 (x2 )T 7
6 7
y= w +✏ =6
4
..
.
7
5
T (xN )T
L(w) = ✏ ✏
Minimize:

⇤ T 1
w =( ) y
109
2D example: second-order polynomials

x = (x1 , x2 )
2 3
1
6 x1 7
6 7
6 x2 7
(x) = 6
6 (x1 )2
7
7
6 7
4 (x2 )2 5
x1 x2
hw, (x)i = w0 + w1 x1 + w2 x2 + w3 x21 + w4 x22 + w5 x1 x2
110
5D Example: fourth-order polynomials in 5D

x = (x1 , . . . , x5 )
2 3
1
6 x 1 7
6 7
6 .. 7
6 . 7
6
(x) = 6 7
x 7
6 5 7
6 .. 7
4 . 5
(x1 x2 x3 x4 x5 )4

15625 Dimensions =>15625 parameters


111
What was happening before: approximations
Training: S = {(xi , y i )}, i = , 1, . . . , N

y 1 ' w0 x10 + w1 x11 + . . . + wD x1D

y 2 ' w0 x20 + w1 x21 + . . . + wD x2D


..
.
N N N N
y ' w 0 x0 + w 1 x1 + ... + w D xD
If N>D (e.g. 30 points, 2 dimensions) we have more equations than
unknowns: overdetermined system!
Input-output relations can only hold approximately!
112
What is happening now: overfitting
Training: S = {(xi , y i )}, i = , 1, . . . , N

1 1 1 1
y = w 0 x0 + w 1 x1 + ... + w D xD
y 2 = w0 x20 + w1 x21 + . . . + wD x2D
..
.
N N N N
y = w 0 x0 + w 1 x1 + ... + w D xD
If N<D (e.g. 30 points, 15265 dimensions) we have more unknowns
than equations: underdetermined system!
Input-output equations hold exactly, but we are simply memorizing data
113
Overfitting, in images
Classification
just right

Regression
114
Tuning the model’s complexity
A flexible model approximates the target function well in the training set
but can “overtrain” and have poor performance on the test set (“variance”)

A rigid model’s performance is more predictable in the test set


but the model may not be good even on the training set (“bias”)
115
Regularization: keeping it simple

In high dimensions: too many solutions for the same problem

Regularization: prefer the least complex among them

How? Penalize complexity


116
How to control complexity?

Observation: problem started with high-dimensional embeddings

Guess: Number of dimensions relates to “complexity”

(Week 4: we will guess again!)

Intuition: with many parameters, we can fit anything

But what if we force the classifier not to use all of the parameters?

Idea: penalize the use of large parameter values


How do we measure “large”?
How do we enforce small values?
117
How do we measure “large”?
Method parameters: D-dimensional vector

w = [w1 , w2 , . . . , wD ]
“Large” vector: vector norm
v
uD
. u X p
L2, (“euclidean”) norm: kwk2 = t wd2 = hw, wi
d=1
D
X
L1, (“manhattan”) norm:
.
kwk1 = |wd |
d=1
D
!1/p
. X p
Lp norm, p>1: kwkp = wd
d=1
118
Regularized linear regression

✏=y w residual vector

T
L(w) = ✏ ✏ linear regression: minimize model error
.
Complexity term: R(w) = kwk22 = wT w
(regularizer)

T T
L(w) = ✏ ✏ + w w
“data fidelity” complexity

minimum remains scalar, remains to


to be determined be determined
119
Least squares solution
T
L(w) = ✏ ✏
= (y Xw)T (y Xw)
T T T T
=y y 2y Xw + w X Xw
Condition for minimum:

rL(w ) = 0
2XT y + 2XT Xw⇤ = 0
⇤ T 1 T
w = (X X) X y
120
Ridge regression: L2-regularized linear regression

T T
L(w) = ✏ ✏ + w w
= yT y 2yT Xw + wT XT Xw + wT Iw
as before, for linear regression identity matrix

= yT y 2yT Xw + wT XT X + I w
Condition for minimum:

rL(w ) = 0
2XT y + 2(XT X + I)w⇤ = 0
⇤ T 1 T
w = (X X + I) X y
121
Ridge regression, continued
.
Regularizer: R(w) = kwk22 = wT w
New objective:
T T
L(w) = ✏ ✏ + w w
“data fidelity” complexity

We just scalar, remains to


determined be determined
minimum

λ: “hyperparameter”
Νοτε: direct minimization w.r.t. it would lead to λ=0
122
Bias-Variance tradeoff as a function of λ

sweet spot!

(function of λ)
123
Selecting λ with cross-validation
• Cross validation technique
– Exclude part of the training data
from parameter estimation
– Use them only to predict the
test error
• K-fold cross validation:
– K splits, average K errors

• Use cross-validation for different


values of λ parameter
– pick value that minimizes cross-
validation error

Least glorious, most effective


of all methods
124
How do we measure “large”?
Method parameters: D-dimensional vector

w = [w1 , w2 , . . . , wD ]
“Large” vector: vector norm
v
uD
. u X p
L2, (“euclidean”) norm: kwk2 = t wd2 = hw, wi
d=1
D
X
L1, (“manhattan”) norm:
.
kwk1 = |wd |
d=1
D
!1/p
. X p
Lp norm, p>1: kwkp = wd
d=1
125
Vector norm: determines geometry

n Unit balls for different vector norms

n L1: ‘sparsity promoting’ loss


126
Regularized regression, revisited
.
Regularizer: R(w) = kwk22 = wT w
Ridge regression:
N
X
L(w) = ✏ T ✏ + wi2
i=1
N
X
Alternative regularizer:
.
R(w) = kwk1 = |wi |
i=1
Lasso (least absolute shrinkage and selection operator):
N
X
T
L(w) = ✏ ✏ + |wi |
i=1
https://en.wikipedia.org/wiki/Lasso_(statistics)

You might also like