Introduction to Machine Learning Fundamentals

1
Introduction to Machine Learning
Lecture 1: Introduction
Iasonas Kokkinos
i.kokkinos@cs.ucl.ac.uk
University College London

12
Lecture outline
Introduction to the course
Introduction to Machine Learning
Least squares
13
Machine Learning
Machine learning is a field of computer science that uses statistical

techniques to give computer systems the ability to learn (i.e.,
progressively improve performance on a specific task) with data, without
being explicitly programmed.
‘ML’ coined by Arthur Samuel, 1959.
data model building prediction

14
Machine Learning
Goal: Machines that perform a task based on experience, instead of

explicitly coded instructions
Why?
• Crucial component of every intelligent/autonomous system
• Important for a system’s adaptability
• Important for a system’s generalization capabilities
• Attempt to understand human learning

15
Rise of Machine Learning
AI
neural network
ML
artificial intelligence
NN
machine learning
16
Machine Learning variants
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction
• Weakly supervised/semi-supervised
Some data supervised, some unsupervised
• Reinforcement learning
Supervision: sparse reward for a sequence of decisions
17
Classification
• Based on our experience, should we give a loan to this customer?
– Binary decision: yes/no
Decision boundary
18
Classification examples
• Digit Recognition
• Spam Detection
• Face detection
19
Segmentation + Classification in Real
Images
Evaluation measures: Confusion matrix, ROC curve, precision, recall, etc.

20
`Faceness function’: classifier
Background Decision boundary
Face
21
Test time: deploy the learned function

• Scan window over image
– Multiple scales
– Multiple orientations
• Classify window as either:
– Face
– Non-face
Face
Window Classifier
Non-face
22
Face Detection
CMU Science Lab

23
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
• Weakly supervised
Supervision: reward for a sequence of decisions
24
Regression
• Output: Continuous
– E.g. price of a car based on years, mileage, condition,…
25
Computer vision example:
Human Face/Pose Estimation
[Blanz and Vetter, Siggraph, 1999]

26
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
27
Clustering
• Break a set of data into coherent groups

– Labels are `invented’
28
Clustering examples
• Spotify recommendations
29
Clustering examples
• Image segmentation
30
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
31
Dimensionality reduction & manifold learning
• Find a low-dimensional representation of high-dimensional data

– Continuous outputs are `invented’
32
Dimensionality Reduction examples
Isomap
[Tenenbaum et al., Science, 2000]

Face Manifold
[Averkiou et al., Eurographics, 2014]

33
Example of nonlinear manifold: faces
Average of two faces is not a face
1
(x1 + x2 ) x2
2
34
Moving along the learned face manifold
Trajectory along the “male” dimension
Trajectory along the “young” dimension
Lample et. al. Fader Networks, NIPS 2017

42
Tentative Course Schedule
• 1st week: Introduction & linear regression.
• 2nd week: Regularization, Ridge Regression, Cross-Validation,
• 3rd week: Logistic Regression
• 4th week: Support Vector Machines
• 5th week: Ensemble Models (Adaboost, Random Forests)
• 6th week: Deep Learning (neural networks, backpropagation, SGD)
• 7th week: Deep Learning and applications
• 8th week: Unsupervised learning (K-means, PCA, Sparse Coding)
• 9th week: Probabilistic modelling (hidden variable models, EM)
• 10th week: Introduction to Structured Prediction, RNNs & NLP
No rush - stop me whenever something is not clear!

43
Focus of first part: supervised learning
• Supervised
– Classification
– Regression
• Unsupervised
– Clustering
– Dimensionality Reduction, Manifold Learning
44
Prerequisites: Vectors and Matrices
vector
matrix
linear
equation
inner prod.
Self check: do you know what these stand for?

• linear independence; rank of a matrix
• span of a matrix
45
Prerequisites: Differential Calculus
46
Classification: yes/no decision
47
Regression: continuous output
48
What we want to learn: a function
• Input-output mapping
y = fw (x)
49
method
prediction
y = fw (x)
parameters Input
50
method
prediction
y = fw (x)
parameters Input
Calculus x2R
Vector calculus x 2 RD
Machine learning: can work also for discrete inputs, strings, trees, graphs,…
51
method
prediction
y = fw (x)
parameters Input
Classification: y 2 {0, 1}
Regression: y2R
52
method
prediction
y = fw (x)
Linear classifiers, neural networks, decision trees, ensemble models,

probabilistic classifiers, …
53
Example of method: K-nearest neighbor classifier
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
–Compute distance to other training records

–Identify K nearest neighbors
–Take majority vote
54
Training data for NN classifier (in R2)
55
1-nn classifier prediction (in R2)
56
3-nn classifier prediction
60
Method example: linear classifier
Feature coordinate j
Feature coordinate i
61
Method example: neural network
62
63
64
Lectures 1-7: all of that
• 10th week: Introduction to Reinforcement Learning

65
Lectures 1-7: all of that
• 10th week: Introduction to Reinforcement Learning

66
We have two centuries of material to cover!
https://en.wikipedia.org/wiki/Least_squares
The first clear and concise exposition of the method of least squares was
published by Legendre in 1805.
The technique is described as an algebraic procedure for fitting linear
equations to data and Legendre demonstrates the new method by
analyzing the same data as Laplace for the shape of the earth.
The value of Legendre's method of least squares was immediately
recognized by leading astronomers and geodesists of the time
67
method
prediction
y = fw (x) = f (x; w)
parameters Input
w2R
w 2 RK
68
Assumption: linear function
T
y = fw (x) = f (x, w) = w x
Inner product:
D
X
T
w x = hw, xi = wd x d
d=1
D D
x 2 R ,w 2 R
69
Reminder: linear classifier
xi positive : xi × w + b ³ 0
xi negative : xi × w + b < 0
Each data point has

a class label:
+1 ( )
yt =
-1 ( )
70
Question: which one?
xi positive : xi × w + b ³ 0
xi negative : xi × w + b < 0
Each data point has

a class label:
+1 ( )
yt =
-1 ( )
71
Linear regression in 1D
72
Training set: input–output pairs S = {(xi , y i )}, i = 1...,N

xi 2 R, y i 2 R
73
noise
74
Sum of squared errors criterion
i T i i
y =w x +✏
Loss function: sum of squared errors
X N
i 2
L(w) = (✏ )
i=1
Expressed as a function of two variables:

N
X ⇥ i
⇤2
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
Question: what is the best (or least bad) value of w?
Answer: least squares
75
Prerequisites: Differential Calculus
76
Calculus 101
f (x)
x⇤ x
77
Calculus 101
f (x)
x⇤ x
⇤
x = argmaxx f (x)
78
Condition for maximum: derivative is zero
f (x)
x⇤ x
⇤
x = argmaxx f (x)
79
Condition for maximum: derivative is zero
f (x)
x⇤ x
⇤ !
x = argmaxx f (x) f 0 (x⇤ ) = 0
80
Condition for minimum: derivative is zero
⇤ !
x = argminx f (x) f 0 (x⇤ ) = 0
81
Vector calculus 101
" #
@f
f (x) @x1
f (x) = c rf (x) = @f
@x2
2D function graph isocontours gradient field
at minimum of function: rf (x) = 0

82
Back to least squares..
i T i i
y =w x +✏
N
X
i 2
L(w) = (✏ )
i=1
Expressed as a function of two variables: training sample

N
X ⇥ i
⇤2
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
feature dimension
83
Line Fitting
84
Fitting a line N
X ⇥ ⇤2
i
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
N ⇥ ⇤2
@L(w0 , w1 ) X @ y i
w0 xi0 + w1 xi1
=
@w0 i=1
@w0
N
X ⇥ i ⇤
= 2 y w0 xi0 + w1 xi1 ( xi0 )
i=1 N
X
= 2 y i xi0 w0 xi0 xi0 w1 xi1 xi0
@L(w0 , w1 ) i=1
= 0N ó N N
@w0 X X X
y i xi0 = w0 xi0 xi0 + w1 xi1 xi0
i=1 i=1 i=1
85
Fitting a line, continued
@L(w0 , w1 )
@w0
=0
N
ó N N
X X X
i=1 i=1 i=1
@L(w0 , w1 )
@w1
=0 ó
N
X N
X N
X
i=1 i=1 i=1
2 linear equations, 2 unknowns

86
Fitting a line, continued
N
X N
X N
X
i=1 i=1 i=1
N
X N
X N
X
i=1 i=1 i=1
2x2 system of equations:
" P # " P PN i i # 
N i i N i i
i=1 y x 0 i=1 x 0 x0 i=1 x0 x1 w0
PN i i = PN i i PN i i
i=1 y x1 i=1 x0 x1 i=1 x1 x1
w1
That’s it!
87
Line Fitting (continued)
n
atio
qu
alE
r m
No
88
89
Linear regression in 2D (or ND)
90
Least squares solution for linear regression
D: problem dimension
2 3 2 32 3 2 3
N: training set size
1
y x11 ... x1D w1 ✏1
6 y2 7 6 x21 ... x2D 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN xN1 ... xN
D wD ✏N
Nx1 NxD Dx1 Nx1
91
y = Xw + ✏
92
N
X N
X
Loss function: L(w) = (y i w T xi )2 = (✏i )2
i=1 i=1
2 1
3
✏
⇥ 1 2 N
⇤6
6 ✏2 7
7
L(w) = ✏ ✏ ... ✏ 6 .. 7
4 . 5
N
✏
95
N
X N
X
i=1 i=1
2 1
3
✏
⇥ 1 2 N
⇤6
6 ✏2 7
7
L(w) = ✏ ✏ ... ✏ 6 .. 7
4 . 5
N
✏
T y = Xw + ✏
L(w) = ✏ ✏
96
Recap: Sum of squared errors criterion
i T i i
y =w x +✏
X N
i 2
L(w) = (✏ )
i=1
Expressed as a function of two variables:

N
X ⇥ i
⇤2
L(w0 , w1 ) = y w0 xi0 + w1 xi1
i=1
97
Gradient-based optimization
" #
@f
@x1
rf (x) =
f (x) f (x) = c @f
@x2
2D function graph isocontours gradient field
at minimum of function: rf (x) = 0

98
Recap: condition for optimum
@L(w0 , w1 )
@w0
=0
N
ó N N
X X X
i=1 i=1 i=1
@L(w0 , w1 )
@w1
=0 ó
XN N
X N
X
i=1 i=1 i=1
2 linear equations, 2 unknowns

99
Least squares solution, in vector form
T
L(w) = ✏ ✏
= (y Xw)T (y Xw)
T T T T
=y y 2y Xw + w X Xw
Condition for minimum:
rL(w⇤ ) = 0
2XT y + 2XT Xw⇤ = 0
⇤ T 1 T
w = (X X) X y
100
Generalized linear regression
2 3
1 (x)
6 .. 7
x ! (x) = 4 . 5
M (x)
101
1D Example: 2nd degree polynomial fitting
2 3
1
(x) = 4 x 5
(x)2
hw, (x)i = w0 + w1 x + w2 (x)2

102
1D Example: k-th degree polynomial fitting
2 3
1
6 x 7
6 7
(x) = 6 .. 7
4 . 5
(x)K
K
hw, (x)i = w0 + w1 x + . . . + wk (x)
103
2D example: second-order polynomials
x = (x1 , x2 )
2 3
1
6 x1 7
6 7
6 x2 7
(x) = 6
6 (x1 )2
7
7
6 7
4 (x2 )2 5
x1 x2
hw, (x)i = w0 + w1 x1 + w2 x2 + w3 x21 + w4 x22 + w5 x1 x2
104
Reminder: linear regression
N
X N
X
i=1 i=1
2 1
3 2 32 3 2 3
y x11 ... x1D w1 ✏1
6 y2 7 6 x21 ... x2D 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7 = 6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN xN1 . . . xN
D wD ✏N
105
Reminder: linear regression
N
X N
X
i=1 i=1
2 3 2 1 T 32 3 2 3
y1 (x ) w1 ✏1
6 y 2 7 6 (x2 )T 76 w2 7 6 ✏ 2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
N N
y (xN )T wD ✏
106
Generalized linear regression
N
X N
X
Loss function: L(w) = (y i –wT (xi ))T = (✏i )2
i=1 i=1
2 3 2 32 3 2 3
y1 (x1 )T w1 ✏1
6 y2 7 6 (x2 )T 76 w2 7 6 ✏2 7
6 7 6 76 7 6 7
6 .. 7=6 .. 76 .. 7+6 .. 7
4 . 5 4 . 54 . 5 4 . 5
yN N T
(x ) wM ✏N
Nx1 NxM Mx1 Nx1
(x) : RD ! RM
107
2 1 T 3
(x )
6 (x2 )T 7
6 7
y = Xw + ✏ X=6
4
..
.
7
5
T (xN )T
L(w) = ✏ ✏
Minimize:
⇤ T 1 T
w = (X X) X y
108
Least squares solution for generalized linear regression
2 1 T 3
(x )
6 (x2 )T 7
6 7
y= w +✏ =6
4
..
.
7
5
T (xN )T
L(w) = ✏ ✏
Minimize:
⇤ T 1
w =( ) y
109
2D example: second-order polynomials
x = (x1 , x2 )
2 3
1
6 x1 7
6 7
6 x2 7
(x) = 6
6 (x1 )2
7
7
6 7
4 (x2 )2 5
x1 x2
hw, (x)i = w0 + w1 x1 + w2 x2 + w3 x21 + w4 x22 + w5 x1 x2
110
5D Example: fourth-order polynomials in 5D
x = (x1 , . . . , x5 )
2 3
1
6 x 1 7
6 7
6 .. 7
6 . 7
6
(x) = 6 7
x 7
6 5 7
6 .. 7
4 . 5
(x1 x2 x3 x4 x5 )4
15625 Dimensions =>15625 parameters

111
What was happening before: approximations
Training: S = {(xi , y i )}, i = , 1, . . . , N
y 1 ' w0 x10 + w1 x11 + . . . + wD x1D
y 2 ' w0 x20 + w1 x21 + . . . + wD x2D

..
.
N N N N
y ' w 0 x0 + w 1 x1 + ... + w D xD
If N>D (e.g. 30 points, 2 dimensions) we have more equations than
unknowns: overdetermined system!
Input-output relations can only hold approximately!
112
What is happening now: overfitting
Training: S = {(xi , y i )}, i = , 1, . . . , N
1 1 1 1
y = w 0 x0 + w 1 x1 + ... + w D xD
y 2 = w0 x20 + w1 x21 + . . . + wD x2D
..
.
N N N N
y = w 0 x0 + w 1 x1 + ... + w D xD
If N<D (e.g. 30 points, 15265 dimensions) we have more unknowns
than equations: underdetermined system!
Input-output equations hold exactly, but we are simply memorizing data
113
Overfitting, in images
Classification
just right
Regression
114
Tuning the model’s complexity
A flexible model approximates the target function well in the training set
but can “overtrain” and have poor performance on the test set (“variance”)
A rigid model’s performance is more predictable in the test set

but the model may not be good even on the training set (“bias”)
115
Regularization: keeping it simple
In high dimensions: too many solutions for the same problem
Regularization: prefer the least complex among them
How? Penalize complexity

116
How to control complexity?
Observation: problem started with high-dimensional embeddings
Guess: Number of dimensions relates to “complexity”
(Week 4: we will guess again!)
Intuition: with many parameters, we can fit anything
But what if we force the classifier not to use all of the parameters?
Idea: penalize the use of large parameter values

How do we measure “large”?
How do we enforce small values?
117
Method parameters: D-dimensional vector
w = [w1 , w2 , . . . , wD ]
“Large” vector: vector norm
v
uD
. u X p
L2, (“euclidean”) norm: kwk2 = t wd2 = hw, wi
d=1
D
X
L1, (“manhattan”) norm:
.
kwk1 = |wd |
d=1
D
!1/p
. X p
Lp norm, p>1: kwkp = wd
d=1
118
Regularized linear regression
✏=y w residual vector
T
L(w) = ✏ ✏ linear regression: minimize model error
.
Complexity term: R(w) = kwk22 = wT w
(regularizer)
T T
L(w) = ✏ ✏ + w w
“data fidelity” complexity
minimum remains scalar, remains to

to be determined be determined
119
Least squares solution
T
L(w) = ✏ ✏
= (y Xw)T (y Xw)
T T T T
=y y 2y Xw + w X Xw
⇤
rL(w ) = 0
2XT y + 2XT Xw⇤ = 0
⇤ T 1 T
w = (X X) X y
120
Ridge regression: L2-regularized linear regression
T T
L(w) = ✏ ✏ + w w
= yT y 2yT Xw + wT XT Xw + wT Iw
as before, for linear regression identity matrix
= yT y 2yT Xw + wT XT X + I w
⇤
rL(w ) = 0
2XT y + 2(XT X + I)w⇤ = 0
⇤ T 1 T
w = (X X + I) X y
121
Ridge regression, continued
.
Regularizer: R(w) = kwk22 = wT w
New objective:
T T
L(w) = ✏ ✏ + w w
“data fidelity” complexity
We just scalar, remains to

determined be determined
minimum
λ: “hyperparameter”
Νοτε: direct minimization w.r.t. it would lead to λ=0
122
Bias-Variance tradeoff as a function of λ
sweet spot!
(function of λ)
123
Selecting λ with cross-validation
• Cross validation technique
– Exclude part of the training data
from parameter estimation
– Use them only to predict the
test error
• K-fold cross validation:
– K splits, average K errors
• Use cross-validation for different

values of λ parameter
– pick value that minimizes cross-
validation error
Least glorious, most effective

of all methods
124
Method parameters: D-dimensional vector
w = [w1 , w2 , . . . , wD ]
“Large” vector: vector norm
v
uD
. u X p
L2, (“euclidean”) norm: kwk2 = t wd2 = hw, wi
d=1
D
X
L1, (“manhattan”) norm:
.
kwk1 = |wd |
d=1
D
!1/p
. X p
Lp norm, p>1: kwkp = wd
d=1
125
Vector norm: determines geometry
n Unit balls for different vector norms
n L1: ‘sparsity promoting’ loss

126
Regularized regression, revisited
.
Regularizer: R(w) = kwk22 = wT w
Ridge regression:
N
X
L(w) = ✏ T ✏ + wi2
i=1
N
X
Alternative regularizer:
.
R(w) = kwk1 = |wi |
i=1
Lasso (least absolute shrinkage and selection operator):
N
X
T
L(w) = ✏ ✏ + |wi |
i=1
https://en.wikipedia.org/wiki/Lasso_(statistics)

Introduction to Machine Learning Fundamentals

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to Machine Learning Fundamentals

Uploaded by

Copyright:

Available Formats

1

Introduction to Machine Learning

University College London

Machine learning is a field of computer science that uses statistical

‘ML’ coined by Arthur Samuel, 1959.

data model building prediction

Goal: Machines that perform a task based on experience, instead of

• Crucial component of every intelligent/autonomous system

• Important for a system’s adaptability

• Important for a system’s generalization capabilities

• Attempt to understand human learning

Evaluation measures: Confusion matrix, ROC curve, precision, recall, etc.

`Faceness function’: classifier

Background Decision boundary

Test time: deploy the learned function

CMU Science Lab

[Blanz and Vetter, Siggraph, 1999]

• Break a set of data into coherent groups

• Find a low-dimensional representation of high-dimensional data

[Tenenbaum et al., Science, 2000]

[Averkiou et al., Eurographics, 2014]

Average of two faces is not a face

Trajectory along the “male” dimension

Trajectory along the “young” dimension

Lample et. al. Fader Networks, NIPS 2017

No rush - stop me whenever something is not clear!

Self check: do you know what these stand for?

What we want to learn: a function

What we want to learn: a function

What we want to learn: a function

What we want to learn: a function

What we want to learn: a function

Linear classifiers, neural networks, decision trees, ensemble models,

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

–Compute distance to other training records

No rush - stop me whenever something is not clear!

No rush - stop me whenever something is not clear!

What we want to learn: a function

Each data point has

Question: which one?

Each data point has

Training set: input–output pairs S = {(xi , y i )}, i = 1...,N

Expressed as a function of two variables:

Condition for maximum: derivative is zero

Condition for maximum: derivative is zero

Condition for minimum: derivative is zero

Vector calculus 101

at minimum of function: rf (x) = 0

Expressed as a function of two variables: training sample

2 linear equations, 2 unknowns

Expressed as a function of two variables:

2D function graph isocontours gradient field

at minimum of function: rf (x) = 0

2 linear equations, 2 unknowns

hw, (x)i = w0 + w1 x + w2 (x)2

15625 Dimensions =>15625 parameters

y 1 ' w0 x10 + w1 x11 + . . . + wD x1D

y 2 ' w0 x20 + w1 x21 + . . . + wD x2D

A rigid model’s performance is more predictable in the test set

In high dimensions: too many solutions for the same problem

Regularization: prefer the least complex among them

How? Penalize complexity