You are on page 1of 16

Course: Intelligent Systems

Unit 2: Neural Networks


2.2 Training. Linear Regression

Daniel Manrique
2021

This work is licensed under a Creative Commons


Attribution-NonCommercial-ShareAlike 4.0 International License:
http://creativecommons.org/licenses/by-nc-sa/4.0/
Intelligent Systems
Neural Networks

1. Representing neural networks.


2. Training neural networks.
1. Linear Regression.
2. Softmax Regression.
3. Multilayer perceptrons.

2
Training artificial Neural Networks
n Artificial neural networks learn from a dataset.
n Learning is the process of adjusting the weights and bias
(parameters) of the connections between neurons that
makes the neural network adapt its responses to exhibit
the expected target behavior: to fit the dataset.
n This set of weights and bias is the solution to the problem.
n The goal is to provide adequate answers to unseen data, not
present in the training dataset. Generalization.

W[1] W[2]

W[3]

Input 0419213143
MNIST 5361928694
examples ….

3
Weights adjustment
Ajuste de pesos

y
Supervised learning:

Entrada
t

deseada
output
Target
Salida
x

Output

Salida
Input
Red de neuronas Comparación
Neural architecture Comparison

● The training data fed to the algorithm


include the desired or target outputs, called
labels.
● Typical examples are classification, where
labels correspond to classes, or regression,
where labels correspond to the target
values.
● An attribute is an input variable.
https://scorecardstreet.wordpress.com/2015/12/09/is-
machine-learning-the-new-epm-black/ Laura Edell, 2015.

Attributes, x Target output, t


Long. Lat. Age Rooms Beds Pop. HouH. Inc. Ocean Median
proximity House value
-0.5623 0.763 0.17 -0.87 -0.74 -0.85 -091 0.93 0.33 0.87

0.4297 0.477 -.13 0.44 0.32 0.44 0.52 0.23 -1 -0.45

0.0212 -0.75 0.21 0.67 0.87 0.94 0.9 0.64 -0.33 0.34

-0.6171 -0.11 -.37 -0.31 -0.23 -0.34 -0.27 -0.34 1 -082


4
http://www.asimovinstitute.org/neural-network-zoo/
5
Linear regression bias
y

+1
9 input variables (attributes)
-0.5623 0.763 0.17 0.87 -0.74 -0.85 -0.91 0.93 0.33

MedianHouseValuePreparedCleanAttributes.csv

Wx+b=y 0.87 (328,500 $)


MedianHouseValueContinuousOutput.csv

bias
Weights

Training process
6
Linear activation function
n The linear activation function is the identity function.

y = f 𝑊𝑥 + 𝑏 = 𝑊𝑥 + 𝑏
For each unit:
$! $! f(x)
𝑦 = f $ x! w%! + b = $ x! w%! + b
!"# !"#

$! x
net = $ x! w%! + b
!"#

https://towardsdatascience.com/activation-functions-neural-
networks-1cbd9f8d91d6 7
LMS learning algorithm or Delta rule
n The LMS (least mean squares) algorithm was designed to
adjust a linear unit or ADALINE.
n The bias b is usually included as column 0 of W:
n Wi = (wi,0,wi,1, …, wi,n𝓍), where wi,0 is bi.
n The input vector x is also increased by a row at 0 with a
constant 1:
n xT = (1,x1, x2, …, xn𝓍).
n Therefore, the output vector is y = W x; yi= ∑$!"#
𝓍 x w
! %!
n Given a linear regression problem, the loss (error) function for
a single pth training sample with nx attributes and ny classes:
) (') (') *
ℒ (')
W =E '
(W) = ∑$%")
!
t% − y%
*
W is the matrix of weights.
ti(p) is the desired target output for the ith unit: the label corresponding to the pth training sample.
yi(p) is the calculated output for the ith unit and pth input vector.
8
LMS learning algorithm or Delta rule
n The gradient descent applied to the error function
adjusts the weights as follows:
! " && (!) (!) # )* + (+)
E (W) = #
∑$%" t$ − y$ ; △(p)wij = −α ),
,-

n wij is the weight between the jth input unit and the ith
output unit.
n The chain rule permits to calculate the derivative of
the error function:
(+)
)* + (+) )* + (+) )-, )* + (+)
),,-
= (+) + ), ; (+) =-(t (')
% − y
(')
% )
)-, ,- )-,

9
LMS learning algorithm or Delta rule
Since the activation function is linear:
0! (1)
yi(p) = neti(p) = ∑-./ wij x-
n Where xj(p) is the value of the jth position of the pth
input vector.
# (#)
(#)
𝜕E # W 𝜕E # W 𝜕E # W 𝜕y% 𝜕E # W " " 𝜕y% (#)
∆ w%& = −α ; = , ; = − t ! − y! ; = x &
𝜕w%& 𝜕w%& 𝜕y%
# 𝜕w%& 𝜕y%
# 𝜕w%&

The final weight adjustment for each training example is:


(') + + (')
∆ w%! = α t % − y% x !

Matrix: ∆(') W = α t (')


−y (') (')
x

Being 𝛂 the learning rate or step.


10
Cost and loss functions
n We adopt the following convention:
n The loss function, ℒ(𝑊), compares, for each input
example x(p), the computed output y(p), and the
target t(p). It is a local error, e.g., SSE:
&"
(+) (+)
1 (+) (+) .
ℒ W =E W = $ t % − y%
2
%"#

n The cost function, J(W), involves the whole


dataset of size m. It is a global error, e.g., MSE:
/
1
J W = MSE(W) = $ 𝐸 (0) (W)
m
+"#
11
2
12
Linear regression: results
Hyperparameters:
Stop condition: 1,000 epochs 1634 samples for training;
204 for development
Learning rate 𝛂=0.1
Test samples are not considered
Training MSE evolution

y(p) t(p)
-0.17 -0.54
MSE in training: 0.22 -0.21 0.17
MSE in dev: 0.238 -0.18 -0.54
MedianHouseValuePreparedCleanAttributes.csv
MedianHouseValueContinuousOutput.csv -0.22 -0.41
-0.20 -0.43
-0.21 -0.26
-0.20 0.06
The MSE evolution gets flat already at 100 epochs.
The computed values are all around -0.2, the mean value of the target outputs.
A non-linear problem is tried to be solved with linear regression. 13
Linear models can not be deep
n Sandwiching in intermediate layers does not increase the
model’s computational power or its accuracy.
n The general case for the ℓth layer is: y[ℓ] = W[ℓ] ·y[ℓ-1]
n y[1] = W[1]·x
n y[2] = W[2]·y[1] = W[2]·W[1]·x
n …
n y[ℓ] = W[ℓ] ·y[ℓ-1] = W[ℓ]·W[ℓ-1] ·...·W[2]·W[1]·x;

n An ℓ-layers neural network is computationally equivalent to


one layer network whose connection matrix (kernel) W is:
n W=W[ℓ]·W[ℓ-1]·...·W[2]·W[1]. Therefore, y = Wx.

n Therefore, linear models can not be deep


n The linear activation function may be used in the output
layer, though, combined with non-linear activation functions.
14
Conclusions
n Computed output y(p) tends to the mean
value of the discretized target outputs to
achieve the least MSE.
n A linear approach can not solve a non-linear
problem.
n We’ll try a logistic regression
approach in the next lesson
to check whether or not
more accurate results are
achieved. Disney - Pixar
15
Lecture slides Training. Linear Regression of the master course “Intelligent Systems”.
2021 Daniel Manrique

Suggested work citation:


D. Manrique. (2021). Training. Linear Regression. Lecture slides. In course
“Intelligent Systems” of the Master Degree in Computer Engineering. Department of
Artificial Intelligence. Universidad Politécnica de Madrid.

This work is licensed under a Creative Commons


Attribution-NonCommercial-ShareAlike 4.0 International License:
http://creativecommons.org/licenses/by-nc-sa/4.0/

You might also like