ANN-Unit 3 - Regression & Multi-Layer Perceptron

10/30/2023
Applied Neural
Networks
Unit – 3
Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 1
Lecture Outline
▪ Machine Learning Basics
▪ Linear and Logistic Regression
▪ Neural Networks and Architecture
▪ Vector Analysis for Neural Networks
▪ Loss and Cost Functions
▪ Derivative Evaluation
▪ Vectorization
1
10/30/2023
Machine Learning
▪ As a broad subfield of artificial intelligence, machine learning is concerned with the
design and development of algorithms and techniques that allow computers
to "learn".
▪ A major focus of machine learning research is to automatically learn to recognize
complex patterns and make intelligent decisions based on data.
Types of Machine Learning

▪ Supervised Learning
▪ Machine learning task of inferring a function from labeled training data
▪ Unsupervised Learning
▪ Machine learning algorithms used to draw inferences from datasets consisting of input data
without labeled responses
▪ Reinforcement Learning
▪ Learning from a series of reinforcements—rewards or punishments. For example,
the lack of a tip at the end of the customer dealing or sale.
2
10/30/2023
Popular Supervised Learning Techniques

▪ Supervised Learning
▪ Classification
▪ K-Nearest Neighbor
▪ Classification Trees
▪ Naïve Bayes
▪ Regression
▪ Artificial Neural Networks
10/30/2023 Dr. Muhammad Usman Arif; Applied Neural Networks 6
Classification
3
10/30/2023
Classification: Definition
▪ Given a collection of records (training set )
▪ Each record contains a set of attributes, one of the attributes is the class.
▪ Find a model for class attribute as a function of the values of other attributes.
▪ Goal: previously unseen records should be assigned a class as accurately as

possible.
▪ A test set is used to determine the accuracy of the model. Usually, the given data set is
divided into training and test sets, with training set used to build the model and test set
used to validate it.
Classification Example
4
10/30/2023
Classification Example
Venue Type of Wicket Type of match Batted first Winning
Team
Pakistan Slow ODI Pakistan Pakistan
India Fast Test Pakistan Pakistan

India Slow ODI India India
Pakistan Slow ODI Pakistan India
Neutral Fast ODI India Pakistan
India Fast ODI India India
Pakistan Fast Test India Pakistan
Neutral Fast Test Pakistan India
Neutral Slow Test India Pakistan
Neutral Slow ODI Pakistan Pakistan
Pakistan Fast ODI Pakistan India
Neutral Slow Test Pakistan Pakistan
Pakistan Fast ODI India Pakistan
India Slow ODI Pakistan ???
The input and output values can be discrete or continuous. For now we will concentrate on problems
where the output has exactly two possible values; this is Boolean classification
Classification
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class (categorical
variable).
• Find a model for class attribute as a function of the values of other attributes
(supervised learning).
Venue Type of Wicket Type of match Batted first Winning Team
Pakistan Slow ODI Pakistan Pakistan
India Fast Test Pakistan Pakistan

India Slow ODI India India
Pakistan Slow ODI Pakistan India
Neutral Fast ODI India Pakistan
India Fast ODI India India
Pakistan Fast Test India Pakistan
Neutral Slow Test India Pakistan
Neutral Slow ODI Pakistan Pakistan
Pakistan Fast ODI Pakistan India
Neutral Slow Test Pakistan Pakistan
Pakistan Fast ODI India Pakistan
India Slow ODI Pakistan ???
10
5
10/30/2023
Classification vs. Prediction
▪ If the final variable (to

be classified) is a
numeric attribute
rather than a
categorical attribute
then the problem to be
solved is a prediction
problem.
11
Linear Regression
12
6
10/30/2023
What is Regression?
▪ Regression is a parametric technique used to predict continuous (dependent)
variable given a set of independent variables.
▪ It is parametric in nature because it makes certain assumptions (discussed next)
based on the data set.
▪ If the data set follows those assumptions, regression gives incredible results. Otherwise,
it struggles to provide convincing accuracy.
13
y
dependent
variable
Regression (output)
x – independent variable (input)

▪ For classification the output(s) is nominal
▪ In regression the output is continuous
▪ Function Approximation
▪ Many models could be used – Simplest is linear regression

▪ Fit data with the best hyper-plane which "goes through" the points
▪ For each point the differences between the predicted point and the
actual observation is the residue
14
7
10/30/2023
Linear Regression
We want to find the best line (linear function y=f(X))
to explain the data.
y
X
15
Bivariate and multivariate models
Bivariate or simple regression model

(Education) x y (Income)
Multivariate or multiple regression model

(Education) x1
(Sex) x2
y (Income)
(Experience)x3
(Age) x4
16
8
10/30/2023
Linear Regression
The predicted value of y is given by:
𝑦ො = 𝛽መ0 + ෍ 𝑋𝑗 𝛽መ𝑗
𝑗=1
The vector of coefficients 𝛽መ is the regression model.
If 𝑋0 = 1, the formula becomes a matrix product:

𝑦ො = 𝛽መ0 + X 𝛽መ 1
17
Simple Linear Regression

▪ For now, assume just one (input) independent
variable x, and one (output) dependent variable y
▪ Multiple linear regression assumes an input vector x
▪ Multivariate linear regression assumes an output vector y
▪ We will "fit" the points with a line (i.e. hyper-plane)
18
9
10/30/2023
Simple Linear Regression
▪ Which line should we use?

▪ Choose an objective function
▪ For simple linear regression we choose sum squared
error (SSE)
▪ S (predictedi – actuali)2 = S (residuei)2
▪ Thus, find the line which minimizes the sum of the
squared residues (e.g. least squares)
19
How do we "learn" parameters

▪ For the 2-d problem (line) there are coefficients for the bias and the independent
variable (y-intercept and slope)
Y = b0 + b1X
▪ To find the values for the coefficients which minimize the objective function we
take the partial derivates of the objective function (SSE) with respect to the
coefficients. Set these to 0, and solve.
20
10
10/30/2023
Example I
▪ Find the least square regression line for the following set of data
{(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}
x y xy x2
-1 0 0 1
0 2 0 0
1 4 4 1
2 5 10 4
Σx = 2 Σy = 11 Σx y = 14 Σx = 6
2
21
Example I
▪ (4*14 - 2*11) / (4*6 - 22) = 17/10 = 1.7
▪ (1/4)(11 - 1.7*2) = 1.9
22
11
10/30/2023
x 0 1 2 3 4
y 2 3 5 4 6
Example II
• Find the least square regression line for the following set
of data. Estimate y when x = 10
x y xy x2
0 2 0 0
1 3 3 1
2 5 10 4
3 4 12 9
4 6 24 16
Σx = 10 Σy = 20 Σx y = 49 Σx2 = 30
23
Example II
▪ (5*49 - 10*20) / (5*30 - 102) = 0.9

▪ (1/5)(20 - 0.9*10) = 2.2
▪ Now that we have the least square regression line y = 0.9
x + 2.2, substitute x by 10 to find the value of the
corresponding.
y = 0.9 * 10 + 2.2 = 11.2
24
12
10/30/2023
Example III
▪ The sales of a company (in million dollars) for each year are shown in the table
below.
x (year) 2005 2006 2007 2008 2009

y (sales) 12 19 29 37 45
▪ Find the least square regression line y = a x + b.

▪ Use the least squares regression line as a model to estimate the sales of the
company in 2012.
25
t (years after 2005) 0 1 2 3 4

y (sales) 12 19 29 37 45
Example III
▪ We first change the variable x into t such that t = x - 2005

and therefore t represents the number of years after 2005.
Using t instead of x makes the numbers smaller and
therefore manageable. The table of values becomes.
t y ty t2
0 12 0 0
1 19 19 1
2 29 58 4
3 37 111 9
4 45 180 16
Σx = 10 Σy = 142 Σxy = 368 Σx2 = 30
26
13
10/30/2023
Example III
▪ (5*368 - 10*142) / (5*30 - 102) = 8.4
▪ (1/5)(142 - 8.4*10) = 11.6
▪ In 2012, t = 2012 - 2005 = 7

The estimated sales in 2012 are:
▪ y = 8.4 * 7 + 11.6 = 70.4 million dollars.
27
Error Calculation
▪ Error is an inevitable part of the prediction-making process.
▪ No matter how powerful the algorithm we choose, there will always remain an (∈)
irreducible error which reminds us that the "future is uncertain."
▪ Try to reduce it to the lowest.
▪ Conceptually, the regression model tries to reduce the sum of squared
errors ∑[Actual(y) - Predicted(y')]² by finding the best possible value of regression
coefficients (β0, β1, etc).
28
14
10/30/2023
Regression Model
▪ The first coefficient without an input is called
the intercept, and it adjusts what the model
predicts when all your inputs are 0.
▪ Given the coefficients, if we plug in values for the

inputs, the linear regression will give us
an estimate for what the output should be.
▪ Our error metrics will be able to judge the

differences between prediction and actual values,
but we cannot know how much the error has
contributed to the discrepancy. While we cannot
ever completely eliminate epsilon, it is useful to
retain a term for it in a linear model.
29
Residual Errors
▪ We call the difference between the actual value and the model’s estimate a residual.
▪ If our collection of residuals are small, it implies that the model that produced them
does a good job at predicting our output of interest.
▪ Conversely, if these residuals are generally large, it implies that model is a poor
estimator.
▪ We technically can inspect all of the residuals to judge the model’s accuracy but this
does not scale well.
▪ Statistical Computations
▪ Mean Absolute Error
▪ Mean Squared Error
▪ Root Mean Squared Error
30
15
10/30/2023
Mean Absolute Error

▪ The mean absolute error (MAE) is the
simplest regression error metric to
understand.
▪ We’ll calculate the residual for every
data point, taking only the absolute
value of each so that negative and
positive residuals do not cancel out.
▪ We then take the average of all these
residuals
31
Interpreting MAE
▪ The MAE is also the most intuitive of the metrics since we’re just looking at the
absolute difference between the data and the model’s predictions.
▪ Because we use the absolute value of the residual, the MAE does not
indicate underperformance or overperformance of the model (whether or not the
model under or overshoots actual data).
▪ Each residual contributes proportionally to the total amount of error, meaning that
larger errors will contribute linearly to the overall error.
▪ A small MAE suggests the model is great at prediction, while a large MAE suggests
that your model may have trouble in certain areas.
▪ A MAE of 0 means that your model is a perfect predictor of the outputs (but this will
almost never happen).
32
16
10/30/2023
Mean Squared Error

▪ The mean square error (MSE) is just like the MAE, but squares the difference
before summing them all instead of using the absolute value.
33
Consequences of the Squared Term

▪ MAE and MSE cannot be compared directly
(Because we are squaring the difference, the MSE
will almost always be bigger than the MAE)
▪ We can only compare our model’s error metrics to
those of a competing model.
▪ While each residual in MAE
contributes proportionally to the total error, the error
grows quadratically in MSE
▪ Meaning that outliers in our data will contribute to much
higher total error in the MSE than they would the MAE.
▪ The large differences between actual and predicted are
punished more in MSE than in MAE.
34
17
10/30/2023
Root Mean Squared Error

▪ Similar to MSE, RMSD is the square root of the average of squared
errors. The effect of each error on RMSD is proportional to the size
of the squared error; thus larger errors have a disproportionately
large effect on RMSD. Consequently, RMSD is sensitive to outliers.
▪ MSE is measured in units that are the square of the target variable.
▪ RMSE is measured in the same units as the target variable.
35
10/30/2023 Dr. Muhammad Usman Arif; Applied Neural Networks 36
Multiple Linear
Regression
Slides in this section are taken from the Instructor Resources of Applied Statistics and Probability for Engineers
by Montgomery and Runger (John Wiley and Sons).
36
18
10/30/2023
Multiple Linear Regression Models
Introduction
• Many applications of regression analysis involve
situations in which there are more than one
regressor variable.
• A regression model that contains more than one
regressor variable is called a multiple regression
model.
37
Simple vs. Multiple Linear Regression
38
19
10/30/2023
39
Data Representation
40
20
10/30/2023

Matrix Approach to Multiple Linear Regression
where
41

Matrix Approach to Multiple Linear Regression
The coefficients for each independent variable can

be computed using:
42
21
10/30/2023

Example
43
Example
44
22
10/30/2023

Example
45

Example
46
23
10/30/2023

Example
47
Example
48
24
10/30/2023
Logistic Regression
49
What is Logistic Regression?
▪Like the multiple regression, logistic

regression is also used to predict
something (dependent variable) with
respect to one or more independent
variables.
▪However, in logistic regression, the

predicted value, unlike in mulitple
regression is binary (True/False).
50
25
10/30/2023
Logistic Regression
Furthermore, the logistic regression, rather
than fitting a line to the given data fits a
curve to the data.
The curve gives us the probability of the
output variable being 1 or 0 based on the
independent attributed.
In our figure this gives us the probability of
a mouse being obese based on the weight
of the mouse.
51
π = Proportion of “Success”
In ordinary regression the model predicts the
mean Y for any combination of predictors.
What’s the “mean” of a 0/1 indicator variable?
 yi # of 1' s
y= = = Proportion of " success"
n # of trials
Goal of logistic regression: Predict the “true”
proportion of success, π, at any value of the
predictor.
52
26
10/30/2023
Logistic Regression Models

▪ Logistic Regression Models can be as complex as multiple regression:
53
Binary Logistic Regression Model

Y = Binary response X = Quantitative predictor
π = proportion of 1’s (yes,success) at any X
Equivalent forms of the logistic regression model:
Probability form
b0 + b1 X
e
p= b0 + b1 X
1+ e
What does this look like?
54
27
10/30/2023
Sigmoid Function
no data Function Plot
1.0
0.8
0.6
y
0.4
0.2
-10 -8 -6 -4 -2 0 2 4 6 8 10 12
x
exp (bo + b1• x )
y=
1 + exp (bo + b1• x )
55
Maximum Likelihood
56
28
10/30/2023
Artificial Neural Networks

Introduction
57
Specification of ANN
▪ The number of input attributes found within individual instances determines the
number of input layer nodes.
▪ The user specifies the number of hidden layers as well as the number of nodes
within a specific hidden layer.
58
29
10/30/2023
Input Format
▪ The input to individual neural network nodes should be

numeric and fall in the closed interval range [0,1].
▪ We need a way to numerically represent categorical data.
▪ Attribute Color: {Red, Green, Blue, Yellow}
▪ We also need a conversion method for numerical data
falling outside the [0,1] range.
▪ Values: 100, 200, 300, 400
59
Input Format (Cont’d)

▪ Typically, input values are normalized so as to fall between 0 and 1.
▪ Discrete-valued attributes may be encoded such that there is one input unit per
domain value.
▪ For example, if an attribute A has three possible or known values, namely {a0, a1, a2}, then
we may assign three input units (nodes) to represent A.
▪ Only one of these nodes can have a 1 as input based on the attribute value
▪ Neural networks can be used for both classification (to predict the class label of a
given tuple) or prediction (to predict a continuous-valued output).
▪ For classification, one output unit (node) may be used to represent two classes (where the
value 1 represents one class, and the value 0 represents the other).
▪ If there are more than two classes, then one output unit per class is used.
60
30
10/30/2023
Architecture of NN?
▪ How many neurons are required in the input layer?
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
bat yes yes no yes mammals
pigeon no yes no yes non-mammals
cat yes no no yes mammals
leopard shark yes no yes no non-mammals
turtle no no sometimes yes non-mammals
penguin no no sometimes yes non-mammals
porcupine yes no no yes mammals
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
61
Architecture of NN?
▪ How many neurons are required in the input layer?
Outlook Temperature Humidity W indy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
62
31
10/30/2023
x1 x2 x3 x4 x5
Input
Layer
Hidden
Output Format Layer
▪ The nodes of the input layer pass

Output
Layer
input attribute values to the hidden layer unchanged.

▪ A hidden or output layer node takes input from the connected
nodes of the previous layer, combines the previous layer node
values into a single value (weighted sum), and uses the new
value as input to an evaluation function.
▪ The output of the evaluation function is a number in the closed
interval [0, 1].
63
A Fully Connected Feed-Forward Network

Input Layer Hidden Layer Output Layer
Node 1 w1i
w1j Node i wik

w2i
Node 2 Node k
w2j wjk
w3i Node j
w3j
Node 3
64
32
10/30/2023
Learning ANN
▪ The backpropagation algorithm performs learning on a multilayer
feed-forward neural network
▪ Learning is accomplished by modifying network connection
weights while a set of input instances is repeatedly passed
through the network.
▪ Once trained, an unknown instance passing through the network
is classified according to the value(s) seen at the output layer.
65
Explanation of the Backpropagation Algorithm

w1i= 0.20, w1j= 0.10, w2i= 0.30, w2j= -0.10, w3i= -0.10, w3j= 0.20, wik=0.10, wjk=0.50, T= 0.65
▪ Input = {1.0, 0.4, 0.7}
▪ Input to node i:
▪ 0.2x1.0 + 0.3x0.4 - 0.1x0.7 = 0.25
▪ Now apply the sigmoid function:

▪ f(0.25) = 0.562 Node 1 w1i
▪ Input to node j = ? w1j Node i wik
▪ Input to node k = ? w2i
Node 2 Node k
w2j wjk
▪ Error(k) = (T – Ok) Ok (1 – Ok) Node j
w3i
▪ T = the target output w3j
Node 3
▪ Ok = the computed output at node k
▪ Error(k) = ?
66
33
10/30/2023

▪ Error(i) = Error(k) wik Oi (1 – Oi)

= ?
▪ Error(j) = ?
▪ The next step is to update the weights associated with the individual node
connections.
▪ Weight adjustments are made using the delta rule
▪ To minimize the sum of the square errors, where error is defined as the distance
between computed and actual output
67

▪ wik = wik (current) + wik
▪ wik = r x Error(k) x Oi
▪ where r is learning rate parameter, 0 < r < 1
▪ Compute: wik w1i w2i w3i
68
34
10/30/2023
Algorithm
▪ Initialize the network:
▪ Create the network topology by choosing the number of nodes for the input, hidden, and output layers.
▪ Initialize weights for all node connections to arbitrary values between -1.0 and 1.0.
▪ Choose a value between 0 and 1 for the learning parameter.
▪ Choose a terminating condition.
▪ For all the training instances:
▪ Feed the training instance through the network.
▪ Determine the output error.
▪ Updated the network weights.
▪ If the terminating condition has not been met, repeat step 2.
▪ Test the accuracy of the network on a test dataset. If the accuracy is less than optimal,
change one or more parameters of the network topology and start over.
69
Training/Testing of ANN
▪ During the training phase, training instances are
repeatedly passed through the network while individual
weight values are modified.
▪ The purpose of changing the connection weights is to
minimize training set error rate.
▪ Network training continues until a specific terminating
condition is satisfied.
▪ The terminating condition can be convergence of the
network to a minimum total error value, a specific time
criterion, or a maximum number of iterations.
70
35

ANN-Unit 3 - Regression & Multi-Layer Perceptron

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ANN-Unit 3 - Regression & Multi-Layer Perceptron

Uploaded by

Copyright:

Available Formats

10/30/2023

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 1

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 2

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 3

Types of Machine Learning

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 4

Popular Supervised Learning Techniques

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 5

10/30/2023 Dr. Muhammad Usman Arif; Applied Neural Networks 6

▪ Goal: previously unseen records should be assigned a class as accurately as

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 7

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 8

India Fast Test Pakistan Pakistan

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 9

Pakistan Slow ODI Pakistan Pakistan

India Fast Test Pakistan Pakistan

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 10

Classification vs. Prediction

▪ If the final variable (to

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 11

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 13

x – independent variable (input)

▪ Many models could be used – Simplest is linear regression

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 14

Bivariate and multivariate models

Bivariate or simple regression model

Multivariate or multiple regression model

The vector of coefficients 𝛽መ is the regression model.

If 𝑋0 = 1, the formula becomes a matrix product:

Simple Linear Regression

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 18

Simple Linear Regression

▪ Which line should we use?

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 19

How do we "learn" parameters

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 20

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 21

▪ (4*14 - 2*11) / (4*6 - 22) = 17/10 = 1.7

▪ (1/4)(11 - 1.7*2) = 1.9

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 22

▪ (5*49 - 10*20) / (5*30 - 102) = 0.9

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 24

x (year) 2005 2006 2007 2008 2009

▪ Find the least square regression line y = a x + b.

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 25

t (years after 2005) 0 1 2 3 4

▪ We first change the variable x into t such that t = x - 2005

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 26

▪ (5*368 - 10*142) / (5*30 - 102) = 8.4

▪ (1/5)(142 - 8.4*10) = 11.6

▪ In 2012, t = 2012 - 2005 = 7

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 27

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 28

▪ Given the coefficients, if we plug in values for the

▪ Our error metrics will be able to judge the

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 29

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 30

Mean Absolute Error

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 31

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 32

Mean Squared Error

Dr. Muhammad Usman Arif; Applied Neural Networks 10/30/2023 33

▪ (414 - 211) / (4*6 - 22) = 17/10 = 1.7

▪ (549 - 1020) / (5*30 - 102) = 0.9

▪ (5368 - 10142) / (5*30 - 102) = 8.4