You are on page 1of 10

Linear Regression Analysis

Chapter 83
83.1 Basic Concepts
83.2 Pre-Processing of Data
83.3 Method of Least Squares
83.4 Model Validation
83.5 Goodness of Fit
83.6 Worked Example No 1
83.7 Multiple Linear Regression
83.8 Variable Selection
83.9 Worked Example No 2
83.10 Worked Example No 3
83.11 Comments

Regression analysis is concerned with curve fitting. least squares is covered in many texts on modern
Given a set of empirical data relating two or more control and related topics, such as the text on op-
variables, what is the best straight line or curve timisation by Edgar (2001). Regression analysis is
that fits the data? Whereas the data may be plotted obviously covered in most standard texts on statis-
and a line or curve fitted by inspection, regres- tics.
sion analysis is a more rational means of doing so.
The most commonly used method is the so called
“least squares” approach which minimises the er-
ror involved in curve fitting. Regression analysis 83.1 Basic Concepts
provides the basis for principal components anal- Consider some phenomenon for which there is an
ysis (PCA) and for statistical process control (SPC) underlying physical relationship of a linear nature:
as described in Chapters 101 and 102 respectively.
In both cases the interest is in understanding the y = ˇ0 + ˇ1 .x (83.1)
relationship between process variables.
The concept of least squares is used extensively If the ˇ coefficients are known, for any value of
in advanced process control.For example,for prob- x the corresponding value of y is uniquely deter-
lems that can be formulated in a quadratic form, mined. Conversely, any two pairs of x and y val-
least squares provides the basis for most optimi- ues will enable the ˇ coefficients to be determined.
sation techniques. Another example of its use is However, with empirical data, there are measure-
in identification, the process used to establish dy- ment errors and other sources of inaccuracy. Pro-
namic models in which time is the dependent vari- vided these errors are small, for a set of values of x
able. the corresponding values of y may be plotted and a
The principles of the least squares method are straight line graph drawn. There will be little scat-
covered in this chapter in relation to both simple ter and the ˇ coefficients can be estimated with
and multiple regression analysis. The principles of confidence from the intercept and slope. However,
690 83 Linear Regression Analysis

if the errors are significant, for any given value of suspect data. Pre-processing essentially involves
x there will be an error on the measured value of y common sense. Plot the data to get a feel for the
according to: relationships: trends can be observed much more
readily from graphs than from tables.Question any
y = ˇ0 + ˇ1.x + " (83.2) apparent anomalies:
Often the dependent variable y is referred to as the • Points which are inconsistent with the trend, re-
response or output and the independent variable ferred to as outliers, are probably either caused
x as the regressor, predictor or input variable. The by some unusual event or are false readings and
error is referred to as the residual. may be rejected. However, they should only be
The objective of regression analysis is to esti- rejected if there is strong non-statistical evi-
mate the ˇ coefficients. This is realised by means dence that they are abnormal.
of the least squares method which, for a set of em- • Sometimes points are known in advance from
pirically determined x and y values, minimises the the nature of the relationship, such as the graph
residuals. Thus, for a given value of x the value of passing through the origin. Check whether the
y predicted is, at best, an estimate of the true value facts are supported by the data.
of y and is often referred to as the fitted value: • Data which is excessively noisy may be filtered.
• Calculate summary statistics, e.g. mean, median,
ŷ = ˇˆ0 + ˇˆ1 .x (83.3) standard deviation, correlation coefficients, etc.
where the ˆ denotes an estimated value. For MLR analysis:
Equation 83.1 is the so called “simple linear
regression” model. If the model contains powers • Cross-correlation tests using Equation 82.16 on
of the independent variable, as in the following both the input and output variables may be nec-
quadratic example, it is referred to as a polynomial essary to reveal time delays and dependencies.
regression model. It is nevertheless still linear with • Standardisation of the data so that it has zero
respect to the ˇ coefficients: mean and unit variance may be necessary. This
requires subtraction of the mean and division
y = ˇ0 + ˇ1 .x + ˇ2 .x 2 (83.4) by the standard deviation. Such standardising
makes the data independent of the scale and/or
Indeed, logarithmic and other nonlinear functions units of measurement and prevents one input
may be included in linear regression models, such overshadowing others.
as:
log(y) = ˇ0 + ˇ1. log(x) (83.5)
When there are several independent variables in-
volved, the model is referred to as being a multiple
83.3 Method of Least Squares
linear regression (MLR) model: Figure 83.1 depicts n sets of measurements
(x1 , y1 ), (x2, y2), · · · , (xn , yn).
y = ˇ0 + ˇ1.x1 + ˇ2 .x2 + ˇ3.x3 (83.6) The aim is to find the underlying linear rela-
tionship of the form of Equation 83.1. Estimates of
the regression coefficients ˇ0 and ˇ1 are required
such that the best fit is obtained.
83.2 Pre-Processing of Data For each measurement, i.e. for each value of x,
Prior to carrying out a regression analysis, it may let the residual (error between the observed value
be necessary to pre-process the data available. The of y and its underlying true value) be:
objective is to identify periods of unrepresentative
data and unusual events, with a view to removing "j = yj − ˇ0 + ˇ1.xj
83.4 Model Validation 691

y Similarly differentiating Equation 83.7 with re-


×
ε4 o spect to ˇ1 gives:
×
ε2 slope β1
∂Q  n

ε3 = −2 xj . yj − ˇ0 + ˇ1 .xj
β0
∂ˇ1
×
× ε1
j=1

x whence, for a minimum:


1 2 3 4 5 
n 
n 
n

Fig. 83.1 Residuals on n sets of measurements xj yj = ˇ0 xj + ˇ1 xj2


j=1 j=1 j=1

Dependency upon the sign of the residual is re- This, together with Equation 83.8, forms a set of
moved by squaring. Thus the sum of the squares of two simultaneous equations which can be solved
the residuals for all the measurements is given by: algebraically for the two unknowns ˇ0 and ˇ1.
Extensive manipulation yields:

n 
n
2
Q= "j2 = yj − ˇ0 + ˇ1 .xj (83.7) ˇˆ0 = y − ˇˆ1x
j=1 j=1
and:
The straight line best fitting the data corresponds 
n
 T
to Q being a minimum. Clearly Q is a function of xj − x . yj − y
j=1 (x − x) . y − y
both ˇ0 and ˇ1.Thus,differentiating Q with respect ˇˆ1 = =

n
(x − x) . (x − x)T
to each of ˇ0 and ˇ1 and setting the differentials to xj − x
2

zero establishes that minimum. j=1


Differentiating Equation 83.7 with respect to ˇ0 (83.9)
gives: The two equations at Equation 83.9 enable the
regression coefficients to be determined directly
∂Q  n
from the empirical data available. In essence, the
= −2 yj − ˇ0 + ˇ1.xj value of ˇ1 is found first: that value is then used to
∂ˇ0
j=1 find the value of ˇ0. Note that, unless n is large, the
coefficients found cannot be anything other than
For a minimum:
estimates of the regression coefficients, hence they

n are denoted as estimates in Equation 83.9.
yj − ˇ0 + ˇ1 .xj =0
j=1

Strictly speaking, to prove that this is a minimum a 83.4 Model Validation


positive second differential should be established:
just assume that to be the case. Whence: Validating the model is essentially a question of
confirming that the regression analysis has pro-

n 
n duced sensible results. The most likely problems
yj = n.ˇ0 + ˇ1 xj are due to:
j=1 j=1
• Insufficient or inadequate empirical data
Dividing throughout by n and rearranging gives: • The model fitted being inappropriate, e.g. a
straight line instead of a quadratic
ˇ0 = y − ˇ1 x (83.8) • The effect of variables not included in the model,
e.g. a simple regression instead of a multiple re-
where x and y denote the average observed values. gression model
692 83 Linear Regression Analysis

Validation is best realised by consideration of the If the regression is a good fit, and provided n is
residuals. For any particular input, the residual is fairly large, then both:
the difference between the measured output and
its fitted value according to Equation 83.3: 
n 
n
  yj − ŷj ≈ 0 and ŷj − y ≈ 0
"j = yj − ˇˆ0 + ˇˆ1 .xj (83.10) j=1 j=1

If the model is correct, the population of residu-


als should have zero mean, constant variance and whence:
be normally distributed. Any plot of the residuals
should demonstrate such characteristics or, if the 
n
2

n
2

n
2
yj − y = yj − ŷj + ŷj − y
data is sparse, should at least not contradict them.
j=1 j=1 j=1
Otherwise the plot should not exhibit any struc-
ture: non-random looking patterns must be con-
sidered to be suspect. The most useful diagnostic This may be thought of as:
plot is that of the residuals vs the fitted values. The
points should have a random distribution along the Total variation = unexplained variations
fitted value axis and a normal distribution along + explained variations
the residuals axis.
An auto-correlation analysis on the residuals The coefficient of determination R2 , sometimes re-
is a particularly useful means of validation. The ferred to as the goodness of fit coefficient, is then
closer the auto correlation function, as determined
articulated as the ratio of explained variations to
by Equation 82.18, is to zero the better the regres-
total variation:
sion model fits the data.

n
2  T  
ŷj − y
j=1
ŷ − y . ŷ − y
R2 = =  T   (83.11)
83.5 Goodness of Fit 
n
2
yj − y y −y . y −y
j=1
This is a measure of how well a regression equation
fits the data from which it was derived. Consider
 T
the identity: where y = y1 y2 · · · yn .
y − y = y − ŷ + ŷ − y The coefficient of determination lies between 0
and 1. The closer R2 is to one the nearer the fitted
Squaring both sides gives: values are to the observed values and the better the
2 2 2 regression model fits the data. A value for R2 of 0.9
y−y = y − ŷ + 2 y − ŷ ŷ − y + ŷ − y
is excellent, 0.8 is good, 0.7 is OK, 0.6 is suspect and
Thus, for a series of n data points: 0.5 or less is useless.

n
2

n
2
yj − y = yj − ŷj
j=1 j=1
83.6 Worked Example No 1

n
+2 yj − ŷj ŷj − y An experiment was set up to establish the variation
j=1 of specific heat of a substance with temperature.

n Results of measurements taken at each of a series
2
+ ŷj − y of temperatures are as shown in Table 83.1.
j=1 The mean values are  = 72.5 and cp = 7.039.
83.7 Multiple Linear Regression 693

Table 83.1 Specific heat vs temperature data


 (◦ C) 50 55 60 65 70 75 80 85 90 95
cp (kJ/kg ◦ C) 6.72 6.91 6.85 6.97 7.01 7.12 7.14 7.22 7.18 7.27

Noting that the temperature is the input and spe- required. This data may be collated in matrix form
cific heat is the output, the ˇ coefficients may be as follows:
calculated from Equation 83.9: ⎡ ⎤
⎡ ⎤ ⎡ ⎤ ˇ0 ⎡ ⎤
 T y1 1 x11 x12 · · · x1p ⎢ ⎥ "1
ˇ
⎢ y2 ⎥ ⎢ 1 x21 x22 · · · x2p ⎥ ⎢ ⎥ ⎢ "2 ⎥
1
 −  . cp − c p ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
ˇˆ1 = ⎢ .. ⎥ = ⎢ .. .. .. . . .. ⎥ ⎢ ˇ2 ⎥ + ⎢ .. ⎥
T
= 0.0113 ⎣ . ⎦ ⎣. . . . . ⎦⎢ . ⎥ ⎣ . ⎦
⎣ .. ⎦
 − .  −
yn 1 xn1 xn2 · · · xnp "n
ˇp
where  and cp are row vectors of the measured
values, and from Equation 83.8: which is of the general form:

ˇˆ0 = cp − ˇˆ1  = 6.221 y = X.ˇ + " (83.13)


where y and " are (nx1) vectors, X is an (nx(p + 1))
whence:
matrix and ˇ is a ((p + 1)x1) vector.
ĉp = 6.221 + 0.0113.
The sum of the squares of the residuals may be
The coefficient of determination is given by Equa- formulated according to:
tion 83.11:

n
  Q = "j2 = " T ."
T
ĉp − cp . ĉP − cp
2 j=1
R =   T ≈ 0.93  T  
c p − c p . cp − c p = y − X.ˇ . y − X.ˇ

where ĉp is the row vector of the fitted values. = y T y − ˇ T X T y − y T Xˇ + ˇ T X T Xˇ


The coefficient indicates that some 93% of the  T
variability in the specific heat is explained by the Noting that ˇ T XTy = y T Xˇ and that both
change in temperature.
ˇ T X T y and y T Xˇ are scalar quantities:

Q = y T y − 2ˇ T X T y + ˇ T X T Xˇ
83.7 Multiple Linear The regression equation that best fits the data cor-
Regression responds to the vector ˇˆ that minimises Q.
Noting that differentiation of a scalar by a vec-
As seen in Equation 83.6, an MLR model involves
tor is covered in Chapter 79, the derivative of Q
several inputs and one output. Due to measure-
ment errors, for any given set of values of x there with respect to ˇˆ is given by:
will be an error on the measured value of y accord- ∂Q
ing to: = −2X T y + 2X T Xˇ
∂ˇ
y = ˇ0 + ˇ1 .x1 + ˇ2 .x2 + · · · + ˇP .xP + " (83.12) Setting this to zero yields the best estimate of the
ˆ
vector ˇ:
Regression analysis involves estimating the various
ˇ coefficients for which n sets of empirical data are −X T y + X T X ˇˆ = 0
694 83 Linear Regression Analysis

whence the so-called batch least squares (BLS) so- output y. This correlation is in terms of absolute
lution: values and does not need to be standardised. The
ˇˆ = X T X X T y
−1
(83.14) next input x2 to be added to the model is that which
has the second highest sample correlation with the
The inverse of XT X should exist provided that the input and/or increases the coefficient of determi-
inputs are linearly independent, i.e. no column of nation value (R2 ) by more than any other input.
the X matrix is a linear combination of the other This process of adding inputs continues until all
columns. The less the “collinearity” the greater the the inputs are included, or the number of inputs is
accuracy of the matrix inversion. deemed to be sufficient, or the increase in R2 is no
The vector of fitted values is thus given by: longer significant.
The reverse selection approach is essentially
ŷ = X.ˇˆ = X X T X
−1
X T y = H.y (83.15) the reverse of forward selection, starting with a
model containing all the inputs and then elimi-
−1
where H = X X T X X T . nating the least significant. Although these pro-
Analogous to Equation 83.10, the residuals for cesses have the semblance of being quantitative, it
MLR are defined to be the difference between the should be recognised that they are essentially sub-
measured outputs and their fitted values: jective.

" = y − ŷ = y − X ˇˆ = y − Hy = (I − H) .y

Note that the formula of Equation 11 used for cal- 83.9 Worked Example No 2
culating the coefficient of determination R2 for A polymerisation is carried out in a reactor batch-
simple linear regression also applies to multiple wise.The end point of the reaction is imprecise,be-
linear regression. ing some function of the mean molecular weight,
but normally occurs between 8 and 12 h after the
start of the batch. The extent of conversion (frac-
83.8 Variable Selection tional) is determined by analysis of samples. The
refractive index (dimensionless) and viscosity are
The key issue in MLR is the choice of variables. measured on line. Data obtained for one batch is
Given a number of possible inputs, how are the given in Table 83.2.
most important ones selected? Without doubt, the From an understanding of the process the fol-
most important basis of selection is a knowledge lowing MLR model is proposed:
of the relationships between the variables gained
from an understanding of the underlying process y = ˇ0 + ˇ1 .r + ˇ2 . log(v)
and/or plant. If there is doubt about the signif-
There are seven sets of data which can be captured
icance of possible inputs, then selection may be
in the form of Equation 83.13:
aided by the use of cross correlation or princi-
pal components analysis, as explained in Chap- y = X.ˇ + "
ter 101.
Otherwise, a systematic approach has to be ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0.781 1.0 1.533 −1.569 "1
adopted in which inputs are added or deleted from ⎢ 0.843 ⎥ ⎢ 1.0 1.428 −1.181 ⎥ ⎢ "2 ⎥
a subset of inputs according to the significance ⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ ⎥
⎢ 0.841 ⎥ ⎢ 1.0 1.567 −0.854 ⎥ ˇ0 ⎢ "3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
of their effect on the regression analysis. The so- ⎢ 0.840 ⎥ = ⎢ 1.0 1.496 −0.532 ⎥⎣ ˇ1 ⎦ + ⎢ "4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
called “forward selection” approach starts with the ⎢ 0.850 ⎥ ⎢ 1.0 1.560 −0.267 ⎥ ˇ2 ⎢ "5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
simple linear regression model which contains the ⎣ 0.852 ⎦ ⎣ 1.0 1.605 0.020 ⎦ ⎣ "6 ⎦
input x1 that has the biggest correlation with the 0.855 1.0 1.487 0.258 "7
83.10 Worked Example No 3 695

The estimates of the ˇ coefficients are given by • The splitter is being forced by the feed rate x1
batch least squares, Equation 83.14: that is cyclical,albeit with a small amplitude (pe-
⎡ ⎤ riod of approximately 25 samples).
1.010
• The composition y3 (of interest) and the top
ˇˆ = X X X y = ⎣ −0.100 ⎦
T −1 T
product vapour take off x6 seem to cycle in re-
0.0332
sponse to x1 .
whence the regression equation: • The column is also forced by the top product
liquid take off x7 which is progressively stepped
y = 1.01 − 0.1r + 0.0332 log(v)
upwards, and bottom product take offs x13 and
The fitted values are given by Equation 83.15: x14 which have large increases.
  • The level x9 decreases to reflect the various in-
ŷ = X.ˇˆ = 0.804 0.828 0.824 0.842 0.845 0.850 0.869 T creased take offs: the level controller x9 seems
Knowing that y = 0.837 the coefficient of determi- to be detuned because, as the level changes, the
nation is found from Equation 83.11: take off x12 doesn’t vary much.
 T   • There is an apparent connection between the
ŷ − y . ŷ − y level x9 , the feed temp x2 and the bottom prod-
R2 =  T   ≈ 0.67 uct composition x11 .
y−y . y−y
A total of 145 sets of data for these 17 variables
This means that some 67% of the variability in the was gathered at 5-min intervals. Summary statis-
conversion is explained by the changes in refrac- tics for pre-processing of the data are as shown
tive index and viscosity.A relatively low coefficient in Table 83.4. The standard deviation is calculated
could have been anticipated because inspection of from Equation 82.3 and the cross correlation func-
the data reveals that it is not monotonic: the values tion using Equation 82.16.
of conversion and refractive index do not succes- Inspection of the column headed “Percent” re-
sively increase or decrease with time. veals that for variables x2 , x8 , x10 and x11 the stan-
dard deviation is ≤ 1% of the mean which, bearing
in mind the accuracy of the instrumentation likely
to have been used in their measurement, suggests
83.10 Worked Example No 3 that the errors in the measurements is likely to be
A naphtha (C6–C8 hydrocarbons) stream is split in more significant than the trends in the data. These
a column as depicted in Figure 83.2, the objective variables are therefore discounted as being statis-
being to operate the column against a constraint on tically suspect. Inspection of the column headed
the maximum amount of ≥ C7s in the top product “Xcorr fn” reveals that the cross correlation func-
stream. tion for variables x2 , x5 , x8 , x9 and x11 to x14 are
The various measurements and controls are as all ≥ 13 samples or 65 min. These are not cred-
described in Table 83.3. ible on the basis of a time delay, known a priori,
Plotting the raw data reveals the following insights: of some 30 min for changes in the feed to affect

Table 83.2 Conversion, refractive index and viscosity vs time data


Time: t (h) 8.5 9.0 9.5 10.0 10.5 11.0 11.5
Conversion: y 0.781 0.843 0.841 0.840 0.850 0.852 0.855
Refractive index: r 1.533 1.428 1.567 1.496 1.560 1.605 1.487
Viscosity: v (kg/ms) 0.027 0.066 0.140 0.294 0.541 1.048 1.810
696 83 Linear Regression Analysis

x4

x3
x6

x1 x2

x5 y1-y3 x7

x9 x12
x8

x13
x11

x10 x14

Fig. 83.2 Outline P&I diagram for naphtha column

Table 83.3 Definition of variables for naphtha column


Variable Units Description

y1 % Amount of benzene in top product (by analysis)


y2 % Amount of ≤ C4s in top product (by analysis)
y3 % Amount of ≥ C7s in top product (by analysis)
x1 m3 /h Controlled flow rate of feed to splitter

x2 C Inlet temperature of feed stream

x3 C Temperature at top of splitter
x4 bar (g) Controlled pressure in overhead system
x5 m3 /h Controlled reflux stream flow rate
x6 m3 /h Manipulated flow rate of vapour top product stream (liquid equivalent)
x7 m3 /h Controlled flow rate of liquid top product stream

x8 C Controlled lower tray temperature
x9 % Controlled level in splitter still

x10 C Temperature at bottom of splitter

x11 C Composition of bottoms (5% cut point)
x12 m3 /h Manipulated flow rate of bottom product stream
x13 m3 /h Controlled flow rate of second bottom product stream
x14 m3 /h Controlled flow rate of third bottom product stream
83.11 Comments 697

Table 83.4 Summary statistical data for variables of naphtha column


Variable Units Mean Standard Percent Xcorr fn
deviation

y1 % 2.325 0.1735 7.4 –


y2 % 1.497 0.2771 18.5 –
y3 % 0.7394 0.2447 33.0 0
x1 m3 /h 345.5 6.120 1.7 12

x2 C 131.0 1.175 0.9 74

x3 C 75.05 0.7822 1.0 12
x4 bar (g) 1.191 0.0396 3.3 13
x5 m3 /h 95.80 1.744 1.8 113
x6 m3 /h 102.5 4.520 4.4 10
x7 m3 /h 11.76 2.570 21.8 3

x8 C 126.1 0.3364 0.2 108
x9 % 61.36 14.52 23.6 87

x10 C 133.0 0.5096 0.3 11

x11 C 93.76 0.9124 0.9 117
x12 m3 /h 0.9258 0.4828 52.0 95
x13 m3 /h 118.1 2.386 2.0 63
x14 m3 /h 118.1 2.372 2.0 64

changes in the product streams and so they too are puts. By means of the reverse selection process it
discounted. is found that a good fit is obtained by eliminating
Of the remaining variables the biggest delay is all the variables except for x3 and x6 which results
of 13 samples in the variable x4 so, in effect there in the MLR model:
are 132 complete sets of data representing a period
y3 = 0.5505.x3 + 0.3988.x6
of some 11 h which is an excellent statistical ba-
sis. The data for each of these variables is shifted for which the coefficient of determination is 0.783.
by the appropriate amount, standardised by sub- Thus the fit is still good but the model much sim-
tracting the mean and dividing by the standard pler. Given that the reflux rate x5 is relatively con-
deviation, and assembled in the form of the X ma- stant, it is to be expected that the top product com-
trix of Equation 83.13. Note that since the data is position would be highly correlated to the over-
standardised there is no need for a bias term ˇ0 . head temperature x3 and to the dominant distillate
The regression coefficients are found by means of flow rate x6 . A plot of the residuals versus fitted
batch least squares, Equation 83.14, yielding the values is of a random nature which confirms the
MLR model: validity of the model.

y3 = 0.1863.x1 + 0.3255.x3 − 0.1492.x4


+ 0.4621.x6 + 0.1867.x7
The coefficient of determination is found from
83.11 Comments
Equation 83.11 to be 0.824 which indicates that 82% Regression models are intended for use as interpo-
of the variability in the data is accounted for by the lation equations and are only as good as the origi-
model. But this is based upon all five credible in- nal data from which they were derived. The models
698 83 Linear Regression Analysis

are only valid over the range of inputs and outputs cosity and refractive index may change. There may
used to fit the model and extrapolation beyond well be a regression model between the viscosity
these ranges should be treated with suspicion. and the refractive index, but it is not sensible to
And finally,just because a regression model can say that the change in viscosity is caused by the
be fitted to two or more variables, it doesn’t nec- change in refractive index. In fact, the changes are
essarily imply a causal relationship. For example, primarily a function of the extent of reaction.
as a reaction approaches completion both the vis-

You might also like