You are on page 1of 33

Linear Regression

Linear regression
-involves solving set of variables when it is known that there
exist some inherent relationship among the variables
Examples:
It highly that for many example runs in which the inlet temperature is
the same, 130C, the outlet tar content will not be the same.

For automobiles with the same engine volume, they will not all have the
same gas mileage
Dependent variables- responses in the scenarios(tar content, gas
mileage)
Independent variables(regressors)- inlet temperature and engine
volume(cubic feet)

• The relationship for the response Y and the regressor x is the linear
relationship
� = �� + �1 �
where,�� is the intercept and is the �1 slope.
• If the relationship is exact, then it is deterministic relationship
between two scientific variables and there is no random or
probabilistic component to it.

• The concept of regression analysis deals with finding the best


relationship between Y and x, quantifying the strength of the
relationship, and using methods that allow for prediction of the
response values given values of the regressor x
• For more than one regressor(i.e. more than one independent variable
that helps to explain Y), the case of the multiple regression structure
might be written

� = � � + � 1 �1 + � 2 �2

The resulting analysis is termed multiple regression while the analysis


of the single regressor case is called simple regression
� = �� + �1 �

}��
Simple Linear Regression Model
In regression analysis that deals on non deterministic relationship
of the variables there must be a random component to the equation that
relates the variables. Indeed, in most applications of regression , the
linear equation, say, � = �� + �1 �1 + �2 �2 is an approximation that is
a simplification of something unknown and much more complicated
• For the equation more often,� = �� + �1 �1 the models that are
simplifications of complicated and unknown structure are linear in
nature. These linear structures are simple and empirical in nature and
thus called empirical models.

• An analysis of the relationship between Y and x requires the statement


of a statistical model. The basis for the use of a statistical model
relates to how the random variable Y moves with x and the random
component
The response Y is related to the independent variable x through the
equation
� = � � + � 1 �1 + �

Where �� ��� �1 are unkown intercept and slope parameters


respectively and is a random variable that is assumed to be distributed
withE(�)=0 and ���(�) = �2 . The quantity �2 is often called the error
variance or the residual variance
• From the model, the quantity Y is a random variable since is random.
The quantity � , often called a random error or random disturbance,
has constant variance. The presence of this random error keeps the
model from becoming simply a deterministic equation.
�4
�5
�2

�3

�1

True Regression Line �(�) = �� + �1 �


The fitted Regression Line
An important aspect of regression analysis is to estimate the
parameters �� and �1 or the so -called regression coefficients

If we denote �� for �� and �1 for �1 . Then the estimated or fitted


regression line is given by
� = �� + �1 �
where � is the predicted or fitted value. The fitted line is an estimate
of the true regression line
On finding the relationship between x and y many
lines can be drawn. Of all possible lines that can be
drawn the one that is usually of most interest is
called the line of best fit or the least-square
regression line. The least-square regression line is
the line that fits the data better than any other line
that might be drawn.
definition:
• The Least square regression line for a set of bivariate
data is the line that minimizes the sum of the squares
of the vertical deviations from each data point to the
line
• The procedure to determine a point between given
data points is referred to as interpolation
• The process of using an equation to determine a point
to the right or left of given data points is referred to as
extapolation
• Consider the experimental data in table which were obtained from
33 samples of chemically treated waste in a study conducted at
virginia Tech.

Readings on x, the percent reduction in total solids, and y the percent


reduction in chemical oxygen demand
Solids Reduction Oxygen Demand Solids Reduction Oxygen Demand
x% reduction x% reduction
y% y%

3 5 36 34
7 11 37 36
11 21 38 38
15 16 39 37
18 16 39 36
27 28 39 45
29 27 40 39
30 25 41 41
30 35 42 40
31 30 42 44
31 40 43 37
Solids Reduction Oxygen Demand Solids Reduction Oxygen Demand
x% reduction x% reduction
y% y%

32 32 44 44
33 34 45 46
33 32 46 46
34 34 47 49
36 37 50 50
36 38
The table are plotted in a scatter diagram as shown in the figure. From
the observation, the points closely follow a straight line indicating the
assumption of linearity between the two variables appears to be
reasonable
Least square and fitted model
fitting an estimated regression line to the data requires the
determination of estimates �� for �� and �1 for �1 and computing for
the predicted values from the equation

� = �� + �1 �
A residual is essentially an error in the fit of the model

� = �� + �1 �

Definition
Residual:( Error in Fit)Given a set of regression data (�� , �� ); � =
1,2, . . . , �} and a fitted model � = �� + �1 � the ith residual �� is given
by
� � = �� − �� � = 1,2, . . . , �
If a set of n residuals is large, then the fit of the model is not good.
Small residuals are a sign of a good fit.

Equation

� = �� + �1 � + ��

bear in mind that �� are not observed and �� are not only observed
but also play an important role in the total analysis
The Method of Least Square

Estimating the �� and �1 results to the minimum sum of the square of


the residuals. The residual sum of squares is often called the sum of the
squares of the errors about the regression line and is denoted by SSE.
and is called the method of least squares
normal equations
� � � �
� � �
�=1 � �
− �
�=1 �

�=1 � (�� − �)(�� − �)
�=1
�1 = = �
� ��=1 �� 2 − ��=1 �� �=1
(� � − �) 2

� �

�=1 �
− � � �=1 ��
�� = = � − �1 �

Estimate the regression line for the pollution data
33 33 33 33 2

�=1 �
= 1104 �
�=1 1
= 1124 �=1
41,355 �=1
� = 41,086

Therefore:
(33)(41,355) − (1104)(1124)
�1 = 2 = 0.903643
(33)(41,086) − (1104)

1124 − (0.903643)(1104)
�� = = 3.829633
33
The estimated regression line is given by

� = 3.8296 + 0.9036�

We would predict a 31% reduction in the chemical oxygen demand


when the reduction in total solids is 30%. Such estimate are subject to
error. In the original data show that measurement of 25% and 35%
were recorded for the reduction in oxygen demand when the reduction
in total solids was kept at 30%
Example from the book (Mathematics of the Modern World of Rex
Booksttore)

Speed for Selected Stride Length of Adult Individual

Stride 2.5 3.0 3.3 3.5 3.8 4.0 4.2 4.5


length(m)

Speed(m/ 3.4 4.9 5.5 6.6 7.0 7.7 8.3 8.7


s)
Finding for relationship of stride length and Speed

Stride length is defined as the distance x from a


particular point on a footprint to that samepoint on the
next footprint of the same foot.

x
Table of values
n Stride Length(x) Speed(y) xy ��
1 2.5 3.4 8.5 6.25
2 3 4.9 14.7 9
3 3.3 5.5 18.15 10.89
4 3.5 6.6 23.1 12.25
5 3.8 7.0 26.6 14.44
6 4 7.7 30.8 16
7 4.2 8.3 34.86 17.64
8 4.5 8.7 39.15 20.25

sum 28.8 52.1 195.86 106.72


mean 3.6 6.5125
���2 829.44
8(195.86) − (28.8 × 52.1)
�1 = = 2.730263158
8(106.72) − 829.44
�� = � − �1 �
�0 = 6.5125 − (3.6 × 2.730263158)
�� =− 3.3164
� =− 3.3163 + 2.7302�
Use the equation of the least-square line to predict the
average speed of an adult man for each of the following stride
length
a.3.7
b.4.7

Solution:
� =− 3.3163 + 2.7302(3.7)
� = 6.875 interpolation

� =− 3.3163 + 2.7302(4.7)
� =9.516 extrapolation

You might also like