Professional Documents
Culture Documents
1 of 18
Regression Analysis
he objective of many investigations is to understand and explain the relationship among
variables. Frequently, one wants to know how and to what extent a certain variable (response
variable) is related to a set of other variables (explanatory variables).
Regression analysis helps us to determine the nature and the strength of relationship among
variables.
Types of relationship:
i) Deterministic relationship also called functional relationship
ii) Probabilistic relationship also called statistical relationship
In deterministic relationship the relationship between two variables is known exactly such as
a) Area of a circle= r2
b) F=k(m1m2/r2) (Newton’s law of gravity)
c)The relationship between dollar sales (Y) of a product sold at a fixed price and the number of units
sold.
In statistical relationship the relation between variables is not know exactly and we have to approximate
the relationship and develop models that characterize their main features. Regression analysis is
concerned with developing such “approximating” models.
For example, in a chemical process the yield of product is related to the operating temperature, it may be
of interest to build a model relating yield to temperature and then use the model for prediction, process
optimization, or process control.
The word regression is used to investigate the dependence of one variable called the dependent variable
denoted by Y, on one or more variables, called independent variables denoted by X’s and provides an
equation to be used for estimating or predicting the average value of the dependent variable from the
known values of the independent variables. When we study the dependence of a variable on a single
independent variable, it is called simple regression. Where as the dependence of a variable on two or
more than two independent variables is called multiple regression. When the parameters in the model are
in linear form, then we say that model is linear.
The dependent variable is also called the predictand, the response , the regressand, where as the
independent variable is also called the predictor ,the explanatory or the regressor variable.
The regression analysis is generally classified into two kinds.
1. Linear Regression
Simple Linear Regression
Multiple Linear Regression
Curvi Linear Regression
2. Nonlinear Regression
Intrinsically Linear
Intrinsically Non-Linear
Linear:- The regression model is linear if the parameter in the model are in linear form (that is no
parameter appears as an exponent or is multiplied or divided by any other parameter). Otherwise, non-
linear model.
Suppose
Y 1 X 1 2 x 2 where & are parameters. It is a linear model
But if
X
Y= X or Y= it is non-linear.
Regression
2 of 18
Non Linear Model:- The non linear model that can be linearized (that is it can be converted into linear
model) by an appropriate transformation is called intrinsically linear and those that can not be so
transformed is called intrinsically non-linear.
e.g. Y= X Apply log on both sides Log(Y) = Log( )+ Log(X)
X
Y= Apply log on both sides Log(Y) = Log( )+ X Log( )
Regressor:- The variable that forms the basis of estimation or prediction is called the regressor. It is also
called independent variable, or explanatory or controlled or predictor variable, usually denoted by X.
Regressand:- The variable whose resulting values depends upon the known values of independent
variable, is called regressand. It is also called response, dependent, or random variable, usually denoted
by Y.
In simple regression, the dependence of response variable (Y) is investigated on only one
regressor (X). if the relationship of these variables can be described by a straight line, it is termed as
simple linear regression.
The population simple linear regression model is defined as:
Solution:-
6
4
2
0
0 2 4 6 8 10 12 14
age (days)
X 10 Y 3.415
n ( X )( Y )
S ( XY ) ( X i X )(Yi Y ) XY 70.8
i 1 n
( X )2
S ( XX ) ( X i X ) X 2 2
262
n
( Y ) 2
S (YY ) (Yi Y ) Y 2 2
19.6569
n
S ( XY )
b1 0.270 cm/day
S ( XX )
bo Y b1 X 0.715 cm
So estimated simple linear regression equation is
Y=0.715 + 0.270 X
Interpretation of estimated regression parameter
The value of b1=0.270, indicates that the average wing length is expected to increase by 0.270
cm with each one day increase in age.
The observed range of age(Explanatory Variable) in the experiment was 3 to 17 days(i.e scope of the
model), therefore it would be an unreasonable extrapolation to expect this rate of increase in wing length
Regression
5 of 18
to continue if number of days were to increase. It is safe to use the results of regression only within the
range of the observed value of the independent variable only (i.e within the scope of the model).
In regression equation b0=0.715, is the average wing length when age=0 day. In this example
since scope of the model does not cover x=0 so b0 does not have any particular meaning as a
separate term in the regression equation.
NOTE: Interpolation and Extrapolation
Interpolation is making a prediction within the range of values of the predictor in the sample used to
generate the model. Interpolation is generally safe. Extrapolation is making a prediction outside the
range of values of the predictor in the sample used to generate the model. The more removed the
prediction is from the range of values used to fit the model, the riskier the prediction becomes because
there is no way to check that the relationship continues to be linear
2 (Y Y ) 2 Y 2
bo Y b1 XY
S e
n2
0.048 OR
n2
0.525 / 11 0.048
Se.=0.218
1 X2 1
SE (b0) Se 0.148 SE (b1) S e 0.0135
n S ( XX ) S ( XX )
1 X2 1
SE (b0) Se SE (b1) S e
n S(X , X ) S(X , X )
Test of hypothesis for 1
1) Construction of hypotheses
Ho : 1 = 0
H1: 1 0
Regression
6 of 18
2) Level of significance
= 5%
3) TEST STATISTIC
b1 1 0.270 0
t 20.03
SE (b1) 0.0135
1) Construction of hypotheses
Ho : y/13 = 4
H1: y/13 4
2) Level of significance
= 5%
3) Test Statistic
Y13 Y /13 4.225 4
t 3.082
0.073
SE (Yˆ13 )
4) Decision Rule:- Reject Ho if tcal t/2(n-2) =2.201 or tcal - t/2(n-2)=-2.201
5) Result:- So reject Ho.
Confidence interval for mean value of Y i.e y/x
Yˆ13 t / 2( n 2) SE (Yˆ13 ) 4.225 (2.201)0.073
(4.064 , 4.386)
4. Y ae bX 4. Ln(Y ) Ln(a ) bX 4. Y * a * bX
5. Y a b X 5. Y a bX * X X* 5. Y a bX *
Y
6. aX b 6. Y * b aX
6. Y aX bX
2
X
Regression
10 of 18
Example:- The number (Y) of bacteria per unit volume present in a culture after X hours is given in the
following table
2
Y X Log(Y)=Y* XY* X
32 0 1.50515 0 0
47 1 1.6721 1.6721 1
65 2 1.81291 3.6258 4
92 3 1.96379 5.8914 9
132 4 2.12057 8.4823 16
190 5 2.27875 11.3938 25
275 6 2.43933 14.636 36
833 21 13.7926 45.7014 91
Fit a leasr square curve having the form Y=abX to the data. Estimate the value of Y when X=7.
We have to estimate a model Y=abX for which transformed line takes the form:
Log (Y ) Log (a ) XLog (b)
Y*= a* + b* X
S ( XY *) 4.33
b* 0.154
S ( XX ) 28
a* Y * b * X 1.51
The regression equation is
Log(Y) = 1.51 + 0.154 X
Now
Log(a)=1.51 Log(b)=0.154
Antilog [Log(a)]=Antilog(1.51) = 32.36
Antilog[Log(b)]=Antilog(0.154) = 1.43
If there are only two independent variables than Multiple Regression Model is:
Y 1 X 1 2 x 2 Population Regression Model
Where
X1 & X2 are independent variables and Y is the dependent variable.
a: Y intercept
b1 & b2 also called partial regression coefficients.
Where a, b1, b2 can be estimated from sample information as:
S ( X 2 , X 2 )S ( X 1 , Y ) S ( X 1 , X 2 )S ( X 2 , Y )
b1
S ( X 1 , X 1 ) S ( X 2 , X 2 ) [ S ( X 1 , X 2 )]2
S ( X 1 , X 1 )S ( X 2 ,Y ) S ( X 1 , X 2 )S ( X 1 ,Y )
b2
S ( X 1 , X 1 ) S ( X 2 , X 2 ) [ S ( X 1 , X 2 )] 2
bo Y b1 X 1 b 2 X 2
Interpretation of regression coefficients:
a is the mean value of Y when X1=X2=0
b1 is average change (increase or decrease) in response variable Y for one unit increase in the
explanatory variable X1 when the effect of X2 is held constant.
b2 measures the average change in Y for unit increase in X2 when the effect of X1 is held constant.
Regression
12 of 18
EXAMPLE: The following data represent the performance of a chemical process as a function of
several controllable process variables:
CO2 Solvent Hydrogen
Product Total Consumption Y2 X 12 X 22 X1Y X2 Y X1X2
Y X1 X2
36.98 2227.25 2.06 1367.52 4960643 4.2436 82364 76.179 4588.1
13.74 434.90 1.33 188.79 189138 1.7689 5976 18.274 578.4
10.08 481.19 0.97 101.61 231544 0.9409 4850 9.778 466.8
8.53 247.14 0.62 72.76 61078 0.3844 2108 5.289 153.2
36.42 1645.89 0.22 1326.42 2708954 0.0484 59943 8.012 362.1
26.59 907.59 0.76 707.03 823720 0.5776 24133 20.208 689.8
19.07 608.05 1.71 363.66 369725 2.9241 11596 32.610 1039.8
5.96 380.55 3.93 35.52 144818 15.4449 2268 23.423 1495.6
15.52 213.40 1.97 240.87 45540 3.8809 3312 30.574 420.4
56.61 2043.36 5.08 3204.69 4175320 25.8064 115675 287.579 10380.3
229.50 9189.32 18.65 7608.87 13710479 56.0201 312224 511.926 20174.4
Y^ e Y Yˆ e2
47.3928 -10.41 108.42633
13.30172 0.44 0.1920925
13.68672 -3.61 13.008438
8.901963 -0.37 0.1383564
34.23853 2.18 4.7588268
21.29525 5.29 28.034412
16.99981 2.07 4.2857005
15.69707 -9.74 94.810435
10.04365 5.48 29.990376
47.9425 8.67 75.125472
229.50 0.00 358.77
1. Fit a multiple linear regression relating CO2 product to total solvent and hydrogen consumption
and calculate the value of R2
2. Test the significance of Regression
3. Test the significance of partial regression coefficients and construct confidence intervals
4. Can we conclude that total solvent and hydrogen consumption are sufficient number of
independent variables for explaining the variability in CO 2 product?
Regression
13 of 18
43.9475
Y
18.6225
1723.79
X1
716.86
3.865
X2
1.435
25 75 6. 8
6 .7 9 35 65
. 62 . 94 71 23 1.4 3 .8
18 43 17
S ( X 1 , X 1 )S ( X 2 , Y ) S ( X 1 , X 2 )S ( X 1 , Y ) 134212437.4
b2 1.31
D S ( X 1 , X 1 ) S ( X 2 , X 2 ) [ S ( X 1 , X 2 )] 2
102633124.2
bo Y b1 X 1 b 2 X 2 3.52
ANOVA TABLE
Degree of Mean Sum
Source Of Variation Sum of Squares
Freedom of Squares Fcal Ftab
(S.O.V) (SS)
(DF) (MSS=SS/df)
Regression 2 1983.07 991.54 F.05(2,7)=4.74
19.35*
Error 7 358.77 51.25
TOTAL 9 2341.84
Coefficient of Determination
The co-efficient of determination tells us the proportion of variation in the dependent variable explained
by the independent variables
Re g .SS 1983.07
R2 x100 x100 84.7%
TotalSS 12341.84
The value of R2, indicates that about 85 % variation in the dependent variable has been explained by the
linear relationship with X1 & X2 and remaining are due to some other unknown factors.
Test of hypothesis about significance of the partial regression coefficients:
Test of hypothesis for 1
1) Construction of hypotheses
Ho : 1 = 0
H1: 1 0
2) Level of significance
= 5%
3) TEST STATISTIC
b1 1 0.0185 0
t 5.68
SE (b1) 0.003257
S ( X 2, X 2) 21.24
where S .E (b1) S e 7.16 0.003257
S ( X 1, X 1) S ( X 2, X 2) [ S ( X 1, X 2)] 2
102633124.2
Polynomial Regression
Example:- The data is regarding time (in weeks)[X] and the corresponding yield ( in Kg) [Y]of cotton
per plot in the specified period
Put X=X1 and X2=X2
X 1
2
S ( X 1, X 1) X 1 2
82.50
n
X 2
2
S ( X 2, X 2) X 2 2
10510.50
n
( X 1)( Y ) (55)(1266)
S ( X 1, Y ) X 1Y 6860 103
n 10
( X 2)( Y ) (385)(1266)
S ( X 2, Y ) X 2Y 45974 2767
n 10
( X 1)( X 2)
S ( X 1, X 2) X 1X 2 907.50
n
( Y ) 2
S (Y , Y ) Y 2 6402.40
n
D=43560
S ( X 2, X 2) S ( X 1, Y ) S ( X 1, X 2) S ( X 2, Y )
b1 32.7932
S ( X 1, X 1)S ( X 2, X 2) S ( X 1, X 2)
2
D
S ( X 1, X 1) S ( X 2, Y ) S ( X 1, X 2) S ( X 1, Y )
b2 3.0947
S ( X 1, X 1)S ( X 2, X 2) S ( X 1, X 2)
2
D
bo Y b1 X 1 b 2 X 2 =65.3800
ANALYSIS OF VARIANCE
The hypothesis 1=2=0 may be tested by analysis of variance procedure.
Total SS=S(Y,Y)=6402.4
Reg.SS =b1 S(X1,Y)+ b2 S(X2,Y)=(32.7932)(-103)+(-3.0947)(-2767)=5185.3
ANOVA TABLE
Source Of Variation Degree of Sum of Squares Mean Sum Fcal Ftab
(S.O.V) Freedom (SS) of Squares
(DF) (MSS=SS/df)
Regression (X , X2) 2 5185.3 2592.7 14.91* F.05(2,7)=4.74
Error 7 1217.1 173.9
TOTAL 9 6402.40
Test of significance of Quadratic regression
1) Construction of hypotheses
Ho : 2 = 0
H1: 2 0
2) Level of significance
= 5%
3) TEST STATISTIC
b 2 2 3.0947 0
t 5.39
SE (b 2) 0.5738
S ( X 1, X 1) 82.50
where S .E (b 2) S e 13.19 0.5738
S ( X 1, X 1) S ( X 2, X 2) [ S ( X 1, X 2)] 2
43560
Coefficient of Determination
The co-efficient of determination tells us the proportion of variation in the dependent variable explained
by the independent variable
Re g .SS 5185.33
R2 x100 x100 81%
TotalSS 6404.40
The 2nd degree curve is appropriate for the above data set
b1
The value of X at which maximum or minimum value of quadratic regression occur X =5.30
2b 2
b12
The maximum or minimum value of Y is bo =152.28
4b 2
Regression
18 of 18
170
160
150
140
130
y
120
110
100
90
0 1 2 3 4 5 6 7 8 9 10
x1
SIMPLE LINEAR REGRESSION ( Ist degree curve) Curvilinear REGRESSION ( 2ND degree curve)