1 - Stat-701 Regression

Regression
1 of 18
Regression Analysis
he objective of many investigations is to understand and explain the relationship among
variables. Frequently, one wants to know how and to what extent a certain variable (response
variable) is related to a set of other variables (explanatory variables).
Regression analysis helps us to determine the nature and the strength of relationship among
variables.
Types of relationship:
i) Deterministic relationship also called functional relationship
ii) Probabilistic relationship also called statistical relationship
In deterministic relationship the relationship between two variables is known exactly such as
a) Area of a circle= r2
b) F=k(m1m2/r2) (Newton’s law of gravity)
c)The relationship between dollar sales (Y) of a product sold at a fixed price and the number of units
sold.
In statistical relationship the relation between variables is not know exactly and we have to approximate
the relationship and develop models that characterize their main features. Regression analysis is
concerned with developing such “approximating” models.
For example, in a chemical process the yield of product is related to the operating temperature, it may be
of interest to build a model relating yield to temperature and then use the model for prediction, process
optimization, or process control.
The word regression is used to investigate the dependence of one variable called the dependent variable
denoted by Y, on one or more variables, called independent variables denoted by X’s and provides an
equation to be used for estimating or predicting the average value of the dependent variable from the
known values of the independent variables. When we study the dependence of a variable on a single
independent variable, it is called simple regression. Where as the dependence of a variable on two or
more than two independent variables is called multiple regression. When the parameters in the model are
in linear form, then we say that model is linear.
The dependent variable is also called the predictand, the response , the regressand, where as the
independent variable is also called the predictor ,the explanatory or the regressor variable.
The regression analysis is generally classified into two kinds.
1. Linear Regression
Simple Linear Regression
Multiple Linear Regression
Curvi Linear Regression
2. Nonlinear Regression
Intrinsically Linear
Intrinsically Non-Linear
Linear:- The regression model is linear if the parameter in the model are in linear form (that is no
parameter appears as an exponent or is multiplied or divided by any other parameter). Otherwise, non-
linear model.
Suppose
Y    1 X 1   2 x 2   where & are parameters. It is a linear model
But if
X
Y= X or Y= it is non-linear.
Regression
2 of 18
Non Linear Model:- The non linear model that can be linearized (that is it can be converted into linear
model) by an appropriate transformation is called intrinsically linear and those that can not be so
transformed is called intrinsically non-linear.
e.g. Y= X Apply log on both sides Log(Y) = Log( )+ Log(X)
X
Y= Apply log on both sides Log(Y) = Log( )+ X Log( )
Regressor:- The variable that forms the basis of estimation or prediction is called the regressor. It is also
called independent variable, or explanatory or controlled or predictor variable, usually denoted by X.
Regressand:- The variable whose resulting values depends upon the known values of independent
variable, is called regressand. It is also called response, dependent, or random variable, usually denoted
by Y.
In simple regression, the dependence of response variable (Y) is investigated on only one
regressor (X). if the relationship of these variables can be described by a straight line, it is termed as
simple linear regression.
The population simple linear regression model is defined as:
Y= 0 + 1 X+ , Population Regression Model

Y= 0 + 1 X Population Regression Line
where 0 and 1 are the population regression coefficients and i is a random error peculiar to the i-th
observation. Thus, each response is expressed as the sum of a value predicted from the corresponding X,
plus a random error.
The sample regression equation is an estimate of the population regression equation. Like any other
estimate, there is an uncertainty associated with it.
Y^ = b0 + b1 X Sample Regression Line

Where
b0 : Y intercept
b1: Slope of regression line
b0 & b1 also called regression coefficients. X1 is independent variable and Y is the dependent variable.
This model is said to be simple (b/c only one independent variable) linear in parameters and
linear in independent variable (as it is in first power not X2 or X3)
Regression
3 of 18
How to identify the relationship between variables

In order to begin regression analysis, useful tool is to plot the Y verses X this plot is called a scatter
plot and may suggest that what type of mathematical functions would be appropriate for summarizing
the data. A variety of functions are useful in fitting models to data.
LEAST SQUARE LINE
After using scatter diagram to illustrate the relationship between independent and dependent
variable, the next step is to specify the mathematical formulation of the linear regression model, which
provides the basis for statistical analysis. In scatter plot the observed data points do not all fall on a
straight line but cluster about it. Many lines can be drawn through the data points; the problem is to
select among them. The method of LEAST SQUARE results in a line that minimizes the sum of squared
vertical distances from the observed data points to the line (i.e Random Error). Any other line has a
larger sum.
A least square line is described in terms of its Y-intercept (the height at which it intercepts the Y-
axis) and its slope (the angle of the line). The line can be expressed by the following relation
Y=a + bX (Estimated regression of Y on X)
Where
S ( XY )
b Called slope of the line
S ( XX )
 
a  Y  b X , Called intercept of the line
Example: - The following data are the sparrow wing length in cm at various times in days after hatching
Wing Age XY X2 Y2 Y^ e=Y-Y^ e2
Length (X)
(Y)
1.4 3 4.2 9 1.96 1.525 -0.125 0.015625
1.5 4 6.0 16 2.25 1.795 -0.295 0.087025
2.2 5 11 25 4.84 2.065 0.135 0.018225
2.4 6 14.4 36 5.76 2.335 0.065 0.004225
3.1 8 24.8 64 9.61 2.875 0.225 0.050625
3.2 9 28.8 81 10.24 3.145 0.055 0.003025
3.2 10 32.0 100 10.24 3.415 -0.215 0.046225
3.9 11 42.9 121 15.21 3.685 0.215 0.046225
4.1 12 49.2 144 16.81 3.955 0.145 0.021025
4.7 14 65.8 196 22.09 4.495 0.205 0.042025
4.5 15 67.5 225 20.25 4.765 -0.265 0.070225
5.2 16 83.2 256 27.04 5.035 0.165 0.027225
5.0 17 85.0 289 25.00 5.305 -0.305 0.093025
44.4 130 514.80 1562 171.3 44.395 0.005 0.525
(i):- Draw scatter plot for the data
(ii):- Fit simple linear regression and interpret the parameters
(iii):-Find Standard error of estimate, SE(b0) and SE(b1).
(iv):-Test the hypothesis that there is no linear relation between Y and X. i.e 1=0
(v):- Test the hypothesis that 0=0.95
(vi):-Construct 90% C.I for regression parameters.
(vii):-Perform Analysis of Variance. Calculate coefficient of determination and interpret it.
(viii):- Test the hypothesis that the mean wing length of 13 day-old birds in the population is
Regression
4 of 18
4cm. Also find 95%C.I for mean value of Y when X=13.

(ix):-Test the hypothesis that the wing length of one 13 day-old birds in the population is 4.2 cm.
Also Construct 95 % C.I for single value of Y when X=13.
Solution:-
Wing length VS Days

Wing length (Cm)
6
4
2
0
0 2 4 6 8 10 12 14
age (days)
X  10 Y  3.415
n (  X )(  Y )
S ( XY )   ( X i  X )(Yi  Y )   XY   70.8
i 1 n
( X )2
S ( XX )   ( X i  X )   X  2 2
 262
n
( Y ) 2
S (YY )   (Yi  Y )   Y  2 2
 19.6569
n
S ( XY )
b1   0.270 cm/day
S ( XX )
 
bo  Y  b1 X  0.715 cm
So estimated simple linear regression equation is
Y=0.715 + 0.270 X
Interpretation of estimated regression parameter
 The value of b1=0.270, indicates that the average wing length is expected to increase by 0.270
cm with each one day increase in age.
The observed range of age(Explanatory Variable) in the experiment was 3 to 17 days(i.e scope of the
model), therefore it would be an unreasonable extrapolation to expect this rate of increase in wing length
Regression
5 of 18
to continue if number of days were to increase. It is safe to use the results of regression only within the
range of the observed value of the independent variable only (i.e within the scope of the model).
 In regression equation b0=0.715, is the average wing length when age=0 day. In this example
since scope of the model does not cover x=0 so b0 does not have any particular meaning as a
separate term in the regression equation.
NOTE: Interpolation and Extrapolation
Interpolation is making a prediction within the range of values of the predictor in the sample used to
generate the model. Interpolation is generally safe. Extrapolation is making a prediction outside the
range of values of the predictor in the sample used to generate the model. The more removed the
prediction is from the range of values used to fit the model, the riskier the prediction becomes because
there is no way to check that the relationship continues to be linear
Standard Error of Estimate

The observed values of (X,Y) do not all fall on the regression line but they scatter away from it. The
degree of scatter of the observed values about the regression line is measured by what is called standard
error of estimate or standard deviation of regression and denoted by (  e ), its estimate is Se
^
2  (Y  Y ) 2 Y 2
 bo  Y  b1 XY
S e

n2
 0.048 OR
n2
 0.525 / 11  0.048
Se.=0.218
1 X2 1
SE (b0)  Se   0.148 SE (b1)  S e  0.0135
n S ( XX ) S ( XX )
Inference in Simple Linear Regression (From samples to population)

Generally, more is sought in regression analysis than a description of observed data. One usually wishes
to draw inferences about the relationship of the variables in the population from which the sample was
taken. To draw inferences about population values based on sample results, the following assumptions
are needed.
 Linearity
 Equal Variances for error
 Independence of errors
 Normality of errors
The slope and the intercept estimated from a single sample typically differ from the population values
and vary from sample to sample. To use these estimates for inference about the population values, the
sampling distributions of the two statistics are needed. When the assumptions of the linear regression
model are met, the sampling distribution of bo & b1are normal with mean 0 and 1 with standard errors
1 X2 1
SE (b0)  Se  SE (b1)  S e
n S(X , X ) S(X , X )
Test of hypothesis for 1
1) Construction of hypotheses
Ho : 1 = 0
H1: 1  0
Regression
6 of 18
2) Level of significance
 = 5%
3) TEST STATISTIC
b1   1 0.270  0
t   20.03
SE (b1) 0.0135
4) Decision Rule:- Reject Ho if tcal  t/2(n-2)=2.201 or tcal  - t/2(n-2)=-2.201

5) Result:- So reject Ho and conclude that there is significant relationship between age and wig
length

Ho : o = 0.95
H1: o  0.95
 = 5%
3) Test Statistic
bo  o 0.715  0.95
t   1.588
SE (bo ) 0.148
4) Decision Rule:- Reject Ho if tcal  t/2(n-2) =2.201 or tcal  - t/2(n-2)=-2.201
5) Result:- So do’t reject Ho.
Confidence intervals for regression parameters
A statistics calculated from a sample provides a point estimate of the unknown parameter. A point
estimate can be thought of as the single best guess for the population value. While the estimated value
from the sample is typically different from the value of the unknown population parameter, the hope is
that it isn,t too for away. Based on the sample estimates, it is possible to calculate a range of values that,
with a designated likehood, includes the population value. Such a range is called a confidence interval.
90% C.I for 1
b1 t / 2( n 2) SE (b1)  0.270  t.05(11) 0.0135
(0.2458 , 0.2942)
90% C.I can be interpret as If we take 100 samples of the same size under the same conditions and
compute 100 C.I’s about parameter, one from each sample, then 90 such C.Is will contain the parameter
(i.e not all the constructed C.Is)
Confidence interval estimate of a parameter is more informative than point estimate because it reflects
the precision of the estimate.
The width of the C.I (i.e U.L – L.L)is called precision of the estimate. The precision can be increased
either by decreasing the confidence level or by increasing the sample size.
Confidence level C.I Width
99% (0.2281,0.3119) 0.0838
95% (0.2403,0.2997) 0.0594
90% (0.2458,0.2942) 0.0484
90% C.I for 0
Regression
7 of 18
b1 t / 2( n 2) SE (b0)  0.715  t .05(11) 0.148

(0.4492 , 0.9808 )
ANALYSIS OF VARIANCE IN SIMPLE LINEAR REGRESSION

Partition of variation in dependent variable into explained and unexplained variation
Total variation=Explained variation (Variation due to X also called variation due to regression)
+ Unexplained variation (Variation due to unknown factors)
Total variation:- S(YY)=19.6569
Explained variation (Variation in Y due to X also called variation due to regression):
bS(XY) =0.270(70.80)=19.1322
Unexplained Variation: Total variation – explained variation=19.6569-19.1322=0.5247
The hypothesis 1=0 may be tested by analysis of variance procedure.

ANOVA TABLE
Source Of Degree of Sum of Mean Sum Fcal Ftab
Variation Freedom Squares of Squares
(S.O.V) (DF) (SS) (MSS=SS/df)
Regression 1 19.1322 19.1322 401.1* F.05(1,11)=4.84
Error 13-2=11 0.5247 0.0477
TOTAL 13-1=12 19.6569
Relation between F and t for testing 1=0
F=t2 401.1=(20.03)2
Goodness of Fit
An important part of any statistical procedure that builts models from data are establishing how well the
model actually fits. This topic encompasses the detecting of possible violations of the required
assumptions in the data being analyzed and to check how close the observed data points to the fitted
line.
A commonly used measure of the goodness of fit of a linear model is R 2 called coefficient of
determination. If all the observations fall on the regression line R 2 is 1. If no linear relationship between
Y & X R2 is 0. R2 =0 does not necessarily mean that there is no association between the variables.
Instead, it indicates that there is no linear relationship.
The co-efficient of determination tells us the proportion of variation in the dependent variable explained
by the independent variable
Re g .SS 19.1322
R2  x100  x100  97.33%
TotalSS 19.6569
The value of R2, indicates that about 97% variation in the dependent variable has been explained by the
linear relationship with X and remaining are due to some other unknown factors.
Test of hypothesis for mean value of Y i.e y/x

 1 ( Xo  X )2
Y13=0.715 + 0.270 (13)=4.225 SE (Y13 )  Se   0.073
n S ( XX )
Regression
8 of 18
Ho : y/13 = 4
H1: y/13  4
 = 5%
3) Test Statistic
Y13  Y /13 4.225  4
t   3.082
0.073
SE (Yˆ13 )
5) Result:- So reject Ho.
Confidence interval for mean value of Y i.e y/x

Yˆ13  t  / 2( n  2) SE (Yˆ13 )  4.225  (2.201)0.073
(4.064 , 4.386)
Test of hypothesis for single of Y

1 ( Xo  X ) 2
Y13=0.715 + 0.270 (13)=4.225 SE (Yˆ )1  Se 1    0.230
n S ( XX )
Ho : Y13 = 4.2
H1: Y13  4.2
 = 5%
3) TEST STATISTIC
Yˆ13  Y13 4.225  4.2

t   0.109
0.230
SE (Yˆ13 )1
5) Result:- So do’t reject Ho.
Confidence interval for single value of Y

Y13  t  / 2( n 2) SE (Yˆ )1  4.225  ( 2.201)0.230
(3.719 , 4.731)
Regression
9 of 18
Transformation to a straight line

It is easy to deal with the regression, which is linear in parameters, but in some situations the
models are non-linear. The non-linear models can be divided into two types
(1):-Intrinsically Linear (2): - Intrinsically Non-Linear models
The models that can be transformed in to linear models after applying some suitable transformation are
called intrinsically linear models and the models that can not be transformed in to linear models are
called intrinsically non-linear models.
Following are the examples of some common non-linear models with suitable transformation to convert
them into linear models:
Non-linear Form Transformation Linear model
1. Y  aX b
1. Log (Y )  Log (a )  bLog ( X ) 1. Y *  a *  bX *
2. Y  ab X 2. Log (Y )  Log (a )  XLog (b) 2. Y *  a *  b * X

1 1
3.  a  bX 3. Y *  a  bX (  Y*)
Y Y 3. Y *  a  bX
4. Y  ae bX 4. Ln(Y )  Ln(a )  bX 4. Y *  a *  bX
5. Y  a  b X 5. Y  a  bX * X  X* 5. Y  a  bX *
Y
6.  aX  b 6. Y *  b  aX
6. Y  aX  bX
2
X
Regression
10 of 18
Example:- The number (Y) of bacteria per unit volume present in a culture after X hours is given in the
following table
2
Y X Log(Y)=Y* XY* X
32 0 1.50515 0 0
47 1 1.6721 1.6721 1
65 2 1.81291 3.6258 4
92 3 1.96379 5.8914 9
132 4 2.12057 8.4823 16
190 5 2.27875 11.3938 25
275 6 2.43933 14.636 36
833 21 13.7926 45.7014 91
Fit a leasr square curve having the form Y=abX to the data. Estimate the value of Y when X=7.
We have to estimate a model Y=abX for which transformed line takes the form:
Log (Y )  Log (a )  XLog (b)
Y*= a* + b* X
S ( XY *) 4.33
b*    0.154
S ( XX ) 28
 
a*  Y * b * X  1.51
The regression equation is
Log(Y) = 1.51 + 0.154 X
Now
Log(a)=1.51 Log(b)=0.154
Antilog [Log(a)]=Antilog(1.51) = 32.36
Antilog[Log(b)]=Antilog(0.154) = 1.43
The estimated model is: Yˆ  (32.36)(1.43) X

Predict the grouth7 hours from now(X=7) to be: Yˆ  (32.36)(1.43) 7  395.70
Regression
11 of 18
Multiple Linear Regression

Multiple linear regression is a relationship that describes the dependence of mean values of the response
variable (y) for given values of two or more than two independent variable (X)
There are many applications where many explanatory variables affect the dependent var. for
example
1) Yield of a crop depend upon the fertility of the land, dose of the fertilizer applied, quantity of
seed etc.
2) The grade point average of students depend on aptitude, mental ability , hours devoted to study,
type and nature of grading by teachers.
3) The systolic blood pressure of a person depends upon one’s weight, age, etc.
If there are only two independent variables than Multiple Regression Model is:
Y    1 X 1   2 x 2   Population Regression Model
 Y / X 1, X 2     1 X 1   2 x 2 Population Regression Line
ˆ Y / X 1, X 2  Yˆ  a  b1 X 1  b2 x 2 Sample Regression Line
Where
X1 & X2 are independent variables and Y is the dependent variable.
a: Y intercept
b1 & b2 also called partial regression coefficients.
Where a, b1, b2 can be estimated from sample information as:
S ( X 2 , X 2 )S ( X 1 , Y )  S ( X 1 , X 2 )S ( X 2 , Y )
b1 
S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )]2
S ( X 1 , X 1 )S ( X 2 ,Y )  S ( X 1 , X 2 )S ( X 1 ,Y )
b2 
S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )] 2
 
bo  Y  b1 X 1  b 2 X 2
Interpretation of regression coefficients:
 a is the mean value of Y when X1=X2=0
 b1 is average change (increase or decrease) in response variable Y for one unit increase in the
explanatory variable X1 when the effect of X2 is held constant.
 b2 measures the average change in Y for unit increase in X2 when the effect of X1 is held constant.
Regression
12 of 18
EXAMPLE: The following data represent the performance of a chemical process as a function of
several controllable process variables:
CO2 Solvent Hydrogen
Product Total Consumption Y2 X 12 X 22 X1Y X2 Y X1X2
Y X1 X2
36.98 2227.25 2.06 1367.52 4960643 4.2436 82364 76.179 4588.1
13.74 434.90 1.33 188.79 189138 1.7689 5976 18.274 578.4
10.08 481.19 0.97 101.61 231544 0.9409 4850 9.778 466.8
8.53 247.14 0.62 72.76 61078 0.3844 2108 5.289 153.2
36.42 1645.89 0.22 1326.42 2708954 0.0484 59943 8.012 362.1
26.59 907.59 0.76 707.03 823720 0.5776 24133 20.208 689.8
19.07 608.05 1.71 363.66 369725 2.9241 11596 32.610 1039.8
5.96 380.55 3.93 35.52 144818 15.4449 2268 23.423 1495.6
15.52 213.40 1.97 240.87 45540 3.8809 3312 30.574 420.4
56.61 2043.36 5.08 3204.69 4175320 25.8064 115675 287.579 10380.3
229.50 9189.32 18.65 7608.87 13710479 56.0201 312224 511.926 20174.4
Y^ e  Y  Yˆ e2
47.3928 -10.41 108.42633
13.30172 0.44 0.1920925
13.68672 -3.61 13.008438
8.901963 -0.37 0.1383564
34.23853 2.18 4.7588268
21.29525 5.29 28.034412
16.99981 2.07 4.2857005
15.69707 -9.74 94.810435
10.04365 5.48 29.990376
47.9425 8.67 75.125472
229.50 0.00 358.77
1. Fit a multiple linear regression relating CO2 product to total solvent and hydrogen consumption
and calculate the value of R2
2. Test the significance of Regression
3. Test the significance of partial regression coefficients and construct confidence intervals
4. Can we conclude that total solvent and hydrogen consumption are sufficient number of
independent variables for explaining the variability in CO 2 product?
Regression
13 of 18
43.9475
Y
18.6225
1723.79
X1
716.86
3.865
X2
1.435
25 75 6. 8
6 .7 9 35 65
. 62 . 94 71 23 1.4 3 .8
18 43 17
X 1  918.93 X 2  1.865 Y  22.95

( X 1 )( Y ) (9189.32)(229.5)
S ( X 1Y )   X 1Y  312224   101329.106
n 10
( X 2 )( Y ) (18.65)(229.5)
S ( X 2Y )   X 2Y   511.926   83.91
n 10
( X 1 ) 2
S(X1X1)   X1 
2
5266118 .8
n
( X 2 ) 2
S(X 2 X 2 )   X 2 
2
 21.24
n
( X 1 )( X 2 )
S(X1X 2 )   X1X 2   3036.32
n
( Y ) 2
S (YY )   Y  2
 2341.84
n
D  S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )]2  102633124.2
S ( X 2 , X 2 )S ( X 1 ,Y )  S ( X 1 , X 2 )S ( X 2 , Y ) 1897452.6
b1    0.0185
D  S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )] 2
102633124.2
S ( X 1 , X 1 )S ( X 2 , Y )  S ( X 1 , X 2 )S ( X 1 , Y ) 134212437.4
b2    1.31
D  S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )] 2
102633124.2

bo  Y  b1 X 1  b 2 X 2  3.52
Fitted regression line is Y = 3.52 + 0.0185 X1 + 1.31 X2

Regression
14 of 18
ANALYSIS OF VARIANCE IN MULTIPLE LINEAR REGRESSION

The hypothesis 1=2=0 may be tested by analysis of variance procedure.
Total SS=S(Y,Y)= 2341.84
Reg.SS =b1 S(X1,Y)+ b2 S(X2,Y)=(0.0185)( 101329.106 )+(1.31)( 83.91 )=1983.07
ANOVA TABLE
Degree of Mean Sum
Source Of Variation Sum of Squares
Freedom of Squares Fcal Ftab
(S.O.V) (SS)
(DF) (MSS=SS/df)
Regression 2 1983.07 991.54 F.05(2,7)=4.74
19.35*
Error 7 358.77 51.25
TOTAL 9 2341.84
Coefficient of Determination
by the independent variables
Re g .SS 1983.07
R2  x100  x100  84.7%
TotalSS 12341.84
The value of R2, indicates that about 85 % variation in the dependent variable has been explained by the
linear relationship with X1 & X2 and remaining are due to some other unknown factors.
Test of hypothesis about significance of the partial regression coefficients:
Ho : 1 = 0
H1: 1  0
 = 5%
3) TEST STATISTIC
b1   1 0.0185  0
t   5.68
SE (b1) 0.003257
S ( X 2, X 2) 21.24
where S .E (b1)  S e  7.16  0.003257
S ( X 1, X 1) S ( X 2, X 2)  [ S ( X 1, X 2)] 2
102633124.2
4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

( n 2)
2
5) Result:- So reject Ho and conclude that there is significant relationship between CO2 Product
and Solvent Total
95% C.I for 1

b1 t / 2( n 2) SE (b1) 
0.0185  t .025( 6) 0.003257 
0.0185  (2.306)0.003257  = (0.011 , 0.026)
Regression
15 of 18

Ho : 2 = 0
H1: 2  0
 = 5%
3) TEST STATISTIC
b 2   2 1.31  0
t   0.81
SE (b 2) 1.622
S ( X 1, X 1) 5266118.8
where S .E (b 2)  S e  7.16  1.622
S ( X 1, X 1) S ( X 2, X 2)  [ S ( X 1, X 2)] 2
102633124.2

( n 2)
2
5) Result:- So don’t reject Ho and conclude that there is significant relationship between CO 2 Product
and Hydrogen Consumption
95% C.I for 2

b 2 t  / 2( n 2) SE (b 2) 
1.31  t .025( 6) 1.622 
1.31  (2.306)1.622
(-2.43, 5.05)
Relative importance of independent variables

Standardized regression coefficients are useful for measuring the relative importance of the independent
variables because Standardized regression coefficients are unit free quantities
 S( X1, X1)   5266118.8 
b1*  b1    0.0185   0.38
 S (YY )   12341.84 
 S(X 2, X 2 )   21.24 
b2*  b2    1.31   0.054
 S (YY )   12341.84 
So Solvent Total(X1) is more important variable than Hydrogen Consumption(X2) in predicting the CO 2
Product.
Regression
16 of 18
Polynomial Regression
Example:- The data is regarding time (in weeks)[X] and the corresponding yield ( in Kg) [Y]of cotton
per plot in the specified period
Put X=X1 and X2=X2
Y X1 X2 X12 X22 X1X2 X1Y X2Y Y2 Y^ e=Y-Y^ e2

100 1 1 1 1 1 100 100 10000 95.08 4.92 24.189
125 2 4 4 16 8 250 500 15625 118.59 6.41 41.076
118 3 9 9 81 27 354 1062 13924 135.91 -17.91 320.790
135 4 16 16 256 64 540 2160 18225 147.04 -12.04 144.983
160 5 25 25 625 125 800 4000 25600 151.98 8.02 64.291
170 6 36 36 1296 216 1020 6120 28900 150.73 19.27 371.204
148 7 49 49 2401 343 1036 7252 21904 143.30 4.70 22.133
120 8 64 64 4096 512 960 7680 14400 129.67 -9.67 93.474
100 9 81 81 6561 729 900 8100 10000 109.85 -9.85 97.052
90 10 100 100 10000 1000 900 9000 8100 83.85 6.15 37.878
1266 55 385 385 25333 3025 6860 45974 166678 1266.00 0.00 1217.07
Y  126.6 X 1  5.50 X 2  38.50
  X 1
2
S ( X 1, X 1)   X 1 2
  82.50
n
 X 2
2
S ( X 2, X 2)   X 2 2
  10510.50
n
(  X 1)( Y ) (55)(1266)
S ( X 1, Y )   X 1Y   6860   103
n 10
( X 2)( Y ) (385)(1266)
S ( X 2, Y )   X 2Y   45974   2767
n 10
( X 1)( X 2)
S ( X 1, X 2)   X 1X 2   907.50
n
( Y ) 2
S (Y , Y )   Y 2   6402.40
n
D=43560
S ( X 2, X 2) S ( X 1, Y )  S ( X 1, X 2) S ( X 2, Y )
b1   32.7932
 S ( X 1, X 1)S ( X 2, X 2)  S ( X 1, X 2)
2
D
S ( X 1, X 1) S ( X 2, Y )  S ( X 1, X 2) S ( X 1, Y )
b2   3.0947
 S ( X 1, X 1)S ( X 2, X 2)  S ( X 1, X 2)
2
D
  
bo  Y  b1 X 1 b 2 X 2 =65.3800
Fitted regression is Y= 65.3800 + 32.7932 X1 - 3.0947 X 2

Regression
17 of 18
ANALYSIS OF VARIANCE
The hypothesis 1=2=0 may be tested by analysis of variance procedure.
Total SS=S(Y,Y)=6402.4
Reg.SS =b1 S(X1,Y)+ b2 S(X2,Y)=(32.7932)(-103)+(-3.0947)(-2767)=5185.3
ANOVA TABLE
Source Of Variation Degree of Sum of Squares Mean Sum Fcal Ftab
(S.O.V) Freedom (SS) of Squares
(DF) (MSS=SS/df)
Regression (X , X2) 2 5185.3 2592.7 14.91* F.05(2,7)=4.74
Error 7 1217.1 173.9
TOTAL 9 6402.40
Test of significance of Quadratic regression
Ho : 2 = 0
H1: 2  0
 = 5%
3) TEST STATISTIC
b 2   2  3.0947  0
t   5.39
SE (b 2) 0.5738
S ( X 1, X 1) 82.50
where S .E (b 2)  S e  13.19  0.5738
S ( X 1, X 1) S ( X 2, X 2)  [ S ( X 1, X 2)] 2
43560

( n 2)
2
5) Result:- So reject Ho and conclude that Quadratic regression is a useful model to explain the
variation in the dependent variable
Coefficient of Determination
by the independent variable
Re g .SS 5185.33
R2  x100  x100  81%
TotalSS 6404.40
The 2nd degree curve is appropriate for the above data set
 b1
The value of X at which maximum or minimum value of quadratic regression occur X  =5.30
2b 2
b12
The maximum or minimum value of Y is bo  =152.28
4b 2
Regression
18 of 18
Comparison of Ist degree and 2nd degree curve

SCATTER PLOT
170
160
150
140
130
y
120
110
100
90
0 1 2 3 4 5 6 7 8 9 10
x1
SIMPLE LINEAR REGRESSION ( Ist degree curve) Curvilinear REGRESSION ( 2ND degree curve)
y = 133 - 1.25 X y = 65.4 + 32.8 X - 3.09 X2
Se = 28.00 R2 = 2.0% Se = 13.19 R2 = 81.0%

1 - Stat-701 Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - Stat-701 Regression

Uploaded by

Copyright:

Available Formats

Regression

Y= 0 + 1 X+ , Population Regression Model

Y^ = b0 + b1 X Sample Regression Line

How to identify the relationship between variables

4cm. Also find 95%C.I for mean value of Y when X=13.

Wing length VS Days

Standard Error of Estimate

Inference in Simple Linear Regression (From samples to population)

4) Decision Rule:- Reject Ho if tcal  t/2(n-2)=2.201 or tcal  - t/2(n-2)=-2.201

Test of hypothesis for 0

b1 t / 2( n 2) SE (b0)  0.715  t .05(11) 0.148

ANALYSIS OF VARIANCE IN SIMPLE LINEAR REGRESSION

The hypothesis 1=0 may be tested by analysis of variance procedure.

Test of hypothesis for mean value of Y i.e y/x

Test of hypothesis for single of Y

Yˆ13  Y13 4.225  4.2

Transformation to a straight line

2. Y  ab X 2. Log (Y )  Log (a )  XLog (b) 2. Y *  a *  b * X

The estimated model is: Yˆ  (32.36)(1.43) X

Multiple Linear Regression

 Y / X 1, X 2     1 X 1   2 x 2 Population Regression Line

ˆ Y / X 1, X 2  Yˆ  a  b1 X 1  b2 x 2 Sample Regression Line

X 1  918.93 X 2  1.865 Y  22.95

Fitted regression line is Y = 3.52 + 0.0185 X1 + 1.31 X2

ANALYSIS OF VARIANCE IN MULTIPLE LINEAR REGRESSION

4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

95% C.I for 1

Test of hypothesis for 2

4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

95% C.I for 2

Relative importance of independent variables

Y X1 X2 X12 X22 X1X2 X1Y X2Y Y2 Y^ e=Y-Y^ e2

Fitted regression is Y= 65.3800 + 32.7932 X1 - 3.0947 X 2

4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

Comparison of Ist degree and 2nd degree curve

y = 133 - 1.25 X y = 65.4 + 32.8 X - 3.09 X2

Se = 28.00 R2 = 2.0% Se = 13.19 R2 = 81.0%

You might also like