You are on page 1of 3

Copy Right : Ra i Unive rsit y

11.556 183
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Essentially when we run a regression we are actually estimating the
parameters on the basis of the sample of observations. Therefore
y =a+bx for example is a sample regression line much in the
same way that x is a sample estimate of the population parameter
m. In the same way our population regression line or the true
relationship of the data is : is Y=A+Bx . This equation however
is unknown and we have to use sample data to estimate it. The
true form of the unknown equation for the k variable case is:
k k
x b x b x b x b a y + + + + + = .. ..........
3 3 2 2 1 1
Even in the case of the population regression plane regression
plane not all data points will lie on it.. Why ? Consider our IRS
problem. Not all payments to informants will be equally effective.;
Some of the computer hours may be used for organizing
data rather than analyzing accounts.
For these and other reasons some of the data points will
lie above the regression plane and some below it .
Therefore instead of satisfying the above equation the
individual data points will satisfy :
e x b x b x b x b a y
k k
+ + + + + + = .. ..........
3 3 2 2 1 1
This is the population regression plane plus a random disturbance
term
.
The term e is a random disturbance term which equals zero
on the average. The standard deviation of this term of this term
is e. The standard error of the regression se which we have
talked about in the earlier section is an estimate of e.
As our sample regression equation :
k k
x b x b x b x b a y + + + + + = .. ..........
3 3 2 2 1 1
This equation estimates the unknown e population regression
plane
k k
x B x B x B x B a y + + + + + = .. ..........
3 3 2 2 1 1
As we can see the estimation of a regression plane can also be
thought of as a problem of statistical inference where we make
inferences regarding an unknown population relationship on the
basis of an estimated relationship based on sample data.
Much in the same way as for hypothesis testing for a mean we can
also set up confidence intervals for the parameters of the estimated
equation. We can also make inferences about the slopes of the
true regression equation slope parameters(B
1
, B
2
, B
k
) on the basis
of slopes coefficients of the estimated equation (b
1,
b
2
, b
3,
b
k
).
Tests of Inference of an Individual Slope Parameter
B
i
As explained earlier we can use the value of the individual b
i
,
which are values of the slope parameter for the ith variable , to test
a hypotheses about the value of Bi , which is the true population
value of the slope for the i
th
variable.
The process of hypotheses testing is the same as that delineated
for testing the mean.
When we perform a regression we are frequently interested in
questions whether y is actually dependent on x? That is if we take
our example we may ask whether the volumeof unpaid tax recovery
actually depends on the number of computer hours of research
the filed researcher puts in. Essentially we are asking is x a significant
explanatory variable for y?
We can say that if there is some relationship between y and x if
Bi0. There is no relationship between x and y if Bi=0.
Thus we can formulate our hypotheses regarding the tests of
significance of the x
i
coefficient as follows:
Ho: Bi=0 Null hypothesis , that x
i
is not a significant explanatory
variable for y
Ha: Bi0 Alternative hypothesis that x
i
is a significant explanatory
variable for y.
We can test this hypothesis using the t ratio:
bi
io i
s
B b
t

=
Where
b
i
: slope of fitted regression
B
io
: actual slope of hypothesized for the population
S
bi:
standard error of the regression coefficient
Why is the t Statistic Used ?
In multiple regressionwe use n data points to estimate k+1
coefficients, i.e., intercept a and b
1
b
k
slope coefficients. These
coefficients were used to calculate s
e
which estimates
se.
the standard
deviation of the disturbance of the data. We use s
e
to estimate s
bi
Therefore since se has n-k-1 degrees of freedom, s
bi
will have n-k-
1. Degrees of freedom.
The value of s
b
i is given in the output as stdev term.
Because our hypothesized value value of Bi is 0 , the standardized
value of the regression coefficient to becomes:
The value of to is called the observed or computed t value. This is
the number that appears in the column headed t ratio in the
computer out put.
We test for the significance of the t ratio by checking against the
column headed p-value. This column gives the prob values for
the two-tailed test of hypotheses:
Ho: Bi=0
Ha: Bi0
The prob values are the probabilities that each bi would be as far
(or farther ) away from zero( hypothesized value of Bi coefficient)
if ho is true. This is shown in Figure 2 . We need only to compare
the p values with a, the level of significance. To determine whether
xi is a significant explanatory variable of y.
LESSON 30:
MAKING INFERENCES ABOUT POPULATION PARAMETERS
Copy Right : Ra i Unive rsit y
184 11.556
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Figure 2
If p> Xi is not a significant explanatory variable .
If p< Xi is a significant explanatory variable .
This test of significance of the explanatory variable is always a
two-tailed test. The independent variable x
i
is a significant
explanatory variable if bi is significantly different from zero. This
requires that our t ratio be a large positive or negative.
In our IRS example for each of the three explanatory variables p
is less than .01. Therefore we conclude that each one is a significant
explanatory variable.
TEst of Significance of The Regression as a Whole
It is quite possible that we frequently may get a high value of R
2
by pure chance. After all if we throw a dart on board to get a scatter
plot we could generate a regression, which may conceivably have a
high R
2
. Therefore we need to ask the question a high value of R
2
necessarily mean that the independent variables explain a large
proportion of the variation in Y or could this be a freak chance.
In statistical terms we ask the following question:
Is the regression as a whole significant? In the last section we had
looked at whether the individual x
i
were significant. Now we ask
whether collectively all the x
i
(i=1k) together significantly explain
the variability in y.
Our hypothesis is:
Ho: B
1
=B
2
+B
k
= 0 Null hypothesis that y does not depend on x
is
Ha: atleast oneBi0 Alternative hypothesis that at least one Bi is
not zero.
To explain this concept we have to go back to our initial diagram,
which shows the two variable case. (insert diag Lr p743
The total variation in y


2
) ( y y
Explained variation by the regression is


2
) ( y y
Unexplained variation


2
) ( y y
This is shown in the figure 3 for the one variable case of simplicity.
For a multiple variable case the something applies conceptually.
Figure 3
Thus when we look at the variation in y we look at 3 different
terms each of which is a sum of squares .These are denoted as
follows:
SST= Total sum of squares


2
) ( y y
SSR=Regression sum of squares


2
) ( y y
SSE=Error sum of squares


2
) ( y y
Total variation in y can be broken into two parts: the explained
and the unexplained:
SST=SSR+SSE
Each of these has an associated degrees of freedom. SST has n-1
degrees of freedom. SSR has k degrees of freedom because there
are k independent variables. SSE has degrees of freedom n-k-1
because we used n observations to estimate k+1 parameters a,
b
1
,b
2
, ..b
k
.
If the null hypotheses is true we get the following F ratio
1
=
k n
SSE
k
SSR
F
Which has a F distribution with k numerator degrees of freedom
and n-k-1 degrees of freedom in the denominator.
If the null hypotheses is false i.e that the explanatory variables
have a significant effect on y then the F ratio tends to be higher
than if the null hypothesis is true., So if the F ratio is large we
reject the null hypotheses that the explanatory variables have no
effect on the variation of y. Therefore we reject Ho and conclude
that the regression is significant.
Going back to our IRS example we now look at the computer
output. A typical output of a regression also includes the computed
F ratio for the regression. This is also at times called the ANOVA
for the regression. This is because we break up the up the analysis
of variation in Y into explained variance or variance explained by
the regression(between column variace0 and unexplained
variance.(within column variance.) This is shown in table 3
Copy Right : Ra i Unive rsit y
11.556 185
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Table 3
Analysis of Variance

Source DF SS MS F P
Regression 3 29.1088 9.7029 118.52 0.00
Error 6 .4912 .0819
Total 9 29.600
The sample output for the IRS problem is given above.
SSR=29.109, k=3
SSE=.491 ( with n-k-1 df = 6) degrees of freedom.
3 . 118
6
491 . 0
3
11 . 29
= = F
The MS column is the sum of squares divided by the number of
degrees of freedom. The output also gives us the p- value, which
is 0.00. Because p< =0.01 we can conclude that the regression as
a whole is highly significant.
Exercises
Q1. Bill Buxton, a statistic professor in a leading business school,
has a keen interest in factors affecting students performance
on exams. The midterm exam for the past semester had a
wide distribution of grades, but Bill feels certain that several
factors explain the distribution: He allowed has students to
study from as many different books as they liked, their Iqs
they are of different ages, and they study varying amount of
time for exams. To develop a predicting formula for exam
grads, Bill asked each student to answer, at the end of the
exam, questions regarding study time and number of books
used. Bills teaching record already contained the Iqs and
ages for the students, so he compiled the data for the class
and ran a multiple regression with Minitab. The output form
Bills computer run was as follows:
Predictor Coef Stdev T-
ratio
P
Constant -49.948 41.55 -1.20 0.268
Hours 1.06931 0.98163 1.09 0.312
Iq 1.36460 0.37627 3.63 0.008
Books 2.03982 1.50799 1.35 0.218
Age -
1.78990
0.67332 -2.67 0.319
S = 11.657 R sq = 76.7%
a. What is the best fitting regression equation for these data?
b. What percentage of the variation in grades is explained by this
equation?
c. What grade would you aspect for a 21- year old student with an
IQ of 113 who studied 5 hour and used three different books?
Notes