Professional Documents
Culture Documents
^
Y =Xβ+ ε ⇒ Y^ =X β=X (X'X )−1 X'Y
We list for future reference some useful facts concerning the transpose and inverse
of a matrix. Proofs are omitted.
( A' ) '= A
( AB ) '=A'B'
−1
( A−1) = A
( AB )−1 =B−1 A−1
( A' )−1=( A−1 ) '
FITTING A REGRESSION MODEL WITH DUMMY VARIABLES
What are dummy variables? QUALITATIVE INDEPENDENT VARIABLES/
DUMMY VARIABLES
Dummy variables, also called indicator variables allow us to include categorical data
(like Gender) in regression models. A dummy variable can take only 2 values, 0
(absence of a category) and 1 (presence of a category).
Suppose he have dummy variable gender to 1 for females and 0 when the employee
is not a female. When interpreting results for gender, we remember that when
dummy variable is 0 (not a female), we are talking about males.
EXAMPLE 1
1 7.5 Male 6
2 8.6 Male 10
3 9.1 Male 12
4 10.3 Male 18
5 13 Male 30
6 6.2 Female 5
7 8.7 Female 13
8 9.4 Female 15
9 9.8 Female 21
Coding: What would have happened if we had used 0 for females and 1 for males in
our data? Would our results be any different?
EXAMPLE 2
Consider a business problem that involves developing a model for predicting the
assessed value of houses ($000), based on the size of the house (in thousands of
square feet) and whether the house has a swimming
Holding constant whether a house has a swimming pool, for each increase of
1.0 thousand square feet in the size of the house, the predicted assessed value is
estimated to increase by 16.1858 thousand dollars (i.e., $16,185.80).
Holding constant the size of the house, the presence of a swimming pool is
estimated to increase the predicted assessed value of the house by 3.8530 thousand
dollars (i.e., $3,853).
EXAMPLE 3
EXAMPLE 4
A team of research physicians conducted a study to determine the effect of health
education on the utilization of health services for hypertension patients (Drug Topics,
April 1993). Data collected for a sample of n=282 new HMO enrollers with
hypertension problems were used to fit the following regression model:
E( y )=β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 5
where
y=¿Annual health care expenditures (dollars)
x 1= Age (years) {
x 2 = 1 if female
0 if male {
x 3 = 1 if white
0 if nonwhite
x 4 = Number of concomitant maintenance medications (regimen)
{
x 5 = 1 if enrolled in health education program
0 if not
The regression results are summarized below
VARIABLES β ESTIMATE p-VALUE FOR TESTING
H 0 : β k =0
Intercept 64.82 ¿ 0 . 05
Age (x1) 1.05 ¿ 0 . 05
Gender (x2) -10.53 ¿ 0 . 05
Race (x3) 0.27 ¿ 0 . 05
Regimen (x4) 9.46 ¿ 0 . 05
Health education (x5) -92.97 ¿ 0 . 001
F=37 . 84 , R2 =0 . 4357
x
estimating a mean response for Y , one needs to specify a vector of i values within
the range in which the model was constructed.
Matrix Expressions for Confidence Intervals and Prediction Intervals
Point estimates and confidence intervals for the mean response and prediction
intervals for a future response can also be expressed using matrix notation. The
() β^
Y^ ( x h ) = β^ 0 + β^ 1 x h =( 1 , x h ) 0 =x'h β^
β^ 1
.
^ )−1 X'Y therefore X' h β^ =X'h ( X'X) X'Y
−1
Recall that β=(X'X
−1
Let A=X'h (X'X ) X' and A '=X ( X'X)−1 X ' h
The statistical intervals for estimating the mean or predicting new observations in the
simple linear regression case is easily extended to the multiple regression case.
Here, it is only necessary to present the formulas. First, let use define the vector of
given predictors as
[]
1
x h, 1
x
X h= h, 2
xh , 3
⋮
x h, p−1
E ( Y |X h )
We are interested in either intervals for or intervals for the value of a new
response
X h given that the observation has the particular value X h . First we define
X
the standard error of the fit at h given by:
s. e . ( ^y h ) =√ MSE(X' h (X'X )−1 X h
The partial F-test is similar to the F-test, except that individual or subsets of predictor
variables are evaluated for their contribution in the model to increase SSR or,
conversely, to decrease SSE.
Age and X 2 -mileage ), we ask, ‘‘what is the contribution of the individual X 1 - age and
X 2 -mileage variables?’’. To determine this, we can evaluate the model, first with X 1 -
in the model, then with X 2 . We evaluate X 1 in the model, not excluding X 2 , but
holding it constant, and then we measure it with the sum-of-squares regression (or
sum of-squares error), and vice versa. That is, the sum-of-squares regression is
SSR(xi all variables except j) = SSR(all variables including j) -SSR(all variables except j)
Example: Suppose we have two independent variables in our model: Use the
selling price data
SSR(x1 and x2) = regression sum of squares of the when variables x1 and x2 are
both included in the multiple regression model.
H0: Variable x1 does not significantly improve the model after x2 has been included.
Ha: Variable x1 significantly improve the model after x2 has been included.
t 2a =F1 , a
The ANOVA table dividing the regression sum of squares into components to
determine the contribution of variable x1.
The ANOVA table dividing the regression sum of squares into components to
determine the contribution of variable x2.
X1 X2 Y
17 42 90
19 45 71 76
20 29 63 63 80 80
21 93 80 64 82 66
25 34 75 82
27 98 99
28 9 73
30 73 67 74