Professional Documents
Culture Documents
Log Value With Dummy
Log Value With Dummy
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Dummy Variables -- continued
To create dummy variables we can use an IF
statement or we can use StatPro’s Dummy variable
procedure.
The Dummy variable procedure is usually easier
particularly when there are multiple categories.
Once the dummy variables are created, we can
combine the variables if we like by simply adding the
columns to get the dummy for the new category.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis
In this example we create dummy variables for Gender, and
EducLev.
Then we can run a regression analysis with Salary as the
response variable, using any combination of numerical and
dummy explanatory variables.
We must follow two rules:
– We shouldn’t use any of the original categorical variables that the
dummies are based on.
– We should use one less dummy than the number of categories for
any categorical variable.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
This second rule is a technical one. If we violate it the
software will give us an error message.
For example, Ed_1-Ed_6, any five of these variables
can be used. The omitted dummy then corresponds to
the reference category.
As we will see the interpretation of the dummy variable
coefficients are all relevant to this reference category.
To get used to dummy variables in regression analysis
we will proceed in several stages.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
We first estimate a regression equation with only one variable.
The output is shown in this table. The resulting equation is
Predicated Salary = 45.505 - 8.26Female
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
To interpret this equation recall that Female has only
two possible values, 0 and 1. If we substitute 1 then
the predicted salary equals 37.209 and if we
substitute 0 the predicated salary is 45.505.
These are the average salaries of females and
males. Therefore the interpretation of the -8.926
coefficient of the Female dummy variable is
straightforward.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
The above equation only tells part of the story, it
ignores all information except for gender.
We expand this equation by adding the experience
variables. The output is shown in this table.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
The corresponding equation is
Predicted Salary = 35.492 + 0.998YrsExper
+ 0.131YrsPrior - 8.080Female
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
We next add job grade to the equation by including
five of the six job grade dummies. Although any five
can be use we use Job_2-Job_6. The resulting
output is shown in this table.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
The estimated regression equations is now
Predicated Salary=30.230 + 0.408YrsExper + 0.149YrsPrior
- 1.962Female + 2.57Job_2 + 6.295Job_3 + 10.475Job_4
+16.011Job_5 + 27.647Job_6
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
For example, the equation for females at the fifth job
grade is found by setting Female=1 and Job_5=1 and
setting the other job dummies equal to 0. The
equation formed is
PredictedSalary = 44.279 + 0.408YrsExper + 0.150YrsPrior
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
– The coefficients of the job dummies indicate the average increase in
salary an employee can expect relative to the reference (lowest) job
grade.
– The key coefficient, the negative $1962 for females indicates the
average salary disadvantage for females relative to males, given that
they have the same experience levels and are in the same job grade
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
We can check whether females are
disproportionately in the lower job categories by using
a pivot table with JobGrade in the row area, Gender
in the column area and the count (expressed as a
percentage) of any variable in the data area.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
Clearly, females tend to be concentrated at the lower
job grades.
This certainly helps to explain why females get lower
salaries on average, but it doesn’t explain why
females are at the lower job grades in the first place.
We won’t be able to provide a thorough analysis of
this issue but we can add one more piece to the
puzzle now by adding education level, age, and
PCJob to the equation.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
We don’t provide the whole equation but the resulting
output is shown here.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
The coefficients can be seen in the output.
It doesn’t appear to add much to the previous
equation. The “penalty” does, however, go up to
$2555, which is slightly greater than the $1962.
At face value we can interpret the coefficients of the
education dummies as a benefit (or loss if negative)
of extra education relative to a high school diploma,
the reference category.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Analysis -- continued
The coefficient of PCJob implies that an employee
with a computer-related job can expect an extra
$4923 in salary relative to an employee without a
computer-related job, provided the other variables
are the same for each employee.
The age coefficient is quite small and has little effect
on salary.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Conclusion
The main conclusion we can draw from the output is
that there is still a plausible case to be made for
discrimination against females, even after including
information on all the variables in the database in the
regression equation.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Modeling Possibilities
BANK.XLS
The Fifth National Bank of Springfield is facing a
gender-discrimination suit. The charge is that its
female employees receive substantially smaller
salaries than its male employees.
The bank’s employee database is listed in this file.
Here is a partial list of the data.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Question
Earlier we estimated an equation for Salary suing the
numerical explanatory variables YrsExper and YrsPrior
and the dummy variable Female.
If we drop the YrsPrior variable from the equation (for
simplicity) and rerun the regression, we obtain the
equation
Predicted Salary = 35.824 + 0.981YrsExper - 8.012Female
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Interaction Terms
An interaction variable algebraically is the product of
two variables. Its effect is to allow the effect of one of
the variables on Y to depend on the value of the
other variable.
The interaction term allows the slope of the
regression line to differ between the two categories.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution
We first need to form an interaction variable that is the
product of YrsExper and Female.
This can be done two ways in Excel.
– we can do it manually by introducing a new variable that contains
the product of the two variables involved, or
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
Once the interaction variable has been created, we
include it in the regression equation in addition to the
other variables. The multiple regression output is
shown here.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The estimated regression equation is
Predicated Salary = 30.430 + 1.528YrsExper + 4.908Female
- 1.248YrsExper_Female
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Nonparallel Female and Male
Salary Lines
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The Y-intercept for the female line is slightly higher - females
with no experience at Fifth National Bank tend to start out
slightly higher than males - but the slope of the female line is
much lower. That is, males tend to move up the salary ladder
much more quickly than females.
Again, this provides another argument, although a somewhat
different one, for gender discrimination against females.
The R2 value increased from 49.1% to 63.9%. The interaction
variable has definitely added to the explanatory power of the
equation.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Modeling Possibilities
BANK.XLS
The Fifth National Bank of Springfield is facing a
gender-discrimination suit. The charge is that its
female employees receive substantially smaller
salaries than its male employees.
The bank’s employee database is listed in this file.
Here is a partial list of the data.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Question
A glance at the distribution of salaries of the 208
employees shows some skewness to the right - a few
employees make substantially more than the majority
of employees.
Therefore, it might make sense to use the natural
logarithm of Salary instead of Salary as the response
variable.
If we do this, how do we interpret the results?
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution
All of the analyses we did earlier with this data set
could be repeated except with Log_Salary as the
response variable.
For the sake of discussion we will look only at the
regression equation with Female and YrsExper as
explanatory variables.
After we create the Log_Salary variable and run the
regression, we obtain the output shown here.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Output with
Log_Salary as Response Variable
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution
The estimated regression equation is
Predicted Log_Salary = 3.5829 +0.0188YrsExper
- 0.1616 Female
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The situation for se is even worse. Each se is a measure of a
typical residual, but the residuals in the Log_Salary equation
are in log dollars, whereas the residuals in the Salary equation
are in dollars.
Therefore it is no surprise that the Log_Salary is much smaller
than the se for the Salary equation.
If we want comparable standard error measures for the two
equations, we should take antilogs of the fitted values from the
Log_Salary equation to convert them back to dollars, subtract
these from the original Salary values, and take the standard
deviation of these residuals.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The resulting standard deviation is 7.74. This is
somewhat smaller than the se from the Salary equation,
an indication of a slightly better fit.
Finally we interpret the equation itself.
When the response variable is Log_Y and a term on
the right hand side of the equation is of the form bX,
then whenever X increases by one unit Y-hat changes
by a constant percentage, and this percentage is
approximately equal to b (written as a percentage).
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
This means that for each year of experience with
Fifth National, an employees salary can be expected
to increase 1.88%.
The Female expected percentage decrease in salary
is 16.16%.
In other words this equation implies that females can
expect to make about 16% less than men for
comparable years of experience.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Modeling Possibilities
POWER.XLS
The Public Service Electric Company produces
different quantities of electricity each month, depending
on the demand.
This file lists the number of units of electricity produced
(Units) and the total cost of producing these (Cost) for
a 36-month period.
The data set appears on the next slide.
How can regression be used to analyze the
relationship between Cost and Units?
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Data for Electric Power
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution
A good place to start is with a scatterplot of Cost
versus Units.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The scatterplot indicates a definite positive relationship and
one that is nearly linear.
However, there is also some evidence of curvature in the
plot. The points increase slightly less rapidly as Units
increase from left to right.
In economic terms, there may be economics of scale, where
marginal cost of the electricity decreases as more units of
electricity are produced.
Nevertheless, we use regression to estimate a linear
relationship between Cost and Units.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The resulting regression equation is
Predicted Cost = 23,651 + 30.53 Units
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Residuals from a Straight-Line
Fit
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
Admittedly the pattern is far from perfect - there are a
few negatives in the middle - but the plot does hint at
nonlinear behavior.
The negative-positive-negative behavior of the residuals
suggests a parabola; that is, a quadratic equation with
the square of Units included in the equation.
We first create a new variable Sqr_Units in the data set.
This can be done manually or using StatPro’s Transform
Variables menu item.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
Then we use multiple regression to estimate the
equation for Cost with both explanatory variables,
Units and Sqr_Units, included.
The resulting equation from the output on the next
slide is
Predicated Cost = 5793 +98.3Units - 0.0600Sqr_Units
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Output with Squared
Term Included
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
One way to see how this regression equation fits the
scatterplot of Costs versus Units is to use Excel’s
trendline option.
To do so activate the scatterplot, click on any point and
use the Chart/Add Trendline menu item, click the Type
tab and select the Polynormal type or order 2, that is a
quadratic.
A graph of the equation is superimposed on the
scatterplot on the following slide. It shows reasonably
good fit, plus an obvious curvature.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Quadratic Fit Scatterplot
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The main downside to a quadratic regression equation is that
there is no easy interpretation of the coefficients of Units and
Sqr_Units.
All we can say is that the terms in the equation combine to
explain the nonlinear relationship between units produced and
total cost.
A final note about the equation concerns the coefficient of
Sqr_Units.
– First, the fact that it is a negative make the parabola bend downward.
This produces the decreasing marginal cost behavior, where every
extra unit of electricity incurs a smaller cost.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
– Second, we shouldn’t be fooled by the small magnitude of
this coefficient. Remember that it is the coefficient of Units
squared, which is a large quantity. Therefore, the effect of
the product -0.0600Sqr_Units is sizable.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
To create the new variable we can again use
StatPro’s Transform Variable menu item and then we
can superimpose a logarithmic curve on the
scatterplot of Cost versus Units by using the trendline
feature.
This curve appears in the scatterplot on the next
slide.
To the naked eye, it appears to be similar, and about
as good a fit as the quadratic curve.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Logarithmic Fit Scatterplot
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
The resulting regression equation is
Predicted Cost = -63,993 + 16,654Log_Units
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
In this case, where the log of the explanatory variable is
used, we can interpret its coefficient as follows.
– Suppose Units increases by 1%, for example from 600 to 606.
Then the equation implies that the expected Cost will increase
approximately $166.54.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Modeling Possibilities
CARDEMAND.XLS
This file contains annual data (1970-1987) on domestic auto
sales in the United States. The data set is shown here on the
next slide.
The variables are defined as
– Quantity: annual domestic auto sales (in number of units)
– Price: real price index of new cars
– Income: real disposable income
– Interest: prime rate of interest
Estimate and interpret a multiplicative (constant elasticity)
relationship between Quantity and Price, Income and Interest.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Car Demand Data
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Constant Elasticity Relationships
A particular type of nonlinear relationship that has
firm grounding in economic theory is called a
constant elasticity relationship. It is also called a
multiplicative relationship.
One property of this type of relationship is that the
effect of a change on any explanatory variable Xi on
Y depends on the levels of the other X’s in the
equation.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution
We first take the natural logs of all four variables.
– This can be done in one step using the Transform Variables
menu item or we can use Excel’s LN function.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Regression Output for
Multiplicative Relationship
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
If we like we can convert this back to the original variables,
that is back to multiplicative form, by taking antilogs. The
result is
Predicted Quantity = 107.198Price-1.185Income2.183Interest-0.191
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Conclusions
Does this multiplicative equation provide a better fit to
the automobile data than does an additive relationship?
Without doing considerable more work it is difficult to
answer this questions with certainty.
As we discussed previously, it is not sufficient to
compare R2 and se values for the two fits.
We will simply state that the multiplicative relationship
provides a reasonably good fit, and it makes sense
economically.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Modeling Possibilities
LEARNING.XLS
The Presario Company produces a variety of small
industrial products.
It has just finished producing 22 batches of a new
product (new to Presario) for a customer.
This file contains the times (in hours) to produce each
batch. These data are in the table on the next slide.
Clearly, the times have tended to decrease as Presario
has gained more experience in making the product.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Data for Learning Curve
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution
One way to check whether the multiplicative learning
model is reasonable is to create the log variables
Log_time and Log_batch in the usual way and then
see whether a scatterplot of Log_Time versus
Log_Batch is approximately linear.
The multiplicative model implies that it should be.
Such a scatterplot is shown on the next slide, along
with a superimposed linear trend line. The fit appears
to be quite good.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Scatterplot of Log Variables with
Linear Trend Superimposed
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
To estimate the relationship, we regress Log_Time
on Log_Batch. The resulting equation is
Predicated Log_Time = 4.834 - 0.155Log_Batch
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Solution -- continued
– We know that the estimated learning rate satisfies
-0.155 = ln(learning rate/ln(2)
Solving for the learning rate (multiply through by ln(2)) and
then take antilogs, we find that it is 0.898, or approximately
90%. In other words, whenever cumulative production
doubles, the time to produce a batch decreases by about
10%.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Predicting Future Production
Times
Presario could use this regression equation to predict
future production times.
For example, suppose the customer places an order for
15 more batches of the same product. We can use the
equation to predict the log of production time for each
batch, then take their antilogs and sum them to obtain
the total production time.
The calculations are shown in rows 26-42 of the
following table. The total predicted time to finish is about
1115 hours.
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6
Using the Learning Curve Model
for Predications
13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6