You are on page 1of 51

# Chapte 10

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or
duplicated, or posted to a publicly accessible website, in whole or in part.

BUSINESS ANALYTICS:
DATA ANALYSIS AND
DECISION MAKING

Relationships

Introduction
(slide 1 of 2)

## Regression analysis is the study of relationships

between variables.
There are two potential objectives of regression
analysis: to understand how the world operates and
to make predictions.
Two basic types of data are analyzed:

## Cross-sectional data are usually data gathered from

approximately the same period of time from a population.
Time series data involve one or more variables that are
observed at several, usually equally spaced, points in
time.
Time

## series variables are usually related to their own past

valuesa property called autocorrelationwhich adds
complications to the analysis.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Introduction
(slide 2 of 2)

## In every regression study, there is a single variable that

we are trying to explain or predict, called the dependent
variable.

## To help explain or predict the dependent variable, we use

one or more explanatory variables.

## If there is a single explanatory variable, the analysis is

called simple regression.
If there are several explanatory variables, it is called
multiple regression.
Regression can be linear (straight-line relationships) or
nonlinear (curved relationships).

## Many nonlinear relationships can be linearized mathematically.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Scatterplots:
Graphing Relationships

## Drawing scatterplots is a good way to

begin regression analysis.
A scatterplot is a graphical plot of two
variables, an X and a Y.
If there is any relationship between the
two variables, it is usually apparent from
the scatterplot.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.1:
Drugstore Sales.xlsx

(slide 1 of 2)

## Objective: To use a scatterplot to examine the relationship

between promotional expenditures and sales at Pharmex.
Solution: Pharmex has collected data from 50 randomly
selected metropolitan regions.
There are two variables: Pharmexs promotional
expenditures as a percentage of those of the leading
competitor (Promote) and Pharmexs sales as a
percentage of those of the leading competitor (Sales).
A partial listing of the data is shown below.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.1:
Drugstore Sales.xlsx

(slide 2 of 2)

## Use Excels Chart Wizard or the StatTools Scatterplot

procedure to create a scatterplot.

## Sales is on the vertical axis and Promote is on the horizontal

axis because the store believes that large promotional
expenditures tend to cause larger values of sales.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.2:
Overhead Costs.xlsx

(slide 1 of 3)

## Objective: To use scatterplots to examine the relationships

among overhead, machine hours, and production runs at
Bendrix.
Solution: Data file contains observations of overhead costs,
machine hours, and number of production runs at Bendrix.
Each observation (row) corresponds to a single month.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.2:
Overhead Costs.xlsx

(slide 2 of 3)

## Examine scatterplots between each

explanatory variable (Machine Hours and
Production Runs) and the dependent
variable (Overhead).

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.2:
Overhead Costs.xlsx

(slide 3 of 3)

## Check for possible time series patterns, by

creating a time series graph for any of the
variables.
Check for relationships among the multiple
explanatory variables (Machine Hours
versus Production Runs).

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Scatterplots are useful for detecting relationships that

may not be obvious otherwise.
The typical relationship you hope to see is a straight-line,
or linear, relationship.

This doesnt mean that all points lie on a straight line, but that
the points tend to cluster around a straight line.

## The scatterplot below illustrates a relationship that is

clearly nonlinear.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Outliers
(slide 1 of 2)

## Scatterplots are especially useful for identifying

outliersobservations that fall outside of the
general pattern of the rest of the observations.

## If an outlier is clearly not a member of the population

of interest, then it is probably best to delete it from
the analysis.
If it isnt clear whether outliers are members of the
relevant population, run the regression analysis with
them and again without them.
If

## the results are practically the same in both cases, then it

is probably best to report the results with the outliers
included.
Otherwise, you can report both sets of results with a
verbal explanation of the outliers.
2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Outliers
(slide 2 of 2)

## In the figure below, the outlier (the point at the

top right) is the company CEO, whose salary is
well above that of all of the other employees.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Unequal Variance

## Occasionally, the variance of the dependent variable

depends on the value of the explanatory variable.
The figure below illustrates an example of this.

## There is a clear upward relationship, but the variability of

amount spent increases as salary increaseswhich is evident
from the fan shape.

## This unequal variance violates one of the assumptions of

linear regression analysis, but there are ways to deal with it.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

No Relationship

## A scatterplot can also indicate that there

is no relationship between a pair of
variables.

## This is usually the case when the

scatterplot appears as a shapeless swarm
of points.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Correlations: Indicators of
Linear Relationships (slide 1 of 2)

## Correlations are numerical summary measures that

indicate the strength of linear relationships between
pairs of variables.

## A correlation between a pair of variables is a single number

that summarizes the information in a scatterplot.
It measures the strength of linear relationships only.
The usual notation for a correlation between variables X
and Y is rxy.

## Formula for Correlation:

The numerator of the equation is also a measure of
association between X and Y, called the covariance
between X and Y.

## The magnitude of a covariance is difficult to interpret

because it depends on the units of measurement.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Correlations: Indicators of
Linear Relationships (slide 2 of 2)

## By looking at the sign of the covariance or correlation

plus or minusyou can tell whether the two
variables are positively or negatively related.
Unlike covariances, correlations are completely
unaffected by the units of measurement.

## A correlation equal to 0 or near 0 indicates practically no

linear relationship.
A correlation with magnitude close to 1 indicates a strong
linear relationship.
A correlation equal to -1 (negative correlation) or
+1 (positive correlation) occurs only when the linear
relationship between the two variables is perfect.

## Be careful when interpreting correlationsthey are

relevant descriptors only for linear relationships.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Scatterplots and correlations indicate

linear relationships and the strengths of
these relationships, but they do not
quantify them.
Simple linear regression quantifies the
relationship where there is a single
explanatory variable.
A straight line is fitted through the
scatterplot of the dependent variable Y
versus the explanatory variable X.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

(slide 1 of 2)

## When fitting a straight line through a scatterplot, choose

the line that makes the vertical distance from the points
to the line as small as possible.
A fitted value is the predicted value of the dependent
variable.

## Graphically, it is the height of the line above a given

explanatory value.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

(slide 2 of 2)

## Fundamental Equation for Regression:

Observed Value = Fitted Value + Residual
The best-fitting line through the points of a scatterplot is
the line with the smallest sum of squared residuals.

## This is called the least squares line.

It is the line quoted in regression outputs.

and intercept.

## Equation for Intercept in Simple Linear Regression:

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.1 (continued):

Drugstore Sales.xlsx (slide 1 of 2)

## Objective: To use StatToolss Regression procedure to find the

least squares line for sales as a function of promotional expenses
at Pharmex.
Solution: Select Regression from the StatTools Regression and
Classification dropdown list.
Use Sales as the dependent variable and Promote as the
explanatory variable.
The regression output is shown below and on the next slide.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.1 (continued):

Drugstore Sales.xlsx (slide 2 of 2)

## The equation for the least squares line is:

Predicted Sales = 25.1264 + 0.7623Promote

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.2 (continued):

Overhead Costs.xlsx (slide 1 of 2)
Objective: To use the StatTools Regression procedure to regress
overhead expenses at Bendrix against machine hours and then
against production runs.
Solution: The Bendrix manufacturing data set has two potential
explanatory variables, Machine Hours and Production Runs.
The regression output for Overhead with Machine Hours as the
single explanatory variable is shown below.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.2 (continued):

Overhead Costs.xlsx (slide 2 of 2)

## The output when Production Runs is the only

explanatory variable is shown below.

## The two least squares lines are therefore:

Predicted Overhead = 48621 + 34.7MachineHours
Predicted Overhead = 75606 + 655.1ProductionRuns

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## The magnitude of the residuals provide a good indication of how

useful the regression line is for predicting Y values from X values.
Because there are numerous residuals, it is useful to summarize
them with a single numerical measure.

## This measure is called the standard error of estimate and is

denoted se.
It is essentially the standard deviation of the residuals.
It is given by this equation:

## The usual empirical rules for standard deviation can be applied

to the standard error of estimate.
In general, the standard error of estimate indicates the level of
accuracy of predictions made from the regression equation.

## The smaller it is, the more accurate predictions tend to be.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## R2 is an important measure of the goodness

of fit of the least squares line.

## It is the percentage of variation of the

dependent variable explained by the regression.
It always ranges between 0 and 1.
The better the linear fit is, the closer R2 is to 1.
Formula for R2:
In simple linear regression, R2 is the square of
the correlation between the dependent variable
and the explanatory variable.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Multiple Regression

## To obtain improved fits in regression, several

explanatory variables could be included in the regression
equation. This is the realm of multiple regression.

## Graphically, you are no longer fitting a line to a set of points.

If there are two explanatory variables, you are fitting a plane
to the data in three-dimensional space.
The regression equation is still estimated by the least
squares method, but it is not practical to do this by hand.
There is a slope term for each explanatory variable in the
equation, but the interpretation of these terms is different.
The standard error of estimate and R2 summary measures
are almost exactly as in simple regression.
Many types of explanatory variables can be included in the
regression equation.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## If Y is the dependent variable, and X1 through Xk are

the explanatory variables, then a typical multiple
regression equation has the form shown below, where
a is the Y-intercept, and b1 through bk are the slopes.

## General Multiple Regression Equation:

Predicted Y = a + b1X1 + b2X2 + + bkXk

## Collectively, a the bs in the equation are called the

regression coefficients.
Each slope coefficient is the expected change in Y
when this particular X increases by one unit and the
other Xs in the equation remain constant.

## This means that the estimates of the bs depend on which

other Xs are included in the regression equation.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.2 (continued):

Overhead Costs.xlsx
Objective: To use StatToolss Regression procedure to estimate the
equation for overhead costs at Bendrix as a function of machine
hours and production runs.
Solution: Select Regression from the StatTools Regression and
Classification dropdown list. Then choose the Multiple option and
specify the single D variable and the two I variables.
The coefficients in the output below indicate that the estimated
regression equation is: Predicted Overhead = 3997 + 43.54Machine
Hours + 883.62Production Runs.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Interpretation of Standard Error of

Estimate and R-Square
The multiple regression output is very similar to simple
regression output.
The standard error of estimate is essentially the standard
deviation of residuals, but it is now given by the equation
below, where n is the number of observations and k is the
number of explanatory variables:

## The R2 value is again the percentage of variation of the

dependent variable explained by the combined set of
explanatory variables, but it has a serious drawback: It can only
increase when extra explanatory variables are added to an
equation.
Adjusted R2 is an alternative measure that adjusts R2 for the
number of explanatory variables in the equation.

## It is used primarily to monitor whether extra explanatory variables really

belong in the equation.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Modeling Possibilities

## Several types of explanatory variables

can be included in regression equations:

Dummy variables
Interaction variables
Nonlinear transformations

## There are many alternative approaches

to modeling the relationship between a
dependent variable and potential
explanatory variables.

## In many applications, these techniques

produce much better fits than you could

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Dummy Variables

## Some potential explanatory variables are categorical and

cannot be measured on a quantitative scale.

## However, these categorical variables are often related to the

dependent variable, so they need to be included in the
regression equation.
A dummy variable is a variable with possible values of 0 and 1.
It is also called a 0-1 variable or an indicator variable.
It equals 1 if the observation is in a particular category, and 0
if it is not.

## When there are only two categories (example: gender)

When there are more than two categories (example: quarters)
In

## this case, multiple dummy variables must be created.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.3:
Bank Salaries.xlsx

(slide 1 of 3)

## Objective: To use the StatTools Regression procedure to

analyze whether the bank discriminates against females in
terms of salary.
Solution: Data set includes the following variables for each
of the 208 employees of the bank: Education (categorical),
Grade (categorical), Years1 (years with this bank), Years2
(years of previous work experience), Age, Gender
(categorical with two values), PCJob (categorical yes/no),
Salary.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.3:
Bank Salaries.xlsx

(slide 2 of 3)

## Create dummy variables for the various

categorical variables, using IF functions or
the StatTools Dummy procedure.
Then run a regression analysis with Salary as
the dependent variable, using any
combination of numerical and dummy
explanatory variables.

## Dont use any of the original categories (such as

Education) that the dummies are based on.
Always use one fewer dummy than the number
of categories for any categorical variable.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.3:
Bank Salaries.xlsx

(slide 3 of 3)

## The regression output with all variables

appears below.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Interaction Variables

## When you include only a dummy variable in a regression

equation, like the one above, you are allowing the
intercepts of the two lines to differ, but you are forcing
the lines to be parallel.
To be more realistic, you might want to allow them to
have different slopes.
You can do this by including an interaction variable.

## An interaction variable is the product of two explanatory

variables.
Include an interaction variable in a regression equation if you
believe the effect of one explanatory variable on Y depends
on the value of another explanatory variable.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.3 (continued):

Bank Salaries.xlsx (slide 1 of 2)
Objective:

## To use multiple regression with an interaction variable to

see whether the effect of years of experience on salary is different
across the two genders.
Solution: First, form an interaction variable that is the product of
Years 1and Female, using an Excel formula or the Interaction option
from the StatTools Data Utilities dropdown menu.
Include the interaction variable in addition to the other variables in the
regression equation.
The multiple regression output appears below.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.3 (continued):

Bank Salaries.xlsx (slide 2 of 2)

## The regression equations for Female and Male are

shown graphically below.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Nonlinear Transformations
The general linear regression equation has the form:
Predicted Y = a + b1X1 + b2X2 + + bkXk
It is linear in the sense that the right side of the
equation is a constant plus a sum of products of
constants and variables.
The variables can be transformations of original
variables.

## Nonlinear transformations of variables are often used

because of curvature detected in scatterplots.
You can transform the dependent variable Y or any of the
explanatory variables, the Xs. Or you can do both.
Typical nonlinear transformations include: the natural
logarithm, the square root, the reciprocal, and the square.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.4:
Cost of Power.xlsx

(slide 1 of 3)

## Objective: To see whether the cost of supplying electricity is a

nonlinear function of demand, and if it is, what form the
nonlinearity takes.
Solution: The data set lists the number of units of electricity
produced (Units) and the total cost of producing these (Cost) for
a 36-month period.
Start with a scatterplot of Cost versus Units.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.4:
Cost of Power.xlsx

(slide 2 of 3)

## Next, request a scatterplot of the residuals versus the fitted

values.
The negative-positive-negative behavior of residuals suggests
a parabolathat is, a quadratic relationship with the square
of Units included in the equation.
Create a new variable (Units)^2 in the data set and then use
multiple regression to estimate the equation for Cost with
both Units and (Units)^2 included.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.4:
Cost of Power.xlsx

(slide 3 of 3)

## Use Excels Trendline option to superimpose a quadratic curve

on the scatterplot. This curve is shown below, on the left.
Finally, try a logarithmic fit by creating a new variable,
Log(Units), and then regressing Cost against this variable. This
curve is shown below, on the right.

## One reason logarithmic transformations of variables are used so

widely in regression analysis is that they are fairly easy to interpret.
A logarithmic transformation of Y is often useful when the
distribution of Y values is skewed to the right.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.3 (continued):

Bank Salaries.xlsx (slide 1 of 2)
Objective:

## To reanalyze the bank salary data, now using the

logarithm of salary as the dependent variable.
Solution: The distribution of salaries of the 208 employees
shows some skewness to the right.
First, create the Log(Salary) variable.
Then run the regression, with Log(Salary) as the dependent
variable and Female and Years 1 as the explanatory variables.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.3 (continued):

Bank Salaries.xlsx (slide 2 of 2)
The

general:

## The R2 values with Y and Log(Y) as dependent variables

are not directly comparable. They are percentages
explained of different variables.
The se values with Y and Log(Y) as dependent variables
are usually of totally different magnitudes. To make the
se from the log equation comparable, you need to go
through the procedure described in the example so that
the residuals are in original units.
To interpret any term of the form bX in the log equation,
you should first express b as a percentage. Then when X
increases by one unit, the expected percentage change
in Y is approximately this percentage b.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Constant Elasticity
Relationships

## A particular type of nonlinear relationship that has

firm grounding in economic theory is the constant
elasticity relationship.

## This is also called a multiplicative relationship.

It has the form shown in the equation below:
The effect of a one-unit change in any X on Y depends on
the levels of the other Xs in the equation.
The dependent variable is expressed as a product of
explanatory variables raised to powers.
When any explanatory variable X changes by 1%, the
predicted value of the dependent variable changes by a
constant percentage, regardless of the value of
this X or the values of the other Xs.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.5:
Car Sales.xlsx

(slide 1 of 2)

Objective:

## To use logarithms of variables in a multiple regression to

estimate a multiplicative relationship for automobile sales as a function
of price, income, and interest rate.
Solution: The data set contains annual data on domestic auto sales in
the United States.

Variables include: Sales (in units), Price Index (consumer price index of
transportation), Income (real disposable income), and Interest (prime rate).

First,

## take natural logs of all four variables.

Then run a multiple regression with Log(Sales) as the dependent variable
and Log(Price Index), Log(Income), and Log(Interest) as the explanatory
variables.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.5:
Car Sales.xlsx

(slide 2 of 2)

## The resulting output is shown below.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## A learning curve relates the unit of

production time (or cost) to the
cumulative volume of output since the
production process first began.

## Empirical studies indicate that production

times tend to decrease by a relatively
constant percentage every time cumulative
output doubles.
This constant is often called the learning rate.
Equation for Learning Rate (where LN refers
to the natural logarithm):

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.6:
Learning Curve.xlsx

(slide 1 of 2)

## Objective: To use a multiplicative regression

equation to estimate the learning rate for
production time.
Solution: Data set contains the times (in hours)
to produce each batch of a new product at
Presario Company.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

Example 10.6:
Learning Curve.xlsx

(slide 2 of 2)

## First, check whether the multiplicative learning model is reasonable by

creating a scatterplot of Log(Time) versus Log(Batch). The multiplicative
model implies that it should be approximately linear.

## The relationship can then be estimated by regressing Log(Time) on

Log(Batch). The resulting equation is:

## The estimated learning rate satisfies the equation:

Now solve for the learning rate (multiply through by LN(2) and then take
antilogs).

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## The fit from a regression analysis is often overly optimistic.

To see if the regression equation will be successful in
predicting new values of the dependent variable, split the
original data into two subsets: one for estimation and one
for validation.

## A regression equation is estimated from the first subset.

Then the values of the explanatory variables from the second
subset are substituted into the equation to obtain predicted
values for the dependent variable.
Finally, these predicted values are compared to the known
values of the dependent variable in the second subset.
If the agreement is good, there is reason to believe that the
regression equation will predict well for the new data.

## This procedure is called validating the fit.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible

## Example 10.2 (continued):

Overhead Costs Validation.xlsx
Objective: To validate the original Bendrix regression for
making predictions at another plant.
Solution: Bendrix would like to predict overhead costs for
another plant by using data on machine hours and production
runs at this second plant.
The first step is to see how well the regression from the first
plant fits data from the other plant.

2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible