You are on page 1of 56

Simple & Multiple Linear Regression

Associate professor
Dept of Accounting & Information Systems
University of Dhaka
Regression Analysis

• In 1889, Sir Francis Galton, a cousin of Charles Darwin


published a paper on heredity, “Natural Inheritance”.

• It refers to the statistical technique of modeling the


relationship between two or more variables. In general
sense, regression analysis means the estimation or
prediction of the unknown value of one variable from the
known value(s) of the other variable(s).

• It is one of the most important and widely used statistical


techniques in almost all sciences - natural, social or physical.
Simple Linear Regression

• Regression analysis is a mathematical measure of the average


relationship between two or more variables in terms of the
original units of the data.

• Study the nature of relationship between the variables.


Establish if there is a statistically significant relationship
between two variables
• The cause and effect relationship is clearly indicated through
regression analysis
• Forecast new observations-
ex- what will be the sales of mask over the next quarter?
Simple Linear Regression
• In regression analysis we use the independent variable (X) to estimate the
dependent variable (Y).

• Dependent variable: This is the variable whose values we want to explain


or forecast . Its values depend on something else. We denote it as Y

• Independent variables: This is the variable that explains the other explain variable .
Its values are not dependent, called independent variable. We denote it as X.

• Both variables must be at least interval scale.

• The relationship between the variables is linear.


You may remember the equation
y=mx+c
y= a+bx
Linear Regression Model

5
Regression Analysis
LINEAR REGRESSION: The line of Regression is the graphical or
relationship representation of the best estimate of one variable for any given
value of the other variable.
If X and Y are two variables of which relationship is to be indicated, a line that
gives best estimate of Y for any value of X, it is called Regression line of Y on X.

If the dependent variable changes to X, then best estimate of X by any value


of Y is called Regression line of X on Y.

6
Assumptions Underlying Linear Regression
For each value of X, there is a group of Y values, and these
1. Normality: For any fixed value of X, Y is normally distributed.
The means of these normal distributions of Y values all lie on the
straight line of regression. The standard deviations of these normal
distributions are equal.
2. Independence: The Y values are statistically independent. This means that
in the selection of a sample, the Y values chosen for a particular X value do
not depend on the Y values for any other X values.

7
Assumptions Underlying Linear Regression

3. Linearity: The relationship between X and the mean of Y is


linear.
4. Homoscedasticity : The variance of error term /residual is
equal. (same for any value of X)
Least Square Method

The least square method of fitting a line of best fit requires minimizing the
sum of the squares of vertical deviations of each observed Y value from the
fitted line.

• The straight line relationship in Equation is stated in terms of two constants


a and b
• The constant a is the Y-intercept; it indicates the height on the vertical axis
from where the straight line originates, representing the value of Y when X is
zero.
• Constant b is a measure of the slope of the straight line; it shows the
absolute change in Y for a unit change in X. As the slope may be positive or
negative, it indicates the nature of relationship between Y and X. Accordingly,
b is also known as the regression coefficient of Y on X.
Regression coefficient

• The least squares principle is used to obtain a and b.


• The equations to determine a and b are
The Standard Error of Estimate

• The standard error of estimate is similar to standard deviation.


The standard error of estimate is a measure of the variation or
scatterdness about the line of regression. Where as standard
deviation measures the variation or scatterdness about the
arithmetic mean.

• Standard error of estimate indicate how precise the prediction


of y is based on x or conversely.

• There are two types of standard error of estimates.


The Standard Error of Estimate
1. The standard error of estimated regression equation of y on x
2. The standard error of estimated regression equation of y on x

• The formulas that are used to compute the standard error:

^ Y 2  aY  bXY
s y.x 
 (Y  Y ) 2 s y. x 
n2 n2

Syx measures the average absolute amount by which


observed Y values depart from the corresponding
computed Yc values.
12
Problem
Solution
Solution
Regression Equation - Example
Recall the example involving Copier
Sales of America. The sales
manager gathered information on
the number of sales calls made
and the number of copiers sold for
a random sample of 10 sales
representatives. Use the least
squares method to determine a
linear equation to express the
relationship between the two
variables.
What is the expected number of
copiers sold by a representative
who made 20 calls?

16
Finding the Regression Equation - Example

The regression equation is :


^
Y  a  bX
^
Y  18 .9476  1 .1842 X
^
Y  18 .9476  1 .1842 ( 20 )
^
Y  42 .6316
17
Computing the Estimates of Y

Step 1 – Using the regression equation, substitute the


value of each X to solve for the estimated sales

TomKeller Soni
Jones
^ ^
Y 18.9476
1.1842
X Y 18.9476
1.1842
X
^ ^
Y 18.9476
1.1842
(20) Y 18.9476
1.1842
(30)
^ ^
Y 42.6316 Y 54.4736
18
Plotting the Estimated and the Actual Y’s

19
Standard Error of the Estimate - Example

Recall the example involving


Copier Sales of America.
The sales manager
determined the least
squares regression
equation is given below.
Determine the standard error
of estimate as a measure of
how well the values fit the
regression line.
^ ^

Y  18.9476  1.1842X (YY)2


sy.x 
n2
784
.211
 9.901
102

20
Standard Error of the Estimate
Graphical Illustration of the Differences between Actual Y – ^
Estimated Y (Y  Y )

22
PROPERTIES OF REGRESSION COEFFICIENTS

1. The slope of regression line is called the regression coefficient. It tells the
effect on dependent variable if there is a unit change in the independent
variable. Since for a paired data on X and Y variables, there are two
regression lines: regression line of Y on X and regression line of X on Y, so
we have two regression coefficients:

a. Regression coefficient of Y on X, denoted by byx


b. Regression coefficient of X on Y, denoted by bxy
PROPERTIES OF REGRESSION COEFFICIENTS

2. The value of both the regression coefficients cannot be greater than 1.


However, value of both the coefficients can be below 1 or at least one of
them must be below 1, so that the square root of the product of two
regression coefficients must lie in the limit ±1.

3. Coefficient of correlation is the geometric mean of the regression


coefficients, i.e.

The signs of both the regression coefficients are the same, and so the value of r
will also have the same sign.
PROPERTIES OF REGRESSION COEFFICIENTS
Problem
Problem
Multiple Regression Analysis

The general multiple regression with k independent variables


is given by:

The least squares criterion is used to develop this equation.


Because determining b1, b2, etc. is very tedious, a software
package such as Excel or MINITAB is recommended.

28
Multiple Regression Analysis

For two independent variables, the general form of the


multiple regression equation is:

•X1 and X2 are the independent variables.


•a is the Y-intercept
•b1 is the net change in Y for each unit change in X1 holding X2 constant. It is
called a partial regression coefficient, a net regression coefficient, or just a
regression coefficient.

29
Regression Plane for a 2-Independent Variable Linear
Regression Equation

30
Problem

• The model is given below-


z= 6.56x1 +3.26+x2 6.72x3 +1.05x4

• Here,
• Z = Financial distress
X1(Liquidity Ratio) = Net Working Capital/ Total Assets
X2(Profitability Ratio) = Retained Earnings/ Total Assets
X3(Return on Assets)= EBIT/ Total Assets
X4(Solvency Ratio)= Market Value of Equity/ Book Value of
Total Liabilities
Multiple Linear Regression - Example

• Salsberry Realty sells homes along the east coast of the


United States. One of the questions most frequently asked by
prospective buyers is: If we purchase this home, how much
can we expect to pay to heat it during the winter? The research
department at Salsberry has been asked to develop some
guidelines regarding heating costs for single-family homes.

• Three variables are thought to relate to the heating costs: (1)


the mean daily outside temperature, (2) the number of inches
of insulation in the attic, and (3) the age in years of the
furnace.

• To investigate, Salsberry’s research department selected a


random sample of 20 recently sold homes. It determined the
cost to heat each home last January, as well
32
Multiple Linear Regression - Example

33
The Multiple Regression Equation – Interpreting the
Regression Coefficients

• The regression coefficient for mean outside temperature is 4.583. The coefficient
is negative and shows an inverse relationship between heating cost and
temperature. As the outside temperature increases, the cost to heat the home
decreases. The numeric value of the regression coefficient provides more
information. If we increase temperature by 1 degree and hold the other two
independent variables constant, we can estimate a decrease of $4.583 in monthly
heating cost. So if the mean temperature in Boston is 25 degrees and it is 35
degrees in Philadelphia, all other things being the same (insulation and age of
furnace), we expect the heating cost would be $45.83 less in Philadelphia.
• The attic insulation variable also shows an inverse relationship: the more
insulation in the attic, the less the cost to heat the home. So the negative sign for
this coefficient is logical. For each additional inch of insulation, we expect the
cost to heat the home to decline $14.83 per month, regardless of the outside
temperature or the age of the furnace.
• The age of the furnace variable shows a direct relationship. With an older
furnace, the cost to heat the home increases. Specifically, for each additional year
older the furnace is, we expect the cost to increase $6.10 per month.
34
Applying the Model for Estimation

What is the estimated heating cost for a home if the


mean outside temperature is 30 degrees, there are 5
inches of insulation in the attic, and the furnace is 10
years old?

35
Multiple Standard Error of Estimate

The multiple standard error of estimate is a measure of the


effectiveness of the regression equation.
• It is measured in the same units as the dependent variable.
• It is difficult to determine what is a large value and what is a
small value of the standard error.
• The formula is:

36
Multiple Regression and Correlation Assumptions

• The independent variables and the dependent variable


have a linear relationship. The dependent variable must
be continuous and at least interval-scale.
• The residual must be the same for all values of Y. When
this is the case, we say the difference exhibits
homoscedasticity.
• The residuals should follow the normal distributed with
mean 0.
• Successive values of the dependent variable must be
uncorrelated.

37
The ANOVA Table

• The ANOVA table reports the variation in the


dependent variable. The variation is divided into
two components.
• The Explained Variation is that accounted for by the
set of independent variable.
• The Unexplained or Random Variation is not
accounted for by the independent variables.

38
Minitab – the ANOVA Table

39
Coefficient of Multiple Determination (r2)

Characteristics of the coefficient of multiple determination:


1. It is symbolized by a capital R squared. In other words, it is written as
because it behaves like the square of a correlation coefficient.
2. It can range from 0 to 1. A value near 0 indicates little association between
the set of independent variables and the dependent variable. A value near
1 means a strong association.
3. It cannot assume negative values. Any number that is squared or raised to
the second power cannot be negative.
4. It is easy to interpret. Because is a value between 0 and 1 it is easy to
interpret, compare, and understand.

40
Minitab – the ANOVA Table

SSR 171,220
R2   0.804
SS total 212,916

41
Adjusted Coefficient of Determination

• The number of independent variables in a multiple


regression equation makes the coefficient of
determination larger. Each new independent variable
causes the predictions to be more accurate.
• If the number of variables, k, and the sample size, n, are
equal, the coefficient of determination is 1.0. In practice,
this situation is rare and would also be ethically
questionable.
• To balance the effect that the number of independent
variables has on the coefficient of multiple determination,
statistical software packages use an adjusted coefficient
of multiple determination.

42
Adjusted Coefficient of Determination

43
Evaluating the
Assumptions of Multiple Regression

1. There is a linear relationship. That is, there is a straight-line


relationship between the dependent variable and the set of
independent variables.
2. The variation in the residuals is the same for both large and small
values of the estimated Y To put it another way, the residual is
unrelated whether the estimated Y is large or small.
3. The residuals follow the normal probability distribution.
4. The independent variables should not be correlated. That is, we
would like to select a set of independent variables that are not
themselves correlated.
5. The residuals are independent. This means that successive
observations of the dependent variable are not correlated. This
assumption is often violated when time is involved with the sampled
observations.

44
Analysis of Residuals

• A residual is the difference between the actual value of Y and


the predicted value of Y. Residuals should be approximately
normally distributed. Histograms and stem-and-leaf charts
are useful in checking this requirement.

• A plot of the residuals and their corresponding Y’ values is


used for showing that there are no trends or patterns in the
residuals.

45
Scatter Diagram

46
Residual Plot

47
Distribution of Residuals

Both MINITAB and Excel offer another graph that helps to evaluate the
assumption of normally distributed residuals. It is a called a normal probability
plot and is shown to the right of the histogram.

48
Multicollinearity
• Multicollinearity exists when independent
variables (X’s) are correlated.
• Correlated independent variables make it
difficult to make inferences about the individual
regression coefficients (slopes) and their
individual effects on the dependent variable (Y).
• However, correlated independent variables do
not affect a multiple regression equation’s
ability to predict the dependent variable (Y).

49
Variance Inflation Factor
• A general rule is if the correlation between two independent variables
is between -0.70 and 0.70 there likely is not a problem using both of
the independent variables.
• A more precise test is to use the variance inflation factor (VIF).
• The value of VIF is found as follows:

•The term R2j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
•A VIF greater than 10 is considered unsatisfactory, indicating that independent
variable should be removed from the analysis.

50
Multicollinearity – Example
Refer to the data in the table,
which relates the heating
cost to the independent
variables outside
temperature, amount of
insulation, and age of
furnace.
Develop a correlation matrix
for all the independent
variables.
Does it appear there is a
problem with
multicollinearity?
Find and interpret the
variance inflation factor
for each of the
independent variables.

51
Correlation Matrix - Minitab

52
VIF – Minitab Example

Coefficient of
Determination

The VIF value of 1.32 is less than the


upper limit of 10. This indicates that
the independent variable
temperature is not strongly
correlated with the other
independent variables.
53
Independence Assumption

• The fifth assumption about regression and


correlation analysis is that successive
residuals should be independent.
• When successive residuals are correlated
we refer to this condition as
autocorrelation. Autocorrelation
frequently occurs when the data are
collected over a period of time.

54
Residual Plot versus Fitted Values
• The graph below shows the
residuals plotted on the
vertical axis and the fitted
values on the horizontal axis.
• Note the run of residuals
above the mean of the
residuals, followed by a run
below the mean. A scatter plot
such as this would indicate
possible autocorrelation.

55
Qualitative Independent Variables
• Frequently we wish to use nominal-scale
variables—such as gender, whether the home has
a swimming pool, or whether the sports team
was the home or the visiting team—in our
analysis. These are called qualitative variables.
• To use a qualitative variable in regression
analysis, we use a scheme of dummy variables in
which one of the two possible conditions is coded
0 and the other 1.

56

You might also like