You are on page 1of 40

CHAPTER 6 (PART I)

Trendlines and Regression Analysis

Prepared by: Nur Liyana Mohamed Yousop


CHAPTER OUTLINES

Introduction

Simple Linear Regression

Multiple Linear Regression

Regression with Categorical Independent Variable


INTRODUCTION
SCOPE OF BUSINESS ANALYTICS

CHAPTER 4

CHAPTER 6 AND 7

CHAPTER 8
PREDICTIVE ANALYTICAL MODELS

Predictive analytical model is


Predictive analytics are executed by processing developed by using
historical data to forecast future happenings. mathematical functions as
following:

Linear Logarithmic Polynomial Power Exponential


function function function function function
PREDICTIVE ANALYTICAL MODELS

Linear function Logarithmic function Polynomial function Power function Exponential function
y=a+bx y=In(x) y=ax2+bx+c y=axb y=abx
Polynomial functions are
functions that have a
Linear functions show
quadratic, a cubic, a
steady increases or Exponential functions
Logarithmic functions quartic and other
decreases over the Power functions are come with a property
are used when the rate properties (all functions
range of x. defined by single where y increases or
of change in a variable plus, minus,
monomials (includes decreases at constantly
increases or decreases multiplication), taking
This is the simplest type number and variables increasing rate.
quickly. just non-negative integer
of function used in that are multiplied
power of x.
predictive models. together, e.g. 3xy) E.g. Continuosly
E.g. Richter scale used
where a≠0 and b>0 compounding interest
to measure earthquake E.g. Business people use
E.g. Demand function (PV and FV)
polynomials to see how
(price and quantity)
rising of a goods will
affect its sales
TYPES OF DATA

Location Time Example

Time series 1 n Malaysia GDP from 2009-2019

India GDP in 2019


Cross-sectional n 1 China GDP in 2019
Japan GDP in 2019
Malaysia GDP from 2009-2019
Pooled @ Panel data n n Japan GDP from 2009-2019
Kore GDP from 2009-2019
TYPES OF DATA

TIME SERIES
CROSS-SECTIONAL DATA

POOLED / PANEL DATA

Our syllabus
For time
series data,
MODE L I NG use a line
R E L ATI ONSHI P S chart.
AND T R E NDS I N For cross-
DATA sectional
data, use a
Create scatter chart.
charts to
better
understand
data sets.
MODELING A PRICE-DEMAND FUNCTION

Linear demand function:


Demand = 20,512 - 9.5116(Price)
EXCEL TRENDLINE TOOL
ORDINARY LEAST SQUARE REGRESSION
TYPES OF VARIABLES

Dependent Variable Independent Variable


• The variable that depends • The variable that is stable
on other variable/s that and unaffected by the
is/are measured other variables you are
trying to measure.
LEAST-SQUARES REGRESSION

Ordinary Least Squares regression (OLS) is more


commonly named linear regression (simple or
multiple depending on the number of explanatory
variables).

Regression is a powerful analysis that can analyze


multiple variables simultaneously to answer
complex research questions.

However, if OLS assumptions are not satisfied, then


results cannot be trusted.
LEAST-SQUARES REGRESSION

To obtain the best-fitted line, minimize the distance between the actual values and the predicted values
through Ordinary Least-Squares method (OLS).

Formula OLS_SLR:

y = ß0 +ß1x

Where;
y : Dependent variable
ß0 : Intercept (often labeled the constant) is the expected mean value of y when all x=0
ß1 : Slope represents the rate of change in y as x changes.
x : Independent variable
REGRESSION ANALYSIS

Regression analysis Simple linear regression Multiple regression

A tool for building


mathematical and
statistical models that
characterize relationships
between a dependent Involves a single Involves two or more
(ratio) variable and one independent variable. independent variables.
or more independent, or
explanatory variables
(ratio or categorical), all
of which are numerical.
POPULATION & SAMPLE REGRESSION
MODELS

Population Random Sample


The fitted line is best on
estimation. There is a
Unknown 𝒀 ෡𝟎 + 𝜷
෡𝒊 = 𝜷 ෡ 𝟏 𝑿𝟏 difference between actual
Relationship value ( 𝑦 ) and the predicted
𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝑿𝒊 + 𝜺𝒊
☺ ☺ value (𝑦).

☺ 𝑦- 𝑦ො = 𝜀𝑖
☺ ☺ Observed error / Residuals
☺ ☺
TEXTBOOK PAGE: 70

SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION

• Simple linear regression (SLR) is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
• One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
• The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Independent Variable Dependent Variable


SIMPLE LINEAR REGRESSION

Finds a linear relationship


between:
• one independent variable X First prepare a scatter plot to
and; verify the data has a linear Use alternative approaches if
trend. the data is not linear.
• one dependent variable Y
OLS_SLR: STEP BY STEP

DATA: HOME MARKET VALUE STEP 1: Determine Y (DV) and X (IV)


$130,000.00
Size of a house is typically related to its
$120,000.00
market value.
• Y = market value ($) → DV
$110,000.00
• X = square feet → IV

STEP 2: Plot the Scatter Plot


MARKET VALUES

$100,000.00

$90,000.00 The scatter plot of the full data set (42


homes) indicates a linear trend.
$80,000.00

How to adjust scale?


$70,000.00 o Select axis
o Right click → Format axis
$60,000.00
1,400 1,600 1,800 2,000 2,200 2,400 2,600
o Axis option → Change scale
SQUARE FEET
FINDING THE BEST-FITTING
REGRESSION LINE

RELATIONSHIP BETWEEN HOME MARKET VALUES AND SIZE OF SLR Formula:


THE HOUSE (SQUARE FEET) Market value = ß0 + ß(Square feet)
$130,000.00
y = 35.036x + 32673 STEP 3: Find the Best-Fit Regression Line
$120,000.00 R² = 0.5347

$110,000.00 Click Chart → Add Chart Elements → Trendline →


Linear
MARKET VALUES

$100,000.00

Y
$90,000.00
Linear (Y)

$80,000.00

$70,000.00

$60,000.00
1,400 1,600 1,800 2,000 2,200 2,400 2,600
SQUARE FEET
FINDING THE BEST-FITTING
REGRESSION LINE

RELATIONSHIP BETWEEN HOME MARKET VALUES AND


SLR Formula:
SIZE OF THE HOUSE (SQUARE FEET) Market value = ß0 + ß(Square feet)
$130,000.00 STEP 4: Determine the Best Regression Line
y = 35.036x + 32673
R² = 0.5347
$120,000.00
• The regression model explains variation in market value
$110,000.00
due to size of the home.
MARKET VALUES

$100,000.00 • It provides better estimates of market value than simply


Y
using the average.
$90,000.00
Linear (Y)

$80,000.00
• Market value = 32,673 + $35.036 (Square feet)

$70,000.00 • The estimated market value of a home with 2,200


square feet would be:
$60,000.00
1,400 1,600 1,800 2,000 2,200 2,400 2,600
SQUARE FEET • Market value = $32,673 + $35.036(2,200) =
$109,752
SIMPLE LINEAR REGRESSION WITH EXCEL

Data → Data Analysis → Regression


HOME MARKET VALUE REGRESSION
RESULTS
HOME MARKET VALUE REGRESSION
RESULTS

REGRESSION STATISTICS

Analysis Details Interpretation


Value range -1 to 1
o Value > 0 : +ve correlation
Multiple R 0.7313 > 0, +ve correlation
o Value < 0 : -ve correlation
o Value = 0 : no correlation
Variation in the DV explained by IV R2 = 0.5347
R-Squared (R2)
o Value between 0 and 1 53.47% of variation in market values
/ Coefficient of
o Closer to 1, better fit is explained by the size of the house
determination
(square feet)
HOME MARKET VALUE REGRESSION
RESULTS

REGRESSION STATISTICS
Analysis Details Interpretation
Adjusted R2 = 0.5231
o Will be beneficial when the present
Adjusted R-
Modified R2 model is compared with other models
Squared
that incorporate more explanatory
variables.
Standard error of the estimate is the
difference between the observed
(ACTUAL) and ESTIMATED values. SE
Standard Error will be small if the data is close to None
regression line. The SE will be big if the
data is dispersed widely from the
HOME MARKET VALUE REGRESSION
RESULTS

ANALYSIS OF VARIANCE

Analysis Details Interpretation


ANOVA is used to test for significance of
H0: 𝛽1 = 0 (IV has no effect on the DV)
regression:
H1: 𝛽1 ≠ 0 (IV explains variation in DV)
H0: population slope coefficient (𝛽1 ) = 0
F-test = 3.8E-08 < 0.05
H1: population slope coefficient (𝛽1 ) ≠ 0
ANOVA
H rejected
The significance of F-value given in the 0
The slope is not equal to zero. Using a
ANOVA table is the p-value for the F-test.
linear relationship, home size (square
feet) is a significant variable in explaining
If F < the level of significance (normally 5%),
variation in market value
H0 rejected
HOME MARKET VALUE REGRESSION
RESULTS

ANALYSIS OF VARIANCE

Positive • y-value in increases as x-values increase


slope

Negative • y-value decreases as x-values increase


slope

Zero • y-value stays constant as x-values increase


slope
HOME MARKET VALUE REGRESSION
RESULTS

TESTING HYPOTHESIS FOR REGRESSION COEFFICIENT


Analysis Details Interpretation
An alternate method for testing whether a slope or intercept is zero
is to use a t-test: p-values = 0.0000 < α=5%

H0: population slope coefficient (𝛽1 ) = 0 H0 rejected


H1: population slope coefficient (𝛽1 ) ≠ 0
We can conclude that
T-TEST
The test can be computed by using: coefficient is statistically not
equal to zero. Meaning that
෡ −𝛽1
Regression T-Test =
𝛽 1 home size (square feet) has a
𝑆tandard 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑆𝑙𝑜𝑝𝑒 significant relationship with
market values.
Excel provides the p-values for tests on the slope and intercept.
HOW TO INTERPRET COEFFICIENT?

In our example X is House Size (Square Feet) and Y is Home


Value, thus,

ෝ = 𝟑𝟐, 𝟔𝟕𝟑. 𝟐𝟏𝟗𝟗 + 𝟑𝟓. 𝟎𝟑𝟔𝟒𝑿


𝒚

For coefficient:
If the house size increases by 1 square feet, the home value
increases by $35.0364.

For intercept:
If there is no change in house size, thus, the home value will
be $32,673.2199
HOME MARKET VALUE REGRESSION
RESULTS

CONFIDENCE INTERVAL FOR REGRESSION COEFFICIENT


Analysis Details Interpretation
For the Home Market Value data, it can be concluded that
Confidence intervals (Lower 95%
the true intercept and slopes lies between [14,823, 50,523]
and Upper 95% values in the
and [24.59, 45.48] respectively at α=5% level of
output) provide information about
significance.
the unknown values of the true
regression coefficients, accounting for
CONFIDENCE Although we estimated that a house with 1,750 square feet
sampling error.
INTERVAL has a market value of 32,673 + 35.036(1,750) =$93,986,
if the true population parameters are at the extremes of
Narrower confidence intervals
the confidence intervals, the estimate might be as low as
provide more accuracy in our
14,823 + 24.59(1,750) = $57,855 or as high as 50,523 +
predictions.
45.48(1,750) = $130,113.
RESIDUALS

• Residuals are the observed errors associated with estimating the value of the
dependent variable using the regression line:

𝜀𝑖 = 𝑦𝑖 − 𝑦ො𝑖
RESIDUAL ANALYSIS AND REGRESSION
ASSUMPTIONS

• Residual (ε) = Actual (Observed) Y value − Predicted Y value


• Standard residual = Residual / Standard deviation
• Rule of thumb: Standard residuals outside of ±2 or ±3 are
potential outliers.
• Excel provides a table and a plot of residuals.

This point has a standard


residual of 4.5336
CHECKING ASSUMPTIONS

Assumption Verification Details

Linearity • Examine scatter diagram (should appear If assumption is met:


linear) o Residuals randomly scattered
Linear relationship • Examine residual plot (should appear about zero
between IV and DV random) o Do not exhibit a specific pattern

Normality of Errors • View a histogram of standard residuals If assumption is met:


• Formal Goodness of Fit Test (e.g. Pearson, o Bell-shaped distribution
Errors of all IVs are Chi-square, Jacque-Bera and others)
normally distributed,
mean=0
CHECKING ASSUMPTIONS

Assumption Verification Details


Homoscedasticity • Examine the residual plot If assumption is met:
Constant variance o There will not be dramatic
Variance around the regression differences in the spread of the
line is similar for all the IVs data for different values of the
IVs

Independence of Errors • Durbin Watson Statistics If assumption is met:


(Autocorrelation) o No autocorrelation, if 1.5 ≤ D ≤
The error term for all IVs should 2.5
not be correlated with one
another. If the do, then the • d takes on values between 0 and 4. A value of d = 2
means there is no autocorrelation. A value
problem of autocorrelation substantially below 2 (and especially a value less
persists. than 1) means that the data is positively
autocorrelated. A value of d substantially above 2
means that the data is negatively autocorrelated
CHECKING REGRESSION ASSUMPTIONS FOR
THE HOME MARKET VALUE DATA

• Linearity
• linear trend in scatterplot
• no pattern in residual plot
CONTINUED…

• Normality of Errors • Homoscedasticity


• Residual histogram appears slightly skewed • Residual plot shows no serious difference in the
but is not a serious departure spread of the data for different X values.
• Data→ Data Analysis → Histogram
Square Feet Residual Plot
Histogram-Standard Residual 35000
30 30000

25 25000
20000
20 15000

Residuals
Frequency

10000
15
5000
Frequency
10 0
-50001,300 1,500 1,700 1,900 2,100 2,300 2,500
5 -10000
-15000
0 Square Feet
-3 -2 -1 0 1 2 3 More
BIN
CONTINUED…

• Homoscedasticity
CONTINUED…

• Autocorrelation

You might also like