You are on page 1of 36

Regression Analysis

Ordinary Least Square Method

Learning Objectives
Understanding What is regression analysis Where it is used?

Prediction Estimation R-Square F test t test

Statistics used in regression analysis


The Basic Problem

Available data on two or more variables

Formulate a model to predict or estimate a value of interest

Use model to make a business decision

Regression

Provides a conceptually simple method for investigating functional relationships between one or more independent explanatory variables (factors) and a dependent variable (outcome of interest)

The relationship is expressed in the form of an equation


or a model connecting the response or dependent variable and one or more explanatory or predictor or

independent variables

Regression in Business

Risk analysis for investment (Optimal Portfolio choice)

Predict the future joint distribution of asset returns Construct a optimal portfolio (choose weights) Estimate the effect of price and advertisement on sales Decide what is optimal price and ad campaign Predict the future probability of default using known characteristics of borrower Decide whether or not to lend (and if so, how much)

Determining price and marketing strategy


Credit scoring models

Regression in Business

Sales or market forecast

Sales volume, market movement (icecream, houses) Customer complaints over time Key product specialization Predict the demographics and types of future workforce for large companies Estimate training impact

Total quality control


Linear Regression in Human Resources

Regression: Prediction & Explanation

Straight prediction question:

What price should I charge for my car? What will the interest rate be next month? Will this person like that movie?

Explanation and Understanding


Does your income increase if you complete this course? Will tax incentives change purchasing behaviour? Is my advertising campaign working?

Where to start?

Describe the data using descriptive statistics

mean, median, mode, standard deviation, percentiles

Plot and explore your data

Scatterplot exploring 2-3 dimensions.

Linear Prediction
Example : Predicting house price Problem: Predict market price based on observed characteristics

Solution: Look at property related data where we know the price and some observed characteristics Build a decision rule that predicts price as a function of the observed characteristics

Linear Prediction: Predicting house prices


What characteristics do we use?

Many factors or characteristics affect the price of house


Size No of rooms Attached baths Garage space, UPS facility, neighbourhood etc

Easy to quantify variables like price and size but what about other variables like aesthetics, workmanship etc.

Linear Prediction: Predicting house prices


To keep things simple lets focus only on size

The value that we seek to predict is called the dependent (or explained ) variable, and we denote this as

Y = price of house (e.g. lakhs of rupees)

The variable that we use to guide prediction is the independent (or explanatory) variable, and this is labelled

X = size of house (e.g. thousands of square feet)

Linear Prediction: Predicting house prices

What does this data look like

Plot the scatterplot

Linear Prediction: Predicting house prices

Appears to be a linear relationship between size and price:

As size goes up price goes up

Line here is the Trend line

Linear prediction

Recall that the equation of a line is: Y = b0 + b1X We add the random residual term Y = b0 + b1X + u

Where b0 is the intercept and b1 is the slope


The intercept value is in units of Y (Rs.1,00,000) The slope is in the units of Y per unit of X (Rs.1,00,000/1,000 Sq feet)

Linear regression: interpretation


Y = b0 + b1X + u

Intercept b0 : when X =0, Y = b0 Intercept is the best predictor of Y Slope b1 : when X increases by 1 unit (1000 sq ft), Y increases by b1 units (Rs.1,00,000)

Linear Prediction

For our example b0 = 37 and b1 = 38

Hat indicates an estimate

Linear Prediction

We can now predict the price of a house when we know only the size

just read the value of the line we have drawn

For eg., given a house with size X = 2.2 (1000 sq ft)

Predicted price Y = 37 + 38 x (2.2) = 37 + 83.6 = Rs120.6 lakhs

Regression Model: General


Y = dependent variable X1, X2, X3, Xp = independent variables Linear relationship is written as:

Y = b0 + b1X1 + b2X2 + bpXp + u

Estimating this model requires statistical tools better than simple graphical methods Least Square Method

Least Square Regression Model

A reasonable way to fit a line is to minimize the amount by which the fitted value differs from the actual value. This amount is called the residual or Error

Lets work with two variable model

Yi 0 1 X i u i

Estimated using Least Square Regression Model

X Y i 0 1 i

Fitted value
What is the fitted value?

The dots are the observed values and the line represents our fitted values given by

Residual or error term


What is the residual for the ith observation?

ui

u Yi Y i i u Y Y
i i i

Ordinary Least Squares Method (OLS)

Ideally we want to minimize the size of all residuals:

If they were all zero we would have a perfect line

Objective: Minimize the total of residuals to get best fit

Total may be small but the individual residual may be widely scattered Also positives may cancel out negative residuals resulting in a small total

Square the residuals and then sum it

OLS Criteria

Objective of OLS: Minimize

The coefficient of determination r2

How well does the sample regression line fit the data? We want to know what proportion of variations in Y does our model explain?

This is given by r-square statistic- coefficient of determination


r2: Measures the goodness of fit

The coefficient of determination r2

Ballentine view of r2

r2 = 0

r2 = 1

0 < r2 < 1 The closer the r2 is to 1, the better the fit

Example for multivariate regression model

Sam wants to predict the sale of compact cassette tape recorder across stores using advertisement and price data where Sales is measured in number of units sold Advertisement = number of times product is advertised within the store Price = in dollars Predict the sale of compact cassette tape recorder if advertisement = 7 and price = $132?

Least Square Regression Model

Least Square Regression Model

Yi 0 1 X 1i 2 X 2i u i

Least Square Regression


Sales 0 1 ( Advertisement ) 2 (Pr ice) Error
Coefficientsa Unstandardized Coefficients B (Constant) 1 Number of Advertisement Price in Dollars 219.231 6.381 -1.671 Std. Error 86.242 2.180 .684 .847 -.706 Standardize d Coefficients Beta 2.542 .085 2.927 .061 -2.441 .092

Model

Sig.

a. Dependent Variable: Sales (units sold)

Least Square Regression

Estimated Equation

Sales 219.231 6.381( Advertisement) 1.671(Pr ice)


Interpretation Constant: 0: When Advertisement and price are zero Average sales = 219.231 (constant) Slopes: 1 : If advertisement increases by 1 number, sales increases by 6.4 units 2 : If price increases by 1 $, sales decreases by 1.67 units

Prediction

Predict Sales when Advertisement = 7 and Price = $132 Sales = 219.231 + 6.381 x 7 -1.671 x 132 = 219.231 + 604.667 220.572 =603.326 units of sale

R-Square

SPSS output
Model Summary Model 1 R .884a R Square .782 Adjusted R Square .637 Std. Error of the Estimate 16.108

a. Predictors: (Constant), Price in Dollars, Number of Advertisement

R-Square = 0.782 indicates that the model explains 78.2 % variation in Y variable

Hypothesis testing

Testing individual slope coefficients using t test Sample = population H0 : 2 2


H1 : 2 2

For df = n-k and level of significance read the table value from t table Decision rule: if the calculated |t| > t, then reject the Ho.

Hypothesis testing

Test if each of the slope coefficients make any impact on the Y variable at significance level of 0.05.
0 H0 : 1 0 H1 : 1

6.381 0 1 1 t 2.927 ) 2.18 se( 1


Significance level (SPSS output):0.061 0.061 > 0.05 => Do not reject H0 Advertisement has no significant impact on Sales

Individual t test: significance of slope coefficient


Coefficientsa Unstandardized Coefficients B (Constant) 1 Number of Advertisement 219.231 6.381 Std. Error 86.242 2.180 .847 Standardized Coefficients Beta 2.542 .085 2.927 .061 Model t Sig.

Price in Dollars

-1.671

.684

-.706

-2.441 .092

a. Dependent Variable: Sales (units sold)

Significance level > 0.05 -> Do Not Reject H0

Hypothesis testing: The overall significance


Is the regression as a whole significant? Test if atleast one X variable has an impact on the Y

0 (Y doesnt depend on X) H0: 1 2


H1: atleast one i 0 ( Y depends on at least one X)

Statistics used : F Statistics Given as ANOVA table output in SPSS output At Significance level of 0.05 If Sig < 0.05, then Reject H0

Hypothesis testing : F test


ANOVAb Model Sum of Squares df 2 3 5 Mean Square F Sig.

Regression 2792.424 1 Residual Total 778.409 3570.833

1396.212 5.381 .102a 259.470

a. Predictors: (Constant), Price in Dollars, Number of advertisement b. Dependent Variable: Sales (units sold) Sig = 0.102 > 0.05. Hence do not reject H0. Y does not depend on any of the X variables