You are on page 1of 44

Regression Analysis

Learning Objectives

◉ How Business depends on relationship between two or more variables


◉ Using scatter diagram to understand relationships
◉ Types of relationships
◉ Quantifying the relationship
◉ Least squared method for regression analysis
◉ Correlation Analysis
◉ Limitations of regression & correlation analysis
1 Introduction
Getting Started
1.1 Forecasting in Business

◉ Every day, managers take decisions based on predictions of future events. To make
these decisions one has to rely on different attributes affecting it and their relationship
with each other.

◉ If decision makers can predict a future event based on today’s fact, it gives them an aid
to make effective strategies to overcome a future problem or to cash a business
opportunity or sometimes simply to boost sales.

◉ Example predicting power output of windmills depending on wind speed. It gives you
understanding what will be output for next one week. It helps in assessing power
consumption and supply issues well in advance.
1.2 Regression & Correlation
Analysis

◉ Regression and correlation analyses helps us in understanding the nature and strength of
relationship between two or more variables.

◉ In other words, Regression quantifies the relationship between two variables and correlation
measures strength of the relationship between two or more variables.

◉ In regression analysis we will develop an equation which relates the known variable to the
unknown variable.
1.3 Types of Relationships

Regression and correlation analysis is based on relationship between two or more variables. The
known variable is called Independent Variable and the variable which we want to predict is called
Dependent Variable.

For example , the monthly sales of packaged drinking water depends on average temperature for
the month. The monthly sales value in this example is our dependent variable and average
temperature for month is independent variable.

The relationship between monthly sales and average temperature is above example is a direct
one. It means as the temperature increases, monthly sales value will increase.

We can also have an inverse relationship between two variables. For example, your monthly
savings depends on petrol prices (for some of us it does depend). But in this case as petrol prices
go up, your savings will come down.
1.3 Types of Relationships

Dependent Dependent
variable variable
denoted by y denoted by y

Independent Variable denoted by x Independent Variable denoted by x

Direct Relationship Inverse Relationship

Example: Example:

Amount invested in social media Power of a vehicle and fuel


campaign and number of visitors efficiency of the vehicle
on your website.
1.4 Establishing the Relationships

First step before starting with regression analysis is to determine whether there is a relationship
between the variables or not. The easiest way to find that is to plot a scatter diagram.

Scatter Diagram:

A scatter diagram can give us two For example , the monthly sales of packaged drinking
types of information. water depends on average temperature for the month. This
monthly sales and temperature data has been captured and
1. We can look for patterns that shown below:
indicates that variables are
related. Average Monthly Temperature Monthly Revenue
2. If there is a relationship we 18 2250
can see what kind of equation 20 2200
22 2466
explains the relationship
24 3105
between variables. 35 3763
20 2200
1.4 Establishing the Relationships

Here is a scatter diagram on the example mentioned in last section:

This scatter diagram is giving us two types of information.


4000
1. The plot shows that there is pattern in the data i.e. as the
3500
temperature increases revenue also increases.
3000

Monthly Revenue
2. Also, it is showing that this relationship can be explained
by an equation of a line. 2500

In this case, the line drawn through our data points represents 2000
a direct relationship because y increases as x increases. Also
1500
the relationship can be described by a line so this kind of
relationship is termed as linear relationship.
1000
10 15 20 25 30 35 40
Similarly, based on variables a relationship can be termed as Average Monthly Temperature
curvilinear also.
1.5 Scatter Plots
Inverse Linear Direct Curvilinear
Direct Linear

Inverse linear No Relationship


Inverse Curvilinear
with more
scattering
Estimation using Regression
1.6
Line
What is a Regression Line?

A regression line is put in place by fitting the lines visually among the data points.

Our objective for the analysis would be to calculate the regression


line more precisely, using an equation that relates the two variables
mathematically.

The equation of a straight line where the dependent variable Y is


determined by the independent variable X can be given as

Y = mX + c

where m = slope of the line


c = Y- intercept
Regression Line
Estimation using Regression
1.6
Line
How many Regression Lines are possible?

Y = m1X + c1 Y = m2X + c2 Y = m3X + c3

Which line to choose??????


1.7 The Method of Least Square

How many Regression Lines are possible?

Error = Observed Value – Estimated Value


Error=-0.5
12 12

10 10
Error=0.5
8 8

6 6
Error=0.4
Error=-0.7
4 4 Error=5
Error=-0.4
2 2

0 0
1.7 The Method of Least Square
12
Graph A Applying least squared method on the estimated lines
10
8 Error=0.5
Graph A ^ Graph B ^
6 Error=0.4
4 (0.6-0.2)2 =(0.4)2 = 0.16 (0.7-0.2)2 =(0.5)2 = 0.25
Error=-0.4
2
(0.1-0.5)2 =(-0.4)2 = 0.16 (0.1-0.8)2 =(-0.7)2 = 0.49
0
(1.1-0.6)2 =(0.5)2 = 0.25 (1-1.5)2 =(0.5)2 = 0.25

12 Graph B Error=-0.5 ∑= 0.47 ∑ = 0.99


10
8 Error=0.5  
6 ^
4 Error=-0.7
It shows that estimated line in graph A has lesser sum of squared
2
error than that of graph B. So in this case the second regression
0
line is better estimate than the first one.
1.7 The Method of Least Square

 
Our objective of regression analysis is to find below mentioned equation for which sum of squared
error is minimized. Or in other words, we need to find such m & c value for the below equation
which minimizes ^

^
To minimize for a given m & c value we need to solve,

^
=0
Solving these two equations
will give us desired value of
𝑑  ∑ ( 𝒚 − ^𝒚)𝟐 m & c.
=0
𝑑𝑐
The Standard Error of
1.8
Estimate

The standard error of estimate measures variability of the observed value around the regression
line.

∑  ( 𝑌 −𝑌^ ) 2
𝑠  𝑒=¿
√ n-2 Less accurate estimator

Note: the sum squared error is divided by n-2 as we lost two


degrees of freedom in order to estimate regression line.

More accurate estimator


1.9 Interpreting Se

 
Just like standard deviation, the larger the standard error +2 se
of estimate, the greater the scattering of points around the
regression line. +1 se

If se = 0, the estimating equation will be perfect.


-1 se
Assuming that the observed points are normally
distributed around the regression line, we can expect 68% -2 se
of points within 1 se and so on.
The two assumptions that we are making here are:
1. The observed values are normally distributed around each value of estimated value.
2. The variance around each value of observed value is constant.

The above two assumptions are required to construct confidence intervals and to carry out hypotheses
testing on estimated values of m & c.
Correlation
2 Analysis
Strength of relationships
2.1 Correlation Analysis: Concept
Correlation analysis determines the strength of linear relationship between two variables. It helps in
finding out how well the regression line is explaining relationship between dependent and independent
variable(s).

To illustrate further, let us consider an example.

Manager at a bar understands that the bar earning depends on Bar Earning Temperature
temperature. To understand this pattern, the manager has taken last six
1000 20
days of data as sample. The data is shown in the table.
1500 25
The scatter diagram gave an idea that there is a direct relationship
between temperature and the bar earning. Based on this relationship, 1500 30
they have developed an equation to predict bar earning depending on
temperature. 2500 35

4800 40
Earning = 205 * temperature - 3784
6000 45
2.1 Correlation Analysis: Concept
The table on left shows estimated earnings as per equation:
Bar Estimated
Temp.
Earning Earning
Earning = 205 * temperature – 3784
1000 20 316
Generalized form would be, y = 205 * x – 3784 where y is dependent
1500 25 1341 variable (Bar Earning) and x is independent variable (temperature).
1500 30 2366
Average earning can be given by y = E = 2883
2500 35 3391
It means on an average bar should earn about 2883, but due to
4800 40 4416
temperature sometimes it goes up and sometimes it goes down.
6000 45 5441
This deviance from the average earning is termed as overall variance in
earning. We will denote it by SST (sum squared total).

SST = ∑ (y - y)2
2.1 Correlation Analysis: Concept
From the figure on right, it can be seen that overall
variance, SST has two components :
SSE
1. Proportion of overall variance which can be explained
by regression line. Denoted by SSR (sum squared SST
regression).
^ SSR
SSR = ∑ (y - y)2
Y = 2883
2. Proportion of variance which is not explained and can
be termed as error. Denoted by SSE (sum squared error).

^
SSE = ∑ (y - y)2 temp. = 45
Observed Value = 6000
84
And, SST = SSR + SSE -37 Estimated Value = 5441
x
5*
20
= Average Value = 2883
Y
2.1 Correlation Analysis: Concept
 Coefficient of Determination:

This is the first measure of strength of linear relationship SSE


between two variables and can be given as
SST
r =2
SSR

Y = 2883

 Coefficient of Correlation:

This is the secondary measure of strength of linear temp. = 45


relationship between two variables and can be given as
Observed Value = 6000
84
r= -37 Estimated Value = 5441
x
5*
20
= Average Value = 2883
Y
2.2 Range of Correlation Values

 Coefficient of Determination:  Coefficient of Correlation:

This is the first measure of strength of linear This is the secondary measure of strength of linear
relationship between two variables and can be relationship between two variables and can be given
given as as

r2 = r=

It ranges from 0 to 1. A value of 0 means no The coefficient of correlation values range from -1 to
relationship and 1 means perfect relationship +1, a value of -1 means perfect correlation but
where all the estimated points are equal to relationship is inverse.
observed point.
A value of +1 means a perfect direct relationship.

A value of 0 means no relationship.


Multiple
3 Regression
Building regression models
3.1 Multiple Regression:
Introduction

 
Most of the business cases involves multiple factors impacting a decision or an event. And hence most
of the predictions involved in business will have dependency on multiple factors.

So our regression equation, which was y = m*x + c (where y was dependent variable and x was
independent variable with m as slope and c as intercept), will change to:

………….. (this represents a plane rather than a line)

Where
y is dependent variable
x1, x2, ………xn are independent variables.
m1, m2 ………mn are corresponding slope for x1, x2, ………xn .
c is y-intercept.
The Standard Error of
3.2
Estimate

The standard error of estimate measures variability of the observed value around the regression
plane.

∑  ( 𝑌 −𝑌^ ) 2
𝑠  𝑒=¿
√ n-k-1

where k is number of independent variables


Making Inference about
3.3
Parameters
 Our estimated equation of the regression plane is:

…………..

The above equation is an estimate of an unknown population regression plane of the form
^
…………..

Inference about individual slope mn (nth parameter for nth independent variable x):
Null Hypotheses H0: Mn = 0

Alternative Hypotheses Ha : Mn = mn
 Calculate t =

Reject H0 if calculated t is significantly large.


3.4 Multicollinearity

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant
information about the response.

Example of multicollinear predictors are height and weight of a person, years of education and income,
and assessed value and square footage of a home.

Consequences of high multicollinearity:

1. Increased standard error of estimates of the m’s (decreased reliability).

2. Often confusing and misleading results.


3.5 Detecting multicollinearity

 
Methods for detecting multicollinearity:

1. compute correlations between all pairs of predictors. If some r are close to 1 or -1, remove one of
the two correlated predictors from the model.

2. calculate the variance inflation factors for each predictor x j:

VIFj =

where R2 j is the coefficient of determination of the model that includes all predictors except the jth
predictor.

If VIFj ≥ 10 then there is a problem with multicollinearity.


3.6 Creating Dummy Variables

A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with
two or more distinct categories/levels.

Regression analysis treats all independent (X) variables in the analysis as numerical. Numerical
variables are interval or ratio scale variables whose values are directly comparable.

But sometimes a qualitative variable (like Gender) may impact the dependent variable. So for that
matter we create dummy variables for such qualitative variables. We code the levels in the variables 1
or 0 where 1 means if the event is true and 0 means event is false.

Income Gender Gender_dummy Location East_dummy West_dummy North_dummy

15784 Male 1 East 1 0 0

25789 Female 0 West 0 1 0

35789 Female 0 North 0 0 1


Multiple Regression: R
Code
#Data Set: house_price

hp=read.csv("house.csv")
#checking data
View(hp)

#checking relationship with categorical data


aov_fit = aov(hp$Price~hp$Neighborhood,data = hp)
summary(aov_fit)# Neighborhood is impacting

aov_fit1=aov(hp$Price~hp$Brick,data = hp)
summary(aov_fit1) #brick is also impacting

#We should create dummy variables for Neighborhood & Brick but R-software does automatically
Multiple Regression: R
Code
#running correlation on numerical variables with using HMISC package
library(Hmisc)
rcorr(as.matrix(hp[c(2:6)])) # taking 2nd to 6th columns
#splitting data in train and test
set.seed(123)
train_ind <- sample(seq_len(nrow(hp)), size = 80)
train <- hp[train_ind, ]
test <- hp[-train_ind, ]
#building regression model
fit=lm(train$Price~.,data=train)
summary(fit)
#checking vif with car package
library(car)
vif(fit) #value more than 5 needs to be dropped
#finding estimated values
train=data.frame(train,fit$fitted.values)
3.7 Regression Diagnostics

The regression diagnostic will let us know if our model is working well or not. In this phase, we
primarily check all the assumptions in regression which are:

◉ Linearity - the relationships between the predictors and the outcome variable should be linear

◉ Normality - the errors should be normally distributed - technically normality is necessary only for
the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and
independently distributed

◉ Homogeneity of variance (homoscedasticity) - the error variance should be constant

◉ Independence - the errors associated with one observation are not correlated with the errors of
any other observation
Outlier, Leverage &
3.8
Influence
A single observation that is significantly different from all other observations can make a large
difference in the results of your regression analysis.  If a single observation (or small group of
observations) substantially changes your results, you would want to know about this and investigate
further.  There are three ways that an observation can be unusual.

Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an
observation whose dependent-variable value is unusual given its values on the predictor variables. An
outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is called a point with high
leverage. Leverage is a measure of how far an observation deviates from the mean of that variable.
These leverage points can have an effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes


the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.
Influencial Observations

#checking influencial observations using cooks distance


d1=cooks.distance(fit)
train=data.frame(train,d1)

# any point whose calculated cooks distance is more than 4/(n-k-1) is an influential point where n is
number of observations and k is number of predictors

train$ol=ifelse(train$d1>4/69,TRUE,FALSE)

#deleting obs which are outliers


train=train[!train$ol==TRUE,]

Note: An influential point may impact the model in positive manner also so we should always cross
check R-square values after removing the influential point.
Outliers & Leverage

# Assessing Outliers

outlierTest(fit) # Bonferonni p-value for most extreme obs


qqPlot(fit, main="QQ Plot") #qq plot for studentized resid
leveragePlots(fit) # leverage plots

#leverage can be checked by simply using plot command

plot(fit)

#any point which is very far from rest of the observations requires our attention
Assumptions in
Regression
#checking assumptions
par(mfrow=c(2,2))
plot(fit)

#any failure of
assumptions will be
visible in these plots
Homoscedasticity Failed
3.9
Examples

1
1

Some examples where test of


homoscedasticity failed or
heteroscedasticity was found.

1
1
Scoring on Test Dataset

# scoring on test data-set


pred=predict(fit,test) #creates a vector for predicted values
test=data.frame(test,pred) # combining predicted values and test data-set

The last step would be to find out residuals in test dataset and create residual plots to see if
assumptions are validated on test data set as well.
Transformation in
4 Regression
When the assumptions fail.
4.1 Windmill Data
Sno Wind Velocity DC Output
1 5 1.582
The output of a windmill depends on the wind speed. 2 6 1.822
The table on right has documented this data for a 3 3.4 1.057
windmill for different days. 4 2.7 0.5
5 10 2.236
A model was built to predict Output depending on . . .
Wind velocity. . . .

As shown in graph the model fails to meet


homoscedasticity assumptions.

In such scenarios we transform the variables to make


sure that assumptions are not violated.
4.1 Windmill Data

For example in this scenario instead of taking wind Sno Wind.Velocity DC.Output vel
Velocity as independent variable, we will transform it 1 5 1.582 0.2
in following way 2 6 1.822 0.16
3 3.4 1.057 0.29
vel = 1/ wind Velocity 4 2.7 0.5 0.37
5 10 2.236 0.1
where vel is a new variable created by transforming
the original independent variable Wind Velocity.

Now one can see that the chart meets the


homoscedasticity assumptions.
4.2 Transformation Examples

There is no set rule what to transform and how to transform. But generally we try to transform
the dependent variable.

Some possible transformations are:


Mathematical Transformations:
sin(variable)
cos(variable)
sin-1(variable)
exp(variable)
1/variable
1/variable2
variable2
.

.
Thanks!
Any questions ?
You can contact me at
◉ 8106105105
◉ amitendra.kumar@theanalytica.com

You might also like