Professional Documents
Culture Documents
Learning Objectives
◉ Every day, managers take decisions based on predictions of future events. To make
these decisions one has to rely on different attributes affecting it and their relationship
with each other.
◉ If decision makers can predict a future event based on today’s fact, it gives them an aid
to make effective strategies to overcome a future problem or to cash a business
opportunity or sometimes simply to boost sales.
◉ Example predicting power output of windmills depending on wind speed. It gives you
understanding what will be output for next one week. It helps in assessing power
consumption and supply issues well in advance.
1.2 Regression & Correlation
Analysis
◉ Regression and correlation analyses helps us in understanding the nature and strength of
relationship between two or more variables.
◉ In other words, Regression quantifies the relationship between two variables and correlation
measures strength of the relationship between two or more variables.
◉ In regression analysis we will develop an equation which relates the known variable to the
unknown variable.
1.3 Types of Relationships
Regression and correlation analysis is based on relationship between two or more variables. The
known variable is called Independent Variable and the variable which we want to predict is called
Dependent Variable.
For example , the monthly sales of packaged drinking water depends on average temperature for
the month. The monthly sales value in this example is our dependent variable and average
temperature for month is independent variable.
The relationship between monthly sales and average temperature is above example is a direct
one. It means as the temperature increases, monthly sales value will increase.
We can also have an inverse relationship between two variables. For example, your monthly
savings depends on petrol prices (for some of us it does depend). But in this case as petrol prices
go up, your savings will come down.
1.3 Types of Relationships
Dependent Dependent
variable variable
denoted by y denoted by y
Example: Example:
First step before starting with regression analysis is to determine whether there is a relationship
between the variables or not. The easiest way to find that is to plot a scatter diagram.
Scatter Diagram:
A scatter diagram can give us two For example , the monthly sales of packaged drinking
types of information. water depends on average temperature for the month. This
monthly sales and temperature data has been captured and
1. We can look for patterns that shown below:
indicates that variables are
related. Average Monthly Temperature Monthly Revenue
2. If there is a relationship we 18 2250
can see what kind of equation 20 2200
22 2466
explains the relationship
24 3105
between variables. 35 3763
20 2200
1.4 Establishing the Relationships
Monthly Revenue
2. Also, it is showing that this relationship can be explained
by an equation of a line. 2500
In this case, the line drawn through our data points represents 2000
a direct relationship because y increases as x increases. Also
1500
the relationship can be described by a line so this kind of
relationship is termed as linear relationship.
1000
10 15 20 25 30 35 40
Similarly, based on variables a relationship can be termed as Average Monthly Temperature
curvilinear also.
1.5 Scatter Plots
Inverse Linear Direct Curvilinear
Direct Linear
A regression line is put in place by fitting the lines visually among the data points.
Y = mX + c
10 10
Error=0.5
8 8
6 6
Error=0.4
Error=-0.7
4 4 Error=5
Error=-0.4
2 2
0 0
1.7 The Method of Least Square
12
Graph A Applying least squared method on the estimated lines
10
8 Error=0.5
Graph A ^ Graph B ^
6 Error=0.4
4 (0.6-0.2)2 =(0.4)2 = 0.16 (0.7-0.2)2 =(0.5)2 = 0.25
Error=-0.4
2
(0.1-0.5)2 =(-0.4)2 = 0.16 (0.1-0.8)2 =(-0.7)2 = 0.49
0
(1.1-0.6)2 =(0.5)2 = 0.25 (1-1.5)2 =(0.5)2 = 0.25
Our objective of regression analysis is to find below mentioned equation for which sum of squared
error is minimized. Or in other words, we need to find such m & c value for the below equation
which minimizes ^
^
To minimize for a given m & c value we need to solve,
^
=0
Solving these two equations
will give us desired value of
𝑑 ∑ ( 𝒚 − ^𝒚)𝟐 m & c.
=0
𝑑𝑐
The Standard Error of
1.8
Estimate
The standard error of estimate measures variability of the observed value around the regression
line.
∑ ( 𝑌 −𝑌^ ) 2
𝑠 𝑒=¿
√ n-2 Less accurate estimator
Just like standard deviation, the larger the standard error +2 se
of estimate, the greater the scattering of points around the
regression line. +1 se
The above two assumptions are required to construct confidence intervals and to carry out hypotheses
testing on estimated values of m & c.
Correlation
2 Analysis
Strength of relationships
2.1 Correlation Analysis: Concept
Correlation analysis determines the strength of linear relationship between two variables. It helps in
finding out how well the regression line is explaining relationship between dependent and independent
variable(s).
Manager at a bar understands that the bar earning depends on Bar Earning Temperature
temperature. To understand this pattern, the manager has taken last six
1000 20
days of data as sample. The data is shown in the table.
1500 25
The scatter diagram gave an idea that there is a direct relationship
between temperature and the bar earning. Based on this relationship, 1500 30
they have developed an equation to predict bar earning depending on
temperature. 2500 35
4800 40
Earning = 205 * temperature - 3784
6000 45
2.1 Correlation Analysis: Concept
The table on left shows estimated earnings as per equation:
Bar Estimated
Temp.
Earning Earning
Earning = 205 * temperature – 3784
1000 20 316
Generalized form would be, y = 205 * x – 3784 where y is dependent
1500 25 1341 variable (Bar Earning) and x is independent variable (temperature).
1500 30 2366
Average earning can be given by y = E = 2883
2500 35 3391
It means on an average bar should earn about 2883, but due to
4800 40 4416
temperature sometimes it goes up and sometimes it goes down.
6000 45 5441
This deviance from the average earning is termed as overall variance in
earning. We will denote it by SST (sum squared total).
SST = ∑ (y - y)2
2.1 Correlation Analysis: Concept
From the figure on right, it can be seen that overall
variance, SST has two components :
SSE
1. Proportion of overall variance which can be explained
by regression line. Denoted by SSR (sum squared SST
regression).
^ SSR
SSR = ∑ (y - y)2
Y = 2883
2. Proportion of variance which is not explained and can
be termed as error. Denoted by SSE (sum squared error).
^
SSE = ∑ (y - y)2 temp. = 45
Observed Value = 6000
84
And, SST = SSR + SSE -37 Estimated Value = 5441
x
5*
20
= Average Value = 2883
Y
2.1 Correlation Analysis: Concept
Coefficient of Determination:
Y = 2883
Coefficient of Correlation:
This is the first measure of strength of linear This is the secondary measure of strength of linear
relationship between two variables and can be relationship between two variables and can be given
given as as
r2 = r=
It ranges from 0 to 1. A value of 0 means no The coefficient of correlation values range from -1 to
relationship and 1 means perfect relationship +1, a value of -1 means perfect correlation but
where all the estimated points are equal to relationship is inverse.
observed point.
A value of +1 means a perfect direct relationship.
Most of the business cases involves multiple factors impacting a decision or an event. And hence most
of the predictions involved in business will have dependency on multiple factors.
So our regression equation, which was y = m*x + c (where y was dependent variable and x was
independent variable with m as slope and c as intercept), will change to:
Where
y is dependent variable
x1, x2, ………xn are independent variables.
m1, m2 ………mn are corresponding slope for x1, x2, ………xn .
c is y-intercept.
The Standard Error of
3.2
Estimate
The standard error of estimate measures variability of the observed value around the regression
plane.
∑ ( 𝑌 −𝑌^ ) 2
𝑠 𝑒=¿
√ n-k-1
…………..
The above equation is an estimate of an unknown population regression plane of the form
^
…………..
Inference about individual slope mn (nth parameter for nth independent variable x):
Null Hypotheses H0: Mn = 0
Alternative Hypotheses Ha : Mn = mn
Calculate t =
Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant
information about the response.
Example of multicollinear predictors are height and weight of a person, years of education and income,
and assessed value and square footage of a home.
Methods for detecting multicollinearity:
1. compute correlations between all pairs of predictors. If some r are close to 1 or -1, remove one of
the two correlated predictors from the model.
VIFj =
where R2 j is the coefficient of determination of the model that includes all predictors except the jth
predictor.
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with
two or more distinct categories/levels.
Regression analysis treats all independent (X) variables in the analysis as numerical. Numerical
variables are interval or ratio scale variables whose values are directly comparable.
But sometimes a qualitative variable (like Gender) may impact the dependent variable. So for that
matter we create dummy variables for such qualitative variables. We code the levels in the variables 1
or 0 where 1 means if the event is true and 0 means event is false.
hp=read.csv("house.csv")
#checking data
View(hp)
aov_fit1=aov(hp$Price~hp$Brick,data = hp)
summary(aov_fit1) #brick is also impacting
#We should create dummy variables for Neighborhood & Brick but R-software does automatically
Multiple Regression: R
Code
#running correlation on numerical variables with using HMISC package
library(Hmisc)
rcorr(as.matrix(hp[c(2:6)])) # taking 2nd to 6th columns
#splitting data in train and test
set.seed(123)
train_ind <- sample(seq_len(nrow(hp)), size = 80)
train <- hp[train_ind, ]
test <- hp[-train_ind, ]
#building regression model
fit=lm(train$Price~.,data=train)
summary(fit)
#checking vif with car package
library(car)
vif(fit) #value more than 5 needs to be dropped
#finding estimated values
train=data.frame(train,fit$fitted.values)
3.7 Regression Diagnostics
The regression diagnostic will let us know if our model is working well or not. In this phase, we
primarily check all the assumptions in regression which are:
◉ Linearity - the relationships between the predictors and the outcome variable should be linear
◉ Normality - the errors should be normally distributed - technically normality is necessary only for
the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and
independently distributed
◉ Independence - the errors associated with one observation are not correlated with the errors of
any other observation
Outlier, Leverage &
3.8
Influence
A single observation that is significantly different from all other observations can make a large
difference in the results of your regression analysis. If a single observation (or small group of
observations) substantially changes your results, you would want to know about this and investigate
further. There are three ways that an observation can be unusual.
Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an
observation whose dependent-variable value is unusual given its values on the predictor variables. An
outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
Leverage: An observation with an extreme value on a predictor variable is called a point with high
leverage. Leverage is a measure of how far an observation deviates from the mean of that variable.
These leverage points can have an effect on the estimate of regression coefficients.
# any point whose calculated cooks distance is more than 4/(n-k-1) is an influential point where n is
number of observations and k is number of predictors
train$ol=ifelse(train$d1>4/69,TRUE,FALSE)
Note: An influential point may impact the model in positive manner also so we should always cross
check R-square values after removing the influential point.
Outliers & Leverage
# Assessing Outliers
plot(fit)
#any point which is very far from rest of the observations requires our attention
Assumptions in
Regression
#checking assumptions
par(mfrow=c(2,2))
plot(fit)
#any failure of
assumptions will be
visible in these plots
Homoscedasticity Failed
3.9
Examples
1
1
1
1
Scoring on Test Dataset
The last step would be to find out residuals in test dataset and create residual plots to see if
assumptions are validated on test data set as well.
Transformation in
4 Regression
When the assumptions fail.
4.1 Windmill Data
Sno Wind Velocity DC Output
1 5 1.582
The output of a windmill depends on the wind speed. 2 6 1.822
The table on right has documented this data for a 3 3.4 1.057
windmill for different days. 4 2.7 0.5
5 10 2.236
A model was built to predict Output depending on . . .
Wind velocity. . . .
For example in this scenario instead of taking wind Sno Wind.Velocity DC.Output vel
Velocity as independent variable, we will transform it 1 5 1.582 0.2
in following way 2 6 1.822 0.16
3 3.4 1.057 0.29
vel = 1/ wind Velocity 4 2.7 0.5 0.37
5 10 2.236 0.1
where vel is a new variable created by transforming
the original independent variable Wind Velocity.
There is no set rule what to transform and how to transform. But generally we try to transform
the dependent variable.
.
Thanks!
Any questions ?
You can contact me at
◉ 8106105105
◉ amitendra.kumar@theanalytica.com