You are on page 1of 137

UNIT-IV

BADM

12/9/20 1
12/9/20 2
• Retail Store / Grocery store:
The number of items sold in a day
is positively correlated with the
number of customers visited the
store.

• Finance / Banking
The profit of financial
institution is negatively
correlated with the number of
bad loans.

12/9/20 3
• Education
The number of students enrolled
in a University to the number of
professors and resources

• Defense
The number of troops/cops
assigned to a particular area to a
number of criminal activities in
the area

12/9/20 4
• Healthcare
The more patients with a
certain disease or symptom
more the newer clinical trials
related to that disease.

• The fuel consumed


by a car is correlated
to the number of
miles travelled.

12/9/20 5
Height & weight

• Income & expenditure

12/9/20 6
Price and demand • Volume and pressure
of a perfect gas

12/9/20 7
Correlation

• Correlation is a statistical measure (expressed as a number) that


describes the size and direction of a relationship between two or
more variables.

• Two variables are said to be correlated if change in one variable


affects the change in other variable, and the relation between them
is known as correlation.

• If two variables vary together they are said to be correlated.

12/9/20 8
• Positive Correlation:
• Two variables are said
to be positively
correlated if they
deviates in the same
direction.
• e.g. height & weight,
income & expenditure
•  

12/9/20 9
• Negative Correlation
• Two variables are said
to be negatively
correlated if they
deviates in the opposite
directions.
• e.g. volume and
pressure of a perfect
gas, price and demand

12/9/20 10
• Un-correlation
• Two variables are said to

Intelligence level
be uncorrelated or
statistically independent
if there is no relation
between them.

Amount of tea drunk

12/9/20 11
Karl Pearson’s coefficient of correlation
• Correlation coefficient between two random variables X and Y, usually denoted
by r(X,Y) and defined as

n xy   x  y
r( X ,Y ) 
n x    x  ).(n y    y  )
2 2 2 2

12/9/20 12
Note:
• If r=+1 then correlation is perfectly positive,

• If r=-1 then correlation is perfectly negative,

• If r=0 then variables are uncorrelated.

12/9/20 13
12/9/20 14
12/9/20 15
12/9/20 16
12/9/20 17
Cov( X , Y )
r
 XY

n    
   
 X i  X  Yi  Y 
i 1   
r 
n   2 n   2
   X i  X 
     Yi  Y 

i 1   i 1  

12/9/20 18
REGRESSION ANALYSIS

12/9/20 19
Example of MLR
Like, I have data of my monthly spending, monthly income and the
number of trips per month for the last three years. Now I need to answer
the following questions:
 What will be my monthly spending for next year?
 Which factor(monthly income or number of trips per month) is
more important in deciding my monthly spending?
 How monthly income and trips per month are correlated with
monthly spending?

12/9/20 20
Example of MLR
• In the credit card industry, a
financial company maybe
interested in minimizing the risk
portfolio and wants to
understand the top five factors
that cause a customer to default.

• Based on the results the


company could implement
specific EMI options so as to
minimize default among risky
customers.

12/9/20 21
Example of MLR
A company wanted to be able to
estimate or predict how much fuel
they needed to transport building
materials to their oil wells so that
they could line them with concrete.
The data provided was:
• Number of wells
• Depth of wells
• Distance to wells
• Weight of materials
• Tonne kilometres
• Fuel costs

12/9/20 22
Example of MLR

The selling price of a house can


depend on
• the desirability of the location,
• the number of bedrooms,
• the number of bathrooms,
• the year the house was built,
• the square footage of the lot and
• a number of other factors.

12/9/20 23
Example of MLR

• Predicting Gross Movie


Revenue

• Success or failure of a movie


can depend on many factors:
star-power, release date, Critics
review, budget, rating, plot
and the highly unpredictable
human reactions.

• Predicted revenues can be used


for planning both the
production and distribution
stages.
12/9/20 24
Regression

• Regression can be defined as a method to estimate the value


of one variable when that of other is known, when the
variables are correlated.

12/9/20 25
12/9/20 26
12/9/20 27
Example
• The table below shows some data from the early days
of the Italian clothing company Benetton.

• Each row in the table shows Benetton’s sales for a


year and the amount spent on advertising that year.

• In this case, our outcome of interest is sales—it is


what we want to predict.

12/9/20 28
Simple Regression Analysis
The General Idea
Simple regression considers the relation between a single explanatory
variable and response variable

12/9/20 29
In math courses, the slope-intercept form of the
equation of a line often takes the form

y=mx+b
where
m = slope of the line
b = y intercept of the line

12/9/20 30
12/9/20 31
ŷ  b0  b1 x  

 x y
b1 
 xy  n b0 
 y
b
 x
  x
2 1
n n
x 2

n

12/9/20 32
Regression Nomenclature
Dependent Variable Independent Variable
Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable Explanatory variable

12/9/20 33
The sign of each coefficient indicates the direction of the relationship
between a predictor variable and the response variable.

• A positive sign indicates that as the predictor variable increases, the


response variable also increases.

• A negative sign indicates that as the predictor variable increases, the


response variable decreases.

12/9/20 34
Classwork Problems

The relationship between money spent on research and development and the chemical
firm annual profit for the preceding 6 years is given below:

12/9/20 35
1. Use these data to develop a regression model to predict annual
profit.
2. Plot the scatter plot.
3. Which type of relationship is it?
4. Predict annual profit when money spent on research and
development is 8 $Millions.

12/9/20 36
 xy   x y
b1  n
  x
2

x 2

n

12/9/20 37
b0 
 y
b
 x
1
n n

12/9/20 38
12/9/20 39
12/9/20 40
12/9/20 41
12/9/20 42
Classwork Problems

12/9/20 43
12/9/20 44
12/9/20 45
12/9/20 46
Classwork Problems

12/9/20 47
Classwork Problems

Q.01 Suppose a study is conducted using only Boeing 737s


traveling 500 miles on comparable routes during the same
season of the year. Can the number of passengers predict
the cost of flying such routes? Suppose the data displayed
in Table are the costs and associated number of
passengers for twelve 500-mile commercial airline flights
using Boeing 737s during the same season of the year.
1. Use these data to develop a regression model to
predict cost by number of passengers.
2. Plot the scatter plot.
3. Which type of relationship is it?
4. Predict cost of flying when no of passengers are 110

12/9/20 48
12/9/20 49
Residuals
o The residual, or error, of the regression model is the difference
between the y value and the predicted value. Each data point has
one residual.
Residual = Observed value - Predicted value

Residual=y  yˆ
o Because a linear regression model is not always appropriate for the
data, you should assess the appropriateness of the model by
defining residuals and examining residual plots.

o Both the sum and the mean of the residuals are equal to zero.
12/9/20 50
12/9/20 51
RESIDUAL SUM OF SQUARES RSS

• The residual sum of squares (RSS), also known as the sum of squared
residuals (SSR) or the sum of squared errors of prediction (SSE), is the
sum of the squares of residuals.

RSS  SSR  SSE      ( y  yˆ )


2 2

• Ideally, the sum of squared residuals should be a smaller or lower value


(Close to 0) in any regression model.

12/9/20 52
The Standard Error of Estimate

The standard error of estimate, on the other hand, measures the variability, or
scatter, of the observed values around the regression line.

se 
 ( y  yˆ ) 2


RSS
n2 n2

12/9/20 53
Interpreting the Standard Error of Estimate
• We shall use the standard error of estimate as a tool in the same way that we
can use the standard deviation.

• The larger the standard error of estimate, the greater the scattering (or
dispersion) of points around the regression line.

• The smaller the standard error of estimate, the smaller the scattering (or
dispersion) of points around the regression line.

• If se = 0, we expect the estimating equation to be a “perfect” estimator of the


dependent variable.
12/9/20 54
The coefficient of determination: R Square Value

• The coefficient of determination measures the strength of the


association that exists between two variables, X and Y.

• The coefficient of multiple determination can be calculated by the following


formula..

R  1
2 SSE
 1
 ( y  ˆ
y ) 2

SST  ( y  y) 2

12/9/20 55
The coefficient of determination: R Square Value

 The coefficient of determination ranges from 0 to 1.

 R2 close to 1 indicates a strong correlation between X and Y,

 Whereas an R2 near 0 means that there is little correlation between


these two variables.

12/9/20 56
 In general, the higher the R-squared (Close to 1 or
close to 100 %), the better the model fits your data.

12/9/20 57
Which Model is better???

R2=15% R2=85%

12/9/20 58
Which Model is better???

R2=38% R2=87.4%
12/9/20 59
Classwork Problems

For research and development problem,


1. Calculate RSS and standard error of estimate.
2. Analyze the residuals.
3. Calculate Coefficient of determination or R square
4. Comment on value of R square.

12/9/20 60
Classwork Problems

12/9/20 61
Classwork Problems

12/9/20 62
12/9/20 63
 x y
b1 
 xy  n b0 
 y
b
 x
1
  x n n
2

x 2

n

12/9/20 64
12/9/20 65
12/9/20 66
12/9/20 67
12/9/20 68
Classwork Problems

12/9/20 69
1. Calculate RSS and standard error of estimate.
2. Analyze the residuals.
3. Calculate Coefficient of determination or R
square
4. Comment on value of R square.

12/9/20 70
Measures of Variation
Total variation is made up of two parts:

SST  SSR  SSE


Total Sum of Regression Sum of Error Sum of
Squares Squares Squares

SST   ( Yi  Y )2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2


where:
Y value of the dependent variable
= Mean
Yi = Observed value of the dependent variable
ˆ
Yi
= Predicted value of Y for the given Xi value
13-71
Measures of Variation
(continued)

Y
Yi  
SSE = (Yi - Yi )2 Y

_
SST = (Yi - Y)2

Y  _
SSR = (Yi - Y)2
_ _
Y Y

Xi X
13-72
Assumptions of Linear Regression

12/9/20 73
o Assumption 1: Linearity
o Assumption 2: Variables are measured without error (reliably)
o Assumption 3: Outliers/influential cases
o Assumption 4: Auto Correlation
o Assumption 5: Multicollinearity
o Assumption 6: Homoscedasticity
o Assumption 7: Residuals should be normally distributed

12/9/20 74
Assumption 1: Linearity

12/9/20 75
o First, linear regression requires the relationship
between the independent and dependent
variables to be linear.

o The linearity assumption can best be tested with


scatterplots.

12/9/20 76
o If data is given in pairs then the
scatter diagram of the data is just
the points plotted on the xy-
plane.

o The scatter plot is used to


visually identify relationships
between the first and the second
entries of paired data.

12/9/20 77
12/9/20 78
o The scatter plot above represents the age vs. size of a plant.
o It is clear from the scatter plot that as the plant ages, its size tends to
increase.
o If it seems to be the case that the points follow a linear pattern well,
then we say that there is a high linear correlation, while if it seems
that the data do not follow a linear pattern, we say that there is no
linear correlation.
o If the data somewhat follow a linear path, then we say that there is a
moderate linear correlation.
o Given a scatter plot, we can draw the line that best fits the data

12/9/20 79
o Types of Regression Models:

12/9/20 80
Assumption 2: Variables are measured without error
(reliably):

12/9/20 81
Residuals
o The residual, or error, of the regression model is the difference
between the y value and the predicted value. Each data point has
one residual.
Residual = Observed value - Predicted value

Residual=y  yˆ
o Because a linear regression model is not always appropriate for the
data, you should assess the appropriateness of the model by
defining residuals and examining residual plots.

o Both the sum and the mean of the residuals are equal to zero.
12/9/20 82
12/9/20 83
Using Residuals to Test the Assumptions of the Regression Model

o One of the major uses of residual analysis is to test some of the


assumptions underlying regression. The following are the
assumptions of simple regression analysis.
1. The model is linear.(Assumption 01)

2. The error terms have constant variances.(Assumption 06)

3. The error terms are independent.

4. The error terms are normally distributed.(Assumption 07)


12/9/20 84
Residual Plot

• A particular method for studying the behavior of


residuals is the residual plot.

• The residual plot is a type of graph in which the


residuals for a particular regression model are
plotted along with their associated value of x as
an ordered pair(x, Residual).

12/9/20 85
• A residual plot is a graph in which residuals are on
the vertical axis and the independent variable is on
the horizontal axis.
• If the dots are randomly dispersed around the
horizontal axis then a linear regression model is
appropriate for the data; otherwise, choose a non-
linear model.

12/9/20 86
• Following example shows few patterns in residual plots.

• In first case, dots are randomly dispersed. So linear regression


model is preferred.

• In Second and third case, dots are non-randomly dispersed and


suggests that a non-linear regression method is preferred
12/9/20 87
12/9/20 88
Clearly dots are randomly dispersed. So

Linear regression model is preferred.

12/9/20 89
Independence of Errors
Error values are statistically independent
If your points are following a
clear pattern, it might
indicate that the errors are
influencing each other. The
following image shows two
linear regression lines; on
the left, the points are
scattered randomly. On the
right, the points are clearly
influencing each other.

12/9/20 90
Assumption 3: Outliers/influential cases

12/9/20 91
Example:
Let us consider a dataset where
• y = foot length (cm) and
• x = height (in)
• for n = 33 male students in a statistics class.
A scatterplot of the male foot length and height data
shows one point labeled as an outlier.

12/9/20 92
12/9/20 93
• An outlier is a data point which is very far, somehow, from
the rest of the data.
• An outlier is an observation of data that does not fit the rest
of the data.
• It is sometimes called an extreme value.
• When you graph an outlier, it will appear not to fit the
pattern of the graph.
• Some outliers are due to mistakes (for example, writing
down 50 instead of 500) while others may indicate that
something unusual is happening.
12/9/20 94
Most common causes of outliers on a data set:
• Data entry errors (human errors)
• Measurement errors (instrument errors)
• Experimental errors (data extraction or experiment
planning/executing errors)
• Intentional (dummy outliers made to test detection methods)
• Data processing errors (data manipulation or data set unintended
mutations)
• Sampling errors (extracting or mixing data from wrong or various
sources)

12/9/20 95
Residual Plots to deduct Multivariate Outliers

• Another example is the


regression model for Yield as a
function of Concentration is
significant, but note that the line
of fit appears to be tilted towards
the outlier.

• We can see the effect of this


outlier in the residual by
predicted plot.

12/9/20 96
Box Plot Diagram to identify Outliers

• Box plot diagram also termed as Whisker’s plot is a graphical method typically
depicted by quartiles and inter quartiles that helps in defining the upper limit
and lower limit beyond which any data lying will be considered as outliers.

• The very purpose of this diagram is to identify outliers and discard it from the
data series before making any further observation so that the conclusion made
from the study gives more accurate results not influenced by any extremes or
abnormal values.

• Box plots can be used as an initial screening tool for outliers as they provide a
graphical depiction of data distribution and extreme values

12/9/20 97
12/9/20 98
Example:

12/9/20 99
• The box part of the chart is as described above, except that the mean is shown as
an ×.
• The whiskers extend up from the top of the box to the largest data element that is
less than or equal to 1.5 times the interquartile range (IQR) and down from the
bottom of the box to the smallest data element that is larger than 1.5 times the
IQR.
• Values outside this range are considered to be outliers and are represented by
dots.
• The boundaries of the box and whiskers are as calculated by the values and
formulas shown in Figure.
• The only outlier is the value 1850 for Brand B, which is higher than the upper
whisker, and so is shown as a dot.

12/9/20 100
Assumption 4: Auto Correlation

• Autocorrelation is correlation between two successive observations of


same variable.

• Economic activities of the past often have a strong effect on the present
and future economic activities.

• Students from the same class might perform more similarly to each
other than students from different classes

12/9/20 101
• Linear regression analysis requires that there is
little or no autocorrelation in the data.

• Typically autocorrelation occurs in stock prices,


where the price is not independent from the
previous price.

12/9/20 102
How to check:
• Look for Durbin – Watson (DW) statistic.
• It must lie between 0 and 4.
• If DW = 2, implies no autocorrelation,
• 0 < DW < 2 implies positive autocorrelation
• While 2 < DW < 4 indicates negative

12/9/20 103
Remedial Measure for Autocorrelation:
• To remove the problem of autocorrelation from
the data we have to transfer the original data
and then apply OLS technique to estimate the
parameter.

12/9/20 104
Assumption 5: Multicollinearity

12/9/20 105
• you may be looking at contributions to town
charity organizations using a model that
includes the population of the town and the
total gross income of the town.
contributions to town charity organizations

population of the town

the total gross income of the town.


12/9/20 106
• You identify that these variables are highly
correlated because the population of the town is a
direct contributor to the total gross income of the
town.
• In a case like this, you should restructure your
model to avoid regressing on two variables that are
causally related.
• You could do this by either omitting one of these
variables or by combining them into a single ratio
variable such as per capita income.
12/9/20 107
contributions to town charity organizations

Per capita income

12/9/20 108
• For example, you may be looking at customer loyalty to a shop using
a model that includes several different measures of satisfaction.
• .
customer
loyalty to a
shop

satisfaction satisfaction
with quality with the
of product network
12/9/20 109
• You identify that two of these measures of
satisfaction (satisfaction with quality of product and
satisfaction with the network) are highly correlated
and determine that it is because customer don’t tend
to describe out satisfaction in that way. Rather, both
measures of satisfaction are really a reflection of the
same measure of overall satisfaction.
• In this case, you could simply use overall
satisfaction as a predictor variable instead of the
separate measures of satisfaction
12/9/20 110
customer
loyalty to a
shop

Overall
satisfaction

12/9/20 111
• Multicollinearity occurs when independent variables in a regression
model are correlated.

• This correlation is a problem because independent variables should


be independent.

• If the degree of correlation between variables is high enough, it can


cause problems when you fit the model and interpret the results.

• For example, height and weight, household income and water


consumption, mileage and price of a car, study time and leisure time,
etc.
12/9/20 112
• Multicollinearity is a statistical phenomenon in which
there exists a perfect or exact relationship between the
predictor variables.

• When there is a perfect or exact relationship between the


predictor variables, it is difficult to come up with reliable
estimates of their individual coefficients.

• It will result in incorrect conclusions about the relationship


between outcome variable and predictor variables.

12/9/20 113
• Remedial Measures: To drop one or several
predictor variables in order to lessen the
multicollinearity

12/9/20 114
How to detect Multicollinearity?

• There is a very simple test to assess multicollinearity in your


regression model. The variance inflation factor (VIF) identifies
correlation between independent variables and the strength of that
correlation.

12/9/20 115
The variance inflation factor for the jth predictor is:

1
VIFj 
1 Rj
2

12/9/20 116
where R2j is the R2-value obtained by regressing the jth predictor on
the remaining predictors.

12/9/20 117
Assumption 6: Homoscedasticity

• Homoscedaticity means equal scatter or same variance.

• Heteroscedasticity means unequal scatter.

• This assumption means that the variance around the


regression line is the same for all values of the predictor
variable (X).

12/9/20 118
The plot shows a violation of this assumption

12/9/20 119
• Homoscedasticity can also be tested using scatter
plot of residual vs fitted values.

• If heteroskedasticity exists, the plot would exhibit


a funnel shape pattern. For example:

12/9/20 120
12/9/20 121
Residual Analysis for Equal Variance
Y
Y

x x
residuals

residuals
x x

Non-constant variance
 Constant variance

13-122
12/9/20 123
Assumption 7: Assumption of Normality

• The values of the residuals are normally distributed.

• This assumption can be tested by looking at the distribution of


residuals.

• We can do this by CHECKING the Histogram and Normal


probability plot.

• However, unless the residuals are far from normal or have an


obvious pattern, we generally don’t need to be overly concerned
about normality.
12/9/20 124
12/9/20 125
To determine which one or ones are significant
predictors, a
t test to be performed for
the individual significance of the
each explanatory variable.

12/9/20 126
Inferences About the Slope
The standard error of the regression slope coefficient (b1) is estimated by

Se
Sb1 
 i
(X  X ) 2

where:
Sb1= Estimate of the standard error of the slope
SSE
Se  = Standard error of the estimate
n2

13-127
t Test
Is there a linear relationship between X and Y?
Null and alternative hypotheses
◦ H0: β1 = 0 (no linear relationship)
◦ H1: β1 ≠ 0 (linear relationship does exist)
Test statistic
b1  β 1 where:
t STAT  b1 = regression slope
Sb
1 coefficient
β1 = hypothesized slope
d.f.  n  2 Sb1 = standard
error of the slope

13-128
If the p-value of your regression estimate is less than 0.05 (or
5%), then you can conclude that In the population from which
the sample is drawn, there is a true, non-zero relationship
between Y and X is correct and that the estimate is trustworthy
for the population or predictor variables are statistically
significant.

12/9/20 129
t Test Example

House Price in Square Feet


$1000s
(y)
(x) Estimated Regression Equation:
245 1400 house price  98.25  0.1098 (sq.ft.)
312 1600
279 1700
308 1875
199 1100 The slope of this model is 0.1098
219 1550
405 2350 Is there a relationship between the
324 2450 square footage of the house and its
319 1425
255 1700
sales price?

13-130
t Test Example H0: β1 = 0
H1: β1 ≠ 0

From Excel output:

  Coefficients Standard Error t Stat P-value


Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

b1
Sb1

b1  β 1 0.10977  0
t STAT    3.32938
Sb 0.03297
1

13-131
t Test Example

H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0

d.f. = 10- 2 = 8

a/2=.025 a/2=.025
Decision: Reject H0

There is sufficient
Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 evidence that square
-2.3060
0
2.3060 3.329 footage affects house price

13-132
an F test to
• It is common in regression analysis to compute

determine the overall significance of the


model.
Null hypothesis;
H0:
• all the coefficients = 0
• This implies that none of the explanatory variables are significant
predictors of the response variable
• ie Model is not significant

Alternative hypothesis:
• HA: at least one coefficient is not 0
• This implies that at least one of the explanatory variables is a
significant predictor of the response variable ie
• Model is significant
12/9/20 133
• if the p-value from the F-test is less than LOS, we
should continue the analysis. Null Hyp rejected

• But, if the p-value is greater than LOS, then there


is no evidence to indicate that any of the
explanatory variables are significant predictors of
the response variable and, therefore, there would
be no need to continue to the next step. Null Hyp
not rejected

12/9/20 134
F Test for Significance
MSR
F Test statistic: FSTAT 
MSE
where SSR
MSR 
k
SSE
MSE 
n  k 1

where FSTAT follows an F distribution with k numerator and (n – k - 1)


denominator degrees of freedom

(k = the number of independent variables in the regression model)


13-135
F-Test for Significance
Excel Output
Regression Statistics
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082 FSTAT    11.0848
Adjusted R MSE 1708.1957
Square 0.52842
Standard Error 41.33032
With 1 and 8 degrees of p-value for
Observations 10
freedom the F-Test

ANOVA
  df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000      

13-136
F Test for Significance
(continued)

H0: β1 = 0 Test Statistic:


H1: β1 ≠ 0 MSR
FSTAT   11.08
 = .05 MSE
df1= 1 df2 = 8 Decision:
Critical Reject H0 at  = 0.05
Value:
F = 5.32
Conclusion:
 = .05
There is sufficient evidence that house size affects selling
price
0 F
Do not Reject H0
reject H0
F.05 = 5.32
13-137

You might also like