Corelation and Regression

UNIT-IV
BADM
12/9/20 1
12/9/20 2
• Retail Store / Grocery store:
The number of items sold in a day
is positively correlated with the
number of customers visited the
store.
• Finance / Banking
The profit of financial
institution is negatively
correlated with the number of
bad loans.
12/9/20 3
• Education
The number of students enrolled
in a University to the number of
professors and resources
• Defense
The number of troops/cops
assigned to a particular area to a
number of criminal activities in
the area
12/9/20 4
• Healthcare
The more patients with a
certain disease or symptom
more the newer clinical trials
related to that disease.
• The fuel consumed

by a car is correlated
to the number of
miles travelled.
12/9/20 5
Height & weight
• Income & expenditure
12/9/20 6
Price and demand • Volume and pressure
of a perfect gas
12/9/20 7
Correlation
• Correlation is a statistical measure (expressed as a number) that

describes the size and direction of a relationship between two or
more variables.
• Two variables are said to be correlated if change in one variable

affects the change in other variable, and the relation between them
is known as correlation.
• If two variables vary together they are said to be correlated.
12/9/20 8
• Positive Correlation:
• Two variables are said
to be positively
correlated if they
deviates in the same
direction.
• e.g. height & weight,
income & expenditure
•
12/9/20 9
• Negative Correlation
• Two variables are said
to be negatively
correlated if they
deviates in the opposite
directions.
• e.g. volume and
pressure of a perfect
gas, price and demand
12/9/20 10
• Un-correlation
• Two variables are said to
Intelligence level
be uncorrelated or
statistically independent
if there is no relation
between them.
Amount of tea drunk
12/9/20 11
Karl Pearson’s coefficient of correlation
• Correlation coefficient between two random variables X and Y, usually denoted
by r(X,Y) and defined as
n xy   x  y
r( X ,Y ) 
n x    x  ).(n y    y  )
2 2 2 2
12/9/20 12
Note:
• If r=+1 then correlation is perfectly positive,
•
• If r=-1 then correlation is perfectly negative,
•
• If r=0 then variables are uncorrelated.
12/9/20 13
12/9/20 14
12/9/20 15
12/9/20 16
12/9/20 17
Cov( X , Y )
r
 XY
n    
   
 X i  X  Yi  Y 
i 1   
r 
n   2 n   2
   X i  X 
     Yi  Y 

i 1   i 1  
12/9/20 18
REGRESSION ANALYSIS
12/9/20 19
Example of MLR
Like, I have data of my monthly spending, monthly income and the
number of trips per month for the last three years. Now I need to answer
the following questions:
 What will be my monthly spending for next year?
 Which factor(monthly income or number of trips per month) is
more important in deciding my monthly spending?
 How monthly income and trips per month are correlated with
monthly spending?
12/9/20 20
Example of MLR
• In the credit card industry, a
financial company maybe
interested in minimizing the risk
portfolio and wants to
understand the top five factors
that cause a customer to default.
• Based on the results the

company could implement
specific EMI options so as to
minimize default among risky
customers.
12/9/20 21
Example of MLR
A company wanted to be able to
estimate or predict how much fuel
they needed to transport building
materials to their oil wells so that
they could line them with concrete.
The data provided was:
• Number of wells
• Depth of wells
• Distance to wells
• Weight of materials
• Tonne kilometres
• Fuel costs
12/9/20 22
Example of MLR
The selling price of a house can

depend on
• the desirability of the location,
• the number of bedrooms,
• the number of bathrooms,
• the year the house was built,
• the square footage of the lot and
• a number of other factors.
12/9/20 23
Example of MLR
• Predicting Gross Movie

Revenue
• Success or failure of a movie

can depend on many factors:
star-power, release date, Critics
review, budget, rating, plot
and the highly unpredictable
human reactions.
• Predicted revenues can be used

for planning both the
production and distribution
stages.
12/9/20 24
Regression
• Regression can be defined as a method to estimate the value

of one variable when that of other is known, when the
variables are correlated.
12/9/20 25
12/9/20 26
12/9/20 27
Example
• The table below shows some data from the early days
of the Italian clothing company Benetton.
• Each row in the table shows Benetton’s sales for a

year and the amount spent on advertising that year.
• In this case, our outcome of interest is sales—it is

what we want to predict.
12/9/20 28
Simple Regression Analysis
The General Idea
Simple regression considers the relation between a single explanatory
variable and response variable
12/9/20 29
In math courses, the slope-intercept form of the
equation of a line often takes the form
y=mx+b
where
m = slope of the line
b = y intercept of the line
12/9/20 30
12/9/20 31
ŷ  b0  b1 x  
 x y
b1 
 xy  n b0 
 y
b
 x
  x
2 1
n n
x 2

n
12/9/20 32
Regression Nomenclature
Dependent Variable Independent Variable
Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable Explanatory variable
12/9/20 33
The sign of each coefficient indicates the direction of the relationship
between a predictor variable and the response variable.
• A positive sign indicates that as the predictor variable increases, the

response variable also increases.
• A negative sign indicates that as the predictor variable increases, the

response variable decreases.
12/9/20 34
Classwork Problems
The relationship between money spent on research and development and the chemical
firm annual profit for the preceding 6 years is given below:
12/9/20 35
1. Use these data to develop a regression model to predict annual
profit.
2. Plot the scatter plot.
3. Which type of relationship is it?
4. Predict annual profit when money spent on research and
development is 8 $Millions.
12/9/20 36
 xy   x y
b1  n
  x
2
x 2

n
12/9/20 37
b0 
 y
b
 x
1
n n
12/9/20 38
12/9/20 39
12/9/20 40
12/9/20 41
12/9/20 42
Classwork Problems
12/9/20 43
12/9/20 44
12/9/20 45
12/9/20 46
Classwork Problems
12/9/20 47
Classwork Problems
Q.01 Suppose a study is conducted using only Boeing 737s

traveling 500 miles on comparable routes during the same
season of the year. Can the number of passengers predict
the cost of flying such routes? Suppose the data displayed
in Table are the costs and associated number of
passengers for twelve 500-mile commercial airline flights
using Boeing 737s during the same season of the year.
1. Use these data to develop a regression model to
predict cost by number of passengers.
2. Plot the scatter plot.
3. Which type of relationship is it?
4. Predict cost of flying when no of passengers are 110
12/9/20 48
12/9/20 49
Residuals
o The residual, or error, of the regression model is the difference
between the y value and the predicted value. Each data point has
one residual.
Residual = Observed value - Predicted value
Residual=y  yˆ
o Because a linear regression model is not always appropriate for the
data, you should assess the appropriateness of the model by
defining residuals and examining residual plots.
o Both the sum and the mean of the residuals are equal to zero.
12/9/20 50
12/9/20 51
RESIDUAL SUM OF SQUARES RSS
• The residual sum of squares (RSS), also known as the sum of squared
residuals (SSR) or the sum of squared errors of prediction (SSE), is the
sum of the squares of residuals.
RSS  SSR  SSE      ( y  yˆ )

2 2
• Ideally, the sum of squared residuals should be a smaller or lower value

(Close to 0) in any regression model.
12/9/20 52
The Standard Error of Estimate
The standard error of estimate, on the other hand, measures the variability, or
scatter, of the observed values around the regression line.
se 
 ( y  yˆ ) 2

RSS
n2 n2
12/9/20 53
Interpreting the Standard Error of Estimate
• We shall use the standard error of estimate as a tool in the same way that we
can use the standard deviation.
• The larger the standard error of estimate, the greater the scattering (or
dispersion) of points around the regression line.
• The smaller the standard error of estimate, the smaller the scattering (or
dispersion) of points around the regression line.
• If se = 0, we expect the estimating equation to be a “perfect” estimator of the

dependent variable.
12/9/20 54
The coefficient of determination: R Square Value
• The coefficient of determination measures the strength of the

association that exists between two variables, X and Y.
• The coefficient of multiple determination can be calculated by the following

formula..
R  1
2 SSE
 1
 ( y  ˆ
y ) 2
SST  ( y  y) 2
12/9/20 55
The coefficient of determination: R Square Value
 The coefficient of determination ranges from 0 to 1.
 R2 close to 1 indicates a strong correlation between X and Y,
 Whereas an R2 near 0 means that there is little correlation between

these two variables.
12/9/20 56
 In general, the higher the R-squared (Close to 1 or
close to 100 %), the better the model fits your data.
12/9/20 57
Which Model is better???
R2=15% R2=85%
12/9/20 58
Which Model is better???
R2=38% R2=87.4%
12/9/20 59
Classwork Problems
For research and development problem,

1. Calculate RSS and standard error of estimate.
2. Analyze the residuals.
3. Calculate Coefficient of determination or R square
4. Comment on value of R square.
12/9/20 60
Classwork Problems
12/9/20 61
Classwork Problems
12/9/20 62
12/9/20 63
 x y
b1 
 xy  n b0 
 y
b
 x
1
  x n n
2
x 2

n
12/9/20 64
12/9/20 65
12/9/20 66
12/9/20 67
12/9/20 68
Classwork Problems
12/9/20 69
1. Calculate RSS and standard error of estimate.
2. Analyze the residuals.
3. Calculate Coefficient of determination or R
square
4. Comment on value of R square.
12/9/20 70
Measures of Variation
Total variation is made up of two parts:
SST  SSR  SSE

Total Sum of Regression Sum of Error Sum of
Squares Squares Squares
SST   ( Yi  Y )2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2

where:
Y value of the dependent variable
= Mean
Yi = Observed value of the dependent variable
ˆ
Yi
= Predicted value of Y for the given Xi value
13-71
Measures of Variation
(continued)
Y
Yi  
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2

Y  _
SSR = (Yi - Y)2
_ _
Y Y
Xi X
13-72
Assumptions of Linear Regression
12/9/20 73
o Assumption 1: Linearity
o Assumption 2: Variables are measured without error (reliably)
o Assumption 3: Outliers/influential cases
o Assumption 4: Auto Correlation
o Assumption 5: Multicollinearity
o Assumption 6: Homoscedasticity
o Assumption 7: Residuals should be normally distributed
12/9/20 74
Assumption 1: Linearity
12/9/20 75
o First, linear regression requires the relationship
between the independent and dependent
variables to be linear.
o The linearity assumption can best be tested with

scatterplots.
12/9/20 76
o If data is given in pairs then the
scatter diagram of the data is just
the points plotted on the xy-
plane.
o The scatter plot is used to

visually identify relationships
between the first and the second
entries of paired data.
12/9/20 77
12/9/20 78
o The scatter plot above represents the age vs. size of a plant.
o It is clear from the scatter plot that as the plant ages, its size tends to
increase.
o If it seems to be the case that the points follow a linear pattern well,
then we say that there is a high linear correlation, while if it seems
that the data do not follow a linear pattern, we say that there is no
linear correlation.
o If the data somewhat follow a linear path, then we say that there is a
moderate linear correlation.
o Given a scatter plot, we can draw the line that best fits the data
12/9/20 79
o Types of Regression Models:
12/9/20 80
Assumption 2: Variables are measured without error
(reliably):
12/9/20 81
Residuals
o The residual, or error, of the regression model is the difference
between the y value and the predicted value. Each data point has
one residual.
Residual = Observed value - Predicted value
Residual=y  yˆ
o Because a linear regression model is not always appropriate for the
data, you should assess the appropriateness of the model by
defining residuals and examining residual plots.
o Both the sum and the mean of the residuals are equal to zero.
12/9/20 82
12/9/20 83
Using Residuals to Test the Assumptions of the Regression Model
o One of the major uses of residual analysis is to test some of the

assumptions underlying regression. The following are the
assumptions of simple regression analysis.
1. The model is linear.(Assumption 01)
2. The error terms have constant variances.(Assumption 06)
3. The error terms are independent.
4. The error terms are normally distributed.(Assumption 07)

12/9/20 84
Residual Plot
• A particular method for studying the behavior of

residuals is the residual plot.
• The residual plot is a type of graph in which the

residuals for a particular regression model are
plotted along with their associated value of x as
an ordered pair(x, Residual).
12/9/20 85
• A residual plot is a graph in which residuals are on
the vertical axis and the independent variable is on
the horizontal axis.
• If the dots are randomly dispersed around the
horizontal axis then a linear regression model is
appropriate for the data; otherwise, choose a non-
linear model.
12/9/20 86
• Following example shows few patterns in residual plots.
• In first case, dots are randomly dispersed. So linear regression

model is preferred.
• In Second and third case, dots are non-randomly dispersed and

suggests that a non-linear regression method is preferred
12/9/20 87
12/9/20 88
Clearly dots are randomly dispersed. So
Linear regression model is preferred.
12/9/20 89
Independence of Errors
Error values are statistically independent
If your points are following a
clear pattern, it might
indicate that the errors are
influencing each other. The
following image shows two
linear regression lines; on
the left, the points are
scattered randomly. On the
right, the points are clearly
influencing each other.
12/9/20 90
Assumption 3: Outliers/influential cases
12/9/20 91
Example:
Let us consider a dataset where
• y = foot length (cm) and
• x = height (in)
• for n = 33 male students in a statistics class.
A scatterplot of the male foot length and height data
shows one point labeled as an outlier.
12/9/20 92
12/9/20 93
• An outlier is a data point which is very far, somehow, from
the rest of the data.
• An outlier is an observation of data that does not fit the rest
of the data.
• It is sometimes called an extreme value.
• When you graph an outlier, it will appear not to fit the
pattern of the graph.
• Some outliers are due to mistakes (for example, writing
down 50 instead of 500) while others may indicate that
something unusual is happening.
12/9/20 94
Most common causes of outliers on a data set:
• Data entry errors (human errors)
• Measurement errors (instrument errors)
• Experimental errors (data extraction or experiment
planning/executing errors)
• Intentional (dummy outliers made to test detection methods)
• Data processing errors (data manipulation or data set unintended
mutations)
• Sampling errors (extracting or mixing data from wrong or various
sources)
12/9/20 95
Residual Plots to deduct Multivariate Outliers
• Another example is the

regression model for Yield as a
function of Concentration is
significant, but note that the line
of fit appears to be tilted towards
the outlier.
• We can see the effect of this

outlier in the residual by
predicted plot.
12/9/20 96
Box Plot Diagram to identify Outliers
• Box plot diagram also termed as Whisker’s plot is a graphical method typically
depicted by quartiles and inter quartiles that helps in defining the upper limit
and lower limit beyond which any data lying will be considered as outliers.
• The very purpose of this diagram is to identify outliers and discard it from the
data series before making any further observation so that the conclusion made
from the study gives more accurate results not influenced by any extremes or
abnormal values.
• Box plots can be used as an initial screening tool for outliers as they provide a
graphical depiction of data distribution and extreme values
12/9/20 97
12/9/20 98
Example:
12/9/20 99
• The box part of the chart is as described above, except that the mean is shown as
an ×.
• The whiskers extend up from the top of the box to the largest data element that is
less than or equal to 1.5 times the interquartile range (IQR) and down from the
bottom of the box to the smallest data element that is larger than 1.5 times the
IQR.
• Values outside this range are considered to be outliers and are represented by
dots.
• The boundaries of the box and whiskers are as calculated by the values and
formulas shown in Figure.
• The only outlier is the value 1850 for Brand B, which is higher than the upper
whisker, and so is shown as a dot.
12/9/20 100
Assumption 4: Auto Correlation
• Autocorrelation is correlation between two successive observations of

same variable.
• Economic activities of the past often have a strong effect on the present
and future economic activities.
• Students from the same class might perform more similarly to each
other than students from different classes
12/9/20 101
• Linear regression analysis requires that there is
little or no autocorrelation in the data.
• Typically autocorrelation occurs in stock prices,

where the price is not independent from the
previous price.
12/9/20 102
How to check:
• Look for Durbin – Watson (DW) statistic.
• It must lie between 0 and 4.
• If DW = 2, implies no autocorrelation,
• 0 < DW < 2 implies positive autocorrelation
• While 2 < DW < 4 indicates negative
12/9/20 103
Remedial Measure for Autocorrelation:
• To remove the problem of autocorrelation from
the data we have to transfer the original data
and then apply OLS technique to estimate the
parameter.
12/9/20 104
Assumption 5: Multicollinearity
12/9/20 105
• you may be looking at contributions to town
charity organizations using a model that
includes the population of the town and the
total gross income of the town.
contributions to town charity organizations
population of the town
the total gross income of the town.

12/9/20 106
• You identify that these variables are highly
correlated because the population of the town is a
direct contributor to the total gross income of the
town.
• In a case like this, you should restructure your
model to avoid regressing on two variables that are
causally related.
• You could do this by either omitting one of these
variables or by combining them into a single ratio
variable such as per capita income.
12/9/20 107
contributions to town charity organizations
Per capita income
12/9/20 108
• For example, you may be looking at customer loyalty to a shop using
a model that includes several different measures of satisfaction.
• .
customer
loyalty to a
shop
satisfaction satisfaction
with quality with the
of product network
12/9/20 109
• You identify that two of these measures of
satisfaction (satisfaction with quality of product and
satisfaction with the network) are highly correlated
and determine that it is because customer don’t tend
to describe out satisfaction in that way. Rather, both
measures of satisfaction are really a reflection of the
same measure of overall satisfaction.
• In this case, you could simply use overall
satisfaction as a predictor variable instead of the
separate measures of satisfaction
12/9/20 110
customer
loyalty to a
shop
Overall
satisfaction
12/9/20 111
• Multicollinearity occurs when independent variables in a regression
model are correlated.
• This correlation is a problem because independent variables should

be independent.
• If the degree of correlation between variables is high enough, it can

cause problems when you fit the model and interpret the results.
• For example, height and weight, household income and water

consumption, mileage and price of a car, study time and leisure time,
etc.
12/9/20 112
• Multicollinearity is a statistical phenomenon in which
there exists a perfect or exact relationship between the
predictor variables.
• When there is a perfect or exact relationship between the

predictor variables, it is difficult to come up with reliable
estimates of their individual coefficients.
• It will result in incorrect conclusions about the relationship

between outcome variable and predictor variables.
12/9/20 113
• Remedial Measures: To drop one or several
predictor variables in order to lessen the
multicollinearity
12/9/20 114
How to detect Multicollinearity?
• There is a very simple test to assess multicollinearity in your

regression model. The variance inflation factor (VIF) identifies
correlation between independent variables and the strength of that
correlation.
12/9/20 115
The variance inflation factor for the jth predictor is:
1
VIFj 
1 Rj
2
12/9/20 116
where R2j is the R2-value obtained by regressing the jth predictor on
the remaining predictors.
12/9/20 117
Assumption 6: Homoscedasticity
• Homoscedaticity means equal scatter or same variance.
• Heteroscedasticity means unequal scatter.
• This assumption means that the variance around the

regression line is the same for all values of the predictor
variable (X).
12/9/20 118
The plot shows a violation of this assumption
12/9/20 119
• Homoscedasticity can also be tested using scatter
plot of residual vs fitted values.
• If heteroskedasticity exists, the plot would exhibit

a funnel shape pattern. For example:
12/9/20 120
12/9/20 121
Residual Analysis for Equal Variance
Y
Y
x x
residuals
residuals
x x
Non-constant variance
 Constant variance
13-122
12/9/20 123
Assumption 7: Assumption of Normality
• The values of the residuals are normally distributed.
• This assumption can be tested by looking at the distribution of

residuals.
• We can do this by CHECKING the Histogram and Normal

probability plot.
• However, unless the residuals are far from normal or have an

obvious pattern, we generally don’t need to be overly concerned
about normality.
12/9/20 124
12/9/20 125
To determine which one or ones are significant
predictors, a
t test to be performed for
the individual significance of the
each explanatory variable.
12/9/20 126
Inferences About the Slope
The standard error of the regression slope coefficient (b1) is estimated by
Se
Sb1 
 i
(X  X ) 2
where:
Sb1= Estimate of the standard error of the slope
SSE
Se  = Standard error of the estimate
n2
13-127
t Test
Is there a linear relationship between X and Y?
Null and alternative hypotheses
◦ H0: β1 = 0 (no linear relationship)
◦ H1: β1 ≠ 0 (linear relationship does exist)
Test statistic
b1  β 1 where:
t STAT  b1 = regression slope
Sb
1 coefficient
β1 = hypothesized slope
d.f.  n  2 Sb1 = standard
error of the slope
13-128
If the p-value of your regression estimate is less than 0.05 (or
5%), then you can conclude that In the population from which
the sample is drawn, there is a true, non-zero relationship
between Y and X is correct and that the estimate is trustworthy
for the population or predictor variables are statistically
significant.
12/9/20 129
t Test Example
House Price in Square Feet

$1000s
(y)
(x) Estimated Regression Equation:
245 1400 house price  98.25  0.1098 (sq.ft.)
312 1600
279 1700
308 1875
199 1100 The slope of this model is 0.1098
219 1550
405 2350 Is there a relationship between the
324 2450 square footage of the house and its
319 1425
255 1700
sales price?
13-130
t Test Example H0: β1 = 0
H1: β1 ≠ 0
From Excel output:
Coefficients Standard Error t Stat P-value

Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1
Sb1
b1  β 1 0.10977  0
t STAT    3.32938
Sb 0.03297
1
13-131
t Test Example
H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
There is sufficient
Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 evidence that square
-2.3060
0
2.3060 3.329 footage affects house price
13-132
an F test to
• It is common in regression analysis to compute
determine the overall significance of the

model.
Null hypothesis;
H0:
• all the coefficients = 0
• This implies that none of the explanatory variables are significant
predictors of the response variable
• ie Model is not significant
Alternative hypothesis:
• HA: at least one coefficient is not 0
• This implies that at least one of the explanatory variables is a
significant predictor of the response variable ie
• Model is significant
12/9/20 133
• if the p-value from the F-test is less than LOS, we
should continue the analysis. Null Hyp rejected
• But, if the p-value is greater than LOS, then there

is no evidence to indicate that any of the
explanatory variables are significant predictors of
the response variable and, therefore, there would
be no need to continue to the next step. Null Hyp
not rejected
12/9/20 134
F Test for Significance
MSR
F Test statistic: FSTAT 
MSE
where SSR
MSR 
k
SSE
MSE 
n  k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)

denominator degrees of freedom
(k = the number of independent variables in the regression model)

13-135
F-Test for Significance
Excel Output
Regression Statistics
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082 FSTAT    11.0848
Adjusted R MSE 1708.1957
Square 0.52842
Standard Error 41.33032
With 1 and 8 degrees of p-value for
Observations 10
freedom the F-Test
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
13-136
F Test for Significance
(continued)
H0: β1 = 0 Test Statistic:

H1: β1 ≠ 0 MSR
FSTAT   11.08
 = .05 MSE
df1= 1 df2 = 8 Decision:
Critical Reject H0 at  = 0.05
Value:
F = 5.32
Conclusion:
 = .05
There is sufficient evidence that house size affects selling
price
0 F
Do not Reject H0
reject H0
F.05 = 5.32
13-137

Corelation and Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corelation and Regression

Uploaded by

Copyright:

Available Formats

UNIT-IV

• The fuel consumed

• Income & expenditure

• Correlation is a statistical measure (expressed as a number) that

• Two variables are said to be correlated if change in one variable

• If two variables vary together they are said to be correlated.

Amount of tea drunk

• Based on the results the

The selling price of a house can

• Predicting Gross Movie

• Success or failure of a movie

• Predicted revenues can be used

• Regression can be defined as a method to estimate the value

• Each row in the table shows Benetton’s sales for a

• In this case, our outcome of interest is sales—it is

• A positive sign indicates that as the predictor variable increases, the

• A negative sign indicates that as the predictor variable increases, the

Q.01 Suppose a study is conducted using only Boeing 737s

RSS  SSR  SSE      ( y  yˆ )

• Ideally, the sum of squared residuals should be a smaller or lower value

• If se = 0, we expect the estimating equation to be a “perfect” estimator of the

• The coefficient of determination measures the strength of the

• The coefficient of multiple determination can be calculated by the following

 The coefficient of determination ranges from 0 to 1.

 R2 close to 1 indicates a strong correlation between X and Y,

 Whereas an R2 near 0 means that there is little correlation between

For research and development problem,

SST  SSR  SSE

SST   ( Yi  Y )2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2

o The linearity assumption can best be tested with

o The scatter plot is used to

o One of the major uses of residual analysis is to test some of the

2. The error terms have constant variances.(Assumption 06)

3. The error terms are independent.

4. The error terms are normally distributed.(Assumption 07)

• A particular method for studying the behavior of

• The residual plot is a type of graph in which the

• In first case, dots are randomly dispersed. So linear regression

• In Second and third case, dots are non-randomly dispersed and

Linear regression model is preferred.

• Another example is the

• We can see the effect of this

• Autocorrelation is correlation between two successive observations of

• Typically autocorrelation occurs in stock prices,

population of the town

the total gross income of the town.

Per capita income

• This correlation is a problem because independent variables should

• If the degree of correlation between variables is high enough, it can

• For example, height and weight, household income and water

• When there is a perfect or exact relationship between the

• It will result in incorrect conclusions about the relationship

• There is a very simple test to assess multicollinearity in your

• Homoscedaticity means equal scatter or same variance.

• Heteroscedasticity means unequal scatter.

• This assumption means that the variance around the

• If heteroskedasticity exists, the plot would exhibit

• The values of the residuals are normally distributed.

• This assumption can be tested by looking at the distribution of

• We can do this by CHECKING the Histogram and Normal

• However, unless the residuals are far from normal or have an

House Price in Square Feet