Correlation and Regression

There is always
something new to
experience!!!!
Always curious to learn!!!
We are here to learn those
interesting things!!!
What’s
now?
Today’s challenge is –
LOOKING FOR
RELATIONS AND
ASSOCIATIONS!!!
LOOKING FOR RELATIONS...
AMONG SET OF VARIABLES

Can we say looking at the following table that
higher the category of officers, higher the
degree of satisfaction?
Categories of Officers – I means Senior
Post
Satisfaction I II III IV
High 40 60 52 48
Medium 103 87 82 88
Low 57 53 66 64
THINK
• Do you think credit card has boost up your purchasing power?
a) Strongly Agree
b) Somewhat Agree
c) Neither Disagree nor Agree
d) Disagree
e) Strongly Disagree
• Do you feel end up buying more while using credit card?
a) Strongly Agree
b) Somewhat Agree
c) Neither Disagree nor Agree
d) Disagree
e) Strongly Disagree
What is the thing
you wish to ‘PUT
ON TEST’?
What to do …?
Research Project:
“TABOOS ABOUT ABORTION

IN INDIAN SOCIETY”
One of the research issue in it -

“Under what circumstances, people
like to have abortion?”
The researcher had obtained
responses from different male and
female respondents.
The data is ……………………….
GENDER
MALE FEMALE TOTAL
WHENEVER THEY
466 448 914
WANT ABORTION
FEELINGS
ONLY IN SPECIAL
ABOUT 345 383 728
CIRCUMSTANCES
ABORTION
SHOULD NOT BE
38 46 84
ALLOWED
TOTAL 849 877 1726
IS THERE ANY ‘DEGREE OF

ASSOCIATION’ BETWEEN
GENDER AND FEELINGS ABOUT
ABORTION?
MEASURES OF
ASSOCIATION/correlation
 A Measure of Association is a numerical index
summarizing the strength, degree and direction of
relationship in a two - dimensional cross-
classification of variables. It reveals ‘how two
variables are related’.
 Any measure of association/correlation represents
the mutual relationships between two variables.
FIRST,
MEASURES OF
ASSOCIATION FOR
NOMINAL DATA...
MEASURES OF
ASSOCIATION
 MEASURES OF ASSOCIATION FOR NOMINAL DATA
 CHI - SQUARE is used to determine whether there exists

any association/relation among nominal variables. But,
CHI - SQUARE STATISTIC has the following
features/limitations -
 It shows whether relationship exists or not.
 It does not provide any degree of relationship among

the variables.
 It is highly sensitive to sample size.

MEASURES OF ASSOCIATION…
NOMINAL DATA
MEASURES BASED UPON CHI – SQUARE
Cramer’s CONTINGENCY coefficient, V -
V = (2/(q -1))1/2
where
 = c2/n
2
q = Minimum( number of rows, number of columns).

 In case of no association, V = 0 and in case of the perfect
NOMINAL DATA (CONTINUED…)
 There is no general consensus as to what constitutes a strong
relation or weak relation among the variables. However, the
following is offered as a guideline for interpreting the values
of the above mentioned statistics.
 Value of Statistics Possible Interpretation
0 - 0.0999 Negligible Association
0.10 - 0.1999 Weak Association
0.20 - 0.3999 Moderate Association
0.40 - 0.5999 Fairly Strong Association
0.60 - 0.9999 Very Strong Association
What is the degree of
association between
the degree and major
a student does?
OBSERVED DATA
Degree
Major TOTAL
BBA MBA DBA
Accounting 75 5 10 90
Finance 25 20 5 50
Other 20 5 35 60
TOTAL 120 30 50 200

EXPECTED DATA
Degree
Major TOTAL
BBA MBA DBA
Accounting 54 13.5 22.5 90
Finance 30 7.5 12.5 50
Other 36 9 15 60
TOTAL 120 30 50 200

CALCULATION OF CHI-SQUARE
Degree
Major TOTAL
BBA MBA DBA
Accounting 8.17 5.35 6.94 20.46
Finance 0.83 20.83 4.50 26.17
Other 7.11 1.78 26.67 35.56
TOTAL 16.11 27.96 38.11 82.19

• Cramer’s V is –
V  ( 82.19 / 200 )/( 3-1 )

 0.4533
• Gender: □ Male □ Female
• Occupation: □ Student □ Homemaker
□ Salaried □ Self-Employed
• Marital Status: □ Single □ Married
• Which Outdoor ENTERTAINMENT you like most?

a) Go for shopping in malls etc
b) Watch movie in a theater
c) Dine with friends and family members
What is the
d) Go out for adventurous experience thing
e) Any Other (please specify) you wish to ‘PUT
ON TEST’?
SPSS Output …
MARTIAL STATUS * ENTERTAINMENT Crosstabulation
Count
ENTERTAINMENT
Dine with
Go for friends Go out for Any Other
shopping in Watch movie and family adventurous (please
malls etc in a theater members experience specify) Total
MARTIAL MARRIED 80 40 10 35 15 180
STATUS UNMARRIED 25 40 50 25 5 145
Total 105 80 60 60 20 325
Symmetric Measures
Nominal by Phi
Value Approx. Sig. What can you
.426 .000
Nominal Cramer's V .426 .000 Say about the
N of Valid Cases
a. Not assuming the null hypothesis.
325
Association?
b. Using the asymptotic standard error assuming the null
hypothesis.
SPSS Output …
GENDER * ENTERTAINMENT Crosstabulation
Count
ENTERTAINMENT
Dine with
GENDER MALE 40 60 20 75 25 220
FEMALE 45 25 10 5 20 105
Total 85 85 30 80 45 325
Symmetric Measures
Value Approx. Sig.

Nominal by Phi .371 .000 What can you
Nominal Cramer's V .371 .000
N of Valid Cases 325
Say about the
a. Not assuming the null hypothesis. Association?
hypothesis.
SPSS Output …
OCCUPATION * ENTERTAINMENT Crosstabulation
Count
ENTERTAINMENT
Dine with
OCCUPATION STUDENTS 25 40 50 15 5 135
HOMEMAKER 30 12 8 2 5 57
SALARIED 25 20 10 5 3 63
SELF-EMPLOYED 8 7 4 1 3 23
Total 88 79 72 23 16 278
Symmetric Measures
Value Approx. Sig.

Nominal by Phi .370 .000 What can you
Nominal Cramer's V .214 .000
N of Valid Cases 278
Say about the
a. Not assuming the null hypothesis. Association?
hypothesis.
What do you say?
Which has more impact
on Type of Entertainment
– Gender, Martial Status
or Occupation?
SECOND,
MEASURES OF
ASSOCIATION FOR
ORDINAL DATA...
ORDINAL DATA
· Measures of association between ordinal
data are classified into two groups -
– That which are based upon the concept of rank order
correlation
– That which are based upon the concepts of
agreement/concordance or disagreement/discordance.
 METHOD BASED UPON RANK ORDER CORRELATION
CONCEPT :
SPEARMAN RANK ORDER CORRELATION

It is equal to 1 - (6 d2)/(N(N2-1))
ORDINAL DATA (CONTINUED…)
 Methods based upon the concepts of AGREEMENT
/CONCORDANCE or DISAGREEMENT/ DISCORDANCE
 These measures are developed to measure association
among ordinal data in the sense of ... “WHAT IS THE
DEGREE OF AGREEMENT OR DISAGREEMENT AMONG
THE RANKS ASSIGNED TO TWO VARIABLES?”
 These measures make use of the following concepts …
· AGREEMENT/CONCORDANCE(C): It means degree of
harmony/agreement between two ranks. Two pairs
(X1,Y1) and (X2,Y2) are said be concordant if
X1 > X2  Y1 > Y2 OR
X1 < X2  Y1 < Y2
· DISAGREEMENT/DISCORDANCE(D):It means
degree of disharmony/disagreement between
two ranks. Two pairs (X1,Y1) and (X2,Y2)
are said be discordant if
• X1 > X2  Y1 < Y2 OR
• X1 < X2  Y1 > Y2
· TIES:If some equality is found between pairs of

observation, then there exists a tie.
* PAIRS TIED ON X; PAIRS TIED ON Y AND PAIRS TIED
ON X &Y BOTH.
¨ Goodman and Kruskal Gamma () :
 = ( C - D )/ (C+D)
Can we estimate what is
the DEGREE OF
AGREEMENT between
two tests?
Two tests were conducted
to measure the Employees Test #1 Test #2
LEADERSHIP TRAITS OF 1 10 11
10 EMPLOYEES. 2 12 15
3 13 14
4 14 14
The data collected shows 5 15 14
the number of traits 6 10 13
possessed by an 7 8 9
employee out of 20 traits. 8 9 9
9 12 10
10 15 15
Can we say looking at the following table that
higher the category of officers, higher the
degree of satisfaction?
Categories of Officers – I means Senior
Post
Satisfaction I II III IV
High 40 60 52 48
Medium 103 87 82 88
Low 57 53 66 64
Revisiting the
Problem!!!
1. You prefer that your life partner’s family
should be financially sound.
2. Love marriage is better than ‘arranged
marriage’
Are respondents
consistent in
their responses?
SPSS Output …
Love marriage is better than 'arranged marriage'. * Your prefer that your life partner's family should be financially sound.
Crosstabulation
Count
Your prefer that your life partner's family should be financially sound.
STRONGLY SOMEWHAT SOMEWHAT STRONGLY
AGREE AGREE NEUTRAL DISAGREE DISAGREE Total
Love marriage STRONGLY AGREE 10 4 12 13 16 55
is better than SOMEWHAT AGREE 15 6 22 8 25 76
'arranged NEUTRAL 20 12 23 8 15 78
marriage'.
SOMEWHAT DISAGREE 28 9 24 18 10 89
STRONGLY DISAGREE 22 14 25 17 5 83
Total 95 45 106 64 71 381
Symmetric Measures
Asymp.
a b
Value Std. Error Approx. T Approx. Sig.
Ordinal by Ordinal Gamma -.197 .049 -3.991 .000
N of Valid Cases 381 What can you
a. Not assuming the null hypothesis.
CONCLUDE?
b. Using the asymptotic standard error assuming the null hypothesis.
THIRD,
MEASURES OF
ASSOCIATION FOR
INTERVAL & RATIO
SCALE DATA...
MEASURES OF ASSOCIATION FOR
INTERVAL AND RATIO SCALE DATA
 The degree of correlation between two variables
measured on interval and ratio scale can be measured
through PERASON’S CORRELATION COEFFICIENT which
is -
N  XY  (  X)(  Y )
rxy 
N  X2  (  X)2  N  Y 2  (  Y )2 
   
 Value of r Possible Interpretation
0.90 - 1.00 Very Strong Association
0.70 - 0.90 Fairly Strong Association
0.40 - 0.70 Moderate Association
0.20 - 0.40 Weak Association
Less than 0.2 Negligible Association
FOURTH,
MEASURES OF ASSOCIATION
FOR INTERVAL & RATIO
SCALE DATA AND NOMINAL
DATA...
Research Project:
What to do …?
“TV VIEWING HABITS AMONG
WOMEN IN NCR”
One of the research issue in it -

“Is working/non-working status of
women has any impact on the hours of
viewing TV?”
The researcher had obtained
responses from different working and
non-working women respondents.
The data is
……………
IS THERE ANY ‘DEGREE OF

ASSOCIATION’ BETWEEN STATUS
OF WOMEN AND TV VIEWING
HOURS?
MEASURE OF ASSOCIATION …
INTERVAL BY NOMINAL
 ETA (h): When one variable is categorical and the other is

a scaled one, then Eta is more suitable measure of
correlation.
 Usually, ETA is used when the dependent variable is

measured on the Scale Level while the independent
variable is measured on the Nominal Level.
 ETA squared is variation in dependent variable explained

by independent variable.
SPSS Output …
Directional Measures
Value
Nominal by Interval Eta STATUS OF WOMEN
RESPONDENT .799
Dependent
AVERAGE TV VIEWING
HOUR PER DAY IN THE .617
LAST WEEK Dependent
What can you

Say about the
Association?
The Last thing about Correlation …
“The invalid assumption that

correlation implies cause is
probably among the two or
three most serious and
common errors of human
reasoning”
With a happy relation and correlation … let’s MOVE to
something different ...
Have you ever wondered how a financial analyst
can predict Profits of a company?
Are you aware of the TOOL needed to estimate the
beta of a company?
Can I establish a functional relation between EPS of
share and its market price?
Can you model CAUSE and EFFECT
RELATIONSHIP between the two variables?
If you are serious in looking answers
to the issues raised then we have ……
Regression Model
Yi    X i  
Dr. C. P. Gupta
DETERMINISTIC
vs.
STOCHASTIC MODEL
 A model with the following functional

relation is called DETERMINISTIC model:
Y=a + b X
 A model with the following functional

relation is called STOCHASTIC model:
Y = a+bX + e
Why DISTURBANCE/ERROR
TERM …???
 Usually, the following rationales are given for

“Why ERROR term…?” :
 Omission of the variables and specification
error
 Measurement Error
 Human indeterminacy or stochastic nature of
economic processes
Some Preliminaries!!!
 Parameters: Unknown constants in a model are called

parameters. For instance, a and b are parameters in the
following model:
Y= a + b X+e
 Estimators: An estimator is a rule, formula, an algorithm
that is applied to the data in a specific sample to
compute an estimate of the population parameter.
 Estimates: An estimate is a number or specific value
computed or obtained through an estimator.
What is a GOOD estimator?
 Following are some criteria on the basis of which one
can judge the “goodness” of an estimator:
 Computational Cost
 Highest R2
 Linear Estimator
 Unbiased Estimator
 Minimum-variance or Efficient Estimator
 Based on all available information
CLASSICAL LINEAR REGRESSION
MODEL
 The General form of the Classical Linear Regression Model:
Yi = a + bXi + e i ; i = 1, …,n
 BASIC ASSUMPTIONS:
 Zero Mean of the Disturbance: E[ei] = 0 for all i;
 Homoscedasticity: Var[ei] = s2, a constant for all i;
 Non-autocorrelation: Cov[ei , ej] = 0 if i  j;
 Uncorrelatedness of regressor and disturbance: Cov[Xi , ej] = 0
if all i and j;
 Normality: ei ~ N[0, s2]; and
 Non-Stochastic Regressor: the value of Xi is a known constant
in the probability distribution of Yi.
The parameters of the Classical Regression
Model are determined by LEAST SQUARES
METHOD.
Using Least Squares Method, the estimate of

b, say b, is determined as follows:
b
i ( X i  X )( Yi  Y )
i ( X i  X ) 2
And, the estimate of a, say a, can be
determined as thus:
a  Y  bX
Let’s do step-by-step Regression
Analysis …
Trying to establish a Relation between the
Interest Rates and Futures Index
Day Interest Rate Futures Index
1 7.43 221
2 7.48 222
3 8.00 226
4 7.75 225
5 7.60 224
6 7.63 223
7 7.68 223
8 7.67 226
9 7.59 226
10 8.07 235
11 8.03 233
12 7.25 325
13 8.00 241
Step No.#1: Do we have sufficient
evidence to fit a Linear Regression Model?
Can you fit a linear

regression model to the data?
Relook at the data…
What will you like to say
about this point?
It is an OUTLIER!!!!!
Identify an outlier and remove it…
1 7.43 221
2 7.48 222
3 8.00 226
4 7.75 225
5 7.60 224
6 7.63 223
7 7.68 223
8 7.67 226
9 7.59 226
10 8.07 235
11 8.03 233
12 7.25 325
13 8.00 241
Removing the outlier we get the final data
for Regression Analysis …
1 7.43 221
2 7.48 222
3 8.00 226
4 7.75 225
5 7.60 224
6 7.63 223
7 7.68 223
8 7.67 226
9 7.59 226
10 8.07 235
11 8.03 233
13 8.00 241
Using the Least Square Method, we get …
Estimate of the Beta (b)…
Covariance(x, y)
Estimate of  
Variance(x )
Using Scientific Calculator, one can get ---
Covariance (Interest Rate and Futures Index) = 1.0180; and Variance
of Interest Rate = 0.0462.
Therefore, the estimate of Beta is: 22.0307
Using the Least Square Method, we get …
Estimate of the Alpha (a)…
Estimate of   y   x
Using Scientific Calculator, one can get ---

Mean of Interest Rate = 7.74; and
Mean of Futures Index = 227.0833.
Therefore, the estimate of Alpha is: 56.4740
Our FINAL REGRESSION EQUATION…
Futures Index = 56.4740 + 22.0307 Interest Rate

Line of BEST-FIT – Regression Line
Futures Index
f(x) = 22.03 x + 56.47

R² = 0.66
Futures Index
Interest Rate
Will our story of Regression
Analysis end here?
NO!
We shall have a beginning of … a
NEW STORY!
Before we proceed further, we must ensure –
‘how best is our line of BEST FIT?’
Futures Index
f(x) = 22.03 x + 56.47

R² = 0.66
Futures Index
Interest Rate
For that, we need a tool…
… to measure the DEGREE OF

GOODNESS OF FIT.
One of the ways in which the FIT of
the Regression can be evaluated is -
‘whether variation in x is a good

predictor of variation in y.’
Now, what is that which can measure

such a variation?
And, it is …
R =R Square!!!!
2
R2 - COEFFICIENT OF DETERMINATION. It is a measure that

represents the proportion of total variation in dependent variable
explained by the model.
Higher the value of R2, higher the variation explained and hence, it
is a better fit.
It is good that R2 can explain about the
GOODNESS of FIT. But, whatever is
explained how can I believe that it
would be statistically significant?!!!!!
But, why
For that ANOVA in
one can Regression?
use
ANOVA!!!!!!
ANALYSIS OF VARIANCE
ANOVA TABLE
Sources of Variation Variation Degrees of Freedom Mean Squre F-Ratio
Ratio of
Regression SSR K SSR/K
Mean
Residuals SSE n - (K+1) SSE/(n-(K+1)) Squares
Total SST n-1 SST/(n-1)
Summarizing…
Evaluating the FIT of the Regression!
 There are different tools to capture different

dimensions of FIT!!!!!
 Coefficient of determination or r2
 Analysis of Variance (ANOVA)
 Testing significance of the parameters of the

model individually.
Using EXCEL to get the Result of
Regression Analysis.
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8153
R Square 0.6646
Adjusted R Square 0.6311
Standard Error 3.6850
Observations 12

ANOVA
df SS MS F Significance F
Regression 1 269.1232 269.1 19.82 0.0012
Residual 10 135.7934 13.58
Total 11 404.9167

Coefficients Standard Error t Stat P-value
Intercept 56.4740 38.3384 1.473 0.172
Interest Rate 22.0307 4.9487 4.452 0.001
Once we get the Regression Line and
assuming that it is the BEST FITTED LINE,
Then WHAT?
Where to go?
One can use Regression Analysis
for …
 One, Establishing a relation between the
variables and estimate the values.
 Second, to make a forecast!!!!!

Let’s take another example:
Car Age(Years) Selling Price (Rs.'000)
1 9 81
2 7 60
3 11 36
4 12 40
5 8 56
6 7 15
7 8 76
8 11 80
9 10 80
10 12 60
11 6 86
12 8 80
13 5 90
14 8 70
15 9 50
16 12 40
17 8 75
18 7 65
19 6 85
20 10 50
Let’s take another example:
Car Age(Years) Selling Price (Rs.'000)
1 9 81
2 7 60
3 11 36
4 12 40
5 8 56
6 7 15
7 8 76
8 11 80
9 10 80
10 12 60
11 6 86
12 8 80
13 5 90
14 8 70
15 9 50
16 12 40
17 8 75
18 7 65
19 6 85
20 10 50
EXCEL OUTPUT……
SUMMARY OUTPUT
1.What the
is

Regression

Line?

Regression Statistics
Multiple R 0.4218
R Square 0.1779
Adjusted R Square 0.1322 2.How
well
the
Standard Error 18.8546 Regression
Line

Observations 20 Fit
the Data?

ANOVA
df SS MS F Significance F
Regression 1 1384.805684 1384.81 3.89541 0.063970444
Residual 18 6398.944316 355.497
Total 19 7783.75

Intercept 98.6206 18.1639 5.42948 3.7E-05
Age(Years) -4.0081 2.0308 -1.9737 0.06397
EXCEL OUTPUT……
SUMMARY OUTPUT
1.What
is the

Regression Statistics Regression
Line?

Multiple R 0.7262
R Square 0.5274
Adjusted R Square 0.5012
2.How
well the
Standard Error 2.0738
Regression

Line

Fit
Observations 20
the Data?

ANOVA 3.If R/S

Ratio is 0.30,
df SS
then MS determine F Significance F
the
Regression 1 86.4036 86.4036 20.0902 0.0003
P/E Ratio?
Residual 18 77.4139 4.3008
Total 19 163.8175

Intercept 5.9772 0.9174 6.5155 0.0000
R/S Ratio 74.0676 16.5248 4.4822 0.0003

Correlation and Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression

Uploaded by

Copyright:

Available Formats

There is always

AMONG SET OF VARIABLES

“TABOOS ABOUT ABORTION

One of the research issue in it -

IS THERE ANY ‘DEGREE OF

 CHI - SQUARE is used to determine whether there exists

 It shows whether relationship exists or not.

 It does not provide any degree of relationship among

 It is highly sensitive to sample size.

q = Minimum( number of rows, number of columns).

TOTAL 120 30 50 200

Accounting 54 13.5 22.5 90

Finance 30 7.5 12.5 50

TOTAL 120 30 50 200

Accounting 8.17 5.35 6.94 20.46

Finance 0.83 20.83 4.50 26.17

Other 7.11 1.78 26.67 35.56

TOTAL 16.11 27.96 38.11 82.19

V  ( 82.19 / 200 )/( 3-1 )

• Which Outdoor ENTERTAINMENT you like most?

Value Approx. Sig.

Value Approx. Sig.

SPEARMAN RANK ORDER CORRELATION

· TIES:If some equality is found between pairs of

¨ Goodman and Kruskal Gamma () :

One of the research issue in it -

IS THERE ANY ‘DEGREE OF

 ETA (h): When one variable is categorical and the other is

 Usually, ETA is used when the dependent variable is

 ETA squared is variation in dependent variable explained

What can you

“The invalid assumption that

 A model with the following functional

 A model with the following functional

 Usually, the following rationales are given for

 Parameters: Unknown constants in a model are called

Using Least Squares Method, the estimate of

Can you fit a linear

Estimate of the Beta (b)…

Estimate of the Alpha (a)…

Using Scientific Calculator, one can get ---

Futures Index = 56.4740 + 22.0307 Interest Rate

f(x) = 22.03 x + 56.47

f(x) = 22.03 x + 56.47

… to measure the DEGREE OF

‘whether variation in x is a good

Now, what is that which can measure

R2 - COEFFICIENT OF DETERMINATION. It is a measure that

 There are different tools to capture different

 Analysis of Variance (ANOVA)

 Testing significance of the parameters of the

 Second, to make a forecast!!!!!

ANOVA 3.If R/S

You might also like