You are on page 1of 11

UNIVERSITY OF NEGROS OCCIDENTAL - RECOLETOS

GRADUATE SCHOOL

First Semester 2022 - 2023


Chapter 3. Correlation and Regression

According to Surbhi (2017), the correlation and regression are the two analysis based on multivariate
distribution. A multivariate distribution is described as a distribution of multiple variables. Correlation is described as the
analysis which lets us know the association or the absence of the relationship between two variables ‘x’ and ‘y’. On the
other end, Regression analysis, predicts the value of the dependent variable based on the known value of the
independent variable, assuming that average mathematical relationship between two or more variables.

Statistical correlation is measured by what is called the


coefficient of correlation (r). Its numerical value ranges from +1.0 to -
1.0. It gives us an indication of both the strength and direction of the
relationship between variables (Wilson, 2019).

Moreover, r > 0 indicates a positive relationship, r < 0 indicates


a negative relationship and r = 0 indicates no relationship (or that the
variables are independent of each other and not related). Here r =
+1.0 describes a perfect positive correlation and r = -1.0 describes a
perfect negative correlation. In addition, the closer the coefficients
are to +1.0 and -1.0, the greater the strength of the relationship
between the variables.

According to Cohen et al. (2002), the correlation between the


two variables can be strong or week. The value of the correlation
coefficient indicates the amount of variability shared between the
two variables and the strength of the relationship, where the
correlation coefficient always falls between – 1 and + 1.

There are no absolute criteria in interpreting the strength of the relationship based on the value of the
correlation coefficient, according to Salkind (2008) as shown in table 1. Furthermore, the rule of thumb generally a
accepted by social science researcher in interpreting the correlation coefficient.
Table 1
Correlation Coefficient General interpretation of the strength of relationship

± 0.8 to ± 1.0 Very strong/ Very high (± 1 = perfect relationship)

± 0.6 to ± 0.79 Strong/ High

± 0.4 to ± 0.59 Moderate/ Average

± 0.2 to ± 0.39 Weak/ Low

0 to ±0.2 Very weak/ Very low (0 = no relationship)

Test of Correlation/Relationship

Pearson’s correlation coefficient (Pearson’s – r) is the parametric test statistics that measures the statistical
relationship, or association, between two continuous variables. It is known as the best method of measuring the
association between variables of interest because it is based on the method of covariance. It gives information about
the magnitude of the association, or correlation, as well as the direction of the relationship.
To use Pearson correlation, your data must meet the following requirements:
1. Two or more continuous variables (i.e., interval or ratio level)
2. Cases must have non-missing values on both variables
3. Linear relationship between the variables
4. Independent cases (i.e., independence of observations)
5. Normality
6. Random sample of data from the population
7. No outliers

What values can the Pearson correlation coefficient take?


The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there
is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value
of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association;
that is, as the value of one variable increases, the value of the other variable decreases. This is shown in the diagram
below:

Example:
An analyst is studying the relationship between shopping-center traffic and a department store’s daily sales. The
analyst develops an index to measure the daily volume of traffic entering the shopping center, and an index of daily sales
for 10 randomly selected days. Using 0.05 level of significance, is there significant relationship between shopping-center
traffic and a department store’s daily sales?

Traffic index (X) Sales index (Y)

71 250

82 280

111 301

85 325

89 328

110 390

111 410

121 420

129 450

132 475

Using the 5 steps of Hypothesis Testing.

1. State your null and alternative hypothesis


Ho: There is no significant relationship between shopping-center traffic and a department store’s daily
Sales.
H1: There is significant relationship between shopping-center traffic and a department store’s daily sales.

2. Level of significance.
α = 0.05

3. Statistical Tool
Pearson’s - r

4. Computation

Using JASP

1. Open the data

2. Click Regression and select

3. Direct Traffic index and Sales index to Variables, and go to Results

Pearson's Correlations
Traffic Sales
Variable  
index index
1. Traffic Pearson'

index sr
p-value —  
2. Sales Pearson'
0.908 —
index sr
p-value < .001 —
Pearson's Correlations
Traffic Sales
Variable  
index index

Note: Before interpreting the result, determine first the df (degree of freedom). To determine the degree of freedom, find
the difference of n (number of participants) and 2 (independent and dependent variables). Since is n = 10, then 10 – 2 is
equal 8.

5. Making decision and conclusion

Based from the result, there was a very high positive relationship between Traffic index and Sales index [r(8) =
0.908, p < 0.001] at 0.05 level of significance. This implies, if the number of traffic index increases the number of sales
indexes also increases.

Activity:

With the growth of internet service providers, a researcher decides to examine whether there is a
relationship between cost of internet service per month and degree of customer satisfaction (on a scale of 1 -
10 with a 1 being not at all satisfied and a 10 being extremely satisfied). The researcher only includes programs
with comparable types of services. A sample of the data is provided below.

Pesos Satisfaction

1100 6

1800 8

1700 10

1500 4

900 9

500 6

1200 3

1900 5

2200 2

2500 10

Can we conclude that there was relationship between amount of money spent per month on internet provider
service and level of customer satisfaction? (use 0.05 level of significance)

Linear Regression

Statistical regression estimates relationships between independent variables and dependent variables.
Furthermore, regression models can be used to help understand and explain relationships among variables; they can
also be used to predict actual outcomes, according to Pardoe (2019).

An example of statistical regression is Linear regression. According to Prabhakaran (2017), it is used to predict
the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear
relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use
this formula to estimate the value of the response Y, when only the predictors (X’s) values are known.
The basic equation for Regression Line is (Nishishiba, 2014);
Where: Y = the dependent variable (value to be predicted)
X = The independent variable (predictor)
a = The point where the regression line crosses the Y axis, called the intercept.
b = The slope of the regression line, indicated the strength of the relationship between X and Y. (regression
coefficient).

Frost (2017) indicated, that regression analysis produces a regression equation where the coefficients represent
the relationship between each independent variable and the dependent variable. Where, equation is used to make
predictions.

Example:
An analyst wants to determine if shopping-center traffic can predict the department store’s daily sales. The
analyst develops an index to measure the daily volume of traffic entering the shopping center, and an index of daily sales
for 10 randomly selected days. Using 0.05 level of significance, is the shopping-center traffic index can predict the
department store’s daily sales?

Using the 5 steps of hypothesis testing

1. State null and alternative hypothesis


Ho: The shopping-center traffic index cannot predict the department store’s daily sales.
H1: The shopping-center traffic index can predict the department store’s daily sales.

2. Level of significance
α = 0.05

3. Statistical Tool
Linear Regression

4. Computation

Using JASP

1. Open data

2. Click Regression and select Linear Regression

3. Direct Traffic index to Covariates and Sales index to Dependent Variable, and go to Results
Model Summary - Sales index
Model R R² Adjusted R² RMSE
H₀ 0.000 0.000 0.000 76.355
H₁ 0.908 0.824 0.802 33.937

This table provides the R and R2 values. The R value represents the simple correlation and is 0.908 (the "R"
Column), which indicates a high degree of correlation. The R2 value (the "R Square" column) indicates how much of the
total variation in the dependent variable, sales, can be explained by the independent variable, traffic. In this case, 82.4%
can be explained, which is very large.

ANOVA
d
Model   Sum of Squares Mean Square F p
f
H₁ Regression 43257.398 1 43257.398 37.560 < .001
Residual 9213.502 8 1151.688
Total 52470.900 9

Note.  The intercept model is omitted, as no meaningful information can be shown.

This table indicates that the regression model predicts the dependent variable significantly well. How do we
know this? Look at the "Regression" row and go to the "Sig." column. This indicates the statistical significance of the
regression model that was run. Here, p < 0.001, which is less than 0.05, and indicates that, overall, the regression model
statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).

The Coefficients table provides us with the necessary information to predict price from income, as well as
determine whether income contributes statistically significantly to the model (by looking at the "Sig." column).
Furthermore, we can use the values in the "B" column under the "Unstandardized Coefficients" column.

Coefficients
Model   Unstandardized Standard Error Standardized t p
H₀ (Intercept) 362.900 24.146 15.030 < .001
H₁ (Intercept) 20.175 56.942 0.354 0.732
  Traffic index 3.292 0.537 0.908 6.129 < .001

5. Making decision and conclusion

Results shows that the shopping-center traffic index can predict the department store’s daily sales [F(1, 8) =

37.560, p < 0.001, R = 0.908, R2 = 0.824] at 0.05 level of significance, with a linear equation of Sales = 20.175 + 3.292

(Traffic).

This implies that the increase of traffic index of the shopping-center anticipates also the increase of the

department store’s daily sales


To present the regression equation as:
Sales = 20.175 + 3.292 (Traffic)

Example:
1. Traffic index (X) = 60, what is the daily sales?

Then, Sales = 20.175 + 3.292 (60) = 217.695 or 218

2. Traffic index (X) = 150, what is the daily sales?

Then, Sales = 20.175 + 3.292 (150) = 513.975 or 514

Activity 3:
Given: The data of the 10 permanent government employees who have been infected with COVID – 19 and sent
to quarantine facility for health reasons. Given also their level of depression and self-esteem after infection.
Depression Self-Esteem
10 104
12 100
19 98
4 150
25 75
15 105
21 82
7 133

a. Determine if there is significant relationship between the level of depression and self-esteem of the employees.
b. Determine if the level of depression can predict the level of self-esteem of the employees. What will be the model
equation of the given data.

Multiple Regression

Multiple regression is a statistical technique that can be used to analyze the relationship between a single
dependent variable and several independent variables. The objective of multiple regression analysis is to use the
independent variables whose values are known to predict the value of the single dependent value. Each predictor value
is weighed, the weights denoting their relative contribution to the overall prediction.

Y =a+b1 X 1 +b 2 X 2 +. . .+bn X 2

The difference between Linear and Multiple Regression

The analyst utilized linear regression to explain the change of dependent variables using only one independent
variable. While in multiple regression, the analyst attempts to explain a dependent variable using more than one
independent variable.
A multiple regression considers the effect of more than one explanatory variable on some outcome of interest. It
evaluates the relative effect of these explanatory, or independent, variables on the dependent variable when holding all
the other variables in the model constant.

Example: An industrial psychologist conducts a study to examine those variables thought to be related to on-
the-job performance of technical employees. A random sample of 15 employees gives the following results.

Performance Ratings Job Aptitude Test In-service Training Units

54 15 8

37 13 1
30 15 1

48 15 7

37 10 4

37 14 2

31 8 3

49 12 7

43 10 9

12 3 1

30 15 1

37 14 2

61 14 10

31 9 1

31 4 5

Is job aptitude test results and in-service training units earned predicts the job performance of the employees?
(use 0.05 level of significance)

Using the 5 steps of hypothesis testing

1. State the null and alternative hypothesis


Ho: The job aptitude test results and in-service training units earned cannot predict the job performance of
the employees.
H1: The job aptitude test results and in-service training units earned can predict the job performance of
the employees.
.

2. Level of significance
α = 0.05

3. Statistical Tool
Multiple Regression

4. Computation

Note: Establish first the relationship of job performance to job aptitude and in-service training units.

Using JASP
1. Open the data

2. Select Regression and click Correlation


3. Direct performance ratings, job aptitude test and in-service training units to Variables. Then go to the results.

Results
Pearson's Correlations
Performance Job Aptitude In-service Training
Variable  
Ratings Test Units
Pearson'
1. Performance Ratings —
sr
p-value —    
Pearson'
2. Job Aptitude Test 0.604 —
sr
p-value 0.017 —  
3. In-service Training Pearson'
0.815 0.149 —
Units sr
p-value < .001 0.595 —

Results shows that the job performance of the employee’s have a high positive relationship with their job

aptitude test [r(13) = 0.604, p = 0.017] and in-service training units earned [r(13) = 0.815, p < 0.001] at 0.05 level of

significance.

Since the job performance of the employee’s have relationship with job aptitude test and in-service training
units earned, now we can employ multiple regression to determine if job aptitude test and in-service training units
earned predicts performance ratings of the employees.

4. Select Regression and click Linear Regression


5. Direct performance ratings to Dependent Variable and, job aptitude test and in-service training units to Covariates.
Then go to results.

6. Results

Model Summary - Performance Ratings


Model R R² Adjusted R² RMSE
H₀ 0.000 0.000 0.000 11.910
H₁ 0.950 0.902 0.886 4.028

ANOVA
Model   Sum of Squares df Mean Square F p
H₁ Regression 1791.079 2 895.539 55.208 < .001
1
  Residual 194.654 16.221
2
1
  Total 1985.733
4

Note.  The intercept model is omitted, as no meaningful information can be shown.

Coefficients
Mod
  Unstandardized Standard Error Standardized t p
el
12.31 < .00
H₀ (Intercept) 37.867 3.075
4 1
0.01
H₁ (Intercept) 9.870 3.380 2.920
3
< .00
  Job Aptitude Test 1.477 0.274 0.494 5.399
1
< .00
  In-service Training Units 2.699 0.333 0.741 8.107
1

5. Making decision and conclusion

Based from the results, the job aptitude test and in-service training units predicts the job performance of the

employees [F(2, 12) = 55.208, p <0.001, R = 0.950, R2 = 0.902] at 0.05 level of significance, with model equation of:

Job Satisfaction = 9.870 + 1.477 (X1) + 2.699 (X2)


This implies that every time the job aptitude test results rises and additional in-service training units of the

employees, their job performance is increases.

Activity:

Determine if the level of preparedness and level of awareness of the participants towards natural disaster
predicts their level of resiliency. Refer your data on ACTIVITY 1 and use 0.05 level of significance.

Reference:

Cohen, J., et al. (2002), Applied multiple regression/correlation analysis for behavioral sciences (3 rd ed). Mahwah, NY:
Erlbaum.

Daniel, W. and Terrel, J (1986) Business Statistics (Basic Concepts and Methodology) 4 th Edition. Houghton Mifflin
Company, Boston.

Nishishiba, M. (2014), Research Methods and Statistics for Public and Nonprofit Administrators, Sage.

Pardoe, I (2019), Regression, https://www.statistics.com/courses/regression-analysis/, Date Retrieved: March 13, 2020,


Friday.

Prabhakaran, S. (2017), Linear Regression, http://r-statistics.co/Linear-Regression.html, Date Retrieved: March 13, 2020,
Friday.

Salkind, N.J. (2008), Statistics for people who (think they) hate statistics. Thousand Oaks, CA: Sage.

Surbhi, S. (2017), Difference Between Correlation and Regression, https://keydifferences.com/difference-between-


correlation-and-regression.html, Date Retrieved March 13, 2020, Friday

Wilson, L. (2019), Statistical Correlation, https://explorable.com/statistical-correlation, Date Retrieved March 13, 2020,
Friday.

You might also like