Pearson Product-
Moment
Correlation
SAS Mathematics Faculty
Objectives
1. Use the methods of linear
regression and correlations to
predict the value of a variable
given certain conditions.
2. Solve and interpret and
coefficient of determination, 2.
Karl Pearson (1857-1936)
- He was an
influential English
mathematician and
biostatician.
Karl Pearson (1857-1936)
- In 1911, he founded the
world’s first university
statistics department at
the University College of
London, and contributed
significantly to the field of
biometrics, meteorology,
social Darwinism and
Eugenics.
What is Correlation?
It is a statistical method used to
determine whether a relationship between
two variables exists.
It also measure of the direction and
strength of linear relationship between
two variables.
Direction maybe positive, negative or
zero.
Types of Correlation
Positive correlation
Negative correlation
Zero correlation
Positive Correlation
A positive correlation exists
when high values of one variable
correspond to high values in the
other variable or low values in one
variable correspond to low values in
the other variable.
Positive Correlation
10
Score in English
8
6
4
2
0
0 1 2 3 4 5 6 7 8 9 10
Score in Mathematics
The graph indicates a direct correlation between
variables x and y which appears to be increasing.
Negative Correlation
A negative correlation exist
when high values of one variable
correspond to low values in the
other variable or low values in one
variable correspond to high values
in the other variable.
Negative Correlation
12
Score in English
10
8
6
4
2
0
0 2 4 6 8 10 12
Score in Mathematics
This time the trend of the data is decreasing, hence, the variables are
negatively correlated.
Zero Correlation
A zero correlation exists when
high values in one variable
correspond to either high or low
values in the other variable.
Zero Correlation
10
Score in English
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
Score in Mathematics
The scatter graph of the data above is either increasing or decreasing. This
graph represents a zero correlation.
Example:
Determine the direction of
relationship between the following
pairs of variables. Is it positive,
negative or zero?
Example:
1. Room rate and size of a room in
a hotel
A. Positive
B. Negative
C. Zero
Example:
2. Weight and height of students
A. Positive
B. Negative
C. Zero
Example:
3. Pressure and volume of a gas
A. Positive
B. Negative
C. Zero
Example:
4. IQ and height of persons
A. Positive
B. Negative
C. Zero
Example:
5. Number of customers and sales
in a department store
A. Positive
B. Negative
C. Zero
What is Correlation?
Strength can be perfect, strong or
high, moderate, low, zero or no
correlation.
Note:
Correlation between two variables
does not prove X causes Y or Y causes
X.
A scatter plot (or scatter diagram)
is used to show how each point
collected from a set of bivariate data
are scattered on the Cartesian plane.
It gives a good visual picture
between the two variables.
It is a graphical representation of the
relationship between two variables.
Types of Linear Correlation
Example:
Construct the scatterplots for the
following bivariate data using
Microsoft Excel and describe the
relationship in terms of direction
and strength.
Sample 1
60
50
40
30
20
10
0
10 12 14 16 18 20 22
Sample 2
18
16
14
12
10
8
6
4
2
0
0 1 2 3 4 5 6 7 8
Sample 3
120
100
80
60
40
20
0
0 20 40 60 80 100 120
Pearson Product-Moment Correlation
- The most widely used in statistics to
measure the degree of the relationship
between the linear related variables.
- The Pearson r correlation would
require both variables to be normally
distributed.
Testing Normality of the Data
• To determine if the data is following a
normality distribution, we can use the
graphical or numerical method.
Graphical method
• Histogram
• It plots the observed
values against their
frequency and states a
visual estimation of
whether the distribution
is bell-shaped or not.
Graphical method
• Q-Q probability Plots
• It displays the
observed values
against normally
distributed
data(represented by
the line).
Graphical method
• Q-Q probability
Plots
• If the data is normally
distributed, the points
in a Q-Q plot will lie
on a straight diagonal
line.
Remember:
• Graphical methods are typically not very
useful when the sample size is small.
Numerical method
• Shapiro-Wilk Test
• One of the most popular tests for normality
assumption diagnostics which has good
properties of power and is based on correlation
within given observations and associated normal
scores
Hypotheses of Normality Test
• Ho: The sample data follows a normal distribution
• Ha: The sample data does not follow a normal
distribution.
• When we are testing normality:
• If P-value is greater than the alpha, it means that the data
are normal.
• If P-value is less than the alpha, it means that the data are
NOT normal.
Pearson Product-Moment Correlation
The following summarizes the correlation coefficient
and strength of relationships:
0.00 no correlation, no relationship
±0.01 to ±0.20 very low correlation, almost negligible relationship
±0.21 to ±0.40 slight correlation, definite but small relationship
±0.41 to ±0.70 moderate correlation, substantial relationship
±0.71 to ±0.90 high correlation, marked relationship
±0.91 to ±0.99 very high correlation, very dependable relationship
±1.00 perfect correlation, perfect relationship
Example
• A medical researcher wants to know if a linear relationship exists
between toddlers’ age (months) and their height (cm). Data are
shown below: Age Height
12 75
13 72
15 70
18 80
20 81
24 80
Example
• Determine whether the following:
• The data is normal.
• Determine the correlation.
• Identify if there is a significant relationship
between
Solution
• The data is normal.
• Based on the Q-Q plot for
the age, since the points
almost lie in the diagonal
line. Hence, we can say that
the data in age is normal.
Solution
• The data is normal.
• Based on the Q-Q plot for
the height, since most
points almost lie in the
diagonal line. Hence, we
can say that the data in
height is also normal.
Solution
• The data is normal.
• Based on the table, since the
computed p-value (0.749) in
age and the computed p-
value (0.231) in height is
greater than to p-value
(0.05). Hence, the data for
age and height is normal.
• Determine the correlation.
• Using the Pearson Product-Moment Correlation
formula, you can create the following columns:
X Y XY
12 75 144 5625 900
13 72 169 5184 936
15 70 225 4900 1050
18 80 324 6400 1440
20 81 400 6561 1620
24 80 576 6400 1920
Total 102 458 1838 35070 7866
Interpretation
The Pearson correlation coefficient (r = 0.75) can be
interpreted as there is a high positive correlation and a
marked relationship between the toddler’s age in months and
their height. It means that as the toddler’s age increases
their height tends to increase and vice versa.
Example
• To identify if there is a significant
relationship between.
• Step 1. State the null and alternative
hypotheses.
• Step 2. Determine the value of alpha.
• Step 3. Identify the test statistics.
T test
- Used to test if there is a significant relationship
between two set of scores.
Example
• Step 4. Determine the degrees of freedom, computed
t-value, and critical t-value.
• Step 5. If the computed t is greater than or equal to
the critical value of t then reject the null hypothesis.
If the computed t is less than the critical value of t
then accept the null hypothesis.
• Step 6. Formulate your conclusion and interpretation.
Solution
• Step 1.
• Ho: There is a significant relationship
between the toddler’s age (in months) of the
baby and their height.
• Ha: There is no significant relationship
between the toddler’s age (in months) of the
baby and their height.
Solution
• Step 2. The value of is 0.05.
• Step 3. The t-test will be used with the
degrees of freedom of .
• Step 4. Since , then the degrees of
freedom is .
Solution
𝟎.𝟕𝟓 √ 𝟒
• To find the
computed t-value.
𝒕=
√𝟏−𝟎.𝟓𝟔𝟐𝟓
To find the critical t-value, use the t-table.
Solution
• Step 5. Since the computed t-value (2.27)
is less than the critical t-value (2.776).
Therefore, we accept the null hypothesis.
• Step 6. There is a significant relationship
between the toddlers’ age (in months)
and their height.
Coefficient of Determination
- This tells us how much of dependent
variable () is due to or can be attributed
to independent variable ().
- This is denoted as .
Example
From our previous example, if then .
Interpretation
Fifty-six percent of the variation in
toddlers’ height is due to or can be
attributed to the variation in the toddlers’
age. The remaining 44% is due to other
factors.
Summary
Closer to 0 = weaker
Closer to 1.0 = stronger
Summary
r is equal to 1.0 perfect
r = 0 could mean many things:
No relationship at all between X & Y
Non-linear relationship between X & Y
Restricted range on X and/or Y
Outlier may be causing problems
Thank
you!