You are on page 1of 79

CORRELATION AND REGRESSION

2017 MASS TRAINING


OF TEACHERS
for Senior High School
Roldan C. Bangalan
Content Standards The learner demonstrates
understanding of key concepts of correlation and
regression analyses.

Performance Standards The learner is able to


perform correlation and regression analyses on
real-life problems in different disciplines.

2017 MASS TRAINING OF TEACHERS


for Senior High School
Learning Competencies The learner…
1. Illustrates the nature of bivariate data.
2. Constructs a scatter plot
3. Describes shape (form), trend (direction), and variation
(strength) based on a scatter plot.
4. Estimates strength of association between the variables based
on a scatter plot.
5. Calculates the Pearson’s sample correlation coefficient.
6. Solves problems involving correlation analysis.
7. Identifies the independent and dependent variables.
8. Draws the best-fit line on a scatter plot.
9. Calculates the slope and y-intercept of the regression line.
10. Interprets the calculated slope and y-intercept of the regression
line.
11. Predicts the value of the dependent variable given the value of
the independent variable.
12. Solves problems involving analyses.
Matching Type

1.  Philippines
2.  Thailand

 Indonesia
3.
 Singapore
4.
 Malaysia
5.
What’s the Rule?

B = 2A - 3
1 -1
2A – B = 3
2 1

3 3

4 5

5 7

A B
Who is he?
 Michael Fred Phelps II
(born June 30, 1985) is an
American competition swimmer
and the most decorated
Olympian of all time, with a
total of 22 medals in three
Olympiads. Phelps also holds
the all-time records for Olympic
gold medals (18, double the
second highest record holders),
Olympic gold medals in
individual events (11), and
Olympic medals in individual
events for a male (13).
You and Michael
 Roman writer, architect, and engineer Marcus
Vitruvius proposed, among other relationships, that
a person’s height and their arm span (herein called
“wingspan”) are approximately equal. In this
investigation, students will collect data to assess
whether or not Vitruvius’s proposal was reasonable.
Scatterplots will be drawn to illustrate the data and
a best-fit line will be overlain on the scatterplot.
The equation of the best-fit line will be determined,
and the slope interpreted in context.
 Stephen Miller, Winchester Thurston School, http://www.amstat.org/education/stew/
Michael Phelps
Guide questions:
 Assessment
 1. How well does a line fit the wingspan vs. height
data? What does that mean?
 2. Can we claim that the scatterplot represents the
relationship between height and “wingspan” in the
general population? Why or why not?
 3. What about Michael Phelps – is he like us or is he
different? How?
 4. How do your measurements compare to Michael
Phelps?
Answers
 1. The data points do seem to cluster closely to the best-fit
line; there is not a lot of deviation between the line and the
points.
 2. No, we cannot claim that the scatterplot represents the
relationship between height and wingspan in the general
population. These data values were collected for students;
there is no guarantee that as adults or younger children this
same relationship between height and wingspan holds true.
 3. Although Michael seems to follow the same general trend,
his wingspan seems to be somewhat longer compared to his
height than typical students.
 4. Answers may vary. One possible answer is “My height and
wingspan are closer to each other than are Michael’s height
and wingspan.”
Statistics @ Work
 A businessperson may want to know whether the volume of
sales for a given month is related to the amount of advertising
the firm does that month.
 Educators are interested in determining whether the number of
hours a student studies is related to the student’s score on a
particular exam.
 Medical researchers are interested in questions such as, Is
caffeine related to heart damage? or Is there a relationship
between a person’s age and his or her blood pressure?
 A zoologist may want to know whether the birth weight of a
certain animal is related to its life span.
Correlation Analysis
 Correlation analysis is a method used to measure the
strength of relationship between two variables.
 Correlation is a statistical method used to determine
whether a linear relationship between variables
exists.
Examples of Correlated Variables
 The students’ mental ability and academic
performance in school are related.
 There is a close relationship between reading
comprehension and mathematical ability.
Bivariate data
 Bivariate data is a fancy way to say, ‘two-variable
data.’ The easiest way to visualize bivariate data is
through a scatter plot.
Bivariate Data
 Can you think of pairs  Why do you think there
of variables that may is a link between the
be linked? variables you have
 Ice cream sales and chosen?
temperature
 Hours spent studying
and Marks in exams
 The amount of hours you
work and the amount of
money you earn
 Law of Supply  Law of Demand
Types of Correlation
1. A positive correlation exists when high scores in one
variable are associated with high scores in the
second variable. This is also true when low scores in
one variable are associated with low scores in the
other. Thus, there is direct relationship that exists in
positively correlated variables.
A B

x y
Types of Correlation
2. A negative correlation exists when high scores in one
variable are associated with low scores in the
second variable. This is also true when low scores in
one variable are associated with high scores in the
other.

M N
Types of Correlation
3. A zero correlation exists when high scores in one
variable tend to score neither systematically high
nor systematically low in the other variable.
Examples
 The more you study for a test, the higher your grade will be.
 The more you practice a sport, the better you will become.
 The more hours you work, the more money you'll have in
your bank account.
 The more you go over your notes, the higher your test scores
will be.
 The more you shoot a basketball the easier it gets score
 The more clubs you join in school, the more friends you can
make.
 The more you exercise, the more weight you will lose.

 Retrieved from: https://psychlopedia.wikispaces.com/Correlation+Study


Examples
 The more you daydream in class, the worse you will do on
the tests.
 The faster you drive the sooner you'll get where you're
going. The faster ... the less time.
 The more hours of sleep you get, the less stressed you will
be.
 The more sunscreen you wear, the less sunburned you will
get.
 The more you work, the less of a social life you'll have.
 The more you eat veggies and fruit , the less chance you will
have to take vitamins for nutrients
 The more you exercise, the less chance of gaining weight.
Describe the relationship
 Soft drink sales and temperature
 Coffee sales and temperature
 Reaction time and age
 Mood and drug dose
Scatter Diagram
A graph of plotted points that shows the relationship
between two sets of data.
Construct a scatter plot for the data
below:
 Student Hours of study x Grade y (%)
 A 6 92
 B 2 73
 C 1 67
 D 5 98
 E 2 78
 F 3 85
Scatter Plot
Scatter Diagram
Employee Age Efficiency Rating
1 44 61
2 44 41
3 45 91
4 43 76
5 40 79
6 52 67
7 43 73
8 47 94
9 54 96
Scatter Diagram (Interpretation)
Scatter Diagram (Interpretation)
Scatter Diagram (Interpretation)
Pearson Product-Moment Correlation Coefficient

 The most common statistical tool


in measuring the linear
relationship between two
random variables, x and y, is
the linear correlation coefficient
commonly called the Pearson
Product-Moment Correlation
Coefficient or Pearson r for
short. This formula was
developed and perfected by
Karl Pearson.
Pearson Product-Moment Correlation Coefficient
(Formula)
Properties of the Linear Correlation
Coefficient r
 1. The value of r is always between -1 and +1
inclusive. That is, −1 ≤ r ≤ +1
 2. The value of r does not change if all values of
either variable are converted to a different scale.
 3. The value of r is not affected by the choice of x
or y. Interchange all x- and y-values and the value
of r will not change.
 4. r measures the strength of a linear relationship. It
is not designed to measure the strength of a
relationship that is not linear.
Three Characteristics of the relationship
between two variables
 TREND (Direction) – positive or negative
 SHAPE (Form) – linear and nonlinear
 STRENGTH (Variation/Degree) – value of r
Correlation Coefficient
r Interpretation
1.0 Perfect correlation
±0.80 to ± 0.99 High

±0.60 to ± 0.79 Moderately high

±0.40 to ± 0.59 Moderate

±0.20 to ± 0.39 Low

±0.01 to ± 0.19 Slight/negligible

0 No correlation

Jimenez, R. and Parreno, E. (2014). Basic Statistics. Quezon City. C & E Publishing, Inc.
Correlation Coefficient

 Exactly –1. A perfect downhill (negative) linear relationship


 –0.70. A strong downhill (negative) linear relationship
 –0.50. A moderate downhill (negative) relationship
 –0.30. A weak downhill (negative) linear relationship
 0. No linear relationship
 +0.30. A weak uphill (positive) linear relationship
 +0.50. A moderate uphill (positive) relationship
 +0.70. A strong uphill (positive) linear relationship
 Exactly +1. A perfect uphill (positive) linear relationship
 http://www.dummies.com/how-to/content/how-to-interpret-a-correlation-coefficient-r.html
Pearson Product-Moment Correlation Coefficient
(Example)

A personnel manager would like to know if there is a


relationship between knowledge factors and
practical factors of a training course. The following
scores were obtained by six trainees on the
knowledge factors, X, and the practical factors, Y, in
a training course.
Pearson Product-Moment Correlation Coefficient
(Example)

Trainee Knowledge Practical Factors


Factors (X) (Y)
1 2 4
2 4 10
3 4 8
4 5 8
5 7 14
6 9 16
Test of Significance of the Correlation Coefficient

 It is important that the value of the correlation


coefficient be tested if it is significant or not.
 If it is found to be significant then, there is a
relationship that exists between the two variables.
 Otherwise, the computed r is due to chance alone.
Test of Significance of the
Correlation Coefficient
Test of Significance of the Correlation Coefficient
(Example)

 A teacher wants to know if the number of hours


spent in studying is correlated with the score
obtained in an examination. The following table
shows the number of hours spent in studying and the
scores obtained by six students. Compute the
correlation coefficient and test its significance at
0.01 level.
Test of Significance of the Correlation Coefficient
(Example)

Student No. of Hours Score in the


Spent in Exam (Y)
Studying (X)
A 3.0 20
B 2.7 34
C 3.8 19
D 2.6 10
E 3.3 24
F 3.4 31
Problem 1:
 Ruben wants to see the degree of relationship that
exists between Jojo’s scores in English (x) and
Trigonometry (y). The data are given below. Is the
obtained relationship significant at 𝛼 =0.05?
 English(x) 14 12 18 20 8 10 12
 Math(y) 10 9 12 13 7 8 9
Problem 2
 A researcher wishes to determine if a person’s age
is related to the number of hours he or she exercises
per week. The data for the sample are shown here.
Test the significance of r at alpha 0.05.
 Age (x) 18 26 32 38 52 59
 Hours (y) 10 5 2 3 1.5 1
SPSS Result (Pearson r)
Spearman Rank-Order
Correlation Coefficient

 The Pearson product-moment correlation coefficient


is most appropriate when the data are interval or
ratio scale.
 For ordinal data, the Spearman Rank-Order
correlation coefficient of the ranks of the variables
is used to determine the strength of relationship
between two variables. Spearman rho (ρ)
Spearman Rank-Order Correlation Coefficient
(Formula)
Spearman Rank-Order Correlation Coefficient

 Rank the performance of the following students in


their History and Literature classes. Then, use
Spearman rho coefficient to test the difference
between their ranks. Use 5% level of significance.
Spearman Rank-Order Correlation Coefficient
(Example)

PERFORMANCE in
STUDENT HISTORY LITERATURE
A 78 79
B 77 80
C 88 85
D 84 78
E 80 89
F 85 80
G 79 80
H 88 85
Possible Relationships Between
Variables
 When the null hypothesis has been rejected for a specific a
value, any of the following five possibilities can exist.
 1. There is a direct cause-and-effect relationship between the
variables. That is, x causes y. For example, water causes
plants to grow, poison causes death, and heat causes ice to
melt.
 2. There is a reverse cause-and-effect relationship between the
variables. That is, y causes x. For example, suppose a
researcher believes excessive coffee consumption causes
nervousness, but the researcher fails to consider that the
reverse situation may occur. That is, it may be that an
extremely nervous person craves coffee to calm his or her
nerves.
Possible Relationships Between
Variables
 3. The relationship between the variables may be caused by a third variable.
For example, if a statistician correlated the number of deaths due to
drowning and the number of cans of soft drink consumed daily during the
summer, he or she would probably find a significant relationship. However,
the soft drink is not necessarily responsible for the deaths, since both
variables may be related to heat and humidity.
 4. There may be a complexity of interrelationships among many variables. For
example, a researcher may find a significant relationship between students’
high school grades and college grades. But there probably are many other
variables involved, such as IQ, hours of study, influence of parents,
motivation, age, and instructors.
 5. The relationship may be coincidental. For example, a researcher may be
able to find a significant relationship between the increase in the number of
people who are exercising and the increase in the number of people who
are committing crimes. But common sense dictates that any relationship
between these two values must be due to coincidence.
 Thus, when the null hypothesis is rejected, the
researcher must consider all possibilities and select
the appropriate one as determined by the study.
Remember, correlation does not necessarily imply
causation.
Correlation does not necessarily imply
causation.
 That there is a strong positive correlation between ice
cream sales and murder rates in the summer.
 As ice cream sales rise, so do murder rates.
 Is this because eating ice cream makes us want to
murder people?
 The actual explanation is that when the weather is hot,
more people buy ice cream, but they also go out more,
drink more, and socialize more, leading to an increase
in murder rates. Extreme temperatures observed in the
summer also have been shown to increase aggression.
 Source: Boundless. “Correlational Research.” Boundless Psychology. Boundless, 26 May. 2016. Retrieved 29
May. 2016 from https://www.boundless.com/psychology/textbooks/boundless-psychology-
textbook/researching-psychology-2/types-of-research-studies-27/correlational-research-125-12660 /
REGRESSION ANALYSIS
2017 MASS TRAINING
OF TEACHERS
for Senior High School
ROLDAN C. BANGALAN
What’s next?

x 1 2 3 4 5
y -2 1 4 7 ?
10

y = 3x - 5
Follow the rule

x 2 3 4 5 7
y -3 -5 -7 -9 ?
-13

y = -2x + 1
What’s the pattern?

x 18 26 32 38 52 59
y 10 5 2 3 1.5 ??

y=?
Scatter It! (Predict Billy’s Height)
 In this lesson, students explore the relationship
between age and height in order to help a
hypothetical student predict his height in two years.
Students will examine data that will enable them to
create a scatterplot and approximate a line of best
fit. The scatterplot and line of best fit will be used
to predict height. The slope of the line of best fit
will be interpreted in context.
 Susan Haller, St. Cloud State University
Explain to the class that Billy’s parents measured each of their
children’s heights on the first day of school every year

Person Grade Age Height (in inches)


Billy K 5 42.00
Billy 1 6 44.50
Billy 2 7 46.25
Billy 3 8 50.00
Billy 4 9 53.50
Billy 5 10 56.25
Billy 6 11 59.25
Billy 7 12 63.25
Billy 8 13 68.00
Guide questions:
 1. After the graph of Billy’s data has been constructed, ask
students to share anything they notice within the pattern of Billy’s
growth – does there seem to be a relationship between age and
height? Does this relationship appear to be linear? Discuss the
properties of a line of best fit, and have students place the
spaghetti in their scatterplot to approximate the line of best fit for
Billy’s data. Next, have students locate Billy’s age in grade 10 (he
would be 15 at the start of that school year), and use the line of
best fit to predict Billy’s height. Students should compare their
results with the rest of the class (predictions will vary because the
lines will be slightly different).
 Students should then enter the data into a graphing calculator or
appropriate computer statistical software program and create a
scatterplot of and line of best fit for Billy’s data. Alternatively, the
teacher can calculate the line of best fit. In this case, the line of best
fit is y = 3.21x + 24.8.
REGRESSION ANALYSIS
 Regression Analysis deals with the estimation of one
variable based on the changes or movements of the
other variable.
 If two variables are correlated, then it is possible to
predict or estimate the value of one variable from the
knowledge of the other variable.
 The goal of regression analysis is to describe the
relationship between two variables based on observed
data and to predict the value of the dependent
variable based on the value of the independent
variable.
Line of best fit
 A line on a scatter plot which can be drawn near
the points to more clearly show the trend between
two sets of data
Line of best fit
 Does your line minimize
the average distance
from it to each of the
data points?
 Best fit means that the
sum of the squares of
 The reason you need a line of
the vertical distances best fit is that the values of y
will be predicted from the
from each point to the values of x; hence, the closer
line is at a minimum. the points are to the line, the
better the fit and the prediction
will be.
How do I construct a best-fit line?
 A best-fit line is meant to mimic the trend of the data. In
many cases, the line may not pass through very many of
the plotted points. Instead, the idea is to get a line that
has equal numbers of points on either side.
 There are two possible ways to construct a best-fit line:
 The first method involves enclosing the data in an area
 The second method involves dividing data into two
equal groups, approximating the center of each group
and constructing a line between the two centers
 http://serc.carleton.edu/mathyouneed/graphing/bestfit.html
STEPS FOR CONSTRUCTING A BEST-FIT LINE
USING THE AREA METHOD
 1. Begin by plotting all your data on
graph paper.
 2. Examine the data and determine the
visual trend of data. Does it look like a
line? A blob? Does x increase as y
increases? Try to visualize approximately
where the trend should be.
 3.Draw a shape that encloses all of the
data, (try to make it smooth and relatively
even).
 4. Draw a line that divides the area that
encloses the data into two even sized
areas. In other words, bisect the area with
a line that goes from one edge of the plot
to the other.
 5. Congratulations! You have just
constructed a best fit line through the
data!
STEPS FOR CONSTRUCTING A BEST-FIT LINE
USING THE DIVIDING METHOD
 1. Begin by plotting all your data on graph
paper.
 2. Examine the data and determine the
visual trend of data. Does it look like a
line? A blob? Does x increase as y
increases? Try to visualize approximately
where the trend should be.
 3. Draw a line that divides the data points
in two equal groups (even numbers of points
on either side).
 4. Place an x (or a dot) at the center of the
clusters of data on either side of the line
you drew in part 3.
 5. Draw a line that connects the two x -
marks (or points) that you drew in part 4.
 6.Congratulations! You have just constructed
a best fit line through the data!
Outliers
 A scatter plot should be checked for outliers. An outlier is a point that
seems out of place when compared with the other points. Some of
these points can affect the equation of the regression line. When this
happens, the points are called influential points or influential
observations. When a point on the scatter plot appears to be an
outlier, it should be checked to see if it is an influential point. An
influential point tends to “pull” the regression line toward the point
itself. To check for an influential point, the regression line should be
graphed with the point included in the data set.
 Then a second regression line should be graphed that excludes the
point from the data set. If the position of the second line is changed
considerably, the point is said to be an influential point. Points that
are outliers in the x direction tend to be influential points.
Outliers
 Researchers should use their judgment as to whether
to include influential observations in the final
analysis of the data. If the researcher feels that the
observation is not necessary, then it should be
excluded so that it does not influence the results of
the study. However, if the researcher feels that it is
necessary, then he or she may want to obtain
additional data values whose x values are near the
x value of the influential point and then include
them in the study.
REGRESSION ANALYSIS (formula)
 y = a + bx
where
 y = criterion measure
 x = predictor
 a = ordinate or point where the regression line
crosses the y-axis
 b = beta weight of the slope of the line
REGRESSION ANALYSIS (formula)
Sample Problem
 A researcher wishes to determine if a person’s age
is related to the number of hours he or she exercises
per week. The data for the sample are shown here.
Test the significance of r at alpha 0.05. Predict the
number of hours a 21 year old person exercises per
week.
 Age (x) 18 26 32 38 52 59
 Hours (y) 10 5 2 3 1.5 1
SPSS Result (Pearson r)
SPSS Result (Scatter Plot)
SPSS Result (y = a + bx)
REGRESSION ANALYSIS (example)
Data about the advertising cost of a product and its annual sales
in million pesos are shown below.
Advertising Annual Sales
Cost (x) (y)
0.9 2
1.1 4
1.1 6
1.4 8
1.5 10
1.9 12
REGRESSION ANALYSIS (example)
Data about the advertising cost of a product and its annual sales
in million pesos are shown below.
Advertising Annual Sales
Cost (x) (y)
0.9 2
1.1 4
1.1 6
1.4 8
1.5 10
1.9 12
1.Write the regression equation for predicting the annual sale
from the knowledge of the costs.
2.Predict the annual sale of the product, if the advertising cost is
1.3 million pesos.
TAYO para sa EDUKASYON!

Roldan C. Bangalan, MST


rbangalan@spup.edu.ph
St. Paul University Philippines
Tuguegarao City, Cagayan
Caritas Christi urget nos!