You are on page 1of 3

Examine Correlation in Data

Journey Toward Data Fluency


Data literacy is the foundation for using and communicating with data with ease.

The Data Literacy Basics module describes quantitative variables as numerically


measurable characteristics, such as number of hours spent watching television each
day, speed measured in miles per hour, total inches of annual rainfall in a city,
sales in dollars, and amount spent on marketing.

When you are examining relationships within your data, how do you determine how
closely two variables, like sales and the amount spent on marketing, are related?
Can you use one variable to predict the other?

Correlation and regression are important techniques used to discover trends and
make predictions. While there are other important forms used in analytics, we focus
on the simplest form used in AI and analytics—linear correlation and regression.

In this unit, you gain familiarity with the concept of correlation, which describes
whether and how closely two variables move in relation to each other. You gain an
appreciation of how correlation measures association but doesn’t prove causation.
In the next unit, you explore how linear regression can be used to calculate or
predict the value of one variable based on another, in addition to measuring how
well this model fits your data.

What Is Correlation?
Correlation is a technique that can show whether and how strongly pairs of
quantitative variables are related.

Note
This unit discusses Pearson's correlation. There are other non-linear correlations,
which are not covered here.

For example, do the number of daily calories consumed and body weight have a
relationship? Do people who consume more calories weigh more? Correlation can tell
you how strongly peoples’ weights are related to their calorie intake.

The correlation between weight and calorie intake is a simple example, but
sometimes the data you work with may not have the relationships that you expect.
Other times, you may suspect correlations without knowing which are the strongest.
Correlation analysis helps you understand your data.

When you begin your correlation analysis, you can create a scatter plot to
investigate the relationship between two quantitative variables. The variables are
plotted as Cartesian coordinates, marking how far along on a horizontal x-axis and
how far up on a vertical y-axis each data point is. In the scatter plot below, you
see the relationship between sales and the amount spent on marketing. It appears
there’s a correlation: As one variable goes up, the other seems to as well.

A scatter plot that indicates a correlation between two quantitative variables

Note
Concepts in this unit are adapted from David M. Lane's online, public domain work,
Introduction to Statistics.

Correlation Versus Causation


Now that you know how correlation is defined and how it is represented graphically,
let's discuss how to better understand correlation.

First, it’s important to know that correlation never proves causation.


Pearson’s correlation tells us only how strongly a pair of quantitative variables
are linearly related. It does not explain the how or why they’re related.

For example, sales of air conditioners correlate with sales of sunscreen. People
aren’t buying air conditioners because they bought sunscreen, or vice versa. The
cause of both purchases is hot weather.

How Is Correlation Measured?


Pearson’s correlation, also called the correlation coefficient, is used to measure
the strength and direction (positive or negative) of the linear relationship
between two quantitative variables. When correlation is measured in a sample of
data, the symbol used is the letter r. Pearson’s r can range from -1 to 1.

When r = 1, there is a perfect positive linear relationship between variables,


meaning that both variables correlate perfectly as values increase. When r = -1,
there is a perfect negative linear relationship between variables. In a perfect
negative correlation, when one variable increases, the other variable decreases
with the same magnitude. When r = 0, no linear relationship between variables is
indicated.

With real data, you would not expect to see r values of -1, 0, or 1.

Generally, the closer r is to 1 or to -1, the stronger the correlation, as shown in


the following table.

r = Correlation
0.90 to 1

or

-0.90 to -1

Very strong correlation

0.70 to 0.89

or

-0.70 to -0.89

Strong correlation

0.40 to 0.69

or

-0.40 to -0.69

Modest correlation

0.20 to 0.39

or
-0.20 to -0.39

Weak correlation

0 to 0.19

or

0 to -0.19

Very weak or no correlation

Note
Some resources on this topic categorize correlations simply as strong, modest, or
weak.

Linear Correlation Conditions


For correlations to be meaningful, you need to consider some conditions: they must
use quantitative variables, describe linear relationships, and take into account
the effect of any outliers. You should check these conditions before you run a
correlation analysis..

In 1973, a statistician named Francis Anscombe developed Anscombe’s Quartet to show


the importance of graphing data visually, as opposed to simply running statistical
tests. The four visualizations in his quartet all show the same trend line
equation, so the r value would be the same for all four. The quartet illustrates
why visualizations are so important—they help us identify trends within our data
that may be obscured by statistical tests.

In the example below, only the top-left scatter plot in the quartet meets the
criteria of being linear without any outliers. The top-right scatter plot is not
showing a linear relationship and a nonlinear model would be more appropriate. The
two scatter plots on the bottom each have outliers which can dramatically affect
the results.

Four scatter plots with the scatter plot on the top left highlighted, showing a
linear relationship with no outliers

Now that you’re more familiar with the concepts around the statistical technique of
correlation, you’re ready for the next unit, where you learn about linear
regression.

You might also like