Professional Documents
Culture Documents
When you are examining relationships within your data, how do you determine how
closely two variables, like sales and the amount spent on marketing, are related?
Can you use one variable to predict the other?
Correlation and regression are important techniques used to discover trends and
make predictions. While there are other important forms used in analytics, we focus
on the simplest form used in AI and analytics—linear correlation and regression.
In this unit, you gain familiarity with the concept of correlation, which describes
whether and how closely two variables move in relation to each other. You gain an
appreciation of how correlation measures association but doesn’t prove causation.
In the next unit, you explore how linear regression can be used to calculate or
predict the value of one variable based on another, in addition to measuring how
well this model fits your data.
What Is Correlation?
Correlation is a technique that can show whether and how strongly pairs of
quantitative variables are related.
Note
This unit discusses Pearson's correlation. There are other non-linear correlations,
which are not covered here.
For example, do the number of daily calories consumed and body weight have a
relationship? Do people who consume more calories weigh more? Correlation can tell
you how strongly peoples’ weights are related to their calorie intake.
The correlation between weight and calorie intake is a simple example, but
sometimes the data you work with may not have the relationships that you expect.
Other times, you may suspect correlations without knowing which are the strongest.
Correlation analysis helps you understand your data.
When you begin your correlation analysis, you can create a scatter plot to
investigate the relationship between two quantitative variables. The variables are
plotted as Cartesian coordinates, marking how far along on a horizontal x-axis and
how far up on a vertical y-axis each data point is. In the scatter plot below, you
see the relationship between sales and the amount spent on marketing. It appears
there’s a correlation: As one variable goes up, the other seems to as well.
Note
Concepts in this unit are adapted from David M. Lane's online, public domain work,
Introduction to Statistics.
For example, sales of air conditioners correlate with sales of sunscreen. People
aren’t buying air conditioners because they bought sunscreen, or vice versa. The
cause of both purchases is hot weather.
With real data, you would not expect to see r values of -1, 0, or 1.
r = Correlation
0.90 to 1
or
-0.90 to -1
0.70 to 0.89
or
-0.70 to -0.89
Strong correlation
0.40 to 0.69
or
-0.40 to -0.69
Modest correlation
0.20 to 0.39
or
-0.20 to -0.39
Weak correlation
0 to 0.19
or
0 to -0.19
Note
Some resources on this topic categorize correlations simply as strong, modest, or
weak.
In the example below, only the top-left scatter plot in the quartet meets the
criteria of being linear without any outliers. The top-right scatter plot is not
showing a linear relationship and a nonlinear model would be more appropriate. The
two scatter plots on the bottom each have outliers which can dramatically affect
the results.
Four scatter plots with the scatter plot on the top left highlighted, showing a
linear relationship with no outliers
Now that you’re more familiar with the concepts around the statistical technique of
correlation, you’re ready for the next unit, where you learn about linear
regression.