You are on page 1of 21

How to Do

Linear Regression
in SQL
Author: Julia Glick
Edited by: Jessica Schimm

© Mode Analytics, Inc. 2020 | mode.com


Table of Contents

Introduction

Create the predictor variable

Validate your SQL query output in R Notebook

Detrend & deseason the dataset

Validate model outputs

Conclusion

| © Mode Analytics, Inc. 2020 | mode.com Table of Contents


Introduction

Ordinary least squares (OLS)


regression is one of the most powerful
tools in a data scientist’s pocket.
Personally, as someone who specializes
Why would you
in causal inference, I have built most of want to do linear
my career on OLS and closely related
analyses, from t-tests up through structural
regression in SQL?
time series models.
Well, we’ve all had to work
with simpler tools than we’d
In this intermediate-advanced tutorial,
prefer from time to time (like
I’ll show you how you can get a simple
before a company has adopted
regression analysis done in pure SQL
Mode!) For example, you might
with relatively little pain. Doing OLS in
be working with a dataset that
SQL is a bit tedious but not terribly
is too large to easily export into
complicated, especially once we bring
R. I personally have had to
in slightly more advanced syntax such
analyze and share A/B test
as window functions and WITH clauses.
results when the only analysis
tool I could easily access was
SQL linear regression, in which
26a05a
- SQL was a great tool (but don’t
ask me about the time I wrote a
Mann-Whitney-Wilcoxon U test
in SQL).

| © Mode Analytics, Inc. 2020 | mode.com Introduction | 01


INTRODUCTION
INTRODUCTION

It shows the number of completed As you can see, there isn’t a


housing units in four major regions in strong time trend in the raw data
the U.S. by month from January 1968 (the line doesn’t seem to rise or
through July 2014 and has no missing fall very much on average over the
observations. The data we will be full dataset). It does look like there
working with are counts of completed might be a fair bit of time of year
housing units, which we’ll be calling seasonality, but it’s hard to tell by
variable “Y” in many of the equations eye how much of this is really
below, since it is (what we will treat as) seasonality (repeated year after
our “dependent” variable. year) and how much is variation
that we won’t be able to explain.

| © Mode Analytics, Inc. 2020 | mode.com Introduction | 03


Creating the
predictor variable
We’ll start with simple linear regression (i.e., with only
a single predictor) to remove the linear time trend &
work with just one of the four regions for simplicity.

The first step is to create our Here’s a query that shows this
predictor variable, called “X” transformation.
below, which is to say, the linear
effect of time. Unfortunately, time Now that we have our X and Y
in this dataset is represented by variables, we need to actually do
two columns, one for year and the regression. But how? This is
one for month. That’s not so where all those by-hand formulas
convenient for us! that we did in Intro Stats come in
handy. Formulas that are meant
We can create the predictor by to be done by hand are often, in
multiplying the year by 12 and my experience, quite easy to
adding the month; this gets us translate into SQL.
a variable that gives consecutive
numbers to consecutive months.
It’s also really big and hard to
work with, so we’ll “center” the
variable around 0 by subtracting
off the mean time value in our
dataset. If there’s an odd number
of months, that might mean that
all our time values end with .5,
but that’s not a problem.

| © Mode Analytics, Inc. 2020 | mode.com Creating the predictor variable | 04


CREATING THE PREDICTOR VARIABLE

Fortunately, I have an interest in


old-fashioned computation (slide
rules and so forth), so I grabbed my
copy of Edwards’ Statistical Methods 1 / n-2 ∑e2
seb₌
for the Behavioral Sciences from 1954. ∑x2 - (∑x)2 / n
No computer-assisted shortcuts to
worry about in this book. The critical
formula is the one for the slope of the
regression line.
And sure enough, the SQL query
The traditional way to do this by that implements this regression
hand for small datasets is to write out is pretty straightforward, if a bit
a table with your X and Y values, then long; here it is. I rely on lots of
make additional columns to the left common table expressions
with X2, Y2, and XY values, and finally (using WITH) to do the
total each column. (Edwards gives an transformations progressively,
example on page 121, should you which keeps the query and its
happen to have a copy of the book.) logic more readable.
Sounds a bit like SQL, don’t you think?

Finding a line of best fit


for Y = a + bX

(the traditional way)

| © Mode Analytics, Inc. 2020 | mode.com Creating the predictor variable | 05


CREATING THE PREDICTOR VARIABLE

Notice that we centered our X (time) variable in advance.


That means that, in particular, ∑X = 0 by definition, so we
could simplify this formula to just b = ∑XY / ( ∑X ) 2 — but it’s
already simple enough that I’d prefer to leave it in its full form.

Once we know b, getting the intercept a is easy. We just


need the average values of Y and X, computed using ∑Y and
∑X, which we used earlier, together with the number of rows.

Ok, so those are our regression coefficients, but that’s only


part of the story. We want to detrend the data, which is to
say, remove the variance that’s explainable by the linear effect
of time. We can do that by joining the regression coefficients
back to our original dataset.

Take a look at this query. There’s only a single set of


coefficients, so we can copy them to every row by joining
on 1=1, and then use them to create the predicted Y values
Ŷ = a + bX, and the residuals or “error terms” e = Y - Ŷ .

In a time series context, these “error terms”—the part of the


dataset that cannot be explained by the linear trend—are
almost exactly the detrended data that we want! The only
problem is that the residuals are centered around 0, so we
need to add back the mean of Y to get our detrended data.

| © Mode Analytics, Inc. 2020 | mode.com Creating the predictor variable | 06


CREATING THE PREDICTOR VARIABLE

Let’s look at the results.

The above plot shows the raw data again, along with the line of best fit.
For a simple linear regression, that line of best fit represents the explainable
variance. We can see that there is a downward trend on average in the dataset.

And this plot shows the detrended data. Early observations are brought down a
bit and late observations are brought up a bit to cancel out the trend. If we were
to draw a line of best fit for the detrended dataset, it would be a flat horizontal
line through the mean.

| © Mode Analytics, Inc. 2020 | mode.com Creating the predictor variable | 07


Validating your
SQL query output
in R Notebook

Of course, we also need to Once we’ve done that, it’s a very


validate that the SQL query is giving simple run of the R function
us the right answer. So let’s just hop lm (code block #2 at the linked
over to an R Notebook and pull in the Notebook) and we can immediately
raw dataset. I use a query that includes validate that we’re getting the right
the linear time trend predictor so that coefficients. Sure enough, the
we don’t need to worry about “Intercept” from the lm call matches
whether we did that correctly. our variable “a”, and the “time_c”
coefficient matches our “b.”

There’s another thing worth mentioning about this


simple regression, which is not really relevant to
detrending the time series but might be useful for
other regression applications: running statistical
significance tests.

| © Mode Analytics, Inc. 2020 | mode.com Validating your SQL | 08


VALIDATING YOUR SQL QUERY OUTPUT

There’s a lot of potential depth here, this one, there’s going to be some
so I’ll just touch on it lightly. We’re non-zero trend. But it might be much
going to run a significance test to more interesting if, for example, our X
see whether our coefficient b predictor took values 0 and 1 for two
is significantly different from zero: different groups of users, and Y was
in this case, it asks whether there is some kind of engagement metric; in a
a non-zero time trend (up or down), situation like that, simple regression is
or whether we don’t have enough the equivalent of a two sample t-test.
evidence to reject a zero trend. (Indeed, there is a deep relationship
That’s not a terribly interesting question between t-tests, ANOVA designs and
in some sense, because with any kind linear regression, but we can’t get into
of random-walk-like time series like all of that here.)

To run our statistical significance test, we


need to know the standard error on our
coefficient b. Alas, Edwards 1954 does not
give us this formula; he goes into the t-test 1 / n-2 ∑e2
and ANOVA formulations instead, because seb₌
they’re generally simpler to do by hand.
Wikipedia comes to the rescue and tells
us that the standard error of b is:

Or, using an easily-derived identity to turn


that denominator into a by-now-familiar
quantity, where e are our errors or residuals
as we discussed above. seb₌ 1 / n-2 ∑e2
∑x2 - (∑x)2 / n

| © Mode Analytics, Inc. 2020 | mode.com Validating your SQL | 09


VALIDATING YOUR SQL QUERY OUTPUT

With the standard error on b in hand, we just need to take the


ratio tb = b / seb to get a random variable with a t-distribution.
This t-stat gives us the statistical significance test we want, on
the null hypothesis that b = 0. So,waving our hands furiously in
the air, we can just glance at t and see whether it’s bigger than 2
in absolute magnitude for a traditional ª = 0.05 significance test.
For anything more precise, you’ll probably want to grab a table
of critical values from your retro stats textbook, or switch off of
SQL, of course. Here’s the query.

Once again, we can validate this result in R using the regression


we already ran. Running summary on our lm output (cell 3 in
the Notebook) gives us a nice table that includes t and p,
matching our query output.

| © Mode Analytics, Inc. 2020 | mode.com Validating your SQL | 10


Detrend & deseason
the dataset

I promised that we would both There’s another problem, though.


detrend and deseason the dataset We can reproduce our formula from
though, which leads us from simple simple regression above for the
regression to multiple regression. eleven new predictors—but it’s wrong.
To accomplish this, we need to Specifically, it’s wrong because when
include multiple predictors, one for we do multiple regression, we need
the linear effect plus another eleven to account not only for the
for the month-of-year seasonality correlations between each predictor
term. (Why eleven? Because there and the outcome (which that formula
are twelve months in the year, and includes), but also the correlations
you need k-1 numerical predictors to between each pair of predictors
capture the variation in a categorical (which that formula does not include).
predictor with k values.) And the only reasonable way to ac-
complish that is to use a matrix
Going from one predictor to twelve formulation, which we can’t easily
is going to expand our query code up in SQL. (You can do the
substantially, and involve a lot of ugly special case of k=2 predictors fairly
copy-and-pasting. But we can do it. easily, but beyond that it’s just not
reasonable.)

| © Mode Analytics, Inc. 2020 | mode.com Detrend & deseason | 11


DETREND & DESEASON THE DATASET

So, it looks like we’re stuck...Right?

Well, in general, sure. If the


predictors are correlated then we Another would be 1 for February &
can’t easily get the right answer 0 otherwise, ...and so on through
from a query. But for this specific November. December wouldn’t
problem, there’s a path forward: have an indicator variable and
we can ensure that the predictors would represent our base case
are all uncorrelated with each other. against which the other 11 months
This wouldn’t work if we were using would be compared. The problem
just any old predictors that we is, these 11 variables are correlated
observed in the world, say, multiple with each other.
different measures of economic
health. But our linear time trend Instead, we need to move to
and our eleven seasonality orthogonal contrasts. These are
predictors aren’t “observations” in encodings that have the properties
the same kind of way, and we can that 1) across all 12 months, the
construct them to be mutually dot product of any two encodings
orthogonal to each other. is 0, and also 2) the sum across all
12 months of any single encoding
The trick is, we can’t use the obvious is 0. (The indicator variables above
encoding for the seasonality terms. satisfy 1 but not 2.) There are lots
The easiest approach is to use of different encodings that work,
eleven 0/1 indicator variables, one and they mean different things if
for each month of the year except you’re interested in the individual
for a single base case. So, for regression coefficients/statistical
example, one variable would be 1 significance tests, but we don’t
for January and 0 otherwise. care about any of that here because
they’ll all give the same residuals &
thus the same deseasonalized data.

| © Mode Analytics, Inc. 2020 | mode.com Detrend & deseason | 12


DETREND & DESEASON THE DATASET

Figure 1: Helmert Contrasts

We’ll use one of the contrasts built into R, called Helmert contrasts.
These contrasts compare each level to the average of the previous
levels: February to January; March to the average of January and
February; April to the average of January through March; and so on.
The SQL to generate them is a bit tedious, but straightforward.

| © Mode Analytics, Inc. 2020 | mode.com Detrend & deseason | 13


DETREND & DESEASON THE DATASET

Finally, in order to get our predictors Finally, once we have all of the
as close to perfectly orthogonal as regression coefficients, we need
possible, we’ll want to exclude any to combine all of them together to
years where we don’t have all 12 get the predicted values. Once we
months of observations. The 2014 have the model’s predicted values,
data in this dataset is only partial, getting the residuals is the same
so we’ll only include data from as before. But the meaning of
2013 and earlier. those residuals is different, because
we’re controlling for a different set
The completed query is here for of predictors.
the regression coefficients, and
here for the deseasonalized and In the first set of queries, we were
detrended data. They should look only controlling for the linear time
a lot like the earlier queries, but trend, and so the residuals allowed
with a lot of extra repeated code. us to create a detrended dataset.
In this case, we control for both
Obviously in a language like R trend and for the month-of-year
or Python, this would all be much seasonality term, and so the residuals
easier to implement (but also you’d correspond to data that has had
probably be using a pre-existing both trend and seasonality removed.
package instead!).

Of special note, we need to get


the sums and “sums of squares”
for each predictor, and the “sums
of products” for each predictor with
the outcome (in the "sums_squares"
subquery), then use those to
calculate the regression coefficients
separately for each predictor.

| © Mode Analytics, Inc. 2020 | mode.com Detrend & deseason | 14


Validating model outputs

Once again, we can validate our model outputs against a linear


regression in R using the standard lm function (code blocks 4
and 5). Running the equivalent model in R takes a little bit of extra
work because we need to tell R how to set up the orthogonal
contrast codes for the month-of-year terms. Fortunately, we don’t
need to manually code up those contrasts like we did in SQL; we
can use the contrasts function to set things up.

When we look at the output


(code block 5), the results are
not exactly the same, unlike
the simple regression case.
That’s because the contrast
codes are only approxmately
orthogonal to the linear time
trend, as we can see by
running cor on the output of
model.matrix (code block 6).
So our simplified formula
doesn’t give exactly the right
answer. It’s pretty close, though!

| © Mode Analytics, Inc. 2020 | mode.com Validating model outputs | 15


VALIDATING MODEL OUTPUTS

Predictor Correlations

Let’s look at the results. In the first plot, we have the raw data
plotted against the model’s predictions. But unlike before,
the prediction is no longer a “line of best fit.” Instead it
contains both an annually repeating component and the
linear trend we saw earlier. Sure enough, there is quite a
lot of time-of-year seasonality; the difference between the
highest and lowest performing months of the year is
substantial relative to the overall variance of the time series.

| © Mode Analytics, Inc. 2020 | mode.com Validating model outputs | 16


VALIDATING MODEL OUTPUTS

And once that prediction is


subtracted from the raw data
(and the mean added back in),
we can recover the detrended
and deseasonalized dataset.

Notice that the month-to-month


swings are now somewhat less
extreme than before. The data
seems to have been smoothed
out a little bit. This again
suggests that the time-of-year
seasonality was a meaningful
factor in the raw dataset, and
removing it helps us see where
the true seasonally-adjusted
peaks and valleys are.
JAN 68 JUL 14

time_c

| © Mode Analytics, Inc. 2020 | mode.com Validating model outputs | 17


LINEAR REGRESSION IN SQL

Conclusion

This has been a lot of words about formulations, anything iterative like
a solution to a rather specific EM algorithms or gradient descent,
problem. Sure, lots of problems can etc, but those are also generally
be fit into this framework, possibly the algorithms that aren’t suited to
with a bit of effort, but it’s still just computation by hand. Adapting
one tool. I’d like to close with some these old statistical computations
higher-level takeaways which I think to SQL lets you run them on
the discussion above both highlights enormously vaster quantities of
and exemplifies. data than anything from the 1950s
or 1970s, and that can transform
First, the general point is that there their usefulness all by itself.
is a wide range of statistical analyses (As a side note: for even larger
you can do in SQL & the algorithms quantities of data, the above
and computations you need to do regression formulation fits nicely
them already exist in many cases into an incrementally updating
—just, they were intended for hand system. Why not try updating your
computation, not for SQL. Of course regression coefficients right in
there is a wide range of approaches your ETL?)
you can’t readily do in pure SQL,
such as anything that needs matrix

| © Mode Analytics, Inc. 2020 | mode.com Conclusion | 18


CONCLUSION

Second, all of these methods & Finally, if you’ll allow me


approaches are tools. Different types to editorialize a bit, simple
of modeling approaches are tools. statistical methods aren’t just
Different software environments and useful, they’re worthwhile.
programming languages are tools.
In most cases, I believe, the
If you have a big toolbox, and limitations on your ability to extract
you’re flexible about how you can business-relevant understanding
fit multiple tools together, then when from raw data is not due to the
you’re confronted with a novel problem sophistication of your models.
you can craft a combination of tools Generally, you will benefit more
that fit the problem. If you don’t, then from understanding & improving
you’re going to be limited to forcing your measurement, carefully
the problem to fit the tool. And that crafting your experimental
rarely results in good analyses. designs, (whether randomized or
observational), and thinking very
deeply about the appropriate
causal interpretation of your results,
than from making your statistical
analyses more complicated. As data
scientists, we should be ready to do
“boring” analyses when they will
result in exciting business insights.

| © Mode Analytics, Inc. 2020 | mode.com Conclusion | 19

You might also like