Professional Documents
Culture Documents
Linear Regression
in SQL
Author: Julia Glick
Edited by: Jessica Schimm
Introduction
Conclusion
The first step is to create our Here’s a query that shows this
predictor variable, called “X” transformation.
below, which is to say, the linear
effect of time. Unfortunately, time Now that we have our X and Y
in this dataset is represented by variables, we need to actually do
two columns, one for year and the regression. But how? This is
one for month. That’s not so where all those by-hand formulas
convenient for us! that we did in Intro Stats come in
handy. Formulas that are meant
We can create the predictor by to be done by hand are often, in
multiplying the year by 12 and my experience, quite easy to
adding the month; this gets us translate into SQL.
a variable that gives consecutive
numbers to consecutive months.
It’s also really big and hard to
work with, so we’ll “center” the
variable around 0 by subtracting
off the mean time value in our
dataset. If there’s an odd number
of months, that might mean that
all our time values end with .5,
but that’s not a problem.
The above plot shows the raw data again, along with the line of best fit.
For a simple linear regression, that line of best fit represents the explainable
variance. We can see that there is a downward trend on average in the dataset.
And this plot shows the detrended data. Early observations are brought down a
bit and late observations are brought up a bit to cancel out the trend. If we were
to draw a line of best fit for the detrended dataset, it would be a flat horizontal
line through the mean.
There’s a lot of potential depth here, this one, there’s going to be some
so I’ll just touch on it lightly. We’re non-zero trend. But it might be much
going to run a significance test to more interesting if, for example, our X
see whether our coefficient b predictor took values 0 and 1 for two
is significantly different from zero: different groups of users, and Y was
in this case, it asks whether there is some kind of engagement metric; in a
a non-zero time trend (up or down), situation like that, simple regression is
or whether we don’t have enough the equivalent of a two sample t-test.
evidence to reject a zero trend. (Indeed, there is a deep relationship
That’s not a terribly interesting question between t-tests, ANOVA designs and
in some sense, because with any kind linear regression, but we can’t get into
of random-walk-like time series like all of that here.)
We’ll use one of the contrasts built into R, called Helmert contrasts.
These contrasts compare each level to the average of the previous
levels: February to January; March to the average of January and
February; April to the average of January through March; and so on.
The SQL to generate them is a bit tedious, but straightforward.
Finally, in order to get our predictors Finally, once we have all of the
as close to perfectly orthogonal as regression coefficients, we need
possible, we’ll want to exclude any to combine all of them together to
years where we don’t have all 12 get the predicted values. Once we
months of observations. The 2014 have the model’s predicted values,
data in this dataset is only partial, getting the residuals is the same
so we’ll only include data from as before. But the meaning of
2013 and earlier. those residuals is different, because
we’re controlling for a different set
The completed query is here for of predictors.
the regression coefficients, and
here for the deseasonalized and In the first set of queries, we were
detrended data. They should look only controlling for the linear time
a lot like the earlier queries, but trend, and so the residuals allowed
with a lot of extra repeated code. us to create a detrended dataset.
In this case, we control for both
Obviously in a language like R trend and for the month-of-year
or Python, this would all be much seasonality term, and so the residuals
easier to implement (but also you’d correspond to data that has had
probably be using a pre-existing both trend and seasonality removed.
package instead!).
Predictor Correlations
Let’s look at the results. In the first plot, we have the raw data
plotted against the model’s predictions. But unlike before,
the prediction is no longer a “line of best fit.” Instead it
contains both an annually repeating component and the
linear trend we saw earlier. Sure enough, there is quite a
lot of time-of-year seasonality; the difference between the
highest and lowest performing months of the year is
substantial relative to the overall variance of the time series.
time_c
Conclusion
This has been a lot of words about formulations, anything iterative like
a solution to a rather specific EM algorithms or gradient descent,
problem. Sure, lots of problems can etc, but those are also generally
be fit into this framework, possibly the algorithms that aren’t suited to
with a bit of effort, but it’s still just computation by hand. Adapting
one tool. I’d like to close with some these old statistical computations
higher-level takeaways which I think to SQL lets you run them on
the discussion above both highlights enormously vaster quantities of
and exemplifies. data than anything from the 1950s
or 1970s, and that can transform
First, the general point is that there their usefulness all by itself.
is a wide range of statistical analyses (As a side note: for even larger
you can do in SQL & the algorithms quantities of data, the above
and computations you need to do regression formulation fits nicely
them already exist in many cases into an incrementally updating
—just, they were intended for hand system. Why not try updating your
computation, not for SQL. Of course regression coefficients right in
there is a wide range of approaches your ETL?)
you can’t readily do in pure SQL,
such as anything that needs matrix