You are on page 1of 4

Piecewise linear regression with R

Prerequisites This module assumes that the user has installed R and has some basic familiarity with the installation of packages. Preliminaries

Linear regression in R is accomplished through the use of the lm() function. lm() fits a linear model to the dataset given by the R user of the form specified by a model object. Model objects are given in the form of strings in between the parentheses. For example, we have the most basic formula: y ~ x This fits the regression model where the dependent/response variable, y is determined by the single independent variable, x. In cases where we wish to regress y against more variables we can have: y~x+z, where x and z are the independent variables. When there are interactions between the independent variables we may be interested in, then one may write the formula: y~x+z+x*z. Hence the functional form of the regression is simply told to R, which takes these terms and constructs the necessary linear regression for you.
Create the data for this example: x <- c(1:10, 13:22) y <- numeric(20) ## Create first segment y[1:10] <- 20:11 + rnorm(10, 0, 1.5) ## Create second segment y[11:20] <- seq(11, 15, len=10) + rnorm(10, 0, 1.5) ## Plot it par(mar=c(4,4,1,1)+0.2) plot(x,y, ylim=c(5, 20), pch=16)

Alternative 1: Iterative searching


Fit the curve segments to the observed data by finding the set of best fitting regression curves such that the overall mean squared error (MSE) of the curves is minimized.
Formula: y ~ x * I(x < c) + x * I(x > c)

Guess where c is? Start with a range of guesses, then search for the model that has lowest residual MSE
breaks <- x[which(x >= 9 & x <= 17)] #alternative # breaks<-seq(1,20,0.1) mse <- numeric(length(breaks)) # Alternative? # mse<-matrix(0,1,length(breaks)) for(i in 1:length(breaks)){ piecewise1 <- lm(y ~ x*(x < breaks[i]) + x*(x>=breaks[i])) mse[i] <- summary(piecewise1)$sigma #alternative?

Piecewise linear regression with R


#mse[i] <- summary(piecewise1)[6] } mse <- as.numeric(mse)

We can graphically inspect the mse curve to check for the preferred best breakpoints
#look at the data in different ways, try each of the three commands below plot(mse) plot(mse, type="l") plot(mse, type="o")

On the other hand, we may not be able to spot the lowest points directly, hence it is safer and sometimes more direct to look for the minimum point
breaks[which(mse==min(mse))]

As we can see theres no unique breakpoint, so we shall pick the highest value for the breakpoint, or 14.
piecewise2<-lm(y~ x*(x < 14)+x*(x > 14)) summary(piecewise2)

The output from the linear regression is intuitively presented. But we may interpret the estimates as shown in the table below. Interpretation: Line segment x < 14 x > 14
Intercept 8.7232 + 13.4874 8.7232 2.8868 Slope coefficient 0.4117 1.4293 0.4117

Lets inspect the results of our guess using this code.


plot(x,y, ylim=c(5, 20), pch=16) curve((8.7232 + 13.4874) + (0.4117-1.4293)*x, add=T, from=1, to=14) curve((8.7232-2.8868) + 0.4117*x, add=T, from=14, to=max(x)) abline(v=14, lty=3)

Piecewise linear regression with R

We see that the breakpoints do not result in contiguous line segments, i.e. the start of the second line does not coincide with the end of the first line segment. However we are only getting what we asked for as the method we chose does not aim for contiguity but rather tries to find the best fitting individual segments. For the best set of contiguous segments, we shall have to modify the problem explicitly to do so. Another issue with the iterative method is that we have had to explicitly interpret the output and create the two curves. Finally, the iterative method as shown is able to create only two line segments.

Alternative 2: Segmented lines


What if the formula we solved for penalizes jumps in the lines?
Formula: y ~ c1x + c2(x - c) + I(x > c)

We search for solutions where is minimized within the same data weve already experimented with above. For the sake of showing what this method is capable of, we shall guess that there are three distinct line segments.
library(segmented) testlm <- lm(y~x) segmented(testlm,seg.Z=~x,psi=list(x=c(4,17)),control=seg.control(display=F ALSE))->o o

Segmented improves on our guess and thinks that the best break points are at 2.00, and 10.62 respectively. This produces the plot shown below (using these commands)

Piecewise linear regression with R


plot(x,y) plot(o,add=T)

The weakness with this second method of course is that we need to provide initial estimates for where the breakpoints may be. If these breakpoints are too close to each other or the extremes of the dataset, the algorithm will fail. You may experience this yourself with the initial breakpoint set Ive suggested above. Try changing them until the algorithm works.

You might also like