When we fit a curve to a set of data, we are finding a function that relates an

independent variable (inches horizontally from the launch point in this example)

to a predicted value of a dependent variable (inches above the launch point in

this example). Asking about the goodness of a fit is equivalent to asking about

the accuracy of these predictions. Recall that the fits were found by minimizing

the mean square error. This suggests that one could evaluate the goodness of a

fit by looking at the mean square error. The problem with that approach is that

while there is a lower bound for the mean square error (zero), there is no upper

bound. This means that while the mean square error is useful for comparing

the relative goodness of two fits to the same data, it is not particularly useful for

getting a sense of the absolute goodness of a fit.

We can calculate the absolute goodness of a fit using the coefficient of

determination, often written as R2.97 Let !! be the ! !! observed value, !! be the

corresponding value predicted by model, and ! be the mean of the observed

values.

!! = 1

!! )!

!

! (!! !)

! (!!

By comparing the estimation errors (the numerator) with the variability of the

original values (the denominator), R2 is intended to capture the proportion of

variability in a data set that is accounted for by the statistical model provided by

the fit. When the model being evaluated is produced by a linear regression, the

value of R2 always lies between 0 and 1. If R2 = 1, the model explains all of the

variability in the data. If R2 = 0, there is no relationship between the values

predicted by the model and the actual data.

The code in Figure 15.5 provides a straightforward implementation of this

statistical measure. Its compactness stems from the expressiveness of the

operations on arrays. The expression (predicted - measured)**2 subtracts the

elements of one array from the elements of another, and then squares each

element in the result. The expression (measured - meanOfMeasured)**2

subtracts the scalar value meanOfMeasured from each element of the array

measured, and then squares each element of the results.

def rSquared(measured, predicted):

"""Assumes measured a one-dimensional array of measured values

predicted a one-dimensional array of predicted values

Returns coefficient of determination"""

estimateError = ((predicted - measured)**2).sum()

meanOfMeasured = measured.sum()/float(len(measured))

variability = ((measured - meanOfMeasured)**2).sum()

return 1 - estimateError/variability

97 There are several different definitions of the coefficient of determination. The definition

supplied here is used to evaluate the quality of a fit produced by a linear regression.

