You are on page 1of 22

Curve Fitting

Fitting Functions to Data

Rex Boggs
1997 Raybould Fellow
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Curve Fitting
Introduction

A major part of the Maths B course is the study of functions. If you choose to use real problems with
real data when teaching applications of these functions (which I hope you do) then you have to
introduce some concepts of linear and nonlinear regression that are outside of our syllabus. The main
topics are scatterplots, least squares regression, correlation, normal probability plots and residual
plots.

Scatterplots are simple to construct, and the least squares regression line and correlation coefficient
only need to be understood at a conceptual level. It is certainly not necessary for a student to be able
to find the equation of the line or the value of r by hand. A normal probability plot is optional though
it is a nice application of the normal distribution and can be taught as such, while a residual plot is a
simple extension to a scatterplot. The time needed to cover these topics is not great and is balanced
by the richer maths course you are able to offer your students.

Curve fitting, as nonlinear regression is often called, nicely integrates algebra and statistics and has
the potential to integrate Maths B with other senior subjects, notably Physics, Chemistry and Biology.

Using Technology

The TI-82 graphical calculator allows students to fit a range of non-linear functions to a set of data.
The data and a function can be plotted on the same axes so the user can see how well the function fits
the data, and the calculator can give a correlation coefficient (or an r2 value) which can assist in
deciding if the model is appropriate. The TI-83 calculator will also produce a residual plot to inform
the decision about the appropriateness of the model and to assist in looking for underlying patterns.
The TI-83 also extends the choice of functions by including the logistic and sinusoidal functions.

Statistics packages such as NCSS Jr and Minitab are both less powerful and more powerful than
graphical calculators. With these packages the user applies a transformation to the data to
‘straighten’ it and then finds a linear regression line through the data. The gradient and y-intercept of
this line are then used (with some algebra) to find the equation of a non-linear function that fits the
original data. Graphical calculators follow the same process with polynomial, exponential and
logarithmic functions, but carry out the process automatically.

Statistics packages provide a immense amount of information about the fit of the function, far more
than the graphical calculator. NCSS Jr for example will display a number of different residual plots,
each of which tells something about how the function fits the data. Statistics packages also allow the
user to work with data where there is more than one explanatory variable (called multiple regression).
Note that some datasets, eg a dataset that exhibits periodic behaviour, can’t be straightened and hence
can’t be analysed with a statistics program.

There are also programs designed specifically for fitting functions to data. One of these is
CurveExpert. It contains over twenty-five common classes of functions (plus user-defined functions)
and a specialised method of fitting a function to the data that doesn’t require the data be straightened
first. CurveExpert provides an r value, the standard error and a residual plot to assist the user in
choosing which function best fits the data, but doesn’t provide the range of other information given
by a statistics program. There is a danger with CurveExpert that a user will go ‘function shopping’
and end up choosing a function based on the smallest r value. The screen shot below, where a
polynomial of degree 13 is applied to a dataset with 15 values shows the danger in this.

1
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

An awareness of the properties of the functions studied in Maths is critical if students are to make an
informed choice about which functions are possible candidates for a particular dataset. A knowledge
of the effect of each parameter on the graph of the function is equally valuable.

There is value in students carrying out the entire process of transforming a dataset to ‘straighten’ it
(using logarithms, for example) using technology to find a least squares regression line and then
doing the algebra to transform back to get the nonlinear function that fits the original data, for at least
one or two datasets. Past that I would allow students to use whichever technology they feel is most
appropriate. Students should justify the decision about the technology they have used, the function
chosen to model the data, the accuracy of the model and they need to discuss if extrapolating to
values larger or smaller than those in the dataset is appropriate.

2
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Using Statistics in Biology - The Outbreak of the Gypsy Moth


Biological populations can grow exponentially if not restrained by predators or lack of food or space.
Here are some data on an outbreak of the gypsy moth, which devastated forests in Massachusetts in
the US. Rather than count the number of moths, the number of acres defoliated by the moths was
counted. This data was supplied by Chuck Schwalbe, U.S. Dept of Agriculture.

Year 1978 1979 1980 1981


Acres 63 042 226 260 907 075 2 826 095

1. Plot the number of acres defoliated y against the year x. Does the pattern of growth appear to
be exponential?

2. Use a graphical calculator to find the exponential model that best fits this data.

3. Use this model to predict the number of acres defoliated in 1982. The actual number for 1982
was 1 383 265. Give a reason why the predicted value and the actual value could be so
different.

The Outbreak of the Gypsy Moth : Solution


Year 1978 1979 1980 1981
Acres 63 042 226 260 907 075 2 826 095

If we use this data kY


as is, we run into a problem. We are trying to find the values of A0 and k to fit the
equation A = A0e ,where A is the number of acres and Y is the year. The numbers as given are
outside the range of values that the calculator can handle for exponential regression. Since the actual
years are not important, we will modify the data as follows.

Year 1 2 3 4
Acres (1000s) 63 226 907 2 826

1. Draw a scatterplot of Acres(1000s) versus Year.

2. The data looks exponential. Use your graphical calculator to


fit an exponential curve to this data.

An exponential model seems to fit this data very well.

3
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

3. Draw a residual plot. With only four data values, showing


that a pattern exists will be difficult. There is no obvious
pattern to this data.

We conclude that an exponential function is a good model for


this data. The model given
Y by the graphical calculator is:
A = 17.8*3.60
kY
Students should alter this to the form A = A0 e , which is
more common for biological models.

4. The predicted value for the following year is 10 723 000. The
actual value was 1 383 000. There was a viral infection in the
gypsy moths which reduced their numbers drastically.

Extrapolate with care!

The Size of Alligators


Many wildlife populations are monitored by taking aerial photographs. Information about the
number of animals and their whereabouts is important to protecting certain species and to ensuring
the safety of surrounding human populations.

In addition, it is sometimes possible to monitor certain characteristics of the animals. The length of an
alligator can be estimated quite accurately from aerial photographs or from a boat. However, the
alligator's weight is much more difficult to determine. In the example below, data on the length (in
inches) and weight (in pounds) of alligators captured in central Florida are used to develop a model
from which the weight of an alligator can be predicted from its length.

Weight Length Weight Length

130 94 83 86
51 74 70 88
640 147 61 72
28 58 54 74
80 86 44 61
110 94 106 90
33 63 84 89
90 86 39 68
36 69 42 76
38 72 197 114
366 128 102 90
84 85 57 78
80 82

A scatterplot of weight against length reveals that the relationship between these variables is not
linear but curved. A successful model must take into account this non-linear relationship.

4
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Some Possible Models

1. Assume that alligators 'scale up' nicely, ie an alligator that is twice as long is also twice as
wide, twice as thick, each tooth
3 is twice as long, etc. Then3 the model would be a cubic
power function, W = k*L . To find k, we plot W vs L and find the least squares
regression line. The slope of the line is k. b
2. Assume that the model is a power function of the form W = k*L . Plot ln(weight) vs
ln(length) and find the least squares regression line. It’s a nice application of log laws to find
values for k and b algebraically. As this the method that the TI-82/3 uses, the values for k
and b should match those given by these calculators when using the power regression
function. b
3. Assume that the model is a power function of the form, W = k*L and use CurveExpert to
find values for k and b. CurveExpert uses an iterative method to find values for the
parameters that doesn’t require ‘straightening’ the data first, and finds values for k and b that
minimise the standard error. 3 2
4. Assume that the model is a general cubic of the form W=a*L + b*L + c*L + d. Use either
CurveExpert or the TI-83 to find the best fit.

Some Possible Models - Results


3
First Model - weight vs length

This model assumes the relationship between weight and length is a cubic one. Here is the output from the
Regression section of NCSS Jr., a freeware statistics program ( http://WWW.NCSS.com/ ).

Regression Equation Section


Independent Regression Standard T-Value Prob Decision Power
Variable Coefficient Error (Ho: B=0) Level (5%) (5%)
Intercept -34.81806 6.620628 -5.2590 0.000025 Reject Ho 0.998927
l_cubed 1.973203E-04 0 0.0000 1.000000 Accept Ho 0.050000
R-Squared 0.973043

Plots Section

Histogram Normal Probability Plot of Residuals of weight


14.0

60.0
10.5

25.0
Residuals of weight
Count
7.0

-10.0
3.5

-45.0
0.0

-80.0

-80.0 -45.0 -10.0 25.0 60.0

-2.0 -1.0 0.0 1.0 2.0


Residuals of weight Normal Distribution

5
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

l_cubed vs weight

800.0
Residual vs Predicted

60.0

600.0
25.0

weight
400.0
Residuals
-10.0

200.0
-45.0
-80.0

0.0
0.0 150.0 300.0 450.0 600.0 0.0 875000.0 750000.0 625000.0 500000.0
Predicted l_cubed

Analysis

The scatterplot of weight vs l_cubed indicates the transformed data is approximately linear. The regression
output indicates k=.0001973 and the y-intercept is -34.8, ie the regression equation is
3 2
W = .0001973*L - 34.8. The r value of 0.973 indicates that the fit of the straight line to the linearised data
is quite good. However the residual plot shows some disturbing patterns. For all of the alligators except the
largest three, there is a notable negative gradient on the residuals, while the largest three show the opposite
trend. Additionally the scatterplot of weight vs L-cubed shows that the lengths of the largest alligators are
influential values as their x-values (the cube of the length) are much larger than the others. This is due in
part to the data being cubed, as any differences in length would be exaggerated by this operation. Finally the
normal probability plot of the residuals shows that the residuals may not be normally distributed which again
indicates that this model may not be satisfactory.

There may be a temptation to remove these large values as they appear to be atypical, but that would be a big
mistake! After all it is the largest alligators that are the most important in terms of their impact on nearby
human populations.

If we chose to use this model, there may be a case for using a piecewise function - one branch for the typical
alligators and another for larger alligators.

Second Model - ln(weight) vs ln(length)

This model assumes the relationship between weight and length is a power function. To straighten this data
we plot ln(weight) vs ln(length). Here is the output from NCSS Jr.

Regression Equation Section


Independent Regression Standard T-Value Prob Decision Power
Variable Coefficient Error (Ho: B=0) Level (5%) (5%)
Intercept -10.1746 0.7316143 -13.9071 0.000000 Reject Ho 1.000000
ln_len 3.285993 0.1653929 19.8678 0.000000 Reject Ho 1.000000
R-Squared 0.944940

6
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Plots Section
Histogram Normal Probability Plot of Residuals of ln_wht

12.0

0.6
9.0

0.4
Residuals of ln_wht
Count
6.0

0.1
3.0

-0.2
0.0
-0.4 -0.2 0.1 0.4 0.6

-0.4
-2.0 -1.0 0.0 1.0 2.0
Residuals of ln_wht Normal Distribution

Residual vs Predicted
ln_len vs ln_wht

0.6

6.5
0.4

5.6
Residuals
0.1

ln_wht
4.8
-0.2

3.9
-0.4

3.0
3.0 3.9 4.8 5.6 6.5 4.0 4.3 4.5 4.8 5.0
Predicted ln_len

Analysis
From the scatterplot the transformed data appear to be approximately linear. The original model is W
= k*Lb and the transformed model is ln(W) = a*ln(L) + c. From the regression output we have
Ln(W) = 3.286*Ln(L) - 10.75. It is a nice application of logarithm laws to obtain the function for W
in terms of L.

Ln(W )  3.286 * Ln( L)  10.75


Ln(W )  Ln( L3.286 )  Ln(26239)
L3.286
Ln(W )  Ln( )
26239)
W  .0000381 * L3.286
It is worth noting that the TI-82 and TI-83 give the same values if power regression is applied to this
data. The residual plot looks much better with only a weak pattern to the data. Except for one
extreme value the normal plot of the residuals shows there are no problems with this model. While
2
the r value is a bit lower at .945, the model still fits the data very well.

7
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Third Model - Power Function Using CurveExpert

CurveExpert is a curve-fitting program that uses iterative methods of finding a ‘curve of best fit’ to a
dataset (the shareware version is availabe from http://www.ebicom.net/~dhyams/cvxpt.htm). As it
doesn’t linearise the data first, its results generally don’t agree with that generated by NCSS Jr or
graphical calculators. It is a matter of judgement as to which model should be used.

Here is the output data from CurveExpert. ‘S’ is the standard error and ‘r’ the correlation coefficient.

Power Fit: y=ax^b

Coefficient Data:
a= 3.4045285e-006 S = 13.50
b= 3.8140161 r = .9949

Below is the plot of the data with the line of best fit, and the residual plot.

S = 13.50301636
r = 0.99486573

.20
701

.80
584

Y Axis (units)
.40
468

.00
352

.60
235

.20
119

0
2.8
49.1 66.9 84.7 102.5 120.3 138.1 155.9

X Axis (units)

Residuals
56
49.

8
24.7
Y Axis (units)

0
0.0

.78
-24

.56
-49 49.1 75.8 102.5 129.2 155.9

X Axis (units)

Analysis
The model that CurveExpert has found is W = .0000034*L3.81. The correlation coefficient r = .995
and the standard error S = 13.50. While it may seem that a method that fits the data directly to a
model would have to be preferable to one that linearises the data first, that isn’t necessarily the case.

8
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

The benefit of using a statistics program like NCSS is the wealth of information available about the
validity of the fit. All that CurveExpert supplies is a residual plot, valuable to be sure but in some
cases maybe not sufficient.

It is important that the student doesn’t go ‘curve shopping’, i.e. choosing the model that gives the
smallest standard error. There should be a reason why a particular model is chosen and a physical
interpretation of each of the parameters. In this problem it is reasonable to expect the power to be
aproximately 3 since weight is correlates strongly to volume. A value of 3.81 appears to high for this
situatation.

In our previous model we were able to reduce the influence of the largest alligators by taking the
logarithm of the lengths. This isn’t possible with CurveExpert, so the three largest lengths are much
larger that the others and they have undue influence on the model. One way to determine the
influence of a datavalue is to remove it and recalculate the regression equation. Removing the largest
alligator from the dataset gave the following values:

Power Fit: y=ax^b

Coefficient Data:
a= 9.491332e-06
b= 3.5899794

Removing a single datavalue has markedly altered the values of both parameters. Removing the
three large alligators gave this output:

Power Fit: y=ax^b

Coefficient Data:
a= 4.6383263e-05
b= 3.236518

The implication is that this model is not very robust and hence should not be used.

Fourth Model - Fitting a General Cubic Equation

Either a graphics calculator or CurveExpert can be used to fit a general cubic equation to the dataset.
Here is the output from CurveExpert:

3rd degree Polynomial Fit: y=a+bx+cx^2+dx^3...

Coefficient Data:
a= -277.82231 S = 11.36
b= 11.473033 R = .9966
c= -0.15420092
d= 0.00080703423

9
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

S = 11.36027521
r = 0.99668494

.20
701

.80
584

Y Axis (units)
.40
468

.00
352

.60
235

.20
119

0
2.8
49.1 66.9 84.7 102.5 120.3 138.1 155.9

X Axis (units)

Residuals
72
29.

6
14.8

0
0.0

.86
-14

.72
-29 49.1 75.8 102.5 129.2 155.9

Analysis
The regression equation is W = .000897*L3 - .154*L2 + 11.5*L - 278. The standard error = 11.36
and the r value = .9966. Based on the standard error and correlation coefficient, one may think that
this model is satisfactory.

However what is happening here is that this model is including some of the sample error into the
model itself by wiggling its way through the data values. A small standard error does not mean the
model is a valid one! I would reject this model on the basis that there is no physical reason for
including all of these terms of the cubic function.

Decision
Having tested a number of models I will choose the power function W = .000038*L3.3. I feel that
two significant digits is reasonable accuracy given that the sample size is small. This model was less
influenced by the large data values, and analysis of this model showed a reasonable residual plot and
normal plot. This model can of course be modified if more alligators are able to be captured and
measured.

10
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

World Oil Production


Mathematical models which describe physical phenomena are often very accurate, reflecting the
simple underlying formula that links the variables and the ability to measure such variables precisely.
We usually aren't so fortunate when we model activities that involve nature and biological processes,
while those involving people are the most difficult of all to model accurately.

The data in the table is the world oil production measured in millions of barrels. Your task is to find
a function to model this data, discussing limitations of your model, and its usefulness as a predictor
of future production.

Construct a residual plot and discuss any interesting features in the plot.

Year Mbbl
1880 30
1890 77
1900 149
1905 215
1910 328
1915 432
1920 689
1925 1069
1930 1412
1935 1655
1940 2150
1945 2595
1950 3803
1955 5626
1960 7674
1962 8882
1964 10310
1966 12016
1968 14104
1970 16690
1972 18584
1974 20389
1976 20188
1978 21922
1980 21722
1982 19411
1984 19837
1986 20246
1988 21338

11
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Solution - World Oil Production


The first thing we do with data is to graph it, in this case a scatterplot with time as the predictor
variable and oil production as the response variable.

Scatterplots

Year vs MBBl

25000.0
18750.0
12500.0
MBBl
6250.0
0.0
1860.0 1895.0 1930.0 1965.0 2000.0
Year

The data is clearly nonlinear. Assuming a constant percentage growth gives rise to an exponential
model. To test this we can plot Year vs ln(Mbbl) and see how linear the data appears to be.

Year vs Ln_MBBI

10.0
8.0
Ln_MBBI
6.0
4.0
2.0

1860.0 1895.0 1930.0 1965.0 2000.0


Year

The data is approximately linear so the model is worth pursuing. The original scatterplot does
indicate that something peculiar happened in the early 1970s to alter the exponential pattern. It was
of course the war in the Mid-East that disrupted oil production. Looking back at the original data,
obviously the model isn’t applicable after 1972 so the next step in the analysis is to delete the last 8
rows of data. Here is a new plot with this data removed, and the least squares regression line added.

12
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Year vs Ln_MBBI
10.0

7.3

Ln_MBBI
4.7

2.0
1860.0 1900.0 1940.0 1980.0
Year
Before using NCSS to do a regression analysis on this data it is useful to make 1880 the base year (by
subtracting 1880 from each year), otherwise the y-intercept represents an estimate for Ln(Mbbl) in
the year 0, which is quite outside the range of values we are considering.

Both the regression equation and plots output is given below.

Regression Equation Section


Independent Regression Standard T-Value Prob Decision Power
Variable Coefficient Error (Ho: B=0) Level (5%) (5%)
Intercept 3.702236 6.431098E-02 57.5677 0.000000 Reject Ho 1.000000
Year_adj 6.649784E-02 1.025216E-03 64.8623 0.000000 Reject Ho 1.000000
R-Squared 0.995504

Plots
Histogram of Residuals of Ln_MBBI Normal Probability Plot of Residuals of Ln_MBBI
10.0 0.3

Residuals of Ln_MBBI
7.5 0.1
Count

5.0 -0.1

2.5 -0.2

0.0 -0.4
-0.4 -0.2 -0.1 0.1 0.3 -2.0 -1.0 0.0 1.0 2.0
Residuals of Ln_MBBI Expected Normals

13
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Residuals vs Year_adj
0.3

0.1

Residuals
-0.1

-0.2

-0.4
-20.0 10.0 40.0 70.0 100.0
Year_adj

A bit of algebra gives Mbbl (b) in terms of Year Since 1880 (y).

ln(b) = 3.702 + .0650 y


b = e(3.702 + .0650y)
.0650y
b = 40.53 e

Report
The scatterplot of Mbbl vs Year Since 1880 shows a non-linear pattern until the early 1970s, after
which the underlying pattern is obviously changed. We restrict our domain to 1880 - 1972 by
removing the last 8 pairs of data.

The scatterplot of ln(Mbbl) vs year shows that the relationship is approximately linear and hence the
cy
function is of the form b = A e . The regression analysis followed by a bit of algebra gives us the
equation b = 40.53 e.0650y.
It is worth noting that with real-life data there may be underlying patterns that are the result of
external factors and hence can’t be accounted for by the mathematical model. The residual plot and
the normal probability plot together indicate some underlying patterns to this dataset.
The residual plot show that growth was faster than that predicted by the model prior to 1930, slower
in the 1930s and 1940s, and the faster again in the 1950s to the 1970s. The obvious factors linked to
the slowdown are the Great Depression and World War II. A history lesson reflected in a residual
plot!

Student Generated Data


In addition to using data gathered by researchers students should generate their own data, either by
experiment or observation and then find a model that gives an acceptable fit to the data. Gathering
good data is not always an easy task, and students are best made aware of this by going through this
process. The first set of ideas comes from an email from Alice Hankla, Galloway School, Atlanta
Georgia, USA. I have edited her email slightly.

**

With exponentials, we use kid-collected data (say radioactivity or Newton's law of cooling) and
graph on semilog paper first, then use a graphing program. For log-log plots, we do Kepler's law
concerning period vs average distance from sun. One is power of 2 and the other power of 3, hence
log-log to straighten out.

About the cubic and quartic and higher powers of polynomials. One can fit anything to a high power
and it is meaningless. This is the time for the lesson in the meaning of the parameters of an equation

14
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

vs best fit. They must think about the model, and about extrapolating values. A good problem is the
stopping distance of cars at various speeds (data from Georgia Drivers Manual, which they all have.
Fit to quadratic or cubic and the extrapolated distance for 100 mph is outrageous. Point made.

Another outrageous extrapolation is from cumulated number of cases of AIDS diagnosed


(downloaded from CDC webpage) . The last year shows a tad of a decrease and decrease is predicted
for the future. (It is being modelled via diff equations.; my class attended a lecture by a man
currently doing this). Another interesting dataset is the number of TB cases, which was exponential
decay until 1988 then the graph turns upward (data from CDC by phone and I have and can send via
this listserve if anyone wants it).

We also do the number of hamburgers that McDonald sold annually and the number of locations
opened cumulatively. I got this data from the web about 4 years ago and can send it to youall.

Increase in population world-wide and in US is in the newspaper from time to time - exponential, of
course.

From a pizza menu, graph area vs price for various toppings to see the parallel and to see best buy.

Nonlinear - Mostly Physics

Time same size cans for different content - cream soup, thin soup, vege soup - as each rolling down
the same incline. Also can time one can for different distances.

On a gallon milk jug, make marks one inch apart. Fill with water to each mark and time exit from a
hole near the bottom until level is at a specified point.

Measure height of bounce of a tennis ball until it stops Measure first height of bounce when
dropping at different heights.

Time a steel ball falling same distance in different liquids - motor oil, shampoo, vinegar.
Meaningless but fun. Different sizes of same metal balls in same liquid does give coefficient of
viscosity.

Time students running up a set of stairs. Plot time vs weight. separate into gender. Can get work by
figuring height of steps - mgh and horsepower by mgh/time.

Newton's law of cooling - Record water temp at 3-min intervals for 2 hours. Graph time vs change in
temperature and time vs temperature.

Torque on a meterstick: Set a meterstick with its 30 cm mark over the fulcrum. Balance it by
hanging a known weight on one location, say at 5 cm from fulcrum. Record weight and location.
Repeat for different weights and distances from fulcrum.

Measure and graph object distance vs real image distance for a convex lens - hyperbola. For virtual
images one gets the other half of the hyperbola. Unique experiment.

15
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Other Ideas
Here are a few other non-linear modelling applications that I have heard about.

Hang a rope across the classroom before the students arrive, with the ends at different heights. The
students are allowed to measure whatever they want, but only to the middle of the room. They must
use their data to create a mathematical model that predicts how high up the wall the other end of the
rope is attached. Then test it.

Grow a spinach plant, and measure its height periodically. This gives a nice logistic equation, at least
it does if you water regularly!

Fill a plastic drum with water, place it on a stand, punch a hole in the side (near the bottom) and
measure how far the water squirts out, as a function of time.

Bounce a wet superball down a footpath (ie sidewalk) and measure the distance between wet spots. It
gives a remarkably good decaying exponential function.

One that I tried - Using a CBL unit with a TI-83, we took a park bench (you can use any long straight
thing with a groove that will keep a ball on track) and gave it a slight tilt (eg a brick under one end).
We rolled a ball (actually a small globe because that was all we could find) down the incline, using
the CBL and a motion detector probe to measure distance vs time.

We plotted the data, and applied a quadratic model. After deleting the first few data points (since we
reckon my mate John pushed the globe slightly rather than just letting it roll), we had a correlation
coefficient of over 0.999. I was very surprised it was so high.

The WorldWatch Datadisk

Worldwatch tracks trends in key factors that affect our environment, with many of their datasets
starting around 1950. They publish a very interesting book every year, called Vital Signs, which
discusses the changes in the trends over the past year. More importantly for statistics education, they
publish a disk with all of the data given in the tables and graphs in Vital Signs. To find out more
about the data disk, visit the WorldWatch website at http://www.worldwatch.org/

Further Problems of Linear and Non-Linear Regression

Statistics and Nutrition


A study of nutrition in developing countries collected data from the Egyptian village of Nahya. Here
are the mean weights for 170 infants in Nahya who were weighed each month during their first year
of life.

(Data from Zeinab E. M. Afifi, “Principal components analysis of growth of Nahya infants: Size,
velocity and two physique factors,” Human Biology, 57 (1985), pp. 659-669.)

Age (Months) 1 2 3 4 5 6 7 8 9 10 11 12
Weight (kg) 4.3 5.1 5.7 6.3 6.8 7.1 7.2 7.2 7.2 7.2 7.5 7.8

16
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

1. Plot the mean weight against time. Compute the least squares regression line. Plot this line
on your graph. Is it an acceptable summary of the overall pattern of growth?

2. Plot residuals against age. Describe what this output tells you.

3. Describe a better model for weight against age (Hint: there may be different functions
needed for different ages).

Erosion
A study of erosion produced the following data on the rate (in litres per second) at which water flows
across a soil test bed and the weight (in kilograms) of soil washed away. (Data from G.R. Foster,
W.R. Ostercamp, and L.J. Lane, “Effects of discharge rate on rill erosion,” paper presented at the
1982 Winter Meeting of the American Society of Agricultural Engineers.)

Flow Rate .31 .85 1.26 2.47 3.75


Eroded Soil .82 1.95 2.18 3.01 6.07

Find a mathematical model for this data. Determine its validity. Comment on the strength of its
predictive value.

Mystery Dataset
from: Bruce King, New Milford, CT (kingb@wcsu.ctstateu.edu)

I have enjoyed asking my students, occasionally, to deal with a "mystery" data set. That is, I want
them to arrive at conclusions, however tentative, based only on the characteristics of the data,
without taking into account any contextual information. Afterwards, I tell them what the variables
measure; sometimes this can be a bit of fun.

For example, once some years ago I grabbed about two dozen hard-cover books from my shelves, and
measured such things as weight (Y), area of cover, number of pages, thickness, etc.,--and number of
letters in the author's (or first author's) last name. I would ask students to construct a model for Y in
terms of the other variables. (This was in a unit on multiple regression. In a more elementary course,
I might ask them to identify, however tentatively, which X-variable(s) are strongly related to Y; and
which are not related to Y.) Only after we had done everything we could would I tell them what the
Xs and Y were.

Here's another "mystery" data set I've found useful about this time of the year (or a bit later) in a
"Moore-oriented" course (i.e., one that relies on BPS or IPS). It involves transformations. which
appears first in BPS on p.110; and in IPS, on p.150.

17
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

X Y

.3871 .2409
.7323 .6152
1.000 1.000
1.524 1.881
5.203 11.86
9.555 29.46
19.22 84.01
30.11 164.8
39.81 247.7

The idea, of course, is to find a reasonable model for Y as a function of X.

If you haven't seen this before, you might like to try it before looking at some feedback, which I'll put
in a separate, second, message.

Answer: Kepler may have recognised this data, as it is the ‘average’ distance of each planet from the
Sun (using the Earth’s distance as 1) and the length of the planet’s year (with the Earth’s year as 1).

18
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Sample Assignment
Galileo's Gravity and Motion Experiments

Over 400 years ago, Galileo conducted a series of experiments on the paths of projectiles, attempting
to find a mathematical description of falling bodies. Two of his experiments are the basis of this
assignment.

The experiments consisted of rolling a ball down a grooved ramp that was placed at a fixed height
above the floor and inclined at a fixed angle to the horizontal. In one experiment the ball left the end
of the ramp and descended to the floor. In a related experiment a horizontal shelf was placed at the
end of the ramp, and the ball would travel along this shelf before descending to the floor. In each
experiment Galileo altered the release height of the ball (h) and measured the distance (d) the ball
travelled before landing. The units of measurement were called 'punti'. A page from Galileo's notes
is shown below.

The data from these experiments is


given in the following two tables.

Table 1 - Ramp Only

19
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Release Height Horizontal


Above Table (h) Distance (d)
1000 573
800 534
600 495
450 451
300 395
200 337
100 253

Table 2 - Ramp and Shelf


Release Height Horizontal
Above Table (h) Distance (d)
1000 1500
828 1340
800 1328
650 1172
300 800

Source: Drake, S. (1978), Galileo at Work, Chicago: University of Chicago Press.

20
From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Take Note of:


Ockham's Razor: A maxim that whenever possible choose a simple model over a more complicated
one. It just seems to be the way the world often works.

1. Use the Ramp and Shelf data to find a mathematical model for the horizontal distance
travelled as a function of release height. In particular:

a. Test at least two different mathematical models. Show any scatterplots and statistical
analyses used in these tests.

b. Decide which mathematical model you feel best represents the data. Justify your decision.

c. Discuss how accurately your model fits the given data.

d. Would your mathematical model give sensible answers if the ball is released at greater
heights? How well does your model work if the release height is 0?

2. According to Jeffreys and Berger in an erratum to an article in American Scientist (1992),


ad 2
the model for the Ramp Only data is of the form h  where a and b are parameters to
1  bd
be determined.

a. Find the values of a and b, using a non-linear regression software program such as
CurveExpert.

Note that this program requires you to enter initial estimates for these parameters. You can
find good initial estimates of the values of a and b by choosing two pairs of data from the
Ramp Only data table, substituting into the above function and solving the resulting
simultaneous equations.

b. Discuss how accurately your function fits the data.

c. What is the domain of d, given these values for a and b? What is the physical interpretation of
this domain?

References:

Dickey, D.A. and Arnold, J.T, (1995). Teaching Statistics with Data of Historic Significance:
Galileo's Gravity and Motion Experiments, Journal of Statistics Education, v.3, n.1.

Drake, S. (1978). Galileo at Work, Chicago: University of Chicago Press.

Jeffreys, W. H., and Berger, J. O. (1992). Ockham's Razor and Bayesian Analysis, American
Scientist, 80, 64-72 (Erratum, p. 116).

21

You might also like