Professional Documents
Culture Documents
Chapter 7
Linear Regression
Solutions to class examples:
1. See Class Example 1.
2. According to the linear model, each additional cricket chirp is associated with an increase in
temperature by an average of 0.25 degrees Fahrenheit.
3. Scatterplots, regression equations, correlations, R 2 values, and residuals plots for the heights
and weights provided with Chapter 6 materials is on pages 6-3 and 6-4. Of course, if you use
your own class data for heights and weights, answers will vary. A worksheet detailing the
relationship between distance and airfare is on pages 7-3 and 7-4, with a key on page 7-5.
4. See Class Example 4.
177.22 0.079( Distance)
5. Fare r R 2 0.482 0.694
644.287 6.13( Fare)
6. Distance
Predicting the distance traveled for a $300 fare, for example:
644.287 6.13(300)
Distance
1194.713miles
Some students may think that you could just use the equation that predicts the fare, and
merely “back-solve” it. This isn’t true.
177.22 0.079( Distance)
Fare
300 177.22 0.079( Distance)
Distance 1554.18 miles
In order to predict distance, we need to use the model that minimizes the residuals for
distances. The correct model predicts a distance of 1194.7 miles for a $300 fare.
7. The Data Desk Lab Worksheet is on pages 7-17 and 7-18, with a key on page 7-19.
Investigative Task
Chapter 7’s Investigative Task examines the relationship between smoking and coronary heart
disease.
Supplemental Resources
The following page, 7-2, contains the algebra to back up our assertion that the least squares line
passes through x , y . This is merely provided for interested students, and is not required to
understand the idea of linear regression.
A worksheet for the distance and airfares data is on pages 7-14 and 7-15, with a key on 7-16.
A lab activity using the Data Desk Cars datafile is on pages 7-17 and 7-18, with a key on 7-19.
The derivation of the least-squares regression line begins with the assertion that the best line
must pass through the mean-mean point (the origin in the scatterplot of z-scores). Most students
will not object.
If you have some students who insist on a proof of the mean-mean assertion, here are the steps.
We advise not doing all this in class lest most eyes glaze over…
z zˆ
2
We seek the line zˆ y a mz x that minimizes y y
z a mz
2
Substitute the equation: y x
z mz a
2
Rearrange terms: y x
z mz 2a z
mz x a 2
2
Square the binomial: y x y
Consider the “middle term”: 2a z mz
y x
But the mean (and thus the sum) of a set of z-scores must be zero. Hence this whole middle
term is zero and we can turn our attention to the task of trying to minimize what’s left:
z y mzx a 2
2
By choosing a = 0 we can be sure that the sum will be minimal. (Adding the square of any
other value would make it bigger.) Hence the best line must have a y-intercept of 0 in the
standardized plane, proving that the line of regression goes through the mean-mean point.
An article in the Journal of Statistics Education reported the price of diamonds of different sizes
in Singapore dollars (SGD). The following table contains a data set that is consistent with this
data, adjusted to US dollars in 2004:
1. Make a scatterplot and describe the association between the size of the diamond (carat) and
the cost (in US dollars).
2. Create a model to predict diamond costs from the size of the diamond.
8. Would it be better for a customer buying a diamond to have a negative residual or a positive
residual from this model? Explain.
An article in the Journal of Statistics Education reported the price of diamonds of different sizes
in Singapore dollars (SGD). The following table contains a data set that is consistent with this
data, adjusted to US dollars in 2004:
1. Make a scatterplot and describe the association between the size of the diamond (carat) and
the cost (in US dollars).
Scatterplot of 2004 US $ vs Carat
2500
2000
2004 US $
1500
1000
500
There is a strong, positive, linear association between the size of the diamond and its cost. The
cost of a diamond increases with size.
2. Create a model to predict diamond costs from the size of the diamond.
The regression equation is
2004 US $ = - 559 + 8225 Carat
50
Residual 0
-50
-100
500 1000 1500 2000 2500
Fitted Value
A linear model is appropriate for this problem. The residual plot shows no obvious pattern.
The slope of the model is 8225.1. The model predicts that for each additional carat, the cost of the
diamond will increase by $8225.10, on average. This can also be interpreted as for each additional
0.01 carat, the cost of the diamond will increase by $82.251, on average.
The intercept of the model is -558.52. The model predicts that a diamond of 0 carats costs -
$558.52. This is not realistic.
The correlation, r, is r 0.987 0.993 . Since the scatterplot shows a positive relationship, the
positive value must be used.
R 2 = 0.987 . So 98.7% of the variation in diamond prices can be accounted for by the variation in
the size of the diamond.
8. Would it be better for a customer buying a diamond to have a negative residual or a positive
residual from this model? Explain.
It would be better for customers to have a negative residual from this model, since a negative
residual would indicate that the actual cost of the diamond was less than the model predicted it to
be.
In an effort to decide if there is an association between the year of a postal increase and the new
postal rate for first class mail, the data were gathered from the United States Postal Service. In
1981, the United States Postal Service changed their rates on March 22 and November 1. This
information is shown in the table below.
1. Make a scatterplot and describe the association between the year and the
first class postal rate. Year Rate
1971 0.08
1974 0.10
1975 0.13
1978 0.15
1981 0.18
1981 0.20
1985 0.22
1988 0.25
1991 0.29
1995 0.32
8. Would it be better for customers for a year to have a negative residual or a positive residual
from this model? Explain.
In an effort to decide if there is an association between the year of a postal increase and the new
postal rate for first class mail, the data were gathered from the United States Postal Service. In
1981, the United States Postal Service changed their rates on March 22 and November 1. This
information is shown in the table below.
1. Make a scatterplot and describe the association between the year and the Year Rate
first class postal rate. 1971 0.08
1974 0.10
1975 0.13
1978 0.15
1981 0.18
1981 0.20
1985 0.22
1988 0.25
1991 0.29
1995 0.32
There is a strong, positive, linear association between the year and the first class postal rate. Postal
rates have increased over time.
Slope of the model is 0.0101518. The model predicts that for every additional year the first class
postal rate will increase by $0.01, on average.
Intercept of the model is -19.93. The model predicts that at Year = 0, the first class postal rate was
-$19.93. This is not realistic.
The correlation, r, is r 0.990 0.9950 . Since the scatterplot shows a positive relationship, the
positive value must be used.
R2 = 0.990. So 99.0% of the variation in first class postal rates can be accounted for by the
variation in year.
8. Would it be better for customers for a year to have a negative residual or a positive residual
from this model? Explain.
It would be better for customers to have a negative residual from this model. A negative residual
would indicate that the actual first class postal rate is lower than the model predicted it would be.
Smoking
Now that cigarette smoking has been clearly tied to lung cancer, researchers are focusing on
possible links to other diseases. The data below show annual rates of cigarette consumption and
deaths from coronary heart disease for several nations. Some public health officials are urging
that the United States adopt a national goal of cutting cigarette consumption in half.
Examine these data and write a report. In your report you should:
• include appropriate graphs and statistics;
• describe the association between cigarette smoking and coronary heart disease;
• create a linear model;
• evaluate the strength and appropriateness of your model;
• interpret the slope and y-intercept of the line;
• use your model to estimate the potential benefits of reaching the “national goal” proposed
for the United States.
NOTE: We present a model solution with some trepidation. This is not a scoring key, just an
example. Many other approaches could fully satisfy the requirements outlined in the scoring
rubric. That (not this) is the standard by which student responses should be evaluated.
100,000 people.
225
standard deviation of 809 cigarettes per adult per year. The
mean CHD rate was 144.9 deaths per 100,000 citizens, with a 150
standard deviation of 66.5 deaths per 100,000 residents.
75
The association between cigarette consumption and CHD
deaths is moderate, with r = 0.731, linear, and positive. 1500 3000
Countries with higher cigarette consumption generally have Cigarettes consumed
higher rates of CHD deaths. There are several unusual points. per adult per year
Mexico and Greece have CHD death rates lower than we Residuals Plot
would expect, given their cigarette consumption, and Finland
has a higher rate of CHD deaths than we would expect for its 50
Residual
level of cigarette consumption. These points are notable, but 0
not drastic departures from the pattern.
-50
Since the relationship is straight enough, both variables are
quantitative, and there are no outliers, we can model the
120 200
association with a linear model. The regression equation for
predicted(C/C)
15.6415 0.060176(Cigarettes ) .
the relationship is CHD
The model predicts that a country in which there is no cigarette consumption would have a CHD
death rate of about 15.6 deaths per 100,000. Furthermore, according to the model, for each
additional 100 cigarettes consumed per adult per year, there is an expected increase of about 6
deaths due to CHD. With R 2 = 53.4% , the model explains 53.4% of the variability in CHD
death rate. The residuals plot shows no pattern, so the
15.6415 0.060176(Cigarettes ) linear model is appropriate.
CHD
If the U.S. were to cut its cigarette consumption in
15.6415 0.060176(1950)
CHD half, from 3900 cigarettes per adult per year to 1950,
132.98
CHD the model predicts that the CHD rate would drop from
257 to approximately 133 deaths per 100,000 citizens.
There is no guarantee of a decrease in CHD death rate, since we are dealing with a model, not
reality, and their may not be a causal relationship between average cigarette consumption and
CHD death rate.
4. Find the y-intercept of the regression line. 10. Explain what the slope means in this
context.
5. Write the equation of the linear model. 11. The fare to fly to Los Angeles, 1719
miles from Atlanta, is $212. Find the
residual.
6. Estimate the fare for a 200-mile flight. 12. In general, a positive residual means…
7. Estimate the fare for a 2000-mile flight. 13. In general, a negative residual
means…
Residuals
0
7. How far does this model suggest you could fly for the fare you estimated in #5?
Fuel Economy
Yes, cars again. But now we aren’t insuring them, we are driving them. To most of us, fuel
economy is an important issue. In the US, we measure fuel economy in miles per gallon; in
Europe and Canada, the standard is liters per 100 kilometers. A sociologist might say that this
difference is a window into the national psyche. A European has to go from A to B and asks,
“How much gas will I need to get there?” An American says, “The gas tank is full. Where can I
go?”
Anyway, we seek a good way to predict a car’s fuel economy. In completing this assignment you
will examine several possible explanatory variables, choose the one you think is best, construct
the linear model, then make a prediction and interpret the results. Here are the step-by-step
instructions. Check off the steps as you go.
1. Launch ActivStats.
Open (double click) the Resources tab on the left side of the Lesson Book.
Scroll down and choose ActivStats DataSets: to launch the List Dataset window.
Choose the file Cars1991, the hit the Insert into Data Desk button.
Load the data by following the instructions. You’ll see these variables of interest:
MPG - fuel economy ratings in miles per gallon;
Weight - in pounds;
Horsepo - engine horsepower;
Eng. Dis - engine displacement (size) in cubic inches;
Cylinders - number of cylinders (usually 4, 6, or 8);
Drive R - drive ratio (how many revolutions the engine makes to rotate the wheels
once).
2. Let’s look for the best predictor of fuel economy.
Select the response variable MPG as Y.
Holding the shift key down, select the other variables simultaneously as possible
explanatory variables X.
Calculate Correlations (Pearson).
The correlation matrix shows you the strength and direction of the association between
fuel economy and each of the other variables. Which variable seems to be most strongly
correlated with MPG? _____________
Explain: _____________________________________________________________
3. From now on you will be working only with the response variable MPG and whatever you
just chose as the best explanatory variable.
Select these variables as Y and X respectively.
Plot the Scatterplot.
Is the pattern you see what the correlation led you to expect? Explain.
________________________________________________________________________
4. Now create the model (find the equation of the line of best fit).
Using the scatterplot’s hypermenu (click on the little triangle in the plot’s title bar), Add
Regression Line and do the Regression of MPG vs X.
How many cars is this analysis based on? ______
What is the model - the equation of the line of best fit? The constant term and the slope
for the equation of the line are the coefficients displayed in the bottom left corner of the
regression analysis.
(Use meaningful variable names) ___________________________
What does the slope mean in this context?
_____________________________________________________________________
What does the value of r2 mean in this context?
_____________________________________________________________________
_____________________________________________________________________
5. A 1992 Geo Prizm weighed 2608 pounds and had a 102 horsepower, 97 cubic inch, 4
cylinder engine with a final drive ratio of 3.05.
Use your model to estimate how many miles per gallon it should get.
_____________
The owner actually averaged about 33.7 mpg. What is the residual?
_____________
What does the residual mean?
___________________________________________
6. Residuals provide an important look at how successful the model is.
Using the hypermenu on the title bar of the regression analysis, create the Scatterplot
residuals vs predicted.
In that scatterplot’s hypermenu you can again Add Regression Line.
The regression line now is horizontal, indicating where residuals would equal 0. See the
actual residuals plotted for the cars data? Explain what they represent.
_____________________________________________________________________
7. If the model had successfully extracted all the meaning from the data, the remaining error
would be random. In that case the residuals plot should appear to contain no pattern in the
scatter. Look at your residuals plot. Does the scatter appear to be random? Look carefully.
You should see a hint of a curve. That means you might be able to find a better model. Try
the European concept: gallons per 100 miles.
Under Manip, choose Transform, and create a New Derived Variable named GpHM for
“gallons per hundred miles.”
Enter the formula 100/MPG.
You will now see a new icon GpHM to use as the response variable Y. Using your chosen
explanatory variable as X again, make a new scatterplot, do the regression analysis to
create a new model, and check the new residuals.
2
Using the residuals plot and the value of R , explain why you think this model is better
than the original.
Kunnioittaen
Alexander L. Kielland.»
II.
Teidän
Alexander L. Ki.»
III.
Teidän
Alexander L. Kielland.»
IV.
Teidän
Alexander L. Kielland.»
*****
I.
II.
»Ystäväni.
III.
IV.
*****
»Hän tuli maalta. Hän oli hyvin pitkän taistelun jälkeen lopultakin
riistäytynyt irti kaikenlaisista siteistä, joihin olosuhteet olivat hänet
kietoneet. Oli ollut aikoja, jolloin hän oli ollut muuttumaisillaan,
kuolemaisillaan, katoamaisillaan yhtä yksinäisenä ja vailla
ymmärtämystä kuin oli ollut varhaisimmasta lapsuudestaan asti.
Elämä oli hänen edessään kuin arvoitus, kuin kaunis, ihmeellinen,
salaperäinen arvoitus, tai kuin synkkä, surullinen, toivoton arvoitus.
Hän paloi halusta saada heittäytyä siihen ja saada selvyyttä kaikkiin
niihin ajatuksiin, joita hänellä oli ollut. Hän vietti merkillistä, villiä,
epämääräistä elämää. Hän valvoi yöt, oli niin ihanaa kuvitella
olevansa maailman kuningatar ja että entinen, surkea elämä oli
lopussa. — Miten hän säälikään niitä, jotka nukkuivat pois elämänsä.
Hän meni ulos iltaisin ilman seuraa ja sanomatta siitä kellekään, ja
kun hän tuli kotiin ja hänelle tehtiin kysymyksiä, ei hän koskaan
antautunut selittelyihin. Hänet saattoi villin riemun valtaan myrsky ja
ukkonen, hän väitti, että niissä oli sama luonne kuin hänessäkin.
Hänellä ei ollut milloinkaan ollut ystävätärtä, hän paljasti sielunsa
vain luonnolle, koska se oli ainoa, joka ei pettänyt. Hän saattoi istua
yökaudet tuijottaen avaruuteen, hän itki toisinaan niin, että sydän oli
ratketa. Kaikki oli niin kehnoa, niin tyhjää, niin tarkoituksetonta. ‒ ‒ ‒
Vuosien kuluessa hän muuttui. Hänen katseeseensa tuli jotain
levotonta, hän alkoi kaivata yhäkin suurempaa vaihtelua. Toisinaan
hän oli vilkas ja ystävällinen aina siihen määrään, että se muodostui
rasitukseksi hänen ympäristölleen, toisinaan hän taas
välinpitämättömyydellään ja epäkohteliaisuudellaan saattoi loukata
ihmisiä. Hermostuneena ja kuin kuumeen vallassa hän saattoi
lingota hävyttömyyksiä vasten toisten kasvoja. Häntä alettiin
vieroksua. Naistuttavat eivät hänestä yleensä pitäneet. Heidän
mielestään hän oli epänaisellinen, koketti, kylmä. Mutta siitä hän ei
välittänyt. Hän sanoi tuntevansa suurempaa mielenkiintoa herroja
kohtaan, sillä heiltä voi jotain oppia. Hänen sanottiin olevan kihloissa
milloin toisen, milloin toisen kanssa. Mutta aika kului, eikä hän
mennyt naimisiin. — ‒ ‒
Yksin, kuten hän oli tähänkin asti elänyt, hän nyt päätti luoda
itselleen vaikutuspiirin, hän tahtoi tulla joksikin, hän tahtoi, että
hänestä jäisi joitakin jälkiä. Hän oli lopultakin päässyt pois
ahdistavasta ilmasta, joka oli uhannut tukehduttaa hänet, hän oli
saanut ilmaa siipiensä alle ja hän tahtoi lentää, lentää vapaasti ja
lentää korkealle ja iloita maailman ihmeellisestä kauneudesta ja
ihanuudesta. Hän ei voinut elää elämän pieniä velvollisuuksia varten,
hän ei voinut edes ihailla heitä, jotka niin tekivät, hän uskoi, että
heiltä vain puuttui kykyä suurempaan. Hän uskoi elämään, hän luotti
saavuttavansa onnen, palavalla innolla hän tahtoi ryhtyä kaikkeen,
nähdä kaikkea, oppia kaikkea, oppia, oi Jumala, mikä onni! Hän
tahtoi, että tulisi jotain, joka kokonaan valtaisi hänen olemuksensa,
jokaisen ajatuksen, jokaisen tunteen, jokaisen veripisaran hänen
ruumiissaan. Hän etsi, hän kaipasi, hän janosi jotain, jota hän koko
elämänsä oli janonnut. — ‒ ‒