You are on page 1of 13

BIOSTATS 640 – Spring 2016 2.

Regression and Correlation (Part 1 of 2) - SOLUTIONS

Unit 2 – Regression and Correlation


WEEK 2 - Practice Problems

SOLUTIONS
Stata version

1. A regression analysis of measurements of a dependent variable Y on an independent variable X


produces a statistically significant association between X and Y. Drawing upon your education in
introductory biostatistics, the theory of epidemiology, the scientific method, etc – see how many
explanations you can offer for this finding. Hint – I get seven (7)

(i) X causes Y
Example: X=smoking Y=cancer

(ii) Y causes X
Example: X=cancer Y=smoking

(iii) Positive confounding


Example: X=alcohol Y=cancer Confounder = Z = smoking

(iv) Necessary but not sufficient


Example: X=sun exposure Y=melanoma
Sun exposure alone does not cause melanoma. Melanoma is the result of a gene-environment interaction.

(v) Intermediary
Example: X = asthma Y=lesions on the lung
X=asthma is an intermediary in the pathway coal exposure à asthma à lesions on lung

(vi) Confounding
Example: X=yellow finger Y=lung cancer
Z=smoking causes both yellow finger and lung cancer.

(vii) Anomoly
There is no association. An event of low probability has occurred.

sol_WEEK2regression.doc Page 1 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

2. Below is a figure summarizing some data for which a simple linear regression analysis has been performed.
The point denoted X that appears on the line is (x,y). The two points indicated by open circles were NOT
included in the original analysis.

sol_WEEK2regression.doc Page 2 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

Multiple Choice (Choose ONE):

Suppose the two points represented by the open circles are added to the data set and the regression analysis is
repeated. What is the effect of adding these points on:

1. The estimated slope.


_____ (a) increase

_____ (b) decrease

__X (c) no change

2. The residual sum of squares.


_____ (a) increase

_____ (b) decrease

__X_ (c) no change

3. The degrees of freedom.


__X__ (a) increase

_____ (b) decrease

_____ (c) no change

4. Standard error of the estimated slope.


_____ (a) increase

__X__ (b) decrease

_____ (c) no change

5. The predicted value of y at x=24.


_____ (a) increase

_____ (b) decrease

__X__ (c) no change

sol_WEEK2regression.doc Page 3 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

3. . The course website page REGRESSION AND CORRELATION has some examples of code
to produce regression analyses in STATA. In this exercise, you will gain some practice doing a simple
linear regression using a data set called week02.dta. This data set has n=31 observations of boiling
points (Y=boiling) and temperature (X=temp).

Tip - I encourage you to download and print the solutions to this question, so that you can follow along!

Tip - To access this data set in STATA, follow the following 2 steps:

STEP 1: From the course website page THIS WEEK or page REGRESSION, right
click to download week02.dta. Take care to remember the directory or folder where
you have placed the data.

STEP 2: Now launch Stata. From the main menu bar, use FILE > OPEN to input
into Stata memory the data in week02.dta

Tip – You will need to create a new dependent variable that is 100log10(Y). To do this, type the following
into the command window:

. generate newy=100*log10(boiling)

Carry out an exploratory analysis to determine whether the relationship between temperature and boiling
point is better represented using

(i) Y = β0 +β1X or
(ii) 100 log10 (Y) = β'0 +β1' X

In developing your answer, try our hand at producing

(a) Estimates of the regression line parameters


(b) Analysis of variance tables
(c) R2
(d) Scatter plot with overlay of fitted line.

Complete your answer with a one paragraph text that is an interpretation of your work. Take your time with
this and have fun.

sol_WEEK2regression.doc Page 4 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

Tip! A really good practice is to maintain a log of your session. Here is how.
Step 1: From the main menu at upper left, click: FILE > LOG > BEGIN

sol_WEEK2regression.doc Page 5 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

Step 2: At the dialog box SAVE AS: Type in a file name of your choosing. Mine is “carol_log”
Note that I have NOT typed an extension. Stata does this for you. Do NOT click on SAVE just yet.

Step 3: At the dialog box WHERE: Click on the drop down menu and select a path of your choosing.
Again, do NOT click on SAVE just yet

Step 4: At the dialog box FILE FORMAT: click on drop down menu at right.

Step 5: From the drop down menu, choose STATA LOG.


Now you can click on SAVE

sol_WEEK2regression.doc Page 6 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

My Stata Session
Note – I imported my log into an MS Word document and, there, edited all my misspellings and errors

Key
BLACK – commands. Comments begin with asterisk and are in bold. Blue – output

name: <unnamed>
log: /Users/carolbigelow/Desktop/hw_week02.log
log type: text

. *
. ***** Turn off screen pause of output
. set more off

. *
. ***** Read data into stata workspace
. use "http://people.umass.edu/biep640w/datasets/week02.dta"

. *
. ***** Preliminary) Use command codebook with option compact for a quick look at data.
codebook, compact

Variable Obs Unique Mean Min Max Label


-----------------------------------------------------------------------------------------
temp 31 29 191.7839 180.6 210.8
boiling 31 31 19.76958 12.267 29.211

. *
. ***** Preliminary) Use command generate to create newy = 100* log10(boiling)
. generate newy=100*log10(boiling)

. *
. ***** Simple Linear Regression
. *
. ** a-c) Use command regress yvariable xvariable for: fitted line, anova, and R-squared
. regress boiling temp

Source | SS df MS Number of obs = 31


-------------+------------------------------ F( 1, 29) = 336.60
Model | 450.558755 1 450.558755 Prob > F = 0.0000
Residual | 38.8187237 29 1.33857668 R-squared = 0.9207
-------------+------------------------------ Adj R-squared = 0.9179
Total | 489.377479 30 16.3125826 Root MSE = 1.157

------------------------------------------------------------------------------
boiling | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
temp | .4437942 .0241895 18.35 0.000 .3943211 .4932674
_cons | -65.343 4.643815 -14.07 0.000 -74.84067 -55.84533
------------------------------------------------------------------------------

sol_WEEK2regression.doc Page 7 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

. *
. ** d) Use command graph twoway for scatter plot with overlay of fitted line
.** No frills graph
. graph twoway (scatter boiling temp) (lfit boiling temp)

. ** Graph with aesthetics.


. graph twoway (scatter boiling temp, symbol(d)) (lfit boiling temp), title("Simple
Linear Regression") subtitle("Y=boiling on X=temp") caption("boiling.png", size(vsmall))

Tip – I used the option caption( )so that the name of my saved file (which I am about to
save) will appear on the graph itself. Handy!

sol_WEEK2regression.doc Page 8 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

Tip!
Save your graph as a “.png” image so that you can “paste” it elsewhere. Here is how.
Step 1: Create your graph (see previous page )

Step 2: With the graph window still active, click on the SAVE icon.

Step 3: From drop down menu for File format: Choose PORTABLE NETWORK GRAPHICS (*.png)

sol_WEEK2regression.doc Page 9 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

. *
. ***** Simple Linear Regression of Y=100*log10(boiling) on X=temp
. *
. ** a-c) Use command regress yvariable xvariable for: fitted line, anova, and R-squared.
regress newy temp

Source | SS df MS Number of obs = 31


-------------+------------------------------ F( 1, 29) = 223.69
Model | 1962.25312 1 1962.25312 Prob > F = 0.0000
Residual | 254.398153 29 8.7723501 R-squared = 0.8852
-------------+------------------------------ Adj R-squared = 0.8813
Total | 2216.65127 30 73.8883757 Root MSE = 2.9618

------------------------------------------------------------------------------
newy | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
temp | .9261545 .0619247 14.96 0.000 .7995043 1.052805
_cons | -48.85826 11.88807 -4.11 0.000 -73.17209 -24.54443
------------------------------------------------------------------------------

. *
. ** d) Scatter plot with overlay of fitted line using the command graph twoway
. ** No frills graph
. graph twoway (scatter newy temp) (lfit newy temp)

sol_WEEK2regression.doc Page 10 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

. ** Graph with aesthetics


. graph twoway (scatter newy temp, symbol(d)) (lfit newy temp), title("Simple Linear
Regression") subtitle("Y=100*log10(boiling) on X=temp") caption("newy.png", size(vsmall))

sol_WEEK2regression.doc Page 11 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

Putting it all together …


(a) Estimates of Regression Line Parameters
i. Y= -65.34+0.44*X
ii. 100log 10 (Y)= -48.85830+0.93*X

Table 1. Parameters estimations for dependent = y = boiling (Boiling Point)


Parameter Estimates

Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 -65.34300 4.64382 -14.07 <.0001


Temp Temperature 1 0.44379 0.02419 18.35 <.0001

Table 2. Parameters estimations for dependent = y = newy (100 log10 (Boiling Point))

Parameter Estimates

Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 -48.85830 11.88807 -4.11 0.0003


Temp Temperature 1 0.92615 0.06192 14.96 <.0001

(b) Analysis of Variance Tables

Y=boiling (Boiling Point)

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 450.55876 450.55876 336.60 <.0001


Error 29 38.81874 1.33858
Corrected Total 30 489.37750

Y = newy (100 log10 (Boiling Point))

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 1962.25366 1962.25366 223.69 <.0001


Error 29 254.39829 8.77235
Corrected Total 30 2216.65195

sol_WEEK2regression.doc Page 12 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS

(c) R2

Y=boiling (Boiling Point): R2 = 0.9207


Y = newy (100 log10 (Boiling Point)) R2 = 0.8852

(d) Scatterplot with Overlay of Fitted Line

Y=boiling (Boiling Point)

Y = newy (100 log10 (Boiling Point))

Solution (one paragraph of text that is interpretation of analysis):

Did you notice that the scatter plot of these data reveal two outlying values? Their inclusion may or may not be
appropriate.

If all n=31 data points are included in the analysis, then the model that explains more of the variability in
boiling point is Y=boiling point modeled linearly in X=temperature. It has a greater R2 (92% v 89%).

Be careful - It would not make sense to compare the residual mean squares of the two models because the
scales of measurement involved are different.

sol_WEEK2regression.doc Page 13 of 13

You might also like