Professional Documents
Culture Documents
SOLUTIONS
Stata version
(i) X causes Y
Example: X=smoking Y=cancer
(ii) Y causes X
Example: X=cancer Y=smoking
(v) Intermediary
Example: X = asthma Y=lesions on the lung
X=asthma is an intermediary in the pathway coal exposure à asthma à lesions on lung
(vi) Confounding
Example: X=yellow finger Y=lung cancer
Z=smoking causes both yellow finger and lung cancer.
(vii) Anomoly
There is no association. An event of low probability has occurred.
sol_WEEK2regression.doc Page 1 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
2. Below is a figure summarizing some data for which a simple linear regression analysis has been performed.
The point denoted X that appears on the line is (x,y). The two points indicated by open circles were NOT
included in the original analysis.
sol_WEEK2regression.doc Page 2 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
Suppose the two points represented by the open circles are added to the data set and the regression analysis is
repeated. What is the effect of adding these points on:
sol_WEEK2regression.doc Page 3 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
3. . The course website page REGRESSION AND CORRELATION has some examples of code
to produce regression analyses in STATA. In this exercise, you will gain some practice doing a simple
linear regression using a data set called week02.dta. This data set has n=31 observations of boiling
points (Y=boiling) and temperature (X=temp).
Tip - I encourage you to download and print the solutions to this question, so that you can follow along!
Tip - To access this data set in STATA, follow the following 2 steps:
STEP 1: From the course website page THIS WEEK or page REGRESSION, right
click to download week02.dta. Take care to remember the directory or folder where
you have placed the data.
STEP 2: Now launch Stata. From the main menu bar, use FILE > OPEN to input
into Stata memory the data in week02.dta
Tip – You will need to create a new dependent variable that is 100log10(Y). To do this, type the following
into the command window:
. generate newy=100*log10(boiling)
Carry out an exploratory analysis to determine whether the relationship between temperature and boiling
point is better represented using
(i) Y = β0 +β1X or
(ii) 100 log10 (Y) = β'0 +β1' X
Complete your answer with a one paragraph text that is an interpretation of your work. Take your time with
this and have fun.
sol_WEEK2regression.doc Page 4 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
Tip! A really good practice is to maintain a log of your session. Here is how.
Step 1: From the main menu at upper left, click: FILE > LOG > BEGIN
sol_WEEK2regression.doc Page 5 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
Step 2: At the dialog box SAVE AS: Type in a file name of your choosing. Mine is “carol_log”
Note that I have NOT typed an extension. Stata does this for you. Do NOT click on SAVE just yet.
Step 3: At the dialog box WHERE: Click on the drop down menu and select a path of your choosing.
Again, do NOT click on SAVE just yet
Step 4: At the dialog box FILE FORMAT: click on drop down menu at right.
sol_WEEK2regression.doc Page 6 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
My Stata Session
Note – I imported my log into an MS Word document and, there, edited all my misspellings and errors
Key
BLACK – commands. Comments begin with asterisk and are in bold. Blue – output
name: <unnamed>
log: /Users/carolbigelow/Desktop/hw_week02.log
log type: text
. *
. ***** Turn off screen pause of output
. set more off
. *
. ***** Read data into stata workspace
. use "http://people.umass.edu/biep640w/datasets/week02.dta"
. *
. ***** Preliminary) Use command codebook with option compact for a quick look at data.
codebook, compact
. *
. ***** Preliminary) Use command generate to create newy = 100* log10(boiling)
. generate newy=100*log10(boiling)
. *
. ***** Simple Linear Regression
. *
. ** a-c) Use command regress yvariable xvariable for: fitted line, anova, and R-squared
. regress boiling temp
------------------------------------------------------------------------------
boiling | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
temp | .4437942 .0241895 18.35 0.000 .3943211 .4932674
_cons | -65.343 4.643815 -14.07 0.000 -74.84067 -55.84533
------------------------------------------------------------------------------
sol_WEEK2regression.doc Page 7 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
. *
. ** d) Use command graph twoway for scatter plot with overlay of fitted line
.** No frills graph
. graph twoway (scatter boiling temp) (lfit boiling temp)
Tip – I used the option caption( )so that the name of my saved file (which I am about to
save) will appear on the graph itself. Handy!
sol_WEEK2regression.doc Page 8 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
Tip!
Save your graph as a “.png” image so that you can “paste” it elsewhere. Here is how.
Step 1: Create your graph (see previous page )
Step 2: With the graph window still active, click on the SAVE icon.
Step 3: From drop down menu for File format: Choose PORTABLE NETWORK GRAPHICS (*.png)
sol_WEEK2regression.doc Page 9 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
. *
. ***** Simple Linear Regression of Y=100*log10(boiling) on X=temp
. *
. ** a-c) Use command regress yvariable xvariable for: fitted line, anova, and R-squared.
regress newy temp
------------------------------------------------------------------------------
newy | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
temp | .9261545 .0619247 14.96 0.000 .7995043 1.052805
_cons | -48.85826 11.88807 -4.11 0.000 -73.17209 -24.54443
------------------------------------------------------------------------------
. *
. ** d) Scatter plot with overlay of fitted line using the command graph twoway
. ** No frills graph
. graph twoway (scatter newy temp) (lfit newy temp)
sol_WEEK2regression.doc Page 10 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
sol_WEEK2regression.doc Page 11 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Table 2. Parameters estimations for dependent = y = newy (100 log10 (Boiling Point))
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
sol_WEEK2regression.doc Page 12 of 13
BIOSTATS 640 – Spring 2016 2. Regression and Correlation (Part 1 of 2) - SOLUTIONS
(c) R2
Did you notice that the scatter plot of these data reveal two outlying values? Their inclusion may or may not be
appropriate.
If all n=31 data points are included in the analysis, then the model that explains more of the variability in
boiling point is Y=boiling point modeled linearly in X=temperature. It has a greater R2 (92% v 89%).
Be careful - It would not make sense to compare the residual mean squares of the two models because the
scales of measurement involved are different.
sol_WEEK2regression.doc Page 13 of 13