You are on page 1of 5

Homework 3: Poisson Regression for Counts & Rates

Dr. Timothy R. Johnson Spring Semester, 2014


This homework assignment is due no later than 3:00 on Friday, March 28th. Please read the instructions below carefully.

Homework Instructions
This homework assignment is due no later than 3:00 on Friday, March 28th. Late assignments will only be accepted in extreme circumstances and only if arrangements have been made in advance. Your solutions must be typed and very neatly organized. I will not try to infer your solutions if it they are not clearly presented. Equations need not be typeset perfectly but they should be clear. You may substitute letters for symbols (e.g., b1 for 1 ), and you may write-out equations (neatly) by hand if necessary. Include with your solutions you must include the relevant R output and the R scripts that created them. Include these within the text of your solutions using cut-and-paste. Try to include only the relevant output. Use a monospace font (e.g., Consolas or Courier) for R scripts and output for clarity, but only for R scripts and output. It is permitted for you to discuss the homework with other students in the course. However you must still write your own R scripts, produce your own output, and write up your own solutions. You are welcome to ask me questions concerning the homework. I will be particularly open to helping with any R problems. I want to evaluate your understanding of applied regression, not R, but part of the purpose of the homework assignments is to get you to exercise using R. If you email me with a R question, it may be helpful for you to include with your email your full R script so that I can replicate your problem. The Statistics Assistance Center (SAC) and Statistical Consulting Center (SCC) are not designed to accommodate this course. Direct all questions to me.

homework 3 : poisson regression for counts & rates

Cancer Deaths of Atomic Bomb Surviors


The data frame ex2220 from the Sleuth3 package contains data from an observational study of cancer deaths among survivors from the two atomic bombs dropped on Japan during World War II. Here the data are summarized in terms of the number of deaths, radiation exposure, and years after exposure. Since the number of people for a given exposure and time interval varies, the number of person-years was also recorded.1 The primary goal here is to investigate the effect of radiation exposure on the cancer death rate. 1. Download an R script le from the following link.
https://dl.dropboxusercontent.com/u/10884844/Homework/atomicbomb.R

The concept of person-years can be confusing, but it is basically the sum of the number of years at risk for each person. For example, if two people were observed for ve years, the person-years would be ten.
1

Note that the script le estimates and plots a linear model for the cancer death rate using exposure and years after exposure as explanatory variables. Modify the script to estimate and plot a Poisson regression model with the same linear predictor for the cancer death rate. Report the parameter estimates and standard errors using the summary function and your plot.2 2. According to the Poisson regression model, by what factor does the cancer death rate increase per unit (1 rad) increase in radiation exposure? Estimate this factor and provide a prole likelihood condence interval for it as well. Also test the signicance of the effect for radiation exposure using each of the following methods: the prole likelihood condence interval, a likelihood ratio test, and a Wald test.3 Report the test statistic for the latter two, and explain how you use the prole likelihood condence interval to decide whether or not to conclude that the effect is statistically signicant at = 0.05. 3. In the previous analyses radiation exposure was treated as a quantitative variable. However given that these are probably averages or midpoints of ranges of radiation exposure, it might be reasonable to treat it as categorical. To do this, convert it to a factor prior to your analysis and plotting using the following command.
> ex2220$Exposure <- factor(ex2220$Exposure)

Although the plot is in color you need not print it in color.


2

We should not be surprised to see a signicant effect for radiation exposure. However it is interesting to note that it is not signicant at = 0.05 for the original linear model.
3

You will also need to make a change for plotting. The geom_line will not work if x is categorical unless we group the points. This can be done by including a group aesthetic variable as follows.
> p <- p + geom_line(aes(y = yhat, linetype = YearsAfter, group = YearsAfter))

Report the parameter estimates and standard errors for this model using summary and provide the plot.

homework 3 : poisson regression for counts & rates

4. The plot from the previous problem should show a rather curious result. The cancer death rate actually appears to be lower for the 400 rad exposure than the 250 rad exposure. Use that model and contrast in combination with contrastfix to determine if the cancer rate at 250 rads is signicantly different from that at 400 rads using a Wald test.4 5. Looking at the plot you can see that the observed rate for survivors observed 28 to 31 years after an exposure of 250 rads appears to be unusually high. It could be an outlier, but because of the asymmetry and heteroscedasticity of the Poisson distribution it is particularly important to use standardized or studentized residuals when evaluating potential outliers. Calculate and report the studentized residual for this observation, and compare it to the residuals for the other observations. Does this or any other observations appear to be outliers that might warrant further investigation of the data or revision of the model? 6. It might be worthwhile to re-evaluate the model without the outlier to determine its inuence on our analyses by omitting that observaton from another analysis.5 There are several ways to do this. Perhaps the easiest would be to drop that observation from the data. To do this you can use the subset function as follows.6

Since the model does not include an interaction you should nd that the difference is the same for each years to exposure level. Also note that since exposure is now a categorical variable you will want to specify its levels as characters i.e., Exposure = 250 rather than Exposure = 250.
4

Inuence analysis is a term for a collection of methods designed to determine how an observation or set of observations inuences inferences.
5

The subset function is one way to create a subset of observations and/or variables from a data frame. In this case we are creating a subset of observations by excluding the observation corresponding to an exposure of 250 rads and 28 to 31 years after exposure. The ! means not so the subset is all observations except the possible outlier.
6

> ex2220 <- subset(ex2220, !(Exposure == "250" & YearsAfter == "28to31"))

Repeat what you did in problems 3 and 4 after dropping this observation from your analysis. Briey summarize the changes to your results.7

Although these data make for a good exercise, I am not very satised with this analysis. There is no explanation for the outlier and there are what look like other irregularities in the data. Also it would have been good to have been able to control for age since the risk of cancer increases with age regardless of exposure, and years after exposure is not the same thing as age. But age would be partially confounded with years after exposure, but it also may be confounded with exposure. Statistics can be complicated, but often data are more so.
7

homework 3 : poisson regression for counts & rates

Aerial Snow Geese Counting Again


The following problems concern the aerial snow geese counting data from the previous homework. Download an R script le from the following link.
https://dl.dropboxusercontent.com/u/10884844/Homework/snowgeese-glm.R

This script formats the data, and plots the raw data with the predicted values from a linear model as well as the residuals against the predicted values. 1. One major issue with the linear model is the heteroscedasticity that is evident in the plots. Estimate a Poisson regression model for these data and report the parameter estimates and standard errors from summary. 2. Based on the plots of the predicted values against the observed counts and that of the residuals you should see that the Poisson regression model is questionable because the relationship between photo count and observer count appears to be linear, but the Poisson regression model is log-linear. Transforming the explanatory variable photo may help with this. Try the transformation log2 ( x ) which is specied in R as log2(x).8 To do this replace every instance of photo in your script with log2(photo). You will also need to adjust the range for the fake data for plotting predicted values. Something like
seq(log2(9), log2(409), length = 100)

would do it. Report the parameter estimates and standard errors from summary for this model and comment on if the plots show that the model better ts the data and why. 3. In previous models like that for the alcohol metabolism data you learned how to use contrast to estimate the slope of each line when you have an interaction between a quantitative and a categorical variable. Here you can do the same since the model is linear for log[ E(Yi )], but to interpret these slopes you can exponentiate them to summarize the factor by which E(Yi ) changes per unit change in the quantitative variable. Use the contrast function in combination with my contrastfix function to estimate the factor by which the expected count increases for each observer when the photo count is doubled. Write a short paragraph (i.e., just a few sentences) that summarizes these estimates as well as their Wald condence intervals. Write as if you were were summarizing the results for a paper or report.

The log2 ( x ) transformation is relatively easy to interpret. Normally we interpret parameters in terms of a unit increase in the explanatory variable. A unit increase in log2 ( x ) is the same as doubling x because log2 (2x ) = log2 ( x ) + 1. So for a loglinear model we would say that the expected value of the response variable changes by a certain factor when x is doubled.
8

homework 3 : poisson regression for counts & rates

Locations of Malignant Melanoma


Roberts et al. (1981) report the results of an observational study of n = 400 patients with malignant melanoma (a type of skin cancer) were classied in terms of the type of tumor and its location.9 The data are summarized in the following contingency table. Tumor Site Tumor Type Hutchinsons melanotic freckle indeterminate nodular supercial spreading melanoma extremity 10 28 73 115 head 22 11 19 16 trunk 2 17 33 54

9 Roberts, G., Martyn, A. L., Dobson, A. J., McCarthy, W. H. (1981). Tumor thickness and histological type in malignant melanoma in New South Wales, Australia, 197076. Pathology, 13, 763770.

A common exercise in an introductory statistics class is to determine if there is a relationship between two categorical variables. Statistical tests for this are sometimes called tests of independence or tests of homogeneity. The test can be done in R in a variety of ways. One way is to use the assocstats function from the vcd package.
> library(faraway) > data(melanoma) > mytable <- xtabs(count ~ tumor + site, data = melanoma) > assocstats(mytable) X^2 df Likelihood Ratio 51.795 Pearson 65.813 P(> X^2)
10

6 2.0505e-09 6 2.9432e-12

The output shows two different test statistics for the test of independence/homogeneity (i.e., a test of the null hypothesis that there is no statistical relationship between tumor type and site).10 Interestingly the test can also be done using Poisson regression.11 The test of independence/homogeneity above is equivalent to a test of the null hypothesis that there is no interaction between tumor type and site for a Poisson regression model that uses the count as a response variable and tumor type and site as explanatory variables. Do this using a Poisson regression model with the melanoma data frame. The test statistic for the likelihood ratio test for the interaction should be equal (within rounding error) to the likelihood ratio test reported above.

The Pearson test statistic is the one that is usually covered in an introductory statistics course, but several other test statistics exist.

This is actually not obvious because the counts in the table like the one above will usually have a predetermined sum by row, column, or total which results in what is called a multinomial or product-multinomial distribution, not a Poisson distribution. A properly specied Poisson regression model will, however, have the same likelihood function as a multinomial model, which means we can obtain the same inferences from the Poisson regression model. In this application the Poisson regression model is plays the role of what is sometimes called a surrogate model. There are some cases in which one statistical model can serve as a surrogate for another model because they have the same likelihood function, even though they appear to be fundamentally different.
11

You might also like