You are on page 1of 23

Poisson regression using Stata

Mike Crowson, Ph.D.


Created July 10, 2019

In this video, I provide a demonstration of how to carry out a basic Poisson regression analysis in Stata
(the version I am using is Stata 14.2).

A link for the data used, as well as this Powerpoint (which contains additional information not covered in
the video), and text file containing the commands used in the video will be made available for download
underneath the video description.

If you find video and materials useful, please take the time to “like” the video and share the link with
others. Also, please consider subscribing to my Youtube channel.
Youtube video link: https://www.youtube.com/watch?v=Wo_QSeLV0Vk

For more videos and resources, check out my website:


https://sites.google.com/view/statistics-for-the-real-world/home
Poisson regression using Stata

Poisson regression is a technique that is used to predict count outcomes, with those counts occurring
within a given space or span of time (King, 1988). Oftentimes, the observed counts reflect low frequency
events (Osborne, 2017). Unlike OLS regression, Poisson regression does not assume normally distributed
residuals with constant variance (Coxe, West, and Aiken, 2009). It also does not assume a linear
relationship between the independent and dependent variables, as is required using OLS regression.

Since the Poisson distribution becomes increasingly normal as the expected value (i.e., mean) of the
distribution increases, OLS regression can be a reasonable approach to predicting count outcomes when
the counts on the dependent variable involve larger integer values – so long as OLS assumptions of
normality and homoskedasticity of residuals is met (Osborne, 2015). When the expected value of the
distribution is small (e.g., less than 10), then Poisson regression is more appropriate (Coxe et al., 2009).
Notably, OLS regression involving low frequency count outcomes can produce illogical predicted values on
the dependent variable in the form of negative predicted counts, despite the fact a count cannot take the
form of a non-negative integer value (Gardner, Mulvey, & Shaw, 1995). Moreover, modeling low frequency
count outcomes with OLS regression can result in downwardly biased standard errors, thereby increasing
the likelihood of incorrect inferences (Coxe et al., 2009).
Poisson regression using Stata

Poisson regression makes the assumption that the conditional mean and variance of the distribution of counts
are equal (Payne et al., 2017; see also
https://www.theanalysisfactor.com/glm-r-overdispersion-count-regression/) – a condition referred to as
equidispersion (Coxe et al., 2009). In many cases, the variance of the count outcome is greater than the mean,
referred to as overdispersion (Karazsia & van Dulmen, 2008). Allison (2009) describes overdispersion as a
condition in which “there is more variation in the event counts than would be expected based on a Poisson
distribution” – something that typically occurs when there is a failure to “include all causes of variation in the
counts” (p. 57) in the model. When this occurs, the Poisson model can underestimate standard errors leading to
increased likelihood of Type 1 error (Allison, 2009; Ismail & Jemain, 2009). When overdispersion is a problem,
negative binomial logistic regression may be used, or the analyst might try using a different scaling parameter
(Osborne, 2017). [Note: When equidispersion is present, the negative binomial distribution converges onto the
Poisson distribution (Karazsia & van Dulmen, 2008).]

Sometimes the conditional variance is lower than the conditional mean – a condition referred to as
underdispersion. The effect of underdispersion is to inflate standard errors, thereby reducing the power of tests
of the regression coefficients. Osborne (2017) suggests that re-specifying the scaling parameter may be used
when this problem presents. Nevertheless, it would seem that researcher tend to be more concerned with
issues of overdispersion as opposed to underdispersion in their treatments of Poisson regression.
Poisson regression using Stata

Frequently, Poisson regression is used when counts have been made within a fixed period of time (i.e., the
measurement period has the same length for all cases). However, in situations where counts (e.g., number of
children in a family) are made over varying periods of time (e.g., age of the mother), then it becomes necessary
to control for differences in the length of the periods in which observations are made. This can be accomplished
through incorporation of an offset variable (Coxe et al., 2009). In effect, an offset variable is used to account for
different levels of exposure associated with the the cases under observation (see
https://www.theanalysisfactor.com/the-exposure-variable-in-poission-regression-models/). In Stata it is
possible to incorporate an offset (or exposure) variable.
Assessing model fit:

In general, the first step in assessing model fit is evaluation of the results of the likelihood ratio (LR) chi-
square test. The LR test is used to test whether the model containing the full set of predictors represents a
significant improvement in fit over a null (intercept only) model. When the LR test is statistically significant,
this means that the model is a significant improvement in fit relative to the null. Only when this model is
significant do you proceed to evaluation of each of the regression coefficients in order to determine those
predictors that contribute significantly to the model (Osborne, 2017).

There are two ways of testing the regression coefficients in a Poisson regression model: Wald test and
Likelihood ratio test. A limitation of the Wald test is that it can be overly conservative when it comes to
testing the regression coefficient against the null. The Likelihood ratio test is a more powerful test of
regression parameters. It involves testing the full model including a given predictor against a reduced
model without that predictor. If the decrease in fit is statistically significant, then this indicates the
predictor is a significant contributor to the model.
Examples in which Poisson regression might be used:

(a) A management researcher is studying predictors of the number of work days during a 90-day period
employees are absent from work (for whatever reason). The researcher obtains data from a sample of
150 employees and models the number of days absent as a function of employee salary, gender
identification, an indicator whether an employee is or is not in a managerial position, and self-report
measures of employee satisfaction and stress associated with their job.

(b) A researcher is studying predictors of the number of self-reported alcoholic beverages an adult has had
over the previous week. The researcher obtains cross-sectional data from n=200 adults on drinking
behavior and models it as a function of age, gender identification, family income, and marital status.

(c) A sports statistician obtains data from a cross-section of regular season 300 games (with ‘game’ being
the unit of analysis) during the college football season in order to predict the number of penalties the
losing team is assessed based on characteristics of the teams playing in those games.

(d) A counseling researcher is studying predictors of the number of “no-shows” of clients at a counseling
clinic over the span of a month. The researcher models factors such as (a) family income, (b) gender
identification, (c) symptom severity, and (d) a measure of discomfort with psychotherapy as predictors
of the number of “no-shows”.
(e) An educational psychologist is studying predictors of the number of times a student at a high school
received a failing letter grade for a class over a six month span. The researcher considers predictors such
as student gender, a measure of how often a student had been written up for a behavioral infraction
during the six month period, and self-report measures of student engagement and academic self-
efficacy.

(f) A researcher is studying the number of accidents that occur over the span of six months in a sample of
200 traffic intersections randomly sampled from the population of intersections within a large city. The
researcher hypothesizes that factors such as the degree of traffic, road conditions, signage, and other
characteristics of an intersection account for variation in number of accidents observed during that time
frame.

(g) A human performance researcher is studying the number of times a person is able to lift a 50 pound
weight during a one-minute period of time, as a function of a person’s height, weight, and a pre-task
intervention (intervention versus control) aimed at increasing motivation for performing the task.

(h) A researcher is studying predictors of the count for the number of parking tickets a driver has received
in a lifetime. The researcher models ‘count’ as a function of the Big 5 personality factors of
‘conscientiousness’, ‘agreeableness’, and ‘extroversion’. Given that the observation/exposure period will
not be the same for everyone (since individuals vary in the length of time they have been driving), an
offset variable is incorporated into the model.
Scenario: You are studying predictors of the number of work days during a 90-day period employees are absent
from work (for whatever reason). You obtain data from a sample of 50 employees (in actually very low for
carrying out Poisson regression) on the number of days absent, employee salary (in 10k’s), a measure of gender
identification (coded 0=male, 1=female), an indicator whether or not an employee is in a managerial position
(0=not manager, 1=manager), and self-report measures of employee satisfaction and stress associated with
their job.

You are predicting the count of absences as a function of salary, gender identification, managerial position,
and employee satisfaction and stress. Note: In this demonstration, we will NOT be including an offset
variable as we will assume the level of exposure is the same for everyone in the study. For example, we
will assume that no one was hired or left within the observation period.
The basic command structure for running a Poisson regression is:

poisson <name of dv> followed by <names of independent variables>

Below, I’ve typed the command for running the Poisson regression for our analysis into the
command line:
This is a likelihood ratio chi-square test
aimed at testing whether the model
containing the full set of predictors fits
significantly better than a null
(intercept only) model. If the test
result is significant (as it is here), then
we say that our model is a significant
improvement in fit over a null model.

The pseudo R-square is MacFadden’s.

(For details, see


https://stats.idre.ucla.edu/stata/outpu
t/poisson-regression/
)
The regression slope (B column) is interpreted as the predicted change in the log count for every one unit
increase on the predictor (controlling for the remaining predictors). Although the units we are working with
when interpreting the regression slopes is log count, we can generally state that a positive coefficient
indicates that as scores increase on a predictor, the predicted incidence rate (i.e., count) increases on the
dependent variable. On the other hand, a negative coefficient indicates that the predicted incidence rate
decreases on the dependent variable with increasing values on the predictor. The Wald test (z) is provided in
the output to test the significance of the regression coefficients.
In the model, ‘salary’ was a negative and significant predictor of the incidence rate for the number of days
absent from work (b=-.075, s.e.=.0355, p=.035).

The regression coefficient for ‘manager’ was non-significant (b=-.147, s.e.=.1735, p=.397). Similarly, the
regression slope for ‘gendered’ was non-significant (b=.136, s.e.=.1591, p=.391), indicating no difference in
predicted incidence rate between persons identified as male and female. [Had the regression slope been
significant, then the positive slope would’ve been interpreted as indicating that the rate of absences for
females – coded 1 – was greater than that for males – coded 0. Again, the difference shown in the table is not
significant.]
The regression coefficient for ‘work satisfaction’ was negative, suggestive that employees scoring higher on
work satisfaction are more likely to exhibit a lower incidence rate for absences than employees scoring
lower on the measure. Nevertheless, work satisfaction was not a significant predictor in the model (b=-.061,
s.e.=.0322, p=.056).

The regression coefficient for ‘stress’ was positive and significant (b=.061, s.e.=.0250, p=.016), indicating
that employees scoring higher on stress were predicted to have a higher rate of absenteeism than those
scoring lower on stress.
To obtain the incidence rate ratio (IRR), you simply type in an additional comma, followed by ‘irr’.
The incidence rate ratio (IRR) represents the predicted change in the incidence rate per unit increase on the
predictor. A value greater than 1 indicates that with increasing scores on the predictor, the incidence rate
changes by a factor of the IRR; a value less than one indicates that with increasing scores on the predictor, the
incidence rate decreases by a factor of the IRR. [For a nice discussion on incidence the incidence rate ratio
(found in the Exp(B) column) go here: https://stats.idre.ucla.edu/stata/output/poisson-regression/). The 95%
confidence interval for the IRR is found in the table as well. The null IRR is 1.
The IRR for ‘salary’ suggests that for every one unit increase on the predictor, the predicted incidence rate
changes by a factor of .928 (meaning the incidence rate was decreasing).

The IRR for ‘stress’ indicates that for every one unit increase on work satisfaction, the predicted incidence
rate changes by a factor of 1.062.

Since the remaining predictors were not significant, we will not interpret the remaining coefficients.
You can use several postestimation commands to obtain additional measures of fit. Just use the
command ‘estat’ followed by ‘gof’ to obtain the Deviance goodness of fit and Pearson goodness of fit
tests.

Non-significant test results (as in this case) are


indicators of a good fitting model.
A way of evaluating for the possibility of overdispersion is to compute the ratio of the deviance or
Pearson chi-square value to the degrees of freedom.

According to Payne et al. (2017) and Osborne (2017),


departures from equidispersion can be assessed by
computing the ratio of the deviance to its degrees of
freedom and/or forming a ratio of the Pearson chi-
square to its degrees of freedom. Values greater than 1
indicate the presence of overdispersion, whereas values
less than 1 signal the presence of underdisperion (as
shown in this example, when you look at the Value/df
column in the output). Field (2018) suggests that
overdispersion is most likely to be problematic when the
The ratio of the deviance to the degrees of ratio of the chi-square to its degrees of freedom is
freedom in this data is: 16.79955/44 = .382. greater than 2. On the other hand, Payne (2017) suggests
a chi-square/df ratio of 1.2 as a threshold for moving
The ratio of the Pearson chi-square to from use of a Poisson model to a negative binomial
degrees of freedom is: 13.05686/44 = .297. model.
You can use several postestimation commands to obtain additional measures of fit. Just use the
command ‘estat’ followed by ‘ic’ to obtain Akaike’s Information criterion (AIC) and Bayesian
Information Criterion (BIC).
Likelihood ratio tests can be carried out to test each individual predictor for significance.

Step 1: Store the results from the full model

The ‘full’ in this line is providing the name of the full model (which is used to
compare against later reduced models)

Step 2: Re-analyze the model, excluding the predictor you are testing for significance. Following
type ‘lrtest full’. Let’s test the ‘manager’ variable for significance. We’ll re-run our model, excluding
this variable and then use the ‘lrtest’ command.

Here, we see that


‘manager’ was not
a significant
predictor.
References & Resources
Allison, P.D. (2009). Fixed effects regression models. Thousand Oaks, CA: Sage.

Coxe, S., West, S.G., & Aiken, L.S. (2009). The analysis of count data: A gentle introduction to Poisson regression and its alternatives. Journal of
Personality Assessment, 91, 121-136.

Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed). Los Angeles: Sage.

Gardner, W., Mulvey, E.P., & Shaw, E.C. (1995). Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models.
Psychological Bulletin, 118, 392-404.

Ismail, N., & Jemain, A.A. (2007). Handling overdispersion with negative binomial and generalized Poisson regression models. Casualty Actuarial Society
Forum, Winter 2007. Retrieved on June 22, 2019 from https://www.casact.org/pubs/forum/07wforum/07w109.pdf.

Karazsia, B.T., & van Dulmen, H.M. (2008). Regression models for count data: Illustrations using longitudinal predictors of childhood injury. Journal of
Pediatric Psychology, 33, 1076–1084.

King, Gary. 1988. “Statistical Models for Political Science Event Counts: Bias in Conventional Procedures and Evidence for The Exponential Poisson
Regression Model.” American Journal of Political Science 32(3): 838–863. Retrieved on June 22, 2019 from
https://scholar.harvard.edu/files/gking/files/epr.pdf.

Osborne, J.W. (2017). Regression and linear modeling: Best practices and modern methods. Los Angeles, CA: Sage.

Payne, E.H. Gebregziabher, M., Hardin, J.W., Ramakrishnan, V., & Egede, L.E. (2017). An empirical approach to determine a threshold for identification of
overdispersion in count data. Communications in Statistics – Simulation and Computation, 47, 1722-1738.

Additional information can be found here: https://stats.idre.ucla.edu/stata/output/poisson-regression/


Thanks for watching!

If you find the video and materials I have made available useful, please take the time to “like” the video
and share it with others. Also, please consider subscribing to receive information on new statistics
videos I upload!

You might also like