You are on page 1of 25

Introduction to GEE

Generalised Estimating Equations (GEE)


- Extends generalized linear model to accommodate correlated Y s

- Handles longitudinal, repeated measures or otherwise clustered or correlated


data, eg from complex survey

- Often people would fit a linear model to such data and only then adjust the
standard errors to account for the clustering; the problem is that this post-hoc
approach does not affect the parameter estimates in the model.

-
GEEs are
- A different way to estimate regression coefficients (not based on likelihoods)

- Can account for clustered or correlated data

- Called population average or marginal models

- They are not really statistical models (“estimating equations”)

- New! First introduced in 1986 (Liang & Zeger)


- GEEs have consistent and asymptotically normal solutions, even with
misspecification of the correlation structure

- Avoids need for multivariate distributions by only assuming a functional form


for the marginal distribution at each timepoint (i.e., y ij )

- The covariance structure is treated as a nuisance

- Relies on the independence across subjects to estimate consistently the


variance of the regression coefficients (even when the assumed correlation
structure is incorrect)
Subject-specific vs population-average
GEE estimates the population average effects. Consider:

- A: You are a doctor. You want to know how much a certain drug will reduce
your patient’s odds of getting a heart attack.
- B: You are a department of health official. You want to know how the number
of people who die of heart attacks would change if everyone at risk took a
certain drug.

A is asking for subject specific information, B is asking for entire population


information. GEE can give estimates for B, but not A.
Parts of a GEE

g(Y) = 𝛃X + CORR
Parts of a GEE
g(Y) - outcome, related to the systematic part through a link function g()

𝛃X - systematic part (linear predictor, includes coefficients, covariates etc)

CORR - placeholder term for the correlation / covariance matrix

Y - the responses, are assumed to be clustered or correlated


Assumptions
We assume that the mean of Y (E(Y)) depends on the covariates via a link
function g()

AND

that the variance of Y is related (through a function) to the mean

For this, we need to estimate something called a ‘working correlation’


Assume a structure for the working
correlation
Independent - Exchangeable - Autoregressive - Unstructured - Free specification
Describing correlation mathematically
Exchangeable correlation: Responses within subjects are equally correlated
First-order Auto Regressive (AR1):
Correlation among responses within subjects decays exponentially
Unstructured:
Correlation among responses within subjects completely unspecified
Independence: No correlation among responses within subjects
Independent: the covariance matrix is diagonal (only variance, no co-variance,
GLM)

Exchangeable: All measurements on the same unit are equally correlated


(plausible for clustered data, maybe not so much for longitudinal)

Also called: spherical, compound symmetry

Autoregressive: Correlation depends on time or distance between


measurements (plausible for many spatial and temporal data)

Unstructured: No assumptions about correlation

Lots of parameters, usually fails to converge


Examples
An example
● Longitudinal follow-up from 1995-2003 of lung function decline
● N = 322 males used for this example
● Variables of interest

○ FEV1

○ Age (at each observation)

○ Smoking status: current, former, nonsmoker

○ Age at baseline

○ Height at baseline
Data from 2
Code for fitting this
. xtgee fev age smoking2 smoking3 agebase hbase, i(id), t(wave)
corr(exchangeable)

geeglm(fev ~ age + smoking2 + smoking3 + agebase + hbase, id = id, waves =


“wave”, corrst = ”exchangeable”)

** please note: usually the order of your individuals with clusters and the type of
your variables matters more than you might wish. Please read documentation
carefully.
“An expected difference in blood pressure comparing smokers to non-smokers of
the same age and weight”

Not

“An expected difference in blood pressure comparing a smoker to a non-smoker of


the same age and weight”
A real study! GEEs!
Design
● N = 375
● Visits at weeks 4, 12, 24, 36, 48
● Outcome 1: Retention at week 48 (Y/N)
● Outcome 2: Protective tenofovir concentrations at week 48 (Y/N)
Methods
“as tenofovir diphospahate concentrations for individuals aged 18-24 years were
assessed at all visits, to control for correlation among observations, generalised
estimating equation logistic regression models with autoregressive correlation
matrix and robust variance were used..”
Reading / links
Simple worked example in R:
https://rlbarter.github.io/Practical-Statistics/2017/05/10/generalized-estimating-equ
ations-gee/

Nice paper - probably read the introduction only - the rest gets fairly technical. Statistical Analysis of
Correlated Data Using Generalized Estimating Equations: An Orientation James A. Hanley, Abdissa
Negassa, Michael D. deB. Edwardes, Janet E. Forrester American Journal of Epidemiology, Volume 157,
Issue 4, 15 February 2003, Pages 364–375, https://academic.oup.com/aje/article/157/4/364/78911

You might also like