You are on page 1of 91

離散資料分析

Categorical Data Analysis

陳俞成
Email:ycchen@mail.chna.edu.tw

2005.10.24

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis
Generalized Linear Models

I Using models as the basis of investigating effects


of explanatory variables on categorical response
variables

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Generalized Linear Models

I Benefits of good-fitting model


I The structural form of the model describes the
patterns of association and interaction.
I Inferences for model parameters help us evaluate
which explanatory variables affect the response, while
controlling effects of possible confounding variables.
I The size of the estimated model parameters determine
the strength and importance of the effects.
I The model’s predicted values smooth the data and
provide improved estimates of the mean of the
response distribution.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Generalized Linear Models

I Model can handle more complicated situations


such as analyzing simultaneously the effects of
several explanatory variables.
I The model-building paradigm focuses on
estimating parameters that describe the effects,
which is more informative than mere significance
testing.
I The explanatory variables in the model can be
continuous or categorical or both types.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Generalized Linear Models

I Generalized linear model(GLM) is a broad class


of models that includes ordinary regression and
ANOVA models for continuous response variables
as well as models for categorical response
variables.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Generalized Linear Models
I §4.1 discusses three components that are common to all
GLMs.

I §4.2 introduces the logistic regression model for binary


response variables, appropriate for binomial data.

I §4.3 introduces the loglinear model for count-type response


variables modeled as Poisson data.

I §4.4 discusses checks of the adequacy of model fit for


GLMs, illustrating for Poisson loglinear models.

I §4.5 presents further details about the fitting and checking


of GLMs.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


I The random component identifies the response
variable Y and assumes a probability distribution
for it.
I The systematic component specifies the
explanatory variables used as predictors in the
model.
I The link component describes the functional
relationalship between the systematic component
and the expected value(mean) of the random
component.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Random Component

I The random component of a GLM consists of a


response variable Y with independent
observations (y1 , . . . , yn ) from a distribution in
the natural exponential family.
I The natural exponential family has probability
density function or mass function of form

f (yi ; θi ) = a(θi )b(yi ) exp[yi Q(θi )].

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Random Component

I The value of the parameter θi may vary for


i = 1, . . . , n, depending on values of explanatory
variables.
I The term Q(θ) is called the natural parameter.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Binomial Logit Models for Binary Data

I Represent the success and failure outcomes by 1


and 0.
I The Bernoulli distribution for this Bernoulli trial
specifies probabilities P(Y = 1) = π and
P(Y = 0) = 1 − π, for which E (Y ) = π.
I f (y ; π) = π y (1 − π)1−y = (1 − π)[π/(1 − π)]y
π
= (1 − π) exp(y log 1−π ) for y = 0 and 1.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Binomial Logit Models for Binary Data

I This is in the natural exponential family,


identifying θ = π, a(π) = 1 − π, b(y ) = 1, and
Q(π) = log[π/(1 − π)].
I The natural parameter log[π/(1 − π)] is the log
odds of response 1, the logit of π.
I GLMs using the logit link are often called logit
models.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Loglinear Models for Count Data

I Poisson variates can take any nonnegative


integer value.
I The Poisson probability mass function for Y is
−µ y
f (y ; µ) = e y !µ = exp(−µ)( y1! ) exp(y log µ),
y = 0, 1, 2, . . . .

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Loglinear Models for Count Data

I This has natural exponential form with


θ = µ, a(µ) = exp(−µ), b(y ) = 1/y !, and
Q(µ) = log µ.
I The natural parameter is log µ.
I GLMs using the log link are often called Poisson
loglinear models.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Systematic Component

I The systematic component of a GLM relates a


vector (η1 , . . . , ηn ) to the explanatory variables
through a linear model.
I Let xij denote the value of predictor
j(j = 1, 2, . . . , p) for subject i.
P
I ηi = βj xij , i = 1, . . . , n
j

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Systematic Component

I This linear combination of the explanatory


P
variables, βj xij , is called the linear predictor.
j
I One xij = 1 for all i, for the coefficient of an
intercept(often denoted by α) in the model.
I Some {xij } may be based on others in the model;
for instance, perhaps xi3 = xi1 xi2 , or xi3 = xi12 .

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Link

I The third component of a GLM is a link function


that connects the random and systematic
components.
I Let µi = E (Yi ), i = 1, . . . , n.
I The model links µi to ηi by ηi = g (µi ), where
the link function g (.) is a monotonic,
differentiable function.
P
I g (µi ) = βj xij , i = 1, . . . , n
j

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Link

I The link function g (µ) = µ, called the identity


link, has ηi = µi .
I µi = α + β1 xi1 + · · · + βp xip
I This is the form of ordinary regression models for
normally distributed responses.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Link

I The link function g (µ) = log[µ/(1 − µ)], called


the logit link, has ηi = g (µi ).
I log[µi /(1 − µi )] = α + β1 xi1 + · · · + βp xip
I It is appropriate when µ is between 0 and 1, such
as a probability.
I A GLM that uses the logit link is called a logit
model.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Link

I The link function g (µ) = log(µ), called the log


link, has ηi = g (µi ).
I log(µi ) = α + β1 xi1 + · · · + βp xip
I It is appropriate when µ cannot be negative, such
as with count data.
I A GLM that uses the log link is called a loglinear
model.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Link

I The link function that transforms the mean to


the natural parameter is called the canonical link.
P
I For it, g (µi ) = Q(θi ), and Q(θi ) = j βj xij .
I For the normal distribution, it is the mean itself.
I For the Poisson, the natural parameter is the log
of the mean.
I For the binomial, the natural parameter is the
logit of the success probability.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Normal GLM

I Ordinary regression and ANOVA models for


continuous variates are special cases of GLMs.
I A GLM generalizes ordinary regression models in
two ways:
First, it allows the random component to have a distribution
other than the normal.
Second, it allows modeling some function of the mean.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Normal GLM

I A traditional way of analyzing nonnormal data


attemps to transform the response so it is
approximately normal, with constant variance.
I e.g. Box-Cox transformation
( λ stabilize variance by
y −1
∗ λ if λ 6= 0
finding λ s.t. y =
ln y if λ = 0
then expect y ∗ ∼ normal distribution.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Normal GLM

I A transform that produces constant variance may


not produce normality, or else simple linear
models for the explanatory variables may fit
poorly on that scale.
I With the theory and methodology of GLMs, it is
unnecessary to transform data so that
normal-theory methods apply.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Normal GLM

I The GLM fitting process utilizes maximum


likelihood methods for our choice of random
component, and in GLMs the choice of link is
separate from the choice of random component.
I If a link produces additivity of effects(i.e., if a
linear model holds for that link), it is not
necessary that it also stabilize variance or
produce normality.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Normal GLM

I Regression, ANOVA, and models for categorical


data are special cases of one super model.
I The same fitting method yields ML estimates of
parameters for all GLMs.
I This method is the basis of software for fitting
GLMs, such as GLIM and
SAS(PROC GENMOD).

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Generalized Linear Models for Binary Data

I Y might indicate vote in a British


election(Labour, Conservative), choice of
automobile(domestic, import), or diagnosis of
breast cancer(present, absent).
I Each observation has one of two outcomes,
denoted by 0 and 1, binomial for a single trial.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Generalized Linear Models for Binary Data

I A binary response is somtimes called a Bernoulli


variable.
I Its distribution is specified by probabilities
P(Y = 1) = π of success and
P(Y = 0) = (1 − π) of failure.
I This distribution has mean E (Y ) = π and
variance var(Y ) = π(1 − π).

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Generalized Linear Models for Binary Data

I For n independent observations on a binary


response with parameter π, the number of
successes has the binomial distribution specified
by the indices n and π.
I We denote P(Y = 1) by π(x), reflecting its
dependence on values x = (x1 , . . . , xp ) of
predictors.
I The variance of Y is var(Y ) = π(x)[1 − π(x)].

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Linear Probability Model

I For a binary response, the regression model


π(x) = α + βx is called a linear probability
model.
I With independent observations it is a GLM with
binomial random component and identity link
function.
I The parameter β represents the change in π(x)
for a one-unit increase in x.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Linear Probability Model

I Probabilities fall between 0 and 1, but linear


functions take values over the entire real line.
I This model has π(x) < 0 and π(x) > 1 for
sufficiently large or small x values.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Linear Probability Model

I For its extension with multiple predictors,


difficulties often occur fitting this model because
during the fitting process, π̂(x) falls outside the
[0, 1] range for some subject’s x values.
I The model can be valid over a restricted range of
x values.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Linear Probability Model

I Least squares is ML for a normal distribution


with constant variance.
I For binary responses, the constant variance
condition that makes least squares estimators
optimal(i.e., minimum variance in the class of
linear unbiased estimators) is not satisfied.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Linear Probability Model

I Since var(Y ) = π(x)[1 − π(x)], the variance


depends on x through its influence on π(x).
I As π(x) moves toward 0 or 1, the distribution of
Y is more nearly concentrated at a single point,
and the variance moves toward 0.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Linear Probability Model

I Because of the nonconstant variance, the


binomial ML estimator is more efficient than
least squares.
I Y , being binary, is very far from normally
distributed. The usual sampling distributions for
the least squares estimators do not apply.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Linear Probability Model

I The estimates and standard errors for ML and


least squares are usually similar, however, when
π̂(x) for the sample x values falls in the range
within which the variance is relatively
stable(about 0.3 to 0.7).

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

I Based on an epidemiological survey of 2484


subjects to investigate snoring as a risk factor for
heart disease.
I The model states that the probability of heart
disease π(x) is linearly related to the level of
snoring x.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

I We treat the rows of the table as independent


binomial samples with that probability as the
parameter.
I We use scores (0,2,4,5) for the snoring
categories, treating the last two levels as closer
than the other adjacent pairs.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

Relationship between Snoring and Heart Disease

Heart Disease Proportion Linear Logit Probit

Snoring Yes No Yes Fita Fit Fit


Never 24 135 .017 .017 .021 .020
Occasional 35 603 .055 .057 .044 .046
Nearly every night 21 192 .099 .096 .093 .095
Every night 30 224 .118 .116 .132 .131
a Model fits refer to proportion of yes responses.
Source:Brit. Med. J.,291:630-632(1985).

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

I Software reports the ML fit,


π̂(x) = 0.0172 + 0.0198x, with a standard error
SE=0.0028 for β̂ = 0.0198.
I For nonsnorers (x = 0), the estimated proportion
of subjects having heart disease is 0.0172.
I We refer to the estimated values of E (Y ) for a
GLM as fitted values(擬合值).

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

I Figure 4.1 graphs the sample and fitted values.


I The table and graph suggest that the model fits
well.
I §5.4 discusses formal goodness-of-fit analyses for
binary-response GLMs.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

I The estimated probability of heart disease is


about 0.02 for nonsnorers; it increases
2(0.0198)=0.04 for occasional snorers, another
0.04 for those who snore nearly every night, and
another 0.02 for those who always snore.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

I Suppose we had chosen scores for snoring level


having different relative spacings from the scores
{0,2,4,5}. Examples are {0,2,4,4.5} or {0,1,2,3}.
Then the fitted values for the four snoring
categories would change somewhat.
I They would not change if the relative spacings
between scores were the same, such as
{0,4,8,10} or {1,3,5,6}.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Snoring and Heart Disease Example

I If we entered the data as 2484 binary


observations of 0 or 1 and fitted the model using
ordinary least squares rather than ML, we would
obtain π̂(x) = 0.0169 + 0.0200x.
I When the model fit is good, least squares and
ML estimates are usually similar.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Logistic Regression Model

I Binary data result from a nonlinear relationship


between π(x) and x.
I A fixed change in x often has less impact when
π(x) is near 0 or 1 than when π(x) is near 0.5.
I Nonlinear relationships between π(x) and x are
often monotonic, with π(x) increasing
continuously or π(x) decreasing continuously as
x increases.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Logistic Regression Model

I The S-shaped curves in Figure 4.2 are typical.


I The most important curve with this shape has
exp(α+βx)
the model formula π(x) = 1+exp(α+βx) .
I This is the logistic regression model.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Logistic Regression Model

I As x → ∞, π(x) ↓ 0 when β < 0 and π(x) ↑ 1


when β > 0.
I As |β| increases, the curve has a steeper rate of
change.
I When β = 0, the curve flattens to a horizontal
straight line.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Logistic Regression Model

π(x)
I The odds are 1−π(x) = exp(α + βx).
I The log odds has the linear relationship
π(x)
logit(π) = log( 1−π(x) ) = α + βx.
I This is called the logistic regression function.
I The log odds transformation is called the logit
transformation.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Logistic Regression Model

I Logistic regression models are GLMs with


binomial random component and logit link
function.
I Logistic regression models are also called logit
models.
I The logit is the natural parameter of the binomial
distribution, so the logit link is its canonical link.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Logistic Regression Model

I Whereas π(x) must fall in the (0,1) range, the


logit can be any real number.
I The real numbers are also the range for linear
predictors (such as α + βx) that form the
systematic component of a GLM.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Logistic Regression Model

I For the snoring and heart disease data, software


reports the logistic regression ML fit
logit[π̂(x)] = −3.87 + 0.40x.
I The positive β̂ = 0.40 reflects the increased
incidence of heart disease at higher snoring levels.
I Chapter 5 presents several ways of interpreting
such equations.
I Results are similar to those for the linear
probability model.
陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis
Alternative Binary Links

I The cumulative distribution function(cdf) F (x)


for X is defined as
F (x) = P(X ≤ x), −∞ < x < ∞.
I A monotone regression with β > 0 in Figure 4.2
has the shape of a cdf for a continuous random
variable.
I This suggests a model for a binary response
having form π(x) = F (x) for some cdf F .

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Alternative Binary Links

I When β > 0, F (x) is the cdf of a two-parameter


logistic distribution.
I When β < 0, the formula for 1 − π(x) has the
logistic cdf appearance.
I Each choice of α and of β > 0 corresponds to a
different logistic distribution.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Alternative Binary Links

I The logistic cdf corresponds to a probability


distribution with a symmetric, bell shape.
I It looks similar to a normal distribution but with
slightly thicker tails.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Alternative Binary Links

I When a tolerance distribution applies to subjects’


responses model form π(x) = F (x) occurs naturally.

I For instance, in a toxicology study, suppose that


researchers spray an insecticide at various dosage levels on
batches of mosquitoes.

I If a cdf F describes the distribution of tolerances, then the


model for the probability π(x) of death at dosage level x
necessarily has form π(x) = F (x).

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Probit Models

I When F is the cdf of a normal distribution,


model π(x) = F (x) is called the probit model.
I The link function for the model is then called the
probit link.
I The probit model has alternative expression
probit[π(x)] = α + βx.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Probit Models

I The probit link applied to a probability π(x)


transforms it to the standard normal z-score at
which the left-tail probability equals π(x).
I For instance, probit(.05)=−1.645, probit(.50)=0,
probit(.95)=1.645, and probit(.975)=1.96.
I The probit model is a GLM with binomial
random component and probit link.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Probit Models

I The ML fit of the probit model, using score {0,2,4,5} for


snoring level in the snoring and heart disease data, is
probit[π̂(x)] = −2.061 + 0.188x.

I π̂(0) = Φ(−2.061 + 0.188(0)) = Φ(−2.06) = .020

I π̂(5) = Φ(−2.061 + 0.188(5)) = Φ(−1.12) = .131

I The fitted values are similar to those obtained with the


linear probility and logistic regression models.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Probit Models

I It is rare, and requires enormous sample sizes, to


find data for which a logistic regression model
fits well but the probit model fits poorly, or
conversely.
I When both models fit well, slope esimates in
logistic regression models are roughly about
1.6-2.0 times those in probit models.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Probit Models

I The probit transform maps π(x) so that the


regression curve for π(x)(or 1 − π(x), when
β < 0) has the appearance of the normal cdf
with mean µ = −α/β and standard deviation
σ = 1/|β|.
I For the snoring and heart disease data,
−α̂/β̂ = 2.061/0.188 = 11.0 and
1/|β| = 1/0.188 = 5.3.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Probit Models

I The predicted probability of heart disease equals


1
2 at snoring level x = 11.0.
I The fitted probit value of −2.06 at x = 0 means
that 0 is 2.06 standard deviations below the
mean of a normal distribution with mean 11.0
and standard deviation 5.3.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Probit Models

I The probit model was introduced in 1934 for


models in toxicology.
I The logistic regression model was not studied
until about a decade later, but it is now much
more popular than the probit.
I Partly this is because one can interpret the
logistic regression effects using odds ratios.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


I The number automobile thefts in 1995, or the number of
imperfections on a wafer have counts as possible outcomes.

I Poisson variates can take any nonnegative integer value.

I Chapter 6 presents Poisson GLMs for counts in


contingency tables.

I The response data are cell counts obtained by


cross-classifying subjects on two or more categorical
response variables.

I This section introduces Poisson regression-type models


using an alternative application: modeling count or rate
data for a single response variable.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Regression

I The log mean is the natural parameter for the


Poisson distribution, and the log link is the
canonical link for a GLM with Poisson random
component.
I A Poisson loglinear model is a GLM that assumes
a Poisson distribution for Y and uses the log link.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Regression

I Let µ denote the expected value for a Poisson


variate Y , and let X denote an explanatory
variable.
I The Poisson loglinear model has form
log µ = α + βx.
I The mean satisfies the exponential relationship
µ = exp(α + βx) = e α (e β )x .

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Regression

I A one-unit increases in X has a multiplicative impact of e β


on µ.

I The mean of Y at x + 1 equals the mean of Y at x


multiplied by e β .

I If β = 0, then e β = e 0 = 1 and the multiplicative factor is


1; that is, the mean of Y does not change as X changes.

I If β > 0, then e β > 1, and the mean of Y increases as X


increases.

I If β < 0, then the mean of Y decreases as X increases.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I A study of nesting horseshoe crabs


I Each female horseshoe crab in the study had a
male crab attached to her in her nest.
I The study investigated factors that affect
whether the female crab had any other males,
called satellites, residing near her.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I Explanatory variables thought possibly to affect


this included the female crab’s color, spine
condition, weight, and carapace width.
I The response outcome for each female crab is
her number of satellites.
I For now, we use width alone as a predictor of the
response.(Other analyses of these data occur in
Chapter 5.)

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I Figure 4.3 plots the response counts against


width, with numbered symbols indicating the
number of observations at each point.
I The substantial variability in counts makes it
difficult to discern a clear pattern.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I To obtain a clearer picture of overall trend, we


grouped the female crabs into a set of width
categories, (≤23.25,23.25-24.25,24.25-25.25,
25.25-26.25,26.25-27.25, 27.25-28.25,
28.25-29.25,>29.25),
I and calculated the sample mean number of
satellites for female crabs in each category.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I Figure 4.4 plots these sample means against the


sample mean width for crabs in each category.
I The sample mean width equals 26.3 and the
standard deviation equals 2.1.
I We used 26.25 rather than 26.3 for the midpoint
of the eight classes so that no observation would
fall exactly on the boundary between two
categories.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I More sophisticated ways of portraying the trend


smooth the data without grouping the width
values or assuming a particular functional
relationship.
I Figure 4.4 shows such a smoothed curve.
I The sample means and the smoothed curve both
show a strong increasing trend.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I The means tend to fall above the curve, since


the response counts in a category tend to be
skewed to the right.
I The smoothed curve is less susceptible to
outlying observations.
I The trend seems approximately linear, and we
discuss next models for the ungrouped data for
which the mean or the log of the mean is linear
in width.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I For a female crab, let µ be the expected number


of satellites and x = width.
I From GLM software, the ML fit of the Poisson
loglinear model is
log µ̂ = α̂ + β̂x = −3.305 + 0.164x.
I The effect β̂ = 0.164 of width has an asymptotic
(large-sample) standard error of ASE = 0.020.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I The model fitted value at any width level is an


estimated mean number of satellites µ̂.
I The fitted value at the mean width of x = 26.3
is µ̂ = exp(α̂ + β̂x) =
exp[−3.305 + 0.164(26.3)] = 2.74.
I For this model, exp(β̂) = exp(0.164) = 1.18
represents the multiplicative effect on µ̂ for a
1-cm increase in x.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I The fitted value at x = 27.3 = 26.3 + 1 is


exp[−3.305 + 0.164(27.3)] = 3.23, which equals
1.18 × 2.74.
I A 1-cm increase in width yields an 18% increase
in the estimated mean.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites
I Figure 4.3 shows that one crab had somewhat
greater width than the others, 33.5cm.
I An observation having explanatory variable
values much different from the rest of the sample
can have an undue influence on the model fit.
I To check the effect of this observation, we
deleted it and refitted the model for the
remaining 172 crabs.
I The ML estimates then equal α̂ = −3.461 and
β̂ = 0.170 (ASE=0.022).
陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis
Horseshoe Crabs and Satellites

I Figure 4.4 shows that E (Y ) may grow


approximately linearly with width.
I This suggests the Poisson GLM with identity
link, µ = α + βx.
I Its has ML fit µ̂ = α̂ + β̂x = −11.53 + 0.55x.
I The effect of X on µ in this model is additive,
rather than multiplicative.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I A 1-cm increase in x has an estimated increase


of β̂ = 0.55 in µ̂.
I The fitted value at the mean width of x = 26.3
is µ̂ = −11.53 + 0.55(26.3) = 2.93; at x = 27.3,
it is 2.93 + 0.55 = 3.48.
I The fitted values are positive at all sampled x.
I On the average, about a 2-cm increase in width
is associated with an extra satellite.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Horseshoe Crabs and Satellites

I Figure 4.5 plots µ̂ against width for the models


with log link and identity link.
I Although they diverge somewhat for relatively
small and large widths, they provide similar
predictions over the width range in which most
observations occur.
I §4.4.2 and 4.4.3 study whether either model
provides an adequate fit to these data.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Regression for Rate Data

I When events of a certain type occur over time, space, or


some other index of size, it is often relevant to model the
rate at which events occur.

I In modeling numbers of auto thefts in 1995 for a sample of


cities, we could form a rate for each city by dividing the
number of thefts by the city’s population size.

I The model might describe how the rate depends on


explanatory variables such as the city’s unemployment rate,
its resident’s median income, and percentage of residents
having completed high school.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Regression for Rate Data

I When a response count Y has index(such as


population size) equal to t, the sample rate is
Y /t.
I The expected value of the rate is µ/t.
I With an explanatory variable x, a loglinear model
for the expected rate has form
log(µ/t) = α + βx.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Regression for Rate Data

I This model has equivalent representation


log µ − log t = α + βx.
I The adjustment term, − log t, to the log link of
the mean is called an offset(補償項).
I The fit correspond to using log t as a predictor
on the right-hand side and forcing its coefficient
to equal 1.0.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Poisson Regression for Rate Data

I The expected response count satisfies


µ = t exp(α + βx).
I The mean is proportional to the index t, with
proportionality constant depending on the value
of x.
I For a fixed value of x, doubling the population
size t also doubles the expected number of auto
thefts µ.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Examples of Rate Models

I Using data dealing with motor vehicle accident


rates for elderly drivers(from an article by W.A.
Ray et al., Amer. J. Epidemiol., 132:
873-884(1992))
I The sample consisted of 16,262 Medicaid
enrollees aged 65-84 years, with data on each
subject for a period of somewhere between 0 and
4 years.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Examples of Rate Models

I The total observation time for women in the sample was


17.30 thousand years. During this period, they had 175
accidents in which an injury occured.

I The total observation time for men was 21.40 thousand


years, during which they had 320 injurious accidents.

I The sample rates of injurious accidents are


320/21.40=14.95 crashes per thousand years of driving for
males, and 175/17.30=10.12 for females.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Examples of Rate Models

I Let µ denote the expected number of injurious


accidents, for an observation period of t
thousand years.
I To model the effect of gender on the accident
rate, we use model log(µ/t) = α + βx with
x = 0 for females and x = 1 for males.
I The explanatory variable x is a dummy variable
(虛擬變數) for gender.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Examples of Rate Models
I The log of the accident rate equals α for females, and it
equals α + β for males. The rates are identical if β = 0.
I The estimate of α is simply the sample log(rate) for
females, namely log(10.12) = 2.31; the estimate of α + β
is simply the sample log(rate) for males, namely
log(14.95) = 2.70
I The estimated difference is β̂ = 0.39.
I The estimated accident rate for men was
exp(β̂) = exp(0.39) = 1.48 times the rate for women.
That is, 14.95/10.12=1.48, the sample rate being 48%
higher for men.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Examples of Rate Models

I To test whether the true rate are the same, we


test H0 : β = 0.
I GLM software reports an ASE for β̂ = 0.39 of
0.09, so there is strong evidence that the accident
rate was higher for males(i.e., that β > 0).
I The accident rates do not take into account
possibly different yearly levels of driving for the
two groups.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Examples of Rate Models

I When all counts in a data set have the same


index value t, or when counts do not refer to an
index such as time or group size, the model does
not need an offset term.

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis


Summary

I Components of a generalized linear model


I Generalized linear model for binary data
I Generalized linear model for count data

陳俞成 Email:ycchen@mail.chna.edu.tw 離散資料分析 Categorical Data Analysis

You might also like