You are on page 1of 5

Modeling Ordinal Data with Log-linear Models

Earlier in the course we had described the ways to perform significance tests for independence
and conditional independence, and to measure (linear) associations with ordinal categorical
variables.
For example we focused on the CMH statistic and correlation measures for testing independence
and linear associations for the example with the heart disease and cholesterol level. We
concluded that the variables are not independent, and that the linear association was not strong
either M
2
= 27.3, df = 1.


Serum cholesterol (mg/100 cc)
Total
0199 200219 220259 260+
CHD 12 8 21 41 92
no CHD 307 246 439 245 1237
total 319 254 470 286 1329

Can we answer the same questions (and more) via log-linear models?

Modeling Ordinal Data in 2-way Tables
Loglinear models for contingency tables, by default, treat all variables as nominal variables.
If there is an ordering of the categories of the variables, this is not taken into account. That
means that we could rearrange the rows and/or columns of a table and we would get the same
fitted odds ratios for the data as we would given the orginal ordering of the rows and/or columns.
To model ordinal data with log-linear models we can apply some of the general ideas we saw
with incomplete tables and analysis of ordinal data from the earlier in the semester.
That is, we typically
assign scores to the levels of our categorical variables, and
include additional parameters (which represent these scores) into a log-linear model to
model the dependency between two variables.
Linear by Linear Association Model
This is the most common log-linear model when you have ordinal data.
Objective:
Modeling the log counts by accounting for the ordering of the categories of discrete variables.
Suppose we assign scores for the categories of the row variable, u
1
u
2
... u
I
, and for the
categories of the column variable, v
1
v
2
... v
J
. These are numbers or values that you will
use to describe the difference in magnitude between these variables. Then we can model the
dependency between two variables, e.g. C = CHD, and S = Serum cholesterol.
Model Structure:
log(ij)=+Ci+Sj+uivj
For each row i, the log fitted values are a linear function of the columns. For each column j, the
log fitted values are a linear function of the rows.
Parameter estimates and interpretation:
This model only has one more parameter than the independence model (i.e., u
i
v
j
), and is in
between the independence and the saturated models by its complexity. We are trying to say
something about the 'linear by linear association' by modeling this association between these two
categories based on the scores that you have assigned.
If > 0, then C and S are positively associated (i.e., C tends to go up as S goes up).
If < 0, the C and S are negatively associated.
The odds ratio for any 2 2 sub-table is a direct function of the row and column scores
and

Model Fit:
We use the G
2
and G
2
as with any other log-linear model.
We observe G
2
= 4.09, df = 2, p-value = 0.13 which indicates that the linear by linear association
model fits well, and significantly better than the independence model where G
2
= 27.832, df =
1, p-value < 0.001. Notice the equivalence of the values of the G
2
, M
2
, and the likelihood ratio
statistic for "xCHD*yserum" parameter under the significance testing for the individual
parameters (e.g. 'Type 3 Analysis' output of GENMOD).
^=0.574 and exp(0.574) = 0.56 means that the estimated odds ratio for a unit change in row
and column scores of 'chd-nochd' and '0-199 200-219' equal 0.56.


Look at the model fitted values ('Pred' from SAS "Observation Statistics" table or from using
the"fitted" function in R):

The cells in red:

















The estimated odds of 'chd' and higher level of cholesterol, e.g. '260+' under this model are


We can use this evidence to conclude that a person is about 5.5 times more likely to have a heart
condition with such a high cholesterol level.

Choice of Scores
There are many options for assigning the score values, and these can be of equal or unequal
spacing.
The most common choice of scores are consecutive integers; that is u
1
= 1, u
2
= 2, ... u
I
= I
and v
1
= 1, v
2
= 2, ... v
J
= J (which is what we used in the above example).
The model with such scores is a special case of the linear by linear association model and is
known as the Uniform Association Model. It is called the uniform association model because
the odds ratios for any two adjacent rows and any two adjacent columns are the same and equal
to

In other words, the Local Odds Ratio equals exp() and is the same for adjacent rows and
columns.
Also, sets of scores with the same spacing between them will lead to the same goodness-of-fit
statistics, fitted counts, odds ratios, and ^ .
For example, v
1
= 1, v
2
= 2, v
3
= 3, v
4
= 4 and v
1
= 8, v
2
= 9, v
3
= 10, v
4
= 11 will yield the same
results.

However, please note: Two sets of scores with the same relative spacing will lead to the
same goodness-of-fit statistics, fitted counts, and odds ratios, BUT different estimates of .
For example, v
1
= 1, v
2
= 2v
3
= 4, v
4
= 8 and v
1
= 2, v
2
= 4, v
3
= 8, v
4
= 16
The choice of scores may highly depend on your data and the context of your problem. There are
other ways of using and modeling ordinality, e.g. Cumlative logit models (ref. Agresti(2002), sec
7.2 and 7.3 , Agresti (2007), Sec. 6.2, and Agresti(1996), Sec. 8.2 and 8.3.; which has already
been discussed.
Generalization to Higher-dimensional Tables
For higher-dimensions we already know how to test for associations and conditional
independence with ordinal data, and combinations of ordinal and nominal, via CMH statistic.
The modeling approach described today generalizes to higher-dimensional tables as well. We
can always create new variables representing the scores.
Association models are generalization of linear by linear association models for multi-way tables.
We can also combine ordinal and nominal variables where we only assign the scores to the
ordinal variables, and estimate scores from the data. Some of these models are known as row
effects, column effect and row and column effects models. These are more advanced topics on
this issues.

Sumber : https://onlinecourses.science.psu.edu/stat504/node/141