Professional Documents
Culture Documents
1988 - Modeling Strategies For Categorical Data Examples From Housing and Tenure Choice
1988 - Modeling Strategies For Categorical Data Examples From Housing and Tenure Choice
1988 - Modeling Strategies For Categorical Data Examples From Housing and Tenure Choice
Deurloo
and F. M . Dieleman
1. INTRODUCTION
Techniques that use categorical data have proliferated in the past decade.
Logit, mutinomial logit, and sequential logit models have been applied to mobil-
ity (Clark, Deurloo, and Dieleman 1984; Clark and Onaka 1985), consumer
choice (Wrigley 1985), housing choice (Quigley 1976), and transportation mode
choice (Johnson and Hensher 1982) to mention only a few of the substantive
areas. However, this work has sometimes proceeded without a detailed consid-
eration of the underlying contingency tables. In this article, we deal with the
statistical analysis of contingency tables rather than with sequential choice
modeling. It is necessary to stress this point because both statistical modeling of
categorical data and discrete choice approaches often use the same terms with
different meanings. An example of this is the term nested logit model. In a later
section, the term is not used for sequential choice modeling, as is usually the case
in geographical studies (Wrigley 1985), but to describe the situation where the
effect of an independent variable (on the dependent variable) is only operative
2. MODELING STRATEGIES
The methods considered in this paper can be grouped into three general cate-
gories (Figure 1).The first group is a selection of approaches where the objective
is to simplify both the set of explanatory variables and their number of catego-
ries. In our experience, two techniques, proportional reduction in uncertainty
(PRU) (Kim 1984), and Chi-square automatic interaction detection (CHAID)
(Kass 1980), are good choices from a wider set of possible techniques with the
same purpose. They are both applicable in situations where a qualitative de-
pendent variable with two or more categories and a large set of qualitative inde-
pendent variables with two or more categories are to be analyzed.
Given the dependent variable, the PRU technique selects independent vari-
ables in a sequential manner. The independent variable that explains most of the
distribution of the observations over the categories of the dependent variable is
chosen first and its categorization is simplified. The criterion for selection of
additional independent variables and simplication of their categories is mea-
sured by the change in entropy in the dependent variable. Variables are added
200 / Geographical Analysis
Un-Nested
Analysis PRU
+
- MNAl
ANOTA
__*
HlER -
ARCHICAL
LOGIT
Nested NESTED
Analysis
CHAID ___, LOGIT
i
Limited Formal
Pre-Processing Parameterization Modeling
until no relevant additional variables can be included or the data are exhausted.
The PRU method is more fully explained and illustrated in section 4.
CHAID has approximately the same purpose as the PRU method and stems
from the more well-known automatic-interaction-detection (AID) method of
Sonquist and Morgan (1964). That method was limited to a binary dependent
variable. Like PRU, CHAID is a multivariable method and considers selection
and simplification of variables one after another. The difference between the
two techniques, apart from their different theoretical underpinning, lies mainly
in the nesting of independent variables within the categories of other indepen-
dent variables in the case of CHAID. Both methods add the most effective vari-
able to the tabulation at each step. Therefore, both methods lead to a selection of
independent variables that are usually not strongly interrelated. The detail of the
categorization can often be reduced dramatically without important loss of in-
formation with both methods. CHAID will be explained more fully in section 5.
In our modeling strategy, a second stage involves the use of multivariate nom-
inal scale analysis (MNA),or the analysis of tables (ANOTA).A standard proce-
dure after the preprocessing of PRU or CHAID is to estimate a logit model.
While logit transformation is the classical solution to prevent the estimated prob-
abilities in models for categorical data from taking on values outside the 0-1
range it is important to realize that this is the only reason to perform this trans-
formation; it is certainly not true that logit models are a priori more suitable than
other models. The use of MNA/ANOTA offers an alternative to proceeding
directly to logit transformations.Logit transformation may not always be desir-
able, as it entails a number of disadvantages, especially when the dependent
variable has more than two categories or when there are numerous predictors
that can eventually have more than two categories. In practice, the selection of a
qualified logit model containing interaction effects, even in cases with a binary
dependent variable, is only possible with a relatively small number of predictors.
MNA/ANOTA provides a number of simplifications as soon as the restriction is
dropped that the estimated coefficientsmust be in the 0-1 range. Then the coef-
ficients can be estimated with the ordinary-least-squares method and, conse-
quently, only a linear system of equationshas to be solved in which only bivariate
tables are used, so that even extensive problems can be handled on small (micro)
computers. Furthermore, the interpretation of the parameters becomes straight-
forward and the number of categories of the dependent variable is of no impor-
W. A. V. Clark, M . C . Deurloo, and F . M . Dieleman / 201
tance. It is true that the MNA/ANOTA model does not guarantee a good fit. But
for logit models, good fit can be guaranteed only for the saturated model and the
saturated model is not an end in itself, but only a starting point for a process of
reduction to a parsimonious form. In our opinion, it is more a matter of personal
taste than statisticalmethodology if modeling is done on the linear scale or on the
log-linear scale.
MNA/ANOTA is labeled as a “limited”parameterization method because as-
sumptions about the structure of the data are used to constrain the complexity of
the models, thus facilitatinginterpretation. For example, MNA/ANOTA makes
the restrictive assumption that an additive model adequately represents the data.
If the method is applied to data that have been preprocessed with PRU or
CHAID, this assumption is not unreasonable. It is possible to avoid the strongest
interactions in the data by dividing them first into relevant separate subgroups
and then selecting the predictor variables and their categories by virtue of the
preliminary CHAID analysis. In many respects CHAID can be seen as comple-
mentary to MNA/ANOTA. The strongest interaction effects are filtered out
with these procedures. This is illustrated by the fact that the estimated coeffi-
cients for a specific category of any independent variable in our models differ
little from the corresponding deviations of the average proportions of the cate-
gories of the dependent variable in the sample.
As a consequence of the constraints, the interpretation of the MNA/ANOTA
results is straightforward and comparable to multiple regression. MNA/ANOTA
is especially useful if the dependent variable has more than two categories; the
ease of computation and interpretation then compares favorably with the much
more complicated multinomial logit modeling.’ However, as mentioned above,
there are disadvantages of the ANOTA method. To reiterate, the estimated pro-
portions may be out of the 0-1 range, and combined effects of independent
variables on the dependent variable are ignored. However, the method often
gives a reasonable approximation, and can be used as the final result of an analy-
sis, if the greater sophistication of a logit model serves no definite purpose or if
the formulations and interpretation of the parameters of a logit model are not
straightforward. ANOTA can also serve as an interim step towards more formal
modeling and in that case is a good starting point for finding an adequate logit
model. It is also possible to move directly from the preprocessing to the full
parameterization stage. Omitting the intermediate stage can be justified when
the preprocessing stage has yielded a sufficient reduction in categories and vari-
ables. Log-linear modeling is most successful with small tables of three to five
variables with two to three categories.
The final, fully parametric group of models (Figure 1)are the logit models. No
preliminary assumptions about the data are introduced. If necessary, the original
cross-tabulation of the dependent variable and the independent variables can be
reproduced exactly (the saturated model). The logit models are theoretically
more sophisticated than the MNA/ANOTA method and allow the specification
of interaction effects of predictors on the dependent variable. Hierarchical logit
models are widely known and applied for both dichotomous and polytomous
dependent variables and need no further explanation here. Nonhierurchicul logit
models have not been used widely for several reasons. In the first place, the basic
literature (e.g., Bishop, Fienberg, and Holland 1975) is almost exclusively de-
voted to hierarchical logit modeling. In the second place there exists a broad
variety of different nonhierarchical logit models that can be specified for any
given data set. Therefore, it is difficult to devise a way to find appropriate mod-
els, although it is sometimes logical to look for nonhierarchical logit models, such
as models which take order into account (see, for example, Deurloo et al. 1988),
or nested models in which a contrast or group of contrasts is supposed to be
operative at some levels of avariable (but not at all levels). To reiterate, the term
nested is used here in the statistical sense and not to describe sequential choice
modeling such as in Clark and Onaka (1985). In this paper, we will discuss an
example of a nested model in which size of household and type of housing
market seem to influence housing choices of lower-income households, while
they play no important role for the high-income groups. Magidson (1982) argues
that nested logit models can be useful in such situations. He also argues that a
preliminary analysis of the data with CHAID can guide the specification of rele-
vant nested logit models because of its property to nest independent variables
within the categories of previously chosen independent variables. Sometimes
CHAID is also used as a preprocessing method for hierarchical logit modeling
(e.g., Green 1978).
In our analysis, we distinguish between two approaches: a nested modeling
strategy and a non-nested modeling strategy (Figure 1). In the PRU-MNA/
ANOTA-LOGIT analyses of the data, no nesting of independent variables oc-
curs, but in the CHAID-NESTED LOGIT analyses, the emphasis is on estimat-
ing nested models. Because the statistical presentations of PRU (Kim 1984) and
CHAID (Kass 1980) are available we do not replicate those materials. The litera-
ture on MNA and its reformulation as ANOTA is somewhat less accessible and
we have provided a detailed statistical appendix on this technique (Appendix I).
Programs to run PRU, CHAID, and ANOTA are described in Appendix 11.
TABLE I
The Nine Variables and Their Categories
\'ariahIr Categories humherot raws
Housing choice 1. multifamily rent 914
A. single family rent 1076
3. owner occupation 933
Household Characteristics
Income 1. 445
2. 878
3. 938
4. 662
Age of head of household 1. < 34 years 1397
2. 35-44 years 622
3. 45-54 years 306
4. 55-74 years 598
Size of household 1. 1 person 375
2. 2 persons 861
3. 3 or 4 persons 1391
4. 5 or more persons 296
Characteristics of Previous House
Previous tenure 1. public rental 1934
2. private rental 989
Number of rooms 1. 1 or 2 rooms 406
previous dwelling 2. 3 rooms 625
3. 4 rooms 1346
4. 5 or more rooms 546
Type previous dwelling 1. single family 1098
2. multifamily 1825
Rent previous dwelling 428
1324
50 -bO/month 513
50-550/month 325
5. > fl550/month 333
Type housing market 1. Periphery 502
2. South 672
3. Middle 727
4. Randstad 1022
choice and the previously selected explanatory variables. This new dimension is
examined for its effect on association between the dependent variable and the
set of independent variables in the tabulation when categories are aggregated.
This often reduces the detail of the categorizationdramaticallywithout a notice-
able effect on the level of association. In the earliest reference we know of
(McGill and Quastler 1955) the PRU measure is called the “coefficient of con-
straint.” Hays (1980)calls it “the relative reduction in uncertainty”and Nie et al.
(1975)the “uncertainty coefficient.”There is a strong analogy between the PRU
for discrete variables and the coefficient of determination (the square of Pear-
son’s correlation coefficient) for continuous variables.
The PRU is also generally related to the likelihood ratio test statistic (LRstatis-
tic) G2 (see Clark et al. 1986). But G2 can only be used to determine whether
significant differences exist. This statistic is used in the present analysis only in a
secondary manner and only after the PRU measure, which provides better in-
sight into the level of association between the dependent variable and indepen-
dent variable(s) and thus of the relevancy rather than significanceof the relation-
ships. PRU is used increasingly as a measure in the evaluation of logit models
(Kim 1984).Lammerts van Bueren (1982) discusses the measure and its useful-
ness in detail. In our opinion, the PRU has some advantages over other prepro-
cessing approaches, like those of Higgins and Koch (1977) and Conant (1980)
(see Clark et al. 1986).
The selection of variables and the reduction of the number of categories fol-
lows a simple forward step procedure. In the first step, the PRU is calculated for
the two-way cross-tabulation of housing choice and each of the independent
variables (Table2). Income is by far the most important predictor of choice, and
is therefore chosen in the first step in the construction of a meaningful cross-
tabulation.
In step lB, the simplification of the four categories of income is examined. If
any simplification is to be effected, collapsing categories 1and 2 would be most
appropriate. The decrease in PRU is lowest in that case, indicating the lowest
reduction in explained variation of choice. But, even then, the reduction in PRU
is fairly large. The substantial decrease in G2between the original table and the
smaller table also indicates this (710.4 - 658.8 = 51.6).With2 fewer degrees of
freedom, one would only combine categories 1 and 2 at the 1 percent signifi-
cance level if the decrease in G2were less than 9.2. So on the basis of PRU and G2,
income should retain its original four categories.
In step 2A, the two-way table resulting from the first step is expanded with a
new dimension. The variable housing market type increases the PRU substan-
tially and by more than any other potential explanatory variable (Table 2). The
increase is also significant at the 1percent level (theincrease of G2is 284.7 with24
df). In step 2B, categories 1and 2 of housing market type are added without real
loss in PRU, and category 3 also can be combined to these categories. It is clear
that housing choice in the Randstad is very different from elsewhere in the
Netherlands; whenever category 4 (Randstad) is collapsed with another cate-
gory the PRU decreases drastically. The PRU after simplification of housing
market type (0.152)is still higher than the second variable that had a high PRU in
step 2 (size of household, 0.150), so we can proceed to the next step.
In step 3A, size of household is added to the three-way cross-tabulation and
increases the PRU significantly. Categories 3 and 4 of this variable can be col-
lapsed (step 3B).
In the next step of the analysis, a critical point in the PRU procedure is reached.
The addition of yet another variable (rent of previous dwelling and age of head
TABLE 2
Stens in the Analvsis with the PRU Criterion
of household are the candidates) increases the PRU significantly, but many
empty cells occur, and even marginals now have zero cases. We are thus well
beyond the limit of a meaningful addition of variables to the table if we want to
perform logit analysis. Therefore, no further variables are added after step 3. It is
sometimes useful to perform a backward simplification procedure after the final
step in the forward selection procedure of variables and categories with PRU. It
is possible that the categorization of variables in the earlier steps of the forward
selection procedure can now be simplified further because other variables have
been added to the table. In our analysis, further simplification is considered in
steps 4A and 4B. In step 4A, only the categorization of income has to bereconsid-
ered, because housing market is already simplified to two categories, and size of
household has just been considered in step 3B. Collapsing categories 1 and 2 of
income decreases the PRU slightly, although the loss of information is significant
at the 1percent level. As argued above, we attribute more value to the absolute
level of the PRU for the selection of the model than to considerations of signifi-
cance. The simplification of income to three categories decreases the cross-
tabulation to 18 cells, with only a slight loss in PRU value. Further combination of
categories leads to a much larger decrease in PRU (as step 4B illustrates) and thus
the process of combining categories was terminated.
The PRU measure helps to select the most relevant variables from a larger set,
but the simplification of the categorization of the variables is equally important.
The original categorization of income, housing market, and size of household
would lead to a table of 192 cells, while after the combination of categories there
are only 54 cells. The PRU values for these tables are 0.219 and 0.192, respec-
tively. Therefore, the number of cells in the table is reduced to 88 percent of the
original PRU.
Table 3 is the result of the PRU procedure. Careful inspection of the table
gives an initial idea of how the chosen predictors influence housing choice. First,
with increasing income, choice of single-family rental dwellings and owner oc-
cupation increases. Second, in the Randstad, because multifamily rental dwel-
lings form a larger part of the housing stock, they are chosen more often. Third,
the larger the household, the greater the probability of choosing single-family
housing. The ANOTA and logit models that use Table 3 as input will bring out
these patterns more clearly.
The ANOTA Analysis
Multivariate nominal scale analysis, developed by Andrews and Messenger
(1973), or its reformulation as ANOTA by Keller, Verbeek, and Bethlehem
(1984) meets the demand for a simple alternative to logit and probit models in
multivariate analysis of qualitative data. The authors argue that existing models,
such as logit and probit, are not completely satisfactory for dependent variables
with more than two categories, that the logit and probit transformations hamper
the interpretation of the linear parameters, and that the computational require-
ments are substantial. MNA/ANOTA seeks to optimize ease of computation and
interpretation instead.
The core of the ANOTA model is formed from the estimated coefficients
which show the “effect” of membership in the particular (nominal) category of
the independent variable on the likelihood of membership in each (nominal)
category of the dependent variable. The coefficients are corrected for possible
interactions between the explanatory variables, and therefore represent “pure”
effects, which can be interpreted as partial regression coefficients. Thus, these
coefficients can be added together (literally) across the several independent vari-
W. A. V. Clark, M . C. Deurloo, and F . M . Dielemn / 207
TABLE 3
Housing Choice of Movers Previously in the Rental Sector by Income,
Type of Housing Market, and Household Size; Result of PRU Analysis
Housina choice
Income Housing multifam. single fam. own
(XlW) market Size rent (S) rent (S) (S) No.
Randstad 1 P. * * * 14~~
2 P. 32 15 53 124
3 or m 23 29 48 122
Total 31 37 32 2923
'Values may be unreliable because of the small cell sizes and are therefore not reported.
ables to predict the household's score on the dependent variable (the expected
probability for any household is obtained by summing the base likelihood and
the coefficients that pertain to that household, and dividing by 100). For exam-
ple, in Table 4, the expected probability of a one-person household moving from
the rental sector, with an income below fl. 30,000, living in the Randstad, and
choosing a multifamily rental dwelling, is 0.313 (the base likelihood or average)
plus 0.123 (income effect)plus 0.171 (Randstad effect) plus 0.245 (sizeof house-
hold effect), a total of 0.852. Thus, the expected probability of making a particu-
lar choice, given the categories of the independent variables of a moving house-
hold, can be determined in a straightforward manner. It is also possible to focus
on a particular column of coefficients in Table 4. The coefficients associated
with any category of any predictor sum to zero across the categories of the de-
pendent variable, and so can be interpreted as deviations from the average.
The coefficients in Table 4 show the relationship of the predictors with hous-
ing choice more clearly than the percentages in Table 3, because the coefficients
are “pure” effects as a result of the assumption of independent influences of the
independent variables. If the assumptions of the model are severely violated, the
ANOTA parameters would be misleading. But this does not seem to be the case
here, as inspection of the bivariate tables for independent variables indicates.
Income, in particular, affects the choice between own and rent. Keeping size of
household and housing market type constant in the lowest income category, the
likelihood of buying a house for households who were originally in the rental
sector is low (0.319 - 0.202 = 0.117),while it is high in the highest income cate-
+
gory (0.319 0.336 = 0.655). The housing market type mainly affects the deci-
sion to select multifamily rental housing; in the Randstad the expected probabil-
ity of choosing a multifamily dwelling is nearly 0.50, while in the rest of the
Netherlands the probability is below 0.25, again keeping the household’s income
and size constant. Choice patterns vary widely between one- and two-or-more-
person households. For example, the expected probability of a single person
(controlling income and housing market type) moving into multifamily housing
is 0.558, against 0.145 for moving into a single-family rental house.
As noted earlier, the ease of computation and the straightforward manner of
interpretation shown must be traded off against a number of disadvantages of
ANOTA. Sometimes the estimated probabilities may be out of the 0-1 range for
rare combinations of categories of the predictors. As a rule of thumb, Andrews
and Messenger (1973) suggest at least ten times as many cases as number of
predictor variable categories, and that each category of the dependent variable
contains at least 10 percent of the cases to avoid inaccurate estimates. Another
disadvantage of ANOTA pertains to the overall explanatory power of the model
or of a single predictor on choice. The ANOTA analysis in itself does not give this
information, although a number of coefficients have been suggested for this
purpose (Deurloo et al. 1988). In many instances, the ANOTA results will be
treated as the final step in the analysis (e.g., Linde et al. 1986). If the disadvan-
tages are critical, or if the model assumptions are violated, the use of a logit
model is necessary.
The Hierarchical Logit Model
The description of the association between the dependent variable housing
choice and the selected predictors can be explored further with the logit ap-
proach. Relevant interaction effects between the independent variables can be
estimated and irrelevant interaction effects can be dropped, thereby simplifying
the model. In this phase of the analysis the PRU can be used to evaluate the fit of
the (unsaturated) logit models (see Kim 1984; Clark et al. 1986).
In presenting the logit models, we use the notation of fitted marginals, as is
usually done for log-linear analyses (Haberman 1978). As a consequence of this
notation, all effects between the independent variables are listed. For example,
in a four-dimensional cross-tabulation with the variables housing choice ( C H ) ,
income ( I ) , housing market ( M ) ,and size of household (S); [ Z , M , S , ] ,[CHI de-
notes the model without association between the dependent variable ( C H )and
the set of independent variables. The notation [ Z , M , S ] , [Z,CH], [ M , C H ] ,
[S ,CHI covers the unsaturated model in which the logit is the sum of the main
effects of each of the independent variables on choice, without any interaction
effects (this is the log-linear equivalent to the linear ANOTA model). Table 5
shows a selection of hierarchical multinominal logit models fitted to the data in
Table 3. The simple unsaturated models with zero or only one interaction effect
W. A. V. Clark, M . C . Deurloo, and F. M . Dieleman / 209
TABLE 5
A Selection of Hierarchical Multinomial Logit Models of Housing Choice ( C H ) ,Income ( I ) ,
Housing Market ( M ) and Size Household ( S )
(models 1-4) are not very different from the saturated model. The PRU of Table
3 is 0.192, while the PRU of model 1 is 0.180. Therefore, with only the main
effects of income, housing market type, and household size on choice in the
model, the loss of information is quite small. Of the three models with one inter-
action effect, model 4 fits best. We decided to take this simple model for inter-
pretation because the PRU of 0.186 is close to the PRU of Table 3 (although in
terms of G2the loss in explanation is significant).The more complicated model 5,
with two interaction effects and a PRU of 0.191, should be chosen when the
significance of G2is the ultimate criterion. However, some of the parameters of
model 5 are unreliable, showing very high standard errors. It is clear that the
preprocessing now bears fruit; the careful selection of the predictors and their
categorization is a prerequisite for finding such simple and robust logit models.
Table 6 presents the parameters of model 4 written out for each category of
the predictors in the form of a table with “cornered effects” (see Wrigley 1985).
The parameters of category 1 of each variable are set to zero, and the parameters
for the other categories can be directly compared to these categories. For exam-
ple, the parameter for income category three (< fl. 42,000) on choice category
three (owner occupation) has a value of 3.00; this indicates that in the highest
income category a household is twenty times as likely ( e3.00)to buy a house as to
rent a multifamily dwelling (holding other characteristicsof the household con-
stant); it also indicates that (holding other characteristics of the household con-
stant) in the highest income category a household is three times as likely to buy a
house as a household in income bracket fl30.000-42.000 (e3.00/e1.92).Thus, the
logit parameters of Table 6 can be compared directly. Presented in this tabulated
form they show major and minor effects of the choice predictors.
For the group of households used as an illustration in this article (movers from
or within the rental sector), income’s main effect is on the choice between rent
and own, as the high parameters for owner occupation show. Three-or-more-
person households show a very strong preference for single-family rental hous-
ing and owner occupation (whichis also mainly single-family).Size of household
seems to be an important determinant for the choice of type of dwelling. Living
in the Randstad in general means a decrease in the possibility of renting a single-
family dwelling (note the main effect of housing market and the interaction ef-
fect of housing market and size of household on choice category two). But living
in the Randstad affects two-or-more-personhouseholds buying a house in par-
ticular; the interaction effect of size of household and housing market shows that
these households have a much lower probability of buying a house in the Rand-
stad as compared to the rest of the Netherlands (the parameters are -1.11 and
-2.15 of the [ M , S , C H ] effect on choice category three).
~~ ~ ~~ ~ ~~ ~
TABLE 6
Coefficients of Logits Models ( l , M , S }{ l , C H } { M , S , CHI Cornered Effects
Income <fl30000 fl 3OoO-42oO > fl42oO
Housingmarket rrst Nrth Randctad rest Neth Randstad rrst Neth Randstad
Size household 1 2 =>7 1 2 =>3 1 2 =>3 1 2 =>? 1 2 =>3 1 2 =>3
Choice * Effects
constant
1,CH
multi- M,CH
family S,CH
rent M,S,CH
-.88 -.88 -.88 -.88 -83 -.88 -.88 -.% -.% -.88 -.88 -.88 -.88 -.% -.88 -.88 -.% -.% constant
single -64 .64 64 .M .64 .64 .38 .38 .38 .38 .38 .38 I ,CH
family -1.42 -1.42 -1.42 -1.42 -1.42 -1.42 -1.42 -1.42 -1.42 M,CH
rent .75 2.18 .75 2.18 .75 2.18 .75 2.18 .75 2.18 .75 2.18 S,CH
.43 - 3 4 .43 -.34 .43 -.34 M,S,CH
Q
' 1+2+3 h.market typ.prev
3 63.5 52.0
n 255
FIG.2. CHAID Dendrogram for Renters. Table values are the percentages moving to each destination category.
W. A. V. Clark, M . C . Deurloo, and F . M . Dieleman / 213
H3'ZRC*S
H3R
H3'S
W. A. V. Clark, M . C . Deurloo, and F . M . Dieleman / 215
6. CONCLUSION
In situations with a dependent categorical variable (especially if it has more
than two categories), and a large number of categorical independent variables, a
combination of preprocessing and logit or MNA/ANOTA modeling is a pre-
ferred modeling strategy as the housing choice examples in this paper illustrate.
The preprocessing is useful in selecting an “optimal” subset of independent vari-
ables from the larger available set and simplifying the categorization wherever
possible. Without thorough preprocessing, any attempt at logit modeling for
data sets with large numbers of empty cells, or in which there are only a few
observations, renders the parameters of a logit model unreliable if not meaning-
less. Also, a large cross-tabulationusually leads to so many parameters in the logit
model that it is impossible to provide a meaningful interpretation.
The PRU and CHAID methods are efficient approaches for preprocessing.
Both lead to a selection of relevant variables and important simplifications in the
categorization often without any loss of information about the choice patterns.
CHAID gives more detail about specific housing choices because predictor vari-
ables are “nested” within the categories of previously selected predictors. But
this also makes it more difficult to proceed from the results of the CHAID pre-
processing toward logit or ANOTA modeling than from the results of the pre-
processing with the PRU measure; a somewhat subjective interpretation of the
CHAID results seems to be unavoidable in formulating standard logit models.
Preprocessing in itself yields an indication of the structure of the relationships
between the dependent and most relevant independent variables, as Table 3 and
Figure 2 of our example illustrate. Although it has been common to stop at this
stage, we have shown that a subsequent ANOTA analysis and/or logit modeling
can still be profitable. Logit and ANOTA modeling lead to much clearer and
richer conclusions about the relationships between the variables than a mere
cross-tabulation reached by preprocessing. On the other hand, ANOTA and
logit modeling clearly profit from the preprocessing, and robust models are the
result of the combination of methods.
MNA/ANOTA analysis seems a reasonable alternative to logit modeling. The
216 / Geographical Analysis
B = (X’X+R’R)-’ . X‘ . Y.
[KXZI [KXKI [KXN] [KXZ]
Estimates for the standard errors of the regression coefficients can also be
calculated. They are completely different from ordinary-least-squaresstandard
errors.
Let bi represent the vector formed by taking the ith column of B , so its com-
ponents are the regression coefficients corresponding to the ith category of the
dependent variable Y.Representing the ith column of the N X I indicator matrix
Y by Yi , it follows that
p denotes the true model value. Two approximations are now applied. First
is approximated by p
the overall probability in the population that a randomly selected element be-
longs to category i of the dependent variable Y.This is not very accurate, but
when
takes values between 0.15 and 0.85 the variance varies only between 0.13 and
0.25. In applying this approximation the heteroscedasticity of Yj is in fact
neglected.
In the second approximation the true fraction p yi is, as usual, estimated by the
sample fraction CYi. Summing up, Var(b) is estimated by
The authors of ANOTA also propose a more accurate estimate for Var(b) in
which the bivariate tables Y X X i are used. They point out however, that although
this method is much more complex than the first, the difference in accuracy is
remarkably small.