This action might not be possible to undo. Are you sure you want to continue?

# Journal of Data Science 5(2007), 297-313

Linear Information Models: An Introduction

**Philip E. Cheng, Jiun W. Liou, Michelle Liou and John A. D. Aston Academia Sinica
**

Abstract: Relative entropy identities yield basic decompositions of categorical data log-likelihoods. These naturally lead to the development of information models in contrast to the hierarchical log-linear models. A recent study by the authors clariﬁed the principal diﬀerence in the data likelihood analysis between the two model types. The proposed scheme of log-likelihood decomposition introduces a prototype of linear information models, with which a basic scheme of model selection can be formulated accordingly. Empirical studies with high-way contingency tables are exempliﬁed to illustrate the natural selections of information models in contrast to hierarchical log-linear models. Key words: Contingency tables, log-linear models, information models, model selection, mutual information.

1. Introduction Analysis of contingency tables with multi-way classiﬁcations has been a fundamental area of research in the history of statistics. From testing hypothesis of independence in a 2 × 2 table (Pearson, 1904; Yule, 1911; Fisher, 1934; Yates, 1934) to testing interaction across a strata of 2 × 2 tables (Bartlett, 1935), many discussions had emerged in the literature to build up the foundation of statistical inference of categorical data analysis. In this vital ﬁeld of applied statistics, three closely related topics have gradually developed and are still theoretically incomplete even after half a century of investigation. The ﬁrst and primary topic is born out of the initial hypothesis testing for independence in a 2 × 2 table. From 1960’s utill the 1980’s, Fisher’s exact test repeatedly received criticism for being conservative due to discrete nature (Berkson, 1978; Yates, 1984). Although the arguments in favor of the unconditional tests essentially focused on using the unconditional exact tests in the recent decades, the reasons for preferred p values and sensitivity of the unconditional tests have not been assured in theory. In this respect, a recent approach via information theory proved that the power analysis of unconditional tests is not suitable for

1925) inspired discussions of partitioning chi-squares within the contingency tables (Mood. has only recently been discussed in the literature. the probability at alternate interactions given the observed data. 1982).. 2006). et al. Bartlett (1935) addressed the topic of testing interaction and derived an estimate of the common odds ratio.298 Philip E.. and since then. 2002). it was implicit that an inferential ﬂaw lies in the estimating-equation design of the CMH score test (Woolf. Simpson (1951) and Roy and Kastenbaum (1956) discussed interpretations of interactions and showed that Bartlett’s test is a conditional maximum likelihood estimation (MLE) given the table margins. Analysis of variance (ANOVA. Kullback and Claringbold was remarked by Plackett (1962). Darroch. 1964. 2007). medicine. 1964). These data information identities also indicate that analysis of variance may not be designed and used to measure deviations from uniform association. Goodman. as an extension of the power analysis for a single 2 × 2 table. Consequently. 1959) has been applied extensively in the ﬁelds of biology. 1993. Lewis. 1962. 1969. Extending to several 2×2 tables. 1975). 1955. have been widely used in the literature (Hagenaars. Birch. the power analysis at alternatives to the null.. testing independence. 1958. Hierarchical log-linear models were thereby formulated to analyze general aspects of contingency tables (cf. Mellenberg. Cheng et al. Bishop et al. which is the topic of this study to be discussed below.. Goodman. . A drawback of inference with the the test statistics by Lancaster. 1961). it is seen that noncentral hypergeometric distributions cannot determine power evaluations at arbitrary alternative 2 × 2 tables (Fisher. 2005).. For the same data. 1962) and again. It was recently found that a ﬂaw of inference exists with a likelihood ratio test for association (Roy and Kastenbaum. or varied interactions between the categorical variables. 1956) and another for testing interaction (Darroch. 1954. Norton (1945). 1997. A prototype of the linear information models will be formulated below and compared to the hierarchical log-linear model. Christensen. Fisher. A remedy to such statistical inference was recently provided by an analysis of invariant information identities (Cheng. 1950. It inspired in turn the development of loglinear models (Kullback. Mantel and Haenszel. that is. Agresti. 1970. engineering. Claringbold. In addition. 1951. et al. the celebrated CMH test for association (Cochran. and Goodman. Lancaster. but simply for equal and unequal binomial rates. sociology and psychology. 1962. will also be useful for testing hypothesis with high-way contingency tables. the data likelihood identities provide appropriate explanations (Cheng. which are simply deﬁned by likelihood factorizations. In retrospect. However. the long term ambiguous criticism of the exact test was ﬁnally proved to be logically vacuous (Cheng et al. 1959. education. Snedecor. 1962). The solution.

.. or zero conditional association. et al. where the notations and parameters deﬁned by ANOVA decompositions may require careful interpretations. The model permits all three pairs to be conditionally dependent. 1970.. k = 1. that is.. remarks on criteria of information models selection are noted for further useful research. XZ). I. Section 4 will provide empirical study of four.. which allow only a few diﬀerent expressions of log-likelihood equations.2) The full model can be reduced to be a submodel of no three-way interaction. respectively. It is elementary and instructive to discuss the case of three-way tables. Elementary Log-likelihood Equations The representations of data log-likelihood can be formulated in various ways. i j k ij jk ik ijk (2. Denote the joint probability density by fijk (= f (i. k)) = P (X = i.. (2. an elementary scheme of model selection can be formulated with easily justiﬁed tests of signiﬁcance. 2. the prototypes of information models with four-way and high-way tables begin to indicate the essential diﬀerence from that of the log-linear models.1 Basic log-linear models Suppose that individuals of a sample are classiﬁed according to three categorical variables {X}. no pair is conditionally independent (Agresti. j = 1. which is denoted by (XY. The three-way log-linear models is ﬁrst reviewed. J. and. Y Z. the corresponding log-linear model is formulated as . . and obvious advantages over the log-linear modeling are easily shown through the information models selection. 1975) is deﬁned as log fijk = λ + λX + λY + λZ + λXY + λY Z + λXZ + λXY Z . j.. 2.. There are numerous ways of decomposing the data likelihood with high-way tables. 2002). Next.Linear Information Models 299 The basic log-linear models in three variables will be reviewed in Section 2.. depending on the methods of likelihood decomposition. which may diﬀer from the log-linear models only in some representations. In conclusion. Comparisons between the log-linear models and the proposed information models will be discussed.1) where ijk fijk = 1. In Section 3. {Y }.. which have been analyzed with log-linear models in the literature. . information identities of three-way tables are fully discussed to characterize the corresponding information models. {Z} with classiﬁed levels: i = 1. Z = k). Bishop. in particular.and ﬁve-way tables. The full (saturated) log-linear model (Goodman. K. . Y = j.

in order to understand the intrinsic diﬀerence between dependent and independent likelihood decompositions. i j ij k jk (2. zero mutual information (Cheng et al. say. and written as log fijk = λ + λX + λY + λZ + λXY . denoted (XY. which indicates an intrinsic diﬀerence in the interpretation of these log-linear models. It is obvious that the one-way and two-way parameters in each of the models (2. are scaled log-likelihood ratios. log fijk = λ + λX + λY + λZ + λXY + λY Z + λXZ .4) is however not derived from (2. {X. i j k (2.2) by deleting the conditional association between {X. Equation (2. Equation (2.3).12) below.300 Philip E. The parameters in the log-linear models.2) by checking the magnitude of the three-way interaction using the iterated proportional ﬁtting (Deming and Stephens. and expressed as log fijk = λ + λX + λY + λZ . Z).4) to (2. Z). but directly computed from (2. Z} given Y . 2006).6) Indeed. Y. Basic identities of data likelihood are examined. Such a desirable property is lost between the two-way and three-way terms in the models (2. and a prototype of linear information models in three variables will be considered.2 Basic information models It may be useful to take a diﬀerent look at the formulation of hierarchical loglinear models. Y } and Z. i j ij k (2.6) deﬁnes the mutual independence between the three factors. i j ij k jk ik (2.. or the logarithmic odds ratios. model (2.4) reduces to independence between {X. exactly one pair of factors is conditionally independent given the third factor. (2.2) and (2.1) . Deﬁne the marginal probabilities as J K K fi = fi·· = j=1 k=1 fijk . Z} given Y . Cheng et al. fij· = k=1 fijk . This will be further explained using equation (2.5) The ﬁnal reduction to three pairs of conditional independence is denoted by (X.3). 2.3) obtains from (2. Z} given Y .4) With two pairs of conditionally independent factors. equation (2. then the model is expressed as log fijk = λ + λX + λY + λZ + λXY + λY Z . besides the normalizing constant λ. which is the sum of the unique three-way interaction and the conditional uniform association between {X. that is.3) In case. 1940).6) are statistically independent using disjoint data information.

which are the three possible saturated information models for a three-way table. Y ) + I(X. Z) + I(X. IJK− (I +J +K)+2 under the null hypothesis of mutual independence.j. are deﬁned in terms of the observed cell frequencies. i. let nijk denote the number of individuals classiﬁed to the cell {X = i. The entropy identity in three variables is expressed as H(X) + H(Y ) + H(Z) = H(X. the sample entropy and the sample mutual information. 2006. z)||f (x)g(y)h(z)). Z|X) = I(X. I(X.j. Y ) = − fijk log fijk i. For example. (2. fi·k . Z) + I(X. Z) = i. Y ) + I(Y . Z) + I(Y .k H(X. The sample analogs of the entries in (2.j. the mutual information. (2.k nijk n2 ni nj nk . Y. Z) . Z) + I(Y . Z). ni = ni·· . Y |Z).9). Z) yields the likelihood ratio test statistic nijk log i. and the mutual information are deﬁned respectively to be H(X) = − i fi log fi . Z) = − I(X. Y. Z) + I(X. Some convenient terminology will be borrowed from our previous study (Cheng. fj . Y .j H(Z) = − k hk log hk . y.Linear Information Models 301 and analogously. Y .f.9) The last entry of (2. Y . Y . f·jk . fij· log fij· . Z|Y ) = I(X. Also. Z) = I(X. respectively. et al.9).10) such that twice of (2. Y = j. H(Y ) = − j gj log gj .. denote the total cell frequency and marginal totals.10) is asymptotically chi-squared distributed with d. and similar notations n = n···. Y.k fijk log fijk fi gj hk = D(f (x. H(X. joint entropy. which are the natural maximum likelihood estimates under the general model of multinomial distribution. Section 2). (2. and nij· . the sample analog of the mutual information I(X. Z = k}. is the Kullback-Leibler divergence between the joint density and the product marginal density of (X. It is easily seen that data log-likelihood admits three equivalent information identities.11) . (2. fk .8) where the marginal entropy.

is usually expected to be signiﬁcantly large.325). Int(X. characterizes exactly the three-way term λXY Z of (2. Z||Y ).12). Y ) + I(Y . with a rare chance it may happen that both Int(X. Z||Y ) are insigniﬁcantly small.12) form a natural two-step likelihood ratio (LR) test. has been a much disucussed topic in the literature. where the ﬁrst step LR statistic. across the levels of Y . is rejected. Z) of (2. if the test for no interaction. Y . 2002. Cheng et al. Z||Y ). Z|Y ) is insigniﬁcantly small..3). Being saturated models. there is no need to perform the second step test. Z|Y ) = Int(X. that is also consistent with the signiﬁcant omnibus test. the Breslow-Day test (1980) and the CMH test (1954. p. I(X. are close to 1. across K strata of Y .12) that four combinations of signiﬁcance and insigniﬁcance patterns between the three LR tests . Z|Y ) of (2. If the omnibus CMI I(X. Z||Y ) characterizes the uniform association between X and Z. then the second-step test statistic I(X. it is theoretically rare that the two-step test is not suﬃciently sensitive. the second step is in use only when the ﬁrst step is insigniﬁcant. and. Y .11) are diﬀerent expressions of the log-linear full model (2.2).12).11) may be reduced to the sum I(X. It is understood that such signiﬁcant or insigniﬁcant tests are deﬁned with a common ﬁxed level of signiﬁcance for each approximating chi-square distribution. Y . also called the generalized CMH test (Agresti. 2007).12) is removed (equal to zero). Z) and I(X.4). Int(X. in which the two independent summands. it is logically consistent with the signiﬁcant omnibus test. Y . If the interaction is insigniﬁcantly small.2). with four possible combined cases. it is insuﬃcient to use only the second-step test for testing no association between X and Z. or Pearson’s chi-square test. Likewise. Z) in the equations. the conditional mutual information (CMI) I(X. Y . Here. Z||Y ). and the latter expression is unique up ijk to the normalizing constant λ. I(X. and the ﬁrst equation of (2. Z) and I(X. It was recently proposed that the two summands of (2. it follows from (2. Otherewise. Z). across Y . Int(X. Z) = 0. then the conditional odds ratios between X and Z. and. for example. The remaining terms are not the same. within strata of Y (Cheng et al. Int(X. 1959). Logically. and. Z) + I(X. Y . In this case. (2. this CMI is signiﬁcantly large. whereas the omnibus CMI is marginally signiﬁcant. The unique three-way interaction. All three equations are reduced to model (2. within the levels of Y. and the second. The special case of 2 × 2 × K table. the equations of (2. The two-step LR test for no association across strata Y is shown to be asymptotically unbiased and more powerful than the one-step omnibus LR test. the simplest three-way table.302 Philip E. Z) which is model (2. and clearly. tests for no interaction between X and Z. Y . Z||Y ). say. tests for no association between X and Z. the two-step LR test improves over the combination of the score tests. I(X. can be individually signiﬁcantly large or small. if the common interaction term Int(X.

the use of such a CV is always applicable and particularly useful.Linear Information Models 303 are meaningful in practice.. It is plain that both formulae of data log-likelihood decompositions are built upon the notion of mutual information.11) that a common variable appears in each term of the decomposed three-way mutual informtion. which are coined information models.1 A four-way information structure It is observed from equation (2. conditioned on the remaining t − 2 variables. 3.2) to (2. and it is used as the conditioning variable (termed CV). conditioned upon other t − 2 variables. Z) + I(W . When any variable is of primary interest. (K − 2) three-way (CMI) eﬀects. that equations (2. Information Structures of High-Way Tables The natural extensions of equation (2.4) as mentioned above. like the response variable in a linear regression model.11) consist of two major formulae for each saturated model of a K-way contingency table.11) to high-way contingency tables. however. two (K − 1)-way CMI.11) to high-way contingency tables will enhance interpretations of high-way eﬀects and manifest further differences from the conventional log-linear models. These information with varied statistical inference are not directly revealed between models (2. Z). the analog of (2. Y . The key point of fact. . The goal of this study is to make a natural and systematic extension of equation (2. Subsequently. (K − 1) two-way (MI) eﬀects.. Y. X. and one K-way CMI.8) is the entropy identity H(W ) + H(X) + H(Y ) + H(Z) = H(W.1) . a prototype of the linear information model. The primary formula is that the log-likelihood decomposition is the sum of K main eﬀects. 3. it is used as a CV for ﬁnding its relations to the remaining variables. deﬁned by simple criteria of model selection. not in terms of ANOVA as deﬁned with the log-linear models.11) disallow the inclusion of all three two-way terms. X. This is not necessarily required of a four-way or a high-way table. The diﬀerence between the proposed information models and the log-linear models will be illustrated using four-way and ﬁve-way data tables that have been analyzed in the literature. will be introduced to oﬀer advantage of statistical inference over the conventional log-linear model. In case of four variables. is not clariﬁed in the discussion of hierarchical log-linear models.. (3. A t-way CMI is always a sum of the t-way interaction and the uniform association between the pair of variables. The secondary formula is that each t-way CMI (t ≤ K) is a mutual information between a pair of variables. It is perceivable that extensions of identity (2.

Y ) + I(W . then either a diﬀerent identity of (2. Z).304 Philip E. X. or. or another identity such as I((X. Y ) = I(W . Z|X) + I(W . twice the sample analog of I(W . either because it is a variable of focus. conditioned upon the chosen CV.11) may be used to express I(W . A selection scheme based on equation (3. Y ).4). X. X. Y ) + I(X.2) using a CV is ﬁrst outlined below. Y ). Y ) + I(W . Y ) = I(W . Y ) + I(Y . Z) + I(Y . Z) + I(W . Y ). Y ) + I(X. respectively.2) The last equation of (3. X. 3. the total number of saturated information models in four variables would be 72 = (4!) × 3. the three-way CMI terms. Z|X. say Y . Cheng et al. Z) = I(W . (3. exactly one of three CV models may be used according to equation (2. Y ). Z) + I(X. X|Y ) + I(Y . If both CV and non-CV models are included. Y ). The question is how to select a meaningful and parsimonious model among those seventy-two candidate linear information models. Z) + I(Y . And. Z|X. Y |W ) + I(X. For example. Z|X. for a ﬁxed nominal level) sum of three two-way eﬀects. X. (3. . Z|Y ) + I(W .2 Elementary model selection schemes Without loss of generality.2). Y ) + I(X. Z) + I(Y . Y. X. an analog of (2. say. in the second equation of (3.3) It is worth noting with these saturated models that all the variables must appear at least once in the main eﬀects. Y ) + I(X.2) follows by using (2. Z) = I(X. among the four variables {W. For a threeway contingency table. Step 1 : Select the CV. Z|X) + I(W . Z) = I(W . X. and a speciﬁed four-way CMI. the information model selection scheme is illustrated with a four-way contingency table. and also among the two-way MI terms. X) + I(W . the last statement simply leads to diﬀerent non-CV saturated models such as I(W . Y. Z) = I(W . which yields 24 distinct saturated CV models. This is organized as a four-step procedure for ease of exposition. Z|X. Y ). Y ) + I((W. and. Y ) + I((X. Y yields the maximal signiﬁcant (in p value. Z|X) can be used.11) among many equivalent identities may be expressed as I(W . If there is no special CV of interest. Y . Step 2 : Find the maximal insigniﬁcant (in p value) among the insigniﬁcant four-way CMI (between certain two variables. there are exactly six distinct saturated models for each ﬁxed CV. Y . X|Y ) + I(X. for a four-way table. Z}.11) with the variable Y as the CV.

second-minimal signiﬁcant) CMI. a saturated CV information model selection in the four-variable case is essentially complete. the same selection procedure is continued with the renewed model modiﬁed in Step 2. Suppose the chosen fourway CMI is I(W . a minimal signiﬁcant) high-way CMI term can be replaced by its next term. Steps 3 and 4 are kept unchanged. Since the three two-way (MI) terms have already been selected in Step 1.f. To conclude a parsimonious model selection after performing the above three steps. X|Y ). Z|X. If there is no need to ﬁx a CV of particular interest. though not necessary. when the three available four-way CMI are all signiﬁcantly large.2). (2. choose the minimal signiﬁcant. The sum of all the insigniﬁcantly (small in p value) terms. because it can always choose the next insigniﬁcant (signiﬁcant) term whenever needed. while modifying the information decomposition. each variable must appear at least once among the two-way terms of a saturated model before parsimonious selection. it is expected that such a general scheme often results in selecting a non-CV model. an interaction term and a uniform association term (cf. A maximal insigniﬁcant (or. the second-maximal insigniﬁcant (or.. Thus. 0. from high-way to low-way subtables. the remainder is signiﬁcantly large (near or over a 95th percentile of the associated chi-square distribution) and the tentative model may be lack of suﬃcient information. are directly obtained by reading the appeared variables from the chosen four-way CMI.3). Step 3 : To conﬁrm the selected MI and CMI terms. Step 4. to the same nominal level. it is often of extra interest. among the ten terms. whether a common CV is in use or not. and. Y ). then the tentative remainder is insigniﬁcantly small and deleted as an insigniﬁcant residual.2) . then. each one of the ten selected terms (including four main-eﬀect terms) must be individually and separately tested against the same nominal level.12)). the new remainder is likewise tested to be a negligible residual. say. Step 1 is bypassed. to yield a more parsimonious model. as shown in Step 3. under the same nominal level.05 in common practice. This remedy as a supplementary scheme is easily used in practice. and Step 2 is generalized without requiring a ﬁxed CV throughout the selection scheme. It is understood that a general scheme may select either a CV model (3. Otherwise. If this test is insigniﬁcant. to test against the summands of each selected CMI term.Linear Information Models 305 and the remaining variable). there are alternate ways of selecting a model like (3. I(W . Finally. a simple remedy is recommended. is taken as the tentative remainder which is then tested against the total (chisquare) d. This step is asymptotically correct by the orthogonal likelihood decomposition. and then. or a non-CV model (3. X|Y ) and I(Z. On the other hand. more balanced in selecting the variables among the CMI terms. In this case. so that the tentative model is accepted. Otherwise. In other words. the two candidate three-way CMI.

These alternatives to Step 2 may sometimes yield less high-way CMI terms compared to the original Step 2.3) is to delete more high-way CMI terms. Cheng et al. This will be illustrated below in Section 4. and law enforcement (L). selecting a minimal insigniﬁcant high-way CMI term may be preferred. which approximately equals to the 35th percentile of the chi-square distribution with 48 d.or four-step model selection scheme can be easily extended to high-way contingency tables. compared to the deletion of high-way interaction terms as a common practice in the selection of a log-linear model. Applications to High-Way Tables Four-way and ﬁve-way contingency data tables in the literature will be examined.2. they usually lead to selecting more terms at end. Readers may refer to subsequent estimation of odds ratios parameters to the log-linear model ﬁtting with all two-way eﬀects. particularly with high-way tables. ﬁtting the data to the summary of the four main eﬀects and all the six two-way eﬀects (Agresti. deﬁned with the variables: environment (E).20).f. The above three.1. that is.306 Philip E. The basic analysis of this data by log-linear modeling leads to the accepted model: deletion of the four-way interaction and all the threeway eﬀects. As an alternative choice in Step 2. 2002. 2002. Table 8. However. Table 8. and sacriﬁces model parsimony. Example 4. according to Step 1 of the selection scheme as illustrated in Section 3. otherwise. An application to ﬁve-way data table will also be illustrated in Section 4.2) and (3. health (H).3). It is entertaining that the proposed method yields easily interpretable and more parsimonious models compared to those obtained from the hierarchical log-linear modeling. or (3. in particular.67. if any. It is so ﬁtted with the loglinear modeling because the deviance between this ﬁtted model and the full model is evaluated to be 31. Example 4. “about right” and “too much”. The information modeling begins with ﬁnding the most relevant variable to be the CV. big cities (C). The proposed information modeling and the four-step model selection scheme of Section 3 will be applied to both four-way and ﬁve-way data below. 4.19) is exempliﬁed for a comparison study. It is remarkable that the principal idea in formulaing the models (3.1: A four-way contingency table A 3 × 3 × 3 × 3 four-way data frequency table (Agresti. select the minimal (or maximal) signiﬁcant high-way CMI term when all such CMI terms are signiﬁcantly large. These 81 cell frequency counts are listed according to the three levels: “too little”. The data consist of three-level individual’s opinions on each of four variables of government spending. It is found that the variable health (H) yields the greatest signiﬁcant sum .

f. can be computed along with the selection scheme. for which the remainder deviance is 77.f. may not be considered as useful CV. plus the two-way eﬀects {CE. plus the four main eﬀects”. EH}”.Linear Information Models 307 of two-way eﬀects. including interval estimates.f. For brevity. the deviances 2I(C.2).59. it can be = = easily checked that the deviance. these calculations are not discussed here for each selected information models. and the readers may contact the authors for details. instead of all the six two-way terms as used in the selected log-linear model (Agresti. This exhibits a basic advantage of information modeling that it usually selects a more parsimonious model with more concise and simpler explanation compared to the classical log-linear modeling. close to the 88 percentile of the chi-square (64 d. are the two-way eﬀects. By Step 4.2) and the same selection scheme that a similar model is selected: “the main eﬀects. the ﬁtted log-linear model of all two-way eﬀects. for which the remainder deviance is 81. In case no particular variable is ﬁxed as CV. then. using Y = H and X = E. which is 61.41.3) and the same selection scheme (omitting Step 1) would lead to a non-CV model: “{CE. In case another variable is of interest. . This provides a similar CV model selection and interpretation. say. EH}”.98. which is close to the 76th percentile of the chi-square distribution with 64 d.= 16.87 to the chi-square distribution with 12 d. with the total ﬁtted d.2) is tentatively concluded with the information model: “four main eﬀects plus the two-way eﬀects {CH. H) ∼ 28. It can be easily checked from this four-way data that the other two categories. A complete selection scheme in accordance with equation (3. that is only half of 32. This provides another equally parsimonious information model in which all the four variables share the two-way eﬀects together in a pair of two-way terms. the variable (E) can be taken as the CV. H) ∼ 24. close to the 93. Thus. because the selected models will include at least one three-way CMI term. Environment (E).18.f. HL}. the sum of the residual insigniﬁcant CMI’s.f.2) for the four-way table is summarized in Table 1 below. it is checked by equation (3. It also follows by equation (3. which may be closely related to budget spending and worth an investigation. 2002). The (conditional) odds ratio parameters and the deviances of the above three selected information models. is 71. It then easily follows by Steps 2 and 3 to ﬁnd that the only signiﬁcant terms in the information equation (3.) distribution. in addition to the two-way terms and main eﬀects. a prototype of information model selection by equation (3.74 and 2I(E.5 percentile of the chi-square distribution with 64 d. cities (C) and law enforcement (L).

55 0. R)) + I(K.001 0. S) = I(L. Tables 5 to 7) using hierarchical log-linear models. it is found that factor K has the maximal two-way eﬀect with the other variables. denoted by K) of cancer. S) + I(L. (K. R) = I(L. it is so far unknown in the literature whether there are speciﬁc criteria or schemes of selecting a deﬁnite.2: A ﬁve-way contingency table A data of cross-classiﬁcation of individuals according to ﬁve dichotomized factors had been studied in six publications prior to Goodman (1978. N . p. a saturated information model can be derived according to Step 2 as follows. R. and the presence or absence of the other four qualitative attributes: L(= lecture). R(= radio). R)) + I(K. R|K) + I(K. (K. in contrast to the hierarchical log-linear modeling. Let factor K be the CV. L) I(E.99 8. N.95 24. N ) + I(K. for the present ﬁve-way data that had been much discussed prior to Goodman (1978). R. R) + I(L. N (= newspaper).77 7. H) chi-squares 34. S|K. and S(= solid reading). N |K. N. However. R.78 0. (4. R.308 Philip E. or tentatively entertained.1) . R. N. for the current ﬁve-way data. S|K. S) + I(L. (K. H) I(C.2. parsimonious loglinear model. S|K.18 28. Cheng et al. (K. R) + I(L.2) MI. R) = I(L. which evidences that it was a useful study design. R|K) + I(K. S|K. R). S) + I(N . I(K. (K. R. N ) + I(S. 36 12 12 4 4 4 p-values 0. N |K. L|H) I(H.74 d. E|H) I(E. Table 1). Table 1: Mutual Information (3. and hypotheses about the logits of K. (K. The purpose was to understand the association relationship between the knowledge (good or poor. S) + I(L. S) = I(L. R)) + I(R. L|E. According to Step 1 of Section 3.112.f.07 0. R. L) + I(R.07 0. H) I(C. N |K. N )) + I(N . The factor of primary interest is the “knowledge K”. It is thus worth investigating the proposed selection scheme of linear information models. N.90 19. CMI \ values I(C. N ) + I(N . L. S|K) + I(K. S)) + I(S.001 Example 4. S)) + I(K. plus estimates of the hypothesized eﬀects were examined by Goodman (1978.

R) I(L. and. L) I(K.610 < 0. R.001 < 0.001 < 0. S). the ﬁve-way CMI term I(L.001 < 0. the selection scheme conﬁrms only a slight reduction of two CMI terms by equations (4.78 24.25 150. S) ∼ I(L. S|K. N |K. except a few insigniﬁcant terms. S).2) MI.001 As a supplementary note. although information dissemination by model (4.2) For the ﬁve-way data.2) and (3.230 0. S|K. N ) I(K. R). N )+I(L.6” is insigniﬁcantly small. and the four-way CMI I(R.2) treats the variable (Knowledge.3). R. S) chi-squares 10. K) as a response variable. S|K) + I(K. This yields the linear information model that exhibits the desired relationship between the CV “Knowledge K” and the other four variables: I(K.001 < 0. To summarize the data analysis.001 < 0. R) + I(L. L. S) = + I(N . N |K.001 < 0.1). R. which includes a four-variable model (3. it is notable that the selected information model (4.89 58. S|K) I(N .58 2.16 105. the possible choices of ﬁve-way information models may be estimated like the previous discussion of four-way models based on equations (3. CMI \ values I(L.Linear Information Models 309 In the last equation of the saturated model (4.96 17. S|K.001 < 0. R. Thus. R) I(K. 8 4 4 2 2 2 1 1 1 1 p-values 0. N ) + I(N .f.45 d. L) + I(K.2). N ) I(L. the number of saturated ﬁve-way CV models is 720 = 5! × 3 × 2. By Step 4. except two high-way CMI terms. R|K) + I(K. N . Table 2: Mutual Information (4. Equation (4. R|K) I(K. . (4. approximately equal to the 65th percentile on the chi-square distriubtion with 12 d. N ).1) allows ﬁve ways of separating one variable from the other four. it is checked that “twice the sample sum of I(R. in which most are signiﬁcantly large. R|K) + I(K.56 190.05.2) yields Table 2. R|K) I(N . it is found that all terms are statistically signiﬁcant at the nominal level of 0. and very little information reduction is possible.2) as part of the whole model. Table 2 exhibits the component MI and CMI values of the overall mutual information.72 20. S|K.1) and (4. or 26.48 15. N |K. S|K. S) I(R.1) and (4. This presents a case that the variables deﬁned in the study are highly associated.f.

248-252. The proposed model selection schemes. S. but not through the adapted ANOVA. 2nd ed. such as the current proposal. (2002). J. Categorical Data Analysis. without additional operations on the data. A basic drawback of the latter. The important advantages of linear information models are based on direct use of the data likelihood identities as illustrated in Section 3 and exempliﬁed in Section 4. 5. Contingency table interactions. It is the simplest method based on comparing data deviances of any possible remainder terms against appropriate chi-square distributions. Extensions of equations (3. Essentially. Statist. the proposed model identiﬁcation and selection schemes oﬀers a fundamental likelihood analysis with observed data. it may take further study to deﬁne optimal model parsimony and selection.. are naturally born out of the data likelihood. In dispraise of the exact test. can be especially intricate with high-way tables. the number of all saturated ﬁve-way models is 5! × (4! × 3) = 25920. and. together with some additional selection criteria that would be useful in other statistical applications. the disadvantage of inevitably crossed and overlapped data information in the summands of the log-linear models. (1978). Bartlett.3) and (4. 2. it is understood that no best or uniquely optimal model can be deﬁned whichever selection criterion is used. The basic information identities of Section 2 are used to illustrate the advantages of using orthogonal information decomposition. for which a three-way case was illustrated in the recent literature (Cheng et al. the classical approach invalidates the development of useful selection schemes among the hierarchical log-linear models. Thus. (1935). .2). Planning and Inference. Roy. Given a natural selection criterion. B 2. either using a CV or not. Statist. Wiley. J. Cheng et al. (3. Concluding Remarks A short summary of the proposed information models and the selection scheme in Section 3 can be illustrated with a few remarks. M. 2006). References Agresti. While useful information model selections still depend on interpreting the data through the choice of certain CV (or no CV) by the experimenter.1) to multi-way contingency tables appear to be straightforward. J. Soc. The primary purpose of developing the linear information models is to recommend the natural factorization of the raw data likelihood.310 Philip E. Berkson. A. directly using the observed data likelihood. 27-42.

E. and Day. Soc. M. E. 319-352. Aust. (1975). Fisher. Testing association: A two-stage test or Cochran-Mantel-Haenszel test. J. Statist. E. MA: MIT Press. Analyzing Qualitative/Categorical Data. (2006). Fisher. Goodman. Cheng. M. E. Statist. Data information in contingency tables: A fallacy of hierarchical log-linear models. 11. Goodman. (1997). Roy. Cambridge: Abt Associates Inc. and Aston. and Holland. A.Linear Information Models 311 Birch. Some methods for strengthening the common chi-square tests. 4. (1980). J. Biometrics 24. J. Aston. (2007). in press. R. N. Liou. Roy.. Australian Journal of Statistics. W. J. J. Newbury Park: Sage. The use of orthogonal polynomials in the partition of chisquare. J. P. W. L. Liou. (1978). 59. 1. Deming. M. (1993). R. (2005). 486-498. (1970). A. J Amer. Oliver & Boyd. Lyon. W. Technical Report 2007-03. F. On partitioning and detecting partial association in three-way contingency tables. (1940). Loglinear Models with Latent Variables. A. 387-398. M. Assoc. P. Christensen. S. L. (1961). Information identities and testing hypotheses: Power analysis for contingency tables... Liou. Goodman. Amer. Log-linear Models and logistic regression. 41-41. Fienberg. Cambridge. J. Statist. J. B 31. L. M. A.. 315-327. B 24. Claringbold. Goodman. Y. J. Ann. Academia Sinica. Discrete Multivariate Analysis. W. Statist. Statistical Methods in Cancer Research. Bishop. A. J. 251-263. Statist. E. 48-63. R. The Analysis of Case-Control Studies. Cheng. L. J. 226-256. M. Arthur C. (1962). (1954). (1925) (5th ed. Roy. 313-324. and Stephan F. A. The multivariate analysis of qualitative data: Interactions among multiple classiﬁcations. 4. Math.. (1969). Darroch. E. Liou. B 26. 427-444. Assoc. N. J. Statist. 3. E. Cheng. Institute of Statistical Science. Soc.. W. Springer-Verlag. Statistical Methods for Research Workers. 1934). P.. Soc. G. and Tsai. and Aston. Cochran. P. 65. (1964). (1964). N. Conﬁdence limits for a cross-product ratio. (1962). Statist. Journal of Data Science. Hagenaars. The detection of partial association. . Simple methods for analyzing three-factor interaction in contingency tables. A. Statistica Sinica. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Interactions in multifactor contingency tables. Breslow. P.

Cancer Inst. G. 162-166. A 125. Mellenberg. Yule. Snedecor. J. G. A 147. J. 2007. H. Royal Statist. Roy. Statist. Chi-squares of Bartlett. Contingency tables involving small numbers and the test. J. 749-757. J. H. Mathematical contributions to the theory of evolution XIII: On the theory of contingency and its relation to association and normal correlation. Math. . E. M. J. ed. J. Statist. N. Roy. Lancaster. and Lancaster in a contingency table. Roy.. An Introduction to the Theory of Statistics. Soc. (1982). J. (1959). J. B 24. Nat. 2007. 1948. Calculation of chi-square for complex contingency tables. B 13. Pearson. Woolf. (1959). 22. Ann. N. Information Theory and Statistics. Statist. 105-118. and Haenszel. Received April 1. Assoc. Mood. (1956). Ann. Pearson. (1950).) Plackett. E. 251-258. 217-235. (1911). 242-249. Statist. 238-241. 7. Yates. McGraw-Hill. A. no. W. Introduction to the Theory of Statistics. F. Statist. (1958). Soc. Soc. Simpson. Soc. (1984). M. Tests of Signiﬁcance for contingency tables (with discussion). Research Memoirs. Kullback. (1945). Mood. (1962). L. (1962). R. (1904). 88-117. B. G. and Kastenbaum. Wiley. Roy. (1951). 40. Contingency table models for assessing item bias. accepted May 31. 251-253. U. Griﬃn. On the hypothesis of no “interaction” in a multi-way contingency table. 27. (1934). Draper’s Co. J.312 Philip E. H. Roy. Human Genetics 19. A. S.. S. W. The interpretation of interaction in contingency tables. Complex contingency tables treated by the partition of chisquare. 560-562. Soc. W. Biometrics 14. Soc. Cheng et al. 426-463. S. On estimating the relation between blood group and disease. Statistical aspects of the analysis of data from retrospective studies of disease. A note on interactions in contingency tables. Royal Statist. (1951). J. (Reprinted in Karl Pearson’s Early Papers. Yates. B. Lewis. Mantel. Statist. Statist. B 13. F. Suppl. 1. K. Edu. On the analysis of interaction in multi-dimensional contingency tables. Norton. (1955). O. Biometric Series. 1. Amer. N. Cambridge: Cambridge University Press. 719-748.

Cheng Institute of Statistical Science Academia Sinica Taipei. Taiwan needgem@stat.edu.sinica. 115. John A. D. Taiwan mliou@stat. Liou Institute of Statistical Science Academia Sinica Taipei. 115.tw Jiun W. Taiwan jaston@stat.tw Aston.edu.tw 313 . 115.Linear Information Models Philip E.edu. Institute of Statistical Science Academia Sinica Taipei.edu. Taiwan pcheng@stat. 115.sinica.sinica.tw Michelle Liou Institute of Statistical Science Academia Sinica Taipei.sinica.