Journal of Data Science 5(2007), 297-313

Linear Information Models: An Introduction

Philip E. Cheng, Jiun W. Liou, Michelle Liou and John A. D. Aston Academia Sinica
Abstract: Relative entropy identities yield basic decompositions of categorical data log-likelihoods. These naturally lead to the development of information models in contrast to the hierarchical log-linear models. A recent study by the authors clarified the principal difference in the data likelihood analysis between the two model types. The proposed scheme of log-likelihood decomposition introduces a prototype of linear information models, with which a basic scheme of model selection can be formulated accordingly. Empirical studies with high-way contingency tables are exemplified to illustrate the natural selections of information models in contrast to hierarchical log-linear models. Key words: Contingency tables, log-linear models, information models, model selection, mutual information.

1. Introduction Analysis of contingency tables with multi-way classifications has been a fundamental area of research in the history of statistics. From testing hypothesis of independence in a 2 × 2 table (Pearson, 1904; Yule, 1911; Fisher, 1934; Yates, 1934) to testing interaction across a strata of 2 × 2 tables (Bartlett, 1935), many discussions had emerged in the literature to build up the foundation of statistical inference of categorical data analysis. In this vital field of applied statistics, three closely related topics have gradually developed and are still theoretically incomplete even after half a century of investigation. The first and primary topic is born out of the initial hypothesis testing for independence in a 2 × 2 table. From 1960’s utill the 1980’s, Fisher’s exact test repeatedly received criticism for being conservative due to discrete nature (Berkson, 1978; Yates, 1984). Although the arguments in favor of the unconditional tests essentially focused on using the unconditional exact tests in the recent decades, the reasons for preferred p values and sensitivity of the unconditional tests have not been assured in theory. In this respect, a recent approach via information theory proved that the power analysis of unconditional tests is not suitable for

1925) inspired discussions of partitioning chi-squares within the contingency tables (Mood. has only recently been discussed in the literature. the probability at alternate interactions given the observed data. 1982).. 2006). et al. Bartlett (1935) addressed the topic of testing interaction and derived an estimate of the common odds ratio.298 Philip E.. and since then. 2002). it was implicit that an inferential flaw lies in the estimating-equation design of the CMH score test (Woolf. Simpson (1951) and Roy and Kastenbaum (1956) discussed interpretations of interactions and showed that Bartlett’s test is a conditional maximum likelihood estimation (MLE) given the table margins. Analysis of variance (ANOVA. Kullback and Claringbold was remarked by Plackett (1962). Darroch. 1964. 2007). medicine. 1964). These data information identities also indicate that analysis of variance may not be designed and used to measure deviations from uniform association. Goodman. as an extension of the power analysis for a single 2 × 2 table. Consequently. 1959) has been applied extensively in the fields of biology. 1993. Lewis. 1962. 1969. Extending to several 2×2 tables. 1975). 1955. have been widely used in the literature (Hagenaars. Birch. the power analysis at alternatives to the null.. testing independence. 1958. Hierarchical log-linear models were thereby formulated to analyze general aspects of contingency tables (cf. Mellenberg. Cheng et al. Bishop et al. which is the topic of this study to be discussed below.. Goodman. . A drawback of inference with the the test statistics by Lancaster. 1961). it is seen that noncentral hypergeometric distributions cannot determine power evaluations at arbitrary alternative 2 × 2 tables (Fisher. 2005).. For the same data. 1962) and again. It was recently found that a flaw of inference exists with a likelihood ratio test for association (Roy and Kastenbaum. or varied interactions between the categorical variables. 1956) and another for testing interaction (Darroch. 1954. Norton (1945). 1997. A prototype of the linear information models will be formulated below and compared to the hierarchical log-linear model. Christensen. Fisher. A remedy to such statistical inference was recently provided by an analysis of invariant information identities (Cheng. 1950. It inspired in turn the development of loglinear models (Kullback. Mantel and Haenszel. that is. Agresti. 1970. engineering. Claringbold. In addition. 1951. et al. the celebrated CMH test for association (Cochran. and Goodman. Lancaster. but simply for equal and unequal binomial rates. sociology and psychology. 1962. will also be useful for testing hypothesis with high-way contingency tables. the data likelihood identities provide appropriate explanations (Cheng. which are simply defined by likelihood factorizations. In retrospect. However. the long term ambiguous criticism of the exact test was finally proved to be logically vacuous (Cheng et al. 1959. education. Snedecor. 1962). The solution.

.. or zero conditional association. et al. where the notations and parameters defined by ANOVA decompositions may require careful interpretations. The model permits all three pairs to be conditionally dependent. 1970.. k = 1. that is.. remarks on criteria of information models selection are noted for further useful research. XZ). I. Section 4 will provide empirical study of four.. which allow only a few different expressions of log-likelihood equations.2) The full model can be reduced to be a submodel of no three-way interaction. respectively. It is elementary and instructive to discuss the case of three-way tables. Elementary Log-likelihood Equations The representations of data log-likelihood can be formulated in various ways. i j k ij jk ik ijk (2. Denote the joint probability density by fijk (= f (i. k)) = P (X = i.. (2. an elementary scheme of model selection can be formulated with easily justified tests of significance. 2. the prototypes of information models with four-way and high-way tables begin to indicate the essential difference from that of the log-linear models.1 Basic log-linear models Suppose that individuals of a sample are classified according to three categorical variables {X}. no pair is conditionally independent (Agresti. j = 1. which is denoted by (XY. The three-way log-linear models is first reviewed. J. and. Y Z. the corresponding log-linear model is formulated as . . and obvious advantages over the log-linear modeling are easily shown through the information models selection. 1975) is defined as log fijk = λ + λX + λY + λZ + λXY + λY Z + λXZ + λXY Z . j.. 2.. There are numerous ways of decomposing the data likelihood with high-way tables. 2002). Next.Linear Information Models 299 The basic log-linear models in three variables will be reviewed in Section 2.. depending on the methods of likelihood decomposition. which may differ from the log-linear models only in some representations. In conclusion. Comparisons between the log-linear models and the proposed information models will be discussed.1) where ijk fijk = 1. In Section 3. {Y }.. which have been analyzed with log-linear models in the literature. . information identities of three-way tables are fully discussed to characterize the corresponding information models. {Z} with classified levels: i = 1. Z = k). Bishop. in particular.and five-way tables. The full (saturated) log-linear model (Goodman. K. . Y = j.

in order to understand the intrinsic difference between dependent and independent likelihood decompositions. i j ij k jk (2. zero mutual information (Cheng et al. say. and written as log fijk = λ + λX + λY + λZ + λXY . denoted (XY. which indicates an intrinsic difference in the interpretation of these log-linear models. It is obvious that the one-way and two-way parameters in each of the models (2. are scaled log-likelihood ratios. log fijk = λ + λX + λY + λZ + λXY + λY Z + λXZ .4) is however not derived from (2. {X. i j k (2.2) by deleting the conditional association between {X. Equation (2. Equation (2.3).12) below.300 Philip E. The parameters in the log-linear models.2) by checking the magnitude of the three-way interaction using the iterated proportional fitting (Deming and Stephens. and expressed as log fijk = λ + λX + λY + λZ . Z).4) to (2. Z). but directly computed from (2. Z} given Y . 2006).6) Indeed. Y. Basic identities of data likelihood are examined. Such a desirable property is lost between the two-way and three-way terms in the models (2. and a prototype of linear information models in three variables will be considered.2 Basic information models It may be useful to take a different look at the formulation of hierarchical loglinear models. Y } and Z. i j ij k (2.6) defines the mutual independence between the three factors. i j ij k jk ik (2.. or the logarithmic odds ratios. model (2.4) reduces to independence between {X. exactly one pair of factors is conditionally independent given the third factor. (2.2) and (2.1) . Define the marginal probabilities as J K K fi = fi·· = j=1 k=1 fijk . Z} given Y . Cheng et al. fij· = k=1 fijk . This will be further explained using equation (2.5) The final reduction to three pairs of conditional independence is denoted by (X.3). 2.3) obtains from (2. Z} given Y .4) With two pairs of conditionally independent factors. equation (2. then the model is expressed as log fijk = λ + λX + λY + λZ + λXY + λY Z . besides the normalizing constant λ. which is the sum of the unique three-way interaction and the conditional uniform association between {X. that is.3) In case. 1940).6) are statistically independent using disjoint data information.

which are the three possible saturated information models for a three-way table. Y ) + I(X. Z) + I(X. IJK− (I +J +K)+2 under the null hypothesis of mutual independence.j. are defined in terms of the observed cell frequencies. i. let nijk denote the number of individuals classified to the cell {X = i. The entropy identity in three variables is expressed as H(X) + H(Y ) + H(Z) = H(X. the sample entropy and the sample mutual information. 2006. z)||f (x)g(y)h(z)). Z|X) = I(X. I(X.j. Y ) = − fijk log fijk i. For example. (2. fi·k . Z) + I(X. Z) = i. Y ) + I(Y . Z) + I(Y .k H(X. The sample analogs of the entries in (2.j. the mutual information. (2.k nijk n2 ni nj nk . Y. Z) . Z) + I(Y . Z). ni = ni·· . Y |Z).9). Z) yields the likelihood ratio test statistic nijk log i. and the mutual information are defined respectively to be H(X) = − i fi log fi . Z) = − I(X. Y. Z) + I(X. Some convenient terminology will be borrowed from our previous study (Cheng. fj . Y .j H(Z) = − k hk log hk . y.Linear Information Models 301 and analogously. Y .f.9) The last entry of (2. Y . Y . f·jk . fij· log fij· . Z|Y ) = I(X. Also. Z) = I(X. respectively. et al.9).10) such that twice of (2. Y = j. H(Y ) = − j gj log gj .. denote the total cell frequency and marginal totals.10) is asymptotically chi-squared distributed with d. and similar notations n = n···. Y.k fijk log fijk fi gj hk = D(f (x. H(X. joint entropy. which are the natural maximum likelihood estimates under the general model of multinomial distribution. Section 2). (2. and nij· . the sample analog of the mutual information I(X. Z = k}. is the Kullback-Leibler divergence between the joint density and the product marginal density of (X. It is easily seen that data log-likelihood admits three equivalent information identities.11) . (2. fk .8) where the marginal entropy.

is usually expected to be significantly large.325). Int(X. characterizes exactly the three-way term λXY Z of (2. Z||Y ).12). Y ) + I(Y . with a rare chance it may happen that both Int(X. Z||Y ) are insignificantly small.12) form a natural two-step likelihood ratio (LR) test. has been a much disucussed topic in the literature. where the first step LR statistic. across the levels of Y . is rejected. Z) of (2. if the test for no interaction. Y . 2002. Cheng et al. Z||Y ). Z|Y ) is insignificantly small..3). Being saturated models. there is no need to perform the second step test. Z|Y ) = Int(X. that is also consistent with the significant omnibus test. the Breslow-Day test (1980) and the CMH test (1954. p. I(X. are close to 1. across K strata of Y .12) that four combinations of significance and insignificance patterns between the three LR tests . Z|Y ) of (2. If the omnibus CMI I(X. Z||Y ) characterizes the uniform association between X and Z. then the second-step test statistic I(X. it is theoretically rare that the two-step test is not sufficiently sensitive. the second step is in use only when the first step is insignificant. and. Y .11) are different expressions of the log-linear full model (2.2).12).11) may be reduced to the sum I(X. It is understood that such significant or insignificant tests are defined with a common fixed level of significance for each approximating chi-square distribution. Y . also called the generalized CMH test (Agresti. 2007).12) is removed (equal to zero). Z) and I(X.4). Int(X. in which the two independent summands. it is logically consistent with the significant omnibus test. Y . If the interaction is insignificantly small.2). with four possible combined cases. it is insufficient to use only the second-step test for testing no association between X and Z. or Pearson’s chi-square test. Likewise. Z) in the equations. the conditional mutual information (CMI) I(X. Y . Here. Z||Y ). and the latter expression is unique up ijk to the normalizing constant λ. I(X. and the first equation of (2. Z) and I(X. It was recently proposed that the two summands of (2. it follows from (2. Otherewise. Z). across Y . Int(X. Z) = 0. then the conditional odds ratios between X and Z. and. for example. The remaining terms are not the same. within strata of Y (Cheng et al. Int(X. 1959). Logically. and. Z) + I(X. Y . In this case. (2. this CMI is significantly large. whereas the omnibus CMI is marginally significant. The unique three-way interaction. All three equations are reduced to model (2. within the levels of Y. and the second. The special case of 2 × 2 × K table. the equations of (2. The two-step LR test for no association across strata Y is shown to be asymptotically unbiased and more powerful than the one-step omnibus LR test. the simplest three-way table.302 Philip E. Z) which is model (2. and clearly. tests for no interaction between X and Z. Y . Z||Y ). say. tests for no association between X and Z. the two-step LR test improves over the combination of the score tests. I(X. can be individually significantly large or small. if the common interaction term Int(X.

the use of such a CV is always applicable and particularly useful.Linear Information Models 303 are meaningful in practice.. It is plain that both formulae of data log-likelihood decompositions are built upon the notion of mutual information.11) that a common variable appears in each term of the decomposed three-way mutual informtion. which are coined information models.1 A four-way information structure It is observed from equation (2. conditioned on the remaining t − 2 variables. 3.2) to (2. and it is used as the conditioning variable (termed CV). conditioned upon other t − 2 variables. Z) + I(W . When any variable is of primary interest. (K − 2) three-way (CMI) effects. that equations (2. Information Structures of High-Way Tables The natural extensions of equation (2.4) as mentioned above. like the response variable in a linear regression model.11) consist of two major formulae for each saturated model of a K-way contingency table.11) to high-way contingency tables. however. two (K − 1)-way CMI.11) to high-way contingency tables will enhance interpretations of high-way effects and manifest further differences from the conventional log-linear models. These information with varied statistical inference are not directly revealed between models (2. Z). the analog of (2. Y . The key point of fact. . The goal of this study is to make a natural and systematic extension of equation (2. Subsequently. (K − 1) two-way (MI) effects.. Y. X. and one K-way CMI.8) is the entropy identity H(W ) + H(X) + H(Y ) + H(Z) = H(W.1) . a prototype of the linear information model. The primary formula is that the log-likelihood decomposition is the sum of K main effects. 3. it is used as a CV for finding its relations to the remaining variables. defined by simple criteria of model selection. not in terms of ANOVA as defined with the log-linear models.11) disallow the inclusion of all three two-way terms. X. This is not necessarily required of a four-way or a high-way table. The difference between the proposed information models and the log-linear models will be illustrated using four-way and five-way data tables that have been analyzed in the literature. will be introduced to offer advantage of statistical inference over the conventional log-linear model. In case of four variables. is not clarified in the discussion of hierarchical log-linear models.. (3. A t-way CMI is always a sum of the t-way interaction and the uniform association between the pair of variables. The secondary formula is that each t-way CMI (t ≤ K) is a mutual information between a pair of variables. It is perceivable that extensions of identity (2.

Y ) + I(W . then either a different identity of (2. Z).304 Philip E. X. or. or another identity such as I((X. Y ) = I(W . Z|X) + I(W . twice the sample analog of I(W . either because it is a variable of focus. conditioned upon the chosen CV.11) may be used to express I(W . A selection scheme based on equation (3. Y ).4). X. X. Y ) + I(X.2) using a CV is first outlined below. Y ). Y ) + I(W . Y ) = I(W . Y ) + I(Y . Z) + I(Y . Z) + I(W . Y ). Y ) + I(X. respectively.2) The last equation of (3. X. 3. the total number of saturated information models in four variables would be 72 = (4!) × 3. the three-way CMI terms. Z|X. say Y . Cheng et al. Z) = I(W . (3. exactly one of three CV models may be used according to equation (2. Y ). Z) + I(X. X|Y ) + I(Y . If both CV and non-CV models are included. Y ). The question is how to select a meaningful and parsimonious model among those seventy-two candidate linear information models. Z) + I(Y . And. Z|X. Y |W ) + I(X. For example. Z|X. for a fixed nominal level) sum of three two-way effects. X. (3. . Z|Y ) + I(W .2 Elementary model selection schemes Without loss of generality.2). Y ) + I(X. Z) + I(Y . Y. X. an analog of (2. say. in the second equation of (3.3) It is worth noting with these saturated models that all the variables must appear at least once in the main effects. Y ) + I(X.2) follows by using (2. Z) = I(X. among the four variables {W. For a threeway contingency table. Step 1 : Select the CV. Z|X) + I(W . Z) = I(W . X. and a specified four-way CMI. the information model selection scheme is illustrated with a four-way contingency table. and also among the two-way MI terms. X) + I(W . the last statement simply leads to different non-CV saturated models such as I(W . Y. Z) = I(W . which yields 24 distinct saturated CV models. This is organized as a four-step procedure for ease of exposition. Z|X. Y ). Y ) + I((W. and. Y ) + I((X. Y yields the maximal significant (in p value. Z|X) can be used.11) among many equivalent identities may be expressed as I(W . If there is no special CV of interest. Y . Step 2 : Find the maximal insignificant (in p value) among the insignificant four-way CMI (between certain two variables. there are exactly six distinct saturated models for each fixed CV. Y . X|Y ) + I(X. for a four-way table. Z}.11) with the variable Y as the CV.

second-minimal significant) CMI. a saturated CV information model selection in the four-variable case is essentially complete. the same selection procedure is continued with the renewed model modified in Step 2. Suppose the chosen fourway CMI is I(W . a minimal significant) high-way CMI term can be replaced by its next term. Steps 3 and 4 are kept unchanged. Since the three two-way (MI) terms have already been selected in Step 1.f. To conclude a parsimonious model selection after performing the above three steps. X|Y ). Z|X. If there is no need to fix a CV of particular interest. though not necessary. when the three available four-way CMI are all significantly large.2). (2. choose the minimal significant. The sum of all the insignificantly (small in p value) terms. because it can always choose the next insignificant (significant) term whenever needed. while modifying the information decomposition. each variable must appear at least once among the two-way terms of a saturated model before parsimonious selection. it is expected that such a general scheme often results in selecting a non-CV model. an interaction term and a uniform association term (cf. A maximal insignificant (or. the second-maximal insignificant (or.. Thus. 0. from high-way to low-way subtables. the remainder is significantly large (near or over a 95th percentile of the associated chi-square distribution) and the tentative model may be lack of sufficient information. are directly obtained by reading the appeared variables from the chosen four-way CMI.3). Step 3 : To confirm the selected MI and CMI terms. Step 4. to the same nominal level. it is often of extra interest. among the ten terms. whether a common CV is in use or not. and. Y ). then the tentative remainder is insignificantly small and deleted as an insignificant residual.2) . then. each one of the ten selected terms (including four main-effect terms) must be individually and separately tested against the same nominal level.12)). the new remainder is likewise tested to be a negligible residual. say. Step 1 is bypassed. to yield a more parsimonious model. as shown in Step 3. under the same nominal level.05 in common practice. This remedy as a supplementary scheme is easily used in practice. and Step 2 is generalized without requiring a fixed CV throughout the selection scheme. It is understood that a general scheme may select either a CV model (3. Otherwise. If this test is insignificant. to test against the summands of each selected CMI term.Linear Information Models 305 and the remaining variable). there are alternate ways of selecting a model like (3. I(W . Finally. a simple remedy is recommended. is taken as the tentative remainder which is then tested against the total (chisquare) d. This step is asymptotically correct by the orthogonal likelihood decomposition. and then. or a non-CV model (3. X|Y ) and I(Z. On the other hand. more balanced in selecting the variables among the CMI terms. In this case. so that the tentative model is accepted. Otherwise. In other words. the two candidate three-way CMI.

These alternatives to Step 2 may sometimes yield less high-way CMI terms compared to the original Step 2.3) is to delete more high-way CMI terms. Cheng et al. This will be illustrated below in Section 4. and law enforcement (L). selecting a minimal insignificant high-way CMI term may be preferred. which approximately equals to the 35th percentile of the chi-square distribution with 48 d.or four-step model selection scheme can be easily extended to high-way contingency tables. compared to the deletion of high-way interaction terms as a common practice in the selection of a log-linear model. Applications to High-Way Tables Four-way and five-way contingency data tables in the literature will be examined.2. they usually lead to selecting more terms at end. Readers may refer to subsequent estimation of odds ratios parameters to the log-linear model fitting with all two-way effects. particularly with high-way tables. fitting the data to the summary of the four main effects and all the six two-way effects (Agresti. defined with the variables: environment (E).20).f. The above three.1. that is.306 Philip E. The basic analysis of this data by log-linear modeling leads to the accepted model: deletion of the four-way interaction and all the threeway effects. As an alternative choice in Step 2. 2002. 2002. Table 8. However. Table 8. and sacrifices model parsimony. Example 4. according to Step 1 of the selection scheme as illustrated in Section 3. otherwise. An application to five-way data table will also be illustrated in Section 4.2) and (3. health (H).3). It is entertaining that the proposed method yields easily interpretable and more parsimonious models compared to those obtained from the hierarchical log-linear modeling. or (3. in particular.67. if any. It is so fitted with the loglinear modeling because the deviance between this fitted model and the full model is evaluated to be 31. Example 4. “about right” and “too much”. The information modeling begins with finding the most relevant variable to be the CV. big cities (C). The proposed information modeling and the four-step model selection scheme of Section 3 will be applied to both four-way and five-way data below. 4.19) is exemplified for a comparison study. It is remarkable that the principal idea in formulaing the models (3.1: A four-way contingency table A 3 × 3 × 3 × 3 four-way data frequency table (Agresti. select the minimal (or maximal) significant high-way CMI term when all such CMI terms are significantly large. These 81 cell frequency counts are listed according to the three levels: “too little”. The data consist of three-level individual’s opinions on each of four variables of government spending. It is found that the variable health (H) yields the greatest significant sum .

f. can be computed along with the selection scheme. for which the remainder deviance is 77.f. may not be considered as useful CV. plus the two-way effects {CE. plus the four main effects”. EH}”.Linear Information Models 307 of two-way effects. including interval estimates.f. For brevity. the deviances 2I(C.2).59. it can be = = easily checked that the deviance. these calculations are not discussed here for each selected information models. and the readers may contact the authors for details. instead of all the six two-way terms as used in the selected log-linear model (Agresti. This exhibits a basic advantage of information modeling that it usually selects a more parsimonious model with more concise and simpler explanation compared to the classical log-linear modeling. close to the 88 percentile of the chi-square (64 d. are the two-way effects. By Step 4.2) and the same selection scheme that a similar model is selected: “the main effects. the fitted log-linear model of all two-way effects. for which the remainder deviance is 81. In case no particular variable is fixed as CV. then. using Y = H and X = E. which is 61.41.3) and the same selection scheme (omitting Step 1) would lead to a non-CV model: “{CE. In case another variable is of interest. . This provides a similar CV model selection and interpretation. say. EH}”.98. which is close to the 76th percentile of the chi-square distribution with 64 d.= 16.87 to the chi-square distribution with 12 d. with the total fitted d.2) is tentatively concluded with the information model: “four main effects plus the two-way effects {CH. H) ∼ 28. It can be easily checked from this four-way data that the other two categories. A complete selection scheme in accordance with equation (3. that is only half of 32. This provides another equally parsimonious information model in which all the four variables share the two-way effects together in a pair of two-way terms. the variable (E) can be taken as the CV. H) ∼ 24. close to the 93. Thus. because the selected models will include at least one three-way CMI term. Environment (E).18.f. HL}. the sum of the residual insignificant CMI’s.f.2) for the four-way table is summarized in Table 1 below. it is checked by equation (3. It also follows by equation (3. which may be closely related to budget spending and worth an investigation. 2002). The (conditional) odds ratio parameters and the deviances of the above three selected information models. is 71. It then easily follows by Steps 2 and 3 to find that the only significant terms in the information equation (3.) distribution. in addition to the two-way terms and main effects. a prototype of information model selection by equation (3.74 and 2I(E.5 percentile of the chi-square distribution with 64 d. cities (C) and law enforcement (L).

55 0. R)) + I(K.001 0. S) = I(L. Tables 5 to 7) using hierarchical log-linear models. it is found that factor K has the maximal two-way effect with the other variables. denoted by K) of cancer. S) + I(L. (K. R) = I(L. it is so far unknown in the literature whether there are specific criteria or schemes of selecting a definite.2: A five-way contingency table A data of cross-classification of individuals according to five dichotomized factors had been studied in six publications prior to Goodman (1978. N . p. a saturated information model can be derived according to Step 2 as follows. R. and the presence or absence of the other four qualitative attributes: L(= lecture). R(= radio). R)) + I(K. R|K) + I(K. (K. in contrast to the hierarchical log-linear modeling. Let factor K be the CV. L) I(E.99 8. N.95 24. N ) + I(K. for the present five-way data that had been much discussed prior to Goodman (1978). R. R) + I(L. N (= newspaper).77 7. H) chi-squares 34. S|K. and S(= solid reading). N |K. N. However. R.78 0. (4. R.308 Philip E. or tentatively entertained.1) . R. N. for the current five-way data. S|K. S) + I(L. (K. H) I(C.2. parsimonious loglinear model. S|K.18 28. Cheng et al. (K. R) + I(L.2) MI. R) = I(L. which evidences that it was a useful study design. R|K) + I(K. S|K. R). S) + I(N . I(K. (K. R. N ) + I(S. 36 12 12 4 4 4 p-values 0. N |K. L|H) I(H.74 d. E|H) I(E. Table 1). Table 1: Mutual Information (3. and hypotheses about the logits of K. (K. The purpose was to understand the association relationship between the knowledge (good or poor. S) + I(L. S) = I(L. R)) + I(R. L|E. According to Step 1 of Section 3.112.f.07 0. R. L) + I(R.07 0. H) I(C. N |K. N )) + I(N . The factor of primary interest is the “knowledge K”. It is thus worth investigating the proposed selection scheme of linear information models. N.90 19. CMI \ values I(C. N ) + I(N . L. S|K) + I(K. S)) + I(S.001 Example 4. S)) + I(K. plus estimates of the hypothesized effects were examined by Goodman (1978.

R) I(L. and. L) I(K.610 < 0. R.001 < 0.001 < 0. S). the five-way CMI term I(L.001 < 0. the selection scheme confirms only a slight reduction of two CMI terms by equations (4.78 24.25 150. S) ∼ I(L. S|K. N |K. except a few insignificant terms. S).2) MI.001 As a supplementary note. although information dissemination by model (4.2) For the five-way data.2) and (3.230 0. S|K. N ) I(K. R). N )+I(L.6” is insignificantly small. and the four-way CMI I(R.2) treats the variable (Knowledge.3). R. S) chi-squares 10. K) as a response variable. S|K) + I(K. This yields the linear information model that exhibits the desired relationship between the CV “Knowledge K” and the other four variables: I(K.001 < 0. To summarize the data analysis.001 < 0. R) + I(L. L. S) = + I(N . N |K.001 < 0.1). R. which includes a four-variable model (3. it is notable that the selected information model (4.89 58. S|K) I(N .58 2.16 105. the possible choices of five-way information models may be estimated like the previous discussion of four-way models based on equations (3. CMI \ values I(L.Linear Information Models 309 In the last equation of the saturated model (4.96 17. S|K.001 < 0. R. Thus. R) I(K. 8 4 4 2 2 2 1 1 1 1 p-values 0. N ) + I(N .f.45 d. L) + I(K.2). N ) I(L. the number of saturated five-way CV models is 720 = 5! × 3 × 2. By Step 4. except two high-way CMI terms. R|K) + I(K. N . Table 2: Mutual Information (4. Equation (4. R|K) I(K. . (4. approximately equal to the 65th percentile on the chi-square distriubtion with 12 d. N ).1) allows five ways of separating one variable from the other four. it is checked that “twice the sample sum of I(R. in which most are significantly large. R|K) + I(K.56 190.05.2) yields Table 2. R|K) I(N . it is found that all terms are statistically significant at the nominal level of 0. and very little information reduction is possible.2) as part of the whole model. Table 2 exhibits the component MI and CMI values of the overall mutual information.72 20. S|K.1) and (4. or 26.48 15. N |K. S|K. S) I(R.1) and (4. This presents a case that the variables defined in the study are highly associated.f.

248-252. The proposed model selection schemes. S. but not through the adapted ANOVA. 2nd ed. such as the current proposal. (2002). J. Categorical Data Analysis. without additional operations on the data. A basic drawback of the latter. The important advantages of linear information models are based on direct use of the data likelihood identities as illustrated in Section 3 and exemplified in Section 4. 5. Contingency table interactions. It is the simplest method based on comparing data deviances of any possible remainder terms against appropriate chi-square distributions. Extensions of equations (3. Essentially. Statist. the proposed model identification and selection schemes offers a fundamental likelihood analysis with observed data. it may take further study to define optimal model parsimony and selection.. are naturally born out of the data likelihood. In dispraise of the exact test. can be especially intricate with high-way tables. the number of all saturated five-way models is 5! × (4! × 3) = 25920. and. together with some additional selection criteria that would be useful in other statistical applications. the disadvantage of inevitably crossed and overlapped data information in the summands of the log-linear models. (1978). Bartlett.3) and (4. 2. it is understood that no best or uniquely optimal model can be defined whichever selection criterion is used. The basic information identities of Section 2 are used to illustrate the advantages of using orthogonal information decomposition. for which a three-way case was illustrated in the recent literature (Cheng et al. the classical approach invalidates the development of useful selection schemes among the hierarchical log-linear models. Thus. (1935). .2). Planning and Inference. Roy. Given a natural selection criterion. B 2. either using a CV or not. Statist. Wiley. J. Cheng et al. (3. Concluding Remarks A short summary of the proposed information models and the selection scheme in Section 3 can be illustrated with a few remarks. M. 2006). References Agresti. While useful information model selections still depend on interpreting the data through the choice of certain CV (or no CV) by the experimenter.1) to multi-way contingency tables appear to be straightforward. J. Soc. The primary purpose of developing the linear information models is to recommend the natural factorization of the raw data likelihood.310 Philip E. Berkson. A. directly using the observed data likelihood. 27-42.

E. and Day. Soc. M. E. 319-352. Aust. (1975). Fisher. Testing association: A two-stage test or Cochran-Mantel-Haenszel test. J. Statist. E. MA: MIT Press. Analyzing Qualitative/Categorical Data. (2006). Fisher. Goodman. Cheng. M. E. Statist. Data information in contingency tables: A fallacy of hierarchical log-linear models. 11. Goodman. (1997). Roy. Cambridge: Abt Associates Inc. and Aston. and Holland. A.Linear Information Models 311 Birch. Some methods for strengthening the common chi-square tests. 4. (1980). J. Biometrics 24. J. Aston. (2007). in press. R. N. Liou. Roy.. Australian Journal of Statistics. W. J. J. Newbury Park: Sage. The use of orthogonal polynomials in the partition of chisquare. J. P. W. L. Liou. (1978). 59. 1. Deming. M. (1993). R. (2005). 486-498. (1970). A. J Amer. Oliver & Boyd. Lyon. W. Technical Report 2007-03. F. On partitioning and detecting partial association in three-way contingency tables. (1940). Loglinear Models with Latent Variables. A. 387-398. M. Assoc. P. Christensen. S. L. (1961). Information identities and testing hypotheses: Power analysis for contingency tables... Liou. Goodman. Amer. Log-linear Models and logistic regression. 41-41. Fienberg. Cambridge. J. Statist. J. B 31. L. M. A.. 315-327. B 24. Claringbold. Goodman. Y. J. Ann. Academia Sinica. Discrete Multivariate Analysis. W. Statist. Statistical Methods in Cancer Research. Bishop. A. J. 251-263. Statist. E. 48-63. R. The Analysis of Case-Control Studies. Cheng. L. J. 226-256. M. Arthur C. (1962). (1954). (1925) (5th ed. Roy. 313-324. and Stephan F. A. The multivariate analysis of qualitative data: Interactions among multiple classifications. 4. Math.. (1969). Darroch. E. Liou. B 26. 427-444. Assoc. N. J. Statist. 3. E. Cheng. Institute of Statistical Science. Soc.. W. Springer-Verlag. Statistical Methods for Research Workers. 1934). P.. Soc. G. and Tsai. and Aston. Cochran. P. 65. (1964). (1964). N. Confidence limits for a cross-product ratio. (1962). Statist. Journal of Data Science. Hagenaars. The detection of partial association. . Simple methods for analyzing three-factor interaction in contingency tables. A. Statistica Sinica. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Interactions in multifactor contingency tables. Breslow. P.

Cancer Inst. G. 162-166. A 125. Mellenberg. Yule. Snedecor. J. G. A 147. J. 2007. H. Royal Statist. Roy. Statist. Chi-squares of Bartlett. Contingency tables involving small numbers and the test. J. 749-757. J. H. Mathematical contributions to the theory of evolution XIII: On the theory of contingency and its relation to association and normal correlation. Math. . E. M. J. ed. J. Statist. N. Roy. Lancaster. and Lancaster in a contingency table. Roy.. An Introduction to the Theory of Statistics. Soc. (1982). J. (1959). J. B 24. Nat. 2007. 1948. Calculation of chi-square for complex contingency tables. B 13. Pearson. Woolf. (1959). 22. Ann. N. Information Theory and Statistics. Statist. 105-118. and Haenszel. Received April 1. Assoc. Mood. (1956). Ann. Pearson. (1950).) Plackett. E. 251-258. 217-235. (1911). 242-249. Statist. 238-241. 7. Yates. McGraw-Hill. A. no. W. Introduction to the Theory of Statistics. F. Statist. (1958). Soc. Soc. Simpson. Soc. (1984). M. Tests of Significance for contingency tables (with discussion). Research Memoirs. Kullback. (1945). Mood. (1962). L. (1962). R. (1904). 88-117. B. G. and Kastenbaum. Wiley. Roy. (1951). 40. Contingency table models for assessing item bias. accepted May 31. 251-253. U. Griffin. On the hypothesis of no “interaction” in a multi-way contingency table. 27. (1934). Draper’s Co. J.312 Philip E. H. Roy. Human Genetics 19. A. S.. S. W. The interpretation of interaction in contingency tables. Complex contingency tables treated by the partition of chisquare. 560-562. Soc. W. Biometrics 14. Soc. Cheng et al. 426-463. S. On estimating the relation between blood group and disease. Statistical aspects of the analysis of data from retrospective studies of disease. A note on interactions in contingency tables. Royal Statist. (1951). J. (Reprinted in Karl Pearson’s Early Papers. Yates. B. Lewis. Mantel. Statist. Statist. B 13. F. Suppl. 1. K. Edu. On the analysis of interaction in multi-dimensional contingency tables. Norton. (1955). O. Biometric Series. 1. Amer. N. Cambridge: Cambridge University Press. 719-748.

Cheng Institute of Statistical Science Academia Sinica Taipei. Taiwan needgem@stat.edu.sinica. 115. John A. D. Taiwan mliou@stat. Liou Institute of Statistical Science Academia Sinica Taipei. 115.tw Jiun W. Taiwan jaston@stat.tw Aston.edu.tw 313 . 115.Linear Information Models Philip E.edu. Institute of Statistical Science Academia Sinica Taipei.edu. Taiwan pcheng@stat. 115.sinica.sinica.tw Michelle Liou Institute of Statistical Science Academia Sinica Taipei.sinica.

Sign up to vote on this title
UsefulNot useful