2020 Book StatisticalDataAnalysisAndEntr

Behaviormetrics:
Quantitative Approaches to Human Behavior 3
Nobuoki Eshima
Statistical Data
Analysis and
Entropy
Behaviormetrics: Quantitative Approaches
to Human Behavior
Volume 3
Series Editor
Akinori Okada, Graduate School of Management and Information Sciences,
Tama University, Tokyo, Japan
This series covers in their entirety the elements of behaviormetrics, a term that
encompasses all quantitative approaches of research to disclose and understand
human behavior in the broadest sense. The term includes the concept, theory,
model, algorithm, method, and application of quantitative approaches from
theoretical or conceptual studies to empirical or practical application studies to
comprehend human behavior. The Behaviormetrics series deals with a wide range
of topics of data analysis and of developing new models, algorithms, and methods
to analyze these data.
The characteristics featured in the series have four aspects. The first is the variety
of the methods utilized in data analysis and a newly developed method that includes
not only standard or general statistical methods or psychometric methods
traditionally used in data analysis, but also includes cluster analysis, multidimen-
sional scaling, machine learning, corresponding analysis, biplot, network analysis
and graph theory, conjoint measurement, biclustering, visualization, and data and
web mining. The second aspect is the variety of types of data including ranking,
categorical, preference, functional, angle, contextual, nominal, multi-mode
multi-way, contextual, continuous, discrete, high-dimensional, and sparse data.
The third comprises the varied procedures by which the data are collected: by
survey, experiment, sensor devices, and purchase records, and other means. The
fourth aspect of the Behaviormetrics series is the diversity of fields from which the
data are derived, including marketing and consumer behavior, sociology, psychol-
ogy, education, archaeology, medicine, economics, political and policy science,
cognitive science, public administration, pharmacy, engineering, urban planning,
agriculture and forestry science, and brain science.
In essence, the purpose of this series is to describe the new horizons opening up
in behaviormetrics—approaches to understanding and disclosing human behaviors
both in the analyses of diverse data by a wide range of methods and in the
development of new methods to analyze these data.
Editor in Chief
Akinori Okada (Rikkyo University)
Managing Editors
Daniel Baier (University of Bayreuth)
Giuseppe Bove (Roma Tre University)
Takahiro Hoshino (Keio University)
More information about this series at http://www.springer.com/series/16001

Nobuoki Eshima
Statistical Data Analysis

and Entropy
123
Nobuoki Eshima
Center for Educational Outreach
and Admissions
Kyoto University
Kyoto, Japan
ISSN 2524-4027 ISSN 2524-4035 (electronic)

Behaviormetrics: Quantitative Approaches to Human Behavior
ISBN 978-981-15-2551-3 ISBN 978-981-15-2552-0 (eBook)
https://doi.org/10.1007/978-981-15-2552-0
© Springer Nature Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
In modern times, various kinds of data are gathered and cumulated in all scientific
research fields, practical business affairs, governments and so on. The computer
efficiency and data analysis methodologies have been developing, and it has been
extending the capacity to process big and complex data. Statistics provides
researchers and practitioners with useful methods to handle data for their purposes.
The modern statistics dates back to the early 20th century. The Student’s t statistic
by W. S. Gosset and the basic idea of experimental designs by R. A. Fisher have
been influencing great effects on the development of statistical methodologies.
On the other hand, information theory originates from C. E. Shannon’s paper,
“A Mathematical Theory in Communication”, in 1948. The theory is indispensable
for measuring uncertainty of events from information sources and systems of ran-
dom variables and for processing the data effectively in viewpoints of entropy. In
these days, interdisciplinary research domains have been increasing in order to
enhance noble studies to resolve complicated problems. The common tool for
statistics and information theory is “probability”, and the common aim is to deal with
information and data effectively. In this sense, thus, both theories have similar
scopes. The logarithm of probability is negative information; those of odds, and odds
ratios in statistics are relative information; that of the log likelihood function is the
asymptotic negative entropy. The author is a statistician and takes a standpoint in
Statistics; however, there are problems in statistical data analysis that are not able to
resolve in conventional views of statistics. In such cases, perspectives in entropy
may provide us with good clues and ideas to tackle the problems and to develop
statistical methodologies. The aim of this book is to elucidate how most of statistical
methods, e.g. correlation analysis, t, F, v2 statistics, can be interpreted in entropy and
to introduce entropy-based methods for analyzing data analysis, e.g. entropy-based
approaches to the analysis of association in contingency tables and path analysis
v
vi Preface
with generalized linear models, that may be useful tools for behavior-metric
researches. The author would like to expect this book to motivate readers, especially
young researchers, to grasp that “entropy” is one of utilities to deal with practical
data and study themes.
Kyoto, Japan Nobuoki Eshima

Acknowledgements
I would be grateful to Prof. A. Okada, Rikkyo University Professor Emeritus, for

giving me a chance and encouragement to make the present book. I would like to
express my gratitude to Prof. M. Kitano, Kyoto University Professor Emeritus, an
Executive Vice President of Kyoto University, for providing with a precious time
and an excellent environment possible to aid me in the present work. Lastly, I
deeply appreciate an anonymous referee for his valuable comments to improve the
draft of the present book.
vii
Contents
1 Entropy and Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Loss of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Entropy of Joint Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Test of Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Maximum Likelihood Estimation of Event Probabilities . . . . . . . 18
1.9 Continuous Variables and Entropy . . . . . . . . . . . . . . . . . . . . . . . 20
1.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Analysis of the Association in Two-Way Contingency Tables . . . . . . 29
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Odds, Odds Ratio, and Relative Risk . . . . . . . . . . . . . . . . . . . . . 30
2.3 The Association in Binary Variables . . . . . . . . . . . . . . . . . . . . . 36
2.4 The Maximum Likelihood Estimation of Odds Ratios . . . . . . . . . 41
2.5 General Two-Way Contingency Tables . . . . . . . . . . . . . . . . . . . 44
2.6 The RCðM Þ Association Models . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3 Analysis of the Association in Multiway Contingency Tables . . . . . . 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Loglinear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Maximum Likelihood Estimation of Loglinear Models . . . . . . . . 63
3.4 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Entropy Multiple Correlation Coefficient for GLMs . . . . . . . . . . 68
3.6 Multinomial Logit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
ix
x Contents
3.7 Entropy Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . 80

3.8 Asymptotic Distribution of the ML Estimator of ECD . . . . . . . . 85
3.9 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4 Analysis of Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Correlation Coefficient and Entropy . . . . . . . . . . . . . . . . . . . . . . 92
4.3 The Multiple Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 101
4.4 Partial Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.6 Test of the Mean Vector and Variance–Covariance
Matrix in the Multivariate Normal Distribution . . . . . . . . . . . . . . 115
4.7 Comparison of Mean Vectors of Two Multivariate
Normal Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.8 One-Way Layout Experiment Model . . . . . . . . . . . . . . . . . . . . . 122
4.9 Classification and Discrimination . . . . . . . . . . . . . . . . . . . . . . . . 127
4.10 Incomplete Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5 Efficiency of Statistical Hypothesis Test Procedures . . . . . . . . . . . . . 141
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 The Most Powerful Test of Hypotheses . . . . . . . . . . . . . . . . . . . 141
5.3 The Pitman Efficiency and the Bahadur Efficiency . . . . . . . . . . . 144
5.4 Likelihood Ratio Test and the Kullback Information . . . . . . . . . . 148
5.5 Information of Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6 Entropy-Based Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.2 Path Diagrams of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.3 Path Analysis of Continuous Variables with Linear Models . . . . 171
6.4 Examples of Path Systems with Categorical Variables . . . . . . . . 173
6.5 Path Analysis of Structural Generalized Linear Models . . . . . . . . 175
6.6 Path Analysis of GLM Systems with Canonical Links . . . . . . . . 178
6.7 Summary Effects Based on Entropy . . . . . . . . . . . . . . . . . . . . . . 180
6.8 Application to Examples 6.1 and 6.2 . . . . . . . . . . . . . . . . . . . . . 183
6.8.1 Path Analysis of Dichotomous Variables
(Example 6.1 (Continued)) . . . . . . . . . . . . . . . . . . . . . . . 183
6.8.2 Path Analysis of Polytomous Variables
(Example 6.2 (Continued)) . . . . . . . . . . . . . . . . . . . . . . . 190
Contents xi
6.9 General Formulation of Path Analysis of Recursive Systems

of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7 Measurement of Explanatory Variable Contribution in GLMs . . . . . 199
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.2 Preliminary Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.4 Measuring Explanatory Variable Contributions . . . . . . . . . . . . . . 206
7.5 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.6 Application to Test Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
7.7 Variable Importance Assessment . . . . . . . . . . . . . . . . . . . . . . . . 216
7.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8 Latent Structure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.2 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.2.1 Factor Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.2.2 Conventional Method for Measuring Factor
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8.2.3 Entropy-Based Method of Measuring Factor
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.2.4 A Method of Calculating Factor Contributions
by Using Covariance Matrices . . . . . . . . . . . . . . . . . . . . 238
8.2.5 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
8.3 Latent Trait Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
8.3.1 Latent Trait and Item Response . . . . . . . . . . . . . . . . . . . 242
8.3.2 Item Characteristic Function . . . . . . . . . . . . . . . . . . . . . . 243
8.3.3 Information Functions and ECD . . . . . . . . . . . . . . . . . . . 245
8.3.4 Numerical Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Chapter 1
Entropy and Basic Statistics
1.1 Introduction
Entropy is a physical concept to measure the complexity or uncertainty of systems

under studies, and it is used in thermodynamics, statistical mechanics, and informa-
tion theory, and so on; however, the definitions are different in the research domains.
In measuring information of events, “entropy” was introduced to study communica-
tion systems by Shannon [6], which is called the Shannon entropy, and it contributed
greatly to the basis of modern information theory. In these days, the theory has been
applied to wide interdisciplinary research fields. “entropy” can be grasped as a math-
ematical concept of uncertainty of events in sample spaces and random variables. The
basic idea in information theory is to use the logarithmic measure with probability.
In statistical data analysis, parameters describing populations under consideration
are estimated from random samples and the confidence intervals of the estimators
and various statistical tests concerning the parameters are usually made. In the cases,
the uncertainty of the estimators, i.e., events under consideration, is measured with
probability, for example, the 100(1 − α)% confidence intervals of the estimators,
and levels of significances for the tests and so on. Information theory and statistics
discuss phenomena and data processing systems with probability. In this respect, it
is significant to have another look at statistics from a viewpoint of entropy. In this
textbook, data analysis will be considered through information theory, especially
“entropy,” in which follows, the term “entropy” implies the Shannon entropy. In this
chapter, first, information theory is reviewed for readers unfamiliar with entropy and
second, basic statistics are reconsidered in view of entropy. Section 1.2 introduces
an information measure for measuring uncertainties of events in the sample space,
and in Sect. 1.3 the loss of information is assessed by the information measure. In
Sect. 1.4, the Shannon entropy is introduced for measuring the uncertainties of sam-
ple spaces and random variables, and the Kullback–Leibler (KL) information [4]
is derived thorough a theoretical discussion. Sections 1.5 and 1.6 discuss the joint
and the conditional entropy of random variables and briefly discuss the association
© Springer Nature Singapore Pte Ltd. 2020 1

N. Eshima, Statistical Data Analysis and Entropy,
Behaviormetrics: Quantitative Approaches to Human Behavior 3,
https://doi.org/10.1007/978-981-15-2552-0_1
2 1 Entropy and Basic Statistics
between the variables through the mutual information (entropy) and the conditional
entropy. In Sect. 1.7, the chi-square test is considered through the KL information.
Section 1.9 treats the information of continuous variables, and t and F statistics are
expressed through entropy.
1.2 Information
Let be a sample space; A ∈ be an event; and let P(A) the probability of event A.
The information of event A is defined mathematically according to the probability,
not the content itself. Smaller the probability of an event is, greater we feel its value.
Based on our intuition, the mathematical definition of information [6] is given by
Definition 1.1 For P(A) = 0, the information of A is defined by
1
I (A) = loga , (1.1)
P(A)
where a > 1
In which follows, the base of the logarithm is e and the notation (1.1) is simply
denoted by
1
I (A) = log .
P(A)
In this case, the unit is called “nat,” i.e., natural unit of information. If P(A) =
1, i.e., event A always occurs, then I (A) = 0 and it implies that event A has no
information. The information measure I (A) has the following properties:
(i) For events A and B, if P(A) ≥ P(B) > 0, the following inequality holds:
I (A) ≤ I (B). (1.2)
(ii) If events A and B are statistically independent, then, it follows that
I (A ∩ B) = I (A) + I (B). (1.3)
Proof Inequality (1.2) is trivial. In (ii), we have
P(A ∩ B) = P(A)P(B). (1.4)
From this,
1 1
I (A ∩ B) = log = log
P(A ∩ B) P(A)P(B)
1.2 Information 3
1 1
= log + log = I (A) + I (B).
P(A) P(B)
Example 1.1 In a trial drawing a card from a deck of cards, let events A and B be
“ace” and “heart,” respectively. Then,
1 1 1
P(A ∩ B) = = × ,
52 4 13
1 1
P(A) = , P(B) = .
4 13
Since A ∩ B ⊂ A, B, corresponding to (1.2), we have
I (A ∩ B) > I (A), I (B).
and the events are statistically independent, so we also have Eq. (1.3).
Remark 1.1 In information theory, base 2 is usually used for logarithm. Then, the
unit of information is referred to as “bit.” One bit is the information of an event with
probability 21 .
1.3 Loss of Information
When we forget the last two columns of a phone number with 10 digits, e.g., 075-
753-25##, then there is nothing except that the right number can be restored with
1
probability 100 , if there is no information about the number. In this case, the loss of
the information about the right phone number is log 100. In general, the loss of the
information is defined as follows:
Definition 1.2 Let A and B be events such that A ⊂ B. Then, the loss of “information
concerning event A” by using or knowing event B for A is defined by
Loss(A|B) = I (A) − I (B). (1.5)
In the above definition, A ∩ B = A and from (1.5), the loss is also expressed by
1 1 P(B) 1
Loss(A|B) = log − log = log = log . (1.6)
P(A) P(B) P(A) P(A|B)
In the above example of a phone number, the true number, e.g., 097-753-2517, is
event A and the memorized number is event B, e.g., 097-753-25##. Then, P(A|B) =
1
100
.
Table 1.1 Original data

Category 0 1 2 3
Frequency 5 10 15 5
Table 1.2 Aggregated data

Category 0 1
from Table 1.1
Frequency 15 20
Example 1.2 Let X be a categorical random variable that takes values in sample
space = {0, 1, 2, . . . , I − 1} and let πi = P(X = i). Supposed that for integer
k > 0, the sample space is changed into ∗ = {0, 1} by

0 (X < k)
X∗ = .
1 (X ≥ k)
Then, the loss of information of X = i is given by

k−1
πa
∗
P(X ∗ = j (i)) log a=0
(i < k; j (i) = 0)
Loss X = i|X = j (i) = log = Iπi
−1 .
P(X = i) πa
log a=k
πi (i ≥ k; j (i) = 1)
In the above example, random variable X is dichotomized, and it is a usual tech-

nique for processing and/or analyzing categorical data; however, the information of
the original variable is reduced.
Example 1.3 In Table 1.1, categories 0 and 1 are reclassified into category 0 shown
in Table 1.2 and categories 2 and 3 into category 1. Then, the loss of information is
considered. Let A and B be events shown in Tables 1.1 and 1.2, respectively. Then,
we have P(A|B) = 15+1 1
× 20+1
1
= 336
1
, which is the conditional probability restoring
Table 1.1 from Table 1.2, and from (1.6) we have
Loss(A|B) = log 336.
1.4 Entropy
Let = {ω1 , ω2 , . . . , ωn } be a discrete sample space and let P(ωi ) be the proba-
bility of event ωi . The uncertainty of the sample space is measured by the mean of
information I (ωi ) = − log P(ωi ).
1.4 Entropy 5
Table 1.3 Probability

X 0 1 2
distribution of X
1 1 1
Probability 4 2 4
Definition 1.3 The entropy of discrete sample space [6] is defined by

n
n
H () = P(ωi )I (ωi ) = − P(ωi ) log P(ωi ). (1.7)
i=1 i=1
Remark 1.2 In the above definition, sample space and probability space (, P)
are identified.
Remark 1.3 In definition (1.7), if P(ωi ) = 0, then for convenience of the discussion
we set
P(ωi ) log P(ωi ) = 0 · log 0 = 0.
If P(ωi ) = 1, i.e., P(ω j ) = 0, j = i, we have
H () = 0.
Example 1.4 Let X be a categorical random variable that takes values in sample
space = {0, 1, 2} with probabilities illustrated in Table 1.3. Then, the entropy of
is given by
1 4 1 2 1 4 3
H () = log + log + log = log 2.
4 1 2 1 4 1 2
In this case, the entropy is also referred to as that in X and is denoted by H (X ).
With respect to entropy (1.7), we have the following theorem:
Theorem 1.1 Let p = ( p1 , p2 , . . . , pn ) and q = (q1 , q2 , . . . , qn ) be probability

distributions, i.e.

n
n
pi = qi = 1.
i=1 i=1
Then, it follows that:

n
n
pi log pi ≥ pi log qi . (1.8)
i=1 i=1
The equality holds true, if and only if p = q.

Proof For x > 0, the following inequality holds true:
log x ≤ x − 1.
By using the above inequality, we have

n
n
n
qi
pi log qi − pi log pi = pi log
i=1 i=1 i=1
pi

n
qi
≤ pi − 1 = 0.
i=1
pi
This completes the theorem.

In (1.7), the entropy is defined for the sample space. For probability distribution
p = ( p1 , p2 , . . . , pn ), the entropy is defined by

n
H ( p) = − pi log pi .
i=1
From (1.8), we have

n
n
H ( p) = − pi log pi ≤ − pi log qi . (1.9)
i=1 i=1
In (1.9), setting qi = n1 , i = 1, 2, . . . , n, then we have

n
H ( p) = − pi log pi ≤ log n.
i=1

Hence, from Theorem 1.1, entropy H ( p) is maximized at p = n1 , n1 , . . . , n1 , i.e.,
uniform distribution. The following entropy is referred to as the Kullback–Leibler
(KL) information or divergence [4]:

n
pi
D( p q) = pi log . (1.10)
i=1
qi
From Theorem 1.1, (1.10) is nonnegative and takes 0 if and only if p = q. When
the entropy of distribution q for distribution p is defined by

n
H p (q) = − pi log qi ,
i=1
1.4 Entropy 7
Table 1.4 Probability

Y 0 1 2
distribution of Y
1 3 1
Probability 8 8 2
we have
D( p q) = H p (q) − H ( p).
This entropy is interpreted as the loss of information by substituting distribution

q for the true distribution p. With respect to the KL information, in general,
D( p q) = D(q p).
Example 1.5 Let Y be a categorical random variable that has the probability distribu-
tion shown in Table 1.4. Then, random variable X in Example 1.4 and Y are compared.
Let the distributions in Tables 1.3 and 1.4 are denoted by p and p, respectively. Then,
we have
1 1 4 1 1 1 4
D( p q) = log2 + log + log = log ≈ 0.144,
4 2 3 4 2 2 3
1 1 3 3 1 3 3
D(q p) = log + log + log2 = log ≈ 0.152.
8 2 8 4 2 8 2
In which follows, distributions and the corresponding random variables are iden-
tified, the KL information is also denoted by using the variables, e.g., in Example
1.5 KL information D( p q) and D(q p) are expressed as D(X Y ) and D(Y X ),
respectively, depending on the occasion.
Theorem 1.2 Let q = (q1 , q2 , . . . , qk ) be a probability distribution that is made by

appropriately combining categories of distribution p = ( p1 , p2 , . . . , pn ). Then, it
follows that
H ( p) ≥ H (q). (1.11)
Proof For simplicity of the discussion, let
q = ( p1 + p2 , p3 . . . , pn ).
Then,
1 1 1
H ( p) − H (q) = p1 log + p2 log − ( p1 + p2 )log
p1 p2 p1 + p2
p1 + p2 p1 + p2
= p1 log + p2 log ≥ 0.
p1 p2
In this case, theorem holds true. The above proof can be extended into a general
case and the theorem follows.
In the above theorem, the mean loss of information by combining categories of
distribution p is given by
H ( p) − H (q).
Example 1.6 In a large and closed human population, if the marriage is random, a dis-
tribution of genotypes is stationary. In ABO blood types, let the ratios of genes A, B,
and O be p, q, and r, respectively, where p+q +r = 1. Then, the probability distribu-
tion of genotypes AA, AO, BB, BO, AB, and OO is u = p 2 , 2 pr, q 2 , 2qr, 2 pq, r 2 .
We usually observe phenotypes A, B, AB, and O, which correspond with genotypes
“AA or AO,” “BB or BO,” AB, and OO, respectively, and the phenotype probabil-
ity distribution is v = p 2 + 2 pr, q 2 + 2qr, 2qr, r 2 . In this case, the mean loss of
information is given as follows:
p 2 + 2 pr p 2 + 2 pr
H (u) − H (v) = p 2 log 2
+ 2 pr log
p 2 pr
q + 2qr
2
q + 2qr
2
+ q 2 log + 2qr log
q2 2qr
p + 2r p + 2r
= p 2 log + 2 pr log
p 2r
q + 2r q + 2r
+ q 2 log + 2qr log .
q 2r
Theorem 1.3 Let p = ( p1 , p2 , . . . , pn ) and q = (q1 , q2 , . . . , qn ) be probability

distributions. For 0 ≤ λ ≤ 1, the following inequality holds true:
H (λ p + (1 − λ)q) ≥ λH ( p) + (1 − λ)H (q). (1.11)
Proof Function x log x is a convex function, so we have

λpi + (1 − λ)q j log λpi + (1 − λ)q j ≤ λpi log pi + (1 − λ)q j log q j .
Summing up the above inequality with respect to i and j, we have inequality (1.11).
1.5 Entropy of Joint Event
Let X and Y be categorical random variables having sample spaces X =

{C1 , C2 , . . . , C I } and Y = {D1 , D2 , . . . , D J }, respectively. In which follows, the
sample spaces are simply denoted as X = {1, 2, . . . , I } and Y = {1, 2, . . . , J },
1.5 Entropy of Joint Event 9
as long as any confusion does not occur in a context. Let πi j = P(X = i, Y =

j), i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; πi+ = P(X = i), i = 1, 2, . . . , I ; and let
π+ j = P(Y = j), j = 1, 2, . . . , J.
Definition 1.4 The joint entropy of X and Y is defined as

I
J
H (X, Y ) = − πi j log πi j .
i=1 j=1
With respect to the joint entropy, we have the following theorem:
Theorem 1.4 For categorical random variables X and Y,
H (X, Y ) ≤ H (X ) + H (Y ). (1.12)
Proof From Theorem 1.1,

I
J
I
J
H (X, Y ) = − πi j log πi j ≤ − πi j log(πi+ π+ j )
i=1 j=1 i=1 j=1

I
J
I
J
=− πi j log πi+ − πi j log π+ j = H (X ) + H (Y ).
i=1 j=1 i=1 j=1
Hence, the inequality (1.12) follows. The equality holds if and only if πi j =
πi+ π+ j , i = 1, 2, . . . , I ; j = 1, 2, . . . , J , i.e., X and Y are statistically indepen-
dent.
From the above theorem, the following entropy is interpreted as that reduced due
to the association between variables X and Y:
H (X ) + H (Y ) − H (X, Y ). (1.13)
Stronger the strength of the association is, greater the entropy is. The image of
the entropies H (X ), H (Y ), and H (X, Y ) is illustrated in Fig. 1.1. Entropy H (X ) is
expressed by the left ellipse; H (Y ) by the right ellipse; and H (X, Y ) by the union
of the two ellipses.
Fig. 1.1 Image of Entropy

H (X ), H (Y ) and H (X, Y )
Table 1.5 Joint distribution of X and Y

Y X
1 2 3 4 計
1 1/8 1/8 0 0 1/4
2 1/16 1/8 1/16 0 1/4
3 0 1/16 1/8 1/16 1/4
4 0 0 1/8 1/8 1/4
計 3/16 5/16 5/16 3/16 1
From Theorem 1.3, a more general proposition can be derived. We have the
following corollary:
Corollary 1.1 Let X 1 , X 2 , . . . , X K be categorical random variables. Then, the
following inequality folds true.

K
H (X 1 , X 2 , . . . , X K ) ≤ H (X k ).
k=1
The equality holds if and only if the categorical random variables are statistically
independent.
Proof From Theorem 1.3, we have
H ((X, X 2 , . . . , X K −1 ), X K ) ≤ H (X, X 2 , . . . , X K −1 ) + H (X K ).
Hence, inductively the theorem follows.

Example 1.7 The joint probability distribution of categorical variables X and Y is
given in Table 1.5. From this table, we have
1
H (X ) = 4 × log 4 = 2 log 2 ≈ 1.386,
4
3 16 5 16 3 16 5 16
H (Y ) = 2 × log +2× log = log + log
16 3 16 5 8 3 8 5
3 5
= 4 log 2 − log 3 − log 5 ≈ 1.355,
8 8
1 1 13
H (X, Y ) = 6 × log8 + 4 × log16 = log2 ≈ 2.253.
8 16 4
From this, the reduced entropy through the association between X and Y is
calculated as:
H (X ) + H (Y ) − H (X, Y ) ≈ 1.386 + 1.355 − 2.253 = 0.488.

1.6 Conditional Entropy 11
1.6 Conditional Entropy
Let X and Y be categorical random variables having sample spaces X =

{1, 2, . . . , I } and Y = {1, 2, . . . , J }, respectively. The conditional probability of
X = i given Y = j is given by
P(X = i, Y = j)
P(Y = j|X = i) = .
P(X = i)
Definition 1.5 The conditional entropy of Y given X = i is defined by

J
H (Y |X = i) = − P(Y = j|X = i) log P(Y = j|X = i)
j=1
and that of Y given X, H (Y |X ), is defined by taking the expectation of the above

entropy with respect to X:

I
J
H (Y |X ) = − P(X = i, Y = j) log P(Y = j|X = i). (1.14)
i=1 j=1
With respect to the conditional entropy, we have the following theorem [6]:
Theorem 1.5 Let H (X, Y ) and H (X ) be the joint entropy of (X, Y ) and X,
respectively. Then, the following equality holds true:
H (Y |X ) = H (X, Y ) − H (X ). (1.15)
Proof From (1.14), we have

I
J
P(X = i, Y = j)
H (Y |X ) = − P(X = i, Y = j) log
i=1 j=1
P(X = i)

I
J
=− P(X = i, Y = j) log P(X = i, Y = j)
i=1 j=1

I
J
+ P(X = i, Y = j) log P(X = i)
i=1 j=1

I
= H (X, Y ) − P(X = i) log P(X = i) = H (X, Y ) − H (X ).
i=1

With respect to the conditional entropy, we have the following theorem:
Theorem 1.6 For entropy H (Y ) and the conditional entropy H (Y |X ), the following
inequality holds:
H (Y |X ) ≤ H (Y ). (1.16)
The equality holds if and only if X and Y are statistically independent.
Proof Adding (1.12) and (1.15) for the right and left sides, we have
H (Y ) − H (Y |X ) = H (Y ) − (H (X, Y ) − H (X ))
= H (X ) + H (Y ) − H (Y |X ) ≥ 0.
Hence, the inequality follows. From Theorem 1.3, the equality of (1.16) holds true
if and only if X and Y are statistically independent. This completes the theorem.
From the above theorem, the following entropy is interpreted as that of Y explained
by X:
I M (X, Y ) = H (Y ) − H (Y |X ). (1.17)
From (1.17) and (1.15) we have
I M (X, Y ) = H (Y ) − (H (X, Y ) − H (X )) = H (Y ) + H (X ) − H (X, Y ). (1.18)
It implies I M (X, Y ) is symmetrical with respect to X and Y, i.e.,
I M (X, Y ) = I M (Y, X ).
Thus, I M (X, Y ) is the entropy reduced by the association between the variables
and is referred to as the mutual information. Moreover, we have the following
theorem:
Theorem 1.7 For entropy measures H (X, Y ), H (X ) and H (Y ), the following

inequality holds true:
H (X, Y ) ≤ H (X ) + H (Y ). (1.19)
The equality holds true if and only if X and Y are statistically independent.
Proof From (1.15) and (1.16), (1.19) follows.

The mutual information (1.17) is expressed by the KL information.

I
J
P(X = i, Y = j)
I M (X, Y ) = P(X = i, Y = j)log .
i=1 j=1
P(X = i)P(Y = j)
Fig. 1.2 Image of entropy

measures
From the above theorems, entropy measures are illustrated as in Fig. 1.2.
Theil [8] used entropy to measure the association between independent variable X
and dependent variable Y. With respect to mutual information (1.18), the following
definition is given [2, 3].
Definition 1.6 The ratio of I M (X, Y ) to H (Y ), i.e.,
I M (X, Y ) H (Y ) − H (Y |X )
C(Y |X ) = = . (1.20)
H (Y ) H (Y )
is defined as the coefficient of uncertainty.

The above ratio is that of the explained (reduced) entropy of Y by X, and we have
0 ≤ C(Y |X ) ≤ 1.
Example 1.8 Table 1.6 shows the joint probability distribution of categorical vari-
ables X and Y. In this table, the following entropy measures of the variables are
computed:
1
H (X, Y ) = 8 × log 8 = 3 log 2,
8
1 1
H (X ) = 4 × log 4 = 2 log 2, H (Y ) = 4 × log 4 = 2 log 2.
4 4
From the above entropy measures, we have
I M (X, Y ) = H (Y ) + H (Y ) − H (X, Y ) = log 2.
Table 1.6 Joint probability distribution of categorical variables X and Y

Y X
1 2 3 4 計
1 1/8 1/8 0 0 1/4
2 1/8 1/8 0 0 1/4
3 0 0 1/8 1/8 1/4
4 0 0 1/8 1/8 1/4
計 1/4 1/4 1/4 1/4 1
From the above entropy, the coefficient of uncertainty is obtained as
log 2 1
C(Y |X ) = = .
2 log 2 2
From this, 50% of the entropy (information) of Y is explained (reduced) by X.

In general, Theorem 1.6 can be extended as follows:
Theorem 1.8 Let X i , i = 1, 2, . . . ; and Y be categorical random variables. Then,
the following inequality holds true:
H (Y |X 1 , X 2 , . . . , X i , X i+1 ) ≤ H (Y |X 1 , X 2 , . . . , X i ). (1.21)
The equality holds if and only if X i+1 and Y are conditionally independent, given
(X 1 , X 2 , . . . , X i ).
Proof It is sufficient to prove the case with i = 1. As in Theorem 1.5, from (1.12)
and (1.15) we have
H (X 2 , Y |X 1 ) ≤ H (X 2 |X 1 ) + H (Y |X 1 ),
H (Y |X 1 , X 2 ) = H (X 2 , Y |X 1 ) − H (X 2 |X 1 ).
By adding both sides of the above inequality and equation, it follows that
H (Y |X 1 , X 2 ) ≤ H (Y |X 1 ).
Hence, inductively the theorem follows.

From Theorem 1.8, with respect to the coefficient of uncertainty (1.20), the
following corollary follows.
Corollary 1.2 Let X i , i = 1, 2, . . . ; and Y be categorical random variables. Then,
C(Y |X 1 , X 2 , . . . , X i , X i+1 ) ≥ C(Y |X 1 , X 2 , . . . , X i ).
Proof From (1.21),
1 − H (Y |X 1 , X 2 , . . . , X i , X i+1 )
C(Y |X 1 , X 2 , . . . , X i , X i+1 ) =
H (Y )
1 − H (Y |X 1 , X 2 , . . . , X i )
≥ = C(Y |X 1 , X 2 , . . . , X i ).
H (Y )
This completes the corollary.

The above corollary shows that increasing the number of explanatory variables
X i induces an increase of the explanatory power to explain the response variable Y.
Table 1.7 Joint probability distribution of X 1 , X 2 , and Y

Y
X1 X2 0 1 Total
3 1 1
0 0 32 32 8
1 3 1
1 32 32 8
1 1 5
1 0 16 4 16
1 5 7
1 8 16 16
5 11
Total 16 16
The above property is preferable to measure the predictive or explanatory power of

explanatory variables X i for response variable Y.
Example 1.9 The joint probability distribution of X 1 , X 2 , and Y is given in Table 1.7.
Let us compare the entropy measures H (Y ), H (Y |X 1 ), and H (Y |X 1 , X 2 ). First,
H (Y ), H (X 1 ), H (X 1 , X 2 ), H (X 1 , Y ), and H (X 1 , X 2 , Y ) are calculated. From
Table 1.7, we have
5 5 11 11
H (Y ) = − log − log ≈ 0.621,
16 16 16 16
1 1 3 3
H (X 1 ) = − log − log ≈ 0.563,
4 4 4 4
1 1 1 1 5 5 7 7
H (X 1 , X 2 ) = − log − log − log − log ≈ 1.418.
4 4 4 4 16 16 16 16
1 1 1 1 3 3 9 9
H (X 1 , Y ) = − log − log − log − log ≈ 1.157,
8 8 8 8 16 16 16 16
3 3 1 1 1 1
H (X 1 , X 2 , Y ) = − log × 2 − log × 2 − log
32 32 32 32 16 16
1 1 1 1 5 5
− log − log − log ≈ 1.695.
8 8 4 4 16 16
From the above measures of entropy, we have the following conditional entropy
measures:
H (Y |X 1 ) = H (X 1 , Y ) − H (X 1 ) ≈ 1.157 − 0.563 = 0.595,

H (Y |X 1 , X 2 ) = H (X 1 , X 2 , Y ) − H (X 1 , X 2 ) ≈ 1.695 − 1.418 = 0.277.
From the results, the following inequality in Theorems 1.6 and 1.8 hold true:
H (Y |X 1 , X 2 ) < H (Y |X 1 ) < H (Y ).
Moreover, we have
0.621 − 0.277 0.621 − 0.595

C(Y |X 1 , X 2 ) = = 0.554, C(Y |X 1 ) = = 0.042.
0.621 0.621
From the above coefficients of uncertainty, the entropy of Y is reduced with about
4% by explanatory variable X 1 ; however, variables X 1 and X 2 reduce the entropy
with about 55%. The above coefficients of uncertainty demonstrate Corollary 1.2.
The measure of the joint entropy is decomposed into those of the conditional
entropy. We have the following theorem.
Theorem 1.9 Let X i , i = 1, 2, . . . , n be categorical random variables. Then,
H (X 1 , X 2 , . . . , X n ) = H (X 1 ) + H (X 2 |X 1 ) + . . . + H (X n |X 1 , X 2 , · · · , X n−1 ).
Proof The random variables (X 1 , X 2 , . . . , X n ) are divided into (X 1 , X 2 , . . . , X n−1 )

and X n . From Theorem 1.4, it follows that
H (X 1 , X 2 , . . . , X n ) = H (X 1 , X 2 , . . . , X n−1 ) + H (X n |X 1 , X 2 , . . . , X n−1 ).
Hence, the theorem follows.

The above theorem gives a recursive decomposition of the joint entropy.
1.7 Test of Goodness-of-Fit
Let X be a categorical random variable with sample space = {1, 2, . . . , I }; pi be

the hypothesized probability of X = i, i = 1, 2, . . . , I ; and let n i be the number of
observations with X = i, i = 1, 2, . . . , I (Table 1.8). Then, for testing the goodness-
of-fit of the model, the following log-likelihood ratio test statistic [9] is used:

I
ni
G 21 = 2 n i log . (1.22)
i=1
N pi

Let pi be the estimators of pi , i = 1, 2, . . . , I , i.e.,

ni
pi = .
N
Then, statistic (1.22) is expressed as follows:
Table 1.8 Probability distribution and the numbers of observations

X 1 2 … I Total
Probability p1 p2 … pI 1
Frequency n1 n2 … nI N
1.7 Test of Goodness-of-Fit 17

I ni
I
ni
pi
G 21 = 2N log N
= 2N pi log = 2N · D( p p), (1.23)
i=1
N pi i=1
pi

where p = ( p1 , p2 , . . . , p I ) and p = p1 , p2 , . . . , p I . From this, the above statistic

is 2N times of the KL information. As N → ∞, pi → pi , i = 1, 2, . . . , I (in

probability). For x ≈ 1,
1
logx ≈ (x − 1) − (x − 1)2 .
2
For large sample size N, it follows that

pi
≈ 1 (in probability).
pi
From this, we have

I I 2

pi 1 pi
≈ −2N
G 21 pi log = 2N
pi · −1
i=1
pi i=1
2 pi
I 2 I 2
I

pi pi − pi (n i − N pi )2
=N pi −1 ≈ N

= .
i=1
pi i=1
pi i=1
N pi
Let us set
I
(n i − N pi )2
X2 = . (1.24)
i=1
N pi
The above statistics are a chi-square statistic [5]. Hence, the log-likelihood ratio
statistic R12 is asymptotically equal to the chi-square statistics (1.24). For given dis-
tribution p, the statistic (1.22) is asymptotically distributed according a chi-square
distribution with degrees of freedom I − 1, as N → ∞.
Let us consider the following statistic similar to (1.22):

I
N pi
G 22 = 2 N pi log .
i=1
ni
The above statistic is described as in (1.23):

I
pi
G 22 = 2N pi log
= 2N · D( p p), (1.25)
i=1
pi
For large N, given probability distribution p, it follows that

Table 1.9 Probability distribution and the numbers of observations

X 1 2 3 4 5 6 Total
1 1 1 1 1 1
Probability 6 6 6 6 6 6 1
Frequency 12 13 19 14 19 23 100
1
Table 1.10 Data produced with distribution q = 6 , 12 , 4 , 4 , 6 , 12
1 1 1 1 1
X 1 2 3 4 5 6 Total
Frequency 12 8 24 23 22 11 100

I
I
2 I
pi pi (n i − N pi )2
G 22 = −2N pi log ≈N pi −1 = .
i=1
pi i=1
pi i=1
N pi
From the above result, given p = ( p1 , p2 , . . . , p I ), the three statistics are

asymptotically equivalent, i.e.,
G 21 ≈ G 22 ≈ X 2 (in probability)

Example 1.10 For uniform distribution p = 16 , 16 , 16 , 16 , 16 , 16 , data with sample size
100 were made and the results are shown in Table 1.9. From this table, we have
G 21 = 5.55, G 22 = 5.57, X 2 = 5.60.
The three statistics are asymptotically equal and the degrees of freedom in the
chi-square distribution is 5, so the results are not statistically significant.
Example
1 1 1 1.11 with sample size 100 were produced by a distribution q =
Data
, , , , ,
1 1 1
and the data are shown in Table 1.10. We compare the data
6 12 4 4 6 12
with uniform distribution p = 16 , 16 , 16 , 16 , 16 , 16 . In this artificial example, the

null hypothesis H0 is p = 16 , 16 , 16 , 16 , 16 , 16 . Then, we have
G 21 = 15.8, G 22 = 17.1, X 2 = 15.1.
Since the degrees of freedom is 5, the critical point of significance level 0.05 is
11.1 and the three statistics lead to the same result, i.e., the null hypothesis is rejected.
1.8 Maximum Likelihood Estimation of Event Probabilities
Let X be a categorical random variable with sample space = {1, 2, . . . , I };

pi (θ1 , θ2 , . . . , θ K ) be the hypothesized probability of X = i, i = 1, 2, . . . , I , where
1.8 Maximum Likelihood Estimation of Event Probabilities 19
θk , k = 1, 2, . . . , K , are parameters; let n i be the number of observations with

X = i, i = 1, 2, . . . , I . Then, the log likelihood function is given by

I
l(θ1 , θ2 , . . . , θ K ) = n i log pi (θ1 , θ2 , . . . , θ K ). (1.26)
i=1
The maximum likelihood (ML) estimation is carried out by maximizing the above
log likelihood with respect to θk , k = 1, 2, . . . , K . Let θk be the ML estimators of

parameters θk . Then, the maximum of (1.26) is expressed as
I
ni
l θ1 , θ2 , . . . , θ K = N log pi θ1 , θ2 , . . . , θ K .
i=1
N
Concerning the data, the following definition is given:
Definition 1.7 The negative data entropy (information) in Table 1.11 is defined by

I
ni
ldata = n i log . (1.27)
i=1
N
Since

I
ni ni
ldata = N log ,
i=1
N N
negative data entropy

(1.27) is N times of the negative entropy of distribution
,
n1 n2
N N
, . . . , nK
N
. The log likelihood statistic for testing the goodness-of-fit of the
model is given by

I

ni
ni
G (θ ) = 2 ldata − l(θ ) = 2N
2
log N ,
(1.28)
i=1
N pi (θ )
where

θ = θ1 , θ2 , . . . , θ K .
Table 1.11 Data of X

X 1 2 … I Total
n1 n2 nI
Relative frequency N N … N 1
Frequency n1 n2 … nI N
For simplicity of the notations, let us set

n n n2 nK

i 1
= ,
,..., and pi (θ ) = p1 (θ ), p2 (θ ), . . . , p K (θ ) .
N N N N
Then, the log-likelihood ratio test statistic is described as follows:
n
i

G 2 (θ ) = 2N × D pi (θ ) . (1.29)
N

Hence, from (1.29) the ML estimation minimizes the difference between data nNi
and model ( pi (θ )) in terms of the KL information, and statistic (1.29) is used for
testing the goodness-of-fit of model ( pi (θ)) to the data. Under the null hypothesis H0 :
model ( pi (θ )), statistic (1.29) is asymptotically distributed according to a chi-square
distribution with degrees of freedom I − 1 − K [1].
1.9 Continuous Variables and Entropy
Let X be a continuous random variable that has density function f (x). As in the
above discussion, we define the information of X = x by
1
log .
f (x)
Since density f (x) is not a probability, the above quantity does not have a meaning
of information in a strict sense; however, in an analogous manner the entropy of
continuous variables is defined as follows [6]:
Definition 1.8 The entropy of continuous variable X with density function f (x) is
defined by:

H (X ) = − f (x)log f (x)dx. (1.30)
Remark 1.4 For continuous distributions, the entropy (1.30) is not necessarily
positive.
Remark 1.5 For constant a = 0, if we transformed variable X to Y = a X , the density

function of Y, f ∗ (y), is
y dx
∗
f (y) = f = 1 f y .
a dy |a| a
From this, we have

1.9 Continuous Variables and Entropy 21

1 y y dx
H (Y ) = − f ∗ (y)log f ∗ (y)dx − f log f dy
|a| a a dy

1 y y dx
=− f log f + log dy
|a| a a dy

1 y y 1 y dx
=− f log f dy − f log dy
|a| a a |a| a dy

=− f (x)log f (x)dx − f (x)log|a|dx = H (X ) − f (x)log|a|dx.
Hence, the entropy (1.30) is not scale-invariant.

For differential dx, f (x)dx can be viewed as the probability of x ≤ X < x + dx.
Hence,
1
log
f (x)dx
is the information of event x ≤ X < x + dx. Let us compare two density functions
f (x) and g(x). The following quantity is the relative information of g(x) for f (x):
1
g(x)dx f (x)
log = log .
1
f (x)dx
g(x)
From this, we have the following definition:

Definition 1.9 The KL information between two density functions (distributions)
f (x) and g(x), given that f (x) is true density function, is defined by

f (x)
D( f g) = f (x) log dx. (1.31)
g(x)
The above entropy is interpreted as the loss of information of the true distribution
f (x) when g(x) is used for f (x). Although the entropy (1.30) is not scale-invariant,
the KL information (1.31) is scale-invariant. With respect to the KL information
(1.31), the following theorem holds:
Theorem 1.10 For density function of f (x) and g(x), D( f g) ≥ 0. The equality
holds if and only if f (x) = g(x) almost everywhere.
Proof Since function log x is convex, from (1.31), we have

g(x) g(x)
D( f g) = − f (x) log dx ≥ − log f (x)dx.
f (x) f (x)

= − log g(x)dx = − log 1 = 0.

0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
Fig. 1.3 N (4, 1) and N (6, 1)
Example 1.12 Let f (x) and g(x) be normal density functions of N (μ1 , σ12 ) and
N(μ2 , σ22 ), respectively. For σ12 = σ22 = σ 2 = 1, μ1 = 1 and μ2 = 4, the graphs of
the density functions are illustrated as in Fig. 1.3. Since in general we have

f (x) 1 σ2 (x − μ1 )2 (x − μ2 )2
log = log 22 − + ,
g(x) 2 σ1 σ12 σ22

g(x) 1 σ12 (x − μ2 )2 (x − μ1 )2
log = log 2 − + .
f (x) 2 σ2 σ22 σ12
it follows that

1 σ2 σ2 (μ1 − μ2 )2
D( f g) = log 22 + 12 + − 1 , (1.32)
2 σ1 σ2 σ22

1 σ2 σ2 (μ1 − μ2 )2
D(g f ) = log 12 + 22 + − 1 . (1.33)
2 σ2 σ1 σ12
For σ12 = σ22 = σ 2 , from (1.32) and (1.33) we have
(μ1 − μ2 )2
D( f g) = D(g f ) = . (1.34)
2σ 2
The above KL information is increasing in difference |μ1 − μ2 | and decreasing
in variance σ 2 .
Let us consider a two-sample problem. Let {X 1 , X 2 , . . . , X m } and {Y1 , Y2 , . . . , Yn }

be random samples from N(μ1 , σ 2 ) and N(μ2 , σ 2 ), respectively. Comparing the two
distributions with respect to the means, we usually used the following t statistic [7]:

X −Y (m + n)(m + n − 2)
t= ,
(m − 1)U12 + (n − 1)U22 mn
where
m n 2 m n 2
i−1 Xi i−1 Yi Xi − X i=1 i=1 Yi − Y
X= ,Y = = , U12 , U2 =
2
m n m−1 n−1

X −Y (m + n)(m + n − 2)
t= .
(m − 1)U 2 + (n − 1)U 2 mn
1 2
Since the unbiased estimator of variance σ 2 is
(m − 1)U12 + (n − 1)U22
U2 = ,
m+n−2
the above test statistic is expressed as

X −Y m+n
t= .
U mn
Hence, the square of statistic t is
(X − Y )2 m + n m+n

t2 = 2
· = 2D( f g) · ,
U mn mn
where

(X − Y )2
D( f g) = ,
2U 2
From this, the KL information is related to t 2 statistic, i.e., F statistic.
Example 1.13 Let density functions f (x) and g(x) be normal density functions with
variances σ12 and σ22 , respectively, where the means are the same, μ1 = μ2 = μ,
e.g., Figure 1.4. From (1.32) and (1.33), we have

1 σ2 σ2 1 σ2 σ2
D( f g) = log 22 − 1 + 12 , D(g f ) = log 12 − 1 + 22 .
2 σ1 σ2 2 σ2 σ1
In this case,
D( f g) = D(g f ).
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
Fig. 1.4 Graphs of density functions of N (5, 1) and N (5, 4)
Let

σ2 1 1
x = 22 , D( f g) = log x − 1 + .
σ1 2 x
The graph of D( f g) is illustrated in Fig. 1.5. This function is increasing in x > 1

and decreasing in 0 < x < 1. Let {X 1 , X 2 , . . . , X m } and {Y1 , Y2 , . . . , Yn } be random
samples from N(μ1 , σ12 ) and N(μ2 , σ22 ), respectively. Then, F statistic for testing the
equality of the variances is given by
U12
F= , (1.35)
U22
4
3.5
3
2.5
2
1.5
1
0.5
0
0 2 4 6 8 10
σ22
Fig. 1.5 Graph of D( f g) = 1
2 log x − 1 + 1
x ,x = σ12
where
m n m 2 n 2
i−1 Xi i−1 Yi i=1 Xi − X i=1Yi − Y
X= ,Y = , U12 = , U22 = .
m n m−1 n−1
Under the null hypothesis: H0 : σ12 = σ22 , statistic F is distributed according to

F distribution with degrees of freedoms m − 1 and n − 1. Under the null hypothesis
μ1 = μ2 = μ, the estimators of KL information D( f g) and D(g f ) are given by

1 U2 U22 1 1
D( f g) = log 12 − 1 + = log F − 1 + ,
2 U2 U12 2 F

1 U2 U12 1
D(g f ) = log 22 − 1 + = (− log F − 1 + F)
2 U1 U22 2
Although the above KL information measures are not symmetrical with respect
to statistics F and F1 , the following measure is a symmetric function of F and F1 :

1 1
D( f g) + D(g f ) = F + −2 .
2 F
Example 1.14 Let f (x) and g(x) be the following exponential distributions:
f (x) = λ1 exp(−λ1 x), g(x) = λ2 exp(−λ2 x), λ1 , λ2 > 0.
Since
f (x) λ1
log = log − λ1 x + λ2 x,
g(x) λ2
g(x) λ2
log = log − λ2 x + λ1 x,
f (x) λ1
we have
λ1 λ2 λ2 λ1
D( f g) = log −1+ and D(g f ) = log − 1 + .
λ2 λ1 λ1 λ2
Example 1.15 (Property of the Normal Distribution in Entropy) [6] Let f (x) be
a density function of continuous random variable X that satisfy the following
conditions:

x f (x)dx = σ and
2 2
f (x)dx = 1. (1.36)
Then, the entropy is maximized with respect to density function f (x). For
Lagrange multipliers λ and μ, we set
w = f (x) log f (x) + λx 2 f (x) + μf (x).
Differentiating function w with respect to f (x), we have
d
w = log f (x) + 1 + λx 2 + μ = 0.
d f (x)
From this, it follows that
d
w = log f (x) + 1 + λx 2 + μ = 0.
d f (x)
Determining the multipliers according to (1.36), we have

1 x2
f (x) = √ exp − 2 .
2π σ 2 2σ
Hence, the normal distribution has the maximum of entropy given the variance.
Example 1.16 Let f (x) be a continuous random variable on finite interval (a, b).
The function that maximizes the entropy
b
H (X ) = − f (x) log f (x)dx.
a
is determined under the condition
b
f (x)dx = 1.
a
As in Example 1.15, for Lagrange multipliers λ, the following function is made:
w = f (x) log f (x) + λ f (x).
By differentiating the above function with respect to f (x), we have
d
w = log f (x) + 1 + λ = 0.
d f (x)
From this,
f (x) = exp(−λ − 1) = constant.
Hence, we have the following uniform distribution:

1
f (x) = .
b−a
As discussed in Sect. 1.4, entropy (1.7) for the discrete sample space or variable
is maximized for the uniform distribution P(ωi ) = n1 , i = 1, 2, . . . , n. The above
result for the continuous distribution is similar to that for the discrete case.
1.10 Discussion
Information theory and statistics use “probability” as an important tool for measuring
uncertainty of events, random variables, and data processed in communications and
statistical data analysis. Based on the common point, the present chapter has given
a discussion for taking a look at “statistics” from a view of “entropy.” Reviewing
information theory, the present chapter has shown that the likelihood ratio statistics, t
and F statistics can be expressed with entropy. In order to make novel and innovative
methods for statistical data analysis, it is one of the good approaches to take ways
from information theory, and there are possibilities to resolve problems in statistical
data analysis by using information theory.
References
1. Agresti, A. (2002). Categorical data analysis. Hoboken: Wiley.

2. Harry, J. (1989). Relative entropy measures of multivariate dependence. Journal of the American
Statistical Association, 84, 157–164.
3. Haberman, S. J. (1982). Analysis of dispersion of multinomial responses. Journal of the
American Statistical Association, 77, 568–580.
4. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical
Statistics, 22, 79–86.
5. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in
the case of a corrected system of variables is such that it can be reasonable supposed to have
arisen from random sampling. Philosophical Magazine, 5, 157–175. https://doi.org/10.1080/
14786440009463897.
6. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical
Journal, 27, 379–423.
7. Student. (1908). The problem error of mean. Biometrika, 6, 1–25.
8. Theil, H. (1970). On the estimation of relationships involving qualitative variables. American
Journal of Sociology, 76, 103–154.
9. Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing com-
posite hypotheses. Annals of Mathematical Statistics, 9, 60–62. https://doi.org/10.1214/aoms/
1177732360, https://projecteuclid.org/euclid.aoms/1177732360.
Chapter 2
Analysis of the Association in Two-Way
Contingency Tables
2.1 Introduction
In categorical data analysis, it is a basic approach to discuss the analysis of associa-

tion in two-way contingency tables. The usual approach for analyzing the association
is to compute odds, odds ratios, and relative risks in the contingency tables; however,
the numbers of categories of the categorical variables concerned are increasing, the
odds ratios to be calculated are also increasing and the interpretations of the results
are complicated. In this sense, processing contingency tables is more difficult than
that of continuous variables, because the association (correlation) between two con-
tinuous variables can be measured with the correlation coefficient between them in
the bivariate normal distribution. Then, especially for contingency tables for nomi-
nal variables, it is useful to summarize the association by a single measure, where
some information in the tables is lost. For testing the association in the contingency
table, the RC(1) association model [11], which has a similar structure to the bivari-
ate normal distribution [3, 4, 12, 13], was proposed for analyzing ordinal two-way
contingency tables. The model was extended to RC(M) association model with M
canonical association components [13, 14], and [6] showed the RC(M) association
model is a discretized version of the multivariate normal distribution in canonical
correlation analysis. A generalized version of the RC(M) association model was
discussed as canonical exponential models by Eshima [7]. It is a good approach to
deal the association in two-way contingency tables by a method that is similar to
the correlation analysis in the normal distribution. In this chapter, a summary mea-
sure of association in contingency tables is discussed from a viewpoint of entropy. In
Sect. 2.2, odds, odds ratios, and relative risks are reviewed, and the logarithm of them
is interpreted through information theory [8]. Section 2.3 discusses the association
between binary variables, and the entropy covariance and the entropy correlation
coefficient of the variables are proposed. It is shown that the entropy covariance is
expressed by the KL information and that the entropy correlation coefficient is the
absolute value of the Pearson correlation coefficient of binary variables. In Sect. 2.4,

https://doi.org/10.1007/978-981-15-2552-0_2
30 2 Analysis of the Association in Two-Way Contingency Tables
properties of the maximum likelihood estimator of the odds ratio are discussed in
entropy. It is shown the Pearson chi-square statistic for testing the independence of
binary variables is related to the square of the entropy correlation coefficient and
the log-likelihood ratio test statistic the KL information. Section 2.5 considers the
association in general two-way contingency tables based on odds ratios and the log-
likelihood ratio test statistic. In Sect. 2.6, the RC(M) association model, which was
designed for analyzing the association in general two-way contingency tables, is con-
sidered in view of entropy. Desirable properties of the model are discussed, and the
entropy correlation coefficient is expressed by using the intrinsic association param-
eters and the correlation coefficients of scores assigned to the categorical variables
concerned.
2.2 Odds, Odds Ratio, and Relative Risk
Let X and Y be binary random variables that take 0 or 1. The responses 0 and 1 are
formal expressions of categories for convenience of notations. In real data analyses,
responses are “negative” and “positive”, “agreement” and “disagreement”, “yes” or
“no”, and so on. Table 2.1 shows the joint probabilities of the variables. The odds of
Y = 1 instead of Y = 0 given X = i, i = 1, 2 are calculated by
P(Y = 1|X = i) P(X = i, Y = 1)

i = = , i = 0, 1, (2.1)
P(Y = 0|X = i) P(X = i, Y = 0)
for P(X = i, Y = 0) = 0, i = 0, 1. The above odds are basic association measures

for change of responses in Y , given X = i, i = 1, 2. The logarithms of the odds are
referred to as log odds and are given as follows:
logi = (−logP(Y = 0|X = i)) − (−logP(Y = 1|X = i)), i = 0, 1.
From this, log odds are interpreted as a decrease of uncertainty of response Y = 1

for Y = 0, given X = i.
Table 2.1 Joint probability distribution of X and Y

2.2 Odds, Odds Ratio, and Relative Risk 31
Definition 2.1 In Table 2.1, the odds ratio (OR) is defined by
1 P(Y = 1|X = 1)P(Y = 0|X = 0)

= . (2.2)
0 P(Y = 0|X = 1)P(Y = 1|X = 0)
The above OR is interpreted as the association or effect of variable X on variable

Y . From (2.1), we have
1 P(X = 1, Y = 1)P(X = 0, Y = 0) P(X = 1|Y = 1)P(X = 0|Y = 0)

= = .
0 P(X = 1, Y = 0)P(X = 0, Y = 1) P(X = 0|Y = 1)P(X = 1|Y = 0)
(2.3)
Hence, odds ratio is the ratio of the cross products in Table 2.1 and symmetric
for X and Y , and the odds ratio is a basic association measure to analyze categorical
data.
Remark 2.1 With respect to odds ratio expressed in (2.2) and (2.3), if variable X
precedes Y , or X is a cause and Y is effect, the first expression of odds ratio (2.2)
implies a prospective expression, and the second expression (2.3) is a retrospective
expression. Thus, the odds ratio is the same for both samplings, e.g., a prospective
sampling in a clinical trial and a retrospective sampling in a case-control study.
From (2.2), the log odds ratio is decomposed as follows:
1
log = (logP(Y = 1|X = 1) − logP(Y = 0|X = 1))
0
− (logP(Y = 1|X = 0) − logP(Y = 0|X = 0))

1 1
= log − log
P(Y = 0|X = 1) P(Y = 1|X = 1)

1 1
− log − log , (2.4)
P(Y = 0|X = 0) P(Y = 1|X = 0)
where P(X = i, Y = j) = 0, i = 0, 1; j = 0, 1. The first term of (2.4) is a decrease

of uncertainty (information) of Y = 1 for Y = 0 at X = 1, and the second term that
at X = 0. Thus, the log odds ratio is a change of uncertainty of Y = 1 in X . If X is
an explanatory variable or factor and Y is a response variable, OR in (2.4) may be
interpreted as the effect of X on response Y measured with information (uncertainty).
From (2.3), an alternative expression of the log odds ratio is given as follows:
1
log = (logP(X = 1|Y = 1) − logP(X = 0|Y = 1))
0
− (logP(X = 1|Y = 0) − logP(X = 0|Y = 0)). (2.5)
From this expression, the above log odds ratio can be interpreted as the effect of
Y on X , if X is a response variable and Y is explanatory variable. From (2.3), the
third expression of the log odds ratio is made as follows:
1
log = (logP(X = 1, Y = 1) + logP(X = 0, Y = 0))
0
− (logP(X = 1, Y = 0) + logP(X = 0, Y = 1))

1 1
= log + log
P(X = 1, Y = 0) P(X = 0, Y = 1)

1 1
− log + log . (2.6)
P(X = 0, Y = 0) P(X = 1, Y = 1)
If X and Y are response variables, from (2.6), the log odds ratio is the difference
between information of discordant and concordant responses in X and Y , and it
implies a measure of the association between the variables.
Theorem 2.1 For binary random variables X and Y , the necessary and sufficient
condition that the variables are statistically independent is
1
= 1. (2.7)
0
Proof If (2.7) holds, from (2.3),
P(X = 1, Y = 1)P(X = 0, Y = 0) = P(X = 1, Y = 0)P(X = 0, Y = 1).
There exists constant a > 0, such that
P(X = 1, Y = 1) = a P(X = 0, Y = 1),

P(X = 1, Y = 0) = a P(X = 0, Y = 0). (2.8)
From (2.8), we have
ai
P(X = i) = , i = 0, 1.
a+1
and
a
P(X = 1)P(Y = 1) = (P(X = 0, Y = 1) + P(X = 1, Y = 1))
a+1
a
= (P(X = 0, Y = 1) + a P(X = 0, Y = 1))
a+1
= a P(X = 0, Y = 1) = P(X = 1, Y = 1).
Table 2.2 Conditional

X Y Total
distribution of Y given factor
X 0 1
0 P(Y = 0|X = 0) P(Y = 1|X = 0) 1
1 P(Y = 0|X = 1) P(Y = 1|X = 1) 1
Similarly, we have
P(X = i)P(Y = j) = P(X = i, Y = j). (2.9)
This completes the sufficiency. Reversely, if X and Y are statistically independent,

it is trivial that (2.9) holds true. Then, (2.7) follows. This completes the theorem.
Remark 2.2 The above discussion has been made under the assumption that variables
X and Y are random. As in (2.4), the interpretation is also given in terms of the
conditional probabilities. In this case, variable X is an explanatory variable, and Y
is a response variable. If variable X is a factor, i.e., non-random, odds are made
according to Table 2.2. For example, X implies one of two groups, e.g., “control”
and “treatment” groups in a clinical trial. In this case, the interpretation can be given
as the effect of factor X on response variable Y in a strict sense.
Example 2.1 The conditional distribution of Y given factor X is assumed as in

Table 2.3. Then, the odds ratio (2.2) is
7
· 11
77
OR1 = 8 16
= .
1
8
· 5
16
5
Changing the responses of Y as in Table 2.3, the odds ratio is
1
· 5
5
OR2 = 8 16
= .
7
8
· 11
16
77
Therefore, the strength of association is the same for

X Y Total
distribution of Y given factor
X 0 1
(a)
7 1
0 8 8 1
5 11
1 16 16 1
(b)
1 7
0 8 8 1
11 5
1 16 16 1
1
OR1 = .
OR2
In terms of information, i.e., log odds ratio, we have
logOR1 = −logOR2 .
Next, the relative risk is considered. In Table 2.2, risks with respect to response
Y are expressed by the conditional probabilities P(Y = 1|X = i), i = 0, 1.
Definition 2.2 In Table 2.2, the relative risk (RR) is defined by
P(Y = 1|X = 1)
RR = . (2.10)
P(Y = 1|X = 0)
The logarithm of relative risk (2.10) is
logRR = (−logP(Y = 1|X = 0)) − (−logP(Y = 1|X = 1)). (2.11)
The above information is a decrease of uncertainty (information) of Y = 1 by

factor X . In a clinical trial, let X be a factor assigning a “placebo” group by X = 0
and a treatment group by X = 1, and let Y denote a side effect, expressing “no” by
Y = 0 and “yes” Y = 1. Then, the effect of the factor is measured by (2.10) and
(2.11). In this case, it may follow that
P(Y = 1|X = 1) ≥ P(Y = 1|X = 0),
and
RR ≥ 1 ⇔ logRR ≥ 0
Then, the effect of the factor on the risk is positive. From (2.2), we have
P(Y = 1|X = 1)(1 − P(Y = 1|X = 0)) (1 − P(Y = 1|X = 0))

OR = = RR × .
P(Y = 1|X = 0)(1 − P(Y = 1|X = 1)) (1 − P(Y = 1|X = 1))
If risks P(Y = 1|X = 0) and P(Y = 1|X = 1) are small in Table 2.2, it follows
that
OR ≈ RR.
Example 2.2 The conditional distribution of binary variables of Y given X is illus-

trated in Table 2.4. In this example, the odds (2.1) and odds ratio (2.3) are calculated
as follows:

X Y Total
probability distributions of Y
given X 0 1
0 0.95 0.05 1
1 0.75 0.25 1
1
0.05 1 0.25 1 1 19
0 = = , 1 = = , OR = = 3
= ≈ 6.33.
0.95 19 0.75 3 0 1
19
3
From the odds ratio, it may be thought intuitively that the effect of X on Y is
strong. The effect measured by the relative effect is given by
0.25
RR = = 5.
0.05
The result is similar to that of odds ratio.
Example 2.3 In estimating the odds ratio and the relative risk, it is critical to take the
sampling methods into consideration. For example, flu influenza vaccine efficacy is
examined through a clinical trial using vaccination and control groups. In the clinical
trial, both of the odds ratio and the relative risk can be estimated as discussed in this
section. Let X be factor taking 1 for “vaccination” and 0 for “control” and let Y be a
state of infection, 1 for “infected” and 0 for “non-infected.” Then, P(Y = 1|X = 1)
and P(Y = 1|X = 0) are the infection probabilities (ratios) of vaccination and con-
trol groups, respectively. Then, the reduction ratio of the former probability (risk)
for the latter is assessed by

P(Y = 1|X = 0) − P(Y = 1|X = 1)
efficacy = × 100 = (1 − RR) × 100,
P(Y = 1|X = 0)
where RR is the following relative risk:
P(Y = 1|X = 1)
RR = .
P(Y = 1|X = 0)
On the other hand, the influenza vaccine effectiveness is measured through a

retrospective method (case-control study) by using flu patients (infected) and controls
(non-infected), e.g., non-flu patients. In this sampling, the risks P(Y = 1|X = 0) and
P(Y = 1|X = 1) cannot be estimated; however, the odds ratio can be estimated as
mentioned in Remark 2.1. In this case, P(X = 1|Y = 0) and P(X = 1|Y = 1) can
be estimated, and then, the vaccine effectiveness of the flu vaccine is evaluated by
using odds ratios (OR):
P(X =1|Y =0)

1−P(X =1|Y =0)
effectiveness = 1 − P(X =1|Y =1)
× 100 = (1 − OR) × 100,
1−P(X =1|Y =1)
where OR is
P(X =1|Y =0)
1−P(X =1|Y =0)
OR = P(X =1|Y =1)
.
1−P(X =1|Y =1)
2.3 The Association in Binary Variables
In this section, binary variables are assumed to be random. Let πi j =

P(X = i, Y = j); πi+ = P(X = i), and let π+ j = P(Y = j). The logarithms of
the joint probabilities are transformed as follows:
logπi j = λ + λiX + λYj + ψi j , i = 0, 1; j = 0, 1. (2.12)
In order to identify the model parameters, the following appropriate constraints

are placed on the parameters:

1
1
1
1
λiX πi+ = λYj π+ j = ψi j πi+ = ψi j π+ j = 0. (2.13)
i=0 j=0 i=0 j=0
From (2.12) and (2.13), we have the following equations:

1
1
1
πi+ π+ j logπi j = λ, π+ j logπi j = λ + λiX
i=0 j=0 j=0

1
πi+ logπi j = λ + λYj .
i=0
Therefore, we get

1
1
1
1
ψi j = logπi j + π+ j logπi j + πi+ logπi j − πi+ π+ j logπi j . (2.14)
j=0 i=0 i=0 j=0
Let be the following two by two matrix:

ψ00 ψ01
= .
ψ01 ψ11
The matrix has information of the association between the variables. For a link to
a discussion in continuous variables, (2.12) is re-expressed as follows:
2.3 The Association in Binary Variables 37
log p(x, y) = λ + λxX + λYy + (1 − x, x)(1 − y, y)T . (2.15)
where
p(x, y) = πx y .
From (2.15), we have
(1 − x, x)(1 − y, y)t = ϕx y + (ψ01 − ψ00 )x + (ψ10 − ψ00 )y + ψ00
where
ϕ = ψ00 + ψ11 − ψ01 − ψ10
From the results, (2.15) can be redescribed as
log p(x, y) = α + β(x) + γ (y) + ϕx y. (2.16)
where
α = λ + ψ00 , β(x) = λxX + (ψ01 − ψ00 )x, γ (y) = λYy + (ψ10 − ψ00 )y.
(2.17)
By using expression (2.16), for baseline categories X = x0 , Y = y0 , we have
p(x, y) p(x0 , y0 )
log = ϕ(x − x0 )(y − y0 ). (2.18)
p(x0 , y) p(x, y0 )
In (2.18), log odds ratio is given by
p(1, 1) p(0, 0)
log = ϕ.
p(0, 1) p(1, 0)
Definition 2.3 Let μ X and μY be the expectation of X and Y , respectively. Formally

substituting baseline categories x0 , and y0 in (2.18) for μ X and μY , respectively, a
generalized log odds ratio is defined by
ϕ(x − μ X )(y − μY ). (2.19)
and the exponential of the above quantity is referred to as a generalized odds ratio.
Remark 2.3 In binary variables X and Y , it follows that μ X = π1+ and μY = π+1 .
Definition 2.4 The entropy covariance between X and Y is defined by the expectation
of (2.19) with respect to X and Y , and it is denoted by ECov(X, Y ) :
ECov(X, Y ) = ϕCov(X, Y ). (2.20)

The entropy covariance is expressed by the KL information:

1
1
πi j
1 1
πi+ π+ j
ECov(X, Y ) = πi j log + πi+ π+ j log ≥ 0. (2.21)
i=0 j=0
π π
i+ + j i=0 j=0
πi j
The equality holds true if and only if variables X and Y are statistically inde-
pendent. From this, the entropy covariance is an entropy
measure that explains the
difference between distributions (πi j ) and πi+ π+ j . In (2.20), ECov(X, Y ) is also
viewed as an inner product between X and Y with respect to association parameter
ϕ. From the Cauchy inequality, we have

0 ≤ ECov(X, Y ) ≤ |ϕ| Var(X ) Var(Y ). (2.22)
Since the (Pearson) correlation coefficient between binary variables X and Y is
Cov(X, Y ) π00 π11 − π01 π10

Corr(X, Y ) = √ √ =√ .
Var(X ) Var(Y ) π0+ π1+ π+0 π+1
for binary variables, it can be thought that the correlation coefficient measures the
degree of response concordance or covariation of the variables, and from (2.20) and
(2.22), we have
ECov(X, Y ) ϕCov(X, Y ) ϕ
0≤ √ √ ≤ √ √ = Corr(X, Y )
|ϕ| Var(X ) Var(Y ) |ϕ| Var(X ) Var(Y ) |ϕ|
= |Corr(X, Y )|.
From this, we make the following definition:
Definition 2.5 The entropy correlation coefficient (ECC) between binary variables
X and Y is defined by
ECorr(X, Y ) = |Corr(X, Y )|. (2.23)
From (2.21), the entropy covariance

√ ECov(X,
√ Y ) is the KL information, and the
upper bound of it is given as |ϕ| Var(X ) Var(Y ) in (2.22). Hence, ECC (2.23) is
interpreted as the ratio of the explained (reduced) entropy by the association between
variables X and Y . The above discussion gives the following theorem.
Theorem 2.2 For binary random variables X and Y , the following statements are
equivalent:
(i) Binary variables X and Y are statistically independent.

(ii) Corr(X, Y ) = 0.
(iii) ECorr(X, Y ) = 0.
(iv) RR = 1.
2.3 The Association in Binary Variables 39
Table 2.5 Joint probability

X Y Total
distribution of X and Y
0 1
(a)
0 0.4 0.1 0.5
1 0.2 0.3 0.5
Total 0.6 0.4 1
(b)
0 0.49 0.01 0.5
1 0.1 0.4 0.5
Total 0.59 0.41 1
Proof From Definition 2.5, Proposition (ii) and (iii) are equivalent. From Theo-
rem 2.1, Proposition (i) is equivalent to
π11 π22
OR = = 1 ⇔ π11 π22 − π12 π21 = 0 ⇔ Corr(X, Y ) = 0.
π12 π21
Similarly, we can show that
π11 π22 π12 π21 + π22

OR = = 1 ⇔ RR = · = 1.
π12 π21 π11 + π12 π22
Example 2.4 In Table 2.5a, the joint distribution of binary variables X and Y is given.
In this table, the odds (2.1) and odds ratio (2.3) are calculated as follows:
3
0.1 1 0.3 3 1
0 = = , 1 = = , = 2
= 6.
0.4 4 0.2 2 0 1
4
From the odds ratio, it may be thought intuitively that the association between
the two variables are strong in a sense of odds ratio, i.e., the effect of X on Y or that
of Y on X . For example, X is supposed to be a smoking state, where “smoking” is
denoted by X = 1 and “non-smoking” by X = 0, and Y be the present “respiratory
problem,” “yes” by Y = 1 and “no” by Y = 0. Then, from the above odds ratio, it
may be concluded that the effect of “smoking” on “respiratory problem” is strong.
The conclusion is valid in this context, i.e., the risk of disease. On the other hand,
treating variables X and Y parallelly in entropy, ECC is
0.4 · 0.3 − 0.1 · 0.2

ECorr(X, Y ) = Corr(X, Y ) = √ ≈ 0.408.
0.5 · 0.5 · 0.6 · 0.4
Table 2.6 Joint probability

X Y Total
distribution of X and Y for
t ∈ (0, 2) 0 1
0 0.4t 0.1t 0.5t
1 0.4(1 − 0.5t) 0.6(1 − 0.5t) (1 − 0.5t)
Total 0.4 + 0.2t 0.6 − 0.2t 1
In the above result, we may say the association between the variables is moderate
in a viewpoint of entropy. Similarly, we have the following coefficient of uncertainty
C(Y |X ) (1.19).
C(Y |X ) ≈ 0.128.
In this sense, the reduced (explained) entropy of Y by X is about 12.8% and the
effect of X on Y is week. Finally, the measures mentioned above are demonstrated
according to Table 2.5. We have
0.01 1 0.4 1 4
0 = = , 1 = = 4, = 1 = 196;
0.49 49 0.1 0 49
ECorr(X, Y ) = Corr(X, Y ) ≈ 0.793;
C(Y |X ) ≈ 0.558.
In the joint distribution in Table 2.5b, the association between the variables are
strong as shown by ECorr(X, Y ) and C(Y |X ).
Remark 2.4 As defined by (2.2) and (2.10), OR and RR do not depend on the
marginal probability distributions. For example, for any real value t ∈ (0, 2),
Table 2.5 is modified as in Table 2.6. Then, OR and RR in Table 2.6 are the same as
in Table 2.5a; however,

t(2 − t)
ECorr(X, Y ) = . (2.24)
(t + 2)(3 − t)
√
and it depends on real value t. The above ECC takes the maximum at t = 6 − 2 6 ≈
1.101:
max ECorr(X, Y ) ≈ 0.410.

t∈(0,2)
The graph of ECorr(X, Y ) (2.24) is illustrated in Fig. 2.1. Hence, for analyzing
contingency tables, not only association measures OR and RR but also ECC is needed.
2.4 The Maximum Likelihood Estimation of Odds Ratios 41
ECC
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
t
0
0.08
0.16
0.24
0.32
0.4
0.48
0.56
0.64
0.72
0.8
0.88
0.96
1.04
1.12
1.2
1.28
1.36
1.44
1.52
1.6
1.68
1.76
1.84
1.92
Fig. 2.1 Graph of ECorr(X, Y ) (2.24)
2.4 The Maximum Likelihood Estimation of Odds Ratios
The asymptotic distribution of the ML estimator of log odds ratio is considered.

Let n i j , i = 0, 1; j = 0, 1 be the numbers of observations in cell (X, Y ) = (i, j);
n i+ = n i0 + n i1 ; n + j = n 0 j + n 1 j ; and let n ++ = n 0+ + n 1+ = n +0 + n +1 . The
sampling is assumed to be independent Poisson samplings in the cells. Then, the
distribution of numbers n i j given the total n ++ is the multinomial distribution, and
the ML estimators of the cell probabilities πi j are calculated as
ni j
π̂i j = , i = 0, 1; j = 0, 1.
n ++
The ML estimator of the log odds ratio is computed by

n 00 n 11
OR = ,
n 01 n 10
and from this, for sufficiently large n i j , we have

n 00 n 11 μ00 μ11 n 00 n 01
logOR − logOR = log − log = log − log
n 01 n 10 μ01 μ10 μ00 μ01
n 10 n 11 1 1
− log + log ≈ (n 00 − μ00 ) − (n 01 − μ01 )
μ10 μ11 μ00 μ01
1 1
− (n 10 − μ10 ) + (n 11 − μ11 ),
μ10 μ11
where μi j are the expectations of n i j , i = 0, 1; j = 0, 1. Since the sampling method

is an independent Poisson sampling, if data n i j , i = 0, 1; j = 0, 1 are sufficiently

large, they are asymptotically
independent
and distributed according to N μi j , μi j ,
and we have n i j = μi j + o μi j , i = 0, 1; j = 0, 1, where

o μi j
lim = 0.
μi j →∞ μi j
Therefore, the asymptotic variance of OR is given by

1 1 1 1 1 1 1 1
Var logOR ≈ + + + ≈ + + + . (2.25)
μ00 μ01 μ10 μ11 n 00 n 01 n 10 n 11

According to the above results, the asymptotic distribution of logOR is normal

with variance (2.25) and the standard error (SE) of the estimator is given by

1 1 1 1
SE ≈ + + + .
n 00 n 01 n 10 n 11
The test of the independence of binary variables X and Y can be performed by

using the above SE, i.e., the critical region for significance level α is given as follows:

α 1 1 1 1
logOR > Q + + + ,
2 n 00 n 01 n 10 n 11

α 1 1 1 1
logOR < −Q + + + ,
2 n 00 n 01 n 10 n 11
where function Q(α) is the 100(1 − α) percentile of the standard normal distribution.
Remark 2.5 From (2.25), the following statistic is asymptotically distributed

according to the standard normal distribution:

logOR − logOR
Z= .
1
n 00
+ 1
n 01
+ 1
n 10
+ 1
n 11
Then, a 100(1 − α)% confidence interval for the log odds ratio is

α 1 1 1 1
logOR − Q + + + < logOR < logOR

2 n 00 n 01 n 10 n 11

α 1 1 1 1
+Q + + + . (2.26)
2 n 00 n 01 n 10 n 11
From the above confidence interval, if the interval does not include 0, then the
null hypothesis is rejected with the level of significance α.
In order to test the independence of binary variables X and Y , the following
Pearson chi-square statistic is usually used:
2.4 The Maximum Likelihood Estimation of Odds Ratios 43
n ++ (n 00 n 11 − n 10 n 01 )2
X2 = . (2.27)
n 0+ n 1+ n +0 n +1
Since

|n 00 n 11 − n 10 n 01 |
ECorr(X, Y ) = √ ,
n 0+ n 1+ n +0 n +1
we have

X 2 = n ++ ECorr(X, Y )2 .
Hence, the Pearson chi-square statistic is also related to entropy.

On the other hand, the log-likelihood ratio statistic for testing the independence
is given by

1
1
n ++ n i j
G2 = 2 n i j log (2.28)
j=0 i=0
n i+ n + j
As mentioned in Chap. 1, the above statistic is related to the KL information.

Under the null hypothesis that X and Y are statistically independent, the statistics
(2.27) and (2.28) are asymptotically equal and the asymptotic distribution is the chi-
square distribution with degrees of freedom 1. By using the above testing methods,
a test for independence between X and Y can be carried out.
Example 2.5 Table 2.7 illustrates the two-way cross-classified data of X : smoking
state and Y : chronic bronchitis state from 339 adults over 50 years old. The log OR
is estimated as

43 · 121
logOR = log = 2.471,
162 · 13
A 95% confidence interval of the log odds ratio (2.26) is calculated as follows:
0.241 < logOR < 1.568, (2.29)
Table 2.7 Data of smoking

X Y Total
status and chronic bronchitis
of subjects above 50 years old Yes (1) No (0)
Yes (1) 43 162 205
No (0) 13 121 134
Total 56 283 339
X smoking state; Y chronic bronchitis state
Source Asano et al. [2], p. 161
From this, the association between the two variables is statistically significant at
significance level 0.05. From (2.29), a 95% confidence interval of the odds ratio is
given by
1.273 < OR < 4.797.
By using test statistic G 2 , we have
G 2 = 7.469, d f = 1, P = 0.006.
Hence, the statistic indicates that the association in Table 2.7 is statistically
significant.
2.5 General Two-Way Contingency Tables
Let X and Y be categorical random variables that have categories {1, 2, . . . , I } and
{1, 2, . . . , J }, respectively, and let πi j = P(X = i, Y = j), πi+ = P(X = i), and
let π+ j = P(Y = j). Then, the number of odds ratios is

I J I (I − 1) J (J − 1)
= · .
2 2 2 2
Table 2.8 illustrates a joint probability distribution of categorical variables. The

odds ratio with respect to X = i, i and Y = j, j is the following cross product
ratio:
Table 2.8 Joint probability distribution of general categorical variables X and Y

2.5 General Two-Way Contingency Tables 45
πi j πi j
OR i, i ; j, j = for i = i and j = j . (2.30)
πi j πi j
The discussion on the analysis of association in contingency tables is compli-

cated as the numbers of categories in variables are increasing. As an extension of
Theorem 2.1, we have the following theorem:
Theorem 2.3 Categorical variables X and Y are statistically independent, if and

only if

OR i, i ; j, j = 1 for all i = i and j = j . (2.31)
Proof In (2.30), we formally set i = i and j = j . Then, we have
OR(i, i; j, j) = 1.
Hence, assumption (2.31) is equivalent to

πi j πi j
OR i, i ; j, j = = 1 for all i, i , j, and j .
πi j πi j

J I
j =1 i =1 πi j πi j πi j π++ πi j
1= J I = = = 1 for all i and j.
j =1 i =1 πi j πi j
πi+ π+ j πi+ π+ j
Hence, we have
πi j = πi+ π+ j for all i and j,
it means that X and Y are statistically independent. This completes the theorem.

Let the model in Table 2.8 be denoted by πi j and let the independent model by
πi+ π+ j . Then, from (1.10), the Kullback–Leibler information is

I
J
πi j
D πi j || πi+ π+ j = πi j log .
i=1 j=1
πi+ π+ j
Let n i j be the numbers of observations in cells (X, Y ) = (i, j), i =

1, 2, . . . , I ; j = 1, 2, . . . , J . Then, the ML estimators of probabilities πi j , πi+ ,
and π+ j are given as
ni j n i+ n+ j
π̂i j = , π̂i+ = , π̂+ j = ,
n ++ n ++ n ++
Table 2.9 Cholesterol and high blood pressure data in a medical examination for students in a
university
X Y Total
L M H
L 2 6 3 11
M 12 32 11 56
H 9 5 3 17
Total 24 43 17 84
Source [1], p. 206
where

J
I
I
J
n i+ = ni j , n+ j = n i j , n ++ = ni j .
j=1 i=1 i=1 j=1
The test of independence is performed by the following statistic:

I
J
n i j n ++
G2 = 2 n i j log = 2n ++ · D π̂i j || π̂i+ π̂+ j . (2.32)
i=1 j=1
n i+ n + j
The above
statistic implies 2n ++ times the KL information for discriminating
model πi j and the independent
model πi+ π+ j . Under the null hypothesis H0 :
the independent model πi+ π+ j , the above statistic is distributed according to a
chi-square distribution with degrees of freedom (I − 1)(J − 1).
Example 2.6 Table 2.9 shows a part of data in a medical examination. Variables X
and Y are cholesterol and high blood pressure, respectively, which are graded as low
(L), medium (M), and high (H). Applying (2.32), we have
G 2 = 6.91, d f = 4, P = 0.859.
From this, the independence of the two variables is not rejected at the level of
significant 0.05, i.e., the association of the variables is not statistically significant.
Example 2.7 Table 2.10 illustrates cross-classified data with respect to eye color
X and hair color Y in [5]. For baselines X = Blue, Y = Fair, odds ratios are
calculated in Table 2.11, and P-values with respect to the related odds ratios are
shown in Table 2.12. All the estimated odds ratios related to X = Midium, Dark are
statistically significant. The summary test of the independence of the two variables
is made with statistic (2.32), and we have G 2 = 1218.3, d f = 9, P = 0.000.
Hence, the independence of the variables is strongly rejected. By using odds ratios
in Table 2.11, the association between the two variables in Table 2.10 is illustrated
in Fig. 2.2.
2.5 General Two-Way Contingency Tables 47
Table 2.10 Eye and hair color data

X Y Total
Fair Red Medium Dark Black
Blue 326 38 241 110 3 718
Light 688 116 584 188 4 1580
Medium 343 84 909 412 26 1774
Dark 98 48 403 681 85 1315
Total 1455 286 2137 1391 118 5387
Source [5], p. 143
Table 2.11 Baseline odds ratios for baselines X = blue, Y = fair

X Y
Red Medium Dark Black
Light 1.45 1.15 0.81 0.63
Medium 2.10 3.58 3.56 8.24
Dark 4.20 5.56 20.59 94.25
Table 2.12 P-values in estimators of odds ratios in Table 2.11

X Y
Red Medium Dark Black
Light 0.063 0.175 0.125 0.549
Medium 0.000 0.000 0.000 0.001
Dark 0.000 0.000 0.000 0.000
Odds Ratio
100
80
60
40
Dark
20
0 Medium
Red
Medium
Dark Light
Black
0-20 20-40 40-60 60-80 80-100
Fig. 2.2 Image of association according to odds ratios in Table 2.11

As demonstrated in Example 2.7, since a discussion of the association in contin-

gency tables becomes complex as the number of categories increases, an association
measure is needed to summarize the association in two-way contingency tables for
polytomous variables. An entropy-based approach for it will be given below.
2.6 The RC(M) Association Models
The association model was proposed for analyzing two-way contingency tables
of categorical variables [11, 13]. Let X and Y be categorical variables that
have categories {1, 2, . . . , I } and {1, 2, . . . , J }, respectively, and let πi j =
P(X = i, Y = j); πi+ = P(X = i), and let π+ j = P(Y = j), i =
1, 2, . . . , I ; j = 1, 2, . . . , J . Then, a loglinear formulation of the model is given
by
logπi j = λ + λiX + λYj + ψi j .
In order to identify the model parameters, the following constraints are assumed:

I
J
I
J
λiX πi+ = λYj π+ j = ψi j πi+ = ψi j π+ j = 0.
i=1 j=1 i=1 j=1
The association between the two variables

is expressed
by parameters
ψi j and the
log odds ratio with respect to (i, j), i , j , i, j , and i , j is calculated as
πi j πi j
log = ψi j + ψi j − ψi j − ψi j .
πi j πi j
To discuss the association in the two-way contingency table, the RC(M)

association model is given as

M
logπi j = λ + λiX + λYj + φm μmi νm j , (2.33)
m=1
where μmi and νm j are scored for X = i and Y = j in the m th components,

m = 1, 2, . . . , M, and the following constraints are used for model identification:
2.6 The RC(M) Association Models 49
⎧ I
⎪ J
⎪
⎪ μ π = νm j π+ j = 0, m = 1, 2, . . . , M;
⎪
⎪
mi i+
⎪
⎪ i=1 j=1
⎨ I J
μ2mi πi+ = νm2 j π+ j = 1, m = 1, 2, . . . , M; (2.34)
⎪
⎪
⎪
⎪
i=1 j=1
⎪
⎪ I J
⎪
⎩ μmi μm i πi+ = νm j νm j π+ j = 0, m = m .
i=1 j=1
Then, the log odds ratio is given by the following bilinear form:
πi j πi j
M

log = φm (μmi − μmi ) νm j − νm j . (2.35)
πi j πi j m=1
Preferable properties of the model in entropy are discussed below. For the
RC(1) model, Gillula et al. [10] gave a discussion of properties of the model
in entropy and Eshima [9] extended the discussion in the RC(M) association
model. The discussion is given below. In the RC(M) association model (2.33), let
μm = (μm1 , μm2 , . . . , μm I ) and ν m = (νm1 , νm2 , . . . , νm J ) be scores for X and Y in
the m th components, m = 1, 2, . . . , M. From (2.35), parameters φm are related to
log odds ratios, so the parameters are referred to as the intrinsic association param-
eters. If it is possible for scores to vary continuously, the intrinsic parameters φm
are interpreted as the log
odds ratio in unit changes of the related scores in the mth
components. Let Corr μm , ν m be the correlation coefficients between μm and ν m ,
m, m = 1, 2, . . . , M, which are defined by
J
I
Corr μm , ν m = μmi νm j πi j . (2.36)
j=1 i=1

For simplicity of the notation, let us set ρm = Corr μm , ν m , m = 1, 2, . . . , M.
With respect to the RC(M) association model (2.33), we have the following theorem.
Theorem 2.4 The RC(M) association model maximizes the entropy, given the cor-
relation coefficients between μm and ν m and the marginal probability distributions
of X and Y .
Proof For Lagrange multipliers κ, λiX , λYj , and φm , we set

J
I
J
I
I
J
G=− πi j logπi j + κ πi j + λiX πi j
j=1 i=1 j=1 i=1 i=1 j=1

J
I
M
J
I
+ λYj πi j + φm μmi νm j πi j .
j=1 i=1 m=1 j=1 i=1
Differentiating the above function with respect to πi j , we have

∂G M
= −logπi j + 1 + κ + λiX + λYj + φm μmi νm j = 0.
∂πi j m=1
Setting λ = κ + 1, we have (2.33). This completes the theorem.
With respect to the correlation coefficients and the intrinsic association parame-
ters, we have the following theorem.
Theorem 2.5 In the RC(M) association model (2.33), let U be the M×M matrix that
∂ρa
has (a, b) elements ∂φb
. Then, U is positive definite, given the marginal probability
distributions of X and Y and scores μm and ν m , m = 1, 2, . . . , M.
Proof Differentiating the correlation coefficients ρa with respect to φb in model

(2.33), we have

∂ρa J I
∂λ ∂λiX ∂λYj
= μai νa j πi j × + + + μbi νbj . (2.37)
∂φb j=1 i=1
∂φb ∂φb ∂φb
Since marginal probabilities πi+ and π+ j are given, we have

∂πi+ J
∂λ ∂λ X ∂λYj
= πi j × + i + + μbi νbj = 0, (2.38)
∂φb j=1

∂π+ j I
∂λ ∂λ X ∂λYj
= πi j × + i + + μbi νbj = 0. (2.39)
∂φb i=1
From (2.38) and (2.39), we obtain

∂πi+ J
∂λ ∂λ ∂λiX ∂λYj
= πi j × + + + μbi νbj = 0, (2.40)
∂φb j=1
∂φb ∂φb ∂φb ∂φb

∂πi+ J
∂λiX ∂λ ∂λiX ∂λYj
= πi j × + + + μbi νbj = 0, (2.41)
∂φb j=1

∂π+ j I
∂λYj ∂λ ∂λiX ∂λYj
= πi j × + + + μbi νbj = 0. (2.42)
∂φb i=1
From (2.41) and (2.42), it follows that

I
J
∂λ ∂λ ∂λ X ∂λYj
πi j × + i + + μbi νbj = 0, (2.43)
i=1 j=1
∂φa ∂φb ∂φb ∂φb

I
J
∂λ X ∂λ ∂λ X ∂λYj
πi j × i + i + + μbi νbj = 0, (2.44)
i=1 j=1

I
J
∂λYj ∂λ ∂λ X ∂λYj
πi j × + i + + μbi νbj = 0. (2.45)
i=1 j=1
By using (2.43) to (2.45), formula (2.37) is reformed as follows:

∂ρa J I
∂λ ∂λiX ∂λYj
= μai νa j πi j × + + + μbi νbj
j=1 i=1

I J
∂λ ∂λ ∂λiX ∂λYj
+ πi j × + + + μbi νbj .
i=1 j=1

I J
∂λ X ∂λ ∂λ X ∂λYj
+ πi j × i + i + + μbi νbj
i=1 j=1

I J ∂λYj ∂λ ∂λ X ∂λYj
+ πi j × + i + + μbi νbj
∂φba ∂φb ∂φb ∂φb
i=1 j=1

I J
∂λ ∂λiX ∂λYj ∂λ ∂λiX ∂λYj
= πi j × + + + μai νa j + + + μbi νbj
∂φa ∂φa ∂φa ∂φb ∂φb ∂φb
i=1 j=1
The above quantity is an inner product between I J dimensional vectors

∂λ ∂λiX ∂λYj ∂λ ∂λiX ∂λYj
+ + + μai νa j and + + + μbi νbj
∂φa ∂φa ∂φa ∂φb ∂φb ∂φb

with respect to the joint probability distribution πi j . Hence, theorem follows.
From the above theorem, we have the following theorem.

Theorem 2.6 In the RC(M) association model (2.33), ρm = Corr μm , ν m are
increasing in the intrinsic association parameters φm , m = 1, 2, . . . , M, given the
marginal probability distributions of X and Y .

∂ρa
Proof In the above theorem, matrix U = ∂φ b
is positive definite, so we get
∂ρa
≥ 0.
∂φa
With respect to the entropy in RC(M) association model (2.33), we have the
following theorem.
Theorem 2.7 Let φ = (φ1 , φ2 , . . . , φ M ) be the intrinsic association parameter

vector. For real value t > 0, let φ = (tφ10 , tφ20 , . . . , tφ M0 ). Then, the entropy of
the RC(M) association model (2.33) is decreasing in t > 0, given the marginal
probability distributions of X and Y .
Proof The entropy of RC(M) association model (2.33) is calculated as follows:

⎛ ⎞

J
I
I
J
M
H =− πi j logπi j = −⎝λ + λiX πi+ + λYj π+ j + t φm0 ρm ⎠.
j=1 i=1 i=1 j=1 m=1
Differentiating the entropy with respect to t, we have
d(−H ) ∂λ
M M I
∂λiX
= φm0 + φm0 πi+
dt m=1
∂φm m=1 i=1
∂φm

M
J
∂λYj
M
M
M
∂ρm
+ φm0 π+ j + φm0 ρm + t φn0 φm0 .
m=1 j=1
∂φm m=1 n=1 m=1
∂φm
Since

J
I
πi j = 1,
j=1 i=1
we have
d
J I J I J I
d
πi j = πi j = πi j
dt j=1 i=1 j=1 i=1
dt j=1 i=1
M
∂λ M
∂λiX M
∂λYj M
φm0 + φm0 + φm0 + φm0 μmi νm j
m=1
∂φm m=1
∂φm m=1
∂φm m=1
M
∂λ J M
∂λiX
= φm0 + φm0 πi+
m=1
∂φm j=1 m=1
∂φm
J M
∂λiX M
+ φm0 π+ j + φm0 ρm = 0.
j=1 m=1
∂φm m=1
From the above result and Theorem 2.6, it follows that
dH M M
∂ρm
= −t φn0 φm0 < 0.
dt n=1 m=1
∂φm
Thus, the theorem follows.
Remark 2.6 The RC(1) association model was proposed for the analysis of associ-
ation between ordered categorical variables X and Y . The RC(1) association model
is given by
logπi j = λ + λiX + λYj + φμi ν j , i = 1, 2, . . . , I ; j = 1, 2, . . . , J.

Then, from Theorem 2.6, the correlation coefficient between scores (μi ) and ν j
is increasing in the intrinsic association parameter φ and from Theorem 2.7 the
entropy of the model is decreasing in φ.
In the RC(M) association model (2.33), as in the previous discussion, we have

I
J
πi j I J
πi+ π+ j M J I
πi j log + πi+ π+ j log = φm μmi νm j πi j
i=1 j=1
πi+ π+ j i=1 j=1
πi j m=1 j=1 i=1

M
= φm ρm ≥ 0. (2.46)
m=1
In the RC(M) association model, categories X = i and Y = j are identified to

score vectors (μ1i , μ2i , . . . , μ Mi ) and ν1 j , ν2 j , . . . , ν M j , respectively. From this,
as an extension of (2.20) in Definition 2.4, the following definition is given.
Definition 2.6 The entropy covariance between categorical variables X and Y in the
RC(M) association model (2.33) is defined by

M
ECov(X, Y ) = φm ρm . (2.47)
m=1
The above covariance can be interpreted as that of variables X and Y in entropy

(2.46) and is decomposed into M components. Since

M
J
I
ECov(X, Y ) = φm μmi νm j πi j , (2.48)
m=1 j=1 i=1
we can formally calculate ECov(X, X ) by

M
J
I
M
I
M
ECov(X, X ) = φm μ2mi πi j = φm μ2mi πi+ = φm .
m=1 j=1 i=1 m=1 i=1 m=1
Similarly, we also have


M
ECov(Y, Y ) = φm . (2.49)
m=1
From (2.48) and (2.49), ECov(X, X ) and ECov(Y, Y ) can be interpreted as the
variances of X and Y in entropy, which are referred to as EVar(X ) and EVar(Y ),
respectively. As shown in (2.35), since association parameters
φm are related to odds
ratios, the contributions of the mth pairs of score vectors μm , ν m in the association
between X and Y may be measured by φm ; however, from the entropy covariance
(2.47), it is more sensible to measure
the contributions by φm ρm . The contribution
ratios of the mth components μm , ν m can be calculated by
φm ρm
M . (2.50)
k=1 φk ρk
In (2.46), from the Cauchy–Schwarz inequality, we have

M
M
J
I
M
I
J
M
φm ρm = φm μmi νm j πi j ≤ φm μ2mi πi+ νm2 j π+ j = φm .
m=1 m=1 j=1 i=1 m=1 i=1 j=1 m=1
It implies that

0 ≤ ECov(X, Y ) ≤ EVar(X ) EVar(Y ). (2.51)
From this, we have the following definition:
Definition 2.7 The entropy correlation coefficient (ECC) between X and Y in the
RC(M) association model (2.33) is defined by
M
φm ρm
ECorr(X, Y ) =
m=1
M
. (2.52)
m=1 φm
From the above discussion, it follows that
0 ≤ ECorr(X, Y ) ≤ 1.
As in Definition 2.4, the interpretation of ECC (2.52) is the ratio of entropy

explained by the RC(M) association model.
Example 2.8 Becker and Clogg [5] analyzed the data in Table 2.13 with the RC(2)
association model. The estimated parameters are given in Table 2.14. From the
estimates, we have
ρ̂1 = 0.444, ρ̂2 = 0.164.

Table 2.13 Data of eye color X and hair color Y

X Y Total
Fair Red Medium Dark Black
Blue 326 38 241 110 3 718
Light 688 116 584 188 4 1580
Medium 343 84 909 412 26 1774
Dark 98 48 403 681 85 1315
Total 1455 286 2137 1391 118 5387
Source [5]
and
φ̂1 ρ̂1 = 0.219, φ̂2 ρ̂2 = 0.018.
From the above estimates, the entropy correlation coefficient between X and Y is
computed as

φ̂1 ρ̂1 + φ̂2 ρ̂2
ECorr(X, Y ) = = 0.390. (2.53)
φ̂1 + φ̂2
From this, the association in the contingency table is moderate. The contribution
ratios of the first and second pairs of score vectors are
φ̂1 ρ̂1 φ̂2 ρ̂2

= 0.924, = 0.076.
φ̂1 ρ̂1 + φ̂2 ρ̂2

φ̂1 ρ̂1 + φ2 ρ2
About 92.4% of the association between the categorical variables are explained
by the first pair of components (μ1 , ν 1 ).
A preferable property of ECC is given in the following theorem.

vector in the RC(M) association model (2.33). For real value t > 0, let φ =
(tφ10 , tφ20 , . . . , tφ M0 ). Then, the entropy covariance (2.47) is increasing in t > 0,
given the marginal probability distributions of X and Y .
Proof Since

M
ECov(X, Y ) = t φm0 ρm ,
m=1
differentiating the above entropy covariance with respect to t, we have

56
Table 2.14 Estimated parameters in the RC(2) association model

m φm μ1 μ2 μ3 μ4 ν1 ν2 ν3 ν4 ν5
1 0.494 −0.894 −1.045 0.166 1.520 −1.313 −0.425 0.019 1.213 2.567
2 0.110 1.246 0.271 −1.356 0.824 0.782 0.518 −1.219 0.945 0.038
Source [5], p. 145
2 Analysis of the Association in Two-Way Contingency Tables
d M M M
∂ρa dφb 1
M
ECov(X, Y ) = φm0 ρm + t φa0 = φm ρm
dt m=1 b=1 a=1
∂φb dt t m=1

M
M
∂ρa
+t φa0 φb0
b=1 a=1
∂φb
1 M M
∂ρa
= ECov(X, Y ) + t φa0 φb0 .
t b=1 a=1
∂φb
For t > 0, the first term is positive, and the second term is also positive. This
completes the theorem.
By using Theorem 2.5, we have the following theorem.

vector in the RC(M) association model (2.33). For real value t > 0, let φ =
(tφ10 , tφ20 , . . . , tφ M0 ). Then, ECC (2.52) is increasing in t > 0, given the marginal
probability distributions of X and Y .
Proof For φ = (tφ10 , tφ20 , . . . , tφ M0 ), we have

M
φm0 ρm
ECorr(X, Y ) =
m=1
M
.
m=1 φm0
By differentiating ECorr(X, Y ) with respect to t, we obtain

M M M dρm dφm
m=1 φm0 dt m =1 φm0 dφm dt
dρm
d m=1
ECorr(X, Y ) = M = M
m=1 φm0 m=1 φm0
dt
M M
m =1 φm0 dφm φm 0
dρm
m=1
= M ≥ 0.
m=1 φm0
2.7 Discussion
Continuous data are more tractable than categorical data, because the variables are
quantitative and, variances and covariances among the variables can be calculated
for summarizing the correlations of them. The RC(M) association model, which was
proposed for analyzing the associations in two-way contingency tables, has a similar
correlation structure to the multivariate normal distribution in canonical correlation
analysis [6] and gives a useful method for processing the association in contingency
tables. First, this chapter has considered the association between binary variables
in entropy, and the entropy correlation coefficient has been introduced. Second, the
discussion has been extended for the RC(M) association mode. Desirable properties
of the model in entropy have been discussed, and the entropy correlation coefficient
to summarize the association in the RC(M) association model has been given. It
is sensible to treat the association in contingency tables as in continuous data. The
present approach has a possibility to be extended for analyzing multiway contingency
tables.
References
1. Asano, C., & Eshima, N. (1996). Basic multivariate analysis. Nihon Kikaku Kyokai: Tokyo.
(in Japanese).
2. Asano, C., Eshima, N., & Lee, K. (1993). Basic statistic. Tokyo: Morikita Publishing Ltd. (in
Japanese).
3. Becker, M. P. (1989a). Models of the analysis of association in multivariate contingency tables.
Journal of the American Statistical Association, 84, 1014–1019.
4. Becker, M. P. (1989b). On the bivariate normal distribution and association models for ordinal
data. Statistics and Probability Letters, 8, 435–440.
5. Becker, M. P., & Clogg, C. C. (1989). Analysis of sets of two-way contingency tables using
association models. Journal of the American Statistical Association, 84, 142–151.
6. Eshima, N. (2004). Canonical exponential models for analysis of association between two sets
of variables. Statistics and Probability Letters, 66, 135–144.
7. Eshima, N., & Tabata, M. (1997). The RC(M) association model and canonical correlation
analysis. Journal of the Japan Statistical Society, 27, 109–120.
8. Eshima, N., & Tabata, M. (2007). Entropy correlation coefficient for measuring predictive
power of generalized linear models. Statistics and Probability Letters, 77, 588–593.
9. Eshima, N., Tabata, M., & Tsujitani, M. (2001). Properties of the RC(M) association model
and a summary measure of association in the contingency table. Journal of the Japan Statistical
Society, 31, 15–26.
10. Gilula, Z., Krieger, A., & Ritov, Y. (1988). Ordinal association in contingency tables: Some
interpretive aspects. Journal of the American Statistical Association, 74, 537–552.
11. Goodman, L. A. (1979). Simple models for the analysis of association in cross-classifications
having ordered categories. Journal of the American Statistical Association, 74(367), 537–552.
12. Goodman, L. A. (1981a). Association models and bivariate normal for contingency tables with
ordered categories. Biometrika, 68, 347–355.
13. Goodman, L. A. (1981b). Association models and canonical correlation in the analysis of cross-
classification having ordered categories. Journal of the American Statistical Association, 76,
320–334.
14. Goodman, L. A. (1985). The analysis of cross-classified data having ordered and/or unordered
categories: Association models, correlation models, and asymmetry models for contingency
tables with or without missing entries. Annals of Statistics, 13, 10–69.
Chapter 3
Analysis of the Association in Multiway
Contingency Tables
3.1 Introduction
Basic methodologies to analyze two-way contingency tables can be extended for

multiway contingency tables, and the RC(M) association model [11] and ECC [7]
discussed in the previous section can be applied to the analysis of association in
multiway contingency tables [6]. In the analysis of the association among the vari-
ables, it is needed to treat not only the association between the variables but also
the higher-order association among the variables concerned. Moreover, the condi-
tional association among the variables is also analyzed. In Sect. 3.2, the loglinear
model is treated, and the basic properties of the model are reviewed. Section 3.3
considers the maximum likelihood estimation of loglinear models for three-way
contingency tables. It is shown that the ML estimates of model parameters can be
obtained explicitly except model [X Y, Y Z , Z X ] [2]. In Sect. 3.4, generalized linear
models (GLMs), which make useful regression analyses of both continuous and cat-
egorical response variables [13, 15], are discussed, and properties of the models are
reviewed. Section 3.5 introduces the entropy correlation coefficient (ECC) to mea-
sure the predictive power of GLMs. The coefficient is an extension of the multiple
correlation coefficient for the ordinary linear regression model, and its preferable
properties for measuring the predictive power of GLMs are discussed in comparison
with the regression correlation coefficient. The conditional entropy correlation coef-
ficient is also treated. In Sect. 3.6, GLMs with multinomial (polytomous) response
variables are discussed by using generalized logit models. Section 3.7 considers the
coefficient of determination in a unified framework, and the entropy coefficient of
determination (ECD) is introduced, which is the predictive or explanatory power
measure for GLMs [9, 10]. Several predictive power measures for GLMs are com-
pared on the basis of six preferable criteria, and it is shown that ECD is the most
preferable predictive power measure for GLMs. Finally, in Sect. 3.8, the asymptotic
property of the ML estimator of ECD is considered.

https://doi.org/10.1007/978-981-15-2552-0_3
60 3 Analysis of the Association in Multiway Contingency Tables
3.2 Loglinear Model
Let X, Y, and Z be categorical random variables having categories {1, 2, . . . , I },

{1, 2, . . . , J }, and {1, 2, . . . , K }, respectively; let πi jk = P(X = i, Y = j, Z = k).
Then, the logarithms of the probabilities can be expressed as follows:
logππi jk = λ + λiX + λYj + λkZ + λiXY

j + λ jk + λki + λi jk .
YZ ZX XYZ
(3.1)
In order to identify the parameters, the following constraints are used.

⎧
⎪
I
J
K
⎪
⎪ λiX = λYj = λkZ = 0
⎪
⎪
⎪
⎪ i=1 j=1 k=1
⎨
I
J
J
K
K
I
λiXj Y = λiXj Y = λYjkZ = λYjkZ = λkiZ X = λkiZ X = 0, (3.2)
⎪
⎪
⎪
⎪
i=1 j=1 j=1 k=1 k=1 i=1
⎪
⎪
I
J
K
⎪
⎩ λiXjkY Z = λiXjkY Z = λiXjkY Z = 0.
i=1 j=1 k=1
The above model is referred to as the loglinear model and is used for the analysis
of association in three-way contingency tables. The model has two-dimensional
association terms λiXj Y , λYjkZ , and λkiZ X and three-dimensional association terms λiXjkY Z ,
so in this textbook, the above loglinear model is denoted as model [XYZ] for simplicity
of the notation, where the notation is correspondent with the highest dimensional
association terms [2].
Let n i jk be the numbers of observations for X = i, Y = j, and Z = k, i =
1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K . Since model (3.1) is saturated, the
ML estimators of the cell probabilities are given by
n i jk
logπ̂i jk = log ,
n +++
where

I
J
K
n +++ = n i jk .
i=1 j=1 k=1
From this, the estimates of the parameters are obtained by solving the following
equations:

n i jk
λ + λ̂iX + λ̂Yj + λ̂kZ + λ̂iXj Y + λ̂YjkZ + λ̂kiZ X + λ̂iXjkY Z = log ,
n +++
i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K .
More parsimonious models than [XYZ] are given as follows:

3.2 Loglinear Model 61
(i) [X Y, Y Z , Z X ]
If there are no three-dimensional association terms in loglinear model (3.1), we have
logπi jk = λ + λiX + λYj + λkZ + λiXj Y + λYjkZ + λkiZ X . (3.3)
and then, we denote the model as [X Y, Y Z , Z X ] corresponding to the highest dimen-

sional association terms with respect to the variables, i.e., λiXj Y , λYjkZ , and λkiZ X . These
association terms indicate the conditional associations related to the variables, i.e.,
the conditional log odds ratios with respect to X and Y given Z are
πi jk πi j k
log = λiXj Y + λiX Yj − λiX Yj − λiXj Y .
πi jk πi j k
Similarly, the conditional odds ratios with respect to Y and Z given X and those
with respect to X and Z given Y can be discussed.
(ii) [X Y, Y Z ]
If λkiZ X = 0 in (3.3), it follows that
logπi jk = λ + λiX + λYj + λkZ + λiXj Y + λYjkZ . (3.4)
The highest dimensional association terms with respect to the variables are λiXj Y
and λYjkZ , so we express the model by [X Y, Y Z ]. Similarly, models [Y Z , Z X ] and
[Z X, X Y ] can also be defined. In model (3.4), the log odds ratios with respect to X
and Z given Y are
πi jk πi jk
log = 0.
πi jk πi jk
From this, the above model implies that X and Z are conditionally independent,
given Y. Hence, we have
πi j+ π+ jk
πi jk = .
π+ j+
In this case, the marginal distribution of X and Y, πi j+ , is given as follows:

logπi j+ = λ + λiX + λYj + λiXj Y , (3.5)
where

K

λYj = λYj + log exp λkZ + λYjkZ .

k=1
From (3.5), parameters λiX and λiXj Y can be expressed by the marginal probabilities
πi j+ , and similarly, λkZ and λYjkZ are described with π+ jk .
(iii) [X Y, Z ]
If λYjkZ = 0 in (3.4), the model is
logπi jk = λ + λiX + λYj + λkZ + λiXj Y . (3.6)
The highest dimensional association terms with respect to X and Y are λiXj Y , and
for Z term λkZ is the highest dimensional association terms. In this model, (X, Y ) and
Z are statistically independent, and it follows that
πi jk = πi j+ π++k
The marginal distribution of (X, Y ) has the same parameters λiX , λYj , and λiXj Y as
in (3.6), i.e.,
logπi j+ = λ + λiX + λYj + λiXj Y ,
where

K

λ = λ + log exp λkZ .

k=1
From this, parameters λiX , λYj , and λiXj Y can be explicitly expressed by probabilities
πi j+ , and similarly, λkZ by π++k . Parallelly, we can interpret models [Y Z , X ] and
[Z X, Y ].
(iv) [X, Y, Z ]
This model is the most parsimonious one and is given by
logπi jk = λ + λiX + λYj + λkZ .
The above model has no association terms among the variables, and it implies the
three variables are statistically independent. Hence, we have
πi jk = πi++ π+ j+ π++k
Remark 3.1 Let n i jk be the numbers (counts) of events for categories X = i, Y = j,

and Z = k, i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K , which are inde-
pendently distributed according to Poisson distributions with means μi jk . Then, the
loglinear models are given by replacing πi jk by μi jk in the models discussed in the
above, that is
3.2 Loglinear Model 63
logμi jk = λ + λiX + λYj + λkZ + λiXj Y + λYjkZ + λkiZ X + λiXjkY Z .
This sampling model is similar to that for the three-way layout experiment model,
and the terms λiXj Y , λYjkZ , and λkiZ X are referred to as two-factor interaction terms and
λiXjkY Z three-factor interaction terms.
Remark 3.2 Let us observe one of IJK events (X, Y, Z ) = (i, j, k) with probabili-
ties πi jk in one trial and repeat it n +++ times independently; let n i jk be the numbers
(counts) of events for categories X = i, Y = j, and Z = k, i = 1, 2, . . . , I ; j =
1, 2, . . . , J ; k = 1, 2, . . . , K . Then, the distribution of n i jk is the multinomial distri-
bution with total sampling size n +++ . In the independent Poisson sampling mentioned
in Remark 3.1, the conditional distribution of observations n i jk given n +++ is the
multinomial distribution with cell probabilities
μi jk
πi jk = ,
μ+++
where

I
J
K
μ+++ = μi jk .
i=1 j=1 k=1
As discussed above, measuring the associations in loglinear models is carried

out with odds ratios. For binary variables, the ordinary approach may be sufficient;
however, for polytomous variables, further studies for designing summary measures
of the association will be needed.
3.3 Maximum Likelihood Estimation of Loglinear Models
We discuss the maximum likelihood estimation of loglinear models in the previous

section. Let n i jk be the numbers of observations for (X, Y, Z ) = (i, j, k). For model
[X Y, Y Z , Z X ] (3.3), we set

λiXj Y = λ + λiX + λYj + λiXj Y .
Then, considering the constraints in (3.2), the number of independent parameters

λiXj Y is IJ. Since the loglikelihood function is

I
J
K

l= n i jk λiXj Y + λkZ + λYjkZ + λkiZ X
i=1 j=1 k=1

I
J

K
J
K
= n i j+ λiXj Y + n ++k λkZ + n + jk λYjkZ
i=1 j=1 k=1 j=1 k=1

I
K
+ n i+k λkiZ X .
i=1 k=1
From

I
J
K
πi jk = 1,
i=1 j=1 k=1
for Lagrange multiplier ζ, we set

I
J
K
lLagrange = l − ζ πi jk
i=1 j=1 k=1
and differentiating the above function with respect to λiXj Y , we have
∂ K K
l
X Y Lagrange
= n i jk − ζ πi jk = 0.
∂λi j k=1 k=1
From this,
n i j+ − ζ πi j+ = 0 (3.7)
and we have
ζ = n +++
Hence, from (3.7), it follows that

n i j+
π̂i j+ = ,
n +++
Let λ̂iXj Y , λ̂kZ , λ̂YjkZ , and λ̂kiZ X be the ML estimators of λiXj Y , λkZ , λYjkZ , and λkiZ X ,
respectively. Then, we obtain

K
π̂i j+ = exp λ + λ̂iX + λ̂Yj + λ̂iXj Y exp λ̂kZ + λ̂YjkZ + λ̂kiZ X
k=1

K
= exp λ̂iXj Y exp λ̂kZ + λ̂YjkZ + λ̂kiZ X . (3.8)
k=1
3.3 Maximum Likelihood Estimation of Loglinear Models 65
Similarly, the ML estimators of πi+k and π+ jk that are made by the ML estimator
of the model parameters are obtained as
n i+k n + jk
π̂i+k = , π̂+ jk = ;
n +++ n +++
and as in (3.8), we have

J
π̂i+k = exp λ + λ̂iX + λ̂kZ + λ̂kiZ X exp λ̂Yj + λ̂iXj Y + λ̂YjkZ ,
j=1

I
π̂+ jk = exp λ + λ̂Yj + λ̂kZ + λ̂YjkZ exp λ̂iX + λ̂iXj Y + λ̂kiZ X
i=1
As shown above, marginal probabilities πi j+ , πi+k , and π+ jk can be estimated in

explicit forms of observations n i j+ , n i+k , and n + jk ; however, the model parameters
in model [X Y, Y Z , Z X ] cannot be obtained in explicit forms of the observations.
Hence, in this model, an iteration method is needed to get the ML estimators of the
model parameters.
In the ML estimation of model [X Y, Y Z ], the following estimators can be derived
as follows:
n i j+ n + jk
π̂i j+ = , π̂+ jk = .
n +++ n +++
Thus, by solving the following equations

n i j+
λ + λ̂iX + λ̂Yj + λ̂iXj Y = log ,
n +++

n + jk
λ + λ̂Yj + λ̂kZ + λ̂YjkZ = log
n +++

we can obtain the ML estimators λ, λ̂iX , λ̂Yj , λ̂kZ , λ̂iXj Y , and λ̂YjkZ in explicit forms.
Through a similar discussion, the ML estimators of parameters in models [X Y, Z ]
and [X, Y, Z ] can also be obtained in explicit forms as well.
3.4 Generalized Linear Models
Generalized linear models (GLMs) are designed by random, systematic, and link
components and make useful regression analyses of both continuous and categorical
response variables [13, 15]. There are many cases of non-normal response vari-
ables in various fields of studies, e.g., biomedical researches, behavioral sciences,
economics, etc., and GLMs play an important role in regression analyses for the
non-normal response variables. GLMs include various kinds of regression models,

e.g., ordinary linear regression model, loglinear models, Poisson regression mod-
els, and so on. In this section, variables are classified into response (dependent)
and explanatory
(independent)

variables. Let Y be the response variable, and let
X = X 1 , X 2 , . . . , X p be an explanatory variable vector. The GLMs are composed
of three components (i)–(iii) explained below.
(i) Random Component
Let f (y|x) be the conditional density or probability function. Then, the function is
assumed to be the following exponential family of distributions:

yθ − b(θ )
f (y|x) = exp + c(y, ϕ) , (3.9)
a(ϕ)
where θ and ϕ are the parameters. This assumption is referred to as the random
component. If Y is the Bernoulli trial, then the conditional probability function is

π
f (y|x) = π y (1 − π )1−y = exp ylog + log(1 − π ) .
1−π
Corresponding to (3.9), we have

π
θ = log , a(ϕ) = 1, b(θ ) = −log(1 − π ), c(y, ϕ) = 0. (3.10)
1−π
For normal variable Y with mean μ and variance σ 2 , the conditional density
function is

1 yμ − 21 μ2 y2
f (y|x) = √ exp + − 2 ,
2π σ 2 σ2 2σ
where
1 2 y2
θ = μ, a(ϕ) = σ 2 , b(θ ) = μ , c(y, ϕ) = − 2 . (3.11)
2 2σ
Remark 3.3 For distribution (3.9), the expectation and variance are calculated. For
simplicity, response variable is assumed to be continuous. Since

f (y|x)dy = 1,
we have

d 1
f (y|x)dy = y − b (θ ) f (y|x)dy = 0. (3.12)

dθ a(ϕ)
3.4 Generalized Linear Models 67
Hence,
E(Y ) = b (θ ). (3.13)
Moreover, differentiating (3.12) with respect to θ , we obtain
Var(Y ) = b (θ )a(ϕ).
In this sense, the dispersion parameter a(ϕ) relates to the variance of the response
variable. If response variable Y is discrete, the integral in (3.12) is replaced by an
appropriate summation with respect to Y.
(ii) Systematic Component

T
For regression coefficients β0 and β = β1 , β2 , . . . , β p , the linear predictor is
given by
η = β0 + β1 x1 + β2 x2 + . . . + β p x p = β0 + β T x, (3.14)
where

T
x = x1 , x2 , . . . , x p .
(iii) Link Function
For function h(u), mean (3.13) and predictor (3.14) are linked as follows:

b (θ ) = h −1 β0 + β T x .
Eventually, according to the link function, θ is regarded as a function of β T x, so

for simplicity of the discussion, we describe it as

θ = θ βT x .
By using the above three components, regression models can be constructed

flexibly.
Example 3.1 For Bernoulli random variable Y (3.10), the following link function is
assumed:
u
h(u) = log .
1−u
Then, from
π
h(π ) = log = β0 + β T x,
1−π
we have the conditional mean of Y given x is

exp β0 + β T x
f (1|x) =
.
1 + exp β0 + β T x
The above model is a logistic regression model. For a normal distribution with
(3.11), the identity function
h(u) = u
is used. Then, the GLM is an ordinary linear regression model, i.e.,
μ = β0 + β T x
In both cases, we have
θ = β0 + β T x
and the links are called canonical links.
3.5 Entropy Multiple Correlation Coefficient for GLMs
In GLMs composed of (i), (ii), and (iii) in the previous section, for baseline (X, Y ) =
(x 0 , y0 ), the log odds ratio is given by

(y − y0 ) θ β T x − θ β T x 0
log OR(x, x 0 ; y, y0 ) = , (3.15)
a(ϕ)
where
f (y|x) f (y0 |x 0 )
OR(x, x 0 ; y, y0 ) = . (3.16)
f (y0 |x) f (y|x 0 )
The above formulation of log odds ratio is similar to that of the association
model

(2.35). Log odds ratio (3.15) is viewed as an inner product of y and θ β T x with
respect to the dispersion parameter a(ϕ). From (3.16), since
log OR(x, x 0 ; y, y0 ) = {(−log f (y0 |x)) − (−log f (y|x))}

− {(−log f (y0 |x 0 )) − (−log f (y|x 0 ))},
3.5 Entropy Multiple Correlation Coefficient for GLMs 69
predictor β T x is related to the reduction in uncertainty of response Y through explana-

tory variable X. For simplicity of the discussion, variables X and Y are assumed to
be random. Then, the expectation of (3.15) with respect to the variables is calculated
as

Cov Y, θ β T X (E(Y ) − y0 ) E θ β T X − θ β T x 0
+ ,
a(ϕ) a(ϕ)
The first term can be viewed as the mean change of uncertainty of response
variable Y in explanatory variables X for baseline Y = E(Y ). Let g(x) and f (y)
be the marginal density functions of X and Y, respectively. Then, as in the previous
section, we have

¨
Cov Y, θ β T X f (y|x)
= f (y|x)g(x)log dxdy
a(ϕ) f (y)
¨
f (y)
+ f (y)g(x)log dxdy. (3.17)
f (y|x)
The above quantity is the sum of the two types of the KL information, so it is
denoted by KL(X, Y ). If X and/or Y are discrete, the integrals in (3.17) are replaced
by appropriate summations with respect to the variables. If X is a factor, i.e., not
random, taking levels X = x 1 , x 2 , . . . , x K , then, (3.17) is modified as

K
Cov Y, θ β T X 1 f (y|x k )
= f (y|x k ) log dy
a(ϕ) k=1
K f (y)
K
1 f (y)
+ f (y) log dy.
k=1
K f (y|x k)
From the Cauchy inequality, it follows that
√
Cov Y, θ β T X Var(Y ) Var θ β T X

KL(X, Y ) = ≤ . (3.18)
a(ϕ) a(ϕ)
Definition 3.1 The entropy (multiple) correlation coefficient (ECC) between X and
Y in GLM (3.9) with (i), (ii), and (iii) is defined as follows [7]:

Cov Y, θ β T X
ECorr(Y, X) = √
. (3.19)
Var(Y ) Var θ β T X
From (3.18) and (3.19),
0 ≤ ECorr(Y, X) ≤ 1.
Since inequality (3.18) indicates the upper bound of KL(X, Y ) is

√
√
Var(Y ) Var(θ (β T X ))
a(ϕ)
,
as in the previous section, ECC is interpreted as the propor-
tion of the explained entropy of response variable Y by explanatory variable vector
X, and it can be used for a predictive or explanatory power measure for GLMs. This
measure is an extension of the multiple correlation coefficient R in ordinary linear
regression analysis.
Remark 3.4 KL(X, Y ) (3.17) can also be expressed as

¨
Cov Y, θ β T X
= ( f (y|x) − f (y))g(x)log f (y|x)dxdy.
a(ϕ)

Theorem 3.1 In (3.19), ECC is decreasing as a(ϕ), given Var(Y ) and Var θ β T X .
Proof For simplicity of the discussion, let us set E(Y ) = 0, and the proof is given in
the case where explanatory variable X is continuous and random. Let f (y|x) be the
conditional density or probability function of Y, and let g(x) be the marginal density
or probability function of X. Then,
¨

Cov Y, θ β X T
= yθ β T x f (y|x)g(x)dxdy.
Differentiating the above covariance with respect to a(ϕ), we have

¨
d T
Cov Y, θ β X = yθ β T x f (y|x)g(x)dxdy
da(ϕ) da(ϕ)
¨

d
= yθ β T x f (y|x)g(x)dxdy
da(ϕ)
¨
1 T
2
=− 2
yθ β x f (y|x)g(x)dxdy ≤ 0.
a(ϕ)
For a GLM with the canonical link, ECC is the correlation coefficient between
response variable Y and linear predictor
θ = β0 + β T x, (3.20)
where

T
β = β1 , β2 , . . . , β p .
Then, we have
p
βi Cov(Y, X i )
ECorr(Y, X) = √ i=1

. (3.21)
Var(Y ) Var β T X
Especially, for simple regression,
ECorr(Y, X ) = |Corr(Y, X )|. (3.22)
The property (3.21) is preferable for a predictive power measure, because

the effects (contributions) of explanatory variables (factors) may be assessed by
components related to the explanatory variables βi Cov(Y, X i ).
Theorem 3.2 In GLMs with canonical links (3.20), if explanatory variables

X 1 , X 2 , . . . , X p are statistically independent,
βi Cov(Y, X i ) ≥ 0, i = 1, 2, . . . , p.
Proof For simplicity of the discussion, we make a proof of the theorem for continu-
ous variables. Since explanatory variables X 1 , X 2 , . . . , X p are independent, the joint
density function is expressed as follows:

p
g(x) = gi (xi ),
i=1
where gi (xi ) are the marginal density or probability functions. Without the loss of
generality, it is sufficient to show
β1 Cov(Y, X 1 ) ≥ 0

The joint density function of Y and X = X 1 , X 2 , . . . , X p , f (y, x), and the

conditional density function of Y and X 1 given X /1 = X 2 , . . . , X p , f 1 y|x /1 , are

expressed by

p
f (y, x) = f (y|x) gi (xi ),
i=1

/1

f (y, x)
f 1 y|x = p dx1 ,
i=2 gi (x i )
where f (y|x) is the conditional density or probability function of Y, given X = x.

p
Since f 1 y|x /1 is a GLM with canonical link θ = β0 + βi X i , we have
i=2
¨
p
f (y|x)
0≤ f (y|x) gi (xi )log
dxdy
i=1
f 1 y|x /1
¨
/1

p
f 1 y|x /1
+ f 1 y|x gi (xi )log dxdy
i=1
f (y|x)
¨

p

= f (y|x)g1 (x1 ) − f y|x /1 gi (xi )log f (y|x)dxdy
i=1
= β1 Cov(Y, X 1 ).
As a predictive power measure, the regression correlation coefficient (RCC) is

also used [17]. The regression function is the conditional expectation of Y given X,
i.e., E(Y |X) and RCC are defined by
Cov(Y, E(Y |X))

RCorr(Y, X) = √ √ . (3.23)
Var(Y ) Var(E(Y |X))
The above measure also satisfies
0 ≤ RCorr(Y, X) ≤ 1
and in linear regression analysis it is the multiple correlation coefficient R. Since

Cov(Y, E(Y |X)) = Var(E(Y |X)), RCC can also be expressed as follows:

Var(E(Y |X))
RCorr(Y, X) = .
Var(Y )
The measure does not have such a decomposition property as ECC (3.21). In this
respect, ECC is more preferable for GLMs than RCC. Comparing ECC and RCC,
ECC measures the predictive power of GLMs in entropy; whereas, RCC measures
that in the Euclidian space. As discussed above, regression coefficient β in GLMs is
related to the entropy of response variable Y. From this point of view, ECC is more
advantageous than RCC.
Theorem 3.3 In GLMs, the following inequality holds:
RCorr(Y, X) ≥ ECorr(Y, X).
Proof Since

Cov Y, θ β T X = Cov E(Y |X), θ β T X ,

Cov(Y, E(Y |X)) = Var(E(Y |X)),
standardizing them, we have

Cov Y, θ β T X Cov E(Y |X), θ β T X Cov(E(Y |X), E(Y |X))

√
=√
≤√ √
Var(Y ) Var θ β T X Var(Y ) Var θ β T X Var(Y ) Var(E(Y |X))
Var(E(Y |X))
=√ √ ,
Var(Y ) Var(E(Y |X))
Hence, the inequality of the theorem follows. The equality holds if and only if
there exist constants c and d such that

θ β T X = cE(Y |X) + d.
Example 3.2 For binary response variable Y and explanatory variable X, the
following logistic regression model is considered:
exp{y(α + βx)}
f (y|x) = , y = 0, 1.
1 + exp(α + βx)
Since the link is canonical and the regression is simple, ECC can be calculated
by (3.22). On the other hand, in this model, the regression function is
exp{(α + βx)}
E(Y |x) = .
1 + exp(α + βx)
From this, RCC is obtained as Corr(Y, E(Y |X )).
Example 3.3 Table 3.1 shows the data of killed beetles after exposure to gaseous
carbon disulfide [4]. Response variable Y is
Y = 0(killed), 1(alive).
For the data, a complementary log–log model was applied. In the model, for
function h(u), the link is given by
h −1 (E(Y |x)) = log(−log(1 − E(Y |x))) = α + βx.
Table 3.1 Beetle mortality data

Log dose 1.691 1.724 1.755 1.784 1.811 1.837 1.861 1.884
No. beetles 59 60 62 56 63 59 62 60
No. killed 6 13 18 25 52 53 61 60
Source Bliss [4], Agresti [2]
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.65 1.7 1.75 1.8 1.85 1.9
Fig. 3.1 Estimated conditional probability function in log dose X
Since the response variable is binary, E(Y |x) is equal to the conditional probability
function of Y, and it is given by
f (y|x) = {1 − exp(−exp(α + βx))} y exp(−(1 − y)exp(α + βx)).
Then, θ in (3.9) is
f (1|x) 1 − exp(−exp(α + βx))

θ = log = .
1 − f (1|x) exp(−exp(α + βx))
From the data, the estimates of the parameters are obtained as follows:
α̂ = −39.52, β̂ = 22.01.
The graph of the estimated f (y|x) in log dose X is given in Fig. 3.1. ECC and
RCC are calculated according to (3.19) and (3.23), respectively, as follows:
ECC = 0.681, RCC = 0.745.
According to ECC, 68.1% of entropy in Y is reduced by the explanatory variable.
Definition 3.2 Let subX Y be a subset of the sample space of explanatory variable
vector
X
and response variable Y in a GLM. Then, the correlation coefficient between
θ β T X and Y restricted in sub
X Y is referred to as the conditional
entropy

correlation
coefficient between X and Y, which is denoted by ECorr Y, X|sub XY .
T
variable vector X = (X 1 , X 2 ) , the correlation coefficient
T
For explanatory
between θ β X and Y given X 2 = x2 is the conditional correlation coefficient
between X and Y, i.e.,

ECorr(Y, X|X 2 = x2 ) = Corr Y, θ β T X |X 2 = x2
In an experimental case, let X be a factor with quantitative levels {c1 , c2 , . . . , c I }

and βx be the systematic component. In this case, the factor levels are randomly
assigned to an experimental unit (subject), so it implies P(X = ci ) = 1I , and

restricting the factor levels to ci , c j , i.e.,

sub
X Y = (x, y)|x = ci or c j ,

1
it follows that P X = ci |sub
X Y = 2 . From this, it follows that

θ (βci ) + θ βc j
E θ (β X )| X Y =
sub
,
2

E(Y |X = ci ) + E Y |X = c j
E Y | X Y =
sub
,
2
1
Cov(Y, θ (β X )) = E(Y |X = ci ) − E Y |X = c j θ (βci ) − θ βc j ,

4

2

θ (βci ) − θ βc j
Var θ (β X )| X Y =
sub
.
4
Then, we have

E(Y |X = ci ) − E Y |X = c j
ECorr Y, X |sub
XY =
, (3.24)
2 Var Y |subXY
The results are similar to those of pairwise comparisons of the effects of factor
levels. If response variable Y is binary, i.e.,Y ∈ {0, 1}, we have

ECorr Y, X |sub
XY

E(Y |X = ci ) − E Y |X = c j
=
.
E(Y |X = ci ) + E Y |X = c j 2 − E(Y |X = ci ) − E Y |X = c j
(3.25)
Example 3.4 We apply (3.24) to the data in Table 3.1. For X = 1.691, 1.724, the par-
tial data are illustrated in Table 3.2. In this case, we calculate the estimated conditional
probabilities (Table 3.3), and from (3.25), we have
Table 3.2 Partial table in

Log dose 1.691 1.724
Table 3.1
No. beetles 59 60
No. killed 6 13
Table 3.3 Estimated conditional probability distribution of beetle mortality data in Table 3.1
Log dose 1.691 1.724
Killed 0.095 0.187
Alive 0.905 0.813
Table 3.4 Conditional ECCs for base line log dose X = 1.691
Log dose 1724 1.755 1.784 1.811 1.837 1.861 1.884
Conditional ECC 0.132 0.293 0.477 0.667 0.822 0.893 0.908
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.7 1.72 1.74 1.76 1.78 1.8 1.82 1.84 1.86 1.88 1.9
Fig. 3.2 Estimated conditional ECC in l log dose X
ECorr(Y, X |X = {1.691, 1.724}) = 0.132.
It means the effect of level X = 1.724 for base line X = 1.691, i.e., 13.2% of
entropy of the response is reduced by the level X = 1.724. The conditional ECCs of
X = x for baseline X = 1.691 are calculated in Table 3.4. The conditional ECC is
increasing in log dose X. The estimated conditional ECC is illustrated in Fig. 3.2.
3.6 Multinomial Logit Models
Let X 1 and X 2 be categorical explanatory variables with categories {1, 2, . . . , I } and

{1, 2, . . . , J }, respectively, and let Y be a categorical response variable with levels
{1, 2, . . . , K }. Then, the conditional probability function of Y given explanatory
variables (X 1 , X 2 ) is assumed to be
3.6 Multinomial Logit Models 77

exp αk + β(1)ki + β(2)k j

f (Y = k|(X 1 , X 2 ) = (i, j)) = K
. (3.26)
k=1 exp αk + β(1)ki + β(2)k j
Let

1 (X a = i) 1 (Y = k)
X ai = , a = 1, 2, ; Yk = ,
0 (X a = i) 0 (Y = k)
Then, dummy variable vectors
X 1 = (X 11 , X 12 , . . . , X 1I )T , X 2 = (X 21 , X 22 , . . . , X 2J )T , and
Y = (Y1 , Y2 , . . . , Y K )T
are identified with categorical explanatory variables X 1 , X 2 , and Y, respectively,

where the upper suffix “T ” implies the transpose of the vector and matrix. From this,
the systematic component of the above model can be expressed as follows:
θ = α + B (1)
T
X 1 + B (2)
T
X 2,
where
θ = (θ1 , θ2 , . . . , θ K )T , α = (α1 , α2 , . . . , α K )T ,
⎛ ⎞ ⎛ ⎞
β(1)11 β(1)12 . . . β(1)1I β(2)11 β(2)12 . . . β(2)1J
⎜ β(1)21 β(1)22 . . . β(1)2I ⎟ ⎜ β(2)21 β(2)22 . . . β(2)2J ⎟
⎜ ⎟ ⎜ ⎟
B(1) = ⎜ . .. .. .. ⎟, B(2) = ⎜ .. .. .. .. ⎟.
⎝ . . . . . ⎠ ⎝ . . . . ⎠
β(1)K 1 β(1)K 2 . . . β(1)K I β(2)K 1 β(2)K 2 . . . β(2)K J
Then, the conditional probability function is described as follows:

exp y T α + y T B (1) x 1 + y T B (2) x 2

f ( y|x 1 , x 2 ) =
,
y exp y α + y B (1) x 1 + y B (2) x 2
T T T
where y implies the summation over all categories y. In this case, a(ϕ) = 1 and
the KL information (3.17) become as follows:
KL(X, Y ) = trCov(θ , Y ),

where Cov(θ , Y ) is the K × K matrix with (i. j) elements Cov θi , Y j . From this,
ECC can be extended as
trCov(θ , Y )
ECorr(Y , (X 1 , X 2 )) = √ √ ,
trCov(θ, θ ) trCov(Y , Y )

where trCov(θ , θ ) the K × K matrix with (i. j) elements

Cov θi , θ j ; and
trCov(Y , Y )the K × K matrix with (i. j) elements Cov Yi , Y j . Since

K
trCov(θ , Y ) = Cov(θk , Yk ) = trB(1) Cov(X 1 , Y ) + trB(2) Cov(X 2 , Y ),
k=1

K
K
trCov(θ , θ ) = Var(θk ), trCov(Y , Y ) = Var(Yk ),
k=1 k=1
we have
trB (1) Cov(X 1 , Y ) + trB (2) Cov(X 2 , Y )
ECorr(Y , (X 1 , X 2 )) = ,
K K
k=1 Var(θk ) k=1 Var(Yk )
From this, the contributions of categorical variables X 1 and X 2 may be assessed

by using the above decomposition.
Remark 3.5 In model (3.26), let
πiXjk1 X 2 Y ≡ P(X 1 = i, X 2 = j, Y = k)

exp αk + β(1)ki + β(2)k j

= K
P(X 1 = i, X 2 = j). (3.27)
k=1 exp αk + β(1)ki + β(2)k j
Then, the above model is equivalent to loglinear model [X 1 X 2 , X 1 Y, X 2 Y ].
Example 3.5 The data for an investigation of factors influencing the primary food
choice of alligators (Table 3.5) are analyzed ([2; pp. 268–271]). In this example,
explanatory variables are X 1 : lakes where alligators live, {1. Hancock, 2. Oklawaha,
Table 3.5 Alligator food choice data

Lake Size Primary food choice
Fish Invertebrate Reptile Bird Other
Hancock ≤2.3 m (S) 23 4 2 2 8
>2.3 m (L) 7 0 1 3 5
Oklawaha S 5 11 1 0 3
L 13 8 6 1 0
Trafford S 5 11 2 1 5
L 8 7 6 3 5
George S 16 19 1 2 3
L 17 1 0 1 3
Source Agresti ([2], p. 268)
3.6 Multinomial Logit Models 79
3. Trafford, 4. George}; and X 2 : sizes of alligators, {1. small, 2. large}; and the
response variable is Y: primary food choice of alligators, {1. fish, 2. invertebrate, 3.
reptile, 4. bird, 5. other}. Model (3.26) is used for the analysis, and the following
dummy variables are introduced:

1 (X 1 = i) 1 (X 2 = j)
X 1i = , i = 1, 2, 3, 4; X 2 j = , j = 1, 2, ;
0 (X 1 = i) 0 (X 2 = j)

1 (Y = k)
Yk = , k = 1, 2, 3, 4, 5.
0 (Y = k)
Then, the categorical variables X 1 , X 2 , and the response variable Y are identified
with the correspondent dummy random vectors:
X 1 = (X 11 , X 12 , X 13 , X 14 )T , X 2 = (X 21 , X 22 )T , and Y = (Y1 , Y2 , Y3 , Y4 , Y5 )T ,
respectively. Under the constraints,
β(1)ki = 0 for k = 5, i = 4; β(2)k j = 0 for k = 5, j = 2,
the following estimates of regression coefficients are obtained:

⎛ ⎞ ⎛ ⎞
−0.826 −0.006 −1.516 0 −0.332 0
⎜ −2.485 −0.394 0⎟ ⎜ 1.127 0 ⎟
⎜ 0.931 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
B̂ (1) = ⎜ 0.417 2.454 1.419 0 ⎟, B̂ (2) = ⎜ −0.683 0 ⎟.
⎜ ⎟ ⎜ ⎟
⎝ −0.131 −0.659 −0.429 0⎠ ⎝ −0.962 0 ⎠
0 0 0 0 0 0
From this, we have

trB̂(1) Cov(X 1 , Y ) = 0.258, trB̂(2) Cov(X 2 , Y ) = 0.107,

T
T
trB̂(1) B̂(1) Cov(X 1 , X 1 ) = 2.931, trB̂(2) B̂(2) Cov(X 2 , X 2 ) = 0.693,

trCov(Y , Y ) = 0.691.
From the above estimates, we calculate the ECC as

tr B̂ (1) Cov(X 1 , Y ) + tr B̂ (2) Cov(X 2 , Y ) 0.258 + 0.107

ECorr(Y , (X 1 , X 2 )) = =√ √
2.931 + 0.693 0.691

trCov(θ , θ ) trCov(Y , Y )
= 0.231.
Although the effects of factors are statistically significant, the predictive power
of the logit model may be small, i.e., only 23.1% of the uncertainty of the response
variable is reduced or explained by the explanatory variables. In this sense, the

explanatory power of the model is weak. From the above analysis, we assess the
effects of lake and size on food, respectively. Since

tr B̂ (1) Cov(X 1 , Y ) = 0.258 and tr B̂ (2) Cov(X 2 , Y ) = 0.107.
It may be thought that the effect of lake on food is about 2.4 times larger than that
of size.
3.7 Entropy Coefficient of Determination
Measurement of the predictive or explanatory power is important in GLMs, and it

leads to the determination of important factors in GLMs. In ordinary linear regression
models, the coefficient of determination is useful to measure the predictive power
of the models. As discussed above, since GLMs have an entropy-based property, it
is suitable to consider predictive power measures for GLMs from a viewpoint of
entropy.
Definition 3.3 A general coefficient of determination is defined by
D(Y ) − D(Y |X)

R 2D = , (3.28)
D(Y )
where D(Y ) and D(Y |X) are a variation function of Y and a conditional variation
function of Y given X, respectively [1, 5, 12]. D(Y |X) implies that

D(Y |X) = D(Y |X = x)g(x)dx,
where g(x) is the marginal density or probability function of X [1, 11].

For ordinary linear regression models,
D(Y ) = Var(Y ), D(Y |X) = Var(Y |X),
A predictive power measure based on the likelihood function (15, 16) is given as
follows:
n2
L(0)
R L2 = 1 − , (3.29)
L(β)
where L(β) is a likelihood function relating to the regression coefficient vector β,

and n is the sample size. For the ordinary linear regression model, the above measure
is the coefficient of determination R 2 ; however, it is difficult to interpret the measure
3.7 Entropy Coefficient of Determination 81
(3.29) in general cases. This is a drawback of this measure. For GLMs with categorical
response variables, the following entropy-based measure is proposed:
H (Y ) − H (Y |X)
R 2E = , (3.30)
H (Y )
where H (Y ) and H (Y |X) are the entropy of Y and the conditional entropy given
X [11]. As discussed in Sect. 3.4, GLMs have properties in entropy, e.g., log odds
ratios (3.15) and KL information (3.17) are related to linear predictors. We refer the
entropy-based measure (3.17), i.e., KL(X, Y ), to as a basic
predictive power measure

for GLMs. The measure is increasing in Cov Y, θ β T X and decreasing in a(ϕ).
Since

Var(Y |X = x) = a(ϕ)b θ β T x
and

Cov Y, θ β T X
KL(X, Y ) = ,
a(ϕ)
function
a(ϕ)
can be interpreted as error variation of Y in entropy, and

Cov Y, θ β T X is the explained variation of Y by X. From this, we define entropy
variation function of Y as follows:

D E (Y ) = Cov Y, θ β T X + a(ϕ), (3.31)
Since

Cov Y, θ β X |X =
T
(Y − E(Y |x))θ β T x g(x)dx = 0,
the conditional entropy variation function D E (Y |X)is calculated as

D E (Y |X) = Cov Y, θ β T X |X + a(ϕ) = a(ϕ). (3.32)
According to (3.28), we have the following definition:
Definition 3.4 The entropy coefficient of determination (ECD) [8] is defined by
D E (Y ) − D E (Y |X)
ECD(X, Y ) = . (3.33)
D E (Y )
ECD can be described by using KL(X, Y ), i.e.,

Cov Y, θ β T X KL(X, Y )
ECD(X, Y ) = T
= . (3.34)
Cov Y, θ β X + a(ϕ) KL(X, Y ) + 1
From the above formulation, ECD is interpreted as the ratio of the explained
variation of Y by X for the variation of Y in entropy. For ordinary linear regression
2
T
ECDT is the coefficient of determination R , and for canonical link, i.e.,
models,
θ β X = β X, ECD has the following decomposition property:
p
βi Cov(Y, X i )
ECD(X, Y ) = i=1 T
.
Cov Y, θ β X + a(ϕ)
Thus, ECC and ECD have the decomposition property with respect to explanatory
variables X i . This property is preferable, because the contributions of explanatory
variables X 1 can be assessed with components
βi Cov(Y, X i )

,
Cov Y, θ β T X + a(ϕ)
if the explanatory variables are statistically independent. A theoretical discussion

on assessing explanatory variable contribution is given in Chap. 7. We have the
following theorem:

T
Theorem 3.4 Let X = X 1 , X 2 , . . . , X p be an explanatory variable vector; let

T
X \i = X 1 , X 2 , . . . , X i−1 X i+1 , . . . , X p be a sub-vector of X; and let Y be a
response variable. Then,

KL(X, Y ) ≥ KL X \i , Y , i = 1, 2, . . . , p.
Proof Without loss of generality, the theorem is proven for k = 1, i.e., X \1 =

T
X 2 , . . . , X p . For simplicity of the discussion, the variables concerned are
assumed

to be continuous. Let f (x, y) be the joint density function of X and Y; f 1 x \1 , y be

the joint density function of X \1 and Y; f Y (y) be the marginal

density function of Y;
g(x) be the marginal density function of X; and let g1 x \1 be the marginal density
function of X \1 . Since

f (x, y) = f x1 , x \1 , y , g(x) = g x1 , x \1 ,
we have

˚ f x1 , x \1 , y
\1
KL(X, Y ) − KL X , Y = \1
f x1 , x , y log
dx1 dx \1 dy
f Y (y)g x1 , x \1

˚ f Y (y)g x1 , x \1
+ \1
f Y (y)g x1 , x log
dx1 dx \1 dy.
f x1 , x \1 , y

˚ f 1 x \1 , y
− f x1 , x \1 , y log
dx1 dx \1 dy
f Y (y)g1 x \1

˚ f Y (y)g1 x \1
− f Y (y)g x1 , x \1 log
dx1 dx \1 dy
f 1 x \1 , y

˚ f x1 , x \1 , y g1 x \1
= f x1 , x \1 , y log

dx1 dx \1 dy
f 1 x \1 , y g x1 , x \1

˚ f 1 x \1 , y g x1 , x \1
+ f Y (y)g x1 , x \1 log \1

dx1 dx \1 dy.
g1 x f x1 , x \1 , y

˚ f x1 , x \1 , y /g x1 , x \1
= \1
f x1 , x , y log

dx1 dx \1 dy
f 1 x \1 , y /g1 x \1

˚ g x1 , x \1 /g1 x \1
+ f Y (y)g x1 , x \1 log

dx1 dx \1 dy ≥ 0.
f x1 , x \1 , y / f 1 x \1 , y
The above theorem guarantees that ECD is increasing in complexity of GLMs,

where “complexity” implies the number of explanatory variables. This is an
advantage of ECD.
Example 3.6 Applying ECD to Example 3.5, from (3.34) we have

tr B̂ (1) Cov(X 1 , Y ) + tr B̂ (2) Cov(X 2 , Y ) 0.258 + 0.107

ECD(Y , (X 1 , X 2 )) = =
tr B̂ (1) Cov(X 1 , Y ) + tr B̂ (2) Cov(X 2 , Y ) + 1 0.258 + 0.107 + 1
= 0.267.
From this, 26.5% of the variation of response variable Y in entropy is explained

by explanatory variables X 1 and X 2 .
Desirable properties for predictive power measures for GLMs are given as follows
[9]:
(i) A predictive power measure can be interpreted, i.e., interpretability.
(ii) The measure is the multiple correlation coefficient or the coefficient of
determination in normal linear regression models.
(iii) The measure has an entropy-based property.
(iv) The measure is applicable to all GLMs (applicability to all GLMs).
(v) The measure is increasing in the complexity of the predictor (monotonicity in
the complexity of the predictor).
(vi) The measure is decomposed into components with respect to explanatory
variables (decomposability).
First, RCC (3.23) is checked for the above desirable properties. The measure
can be interpreted as the correlation coefficient (cosine) of response variable Y and
the regression function E(Y |X) and is the multiple correlation coefficient R in the
ordinary linear regression analysis; however, the measure has no property in entropy.
The measure is available only for single continuous or binary variable cases. For
polytomous response variable cases, the measure cannot be employed. It has not
been proven whether the measure satisfies property (v) or not, and it is trivial for
the measure not to have property (vi). Second, R L2 is considered. From (3.29), the
measure is expressed as
2 2
L(β) n − L(0) n
R L2 = 2 ;
L(β) n ,
2 2
however, L(β) n and L(0) n cannot be interpreted. Let β̂ be the ML estimates of β.
Then, the likelihood ratio statistic is
⎛ ⎞ n2
L β̂ 1
⎝ ⎠ = .
L(0) 1 − R2
From this, this measure satisfies property (ii). Since logL(β) is interpreted in a
viewpoint of entropy, so the measure has property (iii). This measure is applicable
to all GLMs and increasing in the number of explanatory variables, because it is
based on the likelihood function; however, the measure does not have property (vi).
Concerning R 2E , the measure was proposed for categorical response variables, so the
measure has properties except (ii) and (vi). For ECC, as explained in this chapter
and the previous one, this measure has (i), (ii), (iii), and (vi); and the measure is
the correlation coefficient of the response variable and the predictor of explanatory
variables, so this measure is not applicable to continuous response variable vectors
and polytomous response variables. Moreover, it has not proven whether the measure
has property (v) or not. As discussed in this chapter, ECD has all the properties (i)
to (vi). The above discussion is summarized in Table 3.6.
Mittlebock and Schemper [14] compared some summary measures of association
in logistic regression models. The desirable properties are (1) interpretability, (2)
consistency with basic characteristics of logistic regression, (3) the potential range
should be [0,1], and (4) R 2 . Property (2) corresponds to (iii), and property (3) is met
Table 3.6 Properties of predictive power measures for GLMs

RCC R 2L R 2E ECC ECD
(i) Interpretability ◯ × ◯ ◯ ◯
(ii) Ror R 2 ◯ ◯ × ◯ ◯
(iii) Entropy × ◯ ◯ ◯ ◯
(iv) All GLMs × ◯ ◯ × ◯
(v) Monotonicity ◯ ◯ ◯
(vi) Decomposition × × × ◯ ◯
◯: The measure has the property; : may have the property, but has not been proven; ×: does not
have the property
by ECD and ECC. In the above discussion, ECD is the most suitable measure for the
predictive power of GLMs. A similar discussion for binary response variable models
was given by Ash and Shwarts [3].
3.8 Asymptotic Distribution of the ML Estimator of ECD
The entropy coefficient of determination is a function of KL(X, Y ) as shown in

(3.34), so it is sufficient to derive the asymptotic property of the ML estimator of
KL(X, Y ). Let f (y|x) be the conditional density or probability function of Y given
X = x, and let f (y) be the marginal density or probability function of Y. First, we
consider the case where explanatory variables are not random. We have the following
theorem [8]:
Theorem 3.5 In GLMs with systematic component (3.14), let x k , k = 1, 2, . . . , K

be the levels of p dimensional explanatory variable vector X and let n k be the
numbers of observations at levels x k , k = 1, 2, . . . , K ; Then, for sufficiently large

n k , k = 1, 2, . . . , K , statistic N K L (X, Y ) is asymptotically chi-square distribution

with degrees of freedom p under the null hypothesis H0 : β = 0, where N = nk=1 n k .
Proof Let (x k , yki ), i = 1, 2, . . . , n k ; k = 1, 2, . . . , K be random samples. Let

fˆ(y|x), fˆ(y), and K L (X, Y ) be the ML estimators of f (y|x), f (y), and K L(X, Y ),
respectively. Then, the likelihood functions under H0 : β = 0 and H1 : β = 0 are,
respectively, given as follows:

K
nk
l0 = log fˆ(yki ), (3.35)
k=1 i=1
and

K
nk
l= log fˆ(yki |x k ). (3.36)
k=1 i=1
Then,
1 fˆ(yki |x k )
K nk
1
(l − l0 ) = log
N N k=1 i=1 fˆ(yki )
K
nk f (y|x k )
→ f (y|x k )log dy(in probability),
k=1
N f (y)
as n k → ∞, k = 1, 2, . . . , K . Under the null hypothesis, it follows that


fˆ(y|x k )
K
nk
fˆ(y|x k )log dy
k=1
N fˆ(y)
K
nk f (y|x k )
→ f (y|x k )log dy(in probability).
k=1
N f (y)
For sufficiently large n k , k = 1, 2, . . . , K , under the null hypothesis: H0 : β = 0,

we also have

fˆ(y|x k ) fˆ(y)
K K
nk nk 1
fˆ(y|x k )log dy = fˆ(y)log dy + o ,
k=1
N fˆ(y) k=1
N fˆ(y|x k ) N
where

1
N ·o → 0 (in probability).
N
From the above discussion, we have

!

K
nk fˆ(y|x k ) fˆ(y)
KL(X, Y ) = ˆ
f (y|x k )log dy + fˆ(y)log dy
k=1
N fˆ(y) fˆ(y|x k )

2 1
= (l − l0 ) + o .
N N
From this, under the null hypothesis: H0 : β = 0, it follows that

1
N KL(X, Y ) = 2(l − l0 ) + N · o .
N
As in the above theorem, for random explanatory variables X, the following

theorem holds:

Theorem 3.6 For random explanatory variables X, statistic N K L (X, Y ) is asymp-

totically chi-square distribution with degrees of freedom p under the null hypothesis
H0 : β = 0 for sufficiently large sample size N.
Proof Let (x i , yi ), i = 1, 2, . . . , N , be random samples, and let fˆ(y), ĝ(x), and

fˆ(y|x) be the ML estimators of functions f (y), g(x), and f (y|x), respectively. Let

N
l0 = log fˆ(yi )ĝ(x i )
i=1
3.8 Asymptotic Distribution of the ML Estimator of ECD 87
and

N
l= log fˆ(yi |x i )ĝ(x i ).
i=1
Then, by using a similar method in the previous theorem, as N → ∞, we have

¨
1 fˆ(yi |x i ) fˆ(y|x)
N
1 ˆ 1
(l − l0 ) = log = f (y|x)ĝ(x)log dxdy + o
N N i=1 ˆ
f (yi ) ˆ
f (y) N
¨
f (y|x)
→ f (y|x)g(x)log dxdy(in probability).
f (y)
Under the null hypothesis: H0 : β = 0, we also have

¨ ¨
fˆ(y|x) fˆ(y) 1
fˆ(y|x)ĝ(x)log dxdy = fˆ(y)ĝ(x)log dxdy + o .
fˆ(y) ˆ
f (y|x) N

¨
1 fˆ(y|x)
2(l − l0 ) + N · o =N fˆ(y|x)ĝ(x)log dxdy
N fˆ(y)
¨
fˆ(y) 1
+ fˆ(y)ĝ(x)log dxdy + o
ˆ
f (y|x) N

1
= N KL(X, Y ) + o .
N
This completes the theorem as follows.
By using the above theorems, we can test H0 : β = 0 versus H1 : β = 0

Example
3.7
In Example 3.5, the test for H0 : B (1) , B (2) = 0 versus
H1 : B (1) , B (2) = 0 is performed by using the above discussion. Since N = 219,we
have

219KL(Y , (X 1 , X 2 )) = 219 tr B̂ (1) Cov(X 1 , Y ) + tr B̂ (2) Cov(X 2 , Y )
= 219(0.258 + 0.107) = 79.935.(d f = 16, P = 0.000).

From this, regression coefficient matrix B (1) , B (2) is statistically significant.

3.9 Discussions
In this chapter, first, loglinear models have been discussed. In the models, the loga-
rithms of probabilities in contingency tables are modeled with association parame-
ters, and as pointed out in Sect. 3.2, the associations between the categorical variables
in the loglinear models are measured with odds ratios. Since the number of odds ratios
is increasing as those of categories of variables increasing, further studies are needed
to make summary measures of association for the loglinear models. There may be
a possibility to make entropy-based association measures that are extended versions
of ECC and ECD. Second, predictive or explanatory power measures for GLMs have
been discussed. Based on an entropy-based property of GLMs, ECC and ECD were
considered. According to six desirable properties for predictive power measures as
listed in Sect. 3.7, ECD is the best to measure the predictive power.
References
1. Agresti, A. (1986). Applying R2 -type measures to ordered categorical data. Technometrics, 28,
133–138.
2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: John Wiley & Sons Inc.
3. Ash, A., & Shwarts, M. (1999). R2 : A useful measure of model performance with predicting
a dichotomous outcome. Stat Med, 18, 375–384.
4. Bliss, C. I. (1935). The calculation of the doze-mortality curve. Ann Appl Biol, 22, 134–167.
5. Efron, B. (1978). Regression and ANOVA with zero-one data: measures of residual variation.
J Am Stat Assoc, 73, 113–121.
6. Eshima, N., & Tabata, M. (1997). The RC(M) association model and canonical correlation
analysis. J Jpn Stat Soc, 27, 109–120.
power of generalized linear models. Stat Probab Lett, 77, 588–593.
8. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear
models. Comput Stat Data Anal, 54, 1381–1389.
9. Eshima, N., & Tabata, M. (2011). Three predictive power measures for generalized linear
models: entropy coefficient of determination, entropy correlation coefficient and regression
correlation coefficient. Comput Stat Data Anal, 55, 3049–3058.
10. Goodman, L. A. (1981). Association models and canonical correlation in the analysis of cross-
classification having ordered categories. J Am Stat Assoc, 76, 320–334.
11. Haberman, S. J. (1982). Analysis of dispersion of multinomial responses. J Am Stat Assoc, 77,
568–580.
12. Korn, E. L., & Simon, R. (1991). Explained residual variation, explained risk and goodness of
fit. Am Stat, 45, 201–206.
13. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and
Hall: London.
14. Mittlebock, M., & Schemper, M. (1996). Explained variation for logistic regression. Stat Med,
15, 1987–1997.
15. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. J Roy Stat Soc A,
135, 370–384.
References 89
16. Theil, H. (1970). On the estimation of relationships involving qualitative variables. Am J Sociol,
76, 103–154.
17. Zheng, B., & Agresti, A. (2000). Summarizing the predictive power of a generalized linear
model. Stat Med, 19, 1771–1781.
Chapter 4
Analysis of Continuous Variables
4.1 Introduction
Statistical methodologies for continuous data analysis have been well developed for
over a hundred years, and many research fields apply them for data analysis. For
correlation analysis, simple correlation coefficient, multiple correlation coefficient,
partial correlation coefficient, and canonical correlation coefficients [12] were pro-
posed and the distributional properties of the estimators were studied. The methods
are used for basic statistical analysis of research data. For making confidence regions
of mean vectors and testing those in multivariate normal distributions, the Hotelling’s
T 2 statistic was proposed as a multivariate extension of the t statistic [11], and it was
proven its distribution is related to the F distribution. In discriminant analysis of
several populations [7, 8], discriminant planes are made according to the optimal
classification method of samples from the populations, which minimize the misclas-
sification probability [10]. In methods of experimental designs [6], the analysis of
variance has been developed, and the methods are used for studies of experiments,
e.g., clinical trials and experiments with animals. In real data analyses, we often face
with missing data and there are cases that have to be analyzed by assuming latent
variables, as well. In such cases, the Expectation and Maximization (EM) algorithm
[2] is employed for the ML estimation of the parameters concerned. The method is
very useful to make the parameter estimation from missing data.
The aim of this chapter is to discuss the above methodologies in view of entropy. In
Sect. 4.2, the correlation coefficient in the bivariate normal distribution is discussed
through an association model approach in Sect. 2.6, and with a discussion similar
to the RC(1) association model [9], the entropy correlation coefficient (ECC) [3] is
derived as the absolute value of the usual correlation coefficient. The distributional
properties are also considered with entropy. Section 4.3 treats regression analysis
in the multivariate normal distribution. From the association model and the GLM
frameworks, it is shown that ECC and the multiple correlation coefficient are equal
and that the entropy coefficient of determination (ECD) [4] is equal to the usual

https://doi.org/10.1007/978-981-15-2552-0_4
92 4 Analysis of Continuous Variables
coefficient of determination. In Sect. 4.4, the discussion in Sect. 4.3 is extended to that
of the partial ECC and ECD. Section 4.5 treats canonical correlation analysis. First,
canonical correlation coefficients between two random vectors are derived with an
ordinary method. Second, it is proven that the entropy with respect to the association
between the random vectors is decomposed into components related to the canonical
correlation coefficients, i.e., pairs of canonical variables. Third, ECC and ECD are
considered to measure the association between the random vectors and to assess
the contributions of pairs of canonical variables in the association. In Sect. 4.6, the
Hotelling’s T 2 statistic from the multivariate samples from a population is discussed.
It is shown that the statistic is the estimator of the KL information between two
multivariate normal distributions, multiplied by the sample size. Section 4.7 extends
the discussion in Sect. 4.6 to that for comparison between two multivariate normal
populations. In Sect. 4.8, the one-way layout experimental design model is treated in
a framework of GLMs, and ECD is used for assessing the factor effect. Section 4.9,
first, makes a discriminant plane between two multivariate normal populations based
on an optimal classification method of samples from the populations, and second,
an entropy-based approach for discriminating the two populations is given. It is also
shown that the squared Mahalanobis’ distance between the two populations [13]
is equal to the KL information between the two multivariate normal distributions.
Finally, in Sect. 4.10, the EM algorithm for analyzing incomplete data is explained
in view of entropy.
4.2 Correlation Coefficient and Entropy
Let (X , Y ) be a random vector of which the joint distribution is a bivariate normal

distribution with mean vector μ = (μX , μY )T and variance–covariance matrix

σ11 σ12
= . (4.1)
σ21 σ22
Then, the density function f (x1 , x2 ) is given by

1 σ12 (x − μX )(y − μY ) σ22 (x − μX )2 + σ11 (y − μY )2
f (x, y) = 1 exp −
2π || 2 || 2||
(4.2)
The logarithm of the above density function is
1 σ22 (x − μX )2 σ11 (y − μY )2
log f (x, y) = log 1 − −
2π || 2 2|| 2||
σ12 (x − μX )(y − μY )
+ .
||
4.2 Correlation Coefficient and Entropy 93
Let us set
1 σ22 (x − μX )2 Y σ11 (y − μY )2 σ12

λ = log , λX (x) = − , λ (y) = − ,ϕ = .
2π ||
1
2 2|| 2|| ||
From the above formulation, we have
log f (x, y) = λ + λX (x) + λY (y) + ϕ(x − μX )(y − μY ). (4.3)
The model is similar to the RC(1) association

model
in Sect. 2.6. The log odds
ratio with respect to four points, (x, y), x , y , x, y , and x , y , is calculated as

f (x, y)f x , y
log
= ϕ x − x y − y .
f (x , y)f (x, y )
Replacing x and y for the means and taking the expectation of the above log odds
ratio with respect to X and Y we have

f (X , Y )f (μX , μY )
E log = ϕCov(X , Y ).
f (μX , Y )f (X , μY )
Definition 4.1 The entropy covariance between X and Y in (4.3) is defined by
ECov(X , X ) = ϕCov(X , Y ). (4.4)
Let fX (x) and fY (y) be the marginal density functions of X and Y, respectively.
Since
¨
f (X , Y )f (μX , μY ) f (x, y)
E log = f (x, y)log dxdy
f (μX , Y )f (X , μY ) fX (x)fY (y)
¨
fX (x)fY (y)
+ fX (x)fY (y)log dxdy ≥ 0, (4.5)
f (x, y)
it follows that
ϕCov(X , Y ) ≥ 0.
From the Cauchy inequality, we have

0 ≤ ϕCov(X , Y ) < |ϕ| Var(X ) Var(X ). (4.6)
Dividing the second term by the third one in (4.6), we have the following definition:
Definition 4.2 In model (4.3), the entropy correlation coefficient (ECC) is defined
by
|Cov(X , Y )|
ECorr(X , Y ) = √ √ . (4.7)
Var(X ) Var(X )
From (4.5)
√ √ and (4.6), the upper limit of the KL information (4.5) is
|ϕ| Var(X ) Var(X ). Hence, ECC (4.7) can be interpreted as the ratio of the
information induced by the correlation between X and Y in entropy. As shown in
(4.7),
ECorr(X , Y ) = |Corr(X , Y )|.
Remark 4.1 The above discussion has been made in a framework of the RC(1)
association model in Sect. 2.6. In the association model, ϕ is assumed to be positive
for scores assigned to categorical variables X and Y. On the other hand, in the normal
model (4.2), if Cov(X , Y ) < 0, then, ϕ < 0.
From the above remark, we make the following definition of the entropy variance:
Definition 4.3 The entropy variances of X and Y in (4.3) are defined, respectively,
by
EVar(X ) = |ϕ|Cov(X , X )(= |ϕ|Var(X )) and EVar(Y )

= |ϕ|Cov(Y , Y )(= |ϕ|Var(Y )). (4.8)
By using (4.4) and (4.8), ECC (4.7) can also be expressed as
ECov(X , X )
ECorr(X , Y ) = √ √ .
EVar(X ) EVar(X )
Since the entropy covariance (4.4) is the KL information between model f (x, y)
and independent model fX (x)fY (y), in which follows, for simplicity of discussion, let
us set
¨
f (x, y)
KL(X , Y ) = f (x, y)log dxdy
fX (x)fY (y)
¨
fX (x)fY (y)
+ fX (x)fY (y)log dxdy
f (x, y)
and let
ρ = Corr(X , Y ).
Calculating (4.5) in normal distribution model (4.2), since

σ
σ12 σ12 12
σ11 σ22 ρ 1
ϕ= = = = ×√ .
|| σ11 σ22 − σ12 σ12 σ 2
1 − σ1112σ22 1−ρ 2 σ11 σ22
we have
ρ σ12 ρ2
KL(X , Y ) = ϕσ12 = × √ = .
1 − ρ2 σ11 σ22 1 − ρ2
Next, the distribution of the estimator of KL(X , Y ) is considered. Let

(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be random samples from the bivariate normal dis-
tribution with mean vector μ = (μ1 , μ2 )T and variance–covariance matrix (4.1).
Then, the ML estimator of correlation coefficient ρ is given by
n
i=1 Xi − X Yi − Y
r= 2 n 2 .
n
i=1 X i − X i=1 Yi − Y
Then, the following statistic is the non-central F distribution with degrees 1 and
n − 2:

F = (n − 2)KL(X , Y ), (4.9)
where

r2
KL(X , Y ) = . (4.10)
1 − r2
Hence, F statistic is proportional to the estimator of the KL information. In order

to test the hypotheses H0 :ρ = 0 versus H1 :ρ = 0, we have the following t statistic
with degrees of freedom n − 2:
√ r
t= n − 2√ . (4.11)
1 − r2
Then, the squared t statistic is proportional to the KL information (4.10), i.e.,

t 2 = F = (n − 2)KL(X , Y ).
Remark 4.2 Let X be an explanatory variable and let Y be a response variable in

bivariate normal distribution model (4.2). Then, the conditional distribution of Y
given X can be expressed in a GLM formulation and it is given by

σ
1 − μX )(y − μY ) − 21 σσ2211 (x − μX )2
σ11 (x
12
fY (y|x) = exp
2π σ22 1 − ρ 2 σ22 1 − ρ 2

(y − μY )2
− .
2σ22 1 − ρ 2
In the GLM framework (3.9), it follows that

σ12 σ22
θ= (x − μX ), a(ϕ) = σ22 1 − ρ 2 , b(θ ) = (x − μX )2 ,
σ11 2σ11
(y − μY )2
c(y, ϕ) = − .
2σ22 1 − ρ 2
Then, we have ECC for the above GLM as

Cov Y , σσ1211 X σ12
2
Cov(Y , θ ) σ11
Corr(Y , θ ) = √ √
Var(Y ) Var(θ )
=
√
= √ σ2
Var(Y ) Var σ11 Xσ12 σ22 σ1211

σ12
2
=√ √ = |ρ|.
σ22 σ11
Hence, ECCs based on the RC(1) association model approach (4.3) and the above
GLM approach are the same.
In statistic (4.9), for large sample size n, (4.9) is asymptotically distributed
according to the non-central chi-square distribution with degrees of freedom 1 and
non-centrality
ρ2
λ = (n − 2) .
1 − ρ2
To give an asymptotic distribution of (4.10), the following theorem [14] is used.

Theorem 4.1 Let χ 2 be distributed according to a non-central chi-square dis-
tribution with non-central parameter λ and the degrees of freedom ν; and let us
set
λ λ2
c=ν+ and ν = ν + .
ν+λ ν + 2λ
Then, the following statistic
χ 2
χ2 = (4.12)
c
is asymptotically distributed according to the chi-square distribution with degrees

of freedom ν .
From the above theorem, statistic (4.9) is asymptotically distributed according to

the non-central chi-square distribution with degrees of freedom 1 and non-centrality
ρ2
λ = (n − 2) .
1 − ρ2
Let
λ λ2
ν = 1, c = 1 + and ν = 1 + . (4.13)
1+λ 1 + 2λ
Then, for large sample size n, statistic

F (n − 2)KL(X , Y )
= (4.14)
c c
is asymptotically distributed according to the chi-square distribution with degrees
of freedom ν . Hence, statistic (4.14) is asymptotically normal with mean ν and
variance 2ν . From (4.13), for large n we have
ρ 2
(n − 2) 1−ρ 2
c =1+ ρ 2 ≈ 2,
1 + (n − 2) 1−ρ 2
2 2
ρ
(n − 2)2 1−ρ 2 ρ2
ν = 1 + ρ2
≈ (n − 2) from r 2 ≈ ρ 2 .
1 + 2(n − 2) 1−ρ 2 2 1−ρ 2
Normalizing the statistic (4.12), we have

ρ 2
ρ2
− ν
F F − (n − 2) 1−ρ 2 √ KL(X , Y ) − 1−ρ 2
ZKL = √ c
= = n − 2 . (4.15)
2ν ρ2
2 (n − 2) 1−ρ 2 2 1−ρ ρ2
2
The above statistic is asymptotically distributed according to the standard normal

distribution. By using the following asymptotic property:

ρ2
√ KL(X , Y ) − 1−ρ 2 √ KL(X , Y ) − KL(X , Y )

ZKL = n−2 ≈ n−2 ,
ρ2 r2
2 1−ρ 2 2 1−r 2
we have the following asymptotic standard error (SE):


2 r2
SE = √ ,
n−2 1 − r2
ρ 2
From this, the asymptotic confidence intervals of KL(X , Y ) = 1−ρ 2 can be
constructed, for example, a 100(1 − α)% confidence interval of it is given by

α 2 r2
KL(X , Y ) − z 1 − ×√
2 n−2 1 − r2

α 2 r2
< KL(X , Y ) < KL(X , Y ) + z 1 − ×√ , (4.16)
2 n−2 1 − r2

where z 1 − α2 is the upper 100 1 − α2 % point of the standard normal distribu-
tion. Similarly, by using (4.16), a 100(1 − α)% confidence interval of ρ 2 can be
constructed. From the above result, the asymptotic confidence interval is given by

A1 A2
< ρ2 < , (4.17)
1 + A1 1 + A2
where

α 2 r2
A1 = KL(X , Y ) − z 1 − ×√ ,
2 n−2 1 − r2

α 2 r2
A2 = KL(X , Y ) + z 1 − ×√
2 n − 2 1 − r2
Concerning estimator r, the following result is obtained [1].
Theorem 4.2 In normal distribution (4.2) with variance–covariance matrix (4.1),

the asymptotic distribution of statistic
√ r−ρ
Z= n−1 (4.18)
1 − ρ2
is the standard normal distribution.
The statistic (4.18) is refined in view of the normality. The Fisher Z transformation
[5] is given by
1 1+r
ZFisher = log (4.19)
2 1−r
Theorem 4.3 The Fisher Z transformation (4.19) is asymptotically normally

distributed with mean 21 log 1+ρ
1−ρ
and variance n − 1.
Proof By the Taylor expansion of ZFisher at ρ, for large sample size n we have

1 1+ρ 1 1 1
ZFisher ≈ log + + (r − ρ)
2 1−ρ 2 1+ρ 1−ρ
1 1+ρ 1
= log + (r − ρ).
2 1−ρ 1 − ρ2
From Theorem 4.2, the above statistic is asymptotically distributed according to

the normal distribution with mean 21 log 1+ρ
1−ρ
and variance n − 1.
In Theorem 4.3, a better approximation of the variance of ZFisher is given by n − 3
[1]. By using the result, an asymptotic 100(1 − α)% confidence interval is computed
as
exp(B1 ) − 1 exp(B2 ) − 1
<ρ< , (4.20)
exp(B1 ) + 1 exp(B2 ) + 1
where
1+r 2 α
B1 = log −√ z ,
1−r n−3 2
1+r 2 α
B2 = log +√ z
1−r n−3 2
Example 4.1 Table 4.1 shows an artificial data from the bivariate normal distribution
with mean vector μ = (100, 50) and variance matrix

100 35
= .
35 25
From the data, we have

ρ = 0.726 (SE = 0.132),

KL(X , Y ) = 1.113 (SE = 0.399).
From (4.16), (4.17) and (4.20), it follows that
0.332 < KL(X , Y ) < 1.896,

0.249 < ρ 2 < 0.655,
0.454 < ρ < 0.847.
100
Table 4.1 Artificial two-dimensional data

Case 1 2 3 4 5 6 7 8 9 10
X 94.6 97.9 112.9 86.6 102.4 92.8 99.1 102.0 97.3 104.1
y 49.4 55.4 59.7 46.0 49.9 43.6 51.7 49.0 48.9 53.5
Case 11 12 13 14 15 16 17 18 19 20
X 109.5 104.2 87.7 106.2 101.2 115.1 108.2 89.5 114.2 112.7
Y 52.7 51.2 52.2 53.9 44.1 56.8 52.7 48.0 58.0 56.8
Case 21 22 23 24 25 26 27 28 29 30
X 91.5 104.2 95.6 80.8 98.1 105.7 88.4 112.6 98.5 97.1
Y 50.4 50.6 44.2 37.3 47.4 53.7 53.6 56.5 48.5 44.9
4 Analysis of Continuous Variables
4.3 The Multiple Correlation Coefficient 101
4.3 The Multiple Correlation Coefficient

T
Let X = X1 , X2 , . . . , Xp be a random vector and let Y be a random variable, and
T
the variance–covariance matrix of X T , Y is set as follows:

XX σ XY
= , (4.21)
σ Y X σYY
where

XX = Cov X, X T ,
σ XY = Cov(X, Y ) and σYY = Var(Y ).
Then, we have the following theorem.

T
Theorem 4.4 Let β = β1 , β2 , . . . , βp be a regression coefficient vector and let

p
V = βT X = βi Xi
i=1
Then, Corr(V, Y ) is maximized by β = C −1

XX σ XY , where C is a positive constant,
and the maximum is given by.

σ Y X −1
XX σ XY
Corr σ Y X −1
XX X , Y = (4.22)
σYY
Proof The proof is given under the following constraint:

Var β T X = β T XX β = 1. (4.23)
Since
β T σ XY
Corr(V, Y ) = √ ,
σYY
the following Lagrange function is used:
β T σ XY λ
L(β) = √ − β T XX β,
σYY 2
Differentiating the above function with respect to β, there exists constant λ such
that
∂ 1
L(β) = √ σ XY − λ XX β = 0.
∂β σYY
From this, we have
1
β= √ −1 σ XY .
λ σYY XX
σ Y X −1
XX σ XY
λ2 = .
σYY
√
Hence, for λ = λ2 , Corr(V, Y ) is maximized by (4.22). The theorem follows.
Definition 4.2 The correlation coefficient (4.22) is defined as the multiple correlation
T
coefficient between random vector X = X1 , X2 , . . . , Xp and random variable Y.
T
Second, let us assume the joint distribution of X = X1 , X2 , . . . , Xp and Y is
logf (x, y) = λ + λX (x) + λY (y) + φμ(x)ν(y), (4.24)
where φ > 0. The above model is similar to the RC(1) association model. In the
model, ECC is calculated as follows:
Cov(μ(X), ν(Y ))
ECorr(X, Y ) = √ √ .
Var(μ(X)) Var(Y )
Remark 4.3 The ECC cannot be directly defined in the case where X and Y are ran-
T T
dom vectors, i.e., X = X1 , X2 , . . . , Xp (p ≥ 2) and Y = Y1 , Y2 , . . . , Yq (q ≥ 2).
The discussion is given in Sect. 4.5.
Like the RC(1) association model, the ECC is positive. In matrix (4.21), let
1
XX·Y = XX − σ XY σ Y X and σYY ·X = σYY − σ Y X −1
XX σ XY .
σYY
For a normal distribution with mean vector 0 and variance–covariance matrix

(4.21), since
1
f (x, y) = p+1 1
(2π ) 2 [] 2

−1
1 T XX σ XY x
× exp − x , y , (4.25)
2 σ Y X σYY y
in model (4.24) we set

4.3 The Multiple Correlation Coefficient 103
p+1 1

λ = −log (2π ) 2 [] 2 ,
1 1 −1 2
λX (x) = − xT −1
XX·Y x, λY (y) = − σYY ·X y ,
2 2
−1
μ(x) = σYY σ Y X −1
XX·Y x, ν(y) = y, φ = 1.
From the above formulation, we have (4.22). The ECC for the normal distribution
(4.25) is the multiple correlation coefficient.
Third, in the GLM framework (3.9), ECC in the normal distribution is discussed.
Let f (y|x) be the conditional density function of Y given X. Assuming the joint
distribution of X and Y is normal with mean vector 0 and variance–covariance
matrix (4.21). Then, f (y|x) is given as the following GLM expression:
1
f (y|x) =
σYY − σ Y X −1
p+1
(2π ) XX σ XY
2

yβx − 21 (βx)2 1 2
y
× exp − 2
,
XX σ XY σYY − σ Y X −1
XX σ XY
where
β = −1
XX σ XY
In (3.9), setting
1 2
θ = βx, a(ϕ) = σYY − σ Y X −1
XX σ XY , b(θ ) = θ , and
2
1 2
y
c(y, ϕ) = − 2
, (4.26)
XX σ XY
the ECC between X and Y is given by

Cov(βX, Y ) σ Y X −1
XX σ XY
ECorr(X, Y ) = Corr(βX, Y ) = √ √ =
Var(βX) Var(Y ) σYY
In this case, ECCs in the RC(1) association model and the GLM frameworks are
equal to the multiple correlation coefficient. In this sense, the ECC can be called the
entropy multiple correlation coefficient between X and Y. The entropy coefficient of
determination (ECD) is calculated. For (4.26), we have
Cov(βX, Y ) σ Y X −1 σ XY
ECD(X, Y ) = = XX
Cov(βX, Y ) + a(ϕ) σ Y X XX σ XY + σYY − σ Y X −1
−1
XX σ XY
σ Y X −1
XX σ XY
= = ECorr(X, Y )2 .
σYY
In the normal case, the above equation holds true. Below, a more general discussion
on ECC and ECD is provided.
4.4 Partial Correlation Coefficient
Let X, Y, and Z be random variables that have the joint normal distribution with the
following variance–covariance matrix:
⎛ ⎞
σX2 σXY σXZ
= ⎝ σYX σY2 σYZ ⎠,
σZX σZY σZ2
Then, the conditional distribution of X and Y, given Z, has the following variance–
covariance matrix:
2
σX σXY 1 σXZ
XY ·Z = − 2 σZX σZY
σYX σY 2
σZ σYZ

2 σ σ
σX − σ 2 σXY − σXZσσ2 ZY
XZ ZX
= Z Z .
σYX − σYZσσ2ZX σY2 − σYZσσ2ZY
Z Z
From the above result, the partial correlation coefficient ρXY ·Z between X and Y
given Z is
σXZ σZY
σXY − σZ2 ρXY − ρXZ ρYZ
ρXY ·Z = = .
σXZ σZX σYZ σZY
σX2 − σZ2
σY2 − σZ2 1 − ρXZ
2
1 − ρYZ
2
With respect to the above correlation coefficient, the present discussion can be
directly applied. Let fXY (x, y|z) be the conditional density function of X and Y given
Z = z; let fX (x|z) and fY (y|z) be the conditional density functions of X and Y given
Z = z, respectively, and let fZ (z) be the marginal density function of Z. Then, we
have
4.4 Partial Correlation Coefficient 105
˚
fXY (x, y|z)
KL(X , Y |Z) = fXY (x, y|z)fZ (z) log dxdydz
fX (x|z)fY (y|z)
¨
fX (x|z)fY (y|z) ρXY
2
·Z
+ fX (x|z)fY (y|z)fZ (z) log dxdy = .
fXY (x, y|z) 1 − ρXY
2
·Z
(4.27)
Thus, the partial (conditional) ECC and ECD of X and Y given Z are computed
as follows:
ECorr(X , Y |Z) = |ρXY ·Z |,
KL(X , Y |Z)
ECD(X , Y |Z) = = ρXY
2
·Z.
KL(X , Y |Z) + 1
T T
For normal random vectors X = X1 , X2 , . . . , Xp , Y = Y1 , Y2 , . . . , Yq , and
Z = (Z1 , Z2 , . . . , Zr )T , the joint distribution is assumed to be the multivariate normal
with the following variance–covariance matrix:
⎛ ⎞
XX XY XZ
= ⎝ YX YY YZ ⎠.
ZX ZY ZZ
From the above matrix, the partial variance–covariance matrix of X and Y given
Z is calculated as follows:

XX·Z XY·Z XX XY XZ
(X,Y)·Z ≡ = − Σ −1
ZZ ZX ZY .
YX·Z YY·Z YX YY YZ
Let the inverse of the above matrix be expressed as

XX·Z XY·Z
−1
(X,Y)·Z = .
YX·Z YY·Z
Then, applying (4.27) to this multivariate case, we have
KL(X, Y|Z) = −tr XY·Z YX·Z .
Hence, the partial ECD of X and Y given Z is calculated as follows:
−tr XY·Z YX·Z

ECD(X, Y|Z) = .
−tr XY·Z YX·Z + 1
Example 4.2 Let X = (X1 , X2 )T , Y = (Y1 , Y2 )T , and Z = (Z1 , Z2 )T be random

vectors that have the joint normal distribution with the variance–covariance matrix:
⎛ ⎞
1 0.8 0.6 0.5 0.6 0.7
⎜ 0.8 1 ⎟
⎜ 0.5 0.6 0.5 0.4 ⎟
⎜ ⎟
⎜ 0.6 0.5 1 0.5 0.7 0.6 ⎟
=⎜ ⎟.
⎜ 0.5 0.6 0.5 1 0.4 0.5 ⎟
⎜ ⎟
⎝ 0.6 0.5 0.7 0.4 1 0.8 ⎠
0.7 0.4 0.6 0.5 0.8 1
In this case, we have

⎛ ⎞
1 0.8 0.6 0.5
⎜ 0.8 1 0.5 0.6 ⎟
(X,Y)·Z =⎜
⎝ 0.6 0.5 1 0.5 ⎠
⎟
0.5 0.6 0.5 1

⎛ ⎞
0.6 0.7

⎜ 0.5 0.4 ⎟ 1 0.8 −1 0.6 0.5 0.7 0.4
⎜
−⎝ ⎟
0.7 0.6 ⎠ 0.8 1 0.7 0.4 0.6 0.5
0.4 0.5
⎛ ⎞
0.506 0.5 0.156 0.15
⎜ 0.5 0.75 0.15 0.4 ⎟
=⎜ ⎟
⎝ 0.156 0.15 0.506 0.2 ⎠,
0.15 0.4 0.2 0.75
⎛ ⎞
7.575 −5.817 −1.378 1.955
⎜ −5.817 6.345 0.879 −2.455 ⎟
−1 =⎜
⎝ −1.378 0.879
⎟.
(X,Y)·Z
2.479 −0.854 ⎠
1.955 −2.455 −0.854 2.479
From this, we compute

0.156 0.15 5.175 −4.825
tr XY·Z YX·Z
= tr = −0.771.
0.15 0.4 −5.478 4.522
Hence,
0.771
ECD(X, Y|Z) = = 0.435.
0.771 + 1
Remark 4.5 The partial variance–covariance matrix (X,Y)·Z can also be calculated
by using the inverse of . Let
⎛ ⎞
XX XY XZ
−1 = ⎝ YX YY YZ ⎠.
ZX ZY ZZ
4.4 Partial Correlation Coefficient 107
Then, we have
−1
XX XY
(X,Y)·Z = .
YX YY
4.5 Canonical Correlation Analysis

T T
Let X = X1 , X2 , . . . , Xp and Y = Y1 , Y2 , . . . , Yq be two random vectors for
p < q. Without loss of generality, we set E(X) = 0 and E(Y) = 0. Corresponding
to the random vectors, the variance–covariance matrix of X and Y is divided as

XX XY
= , (4.28)
YX YY
where

XX = E XX T , XY = E XY T , YX = E YX T , and YY = E YY T .
T T
For coefficient vectors a = a1 , a2 , . . . , ap and b = b1 , b2 , . . . , bp , we
determine V1 = aT X and W1 = bT Y that maximize their correlation coefficient
under the following constraints:
Var(V1 ) = aT XX a = 1, Var(W1 ) = bT YY b = 1. (4.29)
For Lagrange multiplier λ and μ, we set
λ T μ
h = aT XY b − a XX a − bT YY b.
2 2
Differentiating the above function with respect to a and b, we have
∂h ∂h
= XY b − λ XX a = 0, = YX a − μ YY b = 0.
∂a ∂b
Applying constraints (4.29) to the above equations, we have
aT XY b = λ = μ
Hence, we obtain

λ XX − XY a
= 0.
− YX μ YY b
Since
a = 0 and b = 0,
It follows that

λ XX − XY

− YX λ YY = 0.
Since

λ XX − XY λ XX 0
=
− YX λ YY 0 λ YY − 1 YX −1 XY
λ XX

1

= |λ XX |λ YY − YX XX XY
−1
λ
2

= λ | XX | λ YY − YX −1
XX XY
p−q
= 0,
we have
2
λ YY − YX −1 XY = 0. (4.30)
XX
From this, λ2 is the maximum eigenvalue of YX −1 −1

XX XY YY and the coefficient
vector b is the eigenvector corresponding to the eigenvalue. Equation (4.30) has p
non-negative eigenvalues such that
1 ≥ λ21 ≥ λ22 ≥ · · · ≥ λ2p ≥ 0. (4.31)
Since the positive square roots of these are given by
1 ≥ λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0,
the maximum correlation coefficient is λ1 . Similarly, we have

2
λ XX − XY −1 YX = 0. (4.32)
YY
From the above equation, we get the roots in (4.31) and q − p zero roots, and the
coefficient vector a is the eigenvector of XY −1 −1
YY YX XX , corresponding to λ1 . Let
2
us denote the obtained coefficient vectors a and b as a(1) and b(1) , respectively. Then,
the pair of variables V1 = aT1 X and W1 = bT1 Y is called the first pair of canonical
variables. Next, we determine V2 = aT(2) X and W2 = bT(2) Y that make the maximum
correlation under the following constraints:
4.5 Canonical Correlation Analysis 109
aT(2) XX a(1) = 0, aT(2) XY b(1) = 0, bT(2) YX a(1) = 0, bT(2) YY b(1) = 0,

aT(2) XX a(2) = 1, bT(2) YY b(2) = 1.
By using a similar discussion that derives the first pair of canonical variables,
we can derive Eqs. (4.30) and (4.32). Hence, the maximum correlation coefficient
satisfying the above constraints is given by λ2 , and the coefficient vectors a(2)
and b(2) are, respectively, obtained as the eigenvectors of XY −1 −1
YY YX XX and
−1 −1
YX XX XY YY , corresponding to λ2 . Similarly, let a(i) and b(i) be the eigenvec-
2
tors of XY −1 −1 −1 −1
YY YX XX and YX XX XY YY corresponding to the eigenvalues
λi , i ≥ 3. Then, the pair of variables Vi = a(i) X and Wi = bT(i) Y gives the maximum
2 T
correlation under the following constraints:
aT(i) XX a(j) = 0, aT(i) XY b(j) = 0, bT(i) YX a(j) = 0, bT(i) YY b(j) = 0, i > j;
aT(i) XX a(i) = 1, bT(i) YY b(i) = 1.
As a result, we can get
aT(i) XX a(j) = 0, aT(i) XY b(j) = 0, bT(i) YY b(j) = 0, i = j;
and

Corr(Vi , Wi ) = Corr aT(i) X, bT(i) Y = aT(i) XY b(i) = λi , i = 1, 2, . . . , p.
Inductively, we can decide bT(q+k) such that

bT(p+k) YY b(1) , b(2) , . . . , b(p) , . . . , b(p+k−1) = 0,
bT(p+k) YY b(p+k) = 1, k = 1, 2, . . . , q − p.
Let us set
T
V = aT(1) X, aT(2) X, . . . , aT(p) X ,
T T
W (1) = bT(1) Y, bT(2) Y, . . . , bT(p) Y , W (2) = bT(p+1) Y, bT(p+2) Y, . . . , bT(q) Y ,

A = a(1) , a(2) , . . . , a(p) , B(1) = b(1) , b(2) , . . . , b(p) ,

B(2) = b(p+1) , b(p+2) , . . . , b(q) . (4.33)
Then, the covariance matrix of V, W (1) and W (2) is given by

⎛ ⎞
Var V T Cov V T , W T(1) Cov V T , W T(1)
⎜ ⎟
Var V T , W T(1) , W T(2) = ⎝ Cov V T , W T(1) Var W T(1) Cov W T(1) , W T(1) ⎠
T
Cov V , W T(1) Cov W T(1) , W T(1) Var W T(1)
⎛ T ⎞
A XX A AT XY B(1) AT XY B(2)
= ⎝ BT(1) YX A BT(1) YY B(1) BT(1) XY B(2) ⎠
BT(2) YX A BT(2) YX B(1) BT(2) YY B(2)
⎛ ⎡ ⎤ ⎞
λ1
⎜ ⎢ 0 ⎥ ⎟
⎜ ⎢ λ2 ⎥ ⎟
⎜ Ip ⎢ .. ⎥ 0 ⎟
⎜ ⎣ . ⎦ ⎟
⎜ 0 ⎟
⎜ λp ⎟
⎜⎡ ⎤ ⎟
⎜ ⎟
= ⎜ λ1 ⎟ ≡ ∗,
⎜⎢ 0 ⎥ ⎟
⎜⎢ λ2 ⎥ ⎟
⎜⎢ ⎥ ⎟
⎜⎣ .. ⎦
I p 0 ⎟
⎜ . ⎟
⎜ 0 ⎟
⎝ λp ⎠
0 0 I q−p
(4.34)
where I p and I q−p are the p- and (q − p)-dimensional identity matrix, respectively.
Then, the inverse of the above variance–covariance matrix is computed by
⎛⎡ ⎤⎡ ⎤ ⎞
1 −λ1
1−λ21 1−λ21
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎢ 0 ⎥⎢ 0 ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎢ 1 ⎥⎢ −λ2 ⎥ ⎟
⎜⎢ 1−λ22 ⎥⎢ 1−λ22 ⎥ ⎟
⎜⎢ .. ⎥⎢ .. ⎥ 0 ⎟
⎜⎢ . ⎥⎢ . ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎣ 0 ⎦⎣ 0 ⎦ ⎟
⎜ −λp ⎟
⎜ 1 ⎟
⎜⎡ 1−λ2p
⎤⎡ 1−λ2p
⎤ ⎟
⎜ −λ1 ⎟
∗−1 ⎜ 1−λ 1
⎟ (4.35)
⎜⎢ 2
⎥⎢
1−λ21
⎥ ⎟
⎜⎢ ⎟
1
⎜⎢ 0 ⎥⎢ 0 ⎥ ⎟
⎜⎢ −λ2 ⎥⎢ ⎥ ⎟
⎜⎢ ⎥⎢ 1 ⎥ ⎟
⎜⎢ 1−λ22 ⎥⎢ ⎥
⎥ 0 ⎟
1−λ22
⎜⎢ .. ⎥⎢ .. ⎟
⎜⎢ . ⎥⎢ . ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎣ ⎥⎢ ⎥ ⎟
⎜ 0 ⎦⎣ 0 ⎦ ⎟
⎜ −λp 1 ⎟
⎝ 1−λ2p 1−λ22 ⎠
0 0 I q−p
From this, we have the following theorem:

T
Theorem 4.5 Let us assume the joint distribution of X = X1 , X2 , . . . , Xp and
T
Y = Y1 , Y2 , . . . , Yq is the multivariate normal distribution with variance–covari-
ance matrix (4.28) and let

(Vi , Wi ) = aT(i) X, bT(i) Y , i = 1, 2, . . . , p (4.36)
be the p pairs of canonical variates defined in (4.33). Then,

p
λ2i
KL(X, Y) = .
i=1
1 − λ2i
T
Proof The joint distribution of random vectors V and W T(1) , W T(2) is the multi-
variate normal distribution with variance–covariance matrix (4.34) and the inverse
of (4.34) is given by (4.35). From this, we have
⎛ ⎞
⎛ ⎞ λ1
− 1−λ
λ1 ⎜
2
1 0 ⎟
⎜ 0 ⎟⎜ λ2 ⎟
T ⎜ λ2 ⎟⎜
− 1−λ 2
⎟
KL V , W T(1) , W T(2) = −tr⎜ .. ⎟⎜
2
.. ⎟
⎝ . ⎠⎜ . ⎟
0 ⎝ 0 ⎠
λ
λp − 1−λp 2
p

p
λ2i
= .
i=1
1 − λ2i
Since
T
KL(X, Y) = KL V , W T(1) , W T(2) ,
the theorem follows.
Remark 4.6 Let the inverse of the variance–covariance matrix (4.28) be denoted as

XX XY
−1 = .
YX YY
Then, we have
KL(X, Y) = −tr XY YX . (4.37)
From Theorem 4.5, we obtain

p
λ2i
−tr XY YX
= . (4.38)
i=1
1 − λ2i
From the p pairs of canonical variates (4.36), it follows that
λ2i
KL(Vi , Wi ) = , i = 1, 2, . . . , p. (4.39)
1 − λ2i
From (4.37), (4.38), and (4.39), we have

p
KL(X, Y) = KL(Vi , Wi ).
i=1
Next, the entropy correlation coefficient introduced in the previous section

is extended for measuring the association between random vectors X =
T T
X1 , X2 , . . . , Xp and Y = Y1 , Y2 , . . . , Yq , where p ≤ q. Let f (x, y) be the joint
density function of X and Y and let g(v, w) be the joint density function of canonical
T
random vectors V and W = W T(1) , W T(2) in (4.33). Then, the joint density function
of X and Y is expressed as follows:
−1
f (x, y) = g(v, w)|A|−1 B(1) , B(2)
⎛ ⎞
p p q wj2
1 vi2 + wi2
= ⎝ vi φi wi − − ⎠|A|−1 B(1) , B(2) −1 ,
p+q 1 exp 2
(2π ) 2 2
2
∗ i=1 i=1 2 1 − λi j=p+1
where
λi
φi = , i = 1, 2, . . . , p.
1 − λ2i
The above formulation is similar to the RC(p) association model explained in

Chap. 2. Assuming the means of X and Y are zero vectors, we have the following
log odds ratio:
g(v, w)g(0, 0)
p
f (x, y)f (0, 0)
log = log = vi φi wi .
f (x, 0)f (0, y) g(v, 0)g(0, w) i=1
As discussed in Chap. 2, we have the entropy covariance between X and Y as

p

p

p
λ2i
ECov(X, Y) = φi Cov(vi , wi ) = φi λ i = .
i=1 i=1 i=1
1 − λ2i
and

p

p

p
λi
ECov(X, X) = φi Cov(vi , vi ) = φi = = ECov(Y, Y).
i=1 i=1 i=1
1 − λ2i
Then, we have

0 ≤ ECov(X, Y) ≤ ECov(X, X) ECov(Y, Y).
From the above results, the following definition can be made:
Definition 4.3 The entropy correlation coefficient between X and Y is defined by

p λ2i
ECov(X, Y) i=1 1−λ2i
ECorr(X, Y) = √ √ = p λi
. (4.40)
ECov(X, X) ECov(Y, Y) i=1 1−λ2i
For p = q = 1, (4.40) is the absolute value of the simple correlation coefficient,

and p = 1 (4.40) becomes the multiple correlation coefficient.
Finally, in a GLM framework, the entropy coefficient of determination is consid-

ered for the multivariate normal distribution. Let f (y|x) be the conditional density
function of response variable vector Y given explanatory variable vector X. Then,
we have

1 1 T −1
f (y|x) = q 1 exp − (y − μY (x)) YY·X (y − μY (x)) ,
(2π ) 2 | YY·X | 2 2
where
E(X) = μX , E(Y) = μY ,
μY (x) = μY + YX −1
XX (x − μX ),
YY·X = YY − YX −1
XX XY .
Let
⎧ −1
⎨ θ = YY·X μY (x),
⎪
a(ϕ) = 1, b(θ ) = 21 YY·X θ T θ, (4.41)
⎪
⎩ c(y, ϕ) = − 1 yT −1 y − log (2π ) 2q | YY·X | 21 .
2 YY·X
Then, we have

1 T −1 1 T −1 1 T −1
f (y|x) = exp y μ
YY·X Y (x) − μ (x) μ (x) − y y
2 Y YY·X Y YY·X
q 1
(2π ) | YY·X | 2
2 2
T
y θ − b(θ )
= exp + c(y, ϕ) . (4.42)
a(ϕ)
As shown above, the multivariate linear regression under the multivariate normal
distribution is expressed by a GLM with (4.41) and (4.42). Since
Cov(Y, θ )
KL(X, Y) = = Cov Y, −1 −1 −1
YY·X μY (X) = tr YY·X YX XX XY
a(ϕ)
−1
= tr YY − YX −1
XX XY YX −1
XX XY ,
it follows that
−1
KL(X, Y) tr YY − YX −1
XX XY YX −1
XX XY
ECD(X, Y) = = −1
.
KL(X, Y) + 1 −1 −1
tr YY − YX XX XY YX XX XY + 1
Since KL(X, Y) is invariant for any one-to-one transformations of X and Y, we

have

p
λ2i
KL(X, Y) = KL(V , W ) = ,
i=1
1 − λ2i
T
where V and W = W T(1) , W T(2) are given in (4.33). Hence, it follows that
p λ2i
KL(V , W ) i=1 1−λ2i
ECD(X, Y) = = . (4.43)
KL(V , W ) + 1 p λ2i
2 + 1 i=1 1−λi
The above formula implies that the effect of explanatory variable vector X on
λ2
response variable vector Y is decomposed into p canonical components 1−λi 2 , i =
i
1, 2, . . . , p. From the above results (4.40) and (4.43), the contribution ratios of canon-
ical variables (Vi , Wi ) to the association between random vectors X and Y can be
evaluated by
λ2i λ2i
KL(Vi , Wi ) 1−λ2i 1−λ2i
CR(Vi , Wi ) = = = . (4.44)
KL(X, Y) −tr XY YX p λ2j
2 j=1 1−λj
Example 4.3 Let X = (X1 , X2 )T and Y = (Y1 , Y2 , Y3 )T be random vectors that have
the joint normal distribution with the following correlation matrix:
⎛ ⎞
1 0.7 0.4 0.6 0.4
⎜ ⎟
⎜ 0.7 1 0.7 0.6 0.5 ⎟
XX XY ⎜ ⎟
=⎜ 0.4 0.7 1 0.8 0.7 ⎟.
YX YY ⎜ ⎟
⎝ 0.6 0.6 0.8 1 0.6 ⎠
0.4 0.5 0.7 0.6 1
The inverse of the above matrix is

⎛ ⎞
3.237 −2.484 2.569 −2.182 −0.542
−1 ⎜ −2.484 −3.167 1.457 ⎟
⎜ 3.885 0.394 ⎟
XX XY XX XY ⎜ ⎟
= =⎜ 2.568 −3.167 6.277 −3.688 −1.626 ⎟.
YX YY YX YY ⎜ ⎟
⎝ −2.182 1.467 −3.688 4.296 0.148 ⎠
−0.542 0.394 −1.626 0.148 2.069
From this, we obtain

⎛ ⎞

2.569 −3.167
0.4 0.6 0.4 ⎝ −0.498 −0.235
XY YX = −2.182 1.457 ⎠ = .
0.7 0.6 0.5 0.218 −1.146
−0.542 0.394
From the above result, it follows that
−tr XY YX 1.644
ECD(X.Y) = = = 0.622.
−tr XY + 1
YX 1.644 +1
By the singular value decomposition, we have

−0.498 −0.235 0.147 −0.989 3.381 0 0.121 −0.993
= .
0.218 −1.146 0.989 0.147 0 0.705 0.993 0.121
From this, the two sets of canonical variables are given as follows:
V1 = 0.147X1 + 0.989X2 , W1 = 0.121Y1 − 0.993Y2 ,

V2 = −0.989X1 + 0.147X2 , W2 = 0.993Y1 + 0.121Y2 .
From (4.44), the contribution ratios of the above two sets of canonical variables
are computed as follows:
3.381
CR(V1 , W1 ) = = 0.716,
3.381 + 0.705
0.705
CR(V2 , W2 ) = = 0.284.
3.381 + 0.705
4.6 Test of the Mean Vector and Variance–Covariance

Matrix in the Multivariate Normal Distribution
Let X 1 , X 2 , . . . , X n be random samples from the p-variate normal distribution with

mean vector μ and variance–covariance matrix Σ. First, we discuss the following
statistic:
2 T
T∗ = nX S−1 X, (4.45)
where
1 1 T
n n
X= Xi, S = Xi − X Xi − X .
n i=1 n − 1 i=1
Theorem 4.6 Let f (x|μ, Σ) be the multivariate normal density function with mean
vector μ and variance–covariance matrix Σ. Then, it follows that
" "
f (x|μ, Σ) f (x|0, Σ)
f (x|μ, Σ)log dx + f (x|0, Σ)log dx = μT −1 μ.
f (x|0, Σ) f (x|μ, Σ)
(4.46)
Proof Since
f (x|μ, Σ) 1 1 1
log = − (x − μ)T −1 (x − μ) + xT −1 x = xT −1 μ − μT −1 μ,
f (x|0, Σ) 2 2 2
we have
"
f (x|μ, Σ) 1
f (x|μ, Σ)log dx = μT −1 μ,
f (x|0, Σ) 2
"
f (x|0, Σ) 1 T −1
f (x|0, Σ)log dx = μ μ.
f (x|μ, Σ) 2
From the above theorem, statistics (4.45) is the ML estimator of the KL informa-
tion (4.46) multiplied by sample size n − 1. With respect to the statistic (4.45), we
have the following theorem [1].
n−p
Theorem 4.7 For the statistic (4.45), p(n−1) (T ∗ )2 is distributed according to the
non-central F distribution with degrees of freedom p and n − p. The non-centrality
parameter is μT Σ −1 μ.
In the above theorem, the non-centrality parameter is the KL information (4.46).

The following statistic is called the Hotelling’s T 2 statistic [11]:
T
T 2 = n X − μ0 S−1 X − μ0 . (4.47)
Testing the mean vector of the multivariate normal distribution, i.e.,

4.6 Test of the Mean Vector and Variance-Covariance Matrix … 117
H0 :μ = μ0 versus H1 :μ = μ1 , from Theorem 4.6, we have
D(f (x|μ1 , Σ)||f (x|μ0 , Σ)) + D(f (x|μ0 , Σ)||f (x|μ1 , Σ))
= (μ1 − μ0 )T −1 (μ1 − μ0 ).
By substituting μ1 and for X and S in the above formula, respectively, we have

T
T 2 = n X − μ0 S−1 X − μ0 . From Theorem 4.7, we have the following theorem:
n−p
Theorem 4.8 For the Hotelling’s T 2 statistic (4.47), p(n−1) T 2 is distributed accord-
ing to the F distribution with degrees of freedom p and n−p under the null hypothesis
H0 :μ = μ0 . Under the alternative hypothesis H1 :μ = μ1 , the statistic is distributed
according to the F distribution with degrees of freedom p and n−p. The non-centrality
is (μ1 − μ0 )T Σ −1 (μ1 − μ0 ).
Concerning the Hotelling’s T 2 statistic (4.47), the following theorem asymptoti-

cally folds true.
Theorem 4.9 For sufficiently large sample size n, T 2 is asymptotically distributed

according to non-central chi-square distribution with degrees of freedom p. The
non-centrality is n(μ1 − μ0 )T Σ −1 (μ1 − μ0 ).
Proof Let
n−p 2
F= T .
p(n − 1)
For large sample size n, we have
n−p 2 1
T ≈ T 2.
p(n − 1) p
From this,
T 2 ≈ pF,
and from Theorem 4.8, T 2 is asymptotically distributed according to the non-central

chi-square distribution with degrees of freedom p. This completes the theorem.
In comparison with the above discussion, let us consider the test for variance
matrices, H0 :Σ = Σ 0 , versus H1 :Σ = Σ 0 . Then, the following KL information
expresses the difference between normal distributions f (x|μ, Σ 0 ) and f (x|μ, Σ):
D(f (x|μ, Σ)||f (x|μ, Σ 0 )) + D(f (x|μ, Σ 0 )||f (x|μ, Σ))

"
f (x|μ, Σ)
= f (x|μ, Σ)log dx
f (x|μ, Σ 0 )
"
f (x|μ, Σ 0 )
+ f (x|μ, Σ 0 )log dx
f (x|μ, Σ)
1
= −p + tr Σ −1 0 + Σ 0Σ
−1
. (4.48)
2
For random samples X 1 , X 2 , . . . , X n , the ML estimator of μ and Σ is given by
1 1 T
n
n
X= Xi, Σ = Xi − X Xi − X .
n i=1 n i=1
Then, for given Σ 0 , the ML estimators of D(f (x|μ, Σ)||f (x|μ, Σ 0 )) and
D(f (x|μ, Σ 0 )||f (x|μ, Σ)) are calculated as

D f (x|μ, )||f (x|μ, 0 ) = D f x|X̄, ˆ ||f x|X̄, 0 ,

ˆ .
D f (x|μ, 0 )||f (x|μ, ) = D f x|X̄, 0 ||f x||X̄,
From this, the ML estimator of the KL information (4.48) is calculated as

D f (x|μ, )||f (x|μ, 0 ) + D f (x|μ, 0 ||f )(x|μ, )
1 ˆ −1 ˆ
−1

= −p + tr 0 + 0
2
The above statistic is an entropy-based test statistic for H0 :Σ = Σ 0 , versus
H1 :Σ = Σ 0 . With respect to the above statistics, we have the following theorem:
Theorem 4.10 Let f (x|μ, Σ) be the multivariate normal density function with mean
vector μ and variance–covariance matrix Σ. Then, for null hypothesis H0 :Σ =
Σ 0 , the following statistic is asymptotically distributed according to the chi-square
distribution with degrees of freedom 21 p(p − 1):

χ 2 = n D f (x|μ, ||f (x|μ, 0 ) + D f (x|μ, 0 ||f (x|μ, )
Proof Let X i , i = 1, 2, . . . , n be random samples from f (x|μ, Σ); let l(μ, Σ) be

the following log likelihood function

n
l(μ, Σ) = logf (X i |μ, Σ),
i=1
and let
4.6 Test of the Mean Vector and Variance-Covariance Matrix … 119

l μ, Σ = max l(μ, Σ), l μ, Σ 0 = max l(μ, Σ 0 ),

μ,Σ μ,
where μ and Σ are the maximum likelihood estimators of μ and Σ, respectively,

i.e.,
1 1 2
n
n

μ=X= Xi, Σ = Xi − X .
n i=1 n i=1
Then, it follows that

1

1 n f x i |μ,
l μ, − l μ, 0 = log
n n i=1 f x|μ, 0
"
f (x|μ, )
→ f (x|μ, )log dx,
f (x|μ, 0 )

"

f x|μ,

D f (x|μ, )||f (x|μ, 0 ) = f x|μ, log dx
f x|μ, 0
"
f (x|μ, )
→ f (x|μ, )log dx,
f (x|μ, 0 )
"

f x|μ, 0

D f (x|μ, 0 )||f (x|μ, ) = f x|μ, 0 log dx

f x|μ,
"
f (x|μ, 0 )
→ f (x|μ, 0 )log dx
f (x|μ, )
in probability, as n → ∞. Under the null hypothesis, we have

" f x|μ, Σ

"

f x|μ, Σ 0
n f x|μ, Σ log dx = n f x|μ, Σ 0 log

dx + o(n),
f x|μ, Σ 0

f x|μ, Σ
where
o(n)
→ 0(n → ∞).
n
Hence, under the null hypothesis, we obtain

"

f x|μ,

2 l μ, − l μ, 0 = 2n f x|μ, log dx + o(n)

f x|μ, 0
⎛
"

f x|μ,
= n⎝ f x|μ, log

dx
f x|μ, 0
" ⎞
f x|μ, 0
dx⎠ + o(n)

+ f x|μ, 0 log

f x|μ,

= n D f (x|μ, ||f )(x|μ, 0 ) + D f (x|μ, 0 ||f )(x|μ, )
+ o(n).
By using the above result, we can test H0 :Σ = Σ 0 , versus H1 :Σ = Σ 0 .
4.7 Comparison of Mean Vectors of Two Multivariate

Normal Populations
Let p-variate random vectors X k be independently distributed according to normal

distribution with mean vectors and variance–covariance matrices μk and Σ k , and
let f (xk |μk , Σ k ) be the density functions, k = 1, 2. Then, X 1 − X 2 is distributed
according to f (x|μ1 − μ2 , Σ 1 + Σ 2 ) and we have
"
f (x|μ1 − μ2 , Σ 1 + Σ 2 )
f (x|μ1 − μ2 , Σ 1 + Σ 2 )log dx
f (x|0, Σ 1 + Σ 2 )
"
f (x|0, Σ 1 + Σ 2 )
+ f (x|0, Σ 1 + Σ 2 )log dx
f (x|μ1 − μ2 , Σ 1 + Σ 2 )
= (μ1 − μ2 )T (Σ 1 + Σ 2 )−1 (μ1 − μ2 ).
First, we consider the case of Σ 1 = Σ 2 = Σ. Let {X 11 , X 12 , . . . , X 1m }

and {X 21 , X 22 , . . . , X 2n } be random samples from distributions f (x|μ1 , Σ) and
f (x|μ2 , Σ), respectively. Then, the sample mean vectors
1 1
m n
X1 = X 1i and X2 = X 2i , (4.49)
m i=1 n i=1

are distributed according to the normal distribution f x|μ1 , m1 Σ and f x|μ2 , 1n Σ ,
respectively. From the samples, the unbiased estimators of μ1 , μ2 and Σ are,
respectively, given by
(m − 1)S1 + (n − 1)S2
μ1 = X 1 , μ2 = X 2 , and S = ,
m+n−2
4.7 Comparison of Mean Vectors of Two Multivariate … 121
where
⎧ m
⎪ T
⎪
⎨ S1 =
1
X 1i − X 1 X 1i − X 1 ,
m−1
i=1
n
T
⎪
⎪
⎩ S2 = X 2i − X 2 X 2i − X 2 .
1
n−1
i=1
From this, the Hotelling’s T 2 statistic is given by
mn T −1

T2 = X1 − X2 S X1 − X2 . (4.50)
m+n
From Theorem 4.7, we have the following theorem:
Theorem 4.11 For the Hotelling’s T 2 (4.50), m+n−p−1

(m+n−2)p
T 2 is distributed according
to non-central F distribution with degrees of freedom p and n − p − 1, where the
non-centrality is (μ1 − μ2 )T Σ −1 (μ1 − μ2 ).
Second, the case of Σ 1 = Σ 2 is discussed. Since sample mean vec-

X 1 and X 2 (4.49)
tors are distributed
according to normal distributions
f x1 |μ1 , m1 Σ 1 and f x2 |μ2 , 1n Σ 2 , respectively, X 1 − X 2 is distributed according

to f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 . From this, it follows that
"
1 1 f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2
f x|μ1 − μ2 , Σ 1 + Σ 2 log dx
m n f x|0, m1 Σ 1 + 1n Σ 2
"
1 1 f x|0, m1 Σ 1 + 1n Σ 2
+ f x|0, Σ 1 + Σ 2 log dx
m n f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2
−1
T 1 1
= (μ1 − μ2 ) Σ1 + Σ2 (μ1 − μ2 ).
m n
Then, the estimator of the above information is given as the following statistic:
−1
T 1 1
T 2 = X1 − X2 S1 + S2 X1 − X2 . (4.51)
m n
With respect to the above statistic, we have the following theorem:
Theorem 4.12 Let {X 11 , X 12 , . . . , X 1m } and {X 21 , X 22 , . . . , X 2n } be random sam-

ples from distributions f (x|μ1 , Σ 1 ) and f (x|μ2 , Σ 2 ), respectively. For sufficiently
large sample size m and n, the statistic T 2 (4.51) is asymptotically distributed
according to the non-central chi-square distribution with degrees of freedom p.
Proof For large sample size m and n, it follows that

−1
T 1 1
T 2 = X1 − X2 S1 + S2 X1 − X2
m n
−1
T 1 1
≈ X1 − X2 Σ1 + Σ2 X1 − X2 . (4.52)
m n

Since X 1 − X 2 is normally distributed according to f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 ,
statistic (4.52) is distributed according to the non-central chi-square distribution with
degrees of freedom p. Hence, the theorem follows.
Remark 4.7 In Theorem 4.12, the theorem holds true, regardless of the normality
because X 1 − X 2 is asymptotically
assumption of samples, distributed according to
normal distribution f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 .
4.8 One-Way Layout Experiment Model
Let Yij , j = 1, 2, . . . , n be random samples observed at level i = 1, 2, . . . , I . One-

way layout experiment model [7] is expressed by
Yij = μ + αi + eij , i = 1, 2, . . . , I ; j = 1, 2, . . . , n.
where

E Yij = μ + αi , i = 1, 2, . . . , I ; j = 1, 2, . . . , n,

E eij = 0, Var eij = σ 2 , Cov eij , ekl = 0, i = k, j = j.
For model identification, the following constraint is placed on the model.

I
αi = 0. (4.53)
i=1
In order to consider the above model in a GLM framework, the following dummy
variables are introduced:
#
1 (for level i)
Xi = , i = 1, 2, . . . , I . (4.54)
0 (for the other levels)
and we set the explanatory dummy variable vector as X = (X1 , X2 , . . . , XI )T . Assum-

ing the random component, i.e., the conditional distribution of Y given X = x, f (y|x),
to be the normal distribution with mean θ and variance σ 2 , we have
4.8 One-Way Layout Experiment Model 123

yθ − θ2
2
y2 √
f (y|x) = exp − − log 2π σ ,
2
σ2 2σ 2
Since the systematic component is given by a linear combination of the dummy

variables (4.54):

I
η= αi Xi ,
i=1
it follows that

I
θ =μ+ αi Xi .
i=1
The factor levels are randomly assigned to experimental units, e.g., subjects, with
probability 1I , so the marginal distribution of response Y is

θi2

1 √
I
yθi − y2
f (y) = exp 2
− − log 2π σ ,
2
I i=1 σ2 2σ 2
where
θi = μ + αi , i = 1, 2, . . . , I .
Then, we have
I " "
1 f (y|Xi = 1) f (y)
f (y|Xi = 1)log dy + f (y)log dy
I i=1 f (y) f (y|Xi = 1)
I I
i=1 Cov(Y , αi Xi ) i=1 αi
1 1
Cov(Y , θ ) 2
= = I
= I
. (4.55)
σ2 σ2 σ2
Hence, the entropy coefficient of determination is calculated as
I
i=1 αi
1 2
ECD(X, Y ) = 1 I
I
.
i=1 αi + σ
2 2
I
In the above one-way layout experiment, the variance of response variable Y is

computed as follows:
"
Var(Y ) = (y − μ)2 f (y)dy
"
θi2

I
yθi − y2 √
21
= (y − μ) exp 2
− − log 2π σ dy
2
I i=1 σ2 2σ 2
I "

θi2

1 yθi − y2 √
= (y − μ) exp
2 2
− − log 2π σ dy
2
I i=1 σ2 2σ 2
1 2 1 2
I I
= αi + σ 2 = α + σ 2. (4.56)
I i=1 I i=1 i
and the partial variance of Y given X = (X1 , X2 , . . . , XI )T is
1
I
Var(Y |X) = Var(Y |Xi = 1)
I i=1
"
θi2

1 √
I
yθi − y2
= (y − μ − αi )2 exp 2
− − log 2π σ 2 dy
I i=1 σ2 2σ 2
1 2
I
= σ = σ 2. (4.57)
I i=1
From (4.56) and (4.57), the explained variance of Y by X is given by
1 2
I
Var(Y ) − Var(Y |X) = α .
I i=1 i
Hence, we obtain
Cov(Y , θ ) = Var(Y ) − Var(Y |X),
and KL information (4.55) is also expressed by
I " "
1 f (y|Xi = 1) f (y)
f (y|Xi = 1)log dy + f (y)log dy
I i=1 f (y) f (y|Xi = 1)
Var(Y ) − Var(Y |X)
= .
Var(Y |X)
The above result shows the KL information (4.55) is interpreted as a signal-to-

noise ratio, where the signal is Cov(Y , θ ) = Var(Y ) − Var(Y |X) and the noise is
Var(Y |X) = σ 2 .
From the data Yij , i = 1, 2, . . . , I ; j = 1, 2, . . . , J , we usually have the following

decomposition:

I
n
2
I
2
I
n
2
SS T = Yij − Y ++ = n Y i+ − Y ++ + Yij − Y i+ ,
i=1 j=1 i=1 i=1 j=1
where
1 1
n I n
Y i+ = Yij , Y ++ = Yij .
n j=1 nI i=1 j=1
Let

I
2
I
n
2
SS A = n Y i+ − Y ++ , SS E = Yij − Y i+ , (4.58)
i=1 i=1 j=1
Then, the expectations of the above sums of variances are calculated as follows:

I $ 2 %
I
E(SS A ) = n E Y i+ − Y ++ = (I − 1)σ 2 + n αi2 ,
i=1 i=1
I n $ 2 % I n
n−1 2
E(SS E ) = E Yij − Y i+ = σ = I (n − 1)σ 2 .
i=1 j=1 i=1 j=1
n
In this model, the entropy correlation coefficient between factor X and response
Y is calculated as
&
' 1 I
Cov(Y , θ ) ' i=1 αi
2
ECorr(X , Y ) = √ √ =(1 I
. (4.59)
Var(Y ) Var(θ ) I
I
i=1 αi + σ
2 2
and ECC is the square root of ECD. Since the ML estimators of the effects αi2 and
error variance σ 2 are, respectively, given by
1 2
I n
α i = Y i+ − Y ++ and σ 2 =

Yij − Y i+ ,
nI i=1 j=1
the ML estimator of ECD is computed as

Table 4.2 Variance decomposition

Factor SS df SS/df Expectation
SSA
Factor A SSA I −1 I −1
I
σ2 + J
I −1 αi2
i=1
SSE
Error SSE I (n − 1) I (n−1) σ2
Total SSA + SSE nI − 1
I 2
i=1 Y i+ − Y ++

1
ECD(X, Y ) = 2
I
2
I
1
I i=1 Y i+ − Y ++ + nI1 Ii=1 nj=1 Yij − Y i+
SSA
= . (4.60)
SSA + SSE
Since the F statistic in Table 4.2 is

1
I −1
SSA SSA I −1
F= ⇔ = F. (4.61)
1
I (n−1)
SSE SSE I (n − 1)
ECD can be expressed by the above F statistic.

SSA

SSE (I − 1)F
ECD(X, Y ) = = . (4.62)
SSA
SSE
+1 (I − 1)F + I (n − 1)
If samples are Yij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , I , the constraint on the main

effects (4.53) is modified as

I
ni αi = 0, (4.63)
i=1
and the marginal density function as

θi2

I
ni yθi − y2 √
f (y) = exp 2
− − log 2π σ 2 , (4.64)
i=1
N σ2 2σ 2
where

I
N= ni .
i=1
Then, the formulae derived for samples Yij , j = 1, 2, . . . , n; i = 1, 2, . . . , I are

modified by using (4.63) and (4.64).
Multiway layout experiment models can also be discussed in entropy as explained
above. Two-way layout experiment models are considered in Chap. 7.
4.9 Classification and Discrimination

T
Let X = X1 , X2 , . . . , Xp be p-dimensional random vector, and let fi (x), i = 1, 2 be
density functions. In this section, we discuss the optimal classification of observation
X = x into one of the two populations with the density functions fi (x), i = 1, 2 [15].
Below, the populations and the densities are identified. In the classification, we have
two types of errors. One is the error that the observation X = x from population
f1 (x) is assigned to population f2 (x), and the other is the error that the observation
X = x from population f2 (x) is assigned to population f1 (x). To discuss the optimal
classification, the sample space of X, , is decomposed into two subspaces
= 1 + 2 . (4.65)
such that if x ∈ 1 , then the observation x is classified into population f1 (x) and if
x ∈ 2 , into population f2 (x). If there is no prior information on the two populations,
the first error probability is
"
P(2|1) = f1 (x)dx,
2
and the second one is

"
P(1|2) = f2 (x)dx.
1
In order to minimize the total error probability P(2|1) + P(1|2), the optimal
classification procedure is made by deciding the optimal decomposition of the sample
space (4.65).
Theorem 4.13 The optimal classification procedure that minimizes the error prob-
ability P(2|1) + P(1|2) is given by the following decomposition of sample space
(4.65):
# ) # )
f1 (x) f1 (x)
1 = x| > 1 , 2 = x| ≤1 . (4.66)
f2 (x) f2 (x)
Proof Let 1 and 2 be any decomposition of sample space . For the decompo-
sition, the error probabilities are denoted by P(2|1) and P(1|2) . Then,
" " "
P(2|1) = f1 (x)dx = f1 (x)dx + f1 (x)dx,
2 2 ∩1 2 ∩2
" " "
P(1|2) = f2 (x)dx = f2 (x)dx + f2 (x)dx.
1 1 ∩1 1 ∩2
From this, we have

" "
P(2|1) − P(2|1) = f1 (x)dx − f1 (x)dx
2 2
⎛ ⎞
" " "
⎜ ⎟
= f1 (x)dx − ⎝ f1 (x)dx + f1 (x)dx⎠
2 2 ∩1 2 ∩2
" " "
= f1 (x)dx − f1 (x)dx ≤ f2 (x)dx
1 ∩2 2 ∩1 1 ∩2
"
− f1 (x)dx,
2 ∩1
and
" " "

P(1|2) − P(1|2) = f2 (x)dx − f2 (x)dx ≤ f1 (x)dx
2 ∩1 1 ∩2 2 ∩1
"
− f2 (x)dx.
1 ∩2
Hence, it follows that

(P(2|1) + P(1|2)) − P(2|1) + P(1|2) ≤ 0
⇔ P(2|1) + P(1|2) ≤ P(2|1) + P(1|2) .

The above theorem is reconsidered from a viewpoint of entropy. Let δ(x) be
a classification (decision) function that takes population fi (x) with probability
πi (x), i = 1, 2, where
4.9 Classification and Discrimination 129
π1 (x) + π2 (x) = 1, x ∈ .
In this sense, the classification function δ(X) is random variable. If there is no

prior information on the two populations, then the probability that decision δ(x) is
correct is
f1 (x) f2 (x)
p(x) = π1 (x) + π2 (x) ,
f1 (x) + f2 (x) f1 (x) + f2 (x)
and the incorrect decision probability is 1 − p(x). Let

#
1 (if the decision δ(X)is correct)
Y (δ(X)) = .
0 (if the decision δ(X) is incorrect)
Then, the entropy of classification function δ(X) with respect to the correct and
incorrect classifications can be calculated by
"
H(Y (δ(X))) = (−p(x)logp(x) − (1 − p(x))log(1 − p(x)))f (x)dx,
where
f1 (x) + f2 (x)
f (x) = .
2
With respect to the above entropy, it follows that
" # )
f1 (x) f2 (x)
H (Y (δ(X))) ≥ min −log , −log f (x)dx
f1 (x) + f2 (x) f1 (x) + f2 (x)
"
f1 (x)
=− f (x)log dx
f1 (x) + f2 (x)
1
"
f2 (x)
− f (x)log dx, (4.67)
f1 (x) + f2 (x)
2
where 1 and 2 are given in (4.66). According to Theorem 4.13, for (4.66), the
optimal classification function δoptimal (X) can be set as
#
1 x ∈ 1
π1 (x) = , π2 (x) = 1 − π1 (x). (4.68)
0 x ∈ 2
From the above discussion, we have the following theorem:

Theorem 4.14 For any classification function δ(x) for populations fi (x), i = 1, 2,
the optimal classification function δoptimal (X) (4.68) satisfies the following inequality:

H(Y (δ(X))) ≥ H Y δoptimal (X)
Proof For the optimal classification function, we have

"
f1 (x)
H Y δoptimal (X) = − f (x)log dx
f1 (x) + f2 (x)
1
"
f2 (x)
− f (x)log dx.
f1 (x) + f2 (x)
2
From (4.67), the theorem follows.
For p-variate normal distributions N (μ1 , ) and N (μ2 , ), let f1 (x) and f2 (x) be
the density functions, respectively. Then, it follows that
f1 (x) 1
> 1 ⇔ logf1 (x) − logf2 (x) = − (x − μ1 ) −1 (x − μ1 )T
f2 (x) 2
1
+ (x − μ2 ) −1 (x − μ2 )T = (μ1 − μ2 ) −1 xT
2
1 1
− μ1 −1 μT1 + μ2 −1 μT2
2 2

−1 μT1 + μT2
= (μ1 − μ2 ) x −
T
>0 (4.69)
2
Hence, sample subspace 1 is made by (4.69), and the p-dimensional hyperplane
1 1
(μ1 − μ2 ) −1 xT − μ1 −1 μT1 + μ2 −1 μT2 = 0,
2 2
discriminates the two normal distributions. From this, the following function is called
the linear discriminant function:
Y = (μ1 − μ2 ) −1 xT . (4.70)
The Mahalanobis’ distance between the two mean vectors μ1 and μ2 of p-variate
normal distributions N (μ1 , ) and N (μ2 , ) is given by

DM (μ1 , μ2 ) = (μ1 − μ2 )T −1 (μ1 − μ2 )
The square of the above distance is expressed by the following KL information:

4.9 Classification and Discrimination 131
"
f (x|μ2 , Σ)
DM (μ1 , μ2 )2 = f (x|μ2 , Σ)log dx
f (x|μ1 , Σ)
"
f (x|μ1 , Σ)
+ f (x|μ1 , Σ)log dx. (4.71)
f (x|μ2 , Σ)
The discriminant function (4.70) is considered in view of the above KL

information, i.e., square of the Mahalanobis’ distance. We have the following
theorem:
Theorem 4.15 For any p-dimensional column vectors a, μ1 , μ2 and p × p vari-

2
T Σ, the DM α T μ 1 , α μ2
T
ance–covariance matrix KL
information
between nor-
mal distributions N α μ1 , α T Σα and N α T μ2 , α T Σα (4.71) is maximized by
α = Σ −1 (μ1 − μ2 ) and
2
max DM α T μ1 , α T μ2 = (μ1 − μ2 )T −1 (μ1 − μ2 ). (4.72)
α
Proof From (4.71), it follows that

T 2
T 2 α μ1 − α T μ2
DM α μ1 , α μ2 =
T
(4.73)
αT Σα
Given αT Σα = 1, the above function is maximized. For Lagrange multiplier λ,

we set
2
g = α T μ1 − α T μ2 − λα T Σα
and differentiating the above function with respect to α, we have
∂g
= 2 α T μ1 − α T μ2 (μ1 − μ2 ) − 2λΣα = 0
∂α
T
Since ν = α μ1 − α T μ2 is a scalar, we obtain
ν(μ1 − μ2 ) = λα.
From this, we have

ν −1
α= (μ1 − μ2 ).
λ
Function (4.73) is invariant with respect to scalar λν , so the theorem follows.
From the above theorem, the discriminant function (4.70) discriminates N (μ1 , )
and N (μ2 , ) in the sense of the maximum of the KL information (4.72).
For the two distributions fi (x), i = 1, 2, if prior probabilities of the distributions

are given, the above discussion is modified as follows. Let qi be prior probabilities
of fi (x), i = 1, 2, where q1 + q2 = 1. Then, substituting fi (x) for qi fi (x), a discussion
similar to the above one can be made. For example, the optimal procedure (4.66) is
modified as
# ) # )
q1 f1 (x) q1 f1 (x)
1 = x < 1 , 2 = x ≤1 .
q2 f2 (x) q2 f2 (x)
4.10 Incomplete Data Analysis
The EM algorithm [2] is widely used for the ML estimation from incomplete data.
Let X and Y be complete and incomplete data (variables), respectively; let f (x|φ)
and g(y|φ) be the density or probability function of X and Y, respectively; let be
the sample space of complete data X and (y) = {x ∈ |Y = y} be the conditional
sample space of X given Y = y. Then, the log likelihood function of φ based on
incomplete data Y = y is
l(φ) = logg(y|φ) (4.74)
and the conditional density or probability function of X given Y = y is
f (x|φ)
k(x|y, φ) = . (4.75)
g(y|φ)
Let

H φ |φ = E logk X|y, φ |y, φ
"

= k(x|y, φ)logk x|y, φ dx.
(y)

Then, the above function is the negative entropy of distribution k x|y, φ (4.75)
for distribution k(x|y, φ). Hence, from Theorems 1.1 and 1.8, we have

H φ |φ ≤ H (φ|φ). (4.76)

The inequality holds if and only if k x|y, φ = k(x|y, φ). Let
4.10 Incomplete Data Analysis 133
⎛ ⎞
"
⎜ f (x|φ) ⎟
Q φ |φ = E logf X|φ |y, φ ⎝= logf x|φ dx⎠.
g(y|φ)
(y)
From (4.74) and (4.75), we have

" "
f (x|φ) f x|φ f (x|φ)
H φ |φ = log dx = log f x|φ dx
g(y|φ) g y|φ g(y|φ)
(y) (y)

− logg y|φ = Q φ |φ − l φ . (4.77)

With respect to function H φ |φ , we have the following theorem [2]:
Theorem 4.16 Let φ p , p = 1, 2, . . . , be a sequence such that

max

Q φ |φ p = Q φ p+1 |φ p .
φ
Then,

l φ p+1 ≥ l φ p , p = 1, 2, . . . .
Proof From (4.77), it follows that

0 ≤ Q φ p+1 |φ p − Q φ p |φ p = H φ p+1 |φ p + l φ p+1 − H φ p |φ p + l φ p

= H φ p+1 |φ p − H φ p |φ p + l φ p+1 − l φ p . (4.78)

H φ p+1 |φ p − H φ p |φ p ≤ 0,
so it follows that

l φ p+1 ≥ l φ p .
from (4.77). This completes the theorem.

The above theorem is applied to obtain the ML estimate of parameter φ, φ , such

that

l φ = max l(φ)
φ
from incomplete data y.

Definition 4.4 The following iterative procedure with E-step and M-step for the ML
estimation, (i) and (ii), is defined as the Expectation-Maximization (EM) algorithm:
(i) Expectation step (E-step)

For estimate φ p at the pth step, compute the conditional expectation of logf (x|φ)
given the incomplete data y and parameter φ p :

Q φ|φ p = E logf (x|φ)|y, φ p . (4.79)
(ii) Maximization step (M-step)

Obtain φ p+1 such that

Q φ p+1 |φ p = max Q φ|φ p . (4.80)
φ
If there exists φ ∗ = lim φ p , from Theorem 4.16, we have

p→∞

l φ ∗ ≥ l φ p , p = 1, 2, . . . .

For the ML estimate φ , it follows that

Q φ |φ = max Q φ|φ .
φ
Hence, the EM algorithm can be applied to obtain the ML estimate φ .
If the density or probability function of complete data X is assumed to be the

following exponential family of distribution:

exp φt(x)T
f (x|φ) = b(x) ,
a(φ)
where φ be a 1 × r parameter vector and t(x) be a 1 × r sufficient statistic vector.

Since
"

a(φ) = b(x)exp φt(x)T dx,
we have
∂ ∂
Q φ|φ p = E logf (x|φ)|y, φ p
∂φ ∂φ
∂
= E t(X)|y, φ p − loga(φ)
∂φ
"
1 ∂
= E t(X)|y, φ p − b(x)exp φt(x)T dx
a(φ) ∂φ

= E t(X)|y, φ − E(t(X)|φ) = 0.
p
From the above result, the EM algorithm with (4.79) and (4.80) can be simplified
as follows:
(i) E-step
Compute

t p+1 = E t(X)|y, φ p . (4.81)
(ii) M-step
Obtain φ p+1 from the following equation:
t p+1 = E(t(X)|φ). (4.82)
Example 4.4 Table 4.3 shows random samples according to bivariate normal
distribution with mean vector μ = (50, 60) and variance–covariance matrix

225 180
= .
180 400
Table 4.3 Artificial complete data from bivariate normal distribution

Case X1 X2 Case X1 X2 Case X1 X2
1 69.8 80.8 21 54.8 28.3 41 41.6 50.4
2 69.5 47.1 22 54.7 92.7 42 40.4 28.7
3 67.9 60.5 23 53.5 78.3 43 39.9 64.4
4 67.8 75.8 24 53.0 56.8 44 39.8 66.5
5 65.5 58.5 25 52.9 69.3 45 39.1 44.9
6 64.4 85.8 26 52.2 80.1 46 38.7 57.8
7 63.9 52.3 27 51.6 71.7 47 38.0 49.2
8 63.9 77.8 28 51.4 60.4 48 37.8 49.8
9 62.7 52.0 29 49.9 60.9 49 36.8 70.5
10 60.5 71.2 30 49.8 58.8 50 33.7 48.6
11 60.5 74.0 31 48.8 71.3 51 33.3 45.5
12 60.4 78.7 32 47.4 52.3 52 32.9 53.6
13 60.0 83.3 33 47.1 57.1 53 32.4 54.2
14 57.9 78.9 34 46.4 63.2 54 30.3 45.5
15 57.6 52.9 35 45.6 63.7 55 27.8 53.0
16 57.2 72.8 36 44.8 74.2 56 26.7 58.0
17 57.0 73.7 37 44.8 51.5 57 24.9 39.3
18 55.5 58.2 38 43.8 43.4 58 24.5 30.4
19 55.2 66.6 39 43.6 58.7 59 20.9 23.0
20 55.1 68.4 40 41.7 57.5 60 14.0 28.6
From Table 4.3, the mean vector, the variance–covariance matrix, and the
correlation coefficient are estimated as

174.3 123.1
μ = (47.7, 59.7), = , ρ = 0.607.

123.18 235.7
Table 4.4 illustrates incomplete (missing) data from Table 4.3. In the missing data,
if only the samples 1–40 with both values of the two variables are used for estimating
the parameters, we have

60.7 24.7
μ0 = (55.3, 65.5), 0 = , ρ 0 = 0.242. (4.83)

24.7 171.0
In data analysis, we have to use all the data in Table 4.4 to estimate the parameters.
The EM algorithm (4.81) and (4.82) is applied to analyze the data in Table 4.4. Let
μp , p and ρ p be the estimates of the parameters in the pth iteration, where
Table 4.4 Incomplete (missing) data from Table 4.3

Case X1 X2 Case X1 X2 Case X1 X2
1 69.8 80.8 21 54.8 28.3 41 41.6 Missing
2 69.5 47.1 22 54.7 92.7 42 40.4 Missing
3 67.9 60.5 23 53.5 78.3 43 39.9 Missing
4 67.8 75.8 24 53.0 56.8 44 39.8 Missing
5 65.5 58.5 25 52.9 69.3 45 39.1 Missing
6 64.4 85.8 26 52.2 80.1 46 38.7 Missing
7 63.9 52.3 27 51.6 71.7 47 38.0 Missing
8 63.9 77.8 28 51.4 60.4 48 37.8 Missing
9 62.7 52.0 29 49.9 60.9 49 36.8 Missing
10 60.5 71.2 30 49.8 58.8 50 33.7 Missing
11 60.5 74.0 31 48.8 71.3 51 33.3 Missing
12 60.4 78.7 32 47.4 52.3 52 32.9 Missing
13 60.0 83.3 33 47.1 57.1 53 32.4 Missing
14 57.9 78.9 34 46.4 63.2 54 30.3 Missing
15 57.6 52.9 35 45.6 63.7 55 27.8 Missing
16 57.2 72.8 36 44.8 74.2 56 26.7 Missing
17 57.0 73.7 37 44.8 51.5 57 24.9 Missing
18 55.5 58.2 38 43.8 43.4 58 24.5 Missing
19 55.2 66.6 39 43.6 58.7 59 20.9 Missing
20 55.1 68.4 40 41.7 57.5 60 14.0 Missing
p
p p p
σ11 σ12
μ = μ1 , μ2 , p =
p
p p .
σ21 σ22
Let x1i and x2i be the observed values of case i for variables X1 and X2 in
Table 4.4, respectively. For example, (x11 , x21 ) = (69.8, 80.8) and x1,41 , x2,41 =
(41.6, missing). Then, in the E-step (4.81), the missing values in Table 4.4 are
estimated as follows:
σ21 p
p
p p
x2i = μ2 + p x1i − μ1 , i = 41, 42, . . . , 60.
σ11
In the (p + 1) th iteration, by using the observed data in Table 4.4 and the above
estimates of the missing values, we have μp+1 , p+1 and ρ p+1 in the M-step (4.82).
Setting the initial values of the parameters by (4.83), we have

174.3 70.8
μ∞ = (47.7, 62.4), ∞ = , ρ ∞ = 0.461. (4.84)
70.8 135.2
Hence, the convergence values (4.84) are the ML estimates obtained by using the
incomplete data in Table 4.4.
The EM algorithm can be used for the ML estimation from incomplete data in
both continuous and categorical data analyses. A typical example of categorical
incomplete data is often faced in an analysis from phenotype data. Although the
present chapter has treated continuous data analysis, in order to show an efficacy of
the EM algorithm to apply to analysis of phenotype data, the following example is
provided.
Example 4.5 Let p, q, and r be the probabilities or ratios of blood gene types A,
B, and O, respectively, in a large and closed population. Then, a randomly selected
individual in the population has one of genotypes AA, AO, BB, BO, AB, and OO with
probability p2 , 2pr, q2 , 2qr, 2pq, r 2 , respectively; however, the complete information
on the genotype cannot be obtained and we can merely decide phenotype A, B, or
O from his or her blood sample. Table 4.5 shows an artificial data produced by
p = 0.4, q = 0.3, r = 0.3. Let nAA , nAO , nBB , nBO , nAB , and nOO be the numbers of
genotypes AA, AO, BB, BO, AB, and OO in the data in Table 4.5. Then, we have
nAA + nAO = 394, nBB + nBO = 277, nAB = 239, nOO = 90.
Table 4.5 Artificial blood phenotype data

Phenotype A B AB OO Total
Frequency 394 277 239 90 1000
In this sense, the data in Table 4.5 are incomplete, and X =

(nAA , nAO , nBB , nBO , nAB , nOO ) is complete and sufficient statistics of the parame-
ters. Given the total in Table 4.5, the complete data are distributed according to a
multinomial distribution, so the E-step and M-step (4.81) and (4.82) are given as fol-
lows. Let p(u) , q(u) , and r (u) be the estimates of parameters p, q, and r in the (u + 1)
th iteration, and let nu+1 u+1 u+1 u+1 u+1 u+1
AA , nAO , nBB , nBO , nAB , and nOO be the estimated complete
data in the (u + 1) th iteration. Then, we have the EM algorithm as follows:
(i) E-step
p(u)2 2p(u) r (u)
AA = (nAA + nAO )
nu+1 , nu+1
AO = (nAA + nAO ) (u)2 ,
+ 2p r
p(u)2 (u) (u) p + 2p(u) r (u)
q(u)2 2q(u) r (u)
nu+1 = (nBB + nBO ) (u)2 , nu+1
BO = (nBB + nBO ) (u)2 ,
BB
q + 2q r(u) (u) q + 2q(u) r (u)
AB = nAB , nOO = nOO .
nu+1 u+1
(ii) M-step
AA + nAO + nAB
2nu+1 2nu+1 + nu+1
BO + nAB
u+1
p(u+1) = , q(u+1) = BB ,
2n 2n
r (u+1) = 1 − p(u+1) − q(u+1) ,
where n = nAA + nAO + nBB + nBO + nAB + nOO .
For the initial estimates of the parameters p(0) = 0.1, q(0) = 0.5, and r (0) = 0.4,
we have
p(1) = 0.338, q(1) = 0.311, r (1) = 0.350,
and the convergence values are
p(10) = 0.393, q(10) = 0.304, r (10) = 0.302.
In this example, the algorithm converges quickly.
4.11 Discussion
In this chapter, first, correlation analysis has been discussed in association model
and GLM frameworks. In correlation analysis, the absolute values of the correlation
coefficient and the partial correlation coefficient are ECC and the conditional ECC,
respectively, and the multiple correlation coefficient and the coefficient of determi-
nation are the same as ECC and ECD, respectively. In canonical correlation analysis,
4.11 Discussion 139
it has been shown that the KL information that expresses the association between
two random vectors are decomposed into those between the pairs of canonical vari-
ables, and ECC and ECD have been calculated to measure the association between
two random vectors. Second, it has been shown that basic statistical methods have
been explained in terms of entropy. In testing the mean vectors, the Hotelling’s T 2
statistic has been deduced from KL information and in testing variance–covariance
matrices of multivariate normal distributions, an entropy-based test statistic has been
proposed. The discussion has been applied in comparison with two multivariate
normal distributions. The discussion has a possibility, which is extended to a more
general case, i.e., comparison of more than two normal distributions. In the exper-
imental design, one-way layout experimental design model has been treated from
a viewpoint of entropy. The discussion can also be extended to multiway layout
experimental design models. In classification and discriminant analysis, the opti-
mal classification method is reconsidered through an entropy-based approach, and
the squared Mahalanobis’s distance has been explained with the KL information.
In missing data analysis, the EM algorithm has been overviewed in entropy. As
explained in this chapter, entropy-based discussions for continuous data analysis are
useful and it suggests a novel direction to make approaches for other analyses not
treated in this chapter.
References
1. Anderson, T. W. (1984). An introduction to multivariate statistical analysis. New York: Wiley.

2. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, B, 39,
1–38.
power of generalized linear models. Statistics and Probability Letters, 77, 588–593.
models. Computational Statistics & Data Analysis, 54, 1381–1389.
5. Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in
samples of an indefinitely large population. Biometrika, 10, 507–521.
6. Fisher, R. A. (1935). The design of experiments. London: Pliver and Boyd.
7. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179–188.
8. Fisher, R. A. (1938). The statistical utilization of multiple measurements. Annals of Eugenics,
8, 376–386.
9. Goodman, L. A. (1981). Association models and canonical correlation in the analysis of cross-
classification having ordered categories. Journal of the American Statistical Association, 76,
320–334.
10. Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of
the Royal Statistical Society: Series B, 58, 155–176.
11. Hotelling, H. (1931). The generalization of student’s ratio. Annals of Mathematical Statistics,
2, 360–372.
12. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.
13. Mahalanobis, P. C. (1936). On the general distance in statistics. Proceedings of the National
Institute of Sciences of India, 2, 49–55.
14. Patnaik, P. B. (1949). The non-central χ2 and F-distributions and their applications. Biometrika,
36, 202–232.
15. Wald, A. (1944). On a statistical problem arising in the classification of an individual into one
of two groups. Annals of Statistics, 15, 145–162.
Chapter 5
Efficiency of Statistical Hypothesis Test
Procedures
5.1 Introduction
The efficiency of hypothesis test procedures has been discussed mainly in the context
of the Pittman- and Bahadur-efficiency approaches [1] up to now. In the Pitman-
efficiency approach [4, 5], the power functions of the test procedures are compared
from a viewpoint of sample sizes, on the other hand, in the Bahadur-efficiency
approach, the efficiency is discussed according to the slopes of log power functions,
i.e., the limits of the ratios of the log power functions for sample sizes. Although the
theoretical aspects in efficiency of test statistics have been derived in both approaches,
herein, most of the results based on the two approaches are the same [6]. The aim
of this chapter is to reconsider the efficiency of hypothesis testing procedures in the
context of entropy. In Sect. 5.2, the likelihood ratio test is reviewed, and it is shown
that the procedure is the most powerful test. Section 5.3 reviews the Pitman and
Bahadur efficiencies. In Sect. 5.4, the asymptotic distribution of the likelihood ratio
test statistic is derived, and the entropy-based efficiency is made. It is shown that
the relative entropy-based efficiency is related the Fisher information. By using the
median test as an example, it is shown that the results based on the relative entropy-
based efficiency, the relative Pitman and Bahadur efficiencies are the same under an
appropriate condition. Third, “information of parameters” in general test statistics is
discussed, and a general entropy-based efficiency is defined. The relative efficiency
of the Wilcoxon test, as an example, is discussed from the present context.
5.2 The Most Powerful Test of Hypotheses
Let x1 , x2 , . . . , xn be random samples according to probability or density function

f (x|θ ), where θ is a parameter. In testing the null hypothesis H0 : θ = θ0 versus the

https://doi.org/10.1007/978-981-15-2552-0_5
142 5 Efficiency of Statistical Hypothesis Test Procedures
alternative hypothesis H0 : θ = θ1 , for constant λ, the critical region with significance

level α is assumed to be
n
f (xi |θ1 )
W0 = (x1 , x2 , . . . , xn ) ni=1 >λ . (5.1)
i=1 f (xi |θ0 )
The above testing procedure is called the likelihood ratio test. The procedure is
the most powerful, and it is proven in the following theorem [3]:
Theorem 5.1 (Neyman–Pearson Lemma) Let W be any critical region with signif-
icance level α and W0 that based on the likelihood ratio with significance level α
(5.1), the following inequality holds:
P((X1 , X2 , . . . , Xn ) ∈ W0 |θ1 ) ≥ P((X1 , X2 , . . . , Xn ) ∈ W |θ1 ).
Proof Let W be any critical region with significance level α. Then,

P(W0 |θ1 ) − P(W |θ1 ) = P W0 ∩ W c |θ1 − P W0c ∩ W |θ1
n
= f (xi |θ1 )dx1 dx2 . . . dxn
W0 ∩W c i=1
n
− f (xi |θ1 )dx1 dx2 . . . dxn .
W0c ∩W i=1

n
≥λ f (xi |θ0 )dx1 dx2 . . . dxn
W0 ∩W c i=1

n
−λ f (xi |θ0 )dx1 dx2 . . . dxn . (5.2)
W0c ∩W i=1
Since the critical regions W0 and W have significance level α, we have

n
α= f (xi |θ0 )dx1 dx2 . . . dxn
W0 i=1

n
= f (xi |θ0 )dx1 dx2 . . . dxn
W0 ∩W i=1

n
+ f (xi |θ0 )dx1 dx2 . . . dxn .
W0 ∩W c i=1
and we also have

5.2 The Most Powerful Test of Hypotheses 143

n
α= f (xi |θ0 )dx1 dx2 . . . dxn
W i=1

n
= f (xi |θ0 )dx1 dx2 . . . dxn
W0 ∩W i=1

n
+ f (xi |θ0 )dx1 dx2 . . . dxn .
W0c ∩W i=1
From the above results, it follows that

n
n
f (xi |θ0 )dx1 dx2 . . . dxn = f (xi |θ0 )dx1 dx2 . . . dxn .
W0 ∩W c i=1 W0c ∩W i=1
Hence, from (5.2)
P(W0 |θ1 ) ≥ P(W |θ1 ).

From the above theorem, we have the following corollary:
Corollary 5.1 Let X1 , X2 , . . . , Xn be random samples from a distribution with
parameter θ and let Tn (X1 , X2 , . . . , Xn ) be any statistic of the samples. Then, in
testing the null hypothesis H0 : θ = θ0 and the alternative hypothesis H1 : θ = θ1 , the
power of the likelihood ratio test with the samples Xi is greater than or equal to that
with the statistic Tn .
Proof Let f (x|θ) be the probability or density function of samples Xi and let ϕn (t|θ)
be that of Tn . The critical regions with significance level α of the likelihood ratio
statistics of samples X1 , X2 , . . . , Xn and statistic Tn (X1 , X2 , . . . , Xn ) are set as
⎧ n
⎨ W = (x1 , x2 , . . . , xn ) ni=1 f (xi |θ1 ) > λ(f , n) ,
i=1 f (xi 0
|θ )
⎩ WT = T (X1 , X2 , . . . , Xn ) = t ϕn (t|θ1 ) > λ(ϕ, n) ,
ϕn (t|θ0 )
such that
n
f (xi |θ1 ) ϕn (t|θ1 )
α = P ni=1 > λ(f , n)|H0 = P > λ(ϕ, n)|H0 .
i=1 f (xi |θ0 ) ϕn (t|θ0 )
Then, Theorem 5.1 proves
P(W |H1 ) ≥ P(WT |H1 ).

Since from Theorem 5.1, critical region WT gives the most powerful test procedure
among those based on statistic T, the theorem follows.
In the above corollary, if statistic Tn (X1 , X2 , . . . , Xn ) is a sufficient statistic of θ ,

then, the power of the statistic is the same that of the likelihood ratio test with sam-
ples X1 , X2 , . . . , Xn . In testing the null hypothesis H0 : θ = θ0 versus the alternative
hypothesis H1 : θ = θ1 , the efficiencies of test procedures can be compared with the
power functions. Optimal properties of the likelihood ratio test were discussed in
Bahadur [2].
5.3 The Pitman Efficiency and the Bahadur Efficiency
Let θ be a parameter; H0 : θ = θ0 be null hypothesis; H1 : θ = θ0 + √hn (= θn ) be

a sequence of alternative hypotheses; and let Tn = Tn (X1 , X2 , . . . , Xn ) be the test
statistic, where X1 , X2 , . . . , Xn are random samples. In this section, for simplicity of
the discussion, we assume h > 0.
Definition 5.1 (Pitman efficiency) Let Wn = {Tn > tn } be critical regions with sig-
nificant level α, i.e., α = P{Wn |θ0 }. Then, for the sequence of power functions
γn (θn ) = P{Wn |θn }, n = 1, 2, . . ., the Pitman efficiency is defined by
PE(T ) = lim γn (θn ). (5.3)

n→∞
Definition 5.2 (Relative Pitman efficiency) Let Tn(k) = Tn(k) (X1 , X2 , . . . , Xn ), k =

1, 2 be the two sequences of test statistics and let γn(k) (θn ) be the power functions of

Tn(k) , k = 1, 2. Then, for γn(1) (θn ) = γN(2)
n
θNn , the relative efficiency is defined by
n
RPE T (2) |T (1) = logn→∞ (5.4)
Nn
Example
5.1 Let X1 , X2 , . . . , Xn be random sample from normal distribution
N μ, σ 2 . In testing the null hypothesis H0 : μ = μ0 versus the alternative hypothesis
H1 : μ = μ0 + √hn (= μn ) for h > 0. The critical region with significance level α is
given by
1
n
σ
X = Xi > μ0 + zα √ ,
n i=1 n
where

σ
P X > μ0 + zα √ |μ0 = α,
n
5.3 The Pitman Efficiency and the Bahadur Efficiency 145
Let
√
n X − μ0 − √h
n
Z= .
σ
Then, Z is distributed according to the standard normal distribution under H1 and
we have

h σ h
γ μ0 + √ = P X > μ0 + zα √ |μ0 + √
n n n

h h
= P Z > − + zα |μ = 0 = 1 − − + zα ,
σ σ
where function (z) is the distribution function of N (0, 1). The above power function
is constant in n, so we have

h
PE X = 1 − − + zα
σ
Next, the Bahadur efficiency is reviewed. In testing the null hypothesis H0 : θ = θ0

versus the alternative hypothesis H1 : θ = θ1 ; let Tn(k) = Tn(k) (X), k = 1, 2 be test
statistics; and let
(k) (k)
L(k) (k)
n Tn (X )|θ = −2 log 1 − Fn Tn (X)|θ ,

where Fn(k) (t|θ ) = P Tn(k) < t|θ , k = 1, 2 and X = (X1 , X2 , . . . , Xn ).
Definition 5.3 (Bahadur efficiency) If under the alternative hypothesis H1 : θ = θ1 ,
(k)
L(k)
n Tn (X)|θ0 P
→ ck (θ1 , θ0 ) (n → ∞),
n
then, the Bahadur efficiency (RBE) is defined by

BE T (k) = ck (θ1 , θ0 ), k = 1, 2. (5.5)
Definition 5.4 (Relative Bahadur efficiency) For (5.5), the relative Bahadur effi-
ciency (RBE) is defined by
c2 (θ1 , θ0 )
RBE T (2) |T (1) = . (5.6)
c1 (θ1 , θ0 )
Under an appropriate condition, the relative Bahadur efficiency is equal to the

relative Pitman efficiency [1].
Example 5.2 Under the same condition of Example 5.1, let
Tn (X) = X .
Since under H1 : μ = μ1 (> μ0 ), for large n,

1
X = μ1 + O n− 2 ,
where
1 1 P
O n− 2 /n− 2 → c (n → ∞).
Since
+∞ √ n
n
1 − Fn (Tn (X)|μ0 ) = √ exp − 2 (t − μ0 )2 dt
2π σ 2 2σ
X
+∞ √ n
n
= √ exp − 2 (t − μ0 )2 dt
1 2π σ 2 2σ
μ1 +O n− 2
+∞ √ n
n
≈ √ exp − 2 (s − μ0 + μ1 )2 ds
2π σ 2 2σ
0
√ n
n
=√ exp − 2 (μ1 − μ0 )2
2π σ 2 2σ

+∞
n n 2
exp − 2 (μ1 − μ0 )s − s ds,
σ 2σ 2
0
for sufficiently large n, we have

√ n
n
1 − Fn (Tn (X)|μ0 ) < √ exp − 2 (μ1 − μ0 )2
2π σ 2 2σ
+∞
n
exp − 2 (μ1 − μ0 )s ds
σ
0
√ n
n σ2
=√ exp − 2 (μ1 − μ0 )2
2π σ 2 2σ n(μ1 − μ0 )
σ n
=√ exp − 2 (μ1 − μ0 )2 ,
2π n(μ1 − μ0 ) 2σ
√
n n
1 − Fn (Tn (X)|μ0 ) > √ exp − 2 (μ1 − μ0 )2
2π σ 2 2σ
5.3 The Pitman Efficiency and the Bahadur Efficiency 147
+∞
n
exp − 2 (s − μ1 + μ0 )2 ds
2σ
0
√ n
n
>√ exp − 2 (μ1 − μ0 )2
2π σ 2 2σ
+∞
n
exp − 2 (s − μ1 + μ0 )2 ds
2σ
μ1 −μ0
1 n
= exp − 2 (μ1 − μ0 )2 .
2 2σ
From this, we have
n
−2 σ
log √ exp − 2 (μ1 − μ0 ) 2
n 2π n(μ1 − μ0 ) 2σ

−2 σ n
= log √ − (μ1 − μ0 ) 2
n 2π n(μ1 − μ0 ) 2σ 2
1
→ 2 (μ1 − μ0 )2 = c(μ1 , μ0 )(n → ∞),
σ
−2 1 n −2 n
log exp − 2 (μ1 − μ0 )2 = −log2 − 2
(μ1 − μ0 )2
n 2 2σ n 2σ
1
→ 2 (μ1 − μ0 )2 ,
σ
so it follows that
Ln (Tn (X)|μ0 ) −2log(1 − Fn (Tn (X)|μ0 )) 1
= → 2 (μ1 − μ0 )2
n n σ
= c(μ1 , μ0 ) (n → ∞),

Let f (x) and g(x) be normal density functions of N μ1 , σ 2 and N μ2 , σ 2 ,
respectively. Then, in this case, we also have
1
2D(g||f ) = (μ1 − μ0 )2 = c(μ1 , μ0 ).
σ2
In the next section, the efficiency of test procedures is discussed in view of entropy.
5.4 Likelihood Ratio Test and the Kullback Information
As reviewed in Sect. 5.2, the likelihood ratio test is the most powerful test. In this
section, the test is discussed with respect to information. Let f (x) and g(x) be den-
sity or probability functions corresponding with null hypothesis H0 and alterna-
tive hypothesis H1 , respectively. Then, for a sufficiently large sample size n, i.e.,
X1 , X2 , . . . , Xn , we have
n
n
i=1 g(Xi ) g(Xi ) −nD(f ||g) (under H0 )
log n = log ≈ (5.7)
i=1 f (Xi ) f (Xi ) nD(g||f ) (under H1 )
i=1
where

f (x) g(x)
D(f ||g) = f (x)log dx, D(g||f ) = g(x)log dx. (5.8)
g(x) f (x)
In (5.8), the integrals are replaced by the summations, if the f (x) and g(x) are
probability functions.
Remark 5.1 In general, with respect to the KL information (5.8), it follows that
D(f ||g) = D(g||f ).
In the normal distribution, we have
D(f ||g) = D(g||f ).
The critical region of the likelihood ratio test with significance level α is given by
n
g(Xi )
i=1
n > λ(n), (5.9)
i=1 f (Xi )
where
n
g(Xi )
P i=1
n > λ(n)|H0 =α
i=1 f (Xi )
With respect to the likelihood ratio function, we have the following theorem:
Theorem 5.2 For parameter θ , let f (x) = f (x|θ0 ) and g(x) = f (x|θ0 + δ) be
density or probability functions corresponding to null hypothesis H0 : θ = θ0 and
alternative hypothesis H1 : θ = θ0 + δ, respectively, and let X1 , X2 , . . . , Xn be
samples to test the hypotheses. For large sample size n and small constant δ, the
following log-likelihood ratio statistic
5.4 Likelihood Ratio Test and the Kullback Information 149
n
n
i=1 g(Xi ) g(Xi )
log n = log (5.10)
i=1 f (Xi ) i=1
f (Xi )
is asymptotically distributed according to normal distribution

N −nD(f ||g), nδ 2 I (θ0 ) − nD(f ||g)2 under H0 , and

N nD(g||f ), nδ 2 I (θ ) − nD(g||f )2 under H1 , where I (θ ) is the Fisher information:
2
d
I (θ ) = logf (x|θ) f (x|θ)dx. (5.11)
dθ
Proof Under H0 , statistic

1
n
g(Xi )
log (5.12)
n i=1 f (Xi )
is asymptotically distributed
according
to normal distribution with mean −D(f ||g)
g(X )
and variance n Var log f (X ) θ0 . For small δ, we obtain
1
g(X ) d
log = logf (X |θ0 + δ) − logf (X |θ0 ) ≈ δ logf (X |θ0 ).
f (X ) dθ
Since

d d
E logf (X |θ0 ) θ0 = f (X |θ0 )dx = 0,
dθ dθ
we have
2 2
g(X ) d
E log θ0 ≈ δ E
2
logf (X |θ0 ) θ0 = δ 2 I (θ0 ).
f (X ) dθ
Hence, we get

1 g(X ) 1
Var log θ0 ≈ δ 2 I (θ0 ) − D(f ||g)2 .
n f (X ) n
On the other hand, under the alternative hypothesis H1 , statistic (5.12)

is asymptot-

)
ically normally distributed with mean D(g||f ) and variance 1n Var log g(X |θ + δ .
f (X ) 0
For small δ, we have
2 2
g(X ) d
E log θ0 + δ ≈ δ E
2
logf (x|θ0 ) θ0 + δ
f (X ) dθ
2
d
≈δ E
2
logf (x|θ0 ) θ0 = δ 2 I (θ0 ).
dθ
From this it follows that

1 g(X ) 1
Var log θ0 + δ ≈ δ 2 I (θ0 ) − D(g||f )2 .
n f (X ) n
According to the above theorem, for critical region (5.9), the power is
n
g(Xi )
P i=1
n > λ(n)|H1 → 1 (n → ∞).
i=1 f (Xi )
Below, the efficiency of the likelihood ratio test is discussed.
Lemma 5.1 If density or probability function f (x|θ) is continuous with respect to θ

and if there exists an integrable function ϕ1 (x) and ϕ2 (x) such that

f (x|θ + δ) d logf (x|θ) < ϕ1 (x) (5.13)
dθ
and

d
f (x|θ) < ϕ2 (x), (5.14)
dθ
then,

d
f (x|θ + δ) logf (x|θ )dx → 0 (δ → 0).
dθ
Proof From (5.13), we obtain

d d
f (x|θ + δ) logf (x|θ)dx → f (x|θ) logf (x|θ)dx
dθ dθ

d
= f (x|θ)dx (δ → 0).
dθ
From (5.14), it follows that


d d
f (x|θ )dx = f (x|θ )dx = 0.
dθ dθ
From the above results, we have

d
f (x|θ + δ) logf (x|θ )dx → 0 (δ → 0).
dθ
Hence, the Lemma follows.
Example 5.3 In exponential density function f (x|λ) = λe−λx (x > 0), for a < λ < b
and small δ, we have

−(λ+δ)x
f (x|λ + δ) d logf (x|λ) = f (x|λ + δ) d f (x|λ) = (λ + δ)e 2
1−λ
dλ f (x|λ) dλ λ
(λ + δ)e−(λ+δ)x
< 1 − λ2 < 2 1 − λ2 e−λx (= ϕ1 (x)).
λ
Similarly, it follows that

d
f (x|λ) = |1 − λx|e−λx < |1 − ax|e−ax (= ϕ2 (x))
dλ
Since functions ϕ1 (x) and ϕ2 (x) are integrable, from Lemma 5.1, we have
∞ ∞
d 1
f (x|λ + δ) logf (x|λ)dx = (λ + δ)exp(−(λ + δ)x) − x dx
dλ λ
0 0
1 1
= − → 0 (δ → 0).
λ λ+δ
The relation between the KL information and the Fisher information is given in
the following theorem:
Theorem 5.3 Let f (x|θ ) and f (x|θ + δ) be density or probability functions with
parameters θ and θ + δ, respectively, and let I (θ ) be the Fisher information. Under
the same condition as in Lemma 5.1, it follows that
D(f (x|θ)||f (x|θ + δ)) I (θ )

→ (δ → 0), (5.15)
δ 2 2
D(f (x|θ + δ)||f (x|θ)) I (θ )
→ (δ → 0). (5.16)
δ2 2
Proof
f (x|θ )
log = logf (x|θ) − logf (x|θ + δ)
f (x|θ + δ)
d δ 2 d2
≈ −δ logf (x|θ) − logf (x|θ).
dθ 2 dθ 2
From this and Lemma 5.1, for small δ, we have

f (x|θ)
D(f (x|θ)||f (x|θ + δ)) = f (x|θ )log dx
f (x|θ + δ)

d δ 2 d2
≈ f (x|θ ) −δ logf (x|θ) − logf (x|θ) dx
dθ 2 dθ 2

d
= −δ f (x|θ ) logf (x|θ)dx
dθ

δ2 d2
− f (x|θ) 2 logf (x|θ )dx
2 dθ

d δ2 d2
≈ −δ f (x|θ )dx − f (x|θ ) 2 logf (x|θ )dx
dθ 2 dθ
2
δ d 2
δ2
=− f (x|θ) 2 logf (x|θ)dx = I (θ ).
2 dθ 2
Similarly, since
f (x|θ + δ)
log = logf (x|θ + δ) − logf (x|θ )
f (x|θ)
d δ 2 d2
= −δ logf (x|θ + δ) − logf (x|θ + δ).
dθ 2 dθ 2
we have

f (x|θ + δ)
D(f (x|θ + δ)||f (x|θ )) = f (x|θ + δ)log dx
f (x|θ )

d δ 2 d2
≈ f (x|θ + δ) δ logf (x|θ + δ) − logf (x|θ + δ) dx
dθ 2 dθ 2

d
≈ −δ f (x|θ ) logf (x|θ )dx
dθ

δ2 d2
− f (x|θ ) 2 logf (x|θ )dx
2 dθ

δ 2 d2 δ2
=− f (x|θ ) 2 logf (x|θ )dx = I (θ ).
2 dθ 2
By using the above theorem, we have the following theorem:
Theorem 5.4 In the likelihood ratio test (5.9) for H0 : θ = θ0 versus H1 , θ = θ0 + δ,

for small δ and large sample size n, the 100α% critical region (5.9) is given by
δ 2 I (θ0 )
logλ(n) = −n + zα nδ 2 I (θ0 ), (5.17)
2
where zα is the upper 100α
percentile of the standard normal distribution, and the

asymptotic power is Q zα − nδ 2 I (θ0 ) , where the function Q(z) is
+∞ 2
1 x
Q(z) = √ exp − dx.
2π 2
z
Proof From Theorems 5.2, under the null hypothesis H0 , the log-likelihood ratio
statistic
(5.10) is asymptotically distributed
according to the normal distribution
N −nD(f ||g), nδ 2 I (θ0 ) − nD(f ||g)2 . From Theorem 5.3, under the null hypothesis,
from (5.15) we have
δ 2 I (θ0 )
D(f ||g) ≈ ,
2
so it follows that
δ 2 I (θ0 )
−nD(f ||g) ≈ −n ,
2
δ4
nδ 2 I (θ0 ) − nD(f ||g)2 ≈ nδ 2 I (θ0 ) − n
I (θ0 )2 ≈ nδ 2 I (θ0 ).
4

From this, the asymptotic distribution is N −n δ I2(θ0 ) , nδ 2 I (θ0 ) . From this, for
2
(5.17) we have
n
i=1 g(Xi )

P n > λ(n) H0 = α.
i=1 f (Xi )
Under the alternative hypothesis H1 , from (5.16) in Theorem 5.3,
δ2
D(g||f ) ≈ I (θ0 )
2
and from Theorem 5.2 the asymptotic mean and variance of the log-likelihood
function under H1 are
nδ 2 I (θ0 )
nD(g||f ) ≈
2
δ4
nδ 2 I (θ0 ) − nD(g||f )2 ≈ nδ 2 I (θ0 ) − n I (θ0 )2 ≈ nδ 2 I (θ0 ).
4
Hence, the asymptotic distribution of statistic (5.10) is
2
nδ I (θ0 )
N nD(g||f ), nδ 2 I (θ0 ) ≈ N , nδ 2 I (θ0 ) .
2
Then, we have
n
i=1 g(Xi )
P n > λ(n)|H1 ≈ Q zα − nδ 2 I (θ0 ) (5.18)
i=1 f (Xi )
With respect to the Fisher information, we have the following theorem.
Theorem 5.5 For random variable X and a function of the variable Y = η(X ), let
f (x|θ ) and f ∗ (y|θ ) be the density or probability functions of X and Y, respectively;
and let If (θ ) and If ∗ (θ ) be the Fisher information of f (x|θ) and that of f ∗ (y|θ ),
respectively. Under the same condition as in Theorem 5.3, it follows that
If (θ ) ≥ If ∗ (θ ).
The equality holds true if and only if y = η(x) is a one-to-one function.
Proof Since variable Y = η(X ) is a function of X, in this sense, it is a restricted

version of X. From this, we have

D(f (x|θ)||f (x|θ + δ)) ≥ D f ∗ (y|θ)||f ∗ (y|θ + δ) . (5.19)
From Theorem 5.3,
D(f (x|θ)||f (x|θ + δ)) If (θ )

→ (δ → 0),
δ 2 2
D(f ∗ (y|θ)||f ∗ (y|θ + δ)) If ∗ (θ )
→ (δ → 0),
δ 2 2
As shown in the above discussion, the test power depends on the KL and the
Fisher information. For random variables, the efficiency for testing hypotheses can
be defined according to entropy.
Definition 5.5 (Entropy-basede efficiency) Let f (x|θ0 ) and f (x|θ0 + δ) be the density
or probability functions of random variable X under hypotheses H0 : θ = θ0 versus
H1 : θ = θ0 + δ, respectively. Then, the entropy-based efficiency (EE) is defined by
EE(X ; θ0 , θ0 + δ) = D(f (x|θ0 + δ)||f (x|θ0 )). (5.20)
Definition 5.6 (Relative entropy-based efficiency) Let f ∗ (y|θ0 ) and f ∗ (y|θ0 + δ) be

the probability or density functions of Y = η(X ) for testing hypotheses H0 : θ = θ0
and H1 : θ = θ0 + δ, respectively. Then, the relative entropy-based efficiency (REE)
of the entropy-based efficiency of Y = η(X ) for that of X is defined as follows:
D(f ∗ (y|θ0 + δ)||f ∗ (y|θ0 ))

REE(Y |X ; θ0 , θ0 + δ) = . (5.21)
D(f (x|θ0 + δ)||f (x|θ0 ))
From (5.19) in Theorem 5.5, we have
0 ≤ REE(Y |X ) ≤ 1.
Let X1 , X2 , . . . , Xn be the original random samples and let Y1 , Y2 , . . . , Yn be the

transformed random variables by function Y = η(X ). Then, the REE (5.21) gives
the relative efficiency of the likelihood ratio test with samples Y1 , Y2 , . . . , Yn for that
with the original samples X1 , X2 , . . . , Xn . Hence, the log-likelihood ratio statistic
(5.10) has the maximum of EE for testing hypotheses H0 : θ = θ0 versus H1 : θ = θi .
Moreover, from Theorem 5.5, we have
D(f ∗ (y|θ0 + δ)||f ∗ (y|θ0 )) If ∗ (θ0 )

lim REE(Y |X ; θ0 , θ0 + δ) = lim = . (5.22)
δ→0 δ→0 D(f (x|θ0 + δ)||f (x|θ0 )) If (θ0 )
By using the above discussion, the REE is calculated in the following example.
Example 5.3 Let X be a random variable that is distributed according to normal
distribution with mean μ and variance σ 2 . In testing the null hypothesis H0 : μ = 0
and the alternative hypothesis H1 : μ = δ, let X ∗ be the dichotomized variable such
as

∗ 0 (X < 0)
X = .
1 (X ≥ 0)
Then, the entropy-based relative efficiency of the test with the dichotomized vari-
able X ∗ for that with the normal variable X is calculated. Since the normal density
function with mean μ and variance σ 2 is

1 1
f (x) = √ exp − 2 (x − μ)2 ,
2π σ 2σ
the Fisher information with respect to mean δ is

2 2
∂ X −δ 1
If (δ) = E logf (X ) =E − 2 = = If (0).
∂δ σ σ2
The distribution of dichotomized variable X ∗ is
+∞
∗ 1 1
P X =1 = √ exp − 2 (x − δ) dx.
2
(5.23)
2π σ 2σ
0
For simplicity of the discussion, we denote the above probability as p(δ). From
this, the distribution f ∗ is the Bernoulli distribution with the positive probability
(5.23) and the probability function is expressed as follows:

P X ∗ = x = p(δ)x (1 − p(δ))1−x .
The Fisher information of this distribution with respect to δ is

2
∂ ∗
If ∗ (δ) = E X log p(δ) + 1 − X ∗ log(1 − p(δ))
∂δ
2 2
∂ ∂
= p(δ) log p(δ) + (1 − p(δ)) log(1 − p(δ))
∂δ ∂δ
2 2
1 ∂ 1 ∂
= p(δ) + (1 − p(δ))
p(δ) ∂δ 1 − p(δ) ∂δ
2 2
1 ∂ 1 ∂
= p(δ) + p(δ)
p(δ) ∂δ 1 − p(δ) ∂δ
2
1 ∂
= p(δ)
p(δ)(1 − p(δ)) ∂δ
Since
+∞
∂ x−δ 1 1 1 δ2
p(δ) = − √ exp − 2 (x − δ) dx = √
2
exp − 2 ,
∂δ σ2 2π σ 2σ 2π σ 2σ
0
we have
2
1 1 δ2
If ∗ (δ) = √ exp − 2
p(δ)(1 − p(δ)) 2πσ 2σ
2
1 1 δ 2
= × exp − 2 →
p(δ)(1 − p(δ)) 2π σ 2 σ πσ2
= If ∗ (0) (δ → 0).
From this, the REE (5.21) is
If ∗ (0) 2
πσ 2 2
lim REE X ∗ |X ; θ0 , θ1 + δ = = = .
δ→0 If (0) 1
σ2
π
The above REE is the relative efficiency of the likelihood ratio test
with dichotomized samples X1∗ , X2∗ , . . . , Xn∗ for that with the original samples
X1 , X2 , . . . , Xn .
By using the above example, the relative Pitman efficiency is computed on the
basis of Theorem 5.3.
Example 5.4 In Example 5.3, let H0 : θ = θ0 be null hypothesis; and let H1 : θ =

θ0 + √hn (= θn ) be the alternative hypothesis. In Theorem 5.4, setting θ0 = 0
and δ = √hn , from (5.18) in Theorem 5.3, for sufficiently large sample size n,
the asymptotic power of the likelihood ratio test statistic based onnormal vari-
∗ (1)
able X and the dichotomized variable
X are γ (θn ) ≈ Q z α − h2 If (0) and
n
γN(2)
n
θNn ≈ Q zα − Nnn h2 If ∗ (0) , respectively. Assuming

γn(1) (θn ) = γN(2)
n
θNn ,
we have
Nn 2
h2 If (0) = h If ∗ (0).
n
Hence, the relative Pitman efficiency (RPE) is
n If ∗ (0) 2
RPE X ∗ |X = logn→∞ = = .
Nn If (0) π
The above RPE is equivalent to REE in Example 5.3.
Example 5.5 In Example 5.3, the relative Bahadur efficiency (RBE) is considered.
Let H0 : μ = 0 be null hypothesis; and let H1 : μ = δ be the alternative hypothesis.
Let
1
n
Tn(1) (X) = Xi ,
n i=1
1 ∗
n
Tn(2) (X) = X .
n i=1 i
Then, under H1 : μ = δ, it follows that

+∞ √ n
n
1 − Fn Tn(1) (X)|μ = 0 = √ exp − 2 t 2 dt
2π σ 2 2σ
X
+∞ √ n
n
= √ exp − 2 t 2 dt
1 2π σ 2 2σ
δ+O n− 2
and
+∞ √
2n 1 2
1− Fn Tn(2) (X)|μ =0 = √ exp −2n t − dt
π 2
Tn(2) (X)
+∞ √
2n 1 2
= √ exp −2n t − dt,
1
π 2
p(δ)+O n− 2
where
+∞
1 1
p(δ) = √ exp − 2 (x − δ) dx.
2
2π σ 2 2σ
0
From the above results, as in Example 5.2, we have

−2log 1 − Fn Tn(1) (X)|μ = 0 δ2
→ c1 (δ, 0) = 2 ,
n σ

−2log 1 − Fn Tn(2) (X)|μ = 0 1 2
→ c2 (δ, 0) = 4 p(δ) − (n → ∞).
n 2
Hence, RBE is computed as follows:

2
2 √ 1

c2 (δ, 0) 4 p(δ) − 21 4 2
∗ 2πσ 2
RBE X |X = = → = (δ → 0).
c1 (δ, 0) δ
σ2
2 1
σ2
π
Hence, the RBE is equal to REE.
From Examples 5.3, 5.4, and 5.5, the medium test for testing the mean of the
normal distribution, the relative entropy-based efficiency is the same as the relative
Pitman and the relative Bahadur efficiencies. Under appropriate assumptions [2], as
in the above example, it follows that
Ln (Tn (X)|θ0 )
→ 2D(g||f ) (n → ∞).
n
5.5 Information of Test Statistics
In this section, test statistics are considered in a viewpoint of entropy. Let

X1 , X2 , . . . , Xn be random samples; let Tn (X1 , X2 , . . . , Xn ) = Tn (X) be a test statis-
tic; and let ϕn (t|θ ) be the probability or density function of statistic Tn (X). Then, the
following definition is made.
Definition 5.7 (Entropy-based efficiency) For test hypotheses H0 : θ = θ0 and

H1 : θ = θ0 + δ, the entropy-based efficiency of test statistic Tn (X) is defined by
EE(Tn ; θ0 , θ0 + δ) = D(ϕn (t|θ0 + δ)||ϕn (t|θ0 )). (5.24)
Definition 5.8 (Relative entropy-based efficiency) Let Tn(1) (X) and Tn(2) (X) be test
statistics for testing H0 : θ = θ0 versus H1 : θ = θ0 + δ, and let ϕn(1) (t|θ ) and ϕn(2) (t|θ)
be the probability or density functions of the test statistics, respectively. Then, the
relative entropy-based efficiency of Tn(2) (X) for Tn(1) (X) is defined by

(2) (1) EE Tn(2) ; θ0 , θ0 + δ
REE Tn |Tn ; θ0 , θ0 + δ = , (5.25)
EE Tn(1) ; θ0 , θ0 + δ

and the relative entropy-based efficiency of the procedure Tn(2) (X) for Tn(1) (X)
is defined as

REE T (2) |T (1) ; θ0 , θ0 + δ = lim REE Tn(2) |Tn(1) ; θ0 , θ0 + δ . (5.26)
n→∞
According to the discussions in the previous sections, to test the hypotheses

H0 : θ = θ0 and H1 : θ = θ0 + δ at significance level α, the test procedure with
statistic Tn (X) is the most powerful to use the following critical region:

ϕn (t|θ0 + δ)
W0 = (x1 , x2 , . . . , xn ) > λ(n) ,
ϕn (t|θ0 )
where
P(W0 |θ0 ) = α.
With respect to REE (5.25), we have the following theorem:

Theorem 5.6 Let X1 , X2 , . . . , Xn be random samples from distribution f (x|θ ); let

Tn(1) = Tn(1) (X1 , X2 , . . . , Xn ) be the lo- likelihood ratio test statistic for testing
H0 : θ = θ0 versus H1 : θ = θ0 + δ; Tn(2)= Tn(2) (X1 , X2 , . . . , Xn ) be a test statistic of
which the asymptotic distribution is N η(θ ), σn , where η(θ ) is an increasing and
2
differentiable function of parameter θ . Then, the relative entropy-based efficiency of

the test T (2) for the likelihood ratio test T (1) is
η (θ0 )2
lim REE T (2) |T (1) ; θ0 , θ0 + δ = ≤ 1, (5.27)
δ→0 I (θ0 )σ 2
where
2
d
I (θ ) = f (x|θ)log logf (x|θ ) dx.
dθ
Proof For testing H0 : θ = θ0 versus H1 : θ = θ0 + δ, the asymptotic distribution of

the log-likelihood ratio statistic Tn(1) is N −nD(f ||g), nδ 2 I (θ0 ) − nD(f ||g)2 under

H0 ,and N nD(g||f
), nδ 2 I (θ0 + δ) − nD(g||f )2 under H1 from Theorem 5.2. Let
nor x|μ, σ 2 be the normal density function with mean μ and variance σ 2 . Then, for
small δ the entropy-based efficiency of Tn(2) is

σ2 σ2
EE Tn(2) ; θ0 , θ0 + δ = D nor x|η(θ0 + δ), ||nor x|η(θ0 ),
n n
n(η(θ0 + δ) − η(θ0 ))2
= .
2σ 2
and from Theorem 5.3, for large sample size n, we have
nI (θ0 )δ 2
EE Tn(1) ; θ0 , θ0 + δ = nD(f (x|θ0 )||f (x|(θ0 + δ))) ≈ .
2
From this, the relative efficiency of the test with Tn(2) is

(2) (1) EE Tn(2) ; θ0 , θ1 + δ
REE T |T ; θ0 , θ0 + δ = lim
n→∞
EE Tn(1) ; θ0 , θ1 + δ

D nor x|η(θ0 + δ), σ 2 nor x|η(θ0 ), σ 2
= lim
n→∞ nD f x|θ0 + δ, σ 2 f x| θ0 , σ 2
(η(θ0 +δ)−η(θ0 ))2
2σ 2 η(θ0 + δ) − η(θ0 ) 2 1
= = ·
I (θ0 )δ 2 δ I (θ0 )σ 2
2
Hence, we have
5.5 Information of Test Statistics 161
(η(θ0 +δ)−η(θ0 ))2

η (θ0 )2
lim REE T (2) |T (1) ; θ0 , θ0 + δ = lim 2σ 2
= .
δ→0 δ→0 I (θ0 )δ 2 I (θ0 )σ 2
2

For
statistic Tn(2) , η−1 Tn(2) is asymptotically normally distributed according to
nor x|θ0 , nησ(θ )2 , and for Fisher information I (θ0 ), we have
2
σ2 1
≥ .
nη (θ0 )2 nI (θ0 )
η (θ0 )2
≤ 1.
I (θ0 )σ 2
From Theorem 5.6, we have the following theorem:
Theorem 5.7 Let θ̂ be the maximum likelihood estimator of θ . Then, “the test of
null hypothesis H0 : θ = θ0 versus alternative hypothesis H1 : θ = θ0 + δ” based on
θ̂ is asymptotically equivalent to the likelihood ratio test (5.9).

Proof For a large sample, the asymptotic distribution of θ̂ is N θ0 , nI (θ
1
0)
under H0

and N θ0 + δ, nI (θ0 +δ) under H1 . In Theorem 5.6, we set η(θ ) = θ and σ = I (θ10 ) ,
1 2
it follows η (θ ) = 1 in (5.27). Let T be the log-likelihood function in (5.9). Then,

we have

lim REE θ̂ |T ; θ0 , θ0 + δ = 1.
δ→0
Moreover, in general, with respect to the likelihood ratio test, the following
theorem holds.
Theorem 5.8 Let X1 , X2 , . . . , Xn be random samples from probability or density

function f (x|θ0 ) under H0 : θ = θ0 and from f (x|θ0 + δ) under H1 : θ = θ0 + δ.
Let Tn(1) = T (1) (X1 , X2 , . . . , Xn ) be the log-likelihood ratio test statistic; let Tn(2) =
T (2) (X1 , X2 , . . . , Xn ) be any test statistic and let ϕn (t|θ0 ) and ϕn (t|θ0 + δ) be the
probability or density function of Tn(2) under H0 : θ = θ0 and H1 : θ = θ0 + δ,
respectively. Then,
D(ϕn (t|θ0 + δ)||ϕn (t|θ0 ))

REE T (2) |T (1) ; θ0 , θ1 + δ = ≤ 1. (5.28)
nD(f (x|θ0 + δ)||f (x|θ0 ))
Proof For simplicity of the discussion, the proof is made for the continuous random
samples Xi . Then, we have

ϕn (t|θ0 )
D(ϕn (t|θ0 )||ϕn (t|θ0 + δ)) = ϕn (t|θ0 )log dt
ϕn (t|θ0 + δ)

n
d
= f (xi |θ0 )dxi
dt
Tn <t i=1
n
d
dt Tn <t i=1 f (xi |θ0 )dxi
log n dt
d
dt Tn <t f (xi |θ0 + δ)dxi
i=1
n

n
f (xi |θ0 )
≤ f (xi |θ0 )log n i=1 dxi
i=1 i=1 f (x i |θ0 + δ)
n
f (xi |θ0 )
= f (xi |θ0 )log dxi
i=1
f (xi |θ0 + δ)
= nD(f (xi |θ0 )||f (xi |θ0 + δ)).
From the above theorem, the likelihood ratio test is optimal in the sense of REE.
With respect to the Pitman, Bahadur, and entropy-based efficiencies, we have the
following theorem:
Theorem 5.9 In the same conditions as in Theorem 5.6,

RPE T (2) |T (1) = lim RBE T (2) |T (1) = lim REE T (2) |T (1) ; θ0 , θ0 + δ .
δ→0 δ→0
Proof In Theorem 5.6, let H0 : θ = θ0 be null hypothesis and let H1 : θ = θ0 +

√h (= θn ) be a sequence of alternative hypotheses. Then, the Pitman efficiency of the
n
likelihood ratio test statistic T = T (X1 , X2 , . . . , Xn ) is calculated as follows. Since
the power is

γn(1) (θn ) ≈ Q zα − h2 I (θ0 )

and that of TN∗n = Tn∗ X1 , X2 , . . . , XNn is
√
Nn h
γN(2) (θn ) ≈ Q zα − η θ0 + √ − η(θ0 )
n
σ n
For
γn(1) (θn ) = γn(2) (θn ),

it follows that
√
Nn h
h2 I (θ0 ) = η θ0 + √ − η(θ0 ) .
σ n
From the above result, the relative Pitman efficiency is derived as follows:
2
n n η θ0 + √h
n
− η(θ0 )
=
Nn σ 2 h2 I (θ0 )
⎛ ⎞2
η θ0 + √hn − η(θ0 ) 1 η (θ0 )2
=⎝ ⎠ → (n → ∞).
√h
n
I (θ0 )σ 2 I (θ0 )σ 2
The above RPE is the same as REE (5.27), i.e.

RPE T (2) |T (1) = lim REE T (2) |T (1) ; θ0 , θ0 + δ .
δ→0
As in a similar discussion in Example 5.2, we can show

lim RBE T (2) |T (1) = lim REE T (2) |T (1) ; θ0 , θ0 + δ .
δ→0 δ→0
In the next example, we consider the efficiency of the Wilcoxon test for two sample
tests.
Example 5.6 (Relative efficiency

of the Wilcoxon test) Let Xi , i = 1, 2, . . . , m be
random samples from N μ0 , σ 2
and let Yi , i = 1, 2, . . . , n be random samples
from N μ0 + δ, σ 2 . Then, for large sample sizes m and n, the t statistic is asymp-

totically normally distributed according to N 0, m1 + 1n σ 2 under H0 : μ = μ0 , and
1
N δ, m + 1n σ 2 under H1 : μ = μ0 + δ. Let f (x|μ0 ) be the density function of

N μ0 , σ 2 and f (x|μ0 + δ) be that of N μ0 + δ, σ 2 . From this, we have

1 1 2 1 1 2 δ2
D nor 0, + σ ||nor δ, + σ = 1 (5.29)
m n m n 2 m + 1n σ 2
On the other hand, the KL information of the Wilcoxon (Mann–Whitney) statistic

is considered. Let us introduce the following function:

0 (t < 0)
u(t) = .
1 (t ≥ 0)
Then, the Wilcoxon statistic is

⎛ ⎞

n n
m

W = ⎝ u Yi − Yj + u Yi − Xj ⎠.
i=1 j=1 j=1
Since
n n
n(n + 1)
u Yi − Yj = ,
i=1 j=1
2
we have
n m
n(n + 1)
W = u Yi − Xj + .
i=1 j=1
2
Under the alternative hypothesis, the mean of W is calculated as follows. Since

Yi − Xj ∼ N δ, 2σ 2 ,
for small δ, we have
0
1 (t − δ)2 1 δ δ2
p(δ) = √ exp − dt ≈ − √ exp − .
4π σ 2 4σ 2 2 4π σ 2 4σ 2
−∞
From this, under the alternative hypothesis H1 : μ = μ0 + δ, it follows that

n(m + n + 1) mnδ δ2
E(W |μ0 + δ) ≈ −√ exp − 2 ,
2 4π σ 2 4σ
mn(m + n + 1)
Var(W |μ0 + δ) ≈ .
12
On the other hand, under the null hypothesis H0 : μ = μ0 ,
n(m + n + 1) mn(m + n + 1)
E(W |μ0 ) ≈ , Var(W |μ0 ) ≈ .
2 12
For large m and n, the asymptotic distribution of W is normal, so we have
D(nor(E(W |μ0 ), Var(W |μ0 ))nor(E(W |μ0 + δ), Var(W |μ0 + δ)))
2
3mnδ 2 exp − 2σδ 2
≈ ,
2π (m + n + 1)σ 2
From (5.29), for large m and n, it follows that

lim REE(W, t; μ0 , μ0 + δ)
δ→0
D(nor(E(W |μ0 ), Var(W |μ0 ))||nor(E(W |μ0 + δ), Var(W |μ0 + δ)))
= lim
δ→0 D nor 0, m1 + 1n σ 2 ||nor δ, m1 + 1n σ 2
2

3mnδ 2 exp − 2σδ 2
2π(m+n+1)σ 2 3
= lim ≈ .
δ→0 δ2 π
2( m1 + 1n )σ 2
For m = n, it can be shown that the result is equal to the relative Pitman and
Bahadur efficiencies.
5.6 Discussion
The efficiency of test procedures has been reconsidered in a context of entropy.

By reviewing the log-likelihood test, the Pitman and the Bahadur efficiencies for
test procedures, an entropy-based efficiency for test procedures has been proposed.
Theoretical discussion on the entropy-based efficiency has been made and it has
been proven that the three procedures derive the same results under an appropriate
condition.
References
1. Bahadur, R. R. (1964). On the asymptotic efficiency of tests and estimates. Sankhya, 22, 229–252.
2. Bahadur, R. R. (1965). An optimal property of the likelihood ratio statistic. In Proceedings of
the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 13–26).
Berkeley and Los Angeles: University of California Press.
3. Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical
hypotheses. Philosophical Transactions of the Royal Society of London, A, Containing Papers
of a Mathematical or physical Character, 231, 289–337.
4. Noether, G. E. (1950). Asymptotic properties of the Wald-Wolfowitz test of randomness. Annals
of Mathematical Statistics, 21, 231–246.
5. Noether, G. E. (1955). On a theorem of Pitman. The Annals of Mathematical Statistics, 26,
64–68.
6. Wieand, H. S. (1976). A condition under which the Pitman and Bahadur approaches to efficiency
coincide. Annals of Statistics, 4, 1003–1011.
Chapter 6
Entropy-Based Path Analysis
6.1 Introduction
Path analysis [19] is a statistical method that treats causal relationships among the
variables concerned. In the analysis, causal relationships among the variables are
described by diagrams with directed arrows, which are called path diagrams, and
according to the diagrams, path analysis models are hypothesized, and the causal
relationships are analyzed. For analyzing causal systems of continuous variables,
linear regression models are used for describing the causal relationships, e.g., lin-
ear structural equation model (LISREL) [3, 14], and the path analysis is made by
using regression coefficients and correlation coefficients. In comparison with path
analysis of continuous variables, that of categorical variables is complex, because
causal system under consideration cannot be described by linear regression equa-
tions. Goodman [10–12] considered a path analysis of binary variables by using
logit models and discussed the effects of the variables by logit parameters; however,
any discussion of the direct and indirect effects was not made. In a factor analysis
approach for categorical manifest variables, latent variables are assumed continuous;
however, there was no discussion on measuring the effects of the latent variables on
the manifest variables [4, 15, 16]. Hagenaars [13] made a discussion of path analysis
of recursive system of categorical variables by using a loglinear model approach,
which is a combination of Goodman’s approach and graphical modeling. Although
the approach is an analogy to LISREL, the discussion of the direct and indirect effects
was not made. Eshima and Tabata [5] also made a discussion on path analysis with
loglinear models. In path analysis with categorical variables, it is a question how the
effects are measured [9, 13]. Albert and Nelson [1] proposed a path analysis method
to calculate pathway effects for causal systems based on generalized linear models
(GLMs) [17, 18], but not all pathway effects are identifiable. As in the two-stage
cases, when factors, intermediate variables, and response variables are categorical, it
makes pathway effects very complicated because the variable effects are defined for
mean differences of response variables. Eshima et al. [6] proposed a method of path

https://doi.org/10.1007/978-981-15-2552-0_6
168 6 Entropy-Based Path Analysis
analysis of categorical variables by using logit models. In this approach, the direct
and indirect effects of variables are discussed according to log odds ratios; however,
the results are complex, and it is required to make a summary measure for causal
analysis with categorical variables. The approach was extended to an entropy-based
approach to path analysis for GLMs [8]. The rest of the present chapter is organized
as follows. In Sect. 6.2, recursive systems of variables are expressed with path dia-
grams, and the joint probabilities or density functions the variables are described by
products of the conditional ones based on the path diagrams. Section 6.3 reviews
the ordinary path analysis of continuous variables. In Sect. 6.4, a preliminary dis-
cussion of path analysis of categorical variables is given by using two examples
of categorical data. Section 6.5 provides a general approach to entropy-based path
analysis for GLM path systems, and the basic discussion is made by log odds ratios,
and Sect. 6.6 applies the basic method to that for GLM path systems with canonical
links. In Sect. 6.7, in order to summarize the effects measured with log odds ratios,
the summary entropy-based effects are introduced by using a recursive system of
four variables, and the total, direct, and indirect effects are discussed. Section 6.8
applies the present path analysis to the examples in Sect. 6.4. Finally, in Sect. 6.9, a
general formulation of path analysis for structural GLM systems is discussed.
6.2 Path Diagrams of Variables
Let X i , i = 1, 2, . . . , n be random variables that X i precedes X i+1 , i = 1, 2, . . . , n−

1 and let X = (X 1 , X 2 , . . . , X n )T . Then, the causal order can be expressed as
X1 ≺ X2 ≺ . . . ≺ Xn. (6.1)
In this case, a general path diagram is shown in Fig. 6.1. The parent variables of
X i are X k , k = 1, 2, . . . , i −1, and the set of the variables is denoted by the following
column vector:
Fig. 6.1 A general path

diagram
6.2 Path Diagrams of Variables 169
X pa(i) = (X 1 , X 2 , . . . , X i−1 )T
and X i is a descendant of the parent variables. The arrows express the directions
of the direct effects, e.g., X 1 → X 2 indicates the direct effect of X 1 on X 2 . Path
analysis is a statistical method for measuring the effects of parent variables on the
descendant variables, and the following effect decomposition is important:
The total effect = the direct effect + the indirect effect.
Let f (x) be the joint probability or density function of X = x. Then, for the path
diagram shown in Fig. 6.1, we have the following recursive decomposition of f (X):

n

f (x) = f 1 (x1 ) f i xi |x pa(i) , (6.2)
i=2

where f 1 (x1 ) is the marginal probability or density function of X 1 and f i xi |x pa(i)
are the conditional probability or density functions of X i given X pa(i) = x pa(i) , i =
2, 3, . . . , n.
For n = 4, some path diagrams are considered. In Fig. 6.2a, the diagram has all
paths between the four variables, so the joint probability or density functions of the
four variables are given by
f (x) = f 1 (x1 ) f 2 (x2 |x1 ) f 3 (x3 |x1 , x2 ) f 4 (x4 |x1 , x2 , x3 ).
In Fig. 6.2b, since there is no path from X 1 to X 3 , it means that X 1 and X 3 are
conditionally independent given X 2 . From this, we have
f (x) = f 1 (x1 ) f 2 (x2 |x1 ) f 3 (x3 |x2 ) f 4 (x4 |x1 , x2 , x3 ).
Fig. 6.2 Four examples of (a) (b)

path diagrams for n = 4
(c) (d)
In Fig. 6.2c, there are no paths between (X 1 , X 2 ) and X 4 , so (X 1 , X 2 ) and X 4 are

conditionally independent given X 3 , and it follows that
f (x) = f 1 (x1 ) f 2 (x2 |x1 ) f 3 (x3 |x1 , x2 ) f 4 (x4 |x3 ).
Since Fig. 6.2d implies X i and X i+2 are independent given X i+1 , i = 1, 2, we
have
f (x) = f 1 (x1 ) f 2 (x2 |x1 ) f 3 (x3 |x2 ) f 4 (x4 |x3 ).
In the above discussion, all the variables are causally ordered as in (6.1); however,
there may be cases that all the variables are not able to be causally ordered. For
example, for variables X 1 , X 2 , X 3 , and X 4 , it is assumed variables X 1 , X 2 , and X 3
have no causal orders and the following preceding relation holds:
(X 1 , X 2 , X 3 ) ≺ X 4 .
Then, the path diagram is generally given as in Fig. 6.3. The rounded double-
headed arrows imply the associations between the variables. In this case, (6.2)
becomes
f (x) = f 123 (x1 , x2 , x3 ) f 4 (x4 |x1 , x2 , x3 ),
where f 123 (x1 , x2 , x3 ) is the joint probability or density function of (X 1 , X 2 , X 3 ),

i.e.,
f 123 (x1 , x2 , x3 ) = f 3 (x3 |x1 , x2 ) f 2 (x2 |x1 ) f 1 (x1 ),
As shown in the above concrete examples, path analysis is carried out according
to the path diagrams under consideration. From this, first, we have to make a path
diagram (model) by discussing practical phenomena, and second, it is needed to
decide what model, i.e., a distributional assumption on the variables concerned, is
Fig. 6.3 Path diagram of

(6.2)
6.2 Path Diagrams of Variables 171
used. Moreover, important is how the effects of parent variables on descendant ones
are measured and expressed. In the next section, usual path analysis of continuous
variables based on linear models is discussed.
6.3 Path Analysis of Continuous Variables with Linear

Models
For continuous variables X i , i = 1, 2, . . . , n, the causal order (6.1) is assumed. Then,

the following linear models are used for path analysis:
X 2 = α2 + β21 X 1 + ε2 , (6.3)
X 3 = α3 + β31 X 1 + β32 X 2 + ε3 , (6.4)
X 4 = α4 + β41 X 1 + β42 X 2 + β43 X 3 + ε4 , (6.5)
···

n−1
X n = αn + βnk X k + εn , (6.6)
k=1
where εi are the error terms with mean 0 and variances σi2 , i = 1, 2, . . . , n. Let
eT (X k → X l ), e D (X k → X l ), and e I (X k → X l ) be the total, direct, and indirect
effects of X k on X l . First, from (6.3), variable X 2 increases by β21 for unit change of
variable X 1 , so the total effect of X 1 on X 2 is defined by β21 , and it is also the direct
effect, i.e.,
eT (X 1 → X 2 ) = e D (X 1 → X 2 ) = β21 , e I (X 1 → X 2 ) = 0.
Second, the effects of X 1 and X 2 on X 3 are considered by using Eqs. (6.3) and
(6.4) (Fig. 6.4). In (6.4), X 3 increases by β31 and β32 for unit changes of X 1 and X 2 ,
respectively, so the direct effects of X 1 and X 2 are defined as follows:
Fig. 6.4 A part of the path

diagram in Fig. 6.2a
e D (X 1 → X 3 ) = β31 , e D (X 2 → X 3 ) = β32 . (6.7)
Since X 2 has no indirect paths to X 3 , we have
eT (X 2 → X 3 ) = β32 , e I (X 2 → X 3 ) = 0.
On the other hand, X 1 has the indirect path to X 3 , i.e., X 1 → X 2 → X 3 . From

(6.3) and (6.4), we have
X 3 = α3 + β31 X 1 + β32 (α2 + β21 X 1 + ε2 ) + ε3

= (α3 + β32 α2 ) + (β31 + β32 β21 )X 1 + (β32 ε2 + ε3 ).
From the above result, the total effect of X 1 on X 3 is defined by
eT (X 1 → X 3 ) = β31 + β32 β21 . (6.8)
Subtracting (6.7) from (6.8), the indirect effect is calculated as
e I (X 1 → X 3 ) = eT (X 1 → X 3 ) − e D (X 1 → X 3 ) = β32 β21 . (6.9)
This effect is interpreted as that through path X 1 → X 2 → X 3 , and from (6.9),

we have
e I (X 1 → X 3 ) = e D (X 1 → X 2 )e D (X 2 → X 3 ).
Third, the effects of X 1 , X 2 , and X 3 on X 4 are discussed in Fig. 6.5. This figure
has the regression coefficients from (6.3) to (6.5), and the regression coefficients are
called the path coefficients. It is convenient to use this diagram to compute the effects
Fig. 6.5 A part of the path

diagram in Fig. 6.2a
6.3 Path Analysis of Continuous Variables with Linear Models 173
of the variables concerned. According to the above discussion, we can calculate the
effects of the variables by the regression coefficients. Since the path coefficients are
the direct effects indicated by the arrows, we have
e D (X 1 → X 4 ) = β41 , e D (X 2 → X 4 ) = β42 , e D (X 3 → X 4 ) = β43 .
The indirect effect of X 1 on X 4 is calculated through the following three paths:
X 1 → X 2 → X 4, X 1 → X 3 → X 4, X 1 → X 2 → X 3 → X 4. (6.10)
From the above paths and the path coefficients, we have
e I (X 1 → X 4 ) = β21 β42 + β31 β43 + β21 β32 β43 .
Similarly, it follows that
e I (X 2 → X 4 ) = β32 β43 , e I (X 3 → X 4 ) = 0.
In the above decomposition of the indirect effect, we define the pathway effects.
In (6.2)–(6.5), there are four pathways from X 1 to X 4 , and the effects through the
pathways can be calculated by the products of the path coefficients related to the
directed arrows, for example, for path X 1 → X 3 → X 4 in (6.10), the pathway effect
is calculated by β31 β43 . The method of calculation of the effects can be extended to the
general case illustrated in path diagram Fig. 6.1. As shown in the above discussion,
path analysis of continuous variables based on linear regression models is carried
out easily because continuous variables are quantitative and the causal relationships
among variables are expressed through linear equations as above.
6.4 Examples of Path Systems with Categorical Variables
In this section, examples of complex path systems of variables are demonstrated,

and issues to resolve are pointed out. The first example is a path system of binary
variables, and the second one is that of polytomous variables.
Example 6.1 Table 6.1 shows the data about marital status with three explanatory
variables: X G : gender; X PMS : premarital sex (PMS); X EMS : extramarital sex (EMS);
X MS : marital status (MS) [2]. The path diagram is illustrated in Fig. 6.6. All the
variables are binary, and the method of path analysis in the previous section cannot
be used. By using logit models, path analysis is carried out below. In the recursive
decomposition of the joint probability function of the four variables according to
Fig. 6.6, the conditional probability functions are expressed by logit models, not
linear regression models. Then, the discussion for linear regression models cannot
be applied, and log odds ratios are used for measuring the effects of parent variables
Table 6.1 Marital status data

MS
Gender PMS EMS Divorced (1) Married (0)
Male (1) Yes (1) Yes (1) 28 11
No (0) 60 42
No (0) Yes (1) 17 4
No (1) 68 130
Female (0) Yes (1) Yes (1) 17 4
No (0) 54 25
No (0) Yes (1) 36 4
No (0) 214 322
Source Agresti [2]
Fig. 6.6 Path diagram of the four dichotomous variables
on descendant ones. As discussed in the previous chapters, especially in Chap. 2, log

odds ratios are interpreted as changes of relative information of descendant variables
for parent variables.
Example 6.2 The data for an investigation of factors influencing the primary food
choice of alligators (Table 6.2) are analyzed in Agresti [2]. In this example, explana-
tory variables are X L : lakes where alligators live, {1. Hancock, 2. Oklawaha, 3.
Trafford, 4. George}; and X S : sizes of alligators, {1. Small (≤ 2.3 m), 2. Large
(> 2.3 m) }; and the response variable is Y : primary food choice of alligators, {1.
Fish, 2. Invertebrate, 3. Reptile, 4. Bird, 5. Other}. Considering the effects of lake
and size of alligator on the primary food choice of alligators, it is valid to use the path
diagram shown in Fig. 6.7. In this example, since variables X L and Y are polytomous
and as in the first example, the method for continuous variables cannot be used as
well. In this example, generalized logit models are used, and the path analysis is
made on the basis of log odds ratios.
In the above examples, path analysis is demonstrated on the basis of log odds ratios
below; however, the path analysis becomes complex as the numbers of categories of
variables are increasing because the number of log odds ratios to employ for the path
analysis is also increasing more than those of categories of variables. For example,
6.4 Examples of Path Systems with Categorical Variables 175
Table 6.2 Data of primary food choice of alligators

Primary food choice
X L : Lake X S : Size of alligator Fish Invertebrate Reptile Bird Other
Hancock S 23 4 2 2 8
L 7 0 1 3 5
Oklawaha S 5 11 1 0 3
L 13 8 6 1 0
Trafford S 5 11 2 1 5
L 8 7 6 3 5
George S 16 19 1 2 3
L 17 1 0 1 3
Source Agresti [2]
Fig. 6.7 Path diagram

(Primary food choice data)
when variables
X and Y have I and J categories, respectively, the number of log odds
I J
ratios is . Hence, it is important to summarize causal effects measured
2 2
with log odds ratios. In the next section, an entropy-based method of path analysis
is discussed in the context of generalized linear models.
6.5 Path Analysis of Structural Generalized Linear Models

Let f i xi |x pa(i) be the conditional probability or density function of X i given
X pa(i) = x pa(i) . In GLMs, a random component is assumed to be

xi θi − bi (θi )
f i xi |x pa(i) = exp + c(xi , ϕi ) , (6.11)
ai (ϕi )
Without loss of generality, we assume ai (ϕi ) > 0. For regression coefficient vector
T
β(i) = β(i)1 , β(i)2 , . . . , β(i)i−1 and a link function that relates linear predictor
β(i)
T
x pa(i) = β(i)1 x1 + β(i)2 x2 + . . . + β(i)i−1 xi−1
Fig. 6.8 Path diagrams from (a)

(X 1 , X 2 , X 3 ) on X 4
(b)
to the mean of X i , i.e., b (θi ), parameter θi can be viewed as a function of β(i)

T
x pa(i) .
For simplification, in this chapter, the function is expressed as

θi = θi x pa(i) = θi (x1 , x2 , . . . , xi−1 ).
In the GLMs, the discussion on path analysis in the previous section cannot be
used except linear regression models. In this section, the effects of parent variables
on the descendant ones are measured according to log odds ratios. First, by using
Fig. 6.2a, the effects of X 1 , X 2 and X 3 on X 4 are considered. The following log odds
ratio can be viewed as the total effect of X pa(4) = (x1 , x2 , x3 )T on X 4 (Fig. 6.8a):

f 4 (x4 |x1 , x2 , x3 ) f 4 x4∗ |x1∗ , x2∗ , x3∗
log ∗
f 4 x4 |x1 , x2 , x3 f 4 x4 |x1∗ , x2∗ , x3∗

x4 − x4∗ θ4 (x1 , x2 , x3 ) − θ4 x1∗ , x2∗ , x3∗
= .
a4 (ϕ4 )

where x1∗ , x2∗ , x3∗ , x4∗ is a baseline. Let μi = E(X i ), i = 1, 2, 3, 4. Setting

x1∗ , x2∗ , x3∗ , x4∗ = (μ1 , μ2 , μ3 , μ4 ),
the total effect of X pa(4) = (x1 , x2 , x3 )T on X 4 is defined by
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 |μ2 , μ3 , μ4 )

log
f 4 (μ4 |x1 , x2 , x3 ) f 4 (x4 |μ2 , μ3 , μ4 )
(x4 − μ4 )(θ4 (x1 , x2 , x3 ) − θ4 (μ1 , μ2 , μ3 ))
= . (6.12)
a4 (ϕ4 )
As explained in the previous chapter, the log odds ratio is a change of relative
information. The total effect of (X 2 , X 3 ) on X 4 in Fig. 6.8b is assessed. Let μi (x1 ) =
E(X i |X 1 = x1 ), i = 2, 3, 4. Since X 1 is the parent of (X 2 , X 3 ), by giving X 1 = x1 ,
the total effect (X 2 , X 3 ) = (x2 , x3 ) on X 4 = x4 is defined as follows:
6.5 Path Analysis of Structural Generalized Linear Models 177
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 (x1 )|x1 , μ2 (x1 ), μ3 (x1 ))

log
f 4 (μ4 (x1 )|x1 , x2 , x3 ) f 4 (x4 |x1 , μ2 (x1 ), μ3 (x1 ))
(x4 − μ4 (x1 ))(θ4 (x1 , x2 , x3 ) − θ4 (x1 , μ2 (x1 ), μ3 (x1 )))
= . (6.13)
a4 (ϕ4 )
By subtracting (6.13) from (6.12), the total effect of X 1 = x1 on X 4 = x4 at

(X 2 , X 3 ) = (x2 , x3 ) is defined by
(the total effect of(x1 , x2 , x3 ) on x4 ) − (the total effect of(x2 , x3 )on x4 given x1 )
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 |μ2 , μ3 , μ4 )
= log
f 4 (μ4 |x1 , x2 , x3 ) f 4 (x4 |μ2 , μ3 , μ4 )
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 (x1 )|x1 , μ2 (x1 ), μ3 (x1 ))
− log
f 4 (μ4 (x1 )|x1 , x2 , x3 ) f 4 (x4 |x1 , μ2 (x1 ), μ3 (x1 ))
(x4 − μ4 )(θ4 (x1 , x2 , x3 ) − θ4 (μ1 , μ2 , μ3 ))
=
a4 (ϕ4 )
(x4 − μ4 (x1 ))(θ4 (x1 , x2 , x3 ) − θ4 (x1 , μ2 (x1 ), μ3 (x1 )))
− . (6.14)
a4 (ϕ4 )
The effects (6.12) and (6.13) are defined by using log odds ratios; however, the
above quantity is calculated by the subtraction of (6.13) from (6.12) and is not the
log odds ratio. As the log odds ratio implies the change of relative information, the
above quantity is also interpreted as the change of information. Let μ1 (x2 , x3 ) and
μ4 (x2 , x3 ) be the conditional expectations of X 1 and X 4 , given (X 2 , X 3 ) = (x2 , x3 ),
respectively. Then, the direct effect of X 1 = x1 on X 4 = x4 given (X 2 , X 3 ) = (x2 , x3 )
is
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 (x2 , x3 )|μ1 (x2 , x3 ), x2 , x3 )
log
f 4 (μ4 (x2 , x3 )|x1 , x2 , x3 ) f 4 (x4 |μ1 (x2 , x3 ), x2 , x3 )
(x4 − μ4 (x2 , x3 ))(θ4 (x1 , x2 , x3 ) − θ4 (μ1 (x2 , x3 ), x2 , x3 ))
= . (6.15)
a4 (ϕ4 )
The above quantity is the log odds ratio of X 4 = x4 in X 1 = x1 , given (X 2 , X 3 ) =

(x2 , x3 ). By subtracting (6.15) from (6.12), the indirect effect of X 1 = x1 on X 4 = x4
through (X 2 , X 3 ) = (x2 , x3 ) can be calculated as
(x4 − μ4 )(θ4 (x1 , x2 , x3 ) − θ4 (μ1 , μ2 , μ3 ))

a4 (ϕ4 )
(x4 − μ4 (x1 ))(θ4 (x1 , x2 , x3 ) − θ4 (x1 , μ2 (x1 ), μ3 (x1 )))
−
a4 (ϕ4 )
(x4 − μ4 (x2 , x3 ))(θ4 (x1 , x2 , x3 ) − θ4 (μ1 (x2 , x3 ), x2 , x3 ))
−
a4 (ϕ4 )
Next, the effects of X 2 = x2 are considered. Let μ3 (x1 , x2 ) and μ4 (x1 , x2 ) be the
conditional expectations of X 3 and X 4 , given (X 1 , X 2 ) = (x1 , x2 ), respectively. As
in the above discussion, the total effect of X 3 = x3 on X 4 = x4 given (X 1 , X 2 ) =
(x1 , x2 ) is defined by
(x4 − μ4 (x1 , x2 ))(θ4 (x1 , x2 , x3 ) − θ4 (x1 , x2 , μ3 (x1 , x2 )))

. (6.16)
a4 (ϕ4 )
As illustrated in Fig. 6.2a, X 3 has no indirect paths, so the above quantity is also the
direct effect. From this, the total effect of X 2 = x2 on X 4 = x4 at (X 1 , X 2 ) = (x1 , x2 )
is defined by subtracting (6.16) from (6.13):
(x4 − μ4 (x1 ))(θ4 (x1 , x2 , x3 ) − θ4 (x1 , μ2 (x1 ), μ3 (x1 )))

a4 (ϕ4 )
(x4 − μ4 (x1 , x2 ))(θ4 (x1 , x2 , x3 ) − θ4 (μ1 , μ2 , μ3 (x1 , x2 )))
− . (6.17)
a4 (ϕ4 )
Let μ2 (x1 , x3 ) and μ4 (x1 , x3 ) be the conditional expectations of X 2 and X 4 , given

(X 1 , X 3 ) = (x1 , x3 ), respectively. Then, the direct effect of X 2 = x2 on X 4 = x4
given (X 1 , X 3 ) = (x1 , x3 ) is defined as follows:
(x4 − μ4 (x1 , x3 ))(θ4 (x1 , x2 , x3 ) − θ4 (x1 , μ2 (x1 , x3 ), x3 ))

(6.18)
a4 (ϕ4 )
By subtracting (6.18) from (6.17), the indirect effect of X 2 = x2 on X 4 = x4

through (X 1 , X 3 ) = (x1 , x3 ) can be calculated.
Remark 6.1 In the above discussion, we treat log odds ratios as if the variables X i
are continuous, for example, in (6.12), we used functions such as f 4 (μ4 |μ2 , μ3 , μ4 ),
f 4 (μ4 |x1 , x2 , x3 ) and f 4 (x4 |μ2 , μ3 , μ4 ). In categorical variables, such functions may
not exist in a strict sense; however, the formal results are interpretable as log odds
ratios, e.g.,
(x4 − μ4 )(θ4 (x1 , x2 , x3 ) − θ4 (μ1 , μ2 , μ3 ))

.
a4 (ϕ4 )
A general discussion based on the path diagram in Fig. 6.1 can be made in a
similar method explained above. In the next section, the above discussion is applied
to path analysis of GLMs with canonical links.
6.6 Path Analysis of GLM Systems with Canonical Links
Let us assume in path diagram 6.2a, the recursive causal relationships are expressed
by GLMs with canonical links. For canonical links, we have
6.6 Path Analysis of GLM Systems with Canonical Links 179
⎧
⎨ θ2 (x1 ) = β21 x1
θ (x , x ) = β31 x1 + β32 x2 , (6.19)
⎩ 3 1 2
θ4 (x1 , x2 , x3 ) = β41 x1 + β42 x2 + β43 x3
and the conditional distributions (6.11) become

⎧
⎪
⎪ f x |x = exp x2 β21 x1 −b2 (θ2 (x1 ))
+ c (x , ϕ )
⎪
⎨
2 2 pa(2)
a2 (ϕ2 ) 2 2 2

f 3 x3 |x pa(3) = exp x3 (β31 x1 +β32ax3 2(ϕ)−b )
3 (θ3 (x 1 ,x 2 ))
+ c3 (x3 , ϕ3 ) (6.20)
⎪
⎪ 3
⎪
⎩ f 4 x4 |x pa(4) = exp x4 (β41 x1 +β42 x2 +β43 x3 )−b(θ4 (x1 ,x2 ,x3 )) + c4 (x4 , ϕ4 )
a4 (ϕ4 )
In order to demonstrate the present path analysis, the effects of X i , i = 1.2.3 on

X 4 are calculated below. First, the effects of X 1 = x1 on X 4 = x4 are considered.
According to (6.12), the total effect of X pa(4) = (x1 , x2 , x3 )T on X 4 = x4 is computed
as follows:
(x4 − μ4 )(β41 (x1 − μ1 ) + β42 (x2 − μ2 ) + β43 (x3 − μ3 ))
. (6.21)
a4 (ϕ4 )
From (6.13), the total effect of (X 2 , X 3 ) = (x2 , x3 ) on X 3 = x3 at X 1 = x1 is

calculated by
(x4 − μ4 (x1 ))(β42 (x2 − μ2 (x1 )) + β43 (x3 − μ3 (x1 )))

. (6.22)
a4 (ϕ4 )
Subtracting (6.22) from (6.21), it follows that
The total effect of X 1 = x1 on X 4 = x4 at (X 2 , X 3 ) = (x2 , x3 )

= the total effect of X pa(4) = (x1 , x2 , x3 )T on X 4 = x4
− (the total effect of (X 2 , X 3 ) = (x2 , x3 ) on X 4 = x4 at X 1 = x1 )
(x4 − μ4 )(β41 (x1 − μ1 ) + β42 (x2 − μ2 ) + β43 (x3 − μ3 ))
=
a4 (ϕ4 )
(x4 − μ4 (x1 ))(β42 (x2 − μ2 (x1 )) + β43 (x3 − μ3 (x1 )))
− .
a4 (ϕ4 )
From (6.15), we have the direct effect of X 1 = x1 on X 4 = x4 at (X 2 , X 3 ) =

(x2 , x3 ) as
(x4 − μ4 (x2 , x3 ))β41 (x1 − μ1 (x2 , x3 ))

.
a4 (ϕ4 )
Second, the effects of X 2 = x2 on X 4 = x4 at (X 1 , X 3 ) = (x1 , x3 ) are computed.

Since the total effect of (X 2 , X 3 ) = (x2 , x3 ) on X 4 = x4 at X 1 = x1 is (6.22) and
the total effect of X 3 = x3 on X 4 = x4 at (X 1 , X 2 ) = (x1 , x2 ) is
(x4 − μ4 (x1 , x2 ))β43 (x3 − μ3 (x1 , x2 ))

, (6.23)
a4 (ϕ4 )
we have
(The total effect of X 2 = x2 on X 4 = x4 at(X 1 , X 3 ) = (x1 , x3 )) = (6.22) − (6.23)

(x4 − μ4 (x1 ))(β42 (x2 − μ2 (x1 )) + β43 (x3 − μ3 (x1 )))
=
a4 (ϕ4 )
(x4 − μ4 (x1 , x2 ))β43 (x3 − μ3 (x1 , x2 ))
− .
a4 (ϕ4 )
The direct effect of X 2 = x2 on X 4 = x4 at (X 1 , X 3 ) = (x1 , x3 ) is
(x4 − μ4 (x1 , x3 ))β42 (x2 − μ2 (x1 , x3 ))

. (6.24)
a4 (ϕ4 )
The total effect of X 3 = x3 on X 4 = x4 at (X 1 , X 2 ) = (x1 , x2 ) (6.23) is also the

direct effect. Similarly, the effects of X i on X 3 , i = 1, 2, and those of X 1 on X 2 can
also be computed. The results are given as follows:
(The total effect of X 2 = x2 on X 3 = x3 at X 1 = x1 )

(x3 − μ3 (x1 ))β32 (x2 − μ2 (x1 ))
= .
a3 (ϕ3 )
(The total effect of X 1 = x1 on X 3 = x3 at X 2 = x2 )

= the total effect of X pa(3) = (x1 , x2 )T on X 3 = x3
− (the total effect of X 2 = x2 on X 3 = x3 at X 1 = x1 )
(x3 − μ3 )(β31 (x1 − μ1 ) + β32 (x2 − μ2 )) (x3 − μ3 (x1 ))β32 (x2 − μ2 (x1 ))
= − ,
a3 (ϕ3 ) a3 (ϕ3 )
(The direct effect of X 1 = x1 on X 3 = x3 at X 2 = x2 ) = (x3 −μ3 (x2 ))β 31 (x 1 −μ1 (x 2 ))

a3 (ϕ3 )
.
As demonstrated above, the path analysis based on log odds ratios is complicated,
even in the canonical link cases, so it is needed to summarize the effects.
6.7 Summary Effects Based on Entropy
In order to simplify the discussion, all the variables concerned are assumed to be
continuous. Taking the expectation of the total effect of X pa(4) = (x1 , x2 , x3 )T on X 4
(6.12) with respect to X pa(4) and X 4 , we have the following summary total effect:
6.7 Summary Effects Based on Entropy 181
¨
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) f 1234 x pa(4) , x4
= f 1234 x pa(4) , x4 log d x pa(4) d x4
a4 (ϕ4 ) f 4 (x4 ) f 123 x pa(4)
¨
f 4 (x4 ) f 123 x pa(4)
+ f 4 (x4 ) f 123 x pa(4) log d x pa(4) d x4 = KL X pa(4) , X 4 .
f 1234 x pa(4) , x4
(6.25)

where f 1234 x pa(4) , x4 is the joint density function of X pa(4) and X 4 , and f 123 x pa(4)
and f 4 (x4 ) are the marginal density functions of X pa(4) and X 4 , respectively.
The quantity (6.25) is the ratio of the explained variation of X 4 by X pa(4) , i.e.,
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) for the error variation a4 (ϕ4 ) in entropy. In this sense, the
total effect (6.25) is a signal-to-noise ratio. Standardizing the above KL information,
we define the standardized summary total effect of X pa(4) on X 4 by

Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) KL X pa(4) , X 4
eT X pa(4) → X 4 = = .
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) + a4 (ϕ4 ) KL X pa(4) , X 4 + 1
(6.26)

The above measure is ECD X pa(4) , X 4 [7]. By taking the expecting (6.13) with
respect to X pa(4) and X 4 , it follows that
¨
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )|X 1 ) f 1234 x pa(4) , x4
= f 1234 x pa(4) , x4 log d x pa(4) d x4
a4 (ϕ4 ) f 4 (x4 |x1 ) f 123 x pa(4)
¨
f 4 (x4 |x1 ) f 123 x pa(4)
+ f 4 (x4 |x1 ) f 123 x pa(4) log d x pa(4) d x4
f 1234 x pa(4) , x4

= KL X pa(4) , X 4 |X 1
As defined in (6.26), the standardized summary total effect of (X 2 , X 3 ) on X 4 is

defined by

Cov X 4 , θ4 X pa(4) |X 1 KL((X 2 , X 3 ), X 4 |X 1 )
eT ((X 2 , X 3 ) → X 4 ) = =
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 ) KL X pa(4) , X 4 + 1
From (6.14), the standardized summary total effect of X 1 on X 4 is given by

eT (X 1 → X 4 ) = eT X pa(4) → X 4 − eT ((X 2 , X 3 ) → X 4 )

Cov X 4 , θ4 X pa(4) − Cov X 4 , θ4 X pa(4) |X 1
=
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 )

KL X pa(4) , X 4 − KL((X 2 , X 3 ), X 4 |X 1 )
= . (6.27)
KL X pa(4) , X 4 + 1
The standardized summary direct effect of X 1 on X 4 is computed by taking

expectation of (6.15):

Cov X 4 , θ4 X pa(4) |X 2 , X 3 KL(X 1 , X 4 |X 2 , X 3 )
e D (X 1 → X 4 ) = = . (6.28)
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 ) KL X pa(4) , X 4 + 1
By subtracting (6.28) from (6.27), we have the indirect effect of X 1 on X 4 as
e I (X 1 → X 4 ) = eT (X 1 → X 4 ) − e D (X 1 → X 4 ).
Although the indirect effect is defined by subtracting (6.28) from (6.27), the above
quantity is interpreted as entropy. By using a similar discussion, we have

Cov X 4 , θ4 X pa(4) |X 1 − Cov X 4 , θ4 X pa(4) |X 1 , X 2
eT (X 2 → X 4 ) =
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 )
= eT ((X 2 , X 3 ) → X 4 ) − eT (X 3 → X 4 ), (6.29)

Cov X 4 , θ4 X pa(4) |X 1 , X 3
e D (X 2 → X 4 ) = , (6.30)
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 )

Cov X 4 , θ4 X pa(4) |X 1 , X 2
eT (X 3 → X 4 ) = e D (X 3 → X 4 ) = . (6.31)
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 )
Similarly, we get
Cov(X 3 ,θ3 (X 1 ,X 2 ))−Cov(X 3 ,θ3 (X 1 ,X 2 )|X 1 )
eT (X 1 → X 3 ) = Cov(X 3 ,θ3 (X 1 ,X 2 ))+a3 (ϕ3 ) ,
Cov(X ,θ ,X 2 )|X 2 )
e D (X 1 → X 3 ) = Cov(X ,θ3 (X3 (X,X1 ))+a .
3 3 1 2 3 (ϕ3 )
Cov(X 3 , θ3 (X 1 , X 2 )|X 1 )
eT (X 2 → X 3 ) = e D (X 2 → X 3 ) =
Cov(X 3 , θ3 (X 1 , X 2 )) + a3 (ϕ3 )
Remark 6.2 It is assumed that path diagram 6.2 (a) is expressed with linear models
(6.2) to (6.4), and the error terms εi , i = 2, 3, 4 are normally distributed with mean
0 and variance σi2 . In this case, the summary effects discussed above are calculated
with covariances of the variables. Since
θ4 (x1 , x2 , x3 ) = β41 x1 + β42 x2 + β43 x3 ,
a4 (ϕ4 ) = σ42
and
6.7 Summary Effects Based on Entropy 183

3
Var(X 4 ) = Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) + a4 (ϕ4 ) = β4i Cov(X 4 , X i ) + σ42 ,
i=1
from (6.27) and (6.28) it follows that

3
β41 Cov(X 4 , X 1 ) + i=2 β4i (Cov(X 4 , X i ) − Cov(X 4 , X i |X 1 ))
eT (X 1 → X 4 ) = 3
i=1 β4i Cov(X 4 , X i ) + σ4
2
3
β41 Cov(X 4 , X 1 ) + i=2 β4i (Cov(X 4 , X i ) − Cov(X 4 , X i |X 1 ))
=
Var(X 4 )
β41 Cov(X 4 , X 1 |X 1 , X 2 )
e D (X 1 → X 4 ) = .
Var(X 4 )
Similarly, from (6.29) to (6.31), we have
β42 Cov(X 4 , X 2 |X 1 ) + β43 (Cov(X 4 , X i ) − Cov(X 4 , X 3 |X 1 , X 2 ))

eT (X 2 → X 4 ) =
Var(X 4 )
β42 Cov(X 4 , X 2 |X 1 , X 3 )
e D (X 2 → X 4 ) =
Var(X 4 )
β42 Cov(X 4 , X 3 |X 1 , X 2 )
eT (X 3 → X 4 ) = e D (X 3 → X 4 ) = .
Var(X 4 )
In the next section, the above discussion is applied to the examples in Sect. 6.4.
6.8 Application to Examples 6.1 and 6.2
In this section, path analyses of the examples in Sect. 6.4 are discussed in details.
6.8.1 Path Analysis of Dichotomous Variables (Example 6.1

(Continued))
By using the following logit model:
exp{xMS (α + βG x G + βPMS xPMS + βEMS xEMS )}

f MS (xMS |x G , xPMS , xEMS ) = ,
1 + exp(α + βG x G + βPMS xPMS + βEMS x E M S )
the present entropy-based path analysis is carried out. The estimated parame-
ters are given in Table 6.3. The log-likelihood ratio statistic for the goodness-
of-fit of the model is G 2 = 13.629, d f = 12, p = 0.325. The total effects
Table 6.3 Estimated logit parameters

Parameter α βG βPMS βEMS
Estimate −0.350 −0.306 0.889 1.565
SE 0.083 0.146 0.167 0.246
t value −4.197 −2.116 −5.339 6.352
Table 6.4 The total effects of (X G , X PMS , X EMS ) on X MS

xG X PMS X EMS X MS Total effect
Male Yes Yes Divorced 0.976
No 0.157
No Yes 0.510
No −0.308
Yes Yes Married −0.889
No −0.143
No Yes −0.465
No 0.281
Female Yes Yes Divorced 1.135
No 0.317
No Yes 0.671
No −0.147
No −0.289
No Yes −0.611
No 0.135
of (X G , X PMS , X EMS ) on X MS are calculated according to (6.12), and the results

are shown in Table 6.4. The exponentials of the effects can be interpreted as
odds ratios for the mean vector of (X G , X PMS , X EMS , X MS ). The total effects
of (X G , X PMS , X EMS ) = (Male,Yes,Yes), (Female,Yes,Yes) are 0.996 and 1.156,
respectively, and these are greater than the other total effects. The exponentials
of these are 2.708 and 3.187. The total effects of (X PMS , X EMS ) on X MS given
X G are given in Table 6.5. The total effects of (X PMS , X EMS ) = (Yes,Yes) at
X G = Male,Female are 0.958 and 1.145, respectively, and the effects are the great-
est, respectively, for male and female in the table. The total effects of X G on X MS
are calculated by subtracting values in the fourth column of Table 6.5 from those in
the last of Table 6.4. The direct and indirect effects of gender on marital status are
calculated in Table 6.6. From the table, it is concluded that the effects of gender on
marital status are small.
Next, the effects of X PMS on X MS are calculated. Since the total effects of
(X PMS , X EMS ) on X MS given X G and those of X EMS on X MS given (X G , X EMS ) are
6.8 Application to Examples 6.1 and 6.2 185
Table 6.5 The total effects of (X PMS , X EMS ) on X MS given X G

X PMS X EMS X MS Total effect xG
Yes Yes Divorced 0.958 Male
No 0.145
No Yes 0.497
No −0.316
No −0.133
No Yes −0.454
No 0.289
Yes Yes Divorced 1.145 Female
No 0.324
No Yes 0.679
No −0.143
No −0.293
No Yes −0.614
No 0.130
Table 6.6 The effects of X G on X M S at X PMS and X EMS

xG X MS Total effect Direct effect Indirect effect X PMS X EMS
Male Divorced 0.017 −0.014 0.031 Yes Yes
0.011 −0.055 0.066 No
0.014 −0.050 0.064 No Yes
0.008 −0.135 0.143 No
Married −0.013 0.093 −0.106 Yes Yes
−0.010 0.079 −0.089 No
−0.011 0.151 −0.162 No Yes
−0.008 0.088 −0.096 No
Female Divorced −0.010 0.026 −0.035 Yes Yes
−0.007 0.071 −0.077 No
−0.008 0.026 −0.034 No Yes
−0.005 0.050 −0.055 No
Married −0.001 −0.173 0.174 Yes Yes
−0.004 −0.102 0.106 No
−0.002 −0.079 0.082 No Yes
−0.006 −0.033 0.038 No
Table 6.7 The total (direct) effects of X EMS on X MS given X G and X PMS
X EMS X MS Total effect xG X PMS
Yes Divorced 0.403 Male Yes
No −0.155
Yes 0.881 No
No −0.093
Yes Married −0.726 Yes
No −0.277
Yes −0.534 No
No 0.057
Yes Divorced 0.388 Female Yes
No −0.103
Yes 0.818 No
No −0.061
Yes Married −0.843 Yes
No 0.226
Yes −0.638 No
No 0.048
shown in Tables 6.6 and 6.8, respectively, the total effects of X PMS on X MS are com-
puted by subtracting “the total effects of X EMS on X MS given (X G , X EMS ) (Table 6.7)”
from “the total effects of (X PMS , X EMS ) on X MS given X G (Table 6.5).” The direct
effects of X PMS on X MS are calculated by (6.24). These effects are demonstrated in
Table 6.8.
As demonstrated in Tables 6.5, 6.6, 6.7, 6.8, and 6.9, the path analysis is com-
plicated, so the summary effects are calculated. Table 6.9 illustrates the estimated
marginal joint probabilities of X G , X PMS , and X EMS that are calculated as the rela-
tive frequencies from Table 6.1. Table 6.10 shows the estimated joint distribution of
X G , X PMS , X EMS , and X MS , which is obtained by using the estimated logit model
(Table 6.3) and the estimated marginal joint distribution of X G , X PMS , and X EMS . By
using the table, the estimated means of the effects in Tables 6.5, 6.6, 6.7, 6.8, and
6.9 can be calculated. Simplicity of the notations, let us set
X 1 = X G , X 2 = X PMS , X 3 = X EMS , X 4 = X MS ,
and we use the notations in Sect. 6.6. From Table 6.4, the estimated mean of the total
effects can be obtained as

KL X pa(4) , X 4 = 0.098. (6.32)
From this, the standardized total effect (ECD) is

Table 6.8 The effects of X PMS on X MS given X G and X EMS

X PMS X MS Total effect Direct effect Indirect effect xG X EMS
Yes Divorced 0.551 0.060 0.492 Male Yes
No 0.301 0.343 −0.042
Yes −0.385 −0.111 −0.273 No
No −0.223 −0.177 −0.046
Yes Married −0.150 −0.251 0.101 Yes
No −0.410 −0.244 −0.167
Yes 0.080 0.466 −0.387 No
No 0.233 0.126 0.107
Yes Divorced 0.757 0.106 0.648 Female Yes
No 0.427 0.433 −0.006
Yes −0.140 −0.057 −0.082 No
No −0.082 −0.063 −0.018
Yes Married −0.187 −0.474 0.286 Yes
No −0.518 −0.342 −0.176
Yes 0.024 0.249 −0.224 No
No 0.082 0.050 0.031
Table 6.9 The estimated

XG X PMS X EMS Probability
marginal joint distribution of
(X G , X PMS , X EMS ) Male Yes Yes 0.038
No 0.098
No Yes 0.020
No 0.191
Female Yes Yes 0.020
No 0.076
No Yes 0.039
No 0.517

KL X pa(4) , X 4
eT ((X G , X PMS , X EMS ) → X MS ) = eT X pa(4) → X 4 =

1 + KL X pa(4) , X 4
0.098
= = 0.090. (6.33)
1 + 0.098
From this result, the effect of X G , X PMS , and X EMS on X MS is not strong because
(6.33) means that the explained extropy by the explanatory variables is only 9%.
Similarly, from Tables 6.6 and 6.9, we have

KL((X 2 , X 3 ), X 4 |X 1 ) = 0.098. (6.34)

Table 6.10 The estimated joint distribution of X G , X PMS , X EMS , and X MS

XG X PMS X EMS X MS Total effect
Male Yes Yes Divorced 0.032
No 0.055
No Yes 0.014
No 0.065
Yes Yes Married 0.005
No 0.044
No Yes 0.006
No 0.126
Female Yes Yes Divorced 0.018
No 0.048
No Yes 0.030
No 0.214
Yes Yes Married 0.002
No 0.028
No Yes 0.009
No 0.304
and subtracting (6.34) from (6.32), we get

KL X pa(4) , X 4 − KL((X 2 , X 3 ), X 4 |X 1 ) = 0.000. (6.35)
The above quantity implies the total effect of X 1 (= X G ), and it can be calculated
from the third column of Table 6.6 as well. Hence, from (6.32) and (6.35), the
standardized total effect of X 1 (= X G ) on X 4 (= X M S ) is given by

KL X pa(4) , X 4 − KL((X 2 , X 3 ), X 4 |X 1 )
eT (X G → X M S )(= eT (X 1 → X 4 )) =

1 + KL X pa(4) , X 4
0.000
= = 0.000.
1 + 0.098
The mean direct effect X 1 (= X G ) on X 4 (= X M S ) is computed from the fourth

column of Table 6.6, since

KL(X 1 , X 4 |X 2 , X 3 ) = 0.004 (6.36)
from (6.28) and (6.36), we have

KL(X 1 , X 4 |X 2 , X 3 ) 0.004
e D (X G → X M S )(= e D (X 1 → X 4 )) =
= = 0.004
1 + KL X pa(4) , X 4 1 + 0.098
Fig. 6.9 Partial path

diagram of the four binary
variables
and
e I (X G → X M S ) = eT (X G → X M S ) − e D (X G → X M S ) = −0.004
By using a similar method, from Tables 6.8 and 6.9, it follows that
eT (X PMS → X MS ) = 0.044, e D (X PMS → X MS ) = 0.026, e D (X PMS → X MS ) = 0.018,
eT (X EMS → X MS ) = 0.046
Path analysis for the partial path diagram (Fig. 6.9) of Fig. 6.6 is carried out to
use the following logit model:
exp{xMS (α + βG x G + βPMS xPMS )}

f EMS (xEMS |x G , xPMS ) = .
1 + exp(α + βG x G + βPMS xPMS )
From the data in Fig. 6.1, we have the following estimates of the logit parameters:
α̂ = −2.597(SE = 0.151), β̂G = 0.357(0.208), β̂PMS = 1.276(0.209).
By using the estimates and the method explained in Sect. 6.6, the present path
analysis can be made. Here, the standardized summary effects are calculated. First,
we have
eT ((X G , X PMS ) → X EMS ) = 0.043, (6.37)
eT (X PMS → X EMS ) = eT (X PMS → X EMS ) = 0.032. (6.38)
Second, subtracting (6.38) from (6.37), we have the standardized summary total
effect of X G on X EMS as
eT (X G → X EMS ) = eT ((X G , X PMS ) → X EMS ) − eT (X PMS → X EMS ) = 0.011
Since the direct effect of X G on X EMS

e D (X G → X EMS ) = 0.003,
It follows that
e I (X G → X EMS ) = eT (X G → X EMS ) − e D (X G → X EMS ) = 0.008.
6.8.2 Path Analysis of Polytomous Variables (Example 6.2

(Continued))
Example 6.2 is analyzed according to generalized logit models. In this example,

explanatory variables X L and X S have four and two categories, respectively, and
response variable Y has five categories. Since all the variables are categorical
(nominal), the following dummy variables are introduced.

1 (X L = i)
X Li = , i = 1, 2, 3, 4;
0 (X L = i)

1 (X S = j)
X Sj = , j = 1, 2;
0 (X S = j)

1 (Y = k)
Yk = , k = 1, 2, 3, 4, 5.
0 (Y = k)
Then, categorical variables X L , X S and response variable Y are identified

with the corresponding dummy random vectors X L = (X L1 , X L2 , X L3i , X L4 )T ,
X S = (X S1 , X S2 )T and Y = (Y1 , Y2 , Y3 , Y4 , Y5 )T , respectively. In this analysis, the
following generalized logit model is assumed:

exp y T (α + B L x L + B S x S )
f (y|x) = ,
y exp y (α + B L x L + B S x S )
T
where α, B L , and B S are a 5 × 1, 5 × 4, and 5 × 2 regression parameter matrices,

respectively. The link function of the above model is canonical, i.e.,
θ = BL x L + BS x S ,
From Table 6.2, the following estimates of regression parameters are obtained:
⎛ ⎞ ⎛ ⎞
−0.826 −0.006 −1.516 0 −0.332 0
⎜ −2.485 −0.394 0⎟ ⎜ 1.127 0 ⎟
⎜ 0.931 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
B̂ L = ⎜ 0.417 2.454 1.419 0 ⎟, B̂ S = ⎜ −0.683 0 ⎟.
⎜ ⎟ ⎜ ⎟
⎝ −0.131 −0.659 −0.429 0⎠ ⎝ −0.962 0 ⎠
0 0 0 0 0 0
Since applying the discussion in Sect. 6.5 to the present data makes very com-
plicated results due to the polytomous variables concerned, for simplicity of the
analysis, only the summary effects of X L and X S on Y are calculated. From the
following estimates,

Cov(θ , Y ) = 0.329(S E = 0.082),

Cov(θ , Y |X L ) = 0.101(0.045),

Cov(θ, Y |X S ) = 0.260(0.072),
we have the standardized summary effects of X L and X S on Y as

Cov(θ , Y ) 0.329
eT ((X L , X S ) → Y ) = = = 0.248, (6.39)
Cov(θ , Y ) + 1 0.329 + 1

Cov(θ , Y ) − Cov(θ , Y |X L ) 0.329 − 0.101

eT (X L → Y ) = = = 0.172, (6.40)
Cov(θ, Y ) + 1 0.329 + 1

Cov(θ , Y |X S ) 0.260
e D (X L → Y ) = = = 0.196,
Cov(θ , Y ) + 1 0.329 + 1
e I (X L → Y ) = eT (X L → Y ) − e D (X L → Y ) = −0.024,

Cov(θ , Y |X L ) 0.101
eT (X S → Y ) = e D (X S → Y ) = = = 0.076. (6.41)
Cov(θ , Y ) + 1 0.329 + 1
From (6.39), 24.8% of the variation of Food Y in entropy is explained by Lake X L

and Size X S . Considering the total effects of X L and X S , the total effect of X L (6.40)
is about two times higher than that of X S (6.41). From this, it may be concluded
that the alligators’ foods vary in four lakes. As shown in this example, it is useful to
summarize the effects of categorical variables through the present approach.
6.9 General Formulation of Path Analysis of Recursive

Systems of Variables
Path analysis of general recursive systems of variables,
X1 ≺ X2 ≺ . . . ≺ X K ,
where X i ≺ X j implies that X i precedes X j , i.e., a parent or ancestor, is dis-

cussed. For simplicity of the discussion, the standardized summary effects on X K
are considered below. The conditional density or probability function is expressed
as

x K θ − a(θ )
f x K |x pa(K ) = exp + c(x K , ϕ) , (6.42)
a(ϕ)
where θ is a function of x pa(K ) = (x1 , x2 , . . . , x K −1 ) and

is expressed as θ =
θ (x1 , x2 , . . . , x K −1 ). The KL information KL X pa(K ) , X K is calculated by

Cov X K , θ X pa(K )
KL X pa(K ) , X K = , (6.43)
a(ϕ)
The above information can also be interpreted as the signal-to-noise

\1,2,...,i
ratio for explaining X K by X pa(K ) . Let X pa(K ) be the set of variables
\1,2,...,i
that excludes X 1 , X 2 , . . . , X i from parent variables X pa(K ) , i.e., X pa(K ) =
(X i+1 , X i+2 , . . . , X K −1 ). Then, we have
Cov X K , θ X pa(K ) |X 1

\1
KL X pa(K ) , X K |X 1 = ,
a(ϕ)
The above information can be interpreted as that originated by the parent variables
\1
X pa(K ) . From this, the summary total effect of X 1 is measured by

\1
KL X pa(K ) , X K − KL X pa(K ) , X K |X 1

Cov X K , θ X pa(K ) − Cov X K , θ X pa(K ) |X 1
= ,
a(ϕ)
Similarly, we have
Cov X K , θ X pa(K ) |X 1 , X 2

\1,2
KL X pa(K ) , X K |X 1 , X 2 = , (6.44)
a(ϕ)
6.9 General Formulation of Path Analysis of Recursive Systems ... 193

\1 \1,2
KL X pa(K ) , X K |X 1 − KL X pa(K ) , X K |X 1 , X 2

Cov X K , θ X pa(K ) |X 1 − Cov X K , θ X pa(K ) |X 1 , X 2
= . (6.45)
a(ϕ)
\1,2
Information (6.44) is that originated by X pa(K ) and (6.45) is that by X 2 . Recur-
\1,2,...,i
sively, we can calculate the information that originates from X pa(K ) and that from
X i , respectively, are given as follows:

\1,2,...,i Cov X K , θ X pa(K ) |X pa(i+1)
KL X pa(K ) , X K |X pa(i+1) = ,
a(ϕ)

\1,2,...,i−1 \1,2,...,i
KL X pa(K ) , X K |X pa(i) − KL X pa(K ) , X K |X pa(i+1)

Cov X K , θ X pa(K ) |X pa(i) − Cov X K , θ X pa(K ) |X pa(i+1)
= ,
a(ϕ)
i = 1, 2, . . . , K − 2.
The information of X K that originates from X K −1 is given by
Cov X K , θ X pa(K ) |X pa(K −1)

\1,2,...,K −2
KL X pa(K ) , X K |X pa(K −1) = ,
a(ϕ)
Since the total information of X K explained by X pa(K ) is (6.43), the standardized

summary total effects of X i , i = 1, 2, . . . , K − 1, are computed as follows:

\1,2,...,i−1 \1,2,...,i
KL X pa(K ) , X K |X pa(i) − KL X pa(K ) , X K |X pa(i+1)
eT (X i → X K ) =
KL X pa(K ) , X K + 1

Cov X K , θ X pa(K ) |X pa(i) − Cov X K , θ X pa(K ) |X pa(i+1)
= ,
Cov X K , θ X pa(K ) + a(ϕ)
i = 1, 2, . . . , K − 2; (6.46)

Cov X K , θ X pa(K ) |X pa(K −1)
eT (X K −1 → X K ) = (6.47)
Cov X K , θ X pa(K ) + a(ϕ)
eT (X i → X K ) ≥ 0, i = 1, 2, . . . , K − 1;
According to the above discussion, we have


\1
eT X pa(K ) → X K = eT (X 1 → X K ) + eT X pa(K ) → X K ,

\1,2,...,i−1 \1,2,...,i−1,i
eT X pa(K ) → X K = eT (X i → X K ) + eT X pa(K ) → XK ,
i = 2, 3, . . . , K − 1.
Hence, inductively, it follows that
−1
K
eT X pa(K ) → X K = eT (X i → X K ).
i=1
The standardized direct effects of X i on X K are calculated by

\i
\i
Cov X i , θ X pa(K ) |X pa(K ) KL X i , X K |X pa(K )
e D (X i → X K ) = = ,
Cov X K , θ X pa(K ) + a(ϕ) KL X pa(K ) , X K + 1
i = 1, 2, . . . , K − 1.
From this, we have
e I (X i → X K ) = eT (X i → X K ) − e D (X i → X K ), i = 1, 2, . . . , K − 1. (6.48)
If the link of GLM (6.42) is canonical, i.e.,
−1
K
θ X pa(K ) = βi X i ,
i=1
The above effects are calculated by covariances of X i and X K . The summary total
effect of X pa(K ) on X K is given by
K −1
Cov θ X pa(K ) , X K j=1 β j Cov X j , X K
KL X pa(K ) , X K = = .
a(ϕ) a(ϕ)
Since

K −1 β j Cov X j , X K |X pa(i)
\1,2,...,i−1 j=i
KL X pa(K ) , X K |X pa(i) = , i = 1, 2, . . . , K ,
a(ϕ)
the standardized summary total effects of X i on X K are computed by

K −1 −1
j=i β j Cov X j , X K |X pa(i) − Kj=i+1 β j Cov X j , X K |X pa(i+1)
eT (X i → X K ) = K −1 ,
j=1 β j Cov X j , X K + a(ϕ)
i = 1, 2, . . . , K − 1,
6.9 General Formulation of Path Analysis of Recursive Systems ... 195
Similarly, we have

\i
βi Cov X i , X K |X pa(K )
e D (X i → X K ) = K −1 , i = 1, 2, . . . , K − 1.
j=1 β j Cov X j , X K + a(ϕ)
From the above result, the following theorem follows:
Theorem 6.1 In GLMs with canonical links,

\i
βi Cov X i , X K |X pa(K ) ≥ 0, i = 1, 2, . . . , K − 1.
The equality holds if and only if X i and X K are conditionally independent given
\i
X pa(K ) .
Proof Since

\i
βi Cov X i , X K |X pa(K )
\i
KL X i , X K |X pa(K ) = ≥ 0,
a(ϕ)
the theorem follows.
Remark 6.3 As discussed in this chapter, the present approach is different from the
usual approach for linear equation models because it is based on the log odds ratio
and entropy by using all the variables concerned. Thus, although indirect effects are
defined by the total effects minus the direct effects, e.g.,
e I (X i → X K ) = eT (X i → X K ) − e D (X i → X K ),
the interpretation can be made in entropy. This point is an advantage of this

method.
6.10 Discussion
Path analysis is a statistical methodology to measure the effects of parent variables

on the descendant ones, and the usual path analysis for continuous variables by using
linear regression models can be easily carried out. The present chapter has made a
discussion on path analysis with GLMs based on information theory. The recursive
system is expressed by GLMs, and the effects of parent variables on descendant ones
have been measured by using log odds ratios (changes of information). For cate-
gorical (polytomous) variables, the total, direct, and indirect effects to be measured
become complex as the numbers of categories in the variables tend to large. From
this, summary measures of the effects were proposed in view of entropy [8], and the
effectiveness has been demonstrated in the two examples in this chapter. The present
discussion provides a method to decompose the total effects of parent variables into
the direct and indirect ones; however, the pathway effects have not been considered.
There are arbitrary number of intermediate (parent) variables between a parent vari-
able to a descendant variable, so there are several pathways from the parent variable
to the descendant one. For example, in Fig. 6.6, there are two variables, X PMS and
X EMS , between X G and X MS , so it is meaningful to decompose the total effect of
X G on X MS into those through four pathways X G → X MS , X G → X PMS → X MS ,
X G → X EMS → X MS , and X G → X PMS → X EMS → X MS . The pathway effect
through path X G → X MS is the direct effect of X G on X MS , and the indirect effect
of X G on X MS is composed of those through the other paths. It is significant to study
a method of pathway effect analysis in GLMs, and it rests to be solved in a future
study .
References
1. Albert, J. M., & Nelson, S. (2011). Generalized causal mediation analysis. Biometrics, 67(3),
1028–1038.
2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley.
3. Bentler, P. M., & Weeks, D. B. (1980). Linear structural equations with latent variables.
Psychometrika, 45, 289–308.
4. Christoferson, A. (1975). Factor analysis of dichotomous variables. Psychometrika, 40, 5–31.
5. Eshima, N., & Tabata, M. (1999). Effect analysis in loglinear model approach to path analysis
of categorical variables. Behaviormetrika, 26, 221–233.
6. Eshima, N., Tabata, M., & Geng, Z. (2001). Path analysis with logistic regression models:
Effect analysis of fully recursive causal systems of categorical variables. Journal of the Japan
Statistical Society, 31, 1–14.
8. Eshima, N., Tabata, M., Borroni, C. G., & Kano, Y. (2015). An entropy-based approach to path
analysis of structural generalized linear models: A basic idea. Entropy, 17, 5117–5132.
9. Fienberg, S. E. (1991). The analysis of cross-classified categorical data (2nd ed.). England:
The MIT Press; Cambridge.
10. Goodman, L. A. (1973). Causal analysis of data from panel studies and other kinds of surveys.
American Journal of Sociology, 78, 1135–1191.
11. Goodman, L. A. (1973). The analysis of multidimensional contingency tables when some
variables are posterior to others: A modified path analysis approach. Biometrika, 60, 179–192.
12. Goodman, L. A. (1974). The analysis of systems of qualitative variables when some of the
variables are unobservable. Part IA modified latent structure approach. American Journal of
Sociology, 79(5), 1179–1259.
13. Hagenaars, J. A. (1998). Categorical causal modeling : Latent class analysis and directed
loglinear models with latent variables. Sociological Methods & Research, 26, 436–489.
14. Jöreskog, K. G. & Sörbom, D. (1996). LISREL8: User’s reference guide (2nd ad.). Chicago:
Scientific Software International.
15. Muthen, B. (1978). Contribution of factor analysis of dichotomous variables. Psychometrika,
43, 551–560.
16. Muthen, B. (1984). A general structural equation model with dichotomous ordered categorical
and continuous latent variable indicators. Psychometrika, 49, 114–132.
References 197
17. McCullagh, P., & Nelder, J. A. (1989). Generalized Linear models (2nd ed.). Chapman and
Hall: London.
18. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. Journal of the Royal
Statistical Society A, 135, 370–384.
19. Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5,
161–215.
Chapter 7
Measurement of Explanatory Variable
Contribution in GLMs
7.1 Introduction
In GLMs except linear regression, only statistical estimation and testing regression
parameters are performed in general; however, the predictive powers of GLMs are
not measured in practical data analyses. The assessment of explanatory variable con-
tribution is also important in regression analysis. Although regression coefficients are
used to measure the factor contributions of explanatory variables, if the explanatory
variables are correlated, it may not be meaningful to interpret the results based on the
coefficients. If the explanatory variables have a causal order, squared partial corre-
lation coefficients according to the order are used to measure the factor contribution
[14]. The unexplained variance fraction can be expressed by 1 minus the squared
multiple correlation coefficient, and the fraction can be expressed by the product
of the unexplained partial variance fractions according to the causal order. By tak-
ing the logarithm, it is suggested that the partial fractions related to the explanatory
variables are used as factor contributions [20, 21]. The products of standardized
regression coefficients and the correlation coefficients between a response variable
and the explanatory variables are proposed for measuring the factor contributions [5,
22] as well, where R2 is decomposed into the products. It is pointed out that the above
measures are not interpretable in general [19]. There are several attempts for mea-
suring the factor contribution; however, there is no unified and satisfactory approach
for this topic even in ordinary linear regression [11–13]. Considering GLMs in view
of entropy, as mentioned in the previous chapters, the entropy coefficient of deter-
mination (ECD) [8, 9] is an extension of R2 in the ordinary linear regression model,
and ECD can be applied to all GLMs. In GLMs, measuring not only the predictive
power of regression models but also the explanatory variable or factor contribution
are important. This chapter gives an ECD approach for measuring the explanatory
variable contribution in GLMs, i.e. an extension of the R2 approach in the ordinary
linear regression model, and a method for assessing importance of explanatory vari-
ables is given [3, 10]. In Sect. 7.2, by using the ordinary linear regression model

https://doi.org/10.1007/978-981-15-2552-0_7
200 7 Measurement of Explanatory Variable Contribution in GLMs
and path diagrams, the issue to be treated in this chapter is explained. Section 7.3
explains three examples, to which an entropy-based method for assessing variable
contribution is applied. In Sect. 7.4, the method is theoretically discussed according
to a path analysis method in Chap. 6, and in Sect. 7.5, the method is applied to the
three examples explained in Sect. 7.3. Section 7.6 considers an application of the
present method for analyzing usual test data. Finally, Sect. 7.7 makes a discussion
to measure the variable importance in GLMs with explanatory variables.
7.2 Preliminary Discussion
In order to clarify a question to treat in this chapter, the ordinary linear regression
T
model is discussed. Let X = X1 , X2 , . . . , Xp , and Y be a p×1 factor or explanatory
T
variable vector and a response variable, respectively, and let β T = β1 , β2 , . . . , βp
be the regression coefficient vector. Then, a linear regression model is given by
Y = μ + β T X + e, (7.1)
Where μ is an intercept parameter and e is an error term distributed according

to a normal distribution with mean 0 and variance σ 2 . The present paper treats the
case where there is no causal ordering among explanatory variables (factors) βk , e.g.
shown in Fig. 7.1a. In this figure, all the explanatory variables are associated with
each other without any causal ordering. Since ECD = R2 , we have
p p p
βi Cov(Y , Xi ) i=1 j=1 βi βj σij
ECD = p i=1
= p p , (7.2)
i=1 βi Cov(Y , Xi ) + σ j=1 βi βj σij + σ
2 2
i=1

where σij = Cov Xi , Xj , i.j = 1, 2, . . . , p. If the explanatory variables are
statistically independent or factors (Fig. 7.1b), then, we have
p
i=1 βi σi
2 2
ECD = p , (7.3)
i=1 βi σi + σ
2 2 2
where σi2 = σii , i = 1, 2, . . . , p, and the contributions of the explanatory variables

C(Xi → Y ) are easily defined as
βi2 σi2
C(Xi → Y ) ≡ eT (Xi → Y ) = p , (7.4)
i=1 βi σi + σ
2 2 2
7.2 Preliminary Discussion 201
Fig. 7.1 a Path diagram of a (a)

regression model. b Path
diagram of a regression
model with independent
explanatory variables. c Path
diagram of a regression
model in which there are
explanatory variables having
no direct paths to Y
(b)
(c)
For a general path diagram (Fig. 7.1a), the entropy-based path analysis is applied
for measuring the explanatory variable contribution. In Fig. 7.1c, the explanatory
variables concerned include ones that have no direct paths, which implies the vari-
ables are conditionally independent of the response variable Y, given the explanatory
variables with the direct paths to Y. In this case, it may also be needed to measure their
contributions to the response variable. It is an important issue how the explanatory
variable contribution and importance are measured in GLMs. The present chapter
gives an approach to the issue based on an entropy-based path analysis in Chap. 6.
7.3 Examples
Example 7.1 This example provides an assessment of the so-called residual income
valuation model, which was proposed by [17] as an alternative to the classical
dividend-discounting valuation model. The model is an ordinary linear regression
model. A sample of 20 banks was observed in 2000 according to their stock price
(PRICE) (Y ), their book value per share (BVS) (X1 ), their earnings for the following
year per share as forecasted by analysts (FY1) (X2 ), and their residual income, given
by their current earnings minus the discounted book value of the preceding year
(INC) (X3 ). All values are in US dollars and PRICE is used as a response variable.
The model used is the following ordinary linear regression model:
Y = μ + β1 x1 + β2 x2 + β3 x3 + e, (7.5)

where e is an error term according to normal distribution N 0.σ 2 . In a GLM formu-
lation, the link is canonical and θ is a linear function of the explanatory variables,
i.e.
θ = β1 X1 + β2 X2 + β3 X3 . (7.6)
Data are shown in Table 7.1, where the names of banks are masked for privacy.
In this example, there is no causal order in the explanatory variables but association
between the explanatory variables. Then, the path model is expressed as in Fig. 7.2
and measuring contributions of explanatory variables are meaningful. The direct
effect of an explanatory variable should be defined as that of the variable given the
other explanatory variables, and the indirect effect is that through the association
between the variable and the other ones.
Example 7.2 Table 7.2 shows two-way layout experimental data in a study of length
of time spent on individual home visits by public health nurses [4, pp. 348–353].
In the example, analysis of the effects of factors, i.e. the type of patient and the
age of a nurse, on the nurses’ behavior will be significant. Let Y be length of home
visit, and let factors X1 and X2 denote the type of a patient and the age of a nurse,
respectively. The results of two-way analysis of variance are shown in Table 7.3. The
main and interactive effects of factors are significant. In this case, levels of factor
vector X = (X1 , X2 ) are levels (i, j), i = 1, 2, 3, 4; j = 1, 2, 3, 4. Although factors
X1 and X2 are independent, the model has interaction terms between them (Fig. 7.3).
Let Yijk be the kth response given factor level (X1 , X2 ) = (i, j); let αi , i = 1, 2, 3, 4
be the main effect parameters of factor X1 ; let βj , j = 1, 2, 3, 4 be the main effects
of X2 ; and let (αβ)ij , i = 1, 2, 3, 4; j = 1, 2, 3, 4 be the interactive effects of X1 and
X2 , respectively. Then, the usual expression of the model is
Yijk = μ + αi + βj + (αβ)ij + εijk , i = 1, 2, 3, 4; j = 1, 2, 3, 4; k = 1, 2, 3, 4, 5.

(7.7)
7.3 Examples 203
Table 7.1 Data of PRICE, BVD, and INC from twenty banks
Bank PRICE (Y) BVD (X1 ) FY1 (X2 ) INC (X3 )
1 11.00 4.6850 0.87 0.5962
2 1.20 0.5680 0.11 0.0867
3 7.65 3.0550 0.62 0.7225
4 10.19 3.9210 0.42 0.2447
5 12.40 1.9860 0.36 0.2112
6 6.10 3.1200 0.51 0.3116
7 12.96 3.9390 0.60 0.3477
8 6.04 2.0610 0.39 0.2806
9 36.10 23.4850 1.80 0.8252
10 10.21 9.3070 0.94 0.6149
11 84.70 44.7590 3.59 1.2261
12 16.70 13.1700 1.16 0.3900
13 9.54 10.2000 1.20 0.3235
14 50.00 20.0290 3.18 1.9821
15 9.39 5.9820 0.30 0.0761
16 4.31 3.2120 0.38 0.3064
17 10.08 6.8790 0.77 0.5846
18 49.75 28.5270 4.16 1.2641
19 12.75 5.0950 0.43 0.3439
20 4.24 1.7080 0.22 0.2806
Source Ohlson [17]
Fig. 7.2 Path diagram of a

residual valuation income
model
where εijk are the measurement

errors which are distributed according to normal
distribution N 0, σ 2 . In order to identify, effect parameters are constrained as
follows:
Table 7.2 Length of home visit in minutes by public health nurses by nurse’s age group and type
of patient
Factor X1 (age groups of nurses) (years old)
Factor X1 1 2 4 4
(type of patients) (20–29) (30–39) (40–49) (50 and over)
1 (Cardiac) 20, 25, 22 25, 30, 29 24, 28, 24 28, 31, 26
27, 21 28, 30 5, 30 29, 32
2 (Cancer) 30, 45, 30 30, 39, 31 39, 42, 36 40, 45, 50
35, 36 30, 30 42, 40 45, 60
3 (C.V.T.) 31, 30, 40 32, 35, 30 41, 45, 40 42, 50, 40
35, 30 40, 30 40, 35 55, 45
4 (Tuberculosis) 20, 21, 20 23, 25, 28 24, 25, 30 29, 30, 28
20, 19 30, 31 26, 23 27, 30
Source [4]

two-way layout experiment
model. The broken lines
indicate the interactive terms
in the model

4
4
4
4
αi = βj = (aβ)ij = (aβ)ij = 0. (7.8)
i=1 j=1 i=1 j=1
In this example, factors are categorical and the response is continuous, which is
different from the path system of continuous variables in Sect. 7.2. It is meaningful
to measure the effects of factors, i.e. the direct and indirect effects, by using a path
analysis approach. A GLM expression of the two-way layout experimental design
model is given below (Table 7.3).
Table 7.3 Results of two-way analysis of variance

Factor SSa dfb MSSc F P
X1 3226.5 3 1075.5 52.64 0.000
X2 1185.1 3 395.0 19.33 0.000
(X1 , X2 ) 704.5 9 78.3 3.83 0.000
Error 1307.6 64 20.4 –
Total 6423.6 79 – –
a Sum of squares (SS); b Degrees of freedom (df); c The mean of SS
7.3 Examples 205
Example 7.3 Table 7.4 shows the data from 2276 high school seniors about questions
whether they have ever used alcohol, cigarettes, or marijuana [2]. For explanatory
variables alcohol use X1 , cigarette use X2 , and response variable marijuana use Y, we
analyzed the data with a logit model. In this case, it is appropriate not to assume any
causal order in the variables, because someone used alcohol before cigarettes, and
others cigarettes before alcohol. The path diagram is illustrated in Fig. 7.4. In this logit
model, we used θ = μ + β1 X1 + β2 X2 , and the estimates are μ = −5.302 (ASE =
0.458), β1 = 2.980(0.448), and β2 = 2.847(0.162). The usual interpretation of the
results is given by using odds with respect to marijuana use. The partial log odds
in marijuana use Y given cigarette use X2 is exp(2.980) = 19.688 times higher at
alcohol use X1 (yes) than that at alcohol use (no), and the partial log odds in marijuana
use Y given alcohol use X1 is exp(2.847) = 17.236 times higher at cigarette use X2
(yes) than that at X2 (no). According to the usual interpretation, it might be thought
the effects of alcohol use and cigarette use on marijuana use would be almost the
same, given that alcohol use and cigarette use were completely controlled; however,
alcohol use and cigarette use are associated and cannot be completely controlled.
Hence, in addition to the usual method based on odds ratios, it is sensible to use a
new method for assessing the explanatory variable contribution. The direct effect of
Xi should be defined given the other explanatory variable, and the indirect effect is
that through the association between them.
Table 7.4 Marijuana use data

Alcohol use Cigarette use Marijuana Number of
(X1 ) (X2 ) use (Y ) subjects
Yes Yes Yes 911
No 538
No Yes 44
No 456
No Yes Yes 3
No 43
No Yes 2
No 279
Source [2]

logit model with explanatory
variables X1 and X2
7.4 Measuring Explanatory Variable Contributions

T
Let X = X1 , X2 , . . . , Xp and Y be a p × 1 factor or explanatory variable vector and
a response variable, respectively. In GLMs, the uncertainty in response variable Y is
expressed by the random, systematic, and link components [15, 16]. Let f (y|x) be
the conditional probability or density function of Y given X = x. The function f (y|x)
is assumed to be a member of the following exponential family of distributions:

yθ − b(θ )
f (y|x) = exp + c(y, ϕ) , (7.9)
a(ϕ)
where θ and ϕ are parameters, and a(ϕ) > 0, b(θ ), andc(y, ϕ) are specific functions.
p
In GLM, θ is a function of linear predictor η = βT x = i=1 βi xi . Let fi (x) be marginal
density or probability functions of explanatory variables Xi , and let Cov(Y , θ |Xi ) be
the conditional covariances of Y and θ , given Xi . From a viewpoint of entropy, the
covariance Cov(Y , θ ) can be regarded as the explained variation of Y in entropy by
all the explanatory variables Xk and Cov(Y , θ |Xi ) is that excluding the effect of Xi .
From this, we make the following definition.
Definition 7.1 In a GLM with explanatory variables X1 , X2 , . . . , Xp (7.9), the

contribution ratio of Xi for predicting response Y is defined by
Cov(Y , θ ) − Cov(Y , θ |Xi )

CR(Xi → Y ) = , i = 1, 2, . . . , p. (7.10)
Cov(Y , θ )
p
Remark 7.1 For canonical link θ = i=1 βi Xi , the contribution ratio of Xi is

βi Cov(Xi , Y ) − j =i βj Cov Xj , Y − Cov Xj , Y |Xi
CR(Xi → Y ) = p , i = 1, 2, . . . , p. (7.11)
j=1 βj Cov Xj , Y
If explanatory variables Xi are statistically independent, it follows that
βi Cov(Xi , Y )
CR(Xi → Y ) = p , i = 1, 2, . . . , p.
j=1 βj Cov Xj , Y
Especially, in the ordinary linear regression model, we have
β 2 Var(Xi )
CR(Xi → Y ) = p i 2 , i = 1, 2, . . . , p.
j=1 βj Var Xj
In this case, the denominator is the explained variance of Y by all the explanatory
variables, and the numerator that by Xi .
7.4 Measuring Explanatory Variable Contributions 207
Fig. 7.5 Path diagram of

explanatory variables Xi and
X \i and response variable Y
Remark 7.2 In the entropy-based path analysis in Chap. 6, the calculation of

contribution ratios of explanatory variables is based on the total effect of vari-
able Xi under the assumption that variable Xi is the parent of variables X \i =
T
X1 , X2 , . . . , Xi−1 , Xi+1 , . . . , Xp as in Fig. 7.5. Then, the total effect of Xi on Y
is defined by
Cov(Y , θ ) − Cov(Y , θ |Xi )

eT (Xi → Y ) = , i = 1, 2, . . . , p.
Cov(Y , θ ) + a(ϕ)
and we have
eT (Xi → Y )
CR(Xi → Y ) = , i = 1, 2, . . . , p.
eT (X → Y )
Hence, the contribution ratios of the explanatory variables Xi in the path diagrams
for GLMs such as Fig. 7.1a, b, and c are the same as that in the case described in
path diagram Fig. 7.5.
Lemma 7.1 Let X = (X1 , X2 )T be a factor or explanatory variable vector; let Y

be a response variable; let f (x1 , x2 , y) be the joint probability or density function
of X and Y; let f1 (y|x1 ) be the conditional probability or density function of Y given
X1 = x1 ; let g(x1 , x2 ) be the joint probability or density function of X; and let f (y)
be the marginal probability or density function of Y. Then,
˚
f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy
˚
≥ f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy, (7.12)
where for discrete variables the related integrals are replaced by the summations.
Proof For simplification, the Lemma is proven in the case of the continuous
distribution. Let
˚
h= f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy
˚
− f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy. (7.13)
Under the following constraint:

˚
f (x1 , x2 , y)dx1 dx2 dy = 1,
the left side of (7.12) is minimized with respect to f (x1 , x2 , y). For Lagrange
multiplier λ, let
˚
L=h−λ f (x1 , x2 , y)dx1 dx2 dy
˚
= f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy
˚
− f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy
˚
−λ f (x1 , x2 , y)dx1 dx2 dy.
Differentiating L with respect to f (x1 , x2 , y), we have
∂L f1 (y|x1 )g(x1 , x2 ) f (y)g(x1 , x2 )

= −
∂f (x1 , x2 , y) f (x1 , x2 , y) f (x1 , x2 , y)
˚
∂
−λ f (x1 , x2 , y)dx1 dx2 dy
∂f (x1 , x2 , y)
f1 (y|x1 )g(x1 , x2 ) f (y)g(x1 , x2 )
= − − λ = 0. (7.14)
f (x1 , x2 , y) f (x1 , x2 , y)
f1 (y|x1 )g(x1 , x2 ) − f (y)g(x1 , x2 ) = λf (x1 , x2 , y), (7.15)
Since
˚ ˚
f1 (y|x1 )g(x1 , x2 )dx1 dx2 dy = f (y)g(x1 , x2 )dx1 dx2 dy
˚
= f (x1 , x2 , y)dx1 dx2 dy = 1,
Integrating the both sides of Eq. (7.15) with respect to x1 , x2 and y, we have λ = 0.
f1 (y|x1 ) = f (y),
and it derives h = 0 as the minimum value of h (7.13). This completes the Lemma.

Remark 7.3 Let X = (X1 , X2 )T be a factor vector with levels x1i , x2j , i =
1, 2, . . . , I ; j = 1, 2, . . . , J ; let nij be sample sizes at x1i , x2j ; and let ni+ =
J
I I J
nij , n+j = nij , n = nij . Then, Lemma 7.1 holds by setting as follows:
j=1 i=1 i=1 j=1
nij 1 nij
J
ni+
g x1i , x2j = , g1 (x1i ) = , f1 (y|x1i ) = f y|x1i , x2j ,
n n ni+ j=1 n
I J
nij
f (y) = f y|x1i , x2j .
i=1 j=1
n
T T
Remark 7.4 Let X = X1 , X2 , . . . , Xp ; let X (1) = X1 , X2 , . . . , Xq and let X (2) =
T
Xq+1 , Xq+2 , . . . , Xp . Then, by replacing X1 and X2 in Lemma 7.1 for X (1) and X (2) ,
respectively, the inequality (7.12) holds true.
From Lemma 7.1, we have the following theorem:
Theorem 7.1 In the GLM (7.9) with explanatory variable vector X =

T
X1 , X2 , . . . , Xp ,
Cov(θ, Y ) Cov(θ, Y |Xi )

− ≥ 0, i = 1, 2, . . . , p. (7.16)
a(ϕ) a(ϕ)
Proof For simplicity of the discussion, the theorem is proven for continuous variables
in the case of p = 2 and i = 1. Since
˚
Cov(θ, Y )
= (f (x1 , x2 , y) − f (y)g(x1 , x2 )) log f (x1 , x2 , y)dx1 dx2 dy
a(ϕ)
and
˚
Cov(θ, Y |Xi )
= (f (x1 , x2 , y) − f1 (y|x1 )g(x1 , x2 )) log f (x1 , x2 , y)dx1 dx2 dy,
a(ϕ)
we have
Cov(θ, Y ) Cov(θ, Y |Xi )
−
a(ϕ) a(ϕ)
˚
= (f1 (y|x1 )g(x1 , x2 ) − f (y)g(x1 , x2 ))logf (x1 , x2 , y)dx1 dx2 dy.
From Lemma 7.1, the theorem follows.
From the above theorem, we have the following theorem.

T
Theorem 7.2 In the GLM (7.9) with X = X1 , X2 , . . . , Xp ,
0 ≤ CR(Xi → Y ) ≤ 1, i = 1, 2, . . . , p. (7.17)
Proof Since Cov(θ, Y |Xi ) ≥ 0, from (7.16) Theorem 7.1 shows that
0 ≤ Cov(θ, Y ) − Cov(θ, Y |Xi ) ≤ Cov(θ, Y ).
Thus, from (7.10) the theorem follows.

Remark 7.4 Contribution ratio (7.10) can be interpreted as the explained variation
of response Y by factor or explanatory variable Xi . In a GLM (7.9) with a canonical
link, if explanatory variables Xi are independent, then
Cov(θ, Y ) − Cov(θ, Y |Xi ) = βi Cov(Xi , Y ) ≥ 0.
and we have
βi Cov(Xi , Y ) βi Cov(Xi , Y )
CR(Xi → Y ) = = p . (7.18)
Cov(θ, Y ) j=1 βj Cov Xj , Y
Remark 7.5 Let n be sample size; let

Cov(θ, Y ) − Cov(θ, Y |Xi )

χ 2 (Xi ) = n ;
a ϕ̂

and let τi be the number

of regression parameters related to Xi , where Cov(θ, Y ),
Cov(θ, Y |Xi ), a ϕ̂ are the ML estimators of Cov(θ, Y ), Cov(θ, Y |Xi ), and a(ϕ),
respectively. Then, for sufficiently large sample size n, χ 2 (Xi ) is asymptotically
non-central chi-square distribution with degrees of freedom τi and the asymptotic
non-centrality parameter is
Cov(θ, Y ) − Cov(θ, Y |Xi )

λ=n . (7.19)
a(ϕ)
In GLMs, after a model selection procedure, there may be cases in which some
explanatory variables have no direct paths to the response variables, that is, the related
regression coefficients are zeros. One of the examples is depicted in Fig. 7.1c. For
such cases, we have the following theorem:
Theorem 7.3 In GLM (7.9), let the explanatory variable vector X be decomposed
into two sub-vectors X (1) with the direct paths to response variable Y and X (2) without
the direct paths to Y. Then,
Cov(θ, Y )
KL(Y , X) = KL Y , X (1) = . (7.20)
a(ϕ)
Proof Let be the conditional probability or density functions of Y given X = x

f (y|x)
and let f1 y|x(1) be that given X (1) = x(1) . In the GLM, the linear predictor is a linear
combination of X (1) and it means that

f (y|x) = f1 y|x(1) .
From this, the theorem follows.
7.5 Numerical Illustrations
Example 1 (continued) According to a model selection procedure, the estimated

regression coefficients, their standard errors, and the factor contributions (7.10) for
explanatory variables are reported in Table 7.5. The estimated linear predictor θ is
θ = 1.522X1 + 8.508X3 , where R2 (= ECD) = 0.932. In this case, as illustrated in
Fig. 7.6, there does not exist the direct path X2 to Y. According to (7.10), the factor
contribution in the model is calculated in Table 7.5. This table shows that BVS (X1 )
has the highest individual contribution and that the contribution of this variable is
1.3 times greater than INC (X3 ). Although FY1 (X2 ) does not have the direct path
to PRICE (Y ), when considering the association among the explanatory variables,
the contribution is high, i.e. 0.889. These conclusions are consistent with other more
specific empirical studies about the Ohlson’s model [6].
Table 7.5 Estimated regression coefficients and contribution ratios

Valuable Regression coefficient SE CR(Xk → Y ) SE
Intercept −1.175 1.692 / /
BVS (X1 ) 1.522 0.149 0.981 0.121
BY1 (X2 ) / / 0.889 0.115
INC (X3 ) 0.858 3.463 0.669 0.100
Fig. 7.6 Path diagram for

the final model for
explaining PRICE
Example 2 The present discussion is applied to the ordinary two-way layout exper-
imental design model. Before analyzing the data in Table 7.2, a general two-way
layout experiment model is formulated in a GLM framework. Let X1 and X2 be fac-
tors with levels {1, 2, . . . , I } and {1, 2, . . . , J }, respectively. Then, the linear predictor
is a function of (X1 , X2 ) = (i, j), i.e.
θ = μ + αi + βj + (αβ)ij . (7.21)
where

I
J
I
J
αi = βj = (aβ)ij = (aβ)ij = 0.
i=1 j=1 i=1 j=1
Let

1 (Xk = i)
Xki = , k = 1, 2.
0 (Xk = i)
In experimental designs, factor levels are randomly assigned to subjects, so the

above dummy variables can be regarded as independent random variables such that
1 1
P(X1i = 1) = , i = 1, 2, . . . , I ; P X2j = 1 = , j = 1, 2, . . . , J ..
I J
Dummy vectors
X 1 = (X11 , X12 , . . . , X1I )T and X 2 = (X21 , X22 , . . . , X2J )T
are identified with factors X1 and X2 , respectively. From this, the systematic
component of model (7.21) can be written as follows:
θ = μ + αTX 1 + β TX 2 + γ TX 1 ⊗ X 2 , (7.22)
where
α = (α1 , α2 , . . . , αI )T , β = (β1 , β2 , . . . , βJ )T ,
T
γ = (αβ)11 , (αβ)12 , . . . , (αβ)1J , . . . , (αβ)IJ ,
and
X 1 ⊗ X 2 = (X11 X21 , X11 X22 , . . . , X11 X2J , . . . , X1I X2J )T .

7.5 Numerical Illustrations 213
Let Cov(X 1 , Y ), Cov(X 2 , Y ), and Cov(X 1 ⊗ X 2 , Y ) are covariance matrices, then,

the total effect of X 1 and X 2 is
Cov(θ, Y ) α T Cov(X 1 , Y ) β T Cov(X 2 , Y ) γ T Cov(X 1 ⊗ X 2 , Y )

KL(Y , X) = = + +
σ 2 σ 2 σ 2 σ2
1 I 1 J 1 I J
αi2
J j=1 jβ 2
IJ i=1 j=1 (αβ) 2
ij
= I i=1 + + . (7.23)
σ2 σ2 σ2
The above three terms are referred to as the main effect of X1 , that of X2 and the
interactive effect, respectively. Then, ECD is calculated as follows:
Cov(θ, Y )
ECD((X1 , X2 ), Y ) =
Cov(θ, Y ) + σ 2
1 I 1 J 1 I J
i=1 αi + J j=1 βj + IJ j=1 (αβ)ij
2 2 2
I i=1
= 1 I J I J
= R2 . (7.24)
I α
i=1 i
2+ 1
J β
j=1 j
2+ 1
IJ i=1 j=1 (αβ)ij
2
+ σ 2
In this case, since
1 2 1
J I J
Cov(θ, Y |X 1 ) = βj + (αβ)2ij , (7.25)
J j=1 IJ i=1 j=1
1 2 1
I I J
Cov(θ, Y |X 2 ) = αi + (αβ)2ij , (7.26)
I i=1 IJ i=1 j=1
we have
Cov(θ, Y ) − Cov(θ, Y |X 1 )
eT (X1 → Y ) =
Cov(θ, Y ) + σ 2
1 I
i=1 αi
2
= 1 I J
I
I J , (7.27)
I i=1 αi2 + 1
J j=1 βj + IJ
2 1
i=1 j=1 (αβ)ij
2
+ σ2
J
j=1 βj
1 2
J
eT (X2 → Y ) = 1 I J I J . (7.28)
I i=1 αi2 + 1
J j=1 βj + IJ
2 1
i=1 j=1 (αβ)ij
2
+ σ2
From the above results, the contributions of X1 and X2 are given by
Cov(θ, Y ) − Cov(θ, Y |X 1 )
CR(X1 → Y ) =
Cov(θ, Y )
1 I
i=1 αi
2
= 1 I J
I
1 I J , (7.29)
2 1 2 2
I i=1
1 J
j=1 βj
2
J
CR(X2 → Y ) = 1 I J 1 I J . (7.30)
2 1 2 2
I i=1
The above contribution ratios correspond to those of main effects of X1 and X2 ,

respectively. The total effect of factors X1 and X2 is given by (7.24), i.e.
eT ((X1 , X2 ) → Y ) = ECD((X1 , X2 ), Y ).
From this, the following effect is referred to as the interaction effect:
eT ((X1 , X2 ) → Y ) − eT (X1 → Y ) − eT (X2 → Y )

1 I J
j=1 (αβ)ij
2
IJ i=1
= 1 I 1 J 1 I J .
i=1 αi + J j=1 βj + IJ j=1 (αβ)ij + σ
2 2 2 2
I i=1
From the present data, we have

Cov(θ, Y ) = 100.6, Cov(θ, Y |X 1 ) = 30.3, Cov(θ, Y |X 2 ) = 61.4.

Cov(θ, Y ) − Cov(θ, Y |X 1 ) 80.2 − 30.3

CR(X1 → Y ) = = = 0.622,
Cov(θ, Y ) 80.2
CR(X2 → Y ) = 0.234.
Example 3 (continued) The data in Table 7.4 is analyzed and the explanatory variable
contributions are calculated. From the ML estimators of the parameters, we have

Cov(θ, Y ) = 0.539, Cov(θ, Y |X 1 ) = 0.323, Cov(θ, Y |X 2 ) = 0.047, (7.31)
we have
0.539
ECD = = 0.350 (= eT ((X1 , X2 ) → Y )),
0.539 + 1
0.539 − 0.323
eT (X1 → Y ) = = 0.140,
0.539 + 1
0.539 − 0.047
eT (X2 → Y ) = = 0.320
0.539 + 1
eT (X1 → Y ) 0.140
CR(X1 → Y ) = = = 0.401,
eT ((X1 , X2 ) → Y ) 0.350
eT (X2 → Y ) 0.320
CR(X2 → Y ) = = = 0.913.
eT ((X1 , X2 ) → Y ) 0.350
From ECD, 35% of variation of Y in entropy is explained by the two explanatory

variables, and from the above results, the contribution of alcohol use X1 on marijuana
7.5 Numerical Illustrations 215
use Y is 40.1%, whereas that of cigarette use X2 on marijuana use Y is 91.3%. The
effect of cigarette use X2 on marijuana use Y is about two times greater than that of
alcohol use X1 .
7.6 Application to Test Analysis
Let Xi , i = 1, 2, . . . , p be scores of items (subjects) for a test battery, for example,

subjects are English, Mathematics, Physics, Chemistry, Social Science, and so on.
In usual, test score

p
Y = Xi (7.32)
i=1
is used for evaluating students’ learning levels. In order to assess contributions of

items to the test score, the present method is applied. Let us assume the following
linear regression model:

p
Y = Xi + e,
i=1

where e is an error term according to N 0, σ 2 and independent of item scores Xi .
In a GLM framework, we set

p
θ= Xi
i=1
from (7.10), and we have
Cov(Y , θ ) − Cov(Y , θ |Xi ) Var(θ ) − Var(θ |Xi )

CR(Xi → Y ) = =
Cov(Y , θ ) Var(θ )
Var(Y ) − Var(Y |Xi )
→ as σ 2 → 0 , i = 1, 2, . . . , p.
Var(Y )
Hence, in (7.32), the contributions of Xi to Y is calculated by
Var(Y ) − Var(Y |Xi )

CR(Xi → Y ) = = Corr(Y , Xi )2 , i = 1, 2, . . . , p. (7.33)
Var(Y )
Table 7.6 Test data for five test items

Subject Japanese X1 English X2 Social X3 Mathematics X4 Science X5 Total Y
1 64 65 83 69 70 351
2 54 56 53 40 32 235
3 80 68 75 74 84 381
4 71 65 40 41 68 285
5 63 61 60 56 80 320
6 47 62 33 57 87 286
7 42 53 50 38 23 206
8 54 17 46 58 58 233
9 57 48 59 26 17 207
10 54 72 58 55 30 269
11 67 82 52 50 44 295
12 71 82 54 67 28 302
13 53 67 74 75 53 322
14 90 96 63 87 100 436
15 71 69 74 76 42 332
16 61 100 92 53 58 364
17 61 69 48 63 71 312
18 87 84 64 65 53 353
19 77 75 78 37 44 311
20 57 27 41 54 30 209
Source [1, 18]
Table 7.6 shows a test data with five items. Under the normality of the data, the
above discussion is applied to analyze the test data. From this table, by using (7.33)
contribution ratios CR(Xi → Y ), i = 1, 2, 3, 4, 5 are calculated as follows:
CR(X1 → Y ) = 0.542, CR(X2 → Y ) = 0.568,

CR(X3 → Y ) = 0.348, CR(X4 → Y ) = 0.542, CR(X5 → Y ) = 0.596.
Except for Social Science X3 , the contributions of the other subjects are similar,
and the correlations between Y and Xi are strong. The contributions are illustrated
in Fig. 7.7.
7.7 Variable Importance Assessment
In regression analysis, any causal order in explanatory variables is not assumed in

usual, so in assessing the explanatory variable contribution in GLMs, as explained
above, the calculation of the contribution of the explanatory variables is made as if
7.7 Variable Importance Assessment 217
Japanese
0.6
0.5
0.4
0.3
Science 0.2 English
0.1
0
Mathematics Social
Fig. 7.7 Radar chart of the contributions of five subjects to the total score
explanatory variable Xi were the parent of the other variables, as shown in Fig. 7.5.
If the explanatory variables are causally ordered, e.g.

X1 → X2 → . . . → Xp → Y = Xp+1 , (7.34)
applying the path analysis in Chap. 6, from (6.46) and (6.47) the contributions of
explanatory variables Xi are computed as follows:

\1,2,...,i−1 \1,2,...,i
eT (Xi → Y ) KL X pa(p+1) , Y |X pa(i) − KL X pa(p+1) , Y |X pa(i+1)
CR(Xi → Y ) = = , (7.35)
eT (X → Y ) KL X pa(p+1) , Y
T
where X = X1 , X2 , . . . , Xp . Below, contributions CR(Xi → Y ) are merely
described as CR(Xi ) as far as not confusing the response variable Y. Then, it follows
that

p
CR(Xi ) = 1. (7.36)
i=1
With respect to measures for variable importance assessment in ordinary linear

regression models, the following criteria for R2 -decomposition are listed [12]:
(a) Proper decomposition: the model variance is to be decomposed into shares, that
is, the sum of all shares has to be the model variance.
(b) Non-negativity: all shares have to be non-negative.
(c) Exclusion: the share allocated to a regressor Xi with βi = 0 should be 0.

(d) Inclusion: a regressor Xi with βi = 0 should receive a non-zero share.
In the GLM framework, the model variance in (a) and R2 are substituted for
KL(Y , X) and ECD(Y , X), respectively, and from Theorem 7.3, we have

ECD(Y , X) = ECD Y , X (1) ,
where X (1) is the subset of all the explanatory variables with non-zero regression coef-
ficients βi = 0 in X. Hence, the explanatory power of X and X (1) are the same. From
this, in the present context, based on the above criteria, the variable importance is dis-
cussed. In which follows, it is assumed all the explanatory variables X1 , X2 , . . . , Xp
have non-zero regression coefficients. In GLMs with explanatory variables that
have no causal ordering, the present entropy-based path analysisis applied to rel-
ative importance assessment
of explanatory variables. Let U = X1 , X2 , . . . , Xp ;
r = r1 , r2 , . . . , rp a permutation of explanatory variable indices {1, 2, . . . , p}; Si (r)
be the parent variables that appear before Xi in permutation r; and let Ti (r) = U \Si (r).

Definition
7.2 For causal
ordering r = r1 , r2 , . . . , rp in explanatory variables
U = X1 , X2 , . . . , Xp , the contribution ratio of Xi is defined by
KL(Ti (r), Y |Si (r)) − KL(Ti (r)\{Xi }, Y |Si (r)∪{Xi })

CRr (Xi ) = . (7.37)
KL(X, Y )
Definition 7.3 The degree of relative importance of Xi is defined as
1
RI(Xi ) = CRr (Xi ), (7.38)
p! r
where
the summation
implies the sum of CRr (Xi )’s over all the permutations r =
r1 , r2 , . . . , rp .
CRr (Xi ) > 0, i = 1, 2, . . . , p;

p
CRr (Xi ) = 1 for any permutation r.
i=1
Hence, from (7.38) it follows that
RI(Xi ) > 0, i = 1, 2, . . . , p; (7.39)

p
RI(Xi ) = 1. (7.40)
i=1
Remark 7.6 In GLMs (7.9), CRr (Xi ) can be expressed in terms of covariances of θ
and the explanatory variables, i.e.
Cov(θ,Y |Si (r)) Cov(θ,Y |Si (r)∪{Xi })
a(ϕ)
− a(ϕ)
CRr (Xi ) = Cov(θ,Y )
a(ϕ)
Cov(θ, Y |Si (r)) − Cov(θ, Y |Si (r)∪{Xi })
= . (7.41)
Cov(θ, Y )
Remark 7.7 In the above definition of the relative importance of explanatory variables
(7.38), if Xi is the parent of the other explanatory variables, i.e. Si (r) = φ, then, (7.37)
implies
KL(U , Y ) − KL(Ti (r)\{Xi }, Y |Xi )

CRr (Xi ) =
KL(X, Y )

KL X1 , X2 , . . . , Xp , Y − KL(Ti (r)\{Xi }, Y |Xi )
= .
KL(X, Y )
If the parents of Xi are Ti (r) = {Xi }, i.e. Si (r)∪{Xi } = U , formula (7.37) is

calculated as
KL(Ti (r), Y |Si (r)) KL(Xi , Y |U \{Xi })
CRr (Xi ) = = .
KL(X, Y ) KL(X, Y )

Remark 7.8 In (7.37), for any permutations r and r such that Ti (r) = Ti r or
Si (r) = Si r , it follows that
CRr (Xi ) = CRr (Xi ).
Example 7.4 For p = 2, i.e. U = {X1 , X2 }, we have two permutations of the explana-
tory variables r = (1, 2), (2, 1). Then, the relative importance of the explanatory
variables is evaluated as follows:
1
RI(Xi ) = CRr (Xi ), i = 1, 2,
2 r
where
KL((X1 , X2 ), Y ) − KL(X2 , Y |X1 ) Cov(θ, Y ) − Cov(θ, Y |X1 )
CR(1,2) (X1 ) = = ,
KL((X1 , X2 ), Y ) Cov(θ, Y )
KL(X2 , Y |X1 )
CR(1,2) (X2 ) = ,
KL((X1 , X2 ), Y )
KL(X1 , Y |X2 )
CR(2,1) (X1 ) = ,
KL((X1 , X2 ), Y )
KL((X1 , X2 ), Y ) − KL(X2 , Y |X2 )

CR(2,1) (X2 ) = .
KL((X1 , X2 ), Y )
Hence, we have
1
RI(X1 ) = CR(1,2) (X1 ) + CR(2,1) (X1 )
2
KL((X1 , X2 ), Y ) − KL(X2 , Y |X1 ) + KL(X1 , Y |X2 )
= ,
2KL((X1 , X2 ), Y )
KL((X1 , X2 ), Y ) − KL(X1 , Y |X2 ) + KL(X2 , Y |X1 )
RI(X2 ) = .
2KL((X1 , X2 ), Y )
Similarly, for p = 3, i.e. U = {X1 , X2 , X3 }, we have
1
RI(X1 ) = 2CR(1,2,3) (X1 ) + CR(2,1,3) (X1 ) + CR(3,1,2) (X1 ) + 2CR(2,3,1) (X1 )
3!
1
= (2KL((X1 , X2 , X3 ), Y ) − 2KL((X2 , X3 ), Y |X1 )
6KL((X1 , X2 , X3 ), Y )
+ KL((X1 , X3 ), Y |X2 ) − KL(X3 , Y |X1 , X2 ) + KL((X1 , X2 ), Y |X3 )
−KL(X2 , Y |X1 , X3 ) + 2KL(X1 , Y |X2 , X3 ))
1
= (2Cov(θ, Y ) − 2Cov(θ, Y |X1 ) + Cov(θ, Y |X2 ) − Cov(θ, Y |X1 , X2 )
6Cov(θ, Y )
+Cov(θ, Y |X3 ) − Cov(θ, Y |X1 , X3 ) + 2Cov(θ, Y |X2 , X3 )),
1
RI(X2 ) = 2CR(2,1,3) (X2 ) + CR(1,2,3) (X2 ) + CR(3,2,1) (X2 ) + 2CR(1,3,2) (X2 ) .
3!
1
RI(X3 ) = 2CR(3,1,2) (X3 ) + CR(1,3,2) (X3 ) + CR(2,3,1) (X3 ) + 2CR(1,2,3) (X3 )
3!
As shown in the above example, the calculation of RI(Xi ) becomes complex as
the number of the explanatory variables increases.
Example 7.1 (continued) Three explanatory variables are hypothesized prior to the
linear regression analysis; however, the regression coefficients of two explanatory
variables X1 and X3 are statistically significant. According to the result, the variable
importance is assessed in X1 and X3 . Since
Cov(θ, Y ) = 394.898, Cov(θ, Y |X1 ) = 7.538, and Cov(θ, Y |X3 ) = 130.821,
by using (7.41), we have
394.898 − 7.538
CR(1,3) (X1 ) = = 0.981,
394.898
7.538
CR(1,3) (X3 ) = = 0.019,
394.898
130.821
CR(3,1) (X1 ) = = 0.331,
394.898
CR(3,1) (X3 ) = 1 − CR(3,1) (X1 ) = 0.669.
Hence, it follows that
CR(1,3) (X1 ) + CR(3,1) (X1 )

RI(X1 ) = = 0.656,
2
RI(X3 ) = 0.344.
Example 7.2 (continued) By using the model (7.22), the relative importance of the
explanatory variables X1 and X2 is considered. From (7.37), we have
Cov(θ, Y ) − Cov(θ, Y |X1 )

CR(1,2) (X1 ) =
Cov(θ, Y )
1 I
i=1 αi
2
= 1 I 1 J
I
1 I J .
2 2 2
I i=1
1 J β 2 + 1 I J 2
Cov(θ, Y |X1 ) J j=1 j IJ i=1 j=1 (αβ)ij
CR(1,2) (X2 ) = = 1 I ,
Cov(θ, Y ) 2 1 J β2 + 1 I J 2
I i=1 αi + J j=1 j IJ i=1 j=1 (αβ)ij
I 1 I J
i=1 αi + IJ j=1 (αβ)ij
1 2 2
I i=1
CR(2,1) (X1 ) = 1 I 1 J 1 I J .
2 2 2
I i=1
1 J
j=1 βj
2
J
CR(2,1) (X2 ) = I J I J .
1
I i=1 αi2 + 1
J j=1 βj + IJ
2 1
i=1 j=1 (αβ)ij
2
Hence, from (7.38), it follows that
CR(1,2) (X1 ) + CR(2,1) (X1 ) Cov(θ, Y ) − Cov(θ, Y |X1 ) + Cov(θ, Y |X2 )

RI(X1 ) = =
2 2Cov(θ, Y )
1 I α 2 + 1 I J
(αβ) 2
I i=1 i 2IJ i=1 j=1 ij
= 1 I 1 J β 2 + 1 I J ,
2 2
I i=1 αi + J j=1 j IJ i=1 j=1 (αβ)ij
CR(1,2) (X2 ) + CR(2,1) (X2 ) Cov(θ, Y ) − Cov(θ, Y |X2 ) + Cov(θ, Y |X1 )
RI(X2 ) = =
2 2Cov(θ, Y )
1 J β 2 + 1 I J 2
J j=1 j 2IJ i=1 j=1 (αβ)ij
= 1 I .
α 2 + 1 J β 2 + 1 I J 2
I i=1 i J j=1 j IJ i=1 j=1 (αβ)ij
In order to demonstrate the above results, the following estimates are used:

Cov(θ, Y ) = 100.6, Cov(θ, Y |X1 ) = 30.3, Cov(θ, Y |X2 ) = 61.4.
From the estimates, we have



CR(1,2) (X1 ) = = 0.699,
Cov(θ, Y )

Cov(θ, Y |X2 )
CR(2,1) (X1 ) = = 0.610,
Cov(θ, Y )

Cov(θ, Y |X1 )
CR(1,2) (X2 ) = = 0.301,
Cov(θ, Y )


CR(2,1) (X2 ) = = 0.390,
Cov(θ, Y )
CR(1,2) (X1 ) + CR(2,1) (X1 ) 0.699 + 0.610
RI(X1 ) = = = 0.655,
2 2
CR(1,2) (X2 ) + CR(2,1) (X2 )
RI(X2 ) = = 0.346.
2
From the above results, the degree of importance of factor X1 (type of patients) is
about twice than that of factor X2 (age groups of nurses).
Example 7.3 (continued) From (7.31), we have
0.539 − 0.323 0.047

CR(1,2) (X1 ) = = 0.401, CR(2,1) (X1 ) = = 0.087,
0.539 0.539
0.323 0.539 − 0.047
CR(1,2) (X2 ) = = 0.599, CR(2,1) (X2 ) = = 0.913,
0.539 0.539
0.401 + 0.087 0.599 + 0.913
RI(X1 ) = = 0.244, RI(X2 ) = = 0.756.
2 2
From the results, the variable importance of cigarette use X2 is more than thrice
greater than that of alcohol use X1 .
7.8 Discussion
In regression analysis, important is not only estimation and statistical significance

tests of regression coefficients of explanatory variables but also measuring contribu-
tions of the explanatory variables and assessing explanatory variable importance. The
present chapter has discussed the latter subject in GLMs, and applying an entropy-
based path analysis in Chap. 6, a method for treating it has been discussed. The
method employs ECD for dealing with the subject. It is an advantage that the present
method can be applied to all GLMs. As demonstrated in numerical examples, the
entropy-based method can be used in practical data analyses with GLMs.
References 223
References
1. Adachi, K., & Trendafilov, N. T. (2018). Some mathematical properties of the matrix
decomposition solution in factor analysis. Psychometrika, 83, 407–424.
2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley.
3. Azen, R., & Budescu, D. V. (2003). The dominance analysis approach for comparing predictors
in multiple regression. Psychological Methods, 8(2), 129–148.
4. Eshima, N., Borroni, C. G., & Tabata, M. (2016). Relative-importance assessment of
explanatory variables in generalized linear models: an entropy-based approach, Statistics and
Applications, 14, 107–122.
5. Daniel, W. W. (1999). Biostatistics: A foundation for analysis in the health sciences (7th ed.).
New York: Wiley.
6. Darlington, R. B. (1968). Multiple regression in psychological research and practice.
Psychological Bulletin, 69, 161–182.
7. Dechow, P. M., Hutton, A. P., & Sloan, R. G. (1999). An empirical assessment of the residual
income valuation model. Journal of Accounting and Economics, 26, 1–34.
models. Computational Statistics and Data Analysis, 54, 1381–1389.
9. Eshima, N., & Tabata, M. (2011). Three predictive power measures for generalized linear
models: Entropy coefficient of determination, entropy correlation coefficient and regression
correlation coefficient. Computational Statistics & Data Analysis, 55, 3049–3058.
11. Grőmping, U. (2006). Relative importance for linear regression in R: The package relaimpo.
Journal of Statistical Software, 17, 1–26.
12. Grőmping, U. (2007). Estimators of relative importance in linear regression based on variance
decomposition. The American Statistician, 61, 139–147.
13. Grőmping, U. (2009). Variable importance assessment in regression: Linear regression versus
random forest. The American Statistician, 63, 308–319.
14. Kruskal, W. (1987). Relative importance by averaging over orderings. The American Statisti-
cian, 41, 6–10.
15. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and
Hall: London.
16. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. Journal of the Royal
Statistical Society, A, 135, 370–384.
17. Ohlson, J. A. (1995). Earnings, book values and dividends in security valuation. Contemporary
Accounting Research, 11, 661–687.
18. Tanaka, Y., & Tarumi, T. (1995). Handbook for statistical analysis: Multivariate analysis
(windows version). Tokyo: Kyoritsu-Shuppan. (in Japanese).
19. Theil, H. (1987). How many bits of information does an independent variable yield in a multiple
regression? Statistics and Probability Letters, 6, 107–108.
20. Theil, H., & Chung, C. F. (1988). Information-theoretic measures of fit for univariate and
multivariate regressions. The American Statistician, 42, 249–253.
21. Thomas, D. R., & Zumbo, B. D. (1996). Using a measure of variable importance to investi-
gate the standardization of discriminant coefficients. Journal of Educational and Behavioral
Statistics, 21, 110–130.
Chapter 8
Latent Structure Analysis
8.1 Introduction
Latent structure analysis [11] is a general name that includes factor analysis [19, 22],
T
latent trait analysis [12, 14], and latent class analysis. Let X = X1 , X2 , . . . , Xp
T
be a manifest variable vector; let ξ = ξ1 , ξ2 , . . . , ξq be a latent variable (factor)
vector that influence the manifest variables; let f (x) and f (x|ξ ) be the joint density
or probability functions of manifest variable vector X and the conditional one given
latent variables ξ , respectively; let f (xi |ξ ) be the conditional density or probability
function of manifest variables Xi given latent variables ξ ; let g(ξ ) be the marginal
density or probability function of manifest variable vector ξ . Then, a general latent
structure model assumes that

p
f (x|ξ ) = f (xi |ξ ). (8.1)
i=1

p
f (x) = f (xi |ξ )g(ξ )d ξ , (8.2)
i=1
where the integral is replaced by the summation when the manifest variables are dis-
crete. The manifest variables Xi are observable; however, the latent variables ξj imply
latent traits or abilities that cannot be observed and are hypothesized components,
for example, in human behavior or responses. The above assumption (8.1) is referred
to as that of local independence. The assumption implies that latent variables explain
the association of the manifest variables. In general, latent structure analysis explains
how latent variables affect manifest variables. Factor analysis treats continuous man-
ifest and latent variables, and latent trait models are modeled with discrete manifest

https://doi.org/10.1007/978-981-15-2552-0_8
226 8 Latent Structure Analysis
variables and continuous latent variables. Latent class analysis handles categorical
manifest and latent variables. It is important to estimate the model parameters, and the
interpretation of the extracted latent structures is also critical. In this chapter, latent
structure models are treated as GLMs, and the entropy-based approach in Chaps. 6
and 7 is applied to factor analysis and latent trait analysis. In Sect. 8.2, factor analysis
is treated, and an entropy-based method of assessing factor contribution is considered
[6, 7]. In comparison with the conventional method of measuring factor contribution
and that based on entropy, advantages of the entropy-based method are discussed
in a GLM framework. For oblique factor analysis models, a method for calculating
factor contributions is given by using covariance matrices. Section 8.3 deals with
the latent trait model that expresses dichotomous responses underlying a common
latent ability or trait. In the ML estimation of latent abilities of individuals from their
responses, the information of test items and that of tests are discussed according to
the Fisher information. Based on the GLM framework, ECD is used for measuring
test reliability. Numerical illustrations are also provided to demonstrate the present
discussion.
8.2 Factor Analysis
8.2.1 Factor Analysis Model
The origin of factor analysis dates back to the works of [19], and the single factor
model was extended to the multiple factor model [22]. Let Xi be manifest variables;
ξj latent variables (common factors); εi unique factors peculiar to Xi ; and let λij be
factor loadings that are weights of factors ξj to explain Xi . Then, the factor analysis
model is given as follows:

m
Xi = λij ξj + εi (i = 1, 2, . . . , p), (8.3)
j=1
where
⎧
⎪
⎪ E(Xi ) = E(εi ) = 0, i = 1, 2, . . . , p;
⎪
⎪
⎪
⎨ E ξj = 0, j = 1, 2, . . . , m;
Var ξj = 1, j = 1, 2, . . . , m;
⎪
⎪
⎪
⎪ Var(εi ) = ωi2 > 0, i = 1, 2, . . . , p;
⎪
⎩ Cov(εk , εl ) = 0, k = l.
T
Let be the variance–covariance matrix of X = X1 , X2 , . . . , Xp ; let be the
m × m correlation matrix of common factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T ; let be
T
the p × p variance–covariance matrix of unique factor ε = ε1 , ε2 , . . . , εp ; and let
8.2 Factor Analysis 227
be the p × m regression coefficient matrix of λij . Then, we have
= T + , (8.4)
and the model (8.3) is expressed by using the matrices as
X = ξ + ε. (8.5)
Let C be any m × m non-singular matrix with standardized row vectors. Then,

transformation ξ ∗ = Cξ makes another model as
X = ∗ ξ ∗ + ε, (8.6)
where ∗ = C −1 . In this model, matrix ∗ = CC T is the correlation matrix of

factors ξ ∗ and we have
= ∗ ∗ ∗T + . (8.7)
It implies factor analysis model is not identified with respect to matrices C. If

common factors ξi are mutually independent, correlation matrix is the identity
matrix and covariance structure (8.4) is simplified as follows:
= T + . (8.9)
and for any orthogonal matrix C, we have
= ∗ ∗T + . (8.10)
where ∗ = C −1 = C T . As discussed above, factor analysis model (8.5) cannot

be determined uniquely. Hence, to estimate the model parameters, m(m−1) 2
constraints
have to be placed on the model parameters . On the other hand, for transformation of
scales of the manifest variables, factor analysis model is scale-invariant. For diagonal
matrix D, let us transform manifest variables X as
X ∗ = DX. (8.11)
Then,
X ∗ = Dξ + Dε. (8.12)
and factor analysis models (8.5) and (8.11) are equivalent. From this, factor anal-
ysis can be treated under standardization of the manifest variables. Below, we set
Var(Xi ) = 1, i = 1, 2, . . . , p. Methods of parameter estimation in factor analysis
have been studied actively by many authors, for example, for least square estimation,
[1, 10], for ML estimation, [8, 9, 15] and so on. The estimates of model parameters
can be obtained by using the ordinary statistical software packages [24]. Moreover,
high-dimensional factor analysis where the number of manifest variables is greater
than that of observations has also been developed [16, 20, 21, 23]. In this chapter,
the topics of parameter estimation are not treated, and an entropy-based method for
measuring factor contribution is discussed.
8.2.2 Conventional Method for Measuring Factor

Contribution
The interpretation of the extracted factors is based on factor loading matrix and
factor structure matrix . After interpreting the factors, it is important to assess
factor contribution. For orthogonal factor analysis models (Fig. 8.1a), the contribution
of factor ξj on manifest variables Xi is defined as follows:
Fig. 8.1 a Path diagram of (a)

an orthogonal factor analysis
model. b Path diagram of an
orthogonal factor analysis
model
(b)

p
2 2
p
Cj = Cov Xi , ξj = λij . (8.13)
i=1 i=1
Measuring contributions of the extracted factors can also be made from the
following decomposition of total variances of the observed variables Xi [2, p. 59]:

p

p

p

p

p
Var(Xi ) = λ2i1 + λ2i2 + · · · + λ2im + ωi2 (8.14)
i=1 i=1 i=1 i=1 i=1
From this, the contribution (8.13) is regarded as the quantity in the sum of the
variances of the manifest variables. Applying it to the manifest variables observed,
contribution (8.13) is not scale-invariant. From this reason, factor contribution is
considered for standardized versions of manifest variables Xi ; however, the sum of
the variances (8.14) has no physical meaning. The variation of manifest variable
vector X = X1 , X2 , . . . , Xp is summarized by the variance–covariance
.
matrix
The generalized variance of manifest variable vector X = X1 , X2 , . . . , Xp is the
determinant of the variance–covariance matrix |Σ|, and it expresses p-dimensional
variation of random vector X. If the determinant could be decomposed into the sums
of quantities related to factors ξj as in (8.14), it would be successful to define the
factor contribution by the quantities. It is impossible to make such decompositions
based on |Σ|.
For standardized manifest variables Xi , we have

λij = Corr Xi , ξj . (8.15)
and

p
Var(Xi ) = p. (8.16)
i=1
The squared correlation coefficients (8.15) are the ratios of explained variances
for manifest variables Xi and can be interpreted as the contributions (effects) of
factors to the individual manifest variables, but the sum of these has no logical
foundation
to be viewed as the contribution of factor ξj to manifest variable vector
X = X1 , X2 , . . . , Xp . In spite of it, the contribution ratio of ξj is defined by
Cj Cj
RCj = m = m p (8.17)
l=1 Cl l=1 k=1 λ2kl
The above measure is referred to as the factor contribution ratio in the common
factor space. Another contribution ratio of ξj is referred to as that in the whole space
of manifest variable vector X = (Xi ), and it is defined by
Cj Cj
j = p
RC = . (8.18)
i=1 Var(Xi ) p
The conventional approach can be said to be intuitive. In order to overcome the

above difficulties, an entropy-based path analysis approach [5] was applied to mea-
suring the contributions of factors to manifest variables [7]. In the next subsection,
the approach is explained.
Remark 8.1 The path diagram in Fig. 8.1a is used to express factor analysis model by
using linear equations as (8.3). The present chapter treats the factor analysis model as
a GLM, so the common factors ξj are explanatory variables and the manifest variables
Xi are response variables. In this framework, the effects of unique factors are dealt
with the random component of the GLM. From this, in which follows, Fig. 8.1b is
employed; i.e., the random component is expressed with the perforated arrows.
8.2.3 Entropy-Based Method of Measuring Factor

Contribution
Factor analysis model (8.3) is discussed in a framework of GLMs. A general path

diagram of factor analysis model is illustrated in Fig. 8.2a. It is assumed that factors
ξj , j = 1, 2, . . . , m and εi , i = 1, 2, . . . , p are normally distributed. Then, the condi-
tional density functions of manifest variables of Xi , i = 1, 2, . . . , p given the factors
ξj , j = 1, 2, . . . , m are given by
⎛ m 2 ⎞
x − λ ξ
1 ⎜ i j=1 ij j
⎟
fi (xi |ξ ) = exp⎝− 2 ⎠
2π ω 2 2ω i
i
⎛ m 2 ⎞
xi j=1 λij ξj −
1 m
j=1 λij ξj

⎜ 2 xi2 ⎟
= exp⎝− − − log 2π ωi2 ⎠,
ωi2 2ωi2
i = 1, 2, . . . , p. (8.19)

m
x2
Let θi = λij ξj and C xi , ωi2 = − 2ωi 2 − log 2π ωi2 . Then, the above density
i
j=1
function is described in a GLM framework as follows:

xi θi − 21 θi2
fi (xi |ξ ) = exp + C xi , ωi2 , i = 1, 2, . . . , p. (8.20)
ωi2
From the general latent structure model (8.1), the conditional normal density
function of X given ξ is expressed as
(a) (b)
(d)
(c)
Fig. 8.2 a Path diagram of a general factor analysis model. b Path diagram of manifest variables
Xi , i = 1, 2, . . . , p, and common factor vector ξ . c Path diagram of manifest variable sub-vectors
X (a) , a = 1, 2, . . . , A, common factor vector ξ and error sub-vectors ε (a) related to X (a) , a =
1, 2, . . . , A. d. Path diagram of manifest variable vector X, common factors ξj , and unique factors
εi

p
xi θi − 21 θi2
f (x|ξ ) = exp + C xi , ωi
2
i=1
ωi2
p
xi θi − 1 θi2
p

= exp 2
+ C xi , ωi .
2
(8.21)
i=1
ωi2 i=1
From (8.21), we have the following theorem:

T
Theorem 8.1 In factor analysis model (8.21), let X = X1 , X2 , . . . , Xp and ξ =
(ξ1 , ξ2 , . . . , ξm )T . Then,

p
KL(X, ξ ) = KL(Xi , ξ ). (8.22)
i=1
Proof Let fi (xi |ξ ) be the conditional density functions of manifest variables Xi , given
factor vector ξ ; fi (xi ) be the marginal density functions of Xi ; f (x) be the marginal
density function of X; and let g(ξ ) be the marginal density function of common
factor vector ξ = ξj . Then, from (8.20), we have
¨
fi (xi |ξ )
KL(Xi , ξ ) = fi (xi |ξ )g(ξ )log dxi dξ
fi (xi )
¨
fi (xi )
+ fi (xi )g(ξ )log dxi dξ .
fi (xi |ξ )
Cov(Xi , θi )
= , i = 1, 2, . . . , p. (8.23)
ωi2

¨
f (x|ξ )g(ξ )
KL(X, ξ ) = f (x|ξ )g(ξ )log dxdξ
f (x)
¨
f (x)
+ f (x)g(ξ )log dxdξ
f (x|ξ )g(ξ )
¨
= (f (x|ξ )g(ξ ) − f (x)g(ξ ))logf (x|ξ )dxi dξ

p
Cov(Xi , θi )
= . (8.24)
i=1
ωi2

KL information (8.23) is viewed as a signal-to-noise ratio, and the total KL infor-
mation (8.24) is decomposed into KL information components for manifest variables
Xi . In this sense, KL(X, ξ ) and KL(Xi , ξ ) measure the effects of factor
vector ξ in
Fig. 8.2b. Let Ri be the multiple correlation coefficient of Xi and ξ = ξj . Since

m
E(Xi |ξ ) = θi = λij ξj , i = 1, 2, . . . , p,
j=1
we have
Cov(Xi , θi )
= R2i , i = 1, 2, . . . , p.
Cov(Xi , θi ) + ωi2
Then, from (8.23)
R2i
KL(Xi , ξ ) = , i = 1, 2, . . . , p. (8.25)
1 − R2i
In a GLM framework, ECDs with respect to manifest variables Xi and factor

vector ξ are computed as follows:
KL(Xi , ξ )
ECD(Xi , ξ ) = = R2i , i = 1, 2, . . . , p; (8.26)
KL(Xi , ξ ) + 1
p Cov(Xi ,θi ) p R2i
i=1 ωi2 i=1 1−R2i
ECD(X, ξ ) = p Cov(Xi ,θi )
= . (8.27)
i=1 ωi2
+1 p R2i
i=1 1−R2i +1
From Theorem 8.1, in Fig. 8.2c, we have the following corollary:
Corollary 8.1 Let manifest variable sub-vectors X (a) , a = 1, 2, . . . , A be any

T
decomposition of manifest variable vector X = X1 , X2 , . . . , Xp . Then,

A

KL(X, ξ ) = KL X (a) , ξ . (8.28)
a=1
Proof In factor analysis models, sub-vectors X (a) , a = 1, 2, . . . , A are condition-

ally independent, given factor vector ξ . From this, the proof is similar to that of
Theorem 8.1. This completes the corollary.
With respect to Theorem 8.1, a more general theorem is given as follows:

T
Theorem 8.2 Let X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm )T be manifest and
latent variable vectors, respectively. Under the assumption of local independence
(8.1), the same decomposition as in (8.22) holds as follows:

p
KL(X, ξ ) = KL(Xi , ξ ).
i=1
Proof
¨
p p ¨
KL(X, ξ ) = fi (xi |ξ )g(ξ )log k=1 fk (xk |ξ ) dxdξ + f (x)g(ξ )log p
f (x)
dxdξ
f (x)
i=1 k=1 fk (xk |ξ )
¨
p p

= fi (xi |ξ )g(ξ ) − f (x) log fk (xk |ξ )dxdξ
i=1 k=1
¨
p p ¨ p
fk (xk |ξ ) fk (xk )
= fi (xi |ξ )g(ξ )log k=1
p dxdξ + f (x)g(ξ )log pk=1 dxdξ
i=1 k=1 fk (xk ) k=1 fk (xk |ξ )
p ¨
p p ¨
f (x |ξ ) f (x )
= fi (xi |ξ )g(ξ )log k k dxdξ + f (x)g(ξ )log k k dxdξ
fk (xk ) fk (xk |ξ )
k=1 i=1 k=1
p ¨ p ¨
f (x |ξ ) f (x )
= fk (xk |ξ )g(ξ )log k k dxk dξ + fk (xk )g(ξ )log k k dxk dξ
fk (xk ) fk (xk |ξ )
k=1 k=1
p ¨ ¨
fi (xi |ξ ) fi (xi )
= fk (xk |ξ )g(ξ )log dxi dξ + fi (xi )g(ξ )log dxi dξ
fi (xi ) fi (xi |ξ )
i=1
p

= KL(Xi , ξ ).
i=1

Remark 8.2 Let = λij be a p × m factor loading matrix; let be the m × m
correlation matrix of common factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T ; and let be the
T
p × p variance–covariance matrix of unique factor vector ε = ε1 , ε2 , . . . , εp .
Then, the conditional density function of X given ξ , f (x|ξ ), is normal with mean ξ
and variance matrix and is given as follows:

1 ˜
X T ξ ˜ 2 ξ
− 21 ξ T T 1 T˜
X X
f (x|ξ ) = 1 exp − 2
p
(2π) 2 || 2 || ||
˜ is the cofactor matrix of . Then, KL information (8.24) is expressed by

where
¨
f (x|ξ )
KL(X, ξ ) = f (x|ξ )g(ξ )log dxdξ
f (x)
¨ ˜
f (x) tr T
+ f (x)g(ξ )log dxdξ =
f (x|ξ ) ||
The above information can be interpreted as a generalized signal-to-noise ratio,

˜
where signal is trT
and noise is ||. From (8.24) and (8.25), we have
˜
tr T Cov(Xi , θi ) R2
p p
= = i
.
|| i=1
ω 2
i i=1
1 − R2i
Hence, the generalized signal-to-noise ratio is decomposed into those for manifest
variables.
The contributions of factors in factor analysis model (8.20) are discussed in view
of entropy. According to the above discussion, the following definitions are made:
Definition 8.1 The contribution of factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T to manifest

T
variable vector X = X1 , X2 , . . . , Xp is defined by
C(ξ → X) = KL(X, ξ ). (8.29)
Definition 8.2 The contributions of factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T to manifest

variables Xi are defined by

R2i
C(ξ → Xi ) = KL(Xi , ξ ) = , i = 1, 2, . . . , p. (8.30)
1 − R2i
T
Definition 8.3 The contributions of ξj to X = X1 , X2 , . . . , Xp are defined by

C ξj → X = KL(X, ξ ) − KL X, ξ \j |ξj , j = 1, 2, . . . , m, (8.31)
T
where ξ \j = ξ1 , ξ2 , . . . , ξj−1 , ξj+1 , . . . , ξm and KL X, ξ \j |ξj are the following
conditional KL information:
¨
\j
f (x|ξ )
KL X, ξ |ξj = f (x|ξ )g(ξ )log dxdξ
f x|ξj
¨
f x|ξj
+ f x|ξj g(ξ )log dxdξ , j = 1, 2, . . . , m.
f (x|ξ )
According to Theorem 8.1 and (8.25), we have the following decomposition:

p

p
R2i
C(ξ → X) = C(ξ → Xi ) = (8.32)
i=1 i=1
1 − R2i
The contributions C(ξ → X) and C(ξ → Xi ) measure the effects of common

factors in Fig. 8.2b.
The effects of common factors in Fig. 8.2d are evaluated by
contributions C ξj → X .
Remark 8.3 The contribution C(ξ → X) is decomposed into C(ξ → Xi ) with respect
to manifest variables Xi (8.32); however, in general, it follows that

m

C(ξ → X) = C ξj → X (8.33)
j=1
due to the correlations between common factors ξj .

In Fig. 8.2a, the following definition is given.
Definition 8.4 The contribution of ξj to Xi , i = 1, 2, . . . , p is defined by

\j
Var(θi ) − Var θi |ξj
C ξj → Xi = KL(Xi , ξ ) − KL Xi , ξ |ξj = ,
ωi2
i = 1, 2, . . . , p; j = 1, 2, . . . , m. (8.34)
Considering
the above discussion, a more general decomposition of contribution
C ξj → X can be made in the factor analysis model.
T
latent variable vectors, respectively. Under the assumption of local independence
(8.1),

p

KL X, ξ \j |ξj = KL Xi , ξ \j |ξj , j = 1, 2, . . . , m. (8.35)
i=1

Proof Let f x, ξ \j |ξj be the conditional density function of X and ξ \j given ξj ;

f x|ξj be the conditional density function of X given ξj ; and let g ξ \j |ξj be the
conditional density function of ξ \j given ξj . Then, in factor analysis model (8.21),
we have

˚ f x, ξ \j |ξj
KL X, ξ \j |ξj = f (x, ξ )log dxdξ \j dξj
f x|ξj g ξ \j |ξj

˚ f x|ξj g ξ \j |ξj

+ f x|ξj g(ξ )log dxdξ \j dξj
f x, ξ \j |ξj
˚ ˚
f (x|ξ ) f x|ξj
= f (x, ξ )log dxdξ /j dξj + f x|ξj g(ξ )log dxdξ \j dξj
f x|ξj f (x|ξ )
p p
Cov Xi , θi |ξj
= 2
= KL Xi , ξ \j |ξj , j = 1, 2, . . . , m.
i=1
ωi i=1
From Theorems 8.1 and 8.3, we have the following decomposition of the
contribution of ξj to X:
p p

C ξj → X = KL(X, ξ ) − KL X, ξ \j |ξj = KL(Xi , ξ ) − KL Xi , ξ \j |ξj
i=1 i=1
p

p
= KL(Xi , ξ ) − KL Xi , ξ \j |ξj = C ξj → Xi . (8.36)
i=1 i=1
From Remark 8.2, in general,

m
p

C(ξ → X) = C ξj → Xi .
j=1 i=1
However, in orthogonal factor analysis models, the following theorem holds:

T
latent variable vectors, respectively. In factor analysis models (8.1), if the factors
ξi , i = 1, 2, . . . , m are statistically independent, then

m
p

C(ξ → X) = C ξj → Xi (8.37)
j=1 i=1
Proof Since factors are independent, it follows that
m
k=j λik
2
k=1 λik
2
KL(Xi , ξ ) = , KL Xi , ξ \j |ξj = .
ωi
2
ωi2
λ2ij
C ξj → Xi = KL(Xi , ξ ) − KL Xi , ξ \j |ξj = 2 .
ωi

p

m
λ2ij
C(ξ → X) = KL(X, ξ ) = .
i=1 j=1
ωi2
Hence, Eq. (8.37) holds. This completes the theorem.

In the conventional method of measuring factor contribution, the relative contri-
butions (8.17) and (8.18) are used. In the present context, the relative contributions
in Fig. 8.2d are defined, corresponding to (8.17) and (8.18).
Definition 8.5 The relative contribution of factor ξj to manifest variable vector X is

defined by

C ξj → X KL X, ξj
RC ξj → X = = . (8.38)
C(ξ → X) KL(X, ξ )

p Cov(Xi ,θi )−Cov(Xi ,θi |ξj )

KL(X, ξ ) − KL X, ξ \j |ξj i=1 ω2
RC ξj → X = = p Cov(Xi i ,θi ) . (8.39)
KL(X, ξ ) i=1 2
ωi
Entropy KL(X, ξ ) implies the entropy

variation
of manifest variable vector X
explained by factor vector ξ , and KL X, ξ \j |ξj is “that by ξ \j ” excluding the effect
of factor ξj on X. Based on the ECD approach [4], we give the following definition:
Definition 8.6 The relative contribution of factor ξj to manifest variable vector X in

the whole entropy space of X is defined by
p Cov(Xi ,θi )−Cov(Xi ,θi |ξj )

KL(X, ξ ) − KL X, ξ \j |ξj i=1 ω2
ξj → X =
RC = p Cov(Xi ,θii ) . (8.40)
KL(X, ξ ) + 1 +1
i=1 2
ωi
8.2.4 A Method of Calculating Factor Contributions by Using

Covariance Matrices
In factor analysis model (8.3), in order to simplify the discussion, the factor contri-
bution of ξ1 is calculated. Let ≡ Var(X, ξ ) be the variance–covariance matrix of
T
X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm ), that is,

= ,
T
and let the matrix be divided according to X, ξ1 and (ξ2 , ξ3 , . . . , ξm ) as follows:

⎛ ⎞
11 12 13
= ⎝ 21 22 23 ⎠,
31 32 33
where

22 23
11 = ; 12 13 = ; and = .
32 33
Let the partial variance–covariance matrix of X and (ξ2 , ξ3 , . . . , ξm ) given ξ1 be

denoted by

11·2 13·2
(1,3)(1,3)·2 = . (8.41)
31·2 33·2
When the inverse of the above matrix is expressed as

⎛ ⎞
11 12 13
−1 = ⎝ 21 22 23 ⎠,
31 32 33
then matrix (8.41) is computed as follows:

−1
11 13
(1,3)(1,3)·2 = ,
31 33
and it follows that
KL((ξ2 , ξ3 , . . . , ξm ), X|ξ1 ) = −tr 31·2 13 .
By using the above result, we can calculate the contribution of factor ξ1 as follows:
C(ξ1 → X) = KL((ξ1 , ξ2 , . . . , ξm ), X) − KL((ξ2 , ξ3 , . . . , ξm ), X|ξ1 )

21 12 13
= −tr + tr 31·2 13 . (8.42)
31
8.2.5 Numerical Example
The entropy-based method for measuring factor contribution is scale-invariant, so a

numerical illustration is made under the standardization of the manifest variables.
Assuming five common factors, Table 8.1 shows artificial nine manifest variables
and factor loadings. First, in order to compare the method with the conventional
Table 8.1 Artificial factor loadings of five common factors in nine manifest variables
Common Manifest variable
factor X1 X2 X3 X4 X5 X6 X7 X8 X9
ξ1 0.1 0.8 0.4 0.3 0.1 0.3 0.8 0.2 0.4
ξ2 0.3 0.2 0.2 0.3 0.8 0.4 0.1 0.1 0.2
ξ3 0.2 0.1 0.2 0.7 0.1 0.2 0.2 0.3 0.8
ξ4 0.7 0.1 0.1 0.1 0.1 0.6 0.1 0.1 0.1
ξ5 0.1 0.1 0.7 0.1 0.1 0.1 0.2 0.7 0.1
ωi2 0.360 0.290 0.260 0.310 0.320 0.340 0.260 0.360 0.140
ωi2 are unique factor variances
Table 8.2 Factor contributions measured with the conventional method (orthogonal case)
Factor ξ1 ξ2 ξ3 ξ4 ξ5
Cj 1.72 0.79 1.07 1.60 1.08
RCj 0.275 0.126 0.171 0.256 0.173
j
RC 0.246 0.113 0.153 0.229 0.154
Table 8.3 Factor contributions measured with the entropy-based method (orthogonal case)

C ξj → X 16.13 2.18 4.27 7.64 4.11

RC ξj → X 0.474 0.063 0.124 0.223 0.120

ξj → X
RC 0.457 0.062 0.121 0.216 0.116
one, the five factors are assumed orthogonal. By using the conventional method, the
factor contributions of the common factors are calculated with Cj (8.13), RCj (8.17),
and
RC j (8.18)
(Table 8.2). On the other hand, the entropy-based
approach uses
C ξj → X (8.36), RC ξj → X (8.38), and RC ξj → X (8.40), and the results are
shown in Table 8.3. According to the conventional method, factors ξ1 and ξ4 have
similar contributions to the manifest variables; however, in the entropy-based method,
the contribution of ξ1 is more than twice that of ξ4 .
Second, an oblique case is considered by using factor loadings shown in Table 8.1.
Assuming the correlation matrix of the five common factors as
⎛ ⎞
1 0.7 0.5 0.2 0.1
⎜ ⎟
⎜ 0.7 1 0.7 0.5 0.2 ⎟
⎜ ⎟
=⎜ 0.5 0.7 1 0.7 0.5 ⎟,
⎜ ⎟
⎝ 0.2 0.5 0.7 1 0.7 ⎠
0.1 0.2 0.5 0.7 1
the factor contributions are computed by (8.39) and (8.42), and the results are illus-
trated in Table 8.4. According to the correlations between factors, factors ξ3 , ξ4 , and
ξ5 have large contributions to the manifest variables; i.e., the contributions are greater
than 0.5. Especially, the contributions of ξ4 , and ξ5 are more than 0.7.
The entropy-based method is applied to factor analysis of Table 7.6, assuming
two common factors. By using the covarimin method, we have the factor loadings in
Table 8.4 Factor contributions measured with the entropy-based method (oblique case)

C ξj → X 9.708 12.975 21.390 18.417 13.972

RC ξj → X 0.385 0.515 0.859 0.731 0.553

ξj → X
RC 0.370 0.495 0.816 0.703 0.531
Table 8.5 Estimated factor loadings from data in Table 7.6

Japanese X1 English X2 Social X3 Mathematics X4 Science X5
ξ1 0.593 0.768 0.677 0.285 0
ξ2 0.244 0 −0.122 0.524 0.919
Table 8.5, where the manifest variables are standardized. According to the results,
factor ξ1 can be interpreted as a latent ability for liberal arts and factor ξ2 as that for
sciences. The correlation coefficient between the two common factors is estimated
as 0.315, and the factor contributions
are calculated
as in Table 8.6. The table shows
the following decompositions of C ξj → X and C(ξ → X):
5

C ξj → X = C ξj → Xi , j = 1, 2;
i=1

5
C(ξ → X) = C(ξ → Xi ).
i=1
However, the factor analysis model is oblique, and it follows that

2

C(ξ → Xi ) = C ξj → Xi , i = 1, 2, 3, 4, 5.
j=1
The effect of ξ2 on the manifest variable vector X is about 1.7 times greater
than that of ξ1 . The summary of the contribution (effects) to the manifest vari-
→ X) = 0.904, the explanatory
ables is demonstrated in Table 8.7. From RC(ξ
power of the latent common factor vector ξ is strong, especially, that of ξ2 , i.e.,

Table 8.6 Factor contributions C ξj → X , C ξj → Xi , and C(ξ → Xi )
Japanese X1 English X2 Social X3 Mathematics X4 Science X5 Totala
ξ1 0.902 1.438 0.704 0.368 0.539 3.951
ξ2 0.373 0.143 0.014 0.685 5.433 6.648
ξ 1.009 1.438 0.728 0.818 5.433 9.426

a Totals in the table are equal to C ξ → X , j = 1, 2 and C(ξ → X)
j
Table 8.7 Factor

ξ1 ξ2 Effect of ξ on X
contributions to manifest
variable vector X C ξj → X 3.951 6.648 C(ξ → X) = 9.426

ξj → X
RC 0.379 0.638 → X) = 0.904
RC(ξ

RC ξj → X 0.419 0.705

ξj → X = 0.638. As mentioned in Remark 8.3, since factor analysis model is
RC
oblique, in Table 8.7, we have

2

C(ξ → X) = C ξj → X
j=1
8.3 Latent Trait Analysis
8.3.1 Latent Trait and Item Response
In a test battery composed of response items with binary response categories, e.g.,
“yes” or “no”, “positive” or “negative”, “success” or “failure”, and so on, let us
assume that the responses to the test items depend on a latent trait or ability θ, where
θ is a hypothesized variable that cannot be observed directly. In item response theory,
the relationship between responses to the items and the latent trait is analyzed, and
it is applied to test-making. In this section, the term “latent trait” is mainly used for
convenience; however, term “latent ability” is also employed. In a test battery with
p test items, and let Xi be responses to items i, i = 1, 2, . . . , p, such that

1 (success response to itemi)
Xi = , i = 1, 2, . . . , p,
0 (failure)
and let Pi (θ ) = P(Xi = 1|θ ) be (success) response probability functions of an indi-

vidual with latent trait θ , to items i, i = 1, 2, . . . , p. Then, under the assumption of
local independence (Fig. 8.3), it follows that

p

P X1 , X2 , . . . , Xp = x1 , x2 , . . . , xp |θ = Pi (θ )xi Qi (θ )1−xi , (8.43)
i=1
where Qi (θ ) = 1 − Pi (θ ), i = 1, 2, . . . , p are failure probabilities of an individual

with latent trait θ . The functions Pi (θ ), i = 1, 2, . . . , p are called item characteristic
functions. Usually, test score
Fig. 8.3 Path diagram of the

latent trait model
8.3 Latent Trait Analysis 243

p
V = Xi .
i=1
is used for evaluating the latent traits of individuals. Then, we have

p

p

p
E Xi |θ = E(Xi |θ) = Pi (θ ),
i=1 i=1 i=1
p
p

p
Var Xi |θ = Var(Xi |θ) = Pi (θ )Qi (θ ).
i=1 i=1 i=1
In general, if weights ci are assigned to responses Xi , we have the following

general score:

p
V = ci Xi . (8.44)
i=1
For the above score, the average of the conditional expectations of ci Xi given
latent trait θ over the items is given by
1
p
T (θ ) = ci Pi (θ ). (8.45)
p i=1
The above function is called a test characteristic function. In the next section, a
theoretical discussion for deriving the item characteristic function Pi (θ ) is provided
[18, pp. 37–40].
8.3.2 Item Characteristic Function
Let Yi be latent traits to answer items i, i = 1, 2, . . . , p and let θ be a common latent

trait to answer all the items. Since latent traits are unobservable and hypothesized
variables, it is assumed variables Yi and θ are jointly distributed according to a
bivariate normal distribution with mean vector (0, 0) and variance–covariance matrix,

1 ρi
= (8.46)
ρi 1
Let ηi be the threshold of latent ability Yi to successfully answer to item

i, i = 1, 2, . . . , p. The probabilities that an individual with latent trait θ gives correct
answers to items i, i = 1, 2, . . . , p are computed as follows:
+∞
1 (y − ρθ )2
Pi (θ ) = P(Yi > ηi |θ) = exp − 21 − ρ 2 dy
ηi 2π 1 − ρi2 i
+∞ +∞ 2
1 t2 1 t
= √ exp − dt = √ exp − dt = (ai (θ − di )),
2π 2 2π 2
√
ηi −ρi θ
2
−ai (θ−di )
1−ρi
i = 1, 2, . . . , p, (8.47)
where (x) is the standard normal distribution function and

ρi ηi
ai = , di = .
1 − ρi2 ρi
From this, for θ = di , the success probability to item i is 21 , and in this sense, the
parameter di implies the difficulty of item i. Since the cumulative normal distribution
function (8.47) is difficult to treat mathematically, the following logistic model is
used as an approximation:
1 exp(Dai (θ − di ))
Pi (θ ) = = , i = 1, 2, . . . , p,
1 + exp(−Dai (θ − di )) 1 + exp(Dai (θ − di ))
(8.48)
where D is a constant. If we set D = 1.7, the above functions are good approximations
of (8.47). For ai = 2, di = 1, the graphs are almost the same as shown in Fig. 8.4a
and b. Differentiating (8.48) with respect to θ , we have
d Dai
Pi (θ ) = Dai Pi (θ )Qi (θ ) ≤ Dai Pi (di )Qi (di ) = , i = 1, 2, . . . , p. (8.49)
dθ 4
From this, the success probabilities to items i are rapidly increasing in neighbor-
hoods of θ = di , i = 1, 2, . . . , p (Fig. 8.5). As increasing the parameters ai , the
success probabilities to the items are also increasing. In this sense, parameters ai
express discriminating powers to respond to items i, i = 1, 2, . . . , p. As shown in
Fig. 8.6, as difficulty parameter d is increasing, the success probabilities for items
are decreasing for fixed discrimination parameter a. Since logistic models (8.48)
are GLMs, we can handle the model (8.48) more easily than normal distribution
model (8.47). In the next section, the information of tests for estimating latent traits
is discussed, and ECD is applied for measuring the reliability of the tests.
(a) normal distribution function

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.12
0.24
0.36
0.48
0.72
0.84
0.96
1.08
1.32
1.44
1.56
1.68
1.92
2.04
2.16
2.28
2.52
2.64
2.76
2.88
0.6
1.2
1.8
2.4
0
3
normal distribution
(b) logistic function
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.12
0.24
0.36
0.48
0.6
0.72
0.84
0.96
1.08
1.2
1.32
1.44
1.56
1.68
1.8
1.92
2.04
2.16
2.28
2.4
2.52
2.64
2.76
2.88
3
logistic function
Fig. 8.4 a The standard normal distribution function (8.47) (a = 2, d = 1). b Logistic model (8.48)
(a = 2, d = 1)
8.3.3 Information Functions and ECD
In item response models (8.48), the likelihood functions of latent common trait θ
given responses Xi are
Li (θ |Xi ) = Pi (θ )Xi Qi (θ )1−Xi , i = 1, 2, . . . , p.
From this, the log-likelihood functions are given as follows:

Logistic models
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
a = 0.5 a=1 a=2 a=3
Fig. 8.5 Logistic models for a = 0.5, 1, 2, 3; d = 1
Logistic functions
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
d=1 d = 1.5 d=2 d = 2.5
Fig. 8.6 Logistic models for a = 2; d = 1, 1.5, 2, 2.5
li (θ |Xi ) = logL(θ |Xi ) = Xi logPi (θ ) + (1 − Xi )logQi (θ ), i = 1, 2, . . . , p.
Then, the Fisher information for estimating latent trait θ is computed as follows:
2
d Xi dPi (θ ) 1 − Xi dQi (θ ) 2
Ii (θ ) = E li (θ |Xi ) =E +
dθ Pi (θ ) dθ Qi (θ ) dθ
2 2
1 dPi (θ ) 1 dQi (θ )
= +
Pi (θ ) dθ Qi (θ ) dθ

1 dPi (θ ) 2 1 dPi (θ ) 2 Pi (θ )2
= + =
Pi (θ ) dθ Qi (θ ) dθ Pi (θ )Qi (θ )
= D2 ai2 Pi (θ )Qi (θ )(= Ii (θ )), i = 1, 2, . . . , p, (8.50)
where
dPi (θ )
Pi (θ ) = , i = 1, 2, . . . , p.
dθ
In test theory, functions (8.50) are called item information functions, because the
Fisher information is related to the precision of the estimate of latent trait θ . Under
the assumption of local independence
(8.43), from (8.50), the Fisher information of
response X = X1 , X2 , . . . , Xp is calculated as follows:

p

p
Pi (θ )2 p
I (θ ) = Ii (θ ) = = D2 ai2 Pi (θ )Qi (θ ). (8.51)
i=1 i=1
Pi (θ )Qi (θ ) i=1
The above function is referred to as the test information function and the char-
acteristic of a test can be discussed by the function. Based on model (8.48), we
have
exp(xi Dai (θ − di ))
p

P X1 , X2 , . . . , Xp = x1 , x2 , . . . , xp |θ =
i=1
1 + exp(Dai (θ − di ))
p
exp D i=1 xi ai (θ − di )
= p .
i=1 (1 + exp(Dai (θ − di )))
(8.52)
From (8.52), we get the following sufficient statistic:

n
Vsufficient = ai Xi . (8.53)
i=1
The ML estimator θ̂ of latent trait θ is a function of sufficient statistic Vsufficient ,

and from Fisher information
the asymptotic variance of the ML estimator θ̂
(8.51),
for response X = X1 , X2 , . . . , Xp is computed as follows:
1 1
Var θ̂ ≈ n = 2 p 2 (8.54)
i=1 Ii (θ ) D i=1 ai Pi (θ )Qi (θ )
In this sense, the score (8.53) is the best to measure the common latent trait θ .
For a large number of items p, statistic (score)
p V (8.44)p is asymptotically dis-
2
tributed according to normal distribution N ci Pi (θ ), ci Pi (θ )Qi (θ ) . The
i=1 i=1
mean is increasing in latent trait θ ; however, the variance is comparatively stable

in a rage including item difficulties di , i = 1, 2, . . . , p. From this, assuming the
variance to be constant as

p
ci2 Pi (θ )Qi (θ ) = σ 2 , (8.55)
i=1
the log likelihood function based on statistic V is given by

2
1 1 p
l(θ |V ) = − log 2π σ −
2
v− ci Pi (θ ) .
2 2σ 2 i=1

d 1 p
p
l(θ |V ) = 2 v − ci Pi (θ ) · D ai ci Pi (θ )Qi (θ ).
dθ σ i=1 i=1
Then, the Fisher information is calculated by

2 p 2
d 1 2
E l(θ |V ) = 2D ai ci Pi (θ )Qi (θ ) . (8.56)
dθ σ i=1
Since
⎛ ⎞⎛ ⎞
P1 (θ )Q1 (θ ) 0 c1
0
p ⎜
⎜ 0 P2 (θ )Q2 (θ ) ⎟⎜
⎟⎜ c2 ⎟
⎟
⎜
ai ci Pi (θ )Qi (θ ) = a1 a2 · · · ap ⎜ ⎟⎜ . ⎟,
.. ⎟⎜ .. ⎟
i=1 ⎝ 0 . 0 ⎠⎝ ⎠
0 Pp (θ )Qp (θ ) cp

the above
quantity can be viewed as an inner product between vectors a1 , a2 , . . . , ap
and c1 , c2 , . . . , cp with respect to the diagonal matrix
⎛ ⎞
P1 (θ )Q1 (θ ) 0
⎜ 0 ⎟
⎜ 0 P2 (θ )Q2 (θ ) ⎟
⎜ .. ⎟.
⎝ . 0 ⎠
0
0 Pp (θ )Qp (θ )
Hence, under the condition (8.55), Fisher information (8.56) is maximized by
ci = τ ai , i = 1, 2, . . . , p,
where τ is a constant. From (8.56), we have

2
d
p
max E l(θ |V ) = D2 ai2 Pi (θ )Qi (θ ).
ci ,i=1,2,...,p dθ i=1
The above information is the same as that for Vsufficient (8.51). Comparing (8.54)
and (8.56), we have the following definition:
Definition 8.7 The efficiency of score V (8.44) for sufficient statistic Vsufficient is
defined by
p 2
ai ci Pi (θ )Qi (θ )
i=1
ψ(θ ) = p 2 p 2
i=1 ai Pi (θ )Qi (θ ) i=1 ci Pi (θ )Qi (θ )
The conditional probability functions of Xi , i = 1, 2, . . . , p given latent trait θ are

given by
exp(xi Dai (θ − di ))
fi (xi |θ) = , i = 1, 2, . . . , p.
1 + exp(Dai (θ − di ))
Hence, the above logistic models are GLMs, and we have
KL(Xi , θ ) = Dai Cov(Xi , θ ), i = 1, 2, . . . , p.
and
KL(Xi , θ ) Dai Cov(Xi , θ )
ECD(Xi , θ ) = = , i = 1, 2, . . . , p.
KL(Xi , θ ) + 1 Dai Cov(Xi , θ ) + 1
From Theorem 8.2, the KL information concerning the association between X =

X1 , X2 , . . . , Xp and θ is calculated as follows:

n
p
KL(X, θ) = KL(Vsufficient , θ) = Cov(DVsufficient , θ) = Dai Cov(Xi , θ) = KL(Xi , θ).
i=1 i=1
Hence, ECD in GLM (8.52) is computed by

p
KL(X, θ ) KL(Xi , θ )
ECD(X, θ ) = ECD(Vsufficient , θ ) = = i=1 . (8.57)
KL(X, θ ) + 1 KL(X, θ ) + 1
The above ECD can be called the entropy coefficient of reliability (ECR) of the
test. Since θ is distributed according to N(0, 1), from model (8.48), we have
KL(Xi , θ ) = Dai Cov(Xi , θ ) = Dai E(Xi θ )


+∞ θ exp(Dai (θ − di )) 1 1
= Dai ∫ · √ exp − θ 2 dθ,
−∞ 1 + exp(Dai (θ − di )) 2π 2
i = 1, 2, . . . , p.
Since

θ exp(Dai (θ − di ))
+∞ 1 1 2
lim Cov(Xi , θ ) = lim ∫ · √ exp − θ dθ
ai →+∞ ai →+∞ −∞ 1 + exp(Dai (θ − di )) 2π 2

+∞ 1 1 2 1 1 2
= ∫ θ · √ exp − θ dθ = √ exp − di ,
di 2π 2 2π 2
i = 1, 2, . . . , p,
it follows that
lim KL(Xi , θ ) = +∞
ai →+∞
⇔ lim ECD(Xi , θ ) = 1, i = 1, 2, . . . , p.
ai →+∞
8.3.4 Numerical Illustration
Table 8.8 shows the discrimination and difficulty parameters for artificial nine items
for a simulation study (Test A). The nine item characteristic functions are illustrated
in Fig. 8.7. In this case, according to KL(Xi , θ ) and ECD(Xi , θ ) (Table 8.8), the
explanatory powers of latent trait θ for responses to the nine items are moderate.
According to Table 8.8, the KL information is maximized for d = 0. The entropy
coefficient of reliability of this test is calculated by (8.57). Since

9
KL(X, θ ) = KL(Xi , θ ) = 0.234 + 0.479 + · · · + 0.234 = 6.356,
i=1
Table 8.8 An item response model with nine items and the KL information [Test (A)]
Item 1 2 3 4 5 6 7 8 9
ai 2 2 2 2 2 2 2 2 2
di −2 −1.5 −1 −0.5 0 0.5 1.0 1.5 2
KL(Xi , θ ) 0.234 0.479 0.794 1.075 1.190 1.075 0.794 0.479 0.234
ECD(Xi , θ) 0.190 0.324 0.443 0.518 0.543 0.518 0.443 0.324 0.190
Item Characteristic Functions in Table 8.8

1
0.9
item 1
0.8
0.7
item 2 item 6
0.6
item 7
0.5 item 3
item 8
0.4 item 4
0.3 item 9
item 5
0.2
0.1
0
-3.00
-2.81
-2.62
-2.43
-2.24
-2.05
-1.86
-1.67
-1.48
-1.29
-1.10
-0.91
-0.72
-0.53
-0.34
-0.15
0.04
0.23
0.42
0.61
0.80
0.99
1.18
1.37
1.56
1.75
1.94
2.13
2.32
2.51
2.70
2.89
Fig. 8.7 Item characteristic functions in Table 8.8
we have
6.356
ECD(X, θ ) = = 0.864.
6.356 + 1
In this test, the test reliability can be thought high. Usually, the following test
score is used for measuring latent trait θ :

9
V = Xi , (8.58)
i=1
In Test A, the above score is a sufficient statistic of the item response model with
the nine items in Table 8.8 [Test (A)], and the information of the item response model
is the same as that of the above score. Then, the test characteristic function is given
by
1
9
T (θ ) = Pi (θ ). (8.59)
9 i=1
As shown in Fig. 8.8, the relation between latent trait θ and T (θ ) is almost linear.
The test information function of Test (A) is given in Fig. 8.9. The precision for
estimating the latent trait θ is maximized at θ = 0.
Second, for Test (B) shown in Table 8.9, the ECDs are uniformly smaller than
those in Table 8.8, and the item characteristic functions are illustrated in Fig. 8.10.
In this test, ECD of Test (B) is
ECD(X, θ ) = 0.533
test characteristic function

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3.0
-2.8
-2.6
-2.4
-2.2
-2.0
-1.7
-1.5
-1.3
-1.1
-0.9
-0.7
-0.5
-0.3
-0.1
0.1
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.5
2.7
2.9
Fig. 8.8 Test characteristic function (8.59) of Test (A)
Test information function (test (A))

8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
-3
-2.79
-2.58
-2.37
-2.16
-1.95
-1.74
-1.53
-1.32
-1.11
-0.9
-0.69
-0.48
-0.27
-0.06
0.15
0.36
0.57
0.78
0.99
1.2
1.41
1.62
1.83
2.04
2.25
2.46
2.67
2.88
Fig. 8.9 Test information function of Test (A)
Table 8.9 An item response model with nine items and the KL information [Test (B)]
Item 1 2 3 4 5 6 7 8 9
ai 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
di −2 −1.5 −1 −0.5 0 0.5 1.0 1.5 2
KL(Xi , θ ) 0.095 0.116 0.135 0.148 0.152 0.148 0.135 0.116 0.095
ECD(Xi , θ) 0.087 0.104 0.119 0.129 0.132 0.129 0.119 0.104 0.087
and the ECD is less than that of Test (A). Comparing Figs. 8.9 and 8.11, the precision
of estimating latent trait in Test (A) is higher than that in Test (B), and it depends on
discrimination parameters ai .
Item characteristic functions in Table 8.9

1
0.9
0.8
item 1
0.7
0.6 item 2
item 9
item 8
0.5 item 3 item 7
item 6
item 5
item 4
0.4
0.3
0.2
0.1
0
-3.0
-2.8
-2.6
-2.4
-2.2
-2.1
-1.9
-1.7
-1.5
-1.3
-1.1
-0.9
-0.7
-0.5
-0.3
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.7
1.9
2.1
2.3
2.5
2.7
2.9
Fig. 8.10 Item characteristic functions of Test (B)
Test information function (test (B))

1.4
1.2
0.8
0.6
0.4
0.2
0
-3.0
-2.8
-2.5
-2.3
-2.1
-1.9
-1.6
-1.4
-1.2
-0.9
-0.7
-0.5
-0.2
0.0
0.2
0.4
0.7
0.9
1.1
1.4
1.6
1.8
2.1
2.3
2.5
2.7
3.0
Fig. 8.11 Test information function of Test (B)
Lastly, Test (C) with nine items in Table 8.10 is considered. The item characteristic
functions are illustrated in Fig. 8.12. According to KL information in Table 8.10,
items 3 and 7 have the largest predictive power on item responses. ECD in this test
is computed by
Table 8.10 An item response model with nine items and the KL information [Test (C)]
Item 1 2 3 4 5 6 7 8 9
ai 0.5 1 2 1 0.5 1 2 1 0.5
di −2 −1.5 −1 −0.5 0 0.5 1.0 1.5 2
KL(Xi , θ ) 0.095 0.260 0.794 0.442 0.152 0.442 0.794 0.260 0.095
ECD(Xi , θ) 0.087 0.207 0.443 0.306 0.132 0.306 0.443 0.207 0.087
Item characteristic functions in Table 8.10

1
0.9
0.8
0.7
0.6 item 8
item 4 item 5 item 9
0.5 item 1
item 3
item 2 item 6 item 7
0.4
0.3
0.2
0.1
0
-3.0
-2.8
-2.6
-2.4
-2.2
-2.1
-1.9
-1.7
-1.5
-1.3
-1.1
-0.9
-0.7
-0.5
-0.3
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.7
1.9
2.1
2.3
2.5
2.7
2.9
Fig. 8.12 Item characteristic functions of Test (C)
ECD(X, θ ) = 0.769.
The test information function of Test (C) is illustrated in Fig. 8.13. The configura-
tion is similar to that of KL information in item difficulties di in Test (C) in Table 8.10
(Fig. 8.14). Since this test has nine items with discrimination parameters 0.5, 1, and
2, the sufficient statistic (score) is given by
Vsufficient = 0.5X1 + X2 + 2X3 + X4 + 0.5X5 + X6 + 2X7 + X8 + 0.5X9 . (8.60)
The above score is the best to estimate latent trait based on test score (8.44). Test
score (8.58) may be usually used; however, the efficiency of the test score for the
sufficient test score (8.60) is less than 0.9, as illustrated in latent trait θ in Fig 8.15.
Since the distribution of latent trait θ is assumed to be N(0, 1), about 95% of latent
trait θ exist in range (−1.96, 1.96), i.e.,
Test information function (test (C))

5.0
4.0
3.0
2.0
1.0
0.0
-3.0
-2.8
-2.5
-2.3
-2.1
-1.9
-1.6
-1.4
-1.2
-0.9
-0.7
-0.5
-0.2
0.0
0.2
0.4
0.7
0.9
1.1
1.4
1.6
1.8
2.1
2.3
2.5
2.7
3.0
Fig. 8.13 Test information function of Test (B)
KL information of items
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Fig. 8.14 KL information KL(Xi , θ) and item difficulties di
efficiency
0.9
0.85
0.8
0.75
0.7
-3.0
-2.8
-2.5
-2.3
-2.1
-1.9
-1.6
-1.4
-1.2
-0.9
-0.7
-0.5
-0.2
0.0
0.2
0.4
0.7
0.9
1.1
1.4
1.6
1.8
2.1
2.3
2.5
2.7
3.0
Fig. 8.15 Efficiency of test score V (8.58) in Test (C)

P(−1.96 < θ < 1.96) ≈ 0.95.
Hence, in range −1.96 < θ < 1.96, the precision for estimating θ according to
test score (8.58) is about 80% for the sufficient statistic (8.60) as shown in Fig. 8.15.
8.4 Discussion
In this chapter, latent structure models are considered in the framework of GLMs. In
factor analysis, the contributions of common factors have been measured through an
entropy-based path analysis. The contributions of common factors have been defined
as the effects of factors on the manifest variable vector. In latent trait analysis, for the
two-parameter logistic model for dichotomous manifest variables, the test reliability
has been discussed with ECD. It is critical to extend the discussion to those for the
graded response model [17], the partial credit model [13], the nominal response model
[3], and so on. Latent class analysis has not been treated in this chapter; however,
there may be a possibility to make an entropy-based discussion for comparing latent
classes. Further studies are needed to extend the present discussion in this chapter.
References
1. Brown, M. N. (1974). Generalized least squares in the analysis of covariance structures. South
African Statistical Journal, 8, 1–24.
2. Bartholomew, D. J. (1987). Latent variable models and factor analysis. New York: Oxford
University Press.
3. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored
in two or more nominal categories. Psychometrika, 37, 29–51.
6. Eshima, N., Borroni, C. G., & Tabata, M. (2016). Relative importance assessment of explanatory
variables in generalized linear models: An entropy-based approach. Statistics & Applications,
16, 107–122.
7. Eshima, N., Tabata, M., & Borroni, C. G. (2018). An entropy-based approach for measuring
factor contributions in factor analysis models. Entropy, 20, 634.
8. Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychome-
trika, 32, 443–482.
9. Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor
analysis. Psychometrika, 34, 183–202.
10. Jöreskog, K. G., & Goldberger, A. S. (1972). Factor analysis by generalized least squares.
Psychometrika, 37, 243–260.
11. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. New York: Houghton-
Mifflin.
12. Lord, F. M. (1952). A theory of test scores (Psychometrika Monograph, No. 7). Psychometric
Corporation: Richmond.
References 257
13. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
14. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Illinois: The
University of Chicago Press.
15. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika,
32, 443–482.
16. Robertson, D., & Symons, J. (2007). Maximum likelihood factor analysis with rank-deficient
sample covariance matrices. Journal of Multivariate Analysis, 98, 813–828.
17. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores
(Psychometric Monograph). Educational Testing Services: Princeton.
18. Shiba, S. (1991). Item response theory. Tokyo: Tokyo University.
19. Spearman, S. (1904). “General-intelligence”, objectively determined and measured. American
Journal of Psychology, 15, 201–293.
20. Sundberg, R., & Feldmann, U. (2016). Exploratory factor analysis-parameter estimation and
scores prediction with high-dimensional data. Journal of Multivariate Analysis, 148, 49–59.
21. Trendafilov, N. T., & Unkel, S. (2011). Exploratory factor analysis of data matrices with more
variables than observations. Journal of Computational and Graphical Statistics, 20, 874–891.
22. Thurstone, L. L. (1935). Vector of mind: Multiple factor analysis for the isolation of primary
traits (p. 1935). The University of Chicago Press: Chicago, IL, USA.
23. Unkel, S., & Trendafilov, N. T. (2010). A majorization algorithm for simultaneous parameter
estimation in robust exploratory factor analysis. Computational Statistics & Data Analysis, 54,
3348–3358.
24. Young, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on
exploratory factor analysis. Quantitative Methods for Psychology, 9, 79–94.

2020 Book StatisticalDataAnalysisAndEntr

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 Book StatisticalDataAnalysisAndEntr

Uploaded by

Copyright:

Available Formats

Behaviormetrics:

Quantitative Approaches to Human Behavior 3

More information about this series at http://www.springer.com/series/16001

Statistical Data Analysis

ISSN 2524-4027 ISSN 2524-4035 (electronic)

Kyoto, Japan Nobuoki Eshima

I would be grateful to Prof. A. Okada, Rikkyo University Professor Emeritus, for

1 Entropy and Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

3.7 Entropy Coefﬁcient of Determination . . . . . . . . . . . . . . . . . . . . . 80

6.9 General Formulation of Path Analysis of Recursive Systems

Entropy is a physical concept to measure the complexity or uncertainty of systems

© Springer Nature Singapore Pte Ltd. 2020 1

Definition 1.1 For P(A) = 0, the information of A is defined by

I (A) ≤ I (B). (1.2)

(ii) If events A and B are statistically independent, then, it follows that

I (A ∩ B) = I (A) + I (B). (1.3)

Proof Inequality (1.2) is trivial. In (ii), we have

P(A ∩ B) = P(A)P(B). (1.4)

I (A ∩ B) > I (A), I (B).

1.3 Loss of Information

Loss(A|B) = I (A) − I (B). (1.5)

Table 1.1 Original data

Table 1.2 Aggregated data

Then, the loss of information of X = i is given by

In the above example, random variable X is dichotomized, and it is a usual tech-

Loss(A|B) = log 336.

Table 1.3 Probability

Definition 1.3 The entropy of discrete sample space [6] is defined by

P(ωi ) log P(ωi ) = 0 · log 0 = 0.

If P(ωi ) = 1, i.e., P(ω j ) = 0, j = i, we have

With respect to entropy (1.7), we have the following theorem:

Theorem 1.1 Let p = ( p1 , p2 , . . . , pn ) and q = (q1 , q2 , . . . , qn ) be probability

Then, it follows that:

The equality holds true, if and only if p = q.

Proof For x > 0, the following inequality holds true:

By using the above inequality, we have

This completes the theorem.

From (1.8), we have

In (1.9), setting qi = n1 , i = 1, 2, . . . , n, then we have

Table 1.4 Probability

This entropy is interpreted as the loss of information by substituting distribution

Theorem 1.2 Let q = (q1 , q2 , . . . , qk ) be a probability distribution that is made by

Proof For simplicity of the discussion, let

Theorem 1.3 Let p = ( p1 , p2 , . . . , pn ) and q = (q1 , q2 , . . . , qn ) be probability

H (λ p + (1 − λ)q) ≥ λH ( p) + (1 − λ)H (q). (1.11)

Proof Function x log x is a convex function, so we have

1.5 Entropy of Joint Event

Let X and Y be categorical random variables having sample spaces X =

as long as any confusion does not occur in a context. Let πi j = P(X = i, Y =

Definition 1.4 The joint entropy of X and Y is defined as

With respect to the joint entropy, we have the following theorem:

Theorem 1.4 For categorical random variables X and Y,

Proof From Theorem 1.1,

Fig. 1.1 Image of Entropy

Table 1.5 Joint distribution of X and Y

Hence, inductively the theorem follows.

H (X ) + H (Y ) − H (X, Y ) ≈ 1.386 + 1.355 − 2.253 = 0.488.

1.6 Conditional Entropy

Let X and Y be categorical random variables having sample spaces X =

Definition 1.5 The conditional entropy of Y given X = i is defined by

and that of Y given X, H (Y |X ), is defined by taking the expectation of the above

Proof From (1.14), we have

This completes the theorem.