Professional Documents
Culture Documents
Nobuoki Eshima
Statistical Data
Analysis and
Entropy
Behaviormetrics: Quantitative Approaches
to Human Behavior
Volume 3
Series Editor
Akinori Okada, Graduate School of Management and Information Sciences,
Tama University, Tokyo, Japan
This series covers in their entirety the elements of behaviormetrics, a term that
encompasses all quantitative approaches of research to disclose and understand
human behavior in the broadest sense. The term includes the concept, theory,
model, algorithm, method, and application of quantitative approaches from
theoretical or conceptual studies to empirical or practical application studies to
comprehend human behavior. The Behaviormetrics series deals with a wide range
of topics of data analysis and of developing new models, algorithms, and methods
to analyze these data.
The characteristics featured in the series have four aspects. The first is the variety
of the methods utilized in data analysis and a newly developed method that includes
not only standard or general statistical methods or psychometric methods
traditionally used in data analysis, but also includes cluster analysis, multidimen-
sional scaling, machine learning, corresponding analysis, biplot, network analysis
and graph theory, conjoint measurement, biclustering, visualization, and data and
web mining. The second aspect is the variety of types of data including ranking,
categorical, preference, functional, angle, contextual, nominal, multi-mode
multi-way, contextual, continuous, discrete, high-dimensional, and sparse data.
The third comprises the varied procedures by which the data are collected: by
survey, experiment, sensor devices, and purchase records, and other means. The
fourth aspect of the Behaviormetrics series is the diversity of fields from which the
data are derived, including marketing and consumer behavior, sociology, psychol-
ogy, education, archaeology, medicine, economics, political and policy science,
cognitive science, public administration, pharmacy, engineering, urban planning,
agriculture and forestry science, and brain science.
In essence, the purpose of this series is to describe the new horizons opening up
in behaviormetrics—approaches to understanding and disclosing human behaviors
both in the analyses of diverse data by a wide range of methods and in the
development of new methods to analyze these data.
Editor in Chief
Akinori Okada (Rikkyo University)
Managing Editors
Daniel Baier (University of Bayreuth)
Giuseppe Bove (Roma Tre University)
Takahiro Hoshino (Keio University)
123
Nobuoki Eshima
Center for Educational Outreach
and Admissions
Kyoto University
Kyoto, Japan
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
In modern times, various kinds of data are gathered and cumulated in all scientific
research fields, practical business affairs, governments and so on. The computer
efficiency and data analysis methodologies have been developing, and it has been
extending the capacity to process big and complex data. Statistics provides
researchers and practitioners with useful methods to handle data for their purposes.
The modern statistics dates back to the early 20th century. The Student’s t statistic
by W. S. Gosset and the basic idea of experimental designs by R. A. Fisher have
been influencing great effects on the development of statistical methodologies.
On the other hand, information theory originates from C. E. Shannon’s paper,
“A Mathematical Theory in Communication”, in 1948. The theory is indispensable
for measuring uncertainty of events from information sources and systems of ran-
dom variables and for processing the data effectively in viewpoints of entropy. In
these days, interdisciplinary research domains have been increasing in order to
enhance noble studies to resolve complicated problems. The common tool for
statistics and information theory is “probability”, and the common aim is to deal with
information and data effectively. In this sense, thus, both theories have similar
scopes. The logarithm of probability is negative information; those of odds, and odds
ratios in statistics are relative information; that of the log likelihood function is the
asymptotic negative entropy. The author is a statistician and takes a standpoint in
Statistics; however, there are problems in statistical data analysis that are not able to
resolve in conventional views of statistics. In such cases, perspectives in entropy
may provide us with good clues and ideas to tackle the problems and to develop
statistical methodologies. The aim of this book is to elucidate how most of statistical
methods, e.g. correlation analysis, t, F, v2 statistics, can be interpreted in entropy and
to introduce entropy-based methods for analyzing data analysis, e.g. entropy-based
approaches to the analysis of association in contingency tables and path analysis
v
vi Preface
with generalized linear models, that may be useful tools for behavior-metric
researches. The author would like to expect this book to motivate readers, especially
young researchers, to grasp that “entropy” is one of utilities to deal with practical
data and study themes.
vii
Contents
ix
x Contents
1.1 Introduction
between the variables through the mutual information (entropy) and the conditional
entropy. In Sect. 1.7, the chi-square test is considered through the KL information.
Section 1.9 treats the information of continuous variables, and t and F statistics are
expressed through entropy.
1.2 Information
Let be a sample space; A ∈ be an event; and let P(A) the probability of event A.
The information of event A is defined mathematically according to the probability,
not the content itself. Smaller the probability of an event is, greater we feel its value.
Based on our intuition, the mathematical definition of information [6] is given by
1
I (A) = loga , (1.1)
P(A)
where a > 1
In which follows, the base of the logarithm is e and the notation (1.1) is simply
denoted by
1
I (A) = log .
P(A)
In this case, the unit is called “nat,” i.e., natural unit of information. If P(A) =
1, i.e., event A always occurs, then I (A) = 0 and it implies that event A has no
information. The information measure I (A) has the following properties:
(i) For events A and B, if P(A) ≥ P(B) > 0, the following inequality holds:
From this,
1 1
I (A ∩ B) = log = log
P(A ∩ B) P(A)P(B)
1.2 Information 3
1 1
= log + log = I (A) + I (B).
P(A) P(B)
Example 1.1 In a trial drawing a card from a deck of cards, let events A and B be
“ace” and “heart,” respectively. Then,
1 1 1
P(A ∩ B) = = × ,
52 4 13
1 1
P(A) = , P(B) = .
4 13
Since A ∩ B ⊂ A, B, corresponding to (1.2), we have
and the events are statistically independent, so we also have Eq. (1.3).
Remark 1.1 In information theory, base 2 is usually used for logarithm. Then, the
unit of information is referred to as “bit.” One bit is the information of an event with
probability 21 .
When we forget the last two columns of a phone number with 10 digits, e.g., 075-
753-25##, then there is nothing except that the right number can be restored with
1
probability 100 , if there is no information about the number. In this case, the loss of
the information about the right phone number is log 100. In general, the loss of the
information is defined as follows:
Definition 1.2 Let A and B be events such that A ⊂ B. Then, the loss of “information
concerning event A” by using or knowing event B for A is defined by
In the above definition, A ∩ B = A and from (1.5), the loss is also expressed by
1 1 P(B) 1
Loss(A|B) = log − log = log = log . (1.6)
P(A) P(B) P(A) P(A|B)
In the above example of a phone number, the true number, e.g., 097-753-2517, is
event A and the memorized number is event B, e.g., 097-753-25##. Then, P(A|B) =
1
100
.
4 1 Entropy and Basic Statistics
Example 1.2 Let X be a categorical random variable that takes values in sample
space = {0, 1, 2, . . . , I − 1} and let πi = P(X = i). Supposed that for integer
k > 0, the sample space is changed into ∗ = {0, 1} by
0 (X < k)
X∗ = .
1 (X ≥ k)
Example 1.3 In Table 1.1, categories 0 and 1 are reclassified into category 0 shown
in Table 1.2 and categories 2 and 3 into category 1. Then, the loss of information is
considered. Let A and B be events shown in Tables 1.1 and 1.2, respectively. Then,
we have P(A|B) = 15+1 1
× 20+1
1
= 336
1
, which is the conditional probability restoring
Table 1.1 from Table 1.2, and from (1.6) we have
1.4 Entropy
Let = {ω1 , ω2 , . . . , ωn } be a discrete sample space and let P(ωi ) be the proba-
bility of event ωi . The uncertainty of the sample space is measured by the mean of
information I (ωi ) = − log P(ωi ).
1.4 Entropy 5
n
n
H () = P(ωi )I (ωi ) = − P(ωi ) log P(ωi ). (1.7)
i=1 i=1
Remark 1.2 In the above definition, sample space and probability space (, P)
are identified.
Remark 1.3 In definition (1.7), if P(ωi ) = 0, then for convenience of the discussion
we set
H () = 0.
Example 1.4 Let X be a categorical random variable that takes values in sample
space = {0, 1, 2} with probabilities illustrated in Table 1.3. Then, the entropy of
is given by
1 4 1 2 1 4 3
H () = log + log + log = log 2.
4 1 2 1 4 1 2
In this case, the entropy is also referred to as that in X and is denoted by H (X ).
n
n
pi = qi = 1.
i=1 i=1
n
n
pi log pi ≥ pi log qi . (1.8)
i=1 i=1
log x ≤ x − 1.
n
n
n
qi
pi log qi − pi log pi = pi log
i=1 i=1 i=1
pi
n
qi
≤ pi − 1 = 0.
i=1
pi
n
H ( p) = − pi log pi .
i=1
n
n
H ( p) = − pi log pi ≤ − pi log qi . (1.9)
i=1 i=1
n
H ( p) = − pi log pi ≤ log n.
i=1
Hence, from Theorem 1.1, entropy H ( p) is maximized at p = n1 , n1 , . . . , n1 , i.e.,
uniform distribution. The following entropy is referred to as the Kullback–Leibler
(KL) information or divergence [4]:
n
pi
D( p q) = pi log . (1.10)
i=1
qi
From Theorem 1.1, (1.10) is nonnegative and takes 0 if and only if p = q. When
the entropy of distribution q for distribution p is defined by
n
H p (q) = − pi log qi ,
i=1
1.4 Entropy 7
we have
D( p q) = H p (q) − H ( p).
D( p q) = D(q p).
Example 1.5 Let Y be a categorical random variable that has the probability distribu-
tion shown in Table 1.4. Then, random variable X in Example 1.4 and Y are compared.
Let the distributions in Tables 1.3 and 1.4 are denoted by p and p, respectively. Then,
we have
1 1 4 1 1 1 4
D( p q) = log2 + log + log = log ≈ 0.144,
4 2 3 4 2 2 3
1 1 3 3 1 3 3
D(q p) = log + log + log2 = log ≈ 0.152.
8 2 8 4 2 8 2
In which follows, distributions and the corresponding random variables are iden-
tified, the KL information is also denoted by using the variables, e.g., in Example
1.5 KL information D( p q) and D(q p) are expressed as D(X Y ) and D(Y X ),
respectively, depending on the occasion.
H ( p) ≥ H (q). (1.11)
q = ( p1 + p2 , p3 . . . , pn ).
Then,
1 1 1
H ( p) − H (q) = p1 log + p2 log − ( p1 + p2 )log
p1 p2 p1 + p2
p1 + p2 p1 + p2
= p1 log + p2 log ≥ 0.
p1 p2
8 1 Entropy and Basic Statistics
In this case, theorem holds true. The above proof can be extended into a general
case and the theorem follows.
In the above theorem, the mean loss of information by combining categories of
distribution p is given by
H ( p) − H (q).
Example 1.6 In a large and closed human population, if the marriage is random, a dis-
tribution of genotypes is stationary. In ABO blood types, let the ratios of genes A, B,
and O be p, q, and r, respectively, where p+q +r = 1. Then, the probability distribu-
tion of genotypes AA, AO, BB, BO, AB, and OO is u = p 2 , 2 pr, q 2 , 2qr, 2 pq, r 2 .
We usually observe phenotypes A, B, AB, and O, which correspond with genotypes
“AA or AO,” “BB or BO,” AB, and OO, respectively, and the phenotype probabil-
ity distribution is v = p 2 + 2 pr, q 2 + 2qr, 2qr, r 2 . In this case, the mean loss of
information is given as follows:
p 2 + 2 pr p 2 + 2 pr
H (u) − H (v) = p 2 log 2
+ 2 pr log
p 2 pr
q + 2qr
2
q + 2qr
2
+ q 2 log + 2qr log
q2 2qr
p + 2r p + 2r
= p 2 log + 2 pr log
p 2r
q + 2r q + 2r
+ q 2 log + 2qr log .
q 2r
Summing up the above inequality with respect to i and j, we have inequality (1.11).
I
J
H (X, Y ) = − πi j log πi j .
i=1 j=1
H (X, Y ) ≤ H (X ) + H (Y ). (1.12)
I
J
I
J
H (X, Y ) = − πi j log πi j ≤ − πi j log(πi+ π+ j )
i=1 j=1 i=1 j=1
I
J
I
J
=− πi j log πi+ − πi j log π+ j = H (X ) + H (Y ).
i=1 j=1 i=1 j=1
Hence, the inequality (1.12) follows. The equality holds if and only if πi j =
πi+ π+ j , i = 1, 2, . . . , I ; j = 1, 2, . . . , J , i.e., X and Y are statistically indepen-
dent.
From the above theorem, the following entropy is interpreted as that reduced due
to the association between variables X and Y:
H (X ) + H (Y ) − H (X, Y ). (1.13)
Stronger the strength of the association is, greater the entropy is. The image of
the entropies H (X ), H (Y ), and H (X, Y ) is illustrated in Fig. 1.1. Entropy H (X ) is
expressed by the left ellipse; H (Y ) by the right ellipse; and H (X, Y ) by the union
of the two ellipses.
From Theorem 1.3, a more general proposition can be derived. We have the
following corollary:
Corollary 1.1 Let X 1 , X 2 , . . . , X K be categorical random variables. Then, the
following inequality folds true.
K
H (X 1 , X 2 , . . . , X K ) ≤ H (X k ).
k=1
The equality holds if and only if the categorical random variables are statistically
independent.
Proof From Theorem 1.3, we have
H ((X, X 2 , . . . , X K −1 ), X K ) ≤ H (X, X 2 , . . . , X K −1 ) + H (X K ).
1
H (X ) = 4 × log 4 = 2 log 2 ≈ 1.386,
4
3 16 5 16 3 16 5 16
H (Y ) = 2 × log +2× log = log + log
16 3 16 5 8 3 8 5
3 5
= 4 log 2 − log 3 − log 5 ≈ 1.355,
8 8
1 1 13
H (X, Y ) = 6 × log8 + 4 × log16 = log2 ≈ 2.253.
8 16 4
From this, the reduced entropy through the association between X and Y is
calculated as:
P(X = i, Y = j)
P(Y = j|X = i) = .
P(X = i)
J
H (Y |X = i) = − P(Y = j|X = i) log P(Y = j|X = i)
j=1
I
J
H (Y |X ) = − P(X = i, Y = j) log P(Y = j|X = i). (1.14)
i=1 j=1
With respect to the conditional entropy, we have the following theorem [6]:
Theorem 1.5 Let H (X, Y ) and H (X ) be the joint entropy of (X, Y ) and X,
respectively. Then, the following equality holds true:
H (Y |X ) = H (X, Y ) − H (X ). (1.15)
I
J
P(X = i, Y = j)
H (Y |X ) = − P(X = i, Y = j) log
i=1 j=1
P(X = i)
I
J
=− P(X = i, Y = j) log P(X = i, Y = j)
i=1 j=1
I
J
+ P(X = i, Y = j) log P(X = i)
i=1 j=1
I
= H (X, Y ) − P(X = i) log P(X = i) = H (X, Y ) − H (X ).
i=1
Theorem 1.6 For entropy H (Y ) and the conditional entropy H (Y |X ), the following
inequality holds:
H (Y |X ) ≤ H (Y ). (1.16)
Proof Adding (1.12) and (1.15) for the right and left sides, we have
H (Y ) − H (Y |X ) = H (Y ) − (H (X, Y ) − H (X ))
= H (X ) + H (Y ) − H (Y |X ) ≥ 0.
Hence, the inequality follows. From Theorem 1.3, the equality of (1.16) holds true
if and only if X and Y are statistically independent. This completes the theorem.
From the above theorem, the following entropy is interpreted as that of Y explained
by X:
I M (X, Y ) = H (Y ) − H (Y |X ). (1.17)
I M (X, Y ) = I M (Y, X ).
Thus, I M (X, Y ) is the entropy reduced by the association between the variables
and is referred to as the mutual information. Moreover, we have the following
theorem:
H (X, Y ) ≤ H (X ) + H (Y ). (1.19)
The equality holds true if and only if X and Y are statistically independent.
I
J
P(X = i, Y = j)
I M (X, Y ) = P(X = i, Y = j)log .
i=1 j=1
P(X = i)P(Y = j)
1.6 Conditional Entropy 13
From the above theorems, entropy measures are illustrated as in Fig. 1.2.
Theil [8] used entropy to measure the association between independent variable X
and dependent variable Y. With respect to mutual information (1.18), the following
definition is given [2, 3].
Definition 1.6 The ratio of I M (X, Y ) to H (Y ), i.e.,
I M (X, Y ) H (Y ) − H (Y |X )
C(Y |X ) = = . (1.20)
H (Y ) H (Y )
0 ≤ C(Y |X ) ≤ 1.
Example 1.8 Table 1.6 shows the joint probability distribution of categorical vari-
ables X and Y. In this table, the following entropy measures of the variables are
computed:
1
H (X, Y ) = 8 × log 8 = 3 log 2,
8
1 1
H (X ) = 4 × log 4 = 2 log 2, H (Y ) = 4 × log 4 = 2 log 2.
4 4
From the above entropy measures, we have
log 2 1
C(Y |X ) = = .
2 log 2 2
H (Y |X 1 , X 2 , . . . , X i , X i+1 ) ≤ H (Y |X 1 , X 2 , . . . , X i ). (1.21)
The equality holds if and only if X i+1 and Y are conditionally independent, given
(X 1 , X 2 , . . . , X i ).
Proof It is sufficient to prove the case with i = 1. As in Theorem 1.5, from (1.12)
and (1.15) we have
H (X 2 , Y |X 1 ) ≤ H (X 2 |X 1 ) + H (Y |X 1 ),
H (Y |X 1 , X 2 ) = H (X 2 , Y |X 1 ) − H (X 2 |X 1 ).
By adding both sides of the above inequality and equation, it follows that
H (Y |X 1 , X 2 ) ≤ H (Y |X 1 ).
1 − H (Y |X 1 , X 2 , . . . , X i , X i+1 )
C(Y |X 1 , X 2 , . . . , X i , X i+1 ) =
H (Y )
1 − H (Y |X 1 , X 2 , . . . , X i )
≥ = C(Y |X 1 , X 2 , . . . , X i ).
H (Y )
Example 1.9 The joint probability distribution of X 1 , X 2 , and Y is given in Table 1.7.
Let us compare the entropy measures H (Y ), H (Y |X 1 ), and H (Y |X 1 , X 2 ). First,
H (Y ), H (X 1 ), H (X 1 , X 2 ), H (X 1 , Y ), and H (X 1 , X 2 , Y ) are calculated. From
Table 1.7, we have
5 5 11 11
H (Y ) = − log − log ≈ 0.621,
16 16 16 16
1 1 3 3
H (X 1 ) = − log − log ≈ 0.563,
4 4 4 4
1 1 1 1 5 5 7 7
H (X 1 , X 2 ) = − log − log − log − log ≈ 1.418.
4 4 4 4 16 16 16 16
1 1 1 1 3 3 9 9
H (X 1 , Y ) = − log − log − log − log ≈ 1.157,
8 8 8 8 16 16 16 16
3 3 1 1 1 1
H (X 1 , X 2 , Y ) = − log × 2 − log × 2 − log
32 32 32 32 16 16
1 1 1 1 5 5
− log − log − log ≈ 1.695.
8 8 4 4 16 16
From the above measures of entropy, we have the following conditional entropy
measures:
From the results, the following inequality in Theorems 1.6 and 1.8 hold true:
H (Y |X 1 , X 2 ) < H (Y |X 1 ) < H (Y ).
Moreover, we have
16 1 Entropy and Basic Statistics
H (X 1 , X 2 , . . . , X n ) = H (X 1 ) + H (X 2 |X 1 ) + . . . + H (X n |X 1 , X 2 , · · · , X n−1 ).
H (X 1 , X 2 , . . . , X n ) = H (X 1 , X 2 , . . . , X n−1 ) + H (X n |X 1 , X 2 , . . . , X n−1 ).
I
ni
G 21 = 2 n i log . (1.22)
i=1
N pi
I ni
I
ni
pi
G 21 = 2N log N
= 2N pi log = 2N · D( p p), (1.23)
i=1
N pi i=1
pi
1
logx ≈ (x − 1) − (x − 1)2 .
2
For large sample size N, it follows that
pi
≈ 1 (in probability).
pi
I I 2
pi 1 pi
≈ −2N
G 21 pi log = 2N
pi · −1
i=1
pi i=1
2 pi
I 2 I 2
I
pi pi − pi (n i − N pi )2
=N pi −1 ≈ N
= .
i=1
pi i=1
pi i=1
N pi
Let us set
I
(n i − N pi )2
X2 = . (1.24)
i=1
N pi
The above statistics are a chi-square statistic [5]. Hence, the log-likelihood ratio
statistic R12 is asymptotically equal to the chi-square statistics (1.24). For given dis-
tribution p, the statistic (1.22) is asymptotically distributed according a chi-square
distribution with degrees of freedom I − 1, as N → ∞.
Let us consider the following statistic similar to (1.22):
I
N pi
G 22 = 2 N pi log .
i=1
ni
I
pi
G 22 = 2N pi log
= 2N · D( p p), (1.25)
i=1
pi
1
Table 1.10 Data produced with distribution q = 6 , 12 , 4 , 4 , 6 , 12
1 1 1 1 1
X 1 2 3 4 5 6 Total
Frequency 12 8 24 23 22 11 100
I
I
2 I
pi pi (n i − N pi )2
G 22 = −2N pi log ≈N pi −1 = .
i=1
pi i=1
pi i=1
N pi
G 21 ≈ G 22 ≈ X 2 (in probability)
Example 1.10 For uniform distribution p = 16 , 16 , 16 , 16 , 16 , 16 , data with sample size
100 were made and the results are shown in Table 1.9. From this table, we have
The three statistics are asymptotically equal and the degrees of freedom in the
chi-square distribution is 5, so the results are not statistically significant.
Example
1 1 1 1.11 with sample size 100 were produced by a distribution q =
Data
, , , , ,
1 1 1
and the data are shown in Table 1.10. We compare the data
6 12 4 4 6 12
with uniform distribution p = 16 , 16 , 16 , 16 , 16 , 16 . In this artificial example, the
null hypothesis H0 is p = 16 , 16 , 16 , 16 , 16 , 16 . Then, we have
Since the degrees of freedom is 5, the critical point of significance level 0.05 is
11.1 and the three statistics lead to the same result, i.e., the null hypothesis is rejected.
I
l(θ1 , θ2 , . . . , θ K ) = n i log pi (θ1 , θ2 , . . . , θ K ). (1.26)
i=1
The maximum likelihood (ML) estimation is carried out by maximizing the above
I
ni
l θ1 , θ2 , . . . , θ K = N log pi θ1 , θ2 , . . . , θ K .
i=1
N
Definition 1.7 The negative data entropy (information) in Table 1.11 is defined by
I
ni
ldata = n i log . (1.27)
i=1
N
Since
I
ni ni
ldata = N log ,
i=1
N N
I
ni
ni
G (θ ) = 2 ldata − l(θ ) = 2N
2
log N ,
(1.28)
i=1
N pi (θ )
where
θ = θ1 , θ2 , . . . , θ K .
i 1
= ,
,..., and pi (θ ) = p1 (θ ), p2 (θ ), . . . , p K (θ ) .
N N N N
Then, the log-likelihood ratio test statistic is described as follows:
n
i
G 2 (θ ) = 2N × D
pi (θ ) . (1.29)
N
Hence, from (1.29) the ML estimation minimizes the difference between data nNi
and model ( pi (θ )) in terms of the KL information, and statistic (1.29) is used for
testing the goodness-of-fit of model ( pi (θ)) to the data. Under the null hypothesis H0 :
model ( pi (θ )), statistic (1.29) is asymptotically distributed according to a chi-square
distribution with degrees of freedom I − 1 − K [1].
Let X be a continuous random variable that has density function f (x). As in the
above discussion, we define the information of X = x by
1
log .
f (x)
Since density f (x) is not a probability, the above quantity does not have a meaning
of information in a strict sense; however, in an analogous manner the entropy of
continuous variables is defined as follows [6]:
Definition 1.8 The entropy of continuous variable X with density function f (x) is
defined by:
H (X ) = − f (x)log f (x)dx. (1.30)
Remark 1.4 For continuous distributions, the entropy (1.30) is not necessarily
positive.
1
log
f (x)dx
is the information of event x ≤ X < x + dx. Let us compare two density functions
f (x) and g(x). The following quantity is the relative information of g(x) for f (x):
1
g(x)dx f (x)
log = log .
1
f (x)dx
g(x)
The above entropy is interpreted as the loss of information of the true distribution
f (x) when g(x) is used for f (x). Although the entropy (1.30) is not scale-invariant,
the KL information (1.31) is scale-invariant. With respect to the KL information
(1.31), the following theorem holds:
Theorem 1.10 For density function of f (x) and g(x), D( f g) ≥ 0. The equality
holds if and only if f (x) = g(x) almost everywhere.
Proof Since function log x is convex, from (1.31), we have
g(x) g(x)
D( f g) = − f (x) log dx ≥ − log f (x)dx.
f (x) f (x)
= − log g(x)dx = − log 1 = 0.
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
Example 1.12 Let f (x) and g(x) be normal density functions of N (μ1 , σ12 ) and
N(μ2 , σ22 ), respectively. For σ12 = σ22 = σ 2 = 1, μ1 = 1 and μ2 = 4, the graphs of
the density functions are illustrated as in Fig. 1.3. Since in general we have
f (x) 1 σ2 (x − μ1 )2 (x − μ2 )2
log = log 22 − + ,
g(x) 2 σ1 σ12 σ22
g(x) 1 σ12 (x − μ2 )2 (x − μ1 )2
log = log 2 − + .
f (x) 2 σ2 σ22 σ12
it follows that
1 σ2 σ2 (μ1 − μ2 )2
D( f g) = log 22 + 12 + − 1 , (1.32)
2 σ1 σ2 σ22
1 σ2 σ2 (μ1 − μ2 )2
D(g f ) = log 12 + 22 + − 1 . (1.33)
2 σ2 σ1 σ12
(μ1 − μ2 )2
D( f g) = D(g f ) = . (1.34)
2σ 2
The above KL information is increasing in difference |μ1 − μ2 | and decreasing
in variance σ 2 .
where
m n 2 m n 2
i−1 Xi i−1 Yi Xi − X i=1 i=1 Yi − Y
X= ,Y = = , U12 , U2 =
2
m n m−1 n−1
X −Y (m + n)(m + n − 2)
t= .
(m − 1)U 2 + (n − 1)U 2 mn
1 2
(m − 1)U12 + (n − 1)U22
U2 = ,
m+n−2
(X − Y )2 m + n m+n
t2 = 2
· = 2D( f g) · ,
U mn mn
where
(X − Y )2
D( f g) = ,
2U 2
Example 1.13 Let density functions f (x) and g(x) be normal density functions with
variances σ12 and σ22 , respectively, where the means are the same, μ1 = μ2 = μ,
e.g., Figure 1.4. From (1.32) and (1.33), we have
1 σ2 σ2 1 σ2 σ2
D( f g) = log 22 − 1 + 12 , D(g f ) = log 12 − 1 + 22 .
2 σ1 σ2 2 σ2 σ1
In this case,
D( f g) = D(g f ).
24 1 Entropy and Basic Statistics
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
Let
σ2 1 1
x = 22 , D( f g) = log x − 1 + .
σ1 2 x
U12
F= , (1.35)
U22
4
3.5
3
2.5
2
1.5
1
0.5
0
0 2 4 6 8 10
σ22
Fig. 1.5 Graph of D( f g) = 1
2 log x − 1 + 1
x ,x = σ12
1.9 Continuous Variables and Entropy 25
where
m n m 2 n 2
i−1 Xi i−1 Yi i=1 Xi − X i=1Yi − Y
X= ,Y = , U12 = , U22 = .
m n m−1 n−1
1 U2 U22 1 1
D( f g) = log 12 − 1 + = log F − 1 + ,
2 U2 U12 2 F
1 U2 U12 1
D(g f ) = log 22 − 1 + = (− log F − 1 + F)
2 U1 U22 2
Although the above KL information measures are not symmetrical with respect
to statistics F and F1 , the following measure is a symmetric function of F and F1 :
1 1
D( f g) + D(g f ) = F + −2 .
2 F
Example 1.14 Let f (x) and g(x) be the following exponential distributions:
Since
f (x) λ1
log = log − λ1 x + λ2 x,
g(x) λ2
g(x) λ2
log = log − λ2 x + λ1 x,
f (x) λ1
we have
λ1 λ2 λ2 λ1
D( f g) = log −1+ and D(g f ) = log − 1 + .
λ2 λ1 λ1 λ2
Example 1.15 (Property of the Normal Distribution in Entropy) [6] Let f (x) be
a density function of continuous random variable X that satisfy the following
conditions:
x f (x)dx = σ and
2 2
f (x)dx = 1. (1.36)
Then, the entropy is maximized with respect to density function f (x). For
Lagrange multipliers λ and μ, we set
26 1 Entropy and Basic Statistics
d
w = log f (x) + 1 + λx 2 + μ = 0.
d f (x)
d
w = log f (x) + 1 + λx 2 + μ = 0.
d f (x)
Hence, the normal distribution has the maximum of entropy given the variance.
Example 1.16 Let f (x) be a continuous random variable on finite interval (a, b).
The function that maximizes the entropy
b
H (X ) = − f (x) log f (x)dx.
a
b
f (x)dx = 1.
a
d
w = log f (x) + 1 + λ = 0.
d f (x)
From this,
1
f (x) = .
b−a
As discussed in Sect. 1.4, entropy (1.7) for the discrete sample space or variable
is maximized for the uniform distribution P(ωi ) = n1 , i = 1, 2, . . . , n. The above
result for the continuous distribution is similar to that for the discrete case.
1.10 Discussion
Information theory and statistics use “probability” as an important tool for measuring
uncertainty of events, random variables, and data processed in communications and
statistical data analysis. Based on the common point, the present chapter has given
a discussion for taking a look at “statistics” from a view of “entropy.” Reviewing
information theory, the present chapter has shown that the likelihood ratio statistics, t
and F statistics can be expressed with entropy. In order to make novel and innovative
methods for statistical data analysis, it is one of the good approaches to take ways
from information theory, and there are possibilities to resolve problems in statistical
data analysis by using information theory.
References
2.1 Introduction
properties of the maximum likelihood estimator of the odds ratio are discussed in
entropy. It is shown the Pearson chi-square statistic for testing the independence of
binary variables is related to the square of the entropy correlation coefficient and
the log-likelihood ratio test statistic the KL information. Section 2.5 considers the
association in general two-way contingency tables based on odds ratios and the log-
likelihood ratio test statistic. In Sect. 2.6, the RC(M) association model, which was
designed for analyzing the association in general two-way contingency tables, is con-
sidered in view of entropy. Desirable properties of the model are discussed, and the
entropy correlation coefficient is expressed by using the intrinsic association param-
eters and the correlation coefficients of scores assigned to the categorical variables
concerned.
Let X and Y be binary random variables that take 0 or 1. The responses 0 and 1 are
formal expressions of categories for convenience of notations. In real data analyses,
responses are “negative” and “positive”, “agreement” and “disagreement”, “yes” or
“no”, and so on. Table 2.1 shows the joint probabilities of the variables. The odds of
Y = 1 instead of Y = 0 given X = i, i = 1, 2 are calculated by
Hence, odds ratio is the ratio of the cross products in Table 2.1 and symmetric
for X and Y , and the odds ratio is a basic association measure to analyze categorical
data.
Remark 2.1 With respect to odds ratio expressed in (2.2) and (2.3), if variable X
precedes Y , or X is a cause and Y is effect, the first expression of odds ratio (2.2)
implies a prospective expression, and the second expression (2.3) is a retrospective
expression. Thus, the odds ratio is the same for both samplings, e.g., a prospective
sampling in a clinical trial and a retrospective sampling in a case-control study.
1
log = (logP(Y = 1|X = 1) − logP(Y = 0|X = 1))
0
− (logP(Y = 1|X = 0) − logP(Y = 0|X = 0))
1 1
= log − log
P(Y = 0|X = 1) P(Y = 1|X = 1)
1 1
− log − log , (2.4)
P(Y = 0|X = 0) P(Y = 1|X = 0)
1
log = (logP(X = 1|Y = 1) − logP(X = 0|Y = 1))
0
− (logP(X = 1|Y = 0) − logP(X = 0|Y = 0)). (2.5)
32 2 Analysis of the Association in Two-Way Contingency Tables
From this expression, the above log odds ratio can be interpreted as the effect of
Y on X , if X is a response variable and Y is explanatory variable. From (2.3), the
third expression of the log odds ratio is made as follows:
1
log = (logP(X = 1, Y = 1) + logP(X = 0, Y = 0))
0
− (logP(X = 1, Y = 0) + logP(X = 0, Y = 1))
1 1
= log + log
P(X = 1, Y = 0) P(X = 0, Y = 1)
1 1
− log + log . (2.6)
P(X = 0, Y = 0) P(X = 1, Y = 1)
If X and Y are response variables, from (2.6), the log odds ratio is the difference
between information of discordant and concordant responses in X and Y , and it
implies a measure of the association between the variables.
Theorem 2.1 For binary random variables X and Y , the necessary and sufficient
condition that the variables are statistically independent is
1
= 1. (2.7)
0
ai
P(X = i) = , i = 0, 1.
a+1
and
a
P(X = 1)P(Y = 1) = (P(X = 0, Y = 1) + P(X = 1, Y = 1))
a+1
a
= (P(X = 0, Y = 1) + a P(X = 0, Y = 1))
a+1
= a P(X = 0, Y = 1) = P(X = 1, Y = 1).
2.2 Odds, Odds Ratio, and Relative Risk 33
Similarly, we have
Remark 2.2 The above discussion has been made under the assumption that variables
X and Y are random. As in (2.4), the interpretation is also given in terms of the
conditional probabilities. In this case, variable X is an explanatory variable, and Y
is a response variable. If variable X is a factor, i.e., non-random, odds are made
according to Table 2.2. For example, X implies one of two groups, e.g., “control”
and “treatment” groups in a clinical trial. In this case, the interpretation can be given
as the effect of factor X on response variable Y in a strict sense.
1
· 5
5
OR2 = 8 16
= .
7
8
· 11
16
77
1
OR1 = .
OR2
logOR1 = −logOR2 .
Next, the relative risk is considered. In Table 2.2, risks with respect to response
Y are expressed by the conditional probabilities P(Y = 1|X = i), i = 0, 1.
P(Y = 1|X = 1)
RR = . (2.10)
P(Y = 1|X = 0)
and
RR ≥ 1 ⇔ logRR ≥ 0
Then, the effect of the factor on the risk is positive. From (2.2), we have
If risks P(Y = 1|X = 0) and P(Y = 1|X = 1) are small in Table 2.2, it follows
that
OR ≈ RR.
1
0.05 1 0.25 1 1 19
0 = = , 1 = = , OR = = 3
= ≈ 6.33.
0.95 19 0.75 3 0 1
19
3
From the odds ratio, it may be thought intuitively that the effect of X on Y is
strong. The effect measured by the relative effect is given by
0.25
RR = = 5.
0.05
The result is similar to that of odds ratio.
Example 2.3 In estimating the odds ratio and the relative risk, it is critical to take the
sampling methods into consideration. For example, flu influenza vaccine efficacy is
examined through a clinical trial using vaccination and control groups. In the clinical
trial, both of the odds ratio and the relative risk can be estimated as discussed in this
section. Let X be factor taking 1 for “vaccination” and 0 for “control” and let Y be a
state of infection, 1 for “infected” and 0 for “non-infected.” Then, P(Y = 1|X = 1)
and P(Y = 1|X = 0) are the infection probabilities (ratios) of vaccination and con-
trol groups, respectively. Then, the reduction ratio of the former probability (risk)
for the latter is assessed by
P(Y = 1|X = 0) − P(Y = 1|X = 1)
efficacy = × 100 = (1 − RR) × 100,
P(Y = 1|X = 0)
P(Y = 1|X = 1)
RR = .
P(Y = 1|X = 0)
where OR is
P(X =1|Y =0)
1−P(X =1|Y =0)
OR = P(X =1|Y =1)
.
1−P(X =1|Y =1)
1
1
1
1
λiX πi+ = λYj π+ j = ψi j πi+ = ψi j π+ j = 0. (2.13)
i=0 j=0 i=0 j=0
1
1
1
πi+ π+ j logπi j = λ, π+ j logπi j = λ + λiX
i=0 j=0 j=0
1
πi+ logπi j = λ + λYj .
i=0
Therefore, we get
1
1
1
1
ψi j = logπi j + π+ j logπi j + πi+ logπi j − πi+ π+ j logπi j . (2.14)
j=0 i=0 i=0 j=0
The matrix has information of the association between the variables. For a link to
a discussion in continuous variables, (2.12) is re-expressed as follows:
2.3 The Association in Binary Variables 37
where
p(x, y) = πx y .
where
where
α = λ + ψ00 , β(x) = λxX + (ψ01 − ψ00 )x, γ (y) = λYy + (ψ10 − ψ00 )y.
(2.17)
p(x, y) p(x0 , y0 )
log = ϕ(x − x0 )(y − y0 ). (2.18)
p(x0 , y) p(x, y0 )
p(1, 1) p(0, 0)
log = ϕ.
p(0, 1) p(1, 0)
and the exponential of the above quantity is referred to as a generalized odds ratio.
Remark 2.3 In binary variables X and Y , it follows that μ X = π1+ and μY = π+1 .
Definition 2.4 The entropy covariance between X and Y is defined by the expectation
of (2.19) with respect to X and Y , and it is denoted by ECov(X, Y ) :
1
1
πi j
1 1
πi+ π+ j
ECov(X, Y ) = πi j log + πi+ π+ j log ≥ 0. (2.21)
i=0 j=0
π π
i+ + j i=0 j=0
πi j
The equality holds true if and only if variables X and Y are statistically inde-
pendent. From this, the entropy covariance is an entropy
measure that explains the
difference between distributions (πi j ) and πi+ π+ j . In (2.20), ECov(X, Y ) is also
viewed as an inner product between X and Y with respect to association parameter
ϕ. From the Cauchy inequality, we have
0 ≤ ECov(X, Y ) ≤ |ϕ| Var(X ) Var(Y ). (2.22)
for binary variables, it can be thought that the correlation coefficient measures the
degree of response concordance or covariation of the variables, and from (2.20) and
(2.22), we have
ECov(X, Y ) ϕCov(X, Y ) ϕ
0≤ √ √ ≤ √ √ = Corr(X, Y )
|ϕ| Var(X ) Var(Y ) |ϕ| Var(X ) Var(Y ) |ϕ|
= |Corr(X, Y )|.
Definition 2.5 The entropy correlation coefficient (ECC) between binary variables
X and Y is defined by
Theorem 2.2 For binary random variables X and Y , the following statements are
equivalent:
Proof From Definition 2.5, Proposition (ii) and (iii) are equivalent. From Theo-
rem 2.1, Proposition (i) is equivalent to
π11 π22
OR = = 1 ⇔ π11 π22 − π12 π21 = 0 ⇔ Corr(X, Y ) = 0.
π12 π21
Example 2.4 In Table 2.5a, the joint distribution of binary variables X and Y is given.
In this table, the odds (2.1) and odds ratio (2.3) are calculated as follows:
3
0.1 1 0.3 3 1
0 = = , 1 = = , = 2
= 6.
0.4 4 0.2 2 0 1
4
From the odds ratio, it may be thought intuitively that the association between
the two variables are strong in a sense of odds ratio, i.e., the effect of X on Y or that
of Y on X . For example, X is supposed to be a smoking state, where “smoking” is
denoted by X = 1 and “non-smoking” by X = 0, and Y be the present “respiratory
problem,” “yes” by Y = 1 and “no” by Y = 0. Then, from the above odds ratio, it
may be concluded that the effect of “smoking” on “respiratory problem” is strong.
The conclusion is valid in this context, i.e., the risk of disease. On the other hand,
treating variables X and Y parallelly in entropy, ECC is
In the above result, we may say the association between the variables is moderate
in a viewpoint of entropy. Similarly, we have the following coefficient of uncertainty
C(Y |X ) (1.19).
C(Y |X ) ≈ 0.128.
In this sense, the reduced (explained) entropy of Y by X is about 12.8% and the
effect of X on Y is week. Finally, the measures mentioned above are demonstrated
according to Table 2.5. We have
0.01 1 0.4 1 4
0 = = , 1 = = 4, = 1 = 196;
0.49 49 0.1 0 49
ECorr(X, Y ) = Corr(X, Y ) ≈ 0.793;
C(Y |X ) ≈ 0.558.
In the joint distribution in Table 2.5b, the association between the variables are
strong as shown by ECorr(X, Y ) and C(Y |X ).
Remark 2.4 As defined by (2.2) and (2.10), OR and RR do not depend on the
marginal probability distributions. For example, for any real value t ∈ (0, 2),
Table 2.5 is modified as in Table 2.6. Then, OR and RR in Table 2.6 are the same as
in Table 2.5a; however,
t(2 − t)
ECorr(X, Y ) = . (2.24)
(t + 2)(3 − t)
√
and it depends on real value t. The above ECC takes the maximum at t = 6 − 2 6 ≈
1.101:
The graph of ECorr(X, Y ) (2.24) is illustrated in Fig. 2.1. Hence, for analyzing
contingency tables, not only association measures OR and RR but also ECC is needed.
2.4 The Maximum Likelihood Estimation of Odds Ratios 41
ECC
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
t
0
0.08
0.16
0.24
0.32
0.4
0.48
0.56
0.64
0.72
0.8
0.88
0.96
1.04
1.12
1.2
1.28
1.36
1.44
1.52
1.6
1.68
1.76
1.84
1.92
Fig. 2.1 Graph of ECorr(X, Y ) (2.24)
ni j
π̂i j = , i = 0, 1; j = 0, 1.
n ++
1 1 1 1
SE ≈ + + + .
n 00 n 01 n 10 n 11
α
1 1 1 1
logOR > Q + + + ,
2 n 00 n 01 n 10 n 11
α
1 1 1 1
logOR < −Q + + + ,
2 n 00 n 01 n 10 n 11
where function Q(α) is the 100(1 − α) percentile of the standard normal distribution.
logOR − logOR
Z= .
1
n 00
+ 1
n 01
+ 1
n 10
+ 1
n 11
Then, a 100(1 − α)% confidence interval for the log odds ratio is
α
1 1 1 1
α
1 1 1 1
+Q + + + . (2.26)
2 n 00 n 01 n 10 n 11
From the above confidence interval, if the interval does not include 0, then the
null hypothesis is rejected with the level of significance α.
In order to test the independence of binary variables X and Y , the following
Pearson chi-square statistic is usually used:
2.4 The Maximum Likelihood Estimation of Odds Ratios 43
n ++ (n 00 n 11 − n 10 n 01 )2
X2 = . (2.27)
n 0+ n 1+ n +0 n +1
Since
|n 00 n 11 − n 10 n 01 |
ECorr(X, Y ) = √ ,
n 0+ n 1+ n +0 n +1
we have
X 2 = n ++ ECorr(X, Y )2 .
1
1
n ++ n i j
G2 = 2 n i j log (2.28)
j=0 i=0
n i+ n + j
Example 2.5 Table 2.7 illustrates the two-way cross-classified data of X : smoking
state and Y : chronic bronchitis state from 339 adults over 50 years old. The log OR
is estimated as
43 · 121
logOR = log = 2.471,
162 · 13
A 95% confidence interval of the log odds ratio (2.26) is calculated as follows:
From this, the association between the two variables is statistically significant at
significance level 0.05. From (2.29), a 95% confidence interval of the odds ratio is
given by
G 2 = 7.469, d f = 1, P = 0.006.
Hence, the statistic indicates that the association in Table 2.7 is statistically
significant.
Let X and Y be categorical random variables that have categories {1, 2, . . . , I } and
{1, 2, . . . , J }, respectively, and let πi j = P(X = i, Y = j), πi+ = P(X = i), and
let π+ j = P(Y = j). Then, the number of odds ratios is
I J I (I − 1) J (J − 1)
= · .
2 2 2 2
πi j πi j
OR i, i ; j, j = for i = i and j = j . (2.30)
πi j πi j
OR(i, i; j, j) = 1.
Hence, we have
it means that X and Y are statistically independent. This completes the theorem.
Let the model in Table 2.8 be denoted by πi j and let the independent model by
πi+ π+ j . Then, from (1.10), the Kullback–Leibler information is
I
J
πi j
D πi j || πi+ π+ j = πi j log .
i=1 j=1
πi+ π+ j
ni j n i+ n+ j
π̂i j = , π̂i+ = , π̂+ j = ,
n ++ n ++ n ++
46 2 Analysis of the Association in Two-Way Contingency Tables
Table 2.9 Cholesterol and high blood pressure data in a medical examination for students in a
university
X Y Total
L M H
L 2 6 3 11
M 12 32 11 56
H 9 5 3 17
Total 24 43 17 84
Source [1], p. 206
where
J
I
I
J
n i+ = ni j , n+ j = n i j , n ++ = ni j .
j=1 i=1 i=1 j=1
I
J
n i j n ++
G2 = 2 n i j log = 2n ++ · D π̂i j || π̂i+ π̂+ j . (2.32)
i=1 j=1
n i+ n + j
The above
statistic implies 2n ++ times the KL information for discriminating
model πi j and the independent
model πi+ π+ j . Under the null hypothesis H0 :
the independent model πi+ π+ j , the above statistic is distributed according to a
chi-square distribution with degrees of freedom (I − 1)(J − 1).
Example 2.6 Table 2.9 shows a part of data in a medical examination. Variables X
and Y are cholesterol and high blood pressure, respectively, which are graded as low
(L), medium (M), and high (H). Applying (2.32), we have
G 2 = 6.91, d f = 4, P = 0.859.
From this, the independence of the two variables is not rejected at the level of
significant 0.05, i.e., the association of the variables is not statistically significant.
Example 2.7 Table 2.10 illustrates cross-classified data with respect to eye color
X and hair color Y in [5]. For baselines X = Blue, Y = Fair, odds ratios are
calculated in Table 2.11, and P-values with respect to the related odds ratios are
shown in Table 2.12. All the estimated odds ratios related to X = Midium, Dark are
statistically significant. The summary test of the independence of the two variables
is made with statistic (2.32), and we have G 2 = 1218.3, d f = 9, P = 0.000.
Hence, the independence of the variables is strongly rejected. By using odds ratios
in Table 2.11, the association between the two variables in Table 2.10 is illustrated
in Fig. 2.2.
2.5 General Two-Way Contingency Tables 47
Odds Ratio
100
80
60
40
Dark
20
0 Medium
Red
Medium
Dark Light
Black
0-20 20-40 40-60 60-80 80-100
The association model was proposed for analyzing two-way contingency tables
of categorical variables [11, 13]. Let X and Y be categorical variables that
have categories {1, 2, . . . , I } and {1, 2, . . . , J }, respectively, and let πi j =
P(X = i, Y = j); πi+ = P(X = i), and let π+ j = P(Y = j), i =
1, 2, . . . , I ; j = 1, 2, . . . , J . Then, a loglinear formulation of the model is given
by
In order to identify the model parameters, the following constraints are assumed:
I
J
I
J
λiX πi+ = λYj π+ j = ψi j πi+ = ψi j π+ j = 0.
i=1 j=1 i=1 j=1
πi j πi j
log = ψi j + ψi j − ψi j − ψi j .
πi j πi j
M
logπi j = λ + λiX + λYj + φm μmi νm j , (2.33)
m=1
Then, the log odds ratio is given by the following bilinear form:
πi j πi j
M
log = φm (μmi − μmi ) νm j − νm j . (2.35)
πi j πi j m=1
Preferable properties of the model in entropy are discussed below. For the
RC(1) model, Gillula et al. [10] gave a discussion of properties of the model
in entropy and Eshima [9] extended the discussion in the RC(M) association
model. The discussion is given below. In the RC(M) association model (2.33), let
μm = (μm1 , μm2 , . . . , μm I ) and ν m = (νm1 , νm2 , . . . , νm J ) be scores for X and Y in
the m th components, m = 1, 2, . . . , M. From (2.35), parameters φm are related to
log odds ratios, so the parameters are referred to as the intrinsic association param-
eters. If it is possible for scores to vary continuously, the intrinsic parameters φm
are interpreted as the log
odds ratio in unit changes of the related scores in the mth
components. Let Corr μm , ν m be the correlation coefficients between μm and ν m ,
m, m = 1, 2, . . . , M, which are defined by
J
I
Corr μm , ν m = μmi νm j πi j . (2.36)
j=1 i=1
For simplicity of the notation, let us set ρm = Corr μm , ν m , m = 1, 2, . . . , M.
With respect to the RC(M) association model (2.33), we have the following theorem.
Theorem 2.4 The RC(M) association model maximizes the entropy, given the cor-
relation coefficients between μm and ν m and the marginal probability distributions
of X and Y .
Proof For Lagrange multipliers κ, λiX , λYj , and φm , we set
J
I
J
I
I
J
G=− πi j logπi j + κ πi j + λiX πi j
j=1 i=1 j=1 i=1 i=1 j=1
J
I
M
J
I
+ λYj πi j + φm μmi νm j πi j .
j=1 i=1 m=1 j=1 i=1
∂G M
= −logπi j + 1 + κ + λiX + λYj + φm μmi νm j = 0.
∂πi j m=1
With respect to the correlation coefficients and the intrinsic association parame-
ters, we have the following theorem.
Theorem 2.5 In the RC(M) association model (2.33), let U be the M×M matrix that
∂ρa
has (a, b) elements ∂φb
. Then, U is positive definite, given the marginal probability
distributions of X and Y and scores μm and ν m , m = 1, 2, . . . , M.
∂ρa
≥ 0.
∂φa
With respect to the entropy in RC(M) association model (2.33), we have the
following theorem.
52 2 Analysis of the Association in Two-Way Contingency Tables
d(−H ) ∂λ
M M I
∂λiX
= φm0 + φm0 πi+
dt m=1
∂φm m=1 i=1
∂φm
M
J
∂λYj
M
M
M
∂ρm
+ φm0 π+ j + φm0 ρm + t φn0 φm0 .
m=1 j=1
∂φm m=1 n=1 m=1
∂φm
Since
J
I
πi j = 1,
j=1 i=1
we have
d
J I J I J I
d
πi j = πi j = πi j
dt j=1 i=1 j=1 i=1
dt j=1 i=1
M
∂λ M
∂λiX M
∂λYj M
φm0 + φm0 + φm0 + φm0 μmi νm j
m=1
∂φm m=1
∂φm m=1
∂φm m=1
M
∂λ J M
∂λiX
= φm0 + φm0 πi+
m=1
∂φm j=1 m=1
∂φm
J M
∂λiX M
+ φm0 π+ j + φm0 ρm = 0.
j=1 m=1
∂φm m=1
dH M M
∂ρm
= −t φn0 φm0 < 0.
dt n=1 m=1
∂φm
2.6 The RC(M) Association Models 53
Remark 2.6 The RC(1) association model was proposed for the analysis of associ-
ation between ordered categorical variables X and Y . The RC(1) association model
is given by
I
J
πi j I J
πi+ π+ j M J I
πi j log + πi+ π+ j log = φm μmi νm j πi j
i=1 j=1
πi+ π+ j i=1 j=1
πi j m=1 j=1 i=1
M
= φm ρm ≥ 0. (2.46)
m=1
Definition 2.6 The entropy covariance between categorical variables X and Y in the
RC(M) association model (2.33) is defined by
M
ECov(X, Y ) = φm ρm . (2.47)
m=1
M
J
I
ECov(X, Y ) = φm μmi νm j πi j , (2.48)
m=1 j=1 i=1
M
J
I
M
I
M
ECov(X, X ) = φm μ2mi πi j = φm μ2mi πi+ = φm .
m=1 j=1 i=1 m=1 i=1 m=1
M
ECov(Y, Y ) = φm . (2.49)
m=1
From (2.48) and (2.49), ECov(X, X ) and ECov(Y, Y ) can be interpreted as the
variances of X and Y in entropy, which are referred to as EVar(X ) and EVar(Y ),
respectively. As shown in (2.35), since association parameters
φm are related to odds
ratios, the contributions of the mth pairs of score vectors μm , ν m in the association
between X and Y may be measured by φm ; however, from the entropy covariance
(2.47), it is more sensible to measure
the contributions by φm ρm . The contribution
ratios of the mth components μm , ν m can be calculated by
φm ρm
M . (2.50)
k=1 φk ρk
M
M
J
I
M
I
J
M
φm ρm = φm μmi νm j πi j ≤ φm μ2mi πi+ νm2 j π+ j = φm .
m=1 m=1 j=1 i=1 m=1 i=1 j=1 m=1
It implies that
0 ≤ ECov(X, Y ) ≤ EVar(X ) EVar(Y ). (2.51)
Definition 2.7 The entropy correlation coefficient (ECC) between X and Y in the
RC(M) association model (2.33) is defined by
M
φm ρm
ECorr(X, Y ) =
m=1
M
. (2.52)
m=1 φm
0 ≤ ECorr(X, Y ) ≤ 1.
Example 2.8 Becker and Clogg [5] analyzed the data in Table 2.13 with the RC(2)
association model. The estimated parameters are given in Table 2.14. From the
estimates, we have
and
From the above estimates, the entropy correlation coefficient between X and Y is
computed as
φ̂1 ρ̂1 + φ̂2 ρ̂2
ECorr(X, Y ) = = 0.390. (2.53)
φ̂1 + φ̂2
From this, the association in the contingency table is moderate. The contribution
ratios of the first and second pairs of score vectors are
φ̂1 ρ̂1 + φ2 ρ2
About 92.4% of the association between the categorical variables are explained
by the first pair of components (μ1 , ν 1 ).
A preferable property of ECC is given in the following theorem.
Proof Since
M
ECov(X, Y ) = t φm0 ρm ,
m=1
d M M M
∂ρa dφb 1
M
ECov(X, Y ) = φm0 ρm + t φa0 = φm ρm
dt m=1 b=1 a=1
∂φb dt t m=1
M
M
∂ρa
+t φa0 φb0
b=1 a=1
∂φb
1 M M
∂ρa
= ECov(X, Y ) + t φa0 φb0 .
t b=1 a=1
∂φb
For t > 0, the first term is positive, and the second term is also positive. This
completes the theorem.
2.7 Discussion
Continuous data are more tractable than categorical data, because the variables are
quantitative and, variances and covariances among the variables can be calculated
for summarizing the correlations of them. The RC(M) association model, which was
proposed for analyzing the associations in two-way contingency tables, has a similar
correlation structure to the multivariate normal distribution in canonical correlation
analysis [6] and gives a useful method for processing the association in contingency
58 2 Analysis of the Association in Two-Way Contingency Tables
tables. First, this chapter has considered the association between binary variables
in entropy, and the entropy correlation coefficient has been introduced. Second, the
discussion has been extended for the RC(M) association mode. Desirable properties
of the model in entropy have been discussed, and the entropy correlation coefficient
to summarize the association in the RC(M) association model has been given. It
is sensible to treat the association in contingency tables as in continuous data. The
present approach has a possibility to be extended for analyzing multiway contingency
tables.
References
1. Asano, C., & Eshima, N. (1996). Basic multivariate analysis. Nihon Kikaku Kyokai: Tokyo.
(in Japanese).
2. Asano, C., Eshima, N., & Lee, K. (1993). Basic statistic. Tokyo: Morikita Publishing Ltd. (in
Japanese).
3. Becker, M. P. (1989a). Models of the analysis of association in multivariate contingency tables.
Journal of the American Statistical Association, 84, 1014–1019.
4. Becker, M. P. (1989b). On the bivariate normal distribution and association models for ordinal
data. Statistics and Probability Letters, 8, 435–440.
5. Becker, M. P., & Clogg, C. C. (1989). Analysis of sets of two-way contingency tables using
association models. Journal of the American Statistical Association, 84, 142–151.
6. Eshima, N. (2004). Canonical exponential models for analysis of association between two sets
of variables. Statistics and Probability Letters, 66, 135–144.
7. Eshima, N., & Tabata, M. (1997). The RC(M) association model and canonical correlation
analysis. Journal of the Japan Statistical Society, 27, 109–120.
8. Eshima, N., & Tabata, M. (2007). Entropy correlation coefficient for measuring predictive
power of generalized linear models. Statistics and Probability Letters, 77, 588–593.
9. Eshima, N., Tabata, M., & Tsujitani, M. (2001). Properties of the RC(M) association model
and a summary measure of association in the contingency table. Journal of the Japan Statistical
Society, 31, 15–26.
10. Gilula, Z., Krieger, A., & Ritov, Y. (1988). Ordinal association in contingency tables: Some
interpretive aspects. Journal of the American Statistical Association, 74, 537–552.
11. Goodman, L. A. (1979). Simple models for the analysis of association in cross-classifications
having ordered categories. Journal of the American Statistical Association, 74(367), 537–552.
12. Goodman, L. A. (1981a). Association models and bivariate normal for contingency tables with
ordered categories. Biometrika, 68, 347–355.
13. Goodman, L. A. (1981b). Association models and canonical correlation in the analysis of cross-
classification having ordered categories. Journal of the American Statistical Association, 76,
320–334.
14. Goodman, L. A. (1985). The analysis of cross-classified data having ordered and/or unordered
categories: Association models, correlation models, and asymmetry models for contingency
tables with or without missing entries. Annals of Statistics, 13, 10–69.
Chapter 3
Analysis of the Association in Multiway
Contingency Tables
3.1 Introduction
The above model is referred to as the loglinear model and is used for the analysis
of association in three-way contingency tables. The model has two-dimensional
association terms λiXj Y , λYjkZ , and λkiZ X and three-dimensional association terms λiXjkY Z ,
so in this textbook, the above loglinear model is denoted as model [XYZ] for simplicity
of the notation, where the notation is correspondent with the highest dimensional
association terms [2].
Let n i jk be the numbers of observations for X = i, Y = j, and Z = k, i =
1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K . Since model (3.1) is saturated, the
ML estimators of the cell probabilities are given by
n i jk
logπ̂i jk = log ,
n +++
where
I
J
K
n +++ = n i jk .
i=1 j=1 k=1
From this, the estimates of the parameters are obtained by solving the following
equations:
n i jk
λ + λ̂iX + λ̂Yj + λ̂kZ + λ̂iXj Y + λ̂YjkZ + λ̂kiZ X + λ̂iXjkY Z = log ,
n +++
i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K .
(i) [X Y, Y Z , Z X ]
Similarly, the conditional odds ratios with respect to Y and Z given X and those
with respect to X and Z given Y can be discussed.
(ii) [X Y, Y Z ]
The highest dimensional association terms with respect to the variables are λiXj Y
and λYjkZ , so we express the model by [X Y, Y Z ]. Similarly, models [Y Z , Z X ] and
[Z X, X Y ] can also be defined. In model (3.4), the log odds ratios with respect to X
and Z given Y are
πi jk πi jk
log = 0.
πi jk πi jk
From this, the above model implies that X and Z are conditionally independent,
given Y. Hence, we have
πi j+ π+ jk
πi jk = .
π+ j+
where
K
From (3.5), parameters λiX and λiXj Y can be expressed by the marginal probabilities
πi j+ , and similarly, λkZ and λYjkZ are described with π+ jk .
(iii) [X Y, Z ]
The highest dimensional association terms with respect to X and Y are λiXj Y , and
for Z term λkZ is the highest dimensional association terms. In this model, (X, Y ) and
Z are statistically independent, and it follows that
πi jk = πi j+ π++k
The marginal distribution of (X, Y ) has the same parameters λiX , λYj , and λiXj Y as
in (3.6), i.e.,
where
K
From this, parameters λiX , λYj , and λiXj Y can be explicitly expressed by probabilities
πi j+ , and similarly, λkZ by π++k . Parallelly, we can interpret models [Y Z , X ] and
[Z X, Y ].
(iv) [X, Y, Z ]
The above model has no association terms among the variables, and it implies the
three variables are statistically independent. Hence, we have
πi jk = πi++ π+ j+ π++k
This sampling model is similar to that for the three-way layout experiment model,
and the terms λiXj Y , λYjkZ , and λkiZ X are referred to as two-factor interaction terms and
λiXjkY Z three-factor interaction terms.
Remark 3.2 Let us observe one of IJK events (X, Y, Z ) = (i, j, k) with probabili-
ties πi jk in one trial and repeat it n +++ times independently; let n i jk be the numbers
(counts) of events for categories X = i, Y = j, and Z = k, i = 1, 2, . . . , I ; j =
1, 2, . . . , J ; k = 1, 2, . . . , K . Then, the distribution of n i jk is the multinomial distri-
bution with total sampling size n +++ . In the independent Poisson sampling mentioned
in Remark 3.1, the conditional distribution of observations n i jk given n +++ is the
multinomial distribution with cell probabilities
μi jk
πi jk = ,
μ+++
where
I
J
K
μ+++ = μi jk .
i=1 j=1 k=1
I
J
K
l= n i jk λiXj Y + λkZ + λYjkZ + λkiZ X
i=1 j=1 k=1
64 3 Analysis of the Association in Multiway Contingency Tables
I
J
K
J
K
= n i j+ λiXj Y + n ++k λkZ + n + jk λYjkZ
i=1 j=1 k=1 j=1 k=1
I
K
+ n i+k λkiZ X .
i=1 k=1
From
I
J
K
πi jk = 1,
i=1 j=1 k=1
I
J
K
lLagrange = l − ζ πi jk
i=1 j=1 k=1
∂ K K
l
X Y Lagrange
= n i jk − ζ πi jk = 0.
∂λi j k=1 k=1
From this,
n i j+ − ζ πi j+ = 0 (3.7)
and we have
ζ = n +++
Let λ̂iXj Y , λ̂kZ , λ̂YjkZ , and λ̂kiZ X be the ML estimators of λiXj Y , λkZ , λYjkZ , and λkiZ X ,
respectively. Then, we obtain
K
π̂i j+ = exp λ + λ̂iX + λ̂Yj + λ̂iXj Y exp λ̂kZ + λ̂YjkZ + λ̂kiZ X
k=1
K
= exp λ̂iXj Y exp λ̂kZ + λ̂YjkZ + λ̂kiZ X . (3.8)
k=1
3.3 Maximum Likelihood Estimation of Loglinear Models 65
Similarly, the ML estimators of πi+k and π+ jk that are made by the ML estimator
of the model parameters are obtained as
n i+k n + jk
π̂i+k = , π̂+ jk = ;
n +++ n +++
J
π̂i+k = exp λ + λ̂iX + λ̂kZ + λ̂kiZ X exp λ̂Yj + λ̂iXj Y + λ̂YjkZ ,
j=1
I
π̂+ jk = exp λ + λ̂Yj + λ̂kZ + λ̂YjkZ exp λ̂iX + λ̂iXj Y + λ̂kiZ X
i=1
we can obtain the ML estimators λ, λ̂iX , λ̂Yj , λ̂kZ , λ̂iXj Y , and λ̂YjkZ in explicit forms.
Through a similar discussion, the ML estimators of parameters in models [X Y, Z ]
and [X, Y, Z ] can also be obtained in explicit forms as well.
Generalized linear models (GLMs) are designed by random, systematic, and link
components and make useful regression analyses of both continuous and categorical
response variables [13, 15]. There are many cases of non-normal response vari-
ables in various fields of studies, e.g., biomedical researches, behavioral sciences,
economics, etc., and GLMs play an important role in regression analyses for the
66 3 Analysis of the Association in Multiway Contingency Tables
Let f (y|x) be the conditional density or probability function. Then, the function is
assumed to be the following exponential family of distributions:
yθ − b(θ )
f (y|x) = exp + c(y, ϕ) , (3.9)
a(ϕ)
where θ and ϕ are the parameters. This assumption is referred to as the random
component. If Y is the Bernoulli trial, then the conditional probability function is
π
f (y|x) = π y (1 − π )1−y = exp ylog + log(1 − π ) .
1−π
For normal variable Y with mean μ and variance σ 2 , the conditional density
function is
1 yμ − 21 μ2 y2
f (y|x) = √ exp + − 2 ,
2π σ 2 σ2 2σ
where
1 2 y2
θ = μ, a(ϕ) = σ 2 , b(θ ) = μ , c(y, ϕ) = − 2 . (3.11)
2 2σ
Remark 3.3 For distribution (3.9), the expectation and variance are calculated. For
simplicity, response variable is assumed to be continuous. Since
f (y|x)dy = 1,
we have
d 1
Hence,
E(Y ) = b (θ ). (3.13)
In this sense, the dispersion parameter a(ϕ) relates to the variance of the response
variable. If response variable Y is discrete, the integral in (3.12) is replaced by an
appropriate summation with respect to Y.
η = β0 + β1 x1 + β2 x2 + . . . + β p x p = β0 + β T x, (3.14)
where
T
x = x1 , x2 , . . . , x p .
For function h(u), mean (3.13) and predictor (3.14) are linked as follows:
b (θ ) = h −1 β0 + β T x .
θ = θ βT x .
Example 3.1 For Bernoulli random variable Y (3.10), the following link function is
assumed:
u
h(u) = log .
1−u
Then, from
π
h(π ) = log = β0 + β T x,
1−π
68 3 Analysis of the Association in Multiway Contingency Tables
exp β0 + β T x
f (1|x) =
.
1 + exp β0 + β T x
The above model is a logistic regression model. For a normal distribution with
(3.11), the identity function
h(u) = u
μ = β0 + β T x
θ = β0 + β T x
In GLMs composed of (i), (ii), and (iii) in the previous section, for baseline (X, Y ) =
(x 0 , y0 ), the log odds ratio is given by
(y − y0 ) θ β T x − θ β T x 0
log OR(x, x 0 ; y, y0 ) = , (3.15)
a(ϕ)
where
f (y|x) f (y0 |x 0 )
OR(x, x 0 ; y, y0 ) = . (3.16)
f (y0 |x) f (y|x 0 )
The above formulation of log odds ratio is similar to that of the association
model
(2.35). Log odds ratio (3.15) is viewed as an inner product of y and θ β T x with
respect to the dispersion parameter a(ϕ). From (3.16), since
Cov Y, θ β T X (E(Y ) − y0 ) E θ β T X − θ β T x 0
+ ,
a(ϕ) a(ϕ)
The first term can be viewed as the mean change of uncertainty of response
variable Y in explanatory variables X for baseline Y = E(Y ). Let g(x) and f (y)
be the marginal density functions of X and Y, respectively. Then, as in the previous
section, we have
¨
Cov Y, θ β T X f (y|x)
= f (y|x)g(x)log dxdy
a(ϕ) f (y)
¨
f (y)
+ f (y)g(x)log dxdy. (3.17)
f (y|x)
The above quantity is the sum of the two types of the KL information, so it is
denoted by KL(X, Y ). If X and/or Y are discrete, the integrals in (3.17) are replaced
by appropriate summations with respect to the variables. If X is a factor, i.e., not
random, taking levels X = x 1 , x 2 , . . . , x K , then, (3.17) is modified as
K
Cov Y, θ β T X 1 f (y|x k )
= f (y|x k ) log dy
a(ϕ) k=1
K f (y)
K
1 f (y)
+ f (y) log dy.
k=1
K f (y|x k)
√
Definition 3.1 The entropy (multiple) correlation coefficient (ECC) between X and
Y in GLM (3.9) with (i), (ii), and (iii) is defined as follows [7]:
Cov Y, θ β T X
ECorr(Y, X) = √
. (3.19)
Var(Y ) Var θ β T X
0 ≤ ECorr(Y, X) ≤ 1.
70 3 Analysis of the Association in Multiway Contingency Tables
¨
Cov Y, θ β T X
= ( f (y|x) − f (y))g(x)log f (y|x)dxdy.
a(ϕ)
Theorem 3.1 In (3.19), ECC is decreasing as a(ϕ), given Var(Y ) and Var θ β T X .
Proof For simplicity of the discussion, let us set E(Y ) = 0, and the proof is given in
the case where explanatory variable X is continuous and random. Let f (y|x) be the
conditional density or probability function of Y, and let g(x) be the marginal density
or probability function of X. Then,
¨
Cov Y, θ β X T
= yθ β T x f (y|x)g(x)dxdy.
Cov Y, θ β X = yθ β T x f (y|x)g(x)dxdy
da(ϕ) da(ϕ)
¨
d
= yθ β T x f (y|x)g(x)dxdy
da(ϕ)
¨
1 T
2
=− 2
yθ β x f (y|x)g(x)dxdy ≤ 0.
a(ϕ)
For a GLM with the canonical link, ECC is the correlation coefficient between
response variable Y and linear predictor
θ = β0 + β T x, (3.20)
where
T
β = β1 , β2 , . . . , β p .
3.5 Entropy Multiple Correlation Coefficient for GLMs 71
Then, we have
p
βi Cov(Y, X i )
ECorr(Y, X) = √ i=1
. (3.21)
Var(Y ) Var β T X
βi Cov(Y, X i ) ≥ 0, i = 1, 2, . . . , p.
Proof For simplicity of the discussion, we make a proof of the theorem for continu-
ous variables. Since explanatory variables X 1 , X 2 , . . . , X p are independent, the joint
density function is expressed as follows:
p
g(x) = gi (xi ),
i=1
where gi (xi ) are the marginal density or probability functions. Without the loss of
generality, it is sufficient to show
β1 Cov(Y, X 1 ) ≥ 0
p
f (y, x) = f (y|x) gi (xi ),
i=1
/1
f (y, x)
f 1 y|x = p dx1 ,
i=2 gi (x i )
¨
p
f (y|x)
0≤ f (y|x) gi (xi )log
dxdy
i=1
f 1 y|x /1
¨
/1
p
f 1 y|x /1
+ f 1 y|x gi (xi )log dxdy
i=1
f (y|x)
¨
p
= f (y|x)g1 (x1 ) − f y|x /1 gi (xi )log f (y|x)dxdy
i=1
= β1 Cov(Y, X 1 ).
0 ≤ RCorr(Y, X) ≤ 1
The measure does not have such a decomposition property as ECC (3.21). In this
respect, ECC is more preferable for GLMs than RCC. Comparing ECC and RCC,
ECC measures the predictive power of GLMs in entropy; whereas, RCC measures
that in the Euclidian space. As discussed above, regression coefficient β in GLMs is
related to the entropy of response variable Y. From this point of view, ECC is more
advantageous than RCC.
Proof Since
=√
≤√ √
Var(Y ) Var θ β T X Var(Y ) Var θ β T X Var(Y ) Var(E(Y |X))
Var(E(Y |X))
=√ √ ,
Var(Y ) Var(E(Y |X))
Hence, the inequality of the theorem follows. The equality holds if and only if
there exist constants c and d such that
θ β T X = cE(Y |X) + d.
Example 3.2 For binary response variable Y and explanatory variable X, the
following logistic regression model is considered:
exp{y(α + βx)}
f (y|x) = , y = 0, 1.
1 + exp(α + βx)
Since the link is canonical and the regression is simple, ECC can be calculated
by (3.22). On the other hand, in this model, the regression function is
exp{(α + βx)}
E(Y |x) = .
1 + exp(α + βx)
Example 3.3 Table 3.1 shows the data of killed beetles after exposure to gaseous
carbon disulfide [4]. Response variable Y is
Y = 0(killed), 1(alive).
For the data, a complementary log–log model was applied. In the model, for
function h(u), the link is given by
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.65 1.7 1.75 1.8 1.85 1.9
Since the response variable is binary, E(Y |x) is equal to the conditional probability
function of Y, and it is given by
Then, θ in (3.9) is
From the data, the estimates of the parameters are obtained as follows:
α̂ = −39.52, β̂ = 22.01.
The graph of the estimated f (y|x) in log dose X is given in Fig. 3.1. ECC and
RCC are calculated according to (3.19) and (3.23), respectively, as follows:
Definition 3.2 Let subX Y be a subset of the sample space of explanatory variable
vector
X
and response variable Y in a GLM. Then, the correlation coefficient between
θ β T X and Y restricted in sub
X Y is referred to as the conditional
entropy
correlation
coefficient between X and Y, which is denoted by ECorr Y, X|sub XY .
T
variable vector X = (X 1 , X 2 ) , the correlation coefficient
T
For explanatory
between θ β X and Y given X 2 = x2 is the conditional correlation coefficient
between X and Y, i.e.,
3.5 Entropy Multiple Correlation Coefficient for GLMs 75
1
it follows that P X = ci |sub
X Y = 2 . From this, it follows that
θ (βci ) + θ βc j
E θ (β X )| X Y =
sub
,
2
E(Y |X = ci ) + E Y |X = c j
E Y | X Y =
sub
,
2
1
2
θ (βci ) − θ βc j
Var θ (β X )| X Y =
sub
.
4
Then, we have
E(Y |X = ci ) − E Y |X = c j
ECorr Y, X |sub
XY =
, (3.24)
2 Var Y |subXY
The results are similar to those of pairwise comparisons of the effects of factor
levels. If response variable Y is binary, i.e.,Y ∈ {0, 1}, we have
ECorr Y, X |sub
XY
E(Y |X = ci ) − E Y |X = c j
=
.
E(Y |X = ci ) + E Y |X = c j 2 − E(Y |X = ci ) − E Y |X = c j
(3.25)
Example 3.4 We apply (3.24) to the data in Table 3.1. For X = 1.691, 1.724, the par-
tial data are illustrated in Table 3.2. In this case, we calculate the estimated conditional
probabilities (Table 3.3), and from (3.25), we have
Table 3.3 Estimated conditional probability distribution of beetle mortality data in Table 3.1
Log dose 1.691 1.724
Killed 0.095 0.187
Alive 0.905 0.813
Table 3.4 Conditional ECCs for base line log dose X = 1.691
Log dose 1724 1.755 1.784 1.811 1.837 1.861 1.884
Conditional ECC 0.132 0.293 0.477 0.667 0.822 0.893 0.908
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.7 1.72 1.74 1.76 1.78 1.8 1.82 1.84 1.86 1.88 1.9
It means the effect of level X = 1.724 for base line X = 1.691, i.e., 13.2% of
entropy of the response is reduced by the level X = 1.724. The conditional ECCs of
X = x for baseline X = 1.691 are calculated in Table 3.4. The conditional ECC is
increasing in log dose X. The estimated conditional ECC is illustrated in Fig. 3.2.
Let
1 (X a = i) 1 (Y = k)
X ai = , a = 1, 2, ; Yk = ,
0 (X a = i) 0 (Y = k)
X 1 = (X 11 , X 12 , . . . , X 1I )T , X 2 = (X 21 , X 22 , . . . , X 2J )T , and
Y = (Y1 , Y2 , . . . , Y K )T
θ = α + B (1)
T
X 1 + B (2)
T
X 2,
where
θ = (θ1 , θ2 , . . . , θ K )T , α = (α1 , α2 , . . . , α K )T ,
⎛ ⎞ ⎛ ⎞
β(1)11 β(1)12 . . . β(1)1I β(2)11 β(2)12 . . . β(2)1J
⎜ β(1)21 β(1)22 . . . β(1)2I ⎟ ⎜ β(2)21 β(2)22 . . . β(2)2J ⎟
⎜ ⎟ ⎜ ⎟
B(1) = ⎜ . .. .. .. ⎟, B(2) = ⎜ .. .. .. .. ⎟.
⎝ . . . . . ⎠ ⎝ . . . . ⎠
β(1)K 1 β(1)K 2 . . . β(1)K I β(2)K 1 β(2)K 2 . . . β(2)K J
where
y implies the summation over all categories y. In this case, a(ϕ) = 1 and
the KL information (3.17) become as follows:
KL(X, Y ) = trCov(θ , Y ),
where Cov(θ , Y ) is the K × K matrix with (i. j) elements Cov θi , Y j . From this,
ECC can be extended as
trCov(θ , Y )
ECorr(Y , (X 1 , X 2 )) = √ √ ,
trCov(θ, θ ) trCov(Y , Y )
78 3 Analysis of the Association in Multiway Contingency Tables
K
trCov(θ , Y ) = Cov(θk , Yk ) = trB(1) Cov(X 1 , Y ) + trB(2) Cov(X 2 , Y ),
k=1
K
K
trCov(θ , θ ) = Var(θk ), trCov(Y , Y ) = Var(Yk ),
k=1 k=1
we have
trB (1) Cov(X 1 , Y ) + trB (2) Cov(X 2 , Y )
ECorr(Y , (X 1 , X 2 )) = ,
K K
k=1 Var(θk ) k=1 Var(Yk )
πiXjk1 X 2 Y ≡ P(X 1 = i, X 2 = j, Y = k)
Example 3.5 The data for an investigation of factors influencing the primary food
choice of alligators (Table 3.5) are analyzed ([2; pp. 268–271]). In this example,
explanatory variables are X 1 : lakes where alligators live, {1. Hancock, 2. Oklawaha,
3. Trafford, 4. George}; and X 2 : sizes of alligators, {1. small, 2. large}; and the
response variable is Y: primary food choice of alligators, {1. fish, 2. invertebrate, 3.
reptile, 4. bird, 5. other}. Model (3.26) is used for the analysis, and the following
dummy variables are introduced:
1 (X 1 = i) 1 (X 2 = j)
X 1i = , i = 1, 2, 3, 4; X 2 j = , j = 1, 2, ;
0 (X 1 = i) 0 (X 2 = j)
1 (Y = k)
Yk = , k = 1, 2, 3, 4, 5.
0 (Y = k)
Then, the categorical variables X 1 , X 2 , and the response variable Y are identified
with the correspondent dummy random vectors:
X 1 = (X 11 , X 12 , X 13 , X 14 )T , X 2 = (X 21 , X 22 )T , and Y = (Y1 , Y2 , Y3 , Y4 , Y5 )T ,
trCov(Y , Y ) = 0.691.
trCov(θ , θ ) trCov(Y , Y )
= 0.231.
Although the effects of factors are statistically significant, the predictive power
of the logit model may be small, i.e., only 23.1% of the uncertainty of the response
80 3 Analysis of the Association in Multiway Contingency Tables
It may be thought that the effect of lake on food is about 2.4 times larger than that
of size.
where D(Y ) and D(Y |X) are a variation function of Y and a conditional variation
function of Y given X, respectively [1, 5, 12]. D(Y |X) implies that
D(Y |X) = D(Y |X = x)g(x)dx,
A predictive power measure based on the likelihood function (15, 16) is given as
follows:
n2
L(0)
R L2 = 1 − , (3.29)
L(β)
(3.29) in general cases. This is a drawback of this measure. For GLMs with categorical
response variables, the following entropy-based measure is proposed:
H (Y ) − H (Y |X)
R 2E = , (3.30)
H (Y )
where H (Y ) and H (Y |X) are the entropy of Y and the conditional entropy given
X [11]. As discussed in Sect. 3.4, GLMs have properties in entropy, e.g., log odds
ratios (3.15) and KL information (3.17) are related to linear predictors. We refer the
entropy-based measure (3.17), i.e., KL(X, Y ), to as a basic
Var(Y |X = x) = a(ϕ)b θ β T x
and
Cov Y, θ β T X
KL(X, Y ) = ,
a(ϕ)
function
a(ϕ)
Since
Cov Y, θ β X |X =
T
(Y − E(Y |x))θ β T x g(x)dx = 0,
D E (Y ) − D E (Y |X)
ECD(X, Y ) = . (3.33)
D E (Y )
Cov Y, θ β T X KL(X, Y )
ECD(X, Y ) = T
= . (3.34)
Cov Y, θ β X + a(ϕ) KL(X, Y ) + 1
82 3 Analysis of the Association in Multiway Contingency Tables
From the above formulation, ECD is interpreted as the ratio of the explained
variation of Y by X for the variation of Y in entropy. For ordinary linear regression
2
T
ECDT is the coefficient of determination R , and for canonical link, i.e.,
models,
θ β X = β X, ECD has the following decomposition property:
p
βi Cov(Y, X i )
ECD(X, Y ) = i=1 T
.
Cov Y, θ β X + a(ϕ)
Thus, ECC and ECD have the decomposition property with respect to explanatory
variables X i . This property is preferable, because the contributions of explanatory
variables X 1 can be assessed with components
βi Cov(Y, X i )
,
Cov Y, θ β T X + a(ϕ)
KL(X, Y ) ≥ KL X \i , Y , i = 1, 2, . . . , p.
f (x, y) = f x1 , x \1 , y , g(x) = g x1 , x \1 ,
we have
˚ f x1 , x \1 , y
\1
KL(X, Y ) − KL X , Y = \1
f x1 , x , y log
dx1 dx \1 dy
f Y (y)g x1 , x \1
˚ f Y (y)g x1 , x \1
+ \1
f Y (y)g x1 , x log
dx1 dx \1 dy.
f x1 , x \1 , y
˚ f 1 x \1 , y
− f x1 , x \1 , y log
dx1 dx \1 dy
f Y (y)g1 x \1
3.7 Entropy Coefficient of Determination 83
˚ f Y (y)g1 x \1
− f Y (y)g x1 , x \1 log
dx1 dx \1 dy
f 1 x \1 , y
˚ f x1 , x \1 , y g1 x \1
= f x1 , x \1 , y log
dx1 dx \1 dy
f 1 x \1 , y g x1 , x \1
˚ f 1 x \1 , y g x1 , x \1
+ f Y (y)g x1 , x \1 log \1
dx1 dx \1 dy.
g1 x f x1 , x \1 , y
˚ f x1 , x \1 , y /g x1 , x \1
= \1
f x1 , x , y log
dx1 dx \1 dy
f 1 x \1 , y /g1 x \1
˚ g x1 , x \1 /g1 x \1
+ f Y (y)g x1 , x \1 log
dx1 dx \1 dy ≥ 0.
f x1 , x \1 , y / f 1 x \1 , y
Desirable properties for predictive power measures for GLMs are given as follows
[9]:
(i) A predictive power measure can be interpreted, i.e., interpretability.
(ii) The measure is the multiple correlation coefficient or the coefficient of
determination in normal linear regression models.
(iii) The measure has an entropy-based property.
(iv) The measure is applicable to all GLMs (applicability to all GLMs).
(v) The measure is increasing in the complexity of the predictor (monotonicity in
the complexity of the predictor).
(vi) The measure is decomposed into components with respect to explanatory
variables (decomposability).
First, RCC (3.23) is checked for the above desirable properties. The measure
can be interpreted as the correlation coefficient (cosine) of response variable Y and
the regression function E(Y |X) and is the multiple correlation coefficient R in the
ordinary linear regression analysis; however, the measure has no property in entropy.
The measure is available only for single continuous or binary variable cases. For
84 3 Analysis of the Association in Multiway Contingency Tables
polytomous response variable cases, the measure cannot be employed. It has not
been proven whether the measure satisfies property (v) or not, and it is trivial for
the measure not to have property (vi). Second, R L2 is considered. From (3.29), the
measure is expressed as
2 2
L(β) n − L(0) n
R L2 = 2 ;
L(β) n ,
2 2
however, L(β) n and L(0) n cannot be interpreted. Let β̂ be the ML estimates of β.
Then, the likelihood ratio statistic is
⎛ ⎞ n2
L β̂ 1
⎝ ⎠ = .
L(0) 1 − R2
From this, this measure satisfies property (ii). Since logL(β) is interpreted in a
viewpoint of entropy, so the measure has property (iii). This measure is applicable
to all GLMs and increasing in the number of explanatory variables, because it is
based on the likelihood function; however, the measure does not have property (vi).
Concerning R 2E , the measure was proposed for categorical response variables, so the
measure has properties except (ii) and (vi). For ECC, as explained in this chapter
and the previous one, this measure has (i), (ii), (iii), and (vi); and the measure is
the correlation coefficient of the response variable and the predictor of explanatory
variables, so this measure is not applicable to continuous response variable vectors
and polytomous response variables. Moreover, it has not proven whether the measure
has property (v) or not. As discussed in this chapter, ECD has all the properties (i)
to (vi). The above discussion is summarized in Table 3.6.
Mittlebock and Schemper [14] compared some summary measures of association
in logistic regression models. The desirable properties are (1) interpretability, (2)
consistency with basic characteristics of logistic regression, (3) the potential range
should be [0,1], and (4) R 2 . Property (2) corresponds to (iii), and property (3) is met
by ECD and ECC. In the above discussion, ECD is the most suitable measure for the
predictive power of GLMs. A similar discussion for binary response variable models
was given by Ash and Shwarts [3].
fˆ(y|x), fˆ(y), and K L (X, Y ) be the ML estimators of f (y|x), f (y), and K L(X, Y ),
respectively. Then, the likelihood functions under H0 : β = 0 and H1 : β = 0 are,
respectively, given as follows:
K
nk
l0 = log fˆ(yki ), (3.35)
k=1 i=1
and
K
nk
l= log fˆ(yki |x k ). (3.36)
k=1 i=1
Then,
1 fˆ(yki |x k )
K nk
1
(l − l0 ) = log
N N k=1 i=1 fˆ(yki )
K
nk f (y|x k )
→ f (y|x k )log dy(in probability),
k=1
N f (y)
fˆ(y|x k )
K
nk
fˆ(y|x k )log dy
k=1
N fˆ(y)
K
nk f (y|x k )
→ f (y|x k )log dy(in probability).
k=1
N f (y)
fˆ(y|x k ) fˆ(y)
K K
nk nk 1
fˆ(y|x k )log dy = fˆ(y)log dy + o ,
k=1
N fˆ(y) k=1
N fˆ(y|x k ) N
where
1
N ·o → 0 (in probability).
N
N
l0 = log fˆ(yi )ĝ(x i )
i=1
3.8 Asymptotic Distribution of the ML Estimator of ECD 87
and
N
l= log fˆ(yi |x i )ĝ(x i ).
i=1
Example
3.7
In Example 3.5, the test for H0 : B (1) , B (2) = 0 versus
H1 : B (1) , B (2) = 0 is performed by using the above discussion. Since N = 219,we
have
219KL(Y , (X 1 , X 2 )) = 219 tr B̂ (1) Cov(X 1 , Y ) + tr B̂ (2) Cov(X 2 , Y )
= 219(0.258 + 0.107) = 79.935.(d f = 16, P = 0.000).
3.9 Discussions
In this chapter, first, loglinear models have been discussed. In the models, the loga-
rithms of probabilities in contingency tables are modeled with association parame-
ters, and as pointed out in Sect. 3.2, the associations between the categorical variables
in the loglinear models are measured with odds ratios. Since the number of odds ratios
is increasing as those of categories of variables increasing, further studies are needed
to make summary measures of association for the loglinear models. There may be
a possibility to make entropy-based association measures that are extended versions
of ECC and ECD. Second, predictive or explanatory power measures for GLMs have
been discussed. Based on an entropy-based property of GLMs, ECC and ECD were
considered. According to six desirable properties for predictive power measures as
listed in Sect. 3.7, ECD is the best to measure the predictive power.
References
1. Agresti, A. (1986). Applying R2 -type measures to ordered categorical data. Technometrics, 28,
133–138.
2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: John Wiley & Sons Inc.
3. Ash, A., & Shwarts, M. (1999). R2 : A useful measure of model performance with predicting
a dichotomous outcome. Stat Med, 18, 375–384.
4. Bliss, C. I. (1935). The calculation of the doze-mortality curve. Ann Appl Biol, 22, 134–167.
5. Efron, B. (1978). Regression and ANOVA with zero-one data: measures of residual variation.
J Am Stat Assoc, 73, 113–121.
6. Eshima, N., & Tabata, M. (1997). The RC(M) association model and canonical correlation
analysis. J Jpn Stat Soc, 27, 109–120.
7. Eshima, N., & Tabata, M. (2007). Entropy correlation coefficient for measuring predictive
power of generalized linear models. Stat Probab Lett, 77, 588–593.
8. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear
models. Comput Stat Data Anal, 54, 1381–1389.
9. Eshima, N., & Tabata, M. (2011). Three predictive power measures for generalized linear
models: entropy coefficient of determination, entropy correlation coefficient and regression
correlation coefficient. Comput Stat Data Anal, 55, 3049–3058.
10. Goodman, L. A. (1981). Association models and canonical correlation in the analysis of cross-
classification having ordered categories. J Am Stat Assoc, 76, 320–334.
11. Haberman, S. J. (1982). Analysis of dispersion of multinomial responses. J Am Stat Assoc, 77,
568–580.
12. Korn, E. L., & Simon, R. (1991). Explained residual variation, explained risk and goodness of
fit. Am Stat, 45, 201–206.
13. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and
Hall: London.
14. Mittlebock, M., & Schemper, M. (1996). Explained variation for logistic regression. Stat Med,
15, 1987–1997.
15. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. J Roy Stat Soc A,
135, 370–384.
References 89
16. Theil, H. (1970). On the estimation of relationships involving qualitative variables. Am J Sociol,
76, 103–154.
17. Zheng, B., & Agresti, A. (2000). Summarizing the predictive power of a generalized linear
model. Stat Med, 19, 1771–1781.
Chapter 4
Analysis of Continuous Variables
4.1 Introduction
Statistical methodologies for continuous data analysis have been well developed for
over a hundred years, and many research fields apply them for data analysis. For
correlation analysis, simple correlation coefficient, multiple correlation coefficient,
partial correlation coefficient, and canonical correlation coefficients [12] were pro-
posed and the distributional properties of the estimators were studied. The methods
are used for basic statistical analysis of research data. For making confidence regions
of mean vectors and testing those in multivariate normal distributions, the Hotelling’s
T 2 statistic was proposed as a multivariate extension of the t statistic [11], and it was
proven its distribution is related to the F distribution. In discriminant analysis of
several populations [7, 8], discriminant planes are made according to the optimal
classification method of samples from the populations, which minimize the misclas-
sification probability [10]. In methods of experimental designs [6], the analysis of
variance has been developed, and the methods are used for studies of experiments,
e.g., clinical trials and experiments with animals. In real data analyses, we often face
with missing data and there are cases that have to be analyzed by assuming latent
variables, as well. In such cases, the Expectation and Maximization (EM) algorithm
[2] is employed for the ML estimation of the parameters concerned. The method is
very useful to make the parameter estimation from missing data.
The aim of this chapter is to discuss the above methodologies in view of entropy. In
Sect. 4.2, the correlation coefficient in the bivariate normal distribution is discussed
through an association model approach in Sect. 2.6, and with a discussion similar
to the RC(1) association model [9], the entropy correlation coefficient (ECC) [3] is
derived as the absolute value of the usual correlation coefficient. The distributional
properties are also considered with entropy. Section 4.3 treats regression analysis
in the multivariate normal distribution. From the association model and the GLM
frameworks, it is shown that ECC and the multiple correlation coefficient are equal
and that the entropy coefficient of determination (ECD) [4] is equal to the usual
coefficient of determination. In Sect. 4.4, the discussion in Sect. 4.3 is extended to that
of the partial ECC and ECD. Section 4.5 treats canonical correlation analysis. First,
canonical correlation coefficients between two random vectors are derived with an
ordinary method. Second, it is proven that the entropy with respect to the association
between the random vectors is decomposed into components related to the canonical
correlation coefficients, i.e., pairs of canonical variables. Third, ECC and ECD are
considered to measure the association between the random vectors and to assess
the contributions of pairs of canonical variables in the association. In Sect. 4.6, the
Hotelling’s T 2 statistic from the multivariate samples from a population is discussed.
It is shown that the statistic is the estimator of the KL information between two
multivariate normal distributions, multiplied by the sample size. Section 4.7 extends
the discussion in Sect. 4.6 to that for comparison between two multivariate normal
populations. In Sect. 4.8, the one-way layout experimental design model is treated in
a framework of GLMs, and ECD is used for assessing the factor effect. Section 4.9,
first, makes a discriminant plane between two multivariate normal populations based
on an optimal classification method of samples from the populations, and second,
an entropy-based approach for discriminating the two populations is given. It is also
shown that the squared Mahalanobis’ distance between the two populations [13]
is equal to the KL information between the two multivariate normal distributions.
Finally, in Sect. 4.10, the EM algorithm for analyzing incomplete data is explained
in view of entropy.
1 σ22 (x − μX )2 σ11 (y − μY )2
log f (x, y) = log 1 − −
2π || 2 2|| 2||
σ12 (x − μX )(y − μY )
+ .
||
4.2 Correlation Coefficient and Entropy 93
Let us set
Replacing x and y for the means and taking the expectation of the above log odds
ratio with respect to X and Y we have
f (X , Y )f (μX , μY )
E log = ϕCov(X , Y ).
f (μX , Y )f (X , μY )
Let fX (x) and fY (y) be the marginal density functions of X and Y, respectively.
Since
¨
f (X , Y )f (μX , μY ) f (x, y)
E log = f (x, y)log dxdy
f (μX , Y )f (X , μY ) fX (x)fY (y)
¨
fX (x)fY (y)
+ fX (x)fY (y)log dxdy ≥ 0, (4.5)
f (x, y)
it follows that
ϕCov(X , Y ) ≥ 0.
Dividing the second term by the third one in (4.6), we have the following definition:
94 4 Analysis of Continuous Variables
Definition 4.2 In model (4.3), the entropy correlation coefficient (ECC) is defined
by
|Cov(X , Y )|
ECorr(X , Y ) = √ √ . (4.7)
Var(X ) Var(X )
From (4.5)
√ √ and (4.6), the upper limit of the KL information (4.5) is
|ϕ| Var(X ) Var(X ). Hence, ECC (4.7) can be interpreted as the ratio of the
information induced by the correlation between X and Y in entropy. As shown in
(4.7),
Remark 4.1 The above discussion has been made in a framework of the RC(1)
association model in Sect. 2.6. In the association model, ϕ is assumed to be positive
for scores assigned to categorical variables X and Y. On the other hand, in the normal
model (4.2), if Cov(X , Y ) < 0, then, ϕ < 0.
From the above remark, we make the following definition of the entropy variance:
Definition 4.3 The entropy variances of X and Y in (4.3) are defined, respectively,
by
ECov(X , X )
ECorr(X , Y ) = √ √ .
EVar(X ) EVar(X )
Since the entropy covariance (4.4) is the KL information between model f (x, y)
and independent model fX (x)fY (y), in which follows, for simplicity of discussion, let
us set
¨
f (x, y)
KL(X , Y ) = f (x, y)log dxdy
fX (x)fY (y)
¨
fX (x)fY (y)
+ fX (x)fY (y)log dxdy
f (x, y)
and let
ρ = Corr(X , Y ).
4.2 Correlation Coefficient and Entropy 95
we have
ρ σ12 ρ2
KL(X , Y ) = ϕσ12 = × √ = .
1 − ρ2 σ11 σ22 1 − ρ2
Then, the following statistic is the non-central F distribution with degrees 1 and
n − 2:
F = (n − 2)KL(X , Y ), (4.9)
where
r2
KL(X , Y ) = . (4.10)
1 − r2
t 2 = F = (n − 2)KL(X , Y ).
fY (y|x) = exp
2π σ22 1 − ρ 2 σ22 1 − ρ 2
(y − μY )2
− .
2σ22 1 − ρ 2
Hence, ECCs based on the RC(1) association model approach (4.3) and the above
GLM approach are the same.
In statistic (4.9), for large sample size n, (4.9) is asymptotically distributed
according to the non-central chi-square distribution with degrees of freedom 1 and
non-centrality
ρ2
λ = (n − 2) .
1 − ρ2
λ λ2
c=ν+ and ν = ν + .
ν+λ ν + 2λ
χ 2
χ2 = (4.12)
c
4.2 Correlation Coefficient and Entropy 97
ρ2
λ = (n − 2) .
1 − ρ2
Let
λ λ2
ν = 1, c = 1 + and ν = 1 + . (4.13)
1+λ 1 + 2λ
F (n − 2)KL(X , Y )
= (4.14)
c c
is asymptotically distributed according to the chi-square distribution with degrees
of freedom ν . Hence, statistic (4.14) is asymptotically normal with mean ν and
variance 2ν . From (4.13), for large n we have
ρ 2
(n − 2) 1−ρ 2
c =1+ ρ 2 ≈ 2,
1 + (n − 2) 1−ρ 2
2
2
ρ
(n − 2)2 1−ρ 2 ρ2
ν = 1 + ρ2
≈ (n − 2) from r 2 ≈ ρ 2 .
1 + 2(n − 2) 1−ρ 2 2 1−ρ 2
ρ 2
ρ2
− ν
F F − (n − 2) 1−ρ 2 √ KL(X , Y ) − 1−ρ 2
ZKL = √ c
= = n − 2 . (4.15)
2ν ρ2
2 (n − 2) 1−ρ 2 2 1−ρ ρ2
2
ρ2
ρ 2
From this, the asymptotic confidence intervals of KL(X , Y ) = 1−ρ 2 can be
where
α
2 r2
A1 = KL(X , Y ) − z 1 − ×√ ,
2 n−2 1 − r2
α
2 r2
A2 = KL(X , Y ) + z 1 − ×√
2 n − 2 1 − r2
The statistic (4.18) is refined in view of the normality. The Fisher Z transformation
[5] is given by
1 1+r
ZFisher = log (4.19)
2 1−r
4.2 Correlation Coefficient and Entropy 99
Proof By the Taylor expansion of ZFisher at ρ, for large sample size n we have
1 1+ρ 1 1 1
ZFisher ≈ log + + (r − ρ)
2 1−ρ 2 1+ρ 1−ρ
1 1+ρ 1
= log + (r − ρ).
2 1−ρ 1 − ρ2
where
1+r 2 α
B1 = log −√ z ,
1−r n−3 2
1+r 2 α
B2 = log +√ z
1−r n−3 2
Example 4.1 Table 4.1 shows an artificial data from the bivariate normal distribution
with mean vector μ = (100, 50) and variance matrix
100 35
= .
35 25
where
XX = Cov X, X T ,
p
V = βT X = βi Xi
i=1
Since
β T σ XY
Corr(V, Y ) = √ ,
σYY
β T σ XY λ
L(β) = √ − β T XX β,
σYY 2
Differentiating the above function with respect to β, there exists constant λ such
that
102 4 Analysis of Continuous Variables
∂ 1
L(β) = √ σ XY − λ XX β = 0.
∂β σYY
1
β= √ −1 σ XY .
λ σYY XX
σ Y X −1
XX σ XY
λ2 = .
σYY
√
Hence, for λ = λ2 , Corr(V, Y ) is maximized by (4.22). The theorem follows.
Definition 4.2 The correlation coefficient (4.22) is defined as the multiple correlation
T
coefficient between random vector X = X1 , X2 , . . . , Xp and random variable Y.
T
Second, let us assume the joint distribution of X = X1 , X2 , . . . , Xp and Y is
where φ > 0. The above model is similar to the RC(1) association model. In the
model, ECC is calculated as follows:
Cov(μ(X), ν(Y ))
ECorr(X, Y ) = √ √ .
Var(μ(X)) Var(Y )
Remark 4.3 The ECC cannot be directly defined in the case where X and Y are ran-
T T
dom vectors, i.e., X = X1 , X2 , . . . , Xp (p ≥ 2) and Y = Y1 , Y2 , . . . , Yq (q ≥ 2).
The discussion is given in Sect. 4.5.
Like the RC(1) association model, the ECC is positive. In matrix (4.21), let
1
XX·Y = XX − σ XY σ Y X and σYY ·X = σYY − σ Y X −1
XX σ XY .
σYY
1
f (x, y) = p+1 1
(2π ) 2 [] 2
−1
1 T XX σ XY x
× exp − x , y , (4.25)
2 σ Y X σYY y
1 1 −1 2
λX (x) = − xT −1
XX·Y x, λY (y) = − σYY ·X y ,
2 2
−1
μ(x) = σYY σ Y X −1
XX·Y x, ν(y) = y, φ = 1.
From the above formulation, we have (4.22). The ECC for the normal distribution
(4.25) is the multiple correlation coefficient.
Third, in the GLM framework (3.9), ECC in the normal distribution is discussed.
Let f (y|x) be the conditional density function of Y given X. Assuming the joint
distribution of X and Y is normal with mean vector 0 and variance–covariance
matrix (4.21). Then, f (y|x) is given as the following GLM expression:
1
f (y|x) =
σYY − σ Y X −1
p+1
(2π ) XX σ XY
2
yβx − 21 (βx)2 1 2
y
× exp − 2
,
σYY − σ Y X −1
XX σ XY σYY − σ Y X −1
XX σ XY
where
β = −1
XX σ XY
In (3.9), setting
1 2
θ = βx, a(ϕ) = σYY − σ Y X −1
XX σ XY , b(θ ) = θ , and
2
1 2
y
c(y, ϕ) = − 2
, (4.26)
σYY − σ Y X −1
XX σ XY
In this case, ECCs in the RC(1) association model and the GLM frameworks are
equal to the multiple correlation coefficient. In this sense, the ECC can be called the
entropy multiple correlation coefficient between X and Y. The entropy coefficient of
determination (ECD) is calculated. For (4.26), we have
104 4 Analysis of Continuous Variables
Cov(βX, Y ) σ Y X −1 σ XY
ECD(X, Y ) = = XX
Cov(βX, Y ) + a(ϕ) σ Y X XX σ XY + σYY − σ Y X −1
−1
XX σ XY
σ Y X −1
XX σ XY
= = ECorr(X, Y )2 .
σYY
In the normal case, the above equation holds true. Below, a more general discussion
on ECC and ECD is provided.
Let X, Y, and Z be random variables that have the joint normal distribution with the
following variance–covariance matrix:
⎛ ⎞
σX2 σXY σXZ
= ⎝ σYX σY2 σYZ ⎠,
σZX σZY σZ2
Then, the conditional distribution of X and Y, given Z, has the following variance–
covariance matrix:
2
σX σXY 1 σXZ
XY ·Z = − 2 σZX σZY
σYX σY 2
σZ σYZ
2 σ σ
σX − σ 2 σXY − σXZσσ2 ZY
XZ ZX
= Z Z .
σYX − σYZσσ2ZX σY2 − σYZσσ2ZY
Z Z
From the above result, the partial correlation coefficient ρXY ·Z between X and Y
given Z is
σXZ σZY
σXY − σZ2 ρXY − ρXZ ρYZ
ρXY ·Z = = .
σXZ σZX σYZ σZY
σX2 − σZ2
σY2 − σZ2 1 − ρXZ
2
1 − ρYZ
2
With respect to the above correlation coefficient, the present discussion can be
directly applied. Let fXY (x, y|z) be the conditional density function of X and Y given
Z = z; let fX (x|z) and fY (y|z) be the conditional density functions of X and Y given
Z = z, respectively, and let fZ (z) be the marginal density function of Z. Then, we
have
4.4 Partial Correlation Coefficient 105
˚
fXY (x, y|z)
KL(X , Y |Z) = fXY (x, y|z)fZ (z) log dxdydz
fX (x|z)fY (y|z)
¨
fX (x|z)fY (y|z) ρXY
2
·Z
+ fX (x|z)fY (y|z)fZ (z) log dxdy = .
fXY (x, y|z) 1 − ρXY
2
·Z
(4.27)
Thus, the partial (conditional) ECC and ECD of X and Y given Z are computed
as follows:
KL(X , Y |Z)
ECD(X , Y |Z) = = ρXY
2
·Z.
KL(X , Y |Z) + 1
T T
For normal random vectors X = X1 , X2 , . . . , Xp , Y = Y1 , Y2 , . . . , Yq , and
Z = (Z1 , Z2 , . . . , Zr )T , the joint distribution is assumed to be the multivariate normal
with the following variance–covariance matrix:
⎛ ⎞
XX XY XZ
= ⎝ YX YY YZ ⎠.
ZX ZY ZZ
From the above matrix, the partial variance–covariance matrix of X and Y given
Z is calculated as follows:
XX·Z XY·Z XX XY XZ
(X,Y)·Z ≡ = − Σ −1
ZZ ZX ZY .
YX·Z YY·Z YX YY YZ
Hence,
0.771
ECD(X, Y|Z) = = 0.435.
0.771 + 1
Remark 4.5 The partial variance–covariance matrix (X,Y)·Z can also be calculated
by using the inverse of . Let
⎛ ⎞
XX XY XZ
−1 = ⎝ YX YY YZ ⎠.
ZX ZY ZZ
4.4 Partial Correlation Coefficient 107
Then, we have
−1
XX XY
(X,Y)·Z = .
YX YY
where
XX = E XX T , XY = E XY T , YX = E YX T , and YY = E YY T .
T T
For coefficient vectors a = a1 , a2 , . . . , ap and b = b1 , b2 , . . . , bp , we
determine V1 = aT X and W1 = bT Y that maximize their correlation coefficient
under the following constraints:
λ T μ
h = aT XY b − a XX a − bT YY b.
2 2
Differentiating the above function with respect to a and b, we have
∂h ∂h
= XY b − λ XX a = 0, = YX a − μ YY b = 0.
∂a ∂b
Applying constraints (4.29) to the above equations, we have
aT XY b = λ = μ
Hence, we obtain
λ XX − XY a
= 0.
− YX μ YY b
108 4 Analysis of Continuous Variables
Since
a = 0 and b = 0,
It follows that
λ XX − XY
− YX λ YY = 0.
Since
λ XX − XY λ XX 0
=
− YX λ YY 0 λ YY − 1 YX −1 XY
λ XX
1
= |λ XX |λ YY − YX XX XY
−1
λ
2
= λ | XX | λ YY − YX −1
XX XY
p−q
= 0,
we have
2
λ YY − YX −1 XY = 0. (4.30)
XX
1 ≥ λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0,
From the above equation, we get the roots in (4.31) and q − p zero roots, and the
coefficient vector a is the eigenvector of XY −1 −1
YY YX XX , corresponding to λ1 . Let
2
us denote the obtained coefficient vectors a and b as a(1) and b(1) , respectively. Then,
the pair of variables V1 = aT1 X and W1 = bT1 Y is called the first pair of canonical
variables. Next, we determine V2 = aT(2) X and W2 = bT(2) Y that make the maximum
correlation under the following constraints:
4.5 Canonical Correlation Analysis 109
By using a similar discussion that derives the first pair of canonical variables,
we can derive Eqs. (4.30) and (4.32). Hence, the maximum correlation coefficient
satisfying the above constraints is given by λ2 , and the coefficient vectors a(2)
and b(2) are, respectively, obtained as the eigenvectors of XY −1 −1
YY YX XX and
−1 −1
YX XX XY YY , corresponding to λ2 . Similarly, let a(i) and b(i) be the eigenvec-
2
tors of XY −1 −1 −1 −1
YY YX XX and YX XX XY YY corresponding to the eigenvalues
λi , i ≥ 3. Then, the pair of variables Vi = a(i) X and Wi = bT(i) Y gives the maximum
2 T
and
Corr(Vi , Wi ) = Corr aT(i) X, bT(i) Y = aT(i) XY b(i) = λi , i = 1, 2, . . . , p.
Let us set
T
V = aT(1) X, aT(2) X, . . . , aT(p) X ,
T
T
W (1) = bT(1) Y, bT(2) Y, . . . , bT(p) Y , W (2) = bT(p+1) Y, bT(p+2) Y, . . . , bT(q) Y ,
A = a(1) , a(2) , . . . , a(p) , B(1) = b(1) , b(2) , . . . , b(p) ,
B(2) = b(p+1) , b(p+2) , . . . , b(q) . (4.33)
where I p and I q−p are the p- and (q − p)-dimensional identity matrix, respectively.
Then, the inverse of the above variance–covariance matrix is computed by
⎛⎡ ⎤⎡ ⎤ ⎞
1 −λ1
1−λ21 1−λ21
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎢ 0 ⎥⎢ 0 ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎢ 1 ⎥⎢ −λ2 ⎥ ⎟
⎜⎢ 1−λ22 ⎥⎢ 1−λ22 ⎥ ⎟
⎜⎢ .. ⎥⎢ .. ⎥ 0 ⎟
⎜⎢ . ⎥⎢ . ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎣ 0 ⎦⎣ 0 ⎦ ⎟
⎜ −λp ⎟
⎜ 1 ⎟
⎜⎡ 1−λ2p
⎤⎡ 1−λ2p
⎤ ⎟
⎜ −λ1 ⎟
∗−1 ⎜ 1−λ 1
⎟ (4.35)
⎜⎢ 2
⎥⎢
1−λ21
⎥ ⎟
⎜⎢ ⎟
1
⎜⎢ 0 ⎥⎢ 0 ⎥ ⎟
⎜⎢ −λ2 ⎥⎢ ⎥ ⎟
⎜⎢ ⎥⎢ 1 ⎥ ⎟
⎜⎢ 1−λ22 ⎥⎢ ⎥
⎥ 0 ⎟
1−λ22
⎜⎢ .. ⎥⎢ .. ⎟
⎜⎢ . ⎥⎢ . ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
⎜⎣ ⎥⎢ ⎥ ⎟
⎜ 0 ⎦⎣ 0 ⎦ ⎟
⎜ −λp 1 ⎟
⎝ 1−λ2p 1−λ22 ⎠
0 0 I q−p
p
λ2i
KL(X, Y) = .
i=1
1 − λ2i
T
Proof The joint distribution of random vectors V and W T(1) , W T(2) is the multi-
variate normal distribution with variance–covariance matrix (4.34) and the inverse
of (4.34) is given by (4.35). From this, we have
⎛ ⎞
⎛ ⎞ λ1
− 1−λ
λ1 ⎜
2
1 0 ⎟
⎜ 0 ⎟⎜ λ2 ⎟
T
⎜ λ2 ⎟⎜
− 1−λ 2
⎟
KL V , W T(1) , W T(2) = −tr⎜ .. ⎟⎜
2
.. ⎟
⎝ . ⎠⎜ . ⎟
0 ⎝ 0 ⎠
λ
λp − 1−λp 2
p
p
λ2i
= .
i=1
1 − λ2i
Since
T
KL(X, Y) = KL V , W T(1) , W T(2) ,
Remark 4.6 Let the inverse of the variance–covariance matrix (4.28) be denoted as
XX XY
−1 = .
YX YY
Then, we have
p
λ2i
−tr XY YX
= . (4.38)
i=1
1 − λ2i
112 4 Analysis of Continuous Variables
λ2i
KL(Vi , Wi ) = , i = 1, 2, . . . , p. (4.39)
1 − λ2i
p
KL(X, Y) = KL(Vi , Wi ).
i=1
where
λi
φi = , i = 1, 2, . . . , p.
1 − λ2i
g(v, w)g(0, 0)
p
f (x, y)f (0, 0)
log = log = vi φi wi .
f (x, 0)f (0, y) g(v, 0)g(0, w) i=1
p
p
p
λ2i
ECov(X, Y) = φi Cov(vi , wi ) = φi λ i = .
i=1 i=1 i=1
1 − λ2i
and
p
p
p
λi
ECov(X, X) = φi Cov(vi , vi ) = φi = = ECov(Y, Y).
i=1 i=1 i=1
1 − λ2i
4.5 Canonical Correlation Analysis 113
Then, we have
0 ≤ ECov(X, Y) ≤ ECov(X, X) ECov(Y, Y).
where
E(X) = μX , E(Y) = μY ,
μY (x) = μY + YX −1
XX (x − μX ),
YY·X = YY − YX −1
XX XY .
Let
⎧ −1
⎨ θ = YY·X μY (x),
⎪
a(ϕ) = 1, b(θ ) = 21 YY·X θ T θ,
(4.41)
⎪
⎩ c(y, ϕ) = − 1 yT −1 y − log (2π ) 2q | YY·X | 21 .
2 YY·X
Then, we have
1 T −1 1 T −1 1 T −1
f (y|x) = exp y μ
YY·X Y (x) − μ (x) μ (x) − y y
2 Y YY·X Y YY·X
q 1
(2π ) | YY·X | 2
2 2
T
y θ − b(θ )
= exp + c(y, ϕ) . (4.42)
a(ϕ)
As shown above, the multivariate linear regression under the multivariate normal
distribution is expressed by a GLM with (4.41) and (4.42). Since
114 4 Analysis of Continuous Variables
Cov(Y, θ )
KL(X, Y) = = Cov Y, −1 −1 −1
YY·X μY (X) = tr YY·X YX XX XY
a(ϕ)
−1
= tr YY − YX −1
XX XY YX −1
XX XY ,
it follows that
−1
KL(X, Y) tr YY − YX −1
XX XY YX −1
XX XY
ECD(X, Y) = = −1
.
KL(X, Y) + 1 −1 −1
tr YY − YX XX XY YX XX XY + 1
p
λ2i
KL(X, Y) = KL(V , W ) = ,
i=1
1 − λ2i
T
where V and W = W T(1) , W T(2) are given in (4.33). Hence, it follows that
p λ2i
KL(V , W ) i=1 1−λ2i
ECD(X, Y) = = . (4.43)
KL(V , W ) + 1 p λ2i
2 + 1 i=1 1−λi
The above formula implies that the effect of explanatory variable vector X on
λ2
response variable vector Y is decomposed into p canonical components 1−λi 2 , i =
i
1, 2, . . . , p. From the above results (4.40) and (4.43), the contribution ratios of canon-
ical variables (Vi , Wi ) to the association between random vectors X and Y can be
evaluated by
λ2i λ2i
KL(Vi , Wi ) 1−λ2i 1−λ2i
CR(Vi , Wi ) = = = . (4.44)
KL(X, Y) −tr XY YX p λ2j
2 j=1 1−λj
Example 4.3 Let X = (X1 , X2 )T and Y = (Y1 , Y2 , Y3 )T be random vectors that have
the joint normal distribution with the following correlation matrix:
⎛ ⎞
1 0.7 0.4 0.6 0.4
⎜ ⎟
⎜ 0.7 1 0.7 0.6 0.5 ⎟
XX XY ⎜ ⎟
=⎜ 0.4 0.7 1 0.8 0.7 ⎟.
YX YY ⎜ ⎟
⎝ 0.6 0.6 0.8 1 0.6 ⎠
0.4 0.5 0.7 0.6 1
−tr XY YX 1.644
ECD(X.Y) = = = 0.622.
−tr XY + 1
YX 1.644 +1
From this, the two sets of canonical variables are given as follows:
From (4.44), the contribution ratios of the above two sets of canonical variables
are computed as follows:
3.381
CR(V1 , W1 ) = = 0.716,
3.381 + 0.705
0.705
CR(V2 , W2 ) = = 0.284.
3.381 + 0.705
2 T
T∗ = nX S−1 X, (4.45)
where
1 1 T
n n
X= Xi, S = Xi − X Xi − X .
n i=1 n − 1 i=1
Theorem 4.6 Let f (x|μ, Σ) be the multivariate normal density function with mean
vector μ and variance–covariance matrix Σ. Then, it follows that
" "
f (x|μ, Σ) f (x|0, Σ)
f (x|μ, Σ)log dx + f (x|0, Σ)log dx = μT −1 μ.
f (x|0, Σ) f (x|μ, Σ)
(4.46)
Proof Since
f (x|μ, Σ) 1 1 1
log = − (x − μ)T −1 (x − μ) + xT −1 x = xT −1 μ − μT −1 μ,
f (x|0, Σ) 2 2 2
we have
"
f (x|μ, Σ) 1
f (x|μ, Σ)log dx = μT −1 μ,
f (x|0, Σ) 2
"
f (x|0, Σ) 1 T −1
f (x|0, Σ)log dx = μ μ.
f (x|μ, Σ) 2
From the above theorem, statistics (4.45) is the ML estimator of the KL informa-
tion (4.46) multiplied by sample size n − 1. With respect to the statistic (4.45), we
have the following theorem [1].
n−p
Theorem 4.7 For the statistic (4.45), p(n−1) (T ∗ )2 is distributed according to the
non-central F distribution with degrees of freedom p and n − p. The non-centrality
parameter is μT Σ −1 μ.
D(f (x|μ1 , Σ)||f (x|μ0 , Σ)) + D(f (x|μ0 , Σ)||f (x|μ1 , Σ))
= (μ1 − μ0 )T −1 (μ1 − μ0 ).
Proof Let
n−p 2
F= T .
p(n − 1)
n−p 2 1
T ≈ T 2.
p(n − 1) p
From this,
T 2 ≈ pF,
In comparison with the above discussion, let us consider the test for variance
matrices, H0 :Σ = Σ 0 , versus H1 :Σ = Σ 0 . Then, the following KL information
expresses the difference between normal distributions f (x|μ, Σ 0 ) and f (x|μ, Σ):
118 4 Analysis of Continuous Variables
1 1 T
n
n
X= Xi, Σ = Xi − X Xi − X .
n i=1 n i=1
Then, for given Σ 0 , the ML estimators of D(f (x|μ, Σ)||f (x|μ, Σ 0 )) and
D(f (x|μ, Σ 0 )||f (x|μ, Σ)) are calculated as
D f (x|μ, )||f (x|μ, 0 ) = D f x|X̄, ˆ ||f x|X̄, 0 ,
ˆ .
D f (x|μ, 0 )||f (x|μ, ) = D f x|X̄, 0 ||f x||X̄,
n
l(μ, Σ) = logf (X i |μ, Σ),
i=1
and let
4.6 Test of the Mean Vector and Variance-Covariance Matrix … 119
1 1 2
n
n
μ=X= Xi, Σ = Xi − X .
n i=1 n i=1
1
1 n f x i |μ,
l μ, − l μ, 0 = log
n n i=1 f x|μ, 0
"
f (x|μ, )
→ f (x|μ, )log dx,
f (x|μ, 0 )
"
f x|μ,
f x|μ, 0
"
f (x|μ, )
→ f (x|μ, )log dx,
f (x|μ, 0 )
"
f x|μ, 0
f x|μ,
"
f (x|μ, 0 )
→ f (x|μ, 0 )log dx
f (x|μ, )
"
f x|μ, Σ 0
f x|μ, Σ
where
o(n)
→ 0(n → ∞).
n
Hence, under the null hypothesis, we obtain
"
f x|μ,
f x|μ, 0
120 4 Analysis of Continuous Variables
⎛
"
f x|μ,
= n⎝ f x|μ, log
dx
f x|μ, 0
" ⎞
f x|μ, 0
dx⎠ + o(n)
+ f x|μ, 0 log
f x|μ,
= n D f (x|μ, ||f )(x|μ, 0 ) + D f (x|μ, 0 ||f )(x|μ, )
+ o(n).
1 1
m n
X1 = X 1i and X2 = X 2i , (4.49)
m i=1 n i=1
are distributed according to the normal distribution f x|μ1 , m1 Σ and f x|μ2 , 1n Σ ,
respectively. From the samples, the unbiased estimators of μ1 , μ2 and Σ are,
respectively, given by
(m − 1)S1 + (n − 1)S2
μ1 = X 1 , μ2 = X 2 , and S = ,
m+n−2
4.7 Comparison of Mean Vectors of Two Multivariate … 121
where
⎧ m
⎪ T
⎪
⎨ S1 =
1
X 1i − X 1 X 1i − X 1 ,
m−1
i=1
n
T
⎪
⎪
⎩ S2 = X 2i − X 2 X 2i − X 2 .
1
n−1
i=1
mn T −1
T2 = X1 − X2 S X1 − X2 . (4.50)
m+n
Then, the estimator of the above information is given as the following statistic:
−1
T 1 1
T 2 = X1 − X2 S1 + S2 X1 − X2 . (4.51)
m n
Remark 4.7 In Theorem 4.12, the theorem holds true, regardless of the normality
because X 1 − X 2 is asymptotically
assumption of samples, distributed according to
normal distribution f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 .
Yij = μ + αi + eij , i = 1, 2, . . . , I ; j = 1, 2, . . . , n.
where
E Yij = μ + αi , i = 1, 2, . . . , I ; j = 1, 2, . . . , n,
E eij = 0, Var eij = σ 2 , Cov eij , ekl = 0, i = k, j = j.
I
αi = 0. (4.53)
i=1
In order to consider the above model in a GLM framework, the following dummy
variables are introduced:
#
1 (for level i)
Xi = , i = 1, 2, . . . , I . (4.54)
0 (for the other levels)
I
η= αi Xi ,
i=1
it follows that
I
θ =μ+ αi Xi .
i=1
The factor levels are randomly assigned to experimental units, e.g., subjects, with
probability 1I , so the marginal distribution of response Y is
θi2
1 √
I
yθi − y2
f (y) = exp 2
− − log 2π σ ,
2
I i=1 σ2 2σ 2
where
θi = μ + αi , i = 1, 2, . . . , I .
Then, we have
I " "
1 f (y|Xi = 1) f (y)
f (y|Xi = 1)log dy + f (y)log dy
I i=1 f (y) f (y|Xi = 1)
I I
i=1 Cov(Y , αi Xi ) i=1 αi
1 1
Cov(Y , θ ) 2
= = I
= I
. (4.55)
σ2 σ2 σ2
Hence, the entropy coefficient of determination is calculated as
I
i=1 αi
1 2
ECD(X, Y ) = 1 I
I
.
i=1 αi + σ
2 2
I
1 2 1 2
I I
= αi + σ 2 = α + σ 2. (4.56)
I i=1 I i=1 i
1
I
Var(Y |X) = Var(Y |Xi = 1)
I i=1
"
θi2
1 √
I
yθi − y2
= (y − μ − αi )2 exp 2
− − log 2π σ 2 dy
I i=1 σ2 2σ 2
1 2
I
= σ = σ 2. (4.57)
I i=1
1 2
I
Var(Y ) − Var(Y |X) = α .
I i=1 i
Hence, we obtain
I " "
1 f (y|Xi = 1) f (y)
f (y|Xi = 1)log dy + f (y)log dy
I i=1 f (y) f (y|Xi = 1)
Var(Y ) − Var(Y |X)
= .
Var(Y |X)
I
n
2
I
2
I
n
2
SS T = Yij − Y ++ = n Y i+ − Y ++ + Yij − Y i+ ,
i=1 j=1 i=1 i=1 j=1
where
1 1
n I n
Y i+ = Yij , Y ++ = Yij .
n j=1 nI i=1 j=1
Let
I
2
I
n
2
SS A = n Y i+ − Y ++ , SS E = Yij − Y i+ , (4.58)
i=1 i=1 j=1
Then, the expectations of the above sums of variances are calculated as follows:
I $ 2 %
I
E(SS A ) = n E Y i+ − Y ++ = (I − 1)σ 2 + n αi2 ,
i=1 i=1
I n $ 2 % I n
n−1 2
E(SS E ) = E Yij − Y i+ = σ = I (n − 1)σ 2 .
i=1 j=1 i=1 j=1
n
In this model, the entropy correlation coefficient between factor X and response
Y is calculated as
&
' 1 I
Cov(Y , θ ) ' i=1 αi
2
ECorr(X , Y ) = √ √ =(1 I
. (4.59)
Var(Y ) Var(θ ) I
I
i=1 αi + σ
2 2
and ECC is the square root of ECD. Since the ML estimators of the effects αi2 and
error variance σ 2 are, respectively, given by
1 2
I n
α i = Y i+ − Y ++ and σ 2 =
Yij − Y i+ ,
nI i=1 j=1
I 2
i=1 Y i+ − Y ++
1
ECD(X, Y ) = 2
I
2
I
1
I i=1 Y i+ − Y ++ + nI1 Ii=1 nj=1 Yij − Y i+
SSA
= . (4.60)
SSA + SSE
SSE (I − 1)F
ECD(X, Y ) = = . (4.62)
SSA
SSE
+1 (I − 1)F + I (n − 1)
I
ni αi = 0, (4.63)
i=1
where
I
N= ni .
i=1
4.8 One-Way Layout Experiment Model 127
= 1 + 2 . (4.65)
such that if x ∈ 1 , then the observation x is classified into population f1 (x) and if
x ∈ 2 , into population f2 (x). If there is no prior information on the two populations,
the first error probability is
"
P(2|1) = f1 (x)dx,
2
In order to minimize the total error probability P(2|1) + P(1|2), the optimal
classification procedure is made by deciding the optimal decomposition of the sample
space (4.65).
Theorem 4.13 The optimal classification procedure that minimizes the error prob-
ability P(2|1) + P(1|2) is given by the following decomposition of sample space
(4.65):
# ) # )
f1 (x) f1 (x)
1 = x| > 1 , 2 = x| ≤1 . (4.66)
f2 (x) f2 (x)
128 4 Analysis of Continuous Variables
Proof Let 1 and 2 be any decomposition of sample space . For the decompo-
sition, the error probabilities are denoted by P(2|1) and P(1|2) . Then,
" " "
P(2|1) = f1 (x)dx = f1 (x)dx + f1 (x)dx,
2 2 ∩1 2 ∩2
" " "
P(1|2) = f2 (x)dx = f2 (x)dx + f2 (x)dx.
1 1 ∩1 1 ∩2
and
" " "
P(1|2) − P(1|2) = f2 (x)dx − f2 (x)dx ≤ f1 (x)dx
2 ∩1 1 ∩2 2 ∩1
"
− f2 (x)dx.
1 ∩2
π1 (x) + π2 (x) = 1, x ∈ .
Then, the entropy of classification function δ(X) with respect to the correct and
incorrect classifications can be calculated by
"
H(Y (δ(X))) = (−p(x)logp(x) − (1 − p(x))log(1 − p(x)))f (x)dx,
where
f1 (x) + f2 (x)
f (x) = .
2
With respect to the above entropy, it follows that
" # )
f1 (x) f2 (x)
H (Y (δ(X))) ≥ min −log , −log f (x)dx
f1 (x) + f2 (x) f1 (x) + f2 (x)
"
f1 (x)
=− f (x)log dx
f1 (x) + f2 (x)
1
"
f2 (x)
− f (x)log dx, (4.67)
f1 (x) + f2 (x)
2
where 1 and 2 are given in (4.66). According to Theorem 4.13, for (4.66), the
optimal classification function δoptimal (X) can be set as
#
1 x ∈ 1
π1 (x) = , π2 (x) = 1 − π1 (x). (4.68)
0 x ∈ 2
For p-variate normal distributions N (μ1 , ) and N (μ2 , ), let f1 (x) and f2 (x) be
the density functions, respectively. Then, it follows that
f1 (x) 1
> 1 ⇔ logf1 (x) − logf2 (x) = − (x − μ1 ) −1 (x − μ1 )T
f2 (x) 2
1
+ (x − μ2 ) −1 (x − μ2 )T = (μ1 − μ2 ) −1 xT
2
1 1
− μ1 −1 μT1 + μ2 −1 μT2
2 2
−1 μT1 + μT2
= (μ1 − μ2 ) x −
T
>0 (4.69)
2
1 1
(μ1 − μ2 ) −1 xT − μ1 −1 μT1 + μ2 −1 μT2 = 0,
2 2
discriminates the two normal distributions. From this, the following function is called
the linear discriminant function:
Y = (μ1 − μ2 ) −1 xT . (4.70)
The Mahalanobis’ distance between the two mean vectors μ1 and μ2 of p-variate
normal distributions N (μ1 , ) and N (μ2 , ) is given by
DM (μ1 , μ2 ) = (μ1 − μ2 )T −1 (μ1 − μ2 )
∂g
= 2 α T μ1 − α T μ2 (μ1 − μ2 ) − 2λΣα = 0
∂α
T
Since ν = α μ1 − α T μ2 is a scalar, we obtain
ν(μ1 − μ2 ) = λα.
From the above theorem, the discriminant function (4.70) discriminates N (μ1 , )
and N (μ2 , ) in the sense of the maximum of the KL information (4.72).
132 4 Analysis of Continuous Variables
The EM algorithm [2] is widely used for the ML estimation from incomplete data.
Let X and Y be complete and incomplete data (variables), respectively; let f (x|φ)
and g(y|φ) be the density or probability function of X and Y, respectively; let be
the sample space of complete data X and (y) = {x ∈ |Y = y} be the conditional
sample space of X given Y = y. Then, the log likelihood function of φ based on
incomplete data Y = y is
f (x|φ)
k(x|y, φ) = . (4.75)
g(y|φ)
Let
H φ |φ = E logk X|y, φ |y, φ
"
= k(x|y, φ)logk x|y, φ dx.
(y)
Then, the above function is the negative entropy of distribution k x|y, φ (4.75)
for distribution k(x|y, φ). Hence, from Theorems 1.1 and 1.8, we have
H φ |φ ≤ H (φ|φ). (4.76)
The inequality holds if and only if k x|y, φ = k(x|y, φ). Let
4.10 Incomplete Data Analysis 133
⎛ ⎞
"
⎜ f (x|φ) ⎟
Q φ |φ = E logf X|φ |y, φ ⎝= logf x|φ dx⎠.
g(y|φ)
(y)
Then,
l φ p+1 ≥ l φ p , p = 1, 2, . . . .
so it follows that
l φ p+1 ≥ l φ p .
l φ = max l(φ)
φ
l φ ∗ ≥ l φ p , p = 1, 2, . . . .
Q φ |φ = max Q φ|φ .
φ
we have
∂ ∂
Q φ|φ p = E logf (x|φ)|y, φ p
∂φ ∂φ
∂
= E t(X)|y, φ p − loga(φ)
∂φ
"
1 ∂
= E t(X)|y, φ p − b(x)exp φt(x)T dx
a(φ) ∂φ
= E t(X)|y, φ − E(t(X)|φ) = 0.
p
4.10 Incomplete Data Analysis 135
From the above result, the EM algorithm with (4.79) and (4.80) can be simplified
as follows:
(i) E-step
Compute
t p+1 = E t(X)|y, φ p . (4.81)
(ii) M-step
Obtain φ p+1 from the following equation:
Example 4.4 Table 4.3 shows random samples according to bivariate normal
distribution with mean vector μ = (50, 60) and variance–covariance matrix
225 180
= .
180 400
From Table 4.3, the mean vector, the variance–covariance matrix, and the
correlation coefficient are estimated as
174.3 123.1
Table 4.4 illustrates incomplete (missing) data from Table 4.3. In the missing data,
if only the samples 1–40 with both values of the two variables are used for estimating
the parameters, we have
60.7 24.7
In data analysis, we have to use all the data in Table 4.4 to estimate the parameters.
The EM algorithm (4.81) and (4.82) is applied to analyze the data in Table 4.4. Let
μp , p and ρ p be the estimates of the parameters in the pth iteration, where
Let x1i and x2i be the observed values of case i for variables X1 and X2 in
Table 4.4, respectively. For example, (x11 , x21 ) = (69.8, 80.8) and x1,41 , x2,41 =
(41.6, missing). Then, in the E-step (4.81), the missing values in Table 4.4 are
estimated as follows:
σ21 p
p
p p
x2i = μ2 + p x1i − μ1 , i = 41, 42, . . . , 60.
σ11
In the (p + 1) th iteration, by using the observed data in Table 4.4 and the above
estimates of the missing values, we have μp+1 , p+1 and ρ p+1 in the M-step (4.82).
Setting the initial values of the parameters by (4.83), we have
174.3 70.8
μ∞ = (47.7, 62.4), ∞ = , ρ ∞ = 0.461. (4.84)
70.8 135.2
Hence, the convergence values (4.84) are the ML estimates obtained by using the
incomplete data in Table 4.4.
The EM algorithm can be used for the ML estimation from incomplete data in
both continuous and categorical data analyses. A typical example of categorical
incomplete data is often faced in an analysis from phenotype data. Although the
present chapter has treated continuous data analysis, in order to show an efficacy of
the EM algorithm to apply to analysis of phenotype data, the following example is
provided.
Example 4.5 Let p, q, and r be the probabilities or ratios of blood gene types A,
B, and O, respectively, in a large and closed population. Then, a randomly selected
individual in the population has one of genotypes AA, AO, BB, BO, AB, and OO with
probability p2 , 2pr, q2 , 2qr, 2pq, r 2 , respectively; however, the complete information
on the genotype cannot be obtained and we can merely decide phenotype A, B, or
O from his or her blood sample. Table 4.5 shows an artificial data produced by
p = 0.4, q = 0.3, r = 0.3. Let nAA , nAO , nBB , nBO , nAB , and nOO be the numbers of
genotypes AA, AO, BB, BO, AB, and OO in the data in Table 4.5. Then, we have
nAA + nAO = 394, nBB + nBO = 277, nAB = 239, nOO = 90.
(ii) M-step
AA + nAO + nAB
2nu+1 2nu+1 + nu+1
BO + nAB
u+1
p(u+1) = , q(u+1) = BB ,
2n 2n
For the initial estimates of the parameters p(0) = 0.1, q(0) = 0.5, and r (0) = 0.4,
we have
4.11 Discussion
In this chapter, first, correlation analysis has been discussed in association model
and GLM frameworks. In correlation analysis, the absolute values of the correlation
coefficient and the partial correlation coefficient are ECC and the conditional ECC,
respectively, and the multiple correlation coefficient and the coefficient of determi-
nation are the same as ECC and ECD, respectively. In canonical correlation analysis,
4.11 Discussion 139
it has been shown that the KL information that expresses the association between
two random vectors are decomposed into those between the pairs of canonical vari-
ables, and ECC and ECD have been calculated to measure the association between
two random vectors. Second, it has been shown that basic statistical methods have
been explained in terms of entropy. In testing the mean vectors, the Hotelling’s T 2
statistic has been deduced from KL information and in testing variance–covariance
matrices of multivariate normal distributions, an entropy-based test statistic has been
proposed. The discussion has been applied in comparison with two multivariate
normal distributions. The discussion has a possibility, which is extended to a more
general case, i.e., comparison of more than two normal distributions. In the exper-
imental design, one-way layout experimental design model has been treated from
a viewpoint of entropy. The discussion can also be extended to multiway layout
experimental design models. In classification and discriminant analysis, the opti-
mal classification method is reconsidered through an entropy-based approach, and
the squared Mahalanobis’s distance has been explained with the KL information.
In missing data analysis, the EM algorithm has been overviewed in entropy. As
explained in this chapter, entropy-based discussions for continuous data analysis are
useful and it suggests a novel direction to make approaches for other analyses not
treated in this chapter.
References
14. Patnaik, P. B. (1949). The non-central χ2 and F-distributions and their applications. Biometrika,
36, 202–232.
15. Wald, A. (1944). On a statistical problem arising in the classification of an individual into one
of two groups. Annals of Statistics, 15, 145–162.
Chapter 5
Efficiency of Statistical Hypothesis Test
Procedures
5.1 Introduction
The efficiency of hypothesis test procedures has been discussed mainly in the context
of the Pittman- and Bahadur-efficiency approaches [1] up to now. In the Pitman-
efficiency approach [4, 5], the power functions of the test procedures are compared
from a viewpoint of sample sizes, on the other hand, in the Bahadur-efficiency
approach, the efficiency is discussed according to the slopes of log power functions,
i.e., the limits of the ratios of the log power functions for sample sizes. Although the
theoretical aspects in efficiency of test statistics have been derived in both approaches,
herein, most of the results based on the two approaches are the same [6]. The aim
of this chapter is to reconsider the efficiency of hypothesis testing procedures in the
context of entropy. In Sect. 5.2, the likelihood ratio test is reviewed, and it is shown
that the procedure is the most powerful test. Section 5.3 reviews the Pitman and
Bahadur efficiencies. In Sect. 5.4, the asymptotic distribution of the likelihood ratio
test statistic is derived, and the entropy-based efficiency is made. It is shown that
the relative entropy-based efficiency is related the Fisher information. By using the
median test as an example, it is shown that the results based on the relative entropy-
based efficiency, the relative Pitman and Bahadur efficiencies are the same under an
appropriate condition. Third, “information of parameters” in general test statistics is
discussed, and a general entropy-based efficiency is defined. The relative efficiency
of the Wilcoxon test, as an example, is discussed from the present context.
The above testing procedure is called the likelihood ratio test. The procedure is
the most powerful, and it is proven in the following theorem [3]:
Theorem 5.1 (Neyman–Pearson Lemma) Let W be any critical region with signif-
icance level α and W0 that based on the likelihood ratio with significance level α
(5.1), the following inequality holds:
n
α= f (xi |θ0 )dx1 dx2 . . . dxn
W i=1
n
= f (xi |θ0 )dx1 dx2 . . . dxn
W0 ∩W i=1
n
+ f (xi |θ0 )dx1 dx2 . . . dxn .
W0c ∩W i=1
such that
n
f (xi |θ1 ) ϕn (t|θ1 )
α = P ni=1 > λ(f , n)|H0 = P > λ(ϕ, n)|H0 .
i=1 f (xi |θ0 ) ϕn (t|θ0 )
144 5 Efficiency of Statistical Hypothesis Test Procedures
Since from Theorem 5.1, critical region WT gives the most powerful test procedure
among those based on statistic T, the theorem follows.
Definition 5.1 (Pitman efficiency) Let Wn = {Tn > tn } be critical regions with sig-
nificant level α, i.e., α = P{Wn |θ0 }. Then, for the sequence of power functions
γn (θn ) = P{Wn |θn }, n = 1, 2, . . ., the Pitman efficiency is defined by
n
RPE T (2) |T (1) = logn→∞ (5.4)
Nn
Example
5.1 Let X1 , X2 , . . . , Xn be random sample from normal distribution
N μ, σ 2 . In testing the null hypothesis H0 : μ = μ0 versus the alternative hypothesis
H1 : μ = μ0 + √hn (= μn ) for h > 0. The critical region with significance level α is
given by
1
n
σ
X = Xi > μ0 + zα √ ,
n i=1 n
where
σ
P X > μ0 + zα √ |μ0 = α,
n
5.3 The Pitman Efficiency and the Bahadur Efficiency 145
Let
√
n X − μ0 − √h
n
Z= .
σ
Then, Z is distributed according to the standard normal distribution under H1 and
we have
h σ h
γ μ0 + √ = P X > μ0 + zα √ |μ0 + √
n n n
h h
= P Z > − + zα |μ = 0 = 1 − − + zα ,
σ σ
where function (z) is the distribution function of N (0, 1). The above power function
is constant in n, so we have
h
PE X = 1 − − + zα
σ
Definition 5.4 (Relative Bahadur efficiency) For (5.5), the relative Bahadur effi-
ciency (RBE) is defined by
c2 (θ1 , θ0 )
RBE T (2) |T (1) = . (5.6)
c1 (θ1 , θ0 )
Tn (X) = X .
where
1 1 P
O n− 2 /n− 2 → c (n → ∞).
Since
+∞ √ n
n
1 − Fn (Tn (X)|μ0 ) = √ exp − 2 (t − μ0 )2 dt
2π σ 2 2σ
X
+∞ √ n
n
= √ exp − 2 (t − μ0 )2 dt
1 2π σ 2 2σ
μ1 +O n− 2
+∞ √ n
n
≈ √ exp − 2 (s − μ0 + μ1 )2 ds
2π σ 2 2σ
0
√ n
n
=√ exp − 2 (μ1 − μ0 )2
2π σ 2 2σ
+∞
n n 2
exp − 2 (μ1 − μ0 )s − s ds,
σ 2σ 2
0
+∞
n
exp − 2 (s − μ1 + μ0 )2 ds
2σ
0
√ n
n
>√ exp − 2 (μ1 − μ0 )2
2π σ 2 2σ
+∞
n
exp − 2 (s − μ1 + μ0 )2 ds
2σ
μ1 −μ0
1 n
= exp − 2 (μ1 − μ0 )2 .
2 2σ
From this, we have
n
−2 σ
log √ exp − 2 (μ1 − μ0 ) 2
n 2π n(μ1 − μ0 ) 2σ
−2 σ n
= log √ − (μ1 − μ0 ) 2
n 2π n(μ1 − μ0 ) 2σ 2
1
→ 2 (μ1 − μ0 )2 = c(μ1 , μ0 )(n → ∞),
σ
−2 1 n −2 n
log exp − 2 (μ1 − μ0 )2 = −log2 − 2
(μ1 − μ0 )2
n 2 2σ n 2σ
1
→ 2 (μ1 − μ0 )2 ,
σ
so it follows that
Ln (Tn (X)|μ0 ) −2log(1 − Fn (Tn (X)|μ0 )) 1
= → 2 (μ1 − μ0 )2
n n σ
= c(μ1 , μ0 ) (n → ∞),
Let f (x) and g(x) be normal density functions of N μ1 , σ 2 and N μ2 , σ 2 ,
respectively. Then, in this case, we also have
1
2D(g||f ) = (μ1 − μ0 )2 = c(μ1 , μ0 ).
σ2
In the next section, the efficiency of test procedures is discussed in view of entropy.
148 5 Efficiency of Statistical Hypothesis Test Procedures
As reviewed in Sect. 5.2, the likelihood ratio test is the most powerful test. In this
section, the test is discussed with respect to information. Let f (x) and g(x) be den-
sity or probability functions corresponding with null hypothesis H0 and alterna-
tive hypothesis H1 , respectively. Then, for a sufficiently large sample size n, i.e.,
X1 , X2 , . . . , Xn , we have
n
n
i=1 g(Xi ) g(Xi ) −nD(f ||g) (under H0 )
log n = log ≈ (5.7)
i=1 f (Xi ) f (Xi ) nD(g||f ) (under H1 )
i=1
where
f (x) g(x)
D(f ||g) = f (x)log dx, D(g||f ) = g(x)log dx. (5.8)
g(x) f (x)
In (5.8), the integrals are replaced by the summations, if the f (x) and g(x) are
probability functions.
Remark 5.1 In general, with respect to the KL information (5.8), it follows that
The critical region of the likelihood ratio test with significance level α is given by
n
g(Xi )
i=1
n > λ(n), (5.9)
i=1 f (Xi )
where
n
g(Xi )
P i=1
n > λ(n)|H0 =α
i=1 f (Xi )
With respect to the likelihood ratio function, we have the following theorem:
Theorem 5.2 For parameter θ , let f (x) = f (x|θ0 ) and g(x) = f (x|θ0 + δ) be
density or probability functions corresponding to null hypothesis H0 : θ = θ0 and
alternative hypothesis H1 : θ = θ0 + δ, respectively, and let X1 , X2 , . . . , Xn be
samples to test the hypotheses. For large sample size n and small constant δ, the
following log-likelihood ratio statistic
5.4 Likelihood Ratio Test and the Kullback Information 149
n
n
i=1 g(Xi ) g(Xi )
log n = log (5.10)
i=1 f (Xi ) i=1
f (Xi )
is asymptotically distributed
according
to normal distribution with mean −D(f ||g)
g(X )
and variance n Var log f (X )
θ0 . For small δ, we obtain
1
g(X ) d
log = logf (X |θ0 + δ) − logf (X |θ0 ) ≈ δ logf (X |θ0 ).
f (X ) dθ
Since
d
d
E logf (X |θ0 )
θ0 = f (X |θ0 )dx = 0,
dθ dθ
we have
2
2
g(X )
d
E log
θ0 ≈ δ E
2
logf (X |θ0 )
θ0 = δ 2 I (θ0 ).
f (X )
dθ
Hence, we get
1 g(X )
1
Var log
θ0 ≈ δ 2 I (θ0 ) − D(f ||g)2 .
n f (X ) n
2
2
g(X )
d
E log
θ0 + δ ≈ δ E
2
logf (x|θ0 )
θ0 + δ
f (X )
dθ
2
d
≈δ E
2
logf (x|θ0 )
θ0 = δ 2 I (θ0 ).
dθ
According to the above theorem, for critical region (5.9), the power is
n
g(Xi )
P i=1
n > λ(n)|H1 → 1 (n → ∞).
i=1 f (Xi )
and
d
f (x|θ)
< ϕ2 (x), (5.14)
dθ
then,
d
f (x|θ + δ) logf (x|θ )dx → 0 (δ → 0).
dθ
Example 5.3 In exponential density function f (x|λ) = λe−λx (x > 0), for a < λ < b
and small δ, we have
−(λ+δ)x
f (x|λ + δ) d logf (x|λ)
=
f (x|λ + δ) d f (x|λ)
=
(λ + δ)e 2
1−λ
dλ
f (x|λ) dλ
λ
(λ + δ)e−(λ+δ)x
< 1 − λ2
< 2
1 − λ2
e−λx (= ϕ1 (x)).
λ
Similarly, it follows that
d
f (x|λ)
= |1 − λx|e−λx < |1 − ax|e−ax (= ϕ2 (x))
dλ
Since functions ϕ1 (x) and ϕ2 (x) are integrable, from Lemma 5.1, we have
∞ ∞
d 1
f (x|λ + δ) logf (x|λ)dx = (λ + δ)exp(−(λ + δ)x) − x dx
dλ λ
0 0
1 1
= − → 0 (δ → 0).
λ λ+δ
The relation between the KL information and the Fisher information is given in
the following theorem:
Theorem 5.3 Let f (x|θ ) and f (x|θ + δ) be density or probability functions with
parameters θ and θ + δ, respectively, and let I (θ ) be the Fisher information. Under
the same condition as in Lemma 5.1, it follows that
Proof
f (x|θ )
log = logf (x|θ) − logf (x|θ + δ)
f (x|θ + δ)
d δ 2 d2
≈ −δ logf (x|θ) − logf (x|θ).
dθ 2 dθ 2
From this and Lemma 5.1, for small δ, we have
f (x|θ)
D(f (x|θ)||f (x|θ + δ)) = f (x|θ )log dx
f (x|θ + δ)
d δ 2 d2
≈ f (x|θ ) −δ logf (x|θ) − logf (x|θ) dx
dθ 2 dθ 2
d
= −δ f (x|θ ) logf (x|θ)dx
dθ
δ2 d2
− f (x|θ) 2 logf (x|θ )dx
2 dθ
d δ2 d2
≈ −δ f (x|θ )dx − f (x|θ ) 2 logf (x|θ )dx
dθ 2 dθ
2
δ d 2
δ2
=− f (x|θ) 2 logf (x|θ)dx = I (θ ).
2 dθ 2
Similarly, since
f (x|θ + δ)
log = logf (x|θ + δ) − logf (x|θ )
f (x|θ)
d δ 2 d2
= −δ logf (x|θ + δ) − logf (x|θ + δ).
dθ 2 dθ 2
we have
f (x|θ + δ)
D(f (x|θ + δ)||f (x|θ )) = f (x|θ + δ)log dx
f (x|θ )
d δ 2 d2
≈ f (x|θ + δ) δ logf (x|θ + δ) − logf (x|θ + δ) dx
dθ 2 dθ 2
d
≈ −δ f (x|θ ) logf (x|θ )dx
dθ
δ2 d2
− f (x|θ ) 2 logf (x|θ )dx
2 dθ
δ 2 d2 δ2
=− f (x|θ ) 2 logf (x|θ )dx = I (θ ).
2 dθ 2
5.4 Likelihood Ratio Test and the Kullback Information 153
δ 2 I (θ0 )
logλ(n) = −n + zα nδ 2 I (θ0 ), (5.17)
2
where zα is the upper 100α
percentile of the standard normal distribution, and the
asymptotic power is Q zα − nδ 2 I (θ0 ) , where the function Q(z) is
+∞ 2
1 x
Q(z) = √ exp − dx.
2π 2
z
Proof From Theorems 5.2, under the null hypothesis H0 , the log-likelihood ratio
statistic
(5.10) is asymptotically distributed
according to the normal distribution
N −nD(f ||g), nδ 2 I (θ0 ) − nD(f ||g)2 . From Theorem 5.3, under the null hypothesis,
from (5.15) we have
δ 2 I (θ0 )
D(f ||g) ≈ ,
2
so it follows that
δ 2 I (θ0 )
−nD(f ||g) ≈ −n ,
2
δ4
nδ 2 I (θ0 ) − nD(f ||g)2 ≈ nδ 2 I (θ0 ) − n
I (θ0 )2 ≈ nδ 2 I (θ0 ).
4
From this, the asymptotic distribution is N −n δ I2(θ0 ) , nδ 2 I (θ0 ) . From this, for
2
(5.17) we have
n
i=1 g(Xi )
P n > λ(n)
H0 = α.
i=1 f (Xi )
δ2
D(g||f ) ≈ I (θ0 )
2
and from Theorem 5.2 the asymptotic mean and variance of the log-likelihood
function under H1 are
154 5 Efficiency of Statistical Hypothesis Test Procedures
nδ 2 I (θ0 )
nD(g||f ) ≈
2
δ4
nδ 2 I (θ0 ) − nD(g||f )2 ≈ nδ 2 I (θ0 ) − n I (θ0 )2 ≈ nδ 2 I (θ0 ).
4
Hence, the asymptotic distribution of statistic (5.10) is
2
nδ I (θ0 )
N nD(g||f ), nδ 2 I (θ0 ) ≈ N , nδ 2 I (θ0 ) .
2
Then, we have
n
i=1 g(Xi )
P n > λ(n)|H1 ≈ Q zα − nδ 2 I (θ0 ) (5.18)
i=1 f (Xi )
Theorem 5.5 For random variable X and a function of the variable Y = η(X ), let
f (x|θ ) and f ∗ (y|θ ) be the density or probability functions of X and Y, respectively;
and let If (θ ) and If ∗ (θ ) be the Fisher information of f (x|θ) and that of f ∗ (y|θ ),
respectively. Under the same condition as in Theorem 5.3, it follows that
If (θ ) ≥ If ∗ (θ ).
As shown in the above discussion, the test power depends on the KL and the
Fisher information. For random variables, the efficiency for testing hypotheses can
be defined according to entropy.
5.4 Likelihood Ratio Test and the Kullback Information 155
Definition 5.5 (Entropy-basede efficiency) Let f (x|θ0 ) and f (x|θ0 + δ) be the density
or probability functions of random variable X under hypotheses H0 : θ = θ0 versus
H1 : θ = θ0 + δ, respectively. Then, the entropy-based efficiency (EE) is defined by
0 ≤ REE(Y |X ) ≤ 1.
By using the above discussion, the REE is calculated in the following example.
Example 5.3 Let X be a random variable that is distributed according to normal
distribution with mean μ and variance σ 2 . In testing the null hypothesis H0 : μ = 0
and the alternative hypothesis H1 : μ = δ, let X ∗ be the dichotomized variable such
as
∗ 0 (X < 0)
X = .
1 (X ≥ 0)
Then, the entropy-based relative efficiency of the test with the dichotomized vari-
able X ∗ for that with the normal variable X is calculated. Since the normal density
function with mean μ and variance σ 2 is
1 1
f (x) = √ exp − 2 (x − μ)2 ,
2π σ 2σ
+∞
∗ 1 1
P X =1 = √ exp − 2 (x − δ) dx.
2
(5.23)
2π σ 2σ
0
For simplicity of the discussion, we denote the above probability as p(δ). From
this, the distribution f ∗ is the Bernoulli distribution with the positive probability
(5.23) and the probability function is expressed as follows:
P X ∗ = x = p(δ)x (1 − p(δ))1−x .
Since
+∞
∂ x−δ 1 1 1 δ2
p(δ) = − √ exp − 2 (x − δ) dx = √
2
exp − 2 ,
∂δ σ2 2π σ 2σ 2π σ 2σ
0
we have
2
1 1 δ2
If ∗ (δ) = √ exp − 2
p(δ)(1 − p(δ)) 2πσ 2σ
2
1 1 δ 2
= × exp − 2 →
p(δ)(1 − p(δ)) 2π σ 2 σ πσ2
= If ∗ (0) (δ → 0).
5.4 Likelihood Ratio Test and the Kullback Information 157
If ∗ (0) 2
πσ 2 2
lim REE X ∗ |X ; θ0 , θ1 + δ = = = .
δ→0 If (0) 1
σ2
π
The above REE is the relative efficiency of the likelihood ratio test
with dichotomized samples X1∗ , X2∗ , . . . , Xn∗ for that with the original samples
X1 , X2 , . . . , Xn .
By using the above example, the relative Pitman efficiency is computed on the
basis of Theorem 5.3.
γN(2)
n
θNn ≈ Q zα − Nnn h2 If ∗ (0) , respectively. Assuming
γn(1) (θn ) = γN(2)
n
θNn ,
we have
Nn 2
h2 If (0) = h If ∗ (0).
n
Hence, the relative Pitman efficiency (RPE) is
n If ∗ (0) 2
RPE X ∗ |X = logn→∞ = = .
Nn If (0) π
Example 5.5 In Example 5.3, the relative Bahadur efficiency (RBE) is considered.
Let H0 : μ = 0 be null hypothesis; and let H1 : μ = δ be the alternative hypothesis.
Let
1
n
Tn(1) (X) = Xi ,
n i=1
1 ∗
n
Tn(2) (X) = X .
n i=1 i
+∞ √ n
n
1 − Fn Tn(1) (X)|μ = 0 = √ exp − 2 t 2 dt
2π σ 2 2σ
X
+∞ √ n
n
= √ exp − 2 t 2 dt
1 2π σ 2 2σ
δ+O n− 2
and
+∞ √
2n 1 2
1− Fn Tn(2) (X)|μ =0 = √ exp −2n t − dt
π 2
Tn(2) (X)
+∞ √
2n 1 2
= √ exp −2n t − dt,
1
π 2
p(δ)+O n− 2
where
+∞
1 1
p(δ) = √ exp − 2 (x − δ) dx.
2
2π σ 2 2σ
0
From Examples 5.3, 5.4, and 5.5, the medium test for testing the mean of the
normal distribution, the relative entropy-based efficiency is the same as the relative
Pitman and the relative Bahadur efficiencies. Under appropriate assumptions [2], as
in the above example, it follows that
5.4 Likelihood Ratio Test and the Kullback Information 159
Ln (Tn (X)|θ0 )
→ 2D(g||f ) (n → ∞).
n
Definition 5.8 (Relative entropy-based efficiency) Let Tn(1) (X) and Tn(2) (X) be test
statistics for testing H0 : θ = θ0 versus H1 : θ = θ0 + δ, and let ϕn(1) (t|θ ) and ϕn(2) (t|θ)
be the probability or density functions of the test statistics, respectively. Then, the
relative entropy-based efficiency of Tn(2) (X) for Tn(1) (X) is defined by
(2) (1) EE Tn(2) ; θ0 , θ0 + δ
REE Tn |Tn ; θ0 , θ0 + δ = , (5.25)
EE Tn(1) ; θ0 , θ0 + δ
and the relative entropy-based efficiency of the procedure Tn(2) (X) for Tn(1) (X)
is defined as
REE T (2) |T (1) ; θ0 , θ0 + δ = lim REE Tn(2) |Tn(1) ; θ0 , θ0 + δ . (5.26)
n→∞
where
P(W0 |θ0 ) = α.
η (θ0 )2
lim REE T (2) |T (1) ; θ0 , θ0 + δ = ≤ 1, (5.27)
δ→0 I (θ0 )σ 2
where
2
d
I (θ ) = f (x|θ)log logf (x|θ ) dx.
dθ
nI (θ0 )δ 2
EE Tn(1) ; θ0 , θ0 + δ = nD(f (x|θ0 )||f (x|(θ0 + δ))) ≈ .
2
Hence, we have
5.5 Information of Test Statistics 161
σ2 1
≥ .
nη (θ0 )2 nI (θ0 )
η (θ0 )2
≤ 1.
I (θ0 )σ 2
Theorem 5.7 Let θ̂ be the maximum likelihood estimator of θ . Then, “the test of
null hypothesis H0 : θ = θ0 versus alternative hypothesis H1 : θ = θ0 + δ” based on
θ̂ is asymptotically equivalent to the likelihood ratio test (5.9).
Proof For a large sample, the asymptotic distribution of θ̂ is N θ0 , nI (θ
1
0)
under H0
and N θ0 + δ, nI (θ0 +δ) under H1 . In Theorem 5.6, we set η(θ ) = θ and σ = I (θ10 ) ,
1 2
Moreover, in general, with respect to the likelihood ratio test, the following
theorem holds.
Proof For simplicity of the discussion, the proof is made for the continuous random
samples Xi . Then, we have
ϕn (t|θ0 )
D(ϕn (t|θ0 )||ϕn (t|θ0 + δ)) = ϕn (t|θ0 )log dt
ϕn (t|θ0 + δ)
n
d
= f (xi |θ0 )dxi
dt
Tn <t i=1
n
d
dt Tn <t i=1 f (xi |θ0 )dxi
log n dt
d
dt Tn <t f (xi |θ0 + δ)dxi
i=1
n
n
f (xi |θ0 )
≤ f (xi |θ0 )log n i=1 dxi
i=1 i=1 f (x i |θ0 + δ)
n
f (xi |θ0 )
= f (xi |θ0 )log dxi
i=1
f (xi |θ0 + δ)
= nD(f (xi |θ0 )||f (xi |θ0 + δ)).
From the above theorem, the likelihood ratio test is optimal in the sense of REE.
With respect to the Pitman, Bahadur, and entropy-based efficiencies, we have the
following theorem:
For
it follows that
√
Nn h
h2 I (θ0 ) = η θ0 + √ − η(θ0 ) .
σ n
From the above result, the relative Pitman efficiency is derived as follows:
2
n n η θ0 + √h
n
− η(θ0 )
=
Nn σ 2 h2 I (θ0 )
⎛ ⎞2
η θ0 + √hn − η(θ0 ) 1 η (θ0 )2
=⎝ ⎠ → (n → ∞).
√h
n
I (θ0 )σ 2 I (θ0 )σ 2
In the next example, we consider the efficiency of the Wilcoxon test for two sample
tests.
Since
n n
n(n + 1)
u Yi − Yj = ,
i=1 j=1
2
we have
n m
n(n + 1)
W = u Yi − Xj + .
i=1 j=1
2
0
1 (t − δ)2 1 δ δ2
p(δ) = √ exp − dt ≈ − √ exp − .
4π σ 2 4σ 2 2 4π σ 2 4σ 2
−∞
n(m + n + 1) mn(m + n + 1)
E(W |μ0 ) ≈ , Var(W |μ0 ) ≈ .
2 12
For large m and n, the asymptotic distribution of W is normal, so we have
D(nor(E(W |μ0 ), Var(W |μ0 ))nor(E(W |μ0 + δ), Var(W |μ0 + δ)))
2
3mnδ 2 exp − 2σδ 2
≈ ,
2π (m + n + 1)σ 2
lim REE(W, t; μ0 , μ0 + δ)
δ→0
D(nor(E(W |μ0 ), Var(W |μ0 ))||nor(E(W |μ0 + δ), Var(W |μ0 + δ)))
= lim
δ→0 D nor 0, m1 + 1n σ 2 ||nor δ, m1 + 1n σ 2
2
3mnδ 2 exp − 2σδ 2
2π(m+n+1)σ 2 3
= lim ≈ .
δ→0 δ2 π
2( m1 + 1n )σ 2
For m = n, it can be shown that the result is equal to the relative Pitman and
Bahadur efficiencies.
5.6 Discussion
References
1. Bahadur, R. R. (1964). On the asymptotic efficiency of tests and estimates. Sankhya, 22, 229–252.
2. Bahadur, R. R. (1965). An optimal property of the likelihood ratio statistic. In Proceedings of
the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 13–26).
Berkeley and Los Angeles: University of California Press.
3. Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical
hypotheses. Philosophical Transactions of the Royal Society of London, A, Containing Papers
of a Mathematical or physical Character, 231, 289–337.
4. Noether, G. E. (1950). Asymptotic properties of the Wald-Wolfowitz test of randomness. Annals
of Mathematical Statistics, 21, 231–246.
5. Noether, G. E. (1955). On a theorem of Pitman. The Annals of Mathematical Statistics, 26,
64–68.
6. Wieand, H. S. (1976). A condition under which the Pitman and Bahadur approaches to efficiency
coincide. Annals of Statistics, 4, 1003–1011.
Chapter 6
Entropy-Based Path Analysis
6.1 Introduction
Path analysis [19] is a statistical method that treats causal relationships among the
variables concerned. In the analysis, causal relationships among the variables are
described by diagrams with directed arrows, which are called path diagrams, and
according to the diagrams, path analysis models are hypothesized, and the causal
relationships are analyzed. For analyzing causal systems of continuous variables,
linear regression models are used for describing the causal relationships, e.g., lin-
ear structural equation model (LISREL) [3, 14], and the path analysis is made by
using regression coefficients and correlation coefficients. In comparison with path
analysis of continuous variables, that of categorical variables is complex, because
causal system under consideration cannot be described by linear regression equa-
tions. Goodman [10–12] considered a path analysis of binary variables by using
logit models and discussed the effects of the variables by logit parameters; however,
any discussion of the direct and indirect effects was not made. In a factor analysis
approach for categorical manifest variables, latent variables are assumed continuous;
however, there was no discussion on measuring the effects of the latent variables on
the manifest variables [4, 15, 16]. Hagenaars [13] made a discussion of path analysis
of recursive system of categorical variables by using a loglinear model approach,
which is a combination of Goodman’s approach and graphical modeling. Although
the approach is an analogy to LISREL, the discussion of the direct and indirect effects
was not made. Eshima and Tabata [5] also made a discussion on path analysis with
loglinear models. In path analysis with categorical variables, it is a question how the
effects are measured [9, 13]. Albert and Nelson [1] proposed a path analysis method
to calculate pathway effects for causal systems based on generalized linear models
(GLMs) [17, 18], but not all pathway effects are identifiable. As in the two-stage
cases, when factors, intermediate variables, and response variables are categorical, it
makes pathway effects very complicated because the variable effects are defined for
mean differences of response variables. Eshima et al. [6] proposed a method of path
analysis of categorical variables by using logit models. In this approach, the direct
and indirect effects of variables are discussed according to log odds ratios; however,
the results are complex, and it is required to make a summary measure for causal
analysis with categorical variables. The approach was extended to an entropy-based
approach to path analysis for GLMs [8]. The rest of the present chapter is organized
as follows. In Sect. 6.2, recursive systems of variables are expressed with path dia-
grams, and the joint probabilities or density functions the variables are described by
products of the conditional ones based on the path diagrams. Section 6.3 reviews
the ordinary path analysis of continuous variables. In Sect. 6.4, a preliminary dis-
cussion of path analysis of categorical variables is given by using two examples
of categorical data. Section 6.5 provides a general approach to entropy-based path
analysis for GLM path systems, and the basic discussion is made by log odds ratios,
and Sect. 6.6 applies the basic method to that for GLM path systems with canonical
links. In Sect. 6.7, in order to summarize the effects measured with log odds ratios,
the summary entropy-based effects are introduced by using a recursive system of
four variables, and the total, direct, and indirect effects are discussed. Section 6.8
applies the present path analysis to the examples in Sect. 6.4. Finally, in Sect. 6.9, a
general formulation of path analysis for structural GLM systems is discussed.
X1 ≺ X2 ≺ . . . ≺ Xn. (6.1)
In this case, a general path diagram is shown in Fig. 6.1. The parent variables of
X i are X k , k = 1, 2, . . . , i −1, and the set of the variables is denoted by the following
column vector:
X pa(i) = (X 1 , X 2 , . . . , X i−1 )T
and X i is a descendant of the parent variables. The arrows express the directions
of the direct effects, e.g., X 1 → X 2 indicates the direct effect of X 1 on X 2 . Path
analysis is a statistical method for measuring the effects of parent variables on the
descendant variables, and the following effect decomposition is important:
Let f (x) be the joint probability or density function of X = x. Then, for the path
diagram shown in Fig. 6.1, we have the following recursive decomposition of f (X):
n
f (x) = f 1 (x1 ) f i xi |x pa(i) , (6.2)
i=2
where f 1 (x1 ) is the marginal probability or density function of X 1 and f i xi |x pa(i)
are the conditional probability or density functions of X i given X pa(i) = x pa(i) , i =
2, 3, . . . , n.
For n = 4, some path diagrams are considered. In Fig. 6.2a, the diagram has all
paths between the four variables, so the joint probability or density functions of the
four variables are given by
In Fig. 6.2b, since there is no path from X 1 to X 3 , it means that X 1 and X 3 are
conditionally independent given X 2 . From this, we have
(c) (d)
170 6 Entropy-Based Path Analysis
Since Fig. 6.2d implies X i and X i+2 are independent given X i+1 , i = 1, 2, we
have
In the above discussion, all the variables are causally ordered as in (6.1); however,
there may be cases that all the variables are not able to be causally ordered. For
example, for variables X 1 , X 2 , X 3 , and X 4 , it is assumed variables X 1 , X 2 , and X 3
have no causal orders and the following preceding relation holds:
(X 1 , X 2 , X 3 ) ≺ X 4 .
Then, the path diagram is generally given as in Fig. 6.3. The rounded double-
headed arrows imply the associations between the variables. In this case, (6.2)
becomes
As shown in the above concrete examples, path analysis is carried out according
to the path diagrams under consideration. From this, first, we have to make a path
diagram (model) by discussing practical phenomena, and second, it is needed to
decide what model, i.e., a distributional assumption on the variables concerned, is
used. Moreover, important is how the effects of parent variables on descendant ones
are measured and expressed. In the next section, usual path analysis of continuous
variables based on linear models is discussed.
X 2 = α2 + β21 X 1 + ε2 , (6.3)
···
n−1
X n = αn + βnk X k + εn , (6.6)
k=1
where εi are the error terms with mean 0 and variances σi2 , i = 1, 2, . . . , n. Let
eT (X k → X l ), e D (X k → X l ), and e I (X k → X l ) be the total, direct, and indirect
effects of X k on X l . First, from (6.3), variable X 2 increases by β21 for unit change of
variable X 1 , so the total effect of X 1 on X 2 is defined by β21 , and it is also the direct
effect, i.e.,
eT (X 1 → X 2 ) = e D (X 1 → X 2 ) = β21 , e I (X 1 → X 2 ) = 0.
Second, the effects of X 1 and X 2 on X 3 are considered by using Eqs. (6.3) and
(6.4) (Fig. 6.4). In (6.4), X 3 increases by β31 and β32 for unit changes of X 1 and X 2 ,
respectively, so the direct effects of X 1 and X 2 are defined as follows:
eT (X 2 → X 3 ) = β32 , e I (X 2 → X 3 ) = 0.
e I (X 1 → X 3 ) = e D (X 1 → X 2 )e D (X 2 → X 3 ).
Third, the effects of X 1 , X 2 , and X 3 on X 4 are discussed in Fig. 6.5. This figure
has the regression coefficients from (6.3) to (6.5), and the regression coefficients are
called the path coefficients. It is convenient to use this diagram to compute the effects
of the variables concerned. According to the above discussion, we can calculate the
effects of the variables by the regression coefficients. Since the path coefficients are
the direct effects indicated by the arrows, we have
X 1 → X 2 → X 4, X 1 → X 3 → X 4, X 1 → X 2 → X 3 → X 4. (6.10)
e I (X 2 → X 4 ) = β32 β43 , e I (X 3 → X 4 ) = 0.
In the above decomposition of the indirect effect, we define the pathway effects.
In (6.2)–(6.5), there are four pathways from X 1 to X 4 , and the effects through the
pathways can be calculated by the products of the path coefficients related to the
directed arrows, for example, for path X 1 → X 3 → X 4 in (6.10), the pathway effect
is calculated by β31 β43 . The method of calculation of the effects can be extended to the
general case illustrated in path diagram Fig. 6.1. As shown in the above discussion,
path analysis of continuous variables based on linear regression models is carried
out easily because continuous variables are quantitative and the causal relationships
among variables are expressed through linear equations as above.
Example 6.1 Table 6.1 shows the data about marital status with three explanatory
variables: X G : gender; X PMS : premarital sex (PMS); X EMS : extramarital sex (EMS);
X MS : marital status (MS) [2]. The path diagram is illustrated in Fig. 6.6. All the
variables are binary, and the method of path analysis in the previous section cannot
be used. By using logit models, path analysis is carried out below. In the recursive
decomposition of the joint probability function of the four variables according to
Fig. 6.6, the conditional probability functions are expressed by logit models, not
linear regression models. Then, the discussion for linear regression models cannot
be applied, and log odds ratios are used for measuring the effects of parent variables
174 6 Entropy-Based Path Analysis
Example 6.2 The data for an investigation of factors influencing the primary food
choice of alligators (Table 6.2) are analyzed in Agresti [2]. In this example, explana-
tory variables are X L : lakes where alligators live, {1. Hancock, 2. Oklawaha, 3.
Trafford, 4. George}; and X S : sizes of alligators, {1. Small (≤ 2.3 m), 2. Large
(> 2.3 m) }; and the response variable is Y : primary food choice of alligators, {1.
Fish, 2. Invertebrate, 3. Reptile, 4. Bird, 5. Other}. Considering the effects of lake
and size of alligator on the primary food choice of alligators, it is valid to use the path
diagram shown in Fig. 6.7. In this example, since variables X L and Y are polytomous
and as in the first example, the method for continuous variables cannot be used as
well. In this example, generalized logit models are used, and the path analysis is
made on the basis of log odds ratios.
In the above examples, path analysis is demonstrated on the basis of log odds ratios
below; however, the path analysis becomes complex as the numbers of categories of
variables are increasing because the number of log odds ratios to employ for the path
analysis is also increasing more than those of categories of variables. For example,
6.4 Examples of Path Systems with Categorical Variables 175
when variables
X and Y have I and J categories, respectively, the number of log odds
I J
ratios is . Hence, it is important to summarize causal effects measured
2 2
with log odds ratios. In the next section, an entropy-based method of path analysis
is discussed in the context of generalized linear models.
Without loss of generality, we assume ai (ϕi ) > 0. For regression coefficient vector
T
β(i) = β(i)1 , β(i)2 , . . . , β(i)i−1 and a link function that relates linear predictor
β(i)
T
x pa(i) = β(i)1 x1 + β(i)2 x2 + . . . + β(i)i−1 xi−1
176 6 Entropy-Based Path Analysis
(b)
In the GLMs, the discussion on path analysis in the previous section cannot be
used except linear regression models. In this section, the effects of parent variables
on the descendant ones are measured according to log odds ratios. First, by using
Fig. 6.2a, the effects of X 1 , X 2 and X 3 on X 4 are considered. The following log odds
ratio can be viewed as the total effect of X pa(4) = (x1 , x2 , x3 )T on X 4 (Fig. 6.8a):
f 4 (x4 |x1 , x2 , x3 ) f 4 x4∗ |x1∗ , x2∗ , x3∗
log ∗
f 4 x4 |x1 , x2 , x3 f 4 x4 |x1∗ , x2∗ , x3∗
x4 − x4∗ θ4 (x1 , x2 , x3 ) − θ4 x1∗ , x2∗ , x3∗
= .
a4 (ϕ4 )
where x1∗ , x2∗ , x3∗ , x4∗ is a baseline. Let μi = E(X i ), i = 1, 2, 3, 4. Setting
x1∗ , x2∗ , x3∗ , x4∗ = (μ1 , μ2 , μ3 , μ4 ),
As explained in the previous chapter, the log odds ratio is a change of relative
information. The total effect of (X 2 , X 3 ) on X 4 in Fig. 6.8b is assessed. Let μi (x1 ) =
E(X i |X 1 = x1 ), i = 2, 3, 4. Since X 1 is the parent of (X 2 , X 3 ), by giving X 1 = x1 ,
the total effect (X 2 , X 3 ) = (x2 , x3 ) on X 4 = x4 is defined as follows:
6.5 Path Analysis of Structural Generalized Linear Models 177
(the total effect of(x1 , x2 , x3 ) on x4 ) − (the total effect of(x2 , x3 )on x4 given x1 )
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 |μ2 , μ3 , μ4 )
= log
f 4 (μ4 |x1 , x2 , x3 ) f 4 (x4 |μ2 , μ3 , μ4 )
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 (x1 )|x1 , μ2 (x1 ), μ3 (x1 ))
− log
f 4 (μ4 (x1 )|x1 , x2 , x3 ) f 4 (x4 |x1 , μ2 (x1 ), μ3 (x1 ))
(x4 − μ4 )(θ4 (x1 , x2 , x3 ) − θ4 (μ1 , μ2 , μ3 ))
=
a4 (ϕ4 )
(x4 − μ4 (x1 ))(θ4 (x1 , x2 , x3 ) − θ4 (x1 , μ2 (x1 ), μ3 (x1 )))
− . (6.14)
a4 (ϕ4 )
The effects (6.12) and (6.13) are defined by using log odds ratios; however, the
above quantity is calculated by the subtraction of (6.13) from (6.12) and is not the
log odds ratio. As the log odds ratio implies the change of relative information, the
above quantity is also interpreted as the change of information. Let μ1 (x2 , x3 ) and
μ4 (x2 , x3 ) be the conditional expectations of X 1 and X 4 , given (X 2 , X 3 ) = (x2 , x3 ),
respectively. Then, the direct effect of X 1 = x1 on X 4 = x4 given (X 2 , X 3 ) = (x2 , x3 )
is
f 4 (x4 |x1 , x2 , x3 ) f 4 (μ4 (x2 , x3 )|μ1 (x2 , x3 ), x2 , x3 )
log
f 4 (μ4 (x2 , x3 )|x1 , x2 , x3 ) f 4 (x4 |μ1 (x2 , x3 ), x2 , x3 )
(x4 − μ4 (x2 , x3 ))(θ4 (x1 , x2 , x3 ) − θ4 (μ1 (x2 , x3 ), x2 , x3 ))
= . (6.15)
a4 (ϕ4 )
Next, the effects of X 2 = x2 are considered. Let μ3 (x1 , x2 ) and μ4 (x1 , x2 ) be the
conditional expectations of X 3 and X 4 , given (X 1 , X 2 ) = (x1 , x2 ), respectively. As
in the above discussion, the total effect of X 3 = x3 on X 4 = x4 given (X 1 , X 2 ) =
(x1 , x2 ) is defined by
As illustrated in Fig. 6.2a, X 3 has no indirect paths, so the above quantity is also the
direct effect. From this, the total effect of X 2 = x2 on X 4 = x4 at (X 1 , X 2 ) = (x1 , x2 )
is defined by subtracting (6.16) from (6.13):
A general discussion based on the path diagram in Fig. 6.1 can be made in a
similar method explained above. In the next section, the above discussion is applied
to path analysis of GLMs with canonical links.
Let us assume in path diagram 6.2a, the recursive causal relationships are expressed
by GLMs with canonical links. For canonical links, we have
6.6 Path Analysis of GLM Systems with Canonical Links 179
⎧
⎨ θ2 (x1 ) = β21 x1
θ (x , x ) = β31 x1 + β32 x2 , (6.19)
⎩ 3 1 2
θ4 (x1 , x2 , x3 ) = β41 x1 + β42 x2 + β43 x3
we have
In order to simplify the discussion, all the variables concerned are assumed to be
continuous. Taking the expectation of the total effect of X pa(4) = (x1 , x2 , x3 )T on X 4
(6.12) with respect to X pa(4) and X 4 , we have the following summary total effect:
6.7 Summary Effects Based on Entropy 181
¨
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) f 1234 x pa(4) , x4
= f 1234 x pa(4) , x4 log d x pa(4) d x4
a4 (ϕ4 ) f 4 (x4 ) f 123 x pa(4)
¨
f 4 (x4 ) f 123 x pa(4)
+ f 4 (x4 ) f 123 x pa(4) log d x pa(4) d x4 = KL X pa(4) , X 4 .
f 1234 x pa(4) , x4
(6.25)
where f 1234 x pa(4) , x4 is the joint density function of X pa(4) and X 4 , and f 123 x pa(4)
and f 4 (x4 ) are the marginal density functions of X pa(4) and X 4 , respectively.
The quantity (6.25) is the ratio of the explained variation of X 4 by X pa(4) , i.e.,
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) for the error variation a4 (ϕ4 ) in entropy. In this sense, the
total effect (6.25) is a signal-to-noise ratio. Standardizing the above KL information,
we define the standardized summary total effect of X pa(4) on X 4 by
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) KL X pa(4) , X 4
eT X pa(4) → X 4 = = .
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) + a4 (ϕ4 ) KL X pa(4) , X 4 + 1
(6.26)
The above measure is ECD X pa(4) , X 4 [7]. By taking the expecting (6.13) with
respect to X pa(4) and X 4 , it follows that
¨
Cov(X 4 , θ4 (X 1 , X 2 , X 3 )|X 1 ) f 1234 x pa(4) , x4
= f 1234 x pa(4) , x4 log d x pa(4) d x4
a4 (ϕ4 ) f 4 (x4 |x1 ) f 123 x pa(4)
¨
f 4 (x4 |x1 ) f 123 x pa(4)
+ f 4 (x4 |x1 ) f 123 x pa(4) log d x pa(4) d x4
f 1234 x pa(4) , x4
= KL X pa(4) , X 4 |X 1
e I (X 1 → X 4 ) = eT (X 1 → X 4 ) − e D (X 1 → X 4 ).
Although the indirect effect is defined by subtracting (6.28) from (6.27), the above
quantity is interpreted as entropy. By using a similar discussion, we have
Cov X 4 , θ4 X pa(4) |X 1 − Cov X 4 , θ4 X pa(4) |X 1 , X 2
eT (X 2 → X 4 ) =
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 )
= eT ((X 2 , X 3 ) → X 4 ) − eT (X 3 → X 4 ), (6.29)
Cov X 4 , θ4 X pa(4) |X 1 , X 3
e D (X 2 → X 4 ) = , (6.30)
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 )
Cov X 4 , θ4 X pa(4) |X 1 , X 2
eT (X 3 → X 4 ) = e D (X 3 → X 4 ) = . (6.31)
Cov X 4 , θ4 X pa(4) + a4 (ϕ4 )
Similarly, we get
Cov(X 3 ,θ3 (X 1 ,X 2 ))−Cov(X 3 ,θ3 (X 1 ,X 2 )|X 1 )
eT (X 1 → X 3 ) = Cov(X 3 ,θ3 (X 1 ,X 2 ))+a3 (ϕ3 ) ,
Cov(X ,θ ,X 2 )|X 2 )
e D (X 1 → X 3 ) = Cov(X ,θ3 (X3 (X,X1 ))+a .
3 3 1 2 3 (ϕ3 )
Cov(X 3 , θ3 (X 1 , X 2 )|X 1 )
eT (X 2 → X 3 ) = e D (X 2 → X 3 ) =
Cov(X 3 , θ3 (X 1 , X 2 )) + a3 (ϕ3 )
Remark 6.2 It is assumed that path diagram 6.2 (a) is expressed with linear models
(6.2) to (6.4), and the error terms εi , i = 2, 3, 4 are normally distributed with mean
0 and variance σi2 . In this case, the summary effects discussed above are calculated
with covariances of the variables. Since
a4 (ϕ4 ) = σ42
and
6.7 Summary Effects Based on Entropy 183
3
Var(X 4 ) = Cov(X 4 , θ4 (X 1 , X 2 , X 3 )) + a4 (ϕ4 ) = β4i Cov(X 4 , X i ) + σ42 ,
i=1
In the next section, the above discussion is applied to the examples in Sect. 6.4.
In this section, path analyses of the examples in Sect. 6.4 are discussed in details.
the present entropy-based path analysis is carried out. The estimated parame-
ters are given in Table 6.3. The log-likelihood ratio statistic for the goodness-
of-fit of the model is G 2 = 13.629, d f = 12, p = 0.325. The total effects
184 6 Entropy-Based Path Analysis
Table 6.7 The total (direct) effects of X EMS on X MS given X G and X PMS
X EMS X MS Total effect xG X PMS
Yes Divorced 0.403 Male Yes
No −0.155
Yes 0.881 No
No −0.093
Yes Married −0.726 Yes
No −0.277
Yes −0.534 No
No 0.057
Yes Divorced 0.388 Female Yes
No −0.103
Yes 0.818 No
No −0.061
Yes Married −0.843 Yes
No 0.226
Yes −0.638 No
No 0.048
shown in Tables 6.6 and 6.8, respectively, the total effects of X PMS on X MS are com-
puted by subtracting “the total effects of X EMS on X MS given (X G , X EMS ) (Table 6.7)”
from “the total effects of (X PMS , X EMS ) on X MS given X G (Table 6.5).” The direct
effects of X PMS on X MS are calculated by (6.24). These effects are demonstrated in
Table 6.8.
As demonstrated in Tables 6.5, 6.6, 6.7, 6.8, and 6.9, the path analysis is com-
plicated, so the summary effects are calculated. Table 6.9 illustrates the estimated
marginal joint probabilities of X G , X PMS , and X EMS that are calculated as the rela-
tive frequencies from Table 6.1. Table 6.10 shows the estimated joint distribution of
X G , X PMS , X EMS , and X MS , which is obtained by using the estimated logit model
(Table 6.3) and the estimated marginal joint distribution of X G , X PMS , and X EMS . By
using the table, the estimated means of the effects in Tables 6.5, 6.6, 6.7, 6.8, and
6.9 can be calculated. Simplicity of the notations, let us set
X 1 = X G , X 2 = X PMS , X 3 = X EMS , X 4 = X MS ,
and we use the notations in Sect. 6.6. From Table 6.4, the estimated mean of the total
effects can be obtained as
KL X pa(4) , X 4 = 0.098. (6.32)
KL X pa(4) , X 4
eT ((X G , X PMS , X EMS ) → X MS ) = eT X pa(4) → X 4 =
1 + KL X pa(4) , X 4
0.098
= = 0.090. (6.33)
1 + 0.098
From this result, the effect of X G , X PMS , and X EMS on X MS is not strong because
(6.33) means that the explained extropy by the explanatory variables is only 9%.
Similarly, from Tables 6.6 and 6.9, we have
The above quantity implies the total effect of X 1 (= X G ), and it can be calculated
from the third column of Table 6.6 as well. Hence, from (6.32) and (6.35), the
standardized total effect of X 1 (= X G ) on X 4 (= X M S ) is given by
KL X pa(4) , X 4 − KL((X 2 , X 3 ), X 4 |X 1 )
eT (X G → X M S )(= eT (X 1 → X 4 )) =
1 + KL X pa(4) , X 4
0.000
= = 0.000.
1 + 0.098
KL(X 1 , X 4 |X 2 , X 3 ) 0.004
e D (X G → X M S )(= e D (X 1 → X 4 )) =
= = 0.004
1 + KL X pa(4) , X 4 1 + 0.098
6.8 Application to Examples 6.1 and 6.2 189
and
e I (X G → X M S ) = eT (X G → X M S ) − e D (X G → X M S ) = −0.004
By using a similar method, from Tables 6.8 and 6.9, it follows that
eT (X EMS → X MS ) = 0.046
Path analysis for the partial path diagram (Fig. 6.9) of Fig. 6.6 is carried out to
use the following logit model:
From the data in Fig. 6.1, we have the following estimates of the logit parameters:
By using the estimates and the method explained in Sect. 6.6, the present path
analysis can be made. Here, the standardized summary effects are calculated. First,
we have
Second, subtracting (6.38) from (6.37), we have the standardized summary total
effect of X G on X EMS as
e D (X G → X EMS ) = 0.003,
It follows that
θ = BL x L + BS x S ,
From Table 6.2, the following estimates of regression parameters are obtained:
6.8 Application to Examples 6.1 and 6.2 191
⎛ ⎞ ⎛ ⎞
−0.826 −0.006 −1.516 0 −0.332 0
⎜ −2.485 −0.394 0⎟ ⎜ 1.127 0 ⎟
⎜ 0.931 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
B̂ L = ⎜ 0.417 2.454 1.419 0 ⎟, B̂ S = ⎜ −0.683 0 ⎟.
⎜ ⎟ ⎜ ⎟
⎝ −0.131 −0.659 −0.429 0⎠ ⎝ −0.962 0 ⎠
0 0 0 0 0 0
Since applying the discussion in Sect. 6.5 to the present data makes very com-
plicated results due to the polytomous variables concerned, for simplicity of the
analysis, only the summary effects of X L and X S on Y are calculated. From the
following estimates,
Cov(θ , Y |X L ) = 0.101(0.045),
Cov(θ, Y |X S ) = 0.260(0.072),
Cov(θ , Y ) 0.329
eT ((X L , X S ) → Y ) = = = 0.248, (6.39)
Cov(θ , Y ) + 1 0.329 + 1
Cov(θ , Y |X S ) 0.260
e D (X L → Y ) = = = 0.196,
Cov(θ , Y ) + 1 0.329 + 1
e I (X L → Y ) = eT (X L → Y ) − e D (X L → Y ) = −0.024,
Cov(θ , Y |X L ) 0.101
eT (X S → Y ) = e D (X S → Y ) = = = 0.076. (6.41)
Cov(θ , Y ) + 1 0.329 + 1
X1 ≺ X2 ≺ . . . ≺ X K ,
The above information can be interpreted as that originated by the parent variables
\1
X pa(K ) . From this, the summary total effect of X 1 is measured by
\1
KL X pa(K ) , X K − KL X pa(K ) , X K |X 1
Cov X K , θ X pa(K ) − Cov X K , θ X pa(K ) |X 1
= ,
a(ϕ)
Similarly, we have
\1 \1,2
KL X pa(K ) , X K |X 1 − KL X pa(K ) , X K |X 1 , X 2
Cov X K , θ X pa(K ) |X 1 − Cov X K , θ X pa(K ) |X 1 , X 2
= . (6.45)
a(ϕ)
\1,2
Information (6.44) is that originated by X pa(K ) and (6.45) is that by X 2 . Recur-
\1,2,...,i
sively, we can calculate the information that originates from X pa(K ) and that from
X i , respectively, are given as follows:
\1,2,...,i Cov X K , θ X pa(K ) |X pa(i+1)
KL X pa(K ) , X K |X pa(i+1) = ,
a(ϕ)
\1,2,...,i−1 \1,2,...,i
KL X pa(K ) , X K |X pa(i) − KL X pa(K ) , X K |X pa(i+1)
Cov X K , θ X pa(K ) |X pa(i) − Cov X K , θ X pa(K ) |X pa(i+1)
= ,
a(ϕ)
i = 1, 2, . . . , K − 2.
eT (X i → X K ) ≥ 0, i = 1, 2, . . . , K − 1;
\1
eT X pa(K ) → X K = eT (X 1 → X K ) + eT X pa(K ) → X K ,
\1,2,...,i−1 \1,2,...,i−1,i
eT X pa(K ) → X K = eT (X i → X K ) + eT X pa(K ) → XK ,
i = 2, 3, . . . , K − 1.
−1
K
eT X pa(K ) → X K = eT (X i → X K ).
i=1
e I (X i → X K ) = eT (X i → X K ) − e D (X i → X K ), i = 1, 2, . . . , K − 1. (6.48)
−1
K
θ X pa(K ) = βi X i ,
i=1
The above effects are calculated by covariances of X i and X K . The summary total
effect of X pa(K ) on X K is given by
K −1
Cov θ X pa(K ) , X K j=1 β j Cov X j , X K
KL X pa(K ) , X K = = .
a(ϕ) a(ϕ)
Since
K −1 β j Cov X j , X K |X pa(i)
\1,2,...,i−1 j=i
KL X pa(K ) , X K |X pa(i) = , i = 1, 2, . . . , K ,
a(ϕ)
Similarly, we have
\i
βi Cov X i , X K |X pa(K )
e D (X i → X K ) = K −1 , i = 1, 2, . . . , K − 1.
j=1 β j Cov X j , X K + a(ϕ)
The equality holds if and only if X i and X K are conditionally independent given
\i
X pa(K ) .
Proof Since
\i
βi Cov X i , X K |X pa(K )
\i
KL X i , X K |X pa(K ) = ≥ 0,
a(ϕ)
Remark 6.3 As discussed in this chapter, the present approach is different from the
usual approach for linear equation models because it is based on the log odds ratio
and entropy by using all the variables concerned. Thus, although indirect effects are
defined by the total effects minus the direct effects, e.g.,
e I (X i → X K ) = eT (X i → X K ) − e D (X i → X K ),
6.10 Discussion
effectiveness has been demonstrated in the two examples in this chapter. The present
discussion provides a method to decompose the total effects of parent variables into
the direct and indirect ones; however, the pathway effects have not been considered.
There are arbitrary number of intermediate (parent) variables between a parent vari-
able to a descendant variable, so there are several pathways from the parent variable
to the descendant one. For example, in Fig. 6.6, there are two variables, X PMS and
X EMS , between X G and X MS , so it is meaningful to decompose the total effect of
X G on X MS into those through four pathways X G → X MS , X G → X PMS → X MS ,
X G → X EMS → X MS , and X G → X PMS → X EMS → X MS . The pathway effect
through path X G → X MS is the direct effect of X G on X MS , and the indirect effect
of X G on X MS is composed of those through the other paths. It is significant to study
a method of pathway effect analysis in GLMs, and it rests to be solved in a future
study .
References
1. Albert, J. M., & Nelson, S. (2011). Generalized causal mediation analysis. Biometrics, 67(3),
1028–1038.
2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley.
3. Bentler, P. M., & Weeks, D. B. (1980). Linear structural equations with latent variables.
Psychometrika, 45, 289–308.
4. Christoferson, A. (1975). Factor analysis of dichotomous variables. Psychometrika, 40, 5–31.
5. Eshima, N., & Tabata, M. (1999). Effect analysis in loglinear model approach to path analysis
of categorical variables. Behaviormetrika, 26, 221–233.
6. Eshima, N., Tabata, M., & Geng, Z. (2001). Path analysis with logistic regression models:
Effect analysis of fully recursive causal systems of categorical variables. Journal of the Japan
Statistical Society, 31, 1–14.
7. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear
models. Computational Statistics & Data Analysis, 54, 1381–1389.
8. Eshima, N., Tabata, M., Borroni, C. G., & Kano, Y. (2015). An entropy-based approach to path
analysis of structural generalized linear models: A basic idea. Entropy, 17, 5117–5132.
9. Fienberg, S. E. (1991). The analysis of cross-classified categorical data (2nd ed.). England:
The MIT Press; Cambridge.
10. Goodman, L. A. (1973). Causal analysis of data from panel studies and other kinds of surveys.
American Journal of Sociology, 78, 1135–1191.
11. Goodman, L. A. (1973). The analysis of multidimensional contingency tables when some
variables are posterior to others: A modified path analysis approach. Biometrika, 60, 179–192.
12. Goodman, L. A. (1974). The analysis of systems of qualitative variables when some of the
variables are unobservable. Part IA modified latent structure approach. American Journal of
Sociology, 79(5), 1179–1259.
13. Hagenaars, J. A. (1998). Categorical causal modeling : Latent class analysis and directed
loglinear models with latent variables. Sociological Methods & Research, 26, 436–489.
14. Jöreskog, K. G. & Sörbom, D. (1996). LISREL8: User’s reference guide (2nd ad.). Chicago:
Scientific Software International.
15. Muthen, B. (1978). Contribution of factor analysis of dichotomous variables. Psychometrika,
43, 551–560.
16. Muthen, B. (1984). A general structural equation model with dichotomous ordered categorical
and continuous latent variable indicators. Psychometrika, 49, 114–132.
References 197
17. McCullagh, P., & Nelder, J. A. (1989). Generalized Linear models (2nd ed.). Chapman and
Hall: London.
18. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. Journal of the Royal
Statistical Society A, 135, 370–384.
19. Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5,
161–215.
Chapter 7
Measurement of Explanatory Variable
Contribution in GLMs
7.1 Introduction
In GLMs except linear regression, only statistical estimation and testing regression
parameters are performed in general; however, the predictive powers of GLMs are
not measured in practical data analyses. The assessment of explanatory variable con-
tribution is also important in regression analysis. Although regression coefficients are
used to measure the factor contributions of explanatory variables, if the explanatory
variables are correlated, it may not be meaningful to interpret the results based on the
coefficients. If the explanatory variables have a causal order, squared partial corre-
lation coefficients according to the order are used to measure the factor contribution
[14]. The unexplained variance fraction can be expressed by 1 minus the squared
multiple correlation coefficient, and the fraction can be expressed by the product
of the unexplained partial variance fractions according to the causal order. By tak-
ing the logarithm, it is suggested that the partial fractions related to the explanatory
variables are used as factor contributions [20, 21]. The products of standardized
regression coefficients and the correlation coefficients between a response variable
and the explanatory variables are proposed for measuring the factor contributions [5,
22] as well, where R2 is decomposed into the products. It is pointed out that the above
measures are not interpretable in general [19]. There are several attempts for mea-
suring the factor contribution; however, there is no unified and satisfactory approach
for this topic even in ordinary linear regression [11–13]. Considering GLMs in view
of entropy, as mentioned in the previous chapters, the entropy coefficient of deter-
mination (ECD) [8, 9] is an extension of R2 in the ordinary linear regression model,
and ECD can be applied to all GLMs. In GLMs, measuring not only the predictive
power of regression models but also the explanatory variable or factor contribution
are important. This chapter gives an ECD approach for measuring the explanatory
variable contribution in GLMs, i.e. an extension of the R2 approach in the ordinary
linear regression model, and a method for assessing importance of explanatory vari-
ables is given [3, 10]. In Sect. 7.2, by using the ordinary linear regression model
and path diagrams, the issue to be treated in this chapter is explained. Section 7.3
explains three examples, to which an entropy-based method for assessing variable
contribution is applied. In Sect. 7.4, the method is theoretically discussed according
to a path analysis method in Chap. 6, and in Sect. 7.5, the method is applied to the
three examples explained in Sect. 7.3. Section 7.6 considers an application of the
present method for analyzing usual test data. Finally, Sect. 7.7 makes a discussion
to measure the variable importance in GLMs with explanatory variables.
In order to clarify a question to treat in this chapter, the ordinary linear regression
T
model is discussed. Let X = X1 , X2 , . . . , Xp , and Y be a p×1 factor or explanatory
T
variable vector and a response variable, respectively, and let β T = β1 , β2 , . . . , βp
be the regression coefficient vector. Then, a linear regression model is given by
Y = μ + β T X + e, (7.1)
where σij = Cov Xi , Xj , i.j = 1, 2, . . . , p. If the explanatory variables are
statistically independent or factors (Fig. 7.1b), then, we have
p
i=1 βi σi
2 2
ECD = p , (7.3)
i=1 βi σi + σ
2 2 2
βi2 σi2
C(Xi → Y ) ≡ eT (Xi → Y ) = p , (7.4)
i=1 βi σi + σ
2 2 2
7.2 Preliminary Discussion 201
(b)
(c)
For a general path diagram (Fig. 7.1a), the entropy-based path analysis is applied
for measuring the explanatory variable contribution. In Fig. 7.1c, the explanatory
variables concerned include ones that have no direct paths, which implies the vari-
ables are conditionally independent of the response variable Y, given the explanatory
variables with the direct paths to Y. In this case, it may also be needed to measure their
contributions to the response variable. It is an important issue how the explanatory
variable contribution and importance are measured in GLMs. The present chapter
gives an approach to the issue based on an entropy-based path analysis in Chap. 6.
202 7 Measurement of Explanatory Variable Contribution in GLMs
7.3 Examples
Example 7.1 This example provides an assessment of the so-called residual income
valuation model, which was proposed by [17] as an alternative to the classical
dividend-discounting valuation model. The model is an ordinary linear regression
model. A sample of 20 banks was observed in 2000 according to their stock price
(PRICE) (Y ), their book value per share (BVS) (X1 ), their earnings for the following
year per share as forecasted by analysts (FY1) (X2 ), and their residual income, given
by their current earnings minus the discounted book value of the preceding year
(INC) (X3 ). All values are in US dollars and PRICE is used as a response variable.
The model used is the following ordinary linear regression model:
Y = μ + β1 x1 + β2 x2 + β3 x3 + e, (7.5)
where e is an error term according to normal distribution N 0.σ 2 . In a GLM formu-
lation, the link is canonical and θ is a linear function of the explanatory variables,
i.e.
θ = β1 X1 + β2 X2 + β3 X3 . (7.6)
Data are shown in Table 7.1, where the names of banks are masked for privacy.
In this example, there is no causal order in the explanatory variables but association
between the explanatory variables. Then, the path model is expressed as in Fig. 7.2
and measuring contributions of explanatory variables are meaningful. The direct
effect of an explanatory variable should be defined as that of the variable given the
other explanatory variables, and the indirect effect is that through the association
between the variable and the other ones.
Example 7.2 Table 7.2 shows two-way layout experimental data in a study of length
of time spent on individual home visits by public health nurses [4, pp. 348–353].
In the example, analysis of the effects of factors, i.e. the type of patient and the
age of a nurse, on the nurses’ behavior will be significant. Let Y be length of home
visit, and let factors X1 and X2 denote the type of a patient and the age of a nurse,
respectively. The results of two-way analysis of variance are shown in Table 7.3. The
main and interactive effects of factors are significant. In this case, levels of factor
vector X = (X1 , X2 ) are levels (i, j), i = 1, 2, 3, 4; j = 1, 2, 3, 4. Although factors
X1 and X2 are independent, the model has interaction terms between them (Fig. 7.3).
Let Yijk be the kth response given factor level (X1 , X2 ) = (i, j); let αi , i = 1, 2, 3, 4
be the main effect parameters of factor X1 ; let βj , j = 1, 2, 3, 4 be the main effects
of X2 ; and let (αβ)ij , i = 1, 2, 3, 4; j = 1, 2, 3, 4 be the interactive effects of X1 and
X2 , respectively. Then, the usual expression of the model is
Table 7.1 Data of PRICE, BVD, and INC from twenty banks
Bank PRICE (Y) BVD (X1 ) FY1 (X2 ) INC (X3 )
1 11.00 4.6850 0.87 0.5962
2 1.20 0.5680 0.11 0.0867
3 7.65 3.0550 0.62 0.7225
4 10.19 3.9210 0.42 0.2447
5 12.40 1.9860 0.36 0.2112
6 6.10 3.1200 0.51 0.3116
7 12.96 3.9390 0.60 0.3477
8 6.04 2.0610 0.39 0.2806
9 36.10 23.4850 1.80 0.8252
10 10.21 9.3070 0.94 0.6149
11 84.70 44.7590 3.59 1.2261
12 16.70 13.1700 1.16 0.3900
13 9.54 10.2000 1.20 0.3235
14 50.00 20.0290 3.18 1.9821
15 9.39 5.9820 0.30 0.0761
16 4.31 3.2120 0.38 0.3064
17 10.08 6.8790 0.77 0.5846
18 49.75 28.5270 4.16 1.2641
19 12.75 5.0950 0.43 0.3439
20 4.24 1.7080 0.22 0.2806
Source Ohlson [17]
Table 7.2 Length of home visit in minutes by public health nurses by nurse’s age group and type
of patient
Factor X1 (age groups of nurses) (years old)
Factor X1 1 2 4 4
(type of patients) (20–29) (30–39) (40–49) (50 and over)
1 (Cardiac) 20, 25, 22 25, 30, 29 24, 28, 24 28, 31, 26
27, 21 28, 30 5, 30 29, 32
2 (Cancer) 30, 45, 30 30, 39, 31 39, 42, 36 40, 45, 50
35, 36 30, 30 42, 40 45, 60
3 (C.V.T.) 31, 30, 40 32, 35, 30 41, 45, 40 42, 50, 40
35, 30 40, 30 40, 35 55, 45
4 (Tuberculosis) 20, 21, 20 23, 25, 28 24, 25, 30 29, 30, 28
20, 19 30, 31 26, 23 27, 30
Source [4]
4
4
4
4
αi = βj = (aβ)ij = (aβ)ij = 0. (7.8)
i=1 j=1 i=1 j=1
In this example, factors are categorical and the response is continuous, which is
different from the path system of continuous variables in Sect. 7.2. It is meaningful
to measure the effects of factors, i.e. the direct and indirect effects, by using a path
analysis approach. A GLM expression of the two-way layout experimental design
model is given below (Table 7.3).
Example 7.3 Table 7.4 shows the data from 2276 high school seniors about questions
whether they have ever used alcohol, cigarettes, or marijuana [2]. For explanatory
variables alcohol use X1 , cigarette use X2 , and response variable marijuana use Y, we
analyzed the data with a logit model. In this case, it is appropriate not to assume any
causal order in the variables, because someone used alcohol before cigarettes, and
others cigarettes before alcohol. The path diagram is illustrated in Fig. 7.4. In this logit
model, we used θ = μ + β1 X1 + β2 X2 , and the estimates are μ = −5.302 (ASE =
0.458), β1 = 2.980(0.448), and β2 = 2.847(0.162). The usual interpretation of the
results is given by using odds with respect to marijuana use. The partial log odds
in marijuana use Y given cigarette use X2 is exp(2.980) = 19.688 times higher at
alcohol use X1 (yes) than that at alcohol use (no), and the partial log odds in marijuana
use Y given alcohol use X1 is exp(2.847) = 17.236 times higher at cigarette use X2
(yes) than that at X2 (no). According to the usual interpretation, it might be thought
the effects of alcohol use and cigarette use on marijuana use would be almost the
same, given that alcohol use and cigarette use were completely controlled; however,
alcohol use and cigarette use are associated and cannot be completely controlled.
Hence, in addition to the usual method based on odds ratios, it is sensible to use a
new method for assessing the explanatory variable contribution. The direct effect of
Xi should be defined given the other explanatory variable, and the indirect effect is
that through the association between them.
where θ and ϕ are parameters, and a(ϕ) > 0, b(θ ), andc(y, ϕ) are specific functions.
p
In GLM, θ is a function of linear predictor η = βT x = i=1 βi xi . Let fi (x) be marginal
density or probability functions of explanatory variables Xi , and let Cov(Y , θ |Xi ) be
the conditional covariances of Y and θ , given Xi . From a viewpoint of entropy, the
covariance Cov(Y , θ ) can be regarded as the explained variation of Y in entropy by
all the explanatory variables Xk and Cov(Y , θ |Xi ) is that excluding the effect of Xi .
From this, we make the following definition.
βi Cov(Xi , Y )
CR(Xi → Y ) = p , i = 1, 2, . . . , p.
j=1 βj Cov Xj , Y
β 2 Var(Xi )
CR(Xi → Y ) = p i 2 , i = 1, 2, . . . , p.
j=1 βj Var Xj
In this case, the denominator is the explained variance of Y by all the explanatory
variables, and the numerator that by Xi .
7.4 Measuring Explanatory Variable Contributions 207
and we have
eT (Xi → Y )
CR(Xi → Y ) = , i = 1, 2, . . . , p.
eT (X → Y )
Hence, the contribution ratios of the explanatory variables Xi in the path diagrams
for GLMs such as Fig. 7.1a, b, and c are the same as that in the case described in
path diagram Fig. 7.5.
where for discrete variables the related integrals are replaced by the summations.
Proof For simplification, the Lemma is proven in the case of the continuous
distribution. Let
˚
h= f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy
˚
− f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy. (7.13)
208 7 Measurement of Explanatory Variable Contribution in GLMs
the left side of (7.12) is minimized with respect to f (x1 , x2 , y). For Lagrange
multiplier λ, let
˚
L=h−λ f (x1 , x2 , y)dx1 dx2 dy
˚
= f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy
˚
− f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy
˚
−λ f (x1 , x2 , y)dx1 dx2 dy.
Since
˚ ˚
f1 (y|x1 )g(x1 , x2 )dx1 dx2 dy = f (y)g(x1 , x2 )dx1 dx2 dy
˚
= f (x1 , x2 , y)dx1 dx2 dy = 1,
Integrating the both sides of Eq. (7.15) with respect to x1 , x2 and y, we have λ = 0.
From this, we obtain
f1 (y|x1 ) = f (y),
and it derives h = 0 as the minimum value of h (7.13). This completes the Lemma.
7.4 Measuring Explanatory Variable Contributions 209
Remark 7.3 Let X = (X1 , X2 )T be a factor vector with levels x1i , x2j , i =
1, 2, . . . , I ; j = 1, 2, . . . , J ; let nij be sample sizes at x1i , x2j ; and let ni+ =
J
I I J
nij , n+j = nij , n = nij . Then, Lemma 7.1 holds by setting as follows:
j=1 i=1 i=1 j=1
nij 1 nij
J
ni+
g x1i , x2j = , g1 (x1i ) = , f1 (y|x1i ) = f y|x1i , x2j ,
n n ni+ j=1 n
I J
nij
f (y) = f y|x1i , x2j .
i=1 j=1
n
T T
Remark 7.4 Let X = X1 , X2 , . . . , Xp ; let X (1) = X1 , X2 , . . . , Xq and let X (2) =
T
Xq+1 , Xq+2 , . . . , Xp . Then, by replacing X1 and X2 in Lemma 7.1 for X (1) and X (2) ,
respectively, the inequality (7.12) holds true.
Proof For simplicity of the discussion, the theorem is proven for continuous variables
in the case of p = 2 and i = 1. Since
˚
Cov(θ, Y )
= (f (x1 , x2 , y) − f (y)g(x1 , x2 )) log f (x1 , x2 , y)dx1 dx2 dy
a(ϕ)
and
˚
Cov(θ, Y |Xi )
= (f (x1 , x2 , y) − f1 (y|x1 )g(x1 , x2 )) log f (x1 , x2 , y)dx1 dx2 dy,
a(ϕ)
we have
Cov(θ, Y ) Cov(θ, Y |Xi )
−
a(ϕ) a(ϕ)
˚
= (f1 (y|x1 )g(x1 , x2 ) − f (y)g(x1 , x2 ))logf (x1 , x2 , y)dx1 dx2 dy.
0 ≤ CR(Xi → Y ) ≤ 1, i = 1, 2, . . . , p. (7.17)
Proof Since Cov(θ, Y |Xi ) ≥ 0, from (7.16) Theorem 7.1 shows that
and we have
βi Cov(Xi , Y ) βi Cov(Xi , Y )
CR(Xi → Y ) = = p . (7.18)
Cov(θ, Y ) j=1 βj Cov Xj , Y
In GLMs, after a model selection procedure, there may be cases in which some
explanatory variables have no direct paths to the response variables, that is, the related
regression coefficients are zeros. One of the examples is depicted in Fig. 7.1c. For
such cases, we have the following theorem:
Theorem 7.3 In GLM (7.9), let the explanatory variable vector X be decomposed
into two sub-vectors X (1) with the direct paths to response variable Y and X (2) without
the direct paths to Y. Then,
Cov(θ, Y )
KL(Y , X) = KL Y , X (1) = . (7.20)
a(ϕ)
7.4 Measuring Explanatory Variable Contributions 211
Example 2 The present discussion is applied to the ordinary two-way layout exper-
imental design model. Before analyzing the data in Table 7.2, a general two-way
layout experiment model is formulated in a GLM framework. Let X1 and X2 be fac-
tors with levels {1, 2, . . . , I } and {1, 2, . . . , J }, respectively. Then, the linear predictor
is a function of (X1 , X2 ) = (i, j), i.e.
θ = μ + αi + βj + (αβ)ij . (7.21)
where
I
J
I
J
αi = βj = (aβ)ij = (aβ)ij = 0.
i=1 j=1 i=1 j=1
Let
1 (Xk = i)
Xki = , k = 1, 2.
0 (Xk = i)
1 1
P(X1i = 1) = , i = 1, 2, . . . , I ; P X2j = 1 = , j = 1, 2, . . . , J ..
I J
Dummy vectors
are identified with factors X1 and X2 , respectively. From this, the systematic
component of model (7.21) can be written as follows:
θ = μ + αTX 1 + β TX 2 + γ TX 1 ⊗ X 2 , (7.22)
where
α = (α1 , α2 , . . . , αI )T , β = (β1 , β2 , . . . , βJ )T ,
T
γ = (αβ)11 , (αβ)12 , . . . , (αβ)1J , . . . , (αβ)IJ ,
and
The above three terms are referred to as the main effect of X1 , that of X2 and the
interactive effect, respectively. Then, ECD is calculated as follows:
Cov(θ, Y )
ECD((X1 , X2 ), Y ) =
Cov(θ, Y ) + σ 2
1 I 1 J 1 I J
i=1 αi + J j=1 βj + IJ j=1 (αβ)ij
2 2 2
I i=1
= 1 I J I J
= R2 . (7.24)
I α
i=1 i
2+ 1
J β
j=1 j
2+ 1
IJ i=1 j=1 (αβ)ij
2
+ σ 2
1 2 1
J I J
Cov(θ, Y |X 1 ) = βj + (αβ)2ij , (7.25)
J j=1 IJ i=1 j=1
1 2 1
I I J
Cov(θ, Y |X 2 ) = αi + (αβ)2ij , (7.26)
I i=1 IJ i=1 j=1
we have
Cov(θ, Y ) − Cov(θ, Y |X 1 )
eT (X1 → Y ) =
Cov(θ, Y ) + σ 2
1 I
i=1 αi
2
= 1 I J
I
I J , (7.27)
I i=1 αi2 + 1
J j=1 βj + IJ
2 1
i=1 j=1 (αβ)ij
2
+ σ2
J
j=1 βj
1 2
J
eT (X2 → Y ) = 1 I J I J . (7.28)
I i=1 αi2 + 1
J j=1 βj + IJ
2 1
i=1 j=1 (αβ)ij
2
+ σ2
Cov(θ, Y ) − Cov(θ, Y |X 1 )
CR(X1 → Y ) =
Cov(θ, Y )
1 I
i=1 αi
2
= 1 I J
I
1 I J , (7.29)
i=1 αi + J j=1 βj + IJ j=1 (αβ)ij
2 1 2 2
I i=1
1 J
j=1 βj
2
J
CR(X2 → Y ) = 1 I J 1 I J . (7.30)
i=1 αi + J j=1 βj + IJ j=1 (αβ)ij
2 1 2 2
I i=1
214 7 Measurement of Explanatory Variable Contribution in GLMs
eT ((X1 , X2 ) → Y ) = ECD((X1 , X2 ), Y ).
Example 3 (continued) The data in Table 7.4 is analyzed and the explanatory variable
contributions are calculated. From the ML estimators of the parameters, we have
we have
0.539
ECD = = 0.350 (= eT ((X1 , X2 ) → Y )),
0.539 + 1
0.539 − 0.323
eT (X1 → Y ) = = 0.140,
0.539 + 1
0.539 − 0.047
eT (X2 → Y ) = = 0.320
0.539 + 1
eT (X1 → Y ) 0.140
CR(X1 → Y ) = = = 0.401,
eT ((X1 , X2 ) → Y ) 0.350
eT (X2 → Y ) 0.320
CR(X2 → Y ) = = = 0.913.
eT ((X1 , X2 ) → Y ) 0.350
use Y is 40.1%, whereas that of cigarette use X2 on marijuana use Y is 91.3%. The
effect of cigarette use X2 on marijuana use Y is about two times greater than that of
alcohol use X1 .
p
Y = Xi (7.32)
i=1
p
Y = Xi + e,
i=1
where e is an error term according to N 0, σ 2 and independent of item scores Xi .
In a GLM framework, we set
p
θ= Xi
i=1
Table 7.6 shows a test data with five items. Under the normality of the data, the
above discussion is applied to analyze the test data. From this table, by using (7.33)
contribution ratios CR(Xi → Y ), i = 1, 2, 3, 4, 5 are calculated as follows:
Except for Social Science X3 , the contributions of the other subjects are similar,
and the correlations between Y and Xi are strong. The contributions are illustrated
in Fig. 7.7.
Japanese
0.6
0.5
0.4
0.3
Science 0.2 English
0.1
0
Mathematics Social
Fig. 7.7 Radar chart of the contributions of five subjects to the total score
explanatory variable Xi were the parent of the other variables, as shown in Fig. 7.5.
If the explanatory variables are causally ordered, e.g.
X1 → X2 → . . . → Xp → Y = Xp+1 , (7.34)
applying the path analysis in Chap. 6, from (6.46) and (6.47) the contributions of
explanatory variables Xi are computed as follows:
\1,2,...,i−1 \1,2,...,i
eT (Xi → Y ) KL X pa(p+1) , Y |X pa(i) − KL X pa(p+1) , Y |X pa(i+1)
CR(Xi → Y ) = = , (7.35)
eT (X → Y ) KL X pa(p+1) , Y
T
where X = X1 , X2 , . . . , Xp . Below, contributions CR(Xi → Y ) are merely
described as CR(Xi ) as far as not confusing the response variable Y. Then, it follows
that
p
CR(Xi ) = 1. (7.36)
i=1
where X (1) is the subset of all the explanatory variables with non-zero regression coef-
ficients βi = 0 in X. Hence, the explanatory power of X and X (1) are the same. From
this, in the present context, based on the above criteria, the variable importance is dis-
cussed. In which follows, it is assumed all the explanatory variables X1 , X2 , . . . , Xp
have non-zero regression coefficients. In GLMs with explanatory variables that
have no causal ordering, the present entropy-based path analysisis applied to rel-
ative importance assessment
of explanatory variables. Let U = X1 , X2 , . . . , Xp ;
r = r1 , r2 , . . . , rp a permutation of explanatory variable indices {1, 2, . . . , p}; Si (r)
be the parent variables that appear before Xi in permutation r; and let Ti (r) = U \Si (r).
Definition
7.2 For causal
ordering r = r1 , r2 , . . . , rp in explanatory variables
U = X1 , X2 , . . . , Xp , the contribution ratio of Xi is defined by
1
RI(Xi ) = CRr (Xi ), (7.38)
p! r
where
the summation
implies the sum of CRr (Xi )’s over all the permutations r =
r1 , r2 , . . . , rp .
p
RI(Xi ) = 1. (7.40)
i=1
7.7 Variable Importance Assessment 219
Remark 7.6 In GLMs (7.9), CRr (Xi ) can be expressed in terms of covariances of θ
and the explanatory variables, i.e.
Cov(θ,Y |Si (r)) Cov(θ,Y |Si (r)∪{Xi })
a(ϕ)
− a(ϕ)
CRr (Xi ) = Cov(θ,Y )
a(ϕ)
Cov(θ, Y |Si (r)) − Cov(θ, Y |Si (r)∪{Xi })
= . (7.41)
Cov(θ, Y )
Remark 7.7 In the above definition of the relative importance of explanatory variables
(7.38), if Xi is the parent of the other explanatory variables, i.e. Si (r) = φ, then, (7.37)
implies
Example 7.4 For p = 2, i.e. U = {X1 , X2 }, we have two permutations of the explana-
tory variables r = (1, 2), (2, 1). Then, the relative importance of the explanatory
variables is evaluated as follows:
1
RI(Xi ) = CRr (Xi ), i = 1, 2,
2 r
where
KL((X1 , X2 ), Y ) − KL(X2 , Y |X1 ) Cov(θ, Y ) − Cov(θ, Y |X1 )
CR(1,2) (X1 ) = = ,
KL((X1 , X2 ), Y ) Cov(θ, Y )
KL(X2 , Y |X1 )
CR(1,2) (X2 ) = ,
KL((X1 , X2 ), Y )
KL(X1 , Y |X2 )
CR(2,1) (X1 ) = ,
KL((X1 , X2 ), Y )
220 7 Measurement of Explanatory Variable Contribution in GLMs
Hence, we have
1
RI(X1 ) = CR(1,2) (X1 ) + CR(2,1) (X1 )
2
KL((X1 , X2 ), Y ) − KL(X2 , Y |X1 ) + KL(X1 , Y |X2 )
= ,
2KL((X1 , X2 ), Y )
KL((X1 , X2 ), Y ) − KL(X1 , Y |X2 ) + KL(X2 , Y |X1 )
RI(X2 ) = .
2KL((X1 , X2 ), Y )
1
RI(X1 ) = 2CR(1,2,3) (X1 ) + CR(2,1,3) (X1 ) + CR(3,1,2) (X1 ) + 2CR(2,3,1) (X1 )
3!
1
= (2KL((X1 , X2 , X3 ), Y ) − 2KL((X2 , X3 ), Y |X1 )
6KL((X1 , X2 , X3 ), Y )
+ KL((X1 , X3 ), Y |X2 ) − KL(X3 , Y |X1 , X2 ) + KL((X1 , X2 ), Y |X3 )
−KL(X2 , Y |X1 , X3 ) + 2KL(X1 , Y |X2 , X3 ))
1
= (2Cov(θ, Y ) − 2Cov(θ, Y |X1 ) + Cov(θ, Y |X2 ) − Cov(θ, Y |X1 , X2 )
6Cov(θ, Y )
+Cov(θ, Y |X3 ) − Cov(θ, Y |X1 , X3 ) + 2Cov(θ, Y |X2 , X3 )),
1
RI(X2 ) = 2CR(2,1,3) (X2 ) + CR(1,2,3) (X2 ) + CR(3,2,1) (X2 ) + 2CR(1,3,2) (X2 ) .
3!
1
RI(X3 ) = 2CR(3,1,2) (X3 ) + CR(1,3,2) (X3 ) + CR(2,3,1) (X3 ) + 2CR(1,2,3) (X3 )
3!
As shown in the above example, the calculation of RI(Xi ) becomes complex as
the number of the explanatory variables increases.
Example 7.1 (continued) Three explanatory variables are hypothesized prior to the
linear regression analysis; however, the regression coefficients of two explanatory
variables X1 and X3 are statistically significant. According to the result, the variable
importance is assessed in X1 and X3 . Since
394.898 − 7.538
CR(1,3) (X1 ) = = 0.981,
394.898
7.538
CR(1,3) (X3 ) = = 0.019,
394.898
7.7 Variable Importance Assessment 221
130.821
CR(3,1) (X1 ) = = 0.331,
394.898
CR(3,1) (X3 ) = 1 − CR(3,1) (X1 ) = 0.669.
Example 7.2 (continued) By using the model (7.22), the relative importance of the
explanatory variables X1 and X2 is considered. From (7.37), we have
In order to demonstrate the above results, the following estimates are used:
Cov(θ, Y |X2 )
CR(2,1) (X1 ) = = 0.610,
Cov(θ, Y )
Cov(θ, Y |X1 )
CR(1,2) (X2 ) = = 0.301,
Cov(θ, Y )
7.8 Discussion
References
1. Adachi, K., & Trendafilov, N. T. (2018). Some mathematical properties of the matrix
decomposition solution in factor analysis. Psychometrika, 83, 407–424.
2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley.
3. Azen, R., & Budescu, D. V. (2003). The dominance analysis approach for comparing predictors
in multiple regression. Psychological Methods, 8(2), 129–148.
4. Eshima, N., Borroni, C. G., & Tabata, M. (2016). Relative-importance assessment of
explanatory variables in generalized linear models: an entropy-based approach, Statistics and
Applications, 14, 107–122.
5. Daniel, W. W. (1999). Biostatistics: A foundation for analysis in the health sciences (7th ed.).
New York: Wiley.
6. Darlington, R. B. (1968). Multiple regression in psychological research and practice.
Psychological Bulletin, 69, 161–182.
7. Dechow, P. M., Hutton, A. P., & Sloan, R. G. (1999). An empirical assessment of the residual
income valuation model. Journal of Accounting and Economics, 26, 1–34.
8. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear
models. Computational Statistics and Data Analysis, 54, 1381–1389.
9. Eshima, N., & Tabata, M. (2011). Three predictive power measures for generalized linear
models: Entropy coefficient of determination, entropy correlation coefficient and regression
correlation coefficient. Computational Statistics & Data Analysis, 55, 3049–3058.
10. Eshima, N., Tabata, M., Borroni, C. G., & Kano, Y. (2015). An entropy-based approach to path
analysis of structural generalized linear models: A basic idea. Entropy, 17, 5117–5132.
11. Grőmping, U. (2006). Relative importance for linear regression in R: The package relaimpo.
Journal of Statistical Software, 17, 1–26.
12. Grőmping, U. (2007). Estimators of relative importance in linear regression based on variance
decomposition. The American Statistician, 61, 139–147.
13. Grőmping, U. (2009). Variable importance assessment in regression: Linear regression versus
random forest. The American Statistician, 63, 308–319.
14. Kruskal, W. (1987). Relative importance by averaging over orderings. The American Statisti-
cian, 41, 6–10.
15. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and
Hall: London.
16. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. Journal of the Royal
Statistical Society, A, 135, 370–384.
17. Ohlson, J. A. (1995). Earnings, book values and dividends in security valuation. Contemporary
Accounting Research, 11, 661–687.
18. Tanaka, Y., & Tarumi, T. (1995). Handbook for statistical analysis: Multivariate analysis
(windows version). Tokyo: Kyoritsu-Shuppan. (in Japanese).
19. Theil, H. (1987). How many bits of information does an independent variable yield in a multiple
regression? Statistics and Probability Letters, 6, 107–108.
20. Theil, H., & Chung, C. F. (1988). Information-theoretic measures of fit for univariate and
multivariate regressions. The American Statistician, 42, 249–253.
21. Thomas, D. R., & Zumbo, B. D. (1996). Using a measure of variable importance to investi-
gate the standardization of discriminant coefficients. Journal of Educational and Behavioral
Statistics, 21, 110–130.
Chapter 8
Latent Structure Analysis
8.1 Introduction
Latent structure analysis [11] is a general name that includes factor analysis [19, 22],
T
latent trait analysis [12, 14], and latent class analysis. Let X = X1 , X2 , . . . , Xp
T
be a manifest variable vector; let ξ = ξ1 , ξ2 , . . . , ξq be a latent variable (factor)
vector that influence the manifest variables; let f (x) and f (x|ξ ) be the joint density
or probability functions of manifest variable vector X and the conditional one given
latent variables ξ , respectively; let f (xi |ξ ) be the conditional density or probability
function of manifest variables Xi given latent variables ξ ; let g(ξ ) be the marginal
density or probability function of manifest variable vector ξ . Then, a general latent
structure model assumes that
p
f (x|ξ ) = f (xi |ξ ). (8.1)
i=1
where the integral is replaced by the summation when the manifest variables are dis-
crete. The manifest variables Xi are observable; however, the latent variables ξj imply
latent traits or abilities that cannot be observed and are hypothesized components,
for example, in human behavior or responses. The above assumption (8.1) is referred
to as that of local independence. The assumption implies that latent variables explain
the association of the manifest variables. In general, latent structure analysis explains
how latent variables affect manifest variables. Factor analysis treats continuous man-
ifest and latent variables, and latent trait models are modeled with discrete manifest
variables and continuous latent variables. Latent class analysis handles categorical
manifest and latent variables. It is important to estimate the model parameters, and the
interpretation of the extracted latent structures is also critical. In this chapter, latent
structure models are treated as GLMs, and the entropy-based approach in Chaps. 6
and 7 is applied to factor analysis and latent trait analysis. In Sect. 8.2, factor analysis
is treated, and an entropy-based method of assessing factor contribution is considered
[6, 7]. In comparison with the conventional method of measuring factor contribution
and that based on entropy, advantages of the entropy-based method are discussed
in a GLM framework. For oblique factor analysis models, a method for calculating
factor contributions is given by using covariance matrices. Section 8.3 deals with
the latent trait model that expresses dichotomous responses underlying a common
latent ability or trait. In the ML estimation of latent abilities of individuals from their
responses, the information of test items and that of tests are discussed according to
the Fisher information. Based on the GLM framework, ECD is used for measuring
test reliability. Numerical illustrations are also provided to demonstrate the present
discussion.
The origin of factor analysis dates back to the works of [19], and the single factor
model was extended to the multiple factor model [22]. Let Xi be manifest variables;
ξj latent variables (common factors); εi unique factors peculiar to Xi ; and let λij be
factor loadings that are weights of factors ξj to explain Xi . Then, the factor analysis
model is given as follows:
m
Xi = λij ξj + εi (i = 1, 2, . . . , p), (8.3)
j=1
where
⎧
⎪
⎪ E(Xi ) = E(εi ) = 0, i = 1, 2, . . . , p;
⎪
⎪
⎪
⎨ E ξj = 0, j = 1, 2, . . . , m;
Var ξj = 1, j = 1, 2, . . . , m;
⎪
⎪
⎪
⎪ Var(εi ) = ωi2 > 0, i = 1, 2, . . . , p;
⎪
⎩ Cov(εk , εl ) = 0, k = l.
T
Let be the variance–covariance matrix of X = X1 , X2 , . . . , Xp ; let be the
m × m correlation matrix of common factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T ; let be
T
the p × p variance–covariance matrix of unique factor ε = ε1 , ε2 , . . . , εp ; and let
8.2 Factor Analysis 227
= T + , (8.4)
X = ξ + ε. (8.5)
X = ∗ ξ ∗ + ε, (8.6)
= ∗ ∗ ∗T + . (8.7)
= T + . (8.9)
= ∗ ∗T + . (8.10)
X ∗ = DX. (8.11)
Then,
and factor analysis models (8.5) and (8.11) are equivalent. From this, factor anal-
ysis can be treated under standardization of the manifest variables. Below, we set
Var(Xi ) = 1, i = 1, 2, . . . , p. Methods of parameter estimation in factor analysis
have been studied actively by many authors, for example, for least square estimation,
228 8 Latent Structure Analysis
[1, 10], for ML estimation, [8, 9, 15] and so on. The estimates of model parameters
can be obtained by using the ordinary statistical software packages [24]. Moreover,
high-dimensional factor analysis where the number of manifest variables is greater
than that of observations has also been developed [16, 20, 21, 23]. In this chapter,
the topics of parameter estimation are not treated, and an entropy-based method for
measuring factor contribution is discussed.
The interpretation of the extracted factors is based on factor loading matrix and
factor structure matrix . After interpreting the factors, it is important to assess
factor contribution. For orthogonal factor analysis models (Fig. 8.1a), the contribution
of factor ξj on manifest variables Xi is defined as follows:
(b)
8.2 Factor Analysis 229
p
2 2
p
Cj = Cov Xi , ξj = λij . (8.13)
i=1 i=1
Measuring contributions of the extracted factors can also be made from the
following decomposition of total variances of the observed variables Xi [2, p. 59]:
p
p
p
p
p
Var(Xi ) = λ2i1 + λ2i2 + · · · + λ2im + ωi2 (8.14)
i=1 i=1 i=1 i=1 i=1
From this, the contribution (8.13) is regarded as the quantity in the sum of the
variances of the manifest variables. Applying it to the manifest variables observed,
contribution (8.13) is not scale-invariant. From this reason, factor contribution is
considered for standardized versions of manifest variables Xi ; however, the sum of
the variances (8.14) has no physical meaning. The variation of manifest variable
vector X = X1 , X2 , . . . , Xp is summarized by the variance–covariance
.
matrix
The generalized variance of manifest variable vector X = X1 , X2 , . . . , Xp is the
determinant of the variance–covariance matrix |Σ|, and it expresses p-dimensional
variation of random vector X. If the determinant could be decomposed into the sums
of quantities related to factors ξj as in (8.14), it would be successful to define the
factor contribution by the quantities. It is impossible to make such decompositions
based on |Σ|.
For standardized manifest variables Xi , we have
λij = Corr Xi , ξj . (8.15)
and
p
Var(Xi ) = p. (8.16)
i=1
The squared correlation coefficients (8.15) are the ratios of explained variances
for manifest variables Xi and can be interpreted as the contributions (effects) of
factors to the individual manifest variables, but the sum of these has no logical
foundation
to be viewed as the contribution of factor ξj to manifest variable vector
X = X1 , X2 , . . . , Xp . In spite of it, the contribution ratio of ξj is defined by
Cj Cj
RCj = m = m p (8.17)
l=1 Cl l=1 k=1 λ2kl
The above measure is referred to as the factor contribution ratio in the common
factor space. Another contribution ratio of ξj is referred to as that in the whole space
of manifest variable vector X = (Xi ), and it is defined by
230 8 Latent Structure Analysis
Cj Cj
j = p
RC = . (8.18)
i=1 Var(Xi ) p
i = 1, 2, . . . , p. (8.19)
m
x2
Let θi = λij ξj and C xi , ωi2 = − 2ωi 2 − log 2π ωi2 . Then, the above density
i
j=1
function is described in a GLM framework as follows:
xi θi − 21 θi2
fi (xi |ξ ) = exp + C xi , ωi2 , i = 1, 2, . . . , p. (8.20)
ωi2
From the general latent structure model (8.1), the conditional normal density
function of X given ξ is expressed as
8.2 Factor Analysis 231
(a) (b)
(d)
(c)
Fig. 8.2 a Path diagram of a general factor analysis model. b Path diagram of manifest variables
Xi , i = 1, 2, . . . , p, and common factor vector ξ . c Path diagram of manifest variable sub-vectors
X (a) , a = 1, 2, . . . , A, common factor vector ξ and error sub-vectors ε (a) related to X (a) , a =
1, 2, . . . , A. d. Path diagram of manifest variable vector X, common factors ξj , and unique factors
εi
p
xi θi − 21 θi2
f (x|ξ ) = exp + C xi , ωi
2
i=1
ωi2
p
xi θi − 1 θi2
p
= exp 2
+ C xi , ωi .
2
(8.21)
i=1
ωi2 i=1
p
KL(X, ξ ) = KL(Xi , ξ ). (8.22)
i=1
Proof Let fi (xi |ξ ) be the conditional density functions of manifest variables Xi , given
factor vector ξ ; fi (xi ) be the marginal density functions of Xi ; f (x) be the marginal
density function of X; and let g(ξ ) be the marginal density function of common
factor vector ξ = ξj . Then, from (8.20), we have
¨
fi (xi |ξ )
KL(Xi , ξ ) = fi (xi |ξ )g(ξ )log dxi dξ
fi (xi )
¨
fi (xi )
+ fi (xi )g(ξ )log dxi dξ .
fi (xi |ξ )
Cov(Xi , θi )
= , i = 1, 2, . . . , p. (8.23)
ωi2
p
Cov(Xi , θi )
= . (8.24)
i=1
ωi2
m
E(Xi |ξ ) = θi = λij ξj , i = 1, 2, . . . , p,
j=1
8.2 Factor Analysis 233
we have
Cov(Xi , θi )
= R2i , i = 1, 2, . . . , p.
Cov(Xi , θi ) + ωi2
R2i
KL(Xi , ξ ) = , i = 1, 2, . . . , p. (8.25)
1 − R2i
KL(Xi , ξ )
ECD(Xi , ξ ) = = R2i , i = 1, 2, . . . , p; (8.26)
KL(Xi , ξ ) + 1
p Cov(Xi ,θi ) p R2i
i=1 ωi2 i=1 1−R2i
ECD(X, ξ ) = p Cov(Xi ,θi )
= . (8.27)
i=1 ωi2
+1 p R2i
i=1 1−R2i +1
A
KL(X, ξ ) = KL X (a) , ξ . (8.28)
a=1
p
KL(X, ξ ) = KL(Xi , ξ ).
i=1
Proof
¨
p p ¨
KL(X, ξ ) = fi (xi |ξ )g(ξ )log k=1 fk (xk |ξ ) dxdξ + f (x)g(ξ )log p
f (x)
dxdξ
f (x)
i=1 k=1 fk (xk |ξ )
234 8 Latent Structure Analysis
¨
p p
= fi (xi |ξ )g(ξ ) − f (x) log fk (xk |ξ )dxdξ
i=1 k=1
¨
p p ¨ p
fk (xk |ξ ) fk (xk )
= fi (xi |ξ )g(ξ )log k=1
p dxdξ + f (x)g(ξ )log pk=1 dxdξ
i=1 k=1 fk (xk ) k=1 fk (xk |ξ )
p ¨
p p ¨
f (x |ξ ) f (x )
= fi (xi |ξ )g(ξ )log k k dxdξ + f (x)g(ξ )log k k dxdξ
fk (xk ) fk (xk |ξ )
k=1 i=1 k=1
p ¨ p ¨
f (x |ξ ) f (x )
= fk (xk |ξ )g(ξ )log k k dxk dξ + fk (xk )g(ξ )log k k dxk dξ
fk (xk ) fk (xk |ξ )
k=1 k=1
p ¨ ¨
fi (xi |ξ ) fi (xi )
= fk (xk |ξ )g(ξ )log dxi dξ + fi (xi )g(ξ )log dxi dξ
fi (xi ) fi (xi |ξ )
i=1
p
= KL(Xi , ξ ).
i=1
˜
tr T Cov(Xi , θi ) R2
p p
= = i
.
|| i=1
ω 2
i i=1
1 − R2i
Hence, the generalized signal-to-noise ratio is decomposed into those for manifest
variables.
The contributions of factors in factor analysis model (8.20) are discussed in view
of entropy. According to the above discussion, the following definitions are made:
8.2 Factor Analysis 235
Remark 8.3 The contribution C(ξ → X) is decomposed into C(ξ → Xi ) with respect
to manifest variables Xi (8.32); however, in general, it follows that
m
C(ξ → X) = C ξj → X (8.33)
j=1
Considering
the above discussion, a more general decomposition of contribution
C ξj → X can be made in the factor analysis model.
T
Theorem 8.3 Let X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm )T be manifest and
latent variable vectors, respectively. Under the assumption of local independence
(8.1),
p
KL X, ξ \j |ξj = KL Xi , ξ \j |ξj , j = 1, 2, . . . , m. (8.35)
i=1
Proof Let f x, ξ \j |ξj be the conditional density function of X and ξ \j given ξj ;
f x|ξj be the conditional density function of X given ξj ; and let g ξ \j |ξj be the
conditional density function of ξ \j given ξj . Then, in factor analysis model (8.21),
we have
˚ f x, ξ \j |ξj
KL X, ξ \j |ξj = f (x, ξ )log dxdξ \j dξj
f x|ξj g ξ \j |ξj
˚ f x|ξj g ξ \j |ξj
+ f x|ξj g(ξ )log dxdξ \j dξj
f x, ξ \j |ξj
˚ ˚
f (x|ξ ) f x|ξj
= f (x, ξ )log dxdξ /j dξj + f x|ξj g(ξ )log dxdξ \j dξj
f x|ξj f (x|ξ )
p p
Cov Xi , θi |ξj
= 2
= KL Xi , ξ \j |ξj , j = 1, 2, . . . , m.
i=1
ωi i=1
From Theorems 8.1 and 8.3, we have the following decomposition of the
contribution of ξj to X:
p p
C ξj → X = KL(X, ξ ) − KL X, ξ \j |ξj = KL(Xi , ξ ) − KL Xi , ξ \j |ξj
i=1 i=1
p
p
= KL(Xi , ξ ) − KL Xi , ξ \j |ξj = C ξj → Xi . (8.36)
i=1 i=1
8.2 Factor Analysis 237
m
p
C(ξ → X) = C ξj → Xi .
j=1 i=1
m
p
C(ξ → X) = C ξj → Xi (8.37)
j=1 i=1
m
k=j λik
2
k=1 λik
2
KL(Xi , ξ ) = , KL Xi , ξ \j |ξj = .
ωi
2
ωi2
λ2ij
C ξj → Xi = KL(Xi , ξ ) − KL Xi , ξ \j |ξj = 2 .
ωi
p
m
λ2ij
C(ξ → X) = KL(X, ξ ) = .
i=1 j=1
ωi2
In factor analysis model (8.3), in order to simplify the discussion, the factor contri-
bution of ξ1 is calculated. Let ≡ Var(X, ξ ) be the variance–covariance matrix of
T
X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm ), that is,
= ,
T
where
22 23
11 = ; 12 13 = ; and = .
32 33
By using the above result, we can calculate the contribution of factor ξ1 as follows:
Table 8.1 Artificial factor loadings of five common factors in nine manifest variables
Common Manifest variable
factor X1 X2 X3 X4 X5 X6 X7 X8 X9
ξ1 0.1 0.8 0.4 0.3 0.1 0.3 0.8 0.2 0.4
ξ2 0.3 0.2 0.2 0.3 0.8 0.4 0.1 0.1 0.2
ξ3 0.2 0.1 0.2 0.7 0.1 0.2 0.2 0.3 0.8
ξ4 0.7 0.1 0.1 0.1 0.1 0.6 0.1 0.1 0.1
ξ5 0.1 0.1 0.7 0.1 0.1 0.1 0.2 0.7 0.1
ωi2 0.360 0.290 0.260 0.310 0.320 0.340 0.260 0.360 0.140
ωi2 are unique factor variances
240 8 Latent Structure Analysis
Table 8.2 Factor contributions measured with the conventional method (orthogonal case)
Factor ξ1 ξ2 ξ3 ξ4 ξ5
Cj 1.72 0.79 1.07 1.60 1.08
RCj 0.275 0.126 0.171 0.256 0.173
j
RC 0.246 0.113 0.153 0.229 0.154
Table 8.3 Factor contributions measured with the entropy-based method (orthogonal case)
Factor ξ1 ξ2 ξ3 ξ4 ξ5
C ξj → X 16.13 2.18 4.27 7.64 4.11
RC ξj → X 0.474 0.063 0.124 0.223 0.120
ξj → X
RC 0.457 0.062 0.121 0.216 0.116
one, the five factors are assumed orthogonal. By using the conventional method, the
factor contributions of the common factors are calculated with Cj (8.13), RCj (8.17),
and
RC j (8.18)
(Table 8.2). On the other hand, the entropy-based
approach uses
C ξj → X (8.36), RC ξj → X (8.38), and RC ξj → X (8.40), and the results are
shown in Table 8.3. According to the conventional method, factors ξ1 and ξ4 have
similar contributions to the manifest variables; however, in the entropy-based method,
the contribution of ξ1 is more than twice that of ξ4 .
Second, an oblique case is considered by using factor loadings shown in Table 8.1.
Assuming the correlation matrix of the five common factors as
⎛ ⎞
1 0.7 0.5 0.2 0.1
⎜ ⎟
⎜ 0.7 1 0.7 0.5 0.2 ⎟
⎜ ⎟
=⎜ 0.5 0.7 1 0.7 0.5 ⎟,
⎜ ⎟
⎝ 0.2 0.5 0.7 1 0.7 ⎠
0.1 0.2 0.5 0.7 1
the factor contributions are computed by (8.39) and (8.42), and the results are illus-
trated in Table 8.4. According to the correlations between factors, factors ξ3 , ξ4 , and
ξ5 have large contributions to the manifest variables; i.e., the contributions are greater
than 0.5. Especially, the contributions of ξ4 , and ξ5 are more than 0.7.
The entropy-based method is applied to factor analysis of Table 7.6, assuming
two common factors. By using the covarimin method, we have the factor loadings in
Table 8.4 Factor contributions measured with the entropy-based method (oblique case)
Factor ξ1 ξ2 ξ3 ξ4 ξ5
C ξj → X 9.708 12.975 21.390 18.417 13.972
RC ξj → X 0.385 0.515 0.859 0.731 0.553
ξj → X
RC 0.370 0.495 0.816 0.703 0.531
8.2 Factor Analysis 241
Table 8.5, where the manifest variables are standardized. According to the results,
factor ξ1 can be interpreted as a latent ability for liberal arts and factor ξ2 as that for
sciences. The correlation coefficient between the two common factors is estimated
as 0.315, and the factor contributions
are calculated
as in Table 8.6. The table shows
the following decompositions of C ξj → X and C(ξ → X):
5
C ξj → X = C ξj → Xi , j = 1, 2;
i=1
5
C(ξ → X) = C(ξ → Xi ).
i=1
2
C(ξ → Xi ) = C ξj → Xi , i = 1, 2, 3, 4, 5.
j=1
The effect of ξ2 on the manifest variable vector X is about 1.7 times greater
than that of ξ1 . The summary of the contribution (effects) to the manifest vari-
→ X) = 0.904, the explanatory
ables is demonstrated in Table 8.7. From RC(ξ
power of the latent common factor vector ξ is strong, especially, that of ξ2 , i.e.,
Table 8.6 Factor contributions C ξj → X , C ξj → Xi , and C(ξ → Xi )
Japanese X1 English X2 Social X3 Mathematics X4 Science X5 Totala
ξ1 0.902 1.438 0.704 0.368 0.539 3.951
ξ2 0.373 0.143 0.014 0.685 5.433 6.648
ξ 1.009 1.438 0.728 0.818 5.433 9.426
a Totals in the table are equal to C ξ → X , j = 1, 2 and C(ξ → X)
j
2
C(ξ → X) = C ξj → X
j=1
In a test battery composed of response items with binary response categories, e.g.,
“yes” or “no”, “positive” or “negative”, “success” or “failure”, and so on, let us
assume that the responses to the test items depend on a latent trait or ability θ, where
θ is a hypothesized variable that cannot be observed directly. In item response theory,
the relationship between responses to the items and the latent trait is analyzed, and
it is applied to test-making. In this section, the term “latent trait” is mainly used for
convenience; however, term “latent ability” is also employed. In a test battery with
p test items, and let Xi be responses to items i, i = 1, 2, . . . , p, such that
1 (success response to itemi)
Xi = , i = 1, 2, . . . , p,
0 (failure)
p
P X1 , X2 , . . . , Xp = x1 , x2 , . . . , xp |θ = Pi (θ )xi Qi (θ )1−xi , (8.43)
i=1
p
V = Xi .
i=1
p
V = ci Xi . (8.44)
i=1
For the above score, the average of the conditional expectations of ci Xi given
latent trait θ over the items is given by
1
p
T (θ ) = ci Pi (θ ). (8.45)
p i=1
The above function is called a test characteristic function. In the next section, a
theoretical discussion for deriving the item characteristic function Pi (θ ) is provided
[18, pp. 37–40].
+∞
1 (y − ρθ )2
Pi (θ ) = P(Yi > ηi |θ) =
exp − 21 − ρ 2 dy
ηi 2π 1 − ρi2 i
+∞ +∞ 2
1 t2 1 t
= √ exp − dt = √ exp − dt = (ai (θ − di )),
2π 2 2π 2
√
ηi −ρi θ
2
−ai (θ−di )
1−ρi
i = 1, 2, . . . , p, (8.47)
From this, for θ = di , the success probability to item i is 21 , and in this sense, the
parameter di implies the difficulty of item i. Since the cumulative normal distribution
function (8.47) is difficult to treat mathematically, the following logistic model is
used as an approximation:
1 exp(Dai (θ − di ))
Pi (θ ) = = , i = 1, 2, . . . , p,
1 + exp(−Dai (θ − di )) 1 + exp(Dai (θ − di ))
(8.48)
where D is a constant. If we set D = 1.7, the above functions are good approximations
of (8.47). For ai = 2, di = 1, the graphs are almost the same as shown in Fig. 8.4a
and b. Differentiating (8.48) with respect to θ , we have
d Dai
Pi (θ ) = Dai Pi (θ )Qi (θ ) ≤ Dai Pi (di )Qi (di ) = , i = 1, 2, . . . , p. (8.49)
dθ 4
From this, the success probabilities to items i are rapidly increasing in neighbor-
hoods of θ = di , i = 1, 2, . . . , p (Fig. 8.5). As increasing the parameters ai , the
success probabilities to the items are also increasing. In this sense, parameters ai
express discriminating powers to respond to items i, i = 1, 2, . . . , p. As shown in
Fig. 8.6, as difficulty parameter d is increasing, the success probabilities for items
are decreasing for fixed discrimination parameter a. Since logistic models (8.48)
are GLMs, we can handle the model (8.48) more easily than normal distribution
model (8.47). In the next section, the information of tests for estimating latent traits
is discussed, and ECD is applied for measuring the reliability of the tests.
8.3 Latent Trait Analysis 245
0.72
0.84
0.96
1.08
1.32
1.44
1.56
1.68
1.92
2.04
2.16
2.28
2.52
2.64
2.76
2.88
0.6
1.2
1.8
2.4
0
3
normal distribution
(b) logistic function
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.12
0.24
0.36
0.48
0.6
0.72
0.84
0.96
1.08
1.2
1.32
1.44
1.56
1.68
1.8
1.92
2.04
2.16
2.28
2.4
2.52
2.64
2.76
2.88
3
logistic function
Fig. 8.4 a The standard normal distribution function (8.47) (a = 2, d = 1). b Logistic model (8.48)
(a = 2, d = 1)
In item response models (8.48), the likelihood functions of latent common trait θ
given responses Xi are
Logistic models
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
a = 0.5 a=1 a=2 a=3
Logistic functions
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
d=1 d = 1.5 d=2 d = 2.5
Then, the Fisher information for estimating latent trait θ is computed as follows:
2
d Xi dPi (θ ) 1 − Xi dQi (θ ) 2
Ii (θ ) = E li (θ |Xi ) =E +
dθ Pi (θ ) dθ Qi (θ ) dθ
2 2
1 dPi (θ ) 1 dQi (θ )
= +
Pi (θ ) dθ Qi (θ ) dθ
8.3 Latent Trait Analysis 247
1 dPi (θ ) 2 1 dPi (θ ) 2 Pi (θ )2
= + =
Pi (θ ) dθ Qi (θ ) dθ Pi (θ )Qi (θ )
= D2 ai2 Pi (θ )Qi (θ )(= Ii (θ )), i = 1, 2, . . . , p, (8.50)
where
dPi (θ )
Pi (θ ) = , i = 1, 2, . . . , p.
dθ
In test theory, functions (8.50) are called item information functions, because the
Fisher information is related to the precision of the estimate of latent trait θ . Under
the assumption of local independence
(8.43), from (8.50), the Fisher information of
response X = X1 , X2 , . . . , Xp is calculated as follows:
p
p
Pi (θ )2 p
I (θ ) = Ii (θ ) = = D2 ai2 Pi (θ )Qi (θ ). (8.51)
i=1 i=1
Pi (θ )Qi (θ ) i=1
The above function is referred to as the test information function and the char-
acteristic of a test can be discussed by the function. Based on model (8.48), we
have
exp(xi Dai (θ − di ))
p
P X1 , X2 , . . . , Xp = x1 , x2 , . . . , xp |θ =
i=1
1 + exp(Dai (θ − di ))
p
exp D i=1 xi ai (θ − di )
= p .
i=1 (1 + exp(Dai (θ − di )))
(8.52)
n
Vsufficient = ai Xi . (8.53)
i=1
In this sense, the score (8.53) is the best to measure the common latent trait θ .
For a large number of items p, statistic (score)
p V (8.44)p is asymptotically dis-
2
tributed according to normal distribution N ci Pi (θ ), ci Pi (θ )Qi (θ ) . The
i=1 i=1
248 8 Latent Structure Analysis
p
ci2 Pi (θ )Qi (θ ) = σ 2 , (8.55)
i=1
Since
⎛ ⎞⎛ ⎞
P1 (θ )Q1 (θ ) 0 c1
0
p ⎜
⎜ 0 P2 (θ )Q2 (θ ) ⎟⎜
⎟⎜ c2 ⎟
⎟
⎜
ai ci Pi (θ )Qi (θ ) = a1 a2 · · · ap ⎜ ⎟⎜ . ⎟,
.. ⎟⎜ .. ⎟
i=1 ⎝ 0 . 0 ⎠⎝ ⎠
0 Pp (θ )Qp (θ ) cp
the above
quantity can be viewed as an inner product between vectors a1 , a2 , . . . , ap
and c1 , c2 , . . . , cp with respect to the diagonal matrix
⎛ ⎞
P1 (θ )Q1 (θ ) 0
⎜ 0 ⎟
⎜ 0 P2 (θ )Q2 (θ ) ⎟
⎜ .. ⎟.
⎝ . 0 ⎠
0
0 Pp (θ )Qp (θ )
ci = τ ai , i = 1, 2, . . . , p,
8.3 Latent Trait Analysis 249
The above information is the same as that for Vsufficient (8.51). Comparing (8.54)
and (8.56), we have the following definition:
Definition 8.7 The efficiency of score V (8.44) for sufficient statistic Vsufficient is
defined by
p 2
ai ci Pi (θ )Qi (θ )
i=1
ψ(θ ) = p 2 p 2
i=1 ai Pi (θ )Qi (θ ) i=1 ci Pi (θ )Qi (θ )
exp(xi Dai (θ − di ))
fi (xi |θ) = , i = 1, 2, . . . , p.
1 + exp(Dai (θ − di ))
and
KL(Xi , θ ) Dai Cov(Xi , θ )
ECD(Xi , θ ) = = , i = 1, 2, . . . , p.
KL(Xi , θ ) + 1 Dai Cov(Xi , θ ) + 1
n
p
KL(X, θ) = KL(Vsufficient , θ) = Cov(DVsufficient , θ) = Dai Cov(Xi , θ) = KL(Xi , θ).
i=1 i=1
The above ECD can be called the entropy coefficient of reliability (ECR) of the
test. Since θ is distributed according to N(0, 1), from model (8.48), we have
Since
θ exp(Dai (θ − di ))
+∞ 1 1 2
lim Cov(Xi , θ ) = lim ∫ · √ exp − θ dθ
ai →+∞ ai →+∞ −∞ 1 + exp(Dai (θ − di )) 2π 2
+∞ 1 1 2 1 1 2
= ∫ θ · √ exp − θ dθ = √ exp − di ,
di 2π 2 2π 2
i = 1, 2, . . . , p,
it follows that
lim KL(Xi , θ ) = +∞
ai →+∞
⇔ lim ECD(Xi , θ ) = 1, i = 1, 2, . . . , p.
ai →+∞
Table 8.8 shows the discrimination and difficulty parameters for artificial nine items
for a simulation study (Test A). The nine item characteristic functions are illustrated
in Fig. 8.7. In this case, according to KL(Xi , θ ) and ECD(Xi , θ ) (Table 8.8), the
explanatory powers of latent trait θ for responses to the nine items are moderate.
According to Table 8.8, the KL information is maximized for d = 0. The entropy
coefficient of reliability of this test is calculated by (8.57). Since
9
KL(X, θ ) = KL(Xi , θ ) = 0.234 + 0.479 + · · · + 0.234 = 6.356,
i=1
Table 8.8 An item response model with nine items and the KL information [Test (A)]
Item 1 2 3 4 5 6 7 8 9
ai 2 2 2 2 2 2 2 2 2
di −2 −1.5 −1 −0.5 0 0.5 1.0 1.5 2
KL(Xi , θ ) 0.234 0.479 0.794 1.075 1.190 1.075 0.794 0.479 0.234
ECD(Xi , θ) 0.190 0.324 0.443 0.518 0.543 0.518 0.443 0.324 0.190
8.3 Latent Trait Analysis 251
we have
6.356
ECD(X, θ ) = = 0.864.
6.356 + 1
In this test, the test reliability can be thought high. Usually, the following test
score is used for measuring latent trait θ :
9
V = Xi , (8.58)
i=1
In Test A, the above score is a sufficient statistic of the item response model with
the nine items in Table 8.8 [Test (A)], and the information of the item response model
is the same as that of the above score. Then, the test characteristic function is given
by
1
9
T (θ ) = Pi (θ ). (8.59)
9 i=1
As shown in Fig. 8.8, the relation between latent trait θ and T (θ ) is almost linear.
The test information function of Test (A) is given in Fig. 8.9. The precision for
estimating the latent trait θ is maximized at θ = 0.
Second, for Test (B) shown in Table 8.9, the ECDs are uniformly smaller than
those in Table 8.8, and the item characteristic functions are illustrated in Fig. 8.10.
In this test, ECD of Test (B) is
ECD(X, θ ) = 0.533
252 8 Latent Structure Analysis
Table 8.9 An item response model with nine items and the KL information [Test (B)]
Item 1 2 3 4 5 6 7 8 9
ai 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
di −2 −1.5 −1 −0.5 0 0.5 1.0 1.5 2
KL(Xi , θ ) 0.095 0.116 0.135 0.148 0.152 0.148 0.135 0.116 0.095
ECD(Xi , θ) 0.087 0.104 0.119 0.129 0.132 0.129 0.119 0.104 0.087
and the ECD is less than that of Test (A). Comparing Figs. 8.9 and 8.11, the precision
of estimating latent trait in Test (A) is higher than that in Test (B), and it depends on
discrimination parameters ai .
8.3 Latent Trait Analysis 253
0.9
0.8
item 1
0.7
0.6 item 2
item 9
item 8
0.5 item 3 item 7
item 6
item 5
item 4
0.4
0.3
0.2
0.1
0
-3.0
-2.8
-2.6
-2.4
-2.2
-2.1
-1.9
-1.7
-1.5
-1.3
-1.1
-0.9
-0.7
-0.5
-0.3
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.7
1.9
2.1
2.3
2.5
2.7
2.9
Fig. 8.10 Item characteristic functions of Test (B)
1.2
0.8
0.6
0.4
0.2
0
-3.0
-2.8
-2.5
-2.3
-2.1
-1.9
-1.6
-1.4
-1.2
-0.9
-0.7
-0.5
-0.2
0.0
0.2
0.4
0.7
0.9
1.1
1.4
1.6
1.8
2.1
2.3
2.5
2.7
3.0
Lastly, Test (C) with nine items in Table 8.10 is considered. The item characteristic
functions are illustrated in Fig. 8.12. According to KL information in Table 8.10,
items 3 and 7 have the largest predictive power on item responses. ECD in this test
is computed by
254 8 Latent Structure Analysis
Table 8.10 An item response model with nine items and the KL information [Test (C)]
Item 1 2 3 4 5 6 7 8 9
ai 0.5 1 2 1 0.5 1 2 1 0.5
di −2 −1.5 −1 −0.5 0 0.5 1.0 1.5 2
KL(Xi , θ ) 0.095 0.260 0.794 0.442 0.152 0.442 0.794 0.260 0.095
ECD(Xi , θ) 0.087 0.207 0.443 0.306 0.132 0.306 0.443 0.207 0.087
0.9
0.8
0.7
0.6 item 8
item 4 item 5 item 9
0.5 item 1
item 3
item 2 item 6 item 7
0.4
0.3
0.2
0.1
0
-3.0
-2.8
-2.6
-2.4
-2.2
-2.1
-1.9
-1.7
-1.5
-1.3
-1.1
-0.9
-0.7
-0.5
-0.3
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.7
1.9
2.1
2.3
2.5
2.7
2.9
ECD(X, θ ) = 0.769.
The test information function of Test (C) is illustrated in Fig. 8.13. The configura-
tion is similar to that of KL information in item difficulties di in Test (C) in Table 8.10
(Fig. 8.14). Since this test has nine items with discrimination parameters 0.5, 1, and
2, the sufficient statistic (score) is given by
The above score is the best to estimate latent trait based on test score (8.44). Test
score (8.58) may be usually used; however, the efficiency of the test score for the
sufficient test score (8.60) is less than 0.9, as illustrated in latent trait θ in Fig 8.15.
Since the distribution of latent trait θ is assumed to be N(0, 1), about 95% of latent
trait θ exist in range (−1.96, 1.96), i.e.,
8.3 Latent Trait Analysis 255
4.0
3.0
2.0
1.0
0.0
-3.0
-2.8
-2.5
-2.3
-2.1
-1.9
-1.6
-1.4
-1.2
-0.9
-0.7
-0.5
-0.2
0.0
0.2
0.4
0.7
0.9
1.1
1.4
1.6
1.8
2.1
2.3
2.5
2.7
3.0
Fig. 8.13 Test information function of Test (B)
KL information of items
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
efficiency
0.9
0.85
0.8
0.75
0.7
-3.0
-2.8
-2.5
-2.3
-2.1
-1.9
-1.6
-1.4
-1.2
-0.9
-0.7
-0.5
-0.2
0.0
0.2
0.4
0.7
0.9
1.1
1.4
1.6
1.8
2.1
2.3
2.5
2.7
3.0
Hence, in range −1.96 < θ < 1.96, the precision for estimating θ according to
test score (8.58) is about 80% for the sufficient statistic (8.60) as shown in Fig. 8.15.
8.4 Discussion
In this chapter, latent structure models are considered in the framework of GLMs. In
factor analysis, the contributions of common factors have been measured through an
entropy-based path analysis. The contributions of common factors have been defined
as the effects of factors on the manifest variable vector. In latent trait analysis, for the
two-parameter logistic model for dichotomous manifest variables, the test reliability
has been discussed with ECD. It is critical to extend the discussion to those for the
graded response model [17], the partial credit model [13], the nominal response model
[3], and so on. Latent class analysis has not been treated in this chapter; however,
there may be a possibility to make an entropy-based discussion for comparing latent
classes. Further studies are needed to extend the present discussion in this chapter.
References
1. Brown, M. N. (1974). Generalized least squares in the analysis of covariance structures. South
African Statistical Journal, 8, 1–24.
2. Bartholomew, D. J. (1987). Latent variable models and factor analysis. New York: Oxford
University Press.
3. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored
in two or more nominal categories. Psychometrika, 37, 29–51.
4. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear
models. Computational Statistics & Data Analysis, 54, 1381–1389.
5. Eshima, N., Tabata, M., Borroni, C. G., & Kano, Y. (2015). An entropy-based approach to path
analysis of structural generalized linear models: A basic idea. Entropy, 17, 5117–5132.
6. Eshima, N., Borroni, C. G., & Tabata, M. (2016). Relative importance assessment of explanatory
variables in generalized linear models: An entropy-based approach. Statistics & Applications,
16, 107–122.
7. Eshima, N., Tabata, M., & Borroni, C. G. (2018). An entropy-based approach for measuring
factor contributions in factor analysis models. Entropy, 20, 634.
8. Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychome-
trika, 32, 443–482.
9. Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor
analysis. Psychometrika, 34, 183–202.
10. Jöreskog, K. G., & Goldberger, A. S. (1972). Factor analysis by generalized least squares.
Psychometrika, 37, 243–260.
11. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. New York: Houghton-
Mifflin.
12. Lord, F. M. (1952). A theory of test scores (Psychometrika Monograph, No. 7). Psychometric
Corporation: Richmond.
References 257
13. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
14. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Illinois: The
University of Chicago Press.
15. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika,
32, 443–482.
16. Robertson, D., & Symons, J. (2007). Maximum likelihood factor analysis with rank-deficient
sample covariance matrices. Journal of Multivariate Analysis, 98, 813–828.
17. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores
(Psychometric Monograph). Educational Testing Services: Princeton.
18. Shiba, S. (1991). Item response theory. Tokyo: Tokyo University.
19. Spearman, S. (1904). “General-intelligence”, objectively determined and measured. American
Journal of Psychology, 15, 201–293.
20. Sundberg, R., & Feldmann, U. (2016). Exploratory factor analysis-parameter estimation and
scores prediction with high-dimensional data. Journal of Multivariate Analysis, 148, 49–59.
21. Trendafilov, N. T., & Unkel, S. (2011). Exploratory factor analysis of data matrices with more
variables than observations. Journal of Computational and Graphical Statistics, 20, 874–891.
22. Thurstone, L. L. (1935). Vector of mind: Multiple factor analysis for the isolation of primary
traits (p. 1935). The University of Chicago Press: Chicago, IL, USA.
23. Unkel, S., & Trendafilov, N. T. (2010). A majorization algorithm for simultaneous parameter
estimation in robust exploratory factor analysis. Computational Statistics & Data Analysis, 54,
3348–3358.
24. Young, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on
exploratory factor analysis. Quantitative Methods for Psychology, 9, 79–94.