You are on page 1of 16

Journal of Multivariate Analysis 149 (2016) 144–159

Contents lists available at ScienceDirect

Journal of Multivariate Analysis


journal homepage: www.elsevier.com/locate/jmva

Bias correction of the Akaike information criterion in


factor analysis
Haruhiko Ogasawara
Department of Information and Management Science, Otaru University of Commerce, 3-5-21, Midori, Otaru 047-8501, Japan

article info abstract


Article history: The higher-order asymptotic bias for the Akaike information criterion (AIC) in factor anal-
Received 18 November 2015 ysis or covariance structure analysis is obtained when the parameter estimators are given
Available online 25 April 2016 by the Wishart maximum likelihood. Since the formula of the exact higher-order bias is
complicated, simple approximations which do not include unknown parameter values are
AMS subject classifications: obtained. Numerical examples with simulations show that the approximations are reason-
62E20
ably similar to their corresponding exact asymptotic values and simulated values. Simula-
62F12
62H25
tions for model selection give consistently improved results by the approximate correction
of the higher-order bias for the AIC over the usual AIC.
Keywords: © 2016 Elsevier Inc. All rights reserved.
Asymptotic bias
AIC
Covariance structure analysis
Exploratory factor analysis
Non-normality

1. Introduction

Information criteria are used primarily for model selection. One of the typical information criteria is the Akaike infor-
mation criterion (AIC, [1]). The AIC is given by the familiar formula of −2 times the sample log likelihood plus 2 times the
number of parameters in a statistical model. The latter quantity is a correction term corresponding to the negative asymp-
totic bias of order O(1) for the former main term as an estimator of −2 times the expected log predictive likelihood, which
is an unknown criterion for model selection. Since the bias is negative, the positive correction term is often interpreted as a
penalty for a relatively complicated model in model selection.
The popularity of the AIC is due to its simplicity of the fixed correction term independent of unknown population values
of parameters. The Takeuchi information criterion (TIC, [44]) gives an alternative correction term corresponding to that of
the AIC under possible model (distribution) misspecification. The TIC was also derived by Stone [41] in the context of cross
validation. For the TIC and its derivation, see [28, Proposition 2, Appendix A.2.1], [26, Section 3.4.3], [14, Section 2.5] and
[12, Section 2.3]. Though the correction term in the TIC is robust under model misspecification, the term generally depends
on unknown population parameters and tends to be relatively complicated. In practice, the correction term is given by its
sample counterpart without changing the order of the remainder in approximation to the exact bias correction.
Improvements of the Akaike and Takeuchi information criteria have been given typically by deriving the higher-order
asymptotic biases and the asymptotic biases of the estimated lower-order asymptotic biases. Konishi and Kitagawa [25]
gave these results using the von Mises calculus [46,47] while Ogasawara [38, Section 3] obtained the general results using
log likelihood derivatives and gave simplified formulas in the case of the exponential family of distributions under canonical
parametrization.

E-mail address: hogasa@res.otaru-uc.ac.jp.

http://dx.doi.org/10.1016/j.jmva.2016.04.003
0047-259X/© 2016 Elsevier Inc. All rights reserved.
H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 145

A typical problem in model selection by the AIC is how to choose appropriate regressors among candidates in regression
models. Sugiura [42] gave the exact correction term for the AIC in normal univariate linear regression under correct
model specification. Fujikoshi and Satoh [16] derived the corresponding higher-order asymptotic bias under possible model
misspecification in normal multivariate linear regression. Yanagihara, Sekiguchi and Fujikoshi [48] obtained the higher-
order term in logistic regression. Kamo, Yanagihara and Satoh [22] derived the higher-order term in Poisson regression.
Ogasawara [38, Section 6] added the corresponding results in gamma and negative binomial regression. It is known that in
normal linear regression, the AIC tends to choose models with relatively large number of regressors or complicated models
(see [18,16]). It has been shown that the bias-corrected AIC substantially increases the proportions of choosing correct
models in normal linear regression [16].
The AIC has also been used in factor analysis [43] or more generally in covariance structure analysis and structural
equation modeling. A typical problem in exploratory factor analysis (EFA) is to select an appropriate number of common
factors. This problem partially motivated H. Akaike to coin the AIC [24, p. 124]. It is known as in regression analysis that in
EFA the AIC tends to choose relatively excessive number of common factors [2, Table 5], [19, p. 39]. It is to be noted that the
factor analysis model can be seen as a multivariate linear regression model, where regressor(s) are latent common factors.
Note also that even in usual regression models error term(s) are latent variables corresponding to unique factors in EFA.
This paper gives the usual and higher-order correction terms of the TIC and AIC in EFA using the Wishart likelihood under
possible non-normality. Simple approximations to the higher-order correction term in the AIC without using unknown
population factor loadings and unique variances are also given under normality. The approximations are shown to be similar
to the corresponding simulated and exact asymptotic values. In the simulation of model selection, it is shown that the
proportion of choosing a correct number of common factors increases by the approximation to the higher-order correction
term in the AIC under normality and different conditions of non-normality.

2. Bias correction of the AIC and TIC

Let
N
i.i.d.

xi ∼ N(µ, 6) (i = 1, . . . , N ) and U ≡ (xi − µ)(xi − µ)′ , (2.1)
i=1

where xi is the p × 1 vector of observable variables, and µ and 6 are its expectation and covariance matrix, respectively.
Then, it is known that U is Wishart distributed with the scale parameter 6 and N degrees of freedom, which is denoted by

U ∼ Wp (6, N ), (2.2)

whose probability density is

|U|(N −p−1)/2 exp{−tr(6−1 U)/2}


f (U|6) = (2.3)
2pN /2 |6|N /2 Γp (N /2)

[3, Section 7.2], where Γp (·) is the p-variate gamma function given by

p
Γp (t ) = π p(p−1)/4

Γ {t − (1/2)(i − 1)} (2.4)
i=1

and Γ { · } is the usual or univariate gamma function.


Let
N

S = n− 1 (xi − x̄)(xi − x̄)′ , (2.5)
i=1

with n = N − 1, x̄ = N −1 i=1 xi and Ef (S) = 6, where Ef (·) is an expectation under correct model (distribution)
N
specification corresponding to (2.3) or equivalently to (2.1). Using the property of normally distributed variables with (2.3)
and (2.5), it can be shown that

S ∼ Wp (n−1 6, n), (2.6)

and its density is

|S|(n−p−1)/2 exp{−tr(n6−1 S)/2}


f (S|6) = . (2.7)
(2/n)pn/2 |6|n/2 Γp (n/2)
146 H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159

Define σ as a {(p2 + p)/2} × 1 vector consisting of non-duplicated elements of 6. Let l = l(σ|S) be the log likelihood of
σ when S is given and l̄ ≡ n−1 l as the mean log likelihood averaged over independent observations or degrees of freedom.
Then,

− 2n−1 l = −2l̄ = ln |6| + tr(6−1 S) + C (S), (2.8)


where
n−p−1 2 2
C (S) = − ln |S| + p ln + ln Γp (n/2), (2.9)
n n n
is a random variable without 6.
Define θ as a q × 1 vector of structural parameters in 6 = 6(θ). Then, the Wishart maximum likelihood estimator (WMLE)
θ̂ = θ̂WML = θ(S) of θ is given by minimizing −2l̄ or ln |6| + tr(6−1 S) in (2.8). Let

l̂ = l(θ̂|S), ˆl̄ = l̄(θ̂|S), = n−1 l̂ and 6̂ = 6(θ̂) = 6{θ(S)}, (2.10)


then, we have
−1
− 2ˆl̄ = ln |6̂| + tr(6̂ S) + C (S). (2.11)
The AIC in a general covariance structure is defined by
−1
AIC = −2l̂ + 2q = n{ln |6̂| + tr(6̂ S) + C (S)} + 2q. (2.12)
Since the AIC is of order Op (n), for tractability without loss of generality we use
−1
n−1 AIC = ln |6̂| + tr(6̂ S) + C (S) + n−1 2q. (2.13)
Let T be an independent copy of S as another sample in the future. Define g (S|ς0 ) = g (T|ς0 ) as true densities of S and
T under possible model misspecification, respectively, where ς0 is the q∗ × 1 vector of parameters in the true distribution
with
Eg (S) = Eg (T) ≡ 6T , (2.14)
(T)
which is a true covariance matrix with the assumption of its existence. Denote an expectation using g (T|ς0 ) by Eg (·) (the

superscript (T) will be dropped when obvious as was in (2.14)) and define ˆl̄ as the expected predictive log likelihood
averaged over degrees of freedom using θ̂ = θ(S) as
∗ −1
− 2ˆl̄ = −2E(gT) {l̄(θ̂|T)} = Eg(T) {ln |6̂| + tr(6̂ T) + C (T)}
−1
= ln |6̂| + tr(6̂ 6T ) + E(gT) {C (T)}, (2.15)
(T)
where the existence of Eg {C (T)} is assumed, and it is known under normality
(T)
Ef {C (T)} = Ef {C (T)}
n−p−1 2 2
=− Ef {ln |T|} + p ln + log Γp (n/2)
n n n
n−p−1 2 2
=− {ln |n−1 6| + p ln 2 + ψp (n/2)} + p ln + ln Γp (n/2) (2.16)
n n n
(see e.g., [6, Equation (B.81)]), where Ef {ln |T|} is given by the cumulant generating function in the exponential family under
canonical parametrization and ψp (·) is the p-variate psi function or the first derivative of log Γp (·).
−1
The bias of −2ˆl̄ = −2l̄(S) = ln |6̂| + tr(6̂ S) + C (S) under possible model misspecification is defined by
∗ ∗ −1 −1
− 2E(gS) (ˆl̄ − ˆl̄ ) = −2Eg (ˆl̄ − ˆl̄ ) = Eg {tr(6̂ S) − tr(6̂ 6T )}
−1
= Eg [tr{6̂ (S − 6T )}] ≡ n−1 b1 + n−2 b2 + O(n−3 ), (2.17)
(S) (T)
where Eg (ln |S|) = Eg (ln |T|) is used and n bi (i = 1, 2) are the asymptotic biases of orders O(n ) (i = 1, 2),
−i −i

respectively. Note that b1 = −2q is the negative of the correction term in the AIC under correct model specification.
Let θ0 be the population value of θ, where θ0 under model misspecification is defined as the solution of θ in the estimation
equation for θ̂ when S is replaced by 6T i.e., θ0 ≡ arg maxθ∈2 l{6(θ)|6T } with 2 being the parameter space of θ. Let
−1
60 = 6(θ0 ). Define h = h(s) = tr{6̂ (S − 6T )} (see (2.17)), where s = v(S) and v(·) is the vectorizing operator taking the
non-duplicated elements of a symmetric matrix with σT = v(6T ). Note that h is a function of s, where 6̂ = 6(θ̂) = 6{θ(s)}
and S are functions of s while 6T in h is seen as a constant. Then,
H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 147

Theorem 1. Under regularity conditions for the existence of associated expectations and possible non-normality, suppose that h
is expanded as

1 ∂ ih 
4 
(s − σT )⟨i⟩ + Op (n−5/2 )

h =
i! (∂ s ) 
′ ⟨ i ⟩

i=1 s=σT

1 ∂ ih
4
−5/2

′ ⟨i⟩ (s − σT ) + Op (n ),
⟨i⟩
≡ (2.18)
i =1
i! (∂σ T )
where s⟨i⟩ = s ⊗ · · · ⊗ s (i times of s) is the i-fold Kronecker product of s. Then,

1 ∂ 2h
b1 = nE g {(s − σT )⟨2⟩ }→O(n−1 ) (2.19)
2 (∂σ′T )⟨2⟩
and
1 ∂ 2h
b2 = n2 [E g {(s − σT )⟨2⟩ }→O(n−2 ) − E g {(s − σT )⟨2⟩ }→O(n−1 ) ]
2 (∂σ′T )⟨2⟩

1 ∂ 3h 1 ∂ 4h
′ ⟨3⟩ n [E g {(s − σT ) }→O(n−2 ) ] + n2 [E g {(s − σT )⟨4⟩ }→O(n−2 ) ],
2 ⟨3 ⟩
+ (2.20)
6 (∂σT ) 24 (∂σ′T )⟨4⟩

where E g { · }→O(n−i ) indicates that the expectation is taken up to order O(n−i ).


The proof and actual expressions of b1 and b2 using 60 with additional explanations are given in Appendix A.1 of the
Appendix. Under correct model specification, as addressed earlier b1 becomes −2q. On the other hand, b2 generally depends
on θ0 even under correct model specification, where θ0 can be replaced by θ̂ without changing the order of the remainder
for the AIC. In n−1 TIC, b1 is replaced by its estimator denoted by b̂1 using θ̂. Then, the higher-order correction term in n−1 TIC
should be −n−2 (b̂2 + b̂∆1 ), where n−1 b̂∆1 is the estimator of the asymptotic bias n−1 b∆1 of b̂1 , which is given by Theorem A.1
in Appendix A.2 of the Appendix.

3. Results in exploratory factor analysis

In Section 2, the results are shown for a general covariance structure, where the differentiability up to a required number
of times is implicitly assumed with the assumption of the existence of the inverses of associated matrices. In this section, EFA
for unstandardized variables is dealt with (EFA for standardized variables will be addressed later in the discussion section).
Define 0 = 0(η) = 6−1 = {6(θ)}−1 , where η is the q × 1 vector of parameters after reparametrization. That is, η is the
parameter vector in the inverse structure or the structure of the inverse of 6(θ). Note that when 0 is unstructured or linear
with respect to η, where η = v(0) in the former case, f (S|6) = f (S|0) = f (S|0(η)) in (2.7) becomes

|S|(n−p−1)/2 |0|n/2 exp{−tr(n0S)/2}


f (S|0) = , (3.1)
(2/n)pn/2 Γp (n/2)
which shows that η = v(0), or η when 0(η) is linear with respect to η, is the canonical parameter vector, which
−1
yields various simplifications. For example, −2ˆl̄ = ln |6̂| + tr(6̂ S) + C (S) (see the result before (2.17)) becomes
− ln |0̂| + tr(0̂S) + C (S), where 0̂ = 0(η̂) and there is no inverse.
In factor analysis, the inverse structure 0(η) can be used, where three examples including the usual model of EFA
will be shown in Appendix A.3 of the Appendix. In the case of EFA with k common factors, 6 = 33′ + 9, where
9 = diag(ψ1 , . . . , ψp ) consists of unique factor variances (unique variances) and 3 is the p × k loading matrix whose
(k2 − k)/2 elements are set to be zero to remove rotational indeterminacy without loss of generality. The corresponding
inverse structure is 0 = 6−1 = 9∗ − 3∗ 3∗′ after reparametrization with 9∗ = 9−1 and 3∗ 3∗′ = 9−1 − 6−1 , where the
reparametrized p × k matrix 3∗ is assumed to have the restrictions similar to those on 3.
−1
Using 0, h in (2.18) becomes h = tr{6̂ (S − 6T )} = tr{0̂(S − 6T )}, where there is no inverse. The partial derivatives of
h with respect to s evaluated at s = σT shown in Appendix A.1 of the Appendix are simplified, and are given in Appendix A.4
of the Appendix. The partial derivatives of the normal-theory (NT) discrepancy function evaluated at the WMLEs i.e., F̂NT =
−1
ln |6̂| + tr(6̂ S) = − ln |0̂| + tr(0̂S) with respect to η̂ and s are simplified (compare with [35, Subsection A.4]
for the corresponding results when θ̂ is used), and are also shown in Appendix A.4 of the Appendix.
−1
It is known that in the case of EFA, tr(6̂ S) = tr(0̂S) = p, which generally holds when θ∗ exists such that 6(θ∗ ) =
α 6(θ), where θ is an arbitrary parameter vector with an arbitrary α > 0 [7, p. 521], [10, Section 1.2]. This property is called
positive homogeneity or invariance under a constant scaling factor, which holds in many cases as Browne and Arminger
[11, p. 192] write that ‘‘Nearly all covariance structures that are useful in practice have this property’’.
148 H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159

4. Approximations to the higher-order asymptotic bias for the AIC in covariance structures

In contrast to the simple expression of the correction term n−1 2q for bias in n−1 AIC, the corresponding higher-order term
−n−2 b̂2 tends to require excessive computation even when the inverse structure is used. In this section, approximations
without using the associated parameter values are considered. We have two extreme models for covariance structures. One
of them is a saturated model with 6̂ = S and the other is the model of uncorrelated observable variables with 6̂ = Diag(S),
where Diag(·) is the diagonal matrix with the same diagonal elements of the matrix in parentheses. The latter model is used
as a baseline model to evaluate the goodness-of-fit of a model 6 = 6(θ) and will be used later in simulations. Models in
practice can be seen to exist in between. The baseline model can be relaxed to that with uncorrelated subsets of variables,
which may be more similar to models in practice. Note that the relaxed model gives a factor loading matrix with a perfect
simple structure or the pattern of independent clusters in an orthogonal solution when each subset of observable variables
has a one-factor model.

Lemma 1. Let m uncorrelated subsets of observable variables consist of p1 , . . . , pm variables, respectively with i=1 pi = p and
m
Si be the unbiased sample covariance matrix for the ith subset of variables (i = 1, . . . , m). Then, under normality b2 for the AIC
in (2.17) is
m

b2 = − pi (pi + 1)2 . (4.1)
i=1

Proof.
−1
Ef [tr{6̂ (S − 60 )}] = Ef [tr{bdiag(S−
1 , . . . , Sm )(S − 60 )}]
1 −1

m
npi  m
(n − pi − 1 + pi + 1)pi
= p− =p−
i=1
n − pi − 1 i=1
n − pi − 1

pi (pi + 1)
 m m  m
=− = −n−1 pi (pi + 1) − n−2 pi (pi + 1)2 + O(n−3 )
i =1
n − p i − 1 i=1 i =1
m

= −n−1 2q − n−2 pi (pi + 1)2 + O(n−3 ) = n−1 b1 + n−2 b2 + O(n−3 ), (4.2)
i=1

where bdiag(·) indicates the block-diagonal matrix with diagonal blocks given in parentheses and Ef (S− i ) = {n/(n − pi −
1

1)}60i with Ef (Si ) = 60i from the property of the inverse Wishart distribution [3, Lemma 7.7.1] is used. 
−1

Note that − i=1 pi (pi +1)/(n−pi −1) in (4.2) is the exact bias of −2ˆl̄ and the remainder of order O(n−3 ) in approximation
m

is − a=3 n−a i=1 pi (pi + 1)a . We find that b1 < 0, b2 < 0 and the remainder is negative. Lemma 1 shows that the amount
+∞ m
of bias correction i.e., 2q by the AIC is not sufficient for removing the whole bias, which suggests the tendency of choosing
relatively complicated models by the AIC.
In normal univariate linear regression with p∗ fixed covariates including an intercept when it is used, the exact bias of
−2ˆl̄ derived by Sugiura [42] is the left-hand side of
2(p∗ + 1)
− < −N −1 2(p∗ + 1), (4.3)
N − p∗ − 2

where 2(p∗ + 1) = 2q = −b1 in the AIC, and 1 = q − p∗ is due to unknown error variance σ02 . Similarity of (4.3) and
(4.2) is expected since Sugiura [42] also used Ef {(N σ̂ML
2
/σ02 )−1 } with σ̂ML
2
being the MLE of σ02 , which is given by the inverse
chi-square distribution or the special case of the inverse Wishart. Note that in (4.3) the degrees of freedom N − p∗ change
with p∗ while those in Lemma 1 i.e., n = N − 1 are unchanged irrespective of p1 , . . . , pm .
From (4.3), in normal regression we have n−1 b1 = −n−1 2(p∗ + 1), n−2 b2 = −n−2 2(p∗ + 1)(p∗ + 2) and the residual is
− +∞ −a
2(p∗ + 1)(p∗ + 2)a . The known tendency of choosing models with relatively large number of regressors by the

a=3 n
AIC is explained by the insufficient amount of the penalty or the correction.

Theorem 2. When p1 = · · · = pm ≡ p̄ in Lemma 1, b2 = −4q2 /p.

Proof. From p1 = · · · = pm = p̄, we have 2q = mp̄(p̄ + 1) and consequently

b2 = −mp̄(p̄ + 1)2 = −(p̄ + 1)2q = −{2q/(mp̄)}2q = −4q2 /p. 


H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 149

Table 1
Asymptotic and simulated biases of −2ˆl̄ in artificial one-factor models (2q = 24; the number of replications = 105 ).

n−1 TIC n−1 CTIC b1 b̃1 b2 b̃2 (Deleted cases, Heywood cases)
N = 200 400 N = 200 400 (N = 200) (N = 400)
Normal
Case A −0.2643 −0.2613 −24 −24.7 −24.3 −117 −131 −111 (0, 0) ( 0, 0)
Case B −0.2416 −0.2381 −24 −24.7 −24.3 −138 −149 126 (0, 10) ( 0, 0)
Uniform
Case A −0.2953 −0.2929 −17.8 −18.4 −18.3 −96 −125 −186 (0, 0) ( 0, 0)
Case B −0.2732 −0.2702 −17.7 −18.5 −18.2 −118 −146 −199 (0, 14) ( 0, 0)
χ 2 (3df )
Case A −0.1607 −0.1589 −44.6 −44.7 −43.9 −74 −24 287 (0, 0) ( 0, 0)
Case B −0.1364 −0.1339 −44.9 −45.3 −44.4 −98 −79 235 (0, 22) ( 0, 0)
χ 2 (1df )
Case A 0.0464 0.0410 −85.8 −85.8 −85.2 211 4 258 (0, 0) ( 0, 0)
Case B 0.0741 0.0697 −86.8 −87.0 −86.2 175 −42 254 (1, 68) ( 0, 0)
Note. N = n + 1 = the sample size, q = the number of parameters, b̃1 = n times the simulated bias, b̃2 = n(b̃1 − b1 ). Case A: λ0 = 0.6 × 1(6) and
90 = diag(0.64 × 1(6) ). Case B: λ0 = (0.2, 0.3, 0.4, 0.5, 0.6, 0.7)′ and 90 = diag(0.96, 0.91, 0.84, 0.75, 0.64, 0.51). Heywood cases are included
in the data set. n−1 CTIC is given when N = 200. n−1 CTIC = n−1 TIC − n−2 b2 = −2l̄0 |S=60 − n−1 b1 − n−2 b2 under arbitrary distributions and
n−1 CAIC = n−1 AIC − n−2 b2 = −2l̄0 |S=60 + n−1 2q − n−2 b2 under normality in this table. n−1 AIC = n−1 TIC and n−1 CAIC = n−1 CTIC under normality.

Note that the results of Lemma 1 and Theorem 2 do not depend on 60 , which is similar to the situation in regression (see
(4.3)). The block diagonal model considered above has saturated block diagonal parts and least fitted remaining parts. This
may correspond to models in practice with medium goodness-of-fit in all parts of S.
When q is seen as the number of parameters in a covariance structure, the b2 in the corresponding block diagonal model
is expected to work as an approximation to the exact b2 in the covariance structure. In EFA, q = p(k + 1) − {(k2 − k)/2}
and an approximation b2 = −4q2 /p may be used. In the next section, the comparison to the exact and simulated values will
be shown. When p ≫ k, b2 /b1 = 2q/p is approximately 2(k + 1), and b2 = ˙ 2(k + 1)b1 is another approximation, which is
smaller than the first approximation by {(k2 − k)/p}(−b1 ).

5. Numerical illustration of asymptotic and simulated biases

In this section, numerical values of asymptotic and simulated biases are illustrated based on artificial and real data.
Tables 1 and 2 show the results given by artificial one-factor models and two-factor models fitted to three real data sets
used in the literature: 6 school subjects (6SS, [27, p. 66]), 8 physical variables (8PV, [17, p. 22]) and 9 educational tests (9ET,
[15, p. 90]). The population parameters in Case A of Table 1 are 30 = λ0 = 0.6 × 1(6) and 90 = diag(0.64 × 1(6) ), where 1(6)
is the 6 × 1 vector of 1’s. That is, 60 has compound symmetry in Case A. The corresponding values in Case B of Table 1 are
λ0 = (0.2, 0.3, 0.4, 0.5, 0.6, 0.7)′ and 90 = diag(0.96, 0.91, 0.84, 0.75, 0.64, 0.51). In Table 2, the original data are seen
as sample covariance matrices. Then, the fitted matrices with two common factors are regarded as population covariance
matrices.
The population values of b1 and b2 are given under normality and three conditions of non-normality. The non-normal
data are generated by independently distributed common and unique factors with 30 and 3∗0 each having an echelon form
for model identification in the case of the two-factor model. The uniform and chi-square distributions with 3 and 1 degrees
of freedom followed by standardization
√ with mean zero and unit variance are used. The skewnesses of these distributions

are 0, 2 2/3(= ˙ 1.63) and 2 2(= ˙ 2.83) with the corresponding excess kurtoses being −1.2, 4 and 12, respectively.
Simulations are performed under correct model specification and normality/non-normality with the number of
−1
replications being 105 to have b̃1 and b̃2 , where b̃1 is n times the simulated bias or n times the mean of tr{6̂ (S − 60 )} =
tr{0̂(S − 60 )} over 105 replications, b̃2 = n(b̃1 − b1 ), n = N − 1 and N is the sample size. In Table 1, N = 200 and 400 are
used, and 2q = 24 in the table. In Table 2, N = 220, 305 and 211 as in the literature and 2q = 34, 46 and 52 for 6SS, 8PV
and 9ET, respectively.
In Table 1, the bias-corrected TIC denoted by CTIC and similarly CAIC are shown, which are given by

n−1 CTIC = n−1 TIC − n−2 b2 = −2l̄0 |S=60 − n−1 b1 − n−2 b2 (5.1)

under arbitrary distributions and

n−1 CAIC = n−1 AIC − n−2 b2 = −2l̄0 |S=60 + n−1 2q − n−2 b2 (5.2)

under normality. Note that in Table 1 these are evaluated by using population values of parameters. Under normality,
n−1 AIC = n−1 TIC and n−1 CAIC = n−1 CTIC. In Table 2, (5.1) and (5.2) are also used with −2l̄0 |S=60 replaced by −2l̄0 |S=S . The
number of non-convergent cases excluded and that of Heywood cases included with negative estimated unique variance(s)
until 105 regular sets of estimates are obtained are shown in the tables.
150 H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159

Table 2
Asymptotic and simulated biases of −2ˆl̄ in two-factor models based on real data (N = n + 1 = 220, 305 and 211 for 6SS, 8PV and 9ET, respectively;
the number of replications = 105 ).

n−1 TIC n−1 CTIC 2q b1 b̃1 b2 b̃2 (Deleted cases,


Heywood cases)

Normal
6SS −0.2004 −0.1945 34 −34 −35.7 −287 −369 (91, 905)
8PV −0.2410 −0.2381 46 −46 −46.8 −271 −254 (0, 212)
9ET −0.5613 −0.5539 52 −52 −53.8 −326 −368 (0, 3)
Uniform
6SS −0.2254 −0.2199 34 −28.5 −30.0 −262 −329 (58, 826)
8PV −0.2676 −0.2651 46 −37.9 −38.8 −230 −277 (0, 181)
9ET −0.6043 −0.5979 52 −43.0 −44.4 −281 −305 (0, 2)
χ 2 (3df )
6SS −0.1173 −0.1116 34 −52.2 −53.7 −273 −317 (124, 1072)
8PV −0.1526 −0.1498 46 −72.9 −73.0 −261 −45 (0, 255)
9ET −0.4180 −0.4109 52 −82.1 −83.1 −315 −211 (0, 5)
χ 2 (1df )
6SS 0.0491 0.0505 34 −88.6 −90.2 −65 −343 (337, 1547)
8PV 0.0242 0.0240 46 −126.6 −126.6 22 17 (0, 353)
9ET −0.1315 −0.1313 52 −142.3 −143.7 −7 −294 (0, 30)
Note. N = n + 1 = the sample size, q = the number of parameters, b̃1 = n times the simulated bias, b̃2 = n(b̃1 − b1 ), 6SS = 6 school subjects (Lawley
and Maxwell [27, p. 66]), 8PV = 8 physical variables (Harman [17, p. 22]), 9ET = 9 educational tests (Emmett [15, p. 90]).
Heywood cases are included in the data set. n−1 CTIC = n−1 TIC − n−2 b2 = −2l̄0 |S=S − n−1 b1 − n−2 b2 under arbitrary distributions and n−1 CAIC =
n−1 AIC − n−2 b2 = −2l̄0 |S=S + n−1 2q − n−2 b2 under normality in this table. n−1 AIC = n−1 TIC and n−1 CAIC = n−1 CTIC under normality.

The tables show that the absolute values of b1 and b̃1 become large when excess kurtosis becomes large. Note, however,
that the absolute values under the uniform distribution are smaller than those under normality. This is expected since the
excess kurtosis of this distribution is negative. The values of b2 and b̃2 are mostly negative while some of them are positive
under the chi-square distribution with 1 degree of freedom. The simulated values b̃1 and b̃2 are reasonably similar to the
corresponding asymptotic values b1 and b2 , respectively especially under normality or relatively mild non-normality in the
case of b̃2 .
In Section 4, approximations to b2 were given. In Tables 1 and 2 under normality, they are as follows:

b2 /b1 2(k + 1) 2q/p(= (4q2 /p)/(2q))


Case A 117/24 = 4.9 4 4
Case B 138/24 = 5.8 4 4
6SS 287/34 = 8.4 6 34/6 = 5.67
8PV 271/46 = 5.9 6 46/8 = 5.75
9ET 326/52 = 6.3 6 52/9 = 5.78

We find that both two approximation formulas 2k + 2 and 2q/p are reasonable in these data.
Note that the block diagonal model used in the approximations gives zero off block-diagonal elements of a fitted
covariance matrix. So, the one-factor model in Cases A and B is never given by the block diagonal model. However, as shown
above, the approximations to b2 are reasonable even in the one-factor model. This can partially be explained as follows. In
the block diagonal model, the block diagonal elements are perfectly fitted while the remaining elements are poorly fitted
by zero values. On the other hand, in the one-factor model typically all the elements are moderately fitted. So, the overall
model fits of these two models can be similar.

6. Simulation for model selection

Since information criteria are used primarily for model selection, simulations of selecting the number of common fac-
tors are carried out in this section. The population covariance matrices are given by one- and two-factor models. In the
one-factor model with 7 observable variables, λ0 = (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)′ and 90 = diag(0.96, 0.91, 0.84,
0.75, 0.64, 0.51, 0.36). The population two-factor models are the same as those for 8PV and 9ET in Table 2. Normal and
non-normal data are generated as in Tables 1 and 2. Candidate models are those with k − 1, k and k + 1 common fac-
tors, where k = 1, 2 are true numbers of common factors. Note that when k = 1, k − 1 = 0 indicates the model of only
unique factors or equivalently that of uncorrelated observable variables (baseline model). Note also that when the number
of common factors is 3, the minimum p satisfying Lederman’s inequality q < p(p + 1)/2 is 7.
We had data distortion when the model of k + 1 common factors was fitted to generated data due to frequent non-
convergent cases. The number of replications was 1,000 while we had to discard many non-convergent cases, which occurred
mostly when the model of k + 1 common factors was fitted, whose numbers are shown in Tables 3 and 4. No Heywood cases
are found in the regular 1,000 replications. Although data are distorted, this is favorable to the k + 1 model since sample
H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 151

covariance matrices giving convergence should be relatively more fitted to the k + 1 model than those discarded. So, when
the true model is chosen in spite of such conditions by some criteria, this may show their excellence.
In factor analysis or covariance structure analysis various fit indexes other than information criteria have been proposed.
In this section, thirteen such indexes are used as in Ogasawara [35, Section 2]:
−1
(1) F̂ = ln |6̂| − ln |S| + tr(6̂ S) − p = ln |6̂| − ln |S| = − ln |0̂| − ln |S|, −2n−1 log likelihood ratio chi-square statistic;
(2) F̂B = ln |Diag(S)| − ln |S|, which is F̂ for the baseline model or uncorrelated observable variables;
2F̂
(3) GFI = 1 − ,
2F̂ +p
p(p+1) p(p+1)
(4) AGFI = 1 −
 (1− GFI)

with υ = 2 − q [21, 1981, I.40], [29, Equation (9)], [30, Appendix A], [39, 1989];
(5) Abs.GFI = exp − 21 F̂ − υn [31, Equation (8)], [32, Equation (14)];
(6)
  

 F̂ 1
RMSEA = max
 − , 0 [40]; (6.1)
υ n

2{F̂ −(υ/n)}
(7) Γ̂1 = 1 − [39, Equation (51), p. 85];
2{F̂ −(υ/n)}+p
p(p+1)
(8) Γ̂2 = 1 − 2υ
( −
1 Γ̂1 ) [39, Equation (52), p. 86];
F̂B − k1 F̂ − k2 k1 F̂ + k2 − k0
I = =1− with
F̂B − k0 F̂B − k0
(9) NFI [5, p. 599], k0 = 0, k1 = 1, k2 = 0;
(10) IFI(∆2 ) [9, Equation (3)], k0 = υ/n, k1 = 1, k2 = 0;
(11) ρ1 [8, Equation (3)], k0 = 0, k1 = υB /υ, k2 = 0;
(12) ρ2 [45, Equation (11)], k0 = υB /n, k1 = υB /υ, k2 = 0;
(13) FI [4, Equation (12)]; RNI, [32, Equation (27)], k0 = υB /n, k1 = 1, k2 = (υB − υ)/n; where υB = {p(p + 1)/2} − p =
p(p − 1)/2. The comparative fit index (CFI, [4, Equation (13)]) is the restricted FI to have the range [0, 1]. The indexes
(1) F̂ and (2) F̂B are included as extreme ones for comparison.
Information criteria used in this section are
(14) n−1 AIC = −2ˆl̄ + n−1 2q;
(15)

n−1 CAIC∗ = −2ˆl̄ + n−1 2q − n−1 b∗2 with b∗2 = −4q2 /p; (6.2)

(16) n−1 TIC = −2ˆl̄ − n−1 b̂1 , where b̂1 is the estimated b1 using η̂ or θ̂.
Tables 3 and 4 show the frequencies of choosing true models using minimum (maximum) values of the fit indexes and
information criteria in 1,000 replications. Indexes (1) F̂ , (2) F̂B and (9) NFI with zero frequencies chose only the k + 1, k − 1
and k + 1 models, respectively. The (k − 1)-common factor model was never chosen except by (2) F̂B . In the tables,
(5) Abs.GFI, (7) Γ̂1 and (13) FI gave the same results, where the first two indexes have a monotone relationship. The indexes
(11) ρ1 and (12) ρ2 also have the same result. Among the 13 fit indexes, RMSEA gives excellent results, which is in line
with the familiarity of the index as the gold standard with its cutoff point 0.05 in the community of covariance structure
analysis (see [13, p. 465]). The AIC shows further substantial increase of proportions of choosing correct models. Although
the improvement by the CAIC∗ over the AIC is not so large, it is observed consistently under each condition. The results by
the TIC are similar to those by the AIC, but slightly less excellent.
To see the above situations transparently, histograms of the n−1 CAIC∗ in 8PV under normality and non-normality
(chi-square with 1 degree of freedom) and RMSEA under normality are shown in Figs. 1–3, respectively. In the figures,
the histograms of the criteria are shown for the k − 1 (Model 1), k (Model 2) and k + 1 (Model 3) common factor models and
their differences (Model 1–Model 2, Model 1–Model 3 and Model 3–Model 2). Note that the three left-hand plots are on the
same range, and the upper two plots on the right-hand side are also on the same range. The lower-right plot shows details
of the difference of the competing two models.
From the figures, it is clear that Model 1 is completely separated from Models 2 and 3. In Fig. 3, the lower-right plot shows
that mostly the difference (Model 3–Model 2) are negative, which should give favorable result for Model 3 (the incorrect
(k + 1)-common factor model). The figure looks contradictory to the result in Table 4, which is explained as follows. As
Fig. 3 shows, many of the 1,000 values of RMSEA are 0 for the k and k + 1 common factor models, which is expected by
definition of the index. The numbers of zero RMSEA for the k and k + 1 common factor models are as many as 544 and 905
under normality, respectively. The corresponding counts under non-normality are 553 and 893, respectively. When values
of RMSEA for the two models are equal, the simpler model or the k-factor model is chosen in the simulation, which gives
results favorable to the true k-common factor model. This is based on the principle that simpler models are to be chosen
when their goodness-of-fit is equal.
152 H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159

Table 3
1,000 times the simulated proportions choosing a correct one-factor model among three candidate models with 0, 1 and 2 common factors (the number
of observable variables = 7; 2q = 28; the number of replications = 1,000).
N =n+1 Normal Uniform χ 2 (3df ) χ 2 (1df )
200 400 200 400 200 400 200 400

Fit indexes other than information criteria


(1) F̂ 0 0 0 0 0 0 0 0
(2) F̂B 0 0 0 0 0 0 0 0
(3) GFI 0 0 0 0 0 0 0 0
(4) AGFI 48 27 36 34 39 37 29 26
(5) Abs.GFI 206 208 208 225 195 206 229 210
(6) RMSEA 565 576 567 567 552 554 561 560
(7) Γ̂1 206 208 208 225 195 206 229 210
(8) Γ̂2 46 26 33 33 34 35 27 25
(9) NFI 0 0 0 0 0 0 0 0
(10) IFI (∆2 ) 215 213 220 230 205 210 241 218
(11) ρ1 43 26 33 33 34 35 27 25
(12) ρ2 43 26 33 33 34 35 27 25
(13) FI 206 208 208 225 195 206 229 210
Information criteria
(14) n−1 AIC 801 801 801 818 793 802 773 784
(15) n−1 CAIC∗ 835 815 832 831 826 817 805 797
(16) n−1 TIC 790 783 790 813 763 791 755 782

Non-conv. 1623 1522 1522 1419 1540 1436 1495 1341


Note. N = n + 1 = the sample size, Non-conv. = the number of non-convergent cases deleted.
λ0 = (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)′ and 90 = diag(0.96, 0.91, 0.84, 0.75, 0.64, 0.51, 0.36).
n−1 CAIC∗ = n−1 AIC − n−2 b∗2 = −2ˆl̄ + n−1 2q − n−2 b∗2 with b∗2 = −4q2 /p, n−1 TIC == −2ˆl̄ − n−1 b̂1 .

Table 4
1,000 times the simulated proportions choosing correct two-factor models based on real data among three candidate models with 1, 2 and 3 common
factors (the number of replications = 1,000).
8 physical variables 9 educational tests
(Harman [17]; N = 305; 2q = 46) (Emmett [15]; N = 211; 2q = 52)
Normal Uniform χ 2 (3df ) χ 2 (1df ) Normal Uniform χ 2 (3df ) χ 2 (1df )
Fit indexes other than information criteria
(1) F̂ 0 0 0 0 0 0 0 0
(2) F̂B 0 0 0 0 0 0 0 0
(3) GFI 0 0 0 0 0 0 0 0
(4) AGFI 42 30 28 32 15 25 18 26
(5) Abs.GFI 255 250 223 264 196 201 187 175
(6) RMSEA 560 587 530 565 633 602 555 567
(7) Γ̂1 255 250 223 264 196 201 187 175
(8) Γ̂2 41 29 28 30 13 22 18 26
(9) NFI 0 0 0 0 0 0 0 0
(10) IFI (∆2 ) 256 250 224 264 199 208 188 176
(11) ρ1 41 29 28 30 13 21 18 25
(12) ρ2 41 29 28 30 13 21 18 25
(13) FI 255 250 223 264 196 201 187 175
Information criteria
(14) n−1 AIC 826 815 791 802 838 829 805 771
(15) n−1 CAIC∗ 853 840 825 823 881 864 845 808
−1
(16) n TIC 814 806 765 774 828 823 785 753

Non-conv. 1920 1651 1665 1733 20,560 20,797 18,174 18,607


Note. N = n + 1 = the sample size, Non-conv. = the number of non-convergent cases deleted.
n−1 CAIC∗ = n−1 AIC − n−2 b∗2 = −2ˆl̄ + n−1 2q − n−2 b∗2 with b∗2 = −4q2 /p, n−1 TIC == −2ˆl̄ − n−1 b̂1 .

7. Discussion

From the results of Section 6, the formula of CAIC∗ is promising. Notice that the results under non-normality are almost
the same as under normality in a practical sense. This is somewhat surprising since b1 (= −2q) and b2 under normality are
not robust under non-normality. Fig. 2 under non-normality is similar to Fig. 1 under normality except that the separation
of Model 1 from Models 2 and 3 has become reduced though still perfect. The robust results may be explained by robustness
of the differences of the CAIC∗ for candidate models under non-normality as far as model selection is concerned.
H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 153

Fig. 1. Frequencies of CAIC∗ /n and their differences for factor analysis models under normality in Harman’s (1976) 8 physical variables.

In the behavioral sciences, observable variables are frequently standardized to have unit variances. The EFA model for
standardized variables can be written as

6 = DPD = D{33′ + Diag(I(p) − 33′ )}D, (7.1)

where P is a structural correlation matrix, D = diag(d1 , . . . , dp ), di (i = 1, . . . , p) are standard deviations of observable


variables and I(p) is the p × p identity matrix. It is known that in the model for unstandardized variables i.e., 6 = 33′ + 9,
we have Diag(6̂) = Diag(S). This gives a valid estimation method using a sample correlation matrix in place of S in the
Wishart likelihood. That is, the restrictions, Diag(P̂) = I(p) is automatically satisfied. It is to be noted that the number of
parameters q is unchanged from that for unstandardized variables. The approximate formulas −4q2 /p and −4(k + 1)q for
b2 may similarly be used.

Acknowledgments

The author is indebted to two anonymous reviewers for the improvement of an earlier version of this paper. This work
was partially supported by a Grant-in-Aid for Scientific Research from the Japanese Ministry of Education, Culture, Sports,
Science and Technology (JSPS KAKENHI, Grant No. 26330031).

Appendix

A.1. Proof of Theorem 1 and additional explanations

Let σTab = (6T )ab , where (·)ab is the (a, b)th element of a matrix in parentheses and θ0i = (θ0 )i , where (·)i is the ith
element of a vector in parentheses. Define Ecd as a matrix of an appropriate size whose (c , d)th element is 1 with the other
154 H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159

Fig. 2. Frequencies of CAIC∗ /n and their differences for factor analysis models under non-normality (chi-square with 1 d.f.) in Harman’s (1976) 8 physical
variables.

−1
elements being 0. Let δab be the Kronecker delta. Then, from h = tr{6̂ (S − Σ T )}, we obtain

q
∂ 2h 1 ∂ 60 −1 ∂θ0i 2 − δcd ∂θ0i 2 − δab
   
= − tr 6− 6 ( E cd + E dc ) + (E ab + E ba )
∂σTab ∂σTcd i=1
0
∂θ0i 0 ∂σTab 2 ∂σTcd 2
q
1 ∂ 60 −1 ∂θ0i
2   
= − (2 − δab ) 6− 60 , (A.1.1)
(ab,cd) i=1
0
∂θ 0i ab ∂σ Tcd


q 
∂ 3h 3
−1 ∂ 6 0 −1 ∂ 2 θ0i
  
= − (2 − δab ) 60 60
∂σTab ∂σTcd ∂σTef (ab,cd,ef ) i =1
∂θ0i ab ∂σTcd ∂σTef
q
1 ∂ 60 −1 ∂ 60 −1 −1 ∂ 6 0 −1 ∂ 6 0 −1 −1 ∂ 6 0
2
  
+ −6 − 6 6 − 6 6 6 + 6 6 −1

i,j=1
0
∂θ0i 0 ∂θ0j 0 0
∂θ0j 0 ∂θ0i 0 0
∂θ0i ∂θ0j 0 ab
∂θ0i ∂θ0j

× ,
∂σTcd ∂σTef

q 
∂ 4h 4
1 ∂ 6 0 −1 ∂ 3 θ0i
  
=− (2 − δab ) 6− 6
∂σTab ∂σTcd ∂σTef ∂σTgh∗ (ab,cd,ef ,gh∗) i=1
0
∂θ0i 0
ab ∂σTcd ∂σTef ∂σTgh∗

q 
∂ 60 −1 ∂ 60 −1 1 ∂ 60 −1 ∂ 60 −1 ∂ 2 6 0 −1
 
+ −6 − 1
60 60 − 6− 60 60 + 6− 1
6
i,j=1
0
∂θ0i ∂θ0j 0
∂θ0j ∂θ0i 0
∂θ0i ∂θ0j 0 ab
H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 155

Fig. 3. Frequencies of RMSEA and their differences for factor analysis models under normality in Harman’s (1976) 8 physical variables.

3
 ∂ 2 θ0i ∂θ0j
×
(cd,ef ,gh∗)
∂σ Tcd ∂σ Tef ∂σTgh∗


q
−1 ∂ 60 −1 ∂ 60 −1 ∂ 60 −1 1 ∂ 60 1 ∂ 6 0 −1
 6  3  2
+ 60 60 60 60 − 6− 6− 6
i,j,k=1 (i,j,k)
∂θ0i ∂θ0j ∂θ0k (i,j,k)
0
∂θ0i ∂θ0j 0
∂θ0k 0

1 ∂ 60 −1 ∂ 2 60 −1 ∂ 3 60 ∂θ0i ∂θ0j ∂θ0k


  
+ 6− 6 6 + 6 −1
6 −1
0
∂θ0k 0 ∂θ0i ∂θ0j 0 0
∂θ0i ∂θ0j ∂θ0k 0 ab ∂σTcd ∂σTef ∂σTgh∗
(p ≥ a ≥ b ≥ 1; p ≥ c ≥ d ≥ 1; p ≥ e ≥ f ≥ 1; p ≥ g ≥ h∗ ≥ 1),
2
where (ab,cd) is the sum of two terms considering the associated combination with other similar notations defined
similarly. Define FNT = ln |6| + tr(6−1 S) as the normal-theory (NT) discrepancy function to be minimized for estimation of
θ. Then, from Ogasawara [34, Equations (17) and (19)], [37, Equation (3.16)], we have
 −1
∂θ0 ∂ FNT ∂ 2 FNT
 2
=− ,
∂σTab ∂θ0 ∂θ0

∂θ0 ∂σTab
−1   q
∂ 2 θ0 ∂ FNT ∂ 3 FNT ∂θ0i ∂θ0j
 2
= −
∂σTab ∂σTcd ∂θ0 ∂θ0

i,j=1
∂θ 0 ∂θ 0i ∂θ0j ∂σ Tab ∂σTcd
q 2
∂ 3 FNT ∂θ0i ∂ 3 FNT
  
+ + , (A.1.2)
i=1 (ab,cd)
∂θ0 ∂θ0i ∂σTab ∂σTcd ∂θ0 ∂σTab ∂σTcd
156 H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159

 −1  
q
∂ 3 θ0 ∂ 2 FNT ∂ 4 FNT ∂θ0i ∂θ0j ∂θ0k

= −
∂σTab ∂σTcd ∂σTef ∂θ0 ∂θ′0 ∂θ 0 ∂θ 0i ∂θ 0j ∂θ 0k ∂σ Tab ∂σTcd ∂σTef
 i,j,k=1
q q
 3   ∂ 3 FNT ∂θ0i ∂ 2 θ0j
+
(ab,cd,ef ) i=1 j =1
∂θ0 ∂θ0i ∂θ0j ∂σTab ∂σTcd ∂σTef
∂ FNT
4
∂θ0i ∂θ0j ∂ 3 FNT ∂ 2 θ0i

+ +
∂θ0 ∂θ0i ∂θ0j ∂σTab ∂σTcd ∂σTef ∂θ0 ∂θ0i ∂σTab ∂σTcd ∂σTef
∂ 4 FNT ∂θ0i ∂ 4 FNT
 
+ +
∂θ0 ∂θ0i ∂σTab ∂σTcd ∂σTef ∂θ0 ∂σTab ∂σTcd ∂σTef
(p ≥ a ≥ b ≥ 1; p ≥ c ≥ d ≥ 1; p ≥ e ≥ f ≥ 1).
On the other hand, from e.g., Kaplan [23, pp. 320–321] and Ogasawara [33, Equations (3.7) and (3.8)],

nEg {(sab − σTab )(scd − σTcd )} = σTabcd − σTab σTcd − n−1 κabcd + O(n−2 ),
n2 Eg {(sab − σTab )(scd − σTcd )(sef − σTef )}
12
 4
 8

= κabcdef + κabce κdf + κace κbdf + κac κbe κdf + O(n−1 ), (A.1.3)

n Eg {(sab − σTab )(scd − σTcd )(sef − σTef )(sgh − σTgh∗ )}


2

3

= (σTabcd − σTab σTcd )(σTefgh∗ − σTef σTgh∗ ) + O(n−1 )
(p ≥ a ≥ b ≥ 1; p ≥ c ≥ d ≥ 1; p ≥ e ≥ f ≥ 1; p ≥ g ≥ h∗ ≥ 1),
where σTabcd (κabcd ) is the fourth multivariate central moment (cumulant) with respect to observable variables Xa , Xb , Xc and
Xd corresponding to (xi )a , (xi )b , (xi )c and (xi )d , respectively; σTab = κab ;
j
is the sum of j terms considering the associated
combinations; and the other cumulants are defined similarly. 
Under normality with 60 = 6T and σ0ab = (60 )ab ,

nEf {(sab − σ0ab )(scd − σ0cd )} = σ0ac σ0bd + σ0ad σ0bc (A.1.4)

holds exactly. Under possible non-normality, b1 becomes

∂ 2h
1
b1 = nEg {(s − σT )⟨2⟩ }→O(n−1 )
2 (∂σ′T )⟨2⟩
q     2 −1 
−1 ∂ 60 −1 ∂ FNT ∂ 2 FNT
2
1  
=− (2 − δab ) 60 6 − × n acovg (sab , scd )
2 a≥b (ab,cd) i=1
∂θ0i 0 ab ∂θ0 ∂θ′0 ∂θ0 ∂σTcd
i
c ≥d
   −1 
2
1  ∂ 2 FNT ∂ 2 FNT ∂ 2 FNT
= − n acovg (sab , scd )
2 a≥b (ab,cd)
∂θ0i ∂σTab ∂θ0 ∂θ′0 ∂θ0 ∂σTcd
c ≥d i
  −1 
 ∂ FNT /2
2
∂ FNT /2
2
∂ 2 FNT /2
= −2 tr n acovg (sab , scd )
a≥b
∂θ0 ∂θ′0 ∂θ0 ∂σTab ∂θ0i ∂σTcd
c ≥d

−1
∂ 2 l̄ ∂ l̄ ∂ l̄
   
= −2tr Eg − nEg , (A.1.5)
∂θ0 ∂θ′0 ∂θ0 ∂θ′0 →O(n−1 )

a≥b (·) = p≥c ≥d≥1 (·); acovg (·) is the asymptotic covariance of order O(n ); and FNT = −2l̄ − C (S)
−1
  
where p≥a≥b≥1
c ≥d
is used. Note that the last expression of (A.1.5) corresponds to the negative correction term of the TIC when it
 is evaluated

− ∂θ∂ ∂θl̄ ′
2
at its population value. Under normality and correct model specification, from the Bartlett identity i.e., Ef =
0 0
 
∂ l̄ ∂ l̄
nEf ∂θ0 ∂θ′0
, it is found that (A.1.5) becomes −2q or the negative of the correction term in the AIC. Under these conditions,
b2 in (2.20) becomes simplified as

1 ∂ 3h 2 1 ∂ 4h 2
b2 = n [ Ef {(s − σ T )⟨3⟩
} ( −2 ) ] + n [Ef {(s − σT )⟨4⟩ }→O(n−2 ) ] (A.1.6)
6 (∂σ′T )⟨3⟩ 24 (∂σ′T )⟨4⟩
→ O n

since the first term on the right hand side of (2.20) vanishes due to the exact result of (A.1.4).
H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 157

A.2. The higher-order correction term in the TIC

∂2h
Define b̂1 as a consistent estimator of b1 = 1
2 (∂σ′ )⟨2⟩
nEg {(s − σT )⟨2⟩ }→O(n−1 ) . Let the (ab, cd)th element of the estimator
T
of nEg {(s − σT ) }→O(n−1 ) using double subscript notation be sabcd − sab scd (p ≥ a ≥ b ≥ 1; p ≥ c ≥ d ≥ 1), where sabcd is
⟨2⟩

the multivariate sample fourth central moment with respect to Xa , Xb , Xc and Xd . Then, by direct expansion,

Theorem A.1. Under regularity conditions of the existence of the associated expectations, the asymptotic bias of b̂1 is given by

∂ 4h

1
E g (b̂1 − b1 ) = n −1
n2 [E g {(s − σT )⟨2⟩ }]⟨2⟩
4 (∂σ′T )⟨4⟩
∂ 3h 1 ∂ 2h
− [σ T ⊗ n E g {(s − σ T ) ⟨2⟩
}] − nE g {(s − σT )⟨2⟩ }
(∂σ′T )⟨3⟩ 2 (∂σ′T )⟨2⟩
∂ 3h

1
+ nE g {(s − σT )(s(4) − σT(4) )} + O(n−2 )
2 (∂σ′T )⟨3⟩

≡ n−1 b∆1 + O(n−2 ), (A.2.1)

where all the expectations are taken up to order O(n ); σT (4) is the {p(p + 1)/2} × 1 vector of associated fourth central moments
−1 2

of e.g., Xa , Xb , Xc and Xd ; s(4) is a consistent estimator of σT (4) ; nE g {(s − σT )(s(4) − σT(4) )}→O(n−1 ) is given by

nE g {(sabcd − σT abcd )(sef − σT ef )}


4

= σT abcdef − σT abcd σT ef − σT aef σT bcd + O(n−1 ) (a, b, c , d, e, f = 1, . . . , p) (A.2.2)

[36, Lemma 1], where σT abcdef is the sixth multivariate central moment of Xa , Xb , Xc , Xd , Xe and Xf , and σT aef = κaef .

A.3. Three examples for the inverse structure in factor analysis

In factor analysis, there are linear and bilinear models for 6 with respect to θ. Using the Woodbury formulas

(A + BCB′ )−1 = A−1 − A−1 B(B′ A−1 B + C−1 )−1 B′ A−1 ,


(A.3.1)
(A + bb′ )−1 = A−1 − (b′ A−1 b + 1)−1 A−1 bb′ A−1 ,
where the inverses are assumed to exist, we have the following examples.

Example 1 (The Model of Parallel Tests (See e.g., [20])). This is the one-factor model of equal factor loadings and unique
variances with 6 = φ 1(p) 1′(p) + ψ I(p) , where 1(p) is the p × 1 vector of 1’s; I(p) is the p × p identity matrix; and θ = (φ, ψ)′ .
From (A.3.1),

0 = 6−1 = (φ 1(p) 1′(p) + ψ I(p) )−1 = ψ −1 I(p) − (pφψ −1 + 1)−1 φψ −2 1(p) 1′(p)
φ
= ψ −1 I(p) − 1(p) 1′(p) = η1 I(p) − η2 1(p) 1′(p) , (A.3.2)
pφψ + ψ 2

where η = {ψ −1 , φ/(pφψ + ψ 2 )}′ . Note that 0 is linear with respect to η.

Example 2 (The Essentially Tau-equivalent Model (See e.g., [20])). This model relaxes the restrictions of equal unique variances
retaining the equalities of the factor loadings: 6 = φ 1(p) 1′(p) + 9 and 9 = diag(ψ1 , . . . , ψp ), where θ = (φ, 1′(p) 9)′ . From
(A.3.1),

0 = 6−1 = (φ 1(p) 1′(p) + 9)−1 = 9−1 − (φ 1′(p) 9−1 1(p) + 1)−1 φ 9−1 1(p) 1′(p) 9−1
= diag(ψ∗ ) − φ ∗ ψ∗ ψ∗′ , (A.3.3)

where ψ∗ = 9−1 1(p) , φ ∗ = φ/(φ 1′(p) 9−1 1(p) + 1), η = {φ/(φ 1′(p) 9−1 1(p) + 1), 1′(p) 9−1 }′ and (A.3.3) is a trilinear model.
After further reparametrization as ψ∗∗ = ψ∗ (φ ∗ )1/2 and φ ∗∗ = (φ ∗ )−1/2 , (A.3.3) becomes

0 = diag(ψ∗∗ φ ∗∗ ) − ψ∗∗ ψ∗∗′ , (A.3.4)

which is a bilinear model.


158 H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159

Example 3 (The EFA Model). This is the usual EFA model with 6 = 33′ + 9, where 3 is the p × k (p > k) loading matrix
and 9 is as before. The (k2 − k)/2 elements λij = (3)ij (i < j) are set to zero after permutation of observable variables,
when necessary, for model identification without loss of generality. Then, from (A.3.1),

0 = (33′ + 9)−1 = 9−1 − 9−1 3(3′ 9−1 3 + I(k) )−1 3′ 9−1 = 9∗ − 3∗ 3∗′ , (A.3.5)

where 9 = 9∗ −1
and restrictions as in 3 for model identification are imposed on 3 . In (A.3.5), η = ∗
(1′(p) 9∗ , λ∗′ )′ , where
λ is the [pk − {(k − k)/2}] × 1 vector of λij = (3 )ij (i ≥ j). Note that when k = 1, the (k − k)/2 restrictions are
∗ 2 ∗ ∗ 2

unnecessary.
∗ ∗
In (A.3.5), 9∗ and 3∗ in the inverse structure is reparametrized from 9 and 3. However, 9̂ and 3̂ can be directly
obtained by minimizing FNT = ln |6| + tr(6−1 S) = − ln |0| + tr(0S), which is intriguing since in FNT using 0 there is no
inverse. We have convergence in iterative computation for estimating η when the starting value is close to the solution,
otherwise we tend to have divergence. This difficulty is partially explained by possible singularity of 0 in the process of
iteration before the solution is attained. In contrast, the non-negative definiteness of 6 = 33′ + 9 is guaranteed when 9 is
positive definite throughout the iteration under standard conditions. So, reparametrization from 9 and 3 is recommended
in practice. Note that 3∗ is obtained from the Cholesky decomposition of 9−1 − 6−1 , whose solution is an algebraic one
without using iteration.

A.4. Partial derivatives when the inverse structure is used

From,

1 ∂ ih
4
−1 
−5/2
h = tr{6̂ (S − 6T )} = tr{0̂(S − 6T )} = ′ ⟨i⟩ (s − σT ) + Op (n
⟨i ⟩
), (A.4.1)
i=1
i ! (∂σ T )
we have
q
∂ 2h ∂ 00 ∂η0i 2 − δcd ∂η0i 2 − δab
   
= tr (Ecd + Edc ) + (Eab + Eba )
∂σTab ∂σTcd i=1
∂η0i ∂σTab 2 ∂σTcd 2
q
2  ∂(00 )ab ∂η0i
= (2 − δab ) , (A.4.2)
(ab,cd) i=1
∂η0i ∂σTcd
 
q q
∂ 3h  3  ∂(00 )ab ∂ 2 η0i  ∂ 2 (00 )ab ∂η0i ∂η0j
= (2 − δab ) + ,
∂σTab ∂σTcd ∂σTef (ab,cd,ef ) i =1
∂η0i ∂σTcd ∂σTef i,j=1
∂η0i ∂η0j ∂σTcd ∂σTef

q
∂ 4h  4  ∂(00 )ab ∂ 3 η0i
= (2 − δab )
∂σTab ∂σTcd ∂σTef ∂σTgh∗ (ab,cd,ef ,gh∗) i=1
∂η0i ∂σTcd ∂σTef ∂σTgh∗
q q
∂ 2 (00 )ab  3
∂ 2 η0i ∂η0j ∂ 3 (00 )ab ∂η0i ∂η0j ∂η0k
  
+ +
i,j=1
∂η0i ∂η0j (cd,ef ,gh∗) ∂σTcd ∂σTef ∂σTgh∗ i,j,k=1 ∂η0i ∂η0j ∂η0k ∂σTcd ∂σTef ∂σTgh∗
(p ≥ a ≥ b ≥ 1; p ≥ c ≥ d ≥ 1; p ≥ e ≥ f ≥ 1; p ≥ g ≥ h∗ ≥ 1).
From F̂NT = − ln |0̂| + tr(0̂S), we obtain
 
∂ F̂NT ∂ 0̂ ∂ FNT ∂ 00
 
−1
= tr (S − 0̂ ) = 0, = tr (6T − 60 ) = 0,
∂ η̂i ∂ η̂i ∂η0i ∂η0i
 
∂ 2 F̂NT ∂ 2 0̂ −1 −1 ∂ 0̂ −1 ∂ 0̂
= tr (S − 0̂ ) + 0̂ 0̂ , (A.4.3)
∂ η̂i ∂ η̂j ∂ η̂i ∂ η̂j ∂ η̂i ∂ η̂j
∂ 2 F̂NT ∂(0̂)ab
= (2 − δab ),
∂ η̂i ∂ sab ∂ η̂i
 
∂ 3 F̂NT ∂ 3 0̂ −1  3
−1 ∂ 0̂
2
−1 ∂ 0̂ −1 ∂ 0̂ −1 ∂ 0̂ −1 ∂ 0̂
= tr (S − 0̂ ) + 0̂ 0̂ − 20̂ 0̂ 0̂ ,
∂ η̂i ∂ η̂j ∂ η̂k ∂ η̂i ∂ η̂j ∂ η̂k (i,j,k)
∂ η̂i ∂ η̂j ∂ η̂k ∂ η̂i ∂ η̂j ∂ η̂k

∂ 3 F̂NT ∂ 2 (0̂)ab
= (2 − δab ),
∂ η̂i ∂ η̂j ∂ sab ∂ η̂i ∂ η̂j
H. Ogasawara / Journal of Multivariate Analysis 149 (2016) 144–159 159

∂ 4 F̂NT ∂ 4 0̂ −1  4
−1 ∂ 3 0̂ −1 ∂ 0̂
= tr (S − 0̂ ) + 0̂ 0̂
∂ η̂i ∂ η̂j ∂ η̂k ∂ η̂l∗ ∂ η̂i ∂ η̂j ∂ η̂k ∂ η̂l∗ (i,j,k,l∗)
∂ η̂ i ∂ η̂j ∂ η̂k ∂ η̂l∗
3
 ∂ 2 0̂ −1 ∂ 2 0̂
−1  6
−1 ∂ 0̂
2
−1 ∂ 0̂ −1 ∂ 0̂
+ 0̂ 0̂ −2 0̂ 0̂ 0̂
(i,j,k,l∗)
∂ η̂i ∂ η̂j ∂ η̂k ∂ η̂l ∗ (i,j,k,l∗)
∂ η̂i ∂ η̂j ∂ η̂k ∂ η̂l∗

 −1 ∂ 0̂ −1 ∂ 0̂ −1 ∂ 0̂ −1 ∂ 0̂
3
+2 0̂ 0̂ 0̂ 0̂ ,
(i,j,k,l∗)
∂ η̂i ∂ η̂j ∂ η̂k ∂ η̂l∗

∂ 4 F̂NT ∂ 3 (0̂)ab
= (2 − δab ) (i, j, k, l∗ = 1, . . . , q : p ≥ a ≥ b ≥ 1),
∂ η̂i ∂ η̂j ∂ η̂k ∂ sab ∂ η̂i ∂ η̂j ∂ η̂k
where the second result ∂ FNT /∂η0i = ∂ FNT /∂ηi |η=η0 = 0 is given by the definition of η0 under possible model
misspecification.

References

[1] H. Akaike, Information theory and an extension of the maximum likelihood principle, in: B.N. Petrov, F. Csáki (Eds.), Proceedings of the 2nd
International Symposium on Information Theory, Académiai Kiado, Budapest, 1973, pp. 267–281.
[2] H. Akaike, Factor analysis and AIC, Psychometrika 52 (1987) 17–332.
[3] T.W. Anderson, An Introduction to Multivariate Statistical Analysis, third ed., Wiley, New York, 2003.
[4] P.M. Bentler, Comparative fit indexes in structural models, Psychol. Bull. 107 (1990) 238–246.
[5] P.M. Bentler, D.G. Bonett, Significance tests and goodness of fit in the analysis of covariance structures, Psychol. Bull. 88 (1980) 588–606.
[6] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006.
[7] R.D. Bock, R.E. Bargmann, Analysis of covariance structures, Psychometrika 31 (1966) 507–534.
[8] K.A. Bollen, Sample size and Bentler and Bonett’s nonnormed fit index, Psychometrika 51 (1986) 375–377.
[9] K.A. Bollen, A new incremental fit index for general equation models, Sociol. Methods Res. 17 (1989) 303–316.
[10] M.W. Browne, Covariance structures, in: D.M. Hawkins (Ed.), Topics in Applied Multivariate Analysis, Cambridge University Press, Cambridge, 1982,
pp. 72–141.
[11] M.W. Browne, G. Arminger, Specification and estimation of mean- and covariance-structure models, in: G. Arminger, C.C. Clogg, M.E. Sobel (Eds.),
Handbook of Statistical Modeling for the Social and Behavioral Sciences, Plenum Press, New York, 1995, pp. 185–241.
[12] K.P. Burnham, D.R. Anderson, Model Selection and Multimodel Inference — A Practical Information-theoretic Approach, second ed., Springer, New
York, 2010.
[13] F. Chen, P.J. Curran, K.A. Bollen, J. Kirby, P. Paxon, An empirical evaluation of the use of fixed cutoff points in RMSEA test statistic in structural equation
models, Sociol. Methods Res. 36 (2008) 462–494.
[14] G. Claeskens, N.L. Hjort, Model Selection and Model Averaging, Cambridge University Press, Cambridge, 2008.
[15] W.G. Emmett, Factor analysis by Lawley’s method of maximum likelihood, Br. J. Psychol. Stat. Sect. 2 (1949) 90–97.
[16] Y. Fujikoshi, K. Satoh, Modified AIC and Cp in multivariate linear regression, Biometrika 84 (1997) 707–716.
[17] H.H. Harman, Modern Factor Analysis, third ed., University of Chicago Press, Chicago, 1976.
[18] C.M. Hurvich, C.-L. Tsai, Regression and time series model selection in small samples, Biometrika 76 (1989) 297–307.
[19] M. Ichikawa, Empirical assessment of AIC procedure for model selection in factor analysis, Behaviormetrika (24) (1988) 33–40.
[20] K.G. Jöreskog, Structural analysis of covariance and correlation matrices, Psychometrika 43 (1978) 443–477.
[21] K.G. Jöreskog, D. Sörbom, LISREL V: Analysis of Linear Structural Relations by the Method of Maximum Likelihood, International Educational Services,
Chicago, 1981.
[22] K. Kamo, H. Yanagihara, K. Satoh, Bias-corrected AIC for selecting variables in Poisson regression models, Comm. Statist. Theory Methods 42 (2013)
1911–1921.
[23] E.L. Kaplan, Tensor notation and the sampling cumulants of k-statistics, Biometrika 39 (1952) 319–323.
[24] G. Kitagawa, Contributions of Professor Hirotsugu Akaike in statistical science, J. Japan Statist. Soc. 38 (2008) 119–130.
[25] S. Konishi, G. Kitagawa, Asymptotic theory for information criteria in model selection — functional approach, J. Statist. Plann. Inference 114 (2003)
45–61.
[26] S. Konishi, G. Kitagawa, Information Criteria and Statistical Modeling, Springer, New-York, 2008.
[27] D.N. Lawley, A.E. Maxwell, Factor Analysis as a Statistical Method, second ed., Butterworths, London, 1971.
[28] H. Linhart, W. Zucchini, Model Selection, Wiley, New York, 1986.
[29] S.S. Maiti, B.N. Mukherjee, A note on distributional properties of the Jöreskog–Sörbom fit indices, Psychometrika 55 (1990) 721–726.
[30] H.W. Marsh, J.R. Balla, R.P. McDonald, Goodness-of-fit indices in confirmatory factor analysis: Effects of sample size, Psychol. Bull. 103 (1988) 391–411.
[31] R.P. McDonald, An index of goodness-of-fit based on noncentrality, J. Classification 6 (1989) 97–103.
[32] R.P. McDonald, H.W. Marsh, Choosing a multivariate model: Noncentrality and goodness of fit, Psychol. Bull. 107 (1990) 247–255.
[33] H. Ogasawara, Asymptotic expansion of the sample correlation coefficient under nonnormality, Comput. Statist. Data Anal. 50 (2006) 891–910.
[34] H. Ogasawara, Higher-order estimation error in structural equation modeling, Econ. Rev. Otaru Univ. Commer. 57 (4) (2007) 131–160. Permalink:
http://hdl.handle.net/10252/268.
[35] H. Ogasawara, Higher-order approximations to the distributions of fit indexes under fixed alternatives in structural equation models, Psychometrika
72 (2007) 227–243.
[36] H. Ogasawara, Asymptotic expansion of the distributions of the estimators in factor analysis under nonnormality, British J. Math. Statist. Psych. 60
(2007) 395–420.
[37] H. Ogasawara, Asymptotic expansions in mean and covariance structure analysis, J. Multivariate Anal. 100 (2009) 902–919.
[38] H. Ogasawara, Asymptotic cumulants of some information criteria (2nd version). Discussion Paper, Center for Business Creation, Otaru University of
Commerce, No.174 (2015) Permalink: http://hdl.handle.net/10252/5497.
[39] J.H. Steiger, EzPATH: A Supplementary Module for SYSTAT and SYGRAPH, SYSTAT Inc., Evanstone, IL., 1989.
[40] J.H. Steiger, J.C. Lind, Statistically based tests for the number of common factors. Paper presented at the annual Spring Meeting of the Psychometric
Society, Iowa City, IA., 1980, May.
[41] M. Stone, An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion, J. R. Stat. Soc. Ser. B Stat. Methodol. 39 (1977)
44–47.
[42] N. Sugiura, Further analysis of the data by Akaike’s information criterion and the finite corrections, Comm. Statist. Theory Methods 7 (1978) 13–26.
[43] Y. Takane, H. Bozdogan, Introduction to special section, Psychometrika 52 (1987) 315.
[44] K. Takeuchi, Distributions of information statistics and criteria of the goodness of models, Math. Sci. 153 (1976) 12–18. in Japanese.
[45] L.R. Tucker, C. Lewis, A reliability coefficient for maximum likelihood factor analysis, Psychometrika 38 (1973) 1–10.
[46] R. von Mises, On the asymptotic distribution of differentiable statistical functions, Ann. Math. Stat. 18 (1947) 309–348.
[47] C.S. Withers, Expansions for the distribution and quantiles of a regular functional of the empirical distribution with applications to nonparametric
confidence intervals, Ann. Statist. 11 (1983) 577–587.
[48] H. Yanagihara, R. Sekiguchi, Y. Fujikoshi, Bias correction of AIC in logistic regression models, J. Statist. Plann. Inference 115 (2003) 349–360.

You might also like