Geophysical Prospecting - 2004 - Mitsuhata - Adjustment of Regularization in Ill Posed Linear Inverse Problems by The

gpr412 GPR-xml.
cls Apr l 14, 2004 12:35
13652478, 2004, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/j.1365-2478.2004.00412.x by Wuhan University, Wiley Online Library on [02/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Geophysical Prospecting, 2004, 52, 213–239
Adjustment of regularization in ill-posed linear inverse problems

by the empirical Bayes approach
Yuji Mitsuhata∗
Exploration Geophysics Group, National Institute of Advanced Industrial Science and Technology, Central No 7 1-1-1 Higashi, Tsukuba
305-8567, Japan
Received August 2002, revision accepted October 2003
ABSTRACT
Regularization is the most popular technique to overcome the null space of model
parameters in geophysical inverse problems, and is implemented by including a con-
straint term as well as the data-misfit term in the objective function being minimized.
The weighting of the constraint term relative to the data-fitting term is controlled by
a regularization parameter, and its adjustment to obtain the best model has received
much attention. The empirical Bayes approach discussed in this paper determines the
optimum value of the regularization parameter from a given data set. The regular-
ization term can be regarded as representing a priori information about the model
parameters. The empirical Bayes approach and its more practical variant, Akaike’s
Bayesian Information Criterion, adjust the regularization parameter automatically
in response to the level of data noise and to the suitability of the assumed a priori
model information for the given data. When the noise level is high, the regularization
parameter is made large, which means that the a priori information is emphasized.
If the assumed a priori information is not suitable for the given data, the regular-
ization parameter is made small. Both these behaviours are desirable characteristics
for the regularized solutions of practical inverse problems. Four simple examples
are presented to illustrate these characteristics for an underdetermined problem, a
problem adopting an improper prior constraint and a problem having an unknown
data variance, all frequently encountered geophysical inverse problems. Numerical
experiments using Akaike’s Bayesian Information Criterion for synthetic data pro-
vide results consistent with these characteristics. In addition, concerning the selec-
tion of an appropriate type of a priori model information, a comparison between
four types of difference-operator model – the zeroth-, first-, second- and third-order
difference-operator models – suggests that the automatic determination of the opti-
mum regularization parameter becomes more difficult with increasing order of the
difference operators. Accordingly, taking the effect of data noise into account, it is
better to employ the lower-order difference-operator models for inversions of noisy
data.
INTRODUCTION
In almost all geophysical inverse problems, we want to know as much as possible about the detailed interior structures of the earth
from limited data which are contaminated by noise. Such geophysical inverse problems are ill-posed because of non-uniqueness
∗ E-mail: y.mitsuhata@aist.go.jp

C 2004 European Association of Geoscientists & Engineers 213
gpr412 GPR-xml.cls Apr l 14, 2004 12:35
214 Y. Mitsuhata
and ill-conditioning, meaning that small changes in the data lead to large changes in the constructed model. The least-squares
method is the most popular for solving inverse problems. For ill-posed linear problems, the least-squares method, incorporating
the stabilization of its ill-posed nature, minimizes the following objective function U (Tikhonov and Arsenin 1977):
U(m) = Wd (dobs − Am)2 + λφ(m), (1)
where dobs is an N-dimensional column vector composed of N data points, m is an M-dimensional column vector containing
M model parameters, A is an N × M linear operator, Wd is an N × N weighting matrix, φ(m) is a constraint or a penalty to
regularize the inversion, and hence λ is called a regularization parameter. The first term on the right-hand side of (1) corresponds
to the data misfit, and so λ determines the balance between the misfit and the constraint. This regularization technique is called
Tikhonov regularization. The estimation of m by minimizing U is strongly dependent on λ. In geophysics, non-linear problems
are in the majority. When the Gauss–Newton technique is applied to non-linear problems and the inversion is implemented
iteratively, equation (1) results from the linearized approximation at each iteration.
The selection of a reasonable value of λ has received a certain amount of attention in geophysical inverse problems. Several
useful techniques have been developed in the mathematical literature and applied to geophysical problems. In the most popular
approach, a target misfit based on, for example, the chi-squared distribution is determined in advance, and λ is adjusted so
that the data misfit matches the target misfit (e.g. Constable, Parker and Constable 1987). However, this scheme needs correct
information about the data noise, which, in practice, is difficult to determine. In such a case, the L-curve method (Hansen 1992)
and the generalized cross-validation method (Wahba 1990) are applicable (see also Farquharson and Oldenburg 2000; Haber
and Oldenburg 2000). Wang (2003) addressed the choice of the trade-off parameter λ and suggested normalizing the sensitivities
of the linear operator A and an additional Hessian constraint so that it becomes a dimensionless factor and easy to control.
Recently Bayes inference has become popular in geophysical inverse problems. The reason for this is that a priori information
on the model is useful, sometimes essential, for regularizing the inversions. The application of Bayes’ theorem to inversion is
called Bayesian inversion (Scales and Snieder 1997). By regarding the constraint as a priori model information in Bayes’ theorem,
Akaike (1980) proposed the Akaike’s Bayesian Information Criterion (ABIC) and determined λ by minimizing the ABIC. The
ABIC minimization method has been applied to various geophysical inversions, and it has given successful results: for example,
analysis of earth tide data (Tamura et al. 1991), estimation of surficial density from gravity data (Murata 1993), deconvolution
of palaeomagnetic remanence data (Oda and Shibuya 1994), groundwater inverse analysis (Honjo and Kashiwagi 1999), 2D
inversion of magnetotelluric data (Uchida 1993; Ogawa and Uchida 1996), seismic waveform inversion (Yoshida 1989), the
Fourier transform (Mitsuhata et al. 2001) and 2.5D inversion of controlled-source electromagnetic data (Mitsuhata, Uchida and
Amano 2002). Although this method can be a powerful tool for ill-posed inverse problems, its mechanism for the adjustment
and selection of the regularization parameter seems to be unclear to many geophysicists.
In this paper, firstly an explanation is given of Bayes’ theorem, from which the a posteriori model distribution and the
predictive distribution for the observed data can be derived. In the empirical Bayes approach (O’Hagan 1994, p. 131), the
predictive distribution is used to determine the optimum values of λ and the data variance, and then with the determined value
of λ, the a posteriori model distribution is used to estimate the most probable model. These two procedures are illustrated
with simple examples, and it is shown how assumed a priori information is adjusted objectively in response to the input data.
Secondly, as a more practical version of the empirical Bayes approach, the ABIC minimization scheme is introduced and tested
on synthetic data. Using numerical experiments, the influence of the data noise and a priori model information in the ABIC
scheme is also investigated. In ill-posed inverse problems, the purpose of adopting the a priori model information is to regularize
the problem. Too much emphasis on the a priori information obstructs the contribution from the observed data. Consequently,
in practice, the regularization parameter should be adjusted objectively depending on the observed data, which means the a
posteriori determination of the regularization parameter (Groetsch 1993). In this paper, it is demonstrated that the empirical
Bayes approach and the ABIC minimization method realize this data-based adjustment of the a priori model information.
B AY E S I A N I N V E R S I O N
When a priori knowledge about model parameters m, such as a priori model parameters mprior and a covariance matrix Cm , is
represented by zprior , the probability density functions (PDFs) for m and the observed data dobs , conditional on zprior , have the

C 2004 European Association of Geoscientists & Engineers, Geophysical Prospecting, 52, 213–239
Empirical Bayes regularization in inverse problems 215
following relationship:
p(m, dobs | zprior ) = p(m | dobs ) p(dobs | zprior ) = p(dobs | m) p(m | zprior ), (2)
where p(A | B) means a conditional PDF for the proposition that A is true under the assumption that proposition B is true. From
(2), the PDF of m given dobs is obtained, given by
p(dobs | m) p(m | zprior )
p(m | dobs ) = , (3)
p(dobs | zprior )
This equation is known as Bayes’ theorem (Carlin and Louis 2000, p. 17), although it is usual to neglect explicitly writing zprior in
(3). The conditional PDFs, p(m | zprior ) and p(m | dobs ), are called the a priori model PDF and a posteriori model PDF, respectively,
and p(dobs | m) is the likelihood function. Integrating both sides of (2) with respect to m, and because

p(m | dobs ) dm = 1, (4)
the denominator of (3) is given by

p(dobs | zprior ) = p(dobs | m) p(m | zprior ) dm. (5)
This is the marginal PDF of dobs given zprior , and is sometime called the predictive distribution of dobs in Bayesian analysis because
it describes the PDF of dobs for the prediction from zprior (Berger 1985, p. 95). If we have the a priori model information, it can
provide the a priori information on the observed data through a physical relationship, i.e. forward modelling. The predictive
distribution plays a very important role in determining the regularization parameter. However, it is conventionally regarded as
a constant for the estimation of the model parameters.
As a consequence, the a posteriori PDF is described as p(m | dobs ) ∝ p(dobs | m) · p(m | zprior ), and we should choose a solu-
tion for m that coincides with the maximum in p(m | dobs ). This estimation, based on Bayes’ theorem, is called the maximum
a posteriori (MAP) estimation. Hereafter, it is assumed that observed data contain only Gaussian noise and that the a priori
model PDF is described with Gaussian distributions in order to make the analytical calculations of p(m | dobs ) and p(dobs | zprior )
possible. This assumption may be unacceptable in some practical problems, in which case, numerical computations or some ap-
proximations of p(m | dobs ) and p(dobs | zprior ) can be applied. To avoid any confusion concerning the names of the PDFs, each PDF
is represented as p(m | zprior ) ⇒ π (m | zprior ), p(dobs | m) ⇒ f (dobs | m), p(m | dobs ) ⇒ h(m | dobs ) and p(dobs | zprior ) ⇒ l(dobs | zprior ).
Under the Gaussian noise assumption, the likelihood function is

1
f (dobs | m) = (2π)−N/2 det(Cd )−1/2 exp − (dobs − Am)T C−1 d (dobs − Am) , (6)
2
where Cd is the data covariance matrix, det denotes the determinant of a matrix, and T signifies transpose. Moreover, the
following a priori model PDF is assumed:

1
π (m | zprior ) = (2π)−M/2 det(Cm )−1/2 exp − (m − mprior )T C−1
m (m − m prior ) , (7)
2
where zprior is composed of mprior and Cm . By substituting (6) and (7) into (3), the a posteriori model PDF is written as
h(m | dobs ) = l −1 (dobs | zprior ) · (2π)−(N+M)/2 det(Cd )−1/2 det(Cm )−1/2 exp(−U(m)/2), (8)
where
U(m) = (dobs − Am)T C−1 T −1

d (dobs − Am) + (m − mprior ) Cm (m − mprior ). (9)
Once mprior and Cm are specified, l(dobs | zprior ) becomes a constant. Therefore, the MAP estimation corresponds to minimizing
U(m).
Assuming that Cd and Cm are symmetric positive-definite matrices, they can be factorized as Cd = Ld LTd and Cm = Lm LTm
by Cholesky factorization, where Ld and Lm are lower-triangular matrices. By defining Wd = L−1 −1
d and Wm = Lm , the objective
function can be rewritten as follows:

216 Y. Mitsuhata
U(m) = Wd (dobs − Am)2 + Wm (m − mprior )2 . (10)
When compared with (1), the role of the second term on the right-hand side of (10) is recognized as that of the regu-
larization. In the case of Wm = λ1/2 IM where IM is the M × M identity matrix, equation (10) is identical to (1) with
φ(m) = ||m − mprior ||2 . Furthermore, for the convenience of applying the least-squares method, equation (10) is rewritten
as
U(m) = W(d − Fm)2 , (11)
where

dobs
d= , (12)
mprior

A
F= , (13)
IM

Wd 0
W= . (14)
0 Wm
By applying ∂U(m)/∂m = 0, the least-squares estimate, m∗ , is obtained:
m∗ = (FT WT WF)−1 FT WT Wd. (15)
In addition, by using m∗ as shown in Appendix A, the objective function can be written as
U(m) = U(m∗ ) + (m − m∗ )T FT WT WF(m − m∗ ). (16)
Substituting (6) and (7) into (5) and calculating the integration analytically (Appendix A), the predictive distribution can be
written as follows:
−1/2
l(dobs | zprior ) = (2π )−N/2 detCd · detCm · det(FT WT WF) exp(−U(m∗ )/2). (17)
After more rearrangement as shown in Appendix A, equation (17) can be written as

1
l(dobs | zprior ) = (2π )−N/2 (detCdp )−1/2 exp − (dobs − Amprior )T C−1
dp (dobs − Amprior ) , (18)
2
where
Cdp = ACm AT + Cd . (19)
Equation (18) explicitly shows that l(dobs | zprior ) is the PDF of dobs for a given zprior and its covariance matrix Cdp is composed of
Cm and Cd . Substituting (16) and (17) into (8), the a posteriori model PDF is obtained as

1
h(m | dobs ) = (2π )−M/2 det(FT WT WF)1/2 exp − (m − m∗ )T (FT WT WF)(m − m∗ ) ,
2

1
= (2π)−M/2 (det Cm∗ )−1/2 exp − (m − m∗ )T C−1 ∗
m∗ (m − m ) , (20)
2
where, by comparison with the Gaussian distributions of (6) and (7), Cm∗ is a covariance matrix of m∗ and is given by
−1
Cm∗ = (FT WT WF)−1 = AT C−1d A + Cm
−1
. (21)

E M P I R I C A L B AY E S A P P R O A C H
The determination of the regularization parameter is a critical component of regularized inverse problems. As the previous
section shows, the regularization parameter is related to the variance of the a priori model PDF in a Bayesian inversion, and its
determination is equivalent to the estimation of a reasonable variance. In practice, there may be some statistical ways to estimate
mprior and Cm . For geophysical data measured on the earth’s surface, analysis of well-logging data can provide mprior and Cm
for the subsurface. However, well-logging data can have a very different scale of resolution from surface measurements, and the
well may not be located at the same position as the measurements. Moreover, it is often the case that there are no well-logging
data available. In such cases, we must resort to more general a priori information such as the smoothness of the model.
As described in the previous section, the predictive distribution, l(dobs | zprior ), represents the PDF of dobs given zprior . More
generally, it can be rewritten as l(dobs | π ) where the a priori information is expressed as the symbol π. When we have two
assumptions for the a priori information, for example, π 1 and π 2 , where l(dobs | π 1 ) > l(dobs | π 2 ), we naturally select π 1 as the
more reasonable candidate for the a priori information (Berger 1985, p. 99). The selection of π by using l(dobs | π) is called the
empirical Bayes approach (e.g. O’Hagan 1994, p. 131), in which π is selected from the same data for which the a posteriori
model PDF is maximized. The a priori information π can be parametrized with zprior , zprior being called hyperparameters (Everitt
1998, pp. 158–159). The estimation of zprior by the empirical Bayes is called the parametric empirical Bayes approach (Carlin
and Louis 2000, pp. 57–58). A simple form of the parametric empirical Bayes approach is to regard l(dobs | zprior ) as a likelihood
function of zprior , and to select zprior by the maximum-likelihood method (Berger 1985, pp. 99–101). Good (1965) suggested this
method and called it the type II maximum-likelihood method. l(dobs | zprior ) is also called the Bayesian likelihood (Akaike 1980)
or the evidence (e.g. MacKay 1992; Malinverno 2000). Reasonable zprior for dobs can be determined by maximizing l(dobs | zprior ).
Hereafter this method is called maximum predictive distribution.
The empirical Bayes approach violates the rigorous Bayes’ theorem which requires the a priori model PDF to be independent
of a current data set (Press 1989, p. 43; Scales and Snieder 1997), but it proves to be a practical and effective scheme for the
adjustment of the regularization parameters in response to observed data. In the following four simple examples, it is demonstrated
how the variance of the a priori model PDF is estimated from the maximum predictive distribution.
E X A M P L E S O F T H E E M P I R I C A L B AY E S A P P R O A C H
The first example in this section deals with the simplest problem, which is helpful in understanding the empirical Bayes process.
Similar cases were discussed by Berger (1985) and Carlin and Louis (2000). The other three examples treat an underdetermined
problem, a problem using an improper prior constraint, and a problem having an unknown data variance, all frequently
encountered geophysical problems. These three examples are most important in demonstrating how the empirical Bayes approach
works in regularized inverse problems.
Example 1
In this example we consider the case with one datum dobs and one model parameter m, related by the expression dobs = gm + ε d
with Gaussian noise εd , and we assume the data variance σ 2d is known. In addition, we have a priori information on the model
parameter: m = mprior + εm with Gaussian error εm . The likelihood function and the a priori model PDF are given by
−1/2

f (dobs | m) = 2πσd2 exp − (dobs − gm)2 2σd2 (22)
and
−1/2

π (m | mprior ) = 2πσm2 exp −(m − mprior )2 2σm2 , (23)
where σ 2m is the variance of the a priori model PDF. After simple calculation in consideration of the MAP, that is, ∂U(m)/∂m =
0, the objective function can be rewritten as

218 Y. Mitsuhata
(dobs − gm)2 (m − mprior )2

U= +
σd
2 σm2
(m − m∗ )2 (dobs − gmprior )2
= + , (24)
σm∗
2
g 2 σm2 + σd2
where the model estimate m∗ and its variance σ 2m∗ are given by
gσm2 dobs + σd2 mprior
m∗ = (25)
g 2 σm2 + σd2
and
σm2 σd2
σm2 ∗ = . (26)
g 2 σm2 + σd2
As we can see from (25) and (26), when σ 2d σ 2m , m∗ → mprior and σ 2m∗ → σ 2m . Conversely, when σ 2m σ 2d , m∗ → dobs /g and
σ 2m∗ → σ 2d /g2 . In addition, σ 2m∗ < σ 2m is always satisfied, which means that the information from the observed data always
improves the reliability of the model parameters.
From (18)–(21), the predictive distribution and the a posteriori model PDF are given by
−1/2

l(dobs | mprior ) = 2π g 2 σm2 + σd2 exp −(dobs − gmprior )2 2 g 2 σm2 + σd2 (27)
and
−1/2

h(m | dobs ) = 2π σm2 ∗ exp −(m − m∗ )2 2σm2 ∗ . (28)
In order to estimate σ 2m that maximizes log l, instead of l, the derivative of (log l) with respect to σ 2m is calculated, i.e.

∂ log l(dobs | mprior ) g2 (dobs − gmprior )2
=− 1− . (29)
∂σm2 2 g2 σ 2 + σ 2 g 2 σm2 + σd2
m d
This becomes equal to zero when

σm2 = (dobs − gmprior )2 − σd2 g 2 , (30)
and if (dobs – gmprior )2 ≤ σ 2d , the derivative in (29) is always negative. Therefore σ 2m = 0 should be selected (Berger 1985, p. 100),
meaning that mprior is sufficient for the estimation of m in this case. Setting τ = σ 2d /(dobs – gmprior )2 and substituting (30) into
(25), we obtain
dobs
m∗ = (1 − τ ) + τ mprior , for τ < 1,
g
= mprior , for τ ≥ 1. (31)
This result is illustrated in Fig. 1, where the case for g = 1 is considered for simplicity. The cases of τ < 1 and τ ≥ 1 are represented
by Figs 1(a) and 1(b), respectively. When τ < 1, σ 2m is adjusted to be large if mprior is far from dobs (Fig. 1c), and σ 2m becomes
small if mprior is close to dobs (Fig. 1d). These illustrations show that the maximization of the predictive distribution adjusts σ 2m
depending on the data reliability σ 2d and the consistency of mprior with dobs .
Example 2
In this example, we consider an underdetermined problem in which two model parameters, m1 and m2 , are estimated from only
one datum, d, with the relationship d = m1 + m2 + εd . This problem cannot be solved without some a priori model information,
and hence it is assumed that the mean values of m1 and m2 are equal to zero. The likelihood function and the a priori model
PDF are written as
−1/2

f (d | m1 , m2 ) = 2πσd2 exp −(d − m1 − m2 )2 2σd2 (32)

(a) (b)
(c) (d)
Figure 1 Schematic diagrams of the adjustment of σ 2m by the maximum predictive distribution in the case of g = 1 of example 1. (a) When τ <
1, σ 2m has a finite value and m∗ is between dobs and mprior , but (b) when τ ≥ 1, σ 2m is equal to zero and m∗ remains equal to mprior . In the case
of τ < 1, σ 2m is adjusted depending on the consistency between dobs and mprior . Namely (c) if mprior is far from dobs , σ 2m is adjusted to be large,
otherwise (d) if mprior is close to dobs , σ 2m becomes small. Thick solid black lines and thick broken grey lines indicate the probability densities of
f (dobs | m) and π (m | mprior ), respectively.
and
−1

π (m1 , m2 ) = 2πσm2 exp − m21 + m22 2σm2 . (33)
In the same way as the previous example, applying ∂U(m)/∂m = 0 gives the following estimations of m1 and m2 :

m∗1 = m∗2 = σm2 d 2σm2 + σd2 . (34)
In this case, we can write Cd = σ 2d , Cm = diag(σ 2m , σ 2m ) and A = [1 1]. Substituting them into (18)–(21) gives the predictive
distribution and a posteriori model PDF:
−1/2

l(d) = 2π σd2 + 2σm2 exp −d2 2 σd2 + 2σm2 (35)
and

1 σd2 + 2σm2 1 m1 − m∗1
h(m1 , m2 | d) = exp − m1 − m∗1 m2 − m∗2 C−1
m∗ , (36)
2πσm2 σd2 2 m2 − m∗2
where

σ2 σd2 + σm2 −σm2
C m∗ = 2 m 2 . (37)
σd + 2σm −σm2 σd2 + σm2
By setting ∂log l/∂σ 2m = 0, σ 2m is obtained as

σm2 = d2 − σd2 2. (38)

220 Y. Mitsuhata
If d2 < σ 2d , the derivative ∂log l/∂σ 2m is always negative, and hence σ 2m = 0 should be selected. By substituting (38) into (34), and
by introducing the parameter τ = σ 2d /d2 , the estimated model parameters become
m∗1 = m∗2 = d(1 − τ )/2, for τ < 1,

m∗1 = m∗2 = 0, for τ ≥ 1. (39)
Figure 2 explains this result, where the cases of τ < 1 and τ ≥ 1 are represented by Figs 2(a) and 2(b), respectively. When d is
precise, i.e. σ 2d → 0, or d becomes larger than σ 2d , τ → 0 and m∗1 and m∗2 become d/2. When τ < 1 and σ 2d is constant, and if d2
is much larger than σ 2d , σ 2m is adjusted to be large, and consequently m∗1 and m∗2 become close to d/2 (Fig. 2c). If d2 is not much
larger than σ 2d , σ 2m is adjusted to be small, and m∗1 and m∗2 tend to zero (Fig. 2d).
Figure 2 Schematic diagrams of the adjustment of σ 2m by the maximum predictive distribution in example 2. (a) When τ < 1, σ 2m has a finite
value, and m∗1 and m∗2 lie between d/2 and zero, but (b) when τ ≥ 1, σ 2m should be equal to zero, and m∗1 and m∗2 become zero. In the case of
τ < 1 with a constant σ 2d , σ 2m is adjusted depending on the value of d2 . Namely (c) if d2 is much larger than σ 2d , σ 2m is adjusted to be large, and
as a result, m∗1 and m∗2 become close to d/2. (d) If d2 is not much larger than σ 2d , σ 2m is adjusted to be small, and m∗1 and m∗2 tend to zero. Dashed
grey lines and solid black concentric circles represent the contours of f (d | m1 , m2 ) and π (m1 , m2 ), respectively.

Example 3
In the third example, we consider the problem of estimating m1 and m2 from observed data d1 and d2 with the relationship:
d1 = m1 + εd1 and d2 = m2 + εd2 , where the Gaussian noise εd1 and εd2 have the same variance. It is also assumed that the a
priori model information can be written as m1 ∼ = m2 and that the data variance σ 2d is known. The likelihood function and the a
priori model PDF are therefore
−1

f (d1 , d2 | m1 , m2 ) = 2πσd2 exp − (d1 − m1 )2 + (d2 − m2 )2 2σd2 (40)
and
−1/2

π(m1 , m2 ) = 2πσm2 exp −(m1 − m2 )2 2σm2 . (41)
The above a priori model PDF is a kind of improper prior distribution since its integration with respect to m1 and m2 is not
one but infinity (Everitt 1998, p. 161). Equation (41) gives only the relationship between m1 and m2 , from which we cannot
determine their individual values. The use of an improper a priori distribution leads to some scepticism about the Bayesian
approach (e.g. Press 1989, p. 48). However, even some vague or general a priori information, such as the smoothness, can be
useful in inverse problems, which tend to be what improper a priori distributions represent. The objective function for the MAP
is given by
(d1 − m1 )2 + (d2 − m2 )2 (m1 − m2 )2

U(m1 , m2 ) = + , (42)
σd2 σm2
and from the derivatives, ∂U/∂m1 and ∂U/∂m2 , the estimates of m1 and m2 are obtained as
(d1 − d2 )σd2 (d1 − d2 )σd2

m∗1 = d1 − and m∗2 = d2 + . (43)
σm2 + 2σd2 σm2 + 2σd2
These estimates show that when σ 2d σ 2m , m∗1 and m∗2 → (d1 + d2 )/2, and conversely when σ 2m σ 2d , m∗1 → d1 and m∗2 → d2 .
By substituting these estimates into (42), the minimum of U becomes
(d1 − d2 )2
U(m∗1 , m∗2 ) = , (44)
σm2 + 2σd2
and, as shown in Appendix B, U can be rewritten in the form,

m1 − m∗1
U(m1 , m2 ) = [ m1 − m∗1 m2 − m∗2 ]GT G + U(m∗1 , m∗2 ), (45)
m2 − m∗2
where
 
1/σd 0
G=  0 1/σd  . (46)
1/σm −1/σm
Integrating the product of (40) and (41) to evaluate the predictive distribution, and using (44)–(46) as shown in Appendix B, we
obtain

−1/2 1 (d1 − d2 )2
l(d1 , d2 ) = 2π σm + 2σd
2 2
exp − . (47)
2 σm2 + 2σd2
It should be noted that l is also an improper distribution in terms of d1 and d2 , which involves only the relationship between d1
and d2 . Nevertheless, it indicates that the data with d1 = d2 are the most probable data, and it is plausible for the given a priori
model information. The a posteriori model PDF is obtained from (3) as

∗
1 σm2 + 2σd2 1 ∗ ∗ −1 m1 − m1
h(m1 , m2 | d1 , d2 ) = exp − [ m1 − m1 m2 − m2 ]Cm∗ , (48)
2πσd2 σm2 2 m2 − m∗2

222 Y. Mitsuhata
where

σd2 σm2 + σd2 σd2
Cm∗ = (GT G)−1 = . (49)
σm2 + 2σd2 σd2 σm2 + σd2
As a result, the covariance matrix of the estimated model parameters has off-diagonal elements due to the a priori information.
If σ 2m → ∞, they disappear. From ∂log l/∂σ 2m = 0, σ 2m is evaluated as
σm2 = (d1 − d2 )2 − 2σd2 , (50)
and if (d1 – d2 )2 ≤ 2σ 2d , the derivative, ∂log l/∂σ 2m , is always negative, indicating that σ 2m = 0 should be selected. By the substitution
of (50) into (43), and by using the parameter τ = 2σ 2d /(d1 – d2 )2 , the estimated model parameters are given by
m∗1 = (1 − τ )d1 + τ (d1 + d2 )/2 and m∗2 = (1 − τ )d2 + τ (d1 + d2 )/2, for τ < 1,
m∗1 = m∗2 = (d1 + d2 )/2, for τ ≥ 1. (51)
Figure 3 shows the schematic diagrams of the result. The cases of τ < 1 and τ ≥ 1 are represented by Figs 3(a) and 3(b),
respectively. In addition, for the case of τ < 1 with a constant σ 2d , σ 2m is adjusted to be large if the point (d1 , d2 ) is far from
the line m1 = m2 (Fig. 3c), and σ 2m becomes small if the point (d1 , d2 ) is close to the line m1 = m2 (Fig. 3d). In the same way
as for the previous examples, the maximum predictive distribution adjusts σ 2m depending on the data noise and the consistency
between the observed data and the a priori model information.
Example 4
In most cases, the data variance σ 2d is unknown. Then the approach of using a target data misfit is inapplicable. To simulate
such a situation, we consider the problem of estimating one model parameter m from two observed data d1 and d2 with the
relationship: d1 = m + εd1 and d2 = m + εd2 under the a priori model information of m = mprior + ε m . The likelihood function
composed of two data is given by
−1

f (d1 , d2 | m) = 2πσd2 exp − (d1 − m)2 + (d2 − m)2 2σd2 , (52)
and we can use (23) as the a priori model PDF. Consequently the objective function becomes
(d1 − m)2 + (d2 − m)2 (m − mprior )2

U(m) = + . (53)
σd2 σm2
By setting ∂U/∂m = 0, the estimation of the model parameter is given by
2σm2 d1 + d2 σd2
m∗ = + mprior . (54)
2σm2 + σd2 2 2σm2 + σd2
After calculations given in Appendix C, the predictive distribution can be written as

1 1 T −1
l(d1 , d2 | mprior ) = exp − 2 d̂ Cdp d̂ , (55)
2π σd2 2σm2 + σd2
where d̂ = [d̂1 d̂2 ], d̂1 = d1 − mprior , d̂2 = d2 − mprior ,
and

1 σm2 + σd2 −σm2
C−1
dp =
. (56)
σd2 2σm2 + σd2 −σm2 σm2 + σd2

(a) (b)
(c) (d)
Figure 3 Schematic diagrams of the adjustment of σ 2m by the maximum predictive distribution in example 3. (a) When τ < 1, σ 2m has a finite
value and the estimation, m∗ , approaches the line m1 = m2 depending on σ 2m , but (b) when τ ≥ 1, σ 2m should be zero and m∗ is fixed on the line
m1 = m2 . In the case of τ < 1 with a constant σ 2d , σ 2m is adjusted depending on the consistency between dobs and the a priori model information
of m1 ∼= m2 . (c) If the point (d1 , d2 ) is far from the line m1 = m2 , σ 2m is adjusted to be large and m∗ approaches dobs . (d) In contrast, if the point
(d1 , d2 ) is close to the line m1 = m2 , σ 2m becomes small and m∗ moves on the line perpendicular to the line m1 = m2 . Solid black concentric
circles and dashed grey lines represent the contours of f (d1 , d2 | m1 , m2 ) and π (m1 , m2 ), respectively.
Calculating the derivatives of (log l) with respect to σ 2m and σ 3d , and setting ∂log l/∂σ 2m = ∂log l/∂σ 2d = 0, the following estimates
of σ 2m and σ 2d are obtained (Appendix C):

σm2 = d̂1 d̂2 and σd2 = (d1 − d2 )2 2, for d̂1 d̂2 > 0 (57)
and

σm2 = 0 and σd2 = d̂21 + d̂22 2, for d̂1 d̂2 ≤ 0. (58)

224 Y. Mitsuhata
By substituting these estimated variances into (54), m∗ becomes
d1 + d2
m∗ = (1 − τ ) + τ mprior , for τ < 1,
2 (59)
= mprior , for τ ≥ 1,
where τ = (d̂1 − d̂2 )2 /(d̂1 + d̂2 )2 . Corresponding to (57) and (58), τ < 1 for d̂1 d̂2 > 0 and τ ≥ 1 for d̂1 d̂2 ≤ 0. This result provides
significant insight. For instance, when we have a priori information that mprior = 5 and the observed data are d1 = 9 and d2 =
3, the estimated result is m∗ = mprior = 5, σ 2m = 0 and σ 2d = 10, based on (58) and (59). This means that we should adopt the
a priori model mprior as the estimated model if it is between d1 and d2 , that is, τ < 1. In the case of d1 = 9 and d2 = 7 under
mprior = 5, the results are m∗ = 7.67, σ 2m = 8 and σ 2d = 2. Figure 4 illustrates the result in this example. The cases of τ < 1 and
τ ≥ 1 are represented by Figs 4(a) and 4(b), respectively. The smaller the data noise, the smaller is the difference between d1
and d2 . Therefore, the estimation from the data with large noise results in the case of τ ≥ 1 (Fig. 4b). In such a case, the data
variance is estimated to be large. In the case of τ < 1, σ 2m is adjusted to be large if mprior is far from d1 and d2 (Fig. 4c), and σ 2m
becomes small if mprior is close to d1 and d2 (Fig. 4d). In addition, in both of these cases, the estimation of the data variance is
the same and is independent of mprior , as shown in (57).
Figure 4 Schematic diagrams of the adjustment of σ 2m and σ 2d by the maximum predictive distribution in example 4, where the likelihood
function is represented by the product of the likelihood for each datum, f (d1 , d2 | m) = f 1 (d1 | m)· f 2 (d2 | m). (a) When τ < 1, σ 2m has a finite
value and m∗ is estimated between mprior and (d1 + d2 )/2. (b) If the difference between d1 and d2 becomes large because of a large noise term
and mprior lies between d1 and d2 (τ ≥ 1), σ 2m becomes equal to zero, σ 2d becomes large, and m∗ = mprior . (c) In the case of τ < 1, if mprior is far
from d1 and d2 , σ 2m is adjusted to be large and m∗ is close to (d1 + d2 )/2. (d) If mprior is close to d1 and d2 , σ 2m becomes small and m∗ is pulled
to mprior . In both cases, the estimation of σ 2d is independent of mprior . Thick solid black lines indicate the probability densities of f 1 (d1 | m) and
f 2 (d2 | m), and thick broken grey lines show the probability densities of π(m | mprior ).

The four examples described above demonstrate that maximizing the predictive distribution adjusts σ 2m in response to the
data reliability and the consistency between the observed data and the a priori model information, that is, the suitability of the
a priori model information for the observed data.
A B I C M I N I M I Z AT I O N M E T H O D
Akaike (1980) developed Akaike’s Bayesian Information Criterion (ABIC), and proposed the use of the hyperparameters mini-
mizing the ABIC. This method provides a practical scheme for maximum predictive distribution because it is applicable in the
case when σ 2d is unknown, by treating σ 2d as one of the hyperparameters.
In some situations, correlated noise has to be considered. Mitsuhata et al. (2001) dealt with correlated Gaussian noise in
the ABIC minimization. In this paper, however, we assume uncorrelated Gaussian noise to keep the problem simple. Thus the
likelihood function can be written as

−N/2 1 T
f (dobs | m) = 2πσd2
exp − 2 dobs − Am dobs − Am , (60)
2σd
where the data variance σ 2d is assumed to be unknown and is determined by minimizing the ABIC. For one-dimensional model
space, m(x), the following constraint function is considered as a priori information:
k 2
∂ (m − mprior )
φk(m) = dx, (61)
∂ xk
where k = 0, 1, . . . , and mprior is the a priori model parameter which is assumed to have a constant value. Equation (61) is
discretized with a step size as
φk(m) = −2k+1 || Dk(m − mprior ) || 2 , (62)
where Dk is the kth-order differential operator. By considering (62) as a constraint function, the a priori model PDF is written as

Mk/2 λ −2k+1
2 −2k+1
π m σd , λ,mprior = λ 2πσd 2
exp − (m − mprior ) Dk Dk(m − mprior ) ,
T T
(63)
2σd2
where Mk is the rank of Dk given by Mk = M – k. It seems strange that (63) contains the data variance σ 2d as a factor of C−1 m .
However, this makes the analytical derivative of ABIC with respect to σ d possible, as is shown later. The difference operator for
2
k = 0 is given by D0 = IM . For k = 1, a flat model will result, and the first-order differential operator is given by
 
−1 1 0 0 0 ··· 0
 
 0 −1 1 ··· 0
 0 0 
 
 . . . . . . 

D1 =  . 
. .
. .
. .
. .
. ··· ..
. (64)
 
 
 0 ··· ··· ··· ··· 0 0 −1 1
 
0 ··· ··· ··· ··· ··· ··· ··· 0
For k = 2, a smooth model will be produced, and the second-order differential operator is given by
 
1 −2 1 0 0 ··· 0
 
0 1 −2 1 0 ··· 0
 
 
. . . . . . 
. . . . . . 
. . . . . ··· .
D2 =  . (65)
 
0 · · · · · · · · · 0 1 −2 1
 
 
0 · · · · · · · · · · · · · · · · · · 0
 
0 ··· ··· ··· ··· ··· ··· 0

226 Y. Mitsuhata
For k = 3, the third-order differential operator is written as

 
1 −3 3 −1 0 ··· 0
 
0 −3 −1 ··· 0
 1 3 
 
. .. .. .. .. .. 
. 
. . . . . ··· .
 
 
D3 = 0 ··· ··· ··· ··· 1 −3 3 1 . (66)
 
 
0 ··· ··· ··· ··· ··· ··· · · · 0
 
 
0 ··· ··· ··· ··· ··· ··· · · · 0
 
 
0 ··· ··· ··· ··· ··· ··· ··· 0
Except for the case of k = 0, the a priori model PDF is improper, that is, the ranks of D1 , D2 and D3 are smaller than M, which
means that the integral of the a priori model PDF with respect to m is infinity. If the null space of the a priori model PDF overlaps
the null space of the likelihood function, the regularization will fail (Gouveia and Scales 1997) and the a posteriori model PDF
will be improper. Such cases rarely occur, but we need to keep them in mind. From the comparison of (7) and (63), we know
C−1
m = (λ
−2k+1
/σ 2d ) DTk Dk , but this is not invertible except for the case of k = 0 (Ory and Pratt 1995). This fact makes the elegant
description of Cdp in (19) impossible. However, despite the above-mentioned drawbacks, the use of the improper a priori model
PDF is important and effective in regularizing inversions as long as the a posteriori model PDF is proper. From the product of
(60) and (63), the objective function to be minimized is given by
U(m) = (dobs − Am)T (dobs − Am) + λ−2k+1 (m − mprior )T DTk Dk(m − mprior ). (67)
By setting Wd = IN and Wm = (λ−2k+1 )1/2 Dk , using (10)–(15), and following the same derivation that gave (17), the predictive
distribution is obtained as
−(N+Mk −M)/2

l dobs σd2 , λ, mprior = 2πσd2 . (λ−2k+1 ) Mk/2 · det Wd · (det FT WT WF)−1/2 exp −U(m∗ ) 2σd2 . (68)
The a posteriori model PDF is given by the same form as that of (20), and from (21) the covariance Cm∗ is obtained as
−1
Cm∗ = σd2 AT WTd Wd A + λ−2k+1 DTk Dk . (69)
From analogy with the AIC (Akaike’s Information Criterion, Akaike 1973, 1974), Akaike (1980) defined the ABIC as

ABIC = −2 ln l dobs σd2 , λ, mprior + 2Nh , (70)
where Nh is the number of hyperparameters. The hyperparameters, σ 2d and λ (and also mprior for some cases) are adjusted to
minimize the ABIC. Substituting (68) into (70) and calculating ∂ABIC/∂σ 2d = 0 to estimate the optimum σ 2d , we obtain
U(m∗ )
σd2 = . (71)
N + Mk − M
Consequently, by using (71), ABIC can be written as

U(m∗ )
ABIC = (N + Mk − M) ln 2π − Mk ln λ + Mk(2k − 1) ln − 2 ln (det Wd )
N + Mk − M
+ ln{det(FT WT WF)} + N + Mk − M + 2Nh . (72)
In practical applications, for a given λ, m∗ is estimated based on the MAP, and σ 2d (λ) and the ABIC(λ) are computed. Then by
using a one-dimensional search for the minimum of ABIC with respect to λ, we determine the optimal values of λ and σ 2d , and
thus the optimum model parameters.

S Y N T H E T I C D ATA E X A M P L E S
The usefulness of the ABIC minimization method for regularized inverse problems is demonstrated by some examples from
numerical experiments. The main purpose of the experiments is to show how the ABIC minimization method adjusts the
regularization parameter in response to the data noise and the suitability of the a priori model information for the observed
data, just as in the empirical Bayes approach illustrated by the preceding simple examples. In discrete ill-posed problems arising
in a variety of applications, the underlying mathematical problem is often a linear Fredholm integral equation of the first kind
(Hansen 1992). Therefore, as an example of the linear inverse problem in this paper, the Fredholm integral equation of the first
kind by Matsuoka (1986) is adopted, but with a different function for describing the model parameters. The synthetic data
dobs (t), the kernel G(t,u) and the model m(u) are given by
1
dobs (ti ) = G(ti , u)m(u) du + εdi , (73)
0
G(ti , u) = exp{−(ti + 1)u} (74)
and
m(u) = a sin(π u), (75)
where a is a given value, dobs was sampled at t = 0, 0.25, 0.50, . . . , 7.50, giving dobs = [d1 , d2 , . . . , dN ]T with N = 31. The
model was discretized with a spacing of = 0.025 for 0.0 ≤ u ≤ 1.00, that is, m = [m1 , m2 , . . . , mM ]T with M = 41, and
m(u) was approximated by the linear interpolation of m. The elements of the matrix A were calculated analytically (Appendix
D). Uncorrelated Gaussian noise, ε di = N(0, σ 2t ) was added to the data. Figure 5 shows the synthetic data calculated for the
model with a = 1 in (75). These data contain noise with σ t = 3.95 × 10−3 corresponding to 1% of the maximum of the original
data. The data were inverted by employing the smooth model (k = 2) as the a priori model information. Figure 6 shows the
variations of the ABIC and root-mean-square (RMS) misfit defined as (||dobs – Am∗ | |2 /N)1/2 with respect to λ. The ABIC has
its minimum at λ = 3.98 × 10−7 for which the estimated standard deviation is σ d = 4.20 × 10−3 . On the other hand, the RMS
misfit gradually decreases with decreasing λ in the region of λ < 10−3 and there is no minimum point. The model parameters
obtained for various values of λ are shown in Fig. 7 along with the true model. For the small value of λ = 3.98 × 10−10 ,
the evaluated model has large oscillations, and these oscillations are depressed as λ increases. However, at λ = 3.98 × 10−6 ,
since the smooth a priori information is over-emphasized, the obtained model tends towards a straight line. The model for λ =
3.98 × 10−7 is therefore a reasonable selection.
Figure 5 Synthetic data in the test of the ABIC minimization method. The data incorporate the uncorrelated Gaussian noise (σ t = 3.95 × 10−3 ),
and the number of data is N = 31. The data calculated from the estimated model parameters of Fig. 7(d) are also shown.

228 Y. Mitsuhata
Figure 6 Plots of the ABIC (solid line) and the RMS misfit (broken line) as functions of λ. The ABIC has its minimum at λ = 3.98 × 10−7 , where
σ d = 4.20 × 10−3 (true value σ t = 3.95 × 10−3 ).
(a) (d)
(b) (e)
(c) (f)
Figure 7 Reconstructed models (open circles) for the various values of λ, (a) λ = 3.98 × 10−10 , (b) λ = 3.98 × 10−9 , (c) λ = 3.98 × 10−8 , (d)
λ = 3.98 × 10−7 , (e) λ = 3.98 × 10−6 and (f) λ = 3.98 × 10−5 . The number of model parameters is M = 41. The true model parameters are
shown by the solid curves. When λ = 3.98 × 10−7 , the ABIC has its minimum value.

Figure 8 Plots of the ABIC for the synthetic data sets with the three different noise levels: 1% noise (σ t = 3.95 × 10−3 ), 3% noise (σ t = 1.19 ×
10−2 ) and 10% noise (σ t = 3.95 × 10−2 ). To emphasize how the sharpness of the minimum depends on the noise level, the minimum value of
each curve has been subtracted from the plotted values. The ABIC has its minimum values at λ = 3.98 × 10−7 , 3.16 × 10−6 and 3.16 × 10−4 ,
respectively.
Effect of noise
In order to investigate the effect of data noise in the ABIC minimization method, the noise level in the synthetic data is increased
up to variances of 3% (σ t = 1.19 × 10−2 ) and 10% (σ t = 3.95 × 10−2 ) of the maximum datum, and the ABIC minimization is
applied to each data set with the true data variance unknown. Figure 8 shows the plots of the ABIC for each data set as functions
of λ, where the minimum value of the ABIC has been subtracted to emphasize the sharpness of the valley for each data set. The
value of λ at the minimum increases with the increase in the noise level, even without the information about the data variance,
which is reasonable because we usually require a large contribution from the constraint term of (1) for the inversion of noisy
data. Figure 9 shows the model parameters evaluated for each data set. Discrepancies between the reconstructed model and
the true model appear in the region of about u > 0.8 in Fig. 9(b), and in Fig. 9(c), the reconstructed model becomes almost a
straight line due to the large contribution from the smoothness a priori information. The 10% noise data and the data calculated
for Fig. 9(c) are shown in Fig. 10. In spite of the scattering of the data, the calculated data agree with the true data. From
Fig. 8, it also should be noted that the minimum of the ABIC becomes unclear with the increase in the noise level. In particular,
for the 10% noise data, it is impossible to find the minimum by a coarse line search. This example therefore shows that the
ABIC minimization method adjusts the regularization parameter λ in response to the noise level, in particular, providing a large
value of λ for noisy data. However, if the noise level becomes too high, the curve of the ABIC becomes flat, indicating that the
appropriate value of λ is approaching infinity. This is because the a priori information can adequately account for the noisy data.
This case is consistent with the results of the simple examples shown in Figs 1(b), 2(b), 3(b) and 4(b). It should be noted that
the intensification of smoothing in response to the noise level has already been reported in individual applications of the ABIC
(e.g. Oda and Shibuya 1994; Mitsuhata 1994). However, this example also illustrates the difficulty of automatically finding the
minimum of the ABIC for highly noisy data.
Effect of the suitability of the a priori model information
The ABIC minimization with the smoothness a priori information is now examined for three different synthetic data sets
calculated for the models with a = 1, 10 and 100 in (75). All data have the same level of Gaussian noise with σ t = 3.95 ×
10−3 . The three different cases have different model roughnesses defined as ||D2 mt ||2 where mt are the true model parameters,
i.e. the values of roughness for the models with a = 1, 10 and 100 are 7.60 × 10−4 , 7.60 × 10−2 and 7.60, respectively. Plots of
the ABIC for these data sets with respect to λ are shown in Fig. 11. As the roughness increases, the value of λ at the minimum
decreases. This result means that the greater the roughness of the model, the less the a priori information of the smooth model

230 Y. Mitsuhata
(a)
(b)
(c)
Figure 9 Models reconstructed by the ABIC minimization method for the synthetic data with Gaussian noise of (a) 1% (σ t = 3.95 × 10−3 ),
(b) 3% (σ t = 1.19 × 10−2 ) and (c) 10% (σ t = 3.95 × 10−2 ). The estimated and true model parameters are shown by solid circles and solid
curves, respectively. The error bars are calculated from the diagonal terms of the covariance Cm∗ .
Figure 10 A comparison between the synthetic data with 10% Gaussian noise (σ t = 3.95 × 10−2 ), the data calculated for the model of
Fig. 9(c), and the true data.

Figure 11 Plots of the ABIC for the synthetic data sets for the three different models: a = 1, a = 10 and a = 100 in (75). All data have the same
noise level of σ t = 3.95 × 10−3 . The ABIC has its minimum values at λ = 3.98 × 10−7 , 6.31 × 10−9 and 1.00 × 10−10 , respectively.
contributes to the inversion. Figure 12 shows model parameters estimated from each data set. In each case, the estimated result
agrees well with the true model parameters.
As a further experiment, we consider the application of the a priori information of the zeroth-order difference-operator
model (k = 0) assuming the constant reference model, mref , as mprior in (63). In general, a background or average value may
be an appropriate candidate for mref . However, in some cases, a priori estimation of mref may be difficult. In this experiment,
from the point of view of the empirical Bayes approach, mref was regarded as one of the hyperparameters and it was determined
simultaneously with λ by the ABIC minimization method. The value of ABIC for given values of mref and λ was calculated
for the same synthetic data as in Fig. 5. Figure 13 shows the contour lines of the ABIC as a function of mref and λ. The ABIC
has its minimum at mref = 0.644 and λ = 7.94 × 10−4 . In addition, it should be noted that although there is a value of λ
minimizing the ABIC for a given value of mref , the value of λ = 7.94 × 10−4 for mref = 0.644 is the largest value of all values
of λ minimizing the ABIC. The variance of the a priori model PDF, σ 2m , corresponds to σ 2m = σ 2d /λ from (63) in the case of
k = 0. Therefore, this result shows that the ABIC minimization method suggests that the a priori information of mref = 0.644 is
the most reliable, because the corresponding σ 2m is the smallest. In fact, the average value of the true model over 0 ≤ u ≤ 1 is
2/π ≈ 0.637, which is close to the estimated value of mref . Figure 14 shows the models reconstructed for mref = 0.0, 0.644 and
1.0. The models for mref = 0.0 and 1.0 have a gentle oscillation in the region of u < 0.5, but it has been depressed in the model for
mref = 0.644.
The above two results demonstrate how the ABIC minimization method adjusts the regularization parameter in response
to the suitability of the a priori information. In particular, in the application of the smooth model assumption, if the data do not
support a smooth model, the ABIC weakens the contribution from the a priori information. Furthermore, in the application of
the zeroth-order difference-operator model, if a given reference model is not appropriate, the ABIC also weakens the a priori
information. These results are consistent with the results shown in Figs 1(c), 2(c), 3(c) and 4(c). Since we can never know the true
model, the suitability of the a priori information is judged from the observed data through the ABIC or the predictive distribution
of (17).
Effect of the type of the a priori model information
The purpose of this final experiment is different from those of the previous experiments. Sometimes we wonder which type of
a priori information we should employ for the inversion of given data. Here four types of a priori information are compared:
the zeroth-order (k = 0), first-order (k = 1), second-order (k = 2) and third-order (k = 3) difference-operator models for the
data containing 1% noise as shown in Fig. 5. In the application of the zeroth-order difference-operator model, mref = 0.644 was

232 Y. Mitsuhata
(a)
(b)
(c)
Figure 12 Models reconstructed by the ABIC minimization method for the synthetic data sets for the three different models: (a) a = 1, (b) a =
10 and (c) a = 100. All data have the same noise level of σ t = 3.95 × 10−3 . The reconstructed and true model parameters are shown by the
solid circles and solid curves, respectively. The error bars are calculated from the diagonal terms of the covariance Cm∗ .
Figure 13 Contour lines of the ABIC for the same synthetic data as in Fig. 5 shown as a function of mref and λ. The ABIC has its minimum at
mref = 0.644 and λ = 7.94 × 10−4 .

(a)
(b)
(c)
Figure 14 Models reconstructed by the ABIC minimization method from the same synthetic data as in Fig. 5. As the a priori information, we
employed the zeroth-order difference-operator model (k = 0) with (a) mref = 0.0, (b) mref = 0.644 and (c) mref = 1.0. The reconstructed and
true model parameters are shown by solid circles and solid curves, respectively. The error bars are calculated from the diagonal terms of the
covariance Cm∗ .
used. Figure 15 shows the plots of the ABIC for each type of a priori information with respect to λ. The sharpness around the
minimum of the ABIC is reduced as the order of the difference increases, and the curve of the ABIC becomes flat in the case of
k = 3, which means that we cannot find the minimum of ABIC in the application of the third-order difference-operator model.
The minimum values of ABIC increase as the order of the difference increases. The reconstructed models are shown in Fig. 16,
with the regularization parameter for the third-order difference-operator model fixed at λ = 10.0. The model reconstructed using
the third-order difference-operator model agrees best with the true model. However, the corresponding minimum of the ABIC is
the largest. Although the ABIC minimization method works well for the determination of the hyperparameters, the ABIC may
not be a good indicator for the selection of an appropriate type of a priori information. This may be caused by the difference
of the rank, Mk , depending on the type of the a priori information in (72). In addition, it should be noted that estimated model
error bars shown in Fig. 6 are strongly dependent on what kind of a priori information was adopted.
Higher-order difference-operator models may be necessary to describe a more complex model function. However, there are
two practical problems in applying the higher-order difference-operator models. One is that the higher-order difference operators
have larger null spaces, which may easily cause failure of the inversion process. The other is that the minimum of the ABIC is
unclear. If the data noise increases, the curve of the ABIC for the higher-order difference-operator models becomes flat, which
may be because the data noise greatly increases the error of the estimation for the higher-order difference. In practical algorithms

234 Y. Mitsuhata
Figure 15 Plots of the ABIC for the same synthetic data as in Fig. 5 with respect to λ in the application of the four different types of a priori
model information: the zeroth-order (k = 0), first-order (k = 1), second-order (k = 2) and third-order (k = 3) difference-operator models. In
the zeroth-order difference-operator model, mref = 0.644 was used.
(a) (c)
(b) (d)
Figure 16 Models reconstructed by using the four different types of the a priori model information. In the zeroth-order difference-operator
model, mref = 0.644 was used, and in the third-order difference-operator model, the model was reconstructed with λ = 10.0 because the
minimum could not be found. The reconstructed and true model parameters are shown by solid circles and solid curves, respectively. The error
bars are calculated from the diagonal terms of the covariance Cm∗ .
for the selection of λ, automatically determining the optimum λ is preferable. Therefore, when it is difficult to find the minimum
of the ABIC, the application of the lower-order difference-operator models can show a more distinctive minimum point.
CONCLUSIONS
An explanation has been given of the empirical Bayes approach as a general concept and the ABIC minimization method as
a practical algorithm for the adjustment of the regularization parameter in regularized least-squares inversions. The examples
have demonstrated that the regularization parameter is adjusted in response to the noise level of the observed data and also to
the suitability of the a priori model information for the observed data. If the noise level is high, the regularization parameter

is adjusted to be large. Moreover, if the a priori model information explains the observed data sufficiently well within the
data variance, the regularization parameter is adjusted to be infinite, and the a priori model is kept as the prime candidate
for the estimated model. If the a priori model information is not suitable as a candidate to account for the observed data, the
regularization parameter is adjusted to be small. These results show that the scheme for the adjustment of the regularization
parameter is reasonable. The ABIC minimization method, or the more general empirical Bayes approach, is therefore a powerful
tool for regularized least-squares inversions. Concerning the selection of the most appropriate one from several types of a priori
model information, the ABIC proved not to be a good criterion in the numerical experiments presented. From a practical point
of view, however, the use of a lower-order difference-operator model is preferable for inversions of noisy data.
ACKNOWLEDGEMENTS
This work was performed while the author was a visitor at the University of British Columbia on leave from the National
Institute of Advanced Industrial Science and Technology, Japan. The author thanks D.W. Oldenburg for his kind hospitality and
his valuable comments, C.G. Farquharson for his encouragement and patient help in improving the manuscript, T. Uchida for
his support in this study, and Y. Wang for his suggestion.
REFERENCES
Akaike H. 1973. Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd International Symposium
on Information Theory (eds B.N. Petrv and F. Csaki), pp. 267–281. Akademiai Kiado, Budapest.
Akaike H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control AC-19, 716–723.
Akaike H. 1980. Likelihood and Bayes procedure. In: Bayesian Statistics (eds J.M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M. Smith),
pp. 141–166. University Press, Valencia.
Berger J.B. 1985. Statistical Decision Theory and Bayesian Analysis, 2nd edn. Springer-Verlag, Inc.
Carlin B.P. and Louis T.A. 2000. Bayes and Empirical Bayes Methods for Data Analysis, 2nd edn. Chapman & Hall, New York.
Constable S.C., Parker R.L. and Constable C.G. 1987. Occam’s inversion, a practical algorithm for generating smooth models from electromag-
netic sounding data. Geophysics 52, 289–300.
Everitt B.S. 1998. The Cambridge Dictionary of Statistics. Cambridge University Press.
Farquharson C.G. and Oldenburg D.W. 2000. Automatic estimation of the trade-off parameter in nonlinear inverse problems using the GCV
and L-curve criteria. 70th SEG meeting, Calgary, Canada, Expanded Abstracts, 265–268.
Good I.J. 1965. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press, Cambridge, MA.
Gouveia W.P. and Scales J.A. 1997. Resolution of seismic waveform inversion: Bayes versus Occam. Inverse Problems 13, 323–349.
Groetsch C.W. 1993. Inverse Problems in the Mathematical Sciences. Frieder Vieweg & Sohn Verlagsgesellschaft mbH, Wiesbanden, Germany.
Haber E. and Oldenburg D.W. 2000. A GCV based method for nonlinear ill-posed problems. Computational Geosciences 4, 41–63.
Hansen P.C. 1992. Analysis of discrete ill-posed problems by means of the L-curve. SIAM Review 34, 561–580.
Honjo Y. and Kashiwagi N. 1999. Matching objective and subjective information in groundwater inverse analysis by Akaike’s Bayesian Infor-
mation Criterion. Water Resources Research 35, 435–447.
MacKay D.J.C. 1992. Bayesian interpolation. Neural Computation 14, 415–447.
Malinverno A. 2000. A Bayesian criterion for simplicity in inverse problem parametrization. Geophysical Journal International 140, 267–285.
Matsuoka T. 1986. Numerical techniques in inverse problems – mainly a topic of the least squares method. Buturi-Tansa (Geophysical Explo-
ration) 39, 340–356 (in Japanese).
Mitsuhata Y. 1994. Application of prior information of smooth resistivity structure to 1-D inversion of magnetotelluric data by using the ABIC
minimization method. Buturi-Tansa (Geophysical Exploration) 47, 358–374.
Mitsuhata Y., Uchida T. and Amano H. 2002. 2.5-D inversion of frequency-domain electromagnetic data generated by a ground-wire source.
Geophysics 67, 1753–1768.
Mitsuhata Y., Uchida T., Murakami Y. and Amano H. 2001. The Fourier transform of controlled-source time-domain electromagnetic data by
smooth spectrum inversion. Geophysical Journal International 144, 123–135.
Murata Y. 1993. Estimation of optimum average surficial density from gravity data: An objective Bayesian approach. Journal of Geophysical
Research 98, 12097–12109.
Noble B. 1969. Applied Linear Algebra. Prentice-Hall, Inc.
O’Hagan A. 1994. Bayesian Inference, Kendall’s Advanced Theory of Statistics, Vol. 2B. John Wiley & Sons, Inc.
Oda H. and Shibuya H. 1994. Deconvolution of whole-core magnetic remanence data by ABIC minimization. Journal of Geomagnetics and
Geoelectrics 46, 613–628.

236 Y. Mitsuhata
Ogawa Y. and Uchida T. 1996. A two-dimensional magnetotelluric inversion assuming Gaussian static shift. Geophysical Journal International
126, 69–76.
Ory J. and Pratt R.G. 1995. Are our parameter estimators biased? The significance of finite-difference regularization operators. Inverse Problems
11, 397–424.
Press S.J. 1989. Bayesian Statistics: Principles, Models, and Application. John Wiley & Sons, Inc.
Scales J.A. and Snieder R. 1997. To Bayes or not to Bayes. Geophysics 62, 1045–1046.
Tamura Y., Sato T., Ooe M. and Ishiguro M. 1991. A procedure for tidal analysis with a Bayesian information criterion. Geophysical Journal
International 104, 507–516.
Tarantola A. 1987. Inverse Problem Theory: Methods for Data Fitting and Model Parameter Estimation. Elsevier Science Publishing Co.
Tikhonov A.N. and Arsenin V.Y. 1977. Solutions of Ill-posed Problems. John Wiley & Sons, Inc.
Uchida T. 1993. Smooth 2-D inversion for magnetotelluric data based on statistical criterion ABIC. Journal of Geomagnetics and Geoelectrics
45, 841–858.
Wahba G. 1990. Spline Models for Observational Data. Society of Industrial and Applied Mathematics, Philadelphia.
Wang Y. 2003. Seismic Amplitude Inversion in Reflection Tomography. Pergamon Press, Inc.
Yoshida S. 1989. Waveform inversion using ABIC for the rupture process of the 1983 Hindu Kush earthquake. Physics of the Earth and Planetary
Interiors 56, 389–405.
APPENDIX A
Calculation of the predictive distribution
Using (15), the objective function U of (11) can be rewritten as follows:
U = mT FT C−1 Fm − 2mT FT C−1 d + dT C−1 d

= mT FT C−1 Fm − 2mT FT C−1 d + dT C−1 d − 2mT (FT C−1 Fm∗ − FT C−1 d) + 2m∗T (FT C−1 Fm∗ − FT C−1 d)
= (mT FT C−1 Fm − 2mT FT C−1 Fm∗ + m∗T FT C−1 Fm∗ ) + (dT C−1 d − 2m∗T FT C−1 d + m∗T FT C−1 Fm∗ )
= (m − m∗ )T FT C−1 F(m − m∗ ) + (d − Fm∗ )T C−1 (d − Fm∗ ), (A1)
where C−1 = WT W, which results in (16). From (5), the predictive distribution is given by

l(dobs | zprior ) = f (dobs | m)π(m | zprior ) dm,

= (2π)−(N+M)/2 (det Cd · det Cm )−1/2 exp(−U(m)/2) dm. (A2)
By substituting (16) into the integration of the exponential term of (A2), the integral is given by

I = exp(−U/2) dm

1
= exp{−U(m∗ )/2} exp − (m − m∗ )T FT WT WF(m − m∗ ) dm. (A3)
2
Using the following formula based on the fact that the integral of the multivariate normal density is one (e.g. Tarantola 1987,
p. 159):

1
exp − xT Bx dx = (2π)M/2 (det B)−1/2 , (A4)
2
where B is an M × M positive-definite matrix, we obtain
I = exp(−U(m∗ )/2) · (2π) M/2 · (det FT WT WF)−1/2 . (A5)
Finally, by substituting (A5) into (A2), the predictive distribution of (17) is obtained.

Considering (12)–(14) and following Tarantola (1987, p. 70), we can rearrange the estimated model parameters m∗ of (15)
as
−1
m∗ = AT C−1
d A + C −1
m AT −1
C d dobs + C −1
m mprior
−1
= AT C−1
d A + C −1
m A T −1
Cd dobs + A T −1
C d A + C −1
m − A T −1
Cd A mprior
−1
= mprior + AT C−1
d A + C −1
m AT C−1
d (dobs − Amprior ).
(A6)
In addition, as shown in Tarantola (1987, p. 158), from the relationship,

AT + AT C−1
d AC m AT
= A T −1
C d Cd + AC m AT

= Cm + AT C−1
−1
d A Cm A ,
T
(A7)
we can derive
−1 −1
AT C−1
d A + Cm
−1
AT C−1
d = Cm A
T
ACm AT + Cd , (A8)
since AT C−1 −1
d A + Cm and A Cm A + Cd are positive-definite, and thus, regular matrices. By substituting (A8) into (A6), m
T ∗
becomes
−1
m∗ = mprior + Cm AT ACm AT + Cd (dobs − Amprior ). (A9)
Moreover, substituting the above equation into (9), using the following equation,
−1
dobs − Am∗ = I N − ACm AT ACm AT + Cd (dobs − Amprior )
−1
= ACm AT + Cd − ACm AT ACm AT + Cd (dobs − Amprior )
−1
= Cd ACm AT + Cd (dobs − Amprior ), (A10)
and rearranging, U(m∗ ) can be re-formed as

−1
U(m∗ ) = (dobs − Amprior )T ACm AT + Cd (dobs − Amprior ), (A11)
and a matrix Cdp is defined as in (19). In the following formula (Noble 1969, p. 225):
det S det(V + US−1 T) = det V det(S + TV−1 U), (A12)
where S and V are regular matrices of orders, M and N, and T and U are M × N and N × M matrices, respectively, we set S =
C−1
m , T = A , U = A and V = Cd . Then considering (21), we obtain
T
−1
detCdp = det ACm AT + Cd = det C−1 m det Cd det AT C−1 −1
d A + Cm
= det Cm det Cd det(FT WT WF). (A13)
Finally, by using (A11) and (A13), equation (18) is obtained.
APPENDIX B
Predictive distribution in example 3
In example 3, the matrices related to the PDFs can be written as

−1
1 0 σd 0 σm−1 −σm−1
d = [d1 d2 0 0]T , A = , Wd = −1 and Wm = . (B1)
0 1 0 σd 0 0

238 Y. Mitsuhata
Substituting the above matrices into (13) and (14), we obtain

 σ −1 0 
d

 0 σd−1  G
WF =  σ
= , (B2)
m
−1
−σm 
−1
0 0
0 0
where G is given by (46). Substituting (B2) into (16) yields (45). The predictive distribution can be calculated by integrating the
product of (40) and (41):

l(d1 , d2 ) = f (d1 , d2 | m1 , m2 )π(m1 , m2 ) dm1 dm2
−1 −1/2
= 2π σd2 2πσm2 2π(det GT G)−1/2 exp{−U(m∗1 , m∗2 )/2}
−1/2
= 2π σm2 + 2σd2 exp{−U(m∗1 , m∗2 )/2}, (B3)
which is derived from (A5) and the relationship:
FT WT WF = GT G.
APPENDIX C
Estimation of σ 2d and σ 2m in example 4
In example 4, the matrices related to the PDFs can be written as

2
d1 1 σd 0
dobs = , A= , Cd = and Cm = σm2 . (C1)
d2 1 0 σd2
By substituting the above matrices and variances into (18) and (19), equations (55) and (56) are obtained. Next ∂log l/∂σ 2m and
∂log l/∂σ 2d are calculated as
 
∂log l 1 1 − ( d̂ 1 + d̂2 )
2
=− 2  (C2)
∂σm2 2σm + σd2 2 2σ 2 + σ 2
m d
and
  2  

 2 σ 2
+ σ 2
σ 2
d̂ + d̂ d̂ 2
+ d̂ 2 
∂log l 1 m d  m 1 2
 1 2
=− 1 +  − . (C3)
∂σd
2 2 σd2 2σm2 + σd2 σd2 2σm2 + σd2 σd
4 

Considering ∂log l/∂σ 2m = 0 for (C2), when σd2 < (d̂1 + d̂2 )2 /2, we obtain σm2 = (d̂1 + d̂2 )2 /4 − σd2 /2 and when σd2 ≥ (d̂1 + d̂2 )2 /2,
we should adopt σ 2m = 0 because ∂log l/∂σ 2m < 0. By substituting these results into ∂log l/∂σ 2d = 0 for (C3), equations (57) and
(58) are obtained.
APPENDIX D
Calculation of the Jacobian in the numerical experiment
Discretizing the model function m(u) as m(uj ) = mj and m(uj+1 ) = mj+1 , m(u) is approximated within uj ≤ u < uj+1 by the linear
interpolation,

u j+1 − u u − uj
m(u) = mj + m j+1 , (D1)
u j u j

where uj = uj+1 –uj . Substituting (D1) into the integral of (73), and performing the integration within uj ≤ u < uj+1 gives
u j+1 u j+1
u j+1 u j+1 1
Gi (u) · m(u) du = Gi (u) du − u · Gi (u)Eu m j
uj u j u j u j u j
u j+1
−u j u j+1 1
+ Gi (u) du + u · Gi (u)Eu m j+1 , (D2)
u j u j u j u j
where Gi (u) = exp{–(ti + 1)u} and the contents of the brackets of the above equation can be calculated analytically. By
implementing this approximated integration for each segment uj , and assembling them for 0 ≤ u < 1, the Jacobian matrix is
obtained (Matsuoka 1986).


Geophysical Prospecting - 2004 - Mitsuhata - Adjustment of Regularization in Ill Posed Linear Inverse Problems by The

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Geophysical Prospecting - 2004 - Mitsuhata - Adjustment of Regularization in Ill Posed Linear Inverse Problems by The

Uploaded by

Copyright:

Available Formats

gpr412 GPR-xml.

cls Apr l 14, 2004 12:35

Adjustment of regularization in ill-posed linear inverse problems

Received August 2002, revision accepted October 2003

U(m) = Wd (dobs − Am)2 + λφ(m), (1)

the denominator of (3) is given by

U(m) = (dobs − Am)T C−1 T −1

U(m) = Wd (dobs − Am)2 + Wm (m − mprior )2 . (10)

U(m) = W(d − Fm)2 , (11)

m∗ = (FT WT WF)−1 FT WT Wd. (15)

In addition, by using m∗ as shown in Appendix A, the objective function can be written as

U(m) = U(m∗ ) + (m − m∗ )T FT WT WF(m − m∗ ). (16)

After more rearrangement as shown in Appendix A, equation (17) can be written as

Cdp = ACm AT + Cd . (19)

(dobs − gm)2 (m − mprior )2

This becomes equal to zero when

σm2 = (dobs − gmprior )2 − σd2 g 2 , (30)

σm2 = d2 − σd2 2. (38)

m∗1 = m∗2 = d(1 − τ )/2, for τ < 1,

(d1 − m1 )2 + (d2 − m2 )2 (m1 − m2 )2

(d1 − d2 )σd2 (d1 − d2 )σd2

and, as shown in Appendix B, U can be rewritten in the form,

σm2 = (d1 − d2 )2 − 2σd2 , (50)

(d1 − m)2 + (d2 − m)2 (m − mprior )2

By setting ∂U/∂m = 0, the estimation of the model parameter is given by

After calculations given in Appendix C, the predictive distribution can be written as

where d̂ = [d̂1 d̂2 ], d̂1 = d1 − mprior , d̂2 = d2 − mprior ,

σm2 = 0 and σd2 = d̂21 + d̂22 2, for d̂1 d̂2 ≤ 0. (58)

By substituting these estimated variances into (54), m∗ becomes

φk(m) = −2k+1 || Dk(m − mprior ) || 2 , (62)

For k = 3, the third-order differential operator is written as

Consequently, by using (71), ABIC can be written as

G(ti , u) = exp{−(ti + 1)u} (74)

m(u) = a sin(π u), (75)

Effect of the suitability of the a priori model information

Effect of the type of the a priori model information

Calculation of the predictive distribution

Using (15), the objective function U of (11) can be rewritten as follows:

U = mT FT C−1 Fm − 2mT FT C−1 d + dT C−1 d

where B is an M × M positive-definite matrix, we obtain

I = exp(−U(m∗ )/2) · (2π) M/2 · (det FT WT WF)−1/2 . (A5)

In addition, as shown in Tarantola (1987, p. 158), from the relationship,

and rearranging, U(m∗ ) can be re-formed as

det S det(V + US−1 T) = det V det(S + TV−1 U), (A12)

= det Cm det Cd det(FT WT WF). (A13)

Finally, by using (A11) and (A13), equation (18) is obtained.

Predictive distribution in example 3

In example 3, the matrices related to the PDFs can be written as

Substituting the above matrices into (13) and (14), we obtain

which is derived from (A5) and the relationship:

Estimation of σ 2d and σ 2m in example 4

In example 4, the matrices related to the PDFs can be written as

Calculation of the Jacobian in the numerical experiment

You might also like

U(m) = Wd (dobs − Am)2 + λφ(m), (1)

U(m) = Wd (dobs − Am)2 + Wm (m − mprior )2 . (10)

U(m) = W(d − Fm)2 , (11)