Professional Documents
Culture Documents
Ubc 2001-715531 PDF
Ubc 2001-715531 PDF
by
Steven X i a o g a n g W a n g
A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F
Doctor of Philosophy
in
T H E F A C U L T Y O F G R A D U A T E STUDIES
Department of Statistics
T H E U N I V E R S I T Y O F BRITISH C O L U M B I A
Department
DE-6 (2/88)
Maximum Weighted Likelihood Estimation
Steven X. Wang
Abstract
ii
Contents
Abstract iii
T a b l e of C o n t e n t s iii
L i s t of Figures vi
L i s t of Tables vii
Acknowledgements viii
1 Introduction 1
1.1 Introduction 1
1.2 Local Likelihood and Related Methods 2
1.3 Relevance Weighted Likelihood Method 6
1.4 Weighted Likelihood Method 6
1.5 A Simple Example 8
1.6 The Scope of the Thesis 11
2 M o t i v a t i n g Example: N o r m a l Populations 14
iii
2.4 The Optimum WLE 22
2.5 Results for Bivariate Normal Populations 24
4 A s y m p t o t i c P r o p e r t i e s of the W L E 48
5 C h o o s i n g W e i g h t s by C r o s s - V a l i d a t i o n 85
5.1 Introduction 85
5.2 Linear WLE for Equal Sample Sizes 87
5.2.1 Two Population Case 88
iv
5.2.2 Alternative Matrix Representation of A and b e e 93
5.2.3 Optimum Weights X°f By Cross-validation 95
5.3 Linear WLE for Unequal Sample Sizes 96
5.3.1 Two Population Case 97
5.3.2 Optimum Weights By Cross-Validation 99
5.4 Asymptotic Properties of the Weights 102
5.5 Simulation Studies 107
8 A p p l i c a t i o n to Disease M a p p i n g 137
Bibliography 146
v
List of Figures
8.1 Daily hospital admissions for CSD # 380 in the summer of 1983. . . 138
8.2 Hospital Admissions for CSD # 380, 362, 366 and 367 in 1983. . . . 139
vi
List of Tables
5.1 MSE of the MLE and the WLE and standard deviations of the squared
errors for samples with equal sizes for N(0,1) and iV(0.3,1) 108
5.2 Optimum weights and their standard deviations for samples with equal
sizes from N(0,1) and N(0.3,1) 109
5.3 MSE of the MLE and the WLE and their Standard deviations for
samples with equal sizes from V(3) and V(3.6) 110
5.4 Optimum weights and their standard deviations for samples with equal
sizes from V(3) and P(3.6) Ill
8.1 Estimated MSE for the MLE and the WLE 142
8.2 Correlation matrix and the weight function for 1984 144
8.3 MSE of the MLE and the WLE for CSD 380 145
vii
Acknowledgments
viii
Chapter 1
Introduction
I.l Introduction
Recently there has been increasing interest in. combining information from diverse
sources. Effective methods for combining information are needed in a variety of fields,
including engineering, environmental sciences, geosciences and medicine. Cox (1981)
gives an overview on some methods for the combination of data such as weighted
means and pooling in the presence of over-dispersion. An excellent survey of more
current techniques of combining information and some concrete examples can be
found in the report Combining Information written by the Committee on Applied
and Theoretical Statistics, U.S. National Research Council (1992).
This thesis will concentrate on the combination of information from separate
sets of data. • When two or more data sets derive from similar variables measured
on different samples of subjects are available to answer a given question, a judgment
must be made whether the samples and variables are sufficiently similar that the data
sets may be directly merged or whether some other method of combining information
that only partially merges them is more appropriate. For instance, data sets may be
1
1.2. LOCAL LIKELIHOOD A N DRELATED METHODS 2
time series data for ozone levels of each year. Or they may be hospital admissions
for different geographical regions that are close to each other. The question is how
to combine information from data sets collected under different conditions (and with •
differing degrees of precision and bias) to yield more reliable conclusions than those
Local likelihood, introduced by Tibshirani and Hastie (1987), extends the idea of
a special case of the local likelihood procedure. Staniswalis (1989) defines her version
i=i
where X{ are fixed and b is a single unknown parameter. Recently, versions of local
likelihood for estimation have been proposed and discussed. The general form of local
likelihood was presented by Eguchi and Copas (1998). The basic idea is to infuse local
n x —t
L(t;x ,x ,...,x )
1 2 n = K{- -^—)logf{x ,e),
!
i
i=l .
where K = K(^^) is a kernel function with center t and bandwidth h. The local
by maximizing the weighted version of the likelihood function which gives more weight
to sample points near t. This does not give an unbiased estimating equation as it
stands, and so the local likelihood approach introduces a correction factor to ensure
say 9t,h, depends on the controllable variables t and h through the kernel function K,
and, intuitively, it is natural to think that Q ,h gains more information about the data
t
around t in the sample space. Detailed discussions of the local likelihood method
can be found in Eguchi and Copas (1998). The L M L E might be related to the local
M-estimator proposed by Hardle and Gasser (1984). The local M-estimator is defined
as
i=l </x,-l
The term weighted likelihood has been used in the statistics literature besides the
local likelihood approach. Dickey and Lientz (1970) propose the use of what they
called weighted likelihood ratio for tests of simple hypothesis. Dickey (1971) proposes
the weighted utility-likelihood along the same line of argument. Assume a statistical
Borel set H C E . Let H denote a Borel measurable alternative such that H(~)H = 0
T
with P(H) + P(H) = 1. The key feature of their proposal is to use weighted likelihood
$(D\H)
that they called it weighted likelihood ratio is because the quantity &(D\H) is the
summary of the evidence in D for H. A modern name for their weighted likelihood
Markatou, Basu and Lindsay (1997,1998) propose a method based on the weighted
likelihood equation in the context of robust estimation. Their approach can be de-
1.2. LOCAL LIKELIHOOD A N D R E L A T E D METHODS 4
w(Xi, F) is selected such that it has value close to 1 if there is no evidence of model
violation at x from the empirical distribution function. The weight function will
tions indicates lack of fit at or near X{. Thus, the role of the weight function is to
down-weight points in the sample that are inconsistent with the assumed model.
Hunsberger (1994) also uses the term "weighted likelihood" to arrive at kernel
regression models. Consider the model with Xi\(Yi,Ti = ti) having the distribution
X\(Y,T), be arbitrary but known. Then xB is the parametric portion, Bo being the
0
unknown parameter to be estimated that relates the covariate y to the response. Here
g is the non-parametric portion of the model, the only assumption on g being that
and the rji are independent random error terms with E(r]i) = 0 and En 2
— a . Now
2
Xi can be rewritten using the model for the y's to obtain Aj = r/j/3 + h(ti), where 0
h(ti) =• r(ti)P + g(ti) is the portion that depends on t. The main purpose is to
0
WW 0) = J2 Y, C-^Y°9f{Xf,
w
P, 9i)/n b
2
i 3
Besides these "weighted likelihood" approaches, it should be noted that the term
weighted likelihood has been used in other contexts as well. Newton and Raftery
(1994) introduce what they called weighted likelihood bootstrap as a way to simu-
defined as
i=l
where the random weight vector w = (w i, n> w ,w )nj2 n>n has some probability dis-
to the posterior.
Rao (1991) introduces his definition of the weighted maximum likelihood to deal
with irregularities of the observation times in the longitudinal studies of the growth
i=l
where A(.) is a known nondecreasing function on [a,b] and t iin are observation times
such that
The term weighted likelihood is also used by Warm (1987) in the context of item
of the traditional likelihood function L(x\9). The weight function, w(9) is carefully
Hu and Zidek (1997) give a very general method for using all relevant sample infor-
mation in statistical inference. They base their theory on what they call relevance-
vector X = (Xi, X ,X ).
2 n Let / ; be the unknown probability density function of
ity density function / . A t least in some qualitative sense, the fi are thought to be
though the Xi are independently drawn from a population different from our study
REWL(0) =Hf(xi;9) Xi
i=i
R E W L plays the same role in the "relevant-sample" case as the likelihood in the
Likelihood Estimator can be found in Hu (1997). Hu, Rosenberger and Zidek (2000)
In this thesis, we are interested in a context which is different from that of all the
related populations, Population 2, Population 3 and so on, are available together with
the direct information from Population 1. Let m denote the total number of popula-
1.4. W E I G H T E D L I K E L I H O O D M E T H O D 7
denote the number of observations obtained from each population mentioned above.
the / ( . ; 62),
2 /m(-; 9m) are thought to be "similar to" /i(.; 9i).
i=i
analyst.
In this thesis, we assume that Xn,X{2, ...,X , i=l,2,...,m, are independent and irii
wL(0o=nn^^) - Ai
i=i 3=1
and function estimation. There information about 9\ builds up because the number
of populations grows with increasingly many in close proximity to that of B\. This
it is not always the most natural one. In contrast we postulate a fixed number of
1.5. A SIMPLE E X A M P L E 8
populations with an increasingly large number of samples from each population. Our
paradigm may be more natural in situations like that in which James-Stein estimator
obtained, where a specific set of populations is under study.
The probability for the MLE to conclude that either the fair coin has no head or no
tail is 50%. It is clear that the probability of making a nonsensical decision about
the fair coin in this case is extremely high due to the small sample size.
Suppose that another coin, not necessarily a fair one, is tossed twice as well. The
question here is whether we can use the result from the second experiment to derive
a better estimate for 6\. The answer is affirmative if we combine the information
obtained from the two experiments.
Suppose for definiteness 9 = P(H) = 0.6 for the second coin. Let 9 denote the
2 2
P (l = fj)
2 = P(Y = Y = 0) = 0.16;
X 2
P (§ = 1/2)
2 = P ({Y = 0; Y = 1} n {Y = 1; Y = 0}) = 0.48;
x 2 1 2
9\ = Ai#i + A 0 , 2 2
The optimum weights will be discussed in later chapters. Pretend that we do not
know how to choose the best weights. We might just set each of the weights to be
= P (S = 0; S = 0) = 0.04;
l 2
P (0 - 1/2
X = P ({Si = 1; S = 1} n {Si = 2; S = 0} n {Si = 0; S = 1}) = 0.37;
2 2 2
P = P (Si = 2; S = 2) = 0.09. 2
Note that the probability of making a nonsensical decision has been greatly reduced
to 0.13 (0.04+0.09) through the introduction of the information obtained from the
Thus,
MSE(9 )/MSE(9 )
l l « 0.5.
Due to the small sample size of the first experiment, for arbitrary 9 , the prob- X
50%. B y incorporating the relevant information in a very simple way, that proba-
bility is greatly reduced. In particular, if the second coin is indeed a fair coin and
the two M L E ' s in this case. Let 0i — \9 + \9 . For arbitrary 9\ and 0 with sample
X 2 2
size of 2, we have
MSE(9 ) X = Var(d ) =
1 ^6 (l-d )
1 l
= =\0i(i-e ) l + -9 (i-9 ) +l
2 2 - (e -9 ) .
4 1 2
2
1 19 (1 - 9 ) + l(9 - 9 )
2 2 x 2
2
4 |0i(l-0i)
(0i-0 ) 2
2
< 0.09;
We then have
^ < 0.72.
MSE(9 )
1 ~
Therefore, for a wide range of values of 9 and 9 , the simple average of 9\ and 9 will
X 2 2
produce at least 28% reduction in the M S E compared with the traditional M L E , 6\.
The maximum reduction is achieved if these two coins happen to be of the same type,
i.e. 9\ = 9 . We remark that the upper bound on the bias in this case is 0.15. Notice
2
that the weights are chosen to be 0.5. However they are not the optimum weights
In Chapter 2 we will show that certain linear combinations of the M L E ' s derived
from two possibly different normal populations achieves smaller M S E than that of
the traditional one sample M L E for finite sample sizes. A criterion for assessing the
relevance of two related samples is proposed to control the magnitude of possible bias
introduced by the combination of data. Results for two normal populations are shown
in this chapter.
The weighted likelihood function which uses all the relevant information is formally
Results for exponential families are presented in this chapter. The advantages of using
weights for the linear W L E is proposed and the non-stochastic limits of the proposed
optimum weights when the sample size of the first population goes to infinity are
found.
when the parameter space is a subset of R ,p > 1. The asymptotic results proved
p
here differ from those of Hu (1997) because a different asymptotic paradigm is used.
There information about 9\ builds up because the number of populations grows with
increasingly many in close proximity to that of 6\. This is the paradigm commonly
invoked in the context of non-parametric regression but it is not always the most nat-
large number samples from each. Asymptotically, the procedure can rely on just the
data from the population of interest alone. The asymptotic properties of the W L E
using adaptive weights, i.e. weights determined from the sample, are also established
in this chapter. These results offer guidance on the difficult problem of specifying A.
In Chapter 5 we address the of choosing the adaptive weights by using the cross-
validation criterion. Stone (1974) introduces and studies in detail the cross-validatory
(1974) and Geisser (1975) are closely related to the linear W L E . Breiman and Fried-
man (1997) also demonstrate the benefit of using cross-validation to obtain the linear
variate regression. Although there are many ways of dividing the entire sample into
subsamples such as a random selection of the validation sample, we use the simplest
leave-one-out approach in this chapter since the analytic forms ofthe optimum weights
are tractable for the linear W L E . The weak consistency and asymptotic normality
reviewed the historical development of entropy and discussed the importance of the
maximum entropy principle. Hu and Zidek (1997) discovered the connection between
1.6. T H E SCOPE OF T H E THESIS 13
relevance weighted likelihood and maximum entropy principle for the discrete case. We
shall show that the weighted likelihood function can be derived from the maximum
valid. Observations from different samples follow different distributions. The saddle-
point approximation technique in Daniels (1954) is then generalized for the non i.i.d.
posed in Daniels (1983) is also generalized to derive the approximate density of the
The last chapter of this thesis applies the W L E approach to disease mapping
data. Weekly hospital admission data are analyzed. The data from a particular site
and neighboring sites are combined to yield a more reliable estimate to the average
Populations
14
2.1. A MOTIVATING E X A M P L E 15
where the { i i } " =1 are fixed. The {ei}'s are i.i.d.. So are the {e^j's. While Covfa, e'j) =
0 if i ^ j; Cov(ei,e'i) = po\0~2 for all i, and for the purpose of our demonstration we
assume p, o\ and a are known although that would rarely be the case in practice.
2
Note that a bivariate normal distribution is not assumed in the above model. In
fact, only the marginal distributions are specified; no joint distribution is assumed
The parameters, a and 8, are of primary interest. They are thought or expected
to be not too different due to the "similarity" of the two experiments. The error
terms are assumed to be i.i.d. within each sample. The joint distributions of the
(Xi,Yi),i = l , . . . , n are not specified. Only the correlations between samples are
assumed to be known. The objective is to get reasonably good estimates for the
parameters without assuming a functional form of the joint distribution of (Xi, Yi)
L (x ,x ,...,x ;
1 1 2 n a) cc T[exp( ' 2 * ),
i=i 1
° l
L {yi,y ,y ;
2 2 n 8) oc ]~[exp(-^ 1
).
, r / x ^ ( X j - a tj) 2
In U(a) = Y ~ 2 ,
2.2. W E I G H T E D LIKELIHOOD ESTIMATION 16
ta\ (Vi-P U) 2
i
In
r
LM = -}Z- »=i
2 2
2
•
The M L E ' s based on the Xi and respectively for a and /? are
n
i=i
« = . — — >
i=l
E*i J/i
i=i
of the problem of combining information that we have just described is that we can
incorporate the relevant information into the likelihood function by assigning possibly
where A i and A are weights selected according to the relevance of the likelihood to
2
which they attached. The non-negativeness of the weights is not assume although the
WL(a) since a is of our primary interest at this stage and the marginals distri-
butions of the F ' s are thought to resemble the marginal distributions of the X ' s .
Note that the W L depends on the distributions of the X ' s . But it does not depend
2.2. WEIGHTED LIKELIHOOD ESTIMATION 17
on the distribution of Y's. Since X and Y are not independent, the weights of the
likelihood functions are designed to reflect the dependence that is not expressed in
the marginals. Notice that the joint distribution of the X's and y ' s does not appear
In WL(Q;) = Ai In Li{x , x x,
2 x ; a) + A 2 In L (y ,y ,
n l 1 2 ...,y ;
n a),
din W L ( a ) Ai , A ^ 2 , \
1
i=i 1
t=i
So the W L E for a is
Al „ , ^2 n
a = —a + —B.
Ai + A 2 Ai + A 2
a = Ai a + A ft, 2
where Ai + A = 1.2
over-simplified model. The weights Ai and A reflect the importance of a and J3. In-
2
tuitively, the inequality A < Ai should be satisfied because direct sample information
2
from {Xj}" should be more reliable than relevant sample information from {Fi}" .
=1 =1
Obviously, the W L E is the M L E obtained from the marginal distribution if the weight
for the second marginal likelihood function is set to be zero. This may happen when
evidence suggests that the seemingly relevant sample information is actually totally
irrelevant. In that case, we would not want to include that information and thereby
We call the estimator derived from the weighted likelihood function the W L E in
line with the terminology found in the literature although the estimator described here
differs from others proposed in those published papers. In particular, we work with
the problem in which the number of samples are fixed in advance while the authors
of the R E W L E were interested in the problem where the number of populations goes
The weighted likelihood estimator is a linear combination of the M L E ' s derived from
the likelihood function under the condition of marginal normality. We would like to
find a good W L E under our model since there is no guarantee that any W L E will
will be determined by the weights assigned to the likelihood function and the M L E ' s
obtained from the marginals. The value of any estimator will depend on our choice of
a loss function. The most commonly used criterion, the mean squared error (MSE),
is selected as the measure of the performance of the W L E . The next proposition will
give a lower bound for the ratio ^ such that the W L E will outperform the M L E
obtained from the marginal distribution. For simplicity, we make the assumption of
E(a - a) = A (8 - a)
2
\E(a-a)\ < A C 2
2 c2
Ai C S;4 1
E(a) = a,
£(/3) = b,
E(a) = Ai a + A /?. 2
It follows that
bounded by A C if \B - a\ < C.
2
a 2
Var(a) =
Var0) = ^ ,
n (~ a\ Cov(i ,<?x) x po 2
Cov(a,p) = =
'Sf'
where S = £ * .
2 2
i=l
Furthermore, we have
(A + A + 2 A i A
2 2
2 p)°^.
2.3. A CRITERION FOR ASSESSING R E L E V A N C E 20
function for a. We would like to choose weights that are independent of a and 8
E(a — a) 2
= Var(a) = 2 ,
E(a - a) 2
= Var(a) + (Bias)*
.2
a
< (X + X + 2X X p)^
2 2
1 2 + XC
z
2
2
Furthermore,
2 2
max E(a - a) 2
< E(& - a) 2
^2 X X p
L 2 + X\C 2
<2 X A x 2 .
\P—a\<C Ot ' Oj
We then have
In conclusion, we have max^^ \<cM a S E(a) < MSE(a) iff ^ > (i- )a 2 P
2
• °
, The optimum weights which achieve the minimum of the M S E in this case will
be positive as will be shown in the next section. Thus, we will not consider the case
when A < 0.
2
\P-a\<c A 2 2(o\ - po o ) x 2
2.3. A C R I T E R I O N FOR ASSESSING R E L E V A N C E 21
If o\ = of =
° 1 then the equation (2.4) is reduced to the inequality in the
2
A C2
9 2
From equation (2.4), observe that we will have Ai = 1 and A = 0 for a number 2
of special cases:
1) C —> oo and p < 1. Here we have little information about the upper bound for the
distance between the two parameters, or we do not have any prior information at all.
2) Sf —>• oo and p < 1. We already have enough data to make inference and the addi-
tional relevant data has little value in terms of estimating the parameter of primary
interest.
3) af —> 0 and p < 1. This means that the precision ofthe data from the first sample
Figure 2.1: A special case: the solid line is the max\p_ \<cMSE(WLE), a and the
broken line represents the MSE(MLE). The X-axis represents the value of A . x The
A special case is discussed here to gain some insight into the situation. The upper
bound of mean square error for WLE as a function of Ai is shown in the Figure 1.
It can be seen that the upper bound of the MSE for the WLE will be less than that
of the MLE if Ai lies in a certain range. From the Proposition 2.1, the cut-off point
should be 0.5 which is exactly the case shown in the graph. In fact, the minimum
of the max\i3- \<ca MSE can be obtained by using the Proposition 2.2 which will be
established in the next section.
In the previous section we found the threshold for Ai and A in order to make WLE 2
outperform the MLE in the sense of obtaining a MSE and variance which are smaller
than that of MLE uniformly on the set {(a, 8) : \8 — a\ < C}. Because there are
numerous cases under which \ and A satisfy the threshold, we might want to find
X 2
the optimum WLE in the sense of minimizing the maximum of MSE and variance.
- 1 + K . 1 3
a — a H : — 8,
2 + K 2 + K '
P
2 + K " 2 + K
where K = C S t
In fact, the above equation has a unique solution. It can be verified that the
minimum is achieved at that point by checking the second derivative. The solution
to these equations are A^ = and A2 = where K = jjzfijp- Note that the
weights X{ and A2 satisfy the condition of ^ > y because ^ = K + 1 > y . This
implies that with the optimum weights given above the WLE will achieve a smaller
MSE than the MLE according to Proposition 2.1
Thus, the optimum weights for estimating a are
\* _ I+K
1 2+ftT'
A* — 1
A
2 — 2+K'
a = l±K _L_ R
where i f = ^ _ ^ . Note that the optimum WLE for ^ is obtained by the argument
p 2
of symmetry, o
Notice that A2 < \ for all K > 0. This implies that, for the purpose of estimating
a, a should never get a weight which is less than that of b if we want to obtain the
optimum WLE. This is consistent with our intuition since relevant sample information
(the {Fjj's) should never get larger weight than direct sample information (the {-^}'s)
when the sample sizes and variances are all equal. It should be noted that the
WLE under a marginal normality assumption is a linear combination of the MLE's.
Furthermore, the optimum WLE is the best linear combination of MLE's in the sense
of achieving the smallest upper bound for MSE compared to other linear combination
of MLE's. The optimization procedure is used to obtain the minimax solution in the
sense that basically we are minimizing the upper bound of the MSE function over the
set {{a, 8) :\a-B\< C}.
2.5. RESULTS FOR BIVARIATE N O R M A L POPULATIONS 24
This subsection contains some results for bivariate normal distributions. The follow-
ing results should be considered as immediate corollaries of Propositions 2.1 and 2.2
C o r o l l a r y 2.1 Let
X
/ 2 2
l Px a po
N l,..,n.
Yi
\ po o
2 2
If \Px — PY\ < C and p < 1, then the optimum WLE's for estimating the marginal
means are
r, v —
l
+ M
y +- 1
v
1
2+M ^ 2+M
^ - 2+M Y
+ 2+M A
'
where M = 2 (l-p) a 2
•
Furthermore,
o 2
and b — Y, we can apply Proposition 2.2 to get the optimum W L E . Therefore the
optimum W L E will have smaller MSE and variance than the M L E , X, obtained from
following
Var (fix) + Var fay) + 2 Cov(fi , X fiy) = Var(X) + Var(Y) + 2 Cov(X, Y).
2.5. RESULTS F O R B I V A R I A T E N O R M A L P O P U L A T I O N S 25
It follows that
Cov(p, ,p, )
x Y > Cov(X,Y). o
From Corollary 2.1, we draw the conclusion that the optimum WLE out-performs
the MLE when \JJL — p,y\
x < C is true.
Corollary 2.2 Under the conditions of Corollary 2.1,
- p
fix — X —> 0 as n —» oo.
P(\px-X\>e)=p(^\X-Y\>e^j < ^ E
^ ~ Y
^ _> 0 n ^ oo.o
Corollary 2.3 77ie optimum WLE is strongly consistent under the conditions of
Corollary 2.1.
Proof: Consider
P(\fi -p \>e)
x x = \P(\(ji -X)-{n -X)\>e)
x x
'(^-^>i)-Klf^l>5)^5Ts)*(*- )?<^ , p ,
p
0- -"-i 2)^—1^
Y >
^ w = ^-
where M is a constant. Thus,
2
oo
It follows from Corollary 2.1 that the optimum W L E is preferable to the M L E . How-
ever the relevant information will play a decreasingly important role as the direct
Motivated by our previous results, it seems that the linear combination of M L E ' s is
to be normal. A more general result is given as follows when the normality condition
is relaxed.
Var(Xi) = o \ < oo, and let Yi,Y , 2 ...,Yn be i.i.d. random variables with E(Yi) = 6
and Var(Yi) < oo. Let a and B denote some estimates of a and ft respectively.
=
2+7?' ^ 2 =
2+K a n
d K =
{i- )Var{a) • ^
P
s 0 s u
PP o s e
P — cor(a,J3) < 1 while
Var(a) is assumed known.
Var(a) = Var{X a 1 + X P) 2
= A Var(a)
2
+ X 2
2 Var0) + 2 A : A2 Cov(aJ)
< A Var(a)
2
+ X Var{a) 2
2 + 2 X X p ^/Var(a)
x 2 \]var(B)
= Var(a).
2.5. RESULTS FOR BIVARIATE NORMAL POPULATIONS 27
= A Var(&)
2
+ \\ Var0) + 2 Ai A Cov(a,a)
2 + \ \ (B - a)2
< (A 2
+ \\ + 2 Ai A p)Var(a)
2 + \\ C.
2
Now we are in a position to apply Proposition 2.2 to get the weights for the best
linear combination of a and B. Consequently we have the best linear combination,
a = Ai a + A 2 J3, where the weights Ai = | ± f , A 2 = ^ and K = ( 1 _ p )
<
V a r ( a ) • Also,
we have max\p_ \<c a E(a — a) 2
< E(a — a) .
2
The last statement of this theorem can
be established in the same way as shown in Corollary 2.2. o
The above theorem implies that any information contained in B can be used to
improve the quality of inference provided it is as reliable as a. Normality assumptions
for the marginals are no longer necessary to obtain a better estimate of a under the
conditions of Theorem 2.1. Moreover, we can correct for our "working assumption"
of independent samples through likelihood weights.
Chapter 3
28
3.2. R E S U L T S F O R O N E - P A R A M E T E R E X P O N E N T I A L F A M I L I E S 29
wL(^)=nn^(^;^) '
m rii
Ai
i=l j=l
log WL(0O = E X i l
°9 M v> h)-
x
i=l j=l
will use 9\ and 9\ to denote the W L E and the true value of 9\ in the sequel.
ilies
Exponential family models play a central role in classical statistical theory for inde-
pendent observations. Many models used in statistical practice are from exponential
same distribution from the exponential family with one parameter, #i, that is,
The likelihood function for n observations which are i.i.d from the above distri-
x
It follows that
3=1
i=l
m (i'^OTlx ) + B;(0 )) 1
1
n _1
where Tlx ) = ^E5K).
r 1
j'=i
It is known (Lehmann 1983, p 123) that for the exponential family, the necessary
and sufficient condition for an unbiased estimator to achieve the Cramer-Rao lower
bound is that there exists D(9i) such that
<91n Li(X ,X ,
n l2 Xi ;0i)
ni
n D (9 )(T(x )-9 ), 1
V0x.
l 1 1 l
001
Theorem 3.1 Assume that, for any given i, Xij '~" f(x;9i),j %
= l,2,...,nj. The
o9\
i.e., the WLE of 9\ will be the linear combination of the MLE's if the Cramer-Rao
lower bound can be achieved by the MLE's derived from the marginals.
= n D {e ) (T(^) - 0 ),
i 1 1 X
d6 1
3.2. R E S U L T S F O R O N E - P A R A M E T E R E X P O N E N T I A L F A M I L I E S 31
it can be seen that T(x}) is the traditional M L E for 9\ which achieves the Cramer-Rao
din W L
=J2 > i X n
D {d )(T(x )-B ) i
09, 1 1 1
m
where U = rii/ ]P Aj/i;.
i=i
Therefore the W L E of 9\ is a linear combination of the M L E ' s obtained from the
the W L E is a linear combination of the M L E ' s obtained from the marginal distribu-
tions.
T h e o r e m 3.2 For distributions of the exponential family form, suppose the MLE of
( m
2~2 U\ T{x ) l
\
I, where U = n^/ ]T
m
\ni.
i=i ) i=i
Proof: As seen above
dlnLi(sii,3;i2, ...,x \9i) =
lni n i (A'MTix^ + B ' M ) .
dQx
^Xim ( X ( 0 ) T ( ^ ) + B ' M ) =o
1
i=i
A[(9i) (f>A, T
^ ) ) + B
' M = 0
,i=i
m
where *i = n*/ Ajnj.
i=i
3.3. W L E O N RESTRICTED P A R A M E T E R SPACES 32
( m
5>A, T V )
i=i
m
Therefore the W L E has the same functional form as the M L E obtained from the
statistics from the two samples instead of using the sufficient statistic from a single
sample. The advantage of doing this is that it may be a better estimator in terms of
variance and M S E .
problem. A n overview of the history and some recent developments can be found
in van Eeden (1996). van Eeden and Zidek (1998) consider the problem of combin-
ing sample information in estimating ordered normal means, van Eeden and Zidek
(2001) also consider estimating one of two normal means when their difference is
bounded. We are concerned with combining the sample information when a number
information in some form is available from past studies or expert opinions. A l l statis-
ticians must make use of such information in their analysis although they might differ
is a linear combination of the individual M L E . We remark that the results of this sec-
3.3. W L E O N R E S T R I C T E D P A R A M E T E R SPACES 33
tion hold for the general case where the new estimator is a linear combination of the
estimator derived from each sample. We need the following Lemma on optimization
to prove the major theorem of this section. We emphasize that we do not require the
Aj's to be non-negative.
n
Lemma 3.1 Let A = (Ai, A , A ) '
2 m and Aj = 1. Let A be a symmetric
i=i
A* = - ^ i -
provided lA
l
1 > 0.
Proof: Using the Lagrange method for maximizing an objective function under con-
straints, we only need to maximize
G = X A X - c(l*A - 1),
l
m
subject to 2~2 Xi = 1.
i=l
Differentiating the function G gives
^ = 2AX-cl.
oX
Now 1 = 1* A = - l ^ - ! . 1
2
2
So c =
lM-U'
Therefore,
VA- !'1
3.3. W L E O N R E S T R I C T E D P A R A M E T E R SPACES 34
with E(Xij) = Qi, and m is the the number of related data sources and rii is th> e
sample size for each source. The marginal distributions are assumed to be known. The
joint distribution across those related sources is not assumed. Instead, the covariance
are all finite and \9i — 6i\ < Ci where Ci is a known constant. Let 0 = (Q\,Q , ••••,9m)
2
t
where Qi — Qi(xn,Xi , 2 Xi ) ni is the MLE for the parameter Qi derived the distribution
of the X^. Suppose E(9i) = Qi and V = cov(0) are known and (V + BB )' 1 1
is
inveriible.
0~i = ^ ^'
i=l
where:
, (V + B B ^ l
A* = (Aj, A j , A ^ J
s t
V(V + BB )- !'
1 1
Proof: We are seeking the best linear combination of these MLE's derived from the
marginal distributions. As before, let us consider the MSE of the WLE.
3.3. W L EON RESTRICTED PARAMETER SPACES 35
iy
i=i
i=i
= £
E E xx (e -e )(9 -e )\
3 l l 3 l
i=l j=l
m m
= E E A
' i X E
\.& -°r + i- e
Oi)0j ~ 8j + Oj - Oi))
i = l j=l
m m
= E
i=l E
j=l
X i X
i ( 0u
cov
Oj) + (di - 9 )(9 - -
X 3 9,))
m m
= E E 'i A
X
cov(9i,0j) + E E *i A
X
& - Mi
d d
~ 0
i)
i=l j=l - i=l j=l
< A'cou(0)A + \ B B \
t t
(by the assumptions)
WLE is:
m
Ol = E X
i ^'
i=l
and involves only the first and the second moments of the marginal distributions and
the joint distributions. The whole procedure consists of two stages. The first step is
to work out the functional form of the M S E . The second step is to work out the
optimum weights to construct the optimum WLE. Note that the optimum weights
are functions of the matrices V and B.
Let's consider some special cases:
3.3. W L EO N RESTRICTED P A R A M E T E R SPACES 36
(i) B = 0 and V = a 1. 2
This implies that all the data can be pooled together in such a way that equal weights
should be attached to the M L E derived from each marginal distribution since now we
5 if 0 1
+^ ° 2 +
"" +
A ° m
01 = m •
k=l k
Note that most current pooling methods estimate the population parameter 6 is a
9= i = 1
i=l *
provided that the Var(9i) = cr are known. The optimal weights under the assumption
2
t
are proportional to the precision in the ith data set, that is, to the inverse of the
the optimum W L E coincides with the weighted least squares estimator under the
Furthermore, let us apply Theorem 3.3 to our motivating example since it is also
a special case of the theorem. To be consistent with the notation used in Section 2,
we still use a and b to denote the parameters of interest. We need to work out the
matrix V and BB before applying the theorem since both are assumed to be known.
1
As before, we have
O"2 2 2
Var(a) = Var(b) = ^ a n
d cov(a, b) =
3.3. W L EO NR E S T R I C T E D P A R A M E T E R SPACES 37
V = cov{{a,bY)
J_ / ° 2
pa 2
St \ pa 2
o 2
0 0
BB 1
=
0 c 2
o po*
V + BB* =
S 2
t
pa 2
o + 2
C Sf 2
Therefore,
-i -i
o* po*
• {V + BB )- 1 1
S 2
po 2
o + 2
CS 2 2
s 2
t
o +CS 2 2 2
-po 2
W + BBH
-po*
o + CS
2 2 2
-po*
a - p a +
4 2 4
CS o 2 2 2
t
-po* o 2
Thus, we have
S 2
o +CS
2 2 2
-po 2
| ^ 1
(V + BB )' ! 1 1
=
o - po
4 2 A
+ CSo 2 2 2
2 2
1
—po z
O*
(1 - p)o + C S 2 2 2
a 4
- po 2 4
+ CSo 2 2 2
(1 - p)o 2
Furthermore,
si
l t ( K + B B t )
' H
= o^-p o^C S o 2 2 2 2
^ - ^ +
^
3.4. LIMITS OF O P T I M U M WEIGHTS 38
(l-p)a +C S2 2 2
2{l-p)a +C S 2 2 2
J
where K = c s?
2
These are exactly the weights we derived in Proposition 2.2. We therefore have
shown that Theorem 3.3 can be used to derive the results we found earlier in Propo-
sition 2.2.
constant.
Re-write V = o\W, where the of is a function ofni andW is a function ofn,, n, 2 ...n .
m
n\—too ni—>oo
Then
{V + BBty ! 1
J
mxm J
where S„ = WrT 1
-
it follows that
^W^BB^W' 1
-1
1+ ^BW-'B
\W- BB W
l l
-l
w- - -1 + 1 1
\BW~ B X
-1 W^BB'W' 1
=w al + B W- B'
l l
Therefore,
(a W + 2
BB )- ! 1 1
lim A* = nX^oo
lim
V(afW + BB*)-n
Q-jjo-fW + BB )- ! 1 1
ni^oo al! (a W l 2
-rBB^- ! 1
fw~ l
— WzlMB^Wzl\ i
\ V V
a +BtW-iB
2
) 1
= m-yoo
lim -, (xxr-i _ W~ BB W-
1 t 1
\
t
A
\ V V
al+BW-lB )
( -l
W _ W ^ B B ^ \
0
V 0
B'W^B )
t ( w i _ w ->BBtw-i\
0
V 0
BW^B )
1
LJ
mxm - J
_1 Wg BB W^
1 t l
where S n
BtWg^B -. This completes the proof, o
C o r o l l a r y 3.1 Under the conditions of Theorem 3.4, assume that V = Diag(a\, o f , a ,
m
If a (ni)
2
—> 0 and a (ni)/a (ni)
2 2
—> ji > 0, for i > 2, as n x —)• oo and ]C 7i > 0, then
i=2
lim A* = (1,0,0,...,())'.
3.4. LIMITS OF O P T I M U M WEIGHTS 40
-1 WZ BB Wr
x l l
<? 1
u
mxm - s
BW^B
i - /,
n x Diag{l,j ,-,7m)BB Diag{l,^-nm)\
2
t
,
= [Diag(1,72, ...,7 m .
t U
B Diag(l,-y2,-,'Ym)B
t
(
C (0,72,-,7m)'(0,72,-,7m)
2
V i=2
7
lim A* = (1,0,...,())'. o
n—Kx>
Next we will consider the case where the covariance matrix is not a diagonal
matrix but has a special form. Suppose that cov(6) = ^-V 0 where
/ 1 P'
„m—1 \
P
P 1 P
„m-2
(3.2)
y p —l m
p~
m 2
p ~^m
where /) / 0 and p < 1. This kind of covariance structure is found in first order
2
autoregressive model with lage 1 effect, namely AR(1) model, in time series analysis.
If the goal of inference is to predict the average response at the next time point given
observations from current and afixednumber of previous time points, then this type
of covariance structure is of our interest. By Proposition 2.2, the optimum weights
will go to (1,0) if m = 2. This is because K = , _ ? ^ / x goes to infinity. As a result,
x n
Corollary 3.2 For m > 2, if cov(0) = ^-Vb, where V 0 takes the form as in (3.2),
then
umA I L ,
P +
D - p ' A)-p '""'
0
2 2
D 0 - p * ' D 0 - p * J '
where D 0 = 1 + (m - 2)(1 - p) . 2
-p 1 + p 2
-p 0 0
l
WV1 =
l - p 2
0 0 -p 1 + p 2
-p
0 0
It follows that
C
( - P , i - p + p , (i - p ) , ( i - p) , i -
2 2 2
Py,
1-p 2
B WQ 1
1 1
B ^ ^ B
1-p 2
We then have
Do-P
(-p, 1 - p + p , (1 - p ) , ( 1 - p) ,1 - pf.
2 2 2
(1 - P ) 2
A)
3.4. LIMITS OF O P T I M U M W E I G H T S 42
Therefore,
1 -- p ( \
(1- pf i-p +p z
SI
(1- pf Do-P (1 - Pf
(1 - P )D 2
0
( 1 - pf (1 - pf
1 -- p 1-p
V
P ( I - P + P 2
) p{i-p) 2
p(i- ) 2
pii-py*
1
P
-p +
1- p2
\ D '
D 00 ' D0 D 0 D 0 D 0
It can be verified that lim 1*51 =-^-^(1 — j^). This completes the proof, o
Note that the limits of A2 and A^ take different form than that of A 3 , A ^ _ x . The
reason is that B = (0, C, C,C) ignores the first row or column of W 0
_1
when
multiplied with it and the last row of W ~ is different than the other rows in that
0
l
matrix.
Positive Weights
If the estimators are uncorrelated, that is, p = 0, then we show below that asymp-
totically all weights are positive. Furthermore, all the weights are always positive in
this case.
Theorem 3.5 As before, let V be the covariance matrix of (9\, 9 , • 2 9 ) and let
m
of, i = 1 , 2 , m , are known. Then A* > 0 for all i = 1 , 2 , m and for all n.
(V + BB')- 1
= V- 1
-
V
' l + <o
^ of'
i=2 1
and
0 0 0 \
0 c 2
c 2
0 c 2
Therefore, we have
(V + BB )- 1 1
= diag(\,\,...,±-) Ti
\°1 °2 V i / l + *o
0 n \ 0 0 0
0 n c 2
c2
1+to
o o ... 4 - c2
c2
Furthermore,
1 1 1
(V + BB )' 1 1
1 diag 2' 2' ' Ti 1
°\ ^2 1 + to
0 0 \ 0 0
l 0 1 c 2
c 2
1
°l T?O-2
l + *o
c2
0
i / V / 1
V / 1
/ 1 \
C 2
1 1
a , a:
2
1 + 4 0
u < T | 1+to
V - C 2
l + t 0
u
\ "l, l+to
since
1
1 1
m
io
Hence,
•1 1 m
1
1' (V + BB')' 1
1 = 4 + T - V T -2 •
Therefore,
K = ^r—>o.
J_ + _ J _ V - L
3=2 3
A: = * \ l +
> o.
Negative Weights
The above theorem gives conditions that insure positive weights. However, weights
lim A * = ( i , -f> °D +
P( ~Pl
+ P ),
2
PJ ~ Pf
1
? . Pi 1
- Pf Pi 1
~ P) V ?
where p 2
< 1 and D 0 = 1 + (m — 2)(1 — p) . It follows that D 2
0 > I provided
m > 2. Hence the sign of those asymptotic weights taking the form ^1^,2 are
m
determined by the sign of p. Notice that hm A* = 1 and lim Ai = 1*. Hence
m
lim A = — 2^2 bm A*. Therefore, we have the following.
2
Til—S-OO j_gni—•oo
e
i = YI ij- a
Assume a u — -^r— > 0. Then the asymptotic weight, lim A*, is negative
3=2 h
if ^ < for some i > 1.
61
E et
3.4. LIMITS OF O P T I M U M WEIGHTS 45
Proof:
(V + BB*)- ! 1
S l
lim A* = lim ,—
m x m
where 5 m x m =W 0
1
- ° B t i y -i °
B •
We then have
W B = ( C ^ a
0
l
i i , C ^ a 2 i , . . . , C ^ a m i ) i
= C(ei, e , e ) * . 2 m
Therefore,
WQ-^B'WQ- , 1
B~*wF B r
C ( e i , e , ••.,e ) C(ei,e2, - , e )
2 m
f
m
(0,C,C,...,C)C(e ,e ,...,e y 1 2 m
C (ei, e ,
2
2 e )'(ei, e , •
m 2 e ).
m
= (an+ei,ai2 + e ,...,a i+e )' 2 m m
i=2
^ a n + ei ^ (
eie 2
ai2 +e 2 e e!
2 e 2
e2e m
i=2
y c ei y
m C77je2 ... e m
V / 1
3.4. LIMITS OF O P T I M U M WEIGHTS 46
It follows that
( \
/ o n + ei ^ ei E* e
i=l
ai2 + e •m
9 1 —
2
e 2 E i
e
m i=i
\ a +e ) J
E e
i
mX m i=2 I 6 J^J I
m
\ 1= 1 /
e i ( e i + E e;
j=2
a n + ei
m
i= 2
m
e2(ei+ E ev)
a i + e
2 2 i=2
m
E e,
i= 2
(ei+E 0 e
a i +
m
m
i=2
i=2
e\ e2 ei e m e : t
= (an m
,a i- — — , a
2 m i —) .
E* e
i=2 i=2
i=2
Note that W is symmetric which implies that the matrix W
0 0
1
is also symmetric.
Thus, we have
m m
3=2 3=2
E i e
i=2
Therefore, lim A* < 0 if an < ^r - for any i such that i > 2. o
1
ni->oo ti
i=2
We can construct a simple example where the non-asymptotic weights can actually
be negative. Suppose that information from three populations is available and one
single observation is obtained from each population. Further, we assume that the
three random variables, X\, X , and X , say, follow a multivariate normal distribution
2 3
3.4. LIMITS OF O P T I M U M WEIGHTS 47
V + BB 1
= 0.7 1.0 0.7 + 0 1.0 1.0 0.7 2.0 1.7
0.3 0.7 1.0 0 1.0 1.0 0.3 1.7 2.0
Hence, approximately
(V + BB )' 1 1
= -1.34 2.88 -2.24
0.89 -2.24 2.27
We then have
and
1\V + BB )' ! 1 1
= 1.43
It follows that
(V + BB*)- ! 1
A* = (0.85,-0.49,0.64).
V(V + BB')- ! 1
Thus, A is negative in this example. The negative weights in this example might
2
be due to the collinearity. If we replace 0.7 by 0.3 in the covariance matrix, all the
nii/i^i)*. w
i=l j=l
It follows that
logWL(x, 0i) =EE X i l o g
hfai'i
i=l 3=1
48
4.1. A S Y M P T O T I C RESULTS F O R T H E W L E 49
but it is not always the most natural one. In contrast we postulate a fixed number of
the procedure can rely on just the data from the population of interest alone. These
We also consider the general version of the adaptively weighted likelihood in which
the weights are allowed to depend on the data. Such likelihood arises naturally when
variables. In that case the likelihood factors into powers of the common probability
mass function at successive discrete points in the sample space. (The multinomial
likelihood arises in precisely this way for example). The factors in the likelihood may
sampling period. The sample itself now "self-weights" the likelihood factors according
In Section 4.1, we present our extension of the classical large sample theory for the
the asymptotic results for the maximum likelihood estimator. Both consistency and
asymptotic normality of the W L E for the fixed weights are shown under appropriate
assumptions. The weights may be "adaptive" that is, allowed to depend on the
data. In Section 4.2 we present the asymptotic properties of the W L E using adaptive
weights.
In this section, establish the consistency and asymptotic normality of the W L E under
appropriate conditions.
4.1. ASYMPTOTIC RESULTS FOR T H E W L E 50
interest. In this and the next sub-section, we will give a general conditions that en-
sure the consistency of the W L E ' s . Our analysis concerns a-finite probability spaces
(X, T, fii),i = 1,2 under suitable regularity conditions. We assume that the probabil-
ity measures fa is absolutely continuous with respect to one another; that is, suppose
there exists no set (event) E G T for which fa(E) = 0 and faj(E) ^ 0, or fa(E) = 0
and fJ-j(E) ^ 0 for i ^ j. Let v be a measure that dominates fa, i = 1, 2, for example
(pi +fa>)/2and fr, i = 1,2. By the Radon-Nikodym theorem (Royden 1988, p. 276),
there exist measurable functions fi(x), i = 1,2, called probability densities, unique
up to sets of (probability) measure zero in v, 0 < fi(x) < oo (a.e. v),i = 1,2, ...,m
such that
for all £ 6 ^ .
In this expression, log(fi(X)/f (X)) 2 is defined as +oo if fi(x) > 0 and f (x)
2 = 0,
fi{x) = 0 and f (x) 2 > 0, the integrand, log(fi(x)/f {x))fi(x) 2 is defined as zero in
K(f ,f )
1 2 = E (log^^j
1 = j logj&hWM*) > 0,
Assumption 4.2 For each i = l,..,m, assume {Xij : j = 1,2,rij.} are i.i.d.
6\, 9[ G © and the densities fi(x, 9) have the same support for all 9 G O.
Assumption 4.4 For any 8\ G 0 and for any open set OC0, assume
^ [ s u p i i o g y y ; ; ! ]112' 1
<K<OO,
Oiee JiK-X-ifVi)
while
also \wi-\f \ ]
= O (n^)/ ) . 2
4.1. A S Y M P T O T I C RESULTS F O R T H E W L E 52
W 2
= CE°Z) > 1,2
1=1
such that lim \\9 n — 9\ \ = 0, we have g(9) > lim s u p A function is called lower
semi-continuous if g(9) < liminfg(# ) n whenever lim \\9 n — 9\ \ = 0.
We need to show that sup U(x] 9i) = 1°9JJ0^ for some open set O is measurable
if fi(x; 9i) is upper semi-continuous.
Proposition 4.1 If U(x; 9,) is lower semi-continuous in 9\ for all x, then sup U(x
\\6i-e \\<R
0
is measurable.
Proof: For simplicity, let us assume that 6 = R and O = {9_i : ||0i - 9^\\ < R} . We
will show that
sup U(x;9 ) 1 = sup U(x;9 ) 1 (4.2)
\0i-e\\<R e eD
x
where D — N D {9\ : \Q\ — 0\\ < R}. The set N is the set of all the rational numbers
in R.
Let s = sup U(x; 9 ). It follows that for any 9\ G D , U(x\ 9\) < s. For any given
X
|0i-0?|<R
e > 0, there exist 91(e) such that \9l(e) - 9°\ < R and
Since D is a dense subset of {9\ : \6\—9\\ < i?}, then there exist a sequence 9^(6) G D
such that lim 9^\e) = 91(e). Since U(x;9i) is lower semi-continuous, then, for fixed
n—foo
e and some S > 0, there exist 0** G D such that — 0JI < 5 and
Thus, combining equation (4.3) and (4.4), it follows that for any given e > 0, there
exist 9{* e D such that U(x\9\*) > s - e. Equation (4.2) is then established.
We then have
Since {x : U(x;9i) < a} is measurable. Therefore the set {x : sup U(x; 9±) < a}
\ei-e°\<R
is measurable for any 6\. Therefore the set {x : sup U(x; 9i) < a} is measurable.
\ei-e°\<R
1
i=i J=I
Further,
/ i HI. <«i \ * 771 7l{
1=1 3=1
1
i=l j=l
^ m m ni ?V
1
i=i i'=i j=l j'=i
\nf/l<k<m l<i<m
<o(i) (nn 0
by Assumption 4.5.
It then follows that
=
771 71, 771 7li
We then have
^ m rii
1
t=i j=i
for any 9 ,6 ,9 ,
1 2 3 ...,8 ,0i
m e Q,i = 1,2, ...,m. o
The above result will be used to establish the following theorem which will be
applied to prove the weak consistency and asymptotic normality of the WLE.
4.1. ASYMPTOTIC RESULTS FOR T H E W L E 55
( m n;
i=l j=l
m
i=l j=l
Proof:
With P o measure 1,
g
m rii m nt
|W T TT T , , ^ s
i=l j=l i = l j=l
if and only if
W„,(X,9 )>0. 1
where
^n), Jl{Xj \9l)
n 3
n
i~l~{ ji\Xj,Vi y
Observe that
n
l~l~{ Jl{Xj,Vl)
1
/ /i(^ij,#i) o / a N
-S ni
Hi
Theorem 4.2 Suppose that logf\(x;9) is upper semi-continuous in 6 for all x. As-
sume that for every 9\ ^ 9\ there is an open set Ng such that 9\ € Ng C O. Then for 1 x
any sequence of maximum weighted likelihood estimates 0(™^ of'9\, and for all e > 0,
Proof: The proof of this theorem given below resembles the proof of weak consistency
ofthe MLE in Schervish (1995).
For each 9\ ^ 9\, let NJfi\k = 1, 2,... be a sequence of open balls centered at 0i and
of radius at most l/k such that for all k,
N:
(*+i)
C Ng^ c e.
It follows that Hfcli NJ,? =
{^i}- Thus, for fixed Xij, Zij(Ng?) increases with k and
therefore has a limit as k —>• oo. Note that loq{ ^ Al ,9
is lower semi-continuous in 0i
for each x. So,
lim ZIJW) > log J
by Assumption 4.4. This implies that Z\j(Ng') are integrable. Using the monotone
convergence theorem, we then have
{JV£:0i"ee\JV } o
^(ll*i B , )
-0ill>e)
= Po e (flf 0
e for some
;=i
;=i V 1
i=i j=i /
p / 1 m n; _. m
P*> ( i f r 0
~ i\\ >e) —>0 as m'^oo..
e
4.1. A S Y M P T O T I C RESULTS F O R T H E WLE 58
Since EgoZ^N*,) 2
< E 9i (sup llog^f^]l) < K < oo by Assumption 4.4, it
1
' /
follows from Lemma 4.2 that
1 m Hi
n-i
1
i=i j=i
Also,
1 n i
p
- Z (N* ) -A
Xj 0[ EgoZ (N* ) Xj g[ > 0, for any 6[ e 9 \ N , 0
by the Weak Law of Large Numbers and the construction of NZ. Thus, for any
6[eO\N , 0
-X^(Ar;,) + -^^(AS n )
-^)^(iV;,)<0 —>0 as m ' - x x ) .
1
J=l 1
i=l j=1 /
E *° - E ^ W ) + - E E ( ^ - ^)^(^)
P
x
A
< 0 U 0 a a m 4 o o .
1=1 \ 1
j=l 1
i=\ j=l J
Thus the assertion follows, o
In the next theorem we drop Assumption 4.1 which assumes the compactness of
the parameter space and replace it with a slightly different condition. A t the same
where K c
is a constant independent of 9 , •••,9 .2 m
P* (\\% ni)
- %\\ > *)
P
< Y, e° p
(^i ni) e
N
k) + (^ n i )
e c n e)
c
*=i
k=l
( 1 _ _ nl
1 _ _ m
\
It follows from the proof of Theorem 4.2 that the first term of last expression goes to
zero as n goes to infinity.
By the Weak Law of Large Numbers, we have
P0
fl 6)
I \—
2~^(X
m
^EE( S A n )
-^)%( G C n 0
) = E ^ ( A
i n )
- ^ E ^ n e ) .
n i
i=l i=\ i=l U l
• U i
j=l
By the Weak Law of Large Numbers, it follows that
inf^ c ne l€
c
l°Q f\x --9i)) 3
IS a
^^
n e
number by the condition of this theorem.
By Assumption 4.5, it follows that
Tl' ( \
— {\r - wt) —> 0, as m ->• oo. (4.8)
n\
4.1. A S Y M P T O T I C RESULTS F O R T H E W L E 60
1
1=1 j=l
To obtain asymptotic normality of the WLE, more restrictive conditions are needed.
In particular, some conditions will be imposed upon the first and second derivative
of the likelihood function.
For each fixed n, there may be many solutions to the likelihood equation even if
the WLE is unique. However, as will be seen in the next theorem, there generally
exist a sequence of solutions of this equation that are asymptotically normal.
Assume that 0i is a vector defined in RP with p, a positive integer, i. e.
0i = (011, 012, • 0\ ) and the true value of the parameter is 0° = ( 0 , 0 ° , 6 \ ) .
P n 2 p
Write
ip(x; 6i) = ——logfi(x; 0i), a p column vector,
(701
and
tp =—ip(x;9x), a p by p matrix.
Assuming that the partial derivatives can be passed under the integral sign in
f fi(x;6 )di/(x)
x = 1, we find that, for any j,,
I(e° )
1 = cov oiJj(X ;6 ).
e lj
0
1
4.1. A S Y M P T O T I C RESULTS F O R T H E W L E 61
If the second partial derivatives with respect to 9\ can also be passed under the
integral sign, then f(d /d9 )fi(x;
2 2
9\)dv(x) = 0, and
I(9 ) =
1 -E oi;(X ;9 ).
e l
0
l
WL (x;9 )
ni 1 = —logWL(x;8 ) 1 and WL (x;ni 0°) = —logWL(x; 0i)|„ o1=fl
In the next theorem we assume that the parameter space is an open subset of RP.
(1) for almost all x the first and second partial derivatives of fi(x;9) with respect
to 9 exist, are continuous in 9 £ Q, and may be passed through the integral sign in
ff (x;9)du{x)
1 = l;
(2) there exist three functions G\(x), G^O^) and Gz(x) such that for all 9< ,---,9 , 1 m
Ego\Gi(Xij)\ 2
< Ki < oo,l = 1,2,3,, i = l,...,m, and in some neighborhood of 9\
&logh{x;e,)
d9i d9ik d9i ^
kl 2 k
Then there exists a sequence of roots 9^^ of the weighted likelihood equation that is
^ ( f ' - ^ A i V ^ ^ ) ) - 1
) ) , as m->oo.
4(a) = {x : log WL(x; 0?) > log WL(x, 0?) for all boundary points 9\ of S } a
for any given e, there exist N€ such that, for any n > N, e we have Peo(X. e / (a)) >
n
1 — e. This implies that I (a) is not an empty set when n > N . To prove the claim,
n e
we expand the log weighted likelihood on the boundary of S about the true value 9\ a
ni rii
where
Kl=l
2 P P
2ni 1
Jfel=l*2=1
j v v p
"
6ni (EE E ( ^ i ~ 01*1) (0i*2 - 0?* )(0i*3
2 - 0?*) 3
*1=1 *2 = 1 *3 = 1
i=i j=i
and
(^dlogfiixij-^i)
4 W = E E A
39 10i=0?>
i=i j=i i*i
= 1 j = 1
l 8 , = , t
'
4.1. A S Y M P T O T I C RESULTS FOR T H E W L E 63
1 ^ d logf (X ;9 )
2
l lj l p e0 q 0
32;
- L D a
1 - ^ ) - I ^ - - I M ^ 0
m->«>•. ( - 4 1 3
)
i=i j=\
^ m rii
^ iE
ni
=i (j=i !
A n )
- ')W^
w 0 flsn
i^°°' (- )
4 14
To prove that the maximum difference -^log WL(x; 9\) — ^log WL(x; 9®) over
sufficiently small a, we will show that, with Pgo probability tending to 1, the maximum
compared to | 5 | . 2
- 4 ( X ) - ^ 0 as 00. (4.15)
iSii<^Ekwx)i.
Bi =1
4.1. A S Y M P T O T I C RESULTS FOR T H E W L E 64
For any given a, it follows from (4.15) that with Pgo probability tending to 1
- V \A (X)\<pa
kl
2
(4.16)
1
iti=i
Next consider
p p
2S 2 = - E E^ lfcl _
0° )I k (0i)(0ik2
kl kl 2 - 0ik ) 2
k l 1= fc = l
2
(4.18)
+E - +.W0?))(*i* - a
fci=ifc =i 2
1
+
^ E 2 > -«0 l«irff-
i=i j=i
Thus, for any boundary points such that \\9\ — 9°\\ = a, we have
Let us examine the first term in (4.18). Since 1(8®) is a symmetric and positive
definite matrix, there exits a matrix B p x p such that BBl
= Diag(l, 1 , 1 ) the
identity matrix and —I(8\) = B Diag(5i, t
5 ,5 )B
2 P where Si < 0,i = 1,2, ...,m.
V V rn
- E E o&i - w t f x c - *W = E<^ ^ 2 5 V
(- ) 4 22
fc lfc2
1= = l (=1
where S* — max{5i, S ,S }
2 p < 0. We see that there exist a0 such that 5* +p ao 2
< 0.
Combining (4.21) and (4.22), it follows that with P o probability tending to e 1 there
S2 < —ca . 2
(4.23)
Tlx 1
i=i j = i
-.rii m rii
=- E G
3 ( * y ) + - E E( * - A (n)
"O^Oy).
J'=l i = l 3=1
where we use the inequality \EZ\ < E\Z\ < 1 + E\Z\ for any random variable Z. 2
1 m nt
(4.25)
' i
J
. _ i 1
=i i = i
Hence by (4.24) and (4.25) , for any given a and'0f such that ||0j - 0?|| = a,
\S \<^(13 + K )a\
3 (4.26)
4.1. A S Y M P T O T I C RESULTS FOR T H E W L E 66
Finally combining (4.17), (4.23) and (4.26), we then have, with P o probability e
e\€Sa 2
which is less than zero if a < c/[p + ^(1 + K )). This completes the proof of our 3
claim that for any sufficiently small a the Pgo probability tends to 1 that
For any such a and x G /71(a), it follows that there exists at least one point 9^
with ||0( ^ ni
— 9\\\ < a at which WL(x; 9\) has a local maximum, -^-WL .(x; 0i)| _s( ;
n ni
0. Since Pgo (X G J„(a)) —> 1 as n —> 00 for all sufficiently small a, it then fol- x
lows that for such fixed a > 0, there exists a sequence of roots 9^{ \a)
x
such that
P o (ll^V) -0°|| < a) -» 1.
e
It remains to show that we can determine such a sequence of roots, which does
not depend on a. Let 0 ^ be the closest root to 9\ among all the roots to the
likelihood equation for every fixed n\. This closest root always exists. The reason
for this can be seen as follows. If there is afinitenumber of roots within the closed
ball ||0i — Q±\\ < a, we can always find such a root. If there are infinitely many roots
in that sphere which is compact, then there exists a convergent sequence of roots
inside the sphere 0~f^ such that lim ||0^ — 0°|| = inf ||0i — 0i||, where V is the ni
set of all the roots to the likelihood equation. Then the closest root exists since the
limit of this sequence of roots 0^ is again a root by the assumed continuity of the
^WL (9 ).
ni 1 Thus 9 { lY
x does not depend on a and P o (||0i * - 0?|| < a) -> 1. '
e
ni)
WL (x-9 )
ni 1 = WL (x;9 )+
ni
0
1 / ^^A l
{ n
V ( ^ ; 0 ? + i(0 -0?))a't(0 -0?),
1 1
J o
i=i j=i
4.1. A S Y M P T O T I C RESULTS FOR T H E W L E 67
m rii , >
—WL {x-X) = B ^ T ( 0 S l
ni niV
ni)
- 0?), (4.27)
where
1 r 1
_. m
Bni
„- / £ £ * ^ ; ?+*$ A n ) e ni)
-
Note that
m rii m nt
WL (x;9°)
ni = E £ ^ ( ^ ) +££(\ w X
( n )
-«;0^«;O
V 1
3=1 V 1 i = i j=i
From the Central Limit Theorem, because E oip(Xij-,9°) = 0 and cov oil>(Xij]9l) e e =
l "*
V i n
i = 1
77.1 V* ! 1
4.1. A S Y M P T O T I C RESULTS F O R T H E W L E 68
Now we prove
i=ij=i
Let Kx = E E( ! A
- wOV'^ijI^i)-
1=1 J'=l
We then have
(-1=1114,11 > e )
^ * '
< g-EEEEi . " -«'.ii^ -'»-iA
i = l i' = l j = l j ' = l
( > )
i /-l «i
711 7 0
i=l j=l
First, we prove B !
ni 1(9®) as n\ —> oo.
Let S p = {9i : ||0i - 9\\\ < p}. Note that E oip(Xij,0i)
e is continuous in 0], by
condition (1), so there is a p > 0 such that
l l ^ ^ - ^ I K P ^ I I ^ + ^i" -^) - 0
0?||<P- (4-29)
By equation (4.28) and (4.29), we then have
Note that P o ( | | ^
0
n i )
- 0j|| < p) —> 1. We then have
Peo(\Ego7p(X f,e 1
0
+ t(9^ -9° )) l)
l + I(e Y < e ) . — • ! as n
0
x o o . (4.30)
P o (9° + t ( 0 i
e
ni)
- 0°) eS )-^l p as m -»• oo. (4.31)
From the Uniform Strong Law of Large Numbers, Theorem 16(a) in Ferguson (1991),
Til
Then, assuming N is so large that rii > A / => ||0[ ^ — 0j|| < p, then 7 ni
+E ^(X ;
eo lj 0? + i ( 0 i ni)
- 0°)) + 7(0?)left
r / 1
1 n i
0<t<l
0 as ni —> oo
\ "\ <
B
^ - E E h - t m x - 0 ? + ( 9 ^ - 0?)) ]
l 3 1
]
dt 0 as ni oo.
sup —S (x,0i)
ni (4.33)
/i(*«j,g?)
/i(*y,«i) '
Therefore, we have
sup —S (x;0i) —)• 0, a.s. [P o] ni e
T h e o r e m 4.5 Suppose:
(1) Q is compact;
0™ —• 0? a.s. [P, ] ?
/ o r any 0 , 0 , 9 , 0, G O, i = 2, 3 , r a .
2 3 m
1. , 1
W ni = —logWL(9 ) 1 -—logWL(9 i) l
= - E E -(
.. ni m
A
W/I(A^-; 00 - '°s/i(*y; 0?))
U l
j=l i=l
1
= -E (^'^) + - " ( '^)
1 n l
c/ 5 x
1
3=1
^hereU(X ,9 )
lj 1 = l o g ^ ^ ) .
Let p > 0 and I> = (0i G 0 : ||0i - 0°|| > p). Then D is compact by condition
(1). We then have, (c.f. Ferguson 1996, p. 109)
Hence p(9i) achieves its maximum value on D. Let 5 = sup p(9i); then 5 < 0 by
Lemma 4.1 and Assumption 4.3. Then by Lemma 4.3, with P o measure 1, there d
1
sup S (X.,9i)
ni < -5/2.
0ieD ni
Observe that
( n
1
l ~ [
711
- E ^ i ^ i ) + -5 (X e ) n
1
l
ril ) 1
\
J
< sup -
B x t D U x ^
1 n i
J2u(X ;e )+su lj l
eD
V
ni
Sni (X, #1)
It follows that, with P o measure 1, there exists an ./V such that for all ni > N,
dl
sup
sup ( — jhu{X ;9 )
6i&D
h&D \ n
l ^
lj 1 + —S (XA))
"1
ni
J <5/2<Q.
1 m
1 / 1 711
1 \
- U(Xtf 0? ) + S (X;
}
ni 8~t ) = sup]
- V U(X lj; 6,) + - 5 B 1 ( X , 9{) > 0.
ni
*i p7 n
i 0iee \n\ ^ " i J
since
1 ni
n i J2U(X ;9° ) + -S (X;9 ) = 0
0
l] 1 ni
j=i U l
The proof of the above theorem resembles the famous proof given by Wald (1949)
which established the strong consistency of M L E except that we have to deal with
f,2) i/iere exzsi a function K(x) such that E o\K(X)\ < oo e and loff^jgoj < K(x), for
all x and 9 € C;
Then for any sequence of maximum weighted likelihood estimates 9^^ of 9\,
Proof:
Let D = (9i : \\9i - 9°\\ > p) as in the proof of Theorem 4.5 such that C n D ± <f>. It
follows that C fl D is also compact. It follows from the proof of Theorem 4.5 that,
with PQO measure 1, there exits an Ni such that, for all n x > Ni,
where U(X^ 0) =
X
1 . W l
.
—V sup t/(Xi,-;0i) < 5 (4.35)
by the Strong Law of Large Numbers and the fact that Ego c U(Xij; eC nQ 9{) < 0.
It follows that, with Pgo measure 1, there exist an A^, such that, for all n\ > N , 3
1 n i
1
—V sup U(X ;9 )
lj 1 + — sup 5 (X;0 ) < ni x 5/2 < 0.
n
i~~^eiec ne c
0iec ne
c
1 n i
1
sup —J"U(X ;9 )+ lj 1 sup — S (X;0i) < 0.
ni (4.36)
flieone ~ ^ e i € C n e ™i
c
since the sum is equal to 0 for 9\ = 9\. This implies that the WLE, 0\ G D for c
rii > N*; that is, H ^ " ^ ~ 9 \\ < p. Since p is arbitrary, the theorem follows, o .
1
X
At the practical level, we might want to let the relevant weight vector to be a function
of the data. This section is concerned with the asymptotic properties of the WLE
using adaptive weights. Assumption 4.1-4.4 are assumed to hold in this section.
(ii) the adaptive relevant weight vector A^ ^(X) = (A[ ^(X), A " ^ ( X ) , A m ^ ( X ) ) *
n n
2 sat-
^ ( X , W = J gg(A}" (X)-«,)^^gg.
r
,
Lemma 4.4 If the adaptive relevance weight vector satisfies Assumption 4-6, then
(X, 0 ) X % 0, as nx -»• oo
for any 9 ,9 ,
2 3 ...,0 ,0 G 0,z = 2,3, ...,m.
m i
Proof: Let Tj = fl lo
9 \\f-fe%
f
f for i = 1, 2 , m . Then
1 i m
^ = ^(nx)-,)r,
m 1
ni V / n, :
Theorem 4.7 For each 9\, the true value of 6i, and each 9i ^ 9®
i>
(
m rii m ni \
Theorem 4.8 Suppose that the conditions of Theorem 4-2 are satisfied. Then for
for any 9 , 9 , 9
2 3 m , Qi e O, % = 2, 3 , m .
Theorem 4.9 Suppose that the conditions of Theorem 4-3 are satisfied. Then for
lim P o ( | | ^ - ^ | | > ) = 0 ,
e
l )
e
for any 9 , 9 ,9 ,
2 3 m 9iG&,i = 2, 3 , m .
We remark that the proofs of Theorem 4.7 - 4.9 are identical to the proofs of Theorem
4.1 - 4.3 except that the fixed weights are replaced by adaptive weights and the
utilization of Lemma 4.2 is replaced everywhere by Lemma 4.4.
We are now in a position to establish the asymptotic normality for the WLE
constructed by adaptive weights. We assume that the parameter space is an open
subset of RP.
satisfied. Then there exists a sequence of roots of the weighted likelihood function
as ni —> oo.
4.2. ASYMPTOTIC PROPERTIES OFADAPTIVE WEIGHTS 77
To establish the strong consistency of the WLE constructed by the adaptive weights,
we need a condition that is stronger than Assumption 4.6. We hence assume the
following condition:
isfies
AJ (X)
B)
wt, a.s. [P ], 0O
L e m m a 4.5 If the adaptive relevance weight vector satisfies Assumption 4-7, then
0, a.s. [Pgo],
where Eg An = Eg sup log /i(*u;0i) < oo. This implies that, for any i = 1, 2 , m ,
0i ee
0 0
0, a.s. [P o]e
0iG0
Theorem 4.11 Suppose the conditions of Theorem 4-5 are satisfied. Then for any
weights A^(X),
Theorem 4.12 Suppose the conditions of Theorem 4-6 are satisfied. Then for any
weights Aj(X),
• §M el a.s. [Pgo],
4.3 Examples.
Suppose Xij are independent random variables that follow a normal distribution with
mean 9i and variance 1. Assume 0 = (—00,00) and C = [—M,M].
We need to verify the condition that, for 9\ G C, 0 < Ego ^inf ' c ^^/iffi'a' ) ) —
fl i6C n0
1
We then have
-\{x-e\f if \ X \ > M ,
f (X-- 9°)
Eo e inf log 1
' tJ
= hi + I 1
i2 +I , i3 i= l,2,...,m,
where
hi = - f \{x-9«Y~exp-^ l dx, 2 2
|x|>M
M
1 2 1 =
/ " +
- M)2
) - ^ e X P
" { X
~ 9 i ) 2 / 2 d X
'
0
0
—M
The first term ^ goes to zero as M goes to infinity. It can be verified that hi + hi =
M 2
+ o(M ). 2
It then follows that there exist M 0 > 0 such that I u + hi + hi > 0
for M > MQ. If we choose K C
= 2M , it then follows that, for i = 1,2, ...,m,j
0
2
=
1, 2, ...,Tli,
/i(-Xij-;0?) \. 2
i.i.d. normal random variables each with mean 6 2 and variance a.2
Population 1
is of inferential interest while Population 2 is the relevant population. However,
102 —
I < C for a known constant C > 0. Assumptions 4.2 and 4.3 are obviously
satisfied for this example. The condition (4.6) in Theorem 4.3 is satisfied as shown in
the previous example. If we show that Assumption 4.5 is also satisfied, then all the
conditions assumed will be satisfied for this example.
To verify the final assumption, an explicit expression for the weight vector is needed.
Let mXi. = x
ij, i = l,...,m,V = Cov{{X .,X ) )
l 2
t
and B = (0, C) .
f
It follows
that
It can be shown that the "optimum" WLE in this case, the one that minimizes the
maximum MSE over the restricted parameter space, takes the following form
"l — ^ I ^ - I . ~r /\ ^V2.)
2
where
(V + BB )- !
1 1
(A;,A;)*
l^V + BB^-n'
We find that
It follows that
1\V + BB )- l
1 1
+ a /n
l l
u /ni
2 2
2 + C
Thus, we have
a /n +C
2
2
A; =
4.3. EXAMPLES. 81
Finally
K = 1— '
\a* n2 ni/
Estimators of this type are considered by van Eeden and Zidek (2000).
It follows that |A| ^ —
ni
Wi\ = O(^), i — 1, 2. If we have n = 2 0(n ~ ),
2 s
then Assump-
tion 4.5 will b e satisfied. Therefore, we do not require that the two sample sizes
approach to infinity at the same rate for this example in order to obtain consistency
and asymptotic normality. The sample size of the relevant sample might go to infinity
at a much higher rate.
Under the assumptions made in the subsection, it can be shown that the conditions of
Theorem 4.4 are satisfied. The maximum of the likelihood estimator in this example
is unique for any fixed sample size. Therefore, we have
9\ =^ \Xi.
4.3. EXAMPLES. 82
C ( X ) = ( C i ( X ) , . . , U X ) ) , where
/ \
m-2
C*(X) =
1
Xi.
The quantity,
P-2
1 - m _ '
i=i
can be viewed as a weight function derived from the weight in the James-Stein esti-
mator.
i=l
1 , 1 m - 2
m - I m
- „ )' z _ 2
' 3
' ' • • ' m
'
t=i
for some 5 > 0 and c > 0. It can be verified that E A* = 1 and A; > 0, i = 2,3,m.
i=i
It follows that
1 m-2 m-2
Peo(
"i- Y.x? + c
2=1 i=i
m-2„ 1
< , -geo-
1 4 (since Xf > 0)
nl+ e 5
c
= O i+5 / •
n
We then have
1 m-2
> e) = O 1+6
(4.37)
n
i E A ? +C n
i=l-
P ( A i < 0) = O
e 0
1+6
(4.38)
n
4.4. CONCLUDING REMARKS 83
Aj(X) - u>i 0,
for i = 1,2, ...,m. Assumption 4.6 is then satisfied. Therefore, the weak consistency
and asymptotic normality of the WLE constructed by this set of weights will follow.
(ii) If 5 > 0,
Ai(X) - Wi 0, a.s. [P o],
e
for i = 1,2, ...,m, then this leads to strong consistency. Since strong consistency
implies weak consistency, asymptotic normality of the WLE using adaptive weights
will follow in this case.
In this chapter we have shown how classical large sample theory for the maximum
likelihood estimator can be extended to the adaptively weighted likelihood estimator.
In particular, we have proved the weak consistency of the latter and of the roots of
the likelihood equation under more restrictive conditions. The asymptotic normality
of the WLE is also proved. Observations from the same population are assumed to
be independent although the observations from different populations obtained at the
same time can be dependent.
In practice weights will sometimes need to be estimated. Assumption 4.6 states
conditions that insure the large sample results obtain. In particular, they obtain as
long as the samples drawn from populations different from that of inferential interest
are of the same order as that of the drawn from the latter.
4.4. C O N C L U D I N G REMARKS 84
This finding could have useful practical implications since often there will be a dif-
ferential cost of drawing samples from the various populations. The overall cost
inexpensive data, that although biased, nevertheless increases the accuracy of the
estimator. Our theory suggests that as long as the amount of that other data is
about the same as obtained from the population of direct interest (and the weights
Choosing Weights by
Cross-Validation
5.1 Introduction
85
/
5.1. INTRODUCTION . 86
Xml, X 2,
m X , m3 Xmrim ~ fm( ',9m)
x
where, for fixed i, the {Xij} are observations obtained from population i and so on.
Assume that observations obtained from each population are independent of those
from other populations and E(Xij) = 4>(9i),j = 1, 2 , r i j . The population parameter
of the first population, 9 , is of inferential interest. Taking the usual path, we predict
X
The optimum weights are derived such that the minimum of D(\) is achieved for
m
fixed sample sizes n\, n , n and E A* = 1.
2 m
j=i
We will study the linear WLE by using cross-validation when E(Xij) = 9i,j =
1.2, ...nj for any fixed i. The asymptotic properties of the WLE are established in
5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES 87
this chapter. The results of simulation studies are shown later in this chapter.
5.2 L i n e a r W L E f o r E q u a l S a m p l e Sizes
Stone (1974) and Geisser (1975) discuss the application of the the cross-validation
approach to the so-called K-group problem. Suppose that the data set S consists of
n observations in each of K groups. The prediction of the mean for the ith group is
constructed as:
(ii = aXi. + (1 — a)X_.
m n n
A 1 = ( i - ^ Q ) x 1 . + g ^ „ .
We remark that the above formula is a special form of linear combination of the
sample means. The cross-validation procedure is used by Stone (1974) to derive the
value of ct.
We consider general linear combinations. Let 0^ denote the WLE by using the
cross-validation rule when the sample sizes are equal. If <fi(0) = 0, the linear WLE
for Bi is then defined as
m
m
where ^ = 1. We assume ni = n = ... = n = n in this section.
2 m
i=i
In this section, we will use cross-validation by simultaneously deleting Xij, X j,
2 ...,X,
for each fixed j . That is, we delete one data point from each sample at each step.
This might be appropriate if these data points are obtained at the same time point
and strong associations exist among these observations. By simultaneously deleting
5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES 88
Xij,X j,
2 ...,X j
m for each fixed j , we might achieve numerical stability of the cross-
validation procedure. An alternative approach is to delete a data point from only the
first sample at each step. It will be studied in the next section.
Let X L be the sample mean of the zth sample with jth element in that sample
excluded. A natural measure for the discrepancy of 9\ might be:
n / m
3=1 \ i=l
m n m n m m
=E x
i - 2
E ^ E vt" + E E E
j=l j=l i=l j=l i=l k=l
n m n m m n
= E l - E ^ E ^ + E E * * E ^."^
x
2 x 3) A A
We seek the optimum weights such. that Ai + A = 1 and they minimize the 2
D
e ]
= £ (*y - ^[: X J)
- A a ^ ) 2
- 7(Ai + A - 1). 2
d x \- = - £ * i : J )
- Ai^: J )
- ^x[: ) ])
- 7 = 0,
(2)
dD,
<9A
- E ^ (X - A y ^ - XiXt") -7
2
It follows that
E - - AxAt* - A ^ ) = o.
Af (X) = 1
£(x[: -x : )i
j)
2
j)
3=1
(5.2)
Af (X)
E
Af = 1 - A f and \° 2
pt
= S /S ,
e
2
e
x
where
- Xf 2 + ^ \ i)2 E (*y - ^ ) 2
,
n
Se
( c r — ccw )
2
(n-iy
2
where a\ = £ E
2
- *i.) 2
and cot) = £ E - *i.) (X2j - X ).2
3=1 3=1
5.2. L I N E A R W L E FOR E Q U A L S A M P L E SIZES 90
n —1 .
_±_ x
l j
n-1 13
3=1
~Xi Xi.
n — 1
where e = - . n
JL
T
n
n—1
Let St = ±it ( { x 3)
~ 2 Y-
x J) l t t h e n
follows that
n
j=i v
' y
5
i e
= - E f ( e n ^ i . - ^ T ^ ) - ( e n ^ 2 . - - ^ - X , ) )
n V n-1 n-1 2
/
j=i v 7
+(^i) E(^-^) ) 2 2
= el (Xi. - X f 2 - 2e ^ (Xi.
n 1 - Xf 2 +— ^ f^X^ - Xf
v
^ j=l
= ^ (e - n (Xi. - X f + -^-L— J2
2
3-
{Xlj - Xf 2]
n(n — 2) 1 "
7 ^ T ) 2 c * i . - *y+ _i x
n(n )2 - x
^ 2
5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES 91
Let Sf = i f2(x[ j )
- X ~ ){X[[ 3) 3 )
- Xij). It follows that
X i . E ( i i - 2j) - E
x x
x
i +
E - ^ ' ^ '
n V j=i 3=1 /
n
7^2- ( *? - cov)
the two samples. If that measure is almost zero, the formula for A 2
p t
will reflect that
the two populations if the difference between the two sample means is relatively large
or the second sample has little information of relevance to the first. The weights
chosen by the cross-validation rule will then guard against the undesirable scenario in
which too much bias might be introduced into the estimation procedure. On the other
hand, if the second sample does contain valuable information about the parameter
value to A *.
2
P
5.2. L I N E A R W L E FO R E Q U A L S A M P L E SIZES 92
~v P < )
\ aO. 17 p
e\ no
AI. > tf ,
1 A 2. — t / , 2
j=i
Therefore,
b\ — cov —• o\ — p0~\O . 2
Thus condition p < a\ja 2 implies that of > cov for sufficiently large n. Thus, A^*
eventually will be positive, o
We remark that the condition p < a\/a is satisfied if a < o\ or p < 0. If
2 2
the condition p < o~i/o~ 2 is not satisfied, then A ^ will have negative sign for suffi-
ciently large n. However, the value of A^' will converge to zero as shown in the next
Proposition.
Proof: From Lemma 5.1, it follows that the second term of S\ goes to zero in prob-
ability as n goes to infinity while the first term converges to (9\ — 0°) in probability. 2
5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES 93
Therefore we have
St ^ (9° - 8° ) 2
2
as n oo,
, „ , 5 f . <> P 0
|A 1 = | —
2 | —> 0 as n —>• oo.
To study the case of more than two populations, it is necessary to derive an alternative
matrix representation of X opt
. It can be verified that
-(-j)-(-j) _ . , _ _ 1 w _ _ 1 X
n—1- n —1
2— — ^n
— ^ n
— i f ^ ~\ 2
— 6 Xj.X/j_ ~7%ij%k. T kj i.
X x
~r \ 7) %ij kj
x
n
n
77 -L J. X
where e„ — - JL
T .
" 71—1
5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES 94
Thus, we have
n n , 1 ' v
(—j) {—j) v > / 2 ^7i / I \2 \
•^i. %k. / j I &nXi.%k. ~ ~^%ij-Ek. ~ ^•^fcj-^'i. "F ( — ~j XijX j
k J
j=l j=l \ /
1 71
E ~ _\ - E ^ ( _ ]_) E
n n
Xi +
_
n _ l - Xk
j=l n j=l n J=l X
*J' '*J
a
71
2 2 2 77. 1^ >
[n — l) n z
n — 177
e (n
n - 2)a;i x. +fc - - Y](xij - Xi)(x kj - x) k + -XiX . k
77 — 1 77 ^ — ' 77—1
where
^ l "
A (ik) = E^
(-J)^-J)
e
.7=1
It follows that
A ^ ^ E + (el(n-2) + - ^ J 00* (5.3)
We also have
X ^
3=1
\S - \( __L_ \
An + / (Cn-^lj
J Cn-^lJ I Cn^i. -^^ij j
j=l ^ '
e "
n - 1 ^—'
b (x) = A
e 1 - e % . (5.4)
where ^4i is the first column of A E and E i is the first column of the sample covariance
matrix E .
We are now in a position to derive the optimum weights when sample sizes are equal.
P r o p o s i t i o n 5.3 The optimum weight vector which minimizes D ^ takes the fol- e
m
lowing form
XT = ( 1 , 0 , 0 , 0 ) * - el ( A ; % - ^ ^ A - e
l
i \.
Proof:
We then have
1 = i * A f =
Thus,
2
(1 - l'A^&e).
Therefore,
Af* = A- e
l
b e +
pression of the weight vector in the two population case can also be derived by using
the matrix representation given as above. The detailed calculation is quite similar to
that given in the previous subsection.
5.3 L i n e a r W L E f o r U n e q u a l S a m p l e Sizes
In the previous section, we discussed choosing the optimum weights when the samples
sizes are equal. In this section, we propose cross-validation methods for choosing
5.3. L I N E A R W L E F O R U N E Q U A L S A M P L E SIZES 97
adaptive weights for unequal sample sizes. If the sample sizes are not equal, it is
not clear whether the delete-one-column approach is a reasonable one. For example,
suppose that there are 10 observations in first sample and there are 5 observations in
the second. Then there is no observation to delete for the second sample for half of
one column for small sample sizes. Therefore we propose alternative method which
delete only one data point from the first sample and keep all the data points from
Let us again consider the two population case in which only two populations are
objective function:
m
D^\X) = Y {Xi -^x[- J ]
J)
-\ X ) 2 2 ,
3=1
i=l k^j
between and is that only the j t h data point of the first sample is left out
3=1
ni 2
= J^((x l j -x ) 2 + \ (x .-x - j)
1 2
< i )
•
3=1
(2)
- + - x
i~ ))
j)
- i~ ) •
xi j)
3=1
5.3. L I N E A R W L E F O R U N E Q U A L S A M P L E SIZES 98
We then have
E(^2.-X| V J
^2
Consider
ni
/"V "v ( J')'\
r -
.7=1
m
E (*- -
3=1 V
2
- ir^rr**- -
1
1
7
(*- - ^
2 x
E (A . - A \ ) ( A \ _ x )
2 y - — - E(Ai. - X )(X . -
y 2 X)l3
3=1
1
J'=l
1 "l
nifXx. - A \ ) 2
+ — - E(Ax. - X )X
X] l3
nx _
77-1 — 1
and
m
s = E^ --^)
u
2
2 2
i=i 1
= E(^i- - * -) 2 2
+ ^-rr(^-
1
2
- *-) —'
ni —
E^- 2
A - + (n^i - Z T ) 2
E^- - ^ x
We then have
nAX -X f -^T<J 2
^AT;
x 2
Af = , — — A 2 =l - A f . (5.5)
2 e°.
p
2
o l —> a
(X -x .) L 2
2
% (0?-0 °)Vo. 2
0.
n (X .-X .)2 + _L_ 2
1 1 2 a
We then have
Af'^1.
We derive the general formula for the optimum weights by cross-validation when the
sample sizes are not all equal.
The objective function is defined as follows:
= E U ^ - A ^ - E ^
3=1 \ i=2
n i . n i / -. m \
£ X\ - 2 £
3 X y A, ( X + — j - C * ! . - *y)) + £
L XpCi.
3=1 3=1 \ 1
t=2 /
+E ( Ai
( ^ i - + ^ - T i ^ - - *y) j + E ^ - A
= c{X)-2b{X)\ u + \ A{X)\
t
u u
where
6i = V X y (x + (Xx. - A y ) ) = n X 2
- -^-a 2
-f-f V ni - 1 / ni - 1
L x
3=1
bi = niXiXi,, i = 2,...,m;
5.3. L I N E A R W L E FOR U N E Q U A L S A M P L E SIZES 100
and
= E ( * . . + ^ ( * . - - - M ) >
= '0?. + (^*?
«ij = fi\XiXj,, j 7^ 1 or j ^ 1.
where
dij = 0, i 7^ 1 or j / 1.
Therefore, we have
rank(A) < m if m > 2.
It then follows that A is not invertible for m > 2. Thus the Lagrange method
will not work in this case since it involves the inversion of the matrix A .
To solve this problem, we can rewrite the objective function in terms of
m
A , A3, ...,A only, that is, we replace Ai by 1 — E - V The original minimization
2 m
i=2
problem is then transformed into a minimization problem without any constraint.
As we will see in the following derivation, the new objective function is a quadratic
function of A2, A 3 , A m . Thus, the minimum of the new objective function exists
and is unique.
5.3. L I N E A R W L E F O R U N E Q U A L S A M P L E SIZES 101
(
i=2
m \ ro
i — A
») i +
b
J2 biXi
i=2 / t=2
m
i=2
m 2
= bi+n E(X .X .-^i. + ^- ya?)A
1 1 i r i
i=2 ^
= fei+nx^^^-^ + ^ - ^ - ^ A i .
i=2
A*J4A u = A a 2
u + 2Ax E AjOii + Aja^Afc
i=2 i=2 k=2
/ . m \ / m \ m mm
( E E * + E E *^
2
= i-E 0 i=2
A
/
a u+ 2 1 _
\
A <
i=2 /
A
i=2
aii
i=2 Jfc=2
A a A
an - 2an E + E E * A anAfc
)
^ i=2 i=2 k=2 /
( E ** - E E
m
A ai
mm
Ai°iiAfc)
\ m
+ E E AiOyAfc
m
= an ~ 2 E( an
~~ ° i * ) A i + E E A
^ ^' a +
° u
~~ 2 a
ii)Afc-
5.4. A S Y M P T O T I C PROPERTIES OF T H E WEIGHTS 102
We then have
1=1 \ rn-1 J
m mm
- 2 E( ~ n)^i + E £
Qu a
Ai(flij + an - 2a )X u k
+ E X] Aj( u + n - 2a )A a a
il fc
i=2 k=2
m
— O 11
n — - 26i + c(X) - 2n J2 x
n i
ff^
m m / 1 \
+ni <M; + 0i - 20*01 + _ a I A*.
2
i=2 fc=2 ^ I 1 J /
If C is invertible, it then follows that
Ar<- ) 1
= (A A ... A )' = ^
2> 3> J m
2
i ? C7- l 1
We then have
Af* = ( A ( ^ - ) ) ) . 1 >
1 t t
In this section, we derive the asymptotic properties ofthe cross-validated weights. Let
0 ^ be the MLE based on the first sample of size n\. Let 9\~^ and respectively
5.4. A S Y M P T O T I C PROPERTIES OF T H E WEIGHTS 103
be the MLE and WLE based on m samples without the jth data point from the first
sample. This generalizes the two cases where either only the jth data point is deleted
from the first sample or jth data point from each sample is deleted. Note that 9[~^
is a function of the weight function A. Let ^D ni be the average discrepancy in the
cross-validation which is defined as
1 1 7 1 1
2
ni ni .
3=1
£ E U{e\r ) - mi)
n i
3)
(2) ^ O a s n , ^ oo;
3=1
(4) Pgo ( 0(0^) - <f>(9^) > Af) = o(^) for some constant 0 < M < oo;
then
A ^ ) ^ ^ 0 = (l,0,0,...,0) . t
(5.6)
Proof: Consider
3=1
1
E - ^(^")') + J
)
- tt&r*)))
1 n i
2 1 7 1 1
\ 2
Note that
3=1
1 . n i
.
3=1
= Si + S 2
where
* = ^E(^-^?))^
ni
1
3=1
P o(|5i|>e)
fl
= P o (e < | S i | and
e (b{6[- ) - (f>(6{~ <
j)
) M j)
for all j)
+P o (e < | S i | and
e \<f>(§[~ ) - <i>(6~[~ ) \ > M for
l) l)
some f)
( l ^ ) - <K6>i-)l > )
1 7 1 1
\
M
- - m)) + miV»
1}
7 1 1
3=1 J
i
1 n i
1 \
The first term goes to zero by the Weak Law of Large Numbers. The second term
also goes to zero by assumption (4). We then have
Consider
P o(\S \>e)
e 2
M
< Pgo\e< \S \ <
2
ni
= Peo
The first term goes to zero by assumption (2). The second term also goes to zero by
assumption (4). We then have
E 0(^ )j + - E
1 1 _ _ /
ni
2 1 n i
2
)
—D (X)-= ni - (*y - ~ H0[~ )) J)
+ Rn (5.9)
1 1
3=1 1
3=1
3=1
since (f>(9[- ) j)
= (f>(9 - ) { j)
for A ( c u )
=w =0 (1,0,0,0)* forfixedm.
5.4. A S Y M P T O T I C PROPERTIES OF T H E WEIGHTS 106
,(cv) P
0 °
^ ;
—> w , 0 as rii —> oo.
Ul V / Til
To check the assumptions of the above theorem, let us consider the linear WLE
for two samples with equal sample sizes. Assumption (1) is satisfied since ^D (\)
ni
is a quadratic form in A and its minimum is indeed unique for each fixed n i . To check
Assumption (2), consider
J=l ]=1
Til
1
i:f- -rE^-^ J
Next we consider
ni 14
—' \ / ni
1
i=i 1
i=i
ni
— E (^y _
(— ~^ \.
l
-Xijx
i~[\
n
\ni-l ni - 1
( n \ 1 p
2 Ul
2
ni)i (CTJ)
X L - (A^Xj. + \ r x ) 2
(cu)
XL - X,
(
(al-covnX^-X.y
'•o(|^i B l )
)-^°)|>e) ='P oe >
n-2
[(Xi.-x^ + ^ E t X y - x ^ p
i=l
(al-covfiX.-X^
< (n - 2) e Ego 2 2
(x i : -x y2
1
< Ego a\ — cov = o{n)
( n - 2) e ' 2 2
X\. — X 2
since X . - X .
x 2 0° - 0° ^ 0 and d - cow 2
a\ - cov(X ,X ). u 2l Thus the
(cu) {cv)
i; 0.
rem 4.10.
which deletes j t h point from each sample, i.e. delete-one-column approach. Step 1:
MSE(WLE)
n MSE(MLE) SD of (MLE-0?) 2
MSE(WLE) SD of (WLE-00) 2
MSE(MLE)
Table 5.1: M S E of the M L E and the W L E and standard deviations of the squared
errors for samples with equal sizes for N(0,1) and Af(0.3,1).
Repeat Step 1 - 3 for 1000 times. Calculate the averages and standard deviations
of the squared differences of MLE and WLE to the value of the true parameter 6^
respectively. Calculate the averages and standard deviations of the optimum weights.
We generate random samples from N(0,af) and N(c, of). For simplicity, we set
= <T = L For the purpose of the demonstration, we set c — 0.3 which is 30% of
2
the variance. Other values were tried. In general, the larger the value of c, the less
the M S E for M L E and W L E will almost be 1 which means that the cross-validation
procedure will not consider the second sample useful in that case. Some of the result
5.5. SIMULATION STUDIES 109
Table 5.2: Optimum weights and their standard deviations for samples with equal
sizes from JV(0,1) and AT(0.3,1).
It is obvious from the table that the MSE of WLE is much smaller than that
of the MLE for small and moderate sample sizes. The standard deviations of the
squared differences for the WLE is less or equal to that of the MLE. This suggests
that not only does WLE achieve smaller MSE but also its MSE has less variation than
that of MLE. Intuitively, as the sample size increases, the importance of the second
sample diminishes. As indicated by Table 5.2, the cross-validation procedure begins
to realize this by assigning a larger value to Ai as the sample size increases. Although
quite slowly, the optimum weights do increase towards the asymptotic weights (1,0)
for the normal case.
5.5. SIMULATION STUDIES 110
We repeat the algorithm for Poisson distribution with V(3) and 7->(3.6). Some of
the results are shown in Table 5.3 and Table 5.4. The result for Poisson distributions
is somewhat different than that from the normal. The striking difference can be seen
achieves smaller M S E on average when the sample sizes are less than 30. Once it is
over 30, it seems that we should not combine the two samples. This is not the case
above case.. Thus the cross-validation procedure will not combine two samples if the
second sample does not help to predict the behavior of the first. We also emphasize
that the value c in both cases are not assumed to be known to the cross-validation
procedure.
MSE(wle)
n MSE(mle) SD MSE(wle) SD MSE{mle)
Table 5.3: M S E of the M L E and the W L E and their Standard deviations for samples
with equal sizes from V(3) and V(3.6).
5.5. SIMULATION STUDIES 111
Table 5.4: Optimum weights and their standard deviations for samples with equal
sizes from V(3) and "P(3.6)
We remark that simulations of using the delete-one-point approach have also been
done. They give quite similar results.
Chapter 6
D e r i v a t i o n s o f the W e i g h t e d
Likelihood Functions
In this chapter, we shall develop the theoretical foundation for using weighted like-
lihood. Hu and Zidek (1997) discuss the connection between relevance weighted
likelihood and maximum entropy principle for the discrete case. We also show that
the weighted likelihood function can be derived from maximum entropy principle
advocated by Akaike for the continuous case.
6.1 Introduction
Akaike (1985) reviewed the historical development of entropy and discussed the im-
portance of the maximum entropy principle. Hu and Zidek (1997) discovered the
connection between relevance weighted likelihood and maximum entropy principle
for the discrete case. We offer a proof for the continuous case.
We first state the maximum entropy principle: all statistical activities are directed
to maximize the expected entropy of the predictive distribution in each particular ap-
112
6.1. INTRODUCTION 113
dx.
p(x;9)
= 1, 2, 3..., ra.
The fa reflects the maximum allowable deviation from the density function fi(x). If
ai is set to be zero, then g(x) takes exact the same functional form of fi(x).
For a given set of density functions, we seek a probability density function which
minimizes I(fi,g) = J f\(x)log ^rdx over all probability densities g satisfying
(6.1)
6.2. E X I S T E N C E OF T H E O P T I M A L DENSITY 114
where a, are finite non-negative constants. This idea of minimizing the relative
entropy under certain constraint is also similar to the approach outlined in Kull-
back(1959, Chapter 3) for the hypothesis testing.
mi 1(g).
gee
To avoid trivial cases, we assume that the function 1(g) is proper, i.e. it does not
take the value —oo and is not identically equal to -t-oo. We then have the following
theorem.
Theorem 6.1 Assume that 1(g) is convex, lower semi-continuous and proper with
respect to g. In addition, assume that the set V is bounded, i.e. there exist a constant
Then the minimization problem has at least one solution. The solution is unique if
Define
£i =
= {9 • J fi{x)log-^-^dx < cii, J g(x)dx = 1, g(x) >0,ge L },
p
i =2 , m ,
and
£=
V (f J )/2<I(f ,f )
2
1 2 1 2
Proof: It suffices to show that each £, is bounded in LP. For any density in £i, i =
2,.., m, we have
j\g(x)-fi(x)\ dxp
= f\g{x)-fi{x)\\g{x)-fi{x)\ - dx
p l
< B /2I(fi,g)
iy
= BiV2a~.
By the Mankowski Inequality for LP spaces with 1 < p < 00, we have
problem (6.1) has a unique solution if the admissible density functions satisfy the
conditions
\g(x) — fi(x)\ ~
p l
< Bi, for all x and g e L, p
1 < p < oo.
Proof: Note that log function is strictly convex. The assertion of this theorem follows
from Theorem 6.1 and Lemma 6.2. o
In order to find the solution of the problem, it is useful to state certain results in the
calculus of variations. In 1744, Euler formulated the general Isoperimetric problem
as the following:
Find a vector function (yi, y 2 , y m ) which minimizes the functional
b
Vi( ) = Vi, a
* = 1,2, ..,m, (6.3)
yi{b) = y i, i = l,2,.,m, b
(6.4)
while
b
fi(x,yi(x),...,y (x),y[(x),...,y {x))dx
m m = z = l,2, ...,m, (6.5)
where h,l ,
2 ...,i m are given numbers.
We now state a fundamental theorem in the calculus of variations.
6.4. DERIVATION OF T H E W L FUNCTIONS 117
h (x,y ,y ,
I
yk 1 2 .:,ym,y'i,y' ,2 -••,y ,*) -
m T^h ,\x,y ,y ,
I
y 1 2 ...,y ,y\,y ,m 2 ...,y' ,t) = m 0 (6.6)
h (x,
r
y , y , ...,y , y[, y' ,
x 2 m 2 y' , t) = t f(x,
m 0 3/1,3/2, y, m Vi, y' ,2 y' )
m
P=i
Proof: See, for example, Theorem 2 on p.90 in Giaquinta and Hildebrant (1996). o
Note that the values of a and b are arbitrary. They can take the value of +00
and —00. The original proof is given in Bliss (1930). Detailed discussions can also be
found in that paper.
We now establish the necessary condition for the optimal solution to the minimization
problem. Assume that the density functions /1, f 2 , f m are continuous and twice
differentiable.
that it is a mxiture distribution, i.e., there exist constants t\,t\, ...,t* m such that
m
t* = 1, and
i=i
m
fc=i
for any particular choice of g satisfying the constraints (6.1). We seek the optimal
6.4. DERIVATION OF T H E W L FUNCTIONS 118
solution which is a point in a certain manifold in the function spaces. Thus the
optimization problem can be re-formulated in the context of calculus of variations as
follows
min7(g) = min [ fi(x) l o g d x
geE g J g(x)
that the necessary condition for g* to be the optimal solution is that it has to satisfy
the Euler-Lagrange equation, i.e.
dip d dip . .
where g = ||.
Notice that ip{x,g) is not a function of g . It implies that | 4 = 0. The Euler
equation then becomes
dip
It follows that
.AM + iQ f JtW o =
fa) 0
x=\ k
g(x)
We then have
g*(x)=J2f f (x), k k
k=l
where t{ = l/l ,
0 t* = l /lo, k =
k 2,m.
Since we seek a density function, it then follows that the sum of the ti must
m m
be 1 since Jg*(x)dx = YJ % =
^ also follows that g*(x) = YJ Kfk( )
x
> 0.
fe=i fc=i
6.4. DERIVATION OF T H E W L FUNCTIONS 119
Since if g*(x) = YJ t* f {x) k k < 0 for all x e K with Pi(K) > 0, the constraints
fc=i
f fi(x)log(fi(x)/g*(x))dx = h > 0 then will not be satisfied. This completes the
proof, o
Consider the minimization problem without any constraint. We then seek the
optimal density function g* such that it minimizes I(f\,g) for any given f\(x). Ac-
cording to Theorem 6.4, the necessary condition of the optimal function g* is that
Since t* = 0,i = 2,3, ..,m, then t{ = 1. It then follows that g*(x) = f\(x), a.e..
Furthermore, we have I(f\,g) > I(fi,g*) — I{fi, /i) = 0 for any density function g.
(6.1).
. Proof: Suppose that there exist a probability density function g such that 0
while
It follows that
while
We then have
771 „ 771 „
E*i / fii ) x l o
9 9*( ) dx x
< / h( ) x l o
9 9o( ) dx
x
i=l J i=l J
m p. m
/
E*.*/i(a;) °9 9*( ) dx l x
< Y^ifi^) lo
9 9o( ) dx x
i=l J i=l
that there exists go = ^2Ufi(x) with ti chosen so that go achieves the equalities in
7=1
771
the constraints (6.1) and U = 1, 0 < U < 1, i = 1,2, ...,m for any a suc/i that
i=i
\a,i — a°\ < Sf. Then U are monotone functions of ai. Moreover,
< 0, i = 2,...,m,
da {
0 , 1
k^i
-[<f>i(x)-Y,tk<i>k{x)]<9*(x)- (6-10)
fc=2
Since g* satisfies the constraints (6.1), it follows that, for 2 < i < m
In
dti
= dti J [M*) logg*(x) dx]
l J i y J y J
J-[/<(*) ^g J l{x)
dx]
jfc=i
= - f [g*(x) + Hx) - f ] W * ( i ) ] ^ dx
V ^ 9 [X)
J 9 (x) J 9*{x)
= 0.
dti 1
o
r9o - >
»—< 0.
° a%
du
It also follows that
d
da;
Proof: Note that if we set a; = 0, then ti — 1; if a, = oo, then rjj = 0. Since rjj is a
monotone function of at for any fixed a ,i ^ j , it follows that 0 < ti < 1. o
3
the weighted likelihood function for the discrete case is given in Hu and Zidek (1994).
We now generalize their derivation of the weighted likelihood function.
Theorem 6.8 Assume that the optimal distribution takes the functional form f(x).
Given X ,i
i3 = 1,2, ...,m,j = l,2,...,n;, the optimization problem considered in the
Proof: By the proof of Theorem 6.4 and the Lagrange theorem, we need to choose
the optimal density function g* which minimizes
m
max E k ! log f(x; 6)dFi(x).
This implies that the estimate of parameter of the optimal density is equivalent to
finding the WLE derived from the weighted likelihood function if the functional form
of the optimal density function is known. This completes the proof, o
We have shown that the optimal function takes the form
m
k=l
However the density function g* does not exist if the constraints define an empty
set. Let us consider a relative simple situation where three populations are involved.
Recall that the ti need to satisfy the following condition:
h + t + h = i,
2
0<* + * 3 < l ;
2
straints since a probability density function can not take the functional form of f and 2
m m
k=l fc=l
6.4. DERIVATION OF T H E W L FUNCTIONS 124
We then have
/(/*,$>/*) = [ l o g m
h { x )
Mx)dx
k=l
m
< / [logfi(x) - Yjklog'fk(x)]fj(x)dx
"* k=l
m m
[Y,tklogfi(x) - y^Jklogfk(x)]fi{x)dx
/
k=l k=l
m
k=l
This completes the proof, o
Let D = (d^) where
(
1 if i = l
dij —
J(fiJj) if i = 2,3, ...m.
and o m x i = (l,a ,....,a ) .
2 m
t
k=l k=l
Dt = a.
By a result from elementary linear algebra, the assertion of the Theorem follows, o
Chapter 7
Saddlepoint A p p r o x i m a t i o n of the
WLE
7.1 Introduction
In a pioneering paper, Daniels (1954) introduced a new idea into statistics by applying
saddlepoint techniques to derive a very accurate approximation to the distribution
of the sample mean for the i.i.d. case. It is a general technique which allows us to
125
7.2. R E V I E W OF T H E SADDLEPOINT APPROXIMATION 126
/ e <j>{z)dz
vw{z)
(7.1)
Jv
when the real parameter v is large and positive. Here w and <j> are analytic functions
of z in a domain of the complex plane which contains the path of integration V. This
technique is called the method of steepest descent and is used to derive saddlepoint
Consider the integral (7.1). In order to find the approximation we deform arbi-
trarily the path of integration V provided we remain the domain where w and (j) are
(i) the new path of integration passes through a zero of the derivative w (z);
If we write
z = x + iy, z0 = x0 + iy ,0
and denote by S the surface (x,y) —> u(x,y), then by Cauchy-Riemann differential
equations
du dv du dv
dx dy' dy dx'
it then follows that the point (x ,y ) 0 0 can not be a maximum or minimum but a
saddlepoint on the surface S. Moreover, the orthogonal trajectories to the level curves
u(x,y) = constant are given the the curves v(x,y) — constant. It follows that the
paths on S corresponding to the orthogonal trajectories of the level curves are paths
of steepest descent. We will truncate the integration at certain point on the paths
of steepest descent. The major part of the integration is then used to approximate
7.2. R E V I E W OF T H E SADDLEPOINT APPROXIMATION 127
the integration on the complex plane. Detailed discussions can be found in Daniels
(1954).
Suppose that Xi, X 2 , X n are i.i.d. real-valued random variables with den-
sity / and T (Xi, X
n 2 , X n ) is real-valued statistic with density /„. Let M (a)
n =
f e fn{t)dt
at
be the moment-generating function, and K (a)
n = logM (a)
n be the
cumulant-generating function. Further suppose that the moment generating func-
tion M (a) exists for real a in some non-vanishing interval that contains the origin.
n
By Fourier inversion,
2TT 7_OO
=
2^ / M
-( ) ~
n z e n Z X d z
rr+ioo / \
n
- / • exp( n(R (z) n — zx) J dz,
2iri Jr-ioo
1
^ '
where I is the imaginary axis in the complex plane, r is any real number in the
interval where the moment generating exists, and
R (z)
n = K (nz)/n.
n
Applying the saddlepoint approximation to the last integral gives the approxima-
tion of /„ with uniform error of order n~ : 1
K(«o) = *,
where R! and R'^ denote the first two derivatives of R^. Detail discussions of the
n
saddlepoint can be found in Daniels (1954) and Field and Ronchetti (1990).
7.3. RESULTS F O R E X P O N E N T I A L F A M I L Y 128
Stirling Formula.
The saddlepoint approximation stated in the last section is for a general statistic
constructed by a series of i.i.d. random' variables. In this section, we derive the
saddlepoint approximation to the distribution of a sum of afinitenumber of random
variables that are independent but not identically distributed.
7.3. RESULTS F O R E X P O N E N T I A L F A M I L Y 129
where X{ - are i.i.d for any given i. But (Xij) and (Xjj) with i ^ % do not follow the
3
Theorem 7.1 The saddlepoint approximation to the density function of the random
/ \ 1 / 2
1=1
2m Jz
rT+ioo / . m
. \
Applying the saddlepoint technique we derive the approximate density for 5(X):
/ \ 1 / 2
n
fs, dx) = exp ( i( Rd i o)l i
n n a n
— OLQX^
J
(ni,H2 "m) v
'
2 7 r
E^i'( i o)/ i
n a n
V i=i
i=i
Example 7.1 (The Sum of Sample Means) Let us examine the distribution of W n
where
W n = —(Xu + X 12 + ... + X ) lni + -(X 2 1 + X 22 + ... + X ).
2n2
ni n 2
—K
oz
(z +—TT
rii oz
x * \—z
\ n
K
)
)=x.
2
\ 1/2
f I \ I N I
V V ni
Til n
Tlo / I 2
fw»( )
x =
o o 32 , , R exp (n(2K(a* ) - a* x)) . 0 0
\27r2-^K(a ) 0 J
The sample average of the combined sample is W /2. It then follows that n
1/2
1/2
2n
2—AT(z) = x = 2y.
7.3. RESULTS F O R E X P O N E N T I A L F A M I L Y 131
This is exactly the saddlepoint a 0 for an i.i.d. sample with size 2n. Thus, the
saddlepoint approximation of the density of W /2 by Theorem 7.1 when the random
n
variables from both samples are indeed i.i.d. is exactly the saddlepoint approximation
of the sample mean of a i.i.d. sample with size 2n.
Example 7.2 (Spread Density for the Exponential) If Yi, Y 2 ,Y m + \ are independent,
exponentially distributed random variables with common mean 1, then the spread,
V( +i)
m
—
the difference between maxima and minima of the sample, has the
density
m m . m •
is YJ Ri(z) = — YJ ^ (l n — z
lJ)- The equation YJ R\{ ) — z x m
Theorem 7.1 becomes
i=l j=l i=l
YJ 1/ (j< — z) = x which needs to be solved numerically. Due to the unequal variances
of Yj, the normal approximation does not work well. Lange (1999) calculates the
saddlepoint approximation for this particular example. Note that the following table
is part of Table 17.2 in Lange (1999). The last column is the difference between the
exact density and the saddlepoint approximation.
It can be seen that saddlepoint approximation gives a very accurate approximation.
Lange (1999) also shows that the saddlepoint approximation out-performs the Edge-
worth expansion in this example.
7.3. RESULTS FOR E X P O N E N T I A L F A M I L Y 132
X Exact Error
0.5 .00137 -.00001
1.0 .05928 -.00027
1.5 .22998 .00008
2.0 .36563 .00244
2.5 .37874 .00496
3.0 .31442 .00550
3.5 .22915 .00423
4.0 .15508 .00242
4.5 .10046 .00098
5.0 .06340 .00012
5.5 .03939 -.00026
6.0 .02424 -.00037
6.5 .01483 -.00035
7.0 .00904 -.00028
7.5 .00550 -.00021
8.0 .00334 -.00014
9 ( Zj AJTJ(XJI, X )
ini j for some smooth function g under fairly general conditions.
1/2
( \
ni
SwLEiy)
27r£X(A mO/n i 1
i=i J
/ m
1
'.exp I ni( J^i2i(Ajniao)M - a
o9~ (y))
l
\ t=i
where satisfies the following equation
i=l
i=i
The cumulant generating function of AjTj is Ri(Xiz) where Ri(z) is the cumulant
generating function of Tj. By Theorem 7.1, the approximate density function of
S = E AjTj is given by
t=i
1/2
^i?-(AjnxQ;^)/nx = x.
i=l
( m
Y2, A j T j ( Y i x , X
i=i
ini
is given by
1/2
\
ni
IWLE{V) =
. 27rE^'(A niO/m l
\ i=i
xexp I ni(^Ri(\iniao)/ni - a* g0
1
(y))
i=i g\9- {y))
l
7.4. APPROXIMATION FOR GENERAL W L ESTIMATION 134
Y ^iniz){a )/n
J
w
0 1 = g-\y).
i=i
The saddlepoint approximation to estimating equations for the i.i.d. case is devel-
oped by Daniels (1983). We generalize the techniques to the W L E derived from the
E ^ E J - W ; « I ) = 0 - (7.4)
i=l j=l 1
For simplicity, let tp(Xij\6i) = •J^logf(X j\6i). i The estimating equation for W L E
can be written as
m rii
Y ^^(X t i j ;9 ) 1 = 0. (7.5)
t=l 3=1
Assume that ib(Xij-,6) is a monotone decreasing function of 0 and Aj > 0, i =
l , 2 , . . . , m . Write
m rii
T h e o r e m 7.3 Let 0\ be the WLE derived from the weighted likelihood equation. Then
poo
P (0!
0l > a) = P(S > 0) ~ / f (x, a)dx
s (7.6)
Jo
7.4. APPROXIMATION FOR G E N E R A L W L ESTIMATION 135
and
1/2
\
7li
fs(x) =
2?r iE (7.7)
= i nihiX^g-sK^nxXiZ, a)\ s z=Q
( m
i=i
Proof: The mo
m e
nt ge
n e
rat
n
ig f
u nct
i
o n of S(a) is
M (t,a)
s = exp(K (t,a))
s
m
(7 7 ( ni
\
=
II /'•••/ e x p
\ iY,^( i> }
tX Xi a
) Ylf( ij^i)dx ...dx
x
n ini
i=l
( m
y^nji^^Ai^a)
i=i /
wh
er
e exp (K~i(t, a)) = Eg exp(ttp(Xij,a)) ,i = 1,2, ...,m. The f
i u nct
i
o n Ms(t,a) is
a
ssu
m e
d to c
onv
er
ge for r
e al i in a non-
vans
ihn
ig i
n t
e r
v al cont
a i
n i
n g the origin.
It t
h e
n f
o l
o wst
h at
i 7 ( dr
m
\
fs( )x
= ^— / exp 1 Y niKjjirXj, a) j exp (-irx)
-oo
T+ioo
ni niKi(n\XiZ,a) — n\xz dz
2ni
T —lOO
T+ZOO
7 ( m
n- \
j exp yni — A"i(riiAj2;, a) — xz) J dz
2TTZ
The s
adde
lpon
i t OJQ is the r
o ot of the e
q ua
to
in
d
E »Aj—ifi(n]AiZ,a) = x.
n
i=l
7.4. A P P R O X I M A T I O N F O R G E N E R A L W L ESTIMATION 136
It then follows that the saddlepoint approximation to the density fs(x) is given by
/ \ 1 / 2
fs(x) =
( ^Wi*i-^Ki(ni\iZ,a)\ s)
111 \
z=a
( n
l(^^ i^i
m
n K
i{ l^i i )\z=a
n z a s
0 ~ x a
0 ) ) •
We can then deploy the device used in Field and Hampel (1982) and Daniels (1983)
Pg (9 >a)
1 1 = P (S(a)>O).
01 (7.8)
We then have
/•OO
Jo
In general, the saddlepoint approximation is very computational intensive since
the normalizing constant needs to be calculated by numerical integration. We remark
that the saddlepoint approximations to the WLE proposed in this chapter are for fixed
weights. The saddlepoint approximation to the WLE with adaptive weights needs
further study.
Chapter 8
A p p l i c a t i o n t o Disease M a p p i n g
8.1 Introduction
In this chapter, we present the results of the application of the maximum weighted
137
8.2. W E I G H T E D LIKELIHOOD ESTIMATION 138
.9 o
uwuu
20 40 60 80 100 120
Days in the summer of 1983
Figure 8.1: Daily hospital admissions for C S D # 380 in the summer of 1983.
people with respiratory disease went to hospital to seek treatments due perhaps to the
high level of pollution in the region. For CSD # 380 that has the largest number of
hospital admissions among all the CSD's, there are no hospital admission for a total
of 112 days out of 123 days in the summer of 1983. However there were 17 hospital
admissions on day 51. The daily counts of this C S D are shown in Figure 8.1. The
problem of data sparseness and high level of variation is quite obvious. Thus we will
consider the estimation of the rate of weekly admissions instead of the daily counts.
There are 17 weeks in total. For simplicity, the data obtained in the last few days of
each year are dropped from the analysis since they do not constitute a whole week.
We assume that the total number of hospital admissions of a week for a particular
Y* ~ V {e ) , j = 1, 2 , . . , 17;'t = 1, 2 , 7 3 3 ; q = l , 2 , 6 .
q
i3
8.2. W E I G H T E D LIKELIHOOD ESTIMATION 139
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
C S D 380: Weeks in 1983
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
C S D 362: Weoks in 1983
1 2 3 4 5 . 6 7 8 9 10 11 12 13 14 15 16 17
C S D 367: Weeks in 1983
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
C S D 367: Weeks in 1983
Figure 8.2: Hospital Admissions for CSD # 380, 362, 366 and 367 in 1983.
The raw estimate of which is Y£ is highly unreliable. The sample size is only
1 in this case. Also each CSD may contain only a small group of people whose lung
conditions are susceptible to the high levels of air pollution. Therefore we think that
it is desirable to combine neighboring CSD's in order to produce a more reliable
estimate. For any given CSD, the neighboring CSD's are defined to be CSD's that
are in close proximity to the CSD of interest, CSD #380 in our analysis. To study
the rate of weekly hospital admissions in a particular CSD, we would expect that
neighboring subdivisions contain relevant information which might help us to derive
8.2. W E I G H T E D LIKELIHOOD ESTIMATION 140
a better estimate than the traditional sample average. The importance of combining
disease and exposure data is discussed in Waller et al. (1997). The Euclidean distance
between the target CSD and any other C S D in the dataset is calculated by using the
longitudes and latitudes. CSD's whose Euclidean distances are less than 0.2 from the
target C S D are selected as the relevant CSD. For CSD # 380, neighboring CSD's are
CSD # 362, 366 and 367. The time series plots of weekly hospital admissions for
those CSD's in 1983 are shown in Figure 8.2. It seems that the hospital admissions of
these CSD's at a given week might be related since the major peaks in the time series
plot occurred at roughly the same time point. However, including data from other
CSD's might introduce bias. The weight function defined in the W L E can control the
Ideally, we would assume that 0?- = Q\ for j = 1,2, ...,17. But this assumption
does not hold due to seasonality. For example, week 8 always has the largest hospital
admissions for CSD 380. By examining the data more closely, we realize that the 8th
week for each year is a week with more than normal hospital admissions. In 1983,
there are 21 admissions in week 8 while the second largest weekly count is only 7 in
week 15. In fact, week 8 is an unusual week through 1983 to 1988. The air pollution
level might further explain this phenomenon by using the method proposed in Zidek
perform the analysis on a window of a few weeks and repeat the analysis while we
move the window one week forward. This is equivalent of assuming that 0?- take the
same value for a period of a few weeks instead of the entire summer. In this chapter,
we will simply exclude the observations from week 8 and proceed with the assumption
that 0?- = Q\ for j = 1,2, ...7,9, ..17. The fact that the sample means and sample
variances of the weekly hospital admissions for those 16 weeks of CSD 380 are quite
Let Y to be the overall sample average for a particular CSD i for a given year. For
L
Poisson distributions, the MLE of 6\ is the sample average of the weekly admissions
of CSD #380 and the WLE is a linear combination of the sample average for each
CSD according to Theorem 3.1. Thus, the weighted likelihood estimate ofthe average
weekly hospital admissions for a CSD, 6\, is
4
WLE" = ^2X^1 a = 1,2.., 6.
i=l
For our analysis, the weights are selected by the cross-validation procedure proposed
in Chapter 5. Recall that the cross-validated weights for equal sample sizes can be
calculated as follows:
j=i • j=\
We assess the performance ofthe MLE and the WLE by comparing their MSE's. The
MSE of the MLE and the WLE are defined as, for q = 1,2,.., 6,
MSEUei) = E (Y -ei)
el
g
h
2
In fact, the 9\ are unknown. We then estimate the MSEM and MSEw by replacing
9\ by the MLE. Under the assumption of Poisson distributions, the estimated MSE
for the MLE is given by:
MSE q
M = var(Y )/16, u o=l,2,...6.
8.3. RESULTS OF T H E ANALYSIS 142
( m
i=i
4 4 / m
The estimated M S E for the M L E and the W L E are given in the following table. It
can be seen from the table that the M S E for the W L E is much smaller than that of
Combining information across these CSD's might also help us in the prediction
since the patterns exhibited in one neighboring location in a particular year might
manifest itself at the location of interest the next year. To assess the performance
of the W L E , we also use the W L E derived from one particular year to predict the
overall weekly average of the next year. The overall prediction error is defined as the
average of those prediction errors. To be more specific, the overall prediction errors
8.4. DISCUSSION 143
for the WLE and the traditional sample average are defined as follows:
1
PRED £(n-FT) ; 2
V
PRED
1 J2{WLE^-Yl y +I
The average prediction error for the MLE, Pred , is 0.065 while the Pred , the
M w
average prediction error for the WLE, is 0.047 which is about 72% of that of the
MLE.
8.4 Discussion
Bayes methods are popular choices in the area of disease mapping. Manton et al.
(1989) discuss the Empirical Bayes procedures for stabilizing maps of cancer mortality
rates. Hierarchical Bayes generalized linear models are proposed for the analysis
of disease mapping in Ghosh et al. (1999). But it is not obvious how one would
specify a neighborhood which needs to be defined in these approaches. The numerical
values of the weight functions can be used as a criterion to appropriately define
the neighborhood in the Bayesian analysis. We will use the following example to
demonstrate how a neighborhood can be defined by using the weight functions derived
from the cross-validation procedure for the WLE.
From Table 8.3, we see that there is strong linear association between CSD 380
and CSD 366. However, the weight assigned to CSD 366 is the smallest one, It shows
that CSD's with higher correlation contain less information for the prediction since
they might have too similar a pattern to the target CSD for a given year to be helpful
in the prediction for the next year. Thus CSD 366 which has the smallest weight
should not be included the analysis. Therefore, the "neighborhood" of CSD 380 in
8.4. DISCUSSION 144
Table 8.2: Correlation matrix and the weight function for 1984.
the analysis should only include CSD 362 and CSD 367.
In general, we might examine those CSD which are in close proximity to the target
CSD. We can calculate the weight for each CSD selected by using the cross-validation
procedure. The CSD with small weights should be dropped from the analysis since
they are not deemed to be helpful or relevant to our analysis according to the cross-
validation procedure.
We remark that the weight function can also be helpful in selecting an appropriate
distribution that takes into account the spatial structure. Ghosh et al. (1999) propose
a very general hierarchical Bayes spatial generalized model that is considered broad
enough to cover a large number of situations where a spatial structure needs to be
incorporated. In particular, they propose the following:
0i = qi = z-b + Ui + Vi, i = 1, 2 , m
where the qi are known constants, Xi are covariates, Ui and Vi are mutually indepen-
dent with Vi ~ ' N(0, a ) and the Ui have joint pdf
J d 2
(-E&-«,)H/(^)
m
i=l j^i
where uiij > 0 for all 1 < i ^ j < m. The above distribution is designed to take
into account the spatial structure. In their paper, they propose to use uii = 1 if
location % and j are neighbors. They also mention the possibility of using the inverse
8.4. DISCUSSION 145
of the correlation matrix as the weight function. The weights function derived from
the cross-validation procedure might be a better choice since it takes account of the
The predictive distribution for the weekly total will be Poisson (WLE). We can
then derive the 95% predictive interval for the weekly average hospital admissions.
This might be criticized as failing to take into account the the uncertainty of the
unknown parameter. Smith (1998) argues that the traditional plug-in method has a
lar, it has a smaller M S E when the true value of the parameter is not large. Let CIw
and CIM be the 95% predictive intervals of the weekly averages calculated from the
W L E and the M L E respectively. The results are shown in the following table.
method. Further analysis is needed if one wants to compare the performances of the
[1] Akaike, H. (1985). Prediction and entropy, In: A Celebration of Statistics 1-24,
Edited by Atkinson, A. C. and Fienberg, S. E., Springer-Verlag, New York.
[2] Bliss, G. A. (1930). The problem of lagrange in the calculus of variations. The
American Journal of Mathematics 52 673-744.
[4] Burnett, R. and Krewski, D. (1994). Air pollution effects on hospital admission
rates: A random effects modeling approach. The Canadian Journal of Statistics
22 441-458.
146
[9] Dickey, J. M. (1971) The weighted likelihood ratio, linear hypotheses on normal
location parameters. The Annals of Mathematical Statistics 42 204-223.
[10] Dickey, J. M and Lientz, B. P. (1970) The weighted likelihood ratio, sharp hy-
potheses about chances, the order of a Markov chain. The Annals of Mathematical
Statistics 41 214-226.
[12] Eguchi, S. and Copas, J. (1998). A class of local likelihood methods and near-
parametric asymptotics. Journal of Royal Statistical Society, Series B 60 709-724.
[13] Ekeland, I. and Temam, R. (1976). Convex Analysis and Variational Problems.
American Elsevier Publishing Company Inc., New York.
[14] Feller, W. (1971) An Introduction to Probability Theory and Its Applications, Vol
2. John Wiley k, Sons, Inc., New York.
[15] Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman and Hall,
New York.
[18] Geisser, S. (1975). The predictive sample reuse method with applications, Jour-
nal of the American Statistical Association 70 320-328.
147
[19] Genest, C. and Zidek, J. V. (1986). Combining probability distributions: a cri-
tique and an annotated bibliography. Statistical Science 1 114-148.
[20] Ghosh, M., Natarajan, K., Waller, L. A. and Kim, D. (1999). Hierarchical Bayes
GLMs for the analysis of spatial data: An application to disease mapping. Jour-
nal of Statistical Planning and Inference 75 305-318.
[22] Hardle, W. and Gasser, T. (1984) Robust Nonparametric Function Fitting. Jour-
nal of the Royal Statistical Society, Series B, 46 42-51.
[23] Hu, F. (1994). Relevance Weighted Smoothing and A New Bootstrap Method,
Ph.D. Dissertation, Department of Statistics, University of British Columbia,
Canada.
[25] Hu, F., Rosenberger, W. F. and Zidek, J. V. (2000). Relevance weighted likeli-
hood for dependent data. Metrika 51 223-243.
[26] Hu, F. and Zidek, J. V. (1997). The relevance weighted likelihood with applica-
tions. In: Empirical Bayes and Likelihood Inference 211-235, Edited by Ahmed,
S. E. and Reid, N., Springer, New York.
[28] Kullback, S. (1954). Certain inequality in information theory and the Cramer-
Rao inequality. The Annals of Mathematical Statistics 25 745-751.
148
[29] Kullback, S. (1959). Information Theory and Statistics. Lecture Notes-
Monograph Series Volume 21, Institute of Mathematical Statistics.
[31] Lehmann, E. L. (1983), Theory of Point Estimation. John Wiley & Sons Inc.,
New York.
[32] Markatou, M., Basu, A. and Lindsay, B. (1997). Weighted likelihood estimating
equations: The discrete case with applications to logistic regression. Journal of
[33] Markatou, M., Basu, A. and Linday, B. (1998). Weighted likelihood equations
with bootstrap root search. Journal of the American Statistical Association 93
740-750.
[35] National Research Council (1992). Combining Information: Statistical Issues and
56 3-48.
149
[37] Rao, B. L. S. (1991). Asymptotic theory of weighted maximum likelihood estima-
tion for growth models. In: Statistical Inference in Stochastic Processes 183-208,
edited by Prabhu, N.U. and Basawa, I. V., Marcel Dekker, Inc., New York.
[38] Rao, C. R. (1965). Linear Statistical Inference and Its Applications. John Wiley
& Sons, Inc., New York.
[42] Small, C. G., Wang, J. and Yang, Z. (2000). Eliminating multiple root problems
in estimation. Statistical Science 15, 313-341.
[46] Tibshirani, R. and Hastie, T. (1987). Local likelihood estimation. Journal of the
150
[48] van Eeden, C. and Zidek, J. V. (1998). Combining sample information in esti-
mating ordered normal means. Technical Report # 182, Department of Statistics,
University of British Columbia.
[49] van Eeden, C. and Zidek, J.V. (2001). Estimating one of two normal means when
their difference is bounded. Statistics & Probability Letters 51 277-284.
[50] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate.
The Annal of Mathematical Statistics 15 358-372.
[51] Waller, L. A., Louis, T. A., and Carlin, B. P. (1997). Bayes methods for combin-
ing disease and exposure data in assessing environmental justice. Environmental
and Ecological Statistics 4 267-281.
[53] Zidek, J. V., White, R. and Le, N. D. (1998). Using spatial data in assessing the
association between air pollution episodes and respiratory morbidity. Statistics
for the Environment 4- Pollution Assessment and Control 111 -136.
151