You are on page 1of 19

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO.

12, DECEMBER 2023 7565

Generalized Fisher-Darmois-Koopman-Pitman
Theorem and Rao-Blackwell Type Estimators
for Power-Law Distributions
Atin Gayen and M. Ashok Kumar , Member, IEEE

Abstract— This paper generalizes the notion of sufficiency [19, Sec. 7]. He designated a statistic T to be sufficient for
for estimation problems beyond maximum likelihood. In par- estimating θ if the conditional distribution of the sample, given
ticular, we consider estimation problems based on Jones et al. T , is independent of θ. Formally, if T := T (X1n ) is a statistic
and Basu et al. likelihood functions that are popular among
distance-based robust inference methods. We first characterize (T could very well be a vector-valued function), then T is
the probability distributions that always have a fixed number sufficient (for estimating θ) if the conditional distribution
of sufficient statistics (independent of sample size) with respect
to these likelihood functions. These distributions are power-law
pθX n |T (xn1 |t) is independent of θ (1)
1
extensions of the usual exponential family and contain Student forevery t ∈ T := {T (y1n ) : y1n ∈ Sn }. This definition,
distributions as a special case. We then extend the notion of
minimal sufficient statistics and compute it for these power-law however, is helpful only if one has a suitable guess for T .
families. Finally, we establish a Rao-Blackwell-type theorem for Observe also that the distribution of T is needed to find the
finding the best estimators for a power-law family. This helps us conditional distribution in (1). Later, Koopman proposed the
establish Cramér-Rao-type lower bounds for power-law families. following equivalent definition, where the distribution of T
Index Terms— Cramér-Rao bounds, Fisher-Darmois- is not needed [31]. He called a statistic T sufficient for θ
Koopman-Pitman theorem, Rao-Blackwell theorem, Student if the following holds. If xn1 and y1n are two sets of i.i.d.
Distribution, sufficient statistics. samples such that they are same under T , that is, T (xn1 ) =
T (y1n ), then the difference of their respective log-likelihoods,
I. I NTRODUCTION [L(xn1 ; θ) − L(y1n ; θ)] is independent of θ, where L(xn1 ; θ) :=
P n
i=1 log pθ (xi ) denotes the log-likelihood function of the

C ONSIDER a parametric family of probability distribu-


tions, parameterized by θ, Π = {pθ : θ ∈ Θ}, where Θ
is an open subset of Rk . We shall assume (unless explicitly
sample xn1 . Observe that, to use this definition also, one needs
to have a guess for T . Let us now suppose that a sufficient
statistic T is known. A natural question that arises is, what
specified) that all distributions in the family have common kind of estimators that can be found using T . Observe that
support S ⊂ R. Let X1n := X1 , . . . , Xn be an independent and both Fisher’s and Koopman’s definitions of sufficiency make
identically distributed (i.i.d.) random sample drawn according use of the log-likelihood function L. However, it is still not
to some pθ ∈ Π. An estimator of θ is a function of X1n . If the clear if Fisher or Koopman had maximum likelihood (ML)-
estimator is a function of some specific statistics of the sample, estimation in mind while proposing the above definitions. The
then one can use these statistics to estimate θ without knowing following result, known as factorization theorem [26], [38],
the entire sample. Fisher introduced this concept in his seminal which is equivalent to both the above definitions, seems to
paper titled “On the mathematical foundations of theoretical answer this. A statistic T is sufficient for θ if and only if
statistics” and termed it as sufficient statistics for the parameter there exist two functions u : Θ × T → R and v : Sn → R
such that
Manuscript received 23 June 2022; revised 1 September 2023; accepted
4 September 2023. Date of publication 16 October 2023; date of current L(xn1 ; θ) = u(θ, T (xn1 )) + v(xn1 ) (2)
version 22 November 2023. The work of Atin Gayen was supported by
the INSPIRE Fellowship of the Department of Science and Technology, for all xn1and all θ ∈ Θ [10, Th. 6.2.6]. Observe that, using
Government of India under Grant IF 160347. An earlier version of this paper
was presented at the 2021 IEEE International Symposium on Information
this criterion, one can identify the sufficient statistic T by
Theory [DOI: 10.1109/ISIT45174.2021.9517914]. (Corresponding author: simply factorizing the log-likelihood function. Hence no prior
M. Ashok Kumar.) guess for T is needed. If θMLE denotes the maximum likelihood
Atin Gayen was with the Department of Mathematics, Indian Institute
of Technology Palakkad, Palakkad, Kerala 678557, India. He is now with
estimator (MLE) of θ, then from (2), we have
the Department of Computer Science and Engineering, Indian Institute of
Technology Roorkee, Roorkee, Uttarakhand 247667, India (e-mail: atin-
θMLE = arg max L(X1n ; θ)
θ
fordst@gmail.com).
M. Ashok Kumar is with the Department of Mathematics, Indian Insti-
= arg max[u(θ, T (X1n )) + v(X1n )]
θ
tute of Technology Palakkad, Palakkad, Kerala 678557, India (e-mail:
ashokm@iitpkd.ac.in).
= arg max u(θ, T (X1n )).
θ
Communicated by Y. Chi, Associate Editor for Machine Learning and
Statistics. This implies that the MLE of θ depends on the sample only
Digital Object Identifier 10.1109/TIT.2023.3318184 through the statistic T (that is, knowledge of T is sufficient for
0018-9448 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7566 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

the estimation of θ). This seems to suggest that the classical functions, called Jones et al. likelihood function [29]:
notion of sufficiency is based on ML estimation. (α)
LJ (xn1 ; θ)
Observe that the ‘entire sample’ is trivially a sufficient " n #
statistic with respect to any estimation. However, the notion 1 1X
:= log pθ (xi )α−1
of sufficient statistics is helpful only when it exists in a α−1 n i=1
fixed number, independent of the sample size. Fisher-Darmois- Z 
1
Koopman-Pitman theorem addresses this problem [1], [16], − log pθ (x)α dx
[20], [31], [41]. It states that, under certain regularity con- α
" n #
ditions, the existence of such a set of sufficient statistics is 1 1 X
= log pθ (xi )α−1 − log ∥pθ ∥, (5)
possible if and only if the underlying family is exponential. α−1 n i=1
A set of probability distributions E is said to be an exponential
1
family if there exist functions w = [w1 , . . . , ws ]⊤ and f = pθ (x)α dx α .
R
where α > 0, α ̸= 1, and ∥pθ ∥ :=
[f1 , . . . , fs ]⊤ with wi : Θ → R, fi : S → R, and h : R → R Let us now consider the following family of Student distri-
such that each pθ ∈ E is given by butions given by
( 2 h (x − µ)2 i−2
Z(θ) exp h(x) + w(θ)⊤ f (x)
 
if x ∈ S pθ (x) = √ 1+ for x ∈ R, (6)
pθ (x) = (3) 3πσ 3σ 2
0 otherwise,
where θ = [µ, σ 2 ]⊤ , µ ∈ (−∞, ∞) and σ ∈ (0, ∞). Observe
that (6) does not fall in the exponential family as defined
where Z(θ) is the normalizing factor [7]. Bernoulli, Binomial,
in (3). Hence, by the Fisher-Darmois-Koopman-Pitman theo-
Poisson, and Normal distributions are some examples of expo-
rem, there does not exist any non-trivial sufficient statistics
nential families. However, a statistic that is sufficient for ML
for µ and σ. Indeed, it can be shown that the knowledge
estimation might not be sufficient for some other estimations.
of the entire sample is needed for estimating µ and σ in
For example, consider the family of binomial distributions
this case (See [24]). However, Eguchi et al. considered the
with parameters, say (m, θ), where m is a positive integer
estimation based on Jones et al. likelihood function (5) with
denoting the number of trials (assumed known) and θ ∈ (0, 1)
α = 1/2 and found closed-form solution for the estimators
is the success probability, which is unknown. It is well-known
(See
Pn [17]). The Pnestimators are functions of the statistics
that the sum of the sample is always a sufficient statistic 2
X
i=1 i and i=1 iX [17, Th. 3] (c.f. [25, Th. 11]). This
for θ (with respect to the usual likelihood function) (see, for
motivates us to generalize the concept of sufficiency based
example, [10]).
on an estimation method. Let us consider the difference of
Let us now suppose that we want to estimate θ by maximiz- (1/2)
Jones et al. likelihood function LJ evaluated at samples
ing the following likelihood function, called Cauchy-Schwarz n n
x1 and y1 :
likelihood function (c.f. [9, Eq. (2.90)]):
(1/2) (1/2)
" # LJ (xn1 ; θ) − LJ (y1n ; θ)
m m
X .X n n
Lcs (xn1 ; θ) := log pθ (x)2 ,
h i
pn (x)pθ (x) (4)
X X
= 2 log pθ (yi ) − log pθ (xi ) .
x=0 x=0
i=1 i=1
m

where pθ (x) = x θx (1 − θ)m−x for x = 0, . . . , m and pn is For the family of Student distributions (6), we have
the empirical distribution of the observed sample. Maximizing (1/2) (1/2)
LJ (xn1 ; θ) − LJ (y1n ; θ)
the Cauchy-Schwarz likelihood function is the same as min- "
imizing the well-known Cauchy-Schwarz divergence. (This 1 + (1/(µ2 + 3σ 2 ))
= 2 log ·
divergence is popular in machine learning applications such 1 + (1/(µ2 + 3σ 2 ))
as blind source separation, independent component analysis, n n
yi2 − (2µ/(µ2 + 3σ 2 ))
P P
manifold learning, image segmentation, and so on [27], [28], yi #
i=1 i=1
[30], [39], [40], [45].) It is easy to check that the solutions n n . (7)
x2i
P P
of the estimating equation, in this case, are the roots of a − (2µ/(µ2 + 3σ 2 )) xi
i=1 i=1
polynomial in θ whose coefficients involve some of pn (x),
x = 0, . . . , m. For example, when m = 2, it is a polynomial of (7) is independent of θ if
degree five, and the following are, respectively, the coefficients n
X n
X n
X n
X
in descending order of exponents of θ: x2i = yi2 and xi = yi .
8pn (1), 8pn (0)−20pn (1)−6pn (2), −20pn (0)+16pn (1)+ i=1 i=1 i=1 i=1

12pn (2), 18pn (0) − 4pn (1) − 6pn (2), −7pn (0) − 2pn (1) + for µ and σ depend
Notice that the Jones et al. estimatorsP on
n Pn
pn (2), and pn (0) + pn (1). the sample only through the statistics i=1 Xi and i=1 Xi2 .
This shows that at least two among pn (0), pn (1) and pn (2) This implies that, when the estimation is based on Jones et al.
(1/2)
are needed to find the estimator. Hence the sum of the sample likelihood
P function
 LJ , it is reasonable to call the pair
2
P
is no longer sufficient for estimating θ. Xi , Xi sufficient for µ and σ.
Observe that the Cauchy-Schwarz likelihood function is a Jones et al. likelihood function is popular in the con-
special case (α = 2) of the more general class of likelihood text of robust estimation [29]. Observe that (5) coincides

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7567

with the usual log-likelihood function L(θ) as α → 1. than the given one. This result is known as the Rao-Blackwell
Hence Jones et al. likelihood function is considered a robust theorem [4], [42]. That is, conditional expectation of an
alternative to ML estimation. Motivated by the works of unbiased estimator given the sufficient statistics results in
Windham [48] and Field and Smith [18], Jones et al. proposed a uniform improvement (in variance) of the estimator. This
this method in search of an estimating equation, an alternative suggests one to consider only such conditional expectations
to the one of MLE, that uses a suitable power of the model while searching for best unbiased estimators.
density to down-weight the effect of outliers (possibly) present In this paper, we find Rao-Blackwell type estimators for
in the data [17], [21], [25], [29]. It is well-known that ML power-law families of the form (6) with respect to Jones et al.
estimation is closely related to the minimization of the well- and Basu et al. likelihood functions. We also find a lower
known Kullback-Leibler divergence or relative entropy [15, bound for the variance of these estimators. This generalizes
Lem. 3.1]. Similarly, Jones et al. estimation is related to the the usual Cramér-Rao Lower Bound (CRLB) and results in a
minimization of the so-called Jones et al. divergence [22], sharper bound for the power-law families. We also apply these
[25], [35], [46], [47] (also known as γ-divergence [11], results to the Student distributions.
[22], projective power divergence [17], logarithmic density Our main contributions in this paper are the following.
power divergence [36], relative α-entropy [32], [33], Rényi (i) Propose a generalized notion of principle of sufficiency
pseudo-distance [5], [6]). While Jones et al. divergence for that is appropriate when the underlying estimation is not
α > 1 corresponds to robust inference, the same for α < necessarily ML estimation.
1 has connections to problems in information theory, such (ii) Characterize the families of probability distributions that
as mismatched guessing [46], source coding [33], encoding have a fixed number of sufficient statistics independent
of tasks [8] and so on. Though Jones et al. estimation for of sample size with respect to Basu et al. and Jones et al.
α < 1 does not have a meaning in terms of robustness, it can likelihood functions.
be used to approximate MLE of a Student distribution (c.f. (iii) Generalize the notion of minimal sufficiency with respect
[25, Rem. 7]). to a likelihood function and find generalized minimal
Another likelihood function that is popular in the context of sufficient statistics for the power-law family.
robust inference is the so-called Basu et al. likelihood function (iv) Extend Rao-Blackwell theorem to both Basu et al. and
given by Jones et al. estimations and find best estimators for the
(α) power-law family.
LB (xn1 ; θ)
n 
(v) Derive a Cramér-Rao lower bound for the variance of
1 X α · pθ (xi )α−1 − 1
 Z
= − pθ (x)α dx, (8) an unbiased estimator (of the expected value of the
n i=1 α−1 generalized sufficient statistic) of a power-law family
in terms of the asymptotic variance of the Jones et al.
(α)
where α > 0, α ̸= 1 [3, Eq. (2.2)]. Maximization of LB estimator.
results in a non-normalized version of the Jones et al. estimat- The rest of the paper is organized as follows. In Section II,
ing equation (see [25, Sec. 1]). Basu et al. estimation is closely we introduce the generalized notion of sufficiency.
related to the minimization of the so-called Basu et al. diver- In Section III, we establish Fisher-Darmois-Koopman-Pitman
gence, which belongs to the Bregman class [13]. Basu et al. theorem for Jones et al. and Basu et al. likelihood functions.
divergence for α = 2 corresponds to the squared Euclidean In this section, we also find minimal sufficient statistics for
distance between the empirical measure and the model density. the mean and variance parameters of Student distributions.
A comparison of Basu et al. and Jones et al. estimations was In Section IV, we find generalized Rao-Blackwell type
studied in [29]. Unlike other likelihood functions such as the estimators for the power-law family. In Section V, we establish
Hellinger likelihood function, Basu et al. and Jones et al. generalized CRLB for Basu et al. and Jones et al. estimators
likelihood functions are special in robust inference, in the of power-law family. In this section, we also find the
sense that smoothing of the empirical probability measure best estimators for the mean and variance parameters of
is not needed (see, for example, [25, Sec. 1] for details). Student distributions. We end the paper with a summary and
In this paper, we propose a generalized notion of sufficiency concluding remarks in Section VI. Proofs of most of the
based on the likelihood function associated with an estimation results are presented in the Appendix.
problem. Specifically, we will focus on Jones et al. and
Basu et al. likelihood functions. We will characterize the
family of probability distributions that has a fixed number of II. A G ENERALIZED N OTION OF THE
sufficient statistics independent of the sample size with respect P RINCIPLE OF S UFFICIENCY
to these likelihood functions. It turns out that these are the In this section, we first propose a generalized notion of
power-law families studied in [2], [14], [25], [33], and [37]. sufficiency analogous to Koopman’s definition of sufficiency.
Student distributions fall in this family. We then derive a factorization theorem associated with this
Sufficient statistics not only help in data reduction but also generalized notion. We also show that this is equivalent to
help in finding an estimator better than the given one. Indeed, generalized Fisher’s definition of sufficiency.
if one has an unbiased estimator along with sufficient statistics, Definition 1: Let Π = {pθ : θ ∈ Θ} be a family of
it is always possible to find another unbiased estimator (usually probability distributions. Suppose the underlying problem is
a function of sufficient statistics) whose variance is not more to estimate θ by maximizing some generalized likelihood

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7568 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

function LG . A set of statistics T is said to be a sufficient Then


statistic for θ with respect to LG if [LG (xn1 ; θ) − LG (y1n ; θ)] p∗θ (xn1 ) n
Z Z
is independent of θ whenever T (xn1 ) = T (y1n ). p∗θX n |T (xn1 |t)dxn1 = dx1
1 qθ∗ (t)
Observe that when LG is the log-likelihood function, the At
p∗θ (xn1 )
Z
above definition coincides with the usual Koopman’s definition
= dxn
of sufficiency. We now derive a generalized factorization qθ (T (xn1 )) 1

theorem. At
p∗θ (xn1 )dxn1
R
Proposition 2 (Generalized Factorization Theorem): Let
At
Π = {pθ : θ ∈ Θ} be a family of probability distributions. =
p∗θ (xn1 )dxn1
R
Suppose the underlying problem is to estimate θ by
At
maximizing some generalized likelihood function LG . Then
= 1.
a set of statistics T is a sufficient statistic for θ with respect
to LG if and only if there exist functions u : Θ × T → R Observe that, when LG = L, we have
and v : Sn → R such that
p∗θX n |T (xn1 |t) = pθX n |T (xn1 |t),
1 1

LG (xn1 ; θ) = u(θ, T (xn1 )) + v(xn1 ), (9) the usual conditional distribution of xn1 given T = t.
Proposition 3: T is sufficient for θ with respect to a likeli-
hood function LG if and only if p∗θX n |T (xn1 |t) is independent
for all sample points xn1 and for all θ ∈ Θ. 1

Proof: See Sec-A of Appendix A. of θ for any value t of T .


Observe that when LG = L, the above coincides with Proof: See Sec-B of Appendix A.
the usual factorization theorem [26], [38]. We now show that Remark 1: Fisher’s idea of sufficiency has one appealing
this generalized notion of sufficiency is also equivalent to a interpretation that if we know the value of the sufficient statis-
generalized version of Fisher’s definition of sufficiency. Let tic T without knowing the entire sample, then we could (using
X1n be i.i.d. random variables according to some pθ ∈ Π. Let the conditional distribution of the sample which is independent
xn1 := x1 , . . . , xn be a sample. Let us define the deformed of θ given the sufficient statistic) generate another sample
probability distribution associated with the likelihood function which has the same unconditional distribution as the original
LG by sample and the same value of the sufficient statistic T . Hence
the information contained in both the samples are essentially
exp[LG (xn1 ; θ)] the same. In this sense, the statistic T is “sufficient”. This
p∗θ (xn1 ) := R , interpretation continues to hold for the generalized notion
exp[LG (y1n ; θ)]dy1n
of sufficiency as well. Indeed, in the case of generalized
sufficient statistics T with respect to LG , one can generate
provided the denominator is finite. Observe that, a sample using the generalized conditional distribution as
if LG (xn1 ; θ) = L(xn1 ; θ), then defined in (11). It contains the same information about θ as
in the original sample (with respect to LG ). Suppose that the
exp[log pθ (xn1 )] given sample is X1n = xn1 and that T (xn1 ) = t. If we do
p∗θ (xn1 ) = R
exp[log pθ (y1n )]dy1n not know xn1 , but know only that T (X1n ) = t, then we can
pθ (xn1 ) generate z1n using the generalized conditional distribution of
= R
pθ (y1n )dy1n X1n = z1n , given T (X1n ) = t (since the conditional distribution
= pθ (xn1 ). (10) is independent of θ by Proposition 3). Then, using (56), the
value of LG for the generated sample is given by LG (z1n ; θ) =
u(t, θ) + v(z1n ), where
Let T be a statistic
R ∗ n of the sample xn1 . For any value t of Z

T , let qθ (t) := pθ (y1 )1(T (y1 ) = t)dy1n , where 1(·) denotes
n
u(t, θ) = log exp[LG (y1n ; θ)]dy1n with
the indicator function. This can be re-written as
At
Z At = {xn1 : T (xn1 ) = t}, and
qθ∗ (t) = p∗θ (y1n )dy1n ,
At
v(z1n ) = log p∗θX n |T (z1n |t).
1

where At = {xn1 : T (xn1 ) = t}. The associated conditional Hence


distribution is given by arg max LG (z1n ; θ) = arg max u(t, θ)
θ θ
p∗θ (xn1 )1(T (xn1 ) = t) = arg max LG (xn1 ; θ),
p∗θX n |T (xn1 |t) := θ
1 q ∗ (t) since xn1∈ At . Thus andxn1 z1n
have the same informa-
 ∗ nθ
 pθ (x1 ) if xn ∈ A tion about θ as far as estimation with respect to LG is
1 t
= q ∗ (t) (11) concerned. Also, from (11), for any y1n ∈ At , the uncondi-
 θ
0 otherwise. tional distribution with respect to p∗θ is given by p∗θ (y1n ) =

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7569

p∗θX n |T (y1n |t)qθ∗ (t). This implies that the generated sample z1n (iii) Student distributions fall in M(α) -family [25, Ex. 3].
1
has the same distribution as the initial sample xn1 (with respect We now state the main result of this section.
to p∗θ ). Theorem 7: Let Π = {pθ : θ ∈ Θ ⊂ Rk } be a family
While sufficient statistics help in data compression, minimal of probability distributions with common support S, an open
sufficient statistics help maximum such compression. We now subset of R. Let xn1 be an i.i.d. sample from some pθ ∈ Π.
define the notion of minimal sufficiency in a general setup. For some s (k ≤ s < n), let Tj (xn1 ) for j = 1, . . . , s be
Let us first recall the definition of minimal sufficiency in the differentiable functions on Sn . If [T1 , . . . , Ts ]⊤ is a sufficient
classical sense. statistic for θ with respect to Jones et al. likelihood function
(α)
Definition 4 [10, Def. 6.2.11]: A statistic T is said to LJ , then there exist functions h : S → R, fi : S → R and
be a minimal sufficient statistic if T is a function of any wi : Θ → R, i = 1, . . . , s such that pθ (x) can be expressed
other sufficient statistic. In other words, T is minimal if as (12) for x ∈ S.
the following holds: For any sufficient statistic Te and for Proof: The proof is divided into two parts. The first part
any two i.i.d. samples xn1 and y1n , Te(xn1 ) = Te(y1n ) implies deals with the case when k = s and the next one for k < s.
T (xn1 ) = T (y1n ). (a) Let k = s. We only give the proof when k = s = 1.
There is an easy criterion to find minimal sufficient statistics Other cases can be proved similarly.
for MLE due to Lehmann and Scheffé [10, Th. 6.2.13]. Here Let us first consider the Jones et al. likelihood func-
we first generalize this criterion to a generalized likelihood tion (5). Taking derivative with respect to θ on both sides
function. Later we will use it to find minimal sufficient of (5), we get
statistics for power-law families. ∂ (α) n
Proposition 5: Let Π = {pθ : θ ∈ Θ} be a family of L (x1 ; θ)
∂θ J
probability distributions. Suppose the underlying problem is Pn ∂  α−1

1 i=1 ∂θ pθ (xi ) ∂
to estimate θ by maximizing some likelihood function LG . = Pn α−1
− log ∥pθ ∥. (13)
α−1 i=1 pθ (xi ) ∂θ
Assume that the following condition holds for a statistic T .
For any two i.i.d. samples xn1 and y1n , T (xn1 ) = T (y1n ) if and If T is a sufficient statistic for θ with respect to LJ ,
(α)

only if [LG (xn1 ; θ)−LG (y1n ; θ)] is independent of θ for θ ∈ Θ. then, from Proposition 2, we have
Then T is a minimal sufficient statistic for θ with respect to (α)
LG . LJ (xn1 ; θ) = u(θ, T ) + v(xn1 ) (14)
Proof: See Sec-C of Appendix A. for some functions u and v, where v is independent of
θ. This implies that
III. G ENERALIZED F ISHER -DARMOIS -KOOPMAN -P ITMAN
∂ (α) n
T HEOREM L (x1 ; θ) = u1 (θ, T ), (15)
∂θ J
In this section, we first characterize the probability dis-
∂  
tributions that have a fixed number of sufficient statistics where u1 (θ, T ) := u(θ, T ) . Comparing (13)
independent of sample size with respect to Jones et al. and ∂θ
and (15), we have
Basu et al. likelihood functions. We observe that each of
them is a power-law extension of the usual exponential family. u1 (θ, T )
Pn ∂  α−1

We then find sufficient statistics for the power-law families. 1 i=1 ∂θ pθ (xi ) ∂
= P n α−1
− log ∥pθ ∥.
We start by defining the family associated with Jones et al. α−1 p (x
i=1 θ i ) ∂θ
likelihood function. (16)
Definition 6: A k-parameter α-power-law family character-
Now, for a particular value of θ ∈ Θ, say θ0 , (16) can be
ized by h, w, f, Θ and S, where Θ is an open subset of Rk ,
re-written as
w = [w1 , . . . , ws ]⊤ , f = [f1 , . . . , fs ]⊤ , with wi : Θ → R is
differentiable for i = 1, . . . , s, fi : R → R for i = 1, . . . , s, u1 (θ0 , T )
h : R → R, and S ⊆ R, is defined by M(α) = {pθ : θ ∈ Θ}, Pn ∂
h  i
i=1 pθ (xi )α−1
where 1 ∂θ
θ0
h∂ i
= Pn − log ∥p θ ∥
 1 α−1 pθ0 (xi )α−1 ∂θ
(
θ0
Z(θ) h(x) + w(θ)⊤ f (x) α−1

for x ∈ S i=1
pθ (x) = (12) 1
Pn
g(xi ) h ∂ i
0 otherwise, = P i=1 − log ∥pθ ∥ , (17)
1
α − 1 ni=1 h(xi ) ∂θ θ0
with Z(θ)−1 = [h(x) + w(θ)⊤ f (x)] α−1 dx.
R
∂ 
S
where g(xi ) = ∂θ pθ (xi )α−1 θ0 , and h(xi ) =
Remark 2: (i) If h(x) = q(x)α−1 for some function pθ0 (xi )α−1 .
q(x) > 0 and f (x) = (1 − α)f˜(x), then M(α) -family can
Pn Pn
Let us denote Te1 := i=1 g(xi ) and Te2 := i=1 h(xi ).
be shown to coincide with an exponential family when Then, from (17), there exists a function Φ such that
α = 1 (see [32, Def. 8]).
T = Φ(Te1 , Te2 ). (18)
(ii) In [33], this family was derived as projections of
Jones et al. divergence on a set of probability distri- Using this, (15) can be re-written as
butions determined by linear constraints. More about ∂ (α) n
M(α) -family can be found in [2], [23], [25], and [33]. L (x1 ; θ) = u2 (θ, Te1 , Te2 ) (19)
∂θ J
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7570 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

for some function u2 . Observe also that, (5) can be re- From (23), it is clear that
written as ∂ h ∂ i
(α) u3 (θ, Te1 , Te2 ) = 0 if and only if
LJ (xn1 ; θ) ∂xi ∂ Te1
 X n  ∂ h ∂ i
1 1 1 u3 (θ, Te1 , Te2 ) = 0, (24)
= log pθ (xi )α−1 − log ∥pθ ∥α−1 ∂xi ∂ Te2
α−1 n i=1 α−1
∂ ∂
since neither ∂x [g(xj )] nor ∂x [h(xj )] can be zero.
" n #
1 1 X pθ (xi )α−1 j j
= log . Equivalently,
α−1 n i=1 ∥pθ ∥α−1
∂ h ∂ i
This implies that u3 (θ, Te1 , Te2 ) ̸= 0 if and only if
∂xi ∂ Te1
n
h
(α)
i 1 X pθ (xi )α−1 ∂ h ∂ i
exp (α − 1)LJ (xn1 ; θ) = . u3 (θ, Te1 , Te2 ) ̸= 0.
n i=1 ∥pθ ∥α−1 ∂xi ∂ Te2
Further (14) implies that ∂ h ∂ i
Suppose that u3 (θ, Te1 , Te2 ) ̸= 0. Then (23) can
∂xi ∂ Te1
h i
(α)
exp (α − 1)LJ (xn1 ; θ)
h i h i be re-written as
h i
= exp (α − 1)u(T, θ) · exp (α − 1)v(xn1 ) . ∂ ∂
u (θ, T , T ) ∂
∂xi ∂ Te1 3 1 2 ∂xj [h(xj )]
e e
i = − ∂
Thus we have
h
∂xj [g(xj )]
∂ ∂
n ∂xi u3 (θ, Te1 , Te2 )
∂ Te2
1 X pθ (xi )α−1
=: ξ(xj ). (25)
n i=1
∥pθ ∥α−1
Since the right hand side of (25) is a function of xj alone,
= exp (α − 1)u(T, θ) · exp (α − 1)v(xn1 ) .
   
both ∂∂Te [u3 (θ, Te1 , Te2 )] and ∂∂Te [u3 (θ, Te1 , Te2 )] must be of
(20) 1
the following form
2

Fixing θ = θ0 , the above equation becomes ∂


h i u3 (θ, Te1 , Te2 ) = Ψ(xn1 , θ)S1 (xj )
Te2 = n∥pθ0 ∥α−1 exp (α − 1)u(T, θ0 ) · ∂ Te1
h i +V1 (x1 , . . . , xi−1 , xi+1 , . . . , xn , θ)
exp (α − 1)v(xn1 ) .
and
Since T = Φ(Te1 , Te2 ), this implies that v(xn1 ) must be ∂
a function of Te1 and Te2 only. Therefore, (20) can be u3 (θ, Te1 , Te2 ) = Ψ(xn1 , θ)S2 (xj )
∂ Te2
re-written as +V2 (x1 , . . . , xi−1 , xi+1 , . . . , xn , θ),
Xn
pθ (xi )α−1 = u3 (θ, Te1 , Te2 ), (21) for some real-valued functions Ψ, S1 , S2 , V1 , and V2 .
i=1 Observe that (25) holds for any i ̸= j. This implies
V1 and V2 must be functions of xj and θ only. Using

where u3 (θ, Te1 , Te2 ) := n∥pθ ∥α−1 exp (α −
  
1)u Φ(Te1 , Te2 ), θ · exp (α − 1)v(xn1 ) . Now, taking these in (22), we get
partial derivative with respect to xj on both sides of (21), A(xj , θ) = Ψ(xn1 , θ)K1 (xj ) + K2 (xj , θ), (26)
we have
where
∂ ∂ ∂ Te1
pθ (xj )α−1 = u3 (θ, Te1 , Te2 )
∂xj ∂ Te1 ∂xj K1 (xj )
∂ ∂ Te2 ∂ ∂
+ u3 (θ, Te1 , Te2 ) := S1 (xj ) [g(xj )] − S2 (xj ) [h(xj )],
∂xj ∂xj ∂xj
∂ Te2
∂ ∂ and
= u3 (θ, Te1 , Te2 ) g(xj )
∂ Te1 ∂xj K2 (xj , θ)
∂ ∂ ∂ ∂
+ u3 (θ, Te1 , Te2 ) h(xj ). (22) := V1 (xj , θ) [g(xj )] − V2 (xj , θ) [h(xj )].
∂ T2
e ∂xj ∂xj ∂xj
Since the left hand side of (22) is a function of xj and θ In order for (26) to hold, Ψ must be a function of xj and
only, we shall denote it by A(xj , θ). Thus taking partial θ only. This implies that both
derivative with respect to xi for i ̸= j on both sides
of (22), we get ∂ ∂
u3 (θ, Te1 , Te2 ) and u3 (θ, Te1 , Te2 ) (27)
∂ h ∂ i ∂ ∂ Te1 ∂ Te2
u3 (θ, Te1 , Te2 ) [g(xj )]
∂xi ∂ Te1 ∂xj must be a function of xj and θ only.
h i
∂ ∂
∂ h ∂ i ∂ On the other hand, let ∂x ∂ T1
u3 (θ, T
e1 , Te2 ) =
+ u3 (θ, Te1 , Te2 ) [h(xj )] = 0. (23) i e
∂xi ∂ Te2 ∂xj 0 for some i ̸= j. Then from (24), we have

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7571

h i Pn
∂ ∂
∂xi ∂ Te2 u3 (θ, T1 , T2 )
e e = 0. This implies that both for i = 1, . . . , s, where fi = n1 j=1 fi (xj ) and h :=
1
P n
∂ ∂
u (θ, T1 , T2 ) and ∂ Te u3 (θ, Te1 , Te2 ) are independent n j=1 h(xj ).
∂ Te1 3
e e
2
of xi ’s where i ̸= j. Thus, in this case also, (27) holds. Proof: See Sec-A of Appendix B.
Since u3 (θ, Te1 , Te2 ) is a function of Te1 , Te2 and θ Remark 3: It is easy to see from the above theorem that

only, both ∂∂Te u3 (θ, Te1 , Te2 ) and ∂∂Te u3 (θ, Te1 , Te2 ) are f /h := f1 /h, . . . , fs /h is a sufficient statistic for the
1 2
also functions of Te1 , Te2 and θ only. But both Te1 and k-parameter M(α) -family with respect to Jones et al. likelihood
Te2 are symmetric functions of the xi ’s, i = 1, . . . , n. function.
This implies that both quantities in (27) must only be a Example 1: Consider the Student distributions with loca-
function of θ. Let tion parameter µ, scale parameter σ and degrees of freedom
parameter ν ∈ (2, ∞) :

u3 (θ, Te1 , Te2 ) =: w1 (θ) and − ν+1
(x − µ)2
 2
∂ Te1
pµ,σ (x) = Nσ,ν 1 + , (29)
∂ νσ 2
u3 (θ, Te1 , Te2 ) =: w2 (θ)
∂ Te2 Γ( ν+1
2 )
for some w1 and w2 which are functions of θ only. Then for x ∈ R, where Nσ,ν := √ is the normalizing
νπσ 2 Γ( ν2 )
we have factor. Observe that (29) includes (6) for ν = 3.
u3 (θ, Te1 , Te2 ) = w1 (θ)Te1 + w2 (θ)Te2 + w3 (θ). Let ν be known, and µ and σ 2 be unknown. Letting α =
ν−1
and θ = [µ, σ 2 ]⊤ , (29) can be re-written as
Using (21), we get ν+1
pθ (xi )α−1 = w1 (θ)g(xi ) + w2 (θ)h(xi ) + w3 (θ)
 µ2 bα  α−1
1

pθ (x) = Nθ,α 1 + 2 ·
σ
for i = 1, . . . , n. 1
bα x 2
  α−1
(b) Let k < s. The proof for this case is quite similar to that 2µbα x
1+ 2 − , (30)
of case (a). For the sake of simplicity, we give a sketch σ + bα µ2 σ 2 + bα µ2
of proof for k = 1 and s = 2. 1 1−α
Let (T1 , T2 ) be a pair of sufficient statistic of θ with where bα = = and Nθ,α = Nσ,ν .
(α)
ν 1+α
respect to Jones et al. likelihood function LJ . Then, pro- (30) implies that Student distributions form an M(α) -family
ceeding as in Case (a), one would get an equation similar with
to (18) with fourP symmetric functionsPof xi ’s (instead of  µ2 bα  α−1
1
n n
two), say, Te1 = i=1 g1 (xi ), Te2 = i=1 g2 (xi ), Te3 = h(x) ≡ 1, Z(θ) = Nθ,α 1 + 2 ,
Pn Pn σ
i=1 h1 (xi ) and T4 = i=1 h2 (xi ). Hence (21) holds
e  ⊤  ⊤
w(θ) = w1 (θ), w2 (θ) , f (x) = f1 (x), f2 (x)) ,
with Te1 , Te2 , Te3 , and Te4 instead of only Te1 and Te2 . Then a
w1 (θ) = bα (σ 2 + bα µ2 ), f1 (x) = x2 ,

similar calculation yields that all ∂∂Te u3 (θ, Te1 , Te2 , Te3 , Te4 )
w2 (θ) = −2µbα (σ 2 + bα µ2 ), and f2 (x) = x.
i

for i = 1, 2, 3, 4 are a function of θ only. This further
implies Pn Pn
By Theorem 8, f¯1 = n1 i=1 x2i and f¯2 = n1 i=1 xi are
2 sufficient statistics for µ and σ with respect to Jones et al.
(1)
X
pθ (xi )α−1 = wj (θ)gj (xi ) likelihood function.
j=1
2
X (2) A. Generalized Fisher-Darmois-Koopman-Pitman Theorem
+ wj (θ)hj (xi ) + w(3) (θ)
for Basu et al. Likelihood Function
j=1
In [24], we characterized the probability distributions that
for i = 1, . . . , n. This completes the proof.
have a fixed number of sufficient statistics with respect to
the Basu et al. likelihood function (only for the case Θ ⊂
The converse of the above theorem is also true. Indeed, R though). We termed the associated family of probability
in the following, we show that an M(α) -family always has distributions as B(α) -family. In the following, we first define
a fixed number of sufficient statistics with respect to the the B(α) -family and then state the results concerning sufficient
(α)
Jones et al. likelihood function LJ . statistics for the same.
Theorem 8: Consider a parametric family of probability Definition 9: Let h, w, f, Θ and S be as in Definition 6. A
distributions Π = {pθ : θ ∈ Θ ⊂ Rk } having a common k-parameter B(α) -family characterized by h, w, f, Θ and S is
support S. Let T = [T1 , . . . , Ts ]⊤ be s statistics defined on defined by B(α) = {pθ : θ ∈ Θ}, where
Sn where s ≥ k. T is a sufficient statistic for θ with respect ( 1
(α)
to Jones et al. likelihood function LJ if the following hold. [h(x) + F (θ) + w(θ)⊤ f (x)] α−1 for x ∈ S
(α)
pθ (x) = (31)
(a) Π is a k-parameter M -family as in Definition 6, 0 otherwise
(b) there exist real valued functions ψ1 , . . . , ψs defined on
for some F : Θ → R, the normalizing factor.
T such that for any i.i.d. sample xn1
Remark 4: (i) Student distributions fall in B(α) -family
ψi (T (xn1 )) = fi h

(28) [25, Ex. 1].

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7572 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

(ii) B(α) -family is also a generalization of the usual exponen- Proof: See Sec-B of Appendix B.
tial family [24], [25]. A more general form of this family Remark 5: In view of Theorem 12, the sufficient statistics
can be found in [14] in connection to the projection derived in Example 1 for Student distributions are minimal too
theorem of Bregman divergences (c.f. [25]). This family (with respect to both Basu et al. and Jones et al. likelihood
was also studied in [2] and [37]. A comparison between functions).
B(α) and M(α) families can be found in [25, Rem. 2].
The following is the generalized Fisher-Darmois-Koopman- IV. R AO -B LACKWELL T HEOREM FOR
Pitman theorem with respect to Basu et al. likelihood J ONES ET AL . E STIMATION
(α)
function LB . Let X1n = xn1 be an i.i.d. sample drawn according to some
Theorem 10: Let k ≤ s < n. If [T1 , . . . , Ts ]⊤ is a sufficient pθ ∈ Π. Here and afterward, we assume θ ∈ Θ ⊂ R unless
statistic for θ with respect to the Basu et al. likelihood function otherwise stated. Let θb := θ(x b n ) be an unbiased estimator
(α) 1
LB , then for x ∈ S, there exist functions h : S → R, fi : n
of θ. Let T := T (x1 ) be a sufficient statistic for θ with
S → R and wi : Θ → R, i = 1, . . . , s such that pθ (x) can be respect to the usual log-likelihood function. Let us consider
expressed as (31). the conditional expectation Eθ [θ|T b ]. It is easy to show from
Proof: The proof for the case s = k can be found in Fisher’s definition of sufficiency that Eθ [θ|T b ] is independent
[24, Th. 2]. On the other hand, let k < s. We give the sketch of θ [4]. Hence Eθ [θ|T ] is a function of T alone. Let ϕ(T ) :=
b
of the proof for k = 1 and s = 2 as in Theorem 7. The Eθ [θ|T
b ]. Then by Rao-Blackwell theorem, ϕ(T ) is a uniformly
other cases can be proved similarly. Following the same steps better unbiased estimator of θ than θb (see [4], [42]). In other
(α)
as in Theorem 7, one gets (19) using LB (xn1 ; θ) in place of words,
(α) n
LJ (x1 ; θ). This, upon taking partial derivative with respect to 1. Eθ [ϕ(T )] = Eθ [θ]b = θ, and
xj first, and then with respect to some xi , i ̸= j, implies (23). 2. Varθ [ϕ(T )] ≤ Varθ [θ]b
Then, proceeding as in Theorem 7, one can prove the
for all θ ∈ Θ, and equality holds if and only if θb is a function of
result.
the sufficient statistic T . This implies that the ‘best’ unbiased
The converse of the above theorem is also true.
estimator of θ (the one with the least variance) is a function
Theorem 11: Consider a k-parameter B(α) -family as in
of the sufficient statistic T only, that is, one from the family
Definition 9. Let T = [T1 , . . . , Ts ]⊤ be s statistics defined
{ψ(T ) : Eθ [ψ(T )] = θ}.
on Sn where s ≥ k. Then T is sufficient statistic for θ with
(α) Let us consider an exponential family E as in (3). f¯ =
respect to LB if there exist real valued functions ψ1 , . . . , ψs 1
P n
defined on T such that ψi (T (xn1 )) = fi for any i.i.d. sample n i=1 f (xi ) is a sufficient statistic for θ with respect to the
 ⊤ usual log-likelihood function [34] and Eθ [f¯] = Eθ [f (X)] =:
xn1 and i = 1, . . . , s. In particular, f := f1 , . . . , fs is a
(α) τ (θ), called the mean value parameter. Let θb be any unbiased
sufficient statistic for θ with respect to LB .
estimator of τ (θ). Then by Rao-Blackwell theorem, we have
Proof: See [24, Th. 4].
ϕ(f¯) := Eθ [θ|
b f¯] is also an unbiased estimator of τ (θ) and has
Example 2: Consider the Student distributions as in Exam-
variance not more than that of θ. b That is,
ple 1. In [25], we showed that these also form a 2-parameter
B(α) -family. Using Theorem 11, the same statistics as in Eθ [ϕ(f¯)] = τ (θ) and Varθ [ϕ(f¯)] ≤ Varθ [θ].
b (32)
Example 1 form sufficient statistics for their parameters with
respect to Basu et al. likelihood function. Now consider ψ(f¯) = ϕ(f¯) − f¯. Then
Eθ [ψ(f¯)] = Eθ [ϕ(f¯)] − Eθ [f¯] = τ (θ) − τ (θ) = 0,
B. Minimal Sufficiency With Respect to Basu et al. and
Jones et al. Likelihood Functions and
In the following, we show that f¯/h̄ and f¯ are also minimal, Varθ [ϕ(f¯)]
respectively, for regular M(α) and B(α) families. An M(α) - = Varθ [ψ(f¯) + f¯]
family as in Definition 6 is said to be regular if the following
are satisfied (c.f. [25, Def. 3]). = Varθ [ψ(f¯)] + Varθ [f¯] + 2 Covθ [ψ(f¯), f¯]. (33)
(i) The support S does not depend on the parameter θ. It can be shown, under some mild regularity conditions, that
(ii) The number of unknown parameters is the same as the Covθ [ψ(f¯), f¯] = 0. See Sec-A of Appendix C for the detailed
number of wi ’s, that is, s = k. derivation. Hence (33) reduces to
(iii) 1, w1 , . . . , wk are linearly independent.
(iv) f1 , . . . , fk are linearly independent. Varθ [ϕ(f¯)] = Varθ [ψ(f¯)] + Varθ [f¯] ≥ Varθ [f¯].
Similarly, a B(α) -family as in Definition 9 is said to be b = Eθ [f¯], we have
Thus from (32), for any θb with Eθ [θ]
regular if 1, f1 , . . . , fk are linearly independent along with
b ≥ Varθ [ϕ(f¯)] ≥ Varθ [f¯].
Varθ [θ]
conditions (i)-(iii) above (c.f. [25, Def. 1]). ⊤
Theorem 12: The statistics f /h = f1 /h, . . . , fk /h
 ⊤ This shows that the variance of f¯ is not more than that of
and f = f1 , . . . , fk are minimal sufficient statistics, any unbiased estimator of τ (θ). Thus we have the following
respectively, for a k-parameter regular M(α) -family and a k- (c.f. [10], [44]).
parameter regular B(α) -family with respect to Jones et al. and Proposition 13: Consider a regular exponential family as
Basu et al. likelihood functions. in (3). Let Eθ [f (X)] = τ (θ). Then there does not exist any

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7573

unbiased estimator of τ (θ) that has lesser variance than that Let T be a statistic of the sample xn1 . Then, from (11),
of f¯. That is, f¯ is the best estimator of its expected value. we have
Further, it can be easily shown that (c.f. [43], [44])  ∗ n
 pθ (x1 ) if xn ∈ A
1 t
1 [τ ′ (θ)]2 p∗θX n |T (xn1 |t) = q ∗ (t)
Varθ [f¯] = Var[f (X)] = , 1  θ
n n I(θ) 0 otherwise.
Observe that, as α → 1, the above coincides with the usual
where I(θ) is the i Fisher information, that is, I(θ) =
h

2 conditional distribution of xn1 given T = t.
Eθ ∂θ log pθ (X) . This and Proposition 13 imply that b n ) be an unbiased estimator of τ ∗ (θ) := E∗ [θ],
Let θb := θ(x 1 θ
b
′ 2
where E∗θ [· · · ] denotes the expectation with respect to p∗θ .
b ≥ [τ (θ)] .
Varθ [θ] (34) Let T := T (xn1 ) be a sufficient statistic for θ with respect
n I(θ) (α)
to Jones et al. likelihood function LJ . Let us consider the
conditional expectation
Equality holds in (34) if and only if θb = f¯ almost surely. Z
Notice that (34) is the well-known Cramér-Rao inequal- ∗ b
 n

Eθ θ|T = T (x1 ) = θ(y b n )p∗ (y n |T (xn ))dy n . (36)
1 θ 1 1 1
ity [12], [42].
Observe that, in Rao-Blackwell theorem, instead of a suf- In the following, we first establish some properties of (36).
ficient statistic T , if one conditions on some other statistic, Proposition 14: The conditional expectation E∗θ [θ|T b ] is
then also the resulting estimator has variance not more independent of θ.
than that of the given estimator. However, conditioning on Proof: See Sec-B of Appendix C.
any other statistic does not guarantee that the conditional Let
expectation is independent of θ. Also by Fisher-Darmois-
E∗θ θ|T
b = T (xn ) =: ϕ∗ (T (xn )).
 
Koopman-Pitman theorem, a sufficient statistic with respect 1 1
to the usual log-likelihood function exists if and only if
the underlying model is exponential [16], [31], [41]. Hence Let us consider the expectation of ϕ∗ (T ) with respect to p∗θ .
Rao-Blackwell theorem is not applicable outside the expo- That is,
nential family. However, in Section III, we showed that there
Z
are models that have a fixed number of sufficient statistics E∗θ [ϕ∗ (T (X1n ))] = p∗θ (xn1 )ϕ∗ (T (xn1 ))dxn1 .
with respect to likelihood functions other than log-likelihood.
Then we have the following.
In particular, we showed that M(α) -family is one such model
Proposition 15: ϕ∗ (T ) is unbiased for τ ∗ (θ), that is,
which has a fixed number of sufficient statistics with respect
to Jones et al. likelihood function (see Theorem 7 and The- Eθ [ϕ (T (X1n ))] = E∗θ [θ(X
∗ ∗ b n ))] = τ ∗ (θ).
1

orem 8). In the following, we generalize the Rao-Blackwell Proof: See Sec-C of Appendix C.
theorem for Jones et al. likelihood function. Later we will Thus, if we have an unbiased estimator of τ ∗ (θ) and a suffi-
apply this to M(α) -family and find Rao-Blackwell type “best” cient statistic with respect to Jones et al. likelihood function,
estimator. This also helps us find a lower bound for the we can always get another unbiased estimator for τ ∗ (θ) that
variance of the estimators of M(α) -family as an alternative is based only on the sufficient statistic. Let us now consider
to the usual CRLB. the following two classes of estimators based on the sufficient
(α)
The Jones et al. likelihood function LJ as in (5) can be statistic T :
re-written as A = {E∗θ [θ|T
b ]: θb is an unbiased estimator of
(α)
n
h 1 X 1
 α−1 i τ ∗ (θ), that is, E∗θ [θ]
b = τ ∗ (θ)},
LJ (xn1 ; θ) = log pθ (xi )α−1 /∥pθ ∥ .
n B = {ψ(T ) : E∗θ [ψ(T )] = τ ∗ (θ)}. (37)
i=1

Then the deformed probability distribution associated with Proposition 16: The above two classes of estimators are the
(α)
LJ is given by same, that is A = B.
Proof: See Sec-D of Appendix C.
(α)
exp[LJ (xn1 ; θ)] The above proposition suggests that any unbiased estimator
p∗θ (xn1 ) := R (α) of τ ∗ (θ) that is a function of the sufficient statistic T is of
exp[LJ (y1n ; θ)]dy1n the form E∗θ [θ|T
b ], where θb is an unbiased estimator of τ ∗ (θ).
 n 1
 α−1
1
P α−1
In the following, we compare the variance of an estimator of
n pθ (xi ) the form (37) and any other unbiased estimator of τ ∗ (θ).
i=1
=  1
 α−1 . (35) Lemma 17: Let θb be any unbiased estimator of τ ∗ (θ) and
n
1 ϕ (T ) = E∗θ [θ|T
∗ b ]. Let ψ ∗ (T ) ∈ B. Then
R
dy1n
P
n pθ (yi )α−1
i=1
Var∗θ [θ]
b = E∗ [(θb − ψ ∗ (T ))2 ] + Var∗ [ψ ∗ (T )]
θ θ
(α)
Observe that when α → 1, LJ (xn1 ; θ) coincides with + 2E∗θ [ψ ∗ (T ){ϕ∗ (T ) − ψ ∗ (T )}]. (38)
n
the log-likelihood function L(x1 ; θ). Hence, in this case,
p∗θ (xn1 ) = pθ (xn1 ). Proof: See Sec-E of Appendix C.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7574 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

The following (also the main result of this section) is where


a generalization of Rao-Blackwell theorem for Jones et al. X 2θ − 1

likelihood function. N (θ)−1 := 1+y·
y1n
1−θ
Theorem 18: Let Π = {pθ : θ ∈ Θ ⊂ R} be a family " n  #
of probability distributions with common support S ⊂ R. n (2θ − 1) 1 X n
Consider the deformed family of probability distributions = 2 + · i·
(1 − θ) n i=1 i
Π∗ = {p∗θ } of Π, where p∗θ is as in (35). Let θb be an  
unbiased estimator for its expected value with respect to p∗θ . (2θ − 1)
2n + · n2n−1

=
Let τ ∗ (θ) := E∗θ [θ].
b Suppose T is a sufficient statistic for n(1 − θ)
θ with respect to the Jones et al. likelihood function. Let 2n−1
= .
ϕ∗ (T ) := E∗θ [θ|T
b ]. Then ϕ∗ (T ) is a uniformly better estimator 1−θ

of τ (θ) than θ, b that is, Also
(a) E∗θ [ϕ∗ (T )] = τ ∗ (θ), n
!
X
(b) Var∗θ [ϕ∗ (T (X1n ))] ≤ Var∗θ [θ(X
b n )], and equality holds if
1
p∗θ X1n = xn1 T = xi
and only if θb is a function of the sufficient statistic T . i=1
1 + x · (2θ − 1)/(1 − θ)
Proof: See Sec-F of Appendix C. = P
Theorem 18 implies that ϕ∗ (T ) is uniformly a better esti- [1 + y · (2θ − 1)/(1 − θ)]
y1n ∈AP xi
mator of τ ∗ (θ) than θ.b This means that the best unbiased
∗ 1
estimator of τ (θ) is a function of the sufficient statistic T . =
The following result asserts that the estimator with minimum AP x i
variance (best estimator) is unique. 1
= n
.
Proposition 19: The best unbiased estimator for τ ∗ (θ) with P
xi
respect to p∗θ is unique.
Proof: See Sec-G of Appendix C. Let θb = x1 . Then
In the following, we apply the above results to the family τ ∗ (θ)
of Bernoulli distributions.
= E∗θ [θ]
b
X
= x1 · p∗θ (xn1 )
A. Example xn
1
"
Observe that the family of Bernoulli distributions can be
X 2θ − 1
= N (θ) 2n−1 + ·
written as x1
n(1 − θ)
n  #
pθ (x) = (1 − θ)[1 + x · (2θ − 1)/(1 − θ)] for x = 0, 1,
X n−1
(x1 + i)
i=1
i
where θ ∈ (0, 1) is the unknown parameter. Thus Bernoulli
  
2θ − 1 n+1
distributions form an M(α) -family with α = 2, h(x) ≡ 1, = N (θ) · 2n−1 1 + · 1+
n(1 − θ) 2
Z(θ) = 1 − θ, w(θ)Pn = (2θ − 1)/(1 − θ) and f (x) = x. θ n−1
By Theorem 8, i=1 xi is a sufficient statistic for θ with
= + .
n 2n
respect to the Jones et al. likelihood function.
The Jones et al. likelihood function for Bernoulli distribu- The variance of θb with respect to p∗θ is given by
tions is b = E∗ X 2 − E∗ [X1 ] 2 = τ ∗ (θ) − τ ∗ (θ)2 .
Var∗θ [θ]
  
" Pn # θ 1 θ
1
(2) n i=1 pθ (x i ) Let us consider thePgeneralized Rao-Blackwell estimator of
LJ (x1 , θ) = log n n
(Eθ [pθ (X)])1/2 τ ∗ (θ). Recall that i=1 xi is a sufficient statistic for θ with
" # respect to Jones et al. likelihood function.
(1 − θ)1/2 {1 + x · (2θ − 1)/(1 − θ)}
= log . n
!
(1 + θ(2θ − 1)/(1 − θ))1/2 ∗ n
 ∗
X
ϕ T (x1 ) = ϕ xi
i=1
The associated deformed probability distribution is given by n
!
X X
(2)
= b n)
θ(y1 · p∗θ y1n T = xi
exp[LJ (xn1 , θ)] y1n
p∗θ (xn1 ) = (2)
i=1
exp[LJ (y1n , θ)]
P !
n
y1n X X
= y1 · p∗θ y1n T = xi
1 + x · (2θ − 1)/(1 − θ)
= P y1n ∈AP xi i=1
y1n [1 + y · (2θ − 1)/(1 − θ)]  −1
  n X
2θ − 1 = · y1
= N (θ) 1 + x ·
P
, xi
1−θ y1n ∈AP xi

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7575

 −1
=
n
·
X
1 Let us consider ψ ∗ (f¯) := ϕ∗ (f¯) − f¯. Then E∗θ [ψ ∗ (f¯)] =
τ ∗ (θ) − τ ∗ (θ) = 0. Also
P
xi
y2n ∈A(Pn xi −1)
i=1

n−1
,
n
 Var∗θ [ϕ∗ (f¯)]
= P P = Var∗θ [ψ ∗ (f¯) + f¯]
xi − 1 xi
n = Var∗θ [ψ ∗ (f¯)] + Var∗θ [f¯] + 2 Cov∗θ [ψ ∗ (f¯), f¯]. (42)
1X
= xi .
n i=1 In Sec-A of Appendix D, we have shown that, if w(θ) > 0 for
n
θ ∈ Θ, then Cov∗θ [ψ ∗ (f¯), f¯] ≥ 0. Using this in (42), we get
Thus ϕ∗ (T (xn1 )) = 1
P
xi is uniformly better estimator for
n
i=1 Var∗θ [ϕ∗ (f¯)]

τ (θ) than x1 . It can also be shown that = Var∗θ [ψ ∗ (f¯)] + Var∗θ [f¯] + 2 Cov∗θ [ψ ∗ (f¯), f¯]
i2
≥ Var∗θ [f¯].
h
∂ ∗
τ (θ) (43)
∂θ
Var∗θ [ϕ∗ (T )] = ∗ (θ)
, Using (41) and (43), we get
Iα,n
where Iα,n∗
(θ) is the inverse of the asymptotic variance of b ≥ Var∗ [f¯],
Var∗θ [θ] (44)
θ
Jones et al. estimator for p∗θ (X1n ) [29]. This shows that the
Rao-Blackwell estimator derived from a sufficient statistic where equality holds if and only if θb = f¯ almost surely. Thus
attains the asymptotic variance of Jones et al. estimator. In the we have the following.
following section, we elucidate this for a general M(α) -family. Theorem 20: Consider an M(α) -family as in (39) with
w(θ) > 0 for θ ∈ Θ ⊂ R. Let τ ∗ (θ) = E∗θ [f¯]. Then the
V. E FFICIENCY OF R AO -B LACKWELL E STIMATORS variance of any unbiased estimator of τ ∗ (θ) with respect to
OF P OWER -L AW FAMILY
p∗θ is not lower than that of f¯. That is, f¯ is the best estimator
of its expected value with respect to p∗θ .
In this section, we shall show the efficiency of We now show that Var∗θ [f¯], indeed, equals to the asymptotic
Rao-Blackwell estimators for a power-law family. For sim- variance of Jones et al. estimator for the model p∗θ . The
plicity, we shall consider a regular M(α) -family with h ≡ 1. Jones et al. asymptotic variance for p∗θ is given by (c.f. [29])
That is, for x ∈ S,
1 1 Var∗θ [p∗θ (X1n )α−1 s∗ (X1n , θ)]
pθ (x) = Z(θ)[1 + w(θ)f (x)] α−1 . (39) := , (45)
∗ (θ)
Iα,n Cov∗θ [s∗ (X1n , θ), p∗θ (X1n )α−1 s∗ (X1n , θ)]2
Then
1
! α−1 where
n
1X ∂
pθ (xi )α−1 s∗ (xn1 , θ) := log p∗θ (xn1 ) (46)
n i=1 ∂θ

n
1
! α−1 It can be easily seen that Iα,n (θ) coincides with the usual
1X Fisher information In (θ) for α = 1, where In∗ (θ) =

= Z(θ)α−1 [1 + w(θ)f (xi )]
n i=1 Var∗θ [ ∂θ

log p∗θ (X1n )]. Observe that (40) can be re-written as
1
= Z(θ)[1 + w(θ)f ] α−1 . p∗θ (xn1 ) = N (θ)[1 + w(θ)f ] α−1
1

 1
Thus, using (35), we get = N (θ)α−1 + N (θ)α−1 w(θ)f α−1 .

1
[1 + w(θ)f ] α−1 For convenience, let F (θ) := N (θ)α−1 and w(θ) :=
p∗θ (xn1 ) =  1
e
1 + w(θ)f¯ α−1 dxn N (θ)α−1 w(θ). Then
R
1
1 1
= N (θ)[1 + w(θ)f ] α−1 , (40) p∗θ (xn1 ) = [F (θ) + w(θ)f
e ] α−1
n
where f¯ := n1 i=1 f (xi ) and N (θ)−1 :=
P R
[1 + and hence
1
w(θ)f¯] α−1 dxn1 is the normalizing constant. Comparing this ∂ ∗ n 1
with (12), we see that (40) also forms an M(α) -family with p (x ) = p∗ (xn )2−α [F ′ (θ) + w e′ (θ)f ].
∂θ θ 1 α−1 θ 1
h ≡ 1 and f (xn1 ) = f¯. Recall that, f¯ is a sufficient statistic
for M(α) with respect to Jones et al. likelihood function Then
(Theorem 8). Let τ ∗ (θ) := E∗θ [f¯]. In the following, we show Var∗θ p∗θ (X1n )α−1 s∗ (X1n , θ)
 
that f¯ is the best estimator of τ ∗ (θ). That is the variance of any  2
unbiased estimator of τ ∗ (θ) is bounded below by the variance 1  ′ 2 ∗
= we (θ) Varθ [f¯], (47)
of f¯ with respect to p∗θ . α−1
Let θb be an unbiased estimator of τ ∗ (θ) with respect to p∗θ . and
Then by Theorem 18, ϕ∗ (f¯) = E∗θ [θ| b f¯] is also an unbiased
estimator of τ ∗ (θ) and Cov∗θ [s∗ (X1n , θ), p∗θ (X1n )α−1 s∗ (X1n , θ)]
1 ∂
b ≥ Var∗ [ϕ∗ (f¯)].
Var∗θ [θ] (41) = e′ (θ) τ ∗ (θ).
w (48)
θ α−1 ∂θ
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7576 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

The detailed derivations of (47) and (48) can be found in if n = 2k, k ≥ 1,


Sec-B of Appendix D. Using these values, (45) becomes n
(nπ/bα ) 2 (1 − α)An (α)k−1
Var∗θ f
  Hn (α) =
1 αΓ(n/2)
∗ (θ)
=h i2 ,
Iα,n ∂ ∗ (1−α)(n−2)
E [f ] with An (α) :=
∂θ θ 2−n(1−α) . Let us now denote Bn (α) =
n(1−α) ∗ 2 An (α)2−k Bn (α)k−1 2
that is, 2α−n(1−α) . Then Eσ [X ] equals to bα σ when
An (α)1−k Bn (α)k 2
i2 i2 n = 2k, and bα σ when n = 2k + 1, provided
n
h h
∂ ∗ ∂ ∗ α > n+2 . Hence the generalized CRLB is given by,
∂θ Eθ [f ] ∂θ τ (θ)
∗ (θ)
= ∗ (θ)
= Var∗θ [f ]. when n = 2k + 1,
Iα,n Iα,n
[{An (α)2−k (Cn (α)k − An (α)−k Bn (α)2k )} b2α ]σ 4 ,

This, together with (44), yields the following result.
Corollary 21: Consider an M(α) -family as in Theorem 20. and when n = 2k,
Let τ ∗ (θ) = E∗θ [f¯]. Let θb be any unbiased estimator of τ ∗ (θ)
with respect to p∗θ . Then [{An (α)3−k (Cn (α)k−1 − An (α)1−k Bn (α)(2k−1) )} b2α ]σ 4

h i2
(1−α)(n+2)
∂ ∗
τ (θ) where Cn (α) := 2α+(n+2)(α−1) and α > n+2
n+4 .
b ≥ ∂θ
Var∗θ [θ] . (49) Remark 7: Observe that Theorem 20 can not be applied,
∗ (θ)
Iα,n
for example, to Student distributions with unknown location
Equality holds in (49) if and only if θb = f¯ almost surely. parameter (scale parameter known), since they do not form
Remark 6: (i) Observe that a regular M(α) -family [25, Rem. 2(c)]. However, in the fol-
lowing, we address this problem by deriving a Rao-Blackwell
Cov∗θ [s∗ (X1n , θ), p∗θ (X1n )α−1 s∗ (X1n , θ)]2 type theorem for Basu et al. likelihood function (8). Note that,
≤ Var∗θ [p∗θ (X1n )α−1 s∗ (X1n , θ)]Var∗θ [s∗ (X1n , θ)], for this model, Basu et al. estimator and Jones et al. estimator
are the same [23], [25].
by Cauchy-Schwarz inequality. This implies that
1 1

≤ ∗ . A. Analogous Results for Basu et al. Likelihood Function
In (θ) Iα,n (θ)
(α)
That means the bound given by Corollary 21 is sharper Consider the Basu et al. likelihood function LB as in (8).
than the usual CRLB for any unbiased estimator of τ ∗ (θ). The associated deformed probability distribution is given by
Observe that Jones et al. estimators are, in general, more (α)
exp[LB (xn1 ; θ)]
robust than MLE but less efficient. The generalized bound p∗θ (xn1 ) := R (α)
in Corollary 21 might help in investigating the efficacies exp[LB (y1n ; θ)]dy1n
α
Pn α−1
of such robust estimators. exp[ n(α−1) i=1 pθ (xi ) ]
= α
Pn . (51)
(ii) In view of (10) and Remark 2(i), we see that (49) in
R
exp[ n(α−1) i=1 pθ (yi )α−1 ]dy1n
Corollary 21 is essentially a generalization of the usual
Cramér-Rao inequality for an exponential family. Recall that, B(α) -families are the ones that have a fixed number
Example 3: Consider the Student distributions with loca- of sufficient statistics with respect to Basu et al. likelihood
tion parameter µ = 0, scale parameter σ 2 (unknown) and a function (Theorem 10). Consider a B(α) -family as in (31).
fixed degrees of freedom parameter ν ∈ (2, ∞) : Then, using (51), we have
− ν+1 h α α i
x2 p∗θ (xn1 ) = exp
 2
h + Fe(θ) + w(θ)f , (52)
pσ (x) = Nν,σ 1 + , (50) α−1 α−1
νσ 2  α 
(h + w(θ)f ) dxn1 , is the
R
where Nν,σ is the normalizing constant. Let α = 1 − ν+1 2
, where Fe(θ) = − log exp α−1
bα = 1/ν and Nα,σ = Nν,σ . Then (50) can be re-written as normalizing constant. Observe that p∗θ forms a one parameter
a one parameter regular M(α) -family with h ≡ 1, w(σ) = exponential family. Then we can derive the following result
bα /σ 2 > 0 and f (x) = x2 [25]. By Theorem 20, x2 is the best concerning the best estimator for the parameter, analogous to
estimator for E∗σ [X 2 ]. We shall now find E∗σ [X 2 ] explicitly. Theorem 20 and Corollary 21.
Using (40), we have Theorem 22: Consider a regular B(α) -family as in (31). Let
1
Eθ [f ] = τ (θ). Then f¯ is the best estimator of its expected
∗ ¯

value with respect to p∗θ . Furthermore, if θb is any estimator


h i α−1
p∗σ (xn1 ) = Zα,σ 1 + (bα /σ 2 )x2 ,
such that E∗θ [θ]b = τ (θ), then
1
−1
:= [1 + (bα /σ 2 )y 2 ] α−1 dy1n = σ n · Hn (α),
R
where Zα,σ ′ 2
provided α > 1 − n2 . Here, b ≥ [τ (θ)] .
Var∗θ [θ] (53)
In∗ (θ)
if n = 2k + 1, k ≥ 0,
1 n 1+α
 Proof: See Sec-C of Appendix D.
(πbα ) 2 (nπ/bα ) 2 Γ 2(1−α) An (α)k Remark 8: Notice that the bound in (53) is given by the
Hn (α) = 1 , and
Γ(n/2)Γ( 1−α ) usual Fisher information (of course, applied to the model p∗θ )

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7577

unlike the one in Corollary 21 where the bound is given by (when it comes to practical applications) that, the tuning
the asymptotic variance of Jones et al. estimators. parameter α in Basu et al. or Jones et al. likelihood function
We now apply the above result to find best estimator for the should be related to the degrees of freedom parameter ν of the
mean parameter of a Student distribution. Student distribution by α = (ν − 1)/(ν + 1). Basu et al. and
Example 4: Consider the Student distributions with degrees Jones et al. estimations for exponential families, in particular,
of freedom parameter ν ∈ (2, ∞), location parameter µ and for normal distributions have been studied, for example, in [3],
scale parameter σ 2 = 1 : [6], and [29]. However, unlike these works, the present work
i− ν+1 focused on models that have closed-form solutions for these
h 1 2
pµ (x) = Nν 1 + (x − µ)2 , (54) estimations.
ν Another important application of sufficient statistics is that,
2
where Nν is the normalizing constant. Letting α = 1 − ν+1 , given an estimator, one can find another estimator that is
(α) uniformly better than the given one. This result is known as
bα = 1/ν and Nα = Nν , (54) forms a regular B -family as
in (31) with h(x) = Nαα−1 bα x2 , F (µ) = (1 + bα µ2 )Nαα−1 , the Rao-Blackwell theorem. We generalized Rao-Blackwell
w(µ) = −2µNαα−1 bα and f (x) = x. This implies that x theorem when one has sufficient statistics with respect to
is a sufficient statistic for µ with respect to the Basu et al. Basu et al. or Jones et al. likelihood function. This helps us find
likelihood function [24, Th. 4] and x is the best estimator for the “best” estimator for the parameters of a power-law family
E∗µ [X] (Theorem 22). and hence for Student distributions. Consequently, we derived
It can also be shown, in this case, that a lower bound for the variance of estimators of a deformed
form of power-law family, p∗θ . In case of Jones et al. likelihood
p∗µ (xn1 ) function, the bound turns out to be its asymptotic variance
h α 2α (with respect to p∗θ ). The derivation of asymptotic variance
i
= exp Nαα−1 bα x2 + Fα (µ) − Nαα−1 bα µx ,
α−1 α−1 of Jones et al. estimator can be found in [29]. This bound is
where sharper than the usual CRLB (See Corollary 21). Recall that
α n h α i the usual CRLB equals the asymptotic variance of MLE.
Fα (µ) := Nαα−1 bα µ2 − log π− Nαα−1 bα , We now make some general remarks on the deformed
α−1 2 n(α − 1)
 α − 1  2α Nαα−1 bα µ probability distribution p∗θ . Recall that, it is the factorization
(1−α)
E∗µ [X] = − = µ, and theorem (Halmos and Savage [26] in the classical case and
α w′ (µ) Proposition 2 in the general case) that makes it clear that the
[τ ′ (µ)]2 (1 − α) notion of sufficiency is with respect to a likelihood function.
Var∗µ [X] ∗ = .
In (µ) 2αbα Nαα−1 However, the generalized notion of sufficiency is defined using
the deformed probability distribution p∗θ . The equivalence of
VI. S UMMARY AND C ONCLUDING R EMARKS analogous factorization theorem and Fisher’s definition of
Fisher introduced the concept of sufficiency in connection sufficiency shows the uniqueness of the usage of p∗θ in the
with parameter estimation problems in an attempt to estimate generalized notion of sufficiency. This fact was further evi-
the unknown parameters of a population without knowing the denced in the case of Jones et al. likelihood function, where the
entire sample [19]. Fisher’s notion of sufficiency is based on lower bound of variance of estimators for power-law families
the usual log-likelihood function and the underlying estimation equals the asymptotic variance of the Jones et al. estimator for
is ML estimation. However, there are situations (for example, p∗θ . However, for Basu et al. likelihood function, the bound is
robust inference) where we need to look beyond ML estima- given by the usual CRLB, though computed with respect to
tion. Hence the classical notion of sufficiency is inadequate p∗θ , does not equal the asymptotic variance of the Basu et al.
in such situations. In this paper, we presented a notion of estimator (Theorem 22). This is surprising as one would
sufficiency based on a general likelihood function. In par- expect the bound to equal the Basu et al. asymptotic variance
ticular, we considered two families of likelihood functions, given by [3, Th. 2] (as in the case of Jones et al. likelihood
namely Basu et al. and Jones et al. α-likelihood functions. function). This will be the focus of one of our forthcoming
These (for α > 1) arise in distance-based robust inference works. In any case, such an anomalous behavior of Basu et al.
methods and are generalizations of the usual log-likelihood likelihood function (from Jones et al. likelihood function) was
function. Also, these likelihood functions are special as they do also noticed in the study of the geometry of the associated
not require smoothing of empirical measure (of the observed divergence functions in connection with projection theorems
sample) [29]. We characterized the family of probability for families of probability distributions determined by linear
distributions that have a fixed number of sufficient statistics constraints. While the geometry of the Basu et al. divergence
(independent of sample size) with respect to these likelihood is similar to that of the Kullback-Leibler divergence, the same
functions. This family generalizes the usual exponential family for Jones et al. divergence was different [33].
and has a power law. Student distributions fall in this family.
A PPENDIX A
We also extended the notion of minimal sufficiency and found
P ROOF OF RESULTS OF S ECTION II
minimal sufficient statistics for the power-law families. This
could potentially help when the data is modeled using one A. Proof of Proposition 2
of these power-law distributions. However, we would like Let T be a sufficient statistic for θ with respect to LG . Let
to emphasize that the established results have the limitation xn1 and y1n be two samples such that T (xn1 ) = T (y1n ). Then

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7578 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

from generalized Koopman’s definition of sufficiency, we have Conversely, let p∗θX n |T (xn1 |t) be independent of θ. Let
1
[LG (xn1 ; θ) − LG (y1n ; θ)] is independent of θ. Let us define a p∗θX n |T (xn1 |t) := v1 (xn1 ). Then
relation ‘∼’ on the set of all sample points of length n by 1

xn1 ∼ y1n if and only if T (xn1 ) = T (y1n ). Then ‘∼’ is an exp[LG (xn1 ; θ)]
= v1 (xn1 ).
equivalence relation. Let us denote the equivalence classes by
R
exp[LG (y1n ; θ)]dy1n
St for each t ∈ T = {r : r = T (xn1 ) for some xn1 ∈ Sn }. For AT (xn )
1
each equivalence class St , designate an element xn1,t ∈ St .
This implies
Let xn1 be a sample point such that T (xn1 ) = t∗ ∈ T .

Then xn1 ∈ St and T (xn1 ) = T (xn1,t∗ ). This implies that LG (xn1 ; θ) = u(T (xn1 ), θ) + v(xn1 ), (56)
[LG (xn1 ; θ) − LG (xn1,t∗ ; θ)] is independent of θ by hypothesis.
where u(T (xn1 ), θ) = log exp[LG (y1n ; θ)]dy1n and
R
Let v(xn1 ) := [LG (xn1 ; θ) − LG (xn1,t∗ ; θ)]. Then
AT (xn )
1
LG (xn1 ; θ) = LG (xn1,t∗ ; θ) + v(xn1 ) = u(θ, t∗ ) + v(xn1 ), v(xn1 ) = log v1 (xn1 ). Since the above is true for any xn1 ∈ Sn ,
we have the result. ■
where u(θ, t) := LG (xn1,t ; θ) for each t ∈ T .
Conversely, let us assume that there exist two functions u
and v such that C. Proof of Proposition 5
The condition that if T (xn1 ) = T (y1n ) then [LG (xn1 ; θ) −
LG (xn1 ; θ) = u(θ, T (xn1 )) + v(xn1 ) LG (y1n ; θ)] is independent of θ, for all θ ∈ Θ implies that T is
for all sample points xn1 and for all θ ∈ Θ. Thus for any two a sufficient statistic for θ with respect to LG , by Definition 1.
sample points xn1 and y1n with T (xn1 ) = T (y1n ), we have Thus the only thing we need to prove is that T is further
minimal. Let Te be a sufficient statistic such that Te(xn1 ) =
LG (xn1 ; θ) − LG (y1n ; θ) Te(y1n ). Then by Definition 1, [LG (xn1 ; θ) − LG (y1n ; θ)] is
= u(θ, T (xn1 )) + v(xn1 )−u(θ, T (y1n )) − v(y1n ) independent of θ. Hence by hypothesis, we have T (xn1 ) =
= v(xn1 ) − v(y1n ). T (y1n ). Therefore T is minimal, by definition. ■

This implies that [LG (xn1 ; θ) − LG (y1n ; θ)] is independent of A PPENDIX B


θ. By generalized Koopman’s definition of sufficiency, T is P ROOF OF R ESULTS OF S ECTION III
a sufficient statistic for θ with respect to LG . This completes
A. Proof of Theorem 8
the proof. ■
Let Π = M(α) as in Definition 6. Let xn1 and y1n be two
i.i.d. samples from some pθ ∈ M(α) such that T (xn1 ) = T (y1n ).
B. Proof of Proposition 3 Assume that, for i = 1, . . . , s,
Let T be sufficient for θ with respect to LG . Then, from
Proposition 2, there exist functions u and v such that fi (xn1 )/h(xn1 ) = ψi (T (xn1 )), and
fi (y1n )/h(y1n ) = ψi (T (y1n )).
LG (xn1 ; θ) = u(T (xn1 ), θ) + v(xn1 ) (55)
Then
for all xn1 ∈ Sn . We need to only show p∗θX n |T (xn1 |t) is
(α) (α)
1
independent of θ when t = T (xn1 ). Observe that if t ̸= T (xn1 ), LJ (xn1 ; θ) − LJ (y1n ; θ)
n
p∗θX n |T (xn1 |t) = 0 which is independent of θ. Using (55), 1 h1 X i
1 = log pθ (xi )α−1
we have α−1 n i=1
n
p∗θX n |T (xn1 |t) 1 h1 X i
1 − log pθ (yi )α−1
p∗θ (xn1 ) α−1 n i=1
=
p∗θ (y1n )dy1n
R 
1 Z(θ)α−1 h(xn1 ) + w(θ)T f (xn1 )
AT (xn ) = log
α−1

1
Z(θ)α−1 h(y1n ) + w(θ)T f (y1n )
exp[LG (xn1 ; θ)]
= R 1 h(xn1 )
exp[LG (y1n ; θ)]dy1n = log
AT (xn ) α−1 h(y1n )
1
 
exp[u(T (xn1 ), θ)] exp[v(xn1 )] 1 1 + w(θ)T f (xn1 ) h(xn1 )
= R + log (57)
exp[u(T (xn1 ), θ)] exp[v(y1n )]dy1n α−1
 
1 + w(θ)T f (y1n ) h(y1n )
AT (xn )
1 1   
exp[v(xn1 )] = log h(xn1 ) h(y1n ) ,
= R , α−1
exp[v(y1n )]dy1n which is independent of θ. The last equality holds since
AT (xn )
1
T (xn1 ) = T (y1n ) and thus fi (xn1 )/h(xn1 ) = fi (y1n )/h(y1n ).
which is independent of θ, where the first equality is by the Hence T is a sufficient statistic for θ with respect to the
(α)
definition of the generalized conditional distribution. Jones et al. likelihood function LJ . ■

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7579

B. Proof of Theorem 12 B. Proof of Proposition 14


(α) (α)
We only give the proof for M -family. The result for B - Since T is a sufficient statistic with respect to the Jones et al.
family can be proved similarly. likelihood function, by Proposition 3, the conditional distribu-
Let us consider an M(α) -family as in Definition 6. Assume tion of xn1 given T with respect to p∗θ is independent of θ.
that the family is regular. We will only show that the statistic Hence the result follows as θb is a function of xn1 only. ■
   
f (xn1 ) h(xn1 ) = f1 (xn1 ) h(xn1 ), . . . , fk (xn1 ) h(xn1 )
C. Proof of Proposition 15
is minimal. Let us consider two sample points xn1 and y1n such
(α) (α)
that [LJ (xn1 ; θ) − LJ (y1n ; θ)] is independent For every partition set AT (y1n ) of Sn , the space of all
n n n
 n of θ. Our aim possible n-length samples, we have
is to prove that f (x1 ) h(x1 ) = f (y1 ) h(y1 ).
(α) (α)
h i
Consider [LJ (xn1 ; θ) − LJ (y1n ; θ)]. Using (57), we have E∗θ ϕ∗ (T (X1n )) · 1AT (yn )
1

1 h(xn1 )
Z
(α) (α) ∗ n ∗
LJ (xn1 ; θ) − LJ (y1n ; θ) = log := pθ (x1 )ϕ (T (xn1 ))dxn1
α−1 h(y1n )
  AT (yn )
1 1 + w(θ)T f (xn1 ) h(xn1 ) Z1
+ log .
α−1 = p∗θ (xn1 )ϕ∗ (T (y1n ))dxn1 , (58)
 
1 + w(θ)T f (y1n ) h(y1n )
Since the above is independent of θ, we must have AT (yn )
1

xn1
∈ AT (y1n ) implies T (xn1 ) = T (y1n ).
 
1 + w(θ)T f (xn1 ) h(xn1 ) since
log   = C, Observe that
1 + w(θ)T f (y1n ) h(y1n )
where C is a constant independent of θ (may depend on xn1 ϕ∗ (T (xn1 ))
Z
and y1n ). This implies b n )p∗ (y n |T (xn ))dy n
= θ(y 1 θ 1 1 1
(1 − exp(C)) + w(θ)T f (xn1 ) h(xn1 )
 
Z
  
= b n )p∗ (y n |T (xn ))dy n
θ(y
− exp(C) f (y1n ) h(y1n ) = 0. 1 θ 1 1 1
AT (xn )
Since the given family is regular, the functions 1, 1
Z
w1 , . . . , wk are linearly independent. Hence we have b n )p∗ (y n |T (xn ))dy n
+ θ(y 1 θ 1 1 1
exp(C) = 1 and therefore
    AcT (xn )
f (xn1 ) h(xn1 ) − f (y1n ) h(y1n ) = 0. Z 1

= b n )p∗ (y n |T (xn ))dy n + 0


θ(y
That is, 1 θ 1 1 1
  AT (xn )
f (xn1 ) h(xn1 ) = f (y1n ) h(y1n ). 1
" #
p∗θ (y1n )
Z
Thus f¯/h̄ is a minimal sufficient statistic for a regular M(α) - = b n) ·
θ(y dy1n
1
(α)
family with respect to LJ . ■ qθ∗ (T (xn1 ))
AT (xn )
Z1 . Z
A PPENDIX C b n )p∗ (y n )dy n
= θ(y 1 θ 1 1 p∗θ (y1n )dy1n .
P ROOF OF R ESULTS OF S ECTION IV
AT (xn ) AT (xn )
A. Proof of Covθ (ψ(f¯), f¯) = 0 for Regular Exponential 1 1

Famiy Thus (58) implies


Recall that pθ (x) = Z(θ)[h(x) + w(θ)f (x)] for x ∈ S and
h i
E∗θ ϕ∗ (T (X1n )) · 1AT (yn )
Eθ [ψ(f¯)] = 0. That is, Z 1
Z ∗ n ∗
ψ(f¯)pθ (xn1 )dxn1 = 0. = pθ (x1 )ϕ (T (y1n ))dxn1
AT (yn )
1
Taking derivative with respect to θ on both sides, we get Z

Z  n  = ϕ (T (y1n )) p∗θ (xn1 )dxn1
∂ Y
ψ(f¯) exp[h(xi ) + w(θ)f (xi )] dxn1 = 0. AT (yn )
∂θ i=1 1

b n )p∗ (xn )dxn


R
This implies θ(x 1 θ 1 1
AT (yn ) h Z i
1
p∗θ (xn1 )dxn1
Z
=
ψ(f¯)pθ (xn1 )w′ (θ)nf¯dxn1 = 0. p∗θ (xn1 )dxn1
R
AT (yn ) AT (yn )
1 1
Since w′ (θ) ̸= 0 for all θ ∈ Θ, we have Eθ [ψ(f¯)f¯] = 0. Z
= b n )p∗ (xn )dxn
θ(x
Therefore 1 θ 1 1
AT (yn )
Covθ [ψ(f¯), f¯] = Eθ [ψ(f¯)f¯] − Eθ [ψ(f¯)]Eθ [f¯] = 0. h1 i
■ = E∗θ b n )) · 1A
θ(X 1 n)
T (y1
. (59)

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7580 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

Observe that (59) holds for every partition set AT (y1n ) of Sn . G. Proof of Proposition 19
This implies both the estimators θb and ϕ∗ (T ) have the same Let ϕ1 (T ) and ϕ2 (T ) be two unbiased estimators for
expected value with respect to p∗θ . That is, τ ∗ (θ) with respect to p∗θ . Let σ∗2 be the minimum variance
that any unbiased estimator for τ ∗ (θ) can achieve. Suppose
E∗θ [ϕ∗ (T (X1n ))] = E∗θ [θ(X
b n ))] = τ ∗ (θ).
1
Var∗θ [ϕ1 (T )] = Var∗θ [ϕ2 (T )] = σ∗2 . We show that ϕ1 (T ) =
■ ϕ2 (T ) almost surely.
ϕ1 (T ) + ϕ2 (T )
Consider ξ(T ) := . Then E∗θ [ξ(T )] = τ ∗ (θ)
D. Proof of Proposition 16 2
and
A ⊆ B is immediate from Propositions 14 and 15. Con-
versely, let ψ(T ) ∈ B. Since E∗θ [ψ(T )] = τ ∗ (θ), we have Var∗θ [ξ(T )] ≤ σ∗2 ,
E∗θ [ψ ∗ (T )|T ] ∈ A. But E∗θ [ψ ∗ (T )|T ] = ψ ∗ (T ). This implies
since Cov∗θ [ϕ1 (T ), ϕ2 (T )] ≤ σ∗2 , by Cauchy-Schwarz inequal-
that ψ ∗ (T ) ∈ A. Hence B ⊆ A. This completes the proof. ■
ity. As σ∗2 is the minimum variance that one unbiased estimator
of τ ∗ (θ) can achieve, we have Var∗θ [ξ(T )] ≥ σ∗2 . Therefore
E. Proof of Lemma 17 Var∗θ [ξ(T )] = σ∗2 . This implies
We have
Cov∗θ [ϕ1 (T ), ϕ2 (T )] = σ∗2 .
Var∗θ [θ]
b
That means ϕ1 (T ) and ϕ2 (T ) are positively correlated. Thus
= E∗θ [(θb − τ ∗ (θ))2 ]
there exist constants, say α > 0 and β, such that
= E∗ [(θb − ψ ∗ (T ) + ψ ∗ (T ) − τ ∗ (θ))2 ]
θ
ϕ1 (T ) = αϕ2 (T ) + β almost surely.
= E∗θ [(θb − ψ ∗ (T ))2 ] + E∗θ [(ψ ∗ (T ) − τ ∗ (θ))2 ]
+2E∗ [(θb − ψ ∗ (T ))(ψ ∗ (T ) − τ ∗ (θ))] Taking expectation and variance with respect to p∗θ on both
θ
sides, respectively, we get
= E∗θ [(θb − ψ ∗ (T ))2 ] + Var∗θ [ψ ∗ (T )]
+2E∗ [(θb − ψ ∗ (T ))(ψ ∗ (T ) − τ ∗ (θ))]. τ ∗ (θ) = ατ ∗ (θ) + β and σ∗2 = α2 σ∗2 .
θ

Now These together imply α = 1 and β = 0. That is, ϕ1 (T ) =


ϕ2 (T ) almost surely. ■
E∗θ [(θb − ψ ∗ (T ))(ψ ∗ (T ) − τ ∗ (θ))]
= E∗ [(θb − ψ ∗ (T ))ψ ∗ (T )]
θ A PPENDIX D
−τ ∗ (θ) E∗θ [(θb − ψ ∗ (T ))] P ROOF OF R ESULTS OF S ECTION V
(a)
= E∗θ [(θb − ψ ∗ (T ))ψ ∗ (T )] A. Proof of Cov∗θ (ψ ∗ (f¯), f¯) ≥ 0 in a Regular M(α) -Family
= E∗ [θb ψ ∗ (T )] − E∗ [(ψ ∗ (T ))2 ] Since E∗θ [ψ ∗ (f¯)] = 0,
θ θ Z
(b)
E∗θ [E∗θ [θb ψ ∗ (T )|T ]] − E∗θ [(ψ ∗ (T ))2 ]
1
= ψ ∗ (f¯)N (θ)[1 + w(θ)f ] α−1 dxn1 = 0.
= E∗θ [ψ ∗ (T ) E∗θ [θ|T
b ]] − E∗ [(ψ ∗ (T ))2 ]
θ
That is,
E∗θ [ψ ∗ (T ) E∗θ [θ|T
b ] − ψ ∗ (T ) ]

= Z
1
= E∗θ [ψ ∗ (T ){ϕ∗ (T ) − ψ ∗ (T )}], ψ ∗ (f¯)[1 + w(θ)f ] α−1 dxn1 = 0.
where (a) follows from the fact that both θb and ψ ∗ (T ) are Differentiating both sides with respect to θ,
unbiased estimators of τ ∗ (θ) and (b) follows from the law of Z
1 2−α
total expectation. Thus we have ψ ∗ (f¯) [1 + w(θ)f ] α−1 w′ (θ)f¯dxn1 = 0.
α−1
Var∗θ [θ]
b = E∗ [(θb − ψ ∗ (T ))2 ] + Var∗ [ψ ∗ (T )]
θ θ
This implies
+2E∗θ [ψ ∗ (T ){ϕ∗ (T ) − ψ ∗ (T )}].
ψ ∗ (f¯)f¯ ∗ n
Z
■ pθ (x1 )dxn1 = 0.
[1 + w(θ)f ]
F. Proof of Theorem 18 That is,
" #
From Proposition 15, we have E∗θ [ϕ∗ (T )] = E∗θ [θ]
b = τ ∗ (θ). ψ ∗ (f¯)f¯
∗ ∗
Now in Lemma 17, if we take ψ (T ) = ϕ (T ), we have E∗θ = 0. (60)
1 + w(θ)f
Var∗θ [θ]
b = E∗ [(θb − ϕ∗ (T ))2 ] + Var∗ [ϕ∗ (T )]
θ θ We now claim that E∗θ [ψ ∗ (f¯)f¯] ≥ 0. Let
≥ Var∗θ [ϕ∗ (T )].
k(θ) := E∗θ [ψ ∗ (f¯)f¯]. (61)
Equality holds if and only if E∗θ [(θb − ϕ∗ (T ))2 ]
= 0, that is,
if and only if θb = ϕ∗ (T ) = E∗θ [θ|Tb ] almost
surely. This Suppose that k(θ) ̸= 0 for all θ ∈ Θ. From (61),
happens if and only if θb is a function of T only. This completes
Z
1

the proof. ■ ψ ∗ (f¯)f¯N (θ)[1 + w(θ)f¯] α−1 dxn1 = k(θ).

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7581

That is, B. Derivation of Generalized Cramér-Rao Bound for M(α)


Z
1 k(θ)
ψ ∗ (f¯)f¯[1 + w(θ)f¯] α−1 dxn1 = . Consider
N (θ)
Differentiating both sides with respect to θ, we get 1
Z p∗θ (xn1 ) = N (θ)[1 + w(θ)f ] α−1
1 2−α
ψ ∗ (f¯)f¯ [1 + w(θ)f¯] α−1 w′ (θ)f¯dxn1
1
= [N (θ)α−1 + N (θ)α−1 w(θ)f ] α−1 .
α−1
∂ h k(θ) i
= . For convenience, let F (θ) := N (θ)α−1 and w(θ) :=
∂θ N (θ) e
N (θ)α−1 w(θ). Then
Hence
f¯2
Z
ψ ∗ (f¯) p∗ (xn )dxn1 p∗θ (xn1 ) = [F (θ) + w(θ)f ] α−1 ,
1

1 + w(θ)f¯ θ 1 e
N (θ) ∂ h k(θ) i
= (α − 1) ′ . and thus
w (θ) ∂θ N (θ)
Hence
∂ ∗ n 1
h 1 + w(θ)f¯ − 1 i pθ (x1 ) = p∗ (xn )2−α [F ′ (θ) + w
e′ (θ)f ].
Z
(62)
ψ ∗ (f¯)f¯ p∗θ (xn1 )dxn1 ∂θ α−1 θ 1
1 + w(θ)f¯
w(θ)N (θ) ∂ h k(θ) i Hence, from the definition of s∗ ,
= (α − 1) .
w′ (θ) ∂θ N (θ)
This implies Cov∗θ [s∗ (X1n , θ), p∗θ (X1n )α−1 s∗ (X1n , θ)]
h ψ ∗ (f¯)f¯ i Z
E∗θ [ψ ∗ (f¯)f¯] − E∗θ = s∗ (xn1 , θ)2 p∗θ (xn1 )α dxn1
1 + w(θ)f Z h 1 ∂ ∗ n i ∗ n α n
w(θ)N (θ) ∂ h k(θ) i = s∗ (xn1 , θ) ∗ n p (x ) pθ (x1 ) dx1
= (α − 1) . pθ (x1 ) ∂θ θ 1
w′ (θ) ∂θ N (θ) Z
Using (60), we get = α−1 1
p∗θ (xn1 )s∗ (xn1 , θ)[F ′ (θ) + w e′ (θ)f¯]dxn1
w(θ)N (θ) ∂ h k(θ) i Z
k(θ) = (α − 1) . 1
w′ (θ) ∂θ N (θ) = p∗θ (xn1 )s∗ (xn1 , θ)F ′ (θ)dxn1
α−1
Since k(θ) ̸= 0, we have 1
Z
+ p∗θ (xn1 )s∗ (xn1 , θ)we′ (θ)f¯dxn1
∂ h k(θ) i.h k(θ) i w′ (θ) α −Z 1
(α − 1) = .
∂θ N (θ) N (θ) w(θ) 1 1 ∂ ∗ n ¯ n
= e′ (θ) p∗θ (xn1 ) ∗ n
w p (x )f dx1
That is, α−1 pθ (x1 ) ∂θ θ 1
Z
∂ h k(θ) i ∂ 1 ∂
(α − 1) log = log w(θ). = e′ (θ)
w p∗θ (xn1 )f¯dxn1
∂θ N (θ) ∂θ α−1 ∂θ
1 ∂
This reduces to = we′ (θ) τ ∗ (θ).
h k(θ) iα−1 α−1 ∂θ
∂ ∂
log = log w(θ).
∂θ N (θ) ∂θ Using (62), we get
Integrating both sides with respect to θ, we get
h k(θ) iα−1 Z
log = log w(θ) + log C s∗ (xn1 , θ)p∗θ (xn1 )α dxn1
N (θ)
Z
where log C is the constant of integration. This implies ∂
= log p∗ (xn )p∗ (xn )α dxn1
∂θ Z θ 1 θ 1
k(θ)α−1 = C · w(θ)N (θ)α−1 .
1
= e′ (θ)f¯]dxn1
p∗θ (xn1 )[F ′ (θ) + w
Observe that C > 0 and N (θ) > 0 for all θ ∈ Θ. Hence, α−1
if w(θ) > 0 for all θ ∈ Θ, we get k(θ) > 0. This, together
with the case k(θ) = 0, implies k(θ) ≥ 0. That means, in any and
case, E∗θ [ψ ∗ (f¯)f¯] ≥ 0. Therefore
Z
Cov∗ [ψ ∗ (f¯), f¯] = E∗ [ψ ∗ (f¯)f¯] − E∗ [ψ ∗ (f¯)]E∗ [f¯]
θ θ θ θ s∗ (xn1 , θ)2 p∗θ (xn1 )2α−1 dxn1
= E∗θ [ψ ∗ (f¯)f¯] −0 Z h i2

= E∗θ [ψ ∗ (f¯)f¯] = log p∗θ (xn1 ) p∗θ (xn1 )2α−1 dxn1
∂θ
≥ 0.  1 2 Z
= e′ (θ)f¯]2 dxn1 .
p∗θ (xn1 )[F ′ (θ) + w
■ α−1

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
7582 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 69, NO. 12, DECEMBER 2023

Thus Hence the inequality in (64) reduces to


Var∗θ [p∗θ (X1n )α−1 s∗ (X1n , θ)] b ≥ Var∗ [f¯],
Var∗θ [θ] (65)
θ
Z
= s∗ (xn1 , θ)2 p∗θ (xn1 )2α−1 dxn1 that is, f¯ is the best estimator of its expected value with respect
hZ i2 to p∗θ .
− s∗ (xn1 , θ)p∗θ (xn1 )α dxn1 [τ ′ (θ)]2
We next show that Var∗θ [f¯] = . Observe that p∗θ
 1 2 In∗ (θ)
forms a one-parameter exponential family and
= {E∗θ [(F ′ (θ) + w e′ (θ)f¯)2 ]
α−1 ∂
− (E∗θ [F ′ (θ) + w e′ (θ)f¯])2 } In∗ (θ) = Var∗θ [ log p∗θ (X1n )]
∂θ
 1 2 α
= Var∗θ [F ′ (θ) + w e′ (θ)f¯] = Var∗θ [Fe′ (θ) + w′ (θ)f¯]
α−1 α−1
 1 2  α 2
= e′ (θ)]2 Var∗θ [f¯].
[w = (w′ (θ))2 Var∗θ [f¯]. (66)
α−1 α−1
■ Using the fact E∗θ [ ∂θ

log p∗θ (X1n )] = 0 and E∗θ [f¯] = τ (θ),
we get
C. Proof of Theorem 22 (α − 1) Fe′ (θ)
τ (θ) = − . (67)
The deformed probability distribution associated with α w′ (θ)
Basu et al. likelihood function is given by Further taking derivative on both sides of
Pn
α
exp[ n(α−1) i=1 pθ (xi )
α−1
] E∗θ [ ∂θ

log p∗θ (X1n )] = 0, we get
∗ n
pθ (x1 ) = R α n . (63) α
exp[ n(α−1) i=1 pθ (yi )α−1 ]dy1n
P
τ ′ (θ) = w′ (θ)Var∗θ [f¯]. (68)
α−1
T := T (xn1 )
is a sufficient statistic for θ with respect to
b n ) be an unbiased Thus, using (66), (67) and (68), we get
Basu et al. likelihood function. Let θb := θ(x 1

estimator of τ (θ) with respect to pθ . Let us define [τ ′ (θ)]2
Var∗θ [f ] = .
In∗ (θ)
ϕ(T (xn1 )) := E∗θ [θ|T
b = T (xn )]
1
Z This completes the proof. ■
n ∗ n
= θ(y1 )pθ (y1 |T (xn1 ))dy1n ,
b
ACKNOWLEDGMENT
where the conditional distribution is as defined in (11) of
the main article. Then proceeding with similar steps as in The authors would like to thank the editor, the associate
Lemma 17 and Theorem 18, we have the following. editor, and the referees for their valuable comments that
improved the presentation of the article.
1. E∗θ [ϕ(T )] = E∗θ [θ]
b = τ (θ),
2. Varθ [ϕ(T )] ≤ Var∗θ [θ],
∗ b and equality holds if and only if
R EFERENCES
θ is a function of T only.
b
[1] E. W. Barankin and A. P. Maitra, “Generalization of the Fisher-Darmois-
Now for the given B(α) -family, using (63), we have Koopman-Pitman theorem on sufficient statistics,” Sankhya, Indian
h α α i J. Statist., A, vol. 25, no. 3, pp. 217–244, 1963.
p∗θ (xn1 ) = exp h + Fe(θ) + w(θ)f . [2] A. G. Bashkirov, “On maximum entropy principle, superstatistics,
α−1 α−1 power-law distribution and Renyi parameter,” Phys. A, Stat. Mech. Appl.,
vol. 340, nos. 1–3, pp. 153–162, Sep. 2004.
Recall that T = f is a sufficient statistic for θ with
[3] A. Basu, “Robust and efficient estimation by minimising a density power
respect to Basu et al. likelihood function. Using generalized divergence,” Biometrika, vol. 85, no. 3, pp. 549–559, Sep. 1998.
Rao-Blackwell theorem for Basu et al. likelihood function, [4] D. Blackwell, “Conditional expectation and unbiased sequential estima-
we get tion,” Ann. Math. Statist., vol. 18, no. 1, pp. 105–110, Mar. 1947.
[5] M. Broniatowski, A. Toma, and I. Vajda, “Decomposable pseudodis-
E∗θ [ϕ(f )] = E∗θ [θ]
b = τ (θ) and tances and applications in statistical estimation,” J. Stat. Planning
Inference, vol. 142, no. 9, pp. 2574–2585, Sep. 2012.
Var∗ [ϕ(f¯)] ≤ Var∗ [θ].
θ
b
θ (64) [6] M. Broniatowski and I. Vajda, “Several applications of divergence
criteria in continuous families,” Kybernetika, vol. 48, no. 4, pp. 600–636,
Let us define ψ(f¯) := ϕ(f¯) − f¯. Then E∗θ [ψ(f )] = 0, since 2012.
E∗θ [f¯] = τ (θ) by hypothesis. Also [7] L. D. Brown, Fundamentals of Statistical Exponential Families: With
Applications in Statistical Decision Theory. Hayward, CA, USA: Inst.
Var∗θ [ϕ(f¯)] Math. Statist., 1986.
[8] C. Bunte and A. Lapidoth, “Encoding tasks and Rényi entropy,” IEEE
= Var∗θ [ψ(f¯) + f¯] Trans. Inf. Theory, vol. 60, no. 9, pp. 5065–5076, Sep. 2014.
[9] J. C. Principe, Information Theoretic Learning: Rényi’s Entropy and
= Var∗θ [ψ(f¯)] + Var∗θ [f¯] + 2 Cov∗ (ψ(f¯), f¯). Kernel Perspectives. New York, NY, USA: Springer, 2010.
[10] G. Casella and R. L. Berger, Statistical Inference, 2nd ed. London, U.K.:
Proceeding as in C.1., one can show that Cov∗ (ψ(f¯), f¯) = 0. Duxbury Press, 2002.
Thus we get [11] A. Cichocki and S.-I. Amari, “Families of Alpha- Beta- and Gamma-
divergences: Flexible and robust measures of similarities,” Entropy,
Var∗θ [ϕ(f¯)] = Var∗θ [ψ(f¯)] + Var∗θ [f¯] ≥ Var∗θ [f¯]. vol. 12, no. 6, pp. 1532–1568, Jun. 2010.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.
GAYEN AND KUMAR: GENERALIZED FISHER-DARMOIS-KOOPMAN-PITMAN THEOREM AND RAO-BLACKWELL TYPE ESTIMATORS 7583

[12] H. Cramér, Mathematical Methods of Statistics. Princeton, NJ, USA: [37] J. Naudts, “Estimators, escort probabilities, and ϕ-exponential families
Princeton Univ. Press, 1946. in statistical physics,” J. Inequalities Pure Appl. Math., vol. 5, no. 4,
[13] I. Csiszar, “Generalized cutoff rates and Renyi’s information measures,” p. 102, 2004.
IEEE Trans. Inf. Theory, vol. 41, no. 1, pp. 26–34, Feb. 1995. [38] J. Neyman, “Sur un teorema concernente la cosidette statisticle suffici-
[14] I. Csiszár and F. Matúš, “Generalized minimizers of convex integral enti,” Giorn. Ist. Ital. Attuari, vol. 6, pp. 320–334, 1935.
functionals, Bergman distance, Pythagorean identities,” Kybernetika, [39] F. Nielsen, K. Sun, and S. Marchand-Maillet, “k-means clustering with
vol. 48, no. 4, pp. 637–689, 2012. Hölder divergences,” in Proc. Int. Conf. Geometric Sci. Inf., 2017,
[15] I. Csiszár and P. C. Shields, “Information theory and statistics: A tuto- pp. 856–863.
rial,” Found. Trends® Commun. Inf. Theory, vol. 1, no. 4, pp. 417–528, [40] F. Nielsen, K. Sun, and S. Marchand-Maillet, “On Hölder projective
2004. divergences,” Entropy, vol. 19, no. 3, pp. 1–28, 2017.
[16] G. Darmois, “Sur les lois de probabilites a estimation exhaustive,” [41] E. J. G. Pitman, “Sufficient statistics and intrinsic accuracy,” Math. Proc.
Comptes Rendus de L’Académie des Sci., vol. 200, no. 1, pp. 1265–1266, Cambridge Phil. Soc., vol. 32, no. 4, pp. 567–579, Dec. 1936.
1935. [42] C. R. Rao, “Information and accuracy attainable in the estimation
[17] S. Eguchi, O. Komori, and S. Kato, “Projective power entropy and of statistical parameters,” Bull. Calcutta Math. Soc., vol. 37, no. 3,
maximum Tsallis entropy distributions,” Entropy, vol. 13, no. 10, pp. 81–91, 1945.
pp. 1746–1764, Sep. 2011. [43] C. R. Rao, “Minimum variance and the estimation of several parame-
[18] C. Field and B. Smith, “Robust estimation: A weighted maximum ters,” Math. Proc. Cambridge Phil. Soc., vol. 43, no. 2, pp. 280–283,
likelihood approach,” Int. Stat. Rev., vol. 62, no. 3, pp. 405–424, 1994. Apr. 1947.
[19] R. A. Fisher, “On the mathematical foundations of theoretical statis- [44] C. R. Rao, “Sufficient statistics and minimum variance estimates,” Math.
tics,” Philos. Trans. Roy. Soc. London A, Math. Phys. Sci., vol. 222, Proc. Cambridge Phil. Soc., vol. 45, no. 2, pp. 213–218, Apr. 1949.
nos. 594–604, pp. 309–368, 1922. [45] Z. Shanqing, Z. Kunlong, and X. Weibin, “Application of
[20] R. A. Fisher, “Two new properties of mathematical likelihood,” London Cauchy–Schwarz divergence in image segmentation,” Comput.
A, Proc. Roy. Soc., vol. 144, no. 8, pp. 285–307, 1972. Eng. Appl., vol. 49, no. 12, pp. 129–131, 2013.
[21] H. Fujisawa, “Normalized estimating equation for robust parameter [46] R. Sundaresan, “A measure of discrimination and its geometric proper-
estimation,” Electron. J. Statist., vol. 7, pp. 1587–1606, Jan. 2013. ties,” in Proc. IEEE Int. Symp. Inf. Theory, May 2002, p. 264.
[22] H. Fujisawa and S. Eguchi, “Robust parameter estimation with a small [47] R. Sundaresan, “Guessing under source uncertainty,” IEEE Trans. Inf.
bias against heavy contamination,” J. Multivariate Anal., vol. 99, no. 9, Theory, vol. 53, no. 1, pp. 269–287, Jan. 2007.
pp. 2053–2081, Oct. 2008. [48] M. P. Windham, “Robustifying model fitting,” J. Roy. Stat. Soc., B,
[23] A. Gayen and M. A. Kumar, “Generalized estimating equation for the Methodol., vol. 57, no. 3, pp. 599–609, Sep. 1995.
student-t distributions,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT),
Jun. 2018, pp. 571–575.
[24] A. Gayen and M. A. Kumar, “A generalized notion of sufficiency for
power-law distributions,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT),
Jul. 2021, pp. 2185–2190.
[25] A. Gayen and M. A. Kumar, “Projection theorems and estimating equa-
tions for power-law models,” J. Multivariate Anal., vol. 184, Jul. 2021,
Art. no. 104734.
Atin Gayen received the bachelor’s degree in mathematics from the University
[26] P. R. Halmos and L. J. Savage, “Application of the Radon–Nikodym
of Calcutta, Kolkata, in 2012, the master’s and M.Phil. degrees in mathematics
theorem to the theory of sufficient statistics,” Ann. Math. Statist., vol. 20,
from Utkal University, Bhubaneswar, in 2014 and 2015, respectively, and the
no. 2, pp. 225–241, Jun. 1949.
Ph.D. degree from the Department of Mathematics in 2022. Subsequently,
[27] H. G. Hoang, B.-N. Vo, B.-T. Vo, and R. Mahler, “The Cauchy–Schwarz he embarked on his doctoral journey, which led him to the Indian Institute of
divergence for Poisson point processes,” IEEE Trans. Inf. Theory, Technology Indore and later to the Indian Institute of Technology Palakkad.
vol. 61, no. 8, pp. 4475–4485, Aug. 2015. He is currently a Post-Doctoral Fellow with the Department of Computer
[28] R. Jenssen, J. C. Principe, D. Erdogmus, and T. Eltoft, Science and Engineering, Indian Institute of Technology Roorkee, Uttarak-
“The Cauchy–Schwarz divergence and Parzen windowing: Connections hand. His research interests include distance-based statistical inference, robust
to graph theory and Mercer kernels,” J. Franklin Inst., vol. 343, no. 6, inference, information geometry, divergence-based learning, and quantum
pp. 614–629, Sep. 2006. information theory.
[29] M. C. Jones, “A comparison of related density-based minimum diver-
gence estimators,” Biometrika, vol. 88, no. 3, pp. 865–873, Oct. 2001.
[30] K. Kampa, E. Hasanbelliu, and J. C. Principe, “Closed-form
Cauchy–Schwarz pdf divergence for mixture of Gaussians,” in Proc. Int.
Joint Conf. Neural Netw., San Jose, CA, USA, 2011, pp. 2578–2585.
[31] B. O. Koopman, “On distributions admitting a sufficient statistic,” Trans.
Amer. Math. Soc., vol. 39, no. 3, pp. 399–409, 1936.
[32] M. A. Kumar and R. Sundaresan, “Minimization problems based on M. Ashok Kumar (Member, IEEE) received the bachelor’s and mas-
relative α-entropy I: Forward projection,” IEEE Trans. Inf. Theory, ter’s degrees in mathematics from Manonmaniam Sundaranar University,
vol. 61, no. 9, pp. 5063–5080, Sep. 2015. Tirunelveli, India, in 1999 and 2001, respectively, and the Ph.D. degree from
[33] M. A. Kumar and R. Sundaresan, “Minimization problems based on the Department of Electrical Communication Engineering, Indian Institute of
relative α-entropy II: Reverse projection,” IEEE Trans. Inf. Theory, Science, Bengaluru, India, in May 2015. From 2001 to 2007, he was a lecturer
vol. 61, no. 9, pp. 5081–5095, Sep. 2015. in mathematics in various educational institutions in India. He was a Visiting
[34] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed. Scientist with the Stat-Math Unit, Indian Statistical Institute, Bengaluru, from
Cham, Switzerland: Springer, 1998. December 2014 to May 2015, and a Post-Doctoral Fellow with the Department
[35] E. Lutwak, D. Yang, and G. Zhang, “Cramér–Rao and moment-entropy of Electrical Engineering, Technion-Israel Institute of Technology, from May
inequalities for Rényi entropy and generalized Fisher information,” IEEE 2015 to December 2015. He was the Faculty Member in mathematics with
Trans. Inform. Theory, vol. 51, no. 2, pp. 473–478, Feb. 2005. IIT Indore from January 2016 to May 2018. He joined IIT Palakkad in June
[36] A. Maji, A. Ghosh, and A. Basu, “The logarithmic super divergence 2018, where he is currently an Associate Professor with the Department of
and asymptotic inference properties,” Adv. Stat. Anal., vol. 100, no. 1, Mathematics. His research interests include information theory, mathematical
pp. 99–131, Jan. 2016. statistics, and probability.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on April 07,2024 at 14:55:30 UTC from IEEE Xplore. Restrictions apply.

You might also like