Asymptotic Theory For Maximum Likelihood Estimates in Reduced-Rank Multivariate Generalized Linear Models

Electronic Journal of Statistics
ISSN: 1935-7524
arXiv: arXiv:0000.0000
Asymptotic theory for maximum

likelihood estimates in reduced-rank
multivariate generalized linear models
Efstathia Bura1,* Sabrina Duarte2,** Liliana Forzani2,† Ezequiel
Smucler3,‡ and Mariela Sued3,??
1 Institute
of Statistics and Mathematical Methods in Economics, TU Wien, A-1040
Vienna, Austria and
Department of Statistics, George Washington University, Washington, D.C. 20052, U.S.A.
e-mail: * efstathia.bura@tuwien.ac.at
2 Facultadde Ingenierı́a Quı́mica, UNL, Researcher of CONICET, 3000 Santa Fe,
Argentina. e-mail: ** sabriduarte1905@gmail.com; † liliana.forzani@gmail.com
3 Instituto de Cálculo, UBA, CONICET, FCEM, 1428 Buenos Aires, Argentina.e-mail:
‡ ezequiels.90@gmail.com; msued@dm.uba.ar
Abstract: Reduced-rank regression is a dimensionality reduction method

with many applications. The asymptotic theory for reduced rank estimators
of parameter matrices in multivariate linear models has been studied ex-
tensively. In contrast, few theoretical results are available for reduced-rank
multivariate generalized linear models. We develop M-estimation theory
for concave criterion functions that are maximized over parameters spaces
that are neither convex nor closed. These results are used to derive the
consistency and asymptotic distribution of maximum likelihood estimators
in reduced-rank multivariate generalized linear models, when the response
and predictor vectors have a joint distribution.
Keywords and phrases: M-estimation, exponential family, rank restric-

tion, non-convex, parameter spaces.
1. Introduction
The multivariate multiple linear regression model for a q-dimensional response

Y = (Y1 , . . . , Yq )T and a p-dimensional predictor vector X = (X1 , . . . , Xp )T
postulates that Y = BX + , where B is a q × p matrix and = (1 , . . . , q )T
is the error term, with E() = 0 and var() = Σ. Based on a random sample
(X1 , Y1 ), . . . , (Xn , Yn ) satisfying the model, ordinary least squares estimates
Pn the
parameter matrix B by minimizing the squared error loss function i=1 ||Yi −
BXi ||2 to obtain
n
! n
!−1
X X
B
bols = Yi XiT Xi XiT . (1.1)
i=1 i=1
Reduced-rank regression introduces a rank constraint on B, so that (1.1) is

minimized subject to the constraint rank(B) ≤ r, where r < min(p, q). The
1
imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017
E. Bura et al./Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models
2
solution is B bT bT T
RRR = Bols Ur Ur , where Ur are the first r singular vectors of
T T
Y = Bols X [see, e.g., Reinsel and Velu (1998)].
b b
Reduced-rank regression has attracted attention as a regularization method
by introducing a shrinkage penalty on B. Moreover, it is used as a dimensionality
reduction method as it constructs latent factors in the predictor space that
explain the variance of the responses.
Anderson (1951) obtained the likelihood-ratio test of the hypothesis that the
rank of B is a given number and derived the associated asymptotic theory under
the assumption of normality of Y and non-stochastic X. Under the assumption
of joint normality of Y and X, Izenman (1975) obtained the asymptotic dis-
tribution of the estimated reduced-rank regression coefficient matrix and drew
connections between reduced-rank regression, principal component analysis and
correlation analysis. Specifically, principal component analysis coincides with
reduced-rank regression when Y = X (Izenman (2008)). The monograph by
Reinsel and Velu (1998) contains a comprehensive survey of the theory and
history of reduced rank regression, including in time series, and its many appli-
cations.
Despite the application potential of reduced-rank regression, it has received
limited attention in generalized linear models. Yee and Hastie (2003) were the
first to introduce reduced-rank regression to the class of multivariate generalized
linear models, which covers a wide range of data types for both the response and
predictor vectors, including categorical data. Multivariate or vector generalized
linear models is the topic of Yee’s (2015) book, which is accompanied with the
associated R packages VGAM and VGAMdata. Yee and Hastie (2003) proposed
an alternating estimation algorithm. This was shown to result in the maximum
likelihood estimate of the parameter matrix in reduced-rank multivariate gen-
eralized linear models by Bura, Duarte and Forzani (2016). Asymptotic theory
for the restricted rank maximum likelihood estimates of the parameter matrix
in multivariate GLMs has not been developed yet.
In general, a maximum likelihood estimator is a concave M-estimator in the
sense that it maximizes the empirical mean of a concave criterion function.
Asymptotic theory for M-estimators defined through a concave function has
received much attention. Huber (1967), Haberman (1989) and Niemiro (1992)
are some of the classical references. More recently, Hjort and Pollard (2011)
presented a unified framework for the statistical theory of M-estimation for
convex criterion functions that are minimized over open convex sets of a Eu-
clidean space. Geyer (1994) studied M-estimators restricted to a closed subset
of a Euclidean space.
The rank restriction in reduced rank regression imposes constraints that have
not been studied before in M-estimation as they result in neither convex nor
closed parameter spaces. In this paper we (a) develop M-estimation theory for
concave criterion functions, which are maximized over parameters spaces that
are neither convex nor closed, and (b) apply the results from (a) to obtain
asymptotic theory for reduced rank regression estimators in generalized linear
models. Specifically, we derive the asymptotic distribution and properties of

3
maximum likelihood estimators in reduced-rank multivariate generalized linear

models where both the response and predictor vectors have a joint distribution.
The asymptotic theory we develop covers reduced-rank regression for linear
models as a special case.
Throughout, for a function f : Rq → R, ∇f (x) denotes the row vector
∇f (x) = (∂f (x)/∂x1 , . . . , ∂f (x)/∂xq ), f˙(x) stands for the column vector of
derivatives, while ∇2 f (x) denotes the symmetric matrix of second order deriva-
tives. For a vector valued function f : Rq1 → Rq2 , ∇f denotes the q2 × q1
matrix,  ∂f 
∂f1 ∂f1
∂x1
1
∂x2 · · · ∂x q1
 ∂f2 ∂f2 ∂f2 
 ∂x
1 ∂x 2
· · · ∂x 
q1 
∇f (x) =  .

 .. .. .. .. 
 . . .  
∂fq2 ∂fq2 ∂fq2
∂x1 ∂x2 · · · ∂xq 1
2. M-estimators
Let Z be a random vector taking values in a measurable space Z and distributed

according to the law P . We are interested in estimating a finite dimensional
parameter ξ0 = ξ0 (P ) using n independent and identically distributed copies
Z1 , Z2 , . . . , Zn of Z. In the sequel, we use P f to denote the mean of f (Z); i.e.,
P f = E[f (Z)].
Let the parameter of interest ξ0 be the maximizer of the map ξ 7→ P mξ
defined on Ξ, where Ξ is a subset of a Euclidean space. One can estimate ξ0
by maximizing an empirical version of the optimization criterion. Specifically,
given a known function mξ : Z 7→ R, define
M (ξ) := P mξ and Mn (ξ) := Pn mξ , (2.1)
where,
Pnhere and throughout, Pn denotes the empirical mean operator Pn V =
n−1 1=1 Vi . Hereafter, Mn (ξ) and Pn mξ will be used interchangeably as the
criterion function, depending on which of the two is appropriate in a given
setting.
Assume that ξ0 is the unique maximizer of the deterministic function M
in (2.1). An M-estimator for the criterion function Mn over Ξ is defined as
n
1X
ξbn = ξbn (Z1 , . . . , Zn ) maximizing Mn (ξ) = mξ (Zi ) over Ξ . (2.2)
n i=1
If the maximum of the criterion function Mn over Ξ is not attained but the
supremum of Mn over Ξ is finite, any value ξbn that almost maximizes the cri-
terion function, in the sense that it satisfies
Mn (ξbn ) ≥ sup Mn (ξ) − An (2.3)

ξ∈Ξ
for An small, can be used instead.

4
Definition 2.1. An estimator ξbn that satisfies (2.3) with An = op (1) is called
a weak M-estimator for the criterion function Mn over Ξ. When An = op (n−1 ),
ξbn is called a strong M-estimator.
Proposition AA.1 in the Appendix lists the conditions for the existence,
uniqueness and strong consistency of an M-estimator, as defined in (2.2) when
Mn is concave and the parameter space is convex. Under regularity conditions,
as those stated in Theorem 5.23 in van der Vaart (2000), the asymptotic expan-
sion and distribution of a consistent strong M-estimator (see Definition 2.1) for
ξ0 is given by
n
√ 1 X
n(ξbn − ξ0 ) = √ IFξ0 (Zi ) + op (1), (2.4)
n i=1
where IFξ0 (Zi ) = −Vξ−1

0
ṁξ0 (Zi ), and Vξ0 is the nonsingular symmetric second
derivative matrix of M (ξ) at ξ0 .
2.1. Restricted M-estimators
We now consider the optimization of Mn over Ξres ⊂ Ξ, where Ξres is the image
of a not necessarily injective function. Specifically, we restrict the optimization
problem to the set Ξres by requiring:
Condition 2.1. There exists an open set Θ ⊂ Rq and a map g : Θ → Ξ such
that ξ0 ∈ g(Θ) = Ξres .
Even when an M-estimator for the unrestricted problem as defined in (2.2)
exists, there is no a priori guarantee that the supremum is attained when con-
sidering the restricted problem. Nevertheless, Lemma 2.1 establishes the ex-
istence of a restricted strong M-estimator, regardless of whether the original
M-estimator is a real maximizer, weak or strong.
Lemma 2.1. Assume there exists a weak/strong M-estimator ξbn for the cri-
terion function Mn over Ξ. If Condition 2.1 holds, then there exists a strong
M-estimator ξbnres for the criterion function Mn over Ξres .
Proposition 2.1 next establishes the existence and consistency of a weak re-
stricted M-estimator sequence when m is concave under the same assumptions
as those of the unrestricted problem in Proposition AA.1 in the Appendix.
Theorem 2.1. Assume that Condition 2.1 holds. Then, under the assumptions
of Proposition AA.1 in the Appendix, there exists a strong M-estimator of the
restricted problem over Ξres that converges to ξ0 in probability.
We derive the asymptotic distribution of ξbnres = g(θbn ), with θbn ∈ Θ, next.
The constrained estimator ξbnres is well defined under Lemma 2.1, even when θbn is
√
not uniquely determined. If θbn were unique and n(θbn − θ) had an asymptotic
distribution, one could use a Taylor series expansion for g to derive asymptotic

5
results for ξbnres . Building on this idea, Condition 2.2 introduces a parametrization
of a neighborhood of ξ0 that allows applying standard tools in order to obtain
the asymptotic distribution of the restricted M-estimator.
Condition 2.2. Given ξ0 ∈ g(Θ), there exist an open set M in Ξres with
ξ0 ∈ M ⊂ g(Θ), and (S, h), where S is an open set in Rqs , qs ≤ q, and
h : S → M is one-to-one, bi-continuous and twice continuously differentiable,
with ξ0 = h(s0 ) for some s0 ∈ S.
Under the setting in Condition 2.2, we will prove that sbn = h−1 (ξbnres ) is a
strong M-estimator for the criterion function Pn mh(s) over S. Then, we can ap-
ply Theorem 5.23 of van der Vaart (2000) to obtain the asymptotic behavior of
sbn , which, combined with a Taylor expansion of h about s0 , yield a linear expan-
sion for ξbnres . Finally, requiring Condition 2.3, which relates the parametrizations
(S, h) and (Θ, g), suffices to derive the asymptotic distribution of ξbnres in terms
of g.
Condition 2.3. Consider (Θ, g) as in Condition 2.1 and (S, h) as in Condition
2.2. For each θ0 ∈ g −1 (ξ0 ), span∇g(θ0 ) = span∇h(s0 ).
Condition 2.3 ascertains Tξ0 = span∇g(θ0 ) is well defined regardless of the
fact that g −1 (ξ0 ) may contain multiple θ0 ’s. Moreover, Tξ0 also agrees with
span∇h(s0 ). Consequently, the orthogonal projection Πξ0 (V ) onto Tξ0 with re-
spect to the inner product defined by a definite positive matrix V satisfies
Πξ0 (V ) = ∇g(θ0 )(∇g(θ0 )T V ∇g(θ0 ))† ∇g(θ0 )T V (2.5)
T −1 T
= ∇h(s0 )(∇h(s0 ) V ∇h(s0 )) ∇h(s0 ) V,
where A† denotes a generalized inverse of the matrix A. The gradient of g is
not necessarily of full rank, in contrast to the gradient of h. In particular, when
the inner product is defined by Vξ0 , the invertible matrix of second derivatives
of P mξ at ξ0 , the projection is denoted by Πξ0 and it is given by
Πξ0 = ∇g(θ0 )(∇g(θ0 )T Vξ0 ∇g(θ0 ))† ∇g(θ0 )T Vξ0 (2.6)

Theorem 2.2. Assume that Conditions 2.1, 2.2 and 2.3 hold. Assume also
that the unrestricted problem satisfies the regularity conditions A1, A2 and A3
in the Appendix. Then, any strong M-estimator ξbnres of the restricted problem
that converges in probability to ξ0 satisfies
n
√ 1 X
n(ξbnres − ξ0 ) = √ Πξ IFξ0 (Zi ) + op (1), (2.7)
n i=1 0
where Πξ0 is defined in (2.6), and IFξ0 (z) is the influence function of the unre-
stricted estimator defined in (2.4).
√
Moreover, n(ξbnres − ξ0 ) is asymptotically normal with mean zero and asymp-
totic variance
√ n o
avar{ n(ξbnres − ξ0 )} = Πξ0 Vξ−1
0
P ṁξ 0
ṁ T
ξ 0
Vξ−1
0
ΠTξ0 . (2.8)

6
As an aside remark, we conjecture that for estimators that are maximizers of

a criterion function under restrictions that satisfy Conditions 2.1, 2.2 and 2.3,
when (2.4) is true, (2.7) also holds. This can be important since the asymptotic
distribution of restricted estimators will be derived directly from the asymptotic
distribution of the unrestricted.
3. Asymptotic theory for the maximum likelihood estimator in

reduced rank multivariate generalized linear models
In this section we show that maximum likelihood estimators for reduced-rank

multivariate generalized linear models are restricted strong M-estimators for the
conditional log-likelihood. Using results in Section 2.1, we obtain the existence,
consistency and asymptotic distribution of maximum likelihood estimators in
reduced-rank multivariate generalized linear models.
3.1. Exponential Family
Let Y = (Y1 , . . . , Yq )T be a q-dimensional random vector and assume that its

distribution belongs to a k- parameter canonical exponential family with pdf
(pms)
fη (y) = exp{η T T (y) − ψ(η)}h(y), (3.1)
where T (y) = (T1 (y), . . . , Tk (y))T is a vector of known real-valued functions,

h(y) ≥ 0 is a nonnegative known function and η ∈ Rk is the vector of natural
parameters, taking values in
Z
H = {η ∈ R : exp{η T T (y)}h(y)dy < ∞},
k
(3.2)
where the integral is replaced by a sum when Y is discrete. The set H of the
natural parameter space is assumed to be open and convex in Rk , and ψ a
strictly convex function defined on H. Moreover, we assume ψ(η) is convex and
infinitely differentiable in H. In particular,
∇ψ(η) = ETη (T (Y )) and ∇2 ψ(η) = varη (T (Y )), for every η ∈ H.
Since ψ is strictly convex, varη (T (Y )) is non-singular for every η ∈ H.
3.2. Multivariate generalized linear models
Let Z = (X, Y ) be a random vector, where now Y ∈ Rq is a multivariate

response and X ∈ Rp is a vector of predictors. The multivariate generalized
linear model postulates that the conditional distribution of Y given X belongs
to a k-parameter exponential family and that the natural parameters of the

7
conditional distribution depend linearly on the vector of predictors. Thus, the

pdf (pms) of Y | X = x is given by
fY |X=x (y) = exp{ηxT T (y) − ψ(ηx )}h(y), (3.3)
where ηx ∈ Rk depends linearly on x. Frequently, a subset of the natural pa-

rameters depends on the predictors while its complement is independent of x.
The normal linear model with constant variance is such an example. To accom-
modate this structure, we partition the vector ηx indexing model (3.3) in ηx1
and ηx2 , with k1 and k2 components, respectively, so that H = Rk1 × H2 , where
H2 is an open convex subset of Rk2 , and assume that

ηx1 η 1 + βx
ηx = = (3.4)
ηx2 η2
where β ∈ Rk1 ×p , η 1 ∈ Rk1 and η 2 ∈ H2 , and η T = (η T1 , η T2 ) is the constant

part of ηx1 and ηx2 .
T
Let ξ = η T1 , vecT (β), η T2 denote the vector of parameters of model (3.4).
Suppose n independent and identically distributed copies of Z = (X, Y ) sat-
isfying (3.3) and (3.4), with true parameter vector ξ0 , are available. Given a
realization zi = (xi , yi ), i = 1, . . . , n, the conditional log-likelihood, up to a
factor that does not depend on the parameter of interest, is
n
1 X T
Ln (β; η) = ηxi T (yi ) − ψ(ηxi ) . (3.5)
n i=1
Let
m(β;η) (z) = ηxT T (y) − ψ(ηx ). (3.6)
By definition (2.2), the maximum likelihood estimator (MLE) of the parameter
indexing model (3.3) subject to (3.4) is an M-estimator.
Theorem 3.1 next establishes the existence, consistency and asymptotic nor-
mality of ξbn , the MLE of ξ0 .
Theorem 3.1. Assume that Z = (X, Y ) satisfies model (3.3) subject to (3.4)
with true parameter ξ0 . Under regularity conditions (A.20), (A.21), (A.22) and
(A.23) in the Appendix, the maximum likelihood estimate of ξ0 , ξbn , exists, is
√
unique and converges in probability to ξ0 . Moreover, n (ξbn − ξ0 ) is asymptoti-
cally normal with covariance matrix
−1
Wξ0 = E F (X)T ∇2 ψ(F (X)ξ0 )F (X)

, (3.7)
where
(1, x)T ⊗ Ik1

0
F (x) = ,
0 Ik2
and ∇2 ψ is the k × k matrix of second derivatives of ψ.

8
3.3. Partial reduced rank multivariate generalized linear models
When the number of natural parameters or the number of predictors is large, the
precision of the estimation and/or the interpretation of results can be adversely
affected. A way to address this is to assume that the parameters live in a lower
dimensional space. That is, we assume that the vector of predictors can be
partitioned as x = (xT1 , xT2 )T with x1 ∈ Rr and x2 ∈ Rp−r , and that the
parameter corresponding to x1 , β1 ∈ Rk1 ×r has rank d < min{k1 , r}. In this
way, the natural parameters ηx in (3.3) are related to the predictors via

ηx1 η 1 + β1 x 1 + β 2 x 2
ηx = = (3.8)
ηx2 η2
where β1 ∈ Rkd1 ×r , the set of matrices in Rk1 ×p of rank d ≤ min{k1 , p} for known
d < min{k1 , r}, while β2 ∈ Rk1 ×(p−r) , and β = (β1 , β2 ).
Following Yee and Hastie (2003), we refer to the exponential conditional
model (3.3) subject to the restrictions imposed in (3.8) as partial reduced rank
multivariate generalized linear model. The reduced-rank multivariate general-
ized linear model is the special case with β2 = 0 in the partial reduced rank
multivariate generalized linear model.
To obtain the asymptotic distribution of the M -estimators for this reduced
model, we will show that Conditions 2.1, 2.2 and 2.3 are satisfied for Ξres ,
(Θ, g), M and (S, p), which are defined next. To maintain consistency with
notation introduced in Section 2.1, we vectorize each matrix involved in the
parametrization of our model and reformulate accordingly the parameter space
for each vectorized object. With this understanding, we use the symbol ∼ = to
indicate that a matrix space component in a product space is identified with its
image through the operator vec : Rm×n → Rmn .
For the non-restricted problem, β1 belongs to Rk1 ×r , so that the entire pa-
rameter ξ = (η 1 , β1 , β2 , η 2 ) belongs to
Ξ∼
= Rk1 × Rk1 ×r × Rk1 ×(p−r) × H2 . (3.9)
However, for the restricted problem, we assume that the true parameter ξ0 =
(η 01 , β01 , β02 , η 02 ) belongs to
Ξres ∼
= Rk1 × Rdk1 ×r × Rk1 ×(p−r) × H2 . (3.10)
Let n o
Θ∼
= Rk1 × Rdk1 ×d × Rdd×r × Rk1 ×(p−r) × H2 (3.11)
and consider g : Θ → Ξres , with (η 1 , (S, T ), β2 , η 2 ) 7→ (η 1 , ST, β2 , η 2 ). Without

k1 ×r
loss of generality, we assume that β01 ∈ Rfirst,d , the set of matrices in Rdk1 ×r
whose first d rows are linearly independent. Therefore,

Id
β01 = B0 , with A0 ∈ R(k1 −d)×d and B0 ∈ Rdd×r . (3.12)
A0

9
This is trivial since the first d rows of β01 form a square matrix of dimension
and rank d and therefore invertible. Consider
M∼ 1 ×r
= Rk1 × Rkfirst,d × Rk1 ×(p−r) × H2 , and so ξ0 ∈ M ⊂ Ξres . (3.13)
Finally, let
n o
S∼
= Rk1 × R(k1 −d)×d × Rdd×r × Rk1 ×(p−r) × H2 , (3.14)
and h : S → M be the map

B
(η 1 , (A, B), β2 , η 2 ) 7→ η1 , , β2 , η 2 . (3.15)
AB
Theorem 3.1. Conditions 2.1, 2.2 and 2.3 are satisfied for ξ0 , Ξ, Ξres , (Θ, g)
and (S, h) defined in (3.9)–(3.15), respectively.
Using the partition of the covariate vector x = (x1 , x2 ) in (3.8), we can write
m(β;η) (z) = ηxT T (y) − ψ (ηx ) . (3.16)
Under the rank restriction on β1 in (3.8), the existence of the maximum likeli-
hood estimate cannot be guaranteed in the sense of an M-estimator as defined in
(2.2) with mξ = m(β;η) in (3.16), and Ξ replaced by Ξres in (3.9). However, using
Lemma 2.1 we can work with a strong M-estimator sequence for the criterion
function Pn m(β;η) over Ξres . Theorem 3.2 states our main contribution.
Theorem 3.2. Let ξ0 = (η 01 , β01 , β02 , η 02 ) denote the true parameter value of
ξ = (η 1 , β1 , β2 , η 2 ). Assume that Z = (X, Y ) satisfies model (3.3), subject to
(3.8) with ξ0 ∈ Ξres defined in (3.10). Then, there exists a strong maximizing
sequence for the criterion function Pn mξ over Ξres for mξ = m(β;η) defined
in (3.16). Moreover, any weak M-estimator sequence {ξbnres } converges to ξ0 in
probability. √
If {ξbnres } is a strong M-estimator sequence, then n(ξbnres −ξ0 ) is asymptotically
normal with covariance matrix
√
avar{ n(ξbnres − ξ0 )} = Πξ0 Wξ0 ΠTξ0
= ∇g(θ0 )(∇g(θ0 )T Wξ−1
0
∇g(θ0 ))† ∇g(θ0 )T
= Πξ0 (W −1 ) Wξ0
ξ0
where Wξ0 is defined in (3.7).

Remark 3.1. Since Πξ0 (W −1 ) is a projection, Πξ0 (W −1 ) Wξ0 ≤ Wξ0 . That is,
ξ0 ξ0
the eigenvalues of Wξ0 − Πξ0 (W −1 ) Wξ0 are non-negative, so that using partial
ξ0
reduced-rank multivariate generalized linear models results in efficiency gain.

10
Appendix A: Proofs
The consistency of M-estimators has long been established (see, for instance,
Theorem 5.7 in van der Vaart (2000)). The functions M (ξ) and Mn (ξ) are
defined in (2.1). Typically, the proof for the consistency of M-estimators assumes
that ξ0 , the parameter of interest, is a well-separated point of maximum of M ,
which is ascertained by assumptions (a) and (b) of Proposition A.1. Assumption
(c) of Proposition A.1 yields uniform consistency of Mn as an estimator of M ,
a property needed in order to establish the consistency of M-estimators.
Theorem A.1. Assume (a) mξ (z) is a strictly concave function in ξ ∈ Ξ, where
Ξ is a convex open subset of a Euclidean space; (b) the function M (ξ) is well
defined for any ξ ∈ Ξ and has a unique maximizer ξ0 ; that is, M (ξ0 ) > M (ξ)
for any ξ 6= ξ0 ; and (c) for any compact set K in Ξ,
" #
E sup |mξ (Z)| < ∞. (A.1)
ξ∈K
Then, for each n ∈ N, there exists a unique M-estimator ξbn for the criterion
function Mn over Ξ. Moreover, ξbn → ξ0 a.e., as n → ∞.
Proof. For each compact subset K of Ξ, {mξ : ξ ∈ K} is a collection of measur-
able functions which, by assumption (c), has an integrable envelope. Moreover,
for each fixed z, the map ξ 7→ mξ (z) is continuous, since it is concave and de-
fined on the open set Ξ. As stated in Example 19.8 of van der Vaart (2000),
these conditions guarantee that the class is Glivenko-Cantelli. That is,
!
pr lim sup |Pn mξ − P mξ | = 0 = 1. (A.2)
n→∞ ξ∈K
We first consider the deterministic case ignoring for the moment that {Mn } is
a sequence of random functions.
Since ξ0 belongs to the open set Ξ, there exists ε0 > 0 such that the closed
ball B[ξ0 , ε0 ] is contained in Ξ. Uniform convergence of {Mn } to M over K =
{ξ : kξ − ξ0 k = ε0 } guarantees that
lim sup {Mn (ξ) − Mn (ξ0 )} = sup {M (ξ) − M (ξ0 )} < 0 (A.3)
n→∞ kξ−ξ k=ε kξ−ξ 0 k=ε0
0 0
because M (ξ0 ) > M (ξ) for any ξ 6= ξ0 . Then, for n ≥ n0 (ε0 ),
sup Mn (ξ) − Mn (ξ0 ) < 0. (A.4)

kξ−ξ 0 k=ε0
Let ζn (ξ) = Mn (ξ) − Mn (ξ0 ). Since Mn is concave and continuous, ζn attains

its maximum over the compact set B[ξ0 , ε0 ], which we denote by ξbn . Since
ζn (ξ0 ) = 0 and ζn is strictly smaller than zero in the boundary of the ball, as

11
shown in (A.4), we conclude that ξbn ∈ B(ξ0 , ε0 ) so that ξbn is a local maximum
for ζn .
Let ξ satisfy kξ − ξ0 k > ε0 . The convexity of Ξ implies there exists t ∈ (0, 1)
such that ξe = (1 − t)ξ0 + tξ, which satisfies kξe − ξ0 k = ε0 , and therefore
ζn (ξbn ) ≥ ζn (ξ0 ) = 0 > ζn (ξ)

e ≥ tζn (ξ) + (1 − t)ζn (ξ0 ) = tζn (ξ), (A.5)
implying that ζn (ξ) < 0 ≤ ζn (ξbn ). Therefore, the maximum ξbn ∈ B(ξ0 , ε0 ) is
global. The strict concavity of Mn guarantees that such global maximum is
unique, thus ξbn is the unique solution to the optimization problem in (2.2).
By repeating this argument for any ε < ε0 , we prove the convergence of the
sequence {ξbn } to ξ0 .
Considering the stochastic case, the uniform convergence of Mn to M over
K = B[ξ0 , ε0 ] on a set Ω1 , with pr(Ω1 ) = 1, as assumed in (A.2), guarantees
the deterministic result can be applied to any element of Ω1 , which obtains the
result.
Proof of Lemma 2.1. Let {ξbn } be any (weak/strong) M-estimator sequence of
the unconstrained maximization problem. Since Mn (ξbn ) ≥ supξ∈Ξ Mn (ξ) − An
with An → 0, we have
! !
1 = lim pr sup Pn mξ < ∞ ≤ lim pr sup Pn mξ < ∞ . (A.6)
n→∞ ξ∈Ξ n→∞ ξ∈Ξres
Define ( )
Ωn := sup Pn mξ < ∞ . (A.7)
ξ∈Ξres
For all n, there exists ξbnres such that

1
Mn (ξbnres ) ≥ sup Mn (ξ) − 2 (A.8)
ξ∈Ξres n
.
on Ωn . Let ξbnres = 0 on Ωcn . Then, since pr(Ωn ) → 1, {ξbnres } is a strong M-
estimator for the criterion function Mn over Ξres .
Proof of Proposition 2.1. We start from the deterministic case. We assume
Mn (ξbn ) ≥ sup Mn (ξ) − An , (A.9)

ξ∈Ξ
where Mn is as in (2.1), An is a sequence of real numbers with An → 0, and

ξbn exists thanks to Proposition A.1. As in the proof of Proposition A.1, define
ζn (ξ) = Mn (ξ) − Mn (ξ0 ) to obtain
1
sup ζn (ξ) ≤ sup {M (ξ) − M (ξ0 )} := −δ() (A.10)
kξ−ξ 0 k=ε 2 kξ−ξ0 k=ε

12
for n large enough. Under Condition 2.1, ξ0 ∈ Ξres , and, by Proposition A.1 and
(A.9),
ζn (ξbnres ) = Mn (ξbnres ) − Mn (ξ0 ) ≥ Mn (ξbnres ) − sup Mn (ξ) ≥ −An . (A.11)

ξ∈Ξres
Since An → 0, −An > −δ(ε) for n large enough. Combining this with (A.10)
and (A.11) obtains
sup ζn (ξ) < ζn (ξbnres ).
kξ−ξ 0 k=ε0
We can deduce that kξbnres − ξ0 k < ε once we prove that
sup ζn (ξ) = sup ζn (ξ). (A.12)

kξ−ξ 0 k=ε kξ−ξ 0 k≥ε
To do so, choose ξ with kξ−ξ0 k > ε, and take t ∈ (0, 1) such that ξe = (1−t)ξbn +tξ
is a distance ε from ξ0 , where ξbn is the maximizer of ζn over Ξ, as defined in
Proposition A.1, which is assumed to be at distance smaller than ε from ξ0 .
Then,

e = ζn (1 − t)ξbn + tξ ≥ (1 − t)ζn (ξbn ) + tζn (ξ) ≥ ζn (ξ).
ζn (ξ)
Thus,
sup ζn (ξ) ≥ ζn (ξ)

e ≥ ζn (ξ) , for any ξ with kξ − ξ0 k > ε
kξ−ξ 0 k=ε0
which in turn yields (A.12). When An = op (1), convergence in probability of

{ξbnres } to ξ0 is equivalent to the existence of an almost everywhere convergent
sub-sub-sequence for any subsequence {ξbnres k
}. Therefore, by applying the deter-
ministic result to the set of probability one, where there exists a sub-subsequence
of Ank that converges to zero a.e., we obtain the result.
Regularity conditions for Proposition 2.2 (from Theorem 5.23, p. 53

in Van der Vaart, A. W. (2000))
Condition A.1. For each ξ in an open subset Ξ of a Euclidean space, mξ (z)
is a measurable function in z such that mξ (z) is differentiable in ξ0 for almost
every z with derivative ṁξ0 (z).
Condition A.2. There exists a measurable function φ(z) with P φ2 < ∞ such
that, for any ξ1 and ξ2 in a neighborhood of ξ0 ,
|mξ1 (z) − mξ2 (z)| ≤ φ(z)kξ1 − ξ2 k (A.13)
Condition A.3. The map ξ → P mξ admits a second order Taylor expansion

at a point of maximum ξ0 with nonsingular symmetric second derivative matrix
Vξ 0 .

13
Under regularity conditions A.1, A.2 and A.3, van der Vaart proved in The-
orem 5.23 of his book (Van der Vaart, A. W. (2000)) that if {ξbn } is a strong
M-estimator sequence for the criterion function Pn mξ over Ξ and ξbn → ξ0 in
probability, then
n
√ 1 X
n(ξbn − ξ0 ) = −Vξ−1 √ ṁξ0 (Zi ) + op (1). (A.14)
0 n i=1
√
Moreover, n(ξbn − ξ0 ) is asymptotically normal with mean zero and
n√ o
avar n(ξbn − ξ0 ) = Vξ−1
0
P ṁξ0 ṁTξ0 Vξ−1
0
. (A.15)
This result will be invoked in the following proofs.

Proof of Proposition 2.2. Assume that {ξbnres } is a sequence in Ξres that con-
verges in probability to ξ0 ∈ M, which is assumed to be open in Ξres . Then,
pr(ξbnres ∈ M) → 1. Bicontinuity of h guarantees that s∗n = h−1 (ξbnres ) converges
in probability to s0 = h−1 (ξ0 ). Note that
Pn mh(s∗n ) = Pn mbξres ≥ sup Pn mξ ≥ sup Pn mξ ≥ sup Pn mh(s) , (A.16)

n ξ ∈Ξres ξ ∈M s∈S
except for an op (n−1 ) term that is omitted in the last three inequalities. There-
fore, {s∗n } is a strong maximizing sequence for the criterion function Pn mh(s) (z)
over S.
We next verify Conditions A.1, A.2 and A.3 are satisfied for {s∗n }, s0 , mh(s) (z)
and P mh(s) . Specifically, Condition AA.1 holds since mh(s) is a measurable
function in z for all s ∈ S and mh(s) (x) is differentiable in s0 for almost every
x. In fact, h(s0 ) = ξ0 , m is differentiable at ξ0 and h is also differentiable.
Moreover, the derivative function is ∇h(s0 )ṁξ0 .
For all s1 and s2 in a neighborhood of s0 , by the continuity of h, h(s1 ) and
h(s2 ) are in a neighborhood of ξ0 . Then
|mh(s1 ) − mh(s2 ) | ≤ φ(x)kh(s1 ) − h(s2 )k

≤ φ(x)k∇hk∞,Ns0 ks1 − s2 k
where k∇hk∞,Ns0 denotes the maximum of ∇h(s) in a neighborhood Ns0 of s0 .

The first inequality holds because such condition is valid in the unconstrained
problem and the second inequality follows since h is continuously differentiable
at s0 . Thus, the Lipschitz Condition AA.2 is satisfied.
For Condition A.3, we observe that the function s 7→ P mh(s) is twice con-
tinuously differentiable in s0 because both P mξ and h(s) satisfy the required
regularity properties at ξ0 and s0 , respectively. Moreover, since P ṁξ0 = 0, the
second derivative matrix of P mh(s) at s0 , is Ws0 = ∇h(s0 )T Vξ0 ∇h(s0 ), where
Vξ0 is the second derivative matrix of P mξ at ξ0 . The matrix Ws0 is nonsingular
and symmetric because ∇h is full rank and Vξ0 is nonsingular and symmetric.

14
We can now apply Theorem 5.23 in van der Vaart (2000), and obtain
n
√ −1 1 X
n(s∗n − s0 ) = − ∇h(s0 )T Vξ0 ∇h(s0 ) √ ∇h(s0 )ṁξ0 (Zi ) + op (1),
n i=1
(A.17)
so that the first order Taylor series expansion of h(s∗n ) around s0 is
√ √
n(h(s∗n ) − h(s0 )) = ∇h(s0 ) n(s∗n − s0 )) + op (1)
n
1 X −1
= − √ ∇h(s0 ) ∇h(s0 )T Vξ0 ∇h(s0 ) ∇h(s0 )T ṁξ0 (Zi ) + op (1)
n i=1
n
1 X
= √ Πξ IFξ0 (Zi ) + op (1),
n i=1 0
for Πξ0 and IFξ0 (z) defined in (2.6) and (2.4), respectively, which obtains the
expansion in (2.7).
Proof of Theorem 3.1. Write
ηx1 = η 1 + βx = (η 1 , β) f (x) = (f (x)T ⊗ Ik1 )vec (η 1 , β) ,
where f (x)T = (1, x)T ∈ R1×(p+1) . Then, in matrix form,
f (x)T ⊗ Ik1

η 1 + βx 0 vec (η 1 , β)
ηx = = = F (x)ξ, (A.18)
η2 0 Ik2 η2
where
f (x)T ⊗ Ik1

0
F (x) = ∈ Rk×(k1 (p+1)+k2 ) ,
0 Ik2
T
and ξ = η T1 , vecT (β), η T2 is the vector of parameters of model (3.4). Note that
F (x)ξ ∈ H for any value of ξ. This notation allows to simplify the expression
for the log-likelihood function in (3.6), and replace it with
mξ (z) = T (y)T F (x)ξ − ψ(F (x)ξ). (A.19)
The regularity conditions required to derive consistency and asymptotic distri-

bution of the MLE are: For any ξ and any η,
pr (F (X)ξ = η) < 1. (A.20)

" #
E kT (Y )T F (X)k < ∞ ,

E sup |ψ(F (X)ξ)| < ∞, (A.21)
ξ∈K

15
for any compact K ⊂ Ξ = Rk1 × Rk1 p × H2 .

" #
T 2 T 2

E kT (Y ) F (X)k < ∞ , E sup k∇ψ(F (X) ξ)F (X)k < ∞(A.22)
,
ξ∈K
" #
T 2
E sup kF (X) ∇ ψ(F (X)ξ)F (X)k < ∞, (A.23)
ξ∈K
for any compact K ⊂ Ξ = Rk1 × Rk1 p × H2 .

To prove the existence, uniqueness and consistency of the MLE under the
present model, ξbn , we next show that the assumptions stated in Proposition
A.1 are satisfied.
The strict convexity of ψ implies that, for each fixed z, mξ (z) is a strictly
concave function in ξ ∈ Ξ = Rk1 +qk1 × H2 . The concavity of mξ (z) is preserved
under expectation, thus M (ξ) = P mξ is concave. The identifiability condition
satisfied by the exponential family in Section 3.1 allows applying Lemma 5.35
of van der Vaart (2000, p. 62) to conclude that
E T (Y )T F (Y )ξ − ψ(F (X)ξ) | X ≤ E T (Y )T F (X)ξ0 − ψ(F (X)ξ0 ) | X

(A.24)
Taking expectation with respect to X, we conclude that P mξ ≤ P mξ0 for any
ξ. Moreover, if P mξ1 = P mξ0 , pr(F (X)ξ1 = F (X)ξ0 ) = 1, which contradicts
the hypothesis (A.20). Finally, the integrability of (A.1) follows from (A.21).
The conditions A.1, A.2 and A.3 required by van der Vaart’s Theorem 5.23
to derive the asymptotic distribution of M-estimators are easily verifiable under
the integrability assumptions stated in (A.22) and (A.23).
The second derivative matrix of P mξ at ξ0 is
Vξ0 = −E F (X)T ∇2 ψ(F (X)ξ0 )F (X) .

(A.25)
Finally, observe that

h i
P ṁξ0 ṁTξ0 = E F (X)T T (Y ) − ∇T ψ(F (X)ξ0 ) T (Y )T − ∇ψ(F (X)ξ0 ) F (X)

= E F (X)T ∇2 ψ(F (X)ξ0 )F (X)

to deduce that, according to the general formula for the

√ asymptotic variance
of an M-estimator in (A.15), the asymptotic variance of n(ξbn − ξ0 ) is given by
(3.7).
Lemmas A.1– A.5 and Corollary A.1.1 are required to prove Proposition 3.1.
k1 ×r
Lemma A.1. Assume that β01 ∈ Rfirst,d and can be written as in (3.12). Let
(S0 , T0 ) ∈ Rkd1 ×d × Rd×q
d with S0 T0 = β01 . Then, there exists U ∈ Rd×d
d so that
S0 = S0 (U ) and T0 = T0 (U ), with

U
S0 (U ) := and T0 (U ) := U −1 B0 . (A.26)
A0 U

16
Proof. Let
S01
S0 =
S02
with S01 ∈ Rd×d and S02 ∈ R(k1 −d)×d . The matrix S01 T0 is comprised of the
first d rows of β01 which are linearly independent. Then, d = rank(S01 T0 ) ≤
rank(S01 ) ≤ d, hence S01 is invertible. Take U = S01 . From the expression
(3.12) for β01 , we have
S01 T0 = B0
S02 T0 = A0 B 0 .
Thus, T0 = U −1 B0 , and since T0 T0T ∈ Rd×d and rank d,
S02 = A0 B0 T0T (T0 T0T )−1 = A0 B0 B0T U −1 (U −1 B0 B0T U −1 )−1 = A0 U.
k1 ×r
Corollary A.1.1. Let β01 ∈ Rfirst,d and can be written as in (3.12) . For U
d×d
in Rd and S0 (U ), T0 (U ) as defined in (A.26), the pre-image of ξ0 through g,
g −1 (ξ0 ), satisfies
g −1 (ξ0 ) ∼
= (η 01 , S0 (U ), T0 (U ), β02 , η 02 ) : U ∈ Rdd×d .

(A.27)
Lemma A.2. Rd×m

d is an open set in Rd×m .
Proof. We will show that the complement of Rd×md is closed. Consider (Tn )n≥1 ∈
Rd×m , each Tn of rank strictly less than d, and assume that Tn converges to
T ∈ Rd×m . Note that, |Tn TnT | = 0 for all n ≥ 1, so that |T T T | is also equal to
zero. Hence, rank(T ) < d.
Lemma A.3. Θ and S are open sets.

Proof. Up to an homeomorphism, Θ and S are equivalent to
Rk1 ×Rkd1 ×d ×Rd×q

d ×Rk1 ×(p−q) ×H2 and Rk1 ×R(k1 −d)×d ×Rd×q
d ×Rk1 ×(p−q) ×H2 ,
respectively, which are products of open sets by Lemma A.2.
Lemma A.4. h : S → M is one to one bicontinuous.

17
1 ×q
Proof. It suffices to note that h1 : R(k1 −d)×d × Rdd×q 7→ Rkfirst,d with

B
(A, B) →
AB
k1 ×q
satisfies the required properties for h. Let h−1
1 : Rfirst,d 7→ R(k1 −d)×d × Rdd×q
with
B −1
→ (CB T BB T , B) (A.28)
C
Then, h−1
1 (h1 (A, B)) = (A, B) and

−1 B B
h1 h1 = ,
AB AB
which together obtain h−1

1 is the inverse of h1 . Thus, h1 is one to one and
bicontinuous.
Lemma A.5. M is open in Ξres

k1 ×q
Proof. It suffices to show Rfirst,d is an open set in Rdk1 ×q , or equivalently, that the
1 ×q
complement of Rkfirst,d is a closed set in Rdk1 ×q . Let {Tn } ⊂ Rdk1 ×q be a sequence
such that the first d rows are not linearly independent and Tn converges to
T ∈ Rkd1 ×q . Then, if we write

Tn1
Tn =
Tn2
with Tn1 ∈ Rd×q and Tn2 ∈ R(k1 −d)×q , we obtain that | Tn1 TnT1 |= 0 for all n.
Then | T1 T1T |= 0, where T1 comprises of the first d rows of T , and therefore
the first d rows of T are not linearly independent.
Proof of Proposition 3.1. We verify that Conditions 2.1, 2.2 and 2.3 are satis-
fied.
Condition 2.1: By Lemma A.3, the set Θ, defined in (3.11), is open and ξ0
belongs to g(Θ) = Ξres .
Condition 2.2: Since the first d rows of β01 are linearly independent, ξ0 ∈ M.
Then, applying Lemma A.5 obtains that M is an open set in Ξres .
Consider (S, h) as in (3.14). By Lemma A.3, S is an open set, and h is
one-to-one and bicontinuous by Lemma A.4. Furthermore, if
s0 = (η 01 , A0 , B0 , β02 , η 02 ) ∈ S,
where A0 y B0 are given in (3.12), then h(s0 ) = (η 01 , β01 , β02 , η 02 ) = ξ0 .

18
The function h is twice continuously differentiable and its Jacobian is full

rank. In fact, the latter is
 
Ik1 0 0 0 0

 0 BT ⊗ 0 Id 
Iq ⊗ 0 0 
∇h(η 1 , A, B, β2 , η 2 ) = 
 Ik1 −d A 

 0 0 0 Ik1 (p−q) 0 
0 0 0 0 Ik2
(A.29)
of order (k1 p + k) × (d(k1 + q − d) + k1 (p − q) + k) with full column rank
d(k1 + q − d) + k1 (p − q) + k [see Cook, R. D. and Ni, L. (2005); also it is a
direct implication of Theorem 5 in Puntanen et al. (2011)].
Condition 2.3: Consider θ0 ∈ g −1 (ξ0 ), associated with U , as shown in Lemma
A.1. Since
 
Ik1 0 0 0 0
 0 T T ⊗ Ik1 Iq ⊗ S 0 0 
∇g(η 1 , S, T, β2 , η 2 ) =  0

0 0 Ik1 (p−q) 0 
0 0 0 0 Ik2
then,

U −1
∇g η 1 , ,U B0 , β2 , η 2
A0 U
 
Ik1 0 0 0 0
 Id 
 0 B0T ⊗ Ik1 Iq ⊗ 0 0 
=
 A0  M (U ).

 0 0 0 Ik1 (p−q) 0 
0 0 0 0 Ik2
where
 
Ik1 0 0 0 0

 0 U −T ⊗ Ik1 0 0 0 

M (U ) = 
 0 0 (Iq ⊗ U ) 0 0 
 (A.30)
 0 0 0 Ik1 (p−q) 0 
0 0 0 0 Ik2
Since M (U ) is invertible of order (d(k1 + q) + k1 (p − q) + k) × (d(k1 + q) + k1 (p −
q) + k),
 
Ik1 0 0 0 0

 0 B0T ⊗ Ik1 Iq ⊗ Id 
0 0 
span∇g(θ0 ) = span   A0 .

 0 0 0 Ik1 (p−q) 0 
0 0 0 0 Ik2
Next, since ∇h = ∇gM (U )−1 (I),

e with

0
I = diag Ik1 Id ⊗
e Iqd Ik1 (p−q) Ik2
Ik1 −d

19
we have span∇h(s0 ) ⊂ span∇g(θ0 ), since. Applying Theorem 5 in Puntanen et

al. (2011) obtains rank(∇g(θ0 )) = d(k1 + q − d) + k1 (p − q) + k = rank(∇h(s0 )).
Therefore, span∇h(s0 ) = span∇g(θ0 ).
Proof of Theorem 3.2. In Theorem 3.1 we verified that the conditions of Theo-
rem 5.23 in Van der Vaart, A. W. (2000) are satisfied for the maximum likeli-
hood estimator under model (3.3) satisfying (3.4). The asymptotic variance of
the MLE estimator is given in (3.7). We also showed Conditions 2.1, 2.2 and 2.3
are satisfied in the proof of Proposition 3.1. The result follows from Proposition
2.2.
References
Anderson, T. W. (1951). Estimating linear restrictions on regression coefficients

for multivariate normal distributions. The Annals of Mathematical Statistics,
22, 327–351.
Bura, E., Duarte, S. and Forzani, L. (2016). Sufficient Reductions in Regres-
sions With Exponential Family Inverse Predictors. Journal of the American
Statistical Association 111, 1313–1329.
Cook, R. D. and Ni, L. (2005). Sufficient dimension reduction via inverse regres-
sion: a minimum discrepancy approach. Journal of the American Statistical
Association, 100, 410–428.
Geyer, C. J. (1994). On the asymptotics of constrained M-estimation. The An-
nals of Statistics, 22, 1993–2010.
Haberman, S. J. (1989). Concavity and estimation. The Annals of Statistics,
17, 1631–1661.
Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers of convex pro-
cesses. arXiv:1107.3806.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under non-
standard conditions. In Proceedings of the fifth Berkeley symposium on math-
ematical statistics and probability (Vol. 1, No. 1, pp. 221–233).
Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear
model. Journal of Multivariate Analysis, 5, 248–264.
Izenman, A. J. (2008). Modern Multivariate. Statistical Techniques: Regression,
Classification and Manifold Learning. New York: Springer.
Puntanen, S., Styan, G. P., and Isotalo, J. (2011). Matrix tricks for linear statis-
tical models: our personal top twenty. Springer Science and Business Media.
Reinsel, G. C. and Velu, R. P. (1998). Multivariate Reduced rank regression.
Lecture Notes in Statistics. New York: Springer.
Niemiro, W. (1992). Asymptotics for M-estimators defined by convex minimiza-
tion. The Annals of Statistics, 20, 1514–1533.
Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge University Press.
Yee, T. W. (2015). Vector generalized linear and additive models. Springer.
Yee, T. W. and Hastie, T. J. (2003). Reduced-rank Vector Generalized Linear
Models. Statistical Modelling, 3, 15–41.

Asymptotic Theory For Maximum Likelihood Estimates in Reduced-Rank Multivariate Generalized Linear Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Asymptotic Theory For Maximum Likelihood Estimates in Reduced-Rank Multivariate Generalized Linear Models

Uploaded by

Copyright:

Available Formats

Electronic Journal of Statistics

Asymptotic theory for maximum

Abstract: Reduced-rank regression is a dimensionality reduction method

Keywords and phrases: M-estimation, exponential family, rank restric-

The multivariate multiple linear regression model for a q-dimensional response

Reduced-rank regression introduces a rank constraint on B, so that (1.1) is

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

maximum likelihood estimators in reduced-rank multivariate generalized linear

Let Z be a random vector taking values in a measurable space Z and distributed

M (ξ) := P mξ and Mn (ξ) := Pn mξ , (2.1)

Mn (ξbn ) ≥ sup Mn (ξ) − An (2.3)

for An small, can be used instead.

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

where IFξ0 (Zi ) = −Vξ−1

2.1. Restricted M-estimators

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

Πξ0 = ∇g(θ0 )(∇g(θ0 )T Vξ0 ∇g(θ0 ))† ∇g(θ0 )T Vξ0 (2.6)

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

As an aside remark, we conjecture that for estimators that are maximizers of

3. Asymptotic theory for the maximum likelihood estimator in

In this section we show that maximum likelihood estimators for reduced-rank

3.1. Exponential Family

Let Y = (Y1 , . . . , Yq )T be a q-dimensional random vector and assume that its

fη (y) = exp{η T T (y) − ψ(η)}h(y), (3.1)

where T (y) = (T1 (y), . . . , Tk (y))T is a vector of known real-valued functions,

∇ψ(η) = ETη (T (Y )) and ∇2 ψ(η) = varη (T (Y )), for every η ∈ H.

Since ψ is strictly convex, varη (T (Y )) is non-singular for every η ∈ H.

3.2. Multivariate generalized linear models

Let Z = (X, Y ) be a random vector, where now Y ∈ Rq is a multivariate

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

conditional distribution depend linearly on the vector of predictors. Thus, the

fY |X=x (y) = exp{ηxT T (y) − ψ(ηx )}h(y), (3.3)

where ηx ∈ Rk depends linearly on x. Frequently, a subset of the natural pa-

where β ∈ Rk1 ×p , η 1 ∈ Rk1 and η 2 ∈ H2 , and η T = (η T1 , η T2 ) is the constant

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

3.3. Partial reduced rank multivariate generalized linear models

and consider g : Θ → Ξres , with (η 1 , (S, T ), β2 , η 2 ) 7→ (η 1 , ST, β2 , η 2 ). Without

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

and h : S → M be the map

m(β;η) (z) = ηxT T (y) − ψ (ηx ) . (3.16)

where Wξ0 is defined in (3.7).

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

because M (ξ0 ) > M (ξ) for any ξ 6= ξ0 . Then, for n ≥ n0 (ε0 ),

sup Mn (ξ) − Mn (ξ0 ) < 0. (A.4)

Let ζn (ξ) = Mn (ξ) − Mn (ξ0 ). Since Mn is concave and continuous, ζn attains

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

ζn (ξbn ) ≥ ζn (ξ0 ) = 0 > ζn (ξ)

For all n, there exists ξbnres such that

Mn (ξbn ) ≥ sup Mn (ξ) − An , (A.9)

where Mn is as in (2.1), An is a sequence of real numbers with An → 0, and

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

ζn (ξbnres ) = Mn (ξbnres ) − Mn (ξ0 ) ≥ Mn (ξbnres ) − sup Mn (ξ) ≥ −An . (A.11)

We can deduce that kξbnres − ξ0 k < ε once we prove that

sup ζn (ξ) = sup ζn (ξ). (A.12)

sup ζn (ξ) ≥ ζn (ξ)

which in turn yields (A.12). When An = op (1), convergence in probability of

Regularity conditions for Proposition 2.2 (from Theorem 5.23, p. 53

|mξ1 (z) − mξ2 (z)| ≤ φ(z)kξ1 − ξ2 k (A.13)

Condition A.3. The map ξ → P mξ admits a second order Taylor expansion

imsart-ejs ver. 2014/10/16 file: RRMGLM-Bernoulli.tex date: March 21, 2017

This result will be invoked in the following proofs.

Pn mh(s∗n ) = Pn mbξres ≥ sup Pn mξ ≥ sup Pn mξ ≥ sup Pn mh(s) , (A.16)

|mh(s1 ) − mh(s2 ) | ≤ φ(x)kh(s1 ) − h(s2 )k