You are on page 1of 23

Semiparametric Gaussian Copula Regression Modelling for

Mixed Data Types (SGCRM)

Debangan Dey, Vadim Zipunnikov


February 2019

Abstract
Many clinical and epidemiological studies encode participant information via a collection of con-
tinuous, truncated, ordinal, and binary variables. To gain novel insights in understanding complex
interactions between collected variables, there is a critical need for the development of flexible frame-
works for joint and conditional modeling of mixed data types variables. We propose Semiparametric
Gaussian Copula Regression Modelling (SGCRM) that allows to model a joint dependence structure
between observed continuous, truncated, ordinal, and binary variables and to construct conditional
models with these four data types as outcomes with a guarantee that the joint and conditional
models are mutually consistent among each other. SGCRM assumes a semiparametric Gaussian
copula mechanism that generates observed variables by monotonically transforming marginals of
latent multivarite normal random variable and, then, dichotimizing/truncating those transformed
variables. SGCRM estimates the correlation matrix of the latent normal variables through an inver-
sion of “bridges” between Kendall’s Tau rank correlations of observed mixed data type variables and
latent Gaussian correlations. We propose computationally efficient methods to predict latent vari-
ables and to do imputation of missing data. We establish the asymptotic normality of the SGCRM
estimators and provide a computationally efficient way to calculate its asymptotic variance. Using
National Health and Nutrition Examination Survey (NHANES), we illustrate SGCRM and compare
it with the traditional conditional regression models including truncated Gaussian regression, ordinal
probit, and probit models.

1 Introduction
Clinical and epidemiological studies as well as health surveys collect a large number of health outcomes
as well as physiological and clinical measures. This information is typically encoded via a collection of
continuous, truncated, ordinal, and binary variables. As a main motivating example, we consider National
Health and Nutrition Examination Survey (NHANES), a cross-sectional, nationally representative survey
that assesses demographic, dietary and health-related questions that can be used to better understand
trends in health and nutrition. For example, self-reported current health status (HSD010, on scale 1-5,
1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor) is an example of the ordinal variable.
Self-reported mobility problem (NAME, 0/1 = no/yes for mobility difficulty) and follow-up mortality
status (mortstat, 0/1 = alive/deceased at the follow-up) are examples of binary variables. In 2003-2006
waves, NHANES measured physical activity on more than 10000 participants using accelerometers worn
by the participants for seven days (REF). At participant level, accelerometry-measured activity is often
summarized using two variables: Total Activity Count (TAC), a continuous measure of the total volume
of physical activity, and a time spent doing Vigorous Physical Activity (VPA), which is typically seen
as a truncated variable, as many participants do zero minutes of high-intensity physical activity. Figure
S2 show scatterplot matrix of these five variables. Many NHANES studies used these five variables
as outcomes in linear regression models (TAC), truncated regression (VPA), ordinal regression (health
status), and logistic regression models (mobility difficulty and mortality status). Although, the results of
those models have been summarized in many reviews (REF), those models and potentially their results
are not necessarily mutually consistent. To better understand complex interrelationship between human
health behaviors and various health outcomes using studies like NHANES, we need to, first, learn how to
better understand complex interdependencies between mixed data types variables by developing flexible
frameworks for their joint modelling.
Due to a lack of standard joint models for multivariate mixed data types, conditional modelling is
frequently used instead. Conditional models focus on fixing one of the variables as an outcome and mod-

1
eling the mean of this outcome as a function of other variables. When outcome is binary, the traditional
way is to model the conditional mean of the outcome as a function of a linear combination of other
variables, logistic and probit regressions are two most popular models. If the outcome is ordinal, it is
essential to capture the directionality or the increase in order of that outcome. The most popularly used
ordinal regression model is the cumulative ordinal regression model McCullagh (1980). It assumes that
the truncation of underlying latent continuous variable rise to the observed categories. The coefficients
in a cumulative model are easily interpretable in terms of transformed odds. However, when we con-
sider different coefficients for different levels of the outcome, the estimation becomes difficult due to the
imposed ordering constraint between estimating coefficients. When it comes to truncated outcome, the
econometrics literature first introduced the most popular Tobit models (Tobin, 1958; Heckman, 1976;
Hausman and Wise, 1977), which assumes the outcome to originate from the truncated observation of a
latent normal variable. Then the conditional mean of the latent variable is modeled as a linear function
of the predictors. The Tobit model was generalized as the Hurdle model (Cragg, 1971) where we fit con-
ditional models for both the truncated and non-truncated outcome separately at the cost of introducing
more parameters. Both these models require likelihood based numerical estimation approaches and can
suffer from convergence issues.
The conditional models discussed above can only model the conditional mean functions of the out-
come distribution given other variables. Hence, they can only draw inferences from a subset of available
information. On the other hand, modeling the joint distribution of the data can give us a more compre-
hensive view of data. To address this issue, researchers have developed a class of factorization models
that assume the marginal distributions of a set of variables and the conditional distribution of the rest
of the variables given the rest. A popular example is General Location models (GLOMs) (Olkin and
Tate, 1961), which enforce conditional normality for continuous variables and arbitrary distribution for
discrete components. Another example is conditional grouped conditional models (CGCMs) (Anderson
and Pemberton, 1985; de Leon and Carriégre, 2007; De Leon, 2005) that assumes that the discrete vari-
ables are derived by truncating a latent multivariate continuous distribution. These models employs
polychoric and polyserial correlations to estimate joint covariance structure. Even though, factorization
models provide a convenient way for specifying mixed distributions, they induce a hierarchy in the data
that depend on the direction of conditioning. Different factorizations for the same set of variables can
lead to different interpretations for the estimated parameters and different inferences for associations.
Another popular direction employs copulas (Song et al., 2009) that define Vector Generalized Lin-
ear Models (VGLMs) for modeling mixed data type outcomes. This approach requires embedding the
marginal distributions (univariate Generalized Linear Models) into the joint distribution function via
a Gaussian copula. Gaussian copulas is a popular choice to couple marginal distributions because of
their analytical tractability and flexibility. Jiryaie et al. (2016) introduced Gaussian copula distributions
(GCD) that take a latent variable approach to embed discrete variables using the Gaussian copula. How-
ever, CGCMs, GLOMs, VGLMs, and GCDs require likelihood-based inference and are computationally
intensive for high dimensions. Pairwise likelihood-based approaches (De Leon, 2005; Jiryaie et al., 2016)
reduce the computational burden but can make worse classification than the full likelihood-based ap-
proach (Jiryaie et al., 2016). It is more desirable to have a joint modeling framework that treats the
mixed data types variables symmetrically and is scalable to high dimensions.
Recently, semiparametric models have seen a wide adaptation for joint modelling of multivariate
mixed type data. Wang and Hua (2014) used likelihood-based inference and Cai and Zhang (2015)
developed a rank-based approach to estimate a joint semiparametric Gaussian copula for continuous
variables. Fan et al. (2016) extended the rank-based approaches to perform quantile regression on
continuous variables. Rank-based estimation of the covariance of the semi-parametric Gaussian copula
family (Liu et al., 2009, 2012) has been particularly attractive because of the fast and robust estimation
procedure. Therefore, we have seen multiple extensions of the use of latent semi-parametric Gaussian
copula to model mixed types data. Fan et al. (2017) developed the estimation in the case of of binary
and continuous variables. Yoon et al. (2018) extended the approach to include truncated variables,
and Quan et al. (2018) has additionally extended it to include ternary variables (ordinal variables with
three categories) and general ordinal-continuous pairs of variables. Feng and Ning (2019) represented
an ordinal variable via multiple dummy binary variables and took a weighted correlation approach to
recover the latent correlation for ordinal pairs with more than three categories. Whereas, Zhang et al.
(2018) arrived at an incorrect bridging function trying to tackle the general ordinal case. Huang et al.
(2021) provided an R package for speeding the computation of latent correlations between pairs of
binary, ternary, truncated and continuous variables using a numerical interpolation approach. But, this
fast algorithm also requires the knowledge of the original analytic function. Unfortunately, there’s no

2
existing approach that can handle general ordinal variable within this framework. Our first contribution
closes this gap by providing bridging formulas for the general case of an ordinal variable.
Our next contribution is a semi-parametric Gaussian copula joint modeling framework that treats
mixed variables symmetrically and is scalable to high-dimensions. The main advantages of the proposed
framework is as follows: i) joint modeling for mixed data type, ii) mutually consistent conditional model-
ing that is alternative to a wide range of conditional models including linear regression, logistic regression,
logistic-ordinal, truncated outcome, iii) latent representations provide natural normalization/scaling of
mixed data types, iv) the approach is semi-parametric and likelihood free, so it only requires estimating
pair-wise Kendall’s Tau correlations, v) interpretation of regression coefficients is familiar and based on
R2 interpretation of predicted variability on latent space, vi) the approach allows to define R2 for all
four types of outcomes and in addition, allows to quantify added-value (using latent-R2 scale) as in
Rl2 py|x, zq “ Rl2 py|xq ` Rl2 py|pz|xqq, vii) the approach is robust and computationally fast, viii) allows to
do missing data imputation.
The rest of the paper is organized as follows. Chapter 2 reviews semiparametric Gaussian Copula
models and states a novel result on incorporating ordinal case. Chapter 3 introduces Semiparamet-
ric Gaussian Copula Regression Model and states main asymptotical results for regression parameters.
Chapter 4 describes key advantages of SGCRM and covers two methodological applications of SGCRM -
prediction of latent variables and imputation of missing observations. Chapter 5 studies the performance
of SGCRM via a few simulation scenarios. Chapter 6 illustrates SGCRM in NHANES data. Finally,
Chapter 7 concludes with a summary and a discussion.

2 Gaussian copula Model


Classical Gaussian model assumptions have been popular due to their computational simplicity. However,
these assumptions can be too restrictive. As an alternative, Liu et al. (2009) proposed non-Paranormal
distribution (NPN) which can be seen as a semiparametric Gaussian Copula model.

Definition 2.1. (Non-paranormal distribution) A random vector Z “ pZ1 , . . . , Zp q1 „ N P Np p0, Σ, f q


if there exist monotone transformation functions f “ pf1 , ..., fp q such that L “ pL1 , . . . , Lp q “ f pZq “
pf1 pZ1 q, . . . , fp pZp qq „ N p0, Σq where Σjj “ 1 for 1 ď j ď p.
The last assumption on Σ is made to ensure the identifiability of the distribution as shown in Liu
et al. (2009).
Fan et al. (2017) introduced latent non-paranormal distribution which extended non-paranormal
distribution to jointly model binary and continuous data. Yoon et al. (2018) introduced truncated
variables using latent non-paranormal variables and Quan et al. (2018) extended the distribution to
ternary-continuous pairs and ternary-ternary pairs. We generalize these and demonstrate how general
ordinal case can be treated. We define Generalized Latent Non-paranormal (GLNPN) distribution which
covers four mixed data types including continuous, truncated, (general) ordinal (k ordered categories),
binary variables.
Definition 2.2. (Generalized latent non-paranormal distribution) Suppose we observe a random vector
X “ pXc , Xt , Xo , Xb q1 , where Xc is pc -dimensional continuous variable, Xt is pt -dimensional truncated,
Xo is po -dimensional ordinal (j-th ordinal variable has levels t0, 1, ¨ ¨ ¨ , lj ´ 1u), and Xb is pb -dimensional
binary variable, and p “ pc `pt `po `pb . We assume that there exist latent variables Z “ pZc , Zt , Zo , Zb q1
such that
Xcj “ Zcj , 1 ď j ď pc
Xtj “ Ztj IpZtj ą δtj q, 1 ď j ď pt
lj ´1
ÿ (1)
Xoj “ kIpδojk ď Zoj ă δojpk`1q q, 1 ď j ď po ; δoj0 “ ´8, δojlj “ 8
k“0
Xbj “ IpZbj ą δbj q, 1 ď j ď pb
If Z “ pZc , Zt , Zo , Zb q1 „ N P N p0, Σ, f q, we denote that X “ pXc , Xt , Xo , Xb q1 „ GLNPNp p0, Σ, f, δq,
where δ “ tδtj ; j “ 1, ¨ ¨ ¨ , pt u Y pYpj“1
o
tδojpk`1q ; k “ 0, ¨ ¨ ¨ , lj uq Y tδbj ; j “ 1, ¨ ¨ ¨ , pb u, i.e., is the set
containing cutoffs for truncated, ordinal and binary variables.
To shorten notations, we will refer to observed continuous, truncated, ordinal, binary variables gen-
erated according to GLNPN distribution as CTOB-GLNPN variables. Figure 1 shows the flowchart

3
Figure 1: The data generation flow of GLNPN distribution

of the data generation mechanism for the observed GLNPN variables. Figure 2 shows an example of
four observed CTOB-GLNPN variables generated via monotone-transformation-then-truncation of latent
bivariate normal variables.

Figure 2: From left to right: (i) a scatterplot of bivariate standard normal variables with correlation of
0.5, (ii) a continuous-continuous pair, (iii) a truncated-continuous pair, iv) an ordinal-continuous pair,
(v) a binary-continuous pair

Cut-off parameters of the GLNPN distribution suffers from identifiability issues (see details at Fan
et al. (2017)). The joint probability mass function of the discrete component or the density of the
truncated component only depends on the set of transformed cutoffs: ∆ “ f pδq “ tfj pδtj q; j “
1, ¨ ¨ ¨ , pt u Y pYpj“1
o
tfj pδojpk`1q q; k “ 0, ¨ ¨ ¨ , lj uq
Y tfj pδbj q; j “ 1, ¨ ¨ ¨ , pb u “ t∆tj ; j “ 1, ¨ ¨ ¨ , pt u Y pYpj“1
o
t∆ojpk`1q ; k “ 0, ¨ ¨ ¨ , lj uq
Y t∆bj ; j “ 1, ¨ ¨ ¨ , pb u. To emphasize that, we will generally refer to the GLNPN distribution as
GLNPNp p0, Σ, f, ∆q.
As a result of the identifiablity constraints for the cutoffs, the binary and ordinal components of
GLNPN distribution are marginally equivalent to the latent Gaussian distribution for binary and ordinal
variables. This comes as no surprise as the discrete components does not have enough information to
identify the marginal transformations. However, when we model the discrete component jointly with
continuous and truncated variables, the class of GLNPN distributions becomes much larger than the
class of latent Gaussian distributions. The marginal transformations from continuous and truncated
variables make the joint distribution of mixed variables more flexible and potentially can give a substantial
advantage to better explain the association between mixed type of variables.

2.1 Estimation of Correlation Matrix


2.1.1 Bridging functions
A few authors (including Fan et al. (2017); Yoon et al. (2018); Quan et al. (2018)) have considered
Kendall’s τ rank correlation to estimate latent correlation matrix Σ across several settings. We can
calculate a sample Kendall’s tau between j-th and k-th variable as follows:
2 ÿ
τ̂jk “ sgntpXij ´ Xi1 j qpXik ´ Xi1 k qu (2)
npn ´ 1q 1ďiăi1 ăn

The construction of a sample Kendall’s τ reveals that it is invariant under a monotone transformation.
Now, for two independent copies Xi , Xi1 of the random vector X, the population-level Kendall’s τ is
defined as

τjk “ ErsgntpXij ´ Xi1 j qpXik ´ Xi1 k qus (3)


The population Kendall’s Tau (τjk “ Epτ̂jk qq is typically related to the latent correlation Σjk through
a one-to-one bridging function F which for non-continuous components will depend on cutoffs - τjk “
F pΣjk q. The estimated latent correlation is then obtained as Σ̂jk “ F ´1 pτ̂jk q.

4
Type Continuous Truncated Ordinal˚ Binary
Continuous Liu et al. (2009) Yoon et al. (2018) Quan et al. (2018) Fan et al. (2017)
Truncated Yoon et al. (2018) Yoon et al. (2018) Theorem 1 Yoon et al. (2018)
Ordinal Quan et al. (2018) Theorem 1 Theorem 1 Theorem 1
Binary Fan et al. (2017) Yoon et al. (2018) Theorem 1 Fan et al. (2017)

Table 1: The reference of bridging functions for all possible pairs of variables.
˚ Ordinal cases for only three categories were derived in Quan et al. (2018)

Fan et al. (2017) calculated the bridging function for a pair of binary and continuous variables, Yoon
et al. (2018) showed how to deal with truncated variables in addition to continuous and binary. Quan et al.
(2018) provided formulas for bridging functions for ternary variables and for general ordinal-continuous
pairs of variables. Feng and Ning (2019) broke an ordinal variable into multiple dummy binary variables
and took a weighted correlation approach to recover the latent correlation for ordinal pairs with more
than three categories, whereas, Zhang et al. (2018) arrived at an incorrect bridging function trying to
tackle the general ordinal case. We summarize the references to the correct bridging functions for all
possible pairs of variables in Table 1. Our contribution here is to derive bridging functions for the general
ordinal variable with arbitrary number of ordinal levels. The results is summarized in in Theorem 1.
The following theorem provides the bridging function for any arbitrary pair of variables -
Theorem 1. Let Xj , Xk be two GLNPN variables, then the population Kendall’s Tau is related to the
latent correlation as follows: τjk “ F pΣjk q, where F can depend on the cutoff ∆j , ∆k , which denotes
the cutoff scalar or vector and corresponds to non-continuous components of vectors Xj and Xk . The
bridging functions corresponding to all pairs of variables are as follows

2
Fcc pρq “ sin´1 pρq
π
Fbb pρ; ∆j , ∆k q “ 2 tΦ2 p∆j , ∆k ; ρq ´ Φp∆j qΦp∆k qu
?
Fcb pρ; ∆j q “ 4Φ2 p∆j , 0; ρ{ 2q ´ 2Φp∆j q
Ftb pρ; ∆j , ∆k q “ 2t1 ´ Φp∆j quΦp∆k q ´ 2Φ3 p´∆j , ∆k , 0; S3a pρqq ´ 2Φ3 p´∆j , ∆k , 0; S3b pρqq
?
Fct pρ; ∆j q “ ´2Φ2 p´∆j , 0; 1{ 2q ` 4Φ3 p´∆j , 0, 0; S3 prqq
Ftt pρ; ∆j , ∆k q “ ´2Φ4 p´∆j , ´∆k , 0, 0; S4a pρqq ` 2Φ4 p´∆j , ´∆k , 0, 0; S4b pρqq
lj ´1
ÿ
Fco pρ; ∆j q “ p4Φ3 p∆jr , ∆jpr`1q , 0q ´ 2Φp∆jr qΦp∆jpr`1q qq
r“1
lj ´1 lk ´1
ÿ ÿ
Foo pρ; ∆j , ∆k q “ 2p rΦ̃2 pp∆jr , ∆ks q, p∆jpr`1q , ∆kps`1q ; ρqΦ̃2 pp´8, ´8q, p∆jr , ∆ks q; ρq´
r“1 s“1
(4)
Φ̃2 pp∆jr , ∆kps´1q q, p∆jpr`1q , ∆ks ; ρqΦ̃2 pp´8, ´8q, p∆jr , ´∆ks q; ´ρqsq
lj ´1
ÿ
Fob pρ; ∆j , ∆k q “ 2p rΦ̃2 pp∆jr , ´8q, p∆jpr`1q , ´∆k q; ´ρqΦ̃2 pp´8, ´8q, p∆jr , ∆k q; ρq´
r“1

Φ̃2 pp∆jr , ´8q, p∆jpr`1q , ∆k q; ρqΦ̃2 pp´8, ´8q, p∆jr , ´∆k q; ´ρqsq
lkÿ
´1
Fto pρ; ∆j , ∆k q “ 2p rΦ̃2 pp∆kr , ´8q, p∆kpr`1q , ´∆j q; ´ρqΦ̃2 pp´8, ´8q, p∆kr , ∆j q; ρq`
r“1

Φ̃4 pp∆kr , ´8, ´8, ´8q, p∆kpr`1q , ∆kr , ´∆j , 0q; S5a pρqq´
Φ̃2 pp∆kpr´1q , ´8q, p∆kr , ´∆j q; ´ρqΦ̃2 pp´8, ´8q, p´∆kr , ∆j q; ´ρq´
Φ̃4 pp∆kpr´1q , ´8, ´8, ´8q, p∆kr , ´∆kr , ´∆j , 0q; S5b pρqqsq

5
with
¨ ? ˛ ¨ ? ˛
1 ´ρ 1{ ?2 1 0 ´1{?2
S3a pρq “ ˝ ´ρ? 1? ´ρ{ 2‚, S3b pρq “ ˝ 0? 1? ´ρ{ 2‚,
1{ 2 ´ρ{ 2 1 ´1{ 2 ´ρ{ 2 1
¨ ? ? ˛
¨ ? ? ˛ 1 0 1{ ?2 ´ρ{? 2
1? 1{ 2 ρ{ 2 ˚ 0
? 1? ´ρ{ 2 1{ 2 ‹
S3 pρq “ ˝1{?2 1 ρ ‚, S4a pρq “ ˚ ˝ 1{ 2 ´ρ{ 2

? ? 1 ´ρ ‚
ρ{ 2 ρ 1
´ρ{ 2 1{ 2 ´ρ 1
? ? ˛ ¨ ρ ˛
¨
1 ρ 1{?2 ρ{?2 1 0 0 ´ 2
?

˚ ρ 1? ρ{ 2 1{ 2‹ ‹ ˚ 0 1 ´ρ ?ρ ‹
S4b pρq “ ˚ ? , S pρq “
˚ 2 ‹
5a ?1 ‚
˝1{ 2 ρ{ 2 1 ρ 0 1
˚ ‹
? ?
‚ ˝ ´ρ ´ 2
ρ{ 2 1{ 2 ρ 1 ´ ?ρ2 ?ρ2 ´ ?12 1
¨ ρ ˛
1 0 0 ´ ?2
˚ 0 1 ρ ´ ?ρ2 ‹
S5b pρq “ ˚
˚ ‹
1 ‹
˝ 0 ρ 1 ´ ?
2

´ ?ρ2 ´ ?ρ2 ´ ?12 1

where, Φ denotes the cdf of univariate standard normal, Φd p. . . , Sq denotes the cdf of d-variate standard
normal with correlation matrix S, Φ̃2 ppa, bq, pc, dq, ρq denotes the probability of the rectangle tpu, vq : a ă
u ă b, c ă v ă du for a standard bivariate normal with correlation ρ and Φ̃4 pa, b, c, d, Sq “ P pZ1 ă
a, Z2 ă b, Z3 ă c, Z4 ă dq denotes the distribution function of a standard quadrivariate normal Z “
pZ1 , Z2 , Z3 , Z4 q with correlation matrix S.
Proof. The derivation of Fcc , Fbb , Fcb , Ftb , Fct , Ftt , Fco has been previously done in literature as reported
Table 1. We provide novel derivations of Foo , Fob , Fto in Supplement S1. To the best of our knowledge,
this theorem is the first result deriving analytical forms of pairwise bridging functions for ordinal-ordinal,
ordinal-binary and ordinal-continuous pair for ordinal variables with arbitrary levels.
Theorem 1 shows that the bridging function depends on the cutoffs for binary, ordinal and truncated
variables. Hence, we need to estimate these cutoffs. From the observed data, we estimate the cutoffs
through the method of moments as follows:

ˆ řn ˙
p j “ Φ´1 IpXij “ 0q
i“1
Binary: ∆
n
ˆ řn ˙
p jr “ Φ´1 i“1 IpX ij ă“ pr ´ 1qq
Ordinal: ∆ , r “ 1, . . . , lj ´ 1 (5)
n
ˆ řn ˙
p j “ Φ´1 i“1 IpXij “ 0q
Truncated: ∆
n

Now, we can plug-in estimated cutoffs in the bridging functions from (4), so the bridging functions
now only depend on latent correlations. After bridging, the correlation matrix formed by the bridged
estimates Σ̂ “ pΣ̂jk q is not guaranteed to be positive semi-definite. So, we need to perform an extra
step and for the estimated matrix find the nearest positive-definite correlation matrix (Higham, 2002).
In Algorithm 1, we lay out all steps our our estimation procedure for all four mixed data types.

6
Algorithm 1 GLNPN estimation algorithm
1: Input: Observed data, Xi “ pXic , Xit , Xio , Xib q, i “ 1, . . . , n
Phase 1 – Estimating cutoffs
2: for j in tt, o, bu do
3: Estimate the set of cutoffs ∆
p j from (5) and store them
4: end for
Phase 2 – Inverting bridging functions
5: for j in tc, t, o, bu do
6: for k ‰ j do
7: Calculate sample Kendall’s Tau: τpjk
8: Get the appropriate bridging function Fjk and plug-in the estimated cutoffs
Obtain Σ p jk “ F ´1 pp pjk q2
9: jk τjk q “ argminρPp´1,1q pFjk pρq ´ τ
10: end for
11: end for
Phase 3 – Getting nearest PD correlation matrix
12: Get the initial estimate of the latent correlation matrix Σ as follows:
¨ ˛
Σ̂cc Σ̂ct Σ̂co Σ̂cb
˚ Σ̂tc Σ̂tt Σ̂to Σ̂tb ‹
Σ
p “˚
˝Σ̂oc Σ̂ot Σ̂oo

Σ̂ob ‚
Σ̂bc Σ̂bt Σ̂bo Σ̂bb

13: Use nearPD (Higham, 2002) function in R to find the nearest positive definite correlation matrix of
Σ
p as our final estimate.

3 Semiparametric Gaussian Copula Regression Model


In this section, we introduce Semiparametric Gaussian Copula Regression Model (SGCRM) and compare
it with the traditional regression framework.
A classical regression model for a continuous outcome Yi is typically written as
p
ÿ
Yi “ Xji βj ` i , i “ 1, ¨ ¨ ¨ , n (6)
j“1

The simplest for understanding case is when both the outcome and all covariates are standard normal
random variables. In that case, the simple linear regression conceptually assumes that both outcomes
and predictors are on the same additive scale and tries to explain the variability of an outcome via
variability of predictors. Various transformations of outcome/predictors can be used to deal with possible
deviations from normality and symmetry. When outcome is not continuous alternative models such as
probit, truncated regression, and other probit-like models have been proposed. However, they often loose
the interpretability appeal of a simple linear regression model.
Semiparametric Gaussian Copula Regression for Mixed Data (SGCRM) can be seen as an alternative
to the simple linear regression that deals with mixed types of outcomes and predictors by operating and
connecting underlying continuous normal latent variables that generate observed mixed types variables.
In this section, we introduce SGCRM and establish key asymptotical results for the estimates of the
regression parameters. We then discuss the main advantages of SGCRM.
First, we define Semiparametric Gaussian Copula Regression for Mixed Data as follows.

$
i.i.d


’ Observed variables: pY1 , X1 q, . . . , pYn , Xn q „ GLN P Np`1 p0, Σ, pfY , fX q, ∆q
& i.i.d
Latent variables: pZ1Y , ZX Y X
1 q, . . . , pZn , Zn q „ N P Np`1 p0, Σ, pfX , fY qq (7)



% SGCRM for latent variables: f pZ Y q “ ř řpk X
Y i kPtc,t,o,bu j“1 fX pZkji qβkj ` i , i “ 1, . . . , n.

7
Essentially, SGCRM is a simple linear regression for the outcome fY pZiY q and predictors fX pZiX q,
i.i.d
which, according to GLNPN, are jointly normal: pfY pZiY q, fX pZiX qq „ Np`1 p0, Σq with the correlation
matrix Σ assuming the following partition:
„ 
ΣY Y Σ Y X
.
ΣY X ΣXX
In SGCRM, we also assume that i are i.i.d. from N p0, 1 ´ ΣY X Σ´1 XX ΣXY q.
It immediately follows that the regression coefficient β “ Σ´1 XX XY . To estimate β, we propose
Σ
to use the estimate of Σ obtained via bridging as described in Section 2.1. Let Σ̂n be the estimated
latent correlation matrix for the model (7). Then, the estimates of the regression coefficient is given by
β̂n “ Σ̂´1
nXX Σ̂nXY .
In the next theorem, we derive asymptotic properties of both the estimator of latent correlation matrix
and the regression parameter of SGCRM model. To formulate the theorem, we will need the following
notations: let vecpAq and veclpAq denote the vectorized matrix A and vector of lower-triangular elements
of matrix A, respectively. Thus, veclpΣ̂n q and vecpΣ̂n q are vectors of length ppp´1q
2 and p2 , respectively.

Theorem 2. Suppose the following assumptions (Eicker, 1963) hold true: (i) the rank of Pn “ Σ̂nXX
is p and (ii) pλnλmax
min pPn q
pPn qq2 Ñ 0 as n Ñ 8 where λmin p¨q and λmax p¨q denote the smallest and largest
?
eigenvalues of a matrix respectively. Then, npveclpΣ̂n q ´ veclpΣqq is asymptotically normal with mean-
?
vector 0 and a variance-covariance matrix VΣ . npβ̂n ´ βq is asymptotically normal with mean 0 and a
variance-covariance matrix Vβ .
Proof. Here, we layout the key ideas of the proof. First, using asymptotics of U-statistics in Hoeffd-
ing (1992) and El Maache and Lepage (2003), we establish the asymptotic normality of the Kendall’s
Tau estimates. Since the latent correlations are deterministic function (inverse bridging function) of the
Kendall’s Tau correlations, we use Delta method to obtain the asymptotic normality of the latent cor-
relations. Next, we project the latent correlations onto a space of independent parameters (Archakova
and Hansen, 2018), so that we can apply Delta method to obtain the asymptotic normality of the
SGCRM regression coefficient. The regularity assumptions ensure the stability of the transformation
β̂n “ Σ̂´1
nXX Σ̂nXY of the latent correlation matrix, so that we can apply Delta method. The detailed
proof and analytical expressions of Vβ and VΣ are provided in Supplement S1.
As part of the derivations, we solve a non-trivial computational problem by developing an efficient
way of computing of the asymptotic covaraince of Kendall’s Tau matrix. Our approach requires Opn2 q
FLOPs compared to the Opn4 q FLOPs using naive brute-force approach. This reduction in computational
burden enables us to calculate the asymptotic variance for moderate-to-large n.

3.1 Advantages of SGCRM


In this section, we present main differences between SGCRM model over traditional regression models
developed for mixed type outcomes. These two approaches are contrasted in Table 2.

Aspect Traditional models (Observed space) SGCRM (Latent space)

Conditional associa- Use simple linear regression and probit- Global model defines mutually consistent
tions like regressions (probit, truncated, and or- conditional models for all outcomes. (See
dinal probit). Section 3)
Goodness of fit measure: AUC or deviance Goodness of fit measure Latent R-square
(depending on defined model)
Can be used to test for conditional inde- Can be used to define a test for con-
pendence for only Gaussian variables ditional independence for mixed type of
variables.
Estimation Requires likelihood computation, can Method of moments approach makes the
be computationally infeasible for certain estimation computationally efficient.
models.

8
Aspect Traditional models (Observed space) SGCRM (Latent space)

Non-robust but efficient. Kendall’s Tau rank correlation maintains


the perfect balance between robust and ef-
ficiency.
Scaling Need to manually normalize the variables Inherent model assumptions take care of
to take into account heterogeneous scales. the scaling naturally.
Maybe impossible for mixed types.
Distributional as- Parametric; convenient but limited. Semi-parametric; allowing us to explain
sumptions more general associations.
Missing data impu- Imputation by mean or restricted to com- Using latent correlation to impute missing
tation plete cases. data under missing-at-random assump-
tion.
Interpretation Can be interpreted on absolute scale and Can be interpreted on quantile scale. The
simplified. The signs of coefficients will signs of coefficients will denote the direc-
denote the direction of association. tion of association.
Prediction Using model construction. Using latent correlation and conditional
expectation to construct the best linear
unbiased predictors.

Table 2: Comparison between traditional approaches to model mixed data and Semi-parametric Gaussian
Copula Regression Modeling

4 Methodological Applications of SGCRM


4.1 Latent variable predictions
Although, latent variables are not needed to estimate the regression parameter of SGCRM, other appli-
cations of SGCRM may require latent variables. To address this, we follow the ideas from Best Linear
Unbiased Predictor (BLUPs) in mixed effect modelling and use a similar conditional expectation ap-
proach to find best predictors of latent variables conditionally on observed variables. Note that at this
point we do not make a distinction between an outcome and predictors. We also drop sub-index i, as
we only use participant-specific observed variables when we predict their latent representations. We
introduce additional notations:

• L “ pLc , Lt , Lo , Lb q “ f pZc , Zt , Zo , Zb q, where f is a vector of coordinate wise monotone transfor-


mations as described in the definition of GLNPN;
• L´c “ pLt , Lo , Lb q and similarly for all other combinations of sub-indexes;

• ct denotes the union of continuous and truncated indices. L´ct “ pLo , Lb q;


• Σa,a indicates the sub-matrix of Σ with indices running over the set a;
• Σa,´a denotes the rows of Σ; indexed by the set a but without the columns indexed by a and
Σ´a,a “ Σ1a,´a ;

• Σ´a,´a indicates the sub-matrix of Σ with indices not in the set a.

To calculate EpL|Xq “ EpLc , Lt , Lo , Lb |Xc , Xo , Xt , Xb q, we will consider two cases: Case 1 when
Xt “ 0 and Case 2 when Xt ą 0.
Case Xt “ 0:. We can observe that for continuous variables Lc “ fc pXc q and the values of Xt , Xo , Xb
will restrict each coordinate of L´c “ pLt , Lo , Lb q to be in a certain interval based on the cutoffs.
ŚThat is, under Ś our model assumptions tXt “ xt , Xb “ xb , Xo “ xo u ðñ tL´c P Bu, where B “
t λRc Bλ u and indicates Cartesian product and Bλ denotes an interval in R for the corresponding
co-ordinate.

9
By using the fact that

L´c|c “ L´c |Lc „ N pΣ´c,c Σ´1 ´1


c,c Lc , Σ´c,´c ´ Σ´c,c Σc,c Σc,´c q

we can derive the following

EpLc |Xc , Xt , Xo , Xb q “ fc pXc q


EppLt , Lo , Lb q|Xc , Xt , Xo , Xb q “ EpL´c |Xc , Xt , Xo , Xb q
(8)
“ EpL´c |Lc , L´c P B´c qq
“ EpL´c|c |L´c|c P B´c q

The last quantity in equation(8) is exactly the expectation of a multivariate normal random variable
truncated in the set B. Thus, we get

EpL|Xc , Xt , Xo , Xb q “ pfc pXc q, EpL´c|c |L´c|c P B´c qq (9)

Case Xt ą 0: Observe that Lc “ fc pXc q, Lt “ ft pXt q and the values of Xo , Xb will restrict each co-
ordinate of L´ct “ pLo , Lb q to be in a certain interval based on the cutoffs. Under our model assumptions

tXb “ xb , Xc “ xc u ðñ tL´ct P B´ct u


Ś Ś
, where B´ct “ t λRct Bλ u and indicates Cartesian product and Bλ denotes an interval in R for the
corresponding co-ordinate. We also use the fact that

L´ct|ct “ L´ct |Lct „ N pΣ´ct,ct Σ´1 ´1


ct,ct Lct , Σ´ct,´ct ´ Σ´ct,ct Σct,ct Σct,´ct q

Using information above, we can derive the following results -

EpLct |Xc , Xt , Xo , Xb q “ fct pXct q


EppLo , Lb q|Xc , Xt , Xo , Xb q “ EpL´ct |Xc , Xt , Xo , Xb q
(10)
“ EpL´ct |Lct , L´ct P B´ct qq
“ EpL´ct|ct |L´ct|ct P B´ct q

The last quantity in equation (10) is exactly the expectation of a multivariate normal random variable
truncated in the set B. Thus, we get

EpL|Xc , Xt , Xo , Xb q “ pfc pXc q, ft pXt q, EpL´ct|ct |L´ct|ct P B´ct qq (11)

To get exact values from equations (9) and (11), we need to know three things: (a) the functions fc
(over entire domain), ft (only for non-zero values), (b) the sets B´c , B´ct , and (c) a way to calculate
the expectation of truncated multivariate normal random variable. Below, we show how to derive these
three.

(a) We illustrate this step by considering a single continuous and a single truncated variable. First,
we get an empirical CDF estimates as follows
n
1 ÿ
Fcn pxq “ IpXci ď xq, x P R
n ` 1 i“1
n
(12)
1 ÿ
Ftn pxq “ IpXti ď xq, x ą 0
n ` 1 i“1

Then, we use equation (12) to construct the estimator of monotone transformations as follows

fˆc pxq “ Φ´1 pFcn pxqq


(13)
fˆt pxq “ Φ´1 pFtn pxqq,

where Φ is the standard normal CDF. This follows the approach for continuous variables discussed
in Section 4 of Liu et al. (2009).
(b) We plugin the method of moments estimates for cutoffs from Section 2.1 to get B̂´c and B̂´ct .

10
(c) To calculate the first moment of a truncated multivariate normal distribution, we use ideas from
Wilhelm and Manjunath (2010). They proposed a recursive formula to calculate the moment
generating function of a truncated multivariate normal distribution and then get the first derivative
at 0 to calculate the desired expectation. We use their method implemented in R software package
tmvtnorm (Wilhelm and G, 2015).
It is important to note the prediction of latent variables described above can be done subject-by-
subject in an embarrassingly parallel way to reduce computational burden.

4.2 Missing data imputation


GLNPN framework provides a readily available way to perform imputation of missing mixed data ob-
servations using the same techniques as for prediction of latent variables.
Suppose we have missing observations for a particular subject. We split the full vector X into
observed and missing parts as X “ pXO , XM q, where O denotes observed and M denotes missing parts
and subject-specific index i has been omitted for notational simplicity. First we predict ErLM |XO s and
then obtain the prediction of XM using an appropriate transformation-then-truncation step applied to
ErLM |XO s. Remember that

LM |O “ LM |LO „ N pΣM,O Σ´1 ´1


O,O LO , ΣM,M ´ ΣM,O ΣO,O ΣO,M q (14)

As XO is σpLO q-measurable random variable, where σpLO q denotes the σ-algebra generated by LO , we
can use the tower property of conditional expectations to get the following identity

ErLM |XO s “ ErErLM |LO s|XO s “ ErΣM,O Σ´1 ´1


O,O LO |XO s “ ΣM,O ΣO,O ErLO |XO s (15)

Finally, we can calculate ErLO |XO s from the equation above using the same steps as in previous
section.

5 Simulation
We conduct a series of simulation experiments to evaluate the performance of our approach. The data
generation algorithm for the simulation experiments is presented below.

1. Generate a random correlation matrix Σ using the random partial correlation method in Joe (2006).
We calculate the condition number of Σ and if the number is below 10, we proceed to Step 2. The
additional step of checking the condition number is to ensure the stability of matrix inversion and
our regression estimates.

2. We generate n “ 1000 replicates of 8-variate latent normal variable from the following model

pLi1 , Li2 , Li3 , Li4 , Li5 , Li6 , Li7 , Li8 q „ N p0, Σq, i “ 1, 2, ¨ ¨ ¨ , n

3. We then apply the cutoffs from Table 3 to generated latent variables from previous step to obtain
observed binary (X1 , X3 ), continuous (X2 ), ordinal (X4 , X5 , X6 , X7 ) and truncated (X8 ) variables.
We consider ordinal variables with 3 categories (X4 , X5 ) and 4 categories (X6 , X7 ). We vary the
entropy
ř of our binary and ordinal variables. The entropy of a discrete random variable is defined
as i pi logppi q, where pi is the probability of the i-th distinct value. The entropy indicates the
average level of information contained in the variable’s possible outcomes. Varying entropy enables
us to consider the performance of our approach across different levels of information.
4. We perform Steps 1 ´ 3 for 200 different seeds to replicate our experiment 200 times.

For regression modeling, we treat X1 (the binary variable with high entropy) as our outcome. We
use methods described in Section 2.1 to estimate the latent correlation matrix and regression coefficients
for each instance of the simulated data. We also calculate the asymptotic confidence intervals of our
estimates from Theorem 2. Finally, for every instance of a simulated correlation matrix, we perform 500
replicates of our experiment to get an empirical distribution of our estimated parameters. We calculate
coverage of these 500 estimates against the asymptotic confidence intervals to gauge the accuracy of the
asymptotic intervals.

11
Figure 3: The estimates of latent regression coefficients over different simulation scenarios. The black
line denotes y “ x line

Figure 3 shows the estimates of latent regression coefficients against the true values. We observe that
across different combinations, the estimated and true parameters are very well aligned along the diagonal
line. We also report the coverage of our proposed asymptotic confidence interval for regression coefficients
(Fig. S1). The median coverage (across 100 seeds) of SGCRM regression coefficients is slightly below the
expected 95% line. We expect this undercoverage as we do not consider the estimated cutoffs’ uncertainty
in calculating the asymptotic variances. Accounting for additional uncertainty from the use of plug-in
cutoff estimators would make the calculations analytically complex with a small practical gain.

6 NHANES 2003-2006
Our method is illustrated in National Health and Nutrition Examination Survey (NHANES) 2003´2006.
We focus on five variables discussed in Introduction: TAC, VPA, mortality, Health Status, and Mobility
Problem. For the analysis, we excluded participants who (1) have missing mortality information or alive
with follow-up less than 5 years, (2) are younger than 50 years old or aged 85 and older, (3) have missing
any of the above-mentioned variables of interest, (4) have died due to accident, and (5) had fewer than
3 valid accelerometry days (a valid day is defined as a day with at least 10 hours of wear time) (Leroux
et al., 2019). The final analytical sample consisted of 3069 subjects with 313 deaths within 5 years.
Figure 5 compares estimated Pearson and latent correlation matrices (numerical values are in Tables
S1 and S2). We observe that the latent correlation matrix detects stronger correlation between vari-
ables compared to naively interpreting Pearson correlations for mixed types variables. For example, the
correlation between Mortality and Mobility Problem increases from 0.2 to 0.39, the correlation between
Mortality and VPA increases from ´0.08 to ´0.30.
After estimation of the GLNPN latent correlation matrix, we next fit four mutually consistent condi-
tional SGCRM models by treating one of the mixed types variables as outcome and some of the others as
predictors. Specifically, we will consider one outcome for each type: TAC (continuous), VPA (truncated),
Health Status (ordinal), and Mortality (binary) will be outcomes. We compare SGCRM models with the
traditional counterparts such as simple linear regression, truncated regression, ordinal probit and probit

12
Figure 4: The coverage of 95% asymptotic confidence interval for SGCRM regression coefficients. The
red dotted line corresponds to the 0.95 coverage.

regressions in Tables 4, 5, 6, 7, respectively. Both SGCRM and traditional estimates are reported with
95% confidence intervals. We want to compare the direction and significance of conditional associations
captured by SGCRM and traditional models.
We start with a continuous outcome, TAC. Table 4 contrasts the results of the two models. We observe
that Mobility Problem has a significant negative effect on TAC in both SGCRM (´0.543p´0.594, ´0.492q)
and the linear model (´0.342p´0.378, ´0.306q). Furthermore, different levels of reported health status
have a significant negative effect on TAC in the simple linear model, but when the ordinal categories of
Health Status are represented via corresponding GLNPN latent variable we do not observe significant
association with TAC in SGCRM model. This is one obvious disadvantage of our approach for ordinal
variables. If the effect is not present across all levels, when collapsing all levels may potentially loose
partial significance, as we observe here. We will discuss this more in Discussion.
Next, we treat truncated variable VPA as the outcome in SGCRM and compare SGCRM model
vs Gaussian truncated regression model. The results are shown in Table 5. The direction and the
significance of associations estimated by SGCRM are in agreement with those estimated by truncated
regression. However, regression coefficients from truncated Gaussian regression model are not scaled
and, hence, their magnitude cannot be compared. In contrast, SGCRM coefficients are normalized and
can be compared across covariates. For example, we observe that estimated effect of mobility problem
is much higher that that of health status: ´0.32p´0.376, ´0.265q vs ´0.055p´0.108, ´0.004).
Next, we model Health Status as an ordinal outcome with Mobility Problem and VPA as two covari-
ates. Because of VPA is a trunctated variable, we represent VPA via two components in the traditional
model: (1) an indicator variable of VPA being equal to 0, and (2) the VPA value itself. Note that
representation is not needed in SGCRM. The results are shown in Table 6. VPA is negatively associated
with a higher value (worse) Health Status both in probit ordinal model with the regression coefficient of
´0.017p´0.024, ´0.011q and in SGCRM model with the regression coefficient of ´0.047p´0.091, ´0.003q.
Mobility problem is significant and is positively associated with a higher value (worse) Health Status in
both models. Again, SGCRM allows to compare the magnitude of estimated effects of Mobility Problem
and VPA with a much higher effect of the former.

13
Table 3: Mixed GLNPN variables

Variable number Variable type Cutpoint(s)


X1 Binary (high entropy) 0.3
X2 Continuous Not applicable
X3 Binary (low entropy) 1
X4 Ordinal (3 categories and high entropy) p´0.1, 0.6q
X5 Ordinal (3 categories and low entropy) p´1, 1q
X6 Ordinal (4 categories and high entropy) p´0.7, 0.1, 0.6q
X7 Ordinal (4 categories and low entropy) p´0.3, 0.1, 0.2q
X8 Truncated 0

(a) Pearson correlation matrix (b) GLNPN latent correlation matrix

Figure 5: The estimated 5ˆ5 correlation matrices of our variables from NHANES 2003´04 and 2005´06

Finally, we look at the 5-year mortality as a binary outcome with mobility problem, Health Status,
and TAC as covariates. The results are shown in Table 7. We see that the highest (worst) Health Status
level is signifact in probit regression, but when the ordinal categories of Health Status are represented
via corresponding GLNPN latent variable we do not observe significant association with mortality in
SGCRM model. Again, as we discussed above this is a limitation of SGCRM. We again get interpretable
regression coefficients and we see that the effect of TAC is almost twice higher than of mobility problem.
It is also important to note that all four SGCRM models are mutually consistent in contrast to the
set of four traditional regressions: simple linear, truncated, probit ordinal, and probit.
In the final step, we calculate predictions of latent variables using methods described in Section
4.1. Figure 6 shows the distribution of the predicted latent variables for the five variables. The figure
also shows the interrelation between those variables on the off-diagonal blocks of the scatterplot matrix.
We see that the distribution of predicted latent variables approximates the assumed latent normal dis-
tribution, but with multiple modes originated from the discontinuities of the distributions of observed
variables. It is interesting to note that scatterplots of predicted latent variables for mortality vs. mobility
problem, mortality vs TAC, and TAC vs mobility problem reveal linear patterns (with some discontinu-
ities around the cutoffs). This particular observation would be harder, if possible, to make by visually
exploring scatterplots of observed counterparts.

14
Table 4: Comparison of simple linear model and SGCRM results for continuous outcome

T AC „ M obilityP roblem ` HealthStatus


Simple linear regression SGCRM
Covariate Coefficients Covariate Coefficients
1 Mobility Problem (1) -0.342 (-0.378, -0.306) Mobility Problem -0.543 (-0.594, -0.492)
2 Health Status (2) -0.073 (-0.134, -0.013) Health Status 0.011 (-0.036, 0.059)
3 Health Status (3) -0.152 (-0.211, -0.094) NA
4 Health Status (4) -0.153 (-0.217, -0.09) NA
5 Health Status (5) -0.181 (-0.271, -0.09) NA

Table 5: Comparison of truncated Gaussian regression and SGCRM results for truncated outcome

V P A „ M obilityP roblem ` HealthStatus


Truncated Gaussian regression SGCRM
Covariate Coefficients Covariate Coefficients
1 Mobility Problem (1) -629.875 (-753.338, -506.412) Mobility Problem -0.32 (-0.376, -0.265)
2 Health Status (2) -63.201 (-88.362, -38.041) Health Status -0.055 (-0.108, -0.004)
3 Health Status (3) -194.491 (-239.269, -149.714) NA
4 Health Status (4) -195.522 (-248.435, -142.608) NA
5 Health Status (5) -285.281 (-403.648, -166.914) NA

Mortality Mobility.Problem Health.Status VPA TAC


1.00
0.75

Mortality
Corr: Corr: Corr: Corr:
0.50
0.618 0.335 -0.483 -0.672
0.25
0.00
2

Mobility.Problem
1
Corr: Corr: Corr:
0
0.618 -0.449 -0.636
-1
-2
2
Health.Status

1
Corr: Corr:
0
-0.244 -0.268
-1
-2

2
Corr:
VPA

0 0.661
-2

2
TAC

-2

-1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 0 2 -2 0 2

Figure 6: Predictions of latent variables in NHANES

15
Table 6: Comparison of probit ordinal regression and SGCRM results for ordinal outcome

HealthStatus „ M obilityP roblem ` V P A


Probit ordinal regression SGCRM
Covariate Coefficients Covariate Coefficients
1 Mobility Problem (1) 0.874 (0.79, 0.959) Mobility Problem 0.494 (0.45, 0.537)
2 I(VPA == 0) 0.12 (0.04, 0.199) NA
3 VPA -0.017 (-0.024, -0.011) VPA -0.047 (-0.091, -0.003)

Table 7: Comparison of probit regression and SGCRM results for binary outcome

M ortality „ M obilityP roblem ` HealthStatus ` T AC


Probit regression SGCRM
Covariate Coefficients Covariate Coefficients
1 Mobility Problem (1) 0.335 (0.192, 0.478) Mobility Problem 0.195 (0.097, 0.297)
2 Health Status (2) 0.086 (-0.212, 0.402) Health Status 0.033 (-0.042, 0.103)
3 Health Status (3) 0.244 (-0.038, 0.547) NA
4 Health Status (4) 0.195 (-0.102, 0.511) NA
5 Health Status (5) 0.455 (0.094, 0.826) NA
6 TAC ´3.952 ˆ 10´6 p´4.752 ˆ 10´6 , ´3.174 ˚ 10´6 q TAC -0.352 (-0.427, -0.272)

7 Discussion
The main contribution of this paper is a joint modeling approach for mixed data types that builds on
semiparametric Gaussian copula. The approach is scale-free, robust, and fast. Adapted to perform linear
regression on the latent space, SGCRM provides mutually consistent conditional regression models as a
unifying alternative to a range of popular conditional regression models such as simple linear regression
and probit-like regressions including truncated regression, ordinal probit regression, probit and others.
Our likelihood-free approach is more computationally efficient than likelihood-based joint copula models.
Finally, embedding the variables using a semiparametric Gaussian copula automatically normalizes the
scale of all latent variables that results in standardized and more interpretable regression coefficients.
The approach allows to define R2 for all four types of outcomes. Finally, the approach allows to perform
missing data imputation.
In NHANES application, we kept our models simple to carefully illustrate the approach and the
interpretability of the results. In terms of computational complexity, our method needs to estimate
Opp2 q correlation parameters and the calculation of Kendall’s Tau takes only Opnlognq FLOPs for
each estimation. Hence, with the quadratic complexity in p our approach scales very well with respect
of increasing p. Moreover, we propose a computationally efficient way of calculating the asymptotic
variance-covariance matrix of the parameters in Opn2 q FLOPs compared to the brute force approach of
Opn4 q.
In terms of limitations, SGCRM is less flexible in dealing with multiple ordinal levels. For example,
we can estimate two different effects for an ordinal variable with three categories. In comparison, our
model works on the latent scale for the ordinal variable and assumes a uniform magnitude and direction
of the effect. Compared to the traditional models, SGCRM also loses interpretability of original scales
of covariates, because SGCRM coefficients are only interpretable at a latent scale.
As future work, it would be important to develop quantile scale interpretation of SGCRM regression
results. SGCRM can also be adapted to handle survival outcomes. Moreover, it would be interesting to
adapt GNPLN to deal with functional and multi-level/longitudinal mixed data. As one of the method-
ological applications, we provided an algorithm for imputing missing observations and it would be very
interesting to compare that approach with existing ones. Finally, predicted latent variables can be used
within distance-based clustering approaches under mixed data settings as well as for dimension reduction
of multivariate mixed data.

S1 Proofs
Proof of Theorem 1: Let Xj , Xk be ordinal with levels t0, 1, ¨ ¨ ¨ , lj ´ 1u and t0, 1, ¨ ¨ ¨ , lk ´ 1u respec-
tively and pLj , Lk q is the corresponding latent standard bivariate normal with correlation σjk . Then for

16
two independent observations i, i1 -
lj ´1 ls ´1
ÿ ÿ
P pXij ą Xi1 j , Xik ą Xi1 k q “ rP pXij “ r, Xik “ sqP pXi1 j ă r, Xi1 k ă sqs
r“1 s“1
lj ´1 ls ´1
ÿ ÿ
“ rP p∆jr ď Lij ă ∆pj`1qr , ∆ks ď Lik ă ∆pk`1qs qP pLi1 j ă ∆jr , Li1 k ă ∆ks qs
r“1 s“1
lj ´1 lk ´1
ÿ ÿ
“ rΦ̃2 pp∆jr , ∆ks q, p∆jpr`1q , ∆kps`1q ; Σjk qΦ̃2 pp´8, ´8q, p∆jr , ∆ks q; Σjk qs
r“1 s“1
(S1)
Similarly,
lj ´1 ls ´1
ÿ ÿ
P pXij ą Xi1 j , Xik ă Xi1 k q “ rP pXij “ r, Xik “ s ´ 1qP pXi1 j ă r, Xi1 k ą ps ´ 1qqs
r“1 s“1
lj ´1 ls ´1
ÿ ÿ
“ rP p∆jr ď Lij ă ∆pj`1qr , ∆pk´1qs ď Lik ă ∆ks qP pLi1 j ă ∆jr , Li1 k ą ∆ks qs
r“1 s“1
lj ´1 ls ´1
ÿ ÿ
“ rP p∆jr ď Lij ă ∆pj`1qr , ∆pk´1qs ď Lik ă ∆ks qP pLi1 j ă ∆jr , ´Li1 k ă ´∆ks qs
r“1 s“1
lj ´1 lk ´1
ÿ ÿ
“ rΦ̃2 pp∆jr , ∆kps´1q q, p∆jpr`1q , ∆ks ; Σjk qΦ̃2 pp´8, ´8q, p∆jr , ´∆ks q; ´Σjk qs
r“1 s“1
(S2)
By symmetry, the population Kendall’s Tau, τjk for Xj , Xk can be written as follows -

τjk “ 2pP pXij ą Xi1 j , Xik ą Xi1 k q ´ P pXij ą Xi1 j , Xik ă Xi1 k qq
lj ´1 lk ´1
ÿ ÿ
“ 2p rΦ̃2 pp∆jr , ∆ks q, p∆jpr`1q , ∆kps`1q ; Σjk qΦ̃2 pp´8, ´8q, p∆jr , ∆ks q; Σjk q (S3)
r“1 s“1

´ Φ̃2 pp∆jr , ∆kps´1q q, p∆jpr`1q , ∆ks ; Σjk qΦ̃2 pp´8, ´8q, p∆jr , ´∆ks q; ´Σjk qsq

Now, reducing the above calculations for lk “ 2, will yield the bridging function between a general
ordinal and binary pairs.
Now, suppose we have a truncated variable Xm with cutoff ∆m and corresponding latent normal

17
variable Lm , then redoing the above calculations will look like -
lj ´1
ÿ
P pXij ą Xi1 j , Xim ą Xi1 m q “ rP pXij “ r, Xim ą 0qP pXi1 j ă r, Xi1 m “ 0q
r“1
` P pXij “ r, Xi1 j ă r, Xi1 m ą 0, Xim ´ Xi1 m ą 0qs
lj ´1
ÿ
“ rP p∆jr ď Lij ă ∆pj`1qr , Lim ą ∆m qP pLi1 j ă ∆jr , Li1 m ď ∆m q
r“1
Lim ´ Li1 m
` P p∆jr ď Lij ă ∆pj`1qr , Li1 j ă ∆jr , Li1 m ą ∆m , , ą 0qs
2
lj ´1
ÿ
“ rP p∆jr ď Lij ă ∆pj`1qr , ´Lim ă ´∆m qP pLi1 j ă ∆jr , Li1 m ď ∆m q
r“1
Li1 m ´ Lim
` P p∆jr ď Lij ă ∆pj`1qr , Li1 j ă ∆jr , ´Li1 m ă ´∆m , , ? ă 0qs
2
lj ´1
ÿ
“ rΦ̃2 pp∆jr , ´8q, p∆jpr`1q , ´∆m q; ´ρqΦ̃2 pp´8, ´8q, p∆jr , ∆m q; ρq
r“1

` Φ̃4 pp∆j r, ´8, ´8, ´8q, p∆jpr`1q , ∆jr , ´∆m , 0q; S5a pρqqs
lj ´1
ÿ
P pXij ă Xi1 j , Xim ą Xi1 m q “ rP pXij “ pr ´ 1q, Xim ą 0qP pXi1 j ą pr ´ 1q, Xi1 m “ 0q
r“1
` P pXij “ r, Xi1 j ă r, Xi1 m ą 0, Xim ´ Xi1 m ą 0qs
lj ´1
ÿ
“ rP p∆jpr´1q ď Lij ă ∆jr , Lim ą ∆m qP pLi1 j ą ∆jr , Li1 m ď ∆m q
r“1
Lim ´ Li1 m
` P p∆pj´1qr ď Lij ă ∆jr , Li1 j ą ∆jr , Li1 m ą ∆m , , ą 0qs
2
lj ´1
ÿ
“ rP p∆jpr´1q ď Lij ă ∆jr , ´Lim ă ´∆m qP p´Li1 j ă ´∆jr , Li1 m ď ∆m q
r“1
Li1 m ´ Lim
` P p∆pj´1qr ď Lij ă ∆jr , ´Li1 j ă ´∆jr , ´Li1 m ă ´∆m , , ă 0qs
2
lj ´1
ÿ
“ rΦ̃2 pp∆jpr´1q , ´8q, p∆jr , ´∆m q; ´ρqΦ̃2 pp´8, ´8q, p´∆jr , ∆m q; ´ρq`
r“1

Φ̃4 pp∆jpr´1q , ´8, ´8, ´8q, p∆jr , ´∆jr , ´∆m , 0q; S5b pρqs
(S4)
´ ?ρ2
¨ ˛
1 0 0
˚ 0 1 ´ρ ?ρ ‹
where, S5a pρq “ covpLij , Li1 j , ´Li1 m , Li1 m?´L im 2 ‹ and
˚
q “
0 1 ´ ?12 ‚
2
˚ ‹
˝ ´ρ
´ ?ρ2 ?ρ2 ´ ?12 1
´ ?ρ2
¨ ˛
1 0 0
˚ 0 1 ρ ´ ?ρ2 ‹
S5b pρq “ covpLij , ´Li1 j , ´Li1 m , Li1 m?´L im
‹.
˚ ‹
q“˚
2 ˝ 0 ρ 1 ´ ?12 ‚
ρ ρ
´ 2 ´ 2 ´ ?12
? ? 1
Hence, we get -
řlj ´1
Fto pρ; ∆j , ∆m q “ 2p r“1 rΦ̃2 pp∆jr , ´8q, p∆jpr`1q , ´∆m q; ´ρqΦ̃2 pp´8, ´8q, p∆jr , ∆m q; ρq
` Φ̃4 pp∆j r, ´8, ´8, ´8q, p∆jpr`1q , ∆jr , ´∆m , 0q; S5a pρqq ´
Φ̃2 pp∆jpr´1q , ´8q, p∆jr , ´∆m q; ´ρqΦ̃2 pp´8, ´8q, p´∆jr , ∆m q; ´ρq ´
Φ̃4 pp∆jpr´1q , ´8, ´8, ´8q, p∆jr , ´∆jr , ´∆m , 0q; S5b pρqs.
Proof of Theorem 2:
First, let’s familiarize ourselves with some notations. For a p ˆ p correlation matrix A, we can
get singular-value decomposition of A as A “ QΛQT , where Q is an orthonormal matrix and Λ “
diagpλ1 , ¨ ¨ ¨ , λp q are the eigenvalues of A. Let’s define logA “ QlogΛQT , where logΛ “ diagplogλ1 , ¨ ¨ ¨ , logλp q.

18
First we need to state the results for the asymptotic variance of Kendall’s Tau as calculated in
Hoeffding (1992) and El Maache and Lepage (2003). Using the results of U-statistics asymptotics, we
state the results in the Lemma 3 below.
?
Lemma 3. Let Kn be the Kendall’s Tau matrix estimated from the data, then npveclpKn q ´ veclpKqq
is asymptotically normal with mean 0 and variance-covariance matrix VK , where, Kij “ EpsgnppXi1 ´
Xi2 qpXj1 ´ Xj2 qq and

VKpijq,pklq “ 4 ˚ pEpsgnppXi1 ´ Xi2 qpXj1 ´ Xj2 qpXk2 ´ Xk3 qpXl2 ´ Xl3 qqq ´ pveclpKqveclpKqT qpijq,pklq q

tpijq, pklqu denotes the entries corresponding to the covariance of Kendall’s tau between pijq and pklq-th
pair of variables.
Now, we can rewrite the expression EpsgnppXi1´Xi 2 qpXj1 ´ Xj2 qpXk2 ´ Xk3 qpXl2 ´ Xl3 qqq as follows
-
EpsgnppXi1´Xi 2 qpXj1 ´ Xj2 qpXk2 ´ Xk3 qpXl2 ´ Xl3 qqq
“ EpEpsgnppXi1 ´ Xi2 qpXj1 ´ Xj2 qpXk2 ´ Xk3 qpXl2 ´ Xl3 qq|pXi2 , Xj2 , Xk2 , Xl2 qqq
“ EpEpsgnppXi1 ´ Xi2 qpXj1 ´ Xj2 qq|pXi2 , Xj2 , Xk2 , Xl2 qq ˚ EpsgnppXk2 ´ Xk3 qpXl2 ´ Xl3 qq|pXi2 , Xj2 , Xk2 , Xl2 qqq
“ EpEpsgnppXi1 ´ Xi2 qpXj1 ´ Xj2 qq|pXi2 , Xj2 qq ˚ EpsgnppXk2 ´ Xk3 qpXl2 ´ Xl3 qq|pXk2 , Xl2 qqq
“ EpHij pXi2 , Xj2 qHkl pXk2 , Xl2 qq
(S5)
,where,
řn H ij px, yq “ EpsgnppXi ´ xqpX j ´ yqq. We can estimate Hij px, yq from sample as - Ĥ ij px, yq “
1
n m“1 psgnppX im ´ xqpXjm ´ yqq
Hence, the quantity in (S5) can be estimated as -
n
1 ÿ
Ĥij pXim , Xjm qĤkl pXkm , Xlm q
n m“1
.
Evaluating each Ĥij requires Opnq FLOPs and taking products and summing them over takes Opnq
FLOPs resulting in Opn2 q computational complexity. This way of computation is a significant improve-
ment over calculating the quantity in (S5) blatantly which would have required Opn4 q FLOPs. Hence,
we provide a novel efficient way of calculating asymptotic variance of Kendall’s Tau which would have
been infeasible for even moderate n.
Now we want to derive the asymptotic normality of veclpΣn q and vecpβq using Delta method and the
following result.
As shown in Archakova and Hansen (2018), a correlation matrix A can be parametrized by veclplogAq.
There exists a bijective map γ : Cp ÝÑ Rppp´1q{2 which is defined by γpAq “ veclplogAq, where Cp
denotes the set of p ˆ p correlation matrices. As described in Tracy and Jinadasa (1988), a general
technique of defining derivatives with respect to a structured matrix (such as a correlation matrix) is
to first define a map from the matrix to the independent elements of the matrix and then extend the
function under investigation to the set of general matrices. For example, let’s take a function hpAq of a
correlation matrix A, then we will define the derivative as -

dvecphpAqq dvecphpAqq dvecpAq dveclpAq



dveclpγpAqq dvecpAq dveclpAq dveclpγpAqq
.
where, the first derivative dvecphpAqq
dvecpAq is defined assuming h is a general map defined on unstructured
matrices. We can use this result and chain rule to derive the following -

dveclpΣq dveclpΣq dveclpKq


“ “ Dg Γ
dveclpγpKqq dveclpKq dveclpγpKqq
(S6)
dvecpβq dvecpβq dvecpΣq dveclpΣq
“ “ Dβ Hp Dg Γ
dveclpγpKqq dvecpΣq dveclpΣq dveclpγpKqq

Here, Hp denotes duplication matrix of order p which transforms veclpAq to vecpAq for any matrix A,
Dg “ diagpg 1 pKqq, where g 1 is the first order derivative of the indidividual bridging functions and then we
apply it to K, Dβ “ ppΣ´1 ´1 ´1
22 , ´pΣ21 bIp qpΣ22 bΣ22 qqẼ, where Ẽ transforms vecpΣq to pvecpΣ21 q, vecpΣ22 qq.
Γ is calculated in Archakova and Hansen (2018).

19
?
Now, to use the results in (S6), we first derive the asymptotic normality of npveclpγpKn q ´ γpKqq
using results in Archakova and Hansen (2018) and calculate the asymptotic covariance matrix as Vγ .
Then,
? under the regularity assumptions, we apply delta method to get asymptotic covariance
? matrix
of npveclpΣ̂n ´ veclpΣqq as VΣ “ pDg ΓqT Vγ Dg Γ and asymptotic covariance matrix of npβˆn ´ βq as
Vβ “ pDβ Hp Dg ΓqT Vγ Dβ Hp Dg Γ.

S2 Additional Plots and Tables

Table S1: Latent correlation matrix of 5 variables of interest

Mortality Mobility Problem Health Status VPA TAC


Mortality 1.00 0.39 0.23 -0.30 -0.45
Mobility Problem 0.39 1.00 0.50 -0.35 -0.53
Health Status 0.23 0.50 1.00 -0.24 -0.27
VPA -0.30 -0.35 -0.24 1.00 0.63
TAC -0.45 -0.53 -0.27 0.63 1.00

Table S2: Pearson’s correlation matrix of 5 variables of interest

Mortality Mobility Problem Health Status VPA TAC


Mortality 1.00 0.2 0.13 -0.08 -0.23
Mobility Problem 0.2 1.00 0.38 -0.15 -0.37
Health Status 0.13 0.38 1.00 -0.15 -0.37
VPA -0.08 -0.15 -0.15 1.00 0.5
TAC -0.23 -0.37 -0.22 0.5 1.00

Figure S1: The coverage of the 95% asymptotic confidence interval for latent correlations. The red
dotted line denotes 0.95 line

20
Mobility Health
Mortality VPA TAC
Problem Status
6

Mortality
4 Corr: Corr:
−0.163 −0.27
2

0
2000
1500
1000
500

Problem
Mobility
0
2000
1500
1000
500
0
1000
750
500
2500
1000
750
500
2500

Health
Status
1000
750
500
2500
1000
750
500
2500
1000
750
500
2500

75

50 Corr:

VPA
0.583
25

750000

TAC
500000

250000

0
0.00 0.25 0.50 0.75 1.00 0 100 200 0 100 200 0 50100150
0 50100150
0 50100150 0 50100150 0
0 50100150 25 50 75 0 250000 500000 750000

Figure S2: Exploratory analysis for our variables of interest from NHANES

References
Anderson, J. A. and Pemberton, J. (1985). The grouped continuous model for multivariate ordered
categorical variables and covariate adjustment. Biometrics pages 875–885.
Archakova, I. and Hansen, P. (2018). A new parametrization of correlation matrices. Technical report,
Working Paper.
Cai, T. T. and Zhang, L. (2015). High-dimensional gaussian copula regression: Adaptive estimation and
statistical inference. arXiv preprint arXiv:1512.02487 .

Cragg, J. G. (1971). Some statistical models for limited dependent variables with application to the
demand for durable goods. Econometrica: Journal of the Econometric Society pages 829–844.
De Leon, A. (2005). Pairwise likelihood approach to grouped continuous model and its extension.
Statistics & probability letters 75, 49–57.

de Leon, A. R. and Carriégre, K. (2007). General mixed-data model: Extension of general location and
grouped continuous models. Canadian Journal of Statistics 35, 533–548.
Eicker, F. (1963). Asymptotic normality and consistency of the least squares estimators for families of
linear regressions. The annals of mathematical statistics 34, 447–456.

El Maache, H. and Lepage, Y. (2003). Spearman’s rho and kendall’s tau for multivariate data sets.
Lecture Notes-Monograph Series pages 113–130.
Fan, J., Liu, H., Ning, Y., and Zou, H. (2017). High dimensional semiparametric latent graphical model
for mixed data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 405–
421.

21
Fan, J., Xue, L., and Zou, H. (2016). Multitask quantile regression under the transnormal model. Journal
of the American Statistical Association 111, 1726–1735.
Feng, H. and Ning, Y. (2019). High-dimensional mixed graphical model with ordinal data: Parameter
estimation and statistical inference. In The 22nd International Conference on Artificial Intelligence
and Statistics, pages 654–663.
Hausman, J. A. and Wise, D. A. (1977). Social experimentation, truncated distributions, and efficient
estimation. Econometrica: Journal of the Econometric Society pages 919–938.
Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and
limited dependent variables and a simple estimator for such models. In Annals of economic and social
measurement, volume 5, number 4, pages 475–492. NBER.
Higham, N. J. (2002). Computing the nearest correlation matrix—a problem from finance. IMA journal
of Numerical Analysis 22, 329–343.
Hoeffding, W. (1992). A class of statistics with asymptotically normal distribution. In Breakthroughs in
statistics, pages 308–334. Springer.
Huang, M., Müller, C. L., and Gaynanova, I. (2021). latentcor: An r package for estimating latent
correlations from mixed data types. arXiv preprint arXiv:2108.09180 .
Jiryaie, F., Withanage, N., Wu, B., and De Leon, A. (2016). Gaussian copula distributions for mixed
data, with application in discrimination. Journal of Statistical Computation and Simulation 86, 1643–
1659.
Joe, H. (2006). Generating random correlation matrices based on partial correlations. Journal of Multi-
variate Analysis 97, 2177–2189.
Leroux, A., Di, J., Smirnova, E., Mcguffey, E. J., Cao, Q., Bayatmokhtari, E., Tabacu, L., Zipunnikov,
V., Urbanek, J. K., and Crainiceanu, C. (2019). Organizing and analyzing the activity data in nhanes.
Statistics in Biosciences pages 1–26.
Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. (2012). High-dimensional semiparametric
gaussian copula graphical models. The Annals of Statistics 40, 2293–2326.

Liu, H., Lafferty, J., and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high
dimensional undirected graphs. Journal of Machine Learning Research 10, 2295–2328.
McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society: Series
B (Methodological) 42, 109–127.
Olkin, I. and Tate, R. F. (1961). Multivariate correlation models with mixed discrete and continuous
variables. The Annals of Mathematical Statistics pages 448–465.
Quan, X., Booth, J. G., and Wells, M. T. (2018). Rank-based approach for estimating correlations in
mixed ordinal data. arXiv preprint arXiv:1809.06255 .
Song, P. X.-K., Li, M., and Yuan, Y. (2009). Joint regression analysis of correlated data using gaussian
copulas. Biometrics 65, 60–68.
Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica: journal of
the Econometric Society pages 24–36.
Tracy, D. and Jinadasa, K. (1988). Patterned matrix derivatives. Canadian Journal of Statistics 16,
411–418.

Wang, W. Y. and Hua, Z. (2014). A semiparametric gaussian copula regression model for predicting
financial risks from earnings calls. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1155–1165.
Wilhelm, S. and G, M. B. (2015). tmvtnorm: Truncated Multivariate Normal and Student t Distribution.
R package version 1.4-10.

22
Wilhelm, S. and Manjunath, B. G. (2010). tmvtnorm : A package for the truncated multivariate normal
distribution.
Yoon, G., Carroll, R. J., and Gaynanova, I. (2018). Sparse semiparametric canonical correlation analysis
for data of mixed types. arXiv preprint arXiv:1807.05274 .

Zhang, A., Fang, J., Calhoun, V. D., and Wang, Y.-p. (2018). High dimensional latent gaussian copula
model for mixed data in imaging genetics. In 2018 IEEE 15th International Symposium on Biomedical
Imaging (ISBI 2018), pages 105–109. IEEE.

23

You might also like