You are on page 1of 160

ICSA Book Series in Statistics

Series Editors: Jiahua Chen · Ding-Geng (Din) Chen

Ding-Geng (Din) Chen


John Dean Chen Editors

Monte-Carlo
Simulation-
Based Statistical
Modeling
ICSA Book Series in Statistics

Series editors

Jiahua Chen, Department


Department of Stat
Statistic
istics,
s, Unive
University
rsity of British
British Columbia,
Columbia, Vancouver,
Vancouver,
Canada
Ding-Geng (Din) Chen, University of North Carolina, Chapel Hill, NC, USA
More information about this series at http://www.springer.com/series/13402
Ding-Geng (Din) Chen John Dean Chen

Editors

Monte-Carlo
Simulation-Based Statistical
Modeling

13
Editors
Ding-Geng (Din) Chen John Dean Chen
University of North Carolina Risk Management
Chapel Hill, NC Credit Suisse
USA New York, NY
USA
and

University of Pretoria
Pretoria
South Africa

ISSN 2199-0980 ISSN 2199-0999 (electronic)


ICSA Book Series in Statistics
ISB
ISBN 978-
978-98
981-
1-10
10-3
-330
306-
6-33 ISBN
SBN 978-
978-98
981-
1-10
10--3307
3307-0
-0 (eB
(eBook)
ook)
DOI 10.1007/978-981-10-3307-0

Library of Congress Control Number: 2016960187

© Springer Nature Singapore Pte Ltd. 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the mat
materi
erial
al is con
concer
cerned
ned,, spe
speci
cifical
cally
ly the rig
rights
hts of tra
transl
nslati
ation,
on, rep
reprint
rinting
ing,, reu
reuse
se of illu
illustr
strati
ations
ons,,
recitation,
recitation, broad
broadcastin
casting,
g, reprod
reproduction
uction on microfilms or in any other phy physic
sical
al way
way,, and trans
transmis
missio
sionn
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The ususee of gene
genera
rall desc
descri
ript
ptiv
ivee na
name
mes,
s, reg
regis
iste
tere
red
d name
names,
s, tr
trad
adem
emararks
ks,, se
serv
rvic
icee ma
mark
rks,
s, et
etc.
c. in th this
is
publication
public ation does not imply, even in the absen
absencece of a specific state
statement,
ment, that such names are exemptexempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book
boo k are belie
believed
ved to be trutruee and accur
accurate
ate at the date of pubpublic
licati
ation.
on. Neith
Neitherer the pub
publis
lishe
herr nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature

The registered company is Springer Nature Singapore Pte Ltd.


Theregisteredcompanyaddressis:152BeachRoad,#22-06/08GatewayEast,Singapore189721,Singapore
Preface

Over the last two decades, advancements in computer technology have enabled
accelerated research and development of Monte-Carlo computational methods. This
book is a compilation of invited papers from some of the most forward-thinking
statistica
statisticall resea
researchers
rchers.. These authors prese
present
nt new development
developmentss in Monte-Carlo
Monte-Carlo
simula
sim ulation
tion-ba
-based
sed sta
statis
tistic
tical
al modeli
modeling,
ng, thereb
thereby
y creati
creating
ng an opportu
opportunit
nity
y for the
exchange ideas among researchers and users of statistical computing.
Our aim in creating this book is to provide a venue for timely dissemination
of the research in Monte-Carlo simulation-based statistical modeling to promote
further research and collaborative work in this area. In the era of big data science,
this collection of innovative research not only has remarkable potential to have a
substantial impact on the development of advanced Monte-Carlo methods across
the spectrum of statistical data analyses but also has great promise for fostering new
research and collaborations addressing the ever-changing challenges and opportu-
nities of statistics and data science. The authors have made their data and computer
programs publicly available, making it possible for readers to replicate the model
development and data analysis presented in each chapter and readily apply these
new methods in their own research.
The 18 chapters are organized into three parts. Part I includes six chapters that
present and discuss general Monte-Carlo techniques. Part II comprises six chapters
with
wit h a common
common focus
focus on Mon
Monte-
te-Car
Carlo
lo met
method
hodss used
used in missin
missing
g data
data analys
analyses,
es,
which is an area of growing importance in public health and social sciences. Part III
is composed of six chapters that address Monte-Carlo statistical modeling and their
applications.

v
vi Preface

Part I: Monte-Carlo Techniques (Chapters Joint Generation of “

Binary, Ordinal, Count, and Normal Data with Specified


Marginal and Association Structures in Monte-Carlo
Simulations – Quantifying the Uncertainty in Optimal
” “

Experiment Schemes via Monte-Carlo Simulations ) ”

Chapter Jo “Join
intt Ge
Gene
nerarati
tion
on of Bi Bina
nary
ry,, Or
Ordi
dina
nal,
l, Co
Count
unt,, an
andd NoNormrmal
al Da
Data
ta wi
with
th
Specified Marginal and Association Structures in Monte-Carlo Simulations pre- ”

sents a unified framework for concurrently generating data that include the four
major types of distributions (i.e., binary, ordinal, count, and normal) with speci fied
marginal and association structures. In this discussion of an important supplement
to existing methods, Hakan Demirtas unifies the Monte-Carlo Monte-Carlo methods for specified
types
typ es of data
data and pre presen
sentsts his syssystem
temati
aticc and compreh
comprehens ensive
ive invest
investiga
igatio
tion
n for
mixed data generation. The proposed framework can then be readily used to sim-
ulate multivariate data of mixed types for the development of more sophisticated
simulation, computation, and data analysis techniques.
In Chapt
Chapterer Impr
“Improv ovin
ing
g ththee Efficicien
ency
cy of th thee Mo
Mont
nte-C
e-Cararlo
lo Me
Meththods
ods UsUsin
ing
g
Ranked Simulated Approach , Hani Samawi provides an overview of his devel-

opment of ranked simulated sampling; a key approach for improving the ef ficiency
of general Monte-Carlo methods. Samawi then demonstrates the capacity of this
approach to provide unbiased estimation.
In Chapter Normal and Non-normal Data Simulations for the Evaluation of

Two-
Tw o-Sa
Samp
mplele Lo Loca
cati
tion
on Te Test
stss , Jessica
” ica Hoag and Chia-Lin Ling Kuo discuss
Monte-
Mon te-Car
Carlo
lo sim
simula
ulatio
tion
n of nor
normal
mal and non-non-norm
normal al data
data to evaluat
evaluatee two-sa
two-sampl
mplee
location tests (i.e., statistical tests that compare means or medians of two inde-
pendent populations).
Chapter Anatomy of Correlational Magnitude Transformations in Latency and

Discretization Contexts in Monte-Carlo Studies proposes a general assessment of


corr
co rrel
elati
ationa
onall magni
magnitutude
de chchan
angegess in th
thee la
laten
tency
cy and
and didisc
scre
reti
tiza
zati
tion
on cont
context
extss of
Monte-
Mon te-Car
Carlo
lo stu
studie
dies.
s. Fur
Furthe
ther,r, author
authorss Hakan
Hakan Demirt
Demirtasas and CerenCeren Vardar
Vardar-Ac
-Acar
ar
provide a conceptual framework and computational algorithms for modeling the
correlation transitions under specified distributional assumptions within the realm
of discretization in the context of latency and the threshold concept. The authors
illustrate the proposed algorithms with several examples and include a simulation
study that demonstrates the feasibility and performance of the methods.
“Monte-Carlo
Chapter Monte- Carlo Simulation
Simulation of Corre
Correlated
lated Binary Respon
Responses
ses discusses ”

the Monte-Carlo simulation of correlated binary responses. Simulation studies are a


well-known, highly valuable tool that allows researchers to obtain powerful con-
cl
clus
usio
ions
ns foforr corr
correl
elat
ated
ed or lo
long
ngit
itudi
udina
nall resp
respons
onsee data
data.. In ca
case
sess wher
wheree lo
logis
gisti
ticc
modeli
mod eling
ng it used,
used, the res
resear
earcher
cher mus
mustt have
have app
appropr
ropriat
iatee method
methodss for simula
simulatin
tingg
correlated binary data along with associated predictors. In this chapter, author Trent
Lalonde presents an overview of existing methods for simulating correlated binary

response data and compares those methods with methods using R software.
Preface vii

Chapter Qua
Quanti
“ ntifyi
fying
ng the Unc
Uncertertain
ainty
ty in Opt
Optima
imall Exp
Experi
erimen
mentt Sch
Scheme
emess via
Monte-Carlo Simulations provides a general framework for quantifying the sen-

sitivity
sitivity and uncertainty
uncertainty that result misspecification of model parameters in
result from the misspeci
optimal experimental schemes. In designing life-testing experiments, it is widely
ac
acce
cepte
ptedd th
that
at th
thee optim
optimal
al ex
expe
peririme
menta
ntall sche
scheme
me depe
depend
ndss on unkno
unknownwn mode
modell
parameters, and that misspecified parameters can lead to substantial loss of effi-

ciency in the
Tzong-Ru statistical
Tsai, analysis.
Y.L. Li, and To quantify
Nan Jiang this effect,
use Monte-Carlo Tony Ng,
simulations to Yu-Jau
evaluateLin,
the
robustness of optimal experimental schemes.

Part II: Monte-Carlo Methods for Missing Data


(Chapters Markov Chain Monte-Carlo Methods for Missing

Data Under Ignorability Assumptions – Application of Markov


” “

Chain Monte-Carlo Multiple Imputation Method to Deal


with Missing Data from the Mechanism of MNAR
in Sensitivity Analysis for a Longitudinal Clinical Trial ) ”

Chapter Markov Chain Monte-Carlo Methods for Missing Data Under Ignorability

Assumptions pre
” prese
sent
ntss a fu
full
lly
y Baye
Bayesisian
an me
meth
thod
od fo
forr usin
usingg th thee Mark
Markov
ov chai
chain
n
Monte-Carlo technique for missing data to sample the full conditional distribution
of the missing data given observed data and the other parameters. In this chapter,
Hare
Ha resh
sh Rocha
Rochani
ni an
andd DaDanie
niell Li
Linde
nderr show
show ho
howw to appl
apply
y ththes
esee meth
methods
ods to re
real
al
datasets with missing responses as well as missing covariates. Additionally, the
authors provide simulation settings to illustrate this method.
In Chapte
Chapterr A Multiple Imputation Framework for Massive Multivariate Data of

Different Variable Types: A Monte-Carlo Technique , Hakan Demirtas discusses


multiple imputation for massive multivariate data of variable types from planned
miss
mi ssing
ingne
ness
ss desi
design
gnss with
with th
thee pur
purpo
pose
se to buil
build
d th
theo
eoret
retic
ical
al,, al
algor
gorit
ithm
hmic
ic,, and
and
implementation-based components of a unified, general-purpose multiple imputa-
tion framework. The planned missingness designs are highly useful and will likely
increase in popularity in the future. For this reason, the proposed multiple impu-
tation framework represents an important refinement of existing methods.
Chapter HyHybr
“ brid
id Mo
Mont nte-
e-Ca
Carl
rlo
o in Mu
Mult ltip
iple
le Mi
Miss
ssin
ing
g Da
Data
ta Im
Impu
puta
tati
tions
ons wi
with
th
Application to a Bone Fracture Data introduces the Hybrid Monte-Carlo method as

an efficient approach for sampling complex posterior distributions of several cor-


related parameters from a semi-parametric missing data model. In this chapter, Hui
Xiee desc
Xi describ
ribes
es a momode
deliling
ng appr
approaoach
ch fo
forr mi
miss
ssin
ing
g valu
values
es th
that
at does
does not
not re
requ
quire
ire
assuming
assumi ng specific distri
distributi
butiona
onall forms.
forms. To demons
demonstra
trate
te the method,
method, the author
author
provides an R program for analyzing missing data from a bone fracture study.
Chapter Stati
Statistica
“ sticall Method
Methodologie
ologiess for Deali
Dealing
ng with Incompl
Incomplete
ete Longit
Longitudinal
udinal
Outcomes Due to Dropout Missing at Random considers key methods for handling

longitudinal data that are incomplete due to missing at random dropout. In this
viii Preface

chapter, Ali Satty, Henry Mwambil, and Geert Muhlenbergs provide readers with
an overview of the issues and the different methodologies for handling missing data
in lo
long
ngit
itud
udin
inal
al data
datase
sets
ts th
that
at resu
result
lt fr
from
om drdrop
opout
out (e
(e.g
.g.,
., st
stud
udy
y at
attr
trit
itio
ion,
n, lo
loss
ss of
follow
follow-up
-up).
). The authors
authors exa
examin
minee the pot
potent
ential
ial streng
strengths
ths and weaknes
weaknessesses of the
various methods through two examples of applying these methods.
In Chapter Applications of Simulation for Missing Data Issues in Longitudinal

Clinical
addressingTrials , Frank
missing Liu
data and James
issues Kost present
in longitudinal simulation-based
clinical trials, such asapproaches for
control-based
imputa
imp utation
tion,, tip
tippin
ping-po
g-point
int analys
analysis,
is, and a Baye
Bayesiasian
n Markov
Markov chain
chain Monte-
Monte-Car
Carlo
lo
method. Computation programs for these methods are implemented and available in
SAS.
In Chap
Chapterter Applic
Applicati
“ ation
on of Mar
Markovkov Chai
Chain n Mon
Monte-te-Car
Carlo
lo Mul
Multip
tiple
le Imp
Imputat
utation
ion
Method to Deal with Missing Data from the Mechanism of MNAR in Sensitivity
Analysis for a Longitudinal Clinical Trial , Wei Sun discusses the application of

Mark
Ma rkov
ov chchai
ainn Mont
Monte-
e-Car
Carlo
lo mu
multltip
iple
le impu
imputa
tatio
tion
n fo
forr data
data ththat
at is miss
missin
ing
g not at
random
random in lon longitu
gitudin
dinal
al datase
datasets
ts fro
from
m cli
clinic
nical
al tri
trials
als.. This
This chapte
chapterr compar
compareses the
patterns of missing data between study subjects who received treatment and study
subjects who received a placebo.

Part III: Monte-Carlo in Statistical Modellings and Applications


(Chapters Monte-Carlo Simulation in Modeling for Hierarchical

Generalized Linear Mixed Models – Bootstrap-Based LASSO-


” “

type Selection to Build Generalized Additive Partially Linear


Models for High-Dimensional Data ) ”

Chapter Monte-Carlo Simulation in Modeling for Hierarchical Generalized Linear


Mixed Models addaddss a discu


” discussion
ssion of Monte-Carlo
Monte-Carlo simulation-based
simulation-based hierarchical
hierarchical
models, taking into account the variability at each level of the hierarchy. In this

chapter,
chapte r, Kyl
Kylee Iri
Irimat
mataa and JefJeffre
freyy Wil
Wilson
son dis
discus
cusss Monte-
Monte-CarCarlo
lo simula
simulation
tionss for
hierarchical linear mixed-effects models to fit the hierarchical logistic regression
models
mod els with
with random
random interc
intercept
eptss (both
(both ran
random
dom interc
intercept
eptss and random
random slopes
slopes)) to
multilevel data.
“Monte-Carlo
Chapter Monte -Carlo Methods in Finan Financial
cial Modeli
Modeling ng demonstrates the use of

Monte-Carlo
Monte- Carlo methods
methods in finan nancia
ciall mod
modeli
eling.
ng. In this
this chapte
chapter,
r, Chuansh
Chuanshu u Ji, Tao
Wang, and Leicheng Yin discuss two areas of market microstructure modeling and
option pricing using Monte-Carlo dimension reduction techniques. This approach
uses Bayesian Markov chain Monte-Carlo inference based on the trade and quote
database from Wharton Research Data Services.
Chapter Si Simul
“ mulat
atio
ionn St
Stud
udie
iess on ththee Ef
Effe
fect
ctss of ththee Ce
Cens
nsor
orin
ing
g DiDist
stri
ribu
buti
tion
on
Assump
Ass umptio
tion
n in the Ana
Analyslysis
is of Int
Interv
erval-
al-Cen
Censor
soreded Fai
Failur
luree Tim
Timee Dat
Dataa discusses

using Monte-Carlo simulations to evaluate the effect of the censoring distribution


assumpt
assumption
ion for interv
interval-
al-cen
censore
sored d sur
surviv
vival
al data.
data. In this
this chapte
chapter,
r, Tyler
Tyler Cook
Cook and
Preface ix

Jianguo Sun inv


Jianguo invest
estiga
igate
te the effec
effectiv
tivenes
enesss and flexexib
ibil
ilit
ity
y of two
two meth
methododss fo
forr
regression analysis of informative case I and case II interval-censored data. The
authors present extensive Monte-Carlo
Monte-Carlo simula
simulation
tion studies
studies that provide readers
readers with
guidelines regarding dependence of the censoring distribution.
Chapter Robust Bayesian Hierarchical Model Using Monte-Carlo Simulation
“ ”

usess Monte-
use Monte-Car
Carlo
lo sim
simula
ulatio
tion
n to dem
demons
onstra
trate
te a robust
robust Bayesi
Bayesian
an multile
multilevel
vel item
item

resp
respononsese mo
modedel.
l. In this
this chap
chapteter,
r, Geng
Geng Ch Chen
en uses
uses datadata frfrom
om patipatien
ents
ts with
with
Parkinson s disease, a chronic progressive disease with multidimensional impair-

ments.
men ts. Using
Using the
thesese dat
data,
a, Chen ill illustr
ustrate
atess app
applyi
lying
ng the multil
multileve
evell item
item res
respons
ponsee
model to not only deal with the multidimensional nature of the disease but also
simultaneou
simul taneously
sly estimate
estimate measur
measurement -speciific parame
ement-spec parameter ters,
s, covari
covariate
ate ef
effec
fects,
ts, and
patient-specific characteristics of disease progression.
In Chapter A Comparison of Bootstrap Confidence Intervals for Multi-level

Longitudinal
Longitu dinal Data Using Monte-Monte-Carlo
Carlo Simul
Simulation
ation , Mark Reiser, Lanlan Yao, and

Xiao Wang present a comparison of bootstrap confidence intervals for multilevel


longitudinal data using Monte-Carlo simulations. Their results indicate that if the
sample size at the lower level is small, then the parametric bootstrap and cluster
boots
boo tstr
trap
ap pe
perf
rfor
ormm bebett
tter
er at ththee hi
highe
gherr le
leve
vell th
than
an ththee two-
two-ststag
agee boot
bootst
stra
rap.
p. The
The
auth
au thors
ors ththen
en appl
apply y th
thee boot
bootststra
rapp memetho
thods
ds to a lo longi
ngitutudin
dinal
al st
stud
udy y of prpres
esch
chool
ool
children nested within classrooms.
Chapter Bo Boots
“otstr
trap
ap-B
-Base
ased d LA LASS SSO-
O-Ty
Type
pe Se Sele
lect
ctio
ionn to Bu Builild
d GeGener
neralaliz
ized
ed
Addi
Ad diti
tive
ve PaPart
rtia
iall
lly
y LiLine
near
ar MoMode delsls fo
forr Hi
High
gh-D-Dim
imenensi
siona
onall DaData
ta pr
” pres
esent
entss an
approa
app roachch to using
using a bootbootstr
strap-
ap-bas
baseded LASSO-
LASSO-typ typee select
selection
ion to build
build general
generalizeizedd
additive partially linear models for high-dimensional data. In this chapter, Xiang
Li
Liu,
u, TiTianan Chen
Chen,, Yu Yuananzh
zhan
angg Li
Li,, and
and HuHuaa LiLian
ang g first propos
proposee a bootstr
bootstrap-
ap-bas
baseded
procedure to select variables with penalized regression and then apply their pro-
cedure to analyze data from a breast cancer study and an HIV study. The two
examples demonstrate the procedure s flexibility and utility in practice. In addition,

the
the auth
authors
ors pr
pres
esen
entt a simul
simulatatio
ionn study
study ththat
at shshow
ows,s, when
when comp
compar ared
ed with
with th thee
penalized regression approach, their variable selection procedure performs better.
As a genera
generall note,
note, the ref refere
erence
ncess for each
each chapte
chapterr are includ
includeded immedia
immediatel tely
y
following the chapter text. We have organized the chapters as self-contained units
so readers can more easily and readily refer to the cited sources for each chapter.
To facili
facilitat
tatee reader
readerss und
under
’ erst
stan
andi
ding
ng of th
thee meth
method
odss pr
pres
esent
ented
ed in th
this
is book,
book,
corresponding data and computing program can be requested from the fi rst editor by
email at DrDG.Chen@g
DrDG.Chen@gmail. mail.com.
com.
The editors are deeply grateful to many who have supported the creation of this
book. We thank the authors of each chapter for their contributions and their gen-
erous sharing of their knowledge, time, and expertise to this book. Second, our
sincere gratitude goes to Ms. Diane C. Wyant from the School of Social Work,
University of North Carolina at Chapel Hill for her expert editing and comments of
this
this book
book whic
which h subs
substatant
ntia
iall
lly
y upupli
lift
ft the
the qual
qualit
ity
y of th
this
is book
book.. We gr
grat
atef
eful
ully
ly
acknowledge the professional support of Hannah Qiu (Springer/ICSA Book Series
coordinato
coordinator)
r) and Wei Zha
Zhao o (assoc
(associat
iatee edi
editor
tor)) from
from Spr
Springe
ingerr Beijin
Beijing
g that
that made
made
publishing this book with Springer a reality.
x Preface

We welcome readers comments, including notes on typos or other errors, and


look forward to receiving suggestions for improvements to future editions of this


book. Please send comments and suggestions to any of the editors listed below.

October 2016 Ding-Geng (Din) Chen


University of North Carolina, Chapel Hill, USA
University of Pretoria, South Africa
John Dean Chen
Credit Suisse, New York, NY, USA
About the Book

This book
This book br briings
ngs tog
ogeeth
ther
er exexpe
pert
rt re
rese
sear
arch
cher
erss enga
engage
gedd in Mont
Monte-
e-Ca
Carl
rlo
o
simulation-based statistical modeling, offering them a forum to present and dis-
cuss recent issues in methodological development as well as public health appli-
ca
cati
tions
ons.. It is di
divi
vide
dedd in
into
to th
thre
reee pa
part
rts,
s, with thee firs
with th rstt pr
prov
ovidi
iding
ng an overv
overvie
iew
w of

Monte-Carlo
ods,, an
ods and thee techniques,
d th th
thir
ird
d ad
addr the
dres
essi second
sing
ng Baye focusing
Bayesisian
an an
andd on
gemissing
gene
nera
rall st data
stat
atis
isti Monte-Carlo
tica
cal
l mode
modeli ng meth-
ling us
usin
ing
g
Monte-Carlo simulations. The data and computer programs used here will also be
made publicly available, allowing readers to replicate the model development and
data analysis presented in each chapter, and to readily apply them in their own
researc
research.
h. Featur
Featuringing hig
highly
hly top
topica
icall conten
content,
t, the book
book has the potent
potential
ial to impact
impact
model development
development and data analyses spectrum of fields, and to spark
analyses across a wide spectrum
further research in this direction.
xi

Contents

Partt I
Par Mon
Monte-
te-Carl
Carlo
o Tec
Techniq
hniques
ues

Joint Generation of Binary, Ordinal, Count, and Normal


Data with Specified Marginal and Association Structures
in Monte-Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Hakan Demirtas, Rawan Allozi, Yiran Hu, Gul Inan and Levent Ozbek
Improving the Efficiency of the Monte-Carlo Methods
Using Ranked Simulated Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Hani Michel Samawi
Normal and Non-normal Data Simulations for the Evaluation
of Two-Sample Location Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Jessica R. Hoag and Chia-Ling Kuo
Anatomy of Correlational Magnitude Transformations in Latency
and Discretization Contexts in Monte-Carlo Studies . . . . . . . . . . . . . . . . 59
Hakan Demirtas and Ceren Vardar-Acar

Monte-Carlo Simulation of Correlated Binary Responses . . . . . . . . . . . . 85


Trent L. Lalonde
Quantifying the Uncertainty in Optimal Experiment Schemes
via Monte-Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
H.K.T. Ng, Y.-J. Lin, T.-R. Tsai, Y.L. Lio and N. Jiang

Partt II
Par Mont
Monte-C
e-Carlo
arlo Met
Methods
hods in Mi
Missin
ssing
g Data
Data
Markov Chain Monte-Carlo Methods for Missing Data
Under Ignorability Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Haresh Rochani and Daniel F. Linder
A Multiple Imputation Framework for Massive Multivariate
Data of Different Variable Types: A Monte-Carlo Technique . . . . . . . . . 143
Hakan Demirtas
xiii

xiv Contents

Hybrid Monte-Carlo in Multiple Missing Data Imputations


with Application to a Bone Fracture Data . . . . . . . . . . . . . . . . . . . . . . . . 163
Hui Xie
Statistical Methodologies for Dealing with Incomplete
Longitudinal Outcomes Due to Dropout Missing at Random . . . . . . . . . 179

A. Satty, H. Mwambi and G. Molenberghs


Applications of Simulation for Missing Data Issues
in Longitudinal Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
G. Frank Liu and James Kost
Application of Markov Chain Monte-Carlo Multiple
Imputation Method to Deal with Missing Data from
the Mechanism of MNAR in Sensitivity Analysis for
a Longitudinal Clinical Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Wei Sun

Partt III
Par Mon
Monte-
te-Car
Carlo
lo in Stati
Statisti
stical
cal Modell
Modellings
ings and Applic
Applicati
ations
ons

Monte-Carlo Simulation in Modeling for Hierarchical


Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Kyle M. Irimata and Jeffrey R. Wilson
Monte-Carlo Methods in Financial Modeling . . . . . . . . . . . . . . . . . . . . . . 285
Chuanshu Ji, Tao Wang and Leicheng Yin
Simulation Studies on the Effects of the Censoring Distribution
Assumption in the Analysis of Interval-Censored
Failure Time Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Tyler Cook, Zhigang Zhang and Jianguo Sun
Robust Bayesian Hierarchical Model Using Monte-Carlo
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Geng Chen and Sheng Luo
A Comparison of Bootstrap Confidence Intervals for Multi-level
Longitudinal Data Using Monte-Carlo Simulation . . . . . . . . . . . . . . . . . 367
Mark Reiser, Lanlan Yao, Xiao Wang, Jeanne Wilcox and Shelley Gray
Bootstrap-Based LASSO-Type Selection to Build Generalized
Additive Partially Linear Models for High-Dimensional Data . . . . . . . . . 405
Xiang Liu, Tian Chen, Yuanzhang Li and Hua Liang
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Editors and Contributors

About the Editors


Prof. Ding-Geng Chen is a fellow of the American
St
Stat
atis
isti
tica
call Asso
Associ ciati
ation
on and
and currcurrent
ently
ly th
thee Wall
Wallac acee
Kura
Ku ralt
lt di
dist
stin
ingu
guis
ishe
hedd prprof
ofes
essosorr at ththee Univ
Univer ersi
sity
ty of
North Carolina at Chapel Hill. He was a professor at
th
thee Un
Univiver
ersi
sity
ty of Roch
Rocheseste
terr and
and ththee Karl
Karl E. Peac
Peacee
endo
en dowewed d ememininen
entt sc
scho
hola
larr chai
chairr in bi bios
osta
tati
tist
stic
icss at
Geor
Ge orgi
giaa Sout
Southehern
rn Univ
Univer
ersi
sity
ty.. He is al also
so a se seni
nior
or
consul
con sultan
tantt for biophar
biopharmac
maceuteutica
icals
ls and governm
government ent
agencies with extensive expertise in clinical trial bio-
statis
sta tistic
ticss and pubpublic
lic health
health statis
statistic
tics.
s. Profes
Professor
sor Chen
Chen
has wri
writte
tten
n mor
moree than
than 150 referre
referredd publica
publicatio
tions
ns and
co-a
co -aut
utho
hore
red/
d/co
co-e-edi
dite
ted
d teten
n book
bookss on cl clin
inic
ical
al tr tria
iall
methodology, meta-analysis, causal-inference and public health statistics.

Mr. John Dean


simulations Chen is fispecialized
in modelling in Monte-Carlo
nancial market risk. He is
currently
currently a Vice Preside
President
nt at Credit Suisse
Suisse specializi
specializing
ng
in regulatory stress testing with Monte-Carlo simula-
tions. He began his career on Wall Street working in
commo
co mmodiditi
ties
es tr
trad
ading
ing marke
markett ri
risk
sk befor
beforee movi
movingng to
tradin
trading
g str
struct
ucture
ured
d notes
notes within
within the Exotic
Exoticss Intere
Interest
st
Rate Derivatives desk at Barclays Capital. He transi-
tioned back to risk at Mitsubishi UFJ working in its
model risk group. During his career in the financial
industry, he witnessed in person the unfolding of the
financial crisis, and the immediate aftermath consum-
ing much of the fi nancial industry. He graduated from the University of Washington
with a dual Bachelors of Science in Applied Mathematics and Economics.
xv

xvi Editors and Contributors

Contributors

Rawan Allozi Division of Epidemiology and Bios


Biostatistics
tatistics (MC923), University of
Illinois at Chicago, Chicago, IL, USA
Geng Chen Clinical Statistics, GlaxoSmithKline, Collegeville, PA, USA
Tian Chen
Tian Chen Dep
Depart
artmen
mentt of Mathem
Mathemati
atics
cs and Statis
Statistic
tics,
s, Univers
University
ity of Toledo
Toledo,,
Toledo, OH, USA
Tyler Cook University of Central Oklahoma, Edmond, OK, USA
Hakan
Haka n DeDemi
mirt
rtas
as Di
Divi
visi
sion
on of Epid
Epidem
emio
iolo
logy
gy and
and Bios
Biosta
tati
tist
stic
icss (MC9
(MC923
23),
),
University of Illinois at Chicago, Chicago, IL, USA
Shelley Gray Speech
Speech and Hearing Science,
Science, Arizona State Univer
University,
sity, Tempe,
Tempe, AZ,
USA
Jessic
Jessica
a R. Ho Hoag
ag DeDepa
part
rtme
ment
nt of Comm
Commun
unit
ity
y Medi
Medici
cine
ne and
and Heal
Health
th Care
Care,,
Conne
Con nect
ctic
icut
ut In
Inst
stit
itute
ute fo
forr Cl
Clin
inic
ical
al an
and
d Tr
Tran
ansl
slat
atio
iona
nall Scie
Scienc
nce,
e, Unive
Univers
rsit
ity
y of
Connecticut Health Center, Farmington, USA
Yiran
Yira n Hu DivDivisi
ision
on of Epidemi
Epidemiolo
ology
gy and Bio
Biosta
statis
tistic
ticss (MC923)
(MC923),, Univer
Universit
sity
y of
Illinois at Chicago, Chicago, IL, USA
Gul Ina
Inan
n Dep
Depart
artment
ment of Sta
Statis
tistic
tics,
s, Middle
Middle Eas
Eastt Techni
Technical
cal Univer
Universit
sity,
y, Ankara
Ankara,,
Turkey
Kyle M. Irimata School of Mathematical and Statistical Sciences, Arizona State
University, Tempe, AZ, USA
Chuanshu Ji Depart
Chuanshu Departmen
mentt of Statis
Statistic
ticss and Ope
Operat
ration
ionss Resear
Research,
ch, Univer
Universit
sity
y of
North Carolina, Chapel Hill, NC, USA
N. Jiang
Jiang Depa
Departm
rtment
ent of Mathem
Mathemati
atical
cal Sci
Scienc
ences,
es, Univers
University
ity of South
South Dakota
Dakota,,
Vermillion, SD, USA
James Kost Merck & Co. Inc., North Wales, PA, USA
Chia-L
Chia-Lin
ingg Ku
Kuo o Depar
epartm
tmeent of Co Comm
mmun
unit
ity
y Medic
edicin
inee and
and Heal
Healtth Care
Care,,
Conne
Connect
ctic
icut
ut In
Inst
stit
itute
ute fo
forr Cl
Clin
inic
ical
al an
and
d Tr
Tran
ansl
slat
atio
iona
nall Scie
Scienc
nce,
e, Unive
Univers
rsit
ity
y of
Connecticut Health Center, Farmington, USA
Trent L. La
Trent Lalon
londe
de Dep
Depart
artme
ment
nt of Appl
Applie
ied
d St
Stat
atis
isti
tics
cs and
and Rese
Resear
arch
ch Metho
Methods
ds,,
University of Northern Colorado, Greeley, CO, USA
Yuanzhang Li Division of Preventive Medicine, Walter Reed Army Institute of
Research, Silver Spring, MD, USA
Hua Liang Department of Statistics, Ge
George
orge Washington University, Washington,
DC, USA
Editors and Contributors xvii

Y.-J. Lin Department of Applied Mathematics, Chung Yuan Christian University,


Chung-Li District, Taoyuan city, Taiwan
Daniel F. Linder Department of Biostatistics and Epidemiology, Me
Medical
dical College
of Georgia, Augusta University, Augusta, GA, Georgia
Y.L. Li
Y.L. Lioo Depar
Departmen
tmentt of Mathema
Mathematic
tical
al Scienc
Sciences,
es, Univers
University
ity of South
South Dakota
Dakota,,
Vermillion, SD, USA
G. Frank Liu Merck & Co. Inc., North Wales, PA, USA
Xiang Liu Health Informatics Institute, University of South Florida, Tampa, FL,
USA
Sheng Luo Department of Biostatistics, The University of Texas Health Science
Center at Houston, Houston, TX, USA
G. Molenberghs I-BioStat,
I-BioStat, Universiteit
Universiteit Hasselt
Hasselt & KU Leuven, Hasselt, Belgium
H. Mwawambmbii Fa
Facu
cult
lty
y of Ma
Math
them
emat
atic
ical
al Scie
Scienc
nces
es and
and St
Stat
atis
isti
tics
cs,, Alne
Alneel
elai
ain
n
University, Khartoum, Sudan

H.K.T.
H.K. T. Ng Depar
Departmen
tmentt of Statis
Statistica
ticall Scienc
Science,
e, Southe
Southern
rn Methodi
Methodist
st Univer
Universit
sity,
y,
Dallas, TX, USA
Levent Ozbek Depa
Department
rtment of Statistic
Statistics,
s, Ankara University
University,, Ankara,
Ankara, Turkey
Mark Rei
Mark Reiser
ser Scho
School
ol of Mathe
Mathema
mati
tica
call and
and St
Stat
atis
isti
tica
call Scie
Scienc
nce,
e, Ariz
Arizona
ona St
Stat
atee
University, Tempe, AZ, USA
Haresh Rochani Department of Biostatistics, Jiann-Ping Hsu College of Public
Health, Georgia Southern University, Statesboro, GA, Georgia
Hani Michel Samawi Department of Biostatistics, Jia
Jiann-Ping
nn-Ping Hsu College Public
Health, Georgia Southern University, Statesboro, Georgia
A. Satty School of Mathematics, Statistics and Computer Science, University of
KwaZulu-Natal, Pietermaritzburg, South Africa
Jianguo Sun University of Missouri, Columbia, MO, USA
Wei Sun Manager Biostatistician at Otsuka America, New York, USA
T.-R. Tsai Department of Statistics, Tamkang University, Tamsui District, New
Taipei City, Taiwan
Ceren Vardar-Acar Department of Statistics, Middle East Technical University,
Ankara, Turkey
Tao Wang Bank of America Merrill Lynch, New York, NY, USA
Xiao Wang Statistics and Data Corporation, Tempe, AZ, USA

Jeanne Wilcox Division of Educational Leadership and Innovation, Arizona State


State
University, Tempe, AZ, USA
xviii Editors and Contributors

Jeffrey
Jeffrey R. Wil
Wilson
son W.P. Carey
Carey Sch
School
ool of Busine
Business,
ss, Arizona
Arizona State
State Univer
Universit
sity,
y,
Tempe, AZ, USA
Hui Xie Simon Frase
Fraserr Unive
University,
rsity, Burnaby,
Burnaby, Canada; The Unive
University
rsity of Illinois
Illinois at
Chicago, Chicago, USA
Lanlan
Lanlan YaYaoo Sc
Schoo
hooll of Ma
Mathe
thema
mati
tica
call and
and St
Stat
atis
isti
tica
call Sc
Scie
ienc
nce,
e, Ariz
Arizona
ona St
Stat
atee
University, Tempe, AZ, USA
Leicheng Yin Exelon
Exelon Business
Business Services Company, Enterprise
Enterprise Risk Management,
Chicago, IL, USA
Zhigang Zhang Memor
Memorial
ial Sloan Kettering
Kettering Cancer Center, New York, NY, USA
Part I
Monte-Carlo Techniques
Joint Generation of Binary, Ordinal, Count,
and Normal Data with Specified Marginal
and Association Structures in Monte-Carlo
Simulations

Hakan Demirtas, Rawan Allozi, Yiran Hu, Gul Inan and Levent Ozbek

Abstract This chapter is concerned with building a unified framework for con-
currently generating data sets that include all four major kinds of variables (i.e.,
binary, ordinal, count, and normal) when the marginal distributions and a feasible
association structure are specified for simulation purposes. The simulation paradigm
has been commonly employed in a wide spectrum of research fields including the
physical, medical, social, and managerial sciences. A central aspect of every simula-
tion study is the quantification of the model components and parameters that jointly
define a scientific process. When this quantification cannot be performed via deter-
ministic tools, researchers resort to random number generation (RNG) in finding
simulation-based answers to address the stochastic nature of the problem. Although
many RNG algorithms have appeared in the literature, a major limitation is that they
were not desi
designed
gned to
to concu
concurrent
rrently
ly accommo
accommodatedate all va
variabl
riablee types
types ment
mentioned
ioned abov
above.
e.
Thus, these algorithms provide only an incomplete solution, as real data sets include
variables of different kinds. This work represents an important augmentation of the
existing methods as it is a systematic attempt and comprehensive investigation for
mixed data generation. We provide an algorithm that is designed for generating data
of mixed marginals, illustrate its logistical, operational, and computational details;
and present ideas on how it can be extended to span more complicated distributional
settings in terms of a broader range of marginals and associational quantities.

H. Demirtas (B) R. Allozi Y. Hu


· ·
Division of Epidemiology and Biostatistics (MC923), University of Illinois at Chicago,
1603 West Taylor Street, Chicago, IL 60612, USA
e-mail: demirtas@uic.edu
G. Inan
Department of Statistics, Middle East Technical University, Ankara, Turkey
L. Ozbek
Department of Statistics, Ankara University, Ankara, Turkey

© Springer Nature Singapore Pte Ltd. 2017 3


D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_1

4 H. Demirtas et al.

1 Intr
Introd
oduc
ucti
tion
on

Stochastic simulation is an indispensable part and major focus of scientific inquiry.


Model building, estimation, and testing typically require verification via simulation
to ass
assess
ess the va
valid
lidit
ity
y, rel
reliab
iabili
ility
ty,, and pla
plausi
usibil
bility
ity of inf
infere
erenti
ntial
al te
techn
chniqu
iques,
es, to ev
evalu
aluate
ate
how well the implemented models capture the specified true population values, and
how reasonably these models respond to departures from underlying assumptions,
among other things. Describing a real notion by creating mirror images and imper-
fect proxies of the perceived underlying truth; iteratively refining and occasionally
redefining the empirical truth to decipher the mechanism by which the process under
consideration is assumed to operate in a repeated manner allows researchers to study
the performance of their methods through simulated data replicates that mimic the
real data charact
characterist
eristics
ics of interest
interest in any given
given setting.
setting. Accur
Accuracy
acy and precisi
precision
on mea-
sures regarding the parameters under consideration signal if the procedure works
properly; and may suggest remedial action to minimize the discrepancies between
expectation and reality
reality..
Simu
Si mula
lati
tion
on st
stud
udie
iess ha
have
ve be
been
en co
commmmononly
ly em
empl
ploy
oyed
ed in a br
broa
oad
d ra
rang
ngee of di
disc
scip
ipli
line
ness

in order to better comprehend and solve today’s increasingly sophisticated issues. A


core component of every simulation study is the quantification of the model compo-
nents
nen ts and par
parame
ameter
terss tha
thatt joi
jointl
ntly
y defi
define
ne a sci
scient
entific
ific phe
phenom
nomeno
enon.
n. Det
Determ
ermini
inisti
sticc too
tools
ls
are typically inadequate to quantify complex situations, leading researchers to uti-
lize RNG techniques in finding simulation-based solutions to address the stochastic
behavior of the problems that generally involve variables of many different types on
a structural level; i.e., causal and correlational interdependencies are a function of a
mixtur
mix turee of bin
binary
ary,, ord
ordina
inal,l, cou
count,
nt, and con
contin
tinuou
uouss va
varia
riable
bles,
s, whi
which
ch act sim
simult
ultane
aneous
ously
ly
to characterize the mechanisms that collectively delineate a paradigm. In modern
times, we are unequivocally moving from mechanistical to empirical thinking, from
small data to big data, from mathematical perfection to reasonable approximation to
reality, and from exact solutions to simulation-driven solutions. The ideas presented
herein are important in the sense that the basic mixed-data generation setup can be
augmented to handle a large spectrum of situations that can be encountered in many
areas.
This work is concerned with building the basics of a unified framework for con-
currently generating data sets that include all four major kinds of variables (i.e.,
binary, ordinal, count, and normal) when the marginal distributions and a feasible
association structure in the form of Pearson correlations are specified for simulation
purposes. Although many RNG algorithms have appeared in the literature, a funda-
mental restriction is that they were not designed for a mix of all prominent types of
data. The current chapter is a systematic attempt and compendious investigation for
mixed data generation; it represents a substantial augmentation of the existing meth-
ods,, an
ods andd it ha
hass po
pote
tent
ntia
iall to ad
advvan
ance
ce sc
scie
ient
ntifi
ificc re
rese
sear
arch
ch an
andd kn
knoowl
wled
edge
ge in a me
mean
aniningf
gful
ul
way. The broader impact of this framework is that it can assist data analysts, practi-
tioners,
tioners, the
theore
oretic
tician
ians,
s, and met
method
hodolo
ologis
gists
ts ac
acros
rosss man
many
y dis
discip
ciplin
lines
es to sim
simula
ulate
te mix
mixed
ed
data with relative ease. The proposed algorithm constitutes a comprehensive set of

Joint Generation of Binary, Ordinal, Count, and Normal Data … 5

computati
comput ationa
onall too
tools
ls tha
thatt of
offer
ferss pro
promis
mising
ing pot
potent
ential
ial for bu
build
ilding
ing enh
enhanc
anced
ed com
comput
puting
ing
infrastructure for research and education.
We propose an RNG algorithm that encompasses all four major variable types,
buil
bu ildi
ding
ng upo
uponn our pr
preevi
vious
ous woworkrk in ge
gene
nera
rati
tion
on of mu
mult
ltiivar
aria
iate
te ord
ordin
inal
al da
data
ta (D
(Dem
emir
irta
tass
2006),
2006 ), joint generation of binary and normal data (Demirtas and Doganay 2012 2012),),

ordinal and normal data (Demirtas and Yavuz 2015 2015),


), and count and normal data
(Amatya and Demirtas 2015 2015)) with the specification of marginal and associational
parameters along with other related work (Emrich and Piedmonte 1991 1991;; Demirtas
and Hedeker 2011
2011,, 2016
2016;; Demirtas et al. al. 2016a
2016a;; Ferrari and Barbiero 2012 2012;; Yahav
and Shmueli 2012
2012). ). Equally importantly, we discuss the extensions on nonnormal
contin
con tinuou
uouss dat
dataa via po
power
wer pol
polyno
ynomia
mials ls tha
thatt wou
would
ld han
handle
dle the ovoverw
erwhel
helmin
mingg maj
majori
ority
ty
of continuous shapes (Fleishman 1978 1978;; Vale and Maurelli 1983 1983;; Headrick 2010 2010;;
Demirtas et al.
al. 2012
2012;; Demirtas 2017a
2017a), ), count data that are prone to over- and under-
dispersion via generalized Poisson distribution (Demirtas 2017b 2017b), ), broader measures
of associations such as Spearman’s rank correlations and L-correlations (Serfling
and Xiao 2007
2007),), and the specification of higher order product moments. Conceptual,
algorithmic, operational, and procedural details will be communicated throughout
the chapter.
Thee or
Th orga
gani
niza
zati
tion
on of th
thee ch
chap
apte
terr is as fo
foll
llow
ows:
s: In Se
Sect
ct.. 2, th
thee al
algo
goririth
thm
m fo
forr si
simu
mult
lta-
a-
neous generation of binary, ordinal, count, and normal data is given. The essence of
the algorithm is finding the correlation structure of underlying multivariate normal
(MVN) data that form a basis for the subsequent discretization in the binary and
ordinal cases, and correlation mapping using inverse cumulative distribution func-
tions (cdfs) in the count data case, where modeling the correlation transitions for
different distributional pairs is discussed in detail. Section 3 presents some logistical
details and an illustrative example through an R package that implements the algo-
rithm, demonstrating how well the proposed technique works. Section 4 includes
discussion on limitations, future directions, extensions, and concluding remarks.

2 Alg
Algor
orit
ith
hm

The algorithm is designed for concurrently generating binary, ordinal, count, and
continuous data. The count and continuous parts are assumed to follow Poisson
and normal distributions, respectively. While binary is a special case of ordinal, for
the purpose of exposition, the steps are presented separately. Skipped patterns are
allowed for ordinal variables. The marginal characteristics (the proportions for the
binary and ordinal part, the rate parameters for the count part, and the means and
variances for the normal part) and a feasible Pearson correlation matrix need to be
specified by the users. The algorithmic skeleton establishes the basic foundation,
extensions to more general and complicated situations will be discussed in Sect. 4.
The operational engine of the algorithm hinges upon computing the correlation
matrix of underlying MVN data that serve as an intermediate tool in the sense that
binary and ordinal variables are obtained via dichotomization and ordinalization,

6 H. Demirtas et al.

respectively, through the threshold concept, and count variables are retrieved by
correlation mapping using inverse cdf matching. The procedure entails modeling the
correlation transformations that result from discretization and mapping.
In what follows, let B , O , C , and N denote binary, ordinal, count, and normal
variable
variables,
s, respe
respecti
ctivel
vely
y. Let Σ be th
thee sp
spec
ecifi
ified
ed Pe
Pear
arso
son
n co
corr
rrel
elat
atio
ion
n ma
matr
trix
ix wh
whic
ich
h co
com-
m-
prises of ten submatrices that correspond to all possible variable-type combinations.
Required parameter values are p’s for binary and ordinal variables, λ’s for count
variables, (µ,σ 2 ) pa
pair
irss fo
forr no
norm
rmal
al var
aria
iabl
bles
es,, an
and
d th
thee en
entr
trie
iess of th
thee co
corre
rrela
lati
tion
on ma
matr
trix
ix
Σ . These quantities are either specified or estimated from a real data set that is to be
mimicked.

1. Check
Check if Σ is positive definite.
2. Fi
Find
nd the
the up
uppeperr an
and
d lo
lowe
werr co
corr
rrel
elat
atio
ionn bo
boun
undsds fo
forr al
alll pa
pair
irss by th
thee so
sort
rtin
ingg me
meththod
od of
Demirt
Dem irtas
as and Hed
Hedekeker
er (2011
2011).). It is we
well
ll-k
-knonown
wn ththat
at co
corr
rrelelat
atio
ions
ns ar
aree no
nott bo
bound
undeded
between 1 and 1 in most bivariate settings as different upper and/or lower
− +
bounds may be imposed by the marginal distributions (Hoeffding 1940 1940;; Fréchet
1951).
1951 ). These restrictions apply to discrete variables as well as continuous ones.
Let Π ( F , G ) be the set of cdf’s H on R 2 having marginal cdf’s F and G .
Hoeffding (1940(1940)) and Fréchet 1951)) proved that in Π ( F , G ), there exist cdf’s
Fréchet (1951
HL and HU , called the lower and upper bounds, having minimum and max-
imum correlation. For all (x , y ) R 2 , HL (x , y ) m a x F (x ) G ( y ) 1, 0
∈ = [ + − ]
and HU (x , y ) m i n F (x ), G ( y ) . For any H Π ( F , G ) and all ( x , y ) R 2 ,
= [ ] ∈ ∈
HL ( x , y ) ≤ H (x , y ) HU (x , y ). If δ L , δU , and δ den
≤ denote
ote the Pea
Pearso
rsonn cor
correl
relati
ation
on
coefficients for HL , HU , and H , respectively, then δ L ≤ ≤
δ δU . One can infer
that if V iiss un
that unif
ifor
orm
m in 0, 1 , th
[ ] en F −1 (V ) and G −1 (V ) are maxim
then maximally ally corre
correlate
lated;
d;
and F −1 (V ) and G −1 (1 V ) are maximally anticorrelated. In practical terms,

generating X and Y independently with a large number of data points before
sorting them in the same and opposite direction give the approximate upper and
lower correlation bounds, respectively. Make sure all elements of Σ are within
the plausible range.
3. Perform
Perform logical
logical checks such as binary
binary proportions
proportions are between
between 0 and 1, proba-
bilities add up to 1 for ordinal variables, the Poisson rates are positive for count
varia
va riable
bles,
s, va
varia
riance
ncess for nor
normal
mal va
varia
riable
bless are pos
positi
itive
ve,, the mea
mean,
n, va
varia
riance
nce,, pro
propor
por--
tion
tion an
andd ra
rate
te ve
vect
ctor
orss ar
aree co
cons
nsis
iste
tent
nt wi
with
th th
thee nu
numb
mberer of var aria
iabl
bles
es,, Σ is sym
symmemetri
tricc
and its diagonal entries are 1, to prevent obvious misspecification errors.
4. For B-B combi
combinatio
nations,
ns, find the tetr tetrachor
achoric ic (pre-d
(pre-dichot
ichotomiz
omizatio
ation)
n) corre
correlati
lation
on
given
giv en the speci
specified
fied phi coef
coefficie
ficient
nt (post
(post-dich
-dichotomi
otomizati
zationon corre
correlati
lation).
on). Let X 1 , X 2
repres
rep resent
ent bin
binary
ary va
varia
riable
bless suc
such thatt E X j
h tha [ ]= p j and C or ( X 1 , X 2 ) δ12 , wh
= wher eree
pj ( j = 1, 2) and δ12 (ph
(phii coe
coeffi
fficie
cient)
nt) are gi
give n. Let Φ t1 , t2 , ρ12 be the cdf for a
ven. [ ]
standa
sta ndard
rd bi
biva
varia
riate
te norm
normal al ran
random
dom va varia
riable
ble wit
with
h cor
correl
relati
ation
on coe
coeffi
ffici
cient
ent ρ12 (tetra-
 t1 t2
Naturally, Φ t1 , t2 , ρ12
choric correlation). Naturally, [ ]= −∞ −∞ f (z1 , z 2 , ρ12 )d z1 d z 2 ,
2 1/2
where f (z 1 , z 2 , ρ12 ) 2π (1 ρ12 ) −1
ex p (z 12 2ρ12 z 1 z 2 z 22 )/
(2(1 − 2
ρ12 ))
 =[ − ] × − −
. The connection between δ 12 and ρ 12 is reflected in the equation +

Joint Generation of Binary, Ordinal, Count, and Normal Data … 7

1/2
Φ z ( p1 ), z ( p2 ), ρ12
[ ]=δ12 ( p1 q1 p2 q2 ) +p 1 p2

Solve for ρ 12 where z ( p j ) denotes the ptjh quant


quantile
ile of the standard
standard normal dis-
tribution, and q j 1 p j . Repeat this process for all B-B pairs
= − pairs..
5. For B-O and O-O com combinbinati
ations
ons,, imp
implem
lement
ent an ite
iterat
rativ
ivee pro
proced
cedure
ure tha
thatt find
findss
the polychoric (pre-discretization) correlation given the ordinal phi coefficient
(post-discretization correlation). Suppose Z ( Z1, Z 2)
= N (0, ∆ Z 1 Z 2 ), where

Z denotes the bivariate standard normal distribution with correlation matrix
∆ Z 1 Z 2 who
whosese of
off-d
f-diag
iagona
onall ent ry is δ Z 1 Z 2 . Let X ( X 1 , X 2 ) be the bi
entry = biva
vari-
ri-
ate ordi
ordinalnal dat
dataa whe
wherere und
underl
erlyin
yingg Z is dis discre
cretiz
tized
ed bas
baseded on corcorres
respon
pondin
dingg
normal
norm al qua
quanti
ntiles
les gi
give
venn the marmarginginal
al pro
propor
portio
tions,
ns, wit
withh a cor
correl
relati
ation
on matmatrix
rix
∆ Z 1 Z 2 . If we need to sample from a random vector ( X 1 , X 2 ) whose marginal
cdfs are F1 , F2 tied together via a Gaussian copula, we generate a sample
(z 1 , z 2 ) from Z N (0, ∆ Z 1 Z 2 ), then set x (x1 , x2 ) ( F1−1 (u 1 ), F2−1 (u 2 ))
∼ = =
when u (u 1 , u 2 ) (Φ
= (Φ((z 1 ),Φ(z 2 )), wh
= wher
eree Φ is th thee cd
cdff of th
thee st
stan
anda
dard
rd nonorm
rmal
al
distri
dis tribu
butio
tion.
n. The cor
correl
relati
ation
on mat
matrixrix of X, de
denonote
ted
d by ∆ X 1 X 2 (wi
(with
th an of
off-d
f-diag
iagona
onall
entry δ X 1 X 2 ) obv
obviou
iously
sly dif
differ
ferss fro
fromm ∆ Z1 Z2 ddue
ue to discr
discretiz
etization
ation.. More speci
specifical
fically
ly,,

|beδ established
X1 X2 | < |δ via| inthelarge
Z1 Z2 samples.
following The relationship
algorithm between
(Ferrari and δ
Barbiero δ
and ):
2012):
2012
X1 X2 Z1 Z2 can

a. Generate
Generate standard bivaria bivariate te norma correlation δ 0Z 1 Z 2 where
normall data with the correlation
δ 0Z 1 Z 2 = δ X 1 X 2 (Here, δ 0Z 1 Z 2 is the initial polychoric correlation).
b. Discr
Discretiz etizee Z 1 and Z 2 , based on the cumulative probabilities of the marginal
distribution F1 and F2 , to obtain X 1 and X 2 , respectively.
c. Com
Comput putee δ 1X 1 X 2 through X 1 and X 2 (Here, δ 1X 1 X 2 is th thee ord
ordin
inal
al ph
phii co
coefefficficie
ient
nt
after the first iteration).
d. Ex
Exec ecut utee th
thee fo
follllo
owi
wing
ng lo
loop op as lolong ng as δ vX 1 X 2 δ X 1 X 2 > ε and 1 v vma x
| − | ≤ ≤
(vma x and ε are the maximum number of iterations and the maximum toler-
ated absolute error, respectively, both quantities are set by the users):
(a) Update δ vZ 1 Z 2 by δ vZ 1 Z 2 = δ vZ−1 Z12 g (v), where g (v) δ X 1 X 2 /δ vX 1 X 2 . Here,
=
g (v) serves as a correction coefficient, which ultimately converges to 1.
(b) Generate bivariate normal data with δ vZ 1 Z 2 and compute δ vX+1 X1 2 after dis-
cretization.
Again, one should repeat this process for each B-O (and O-O) pair.
6. For C-C combinations, compute the corresponding
corresponding normal-normal correlations
correlations
(pre-mapping) given the specified count-count correlations (post-mapping) via
the inverse cdf method in Yahav and Shmueli (2012 ( 2012)) that was proposed in the
context of correlated count data generation. Their method utilizes a slightly
modified version of the NORTA
NORTA (Normal to Anything) approach (Nelsen 2006 2006),),
which involves
involves generation of MVN variate
variatess with give
given
n univariate marginals and
the correlation structure ( R N ), and then transforming it into any desired distri-
bution using the inverse cdf. In the Poisson case, NORTA can be implemented
by the following steps:

8 H. Demirtas et al.

a. Genera
Generate te a k -dimensional normal vector Z N from M V N distribution with
mean vector 0 and a correlation matrix R N .
b. Tr
Transfor
ansform m Z N to a Poisson vector X C as follows:
i. For each
each eleme nt z i of Z N , calculate the Normal cdf, Φ (z i ).
element
ii.. Fo
ii Forr ea
each
ch valalue
ue of Φ (z i ), ca
calc
lcul
ulat
atee th
thee Po
Pois
isso
son
n in
inve
vers
rsee cd
cdff wi
with
th a
desired−corresponding
λ i
marginal rate λi , Ψλ−i 1 (Φ
(Φ((z i )); where Ψλi (x ) =
 x e λ
i =0 i ! .
1 1 T
c. XC = Ψ − (Φ
λi z −
(Φ(( ) ) , . . . , Ψ (Φ
i z
(Φ(( ))
λk

is a draw from the desired multi-
k
variate count data with correlation matrix R P O I S .
An ex
exact
act the
theore
oretic
tical
al con
connec
necti
tion
on bet
betwee
weenn R N and R P O I S has not bee
been
n est
establ
ablish
ished
ed
to date. However, it has been shown that a feasible range of correlation between
a pair of Poisson variables after the inverse cdf transformation is within ρ [ =
C or (Ψλ−i 1 (U ), Ψλ−j 1 (1 U )),ρ
− C or (Ψλ−i 1 (U ), Ψλ−j 1 (U )) , where λi and λ j
= ]
are the marginal rates, and U Uni f orm
∼ orm (0, 1). Yahav and Shmueli Shmueli (2012
2012))
propose
prop osedd a con
conceceptu
ptuallally
y sim
simple
ple met
method
hod to app
approxi
roximat
matee the rel
relati
ations
onship
hip bet
betwee
ween
n
the two correlations. They have demonstrated that R P O I S can be approximated
as an exponential function of R N where the coefficients are the functions of ρ
and ρ .
7. For B-N/O-N combinations,
combinations, find the biserial/polyserial
biserial/polyserial correlation
correlation (before dis-
cretization of one of the variables) given the point-biserial/point-polyserial cor-
relation (after discretization) by the linearity and constancy arguments pro-
posed by Demirtas and Hedeker (2016 (2016).
). Suppose that X and Y follow a bivari-
ate normal distribution with a correlation of δ X Y . Without loss of generality,
we may assume that both X and Y are standardized to have a mean of 0
and a variance of 1. Let X D be the binary variable resulting from a split on
X, XD I (X
= ≥k ). Thus, E X D [ ]= p and V X D
[ ]= = − pq where q 1 p.
The correlation between X D and X , δ X D X can be obtained in a simple way,
namely, δ X D X = D
= [ ]√ = [ | ≥ ]√
√CVov[ X[ X D]V, X[ X] ] E X D X / pq E X X k / pq . We can
also ex
also expre
press
ss the rel
relati
ations
onship
hip bet
between X and Y via the fol
ween follo
lowin
wing
g lin
linear
ear re
regre
gressi
ssion
on
model:
Y δXY X ε= + (1)

where ε is independent of X and Y , and follows N (0, 1 δ 2X Y ). When we


∼ −
generaliz
gener alizee this to nonnorm
nonnormal
al X and/or Y (both centered and scaled), the same
relationship can be assumed to hold with the exception that the distribution of ε
follows a nonnormal distribution. As long as Eq. 1 is valid,

C ov X D , Y
[ ] = C ov[ X , δ
D XY X+ ε]
= C ov[ X , δ
D XY X ] + C ov[ X D, ε ]
= δ C ov[ X
XY D , X ] + C ov[ X D , ε] . (2)
Since ε is independent of X , it will also be independent of any deterministic
function of X such as X D , and thus C ov X D , ε will be 0. As E X
[ ] E Y [ ]= [ ]=

Joint Generation of Binary, Ordinal, Count, and Normal Data … 9

0, V X[ ] = [ ] = 1, C ov[ X
V Y ] = δ √ pq and C ov[ X , Y ] = δ , Eq
D, Y XDY Eq.. 2 XY
reduces to
δ =δ δ . XDY XY XD X (3)

In the bivariate normal case, δ



h / pq where h is the ordinate of the nor-
XD X

=
mal curve at the point of dichotomization. Equation 3 indicates that the linear
association between X D and Y is assumed to be fully explained by their mutual
association with X (Demirtas and Hedeker 2016 2016).). The ratio, δ X D Y /δ X Y is equal
to δ X D X = [ ]√ = [ | ≥ ]√
E X D X / pq E X X k / pq . It is a constant given p and
the distribution of ( X , Y ). These correlations are invariant to location shifts and
scaling, X and Y do not have to be centered and scaled, their means and vari-
ance
an cess ca
cann ta
take
ke an
any
y fin
finit
itee val
alue
ues.
s. On
Once
ce th
thee ra
rati
tio
o (δ X D X ) is fo
foun
und,
d, on
onee ca
cann co
comp
mput
utee
the biserial correlation when the point-biserial correlation is specified. When X
is ordinalized to obtain X O , the fundamental ideas remain unchanged. If the
assumptions of Eqs. 1 and 3 are met, the method is equally applicable to the
ordinal case in the context of the relationship between the polyserial (before
ordinalization) and point-polyserial (after ordinalization) correlations. The eas-

iest way of computing δ X O X is to generate X with a large number of data points,


then
then or
ordidina
nali
lize
ze it to obt
obtai
ain
n X O , an
andd th
then
en co
comp
mpututee th
thee sa
samp
mple le co
corr
rrel
elat
atio
ionn be
betw tweeeenn
X O and X . X cou couldld fol
follo
loww an
anyy con
contin
tinuous
uous uni
univa
varia
riate
te dis
distri
tribu
butio
tion.
n. Ho
Howeweveverr, her
heree
X is assumed to be a part of MVN data before discretization.
8. For C-N com combinbinati
ations
ons,, use
use the cou
count
nt ve
versi
rsion
on of Eq.
Eq.3 3, which is δ X C Y =δ X Y δ XC X
is valid. The only difference is that we use the inverse cdf method rather than
discretization via thresholds as in the binary and ordinal cases.
9. For B-C and O-C combinations
combinations,, suppose that there are two identical identical standard
standard
normal variables, one underlies the binary/ordinal variable before discretiza-
tion, the other underlies the count variable before inverse cdf matching. One
can find C or ( O , N ) by the method of Demirtas and Hedeker (2016 2016). ). Then,
assume C or (C , O ) C or (C , N ) C or ( O , N ). C or (C , O ) is specified and
= ∗
C or ( O , N ) is calculated. Solve for C or (C , N ). Then, find the underlying N-N
correlation by Step 8 above (Amatya and Demirtas 2015 2015;; Demirtas and Hedeker
2016).
2016 ).
10. Construct an overall,
overall, intermediate correlation matrix, Σ ∗ using the results from
intermediate correlation
Step
St epss 4 th
thro
roug
ughh 9, in co
conj
njun
unct
ctio
ion
n wi
withth th
thee N-
N-NN pa
partrt th
that
at re
rema
maininss un
unto
touc
uche
hed d wh
whenen
we compute Σ ∗ from Σ .
11. Check
Che ck if Σ ∗ is positive definite. If it is not, find the nearest positive definite
correlation matrix by the method of Higham ((2002 2002). ).
12. Genera
Gen eratete mul
multitiva
varia
riate
te nor
normal
mal dat
dataa wit
withh a mea
mean n ve
vecto
ctorr of (0, .. ., 0) and corre
..., correlati
lation
on

matrix of Σ , which can easily be done by using the Cholesky decomposition of
Σ ∗ an
and d a ve
vect
ctor
or of uni
univa
vari
riat
atee no
norm
rmal
al dr
draw
aws.s. Th
Thee Ch
Chol oles
esky
ky de
deco
compmpososit
itio
ionn of Σ ∗
produces a lower-triangular matrix A for which A A T = Σ ∗ . If z = (z 1 , ..., z d )
...,
are d independent standard normal random variables, then Z Az is a random
draw from this distribution. =
13. Dichotomize binary,
binary, ordinalize ordinal by respective
respective quantiles, go from normal
to count by inverse cdf matching.

10 H. Demirtas et al.

3 Some Oper
Operation
ational
al Details
Details and an Illustr
Illustrativ
ativee Example
Example

The software implementation of the algorithm has been done in PoisBinOrdNor


package (Demirtas et al. al. 2016b
2016b)) within R environment (R Development Core Team
2016).
2016 ). The package has functions for each variable-type pair that are collectively
capable of modeling the correlation transitions. More specifically, corr.nn4bb
functio
func tionn find
findss the tet
tetrac
rachori
horicc cor
correl
relati
ation
on in Ste
Stepp 4, corr.nn4bn and corr.nn4on
functio
func tions
ns com
comput
putee the bis
biseri
erial
al and pol
polyse
yseria
riall cor
correl
relati
ations
ons for bin
binary
ary and ord
ordina
inall va
vari-
ri-
ables, respectively, in Step 7, corr.nn4pbo function is used to handle B-C and
O-C pairs in Step 9, corr.nn4pn function is designed for C-N combinations in
Step 8, and corr.nn4pp function calculates the pre-mapping correlations in Step
6. In addition, polychoric correlations in Step 5 are computed by ordcont function
in GenOrd pac packag
kagee (Ba
(Barbi
rbiero
ero and Fer
Ferrar
rarii 2015
2015).). co
corr
rrel
elat
atio
ion
n bou
bound
nd ch
chec
eckk (S
(Ste
tep
p 2) as
well as the validation of the specified quantities (Step 3), assembling all the interme-
diate correlation entries into Σ ∗ (Step 10), and generating mixed data (Steps 12 and
13) are performed by validation.specs, intermat, and genPBONdata
functions, respectively, in PoisBinOrdNor package. Positive definiteness checks
in Steps 1 and 11 are done by is.positive.definite function in corpcor
package (Schaefer et al. al. 2015 ), finding the nearest Σ ∗ is implemented by nearPD
2015),
functio
func tionn in Matrix pac
packag
kagee (Ba
(Bates
tes and Mae
Maechlchler
er 2016
2016),), an
and
d MV
MVN N da
data
ta ar
aree ge
gene
nera
rate
ted
d
by rmvnorm function in mvtnorm package (Genz et al. al. 2016
2016).
).
For illustration, suppose we have two variables in each type. Operationally, Pois-
BinOrdNor package assumes that the variables are specified in a certain order. Let
Y1 ∼ Poisson (3), Y 2 ∼ Poisson (5), Y 3 ∼ Bernoulli (0.4), Y 4 ∼ Bernoulli (0.6),
Y5 and Y6 are ordinal P (Y j= = i) pi , where pi = (0.3, 0.3, 0.4) and (0.5, 0.1, 0.4)
for i = 0, 1, 2 for j = 5 and 6, respectively, Y7 ∼ N (2, 1), and Y8 ∼ N (5, 9). The
correlation matrix Σ is specified as follows under the assumption that columns (and
rows) represent the order above:

 1 0.70 0.66 0.25 0.41 0.63 0.22 0.51 


 0.70 1 0.59 0.22 0.37 0.57 0.20 0.46 
 0.66 0.59 1 0.21 0.34 0.53 0.19 0.43 
Σ=
 0.25 0.22 0.21 1 0.13 0.20 0.07 0.16 
 0.41 0.37 0.34 0.13 1 0.33 0.12 0.27 
 0.63 0.57 0.53 0.20 0.33 1 0.18 0.42 
 0.22 0.20 0.19 0.07 0.12 0.18 1 0.15 
0.51 0.46 0.43 0.16 0.27 0.42 0.15 1

The inte
intermedi
rmediate
ate corre
correlati
lation
on matrix Σ ∗ –af
matrix –after
ter va
valid
lidati
ating
ng the fea
feasib
sibili
ility
ty of mar
margin
ginal
al
and correlational specifications and applying all the relevant correlation transition
steps turns out to be (rounded to three digits
digits after the decim
decimal)
al)

Joint Generation of Binary, Ordinal, Count, and Normal Data … 11


 1 0.720 0.857 0.325 0.477 0.776 0.226 0.523 
 0.720 1 0.757 0.282 0.424 0.693 0.202 0.466 
 0.857 0.757 1 0.336 0.470 0.741 0.241 0.545 
∗  0.325 0.282 0.336 1 0.186 0.299 0.089 0.203 
Σ =
0.477 0.424 0.470 0.186 1 0.438 0.135 0.305
 0.776 0.693 0.741 0.299 0.438 1 0.216 0.504 
 0.226 0.202 0.241 0.089 0.135 0.216 1 0.150 
0.523 0.466 0.545 0.203 0.305 0.504 0.150 1

Generating N 10, 000 rows of data based on this eight-variable system yields the
=
following empirical correlation matrix (rounded to five digits after the decimal):
 1 0.69823 0.67277 0.24561 0.40985 0.63891 0.22537 0.50361 
 0.69823 1 0.59816 0.21041 0.36802 0.57839 0.21367 0.45772 
 0.67277 0.59816 1 0.20570 0.32448 0.55564 0.20343 0.42192 
 0.24561 0.21041 0.20570 1 0.12467 0.20304 0.06836 0.17047 
0.40985 0.36802 0.32448 0.12467 1 0.32007 0.12397 0.26377
 0.63891 0.57839 0.55564 0.20304 0.32007 1 0.17733 0.41562 
 0.22537 0.21367 0.20343 0.06836 0.12397 0.17733 1 0.15319 
0.50361 0.45772 0.42192 0.17047 0.26377 0.41562 0.15319 1

The discrepancies between the specified and empirically computed correlations


are indiscernibly small and the deviations are within an acceptable range that can
be expected in any stochastic process. If we had repeated the experiment many
times in a full-blown simulation study, the average differences would be even more
negligible. We have observed the similar trends in the behavior of the marginal
parameters (not reported for brevity), which lend further support to the presented
methodology. The assessment of the algorithm performance in terms of commonly
accepted accuracy and precision measures in RNG and imputation settings as well
as in other simulated environments can be carried out through the evaluation metric
developed in Demirtas (2004a
(2004a,, b, 2005
2005,, 2007a
2007a,, b, 2008
2008,, 2009
2009,, 2010
2010),
), Demirtas and
Hedeker (2007
2007,, 2008a
2008a,, b, c), Demirtas and Schafer
Schafer (2003
2003),), Demirtas et al. (2007
2007,,
2008),
2008 ), and Yucel and Demirtas (2010
(2010).
).

4 Futu
Futurre Di
Dirrec
ecti
tion
onss

The significance of the current study stems from three major reasons: First, data
analysts, practitioners, theoreticians, and methodologists across many different dis-
ciplines in medical, managerial, social, biobehavioral, and physical sciences will
be able to simulate multivariate data of mixed types with relative ease. Second, the
proposed work can serv
proposed servee as a mile
milestone
stone for the developmen
developmentt of more sophisticate
sophisticatedd
simulation, computation, and data analysis techniques in the digital information,
massive data era. Capability of generating many variables of different distributional

12 H. Demirtas et al.

types, nat
types, nature
ure,, and dep
dependendenc
encee str
struct
ucture
uress may be a con
contritribu
butin
ting
g fac
factor
tor for bet
better
ter gra
grasp-
sp-
ing the ope
operat
ration
ionalal cha
characracter
terist
istics
ics of tod
today’
ay’ss int
intens
ensivivee dat
dataa tre
trends
nds (e.
(e.g.,
g., sat
satell
ellite
ite dat
data,a,
internet traffic data, genetics data, ecological momentary assessment data). Third,
these ideas can help to promote higher education and accordingly be instrumental in
training graduate students. Overall, it will provide a comprehensive and useful set
of computational tools whose generality and flexibility offer promising potential for
building enhanced statistical computing infrastructure for research and education.
While this
this work represent
representss a decent step forwardforward in mixemixed d data generation,
generation, it may
not be sufficiently complex for real-life applications in the sense that real count and
cont
co ntin
inuo
uous
us da
data
ta ar
aree ty
typi
picacall
lly
y mo
more re co
comp
mplilica
cate
ted
d th
than
an wh
whatat Po
Poisisso
son
n an
and
d nor
normamall di
dist
stri
ri--
buti
bu tions
ons ac
acco
commmmodaodate te,, an
andd it is li
like
kely
ly th
that
at sp
spec
ecifi
ifica
cati
tion
on of papara
rame
mete
ters
rs th
that
at co
cont
ntro
roll th
thee
first two moments and the second order product moment is inadequate. To address
these concerns, we plan on building a more inclusive structural umbrella, whose
ingredients are as follows: First, the continuous part will be extended to encompass
nonnormal continuous variables by the operational utility of the third order power
polynomials. This approach is a moment-matching procedure where any given con-
tinu
tinuous
ous va
vari
riab
able
le in th
thee sy
syst
stem
em is exprxpres
esse
sedd by th
thee su
summ of liline
near
ar co
comb
mbin
inat
atio
ions
ns of po
pow-w-
ers of a standard normal variate (Fleishman 1978 1978;; Vale and Maurelli
Maurelli 1983
1983;; Demirtas
et al.
al. 2012
2012),), which requires the specification of the first four moments. A more elab-
orate version in the form of the fifth order system will be implemented (Headrick
2010)) in an attempt to control for higher order moments to cover a larger area in the
2010
skewness-elongation plane and to provide a better approximation to the probability
density functions of the continuous variables; and the count data part will be aug-
mented through the generalized Poisson distribution (Demirtas 2017b 2017b)) that allows
under-- and ove
under over-di
r-dispers
spersion,
ion, which is usual
usuallyly encou
encountere
nteredd in most applications,
applications, via
an additional dispersion parameter. Second, although the Pearson correlation may
not be the best association quantity in every situation, all correlations mentioned in
this chapter are special cases of the Pearson correlation; it is the most widespread
meas
me asur
uree of as
asso
soci
ciat
atio
ion;
n; an
and
d ge
gene
nera
rali
lity
ty of th
thee me
meth
thod
odss pr
prop
opos
osed
ed he
here
rein
in wi
with
th di
difffe
fere
rent
nt

kinds of variables requires the broadest possible framework. For further broadening
the scale, scope, and applicability of the ideas presented in this chapter, the proposed
RNG technique will be extended to allow the specification of the Spearman’s rho,
which is more popular for discrete and heavily skewed continuous distributions, will
be in
inco
corp
rpor
oratated
ed in
into
to th
thee al
algo
gori
rith
thm
m fo
forr co
conc
ncururre
rent
ntly
ly ge
gene
nera
rati
ting
ng al
alll fo
four
ur ma
majo
jorr ty
type
pess of
variables. For the continuous-continuous pairs, the connection between the Pearson
and Spearman correlations is given in Headrick (2010 2010)) through the power coeffi-
cients, and these two correlations are known to be equal for the binary-binary pairs.
The relationship will be derived for all other variable type combinations. Inclusion
of Spearman’s rho as an option will allow us to specify nonlinear associations whose
mono
mo noto
toni
nicc co
comp
mpone
onent
ntss ar
aree re
refle
flect
cted
ed in th
thee ra
rank
nk co
corr
rrel
elat
atio
ion.
n. Th
Thir
ird,
d, th
thee expa
xpand
ndeded fif
fifth
th
order po
order poly
lyno
nomi
mial
al sy
syst
stem
em wi
will
ll be fur
furth
ther
er au
augm
gmen
ente
ted
d to ac
acco
comm
mmod odat
atee L-
L-mo
mome
ment
ntss an
and
d
L-correlations (Hosking 1990
1990;; Serfling and Xiao 2007
2007)) that are based on expecta-
tions of certain linear combinations of order statistics. The marginal and product
L-moments are known to be more robust to outliers than their conventional counter-
parts in the sense that they suffer less from the effects of sampling variability, and

Joint Generation of Binary, Ordinal, Count, and Normal Data … 13

they enable more secure inferences to be made from small samples about an under-
lying probability distribution. On a related note, further expansions can be designed
to handle more complex associations that involve higher order product moments.
The salient advantages of the proposed algorithm and its augmented versions are
as follows: (1) Individual components are well-established. (2) Given their compu-
tational simplicity, generality, and flexibility, these methods are likely to be widely
used by researchers, methodologists, and practitioners in a wide spectrum of sci-
entific disciplines, especially in the big data era. (3) They could be very useful
in graduate-level teaching of statistics courses that involve computation and sim-
ulation, and in training graduate students. (4) A specific set of moments for each
variable is fairly rare in practice, but a specific distribution that would lead to these
mome
mo ment
ntss is ve
very
ry co
comm
mmon
on;; so ha
havi
ving
ng ac
acce
cess
ss to th
thes
esee me
meth
thod
odss is ne
need
eded
ed by po
pote
tent
ntia
iall
lly
y
a large group of people. (5) Simulated variables can be treated as outcomes or pre-
dictors in subsequent statistical analyses as the variables are being generated jointly.
(6) Required quantities can either be specified or estimated from a real data set. (7)
The final product after all these extensions will allow the specification of two promi-
nent types of correlations (Pearson and Spearman correlations) and one emerging
type (L-correlations) provided that they are within the limits imposed by marginal
distributions. This makes it feasible to generate linear and a broad range of nonlinear
associations. (8) The continuous part can include virtually any shape (skewness, low
or high peakedness, mode at the boundary, multimodality, etc.) that is spanned by
power polynomials; the count data part can be under- or over-dispersed. (9) Ability
to jointly generate different types of data may facilitate comparisons among existing
data analysis and computation methods in assessing the extent of conditions under
which available methods work properly, and foster the development of new tools,
especially in contexts where correlations play a significant role (e.g., longitudinal,
clustered, and other multilevel settings). (10) The approaches presented here can
be regarded as a variant of multivariate Gaussian copula-based methods as (a) the
binary and ordinal variables are assumed to have a latent normal distribution before
discreti
discre tizat
zation
ion;; (b) the cou
count
nt va
varia
riable
bless go thr
throug
ough
h a cor
correl
relat
ation
ion map
mappin
pingg proc
procedu
edure
re via
the norma
normal-to-
l-to-anyt
anything
hing approa
approach;
ch; and (c) the conti
continuous
nuous variables
variables consi
consist
st of poly-
nomial terms involving normals. To the best of our knowledge, existing multivariate
copulas are not designed to have the generality of encompassing all these variable
types simultaneously. (11) As the mixed data generation routine is involved with
latent variables that are subsequently discretized, it should be possible to see how
the correlation structure changes when some variables in a multivariate continuous
setting are dichotomized/ordinalized (Demirtas
(Demirtas 2016
2016;; Demirtas and Hedeker
Hedeker 2016
2016;;
Demirtas et al.al. 2016a
2016a). ). An important by-product of this research will be a better
understanding of the nature of discretization, which may have significant implica-
tions in interpreting
interpreting the coefficie
coefficients
nts in regression-type models when some predictors
predictors
are discretized.
discretized. On a rela
related
ted note, this could be usefu
usefull in meta-analysis
meta-analysis when some
studies discretize variables and some do not. (12) Availability of a general mixed
data generation algorithm can markedly facilitate simulated power-sample size cal-
culations for a broad range of statistical models.

14 H. Demirtas et al.

References

Amatya, A., & Demirtas, H. (2015). Simultaneous generation of multivariate mixed data with
Poisson and normal marginals. Journal of Statistical Computation and Simulation, 85, 3129–
3139.
Barbiero, A., & Ferrari, P. A. (2015). Simulation of ordinal and discrete variables with given
correl
cor relati
ation
on mat
matrix
rix and mar
margin
ginal
al dis
distri
tribu
butio
tions.
ns. R pac
packag
kagee Gen
GenOrd
Ord.. https://cran.r-project.org/web/
packages/GenOrd
Bates D., & Maechler M. (2016). Sparse and dense matrix classes and methods. R package Matrix.
http://www.cran.r-project.org/web/packages/Matrix
Demirtas, H. (2004a). Simulation-driven inferences for multiply imputed longitudinal datasets.
Statistica Neerlandica, 58, 466–482.
Demirtas, H. (2004b). Assessment of relative improvement due to weights within generalized esti-
mating equations framework for incomplete clinical trials data. Journal of Biopharmaceutical
Statistics, 14, 1085–1098.
Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for
non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363.
Demirta
Dem irtas,
s, H. (20
(2006)
06).. A met
method
hod for mul
multi
tiva
varia
riate
te ord
ordina
inall dat
dataa gen
genera
eratio
tionn gi
give
venn mar
margin
ginal
al dis
distrib
tributi
utions
ons
and correlations. Journal of Statistical Computation and Simulation, 76 , 1017–1025.
Demirtas, H. (2007a). Practical advice on how to impute continuous data when the ultimate inter-
est cent
centers
ers on dich
dichotomi
otomized
zed outco
outcomes
mes throu
throughgh pre-s
pre-specifi
pecifieded thres
threshold
holds.
s. Communic
Communications
ations in
Statistics-Simulation and Computation, 36 , 871–889.
Demirtas, H. (2007b). The design of simulation studies in medical statistics. Statistics in Medicine,
26, 3818–3821.
Demirta
Dem irtas,
s, H. (20
(2008)
08).. On imp
imputi
uting
ng con
contin
tinuou
uouss dat
dataa whe
when n the ev
event
entual
ual int
intere
erest
st per
pertai
tains
ns to ord
ordina
inaliz
lized
ed
outcomes via threshold concept. Computational Statistics and Data Analysis, 52 , 2261–2271.
Demirtas, H. (2009). Rounding strategies for multiply imputed binary data. Biometrical Journal,
51, 677–688.
Demirtas, H. (2010). A distance-based rounding strategy for post-imputation ordinal data. Journal
of Applied Statistics, 37 , 489–500.
Demirtas, H. (2016). A note on the relationship between the phi coefficient and the tetrachoric
correlation under nonnormal underlying distributions. American Statistician, 70, 143–148.
Demirtas, H. (2017a). Concurrent generation of binary and nonnormal continuous data through
fifth order power polynomials. Communications in Statistics–Simulation and Computation, 46,

489–357.
Demirtas, H. (2017b). On accurate and precise generation of generalized Poisson variates. Com-
munications in Statistics–Simulation and Computation, 46, 489–499.
Demirta
Demirtas,
s, H., Ahm
Ahmadiadian,
an, R., Ati
Atis,
s, S., Can
Can,, F. E., & Erc
Ercan,
an, I. (20
(2016a
16a).
). A non
nonnor
normal
mal loo
look
k at pol
polych
ychori
oricc
correlations: Modeling the change in correlations before and after discretization. Computational
Statistics, 31, 1385–1401.
Demirtas, H., Arguelles, L. M., Chung, H., & Hedeker, D. (2007). On the performance of bias-
reduction
reduc tion tech
technique
niquess for var
variance
iance estim
estimation
ation in appr
approxima
oximate
te Baye
Bayesian
sian boots
bootstrap
trap imput
imputation
ation..
Computational Statistics and Data Analysis, 51 , 4064–4068.
Demirtas, H., & Doganay, B. (2012). Simultaneous generation of binary and normal data with
specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223–
236.
Demirtas,
Demirtas, H., Fre
Freels
els,, S. A., & Yuce
ucel,
l, R. M. (20
(2008)
08).. Pla
Plausi
usibil
bility
ity of mul
multi
tiva
varia
riate
te nor
normal
mality
ity ass
assump
umptio
tion
n
when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal
of Statistical Computation and Simulation , 78 , 69–84.
Demirtas, H., & Hedeker
Hedeker,, D. (2007). Gaussianization-based quasi-imputation and expansion strate-
gies for incomplete correlated binary responses. Statistics in Medicine, 26, 782–799.
Demirtas, H., & Hedeker, D. (2008a). Multiple imputation under power polynomials. Communica-
tions in Statistics- Simulation and Computation, 37 , 1682–1695.

Joint Generation of Binary, Ordinal, Count, and Normal Data … 15

Demirtas, H., & Hedeker, D. (2008b). Imputing continuous data under some non-Gaussian distrib-
utions. Statistica Neerlandica, 62, 193–205.
Demirtas, H., & Hedeker, D. (2008c). An imputation strategy for incomplete longitudinal ordinal
data. Statistics in Medicine, 27, 4086–4093.
Demirtas, H., & Hedeker, D. (2011). A practical way for computing approximate lower and upper

correlation bounds. The American Statistician, 65, 104–109.


Demirtas,
Demirta s, H., & Hed
Hedek
eker
er,, D. (20
(2016
16).
). Com
Comput
puting
ing the poi
point-
nt-bis
biseri
erial
al cor
correl
relati
ation
on und
under
er an
any
y und
underl
erlyin
ying
g
continuous distribution. Communications in Statistics- Simulation and Computation, 45 , 2744–
2751.
Demirtas, H., Hedeker, D., & Mermelstein, J. M. (2012). Simulation of massive public health data
by power polynomials. Statistics in Medicine, 31 , 3337–3346.
Demi
Demirt
rtas
as H.
H.,, Hu Y., & Al Allo
lozi
zi R. (2
(201
016b
6b).
). Dat
Dataa gene
generati
ration
on with Poisso
oisson,
n, binar
binaryy, or
ordina
dinall
and normal components, R package PoisBinOrdNor. https://cran.r-project.org/web/packages/
PoisBinOrdNor..
PoisBinOrdNor
Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture
models for non-ignorable drop-out. Statistics in Medicine , 22 , 2553–2575.
Demirtas, H., & Yavuz, Y. (2015). Concurrent generation of ordinal and normal data. Journal of
Biopharmaceutical Statistics, 25, 635–650.
Emrich, J. L., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate
binary variates. The American Statistician, 45 , 302–304.
Ferrari, P. A., & Barbiero, A. (2012). Simulating ordinal data. Multivariate Behavioral Research,
47, 566–589.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43,
521–532.
Fréchet, M. (1951). Sur les tableaux de corrélation dont les marges sont données. Annales de
l’Université de Lyon Section A, 14 , 53–77.
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., et al. (2016). Multivariate normal and
t distributions. R package mvtnorm. https://cran.r-project.org/web/packages/mvtnorm
https://cran.r-project.org/web/packages/mvtnorm..
Headri
Hea drick,
ck, T. C. (20
(2010)
10).. Statis
Statistical
tical simul
simulation
ation:: powe
powerr metho
method d poly
polynomia
nomialsls and other tran
transform
sformation
ationss
boca raton. FL: Chapman and Hall/CRC.
Higham, N. J. (2002). Computing the nearest correlation matrix—a problem from finance. IMA
Journal of Numerical Analysis, 22 , 329–343.
Hoeffding, W. (1994). Scale-invariant correlation theory. In: N.I. Fisher & P.K. Sen (Eds.), T he
collected works of Wassily Hoeffding (the original publication year is 1940) (pp. 57–107). New
York: Springer.
Hosking, J. R. M. (1990). L-moments: Analysis and estimation of distributions using linear com-
binations of order statistics. Journal of the Royal Statistical Society, Series B, 52, 105–124.
Nelsen, R. B. (2006). An introduction to copulas. Berlin, Germany: Springer.
R Development Core Team (2016) R: A Language and Environment for Statistical Computing.
http://www.cran.r-project.org..
http://www.cran.r-project.org
Schaef
Sch aefer
er,, J., Opg
Opgen-
en-Rhe
Rhein,in, R., Zub
Zuber
er,, V., Ahd
Ahdesm
esmaki
aki,, M., Silv
Silva,
a, A. D., Stri
Strimme
mmerr, K. (20
(2015)
15).. Efficient
Estimation of Covariance and (P artial) Correlation. R package corpcor. https://cran.r-project.
(Partial)
org/web/packages/BinNonNor..
org/web/packages/BinNonNor
Serfling, R., & Xiao, P. (2007). A contribution to multivariate L-moments: L-comoment matrices.
Journal of Multivariate Analysis, 98 , 1765–1781.
Vale, C. D., & Maurelli, V. V. A. (1983). Simulating multivariate nonnormal distributions. Psychome-
trika, 48 , 465–471.
Yahav, I., & Shmueli, G. (2012). On generating multivariate Poisson data in management science
applications. Applied Stochastic Models in Business and Industry , 28 , 91–102.
Yuc
ucel
el,, R. M.
M.,, & De
Demi
mirt
rtas
as,, H. (2
(201
010)
0).. Im
Impa
pact
ct of no
non-
n-no
norm
rmal
al ra
rand
ndom
om ef
effe
fect
ctss on in
infe
fere
renc
ncee by mu
mult
ltip
iple
le
imputation: A simulation assessment. Computational Statistics and Data Analysis, 54, 790–801.

Improving the Efficiency of the Monte-Carlo


Methods Using Ranked Simulated Approach

Hani Michel Samawi

Abstract This chapter explores the concept of using ranked simulated sampling
approach (RSIS) to improve the well-known Monte-Carlo methods, introduced by
Samawii (1999),
Samaw 1999), and extended to steady-state ranked simulated sampling (SRSIS)
Samawi (2000).
by Al-Saleh and Samawi 2000). Both simulation sampling approaches are then
extended to multivariate rank nkeed simulated sampl pliing (MVRSIS) and
multi
mul tiva
varia
riate
te ste
steady
ady-st
-state
ate ran
ranked
ked simula
simulate
ted
d sam
sampli
pling
ng approa
approachch (MVSRS
(MVSRSIS)IS) by
Samawi and Al-Saleh (2007 (2007)) and Samawi and Vogel
Vogel ((2013
2013).
). These approaches have
been
bee n dem
demons
onstra
trate
ted
d as pro
provid
viding
ing unbi
unbiase
ased
d est
estim
imato
ators
rs and improv
improving
ing the perform
performanc
ancee
of some of the Monte-Carlo methods of single and multiple integrals approximation.
Additionally, the MVSRSIS approach has been shown to improve the performance
and efficiency of Gibbs sampling (Samawi et al. al . 2012).
2012). Samawi and colleagues
showed that their approach resulted in a large savings in cost and time needed to
attain a specified level of accuracy.

1 Intr
Introd
oduc
ucti
tion
on

The te
The term
rm Mo
Mont
nte-
e-Ca
Carl
rlo
o re
refe
fers
rs to tech
techni
niqu
ques
es that
that us
usee ra
rand
ndom
om pr
proc
oces
esse
sess to ap
appr
prox
oxim
imat
atee
a non-stochastic k-dimensional integral of the form

θ = g (u )d u , (1.1)
Rk

(Hammersley and Handscomb 1964).


1964).
The literature presents many approximation techniques, including Monte-Carlo
methods. However, as the dimension of the integrals rises, the difficulty of the inte-
gration problem increases even for relatively low dimensions (see Evans and Swartz
1995
1995).
). Give
Given
n such
such compli
complica
catio
tions,
ns, man
manyy res
resear
earche
chers
rs are confus
confused
ed about
about which
which method
method

H.M. Samawi (B)


Department of Biostatistics, Jiann-Ping Hsu College Public Health,
Georgia Southern University, 30460 Statesboro, Georgia
e-mail: hsamawi@georgiasouthern.edu
hsamawi@georgiasouthern.edu
© Springer Nature Singapore Pte Ltd. 2017 17
D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_2

18 H.M. Samawi

to use; however, the advantages and disadvantages of each method are not the pri-
mary concern of this chapter. The focus of this chapter is the use of Monte-Carlo
methods in multiple integration approximation.
The motivation for this research is based on the concepts of ranked set sampling

(RSS), introduced by McIntyre (1952 (1952). ). The motivation is based on the fact that the
i th quan
quanti
tifie
fiedd un
unit
it of RS
RSS S is si
simp
mplyly an obse
observ
rvat
atio
ionn fr
from wheree f (i ) iiss th
om f (i ) , wher thee dens
densit
ity
y
function of the i th order statistic of a random sample of size n . When the underlying
density is the uniform distribution on (0, 1), f (i ) follows a beta distribution with
− +
parameters (i , n i 1).
Samawii (1999
Samaw ( 1999)) was the first to explore the idea of RSS (Beta sampler) for inte-
gral approximation. He demonstrated that the procedure can improve the simulation
efficiency based on the ratio of the variances. Samawi’s ranked simulated sampling
procedu
proc edure
re RSI
RSIS S genera
generates
tes an indepe
independe
ndentnt ran
random sample U(1) , U(2) , . . . , U(n) , wh
dom sample whic
ichh
is de
deno
note
tedd by RSIS wheree U(i )
RSIS,, wher ∼ − + {=
β(i , n i 1), i 1, 2, ..., }
..., n and β(., (., .) denotes
the beta distribution. The RSIS procedure constitutes an RSS based on random sam-
ples from the uniform distribution U (0, 1). The idea is to use this RSIS to compute
1.1) with k
(1.1) 1, instead of using an SRS of size n from U (0, 1), when the range
=
of the integral in (1.1 1.1)) is (0, 1). In case of arbitrary range ( a , b) of the integral
in (1.1
1.1),
), Samawi (1999 ..., X (n) and the importance
(1999)) used the sample: X (1) , X (2) , ...,
sampling technique to evaluate (1.1 (1.1)), where X (i ) = −
FX 1 (U(i ) ) and FX (.) is the dis-
tribution function of a continuous random variable. He showed theoretically and
through simulation studies that using the RSIS sampler for evaluating (1.1 ( 1.1)) substan-
tially improved the efficiency when compared with the traditional uniform sampler
(USS).
Al-Saleh and Zheng (2002 ( 2002)) introduced the idea of bivariate ranked set sampling
(BVRSS) and showed through theory and simulation that BVRSS outperforms the
bivariate simple random sample for estimating the population means. The BVRSS
is as follows:
Suppose ( X , Y ) is a bivariate random vector with the joint probability density func-
tion f X ,Y (x , y ). Then,
1. A ra
rand
ndom
om samp
samplele of size n 4 is identified from the population and randomly
size
allocated into n 2 pools each of size n 2 so that each pool is a square matrix with
n rows and n columns.
2. In the first pool, identify the minimum v value
alue by judgment with respect tto
o the first
characteristic X , for each of the n rows.
3. Fo the n minima obtained in Step 2, the actual quantification is done on the pair
Forr the
that corresponds to the minimum value of the second characteristic, Y , identified
by judgment. This pair, given the label (1, 1), is the first element of the BVRSS
sample.
4. Repea
Repeatt Steps 2 and 3 for the second pool pool,, but in Step 3, the pair correspondin
corresponding g to
the second minimum value with respect to the second characteristic, Y , is chosen
for actual quantification. This pair is given the label (1, 2).
5. Th
Thee proc
proces
esss co
cont
ntin
inue
uess un
unti
till the
the labe
labell (n
(n,, n) is asce
ascert
rtai
aine
ned
d fr
from thee n 2 th (l
om th (las
ast)
t) po
pool
ol..

Improving the Efficiency of the Monte-Carlo Methods … 19

The pro
proced cedure ure descri
described bed above
above pro produc
duces
es a BVR
BVRSS size n 2 .Let ( X [i ]( j ) , Y(i )[ j ] ),
SS of size [
i = 1, 2, . . . , n and j = ]
1, 2, . . . , n denote the BVRSS sample from f X ,Y (x , y )
where f X [i ]( j ) , Y(i )[ j ] (x , y ) is the joint probability density function of ( X [i ]( j ) Y(i )[ j ] ).
From Al-Saleh and Zheng ((2002 2002), ),

f X [i ]( j ) , Y(i )[ j ] (x , y ) =f f X ( j ) (x ) f Y | X ( y x ) | (1.2)
Y(i )[ j ] ( y ) ,
f Y[ j ] ( y )

where f X ( j ) is the density of the j th order statistic for an SRS sample of size n from
the marginal density of f X and f Y[ j ] ( y ) be the density of the corresponding Y value −


given by f Y[ j ] ( y ) = |
f X ( j ) (x ) f Y | X ( y x )d x , while f Y(i )[ j ] ( y ) is the density of the i th
−∞
order statistic of an iid sample from f Y[ j ] ( y ), i.e.

f Y(i )[ j ] ( y ) = c.(F [ ] ( y )) − (1 − F [ ] ( y )) −
Y j
i 1
Y j
n i
f Y[ j ] ( y )

y
where FY[ j ] ( y ) =  ∞ ( |
f X ( j ) (x ) f Y | X (w x )d x )d w.
−∞ −∞
Combining these results, Eq. (1.2)
1.2) can be written as

f X [i ]( jj)) , Y(i )[ j ] ( x , y ) = c1 ( FY[ ] ( y))i −1 (1 − FY[ ] ( y))n−i ( FX (x )) j −1 (1 − F X (x ))n− j f (x , y )


j j
(1.3)
where  
c1 = n ! n !
(i − 1)!(n − i )! )( ( j − 1)!(n − j )! .

Furthermore, Al-Saleh and Zheng (2002


(2002)) showed that,

n n
1
f X [i ]( j ) ,Y(i )[ j ] (x , y ) = f (x , y ). (1.4)
n2

j i

For a varie
For arietyty of ch
choi cess of f (u , v), one can have (U , V ) biv
oice bivariate
ariate uniform
with a probab
probabilitility
y densi
density
ty funct ion f(u, v); 0 < u, v < 1, such that U
function U (0, 1) and ∼
V ∼ U (0, 1) (See Johnson 1987 1987).). In that case, (U[i ]( j ) , V(i )[ j ] ), i [
1, 2, . . . , n and =
j = ]
1, 2, . . . , n should have a bivariate probability density function given by
 n !  n ! 
f ( j ),(i ) (u , v) = [ FY[ ] (v)]i −1 [1 − FY[ ] (v)]n −i [u ] j −1 j j
(i − 1)!(n − i )! ( j − 1)!(n − j )!
(i !
1) (n !
i) (j !
1) (n j) !
[1 − u]n− j f (u, v).
(1.5)

2007)) extended the work of Samawi


Samawi and Al-Saleh ((2007 Samaw i ((1999
1999)) and Al-Saleh
(2002)) for the Monte-Carlo multiple integration approximation of (1.1
and Zheng (2002 (1.1))
when k = 2.

20 H.M. Samawi

Moreover, to further improve some of the Monte-Carlo methods of integration,


Al-Sal
Al-Saleh
eh and Sam awii (2000)
Samaw 2000) use
usedd ste
steady
ady-st
-state
ate ran
ranke
ked
d set simula
simulated
ted sampli
sampling
ng
(SRSIS) as introduced by Al-Saleh and Al-Omari ((1999
1999).
). SRSIS has been shown to
be simpler and more efficient than Samawi’s (1999
( 1999)) method.
In Samawi and Vogel 2013)) work, the SRSIS algorithm introduced by Al-Saleh
Vogel ((2013
and Samawi
Samawi (2000
2000)) was extended to multivariate case for the approximation of
multiple integrals using Monte-Carlo methods. However
However,, to simplify the algorithms,
we introduce only the bivariate integration problem; with this foundation, multiple
integral problems are a simple extension.

2 Steady-S
Steady-State
tate Ranked
Ranked Simulated
Simulated Sampling
Sampling (SRSIS)
(SRSIS)

Al-Saleh
Al-Saleh and Al-Oma
Al-Omarri (1999)
1999) intr
introdu
oduce
ced
d the
the id
idea
ea of mu
mult
ltis
ista
tage
ge ra
rank
nked
ed set
set samp
sampli
ling
ng
(MRSS). To promote the use of MRSS in simulation and Monte-Carlo methods, let
(s ) (s )

Xi i
{ ; = 1, 2, . . . , n , be an (MRSS
}s) of size n at stage s . Assume that X(is ) has
probability density function f i and a cumulati
cumulative
ve dist
distribu
ribution function Fi . Al-
tion function
Saleh and Al-Omeri demonstrated the following properties of MRSS:
1.
n
1  (s)
f (x ) =n fi (x ), (2.1)
=
i 1

2.
 0 if x < Q (i −1)/ n
If s → ∞, then (s )
Fi (x ) → (
Fi
∞) ( x ) = n F (x ) − (i − ≤
1) i f Q (i −1)/ n x < Q (i )/ n ,
1 if x ≥ Q (i )/ n
 (2.2)
for i = 1, 2, . . . , n, where Q α is the 100 α th percentile of F (x ).
3. If X ∼ U (0, 1), then for i = 1, 2, ...,
..., n , we have
 0 if x < (i 1)/ n −
Fi
( ∞) (x ) = nx − − (i 1) i f (i 1)/ n −
x < i /n , ≤ (2.3)
1 if x i /n ≥
and 
( ∞) n i f (i − 1)/ n ≤ x < i /n
fi (x ) = 0 otherwise.
(2.4)

These
These pro
proper
pertie
tiess imply
imply Xi
( ∞) U ( i −n 1 , ni ), when the underlying
underlying distribut
distribution
ion function
function
is U(0, 1).
Vogel (2013)
Samawi and Vogel

2013) provided a modification of the Al-Saleh and Samawi
2000)) steady-state ranked simulated samples procedure (SRSIS) to bivariate cases
(2000
(BVSRSIS) as follows:

Improving the Efficiency of the Monte-Carlo Methods … 21

1. For each
each (i , j ), j = 1, 2, . . . , n and i = 1, 2, . . . , n generate independently
a. (Ui ( j ) from U
 − ,
 j 1
and independent W
j
from U − , , i = 1,
  i 1 i
n n (i ) j n n
..., n ).
2, ...,

2. Gen
Genera te Y(i ) j
erate = F − (W Y
1
(i ) j ) and X i ( j ) = F − (UX
1
i ( j )) from FY ( y ) and FX (x )
respectively. −  j 1 j
generatee ( X [i ]( j ) , Y(i )[ j ] ) from f (x , y ), generate Ui( j ) from U
3. To generat n
,n and
independent W (i ) j from U
− i 1 i
n
,n , then

|
X [i ]( j ) Y(i ) j = F −| 1
X Y (Ui ( j )
 |Y (i ) j ) |
and Y (i )[ j ] X i ( j ) = F −| 1
Y X (W(i ) j
 |X i ( j ) ).

The joint density function of ( X [i ]( j ) , Y(i )[ j ] ) is formed as follows:

( ∞)
f X [i ]( j )Y (x , y ) = f ∞[ ] ( ) ( ∞) |
(x ) f Y i | X [i ]( j ) ( y X [i ]( j ) ) =n 2
|
f X (x ) f Y | X [i ]( j ) ( y X [i ]( j ) ),
[]
(i ) j
X i ( j)

Q X ( j −1)/ n ≤x<Q X ( j )/ n , Q Y (i −1)/ n ≤y<Q Y (i )/ n ,

where Q X (s ) and Q Y (v) are the 100 s th percentile of FX (x ) and 100 v th percentile of
FY ( y ), respec
respecti
tive
vely
ly.. Howe
Howeve verr, for the first
first stage,
stage, both Stokess (1977)
both Stoke 1977) and
and Davi
David
d (1981)
1981)
showed that FY | X [i ]( j ) ( y x ) | =
FY | X ( y x ). Al-Saleh and Zheng (2003 |
(2003)) demonstrated
that joint density is valid for an arbitrary stage, and therefore, valid for a steady state.
Therefore,

fX
( ∞) = ( ∞) ( ∞)
= n2 f X (x ) f Y | X ( y|x ) = n2 fY, X (x , y ),
|
[i ]( j )Y(i )[ j ] (x , y ) fX
[i ]( j ) (x ) f Y i | X [i ]( j ) ( y X [i ]( j ) )
(2.5)
Q X ( j −1)/ n ≤ x < Q X ( j )/ n , Q Y (i −1)/ n ≤ y < Q Y (i )/ n .

Thus, we can write:


n n n n
1  ( )∞ 
n2
fX
[i ]( j )Y(i )[ j ] (x , y ) = [
f Y, X ( x , y ). I Q X ( j −1)/ n ≤ x < Q X ( j )/n ] I [ Q Y (i −1)/n ≤ y < Q Y (i )/n ]
= =
i 1 j 1 = =
i 1 j 1

= f (x , y ),
(2.6)

where I is an indica
indicator
tor va
varia
riable
ble.. Simila
Similarly Eq.((2.5)
rly,, Eq. 2.5) can
can be ext
xten
ende
ded
d by math
mathem
emat
atic
ical
al
induction to the multivariate case as follows:
f ( ∞) (x1 , x2 , ...,
..., xk ) = n k f (x1 , x2 , ...,
..., xk ), Q X i ( j −1)/ n ≤ =
xi < Q X i ( j )/ n , i 1,
. . . , k and j = 1, 2,.., n .. In addition, the above algorithm can be extended for k > 2
as follows:

1. For each (il , l


each = 1,2, ..., l ..., n generate independently
..., k ), i = 1, 2, ...,
i −1 i
s
s
Uil (is ) from U , , l , s = 1, 2, . . . , k and i , i = 1, 2, . . . , n .
l s
n n

22 H.M. Samawi

2. Gen
Generaerate
te X il (is ) FX−i1 (Uil (i s ) )l , s
= =
1, 2, . . . , k and i l , i s =
1, 2, . . . , n , from
l
FX il (x ), l =
1, 2, . . . , k , respectively.
3. Then, genera
generate
te the mult
multiv ivaria
ariatete version of the steady-state
steady-state simulated
simulated sample by
using any technique for conditional random number generation.

3 Monte-Car
Monte-Carlo
lo Method
Methodss for
for Multip
Multiple
le Integr
Integration
ation Problem
Problemss

Very good descriptions of the basics of the various Monte-Carlo methods have
been provided by Hammersley and Handscomb (1964 ( 1964),
), Liu (2001
(2001),), Morgan (1984
(1984),
),
Rober
Rob ertt and
and Case
Casellllaa (2004),
2004), and
and Shre
Shreidider
er (1966).
1966). The Monte-
Monte-CarCarlo
lo methods
methods descri
described
bed
include
inclu de crude
crude,, anti
antithet
thetic,
ic, importanc
importance, e, control
control variate
variate,, and stratified
stratified sampl
sampling
ing approaches.
approaches.
However, when variables are related, Monte-Carlo methods cannot be used directly
(i
(i.e
.e.,
., si
simi
mila
larr to the
the ma
mann
nnerer that
that thes
thesee meth
methododss are
are used
used in uni
univaria
ariate
te in
inte
tegr
grat
atio
ion
n pr
prob-
ob-
lems) because using the bivariate uniform probability density function f (u , v) as a
sampler
sampl er to evalua
evaluate 1.1) with k
te Eq. (1.1) 2, f (u , v) is not consistent. However, in this
=
context it is reasonable to use the importance sampling method, and therefore, it fol-
lows that other Monte-Carlo techniques can be used in conjunction with importance
sampling. Thus, our primary concern is importance sampling.

3.1 Importance Sampling Method

In general, suppose that f is a density function on R k such that the closure of the set
of points where g (.) is non-zero and the closure set of points where f (.)is non-zero.
[
Let U i i = ..., n ] be a sample from f (.). Then, because
1, 2, ...,
 g (u )
θ = f (u )
f (u )d u ,

Equati
Equation 1.1)) can be estimated by
on (1.1

n
∧ 1  g (u i )
θ= . (3.1)
n f (u i )
=
i 1
Equati
Equation
on (3.1)
3.1) is an unbiased estimator for (1.1
( 1.1),
), with variance given by

1 g (u )2
V ar (θ ) ( du θ 2 ).
ˆ =n Rk
 f (u ) −

Improving the Efficiency of the Monte-Carlo Methods … 23

In addition, from the point of view of the strong law of large numbers, it is clear that
ˆ→
θθ̂ θ almost surely as n →∞.
A limite
limited
d number
number of distri
distribu
butio
tional
nal famili
families
es exist
exist in a mul
multi
tidim
dimens
ension
ional
al con
conte
text
xt and
are commonly used as importance samplers. For example, the multivariate
multivariate Student’s
Student’s
family is used extensively in the literature as an importance sampler. Evans and
(1995)) indicated a need for developing families of multivariate distribution
Swartz (1995
that exhibit a wide variety of shapes. In addition, statisticians want distributional
families to have efficient algorithms for random variable generation and the capacity
to be easily fitted to a specific integrand.
This paper provides a new way of generating a bivariate sample based on the
bivariate
biva riate steady-state sampling (BVSRSIS) that has the potential to extend the exist-
ing sampling methods. We also provide a means for introducing new samplers and
to substa
substanti
ntial
ally
ly imp
improv
rovee sub
substa
stanti
ntiall
ally
y the effici
efficienc
ency
y of the integ
integrat
ration
ion approx
approxima
imatio
tion
n
based on those samplers.

3.2 Using Bivariate Steady-State Sampling (BVSRSIS)

Let 
θ = g (x , y )d x d y . (3.2)

To estimate θ , generate a bivariate sample of size n 2 from f(x, y), which mimics
nd has the same range, such as ( X i j , Yi j ), i
g (x , y ) aan [
..., n and
1, 2, ..., =
j = ]
1, 2, ..., n . Then
n n
1 g (x i j , yi j )

θ̂θ ˆ= . (3.3)
n2 f ( x i j , yi j )
= =
i 1 j 1

Equati
Equation
on (3.3)
3.3) is an unbiased estimate for (3.2
( 3.2)) with variance

g 2 (x , y )
ˆ = 1
 2
V ar (θ̂θ )
n 2
(
f (x , y )
dx dy − θ ). (3.4)

estimate (3.2)
To estimate 3.2) using
using BVSRSI
BVSRSIS, S, genera
generate
te a biva
bivaria
riate
te sam
sample size n 2 , as de
ple of size desc
scri
ribe
bed
d
[
in above, say ( X [i ]( j ) , Y(i )[ j ] ), i =
1, 2, ..., n and j =
1, 2, ..., n . Then ]
n n
1  g (x[i ]( j ) , y(i )[ j ] )
ˆ
θ̂θ BBVV S R S I S =n 2
. (3.5)
f (x[i ]( j ) , y(i )[ j ] )
= =
i 1 j 1

on (3.5)
Equation
Equati 3.5) is also an unbiased estimate for (3.2
( 3.2)) using ((2.5
2.5).
). Also, by using (2.5
(2.5))
the variance of (3.5
(3.5)) can be expressed as

24 H.M. Samawi

n n
1  (i , j )
ˆ
V ar (θ̂θ BBVV S R S I S ) = V ar (θ̂θˆ) − n 4
(θg/ f −θ g/ f )2 , (3.6)
= =
i 1 j 1

(i, j )
=
where, θ / f
g E [g ( X [ ] , Y [ ] )/ f ( X [ ] , Y [ ] )], θ = E [ g ( X , Y )/ f ( X , Y )] = θ .
i ( j) (i ) j i ( j) (i ) j g/ f
(3.6)) is less than the variance of the estimator in
The variance of the estimator in (3.6
(3.4).
3.4).

3.3 Simulation Study

This section presents the results of a simulation study that compares the perfor-
mance
man ce of the import
importanc
ancee sam
sampli
pling
ng metho
methodd des
descri
cribed
bed above
above usi
using
ng BVSRSI
BVSRSISS scheme
schemess
with the performance of the bivariate simple random sample (BVUSS) and BVRSS
scheme
schemess by Samaw
Samawii and Al-Sal
Al-Saleh 2007)) as intr
eh (2007 introd
oduc
uced
ed by Sama
Samawi
wi and
and Voge
ogel (2013
2013).
).

3.3.1
3.3.1 Illus
Illustrat
tration
ion for
for Im
Import
portance
ance Sampl
Sampling
ing Metho
Method
d When Integral’
Integral’ss
Limits Are (0, 1)x(0, 1)

As in Sama
Samawi
wi an
andd Al
Al-S
-Sal
aleh
eh (2007),
2007), illu
illust
stra
rati
tion
on of th
thee impa
impact
ct of BVSR
BVSRSI
SIS
S on impo
imporr-
tance sampling is provided by evaluating the following integral

 1 1

θ = (1 + v). exp(u (1 + v)) du dv = 3.671. (3.7)


0 0

This example uses four bivariate sample sizes: n = 20, 30, 40 and 50. To estimate
the variances using the simulation method, we use 2,000 simulated samples from
BVUSS and BVSRSIS. Many choices of bivariate and multivariate distributions
with uniform marginal on [0, 1] are available (Johnson 1987).
1987). However, for this
(Plackett 1965),
simulation, we chose Plackett’s uniform distribution (Plackett 1965), which is given
by

f (u , v) = ψ (ψ{ − 1)(u + v − 2 u v) + 1} , 0 < u , v < 1 , ψ > 0.


2
{[1 + (u + v)(ψ − 1)] − 4ψ (ψ − 1)u v} 3/2
(3.8)
The param eter ψ go
parameter gove
verns
rns the depend
dependenc
encee bet
betwee
weenn the compon entss (U , V ) distributed
component
according to f . Three cases explicitly indicate the role of ψ (Johnson 1987):
1987):

ψ
ψ =→10 UU and V,
= 1 V− are independent,
ψ → ∞ U = V,

Improving the Efficiency of the Monte-Carlo Methods … 25

Table 1 Efficiency of estimating (3.7


( 3.7)) using BVSRSIS relative to BVUSS and BVRSS
\
n ψ 1 2
20 289.92 (8.28
8.28)) 273.68 (9.71
9.71))
30 649.00 (12.94
12.94)) 631.31 (13.06
13.06))
40 1165.31 (16.91
16.91)) 1086.46 (18.60)
50 1725.25 (21.67
21.67)) 1687.72 (23.03
23.03))
Note Values shown in bold were extracted from Samawi and Al-Saleh ((2007
2007))

Table 1 presents the relative efficiencies of our estimators using BVRSIS in compar-
ison with using BVUSS and BVSRSIS relative to BVUSS for estimating (3.7 ( 3.7)).
As illustrated in Table 1, BVSRSIS is clearly more efficient than either BVUSS
or BVRSIS when used for estimation.

3.3.2
3.3.2 Illus
Illustrat
tration
ion When
When tthe
he Int
Integra
egral’
l’ss Limits
Limits A
Are
re Arbitr
Arbitrary
ary
Subset of R 2

Recent work by Samawi and Al-Saleh (2007(2007)) and Samawi and Vogel
Vogel ((2013
2013)) used an
identical example in which the range of the integral was not (0, 1), and the authors
evaluated the bivariate normal distribution (e.g., g (x , y ) is the N2 (0, 0, 1, 1, ρ )
density.) For integrations with high dimensions and a requirement of low relative
error, the evaluation of the multivariate normal distribution function remains one of
the unsolved problems in simulation (e.g., Evans and Swartz 1995 1995).
). To demonstrate
how BVSRSIS increases the precision of evaluating the multivariate normal distri-
bution, we illustrate the method by evaluating the bivariate normal distribution as
follows:
z1 z2

θ = g (x , y ) dx dy , (3.9)

−∞ −∞
where g (x , y ) is the N2 (0, 0, 1, 1, ρ ) density.
Given the similar shapes of the marginal of the normal and the marginal of the
logistic probability density functions, it is natural to attempt to approximate the
biva
bivaria
riate
te normal
normal cumcumula
ulati
tive
ve dis
distri
tribu
butio
tion
n fun
functi
ction
on by the bi
biva
varia
riate
te log
logist
istic
ic cumula
cumulati
tive
ve
distribution function. For the multivariate logistic distribution and its properties, see
Johnson and Kotz (1972
(1972).
). The density of the bivariate logistic (Johnson and Kotz
1972)
1972) is chosen to be
√ √ √3
! 2 π 2 e−π( x + y )/ 3 (1
+√ e−π z1/ 3√+ e−π z2 / )
f (x , y ) = , −∞<x < z1 ; − ∞ < y < z2.
π x/ 3 π y/ 3 3
3 1+e
( − +e − ) (3.10)
It can be shown that the marginal of X is given by

26 H.M. Samawi

− √
π x/ 3 − √
πz / 3 − √
πz / 3
πe
√ +(1e+−πex /√3 + e−+π ze /√3 2 ) ,
1 2

f (x ) = −∞< x < z1. (3.11)


3 (1 ) 2

√3 √ √
π X/ 3 π z2 / 3
Now let W =Y+ π ln(1 + e− + e− ). Then it can be shown that
− √
π w/ 3
2π e
| =√ 
f (w x ) √ √3  3
,
1 e√−π x / 3 √
+
3
1 e−π x / 3 e−π z 2 / 3
+ + + e− π w/
(3.12)
√ √ √
3 +− π x/ 3 − πz / 3

−∞<w<z + π 2 ln 1 e +e 2
.

3.10)) proceed as follows:


To generate from ((3.10

1. Gen
Generate X from (3.11
erate (3.11).
).
2. Gen
Genera
erate
te W independently
√ from√(3.12
(3.12)) √
3 π X/ 3 π z2 / 3
Lett Y
3. The
4. Le
Then n theW = −
resulti
π ln(1
resultingng pair ( X +e− correct
e−, Y ) has the +
). probability density function, as
defined in (3.10
( 3.10)).

For this illustration, two bivariate sample sizes, n 20 and 40, and different =
values of ρ and ( z 1, z 2 ) are used. To estimate the variances using simulation, we use
2,000 simulated samples from BVUSS, BVRSIS, and BVSRSIS (Tables 2 and 3 3).
).
Notabl
Not ably
y, when
when Sam
Samaw ogel (2013)
awii and Vogel 2013) used
used id
iden
enti
tica
call ex
exam
ampl
ples
es to th
thos
osee used
used by
Samawi and Al-Saleh ((2007 2007),), a comparison of the simulations showed that Samawi
and Vogel
Vogel (2013)
2013) BVSRSIS approach improved the efficiency of estimating the
multiple
multi ple integr
integrals
als by a factor ranging from 2 to 100.
As expected, the results of the simulation indicated that using BVSRSIS substan-
tially improved the performance of the importance sampling method for integration

Table 2 Efficiency of using BVSRSIS to estimate Eq.(


Eq. (3.9)
3.9) relative to BVUSS
(z 1 , z 2 ) n = 20 n = 40
ρ = ρ = ρ = ρ = ρ = ρ = ρ = ρ =
± 0.20 ±0.50 ±0.80 ±0.95 ±0.20 ±0.50 ±0.80 ±0.95
(0 , 0 ) 5.39 9.89 5.98 49.65 9.26 22.50 15.80 158.77
(6.42) (21.70) (144.31) (171.22) (12.2) (63.74) (526.10) (612.50)
( 1,− −1) 22
22.7
.73
3 29.43 22.48 95.10 55.99 90.24 69.85 380.87
(75.82
75.82)) (182.27) (87.42) (23.71) (255.75) (688.59) (336.58) (100.40)
( 2,− −2) 148
148.30
.30 133.21 125.37 205.99 506.63 408.66 411.19 802.73
(200.46) (143.62) (30.45) (42.75) (759.81) (568.77) (128.94) (45.55)
( 1, 2) 173
173.07
.07 281.60 216.86 24.11a 714.92 1041.39 882.23 98.76a
− − (91.08
91.08)) (42.89
42.89)) (08.47
08.47)) (382.01
382.01)) (148.16
148.16)) (43.19
43.19))
Note Values shown in bold are results of negative correlations coefficients
a
Values cannot be obtained due the steep shape of the bivariate distribution for the large negative
correlation

Improving the Efficiency of the Monte-Carlo Methods … 27

Table
able 3 Relative
Relative efficienc
efficiency estimating Eq. (3.9)
y of estimating 3.9) using BVRSIS
BVRSIS as compared
compared with
with using BVUSS
BVUSS

(z 1 , z 2 ) n = 20 n = 40
ρ 0.20 ρ 0.50 ρ 0.80 ρ 0.95 ρ 0.20 ρ 0.50 ρ 0.80 ρ 0.95
(0 , 0 ) 2.39= 3.02 = 2.29 = 3.21 = 3.80= 4.85= 3.73= 5.66=
− −1)
( 1, 4.73 4.30 4.04 3.79 7.01 8.28 7.83 7.59
(−2, −2) 8.44 8.47 8.73 5.65 15.69 15.67 16.85 11.02
Source Extracted from Samawi and Al-Saleh (2007
(2007))

approximation. BVSRSIS also outperformed the BVRSIS method used by Samawi


(2007).
and Al-Saleh (2007 ). Moreover, increasing the sample size in both of the above
illustrations increases the relative efficiencies of these methods. For instance, in the
first illustration, by increasing the sample size from 20 to 50, the relative efficiency
of using BVSRSIS as compared with BVUSS to estimate (3.10 ( 3.10)) is increased from
289.92 to 649.00, with the increase dependent on the dependency between U and
V . A similar relationship between sample size and relative efficiencies of these two
methods can be demonstrated in the second illustration.
Based on the above conclusions, BVSRSIS can be used in conjunction with other
multi
mul tiva
varia
riate
te int
integ
egrat
ration
ion proced
procedure
uress to impro
improve
ve the perfor
performan
mance ce of those
those method
methods,
s, and
thus providing researchers with a significant reduction in required sample size. With
the use of BVSRSIS, researchers can perform integral estimation using substantially
fewer simulated numbers. Since using BVSRSIS in simulation does not require
any extra effort or programming, we recommend using BVSRSIS to improve the
well-k
wel l-kno
nown
wn Monte-
Monte-Car
Carlo
lo met
method
hod of numeri
numerica
call multi
multiple
ple integ
integrat
ration
ion proble
problems.
ms. Using
Using
BVSRSIS will yield an unbiased and more efficient estimate of those integrals.
Moreover, this sampling scheme can be applied successfully to other simulation
problems. Last, we recommend using the BVSRSIS method for integrals with a
dimension no greater than 3. For higher dimensional integrals, other methods in
the literature can be used in conjunction with independent steady ranked simulated
sampling.

4 Steady-S
Steady-State
tate Ranked
Ranked Gibbs
Gibbs Samp
Sampler
ler
Many approximation techniques are found in the literature, including Monte-Carlo
methods, asymptotic, and Markov chain Monte-Carlo (MCMC) methods such as the
Gibbs sampler (Evans and Swartz 1995
1995).
). Recently, many statisticians have become
interested in MCMC methods to simulate complex, nonstandard multivariate distri-
butions. Of the MCMC methods, the Gibbs sampling algorithm is one of the best
known and most frequently used MCMC method. The impact of the Gibbs sampler
method on Bayesian statistics has been detailed by many authors (e.g., Chib and
Greenberg 1994;
1994; Tanner
Tanner 1993
1993)) following the work of Tanner and wong ((1987
1987)) and
Gelfand and Smith (1990
(1990).
).

28 H.M. Samawi

To understand the MCMC process, suppose that we need to evaluate the Monte-
Carlo integration E[f(X)], where f (.) is any user-defined function of a random vari-
able X . The MCMC process is as follows: Generate a sequence of random variables,
X , X , X , . . . , such that at each time t 0, the next state X is sampled from
{ 0 1 2 } ≥ t +1
|
a distribution P ( X t +1 X t ) which depends only on the current state of the chain,
chain, X t .
|
This sequence is called a Markov chain, and P (. .) is called the transition kernel
of the chain. The transition kernel is a conditional distribution function that repre-
sents the probability of moving from X t to the next point X t +1 in the support of X .
Assume that the chain is time homogenous. Thus, after a sufficiently long burn-in
{ ; = +
of k iterations, X t t }
k 1, . . . , n will be dependent samples from the station-
ary distribution. Burn-in samples are usually discarded for this calculation, given an
estimator,
n
1 
¯≈
f f ( X t ). (4.1)
n −k =+
t k 1

(4.1)) is called an ergodic average. Convergence to the required


This average in (4.1
expectation is ensured by the ergodic theorem. More information and discussions on
some of the issues in MCMC can be found in Roberts (1995 ( 1995)) and Tierney (1995
(1995). ).
To understand how to construct a Markov chain so that its stationary distribution
is precisely the distribution of interest π(.), we outline Hastings’ (1970
(1970)) algorithm,
which is a generalization of the method first proposed by Metropolis et al. 1953).
al . ((1953 ).
The method is useful for obtaining a sequence of random samples from a probability
distribution for which direct sampling is difficult. The method is as follows: At
each time t , the next state X t +1 is chosen by first sampling a candidate point Y
|
from a proposal distribution q (. X t ) (ergodic). Note that the proposal distribution
may depend on the current point X t . The candidate point Y is then accepted with
α ( X t , Y ) where
probability α(
 | 
π (Y )q ( X t Y )
α( X t , Y ) = min 1,
π ( X )q (Y | X )
. (4.2)
t t

If the candidate point is accepted, the next state becomes X t +1 =


Y . If the candi-
date is rejected, the chain does not move, that is, X t +1 =
X t . Thus the Metropolis–
Hastings algorithm simply requires the following:

Initialize X 0 set ;
set t = 0.
{
Repeat generate a candidate Y from q (. X t ) |
and a value u from a uniform (0, 1), if

u α( X t , Y ) set X t +1
≤ Y
Otherwise set X t +1 Xt ==
Increment t . }

Improving the Efficiency of the Monte-Carlo Methods … 29

A specia
speciall case
case of the Metropo
Metropolis
lis–Ha
–Hasti
stings
ngs alg
algori
orithm
thmis
is the Gibbs
Gibbs sampli
sampling
ng method
method
propo
propose
sed
d by Gema
Geman n an
and
d Gema
Gemann (1984
1984)) and
and intr
introd
oduc
uced
ed by Gelf
Gelfan
and
d and
and Smit
Smith
h (1990
1990).
).
To date, most statistical applications of MCMC have used Gibbs sampling. In Gibbs

sam
sampli
pling,
ng,sampling
Gibbs varia
variable
bless uses
are sample
sam
an pled
d one atto
algorithm a tim
time
e fro
from
mrandom
generate the
their
ir full
fulvariables
l condit
condition
ional
al distri
fromdisatribu
butio
tions.
ns.
marginal
distri
distribu
butio
tion
n ind
indire
irectl
ctlyy, wit
without
hout ca
calcu
lculat
lating
ing the densit
density
y. Simila
Similarr to Casell
Casellaa and Georg
Georgee
(1992),
1992), we demonstrate the usefulness and the validity of the steady-state Gibbs
sampling algorithm by exploring simple cases. This example shows that steady-state
Gibbs sampling is based only on elementary properties of Markov chains and the
properties of BVSRSIS.

4.1 Traditional (standard) Gibbs Sampling Method

that f (x , y1 , y2 , . . . , y g ) is a join
Supposee that
Suppos jointt de
dens
nsit
ityy fu
func
nction on R g+1 an
tion and
d our purp
purpos
osee
is to find
find the
the ch
char
arac
acte
teri
rist
stic
icss of the
the mamargrgin
inal
al de
dens
nsit
ity
y such
such as th
thee me
mean
an and
and th
thee vari
varian
ance
ce..
 
f X (x ) = ... f (x , y1 , y2 , . . . , y g )d y1 , d y2 , . . . , d yg (4.3)

In cases where ((4.3 4.3)) is extremely difficult or not feasible to perform either analyt-
ically or numerically, Gibbs sampling enables the statistician to efficiently generate
a sample X 1 , . . . , X n ∼ f x (x ), without requiring f x (x ). If the sample size n is large
enough, this method will provide a desirable degree of accuracy for estimating the
mean and the variance of f x (x ).
The fol
follo
lowin
wing g dis
discus
cussio
sion n of the Gibbs
Gibbs sampli
sampling
ng metho
method d uses
uses a two
two-v
-vari
ariabl
ablee case
case to
make
ma ke the
the meth
method od sisimp
mplelerr to fo
foll
llo
ow. A case
case wi
with
th more
more ththan
an two
two varia
ariabl
bles
es is illu
illust
stra
rate
ted
d

in the simulation study.


Giv
Gi ven a pair
pair of ra
rand
ndom
om var
aria
iables ( X , Y ), Gibb
bles Gibbss samp
sampli ling
ng gene
generarate
tess a samp
sample le fr
from
om
f X (x ) by sampli
sampling
ng from
from the condit
condition
ional
al dis
distri
tribu
butio |
n, f X |Y (x y ) and f Y | X ( y x ), wh
tion, whicich
h |
are usually known in statistical models application. The procedure for generating a
Gibbs sequence of random variables,

     
X 0 , Y0 , X 1 , Y1 , . . . , X k , Yk , (4.4)

is to start from an initial value Y0 y0 , (which is a known or specified value) and
=
obtaining the rest of the sequence (4.4
(4.4)) iteratively by alternately generating values
from

X j f X Y (x Y j y j )
Y j +1∼∼ f| | (|y| X= = x  ).
Y X j j
(4.5)

30 H.M. Samawi

For large k and under reasonable conditions (Gelfand and Smith 1990),
1990), the final
observation in (4.5), namely X j
(4.5), x j , is effectively a sample point from f X (x ).
=
A natural way to obtain an independent and identically distributed (i.i.d) sample

from f X (x ) is to follow the suggestion of Gelfand and Smith (1990) 1990) to use Gibbs
sampling to find the k th, or final value, from n independent repetitions of the Gibbs
sequence in (4.5
(4.5).
). Alternatively, we can generate one long Gibbs sequence and use a
systematic sampling technique to extract every r th observation. For large enough r ,
this
this meth
methodod will
will also
also yiel
yield
d an appro
approxi
xima
mate
te i.
i.i.
i.d
d samp
sample
le from f X (x ). Fo
from Forr th
thee adv
advanta
antagege
and disadvantage of this alternate method see, Gelman and Rubin (1991 (1991).).
Next, we provide a brief explanation of why Gibbs sampling works under reason-
ablee condit
abl condition
ions.
s. Sup
Suppos
posee we know
know the condit
condition
ional
al den
densit ies f X |Y (x y ) and f Y | X ( y x )
sities | |
of the two random variables X and Y , respectively. Then the marginal density of X ,
f X (x ) can be determined as follows:

f X (x ) = f (x , y )d y ,

where f (x , y ) is the the un


unkn
kno
own join
jointt de
dens
nsity of ( X , Y ). Us
ity Usin
ing
g th
thee fa
fact
ct that f X Y (x , y )
that =
|
f Y ( y ). f Y | X (x y ), then

f X (x ) = |
f Y ( y ). f Y | X (x y )d y .

Using a similar argument for f Y ( y ), then


   
f X (x ) = |
f X |Y (x y ) f Y | X ( y t )d y | f X (t )d t = g (x , t ) f X (t )d t , (4.6)

where g (x , t ) f X |Y (x y ) f Y | X ( y t )d y As argued by Gelfand and Smith (1990


= | | ( 1990),
),
4.6) defines a fixed-point integral equation for which f X (x ) is the solution and
Eq. (4.6)
the solution is unique.
4.2 Steady State Gibbs Sampling (SSGS): The Proposed
Algorithms

To guaran
guarantee tee an unbi
unbiase
ased
d est
estima
imator
tor for the mean,
mean, densit
density
y, and the distri
distribu
butio
tion
n functio
function
n
of f X (x ), Samawi et al. ((2012
2012)) introduced two methods for performing steady-state
Gibbs sampling. The first method is as follows:

In standardf Gibbs sampling, the Gibbs sequence is obtained using the conditional
| |
distribution, X |Y ( x y ) and f Y | X ( y x ) , to generate a sequence of random variables,

X 0 , Y0 , X 1 , Y1 , . . . , X k −1 , Yk−1 , (4.7)

Improving the Efficiency of the Monte-Carlo Methods … 31

starting from an initial, specified value Y 0 y0 and iteratively obtaining the rest of
=
the sequence (4.7
(4.7)) by alternately generating values from

X j ∼ f | (x |Y  = y ) 
X Y j j
(4.8)
Y j +1 ∼ f | ( y| X = x ).
Y X j j

However, in steady state Gibbs sampling (SSGS), the Gibbs sequence is obtained as
follows:
One step before the k th step in the standard Gibbs sampling method, take the last
step as
X i ( j ) FX−|1Y (Ui ( j ) Yk −1
∼ yk −1 ) | =
Y(i ) j ∼ F −| (W | X  − = x  − ),
1
Y X (i ) j k 1 k 1
(4.9)
X [i ]( j ) ∼ F −| (U  |Y  = y )
1
X Y i( j) (i ) j (i ) j

Y(i ) j FY−1X (W(i ) j X i ( j ) xi( j ) )

where Ui ( j ) , Ui( j ) from U


{ }
[ ]−∼ |
j 1 j
,n {
|
and W(i ) j , W(i ) j from U
=
} − 
i 1 i
,n as descr
described
ibed
n n
above. Clearly, this step does not require extra computer time since we generate the
Gibbs sequences from uniform distributions only. Repeat this step independently for
i = 1, 2, . . . , n and j = 1, 2, . . . , n to get
get an in
inde
depependnden
entt samp
sample size n 2 , name
le of size namely ly
[ =
( X [i ]( j ) , Y(i )[ j ] ), i 1, 2, . . . , n and j =
1, 2, . . . , n . For lar
large k an
and ]
d under
under reason
reason--
able conditions (Gelfand and Smith 1990 1990),), the final observatio
observation n in Eq. (4.94.9),
), namely
( X [i ]( j )= =
x[i ]( j ) , Y(i )[ j ] y(i )[ j ] ) is effectively a sample point from (2.5). Using the
propert
prop ertiesies of SRSIS, [
SRSIS, ( X [i ]( j ) , Y(i )[ j ] ), i =
1, 2, . . . , n and j 1, 2, . . . , n , will =
will pr
pro-
o- ]
ducee unbiase
duc unbiased d estima
estimator torss for the margimarginal
nal means
means and distridistribu
butio
tion
n functi
functionsons.. Altern
Alterna-a-
tively, we can generate one long standard Gibbs sequence and use a systematic sam-
pling technique to extract every r th observation using a similar method as described
above. Again, a SSGS sample will be obtained as

X i ( j ) ∼ F −| (U |Y − = y  − ),
1
X Y i( j) r 1 r 1

Y(i ) j ∼ F −| (W | X  − = x  − ),
1
Y X (i ) j r 1 r 1
−1    (4.10)
X [i ]( j ) FX |Y (Ui ( j ) Y(i ) j | =y (i ) j ),

Y(i )[ j ] ∼ F −| 1
Y X ( W(i ) j
 |X  = x
i( j) i ( j ) ),

where {Ui ( j ) , U  } from U


− j 1 j
,n and {W(i ) j , W  } from U , to obtain − 
i 1 i
i( j) n (i ) j n n
2
an independent sample of size n , that is, ( X [i ]( j ) , Y(i ) [ [ ] ), i = 1, 2, . . . , n and j =
j
1, 2, . . . , n .
]

32 H.M. Samawi

Using the same arguments as in (4.1 (4.1),


), suppose we know the conditional densi-
| |
ties f X |Y (x y ) and f Y | X ( y x ) of the two random variables X and Y , respectively.
Equati
Equ ation
on (4.6)
4.6) is the limiting form of the Gibbs iteration scheme, showing how sam-
pling from conditionals produces a marginal distribution. As in Gelfand and Smith
(1990
1990)) for k , X k −1 f X (x ) and Yk −1
→∞ ∼
f Y ( y ) an
and hencee FY−|1X (W(i ) j x k −1 )
d henc ∼ | =
Y(i ) j ∼ f ∞( y ) , F −| (U
Y(i )
1
Y X |x  − ) = X  ∼ f ∞ (x ) where W ∼ U − ,
i( j) k 1

(i ) j X( j) (i ) j
i 1
n
i
n
 and
Ui ( j ) ∼ U − , . Therefore,j 1
n
j
n

FX−|1Y  (Ui( j ) Y(i ) j )| = X [ ] |Y  ∼ f ∞[ ] |  (x |Y  ) and


i ( j) (i ) j X i ( j) Y(i ) j (i ) j
(i ) j

FY−|1X  (W(i ) j | X  ) = Y [ ] | X  ∼ f ∞[ ]|  ( y| X  ).
i( j) (i ) j i( j) Y(i ) j Xi( j) i( j)
i( j)

This step produc


produces
es an indep
independen
endentt biv
bivaria
ariate
te stea
steady-sta
dy-state sample, ( X [i ]( j ) , Y(i )[ j ] ), i
te sample, [ =
1, 2, . . . , n and j = ]
1, 2, . . . , n , where some characteristic of the marginal distrib-
utions are to be investigated. To see how to apply this bivariate steady-state sample
Gibbs sampling, using ((2.5 2.5)) we get

Q (i
(i ) Q (i
(i ) Q ( jj))
 n  n  n

f X [i ]( j ) ( x ) = n 2 f Y ( y ). f X |Y ( x y )d y
| = n f X |Y ( x y )n | |
f X (t ) f Y | X ( y t )dtdy
Q (i
( i −1) Q (i
( i −1) Q ( j −1)
n n n

Q ( j)
 n
Q (i
(i )
n

   
=  |
n f Y | X ( y t ) f X |Y ( x y )d y |  n f X (t )d t (4.11)
Q ( j 1)
− − Q (i
( i 1)
n n

Q ( j) Q (i
 n
(i )
n

n fY X (y t ) f X Y ( x y )d y fX i ( j ) (t )d t .
=Q −  −
( j 1)
n
Q (i
( i 1)
n
| | | |  []

As argued by Gelfand and Smith (1990 (1990),


), Eq. (4.11)
4.11) defines a fixed-point integral
equation for which f X [i ]( j ) (x ) is the solution and the solution is unique.
We next show how SSGS can improve the efficiency of estimating the sample
means of a probability density function f (x ).

Theorem 4.1 ((Sam Samawawii et al. 2012)). Unde


al. 2012 Underr th
thee sa
same
me co cond
ndit itio
ions
ns of th
thee stand
standarard
d Gi
Gibb
bbss
sampli
sam pling,
ng, the biv
bivari
ariate
ate SSG
SSGS S sam
sample above ( X [i ]( j ) , Y(i )[ j ] ), i
ple above 1, 2, . . . , n and j [ = =
]
1, 2, . . . , n from
from f (x , y ) provides the following:
1. Un
Unbi
bias
ased
ed es
esti
tima
mato
torr of th
thee ma
marrgina
ginall means
meansof
of X a
and
nd/o
/orr Y . Hencee E ( X SSGS )
Henc ¯ =µ , x
where µ x E ( X ).
= n2
 Xi
2. Var( X¯ SSGS ) ≤ V ar ( X¯ ), where X¯ = =
i 1
n2
.

Improving the Efficiency of the Monte-Carlo Methods … 33

Proof Using (2.5


(2.5),
),

Q Y (i )/
)/nn Q X ( j )/
)/nn
n n n n
1 1 ( )
E ( X SSGS )
¯ = n2 i =1 j =1 E ( X [i ]( j ) ) = f [i ]( j )Y (i )[ j ] (x , y )d x d y
 n2
   x X ∞
= =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn

Q Y (i )/n
)/ n Q X ( j )/n
)/ n
  n n 
1
= n2 x n 2 f ( x , y )d x d y
= =
i 1 j 1Q
)/ n Q X ( j −1)/n
Y (i −1)/n )/ n

Q X ( j )/n
)/ n Q Y (i )/n
)/ n
  n  n
= x f x ( x )d x |
f Y | X ( y x )d y = E ( X ) = µx .
=
j 1Q
X ( j −1)/n
)/ n
=
i 1Q
Y (i −1)/n
)/ n

Similarly,

Q Y (i )/
)/nn Q X ( j)/
j )/n
n
n n n n
¯
var( X SSGS ) = n14  V ar ( X [i ]( j ) ) = n14    (x − µx [ ] i ( j)
)2 f (∞) X [i ]( j )Y (i )[ j ] ( x , y)
y )d x d y
= =
i 1 j 1 = =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn

Q Y (i )/
)/nn Q X ( jj)/
)/nn
n
  n 
1
= n4 (x − µx[ ] ±µx )2 n2 f (x , y)
i ( j)
y )d x d y
= =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn

Q Y (i )/
)/nn Q X ( jj)/
)/nn
n
  n 
1
= n2 [(x − µx ) − (µ[i ] j − µx ]2 f (x , y)
y )d x d y
( )
= =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn

and,

Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n n
1
var( X SSGS ) (x µx )2 2(x µx )(µx i ( j)
µx )
¯ = n2 i 1 j 1Q 1 Q 1 { −
= = − Y (i )/ n
)/n X( j
− )/ n
)/n
− − [] −
+ (µx[ ] − µx )2 } f (x , y )d x d y
i ( j)
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n
  n  n n
1 2 1 
=
2
(x − µx ) f (x , y )dxdy − 2
n n
= =
i 1 j 1Q
)/ n Q X ( j 1)/n
Y (i 1)/n − )/ n − = =
i 1j 1

Q Y (i )/n
)/ n Q X ( j )/n
)/ n
 
2( x − µx )(µx[ ] − µx ) f (x , y )d x d y
i ( j)
Q Y (i 1)/n
−)/ n Q X ( j 1)/n
)/ n−
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n
  n 
1
+ n2 (µx[i ]( j ) − µx )2 f (x , y )d x d y
i 1 j 1
= = Q Y (i 1)/n)/ n Q X ( j 1)/n
− )/ n −
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n
  n  n n
1 2 1 
= n2 (x − µx ) f ( x , y )d x d y − n2
= =
i 1 j 1Q
)/ n Q X ( j 1)/n
Y (i 1)/n − )/ n − = =
i 1j 1

34 H.M. Samawi

Q Y (i )/n
)/ n Q X ( j )/n
)/ n
 
(µx[i ]( j ) − µx )2 f (x , y )d x d y
Q Y (i 1)/n
−)/ n Q X ( j 1)/n
)/ n−
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
2 n n 2
σX
− µx )2 f (x , y )d x d y ≤ V ( X¯ ) = σnX2 ,
= − n2 1   
(µx[i ]( j )
n2
= =
i 1 j 1Q
)/ n Q X ( j 1)/n
Y (i 1)/n − )/ n −

where σ X2 = V ar ( X ). Similar results can be obtained for the marginal mean of


Y and the marginal distributions of X and Y . Note that as compared with using
standard Gibbs sampling, using SSGS not only provides a gain in efficiency by but
also reduces the sample size required to achieve a certain accuracy in estimating the
marginal means and distributions. For very complex situations, the smaller required
samp
samplele si
size
ze ca
can
n su
subs
bsta
tant
ntia
iall
lly
y re
reduc
ducee comp
comput
utat
atio
ion
n time
time.. To pr
prov
ovid
idee in
insi
sight
ght to th
thee gain
gain
in efficiency by using SSGS, we next conduct a simulation study.

4.3 Simulation Study and Illustrations

This section presents the results of a simulation study comparing the performance
of the SSGS with the standard Gibbs sampling methods. To compare the perfor-
mance of our proposed algorithm, we used the same illustrations as Casella and
George (1992 ). For these examples, four bivariate samples of sizes, n
(1992). 10, 20, and =
50 and Gibbs sequence length k 20, 50 and 100 and r =
20, 50, and 100in the =
long sequence Gibbs sampler. To estimate the variances of the estimators using the
simulation method, we completed 5,000 replications. Using the 5,000 replications,
we estimate the efficiency of our procedure relative to the traditional (i.e., standard)
Gibbs sampling method by eff(θ̂θ , θ̂θ SSGS ) ˆ ˆ = V ar (θ̂θ )
V ar (θθ̂ SSGS ) ,
ˆˆ where θ is the parameter of
interest.

Example 1 Casella and George (1992


(1992).).
m
y x +α−1 (1 −+−
X and Y ha
have
ve the follo
followin
wing
g joi
joint
nt dis
distri
tribu
butio n, f (x , y )
tion, ∝   x
− y) m x β 1
,
x = ≤ ≤
0, 1, . . . , m , 0 y 1. Assume our purpose is to determine certain character-
istics of the marginal distribution f (x ) of X. In Gibbs sampling method, we use
the conditional distributions f (x y ) | ∼Binomial (m , y ) and f ( y x ) | ∼
Beta (x +
− +
α, m x β ).

Tables 4 and 5 show that, relative to the standard Gibbs sampling method, SSGS
improves the efficiency of estimating the marginal means. The amount of improve-
ment depends on two factors: (1) which parameters we intend to estimate, and (2)
the conditional distributions used in the process. Moreover, using the short or long
Gibbs sampling sequence has only a slight effect on the relative efficiency.

Improving the Efficiency of the Monte-Carlo Methods … 35

Table 4 Standard
Standard Gibbssamp
Gibbs sampling
ling method
method compared
comparedWit
With
h the Steady-Stat
Steady-Statee GibbsSampl
Gibbs Sampling
ing (SSGS)
method (Beta-Binomial distribution)
m = 5, α = 2, and β = 4
n2 k Sample Sample Relative Sample Sample Relative
mean mean efficiency mean mean efficiency
Gibbs SSGS of Gibbs SSGS of
sampling X sampling Y
of X of Y
100 20 1.672 1.668 3.443 0..340
0 0.334 3.787
50 1.667 1.666 3.404 0..333
0 0.333 3.750
100 1.667 1.666 3.328 0..333
0 0.333 3.679
400 20 1.666 1.666 3.642 0..333
0 0.333 3.861
50 1.669 1.667 3.495 0..333
0 0.333 3.955
100 1.668 1.667 3.605 0..333
0 0.333 4.002
2500 20 1.666 1.666 3.760 0..333
0 0.333 4.063
50 1.668 1.667 3.786 0..333
0 0.333 3.991
100 1.667 1.667 3.774 0..333
0 0.333 4.007
m = 16, α = 2, and β = 4
100 20 5.321 5.324 1.776 0..333
0 0.333 1.766
50 5.334 5.334 1.771 0..333
0 0.333 1.766
100 5.340 5.337 1.771 0..334
0 0.334 1.769
400 20 5.324 5.327 1.805 0..333
0 0.333 1.811
50 5.333 5.333 1.816 0..333
0 0.333 1.809
100 5.330 5.331 1.803 0..333
0 0.333 1.806
2500 20 5.322 5.325 1.820 0..333
0 0.333 1.828
50 5.334 5.334 1.812 0..333
0 0.333 1.827
100 5.334 5.333 1.798 0..333
0 0.333 1.820
Note The exact mean of x is equal to 5/3 and the exact mean of y is equal to 1/3 for the first case

Example 2 Casella and George (1992


(1992).
).
Let X and Y has the following conditional distributions that are exponential dis-
tributions, restricted to the interval (0, B ), that is f (x y ) | ∝ ye − yx
,0 < x < B <
∞ and f ( y x ) x e− yx , 0 < y < B < .
| ∝ ∞
Similarly, Table 6 shows that SSGS improves the efficiency of marginal means
estimation relative to standard Gibbs sampling. Again, using short or long Gibbs
sampler sequence has only a slight effect on the relative efficiency.

Example 3 Casella and George (1992


(1992).
).
m x α 1
In this example, a generalization
m of the joint distribution is f (x , y , m ) y + −
∝ 
x
(1 y )nx +β −1 , e−λ λm ! , λ > 0 , x
− =
0, 1, . . . , m , 0 y ≤ ≤
1, m =
1, 2, . . . .
Again, suppose we are interested in calculating some characteristics of the mar-
ginal distr
distribution f (x ) of X . In Gibb
ibution Gibbss sa
samp
mpli ling
ng meth
method
od,, we use
use th
thee co
cond
ndititio
ionanall di
dist
stri
ri--
|
butions f (x y , m ) ∼ Binomial (m , y ), f ( y x , m ) | ∼ +
Beta (x α, m x β ) and − +

36 H.M. Samawi

Table 5 Compariso
Comparisonn of the Long Gibbs sampling method
method and the Steady-State
Steady-State Gibbs Sampling
(SSGS) method (Beta-Binomial distribution)
m = 5, α = 2, and β = 4
n2 r Sample Sample Relative Sample Sample Relative
mean mean efficiency mean mean efficiency
Gibbs SSGS of Gibbs SSGS of
sampling X sampling Y
of X of Y
100 20 1.667 1.666 3.404 0..333
0 0.333 3.655
50 1.665 1.665 3.506 0..333
0 0.333 3.670
100 1.668 1.667 3.432 0..334
0 0.333 3.705
400 20 1.667 1.667 3.623 0..333
0 0.333 4.014
50 1.666 1.666 3.606 0..333
0 0.333 3.945
100 1.667 1.667 3.677 0..333
0 0.333 3.997
2500 20 1.667 1.666 3.814 0..333
0 0.333 4.011
50 1.667 1.667 3.760 0..333
0 0.333 4.125
100 1.667 1.667 3.786 0..333
0 0.333 4.114
m = 16, α = 2, and β = 4
100 20 5.338 5.338 1.770 0..334
0 0.334 1.785
50 5.335 5.334 1.767 0..334
0 0.334 1.791
100 5.335 5.334 1.744 0..334
0 0.333 1.763
400 20 5.332 5.332 1.788 0..333
0 0.333 1.820
50 5.337 5.336 1.798 0..333
0 0.333 1.815
100 5.332 5.333 1.821 0..333
0 0.333 1.820
2500 20 5.332 5.332 1.809 0..333
0 0.333 1.821
50 5.335 5.335 1.832 0..333
0 0.333 1.827
100 5.333 5.333 1.825 0..333
0 0.333 1.806

Note The exact mean of x is equal to 5/3 and the exact mean of y is equal to 1/3

[(1− y)λ] −
m x
f (m x , y ) e−(1− y )λ (m −x )! , m
| ∞ = +
x , x 1, . . .. For this example, we used the
following parameters: m =
5, α =
2, and β 4. =
Similarly, Table 7 illustrates the improved efficiency of using SSGS for marginal
means estimation, relative to standard Gibbs sampling. Again using a short or long
Gibbs sampling sequence has only a slight effect on the relative efficiency. Note that
this example is a three-dimensional problem, which shows the improved efficiency
depends on the parameters under consideration.
We show that SSGS converges in the same manner as in the standard Gibbs
sampling method. Howe
Howeverver,, Sects. 3 and 4 indicate that SSGS is more efficient than
stan
standa
dard
rd Gi
Gibb
bbss samp
sampli
ling
ng fo
forr esti
estima
mati
ting
ng the
the me
mean
anss of th
thee marg
margin
inal
al di
dist
stri
ribu
buti
tion
onss usin
using
g
the same sample size. In the examples provided above, the SSGS efficiency (versus
standard Gibbs) ranged from 1.77 to 6.6, depending on whether Gibbs sampling
used the long or short sequence method and the type of conditional distributions
used in the process. Using SSGS yielded a reduced sample size, and thus, reduces

Improving the Efficiency of the Monte-Carlo Methods … 37

Table 6 Relative efficiency


efficiency of Gibbs sampling method
method and Steady-State Gibbs Sampling (SSGS)
method (Exponential Distribution)
Standard Gibbs Algorithm B 5
n2 k Sample= Sample Relative Sample Sample Relative
mean mean efficiency mean mean efficiency
Gibbs SSGS of Gibbs SSGS of
sampling X sampling Y
of X of Y
100 20 1.265 1.264 4.255 1..264
1 1.263 4.132
50 1.267 1.267 4.200 1..265
1 1.264 4.203
100 1.267 1.267 4.100 1..263
1 1.265 4.241
400 20 1.263 1.264 4.510 1..266
1 1.265 4.651
50 1.265 1.265 4.341 1..262
1 1.263 4.504
100 1.263 1.264 4.436 1..265
1 1.265 4.345
2500 20 1.264 1.264 4.461 1..264
1 1.264 4.639
50 1.264 1.264 4.466 1..265
1 1.265 4.409
100 1.265 1.264 4.524 1..265
1 1.264 4.525
Long Gibbs Algorithm
n2 r B =5
100 20 1.265 1.265 4.305 1..264
1 1.265 4.349
50 1.267 1.264 4.254 1..261
1 1.264 4.129
100 1.264 1.265 4.340 1..265
1 1.265 4.272
400 20 1.265 1.264 4.342 1..263
1 1.264 4.543
50 1.265 1.265 4.434 1..265
1 1.265 4.446
100 1.266 1.265 4.387 1..264
1 1.264 4.375
2500 20 1.264 1.264 4.403 1..264
1 1.264 4.660
50 1.265 1.265 4.665 1..264
1 1.264 4.414

100 1.265 1.265 4.494 1..264


1 1.264 4.659
computing time. For example, if the efficiency of using SSGS is 4, then the sample
size needed for estimating the simulation’s distribution mean, or other distribution
characteristics, when using the ordinary Gibbs sampling method is 4 times greater
than when using SSGS to achieve the same accuracy and convergence rate. Addi-
tionally, our SSGS sample produces unbiased estimators, as shown by theorem 4.1.
Moreover,, the biv
Moreover ariate steady-state simulation depends on n 2 simulated sample size
bivariate
to produce an unbiased estimate. However, in k dimensional problem, multivariate
steady-state simulation depends on n k simulated sample size to produce an unbiased
estimate. Clearly, this sample size is not practical and will increase the simulated
sample size required. To overcome this problem in high dimensional cases, we can
use the independent simulation method described by Samawi
Samawi ((1999
1999)) that needs only
a simulated sample of size n regardless of the number of dimensions. This approach
slightly reduces the efficiency of using steady-state simulation. In conclusion, SSGS

38 H.M. Samawi

y
c
e n
itv ie 3 1 1 8 1 8 0 8 9 3 1 1 8 1 8 0 8 9
la c 3 4 3 3 2 1 4 3 3 3 4 3 3 2 1 4 3 3

R
e fi
fe .3
5 3
.
5 .3
5 .3
5 .3
5 .3
5 .3
5 .3
5 .3
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5

n
a
) e
n m
ito S 3 3 3 2 9 9 9 9 9 3 3 3 2 9 9 9 9 9
u G 4 4 4 4 3 3 3 3 3 4 4 4 4 3 3 3 3 3
ib S 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
.
trs S Z 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
i
D n
n a
e
o
s m
is s 1 9 0 0 1 0 9 1 0 1 9 0 0 1 0 9 1 0
o b Z 0 9 0 0 0 0 9 0 0 0 9 0 0 0 0 9 0 0
P ib f 0
. 9
. 0
. 0
. 0
. 0
. 9
. 0
. 0
. 0
. 9
. 0
. 0
. 0
. 0
. 9
. 0
. 0
.
d G o 5 4 5 5 5 5 4 5 5 5 4 5 5 5 5 4 5 5
n
a
n
ito y
u e c
itv n
ib ie
trs la c 8
6
8
9
2
5
0
1
7
7
0
8
1
9
4
8
3
8
8
6
8
9
2
5
0
1
7
7
0
8
1
9
4
8
3
8
4 2 3 5 4 4 5 5 5 4 2 3 5 4 4 5 5 5
ilD R
e fi
fe .6 .
6 .6 .6 .6 .6 .6 .6 .6 .
6 .
6 .
6 .
6 .
6 .
6 .
6 .
6 .
6

ia n
m a
e
o
in m
-B S 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
G 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
ta e S 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
.
B S Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(
d n
a
o e
th e m
m s
) b Y 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
S b
i f 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
. . . . . . . . . . . . . . . . . .
G
S
G o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 /3
1
(S to
g y l
iln e c a
itv n u
p ie 3 1 1 9 7 4 9 9 2 3 1 1 9 7 4 9 9 2 q
e
m
a la c

8
8
9
8
8
8
9
8
9
8
9
8
0
9
9
8
9
8
8
8
9
8
8
8
9
8
9
8
9
8
0
9
9
8
9
8 s
S R
e fe .1 .
1 .1 .1 .1 .1 .1 .1 .1 .
1 .
1 .
1 .
1 .
1 .
1 .
1 .
1 .
1 iy
s f
b o
ib n 4
a n
G 4 e a
m
= e
te β m
a = S 7 6 6 8 7 5 6 5 7 7 6 6 8 7 5 6 5 7
t 5 5 5 5 5 5 5 5 5 d 5 5 5 5 5 5 5 5 5 t
G c
-S β
S 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. n
a 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. a
x
y d S X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 e
d n ,
a a 2 e
te , n
a = th
S 2 e d
d = m α n
n ,
5 a
a s
d
α
, b X 7
6
5
6
6
6
7
6
6
6
6
6
8
6
7
6
7
6
0
7
5
6
6
6
5
6
6
6
7
6
8
6
7
6
7
6 /3
o 5 ib f 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
.
= 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
. 5
th = G o 1 1 1 1 1 1 1 1 1 λ 1 1 1 1 1 1 1 1 1 to
e l
M λ a
u
g m q
iln tih e
p r m is
o tih x
m lg 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f
a r
Ss A
s k
2 5 0
1 2 5 0
1 2 5 0
1 lg
o r
2 5 0
1 2 5 0
1 2 5 0
1 o
n
b b A a
ib b e
i s m
G G b t
ib c
7 rd 0 G 0 a
x
a 0 0 0 0 0 0 e
le d 0 0 0 g 0 0 0
b n 0 0 7 n 0 0 7 e
a ta 3
1 8 2 o 3
1 8 2 h
T S n
L n
T

Improving the Efficiency of the Monte-Carlo Methods … 39

performs at least as well as standard Gibbs sampling and SSGS offers greater accu-
racy. Thus, we recommend using SSGS whenever a Gibbs sampling procedure is
needed. Further investigation is needed to explore additional applications and more
options for using the SSGS approach.

References

Al-Saleh, M. F., & Al-Omari, A. I., (1999). Multistage ranked set sampling. Journal of Statistical
Planning and Inference, 102(2), 273–286.
Al-Saleh, M. F., & Samawi, H. M. (2000). On the efficiency of Monte-Carlo methods using steady
state ranked
ranked simulated
simulated samples.
samples. Commu
Communicat
nication
ion in Statistics
Statistics-- Simulation Computation, 29(3),
Simulation and Computation
941–954. doi:
doi:10.1080/03610910008813647
10.1080/03610910008813647..
Al-Saleh, M. F., & Zheng, G. (2002). Estimation of multiple characteristics using ranked set sam-
pling. Australian & New Zealand Journal of Statistics, 44, 221–232. doi: doi:10.1111/1467-842X.
00224..
00224
Al-Saleh, M. F., & Zheng, G. (2003). Controlled sampling using ranked set sampling. Journal of
Nonparametric Statistics, 15, 505–516. doi:
doi:10.1080/10485250310001604640
10.1080/10485250310001604640..
Casella, G., & George, I. E. (1992). Explaining the Gibbs sampler. American Statistician, 46 (3),
167–174.
Chib, S., & Greenberg, E. (1994). Bayes inference for regression models with ARMA (p, q) errors.
Journal of Econometrics, 64 , 183–206. doi: 10.1016/0304-4076(94)90063-9..
doi:10.1016/0304-4076(94)90063-9
David, H. A. (1981). Order statistics(2nd ed.). New York, NY: Wiley.
Evans, M., & Swartz, T. (1995). Methods for approximating integrals in statistics with special
emphasis on Bayesian integration problems. Statistical Science, 10 (2), 254–272. doi:
doi:10.1214/ss/
1177009938..
1177009938
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85, 398–409. doi:10.1080/01621459.
doi:10.1080/01621459.
1990.10476213..
1990.10476213
Gelman, A., & Rubin, D. (1991). An overview and approach to inference from iterative simulation
[Technical report]. Berkeley, CA: University of California-Berkeley, Department of Statistics.
Geman, S., & Geman,
Geman, Geman, D. (1984)
(1984).. Stocha
Stochasti
sticc rel
relaxa
axatio
tion,
n, Gibbs
Gibbs distri
distribu
butio
tions
ns and the Bayesi
Bayesian
an restor
restora-
a-
tion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 6 , 721–741.
doi::10.1109/TPAMI.1984.4767596 .
doi
Hammersley, J. M., & Handscomb, D. C. (1964). Monte-Carlo methods. London, UK: Chapman
& Hall. doi:10.1007/978-94-009-5819-7
doi:10.1007/978-94-009-5819-7..
Hastin
Has tings,
gs, W. K. (1970)
(1970).. Monte-
Monte-Car
Carlo
lo sampli
sampling
ng method
methodss using
using Marko
Markov v chains
chains and their
their applic
applicati
ations
ons..
doi:10.1093/biomet/57.1.97.
Biometrika, 57 , 97–109. doi: 10.1093/biomet/57.1.97.
Johnson, M. E. (1987). Multivariate statistical simulation. New York, NY: Wiley. doi:10.1002/
doi: 10.1002/
9781118150740..
9781118150740
Johnson, N. L., & Kotz, S. (1972). Distribution in statistics: Continuous multivariate distributions.
New York, NY: Wiley.
Liu, J. S. (2001).
(2001). Monte-Carlo strategies in scientific computing. New York, NY: Springer.
McIntyre, G. A. (1952). A method of unbiased selective sampling, using ranked sets. Australian
Journal of Agricultural Research, 3 , 385–390. doi: 10.1071/AR9520385..
doi:10.1071/AR9520385
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller,
Teller, A. H., & Teller, E. (1953). Equations
of state calculations by fast computing machine. Journal of Chemical Physics , 21 , 1087–1091.
doi::10.1063/1.1699114.
doi 10.1063/1.1699114.
doi:10.1007/978-
Morgan, B. J. T. (1984). Elements of simulation. London, UK: Chapman & Hall. doi:10.1007/978-
1-4899-3282-2..
1-4899-3282-2

40 H.M. Samawi

Plackett, R. L. (1965). A class of bivariate distributions. Journal of the American Statistical Asso-
ciation, 60 , 516–522. doi:
doi:10.1080/01621459.1965.10480807
10.1080/01621459.1965.10480807..
Robert, C., & Casella, G. (2004). Monte-Ca
Monte-Carlorlo statistical methods (2nd ed.). New York, NY:
statistical methods
Springer.. doi:
Springer 10.1007/978-1-4757-4145-2..
doi:10.1007/978-1-4757-4145-2
Roberts, G. O. (1995). Markov chain concepts related to sampling algorithms. In W. R. Gilks, S.
Richardson, & D. J. Spiegelhalter (Eds.), Markov Chain Monte-Carlo in practice (pp. 45–57).
London, UK: Chapman & Hall.
Samaw
Sam awi,i, H. M. (1999)
(1999).. More
More ef
effici
ficient
ent Monte-
Monte-Car
Carlo
lo method
methodss obtain
obtained
ed by using
using ranked
ranked set simula
simulated
ted
samples. Communication in Statistics-Simulation and Computation, 28 , 699–713. doi:10.1080/ doi:10.1080/
03610919908813573..
03610919908813573
Samaw
Sam awii H. M., & Al-Sal
Al-Saleh,
eh, M. F. (2007)
(2007).. On the approx
approxima
imatio
tionn of multip
multiple
le integ
integral
ralss usi
using
ng multiv
multivari
ari--
ate ranked simulated sampling. Applied Mathematics and Computation , 188 , 345–352. doi: doi:10.
1016/j.amc.2006.09.121..
1016/j.amc.2006.09.121
Samawi, H. M., Dunbar, M., & Chen, D. (2012). Steady state ranked Gibbs sampler. Journal of
Statistical Simulation and Computation, 82, 1223–1238. doi: doi:10.1080/00949655.2011.575378
10.1080/00949655.2011.575378..
Samawi, H. M., & Vogel, R. (2013). More efficient approximation of multiple integration using
steady state ranked simulated sampling. Communications in Statistics-Simulation and Computa-
tion, 42 , 370–381. doi:
doi:10.1080/03610918.2011.636856
10.1080/03610918.2011.636856..
Shreider, Y. A. (1966). The Monte-Carlo method . Oxford, UK: Pergamom Press.
Stokes
Sto kes,, S. L. (1977)
(1977).. Ranke
Rankedd set sampli
sampling
ng with
with concom
concomitaitant
nt va
varia
riable
bles.
s. Commun
Communicatio
ications
ns in
Statistics—Theory and Methods, 6 , 1207–1211. doi: doi:10.1080/03610927708827563.
10.1080/03610927708827563.
Tanner,
anner, M. doi: 10.1007/
M . A. (1993). Tools for statistical inference (2nd ed.). New York, NY: Wiley. doi:10.1007/
978-1-4684-0192-9..
978-1-4684-0192-9
Tanner, M. A., & Wong, W. (1987). The calculation of posterior distribution by data augmentation
(with discussion). Journal of the American Statistical Association , 82, 528–550. http://dx.doi.
org/10.1080/01621459.1987.10478458..
org/10.1080/01621459.1987.10478458
Tierney, L. (1995). Introduction to general state-space Markov chain theory. In W. R. Gilks, S.
Richardson
Richa rdson,, & D. J. Spiegelhal
Spiegelhalter
ter (Eds.),
(Eds.), Markov Chain Monte-Carlo in practice (pp. 59–74).
London, UK: Chapman & Hall.
Normal and Non-normal Data Simulations
for the Evaluation of Two-Sample Location
Tests

Jessica R. Hoag and Chia-Ling Kuo

Abstract Two-sample location tests refer to the family of statistical tests that com-
paree tw
par twoo ind
indepe
epende
ndent
nt dis
distri
tribut
bution
ionss via mea
measur
sureses of cen
centra
trall ten
tenden
dencycy,, mos
mostt com
common
monlyly
means
mea ns or me
media ns. The t -te
dians. -test
st is the mos
mostt rec
recogni
ognizedzed par
parame
ametri tricc opt
option
ion for tw
two-s
o-samp
amplele
mean
mea n com
compar pariso
isons.
ns. The poo
pooled
led t -te
-test
st ass
assume
umess the tw
two o popu
populat lation
ion va
varia
riance
ncess are equ
equal.
al.
Under circumstances where the two population variances are unequal, Welch’ Welch’ss t -test
is a more appropriate test. Both of these t -tests require data to be normally distrib-
uted
uted.. If th
thee no
norm
rmal alit
ity
y as
assu
sump
mptition
on is vi
viol
olat
ated
ed,, a no
non-
n-pa
pararame
metr tric
ic al
alte
terna
rnati
tive
ve su
such
ch as th
thee
Wilcoxon rank-sum test has potential to maintain adequate type I error and appre-
ciable power. While sometimes considered controversial, pretesting for normality
followed by the F -test for equality of variances may be applied before selecting a
two-sample location test. This option results in multi-stage tests as another alterna-
tive for two-sample location comparisons,
comparisons, starting with a normalitynormality test, followed by
elch’ss t -te
either Welch’ -test
st or the Wilcilcoxon
oxon ran
rank-s
k-sum
um testest.
t. Les
Lesss com
common monly ly uti
utiliz
lized
ed alt
altern
erna-
a-
tives for two-sample location comparisons include permutation tests, which evaluate
statistical significance based on empirical distributions of test statistics. Overall, a
variety of statistical tests are available for two-sample location comparisons. Which
tests demonstrate the best performance in terms of type I error and power depends
on variations in data distribution, population variance, and sample size. One way to
eva
valu
luat
atee th
thes
esee te
test
stss is to si
simu
mula
late
te da
data
ta th
that
at mi
mimi
micc wh
what
at mi
migh
ghtt be en
enco
count
unter
ered
ed in pr
prac
ac--
tice. In this chapter, the use of Monte Carlo techniques are demonstrated to simulate
normal and non-normal data for the evaluation of two-sample location tests.

J.R. Hoag · C.-L. Kuo (B)


Department of Community Medicine and Health Care, Connecticut Institute
for Clinical and Translational Science, University of Connecticut Health Center,
Farmington, USA
e-mail: kuo@uchc.edu
J.R. Hoag
e-mail: hoag@uchc.edu

© Springer Nature Singapore Pte Ltd. 2017 41


D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_3

42 J.R. Hoag and C.-L. Kuo

1 Intr
Introd
oduc
ucti
tion
on

Statistical tests for two-sample location comparison include t -tests, Wilcoxon rank-
sum test, permutation tests, and multi-stage tests. The t -test is the most recognized
parame
par ametri
tricc te
test
st for tw
two-s
o-samp
ample le mea
meann com
compar
pariso
ison.
n. It ass
assume
umess ind
indepe
epende
ndentnt ide
identi
ntical
cally
ly
distributed (i.i.d) samples from two normally distributed populations. The pooled t -
testt (al
tes (also
so ca
calle
lled
d Stu
Studen t’ss t -te
dent’ -test)
st) add
additi
itiona
onally
lly ass
assume
umess the equ
equali
ality
ty of va
varia
riance
ncess whi
whilele
the Welch’s t -test does not. Other tests such as Wilcoxon rank-sum test (also called
Mann-W
Man n-Whit
hitneney
y tes
test)
t) and Fis
Fisher
her-Pi
-Pitma
tmann per
permut
mutati
ation
on tes
tests
ts mak
makee no suc
suchh dis
distri
tribu
butio
tional
nal
assumptions; thus, are theoretically robust against non-normal distributions. Multi-
stage tests that comprise preliminary tests for normality and/or equality of variance
before a two-sample location comparison test is attempted may also be used as
alternatives.
When the normal assumption is met, t -tests are optimal. The pooled t -test is also

robust to the equal variance assumption, but only when the sample sizes are equal
(Zimmerman 2004 2004).). Welch’s t -test does not alter the type I error rate regardless of
unequal sample sizes and performs as well as the pooled t -test when the equality
of variances is held (Zimmerman 2004 2004;; Kohr and Games 1974 1974),), but it can result in
substantial type II error under heavily non-normal distributions and for small sample
sizes, e.g. Beasley et al. (2009
( 2009). ). Under circumstances of equal variance with either
equal or unequal sample sizes less than or equal to 5, De Winter recommended the
pooled t -tes
-testt ove
overr Welch
elch’’s t -te
-test,
st, pro
provid
vided
ed an ade
adequaquate
tely
ly lar
large
ge popu
populalatio
tionn ef
effec
fectt siz
size,
e,
due to a significant loss in statistical power associated with Welch’ elch’ss t -test (de Wint
Winter er
2013).
2013 ). The power loss associated with Welch’s t -test in this scenario is attributed to
its lower degrees of freedom compared to the pooled t -test (de Winter Winter 2013
2013). ).
In samples with known non-normal distributions, the Wilcoxon rank-sum test is
far superior than parametric tests, with power advantages actually increasing with
increasing
incre asing sampl
samplee size (Saw
(Sawilo ilowsky
wsky 2005
2005).). Th
Thee Wililco
coxon
xon ra
rank
nk-s
-sum
um tetest
st,, ho
howe
weveverr, is
not a sol
soluti
ution
on to het
hetero
erogen
geneou
eouss va
varia
riance
nce.. Sim
Simply
ply,, het
hetero
erogen
geneou
eouss va
varia
riance
nce is mit
mitiga
igate
ted
d
but does not disappear when actual data is convconverted
erted to ranks (Zimmerman 1996
1996).
). In
addition to the Wilcoxon rank-sum test, permutation tests have been recommended
as a supplement to t -tests across a range of conditions, with emphasis on samples
with non-normal distributions. Boik (1987 (1987),
), however, found that the Fisher-Pitman
permutation test (Ernst et al. 2004)) is no more robust than the t -test in terms of type
al. 2004
I error rate under circumstances of heterogeneous variance.
A pretest for equal variance followed by the pooled t -test or Welch’s t -test, how-
ever appealing, fails to protect type I error rate (Zimmerman 2004 2004;; Rasch et al.
2011;; Schucany and Tony Ng 2006
2011 2006). ). With the application of a pre-test for normal-
ity (e.g. Shapiro-Wilk Royston 19821982), ), however, it remains difficult to fully assess
the normality assumption—acceptance of the null hypothesis is predicated only on
insufficient evidence to reject it (Schucany and Tony Ng 2006
2006).
). A three-stage pro-
cedure which also includes a test for equal variance (e.g. F -test or Levenes test)
following a normality test is also commonly applied in practice, but as Rash and
colleagues have pointed out, pretesting biases type I and type II conditional error

Normal and Non-normal Data Simulations for the Evaluation … 43

rates, and may be altogether unnecessary (Rochon et al. 2012


2012;; Rasch et al.
al. 2011
2011).
).
Current recommendations suggest that pretesting is largely unnecessary and that the

t -test should be replaced with Welch’s t -test in general practice because it is robust
against heterogeneous variance (Rasch et al. al. 2011
2011;; Welch 19381938).).
In summary, the choice of a test for two-sample location comparison depends on
varia
va riatio
tions
ns in dat
dataa dis
distri
tribut
bution
ion,, popu
populat lation
ion va
varia
riance
nce,, and sam
sample
ple siz
size.e. Pre
Previo
viousus stu
studie
diess
suggested Welch’ elch’ss t -t
-tes
estt for no
normrmalal da
datata an
andd Wil
ilco
coxon
xon rarank-
nk-su
sum
m te
testst fo
forr non
non-n-nor
ormamall
data without heterogeneous variance (Ruxton 2006 2006).). Some recommended Welch’s
t -t
-tes
estt for ge
gene
nera
rall us
usee (R
(Rasasch
ch et al
al.. 2011
2011;; Zimmerman 1998 1998).). Howe
However elch’ss t -test
ver,, Welch’
is not a powerful test when the data is extremely non-normal or the sample size is
small (Beasley et al. al. 2009
2009).). Previous Monte Carlo simulations tuned the parameters
ad hoc an andd co
compmparared
ed a lilimi
mite
ted
d se
sele
lect
ctio
ionn of tw
two-
o-sa
samp
mplele lo
loca
cati
tion
on te
teststs.
s. Th
Thee si
simu
mula
lati
tion
on
settings were simplified by studying either non-normal data only, unequal variances
only, small sample sizes only, or unequal sample sizes only.
In this chapter, simulation experiments are designed to mimic what is often
enco
en count
unter
ered
ed in pr
prac
acti
tice
ce.. Sa
Samp
mple le si
size
zess ar
aree ca
calc
lcul
ulat
ated
ed fo
forr a bro
broad
ad ra
rang
ngee of popowe
werr ba
basesed
d
on the pooled t -test, i.e. assuming normally distributed data with equal variance. A
medium effect size is assumed (Cohen 2013 2013)) as well as multiple sample size ratios.
Alth
Al thou
ough
gh th
thee sa
samp
mple le si
size
zess ar
aree de
dete
term
rmin ined
ed as
assu
sumi
ming
ng no
norm
rmalal da
data
ta wi
withth eq
equauall va
vari
rian
ancece,,
the simulations consider normal and moderately non-normal data and allow for het-
erogeneous variance. The simulated data are used to compare two-sample location
tests including parametric tests, non-parametric tests, and permutation-based tests.
The goal of this chapter is to provide insight into these tests and how Monte Carlo
simulation techniques can be applied to demonstrate their evaluation.

2 Statis
Statistic
tical
al Test
estss
Capital letters are used to represent random variables and lowercase letters rep-
resent realized values. Additionally, vectors are presented in bold. Assume x =
   
x1 , x 2 ,.., xn1 and y = y1 , y2 , .. ., yn2 are two i.i.d. samples from two independent
...,
populations with means µ 1 and µ 2 and variances σ 12 and σ 22 . Denote by x¯ and y¯ the 
sample means and s12 ands22 the sample variances. Let z = z 1 , ..
 ., z n1 , z n1 +1 , ..
..., ...,
.,
z (n1 +n2 ) rerepre
presen
sentt a ve
vecto
ctorr of grou
group p lab
labels
els wit
with h z i = 1 ass
associ
ociate
ated d wit
with h xi for
i = 1 , 2, .. ., n 1 and z (n1 + j ) = 2 associated with y j for j = 1 , 2, ..
..., ., n 2 . The com-
...,
bined x ’s and y ’s are ranked from largest to smallest. For tied observations, the
average rank is assigned and each is still treated uniquely. Denote by ri the rank
associated with x i and si the rank associated with yi . The sum of ranks in x ’s is given
n1 n2
by R = i =1 ri
and by S = i =1 si
for the sum of ranks in y ’s.
 
Next,the two-sample location tests considered in the simulations are reviewed.
The performance of these tests are evaluated by type I error and power.

44 J.R. Hoag and C.-L. Kuo

2.1 t -T
-Test
est

The pooled t -test and Welch’s t -test assume that each sample is randomly sampled
from a population that is approximately normally distributed. The pooled t -test fur-
ther assumes that the two variances are equal (σ12 = σ 22 ). The test statistic for the
null hypothesis that the two population means are equal ( H0 : µ1 = µ 2 ) is given by

x¯ − y¯
tpooled =  , (1)
1 1
sp n1
+ n2

(n −1)s 2 +(n −1)s 2


where s 2p = 1 n +1 n −22 2 . WhenWhen the nul nulll hypot
hypothes
hesis
is is
is true and n 1 and n 2 are
true and
1 2
sufficien
sufficiently
tly large, tpooled foll
large, owss a t -di
follow -distr
strib
ibuti
ution
on wit
with
h de
degre
grees
es of fre
freedom n 1 + n 2 − 2.
edom
Welch’s t -test allows for unequal variances and tests against the null hypothesis by

x¯ − y¯
twelch =  . (2)
s12 s22
n1
+ n2

The asymptotic distribution of twelch is approximated by the t -distribution with


degrees of freedom,
  2
s12 s22
n1
+ n2
v= . (3)
s14 s24
n 21 (n 1 −1)
+ n 22 (n 2 −1)

Alternatively, the F -test based on the test statistic,

s12
F = , (4)
s2
s2

can be applied to test the equality of variances. If the null hypothesis is rejected,
Welch’s t -test is preferred; otherwise, the pooled t -test is used.

2.2 Wilcoxon Rank-Sum


Rank-Su m Test

The Wilcoxon rank-sum test is essentially a rank-based test. It uses the rank data to
compare two distributions. The test statistic to test against the null hypothesis that
the two distributions are same is given by

U = n 1 n 2 + [n 2 (n 2 + 1)] /2 − R . (5)

Under the nul


Under nulll hyp
hypoth
othesis, U is asy
esis, asympt
mptoti
otica
cally
lly nor
normal
mally
ly dis
distri
tribu
buted
ted wit
with
h the mea
mean
n
(n 1 n 2 )/2 and the variance n 1 n 2 (n 1 + n 2 + 1)/12.

Normal and Non-normal Data Simulations for the Evaluation … 45

2.3 Two-Stage
Two-Stage Test

The choice between t -test and Wilcoxon rank-sum test can be based on whether the
normality assumption is met for both samples. Pior to performing the mean compar-
ison test, the Shapiro-Wilk test (Royston 1982 1982)) can be used to evaluate normality.
If both p-values are greater than the significance level α , t -test is used; otherwise,
Wililco
coxo
xon
n ra
rank
nk-s
-sum
um te
test
st is us
used
ed.. Th
Thee ch
chosen t -te
osen -test
st her
heree is Welch’ss t -t
elch’ -tes
estt for it
itss ro
robu
bust
st--
ness against heterogeneous variance. Since the two normality tests for each sample
are independent, the overall type I error rate (i.e. the family-wise error rate) that at
leastt one hypothesis
leas hypothesis isis incor
incorrect
rectly
ly rejected
rejected is thus controll
controlled
ed at 2α . When α = 2 .5%,
it results in the typical value 5% for the family-wise error rate.

2.4 Permutation Test


Let the group labels in z be shuffled for B times. Each produces a new vector of
 
group labels, z k = z k 1 , .. ., z kn 1 , z k (n1 +1) , ..
..., ., z k (n1 +n2 ) , for k = 1 , 2, ..
..., ., B where z k j
...,
is the new group label for x j if j ≤ n 1 and for y( j −n1 ) if j > n 1 . Given the k -th per-
muted data set, xk and yk , where xk = {{ x j , j : z k j = 1 , j = 1 , 2, .. ., n 1 }, { y j , j :
...,
z k (n1 + j ) = 1 , j = 1 , 2, .. ., n 2 }} and yk = {{ x j , j : z k j = 2 , j = 1 , 2, ..
..., ., n 1 }, { y j ,
...,
j : z k (n1 + j ) = 2 , j = 1 , 2, .. ., n 2 }}, Welch’s t -test and Wilcoxon rank-sum test are
...,
applied and the test statistics and p -values are calculated. Denoted by s kt and s kw the
test statistics and qkt and qkw the p-values at the k -th permutation associated with
Welch’s t -test and Wil Wilcoxon
coxon rank-sum test, respectively Similarly, denote by s ot and
respectively. Similarly,
sow the observed test statistics
statistics and pot and pow the observed p-values, calculated using
t w t w
the observed pdata, x and y. Define m k ( pk , pk )
as the rank-sum
minimumtest
of areand pk pk .
permutation -values of Welch’s t -test and Wilcoxon given byThe
2
 B B

pp.welch = min I (skt ≤ s ot ), I (skt > s ot ) , (6)
B
k =1 k =1

and  
B B
2 
pp.wilcox = min I (skw ≤ s ow), I (skw > sow) , (7)
B
k =1 k =1

where I (·) = 1 if the condition in the parentheses is true; otherwise, I (·) = 0. Sim-
ilarly, the p-value associated with the minimum p-value is given by

B
1  I m k ( pkt , p kw) ≤ m o ( pot , p ow ) .
pminp =
B
 (8)
k =1

46 J.R. Hoag and C.-L. Kuo

Overall, 10 two-sample location tests are compared including t -tests, Wilcoxon


rank-sum test, two-stage test, and permutation tests. For convenience, when pre-
senting results, each test is referenced by an abbreviated notation, shown below in
parentheses.
• t -tests: pooled t -test (pooled), Welch’s t -test (welch), permutation Welch’s t -test
(p.welch), robust t -test (robust.t , F -test for the equality of variances followed by
pooled t -test or Welch’s t -test)
• Wilcilcoxon
oxonran
rank-s
k-sum
um tes
tests:
ts: Wilc
ilcoxo
oxonn ran
rank-s
k-sum testt (wilcox), permu
um tes permutati tation
on Wilcox
ilcoxon
on
rank-sum test (p.wilcox)
• two-stage tests: two-stage test with the first-stage α level at 0.5%, 2.5%, and 5%
(2stage0.5, 2stage2.5, 2stage5)
• minimum p-va -value:
lue: minim
minimum um p-va
-value
lue of permu
permutati
tation elch’’s t -te
on Welch -test
st and Wilc
ilcoxon
oxon
rank-sum test (minp)

3 Simu
Simula
lati
tion
onss

Simula
Simu lateted
d da
data
ta we
were
re us
useded to te test
st th
thee nu
null
ll hy
hypo
poththesis H0 : µ1 = µ 2 vers
esis versus
us the
alternative
alternati ve hypothesis H1 : µ1  = µ 2 . Assume x = { x1 , .. ., xn1 } was simulated from
...,
N (µ1 , σ12 ) and y = { y1 , .. ., yn2 } wa
..., wass sim
simula
ulated
ted from N (µ2 , σ22 ). Wit
from itho
hout
ut lo
losi
sing
ng ge
gen-
n-
erality
erality,, let σ1 and σ2 be se
sett equ
qualal to 1. Let µ1 be ze
zero
ro,, µ2 be 0 und
underer th
thee nu
null
ll hy
hypo
poth
thes
esis
is
and 0.5 under the alternative hypothesis, the conventional value suggested by Cohen
2013)) for a medium effect size. n 1 is set equal to or double the size of n 2 . n 1 and
(2013
n 2 were chosen to detect the difference between µ 1 and µ2 for 20, 40, 60, or 80%
power at 5% significance level. When n 1 = n 2 , the sample size required per group
was 11, 25, 41, and 64, respectively. When n 1 = 2 n 2 , n 1 =18, 38, 62, 96 and n 2 was
half of n 1 in ea
half each
ch se
sett
ttin
ing.
g. Th
Thee sa
same
me sa
samp
mple
le si
size
zess we
were
re us
used
ed for nu
null
ll si
simu
mula
lati
tion
ons,
s, no
non
n
normal, and heteroscedastic settings. Power analysis was conducted using G*Power
software (Faul et al. 2007
2007).
).
Fleishman’s power method to simulate normal and non-normal data is based on
a polynomial function given by

x = f (ω) = a + b ω + c ω2 + d ω3 , (9)

where ω is a random value from the standard normal with mean 0 and standard
deviation 1. The coefficients a , b , c , and d are determined by the first four moments
of X with the first two moments set to 0 and 1. For distributions with mean and

standard
after beingdeviation different
simulated. from 0 the
Let γ3 denote andskewness
1, the data andcan
γ 4 be shifted
denote the and/or
kurtosis.rescaled
γ 3 and
γ4 are both set to 0 if X is normally distributed. The distribution is left-skewed if
γ3 < 0 an
andd ri
righ
ght-
t-sk
skeewe
wed d if γ3 > 0. γ4 is sm
smalalle
lerr th
than
an 0 fo
forr a pl
plat
atok
okurt
urtot
otic
ic di
dist
stri
ribu
butition
on
and
an d gre
great
ater
er tha
than
n 0 fo
forr a le
lept
ptok
okur
urto
toti
ticc dist
distri
ribu
buti
tion.
on. By ththee 12 mo
momement
ntss of the
the st
stan
andadard
rd

Normal and Non-normal Data Simulations for the Evaluation … 47

normal distribution, we can derive the equations below and solve a , b, c, and d via
the Newton-Raphson method or any other non-linear root-finding method,

a = −c (10)
b2 + 6bd + 2 c2 + 15d 2 − 1 = 0 (11)
 2 2
2c b + 24bd + 105d + 2 − γ3 = 0
 (12)

24 bd + c 2
  
1 + b + 28bd + d 2 12 + 48bd + 141c 2 + 225d 2
2
 − γ4 = 0 (13)

One limitation of Fleishman’s power method is that it does not cover the entire
domain of skewness and kurtosis. Given γ3 , the relationship between γ3 and γ4 is
described by the inequation, γ4 ≥ γ 32 − 2 (Devroye
(Devroye 1986
1986).
). Precisely, the empirical

lower
By bound of γ 4 given
Fleishman’s
3 = 0 is -1.151320 (Headrick and Sawilowsky 2000
powerγ method, three conditions were investigated: (1)2000). ).
heteroge-
neous variance, (2) skewness, and (3) kurtosis. Each was investigated using equal
and unequal sample sizes to achieve a broad range of power. µ1 = µ 2 = 0 under
the null hypothesis. µ1 = 0 and µ2 = 0 .5 when the alternative hypothesis is true.
Other parameters were manipulated differently. For (1), normal data was simulated
with equal and unequal variances by letting σ1 = 1 and σ2 = 0.5, 1, 1.5. For (2),
skewed data was simulated assuming equal variance at 1, equal kurtosis at 0, and
equa
eq uall sk
skeewn
wnes
esss at 0, 0.
0.4,
4, 0.
0.8.
8. Si
Simi
mila
larl
rly
y, for (3
(3),
), ku
kurt
rtot
otic
ic da
data
ta wa
wass si
simu
mula
late
ted
d as
assu
sumi
ming
ng
equal variance at 1, equal skewness at 0, and equal kurtosis at 0, 5, 10. To visualize
the distributions from which the data were simulated, 10 6 data points were simu-
lated from the null distributions and used to create density plots (see Fig. 1). To best
visualize the distributions, the data range was truncated at 5 and − 5. The left panel
Fig. 1 Dis
Fig. Distrib
tributi
utions
ons to sim
simula
ulatete het
hetero
erosce
scedas
dastic
tic,, sk
skew
ewed,
ed, and kur
kurtot
totic
ic dat
dataa whe
whenn the nu
null
ll hy
hypot
pothes
hesis
is
of eq
equa
uall po
popu
pula
lati
tion
on me
meananss is tr
true
ue (left : no
norm
rmal
al di
dist
stri
ribu
buti
tion
onss wi
with
th me
mean
anss at 0 an
and
d st
stan
anda
dard
rd de
devi
viat
atio
ions
ns
at 0.5, 1, and 1.5; middle: distributions with means at 0, standard deviations at 1, skewness at 0, 0.4,
and 0.8, and kurtosis at 0; right : distributions with means at 0, standard deviations at 1, skewness
at 0, and kurtosis at 0, 5, and 10)

48 J.R. Hoag and C.-L. Kuo

demonstrates the distributions for the investigation of heterogeneous variance. One


samp
sample
le is si
simu
mula
late
ted
d fr
from
om th
thee st
stan
anda
dard
rd no
norm
rmal
al (g
(gre
reen
en).
). Th
Thee ot
othe
herr sa
samp
mple
le is si
simu
mula
late
ted
d
from the normal with mean 0 and standard deviation 0.5 (red) or 1.5 (blue). For
skewness and kurtosis, the two distributions were assumed to be exactly the same
under the null hypothesis. Although these distributions are for null simulations, the
distribut
distr ibutions
ions for pow
powerer simu
simulati
lations
ons are the same for one sampl
samplee and shifted from 0
to 0.5 for the second sample.
The number of simulation replicates was 10,000 and 1,000 for null simulations
and power simulations, respectively. The number of permutation replicates for the
permutation tests at each simulation replicate was 2,000, i.e. B = 2000. The signif-
icance level for two-location comparison was set to 0.05. All the simulations were
carried out in R 3.1.2 (Team 20142014).). The Fleishman coefficients (a , b , c , and d ) given
thee fir
th first
st fo
four
ur mo
mome
ment
ntss (t
(the
he fir
first
st tw
two
o mo
momement
ntss 0 an
and
d 1) we
were
re de
deri
rive
ved
d vi
viaa th
thee R fu
func
ncti
tion
on
“Fleishman.coef.NN” (Demirtas et al. al. 2012
2012).).

4 Results

The simulation results are presented in figures. In each figure, the type I error or
power is presented on the y -axis and the effective sample size per group to achieve
20%,
20 %, 40
40%,
%, 60
60%,%, an
and d 80
80%% po
powe
werr is pr
pres
esen
ente
ted
d on ththee x -ax
-axis
is ass
assumi
uming
ng the pooled t -test
pooled
is ap
appr
propr
opria
iate
te.. Th
Thee ef
effe
fect
ctiive sa
samp
mplele si
size
ze pe
perr gr
grou
oup p fo
forr tw
two-
o-sa
samp
mple
le me
mean
an co
comp
mpar
aris
ison
on
is defined as
2n 1 n 2
ne = . (14)
n1 + n 2
Each test is represented by a colored symbol. Given an effective sample size, the
results of
1. t -tests (pooled t -test, theoretical and permutation Welch’s t -tests, robust t -test)
2. Wilc
Wilcoxon
oxon rank-sum tests (theoretical and and permutation Wilcoxon
Wilcoxon rank-sum tests)
and the minimum p-value (minimum p-value of permutation Welch’s t -test and
Wilcoxon rank-sum test)
3. tw
two-o-st
stag
agee te
test
stss (no
(norm
rmal
alit
ity
y te
test
st wi
with
th th
thee α le
leve
vell at 0.
0.5,
5, 2.
2.5,
5, an
and
d 5% fo
forr bo
both
th sa
samp
mple
less
followed by Welch’s t -test or Wilcoxon rank-sum test)
are aligned in three columns from left to right. The results for n 1 = n 2 are in the left
panels and those for n 1 = 2 n 2 are in the right panels. In the figures that present type
I error results, two horizontal lines y = 0 .0457 and y = 0 .0543 are added to judge
whether the type I error is correct. The type I error is considered correct if it falls
within the 95% confidence interval of the 5% significance level (0.0457, 0.0543).
We use valid to describe tests that maintain a correct type I error and liberal and
conservative for tests that result in a type I error above and under the nominal level,
respectively.

Normal and Non-normal Data Simulations for the Evaluation … 49


Fig. 2 Null simulation
simulation results for normal data with equal or uneq
unequal
ual variances.
variances. y = 0 .0457 and
y = 0 .0543 are added to judge whether the type I error is correct. The type I error is considered
correct if it falls within the 95% confidence interval of the 5% significance level (0.0457, 0.0543).
The tests are grouped into three groups from left to right for result presentation, (1) t -tests: pooled
t -test (pooled
(pooled),), theoretical and permutation Welch’s t -tests (welch
(welch and p.welch
p.welch),), robust t -test
(robust.t ), (2) Wilcoxon rank-sum test (wilcox
robust.t), (wilcox),
), permutation Wilcoxon rank-sum test (p.wilcox
( p.wilcox))
and the minimum p -value (minp(minp),), (3) two-stage tests: two-stage tests with the first stage α level at
(2stage0.5,, 2stage2.5
0.5, 2.5, 5% (2stage0.5 2stage2.5,, 2stage5
2stage5))

4.1 Heterogeneous Variance

The null simu


simulati
lation
on resul
results
ts are prese
presented
nted in Fig. 2. All tests maintain a correct type
I error when the sample sizes are equal. Wilcoxon rank-sum tests and the minimum
p -value have a slightly inflated type I error when the variances are unequal and
relatively small (σ1 = 1 and σ 2 = 0 .5). When the sample sizes are unequal, all tests
aree va
ar vali
lid
d as lo
long
ng as th
thee var
aria
ianc
nces
es ar
aree eq
equa
ual.
l. Wh
When
en th
thee va
vari
rian
ance
cess ar
aree un
uneq
equa
ual,
l, ho
howe
weve
verr,
neither the pooled t -test nor the Wilcoxon rank-sum test maintain a correct type I

50 J.R. Hoag and C.-L. Kuo


Fig. 3 Powe
Powerr simul
simulation
ation results
results for normal data with equaequall or unequal variances
variances.. The tests are
groupe
gro uped
d int
intoo thr
three
ee gro
groupupss fro
from
m left to right for resul
resultt prese
presentati
ntation,
on, (1) t -tests
-tests:: poole
pooledd t -test
(pooled ), theoretical
pooled), t heoretical and permutation Welch’s
Welch’s t -tests (welch
(welch and p.welch ), robust t -test (robust.t
p.welch), (robust.t),
),
( wilcox),
(2) Wilcoxon rank-sum test (wilcox (p.wilcox)) and the min-
), permutation Wilcoxon rank-sum test (p.wilcox
imum p -value (minp
(minp),), (3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5%
(2stage0.5
2stage0.5,, 2stage2.5
2stage2.5,, 2stage5
2stage5))

error. Both exhibit a conservative type I error when σ1 = 1 and σ2 = 0 .5 and a


liberal type I error when σ1 = 1 and σ2 = 1 .5. The pooled t -test in particular is
more strongly influenced than the Wilcoxon rank-sum test. The minimum p -value
involves Wilcoxon rank-sum test and is thus similarly affected but not as severely as
the Wilcoxon rank-sum test alone. The robust t -test and permutation Welch’s t -test
behave similarly when the sample size is small. Welch’s t -test and two-stage tests
are the only tests with type I error protected for heterogeneous variance regardless
of sample sizes.
The po
power
wer sim
simula
ulatio
tion
n res
result
ultss are pre
presen
sente
ted
d in Fig
Fig.. 3. Unsurp
Unsurprisi
risingly
ngly,, the sett
settings
ings of
σ1 = 1 , σ2 = 0 .5 result in higher power than σ1 = 1 , σ2 = 1, and than σ1 = 1 , σ2 =

Normal and Non-normal Data Simulations for the Evaluation … 51


Fig. 4 Null simulation
simulation results for skewed data. y = 0 .0457 and y = 0 .0543 are added to judge
whether the type I error is correct. The type I error is considered correct if it falls within the 95%
confidence interval of the 5% significance level (0.0457, 0.0543). The tests are grouped into three
groups from left tto (pooled),
o right for result presentation, (1) t -tests: pooled t -test (pooled ), theoretical and
(welch and p.welch
permutation Welch’s t -tests (welch p.welch), (robust.t),
), robust t -test (robust.t ), (2) Wilcoxon rank-sum
test (wilcox
(wilcox),
), permutation Wilcoxon rank-sum test (p.wilcox
(p.wilcox)) and the minimum p -value (minp (minp),
),
(3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5
( 2stage0.5,, 2stage2.5
2stage2.5,,
2stage5))
2stage5

1.5. All the valid tests have similar power. Tests with an inflated type I error tend to
be slightly more powerful. In contrast, tests that have a conservative type I error tend
to be slightly less powerful.

52 J.R. Hoag and C.-L. Kuo


Fig. 5 PowPowerer simulation
simulation results for skewed data. The tests are grou grouped
ped into three groups from
left tto
o right for result presentation, (1) t -tests: pooled t -test (pooled
(pooled),
), theoretical and permutation
Welch’s t -tests (welch
(welch and p.welch ), robust t -test (robust.t
p.welch), (robust.t),
), (2) Wilcoxon rank-sum test (wilcox
(wilcox),
),
(p.wilcox)) and the minimum p -value (minp
permutation Wilcoxon rank-sum test (p.wilcox (minp),), (3) two-stage
(2stage0.5,, 2stage2.5
tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 2stage2.5,, 2stage5
2stage5))

4.2 Skewness

The tw
The twoo sa
samp
mplele si
size
ze ra
rati
tios
os re
resu
sult
lt in si
simi
mila larr pa
patt
tter
erns
ns of ty
type
pe I ererro
rorr an
andd po
powe
werr. Al
Alll te
test
stss
main
ma inta
tain
in a co
corr
rrec
ectt ty
type
pe I er
erro
rorr ac
acro
ross
ss th
thee sp
specectr
trum
um of sasamp
mple le si
size
ze anandd sk
skeewn
wnesesss exc
xcep
eptt
2stage0.5, which has an inflated type I error when the sample size is insufficiently
large and the skewness is 0.8 (Fig. 4). A similar pattern for power is observed within
group
gro upss (t
(thr
hree
ee gr
grou
oups
ps in th
thre
reee co
colu
lumn
mns)s).. Wit ith
h an in
incr
crea
ease
se in sk
skeewn
wnes ess,
s, th
thee po
powe
werr of t -
tests (pooled t -test, theoretical and permutation Welch’ Welch’ss t -te
-test,
st, rob ust t -test) remains
robust
at th
thee le
leve
vell as pl
plan
anne
nedd an
and d is sl
slig
ight
htly
ly lo
lowe
werr th than
an ot
othe
herr te
test
stss wh
whenen ththee sk
skeewn
wnesesss re
reac
ache
hess
0.8 (Fig. 5).

Normal and Non-normal Data Simulations for the Evaluation … 53


Fig. 6 Null simulation
simulation results
results for kurto
kurtotic
tic data. y = 0 .0457 and y = 0 .0543 are added to judge
whether the type I error is correct. The type I error is considered correct if it falls within the 95%
confidence interval of the 5% significance level (0.0457, 0.0543). The tests are grouped into three
groups from left tto (pooled),
o right for result presentation, (1) t -tests: pooled t -test (pooled ), theoretical and
permutation Welch’s t -tests (welch
(welch and p.welch
p.welch),), robust t -test (robust.t
(robust.t),
), (2) Wilcoxon rank-sum
test (wilcox
(wilcox),
), permutation Wilcoxon rank-sum test (p.wilcox
(p.wilcox)) and the minimum p -value (minp (minp),
),
(3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5
( 2stage0.5,, 2stage2.5
2stage2.5,,
2stage5))
2stage5

4.3 Kurtosis

Based
Base d on Fi
Fig.
g. 6, al
alll te
test
stss ma
main
inta
tain
in a co
corr
rrec
ectt ty
type
pe I er
erro
rorr wh
when
en th
thee ef
effe
fect
ctiive sa
samp
mple
le si
size
ze
is eq
equa
uall or gr
grea
eate
terr th
than
an 25
25.. Wh
When
en th
thee ef
effe
fect
ctiive sa
samp
mplele si
size
ze is sm
smalall,
l, e.g. n e = 11
e.g. 11,, an
and
d
the kurtosis is 10, t -tests (with the exception of permutation Welch’ Welch’ss t -test) exhibit a
conservative type I error—the type I error of permutation Welch’s t -test is protected
with consistently similar power to other t -tests. Figure 7 displays tests within groups

54 J.R. Hoag and C.-L. Kuo


Fig. 7 PowPowerer simulation
simulation results for kurto
kurtotic
tic data. The tests are grouped into three groups
groups from
left tto
o right for result presentation, (1) t -tests: pooled t -test (pooled
(pooled),
), theoretical and permutation
Welch’s t -tests (welch
(welch and p.welch ), robust t -test (robust.t
p.welch), (robust.t),
), (2) Wilcoxon rank-sum test (wilcox
(wilcox),
),
(p.wilcox)) and the minimum p -value (minp
permutation Wilcoxon rank-sum test (p.wilcox (minp),), (3) two-stage
(2stage0.5,, 2stage2.5
tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 2stage2.5,, 2stage5
2stage5))

(three groups in three columns) with similar power. While not apparent when the
sample size is small, t -tests are not as compelling as other tests when the skewness
is 5 or greater. Wilcoxon rank-sum tests are the best performing tests in this setting,
but the power gain is not significant when compared to the minimum p-value and
two-stage tests.

Normal and Non-normal Data Simulations for the Evaluation … 55

5 Disc
Discus
ussi
sion
on

The goal of this chapter was to conduct an evaluation of the variety of tests for two-
sample
sam ple loc
locati
ation
on com
compar
pariso
isonn usi
usingng Mon
Montete Car
Carlo
lo sim
simula
ulati
tion
on te
techn
chniqu
iques.
es. In con
conclu
clusio
sion,
n,
hete
he tero
roge
gene
neou
ouss var
aria
ianc
ncee is no
nott a pr
prob
oble
lem
m fo
forr an
any
y of th
thee te
test
stss wh
when
en th
thee sa
samp
mplele si
size
zess ar
aree
equal. For unequal sample sizes, Welch’ Welch’ss t -test maintains robustness against hetero-
geneous variance. The interaction between heterogeneous variance and sample size
is consi
consistent
stent with the stat
statement
ement of (Ko(Kohr
hr and Games 1974 1974)) for general two-sample
location comparison that a test is conservative when the larger sample has the larger
population variance and is liberal when the smaller sample has the larger population
variance. When the normality assumption is violated by moderately skewed or kur-
totic data, Welch’s t -tes
-testt and other t -tests maintain a correct type I error so long as
the sample size is sufficiently large. The t -tests, however, are not as powerful as the

Wilcoxon rank-sum test and others.


As lo
long
ng as th
thee ef
effe
fect
ctiive sa
samp
mple
le si
size
ze is 11 or gr
grea
eate
terr, di
dist
stri
ribu
buti
tion
onss of Wel
elch
ch’’s t -test
and the Wilcoxon rank sum test statistics are well approximated by their theoretical
distributions. There is no need to apply a permutation test for a small sample size
of that level if a theoretical test is appropriately applied. A permutation test is not a
solution to heterogeneous variance but may protect type I error when the normality
is violated and the sample size is small.
An extra test to test for equality of variance or for normality may fail to protect
type I error, e.g. robust.t for heterogeneous variance and 2stage0.5 for skewed data
when the sample size is insufficiently large. While t -tests are not sensitive to non-
normal data for the protection of type I error, a conservative normality test with α
as low as 0.5% can lead to a biased type I error. In simulations, the two-stage tests
perform well against heterogeneous variance, but this is only the case because the
simulatio
simul ation
n sett
settings
ings assume both distribution
distributionss are norma
normall and the two-stage
two-stage test
testss in
fact are Welch’s t -test most of the time. The two-stage tests are meant to optimize
the test based on whether the normality assumption is met. These tests do not handle
heterogeneous variance. Under circumstances of non-normal distributions and het-
erogeneous variance, it has been shown by Rasch et al. that two-stage tests may lead
to an incorrect type I error and lose power (Rasch et al. al . 2011
2011).
).
The literature and these simulation results reveal that the optimal test procedure
is a sensible switch between Welch’s t -test and the Wilcoxon rank sum test. Multi-
stage tests such as robust t -test and two-stage tests are criticized for their biased
type I error. The minimum p-value simply takes the best result of Welch’s t and
Wilcoxon
Wilc oxon rank-sum tests and is not designed specifically for heterogeneous variance
or no
non-
n-nor
norma
mall da
data
ta di
dist
stri
ribu
buti
tions
ons.. Wh
Whililee it sh
shar
ares
es th
thee fla
flaw
w of mu
mult lti-
i-st
stag
agee te
test
stss wi
with
th no
protection of type I error and secondary power, the minimum p-value is potentially
robust to heterogeneous variance and non-normal distributions.
A simple way to protect type I error and power against heterogeneous variance
is to plan a study with two samples of equal size. In this setting, Wilcoxon rank-
sum test maintains robustness to non-normal data and competitive with Welch’s
t -test when the normality assumption is met. If the sample sizes are unequal, the

56 J.R. Hoag and C.-L. Kuo

minimum p-v
-valu
aluee is a rob
robust
ust opt
option
ion bu
butt com
comput
putati
ationa
onall
lly
y mor
moree ex
expen
pensi
sive
ve.. Oth
Otherw
erwise
ise,,
heterogeneous variance and deviation from a normal distribution must be weighed
in selecting between Welch’s t -test or the Wilcoxon rank-sum test. Alternatively,
Zimmerman and Zumbo recommended that Welchs t -test performs better than the
Wilcoxon rank sum test under conditions of both equal and unequal variance on data
that has been pre-ranked (Zimmerman and Zumbo 1993 1993).
).
Although moderately non-normal data has been the focus of this chapter, the con-
clusions are still useful as a general guideline. When data is extremely non-normal,
none of the tests will be appropriate. Data transformation may be applied to meet
the normality assumption Osborne (2005 (2005,, 2010
2010).
). Not every distribution, however,
can be transformed to normality (e.g. L-shaped distributions). In this scenario, data
dichotomization may be applied to simplify statistical analysis if information loss is
not a concern (Altman and Royston 2006 2006).
). Additional concerns include multimodal
distributions which may be a result of subgroups or data contamination (Marrero
1985)—a
1985 )—a few outliers can distort the distribution completely. All of these consid-
erations serve as a reminder that prior to selecting a two-sample location test, the
first key to success is full practical and/or clinical understanding of the data being
assessed.

References

Altman,, D. G., & Roy


Altman Roysto
ston,
n, P. (20
(2006)
06).. The cos
costt of dic
dichot
hotomi
omisin
sing
g con
contin
tinuo
uous
us va
varia
riable
bles.
s. Bmj,
332(7549), 1080.
Beasley, T. M., Erickson, S., & Allison, D. B. (2009). Rank-based inverse normal transformations
are increasingly used, but are they merited? Behavior Genetics, 39 (5), 580–595.
Boik, R. J. (1987). The fisher-pitman permutation test: A non-robust alternative to the normal
theory f test when variances are heterogeneous. British Journal of Mathematical and Statistical
Psychology , 40 (1), 26–42.
Cohen, J. (2013). Statistical power analysis for the behavioral sciences . Academic press.
de Winter, J. C. (2013). Using the students t-test with extremely small sample sizes. Practical
Assessment, Research & Evaluation, 18 (10), 1–12.
Demirtas, H., Hedeker, D., & Mermelstein, R. J. (2012). Simulation of massive public health data
by power polynomials. Statistics in Medicine, 31 (27), 3337–3346.
Devroye, L. (1986). Sample-based non-uniform random variate generation. In Proceedings of the
18th conference on Winter simulation, pp. 260–265. ACM.
Ernst, M. D., et al. (2004). Permutation methods: a basis for exact inference. Statistical Science,
19(4), 676–685.
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G* power 3: A flexible statistical
power ana lysis program for the social, behavioral, and biomedical sciences. Behavior Researc
Research
h
Methods, 39 (2), 175–191.
Headrick, T. C., & Sawi
Headrick, Sawilow
lowsky
sky,, S. S. (200
(2000).
0). Weigh
eighted
ted simpl
simplex
ex proce
procedures
duresfor
for deter
determinin
mining
g boun
boundary
dary
points and constants for the univariate and multivariate power methods. Journal of Educational
and Behavioral Statistics, 25 (4), 417–436.
Kohr, R. L., & Games, P. A. (1974). Robustness of the analysis of variance, the welch procedure
and a box procedure to heterogeneous variances. The Journal of Experimental Education , 43(1),
61–69.

Normal and Non-normal Data Simulations for the Evaluation … 57

Marrero, O. (1985). Robustness of statistical tests in the two-sample location problem. Biometrical

Journal, 27 (3), 299–316.


Osborne, J. (2005). Notes on the use of data transformations. Practical Assessment, Research and
Evaluation, 9 (1), 42–50.
Osborne, J. W. (2010). Improving your data transformations: Applying
Applying the box-cox transformation.
transformation.
Practical Assessment, Research & Evaluation, 15 (12), 1–9.
Rasch, D., Kubinger, K. D., & Moder, K. (2011). The two-sample t-test: pre-testing its assumptions
does not pay off. Statistical Papers, 52(1), 219–231.
Rochon, J., Gondan, M., & Kieser, M. (2012). To test or not to test: Preliminary assessment of
normality when comparing two independent samples. BMC Medical Research Methodology ,
12(1), 81.
Roys
Ro ysto
ton,
n, J. (1
(198
982)
2).. An ex
exte
tens
nsio
ion
n of sh
shap
apir
iro
o an
and
d wi
wilk
lk’’s w te
test
st fo
forr no
norm
rmal
alit
ity
y to la
larg
rgee sa
samp
mple
les.
s. Applied
Statistics, 115–124.
Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to student’s t-test and
the mann-whitney u test. Behavioral Ecology, 17 (4), 688–690.
Sawilowsky, S. S. (2005). Misconceptions leading to choosing the t-test over the wilcoxon mann-
whitney test for shift in location parameter.
Schucany, W. R., & Tony Ng H. (2006). Preliminary goodness-of-fit tests for normality do not
validate the one-sample student t. Communications in Statistics Theory and Methods , 35(12),
2275–2286.
Team, R. C. (2014). R: A language and environment for statistical computing. R foundation for
statistical computing, vienna, austria, 2012.
Welch, B. L. (1938). The significance of the difference between two means when the population
variances are unequal. Biometrika, 29 (3–4), 350–362.
Zimmerman, D. W. (1996). A note on homogeneity of variance of scores and ranks. The Journal of
Experimental Education, 64 (4), 351–362.
Zimmerman, D. W. (1998). Invalidation of parametric and nonparametric statistical tests by con-
current violation of two assumptions. The Journal of Experimental Education, 67 (1), 55–68.
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of
Mathematical and Statistical Psychology, 57 (1), 173–181.
Zimmerman, D. W., & Zumbo, B. D. (1993). Rank transformations and the power of the student
t-test and welch t’test for non-normal populations with unequal variances. Canadian Journal of
Experimental Psychology/Revue canadienne de psycholog
psychologie
ie expérimenta le, 47 (3), 523.
expérimentale

Anatomy of Correlational Magnitude


Transformations in Latency
and Discretization Contexts
in Monte-Carlo Studies

Hakan Demirtas and Ceren Vardar-Acar

Abstract This chapter is concerned with the aassessment


ssessment of correlational m
magnitude
agnitude
changes when a subset of the continuous variables that may marginally or jointly

fol
follo
loww nea
nearly
rlygenerally
Statisticians an
any
y distri
distribu
butio
tion
regard n in a mul
multi
tiva
varia
riate
discretization teasset
settin
ating
g is
bad dic
dichot
idea hotomi
omized
on the zed or ordina
groundsord
ofinaliz
lized.
ed.
power,
inform
informati
ation,
on, and effec
effectt siz
sizee los
loss.
s. Despit
Despitee thi
thiss undeni
undeniabl
ablee disadv
disadvant
antage
age and legit
legitima
imate
te
criticism, its widespread use in social, behavioral, and medical sciences stems from
the fact that discretization could yield simpler, more interpretable, and understand-
able conclusions, especially when large audiences are targeted for the dissemination
of the research outcomes. We do not intend to attach any negative or positive con-
notations to discretization, nor do we take a position of advocacy for or against it.
The purpose of the current chapter is providing a conceptual framework and compu-
tational algorithms for modeling the correlation transitions under specified distribu-
ti
tiona
onall as
assu
sump
mpti
tion
onss with
within
in the
the re
real
alm
m of disc
discre
reti
tiza
zati
tion
on in th
thee cont
contex
extt of th
thee late
latenc
ncy
y and
and
threshold concepts. Both directions (identification of the pre-discretization correla-
tion value in order to attain a specified post-discretization magnitude, and the other
way around) are discussed. The ideas are developed for bivariate settings; a natural
extension to the multivariate case is straightforward by assembling the individual
correlation entries. The paradigm under consideration has important implications
and broad applicability in the stochastic simulation and random number generation
worlds. The proposed algorithms are illustrated by several examples; feasibility and
performance of the methods are demonstrated by a simulation study.

H. Demirtas (B)
Division of Epidemiology and Biostatistics (MC923), University of Illinois at Chicago,
1603 West Taylor Street, Chicago, IL 60612, USA
e-mail: demirtas@uic.edu
C. Vardar-Acar
Department of Statistics, Middle East Technical University, Ankara, Turkey
e-mail: cvardar@metu.edu.tr
cvardar@metu.edu.tr

© Springer Nature Singapore Pte Ltd. 2017 59


D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_4

60 H. Demirtas and C. Vardar-Acar

1 Intr
Introd
oduc
ucti
tion
on

Unlike
Unli ke natu
natura
rall (tru
(true)
e) dich
dichot
otom
omieiess su
such
ch as ma
malele ve
vers
rsus
us fe
fema
male le,, co
cond
nduc
ucto
torr vers
versus
us in
insu
su--
lator, vertebrate versus invertebrate, and in-patient versus out-patient, some binary
variables are derived through dichotomization of underlying continuous measure-
ments. Such artificial dichotomies often arise across many scientific disciplines.
Exam
Ex amplples
es incl
includ
udee obesi
obesity
ty st
stat
atus
us (o
(obebese
se ve
vers
rsus
us non-
non-obe
obesese)) ba
base
sed d on body
body mass
mass in
inde
dex,
x,
preter
pre term
m versu
versuss term
term babies
babies give
givenn the ges
gestat
tation
ion per
period
iod,, high
high versu
versuss lo
low
w nee
needd of socia
sociall
interaction, small versus large tumor size, early versus late response time in sur-
veys, young versus old age, among many others. In the ordinal case, discretization
is equally commonly encountered in practice. Derived polytomous variables such
as young-
young-middl
middle-old
e-old age, low-med
low-medium-hi
ium-highgh income,
income, cold-cool-
cold-cool-av average
erage-hot
-hot tem-
perature, no-mild-moderate-severe depression are obtained based on nominal age,
income, temperature, and depression score, respectively. While binary is a special
case of ordinal, for the purpose of illustration, integrity, and clarity, separate argu-
ments are presented throughout the chapter. On a terminological note, we use the
words binary/dichotomous and ordinal/polytomous interchangeably to simultane-
ously reflect the preferences of statisticians/psychometricians. Obviously, polyto-
mous variables can normally be ordered (ordinal) or unordered (nominal). For the
remainder of the chapter, the term “polytomous” is assumed to correspond ordered
variables.
Discretization is typically shunned by statisticians for valid reasons, the most
prominent of which is the power and information loss. In most cases, it leads to a
dimi
dimini
nish
shed
ed effe
effect
ct si
size
ze as we
well
ll as re
redu
duce
ced
d re
reli
liab
abil
ilit
ity
y and
and stre
strengt
ngth
h of asso
associ
ciat
atio
ion.
n. How-
How-
ever, simplicity, better interpretability and comprehension of the effects of interest,
and superi
superiori
ority
ty of som
somee categ
categori
orica
call data
data measur
measures
es suc
such
h as odds
odds ratio
ratio have
have been
been argue
argued
d
by proponents of discretization. Those who are against it assert that the regression
paradigm is general enough to account for interacti
interactive
ve effects, outliers, ske
skewed
wed distri-
butions, and nonlinear relationships. In practice, especially substantive researchers
and practitioners employ discretization in their works. For conflicting views on rel-
ative perils and merits of discretization, see MacCallum et al. al . ((2002
2002)) and Farrington
and Loeber
Loeber (2000
2000),), respectively. We take a neutral position; although it is not a
recommended approach from the statistical theory standpoint, it frequently occurs
in practice, mostly driven by improved understandability-based arguments. Instead
of engaging in fruitless philosophical discussions, we feel that a more productive
effort can be directed towards finding answers when the discretization is performed,
which motivates the formation of this chapter’s major goals: (1) The determina-
tion of correlational magnitude changes when some of the continuous variables that
may marginally or jointly follow almost any distribution in a multivariate setting
are dichotomized or ordinalized. (2) The presentation of a conceptual and compu-
tational framework for modeling the correlational transformations before and after
discretization.

Anatomy of Correlational Magnitude Transformations in Latency … 61

A correlation between two continuous variables is usually computed as the com-


mon Pearson correlation. If one or both variables is/are dichotomized/ordinalized
by a threshold concept of underlying continuous variables, different naming con-
ventions are assigned to the correlations. A correlation between a continuous and
a dichotomized/ordinalized variable is a biserial/polyserial and point-biserial/point-
polyserial correlation before and after discretization, respectively. When both vari-
ables are dichotomi
dichotomized/
zed/ordina
ordinalize
lized,
d, the correlati
correlation
on between
between the two latent
latent conti
continuous
nuous
variables is known as the tetrachoric/polychoric correlation. The phi coefficient is
the correlation between two discretized variables; in fact, the term phi coefficient is
reserved for dichotomized variables, but for lack of a better term we call it the “ordi-
nal phi coefficient” for ordinalized variables. All of these correlations are special
cases of the Pearson correlation.
Correlations are naturally altered in magnitude by discretization. In the binary
case, there is a closed form, double numerical integration formula that connects
the correlations before and after dichotomization under the normality assumption
when both variables are dichotomized (Emrich and Piedmonte 1991).
1991). Demirtas and
Doganay (2012
(2012)) enhanced this Gaussian copula approach, along with the algebraic
relationship between the biserial and point-biserial correlations, in the context of
joint generation of binary and normal variables. 2016)) fur-
variables. Demirtas and Hedeker (2016
ther extended this correlational connection to nonnormal variables via linearity and
constancy arguments when only one variable in a bivariate setting is dichotomized.
Going back to the scenario where the dichotomization is performed on both vari-
ables, Demirtas (2016
2016)) proposed algorithms that find the phi coefficient when the
tetrachoric correlation is specified (and the other way around), under any distri-
butional assumption for continuous variables through added operational utility of

the
case,power polynomials
modeling (Fleishman
the correlation 1978;; can
1978
transition Valeonly
andbe Maurelli 1983).
1983iteratively
performed ). In the ordinal
under
the normality assumption (Ferrari and Barbiero 2012). 2012). Demirtas et al.
al. ((2016a
2016a)) aug-
mented the idea of computing the correlation before or after discretization when the
other one is specified and vice versa, to an ordinal setting. The primary purpose of
this chapter is providing several algorithms that are designed to connect pre- and
post-discretization correlations under specified distributional assumptions in simu-
lated environments. More specifically, the following relationships are established:
(a) tetrachor
tetrachoricic corre
correlati
lation/phi
on/phi coef
coefficie
ficient,
nt, (b) biserial/
biserial/point
point-bise
-biserial
rial correlati
correlations,
ons,
(c) polychoric correlation/ordinal phi coefficient, and (d) polyserial/point-polyserial
corr
co rrel
elat
atio
ions
ns,, wher
wheree (a
(a)–
)–(b
(b)) and
and (c
(c)–(
)–(d)
d) are
are re
rele
lev
vant
ant to bi
bina
nary
ry and
and or
ordi
dina
nall data
data,, re
resp
spec
ec--
tively; (b)–(d) and (a)–(c) pertain to situations where only one or both variables
is/are discretized, respectively. In all of these cases, the marginal distributions that
are needed for finding skewness (symmetry) and elongation (peakedness) values for
the underlying continuous variables, proportions for binary and ordinal variables,
and associational quantities in the form of the Pearson correlation are assumed to be
specified.
This work is important and of interest for the following reasons: (1) The link
between these types of correlations has been studied only under the normality
assumption; however, the presented level of generality that encompasses a com-

62 H. Demirtas and C. Vardar-Acar

prehensive range of distributional setups is a necessary progress in computational


statistics. (2) As simulation studies are typically based on replication of real data
characteristics and/or specified hypothetical data trends, having access to the latent
data as well as the eventual binary/ordinal data may be consequential for exploring a
richer spectrum of feasible models that are applicable for a given data-analytic prob-
lem involving correlated binary/ordinal data. (3) The set of techniques sheds light
on how corre
correlati
lations
ons are related
related before and afte
afterr discretiz
discretizatio
ation;
n; and has potential
potential to
broaden our horizon on its relative advantages and drawbacks. (4) The algorithms
work
wo rk for
for a very
very bro
broad
ad clas
classs of unde
underl
rlyi
ying
ng biv
bivaria
ariate
te late
latentnt de
dens
nsit
itie
iess an
and
d bi
bina
nary/
ry/or
ordi
dina
nall
dataa distri
dat distribu
butio
tions,
ns, allo
allowin
wing g ski
skipp patter
patterns
ns for the latte
latterr, wit
withou
houtt requir
requiring
ing the identi
identical
cal
distribution assumption on either type of variables. (5) The required software tools
for the implementation are minimal, one only needs a numerical double integration
solver for the binary-binary case, a computational platform with univariate random
number generation (RNG) capabilities for the binary/ordinal-continuous case, an
iterative scheme that connects the polychoric correlations and the ordinal phi coef-
ficients under the normality assumption for the ordinal-ordinal case, a polynomial
root-
roo t-fin
finde
derr an
and
d a nonl
nonlin
inea
earr eq
equa
uati
tions
ons se
sett solv
solver
er to ha
hand
ndle
le no
nonno
nnorm
rmal
al cont
contin
inuo
uous
us vari-
ari-
ables. (6) The algorithmic steps are formulated for the bivariate case by the nature
of the problem, but handling the multivariate case is straightforward by assembling
the correlation matrix entries. (7) The collection of the algorithms can be regarded
as an operational machinery for developing viable RNG mechanisms to generate
multivariate latent variables as well as subsequent binary/ordinal variables given
their marginal shape characteristics and associational structure in simulated settings,
potentially expediting the evolution of the mixed data generation routines. (8) The
meth
me thod
odss coul
could
d be us
usef
eful
ul in ad
adv
vanci
ancing
ng re
rese
sear
arch
ch in me
meta
ta-a
-ana
naly
lysi
siss doma
domain
inss wh
wher
eree vari
vari--

ables
Theare discretized in
organization of some studiesisand
the chapter remainedIncontinuous
as follows. in some others.
Sect. 2, necessary background
information is given for the development of the proposed algorithms. In particular,
how
ho w the correl
correlati
ation
on transf
transform
ormatation
ion wor
worksks for discre
discreti
tized
zed data
data throug
throughh numeri
numerica call inte-
inte-
gration and an iterative scheme for the binary and ordinal cases, respectively, under
the normality assumption,
assumption, is outli
outlined;
ned; a general
general nonnormal
nonnormal continuous
continuous data gener-
gener-
ation technique that forms a basis for the proposed approaches is described (when
both variables are discretized), and an identity that connects correlations before and
after discretization via their mutual associations is elaborated upon (when only one
variable is discretized). In Sect. 3, several algorithms for finding one quantity given
the other are provided under various combinations of cases (binary versus ordinal,
directionality in terms of specified versus computed correlation with respect to pre-
versus post-discretization, and whether discretization is applied on one versus both
varia
va riable
bles)
s) and some
some illust
illustrat
rativ
ivee ex
examp
amplesles,, rep
repres
resent
enting
ing a broad
broad range
range of distri
distribu
butio
tional
nal
shapes that can be encountered in real applications, are presented for the purpose of
exposition. In Sect. 4, a simulation study for evaluating the method’s performance
in a multivariate
multivariate setup by commonly acceptedaccepted accuracy (unbiasedne
(unbiasedness)
ss) measures in
both directions is discussed. Section 5 includes concluding remarks, future research
directions, limitations, and extensions.

Anatomy of Correlational Magnitude Transformations in Latency … 63

2 Build
Buildin
ing
g Bloc
Blocks
ks

This section gives necessary background information for the development of the
propose
prop osedd algori
algorithm
thmss in modeli
modeling
ng correl
correlati
ation
on transi
transitio
tions.
ns. In what
what fol
follo
lows,
ws, cor
correl
relati
ation
on
type and rela
related
ted notation depend on the three factors: (a) before before or after
after discretiz
discretiza-
a-
tion, (b) only one or both variables is/are discretized,
discretized, and (c) discretized varivariable
able is
dichotomous or polytomous. To establish the notational convention, for the remain-
der of the chapter, let Y1 and Y2 be the continuous variables where either Y1 only
or both are discretized to yield X 1 and X 2 depending on the correlation type under
consideration (When Y’s are normal, they are denoted as Z , which is relevant for the
normal-based results and for the nonnormal extension via power polynomials). To
distinguish between binary and ordinal variables, the symbols B and O appear in the
subscr
sub script
ipts.
s. Fur
Furthe
thermo
rmore,
re, for avoid
avoiding
ing an
anyy confus
confusion symbolss B S , T E T , P S , and
ion,, the symbol
P O L Y are made a part of δ Y1 Y2 and δ Z 1 Z 2 to differentiate among the biserial, tetra-
choric, polyserial, and polychoric correlations, respectively. For easier readability,
we includ able 1 tha
includee Table thatt sho
shows
ws the specifi
specificc cor
correl
relati
ation
on types
types and associ
associatated
ed notati
notationa
onall
symbols based on the three above-mentioned factors.

2.1 Dichotomous Case: Normality


Borrowing
Borrow ing ide
ideas
as fro
fromm the RNG litera
literatur
ture,
e, if the unde
underly
rlying
ing distri
distribut
bution
ion before
before
dichotomization is bivariate normal, the relationship between the phi coefficient
and the tet
tetrac
rachori
horicc correl
correlati
ation
on is kno
known
wn (Em
(Emric
rich
h and Piedmo nte 1991;
Piedmonte 1991; Demir
Demir--
tas and Doganay 2012 2016). Let X 1 B and X 2 B represent binary vari-
2012;; Demirtas 2016).
able
ab less such that E X j B
such that pj [ ]= = −
1 q j for j =
and C or X 1 B , X 2 B
1, 2, and δ X1B X2B , [ ]=
where p1 , p2 , and δ X 1 B X 2 B are given. Let Z 1
given. Y1 and Z 2 =
Y2 be ththee corr
corre-
e- =
sponding standard normal variables, and let Φ be the cumulative distribution func-

able 1 Terminological and notational convention


Table convention for different correlation types depend
depending
ing on the
three self-explanatory factors
When Discrete data Discretized Name Symbol
type
Befo
Before
re Dic
Dichoto
hotom
mous
ous Y1 only Biserial correlation δY1 Y B S
2
After Point-biserial correlation δ X 1 B Y2

Before Both Y 1 and Y 2 Tetrachoric correlation δY1 Y2T E T


After Phi coefficient δ X1B X 2B
Befo
Before
re Poly
Polyto
tom
mous
ous Y1 only Polyserial correlation δY1 Y P S
2
After Point-polyserial correlation δ X 1 O Y2
Before Both Y 1 and Y 2 Polychoric correlation δY1 Y P O L Y
2
After Ordinal phi coefficient δ X 1O X 2 O

64 H. Demirtas and C. Vardar-Acar

tion for a standard bivariate normal random variable with correlation coefficient
[
δ Z 1 Z 2T E T . Obviously, Φ z 1 , z 2 , δ Z 1 Z 2T E T z1 z2
]=  
−∞ −∞ f (z1 , z 2 , δ Z 1 Z 2T E T )d z1 d z 2 , where
f ( z 1 , z 2 , δ Z 1 Z 2T E T ) = [ 2π (1 − δ 2
) ]− × ex p − (z − 2δ
Z 1 Z 2T E T
1/2 1 2
1 Z 1 Z 2T E T z 1 z 2 + z )/ 2
2

(2(1 −δ 2
))
 . The
The phi
phi co
coef
effic
ficient (δ X 1 B X 2 B ) and the tetrac
ient tetrachori
horicc correl
correlati
ation
on
Z 1 Z 2T E T
(δ Z 1 Z 2T E T ) are linked via the equation

1/2
[
Φ z ( p1 ), z ( p2 ), δ Z 1 Z 2T E T ]=δ X 1 B X 2 B ( p1 q1 p2 q2 ) +p p 1 2 (1)

where z ( p j ) den otess the p tjh qua


denote quanti
ntile
le of the standa
standard
rd normal
normal dis
distri
tribution for j
bution 1, 2. =
As long as δ X 1 B X 2 B is within the feasible correlation range (Hoeffding 1940, Fréchet
1951;
1951; Demirtas and Hedeker 2011 2011),), the solution is unique. Once Z 1 and Z 2 are

generated,
and the binary
0 otherwise for j variables are derived
1, 2. While= RNG isby setting
not our ultimate z (1 work,
1 if Z j in this
X j B interest pj)
= ≥ −
we can use Eq. 1 for bridging the phi coefficient and the tetrachoric correlation.
When only one of the normal variables ( Z 1 ) is dichotomized, i.e., X 1 B I (Z1 =√ ≥

z (1 p1 ), it is relatively easy to show that δ X 1 B Z 2 /δ Z 1 Z B S
2
δ X1B Z1 h / p1 q1 = =
where h is
is the
the ordi
ordina
natete of the
the nor
norma
mall cu
curv
rvee at the
the po
poin
intt of di
dich
chot
otom
omiz
izat
atio
ion
n (Dem
(Demir irta
tass
and Hedeker 2016
2016).).
Real data often do not conform to the assumption
assumption of normality;
normality; hence most sim-
ulation studies should take nonnormality into consideration. The next section inves-
tigates the situation where one or both continuous variables is/are nonnormal.

2.2 Dichotomous Case: Beyond Normality

Extending the limited and restrictive normality-based results to a broad range of dis-
tributional setups requires the employment of the two frameworks (an RNG routine
for multivariate continuous data and a derived linear relationship in the presence of
discretization), which we outline below.
We first tackle the relationship between the tetrachoric correlation δY1 Y2T E T and
the phi coefficient δ X 1 B X 2 B under nonnormality via the use of the power polynomials
(Fleishman 1978),
1978), whic
which h is a mo
momement
nt-m
-mat
atch
chin
ingg pr
proc
oced
edur
uree th
that
at simu
simula
late
tess no
nonn
nnor
orma
mall
distribut
distributions
ions often used in Monte-Carl
Monte-Carlo
o studi
studies,
es, based on the premise that real-
real-life
life
distributions of variables are typically characterized by their first four moments. It
hinges upon the polynomial transformation, Y a b Z c Z 2 d Z 3 , where Z
follows a standard normal distribution, and Y is standardized (zero mean and unit
= + + +
variance)..1 The distribution of Y depends on the constants a , b, c, and d , that can
variance)
be computed for specified or estimated values of skewness ( ν1 E Y 3 ) and excess
= [ ]
kurtosis (ν2 E Y4
= [ ]− 3). The procedure of expressing any given variable by the
sum of linear combinations of powers of a standard normal variate is capable of

1
We drop the subscript in Y as we start with the univariate case.

Anatomy of Correlational Magnitude Transformations in Latency … 65

covering a wide area in the skewness-elongation plane whose bounds are given by
the general expression ν 2 ν12 2.2
≥ −
Assum ing that E Y
Assuming 0,and E Y 2
[ ]= [ ]=
1, by ut
util
iliz
izin
ing
g th
thee mo
mome
ment
ntss of th
thee stan
standa
dard
rd
normal distribution, the following set of equations can be derived:

= −c a (2)

2 2 2
b + 6bd + 2c + 15d − 1 = 0 (3)

2 2
2c(b + 24bd + 105d + 2) − ν = 0 1 (4)

2 2 2 2 2
24[bd + c (1 + b + 28bd ) + d (12 + 48bd + 141c + 225d )] − ν = 0 2 (5)

These equations can be solved by the Newton-Raphson method, or any other


plausi
plausible
ble root-fi
root-findi
nding
ng or nonl
nonline
inear
ar opt
optim
imiza
izati
tion
on routin
routine.
e. Mor
Moree detail
detailss for the Newto
Newton-
n-
Raphson algorithm for this particular setting is given by Demirtas et al. 2012).
al . ((2012). The
polynomial coefficients are estimated for centered and scaled variables; the resulting
data set should be back-transformed to the original scale by multiplying every data
point by the standard deviation and adding the mean. Centering-scaling and the
reverse of this operation are linear transformations, so it does not change the values
of skewness, kurtosis, and correlations. Of note, we use the words symmetry and
skewness
skewne ss interc
interchan
hangea
geably
bly.. Sim
Simila
ilarly
rly,, kurt
kurtosi
osis,
s, elonga
elongatio
tion,
n, and peake
peakedne
dness
ss are meant
meant
to convey the same meaning.
The multi
multivar
variate
iate extensio
extension n of Fleishman
Fleishman’’s powe
powerr method
method (Vale
(Vale and Maurell
Maurellii 1983
1983))
plays a central role for the remainder of this chapter. The procedure for generat-
ing multivariate continuous data begins with computation of the constants given
in Eqs. 2–5, independently for each variable. The bivariate case can be formulated
in matrix notation as shown below. First, let Z 1 and Z 2 be variables drawn from

standard normal populations; let z be the vector of normal powers 0 through 3,
 
zj 1, Z j , Z 2j , Z 3j ; and let w be the weight vector that contains the power func-
=[ ] 
tion weig hts a , b, c,and d , wj
weights =[
a j , b j , c j , d j for j ]
1, 2. The nonnorm

=
nonnormal al variable
variable
Y j is then defined as the product of these two vectors, Y j =
wj zj . Let δY1 Y2 be the
correlation between two nonnormal variables Y1 and Y 2 that correspond to the nor-
respectively.3 As the variables are standardized, meaning
mal variables Z 1 and Z 2 , respectively.
  
E (Y1 ) = =
E (Y2 ) 0, δ Y1 Y2 E (Y1 Y2 )

= = =
E (w1 z1 z2 w2 ) w1 R w2 , where R is the
expected matrix product of z 1 and z 2 :

 01 δ 0 1
0 3δ Z0 Z

R = E (z z ) 1 2 =  1 0 Z1 Z2
2δ 2Z 1 Z 2 +1 0
1 2  ,

0 3δ Z 1 Z 2 0 6δ 3Z 1 Z 2 + 9δ Z1 Z2

2
In fact, equality is not possible for continuous distributions.
3
δY1 Y2 is
is the
the sa me as δY1 Y T E T or δY1 Y P O L Y dep
same depend
ending
ing on if discre
discretiz
tized
ed va
varia
riable
bless are binary
binary or ordina
ordinal,
l,
2 2
respec
respectiv
tively
ely.. For the genera
generall presen
presentat
tation
ion of the po
power
wer polyno
polynomiamials,
ls, we do not make
make that
that distin
distincti
ction.
on.

66 H. Demirtas and C. Vardar-Acar

where δ is the correlation between Z and Z . After algebraic operations, the


Z1 Z2 1 2
following relationship between δY1 Y2 and δ Z 1 Z 2 in terms of polynomial coefficients
ensues:

2 3
δY1 Y2 =δ Z 1 Z 2 (b1 b2 + 3b d + 3d b + 9d d ) + δ
1 2 1 2 1 2 Z 1 Z 2 (2c1 c2 ) +δ Z 1 Z 2 (6d1 d2 ) (6)

Solving this cubic equation for δ Z 1 Z 2 gives the intermediate correlation between the
two standard normal variables that is required for the desired post-transformation
correlation δY1 Y2 . Clearly,
Clearly, correlations for each pair of variables should be assembled
into
into a mat
matrix
rix of interc
intercorr
orrela
elatio
tions
ns in the multi
multiva
varia
riate
te ca
case.
se. Fo
Forr a compre
comprehen hensi
sive
ve source
source
and detailed account on the power polynomials, see Headrick (2010 ( 2010).
).
In the dichot
dichotomi
omizat
zation
ion con
conte
text,
xt, the connec
connectio
tion
n betwee
between n the underly
underlyinging non
nonnor
normal
mal
(δY1 Y2 ) and normal correlations (δ Z 1 Z 2 ) in Eq. 6, along with the relationship between
the tetrachoric correlation ( δ Z 1 Z 2T E T ) and the phi coefficient (δ X 1 B X 2 B ) conveyed in
Eq. 1, is instrumental in Algorithms-1a and -1b in Sect. 3.
To address the situation where only one variable ( Y1 ) is dichotomized, we now
move to the relationship of biserial ( δY1 Y2B S ) and point-biserial (δ X 1 B Y2 ) correlations
in the absence of the normality assumption, which merely functions as a starting
point below. Suppose that Y1 and Y2 jointly follow a bivariate normal distribu-
tion with a correlation of δY1 Y2B S . Without loss of generality, we may assume that
both Y1 and Y2 are standardized to have a mean of 0 and a variance of 1. Let
X 1 B be the binary variable resulting from a split on Y1 , X 1 B = ≥
I (Y1 k ), where
[ ]=
k is the point of dichotomization. Thus, E X 1 B p1 and V X 1 B [ ]= p1 q1 where
q1 = −1 p1 . The cor correl
relati
ation
C
on betwee
ov [
between
X √
,Y1 ]
n X 1 B and Y1 , δ X 1 B Y1 ccan
√ an be obta
obtain
ined
ed in a simp
simple
le
namely, δ X 1 B Y1
way,, namely,
way √V X V Y 1 B
E X 1 B Y1 / p1 q1 E Y1 Y1 k / p1 q1 .
1B 1

We can also express =


the relationship [ Y] and Y =via [the|following
[ ] [ ] =between ≥ ] linear 1 2
regression model:
Y =δ Y +ε 2 Y1 Y2B S 1 (7)

2
where ε is independent of Y1 and Y2 , and follows N ∼ (0, 1 − δ Y1 Y2B S
). When we
generalize this to nonnormal Y1 and/or Y2 (both centered and scaled), the same
relationship can be assumed to hold with the exception that the distribution of ε
follows a nonnormal distribution. As long as Eq. 7 is valid,

[
C ov X 1 B , Y2 ] = C ov[ X , δ 1B Y1 Y2B S Y1 + ε]
= C ov[ X , δ 1B Y1 Y2B S Y ] + C ov[ X
1 1B , ε ]
= δ C ov[ X
Y1 Y2B S 1B , Y ] + C ov[ X
1 1B , ε] . (8)

Since ε is independent of Y1 , it will also be independent of any deterministic


[ √ ]
function of Y1 such as X 1 B , and thus C ov X 1 B , ε will be 0. As E Y1 E Y2 [ ]= [ ]=
0,
[ ]= [ ]=
V Y1 V Y2 [
1, C ov X 1 B , Y1 ]=
δ X 1 B Y p1 q1 and C ov Y1 , Y2 [ ]=
δY1 Y2B S , Eq. 8
reduces to
δ X 1 B Y2 =
δY1 Y2B S δ X 1 B Y1 . (9)

Anatomy of Correlational Magnitude Transformations in Latency … 67

In the
the at
curve biva
bivari
riat
the ate e norm
point noof
rmal case,, δ X 1 B Y1
al case
dichotomization. h
= √/ p1 q1 9where
Equation h is th
indicates thee or
that ordidina
the nate
te ofassociation
linear th
thee norm
normal al
between X 1 B and Y2 is assumed to be fully explained by their mutual association
with Y 1 (Demirtas and Hedeker 2016 2016).). The ratio, δ X 1 B Y2 /δY1 Y2B S is equal to δ X 1 B Y1 =
[ ]√
E X 1 B Y1 / p1 q1 = [ | ≥ ]√
E Y1 Y1 k / p1 q1 , which is a constant given p1 and the
distribut
distr ion of (Y1 , Y2 ). These
ibution These cor
correl relati
ations
ons are in
inva
varia
riant
nt to locat
locationion shifts
shifts and scalin
scaling,g,
Y1 and Y2 do notnot have
have to be cent
centerereded an
and
d sc
scal
aled
ed,, th
thei
eirr mean
meanss anandd varia
arianc
nces
es can
can take
take any
any
finite values. Once the ratio (δ X 1 B Y1 ) is found (it could simply be done by generating
Y1 and dichotomizing it to yield X 1 B ), one can compute the point-biserial ( δ X 1 B Y2 ) or
biserial δ Y1 Y2B S correlation when the other one is specified. This linearity-constancy
argument that jointly emanates from Eqs. 7 and 9, will be the crux of Algorithm-2
given in Sect. 3.

2.3 Polytomous Case: Normality

In the ordinal case, although the relationship between the polychoric correlation
(δY1 Y2P O L Y ) and the ordinal phi correlation ( δ X 1 O X 2 O ) can be written in closed form,
as explained below, the solution needs to be obtained iteratively even under the nor-
mality assumption since no nice recipe such as Eq. 1 is available. In the context of
correlated ordinal data generation, Ferrari and Barbiero (2012
(2012)) proposed an iterative
procedure based on a Gaussian copula, in which point-scale ordinal data are gener-
ated when the marginal proportions and correlations are specified. For the purposes
of this chapter, one can utilize their method to find the corresponding polychoric
correlation or the ordinal phi coefficient when one of them is given under normal-
ity. The algorithm in Ferrari and Barbiero (2012 ( 2012)) serves as an intermediate step in
formulating the connection between the two correlations under any distributional
assumption on the underlying continuous variables.
Concent
Conc entrat
rating
ing on the biva
bivaria
riate
te ca
case,
se, sup posee Z ( Z 1 , Z 2 )
suppos = ∼
N (0, ∆ Z 1 Z 2 ), wher
wheree
Z denotes the bivariate standard normal distribution with correlation matrix ∆ Z 1 Z 2
=
whose off-diagonal entry is δ Z 1 Z 2P O L Y . Let X ( X 1 O , X 2 O ) be the bivariate ordi-
nal data where underlying Z is discretized based on corresponding normal quan-
tiles given the marginal proportions, with a correlation matrix ∆ Z 1 Z 2 . If we need to
sample from a random vector ( X 1 O , X 2 O ) whose marginal cumulative distribution
functions (cdfs) are F1 , F2 tied together via a Gaussian copula, we generate a sam-

ple ( z 1 , z 2 ) from Z =
N (0, ∆ Z 1 Z 2 ), then set x = (x1 O , x2 O ) ( F1−1 (u 1 ), F2−1 (u 2 ))

when u = (u , u ) = (Φ
1 (Φ((z ),Φ(z )), where Φ is the cdf of the standard normal
2 1 2
distribution. The correlation matrix of X , denoted by ∆ (with an off-diagonal X 1O X 2O
entry δ X 1 O X 2 O ) obviously differs from ∆ Z 1 Z 2 due to discretization. More specifically,
| | | |
δ X 1O X 2 O < δ Z 1 Z 2P O L Y in large
large sam
sample
ples.
s. The rel
relati
ations
onship
hip betwee
betweenn ∆ X 1 O X 2 O and ∆ Z 1 Z 2
is established resorting to the following formula (Cario and Nelson 1997 1997): ):

− −
 ∞  ∞ −1
1 1
E [ X 1 O X 2 O ] = E [ F1 (Φ( Z 1 )) F2 (Φ( Z 1 ))] = F1 (Φ( Z 1 )) F2−1 (Φ( Z 1 )) f (z 1 , z2 )d z 1 d z2 (10)
−∞ −∞

68 H. Demirtas and C. Vardar-Acar

where f ( z 1 , z 2 ) is the biva


bivaria
riate
te sta
standa
ndard
rd normal
normal prob
probabi
abili
lity
ty functio
function
n (pdf)
(pdf) with
with corre-
corre-
lation matrix ∆ Z 1 Z 2 , which implies that ∆ X 1O X 2 O is a function of ∆ Z 1 Z 2 . If X 1 and X 2
are ordinal random variables with cardinality k 1 and k 2 , respectively, Eq. 10 reduces
×
to a sum of k1 k2 integrals of f ( z 1 , z 2 ) over a rectangle, i.e., k1 k2 differences ×
of the bivariate cdf computed at two distinct points in 2 , as articulated by Ferrari 
(2012).
and Barbiero (2012 ).
The relevant part of the algorithm is as follows:
1. Gen
Genera
erate
te sta
standa
ndard
rd biva
bivaria
riate
te nor
normal
mal dat
dataa wit
with
h the cor
correl
relation δ 0Z
ation POLY where
1 Z2

δ 0Z Z P O L Y = δ X 1 O X 2 O (Here, δ 0Z Z P O L Y is the initial polychoric correlation).


1 2 1 2
2. Discr
Discretizetizee Z 1 and Z 2 , based on the cumulative probabilities of the marginal dis-
tribution F1 and F2 , to obtain X 1 O and X 2 O , respectively.
3. Com
Comput putee δ 1X X through X 1 O and X 2 O (Here, δ 1X X is the ordinal ordinal phi coeffi-
coeffi-
1O 2O 1O 2O
cient after the first iteration).
4. Exec
Execute ute the followi
following ng loop as long as δ vX 1O X 2 O δ X 1 O X 2 O > ε and 1 v vma x
| − | ≤ ≤
(vma x and ε are the maximum number of iterations and the maximum tolerated
absolute error, respectively, both quantities are set by users):
(a) Update δ vZ Z P O L Y by δ vZ Z P O L Y =δ vZ−Z1 P O L Y g (v), where g (v) δ X 1O X 2 O /δ vX 1 O X 2 O .
=
1 2 1 2 1 2
Here, g (v) serves as a correction coefficient, which ultimately converges to 1.
(b) Generate bivariate normal data with δ vZ Z P O L Y and compute δ vX+1 O1 X 2 O after dis-
1 2
cretization.
Again, our focus in the current work is not RNG per se, but the core idea in Ferrari
(2012)) is a helpful tool that links δ Y1 Y2P O L Y and δ X 1 O X 2 O for ordinalized
and Barbiero (2012
data through the intermediary role of normal data between ordinal and nonnormal
continuous data (Sect. 2.4 2.4).
).
When only one of the normal variables ( Z 1 ) is ordinalized, no nice formulas such
as the one
one in Secect. 2.1 giv
t.2.1 given in the
the bina
binary
ry da
data
ta cont
conteext ar
aree availa
ailabl
ble.
e. Th
Thee good
good news
news is
that a much more general procedure that accommodates any distributional assump-
tion on underlying continuous variables is available. The process that relates the
polyserial (δ Z 1 Z 2P S ) and point-polyserial (δ X 1 O Z 2 ) correlations is available by extend-
ing the arguments substantiated in Sect. 2.2 to the ordinal data case.

2.4 Polytomous Case: Beyond Normality

When
tion (δboth variables are ordinalized, the connection between the polychoric
shed bycorrela-
Y1 Y2P O L Y ) and the ordinal phi coefficient (δ X 1 O X 2 O ) can be established
establi a two-
stage scheme, in which we compute the normal, intermediate correlation ( δ Z 1 Z 2P O L Y )
from the ordinal phi coefficient by the method in Ferrari and Barbiero ((2012 2012)) (pre-
sented in Sect. 2.3) 2.3) before we find the nonnormal polychoric correlation via the
power polynomials (Eq. 6). The other direction (computing δ X 1 O X 2 O from δY1 Y2P O L Y )
can be implemented by executing the same steps in the reverse order. The associated
computational routines are presented in Sect. 3 ( Algorithms-3a and -3b ).

Anatomy of Correlational Magnitude Transformations in Latency … 69

The correlational identity given in Eq. 9 holds for ordinalized data as well when
only one variable
variable is ordinalized
ordinalized (Demi
(Demirtas Hedekerr 2016);
rtas and Hedeke 2016); the ordinal version
of the equation can be written as δ X 1O Y2 = δY1 Y2P S δ X 1 O Y1 . The same linearity and
constancy of ratio arguments equally apply in terms of the connection between the
polyserial (δY1 Y2P S ) and point-polyserial (δ X 1 O Y1 ) correlations; the fundamental utility
and operational characteristics are parallel to the binary case. Once the ratio ( δ X 1 O Y1 )
is found by generati
generating ng Y 1 and discretizing it to obtain X 1 O , one can easily compute
eith
either
er of thes
thesee qu
quan
antititi
ties
es give
given
n the
the othe
otherr. Th
This
is will
will be pert
pertin inenentt in Algorithm-4 below.
The next section
section puts all these
these concept
conceptss togethe
togetherr from an algorithm algorithmic
ic point of
view with numerical illustrations.

3 Algorithms
Algorithms and Illustrati
Illustrative
ve Examples
Examples

We wo
workrk with
with ei
eigh
ghtt dist
distri
ribbutio
utions
ns to re
refle
flectct so
some
me comm
common on sh
shap
apes
es th
that
at ca
can
n be enco
encounun--
tered
tered in real-l
real-life
ife app
applic
licati
ations
ons.. The illust
illustrat
rativ
ivee ex
examp
amples
les com
comee from
from bibiva
varia
riate
te dat
dataa with
with
Weibull and Normal mixtur
mixturee mar
marginal
ginals. follows, W and N M sta
s. In what follows, nd for Weibull
stand
mixturee, respectively. The W density is f ( y γ , δ )
and Normal mixtur | = δ δ −1
γδ
y ex p( ( γy )δ )

for y > 0, an andd γ > 0 and δ > 0 are are the
the scal
scalee and
and shap
shapee para
parame mete
ters
rs,, re
resp
spec
ec--
tively.. The NM den
tively densit |
y is f ( y π , µ1 , σ1 , µ2 , σ2 )
sity = π
√ ex p
− −
1 y µ1 2
( σ )
+ √−
(1 π )
σ 2π
1
2 1 σ2 2π
ex p
− −
1 y µ2 2
( σ )
, where 0 < π < 1 is the
the mi
mixi
xing
ng para
paramemeteterr. Sinc
Sincee it is a mi
mixtxtur
ure,
e, it
2 2
can be unimodal or bimodal. Depending on the choice of parameters, both distribu-
ti
tions
ons ca
cann ta
takeke a varie
ariety
ty of shap
shapes
es.. We use
use fo
four
ur se
sets
ts of pa
para
rame
meteterr spec
specifiifica
cati
tion
onss fo
forr each
each
of the
these
se distri
distribu butio
tions: For W distribution, (γ , δ ) pa
ns: For pair
irss are
are ch
chos
osenen to be (1, 1), (1, 1.2),
(1, 3.6), and (1, 25), corresponding to mode at the boundary, positively skewed,
nearly symmetric, and negatively skewed shapes, respectively. For NM distribu-
tion, the parameter set (π , µ1 , σ1 , µ2 , σ2 ) is set to (0.5, 0, 1, 3, 1), (0.6, 0, 1, 3, 1),
(0.3, 0, 1, 2, 1),and (0.5, 0, 1, 2, 1), whose whose shapes
shapes are bimoda
bimodal-s l-symm
ymmetretric,
ic, bimoda
bimodal- l-
asymmetric, unimodal-negatively skewed, and unimodal-symmetric, respectively.
These four variations of the W and N M densities are plotted in Fig. 1 (W/NM:
the first/second columns) in the above order of parameter values, moving from
top to bottom. Finally, as before, p1 and p2 represent the binary/ordinal propor-
tions. In the binary case, they are single numbers. In the ordinal case, the marginal
proportions are denoted as P ( X i = = j) pi j for i = 1, 2 and j = 1, 2,...ki , and
pi = ..., pi ki ), in whic
( pi 1 , p i 2 , ..., whichh skip
skip pa
patt
tter
erns
ns are
are allo
allowe
wed.d. Fu
Furt rthe
herm
rmorore,
e, if th
thee user
userss
wish
wi sh to st
star
artt the
the ordi
ordinanall cate
catego
gori
ries
es fr
from
om 0 or any
any ininte
tege
gerr ot
othe
herr th
than
an 1, th
thee asso
associciat
atio
iona
nall
implications remain unchanged as correlations are inv invariant
ariant to the location shifts. Of
note, the number of significant digits reported throughout the chapter varies by the
computational sensitivity of the quantities.
Algorithm-1a: Computing the tetrachoric correlation correlation ( δY1 Y2T E T ) from the phi coef-
ficient ( δ X 1 B X 2 B ): The algorithm for computing δY1 Y2T E T when δ X 1 B X 2 B , p1 , p2 , and the
key distributional characteristics of Y1 and Y2 ( ν1 and ν2 ) are specified, is as follows:

70 H. Demirtas and C. Vardar-Acar


Fig. 1 Density functions
functions of
of Weibull ( first column ) and Normal Mixture ( second column) distribu-
tions for chosen parameter values that appear in the text

1. Solve Eq. 1 for δ Z 1 Z 2T E T .


Solve
2. Comput
Computee the powe
powerr coeffi cientss (a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5.
coefficient
3. Plug all quan
quantiti
tities
es obtai
obtained ned in Steps 1–2 into Eq. 6, and solve for δ Y1 Y2T E T .
Suppose Y1 ∼ W (1, 1), Y2 ∼ =
N M (0.5, 0, 1, 3, 1), ( p1 , p 2 ) (0.85, 0.15), and
=
δ X 1 B X 2B 0.1. Solving for δ Z 1 Z 2T E T in Eq. 1 (Step 1) yields 0.277. The power coeffi-

cients ( a , b, c, d ) in Eqs. 2–5 (Step 2) turn out to be ( 0.31375, 0.82632, 0.31375,
− −
0.02271) and (0.00004, 1.20301, 0.00004, 0.07305) for Y1 and Y2 , respec respec--
tively. Substituting these into Eq. 6 (Step 3) gives δY1 Y2T E T =
0.243. Similarly, for
=
( p1 , p2 ) (0.10, 0.30) and δ X 1 B X 2 B =
0.5, δ Z 1 Z 2T E T =
0.919 and δY1 Y2T E T 0.801. =
The upper half of Table 2 includes a few more combinations.
Algorithm-1b: Computing the phi coefficient ( δ X 1 B X 2 B ) from from the tetrachoric
tetrachoric corr
corre-
e-
la
lation (δY1 Y2T E T ): Th
tion Thee qu
quan
anti
titi
ties
es that
that ne
need
ed to be sp
spec
ecifi
ified
ed are
are th same as in Algorithm-
thee same
1a , and the steps are as follows:
1. Comput
Computee the powe powerr coeffi cientss (a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5.
coefficient
2. Solve Eq. 6 for δ Z 1 Z 2T E T .
Solve
Plug δ Z 1 Z 2T E T into Eq. 1, and solve for δ X 1 B X 2 B .
3. Plug
With the same pair of distributions, where Y1 ∼
W (1, 1) and Y2 N M (0.5, 0, 1, ∼
=
3, 1), suppose ( p1 , p2 ) (0.85, 0.15) and δY1 Y2T E T =−
0.4. After solving for the

Anatomy of Correlational Magnitude Transformations in Latency … 71

Table 2 Computed values correlation (δY1 Y2T E T ) or the phi coefficient ( δ X 1 B X 2 B )


values of the tetrachoric correlation
when
when on
onee of them
them is spec
specifie
ified,
d, wi
with
th two
two sets
sets of pr
prop
opor
ortio
tionsns fo
forr Y1 ∼
W (1, 1) and Y2 ∼
N M (0.5, 0, 1, 3, 1)
p1 p2 δX1B X 2B δ Z1 Z T E T δY1 Y T E T
2 2

0.85 0.15 −0.6 −0.849 −0.755


−0.3 −0.533 −0.472
0.1 0.277 0.243
0.10 0.30 −0.2 −0.616 −0.540
0.3 0.572 0.502
0.5 0.919 0.801
p1 p2 δY1 Y T E T δ Z1 Z T E T δ X1B X 2B
2 2

0.85 0.15 0 .4 0.456 0.246



0.2 −0.227 −0.085
0.6 0.685 0.173
0.10 0.30 −0.5 −0.570 −0.192
0.1 0.114 0.052
0.7 0.801 0.441
Fig. 2 δ X 1 B X 2 B versus 0
.
1

δY1 Y T E T for Y 1 W (1, 1)
2
and Y 2 ∼ N M (0.5, 0, 1,
3, 1), where solid, dashed,
5
.
and dotted curves represent 0
( p1 , p 2 ) (0.85, 0.15),
(0.10, 0.30), and
= T
E
(0.50, 0.50), respectively;
respectively; T
2
Y 0
.
the range differences are due Y
1
0
δ
to the Fréchet-Hoeffding
bounds
5
.
0

0
.
1

−1.0 −0.5 0.0 0.5 1.0

δX 1B X 2B

power coefficients (Step 1), Steps 2 and 3 yield δ Z 1 Z 2T E T =−


0.456 and δ X 1 B X 2 B =
− =
0.246, respectively. Similarly, when ( p1 , p2 ) (0.10, 0.30) and δY1 Y2T E T 0.7, =
=
δ Z 1 Z 2T E T 0.801 and δ X 1 B X 2 B =
0.441. The lower half of Table 2 includes a few
more combinations. More comprehensively, Fig. 2 shows the comparative behav-
ior of δ X 1 B X 2 B and δY1 Y2T E T when the proportion pairs take three different values, with
=
the addition of ( p1 , p 2 ) (0.50, 0.50) to the two pairs above, for this particular
distributional setup.

72 H. Demirtas and C. Vardar-Acar

Table 3 Computed values


values of ĉc1ˆ = δ X 1 B Y1 and ĉc2 δ X 2 B Y2 that connect
ˆ = biserial (δY1 Y2B S ) and
connect the biserial
point-bise
point-biserial
rial correlations (δ X 1 B Y2 or δ X 2 B Y1 ), wher
correlations ∼
wheree Y1 W (1, 1.2) and Y2 ∼
N M (0.6, 0, 1, 3, 1)
p1 or p2 ˆ
ĉc1 ˆ
cĉ2
0.05 0.646 0.447
0.15 0.785 0.665
0.25 0.809 0.782
0.35 0.795 0.848
0.45 0.760 0.858
0.55 0.710 0.829
0.65 0.640 0.772
0.75 0.554 0.692
0.85 0.436 0.585
0.95 0.261 0.392

Algorithm-2: Computing the biserial corre lation (δY1 Y B S ) from the point-biserial
correlation
2
correlation (δ X 1 B Y2 ) and the other way around : One only needs to specify the distri-
butional form of Y1 (the variables that is to be dichotomized) and the proportion ( p1 )
2.2).
for this algorithm (See Sect. 2.2). The steps are as follows:
1. Gen
Genera
erate
te Y 1 with a large number of data points (e.g., N 100, 000). =
2. Dichotomizee Y1 to obtain X 1 B through the specified value of p1 , and compute the
Dichotomiz
sample correlation, δ X 1 B Y1 ĉc1 . =ˆ
BS BS

Find δ X 1 B Y2 or δ Y1 Y2 by δ X 1 B Y2 /δY1 Y2
3. Find =ˆ ĉc1 by Eq. 9.
In this illustration, we assume that Y1 ∼ W (1, 1.2), Y2 ∼ N M (0.6, 0, 1, 3, 1), and
δY1 Y2B S= 0.60. Y1 is dichotomized to obtain X 1 B where E ( X 1 B ) p1 = =
0.55. After
follo
fol lowi
wing
ng Step
Stepss 1 anand ˆ
d 2, ĉc1 tur
turnsout
nsout to be0 .710, and accordi accordinglnglyy δ X 1 B Y2 =ˆ
cĉ1 δY1 Y B S
2
=
0.426. Similarly, if the specified value of δ X 1 B ,Y2 is 0.25, then δ Y1 Y2B S = ˆ=
δ X 1 B Y2 /cĉ1
0.352. The fundamental ideas remain the same if Y2 is dichotomized (with a pro-
portion p2 ) rather than Y1 . In that case, with a slight notational difference, the new
equations would be δ X 2 B Y2 =ˆ =ˆ
ĉc2 and δ X 2 B Y1 /δY1 Y2B S ˆ ˆ
cĉ2 . Table 3 shows cĉ1 and cĉ2 val-
ues whe
when n p 1 or p 2 ran
ranges
ges bet
betwee
ween n 0.05and0.95withanincrementof0.10. We furthe furtherr
generated bivariate continuous data with the above marginals and the biserial corre-

lations between 0.85 and 0.90 with an increment of 0 .05. We then dichotomized
Y1 where p1 is 0.15 and 0.95, and computed the empirical point-biserial correlation.

The lower- and upper-right graphs in Fig. 3 the plot of the algorithmic value of cĉ1 in ˆ
Step 2 and δY1 Y2B S versus δ X 1 B Y2 , respectively, where the former is a theoretical and
δ X 1 B Y2 in the latter is an empirical quantity. As expected, the two cĉ1 values are the ˆ
same as the slopes of the linear lines of δY1 Y2B S versus δ X 1 B Y2 , lending support for how
plausibly Algorithm-2 is working. The procedure is repeated under the assumption
that Y 2 is dichotomized rather than Y 1 (lower graphs in Fig. 3).
Algorithm-3a: Computing the polychoric correlation (δY1 Y2P O L Y ) from the ordinal
phi coefficient (δ X 1O X 2 O ): The algorit
algorithm
hm for computin
computing g δ Y1 Y2P O L Y when δ X 1O X 2 O , p1 ,

Anatomy of Correlational Magnitude Transformations in Latency … 73

6
.
.8 0
0

.6 2
.
0 2 0
Y
B
c1 1
X
.4 δ 2
.
0 0

.2
0 6
.
0

−0.5 0 .0 0 .5 −0.5 0.0 0 .5

δ Y1 Y2BS δ Y1 Y2BS

6
.
0
.8
0

2
.
.6 0
0 Y
1
B
c2 2
X
.4 δ
2
0 .
0

.2
0
6
.
0

−0.5 0 .0 0 .5 −0.5 0.0 0 .5

δ BS δ BS
Y1 Y2 Y1 Y2

ˆ ˆ
Fig. 3 Plots of ĉc1 (upper-left), ĉc2 (lower-left), empirical δ X 1 B Y2 versus δ Y1 Y B S (upper-right), and
2
δ X 2 B Y1 versus δY1 Y B S for p1 or p 2 is 0.15 (sho
2
(shown
wn by o) or 0.95 (sho
(shown
wn by *)
*),, wher
wheree Y1 ∼ W (1, 1.2)
and Y 2 ∼ N M (0.6, 0, 1, 3, 1)

p2 and the key distributional characteristics of Y1 and Y 2 ( ν1 and ν2 ) are specified, is


as follows:
1. Use the meth method
od in Ferrari and (2012),
and Barbiero (2012 ), outlined in Sect. 2.3 for finding
δ Z 1 Z 2P O L Y .
2. Comput
Computee the powe powerr coeffi cientss (a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5.
coefficient
3. Plug all quan quantiti
tities
es obtained in Steps 1–2 into Eq. 6, and solve for δ Y1 Y2P O L Y .
obtained
Suppose Y1 ∼
W (1, 3.6), Y2 ∼ =
N M (0.3, 0, 1, 2, 1), ( p1 , p2 ) ((0.4, 0.3, 0.2, 0.1),
(0.2, 0.2, 0.6)), and δ X 1 O X 2 O =−
0.7. Solving for δ Z 1 Z 2P O L Y in Step 1 yields 0.816. −
The power coefficients (a , b, c, d ) in Eqs. 2–5 (Step 2) turn out to be ( 0.00010, −
− −
1.03934, 0.00010, 0.01268) for Y1 and (0.05069, 1.04806, 0.05069, 0.02626) −
=−
for Y2 . Substituting these into Eq. 6 (Step 3) gives δY1 Y2P O L Y 0.813. Similarly,

for ( p , p ) ((0.1, 0.1, 0.1, 0.7), (0.8, 0.1, 0.1)) and δ


1 2 X 1O X 2O Z 1 Z 2P O L Y
0.441 and δ = = 0.439. The upper half of Table 4 includes a=few =
0.2, δ
Y1 Y2P O L Y more combi-
nations.
Algorithm-3b: Computing the ordinal phi coefficient (δ X 1 O X 2 O ) from
from the pol
polyc
ychor
horic
ic
correlation (δY1 Y2P O L Y ): The required quantities that need specification are the same
as in Algorithm-3a, and the steps are as follows:

74 H. Demirtas and C. Vardar-Acar

Table 4 Comput
Computed ed va
value
luess of the polych
polychori
oricc correl
correlation (δY1 Y P O L Y ) or th
ation thee or
ordi
dina
nall ph
phii co
coef
ef--
2

ficient (δ X 1 O X 2 O ) given the other, with two sets of proportions for Y1 W (1, 3.6) and Y2 ∼
N M (0.3, 0, 1, 2, 1)
p1 p2 δ X 1O X 2 O δ Z1 Z P O L Y δY1 Y P O L Y
2 2

(0.4, 0.3, 0.2, 0.1) (0.2, 0.2, 0.6) − 0 .7 −0.816 −0.813


− 0 .3 −0.380 −0.378
0.4 0.546 0.544
0.6 0.828 0.826
(0.1, 0.1, 0.1, 0.7) (0.8, 0.1, 0.1) − 0 .6 −0.770 −0.767
− 0 .4 −0.568 −0.566
− 0 .1 −0.167 −0.166
0.2 0.441 0.439
p1 p2 δY1 Y P O L Y δ Z1 Z P O L Y δ X 1O X 2O
2 2

(0.4, 0.3, 0.2, 0.1) (0.2, 0.2, 0.6) − 0 .8 −0.802 −0.686


− 0 .2 −0.201 −0.155
0.5 0.502 0.368
0.7 0.702 0.511
(0.1, 0.1, 0.1, 0.7) (0.8, 0.1, 0.1) 0 .8 0.802 0.638
− 0 .2 −0.201 −0.122
0.5 0.502 0.219
0.7 0.702 0.263

1. Comput
Computee the powepowerr coeffi
coefficien
cients ts ( a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5.
2. Solve Eq. 6 for δ Z 1 Z 2T E T .
Solve
3. Solve for δ X 1 O X 2 O given δ Z 1 Z 2T E T by the method in Ferrari and Barbiero (2012
Solve (2012).
).
Withith the same
same set of speci
specifica ficatio
tions,ns, namel
namely y, Y1∼ W (1, 3.6), Y2 ∼ N M (0.3, 0, 1,
=
2, 1), and ( p1 , p2 ) ((0.4, 0.3, 0.2, 0.1), (0.2, 0.2, 0.6)), suppose δY1 Y2P O L Y = 0.5.
Afte
Af terr solv
solvin
ingg for
for the
the po
powe
werr coef
coefficficieientntss (Ste
(Stepp 1)
1),, Step
Stepss 2 an
andd 3 yi
yiel
eld =
d δ Z 1 Z 2P O L Y 0.502
and δ X 1 O X 2 O = 0.368
368.. Simil
Similarl arly
y, whe when =
n ( p1 , p2 ) ((0.1, 0.1, 0.1, 0.7), (0.8, 0.1,
0.1)) and δY1 Y2P O L Y = 0.7, δ Z 1 Z 2P O L Y= 0.702 and δ X 1 O X 2 O = 0.263. The lower half
of Table 4 includes a few more combinations. A more inclusive set of results is
given in Fig. 4, which shows the relative trajectories of δ X 1O X 2 O and δY1 Y2P O L Y when
the prop
proportortion
ion set
setss take
take thrthree
ee differ
different ent va
value
lues,s, wit
withh the add
additi on of ( p1 , p2 )
ition =
((0.25, 0.25, 0.25, 0.25), (0.05, 0.05, 0.9)) to the two sets above.
Algorithm-4: Computing the polyserial correl ation (δY1 Y2P S ) from
correlation from the the po
poin
int-
t-
polyserial corre lation (δ X 1 O Y2 ) an
correlation andd ththee ot
othe
herr wa
wayy arou
aroundnd: Th
Thee fo
foll
llow
owining g step
stepss en
enab
able
le
us to calculate either one of these correlations when the distribution of Y 1 (the vari-
able that is subsequently ordinalized) and the ordinal proportions ( p1 ) are specified
(See Sect. 2.42.4)):

Anatomy of Correlational Magnitude Transformations in Latency … 75

Fig. 4 δ X 1 O X 2 O versus .0
1

δY1 Y P O L Y for Y 1 W (1, 3.6)
2
and Y 2 ∼ N M (0.3, 0, 1,
2, 1), where solid, dashed,
and dotted curves represent .5
0
=
( p1 , p 2 ) ((0.4, 0.3,
0.2, 0.1), (0.2, 0.2, 0.6)),
Y
L
((0.1, 0.1, 0.1, 0.7), (0.8, O

.0
P2
Y
0.8, 0.1)), and (( 0.25, 0.25, Y
1
0
δ
0.25, 0.25), (0.05, 0.05,
0.9)), respectively; the range
differences are due to the .5
0
Fréchet-Hoeffding
Fréchet-Hoeff ding bounds −

.0
1

−1.0 −0.5 0.0 0.5 1.0

δX1OX2O
1. Gen
Genera
erate
te Y 1 with a large number of data points (e.g., N 100, 000).=
2. Ordina lize Y 1 to obtain X 1 O throug
Ordinalize through h the specified value of p1 , and compute the
specified value
sample correlation, δ X 1O Y1 =ˆ ĉc1 .
Find δ X 1 O Y2 or δ Y1 Y2P S by δ X 1 O Y2 /δY1 Y2P S
3. Find =ˆ
ĉc1 by Eq. 9.

For illustrative purposes, we assume that Y1 ∼ W (1, 25), Y2 ∼ N M (0.5, 0, 1,


2, 1), and δ PS 0.6. Y is ordinalized to obtain X , where p (0.4, 0.3, 0.2,
Y1 ,Y2= 1 1O 1=
ˆ
0.1). After following Steps 1 and 2, ĉc1 turns out to be 0.837, and accordingly
δ X 1 O Y2=ˆ =
ĉc1 δY1 Y2P S 0.502. Similarly, if the specified value of δ X 1O Y2 is 0.3, then
δY1 Y2P S = ˆ=
δ X 1 O Y2 /ĉc1 0.358 358.. The ccoreore ide
ideas
as remain
remain uncha
unchangenged d if Y2 is dichotomiz
dichotomized ed
(with a proportion p2 ) instead of Y1 , in which the new equations become δ X 2 O Y2 =ˆ cĉ2

and δ X 2 O Y1 /δY1 Y2P S ĉc2 . Seve ˆ
Several ˆ
ral ĉc1 and ĉc2 va
value
luess are tabula
tabulatetedd for corres
correspond ing p1 or
ponding
p2 specifications in Table Table5 5. Figure 5 provides a comparison between the theoretical
(suggested by Algorithm-4) and empirical point-polyserial correlations (δ X 1 O Y2 ) for
specified polyserial correlation (δY1 Y2P S ) values values in the range −
range of 0.95 and 0.95 (upper
graph
gra ph)) an
and
d the
the scscatatte
terr plot
plot of the
the diff
differ
eren
encecess be
betw
twee
een
n th
thee tw
twoo quan
quanti titi
ties
es (l
(low
owerer gr
grap
aph)
h)
are given. We first generated bivariate continuous data using the above distributional
assumptions, then ordinalized Y1 , computed the empirical post-discretization cor-
relations, and made a comparison that is shown in the two graphs in Fig. 5, which
collectively suggest that the procedure is working properly.
Somee operat
Som operationa ionall remark
remarkss: Al Alll co
comp
mput utin
ing
g wor
orkk has
has bebeen done in R sof
en done softw
tware
are (R
Development
Devel opment Core Team, 2016). In the algorithms that involve the power polynomi-
als,, a stand-
als stand-alo alone
ne comput
computer er code
code in Dem Demirtirtas
as and Hed
Hedekekerer (2008a
2008a)) waswas usused
ed to solv
solvee
the system of equations. More sophisticated programming implementations such as
fleishman.coef function in BinNonNor package (Inan and Demirtas 2016), 2016),
Param.fleishman function in PoisNonNor package (Demirtas et al. al. 2016b),
2016b),
and Fleishman.coef.NN function in BinOrdNonNor package (Demirtas et al.

76 H. Demirtas and C. Vardar-Acar

able 5 Co
Table Compu
mputed
ted va
value
luess of ĉc1 ˆ = ˆ =
δ X 1 O Y1 and cĉ2 δ X 2 O Y2 th that
at co
conn
nnec
ectt th
thee po
poly
lyse
seri
rial
al

(δY1 Y B S ) and point-polyserial correlations ( δ X 1 O Y2 or δ X 2 O Y1 ), where Y1 W (1, 25) and Y2
2

N M (0.5, 0, 1, 2, 1)
p1 p2 ˆ
ĉc1 ˆ
cĉ2
(0.4, 0.3, 0.2, 0.1) – 0.837 –
(0.1, 0.2, 0.3, 0.4) – 0.927 –
(0.1, 0.4, 0.4, 0.1) – 0.907 –
(0.4, 0.1, 0.1, 0.4) – 0.828 –
(0.7, 0.1, 0.1, 0.1) – 0.678 –
– (0.3, 0.4, 0.3) – 0.914
– (0.6, 0.2, 0.2) – 0.847
– (0.1, 0.8, 0.1) – 0.759
– (0.1, 0.1, 0.8) – 0.700
– (0.4, 0.4, 0.2) – 0.906
.5
0

.0
0

.5
0

−0.5 0.0 0.5

3
0
.0
0

0
0
.0
0

3
0
.0
0

Fig. 5 The plot of the theoretica


theoreticall (x axis) versus empirical
empirical (y axis) point-pol
point-polyseri
yserial
al correlations
correlations
(δ X 1 O Y2 ) given the specified polyserial correlation (δY1 Y P S ) values (upper graph) and the scatter
2
plot of the differences between the two quantities ( lower graph), where Y 1 ∼ W (1, 25) and Y2 ∼
N M (0.5, 0, 1, 2, 1)

2016c)) can also be employed. The root of the third order polynomials in Eq. 6 was
2016c
found by polyroot function in the base package. The tetrachoric correlation and
the phi coefficient in Eq. 1 was computed by phi2tetra function in psych pack-
2016)) and pmvnorm function in mvtnorm package (Genz et al.
age (Revelle 2016 al. 2016
2016),
),
respectively. Finding the polychoric correlation given the ordinal phi coefficient and
the opposite direction were performed by ordcont and contord functions in
GenOrd package (Barbiero and Ferrari 2015),
2015), respectively.

Anatomy of Correlational Magnitude Transformations in Latency … 77

4 Simulatio
Simulations
ns in a Multiva
Multivariate
riate Setting
Setting

By the
the probl
problemem defin
definititio
ion
n an
andd desi
design
gn,, all
all de
deve
velo
lopm
pmen
entt has
has be
been
en pr
presesen
ente
tedd in bi
biva
vari
riat
atee
sett
settin
ings
gs.. For
For asasse
sess
ssin
ingg ho
howw the
the algo
algori
rith
thms
ms work
work in a broad
broaderer mu
mult ltiivaria
ariate
te co
cont
nteext and
and
for highlighting the generality of our approach, we present two simulation studies
that involve the specification of either pre- or post-discretization correlations.
Simu
Simulalati
tion
on work
work is devi
devise
sedd aroun
around
d fiv
fivee co
cont
ntin
inuo
uous
us vari
variab
able
les,
s, an
and d four
four of th
thes
esee are
are
subseq
sub sequen
uently
tly dic
dichot
hotomi
omized
zed or ord
ordina
inaliz
lized.
ed. Referr
Referring
ing to the Weibull and Normal mix-
ture den
densit
sities
ies in the illust
illustrat
rativ
ivee ex
examp
amples
les,, the distri
distribu
butio
tional
nal forms
forms are as fol follows: Y1
lows: ∼
W (1, 3.6), Y2 ∼ W (1, 1.2), Y3 ∼ N M (0.3, 0, 1, 2, 1), Y4 ∼ N M (0.5, 0, 1, 2, 1),
and Y5 ∼ N M (0.5, 0, 1, 3, 1). Y1 , ...,
..., Y4 are to be dis discre
cretiz
tized
ed with
with proport
proportion
ionss
p 0.6, p 0.3, p (0.4, 0.2, 0.2, 0.2), and p (0.1, 0.6, 0.3), respectively
respectively..
1 = 2 = 3 = 4 =
Two dichotomized (Y1 and Y2 ), two ordinalized (Y3 and Y4 ), and one continuous (Y5 )
variables form a sufficient environment,
environment, in which all types of correlations mentioned
in this work are covered. For simplicity, we only indicate if the correlations are pre-
or post-discretization quantities without distinguishing between different types in
terms of naming and notation in this section. We investigate both directions: (1) The
pre-discretization correlation matrix is specified; the theoretical (algorithmic) post-
discretization quantities were computed; data were generated, discretized with the
prescription guided by the proportions, and empirical correlations were found via
n = 1000
10 00 si
simu
mulalati
tion
on re
repl
plic
icat
ates
es to se
seee how
how clos
closel
ely
y th
thee algo
algori
rith
thmi
micc an
and
d empi
empiririca
call va
val-
l-
ues are aligne
alignedd on avera
average.
ge. (2) The pos
post-d
t-disc
iscret
retiza
izatio
tion
n matrix
matrix is spe
specifi
cified;
ed; correl
correlati
ation
on
among latent variables were computed via the algorithms; data were generated with
this correlation matrix; then the data were dichotomized or ordinalized to gauge if
we obtain the specified post-discretization correlations on average. In Simulation 1,
the pre-di
pre-discr
screti
etizat
zation
ion cor
correl
relati
ation
on matrix
matrix (Σ pr e ) repres
represent
enting
ing the correl
correlati
ation
on struct
structure
ure
among continuous variables, is defined as
 1 00 0 14 −0 32 0 56 0 54 
. . . . .
 0 14 1 00 −0 10 0 17 0 17 
. . . . .
Σ pr e =  −0 32 −0 10 1 00 −0 40 −0 38 
. . . . . ,
 0 56 0 17 −0 40 1 00 0 67 
. . . . .
0.54 0.17 −
0.38 0.67 1.00

where the variables follow the order of ( Y1 , ..., [ ]


..., Y5 ). Let Σ i , j denote the cor-
relation
relation betwe
between variabless i and j , wher
en variable wheree i , j 1, ..., =
..., 5. The theore
theoreti
tical
cal post-
post-

dis
discre
erlycreti
tizat
and zation
ion the
yield va
value
luess (un
true (under
der the
values) ass
assump
were umptio
tion
n that
computed. that thespecifically,
More algori
algorithm
thmss Σ
functio
function n prop-
prop-
[ ]
postt 1, 2 was
pos
[ ]
found by Algorithm-1b, Σ pos t 1, 5 and Σ pos [ ]
postt 2, 5 by Algorithm-2, Σ pos [ ]
postt 1, 3 ,
[ ] [ ] [ ] [ ]
postt 2, 4 , and Σ pos t 3, 4 by Algorithm-3b, Σ pos t 3, 5
Σ pos t 1, 4 , Σ pos t 2, 3 , Σ pos [ ]
[ ]
and Σ pospostt 4, 5 by Algorithm-4. These values collectively form a post-discretization
correlation matrix (Σ pos t ), which serves as the True Value (TV). The empirical post-
discretiz
discr etizatio
ation n corre
correlati
lation
on estimate
estimatess were calculate
calculated d after generatingN
generating =
1, 000 ro rows
ws

78 H. Demirtas and C. Vardar-Acar

Table 6 Results of Simulation 1 (the reported quantities are define


defined
d in the text)
Parameter Σ pr
pree T V (Σ post ) AE RB PB SB
[ ]
Σ 1, 2 0..14
0 0.08942 0.08934 0.00008 0.09 0.03
Σ [1, 3] −0.32 −0.22936 −0.23385 0.00449 1.96 1.40
Σ [1, 4] 0..56
0 0.38946 0.39128 0.00182 0.47 0.72
Σ [1, 5] 0..54
0 0.43215 0.43425 0.00210 0.49 0.84
Σ [2, 3] −0.10 −0.07420 −0.07319 0.00101 1.36 0.32
Σ [2, 4] 0..17
0 0.12175 0.12073 0.00102 0.84 0.33
Σ [2, 5] 0..17
0 0.13681 0.13883 0.00202 1.48 0.65
Σ [3, 4] −0.40 −0.31908 −0.31922 0.00014 0.04 0.05
Σ [3, 5] −0.38 −0.34082 −0.34671 0.00589 1.73 2.15

Σ [4, 5] 0..67
0 0.58528 0.58773 0.00245 0.42 1.28

..., Y5 ) by the specified Σ pr e , followed by


of multivariate latent, continuous data ( Y1 , ...,
discretization of (Y1 , ...,
..., Y4 ). The whole process was repeated for n =
1, 000 times.
We evaluated the quality of estimates by three commonly accepted accuracy
measures:
measu res: (a) Raw Bias (RB), (b) Per
Bias (RB) Percent age Bias (PB), and (c) Standardized
centage
Bias (SB) (Demirtas 2004a,
2004a, 2007a,
2007a, b, 2008).
2008). They all are functions of the aver-
age estimate (AE ), and their definitions are well-established: When the parame-
ter of interest is δ , R B = | [ ˆ − ]|
E δδ̂ δ (absolut
(absolutee average deviation), P B 100 = ∗
| [ˆ − ] |
E δδ̂ δ /δ (absolute average deviation as a percentage of the true value), and
1/2
S B = 100 ∗ | E [δδ̂ˆ − δ ]|/ V [δδ̂ˆ] (ab
(absol
solute
ute avera
average
ge devia
deviatio
tion
n with
with respec
respectt to the over
over--
all uncertainty in the system). A procedure is typically regarded as working prop-
erly if R B < 5 and S B < 50 (Demirtas et al. al . 2007).
2007). In Table 6, we tabulate Σ pr e ,
T V (Σ pos
postt ), A E , R B , P B , and S B . All the three accuracy quantities demonstrate
negligibly small deviations from the true values; they are within acceptable limits,
suggesting that the set of algorithms provides unbiased estimates.
In Simulation 2, we take the reverse route by specifying the post-discretization
correlation matrix (Σ pos t ), which serves as the True Value (TV), in the following
way:
 1 00 . 0.24 0.18 0.10 0.38

0.24 1.00 0.20 − 0.11 0.42
Σ pos
postt 0.18 0.20 1.00 0.07 0.29
=  0.10 −0.11 − 0.07 − 1.00 − 0.16

0.38 0.42 0.29 − 0.16 1.00

The corres
correspon
pondin
ding g pre
pre-di
-discr
screti
etizat
zation
ion mat
matrix
rix was
was foun
found
d via the alg
algori
orithm
thms.
s. The
[ ] [ ]
theoretical Σ pr e 1, 2 was computed by Algorithm-1a, Σ pr e 1, 5 and Σ pr e 2, 5 [ ]
[ ] [ ] [ ] [ ]
by Algorithm-2, Σ pr e 1, 3 , Σ pr e 1, 4 , Σ pr e 2, 3 , Σ pr e 2, 4 , and Σ pr e 3, 4 by [ ]
[ ] [ ]
Algorithm-3a, Σ pr e 3, 5 and Σ pr e 4, 5 by Algorithm-4. These values jointly form
a pre-discret
pre-discretizat
ization
ion corre
correlati
lation
on matr
matrix
ix (Σ pr e ). The empirica
empiricall post-discr
post-discretiz
etizatio
ation
n

Anatomy of Correlational Magnitude Transformations in Latency … 79

Table 7 Results of Simulation 2 (the reported quantities are define


defined
d in the text)
Pa
Param
ramete
eterr TV (Σ post ) Σ pr
pree AE RB PB SB
[ ]
Σ 1, 2 0..24
0 0.40551 0.24000 0.00000 0.00 0.00
Σ [1, 3] 0..18
0 0.24916 0.17727 0.00273 1.52 0.90
Σ [1, 4] 0..10
0 0.14551 0.10041 0.00041 0.41 0.13
Σ [1, 5] 0..38
0 0.47484 0.38022 0.00022 0.06 0.08
Σ [2, 3] 0..20
0 0.29496 0.20729 0.00729 3.64 2.44
Σ [2, 4] −0.11 −0.15954 −0.10601 0.00399 3.63 1.27
Σ [2, 5] 0..42
0 0.52188 0.43347 0.01347 3.21 5.50
Σ [3, 4] −0.07 −0.08884 −0.07001 0.00001 0.01 0.01
Σ [3, 5] 0..29
0 0.32334 0.29443 0.00443 1.53 1.58

Σ [4, 5] −0.16 −0.18316 −0.16004 0.00004 0.02 0.01

correlation estimates
estimates were calcula
calculated
ted after generating N
generating = 1, 000 rows of multivar
multivari-
i-
ate latent, continuous data (Y1 , .. ., Y5 ) by the computed Σ pr e before discretization of
...,
(Y1 , ..., Y4 ). As before, this process is repeated for n
..., = 1, 000 times. In Table 7, we
tabulate T V (Σ pos postt ), Σ pr e , A E , R B , P B , and S B . Again, the discrepancies between
the expected and empirical quantities are minimal by the three accuracy criteria,
providing substantial support for the proposed method.
These results indicate compelling and promising evidence in favor of the algo-
rithms herein. Our evaluation is based on accuracy (unbiasedness) measures. Pre-

cision is another
estimates important
(Demirtas et al. criterion
al. 2008;
2008 in terms
; Demirtas andof the quality
Hedeker and
2008b;
2008b performance
; Yucel of the
and Demirtas
2010
2010).). We address the precision issues by plotting the correlation estimates across
all simulation replicates in both scenarios (Fig. 6). The estimates
estimates closely match the
true values shown in Tables 6 and 7 7,, with a healthy amount of variation that is within
the limits of Monte-Carlo simulation error. On a cautious note, however,
however, there seems
to be slightly more variation in Simulation 2, which is natural since there are two
layers of randomness (additional source of variability).

5 Disc
Discus
ussi
sion
on

If the discretization thresholds and underlying continuous measurements are avail-


able, one can easily compute all types of correlations that appear herein. The scale
of the work in this chapter is far broader, as it is motivated by and oriented towards
comput
com puting
ing the differ
different
ent correl
correlat
ation
ional
al mag
magnit
nitude
udess bef
before
ore and after
after discre
discreti
tizat
zation
ion when
when
one of these quantities is specified in the context of simulation and RNG, in both
directions.
The set of proposed techniques is driven by the idea of augmenting the normal-
based results concerning different types of correlations to any bivariate continuous

80 H. Demirtas and C. Vardar-Acar

.6 .5
0 0

.4
0
.4
0

.3
0

.2
0 .2
0

.1
.0 0
0

.0
0

.2
0

.1
0

.4 .2
0 0
− −

Fig. 6 The trace plot of correlation


correlation estimates for Simulation 1 (left graph) and Simulation 2 (right
graph) across n =
1, 000 replicates; they closely match the true values shown in Tables 6 and 7

setting. Nonnormality is handled by the power polynomials that map the normal and
nonnormal correlations. The approach works as long as the marginal characteristics
(skewness and elongation parameters for continuous data and proportion values for
binary/ordinal data) and the degree of linear association between the two variables
are legitimately defined, regardless of the shape of the underlying bivariate contin-
uous density. When the above-mentioned quantities are specified, one can connect
correlations before and after discretization in a relatively simple manner.
Onee pote
On potent
ntia
iall limi
limita
tati
tion
on is that
that po
powe
werr po
polylynom
nomia ials
ls co
coveverr mo
most
st of th
thee fe
feas
asib
ible
le sym-
sym-
metry-peak
metry ednesss plane (ν2 ν12 2),
-peakednes ≥ − 2), bu
butt no
nott en
enti
tire
rely
ly.. In an atte
attemp
mptt to sp
span
an a lar
large
gerr
space, one can utilize the fifth order polynomial systems (Demirtas 2017; 2017; Headrick
2002),
2002 ), although it may not constitute an ultimate solution. In addition, a minor con-
cern could be that unlike binary data, the marginal proportions and the second order
product moment (correlation) do not fully define the joint distribution for ordinal
data. In other words, odds ratios and correlations do not uniquely determine each
other. However, in overwhelming majority of applications, the specification of the
first and second order moments suffices for practical purposes; and given the scope
of this
this work
work,, whic
which h is mo
modedeli
ling
ng the
the tr
tran
ansi
siti
tion
on betw
betweeeen n di
difffe
fererent
nt pa
pair
irss of corr
correl
elat
atio
ions
ns,,
this complication is largely irrelevant. Finally, the reasons we base our algorithms
on the Pearson correlation (rather than the Spearman correlation) are that it is much
more common in RNG context and in practice; and the differences between the two

Anatomy of Correlational Magnitude Transformations in Latency … 81

are negligibly small in most cases. Extending this method for encompassing the
Spearman correlation will be taken up in future work resorting to a variation of the
sorting idea that appeared in Demirtas and Hedeker (2011
( 2011),
), allowing us to capture
any monotonic relationship in addition to the linear relationships. On a related note,
furth
further
er expa
expans
nsio
ions
ns can
can be imag
imagin
ined
ed to acco
accomm
mmoda
odate
te mo
more
re comp
comple lex
x asso
associciat
atio
ions
ns th
that
at
involve higher order moments.
The positive characteristics and salient advantages of these algorithms are as
follows:
• They work for an extensive class of underlying bivariate latent distributions
whose components are allowed to be non-identically distributed. Nearly all contin-
uous shapes and skip patterns for ordinal variables are permissible.
The req
requir
uired
ed softwa
software
re too
tools
ls for the implem
implement
entati
ation
on are rather
rather bas
basic
ic,, users
users mer
merel
ely
y

need a computational platform with numerical double integration solver for the
binary-binary case, univariate RNG capabilities for the binary/ordinal-continuous
case, an iterative scheme that connects the polychoric correlations and the ordi-
nal phi coefficients under the normality assumption for the ordinal ordinal case, a
polynomial root-finder and a nonlinear equations set solver to handle nonnormal
continuous variables.

The des
descri
cripti
ption
on of the connec
connectio
tion
n bet
betwee
ween
n the two
two correl
correlati
ations
ons is nat
natura
urally
lly gi
give
venn
for the biva
bivaria
riate
te case.
case. The mul
multi
tiva
varia
riate
te ex
exten
tensio
sion
n is easily
easily man
manage
ageabl
ablee by assemb
assemblinling
g
the individual correlation entries. The way the techniques work is independent of the
number of variables; the curse of dimensionality is not an issue.
•The algorithms could be conveniently used in meta-analysis domains where
some studies discretize variables and some others do not.
•Assessing the magnitude of change in correlations before and after ordinaliza-
tion is likely to be contributory in simulation studies where we replicate the speci-
fied trends especially when simultaneous access to the latent data and the eventual
binary/ordinal data is desirable.
•One can more rigorously fathom the nature of discretization in the sense of
knowing how the correlation structure is transformed after dichotomization or ordi-
nalization.
•The proposed procedures can be regarded as a part of sensible RNG mecha-
nisms to generate multivariate latent variables as well as subsequent binary/ordinal
variables given their marginal shape characteristics and associational structure in
simulated environments, potentially expediting the development of novel mixed
data generation routines, especially when an RNG routine is structurally involved
with generating multivariate continuous data as an intermediate step. In conjunc-
tion with the published works on joint binary/normal (Demirtas and Doganay 2012 2012),),
binary/nonnormal continuous (Demirtas et al. al. 2012),
2012), ordinal/normal (Demirtas and
Yavuz 2015),
2015), count/normal (Amatya and Demirtas 2015), 2015), and multivariate ordinal
2006),
data generation (Demirtas 2006 ), the ideas presented in this chapter might serve as
a milestone for concurrent mixed data generation schemes that span binary, ordinal,
count, and nonnormal continuous data.

These
The se algori
algorithm
thmss ma
may
y be instru
instrumen
mental
tal in de
deve
velop
loping
ing multip
multiple
le imp
imputa
utati
tion
on strat
strate-
e-
gies for mixed longitudinal or clustered data as a generalization of the incomplete

82 H. Demirtas and C. Vardar-Acar

data methods published in Demirtas and Schafer


Schafer (2003),
2003), Demirtas (2004b
2004b,, 2005),
2005),
and Demirtas and Hedeker
Hedeker (2007,
2007, 2008c
2008c).
). Concomitantly, they can be helpful in
2009,, 2010).
improving rounding techniques in multiple imputation (Demirtas 2009 2010).
Wrapping it up, this work is inspired by the development of algorithms that are
designed to model the magnitude of change in correlations when discretization is
employed. In this regard, the proposed algorithms could be of working functionality
in identifying the relationships between different types of correlations before and
after discretization, and have noteworthy advantages for simulation purposes. As a
final note, a software implementation of the algorithms can be accessed through the
recent R package CorrToolBox (Allozi and Demirtas 2016).2016).

References
Allozi, R., & Demirt
Allozi, Demirtas,
as, H. (2016)
(2016).. Modeli
Modelingng Correl
Correlati
ationa
onall Magnit
Magnitude
ude Tr
Trans
ansfor
format
mation
ionss in
Discretiza
Discr etization
tion Contexts, package CorrToolBox
Contexts, R package CorrToolBox.. https://cran.r-project.org/web/packages/
CorrToolBox..
CorrToolBox
Amatya
Ama tya,, A., & Demirt
Demirtas,
as, H. (2015)
(2015).. Simulta
Simultaneo
neous
us genera
generatio
tionn of multi
multiva
varia
riate
te mixed
mixed datadata
with Poisson and normal marginals. Journal of Statistical Computation and Simulation , 85,
3129–3139.
Barbiero, A., & Ferrari, P.A. (2015). Simulation of Ordinal and Discrete Variables with Given
Correlation Matrix and Marginal Distributions, R package GenOrd GenOrd.. https://cran.r-project.org/
web/packages/GenOrd..
web/packages/GenOrd
Cario, M. C., & Nelson, B. R. (1997). Modeling and generating random vectors with arbitrary
margina
mar ginall distribut
distributions
ions and correlatio
correlation
n matrix (T
(Techn
echnical
ical Repor t). Depart
Report) Departmen
mentt of Indust
Industria
riall Engi-
Engi-
neering and Management Services: Northwestern University, Evanston, IL, USA.
Demirtas, H. (2004a). Simulation-driven inferences for multiply imputed longitudinal datasets.
Statistica Neerlandica, 58, 466–482.
Demirtas, H. (2004b). Assessment of relative improvement due to weights within generalized esti-
mating equations framework for incomplete clinical trials data. Journal of Biopharmaceutical
Statistics, 14, 1085–1098.
Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for
non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363.
Demirta
Dem irtas,
s, H. (2006)
(2006).. A method
method for multi
multiva
varia
riate
te ordina
ordinall data
data genera
generatio
tion
n gi
give
ven
n margin
marginal al distrib
distributi
utions
ons
and correlations. Journal of Statistical Computation and Simulation, 76 , 1017–1025.
Demirtas, H. (2007a). Practical advice on how to impute continuous data when the ultimate inter-
est centers
centers on dichotomi
dichotomized
zed outcomes
outcomes through
through pre-specifi
pre-specifieded threshold
thresholds.
s. Communic
Communicationsations in
Statistics-Simulation and Computation, 36 , 871–889.
Demirtas, H. (2007b). The design of simulation studies in medical statistics. Statistics in Medicine,

26,irtas,
Dem 3818–3821.
Demirta s, H. (2008)
(2008).. On imputi
imputing
ng contin
continuou
uouss data
data when
when the ev
event
entual
ual intere
interest
st pertai
pertains
ns to ordina
ordinaliz
lized
ed
outcomes via threshold concept. Computational Statistics and Data Analysis, 52 , 2261–2271.
Demirtas, H. (2009). Rounding strategies for multiply imputed binary data. Biometrical Journal,
51, 677–688.
Demirtas, H. (2010). A distance-based rounding strategy for post-imputation ordinal data. Journal
of Applied Statistics, 37 , 489–500.
Demirtas, H. (2016). A note on the relationship between the phi coefficient and the tetrachoric
correlation under nonnormal underlying distributions. American Statistician, 70, 143–148.

Anatomy of Correlational Magnitude Transformations in Latency … 83

Demirtas, H. (2017). Concurrent generation of binary and nonnormal continuous data through
fifth order power polynomials, Communications in Statistics- Simulation and Computation. 46 ,
344–357.
Demirta
Dem irtas,
s, H., Ahmadi
Ahmadian, an, R., Atis,
Atis, S., Can,
Can, F. E., & Ercan,
Ercan, I. (2016a
(2016a).
). A nonnor
nonnormal
mal look
look at polych
polychori
oricc
correlations: Modeling the change in correlations before and after discretization. Computational
Statistics, 31, 1385–1401.
Demirtas, H., Arguelles, L. M., Chung, H., & Hedeker, D. (2007). On the performance of bias-
reduction techniques for variance estimation in approximate Bayesian bootstrap imputation.
Computational Statistics and Data Analysis, 51 , 4064–4068.
Demirtas, H., & Doganay, B. (2012). Simultaneous generation of binary and normal data with
specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223–
236.
Demirta
Dem irtas,
s, H., Freels
Freels,, S. A., & Yucel,
ucel, R. M. (2008)
(2008).. Plausi
Plausibil
bility
ity of multi
multiva
varia
riate
te no
norma
rmalit
lity
y assump
assumptio
tion
n
when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal
of Statistical Computation and Simulation , 78 , 69–84.
Demirtas, H., & Hedeker,
Hedeker, D. (2007). Gaussianization-based quasi-imputation and expansion strate-
gies for incomplete correlated binary responses. Statistics in Medicine, 26, 782–799.
Demirtas, H., & Hedeker, D. (2008a). Multiple imputation under power polynomials. Communica-
Communica
tions in Statistics- Simulation and Computation, 37 , 1682–1695.
Demirtas, H., & Hedeker, D. (2008b). Imputing continuous data under some non-Gaussian distrib-
utions. Statistica Neerlandica, 62, 193–205.
Demirtas, H., & Hedeker, D. (2008c). An imputation strategy for incomplete longitudinal ordinal
data. Statistics in Medicine, 27, 4086–4093.
Demirtas, H., & Hedeker, D. (2011). A practical way for computing approximate lower and upper
correlation bounds. The American Statistician, 65, 104–109.
Demirta
Dem irtas,
s, H., & Hedek
Hedekerer,, D. (2016
(2016).
). Comput
Computing
ing the point-
point-bis
biseri
erial
al correl
correlati
ation
on under
under an
any
y underl
underlyin
ying
g
continuous distribution. Communications in Statistics- Simulation and Computation, 45 , 2744–
2751.

Demirtas,
by power H., Hedeker, D.,Statistics
polynomials. & Mermelstein, J. M.
in Medicine (2012).
, 31 Simulation of massive public health data
, 3337–3346.
Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture
models for non-ignorable drop-out. Statistics in Medicine , 22 , 2553–2575.
Demirtas, H., Shi, Y., & Allozi, R. (2016b). Simultaneous generation of count and continuous data,
R package PoisNonNor. https://cran.r-project.org/web/packages/PoisNonNor.
https://cran.r-project.org/web/packages/PoisNonNor .
Demirtas, H., Wang, Y., & Allozi, R. (2016c) Concurrent generation of binary, ordinal and contin-
https://cran.r-project.org/web/packages/BinOrdNonNor..
uous data, R package BinOrdNonNor. https://cran.r-project.org/web/packages/BinOrdNonNor
Demirtas, H., & Yavuz, Y. (2015). Concurrent generation of ordinal and normal data. Journal of
Biopharmaceutical Statistics, 25, 635–650.
Emrich, J. L., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate
binary variates. The American Statistician, 45 , 302–304.
Farrington, D. P., & Loeber, R. (2000). Some benefits of dichotomization in psychiatric and crimi-
nological research. Criminal Behaviour and Mental Health , 10 , 100–122.
Ferrari, P. A., & Barbiero, A. (2012). Simulating ordinal data. Multivariate Behavioral Research,
47, 566–589.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43,
521–532.
Fréchet, M. (1951). Sur les tableaux de corrélation dont les marges sont données. Annales de
l’Université de Lyon Section A, 14 , 53–77.
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Bornkamp, B., Maechler, M., &
Hothorn, T. (2016). Multivariate normal and t distributions, R package mvtnorm. https://cran.r-
project.org/web/packages/mvtnorm..
project.org/web/packages/mvtnorm
Headrick, T. C. (2002). Fast fifth-order polynomial transforms for generating univariate and multi-
variate nonnormal distributions. Computational Statistics and Data Analysis, 40 , 685–711.

84 H. Demirtas and C. Vardar-Acar

Headrick, T. C. (2010). Statistical Simulation: Power Method Polynomials and Other Transforma-
tions Boca Raton. FL: Chapman and Hall/CRC.
Hoeffding, W. (1994). Scale-invariant correlation theory. In N. I. Fisher & P. K. Sen (Eds.), The T he
Collected Works of Wassily Hoeffding (the original publication year is 1940) (pp. 57–107). New
York: Springer.
Inan, G., & Demirtas, H. (2016). Data generation with binary and continuous non-normal compo-
nents, R package BinNonNor. https://cran.r-project.org/web/packages/BinNonNor.
https://cran.r-project.org/web/packages/BinNonNor.
MacCal
Mac Callum
lum,, R. C., Zhang,
Zhang, S., Preacher
Preacher,, K. J., & Rucke
Rucker,r, D. D. (2002)
(2002).. On the practice
practice of
dichotomization of quantitative variables. Psychological Methods, 7 , 19–40.
R Development Core Team. (2016). R: A Language and Environment for Statistical Computing.
http://www.cran.r-project.org.
http://www.cran.r-project.org.
Revelle,
Rev elle, W. (2016).
(2016). Procedure
Proceduress for psycholo
psychological
gical,, psychomet
psychometric,
ric, and personality
personality researchm
researchmulti
ultivar
vari-
i-
ate normal and t distributions, R package psych. https://cran.r-project.org/web/packages/psych.
https://cran.r-project.org/web/packages/psych.
Vale, C. D., & Maurelli, V.V. A. (1983). Simulating multivariate nonnormal distributions. Psychome-
trika, 48 , 465–471.
Yuc
ucel
el,, R. M.,
M., & Demi
Demirt
rtas
as,, H. (201
(2010)
0).. Impa
Impact
ct of no
non-
n-no
norm
rmal
al rand
random
om effe
effect
ctss on in
infe
fere
renc
ncee by mult
multip
iple
le
imputation: A simulation assessment. Computational Statistics and Data Analysis, 54, 790–801.
Monte-Carlo Simulation of Correlated
Binary Responses

Trent L. Lalonde

Abstract Simulation studies can provide powerful conclusions for correlated or


longitudi
longit udinal
nal respon
response
se data,
data, partic
particula
ularly
rly for relat
relativ
ively
ely small
small sam
sample
pless for which
which asymp-
asymp-
totic theory does not apply. For the case of logistic modeling, it is necessary to have
appropriate methods for simulating correlated binary data along with associated
predictors. This chapter presents a discussion of existing methods for simulating
correlated binary response data, including comparisons of various methods for dif-
ferent data types, such as longitudinal versus clustered binary data generation. The
purposes and issues associated with generating binary responses are discussed. Sim-
ulation methods are divided into four main approaches: using a marginally specified
joint probability distribution, using mixture distributions, dichotomizing non-binary
random
ran dom vavaria
riable
bles,
s, and usi
using
ng a condit
condition
ionall
ally
y spe
speci
cified
fied dis
distri
tribu
butio
tion.
n. Approa
Approache
chess usi
using
ng
a completely specified joint probability distribution tend to be more computationally
intens
intensiv
ivee and requir
requiree det
determ
ermina
inatio
tion
n of distri
distribu
butio
tional
nal proper
propertie
ties.
s. Mix
Mixtur
turee method
methodss can
involve mixtures of discrete variables only, mixtures of continuous variables only,
and mixtures involving both continuous and discrete variables. Methods that involve
discretizing non-binary variables most commonly use normal or uniform variables,
but some use count variables such as Poisson random variables. Approaches using a
conditional specification of the response distribution are the most general, and allow
for the greatest range of autocorrelation to be simulated. The chapter concludes with
a discussion of implementations available using R software.

1 Intr
Introd
oduc
ucti
tion
on

Correlated binary data occur frequently in practice, across disciplines such as health
policy analysis, clinical biostatistics, econometric analyses, and education research.
For example, health policy researchers may record whether or not members of a
househo
hous ehold
ld ha
have
ve hea
health
lth ins
insura
urance
nce;; econom
econometr
etric
ician
ianss ma
may
y be intere
intereste
sted
d in whethe
whetherr sma
small
ll

T.L. Lalonde (B)


Department of Applied Statistics and Research Methods, University of Northern Colorado,
Greeley, CO, USA
e-mail: trent.lalonde@unco.edu
trent.lalonde@unco.edu

© Springer Nature Singapore Pte Ltd. 2017 85


D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_5

86 T.L. Lalonde

businesse
busine ssess within
within va
vario
rious
us urb
urban
an distri
districts
cts ha
have
ve app
applie
liedd for financi
financial
al ass
assist
istanc
ance;
e; higher
higher
educat
edu cation
ion res
resear
earche
chersrs might
might study
study the probab
probabili
ilitie
tiess of col
colle
lege
ge attend
attendanc
ancee for studen
students
ts
from a number of high schools. In all of these cases the response of interest can be
represented as a binary outcome, with a reasonable expectation of autocorrelation
among those responses. Correspondingly, analysis of correlated binary outcomes
has received considerable and long-lasting attention in the literature (Stiratelli et al.
1984;; Zeger and Liang 1986
1984 1986;; Prentice 1988;
1988; Lee and Nelder 1996; 1996; Molenberghs and
Verbeke 2006).
2006). The most common models fall under the class of correlated binary
logistic regression modeling.
While approp
appropriat
riately
ely develope
developed d logis
logistic
tic regressio
regression n models
models include
include asymptoti
asymptoticc esti-
esti-
mator properties, Monte-Carlo simulation can be used to augment the theoretical
resultss of suc
result such
h large-
large-sam
sample
ple distri
distribu
buti
tiona
onall pro
proper
pertie
ties.
s. Mon
Monte-
te-Car
Carlo
lo simula
simulatio
tion
n can be
used to confirm such properties, and perhaps more importantly, simulation methods
can be used to complement large-sample distributi
distributional
onal properties with small-sample
results. Therefore it is important to be able to simulate binary responses with speci-
fied
fie d au
auto
toco
corr
rrel
elat
atio
ionn so that
that corr
correl
elat
ated
ed bina
binary
ry da
datata mode
models ls can
can bene
benefit
fit fr
from
om simu
simula
lati
tion
on
studies.
Throughout the chapter, the interest will be in simulating correlated binary out-
comes, Yi j , where i indicates a cluster of correlated responses and j enumerates the
responses. It will be assumed that the simulated outcomes have specified marginal
probabilities, πi j , and pairwi
pairwise
se aut
autoco
ocorre
rrelat
lation
ion,, ρi j ,i k . The
The term
term “c
“clu
lust
ster
er”” will
will be used
used
to refer to a homogenous group of responses known to have autocorrelation, such
as an individual in a longitudinal study or a group in a correlated study. For many

of the algorithms
simplicity, presented
in which case theinfirst
thissubscript
discussion, a single
of the cluster
Y i j will will be and
be omitted considered for
Y i , πi , and
ρi j will be used instead. Some methods will additionally require the specification of
joint probabilities, higher-order correlations, or predictors.

1.
1.1
1 Bi
Bina
nary
ry Da
Data
ta Is
Issu
sues
es

There are a number of important factors to consider when developing or evaluating a


correlated data simulation technique. Common issues include computational feasi-
bility
bility or simpli
simplicit
city
y, the incorp
incorpora
oratio
tion
n of pre
predic
dictor
torss or cova
covaria
riates
tes,, va
varia
riatio
tionn of parame
parame--
ters between and within clusters, and the ability to effectively control expectations,
variation, and autocorrelation. For binary data in particular, there are a number of
additional concerns that are not relevant to the more traditional normal, continuous
data generation. Simulation of binary responses is typically based on a probability
of interest, or the expectation of the Bernoulli distribution. Many authors have noted
that the pairwise joint probabilities for binary data, πi , j P (Yi = =
1, Y j 1), are=
restricted by the marginal probabilities, according to,

m a x (0, πi + π − 1) ≤ π ≤ mi n(π , π ),
j i, j i j

Monte-Carlo Simulation of Correlated Binary Responses 87

which imposes restrictions on joint distributions according to the desired marginal


probabilities. In addition, it is necessary to properly account for the inherent mean-
variance
var iance relations
relationship
hip associate
associated d with Bernou
Bernoulli = −
lli data, Var(Yi ) πi (1 πi ), imposi imposingng
further restrictions on higher-order moments based on the desired marginal proba-
bilities.
Many
Man y method
methodss of sim
simula
ulatin
ting
g cor
correl
relate
ated
d binary
binary data
data suffer
suffer fro
fromm restri
restricte
cted
d ranges
ranges of
produce
prod ucedd autoco
autocorre
rrelat
lation
ion,, whi
which
ch sho
should
uld typ
typic
icall
allyy spa
span −
n ( 1, 1). ForFor corr
correl
elat
ated
ed bi
bina
nary
ry
outcomes Y1 , . . . , Y N with mar
marginal
ginal expe
expectat
ctations
ions π1 , . . . , π N , Pren
Prenti
tice
ce (1988)
1988) argue
arguedd
that the pairwise correlation between any two responses Yi and Y j must lie within
the range (l , u ), where

l = max
− (πi π j )/(1 − π )(1 − π ),
i j −
 (1 − π )(1 − π )/(π π )
i j i j
 ,
u = mi n
 πi (1 − π )/π (1 − π ),
j j i
 π j (1 − π )/π (1 − π )
i i j
 ,
(1)

to satisfy the requirements for the joint distribution. This implies that, depending
on the desired marginal probabilities, using a fully specified joint probability distri-
bution can lead to simulated values with restricted ranges of pairwise correlation.
The restrictions of Eq. 1 can be counterintuitive to researchers who are used to the
unconstrained correlation values of normal variables.
Somee method
Som methodss of sim
simula
ulatin
ting
g cor
correl
relat
ated
ed binary
binary outcom
outcomes
es str
strugg
uggle
le to contro
controll
changes in probabilities across clusters of correlated data. Due to the typically non-
linear
linear nature
nature of the relati
relations
onship
hipss betwee
between n respon
responsese pro
probab
babili
ilitie
tiess and predic
predictor
tors,
s, man
manyy
method
met hodss also fail
fail to incorp
incorpora
orate
te pred
predict
ictors
ors into
into dat
dataa simulati
simulation.
on. These
These issues
issues,, among
among
others, must be considered when developing and selecting a method for simulating
correlated binary data.
This chapter presents a thorough discussion of existing methods for simulating
correlated binary data. The methods are broadly categorized into four groups: cor-
related binary outcomes produced directly from a fully specified joint probability
distribution, from mixtures of discrete or continuous variables, from dichotomized
continuous or count variables, and from conditional probability distributions. The
oldest
old est litera
literatur
turee is avail
availabl
ablee for fully
fully specifi
specifieded joi
joint
nt probab
probabili
ility
ty dis
distri
tribu
butio
tions,
ns, bu
butt often
often
these
the se method
methodss are comput
computatation
ionall
ally
y int
intens
ensiive and requir
requiree spe
specifi
cificat
cation
ion of higher
higher-or
-order
der

pro
proba
binebabi
bili
liti
ties
es orofco
products corrrrel
elat
atio
binary ions
ns at the
the ou
variables outs
to tset
et.. Mi
induceMixt
xtur
ureedesired
the appr
approa
oach
ches
es ha
have
ve been
been used
autocorrelation, used to other
while com-
com-
mixtur
mix tures
es invo
involvi
lvingng contin
continuou
uouss varia
variable
bless requir
requiree dic
dichot
hotomi
omizat
zation
ion of the result
resulting
ing val-
val-
ues. Dichotomizing normal or uniform variables is a widely implemented approach
to producing binary data, although less well-known approaches have also been pur-
sued,
sue d, such
such as dichot
dichotomiomizin
zing
g cou
counts
nts.. Condit
Condition
ional
al spe
speci
cifica
ficatio
tion
n of a binary
binary dis
distri
tribu
butio
tion
n
typically makes use of “prior” binary outcomes or predictors of interest, and tends
to lead to the greatest range of simulated autocorrelation. Each of these methods for
generating correlated binary data will be discussed through a chronological perspec-

88 T.L. Lalonde

tive, with some detail provided and some detail left to the original publications. The
chapter concludes with general recommendations for the most effective binary data
simulation methods.

2 Fully Specified
Specified Joint
Joint Probabi
Probability
lity Distribu
Distributions
tions

The met
method
hod for simula
simulati
ting
ng correl
correlate
ated
d binary
binary out
outcom
comeses with
with the longes
longest-t
t-tenu
enured
red pre
pres-
s-
ence in the literature, is full specification of a joint probability distribution for cor-
related binary variates. The joint pdf can either be written explicitly in closed form,
or derived by exhaustive listing of possible outcomes and associated probabilities.
(1986).
In all cases the generation of data relies on the method of Devroye (1986 ).
2.1 Simulating Binary Data with a Joint PDF

Given a joint probabi


Given probability
lity density function
function for any type of binary binary data,
data, correlate
correlated d or
independent
indepe ndent,, Devroye
Devroye (1986)
1986) desc
descriribe
bed d an effe
effect
ctiive meth
methodod fo
forr gene
genera
rati
ting
ng appr
appropopriri--
at
atee se
sequ
quenence
cess of bina
binary
ry ou
outc
tcom
omes
es.. As
Assu sume
me a join
jointt pdf
pdf has
has be
beenen fu
full
lly
y sp
spec
ecifi
ifieded fo
forr th
thee
T
binary random vector Y =(Y1 , . . . , Y N ), such that marginal probabilities can be
ca
calc
lcul
ulat
ated
ed for
for any
any comb
combin inat
atio
ion
n of bina
binary ry ou
outc
tcom
omes
es.. Fo finitee N , pr
Forr finit prob
obab
abil
ilit
itie
iess can
can be
N
calculated for all 2 possible outcome vectors, denoted p0 , . . . , p 2 N −1 . Realizations
for the random vector Y can be generated according to the following algorithm.

Generating Binary Values with a Joint PDF

1. Order the probabi


probabiliti
lities
es from smalles largest,, p(0) , . . . , p (2 N −1) .
smallestt to largest
2. Define cum
cumulativ
ulativee v alues z j according to:
values

z0 = 0,
zj =z− +p
j 1 ( j ),

z 2 N −1 = 1.
3. Gener
Generate
ate a standard
standard uniform varia te U on ( 0, 1).
variate
4. Selectt j such that z j
Selec ≤
U < z j +1 .
5. The random sequsequence
ence is the binary repre
representa
sentation integer j .
tion of the integer

The method is relatively simple, relying on calculating probabilities associated with


all possible N -dimensional binary sequences, and generating a single random uni-
form variable to produce an entire vector of binary outcomes. Devroye (19861986)) has
argued that this method produces appropriate binary sequences according to the pdf

Monte-Carlo Simulation of Correlated Binary Responses 89

provide
provided,
d, an
andd the
the me
meth
thod
od is re
reli
lied
ed on exten
xtensi
sive
vely
ly in situ
situat
atio
ions
ns in wh
whic
ich
h a jo
join
intt pdf
pdf can
can
be constr
construct
ucted.
ed. In fac
fact,
t, rel
relyin
yingg on this
this algori
algorithm
thm,, many
many author
authorss ha
have
ve pursue
pursued d method
methodss
of constructing a joint pdf as a means to generate vectors of binary responses.

2.2 Explicit Specification of the Jo


Joint
int PDF

Simu
Simula
lati
tion
on of co
corr
rrel
elat
ated
ed bina
binary
ry ou
outc
tcom
omes
es is gene
genera
rall
lly
y th
thoug
ought
ht to begi
begin
n with
with Baha
Bahadu
durr
(1961). (1961)) presented a joint pdf for N correlated binary random vari-
1961). Bahadur (1961
ables as follows. Let yi indicate realizations of binary variables, π P (Yi 1)
represent the constant expectation of the binary variables, all equivalent, and ρi j = =
be the autocorrelation between two variables Yi and Y j . Then the joint pdf can be
written,

ρi j ( 1)( yi + y j ) π (2− yi − y j ) (1 ( yi +y )
 − − π) j

i = j
f ( y1 , . . . , y N ) =1+ π (1 − π) .

While Bahadur ((1961 1961)) expressed the joint pdf in terms of lagged correlations for
an autoregressive time series, the idea expands generally to clustered binary data. In
practice it is necessary to estimate range restrictions for the autocorrelation to ensure

f ( y1 , . . . , y N ) (0, 1). These restrictions depend on values of π , and are typically
determine
deter mined d empirica
empirically
lly (Fa
(Farrell
rrell and Sutradha
Sutradharr 2006
2006).). Usin
Using
g th
thee algo
algori
rith
thmm of Devr
Devroy
oyee
(1986),
1986), the model of Bahadur Bahadur (1961)
1961) can be used to simulate values for a single
cluster or longitudinal subject, then repeated for additional clusters. This allows the
probability π to va vary
ry acr
across
oss clu
cluste
sters,
rs, howe
howeve
verr, π is ass
assume
umedd consta
constantnt within
within cluste
clusters,
rs,
reducing the ability to incorporate effects of covariates
covariates.. Most crucially,
crucially, the pdf given
by Bahadur
Bahadur (1961 1961)) will become computationally burdensome for high-dimensional
data simulations (Lunn and Davies 1998; 1998; Farrell and Sutradhar 2006 2006).).

2.3 Derivation of the Joint PDF

1961),
Instead of relying on the joint pdf of Bahadur ((1961 ), authors have pursued the
construction of joint probability distributions not according to a single pdf formula,
but instead by using desired properties of the simulated data to directly calculate all
possible
possib le probab
probabilit
ilities
ies assoc
associated with N -di
iated -dimen
mensio
sional
nal sequen
sequences
ces of binary
binary out
outcom
comes.
es.
Often these methods involve iterative processes, solutions to linear or nonlinear
systems of equations, and matrix and function inversions. They are computationally
complex but provide complete information about the distribution of probabilities
without using a closed-form pdf.
Lee (1993
(1993)) introduced a method that relies on a specific copula distribution, with
binary variables defined according to a relationship between the copula parameters

90 T.L. Lalonde

and the desired probabilities and correlation for the simulated binary responses.
A copula (Genest and MacKay 1986a 1986a,, b) is a multivariate distribution for random
variables ( X 1 , . . . , X N ) on the N -dimensional unit space, such that each marginal
distribution for X i is uniform on the the domain (0, 1). The variables X 1 , . . . , X N
can be used to generate binary variables Y 1 , . . . , Y N by dichotomizing according to
Yi = I ( X i > πi ), where πi is the desired expectation of the binary variable Yi and
I () takes the value of 1 for a true argument and 0 otherwise. However, Lee ((1993 1993))
did not use a simple dichotomization of continuous variables.
In order to extend this idea to allow for autocorrelation, Lee (1993 1993)) proposed
using the copula with exchangeable correlation, the Archimidian copula proposed
by Gene
Genest st and
and MacK
MacKay ay (1986a).
1986a). Use
Use of su
such
ch a co
copul
pulaa will
will in
indu
duce
ce Pe
Pear
arso
son
n co
corre
rrela
lati
tion
on
between any two random binary variables Y i and Y j given by,
ρi j =  −π π
π 0 ,0
,
i j
(2)
π π (1 − π )(1 − π )
i j i j

where π0,0 is the joint probability that both variables Yi and Y j take the value 0.
Because of the restriction that π0,0 ≤
mi n (πi , π j ), it follows that π0,0 > πi π j and
therefore the induced Pearson correlation will always be positive when this method
1993)) requires the correlation of Eq. 2 to be constant
is applied. The method of Lee ((1993
within clusters.

Constructing the Joint Distribution by Archimidian Copula


1. For a single ccluster
luster,, determine marmarginal
ginal probabilities π 1 , . . . , π N for cor-
related binary variables Y 1 , . . . , Y N with desired constant autocorrelation
ρ.
2. A value for the Archimidian
Archimidian copula distri distributi
bution
on parameter
parameter  can be cal-
culated based on the values of πi and ρ . The parameter  takes values
]
on (0, 1 , where a value of 1 indicates independence, while as  →
0 the
Kendall correlation converges to 1.
3. Usi
Using
ng the Archim
Archimidiidian
an copula
copula distri
distribu
butio
tion,
n, solve
solve linea
linearly
rly for the joint
joint
probabilities of Y1 , . . . , Y N with respect to the binary representation of
each integer j , where 0 j≤ ≤2 N 1, −
Pj 1
= P (Y = j , . . . , Y = j1 N N ), (3)

where j1 . . . j N is the binary representation of j . Calculation of this prob-


ability using the Archimidian copula has closed form but requires compu-
tation of the inverse of a CDF-like function.
4. Giv
Given
en the probabil itiess P0 , P2 N −1 , simulate a binary sequence according
probabilitie
to the algorithm of Devroye (1986 (1986). ).
5. Repea
Repeatt for additional
additional clusters.
clusters.

Monte-Carlo Simulation of Correlated Binary Responses 91

( 1993)) allows inclusion of predictors when determining the


The method of Lee (1993
marginal πi , which
which cle
clearl
arly
y allo
allows
ws the
these
se pro
probab
babil
iliti
ities
es to va
vary
ry within
within cluste
clusters.
rs. Howe
Howeve
verr,
the method is restricted to positive, exchangeable autocorrelation that cannot vary
within groups, and requires the solution to numerous systems of equations and the
inverse of a CDF-like function.
Kang and Jung (2001
(2001)) discussed an alternative to this method such that the prob-
abilities P j can be obtained
obtained by solvi
solving
ng a nonlin
nonlinear
ear system of equations relating P j
equations relating
to the first two desired moments, πi and the correlation Corr(Yi , Y j ) ρi j . How- =
ever, the necessity of solving a system of nonlinear equations does not decrease the
computational complexity of the algorithm as compared to the copula approach.
The method of Gange (1995 (1995)) simplifies the method of Lee (1993(1993)) but can add
computational complexity. Suppose that, in addition to specifying the desired mar-
ginal probabilities πi associated with the correlated binary variables, Yi , pairwise
and higher-order joint probabilities are also specified. That is, joint probabilities
πi , j= =
P (Yi =
yi , Y j y j ) and higher-order joint probabilities πi 1 ,..., =
,...,ii k P (Yi 1 =
=
yi 1 , . . . , Yi k yik ) can be specified, up to order k . Given such probabilities, a full
joint probability density function is constructed through the IterativeIterative Proportional
Fitting algorithm.
The main idea of the Iterative Proportional Fitting algorithm is to equate deriva-
tion of the joint pdf to fitting a log-linear model to a contingency table of order 2 N ,
including interactions up to order k . Each pairwise or higher-order joint probability
is treated as a constraint on the implicit contingency table, and has an associated
nonlinear equation. Solving this system of nonlinear equations corresponding to the
higher-order joint probabilities is equated to the standard likelihood-based solution
to log-linear model fitting. Predicted values from the log-linear model give the prob-
abilities associated with all possible outcome combinations.

Constructing the Joint Distribution using the Iterative Proportional Fit-


ting Algorithm

1. For a single cl
cluster,
uster, specify the desired m marginal
arginal probabilit
probabilitiesies πi , pairwise
probabilities π i , j , and up to k -order joint probabilities πi 1 ,...,
,...,ii k .
2. Const
Constru
ruct
ct a log-
log-li
linenear
ar mo
mode
dell with
with inte
intera
ract
ctio
ions order k corresponding
ns up to order
to the constraints specified by the probabilities up to order k .
3. Fit the log-line
log-linear
ar model to obta
obtain
in estimated probabilities corresponding
corresponding to
the joint marginal pdf.
4. Simul
Simulate
ate binary val
values
ues accordi
according
ng to the algorithm
algorithm of Devroye (1986).
Devroye (1986).
5. Repea
Repeatt for additional
additional clusters.
clusters.

The method of Gange ((1995


1995)) allows for covariates to be included in determining ini-
1995))
tial probabilities and pairwise or higher-order joint probabilities, and Gange ((1995
describes how the pairwise probabilities can be connected to the working correlation
structure of the Generalized Estimating Equations, allowing for specific correlation
structures to be derived. However, specific correlation structures are not explicitly

92 T.L. Lalonde

1995).
exemplified by Gange ((1995
exemplified ). Further, this method requires an iterative procedure to
solve a system of nonlinear equations, which can become computationally burden-
some with increased dimension of each cluster.

3 Specifica
Specification
tion by Mixture
Mixture Distribu
Distributions
tions

Many approaches to simulating correlated binary variables rely on the properties


of random variables defined as mixtures of other random variables with various
distributions. Authors have pursued mixtures of discrete distributions, continuous
distributions, and combinations of discrete and continuous distributions. Methods
relying solely on mixtures of continuous distributions tend to struggle to represent
correlated binary responses.

3.1 Mixtures Involving Discrete Distributions

Many attempts to circumvent the computational burden of the methods of Bahadur


1961),
(1961 ), Lee (1993
(1993),), and Gange (1995
1995)) have turned to generating binary responses
through mixtures of discrete random variables. Kanter (1975 1975)) presented a method
that directly uses autoregression to simulate properly correlated binary outcomes.
Suppose it is of interest to simulate a series of binary values Yi with constant expec-
tation π . Kanter 1975)) proposed to use the model,
Kanter ((1975

Yi =U i
 − ⊕ +
Yi 1 Wi (1 − U )W ,
i i


where indindica
icate
tess additi
addition modulo 2, Ui can
on modulo can be take
taken
n to be Berno
Bernoul
ulli
li with
with pr
prob
obab
abil
ilit
ity
y
πU and Wi can be taken to be Bernoulli with probability πW (π((1 πU ))/(1
(π = − −
2π πU ). Assuming Yi −1 , Ui , and Wi to be independent, it can be shown that all Yi
have expectati
expectation
on π and that the autocorrelation between any two outcomes is

Corr(Yi , Y j ) =
 πU (1 − 2π ) | − |
i j
.
1 − 2π π U

The method of Kanter (1975(1975)) requires πU (0, m i n ((1 π)/π, 1)), and includes
∈ −
restrictions on the autocorrelation in the simulated data based on the probabilities
used. In particular, Farrell and Sutradhar (2006
2006)) showed that no negative autocor-
relation can be generated with any probability π chosen to be less than or equal to
0.50. In addi
additi
tion
on,, this
this me
meth
thod
od does
does not allo
allow
w fo
forr easy
easy varia
ariati
tion
on of proba
probabi
bili
liti
ties
es wi
with
thin
in
clusters or series.
Use of a mixture of binary random variables was updated by Lunn and Davies
(1998),
1998), who proposed a simple method for generating binary data for multiple clus-
ters simultaneously,
simultaneously, and for various types of autocorrelation structure within groups.

Monte-Carlo Simulation of Correlated Binary Responses 93

Suppose the intention is to simulate random binary variables Yi j such that the expec-
tation of each is cluster-dependent, πi , and the autocorrelation can be specified. First
assume a positive, constant correlation ρ i is desired within clusters. Then simulate
binary values according to the equation,

Yi j = (1 − U i j )Wi j +U i j Zi , (4)

where Ui j can be taken to be Bernoulli with probability πU ρi , and both Wi j=√


and Z i can be Bernoulli with success probability πi , all independent. Then it can be
show
shownn that [ ]=
that E Yi j πi and Corr =
Corr(Yi j , Yi k ) ρi . Lu
Lunn
nn and
and Davi
Davies
es (1998)
1998) expl
explai
ain
n how
how
to adjust their method to allow for additional correlation structures within clusters,
by changing the particular mixture from Eq. 4,

Yi j = (1 − Ui j ) Wi j +U i j Zi (Exchangeable),
Yi j = (1 − Ui j ) Wi j +U i j Yi , j 1
− (Autoregressive),
Yi j = (1 − Ui j ) Wi j +U i j Wi , j −1 (M-Dependent).

This algorithm requires constant outcome probabilities within clusters. In order to


Davie s (1998)
accommodate varying probabilities within clusters, Lunn and Davies 1998) pro-
posed a simple transformation of generated binary responses,

Y˜ =A
Ỹi j i j Yi j ,

where Ai j is taken to be Bernoulli with success probability αi j , independent of all


other simulated variables. Then the Y ˜
Ỹi j satisfy the previous distributional require-
ments with E Ỹ [˜ ]=
Yi j αi j ma x (πi ). Lunn and Davies (19981998)) acknowledge that this
multiplicative transformation imposes an additional multiplicative correction to the
correlation between generated binary values. While this transformation allows the
probabilities to vary within clusters, it does not easily incorporate predictors into
this variation, instead accommodating known success probabilities that may differ
across responses within clusters.

Generating Binary Values using a Binary Distribution Mixture


1. For a single
single clu ster i , determine the desired probability π i and autocorre-
cluster
lation structure, along with cluster-dependent correlation ρ i .
2. Depending on tthe autocorrelation structure, generate Ui j , W i j , and possi-
he autocorrelation possi-
bly Z i .
3. Calc ulate Yi j using the appropriate mixture.
Calculate
4. If necessa ry,, transform Yi j to Y
necessary ˜
Ỹi j to accommodate varying probabilities
within-cluster.
5. Repea
Repeatt for other
other clust
clusters.
ers.

94 T.L. Lalonde

The issue
issue of allow
allowing ing within
within-cl
-clust
uster
er va
varia
riati
tion
on of succe
success
ss pro
probab
babili
ilitie
tiess wa
wass
addressed by Oman and Zucker Zucker (2001
2001)) in a method that is truly a combination
of mixture variable simulation and dichotomization of continuous values. Oman and
Zuckerr (2001
Zucke (2001)) argued that the cause of further restricted ranges of autocorrelation in
simulated binary data, beyond the limits from Eq. 1 given by Prentice (1988
(1988), ), is the
combination of varying probabilities within clusters and the inherent mean-variance
relationship of binary data. Assume the interest is in simulating binary variables Yi j
with probabilities π i j , varying both between and within clusters. To define the joint
probability distribution of any two binary outcomes Yi j and Yi k , define

P (Y Y 1) (1 ν )π π ν mi n (π , π ), (5)
ij × ik = = − jk ij ik + jk
jk ij ik

where ν j k is chosen to reflect the desired correlation structure within groups, as fol-
lows
lows.. The
The join
jointt dist
distri
ribu
buti
tion
on sp
spec
ecifi
ified Eq. 5 allo
ed by Eq. allows
ws th
thee corr
correl
elat
atioion
n betw
betwee
een
n an
any
y two
two
responses within a cluster, denoted ρ j k , to be written ρ j k ν j k m a x (ρst ), where = ×
the maximum is taken over all values of correlation within cluster i . The process
of generating binary values is constructed to accommodate this joint distribution as
follows. Define responses according to

Yi j = I (Z ≤ θ ij i j ),

where θi j = F −1 (πi j ) fo
forr any
any cont
contin
inuo
uous CDF F ,and Z i j is de
us CDF defin
fined
ed as an appr
appropr
opria
iate
te
mixture. Similarly to the method of Lunn and Davies (1998 ( 1998),
), Oman and Zucker
(2001
2001)) provide mixtures for common correlation structures,

Zi j =U ij Xi0 + (1 − U ) Xij ij (Exchangeable),


Zi j =U ij X i, j − + (1 − U ) X
1 ij ij (Moving Average) ,
Zi j =U ij Z i, j − + (1 − U ) X
1 ij ij (Autoregressive),

where Ui j can be taken to be Bernoulli with probability πU =√


ν j k , and all X i j are
independently distributed according to the continuous distribution F . The different
correlation structures are induced by both adjusting the mixture as given above and
accordingly.. For example, constant ν across all i , j with the first
also by defining ν j k accordingly
mixture will produce exchangeable correlation among the binary outcomes, while
choosing νi j γ | ji − j2 | wit
with
h the third
third mixtur
mixturee wil
willl prod
produce
uce autore
autoregre
gressi
ssive
ve correl
correlati
ation.
on.
=
Generating
Generat ing Bin
Binary
ary Valu
alues es usi
using
ng Bin
Binary
ary and ConContinutinuous
ous Dis
Distrib
tributi
ution
on
Mixture
1. For a ssingle cluster i , determine the marginal probabilities π i j .
ingle cluster
2. Decide the target correlation structure.
3. Based on the tatarget
rget correla
correlation
tion structure, select ν j k and the mixture func-
tion accordingly.
accordingly.

Monte-Carlo Simulation of Correlated Binary Responses 95

4. Gen
Genera te U i j as Bernoulli, X i j according to continuous F , and calculate
erate
Z i j and θi j .
5. Define each outcomoutcomee as Yi j I (Zi j =θi j ). ≤
6. Repea
Repeatt for additional
additional clus
clusters
ters..

Oman and Zucker (2001)2001) noted that covariates can be incorporated by defining
T
=
θi j xi j β as the systematic component of a generalized linear model and taking
F −1 to be the associated link function. It is an interesting idea to use the inverse link
funct
fun ctio
ion
n from
from a ge
gene
nera
rali
lize
zed
d line
linear
ar mo
mode
dell to he
help
lp co
conn
nnec
ectt pr
pred
edic
icto
tors
rs to th
thee dete
determ
rmin
ina-
a-
tion of whether the binary realization will be 0 or 1. However, the method described
continues to suffer from restricted ranges of autocorrelation, most notably that the
correlations between binary responses must all be positive.

3.2 Mixtures Involving Continuous Distributions

Instead of using proper binary distributions, non-normal values can be simulated


using
using linea
linearr combin
combinat ation
ionss of sta
standa
ndard
rd norm
normalal varia
variable
bless to repres
represent
ent the kno
known
wn
moment-based properties of the desired data distribution. Many such methods are
based on the work of Fleishman (1978 (1978),
), and later extended (Headrick 2002a,
2002a, 2010,
2010,
2011). In gene
2011). genera
ral,
l, the
the idea
idea is to si
simu
mula te any non
late non-no
-norma
rmall distri
distribu
butio
tion
n as a polynom
polynomial
ial

mixture of normal variables,  m

Y = ci Z (i −1) ,
=
i 1

where Y indicates the desired random variable, Z is a standard normal random


varia
va ble,, and the ci rep
riable repres
resent
ent coeffi
coefficie
cients
nts chosen
chosen accord
according
ing to the desire
desired
d distri
distribu
butio
tion
n
of Y . Fleishman (1978(1978)) derived a system of nonlinear equations that, given the target
distribution mean, variance, skewness, and kurtosis, could be solved for coefficients
c1 , c2 , c3 , and c4 to produce the third-order polynomial approximation to the desired
distribution. The intention is to use an expected probability π to estimate those
first four moments for the Bernoulli distribution, and approximate accordingly using
standard normals. This process has been extended numerous times, including to
multivariate data (Vale and Maurelli
Maurelli 1983
1983),
), to higher-order polynomials (Headrick
2002a
2002a),
), to using percentiles in place of standard moments to reflect instead the
median, the inter-decile range, the left-right tail-weight ratio, and the tail-weight
factor (Karian and Dudewicz 19991999),
), and to control autocorrelation in multivariate
data (Koran et al. 2015).
al. 2015).
The Fleishman process and similar methods derived from it suffer from some
consistent issues. First, the systems of nonlinear equations have a limited range of
solutions for the necessary coefficients, and consequently can only be used to rep-
resent limited ranges of values for moments or percentile statistics. Therefore the
ranges of probabilities and correlation values in the simulated data will be limited.

96 T.L. Lalonde

Secondly, the resulting mixture random variable is a power function of standard nor-
mal variables, which generally will not reflect the true mean-variance relationship
nece
ne cess
ssar
aryy for
for bina
binary
ry da
data
ta.. Whil
Whilee the
the va
valu
lues
es of me
mean an and
and vari
varian
ance
ce usin
usingg th
thee Flei
Fleish
shma
mann
method can in some cases reflect those of an appropriate sample, the dynamic rela-
tionship between changing mean and variation will not be captured in general. Thus
as the mean changes, the variance generally will not show a corresponding change.
Fina
Finall
lly
y, the
the me
meththod
od does
does not re
read
adilily
y acco
accoun
untt fo
forr th
thee effe
effect
ctss of pr
pred
edic
icto
tors
rs in simu
simula
lati
ting
ng
responses. In short, such methods are poorly equipped to handle independent binary
data, let alone correlated binary outcomes.
4 Simulatio
Simulation
n by Dichotomi
Dichotomizing
zing Variates
ariates

Perhaps the most commonly implemented methods for simulating correlated binary
outco
outcome
mess are
are thos
thosee that
that invo
involv
lvee dich
dichot
otom
omiz
izat
atio
ion
n of ot
othe
herr ty
type
pess of vari
variab
able
les.
s. Th
Thee most
most
frequent choice is to dichotomize normal variables, although defining thresholds for
uniform variables is also prevalent.

4.1 Dichotomizing Normal Variables

Many methods of dichotomizing normal variables have been proposed. The method
of Emrich and Piedmonte (1991 (1991),
), one of the most popular, controls the probabilities
and pairwise correlations of resulting binary variates. Assume it is of interest to
simulate
simul ate binary
binary variabl es Yi with assoc
variables associate
iated
d probab
probabilit
ilities
ies πi an
andd pairwise
pairwise corre
correlati
lations
ons
give
given =
n by Corr (Yi , Y j ) ρi j . Begi
Begin n by so
solv
lvin
ing
g th
thee fo
foll
llow
owin
ingg equa
equati
tion
on fo
forr th
thee norm
normal
al
pairwise correlation, δi j , using the bivariate normal CDF Φ ,

Φ (z (πi ), z (π j ), δi j ) =ρ ij
πi (1 − π )π (1 − π ) + π π ,
i j j i j

where z () indicates the standard normal quantile function. Next generate one N -
multivariate normal variable Z with mean 0 and correlation matrix with
dimensional multivariate
components δ i j . Define the correlated binary realizations using

Yi = I ( Z ≤ z(π )).
i i

1991)) showed that the sequence Y 1 , . . . , Y N has the appro-


Emrich and Piedmonte ((1991
Emrich
priate desired probabilities πi and correlation values ρ i j .

Monte-Carlo Simulation of Correlated Binary Responses 97

Generating Binary Values by Dichotomizing Normals

1. For a single cluster


cluster,, determine the marginal
marginal probabilities π i and autocor-
relation values ρ i j .
2. Using the biv bivaria
ariate
te norma
normall CDF
CDF,, solve
solve for normal pairwise
pairwise correlation
correlation
values δ i j .
3. Gen
Genera
erate
te ononee N -dimensional normal variable Z with correlation compo-
nents given by δi j .
4. Define tthe
he binary
binary val ues as Yi
values =
I ( Zi ≤
z (πi )), where z () is the standard
normalt for
5. Repea
Repeat quantile function.
additional
additional clusters.
clusters.

This method is straightforward and allows probabilities to vary both within and
between clusters. A notable disadvantage of this method is the necessity of solving
a system of nonlinear equations involving the normal CDF, which increases compu-
tational burden with large-dimensional data generation.

4.2 Iterated Dichotomization

(2002b)) proposed a method of simulating multiple clusters of correlated


Headrick (2002b
binary data using JMASM3, an iterative method of dichotomizing two sets of binary
variables in two steps. This method is unique in that it allows for autocorrelation
both within and between clusters of binary data.
Assume it is of interest to simulate N clusters of binary data with correlation
within clusters denoted by ρYi j ,Yik , and correlation between clusters by ρYi j ,Ykl . Given
probabilities π1 , . . . , π N , correlated binary variables X 1 , . . . , X N can be defined
using random uniform variables U 1 , . . . , U N as follows. Define X 1 = I (U1 < π1 ).
successive X i as
Define successive
 X 1, Ui < πi
Xi = X1+ 1, Ui > πi and X 1 =0
1− X ,U 1 i > πi and X 1 = 1.

This generates a cluster of binary values, each correlated with X 1 . Next simulate
binary values Yi j , where i indicates the cluster and j indicates individual outcomes,
as follows, where the Ui j are independent uniform variables on ( 0, 1),
 Xi , Ui j < πi j
Yi j = Xi+ 1, Ui j > πi j and X i =0
1− X i , U i j > πi j and X i = 1.

98 T.L. Lalonde

2002b)) shows that the threshold values πi j can be obtained by solving


Headrick (2002b
a nonlinear system in terms of the specified correlation values ρYi j ,Yik and ρYi j ,Ykl .
The order of the nonlinear system corresponds to the number of correlation values
specified at the outset. Headrick ((2002b
2002b)) also provides expressions for the marginal
probabilities P (Yi j = 1) in terms of the first-stage probabilities πi and the second-
stage thresholds π i j ; however, it is not clear that these marginal probabilities can be
controlled from the start.
Generating Binary Values by Iterated Dichotomization

1. Dete
Determine
rmine the autocorr
autocorrelation desired within and between N clusters of
elation
binary responses, and select first-stage probabilities π 1 , . . . , π N .
2. Simul ate X 1 as I (U1 < π1 ), where U1 is a random uniform realization on
Simulate
(0, 1).
3. Gener
Generate
ate the rema
remaining first-stage binary outcomes X i according to X 1
ining first-stage
and corresponding random uniform variables Ui .
4. Solv
Solvee for the second-stag
second-stagee threshold
thresholds,s, πi j , given the desired within and
between correlation values, ρ Yi j ,Yik and ρ Yi j ,Ykl , respectively.
5. Gener
Generate
ate the second-stage
second-stage bina ry outcomes Y i j according to X i and cor-
binary
responding random uniform variables Ui j .

The iterated dichotomization algorithm allows for control of autocorrelation both


within and between clusters, and does not require complete specification of the
joint probability distribution. However
However,, it does not clearly accommodate predictors
or reduce in complexity for common correlation structures; it does not allow easy
specifi
spe cificat
cation
ion of the margin
marginal
al bin
binary
ary outcom
outcomee probab
probabili
ilitie
ties;
s; and it requir
requires
es the soluti
solution
on
of a potentially high-dimensional system of nonlinear equations.

4.3 Dichotomizing Non-normal V


Variables
ariables

Thee meth
Th method
od of Park
Park et al.
al. (1996
1996)) si
simu
mula
late
tess co
corr
rrel
elat
ated
ed bi
bina
nary
ry valu
values
es usin
using
g a

dichotomization
system of nonlinear of equations.
counts, andAssume
in the process
an interest avoids the necessity
in generating of solving
N correlated any
binary
variables Y1 , . . . , Y N , with
with pro
probab
babili
ilitie
tiess π1 , . . . , π N and associ
associat
ated
ed pairwi
pairwise
se cor
correl
rela-
a-
tions ρi j . Begin by generating N counts Z 1 , . . . , Z N using a collection of M Poisson
random variables X 1 (λ1 ) , . . . , X M (λ M ), as linear combinations,

Monte-Carlo Simulation of Correlated Binary Responses 99

Z1 =
 X i (λi ),
∈ i S1

Z2 =
 X i (λi ),

i S2

..
.
ZN =
 X i (λi ).

i SN
Notice that each count Z i is a combination of a specific set of the Poisson ran-
dom variables, denoted by Si . The number of Poisson variables, M , the associated
means, λi , and the sets used in the sums, Si , are all determined algorithmically based
on the desired probabilities and correlations. Each binary value is then defined by
dichotomizing, Y i = I ( Zi 0). =
Park et al. 1996) describe the determination of M , λi , and Si as follows. The
al. (1996)
Poisson means λ i can be constructed as linear combinations of parameters α i j , 1 ≤
i, j ≤N . The α i j can be calculated based on the desired probabilities and pairwise
correlations, +  
αi j = ln 1 ρi j (1 − π )(1 − π )/(π π )
i j i j .

Given all of the α i j , define λ k to be the smallest positive α i j , i , j ≥


k , until the first
mean λ L matches the magnitude of the largest positivpositivee αi j . Then set M
Then set L , let ea
eacch
mean λi remain as determined by the αi j , and define each summation set Si as those =
means composed of positive α i j (Park et al. 1996).
al. ((1996 ).

Generating Binary Values by Dichotomizing Linear Poisson Mixtures


1. For a single cl
cluster,
uster, determine the indiindividual
vidual probabilities πi and pairwise
correlations ρ i j .
2. Using the probab
probabilitilities
ies and corre
correlati
lation
on values,
values, calculate
calculate the paramet
parameters
ers
αi j .
3. Using the pparame
arametersters αi j , determine the number of Poisson variables, M ,
the Poisson means, λi , and the summation sets, Si , for each count variable
Zi .
1 , . . . , Z N , an sess as Yi i
= I ( Z = 0).
4.
5. Calc
Repea
Repeatt forZadditional
Calculate
ulate additi onal clusand
dters.
defin
define
clusters . e the
the bina
binary
ry re
resp
spon
onse

The method of Park et al.


al. ((1996
1996)) is computationally efficient and does not require
solving systems of nonlinear equations. It allows for varying probabilities and cor-
relation values within and between clusters, and can be adjusted to incorporate
covariates into the probabilities. However, there is still a restriction on the range
of correlations available through this algorithm.

100 T.L. Lalonde

5 Condition
Conditionally
ally Specified
Specified Distribu
Distributions
tions

Recent attention has been paid to conditionally specifying the distribution of corre-
lated binary variables
variables for the purposes of simulation. While the mixture distribut
distributions
ions
can be viewed as conditional specifications, in such cases discussed Sect. 3 the mix-
tures were defined so that the marginal distributions of the resulting binary variables
were completely specified. In this section the discussion focuses on situations with-
out full specification of the marginal outcome distribution. Instead, the distributions
are defined using predictor values or prior outcome values.

5.1 The Linear Conditional Probability Model

sh (2003)
Qaqish
Qaqi 2003) intro
ntrodu
ducced a memeth
thod
od of simu
simullatin
ating
g bi
bina
nary
ry varariiate
ates us
usin
ing
g
autoregressive-type relationships to simulate autocorrelation. Each outcome value
is conditioned on prior outcomes, a relationship referred to as the conditional lin-
ear family. The conditional linear family is defined by parameter values that are
so-called reproducible in the following algorithm, or those that result in conditional
means within the allowable range ( 0, 1).
Suppose the interest is in simulating correlated binary variables Yi with associ-
ated probabilities πi , and variance-covariance structure defined for each response by
its covariation with all previous responses, si Cov( Y1 , . . . , Yi −1 T , Yi ). Qaqish
= [ ]
2003)) argued that the expectation of the conditional distribution of any response Yi ,
(2003
given all previous responses, can be expressed in the form,

T
κ iT T T
[ |[
E Yi Y1 , . . . , Yi −1 ] ]=π i + [  Y1 , . . . , Yi −1 ] − [π , . . . , π
1 −]
i 1


i 1

=π i + κi j (Y j − π ), j
j 1
=
(6)

where the components of κ i are selected corresponding to the desired variance-


covariance structure according to

κi = [Cov([Y , . . . , Y − ] )]− s .
1 i 1
T 1
i

The correlated binary variables are then generated such that Y1 is Bernoulli with
probability π1 , and all subsequent variables are random Bernoulli with probability
given by the conditional mean in Eq. 6. It is straightforward to show that such a
sequen
seq uence
ce will
will ha
have
ve the desire
desired
d ex
expec
pectation π1 , . . . , π N T and autocorrel
tation [
autocorrelation
ationdefine
defined
d ]
by the variance-covariance s i . Qaqish (2003
(2003)) provides simple expressions for κ i j to
produce exchangeable, auto-regressive, and moving average correlation structures,
as follows,

Monte-Carlo Simulation of Correlated Binary Responses 101

  1/2
ρ Vi i
κi j =1 + −
(i 1)ρ Vj j
(Exchangeable),
  1/2
Vi i
λi =π i + −− −
ρ ( yi 1 πi 1)
Vi −1,i −1
(Autoregressive) ,
1/2
κi j = βj − −  
β j
Vi i
(Moving Average),
β −i −β i Vjj
where Vi i rep
repres
resent
entss diagon
diagonal
al ele
elemen
ments
ts of the respons
responsee varia
variance
nce-co
-cova
varia
riance
nce,,
λi E Yi Y1 , . . . , Yi 1 T represents the conditional expectation, and β (1
4ρ 2 )1/2 1 /2ρ with −ρ the decaying correlation for autoregressive models and the
= [−|[ ] ]] =[ −
single time-lag correlation for moving average models.

Generating Binary Values by the Linear Conditional Probability Model

1. For a single cluster


cluster,, determine the individual
individual probabilities π i .
2. Sele
Select
ct the desir
desired
ed correlat
correlation
ion structur
structure,e, and the corresponding
corresponding constants
constants
κ i and conditional means λi .
3. Gener
Generate
ate the first
first res ponse, Y 1 , as a random Bernoulli with success prob-
response,
ability π 1 .
4. Gener
Generate
ate subsequent
subsequent responses according
according to the appropriate
appropriate conditional
conditional
probability.
5. Repea
Repeatt for additional
additional clusters.
clusters.

An interesting property of the method presented by Qaqish ((2003 2003)) is the nature of
includ
inc luding
ing pri
prior
or bin
binary
ary out
outcom
comes.es. The ter
terms −
ms κi j (Y j π j ) sho
show ththat
at a bi
bina
nary
ry re
resp
spons
onsee
varia
variable
ble is includ
included
ed rel
relat
ativ
ivee to its
its ex
expec
pecta
tatio
tion
n and transf
transform
ormed
ed accord
according
ing to a consta
constant
nt
related to the desired autocorrelation. While this method does not explicitly include
predictors
predic tors in the simulat
simulation
ion algorithm,
algorithm, predi
predictors
ctors could be included
included as part of each
expected
expe cted value
value πi . The
The me
meththod
od clea
clearl
rly
y allo
allows
ws fo
forr both
both posi
positi
tive
ve and
and nega
negatitive
ve valu
values
es of
autocorrelation, unlike many other proposed methods, but restrictions on the values
of the autocorrelation remain as discussed by Qaqish ((2003 2003).).

5.2 Non-linear Dynamic Conditional Probability Model

The most general method in this discussion is based on the work of Farrell and
Sutradhar (2006
(2006),
), in which a nonlinear version of the linear conditional probability
mode
mo dell propo
propose
sed
d by Qaqi
Qaqish
sh (2003)
2003) is cons
constr
truc
ucte
ted.
d. Th
Thee mode
modell of Farr
Farrel
elll an
and
d Sutr
Sutrad
adha
harr
2006)) is conditioned not only on prior binary outcomes in an autoregressive-type
(2006
of sequence, but also on possible predictors to be considered in data generation.
This approach allows for the inclusion of covariates in the conditional mean, allows
for the probabilities to vary both between and within clusters, and allows for the

102 T.L. Lalonde

greatest range of both positive and negative values of autocorrelation. However, the
nonlinear conditional probability approach of Farrell and Sutradhar (2006 ( 2006)) does not
explicitly provide methods for controlling the probabilities and correlation structure
at the outset of data simulation.
Assume an interest in simulating correlated binary variables, Yi , where each out-
come
co me is to be as
asso
soci
ciat
ated
ed wi
with
th a vect
vector
or of pr
pred
edic
ictors,, xi , th
tors thro
roug
ugh
h a vect
vector
or of para
parame
mete
ters
rs,,
β . Farrell and Sutradhar (2006
(2006)) proposed using the non-linear conditional model,

i 1
ex p(xiT β + γk Yk )
k 1
E Yi Y1 , . . . , Yi −1 , xi
[ |[ ] ]= =−i 1 . (7)
1 + ex p(xiT β + γk Yk )
=
k 1

Instead of beginning with desired probabilities and pairwise correlations or an asso-


ciated
ciated corre
correlati
lation
on struc
structure,
ture, the method
method of Farrell
Farrell and Sutradhar 2006)) focuses on
Sutradhar ((2006
the model relating predictors to the conditional response probability. This allows
the greatest flexibility in producing values of correlation at the expense of control
of the correlation structure. Farrell and Sutradhar
Sutradhar (2006
2006)) show that, for a simple
auto-regression including only the immediately previous response Yi −1 , the mar-
ginal expectation and correlation can be calculated based on the nonlinear dynamic
model,


= E[Yi ] = P (Yi = 1|Y0 = 1) + E[Yi −1 ] P (Yi = 1|Y1 = 1) − P (Yi = 1|Y0 = 1)
µi
 ,

µ i (1 − µi )  
Corr(Yi , Y j ) = P (Yk = 1|Y1 = 1) − P (Yk = 1|Y0 = 1) .
µ j (1 − µ j )
k ∈(i , j ]

Because the conditional probabilities P (Yk 1 Y1 = | =


1) and P (Yk 1 Y0 1) can = | =
vary within ( 0, 1), Farrell
Farrell and Sutra
Sutradha
dharr (2006
2006)) argue that the marginal correlation
can vary unrestricted between 1 and 1. −
Generat
Generatin
ingg Bi
Bina
nary
ry Val
alues
ues by the
the No
Nonl
nlin
inear
ear Dy
Dyna
namic
mic Co
Cond
ndit
ition
ional
al
Probability Model

1. For a single
cients β , andcluste
clusterr of corre
correlate
autoregressive latedd binary data,
coefficients γ k . selec
selectt predictors x i , coeffi-
predictors
2. Simul
Simulate ate Y1 as Bernoulli with probability π1 = (ex p(xiT β ))/(1 ex p +
(xiT β )).
3. Simul
Simulate ate subseque
subsequent nt Yi according to the conditional probability E Yi [ |
[ ] ]
Y1 , . . . , Yi −1 , xi .
4. Repea
Repeatt for additional
additional clus
clusters
ters..

The intuition behind such an approach is that the predictors xi and also the previous
outcome variables Y 1 , . . . , Yi −1 are combined linearly but related to the conditional

Monte-Carlo Simulation of Correlated Binary Responses 103

mean through the inverse logit function, as in Eq. 7. The inverse logit function will
map any real values to the range (0, 1), thus avoiding the concern of reproducibility
2003).
discussed by Qaqish ((2003 ).

6 Softwa
Software
re Discu
Discussi
ssion
on
Few of the methods discussed are readily available in software. The R packages
bindata and BinNor util
utilize
ize discreti
discretizati
zations
ons of normal
normal random variables,
variables, but not

according to Emrich and Piedmonte (1991(1991).


). The method of Emrich and Piedmonte
1991) is implemented in the generate
(1991) generate.binary()
.binary() function within the MultiOrd pack-
age, and also withi
within
n the functio ns of the mvtBinaryEP package.
functions
The rbin() fun
function in the SimCorMultRes packa
ction package
ge explicit
explicitly
ly uses threshold
threshold values
values
to transform continuous values into binary values (Touloumis 2016). 2016). Touloumis
(2016
2016)) proposed defining the marginal distribution of each binary value according
to a CDF applied to the systematic component associated with a model,

T
P (Yi j = 1) = F (x i j β ),

where F is a cumulative distribution function of a continuous random variable, and


xiTj β is a line
linear
ar co
comb
mbininat
atio
ion
n of pr
pred
edic
icto
tors
rs an
andd para
paramemete
terr valu
values es.. Each
Each bi
bina
nary
ry outco
outcomeme
is then defined according to Y I (e xT β ), where e F and is independent
ij = ij ≤ ij ij ∼
across clusters. While the method of Touloumis (2016 (2016)) does clearly accommodate
predi
pre dict
ctors
ors and
and al
allo
loww proba
probabibili
liti
ties
es to vary
ary with
withinin clus
cluste
ters
rs,, it is no
nott clea
clearr th
thee limi
limita
tati
tions
ons
on the range of autocautocorrel
orrelatio
ation,
n, nor can the autocorrel
autocorrelation
ation be easily
easily controlle
controlled,d, as
with other methods.
The package binarySimCLF implements the method of Qaqish (2003 (2003). ). Authors
have individually implemented the methods of Park et al. 1996),
al. ((1996 ), Fleishman (1978
(1978),),
and publications include code for various software, such as with Lee ( 1993) 1993) and
Headrick (2002b
(2002b).). However, there is a clear bias in available software: users prefer
dichotomization of normal or uniform random variables to produce binary data.
Table 1 provides a brief summary of the simulation methods considered, along
with
wi th some
some comm
common on adva
advantntag
ages
es an
andd disa
disadv dvan
anta
tage
gess of each
each clas
classs of algo
algoririth
thms
ms.. Meth
Meth--
ods requiring the full specification of the joint probability distribution, while allow-
ing complete control of the simulated data properties, tend to be complicated and
computationally expensive. Alternatively, the mixture methods tend to have sim-
pler algorithms and computational burden, and generally allow the user to specify
correlation structures at the outset. The method of Oman and Zucker (2001 (2001)) would
seem to be the ideal such approach. Dichotomizing normal and uniform variables
remain the most commonly implemented methods for simulating correlated binary
outcomes, but most require computations involving systems of nonlinear equations.
The approach of Emrich and Piedmonte (1991 ( 1991)) remains prevalent and, accepting the
computational burden, is an understandable method that allows easy control of cor-
relation structures. Methods involving conditionally defined distributions are more

104 T.L. Lalonde

Table 1 Advantages
Advantages and disadvantages of methods of correlated binary outcome simulation
Simulation type Prominent example
Lee (1993),
Fully specified joint distribution Lee 1993), using the Archimidian copula
Advantages
Control of probabilities, correlation, and higher-order moments
Disadvantages
Computationally expensive, nonlinear systems, function/matrix
inversions
Mixtur
ture distribution
ions Oman and Zucker (2001),
2001), using a mixture of binary and
continuous variables
Advantages
Simple algorithms, controlled correlation structures
Disadvantages
Constant probabilities within clusters, no predictors
Di
Dich
chot
otom
omiz
izin
ing
g varia
ariabl
bles
es Emric
Emrich
h an
and
d Pi
Pied
edmo
mont 1991),
ntee ((1991 ), using dichotomized multivariate
normals
Advantages
Short algorithms, probabilities vary within clusters,
Correlation between clusters
Disadvantages
Nonlinear systems, computationally expensive
Cond
Condit
itio
iona
nall dist
distri
rib
but
utio
ions
ns Qaqi
Qaqish
sh (2003),
2003), using the linear conditional probability model
Advantages
Widest range of correlations, controlled correlation structures,
predictors
Disadvantages
Complicated algorithms, requiring conditional means and
covariances

re
rece
cent
nt,, and
and al
allo
low
w fo
forr the
the gr
grea
eate
test
st ra
rang
ngee of corr
correl
elat
atio
ions
ns to be simu
simula
late
ted.
d. The
The algo
algori
rith
thm
m
(2003)) is an ideal example, with few disadvantages other than a slightly
of Qaqish (2003
limited range of correlation values, but allowing the inclusion of predictors, prior

outcomes, and easily specified common correlation structures.

References

Bahadur, R. R. (1961). A representation of the joint distribution of responses to n dichotomous


items. Stanford Mathematical Studies in the Social Sciences, 6, 158–168.
Devroye, L. (1986). Non-uniform random variate generation (1st ed.). Springer, New York.
Emrich, L. J., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate
binary variates. The American Statistician: Statistical Computing, 45 (4), 302–304.

Monte-Carlo Simulation of Correlated Binary Responses 105

Farrell, P. J., & Sutradhar, B. C. (2006). A non-linear conditional probability model for generating
correlated binary data. Statistics & Probability Letters, 76, 353–361.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43,
521–532.
Gange, S. J. (1995). Generating multivariate categorical variates using the iterative proportional
fitting algorithm. The American Statistician, 49 (2), 134–138.
Genest, C., & MacKay, R. J. (1986a). Copules archimediennes et familles de lois bidimenionnelles
dont les marges sont donnees. Canadian Journal of Statistics , 14, 280–283.
Genest, C., & MacKay, R. J. (1986b). The joy of copulas: Bivariate distributions with uniform
marginals. The American Statistician, 40, 549–556.
Headrick, T. C. (2002a). Fast fifth-order polynomial transforms for generating univariate and mul-

tivariate T.
Headrick, nonC.normal distributions.
(2002b). Jmasm3: A Computational Statistics &
method for simulating Data Analysis
systems , 40, 685–711.
of correlated binary data.
Journal of Modern Applied Statistical Methods, 1, 195–201.
Headri
Headrick,
ck, T. C. (2010)
(2010).. Statis
Statistical
tical simulation
simulation:: Power method
method polynomial
polynomialss and other transform
transformation
ationss
(1st ed.). Chapman & Hall/CRC, New York.
Headrick, T. C. (2011). A characterization of power method transformations through l-moments.
Probability and Statistics, 2011.
Journal of Probability
Kang,
Kan g, S. H.,& Jung,
Jung, S. H. (2001)
(2001).. Genera
Generatin
ting
g correl
correlate
ated
d binary
binary va
varia
riable
bless with
with comple
complete
te specifi
specificat
cation
ion
of the joint distribution. Biometrical Journal, 43(3), 263–269.
Kanter,
Kante r, M. (1975).
(1975). Autoregre
Autoregression
ssion for discrete
discrete processes
processes mod 2. Journal of Applied Probability,
12, 371–375.
Karian, Z. A., & Dudewicz, E. J. (1999). Fitting the generalized lambda distribution to data: A
method based on percentiles. Communications in Statistics: Simulation and Computation, 28,
793–819.
Koran, J., Headrick, T. C., & Kuo, T. C. (2015). Simulating univariate and multivariate no normal
distributions through the method of percentiles. Multivariate Behavioral Research
Research, 50, 216–232.
Lee, A. J. (1993). Generating random binary deviates having fixed marginal distributions and spec-
ified degrees of association. The American Statistician: Statistical Computing, 47 (3), 209–215.
Lee, Y., & Nelder, J. A. (1996). Hierarchical generalized linear models. Journal of the Royal
Statistical Society, Series B (Methodological), 58 (4), 619–678.
Lunn, A. D., & Davies, S. J. (1998). A note on generating correlated binary variables. Biometrika,
85(2), 487–490.
Molenberghs, G., & Verbeke, G. (2006). Models for discrete longitudinal data (1st ed.). Springer.
Oman, S. D., & Zucker, D. M. (2001). Modelling and generating correlated binary variables. Bio-
metrika, 88 (1), 287–290.
Park, C. G., Park, T., & Shin, D. W. (1996). A simple method for generating correlated binary
variates. The American Statistician, 50 (4), 306–310.
Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary obser-
vation. Biometrics, 44, 1033–1048.
Qaqish, B. F. (2003). A family of multivariate binary distributions for simulating correlated binary
variables with specified marginal means and correlations. Biometrika, 90 (2), 455–463.
Stiratelli, R., Laird, N., & Ware, J. H. (1984). Random-effects models for serial observations with
binary response. Biometrics, 40, 961–971.
Touloum
ouloumis, is, A. (2016)
(2016).. Simula
Simulatin
ting
g correl
correlate
ated
d binary
binary and multin
multinomi
omial
al respon
responses
ses with
with simcor
simcormul
multre
tres.
s.
The Comprehensive R Archive Network 1–5.
Vale, C. D., & Maurelli, V. A. (1983). Simulating multivariate no normal distributions. Psychome-
trika, 48, 465–471.
Zeger
Zeg er,, S. L., & Liang,
Liang, K. Y. (1986
(1986).
). Longit
Longitudi
udinal
nal data
data analys
analysis
is for discre
discrete
te and contin
continuou
uouss outcom
outcomes.
es.
Biometrics, 42, 121–130.

Quantifying the Uncertainty in Optimal


Experiment
Experiment Schemes
Schemes via Monte-C
Monte-Carlo
arlo
Simulations
H.K.T. Ng, Y.-J. Lin, T.-R. Tsai, Y.L. Lio and N. Jiang

Abstract In the proces


processs of design
designing
ing life-t
life-test
esting
ing exper
experime
iments
nts,, exper
experime
imente
nters
rs alwa
always
ys
establish the optimal experiment scheme based on a particular parametric lifetime
model. In most applications, the true lifetime model is unknown and need to be
specified for the determination of optimal experiment schemes. Misspecification
of the lifetime model may lead to a substantial loss of efficiency in the statistical
analysis. Moreover, the determination of the optimal experiment scheme is always
relying on asymptotic statistical theory. Therefore, the optimal experiment scheme
may not be optimal for finite sample cases. This chapter aims to provide a general
framework to quantify the sensitivity and uncertainty of the optimal experiment
scheme due to misspecification of the lifetime model. For the illustration of the
methodology developed here, analytical and Monte-Carlo methods are employed to
evaluate the robustness of the optimal experiment scheme for progressive Type-II
censored experiment under the location-scale family of distributions.

H.K.T. Ng (B)
Department of Statistical Science, Southern Methodist University,
Dallas, TX 75275, USA
e-mail: ngh@mail.smu.edu
Y.-J. Lin
Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li District,
Taoyuan city 32023, Taiwan
e-mail: yujaulin@cycu.edu.tw
yujaulin@cycu.edu.tw
T.-R. Tsai
Department of Statistics, Tamkang University, Tamsui District, New Taipei City, Taiwan
e-mail: trtsai@stat.tku.edu.tw
·
Y.L. Lio N. Jiang
Department of Mathematical Sciences, University of South Dakota,
Vermillion, SD 57069, USA
e-mail: Yuhlong.Lio@usd.edu
N. Jiang
e-mail: Nan.Jiang@usd.edu

© Springer Nature Singapore Pte Ltd. 2017 107


D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_6

108 H.K.T. Ng et al.

1 Intr
Introd
oduc
ucti
tion
on

In designing life-testing experiments for industrial and medical settings, experi-


menters always assume a parametric model for the lifetime distribution and then
determine the optimal experiment scheme by optimizing a specific objective func-
tion based on the assumed model. In most applications, the optimal experimental
design is model dependent because the optimality criterion is usually a function of
the information measures. Therefore, prior information about the unknown model
derived from physical/chemical theory, engineering pre-test results, or past experi-
ence with similar experimental units is needed to determine the optimal experiment
scheme in practice. However, this prior information may not be accurate and hence
the optimal experiment scheme may not perform as well as one expected. In other
words, the objective function may not be optimized when the optimal experiment
scheme based on inaccurate prior information is used. In addition, to determine the
optima
opt imall ex
exper
perime
iment
nt sch
scheme
eme,, we rel
rely
y on the asympt
asymptoti
oticc sta
statis
tistic
tical
al theory
theory in most
most cases.
cases.
For instance, A -optimality is developed to minimize the variances of the estimators
of the model parameters while these variances are always replaced by the asymp-
totic ones during the process of optimization. There is no guarantee that the optimal
experiment scheme obtained based on asymptotic theory can be more efficient than
a non-optimal experiment scheme in finite sample situations.
For these reasons, it is important to have a systematic procedure to quantify the

sensitivity
ification. Itand willuncertainty
be useful to of see
the optimal
whetherexperiment
a proposedscheme due toscheme
experiment model misspec-
is robust
to model misspecification. If a design is indeed robust, it would then assure the
practitioners that misspecification in the model would not result in an unacceptable
change in the precision of the estimates of model parameters. In this chapter, we
discuss the analytical and Monte-Carlo methods for quantifying the sensitivity and
uncertainty of the optimal experiment scheme and evaluate the robustness of the
optimal experiment scheme.
Let θ be the parameter vector of lifetime distribution of test items. The com-
monly used procedures for the determination of the optimal experiment scheme
are described as follows: A-optimality, that minimizes the trace of the variance-
covariance matrix of the maximum likelihood estimators (MLEs) of elements of θ ,
provides an overall measure of variability from the marginal variabilities. It is par-
ticularly useful when the correlation between the MLEs of the parameters is low. It
is also pertinent for the construction of marginal confidence intervals for the para-
mete
me rs in θ . D-opti
ters -optimali
mality
ty,, that minim
minimizes
izes the determina
determinant nt of the varianc
variance-co
e-covar
variance
iance
matrix of the MLEs of components of θ , provides an overall measure of variability
by taking into account the correlation between the estimates. It is particularly useful
when the esti
estimate
matess are highly correlat
correlated.
ed. It is also pertinent
pertinent for the construction
construction of
joint confidence regions for the parameters in θ . V -optimality, that minimizes the
variance of the estimator of lifetime distribution percentile.
In Sec
Sect.t. 2, we pre
presen
sentt the notnotati
ation
on and gengenera
erall methods
methods for quanti
quantifyi
fying
ng the
uncertainty in the optimal experiment scheme with respect to changes in model.

Quantifying the Uncertainty in Optimal Experiment Schemes … 109

Then, in Sect. 3, we focus on progressive Type-II censoring with location-scale fam-


ily of distributions. The procedure for determining the optimal scheme with progres-
sive censoring and some commonly used optimal criteria are presented in Sects. 3.1
and 3.2
3.2,, respectively. In Sect. 3.3,
3.3, some numerical illustrations via analytical and
simulation approaches based on extreme value (Gumbel), logistic, and normal dis-
tributions are discussed. Discussions based on the numerical results are provided
in Sect. 3.4
3.4.. A numerical example is presented in Sect. 4. Finally, some concluding
remarks are given in Sect. 5.

2 Quantifyi
Quantifying
ng the
the Uncer
Uncertain
tainty
ty in the Optima
Optimall Experime
Experiment
nt
Scheme

In a life
life-t
-tes
esti
ting
ng expe
experi
rime
ment
nt,, let
let the
the life
lifeti
time
mess of test
test item
itemss fo
foll
llo
ow a fa
fami
mily
ly of stat
statis
isti
tica
call
model M . We are interested in determining the optimal experiment scheme that
optimizes the objective function Q (S , M0 ), where S denotes an experiment scheme
and M0 denotes the true model. In many situations, the determination of the optimal
experiment scheme requires a specification of the unknown statistical model M
and hence the optim
optimal
al expe
experime
rimentnt sche
schememe depen
dependsds on the specified model M . For
instance, in experimental design of multi-level stress testing, (Ka et al a l, 2011)
2011) and
Chan et al.
al. (2016
2016)) considered the extreme value regression model and derived the
expected Fisher information matrix. Consequently the optimal experiment schemes
obtained in Ka et al a l (2011)
2011) and Chan et al. al. (2016)
2016) are specifically for the extreme
value regression model, which may not be optimal for other regression models.
Here, we denote the optimal experiment scheme based on a specified model M
as S ∗ (M ). In the ideal situation, the model specified for the optimal experimental
sche
scheme
me is the
the true
true mo
mode del,
l, i.e.,, Q(S ∗ (M0 ), M0 ) inf Q(S (M0 ), M0 ) and S ∗ (M0 )
i.e. = =
S
ar
arg inf Q(S (M0 ), M0 ).
g inf
S
On the other hand, in determining the optimal experiment scheme, experimenter
always relies on the asymptotic results which are derived based on the sample size
goes to infinity. Nevertheless, in practice, the number of experimental units can be
used in an experiment is finite and thus the use of the asymptotic theory may not
be appropriate. For instance, for A -optimality, the aim is to minimize the variances
of the estimated model parameters. This is always attained through minimizing the
trace of the inverse of the Fisher information matrix or equivalently, the trace of the
asymptotic variance-covariance
variance-covariance matrix of MLEs. However
However,, the asymptotic variance-
covariance matrix may not correctly reflect the true variations of the estimators
when the sample size is finite, and hence the optimal experiment scheme may not
be optimal or as efficient as expected in finite sample situations. Therefore, large-
scale Monte-Carlo simulations can be used to estimate the objective functions and
evaluate the performance of the optimal experiment scheme. For quantifying the
sensitivity and uncertainty of the optimal experiment scheme S ∗ (M ), we describe
two pos
possib
sible
le approa
approache
chess by compar
comparing
ing exper
experime
imenta
ntall scheme
schemess and object
objectiv
ivee functio
functions
ns
in the following subsections.

110 H.K.T. Ng et al.

2.1 Comparing Experimental Schemes

Let the specified model for obtaining the optimal experiment scheme be M ∗ , then
the optimal experiment scheme is
S ∗ (M ∗ ) inf Q(S (M ∗ ), M ∗ ).
= arargg inf
S

To quantify the sensitivity of the optimal experiment scheme S ∗ (M ∗ ), we consider


a different model M and establish the optimal experiment scheme based on M as

S ∗ (M ) = arargg inf
inf Q(S (M ), M ).
S

Comparing the experimental scheme S∗ (M ∗ ) and S ∗ (M ) will provide us some


insights on the sensitivity of the optimal experiment scheme S ∗ (M ∗ ). If S ∗ (M ∗ ) is
insensitivity to the change in the model M , then S ∗ (M ∗ ) and S ∗ (M ) will be similar
to each other. Depending on the nature of the life-testing experiments, different
measures of similarity of two experimental schemes can be considered to quantify
the sensit
sensitiv
ivity
ity of the optima
optimall ex
exper
perime
iment scheme S ∗ (M ∗ ). When
nt scheme When appl
apply
y th
this
is appro
approac
ach,
h,
evaluation of the optimal experiment scheme for different models is needed.

2.2 Comparing Values of Objective Functions

To qua
quanti
ntify
fy the sen
sensit
sitiv
ivity
ity of the opt
optima
imall exper
experime
iment scheme S ∗ (M ∗ ), anothe
nt scheme anotherr
approach is to compare the objective function of the optimal experiment scheme
S ∗ (M ∗ ) under the model (M ) which is believed to be the true model. Specifi-
cally, we can compute the objective function when the experiment scheme S ∗ (M ∗ )
is adopted but the model is M , i.e., to compute Q(S ∗ (M ∗ ), M ). If the optimal
experiment scheme S ∗ (M ∗ ) is insensitivity to the change in the model M , then
Q(S ∗ (M ∗ ), M ∗ ) will be similar to Q(S ∗ (M ∗ ), M ) or Q(S ∗ (M ), M ). When apply
this approach, evaluation of the objective function Q (S ∗ (M ∗ ), M ) is needed.

3 Progre
Progressiv
ssivee Censor
Censoring
ing with Location-
Location-Scal
Scalee Family
Family
of Distributions

In this section, we illustrate the proposed methodology through the optimal progres-
si
sive
ve Type-II
ype-II ce
censo
nsorin
ringg scheme
schemess (see,
(see, for examp
example,
le, Balakr
Balakrish
ishnan
nan and Aggarw
Aggarwal 2000;;
alaa 2000
Balakrishnan 2007
2007;; Balakrishnan and Cramer 2014
2014).). We consider that the underline
statistical model, M , used for this purpose is a member of the log-location-scale
family of distributions. Specifically, the log-lifetimes of the units on test have a

Quantifying the Uncertainty in Optimal Experiment Schemes … 111

Table 1 Examples of functional forms of g( ·


g ( ) and G ( ) ·
Extreme value (EV) Logistic (LOGIS) Normal (NORM)
g (zz)) [ − exp(z)z)]
exp z − [ + exp(−z)z)]2
exp( z)/
z)/ 1 √12π exp(−z2 /2)
z 1 2
G (zz)) 1 − exp[− exp(z)z )] 1/[1 + exp(−z)
z )] −∞ √2π exp(−x /2)dx
location-scale distribution with probability density function (p.d.f.)

1 x − µ,
;
fX (x µ , σ ) = g
 (1)
σ σ

and cumulative distribution function (c.d.f.)


 x − µ,
;
FX (x µ , σ ) =G σ
(2)

·
where g ( ) is the standard
standard fform
orm of the
the p.d.f. ; ·
p.d.f. fX (x µ , σ ) and G ( ) is the sstanda
tandard
rd form
;
of the c.d.f. F X (x µ , σ ) when µ =
0 and σ =
1. The functional forms g and G are
completely specified and they are parameter-free, but the location and scale parame-
ters, −∞ <µ< ∞ ; ;
and σ > 0 of fX (x µ , σ ) and FX (x µ , σ ), are unknown. Many

well-kno
well-k nown
wn proper
propertie
tiess for locati
location-s
on-scal
calee fa
famil
mily
y of distri
distribu
butio
tions
ns had been
been establ
establish
ished
ed
in the literature (e.g., Johnson et al. 1994).
1994). This is a rich family of distributions that
include the normal, extreme value and logistic models as special cases. The func-
· ·
tional forms of g ( ) and G ( ) for extreme value, logistic and normal, distributions
are summarized in Table 1.
A progressively Type-II censored life-testing experiment is described in order.
Let n independent units be placed on a life-test with corresponding lifetimes T1 ,
T2 , . . ., T n that are independent and identically distributed (i.i.d.) with p.d.f. f T (t θ ) ;
;
and c.d.f. FT (t θ ), where θ denotes the vector of unknown parameters. Prior to
the experiment, the number of complete observed failures m < n and the censoring
m 
scheme (R1 , R2 , . . . , Rm ), where Rj ≥ + =
0 and j=1 Rj m n are pre-fixed. During
the experiment, R j functioning items are removed (or censored) randomly from the
test when the j-th failure is observed. Note that in the analysis of lifetime data,
instead of working with the parametric model for T i , it is often more convenient to
work with the equiv = =
alent model for the log-lifetimes Xi log Ti , for i 1 , 2, . . . , n.
equivalent
= ;
The random variables X i , i 1 , 2, . . . , n, are i.i.d. with p.d.f. fX (x µ , σ ) and c.d.f.
;
FX (x µ , σ ).

112 H.K.T. Ng et al.

3.1 Maximum Likelihood Estimation

Let the m completely observed (ordered) log-lifetimes from a progressively Type-II


=
censored experiment be Xi:m:n , i 1 , 2 . . . , m and their observed values be xi:m:n ,
i = 1, 2, . . . , m. The likelihood function based on x : i m n (1: ≤ i ≤ m) is
m

L (µ,σ) = c ;
fX (xi:m:n µ , σ ) [1 −F X (xi m n : : ; µ , σ )] Ri
,
=
i 1

x1:m:n < x 2:m:n < ··· < x m m n,


: : (3)

where c  is the normalizing constant given by

c = n(n − R − 1) · · · (n − R − R − · · · − R
1 1 2 m 1− − m + 1).
The MLEs of µ and σ are the values of µ and σ which maximizes ((3 3). For location-
scal
scalee fa
fami
mily
ly of dist
distri
ribu
buti
tion
onss desc
descri
ribe
bed
d in Eq
Eqs.
s. (1) and (2), the log-li
log-likel
keliho
ihood
od functi
function
on
can be expressed as

(µ,σ) = ln L(µ,σ)
m
ln c m ln σ g
xi:m:n −µ
= + −
=
 σ  i 1
m
  x : : − µ  imn

+ R ln 1 − G i .
σ
i 1=
We denote the MLEs of the parameters µ and σ by µ µ̂ and σ̂ ˆ
σ , respectively. Computa- ˆ
tional algorithms for obtaining the MLEs of the parameters of some commonly used
location-scale distributions are available in many statistical software packages such
as R (R Core Team 2016),
2016), SAS and JMP.
The expected Fisher information matrix of the MLEs can be obtained as

∂ 2 (µ,σ) ∂ 2 (µ,σ)
E ∂µ∂µ
E ∂µ∂σ Iµµ Iµσ
I(µ,σ) = −   2
 
2
  =   . (4)
E
∂ (µ,σ)
E
∂ (µ,σ) Iµσ Iσ σ
∂µ∂σ ∂σ ∂σ

Then, the asymptotic variance-covariance matrix of the MLEs can be obtained by


4) as
inverting the expected Fisher information matrix in Eq. ((4
 V ar (µ) ˆ
µ̂) Cov( µ,
µ̂, σ̂
σ) ˆ ˆ   V11 V12

V(µ,σ) = I − (µ,σ) =
1
Cov( µ, ˆ ˆ
σ ) V ar (σ̂
µ̂, σ̂ σ) ˆ =σ 2
V12 V22
. (5)

Quantifying the Uncertainty in Optimal Experiment Schemes … 113

The computational formulas of the elements in the Fisher information matrix of


(4) can be found in Balakrishnan et al. al. ((2003
2003),
), Ng et al. (2004
( 2004)) and Dahmen et al.
2012). For fixed values of n and m and a specific progressive censoring scheme
(2012).
(R1 , R2 , . . . , Rm ), we can compute the expected Fisher information matrix and the
asymptotic variance-covariance matrix of the MLEs from Eqs. (4
(4) and (5
(5).

3.2 Optimal Criteria

To determine the optimal scheme under progressive Type-II censoring , we consider


the following optimal criteria:
[1] D-optimality
For D -optimality,we search for the censoring scheme that maximizes the deter-
minant of the Fisher information matrix, det (I(µ,σ)). For a given censoring
scheme S =
( R1 , R2 , . . . , Rm ) with a specific model M , the objective function
is

2
QD (S , M ) =I µµ Iσ σ −I µσ . (6)

We den
denote
ote the optima
optimall ex
exper
perime
iment
nt sch eme for D-op
scheme -optim
timali
ality
ty with
with a specifi
specificc mod
modelel
M as S D ∗ (M ).
[2] A-optimality
For A-opt
-optima
imalit
lity
y, we aim
aimto
to min
minimi
imize
ze the varia
variance
ncess of the estima
estimator
torss of the model
model
parameters. This can be achieved by designing an experiment that minimizes
variance-covariance matrix, tr V(µ,σ) . For a given
the trace of the asymptotic variance-covariance [ ]
experime
expe riment
nt schem =
schemee S ( R1 , R2 , . . . , Rm ) wi
with
th a spec
specifi
ificc mo dell M , the objec
mode objecti
tive
ve
function is

QA (S , M ) =V +V 11 22 . (7)

We denote the optimal experiment scheme for A-optimality with a specified


model M as S A∗ (M ).
[3] V -optimality
For V -optimality, we aim to minimize the variance of the estimator of 100 δ -th
percentile of the log-lifetime distribution, 0 < δ < 1, i.e.,

ˆ = µ̂µˆ + σσ̂ˆ G − (δ).



q̂ 1

For a given censoring scheme S = ( R , R , . . . , R


1 2 m) with a specific model M ,
the objective function is

QVδ (S , M ) = V ar(q̂qˆ ) = V ar(µµ̂ˆ + σ̂σˆ G − (δ))


δ
1

= V + [G − (δ)] V + 2G − (δ) V
11
1 2
22
1
12 , (8)

114 H.K.T. Ng et al.

where G −1 ( ) is the inverse c.d.f. of the standard location-scale distribution. We


·
denote
denote the optima
optimall ex
exper
perime
imenta
ntall sch eme for V -op
scheme -optim
timali
ality
ty with
with a speci
specified
fied mod
model
el
M as S V∗ (M ).
δ
When the values of n and m are chosen in advance, that depend on the availability of
units, experimental facilities and cost considerations, we can determine the optimal
censoring scheme (R1 , R2 , . . . , Rm ). In the finite sample situation, we can list all
possible censoring schemes and compute the corresponding objective functions, and
then determine the optimal censoring schemes, respectively, through an extensive

search.

3.3 Numerical Illustrations

For illu
For illust
stra
rati
tive
ve purp
purposose,
e, we co
cons
nsid
ider
er the
the tr
true
ue unde
underl
rlin
inee life
lifeti
time
me mode
modell of th
thee test
test unit
unitss
to be Weibull
eibull (i.
(i.e.,
e., the log-li
log-lifet
fetime
imess follo
follow
w an ex
extre
treme
me value
value distri
distribu
butio
tion, =
n, M0 E V )
and we are interested in investigating the effect of misspecification of the underline
lifetime model as log-logistic (i.e., the log-lifetimes follow a logistic distribution,
M∗ = LOGIS ). We also consider the case that the true underline lifetime model for
the test units to be lognormal (i.e., the log-lifetimes follow a normal distribution,
M0 NOR) and we are interested in investigating the effect of misspecification of
the underline lifetime model as Weibull (M ∗ E V ).
= =

3.3.1
3.3.1 Analyti
Analytical
cal Ap
Appr
proach
oach

In this
this subsec
subsecti
tion,
on, we ev
evalu
aluate
ate the sensit
sensitiv
iviti
ities
es of the optima
optimall prog
progres
ressi
sive
ve Type-II
ype-II cen-
cen-
soring scheme analytically based on the expected Fisher information matrix and the
variance-covariancee matrix of the MLEs. For the specific model M ∗ , we
asymptotic variance-covarianc
determ
det ermine
ine the optima
optimall pro
progre
gressi
ssive
ve Type-II
ype-II censor
censoring
ing sch
scheme
emess und
under
er differ
different
ent optima
optimall
∗ (M ∗ ), S∗ (M ∗ ), S∗ (M ∗ ) and S∗ (M ∗ ) fro
criteria, SD from Eqs.. (4) to (8). For the tru
m Eqs ruee
A V.95 V.05
model M0 , we also determine the optimal progressive Type-II censoring schemes
under different optimal criteria, SD ∗ (M0 ), S∗ (M0 ) and S∗ (M0 ) and S∗ (M0 ).
A V.95 V.05
Then, we can compa
comparere these experi
experiment
mental
al schemes S ∗ (M ∗ ) and S ∗ (M0 ). In addi-
schemes
tion, we compute the objective functions based on the optimal censoring scheme
specified model M ∗ while the true underline model is M0 , i.e., we com-
under the specified
pute QD (SD∗ (M ∗ ), M0 ), QA(S∗ (M ∗ ), M0 ), and QV (S∗ (M ∗ ), M0 ) for δ 0 .05 =
A δ Vδ
= =
and 0.95. The results for n 10, m 5 (1)9 with (M ∗ =
LOGIS , M0 E V ) and =
(M ∗ E V , M0
= =
NOR) are presented in Tables 2 and 3 3,, respectively.

Quantifying the Uncertainty in Optimal Experiment Schemes … 115

3.3.2
3.3.2 Simulat
Simulation
ion Appro
Approach
ach

In this subsection, we use Monte-Carlo simulation to evaluate the performance of


the optimal censoring scheme by comparing the objective functions based on the
asymptotic variance-covariance matrix and the simulated values. Moreover, we can
also evaluate the sensitivities of the optimal censoring schemes when the model is
misspecified based on Monte-Carlo simulation. In our simulation study, 20,000 sets
of progressively Type-II censored data are generated from the true model M0 E V =
and the MLEs of the parameters µ and σ are obtained. The simulated values of the
objective functions for the progressive censoring schemes ( n 10 , m 5 (1)9) in = =
Tables 2 and 3 are presented in Table 4.

3.4 Discussions

3.4.1
3.4.1 Compari
Comparing
ng Exp
Experim
erimenta
entall Sch
Schemes
emes

From Tables 2 and 3,


3, we can compare the optimal censoring schemes under models
M∗ = =
LOGIS and M0 E V . While comparing the optimal censoring schemes
of the same optimal criterion under these two models, we can see that these two
optimal censoring schemes are different except the case when n 10, m 9 for V - = =
=
optimality with δ 0 .95. In some cases, the two optimal censoring schemes can be
very different from each other. For example, when n 10

10,, m = =
7 for A -optimality,
the
the opti
optima
mall ce
cens
nsor
orin
ingg sche
scheme
me under modell M
under mode =
LOGIS iiss (0, 0, 0, 0, 0, 0, 3) while
=
the optimal censoring scheme under model M0 E V is (0, 0, 3, 0, 0, 0, 0). These
results indicate that when the model is misspecified, the optimal censoring scheme
based on logistic distribution may not be optimal under the true model (i.e., extreme
value distribution).
Neve
Ne verth
rthele
eless,
ss, the analyt
analytica
icall approa
approach
ch pro
propos
posed
ed her
heree will
will be useful
useful for practi
practitio
tioner
nerss
to choose an appropriate censoring scheme which is robust with respect to model
= =
Table 2 with n 10, m 5 for D-optimality, if
misspecification. For instance, from Table2

one believes that the model is M =
LOGIS but also suspects that the model might
M
be 0
censoring = Escheme
V , then(0,
from
0, Table 2, with
0, 0, 5) it may not be objective
optimal the best option to use
function det the
(I) optimal
45 .05 =
because there is a substantial loss in efficient if the underline model is M0 E V =
=
(det (I) 32 .89 compared to the optimal value 54.52 under M0 E V in Table 3). =
In this situation, it may better to use a non-optimal censoring scheme such as (0,
3, 0, 0, 2) which gives objective function det (I) 44 .64 under M ∗
= LOGIS and =
=
det (I) 47 .09 under M0 E V . =

116 H.K.T. Ng et al.

)
5
0
.
0
q
ˆ 5 7 5 1 9 2 9 1 4 2 0 4 2 8
6 6 7 5 0 2 7 3 9 8 8 7 7 7
r
(
0 9 9 7 8 5 6 1 1 8 9 5 5 5
a 0
. 1
. 1
. 2
. 4
. 3
. 1
. 1
. .1 .9 .8 .8 .7 .6
V 2 1 1 0 1 1 1 1 1 0 0 0 0 0

)
5
9
.
0
q
ˆ 5 0 1 4 4 5 1 5 6 8 8 0 1 8
8 7 1 0 1 8 9 5 8 1 7 7 1 4
r
(
5 0 2 3 5 5 7 6 4 6 5 4 5 4
a 2
. 3
. 3
. 3
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2 .2
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0

]I 9
8 9
0 1
6 9
8 4
0 4
2 2
6 1
9 3 5 4 5 4
)
[t . . . . . . . . .1 .4 .3 .1 .0 .9
V 2 7 5 1 8 1 5 9 7 2 9 1 0 1
E e 3 4 4 4 4 5 5 5 6 7 7 9 0 2
, d 1 1
)

IS
G
O
L
( 6 1 9 4 9 0 2 9 0 5 2 4 0 5
∗ ] 6 7 9 0 0 9 5 7 8 9 1 3 7 9
S 8 0 0 3 0 8 7 6 4 3 3 1 0 8
( [V 3
. 3
. 3
. 3
. 3
. 2
. 2
. 2
. .2 .2 .2 .2 .2 .1
V Q tr 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E
= )
0 5
0
M
.
0
q
ˆ 2 4 5 6 0 8 3 3 6 0 7 9 2 8
0 3 9 8 6 0 7 9 9 5 9 9 6 6
d r
(
0 8 2 7 5 5 6 3 1 2 0 8 8 6
n a 5
. 4
. 5
. 4
. 4
. 4
. 4
. 4
. .4 .4 .4 .3 .3 .3
a V 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S
I
G )

O 5
9
L q
ˆ
.
0
6 8 3 8 9 4 0 2 4 2 5 3 3 6
= 7 6 3 3 4 7 5 4 8 8 8 8 8 9
r
(
2 5 2 0 2 0 8 2 9 8 0 1 2 6
∗ a
V 8
.
0 7
.
0 7
.
0 8
.
0 6
.
0 6
.
0 5
.
0 6
.
0 .4
0 .4
0 .5
0 .4
0 .4
0 .3
0
M
h
it )
w IS
9 ] 5 4 2 8 4 8 2 7 4 9 4 1 2 7
) G [tI
0
. 6
. 2
. 4
. 2
. 8
. 8
. 4
. .3 .9 .9 .9 .7 .1
1 O 5 4 9 3 0 9 6 7 6 2 1 0 7 4
5
(
L e 4 4 3 4 6 5 5 5 7 7 7 9 8 0
,
)
d 1
= IS
m G
, O
0 L 9 0 1 3 9 7 2 5 1 0 4 1 0 6
1 ∗
(
] 6 6 7 5 9 7 1 3 4 7 9 1 5 6
S 2 1 2 2 6 6 7 7 3 3 3 1 1 9
= ( [V 3
. 3
. 3
. 3
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2 .1
n Q tr 0 0 0 0 0 0 0 0 0 0 0 0 0 0
r
o
f g
s irn
e 2 1 1 3 2 1 2 1 1
m o
s
e )) = = = = = = = = =
n
h e IS 5 5 5 6 6 6 7 7 8
sc
c R R R R R R R R R
a e G
l , , , , , , , , ,
g O 5 3 4 4 4 1 2 3 3 1 2 2 1 1
nri it m
m e ∗(L = = = = = = = = = = = = = =
p h S
so
5 2 1 3 6 2 1 3 7 1 3 8 3 9
O c s ( R R R R R R R R R R R R R R
n
e ] ] ] ] ] ] ] ]
c 5 5 5 5 5 , 5 5
, 5
e tiy 9 0 9 0 9 0 0 0
v .
0
.
0
.
0
.
0
.
0 [3
.
0
]
5
.
0[3 ]
5
.
0
ssi
l n ,] 9 ,] 9
a ]
iro = = = = = = 2 0 = [2 =
[2
. .
m 0
re
[
ti tie ] ]
δ
,
δ
, ] ]
δ
,
δ
, ,] δ
,
δ
, ,] δ ,] δ
g p r = , = ,
ro O c [1 [2 [3 [3 [1 [2 [3 [3 [1 [3 [3 [1 δ
3
[ [1 δ [3
p
l
a
itm n
/
p m
O − % % % % %
0 0 0 0 0
2 1 5 4 3 2 1
e
l
b
a m 5 6 7 8 9
T

Quantifying the Uncertainty in Optimal Experiment Schemes … 117

)
5
0
.
0
q
ˆ 6 9 2 5 1 1 7 0 3 1 1 0 0 5 3 8 0 1
9 7 5 6 9 2 4 7 4 2 4 9 4 4 4 0 7 9
r
(
8 8 2 7 6 0 6 5 6 8 5 4 8 6 4 4 4 4
a 2
. 2
. 3
. 2
. 2
. 3
. 2
. 2
. .2 .2 .2 .2 .1 .2 .2 .2 .2 .2
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

)
5
9
.
0
q
ˆ 1 0 3 1 3 4 4 2 5 7 2 7 6 1 1 4 2 6
( 5 5 9 9 3 8 4 9 3 2 3 0 2 7 1 6 5 8
r 6 1 2 8 6 6 3 4 7 2 9 0 7 8 6 6 6 5
a 4
. 6
. 4
. 3
. 4
. 3
. 3
. 3
. .3 .3 .2 .3 .3 .2 .2 .2 .2 .2
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

] 3
2 0
1 9
7 8
7 3
8 0
0 3
7 4
4 4 3 8 1 8 7 4 1 8 3
[tI . . . . . . . . .0 .2 .0 .3 .7 .4 .8 .8 .6 .7
3 1 8 3 5 9 5 5 9 1 9 9 8 5 3 3 1 1
)
R e 8 7 7 0 9 9 2 2 1 2 4 4 7 4 7 7 7 7
d 1 1 1 1 1 1 1 1 1 1 1 1 1
O
N
,
)
V
(
E 9 5 3 8 6 5 5 0 5 9 6 8 3 5 7 0 7 1
∗ ] 6 9 3 9 6 5 9 9 0 3 3 2 2 6 0 0 9 2
S 3 5 4 0 1 1 8 8 9 9 7 7 6 7 6 6 5 6
( [V 2
. 2
. 2
. 2
. 2
. 2
. 1
. 1
. .1 .1 .1 .1 .1 .1 .1 .1 .1 .1
Q rt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
R
O
N )
5
0
.

= q
ˆ
0
7 5 7 3 1 7 7 7 1 8 5 8 1 7 7 4 6 3
0 0 6 4 5 5 9 9 8 6 7 1 9 9 1 5 1 2 2
r
(
2 0 2 2 9 7 9 3 6 7 1 2 7 0 4 6 0 4
M a
V
0
.
1
0
.
2
9
.
0
8
.
0
1
.
1
7
.
0
6
.
0
7
.
0
.9
0
.6
0
.6
0
.6
0
7
.
0
.6
0
.5
0
.5
0
.6
0
.5
0
d
n
a )
V 5
9
E q
ˆ
.
0
2 5 4 7 6 8 2 0 4 6 7 8 3 2 5 2 7 7
6 8 7 8 4 7 4 3 8 3 1 4 5 3 0 5 2 5
= r
(
0 5 4 8 5 1 7 6 4 9 6 5 4 7 5 4 4 5
∗ a
V 3
.
0 2
.
0 3
.
0 2
.
0 2
.
0 3
.
0 2
.
0 2
.
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0
M
h
it
w
9 ] 2 9 1 5 8 1 2 7 8 8 1 4 0 4 3 0 6 8
5 8 2 6 7 3 2 5 .2 .3 .2 .4 .1 .4 .6 .4 .5 .4
1
)
[tI .
4
.
2
.
4
.
3
.
7
.
2
.
4
.
2 6 2 6 5 9 4 9 8 1 8
( e 5 3 5 7 5 7 9 9 7 9 1 1 9 1 3 3 3 3
5 )V d 1 1 1 1 1 1 1
= E ,
m V )

,
0 E 8 6 8 4 7 9 2 5 7 1 5 4 0 9 6 6 3 2
1 (
∗ ] 1 6 4 9 3 4 1 0 4 6 0 9 7 3 4 3 5 6
S 9 8 9 4 7 5 2 2 3 2 0 9 0 0 8 8 8 8
= ( [V 2
. 3
. 2
. 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .1 .2 .2 .1 .1 .1 .1
n Q rt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
r
o
f
s g ))
e la irn e V
m 5 5 5 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1
e tim e (E
o
s m
h p = = = = = = = = = = = = = = = = = =
n h ∗
sc e
s (S
c 2 5 1 2 5 1 2 3 6 1 2 3 7 1 2 4 7 1
O c R R R R R R R R R R R R R R R R R R
g
rn i ]5 ]5 ]5 ]5 ]
5 ]5 ]5 ]5 ]5 ]5
tiy 9 0 9 0 9 0 9 0 9 0
s o l
.
0
.
0
.
0
.
0
.
0
.
0
.
0
.
0
.
0
.
0
a n
n iro ] ]
e m [2 = =
[2 = = = = = = = =
c ti tie ], ],
δ δ δ δ δ δ δ δ δ δ
e p r , , , , ] ] , , ] ] , , ] ] , ,
v O c [1 [3 [3 [1 [3 [3 [1 [2 3
[ [3 [1 [2 [3 [3 [1 [2 [3 [3
ssi
re
g
ro n
/
p m
l % % % % %
a − 0 0 0 0 0
itm 1 5 4 3 2 1
p
O
3
e
l
b
a m 5 6 7 8 9
T

118 H.K.T. Ng et al.

3 )
d d
e
n u
a n
it
2 n
s )
e
l
5 o
0 c
(
b q
ˆ
.
0
5 5 1 4 3 3 8 0 4 6 5 0 3 7 4 9 8 6 5
a 1 0 7 0 6 9 8 5 3 9 7 8 7 3 7 7 2 7 4
T r
(
1 1 5 5 3 7 0 0 2 4 4 1 5 4 1 8 6 9 6
n a 1
. 0
. 0
. 0
. 1
. 9
. 0
. 0
. 0
. 9
. .9 .0 .9 .9 .9 .8 .8 .8 .9
i V 1 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0
d
e
t
n
se
re
p )
5
) 9
9 .
0
2 6 1 0 5 7 9 8 3 6 2 6 8 7 1 6 0 1 9
) q
ˆ
1 ( 9 3 2 8 3 2 2 7 1 5 6 5 1 7 5 7 2 4 2
(
5 ra 1
3
. 6
3
. 1
6
. 7
4
. 9
3
. 8
4
. 3
4
. 9
3
. 4
3
. 4
3
. .2
7 .2
5 .3
1 .2
7 .2
6 .2
1 .2
3 .2
8 .2
1
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
=
m
,
0
1
=
n
( ] 6 5 5 5 4 1 8 0 8 1 1 7 5 6 0 7 5 8 1
s [tI
0
. 1
. 5
. 2
. 9
. 9
. 7
. 3
. 0
. 8
. .3 .3 .9 .1 .5 .1 .0 .8 .8
e 7 6 1 7 8 8 5 8 3 4 4 4 3 0 2 7 5 1 1
m e 4 4 3 3 3 3 4 4 5 5 6 6 6 7 7 8 8 7 8
e d
h
sc s
e
g lu
n a
ri v
so d
e
t
n
e la 1 7 6 0 3 5 5 7 0 3 0 9 2 3 1 5 1 2 5
c u ] 2 4 8 8 4 7 6 4 6 7 2 1 8 6 6 1 2 5 9
e 2 2 0 5 4 5 1 0 8 8 7 7 5 4 4 3 3 4 3
v im [V 3
. 3
. 4
. 3
. 3
. 3
. 3
. 3
. 2
. 2
. .2 .2 .2 .2 .2 .2 .2 .2 .2
ssi S tr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
re e
g m 2 1 1 3 2 1 2 1
ro e
h
p c
s = = = = = = = =
e g R
5
R
5
R
5
R
6
R
6
R
6
R
7
R
7
h
t in , , , , , , , ,
r r 5 5 5 3 4 4 4 1 2 3 4 4 3 1 2 3 3 3 3
o o
s
f n = = = = = = = = = = = = = = = = = = =
e
V C R
1
R
2
R
5
R
2
R
1
R
3
R
6
R
2
R
1
R
3
R
2 1
R R
7
R
1
R
3 2
R
3
R
6
R
1
R
E
=
0
M
r
e
d
n )

)
0
u )

M M
) )

s 0 0
n
o M∗ D ( (
5
9 M M
it (
∗ A S ∗
.
V
(
∗ A
(
∗ A
c y S , ) ) S ) ) S S ) )
n ilt
)
0 , )
0 ∗ ∗ , ∗ ∗ , )
0 , ∗ ∗
)
0
)
0
u ) ) ) ) ) ) )

M M MM MM M MM MM
)
f a 0 ∗ ∗ ∗ 0 ∗ 0 0
e
v itm (
5 M (
5 M (
5
(
5 MM (
5
(
5 M (
5 M (
5
(
5 MM (
5
(
5
i p 0
.
( 9
. ( 9
.
0
.
( ( 9
.
0
.
( 0
.
( 9
.
0
.
( ( 9
.
0
.

c SV ∗SD ∗SV ∗SA ∗SV ∗SV ∗SD ∗SA ∗SV ∗SV ∗SD ∗SV ∗SD ∗SV ∗SV ∗SD ∗SA ∗SV ∗SV
jt
e O ∗

b
o
e
ht
f
o
s n
e /
u
l m
a − % % %
v 0 0 0
d S 1 5 4 3
et
I
a
l G
u O
L
im
S =

4
e
l
M
b h
a it m 5 6 7
T w

Quantifying the Uncertainty in Optimal Experiment Schemes … 119


)
5
0
.
0
q
ˆ 1 7 6 8 2 0 7 0 9 3
0 0 9 1 6 7 7 8 4 7
r
(
1 8 6 4 7 8 4 2 3 6
a 9
. 8
. 8
. 8
. 8
. 8
. .8 .8 .8 .8
V 0 0 0 0 0 0 0 0 0 0

)
5
9
.
0
q
ˆ 9 0 6 6 3 7 3 3 7 4
( 1 2 6 4 8 6 3 6 1 3
ra 3
2
. 1
2
. 8
1
. 0
2
. 1
2
. 7
1
. .5
1 .5
1 7
.1 7
.1
V 0 0 0 0 0 0 0 0 0 0

] 0 8 9 4 9 2 4 4 7 0
7 9 8 2 0 8 .6 .8 .3 .7
[tI .
5
.
3
.
3
.
0
.
3
.
6 8 9 0 5
e 8 9 0 0 9 0 2 2 2 1
d 1 1 1 1 1 1 1
s
e
lu
a
v
d
e
t
la 2 0 7 4 5 4 9 9 1 0
u ] 2 5 1 0 4 1 3 0 2 5
2 1 1 1 1 1 9 9 9 9
im [V 2
. 2
. 2
. 2
. 2
. 2
. .1 .1 .1 .1
S tr 0 0 0 0 0 0 0 0 0 0
e
m
e 1
h
c
s =
8
g R
in
r ,

o 2 1 2 2 2 2 1 1 1 1
s = = = = = = = = = =
n
e 8 3 2 5 7 1 1 2 8 9
C R R R R R R R R R R
)
∗ ,
)

M ∗
(
5
9
.
M
(
∗ V ∗ V
S S
,
)
,
)
∗ ∗

M
(
M
(
∗ A ∗ A
y S ) S
ilt ,
)

)
0
)
0
)
0 ,
)
)
0
) )

M MMM M
) )
a ∗ 0 0 0 0 ∗
itm M
(
(
5
0
.
MM
( (
(
5
9
.
(
5
0
.
(
5
0
.
MMM
( ( (
(
5
9
.

p
O SD ∗SV ∗SD ∗SA ∗SV ∗SV ∗SV ∗SD ∗SA ∗SD ∗SV

n
/
m
− % %
) 0 0
d 1 2 1
e
u
n
it
n
o
c(
4
e
l
b
a m 8 9
T

120 H.K.T. Ng et al.

3.4.2 Compari
Comparing
ng Values of Objecti
Objective
ve Functions
Functions
By comp
compar arin
ing
g thethe value
aluess of the
the obje
object
ctiive fu
func
ncti onss Q(S ∗ (M ∗ ), M0 ) with
tion
Q(S ∗ (M ∗ ), M ∗ ) and Q(S ∗ (M0 ), M0 ), we can observe that the model misspeci-
fication has a more substantial effect for V -optimality and a relatively minor effect
for A-op
-optim
timal
ality
ity.. For instan
instance
ce,, we com paree Q(S ∗ (M ∗ ), M0 ) and Q(S ∗ (M0 ), M0 ) by
compar
= =
considering n 10 and m 5, the optimal censoring scheme for V -optimality with
δ = under M ∗
0 .95 under =LOGIS is(4,0,0,0,1)with QV.95 (S ∗ (LOG LOGISIS ), EV
E V ) 0 .3211 =
M
19.50% lossEofV efficient.
(Table 0 =
with Q2V).95
, while E Vthe
(S ∗ (EV optimal
), E
EVV ) 0censoring
= .2585 which scheme
givesunder is (0, 0, 0,In 0, 5)
con-
=
trast, consider n 10 and m

= 5, the optimal censoring scheme for A-optimality
under M = LOGIS is (0, 3, 0, 0, 2) with QA (S ∗ (LOG LOGIS =
E V ) 0 .3071 (Table 2),
IS ), EV
while
whi le the opt

optima
imall censor
censoring
ing scheme underr M0
scheme unde = E V is (0, 5, 0, 0, 0) with
QA (S (EVE V ), EV =
E V ) 0 .2918 (Table 3) which gives 4.98% loss of efficient. We have
a similar observation when we compare the objective functions Q(S∗ (M ∗ ), M ∗ )
and Q(S ∗ (M ∗ ), M0 ). For example, in Table 3, when n 10 = 10,, m =6, the optimal
censoring scheme for V -optimality with δ = 0 .95 under M ∗ E V is (0, 0, 0, 0, 4)
=
with Q V.95 (S ∗ (EV
E V ), E
EV =
V ) 0 .2546. If the censoring scheme (0, 0, 0, 0, 4) is applied
=
when the true model is normal (M0 NOR ), the asymptotic variance of the esti-
mator of 95-th percentile is V ar (q̂ ˆ =
q0.95 ) 0 .4633, which is clearly not the minimum
variance that can be obtained because the censoring scheme (4, 0, 0, 0, 0) yields
V ar (q̂
q0.95 ) 0 .3684. Based on the results from our simulation studies, one should
ˆ =
be cautious when the quantity of interest is one of those extreme percentiles (e.g.,
1-st, 5-th, 95-th, 99-th percentiles) because the optimal censoring schemes could be
sensitive to the change of the model.
Base
Ba sed
d on the
the si
simu
mula lati
tion
on appr
approa
oach
ch,, we obse
observ
rvee th
that
at th
thee optim
optimal al cens
censororin
ingg sche
scheme
mess
determined based on asymptotic theory of the MLEs may not be optimal even when
the underline model is correctly specified. Since the Monte-Carlo simulation is a
numerically mimic of the real data analysis procedure in practice, the results are
showing that when the analytical value of the objective function of the optimal cen-
soring scheme and the values of the objective functions of other censoring schemes
are closed, it is likely that those non-optimal censoring schemes will perform better
than
than the
the opti
optima
mall cens
censor orin
ingg sc
sche
heme
me.. We woul
wouldd su
sugg
ggesestt th
thee pr
prac
acti
titi
tion
oner
erss to use
use Mo
Montnte-
e-
Carlo
Carl o si
simu
mula
lati
tion
on in co
comp
mpar
arin
ing
g wi
with
th othe
otherr pro
progr
gres
essi
sive
ve cens
censor
orin
ing
g sche
scheme
mess an
and
d choo
choose
se
the optimal one. However, since the number of possible censoring schemes can be
numerous when n and m are large, it will not be feasible to use Monte-Carlo simu-
lation to compare all the possible censoring schemes. Therefore, in practice, we can
use the analytical approach to identify the optimal censoring scheme and some near
optimal censoring schemes, then Monte-Carlo simulation can be used to choose the
best censoring schemes among those candidates. This approach will be illustrated in
the example which will be presented in the next section.

Quantifying the Uncertainty in Optimal Experiment Schemes … 121

4 Illust
Illustrat
rativ
ivee Examp
Example
le
R Core Team (2016
(2016)) presented a progressively Type-II censored sample based on
the breakdown data on insulating fluids tested at 34 kV from Nelson (1982). The
progressively
progressively censored data presented in R Core TTeam
eam (2016 =
(2016)) has n 19 and m 8 =
with censoring scheme (0, 0, 3, 0, 3, 0, 0, 5). Suppose that we want to re-run the
same experiment with n =
19 and m =
8 and we are interested in using the optimal
censoring scheme that minimizing the variances of the parameter estimators (i.e.,
A-optimality) or minimizing the variance of the estimator of the 95-th percentile of
the lifetime distribution (i.e., V -optimality with δ = 0 .95). We can first identify the
top k optimal censoring schemes based on the asymptotic variances and then use
Monte-Carlo simulation to evaluate the performances of those censoring schemes.
Sinc
Sincee R Core eam (2016)
Core Team 2016) discus
discussed
sed the linear
linear infere
inference
nce under
under progre
progressi
ssive
ve Type
ype--
II censoring when the lifetime distribution is Weibull and used the breakdown data
on insul
insulating
ating fluids as a numer
numerical
ical examp
example, le, we assume
assume here the underline
underline lifetime
lifetime
dist
distri
ribu
buti
tion
on to be Weibu
eibull
ll an
and
d de
dete
term
rmin
inee the
the top
top ten
ten cens
censor
orin
ing
g ssch
chem
emes
es su
subj
bjec
ectt to th
thee
=
A-optimality and V -optimality with δ 0 .05. To study the effect of model misspec-
ification
ifica tion to the optimal censoring scheme
schemes, s, we compu
computete the objective
objective functions for
these censoring schemes when the true underline lifetime distribution is lognormal.
Then, we also use Monte-Carlo simulation to evaluate the performances of these

censoring schemes. To reduce the effect of Monte-Carlo simulation errors, we used


100,000 simulations to obtain the simulated variances. These results are presented
in Tables 5 and 6 for A -optimality and V -optimality with δ =
0 .95, respectively.
As we illustrated in the previous section, the optimal censoring scheme under a
particular model may not be optimal if the model is misspecified. The same observa-
tion is obtained in this numerical example. For A-optimality, from Table 5, instead of
choosing the optimal censoring scheme (0, 11, 0, 0, 0, 0, 0, 0) based on asymptotic
variances, the censoring scheme (1, 9, 1, 0, 0, 0, 0, 0) can be a better option based
on the simulati
simulation
on resul
results.
ts. For V -optimality with δ =
0 .95, from Table 6, instead of
choosing the optimal censoring scheme (0, 0, 0, 0, 0, 0, 11, 0) based on asymptotic
variances, one may adopt (0, 0, 0, 0, 2, 0, 9, 0) as the censoring scheme because it
gives a smaller simulated V ar (q̂ ˆ
q0.95 ) and the performance of this censoring under
M0 NOR is better than (0, 0, 0, 0, 0, 0, 11, 0).
=
5 Conclu
Concludin
ding
g Remark
Remarkss

In this chapter, we propose analytical and simulation approaches to quantify the


uncertainty in optimal experiment schemes systematically. The robustness of the
optimal progressive Type-II censoring scheme with respect to changes of model is
stud
studie
ied.
d. We ha
have
ve show
shownn that
that the
the op
opti
tima
mall cens
censor
orin
ing
g sche
scheme
mess are
are sens
sensit
itiive to miss
misspe
pec-
c-
ification of models, especially when the V -optimal criterion is under consideration.
In practice, we would recommend the use of Monte-Carlo simulation to verify if the

122 H.K.T. Ng et al.

)
R
O
N
,
)
) V
d 3 0 4 8 5 6 9 9 0 6
te 5 E 6 0 5 1 8 5 3 3 5 9
1 2 1 2 1 2 1 2 2 2
la V ∗S
9 (
3 3 3 3 3 .3 .3 .3 .3 .3
.

u Q ( . . . . .
0 0 0 0 0 0 0 0 0 0
im
(S )

M R
s R O
n N
o
it O
N ,
)
al V 0 5 5 4 3 4 0 3 5 3
u = 3 3 3 2 2 3 3 4 2 3
0 E 5 5 5 5 5 5 5 5 5 5
A (

sm
i M Q (S 1
0
. 1
0
. 1
0
. 0
1
. 0
1
. 0
1
. 0
1
. 0
1
. 0
1
. 0
1
.
0
0 )
0, R
0 O
0 N
1 ) ,

itc
)
d V 9 4 1 2 5 5 2 3 4 0
n to 5 E 7 1 8 4 1 6 8 4 8 0
a p V∗
9 ( 5 6 5 6 6 6 5 6 6 7
s .

m Q (S 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2
e 0 0 0 0 0 0 0 0 0 0
c y
s
n
a A
ri ( )
a M R
v R O
cit O N
o N ,
)
t V 7 5 3 3 0 2 8 9 1 0
p =
E 9
3
9
3
0
4
9
3
0
4
9
3
0
4
9
3
9
3
9
3
m 0
A S
( 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
y
s M Q ( 0 0 0 0 0 0 0 0 0 0
a
n
o )

d V
e E
,
sa b
)
V 2 8 6 5 7 5 7 1 8 2
8 ) 5 E 2 7 2 9 5 1 1 7 4 6
d V9 ∗
( 0 0 0 0 0 1 0 0 1 1
2 2 2 2 2 .2 .2 .2 .2 .2
.

= e
t Q (S . . . . .
0 0 0 0 0 0 0 0 0 0
m la
, u
9 im
1
(S )
V
=
n V E
E ,
h )
3 9 4 5 2 1 6 7 7 8
it = V 0 1 0 0 9 0 1 9 0 1
w ∗ E 9 9 9 9 8 9 9 8 9 9
y A (
S 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
ti M Q ( 0 0 0 0 0 0 0 0 0 0
l
a
itm )

p V
o E
- ,
A V
)

r ) 9 1 2 8 3 9 5 0 3 9
o c 5 E 0 8 2 5 9 3 3 7 2 0
f ti 9 ( 9 8 9 8 8 8 9 8 8 8
V ∗ 1 1 1 1 1 .1 .1 .1 .1 .1
.

e to Q (S . . . . .
m p 0 0 0 0 0 0 0 0 0 0
e
h m
y
sc
s
(A )
g V
n V E
ri E ,

so
)

= V 0 2 2 3 4 5 5 6 6 7
n 7 7 7 7 7 7 7 7 7 7
e ∗ E 7 7 7 7 7 7 7 7 7 7
c A S
( 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
l M Q ( 0 0 0 0 0 0 0 0 0 0
a
itm
p ) ) )
o 0 0 0 ) ) ) ) ) ) )
e , , , 0 0 0 0 0 0 0
n 0 0 0 , , , , , , ,
et m ,
0
,
0
,
0
0
,
0
,
0
,
0
,
0
,
0
,
0
,
e , , , 0 0 0 0 0 0 0
p h
c 0 0 0 , , , , , , ,
o s , , , 0 0 0 0 0 0 0
T g 0 0 0 , , , , , , ,
, , , 0 0 0 0 0 0 0
5 irn 0 1 0 , , , , , , ,
o , , , 2 1 3 0 2 4 5
le s 1 0 0 , , , , , , ,
n 1 1 1 9 9 8 9 8 7 6
b e , , , , , , , , , ,
a
T C (0 (0 (1 (0 (1 0
( 2
( 1
( 0
( 0
(

Quantifying the Uncertainty in Optimal Experiment Schemes … 123


R
O
N
,
)
) V
s d 4 8 1 3 8 8 2 2 0 6
n te 5 E 5 9 3 0 4 5 6 5 5 3
1 1 1 2 1 1 1 1 1 1
la V ∗S
9 (
o
it
.
4
. 4
. 4
. 4
. 4
. .4 .4 .4 .4 .4
u Q ( 0 0 0 0 0 0 0 0 0 0
a l
u im
m (S
si
)

M R
0 R O
0 O N
0, N ,
)
0 V 9 4 4 6 9 0 9 5 1 6
0 = 4 6 4 6 4 5 4 4 5 4
1 0 E
(
6 6 6 6 6 6 6 6 6 6

n d M Q S
A ( 1
.
0 1
.
0 1
.
0 1
.
0 1
.
0 .1
0 .1
0 .1
0 .1
0 .1
0
a
s )
R
e
c O
n N
a
ri )
itc
)
,

a V 1 9 5 9 9 6 0 0 4 4
v to 5 E 1 9 8 7 6 6 5 5 5 4
c it p V∗
9 ( 9 8 8 8 8 8 8 8 8 8
3 3 3 3 3 .3 .3 .3 .3 .3
.

o m Q (S . . . . .
t y 0 0 0 0 0 0 0 0 0 0
p s
A
m (
sy
)
M R
a R O
n O N
o N ,
)
d V 2 0 7 5 3 2 9 9 0 7
se
= 8 8 7 7 7 7 6 6 7 6
0 E 5 5 5 5 5 5 5 5 5 5
a A S
M Q
( 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
b ( 0 0 0 0 0 0 0 0 0 0
8
= )

m V
E
,9
1
,
)
V 5 1 6 6 0 0 8 3 0 3
= ) 5 E 9 6 1 1 0 8 6 7 2 6
9 ( 8 8 8 8 8 7 7 7 8 7
n d V ∗
.
3 3 3 3 3 .3 .3 .3 .3 .3
, te Q (S
.
0
.
0
.
0
.
0
.
0 0 0 0 0 0
5
9 la
. u
0 im
= (S )
V
δ V E
h E ,

it =
)
V 1 4 9 7 4 7 2 6 7 0
w ∗ 3 1 9 8 8 7 7 7 9 7
E 5 5 4 4 4 4 4 4 4 4
y A (
S 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2
ilt M Q ( 0 0 0 0 0 0 0 0 0 0
a
itm )

p V
o - E
,
V V
)

r ) 6 7 9 0 1 1 3 3 3 4
o c 5 E 5 5 5 6 6 6 6 6 6 6
f ti 9 ( 4 4 4 4 4 4 4 4 4 4
V ∗ 1 1 1 1 1 .1 .1 .1 .1 .1
.

e to Q (S . . . . .
p 0 0 0 0 0 0 0 0 0 0
me
h m
y
sc
s
(A )
g V
n V E
ri E ,

so
)

= V 7 1 4 3 7 7 8 9 4 8
n 2 2 1 1 0 0 9 9 0 9
e ∗ E 3 3 3 3 3 3 2 2 3 2
c A S
( 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2
l M Q ( 0 0 0 0 0 0 0 0 0 0
a
itm
p ) ) ) )
o 0 0 ) 0 ) ) ) ) 0 )
e , , 0 , 0 0 0 0 , 0
n 1 0 , 0 , , , , 0 ,
et m 1
,
1
,
9
,
1
,
8
,
9
,
7
,
8
,
1
,
9
,
e 0 1 2 0 3 1 4 2 0 0
p h
c , , , , , , , , , ,
o s 0 0 0 1 0 1 0 1 0 2
T g , , , , , , , , , ,
0 0 0 0 0 0 0 0 1 0
6 irn , , , , , , , , , ,
o 0 0 0 0 0 0 0 0 0 0
le s , , , , , , , , , ,
n 0 0 0 0 0 0 0 0 0 0
b e , , , , , , , , , ,
a
T C (0 (0 (0 (0 (0 0
( 0
( 0
( 0
( 0
(

124 H.K.T. Ng et al.


optimal censoring schemes are delivering significant superior results compared to
other censoring schemes.
Thee cu
Th curr
rren
entt st
study
udy is limi
limite
ted
d to the
the pr
prog
ogre
ress
ssiive Type-
ype-IIII cens
censor
orin
ing;
g; it will
will be of in
inte
ter-
r-
est to apply the proposed approaches to other optimal experiment design problems
and to study the effect of model misspecification. Moreover, the methodologies and
illustrations presented in this chapter are mainly focused on misspecification of the
underlying statistical model. In the case that the determination of the optimal experi-
ment
me nt sc
sche
heme
me on the
the spec
specifi
ified
ed value
alue of pa
para
rame
mete ters
rs,, th
thee meth
method
odol
olog
ogie
iess deve
develo
lope
pedd here
here
ca
can
n be appl
applie
ied
d as well
well.. Fo
Forr inst
instan
ance
ce,, in expe
xperi
rime
ment
nt desi
design
gn of mult
multi-
i-le
leve
vell stre
stress
ss test
testin
ing
g
with extreme value regression under censoring (Ka et al. 2011 and Chan et al. 2016),
al . 2016 ),
the expected Fisher information matrix depends on the proportions of observed fail-
ures which are functions of the unknown parameters, and consequently the optimal
experimental scheme also depends on the unknown parameters. Quantifying the
uncertainty of the optimal experimental scheme due to parameter misspecification
can be proceed in a similar manner as presented in this chapter.
R func
functi
tion
on for
for co
comp
mput
utee the
the ob
obje
jecti
ctive
ve funct
functio
ions
ns for
for a sp
speci
ecific
fic prog
progre
ress
ssiv
ivee Typ
ype-
e-II
II
censoring scheme for extreme value distribution
##############################################
## Function to compute the objective ##
## func
functi
tion
ons
s f
for
or a s
spe
peci
cifi
fic
c pro
progr
gres
essi
sive
ve ##
## Type-II censoring scheme ##
##############################################

#######################################
# Input values: #
# nn: Sample size #
# mm: Effective sample size #
# ir:
ir: Cens
Censor
orin
ing
g sche
scheme
me (leng
(length
th = mm)
mm) #
#######################################

###################################################
# Output values: #
# dfi:
dfi: Det
Deter
ermi
mina
nant
nt of Fish
Fisher
er inf
infor
orma
mati
tion
on mat
matri
rix
x #
# tvar: Trace of variance-covariance matrix #
# vq95
vq95:
: Vari
Varian
ance
ce of the ML
MLE
E of 95-t
95-th
h per
percent
centil
ile
e #
# vq05
vq05:
: Vari
Varian
ance
ce of
of th
the
e ML
MLE
E of 5
5-t
-th
h perc
percen
enti
tile
le #
###################################################

objp
objpcs
cs <- func
functi
tion
on(m
(mm,
m, nn,
nn, ir)
ir)
{
rr <- num
numeri
eric(m
c(mm)
m)
cc <- num
numeri
eric(m
c(mm)
m)
aa <- matr
matrix
ix(0
(0,
, mm,
mm, mm)
mm)
epcos
epcos <- num
numeri
eric(m
c(mm)
m)
epcoss
epcossq
q <- num
numeri
eric(m
c(mm)
m)
rpcs
rpcs <- num
numeri
eric(m
c(mm)
m)
gg <- num
numeri
eric(m
c(mm)
m)

##Compute
##Compute rr##
for
for (jj
(jj in 1:mm
1:mm)
)
{rr[
{rr[jj
jj]
] <- mm - jj + 1 + sum(
sum(ir
ir[j
[jj:
j:mm
mm])
])}
}

Quantifying the Uncertainty in Optimal Experiment Schemes … 125


##Compute
##Compute cc##
cc (?) < - r r (?)
for
for (ii
(ii in 2:mm
2:mm)
)
{cc[ii
{cc[ii]
] <- cc[
cc[ii-
ii-1]*
1]*rr[
rr[ii]
ii] }

##Compute
##Compute aa##
for
for (jj
(jj in 1:mm
1:mm)
)
{for
{for (ii
(ii in 1:jj
1:jj)
)
{aa[
{aa[ii
ii,j
,jj]
j] = 1
for
for (kk
(kk in 1:jj
1:jj)
)
{if
{if (kk
(kk != ii)
ii) {aa[
{aa[ii
ii,j
,jj]
j] <- aa[i
aa[ii,
i,jj
jj]/
]/(r
(rr[
r[kk
kk]
] - rr[i
rr[ii]
i])}
)}
} }}

##Comp
##Compute
ute E(Z
E(Z_i:
_i:m:n
m:n)
) and E(Z
E(Z_i:
_i:m:n ∧
m:n 2) ##

for
for (ii
(ii in 1:mm
1:mm)
)
{psu
{psum
m <- 0
psum
psumsq
sq <- 0
for
for (ll
(ll in 1:ii
1:ii)
) {psu
{psum
m <- psum
psum + aa[l
aa[ll,
l,ii
ii]*
]*(d
(dig
igam
amma
ma(1
(1)
) -
log(rr[ll]))/(rr[ll])}
for
for (ll
(ll in 1:ii
1:ii)
) {psu
{psums
msq
q <- psum
psumsq
sq + aa[l
aa[ll,
l,ii
ii]*
]*(
(

(digamma(1) 2) - 2*diga
2*digamma(1)
mma(1)*log(r
*log(rr[ll]
r[ll])
)
+ log(rr
log(rr[ll])*
[ll])*log(rr
log(rr[ll])
[ll]) + pi*pi/6)/rr[
pi*pi/6)/rr[ll]}
ll]}
epcos[ii]
epcos[ii] <- cc[ii]
cc[ii]*psum
*psum
epcoss
epcossq[i
q[ii]
i] <- cc[
cc[ii]
ii]*ps
*psums
umsq
q }

##Elem
##Element
ents
s of Fis
Fisher
her Inf
Inform
ormati
ation
on Mat
Matrix
rix##
##

i22
i22 <- sum(
sum(1
1 + 2*ep
2*epco
cos
s + epco
epcoss
ssq)
q)
i12
i12 <- -sum
-sum(1
(1 + epco
epcos)
s)
i11
i11 <- mm

dfi
dfi <- det(
det(ma
matr
trix
ix(c
(c(i
(i11
11,
, i12,
i12, i12,
i12, i22)
i22),
, 2, 2))
2))
vcov
vcov <- solv
solve(
e(ma
matr
trix
ix(c
(c(i
(i11
11,
, i12,
i12, i12,
i12, i22)
i22),
, 2, 2))
2))
inv05 <- log(-l
log(-log(0.9
og(0.95))
5))
inv95 <- log(-l
log(-log(0.0
og(0.05))
5))

tvar
tvar <- vcov
vcov[1
[1,1
,1]
] + vcov
vcov[2
[2,2
,2]
]
vq95
vq95 <- vco
vcov[1
v[1,1]
,1] + inv95*
inv95*inv
inv95*
95*vco
vcov[2
v[2,2]
,2] + 2*inv9
2*inv95*v
5*vcov
cov[1,
[1,2]
2]
vq05
vq05 <- vco
vcov[1
v[1,1]
,1] + inv05*
inv05*inv
inv05*
05*vco
vcov[2
v[2,2]
,2] + 2*inv0
2*inv05*v
5*vcov
cov[1,
[1,2]
2]
out
out <- c(df
c(dfi,
i, tvar
tvar,
, vq95
vq95,
, vq05
vq05)
)
names(
names(out
out)
) <- c("
c("dfi
dfi",
", "tv
"tvar"
ar",
, "vq
"vq95"
95",
, "vq
"vq05"
05")
)
return(out)}

## Exam
Exampl
ple:
e: n = 10,
10, m = 5, cens
censor
orin
ing
g sche
scheme
me = (5,0
(5,0,0
,0,0
,0,0
,0)
)

objpcs(5,
objpcs(5, 10, c(5,0,
c(5,0,0,0,0
0,0,0))
))
dfi tvar vq95 vq05
54.2
54.214
1445
4581
81 0.29
0.2947
4796
966
6 0.34
0.3473
7380
800
0 0.9
0.92473
247358
58

References

Balakrishnan, N., & Aggarwala, R. (2000). Progressive censoring: Theory, methods and applica-
tions. Boston: Birkhäuser.

126 H.K.T. Ng et al.


Balakrishnan, N. (2007). Progressive censoring methodology: An appraisal (with discussion). Test,
16, 211–296.
Balakrishnan, N., & Cramer, E. (2014). The art of progressive
progressive censoring: Applications to reliability
and quality. Boston, MA: Birkhäuser.
Balakrishnan, N., Kannan, N., Lin, C. T., & Ng, H. K. T. (2003). Point and interval estimation for
gaussian distribution, based on progressively Type-II censored samples. IEEE Transactions on
Reliability, 52 , 90–95.
Chan, P. S., Balakrishnan, N., So, H. Y., & Ng, H. K. T. (2016). Optimal sample size allocation for
multi-level stress testing with exponential regression under Type-I censoring. Communications
in Statistics—Theory and Methods, 45 , 1831–1852.
Dahmen, K., Burkschat, M., & Cramer, E. (2012). A- and D-optimal progressive Type-II censoring
designs based on Fisher information. Journal of Statistical Computation and Simulation , 82,
879–905.
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994) Continuous univariate distribution (Vol. 1, 2nd
ed.). New York: Wiley
Ka, C. Y., Chan, P. S., Ng, H. K. T., & Balakrishnan, N. (2011). Optimal sample size allocation for
multi-level stress testing with extreme value regression under Type-II Censoring, Statistics, 45,
257–279.
Ng, H. K. T., Chan, P. S., & Balakrishnan, N. (2004). Optimal progressive censoring plan for the
Weibull distribution. Technometrics, 46 , 470–481.
R Core Team. (2016). A language and environment for statistical computing . Vienna, Austria: R
Foundation for Statistical Computing
Vivero
iveros,
s, R., & Balakr
Balakrish
ishnan
nan,, N. (1994)
(1994).. Interv
Interval
al estima
estimatio
tion
n of parame
parameter
terss of life
life from
from progre
progressi
ssive
vely
ly
censored data. Technometrics, 36 , 84–91.
Part II
Monte-Carlo Methods in Missing Data
Markov Chain Monte-Carlo Methods
for Missing Data Under Ignorability
Assumptions

Haresh Rochani and Daniel F. Linder

Abstract Missing observations are a common occurrence in public health, cli clinical
nical
studies and social science research. Consequences of discarding missing observa-
tions, sometimes called complete case analysis, are low statistical power and poten-
tially
tially biase
biased
d estimate
estimates.
s. Fully Bayesian methods using Markov Chain Monte-Car
Monte-Carlo lo
(MCMC) provide an alternative model-based solution to complete case analysis by
treating missing values as unknown parameters. Fully Bayesian paradigms are natu-
rally
rally equipp
equipped
ed to handle
handle this
this situat
situation
ion by augmen
augmenti
ting
ng MCMC
MCMC routin
routines
es wit
with
h additi
additiona
onall
layers and sampling from the full conditional distributions of the missing data, in
the case of Gibbs sampling. Here we detail ideas behind the Bayesian treatment
of missing data and conduct simulations to illustrate the methodology. We consider
specifi
specifical
cally
ly Bay
Bayesi
esian
an mul
multi
tiva
varia
riate
te re
regre
gressi
ssion
on with
with missin
missing
g respon
responses
ses and the missin
missing
g
covariate setting under an ignorability assumption. Applications to real datasets are
provided.

1 Intr
Introd
oduc
ucti
tion
on

Complete data are rarely available in epidemiological, clinical and social research,
especially when a requirement of the study is to collect information on a large num-
ber of individuals or on a large number of variables. Analyses that improperly treat

missing data
alizability of can leadtotoa more
results widerbias and lossand
population of efficiency, which
can diminish our may limit
ability gener-
to under-
stand true underlying phenomena. In applied research, linear regression models are
an important tool to characterize relationships among variables. The four common
appr
ap proa
oach
ches
es for
for infe
infere
renc
ncee in re
regr
gres
essi
sion
on mode
models
ls with
with miss
missin
ing
g data
data are:
are: Ma
Maxi
ximu
mumm like
like--

H. Rochani (B)
Department of Biostatistics, Jiann-Ping Hsu College of Public Health,
Georgia Southern University, Statesboro, GA, Georgia
e-mail: hrochani@georgiasouthern.edu
hrochani@georgiasouthern.edu
D.F. Linder
Department of Biostatistics and Epidemiology, Medical College of Georgia,
Augusta University, Augusta, GA, Georgia

© Springer Nature Singapore Pte Ltd. 2017 129


D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_7
130 H. Rochani and D.F. Linder

lihood (ML), Multiple imputation (MI), Weighted Estimating Equations (WEE) and
Fully Bayesian
Bayesian (FB) (Little and Rubin 2014). 2014). This chapter focuses on FB methods
for regression models with missing multivariate responses and models with miss-
ing covariates. The Bayesian approach provides a natural framework for making
inferences about regression coefficients with incomplete data, where certain other
methods may be viewed as special cases or related. For instance, the maximum a
posteriori (MAP) estimate from a FB approach under uniform improper priors leads
to ML es
esti
tima
matetes;
s; ther
theref
efor
oree ML can
can be vie
viewed
wed as a sp
spec
ecia
iall case
case of Ba
Baye
yesi
sian
an in
infe
fere
rencnce.
e.
More
Mo reo
ove
verr, in MI the
the “imp
“imput
utat
atio
ion”
n” st
step
ep is base
based
d on th
thee samp
sampli ling
ng fr
from
om a po
post
ster
erio
iorr pr
pre-
e-
dictive distribution. Overall, FB methods are general and can be powerful tools for
dealing with incomplete data, since they easily accommodate missing data without
having extra modeling assumptions.

2 Missin
Missing
g Data
Data Mecha
Mechanis
nisms
ms

For researchers,
researchers, it is crucial to hav
havee some understa
understanding
nding of the underlying
underlying missing
missing
mechan
mec hanism
ism for the varia
variable
bless under
under inve
investi
stigat
gation
ion so that
that par
parame
ameter
ter est
estima
imate
tess are accu-
accu-
rate and precise. Rubin (1976)
1976) defined the taxonomy of missing data mechanisms
based on how the probability of a missing value relates to the data itself. This tax-
onomy has been widely adopted in the statistical literature. There are mainly three
types of missing data mechanisms:
Missing Completely At Random (MCAR):-
(MCAR):-
MCAR mechanisms assume that the probability of missingness in the variable of
interest does not depend on the values of that variable that are either missing or
obser
obs erve
ved.
d. We begi
beginn by deno
denoti ting
ng the
the data as D and M as th
data thee miss
missin
ing
g in
indi
dica
cato
torr matr
matrix
ix,,
which has values 1 if the variable is observed and 0 if the variable is not observed.
For MCAR, missingness in D is independent of the data being observed or missing,
or equiv alently p ( M | D , θ) = p ( M |θ), where θ are the unknown parameters.
equivalently
Missing At Random (MAR):-
(MAR):-
A MAR mechanism assumes that the probability of missingness in the variable
of interest is associated only with components of observed variables and not on
the
the co
comp
mpononen
ents
ts that
that are
are miss
missin
ing.
g. In math
mathem
emat
atic
ical
al term
terms,
s, it can
can be wr
writ
itte
ten
n as
p ( M | D , θ) = p ( M | Dobs , θ).

Missing Not At Random (MNAR):-


(MNAR):-
Finally
Finall y, an MNAR
MNAR assump
assumptiotion
n all
allow
owss mis
missin
singne
gness
ss in the varia
variable
ble of intere
interest
st to depend
depend
on the
the unob
unobse
serv
rved
ed value
aluess in the
the da
data
ta se
set.
t. In not
notat
atio
ion,
n, it can
can be writ
written as p ( M | D , θ) =
ten
p ( M | Dmiss , θ).
Markov Chain Monte-Carlo Methods for Missing … 131

3 Data
Data Augmen
ugmentat
tation
ion

Data augmentation (DA) within MCMC algorithms is a widely used routine that
handles incomplete data by treating missing values as unknown parameters, which
are sampled during iterations and then marginalized away in the final computa-
tion of various functionals of interest, like for instance when computing posterior
means (Tanner and Wong 1987
1987).
). To
To illustrate how inferences from data with missing

variables may be improved


improved via data augment ation, we first distinguish between com-
augmentation,
plete data, observed data and complete-cases. Complete data refers to the data that
are observed or intended to be observed, which includes response variables, covari-
ates of interest and missing data indicators. Conversely, the observed data comprise
the observed subset of complete data. Furthermore, complete-cases comprise of the
data after excluding all observations for which the outcome or any of the inputs
are missing. In a Bayesian context, for most missing data problems, the observed
data posterior p (θ| Dobs ) is intractable and cannot easily be sampled from. HoweHoweverver,,
when Dobs is ‘augmented’ by assumed values of Dmiss , the resulting complete data
posterior, p (θ| Dobs , Dmiss ) be
 beco
come
 mess much
much easi
easier
er to hand
handle
le.. Fo
Forr DA, at each
each iter
iterat
atio
ion
n
(t )
t , we can sample Dobs , θ(t ) by

(t ) (t −1)

1. D(tmiss
)
∼ p Dmiss | D(obs
 t) , θ
 
2. θ ∼ p θ| Dobs , D miss

where θ may be sampled by using one of the various MCMC algorithms discussed
in the
the pre
previou
viouss chap
chapte
ters
rs of this
this boo
book.
k. A st
stat
atio
iona
nary
ry di
dist
stri
ribu
buti
tion
on of th
thee abov
abovee tr
tran
ansi
siti
tion
on
kernel is p (θ, D miss | Dobs ) where upon marginalization over the imputed missing
values one arrives at the desired target density p (θ| Dobs ). In the following
following section,
section,
we detail how these ideas may be implemented with missing multivariate responses
in the context of Gibbs sampling.

4 Miss
Missin
ing
g Resp
Respon
onse
se

It is quite common to have missing responses in multivariate regression models,


particularly when repeated measures are taken and missing values occur at later
follow up times. In this section, we focus on the multivariate model with missing
responses, in which the missingness depends only on the covariates that are fully
observed. For some literature dealing with the missing response scenario, see (Little
2014;; Daniels and Hogan 2008
and Rubin 2014 2008;; Schafer
Schafer 1997
1997))
132 H. Rochani and D.F. Linder

4.1 Method: Multivariate Normal Model

In this section, we consider regression models where we have n subjects measured at


(d ) oc
occa
casi
sion
onss with
with the
the sa
same sett of ( p ) pre
me se predi
dict
ctors
ors.. For
For in
inst
stan
ance
ce,, th
this
is wo
woul
uld
d be th
thee case
case
for subjects with baseline levels of a predictor and responses measured over time, or
clusters in which predictors are measured at the cluster level. Suppose y1 , y2 , . . . yd
are d possibly correlated random variables with respective means µ1 , µ2 , . . . µd and
variance-covariance matrix  . In terms of matrix notation, we denote the n × d
response matrix as ( Y ) and the n × p design matrix as ( X ), which can be expressed
as    
Y11 Y12 Y13 . . . Y1d X 11 X 12 X 13 . . . X 1 p
Y21 Y22 Y23 . . . Y2d
 X 21 X 22 X 23 . . . X 2 p
  
Y = . . . . . X= . . . . .
 .. .. .. . . ..   .. .. .. . . .. 
Yn 1 Yn 2 Yn 3 . . . Ynd X n 1 X n 2 X n 3 . . . X np

Our mul
multi
tiva
varia
riate
te re
regre
gressi
ssion
on model
model for respon
response vectorr Y i = (Yi 1 , Yi 2 , . . . , Yi d ) with
se vecto
  i nd  
covariate vector X i = X i 1 , X i 2 , . . . , X i p can be written as Y i | X i ∼ N µi ,  i ,
where µi is a d × 1 vect
vectoror and  is a d × d matrix. Furthermore
and Furthermore,, µi j = E ( Y i | X i ) =
X i β where β is a p × 1 vector of regression coefficients. The key assumption for
this particular model is that the measured covariate terms X i k are the same for each
component of the observations Yi j where 1 ≤ j ≤ d . Additionally, since we are
assuming that β ∈ R p , the mean response for observation i is the same for each 1 ≤
j ≤ d . Both
Both the
the assu
assump
mpti tion
on of comm
common on base
baseli line
ne cova
covari
riat
ates
es an
and
d a vector β can easily
vector easily
be extended to time varying covariates and a matrix of coefficients β ∈ Rd × p with
only minor changes to notation; however, we use the current scenario for illustration
purposes. Since we are only focusing on the ignorable missing mechanism, we will
consider the scenario where missing in Y depends on non-missing predictors.
define, Y i (obs ) = (Yi 1 , Yi 2 , . . . , Yi d ∗ ) and Y i (miss ) = Yi (d ∗ +1) , Yi 2 , . . . ,
We define,

Yi d ) , where d ∗ < d . The variance-covariance matrix,  , can be partitioned as

 (obs )  (obs ,miss )

=

 
 (obs ,miss )  (miss )

For the Bayesian solution to the missing data problem, after we specify the com-
plete data model and noninformative independence Jeffreys’ prior for parameters,
d +1
p (β ,  ) ∝ | |− 2 , we would then like to draw inferences based on the observed
posterior p (β ,  |Y obs , X ). Ho
data posterior Howe
weve
verr, as discus
discussed
sed pre
previo
viousl
usly
y, the comple
complete
te dat
dataa
posterior p (β ,  |Y , X ) is easier to sample from than the observed data posterior. In
this particular situation, full conditional distributions for complete data and parame-
ters of interest are easy to derive. For posterior sampling using data augmentation,
 
(t )
at each iteration t , samples from Y (miss ) , β (t ) ,  (t ) |Y obs , X can be obtained by
(t )
firstt sampli
firs ng Y i (miss ) | · · · ∼ p Y i (miss ) |Y i (obs ) , β (t −1) ,  t −1) , X for 1 ≤ i ≤ n and
sampling
 
Markov Chain Monte-Carlo Methods for Missing … 133

then sampling β , 
 (t ) (t )
 |···∼ p
 (t )
β ,  |Y (miss ) , Y (obs ) ,

X . In the data augmen-
tation step, p Y i (miss ) |Y i (obs ) , β (t −1) ,  (t −1) , X is a normal density with mean µi∗
 
and variance
variance-covaria nce matrix  ∗i , where
-covariance

µ∗i = vec( X i β (t −1)


)
d −d ∗ +
 (t −1) −1(t −1)
 (obs ,miss )  (obs )
 
vec Y i (obs ) − X i β (t −1)

d∗
∗ (t −1) (t −1) −1(t −1) (t −1)
i =  (miss ) −  (obs ,miss )  (obs )  (obs ,miss )

and vec(x )d denotes the scalar value x stacked in a vector of length d . Samples
 
(t )
from p β,  |Y (miss ) , Y (obs ) , X can be obta
obtain
ined
ed by Gibb
Gibbss samp
sampli
ling
ng sinc
sincee th
thee
full conditional posteriors for each parameter are analytic and can be written as
p (β |. . . ) ∼ N µβ , V β and p ( | . . . ) ∼ W −1 (nd + d , ψ ), where
 
 −1
  
−1
µβ = (X
i  Xi ) ( X i  −1 Y i )
i i
  −1
−1
Vβ = (X
i  Xi )
i
 ψ= (Y i − X i β )(Y i − X i β )
i

In the above notation, the bold symbol X i represents the d × p matrix with rows X i ,
N denotes a multivariate normal density and W −1 the inverse
inverse Wishart
Wishart density.
density.

4.2 Simulation

A simulation study was conducted to compare the bias and root mean squared
error (RMSE) of regression coefficients (β ) for complete case analysis and the
FB approach under a MAR assumption for our multivariate normal model. Data
augmentation using Gibbs sampling was performed to compare the performance of
the estimators by using various proportions of missing values in the multivariate
response variable as shown in Tables 1 and 2. In the simulation, at each iteration,
three covariates ( X 1 , X 2 , X 3 ) were generated, in which X 1 was binary, and X 2 , X 3
were continuous. X 1 was sampled from the binomial distribution with success prob-
abil
ab ilit
ity
y of 0.
0.4,
4, whilee X 2 and X 3 were
whil were sa
samp
mpleled
d from
from th
thee norm
normal
al di
dist
stri
ribu
buti
tion
on with
with mean
mean
= 0 variance = 1. Furthe
Furthermo
rmore,
re, the multi
multiva
varia
riate
te res
respon
ponsese varia
variable
ble,, Y , was
was ge
gene
nera
rate
ted
d
from N (β0 + β1 x1 + β2 x2 + β3 x3 ,  ) where
 424
 = 242
 224
134 H. Rochani and D.F. Linder

Table 1 Bias comparison of regression


regression coefficients for complete-cases
complete-cases and data augmentation
Missing % Complete cases Data augmentation
y1 y2 y3 β1 β2 β3 β1 β2 β3
5 5 5 0.0194 0.
0 .0045 0.
0 .0001 0.
0 .0208 0.
0 .0002 0.
0 .0008
10 5 5 0.0009 0.0022 0.0033 0.0058 0.0047 0.0038
20 5 5 0.0006 0.0036 0.0015 0.0020 0.0007 0.0026
20 5 5 0.0006 0.0036 0.0015 0.0020 0.0007 0.0026
5 10 5 0.0053 0.0084 0.0018 0.0090 0.0130 0.0002
5 10 5 0.0068 0.0190 0.0086 0.0147 0.0151 0.0067
10 10 5 0.0253 0.0033 0.0053 0.0266 0.0061 0.0081
10 10 5 0.0069 0.0047 0.0001 0.0005 0.0088 0.0002
20 10 5 0.0047 0.0244 0.0039 0.0017 0.0067 0.0025
5 20 5 0.0099 0.0057 0.0001 0.0235 0.0067 0.0032
5 20 5 0.0270 0.0085 0.0057 0.0198 0.0044 0.0021
10 20 5 0.0043 0.0155 0.0045 0.0029 0.0129 0.0029
10 20 5 0.0148 0.0119 0.0055 0.0132 0.0010 0.0019
20 20 5 0.0105 0.0052 0.0012 0.0234 0.0071 0.0091
5 5 10 0.0052 0.0055 0.0042 0.0068 0.0027 0.0039
5 10 10 0.0250 0.0186 0.0082 0.0175 0.0139 0.0065
10 10 10 0.0032 0.0132 0.0028 0.0066 0.0111 0.0013
20 10 10 0.0042 0.0156 0.0004 0.0017 0.0075 0.0058
5 20 10 0.0227 0.0056 0.0042 0.0073 0.0047 0.0010
10 20 10 0.0268 0.0009 0.0046 0.0110 0.0037 0.0014
20 20 10 0.0264 0.0036 0.0144 0.0176 0.0095 0.0072
20 5 20 0.0347 0.0064 0.0006 0.0164 0.0092 0.0015
5 10 20 0.0240 0.0081 0.0045 0.0069 0.0021 0.0037
10 10 20 0.0016 0.0039 0.0083 0.0028 0.0038 0.0006
20 10 20 0.0097 0.0046 0.0033 0.0079 0.0001 0.0043
5 20 20 0.0236 0.0093 0.0052 0.0223 0.0054 0.0041
10 20 20 0.0158 0.0080 0.0016 0.0093 0.0047 0.0010

20 20 20 0.0146 0.0034 0.0003 0.0164 0.0034 0.0050

β0 = 0 and β1 = β2 = β3 = 1. The sample size for each iteration was 100. Various
proportions of missing values were created in Y1 , Y2 and Y3 that depend only on
non-missing X = ( X 1 , X 2 , X 3 ) in order to simulate a MAR missing mechanism. To
model
mod el the missin
missing
g probabi
probabilit
lity
y for the varia
variable P r ( Yi 1 = missing| X ), we co
ble Y1 ; Pr cons
nsid
ider
er
the logistic model as follows;

ex p (γ0 + X i 1 + X i 2 + X i 3 )
P r ( Yi 1 = missing| X ) = = pi .
1 + ex p (γ0 + X i 1 + X i 2 + X i 3 )
Markov Chain Monte-Carlo Methods for Missing … 135

Table 2 RMSE comparison of regression coefficients


coefficients for complete-cases
complete-cases and data augmentation
Missing % Complete cases Data augmentation
y1 y2 y3 β1 β2 β3 β1 β2 β3
5 5 5 0.1396 0.
0 .0910 0.
0 .0440 0.
0 .1424 0.
0 .0797 0.
0 .0419
10 5 5 0.1514 0.0962 0.0461 0.1332 0.0824 0.0441
20 5 5 0.1613 0.1065 0.0521 0.1428 0.0833 0.0368
20 5 5 0.1613 0.1065 0.0461 0.1428 0.0833 0.0368
5 10 5 0.1441 0.0922 0.0516 0.1421 0.0831 0.0438
5 10 5 0.1515 0.0849 0.0461 0.1420 0.0797 0.0399
10 10 5 0.1740 0.0899 0.0491 0.1513 0.0778 0.0400
10 10 5 0.1688 0.0859 0.0477 0.1530 0.0788 0.0438
20 10 5 0.1816 0.1032 0.0555 0.1704 0.0802 0.0441
5 20 5 0.1604 0.0882 0.0456 0.1519 0.0847 0.0428
5 20 5 0.1700 0.0963 0.0515 0.1396 0.0711 0.0434
10 20 5 0.1666 0.1026 0.0562 0.1603 0.0889 0.0465
10 20 5 0.1578 0.0982 0.0539 0.1389 0.0748 0.0424
20 20 5 0.1850 0.0976 0.0507 0.1631 0.0775 0.0473
5 5 10 0.1624 0.0887 0.0428 0.1514 0.0741 0.0362
5 10 10 0.1561 0.0876 0.0428 0.1515 0.0713 0.0368
10 10 10 0.1634 0.1086 0.0457 0.1470 0.0887 0.0403
20 10 10 0.1714 0.0848 0.0477 0.1266 0.0771 0.0368
5 20 10 0.1756 0.0957 0.0539 0.1498 0.0737 0.0420
10 20 10 0.1579 0.0957 0.0504 0.1307 0.0727 0.0394
20 20 10 0.2012 0.1043 0.0589 0.1502 0.0815 0.0433
20 5 20 0.1857 0.1082 0.0533 0.1355 0.0782 0.0444
5 10 20 0.1561 0.0998 0.0546 0.1391 0.0848 0.0458
10 10 20 0.1745 0.0985 0.0528 0.1432 0.0864 0.0385
20 10 20 0.1794 0.1049 0.0495 0.1532 0.0725 0.0464
5 20 20 0.2022 0.1054 0.0570 0.1628 0.0809 0.0457
10 20 20 0.1853 0.1156 0.0556 0.1493 0.0856 0.0459

20 20 20 0.2067 0.1159 0.0593 0.1652 0.0924 0.0414

Based on this probability, we generated a binary variable to indicating whether Yi 1


was missing: Ii ∼ Bernoulli (1, p i ). If Ii = 1 then we deleted the corresponding
value of Yi 1 . This process ensures that missingness in Y1 depends on the X which
are fully observed. WWee generated missing values in Y2 in a similar manner. Inference
for β = (β0 , β1 , β2 , β3 ) in the complete case analysis was performed by using the
complete data posterior mean. The complete case posterior, p (β ,  |Y , X ), was
sampled
sampl ed via the Gibbs samp
sampling
ling algorit
algorithm
hm (Geman
(Geman and Geman 1984
1984)) as discussed
in the previous section. Table 3 shows that the bias for regression coefficients are
negligible under both complete case analysis and data augmentation for various
136 H. Rochani and D.F. Linder

Table 3 Bias comparison of regression


regression coefficients for complete-cases
complete-cases and data augmentation
Missing % Complete cases Data augmentation
x1 x2 β1 β2 β3 β1 β2 β3
5 5 0.0029 0.0003 0.0026 0.0122 0.0263 0.0042
5 10 0.0112 0.0086 0.0010 0.0231 0.0273 0.0017
5 15 0.0008 0.0048 0.0105 0.0110 0.0393 0.0143
5 20 0.0032 0.0025 0.005 0.0094 0.0489 0.0088
10 5 0.0045 0.0088 0.0038 0.0206 0.0327 0.0145
10 10 0.0007 0.0021 0.0041 0.0218 0.0398 0.0164
10 15 0.0062 0.0055 0.0022 0.0282 0.0377 0.0052
10 20 0.0114 0.0008 0.0015 0.0172 0.0511 0.0133
15 5 0.0075 0.0035 0.0012 0.0376 0.0232 0.0237
15 10 0.0102 0.0045 0.0016 0.0222 0.0380 0.0298
15 15 0.0025 0.0042 0.0045 0.0264 0.0396 0.0289
15 20 0.0078 0.0034 0.0045 0.0240 0.0478 0.0221
20 5 0.0146 0.0005 0.0008 0.0529 0.0290 0.0325
20 10 0.0137 0.0026 0.0023 0.0426 0.0296 0.0381
20 15 0.0103 0.0042 0.0035 0.0423 0.0430 0.0456
20 20 0.0011 0.0028 0.0033 0.0257 0.0573 0.0406

proportions of missing values of the response variable. However, the RMSEs are
smal
sm alle
lerr for
for DA as comp
compar
ared
ed to the
the co
comp
mple
lete
te case
case an
anal
alys
ysis
is un
unde
derr di
diff
ffer
eren
entt pr
propo
oport
rtio
ions
ns
of missingness in the response variable (Table 2).

4.3 Prostate Specific Antigen (PSA) Data

This section focuses on a real data application of the fully Bayesian approach for
analyzing the multivariate normal model as discussed in the previous section. We
will illustrate the applicat
application
ion by using the prostate speci
specific
fic antigen (PS
(PSA)
A) data which
which
was published by Etzioni et al. al. (1999).
1999). This was a sub study of the beta-carotene
and retinol trial (CARET) with 71 cases of prostate cancer patients and 70 controls
(i.e. subjects not diagnosed with prostate cancer by the time of analysis, matched to
cases on date of birth). In the PSA dataset, in addition to baseline age, there were
two biomarkers measured over time (9 occasions): free PSA (fpsa) and total PSA
(tpsa). For illustrative purposes, we investigate the effect of baseline age on the first
three fpsa measurement in patients. There were missing values in the fpsa variable at
occasion 2 and 3 for 14 patients in the study. Under the assumption of missingness
being dependent only on the fully observed baseline age, we can obtain estimates
 
of the regression coefficient for age βage = 0.0029 with S E = 0.00026 using the
fully Bayesian modeling approach. A similar model fit with complete-case analysis
gives estimates for baseline age as 0.0061 with S E = 0.00033.
Markov Chain Monte-Carlo Methods for Missing … 137

5 Missin
Missing
g Cova
Covaria
riates
tes

In ad
addiditi
tion
on to mi
miss
ssin
ing
g ou
outc
tcom
omeses bein
being
g a co
comm
mmon on occu
occurr
rren
ence
ce in expe
experi
rime
mentntal
al stud
studie
ies,
s,
missing covariates are frequently encountered as well. In this section, we focus on
miss
mi ssin
ingg cova
covari
riat
ates
es in whic
which h the
the mi
miss
ssin
ingn
gnes
esss depe
depend
ndss on th
thee fu
full
lly
y obse
observ
rveded re
resp
spon
onse
se..
We sp
spececia
iali
lize
ze our
our an
anal
alys
ysis
is to the
the norma
normall re
regre
gressssio
ion
n mode
model.l. We didire
rect
ct th
thee re
read
ader
er to th
thee
multiple methods
methods proposed in the literature for handling missing covariates
covariates (Ibrahim
et al.
al. 1999a,
1999a, 2002;
2002; Lipsitz and Ibrahim 1996;1996; Satten and Carroll 2000;2000; Xie and Paik
1997).
1997).

5.1 Method

Let Yi be the i th response and X i j be the j th covariate for i th subject. A missing


value in the j th covariate for the i th subject can be represented as a parameter, Mi j .
Let Ri be the set that indexes covariates which are observed for the i th subject. The
normal regression model with missing covariates can be represented as

µi = Xi j β j +
  Mi j β j (1)
j ∈ Ri j ∈ Ric

 
where Yi ∼ N µi , σ 2 . Ou Ourr go
goal
al is to esti
estima
matete re
regr
gres
essi
sion
on coef
coeffic
ficie
ient
ntss in th
thee pr
pres
esen
ence
ce
of mi
miss
ssin
ing
g da
data
ta.. For nota
 notatition
on purp
purpososeses,, we ca
cann writ
writee th
thee co
coll
llec
ecti
tion
on of miss
missiningg data
data as
parameter,, M = Mi 1 , Mi 2 . . . Mi p , with
the parameter with cor
corres
respon
pondin
ding prior p ( M ). Ea
g prior Each
  ch of the
missing
missing param
parameter s, M i j , will
eters, will be assi
assign
 gneded a pr
prio
iorr di
dist
stri
ribu
buti on p Mi p . The
tion The comp
complelete
te
data posterior p β , σ 2 , M |Y , X can be determined by Bayes rule as follows:

p β , σ , M L Y |β , σ 2 , M , X
 2
  2
 
p β , σ , M |Y , X = (2)
p β , σ 2 , M L Y |β , σ 2 , M , X d β d σ 2 d M

 
The posterior in Eq. 2 depends on the missing covariate parameters M . However,

our main interest is in posterior inference about β and σ 2 , and the desired posterior
  posterior
distribution p β , σ 2 |Y , X can be obtai
obtained
ned by

p β , σ 2 |Y , X =
    p β , σ 2 , M |Y , X d M
 (3)

In gene
genera ral, Eq. 3 invo
l, Eq. involv
lves
es multi
multi-di
-dimen
mensio
sional
nal integ
integral
ralss which
which do not ha
have
ve closed
closed for
forms
ms
and will be high-dimensional even for low fractions of missing covariate values. In
fact, the dimension of the integration problem is the same as the number of missing
valu
values
es.. Post
Postererio
iorr samp
sampli
ling
ng can
can be perf
perfor
orme
medd ba
base
sed
d on vario
arious
us MCMC
MCMC meth
method
ods,
s, su
such
ch
as via the Gibbs sampler after specifying full conditionals or using a random walk

Metropolis algorithm discussed in Chap. 1 of this book.


138 H. Rochani and D.F. Linder

An important issue in our current setting is the appropriate prior specification for
missing covariates. There are many ways to assign missing covariate distributions;
however,
howev er, some modeling strategies, especially in presence of large fractions of miss-
ing data, can lead to a vast number of nuisance parameters. Hence, MCMC methods
to sample posteriors can then become computationally intensiv intensivee and inefficient
inefficient even
if the
the pa
para
rame
mete
ters
rs are
are iden
identi
tifia
fiabl
ble.
e. Th
Thus
us,, it is esse
essent
ntia
iall to re
redu
duce
ce th
thee numb
numberer of th
thee nui-
nui-
sance parameters by specifying effective joint covariate distributions. Ibrahim et al.
(1999b
1999b);
); Lipsitz and Ibrahim (1996
(1996)) proposed the strategy of modeling the joint
distribution of missing covariates as a sequence of one-dimensional distributions,
which can be given by
  
p Mi 1 , . . . Mi p |α = p Mi p | Mi 1 . . . Mi p−1 , α p

 
× p Mi p−1 | Mi 1 . . . Mi p−2 , α p−1 . . . . . .
× p ( Mi 2 | Mi 1 , α2 ) p ( Mi 1 |α1 ) (4)

  T
where α = α1 , α2 , . . . α p . In specification of one dimensional conditional dis-
 
tributions in Eq. 4, suppose we know that X = X i 1 , X i 2 , . . . , X i p conta contains
ins all
continuous predictors, then a sequence of univariate normal distributions can be
assigned to p M | M . . . M
 , α . If X contains all categorical variables, say
i1 i p −1
ip
 p
binary, then the guideline is to specify a sequence of logistic regressions for each
 
p Mi p | Mi 1 . . . Mi p−1 , α p . One can consider the sequence of multinomial distribu-
tions in the case where all variables in X have more than two levels. Similarly, for all
count variables, one can assign Poisson distributions. With missing categorical and
continuous covariates, it is recommended that the joint covariate distribution should
be assigned by first specifying the distribution of the categorical covariates condi-
tional on continuous covariates. In some special circumstances, when X i p is strictly
positive and continuous, a normal distribution can be specified on the transformed
variable log
l og ( X i p ). If log
l og ( X i p ) is not approximately normal then other specifications
such as an exponential or gamma distribution are recommended.

5.2 Simulation

A simulation study was conducted to determine the bias and mean squared error
(MSE) of estimating β with various proportions of missing covariates under a MAR
assumption. The data augmentation procedure for estimating β and MSEs was com-
pare
pa red d to a comp
comple lete
te case
case an
anal
alys
ysis
is.. In the
the si
simu
mula
lati
tion
on,, at each
each iter
iterat
atio
ion,
n, th
thre
reee cova
covari
riat
ates
es
( X 1 , X 2 , X 3 ), in which X 1 is binary and X 2 , X 3 are continuous, were generated. X 1
was generated from the binomial distribution with a success probability of 0.3, X 2
was simulated from a normal distribution with mean = 0 variance = 1 and X 3 was
generated from a normal distribution with mean = 1 and variance = 2. Furthermore,
the response variable Y was generated from N β0 + β1 x1 + β2 x2 + β3 x 3 , σ 2 = 1
with β0 = 0 and β1 = β2 = β3 = 1. The sample size for each iteration was 100.
 
Markov Chain Monte-Carlo Methods for Missing … 139

Various proportions of missing values were created in X 1 and X 2 , which depended


on non-missing X 3 in order to simulate the MAR missing mechanism. To model the
missing probability for the variable X 1 ; Pr ( X i 1 = missing| X i 3 ), we consider the
logistic model as follows;

ex p (γ + X i 3 )
P r ( X i 1 = missing| X i 3 ) = = pi .
1 + ex p (γ + X i 3 )

Based on this probability, we generated the binary variable indicating whether X i 1


is missing: Ii ∼ Bernoulli (1, pi ). If Ii = 1 then we deleted the corresponding value
of X i 1 . This process ensures that missingness in X 1 depends only on the X 3 which is
fully observed. Similarly, we generated missing values in X 2 , which depend on the
fully observed covariate X 3 . In addition to this, we have a fully observed response
variable Y . Inferences about β = (β0 , β1 , β2 , β3 ) from the complete data poste-
    
rior p β , σ 2 , Mi 1 , Mi 2 |Y, X ∝ p β , σ 2 , Mi 1 , M i 2 L Yi |β , σ 2 , M i 1 , Mi 2 , X can
be obtained by using the random walk Metropolis algorithm (Metropolis and Ulam
1949
1949;; Metr
Metropopol
olis
is et al.
al. 1953
1953;; Hastin
Hastings
gs 1970
1970).
). The joi
joint
nt missin
missing
g cova
covaria
riate
te dis
distri
tribu
butio
tion,
n,
p ( Mi 1 , Mi 2 |α1 , α2 , α3 , α4 ), is modeled as a sequence of one-dimensional distribu-
Ci
tions p ( M1i | M2i , α1 , α2 ) ∼ Bernoulli (1, Ci ) wh
wher
eree lo
log
g 1 −C i
= α1 + α2 Mi 1 and

p ( M2i |α3 , α4 ) ∼ N (α3 , α4 ). In this


this si
simu
mula
lati
tion
on stud
study
y, we pl
plac
 
aced
ed th
thee fo
foll
llo
owing
wing pr
distributions on the four regression parameters and hyper-parameters of the missing
prio
iorr

covariate
covaria te distributions:

p (β0 ) , p (β1 ) , p (β2 ) , p (β3 ) ∼ N (0, v ar = 10)


p (α1 ) , p (α2 ) , p (α3 ) ∼ N (0, v ar = 10)
p (α4 ) , p σ 2 ∼ gamma (2, 2)

Bayesian linear regression was fit by using the above described complete data like-
lihood, missing data model assumptions and prior distributions to estimate β and
variances. Bayesian linear regression was also implemented for the complete cases
to estimate β and variances. This process was repeated 500 times to evaluate the
performance of the estimators in terms of bias and MSE for the regression para-
meters in order to compare complete case analysis to data augmentation. Tables 3
and 4 demonstrate the results from our simulation study comparing the bias and
MSEs. Table 3 shows that complete case analysis has negligible biases for various
combinations of the proportions of the missing data on X 1 and X 2 , while with data
augmentation the regression coefficients have slight biases away from the null when
missingness in covariates is independent of the outcome given the covariates. How-
ever, the MSEs from Table 4, for both complete-case analysis and data augmentation
in estimating the regression coefficients for missing covariates (β1 &β2 ) are very
similar and the MSE for the non-missing covariate, β3 , are significantly smaller for
DA compared to complete-cases under various proportions of missing values, as
shown in Table 4. Moreover, when the the missingness in covariates depends on the
outcome, then complete-case estimates are biased towards the null, while data aug-
mentation had negligible bias (Results are not shown here). For detailed discussion
140 H. Rochani and D.F. Linder

Table 4 MSEs comparison of regression coefficients


coefficients for complete-cases
complete-cases and data augmentation
Missing % Complete cases Data augmentation
x1 x2 β1 β2 β3 β1 β2 β3
5 5 0.0484 0.0110 0.0044 0.0464 0.0109 0.0038
5 10 0.0610 0.0122 0.0053 0.0564 0.0123 0.0045
5 15 0.0627 0.0132 0.0078 0.0579 0.0146 0.0062
5 20 0.0730 0.0150 0.0092 0.0621 0.0170 0.0070
10 5 0.0527 0.0134 0.0060 0.0489 0.0136 0.0044
10 10 0.0610 0.0148 0.0078 0.0599 0.0150 0.0057
10 15 0.0642 0.0142 0.0089 0.0607 0.0137 0.0067
10 20 0.0689 0.0138 0.0119 0.0579 0.0155 0.0085
15 5 0.0557 0.0139 0.0071 0.0530 0.0136 0.0056
15 10 0.0659 0.0158 0.0094 0.0601 0.0160 0.0072
15 15 0.0663 0.0161 0.0110 0.0656 0.0157 0.0072
15 20 0.0719 0.0173 0.0151 0.0646 0.0187 0.0084
20 5 0.0644 0.0136 0.0083 0.0648 0.0124 0.0056
20 10 0.0772 0.0174 0.0133 0.0747 0.0155 0.0074
20 15 0.0764 0.0159 0.0162 0.0680 0.0156 0.0085

20 20 0.0915 0.0179 0.0198 0.0815 0.0188 0.0089

about the bias and efficiency of multiple imputation compared with complete-case
analysis for missing covariate
covariate values, see (White and Carlin 2010;
2010; Chen et al.
al. 2008).
2008).

5.3 BRFSS Data

We applied the FB approach to the Behavioral Risk Factor Surveillance System


(BRFSS)
(BRF SS) data.
data. BRF
BRFSS SS is the world’
world’ss larges
largestt ong
ongoin
oingg random
random-di-digit
git dialed
dialed (RDD)
(RDD) tele-
tele-
phone survey, and is conducted by health departments in the United States. BRFSS
collects state level data about US residents, inquiring about their health-related risk
behaviors and events, chronic health conditions and use of preventive services brf
(2015
2015).
). The
The surv
survey
ey is co
condu
nductcted
ed each
each ye
year
ar,, havi
having
ng begu
begun n in 1984
1984.. For
For illu
illust
stra
rati
tion
on pur-
pur-
poses, we used the Georgia 2013 BRFSS data for the month of January. This dataset
has 336 variable
variabless and 518 observations. To To demonstrate the fully Bayesian approach
for missing covariates, we selected the variable, Reported weight in pounds, as our
re
resp
spon
onse
se vari
variab
able
le.. Re
Repor
porte
tedd he
heig
ight
ht in inch
incheses an
andd part
partic
icip
ipat
ated
ed in any
any phys
physicical
al acti
activi
vity
ty
or exercises in the past month (1 = Yes, 0 = No) were considered as covariates in
our analysis. Because of non-response, some of the subjects are missing values for
the physical activity (PA) variable. There were a total of 34 subjects who did not
respond to the question of participating in physical activity in the past month. To fit
the fully Bayes model, we used the random-walk Metropolis-Hastings algorithm for
Markov Chain Monte-Carlo Methods for Missing … 141

Table 5 Regression coefficients


coefficients and SEs for
for BRFSS data
Coefficients Bayesian approach Complete-cases
βP A –7.6668 –7.8990
(2.5760) (2.5062)
βht 3.4302 3.3865
(0.1346) (0.1379)

posterior sampling and use posterior means to estimate the regression coefficients
for the covariates, reported height and participated in any physical activity in past
months. The results are reported in Table 5.

6 Disc
Discus
ussi
sion
on

In thi
thiss cha
chapte
pterr, we ha
have
ve illus
illustra
trate
ted
d how
how a fully
fully Bayesi
Bayesian
an re
regre
gressi
ssion
on modeli
modeling
ng
approach can be applied to incomplete data under an ignorability assumption. It
is well known that the observed data are not sufficient to identify the underlying
missing mechanism. Therefore, sensitivity analyses should be performed over var-
ious plausible models for the nonresponse mechanism (Little and Rubin 2014 2014).
). In
general, the stability of conclusions (inferences) over the plausible models gives
an indication of their robustness to unverifiable assumptions about the mechanism
underlying missingness. In linear regression, when missingness on Y depends on
the fully observed Xs (MAR), DA has negligible bias and smaller MSEs compared
to the complete-cases. When missingness in Xs depends on other Xs which are
fully observed then the CC analysis has negligible bias and very similar MSEs com-
pare
pa red
d to the
the DA fo
forr mi
miss
ssin
ing
g cova
covari
riat
ates
es.. Fu
Furt
rthe
herm
rmor
ore,
e, when
when miss
missin
ingn
gnes
esss in cova
covari
riat
ates
es
depends on the response, DA will perform better than CC. Because of these biases,
the choice of the method should come from a substantive basis. In summary, a FB

modeling approach enables coherent model estimation because missing data values
are treated as parameters, which are easily sampled within MCMC simulations. The
FB approach takes into account uncertainty about missing observations and offers
a very flexible way to handle the missing data. In conclusion we remark that for
missing covariates we have used a default class of priors to make inferences. How-
ever, for some studies, historical data may be available allowing for construction
of informative priors that may further improve inference. For more details, Ibrahim
et al.
al. (2002)
2002) proposed a class of informative priors for generalized linear models
with missing covariates.
142 H. Rochani and D.F. Linder

References

Behavioral risk factor surveillance system. Retrieved July 5, 2015, from http:// www www.cdc.gov/brfss
.cdc.gov/brfss..
Chen, Q., Ibrahim, J. G., Chen, M. -H., & Senchaudhuri, P. (2008). Theory and inference for
re
regre
gressi
ssion
on models
models with
with missin
missingg respon
responses
ses and co
cova
varia
riates
tes.. Journal of multivariate analysis, 99(6),
1302–1331.
Dani
Da niel
els,
s, M. J.,
J., & Hoga
Hogan,
n, J.
J. W. (200
(2008)
8).. Missin
Missing
g data
data in longit
longitudiudinal
nal studies
studies:: Strate
Strategie
giess for Bayesi
Bayesian
an

modeling
Etzioni, R., and
Pepe, sensitivity
M., Longton, G.,. CRC
analysis Press. H., & Goodman, Gary. (1999). Incorporating
Chengcheng,
the time dimension in receiver operating characteristic curves: A case study of prostate cancer.
Medical Decision Making, 19 (3), 242–251.
Geman,
Gem an, S., & Geman,
Geman, D. (1984)
(1984).. Stocha
Stochasti
sticc rel
relaxa
axatio
tion,
n, gibbs
gibbs distrib
distributi
utions
ons,, and the bayesi
bayesian
an restor
restora-
a-
tion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 6, 721–741.
Hastin
Has tings,
gs, K. W. (1970)
(1970).. Monte-
Monte-Car
Carlolo sampli
samplingng method
methodss using
using marko
markov v chains
chains and their
their applic
applicati
ations
ons..
Biometrika, 57 (1), 97–109.
Ibrahim, J. G., Chen, M. -H., Lipsitz, & S. R., (1999a). Monte-Carlo em for missing covariates in
parametric regression models. Biometrics, 55 (2), 591–596.
Ib
Ibra
rahi
him,
m, J. G.
G.,, Lips
Lipsit
itz,
z, S. R.,
R., & Chen
Chen,, M. -H.
-H. 191999
99b
b. Miss
Missin
ing
g co
cov
varia
ariate
tess in ge
gene
nera
raliz
lized
ed line
linear
ar mode
modelsls
when
whe n the missin
missing g data
data mechan
mechanism ism is non-ig
non-ignonorab
rable.
le. Journal
Journalof
of the Royal Statistical Society: Series
B (Statistical Methodology), 61(1), 173–190.
Ibrahim, J. G., Chen, M. -H., & Lipsitz, S. R. (2002). Bayesian methods for generalized linear
models with covariates missing at random. Canadian Journal of Statistics , 30 (1), 55–78.
Lipsitz, S. R., & Ibrahim, J. G. (1996). A conditional model for incomplete covariates in parametric
regression models. Biometrika, 83(4), 916–922.
Little, R. J. A., & Rubin, D. B. (2014). Statistical analysis with missing data. Wiley.
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction .
USA: Oxford University Press.
Metropolis, N., & Ulam, S. (1949). The Monte-Carlo method. Journal of the American statistical
association, 44(247), 335–341.
Rosenb
Ros enblut
luth,
h, M. N., Tell
eller
er,, A. H., & Tell
eller
er,, E. (1953)
(1953).. Equati
Equation
on of state
state calcul
calculati
ation
onss by fa
fast
st comput
computing
ing
machines. The journal of Chemical Physics , 21(6), 1087–1092.1087–1092.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Satten, G. A., & Carroll, R. J. (2000). Conditional and unconditional categorical regression models
with missing covariates. Biometrics, 56 (2), 384–388.
Schafer, J. L. (1997). Analysis of incomplete multivariate data . CRC press. press.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmen-
tation. Journal of the American statistical Association, 82(398), 528–540.
White, I. R., & Carlin, J. B. (2010). Bias and efficiency of multiple imputation compared with
complete-case analysis for missing covariate values. Statistics in Medicine, 29 (28), 2920–2931.
Xie,, F., & Paik,
Xie Paik, M. C. (1997)
(1997).. Multip
Multiple
le imputa
imputatio
tion
n method
methodss for the missin
missing
g co
cova
varia
riates
tes in genera
generaliz
lized
ed
estimating equation. Biometrics, 1538–1546.
A Multip
Multiple
le Imputat
Imputation
ion Framew
Framework
ork
for Massive Multivariate Data of Different
Variable Types: A Monte-Carlo Technique

Hakan Demirtas

Abstract The pur purpos


posee of thi
thiss cha
chapte
pterr is to bu
build
ild the
theore
oretic
tical
al,, al
algori
gorithm
thmic,
ic, and
implementation-based components of a unified, general-purpose multiple imputa-
tion framework for intensiv
intensivee multivariate
multivariate data sets that are collected via increasingly
popular real-time data capture methods. Such data typically include all major types
of va
vari
riab
able
less th
that
at ar
aree in
inco
comp
mplelete
te du
duee to pl
plan
anne
nedd mi
miss
ssin
ingn
gnes
esss de
desi
sign
gns,s, wh
whic
ich
h ha
have
ve be
been
en
deve
de velop
loped
ed to red
reduce
uce res
respond
pondent
ent bu
burde
rdenn and lo
lower
wer the cos
costt ass
associ
ociate
atedd wit
with
h dat
dataa col
collec
lec--
tion. The imputation approach presented herein complements the methods available
for incomplete data analysis via richer and more flexible modeling procedures, and
can easily generalize to a variety of research areas that involve internet studies and
processes that are designed to collect continuous streams of real-time data. Planned
missingness designs are highly useful and will likely increase in popularity in the
future. For this reason, the proposed multiple imputation framework represents an
important and refined addition to the existing methods, and has potential to advance
scient
scientific
ific kno
knowle
wledge
dge and res
resear
earch
ch in a me
meani
aningfu
ngfull wa
way
y. Cap
Capabi
abilit
lity
y of acc
accomm
ommodaodatin
ting
g
many incomplete variables of different distributional nature, types, and dependence
structures could be a contributing factor for better comprehending the operational
characteristicss of today’
characteristic today’ss massive data trends. It offers promising potential for build-
ing enhanced statistical computing infrastructure for education and research in the
sense
sen se of pro
provid
viding
ing princi
principle
pled,
d, useful,
useful, gen
genera
eral,
l, and fle
flexib
xible
le set of com
computputati
ationa
onall tools
for handling incomplete data.

1 Intr
Introd
oduc
ucti
tion
on

Missin
Miss ingg da
data
ta ar
aree a co
comm
mmon only
ly oc
occu
curr
rrin
ing
g phe
pheno
nome
meno
non
n in ma
many
ny cocont
ntex
exts
ts.. De
Dete
term
rmin
inin
ing
g
a suitable analytical approach in the presence of incomplete observations is a major
focus
focus of sci
scient
entific
ific inq
inquir
uiry
y due to the add
addit
ition
ional
al com
comple
plexit
xity
y tha
thatt ari
arises
ses thr
throug
oughh mis
missin
sing
g
data. Incompleteness generally complicates the statistical analysis in terms of biased

H. Demirtas (B)
Division of Epidemiology and Biostatistics (MC923), University of Illinois at Chicago,
1603 West Taylor Street, Chicago, IL 60612, USA
e-mail: demirtas@uic.edu

© Springer Nature Singapore Pte Ltd. 2017 143


D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical
Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_8
144 H. Demirtas

parameter estimates, reduced statistical power, and degraded confidence intervals,


and thereby may lead to false inferences (Little and Rubin 20022002).
). Advances in com-
putational statistics have produced flexible missing-data procedures with a sound
statistical basis. One of these procedures involves multiple imputation (MI), which
is a stochastic simulation technique in which the missing values are replaced by
m > 1 simulated versions (Rubin 2004 2004).
). Subsequently, each of the simulated com-
plete data sets is analyzed by standard methods, and the results are combined into a
single
sin gle inf
infere
erenti
ntial
al sta
statem
tement
ent tha
thatt for
formal
mally
ly inc
incorpo
orporat
rates
es mis
missin
sing-d
g-data
ata unc
uncert
ertain
ainty
ty to the
modeling process. MI has gained widespread acceptance and popularity in the last
few decades. It has some well-documented advantages: First, MI allows researchers
to use more conventional models and software; an imputed data set may be ana-
lyzed by literally any method that would be suitable if the data were complete. As
computing environments and statistical models grow increasingly sophisticated, the
value of using familiar methods and software becomes important. Second, there are
still many classes of problems for which no direct maximum likelihood procedure
is available. Even when such a procedure exists, MI can be more attractive due to
fact that the separation of the imputation phase from the analysis phase lends greater
flexibility to the entire process. Lastly, MI singles out missing data as a source of
random variation distinct from ordinary sampling variability.
These days most incomplete data sets involve variables of many different types
on a struc
structural
tural level;
level; causal and corre
correlati
lational
onal interdependen
interdependencies cies are a functi
function
on of a
mixture of binary, ordinal, count, and continuous variables as well as nonresponse
rates and mechanisms, all of which act simultaneously to characterize the data-
analytics paradigms under consideration. The use of digital data is growing rapidly
as researchers get more capable of collecting instantaneous self-reported data using
mobile devices in naturalistic settings; and real-time data capture methods supply
novel
nov el ins
insigh
ights
ts int
intoo det
determ
ermina
inants
nts of ho
how
w rea
real-l
l-life
ife pro
proces
cesses
ses are form
formed.
ed. Mob
Mobile
ile pho
phones
nes
are becoming ubiquitous and easy to use, and thus have the capacity to collect data
quickl
qui cklyy fro
from
m lar
large
ge num
number
berss of peo
people
ple and tra
transf
nsfer
er thi
thiss inf
inform
ormat ation
ion to rem
remote
ote ser
serve
vers
rs
in an uno
unobtr
btrusi
usive
ve wa
wayy. As the
these
se pro
proced
cedure
uress yie
yield
ld rel
relati
ative
vely
ly lar
large
ge num
number
ber obs
observ
ervati
ations
ons
per subject, planned missing data designs have been developed to reduce respondent
burden and lower the cost associated with data collection. Planned missing data
designs include items or groups of items according to predetermined probabilis-
tic sampling schemes. These designs are highly useful and will likely increase in
popularity in the future.
Thee pu
Th purp
rpos
osee of th
this
is wo
work
rk is to bu
buil
ild
d th
theo
eore
reti
tica
cal,
l, al
algo
goririth
thmi
mic,
c, an
andd im
impl
plem
emenenta
tati
tion-
on-
based components of a unified, general-purpose multiple imputation framework,
which
whi ch can be ins
instru
trumen
mental
tal in de
deve
velop
loping
ing po
power
wer ana
analys
lysis
is gui
guidel
deline
iness for int
intens
ensiv
ivee mul
mul--
tivariate data sets that are collected via increasingly popular real-time data capture
(RTDC
(R TDC her
herea
eafte
fter)
r) app
approa
roache
ches.
s. Suc
Suchh dat
dataa typ
typica
ically
lly inc
includ
ludee all ma
major
jor typ
types
es of va
varia
riable
bless
(binary, ordinal, count, and continuous) that are incomplete due to planned missing-
ness designs. Existing MI methodologies are restrictive for RTDC data, because
they hinge upon strict and unrealistic assumptions (multivariate normal model for

continuous data, multinomial model for discrete data, general location model for a
A Multiple Imputation Framework for Massive Multivariate Data … 145

mix of normal and categorical data), commonly known as joint MI models (Schafer
1997).
1997 ). Some methods can handle all data types with relaxed assumptions, but lack
theoretical justification (Van Buuren 2012 2012;; Raghunathan et al. al. 2001
2001).). For massive
data
da ta as cocoll
llec
ecte
ted
d in RTD
TDC C st
stud
udieies,
s, a no
nove
vell im
impu
puta
tati
tion
on fr
fram
ameewo
work
rk th
that
at ca
cann ac
acco
commmmo- o-
date all four major types of variables is needed with a minimal set of assumptions.
In addition, no statistical power (probability of correctly detecting an effect) and
sample size (number of subjects, measurements, waves) procedures are av available
ailable for
RTDC data. The lack of these tools severely limits our ability to capitalize on the
full potential of incomplete intensive data, and a unified framework for simultane-
ously imputing all types of variables is necessary to adequately capture a broad set
of substantive messages that massive data are designed to convey. The proposed MI
framework represents an important and refined addition to the existing methods, and
has potential to advance scientific knowledge and research in a meaningful way; it
offers promising potential for building enhanced statistical computing infrastructure
for education and research in the sense of providing principled, useful, general, and
flexible set of computational tools for handling incomplete data.
Combin
Com bining
ing our pre
previo
vious
us ran
random
dom num
number
ber gen
genera
eratio
tion
n (RN
(RNG)G) wor
work k for mul
multitiva
varia
riate
te
ordinal data (Demirtas 2006 2006),), joint binary and normal data (Demirtas and Doganay
2012),
2012 ), ordinal and normal data (Demirtas and Yavuz 2015 2015),
), count and normal data
(Amatya and Demirtas 2015a 2015a), ), binary and nonnormal continuous data (Demirtas
et al
al.. 2012
2012;; Dem
Demirt
irtas
as 2014
2014,, 2016
2016)) wit
withh the spe
specifi
cificat
cation
ion of mar
margin
ginal
al and ass
associ
ociati
ationa
onall
parameters; our published work on MI (Demirtas 2004 2004,, 2005
2005,, 2007
2007,, 2008
2008,, 2009
2009,,
2010,, 2017a
2010 2017a;; Demirtas et al. 20072007,, 2008
2008;; Demirtas and Hedeker 2007 2007,, 2008a
2008a,, b, c;
Demirtas and Schafer 2003 2003;; Yucel and Demirtas 2010 2010),
), along with some related
work (Demirtas et al. al. 2016a
2016a;; Demirtas and Hedeker
Hedeker 2011
2011,, 2016
2016;; Emrich and Pied-
monte 1991 1991;; Fleishman 1978
1978;; Ferrari and Barbiero 2012 2012;; Headrick 2010
2010;; Yahav and
Shmuelii 2012
Shmuel 2012),), a broad mixed data imputation framework that spans all possible
combinations of binary, ordinal, count, and continuous variables, is proposed. This
system is capable of handling the overwhelming majority of continuous shapes; it
can be extended to control for higher order moments for continuous variables, and to
allow
allow ove over-
r- and under
under-disp
-dispersio
ersion n for count variables
variables as well as the specification
specification of
Spearman’s rank correlations as the measure of association. Procedural, conceptual,
operational, and algorithmic details of the published, current, and future work will
be given throughout the chapter.
The organization of the chapter is as follows: In Sect. 2, background information
is provided on the generation of multivariate binary, ordinal, and count data with
an emphasis on underlying multivariate normal data that form a basis for the subse-
quent discretiza
discretization
tion in the binary and ordinal cases, and corre
correlati
lation
on mappi
mappingng using
inverse cumulative distribution functions (cdfs) in the count data case. Then, differ-
ent correlation types that are relevant to the work are described; a linear relationship
between correlations before and after discretization is discussed; and multivariate
power polynomials in the context of generating continuous data are elaborated on.
In Sect. 3, operational characteristics of MI under normality assumption are articu-

lated. In Sect. 4, an MI algorithm for multivariate data with all four major variable
146 H. Demirtas

types is outlined under ignorable nonresponse by merging the available RNG and
MI routines. Finally, in Sect. 5, concluding remarks, future research directions, and
extensions are given.

2 Back
Backgr
grou
ound
nd on RN
RNG
G

The pr
The prop
opos
osed
ed MI al
algo
gori
rith
thm
m ha
hass st
stro
rong
ng im
impe
petu
tuss an
and
d co
comp
mpututat
atio
iona
nall ro
root
otss de
deri
rive
ved
d fr
from
om
some ideas that appeared in the RNG literature. This section provides salient fea-
tures of multivariate normal (MVN), multivariate count (MVC), multivariate binary
(MVB), and multiv
multivariate
ariate ordinal (MVO) data generation. RelevRelevant
ant correlation struc-
tures
tures in
invo
volvi
lving
ng the
these
se dif
differ
ferent
ent typ
types
es of va
varia
riable
bless are dis
discus
cussed
sed;; a con
connec
nectio
tion
n bet
betwee
ween
n
correlations before and after discretization is established; and the use of power poly-
nomials, which will be employed at later stages to accommodate nonnormal contin-
uous data, is explained.
MVN Data Generation
Generation: Sampling from the MVN distribution is straightforward.
Suppose Z ∼
Nd (µ,  ), wheree µ is th
wher thee me
mean
an ve
vect
ctor
or,, an
and
d  is the sym
symmet
metric
ric,, pos
positi
itive
ve
definite, d × d va
varia
riance
nce-co
-cova
varia
riance
nce mat
matrix
rix.. A ran
random
dom dra
draw
w fro
from
m a MVN dis
distri
tribut
bution
ion
can be obtained using the Cholesky decomposition of  and a vector of univariate
normal dra
normal draws.
ws. The Cho
Choles
lesky
ky dec
decomp
omposi ositition
on of  pro
produc
duces
es a lo
lower
wer-tr
-tria
iangul
ngular
ar mat
matrix
rix
A for whi ch A A T
which = =
 . If z (z 1 , ..., z d ) are d indep
..., independen
endentt stand
standard
ard norma
normall rando
random
m
variables, then Z = +
µ Az is a random draw from the MVN distribution with mean
vector µ and covariance matrix  .
MVC Data Generation: Count data have been traditionally modeled by the Poisson
distribution. Although a few multivariate
multivariate Poisson (MVP) data generation techniques
have been published, the method in Yahav and Shmueli (2012 ( 2012)) is the only one that
reasonably works (allowing negative correlations) when the number of components
is greater than two. Their method utilizes a slightly modified version of the NORTA
(Normal to Anything) approach (Nelsen 20062006),), which involves generation of MVN
variates with given univariate marginals and the correlation structure ( R N ), and then
transforming it into any desired distribution using the inverse cdf. In the Poisson
case, NORTA can be implemented by the following steps:
1. Genera te a k -dimensional normal vector Z N from M V N distribution with mean
Generate
vector 0 and a correlation matrix R N .
2. Tr
Transfor
ansformm Z N to a Poisson vector X P O I S as follows:
(a) For each
each eleme nt z i of Z N , calculate the Normal cdf, 
element ((z i ).
(b) Fo
Forr each value
value of ( z i ), calculate the Poisson inverse cdf with a desired
x e−θ θi
corresponding marginal rate θi ,  θ−i 1 (( z i )); where  θi (x )
= i =0 i ! .

T
3. X P O I S =  − ((z ) ) , . . . ,  − ((z ))
θi
1
i θk
1
k is a draw from the desired M V P

distribution with correlation matrix R P O I S .

You might also like