You are on page 1of 532

Robust

Nonparametric
Statistical Methods
Second Edition

K10449_FM.indd 1 11/19/10 1:27 PM


MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY

General Editors

F. Bunea, V. Isham, N. Keiding, T. Louis, R. L. Smith, and H. Tong

1 Stochastic Population Models in Ecology and Epidemiology M.S. Barlett (1960)


2 Queues D.R. Cox and W.L. Smith (1961)
3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964)
4 The Statistical Analysis of Series of Events D.R. Cox and P.A.W. Lewis (1966)
5 Population Genetics W.J. Ewens (1969)
6 Probability, Statistics and Time M.S. Barlett (1975)
7 Statistical Inference S.D. Silvey (1975)
8 The Analysis of Contingency Tables B.S. Everitt (1977)
9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977)
10 Stochastic Abundance Models S. Engen (1978)
11 Some Basic Theory for Statistical Inference E.J.G. Pitman (1979)
12 Point Processes D.R. Cox and V. Isham (1980)
13 Identification of Outliers D.M. Hawkins (1980)
14 Optimal Design S.D. Silvey (1980)
15 Finite Mixture Distributions B.S. Everitt and D.J. Hand (1981)
16 Classification A.D. Gordon (1981)
17 Distribution-Free Statistical Methods, 2nd edition J.S. Maritz (1995)
18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982)
19 Applications of Queueing Theory, 2nd edition G.F. Newell (1982)
20 Risk Theory, 3rd edition R.E. Beard, T. Pentikäinen and E. Pesonen (1984)
21 Analysis of Survival Data D.R. Cox and D. Oakes (1984)
22 An Introduction to Latent Variable Models B.S. Everitt (1984)
23 Bandit Problems D.A. Berry and B. Fristedt (1985)
24 Stochastic Modelling and Control M.H.A. Davis and R. Vinter (1985)
25 The Statistical Analysis of Composition Data J. Aitchison (1986)
26 Density Estimation for Statistics and Data Analysis B.W. Silverman (1986)
27 Regression Analysis with Applications G.B. Wetherill (1986)
28 Sequential Methods in Statistics, 3rd edition
G.B. Wetherill and K.D. Glazebrook (1986)
29 Tensor Methods in Statistics P. McCullagh (1987)
30 Transformation and Weighting in Regression
R.J. Carroll and D. Ruppert (1988)
31 Asymptotic Techniques for Use in Statistics
O.E. Bandorff-Nielsen and D.R. Cox (1989)
32 Analysis of Binary Data, 2nd edition D.R. Cox and E.J. Snell (1989)
33 Analysis of Infectious Disease Data N.G. Becker (1989)
34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989)
35 Empirical Bayes Methods, 2nd edition J.S. Maritz and T. Lwin (1989)
36 Symmetric Multivariate and Related Distributions
K.T. Fang, S. Kotz and K.W. Ng (1990)
37 Generalized Linear Models, 2nd edition P. McCullagh and J.A. Nelder (1989)
38 Cyclic and Computer Generated Designs, 2nd edition
J.A. John and E.R. Williams (1995)
39 Analog Estimation Methods in Econometrics C.F. Manski (1988)
40 Subset Selection in Regression A.J. Miller (1990)
41 Analysis of Repeated Measures M.J. Crowder and D.J. Hand (1990)
42 Statistical Reasoning with Imprecise Probabilities P. Walley (1991)
43 Generalized Additive Models T.J. Hastie and R.J. Tibshirani (1990)
44 Inspection Errors for Attributes in Quality Control
N.L. Johnson, S. Kotz and X. Wu (1991)

K10449_FM.indd 2 11/19/10 1:27 PM


45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)
46 The Analysis of Quantal Response Data B.J.T. Morgan (1992)
47 Longitudinal Data with Serial Correlation—A State-Space Approach
R.H. Jones (1993)
48 Differential Geometry and Statistics M.K. Murray and J.W. Rice (1993)
49 Markov Models and Optimization M.H.A. Davis (1993)
50 Networks and Chaos—Statistical and Probabilistic Aspects
O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall (1993)
51 Number-Theoretic Methods in Statistics K.-T. Fang and Y. Wang (1994)
52 Inference and Asymptotics O.E. Barndorff-Nielsen and D.R. Cox (1994)
53 Practical Risk Theory for Actuaries
C.D. Daykin, T. Pentikäinen and M. Pesonen (1994)
54 Biplots J.C. Gower and D.J. Hand (1996)
55 Predictive Inference—An Introduction S. Geisser (1993)
56 Model-Free Curve Estimation M.E. Tarter and M.D. Lock (1993)
57 An Introduction to the Bootstrap B. Efron and R.J. Tibshirani (1993)
58 Nonparametric Regression and Generalized Linear Models
P.J. Green and B.W. Silverman (1994)
59 Multidimensional Scaling T.F. Cox and M.A.A. Cox (1994)
60 Kernel Smoothing M.P. Wand and M.C. Jones (1995)
61 Statistics for Long Memory Processes J. Beran (1995)
62 Nonlinear Models for Repeated Measurement Data
M. Davidian and D.M. Giltinan (1995)
63 Measurement Error in Nonlinear Models
R.J. Carroll, D. Rupert and L.A. Stefanski (1995)
64 Analyzing and Modeling Rank Data J.J. Marden (1995)
65 Time Series Models—In Econometrics, Finance and Other Fields
D.R. Cox, D.V. Hinkley and O.E. Barndorff-Nielsen (1996)
66 Local Polynomial Modeling and its Applications J. Fan and I. Gijbels (1996)
67 Multivariate Dependencies—Models, Analysis and Interpretation
D.R. Cox and N. Wermuth (1996)
68 Statistical Inference—Based on the Likelihood A. Azzalini (1996)
69 Bayes and Empirical Bayes Methods for Data Analysis
B.P. Carlin and T.A Louis (1996)
70 Hidden Markov and Other Models for Discrete-Valued Time Series
I.L. MacDonald and W. Zucchini (1997)
71 Statistical Evidence—A Likelihood Paradigm R. Royall (1997)
72 Analysis of Incomplete Multivariate Data J.L. Schafer (1997)
73 Multivariate Models and Dependence Concepts H. Joe (1997)
74 Theory of Sample Surveys M.E. Thompson (1997)
75 Retrial Queues G. Falin and J.G.C. Templeton (1997)
76 Theory of Dispersion Models B. Jørgensen (1997)
77 Mixed Poisson Processes J. Grandell (1997)
78 Variance Components Estimation—Mixed Models, Methodologies and Applications P.S.R.S. Rao (1997)
79 Bayesian Methods for Finite Population Sampling
G. Meeden and M. Ghosh (1997)
80 Stochastic Geometry—Likelihood and computation
O.E. Barndorff-Nielsen, W.S. Kendall and M.N.M. van Lieshout (1998)
81 Computer-Assisted Analysis of Mixtures and Applications—
Meta-analysis, Disease Mapping and Others D. Böhning (1999)
82 Classification, 2nd edition A.D. Gordon (1999)
83 Semimartingales and their Statistical Inference B.L.S. Prakasa Rao (1999)
84 Statistical Aspects of BSE and vCJD—Models for Epidemics
C.A. Donnelly and N.M. Ferguson (1999)
85 Set-Indexed Martingales G. Ivanoff and E. Merzbach (2000)

K10449_FM.indd 3 11/19/10 1:27 PM


86 The Theory of the Design of Experiments D.R. Cox and N. Reid (2000)
87 Complex Stochastic Systems
O.E. Barndorff-Nielsen, D.R. Cox and C. Klüppelberg (2001)
88 Multidimensional Scaling, 2nd edition T.F. Cox and M.A.A. Cox (2001)
89 Algebraic Statistics—Computational Commutative Algebra in Statistics
G. Pistone, E. Riccomagno and H.P. Wynn (2001)
90 Analysis of Time Series Structure—SSA and Related Techniques
N. Golyandina, V. Nekrutkin and A.A. Zhigljavsky (2001)
91 Subjective Probability Models for Lifetimes
Fabio Spizzichino (2001)
92 Empirical Likelihood Art B. Owen (2001)
93 Statistics in the 21st Century
Adrian E. Raftery, Martin A. Tanner, and Martin T. Wells (2001)
94 Accelerated Life Models: Modeling and Statistical Analysis
Vilijandas Bagdonavicius and Mikhail Nikulin (2001)
95 Subset Selection in Regression, Second Edition Alan Miller (2002)
96 Topics in Modelling of Clustered Data
Marc Aerts, Helena Geys, Geert Molenberghs, and Louise M. Ryan (2002)
97 Components of Variance D.R. Cox and P.J. Solomon (2002)
98 Design and Analysis of Cross-Over Trials, 2nd Edition
Byron Jones and Michael G. Kenward (2003)
99 Extreme Values in Finance, Telecommunications, and the Environment
Bärbel Finkenstädt and Holger Rootzén (2003)
100 Statistical Inference and Simulation for Spatial Point Processes
Jesper Møller and Rasmus Plenge Waagepetersen (2004)
101 Hierarchical Modeling and Analysis for Spatial Data
Sudipto Banerjee, Bradley P. Carlin, and Alan E. Gelfand (2004)
102 Diagnostic Checks in Time Series Wai Keung Li (2004)
103 Stereology for Statisticians Adrian Baddeley and Eva B. Vedel Jensen (2004)
104 Gaussian Markov Random Fields: Theory and Applications
Håvard Rue and Leonhard Held (2005)
105 Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition
Raymond J. Carroll, David Ruppert, Leonard A. Stefanski,
and Ciprian M. Crainiceanu (2006)
106 Generalized Linear Models with Random Effects: Unified Analysis via H-likelihood
Youngjo Lee, John A. Nelder, and Yudi Pawitan (2006)
107 Statistical Methods for Spatio-Temporal Systems
Bärbel Finkenstädt, Leonhard Held, and Valerie Isham (2007)
108 Nonlinear Time Series: Semiparametric and Nonparametric Methods
Jiti Gao (2007)
109 Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis
Michael J. Daniels and Joseph W. Hogan (2008)
110 Hidden Markov Models for Time Series: An Introduction Using R
Walter Zucchini and Iain L. MacDonald (2009)
111 ROC Curves for Continuous Data
Wojtek J. Krzanowski and David J. Hand (2009)
112 Antedependence Models for Longitudinal Data
Dale L. Zimmerman and Vicente A. Núñez-Antón (2009)
113 Mixed Effects Models for Complex Data
Lang Wu (2010)
114 Intoduction to Time Series Modeling
Genshiro Kitagawa (2010)
115 Expansions and Asymptotics for Statistics
Christopher G. Small (2010)
116 Statistical Inference: An Integrated Bayesian/Likelihood Approach
Murray Aitkin (2010)
117 Circular and Linear Regression: Fitting Circles and Lines by Least Squares
Nikolai Chernov (2010)
118 Simultaneous Inference in Regression Wei Liu (2010)
119 Robust Nonparametric Statistical Methods, Second Edition Thomas P. Hettmansperger and
Joseph W. McKean (2011)

K10449_FM.indd 4 11/19/10 1:27 PM


Monographs on Statistics and Applied Probability 119

Robust
Nonparametric
Statistical Methods
Second Edition

Thomas P. Hettmansperger
Penn State University
University Park, Pennsylvania, USA

Joseph W. McKean
Western Michigan University
Kalamazoo, Michigan, USA

K10449_FM.indd 5 11/19/10 1:27 PM


CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2011 by Taylor and Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed in the United States of America on acid-free paper


10 9 8 7 6 5 4 3 2 1

International Standard Book Number: 978-1-4398-0908-2 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to
publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials
or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material repro-
duced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming,
and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copy-
right.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400.
CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been
granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifica-
tion and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Hettmansperger, Thomas P., 1939-


Robust nonparametric statistical methods / Thomas P. Hettmansperger, Joseph W. McKean. -- 2nd ed.
p. cm. -- (Monographs on statistics and applied probability ; 119)
Summary: “Often referred to as distribution-free methods, nonparametric methods do not rely on
assumptions that the data are drawn from a given probability distribution. With an emphasis on Wilcoxon
rank methods that enable a unified approach to data analysis, this book presents a unique overview of robust
nonparametric statistical methods. Drawing on examples from various disciplines, the relevant R code for
these examples, as well as numerous exercises for self-study, the text covers location models, regression
models, designed experiments, and multivariate methods. This edition features a new chapter on cluster
correlated data”-- Provided by publisher.
Includes bibliographical references and index.
ISBN 978-1-4398-0908-2 (hardback)
1. Nonparametric statistics. 2. Robust statistics. I. McKean, Joseph W., 1944- II. Title. III. Series.

QA278.8.H47 2010
519.5--dc22 2010044858

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com

K10449_FM.indd 6 11/19/10 1:27 PM


i i

“book” — 2010/11/17 — 16:39 — page vii —


i i

vii

Dedication: To Ann and to Marge

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page ix —


i i

Contents

Preface xv

1 One-Sample Problems 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Geometry and Inference in the Location Model . . . . . . . . . 5
1.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Properties of Norm-Based Inference . . . . . . . . . . . . . . . 19
1.5.1 Basic Properties of the Power Function γS (θ) . . . . . 20
1.5.2 Asymptotic Linearity and Pitman Regularity . . . . . . 22
1.5.3 Asymptotic Theory and Efficiency Results for θb . . . . 26
1.5.4 Asymptotic Power and Efficiency Results for the Test
Based on S(θ) . . . . . . . . . . . . . . . . . . . . . . . 27
1.5.5 Efficiency Results for Confidence Intervals Based on S(θ) 29
1.6 Robustness Properties of Norm-Based Inference . . . . . . . . 32
1.6.1 Robustness Properties of θb . . . . . . . . . . . . . . . . 33
1.6.2 Breakdown Properties of Tests . . . . . . . . . . . . . . 35
1.7 Inference and the Wilcoxon Signed-Rank Norm . . . . . . . . 38
1.7.1 Null Distribution Theory of T (0) . . . . . . . . . . . . 39
1.7.2 Statistical Properties . . . . . . . . . . . . . . . . . . . 40
1.7.3 Robustness Properties . . . . . . . . . . . . . . . . . . 46
1.8 Inference Based on General Signed-Rank Norms . . . . . . . . 48
1.8.1 Null Properties of the Test . . . . . . . . . . . . . . . . 50
1.8.2 Efficiency and Robustness Properties . . . . . . . . . . 51
1.9 Ranked Set Sampling . . . . . . . . . . . . . . . . . . . . . . . 57
1.10 L1 Interpolated Confidence Intervals . . . . . . . . . . . . . . 61
1.11 Two-Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . 65
1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

ix
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page x —


i i

x CONTENTS

2 Two-Sample Problems 77
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.2 Geometric Motivation . . . . . . . . . . . . . . . . . . . . . . 78
2.2.1 Least Squares (LS) Analysis . . . . . . . . . . . . . . . 81
2.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis . . . . . . 82
2.2.3 Computation . . . . . . . . . . . . . . . . . . . . . . . 84
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.4 Inference Based on the Mann-Whitney-Wilcoxon . . . . . . . . 87
2.4.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.4.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . 97
2.4.3 Statistical Properties of the Inference Based on the MWW 97
2.4.4 Estimation of ∆ . . . . . . . . . . . . . . . . . . . . . . 102
2.4.5 Efficiency Results Based on Confidence Intervals . . . . 103
2.5 General Rank Scores . . . . . . . . . . . . . . . . . . . . . . . 105
2.5.1 Statistical Methods . . . . . . . . . . . . . . . . . . . . 109
2.5.2 Efficiency Results . . . . . . . . . . . . . . . . . . . . . 110
2.5.3 Connection between One- and Two-Sample Scores . . . 113
2.6 L1 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.6.1 Analysis Based on the L1 Pseudo-Norm . . . . . . . . . 115
2.6.2 Analysis Based on the L1 Norm . . . . . . . . . . . . . 119
2.7 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . 122
2.7.1 Breakdown Properties . . . . . . . . . . . . . . . . . . 122
2.7.2 Influence Functions . . . . . . . . . . . . . . . . . . . . 123
2.8 Proportional Hazards . . . . . . . . . . . . . . . . . . . . . . . 125
2.8.1 The Log Exponential and the Savage Statistic . . . . . 126
2.8.2 Efficiency Properties . . . . . . . . . . . . . . . . . . . 129
2.9 Two-Sample Rank Set Sampling (RSS) . . . . . . . . . . . . . 131
2.10 Two-Sample Scale Problem . . . . . . . . . . . . . . . . . . . 133
2.10.1 Appropriate Score Functions . . . . . . . . . . . . . . . 133
2.10.2 Efficacy of the Traditional F -Test . . . . . . . . . . . . 142
2.11 Behrens-Fisher Problem . . . . . . . . . . . . . . . . . . . . . 144
2.11.1 Behavior of the Usual MWW Test . . . . . . . . . . . . 144
2.11.2 General Rank Tests . . . . . . . . . . . . . . . . . . . . 146
2.11.3 Modified Mathisen’s Test . . . . . . . . . . . . . . . . . 147
2.11.4 Modified MWW Test . . . . . . . . . . . . . . . . . . . 149
2.11.5 Efficiencies and Discussion . . . . . . . . . . . . . . . . 150
2.12 Paired Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.12.1 Behavior under Alternatives . . . . . . . . . . . . . . . 156
2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page xi —


i i

CONTENTS xi

3 Linear Models 165


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.2 Geometry of Estimation and Tests . . . . . . . . . . . . . . . . 166
3.2.1 The Geometry of Estimation . . . . . . . . . . . . . . . 166
3.2.2 The Geometry of Testing . . . . . . . . . . . . . . . . . 169
3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.4 Assumptions for Asymptotic Theory . . . . . . . . . . . . . . 177
3.5 Theory of Rank-Based Estimates . . . . . . . . . . . . . . . . 180
3.5.1 R Estimators of the Regression Coefficients . . . . . . . 180
3.5.2 R Estimates of the Intercept . . . . . . . . . . . . . . . 185
3.6 Theory of Rank-Based Tests . . . . . . . . . . . . . . . . . . . 191
3.6.1 Null Theory of Rank-Based Tests . . . . . . . . . . . . 191
3.6.2 Theory of Rank-Based Tests under Alternatives . . . . 197
3.6.3 Further Remarks on the Dispersion Function . . . . . . 201
3.7 Implementation of the R Analysis . . . . . . . . . . . . . . . . 203
3.7.1 Estimates of the Scale Parameter τϕ . . . . . . . . . . 204
3.7.2 Algorithms for Computing the R Analysis . . . . . . . 207
3.7.3 An Algorithm for a Linear Search . . . . . . . . . . . . 210
3.8 L1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
3.9 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3.9.1 Properties of R Residuals and Model Misspecification . 214
3.9.2 Standardization of R Residuals . . . . . . . . . . . . . 220
3.9.3 Measures of Influential Cases . . . . . . . . . . . . . . 227
3.10 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 231
3.11 Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . . 240
3.11.1 Huber’s Condition for the Correlation Model . . . . . . 240
3.11.2 Traditional Measure of Association and Its Estimate . 242
3.11.3 Robust Measure of Association and Its Estimate . . . . 243
3.11.4 Properties of R Coefficients of Multiple Determination 245
3.11.5 Coefficients of Determination for Regression . . . . . . 250
3.12 High Breakdown (HBR) Estimates . . . . . . . . . . . . . . . 252
3.12.1 Geometry of the HBR Estimates . . . . . . . . . . . . 252
3.12.2 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . 253
3.12.3 Asymptotic Normality of β b
HBR . . . . . . . . . . . . . 256
3.12.4 Robustness Properties of the HBR Estimates . . . . . . 260
3.12.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 263
3.12.6 Implementation and Examples . . . . . . . . . . . . . . 264
3.12.7 Studentized Residuals . . . . . . . . . . . . . . . . . . 265
3.12.8 Example on Curvature Detection . . . . . . . . . . . . 267
3.13 Diagnostics for Differentiating between Fits . . . . . . . . . . 268
3.14 Rank-Based Procedures for Nonlinear Models . . . . . . . . . 276
3.14.1 Implementation . . . . . . . . . . . . . . . . . . . . . . 279

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page xii —


i i

xii CONTENTS

3.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

4 Experimental Designs: Fixed Effects 291


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
4.2 One-way Design . . . . . . . . . . . . . . . . . . . . . . . . . . 292
4.2.1 R Fit of the One-way Design . . . . . . . . . . . . . . . 294
4.2.2 Rank-Based Tests of H0 : µ1 = · · · = µk . . . . . . . . 296
4.2.3 Tests of General Contrasts . . . . . . . . . . . . . . . . 299
4.2.4 More on Estimation of Contrasts and Location . . . . . 300
4.2.5 Pseudo-observations . . . . . . . . . . . . . . . . . . . 302
4.3 Multiple Comparison Procedures . . . . . . . . . . . . . . . . 304
4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 311
4.4 Two-way Crossed Factorial . . . . . . . . . . . . . . . . . . . . 313
4.5 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . 317
4.6 Further Examples . . . . . . . . . . . . . . . . . . . . . . . . . 321
4.7 Rank Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 325
4.7.1 Monte Carlo Study . . . . . . . . . . . . . . . . . . . . 327
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

5 Models with Dependent Error Structure 337


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
5.2 General Mixed Models . . . . . . . . . . . . . . . . . . . . . . 337
5.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . 342
5.3 Simple Mixed Models . . . . . . . . . . . . . . . . . . . . . . . 342
5.3.1 Variance Component Estimators . . . . . . . . . . . . . 343
5.3.2 Studentized Residuals . . . . . . . . . . . . . . . . . . 344
5.3.3 Example and Simulation Studies . . . . . . . . . . . . 346
5.3.4 Simulation Studies of Validity . . . . . . . . . . . . . . 347
5.3.5 Simulation Study of Other Score Functions . . . . . . . 349
5.4 Arnold Transformations . . . . . . . . . . . . . . . . . . . . . 350
5.4.1 R Fit Based on Arnold Transformed Data . . . . . . . 351
5.5 General Estimating Equations (GEE) . . . . . . . . . . . . . . 356
5.5.1 Asymptotic Theory . . . . . . . . . . . . . . . . . . . . 359
5.5.2 Implementation and a Monte Carlo Study . . . . . . . 360
5.5.3 Example: Inflammatory Markers . . . . . . . . . . . . . 362
5.6 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
5.6.1 Asymptotic Theory . . . . . . . . . . . . . . . . . . . . 368
5.6.2 Wald-Type Inference . . . . . . . . . . . . . . . . . . . 370
5.6.3 Linear Models with Autoregressive Errors . . . . . . . 372
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page xiii —


i i

CONTENTS xiii

6 Multivariate 377
6.1 Multivariate Location Model . . . . . . . . . . . . . . . . . . . 377
6.2 Componentwise Methods . . . . . . . . . . . . . . . . . . . . . 382
6.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 385
6.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
6.2.3 Componentwise Rank Methods . . . . . . . . . . . . . 390
6.3 Spatial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 392
6.3.1 Spatial Sign Methods . . . . . . . . . . . . . . . . . . . 392
6.3.2 Spatial Rank Methods . . . . . . . . . . . . . . . . . . 399
6.4 Affine Equivariant and Invariant Methods . . . . . . . . . . . 403
6.4.1 Blumen’s Bivariate Sign Test . . . . . . . . . . . . . . 403
6.4.2 Affine Invariant Sign Tests . . . . . . . . . . . . . . . . 405
6.4.3 The Oja Criterion Function . . . . . . . . . . . . . . . 413
6.4.4 Additional Remarks . . . . . . . . . . . . . . . . . . . 418
6.5 Robustness of Estimates of Location . . . . . . . . . . . . . . 419
6.5.1 Location and Scale Invariance: Componentwise Methods 419
6.5.2 Rotation Invariance: Spatial Methods . . . . . . . . . . 420
6.5.3 The Spatial Hodges-Lehmann Estimate . . . . . . . . . 421
6.5.4 Affine Equivariant Spatial Median . . . . . . . . . . . . 421
6.5.5 Affine Equivariant Oja Median . . . . . . . . . . . . . 422
6.6 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
6.6.1 Test for Regression Effect . . . . . . . . . . . . . . . . 425
6.6.2 The Estimate of the Regression Effect . . . . . . . . . 431
6.6.3 Tests of General Hypotheses . . . . . . . . . . . . . . . 432
6.7 Experimental Designs . . . . . . . . . . . . . . . . . . . . . . . 439
6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

A Asymptotic Results 447


A.1 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . 447
A.2 Simple Linear Rank Statistics . . . . . . . . . . . . . . . . . . 448
A.2.1 Null Asymptotic Distribution Theory . . . . . . . . . . 449
A.2.2 Local Asymptotic Distribution Theory . . . . . . . . . 450
A.2.3 Signed-Rank Statistics . . . . . . . . . . . . . . . . . . 457
A.3 Rank-Based Analysis of Linear Models . . . . . . . . . . . . . 460
A.3.1 Convex Functions . . . . . . . . . . . . . . . . . . . . . 463
A.3.2 Asymptotic Linearity and Quadraticity . . . . . . . . . 464
A.3.3 Asymptotic Distance between β b and β e . . . . . . . . . 467
A.3.4 Consistency of the Test Statistic Fϕ . . . . . . . . . . . 468
A.3.5 Proof of Lemma 3.5.1 . . . . . . . . . . . . . . . . . . . 469
A.4 Asymptotic Linearity for the L1 Analysis . . . . . . . . . . . . 470
A.5 Influence Functions . . . . . . . . . . . . . . . . . . . . . . . . 473

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page xiv —


i i

xiv CONTENTS

A.5.1 Influence Function for Estimates Based on Signed-Rank


Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 474
A.5.2 Influence Functions for Chapter 3 . . . . . . . . . . . . 476
A.5.3 Influence Function of β b
HBR of Section 3.12.4 . . . . . . 482
A.6 Asymptotic Theory for Section 3.12.3 . . . . . . . . . . . . . . 484
A.7 Asymptotic Theory for Section 3.12.7 . . . . . . . . . . . . . . 491
A.8 Asymptotic Theory for Section 3.13 . . . . . . . . . . . . . . . 492

References 495

Author Index 521

Index 527

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page xv —


i i

Preface

Basically, I’m not interested in doing research and I never have been. I’m
interested in understanding, which is quite a different thing. And often to
understand something you have to work it out yourself because no one else has
done it.

David Blackwell describing himself as a “dilettante” in a 1983 interview for


Mathematical People, a collection of profiles and interviews.

I don’t believe I can really do without teaching. The reason is, I have to have
something so that when I don’t have any ideas and I’m not getting anywhere
I can say to myself, “At least I’m living; at least I’m doing something; I’m
making some contribution”-it’s just psychological.

Richard Feynman

Nonparametric inference methods, especially those derived from ranks,


have a long and successful history extending back to early work by Frank
Wilcoxon in 1945. In the first edition of this monograph we developed rank-
based methods from the unifying theme of geometry and continue this ap-
proach in the second edition. The least squares norm is replaced by a weighted
L1 norm, and the resulting statistical interpretations are similar to those of
least squares. This results in rank-based methods or L1 methods depending
on the choice of weights. The rank-based methods proceed much like the tra-
ditional analysis. Using the norm, models are easily fitted. Diagnostics pro-
cedures can then be used to check the quality of fit (model criticism) and to
locate outlying points and points of high influence. Upon satisfaction with the
fit, rank-based inferential procedures can then be used to conduct the statisti-
cal analysis. The advantages of rank-based methods include better power and
efficiency at heavy-tailed distributions and robustness against various model
violations and pathological data.
In the first edition we extended rank methods from univariate location
models to linear models and multivariate models, providing a much more ex-
tensive set of tools and methods for data analysis. The second edition provides

xv
i

i
i i

“book” — 2010/11/17 — 16:39 — page xvi —


i i

xvi PREFACE

additional models (including models with dependent error structure and non-
linear models) and methods and extends significantly the possible analyses
based on ranks.
In the second edition we have retained the material on one- and two-sample
problems (Chapters 1 and 2) along with the basic development of rank meth-
ods in the linear model (Chapter 3) and fixed effects experimental designs
(Chapter 4). Chapter 5, from the first edition, on high breakdown R esti-
mates has been condensed and moved to Chapter 3. In addition, Chapter
3 now contains a new section on rank procedures for nonlinear models. Se-
lected topics from the first four chapters provide a basic graduate course in
rank-based methods. The methods are fully illustrated and the theory fully
developed. The prerequisites are a basic course in mathematical statistics and
some background in applied statistics. For a one semester course, we suggest
the first seven sections of Chapter 1, the first four sections of Chapter 2, the
first seven sections plus section 9 in Chapter 3, and the first four sections of
Chapter 4, and then choice of topics depending on interest.
The new Chapter 5 deals with models with dependent error structure. New
material on rank methods for mixed models is included along with material
on general estimating equations, GEE. Finally, a section on time series has
been added. As in the first edition, this new material is illustrated on data
sets and R software is made available to the reader.
Chapter 6 in both editions deals with multivariate models. In the sec-
ond edition we have added new material on the development of affine in-
variant/equivariant sign methods based on transform-retransform techniques.
The new methods are computationally efficient as opposed to the earlier affine
invariant/equivariant methods.
The methods developed in the book can be computed using R li-
braries and functions. These libraries are discussed and illustrated in
the relevant sections. Information on several of these packages and func-
tions (including Robnp, ww, and Rfit) can be obtained at the web site
http://www.stat.wmich.edu/mckean/index.html. Hence, we have again ex-
panded significantly the available set of tools and inference methods based
on ranks.
We have included the data sets for many of our examples in the book. For
others, the reader can obtain the data at the Chapman and Hall web site. See
also the site http://www.stat.wmich.edu/mckean/index.html for information
on the data sets used in this book.
We are indebted to many of our students and colleagues for valuable dis-
cussions, stimulation, and motivation. In particular, the first author would like
to express his sincere thanks for many stimulating hours of discussion with
Steve Arnold, Bruce Brown, and Hannu Oja while the second author wants
to express his sincere thanks for discussions over the years with Ash Abebe,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page xvii —


i i

xvii

Kim Crimin, Brad Huitema, John Kapenga, John Kloke, Joshua Naranjo, M.
Rashid, Jerry Sievers, Jeff Terpstra, and Tom Vidmar. We both would like to
express our debt to Simon Sheather, our friend, colleague, and co-author on
many papers. We express our thanks to Rob Calver, Sarah Morris, and Michele
Dimont of Chapman & Hall/CRC for their assistance in the preparation of
this book.

Thomas P. Hettmansperger
Joseph W. McKean

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 1 —


i i

Chapter 1

One-Sample Problems

1.1 Introduction
Traditional statistical procedures are widely used because they offer the user a
unified methodology with which to attack a multitude of problems, from simple
location problems to highly complex experimental designs. These procedures
are based on least squares fitting. Once the problem has been cast into a model
then least squares offers the user:

1. a way of fitting the model by minimizing the Euclidean normed distance


between the responses and the conjectured model;

2. diagnostic techniques that check the adequacy of the fit of the model,
explore the quality of fit, and detect outlying and/or influential cases;

3. inferential procedures, including confidence procedures, tests of hypothe-


ses, and multiple comparison procedures;

4. computational feasibility.

Procedures based on least squares, though, are easily impaired by outlying


observations. Indeed one outlying observation is enough to spoil the least
squares fit, its associated diagnostics and inference procedures. Even though
traditional inference procedures are exact when the errors in the model follow
a normal distribution, they can be quite inefficient when the distribution of
the errors has longer tails than the normal distribution.
For simple location problems, nonparametric methods were proposed by
Wilcoxon (1945). These methods consist of test statistics based on the ranks
of the data and associated estimates and confidence intervals for location pa-
rameters. The test statistics are distribution free in the sense that their null
distributions do not depend on the distribution of the errors. It was soon

1
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 2 —


i i

2 CHAPTER 1. ONE-SAMPLE PROBLEMS

realized that these procedures are almost as efficient as the traditional meth-
ods when the errors follow a normal distribution and, furthermore, are often
much more efficient relative to the traditional methods when the error distri-
butions deviate from normality; see Hodges and Lehmann (1956). These pro-
cedures possess both robustness of validity and power. In recent years these
nonparametric methods have been extended to linear and nonlinear models.
In addition, from the perspective of modern robustness theory, contrary to
least squares estimates, these rank-based procedures have bounded influence
functions and positive breakdown points.
Often these nonparametric procedures are thought of as disjoint methods
that differ from one problem to another. In this text, we intend to show that
this is not the case. Instead, these procedures present a unified methodology
analogous to the traditional methods. The four items cited above for the tra-
ditional analysis hold for these procedures too. Indeed the only operational
difference is that the Euclidean norm is replaced by another norm.
There are computational procedures available for the rank-based pro-
cedures discussed in this book. We offer the reader a collection of com-
putational functions written in the software language R; see the site
http://www.stat.wmich.edu/mckean/. We refer to these computational algo-
rithms as robust nonparametric R algorithms or Robnp. For the chapters on
linear models we make use of the set of algorithms ww written by Terpstra
and McKean (2005) and the R package Rfit developed by Kloke and McKean
(2010). We discuss these functions throughout the text and use them in many
of the examples, simulation studies, and exercises. The programming language
R (see Ihaka and Gentleman, 1996) is freeware and can run on all (PC, Mac,
Linux) platforms. To download the R software and accompanying informa-
tion, visit the site http://www.r-project.org/. The language R has intrinsic
functions for computation of some of the procedures discussed in this and the
next chapter.

1.2 Location Model


In this chapter we consider the one-sample location problem. This allows us
to explore some useful concepts such as distribution freeness and robustness
in a simple setting. We extend many of these concepts to more complicated
situations in later chapters. We need to first define a location parameter. For a
random variable X we often subscript its distribution function by X to avoid
confusion.

Definition 1.2.1. Let T (H) be a function defined on the set of distribution


functions. We say T (H) is a location functional if

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 3 —


i i

1.2. LOCATION MODEL 3

1. If G is stochastically larger than F ((G(x) ≤ F (x)) for all x, then


T (G) ≥ T (F );

2. T (HaX+b ) = aT (HX ) + b, a > 0;

3. T (H−X ) = −T (HX ).
Then, we call θ = T (H) a location parameter of H.
Note that if X has location parameter θ it follows from the second item in
the above definition that the random variable e = X −θ has location parameter
0. Suppose X1 , . . . , Xn is a random sample having the common distribution
function H(x) and θ = T (H) is a location parameter of interest. We express
this by saying that Xi follows the statistical location model,

Xi = θ + ei , i = 1, . . . , n , (1.2.1)

where e1 , . . . , en are independent and identically distributed random variable


with distribution function F (x) and density function f (x) and location T (F ) =
0. It follows that H(x) = F (x − θ) and that T (H) = θ. We next discuss three
examples of location parameters that we use throughout this chapter. Other
location parameters are discussed in Section 1.8. See Bickel and Lehmann
(1975) for additional discussion of location functionals.

Example 1.2.1 (The Median Location Functional). First define the inverse
of the cdf H(x) by H −1 (u) = inf{x : H(x) ≥ u}. Generally we suppose that
H(x) is strictly increasing on its support and this eliminates ambiguities on
the selection of the parameter. Now define θ1 = T1 (H) = H −1 (1/2). This is the
median functional. Note that if G(x) ≤ F (x) for all x, then G−1 (u) ≥ F −1 (u)
for all u; and, in particular, G−1 (1/2) ≥ F −1 (1/2). Hence, T1 (H) satisfies the
first condition for a location functional. Next let H ∗ (x) = P (aX + b ≤ x) =
H[a−1 (x − b)]. Then it follows at once that H ∗−1 (u) = aH −1 (u) + b and the
second condition is satisfied. The third condition follows with an argument
similar to the one for the second condition.
Example 1.2.2 (The R Mean Location Functional). For the mean R functional
let
R θ−1 2 = T2 (H) = xdH(x), when the mean exists. Note that xdH(x) =
H (u)du. Now if G(x) ≤ F (x) for all x, then x ≤ G−1 (F (x)). Let x =
−1
RF −1(u) and weRhave F −1 (u) ≤ G−1 (F (F −1(u)) ≤ G−1 (u). Hence, T2 (G) =
G (u)du ≥ F −1 (u)du = T2 (F ) and the first condition is satisfied. The
other two conditions follow easily from the definition of the integral.
Example 1.2.3 (The Pseudo-Median Location Functional). Assume that X1
and X2 are independent and identically distributed, (iid), with distribution

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 4 —


i i

4 CHAPTER 1. ONE-SAMPLE PROBLEMS

function H(x). Let Y = R (X1 + X2 )/2. Then Y has distribution function


H ∗ (y) = P (Y ≤ y) = H(2y − x)h(x)dx. Let θ3 = T3 (H) = H ∗−1 (1/2).
To show that T3 is a location functional, suppose G(x) ≤ F (x) for all x. Then
Z Z Z 2y−x 

G (y) = G(2y − x)g(x) dx = g(t) dt g(x) dx
−∞
Z Z 2y−x 
≤ f (t) dt g(x) dx
−∞
Z Z 2y−t 
= g(x) dt f (t) dx
−∞
Z Z 2y−t 
≤ f (x) dt f (t) dx = F ∗ (y) ;
−∞

hence, as in Example 1.2.1, it follows that G∗−1 (u) ≥ F ∗−1 (u) and, hence,
that T3 (G) ≥ T3 (F ). For the second property, let W = aX + b where X
has distribution function H and a > 0. Then W has distribution function
FW (t) = H((t − b)/a). Then by the change of variable z = (x − b)/a, we have
Z     Z  
∗ 2y − x − b 1 x−b y−b
FW (y) = H h dx = H 2 − z h(z) dz .
a a a a
Thus the defining equation for T3 (FW ) is
Z  
1 T3 (FW ) − b
= H 2 − z h(z) dz ,
2 a
which is satisfied for T3 (FW ) = aT3 (H) + b. For the third property, let V =
−X where X has distribution function H. Then V has distribution function
FV (t) = 1 − H(−t). Hence, by the change in variable z = −x,
Z Z

FV (y) = (1 − H(−2y + x))h(−x) dx = 1 − H(−2y − z))h(z) dz .

Because the defining equation of T3 (FV ) can be written as


Z
1
= H(2(−T3 (FV )) − z)h(z) dz ,
2
it follows that T3 (FV ) = −T3 (H). Therefore, T3 is a location functional. It has
been called the pseudo-median by Hoyland (1965) and is more appropriate
for symmetric distributions.

The next theorem characterizes all the location functionals for a symmetric
distribution.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 5 —


i i

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL 5

Theorem 1.2.1. Suppose that the pdf h(x) is symmetric about some point a.
If T (H) is a location functional, then T (H) = a.
Proof: Let the random variable X have pdf h(x) symmetric about a. Let
Y = X − a, then Y has pdf g(y) = h(y + a) which is symmetric about 0.
Hence Y and −Y have the same distribution. By the third property of location
functionals, this means that T (GY ) = T (G−Y ) = −T (GY ); i.e, T (GY ) = 0.
By the second property, 0 = T (GY ) = T (H) − a; that is , a = T (H).
This theorem means that when we sample from a symmetric distribution
we can unambiguously define location as the center of symmetry. Then all
location functionals that we may wish to study specify the same location
parameter.

1.3 Geometry and Inference in the Location


Model
Letting X = (X1 , . . . , Xn )′ and e = (e1 , . . . , en )′ , we then write the statistical
location model, (1.2.1), as,
X = 1θ + e , (1.3.1)
where 1 denotes the vector all of whose components are 1 and T (Fe ) = 0. If
ΩF denotes the one-dimensional subspace spanned by 1, then we can express
the model more compactly as X = η + e, where η ∈ ΩF . The subscript F
on Ω stands for full model in the context of hypothesis testing as discussed
below.
Let x be a realization of X. Note that except for random error, x would
lie in ΩF . Hence an intuitive fitting criteria is to estimate θ by a value θb
such that the vector 1θb ∈ ΩF lies “closest” to x, where “closest” is defined
in terms of a norm. Furthermore, a norm, as the following general discussion
shows, provides a complete inference for the parameter θ.
Recall that a norm is a nonnegative function, k · k, defined on Rn such
that kyk ≥ 0 for all y; kyk = 0 if and only if y = 0; kayk = |a|kyk for
all real a; and ky + zk ≤ kyk + kzk. The distance between two vectors is
d(z, y) = kz − yk.
Given a location model, (1.3.1), and a specified norm, k · k, the estimate
of θ induced by the norm is
θb = argminkx − 1θk , (1.3.2)
i.e., the value which minimizes the distance between x and the space ΩF . As
discussed in Exercise 1.12.1, a minimizing value always exists. The dispersion
function induced by the norm is given by,
D(θ) = kx − 1θk . (1.3.3)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 6 —


i i

6 CHAPTER 1. ONE-SAMPLE PROBLEMS

The minimum distance between the vector of observations x and the space
b As shown in Exercise 1.12.3, D(θ) is a convex and continuous
ΩF is D(θ).
function of θ which is differentiable almost everywhere. Actually the norms
discussed in this book are differentiable at all but at most a finite number of
points. We define the gradient process by the function
d
S(θ) = − D(θ) . (1.3.4)

As Exercise 1.12.3 shows, S(θ) is a nonincreasing function. Its discontinuities
are the points where D(θ) is nondifferentiable. Furthermore the minimizing
value is a value where S(θ) is 0 or, due to a discontinuity, steps through 0. We
express this by saying that θb solves the equation
b =.
S(θ) 0. (1.3.5)

Suppose we can represent the above estimate by θb = θ(x) b b n ), where


= θ(H
Hn denotes the empirical distribution function of the sample. The notation
b n ) is suggestive of the functional notation used in the last section. This is
θ(H
as it should be, since it is easy to show that θb satisfies the sample analogues
of properties (2) and (3) of Definition 1.2.1. For property (2), consider the
estimating equation of the translated sample y = ax + 1b, for a > 0, given by

θ − b
b
θ(y) = argminky − 1θk = a argmin x − 1 .
a
b
From this we immediately have that θ(y) b + b. For property (3), the
= aθ(x)
defining equation for the sample y = −x is
b
θ(y) = argminky − 1θk = argminkx − 1(−θ)k .

From which we have θ(y) b b


= −θ(x). Furthermore, for the norms considered in
b
this book it is easy to check that θ(Hn ) ≥ θ(G b n ) when Hn and Gn are empirical
cdfs for which Hn is stochastically larger than Gn . Hence, the norms generate
location functionals on the set of empirical cdfs. The L1 norm provides an
easy example. We can think of θ(H b n ) = H −1 ( 1 ) as the restriction of θ(H) =
n 2
H −1 ( 12 ) to the class of discrete distributions which assign mass 1/n to n points.
Generally we can think of θ(H b n ) as the restriction of θ(H) or, conversely, we
can think of θ(H) as the extension of θ(H b n ). We let the norm determine the
location. This is especially simple in the symmetric location model where all
location functionals are equal to the point of symmetry.
Next consider the hypotheses,

H0 : θ = θ0 versus HA : θ 6= θ0 , (1.3.6)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 7 —


i i

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL 7

for a specified θ0 . Because of the second property of location functionals in


Definition 1.2.1, we can assume without loss of generality that θ0 = 0; oth-
erwise we need only subtract θ0 from each Xi . Based on the data, the most
acceptable value of θ is the value at which the gradient S(θ) is zero. Hence
large values of |S(0)| favor HA . Formally the level α gradient test or score
test for the hypotheses (1.3.6) is given by

Reject H0 in favor of HA if |S(0)| ≥ c , (1.3.7)

where c is such that P0 [|S(0)| ≥ c] = α. Typically, the null distribution of


S(0) is symmetric so there is no loss in generality in considering symmetrical
critical regions.
A second formulation of a test statistic is based on the difference in min-
imizing dispersions or the reduction in dispersion. Call Model 1.2.1 the full
model. As noted above, the distance between x and the subspace ΩF is D(θ). b
The reduced model is the full model subject to H0 . In this case the reduced
model space is {0}. Hence the distance between x and the reduced model space
is D(0). Under H0 , x should be close to this space; therefore, the reduction
in dispersion test is given by
b ≥m,
Reject H0 in favor of HA if RD = D(0) − D(θ) (1.3.8)

where m is determined by the null distribution of RD. This test is used in


Chapter 3 and subsequent chapters.
A third formulation is based on the standardized estimate:
b
Reject H0 in favor of HA if √ |θ| ≥ γ , (1.3.9)
Varθb

where γ is determined by the null distribution of θ. b Tests based directly on


the estimate are often referred to as Wald-type tests.
The following useful theorem allows us to shift between computing proba-
bilities when θ = 0 and for general θ. Its proof is a straightforward application
of a change of variables. See Theorem A.2.4 of the Appendix for a more general
result.
Theorem 1.3.1. Suppose that we can write S(θ) = S(x1 − θ, . . . , xn − θ).
Then Pθ (S(0) ≤ t) = P0 (S(−θ) ≤ t).
We now turn to the problem of the construction of a (1 − α)100% con-
fidence interval for θ based on S(θ). Such an interval is easily obtained
by inverting the acceptance region of the level α test given by (1.3.7). The
acceptance region is | S(0) |< c. Define

θbL = inf{t : S(t) < c} and θbU = sup{t : S(t) > −c}. (1.3.10)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 8 —


i i

8 CHAPTER 1. ONE-SAMPLE PROBLEMS

Then because S(θ) is nonincreasing,

{θ :| S(θ) |< c} = {θ : θbL ≤ θ ≤ θbU } . (1.3.11)

Thus from Theorem 1.3.1,

Pθ (θbL ≤ θ ≤ θbU ) = Pθ (| S(θ) |< c) = P0 (| S(0) |< c) = 1 − α . (1.3.12)

Hence, inverting a size α test results in the (1 − α)100% confidence interval


(θbL , θbU ).
Thus a norm not only provides a fitting criterion but also a complete in-
ference. As with all statistical analyses, checks on the appropriateness of the
model and the quality of fit are needed. Useful plots here include: stem-leaf
plots and q−q plots to check shape and distributional assumptions, boxplots
and dotplots to check for outlying observations, and a plot of Xi versus i (or
other appropriate variables) to check for dependence between observations.
Some of these diagnostic checks are performed in the the next section of nu-
merical examples.
In the next three examples, we discuss the inference for the norms associ-
ated with the location functionals presented in the last section. We state the
results of their associated inference, which we derive in later sections.

P
Example 1.3.1 (L1 Norm). Recall that the L1 norm is defined as kxk1 = |
xi |, hence the associatedP
dispersion and negative gradient
P functions are given
respectively by D1 (θ) = | Xi − θ | and S1 (θ) = sgn(Xi − θ). Letting Hn
denote the empirical cdf, we can write the estimating equation as
X Z
−1
0=n sgn(xi − θ) = sgn(x − θ)dHn (x) .

The solution, of course, is θb the median of the observations. If we replace the


empirical cdf Hn by the true underlying cdf H then the estimating equation
becomes the defining equation for the parameter θ = T (H). In this case, we
have
Z Z T (H) Z ∞
0 = sgn(x − T (H))dH(x) = − dH(x) + dH(x) ;
−∞ T (H)

hence, H(T (H)) = 1/2 and solving for T (H) we find T (H) = H −1 (1/2) as
expected.
As we show in Section 1.5,

θb has an asymptotic N(θ, τS2 /n) distribution , (1.3.13)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 9 —


i i

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL 9

where τs = 1/(2h(θ)). Estimation of the standard deviation of θb is discussed


in Section 1.5.
TurningPnext to testing the hypotheses (1.3.6), the gradient test statistic
P 1 (0) =
is S sgn(XP (0) = S1+ − S1− + S10 where S1+ =
i ). But we can write, S1P
I(Xi > 0), S1− = I(Xi < 0), and S10 = I(Xi = 0) = 0, with probability
one since we are sampling from a continuous distribution, and I(·) is the
indicator function. In practice, we must deal with ties and this is usually
done by setting aside those observations that are equal to the hypothesized
value and carrying out the test with a reduced sample size. Now note that
n = S1+ + S1− so that we can write S1 = 2S1+ − n and the test can be based on
S1+ . The null distribution of S1+ is binomial with parameters n and 1/2. Hence
the level α sign test of the hypotheses (1.3.6) is

Reject H0 in favor of HA if S1+ ≤ c1 or S1+ ≥ n − c1 , (1.3.14)

and c1 satisfies
P [bin(n, 1/2) ≤ c1 ] = α/2 , (1.3.15)
where bin(n, 1/2) denotes a binomial random variable based on n trials and
with probability of success 1/2. Note that the critical value of the test can be
determined without specifying the shape of F . In this sense, the test based
on S1 is distribution free or nonparametric. Using the asymptotic null
.
distribution of S1+ , c1 can be approximated as c1 = n/2 − n1/2 zα/2 /2 − .5
where Φ(−zα/2 ) = α/2; Φ(.) is the standard normal cdf, and .5 is the continuity
correction.
For the associated (1 − α)100% confidence interval, we follow the general
development above, (1.3.12). Hence, we must find θbL = inf{t : S1+ (t) < n−c1 },
where c1 is given by (1.3.15). Note that S1+ (t) < n−c1 if and only if the number
of Xi greater than t is less than n−c1 . But #{i : Xi > X(c1 +1) } = n−c1 −1 and
#{i : Xi > X(c1 +1) − ǫ} ≥ n − c1 for any ǫ > 0. Hence, θbL = X(c1 +1) . A similar
argument shows that θbU = X(n−c1 ) . We can summarize this by saying that the
(1 − α)100% L1 confidence interval is the half open, half closed interval

[X(c1 +1) , X(n−c1 ) ) where α/2 = P (S1+ (0) ≤ c1 ) determines c1 . (1.3.16)

The critical value c1 can be determined from the binomial(n, 1/2) distribution
or from the normal approximation cited above. The interval developed here is
a distribution-free confidence interval since the confidence coefficient is deter-
mined from the binomial distribution without making any shape assumption
on the underlying model distribution.
Example 1.3.2
Pn (L22 Norm). Recall that the square of the L2 norm is given
by kxk22 = i=1 xi . As shown in Exercise 1.12.4, the estimate determined
by this norm is the sample mean X and the functional parameter is µ =

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 10 —


i i

10 CHAPTER 1. ONE-SAMPLE PROBLEMS


R
xh(x) dx, provided it exists. Hence the L2 norm is consistent for the mean
location problem. The associated test statistic is equivalent to Student’s t-
test. The approximate distribution of X is N(0, σ 2 /n), provided the variance
σ 2 = VarX1 exists. Hence, the testP statistic is not distribution free. In practice,
σ is replaced by its√estimate s = ( (Xi −X)2 /(n−1))1/2 and the test is based
on the t-ratio, t = n X/s, which, under the null hypothesis, √ is asymptotically
N(0, 1). The usual confidence interval is X ± tα/2,n−1 s/ n, where tα/2,n−1 is
the (1 − α/2)-quantile of a t-distribution with n − 1 degrees of freedom. This
interval has the approximate confidence coefficient (1 − α)100%, unless the
errors, ei , follow a normal distribution in which case it has exact confidence.
Example 1.3.3 (Weighted L1 Norm). Consider the function
n
X
kxk3 = R(|xi |)|xi | , (1.3.17)
i=1

where R(|xi |) denotes the rank of |xi | among |x1 |, . . . , |xn |. As the next theorem
shows this function is a norm on Rn . See Section 1.8 for a general weighted
L1 norm.
P P
Theorem 1.3.2. The function kxk3 = j|x|(j) = R(|xj |)|xj | is a norm,
where R(|xj |) is the rank of |xj | among |x1 |, . . . , |xn | and |x|(1) ≤ · · · ≤ |x|(n)
are the ordered absolute values.
Proof: The equality relating kxk3 to the ranks is clear. To show that we have
a norm, we first note that kxk3 ≥ 0 and that kxk3 = 0 if and only if x = 0.
Also clearly kaxk3 = |a|kxk3 for any real a. Hence, to finish the proof, we
must verify the triangle inequality. Now
X X
kx + yk3 = j|x + y|(j) = R(|xi + yj |)|xi + yj |
X X
≤ R(|xi + yj |)|xi | + R(|xi + yj |)|yj | . (1.3.18)

Consider the first term on the right side. By summing through another index
we can write it as,
X X
R(|xi + yj |)|xi | = bj |x|(j) ,

where b1 , . . . , bn is a permutation on the integers 1, . . . , n. Suppose bj is not


in order, then there exists a t and a s such that |x|(t) ≤ |x|(s) but bt > bs .
Whence,
[bs |x|(t) + bt |x|(s) ] − [bt |x|(t) + bs |x|(s) ] = (bt − bs )(|x|(s) − |x|(t) ) ≥ 0 .
Hence such an interchange never decreases the sum. This leads to the result,
X X
R(|xi + yj |)|xi | ≤ j|x|(j) .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 11 —


i i

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL 11

A similar result holds


P for the second
P term on the right side of (1.3.18). There-
fore, kx + yk3 ≤ j|x|(j) + j|y|(j) = kxk3 + kyk3 , and, this completes
the proof. The above argument is taken from Hardy, Littlewood, and Polya
(1952).

We call this norm the weighted L1 Norm. In the next theorem, we offer
an interesting identity satisfied by this norm. First, though, we need another
representation of it. For a random sample X1 , . . . , Xn , define the anti-ranks
to be the random variables D1 , . . . , Dn such that
Z1 = |XD1 | ≤ . . . ≤ Zn = |XDn | . (1.3.19)
For example, if D1 = 2 then |X2 | is the smallest absolute value and Z1 has rank
1. Note that the anti-rank function is just the inverse of the rank function.
We can then write
n
X n
X
kxk3 = j|x|(j) = j|xDj | . (1.3.20)
i=j j=1

Theorem 1.3.3. For any vector x,



X X xi + xj X X xi − xj
kxk3 =
2 + 2 . (1.3.21)
i≤j i<j

Proof: Letting the index run through the anti-ranks, the right-side of (1.3.21)
is
X n X X  xDi + xDj xDj − xDi 
|xi | + + . (1.3.22)
2 2
i=1 i<j

For i < j, hence |xDi | ≤ |xDj |, consider the expression,



xDi + xDj xDj − xDi
+ .
2 2

There are four cases to consider: where xDi and xDj are both positive; where
they are both negative; and the two cases where they have mixed signs. In all
these cases, though, it is easy to show that

xDi + xDj xDj − xDi
+ = |xD | .
2 2 j

Using this, we have that the right side of expression (1.3.22) is equal to:
n
X XX n
X n
X n
X
|xi | + |xDj | = |xDj | + (j − 1)|xDj | = j|xDj | = kxk3 ,
i=1 i<j j=1 j=1 j=1
(1.3.23)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 12 —


i i

12 CHAPTER 1. ONE-SAMPLE PROBLEMS

and we are finished.


The associated gradient function is
n
X X  
Xi + Xj
T (θ) = R(|Xi − θ|)sgn(Xi − θ) = sgn −θ . (1.3.24)
i=1 i≤j
2

The middle term is due to the fact that the ranks only change values at the
finite number of points determined by |Xi −θ| = |Xj −θ|; otherwise R(|Xi −θ|)
is constant. The third term is obtained immediately from the identity (1.3.21).
The n(n + 1)/2 pairwise averages {(Xi + Xj )/2 : 1 ≤ i ≤ j ≤ n} are called
the Walsh averages. Hence, the estimate of θ is the median of the Walsh
averages, which we denote as,
 
b Xi + Xj
θ3 = medi≤j , (1.3.25)
2

first discussed by Hodges and Lehmann (1963). Often θb3 is called the Hodges-
Lehmann estimate of location. In order to obtain the corresponding loca-
tion functional, note that
R(|Xi − θ|) = #{|Xj − θ| ≤ |Xi − θ|}
= #{θ − |Xi − θ| ≤ Xj ≤ θ + |Xi − θ|}
= nHn (θ + |Xi − θ|) − nHn− (θ − |Xi − θ|) ,
where Hn− is the left limit of Hn . Hence (1.3.24) becomes
Z
{Hn (θ + |x − θ|) − Hn− (θ − |x − θ|)}sgn(x − θ) dHn (x) = 0 ,

and in the limit we have,


Z
{H(θ + |x − θ|) − H(θ − |x − θ|)}sgn(x − θ) dH(x) = 0 ,

that is,
Z θ Z ∞
− {H(2θ − x) − H(x)} dH(x) + {H(x) − H(2θ − x)} dH(x) = 0 .
−∞ θ

This simplifies to
Z ∞ Z ∞
1
H(2θ − x) dH(x) = H(2θ − x) dH(x) = , (1.3.26)
−∞ −∞ 2
Hence, the functional is the pseudo-median defined in Example 1.2.3. If the
density h(x) is symmetric then from (1.7.11)

θb3 has an approximate N(θ3 , τ 2 /n) distribution , (1.3.27)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 13 —


i i

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL 13


√ R
where τ = 1/( 12 h2 (x) dx). Estimation of τ is discussed in Section 3.7.
The most convenient form of the gradient process is
X X  Xi + Xj  X n
+
T (θ) = I >θ = R(|Xi − θ|)I(Xi > θ) . (1.3.28)
i≤j
2 i=1

The corresponding gradient test statistic for the hypotheses (1.3.6) is T + (0).
In Section 1.7, provided that h(x) is symmetric, it is shown that T + (0) is
distribution free under H0 with null mean and variance n(n + 1)/4 and n(n +
1)(2n + 1)/24, respectively. This test is often referred to as the Wilcoxon
signed-rank test. Thus the test for the hypotheses (1.3.6) is
n(n+1)
Reject H0 in favor of HA , if T + (0) ≤ k or T + (0) ≥ 2
−k , (1.3.29)
where P (T + (0) ≤ k) = α/2. An approximation for k is given in the next
paragraph.
Because of the similarity between the sign and signed-rank processes, the
confidence interval based on T + (θ) follows immediately from the argument
given in Example 1.3.1 for the sign process. Instead of the order statistics which
were used in the confidence interval based on the sign process, in this case
we use the ordered Walsh averages, which we denote as W(1) , . . . , W(n(n+1)/2) .
Hence a (1 − α)100% confidence interval for θ is given by
[W(k+1) , W((n(n+1)/2)−k) ) where k is such that α/2 = P (T + (0) ≤ k) . (1.3.30)
As with the sign process, k can be approximated using the asymptotic normal
distribution of T + (0) by
r
. n(n + 1) n(n + 1)(2n + 1)
k= − zα/2 − .5 ,
4 24
where zα/2 is the (1 − α/2)-quantile of the standard normal distribution. Pro-
vided that h(x) is symmetric, this confidence interval is distribution free.

1.3.1 Computation
The three procedures discussed in this section are easily computed in R. The R
intrinsic functions t.test and wilcoxon.test compute the t- and Wilcoxon
signed-rank tests, respectively. Our collection of R functions, Robnp, contains
the functions onesampwil and onesampsgn which compute the asymptotic ver-
sions of the Wilcoxon signed-rank and sign tests, respectively. These functions
also compute the associated estimates, confidence intervals, and standard er-
rors. Their use is discussed in the examples. Minitab (Ryan, Joiner, and Cryer,
2005) also can be used to compute these tests. At command line the Minitab
commands stest, wtest, and ttest compute the sign, Wilcoxon signed-rank,
and t-tests, respectively.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 14 —


i i

14 CHAPTER 1. ONE-SAMPLE PROBLEMS

1.4 Examples
In applications by convention, when testing the null hypothesis H0 : θ = θ0
using the sign test, any data point equal to θ0 is set aside and the sample size is
reduced. On the other hand, these values are not set aside for point estimation
or confidence intervals. The output of the Robnp functions onesampwil and
onesampsgn includes the test statistics T and S, respectively, and a continuity
corrected standardized value z. The p-values are approximated by computing
normal probabilities on z. Especially for small sample sizes, for the test based
on the signs, S, the approximate and exact p-values can be somewhat different.
In calculating the signed-ranks for the test statistic T , we use average ranks.
For t-tests, we report the the p-values and confidence intervals using the t-
distribution with n − 1 degrees of freedom.

Example 1.4.1 (Cushney-Peebles Data). The data given in Table 1.4.1 gives
the average excess number of hours of sleep that each of 10 patients achieved
from the use of two drugs. The third column gives the difference (Laevo-
Dextro) in excesses across the two drugs. This is a famous data set. Gosset,
writing under the pseudonym Student, published his landmark paper on the
t-test in 1908 and used this data set for illustration. The differences, however,
suggests that the L2 methods may not be the methods of choice in this case.
The normal quantile plot, Panel A of Figure 1.4.1, shows that the tails may
be heavy and that there may be an outlier. A normal quantile plot has the
data (differences) on the vertical axis and the expected values of the standard
normal order statistics on the horizontal axis. When the data is consistent
with a normal assumption, the plot should be roughly linear. The boxplot,
with 95% L1 confidence interval, Panel B of Figure 1.4.1, further illustrates
the presence of an outlier. The box is defined by the quartiles and the shaded
notch represents the confidence interval.
For the sake of discussion and comparison of methods, we provide the
p-values for the sign test, the Wilcoxon signed-rank test, and the t-test. We
used the Robnp functions onesampwil, onesampsgn, and onesampt to compute
the results for the Wilcoxon signed-rank test, the sign test, and the t-test,
respectively. For each function, the following display shows the necessary R
code (these are preceded with the prompt >) to compute these functions,
which is then followed by the results. The standard errors (SE) for the sign
and signed-rank estimates are given by (1.5.29) and (1.7.12), respectively, in
general in Section 1.5.5. These functions also produce a boxplot of the data.
The boxplot produced by the function onesampsgn is shown in Figure 1.4.1.

Assumes that the differences are in the vector diffs

> onesampwil(diffs)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 15 —


i i

1.4. EXAMPLES 15

Table 1.4.1: Excess Hours of Sleep under the Influence of Two Drugs
Row Dextro Laevo Diff(L-D)
1 -0.1 -0.1 0.0
2 0.8 1.6 0.8
3 3.4 4.4 1.0
4 0.7 1.9 1.2
5 -0.2 1.1 1.3
6 -1.2 0.1 1.3
7 2.0 3.4 1.4
8 3.7 5.5 1.8
9 -1.6 0.8 2.4
10 0.0 4.6 4.6

Results for the Wilcoxon-Signed-Rank procedure


Test of theta = 0 versus theta not equal to 0
Test-Stat. is T 54 Standardized (z) Test-Stat. is 2.70113
p-value 0.00691043

Estimate 1.3 SE is 0.484031


95 % Confidence Interval is ( 0.9 , 2.7 )
Estimate of the scale parameter tau 1.530640

> onesampsgn(diffs)

Results for the Sign procedure


Test of theta = 0 versus theta not equal to 0
Test stat. S is 9 Standardized (z) Test-Stat. 2.666667
p-value 0.007660761

Estimate 1.3 SE is 0.4081708


95 % Confidence Interval is ( 0.8 , 2.4 )
Estimate of the scale parameter tau 1.290749

> temp=onesampt(diffs)

Results for the t-test procedure


Test of theta = 0 versus theta not equal to 0
Test stat. Ave(x) - 0 is 1.58 Standardized (t)
Test-Stat. 4.062128 p-value 0.00283289

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 16 —


i i

16 CHAPTER 1. ONE-SAMPLE PROBLEMS

Figure 1.4.1: Panel A: Normal q −q plot of Cushney-Peebles data; Panel B:


Boxplot with 95% notched confidence interval; Panel C: Sensitivity curve for
t-test; and Panel D: Sensitivity curve for sign test.
Panel A Panel B

4
Difference: Laevo − Dextro

Difference: Laevo − Dextro


3

3
*
2

2
*
*
* * *
*
1

1
*

*
0

0
−1.0 −0.5 0.0 0.5 1.0

Normal quantiles

Panel C Panel D
2.8
6
5

2.7
Standardized sign test
4

2.6
t−test

2.5
2

2.4
1

2.3
0

2.2

−10 −5 0 5 10 −10 −5 0 5 10

Value of 10th difference Value of 10th difference

Estimate 1.58 SE is 0.3889587


95 % Confidence Interval is ( 0.7001142 , 2.459886 )
Estimate of the scale parameter sigma 1.229995

The confidence interval corresponding to the sign test is (0.8, 2.4) which is
shifted above 0. Hence, there is strong support for the alternative hypothesis
that the location of the difference distribution is not equal to zero. That is,
we reject H0 : θ = 0 in favor of HA : θ 6= 0 at α = .05. All three tests support
this conclusion. The estimates of location corresponding to the three tests are
the median (1.3), the median of the Walsh averages (1.3), and the mean of the
sample differences (1.58). Note that the outlier had an effect on the sample
mean.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 17 —


i i

1.4. EXAMPLES 17

Table 1.4.2: Width to Length Ratios of Rectangles


0.553 0.570 0.576 0.601 0.606 0.606 0.609 0.611 0.615 0.628
0.654 0.662 0.668 0.670 0.672 0.690 0.693 0.749 0.844 0.933

In order to see how sensitive the test statistics are to outliers, we change
the value of the outlier (difference in the 10th row of Table 1.4.1) and plot the
value of the test statistic against the value of the difference in the 10th row
of Table 1.4.1; see Panel C of Figure 1.4.1. Note that as the value of the 10th
difference changes, the t-test changes quite rapidly. In fact, the t-test can be
pulled out of the rejection region by making the difference sufficiently small
or large. However, the sign test, Panel D of Figure 1.4.1, stays constant until
the difference crosses zero and then only changes by 2. This illustrates the
high sensitivity of the t-test to outliers and the relative resistance of the sign
test. A similar plot can be prepared for the Wilcoxon signed-rank test; see
Exercise 1.12.8. In addition, the corresponding p-values can be plotted to see
how sensitive the decision to reject the null hypothesis is to outliers. Sensitivity
plots are similar to influence functions. We discuss influence functions for
estimates in Section 1.6.

Example 1.4.2 (Shoshoni Rectangles). The golden rectangle is a rectangle in


which the ratio of the width to length is approximately 0.618. It can be charac-
terized in various ways. For example, w/l = l/(w + l) characterizes the golden
rectangle. It is considered to be an aesthetic standard in Western civilization
and appears in art and architecture going back to the ancient Greeks. It now
appears in such items as credit and business cards. In a cultural anthropol-
ogy study, DuBois (1960) reports on a study of the Shoshoni beaded baskets.
These baskets contain beaded rectangles and the question was whether the
Shoshonis use the same aesthetic standard as the West. A sample of twenty
width to length ratios from Shoshoni baskets is given in Table 1.4.2.
Panel A of Figure 1.4.2 shows the notched boxplot containing the 95% L1
confidence interval for θ the median of the population of w/l ratios. It shows
two outliers which are also apparent in the normal quantile plot, Panel B of
Figure 1.4.2. We used the sign procedure to analyze the data, performing the
computations with the Robnp function onesampsgn. For this problem, it is of
interest to test H0 : θ = 0.618 (the golden rectangle). The display below
shows this evaluation for the sign test along with a 90% confidence interval
for θ.

> onesampsgn(x,theta0=.618,alpha=.10)

Results for the Sign procedure

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 18 —


i i

18 CHAPTER 1. ONE-SAMPLE PROBLEMS

Figure 1.4.2: Panel A: Boxplot of width to length ratios of Shoshoni rectangles;


Panel B: Normal q−q plot.
Panel A Panel B

0.9

0.9
*
0.8

0.8
Width to length ratios

Width to length ratios


*
0.7

0.7

**
**
**
*
*
***
***
0.6

0.6

**
*
−1.5 −0.5 0.5 1.5

Normal quantiles

Test of theta = 0.618 versus theta not equal to 0.618


Test stat. S is 2 Standardized (z) Test-Stat. 0.2236068
p-value 0.8230633

Estimate 0.641 SE is 0.01854268


90 % Confidence Interval is ( 0.609 , 0.67 )
Estimate of the scale parameter tau 0.0829254

With a p-value of 0.823, there is no evidence to refute the null hypothesis.


Further, we see that the golden rectangle 0.618 is contained in the confidence
interval. This suggests that there is no evidence in this data that the Shoshonis
are using a different standard.
For comparison, the analysis based on the t-procedure is

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 19 —


i i

1.5. PROPERTIES OF NORM-BASED INFERENCE 19

> onesampt(x,theta0=.618,alpha=.10)

Results for the t-test procedure


Test of theta = 0.618 versus theta not equal to 0.618
Test stat. Ave(x) - 0.618 is 0.0425 Standardized (t)
Test-Stat. 2.054523 p-value 0.05394133

Estimate 0.6605 SE is 0.02068606


90 % Confidence Interval is ( 0.624731 , 0.696269 )
Estimate of the scale parameter sigma 0.09251088
Based on the t-test with the p-value of 0.053, one might conclude that there
is evidence that the Shoshonis are using a different standard. Further, the
90% t-interval does not contain the golden rectangle ratio. Based on the t-
analysis, a researcher might conclude that there is evidence that the Shoshonis
are using a different standard. Hence, the robust and traditional approaches
lead to different practical conclusions for this problem. The outliers, of course
impaired the t-analysis. For this data, we have more faith in the simple sign
test.

1.5 Properties of Norm-Based Inference


In this section, we establish statistical properties of the inference described
in Section 1.3 for the norm-fit of a location model. These properties describe
the null and alternative distributions of the test (1.3.7), and the asymptotic
distribution of the estimate (1.3.2). Furthermore, these properties allow us to
derive relative efficiencies between competing procedures. While our discussion
is general, we illustrate the inference based on the L1 and L2 norms as we
proceed. The inference based on the signed-rank norm is considered in Section
1.7 and that based on norms of general signed-rank scores in Section 1.8.
We assume then that Model (1.2.1) holds for a random sample X1 , . . . , Xn
with common distribution and density functions H(x) = F (x − θ) and h(x) =
f (x − θ), respectively. Next a norm is specified to fit the model. We assume
that the induced functional is 0 at F , i.e., T (F ) = 0. Let S(θ) be the gradient
function induced by the norm. We establish the properties of the inference
by considering the null and alternative behavior of the gradient test. For
convenience, we consider the one-sided hypothesis,
H0 : θ = 0 versus HA : θ > 0 . (1.5.1)
Since S(θ) is nonincreasing, a level α test of these hypotheses based on S(0)
is
Reject H0 in favor of HA if S(0) ≥ c , (1.5.2)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 20 —


i i

20 CHAPTER 1. ONE-SAMPLE PROBLEMS

where c is such that P0 [S(0) ≥ c] = α.


The power function of this test is given by,

γS (θ) = Pθ [S(0) ≥ c] = P0 [S(−θ) ≥ c] , (1.5.3)

where the last equality follows from Theorem 1.3.1.


The power function forms a convenient summary of the test based on S(0).
The probability of a Type I Error (level of the test) is given by γS (0). The
probability of a Type II error at the alternative θ is βS (θ) = 1 − γS (θ). For a
given test of hypotheses (1.5.1) we want the power function to be increasing in
θ with an upper limit of one. In the first subsection below, we establish these
properties for the test (1.5.2). We can also compare level α-tests of (1.5.1) by
comparing their powers at alternative hypotheses. These are efficiency consid-
erations and they are covered in later subsections.

1.5.1 Basic Properties of the Power Function γS (θ)


As a first step we show that γS (θ) is nondecreasing:
Theorem 1.5.1. Suppose the test of H0 : θ = 0 versus HA : θ > 0 rejects
when S(0) ≥ c. Then the power function is nondecreasing in θ.
Proof: Recall that S(θ) is nonincreasing in θ since D(θ) is convex. By Theorem
1.3.1, γS (θ) = P0 [S(−θ) ≥ c]. Now, if θ1 ≤ θ2 then S(−θ1 ) ≤ S(−θ2 ) and,
hence, S(−θ1 ) ≥ c implies that S(−θ2 ) ≥ c. It then follows that P0 (S(−θ1 ) ≥
c) ≤ P0 (S(−θ2 ) ≥ c) and the power function is monotone in θ as required.

This theorem shows that the test of H0 : θ = 0 versus HA : θ > 0 based


on S(0) is unbiased, that is, Pθ (S(0) ≥ c) ≥ α for positive θ, where α is the
size of the test. At times it is convenient to consider the more general null
hypothesis:
H0∗ : θ ≤ 0 versus HA : θ > 0 . (1.5.4)
A test of H0∗ versus HA with power function γS is said to have level α, if

sup γS (θ) = α .
θ≤0

The proof of Theorem 1.5.1 shows that γS (θ) is nondecreasing in all θ ∈ R.


Since the gradient test has level α for H0 , it follows immediately that it has
level α for H0∗ also.
We next show that the power function of the gradient test converges to 1
as θ → ∞. We formally define this as:
Definition 1.5.1. Consider a level α test for the hypotheses (1.5.1) which has
power function γS (θ). We say the test is resolving, if γS (θ) → 1 as θ → ∞.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 21 —


i i

1.5. PROPERTIES OF NORM-BASED INFERENCE 21

Theorem 1.5.2. Suppose the test of H0 : θ = 0 versus HA : θ > 0 rejects


when S(0) ≥ c. Further, let η = supθ S(θ) and suppose that η is attained for
some finite value of θ. Then the test is resolving, that is, Pθ (S(0) ≥ c) → 1 as
θ → ∞.
Proof: Since S(θ) is nonincreasing, for any unbounded increasing sequence
θm , S(θm ) ≥ S(θm+1 ). For fixed n and F , there is a real number a such that
P0 (| Xi |≤ a, i = 1, . . . , n) > 1 − ǫ for any specified ǫ > 0. Let Aǫ denote the
event {| Xi |≤ a, i = 1, . . . , n}. Now,
Pθm (S(0) ≥ c) = P0 (S(−θm ) ≥ c)
= 1 − P0 (S(−θm ) < c)
= 1 − P0 ({S(−θm ) < c} ∩ Aǫ ) − P0 ({S(−θm ) < c} ∩ Acǫ ) .
The hypothesis of the theorem implies that, for sufficiently large m,
{S(−θm ) < c} ∩ Aǫ is empty. Further, P0 ({S(−θm ) < c} ∩ Acǫ ) ≤ P0 (Acǫ ) < ǫ.
Hence, for m sufficiently large, Pθm (S(0) ≥ c) ≥ 1 − ǫ and the proof is com-
plete.

The condition of boundedness imposed on S(θ) in the above theorem holds


for almost all the nonparametric tests discussed in this book; hence, these non-
parametric tests are resolving. Thus they are able to discern large alternative
hypotheses with high power. What can be said at a fixed alternative? Recall
the definition of a consistent test:

Definition 1.5.2. We say that a test is consistent if the power tends to


one for each fixed alternative as the sample size n increases. The alternatives
consist in specific values of θ and a cdf F .
Consistency implies that the test is behaving as expected when the sample
size increases and the alternative hypothesis is true. To obtain consistency of
the gradient test, we need to impose the following two assumptions on S(θ):
first
P
S(θ) = S(θ)/nγ → µ(θ) where µ(0) = 0 and µ(0) < µ(θ) for all θ > 0,
(1.5.5)
for some γ > 0 and secondly,
√ D
E0 S(0) = 0 and n S(0) → N(0, σ 2 (0)) under H0 for all F , (1.5.6)
for some positive constant σ(0). The first assumption means that S(0) sep-
arates the null from the alternative hypothesis. Note, it is not crucial that
µ(0) = 0, since this can always be achieved by recentering. It is useful to have
the following result concerning the asymptotic null distribution of S(0). Its
proof follows readily from the definition of convergence in distribution.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 22 —


i i

22 CHAPTER 1. ONE-SAMPLE PROBLEMS



Theorem 1.5.3. Assume (1.5.6). The test defined by n S(0) ≥ zα σ(0) where
zα is the upper α percentile from the√standard normal cdf, i.e., 1 − Φ(zα ) = α
is asymptotically size α. Hence, P0 ( n S(0)) ≥ zα σ(0)) → α.
It follows that a gradient test is consistent; i.e.,
Theorem
√ 1.5.4. Assume conditions (1.5.5) and (1.5.6). Then the gradient
test n S(0) ≥ zα σ(0) is consistent, i.e., the power at fixed alternatives tends
to one as n increases.
Proof: Fix θ∗ > 0 and F . For ǫ > 0 and for large n, we have n−1/2 zα σ(0) <
µ(θ∗ ) − ǫ. This leads to the following string of inequalities:

Pθ∗ ,F (S(0) ≥ n−1/2 zα σ(0)) ≥ Pθ∗ ,F (S(0) ≥ µ(θ∗ ) − ǫ)


≥ Pθ∗ ,F (| S(0) − µ(θ∗ ) |≤ ǫ) → 1 ,

which is the desired result.


Example 1.5.1 (The L1 Case). Assume that the model cdf F has the unique
median 0. Consider the L1 norm. The associated level α gradient test of (1.5.1)
is equivalent to the sign test given by:
P
Reject H0 in favor of HA if S1+ = I(Xi > 0) ≥ c ,

where c is such that P [bin(n, 1/2) ≥ c] = α. The test is nonparametric, i.e.,


it does not depend on F . From the above discussion its power function is
nondecreasing in θ. Since S1+ (θ) is bounded and attains its bound on a finite
interval, the test is resolving. For consistency, take γ = 1 in expression (1.5.5).
Then E[n−1 S1+ (0)] = P (X > 0) = 1 − F (−θ) = µ(θ). An application of the
Weak Law of Large Numbers shows that the limit in condition (1.5.5) holds.
Further, µ(0) = 1/2 < µ(θ) for all θ > 0 and all F . Finally, apply the Central
Limit Theorem to show that (1.5.6) holds. Hence, the sign test is consistent
for location alternatives. Further, it is consistent for each pair θ, F such that
P (X > 0) > 1/2.
A discussion of these properties for the gradient test based on the L2 norm
can be found in Exercise 1.12.5.

1.5.2 Asymptotic Linearity and Pitman Regularity


In the last section we discussed some of the basic properties of the power
function for a gradient test. Next we establish some general results that allow
us to compare power functions for different level α-tests. These results also
lead to the asymptotic distributions of the location estimators θb based on
norm fits. We also make use of them in later sections and chapters.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 23 —


i i

1.5. PROPERTIES OF NORM-BASED INFERENCE 23

Assume the setup found at the beginning of this section; i.e., we are con-
sidering the location model (1.3.1) and we have specified a norm with gradient
function S(θ). We first define a Pitman Regular process:
Definition 1.5.3. We say an estimating function S(θ) is Pitman Regular
if the following four conditions hold: first,

S(θ) is nonincreasing in θ ; (1.5.7)

second, letting S(θ) = S(θ)/nγ , for some γ > 0,

there exists a function µ(θ), such that µ(0) = 0, µ′ (θ) is continuous at 0,


(1.5.8)
′ Pθ
µ (0) > 0 and either S(0) → µ(θ) or Eθ (S(0) = µ(θ); (1.5.9)
third,  
√ b √ P
sup n S √ − n S(0) + µ′ (0)b → 0 , (1.5.10)
|b|≤B n
for any B > 0; and fourth there is a constant σ(0) such that
 
√ S(0) D0
n → N(0, 1) . (1.5.11)
σ(0)
Further the quantity
c = µ′ (0)/σ(0) (1.5.12)
is called the efficacy of S(θ).
Condition (1.5.10) is called the asymptotic linearity of the process S(θ).
Often we can compute c when we have the mean under general θ and the
variance under θ = 0. Thus
d
µ′ (0) = Eθ [S(0) |θ=0 and σ 2 (0) = lim{nVar0 (S(0))} . (1.5.13)

Hence, another way expressing the asymptotic linearity of S(θ) is
 √   
√ S(b/ n) √ S(0)
n = n − cb + op (1) . (1.5.14)
σ(0) σ(0)
√ √
If we replace b by nθn where, of course, | nθn | ≤ B for B > 0, then we can
write    
√ S(θn ) √ S(0) √
n = n − c nθn + op (1) . (1.5.15)
σ(0) σ(0)
We record one more result on limiting distributions whose proof follows from
Theorems 1.3.1 and 1.5.6.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 24 —


i i

24 CHAPTER 1. ONE-SAMPLE PROBLEMS

Theorem 1.5.5. Suppose S(θ) is Pitman Regular. Then


 √ 
√ S(b/ n) D0
n → Z − cb (1.5.16)
σ(0)

and  
√ S(0) D−b/√n
n → Z − cb , (1.5.17)
σ(0)
where Z ∼ N(0, 1) and, so, Z − cb ∼ N(−cb, 1).

The second part of this theorem says that the limiting distribution of S(0),
when standardized by σ(0), and computed along a sequence of alternatives
−b/n1/2 is still normal with the same variance of one but with a new mean,
namely −cb. This result is useful in approximating the power near the null
hypothesis.
We find asymptotic linearity to be useful in establishing statistical prop-
erties. Our next result provides sufficient conditions for linearity.

Theorem 1.5.6. Let S(θ) = (1/nγ )S(θ) for some γ > 0 such that the condi-
tions (1.5.7), (1.5.9), and (1.5.11) of Definition 1.5.3 hold. Suppose for any
b ∈ R,
nVar0 (S(n−1/2 b) − S(0)) → 0 , as n → ∞ . (1.5.18)
Then  
√ b √ P

sup n S √ − n S(0) + µ (0)b → 0 ,

(1.5.19)
|b|≤B n
for any B > 0.

Proof: First consider Un (b) = [S(n−1/2 b) − S(0)]/(b/ n). By (1.5.9) we have
√   √  
n −b n b ′
E0 (U0 (b)) = µ √ = − √ µ (ξn ) → −µ′ (0) . (1.5.20)
b n b n

Furthermore,
   
n b
Var0 Un (b) = 2 Var0 S √ − S(0) → 0 . (1.5.21)
b n

As Exercise 1.12.9 shows, (1.5.20) and (1.5.21) imply that Un (b) converges to
−µ′ (0) in probability, pointwise in b, i.e., Un (b) = −µ′ (0) + op (1).
√ √
For √the second part of the proof, let Wn (b) = n[S(b/ n) − S(0) +
µ′ (0)b/ n]. Further let ǫ > 0 and γ > 0 and partition [−B, B] into
−B = b0 < b1 < . . . < bm = B so that bi − bi−1 ≤ ǫ/(2|µ′ (0)|) for all i.
There exists N such that n ≥ N implies P [maxi |Wn (bi )| > ǫ/2] < γ.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 25 —


i i

1.5. PROPERTIES OF NORM-BASED INFERENCE 25

Now suppose that Wn (b) ≥ 0 ( a similar argument can be given for Wn (b) <
0). Then
       
√ b ′
√ b
|Wn (b)| = n S √ − S(0) + bµ (0) ≤ n S √ − S(0)
n n
+bi−1 µ′ (0) + (b − bi−1 )µ′(0)
≤ |Wn (bi−1 )| + (b − bi−1 )|µ′ (0)| ≤ max |Wn (bi )| + ǫ/2 .
i

Hence,
!
P0 sup |Wn (b)| > ǫ ≤ P0 (max |Wn (bi )| + ǫ/2) > ǫ) < γ ,
|b|≤B i

and
P
sup |Wn (b)| → 0 .
|b|≤B

In the next three subsections we use these tools to handle the issues of
power and efficiency for a general norm-based inference, but first we show
that the L1 gradient function is Pitman Regular.
Example 1.5.2 (Pitman Regularity of the L1 Process). Assume that the
model pdf satisfies f (0) > 0. Recall that the L1 gradient function is
n
X
S1 (θ) = sgn(Xi − θ) .
i=1

Take γ = 1 in Theorem 1.5.6; hence, the average of interest is S 1 (θ) =


n−1 S1 (θ). This is nonincreasing so condition (1.5.7) is satisfied. Next it is
easy to check that µ(θ) = Eθ S 1 (0) = Eθ sgnXi = E0 sgn(Xi + θ) = 1 − 2F (−θ).
Hence, µ′ (0) = 2f (0). Then condition (1.5.9) is satisfied. We now consider
condition (1.5.18). Consider the case b > 0, (similarly for b < 0),
n
X

S 1 (b/ n) − S 1 (0) = −(2/n) I(0 < Xi < b/n1/2 ) .
1

Because this is a sum of independent Bernoulli variables, we have


√ √
nVar0 [S 1 (b/n1/2 ) − S 1 (0)] ≤ 4P (0 < X1 < b/ n) = 4[F (b/ n) − F (0)] → 0 .
The convergence to 0 occurs since F is continuous.
√ Thus condition (1.5.18)
is satisfied. Finally, note that σ(0) = 1 so n S 1 converges in distribution to
Z ∼ N(0, 1) by the Central Limit Theorem. Therefore the L1 gradient process
S(θ) is Pitman Regular. It follows that the efficacy of the L1 is
cL1 = 2f (0) . (1.5.22)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 26 —


i i

26 CHAPTER 1. ONE-SAMPLE PROBLEMS

For future√ reference, we state the asymptotic linearity result for the L1
process: if | nθn | ≤ B then
√ √ √
n S 1 (θn ) = n S 1 (0) − 2f (0) nθn + op (1) . (1.5.23)
Example 1.5.3 (Pitman Regularity of the L2 Process). In Exercise 1.12.6
it is shown that, provided Xi has finite variance, the L2 gradient function is
Pitman Regular and that the efficacy is simply cL2 = 1/σf .
We are now in a position to investigate the efficiency and power properties
of the statistical methods based on the L1 norm relative to the statistical
methods based on the L2 norm. As we see in the next three subsections, these
properties depend only on the efficacies.

1.5.3 Asymptotic Theory and Efficiency Results for θb


As at the beginning of this section, suppose we have the location model, (1.2.1),
and that we have chosen a norm to fit the model with gradient function S(θ). In
this part we develop the asymptotic distribution of the estimate. The asymp-
totic variance provides the basis for efficiency comparisons. We use the asymp-
totic linearity that accompanies
√ b Pitman Regularity. To do this, however, we
first need to show that nθ is bounded in probability.

Lemma 1.5.1. If the gradient function S(θ) is Pitman Regular, then n(θb −
θ) = Op (1).
Proof: Assume without loss of √ generality that θ =√0 and take t > 0. By
√ the
b
monotonicity of S(θ), if S(t/ n) < 0 then θ ≤ t/ n. Hence, P0 (S(t/ n) <

0) ≤ P0 (θb ≤ t/ n). Theorem 1.5.5 implies that the first probability can be
made as close to Φ(tc) as desired. This, in turn, can be made as close to 1 as
√ √
desired. In a similar vein we note that if S(−t/ n) > 0, then θb ≥ −t/ n and

− nθb ≤ t. Again, the probability of this event can be made arbitrarily close
√ b
to 1. Hence, P0 (| nθ| ≤ t) is arbitrarily close to 1 and we have boundedness
in probability.
We next exploit this boundedness in probability to determine the asymp-
totic distribution of the estimate.
Theorem 1.5.7. Suppose S(θ) is Pitman Regular with efficacy c. Then
√ b
n(θ − θ) converges in distribution to Z ∼ n(0, c−2 ).
Proof: As usual we assume, without loss of generality, that θ = 0. First√recall
.
that θb is defined by n−1/2 S(θ)
b = 0. From Lemma 1.5.1, we know that nθb is
bounded in probability so that we can apply (1.5.14) to deduce
√ b √
n S(θ) n S(0) √
= − c nθb + op (1) .
σ(0) σ(0)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 27 —


i i

1.5. PROPERTIES OF NORM-BASED INFERENCE 27

Solving we have √ √
nθb = c−1 n S(0)/σ(0) + op (1) ;

hence, the result follows because n S(0)/σ(0) is asymptotically N(0, 1).
Definition 1.5.4. If we have two Pitman Regular estimates with efficacies c1
and c2 , respectively, then the efficiency of θb1 with respect to θb2 is defined to
be the reciprocal ratio of their asymptotic variances, namely, e(θb1 , θb2 ) = c21 /c22 .
The next example compares the L1 estimate to the L2 estimate.
Example 1.5.4 (Relative Efficiency between the L1 and L2 Estimates). In
this example we compare the L1 and L2 estimates, namely, the sample median
and mean. We have seen that their respective efficacies are 2f (0) and σf−1 , and
their asymptotic variances are 1/4f 2(0)n and σf2 /n, respectively. Hence, the
relative efficiency of the median with respect to the mean is
√ √
e(Ẋ, X̄) = asyvar( nX̄)/asyvar( nẊ) = c2Ẋ /c2X̄ = 4f 2(0)σf2 (1.5.24)

where Ẋ is the sample median and X̄ is the sample mean. The efficiency com-
putation depends only on the Pitman efficacies. We illustrate the computation
of the efficiency using the contaminated normal distribution. The pdf of the
contaminated normal distribution consists of mixing the standard normal pdf
with a normal pdf having mean zero and variance δ 2 > 1. For ǫ between 0 and
1, the pdf can be written:

fǫ (x) = (1 − ǫ)φ(x) + ǫδ −1 φ(δ −1 x) (1.5.25)

with σf2 = 1 + ǫ(δ 2 − 1). This distribution has tails heavier than the standard
normal distribution and can be used to model data contamination; see Tukey
(1960) for more discussion. We can think of ǫ as the fraction of the data
that is contaminated. In Table 1.5.1 we provide values of the efficiencies for
various values of contamination and with δ = 3. Note that when we have 10%
contamination that the efficiency is 1. This indicates that, for this distribution,
the median and mean are equally effective. Finally, this example exhibits a
distribution for which the median is superior to the mean as an estimate of
the center. See Exercise 1.12.12 for other examples.

1.5.4 Asymptotic Power and Efficiency Results for the


Test Based on S(θ)
Consider the location model (1.2.1), and assume that we have chosen a norm to
fit the model with gradient function S(θ). Consider the gradient test (1.5.2) of
the hypotheses (1.5.1). In Section 1.5.1, we showed that the power function of

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 28 —


i i

28 CHAPTER 1. ONE-SAMPLE PROBLEMS

Table 1.5.1: Efficiencies of the Median Relative to the Mean for Contaminated
Normal Models
ǫ e(Ẋ, X̄)
.00 .637
.03 .758
.05 .833
.10 1.000
.15 1.134

this test is nondecreasing with upper limit one and that it is typically resolving.
Further, we showed that for a fixed alternative, the test is consistent. Thus the
power tends to one as the sample size increases. To offset this effect, we let the
alternative converge to the null value at a rate that stabilizes the power away
from one. This enables us to compare two tests along the same alternative
sequence.√Consider the null hypothesis H0 : θ = 0 versus HAn : θ = θn where
θn = θ∗ / n and ∗
√ θ > 0. Recall that the asymptotic size α test based on S(0)
rejects H0 if n S/σ(0) ≥ zα where 1 − Φ(zα ) = α.
The following theorem is called the asymptotic power lemma. Its proof
follows immediately from expression (1.5.14).

Theorem 1.5.8. Assume that S(0) is Pitman Regular √ with efficacy c, then

the asymptotic local power along the sequence θn = θ / n is
√  √ 
γS (θn ) = Pθn n S(0)/σ(0) ≥ zα = P0 n S(−θn )/σ(0) ≥ zα
→ 1 − Φ(zα − θ∗ c), as n → ∞.

Note that larger values of the efficacy imply larger values of the asymptotic
local power.

Definition 1.5.5. The Pitman asymptotic relative efficiency of one test


relative to another is defined to be e(S1 , S2 ) = c21 /c22 .

Note that this is the same formula as the efficiency of one estimate relative
to another given in Definition 1.5.4. Therefore, the efficiency results discussed
in Example 1.5.4 between the L1 and L2 estimates apply for the sign and t-tests
also. Hence, we have an example in which the simple sign test is asymptotically
more powerful than the t-test.
We can also develop a sample size interpretation for the asymptotic power.
Suppose we specify a power γ < 1. Further, let zγ be defined by 1 −Φ(zγ ) = γ.
Then 1 − Φ(zα − cn1/2 θn ) = 1 − Φ(zγ ) and zα − cn1/2 θn = zγ . Solving for n
yields
.
n = (zα − zγ )2 /c2 θn2 . (1.5.26)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 29 —


i i

1.5. PROPERTIES OF NORM-BASED INFERENCE 29

Typically we take θn = kn σ with kn small. Now if S1 (0) and S2 (0) are two
Pitman Regular asymptotically size α tests then the ratio of sample sizes
required to achieve the same asymptotic power along the same sequence of
.
alternatives is given by the approximation: n2 /n1 = c21 /c22 . This provides ad-
ditional motivation for the above definition of Pitman efficiency of two tests.
The initial development of asymptotic efficiency was done by Pitman (1948)
in an unpublished manuscript and later published by Noether (1955).

1.5.5 Efficiency Results for Confidence Intervals Based


on S(θ)
In this part we consider the length of the confidence interval as a measure of
its efficiency. Suppose that we specify γ = 1 − α for the confidence coefficient.
Then let zα/2 be defined by 1 − Φ(zα/2 ) = α/2. Again we suppose throughout
the discussion that the estimating functions are Pitman Regular. Then the
endpoints of the 100γ% confidence interval are given asymptotically by θbL
and θbU such that
√ √
n S(θbL ) n S(θbU )
= zα/2 and = −zα/2 ; (1.5.27)
σ(0) σ(0)
see (1.3.10) for the exact versions of the endpoints.
The next theorem provides the asymptotic behavior of the length of this
interval and, further, it shows that the standardized length of the confidence
√ b
interval is a consistent estimate of the asymptotic standard deviation of nθ.
Theorem 1.5.9. Suppose S(θ) is a Pitman Regular estimating function with
efficacy c. Let L be the length of the corresponding confidence interval. Then

nL P 1

2zα/2 c
.
Proof: Using the same argument as in Lemma 1.5.1, √ we can show that θbL
and θbU are bounded in probability when multiplied by n. Hence, the above
estimating equations can be linearized to obtain, for example:
√ √ √
zα/2 = n S(θbL )/σ(0) = n S(0)/σ(0) − c nθbL /σ(0) + oP (1) .
This can then be solved to find:
√ √
nθbL = n S(0)/cσ(0) − zα/2 /c + oP (1) .

When this is also done for θbU and the difference is taken, we have:
n1/2 (θbU − θbL ) = 2zα/2 /c + oP (1) ,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 30 —


i i

30 CHAPTER 1. ONE-SAMPLE PROBLEMS

which concludes the argument.


From Theorem 1.5.7, θb has an approximate normal distribution with vari-
ance c−2 /n. So by Theorem 1.5.9, a consistent estimate of the standard error
of θb is √
b = nL √1 = L .
SE(θ) (1.5.28)
2zα/2 n 2zα/2
If the ratio of squared asymptotic lengths is used as a measure of efficiency
then the efficiency of one confidence interval relative to another is
again the ratio of the squares of the efficacies.
The discussion of the properties of estimation, testing, and confidence in-
terval construction shows that, asymptotically at least, the relative merit of a
procedure is measured by its efficacy. This measure is the slope of the linear
approximation of the standardized estimating function that determines these
procedures. In the comparison of L1 and L2 methods, we have seen that the
efficiency e(L1 , L2 ) = 4σf2 f 2 (0). There are other types of asymptotic efficiency
that have been studied in the literature along with finite sample versions of
these asymptotic efficiencies. The conclusions drawn from these other effi-
ciencies are consistent with the picture presented here. Finally, conclusions of
simulation studies have also been consistent with the material presented here.
Hence, we do not discuss these other measures; see Section 2.6 of Hettman-
sperger (1984a) for further references.
Example 1.5.5 (Estimation of the Standard Error of the Sample Median).
Recall that the sample median, when properly standardized, has a limiting
normal distribution. Suppose we have a sample of size n from H(x) = F (x−θ)
where θ is the unknown median. From Theorem 1.5.7, we know that the ap-
b the sample median, is normal with mean θ and
proximating distribution for θ,
variance 1/[4nh2 (θ)]. We refer to this variance as the asymptotic variance.
This normal distribution can be used to approximate probabilities concern-
ing the sample median. When the underlying form of the distribution H is
unknown, we must estimate this asymptotic variance. Theorem 1.5.9 provides
one key to the estimation of the asymptotic variance. The square root of the
asymptotic variance is sometimes called the asymptotic standard error of the
sample median. We discuss the estimation of this standard error rather than
the asymptotic variance.
As a simple example, in expression (1.5.28) take α = .05, zα/2 = 2, and k =
n/2 − n1/2 , then we have the following consistent estimate of the asymptotic
standard error of the median:

SE(median) ≈ [X(n/2+n1/2 ) − X(n/2−n1/2 ) ]/4. (1.5.29)

This simple estimate of the asymptotic standard error is based on the length
of the 95% confidence interval for the median. Sheather (1987) shows that

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 31 —


i i

1.5. PROPERTIES OF NORM-BASED INFERENCE 31

the estimate can be improved by using the interpolated confidence intervals


discussed in Section 1.10. Of course, other confidence intervals with different
confidence coefficients can be used also. We recommend using 90% or 95%;
again, see McKean and Schrader (1984) and Sheather (1987). This SE is com-
puted by our R function onesampsgn for general α. The default value of α is
set at 0.05.

There are other approaches to the estimation of this standard error. For
example, we could estimate the density h(x) directly and then use hn (θ)b where
hn is the density estimate. Another possibility is to estimate the finite sample
standard error of the sample median directly. Sheather (1987) surveys these
approaches. We discuss one further possibility here, namely the bootstrap.
The bootstrap has gained wide attention recently because of its versatility in
estimation and testing in nonstandard situations. See Efron and Tibshirani
(1993) for a very readable account of the bootstrap.
If we know the underlying distribution H(x), then we could estimate the
standard error of the median by repeatedly drawing samples with a computer
from the distribution H. If we have B samples from H and have computed and
stored the B values of the sample median, then our estimate of the standard
error of the median is simply the sample standard deviation of these B values.
When H is unknown we replace it by Hn , the empirical distribution func-
tion, and proceed with the simulation. The bootstrap approach based on Hn
is called the nonparametric bootstrap since nothing is assumed about the
form of the underlying distribution H. In another version, called the paramet-
ric bootstrap, we suppose that we know the form of the underlying distribution
H but there are some unknown parameters such as the mean and variance. We
use the sample to estimate these unknown parameters, insert the values into
H, and use this distribution to draw the B samples. In this book we are con-
cerned mainly with the nonparametric bootstrap and we use the generic term
bootstrap to refer to this approach. In either case, ready access to high speed
computing makes this method appealing. The following example illustrates
the computations.

Example 1.5.6 (Generated Data). Using Minitab, the 30 data points in Ta-
ble 1.5.2 were generated from a normal distribution with mean 0 and vari-
ance 1. Thus, we know that the asymptotic standard error should be about
1/[301/2 2f (0)] = 0.23. We use this to check what happens if we try to estimate
the standard error from the data.
Using expression (1.3.16), the 95% confidence interval for the median is
(−0.789, 0.331). Hence, the length of confidence interval estimate, given in
expression (1.5.29), is (0.331 + 0.789)/4 = 0.28. A simple R function was
written to bootstrap the sample; see Exercise 1.12.7. Using this function, we
obtained 1000 bootstrap samples and the resulting standard deviation of the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 32 —


i i

32 CHAPTER 1. ONE-SAMPLE PROBLEMS

Table 1.5.2: Generated N(0, 1) Variates (Placed in Order)


-1.79756 -1.66132 -1.46531 -1.45333 -1.21163 -0.92866 -0.86812
-0.84697 -0.81584 -0.78912 -0.68127 -0.37479 -0.33046 -0.22897
-0.02502 -0.00186 0.09666 0.13316 0.17747 0.31737 0.33125
0.80905 0.88860 0.90606 0.99640 1.26032 1.46174 1.52549
1.60306 1.90116

1000 bootstrap medians was 0.27. For this instance, the bootstrap procedure
essentially agrees with the length of confidence interval estimate.
Note that, from the data, the sample mean is −0.03575 and the sample
standard deviation is 1.04769. If we assume the underlying distribution H is
normal with unknown mean and variance, we would use the parametric boot-
strap. Hence, instead of sampling from the empirical distribution function, we
want to sample from a normal distribution with mean −0.03575 and standard
deviation 1.04769. Using R (see Exercise 1.12.7), we obtained 1000 parametric
bootstrapped samples. The sample standard deviation of the resulting medi-
ans was 0.23, just the value we would expect. You should not expect to get the
precise value every time you bootstrap, either parametrically or nonparametri-
cally. It is, however, a very versatile method to use to estimate such quantities
as standard errors of estimates and p-values of tests.

An unusual aspect of the last example is that the bootstrap distribution of


the sample median can be found in closed form and does not have to be simu-
lated as described above. The variance of the sample median computed from
the bootstrap distribution can then be found. The result is another estimate
of the variance of the sample median. This was discovered independently by
Maritz and Jarrett (1978) and Efron (1979). We do not pursue this develop-
ment here because in most cases we must simulate the bootstrap distribution
and that is where the real strength of the bootstrap approach lies. For an
interesting comparison of the various estimates of the variance of the sample
median, see McKean and Schrader (1984).

1.6 Robustness Properties of Norm-Based In-


ference
We have just considered the statistical properties of the inference procedures.
We have looked at ideas such as efficiency and power. We now turn to stability
or robustness properties. By this we mean how the inference procedures are
affected by outliers or corruption of portions of the data. Ideally, we would
like procedures (tests and estimates) which do not respond too quickly to a

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 33 —


i i

1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE 33

single outlying value when it is introduced into the sample. Further, we would
not like procedures that can be changed by arbitrary amounts by corrupting
a small amount of the data. Response to outliers is measured by the influ-
ence curve and response to data corruption is measured by the breakdown
value. We introduce finite sample versions of these concepts. They are easy to
work with and, in the limit, they generally equal the more abstract versions
based on the study of statistical functionals. We consider, first, the robustness
properties of the estimates and, secondly, tests. As in the last section, the dis-
cussion is general but the L1 and L2 procedures are discussed as we proceed.
The robustness properties of the procedures based on the weighted L1 norm
are covered in Sections 1.7 and 1.8. See Section A.5 of the Appendix for a
development based on functionals.

1.6.1 Robustness Properties of θb


b
We begin with the definition of breakdown for the estimator θ.

Definition 1.6.1. Let x = (x1 , . . . , xn ) represent a realization of a sample


and let
x(m) = (x∗1 , . . . , x∗m , xm+1 , . . . , xn )′
represent the corruption of any number m of the n observations. We define
the bias of an estimator θb to be bias(m; θ, b x) = sup |θ(x
b (m) ) − θ(x)|
b where the
(m)
sup is taken over all possible corrupted samples x . Note that we change
only x∗1 , . . . , x∗m while xm+1 , . . . , xn are fixed at their original values. If the
bias is infinite, we say the estimate has broken down and the finite sample
breakdown value is given by
b x) = ∞} .
ǫ∗n = min {m/n : bias(m; θ, (1.6.1)

This approach to breakdown is called replacement breakdown because ob-


servations are replaced by corrupted values; see Donoho and Huber (1983) for
more discussion of this approach. Often there exists an integer m such that
x(m) ≤ θb ≤ x(n−m+1) and either θb tends to −∞ as x(m) tends to −∞ or θb
tends to +∞ as x(n−m+1) tends to +∞. If m∗ is the smallest such integer then
ǫ∗n = m∗ /n. Hodges (1967) was the first to introduce these ideas.
To remove the effects of sample size, the limit, when it exists, can be
computed. In this case we call the lim ǫ∗n = ǫ∗ , the asymptotic breakdown
value.

Example 1.6.1 (Breakdown Values for the L1 and L2 Estimates). The L1


estimate is the sample median. If the sample size is n = 2k then it is easy
to see that when x(k) tends to −∞, the median also tends to −∞. Hence,
the breakdown value of the sample median is k/n which tends to .5. By a

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 34 —


i i

34 CHAPTER 1. ONE-SAMPLE PROBLEMS

similar argument, when the sample size is n = 2k + 1, the breakdown value is


(k + 1)/n and it also tends to .5 as the sample size increases. Hence, we say
that the sample median is a 50% breakdown estimate. The L2 estimate is the
sample mean. A similar analysis shows that the breakdown value is 1/n which
tends to zero. Hence, we say the sample mean is a zero breakdown estimate.
This sharply contrasts the two estimates since we see that the median is the
most resistant estimate and the sample mean is the least resistant estimate. In
Exercise 1.12.13, the reader is asked to show that the pseudo-median induced
by the signed-rank norm, (1.3.25), has breakdown .29.

We have just considered the effect of corrupting some of the observations.


The estimate breaks down if we can force the estimate to change by an ar-
bitrary amount by changing the observations over which we have control.
Another important concept of stability entails measuring the effect of the in-
troduction of a single outlier. An estimate is stable or resistant if it does not
change by a large amount when the outlier is introduced. In particular, we
want the change to be bounded no matter what the value of the outlier.
Suppose we have a sample of observations x1 , . . . , xn from a distribution
centered at 0 and an estimate θbn based on these observations. By Pitman
Regularity, Definition 1.5.3, and Theorem 1.5.7, we have

n1/2 θbn = c−1 n−1/2 S(0)/σ(0) + oP (1) , (1.6.2)


provided the true parameter is 0. Further, we often have a representation
of S(0) as a sum of independent random variables. We may have to make
a projection of S(0) to achieve this; see the next chapter for examples of
projections. In any case, we then have the following representation
n
X
c−1 n−1/2 S(0)/σ(0) = n−1/2 Ω(xi ) + oP (1) , (1.6.3)
i=1

where Ω(·) is the function needed in the representation. When we combine the
above two statements we have
Xn
1/2 b −1/2
n θn = n Ω(xi ) + oP (1) . (1.6.4)
i=1

Recall that the distribution that we are sampling is assumed to be centered


at 0. The difference (θbn − 0) is approximated by the average of n independent
and identically distributed random variables. Since Ω(xi ) represents the effect
of the ith observation on θbn it is called the influence function.
The influence function approximates the rate of change of the estimate
when an outlier is introduced. Let xn+1 = x∗ represent a new, outlying, ob-
servation. Since θbn should be roughly 0, we have

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 35 —


i i

1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE 35

.
(n + 1)θbn+1 − (n + 1)θbn = Ω(x∗ )
and
θbn+1 − θbn
≈ Ω(x∗ ) , (1.6.5)
1/(n + 1)
and this reveals the differential character of the influence function. Hampel
(1974) developed the influence function from the theory of von Mises differen-
tiable functions. In Sections A.5 and A.5.2 of the Appendix, we use his formu-
lation to derive several influence functions for later situations. Here, though,
we identify influence functions for the estimates through the approximations
described above. We now illustrate this approach.

Example 1.6.2 (Influence Function for the L1 and L2 Estimates). We briefly


describe the influence functions for the sample median and the sample mean,
the L1 and L2 estimates. From Example 1.5.2 we have immediately that, for
the sample median,
n
1/2 b 1 X sgn(Xi )
n θ≈√
n i=1 2f (0)
and

sgn(x)
Ω(x) = .
2f (0)
Note that the influence function is bounded but not continuous. Hence,
outlying observations cannot have an arbitrarily large effect on the estimate. It
is this feature along with the 50% breakdown property that makes the sample
median the prototype of resistant estimates. The sample mean, on the other
hand, has an unbounded influence function. It is easy to see that Ω(x) = x,
linear and unbounded. Hence, a single large outlier is sufficient to carry the
sample mean beyond any bound. The unbounded influence is connected to the
0 breakdown property. Hence, the L2 estimate is the prototype of an estimate
highly efficient at a specified model, the normal model in this case, but not
resistant. This means that quite close to the model for which the estimate is
optimal, the estimate may perform very poorly; recall Table 1.5.1.

1.6.2 Breakdown Properties of Tests


We now turn to the issue of breakdown in testing hypotheses. The problems are
a bit different in this case since we typically want to move, by data corruption,
a test statistic into or out of a critical region. It is not a matter of sending the
statistic beyond any finite bound as it is in estimation breakdown.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 36 —


i i

36 CHAPTER 1. ONE-SAMPLE PROBLEMS

Definition 1.6.2. Suppose that V is a statistic for testing H0 : θ = 0 versus


H0 : θ > 0 and we reject the null hypothesis when V ≥ k, where P0 (V ≥ k) = α
determines k. The rejection breakdown of the test is defined by

ǫ∗n (reject) = min {m/n : inf sup V ≥ k} , (1.6.6)


x
x(m)

where the sup is taken over all possible corruptions of m data points. Likewise
the acceptance breakdown is defined to be

ǫ∗n (accept) = min {m/n : sup inf V < k} . (1.6.7)


x x(m)

Rejection breakdown is the smallest portion of the data that can be cor-
rupted to guarantee that the test rejects the null hypothesis. Acceptance
breakdown is interpreted as the smallest portion of the data that must be
corrupted to guarantee that the test statistic is not in the critical region; i.e.,
the test is guaranteed to fail to reject the null hypothesis. We turn immediately
to a comparison of the L1 and L2 tests.
Example 1.6.3 (Rejection Breakdown of the L1 ). We first consider the one-
sided sign test for testing H0 : θ = 0 versus HA : θ > 0. The asymptotically
size α test rejects the null hypothesis when n−1/2 S1 (0) ≥ zα , the upper α
quantile from a standard normal distribution.
P It is easier to see exactly what
+ 1/2
happens if we convert the test to S1 (0) = I(Xi > 0) ≥ n/2 + (n zα )/2.
Now each time we make an observation positive it makes S1+ (0) increase by
one. Hence, if we wish to guarantee that the test rejects the null hypothesis,
we make m observations positive where m∗ = [n/2 + (n1/2 zα )/2] + 1, [.] the
greatest integer function. Then the rejection breakdown is
. 1 zα
ǫ∗n (reject) = m∗ /n = + 1/2 .
2 2n
Likewise,
. 1 zα
ǫ∗n (accept) = − 1/2 .
2 2n
Note that the rejection breakdown converges down to the estimation break-
down and the acceptance breakdown converges up to it.
We next turn to the one-sided Student’s t-test. Acceptance breakdown
for the t-test is simple. By making a single observation approach −∞, the
t-statistic can be made negative, hence we can always guarantee acceptance
with control of one observation. The rejection breakdown is more interesting.
If we increase an observation both the sample mean and the sample standard
deviation increase. Hence, it is not at all clear what happens to the t-statistic.
In fact it is not sufficient to increase a single observation in order to force the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 37 —


i i

1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE 37

Table 1.6.1: Rejection Breakdown Values for Size α = .05 Tests


n Sign t
10 .71 .27
13 .70 .21
18 .67 .15
30 .63 .09
100 .58 .03
∞ .50 0

t-statistic to move into the critical region. We now show that the rejection
breakdown for the t-statistic is:
t2α
ǫ∗n (reject) = → 0 , as n → ∞ ,
n − 1 + t2α

where tα is the upper α quantile from a t-distribution with n − 1 degrees of


freedom. The infimum part of the definition suggests that we set all obser-
vations at −B < 0 and then change m observations to M > 0. The result
is
mM − (n − m)B m(n − m)(M + B)2
x̄ = and s2 = .
n (n − 1)n
Putting these two quantities together we have
 1/2
n1/2 x̄ n−1 m(n − 1) 1/2
= [m − (n − m)B/M] → ,
s m(n − m)(1 + B/M)2 n−m

as M → ∞. We now equate the limit to tα and solve for m to get m = nt2α /(n−
1 + t2α ), (actually we would take the greatest integer and add one). Then the
rejection breakdown is m divided by n as stated. Table 1.6.1 compares rejection
breakdown values for the sign and t-tests. We assume α = .05 and the sample
sizes are chosen so that the size of the sign test is quite close to .05. For further
discussion, see Ylvisaker (1977).
These definitions of breakdown assume a worst case scenario. They assume
that the test statistic is as far away from the critical region (for rejection
breakdown) as possible. In practice, however, it may be the case that a test
statistic is quite near the edge of the critical region and only one observation is
needed to change the decision from fail to reject to that of reject. An alternative
form of breakdown considers the average number of observations that must
be corrupted, conditional on the test statistic being in the acceptance region,
to force a rejection.
Let MR be the number of observations that must be corrupted to force a
rejection; then, MR is a random variable. The expected rejection break-

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 38 —


i i

38 CHAPTER 1. ONE-SAMPLE PROBLEMS

Table 1.6.2: Comparison of Expected Breakdown and Worst Case Breakdown


for the Size α = .05 Sign Test

n Exp∗n (reject) ǫ∗n (reject)


10 .27 .71
13 .24 .70
18 .20 .67
30 .16 .63
100 .08 .58
∞ 0 .50

down is defined to be
Exp∗n (reject) = EH0 [MR |MR > 0]/n . (1.6.8)
Note that we condition on MR > 0 since MR = 0 is equivalent to a rejection.
It is left as Exercise 1.12.14 to show that the expected breakdown can be
computed with unconditional expectation as
Exp∗n (reject) = EH0 [MR ]/(1 − α) . (1.6.9)
In the following example we illustrate this computation on the sign test and
show how it compares to the worst case breakdown introduced earlier.
Example 1.6.4 (Expected Rejection Breakdown of the P Sign Test). Refer to
Example 1.6.3. The one-sided sign test rejects when I(Xi > 0) ≥ n/2 +
1/2
n zα/2 . Hence, given that we fail to P reject the null hypothesis, we need to
change (corrupt) n/2 + n1/2 zα/2 − I(Xi > 0) negative observations into
positive ones. This is precisely MR and E[MR ] = n1/2 zα/2 . It follows that
Exp∗n (reject) = zα/2 n1/2 (1 − α) → 0 as n → ∞ rather than .5 which happens
in the worst case breakdown. Table 1.6.2 compares the two types of rejection
breakdown. This simple calculation clearly shows that even highly resistant
tests such as the sign test may breakdown quite easily. This is contrary to what
the worst case breakdown analysis would suggest. For additional reading on
test breakdown see Coakley and Hettmansperger (1992). He, Simpson, and
Portnoy (1990) discuss asymptotic test breakdown.

1.7 Inference and the Wilcoxon Signed-Rank


Norm
In this section we develop the statistical properties for the procedures based
on the Wilcoxon signed-rank norm, (1.3.17), that was defined in Example 1.3.3

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 39 —


i i

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM 39

of Section 1.3. Recall that the norm and its associated gradient function are
given in expressions (1.3.17) and (1.3.24), respectively. Recall for a sample
X1 , . . . , Xn that the estimate of θ is the median of the Walsh averages given
by (1.3.25). As in Section 1.3, our hypotheses of interest are

H0 : θ = 0 versus H0 : θ 6= 0 . (1.7.1)

The level α test associated with the signed-rank norm is

Reject H0 in favor of HA , if |T (0)| ≥ c , (1.7.2)

where c is such that P0 [|T (0)| ≥ c]. To complete the test we need to determine
the null distribution of T (0), which is given by Theorems 1.7.1 and 1.7.2.
In order to develop the statistical properties, in addition to (1.2.1), we
assume that
h(x) is symmetrically distributed about θ . (1.7.3)
We refer to this as the symmetric location model. Under symmetry, by
Theorem 1.2.1, V (H) = θ, for all location functionals V .

1.7.1 Null Distribution Theory of T (0)


In addition to expression (1.3.24), a third representation of T (0) is helpful
in establishing its null distribution. Recall the definition of the anti-ranks,
D1 , . . . , Dn , given in expression (1.3.19). Using these anti-ranks, we can write
X X X
T (0) = R(|Xi|)sgn(Xi ) = jsgn(XDj ) = jWj ,

where Wj = sgn(XDj ).
Lemma 1.7.1. Under H0 , |X1 |, . . . , |Xn | and sgn(X1 ), . . . , sgn(Xn ) are inde-
pendent.
Proof: Since X1 , . . . , Xn is a random sample from H(x), it suffices to show
that P [|Xi| ≤ x, sgn(Xi ) = 1] = P [|Xi | ≤ x]P [sgn(Xi ) = 1]. But due to H0
and the symmetry of h(x) this follows from the following string of equalities:
1
P [|Xi| ≤ x, sgn(Xi ) = 1] = P [0 < Xi ≤ x] = H(x) −
2
1
= [2H(x) − 1] = P [|Xi | ≤ x]P [sgn(Xi ) = 1] .
2
Based on this lemma, the vector of ranks and, hence, the vector of an-
tiranks (D1 , . . . , Dn ), are independent of the vector (sgn(X1 ), . . . , sgn(Xn )).
Based on these facts, we can obtain the distribution of (W1 , . . . , Wn ), which
we summarize in the following lemma; see Exercise 1.12.15 for its proof.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 40 —


i i

40 CHAPTER 1. ONE-SAMPLE PROBLEMS

Lemma 1.7.2. Under H0 and the symmetry of h(x), W1 , . . . , Wn are iid ran-
dom variables with P [Wi = 1] = P [Wi = −1] = 1/2.
We can now easily derive the null distribution theory of T (0) which we
summarize in the following theorems. Details are given in Exercise 1.12.16.
Theorem 1.7.1. Under H0 and the symmetry of h(x),

T (0) is distribution free, symmetrically distributed (1.7.4)


E0 [T (0)] = 0 (1.7.5)
n(n + 1)(2n + 1)
Var0 (T (0)) = (1.7.6)
6
T (0)
p has an asymptotically N(0, 1) distribution . (1.7.7)
Var0 (T (0))

The exact distribution of T (0) cannot be found in closed form. We do,


however, have the following recursion formula; see Exercise 1.12.17.
Theorem 1.7.2. Consider the version of the signed-rank test statistics given
by T + , (1.3.28). Let pn (k) = P [T + = k] for k = 0, . . . , n(n+1)
2
. Then
1
pn (k) = [pn−1 (k) + pn−1 (k − n)] , (1.7.8)
2
where

p0 (0) = 1 ; p0 (k) = 0 for k 6= 0; and p0 (k) = 0 for k < 0 .

Using this formula algorithms can be developed which obtain the null
distribution of the signed-rank test statistic. The moment generating function
can also be inverted to find the null distribution; see Hettmansperger (1984a,
Section 2.2). As discussed in Section 1.3.1, software is now available which
computes critical values and p-values of the null distribution.
Theorem 1.7.1 justifies the confidence interval for θ in (1.3.30); i.e, the
(1 − α)100% confidence interval given by [W(k+1) , W(((n(n+1))/2)−k) ) where W(i)
denotes the ith ordered Walsh average and P (T + (0) ≤ k) = α/2. Based on
(1.7.7), k can be approximated as k ≈ n(n + 1)/4 − .5 − zα/2 [n(n + 1)(2n +
1)/24]1/2 . As noted in Section 1.3.1, the computation of the estimate and
confidence interval can be obtain by the Robnp R function onesampwil or the
R intrinsic function wilcox.test.

1.7.2 Statistical Properties


From our earlier analysis of the statistical properties of the L1 and L2 methods
we see that Pitman Regularity is crucial. In particular, we need to compute the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 41 —


i i

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM 41

Pitman efficacy which determines the asymptotic variance of the estimate, the
asymptotic local power of the test, and the asymptotic length of the confidence
interval. In the following theorem we show that the weighted L1 gradient
function is Pitman Regular and determine the efficacy. Then we make some
preliminary efficiency comparisons with the L1 and L2 methods.
R
Theorem 1.7.3. Suppose that h is symmetric and that h2 (x)dx < ∞. Let
X  
2 xi + xj
T (θ) = sgn −θ .
n(n + 1) i≤j 2

Then the conditions of Definition 1.5.3 are satisfied and, thus, T (θ) is Pitman
Regular. Moreover, the Pitman efficacy is given by
√ Z ∞ 2
c = 12 h (x)dx . (1.7.9)
−∞

Proof: Since we have the L1 norm applied to the Walsh averages, the estimating
function is a nonincreasing step function with steps at the Walsh averages.
Hence, (1.5.7) holds. Next note that h(x) = h(−x) and, hence,
  
2 n−1 X1 + X2
µ(θ) = Eθ T (0) = Eθ sgn(X1 ) + Eθ sgn .
n+1 n+1 2
Now Z
Eθ sgnX1 = sgn(x + θ)h(x)dx = 1 − 2H(θ) ,

and
Z Z
Eθ sgn(X1 + X2 )/2 = sgn[(x + y)/2 + θ]h(x)h(y)dxdy
Z
= [1 − 2H(−2θ − y)]h(y)dy .

Differentiate with respect to θ and set θ = 0 to get


Z Z
′ 2h(0) 4(n − 1) ∞ 2
µ (0) = + h (y)dy → 4 h2 (y) dy .
n+1 n + 1 −∞

The finiteness of the integral is sufficient to ensure that the derivative can
be passed through the integral; see Hodges and Lehmann (1961) or Olshen
(1967). Hence, (1.5.9) also holds. We next establish Condition (1.5.10). Since
n
X X  
2 2 Xi + Xj
T (θ) = sgn(Xi − θ) + sgn −θ ,
n(n + 1) i=1 n(n + 1) ı<j 2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 42 —


i i

42 CHAPTER 1. ONE-SAMPLE PROBLEMS

the first term is of smaller order and we need only consider the second term.
Now, for b > 0, let
∗ 2 X 
Xi + Xj
 
X i + Xj

−1/2
V = sgn −n b − sgn
n(n + 1) i<j 2 2
 
−4 X X Xi + Xj
= I 0< < n−1/2 b .
n(n + 1) i<j
2

Hence,
∗ 16n XX
nVar(V ) = E{ (Iij Ist − EIij EIst )} ,
n2 (n + 1)2 i<j s<t

where Iij = I(0 < (xi + xj )/2 < n−1/2 b). This becomes

∗ 16n2 (n − 1) 16n2 (n − 1)(n − 2)


nVar(V ) = Var(I12 ) + [EI12 I13 − EI12 EI13 ].
2n2 (n + 1)2 2n2 (n + 1)2
The first term tends to zero since it behaves like 1/n. In the second term,
consider |EI12 I13 − EI12 EI13 | ≤ EI12 + E 2 I12 = EI12 (1 + EI12 ). Now, as
n → ∞,
  Z
Xi + Xj −1/2
EI12 = P 0 < <n b = [H(2n−1/2 b−x)−H(−x)]h(x)dx → 0.
2
Hence, by Theorem 1.5.6, Condition (1.5.10) is true. Finally, asymptotic nor-
mality of the null distribution is established in Theorem 1.7.1 which also yields
nVar0 T (0) → 4/3 = σ 2 (0). It follows that the Pitman efficacy is
R Z
4 h2 (y)dy √
c= p = 12 h2 (y)dy .
4/3
For future reference we display the asymptotic linearity result:
√ Z
T (θ) T (0)
p =p − 12nθ h2 (x) dx + op (1),
n(n + 1)(2n + 1)/6 n(n + 1)(2n + 1)/6
√ (1.7.10)
for n|θ| ≤ B, where B > 0.
An immediate consequence of this theorem and Theorem 1.5.7 is that
Z 2 !
√ D
n(θb − θ) → Z ∼ N 0, 1/12 h2 (t)dt , (1.7.11)

and we thus have the limiting distribution


R of the median of the Walsh aver-
ages. Exercise 1.12.21 shows that h2 (t) dt < ∞, when h has finite Fisher
information.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 43 —


i i

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM 43

From our general discussion, a simple estimate of the standard error of the
median of the Walsh averages is proportional to the length of a distribution
free confidence interval. Consider the (1 − α)100% confidence interval given
by [W(k+1) , W(((n(n+1))/2)−k) ) where W(i) denotes the ith ordered Walsh average
and P (T + (0) ≤ k) = α/2. Then by expression (1.5.28), a consistent estimate
of the SE of the median of the Walsh averages (medWA) is
W(((n(n+1))/2)−k) − W(k+1)
SE(medWA) = . (1.7.12)
2zα/2

Our R function onesampwil computes this standard error for general α (de-
fault α is set at 0.05). We have more to say about this particular c in the next
chapter where we encounter it in the two-sample location model and later in
the linear model, where a better estimator of this SE is presented.
From Example 1.5.3 and Definition 1.5.4, we have the asymptotic relative
efficiency between the signed-rank Wilcoxon process and the L2 process is
given by Z  2
e(Wilcoxon, L2 ) = 12σh2 2
h (x) dx , (1.7.13)

where h is the underlying density with variance σh2 .


In the following example, we consider the contaminated normal distribu-
tion and then find the efficiency of the rank methods relative to the L1 and
L2 methods.
Example 1.7.1 (Asymptotic Relative Efficiency for Contaminated Normal
Distributions). Let fǫ (x) denote the pdf of the contaminated normal distri-
bution used in Example 1.5.4; the proportion of contamination is ǫ and the
variance of the contaminated part is 9. A straightforward computation shows
that Z
(1 − ǫ)2 ǫ2 ǫ(1 − ǫ)
fǫ2 (y)dy = √ + √ + √ √ ,
2 π 6 π 5 π
and we use this in the formula for c given above. The efficacies for the L1
and L2 are given in Example 1.5.4. We first consider the special case of ǫ = 0
corresponding to an underlying normal distribution. In this case we have for
the rank methods c2R = 12/(4π) = 3/π = .955, for the L1 methods c21 = 2/π =
.637, and for the L2 methods c22 = 1. We have already seen that the efficiency
enormal (L1 , L2 ) = c21 /c22 = .637 from the first line of Table 1.5.1. We now have
.
enormal (Wilcoxon, L2 ) = 3/π = .955 and enormal (Wilcoxon, L1 ) = 1.5 .
(1.7.14)
The efficiency of the rank methods relative to the L2 methods is extraordinary.
It says that even at the distribution for which the t-test is uniformly most
powerful, the Wilcoxon signed-rank test is almost as efficient. This means

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 44 —


i i

44 CHAPTER 1. ONE-SAMPLE PROBLEMS

Table 1.7.1: Efficiencies of the Rank, L1 , and L2 Methods for the Contaminated
Normal Distribution
ǫ e(L1 , L2 ) e(R, L1 ) e(R, L2 )
.00 .637 1.500 .955
.01 .678 1.488 1.009
.03 .758 1.462 1.108
.05 .833 1.436 1.196
.10 1.000 1.373 1.373
.15 1.134 1.320 1.497

that replacing the values of the observations by their ranks (retaining only
the order information) does not affect the statistical properties of the test.
This was considered highly nonintuitive in the 1950’s since nonparametric
methods were thought of as quick and dirty. Now they must be considered
highly efficient competitors of the optimal methods and, in addition, they are
more robust than the optimal methods. This provides powerful motivation
for the continued study of rank methods in other statistical models such as
the two-sample location model and the linear model. The early work in the
area of efficiency of rank methods is due largely to Lehmann and his students.
See Lehmann and Hodges (1956, 1961) for two important early papers and
Lehmann (1975, Appendix) for more discussion.
We complete this example with a table of efficiencies of the rank methods
relative to the L1 and L2 methods for the contaminated normal model with
σ = 3. Table 1.7.1 shows these efficiencies and extends Table 1.5.1. As ǫ
increases the weight in the tails of the distribution also increases. Note that the
efficiencies of both the L1 and rank methods relative to L2 methods increase
with ǫ. On the other hand, the efficiency of the rank methods relative to
the L1 methods decreases slightly. The rank methods are still more efficient;
however, this illustrates the fact that the L1 methods are good for heavy
tailed distributions. The overall implication of this example is that the L2
methods, such as the sample mean, the t-test, and t-confidence interval, are not
particularly efficient once the underlying distribution departs from the normal
distribution. Further, the rank methods such as the Wilcoxon signed-rank test,
confidence interval, and the median of the Walsh averages are surprisingly
efficient, even at the normal distribution. Note that the rank methods are
more efficient than L2 methods even for 1% contamination.

Finally, the following theorem shows that the Wilcoxon signed-rank statis-
tic never loses much efficiency relative to the t-statistic. Let Fs denote the
family of distributions which have symmetric densities and finite Fisher infor-
mation; see Exercise 1.12.21.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 45 —


i i

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM 45

Theorem 1.7.4. Let X1 , . . . , Xn be a random sample from H ∈ FS . Then


inf e(Wilcoxon, L2 ) = 0.864 . (1.7.15)
Fs
R 2 2
Proof: By (1.7.13), e(Wilcoxon, L2 ) = 12σh2 h (x) dx . If σh2 = ∞ then
e(Wilcoxon, L2 ) > .864; hence, we can restrict attention to H ∈ Fs such
that σh2 < ∞. As Exercise 1.12.22 indicates e(Wilcoxon, L2 ) is location and
scale invariant, so, we can further assume that
R 2 h is symmetric
R about R 2 0 and
2
σh =R 1. The problem, then, is to minimize h subject to h = x h = 1
and xh = 0. This is equivalent to minimizing
Z Z Z
2 2 2
h + 2b x h − 2ba h, (1.7.16)

where a and b are positive constants to be determined later. We now write


(1.7.16) as
Z Z
 2 2 2
  2 
h + 2b(x − a )h = h + 2b(x2 − a2 )h
|x|≤a
Z
 2 
+ h + 2b(x2 − a2 )h . (1.7.17)
|x|>a

First complete the square on the first term on the right side of (1.7.17) to get
Z Z
 2

2 2
h + b(x − a ) − b2 (x2 − a2 )2 . (1.7.18)
|x|≤a |x|≤a

Now (1.7.17) is equal to the two terms of (1.7.18) plus the second term on the
right side of (1.7.17). We can now write the density that minimizes (1.7.16).
If |x| > a take h(x) = 0, since x2 > a2 , and if |x| ≤ a take h(x) = b(a2 −x2 ),
since the integral in the first term of (1.7.18) is nonnegative. We R can now
determine the values of a and b from the side conditions. From h = 1. we
have Z a
b(a2 − x2 ) dx = 1 ,
−a
3 3
R
which implies that a b = 4
. Further, from x2 h = 1 we have
Z a
x2 b(a2 − x2 ) dx = 1 ,
−a

15

from which a5 b = . Hence solving for a and b yields a = 5 and b =
√ 4
3 5/100. Now
Z Z" √ √ #2 √
5
3 5 3 5
h2 = √ (5 − x2 ) dx = ,
− 5 100 25

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 46 —


i i

46 CHAPTER 1. ONE-SAMPLE PROBLEMS

which leads to the result,


√ !2
3 5 108 .
inf e(Wilcoxon, L2 ) = 12 = = 0.864 .
Fs 25 125

1.7.3 Robustness Properties


We complete this section with a discussion of the breakdown point of the
estimate and test and a heuristic derivation of the influence function of the
estimate. In Example 1.6.1 we discussed the breakdown of the sample median
and mean. In those cases we saw that the median is the most resistant while
the mean is the least resistant. In Exercise 1.12.13 you are asked to show that
the breakdown point of the median of the Walsh averages, the R estimate, is
b
roughly .29. Our next result gives the influence function θ.
Theorem 1.7.5. The influence function of θb = medi≤j (xi + xj )/2 is given by:
H(x) − 1/2
Ω(x) = R ∞ 2 .
−∞
h (t)dt

We sketch a derivation of this result, here. A rigorous development is offered


in Section A.5 of the Appendix. From Theorems 1.7.3 and 1.5.6 we have

n1/2 T (θ)/σ(0) ≈ n1/2 T (0)/σ(0) − cn1/2 θ ,

and
θbn ≈ T (0)/cσ(0) ,
R
where σ(0) = (4/3)1/2 and c = (12)1/2 h2 (t)dt. Making these substitutions,
X  
b . 1 Xi + Xj
θn = R sgn .
n(n + 1)2 h2 (t)dt i≤j 2

Now introduce an outlier xn+1 = x∗ and take the difference between θbn+1 and
θbn . The result is

Z n+1
X  
. 1 xi + x∗
2 h (t)dt[(n + 2)θbn+1 − nθbn ] =
2
sgn .
(n + 1) i=1 2

We can replace n + 2 and n + 1 by n where convenient without affecting the


asymptotics. Using the symmetry of the density of H, we have
n  
1X xi + x∗ .
sgn = 1 − 2Hn (−x∗ ) → 1 − 2H(−x∗ ) = 2H(x∗ ) − 1 .
n i=1 2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 47 —


i i

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM 47

.
It now follows that (n + 1)(θbn+1 − θbn ) = Ω(x∗ ), given in the statement of the
theorem; see the discussion of the influence function in Section 1.6.
Note that we have a bounded influence function since the cdf H is
a bounded function. Further, it is continuous, unlike the influence func-
tion of the median. Finally, as an additional check, note that EΩ2 (X) =
R
1/12[ h2 (t)dt]2 = 1/c2, the asymptotic variance of n1/2 θ. b
Let θbc = medi,j {(Xi − cXj )/(1 − c)} for −1 ≤ c < 1 . This extension of
the Hodges-Lehmann estimate, (1.3.25), has some very interesting robustness
properties for c > 0. The influence function of θbc is not only bounded but also
redescending, similar to the most robust M estimates. In addition, θbc has 50%
breakdown. For a complete discussion of this estimate see Maritz, Wu, and
Staudte (1977) and Brown and Hettmansperger (1994).
In the next theorem we develop the test breakdown for the Wilcoxon
signed-rank test.

Theorem 1.7.6. The rejection breakdown, Definition 1.6.2, for the Wilcoxon
signed rank test is
 1/2
. 1 zα 1 .
ǫ∗n =1− − 1/2
→ 1 − 1/2 = .29 .
2 (3n) 2
PP
Proof: Consider the form T + (0) = I[(xi + xj )/2 > 0], where the double
sum is over all i ≤ j. The asymptotically size α test rejects H0 : θ = 0
.
in favor of HA : θ > 0 when T + (0) ≥ c = n(n + 1)/4 + zα [n(n + 1)(2n +
1)/24]1/2 . Now we must guarantee that T + (0) is in the critical region. This
requires at least c positive Walsh averages. Let x(1) ≤ . . . ≤ x(n) be the ordered
observations. Then contamination of x(n) results in n contaminated Walsh
averages, namely those Walsh averages that include x(n) . Contamination of
x(n−1) yields n − 1 additional contaminated Walsh averages. When we proceed
in this way, contamination of the b ordered values x(n) , . . . , x(n−b+1) yields
n+ (n−1) + ...+ (n−b+ 1) = [n(n+ 1)/2]−[(n−b)(n−b+ 1)/2] contaminated
.
Walsh averages. We now set [n(n + 1)/2] − [(n − b)(n − b + 1)/2] = c and solve
.
the resulting quadratic for b. We must solve b2 − (2n + 1)b + 2c = 0. The
appropriate root in this case is

. 2n + 1 − [(2n + 1)2 − 8c]1/2


b= .
2
Substituting the approximate critical value for c, dividing by n, and ignoring
higher order terms, leads to the stated result.
Table 1.7.2 displays the finite rejection breakdowns of the Wilcoxon signed-
rank test over the same sample sizes as the rejection breakdowns of the sign
test given in Table 1.6.1. For convenience we have reproduced the results

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 48 —


i i

48 CHAPTER 1. ONE-SAMPLE PROBLEMS

Table 1.7.2: Rejection Breakdown Values for Size α = .05 Tests


Signed-rank
n Sign t Wilcoxon
10 .71 .27 .57
13 .70 .21 .53
18 .67 .15 .48
30 .63 .09 .43
100 .58 .03 .37
∞ .50 0 .29

for the sign and t-tests, also. The rejection breakdown for the Wilcoxon test
converges from above to the estimation breakdown of .29. The Wilcoxon test
is more resistant than the t-test but not as resistant as the simple sign test. It
is interesting to note that from the discussion of efficiency, it is clear that we
can now achieve high efficiency and not pay the price in lack of robustness.
The rank-based methods seem to be a very attractive alternative to the highly
resistant but relatively inefficient (at the normal model) L1 methods and the
highly efficient (at the normal model) but nonrobust L2 methods.

1.8 Inference Based on General Signed-Rank


Norms
In this section, we develop properties for a generalized sign-rank process. It
includes the L1 and the weighted L1 as special cases. The development is
similar to that of the weighted L1 so a brief sketch suffices. For x ∈ Rn ,
consider the function,
n
X
kxkϕ+ = a+ (R|xi |)|xi| , (1.8.1)
i=1

where the scores a+ (i) are generated as a+ (i) = ϕ+ (i/(n + 1)) for a positive
valued, nondecreasing, square-integrable function ϕ+ (u) defined on the inter-
val (0, 1). The proof that k · kϕ+ is a norm on Rn follows in the same way as
in the weighted L1 case; see the proof of Theorem 1.3.2 and Exercise 1.12.23.
The gradient function associated with this norm is
n
X
Tϕ+ (θ) = a+ (R|Xi − θ|)sgn(Xi − θ) . (1.8.2)
i=1

Note that it reduces to the L1 norm if ϕ+ (u) ≡ 1 and the weighted L1 ,


Wilcoxon signed-rank, norm if ϕ+ (u) = u. A family of simple score functions

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 49 —


i i

1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS 49

between the weighted L1 and the L1 are of the form



+ u 0<u<c
ϕc (u) = , (1.8.3)
c 0≤u<1
where the parameter c is between 0 and 1. These scores were proposed by
Policello and Hettmansperger (1976); see, also, Hogg (1974). The frequently
used normal scores are generated by the score function,
 
+ −1 u+1
ϕΦ (u) = Φ , (1.8.4)
2
where Φ is the standard normal distribution function. Note that ϕ+ Φ (u) is the
inverse cdf (or quantile function) of the absolute value of a standard normal
random variable. The normal scores were originally proposed by Fraser (1957).
For the location model (1.2.1), the estimate of θ based on the norm (1.8.1)
is the value of θ which minimizes the distance kX − 1θkϕ+ or equivalently
solves the equation
.
Tϕ+ (θ) = 0 . (1.8.5)
A simple tracing algorithm suffices to compute θ. b As Exercise 1.12.18 shows,
Tϕ+ (θ) is a decreasing step function of θ which steps down only at the Walsh
averages. So first sort the Walsh averages. Next select a starting value θb(0) ,
such as median of the Walsh averages which corresponds to the signed-rank
Wilcoxon scores. Then proceed through the sorted Walsh averages left or right,
depending on whether or not Tϕ+ (θb(0) ) is negative or positive. The algorithm
continues until the sign of Tϕ+ (θ) changes. This is the algorithm behind the
Robnp function onesampr which solves equation (1.8.5) for general scores func-
tions; see Exercise 1.12.34. Also, the linear searches discussed in Chapter 3,
Section 3.7.3, can be used to compute θ. b
To determine the corresponding functional, note that we can write R|Xi −
θ| = #j {θ − |Xi − θ| ≤ Xj ≤ |Xi − θ| + θ}. Let Hn denote the empirical
distribution function of the sample X1 , . . . , Xn and let Hn− denote the left
limit of Hn . We can then write the defining equation of θb as,
Z
ϕ+ (Hn (|x − θ| + θ) − Hn− (θ − |x − θ|))sgn(x − θ) dHn (x) = 0 ,

which converges to
Z ∞
δ(θ) = ϕ+ (H(|x − θ| + θ) − H(θ − |x − θ|))sgn(x − θ) dH(x) = 0 . (1.8.6)
−∞

For convenience, a second representation of δ(θ) can be obtained if we extend


ϕ+ (u) to the interval (−1, 0) as follows:
ϕ+ (t) = −ϕ+ (−t) , for −1 < t < 0 . (1.8.7)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 50 —


i i

50 CHAPTER 1. ONE-SAMPLE PROBLEMS

Using this extension, the functional θ = Tϕ (H) is the solution of


Z ∞
δ(θ) = ϕ+ (H(x) − H(2θ − x)) dH(x) = 0. (1.8.8)
−∞

Compare expressions (1.8.8) and (1.3.26).


The level α test of the hypotheses (1.3.6) based on Tϕ+ (0) is
Reject H0 in favor of HA , if |Tϕ+ (0)| ≥ c , (1.8.9)
where c solves P0 [|Tϕ+ (0)| ≥ c] = α. We briefly develop the statistical and
robustness properties of this test and the estimator θbϕ+ in the next two sub-
sections.

1.8.1 Null Properties of the Test


For this subsection on null properties and the following subsection on efficiency
properties of the test (1.8.9), we assume that the sample X1 , . . . , Xn follows the
symmetric location model (1.7.3), with common symmetric density function
h(x) = f (x − θ), where f (x) is symmetric about 0. Let H(x) denote the
distribution function associated with h(x).
As in Section 1.7.1, we can express Tϕ+ (0) in terms of the anti-ranks as,
X X X
Tϕ+ (0) = a+ (R(|Xi |))sgn(Xi ) = a+ (j)sgn(XDj ) = a+ (j)Wj ;
(1.8.10)
see the corresponding expression (1.3.20) for the weighted L1 norm. Recall
that under H0 and the symmetry of h(x), the variables W1 , . . . , Wn are iid
with P [Wi = 1] = P [Wi = −1] = 1/2, (Lemma 1.7.2). Thus we immediately
have that Tϕ+ (0) is distribution free under H0 with mean and variance
E0 [Tϕ+ (0)] = 0 (1.8.11)
Xn
Var0 [Tϕ+ (0)] = a+2 (i) . (1.8.12)
i=1

Tables can be constructed for the null distribution of Tϕ+ (0) from which critical
values, c, can be obtained to complete the test described in (1.8.9).
For the asymptotic null distribution of Tϕ+ (0), the following additional
assumption on the scores is sufficient:
max a+2 (j)
P j +2 →0. (1.8.13)
i=1 a (i)

Because ϕ+ is square integrable, we have


Z 1
1 X +2 2
a (i) → σϕ+ = (ϕ+ (u))2 du , 0 < σϕ2 + < ∞ , (1.8.14)
n i=1 0

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 51 —


i i

1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS 51

i.e., the left side is a Riemann sum of the integral. Under these assumptions
and the symmetric location model, Corollary A.1.1 of the Appendix can be
used to show that the null distribution of Tϕ+ (0) is asymptotically normal;
see, also, Exercise 1.12.16. Hence, an asymptotic level α test is

√Tϕ+ (0)
Reject H0 in favor of HA , if nσ + ≥ zα/2 . (1.8.15)
ϕ

An approximate (1 −α)100% confidence interval for θ based on the process


Tϕ+ (θ) is the interval (θbϕ+ ,L , θbϕ+ ,U ) such that
√ √
Tϕ+ (θbϕ+ ,L ) = zα/2 nσϕ+ and Tϕ+ (θbϕ+ ,U ) = −zα/2 nσϕ+ ; (1.8.16)

see (1.5.27). These equations can be solved by the simple tracing algorithm
discussed immediately following expression (1.8.5).

1.8.2 Efficiency and Robustness Properties


We derive the efficiency properties of the analysis described above by estab-
lishing the four conditions of Definition 1.5.3 to show that the process Tϕ+ (θ) is
Pitman Regular. Assume that ϕ+ (u) is differentiable. First define the quantity
γh as Z 1
γh = ϕ+ (u)ϕ+
h (u) du , (1.8.17)
0

where 
u+1
h′ H −1
ϕ+
h (u) =− 2 
u+1
. (1.8.18)
h H −1 2

As discussed below, ϕ+ h (u) is called the optimal score function. We assume


that our scores are such that γh > 0.
Since it is the negative of a gradient of a norm, Tϕ+ (θ) is nondecreasing
in θ; hence, the first condition, (1.5.7), holds. Let T ϕ+ (0) = Tϕ+ (0)/n and
consider
µϕ+ (θ) = Eθ [T ϕ+ (0)] = E0 [T ϕ+ (−θ)] .
Note that T ϕ+ (−θ) converges in probability to δ(−θ) in (1.8.8). Hence,
µϕ+ (θ) = δ(−θ) where in (1.8.8) H is a distribution function with point of
symmetry at 0, without loss of generality. If we differentiate δ(−θ) and set
θ = 0, we get
Z ∞

µϕ+ (0) = 2 ϕ+′ (2H(x) − 1)h(x) dH(x)
−∞
Z ∞ Z 1
= 4 +′ 2
ϕ (2H(x)−1)h (x)dx = ϕ+ (u)ϕ+ h (u)du > 0(1.8.19)
0 0

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 52 —


i i

52 CHAPTER 1. ONE-SAMPLE PROBLEMS

where the third equality in (1.8.19) follows from an integration by parts. Hence
the second Pitman Regularity condition holds.
For the third condition, (1.5.10), the asymptotic linearity for the process
Tϕ (0) is given in Theorem A.2.11 of the Appendix. We restate the result here
+

for reference:
" #
1 1
P0 √ sup √ Tϕ+ (θ) − √ Tϕ+ (0) + θγh ≥ ǫ → 0 , (1.8.20)
n|θ|≤B n n

for all ǫ > 0 and all B > 0. Finally the fourth condition, (1.5.11), concerns
the asymptotic
√ null distribution which was discussed above. The null variance
of Tϕ (0)/ n is given by expression (1.8.12). Therefore the process Tϕ+ (θ) is
+

Pitman Regular with efficacy given by


R1 + R∞
0
ϕ (u)ϕ+ h (u) du
2 −∞ ϕ+′ (2H(x) − 1)h2 (x) dx
c ϕ+ = q R = qR . (1.8.21)
1 + 2 1 + 2
0
(ϕ (u)) du 0
(ϕ (u)) du

As our first result, we obtain the asymptotic power lemma for the process
Tϕ+ (θ). This, of course, follows immediately from Theorem 1.5.8 so we state
it as a corollary.

Corollary 1.8.1. Under the symmetric location model,


 
Tϕ+ (0)
P θn √ ≥ zα → 1 − Φ(zα − θ∗ cϕ+ ) , (1.8.22)
nσϕ+

for the sequence of hypotheses


θ∗
H0 : θ = 0 versus HAn : θ = θn = √
n
for θ∗ > 0 .

Based on Pitman Regularity, the asymptotic distribution of the the esti-


mate θbϕ+ is
√ D
n(θbϕ+ − θ) → N(0, τϕ2+ ) , (1.8.23)
where the scale parameter τϕ+ is defined by the reciprocal of (1.8.21),
σϕ+
τϕ+ = c−1
ϕ+ = R 1 . (1.8.24)
0
ϕ+ (u)ϕ+
h (u) du

Using the general result of Theorem 1.5.9, the length of the confidence
interval for θ, (1.8.16), can be used to obtain a consistent estimate of τϕ+ .
This in turn can be used to obtain a consistent estimate of the standard error
of θbϕ+ ; see Exercise 1.12.19.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 53 —


i i

1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS 53

The asymptotic relative efficiency between two estimates or two tests based
on score functions ϕ+ +
1 (u) and ϕ2 (u) is the ratio

c2ϕ+ τϕ2+
e(ϕ+ +
1 , ϕ2 ) = 1
= 2
. (1.8.25)
c2ϕ+ τϕ2+
2 1

This can be used to compare different tests. For a specific distribution we can
determine the optimum scores. Such a score should make the scale parameter
τϕ+ as small as possible. This scale parameter can be written as,
 s
 R 1 ϕ+ (u)ϕ+ (u) du  Z 1
cϕ+ = τϕ−1
+ =
0
qR h ϕ2h (u) du . (1.8.26)
 1 2  0
σϕ +
0
ϕh (u) du

The quantity in brackets is a correlation coefficient; hence, to minimize the


scale parameter τϕ+ , we need to maximize the correlation coefficient which
can be accomplished by selecting the optimal score function given by

ϕ+ (u) = ϕ+
h (u) ,
qR
1
where ϕ+
h (u) is given by expression (1.8.18). The quantity 0
(ϕ+ 2
h (u)) du is
the square root of Fisher information; see Exercise 1.12.24. Therefore for this
choice of scores the estimate θbϕ+ is asymptotically efficient. This is the
h
reason for calling the score function ϕ+ h the optimal score function.
It is shown in Exercise 1.12.25 that the optimal scores are the normal
scores if h(x) is a normal density, the Wilcoxon weighted L1 scores if h(x) is a
logistic density, and the L1 scores if h(x) is a double exponential density. It is
further shown that the scores generated by (1.8.3) are optimal for symmetric
densities with a logistic center and exponential tails.
From Exercise 1.12.25, the efficiency of the normal scores methods relative
to the least squares methods is
Z ∞ 2
f 2 (x)
e(NS, LS) = dx , (1.8.27)
−∞ φ (Φ−1 (F (x)))

where F ∈ FS , the family of symmetric distributions with positive, finite


Fisher information and φ = Φ′ is the N(0, 1) pdf.
We now prove a result similar to Theorem 1.7.4. We prove that the normal
scores methods always have efficiency at least equal to 1 relative to the LS
methods. Further, it is only equal to 1 at the normal distribution. The result
was first proved by Chernoff and Savage (1958); however, the proof presented
below is due to Gastwirth and Wolff (1968).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 54 —


i i

54 CHAPTER 1. ONE-SAMPLE PROBLEMS

Theorem 1.8.1. Let X1 , . . . , Xn be a random sample from F ∈ Fs . Then


inf e(NS, LS) = 1 , (1.8.28)
Fs

and is only equal to 1 at the normal distribution.


Proof: If σf2 = ∞ then e(NS, LS) > 1; hence, we suppose that σf2 = 1. Let
e = e(NS, LS). Then from (1.8.27) we can write
 
√ f (X)
e = E
φ (Φ−1 (F (X)))
 
1
= E .
φ (Φ−1 (F (X))) /f (X)
Applying Jensen’s inequality to the convex function h(x) = 1/x, we have
√ 1
e≥ .
E [φ (Φ−1 (F (X))) /f (X)]
Hence,
 
1 φ (Φ−1 (F (X)))
√ ≤ E
e f (X)
Z

= φ Φ−1 (F (x)) dx .

We now integrate by parts, using u = φ (Φ−1 (F (x))). Then, since


φ′ (x)/φ(x) = −x, we have du = φ′ (Φ−1 (F (x))) f (x) dx/φ (Φ−1 (F (x))) =
−Φ−1 (F (x))f (x) dx. Hence, with dv = dx, we have
Z ∞ Z ∞
  ∞
−1 −1
φ Φ (f (x)) dx = xφ Φ (F (x)) −∞ + xΦ−1 (F (x))f (x) dx .
−∞ −∞
(1.8.29)
−1 −1
Now transform xφ (Φ (F (x))) into RF (Φ(w))φ(w) by firstRletting t = F (x)
and then w = Φ−1 (t). The integral F −1 (Φ(w))φ(w) dw = xf (x) dx < ∞,
hence the limit of the integrand must be 0 as x → ±∞. This implies that the
first term on the right side of (1.8.29) is 0. Hence applying the Cauchy-Schwarz
inequality,
Z ∞
1
√ ≤ xΦ−1 (F (x))f (x) dx
e
Z−∞∞ p p
= x f (x)Φ−1 (F (x)) f (x) dx
−∞
Z ∞ Z ∞ 1/2
2
 −1 2
≤ x f (x) dx Φ (F (x)) f (x) dx
−∞ −∞
= 1,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 55 —


i i

1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS 55


R R
since x2 f (x) dx = 1 and x2 φ(x) dx = 1. Hence e1/2 ≥ 1 and e ≥ 1, which
completes the proof. It should be noted that the inequality is strict except at
the normal distribution. Hence the normal scores are strictly more efficient
than the LS procedures except at the normal model where the asymptotic
relative efficiency is 1.
The influence function for θbϕ+ is derived in Section A.5 of the Appendix.
It is given by
ϕ+ (2H(t) − 1)
Ω(t, θbϕ+ ) = R∞ . (1.8.30)
4 0
ϕ+′ (2H(x) − 1)h2 (x) dx
Note, also, that E[Ω2 (X, θbϕ+ )] = τϕ2+ as a check on the asymptotic distribu-
tion of θbϕ+ . Note that the influence function is bounded provided the score
function is bounded. Thus the estimates based on the scores discussed in the
last paragraph are all robust except for the normal scores. In the case of the
normal scores, when H(t) = Φ(t), the influence function is Ω(t) = Φ−1 (t); see
Exercise 1.12.26.
The asymptotic breakdown of the estimate θbϕ+ is ǫ∗ given by
Z 1−ǫ∗ Z
+ 1 1 +
ϕ (u) du = ϕ (u) du . (1.8.31)
0 2 0
We provide a heuristic argument for (1.8.31); for a rigorous development see
Huber (1981). Recall Definition 1.6.1. The idea is to corrupt enough data so
that the estimating equation, (1.8.5), no longer has a solution. Suppose that
[ǫn] observations are corrupted, where [·] denotes the greatest integer function.
Push the corrupted observations out towards +∞ so that
n
X n
X
a+ (R(|Xi − θ|))sgn(Xi − θ) = a+ (i) .
i=[(1−ǫ)n]+1 i=[(1−ǫ)n]+1

This restrains the estimating function from crossing the horizontal axis pro-
vided
[(1−ǫ)n] n
X X
− a+ (i) + a+ (i) > 0 .
i=1 i=[(1−ǫ)n]+1

Replacing the sums by integrals in the limit yields


Z 1−ǫ Z 1
+
ϕ (u) du > ϕ+ (u) du .
0 1−ǫ

Now use the fact that


Z 1−ǫ Z 1 Z 1
+ +
ϕ (u) du + ϕ (u) du = ϕ+ (u) du
0 1−ǫ 0

and that we want the smallest possible ǫ to get (1.8.31).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 56 —


i i

56 CHAPTER 1. ONE-SAMPLE PROBLEMS

Table 1.8.1: Empirical AREs Based on n = 30 and 10,000 Simulations


Estimators Normal Contaminated Normal
NS, LS 0.983 1.035
Wil, LS 0.948 1.007
NS, WIL 1.037 1.028

Example 1.8.1 (Breakdowns of Estimates Based on Wilcoxon and Normal


Scores). For √θb = med(Xi + Xj )/2, ϕ+ (u) = u and it follows at once that
.
ǫ∗ = 1 − (1/ 2) = .293. For the estimate based on the normal scores where
ϕ+ (u) is given by (1.8.4), expression (1.8.31) becomes
 
1 h −1  ǫ i2 1
exp − Φ 1− =
2 2 2
√ .
and ǫ∗ = 2(1 − Φ( log 4)) = .239. Hence we have the unusual situation
that the estimate based on the normal scores has positive breakdown but an
unbounded influence curve.

Example 1.8.2 (Small Sample Empirical AREs of Estimator Based on Nor-


mal Scores). As discussed above, the ARE between the normal scores estima-
tor and the sample mean is 1 at the normal distribution. This is an asymptotic
result. To answer the question about this efficiency at small samples, we con-
ducted a small simulation study. We set the sample size at n = 30 and ran
10,000 simulations from a normal distribution. We also selected the contam-
inated normal distribution with ǫ = 0.01 and σc = 3, which is a very mild
contaminated distribution. We consider the three estimators: rank-based esti-
mator based on normal scores (NS), rank-based estimator based on Wilcoxon
scores (WIL), and the sample mean (LS). We used the Robnp command
onesampr(x,score=phinscp,grad=spnsc,maktable=F) to compute the nor-
mal scores estimator; see Exercise 1.12.30. As our empirical ARE we used
the ratios of empirical mean square errors of the three estimators. Table 1.8.1
summarizes the results. The empirical AREs for the NS and WIL estimators,
at the normal, are close to their asymptotic counterparts. Note that the NS
estimator results in only a loss of less than 2% efficiency over LS. For this
small amount of contamination the NS estimator dominates the LS estimator.
It also dominates the Wilcoxon estimator. In Exercise 1.12.30, the reader is
asked to extend this study to other situations.

Example 1.8.3 (Shoshoni Rectangles, continued). The next display shows


the normal scores analysis of the Shoshoni Rectangles Data; see Example 1.4.2.
We conducted the same analysis as we did for the sign test and traditional
t-test discussed in Example 1.4.2. Note that the call to the Robnp function

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 57 —


i i

1.9. RANKED SET SAMPLING 57

onnesampr with the values score=phinscp,grad=spnsc computes the normal


scores analysis.

> onesampr(x,theta0=.618,alpha=.10,score=phinscp,grad=spnsc)

Test of Theta = 0.618 Alternative selected is 0


Test Stat. Tphi+ is 7.809417 Standardized (z) Test-Stat. 1.870514
and p-value 0.06141252

Estimate 0.6485 SE is 0.02502799


90 % Confidence Interval is ( 0.61975 , 0.7 )
Estimate of the scale parameter tau 0.1119286

While not as sensitive to the outliers as the traditional analysis, the outliers
still had some influence on the normal scores analysis. The normal scores test
rejects the null hypothesis at level 0.06 while the 90% confidence interval just
misses the value 0.618.

1.9 Ranked Set Sampling

In this section we discuss an alternative to simple random sampling (SRS)


called ranked set sampling (RSS). This method of data collection is useful
when measurements are destructive or expensive while ranking of the data is
relatively easy. Johnson, Nussbaum, Patil, and Ross (1996) give an interest-
ing application to environmental sampling. As a simple example consider the
problem of estimating the mean volume of trees in a forest. To measure the
volume, we must destroy the tree. On the other hand, an expert may well
be able to rank the trees by volume in a small sample. The idea is to take
a sample of size k of trees and ask the expert to pick the one with smallest
volume. This tree is cut down and the volume measured and the other k − 1
trees are returned to the population for possible future selection. Then a new
sample of size k is taken and the expert identifies the second smallest which
is then cut down and measured. This is repeated until we have k measure-
ments, having looked at k 2 trees. This ends cycle 1. The measurements are
represented as x(1)1 ≤ . . . ≤ x(k)1 where the number in parentheses indicates
an order statistic and the second number indicates the cycle. We repeat the
process for n cycles to get nk measurements:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 58 —


i i

58 CHAPTER 1. ONE-SAMPLE PROBLEMS

x(1)1 , . . . , x(1)n iid h(1) (t)


x(2)1 , . . . , x(2)n iid h(2) (t)
.. .. ..
. . .
x(k)1 , . . . , x(k)n iid h(k) (t)

It is important to note that all nk measurements are independent but


are identically distributed only within each row. The density function h(j) (t)
represents the pdf of the jth order statistic from a sample of size k and is
given by:

k!
h(j) (t) = H j−1(t)[1 − H(t)]k−j h(t) .
(j − 1)!(k − j)!
We suppose the measurements are distributed as H(x) = F (x − θ) and we
wish to make a statistical inference concerning θ, such as an estimate, test, or
confidence interval. We illustrate the ideas on the L1 methods since they are
simple to work with. We also wish to compute the efficiency of the RSSL1
methods relative to the SRSL1 methods. We see that there is a substantial
increase in efficiency when using the RSS design. In particular, we compare
the RRS methods to SRS methods based on a sample of size nk. The RSS
method was first applied by McIntyre (1952) in measuring mean pasture yields.
See Hettmansperger (1995) for a development of the RSSL1 methods. The
most convenient form of the RSS sign statistic is the number of positive
measurements given by
k X
X n
+
SRSS = I(X(j)i > 0) . (1.9.1)
j=1 i=1

+ + P + +
Now note that SRSS can be written as SRSS = S(j) where S(j) =
P
i I(X(j)i > 0) has a binomial distribution with parameters n and 1−H(j) (0).
+
Further, S(j) , j = 1, . . . , k are stochastically independent. It follows at once
that
k
X
+
ESRSS = n (1 − H(j) (0)) (1.9.2)
j=1
k
X
+
VarSRSS = n (1 − H(j) (0))H(j) (0) .
j=1

+
With k fixed and n → ∞, it follows from the independence of S(j) , j = 1, . . . , k

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 59 —


i i

1.9. RANKED SET SAMPLING 59

that
k
X D
(nk)−1/2 {SRSS
+
−n (1 − H(j) (0)} → Z ∼ n(0, ξ 2 ) , (1.9.3)
j=1

and the asymptotic variance is given by


k
X X k
1 1
ξ 2 = k −1 [1 − H(j) (0)]H(j) (0) = − k −1 (H(j) (0) − )2 . (1.9.4)
j=1
4 j=1
2
P
It is convenient to introduce a parameter δ 2 = 1 − (4/k) (H(j) (0) − 1/2)2 ,
then ξ 2 = δ 2 /4. The reader is asked to prove the second equality above in
Exercise 1.12.27. Using the formulas for the pdfs of the order statistics it is
straightforward to verify that
k
X k
X
−1 −1
h(t) = k h(j) (t) and H(t) = k H(j) (t) .
j=1 j=1

We now consider testing H0 : θ = 0 versus HA : θ 6= 0. The following theo-


rem provides the mean and variance of the RSS sign statistic under the null
hypothesis.
Theorem 1.9.1. Under the assumption that H0 : θ = 0 is true, F (0) = 1/2,
Z 1/2
k!
F(j) (0) = uj−1 (1 − u)k−j du
(j − 1)!(k − j)! 0

and
X
+ +
ESRSS = nk/2, and VarSRSS = 1/4 − k −1 (F(j) (0) − 1/2)2 .
P
Proof: Use the fact that k −1 F(j) (0) = F (0) = 1/2, and the expectation
formula follows at once. Note that
Z 0
k!
F(j) (0) = F (t)j−1(1 − F (t))k−j f (t)dt ,
(j − 1)!(k − j)! −∞

and then make the change of variable u = F (t).


+
The variance of SRSS does not depend on H, as expected; however, its
computation requires the evaluation of the incomplete beta integral. Table
1.9.1 provides the values of F(j) (0), under HP
0 : θ = 0. The bottom line of the
table provides the values of δ 2 = 1 − (4/k) (F(j) (0) − 1/2)2, an important
parameter in assessing the gain of RSS over SRS.
+
We compare the SRS sign statistic SSRS based on a sample of nk to the
+ +
RSS sign statistic SRSS . Note that the variance of SSRS is nk/4. Then the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 60 —


i i

60 CHAPTER 1. ONE-SAMPLE PROBLEMS

P
Table 1.9.1: Values of F(j) (0), j = 1, . . . , k and δ 2 = 1−(4/k) (F(j) (0)−1/2)2
k: 2 3 4 5 6 7 8 9 10
1 .750 .875 .938 .969 .984 .992 .996 .998 .999
2 .250 .500 .688 .813 .891 .938 .965 .981 .989
3 .125 .313 .500 .656 .773 .856 .910 .945
4 .063 .188 .344 .500 .637 .746 .828
5 .031 .109 .227 .363 .500 .623
6 .016 .063 .145 .254 .377
7 .008 .035 .090 .172
8 .004 .020 .055
9 .002 .011
10 .001
δ 2 .750 .625 .547 .490 .451 .416 .393 .371 .352

+ +
P
ratio of variances is VarSRSS /VarSSRS = δ 2 = 1 − (4/k) (F(j) (0) − 1/2)2 .
The reduction in variance is given in the last row of Table 1.9.1 and can be
quite large.
We next show that the parameter δ is an integral part of the efficacy of
the RSS L1 methods. It is straightforward using the methods of Section 1.5
and Example 1.5.2 to show that the RSS L1 estimating function is Pitman
Regular. To compute the efficacy we first note that
k X
X n
S̄RSS = (nk)−1 sgn(X(j)i ) = (nk)−1 [2SRSS
+
− nk] .
j=1 i=1

We then have at once that


D
(nk)−1/2 S̄RSS →0 Z ∼ n(0, δ 2 ) , (1.9.5)

and µ′ (0) = 2f (0); see Exercise 1.12.28. See Babu and Koti (1996) for a
development of the exact distribution. Hence, the efficacy of the RSS L1
methods is given by

2f (0) 2f (0)
cRSS = = Pk .
δ {1 − (4/k) j=1 (F(j) (0) − 1/2)2 }1/2

We now summarize the inference methods and their efficiency in the fol-
lowing:

1. The test. Reject H0 : θ = 0 in favor of HA : θ > 0 at significance level


+
α if SSRS > (nk/2) − zα δ(nk/4)1/2 where, as usual, 1 − Φ(zα ) = α.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 61 —


i i

1.10. L1 INTERPOLATED CONFIDENCE INTERVALS 61

D
2. The estimate. (nk)1/2 {medX(j)i − θ} → Z ∼ n(0, δ 2 /4f 2(0)).

∗ ∗
3. The confidence interval. Let X(1) , . . . , X(nk) be the ordered values
∗ ∗
of X(j)i , j = 1, . . . , k and i = 1, . . . , n. Then [X(m+1) , X(nk−m) ] is a
+
(1 − α)100% confidence interval for θ where P (SSRS ≤ m) = α/2. Using
.
the normal approximation we have m = (nk/2) − zα/2 δ(nk/4)1/2 .

4. Efficiency. The efficiency of the RSS methods with respect to the SRS
methods is given by e(RSS, SRS) = c2RSS /c2SRS = δ −2 . Hence, the re-
ciprocal of the last line of Table 1.9.1 provides the efficiency values and
they can be quite substantial. Recall from the discussion following Defi-
nition 1.5.5 that efficiency can be interpreted as the ratio of sample sizes
needed to achieve the same approximate variances, the same approxi-
mate local power, and the same confidence interval length. Hence, we
.
write (nk)RSS = δ 2 (nk)SRS . This is really the point of the RSS design.
Returning to the example of estimating the volume of wood in a forest,
if we let k = 5, then from Table 1.9.1, we would need to destroy and
measure only about one half as many trees using the RSS method rather
than the SRS method.
As a final note, we mention the problem of assessing the effect of imperfect
ranking. Suppose that the expert makes a mistake when asked to identify the
jth ordered value in a set of k observations. As expected, there is less gain
from using the RSS method. The interesting point is that if the expert simply
identifies the supposed jth ordered value by random guess then δ 2 = 1 and the
two sign tests have the same information; see Hettmansperger (1995) for more
details. Also, see Presnell and Bohn (1999) for a careful analysis of imperfect
ranking.

1.10 L1 Interpolated Confidence Intervals


When we construct L1 confidence intervals, we are limited in our choice of
confidence coefficients because of the discreteness of the binomial distribution.
The effect does not wear off very quickly as the sample size increases. For
example with a sample of size 50, we can have either a 93.5% or a 96.7%
confidence interval, and that is as close as we can come to 95%. In the following
discussion we provide a method to interpolate between confidence intervals.
The method is nonlinear and seems to be essentially distribution free. We begin
by presenting and illustrating the method and then derive its properties.
Suppose γ is the desired confidence coefficient. Further, suppose the fol-
lowing intervals are available from the binomial table: interval (x(k) , x(n−k+1) )

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 62 —


i i

62 CHAPTER 1. ONE-SAMPLE PROBLEMS

with confidence coefficient γk and interval (x(k+1) , x(n−k) ) with confidence co-
efficient γk+1 where γk+1 ≤ γ ≤ γk . Then the interpolated interval is [θbL , θbU ],

θbL = (1 − λ)x(k) + λx(k+1) and θbU = (1 − λ)x(n−k+1) + λx(n−k) (1.10.1)

where
(n − k)I γk − γ
λ= and I = . (1.10.2)
k + (n − 2k)I γk − γk+1
We call I the interpolation factor and note that if we were using linear in-
terpolation then λ = I. Hence, we see that the interpolation is distinctly
nonlinear.
As a simple example we take n = 10 and ask for a 95% confidence interval.
For k = 2 we find γk = .9786 and γk+1 = .8907. Then I = .325 and λ = .685.
Hence, θbL = .342x(2) + .658x(3) and θbU = .342x(9) + .658x(8) . Note that linear
interpolation is almost the reverse of the recommended mixtures, namely λ =
I = .325 and this can make a substantial difference in small samples.
The method is based on the following theorem. This theorem highlights the
nonlinear relationship between the interpolation factor and λ. After proving
the theorem, we develop an approximate solution and then show that it works
in practice.

Theorem 1.10.1. The interpolation factor I is given by


Z ∞  
γk − γ n k −λ
I= = 1 − (n − k)2 F y (1 − F (y))n−k−1f (y)dy.
γk − γk+1 0 1−λ

Proof: Without loss of generality we assume that θ is 0. Then we can write:

γk = P0 (xk ≤ 0 ≤ xn−k+1 ) = P0 (k − 1 < S1+ (0) < n − k − 1)

and
γk+1 = P0 (xk+1 ≤ 0 ≤ xn−k ) = P0 (k < S1+ (0) < n − k) .

Taking the difference, we have, using nk to denote the binomial coefficient,
 
n
γk − γk+1 = P0 (S1+ (0) = k) + P0 (S1+ (0) = n − k) = (1/2)n−1 . (1.10.3)
k

We now consider the lower tail probability associated with the confidence
interval. First consider
Z ∞
1 − γk+1 n!
P0 (Xk+1 > 0) = = F k (t)(1 − F (t))n−k−1dF (t)
2 0 k!(n − k − 1)!
= P0 (S1+ (0) ≥ n − k) = P0 (S1+ (0) ≤ k). (1.10.4)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 63 —


i i

1.10. L1 INTERPOLATED CONFIDENCE INTERVALS 63

Let kn = n!/[(k − 1)!(n − k − 1)!]. Then the lower end of the interpolated
interval is
1−γ
= P0 ((1 − γ)Xk + λXk+1 > 0)
2 Z ∞Z y
= kn F k−1(x)(1 − F (y))n−k−1f (x)f (y)dxdy
−λ
0 1−λ
y
Z ∞  
1 k k −λy
= kn F (y) − F (1 − F (y))n−k−1f (y)dy
0 k 1−λ
Z ∞  
1−γk+1 kn k −λy
= − F (1−F (y))n−k−1f (y)dy .
2 0 k 1−λ

Use (1.10.4) in the last line above. Now with (1.10.3), substitute into the
formula for the interpolation factor and the result follows.
Clearly, not only is the relationship between I and λ nonlinear but it also
depends on the underlying distribution F . Hence, the interpolated interval is
not distribution-free. There is one interesting case in which we have a distri-
bution free interval given in the following corollary.

Corollary 1.10.1. Suppose F is the cdf of a symmetric distribution. Then


I(1/2) = k/n, where we write I(λ) to denote the dependence of the interpola-
tion factor on λ.

This shows that when we sample from a symmetric distribution, the in-
terval that lies half between the available intervals does not depend on the
underlying distribution. Other interpolated intervals are not distribution free.
Our next theorem shows how to approximate the solution and the solution
is essentially distribution free. We show by example that the approximate
solution works in many cases.

Theorem 1.10.2. The interpolation factor satisfies the approximation


.
I(λ) = λk/(λ(2k − n) + n − k) .

Proof: We consider the integral


Z ∞  
k −λ
F y (1 − F (y))n−k−1f (y)dy .
0 1−λ

The integrand decreases rapidly for moderate powers; hence, we expand the
integrand around y = 0. First take logarithms then
 
−λ λ f (0)
k log F y = k log F (0) − k y + o(y)
1−λ 1 − λ F (0)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 64 —


i i

64 CHAPTER 1. ONE-SAMPLE PROBLEMS

Table 1.10.1: Confidence Coefficients for Interpolated Confidence Intervals in


Example 1.10.1. DE(Approx)=Double Exponential and the Approximation in
Theorem 1.10.2, U=Uniform, N=Normal, C=Cauchy, Linear=Linear Interpo-
lation
λ DE(Approx) U N C Linear
0.1 0.976 0.977 0.976 0.976 0.970
0.2 0.973 0.974 0.974 0.974 0.961
0.3 0.970 0.971 0.971 0.970 0.952
0.4 0.966 0.967 0.966 0.966 0.943
0.5 0.961 0.961 0.961 0.961 0.935
0.6 0.955 0.954 0.954 0.954 0.926
0.7 0.946 0.944 0.944 0.946 0.917
0.8 0.935 0.930 0.931 0.934 0.908
0.9 0.918 0.912 0.914 0.918 0.899

and
f (0)
(n−k−1) log(1−F (y)) = (n−k−1) log(1−F (0))−(n−k−1) y+o(y) .
1 − F (0)

Substitute r = λk/(1−λ) and F (0) = 1−F (0) = 1/2 into the above equations,
and add the two equations together. Add and subtract r log(1/2), and group
terms so the right side of the second equation appears on the right side along
with k log(1/2) − r log(1/2). Hence, we have
 
−λ
k log F y + (n − k − 1) log(1 − F (y)) = k log(1/2) − r log(1/2)
1−λ
+(n − r − k − 1) log(1 − F (y)) + o(y) .
R∞ −λ

Thus, 0 F k 1−λ y (1 − F (y))n−k−1f (y)dy is approximately equal to
Z ∞
1
2−(k−r)(1 − F (y))n+r−k−1f (y)dy = n . (1.10.5)
0 2 (n + r − k)

Substitute this approximation into the formula for I(λ), use r = λk/(1 − λ)
and the result follows.
Note that the approximation agrees with Corollary 1.10.1. In addition Ex-
ercise 1.12.29 shows that the approximation formula is exact for the dou-
ble exponential (Laplace) distribution. In Table 1.10.1 we show how well the
approximation works for several other distributions. The exact results were
obtained by numerical integration of the integral in Theorem 1.10.1. Simi-
lar close results were found for asymmetric examples. For further reading see
Hettmansperger and Sheather (1986) and Nyblom (1992).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 65 —


i i

1.11. TWO-SAMPLE ANALYSIS 65

Example 1.10.1 (Cushney-Peebles Example 1.4.1, continued). We now re-


turn to this example using it to illustrate the sign test and the L1 interpolated
confidence interval. For computation, we use the Robnp function interpci
for the computations. We take as our location model: X1 , . . . , X10 iid from
H(x) = F (x − θ), F and θ both unknown, along with the L1 norm. We have
already seen that the estimate of θ is the sample median equal to 1.3. Besides
obtaining an interpolated 95% confidence interval, we test H0 : θ = 0 versus
HA : θ 6= 0. Assuming that the sample is in the vector x, the output for a test
and a 95% interpolated confidence interval is:

> tm=interpci(.05,x)

Estimation of Median
Sample Median is 1.3
Confidence Interval ( 1 , 1.8 ) 89.0625 %
Confidence Interval ( 0.9315 , 2.0054 ) 95 % Interpolated
Confidence Interval ( 0.8 , 2.4 ) 97.8516 %

Results for the Sign Test


Test of theta = 0 versus theta not equal to 0
Test stat. S is 9 p-value 0.00390625

Note the p-value of the test is .0039 and we would easily reject the null
hypothesis at any reasonable level of significance. The interpolated 95% con-
fidence interval for θ shows the reasonable set of values of θ to be between
.9315 and 2.0054, given the level of confidence.

1.11 Two-Sample Analysis


We now propose a simple way to extend our one-sample methods to the com-
parison of two samples. Suppose X1 , . . . , Xm are iid F (x − θx ) and Y1 , . . . , Yn
are iid F (y −θy ) and the two samples are independent. Let ∆ = θy −θx and we
wish to test the null hypothesis H0 : ∆ = 0 versus the alternative hypothesis
Ha : ∆ 6= 0. Without loss of generality we can consider θx = 0 so that the
X sample is from a distribution with cdf F (x) and the Y sample is from a
distribution with cdf F (y − ∆).
The hypothesis testing rule that we propose is:

1. Construct L1 confidence intervals [XL , XU ] and [YL , YU ].

2. Reject H0 if the intervals are disjoint.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 66 —


i i

66 CHAPTER 1. ONE-SAMPLE PROBLEMS

If we consider the confidence interval as a set of reasonable values for the


parameter, given the confidence coefficient, then we reject the null hypothesis
when the respective reasonable values are disjoint. We must determine the
significance level for the test. In particular, for given γx and γy , what is the
value of αc , the significance level for the comparison? Perhaps more pertinent:
Given αc , what values should we choose for γx and γy ? Below we show that
for a broad range of sample sizes,

Comparing two 84% CIs yields a 5% test of H0 : ∆ = 0 versus HA : ∆ 6= 0,


(1.11.1)
where CI denotes confidence interval. In the following theorem we provide the
relationship between αc and the pair γx , γy . Define zx by γx = 2Φ(zx ) − 1 and
likewise zy by γy = 2Φ(zy ) − 1.
Theorem 1.11.1. Suppose m, n → ∞ so that m/N → λ, 0 < λ < 1, N =
m + n. Then under the null hypothesis H0 : ∆ = 0,

αc = P (XL > YU ) + P (YL > XU ) → 2Φ[−(1 − λ)1/2 zx − λ1/2 zy ] .

Proof: We consider αc /2 = P (XL > YU ). From (1.5.23) we have

. Sx (0) zx . Sy (0) zy
XL = − 1/2 and YU = + 1/2 .
m2f (0) m 2f (0) m2f (0) n 2f (0)

Since m/N → λ
D
N 1/2 XL → λ−1/2 Z1 , Z1 ∼ n(−zx /2f (0), 1/4f 2(0)) ,

and
D
N 1/2 YU → (1 − λ)−1/2 Z2 , Z2 ∼ n(−zy /2f (0), 1/4f 2(0)) .
Now αc /2 = P (XL > YU ) = P (N 1/2 (YU − XL ) < 0) and XL , YU are indepen-
dent, hence
D
N 1/2 (YU − XL ) → λ−1/2 Z1 − (1 − λ)−1/2 Z ,
and λ−1/2 Z1 − (1 − λ)−1/2 Z2 has the distribution
    
1 zx zy 1 1 1
N + , 2 + .
2f (0) (1 − λ)1/2 λ1/2 4f (0) λ 1 − λ
It then follows that
   1/2 !
zx zy 1
P (N 1/2 (YU − XL ) < 0) → Φ − 1/2
+ 1/2 / .
(1 − λ) λ λ(1 − λ)

Which, when simplified, yields the result in the statement of the theorem.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 67 —


i i

1.11. TWO-SAMPLE ANALYSIS 67

Table 1.11.1: Confidence Coefficients for 5% Comparison

λ = m/N .500 .550 .600 .650 .750


m/n 1.00 1.22 1.50 1.86 3.00
zx = zy 1.39 1.39 1.39 1.40 1.43
γx = γy .84 .84 .84 .85 .86

To illustrate, we take equal sample sizes so that λ = 1/2 and we take


zx = zy = 2. Then we have two 95% confidence intervals and we reject the null
hypothesis H0 : ∆ = 0 if the two intervals are disjoint. The above theorem says
that the significance level is approximately equal to αc = 2Φ(−2.83) = .0046.
This is a very small level and it is difficult to reject the null hypothesis. We
might prefer a significance level of say αc = .05. We then must find zx and zy
so that .05 = 2Φ(−(.5)1/2 (zx + zy )). Note that now we have an infinite number
of solutions. If we impose the reasonable condition that the two confidence
coefficients are the same then we require that zx = zy = z. Then we have the
equation .025 = Φ(−(2)1/2 z) and hence −2 = −(2)1/2 z. So z = 21/2 = 1.39 and
the confidence coefficient for the two intervals is γ = γx = γy = 2Φ(1.41)−1 =
.84. Hence, if we have equal sample sizes and we use two 84% confidence
intervals then we have a 5% two-sided comparison of the two samples.
If we set αc = .10, this would correspond to a 5% one-sided test. This
means that we compare the two confidence intervals in the direction specified
by the alternative hypothesis. For example, if we specify ∆ = θy −θx > 0, then
we would reject the null hypothesis if the X-interval is completely below the
Y -interval. To determine which confidence intervals we again assume that the
two intervals have the same confidence coefficient. Then we must find z such
that .05 = Φ(−(2)1/2 z) and this leads to −1.645 = −(2)1/2 z and z = 1.16.
Hence, the confidence coefficient for the two intervals is γ = γx = γy =
2Φ(1.16) − 1 = .75. Hence, for a one-sided 5% test or a 10% two-sided test,
when you have equal sample sizes, use two 75% confidence intervals.
We must now consider what to do if the sample sizes are not equal. Let zc
be determined by αc /2 = Φ(−zc ), then, again if we use the same confidence
coefficient for the two intervals, z = zx = zy = zc /(λ1/2 + (1 − λ)1/2 ). When
m = n so that λ = 1 − λ = .5 we had z = zc /21/2 = .707zc and so z = 1.39
when αc = .05. We now show by example that when αc = .05, z is not sensitive
to the value of λ. Table 1.11.1 gives the relevant information. Hence, if we use
84% confidence intervals, then the significance level is roughly 5% for the
comparison for a broad range of ratios of sample sizes. Likewise, we use 75%
intervals for a 10% comparison. See Hettmansperger (1984b) for additional
discussion.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 68 —


i i

68 CHAPTER 1. ONE-SAMPLE PROBLEMS

Next suppose that we want a confidence interval for ∆ = θy − θx . In the


following simple theorem we show that the proposed test based on comparing
two confidence intervals is equivalent to checking to see if zero is contained in
a different confidence interval. This new interval is a confidence interval for
∆.
Theorem 1.11.2. [XL , XU ] and [YL , YU ] are disjoint if and only if 0 is not
contained in [YL − XU , YU − XL ].
If we specify our significance level to be αc then we have immediately that

1 − αc = P∆ (YL − XU ≤ ∆ ≤ YU − XL )

and [YL − XU , YU − XL ] is a γc = 1 − αc confidence interval for ∆.


This theorem simply points out that the hypothesis test can be equiva-
lently based on a single confidence interval. Hence, two 84% intervals produce
a roughly 95% confidence interval for ∆. The confidence interval is easy to
construct since we need only find the least and greatest differences of the end
points between the respective Y and X intervals.
Recall that one way to measure the efficiency of a confidence interval is
to find its asymptotic length. This is directly related to the Pitman efficacy
of the procedure; see Section 1.5.5. This would seem to be the most natural
way to study the efficiency of the test based on confidence intervals. In the
following theorem we determine the asymptotic length of the interval for ∆.

Theorem 1.11.3. Suppose m, n → ∞ in such a way that m/N → λ, 0 < λ <


1, N = m + n. Further suppose that γc = 2Φ(zc ) − 1. Let Λ be the length of
[YL − XU , YU − XL ]. Then

N 1/2 Λ 1
→ .
2zc [λ(1 − λ)]1/2 ]2f (0)
Proof: First note that Λ = Λx + Λy , the sum of the two lengths of the X and
Y intervals, respectively. Further,

N 1/2 1/2 N 1/2 1/2


N 1/2 Λ = n Λ y + = m Λx .
n1/2 m1/2
But by Theorem 1.5.9 this converges in probability to zx /λ1/2 + zy /(1 − λ)1/2 .
Now note that (1 − λ)1/2 zx + λ1/2 zy = zc and the result follows.
The interesting point about this theorem is that the efficiency of the in-
terval does not depend on how zx and zy are chosen so long as they satisfy
(1−λ)1/2 zx +λ1/2 zy = zc . In addition, this interval has inherited the efficacy of
the L1 interval in the one-sample location model. We discuss the two-sample
location model in detail in the next chapter. In Hettmansperger (1984b) other

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 69 —


i i

1.11. TWO-SAMPLE ANALYSIS 69

Table 1.11.2: Silver Percentage in Two Mintings

First 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2
Fourth 5.3 5.6 5.5 5.1 6.2 5.8 5.8

choices for zx and zy are discussed; for example, we could choose zx and zy so
that the asymptotic standardized lengths are equal. The corresponding con-
fidence coefficients for this choice are more sensitive to unequal sample sizes
than the method proposed here.

Example 1.11.1 (Hendy and Charles Coin Data). Hendy and Charles (1970)
study the change in silver content in Byzantine coins. During the reign of
Manuel I (1143-1180) there were several mintings. We consider the research
hypothesis that the silver content changed from the first to the fourth coinage.
The data consists of nine coins identified from the first coinage and seven coins
from the fourth. We suppose that they are realizations of random samples of
coins from the two populations. The percentage of silver in each coin is given
in Table 1.11.1. Let ∆ = θ1 − θ4 where the 1 and 4 indicate the coinage.
To test the null hypothesis H0 : ∆ = 0 versus HA : ∆ 6= 0 at α = .05,
we construct two 84% L1 confidence intervals and reject the null hypothesis
if they are disjoint. The confidence intervals can be computed by using the
Robnp function onesampsgn with the value alph=.16. Results pertinent to the
confidence intervals are:

> onesampsgn(First,alpha=.16)

Estimate 6.8 SE is 0.2135123


84 % Confidence Interval is ( 6.4 , 7 )
Estimate of the scale parameter tau 0.6405368

> onesampsgn(Fourth,alpha=.16)

Estimate 5.6 SE is 0.1779269


84 % Confidence Interval is ( 5.3 , 5.8 )
Estimate of the scale parameter tau 0.4707503

Clearly, the 84% confidence intervals are disjoint, hence, we reject the
null hypothesis at a 5% significance level and claim that the emperor appar-
ently held back a little on the fourth coinage. A 95% confidence interval for
∆ = θ1 − θ4 is found by taking the differences in the ends of the confidence
intervals: (6.4 − 5.8, 7.0 − 5.3) = (0.6, 1.7). Hence, this analysis suggests that

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 70 —


i i

70 CHAPTER 1. ONE-SAMPLE PROBLEMS

the difference in median percentages is someplace between .6% and 1.7%, with
a point estimate of 6.8 − 5.6 = 1.2%.
Figure 1.11.1 provides a comparison boxplot of the data for the first and
fourth coinages. If one marks these 84% confidence intervals on the plot, the
relatively large gap between the confidence intervals is apparent. Hence, there
was a sharp reduction in silver content from the first to fourth coinage. In
addition, the box for the fourth coinage is a bit more narrow than the box for
the first coinage indicating that there may be less variation (as measured by
the interquartile range) in the fourth coinage. There are no apparent outliers
as indicated by the whiskers on the boxplot. Larson and Stroup (1976) analyze
this example with a two-sample t-test.

Figure 1.11.1: Comparison boxplots of the Hendy and Charles coin data.
7.5
7.0
Percentage of silver

6.5
6.0
5.5
5.0

First Fourth

1.12 Exercises
1.12.1. Show that if k · k is a norm, then there always exists a value of θ which
minimizes kx − θ1k for any x1 , . . . , xn .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 71 —


i i

1.12. EXERCISES 71

1.12.2. Figure 1.12.1 displays the graph of Z(θ) versus θ for n = 20 data
points (count the steps) where
20
1 X
Z(θ) = √ sign(Xi − θ),
n i=1

i.e., the standardized sign (median) process. Use this plot to answer the fol-
lowing:

(a) What are the minimum and maximum values of the sample?

(b) What is the associated point estimate of θ?

(c) Determine a 95% confidence interval for θ, (approximate, but show on the
graph).

(d) Determine the value of the test statistic and the associated p-value for
testing H0 : θ = 0 versus HA : θ > 0.

Figure 1.12.1: The graph of Z(θ) versus θ.

Plot of Z(theta) versus theta


5
Z(theta)

0
−5

−1 0 1 2 3

theta

1.12.3. Show D(θ), (1.3.3), is convex and continuous as a function of θ. Fur-


ther, argue that D(θ) is differentiable almost everywhere. Let S(θ) be a func-
tion such that S(θ) = −D ′ (θ) where the derivative exists. Then show that
S(θ) is a nonincreasing function.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 72 —


i i

72 CHAPTER 1. ONE-SAMPLE PROBLEMS

1.12.4.
√ √ Consider the L2 norm. √ Show that θb = x̄ and that S2 (0) =
2
nt/ n − 1 + t where t = nx̄/s, and s is the sample standard deviation.
Further, show S2 (0) is an increasing function of t so the test based on t is
equivalent to S2 (0).
1.12.5. Discuss the consistency of the t-test. Is the t-test resolving?
1.12.6. Discuss the Pitman Regularity in the L2 case.
1.12.7. The following R function computes a bootstrap distribution of the
sample median.
bootmed = function(x,nb){
# Sample is in x and nb is the number of bootstraps
n = length(x)
bootmed = rep(0,nb)
for(i in 1:nb){
y = sample(x,size=n,replace=T)
bootmed[i] = median(y)
}
bootmed
}
(a) Use this code to obtain 1000 bootstrapped medians for the Shoshoni data
of Example 1.4.2. Determine the standard error of this bootstrap sample
of medians and compare it with the estimate based on the length of the
confidence interval for the Shoshoni data.
(b) Now find the mean and variance of the Shoshoni data. Use these estimates
to perform a parametric bootstrap of the sample median, as discussed
in Example 1.5.6. Determine the standard error of this parametric boot-
strap sample of medians and compare it with estimates in Part (a).
1.12.8. Using languages such as Minitab or R, obtain a plot of the test sen-
sitivity curves based on the signed-rank Wilcoxon statistic for the Cushney-
Peebles data, Example 1.4.1, similar to the sensitivity curves based on the
t-test and the sign test as shown in Figure 1.4.1.
1.12.9. In the proof of Theorem 1.5.6, show that (1.5.20) and (1.5.21) imply
that Un (b) converges to −µ′ (0) in probability, pointwise in b, i.e., Un (b) =
−µ′ (0) + op (1).
1.12.10. Suppose we are sampling from the distribution with pdf
3 1
f (x) = exp{−|x|3/2 }, −∞ < x < ∞
4 Γ(2/3)
and we are considering whether to use the Wilcoxon or sign test. Using the
efficacies of these tests, determine which test to use.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 73 —


i i

1.12. EXERCISES 73

1.12.11. For which of the following distributions is the signed-rank Wilcoxon


more powerful? Why?
 3 2
2
x −1 < x < 1
f1 (x) =
0 elsewhere.
 3
2
(1 − x2 ) −1 < x < 1
f2 (x) =
0 elsewhere.

1.12.12. Show that (1.5.24) is scale invariant. Hence, the efficiency does not
change if X is multiplied by a positive constant. Let

f (x, δ) = δ exp(−|x|δ )/2Γ(δ −1 ), − ∞ < x < ∞, 1 ≤ δ ≤ 2.

When δ = 2, f is a normal distribution and when δ = 1, f is a Laplace


distribution. Compute and plot as a function of δ the efficiency (1.5.24).

1.12.13. Show that the finite sample breakdown of the Hodges-Lehmann esti-
mate (1.3.25) is ǫ∗n = m/n, where m is the solution to the quadratic inequality
2m2 − (4n + 2)m + n2 + n ≤ 0. Table ǫ∗n as a function of n and show that ǫ∗n
.
converges to 1 − √12 = .29.

1.12.14. Derive (1.6.9).

1.12.15. Prove Lemma 1.7.2.

1.12.16. Prove Theorem 1.7.1. In particular, check the conditions of the Lin-
deberg Central Limit Theorem to verify (1.7.7).

1.12.17. Prove Theorem 1.7.2.

1.12.18. For the general signed-rank norm given by (1.8.1), show that the
function Tϕ+ (θ), (1.8.2) is a decreasing step function which steps down only
at the Walsh averages. Hint: First show that the ranks of |Xi − θ| and |Xj − θ|
switch for θ1 < θ2 if and only if

Xi + Xj
θ1 < < θ2 ,
2
(replace ranks by signs if i = j).

1.12.19. Let ϕ+ (u) be a general score function. Using the general result of
Theorem 1.5.9, show that the length of the confidence interval for θ, (1.8.16),
can be used to obtain a consistent estimate of τϕ+ . Use this to obtain a stan-
dard error for the estimate of θ based on the score function ϕ+ (u).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 74 —


i i

74 CHAPTER 1. ONE-SAMPLE PROBLEMS

1.12.20. Use the results of the last exercise to write in some detail the trac-
ing algorithm, described after expression (1.8.5), for obtaining the location
estimator θbϕ+ and its associated standard error.

1.12.21. Suppose h(x) has finite Fisher information:


Z
(h′ (x))2
I(h) = dx < ∞ .
h(x)
R
Prove that h(x) is bounded and that h2 (x)dx < ∞.
Hint: Write Z x Z x

h(x) = h (t)dt ≤ |h′ (t)|dt .
−∞ −∞

1.12.22. Repeat Exercise 1.12.12 for (1.7.13).

1.12.23. Show that (1.8.1) is a norm.


R 2 2
1.12.24. Show that φ+ +
h (u)du, φh (u) given by (1.8.18), is equal to Fisher
information,
Z
(h′ (x))2
dx .
h(x)

1.12.25. Find (1.8.18) when h is normal, logistic, Laplace (double exponen-


tial) density, respectively.

1.12.26. Verify that the influence function of the normal score estimate is
unbounded when the underlying distribution is normal.

1.12.27. Verify (1.9.4).

1.12.28. Derive the limit distribution in expression (1.9.5).

1.12.29. Show that approximation (1.10.5) is exact for the double exponential
(Laplace) distribution.

1.12.30. Extend the simulation study of Example 1.8.2 to the other contam-
inated normal situations found in Table 1.7.1. Comment on the results. Com-
pare the empirical results for the Wilcoxon withe asymptotic results found in
the table.
The following R code performs the contaminated normal simulation dis-
cussed in Example 1.8.2. (Semicolons are end of line indicators. As indicated
in the call to onesampr, the normal scores estimator is computed by using the
gradient R function spnsc and score function phinscp.)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 75 —


i i

1.12. EXERCISES 75

nsims = 10000; n = 30; itype = 1; eps = .01; sigc = 3;


collls = rep(0,nsims); collwil = rep(0,nsims);
collnsc = rep(0,nsims)
for(i in 1:nsims){
if(itype == 0){x = rnorm(n)}
if(itype == 1){x = rcn(n,eps,sigc)}
collls[i] = mean(x)
collnsc[i] = onesampr(x,score=phinscp,grad=spnsc,
maktable=F)$est
collwil[i] = onesampwil(x,maktable=F)$est
}
msels = mean(collls^2); msensc = mean(collnsc^2);
msewil = mean(collwil^2); arensc = msels/msensc;
arewil = msels/msewil; arenscwil = msewil/msensc

1.12.31. Consider the one-sample location problem. Let T (θ) be a nonincreas-


ing process. Consider the hypotheses:

H0 : θ = 0 versus HA : θ > 0.

Assume that T (θ) is standardized so that the decision rule of the (asymptotic)
level α test is given by

Reject H0 : θ = 0 in favor of HA : θ > 0, if T (0) > zα .

Further assume that for all |θ| < B, B > 0,



T (θ/ n) = T (0) − 1.2θ + op (1).

(a) For θ0 > 0, determine the asymptotic power γ(θ0 ), i.e., determine

γ(θ0 ) = Pθ0 [T (0) > zα ].

(b) Evaluate γ(θ0 ) for n = 36 and θ0 = 0.5.

1.12.32. Suppose X1 , . . . , X2n are independent observations such that Xi has


cdf F (x − θi ). For testing H0 : θ1 = . . . = θ2n versus HA : θ1 ≤ . . . ≤ θ2n with
at least one strict inequality, consider the test statistic,
n
X
S= I(Xn+i > Xi ) .
i=1

(a) Discuss the small sample and asymptotic distribution of S under H0 .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 76 —


i i

76 CHAPTER 1. ONE-SAMPLE PROBLEMS

(b) Determine the alternative distribution of S under the alternative θn+i −


θi = ∆, ∆ > 0, for all i = 1, . . . , n. Show that the test is consistent for
this alternative. This test is called Mann’s (1945) test for trend.
1.12.33. The data (Carrie’s Baseball Data) for this exercise can be found
at the url cited in the Preface. The data set consists of a sample of size
59 containing information on professional baseball players. The data were
recorded from the back of a deck of baseball cards (complements of Carrie
McKean).
(a) Obtain dotplots of the weights and heights of the baseball players.
(b) Assume the weight of a typical adult male is 175 pounds. Use the
Wilcoxon test statistic to test the hypotheses
H0 : θW = 175 versus HA : θW 6= 175 ,
where θW is the median weight of a professional baseball player. Compute
the p-value. Next obtain a 95% confidence interval for θW using the
confidence interval procedure based on the Wilcoxon. Use the dotplot in
Part (a) to comment on the assumption of symmetry.
(c) Let θH be the median height of a baseball player. Repeat the analysis of
Part (b) for the hypotheses
H0 : θH = 70 versus HA : θH 6= 70 .

1.12.34. The signed-rank Wilcoxon scores are optimal for the logistic distri-
bution while the sign scores are optimal for the Laplace distribution. A family
of score functions which are optimal for distributions with logistic “middles”
and Laplace “tails” are the bent scores. These are continuous score functions
ϕ+ (u) with a linear (positive slope and intercept 0) piece for 0 < u < b and
a constant piece for b < u < 1, for a specified value of b; see Policello and
Hettmansperger (1976). These are called signed-rank Winsorized Wilcoxon
scores.
R
(a) Obtain the standardized scores such that [ϕ+ (u)]2 du = 1.
(b) For these scores with b = 0.75, obtain the corresponding estimate of
location and an estimate of its standard error for the following data set:
7.94 8.13 8.11 7.96 7.83 7.04 7.91 7.82
7.42 8.06 8.51 7.88 8.96 7.58 8.14 8.06

The software Robnp computes this estimate with the call

onesampr(x,score=phipb,grad=sphipb,param=c(.75)).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 77 —


i i

Chapter 2

Two-Sample Problems

2.1 Introduction
Let X1 , . . . , Xn1 be a random sample with common distribution function F (x)
and density function f (x). Let Y1 , . . . , Yn2 be another random sample, inde-
pendent of the first, with common distribution function G(x) and density g(x).
We call this the general model throughout this chapter. A natural null hypoth-
esis is H0 : F (x) = G(x). In this chapter we consider rank and sign tests of
this hypothesis. A general alternative to H0 is HA : F (x) 6= G(x) for some x.
Except for Section 2.10 on the scale model we are generally concerned with
the alternative models where one distribution is stochastically larger than the
other; for example, the alternative that G is stochastically larger than F which
can be expressed as HA : G(x) ≤ F (x) with a strict inequality for some x.
This family of alternatives includes the location model, described next, and
the Lehmann alternative models discussed in Section 2.7, which are used in
survival analysis.
As in Chapter 1, the location models are of primary interest. For these
models G(x) = F (x − ∆) for some parameter ∆. Thus the parameter ∆
represents a shift in location between the two distributions. It can be expressed
as ∆ = θY −θX where θY and θX are the medians of the distributions of G and
F or equivalently as ∆ = µY − µX where, provided they exist, µY and µX are
the means of G and F . In the location problem the null hypothesis becomes
H0 : ∆ = 0. In addition to tests of this hypothesis we develop estimates
and confidence intervals for ∆. We call this the location model throughout
this chapter and we show that this is a generalization of the location problem
defined in Chapter 1.
As in Chapter 1 with the one-sample problems, for the two-sample prob-
lems, we offer the reader computational R functions which do the computation
for the rank-based analyses discussed in this chapter.

77
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 78 —


i i

78 CHAPTER 2. TWO-SAMPLE PROBLEMS

2.2 Geometric Motivation


In this section, we work with the location model described above. As in Chap-
ter 1, we derive sign and rank-based tests and estimates from a geometric
point of view. As we show, their development is analogous to that of least
squares procedures in that other norms are used in place of the least squares
Euclidean norm. In order to do this we place the problem into the context of
a linear model. This facilitates our geometric development and also serves as
an introduction to Chapter 3, linear models.
Let Z′ = (X1 , . . . , Xn1 , Y1 , . . . , Yn2 ) denote the vector of all observations;
let n = n1 + n2 denote the total sample size; and let

0 if 1 ≤ i ≤ n1
ci = . (2.2.1)
1 if n1 + 1 ≤ i ≤ n

Then we can write the location model as

Zi = ∆ci + ei , 1 ≤ i ≤ n , (2.2.2)

where e1 , . . . , en are iid with distribution function F (x). Let C = [ci ] denote
the n × 1 design matrix and let ΩF U LL denote the column space of C. We can
express the location model as

Z = C∆ + e , (2.2.3)

where e′ = (e1 , . . . , en ) is the n × 1 vector of errors. Note that except for


random error, the observations Z would lie in ΩF U LL . Thus given a norm, we
estimate ∆ so that C∆ ˆ minimizes the distance between Z and the subspace
b
ΩF U LL ; i.e., C∆ is the vector in ΩF U LL closest to Z.
Before turning our attention to ∆, however, we write the problem in terms
of the geometry discussed in Chapter 1. Consider any location functional T
of the distribution of e. Let θ = T (F ). Define the random variable e∗ = e − θ.
Then the distribution function of e∗ is F ∗ (x) = F (x + θ) and its functional is
T (F ∗ ) = 0. Thus the model, (2.2.3), can be expressed as

Z = 1θ + C∆ + e∗ . (2.2.4)

Note that this is a generalization of the location problem discussed in Chap-


ter 1. From the last paragraph, the distribution function of Xi can be expressed
as F (x) = F ∗ (x − θ); hence, T (F ) = θ is a location functional of Xi . Further,
the distribution function of Yj can be written as G(x) = F ∗ (x − (∆ + θ)).
Thus T (G) = ∆ + θ is a location functional of Yj . Therefore, ∆ is precisely the
difference in location functionals between Xi and Yj . Furthermore, ∆ does not
depend on the location functional used. Thus, we call ∆ the shift parameter.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 79 —


i i

2.2. GEOMETRIC MOTIVATION 79

Let b = (θ, ∆)′ . Given a norm, we want to choose as our estimate of b


a value bb such that [1 C]b b minimizes the distance between the vector of
observations Z and the column space V of the matrix [1 C]. Thus we can use
the norms defined in Chapter 1 to estimate b.
If, as an example, we select the L1 norm, then our estimate of b minimizes
Xn
D(b) = |Zi − θ − ci ∆| . (2.2.5)
i=1

Differentiating D with respect to θ and ∆, respectively, and setting the re-


sulting equations to 0 we obtain the equations,
Xn1 Xn2
.
sgn (Xi − θ) + sgn (Yj − θ − ∆) = 0 (2.2.6)
i=1 j=1
n2
X .
sgn (Yj − θ − ∆) = 0 . (2.2.7)
j=1
P 1 .
Subtracting the second equation from the first we get ni=1 sgn (Xi − θ) = 0;
hence, θb = med {Xi }. Substituting this into the second equation, we get
b = med {Yj − θ}
∆ b = med {Yj }−med {Xi }; hence, b b = (med {Xi }, med {Yj −
b − med {Xi }). We obtain inference based on the L1 norm in Sections 2.6.1
θ}
and 2.6.2.
If we select the L2 norm then, as shown in Exercise 2.13.1, the LS-estimate
b = (X, Y − X)′ . Another norm discussed in Chapter 1 was the weighted L1
b
norm. In this case b is estimated by minimizing
X n
D(b) = R(|Zi − θ − ci ∆|)|Zi − θ − ci ∆| . (2.2.8)
i=1

This estimate cannot be obtained in closed form; however, fast minimization


algorithms for such problems are discussed later in Chapter 3.
In the initial statement of the problem, though, θ is a nuisance parameter
and we are really interested in ∆, the shift in location between the populations.
Hence, we want to define distance in terms of norms which are invariant to θ.
The type of norm that is invariant to θ is a pseudo-norm which we define
next.
Definition 2.2.1. An operator k · k∗ is called a pseudo-norm if it satisfies
the following four conditions:
ku + vk∗ ≤ kuk∗ + kvk∗ for all u, v ∈ Rn
kαuk∗ = |α|kuk∗ for all α ∈ R, u ∈ Rn
kuk∗ ≥ 0 for all u ∈ Rn
kuk∗ = 0 if and only if u1 = · · · = un .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 80 —


i i

80 CHAPTER 2. TWO-SAMPLE PROBLEMS

Note that a regular norm satisfies the first three properties but in lieu
of the fourth property, the norm of a vector is 0 if and only if the vector is
0. The following inequalities establish the invariance of pseudo-norms to the
parameter θ:

kZ − θ1 − C∆k∗ ≤ kZ − C∆k∗ + kθ1k∗


= kZ − C∆k∗ = kZ − θ1 − C∆ + θ1k∗
≤ kZ − θ1 − C∆k∗ .

Hence, kZ − θ1 − C∆k∗ = kZ − C∆k∗ .


Given a pseudo-norm, denote the associated dispersion function by
D∗ (∆) = kZ − C∆k∗ . It follows from the above properties of a pseudo-norm
that D∗ (∆) is a nonnegative, continuous, and convex function of ∆.
We next develop an inference which includes estimation of ∆ and tests of
hypotheses concerning ∆ for a general pseudo-norm. As an estimate of the
shift parameter ∆, we choose a value ∆b which solves

b = ArgminD∗ (∆) = ArgminkZ − C∆k∗ ;


∆ (2.2.9)
b minimizes the distance between Z and ΩF U LL. Another way of defining
i.e., C∆
b
∆ is as the stationary point of the gradient of the pseudo-norm. Define the
function S∗ by
S∗ (∆) = − ▽ kZ − C∆k∗ (2.2.10)
where ▽ denotes the gradient of kZ−C∆k∗ with respect to ∆. Because D∗ (∆)
is convex, it follows immediately that

S∗ (∆) is nonincreasing in ∆ . (2.2.11)


b is such that
Hence ∆
b =.
S∗ (∆) 0. (2.2.12)
Given a location functional θ = T (F ), i.e., Model (2.2.4), once ∆ has been
estimated we can base an estimate of θ on the residuals Zi − ∆c b i . For example,
if we chose the median as our location functional then we could use the median
of the residuals to estimate it. We discuss this in more detail for general linear
models in Chapter 3.
Next consider the hypotheses

H0 : ∆ = 0 versus HA : ∆ 6= 0 . (2.2.13)

The closer S∗ (0) is to 0 the more plausible is the hypothesis H0 . More formally,
we define the gradient test of H0 versus HA by the rejection rule,

Reject H0 in favor of HA if S∗ (0) ≤ k or S∗ (0) ≥ l ,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 81 —


i i

2.2. GEOMETRIC MOTIVATION 81

where the critical values k and l depend on the null distribution of S∗ (0).
Typically, the null distribution of S∗ (0) is symmetric about 0 and k = −l.
The reduction in dispersion test is given by
b ≥m,
Reject H0 in favor of HA if D∗ (0) − D∗ (∆)
where the critical value m is determined by the null distribution of the test
statistic. In this chapter, as in Chapter 1, we are concerned with the gradient
test while in Chapter 3 we use the reduction in dispersion test. A confidence
interval for ∆ of confidence (1 − α)100% is the interval {∆ : k < S∗ (∆) < l}
and
1 − α = P∆ [k < S∗ (∆) < l] . (2.2.14)
Since D∗ (∆) is convex, S∗ (∆) is nonincreasing and, hence,
b L = inf{∆ : S∗ (∆) < l} and ∆
∆ b U = sup{∆ : S∗ (∆) > k} ; (2.2.15)
compare (1.3.10). Often we are able to invert k < S∗ (∆) < l to find an explicit
formula for the upper and lower end points.
We discuss a large class of general pseudo-norms in Section 2.5, but now we
present the pseudo-norms that yield the pooled t-test and the Mann-Whitney-
Wilcoxon test.

2.2.1 Least Squares (LS) Analysis


The traditional analysis is based on the squared pseudo-norm given by
n X
X n
kuk2LS = (ui − uj )2 , u ∈ Rn . (2.2.16)
i=1 j=1

It follows, (see Exercise 2.13.1) that


▽kZ − C∆k2LS = −4n1 n2 (Y − X − ∆) ;
hence the classical estimate is ∆ˆ LS = Y − X. Eliminating the constant factor
4n1 n2 the classical test is based on the statistic
SLS (0) = Y − X .
As shown in Exercise 2.13.1, standardizing SLS results in the two-sample
pooled t-statistic. An approximate confidence interval for ∆ is given by
r
1 1
b
Y − X ± t(α/2,n1 +n2 −2) σ + ,
n1 n2
where σb is the usual pooled estimate of the common standard deviation. This
confidence interval is exact if ei has a normal distribution. Asymptotically, we
replace t(α/2,n1 +n2 −2) by zα/2 . The test is asymptotically distribution free.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 82 —


i i

82 CHAPTER 2. TWO-SAMPLE PROBLEMS

2.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis


The rank-based analysis is based on the pseudo-norm defined by
n X
X n
kukR = |ui − uj | , u ∈ Rn . (2.2.17)
i=1 j=1

Note that this pseudo-norm is the L1 norm based on the differences between
the components and that it is the second term of expression (1.3.20), which
defines the norm of the signed rank analysis of Chapter 1. Note further, that
this pseudo-norm differs from the least squares pseudo-norm in that the square
root is taken inside the double summation. In Exercise 2.13.2 the reader is
asked to show that this indeed is a pseudo-norm and that further it can be
written in terms of ranks as
X n  
n+1
kukR = 4 R(ui ) − ui .
i=1
2

From (2.2.17), it follows that the MWW gradient is


n1 X
X n2
▽kZ − C∆kR = −2 sgn (Yj − Xi − ∆) .
i=1 j=1

Our estimate of ∆ is a value which makes the gradient zero; that is, makes half
of the differences positive and the other half negative. Thus the rank-based
estimate of ∆ is
∆b R = med {Yj − Xi } . (2.2.18)
This pseudo-norm estimate is often called the Hodges-Lehmann estimate of
shift for the two-sample problem (Hodges and Lehmann, 1963). As we show in
Section 2.4.4, ∆b R has an approximate normal distribution with mean ∆ and
p
standard deviation τ (1/n1 ) + (1/n2 ), where the scale parameter τ is given
in display (2.4.22).
From the gradient we define
n1 X
X n2
SR (∆) = sgn (Yj − Xi − ∆) . (2.2.19)
i=1 j=1

Next define
SR+ (∆) = #(Yj − Xi > ∆) . (2.2.20)
Note that we have (with probability one) that SR (∆) = 2SR+ (∆) − n1 n2 . The
statistic SR+ = SR+ (0), originally proposed by Mann and Whitney (1947), is
more convenient to use. The gradient test for the hypotheses (2.2.13) is
Reject H0 in favor of HA if SR+ ≤ k or SR+ ≥ n1 n2 − k ,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 83 —


i i

2.2. GEOMETRIC MOTIVATION 83

where k is chosen by P0 (SR+ ≤ k) = α/2. We show in Section 2.4 that


the test statistic is distribution free under H0 and, that further, it has an
asymptotic
p normal distribution with mean n1 n2 /2 and standard deviation
n1 n2 (n1 + n2 + 1)/12 under H0 . Hence, an asymptotic level α test rejects
H0 in favor of HA , if
+
SR −(n1 n2 /2)
|z| > zα/2 where z = √ . (2.2.21)
n1 n2 (n1 +n2 +1)/12

As shown in Section 2.4.2, the (1 − α)100% MWW confidence interval


for ∆ is given by
[D(k+1) , D(n1 n2 −k) ) , (2.2.22)
where k is such that P0 [SR+ ≤ k] = α/2 and D(1) ≤ · · · ≤ D(n1 n2 ) denote the or-
dered n1 n2 differences Yj −Xi . It follows from the asymptotic
q null distribution
n1 n2 (n+1)
of SR+ that k can be approximated as n1 n2
2
− 12 − zα/2 12
.
A rank formulation of the MWW test statistic SR+ (∆) also proves useful.
Letting R(ui) denote the rank of ui among u1 , . . . , un we can write
n2
X n2
X
R(Yj − ∆) = {#i (Xi < Yj − ∆) + #i (Yi − ∆ ≤ Yj − ∆)}
j=1 j=1
n2 (n2 + 1)
= #(Yj − Xi > ∆) + .
2
Defining
n2
X
W (∆) = R(Yi − ∆) , (2.2.23)
i=1

we thus have the relationship that

n2 (n2 + 1)
SR+ (∆) = W (∆) − . (2.2.24)
2

The test statistic W (0) was proposed by Wilcoxon (1945). Since it is a linear
function of the Mann-Whitney test statistic it has identical statistical prop-
erties. We refer to the statistic, SR+ , as the Mann-Whitney-Wilcoxon statistic
and label it as MWW.
As a final note on the geometry of the rank-based analysis, reconsider the
model with the location functional θ in it, i.e., (2.2.4). Suppose we obtain the R
estimate of ∆, (2.2.18). Let b
eR = Z−C∆ b R denote the residuals. Next suppose
we want to estimate the location parameter θ by using the weighted L1 norm
which was discussed for estimation of location in Section 1.7 of Chapter 1. Let

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 84 —


i i

84 CHAPTER 2. TWO-SAMPLE PROBLEMS


P
kukSR = nj=1 j|u|(j) denote this norm. For the residual vector beR , expression
(1.3.10) of Chapter 1 is given by
X X b e + b
e


kbe − θ1kSR = i j
− θ + (1/4)kb
eRkR . (2.2.25)
2
i≤j

Hence the estimate of θ determined by this geometry is the Hodges-Lehmann


estimate based on the residuals; i.e.,
 
b ei + b
b ej
θR = medi≤j . (2.2.26)
2

Asymptotic theory for the joint distribution of the random vector (θbR , ∆
b R )′ is
discussed in Chapter 3.

2.2.3 Computation
The Mann-Whitney-Wilcoxon analysis which we described above is easily com-
puted using the Robnp function twosampwil. This function returns the value
of the Mann-Whitney-Wilcoxon test statistic SR+ = SR+ (0), (2.2.20), the esti-
b (2.2.18), the associated confidence interval (2.2.22), and comparison
mate ∆,
boxplots of the samples. Also, the R intrinsic function wilcoxon.test and
minitab command MANN compute this Mann-Whitney-Wilcoxon analysis.

2.3 Examples
In this section we present two examples which illustrate the methods discussed
in the last section. The calculations were performed by the Robnp functions
twosampwil and twosampt which compute the Mann-Whitney-Wilcoxon and
LS analyses, respectively. By convention, for each difference Yj − Xi = 0,
we add the value 1/2 to the test statistic SR+ . Further, the returned p-value
is calculated with the usual continuity correction. The estimate of τ and its
standard error (SE) displayed in the results are given by expression (2.4.27),
where a full discussion is given. The LS analysis, computed by twosampt, is
based on the traditional pooled two-sample t-test.

Example 2.3.1 (Quail Data). The data for this problem are drawn from a
high-volume drug screen designed to find compounds which reduce low den-
sity lipoproteins, LDL, cholesterol in quail; see McKean, Vidmar, and Sievers
(1989) for a discussion of this screen. For the purposes of the present exam-
ple, we have taken the plasma LDL levels of one group of quail who were fed
over a specified period of time a special diet mixed with a drug compound

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 85 —


i i

2.3. EXAMPLES 85

Table 2.3.1: Data for Quail Example


Control 64 49 54 64 97 66 76 44 71 89
70 72 71 55 60 62 46 77 86 71
Treated 40 31 50 48 152 44 74 38 81 64

and the LDL levels of a second group of quail who were fed the same spe-
cial diet but without the drug compound over the same length of time. A
completely randomized design was employed. We refer to the first group as
the treatment group and the second group as the control group. The data
are displayed in Table 2.3.1. Let θC and θT denote the true median levels of
LDL for the control and treatment populations, respectively. The parameter
of interest is ∆ = θC − θT . We are interested in the alternative hypothesis that
the treatment has been effective; hence the hypotheses are:

H0 : ∆ = 0 versus HA : ∆ > 0 .

The comparison boxplots returned by the Robnp function twosampwil are


found in Figure 2.3.1. Note that there is one outlier, the fifth observation of
the treated group, which has the value 152. Outliers such as this were typical
with most of the data in this study; see McKean et al. (1989). For the data
at hand, the treated group appears to have lower LDL levels.
The analyses returned by the functions twosampwil and twosampt are
given below. The Mann-Whitney-Wilcoxon test statistic has the value 134.5
with p-value 0.067, while the t-test statistic has value 0.557 with p-value 0.291.
The MWW indicates with marginal significance that the treatment performed
better than the placebo. The two-sample t-analysis was impaired by the out-
lier.
The Hodges-Lehmann estimate of ∆, (2.2.18), is 14 and the 90% confidence
interval is (−2.0, 24.0). In contrast, the least squares estimate of shift is 5 and
the corresponding 90% confidence interval is (−10.25, 20.25).

> twosampwil(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",
nameresp="LDL cholesterol")

Test of Delta = 0 Alternative selected is 1


Test Stat. S+ is 134.5 Standardized (z) Test-Stat. 1.495801 and
p-value 0.06735282

MWW estimate of the shift in location is 14 SE is 8.180836


90 % Confidence Interval is ( -2 , 24 )
Estimate of the scale parameter tau 21.12283

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 86 —


i i

86 CHAPTER 2. TWO-SAMPLE PROBLEMS

Figure 2.3.1: Comparison boxplots of treatment and control quail LDL levels.

Comparison Boxplots of Treated and Control


140
120
LDL cholesterol

100
80
60
40

Control Treated

> twosampt(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",
nameresp="LDL cholesterol")

Test of Delta = 0 Alternative selected is 1


Test Stat. ybar-xbar- 0 is 5 Standardized (t)
Test-Stat. 0.5577585 and p-value 0.2907209

Mean of y minus the mean of x is 5 SE is 8.964454


90 % Confidence Interval is ( -10.24971 , 20.24971 )
Estimate of the scale parameter sigma 23.14612

The data discussed in the last example were drawn from a high-speed drug
screen to discover drug compounds which have the potential to reduce LDL
cholesterol. In this screen, if a compound was at least marginally significant
then the investigation of it would continue; otherwise, it would be eliminated
from further scrutiny. Hence, for this drug compound, the robust and LS
analyses would result in different practical outcomes.
Example 2.3.2 (Hendy-Charles Coin Data, continuation of Example 1.11.1).
Recall that the 84% L1 confidence intervals for the data are disjoint. Thus
we reject the null hypothesis that the silver content is the same for the two

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 87 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 87

mintings at the 5% level. We now apply the MWW test and confidence interval
to this data and find the Hodges-Lehmann estimate of shift. If the tailweights
of the underlying distributions are moderate, the MWW methods are more
efficient.
The output from the Robnp function twosampwil is:
> twosampwil(Fourth,First)

Test of Delta = 0 Alternative selected is 0


Test Stat. S+ is 61.5 Standardized (z) Test-Stat. 3.122611
and p-value 0.001792544

MWW estimate of the shift in location is 1.1 SE is 0.2999926


95 % Confidence Interval is ( 0.6 , 1.7 )
Estimate of the scale parameter tau 0.5952794
Note that there is strong statistical evidence that the mintings are different.
The Hodges-Lehmann estimate (2.2.18) is 1.1 which suggests that there is
roughly a 1.1% decrease in the silver content from the first to the fourth
mintings. The standard error of the estimate is 0.30. Thus, the decrease in the
proportion of silver content can be expressed as 0.0110 ± 0.0060.

2.4 Inference Based on the Mann-Whitney-


Wilcoxon
We next develop the theory for inference based on the Mann-Whitney-
Wilcoxon statistic, including the test, the estimate, and the confidence inter-
val. Although much of the development is for the location model, the general
model is also considered. We begin with testing.

2.4.1 Testing
Although the geometric motivation of the test statistic SR+ was derived under
the location model, the test can be used for more general models. Recall that
the general model is comprised of a random sample X1 , . . . , Xn1 with cdf F (x)
and a random sample Y1 , . . . , Yn2 with cdf G(x). For the discussion we select
the hypotheses,
H0 : F (x) = G(x), for all x, versus
HA : F (x) ≥ G(x), with strict inequality for some x. (2.4.1)
Under this stochastically ordered alternative Y tends to dominate X; i.e.,
P (Y > X) > 1/2. Our rank-based decision rule is to reject H0 in favor of

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 88 —


i i

88 CHAPTER 2. TWO-SAMPLE PROBLEMS

HA if SR+ is too large, where SR+ = #(Yj − Xi > 0). Our immediate goal
is to make this precise. What we discuss, of course, holds for the other one-
sided alternative F (x) ≤ G(x) and the two-sided alternative F (x) ≤ G(x) or
F (x) ≥ G(x) as well. Furthermore, since the location model is a submodel of
the general model, what holds for the general model holds for it also. It will
always be clear which set of hypotheses is being considered.
Under H0 , we first show that SR+ is distribution free and then show it is
symmetrically distributed about (n1 n2 )/2.

Theorem 2.4.1. Under the general null hypothesis in (2.4.1), SR+ is distribu-
tion free.

Proof: Under H0 , the combined samples X1 , . . . , Xn1 , Y1 , . . . , Yn2 constitute


a random sample of size n from the distribution function F (x). Hence any
assignment of n2 ranks from the set of integers {1, . . . , n} to Y1 , . . . , Yn2 is
−1
equilikely; i.e., has probability nn2 independent of F .

Theorem 2.4.2. Under H0 in (2.4.1), the distribution of SR+ is symmetric


about (n1 n2 )/2.

Proof: Under H0 , (2.4.1), L(Yj − Xi ) = L(Xi − Yj ) for all i, j; see Exercise


2.13.3. Thus if SR− = #(Xi − Yj > 0) then, under H0 , L(SR+ ) = L(SR− ). Since
SR− = n1 n2 − SR+ we have the following string of equalities which proves the
result:
n1 n2 n1 n2
P [SR+ ≥ + x] = P [n1 n2 − SR− ≥ + x]
2 2
n1 n2 n1 n2
= P [SR− ≤ − x] = P [SR+ ≤ − x] .
2 2
Hence for the hypotheses (2.4.1), a level α test based on SR+ would reject
H0 if SR+ ≥ cα,n1 ,n2 where PH0 [SR+ ≥ cα,n1 ,n2 ] = α. From the symmetry, note
that the lower α critical point is given by n1 n2 − cα,n1 ,n2 .
Although SR+ is distribution free under the null hypothesis its distribution
cannot be obtained in closed form. The next theorem gives a recursive for-
mula for its distribution. The proof can be found in Exercise 2.13.4; see, also,
Hettmansperger (1984, p. 136-137).

Theorem 2.4.3. Under the general null hypothesis in (2.4.1), let Pn1 ,n2 (k) =
PH0 [SR+ = k]. Then
n2 n1
Pn1 ,n2 (k) = Pn1 ,n2 −1 (k − n1 ) + Pn −1,n2 (k) ,
n1 + n2 n1 + n2 1
where Pn1 ,n2 (k) satisfies the boundary conditions Pi,j (k) = 0 if k < 0, Pi,0 (k)
and P0,j (k) are 1 or 0 as k = 0 or k 6= 0.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 89 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 89

Based on these recursion formulas, tables of the null distribution can be


obtained readily, which then can be used to obtain the critical values for the
rank-based test. Alternatively, the asymptotic null distribution of SR+ can be
used to determine approximate critical values. This asymptotic test is dis-
cussed later; see Theorem 2.4.9.
We next derive the mean and variance of SR+ under the three models:

(a) the general model where X has distribution function F (x) and Y has
distribution function G(x);

(b) the location model where G(x) = F (x − ∆);

(c) and the null model in which F (x) = G(x).

Of course, from Theorem 2.4.2, the null mean of SR+ is (n1 n2 )/2. In our deriva-
tion we repeatedly make use of the fact that if H is the distribution function
of a random variable Z then the random variable H(Z) has a uniform distri-
bution over the interval (0, 1); see Exercise 2.13.5.

Theorem 2.4.4. Assuming that X1 , . . . , Xn1 are iid F (x) and Y1 , . . . , Yn2 are
iid G(x) and that these two samples are independent of one another, the means
of SR+ under the three models (a)-(c) are:
 
(a) E SR+ = n1 n2 [1 − E [G(X)]] = n1 n2 E [F (Y )]
 
(b) E SR+ = n1 n2 [1 − E [F (X − ∆)]] = n1 n2 E [F (X + ∆)]
  n1 n2
(c) E SR+ = .
2
Proof: We prove only (a), since results (b) and (c) follow directly from it. We
can write SR+ in terms of indicator functions as
n1 X
X n2
SR+ = I(Yj − Xi > 0) , (2.4.2)
i=1 j=1

where I(t > 0) is 1 or 0 for t > 0 or t ≤ 0, respectively. Let Y have distri-


bution function G, let X have distribution function F , and let X and Y be
independent. Then

E [I (Y − X > 0)] = E [P [Y > X|X]]


= E [1 − G(X)] = E [F (Y )] ,

where the second equality follows from the independence of X and Y . The
results then follow.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 90 —


i i

90 CHAPTER 2. TWO-SAMPLE PROBLEMS

Theorem 2.4.5. The variances of SR+ under the models (a) - (c) are:

  
(a) V ar SR+ = n1 n2 E [G(X)] − E 2 [G(X)]
+ n1 n2 (n1 − 1)V ar [F (Y )] + n1 n2 (n2 − 1)V ar [G(X)]
 + 
(b) V ar SR = n1 n2 E [F (X − ∆)] − E 2 [F (X − ∆)]
+ n1 n2 (n1 − 1)V ar [F (Y )] + n1 n2 (n2 − 1)V ar [F (X − ∆)]
  n1 n2 (n + 1)
(c) V ar SR+ = .
12
Proof: Again, only the result (a) is obtained. Using the indicator formulation
of SR+ , (2.4.2), we have
n1 X
X n2
 
Var SR+ = V ar [I(Yj − Xi > 0)]
i=1 j=1
Xn1 Xn2 X
n1 X
n2
+ Cov [I(Yj − Xi > 0), I(Yk − Xl > 0)] ,
i=1 j=1 l=1 k=1

where the sums for the covariance terms are over all possible combinations
except (i, j) = (l, k). For the first term, note that the variance of I(Y −X > 0)
is

V ar [I(Y > X)] = E [I(Y > X)] − E 2 [I(Y > X)]


= E [1 − G(X)] − E 2 [1 − G(X)]
= E [G(X)] − E 2 [G(X)] .

This yields the first term in (a). For the covariance terms, note that a co-
variance is 0 unless either j = k or i = l. This leads to the following two
cases:

Case (i) For the covariance terms with j = k and i 6= l, we need


E [I(Y > X1 )I(Y > X2 )] which is,

E [I(Y > X1 )I(Y > X2 )] = P [Y > X1 , Y > X2 ]


= E [P [Y > X1 , Y > X2 |Y ]]
= E [P [Y > X1 |Y ] P [Y > X2 |Y ]]
 
= E F (Y )2 .

There are n2 ways to get a j and n1 (n1 − 1) ways to get i 6= l; hence


there are n1 n2 (n1 − 1) covariances of this form. This leads to the second
term of (a).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 91 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 91

Case (ii) The terms for the covariances where i = l and j 6= k follow similarly
to Case (i). This leads to the third and final term of (a).
n1 n2
S+−
The last two theorems suggest that the random variable Z = qR 2
n1 n2 (n+1)
has
12
an approximate N(0, 1) distribution under H0 . This follows from the next re-
sults which yield the asymptotic distribution of SR+ under general alternatives
as well as under the null hypothesis. We obtain these results by projecting our
statistic SR+ down onto a set of linear combinations of independent random
variables. Then we can use central limit theory on the projection. See Hájek
and Šidák (1967) for a discussion of this technique.
Let T = T (Z1 , . . . , Zn ) be a random variable based on a sample Z1 , . . . , Zn
such that E [T ] = 0. Let

p∗k (x) = E [T | Zk = x] , k = 1, . . . , n .

Next define the random variable Tp to be


n
X
Tp = p∗k (Zk ) . (2.4.3)
k=1

In the next theorem we show that Tp is the projection of T onto the space of
linear functions of Z1 , . . . , Zn . Note that unlike T , Tp is a linear combination
of independent random variables; hence, its asymptotic distribution is often
easier to obtain than that of T . As the following projection P theorem shows
it is in a sense the “closest” linear function of the form pi (Zi ) to T .
Pn 2
Theorem 2.4.6. If W = i=1 pi (Zi ) then E [(T − W ) ] is minimized by
∗ 2
taking pi (x) = pi (x). Furthermore E [(T − Tp ) ] = V ar [T ] − V ar [Tp ].

Proof: First note that E [p∗k (Zk )] = 0. We have,


   
E (T − W )2 = E [(T − Tp ) − (W − Tp )]2 (2.4.4)
 2
  2

= E (T − Tp ) + E (W − Tp ) − 2E [(T − Tp )(W − Tp )] .

We can write one-half the cross product term as


n
X n
X
E [(T − Tp )(pi (Zi ) − p∗i (Zi ))] = E [E [(T − Tp )(pi (Zi ) − p∗i (Zi )) | Zi ]] .
i=1 i=1

The right side is equivalent to


n
" " n
##
X X
E (pi (Zi ) − p∗i (Zi ))E T − p∗j (Zj ) | Zi .
i=1 j=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 92 —


i i

92 CHAPTER 2. TWO-SAMPLE PROBLEMS

The conditional expectation within the inner set of brackets is,


X  
(E [T | Zi ] − p∗i (Zi )) − E p∗j (Zj ) = 0 − 0 = 0 .
j6=i

Hence the cross-product term is zero, and, therefore the left side of expression
(2.4.4) is minimized with respect to W by taking W = Tp . Also since this
holds, in particular, for W = 0 we get
     
E T 2 = E (T − Tp )2 + E Tp2 .
Since both T and Tp have zero means the second result of the theorem also
follows.
From these results a strategy for obtaining the asymptotic distribution of
T is apparent. Namely, find the asymptotic distribution of its projection, Tp
and then show V ar [T ] − V ar [Tp ] → 0 as n → ∞. This implies that T and
Tp have the same asymptotic distribution; see Exercise 2.13.7. We apply this
strategy to get the asymptotic distribution of therank-based methods. As a
first step, we obtain the projection of SR+ − E SR+ under the general model.
Theorem 2.4.7.
 +  Under the general model the projection of the random vari-
+
able SR − E SR is
n2
X n1
X
Tp = n1 (F (Yj ) − E [F (Yj )]) − n2 (G(Xi ) − E [G(Xi )]) . (2.4.5)
j=1 i=1

Proof: Define the n random variables Z1 , . . . , Zn by



Xi if 1 ≤ i ≤ n1
Zi = .
Yi−n1 if n1 + 1 ≤ i ≤ n
We have,
   
p∗k (x) = E SR+ | Zk = x − E SR+
n1 X
X n2
 
= E [I(Yj > Xi ) | Zk = x] − E SR+ . (2.4.6)
i=1 j=1

There are two cases depending on whether 1 ≤ k ≤ n1 or n1 + 1 ≤ k ≤


n1 + n2 = n.
Case (1) Suppose 1 ≤ k ≤ n1 then the conditional expectation in the above
expression (2.4.6), depending on the value of i, becomes
(a): i 6= k, E [I(Yj > Xi ) | Xk = x] = E [I(Yj > Xi )]
= P [Y > X]
(b): i = k, E [I(Yj > Xi ) | Xi = x]
= P [Y > X | X = x]
= 1 − G(x) .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 93 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 93

Hence, in this case,


 
p∗k (x) = n2 (n1 − 1)P [Y > X] + n2 (1 − G(x)) − E SR+ .

Case (2) Next suppose that n1 + 1 ≤ k ≤ n then the conditional expectation


in the above expression (2.4.6), depending on the value of j, becomes

(a): j 6= k, E [I(Yj > Xi ) | Yk = x] = P [Y > X]


(b): j = k, E [I(Yj > Xi ) | Yj = x] = F (x)

Hence, in this case,


 
p∗k (x) = n1 (n2 − 1)P [Y > X] + n1 F (x) − E SR+ .

Combining these results we get


n1
X n2
X
Tp = p∗i (Xi ) + p∗j (Yj )
i=1 j=1
n1
X
= n1 n2 (n1 − 1)P [Y > X] + n2 (1 − G(Xi ))
i=1
Xn2
 
+ n1 n2 (n2 − 1)P [Y > X] + n1 F (Yj ) − nE SR+ .
j=1

This can be simplified by noting that

P (Y > X) = E [P (Y > X | X)] = E [1 − G(X)]

or similarly
P (Y > X) = E [F (Y )] .
From (a) of Theorem 2.4.4,
 
E SR+ = n1 n2 (1 − E [G(X)]) = n1 n2 P (Y > X) .

Substituting these three results into (2.4.6) we get the desired result.
The next corollary follows immediately.

Corollary 2.4.1. Under the general model, if Tp is given by (2.4.5) then

Var (Tp ) = n21 n2 Var (F (Y )) + n1 n22 Var (G(X)) .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 94 —


i i

94 CHAPTER 2. TWO-SAMPLE PROBLEMS

From this it follows that Tp should be standardized as


1
Tp∗ = √ Tp .
nn1 n2

In order to obtain the asymptotic distribution of Tp and subsequently SR+ we


need the following assumption on the design (sample sizes),
ni
(D.1) : → λ i , 0 < λi < 1 . (2.4.7)
n
This says that the sample sizes go to ∞ at the same rate. Note that λ1 +λ2 = 1.
The asymptotic variance of Tp∗ is thus

Var (Tp∗ ) → λ1 Var (F (Y )) + λ2 Var (G(X)) .

We first want to obtain the asymptotic distribution under general alterna-


tives. In order to do this we need an assumption concerning the ranges of X
and Y . The support of a continuous random variable with distribution func-
tion H and density h is defined to be the set {x : h(x) > 0} which is denoted
by S(H).
Our second assumption states that the intersection of the supports of F
and G has a nonempty interior; that is

(E.3) : There is an open interval I such that I ⊂ S(F ) ∩ S(G) . (2.4.8)

Note that the asymptotic variance of Tp∗ is not zero under (E.3).
We are now in the position to find the asymptotic distribution of Tp∗ .

Theorem 2.4.8. Under the general model and the assumptions (D.1) and
(E.3), Tp∗ has an asymptotic N (0, λ1 Var (F (Y )) + λ2 Var (G(X))) distribu-
tion.

Proof: By (2.4.5) we can write


r n2 r n1
n1 X n2 X
Tp∗ = (F (Yj ) − E [F (Yj )]) − (G(Xi ) − E [G(Xi )]) .
nn2 j=1 nn1 i=1
(2.4.9)
Note that both sums on the right side of expression (2.4.9) are composed of in-
dependent and identically distributed random variables and that the sums are
independent of one another. The result then follows immediately by applying
the simple central limit theorem to each sum.
This is the key result we need in order to obtain the asymptotic distribution
of our test statistic SR+ . We first obtain the result under the general model and
then under the null hypothesis. As we show, both results are immediate.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 95 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 95

Theorem 2.4.9. Under the general model and the conditions (E.3) and (D.1)
S + −E [SR
+
]
the random variable √R has a limiting N(0, 1) distribution.
Var (SR+ )
Proof: By the last theorem and Theorem 2.4.6, we need only show that the

difference in the variances of SR+ / nn1 n2 and Tp∗ goes to 0 as n → ∞. Note
that,
 
1 n1 n2 
Var √ +
SR = E [G(X)] − E [G(X)]2
nn1 n2 nn1 n2
n1 n2 (n1 − 1) n1 n2 (n2 − 1)
+ Var (F (Y )) + Var (G(X)) .
nn1 n2 nn1 n2

Hence, Var (Tp∗ ) − Var (SR+ / nn1 n2 ) → 0 and the result follows from Exercise
(2.13.7).
The asymptotic distribution of the test statistic under the null hypothesis
follows immediately from this theorem. We record it in the next corollary.

Corollary 2.4.2. Under


 H0 : F (x) =G(x) and (D.1) only, the test statistic
SR is approximately N n12n2 , n1 n212
+ (n+1)
.

Therefore an asymptotic size α test for H0 : F (x) = G(x) versus HA :


F (x) 6= G(x) is to reject H0 if |z| ≥ zα/2 where

S+ − n1 n2
z = qR 2
n1 n2 (n+1)
12

and
1 − Φ(zα/2 ) = α/2 .
Since we approximate a discrete random variable with a continuous one, we
think it is advisable in cases of small samples to use a continuity correction.
Fix and Hodges (1955) give an Edgeworth approximation to the distribution
of SR+ and Bickel (1974) discusses the error of this approximation.
Since the standard normal distribution function, Φ, is continuous on the
entire real line, we can strengthen the convergence in Theorem 2.4.9 to uniform
convergence; that is, the distribution function of the standardized MWW con-
verges uniformly to Φ. Using this, it is not hard to show that the standardized
critical values of the MWW converge to their counterparts at the standard
normal. Thus if cα,n is the MWW critical value defined by α = PH0 [SR+ ≥ cα,n ]
then
cα,n − n12n2
q → zα , (2.4.10)
n1 n2 (n+1)
12

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 96 —


i i

96 CHAPTER 2. TWO-SAMPLE PROBLEMS

where 1 − α = Φ(zα ); see Exercise 2.13.8 for details. This result proves useful
in the next section.
We now consider when the test based on SR+ is consistent. Consider the
general set up; i.e., X1 , . . . , Xn1 is a random sample with distribution function
F (x) and Y1 , . . . , Yn2 is a random sample with distribution function G(x).
Consider the hypotheses

H0 : F = G versus HA1 : F (x) ≥ G(x) with F (x0 ) > G(x0 ) for some x0 ,
(2.4.11)
where x0 ∈ Int(S(F ) ∩ S(G)). Such an alternative is called a stochastically
ordered alternative. The next theorem shows that the MWW test statistic
is consistent for this alternative. Likewise it is consistent for the other one-
sided stochastically ordered alternative with F and G interchanged, HA2 , and,
also, for the two-sided alternative which consists of the union of HA1 and HA2 .
These results imply that the MWW test is consistent for location alternatives,
provided F and G have overlapping support. As Exercise 2.13.9 shows, it is
also consistent when one support is shifted to the right of the other support.

Theorem 2.4.10. Suppose that the assumptions (D.1), (2.4.7), and (E.3),
(2.4.8), hold. Under the stochastic ordering alternatives given above, SR+ is a
consistent test.

Proof: Assume the stochastic ordering alternative HA1 , (2.4.11). For an arbi-
trary level α, select the critical level cα such that the test that rejects H0 if
SR+ ≥ cα has asymptotic level α. We want to show that the power of the test
goes to 1 as n → ∞. Since F (x0 ) > G(x0 ) for some point x0 in the interior of
S(F ) ∩ S(G), there exists an interval N such that F (x) > G(x) on N. Hence
Z Z
EHA [G(X)] = G(y)f (y)dy + G(y)f (y)dy
c
ZN ZN
1
< F (y)f (y)dy + F (y)f (y)dy = . (2.4.12)
N Nc 2

The power of the test is given by


" #
 +  SR+ −EHA(SR+ ) cα −(n1 n2 /2) (n1 n2 /2)−EHA (SR+ )
P HA S R ≥ c α = P HA p ≥ p + p .
VHA (SR+ ) VHA(SR+ ) VHA (SR+ )

Note by (2.4.10) that


p
cα − (n1 n2 /2) cα − (n1 n2 /2) VH0 (SR+ )
p = p p → zα κ ,
VHA (SR+ ) VH0 (SR+ ) VHA (SR+ )

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 97 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 97

where κ is a real number (since the variances are of the same order). But by
(2.4.12)
(n1 n2 /2) − EHA (SR+ ) (n1 n2 /2) − n1 n2 [1 − EHA (G(X))]
p +
= p
VHA (SR ) VHA (SR+ )
 1 
n1 n2 − 2 + EHA (G(X))
= p → −∞ .
VHA (SR+ )
+ +
SR −EHA (SR )
By Theorem (2.4.9), under HA the random variable √ +
converges in
VHA (SR )
distribution to a standard normal variate. Since the convergence is uniform, it
follows from the above limits that the power converges to 1. Hence, the MWW
test is consistent.

2.4.2 Confidence Intervals


Consider the location model (2.2.4). We next obtain a distribution-free confi-
dence interval for ∆ by inverting the MWW test. As a first step, we have the
following result on the function SR+ (∆), (2.2.20):
Lemma 2.4.1. SR+ (∆) is a decreasing step function of ∆ which steps down
by 1 at each difference Yj − Xi . Its maximum is n1 n2 and its minimum is 0.
Proof: Let D(1) ≤ · · · ≤ D(n1 n2 ) denote the ordered n1 n2 differences Yj − Xi .
The results follow immediately by writing SR+ (∆) = #(D(i) > ∆).
Let α be given and choose  +cα/2 to be the  lower α/2 critical point of the
MWW distribution; i.e., P∆ SR (∆) ≤ cα/2, = α/2. By the above lemma we
have
 
1 − α = P∆ cα/2 < SR+ (∆) < n1 n2 − cα/2
h i
= P∆ D(cα/2 +1) ≤ ∆ < D(n1 n2 −cα/2 ) .

Thus [D(cα/2 +1) , D(n1 n2 −cα/2 ) ) is a (1 − α)100% confidence interval for ∆; com-
pare (1.3.30). From the asymptotic null distribution theory for SR+ , Corollary
(2.4.2), we can approximate cα/2 as
r
. n1 n2 n1 n2 (n + 1)
cα/2 = − zα/2 − .5 . (2.4.13)
2 12

2.4.3 Statistical Properties of the Inference Based on


the MWW
In this section we derive the efficiency properties of the MWW test statistic
and properties of its power function under the location model (2.2.4).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 98 —


i i

98 CHAPTER 2. TWO-SAMPLE PROBLEMS

We begin with an investigation of the power function of the MWW test.


For definiteness we consider the one-sided alternative,

H0 : ∆ = 0 versus HA : ∆ > 0 . (2.4.14)

Results similar to those given below can be obtained for the power function of
the other one-sided and the two-sided alternatives. Given a level α, let cα,n1 ,n2
denote the upper critical value for the MWW test of this hypothesis; hence,
the test rejects H0 if SR+ ≥ cα,n1 ,n2 . The power function of this test is given by

γ(∆) = P∆ [SR+ ≥ cα,n1 ,n2 ] , (2.4.15)

where the subscript ∆ on P denotes that the probability is determined when


the true parameter is ∆. Recall that SR+ (∆) = #{Yj − Xi > ∆}.
The following theorem proves useful, its proof is similar to that of Theorem
1.3.1 of Chapter 1 and the more general result Theorem A.2.4 of the Appendix.

Theorem 2.4.11. For all t, P∆ [SR+ (0) ≥ t] = P0 [SR+ (−∆) ≥ t].

From Lemma 2.4.1 and Theorem 2.4.11 we have our first important result
on the power function of the MWW test; namely, that it is monotone.

Theorem 2.4.12. For the above hypotheses (2.4.14), the function γ(∆) is
monotonically increasing in ∆.

Proof: Let ∆1 < ∆2 . Then −∆2 < −∆1 and, hence, from Lemma 2.4.1, we
have SR+ (−∆2 ) ≥ SR+ (−∆1 ). By applying Theorem 2.4.11, the desired result,
γ(∆2 ) ≥ γ(∆1 ), follows from the following:

1 − γ(∆2 ) = P∆2 [SR+ (0) < cα,n1 ,n2 ]


= P0 [SR+ (−∆2 ) < cα,n1 ,n2 ]
≤ P0 [SR+ (−∆1 ) < cα,n1 ,n2 ]
= P∆1 [SR+ (0) < cα,n1 ,n2 ]
= 1 − γ(∆1 ).

From this we immediately have that the MWW test is unbiased; that is,
its power function evaluated at an alternative is always at least as large as its
level of significance. We state it as a corollary.

Corollary 2.4.3. For the above hypotheses (2.4.14), γ(∆) ≥ α for all ∆ > 0.

A more general null hypothesis is given by

H0∗ : ∆ ≤ 0 versus HA : ∆ > 0 .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 99 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 99

If T is any test for these hypotheses with critical region C then we say T is a
size α test provided
sup P∆ [T ∈ C] = α .
∆≤0

For selected α, it follows from the monotonicity of the MWW power function
that the MWW test has size α for this more general null hypothesis.
From the above theorems, we have that the MWW power function is mono-
tonically increasing in ∆. Since SR+ (∆) achieves its maximum for ∆ finite, we
have by Theorem 1.5.2 of Chapter 1 that the MWW test is resolving; hence,
its power function approaches one as ∆ → ∞. Even for the location model,
though, we cannot get the power function of the MWW test in closed form.
For local alternatives, however, we can obtain an asymptotic expression for the
power function. Applications of this result include sample size determination
for the MWW test and efficiency comparisons of the MWW with other tests,
both of which we consider.
We need the assumption that the density f (x) has finite Fisher Infor-
mation, i.e.,
R1
(E.1) f is absolutely continuous, 0 < I(f ) = 0 ϕ2f (u) du < ∞ , (2.4.16)

where
f ′ (F −1 (u))
ϕf (u) = − . (2.4.17)
f (F −1 (u))
As discussed in Section 3.4, assumption (E.1) implies that f is uniformly
bounded.
Once again we consider the one-sided alternative, (2.4.14) (similar results
hold for the other one-sided and two-sided alternatives). Consider a sequence
of local alternatives of the form
δ
HAn : ∆n = √ , (2.4.18)
n

where δ > 0 is arbitrary but fixed.


As a first step, we need to show that SR+ (∆) is Pitman Regular as discussed
+
in Chapter 1. Let S R (∆) = SR+ (∆)/(n1 n2 ). We need to verify the four condi-
tions of Definition 1.5.3 of Chapter 1. The first condition is true by Lemma
2.4.1 and the fourth condition follows from Corollary 2.4.2. By (b) of Theorem
2.4.4, we have
+
µ(∆) = E∆ [S R (0)] = 1 − E[F (X − ∆)] . (2.4.19)
R R
By assumption (E.1), (2.4.16), f 2 (x) dx ≤
R 2 sup f f (x) dx < ∞. Hence dif-

ferentiating (2.4.19) we obtain µ (0) = f (x)dx > 0 and, thus, the second

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 100 —


i i

100 CHAPTER 2. TWO-SAMPLE PROBLEMS

condition is true. Hence we need only show that the third condition, asymp-
+
totic linearity of S R (∆) is true. This follows provided we can show the variance
condition (1.5.18) of Theorem 1.5.6 is true. Note that
+ √ + √
S R (δ/ n) − S R (0) = (n1 n2 )−1 #(0 < Yj − Xi ≤ δ/ n) .

This is similar to the MWW statistic itself. Using essentially the same ar-
gument as that for the variance of the MWW statistic, Theorem 2.4.5 we
get
+ √ + n n(n1 − 1)
nVar0 [S R (δ/ n) − S R (0)] = (an − a2n ) + (bn − cn )
n1 n2 n1 n2
n(n2 − 1)
+ (dn − a2n ) ,
n1 n2
√ √
where an = E0 [F (X + δ/ n) √ − F (X)], bn = E0 [(F (Y ) − F√ (Y − δ/ n))2 ],
cn = E0 [(F (Y ) − F (Y − δ/ n))], and dn = E0 [(F (X + δ/ n) − F (X))2 ].
Using the Lebesgue Dominated Convergence Theorem, it is easy to see that
an , bn , cn , and dn all converge to 0. Therefore Condition (1.5.18) of Theorem
1.5.6 holds and we have thus established the asymptotic linearity result given
by:
 Z 
1/2 + √ + P
sup n S R (δ/ n) − n1/2 S R (0) + δ f 2 (x) dx → 0 , (2.4.20)
|δ|≤B

for any B > 0. Therefore, it follows that SR+ (∆) is Pitman Regular.
In order to get the efficacy of the MWW test, we need the quantity σ 2 (0)
defined by

σ 2 (0) = lim nVar0 (S R (0))


n→0
nn1 n2 (n + 1)
= lim = (12λ1λ2 )−1 ;
n→0 n21 n22 12
see expression (1.5.13). Therefore by (1.5.12) the efficacy of the MWW test
is
p √ Z 2 p
cM W W = µ (0)/σ(0) = λ1 λ2 12 f (x) dx = λ1 λ2 τ −1 ,

(2.4.21)

where τ is the scale parameter given by


√ Z 2
τ = ( 12 f (x)dx)−1 . (2.4.22)

In Exercise
√ 2.13.10 it is shown that the efficacy of the two-sample pooled
t-test is λ1 λ2 σ −1 where σ 2 is the common variance of X and Y . Hence the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 101 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 101

efficiency of the MWW test to the two-sample t-test is the ratio σ 2 /τ 2 . This
of course is the same efficiency as that of the signed rank Wilcoxon test to
the one-sample t-test; see (1.7.13). In particular if the distribution of X is
normal then the efficiency of the MWW test to the two-sample t-test is .955.
For heavier tailed distributions, this efficiency is usually larger than 1; see
Example 1.7.1.
As in Chapter 1 it is convenient to summarize the asymptotic linearity
result as follows:
( + √ ) ( + )
√ S R (δ/ n) − µ(0) √ S R (0) − µ(0)
n = n −cM W W δ+op (1) , (2.4.23)
σ(0) σ(0)

uniformly for |δ| ≤ B and any B > 0.


The next theorem is the asymptotic power lemma for the MWW test. As in
Chapter 1, (see Theorem 1.5.8), its proof follows from the Pitman Regularity
of the MWW test.

Theorem 2.4.13. Under the sequence of local alternatives, (2.4.18),


 p Z 
2
lim γ(∆n ) = P0 [Z ≥ zα − cM W W δ] = 1 − Φ zα − 12λ1 λ2 f δ ,
n→∞

where Z is N(0, 1).

In Exercise 2.13.10, it is shown that if γLS (∆) denotes the power function
of the usual two-sample t-test then
 p 
δ
lim γLS (∆n ) = 1 − Φ zα − λ1 λ2 , (2.4.24)
n→∞ σ

where σ 2 is the common variance of X and Y . By comparing these two power


functions, it is seen that the Wilcoxon is asymptotically more powerful if
τ < σ, i.e., if e = c2M W W /c2t > 1.
As an application of the asymptotic power lemma, we consider sample
size determination. Consider the MWW test for the one-sided hypothesis
(2.4.14). Suppose the level, α, and the power, β, for a particular alternative ∆A
are specified. For convenience, assume equal sample sizes, i.e., n1 = n2 = n∗
∗ −1
where
√ n denotes √ the common sample size; hence, λ1 = λ2 = 2 . Express ∆A
as 2n∗ ∆A / 2n∗ . Then by Theorem 2.4.13 we have
" r √ #
. 1 2n∗ ∆A
β = 1 − Φ zα − ,
4 τ

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 102 —


i i

102 CHAPTER 2. TWO-SAMPLE PROBLEMS

but this implies


√ √
zβ = zα − τ −1 n∗ ∆A / 2
and (2.4.25)
 2
zα − zβ
n∗ = 2τ 2 .
∆A
The above value of n∗ is the approximate sample size. Note that it does depend
on τ which, in applications, would have to be guessed or estimated in a pilot
study; see the discussion in Section 2.4.5 (estimates of τ are discussed in
Sections 2.4.5 and 3.7.1). For a specified distribution it can be evaluated; for
instance, if the underlying
p density is assumed to be normal with standard
deviation σ then τ = π/3σ.
Using (2.4.24) a similar derivation can be obtained for the usual two-sample
t-test, resulting in an approximate sample size of
 2
∗ zα − zβ
nLS = 2σ 2 .
∆A
The ratio of the sample size needed by the MWW test to that of the two-
sample t-test is τ 2 /σ 2 . This provides additional motivation for the definition
of efficiency.

2.4.4 Estimation of ∆
Recall from the geometry earlier in this chapter that the estimate of ∆ based
on the rank-pseudo-norm is ∆ b R = med i,j {Yj − Xi }, (2.2.18). We now obtain
several properties of this estimate including its asymptotic distribution. This
leads again to the efficiency properties of the rank-based methods discussed
in the last section.
For convenience, we note some equivariances of ∆ b R = ∆(Y,
b X), which are
b
established in Exercise 2.13.11. First, ∆R is translation equivariant; i.e.,
b R (Y + ∆ + θ, X + θ) = ∆
∆ b R (Y, X) + ∆ ,

b R is scale equivariant; i.e.,


for any ∆ and θ. Second, ∆
b R (aY, aX) = a∆
∆ b R (Y, X) ,

b R is an unbiased estimate of ∆
for any a. Based on these we next show that ∆
under certain conditions.
Theorem 2.4.14. If the errors, e∗i , in the location model (2.2.4) are symmet-
b R is symmetrically distributed about ∆.
rically distributed about 0, then ∆

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 103 —


i i

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 103

Proof: Due to translation equivariance there is no loss of generality in assuming


that ∆ and θ are 0. Then Y and X are symmetrically distributed about 0;
hence, L(Y ) = L(−Y ) and L(X) = L(−X). Thus from the above equivariance
properties we have,
b
L(−∆(Y, b
X)) = L(∆(−Y, b
−X) = L(∆(Y, X)) .

Therefore ∆b R is symmetrically distributed about 0, and, in general it is sym-


metrically distributed about ∆.
b R is symmetrically
Theorem 2.4.15. Under Model (2.2.4), if n1 = n2 then ∆
distributed about ∆.
b R may
The reader is asked to prove this in Exercise 2.13.12. In general, ∆
be biased if the error distribution is not symmetrically distributed but as the
following result shows ∆b R is always asymptotically unbiased. Since the MWW
+
process SR (∆) was shown to be Pitman Regular the asymptotic distribution
√ b
of n(∆ − ∆) is N(0, c−2
M W W ). In practice, we say

b R has an approximate N(∆, τ 2 (n−1


∆ −1
1 + n2 )) distribution,

where τ was defined in (2.4.22).


Recall from Definition 1.5.4 of Chapter 1, that the asymptotic relative effi-
ciency of two Pitman Regular estimators is the reciprocal of the ratio of their
asymptotic variances. As Exercise 2.13.10 shows,  the least
 squares estimate
b LS = Y − X of ∆ is approximately N ∆, σ
∆ 2 1 1
+ n2 ; hence,
n1

2
Z 2
e(∆ b LS ) = σ = 12σf2
b R, ∆ 2
f (x) dx .
τ2

This agrees with the asymptotic relative efficiency results for the MWW test
relative to the t-test and (1.7.13).

2.4.5 Efficiency Results Based on Confidence Intervals


Let L1−α be the length of the (1 −α)100% distribution-free confidence interval
based on the MWW statistic discussed in Section 2.4.2. Since this interval is
based on the Pitman Regular process SR+ (∆), it follows from Theorem 1.5.9
of Chapter 1 that r
n1 n2 L1−α P
→τ ; (2.4.26)
n 2zα/2
that is, the standardized length of a distribution-free confidence interval is a
consistent estimate of the scale parameter τ . It further follows from (2.4.26)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 104 —


i i

104 CHAPTER 2. TWO-SAMPLE PROBLEMS

that, as in Chapter 1, if efficiency is based on the relative squared asymptotic


lengths of confidence intervals then we obtain the same efficiency results as
quoted above for tests and estimates.
In the Robnp computational function twosampwil a simple degree of free-
dom adjustment is made in the estimation of τ . In the traditional LS analysis
based on the pooled t, this adjustment is equivalent to dividing the pooled
estimate of variance by n1 + n2 − 2 instead of n1 + n2 . Hence, as our estimate
of τ , the function twosampwil uses
r r
n1 + n2 n1 n2 L1−α
τb = . (2.4.27)
n1 + n2 − 2 n 2zα/2

Thus the standard error (SE) of the estimator ∆ b R is given by


p
τb (1/n1 ) + (1/n2 ).
b R . Often
The distribution-free confidence interval is not symmetric about ∆
in practice symmetric intervals are desired. Based on the asymptotic distribu-
b R we can formulate the approximate interval
tion of ∆
r
b 1 1
∆R ± zα/2 τ̂ + , (2.4.28)
n1 n2
where τ̂ is a consistent estimate of τ . If we use (2.4.26) as our estimate of τ
with the level α, then the confidence interval simplifies to

b R ± L1−α .
∆ (2.4.29)
2
Besides the estimate given in (2.4.26), a consistent estimate of τ was pro-
posed by by Koul, Sievers, and McKean (1987) and is discussed in Section 3.7.
Using this estimate small sample studies indicate that zα/2 should be replaced
by the t critical value t(α/2,n−1) ; see McKean and Sheather (1991) for a review
of small sample studies on R estimates. In this case, the symmetric confidence
interval based on ∆ b R is directly analogous to the usual t interval based on
least squares in that the only difference is that σb is replaced by τb.

Example 2.4.1 (Hendy and Charles Coin Data, continued from Examples
1.11.1 and 2.3.2). Recall from Chapter 1 that this example concerned the
silver content in two coinages (the first and the fourth) minted during the
reign of Manuel I. The data are given in Chapter 1. The Hodges-Lehmann
estimate of the difference between the first and the fourth coinage is 1.10% of
silver and a 95% confidence interval for the difference is (.60, 1.70). The length
of this confidence interval is 1.10; hence, the estimate of τ given in expression
(2.4.27) is 0.595. The symmetrized confidence interval (2.4.28) based on the t-
upper .025 critical value is (0.46, 1.74). Both of these intervals are in agreement

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 105 —


i i

2.5. GENERAL RANK SCORES 105

with the confidence interval obtained in Example 1.11.1 based on the two L1
confidence intervals.

Another estimate of τ can be obtained from a similar consideration of


the distribution-free confidence intervals based on the signed-rank statistic
discussed in Chapter 1; see Exercise 2.13.13. Note in this case for consistency,
though, we would have to assume that f is symmetric.

2.5 General Rank Scores


In this section we are concerned with the location model; i.e., X1 , . . . , Xn1 are
iid F (x), Y1 , . . . , Yn2 are iid G(x) = F (x−∆), and the samples are independent
of one another. We present an analysis for this problem based on general rank
scores. In this terminology, the Mann-Whitney-Wilcoxon procedures are based
on a linear score function. We present the results for the hypotheses

H0 : ∆ = 0 versus H0 : ∆ > 0 . (2.5.1)

The results for the other one-sided and two-sided alternatives are similar. We
are also concerned with estimation and confidence intervals for ∆. As in the
preceding sections, we first present the geometry.
Recall that the pseudo-norm which generated the MWW analysis could
be written as a linear combination of ranks times residuals. This is easily
generalized. Consider the function
n
X
kuk∗ = a(R(ui ))ui , (2.5.2)
i=1

P
where a(i) are scores such that a(1) ≤ · · · ≤ a(n) and a(i) = 0. For the
next theorem, we also assume that a(i) = −a(n + 1 − i), although this is only
used to show the scalar multiplicative property.
P
Theorem 2.5.1. Suppose that a(1) ≤ · · · ≤ a(n), a(i) = 0, and a(i) =
−a(n + 1 − i). Then the function k · k∗ is a pseudo-norm.

Proof: By the connection between ranks and order statistics we can write
n
X
kuk∗ = a(i)u(i) .
i=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 106 —


i i

106 CHAPTER 2. TWO-SAMPLE PROBLEMS

Next suppose that u(j) is the last order statistic with a negative score. Since
the scores sum to 0, we can write
n
X
kuk∗ = a(i)(u(i) − u(j))
i=1
X X
= a(i)(u(i) − u(j)) + a(i)(u(i) − u(j) ) . (2.5.3)
i≤j i≥j

Both terms on the right side are nonnegative; hence, kuk∗ ≥ 0. Since all the
terms in (2.5.3) are nonnegative, kuk∗ = 0 implies that all the terms are zero.
But since the scores are not all 0, yet sum to zero, we must have a(1) < 0
and a(n) > 0. Hence we must have u(1) = u(j) = u(n) ; i.e., u(1) = · · · = u(n) .
Conversely if u(1) = · · · = u(n) then kuk∗ = 0. By the condition a(i) =
−a(n + 1 − i) it follows that kαuk∗ = |α|kuk∗; see Exercise 2.13.16.
In order to complete the proof we need to show the triangle inequality
holds. This is established by the following string of inequalities:
n
X
ku + vk∗ = a(R(ui + vi ))(ui + vi )
i=1
Xn n
X
= a(R(ui + vi ))ui + a(R(ui + vi ))vi
i=1 i=1
n
X n
X
≤ a(i)u(i) + a(i)v(i)
i=1 i=1
= kuk∗ + kvk∗ .

The proof of the above inequality is similar to that of Theorem 1.3.2 of Chapter
1.
Based on a set of scores satisfying the above assumptions, we can establish
a rank inference for the two-sample problem similar to the MWW analysis.
We do so for general rank scores of the form

aϕ (i) = ϕ(i/(n + 1)) , (2.5.4)

where ϕ(u) is a square integrable, nondecreasing function defined on the in-


terval (0, 1) which is standardized as
R1 R1
0
ϕ(u) du = 0 and 0
ϕ2 (u) du = 1; (2.5.5)

see, also, Assumption (S.1), (3.4.10), of Chapter 3. The last assumptions con-
cerning standardization of the scores are for convenience. The
√ Wilcoxon scores
are generated in this way by the linear function ϕR (u) = 12(u − (1/2)) and

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 107 —


i i

2.5. GENERAL RANK SCORES 107

the sign scores are generated by ϕS (u) = sgn(2u − 1). We denote the corre-
sponding pseudo-norm for scores generated by ϕ(u) as
n
X
kukϕ = aϕ (R(ui))ui . (2.5.6)
i=1

These two-sample sign and Wilcoxon scores are generalizations of the sign
and Wilcoxon scores discussed in Chapter 1 for the one-sample problem. In
Section 1.8 of Chapter 1 we presented one-sample analyses based on general
score functions. Similar to the sign and Wilcoxon cases, we can generate a
two-sample score function from any one-sample score function. For reference
we establish this in the following theorem:
Theorem 2.5.2. As discussed at the beginning of Section 1.8, let ϕ+ (u) be
a score function for the one-sample problem. For u ∈ (−1, 0), let ϕ+ (u) =
−ϕ+ (−u). Define,

ϕ(u) = ϕ+ (2u − 1) , for u ∈ (0, 1) . (2.5.7)

and n
X
kxkϕ = ϕ(R(xi )/(n + 1))xi . (2.5.8)
i=1

Then k · kϕ is a pseudo-norm on Rn . Furthermore

ϕ(u) = −ϕ(1 − u) , (2.5.9)

and Z Z
1 1
2
ϕ (u) du = (ϕ+ (u))2 du . (2.5.10)
0 0

Proof: As discussed in the beginning of Section 1.8 (see expression (1.8.1)),


ϕ+ (u) is a positive valued and nondecreasing function defined on the interval
(0, 1).R Based on these properties, it follows that ϕ(u) is nondecreasing and
1
that o ϕ(u) du = 0. Hence k · kϕ is a pseudo-norm on Rn . Properties (2.5.9)
and (2.5.10) follow readily; see Exercise 2.13.17 for details.
The two-sample sign and Wilcoxon scores, cited above, are easily seen to
be generated
√ this way from their one-sample counterparts ϕ+ (u) = 1 and
ϕ+ (u) = 3u, respectively. As discussed further in Section 2.5.3, properties
such as efficiencies of the analysis based on the one-sample scores are the same
for a two-sample analysis based on their corresponding two-sample scores.

In the notation of (2.2.3), the estimate of ∆ is


b ϕ = Argmin kZ − C∆kϕ .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 108 —


i i

108 CHAPTER 2. TWO-SAMPLE PROBLEMS

Denote the negative of the gradient of kZ − C∆kϕ by Sϕ (∆). Then based on


(2.5.6),
Xn2
Sϕ (∆) = aϕ (R(Yj − ∆)) . (2.5.11)
j=1

b ϕ equivalently solves the equation,


Hence ∆
.
b ϕ) =
Sϕ ( ∆ 0. (2.5.12)
As with pseudo-norms in general, the function kZ−C∆kϕ is a convex function
of ∆. The negative of its derivative, Sϕ (∆), is a decreasing step function of ∆
which steps down at the differences Yj − Xi ; see Exercise 2.13.18. Unlike the
MWW function SR (∆), the step sizes of Sϕ (∆) are not necessarily the same
size. Based on MWW starting values, a simple trace algorithm through the
differences can be used to obtain the estimator ∆ b ϕ . The R function twosampr2
computes the rank-based analysis for general scores.
The gradient rank test statistic for the hypotheses (2.5.1) is
n2
X
Sϕ = aϕ (R(Yj )) . (2.5.13)
j=1

Since the test statistic only depends on the ranks of the combined sample it
is distribution free under the null hypothesis. As shown in Exercise 2.13.18,
E0 [Sϕ ] = 0 (2.5.14)
n
X
n1 n2
σϕ2 = V0 [Sϕ ] = a2 (i) . (2.5.15)
n(n − 1) i=1

Note that we can write the variance as


( n )
n1 n2 X 1 . n1 n2
σϕ2 = a2 (i) = , (2.5.16)
n − 1 i=1 n n−1
where the
R approximation is due to the fact that the term in braces is a Riemann
sum of ϕ2 (u)du = 1 and, hence, converges to 1.
It is convenient from time to time to use rank statistics based on unstan-
dardized scores; i.e., a rank statistic of the form
n2
X
Sa = a(R(Yj )) , (2.5.17)
j=1

where a(i) = ϕ(i/(n + 1)), i = 1, . . . , n is a set of scores. As Exercise 2.13.18


shows the null mean µS and null variance σS2 of Sa are given by
n1 n2 X
µS = n2 a and σS2 = (a(i) − a)2 . (2.5.18)
n(n − 1)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 109 —


i i

2.5. GENERAL RANK SCORES 109

2.5.1 Statistical Methods


The asymptotic null distribution of the statistic Sϕ , (2.5.13), easily follows
from Theorem A.2.1 of the Appendix. To see this, note that we can use the
notation (2.2.1) and (2.2.2) to write Sϕ as a linear rank statistic; i.e.,
X n Xn  
n
Sϕ = ci a(R(Zi )) = (ci − c)a Fn (Zi) , (2.5.19)
i=1 i=1
n+1
where Fn is the empirical distribution function of Z1 , . . . , Zn . Our score func-
tion ϕ is monotone and square integrable; hence, the conditions on scores in
Section A.2 are satisfied. Also F is continuous so the distributional assumption
is satisfied. Finally, we need only show that the constants ci satisfy conditions,
D.2, (3.4.7), and D.3, (3.4.8). It is a simple exercise to show that
n
X n1 n2
(ci − c)2 =
i=1
n
 2 2
n2 n1
max (ci − c)2 = max , .
1≤i≤n n2 n2
Under condition (D.1), (2.4.7), 0 < λi < 1 where lim(ni /n) = λi for i = 1, 2.
Using this along with the last two expressions, it is immediate that Noether’s
condition, (3.4.9), holds for the ci ’s. Thus the assumptions of Section A.2 hold
for the statistic Sϕ .
As in expression (A.2.7) of Section A.2, define the random variable Tϕ as
n
X
Tϕ = (ci − c̄)ϕ(F (Zi)) . (2.5.20)
i=1

By comparing expressions (2.5.19) and (2.5.20), it seems that the variable Tϕ


is an approximation of Sϕ . This follows from Section A.2. Briefly, under H0 the
distribution of Tϕ is approximately normal and Var((Tϕ −Sϕ )/σϕ ) → 0; hence,
Sϕ is asymptotically normal with mean and variance given by expressions
(2.5.14) and (2.5.15), respectively. Hence, an asymptotic level α test of the
hypotheses (2.5.1) is
Reject H0 in favor of HA , if Sϕ ≥ zα σϕ ,
where σϕ is defined by (2.5.15).
As discussed above, the estimate ∆b ϕ of ∆ solves the equation (2.5.12). The
b L, ∆
interval (∆ b U ) is a (1 − α)100% confidence interval for ∆ (based on the
asymptotic distribution) provided ∆ b L and ∆ b U solve the equations
. p . p
bU) =
Sϕ ( ∆ −zα/2 n1nn2 and Sϕ (∆b L) = zα/2 n1nn2 , (2.5.21)
where 1 − Φ(zα/2 ) = α/2. As with the estimate of ∆, these equations can be
easily solved with an iterative algorithm; see Exercise 2.13.18.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 110 —


i i

110 CHAPTER 2. TWO-SAMPLE PROBLEMS

2.5.2 Efficiency Results


In order to obtain the efficiency results for these statistics, we first show that
the process Sϕ (∆) is Pitman Regular. For general scores we need to further
assume that the density has finite Fisher information; i.e., satisfiesR 1 condition
(E.1), (2.4.16). Recall that Fisher information is given by I(f ) = 0 ϕ2F (u) du,
where
f ′ (F −1 (u))
ϕf (u) = − . (2.5.22)
f (F −1 (u))
Below we show that the score function ϕf is optimal. Define the parameter
τϕ as, Z
−1
τϕ = ϕ(u)ϕf (u)du . (2.5.23)

Estimation of τϕ is discussed in Section 3.7.


To show that the process Sϕ (∆) is Pitman Regular, we show that the
four conditions of Definition 1.5.3 are true. As noted after expression (2.5.12),
Sϕ (∆) is nonincreasing; hence, the first condition holds. For the second con-
dition, note that we can write
Xn2 X n2  
n1 n2
Sϕ (∆) = a(R(Yi − ∆)) = ϕ Fn (Yi − ∆) + Fn (Yi ) ,
i=1 i=1
n+1 1 n+1 2
(2.5.24)
where Fn1 and Fn2 are the empirical cdfs of the samples X1 , . . . , Xn1 and
Y1 , . . . , Yn2 , respectively. Hence, passing to the limit we have,
  Z ∞
1
E0 Sϕ (∆) → λ2 ϕ[λ1 F (x) + λ2 F (x − ∆)]f (x − ∆) dx
n −∞
Z ∞
= λ2 ϕ[λ1 F (x + ∆) + λ2 F (x)]f (x) dx = µϕ (∆) ;
−∞

see Chernoff and Savage (1958) for a rigorous proof of the limit. Differentiating
µϕ (∆) and evaluating the derivative at 0 we obtain
Z ∞

µϕ (0) = λ1 λ2 ϕ′ [F (t)]f 2 (t) dt
Z−∞
∞  ′ 
f (t)
= λ1 λ2 ϕ[F (t)] − f (t) dt
−∞ f (t)
Z 1
= λ1 λ2 ϕ(u)ϕf (u) du = λ1 λ2 τϕ−1 > 0 . (2.5.25)
0

Hence, the second condition is satisfied.


The null asymptotic distribution of Sϕ (0) was established in Section 2.5.1;
hence the fourth condition is true. Hence, we need only establish asymptotic

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 111 —


i i

2.5. GENERAL RANK SCORES 111

linearity. This result follows from the results for general rank regression statis-
tics which are developed in Section A.2.2 of the Appendix. By Theorem A.2.8
of the Appendix, the asymptotic linearity result for Sϕ (∆) is given by

1 √ 1
√ Sϕ (δ/ n) = √ Sϕ (0) − τϕ−1 λ1 λ2 δ + op (1) , (2.5.26)
n n

uniformly for |δ| ≤ B, where B > 0 and τϕ is defined in (2.5.23).


Therefore, following Definition 1.5.3 of Chapter 1, the estimating function
is Pitman Regular.

By the discussion following (2.5.20), we have that n−1/2 Sϕ (0)/ λ1 λ2 is
asymptotically N(0, 1). The efficacy of the test based on Sϕ is thus given by

τϕ−1 λ1 λ2 p
cϕ = √ = τϕ−1 λ1 λ2 . (2.5.27)
λ1 λ2

As with the MWW analysis, several important items follow immediately


from Pitman Regularity. Consider first the behavior of Sϕ under local alterna-
tives. Specifically consider a level α test based on Sϕ for
√ the hypothesis (2.5.1)
and the sequence of local alternatives Hn : ∆n = δ/ n. As in Chapter 1, it
is easy to show that the asymptotic power of the test based on Sϕ is given
by
lim Pδ/√n [Sϕ ≥ zα σϕ ] = 1 − Φ(zα − δcϕ ) . (2.5.28)
n→∞

Based on this result, sample size determination for the test based on Sϕ can
be conducted similar to that based on the MWW test statistic; see (2.4.25).
Next consider the asymptotic distribution of the estimator ∆ b ϕ . Recall
b ϕ solves the equation Sϕ (∆ .
b ϕ ) = 0. Based on Pitman
that the estimate ∆
Regularity and Theorem 1.5.7 of Chapter 1 the asymptotic distribution ∆ b ϕ is
given by
√ D
n(∆b ϕ − ∆) → N(0, τϕ2 (λ1 λ2 )−1 ) ; (2.5.29)

By using (2.5.26) and Tϕ (0) to approximate Sϕ (0), we have the following


useful result:

b = τϕ √1 Tϕ (0) + op (1) .
n∆ (2.5.30)
λ1 λ2 n

We want to select scores such that the efficacy cϕ , (2.5.27), is as large as


possible, or equivalently such that the asymptotic variance of ∆ b ϕ is as small
as possible. How large can the efficacy be? Similar to (1.8.26), note that we

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 112 —


i i

112 CHAPTER 2. TWO-SAMPLE PROBLEMS

can write
Z
τϕ−1 = ϕ(u)ϕf (u)du
sZ R
ϕ(u)ϕf (u)du
= ϕ2f (u)du qR qR
2
ϕf (u)du ϕ2 (u)du
sZ
= ρ ϕ2f (u)du . (2.5.31)

The second equation is true since the scores were standardized


R as above. In the
third equation ρ is a correlation coefficient and ϕ2f (u)du is Fisher location
information, (2.4.16), which we denoted by I(f ). By the Rao-Cramér lower
bound, the smallest asymptotic variance obtainable by an asymptotically un-
biased estimate is (λ1 λ2 I(f ))−1. Such an estimate is called asymptotically
efficient. Choosing a score function to maximize (2.5.31) is equivalent to
choosing a score function to make ρ = 1. This can be achieved by taking
the score function to be ϕ(u) = ϕf (u), (2.5.22). The resulting estimate, ∆ b ϕ,
is asymptotically efficient. Of course this can be accomplished only provided
that the form of f is known; see Exercise 2.13.19. Evidently, the closer the
chosen score is to ϕf , the more powerful the rank analysis is.
In Exercise 2.13.19, the reader is asked to show that the MWW analysis
is asymptotically efficient if the errors have a logistic distribution. For normal
errors, it follows in a few steps from expression (2.4.17) that the optimal scores
are generated by the normal scores function,

ϕN (u) = Φ−1 (u) , (2.5.32)

where Φ(u) is the distribution function of a standard normal random vari-


able. Exercise 2.13.19 shows that this score function is standardized. These
scores yield an asymptotically efficient analysis if the errors truly have a nor-
mal distribution and, further, e(ϕN , L2 ) ≥ 1; see Theorem 1.8.1. Also, unlike
the Mann-Whitney-Wilcoxon analysis, the estimate of the shift ∆ based on
the normal scores cannot be obtained in closed form. But as mentioned above
for general scores, provided the score function is nondecreasing, simple iter-
ative algorithms can be used to obtain the estimate and the corresponding
confidence interval for ∆. In the next sections, we discuss analyses that are
asymptotically efficient for other distributions.

Example 2.5.1 (Quail Data, continued from Example 2.3.1). In the larger
study, McKean et al. (1989), from which these data were drawn, the responses

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 113 —


i i

2.5. GENERAL RANK SCORES 113

were positively skewed with long right tails, although outliers frequently oc-
curred in the left tail also. McKean et al. conducted an investigation of esti-
mates of the score functions for over 20 of these experiments. Classes of simple
scores which seemed appropriate for such data were piecewise linear with one
piece which is linear on the first part on the interval (0, b) and with a second
piece which is constant on the second part (b, 1); i.e., scores of the form
 2
b(2−b)
u − 1 if 0 < u < b
ϕb (u) = b . (2.5.33)
2−b
if b ≤ u < 1

These scores are optimal for densities with left logistic and right exponential
tails; see Exercise 2.13.19. A value
P of b which seemed appropriate for this type
of data was 3/4. Let S3/4 = a3/4 (R(Yj )) denote the test statistic based on
these scores. The Robnp function phibentr with the argument param = 0.75
computes these scores. Using the Robnp function twosampr2 with the argument
score = phibentr, computes the rank-based analysis for the score function
(2.5.33). Assuming that the treated and control observations are in x and y,
respectively, the call and the resulting analysis for a one-sided test as computed
by R is:

> tempb = twosampr2(x,y,test=T,alt=1,delta0=0,score=phibentr,


grad=sphir,param=.75,alpha=.05,maktable=T)

Test of Delta = 0 Alternative selected is 1


Standardized (z) Test-Statistic 1.787738
and p-value 0.03690915

Estimate 15.5 SE is 7.921817


95 % Confidence Interval is ( -2 , 28 )
Estimate of the scale parameter tau 20.45404

Comparing p-values, the analysis based on the score function (2.5.33) is a little
more precise than the MWW analysis given in Example 2.3.1. Recall that the
data are right skewed, so this result is not surprising.

For another class of scores similar to (2.5.33), see the discussion around
expression (3.10.6) in Chapter 3.

2.5.3 Connection between One- and Two-Sample


Scores
In Theorem 2.5.2 we discussed how to obtain a corresponding two-sample
score function given a one-sample score function. Here we reverse the problem,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 114 —


i i

114 CHAPTER 2. TWO-SAMPLE PROBLEMS

showing how to obtain a one-sample score function from a two-sample score


function. This provides a natural estimate of θ in (2.2.4). We also show the
efficiencies and asymptotic properties are the same for such corresponding
scores functions.
Consider the location model but further assume that X has a symmetric
distribution. Then Y also has a symmetric distribution. For associated one-
sample problems, we could then use the signed rank methods developed in
Chapter 1. What one-sample scores should we select?
First consider what two-sample scores would be suitable under symmetry.
Assume without loss of generality that X is symmetrically distributed about
0. Recall that the optimal scores are given by the expression (2.5.22). Using
the fact that F (x) = 1 − F (−x), it is easy to see (Exercise 2.13.20) that the
optimal scores satisfy,
ϕf (−u) = −ϕf (1 − u) , for 0 < u < 1 ,
that is, the optimal score function is odd about 12 . Hence for symmetric dis-
tributions, it makes sense to consider two-sample scores which are odd about
1
2
.
For this sub-section then assume that the two-sample score generating
function satisfies the property
(S.3) ϕ(1 − u) = −ϕ(u) . (2.5.34)
Note that such scores satisfy: ϕ(1/2) = 0 and ϕ(u) ≥ 0 for u ≥ 1/2. Define a
one-sample score generating function as
 
+ u+1
ϕ (u) = ϕ (2.5.35)
2
and the one-sample scores as
 
+ + i
a (i) = ϕ . (2.5.36)
n+1
It follows that these one-sample scores are nonnegative and nonincreasing.
For example, if we use Wilcoxon
√ two-sample
 scores, that is, scores gen-
1
erated by the function, ϕ(u) = 12 u − 2 then the associated one-sample

score generating function is ϕ+ (u) = 3u and, hence, the one-sample scores
are the Wilcoxon signed rank scores. If instead we use the two-sample sign
scores, ϕ(u) = sgn(2u − 1) then the one-sample score function is ϕ+ (u) = 1.
This results in the one-sample sign scores.
Suppose we use two-sample scores which satisfy (2.5.34) and use the associ-
ated one-sample scores. Then the corresponding one- and two-sample efficacies
satisfy p
c ϕ = λ 1 λ 2 c ϕ+ , (2.5.37)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 115 —


i i

2.6. L1 ANALYSES 115

where the efficacies are given by expressions (2.5.27) and (1.8.21). Hence the
efficiency and asymptotic properties of the one- and two-sample analyses are
the same. As a final remark, if we write the model as in expression (2.2.4),
then we can use the rank statistic based on the two-sample scores to estimate
∆. We next form the residuals Zi − ∆c b i . Then using the one-sample scores
statistic of Chapter 1, we can estimate θ based on these residuals, as discussed
in Chapter 1. In terms of a regression problem we are estimating the intercept
parameter θ based on the residuals after fitting the regression coefficient ∆.
This is discussed in some detail in Section 3.5.

2.6 L1 Analyses
In this section, we present analyses based on the L1 norm and pseudo-norm.
We discuss the pseudo-norm first, showing that the corresponding test is
the familiar Mood’s (1950) test. The test which corresponds to the norm is
Mathisen’s (1943) test.

2.6.1 Analysis Based on the L1 Pseudo-Norm


Consider the sign scores. These are the scores generated by the function ϕ(u) =
sgn(u − 1/2). The corresponding pseudo-norm is given by,
Xn  
n+1
kukϕ = sgn R(ui ) − ui . (2.6.1)
i=1
2
This pseudo-norm is optimal for double exponential errors; see Exercise
2.13.19.
We have the following relationship between the L1 pseudo-norm and the
L1 norm. Note that we can write
Xn  
n+1
kukϕ = sgn i − u(i) .
i=1
2
Next consider,
n
X n
X
|u(i) − u(n−i+1) | = sgn(u(i) − u(n−i+1) )(u(i) − u(n−i+1) )
i=1 i=1
X n
= 2 sgn(u(i) − u(n−i+1) )u(i) .
i=1

Finally note that


 
n+1
sgn(u(i) − u(n−i+1) ) = sgn(i − (n − i + 1)) = sgn i − .
2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 116 —


i i

116 CHAPTER 2. TWO-SAMPLE PROBLEMS

Putting these results together we have the relationship,


n
X n
X  
n+1
|u(i) − u(n−i+1) | = 2 sgn i − u(i) = 2kukϕ . (2.6.2)
i=1 i=1
2

Recall that the pseudo-norm based on the Wilcoxon scores can be expressed
as the sum of all absolute differences between the components; see (2.2.17).
In contrast the pseudo-norm based on the sign scores only involves the n
symmetric absolute differences |u(i) − u(n−i+1) |.
In the two-sample location model the corresponding R estimate based on
the pseudo-norm (2.6.1) is a value of ∆ which solves the equation
n2
X  
n+1 .
Sϕ (∆) = sgn R(Yj − ∆) − =0. (2.6.3)
j=1
2

Note that we are ranking the set {X1 , . . . , Xn1 , Y1 − ∆, . . . , Yn2 − ∆} which
is equivalent to ranking the set {X1 − med Xi , . . . , Xn1 − med Xi , Y1 − ∆ −
med Xi , . . . , Yn2 − ∆ − med Xi }. We must choose ∆ so that half of the ranks
of the Y part of this set are above (n + 1)/2 and half are below. Note that in
the X part of the second set, half of the X part is below 0 and half is above
0. Thus we need to choose ∆ so that half of the Y part of this set is below 0
and half is above 0. This is achieved by taking
b = med Yj − med Xi .
∆ (2.6.4)

This is the same estimate as produced by the L1 norm, see the discussion
following (2.2.5). We refer to the above pseudo-norm (2.6.1) as the L1 pseudo-
norm. Actually, as pointed out in Section 2.2, this equivalence between es-
timates based on the L1 norm and the L1 pseudo-norm is true for general
regression problems in which the model includes an intercept,
Pn2 as it does here.
The corresponding test statistic for H0 : ∆ = 0 is j=1 sgn(R(Yj ) − n+1 2
).
Note that the sgn function here is only counting the number of Yj ’s which
are above the combined sample median M c = med {X1 , . . . , Xn1 , Y1 , . . . , Yn2 }
minus the number below M c. Hence a more convenient but equivalent test
statistic is
M0+ = #(Yj > M c) , (2.6.5)
which is called Mood’s median test statistic; see Mood (1950).

Testing
Since this L1 analysis is based on a rank-based pseudo-norm we could use the
general theory discussed in Section 2.5 to handle the theory for estimation and

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 117 —


i i

2.6. L1 ANALYSES 117

testing. As we point out, though, there are some interesting results pertaining
to this analysis.
For the null distribution of M0+ , first assume that n is even. Without loss
of generality, assume that n = 2r and n1 ≥ n2 . Consider the combined sample
as a population of n items, where n2 of the items are Y ’s and n1 items are
X’s. Think of the n/2 items which exceed M c. Under H0 these items are as
+
likely to be an X as a Y . Hence, M0 , the number of Y ’s in the top half of the
sample follows the hypergeometric distribution, i.e.,
n2
 n1 
k
P (M0+ = k) = n
r−k k = 0, . . . , n2 ,
r

where r = n/2. If n is odd the same result holds except in this case r =
(n − 1)/2. Thus as a level α decision rule, we would reject H0 : ∆ = 0 in
favor of HA : ∆ > 0, if M0+ ≥ cα , where cα could be determined from the
hypergeometric distribution or approximated by the binomial distribution.
From the properties of the hypergeometric distribution, E0 [M0+ ] = r(n2 /n)
and V0 [M0+ ] = (rn1 n2 (n − r))/(n2 (n − 1)). Under the assumption D.1, (2.4.7),
it follows that the limiting distribution of M0+ is normal.

Confidence Intervals
Exercise 2.13.21 shows that, for n = 2r,
n2
X
M0+ (∆) c) =
= #(Yj − ∆ > M I(Y(i) − X(r−i+1) − ∆ > 0) , (2.6.6)
i=1

and furthermore that the n = 2r differences,

Y(1) − X(r) < Y(2) − X(r−1) < · · · < Y(n2 ) − X(r−n2 +1) ,

can be ordered only knowing the order statistics from the individual samples.
It is further shown that if k is such that P (M0+ ≤ k) = α/2 then a (1−α)100%
confidence interval for ∆ is given by

(Y(k+1) − X(r−k) , Y(n2 −k) − X(r−n2 +k+1) ) .

The above confidence interval simplifies when n1 = n2 = m, say. In this


case the interval becomes

(Y(k+1) − X(m−k) , Y(m−k) − X(k+1) ) ,

which is the difference in endpoints of the two simple L1 confidence intervals


(X(k+1) , X(m−k) ) and (Y(k+1) , Y(m−k) ) which were discussed in Section 1.11.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 118 —


i i

118 CHAPTER 2. TWO-SAMPLE PROBLEMS

Usingpthe normal approximation to the hypergeometric we have k = m/2 −


Zα/2 m2 /(4(2m − 1)) − .5. Hence, the above two intervals have confidence
coefficient
!
k − m/2  p 
.
γ = 1 − 2Φ p = 1 − 2Φ zα/2 m/(2m − 1)
m/4
. 
= 1 − 2Φ zα/2 2−1/2 .
For example, for the equal sample size case, a 5% two-sided Mood’s test is
equivalent to rejecting the null hypothesis if the 84% one-sample L1 confidence
intervals are disjoint. While this also could be done for the unequal sample
sizes case, we recommend the direct approach of Section 1.11.

Efficiency Results
We obtain the efficiency results from the asymptotic distribution of the esti-
mate, ∆b = med Yj − med Xi of ∆. Equivalently, we could obtain the results
by asymptotic linearity that was derived for arbitrary scores in (2.5.26); see
Exercise 2.13.22.
Theorem 2.6.1. Under the conditions cited in Example 1.5.2, (L1 Pitman
regularity conditions), and (2.4.7), we have
√ D
b − ∆) →
n(∆ N(0, (λ1 λ2 4f 2 (0))−1 ) . (2.6.7)
Proof: Without loss of generality assume that ∆ and θ are 0. We can write,
r r
√ n√ n√
b
n∆ = n2 med Yj − n1 med Xi .
n2 n1
From Example 1.5.2, we have
n2
√ 1 1 X
n2 med Yj = √ sgnYj + op (1)
2f (0) n2 j=1
√ D √ D
hence, n2 med Yj → Z2 where Z2 is N(0, (4f 2 (0))−1). Likewise n1 med Xi →
Z1 where Z1 is N(0, (4f 2(0))−1 ). Since Z1 and Z2 are independent, we have
√ b D
that n∆ → (λ2 )−1/2 Z2 − (λ1 )−1/2 Z1 which
√ yields the result.
The efficacy of Mood’s test is thus λ1 λ2 2f (0). The asymptotic relative
efficiency of Mood’s test to the two-sample t-test is 4σ 2 fR2 (0), while its asymp-
totic relative efficiency with the MWW test is f 2 (0)/(3( f 2 )2 ). These are the
same as the efficiency results of the sign test to the t-test and to the Wilcoxon
signed rank test, respectively, that were obtained in Chapter 1; see Section
1.7.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 119 —


i i

2.6. L1 ANALYSES 119

Example 2.6.1 (Quail Data, continued from Example 2.3.1). For the quail
data the median of the combined samples is M c = 64. For the subsequent
test based on Mood’s test we eliminated the three data points which had this
value. Thus n = 27, n1 = 9 and n2 = 18. The value of Mood’s test statistic
is M0+ = #(Pj > 64) = 11. Since EH0 (M0+ ) = 8.67 and VH0 (M0+ ) = 1.55, the
standardized value (using the continuity correction) is 1.47 with a p-value of
.071. Using all the data, the point estimate corresponding to Mood’s test is 19
while a 90% confidence interval, using the normal approximation, is (−10, 31).

2.6.2 Analysis Based on the L1 Norm


Another sign type procedure is based on the L1 norm. Reconsider expression
(2.2.7) which is the partial derivative of the L1 dispersion function with respect
to ∆. We take the parameter θ as a nuisance parameter and we estimate it by
med Xi . An aligned sign test procedure for ∆ is then obtained by aligning
the Yj ’s with respect to this estimate of θ. The process of interest, then, is
n2
X
S(∆) = sgn(Yj − med Xi − ∆) .
j=1

A test of H0 : ∆ = 0 is based on the statistic


Ma+ = #(Yj > med Xi ) . (2.6.8)
This statistic was proposed by Mathisen (1943) and is also referred to as
the control median test; see Gastwirth (1968). The estimate of ∆ obtained by
. b = med Yj − med Xi .
solving S(∆) = 0 is, of course, the L1 estimate ∆

Testing
Mathisen’s test statistic, similar to Mood’s, has a hypergeometric distribution
under H0 .
Theorem 2.6.2. Suppose n1 is odd and is written as n1 = 2n∗1 + 1. Then
under H0 : ∆ = 0,

n∗ +t n2 −t+n∗

1 1
n∗1 n∗1
P (Ma+ = t) = n
 , t = 0, 1, . . . , n2 .
n1

Proof: The proof is based on a conditional argument. Given X(n∗1 +1) = x, Ma+
is binomial with n2 trials and 1 − F (x) as the probability of success. The
density of X(n∗1 +1) is
n1 ! ∗ ∗
f ∗ (x) = ∗ 2
(1 − F (x))n1 F (x)n1 f (x) .
(n1 !)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 120 —


i i

120 CHAPTER 2. TWO-SAMPLE PROBLEMS

Using this and the fact that the samples are independent we get,
Z  
+ n2
P (Ma = t) = (1 − F (x))t F (x)n2 −t f (x)dx
t
  Z
n2 n1 ! ∗ ∗
= ∗ 2
(1 − F (x))t+n1 F (x)n1 +n2 −t f (x)dx
t (n !)
  1 Z 1
n2 n1 ! ∗ ∗
= ∗ 2
(1 − u)t+n1 un1 +n2 −t du .
t (n1 !) 0
By properties of the β function this reduces to the result.
Once again using the conditional argument, we obtain the moments of Ma+
as
n2
E0 [Ma+ ] = (2.6.9)
2
n2 (n + 1)
V0 [Ma+ ] = ; (2.6.10)
4(n1 + 2)
see Exercise 2.13.23.
The result when n1 is even is found in Exercise 2.13.23. For the asymptotic
null distribution of Ma+ we make use of the linearity result for the sign process
derived in Chapter 1; see Example 1.5.2.
Theorem 2.6.3. Under H0 and D.1, (2.4.7), Ma+ has an approximate
N( n22 , n4(n
2 (n+1)
1 +2)
) distribution.
Proof: Assume without loss of generality that the true median of X and Y is
0. Let θb = med Xi . Note that
Xn2
Ma+ =( b + n2 )/2 .
sgn(Yj − θ) (2.6.11)
j=1


Clearly under (D.1), n2 θb is bounded in probability. Hence by the asymptotic
linearity result for the L1 analysis, obtained in Example 1.5.2, we have
n2
X n2
X
b =n √
sgn(Yj ) − 2f (0) n2 θb + op (1) .
−1/2 −1/2
n2 sgn(Yj − θ) 2
j=1 j=1

But we also have


X1 n
√ √
n1 θb = (2f (0) n1 )−1 sgn(Xi ) + op (1) .
i=1

Therefore
n2
X n2
X n1
X
p
−1/2
n2 b = n−1/2
sgn(Yj − θ) sgn(Y j ) − n2 /n n
−1/2
1 1 sgn(Xi ) + op (1) .
2
j=1 j=1 i=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 121 —


i i

2.6. L1 ANALYSES 121

Note that n2
−1/2
X D
n2 sgn(Yj ) → N(0, λ−1
1 )
j=1

and n1
p −1/2
X D
n2 /n1 n1 sgn(Xi ) → N(0, λ2 /λ1 ) .
i=1
The result follows from these asymptotic distributions, the independence of
the samples, expression (2.6.11), and the fact that asymptotically the variance
of Ma+ satisfies
n2 (n + 1) .
= n2 (4λ1 )−1 .
4(n1 + 2)

Confidence Intervals
Note that Ma+ (∆) = #(Yj − ∆ > θ) b = #(Yj − θb > ∆); hence, if k is such that
+
P0 (Ma ≤ k) = α/2 then (Y(k+1) − θ, b Y(n −k) − θ)b is a (1 − α)100% confidence
2
interval for ∆. For testing the two-sided hypothesis H0 : ∆ = 0 versus
HA : ∆ 6= 0 we would reject H0 if 0 is not in the confidence interval. This is
equivalent, however, to rejecting if θb is not in the interval (Y(k+1) , Y(n2 −k) ).
Suppose we determine k by the normal approximation. Then
s r
. n2 n2 (n + 1) . n2 n2
k= − zα/2 − .5 = − zα/2 − .5 .
2 4(n1 + 2) 2 4λ1

The confidence interval (Y(k+1) , Y(n2 −k) ), is a γ100%, (γ = 1 −


2Φ(−zα/2 (λ1 )−1/2 ), confidence interval based on the sign procedure for the
sample Y1 , . . . , Yn2 . Suppose we take α = .05
√ and have the equal sample sizes
case so that λ1 = .5. Then γ = 1 − 2Φ(−2 2). Hence, the two-sided 5% test
rejects H0 : ∆ = 0 if θb is not in the confidence interval.

Remarks on Efficiency
Since the estimator of ∆ based on the Mathisen procedure is the same as that
of Mood’s procedure, the asymptotic relative efficiency results for Mathisen’s
procedure are the same as that of Mood’s. Using another type of efficiency
due to Bahadur (1967), Killeen, Hettmansperger, and Sievers (1972) show it
is generally better to compute the median of the smaller sample.
Curtailed sampling on the Y ’s is one situation where Mathisen’s test would
be used instead of Mood’s test since with Mathisen’s test an early decision
could be made; see Gastwirth (1968). For another perspective on median tests,
see Freidlin and Gastwirth (2000).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 122 —


i i

122 CHAPTER 2. TWO-SAMPLE PROBLEMS

Example 2.6.2 (Quail Data, continued from Examples 2.3.1 and 2.6.1). For
this data, med Ti = 49. Since one of the placebo values was also 49, we elimi-
nated it in the subsequent computation of Mathisen’s test. The test statistic
has the value Ma+ = #(Cj > 49) = 17. Using n2 = 19 and n1 = 10 the null
mean and variance are 9.5 and 11.875, respectively. This leads to a standard-
ized test statistic of 2.03 (using the continuity correction) with a p-value of
.021. Utilizing all the data, the corresponding point estimate and confidence
interval are 19 and (6, 27). This differs from MWW and Mood analyses; see
Examples 2.3.1 and 2.6.1, respectively.

2.7 Robustness Properties


In this section we obtain the breakdown points and the influence functions of
the L1 and MWW estimates. We first consider the breakdown properties.

2.7.1 Breakdown Properties


We begin with the definition of an equivariant estimator of ∆. For convenience
let the vectors X and Y denote the samples {X1 , . . . , Xn1 } and {Y1, . . . , Yn2 },
respectively. Also let X + a1 = (X1 + a, . . . , Xn1 + a)′ .

b
Definition 2.7.1. An estimator ∆(X, Y) of ∆ is said to be an equivari-
b
ant estimator of ∆ if ∆(X b
+ a1, Y) = ∆(X, b
Y) − a and ∆(X, Y + a1) =
b
∆(X, Y) + a.

Note that the L1 estimator and the Hodges-Lehmann estimator are both
equivariant estimators of ∆. Indeed, as Exercise 2.13.24 shows, any estimator
based on the rank pseudo-norms discussed in Section 2.5 is an equivariant
estimator of ∆. As the following theorem shows, the breakdown point of an
equivariant estimator is bounded above by .25.

Theorem 2.7.1. Suppose n1 ≤ n2 . Then the breakdown point of an equivari-


ant estimator satisfies ǫ∗ ≤ {[(n1 + 1)/2] + 1}/n, where [·] denotes the greatest
integer function.

Proof: Let m = [(n1 + 1)/2] + 1. Suppose ∆ b is an equivariant estimator such


that ǫ∗ > m/n. Then the estimator remains bounded if m points are corrupted.
Let X∗ = (X1 + a, . . . , Xm + a, Xm+1 , . . . , Xn1 )′ . Since we have corrupted m
points there exists a B > 0 such that

b ∗ , Y) − ∆(X,
|∆(X b Y)| ≤ B . (2.7.1)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 123 —


i i

2.7. ROBUSTNESS PROPERTIES 123

Next let X∗∗ = (X1 , . . . , Xm , Xm+1 − a, . . . , Xn1 − a)′ . Then X∗∗ contains
n1 − m = [n1 /2] ≤ m altered points. Therefore,

b ∗∗ , Y) − ∆(X,
|∆(X b Y)| ≤ B . (2.7.2)

b ∗∗ , Y) = ∆(X
Equivariance implies that ∆(X b ∗ , Y) + a. By (2.7.1) we have

b
∆(X, b ∗ , Y) ≤ ∆(X,
Y) − B ≤ ∆(X b Y) + B (2.7.3)

while from (2.7.2) we have

b
∆(X, b ∗∗ , Y) ≤ ∆(X,
Y) − B + a ≤ ∆(X b Y) + B + a . (2.7.4)

Taking a = 3B leads to a contradiction between (2.7.2) and (2.7.4).


By this theorem the maximum breakdown point of any equivariant esti-
mator is roughly half of the smaller sample proportion. If the sample sizes are
equal then the best possible breakdown is 1/4.

Example 2.7.1 (Breakdown of L1 and MWW estimates). The L1 estimator


b = med Yj − med Xi , achieves the maximal breakdown since med Yj
of ∆, ∆
achieves the maximal breakdown in the one-sample problem.
The Hodges-Lehmann estimate ∆ b R = med {Yj − Xi } also achieves max-
imal breakdown. To see this, suppose we corrupt an Xi . Then n2 differences
Yj − Xi are corrupted. Hence between samples we maximize the corruption
by corrupting the items in the smaller sample, so without loss of general-
ity we can assume that n1 ≤ n2 . Suppose we corrupt m Xi ’s. In order
to corrupt med {Yj − Xi } we must corrupt (n1 n2 )/2 differences. Therefore
mn2 ≥ (n1 n2 )/2; i.e., m ≥ n1 /2. Hence med {Yj − Xi } has maximal break-
down. Based on Exercise 1.12.13 of Chapter 1, the one-sample estimate based
on the Wilcoxon signed rank statistic does not achieve the maximal breakdown
value of 1/2 in the one-sample problem.

2.7.2 Influence Functions


Recall from Section 1.6.1 that the influence function of a Pitman Regular
estimator based on a single sample X1P, . . . , Xn is the function Ω(z) when the
estimator has the representation n−1/2 Ω(Xi )+op(1). The estimators we are
concerned with in this section are Pitman Regular; hence, to determine their
influence functions we need only obtain similar representations for them.
For the L1 estimate we have from the proof of Theorem 2.6.1 that
(n n1
)
√ 1 1 X 2
sgn (Y ) X sgn (X )
n∆b = med Yj −med Xi = √ j

i
+op (1) .
2f (0) n j=1 λ2 i=1
λ 1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 124 —


i i

124 CHAPTER 2. TWO-SAMPLE PROBLEMS

Hence the influence function of the L1 estimate is



−(λ1 2f (0))−1sgn z if z is an x
Ω(z) = ,
(λ2 2f (0))−1sgn z if z is an y
which is a bounded discontinuous function.
For the Hodges-Lehmann estimate, (2.2.18), note that we can write the
linearity result (2.4.23) as
Z
√ + √ √ +
n(S (δ/ n) − 1/2) = n(S (0) − 1/2) − δ f 2 + op (1) ,
√ b
which upon substituting n∆R for δ leads to
Z −1
√ √ +
bR =
n∆ f 2
n(S (0) − 1/2) + op (1) .

Recall the projection of the statistic S R (0) − 1/2 given in Theorem 2.4.7.
Since the difference between it and this statistic goes to zero in probability we
can, after some algebra, obtain the following representation for the Hodges-
Lehmann estimator,
Z −1 (n n2
)
√ 1 X 2
F (Yj ) − 1/2 X F (Xi ) − 1/2
bR =
n∆ f2 √ − + op (1) .
n j=1 λ2 i=1
λ1

Therefore the influence function for the Hodges-Lehmann estimate is


( R −1
− λ1 f 2 (F (z) − 1/2) if z is an x
Ω(z) = R 2 −1 ,
λ2 f (F (z) − 1/2) if z is an y

which is easily seen to be bounded and continuous.


For least squares, since the estimate is Y − X the influence function is

−(λ1 )−1 z if z is an x
Ω(Z) = ,
(λ2 )−1 z if z is an y
which is unbounded and continuous. The Hodges-Lehmann and L1 estimates
attain the maximal breakdown point and have bounded influence functions;
hence they are robust. On the other hand, the least squares estimate has 0%
breakdown and an unbounded influence function. One bad point can destroy
a least squares analysis.
For a general score function ϕ(u), by (2.5.30) we have the asymptotic
representation
"n   n2   #
1 X 1
τ X τ
b =√
∆ −
ϕ
ϕ(F (Xi )) +
ϕ
ϕ(F (Yi)) .
n i=1 λ1 i=1
λ2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 125 —


i i

2.8. PROPORTIONAL HAZARDS 125

Hence, the influence function of the R estimate based on the score function ϕ
is given by
 τϕ
− λ1 ϕ(F (z)) if z is an x
Ω(z) = τϕ ,
λ2
ϕ(F (z)) if z is an y

where τϕ is defined by expression (2.5.23). In particular, the influence function


is bounded provided the score generating function is bounded. Note that the
influence function for the R estimate based on normal scores is unbounded;
hence, this estimate is not robust. Recall Example 1.8.1 in which the one-
sample normal scores estimate has an unbounded influence function (nonro-
bust) but has positive breakdown point (resistant). A rigorous derivation of
these influence functions can be based on the influence function derived in
Section A.5.2 of the Appendix.

2.8 Proportional Hazards


Consider a two-sample problem where the responses are lifetimes of sub-
jects. We continue to denote the independent samples by X1 , . . . , Xn1 and
Y1 , . . . , Yn2 . Let Xi and Yj have distribution functions F (x) and G(x), respec-
tively. Since we are dealing with lifetimes both Xi and Yj are positive valued
random variables. The hazard function for Xi is defined by

f (t)
hX (t) =
1 − F (t)

and represents the likelihood that a subject dies at time t given that he has
survived until that time; see Exercise 2.13.25.
In this section, we consider the class of lifetime models that are called
Lehmann alternative models for which the distribution function G satisfies

1 − G(x) = (1 − F (x))α , (2.8.1)

where the parameter α > 0. See Section 4.4 of Maritz (1981) for an overview
of nonparametric methods for these models. The Lehmann model generalizes
the exponential scale model F (x) = 1−exp(−x) and G(x) = 1−(1−F (x))α =
1 − exp(−αx). As shown in Exercise 2.13.25, the hazard function of Yj is given
by hY (t) = αhX (t); i.e., the hazard function of Yj is proportional to the hazard
function of Xi ; hence, these models are also referred to as proportional haz-
ards models; see, also, Section 3.10. The null hypothesis can be expressed
as HL0 : α = 1. The alternative we consider is HLA : α < 1; that is, Y is
less hazardous than X; i.e., Y has more chance of long survival than X and

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 126 —


i i

126 CHAPTER 2. TWO-SAMPLE PROBLEMS

is stochastically larger than X. Note that,

Pα (Y > X) = Eα [P (Y > X | X)]


= Eα [1 − G(X)]
= Eα [(1 − F (X))α ] = (α + 1)−1 . (2.8.2)

The last equality holds, since 1−F (X) has a uniform (0, 1) distribution. Under
HLA , then, Pα (Y > X) > 1/2; i.e., Y tends to dominate X.
The MWW test statistic SR+ = #(Yj > Xi ) is a consistent test statistic for
HL0 versus HLA , by Theorem 2.4.10. We reject HL0 in favor of HLA for large
values of SR+ . Furthermore by Theorem 2.4.4 and (2.8.2), we have that
n1 n2
Eα [SR+ ] = n1 n2 Eα [1 − G(X)] = .
1+α
This suggests as an estimate of α, the statistic,

b = ((n1 n2 )/SR+ ) − 1 .
α (2.8.3)

By Theorem 2.4.5 it can be shown that


αn1 n2 n1 n2 (n1 − 1)α n1 n2 (n2 − 1)α2
Vα (SR+ ) = + + ; (2.8.4)
(α + 1)2 (α + 2)(α + 1)2 (2α + 1)(α + 1)2
see Exercise 2.13.27. Using this result and the asymptotic distribution of SR+
under general alternatives, Theorem 2.4.9, we can obtain, by the delta method,
the asymptotic variance of αb given by
 
. (1 + α)2 α n1 − 1 (n2 − 1)α
b=
Var α 1+ + . (2.8.5)
n1 n2 α+2 2α + 1
This can be used to obtain an asymptotic confidence interval for α; see Exercise
2.13.27 for details. As in the example below the bootstrap could also be used
to estimate the Var(b α).

2.8.1 The Log Exponential and the Savage Statistic


Another rank test which is frequently used in this situation is the log rank
test proposed by Savage (1956). In order to obtain this test, first consider
the special case where X has the exponential distribution function, F (x) =
1−e−x/θ , for θ > 0. In this case the hazard function of X is a constant function.
Consider the random variable ǫ = log X − log θ. In a few steps we can obtain
its distribution function as,

P [ǫ ≤ t] = P [log X − log θ ≤ t]
= 1 − exp (−et ) ;

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 127 —


i i

2.8. PROPORTIONAL HAZARDS 127

i.e., ǫ has an extreme value distribution. The density of ǫ is fǫ (t) = exp (t − et ).


Hence, we can model log X as the location model:

log X = log θ + ǫ . (2.8.6)

Next consider the distribution of the log Y . Using expression (2.8.1) and a few
steps of algebra we get
α
P [log Y ≤ t] = 1 − exp (− et ) .
θ
But from this it is easy to see that we can model Y as
1
log Y = log θ + log +ǫ, (2.8.7)
α
where the error random variable has the above extreme value distribution.
From (2.8.6) and (2.8.7) we see that the log-transformation problem is simply
a two-sample location problem with shift parameter ∆ = − log α. Here, HL0
is equivalent to H0 : ∆ = 0 and HLA is equivalent to HA : ∆ > 0. We refer to
this model as the log exponential model for the remainder of this section.
Thus any of the rank-based analyses that we have discussed in this chapter
can be used to analyze this model.
Let’s consider the analysis based on the optimal score function for the
model. Based on Section 2.5 and Exercise 2.13.19, the optimal scores for the
extreme value distribution are generated by the function

ϕfǫ (u) = −(1 + log(1 − u)) . (2.8.8)

Hence the optimal rank test in the log exponential model is given by
Xn2   Xn2   
R(Yj ) R(log Yj )
SL = ϕfǫ =− 1 + log 1 −
j=1
n+1 j=1
n+1
Xn2   
R(Yj )
= − 1 + log 1 − . (2.8.9)
j=1
n+1

We reject HL0 in favor of HLA for large values of SL . By (2.5.14) the null mean
of SL is 0 while from (2.5.18) its null variance is given by
n   2
2 n1 n2 X i
σϕfǫ = 1 + log 1 − . (2.8.10)
n(n − 1) i=1 n+1

Then an asymptotic level α test rejects HL0 in favor of HLA if SL ≥ zα σϕfǫ .


Certainly the statistic SL can be used in the general Lehmann alternative
model described above, although it is not optimal if X does not have an
exponential distribution. We discuss the efficiency of this test below.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 128 —


i i

128 CHAPTER 2. TWO-SAMPLE PROBLEMS

For estimation, let ∆ b be the estimate of ∆ based on the optimal score


b
function ϕfǫ ; that is, ∆ solves the equation
Xn2   
R[log(Yj ) − ∆] .
1 + log 1 − =0. (2.8.11)
j=1
n+1

Besides estimation, the confidence intervals discussed in Section 2.5 for general
scores, can be obtained for the score function ϕfǫ ; see Example 2.8.1 for an
illustration. n o
Thus another estimate of α would be α b = exp −∆ b . As discussed in
Exercise 2.13.27, an asymptotic confidence interval for α can be formulated
from this relationship. Keep in mind, though, that we are assuming that X is
exponentially distributed.
As a further note, since ϕfǫ (u) is an unbounded function it follows from
b is unbounded. Thus the estimate
Section 2.7.2 that the influence function of ∆
is not robust.
A frequently used, equivalent test statistic to SL was proposed by Savage.
To derive it, denote R(Yj ) by Rj . Then we can write
  Z 1−Rj /(n+1) Z 0
Rj 1 1
log 1 − = dt = dt .
n+1 1 t Rj /(n+1) 1 − t

We can approximate this last integral by the following Riemann sum:


1 1 1 1
+ +
1 − Rj /(n + 1) n + 1 1 − (Rj − 1)/(n + 1) n + 1
1 1
··· + .
1 − (Rj − (Rj − 1))/(n + 1) n + 1
This simplifies to
Xn
1 1 1 1
+ +···+ = .
n+1−1 n+1−2 n + 1 − Rj i=n+1−R i
j

This suggests the rank statistic proposed by Savage (1956),


n2
X n
X 1
S̃L = −n2 + . (2.8.12)
j=1 i=n−Rj +1
i

Note that it is a rank statistic with scores defined by


n
X 1
aj = −1 + . (2.8.13)
i=n−j+1
i

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 129 —


i i

2.8. PROPORTIONAL HAZARDS 129

Exercise 2.13.28 shows that its null mean and variance are given by

EH0 [S̃L ] = 0
( n
)
2 n1 n2 1X1
σ̃ = 1− . (2.8.14)
n−1 n j=1 j

Hence an asymptotic level α test is to reject HL0 in favor of HLA if S̃L ≥ σ̃zα .
Based on the above Riemann sum it would seem that S̃L and SL are close
statistics. Indeed they are asymptotically equivalent and, hence, both are op-
timal when X is exponentially distributed; see Hájek and Šidák (1967) or
Kalbfleisch and Prentice (1980) for details.

2.8.2 Efficiency Properties


We next derive the asymptotic relative efficiencies for the log exponential
model with fǫ (t) = exp (t − et ). The MWW statistic, SR+ , is a consistent test
for the log exponential model. By (2.4.21), the efficacy of the Wilcoxon test is
Z r
√ p 3p
cM W W = 12 fǫ2 λ1 λ2 = λ1 λ2 .
4

Since the Savage test is asymptotically optimal its efficacy is the square root
of √Fisher information, i.e., I 1/2 (fǫ ) discussed in Section 2.5. This efficacy
is λ1 λ2 . Hence the asymptotic relative efficiency of the Mann-Whitney-
Wilcoxon test to the Savage test at the log exponential model, is 3/4; see
Exercise 2.13.29.
Recall√that the efficacy of the L1 procedures, both Mood’s and Mathisen’s,
is 2fǫ (θǫ ) λ1 λ2 , where θǫ denotes the median of the extreme value distribution.
This turns out √ to be θǫ = log(log 2)). Hence fǫ (θǫ ) = (log 2)/2, which leads
to the efficacy λ1 λ2 log 2 for the L1 methods. Thus the asymptotic relative
efficiency of the L1 procedures with respect to the procedure based on Savage
scores is (log 2)2 = .480. The asymptotic relative efficiency of the L1 methods
to the MWW at this model is .6406. Therefore there is a substantial loss of
efficiency if L1 methods are used for the log exponential model. This makes
sense since the extreme value distribution has very light tails.
The variance of a random variable with density fǫ is π 2 /6; hence the asymp-
totic relative efficiency of the t-test to the Savage test at the log exponential
model is 6/π 2 = .608. Hence, for the procedures analyzed in this chapter on
the log exponential model the Savage test is optimal followed, in order, by the
MWW, t-, and L1 tests.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 130 —


i i

130 CHAPTER 2. TWO-SAMPLE PROBLEMS

Example 2.8.1 (Lifetimes of an Insulation Fluid). The data below are drawn
from an example on page 3 of Lawless (1982); see, also, Nelson (1982, p. 227).
They consist of the breakdown times (in minutes) of an electrical insulating
fluid when subject to two different levels of voltage stress, 30 and 32 kV.
Suppose we are interested in testing to see if the lower level is less hazardous
than the higher level.

Voltage Times to Breakdown (Minutes)


30 kV 17.05 22.66 21.02 175.88 139.07 144.12 20.46 43.40
Y 194.90 47.30 7.74
32 kV 0.40 82.85 9.88 89.29 215.10 2.75 0.79 15.93
X 3.91 0.27 0.69 100.58 27.80 13.95 53.24

Let Y and X denote the log of the breakdown times of the insulating
fluid at the voltage stresses of 30 kV and 32 kVs, respectively. Let ∆ = θY −
θX denote the shift in locations. We are interested in testing H0 : ∆ = 0
versus HA : ∆ > 0. The comparison boxplots for the log-transformed data are
displayed in the left panel of Figure 2.8.1. It appears that the lower level (30
kV) is less hazardous.

Figure 2.8.1: Comparison boxplots of insulation fluids: 30 kV and 32 kV.

Exponential q−q Plot Comparison Boxplots of log 32 kv and log 30 kv


200

5
4
150

3
Breakdown−time
Voltage level

2
100

1
50

0
−1
0

0.0 0.5 1.0 1.5 2.0 2.5 log 30 kv log 32 kv

Exponential Quantiles

The Robnp function twosampr2 with the score argument set at philogr
obtains the analysis based on the log-rank scores. Briefly, the results are:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 131 —


i i

2.9. TWO-SAMPLE RANK SET SAMPLING (RSS) 131

Test of Delta = 0 Alternative selected is 1


Standardized (z) Test-Stat 1.302 p-value 0.096

Estimate 0.680 SE is 0.776


95 % Confidence Interval is (-0.261, 2.662)
Estimate of the scale parameter tau 1.95

The corresponding Mann-Whitney-Wilcoxon analysis is

Test of Delta = 0 Alternative selected is 1


Test Stat. S+ is 118 z-Test Stat. 1.816 p-value 0.034

MWW estimate of the shift in location is 1.297 SE is 0.944


95 % Confidence Interval is (-0.201, 3.355)
Estimate of the scale parameter tau 2.37

While the log-rank is insignificant, the MWW analysis is significant at level


0.034. This difference is not surprising upon considering the q −q plot of the
original data at the 32 kV level found in the right panel of Figure 2.8.1.
The population quantiles are drawn from an exponential distribution. The
plot indicates heavier tails than that of an exponential distribution. In turn,
the error distribution for the location model would have heavier tails than
the light-tailed extreme-valued distribution. Thus the MWW analysis is more
appropriate. The two-sample t-test has value 1.34 with the p-value also of .096.
It was impaired by the heavy tails too.
Although the exponential model on the original data seems unlikely, for il-
lustration we consider it. The sum of the ranks of the 30 kV (Y ) sample is 184.
The estimate of α based on the MWW statistic is .40. A 90% confidence inter-
val for α based on the approximate (via the delta-method) variance, (2.8.5), is
(.06, .74); while a 90% bootstrap confidence interval based on 1000 bootstrap
samples is (.15, .88). Hence the MWW test, the corresponding estimate of α,
and the two confidence intervals indicate that the lower voltage level is less
hazardous than the higher level.

2.9 Two-Sample Rank Set Sampling (RSS)


The basic background for rank set sampling was discussed in Section 1.9. In
this section we extend these ideas to the two-sample location problem. Suppose
we have the two samples in which X1 , . . . , Xn1 are iid F (x) and Y1 , . . . , Yn2
are iid F (x − ∆) and the two samples are independent of one another. In the
corresponding RSS design, we take n1 cycles of k samples for X and n2 cycles
of q samples for Y . Proceeding as in Section 1.9, we display the measured data

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 132 —


i i

132 CHAPTER 2. TWO-SAMPLE PROBLEMS

as:
X(1)1 , . . . , X(1)n1 iid f(1) (t) Y(1)1 , . . . , Y(1)n2 iid f(1) (t − ∆)
· · · · · ·
· · · · · · .
· · · · · ·
X(k)1 , . . . , X(k)n1 iid f(k) (t) Y(q)1 , . . . , Y(q)n2 iid f(q) (t − ∆)

To test H0 : ∆ = 0 versus HA : ∆ > 0 we compute the Mann-


Whitney-Wilcoxon
P statistic with these rank set samples. Letting Usi =
n2 Pn1
t=1 j=1 I(Y (s)t > X(i)j ), the test statistic is

q k
X X
URSS = Usi .
s=1 i=1

Note that Usi is the Mann-Whitney-Wilcoxon statistic computed on the sam-


ple of the sth Y order statistics and the ith X order statistics. Even under
the null hypothesis H0 : ∆ = 0, Usi is not based on identically distributed
samples unless s = i. This complicates the null distribution of URSS .
Bohn and Wolfe (1992) present a thorough treatment of the distribution
theory for URSS . We note that under H0 : ∆ = 0, URSS is distribution free and
further, using the same ideas as in Theorem 1.9.1, EH0 (URSS ) = qkn1 n2 /2.
For fixed k and q, provided assumption D.1, (2.4.7), p holds, Theorem 2.4.2
can be applied to show that (URSS − qkn1 n2 /2)/ VH0 (URSS ) has a limiting
N(0, 1) distribution. The difficulty is in the calculation of the VH0 (URSS ); recall
Theorem 1.9.1 for a similar calculation for the sign statistic. Bohn and Wolfe
(1992) present a complex formula for the variance. Bohn and Wolfe provide
a table of the approximate null distribution of URSS for q = k = 2, n1 =
1, . . . , 5, n2 = 1, . . . , 5 and likewise for q = k = 3.
Another way to approximate the null distribution of URSS is to bootstrap
it. Consider, for simplicity, the case k = q = 3 and n1 = n2 = m. Hence
the expert must rank three observations and each of the m cycles consists of
three samples of size three for each of the X and Y measurements. In order
to bootstrap the null distribution of URSS , first align the Y -RSS’s with ∆, b
the Hodges-Lehmann estimate of shift computed across the two RSS’s. Our
bootstrap sampling is on the data with the indicated sampling distributions:
b
X(1)1 , . . . , X(1)m sample F̂(1) (x) Y(1)1 , . . . , Y(1)m sample F̂(1) (y− ∆)
b
X(2)1 , . . . , X(2)m sample F̂(2) (x) Y(2)1 , . . . , Y(2)m sample F̂(2) (y− ∆)
b
X(3)1 , . . . , X(3)m sample F̂(3) (x) Y(3)1 , . . . , Y(3)m sample F̂(3) (y− ∆)

In the bootstrap process, for each row i = 1, 2, 3, we take random samples



X(i)1 ∗
, . . . , X(i)m ∗
from F̂(i) (x) and Y(i)1 ∗
, . . . , Y(i)m b We then
from F̂(2) (y − ∆).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 133 —


i i

2.10. TWO-SAMPLE SCALE PROBLEM 133


compute URSS on these samples. Repeating this B times, we obtain the sample
∗ ∗
of test statistics URSS,1 , . . . , URSS,B . Then the bootstrap p-value for our test

is #(URSS,j ≥ URSS )/B, where URSS is the value of the statistic based on the
original data. Generally we take B = 1000 for a p-value. It is clear how to
modify the above argument to allow for k 6= q and n1 6= n2 .

2.10 Two-Sample Scale Problem


Frequently, it is of interest to investigate whether or not one random variable
is more dispersed than another. The general case is when the random variables
differ in both location and scale. Suppose the distribution functions of X and
Y are given by F (x) and G(y) = F ((y − ∆)/η), respectively; hence L(Y ) =
L(ηX + ∆). For discussion, we consider one-sided hypotheses of the form

H0 : η = 1 versus HA : η > 1. (2.10.1)

The other one-sided or two-sided hypotheses can be handled similarly. Let


X1 , . . . , Xn1 and Y1 , . . . , Yn2 be samples drawn on the random variables X and
Y , respectively.
The traditional test of H0 is the F -test which is based on the ratio of sample
variances. As we discuss in Section 2.10.2, though, this test is generally not
asymptotically correct (one of the exceptions is when F (t) is a normal cdf).
Indeed, as many simulation studies have shown, this test is extremely liberal
in many non-normal situations; see Conover, Johnson, and Johnson (1981).
Tests of H0 should be invariant to the locations. One way of ensuring
this is to first center the observations. For the F -test, the centering is by
sample means; instead, we prefer to use the sample medians. Let θbX and θbY
denote the sample medians of the X and Y samples, respectively. Then the
samples of interest are the folded aligned samples given by |X1∗ |, . . . , |Xn∗1 | and
|Y1∗ |, . . . , |Yn∗2 |, where Xi∗ = Xi − θbX and Yi∗ = Yi − θbY .

2.10.1 Appropriate Score Functions


To obtain appropriate score functions for the scale problem, first consider
the case when the location parameters of X and Y are known. Without
loss of generality, we can then assume that they are 0 and, hence, that
L(Y ) = L(ηX). Further because η > 0, we have L(|Y |) = L(η|X|).
Let Z′ = (log |X1 |, . . . , log |Xn1 |, log |Y1 |, . . . , log |Yn2 |) and ci , (2.2.1), be the
dummy indicator variable, i.e., ci = 0 or 1, depending on whether Zi is an X
or Y , respectively. Then an equivalent formulation of this problem is

Zi = ζci + ei , 1 ≤ i ≤ n , (2.10.2)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 134 —


i i

134 CHAPTER 2. TWO-SAMPLE PROBLEMS

where ζ = log η, e1 , . . . , en are iid with distribution function F ∗ (x) which is


the cdf of log |X|. The hypotheses, (2.10.1), are equivalent to
H0 : ζ = 0 versus HA : ζ > 1. (2.10.3)
Of course, this is the two-sample location problem based on the logs of the ab-
solute values of the observations. Hence, the optimal score function for Model
2.10.2 is given by
f ∗′ (F ∗−1 (u)))
ϕf ∗ (u) = − ∗ ∗−1 . (2.10.4)
f (F (u)))
After some simplification, see Exercise 2.13.30, we have
f ∗′ (x) ex [f ′ (ex ) − f ′ (−ex )]
− = +1. (2.10.5)
f ∗ (x) f (ex ) + f (−ex )
If we further assume that f is symmetric, then expression (2.10.5) for the
optimal scores function simplifies to
  ′ −1 u+1 
u + 1 f F
ϕf ∗ (u) = −F −1 2 
− 1. (2.10.6)
2 f F −1 u+1
2

This expression is convenient to work with because it depends on F (t) and


f (t), the cdf and pdf of X, in the original formulation of this scale problem.
Keep in mind, though, that the scores are for Model (2.10.2). In the following
three examples, we obtain the optimal score functions for the normal, double
exponential, and the generalized F -family distributions, respectively.
Example 2.10.1 (L(X) Is Normal). Without loss of generality, assume that
f (x) is the standard normal density. In this case expression (2.10.6) simplifies
to   2
−1 u+1
ϕF K (u) = Φ −1 , (2.10.7)
2
where Φ is the standard normal distribution function; see Exercise 2.13.33.
Hence, if we are sampling from a normal distribution this suggests the rank
test statistic n2   2
X R|Yj | 1
−1
SF K = Φ + , (2.10.8)
j=1
2(n + 1) 2
where the F K subscript is due to Fligner and Killeen (1976), who discussed
this score function in their work on the two-sample scale problem.
Example 2.10.2 (L(X) Is Double Exponential). Suppose that the density of
X is the double exponential, f (x) = 2−1 exp {−|x|}, −∞ < x < ∞. Then as
Exercise 2.13.33 shows the optimal rank score function is given by
ϕ(u) = −(log (1 − u) + 1) . (2.10.9)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 135 —


i i

2.10. TWO-SAMPLE SCALE PROBLEM 135

These scores are not surprising, because the distribution of |X| is exponential.
Hence, this is precisely the log linear problem with exponentially distributed
lifetime that was discussed in Section 2.8; see the discussion around expression
(2.8.8).
Example 2.10.3 (L(|X|) Is a Member of the Generalized F -family). In Sec-
tion 3.10 a discussion is devoted to a large family of commonly used distri-
butions called the generalized F -family for survival type data. In particular,
as shown there, if |X| follows an F (2, 2)-distribution, then it follows (Exercise
2.13.31), that the log |X| has a logistic distribution. Thus the MWW statistic
is the optimal rank score statistic in this case.

Notice the relationship between tail-weight of the distribution and the


optimal score function for the scale problem over these last three examples. If
the underlying distribution is normal then the optimal score function (2.10.8)
is for very light-tailed distributions. Even at the double-exponential, the score
function (2.10.9) is still for light-tailed errors. Finally, for the heavy-tailed
(variance is ∞) F (2, 2) distribution the score function is the bounded MWW
score function. The reason for the difference in location and scale scores is
that the optimal score function for the scale case is based on the distribution
of the logs of the original variables.
Once a scale score function is selected, following Section 2.5 the general
scores process for this problem is given by
n2
X
Sϕ (ζ) = aϕ (R(log |Yj | − ζ)) , (2.10.10)
j=1

where the scores a(i) are generated by a(i) = ϕ(i/(n + 1)).


A rank test statistic for the hypotheses, (2.10.3), is given by
n2
X n2
X
Sϕ = Sϕ (0) = aϕ (R(log |Yj |) = aϕ (R(|Yj |) , (2.10.11)
j=1 j=1

where the last equality holds because the log function is strictly increasing.
This is not necessarily a standardized score function, but it follows from the
discussion on general scores found in Section 2.5 and (2.5.18) that the null
mean µϕ and null variance σϕ2 of the statistic are given by
n1 n2 X
µϕ = n2 a and σϕ2 = (a(i) − a)2 . (2.10.12)
n(n − 1)
The asymptotic version of this test statistic rejects H0 at approximate level α
if z ≥ zα where
Sϕ − µ ϕ
z= . (2.10.13)
σϕ

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 136 —


i i

136 CHAPTER 2. TWO-SAMPLE PROBLEMS

The efficacy of the test based on Sϕ is given by expression (2.5.27); i.e.,


p
cϕ = τϕ−1 λ1 λ2 , (2.10.14)
where τϕ is given by
Z 1
τϕ−1 = ϕ(u)ϕf ∗ (u) du (2.10.15)
0
and the optimal scores function ϕf ∗ (u) is given in expression (2.10.4). Note
that this formula for the efficacy is under the assumption that the score func-
tion ϕ(u) is standardized.
Recall the original (realistic) problem, where the distribution functions
of X and Y are given by F (x) and G(y) = F ((y − ∆)/η), respectively and
the difference in locations, ∆, is unknown. In this case, L(Y ) = L(ηX + ∆).
As noted above, the samples of interest are the folded aligned samples given
by |X1∗ |, . . . , |Xn∗1 | and |Y1∗ |, . . . , |Yn∗2 |, where Xi∗ = Xi − θbX and Yi∗ = Yi −
θbY , where θbX and θbY denote the sample medians of the X and Y samples,
respectively.
Given a score function ϕ(u), we consider the linear rank statistic, (2.10.11),
where the ranking is performed on the folded-aligned observations; i.e.,
n2
X
Sϕ∗ = a(R(|Yj∗ |)). (2.10.16)
j=1

The statistic S ∗ is no longer distribution free for finite samples. However, if


we further assume that the distributions of X and Y are symmetric, then the
test statistic Sϕ∗ is asymptotically distribution free and has the same efficiency
properties as Sϕ ; see Puri (1968) and Fligner and Hettmansperger (1979).
The requirement that f is symmetric is discussed in detail by Fligner and
Hettmansperger (1979). Note here that the scores need not be standardized
and the null mean and variance of Sϕ∗ are defined in expression (2.10.12). As
with the test statistics, we denote this mean and variance by µ∗ϕ and σϕ∗2 ,
respectively.
Estimation and confidence intervals for the parameter η are based on the
process
n2
X
Sϕ∗ (ζ) = aϕ (R(log |Yj∗ | − ζ)) . (2.10.17)
j=1

An estimate of ζ is a value ζb which solves the equation (2.10.18); i.e.,


b =. ∗
Sϕ∗ (ζ) µϕ . (2.10.18)
An estimate of η, the ratio of scale parameters, is then
b
ηb = eζ . (2.10.19)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 137 —


i i

2.10. TWO-SAMPLE SCALE PROBLEM 137

The interval (ζbL, ζbU ) where ζbL and ζbU solve the respective equations

Sϕ∗ (ζbL ) =
˙ zα/2 σϕ∗ + µ∗ϕ
Sϕ∗ (ζbU ) =
˙ −zα/2 σϕ∗ + µ∗ϕ

forms (asymptotically) a (1 − α)100% confidence interval for ζ. The corre-


sponding confidence interval for η is (exp {ζbL}, exp {ζbU }).
As a simple rank-based analysis, consider the test statistic and estimator
based on the optimal scores (2.10.7) for the normal situation. We call this
the Fligner-Killeen two-sample scale analysis. The folded aligned samples
version of the test statistic (2.10.8) is the statistic

Xn2   2
R|Yj∗ | 1
SF∗ K = Φ−1
+ . (2.10.20)
j=1
2(n + 1) 2

The standardized test statistic is zF∗ K = (SF∗ K − µF K )/σF K , where µF K abd


σF K are the values of (2.10.12) for the scores (2.10.7). This statistic for non-
aligned samples is given on page 74 of Hájek and Šidák (1967). A version of it
was also discussed by Fligner and Killeen (1976). We refer to this test and the
associated estimator and confidence interval as the Fligner-Killeen analysis.
The Robnp function twoscale with the score function phiscalefk computes
the Fligner-Killeen analysis. We next obtain the efficacy of this analysis.

Example 2.10.4 (Efficacy for the Score Function ϕF K (u)). To use expression
(2.5.27) for the efficacy, we must first standardize the score function ϕF K (u) =
{Φ−1 [(u + 1)/2]}2 − 1, (2.10.7). Using the substitution (u + 1)/2 = Φ(t), we
have Z 1 Z ∞
ϕF K (u) du = t2 φ(t) dt − 1 = 1 − 1 = 0.
0 −∞

Hence, the mean is 0. In the same way,


Z 1 Z ∞ Z ∞
2 4
[ϕF K (u)] du = t φ(t) dt − 2 t2 φ(t) dt + 1 = 2.
0 −∞ −∞

Thus the standardized score function is



ϕ∗F K (u) = {Φ−1 [(u + 1)/2]}2 − 1]/ 2. (2.10.21)

Hence, the efficacy of the Fligner-Killeen analysis is


p Z 1
1
c ϕF K = λ 1 λ 2 √ {Φ−1 [(u + 1)/2]}2 − 1]ϕf ∗ (u) du, (2.10.22)
0 2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 138 —


i i

138 CHAPTER 2. TWO-SAMPLE PROBLEMS

where the optimal score function ϕf ∗ (u) is given in expression (2.10.4). In


particular, the efficacy at the normal distribution is given by
p Z 1 √ p
1
cϕF K (normal) = λ1 λ2 √ {Φ−1 [(u + 1)/2]}2 − 1]2 du, = 2 λ1 λ2 .
0 2
(2.10.23)
In Section 2.10.23, we use this efficacy to determine the ARE between the
Fligner-Killeen and the traditional F -tests. We illustrate the Fligner-Killeen
analysis with the following example.
Example 2.10.5 (Doksum and Sievers Data). Doksum and Sievers (1976)
describe an experiment involving the effect of ozone on weight gain of rats.
The experimental group consisted of n2 = 22 rats which were placed in an
ozone environment for seven days, while the control group contained n1 = 21
rats which were placed in an ozone-free environment for the same amount
of time. The response was the weight gain in a rat over the time period.
Figure 2.10.1 displays the comparison boxplots for the data. There appears
to be a difference in scale. Using the Robnp software discussed above, the
Fligner-Killeen test statistic SF∗ K = 28.711 and its standardized value is zF∗ K =
2.095. The corresponding p-value for a two-sided test is 0.036, confirming
the impression from the plot. The associated estimate of the ratio (ozone to
control) of scales is ηb = 2.36 with a 95% confidence interval of (1.09, 5.10).

Conover, Johnson, and Johnson (1981) performed a large Monte Carlo


study of tests of dispersion, including these folded-aligned rank tests, over a
wide variety of situations for the c-sample scale problem. The traditional F -
test (Bartlett’s test) did poorly (as would be expected from our comments
below about the lack of robustness of the classical F -test). In certain null
situations its empirical α levels exceeded .80 when the nominal α level was
.05. One rank test that performed very well was the aligned rank version of a
test statistic similar to SF∗ K , (2.10.20), but with the exponent of 1 instead of 2
in the definition of the score function. This performed well overall in terms of
validity and power except for highly asymmetric distributions, where it has a
tendency to be liberal. However, in the following simulation study the Fligner-
Killeen test (2.10.20) with exponent 2 is empirically valid over the asymmetric
situations covered.
Example 2.10.6 (Simulation Study for Validity of Tests Sϕ∗ ). Table 2.10.1
displays the results of a small simulation study of the validity of the rank-
based tests of scale for five different score functions over mostly skewed error
distributions. The scores in the study are: (fk2 ), the optimal score function
for the normal distribution; (fk1 ), similar to last except the exponent is one;

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 139 —


i i

2.10. TWO-SAMPLE SCALE PROBLEM 139

Figure 2.10.1: Comparison boxplots of treated and control weight gains in rats.

Comparison Boxplots of Control and Ozone


50
40
30
Weight Gain

20
10
0
−10

Control Ozone

(Wilcoxon), the linear Wilcoxon score function; (Quad), the score function
ϕ(u) = u2 ; and (Logistic), the optimal score function if the distribution of X
is logistic (see Exercise 2.13.32). The error distributions include the normal
and the χ2 (1) distributions and several members of the skewed contaminated
normal distribution. In the latter case, the random variable X is written as
X = X1 (1 − Iǫ ) + Iǫ X2 , where X1 and X2 have N(0, 1) and N(µc , σc2 ) distribu-
tions, respectively, Iǫ has a Bernoulli distribution with probability of success
ǫ, and X1 , X2 , and Iǫ are mutually independent. For the study ǫ was set at
0.3 and µc and σc varied. The pdfs of the three SCN distributions in Table
2.10.1 are shown in Figure 2.10.2. The pdf in the bottom right corner panel
of the figure is that of χ2 (1)-distribution. For all but the last situation in Ta-
ble 2.10.1, the sample sizes are n1 = 20 and n2 = 25. The last situation is
for n1 = n2 = 10. The number of simulations for each situation was set at
1000. For each run, the two-sided alternative, HA : η 6= 1, was tested and
the estimator of η and an associated confidence interval for η were obtained.
Computations were performed by Robnp functions.
The table shows the empirical α levels at the nominal 0.10, 0.05, and
0.01 levels; the empirical confidence coefficient for a nominal 95% confidence
interval; the mean of the estimates of η; and the MSE for ηb. Of the five
analyses, overall the Fligner-Killeen analysis (fk2 ) performed the best. This
analysis was valid (nominal levels and empirical coverage) in all the situations,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 140 —


i i

140 CHAPTER 2. TWO-SAMPLE PROBLEMS

except for the χ2 (1) distribution at the 10% level and the larger sample sizes.
Even here, its empirical level is 0.128. The other tests were liberal in the
skewed situations, some such as the Wilcoxon test were quite liberal. Also,
the fk analysis (exponent 1 in its score function) was liberal for the χ2 (1)
situations. Notice that the Fligner-Killeen analysis achieved the lowest MSE
in all the situations.

Hall and Padmanabhan (1997) developed a percentile bootstrap for these


rank-based tests which in their accompanying study performed quite well for
skewed error distributions as well as the symmetric error distributions.

Figure 2.10.2: Pdfs of skewed distributions in the simulation study discussed


in Example 2.10.6.

SCN: µc = 2, σc = 1.41, ε = .3 SCN: µc = 6, σc = 1.41, ε = .3


0.30

0.00 0.05 0.10 0.15 0.20 0.25


0.20
f(x)

f(x)
0.10
0.00

−2 0 2 4 6 8 −2 0 2 4 6 8 10

x x

SCN: µc = 12, σc = 1.41, ε = .3 χ2, One Defree of Freedom


1.2
0.00 0.05 0.10 0.15 0.20 0.25

1.0
0.8
f(x)

f(x)
0.6
0.4
0.2
0.0

0 5 10 15 0 1 2 3 4

x x

As a final remark, another class of linear rank statistics for the two-sample
scale problem consists of simple linear rank statistics of the form
n2
X
S= a(R(Yj )) , (2.10.24)
j=1

where the scores are generated as a(i) = ϕ(i/(n+1)). The folded rank statistics
discussed above suggest that ϕ be a convex (or concave) function. One popular
score function is the quadratic function ϕ(u) = (u − 1/2)2. The resulting

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 141 —


i i

2.10. TWO-SAMPLE SCALE PROBLEM 141

Table 2.10.1: Empirical Levels, Confidences, and MSE’s for the Monte Carlo
Study Discussed in Example 2.10.6
Normal Errors, n1 = 20, n2 = 25
b.10
α b.05
α α d .95
b.01 Cnf η̂ MSE(η̂)
Logistic 0.083 0.041 0.006 0.961 1.037 0.060
Quad. 0.080 0.030 0.008 0.970 1.043 0.076
Wilcoxon 0.073 0.033 0.004 0.967 1.042 0.097
fk2 0.087 0.039 0.004 0.960 1.036 0.057
fk 0.077 0.033 0.005 0.969 1.037 0.067

SKCN(µc = 2, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25
Logistic 0.106 0.036 0.006 0.965 1.035 0.076
Quad. 0.106 0.046 0.008 0.953 1.040 0.095
Wilcoxon 0.103 0.049 0.007 0.952 1.043 0.117
fk2 0.100 0.034 0.006 0.966 1.033 0.073
fk 0.099 0.047 0.006 0.953 1.034 0.085

SKCN(µc = 6, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25
Logistic 0.081 0.033 0.006 0.966 1.067 0.166
Quad. 0.122 0.068 0.020 0.933 1.105 0.305
Wilcoxon 0.163 0.103 0.036 0.897 1.125 0.420
fk2 0.072 0.026 0.005 0.974 1.057 0.126
fk 0.111 0.057 0.015 0.942 1.075 0.229

SKCN(µc = 12, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25
Logistic 0.084 0.046 0.007 0.954 1.091 0.298
Quad. 0.138 0.085 0.018 0.916 1.183 0.706
Wilcoxon 0.171 0.116 0.038 0.886 1.188 0.782
fk2 0.074 0.042 0.007 0.958 1.070 0.201
fk 0.115 0.069 0.015 0.932 1.109 0.400
2
χ (1), n1 = 20, n2 = 25
Logistic 0.154 0.086 0.023 0.913 1.128056 0.353
Quad. 0.249 0.149 0.047 0.851 1.170 0.482
Wilcoxon 0.304 0.197 0.067 0.804 1.196 0.611
fk2 0.128 0.066 0.018 0.936 1.120 0.336
fk 0.220 0.131 0.039 0.870 1.154 0.432
2
χ (1), n1 = 10, n2 = 10
Logistic 0.132 0.062 0.018 0.934 1.360 1.495
Quad. 0.192 0.099 0.035 0.900 1.457 2.108
Wilcoxon 0.276 0.166 0.042 0.833 1.560 3.311
2
fk 0.111 0.057 0.013 0.941 1.335 1.349
fk 0.199 0.103 0.033 0.893 1.450 2.086

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 142 —


i i

142 CHAPTER 2. TWO-SAMPLE PROBLEMS

statistic,
n2 
X 2
R(Yj ) 1
SM = − , (2.10.25)
j=1
n+1 2
was proposed by Mood (1954) as a test statistic for the hypotheses (2.10.1).
For the realistic problem with unknown location, though, the observations
have to be first aligned. Asymptotic theory holds, provided the underlying
distribution is symmetric. This class of aligned rank tests, though, did not
perform nearly as well as the folded rank statistics, (2.10.16), in the large
Monte Carlo study of Conover et al. (1981). Hence, we recommend the folded
rank-based analyses discussed above.

2.10.2 Efficacy of the Traditional F -Test


We next obtain the efficacy of the traditional F -test for the ratio of scale
parameters. Actually for our development we need not assume that X and
Y have the same locations. Let σ22 and σ12 denote the variances of Y and
X, respectively. Then in the notation in the first paragraph of this section,
η 2 = σ22 /σ12 . The classical F -test of the hypotheses (2.10.1) is to reject H0 if
F ∗ ≥ F (α, n2 − 1, n1 − 1) where
F∗ = σ
b22 /b
σ12 ,
and σ b22 and σ b12 are the sample variances of the samples Y1 , . . . , Yn2 and
X1 , . . . , Xn1 , respectively. The F -test is exact size α if underlying distribu-
tions are normal. Also the test is invariant to differences in location.
We first need the asymptotic distribution of F ∗ under the null hypothesis.
Instead of working with F ∗ it√is more convenient mathematically to work with
the equivalent test statistic n log F ∗ . We assume that X has a finite fourth
central moment; i.e., µX,4 = E[(X − E(X))4 ] < ∞. Let ξ = (µX,4 /σ14 ) −
3 denote the kurtosis of X. It easily follows that Y has the same kurtosis
under the null and alternative hypotheses. A key result, established in Exercise
2.13.36, is that under these conditions
√ D
σi2 − σi2 ) → N(0, σi4 (ξ + 2)) , for i = 1, 2 .
ni (b (2.10.26)
It follows immediately by the delta method that
√ D
bi2 − log σi2 ) → N(0, ξ + 2) , for i = 1, 2 .
ni (log σ (2.10.27)
Under H0 , σi = σ, say, and the last result,
r r
√ ∗ n√ 2 2 n√
n log F = b2 − log σ ) −
n2 (log σ b12 − log σ 2 )
n1 (log σ
n2 n1
D
→ N(0, (ξ + 2))/(λ1λ2 )) . (2.10.28)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 143 —


i i

2.10. TWO-SAMPLE SCALE PROBLEM 143

The approximate test rejects H0 if



n log F ∗
p ≥ zα , (2.10.29)
(ξ + 2)/(λ1 λ2 )
when n = n1 + n2 → ∞, ni /n → λi , i = 1, 2, and 0 < λ1 , λ2 < 1. Note
that ξ = 0 if X is normal. Usually in practice, it is assumed that ξ = 0; i.e.,
F ∗ is not corrected by an estimate of ξ. This is one reason that the usual
F -test for ratio in variances does not possess robustness of validity; that is,
the significance level is not asymptotically distribution free. Unlike the t-test,
the F -test for variances is not even asymptotically distribution free under H0 .
In order to obtain the efficacy of the √ F -test, consider the sequence of
contiguous alternatives Hn : ∆n = δ/ n, δ > 0. Assume without loss of
generality that the locations of X and Y are the same. Under this sequence of
alternatives we have Yj = e∆n Uj where Uj is a random variable with cdf F (x)
while Yj has cdf F (e∆n x). We also get σ b22 = exp {2∆n }b
σU2 where σ
bU2 denotes
the sample variance of U1 , . . . , Un2 . Let γF (∆) denote the power function of
the F -test. The asymptotic power lemma for the F -test is
Theorem 2.10.1. Assuming that X has a finite fourth moment, with ξ =
(µX,4 /σ14 ) − 3,
lim γF (∆n ) = P (Z ≥ zα − cF δ) ,
n→∞
where Z has a standard normal distribution and efficacy
p p
cF = 2 λ 1 λ 2 / ξ + 2 . (2.10.30)
Proof: The conclusion follows directly upon observing,
√ √
n log F ∗ = b22 − log σ
n(log σ b12 )
√ √
= bU2 + 2(δ/ n) − log σ
n(log σ b12 )
r r
n√ 2 2 n√
= 2δ + n2 (log σbU − log σ ) − b12 − log σ 2 )
n1 (log σ
n2 n1
and that the last quantity converges in distribution to a N(2δ, (ξ + 2))/(λ1λ2 ))
variate.
Let ϕ(u) denote a general score function for a folded-aligned rank-based
analysis as discussed above. It then follows that the asymptotic relative effi-
ciency of this ϕ-test to the F -test is the ratio of the squares of their efficacies,
i.e., e(S, F ) = c2ϕ /c2F , where cϕ is given in expression (2.5.27).
Suppose we use the Fligner-Killeen analysis. Then its efficacy is cϕF K which
is given in expression (2.10.22). The ARE between the Fligner-Killeen analysis
and the traditional F -test analysis is the ratio c2ϕF K /c2F . In particular, if we
assume that the underlying distribution is normal, then by (2.10.23) this ratio
is one and, hence, the Fligner-Killeen test is asymptotically efficient at the
normal model.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 144 —


i i

144 CHAPTER 2. TWO-SAMPLE PROBLEMS

2.11 Behrens-Fisher Problem


Consider the general model in Section 2.1 of this chapter, where X1 , . . . , Xn1
is a random sample on the random variable X which has distribution function
F (x) and density function f (x) and Y1 , . . . , Yn2 is a second random sample,
independent of the first, on the random variable Y which has common distri-
bution function G(x) and density g(x). Let θX and θY denote the medians of
X and Y , respectively, and let ∆ = θY − θX . In Section 2.4 we showed that
the MWW test was consistent for the stochastically ordered alternative. In
the location model where the distributions of X and Y differ by at most a
shift in location, the hypothesis F = G is equivalent to the null hypothesis
that ∆ = 0. In this section we drop the location model assumption, that is, we
assume that X and Y have distribution functions F and G, respectively, but
we still consider the null hypothesis that ∆ = 0. In order to avoid confusion
with Section 2.4, we explicitly state the hypotheses of this section as

H0 : ∆ = 0vsHA : ∆ > 0, where ∆ = θY − θX , L(X) = F, L(Y ) = G.


(2.11.1)
As in the previous sections we have selected a specific alternative for the
discussion.
The above hypothesis is our most general hypothesis of this section and the
modified Mathisen’s test defined below is consistent for it. We also consider
the case where the forms of F and G are the same; that is, G(x) = F (x/η),
for some parameter η. Note in this case that L(Y ) = L(ηX); hence, η =
T (Y )/T (X) where T (X) is any scale functional, (T (X) > 0 and T (aX) =
aT (X) for a ≥ 0). If T (X) = σX , the standard deviation of X, then this
is a Behrens-Fisher problem with F unknown. If we further assume that the
distributions of X and Y are symmetric then the modified MWW, defined
below, can be used to test that ∆ = 0. The most restrictive case is when both
F and G are assumed to be normal distribution functions. This is, of course,
the classical Behrens-Fisher problem and the classical solution to it is the
Welch type t-test, discussed below. For motivation we first show the behavior
of the usual MWW statistic. We then consider general rank procedures and
finally specialize to analogues of the L1 and MWW analyses.

2.11.1 Behavior of the Usual MWW Test


In order to motivate the problem, consider the null behavior of the usual
MWW test under (2.11.1) with the further restriction that the distributions
of X and Y are symmetric. Under H0 , since we are examining null behavior
there is no loss of generality if we assume that θX = θY = 0. The asymptotic

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 145 —


i i

2.11. BEHRENS-FISHER PROBLEM 145

form of the MWW test rejects H0 in favor of HA if


n1 X
n2
r
X n1 n2 n1 n2 (n + 1)
SR+ = I(Yj − Xi > 0) ≥ + zα .
i=1 j=1
2 12

This test would have asymptotic level α if F = G. As Exercise 2.13.39 shows,


we still have EH0 (SR+ ) = n1 n2 /2 when the densities of X and Y are symmetric.
From Theorem 2.4.5, Part (a), the variance of the MWW statistic under H0
satisfies the limit,

VarH0 (SR+ )
→ λ1 Var(F (Y )) + λ2 Var(G(X)) .
n1 n2 (n + 1)

Recall that we obtained the asymptotic distribution of SR+ , Theorem 2.4.9,


under general conditions which cover the current assumptions; hence, the true
significance level of the MWW test has the following limiting behavior:
" r #
n1 n 2 n1 n2 (n + 1)
αS + = PH0 SR+ ≥ + zα
R 2 12
" s #
SR+ − n12n2 n1 n2 (n + 1)
= P H0 p ≥ zα
VarH0 (SR+ ) 12VarH0 (SR+ )
h 1 1
i
→ 1 − Φ zα (12)− 2 (λ1 Var(F (Y )) + λ2 Var(G(X)))− 2 .(2.11.2)

Under the assumptions that the sample sizes are the same and that L(X)
and the L(Y ) have the same form we can simplify expression (2.11.2) further.
We express the result in the following theorem.
Theorem 2.11.1. Suppose that the null hypothesis in (2.11.1) is true. Assume
that the distributions of Y and X are symmetric, n1 = n2 , and G(x) = F (x/η)
where η is an unknown parameter. Then the maximum observed significance
level is 1 − Φ(.816zα ) which is approached as η → 0 or η → ∞.

RProof: Under the assumptions of the R theorem, note that Var(F (Y )) =


F 2 (ηt)dF (t) − 14 and Var(G(X)) = F 2 (x/η)dF (x) − 14 . Differentiating
(2.11.2) with respect to η we get
 
φ zα (12)−1/2 ((1/2)Var(F (Y )) + (1/2)Var(G(X)))−1/2 zα (12)−1/2
Z Z − 32
2
F (ηt)tf (ηt)f (t)dt + F (t/η)f (t/η)(−t/η )f (t)dt . (2.11.3)

Making the substitution


R u = ηt in the first integral, the quantity in braces
−2
reduces to η (F (u) − F (u/η))uf (u)f (u/η)du. Note that the other factors

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 146 —


i i

146 CHAPTER 2. TWO-SAMPLE PROBLEMS

in (2.11.3) are strictly positive. Thus to determine the graphical behavior of


(2.11.2) with respect to η, we need only consider the factor in braces. First
note that it has a critical point at η = 1. Next consider the case η > 1. In
this case F (u) − F (u/η) < 0 on the interval (−∞, 0) and is positive on the
interval (0, ∞); hence the factor in braces is positive for η > 1. Using a similar
argument this factor is negative for 0 < η < 1. Therefore the limit of the
function αS + (η) is decreasing on the interval (0, 1), has a minimum at η = 1,
R
and is increasing on the interval (1, ∞).
Thus the minimum level of significance occurs at η = 1 (the location
model), where it is α. By the graphical behavior of the function, maximum
levels would occur at the extremes of 0 and ∞. But it follows that
Z 
2 1 0 if η → 0
Var(F (Y )) = F (ηt)dF (t) − → 1
4 4
if η → ∞

and Z 
1 1
2 4
if η → 0
Var(G(X)) = F (x/η)dF (x) − → .
4 0 if η → ∞
From these two results and (2.11.2), the true significance level of the MWW
test satisfies

1 − Φ(zα (3/2)−1/2 ) if η → 0
αS + → .
R 1 − Φ(zα (3/2)−1/2 ) if η → ∞

Hence,
αS + → 1 − Φ(zα (3/2)−1/2 ) = 1 − Φ(.816zα ) ,
R

whether η → 0 or ∞. Thus the maximum observed significance level is 1 −


Φ(.816zα ) which is approached as η → 0 or η → ∞.
For example if α = .05 then .816zα = 1.34 and αS + → 1 − Φ(1.34) = .09.
R
Thus in the equal sample size case when F and G differ only in scale parameter
and are symmetric, the nominal 5% level of the MWW test is not worse than
.09. In order to guarantee that α ≤ .05 choose zα so that 1 − Φ(.816zα ) = .05.
This leads to zα = 2.02 which is the critical value for an α = .02. Hence another
way of saying this is: by performing a 2% MWW test we are guaranteed that
the true (asymptotic) level is at most 5%.

2.11.2 General Rank Tests


Assuming the most general hypothesis, (2.11.1), we follow the development of
Fligner and Policello (1981) to construct general tests. Suppose T represents
a rank test statistic, used in the case F = G, and that the test rejects H0 :
∆ = 0 in favor of HA : ∆ > 0 for large values of T . Suppose further that
n1/2 (T − µF,G )/σF,G converges in distribution to a standard normal. Let µ0

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 147 —


i i

2.11. BEHRENS-FISHER PROBLEM 147

denote the null mean of T and assume that it is independent of F . Next


suppose that σ b is a consistent estimate of σF,G which is a function only of
the ranks of the combined sample. This ensures distribution freeness under
H0 ; otherwise, the test statistic is only asymptotically distribution free. The
modified test statistic is
n1/2 (T − µ0 )
Tb = . (2.11.4)
b
σ
Such a test can be used for the general hypothesis (2.11.1). Fligner and Po-
licello (1981) applied this approach to Mood’s statistic; see Hettmansperger
and Malin (1975), also. In the next section, we consider Mathisen’s test.

2.11.3 Modified Mathisen’s Test


We next present a modified version of Mathisen’s test for the most general
hypothesis (2.11.1). Let θbX = medi Xi and define the sign-process
n2
X
S2 (θ) = sgn(Yj − θ) . (2.11.5)
j=1

Recall from expression (2.6.8), Section 2.6.2 that Mathisen’s test statistic (cen-
tered version) is given by S2 (θbX ). This is our test statistic. The modification
lies in its asymptotic distribution which is given in the next theorem.
Theorem 2.11.2. Assume the null hypothesis in expression (2.11.1) is true.
Then under the assumption (D.1), (2.4.7), √1n2 S2 (θbX ) is asymptotically nor-
2 2
mal with mean 0 and asymptotic variance 1 + K12 where K12 is defined by

2 λ2 g 2 (θY )
K12 = . (2.11.6)
λ1 f 2 (θX )
Proof: Assume without loss of generality that θX = θY = 0. From the asymp-
totic linearity results discussed in Example 1.5.2 of Chapter 1, we have that
1 . 1 √
√ S2 (θn ) = √ S2 (0) − 2g(0) n2 θn ,
n2 n2
√ √
for n|θn | ≤ c, c > 0. Since n2 θbX is bounded in probability, upon substitu-
tion in the last expression we get
1 . 1 √
√ S2 (θbX ) = √ S2 (0) − 2g(0) n2 θbX . (2.11.7)
n2 n2
In Example 1.5.2, we also have the approximation
. 1
θbX = S1 (0) , (2.11.8)
n1 2f (0)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 148 —


i i

148 CHAPTER 2. TWO-SAMPLE PROBLEMS


Pn1
where S1 (0) = sgn(Xi ). Combining (2.11.7) and (2.11.8), we get
i=1
r
1 b . 1 g(0) n2 1
√ S2 (θX ) = √ S2 (0) − √ S1 (0) . (2.11.9)
n2 n2 f (0) n1 n1
√ D
The results follow because of independent samples and because Si (0)/ ni →
N(0, 1), for i = 1, 2.
In order to use this test we need an estimate of K12 . As in Chapter 1,
selected order statistics from the sample X1 , . . . , Xn1 provide a confidence
interval for the median of X. Hence given a level α,√ the interval (L, U), where
L1 = X(k+1) , U1 = X(n−k) , and k = n/2 − zα/2 ( n/2) is an approximate
(1 − α)100% confidence interval for the median of X. Let DX denote the
length of this confidence interval. By Theorem 1.5.9 of Chapter 1,

n1 DX P
→ 2f (0) . (2.11.10)
2zα/2
In the same way let DY denote the length of the corresponding (1 − α)100%
confidence interval for the median of Y . Define
b 12 = DY .
K (2.11.11)
DX
From (2.11.10) and the corresponding result for DY , the estimate K b 12 is a
consistent estimate of K12 , under both H0 and HA .
Thus the modified Mathisen’s test for the general hypotheses (2.11.1), is
to reject H0 at approximately level α if
S2 (θbX )
ZM = q ≥ zα . (2.11.12)
b 2
n2 (1 + K12 )
To derive the efficacy of this statistic we use the development of Section
1.5.2. The average to consider is n−1 S2 (θbX ). Let ∆ denote the shift in medians
and without loss of generality let θX = 0. Then the mean function we need is
lim E∆ (n−1 S2 (θbX )) = µ(∆) .
n→∞

Note that we can reexpress the expansion (2.11.9) as


1 n2 1
S2 (θbX ) = S2 (θbX )
n n n2
 r r 
. n2 1 g(0) n2 n1 1
= S2 (0) − S1 (0)
n1 n2 f (0) n1 n2 n1
 
P∆ g(0)
→ λ2 E∆ [sgn(Y )] − E∆ [sgn(X)]
f (0)
= λ2 E∆ [sgn(Y )] = µ(∆) , (2.11.13)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 149 —


i i

2.11. BEHRENS-FISHER PROBLEM 149

where the next to last equality holds since θX = 0. Using E∆ (sgn(Y )) =


1 − 2G(−∆), we obtain the derivative

µ′ (0) = 2λ2 g(0) . (2.11.14)

By Theorem√ 2.11.2 we have the asymptotic null variance of the test statistic
S2 (θbX )/ n. From the above discussion then the statistic S2 (θbX ) is Pitman
Regular with efficacy

2λ2 g(0) λ1 λ2 2g(0)
cm2 = p 2
=p . (2.11.15)
λ2 (1 + K12 ) λ1 + λ2 (g 2 (0)/f 2(0))
Using Theorem 1.5.4 of Chapter 1, consistency of the modified Mathisen’s
test for the hypotheses (2.11.1) is obtained provided µ(∆) > µ(0). But this
follows immediately from the inequality G(−∆) > G(0).

2.11.4 Modified MWW Test


+
Recall by Theorem 2.4.9R that the mean of the MWW test statistic SR is
n1 n2 P (Y > X) = 1 − G(x)f (x)dx. For general F and G, though, this mean
may not be 1/2 under H0 . Since this section is concerned with methods for
testing the specific hypothesis that ∆ = 0, we add the further restriction that
the distributions of X and Y are symmetric. Recall from Section 2.11.1
that under this assumption and ∆ = 0 that E(SR+ ) = n1 n2 /2; see Exercise
2.13.39.
Using the general development of rank tests, Section 2.11.2, our modified
rank test is given by: reject H0 : ∆ = 0 in favor of HA : ∆ > 0 if Z > zα
where
SR+ − (n1 n2 )/2
Z= q , (2.11.16)
d +)
Var(S R

d + ) is a consistent estimate of Var(S + ), under H0 . From the asymp-


where Var(S R R
totic distribution theory obtained for SR+ under general conditions, Theorem
2.4.9, it follows that this test has approximate level α. By Theorem 2.4.5, we
can express the variance as
Z Z 2 !
Var(SR+ ) = n1 n2 GdF − GdF (2.11.17)
Z Z 2 !
+n1 n2 (n1 − 1) F 2 dG − F dG
Z Z 2 !
+n1 n2 (n2 − 1) (1 − G)2 dF − (1 − G)dF .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 150 —


i i

150 CHAPTER 2. TWO-SAMPLE PROBLEMS

Following the suggestion of Fligner and Policello (1981), we estimate Var(SR+ )


by replacing F and G by the empirical cdfs Fn1 and Gn2 respectively. As
Exercise 2.13.40 demonstrates, this estimate is consistent and, further, it is a
function of the ranks of the combined sample. Thus the test is distribution
free when F (x) = G(x) and is asymptotically distribution free when F and G
have symmetric densities.
The efficacy for the modified MWW follows using an argument similar to
that for the MWW in Section 2.4. As there, the function SR+ (∆) is a decreasing
function of ∆. Its mean function is given by
Z
+ +
E∆ (SR ) = E0 (SR (−∆)) = n1 n2 (1 − G(x − ∆))f (x)dx .

The average to consider here is S R = (nR1 n2 )−1 SR+ . Letting µ(∆) denote the
mean of S R under ∆, we have µ′ (0) = g(x)f (x)dx > 0. The variance we
need is σ 2 (0) = limn→∞ nVar0 (S R ), which using the above result on variance
simplifies to
Z Z ! 2
2
σ (0) = λ−1
2
2
F dG − F dG
Z Z 2 !
+λ−1
1 (1 − G)2 dF − (1 − G)dF .

The process SR+ (∆) is Pitman Regular and, in particular, its efficacy is given
by
√ R
λ1 λ2 g(x)f (x)
cm2 = r  .
R R 2  R R 2 
λ1 F 2 dG − F dG + λ2 (1 − G)2 dF − (1 − G)dF
(2.11.18)
As with the modified Mathisen’s test, we show consistency of the modified
MWW test by using Theorem 1.5.4. Again we need only show that µ(0) <
µ(∆). But this follows immediately provided the supports of F and G overlap
in a neighborhood of 0. Note that this shows that the modified MWW is
consistent for the hypotheses (2.11.1) under the further restriction that the
densities of X and Y are symmetric.

2.11.5 Efficiencies and Discussion


Before obtaining the asymptotic relative efficiencies of the above procedures,
we briefly discuss traditional methods. Suppose we restrict F and G to have
symmetric densities of the same form with finite variance; that is, F (x) =

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 151 —


i i

2.11. BEHRENS-FISHER PROBLEM 151

F0 ((x − θX )/σX ) and G(x) = F0 ((x − θY )/σY ) where F0 is some distribution


function with symmetric density f0 and σX and σY are the standard deviations
of X and Y , respectively. √
Under these assumptions, it follows that n(Y − X − ∆) converges in
2
distribution to N(0, (σX /λ1 ) + (σY2 /λ2 )); see Exercise 2.13.41. The test is to
reject H0 : ∆ = 0 in favor of HA : ∆ > 0 if tW > zα where

Y −X
tW = q 2 ,
sX s2Y
n1
+ n2

where s2X and s2Y are the sample variances of Xi and Yj , respectively. Under
these assumptions, it follows that these sample variances are consistent esti-
2
mates of σX and σY2 , respectively; hence, the test has approximate level α.
If F0 is also normal then, under H0 , tW has an approximate t distribution
with a degrees of freedom correction proposed by Welch (1949). This test is
frequently used in practice and we subsequently call it the Welch t-test.
In contrast, the pooled t-test can behave poorly in this situation, since we
have
Y −X
tp = r  
(n1 −1)s2X +(n2 −1)s2Y 1 1
n1 +n2 −2 n1
+ n2

. Y −X
= q 2 ;
sX s2Y
n2
+ n1

that is, the sample variances are divided by the wrong sample sizes. Hence
unless the sample sizes are fairly close the pooled t is not asymptotically
distribution free. Exercise 2.13.42 obtains the true asymptotic level of tp .
In order to get the efficacy of the Welch t, consider the statistic Y − X.
The mean function at ∆ is µ(∆) = ∆; hence, µ′ (0) = 1. It follows from the
asymptotic distribution discussed above that
" √ #
√ λ1 λ2 (Y − X) D
n p 2 2
→ N(0, 1) ;
(σX /λ1 ) + (σY )/λ2 )
p √
2
hence, σ(0) = (σX /λ1 ) + (σY2 )/λ2 )/ λ1 λ2 . Thus the efficacy of tW is given
by √
µ′ (0) λ1 λ2
ctW = =p 2 . (2.11.19)
σ(0) (σX /λ1 ) + (σY2 )/λ2 )
We obtain the ARE’s of the above procedures for the case where G(x) =
F (x/η) and F (x) has density f (x) symmetric about 0 with variance 1. Thus η

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 152 —


i i

152 CHAPTER 2. TWO-SAMPLE PROBLEMS

is the ratio of standard deviations σY /σX . For this case the efficacies (2.11.15),
(2.11.18), and (2.11.19) reduce to

2 λ1 λ2 f (0)
cm2 = p
λ2 + λ1 η 2
√ R
λ1 λ2 gf
cm2 = q R R R R
λ1 [ F 2 dG − ( F dG)2 ] + λ2 [ (1 − G)2 dF − ( (1 − G)dF )2]

λ1 λ2
ctW = p .
λ2 + λ1 η 2
Thus the ARE between the modified Mathisen’s procedure and the Welch
procedure is the ratio c2m2 /c2tW = 4σX2 2
f (0) = 4f02 (0). This is the same ARE
as in the location problem. In particular the ARE does not depend on η =
σY /σX . Thus the modified Mathisen’s test in comparison to tW would have
poor efficiency at the normal distribution, .63, but in general it would be much
more efficient than tW for heavy-tailed distributions. Similar to the modified
Mathisen’s test, the Mood test can also be modified for these problems; see
Exercise 2.13.43. Its efficacy is the same as that of the Mathisen’s test.
Asymptotic relative efficiencies involving the modified Wilcoxon do depend
on the ratio of scale parameters η. Fligner and Rust (1982) show that if the
variances of X and Y are quite different then the modified Mathisen’s test
may be as efficient as the modified MWW irrespective of the shape of the
underlying distribution.
Fligner and Policello (1981) conducted a simulation study of the pooled
t, Welch’s t, MWW, and the modified MWW over situations where F and G
differ in scale only. The unmodified tests did not maintain their level. Welch’s
t performed well when F and G were normal whereas the modified MWW
performed well over all situations, including unequal sample sizes and normal
and contaminated normal distributions. In the simulation study performed by
Fligner and Rust (1982), they found that the modified Mood test maintains its
level over the situations that were considered by Fligner and Policello (1981).
As a final note, Welch’s t requires distributions with the same shape and
the modified MWW requires symmetric densities. The modified Mathisen’s
test and the modified Mood test, though, are consistent tests for the general
problem stated in expression (2.11.1).

2.12 Paired Designs


Consider the situation where we have two treatments of interest, say, A and B,
which can be applied to subjects from a population of interest. Suppose we are
interested in a particular response after these treatments have been applied.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 153 —


i i

2.12. PAIRED DESIGNS 153

Let X denote the response of a subject after treatment A has been applied and
let Y be the corresponding measurement for a subject after treatment B has
been applied. The natural null hypothesis, H0 , is that there is no difference
in treatment effects. A one-sided alternative is that the response of a subject
under treatment B is in general larger than of a subject under treatment A.
Reversing the roles of A and B yields the other one-sided alternative while
the union of the these two alternatives results in the two-sided alternative.
Again for definiteness we choose as our alternative, HA , the first one-sided
alternative.
The completely randomized design and the paired design are two experi-
mental designs which are often employed in this situation. In the completely
randomized design, n subjects are selected at random from the population of
interest and n1 of them are randomly assigned to treatment A while the re-
maining n2 = n − n1 are assigned to treatment B. At the end of the treatment
period, we then have two samples, one on X while the other is on Y . The two
sample procedures discussed in the previous sections can be used to analyze
the data. Proper randomization along with carefully controlled experimental
conditions give credence to the assumptions that the samples are random and
are independent of one another. The design that produced the data of Example
2.3.1 was a a completely randomized design.
While the completely randomized design is often used in practice, the
underlying variability may impair the power of any procedure, robust or tra-
ditional, to detect alternative hypotheses. The design discussed next usually
results in a more powerful analysis but it does require a pairing device; i.e., a
block of length two.
Suppose we have a pairing device (block of length two). Some examples
include identical twins for a study on human subjects, litter mates for a
study on animal subjects, or the same exterior wall of a house for a study
on the durability of exterior house paints. In the paired design, n pairs of
subjects are randomly selected from the population of interest. Within each
pair, one member is randomly assigned to treatment A while the other re-
ceives treatment B. Again let X and Y denote the responses of subjects after
treatments A and B, respectively, have been applied. This experimental de-
sign results in a sample of pairs (X1 , Y1), . . . , (Xn , Yn ). The sample differences
D1 = X1 −Y1 , . . . Dn = Xn −Yn , however, become the single sample of interest.
Note that the random pairing in this design induces under the null hypothesis
a symmetrical distribution for the differences.

Theorem 2.12.1. In a randomized paired design, under the null hypothesis


of no treatment effect, the differences Di are symmetrically distributed about
0.

Proof: Let F (x, y) denote the joint distribution of (X, Y ). Under the null

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 154 —


i i

154 CHAPTER 2. TWO-SAMPLE PROBLEMS

hypothesis of no treatment effect and randomized pairing, it follows that X


and Y are exchangeable random variables; that is, P (X ≤ x, Y ≤ y) = P (X ≤
y, Y ≤ x). Hence for a difference D = Y − X we have,

P [D ≤ t] = P [Y − X ≤ t] = P [X − Y ≤ t] = P [−D ≤ t] .

Thus D and −D have the same distribution; hence D is symmetrically dis-


tributed about 0.
Let θ be a location functional for the distribution of Di . We further assume
that Di is symmetrically distributed under alternative models also. Then we
can express the above hypotheses by H0 : θ = 0 versus HA : θ > 0.
Note that one-sample analyses based on signs and signed-ranks discussed
in Chapter 1 are appropriate Pfor the randomly paired design. The appropriate
sign
P test statistic is S = sgn(Di ) while the signed-rank statistic is T =
sgn(Di )R(|Di |).
From Chapter 1 we summarize the analysis based on the signed-rank statis-
tic. A level α test would reject H0 in favor of HA , if T ≥ cα where cα is de-
termined from the null distribution of the Wilcoxon signed-rank test or from
the asymptotic approximation to the distribution. The test is consistent for
θ > 0 and it has the efficiency results discussed in Chapter 1. In particular,
for normal errors the efficiency of T with respect to the usual paired t-test
is .955. The associated point estimate of θ is the Hodges-Lehmann estimate
given by θb = med i≤j {(Di + Dj )/2}. A distribution-free confidence interval
for θ is constructed based on the Walsh averages {(Di + Dj )/2}, i ≤ j as
discussed in Chapter 1. Instead of using Wilcoxon scores, general signed-rank
scores as discussed in Chapter 1 can also be used.
A similar summary holds for the analysis based on the sign statistic. In
fact for the sign scores we need not assume that D1 , . . . , Dn are identically
distributed; that is, there can be a block effect. This is discussed further in
Chapter 4.
We should mention that if the pairing is not done randomly then Di may
or may not be symmetrically distributed. If the symmetry assumption is re-
alistic, then both sign and signed-rank analyses can be used. If, however, it is
not realistic then the sign analysis would still be valid but caution would be
necessary in interpreting the results of the signed-rank analysis.

Example 2.12.1 (Darwin Data). The data, Table 2.12.1, are some measure-
ments recorded by Charles Darwin in 1878. They consist of 15 pairs of heights
in inches of cross-fertilized plants and self-fertilized plants (Zea mays), each
pair grown in the same pot.
Let Di denote the difference between the heights of the cross-fertilized
and self-fertilized plants of the ith pot and let θ denote the median of the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 155 —


i i

2.12. PAIRED DESIGNS 155

Table 2.12.1: Plant Growth, Cross (C) and Self (S) Fertilized
Pot 1 2 3 4 5 6 7 8
C- 23.500 12.000 21.000 22.000 19.125 21.500 22.125 20.375
S 17.375 20.375 20.000 20.000 18.375 18.625 18.625 15.250
Pot 9 10 11 12 13 14 15
C 18.250 21.625 23.250 21.000 22.125 23.000 12.000
S 16.500 18.000 16.250 18.000 12.750 15.500 18.000

distribution of Di . Suppose we are interested in testing for an effect; that is,


the hypotheses are H0 : θ = 0 versus HA : θ 6= 0. The boxplot of the
differences is displayed in Panel A of Figure 2.12.1, while Panel B gives the
normal q −q plot of the differences. As the plots indicate, the differences for
Pot 2 and, perhaps, Pot 15 are possible outliers. The results from the Robnp
functions onesampwil and onesampsgn are:
Results for the Darwin Data
Results for the Wilcoxon-Signed-Rank procedure
Test of theta = 0 versus theta not equal to 0
Test T is 172 Stand (z) Test-Stat. is 2.016 p-value 0.043

Estimate 3.1375 SE is 1.244385


95 % Confidence Interval is ( 0.5 , 5.2125 )
Estimate of the scale parameter tau 4.819484

Results for the Sign procedure


Test of theta = 0 versus theta not equal to 0
Test S is 11 Stand (z) Test-Stat. is 2.581 p-value 0.009

Estimate 3 SE is 1.307422
95 % Confidence Interval is ( 1 , 6.125 )
Estimate of the scale parameter tau 5.063624
The value of the signed-rank Wilcoxon statistic for this data is T = 72 with the
approximate p-value of 0.043. The corresponding estimate of θ is 3.14 inches
and the 95% confidence interval is (.50, 5.21).
There are 13 positive differences, so the standardized value of the sign test
statistic is 2.58, with the p-value of 0.01. The corresponding estimate of θ
is 3 inches and the 95% interpolated confidence is (1.00, 6.13). The paired t-
test statistic has the value of 2.15 with p-value 0.050. The difference in sample
means is 2.62 inches and the corresponding 95% confidence interval is (0, 5.23).
Note that the outliers impaired the t-test and to a lesser degree the Wilcoxon
signed-rank test.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 156 —


i i

156 CHAPTER 2. TWO-SAMPLE PROBLEMS

Figure 2.12.1: Boxplot of Darwin data.

10
5
Paired differnces

0
−5

Darwin Data

2.12.1 Behavior under Alternatives


In this section, we compare sample size determination for the paired design
with sample size determination for the completely randomized design. For
the paired design, let γ + (θ) denote the power function of Wilcoxon signed-
rank test statistic for the alternative
√ R θ. 2Then the asymptotic power lemma,
−1
Theorem 1.5.8 with c = τ = 12 f (t) dt, for the signed-rank Wilcoxon
from Chapter 1 states that at significance
√ level α and under the sequence of
contiguous alternatives, θn = θ/ n,
 
+ θ
lim γ (θn ) = Pθn Z ≥ zα − .
n→∞ τ

We only consider the case where the random vector (Y, X) is jointly normal
with variance-covariance matrix
 
2 1 ρ
V=σ .
ρ 1
p p
Then τ = π/3σ 2(1 − ρ).
Now suppose we select the sample size n∗ so that the Wilcoxon signed-rank
test has power γ + (θ0 ) to detect the one-sided alternative θ0 > 0 for a level α

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 157 —


i i

2.13. EXERCISES 157



test. Then writing θ0 = √nn∗θ0 we have by the asymptotic power lemma and

(1.5.26) that √
.
γ + (θ0 ) = 1 − Φ(zα − n∗ θ0 /τ ) ,
and
2
. (zα − zγ + (θo ) ) 2
n∗ = τ .
θ02
Substituting the value of τ into this final equation, we have that the necessary
sample size for the paired design to have the desired local power is
2
. (zα − zγ + (θo ) )
n∗ = (π/3)σ 2 2(1 − ρ) . (2.12.1)
θ02

Next consider a two-sample design with equal sample sizes ni = n∗ . Assume


that X and Y are iid normal with variance σ 2 . Then τ 2 = (π/3)σ 2 . Hence by
(2.4.25), the necessary sample size for the completely randomized design to
achieve power γ + (θ0 ) at the one-sided alternative θ0 > 0 for a level α test is
given by
 
zα − zγ + (θ0 ) 2
n= 2(π/3)σ 2 . (2.12.2)
θ0
Based on expressions (2.12.1) and (2.12.2), the sample size needed for the
paired design is (1 − ρ) times the sample size needed for the completely ran-
domized design. If the pairing device is such that X and Y are strongly,
positively correlated then it pays to use the paired design. The paired design
is a disaster, of course, if the variables are negatively correlated.

2.13 Exercises
2.13.1. (a) Derive the L2 estimates of intercept and shift based on the L2
norm on Model (2.2.4).

(b) Next apply the pseudo-norm, (2.2.16), to (2.2.4) and derive the estimating
function. Show that the natural test statistic is the pooled t-statistic.

2.13.2. Show that (2.2.17) is a pseudo-norm. Show, also, that it can be written
in terms of ranks; see the formula following (2.2.17).

2.13.3. In the proof of Theorem 2.4.2, verify that L(Yj − Xi ) = L(Xi − Yj ).

2.13.4. Prove Theorem 2.4.3.

2.13.5. Prove that if a continuous random variable Z has cdf H(z), then the
random variable H(Z) has a uniform distribution on (0, 1).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 158 —


i i

158 CHAPTER 2. TWO-SAMPLE PROBLEMS


R R
2.13.6. In Theorem 2.4.4, show that E(F (Y )) = F (y)dG(y) = (1 −
G(x))dF (x) = E(1 − G(X)).

2.13.7. Prove that if Zn converges in distribution to Z and if Var(Zn − Wn )


and EZn − EWn converge to 0, then Wn also converges in distribution to Z.

2.13.8. Verify (2.4.10).

2.13.9. Explain what happens to the MWW statistic when one support is
shifted completely to the right of the other support. What does this imply
about the consistency of the MWW in this case?

2.13.10. Show that the L2 estimating function is Pitman Regular and derive
the efficacy of the pooled t-test. Also, establish the asymptotic power lemma,
Theorem
√ 2.4.13, for the L2 case. Finally, establish the asymptotic distribution
of n(Ȳ − X̄).

2.13.11. Prove that the Hodges-Lehmann estimate of shift, (2.2.18), is trans-


lation and scale equivariant. (See the discussion in Section 2.4.4.)

2.13.12. Prove Theorem 2.4.15.

2.13.13. In Example 2.4.1, form the residuals Zi − ∆c b i , i = 1, . . . , n. Then,


similar to Section 1.5.5, use these residuals to estimate τ based on (1.3.30).

2.13.14. Simulate independent random samples from N(20, 52 ) and N(22, 52 )


distributions of sizes 10 and 15, respectively. Let ∆ denote the shift in the
locations of the distributions.

(a) Obtain comparison boxplots for your samples.

(b) Use the Wilcoxon procedure to test H0 : ∆ = 0 versus HA : ∆ 6= 0 at


level .05.

(c) Use the Wilcoxon procedure to estimate ∆ and obtain a 95% confidence
interval for it.

(d) Obtain the true value of τ . Use your confidence interval in the last item
to obtain an estimate of τ . Obtain a symmetric 95% confidence interval
for ∆ based on your estimate.

(e) Form a pooled estimate of τ based on the Wilcoxon signed-rank process


for each sample. Obtain a symmetric 95% confidence interval for ∆ based
on your estimate. Compare it with the estimate from the last item and
the true value.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 159 —


i i

2.13. EXERCISES 159

b Obtain the
2.13.15. Write a R function to bootstrap the distribution of ∆.
bootstrap distribution for 500 bootstraps of data of Problem 2.13.14. What
is your bootstrap estimate of τ ? Compare with the true value and the other
estimates.
2.13.16. Verify the scalar multiple condition for the pseudo-norm in the proof
of Theorem 2.5.1.
2.13.17. Verify (2.5.9) and (2.5.10).
2.13.18. Consider the process Sϕ (∆), (2.5.11):
(a) Show that Sϕ (∆) is a decreasing step function, with steps occurring at
Y j − Xi .
(b) Using Part (a) and the MWW estimator as a starting value, write with
b ϕ.
some details an algorithm which obtains the estimator ∆
(c) Verify expressions (2.5.14), (2.5.15), and (2.5.16).
2.13.19. Consider the the optimal score function (2.5.22):
(a) Show it is location invariant and scale equivariant. Hence, show if g(x) =
1
σ
f ( x−µ
σ
), then ϕg = σ −1 ϕf .
(b) Use (2.5.22) to show that the MWW is asymptotically efficient when the
underlying distribution is logistic. (F (x) = (1 + exp(−x))−1 , −∞ < x <
∞.)
(c) Show that (2.6.1) is optimal for a Laplace or double exponential distribu-
tion. ( f (x) = 21 exp(−|x|), −∞ < x < ∞.)
(d) Show that the optimal score function for the extreme value distribution,
(f (x) = exp{x − ex } , −∞ < x < ∞ ), is given by (2.8.8).
(e) Show that the optimal score function for the normal distribution is given
by (2.5.32). Show that it is standardized.
(f) Show that (2.5.33) is the optimal score function for an underlying distri-
bution that has a left logistic tail and a right exponential tail.
2.13.20. Show that when the underlying density f is symmetric then ϕf (1 −
u) = −ϕf (u).
2.13.21. Show that expression (2.6.6) is true and that the n = 2r differences,

Y(1) − X(r) < Y(2) − X(r−1) < · · · < Y(n2 ) − X(r−n2 +1) ,

can be ordered only knowing the order statistics from the individual samples.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 160 —


i i

160 CHAPTER 2. TWO-SAMPLE PROBLEMS

2.13.22. Develop the asymptotic linearity formula for Mood’s estimating


function given in (2.6.3). Then give an alternative proof of Theorem 2.6.1
based on this result.
2.13.23. Verify the moment formulas (2.6.9) and (2.6.10).
2.13.24. Show that any estimator based on the pseudo-norm (2.5.2) is equiv-
ariant. Hence, if we multiply the combined sample observations by a constant,
then the estimator is multiplied by that same constant.
2.13.25. Suppose X is a continuous random variable representing the time
until failure of some process. The hazard function for a continuous random
variable X with cdf F is defined to be the instantaneous rate of failure at
X = t, conditional on survival to time t. It is formally given by:
P (t ≤ X < t + ∆t|X ≥ t)
hX (t) = lim+ .
∆t→0 ∆t
(a) Show that
f (t)
hX (t) = .
1 − F (t)
(b) Suppose that Y has cdf given by (2.8.1). Show the hazard function is
given by hY (t) = αhX (t).
2.13.26. Verify (2.8.4).
2.13.27. Apply the delta method of finding the asymptotic distribution of a
b. Then verify (2.8.5).
function to (2.8.3) to find the asymptotic distribution of α
Explain how this can be used to find an approximate (1 − α)100% confidence
interval for α.
2.13.28. Verify (2.8.14).
2.13.29. Show that the asymptotic relative efficiency of the Mann-Whitney-
Wilcoxon test to the Savage test at the log exponential model, is 3/4.
2.13.30. Verify (2.10.5).
2.13.31. Show that if |X| has an F (2, 2) distribution then log |X| has a logistic
distribution.
2.13.32. Suppose f (t) is the logistic pdf. Show that the optimal scores func-
tion, (2.10.6) is given by ϕ(u) = u{log[(u + 1)/(1 − u)]}.
2.13.33. For expression (2.10.6)
(a) Verify that it is true.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 161 —


i i

2.13. EXERCISES 161

(b) Apply it to the normal distribution.

(c) Apply it to the Laplace or double exponential distribution.

2.13.34. We consider the Siegel-Tukey (1960) test for the equality of variances
when the underlying centers are equal but possibly unknown. The test statistic
is the sum of ranks of the Y sample in the combined sample (MWW statistic).
However, the ranks are assigned in a different way: In the ordered combined
sample assign rank 1 to the smallest value, rank 2 to the largest value, rank
3 to the second largest value, rank 4 to the second smallest value, and so
on, alternatively assigning ranks to end values. To test H0 : varX = varY vs
HA : varX > varY , reject H0 when the sum of ranks of the Y sample is large.
Find the mean, variance, and the limiting distribution of the test statistic.
Show how to find an approximate size α test.

2.13.35. Develop a sample size formula for the scale problem similar to the
sample size formula in the location problem, (2.4.25).

2.13.36. Verify the asymptotic properties given in (2.10.26), (2.10.27), and


(2.10.28).

2.13.37. Compute the efficiency of Mood’s scale test and the Ansari-Bradley
scale test relative to the classical F -test for equality of variances.

2.13.38. Show that the Ansari-Bradley scale test is optimal for f (x) = 21 (1 +
|x|)−2 , −∞ < x < ∞.

2.13.39. Show that when F and G have densities symmetric at 0 (or any
common point), the expected value of SR+ = n1 n2 /2.

2.13.40. Show that the estimate of (2.11.17) based on the empirical cdfs is
consistent and that it is a function only of the combined sample ranks.

2.13.41. Under
√ the general model in Section 2.11.5, derive the limiting dis-
tribution of n(Y − ∆ − X).

2.13.42. Find the true asymptotic level of the pooled t-test under the null
hypothesis in (2.11.1).

2.13.43. Develop a modified Mood’s test similar to the modified Mathisen’s


test discussed in Section 2.11.5.

2.13.44. Consider the data set of information on professional baseball players


given in Exercise 1.12.33. Let ∆ denote the shift parameter of the difference
between the height of a pitcher and the height of a hitter.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 162 —


i i

162 CHAPTER 2. TWO-SAMPLE PROBLEMS

(a) Obtain comparison dotplots between the heights of the pitchers and hit-
ters. Does a shift model seem appropriate?

(b) Use the MWW test statistic to test the hypotheses H0 : ∆ = 0 versus
HA : ∆ > 0. Compute the p-value.

(c) Determine a point estimate for ∆ and a 95% confidence interval for ∆
based on MWW procedure.
b Use it to obtain an
(d) Obtain an estimate of the standard deviation of ∆.
approximate 95% confidence interval for ∆.

2.13.45. Repeat Exercise 2.13.44 when ∆ is the shift parameter for the dif-
ference in pitchers’ and hitters’ weights.

2.13.46. Repeat Exercise 2.13.44 when ∆ is the shift parameter for the dif-
ference in left-handed (A-1) and right-handed (A-0) pitchers’ ERA’s and the
hypotheses are H0 : ∆ = 0 versus HA : ∆ 6= 0.

2.13.47. Consider the two independent samples X1 , . . . , Xn1 and Y1 , . . . , Yn2


where Xi has cdf F (x) and Yj has cdf F (x − ∆) and and let
n2
X
T = a[R(Yj )],
j=1
P P
where the scores satisfy i=1 a(i) = 0 and n−1 i=1 a2 (i) = 1. Suppose we
are testing
H0 : ∆ = 0 versus HA : ∆ > 0.

(a) Show that EH0 [T ] = 0.

(b) Suppose the data are:

X: 8 12 18
Y: 13 22 25

and the scores are a(1) = −6/3.6, a(2) = −1/3.6, a(3) = −1/3.6, a(4) =
1/3.6, a(5) = 1/3.6, a(6) = 6/3.6. Find the p-value of the test.

2.13.48. A study was performed to investigate the response time between two
drugs, A and B. It was thought that the response time for A was higher. Ten
subjects were selected. Each was randomly assigned to one of the drugs and
after a specified period (including a washout period), their response times were
recorded. Using a nonparametric procedure, test the appropriate hypotheses
and conclude in terms of the p-value.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 163 —


i i

2.13. EXERCISES 163

Subject 1 2 3 4 5 6 7 8 9 10
A 114 116 97 54 91 103 99 63 86 102
B 105 111 72 81 56 98 121 81 69 87

2.13.49. Let X1 , X2 , . . . , Xn1 be a random sample with common cdf and pdf
F (t) and f (t), respectively. Let Y1 , Y2 , . . . , Yn2 be a random sample with com-
mon cdf and pdf G(t) = F (t − ∆) and g(t) = f (t − ∆), respectively. Assume
that
Pn the Yj s and Xi s are independent. Let a(i) beP a set of rank scores such that
n1
i=1 a(i) = 0, where n = n1 + n 2 . Let S(∆) = i=1 a(R(Yi − ∆)). Consider
the hypotheses
H0 : ∆ = 0 versus HA : ∆ < 0.
Assume that a level α test is to reject H0 if S(0) < c0 . Prove that the power
function of this test is nonincreasing (decreasing).
2.13.50. Let X1 , X2 , . . . , Xn1 be a random sample with common cdf and pdf
F (t) and f (t), respectively. Let Y1 , Y2 , . . . , Yn2 be a random sample with com-
mon cdf and pdf G(t) = F (t − ∆) and g(t) = f (t − ∆), respectively. Assume
that the Yj s and Xi s are independent. Let n = n1 + n2 . Let a(i) = ϕ[i/(n + 1)]
be a set of rank scores such that
 1 3
 4 4
<u<1
1 1
ϕ(u) = u− < u < 34 .
 1 2 4
−4 0 < u < 41
P 2
Let S = ni=1 a[R(Yi )]. Suppose the sampling results in: X : 8, 13 and Y :
12, 15.
(a) Compute S.
b be the corresponding estimator. Is ∆
(b) Let ∆ b > 0 or is ∆
b < 0? Why (answer
using the value of S)?
2.13.51. Suppose Y1 , . . . , Yn and X1 , . . . , Xn are all independent and have the
same distribution with support on (0, ∞), (X > 0 and Y > 0). Let Zi = Yi /Xi ,
i = 1, 2, . . . , n, and T = #{Zi > 1}.
(a) Find the distribution of T .

(b) Write a location model in terms of the log Zi . What does the location
parameter mean in terms of the original random variables?

(c) What is the underlying hypothesis of Part (b)? What does it mean in
terms of the original random variables?

(d) Determine the distribution of T in Part (b).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 165 —


i i

Chapter 3

Linear Models

3.1 Introduction
In this chapter we discuss the theory for a rank-based analysis of a general
linear model. Applications of this analysis to experimental design models are
discussed in Chapters 4 and 5. The rank-based analysis is complete, consisting
of estimation, testing, and diagnostic tools for checking the adequacy of fit of
the model, outlier detection, and detection of influential cases. As in the earlier
chapters, we present the analysis in terms of its geometry.
The analysis could be based on either rank scores or signed-rank scores.
We have chosen to use the general rank scores of Chapter 2. This allows the
error distribution to be either asymmetric or symmetric. An analysis based
on signed-rank scores would parallel the one based on rank scores except that
the theory would require a symmetric error distribution; see Hettmansperger
and McKean (1983) for discussion. Although the results are established for
general score functions, we illustrate the methods with Wilcoxon and sign
scores throughout. We commonly use the subscripts R and S for results based
on Wilcoxon and sign scores, respectively.
There is software available for the robust nonparametric procedures dis-
cussed in this chapter. The software (R code) ww developed by Terpstra and
McKean (2005) computes the linear model procedures based on Wilcoxon
scores and, also, the high breakdown (HBR) procedures. It also computes
most of the diagnostic procedures discussed in this chapter. We illustrates its
use in several examples. The R software Rfit developed by Kloke and McKean
(2010) uses the R function optim to obtain the rank-based fit for general scores
functions. It includes functions for inference and diagnostics. Kapenga, Mc-
Kean, and Vidmar (1988) developed a fortran program rglm which computes
these methods. A web interface for rglm is discussed by Crimin, Abebe, and
McKean (2008). See, also, McKean, Terpstra, and Kloke (2009) for a recent
review of computational procedures for rank-based fitting procedures.

165
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 166 —


i i

166 CHAPTER 3. LINEAR MODELS

3.2 Geometry of Estimation and Tests


For i = 1, . . . , n. let Yi denote the ith observation and let xi denote a p × 1
vector of explanatory variables. Consider the linear model
Yi = x′i β + e∗i , (3.2.1)

where β is a p × 1 vector of unknown parameters. In this chapter, the com-


ponents of β are the parameters of interest. We are interested in estimating
β and testing linear hypotheses concerning it. However, it is convenient to
also have a location parameter. So accordingly let α = T (e∗i ) be a location
functional. One that we frequently use is the median. Let ei = e∗i − α then
T (ei ) = 0 and the model can be written as,

Yi = α + x′i β + ei . (3.2.2)

The parameter α is called an intercept parameter. An argument similar to


the one concerning the shift parameter ∆ of Chapter 2 shows that β does not
depend on the location functional used.
Let Y = (Y1 , . . . , Yn )′ denote the n × 1 vector of observations and let X
denote the n × p matrix whose ith row is x′i . We can then express the model
as
Y = 1α + Xβ + e , (3.2.3)
where 1 is an n × 1 vector of ones, and e′ = (e1 , . . . , en ). Since the model
includes an intercept parameter, α, there is no loss in generality in assuming
that X is centered; i.e., the columns of X sum to 0. Further, in this chapter,
we assume that X has full column rank p. Let ΩF denote the column space
spanned by the columns of X. Note that we can then write the model as

Y = 1α + η + e , where η ∈ ΩF . (3.2.4)

This model is often called the coordinate-free model.


Besides estimation of the regression coefficients, we are interested in tests
of general linear hypotheses of the form

H0 : Mβ = 0 versus HA : Mβ 6= 0 , (3.2.5)

where M is a q × p matrix of full row rank. In this section, we discuss the


geometry of estimation and testing with rank-based procedures for the linear
model.

3.2.1 The Geometry of Estimation


With respect to model (3.2.4), we estimate η by minimizing the distance
between Y and the subspace ΩF . In this chapter we define distance in terms

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 167 —


i i

3.2. GEOMETRY OF ESTIMATION AND TESTS 167

of the norms or pseudo-norms presented in Chapter 2. Consider, first, the


general R pseudo-norm discussed in Chapter 2 which is given by expression
(2.5.2) and which we write for convenience,
n
X
kvkϕ = a(R(vi ))vi , (3.2.6)
i=1

where a(1) ≤ a(2) ≤ · · · ≤ a(n) is a set of scores generated as a(i) = ϕ(i/(n +


1)) for some nondecreasing Rscore function ϕ(u) R defined on the interval (0, 1)
2
and standardized such that ϕ(u)du = 0 and ϕ (u)du = 1. This was shown
to be a pseudo-norm in Chapter 2. Recall that√the Wilcoxon pseudo-norm is
generated by the linear score function ϕ(u) = 12(u − 1/2). We also discuss
the sign pseudo-norm which is generated by ϕ(u) = sgn(u − 1/2) and show
that it is equivalent to using the L1 norm. In Section 3.10 we also discuss a
class of score functions appropriate for survival type analyses.
For the general R pseudo-norm given above by (3.2.6), an R estimate of η
b ϕ such that
is a vector Y

b ϕ kϕ = min kY − ηkϕ .
Dϕ (Y, ΩF ) = kY − Y (3.2.7)
η ∈ΩF

These quantities are represented geometrically in Figure 3.2.1.

b ϕ which minimizes the normed


Figure 3.2.1: The R estimate of η is a vector η
differences, (3.2.6), between Y and ΩF . The distance between Y and the space
ΩF is denoted by dF in the figure. Similar items are shown for the reduced
model subspace ΩR ⊂ ΩF .

Y dF

dR
ΩF
η^F

0
η^R ΩR

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 168 —


i i

168 CHAPTER 3. LINEAR MODELS

Once η has been estimated, β can be estimated by solving the equation


Xβ = Y b ϕ ; that is, the R estimate of β is βb = (X′ X)−1 X′ Y b ϕ . As discussed
ϕ
later in Section 3.7, the intercept α can be estimated by a location estimate
based on the residuals b e = Y−Yb ϕ . One that we frequently use is the median
of the residuals which we denote as α bS = med {Yi − x′i β b ϕ }. Theorem 3.5.7
shows, under regularity conditions, that
     −1 2 
bS
α α n τS 0′
b is approximately Np+1 , , (3.2.8)
β ϕ β 0 τϕ2 (X′ X)−1

where τϕ and τS are the scale parameters defined in displays (3.4.4) and (3.4.6),
respectively. From this result, an asymptotic confidence interval for the linear
function h′ β is given by
p
b ± t(α/2,n−p−1) τbϕ h(X′ X)−1 h ,
h′ β (3.2.9)
ϕ

where the estimate τbϕ is discussed in Section 3.7.1. The use of t-critical values
instead of z-critical values is documented in the small sample studies cited in
Section 3.7. Note the close analogy between this confidence interval and those
based on LS estimates. The only difference is that σ b has been replaced by τbϕ .
We make use of the coordinate-free model, especially in Chapter 4; how-
ever, in this chapter we are primarily concerned with the properties of the
estimator βb and it is more convenient to use the coordinate model (3.2.3).
ϕ
Define the dispersion function by
Dϕ (β) = kY − Xβkϕ . (3.2.10)

Then Dϕ (β b ) = Dϕ (Y, Ω ) = kY − Y b ϕ kϕ is the R distance between Y and


ϕ F
the subspace ΩF . It is also the residual dispersion.
Because Dϕ is expressed in terms of a norm it is a continuous and convex
function of β; see Exercise 1.12.3. Exercise 3.15.2 shows that the ranks of
the residuals can ony change at the boundaries of the regions defined by the
n
2
equations yi − x′i β = yj − x′j β. Note that in the simple linear regression
Y −Y
case, these equations define the sample slopes xjj −xii . Hence, in the interior of
these regions the ranks are constant. Therefore, Dϕ (β) is a piecewise linear,
continuous, convex function of β with gradient (defined almost everywhere)
given by
▽Dϕ (β) = −Sϕ (Y − Xβ) , (3.2.11)
where
Sϕ (Y − Xβ) = X′ a(R(Y − Xβ))) (3.2.12)
b ϕ solves
and a(R(Y − Xβ)))′ = (a(R(Y1 − x′1 β)), . . . , a(R(Yn − x′n β))). Thus β
the equations
.
Sϕ (Y − Xβ) = X′ a(R(Y − Xβ))) = 0 , (3.2.13)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 169 —


i i

3.2. GEOMETRY OF ESTIMATION AND TESTS 169

which are called the R normal equations. A quadratic form in Sϕ (Y −Xβ0 )


serves as the gradient R test statistic for testing H0 : β = β 0 versus HA :
β 6= β 0 .
For the asymptotic distribution theory of estimation and testing, we note
that the estimate is location and scale equivariant. Let βb (Y) denote the R
ϕ
estimate β for the linear model (3.2.3). Then, as shown in Exercise 3.15.6,
b (Y + Xδ) = β
β b (Y) + δ and β b (kY) = k β b (Y). In particular these results
ϕ ϕ ϕ ϕ
imply, without loss of generality, that the theory developed in the following
sections can be accomplished under the assumption that the true β is 0.
As a final note, we outline the least squares estimates. The LS estimates
of η in model (3.2.4) is given by

b LS = Argmin kY − ηk2 ,
Y LS

where k · kLS denotes the least squares pseudo-norm given by (2.2.16) of Chap-
ter 2. The value of η which minimizes this pseudo-norm is

b LS = HY ,
η (3.2.14)

where H is the projection matrix onto the space ΩF ; i.e., H = X(X′ X)−1 X′ .
Denote the sum of squared residuals by SSE = minη ∈Ω kY − ηk2LS =
F
k(I − H)Yk2LS . In order to have similar notation we denote this minimum
2
by DLS (Y, ΩF ). Also, it is easy to show that the least squares estimate of β
b
is β LS = (X′ X)−1 X′ Y.

Simple Linear Model


In terms of the simple regression problem Sϕ (β) is a decreasing step function
of β, which steps down at each sample slope. There may be an interval of
solutions of Sϕ (β) = 0 or Sϕ (β) may step across the horizontal axis. Let βbϕ
denote any point in the interval in the former case and P the crossing point in
the latter case. The gradient test statistic is Sϕ (β0 ) = xi a(R(yi − xi β0 )).
If the x’s are distinct and equally spaced then for Wilcoxon scores this test
statistic is equivalent to the test for correlation based on Spearman’s rS ; see
Exercise 3.15.4.

3.2.2 The Geometry of Testing


We next discuss the geometry behind rank-based tests of the general linear
hypotheses given by (3.2.5). As above, consider the model (3.2.4),

Y = 1α + η + e , where η ∈ ΩF , (3.2.15)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 170 —


i i

170 CHAPTER 3. LINEAR MODELS

and ΩF is the column space of the full model design matrix X. Let Y b ϕ,Ω
F
denote the R fitted value in the full model. Note that Dϕ (Y, ΩF ) is the amount
of residual dispersion not accounted for in fitting the Model (3.2.4). These are
shown geometrically in Figure 3.2.1.
Next let ΩR denote the reduced model subspace of ΩF subject to H0 . In
symbols ΩR = {η ∈ ΩF : η = Xβ, for some β such that Mβ = 0}. In
Exercise 3.15.7 the reader is asked to show that ΩR is a subspace of ΩF of
dimension p − q. Let Y b ϕ,Ω denote the R estimate of η when the reduced
R

model is fit and let Dϕ (Y, ΩR ) = kY − Y b ϕ,Ω kR denote the distance between
R
Y and the subspace ΩR . These are illustrated in Figure 3.2.1. The nonnegative
quantity
RDϕ = Dϕ (Y, ΩR ) − Dϕ (Y, ΩF ) , (3.2.16)
denotes the reduction in residual dispersion when we pass from the re-
duced model to the full model. Large values of RDϕ indicate HA while small
values support H0 .
This drop in residual dispersion, RDϕ , is analogous to the drop in residual
sums of squares for the LS analysis. In fact to obtain this reduction in sums
of squares, we need only replace the R norm with the square of the Euclidean
norm in the above development. Thus the drop in sums of squared errors is
2 2
SS = DLS (Y, ΩR ) − DLS (Y, ΩF ) ,
2
where DLS (Y, ΩF ) is defined above. Hence the reduction in sums of squared
residuals can be written as

SS = k(I − HΩR )Yk2LS − k(I − HΩ )Yk2LS .


F
The traditional least squares F -test is given by

SS/q
FLS = , (3.2.17)
b2
σ
where σb2 = DLS2
(Y, ΩF )/(n−p). Other than replacing one norm with another,
Figure 3.2.1 remains the same for the two analyses, LS and R.
In order to be useful as a test statistic, similar to least squares, the reduc-
tion in dispersion RDϕ must be standardized. The asymptotic distribution
theory that follows suggests the standardization

RDϕ/q
Fϕ = , (3.2.18)
τbϕ /2

where τbϕ is the estimate of τϕ discussed in Section 3.7. Small sample studies
cited in Section 3.7 indicate that Fϕ should be compared with F -critical values

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 171 —


i i

3.2. GEOMETRY OF ESTIMATION AND TESTS 171

Table 3.2.1: Robust ANOVA Table for the Hypotheses H0 : Mβ = 0 versus
HA : Mβ 6= 0, Where RDϕ = Dϕ (Y, ΩR ) − Dϕ (Y, ΩF )
Source Reduction Mean Reduction
in Dispersion in Dispersion df in Dispersion Fϕ
Regression RDϕ q RDϕ /q Fϕ
Error n − (p + 1) τbϕ /2

Table 3.2.2: Robust ANOVA Table for the Hypotheses


 H0 : β = 0 versus
HA : β 6= 0, Where RDϕ = Dϕ (0) − Dϕ (Y, ΩF )
Source Reduction Mean Reduction
in Dispersion in Dispersion df in Dispersion Fϕ
Regression RDϕ p RDϕ /p Fϕ
Error n−p−1 τbϕ /2

with q and n − (p + 1) degrees of freedom analogous to the LS classical F -test


statistic. Similar to the LS F -test, the test based on Fϕ can be summarized in
the ANOVA table, Table 3.2.1. Note that the reduction in dispersion replaces
the reduction in sums of squares in the classical table. These robust ANOVA
tables were first discussed by Schrader and McKean (1976).

Tests That All Regression Coefficients Are 0


As discussed more fully in Section 3.6, there are three R test statistics for
the hypotheses (3.2.5). These are the R analogues of the classical tests: the
likelihood ratio test, the scores test, and the Wald test. We introduce them
here for the special null hypothesis that all the regression parameters are 0;
i.e.,
H0 : β = 0 versus H0 : β = 0 . (3.2.19)
Their asymptotic theory and small sample properties are discussed in more
detail in later sections.
In this case, the reduced model dispersion is just the dispersion of the
response vector Y, i.e., Dϕ (0). Hence, the R test based on the reduction
in dispersion is 
Dϕ (0) − Dϕ (Y, ΩF ) /p
Fϕ = . (3.2.20)
τbϕ /2
As discussed above, Fϕ should be compared with F (α, p, n − p − 1)-critical
values. Similar to the general hypothesis, the test based on Fϕ can be expressed
in the robust ANOVA table given in Table 3.2.2. This is the robust analogue
of the traditional ANOVA table that is printed out for a regression analysis
by most least squares regression packages.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 172 —


i i

172 CHAPTER 3. LINEAR MODELS

Table 3.3.1: Data for Example 3.3.1. (The number of calls is in tens of millions
and the years are from 1950-1973. The top rows are years and the bottom rows
are the number of calls.)
50 51 52 53 54 55 56 57 58 59 60 61
0.44 0.47 0.47 0.59 0.66 0.73 0.81 0.88 1.06 1.20 1.35 1.49

62 63 64 65 66 67 68 69 70 71 72 73
1.61 2.12 11.90 12.40 14.20 15.90 18.20 21.20 4.30 2.40 2.70 2.90

The R scores test is the test based on the gradient. Theorem 3.5.2,
below, gives the asymptotic distribution of the gradient Sϕ (0) under the null
hypothesis. This leads to the asymptotic level α test, reject H0 if

S′ϕ (0)(X′ X)−1 Sϕ (0) ≥ χ2α (p) . (3.2.21)

Note that this test avoids the estimation of τϕ .


The R Wald test is a quadratic form in the full model estimates. Based
b given in Corollary
on the asymptotic distribution of the full model estimate β ϕ
3.5.1, an asymptotic level α test rejects H0 if

b ′ (X′ X)β
β b ϕ /p
ϕ
≥ F (α, p, n − p − 1) . (3.2.22)
τbϕ2

3.3 Examples
We offer several examples to illustrate the rank-based estimates and test pro-
cedures discussed
√ in the last section. For all the examples, we use Wilcoxon
scores, ϕ(u) = 12(u − (1/2)), for the rank-based estimates of the regression
coefficients. We estimate the intercept by the median of the residuals and we
estimate the scale parameter τϕ as discussed in Section 3.7. We begin with a
simple regression data set and proceed to multiple regression problems.

Example 3.3.1 (Telephone Data). The response for this data set is the num-
ber of telephone calls (tens of millions) made in Belgium for the years 1950
through 1973. Time, the years, serves as our only predictor variable. The data
is discussed in Rousseeuw and Leroy (1987) and, for convenience, is displayed
in Table 3.3.1.
The Wilcoxon estimates of the intercept and slope are −7.13 and .145,
respectively, while the LS estimates are −26 and .504. The reason for this
disparity in fits is easily seen in Panel A of Figure 3.3.1 which is a scatter-
plot of the data overlaid with the LS and Wilcoxon fits. Note that the years

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 173 —


i i

3.3. EXAMPLES 173

1964 through 1969 had a profound effect on the LS fit while the Wilcoxon fit
was much less sensitive to these years. As discussed in Rousseeuw and Leroy
the recording system for the years 1964 through 1969 differed from the other
years. Panels B and C of Figure 3.3.1 are the Studentized residual plots of
the fits; see (3.9.31) of Section 3.9. As with internal LS-Studentized residuals,
values of the internal R Studentized residuals which exceed 2 in absolute value
are potential outliers. Note that the internal Wilcoxon Studentized residuals
clearly show that the years 1964-1969 are outliers while the internal LS Stu-
dentized residuals only detect 1969. The Wilcoxon Studentized residuals also
mildly detect the year 1970. Based on the scatterplot, this point does not
follow the trend of the early (before 1964) years either. The scatterplot and
Wilcoxon residual plot indicate that there may be a quadratic trend over the
years before the outliers occur. The last few years, though, do not seem to
follow this trend. Hence, a linear model for this data is questionable. On the
basis of these plots, we do not discuss any formal inference for this data set.

Figure 3.3.1: Panel A: Scatterplot of the telephone data, overlaid with the LS
and Wilcoxon fits; Panel B: Internal LS Studentized residual plot; Panel C:
Internal Wilcoxon Studentized residual plot; and Panel D: Wilcoxon dispersion
function.
Panel A Panel B

• •
20

• •
LS-Studentized residuals


• •
15
Number of calls

• ••
1

LS-Fit
Wilcoxon-Fit

••
10

••
••
0

••
••
••
••
••
5

• •
-1

•••
•• ••• ••
••••••••• •
0

50 55 60 65 70 0 2 4 6 8 10

Year LS-Fit

Panel C Panel D
150


50
Wilcoxon-Studentized residuals


40

140
Wilcoxon dispersion



30

••
130
20

120
10


110

••••••••••••••
0

•••

0 1 2 3 -0.2 0.0 0.2 0.4 0.6

Wilcoxon-Fit Beta

Panel D of Figure 3.3.1 depicts the Wilcoxon dispersion function over the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 174 —


i i

174 CHAPTER 3. LINEAR MODELS

interval (−.2, .6). Note that Wilcoxon estimate βbR = .145 is the minimizing
value. Next consider the hypotheses H0 : β = 0 versus HA : β 6= 0. The basis
for the test statistic Fϕ can be read from this plot. The reduction in dispersion
is given by RD = D(0) − D(.145). Also, the gradient test of these hypotheses
would be the negative of the slope of the dispersion function at 0; i.e., −D ′ (0).

Example 3.3.2 (Baseball Salaries). As a large data set, we consider data on


the salaries of professional baseball pitchers for the 1987 baseball season. This
data set was taken from the data set on baseball salaries which was used in
the 1988 ASA Graphics Section Poster Session. It can be obtained at the web
site: http://lib.stat.cmu.edu/datasets. Our analysis concerns a subdata
set of 176 pitchers, which can be obtained from the authors upon request. Our
response variable is the 1987 beginning salary (in log dollars) of these pitchers.
As predictors, we took the career summary statistics through the end of the
1986 season. The names of these variables are listed in Table 3.3.2. Panels
A-G of Figure 3.3.2 show the scatter plots of the log of salary versus each of
the predictors. Certainly the strongest predictor on the basis of these plots is
log years; although, linearity in this plot is questionable.
The internal Wilcoxon Studentized residuals, (3.9.31), versus fitted values
are displayed in Panel H of Figure 3.3.2. Based on Panels A and H, the pattern
in the residual plot follows from the fact that log years is not a linear predictor.
Better fitting models are pursued in Exercise 3.15.1. Note that there are several
large outliers. The three identified outliers, circled points in Panel H, are
interesting. These correspond to the pitchers Steve Carlton, Phil Niekro, and
Rick Sutcliff. These were very good pitchers, but in 1987 they were at the end
of their careers, (21, 23, and 21 years of pitching, respectively); hence, they
missed the rapid rise in baseball salaries. A diagnostic analysis (see Section
3.9 and Exercise 3.15.1) indicates a few mildly influential points, also. For
illustration, though, we consider the model that we fit. Table 3.3.2 also displays
the estimated coefficients and their standard errors. The outliers impaired the
LS fit, somewhat. The LS estimate of σ is .515 in comparison to the estimate
of τ which is .388.
Table 3.3.3 displays the robust ANOVA table for testing that all the coef-
ficients, except the intercept, are 0. Based on the large value of Fϕ , (3.2.20),
the predictors are helpful in explaining the response. In particular, based on
Table 3.3.2, the predictors years in professional baseball, earned run average,
average innings per year, and average number of saves per year seem more im-
portant than the variables wins, losses, and games. These last three variables
form a similar group of variables; hence, as an illustration of the rank-based
statistic Fϕ , the hypothesis that the coefficients for these three predictors are
0 was tested. The reduction in dispersion for this hypothesis is RD = 1.24
which leads to Fϕ = 2.12 which is significant at the 10% level. This confirms

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 175 —


i i

3.3. EXAMPLES 175

Figure 3.3.2: Panels A-G: Plots of log-salary versus each of the predictors for
the baseball data of Example 3.3.2; Panel H: Internal Wilcoxon Studentized
residual plot.
Panel A Panel B

• • • ••
• •
• • • • • • • •
• • • •• •
7

7
• •• • • • • • •• •
• • • ••• •• •• ••• • • • •• •

• •• • •• •••
• • •• •• • •••
• • •• •• •• • • •• •• • • • •• • • • •••••• • •
• • • • • • •• • • •• •• • • •• • • •
••• •• • • • • • ••• •• • • • •
Log salary

Log salary
••• •
• •

• • •
• •• • •• • • ••
• • ••
6

6
• •• ••• • • •• •
• •• ••• • • • • • • • • • •• •• • • ••• • • •
• •• • • • •
• • • •
•• • • • • •• • • •• • •••
••• •• •• • • • • • • •
5

5
• • • •
•• • • • • •• • • •
•• •• ••• •

•• •• • • • ••
•• •• •• •• • • • •
•• •• • • •
• • • • •
4

4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 5 10 15 20

Log Years Ave. wins

Panel C Panel D

• • • • • •
• •
• • • • • •• •• • •• •
7

7
• • •• • • • • • • •
•• • • •• •• •• •• • • • ••
• • • •••• •• • ••• ••
• • ••
• • • •• • • •• •• • • • ••••••••••• • •• • ••• •• • •
•• • • •• • • • • • •
• • • • • • •• • • • • •• • • • • •• ••••••••
• • • •• • •• •• • • •
Log salary

Log salary
• • • • • • • •• •• •
• • • • • • • • • • • • • • •
6

6
• • • • • •• • •
• • ••• • •
• •• •• •
• • •
• • • • • • • • • •• • • • • •• • • • •
• • • • • •
• • •
•• • • • • • • ••• •• • •
• •• • • • • • • • • • • •
• • •
• •• • • • ••
5

5
• •
• •• • •
• • • •
• • •
• • •• • • •• • • •
• •• • •• • •
• • •• • • • • • • • • • • • •• ••
• • • • • • • • •
• • • • • •
4

2 4 6 8 10 12 2.5 3.0 3.5 4.0 4.5 5.0 5.5

Ave. loses ERA

Panel E Panel F

• •• • • •
• •
• • • • • • •
•• • • •• • • • •
7

• • • • • ••••
• • • • • • • •
• • •• ••••••••• •••• •• •• ••• •• • •
• • ••• •
• • • • • •••
• •• ••••••• • • • •

• •• • • • • ••
• ••• • •• • •• • • • • •• • • • •• • • •
• •• • •
• • • • •• •••• • •• • • • • • •• • •
Log salary

Log salary

• • • • • •
• •
•••• •• • • • • • • • • ••
6

• • • • • •• • • • • •
• • •
•• • •• •• ••• • •• • • • •• • • • • • • • • • •
•• • • • •
• •
• • • • • •• •
• •• • • • • • • • • •• ••
• • •• • • • • • • • • ••• • • •
5

• •
• • • • • • •
• • •• • • • •• • • •• • ••

• • • • • • • •
• • •• • •• • • • •• • • • •• • • •• •• •
• • • • •••
4

0 20 40 60 80 0 50 100 150 200 250

Ave. games Ave. innings

Panel G Panel H

•• • •
6


•••• • •
4

• •
7

•••••••• • • •
• • •

••••• ••• •• •• • • • ••
••• • •
Studentized resid.

•••• • • ••
2

•••• • • • • • • • • • • • •••• •• ••• •



•••••• • • • • • •
••••••• •• •••••••••••••• • •••• •••••••••••••••••• ••
Log salary

• • ••• •
• • •• ••• • •••• • •• •• •• • • ••
0

• • • •• • •
• • ••• • • • • • • • • • • ••••••• •• •• ••• • ••
6

• • • • • ••• •• ••

••••••• • •• • • • • • •• • • •• •• • • ••
• ••
-2

• • •
••• •
• • • •
••
-4

••••• • O
• • • •• • • ••
5

• O
•• • • • •
-6

• • ••
•• •
••• •• • •
-8

• • •
O
4

0 5 10 15 20 25 4 5 6 7 8

Ave. saves Wilcoxon fit

the above observations on the regression coefficients.

Example 3.3.3 (Potency Data). This example is part of a multivariate


data set discussed in Chapter 6; see Table 6.6.4 for the data. The experiment
concerned the potency of drug compounds which were manufactured under
different levels of four factors. Here we consider only one of the response vari-
ables POT2, which is the potency of a drug compound at the end of two
weeks. The factors are: SAI, the amount of intragranular steric acid, which
was set at the three levels −1, 0, and 1; SAE, the amount of extragranular
steric acid, which was set at the three levels −1, 0, and 1; ADS, the amount
of cross carmellose sodium, which was set at the three levels −1, 0, and 1; and
TYPE of steric acid which was set at two levels −1 and 1. The initial potency
of the compound, POT0, served as a covariate. The sample size is n = 34.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 176 —


i i

176 CHAPTER 3. LINEAR MODELS

Table 3.3.2: Predictors for Baseball Salaries of Pitchers and Their Estimated
(Wilcoxon Fit) Coefficients
Predictor Estimate Stand. Error t-ratio
log Years in professional baseball .839 .044 19.15
Average wins per year .045 .028 1.63
Average losses per year -.024 .026 -.921
Earned run average -.146 .070 -2.11
Average games per year -.006 .004 1.60
Average innings per year .004 .003 1.62
Average saves per year .012 .011 1.07
Intercept 4.22 .324
Scale (τ ) .388

Table 3.3.3: Wilcoxon ANOVA Table for H0 : β = 0


Source Reduction Mean Reduction
in Dispersion in Dispersion df in Dispersion Fϕ
Regression 78.287 7 11.18 57.65
Error 168 .194

In Example 3.9.2 of Section 3.9 a residual analysis of this data set is per-
formed. This analysis indicates that the model which includes the covariate,
the linear terms of the factors, the simple two-way interaction terms of the
factors, and the quadratic terms of the three factors SAE, SAI, and ADS is
adequate. Let xj for j = 1, . . . , 4 denote the level of the factors SAI, SAE,
ADS, and TYPE, respectively, and let ci denote the value of the covariate.
Then the model is expressed as,
yi = α + β1 x1,i + β2 x2,i + β3 x3,i + β4 x4,i + β5 x1,i x2,i + β6 x1,i x3,i
+β7 x1,i x4,i + β8 x2,i x3,i + β9 x2,i x4,i + β10 x3,i x4,i
+β11 x21,i + β12 x22,i + β13 x23,i + β14 ci + ei . (3.3.1)
The Wilcoxon and LS estimates of the regression coefficients and their
standard errors are given in Table 3.3.4. The Wilcoxon estimates are more
precise. As the diagnostic analysis of Example 3.9.2 shows, this is due to the
outliers in this data set.
Note that the Wilcoxon estimate of the parameter β13 , the quadratic term
of the factor ADS is significant. Again referring to the residual analysis given in
Example 3.9.2, there is some graphical evidence to retain the three quadratic
coefficients in the model. In order to statistically confirm this evidence, we
test the hypotheses
H0 : β12 = β13 = β14 = 0 versus HA : βi 6= 0 for some i = 12, 13, 14 .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 177 —


i i

3.4. ASSUMPTIONS FOR ASYMPTOTIC THEORY 177

Table 3.3.4: Wilcoxon and LS Estimates for the Potency Data


Wilcoxon Estimates LS Estimates
Terms Parameter Est. SE Est. SE
Intercept α 7.184 2.96 5.998 4.50
β1 0.072 0.05 0.000 0.08
Linear β2 0.023 0.05 -0.018 0.07
β3 0.166 0.05 0.135 0.07
β4 0.020 0.04 -0.011 0.05
β5 0.042 0.05 0.086 0.08
β6 -0.040 0.05 0.035 0.08
Two-way β7 0.040 0.05 0.102 0.07
Inter. β8 -0.085 0.06 -0.030 0.09
β9 0.024 0.05 0.070 0.07
β10 -0.049 0.05 -0.011 0.07
β11 -0.002 0.10 0.117 0.15
Quad. β12 -0.222 0.09 -0.240 0.13
β13 0.022 0.09 -0.007 0.14
Covariate β14 0.092 0.31 0.217 0.47
Scale τ or σ .204 .310

Table 3.3.5: Wilcoxon ANOVA Table for H0 : β12 = β13 = β14 = 0


Source Reduction Mean Reduction
of Dispersion in Dispersion df in Dispersion Fϕ
Quadratic Terms .977 3 .326 3.20
Error 19 .102

The Wilcoxon test is summarized in Table 3.3.5 and it is based on the test
statistic (3.2.18). The p-value of the test is 0.047 and, hence, is significant
at the 0.05 level. The LS F -test statistic is insignificant, though, with the p-
value 0.340. As with its estimates of the regression coefficients, the LS F -test
statistic has been impaired by the outliers.

3.4 Assumptions for Asymptotic Theory


For the asymptotic theory developed in this chapter certain assumptions on
the distribution of the errors, the design matrix, and the scores are needed.
The required assumptions for each section may differ, but for easy reference,
we have placed them in this section.
The major assumption on the error density function f for much of the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 178 —


i i

178 CHAPTER 3. LINEAR MODELS

rank-based analyses, is:

(E.1) f is absolutely continuous, 0 < I(f ) < ∞ , (3.4.1)

where I(f ) denotes Fisher information, (2.4.16). Since f is absolutely contin-


uous, we can write Z s
f (s) − f (t) = f ′ (x)dx
t

for some function f . An application of the Cauchy-Schwartz inequality yields
p
|f (s) − f (t)| ≤ I(f )1/2 |F (s) − F (t)| ; (3.4.2)

see Exercise 1.12.21. It follows from (3.4.2), that assumption (E.1) implies
that f is uniformly bounded and is uniformly continuous.
An assumption that is used for analyses based on the L1 norm is:

(E.2) f (θe ) > 0 , (3.4.3)

where θe denotes the median of the error distribution, i.e., θe = F −1 (1/2).


For easy reference, we list again the scale parameter τϕ , (2.5.23),
Z
−1
τϕ = ϕ(u)ϕf (u)du , (3.4.4)

where
f ′ (F −1 (u))
ϕf (u) = − . (3.4.5)
f (F −1 (u))
Under (E.1) the scale parameter τϕ is well defined. A second scale parameter
τS is defined as:
τS = (2f (θe ))−1 ; (3.4.6)
see (1.5.22). Note that it is well defined under assumption (E.2).
As above let H = X(X′ X)−1 X′ denote the projection matrix onto Ω, the
column space of X. Our asymptotic theory assumes that the design matrix X is
imbeded in a sequence of design matrices which satisfy the next two properties.
We should subscript quantities such as X and the projection matrix with n to
show this, but as a matter of convenience we have not done so. We subscript
the leverage values hiin which are the diagonal entries of the projection matrix
H. We often impose the next two conditions on the design matrix:

(D.2) lim max hiin = 0 (3.4.7)


n→∞ 1≤i≤n
−1 ′
(D.3) lim n X X = Σ , (3.4.8)
n→∞

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 179 —


i i

3.4. ASSUMPTIONS FOR ASYMPTOTIC THEORY 179

where Σ is a p × p positive definite matrix. The first condition has become


known as Huber’s condition. Huber (1981) showed that (D.2) is a neces-
sary and sufficient design condition for the least squares estimates to have
an asymptotic normal distribution provided the errors, ei , are iid with finite
variance. Condition (D.3) reduces to assumption (D.1), (2.4.7), of Chapter 2
for the two-sample problem.
Another design condition is Noether’s condition which is given by
x2
(N.1) max Pn ik 2
→ 0 for all k = 1, . . . p . (3.4.9)
j=1 xjk
1≤i≤n

Although this condition is convenient, as the next lemma shows it is implied


by Huber’s condition.
Lemma 3.4.1. (D.2) implies (N.1).
Proof: By the generalized Cauchy-Schwarz inequality (see Graybill, 1976, page
224), for all i = 1, . . . , n we have the following equalities:
δ ′ xi x′ δ
sup ′ ′ i = x′i (X′ X)−1xi = hnii .
kδ k=1
δ X Xδ
Next for k = 1, . . . , p take δ to be δ k , the p × 1 vector of zeroes except for 1
in the kth component. Then the above equalities imply that
x2
Pn ik 2
≤ hnii , i = 1, . . . , n, k = 1, . . . , p .
j=1 xjk

Hence
x2
max max Pn ik 2
≤ max hnii .
j=1 xjk
1≤k≤p 1≤i≤n 1≤i≤n

Therefore Huber’s condition implies Noether’s condition.


As in Chapter 2, we often assume that the score generating function ϕ(u)
satisfies assumption (2.5.5). Additionally, we assume that ϕ(u) is bounded.
For reference, we assume that ϕ(u) is a function defined on (0, 1) such that

ϕ(u) is a nondecreasing, square-integrable, and bounded function
(S.1) R1 R1 .
0
ϕ(u) du = 0 and 0 ϕ2 (u) du = 1
(3.4.10)
Occasionally we need further assumptions on the score function. In Section
3.7, we assume that
(S.2) ϕ is differentiable . (3.4.11)
When estimating the intercept parameter based on signed-rank scores, we need
to assume that the score function is odd about 21 , i.e.,
(S.3) ϕ(1 − u) = −ϕ(u) ; (3.4.12)
see, also, (2.5.5).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 180 —


i i

180 CHAPTER 3. LINEAR MODELS

3.5 Theory of Rank-Based Estimates


Consider the linear model given by (3.2.3). To avoid confusion, we denote
the true vector of parameters by (α0 , β 0 )′ ; that is, the true model is Y =
1α0 + Xβ 0 + e. In this section, we derive the asymptotic theory for the R
analysis (estimation and testing) under the assumptions (E.1), (D.2), (D.3),
and (S.1). We occasionally surpress the subscripts ϕ and R from the notation.
For example, we denote the R estimate by simply β. b

3.5.1 R Estimators of the Regression Coefficients


A key result for both estimation and testing concerns the gradient S(Y −Xβ),
(3.2.12). We first derive its mean and covariance matrix and then obtain its
asymptotic distribution.

Theorem 3.5.1. Under Model (3.2.3),

E [S(Y − Xβ 0 )] = 0
V [S(Y − Xβ 0 )] = σa2 X′ X ,
Pn .
σa2 = (n − 1)−1 i=1 a2 (i) = 1.

Proof: Note that S(Y − Xβ 0 ) = X′ a(R(e)). Under Model (3.2.3), e1 , . . . , en


are iid; hence, the ith component a(R(e)) has mean
n
X
E [a(R(ei ))] = a(j)n−1 = 0 ,
j=1

from which the result for the expectation follows.


Next, V [S(Y − Xβ0 )] = X′ V [a(R(e)] X. The diagonal entries for the co-
variance matrix on the RHS are:
n
  X n−1 2
V [a(R(ei ))] = E a2 (R(ei )) = a(j)2 n−1 = σa .
j=1
n

The off-diagonal entries are the covariances given by

cov(a(R(ei )), a(R(el ))) = E [a(R(ei )a(R(el )]


Pn Pn −1
= j=1 k=1 j6=k a(j)a(k)(n(n − 1))
n
X
= −(n(n − 1))−1 a2 (j)
j=1

= −σa2 /n , (3.5.1)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 181 —


i i

3.5. THEORY OF RANK-BASED ESTIMATES 181


P 2
n
where the third step in the derivation follows from 0 = j=1 a(j) . The
result, (3.5.1), is obtained directly from these variances and covariances.
Under (D.3), we have that
 
V n−1/2 S(Y − Xβ0 ) → Σ . (3.5.2)
This anticipates our next result,
Theorem 3.5.2. Under the Model (3.2.3), (E.1), (D.2), (D.3), and (S.1) in
Section 3.4,
D
n−1/2 S(Y − Xβ 0 ) → Np (0, Σ) . (3.5.3)
Proof: Let S(0) = S(Y − Xβ 0 ) and let T(0) = X′ ϕ(F (Y − Xβ 0 )). Under the
above assumptions, the discussion
√ around Theorem A.3.1 of the Appendix
shows that (T(0) − S(0))/
√ n converges to 0 in probability. Hence we need
only show that T(0)/ n converges to the intended distribution. Letting W ∗ =
n−1/2 t′ T(e) where t 6= 0 is an arbitrary p × 1 vector, it suffices to show that
W ∗ converges in distribution to a N(0, t′ Σt) distribution. Note that we can
write W ∗ as n
X
∗ −1/2
W =n t′ xk ϕ(F (ek )) . (3.5.4)
k=1
R
Since F is the distribution
R function of ek , it follows from ϕdu = 0 that
E [W ∗ ] = 0, from ϕ2 du = 1, and (D.3) that
n
X
∗ −1
V [W ] = n (t′ xk )2 = t′ n−1 X′ Xt → t′ Σt > 0 . (3.5.5)
k=1

Since W ∗ is a sum of independent random variables which are not iden-


tically distributed we establish the limit distribution by the Lindeberg-Feller
Central Limit Theorem; see Theorem A.1.1 of the Appendix. In the notation
of this theorem let Bn2 = V [W ∗ ]. By (3.5.5), Bn2 converges to a positive real
number. We need to show,
X n   
1 ′ 2 2 1
lim Bn−2
E ′
(x t) ϕ (F (ek ))I √ (xk t)ϕ(F (ek )) > ǫBn =0.
n k n
k=1
(3.5.6)
The key is the factor n−1/2 (x′k t) in the indicator function. By the Cauchy-
Schwarz inequality and (D.2) we have the string of inequalities:
n−1/2 |(x′k t)| ≤ n−1/2 kxk kktk
" p
#1/2
X
= n−1 x2kj ktk
j=1
 1/2
≤ p max n−1 x2kj ktk . (3.5.7)
j

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 182 —


i i

182 CHAPTER 3. LINEAR MODELS

By assumptions (D.2) and (D.3), it follows that the quantity in brackets in


equation (3.5.7), and, hence, n−1/2 |(x′k t)| converges to zero as n → ∞. Call
the term on the right side of equation (3.5.7) Mn . Note that it does not depend
on k and Mn → 0. From this string of inequalities, the limit on the leftside of
(3.5.6) is less than or equal to
   Xn
−2 2 ǫBn
lim Bn lim E ϕ (F (e1 ))I |ϕ(F (e1 ))| > lim n−1 (x′k t)2 .
Mn k=1

The first and third limits are positive reals. For the second limit, note that the
random variable inside the expectation is bounded; hence, by Lebesgue Dom-
inated Convergence Theorem we can interchange the limit and expectation.
Since (ǫBn /Mn ) → ∞ as n → ∞, the expectation goes to 0 and our desired
result is obtained.
Similar to Chapter 2, Exercise 3.15.9 obtains the proof of the above theo-
rem for the special case of the Wilcoxon scores by first getting the projection
of the statistic W .
Note from this theorem we have the gradient test that all the regression
coefficients are 0; that is, H0 : β = 0 versus HA : β 6= 0. Consider the test
statistic
T = σa−2 S(Y)′ (X′ X)−1 S(Y) . (3.5.8)
From the last theorem an approximate level α test for H0 versus HA is:

Reject H0 in favor of HA if T ≥ χ2 (α, p) , (3.5.9)

where χ2 (α, p) denotes the upper level α critical value of χ2 -distribution with
p degrees of freedom.
Theorem A.3.8 of the Appendix gives the following linearity result for the
process S(β n ):
1 1 √
√ S(β n ) = √ S(β 0 ) − τϕ−1 Σ n(β n − β 0 ) + op (1) , (3.5.10)
n n

for n(β n − β 0 ) = O(1), where the scale parameter τϕ is given by (3.4.4).
Recall that we have made use of this result in Section 2.5 when we showed
that the two-sample location process under general scores functions is Pitman
Regular. If we integrate the RHS of this result we obtain a locally smooth
approximation of the dispersion function D(β n ) which is given by the following
quadratic function:

Q(Y−Xβ) = (2τϕ )−1 (β−β 0 )′ X′ X(β−β 0 )−(β−β 0 )′ S(Y−Xβ 0 )+D(Y−Xβ0 ).


(3.5.11)
Note that Q depends on τϕ and β 0 so it cannot be used to estimate β. As we
show, the function Q is quite useful for establishing asymptotic properties of

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 183 —


i i

3.5. THEORY OF RANK-BASED ESTIMATES 183

the R estimates and test statistics. As discussed in Section 3.7.3, it also leads
to a Gauss-Newton-type algorithm for obtaining R estimates.
The following theorem shows that Q provides a local approximation to D.
This is an asymptotic quadraticity result which was proved by Jaeckel (1972).
It in turn is based on an asymptotic linearity result derived by Jurečková
(1971) and displayed above, (3.5.10). It is proved in the Appendix; see Theo-
rem A.3.8.
Theorem 3.5.3. Under the Model (3.2.3) and the assumptions (E.1), (D.1),
(D.2), and (S.1) of Section 3.4, for any ǫ > 0 and c > 0,
" #
P max √
|D(Y − Xβ) − Q(Y − Xβ)| ≥ ǫ → 0 , (3.5.12)
kβ −β 0 k<c/ n

as n → ∞.
We use this result to obtain the asymptotic distribution of the R estimate.
Without loss of generality assume that the true β 0 = 0. Then we can write
Q(Y − Xβ) = (2τϕ )−1 β ′ X′ Xβ − β ′ S(Y) + D(Y). Because Q is a quadratic
function it follows from differentiation that it is minimized by
e = τϕ (X′X)−1 S(Y) .
β (3.5.13)

e is a linear function of S(Y). Thus we immediately have from Theorem


Hence, β
3.5.2:
Theorem 3.5.4. Under the Model (3.2.3), (E.1), (D.1), (D.2), and (S.1) in
Section 3.4,
√ D
n(βe−β )→ Np (0, τϕ2 Σ−1 ) . (3.5.14)
0

Since Q is a local approximation to D, it would seem that their minimizing


values are close also. As the next result shows this indeed is the case. The proof
first appeared in Jaeckel (1972) and is sketched in the Appendix; see Theorem
A.3.9.
Theorem 3.5.5. Under the Model (3.2.3), (E.1), (D.1), (D.2), and (S.1) in
Section 3.4,
√ P
b − β)
n(β e → 0.
Combining this last result with Theorem 3.5.4, we get the next corollary
which gives the asymptotic distribution of the R estimate.
Corollary 3.5.1. Under the Model (3.2.3), (E.1), (D.1), (D.2), and (S.1),
√ D
b − β ) → Np (0, τ 2 Σ−1 ) .
n(β (3.5.15)
ϕ 0 ϕ

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 184 —


i i

184 CHAPTER 3. LINEAR MODELS

Under the further restriction that the errors have finite variance σ 2 ,
Exercise 3.15.10 shows that the least squares estimate β b
LS of β satisfies
√ b D 2 −1
n(β LS − β) → Np (0, σ Σ ). Hence as in the location problems of Chapters
1 and 2, the asymptotic relative efficiency between the R estimates and least
squares is the ratio σ 2 /τϕ2 , where τϕ is the scale parameter (3.4.4). Thus the R
estimates of regression coefficients have the same high efficiency relative to LS
estimates as do the rank-based estimates in the location problem. In particu-
lar, the efficiency of the Wilcoxon estimates relative to the LS estimates at the
normal distribution is .955. For longer tailed errors distributions this relative
efficiency is much higher; see the efficiency discussion for contaminated normal
distributions in Example 1.7.1.
From the above corollary, R estimates are asymptotically unbiased. It fol-
lows from the invariance properties, if we additionally assume that the errors
have a symmetric distribution, that R estimates are unbiased for all sample
sizes; see Exercise 3.15.11 for details.
The random vector β̃, (3.5.13), is an asymptotic representation of the R
estimate β.b The following representation proves useful later:

Corollary 3.5.2. Under the Model (3.2.3), (E.1), (D.1), (D.2), and (S.1) in
Section 3.4,

b − β ) = τϕ (n−1 X′ X)−1 n−1/2 X′ ϕ(F (Y − Xβ )) + op (1) , (3.5.16)


n1/2 (β ϕ 0 0

where the notation ϕ(F (Y)) means the n × 1 vector whose ith component is
ϕ(F (Yi )).

Proof: This follows immediately from (A.3.9), (A.3.10), the proof of Theorem
3.5.2, and equation (3.5.13).
Based on this last corollary, we have that the influence function of the
R estimate is given by

b ) = τϕ Σ−1 ϕ(F (y0))x0 .


Ω(x0 , y0 ; β (3.5.17)
ϕ

A more rigorous derivation of this result, based on Frechet derivatives, is given


in the Appendix; see Section A.5.2. Note that the influence function is bounded
in the Y -space but it is unbounded in the x-space. Hence an outlier in the
x-space can seriously impair an R estimate. Although as noted above the R
estimates are highly efficient relative to the LS estimates, it follows from its
influence function that the breakdown of the R estimate is 0. In Section 3.12,
we present the HBR estimates whose influence function is bounded in both
spaces and which can attain 50% breakdown; although, it is less efficient than
the R estimate.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 185 —


i i

3.5. THEORY OF RANK-BASED ESTIMATES 185

3.5.2 R Estimates of the Intercept


As discussed in Section 3.2, the intercept parameter requires the specification
of a location functional, T (ei ). In this section we take T (ei ) = med(ei ). Since
we assume, without loss of generality, that T (ei ) = 0, α = T (Yi − x′i β). This
leads immediately to estimating α by the median of the R residuals. Note that
this is analogous to LS, since the LS estimate of the intercept is the arithmetic
average of the LS residuals. Further, this estimate is associated with the sign
test statistic and the L1 norm. More generally we could also consider estimates
associated with signed-rank test statistics. For example, if we consider the
signed-rank Wilcoxon scores of Chapter 1 then the corresponding estimate is
the median of the Walsh averages of the residuals. The theory of such estimates
based on signed-rank tests, though, requires symmetrically distributed errors.
Thus, while we briefly discuss these later, we now concentrate on the median
of the residuals which does not require this symmetry assumption. We make
use of assumption (E.2), (3.4.3), i.e, f (0) > 0.
The process we consider is the sign process based on residuals given by
n
X
b )=
S1 (Y − α1 − Xβ b ).
sgn(Yi − α − xi β (3.5.18)
ϕ ϕ
i=1

As with the sign process in Chapter 1, this process is a nondecreasing step


function of α which steps down at the residuals. The solution to the equation
b )=.
S1 (Y − α1 − Xβ ϕ 0 (3.5.19)

is the median of the residuals which we denote by α bS = med{Yi − xi β b }.


ϕ
b
Our goal is to obtain the asymptotic joint distribution of the estimate bϕ =
(b b ′ )′ .
αS , β ϕ
Similar to the R estimate of β the estimate of the intercept is location and
scale equivariant; hence, without loss of generality we assume that the true
intercept and regression parameters are 0. We begin with a lemma.

Lemma 3.5.1. Assume conditions (E.1), (E.2), (S.1), (D.1), and (D.2) of
Section 3.4. For any ǫ > 0 and for any a ∈ R,

lim P [|S1 (Y−an−1/2 1−Xβ b )−S1 (Y − an−1/2 1)| ≥ ǫ n] = 0.
ϕ
n→∞

The proof of this lemma was first given by Jurečková (1971) for general signed-
rank scores and it is briefly sketched in the Appendix for the sign scores; see
Lemma A.3.2. This lemma leads to the asymptotic linearity result for the
process (3.5.18).
We need the following linearity result:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 186 —


i i

186 CHAPTER 3. LINEAR MODELS

Theorem 3.5.6. Assume conditions (E.1), (E.2), (S.1), (D.1), and (D.2) of
Section 3.4. For any ǫ > 0 and c > 0,
1
b ϕ ) − n− 21 S1 (Y − Xβ
P [sup|a|≤c |n− 2 S1 (Y − an−1/2 1 − Xβ b ϕ ) + aτ −1 | ≥ ǫ] → 0,
S

as n → ∞, where τs is the scale parameter defined in expression (3.4.6).

Proof: For any fixed a write

|n−1/2 S1 (Y − an−1/2 1 − Xβ b ) − n−1/2 S1 (Y − Xβ b ) + aτ −1 | ≤


ϕ ϕ S

|n −1/2
S1 (Y − an−1/2
1 − Xβb )−n −1/2
S1 (Y − an−1/2
1)|
ϕ
−1/2 −1/2
+ |n S1 (Y − an 1) − n−1/2 S1 (Y) + aτS−1 |
+ |n−1/2 S1 (Y) − n−1/2 S1 (Y − Xβb )| .
ϕ

We can apply Lemma 3.5.1 to the first and third terms on the right side of
the above inequality. For the middle term we can use the asymptotic linearity
result in Chapter 1 for the sign process, (1.5.23). This yields the result for
any a and the sup follows from the monotonicity of the process, similar to the
proof of Theorem 1.5.6 of Chapter 1.
Letting a = 0 in Lemma 3.5.1, we have that the difference n−1/2 S1 (Y −
Xβ b ) − n−1/2 S1 (Y) goes to zero in probability. Thus the asymptotic distribu-
ϕ
tion of n−1/2 S1 (Y − Xβb ) is the same as that of n−1/2 S1 (Y), namely, N(0, 1).
ϕ
We have two applications of these results. The first is found in the next lemma.

Lemma 3.5.2. Assume conditions (E.1), (E.2), (D.1), (D.2), and (S.1) of
Section 3.4. The random variable, n1/2 α
bS is bounded in probability.
b ) is asymptotically N(0, 1)
Proof: Let ǫ > 0 be given. Since n−1/2 S1 (Y − Xβ ϕ
there exists a c < 0 such that
b ) < c] < ǫ .
P [n−1/2 S1 (Y − Xβ (3.5.20)
ϕ
2
Take c∗ = τS−1 (c − ǫ). By the process’s monotonicity and the definition of α
b,
we have the implication n α1/2 ∗
bS < c ⇒ n −1/2 ∗ −1/2
S1 (Y − c n b ) ≤ 0.
1 − Xβ ϕ
Adding in and subtracting out the above linearity result, leads to

P [n1/2 α b ) ≤ 0]
bS < c∗ ] ≤ P [n−1/2 S1 (Y − n−1/2 c∗ 1 − Xβ ϕ

≤ P [|n−1/2 S1 (Y − c∗ n−1/2 1 − Xβ b ) − (n−1/2 S1 (Y − Xβb ) − c∗ τ −1 | ≥ ǫ]


ϕ ϕ S

+ P [n −1/2 b ∗ −1
S1 (Y − Xβ ) − c τ < ǫ].
ϕ S

The first term on the right side can be made less than ǫ/2 for sufficiently
large n whereas the second term is (3.5.20). From this it follows that n1/2 α
bS

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 187 —


i i

3.5. THEORY OF RANK-BASED ESTIMATES 187

is bounded below in probability. To finish the proof a similar argument shows


that n1/2 α
bS is bounded above in probability.
As a second application we can write the linearity result of the last theorem
as
b ) = n−1/2 S1 (Y) − aτ −1 + op (1)
n−1/2 S1 (Y − an−1/2 1 − Xβ (3.5.21)
ϕ S

uniformly for all |a| ≤ c and for c > 0.


Because α bS is a solution to equation (3.5.19) and n1/2 α bS is bounded in
probability, the second linearity result, (3.5.21), yields, after some simplifica-
tion, the following asymptotic representation of our result for the estimate of
the intercept for the true intercept α0 ,
n
X
1/2 −1/2
n (b
αS − α0 ) = τS n sgn(Yi − α0 ) + op (1) , (3.5.22)
i=1

D
where τS is given in (3.4.6). From this we have that n1/2 (b
αS − α0 ) → N(0, τS2 ).
bS and β
Our interest, though, is in the joint distribution of α b .
ϕ
By Corollary 3.5.2 the corresponding asymptotic representation of β b for
ϕ
the true vector of regression coefficients β 0 is
b − β ) = τϕ (n−1 X′ X)−1 n−1/2 X′ ϕ(F (Y)) + op (1) ,
n1/2 (β (3.5.23)
ϕ 0

where τϕ is given by (3.4.4). The joint asymptotic distribution is given in the


following theorem.

Theorem 3.5.7. Let b bϕ = (b b ′ )′ . Then under (D.1), (D.2), (S.1), (E.1),


αS , β ϕ
and (E.2) in Section 3.4,
     −1 2 
b α0 n τS 0′
L bϕ is approximately Np+1 , .
β0 0 τϕ2 (X′ X)−1

Proof: As above assume without loss of generality that the true pa-
rameters
√ are
√ 0.−1It −1
is easier to work with the random vector Tn =
−1
(τs nbαS , n(τϕ (n X′ X)β b )′ )′ . Let t = (t1 , t′ )′ be an arbitrary, nonzero,
ϕ 2
vector in R . We need only show that Zn = t′ Tn has an asymptotically uni-
p+1

variate normal distribution. Based on the above asymptotic representations of


b , (3.5.23), we have
bS , (3.5.22), and β
α ϕ

n
X
−1/2
Zn = n (t1 sgn(Yk ) + (t′2 xk )ϕ(F (Yk )) + op (1) . (3.5.24)
k=1

Denote the sum on the right side of (3.5.24) as Zn∗ . We need only show that
Zn∗ converges in distribution to a univariate normal distribution. Denote the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 188 —


i i

188 CHAPTER 3. LINEAR MODELS


kth summand as Znk . We use the Lindeberg-Feller Central Limit Theorem.
Our application of this theorem is similar to its use in the proof of RTheorem
3.5.2. First note that since the score function ϕ is standardized ( ϕ = 0)
that E(Zn∗ ) = 0. Let Bn2 = Var(Zn∗ ). Because the individual summands
R 2 are
independent, Yk are identically distributed, ϕ is standardized ( ϕ = 1), and
the design is centered, Bn2 simplifies to
n
X n
X n
X
Bn2 = n ( −1
t21 + (t′2 xk )2 + 2t1 cov(sgn(Y1 ), ϕ(F (Y1))t′2 xk
k=1 k=1 k=1
= t21 + ′ −1 ′
t2 (n X X)t2 +0.

Hence by (D.2),
lim Bn2 = t21 + t′2 Σt2 , (3.5.25)
n→∞

which is a positive number. To satisfy the Lindeberg-Feller condition, we need


to show that for any ǫ > 0
n
X
lim Bn−2 ∗2
E[Znk ∗
I(|Znk | > ǫBn )] = 0 . (3.5.26)
n→∞
k=1

Since Bn2 converges to a positive constant we need only show that the sum
converges to 0. By the triangle inequality we can show that the indicator
function satisfies

I(n−1/2 |t1 | + n−1/2 |t′2 xk ||ϕ(F (Yk ))| > ǫBn ) ≥ I(|Znk

| > ǫBn ) . (3.5.27)

Following the discussion after expression (3.5.7), we have that n−1/2 |(x′k t)| ≤
Mn where Mn is independent of k and, furthermore, Mn → 0. Hence, we have

ǫBn − n−1/2 t1
I(|ϕ(F (Yk ))| > ) ≥ I(n−1/2 |t1 | + n−1/2 |t′2xk ||ϕ(F (Yk ))| > ǫBn ) .
Mn
(3.5.28)
Thus the sum in expression (3.5.26) is less than or equal to
X n   
∗2 ǫBn − n−1/2 t1
E Znk I |ϕ(F (Yk ))| >
Mn
k=1
  
ǫBn − n−1/2 t1
= t1 E I |ϕ(F (Y1))| >
Mn
   X n
ǫBn − n−1/2 t1 ′
+ (2/n)E sgn(Y1 )ϕ(F (Y1 ))I |ϕ(F (Y1 ))| > t2 xk
Mn
k=1
  −1/2
 X n
ǫBn − n t1
+ E ϕ2 (F (Y1))I |ϕ(F (Y1))| > (1/n) (t′2 xk )2 .
Mn k=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 189 —


i i

3.5. THEORY OF RANK-BASED ESTIMATES 189

Because the design is centered P the middle term on the right side is 0. As re-
marked above, the term (1/n) nk=1 (t′2 xk )2 = (1/n)t′2X′ Xt2 converges to a
−1/2 t
positive constant. In the expression ǫBn −n Mn
1
, the numerator converges to
a positive constant as the denominator converges to 0; hence, the expression
goes to ∞. Therefore since ϕ is bounded, the indicator function converges to 0.
Again using the boundedness of ϕ, we can interchange limit and expectation
by the Lebesgue Dominated Convergence Theorem. Thus condition (3.5.26) is
true and, hence, Zn∗ converges in distribution to a univariate normal distribu-
tion. Therefore, Tn converges to a multivariate normal distribution. Note by
(3.5.25) it follows that the asymptotic covariance of b b ϕ is the result displayed
in the theorem.
In the above development, we considered the centered design. In practice,
though, we are often concerned with an uncentered design. Let α∗ denote the
intercept for the uncentered model. Then α∗ = α − x′ β where x denoted the
vector of column averages of the uncentered design matrix. An estimate of α∗
based on R estimates is given by α bS − x′ β
bS∗ = α b ϕ . Based on the last theorem,
it follows (Exercise 3.15.14) that
 ∗     
bS
α α0 κn −τϕ2 x′ (X′ X)−1
L b ∼ Np+1 , ,
β ϕ β0 −τϕ2 (X′ X)−1x τϕ2 (X′X)−1
(3.5.29)
−1 2 2 ′ ′ −1
where κn = n τS + τϕ x (X X) x and τS and and τϕ are given respectively
by (3.4.6) and (3.4.4).

Intercept Estimate Based on Signed-Rank Scores


Suppose we additionally assume that the errors have a symmetric distribution;
i.e., f (−x) = f (x). In this case, all location functionals are the same. Let
ϕf (u) = −f ′ (F −1 (u))/f (F −1(u)) denote the optimal scores for the density
f (x). Then as Exercise 3.15.12 shows, ϕf (1 − u) = −ϕf (u); that is, the scores
are odd about 1/2. Hence, in this subsection we additionally assume that the
scores satisfy property (S.3), (3.4.12).
For scores satisfying (S.3), the corresponding signed-rank scores are gener-
ated as a+ (i) = ϕ+ (i/(n+1)) where ϕ+ (u) = ϕ((u+1)/2); see the√discussion in
Section 2.5.3. For example if Wilcoxon scores are√used, ϕ(u) = 12(u − 1/2),
then the signed-rank score function is ϕ+ (u) = 3u. Recall from Chapter 1,
that these signed-rank scores can be used to define a norm and a subsequent
R analysis. Here we only want to apply the associated one sample signed-rank
procedure to the residuals in order to obtain an estimate of the intercept. So
consider the process
n
X
+
T (b
eR − α1) = eRi − α1)a+ (R|b
sgn(b eRi − α|) , (3.5.30)
i=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 190 —


i i

190 CHAPTER 3. LINEAR MODELS

where b b ; see (1.8.2). Note that this is the process discussed in


eRi = yi − x′i β ϕ
Section 1.8, except now the iid observations are replaced by residuals. The
process is still a nonincreasing function of α which steps down at the Walsh
averages of the residuals; see Exercise 1.12.29. The estimate of the intercept
bϕ+ which solves the equation
is a value α
.
T + (b
eR − α) = 0. (3.5.31)

If Wilcoxon scores are used then the estimate is the median of the Walsh
averages, (1.3.25) while if sign scores are used the estimate is the median of
the residuals.
b+ = (b
Let b b ′ )′ . We next briefly sketch the development of the asymp-
αϕ+ , β
ϕ ϕ
totic distribution of b b + . Assume without loss of generality that the true pa-
ϕ
rameter vector (α0 , β ′0 )′ is 0. Suppose instead of the residuals we had the true
errors in (3.5.30). Theorem A.2.11 of the Appendix then yields an asymptotic
linearity result for the process. McKean and Hettmansperger (1976) show that
this result holds for the residuals also; that is,

1
√ S + (b
eR − α1) = S + (e) − ατϕ−1 + op (1) , (3.5.32)
n

for all |α| ≤ c, where c > 0. Using arguments√similar to those in McKean and
Hettmansperger (1976), we can show that nb αϕ+ is bounded in probability;
hence, by (3.5.32) we have that
√ 1
αϕ+ = τϕ √ S + (e) + op (1) .
nb (3.5.33)
n

But by (A.2.43) and (A.2.45) of the Appendix, we have the second represen-
tation given by
n
√ 1 X + +
αϕ+
nb = τϕ √ ϕ (F |ei |)sgn(ei ) + op (1)
n i=1
n
1 X +
= τϕ √ ϕ (2F (ei ) − 1) + op (1) , (3.5.34)
n i=1

where F + is the distribution function of the absolute errors |ei |. Due to sym-
metry, F + (t) = 2F (t) − 1. Then using the relationship between the rank and
the signed-rank scores, ϕ+ (u) = ϕ((u + 1)/2), we obtain finally
n
√ 1 X
αϕ+
nb = τϕ √ ϕ(F (Yi)) . (3.5.35)
n i=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 191 —


i i

3.6. THEORY OF RANK-BASED TESTS 191

Therefore using expression (3.5.2), we have the asymptotic representation of


the estimates:
 +   
√ αbϕ τϕ 1′ ϕ(F (Y))
n b =√ . (3.5.36)
βϕ n (X′ X)−1 X′ ϕ(F (Y))

This and an application of the Lindeberg Central Limit Theorem, similar to


the proof of Theorem 3.5.7, leads to the theorem,

Theorem 3.5.8. Under assumptions (D.1), (D.2), (E.1), (E.2), (S.1), and
(S.3) of Section 3.4
    
bϕ+
α α0
b has an approximate Np+1 , τϕ2 (X′1 X1 )−1 distribution ,
β ϕ β0
(3.5.37)
where X1 = [1 X].

3.6 Theory of Rank-Based Tests


Consider the general linear hypotheses discussed in Section 3.2,

H0 : Mβ = 0 versus HA : Mβ 6= 0 , (3.6.1)

where M is a q × p matrix of full row rank. The geometry of R testing, Section


3.2.2, indicated the statistic based on the reduction of dispersion between the
reduced and full models, Fϕ = (RD/q)/(b τϕ /2) as a test statistic, (3.2.18). In
this section we develop the asymptotic theory for this test statistic under null
and alternative hypotheses. This theory is sufficient for two other rank-based
tests which we discuss later. See Table 3.2.2 and the discussion relating to
that table for the special case when M = I.

3.6.1 Null Theory of Rank-Based Tests


We proceed with two lemmas about the dispersion function D(β) and its
quadratic approximation Q(β) given by expression (3.5.11).

Lemma 3.6.1. Let β b denote the R estimate of β in the full model (3.2.3),
then under (E.1), (S.1), (D.1), and (D.2) of Section 3.4,

b − Q(β)
b →0. P
D(β) (3.6.2)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 192 —


i i

192 CHAPTER 3. LINEAR MODELS

Proof: Assume without loss of generality


h√ that
i the true β is 0. Let ǫ > 0
be given. Choose c0 such that P b
nkβk > c0 < ǫ/2, for n sufficiently large.
Using asymptotic quadraticity, Theorem A.3.8, we have for n sufficiently large
h i
P |D(β)b − Q(β)| b <ǫ
"( ) #
n√ o
≥P max √ |D(β) − Q(β)| < ǫ ∩ b < c0
nkβk
kβ k<c0 / n

>1−ǫ.

From this we obtain the result.


The last result shows that D and Q are close at the R estimate of β. Our
b is close to the minimum of Q.
next result shows that Q(β)

Lemma 3.6.2. Let β e denote the minimizing value of the quadratic function
Q then under (E.1), (S.1), (D.1), and (D.2) of Section 3.4,

e − Q(β)
b →P
Q(β) 0. (3.6.3)

Proof: By simple algebra we have


e − Q(β)
Q(β) b = (2τϕ )−1 (β
e − β)
b ′ X′ X(βe + β)
b − (β
e − β)
b ′ S(Y)
√ h √ i
= (2τϕ )−1 n(βe − β)
b ′ n−1 X′ X n((βe + β)
b − n−1/2 S(Y) .

It is shown in Exercise 3.15.15 that the factor in brackets in the last equation
is bounded in probability. Since the left factor converges to zero in probability
by Theorem 3.5.5 the desired result follows.
It is easier to work with the the equivalent formulation of the linear hy-
potheses given by

Lemma 3.6.3. An equivalent formulation of the model and the hypotheses is:

Y = 1α + X∗1 β ∗1 + X∗2 β ∗2 + e , (3.6.4)

with the hypotheses H0 : β ∗2 = 0 versus HA : β ∗2 6= 0, where X∗i and β ∗i ,


i = 1, 2, are defined in display (3.6.6).

Proof: Consider the QR-decomposition of M given by


 
′ R
M = [Q2 Q1 ] = = Q2 R , (3.6.5)
O

where the columns of Q1 form an orthonormal basis for the kernel of the
matrix M, the columns of Q2 form an orthonormal basis for the column space

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 193 —


i i

3.6. THEORY OF RANK-BASED TESTS 193

of M′ , O is a (p − q) × q matrix of 0’s, and R is a q × q upper triangular,


nonsingular matrix. Define

X∗i = XQi and β ∗i = Q′i β for i = 1, 2 . (3.6.6)

It follows that

Y = 1α + Xβ + e
= 1α + X∗1 β ∗1 + X∗2 β ∗2 + e .

Further, Mβ = 0 if and only if β ∗2 = 0, which yields the desired result.


Without loss of generality, by the last lemma, for the remainder of the
section, we consider a model of the form

Y = 1α + X1 β 1 + X2 β 2 + e , (3.6.7)

with the hypotheses

H0 : β 2 = 0 versus HA : β 2 6= 0 . (3.6.8)

With these lemmas, we are now ready to obtain the asymptotic distribution
of Fϕ . Let β r = (β ′1 , 0′ )′ denote the reduced model vector of parameters, let
b denote the reduced model R estimate of β , and let β
β b = (β b ′ , 0′ )′ . We
r,1 1 r r,1
use similar notation with the minimizing value of the approximating quadratic
Q. With this notation, the drop in dispersion becomes RDϕ = D(β b ) − D(β). b
r
McKean and Hettmansperger (1976) proved the following:

Theorem 3.6.1. Suppose the assumptions (E.1), (D.1), (D.2), and (S.1) of
Section 3.4 hold. Then under H0 ,
RDϕ D 2
→ χ (q) ,
τϕ /2

where RDϕ is formally defined in expression (3.2.16).


Proof: Assume that the true vector of parameters is 0 and suppress the sub-
script ϕ on RD. Write RD as the sum of five differences:
b ) − D(β)
RD = D(β b
 r    
= D(β b ) − Q(β
b ) + Q(β b ) − Q(β
e ) + Q(β e )
r r r r r
    
e + Q(β)
− Q(β) e − Q(β)
b + Q(β) b − D(β)
b .

By Lemma 3.6.1 the first and fifth differences go to zero in probability and
by Lemma 3.6.2 the second and fourth differences go to zero in probability.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 194 —


i i

194 CHAPTER 3. LINEAR MODELS

Hence we need only show that the middle difference converges in distribution
to the intended distribution. As in Lemma 3.6.2, algebra leads to
e = −2−1 τϕ S(Y)′ (X′ X)
Q(β)
−1
S(Y) + D(Y) ,

while  
e ) = −2 τϕ S(Y)
−1 ′ (X′1 X1 )−1 0
Q(β r S(Y) + D(Y) .
0 0
Combining these last two results the middle difference becomes
  
e e −1 ′ ′ −1 (X′1 X1 )−1 0
Q(β r ) − Q(β) = 2 τϕ S(Y) (X X) − S(Y) .
0 0

Using a well known matrix identity (see page 27 of Searle, 1971)


   
′ −1 (X′1 X1 )−1 0 −A−11 B
 
(X X) = + W −B′ A−1 1 I ,
0 0 I

where
 
′ A1 B
XX =
B′ A2
−1
W = A2 − B′ A−1
1 B . (3.6.9)

Hence after some simplification we have


  
RD ′ −A−11 B  ′ −1

= S(Y) W −B A1 I S(Y) + op (1)
τϕ /2 I
  ′   
= −B′ A−1
1 I S(Y) W −B′ A−1
1 I S(Y) + op (1)
  −1/2 ′   −1/2 
= −B′ A−1
1 I n S(Y) nW −B′ A−1 1 I n S(Y) + op (1) .

Using n−1 X′ X → Σ and the asymptotic distribution of n−1/2 S(Y), Theorem


3.5.2, it follows that the right side of (3.6.10) converges in distribution to a χ2
random variable with q degrees of freedom, which completes the proof of the
theorem.
A consistent estimate of τϕ is discussed in Section 3.7. We denote this
estimate by τbϕ . The test statistic we subsequently use is given by

RDϕ /q
Fϕ = . (3.6.10)
τbϕ /2

Although the test statistic qFϕ has an asymptotic χ2 distribution, small sample
studies (see below) have indicated that it is best to compare the test statistic

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 195 —


i i

3.6. THEORY OF RANK-BASED TESTS 195

with F -critical values having q and n − p − 1 degrees of freedom; that is, the
test at nominal level α is:
Reject H0 : Mβ = 0 in favor of HA : Mβ 6= 0 if Fϕ ≥ F (α, q, n − p − 1) .
(3.6.11)
McKean and Sheather (1991) review numerous small sample studies concern-
ing the validity of the rank-based analysis based on the test statistic Fϕ . These
small sample studies demonstrate that the empirical α level of Fϕ over a va-
riety of designs, sample sizes, and error distributions are close to the nominal
values.
In classical inference there are three tests of general hypotheses: the like-
lihood ratio test (reduction in sums of squares test), Wald’s test, and Rao’s
scores (gradient) test. A good discussion of these tests can be found in Rao
(1973). When the hypotheses are the general linear hypotheses (3.6.1), the er-
rors have a normal distribution, and the least squares procedure is used then
the three test statistics are algebraically equivalent. Actually the equivalence
holds without normality, although in this case the reduction in sums of squares
statistic is not the likelihood ratio test; see the discussion in Hettmansperger
and McKean (1983).
There are three rank-based tests for the general linear hypotheses, also.
The reduction in dispersion test statistic Fϕ is the analogue of the likeli-
hood ratio test, i.e., the reduction in sums of squares test. Since Wald’s test
statistic is a quadratic form in full model estimates, its rank analogue is given
by  ′   
′ −1

b
Mβ M (X X) M ′ −1 b
Mβ /q
Fϕ,Q = . (3.6.12)
τbϕ2
Provided τbϕ is a consistent estimate of τϕ it follows from the asymptotic dis-
b Corollary 3.5.1, that under H0 , qFϕ,Q has an asymptotic χ2
tribution of β,
distribution. Hence the test statistics Fϕ and Fϕ,Q have the same null asymp-
totic distributions. Actually as Exercise 3.15.16 shows, the difference of the
test statistics converges to zero in probability under H0 . Unlike the classical
methods, though, they are not algebraically equivalent, see Hettmansperger
and McKean (1983).
The rank gradient scores test is easiest to define in terms of the repa-
rameterized model, (3.6.18); that is, the null hypothesis is H0 : β 2 = 0.
Rewrite the random vector defined in (3.6.10) of Theorem 3.6.1 using as the
true parameter under H0 , β 0 = (β 01 , 0′ )′ , i.e.,
  −1/2 ′   −1/2 
−B′ A−1
1 I n S(Y − Xβ 0 ) nW −B′ A−1 1 I n S(Y − Xβ 0 ) .
(3.6.13)
From the proof of Theorem 3.6.1 this quadratic form has an asymptotic χ2
distribution with q degrees of freedom. Since it does depend on β 0 , it can

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 196 —


i i

196 CHAPTER 3. LINEAR MODELS

not be used as a test statistic. Suppose we substitute the reduced model R


estimate of β 1 ; i.e., the first p − q components of β b , defined immediately
r
after expression (3.6.8). We call it β b . Now since this is the reduced model R
01
estimate, we have
 
b . 0
S(Y − Xβr ) = b ) , (3.6.14)
S2 (Y − X1 β r,1

where the subscript 2 on S denotes the last p − q components of S. This yields


n o−1
Aϕ = S2 (Y − X1 β b r,1 )′ X′ X2 − X′ X1 (X′ X1 )−1 X′ X2 b r,1 )
S2 (Y − X1 β
2 2 1 1
(3.6.15)
as a test statistic. This is often called the aligned rank test, since the ob-
servations are aligned by the reduced model estimate. Exercise 3.15.17 shows
that under H0 , Aϕ has an asymptotic χ2 distribution. As the proof shows,
the difference between qFϕ and Aϕ converges to zero in probability under H0 .
Aligned rank tests were introduced by Hodges and Lehmann (1962) and are
developed for the linear model by Puri and Sen (1985).
Suppose in (3.6.14) we use a reduced model estimate β b ∗ which is not the
r,1
R estimate; for example, it may be the LS-estimate. Then we have
!
∗ . S (Y − X b∗ )
β
S(Y − Xβ b )= 1 1 r,1
. (3.6.16)
r
S2 (Y − X1 β b∗ )
r,1

√ ∗
b − β ) = Op (1), under H0 .
The reduced model estimate must satisfy n(β r 0
Then the statistic in (3.6.15) is
n o−1
−1
A∗ϕ = S∗′
2 X ′
X
2 2 − X ′
X
2 1 (X ′
X
1 1 ) X ′
X
1 2 S∗2 , (3.6.17)

where, from (3.6.10),

S∗2 = S2 (Y − X1 β b∗ ) .
b ∗ ) − X′ X1 (X′ X1 )−1 S1 (Y − X1 β (3.6.18)
r,1 2 1 r,1

Note that when the R estimate is used, the second term in S∗2 vanishes and
we have (3.6.15); see Adichi (1978) and Chiang and Puri (1984).
Hettmansperger and McKean (1983) give a general discussion of these three
tests. Note that both Fϕ,Q and Fϕ require estimation of full model estimates
and the scale parameter τϕ while Aϕ does not. However when using a linear
model, one is usually interested in more than hypothesis testing. Of primary
interest is checking the quality of the fit; i.e., does the model fit the data.
This requires estimation of the full model parameters and an estimate of τϕ .
Diagnostics for fits based on R estimates are the topics of Section 3.9. One is

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 197 —


i i

3.6. THEORY OF RANK-BASED TESTS 197

also usually interested in estimating contrasts and their standard errors. For
R estimates this requires an estimate of τϕ . Moreover, as discussed in Hett-
mansperger and McKean (1983), the small sample properties of the aligned
rank test can be poor on certain designs.
The influence function of the test statistic Fϕ is p derived in Appendix
A.5.2. As discussed there, it is easier to work with the qFϕ . The result is
given by
    1/2
p ′ ′ ′ −1 (X′1 X1 )−1 0
Ω(x0 , y0 ; qFϕ ) = |ϕ[F (y0−x0 β r )]| x0 (X X) − x0
0 0
(3.6.19)
and, as shown in the Appendix, the null distribution of Fϕ can be read from
this result. Note that similar to the R estimates, the influence function of Fϕ
is bounded in the Y -space but not in the x-space; see (3.5.17).

3.6.2 Theory of Rank-Based Tests under Alternatives


In the last section, we developed the null asymptotic theory of the rank-
based tests based on a general score function. In this section we obtain some
properties of these tests under alternative models. We show first that the test
based on the reduction of dispersion, RDϕ , (3.2.16), is consistent under any
alternative to the general linear hypothesis. We then show the efficiency of
these tests is the same as the efficiency results obtained in Chapter 2.

Consistency

We want to show that the test statistic Fϕ is consistent for the general linear
hypothesis, (3.2.5). Without loss of generality, we again reparameterize the
model as in (3.6.18) and consider as our hypothesis H0 : β 2 = 0 versus
HA : β 2 6= 0. Let β 0 = (β ′01 , β ′02 )′ be the true parameter. We assume that
the alternative is true; hence, β 02 6= 0. Let α be a given level of significance.
Let T (τϕ ) = RDϕ /(τϕ /2) where RDϕ = D(β b ) − D(β).
b Because we estimate
r
τϕ under the full model by a consistent estimate, to show consistency of Fϕ it
suffices to show
Pβ [T (τϕ ) ≥ χ2α,q ] → 1 , (3.6.20)
0

as n → ∞.
As in the proof under the null hypothesis, it is convenient to work with
the approximating quadratic function Q(Y − Xβ), (3.5.11). As above, let β e
and βb denote the minimizing values of Q and D, respectively, under the full
b by
model. The present argument simplifies if, for the full model, we replace β

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 198 —


i i

198 CHAPTER 3. LINEAR MODELS

e in T (τϕ ). We can do this because we can write


β
 
D(Y − Xβ) e − D(Y − Xβ) b = D(Y − Xβ) e − Q(Y − Xβ)
e
 
e
+ Q(Y − Xβ) − Q(Y − Xβ) b
 
+ Q(Y − Xβ) b − D(Y − Xβ)
b .

Applying asymptotic quadraticity, Theorem A.3.8, the first and third differ-
ences go to 0 in probability while the second difference goes to 0 in probability
by Lemma 3.6.2; hence the leftside goes to 0 in probability under the alterna-
tive model. Thus we need only show that
b ) − D(β))
Pβ [(2/τϕ )(D(β e ≥ χ2 ] → 1 , (3.6.21)
r α,q
0

b denotes the reduced model R estimate. We state the result


where, as above, β r
next. The proof can be found in the Appendix; see Theorem A.3.10.

Theorem 3.6.2. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Section
3.4 hold. The test statistic Fϕ is consistent for the hypotheses (3.2.5).

Efficiency Results
The above result establishes that the rank-based test statistic Fϕ is consistent
for the general linear hypothesis, (3.2.5). We next derive the efficiency results
of the test. Our first step is to obtain the asymptotic power of Fϕ along a
sequence of alternatives. This generalizes the asymptotic power lemmas dis-
cussed in Chapters 1 and 2. From this the efficiency results follow. As with the
consistency discussion it is more convenient to work with the model (3.6.18).
The sequence of alternative models to the hypothesis H0 : β 2 = 0 is:

Y = 1α + X1 β 1 + X2 (θ/ n) + e , (3.6.22)

where θ is a nonzero vector. Because R estimates are invariant to loca-


tion shifts, we can assume without loss of generality that β 1 = 0. Let
′ ′ √ ′
β n = (0 , θ / n) and let Hn denote the hypothesis that (3.6.22) is the true
model. The concept of contiguity proves helpful with the asymptotic theory
of the statistic Fϕ under this sequence of models. A discussion of contiguity
is given in the Appendix; see Section A.2.2.

Theorem 3.6.3. Under the sequence of models (3.6.22) and the assumptions
(E.1), (D.1), (D.2), and (S.1) of Section 3.4,

Pβ (T (τbϕ ) ≤ t) → P (χ2q (ηϕ ) ≤ t) , (3.6.23)


n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 199 —


i i

3.6. THEORY OF RANK-BASED TESTS 199

where χ2q (ηϕ ) has a noncentral χ2 -distribution with q degrees of freedom and
noncentrality parameter
ηϕ = τϕ−2 θ ′ W0−1 θ , (3.6.24)
where W0 = limn→∞ nW and W is defined in display (3.6.9).
Proof: As in the proof of Theorem 3.6.1 we can write the drop in dispersion as
the sum of the same five differences. Since the first two and last two differences
go to zero in probability under the null model, it follows from the discussion
on contiguity (Section A.2.2) that these differences go to zero in probability
under the model (3.6.22). Hence we need only be concerned about the middle
difference. Since β 1 = 0, the middle difference reduces to the same quantity
as in Theorem 3.6.1; i.e., we obtain,
RDϕ   ′   
= −B′ A−1
1 I S(Y) W −B′ A−1
1 I S(Y) + op (1) .
τϕ /2

The asymptotic linearity result derived in the Appendix (Theorem A.3.8) is


√ 
sup kn−1/2 S(Y − Xβ) − n−1/2 S(Y) − τϕ−1 Σ nβ k = op (1) ,

nkβ k≤c


for all c > 0. Since nkβ n k = kθk, we can take c = kθk and get

kn−1/2 S(Y − Xβ n ) − n−1/2 S(Y) − τϕ−1 Σ(0′ , θ ′ )′ k = op (1) . (3.6.25)

The above probability statements hold under the null model and, hence, by
contiguity under the sequence of models (3.6.22) also. Under the sequence of
models (3.6.22), however,
D
n−1/2 S(Y − Xβ n ) → Np (0, Σ) .

Hence, under the sequence of models (3.6.22),


D
n−1/2 S(Y) → Np (τϕ−1 Σ(0′ , θ ′ )′ , Σ) . (3.6.26)

Then under the sequence of models (3.6.22),


  −1/2 D
−B′ A−1
1 I n S(Y) → Nq (τϕ−1 W0 , W0) .

From this last result, the conclusion readily follows.


Several interesting remarks follow from this theorem. First, since W0 is
positive definite, under alternatives the noncentrality parameter η > 0. Thus
the asymptotic distribution of T (τϕ ) under the sequence of models (3.6.22)
has mean q + η. Furthermore, the asymptotic power of a level α test based on
T (τϕ ) is P [χ2q (η) ≥ χ2α,q ].

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 200 —


i i

200 CHAPTER 3. LINEAR MODELS

Second, note that that we can write the noncentrality parameter as

η = (τϕ2 n)−1 [θ ′ A2 θ − (Bθ)′ A−1


1 Bθ] .

Both matrices A2 and A−1 1 are positive definite; hence, the noncentrality pa-
rameter is maximized when θ is in the kernel of B. One way of assuring this
for a design is to take B = 0. Because B = X′1 X2 this condition holds for
orthogonal designs. Therefore orthogonal designs are generally more efficient
than nonorthogonal designs.
We next obtain the asymptotic relative efficiency of the test statistic Fϕ
with respect to the least squares classical F -test, FLS , defined by (3.2.17) in
Section 3.2.2. The theory for FLS under local alternatives is outlined in Exer-
cise 3.15.18 where it is shown that, under the additional assumption that the
random errors ei have finite variance σ 2 , the null asymptotic distribution of
qFLS is a central χ2q distribution. Thus both Fϕ and FLS have the same asymp-
totic null distribution. As outlined in Exercise 3.15.18, under the sequence of
models (3.6.22) qFLS has an asymptotic noncentral χ2q,ηLS with noncentrality
parameter
ηLS = (σ 2 )−1 θ ′ W0−1 θ . (3.6.27)
Based on Theorem 3.6.3, the asymptotic relative efficiency of Fϕ and FLS
is the ratio of their noncentrality parameters; i.e.,

ηϕ σ2
e(Fϕ , FLS ) = = 2 .
ηLS τϕ

Thus the efficiency results for the rank-based estimates and tests discussed in
this section are the same as the efficiency results presented in Chapters 1 and
2. An asymptotically efficient analysis can be obtained if the selected rank
score function is ϕf (u) = −f0′ (F0−1 (u))/f0(F0−1 (u)) where f0 is the form of the
density of the error distribution. If the errors have a logistic distribution then
the Wilcoxon scores result in an asymptotically efficient analysis.
Usually we have no knowledge of the distribution of the errors. In which
case, we would recommend using Wilcoxon scores. With them, the loss in
relative efficiency to the classical analysis at the normal distribution is only
5%, while the gain in efficiency over the classical analysis for long-tailed error
distributions can be substantial as discussed in Chapters 1 and 2.
Many of the studies reviewed in the article by McKean and Sheather (1991)
included power comparisons of the rank-based analyses with the least squares
F -test, FLS . The empirical power of FLS at normal error distributions was
slightly better than the empirical power of Fϕ , under Wilcoxon scores. Under
error distributions with heavier tails than the normal distribution, the empir-
ical power of Fϕ was generally larger, often much larger, than the empirical

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 201 —


i i

3.6. THEORY OF RANK-BASED TESTS 201

power of FLS . These studies provide empirical evidence that the good asymp-
totic efficiency properties of the rank-based analysis hold in the small sample
setting.
As discussed above, the noncentrality parameters of the test statistics Fϕ
and FLS differ in only the scale parameters. Hence, in practice, planning de-
signs based on the noncentrality parameter of Fϕ can proceed similar to the
planning of a design using the noncentrality parameter of FLS ; see, for exam-
ple, the discussion in Chapter 4 of Graybill (1976).

3.6.3 Further Remarks on the Dispersion Function


Let be denote the rank-based residuals when the linear model, (3.2.4), is fit
using the scores based on the function ϕ. Suppose the same assumptions hold
as above; i.e., (E.1), (D.1), and (D.2) in Section 3.4. In this section, we explore
further properties of the residual dispersion D(b e); see also Sections 3.9.2 and
3.11.
The functional corresponding to the dispersion function evaluated at the
errors ei is determined as follows: letting Fn denote the empirical distribution
function of the iid errors e1 , . . . , en we have
Xn
1 1
D(e) = a(R(ei ))ei
n i=1
n
Xn  
n 1
= ϕ Fn (ei ) ei
i=1
n+1 n
Z  
n
= ϕ Fn (x) x dFn (x)
n+1
Z
P
→ ϕ(F (x))x dF (x) = De . (3.6.28)

As Exercise 3.15.19 shows, D e is a scale parameter; see also the examples


below.
Let D(be) denote the residual dispersion D(β)b = D(Y, Ω). We next show
that n−1 D(be) also converges in probability to D e , a result which proves useful
in Sections 3.9.2 and 3.11. Assume without loss of generality that the true β
is 0. We can write
D(b
e) = (D(b b + (Q(β)
e) − Q(β)) b − Q(β))
e + Q(β)
e .

By Lemmas 3.6.1 and 3.6.2 the two differences on the right side converge to 0
in probability. After some algebra, we obtain
(  −1 )
e =− τϕ 1 1 1
Q(β) √ S(e)′ X′ X √ S(e) + D(e) .
2 n n n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 202 —


i i

202 CHAPTER 3. LINEAR MODELS

By Theorem 3.5.2 the term in braces on the right side converges in distribution
to a χ2 random variable with p degrees of freedom. This implies that (D(e) −
D(be))/(τϕ /2) also converges in distribution to a χ2 random variable with p
degrees of freedom. Although this is a stronger result than we need, it does
imply that n−1 (D(e) − D(b e)) converges to 0 in probability. Hence, n−1 D(b e)
converges in probability to D e .
The natural analog to the least squares F -test statistic is
RD/q
Fϕ∗ = , (3.6.29)
bD /2
σ
bD = D(b
where σ e)/(n − p − 1), rather than Fϕ . But we have
τbϕ /2 D
qFϕ∗ = −1
qFϕ → κF χ2 (q) , (3.6.30)
n D(b e)/2
where κF is defined by
τbϕ P
−1
→ κF . (3.6.31)
n D(b e)
Hence, to have a limiting χ2 -distribution for qFϕ∗ we need to have κF = 1.
Below we give several examples where this occurs. In the first example, the
form of the error distribution is known while in the second example the errors
are normally distributed; however, these cases rarely occur in practice.
There is an even more acute problem with using Fϕ∗ , though. In Sec-
tion A.5.2 of the Appendix, we show that the influence function of Fϕ∗ is
not bounded in the Y -space, while, as noted above, the influence function of
the statistic Fϕ is bounded in the Y -space provided the score function ϕ(u)
is bounded. Note, however, that the influence functions of D(b e) and Fϕ∗ are
linear rather than quadratic as is the influence function of FLS . Hence, they
are somewhat less sensitive to outliers in the Y -space than FLS ; see Hettman-
sperger and McKean (1978).
Example 3.6.1 (Form of Error Density Known). Assume that the errors have
density f (x) = σ −1 f0 (x/σ) where f0 is known. Our choice of scores would then
be the optimal scores given by
1 f0′ (F0−1 (u))
ϕ0 (u) = − p −1 , (3.6.32)
I(f0 ) f0 (F0 (u))
where I(f0 ) denotes the Fisher information corresponding to f0 . These scores
yield an asymptotically efficient rank-based analysis. Exercise 3.15.20 shows
that with these scores
τϕ = D e . (3.6.33)
Thus κF = 1 for this example and qFϕ∗0 has a limiting χ2 (q)-distribution under
H0 .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 203 —


i i

3.7. IMPLEMENTATION OF THE R ANALYSIS 203

Example 3.6.2 (Errors Are Normally√ Distributed). In this case the form
of the error density is f0 (x) = ( 2π)−1 exp {−x2 /2}; i.e., the standard nor-
mal density. This is of course a subcase of the last example. The optimal
scores in this case are the normal scores ϕ0 (u) = Φ−1 (u) where Φ denotes the
standard normal distribution function. Using these scores, the statistic qFϕ∗0
has a limiting χ2 (q)-distribution under H0 . Note here that the score func-
tion ϕ0 (u) = Φ−1 (u) is unbounded; hence the above theory must be modified
to obtain this result. Under further regularity conditions on the design ma-
trix, Jurečková (1969) obtained asymptotic linearity for the unbonded score
function case; see, also, Koul (1992, p. 51). Using these results, the limiting
distribution of qFϕ∗0 can be obtained. The R estimates based on these scores,
however, have an unbounded influence function; see Section 1.8.1. We next
consider this analysis for Wilcoxon and sign scores.
If Wilcoxon scores are employed then Exercise 3.15.21 shows that
r
π
τϕ = σ (3.6.34)
3
r
3
De = σ . (3.6.35)
π
Thus, in this case, a consistent estimate of τϕ /2 is n−1 D(b
e)(π/6).
For sign scores a similar computation yields
r
π
τS = σ (3.6.36)
2
r
2
De = σ (3.6.37)
π
Hence n−1 D(b
e)(π/4) is a consistent estimate of τS /2.
Note that both examples are overly restrictive and again in all cases the re-
sulting rank-based test of the general linear hypothesis H0 has an unbounded
influence function, even in the case when the errors have a normal density and
the analysis is based on Wilcoxon or sign scores. In general then, we recom-
mend using a bounded score function ϕ and the corresponding test statistic
Fϕ , (3.2.18) which is highly efficient and whose influence function, 3.6.19, is
bounded in the Y -space.

3.7 Implementation of the R Analysis


Up to this point, we have presented the geometry and asymptotic theory of
the R analysis. In order to implement the analysis we need to discuss the esti-
mation of the scale parameters τϕ and τS . Estimation of τS is discussed around

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 204 —


i i

204 CHAPTER 3. LINEAR MODELS

expression (1.5.29). Here, though, the estimate is based on the residuals. We


next discuss estimation of the scale parameter τϕ . We also discuss algorithms
for obtaining the rank-based analysis.

3.7.1 Estimates of the Scale Parameter τϕ


The estimators of τϕ that we discuss are based on the R residuals formed after
estimating β. In particular, the estimators do not depend on the the estimate
of intercept parameter α. Suppose then we have fit Model (3.2.3) based on
a score function ϕ which satisfies (S.1), (3.4.10), i.e., ϕ is bounded, and is
R R
standardized so that ϕ = 0 and ϕ2 = 1. Let β b denote the R estimate of
ϕ
b b
β and let eR = Y − Xβ ϕ denote the residuals based on the R fit.
There have been several estimates of τϕ proposed. McKean and Hettman-
sperger (1976) proposed a Lehmann-type estimator based on the standardized
length of a confidence interval for the intercept parameter α. This estimator
is a function of residuals and is consistent provided the density of the errors
is symmetric. It is similar to the estimators of τϕ discussed in Chapter 1.
For Wilcoxon scores, Aubuchon and Hettmansperger (1984, 1989) obtained
a density-type estimator for τϕ and showed it was consistent for symmetric
and asymmetric error distributions. Both of these estimators are available as
options in the command RREGR in Minitab. In this section we briefly sketch
the development of an estimator of τϕ for bounded score functions proposed
by Koul, Sievers, and McKean (1987). It is a density-type estimate based on
residuals which is consistent for symmetric and asymmetric error distributions
which satisfy (E.1), (3.4.1). It further satisfies a uniform consistency property
as stated in Theorem 3.7.1. Witt et al. (1995) derived the influence function
of this estimator, showing that it is robust.
A bootstrap percentile t-procedure based on this estimator did quite well in
terms of empirical validity and efficiency in the Monte Carlo study performed
by George, McKean, Schucany, and Sheather (1995).
Let the score function ϕ satisfy (S.1), (S.2), and (S.3) of Section 3.4. Since
it is bounded, consider the standardization of it given by

ϕ(u) − ϕ(0)
ϕ∗ (u) = . (3.7.1)
ϕ(1) − ϕ(0)

Since ϕ∗ is a linear function of ϕ the inference properties under either score


function are the same. The score function ϕ∗ is useful since it is also a distri-
bution function on (0, 1). Recall that τϕ = 1/γ where
Z 1
f ′ (F −1 (u))
γ= ϕ(u)ϕf (u)du and ϕf (u) = − .
0 f (F −1 (u))

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 205 —


i i

3.7. IMPLEMENTATION OF THE R ANALYSIS 205


R
Note that γ ∗ = ϕ∗ (u)ϕf (u)du = (ϕ(1) − ϕ(0))−1 γ. For the present it is more
convenient to work with γ ∗ .
If we make the change of variable u = F (x) in γ ∗ , we can rewrite it as

Z ∞

γ = − ϕ∗ (F (x))f ′ (x)dx
Z ∞−∞
= ϕ∗′ (F (x))f 2 (x)dx
Z−∞

= f (x)dϕ∗ (F (x)) ,
−∞

where the second equality is obtained upon integration by parts using dv =


f ′ (x) dx and u = ϕ∗ (F (x)).
From the above assumptions on ϕ∗ , ϕ∗ (F (x)) is a distribution function.
Suppose Z1 and Z2 are independent random variables with distributions func-
tions F (x) and ϕ∗ (F (x)), respectively. Let H(y) denote the distribution func-
tion of |Z1 − Z2 |. It then follows that

 R∞
P [|Z1 −Z2 | ≤ y] = [F (z2 +y)−F (z2 −y)]dϕ∗ (F (z2 )) y > 0
−∞
H(y) =
0 y ≤ 0.
(3.7.2)
Let h(y) denote the density of H(y). Upon differentiating under the integral
sign in expression (3.7.2) it easily follows that

h(0) = 2γ ∗ . (3.7.3)

So to estimate γ we need to estimate h(0).


Using the transformation t = F (z2 ), rewrite (3.7.2) as

Z 1  
H(y) = F (F −1 (t) + y) − F (F −1(t) − y) dϕ∗ (t) . (3.7.4)
0

Next let Fbn denote the empirical distribution function of the R residuals and
let Fbn−1 (t) = inf{x : Fbn (x) ≥ t} denote the usual inverse of Fbn . Let Hb n denote
the estimate of H which is obtained by replacing F by Fbn . Some simplification
follows by noting that for t ∈ ((j − 1)/n, j/n], Fbn−1 (t) = eb(j) . This leads to the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 206 —


i i

206 CHAPTER 3. LINEAR MODELS

bn,
following form of H
Z 1 h i
b n (y) =
H Fbn (Fbn−1 (t) + y) − Fbn (Fbn−1 (t) − y) dϕ∗ (t)
0
n Z
X h i
= b b b b
Fn (Fn (t) + y) − Fn (Fn (t) − y) dϕ∗ (t)
−1 −1
j−1 j
j=1 ( , ]
n n

Xn h  
j
 i
j −1

= Fbn (b
e(j) + y) − Fbn (b
ϕ −ϕ
e(j) − y)∗ ∗

j=1
n n
n n     
1 XX ∗ j ∗ j−1
= ϕ −ϕ e(i) − b
I(|b e(j) | ≤ y) . (3.7.5)
n i=1 j=1 n n

An estimate of h(0) and hence γ ∗ , (3.7.3), is an estimate of the form


Hb n (tn )/(2tn ) where tn is chosen close to 0. Since Hb n is a distribution func-
tion, let b b
n,δ denote the δth quantile of Hn ; i.e., b
t√ tn,δ = H b −1 (δ). Then take
n
tn = tn,δ / n. Our estimate of γ is given by

(ϕ(1) − ϕ(0))H b n (tn,δ /√n)


b
γn,δ = √ . (3.7.6)
2tn,δ / n

Its consistency is given by the following theorem:

Theorem 3.7.1. Under (E.1),(D.1), (S.1), and (S.2) of Section 3.4, and for
any 0 < δ < 1,
P
sup |b
γn,δ − γ| → 0 ,
ϕ∈C

where C denotes the class of all bounded, right continuous, nondecreasing score
functions defined on the interval (0, 1).

The proof can be found in Koul et al. (1987). It follows immediately that
τbϕ = 1/b
γn,δ is a consistent estimate of τϕ . Note that the uniformity condition
on the scores in the theorem is more than we need here. This result, though,
proves useful in adaptive procedures which estimate the score function; see
McKean and Sievers (1989).
Since the scores are differentiable, an approximation of H b n is obtained by
an application of the mean value theorem to (3.7.5) which results in
n n  
b ∗ (y) 1 X X ∗′ j
H n = ϕ e(i) − b
I(|b e(j) | ≤ y) , (3.7.7)
cn n i=1 j=1 n+1

Pn ∗′ b ∗ is a distribution function.
where cn = j=1 ϕ (j/(n + 1)) is such that H n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 207 —


i i

3.7. IMPLEMENTATION OF THE R ANALYSIS 207

The expression (3.7.5) for H b n contains a density estimate of f based on a


rectangular kernel. Hence, in choosing δ we are really choosing a bandwidth
for a density estimator. As most kernel type density estimates are sensitive to
the bandwidth, so is γ ∗ sensitive to δ. Several small sample studies have been
done on this estimate of τϕ ; see McKean and Sheather (1991) for a summary.
In these studies the quality of an estimator of τϕ is based on how well it
standardizes test statistics such as Fϕ in terms of how close the empirical
α-levels of the test statistic are to nominal α-levels. In the same way, scale
estimators used in confidence intervals were judged by how close empirical
confidence levels were to nominal confidence levels. The major concern is thus
the validity of the inference procedure. For moderate sample sizes where the
ratio of n/p exceeds 5, the value of δ = .80 yielded valid estimates. For ratios
less than 5, larger values of δ, around .90, gave valid estimates. In all cases it
was found that the following simple degrees of freedom correction benefited
the analysis
r
n
τbϕ = γ −1 .
b (3.7.8)
n−p−1

Note that this is similar to the least squares correction on the maximum
likelihood estimate (under normality) of the variance.

3.7.2 Algorithms for Computing the R Analysis

As we saw in Section 3.2, the dispersion function D(β) is a continuous con-


vex function of β. The Rfit package of R functions of Kloke and McKean
(2010) uses the R optimizing function optim to minimize D(β). For stability,
a QR decomposition (see, (3.7.10)) of the design matrix is used to obtain an
orthonormal basis matrix to compute the fitted values. Least squares is then
used to solve for the estimates of beta. The package uses standard linear model
syntax.
The algorithm which we describe next is a Newton-type of algorithm based
on the asymptotic quadraticity of D(β). It is relatively easy to program and
can be extended to other models such as the nonlinear model discussed in
Section 3.14. It is used in the RREGR command in Minitab and in the fortran
program RGLM; see Crimin et al. (2008) for discussion. A finite algorithm to
minimize D(β) is discussed by Osborne (1985).
The Newton-type of algorithm needs an initial estimate which we denote
b (0) . Let b
as β e(0) = Y−Xβ b (0) denote the initial residuals and let τbϕ(0) denote the
initial estimate of τϕ based on these residuals. By (3.5.11) the approximating

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 208 —


i i

208 CHAPTER 3. LINEAR MODELS

quadratic to D based on β b (0) is given by

1  ′ 
b (0) X′ X β − β b (0)

Q(β) = β − β
2τbϕ (0)
 ′    
− β−β b (0) S Y − Xβ b (0) + D Y − Xβb (0) .

By (3.5.13), the value of β which minimizes Q(β) is given by

b (1) = β
β b (0) + τb(0) (X′ X)−1 S(Y − Xβ
b (0) ) . (3.7.9)
ϕ

This is the first Newton step. In the same way that the first step was defined in
terms of the initial estimate, so can a second step be defined in terms of the first
step. We call these iterated estimates
 (1)  or k-step estimates.
 (0) In practice, though,
we would want to know if D β b is less than D βb before proceeding.
A more formal algorithm is presented below.
These k-step estimates satisfy some interesting properties themselves which
we briefly discuss; details can be found in McKean and Hettmansperger (1978).
√ b (0)
Provided the initial estimate is such that n(β − β) is bounded in proba-
bility then for any k ≥ 1 we have
√  (k) 
P
b b
n β − βϕ → 0 ,

where β b denotes a minimizing value of D. Hence the k-step estimates have the
ϕ
same asymptotic distribution as β b . Furthermore τbϕ(k) is a consistent estimate
ϕ
of τϕ , if it is any of the scale estimates discussed in Section 3.7.1 based on k-
(k)
step residuals. Let Fϕ denote the R test of a general linear hypothesis based
(k)
on reduced and full model k-step estimates. Then it can be shown that Fϕ
satisfies the same asymptotic properties as the test statistic Fϕ under the null
hypothesis and contiguous alternatives. Also it is consistent for any alternative
HA .

Formal Algorithm
In order to outline the algorithm used by RGLM, first consider the QR-
decomposition of X which is given by

Q′ X = R , (3.7.10)

where Q is an n × n orthogonal matrix and R is an n × p upper triangular


matrix of rank p. As discussed in Stewart (1973), Q can be expressed as a
product of p Householder transformations. Writing Q = [Q1 Q2 ] where Q1 is

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 209 —


i i

3.7. IMPLEMENTATION OF THE R ANALYSIS 209

n × p, it is easy to show that the columns of Q1 form an orthonormal basis for


the column space of X. In particular the projection matrix onto the column
space of X is given by H = Q1 Q′1 . The software package LINPACK (1979) is
a collection of subroutines which efficiently computes QR-decompositions and
it further has routines which obtain projections of vectors.
Note that we can write the kth Newton step in terms of residuals as

e(k) = b
b e(k−1) − τbϕ Ha(R(b
e(k−1) ) (3.7.11)
(k−1)
where a(R(b e(k−1) ) denotes the vector whose ith component is a(R(b ei ). Let
(k) (k)
b
D denote the dispersion function evaluated at e . The Newton step is a
step from be(k−1) along the direction τbϕ Ha(R(b e(k−1) )). If D (k) < D(k−1) the
step has been successful; otherwise, a linear search can be made along the
direction to find a value which minimizes D. This would then become the kth
step residual. Such a search can be performed using methods such as false
position as discussed below in Section 3.7.3. Stopping rules can be based on
the relative drop in dispersion, i.e., stop when

D (k) − D (k−1)
< ǫD , (3.7.12)
D (k−1)
where ǫD is a specified tolerance. A similar stopping rule can be based on
the relative size of the step. Upon stopping at step k, obtain the fitted value
b = Y−b
Y e(k) and then the estimate of β by solving Xβ = Y.b
A formal algorithm is: Let ǫD and ǫs be the given stopping tolerances.

e(k−1) and based upon these get an


1. Set k = 1. Obtain initial residuals b
(0)
initial estimate τbϕ of τϕ .

2. Obtain be(k) as in expression (3.7.11). If the step is successful proceed to


the next step, otherwise search along the Newton direction for a value
which minimizes D then go to the next step. An algorithm for this search
is discussed in Section 3.7.3.

3. If the relative drop in dispersion or length of step is within its respective


tolerance ǫD or ǫs stop; otherwise set b e(k−1) = b
e(k) and go to step (2).

4. Obtain the estimate of β and the final estimate of τϕ .

The QR decomposition can readily be used to form a reduced model design


matrix for testing the general linear hypotheses (3.2.5), Mβ = 0, where M
is a specified q × p matrix. Recall that we called the column space of X, ΩF ,
and the space ΩF constrained by Mβ = 0 the reduced model space, ΩR . The
key result lies in the following theorem:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 210 —


i i

210 CHAPTER 3. LINEAR MODELS

Theorem 3.7.2. Denote the row space of M by R(M′ ). Let QM be a p×(p−q)


matrix whose columns consist of an orthonormal basis for the space (R(M′ ))⊥ .
If U = XQM , then R(U) = ΩR .

Proof: If u ∈ ΩR then u = Xb for some b where Mb = 0. Hence b ∈


(R(M′ ))⊥ ; i.e., b = QM c for some c. Conversely, if u ∈ R(U) then for some
c ∈ Rp−q , u = X(QM c). Hence u ∈ R(X) and M(QM c) = (MQM )c = 0.
Thus using the LINPACK subroutines mentioned above, it is easy to write
an algorithm which obtains the reduced model design matrix U defined above
in the theorem. The package RGLM uses such an algorithm to test linear
hypotheses; see Kapenga, McKean, and Vidmar (1988).

3.7.3 An Algorithm for a Linear Search


The computation for many of the quantities needed in a rank-based analysis
involve simple linear searches. Examples include the estimate of the location
parameter for a signed-rank procedure, the estimate of the shift in location
in the two-sample location problem, the estimate of τϕ discussed in Section
3.7, and the search along the Newton direction for a minimizing value in Step
(2) of the algorithm for the R fit in a regression problem discussed in the last
section. The following is a generic setup for these problems: solve the equation

S(b) = K , (3.7.13)

where S(b) is a decreasing step function and K is a specified constant. Without


loss of generality we take K = 0 for the remainder of the discussion. By
the monotonicity, a solution always exists, although it may be an interval
of solutions. In almost all cases, S(b) is asymptotically linear; so, the search
problem becomes relatively more efficient as the sample size increases.
There are certainly many search algorithms that can be used for solving
(3.7.13). One that we have successfully employed is the Illinois version of regula
falsi; see Dowell and Jarratt (1971). McKean and Ryan (1977) employed this
routine to obtain the estimate and confidence interval for the two-sample
Wilcoxon location problem. We write the generic asymptotic linearity result
as
.
S(b) = S(b(0) ) − ζ(b − b(0) ) . (3.7.14)
The parameter ζ is often of the form δ −1 C where C is some constant. Since
δ is a scale parameter, initial estimates of it include such estimates as the
MAD, (3.9.27), or the sample standard deviation. We have found MAD to
usually be preferable. An outline of an algorithm for the search is:

1. Bracket Step. Beginning with an initial estimate b(0) step along the b-axis
to b(1) where the interval (b(0) , b(1) ), or vice-versa, brackets the solution.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 211 —


i i

3.8. L1 ANALYSIS 211

Asymptotic linearity can be used here to make these steps; for instance,
if ζ (0) is an estimate of ζ based on b(0) then the first step is

b(1) = b(0) + S(b(0) )/ζ (0) .

2. Regula-Falsi. Assume the interval (b(0) , b(1) ) brackets the solution and
that b(1) is the more recent value of b(0) , b(1) . If |b(1) − b(0) | < ǫ then
stop. Else, the next step is where the secant line determined by b(0) , b(1)
intersects the b-axis; i.e.,

b(1) − b(0)
b(2) = b(0) − S(b(0) ) . (3.7.15)
S(b(1) ) − S(b(0) )

(a) If (b(0) , b(2) ) brackets the solution then replace b(1) by b(2) and go
to (2) but use S(b(0) )/2 in place of S(b(0) ) in determination of the
secant line (this is the Illinois modification).
(b) If (b(2) , b(1) ) brackets the solution then replace b(0) by b(2) and go to
(2).

The above algorithm is easy to implement. Such searches are used in the
package RGLM; see Kapenga, McKean, and Vidmar (1988).

3.8 L1 Analysis
This section is devoted to L1 procedures. These are widely used procedures;
see, for example, Bloomfield and Steiger (1983). We first show that they are
equivalent to R estimates based on the sign score function under Model (3.2.4).
Hence the asymptotic theory for L1 estimation and subsequent analysis is
contained in Section 3.5. The asymptotic theory for L1 estimation can also
be found in Bassett and Koenker (1978) and Rao (1988) from an L1 point of
view.
Consider the sign scores; i.e., the scores generated by ϕ(u) = sgn(u − 1/2).
In this section we denote the associated pseudo-norm by
n
X
kvkS = sgn(R(vi ) − (n + 1)/2)vi v ∈ Rn ;
i=1

see, also, Section 2.6.1. This score function is optimal if the errors follow a
double exponential (Laplace) distribution; see Exercise 2.13.19 of Chapter 2.
We summarize the analysis based on the sign scores, but first we show that
indeed the R estimates based on sign scores are also L1 estimates, provided
that the intercept is estimated by the median of residuals.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 212 —


i i

212 CHAPTER 3. LINEAR MODELS

Consider the intercept model, (3.2.4), as given in Section 3.2 and let Ω de-
note the column space of X and Ω1 denote the column space of the augmented
matrix X1 = [1 X].
First consider the R estimate of η ∈ Ω based on the L1 pseudo-norm. This
b S ∈ Ω such that
is a vector Y

YbS = Argminη ∈Ω kY − ηkS .

Next consider the L1 estimate for the space Ω1 ; i.e., the L1 estimate of
b L1 ∈ Ω1 such that
α1 + η. This is a vector Y

YbL1 = Argminθ ∈Ω1 kY − θkL1 ,


P
where kvkL1 = |vi | is the L1 norm.

Theorem 3.8.1. R estimates based on sign scores are equivalent to L1 esti-


mates; that is,
b L1 = Y
Y b S + med{Y − Y b S }1 . (3.8.1)

Proof: Any vector v ∈ Ω1 can be written uniquely as v = a1 + vc where a


is a scalar and vc ∈ Ω. Since the sample median minimizes the L1 distance
between a vector and the space spanned by 1, we have

kY − vkL1 = kY − a1 − vc kL1 ≥ kY − med{Y − vc }1 − vc kL1 .

But it is easy to show that sgn(Yi − med{Y − vc } − vci ) = sgn(R(Yi − vci ) −


(n + 1)/2) for i = 1, . . . , n. Putting these two results together along with the
fact that the sign scores sum to 0 we have,

kY − vkL1 = kY − a1 − vc kL1 ≥ kY − med{Y − vc }1 − vc kL1 = kY − vc kS ,


(3.8.2)
for any vector v ∈ Ω1 . Once more using the sign argument above, we can show
that
kY − med{Y − Y b S }1 − Y
b S kL1 = kY − Y
b S kS . (3.8.3)
Using (3.8.2) and (3.8.3) together establishes the result.
b ′ = (b
Let b b ′ ) denote the R estimate of the vector of regression co-
αS , β
S S
efficients b = (β0 , β′ )′ . It follows that these R estimates are the maximum
likelihood estimates if the errors ei are double exponentially distributed; see
Exercise 3.15.13.
From the discussions in Sections 3.5 and 3.5.2, b bS has an approximate
N(b, τS2 (X′1 X1 )−1 ) distribution, where τS = (2f (0))−1. From this the efficiency
properties of the L1 procedures discussed in the first two chapters carry over
to the L1 linear model procedures. In particular its efficiency relative to LS at

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 213 —


i i

3.9. DIAGNOSTICS 213

the normal distribution is .63, and it can be much more efficient than LS for
heavier tailed error distributions.
As Exercise 3.15.22 shows, the drop in dispersion test based on sign scores,
FS , is, except for the scale parameter, the likelihood ratio test of the general
linear hypothesis (3.2.5), provided the errors have a double exponential dis-
tribution. For other error distributions, the same comments about efficiency
of the L1 estimates can be made about the test FS .
In terms of implementation, Schrader and McKean (1987) found it more
difficult to standardize the L1 statistics than other R procedures, such as
the Wilcoxon. Their most successful standardization of FS was based on the
following bootstrap procedure:

1. Compute the full model L1 estimates β b and α


bS , the full model residuals
S
e1 , . . . , b
b en , and the test statistic FS .

2. Select e
e1 , . . . , e
eñ , the ñ = n − (p + 1) nonzero residuals.

3. Draw a bootstrap random sample e∗1 , . . . , e∗ñ with replacement from


e1 , . . . , e
e b ∗ and F ∗ , the L1 estimate and test statistic, from
eñ . Calculate β S S
the model yi∗ = α b + e∗ .
bS + x′i β S i

4. Independently repeat step 3 a large number B times. The bootstrap p


value, p∗ = #{FS∗ ≥ FS }/B.

5. Reject H0 at level α if p∗ ≤ α.

Notice that by using full model residuals, the algorithm estimates the null
distribution of FS . The algorithm depends on the number B of bootstrap
samples taken. We suggest at least 2000.

3.9 Diagnostics
An important part in the analysis of a linear model is the examination of
the resulting fit. Tools for doing this include residual plots and diagnostic
techniques. Over the last fifteen years or so, these tools have been developed
for fits based on least squares; see, for example, Cook and Weisberg (1982) and
Belsley, Kuh, and Welsch (1980). Least squares residual plots can be used to
detect such things as curvature not accounted for by the fitted model; see Cook
and Weisberg (1989) for a recent discussion. Further diagnostic techniques can
be used to detect outliers which are points that differ greatly from pattern set
by the bulk of the data and to measure the influence of individual cases on
the least squares fit. See McKean and Sheather (2009) for a recent review of
diagnostic procedures.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 214 —


i i

214 CHAPTER 3. LINEAR MODELS

In this section we explore the properties of the residuals from the rank-
based fits, showing how they can be used to determine model misspecification.
We present diagnostic techniques for rank-based residuals that detect outlying
and influential cases. Together these tools offer the user a residual analysis for
the rank-based method for the fit of a linear model similar to the residual
analysis based on least squares estimates.
In this section we consider the same linear model, (3.2.3), as in Section
3.2. For a given score function ϕ, let β b and b eR denote the R estimate of
ϕ
β and residuals from the R fit of the model based on these scores. Much of
the discussion is taken from the articles by McKean, Sheather, and Hettman-
sperger (1990, 1991, 1993). Also, see Dixon and McKean (1996) for a robust
rank-based approach to modeling heteroscedasticity.

3.9.1 Properties of R Residuals and Model Misspecifi-


cation
As we discussed above, a primary use of least squares residuals is in detection
of model misspecification. In order to show that the R residuals can also be
used to detect model misspecification, consider the sequence of models

Y = 1α + Xβ + Zγ + e , (3.9.1)

where Z is an n × q centered matrix of constants and γ = θ/ n, for θ 6= 0.
Note that this sequence of models is contiguous to Model (3.2.3). Suppose
we fit Model (3.2.3), i.e. Y = 1α + Xβ + e, when model (3.9.1) is the true
model. Hence the model has been misspecified. As a first step in examining
the residuals in this situation, we consider the limiting distribution of the
corresponding R estimate.

Theorem 3.9.1. Assume Model (3.9.1) is the true model. Let β b be the R
ϕ
estimate for Model (3.2.3). Suppose that conditions (E.1) and (S.1) of Section
3.4 are true and that conditions (D.1) and (D.2) are true for the augmented
matrix [X Z]. Then

b has an approximate Np β + (X′ X)−1 X′ Zθ/√n, τ 2 (X′ X)−1 distribution.
β ϕ ϕ
(3.9.2)

Proof: Without loss of generality assume that β = 0. Note that the situation
here is the same as the situation in Theorem 3.6.3; except now the null hy-
pothesis corresponds to γ = 0 and β b is the reduced model estimate. Thus we
ϕ
seek the asymptotic distribution of the reduced model estimate. As in Section
e which is the
3.5.1 it is easier to consider the corresponding pseudo estimate β
reduced model estimate which minimizes the quadratic Q(Y − Xβ), (3.5.11).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 215 —


i i

3.9. DIAGNOSTICS 215

√ b e P
Under the null hypothesis, γ = 0, n(β ϕ − β) → 0; hence by contiguity
√ b e →P b and β e have
n(β ϕ − β) 0 under the sequence of Models (3.9.1). Thus β ϕ
the same distributions under (3.9.1); hence, it suffices to find the distribution
e But by (3.5.13),
of β.
e = τϕ (X′X)−1 S(Y) ,
β (3.9.3)
where S(Y) is the first p components of the vector T (Y) = [X Z]′ a(R(Y)).
By (3.6.26) of Theorem 3.6.3

D
n−1/2 T (Y) → Np+q (τϕ−1 Σ∗ (0′ , θ ′ )′ , Σ∗ ) , (3.9.4)

where Σ∗ is the following limit,


 
1 X′ X X′ Z
lim = Σ∗ .
n→∞ n Z′ X Z′ Z

Because β e is defined by (3.9.3), the result is an algebraic computation applied


to (3.9.4).
With a few more steps we can write a first order expression for β b , which
ϕ
is given in the following corollary:

Corollary 3.9.1. Under the assumptions of the last theorem,



b = β + τϕ (X′ X)−1 X′ ϕ(F (e)) + (X′ X)−1 X′ Zθ/ n + op (n−1/2 ) . (3.9.5)
β ϕ

Proof: Without loss of generality assume that the regression coefficients are 0.
By (A.3.10) and expression (3.6.25) of Theorem 3.6.3 we can write
   
1 1 X′ ϕ(F (e)) 1 X′ Zθ
√ T (Y) = √ + τϕ−1 + op (1) ;
n n Z′ ϕ(F (e)) n Z′ Zθ

hence, the first p components of √1 T (Y) satisfy


n

1 1 1
√ S(Y) = √ X′ ϕ(F (e)) + τϕ−1 X′ Zθ + op (1) .
n n n
√ b e P
By expression (3.9.3) and the fact that n(β − β) → 0 the result follows.
From this corollary we obtain the following first order expressions of the
R residuals and R fitted values:
.
bR =
Y α1 + Xβ + τϕ Hϕ(F (e)) + HZγ (3.9.6)
.
b
eR = e − τϕ Hϕ(F (e)) + (I − H)Zγ , (3.9.7)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 216 —


i i

216 CHAPTER 3. LINEAR MODELS

where H = X (X′ X)−1 X′. In Exercise 3.15.23 the reader is asked to show that
the least squares fitted values and residuals satisfy
b LS = α1 + Xβ + He + HZγ
Y (3.9.8)
b
eLS = e − He + (I − H)Zγ . (3.9.9)
In terms of model misspecification the coefficients of interest are the regres-
sion coefficients. Hence, at this time we need not consider the effect of the
estimation of the intercept. This avoids the problem of which estimate of the
intercept to use. In practice, though, for both R and LS fits, the intercept
is also fitted and, subsequently, its effect is removed from the residuals. We
also include the effect of estimation of the intercept in our discussion of the
standardization of residuals and fitted values in Sections 3.9.2 and 3.9.3, re-
spectively.
Suppose that the linear model (3.2.3) is correct. Based on its first order
expression when γ = 0, b eR is a function of the random errors similar to b eLS ;
hence, it follows that a plot of b eR versus Y b R should generally be a random
scatter, similar to the least squares residual plot.
In the case of model misspecification, note that the R residuals and least
squares residuals have the same asymptotic bias, namely (I − H)Zγ. Hence R
residual plots, similar to those of least squares, are useful in identifying model
misspecification.
For least squares residual plots, since least squares residuals and the fitted
values are uncorrelated, any pattern in this plot is due to model misspecifica-
tion and not the fitting procedure used. The converse, however, is not true.
As the example on the potency of drug compounds below illustrates, the least
squares residual plot can exhibit a random scatter for a poorly fitted model.
This orthogonality in the LS residual plot does, however, make it easier to
pick out patterns in the plot. Of course the R residuals are not orthogonal to
the R fitted values, but they are usually close to orthogonality; see Naranjo
et al. (1994). We introduce the following parameter ν to measure the extent
of departure from orthogonality.
Denote general fitted values and residuals by Y b and b e respectively. The
expected departure from orthogonality is the parameter ν defined by
h i
ν=E b b .
e′ Y (3.9.10)

For least squares, νLS is of course 0. For R fits, we have the following first
order expression for it:
Theorem 3.9.2. Under the assumptions of Theorem 3.9.1 and either Model
(3.2.3) or Model (3.9.1),
.
νR = pτϕ (E[ϕ(F (e1 ))e1 ] − τϕ ) . (3.9.11)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 217 —


i i

3.9. DIAGNOSTICS 217

Proof: Suppose Model (3.9.1) holds. Using the above first order expressions
we have
.
νR = E [(e + α1 − τϕ Hϕ(F (e)) + (I − H)Zγ)′ (Xβ + τϕ Hϕ(F (e)) + HZγ)] .

Using E[ϕ(F (e))] = 0, E[e] = E(e1 )1, and the fact that X is centered this
expression simplifies to
.
νR = τϕ E [trHϕ(F (e))e′ ] − τϕ2 E [trHϕ(F (e))ϕ(F (e))′] .

Since the components of e are independent, the result follows. The result is
invariant to either of the models.
Although in general, νR 6= 0 for R estimates, if, as the next corollary shows,
optimal scores (see Examples 3.6.1 and 3.6.2) are used the expected departure
from orthogonality is 0.
Corollary 3.9.2. Under the hypothesis of the last theorem, if optimal R scores
are used then νR = 0.
′ (F −1 (u)) R
Proof: Let ϕ(u) = −cf
f (F −1 (u))
where c is chosen so that ϕ2 (u)du = 1. Then
Z  ′ −1  −1
f (F (u))
τϕ = ϕ(u) − du = c.
f (F −1 (u))
Some simplification and an integration by parts shows
Z Z
ϕ(F (e))e dF (e) = c f (e) de = c.

Naranjo et al. (1994) conducted a simulation study to investigate the above


properties of rank-based and LS residuals over several small sample situations
of null (the true model was fitted) models and misspecified models. Error
distributions included the normal distribution and a contaminated normal
distribution. Wilcoxon scores were used. The first part of the study concerned
the amount of association between residuals and fitted values where the asso-
ciation was measured by several correlation coefficients, including Pearson’s r
and Kendall’s τ . Because of orthogonality between the LS residuals and fitted
values, Pearson’s r is always 0 for LS. On the other measures of association,
however, the results for the Wilcoxon analysis and LS were about the same.
In general, there was little association. The second part investigated measures
of randomness in a residual plot, including a runs tests and a quadrant count
test (the quadrants were determined by the medians of the residuals and fitted
values). The results were similar for the LS and Wilcoxon fits. Both showed
validity over the null models and exhibited similar power over the misspecified
models. In a power study over a quadratic misspecified model, the Wilcoxon

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 218 —


i i

218 CHAPTER 3. LINEAR MODELS

Table 3.9.1: Cloud Data, CP = Cloud Point


%I-8 0 1 2 3 4 5 6 7 8 0
CP 22.1 24.5 26.0 26.8 28.2 28.9 30.0 30.4 31.4 21.9
%I-8 2 4 6 8 10 0 3 6 9
CP 26.1 28.5 30.3 31.5 33.1 22.8 27.3 29.8 31.8

analysis exhibited more power for long-tailed error distributions. In summary,


the simulation study provided empirical evidence that residual analyses based
on Wilcoxon fits are similar to LS based residual analyses.
There are other useful residual plots. Two that we briefly discuss are q −
q plots and added variable plots. As with standard residual plots, the
internal R Studentized residuals (see Section 3.9.2) can be used in place of
the residuals. Since the R estimates of β are consistent, the distribution of
the residuals should resemble the distribution of the errors. This leads to
consideration of another useful residual plot, a q − q plot. In this plot, the
quantiles of the target distribution form the horizontal coordinates while the
sample quantiles (ordered residuals) form the vertical coordinates. Linearity
of this plot indicates the appropriateness of the target distribution as the true
model distribution; see Exercise 3.15.24. McKean and Sievers (1989) discuss
how to use these plots adaptively to select appropriate rank scores. In the next
example, we use them to examine how well the R fit fits the bulk of the data
and to highlight outliers.
For the added variable plot, let beR denote the residuals from the R fit of
the model Y = α1 + Xβ + e. In this case, Z is a known vector and we wish
to decide whether or not to add it to the regression model. For the added
variable plot, we regress Z on X. We denote the residuals from this fit as
b
e(Z | X) = (I − H)Z. The added variable plot consists of the scatter plot of
the residuals beR versus be(Z | X). Under model misspecification γ 6= 0 from
expression (3.9.7), the residuals b
eR are also a function of (I − H)Z. Hence, the
plot can be quite powerful in determining the potential of Z as a predictor.
Example 3.9.1 (Cloud Data). The data for this example can be found in
Table 3.9.1. It is taken from an exercise on p. 162 of Draper and Smith (1966).
The dependent variable is the cloud point of a liquid, a measure of degree of
crystallization in a stock. The independent variable is the percentage of I-8
in the base stock. The subsequent R fits for this data set were all based on
Wilcoxon scores with the intercept estimate α bS , the median of the residuals.
Panel A of Figure 3.9.1 displays the residual plot (R residuals versus R
fitted values) of the R fit of the simple linear model. The curvature in the plot
indicates that this model is a poor choice and that a higher degree polynomial

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 219 —


i i

3.9. DIAGNOSTICS 219

model would be more appropriate. Panel B of Figure 3.9.1 displays the residual
plot from the R fit of a quadratic model. Some curvature is still present in the
plot. A cubic polynomial was fitted next. Its R residual plot, found in Panel C
of Figure 3.9.1, is much more of a random scatter than the first two plots. On
the basis of residual plots the cubic polynomial is an adequate model. Least
squares residual plots would also lead to a third degree polynomial.

Figure 3.9.1: Panel A through C are the residual plots of the Wilcoxon fits of
the linear, quadratic, and cubic models, respectively, for the cloud data. Panel
D is the q−q plot based on the Wilcoxon fit of the cubic model.
Panel A Panel B
0.5

0.5
0.0
Wilcoxon residuals

Wilcoxon residuals
−0.5

0.0
−1.0
−1.5

−0.5
−2.0

24 26 28 30 32 34 24 26 28 30 32

Wilcoxon linear fit Wilcoxon quadratic fit

Panel C Panel D
0.4

0.4
0.2

0.2
Wilcoxon residuals

Wilcoxon residuals
0.0

0.0
−0.2

−0.2
−0.4

−0.4

22 24 26 28 30 32 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Wilcoxon cubic fit Normal quantiles

In the R residual plot of the cubic model, several points appear to be


outlying from the bulk of the data. These points are also apparent in Panel
D of Figure 3.9.1 which displays the q −q plot of the R residuals. Based on
these plots, the R regression appears to have fit the bulk of the data well. The
q −q plot suggests that the underlying error distribution has slightly heavier
tails than the normal distribution. A scale would be helpful in interpreting

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 220 —


i i

220 CHAPTER 3. LINEAR MODELS

Table 3.9.2: Wilcoxon (W) and LS Estimates (LS) of the Regression Coeffi-
cients for Cloud Data. (Standard errors are in parentheses.)
Method Intercept Linear Quadratic Cubic Scale
W 22.35 (.18) 2.24 (.17) -.23 (.04) .01 (.003) τbϕ = .307
LS 22.31 (.15) 2.22 (.15) -.22 (.04) .01 (.003) b = .281
σ

these residual plots as discussed in the next section. Table 3.9.2 displays the
estimated coefficients along with their standard errors. The Wilcoxon and least
squares fits are practically the same.

Example 3.9.2 (Potency Data, Example 3.3.3 continued). This example was
discussed in Section 3.3. Recall that the data were the result of an experiment
concerning the potency of drug compounds manufactured under different levels
of four factors and one covariate. Here we want to discuss a residual analysis
of the rank-based fits of the two models that were fit in Example 3.3.3.
First consider Model (3.3.1) without the quadratic terms, i.e., without the
parameters β11 , β12 , and β13 . The residuals used are the internal R Studentized
residuals defined in the next section; see (3.9.31). They provide a convenient
scale for detecting outliers. The curvature in the Wilcoxon-residual plot of
this model, Panel A of Figure 3.9.2, is quite apparent, indicating the need
for quadratic terms in the model; whereas, the LS residual plot, Panel C of
Figure 3.3.1, does not exhibit this quadratic effect. As the R residual plot
indicates, there are outliers in the data and these had an effect on the LS
fit. Panels B and D display the residual plots, when the squared terms of the
factors are added to model, i.e., Model (3.3.1) was fit. This R residual plot
no longer exhibits the quadratic effect indicating a better fitting model. Also
by examining the R plots for both models, it is seen that the outlyingness of
some of the outliers indicated in the plot for the first model was accounted for
by the larger model.

3.9.2 Standardization of R Residuals


In this section we want to obtain an expression for the variance of the R
residuals under Model (3.2.3). We assume in this section that σ 2 , the variance
of the errors, is finite. As we show below, similar to the least squares residual,
the variance of an R residual depends both on its location in the x-space and
the underlying variation of the errors. The internal Studentized least squares
residuals (residuals divided by their estimated standard errors) have proved
useful in diagnostic procedures since they correct for both the model and

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 221 —


i i

3.9. DIAGNOSTICS 221

Figure 3.9.2: Panels A and B are the Wilcoxon internal Studentized residuals
plots for models without and with, respectively, the three quadratic terms
β11 , β12 and β13 . Panels C and D are the analogous plots for the LS fit.
Panel A Panel B

• •

1.0
Wilcoxon residuals

Wilcoxon residuals
0.4


• •

0.5
• •• •
• •
• •• •
• •
0.0

•• •• • • • •
• • • •• • • • • • •• •• • • •

0.0
• • •• • • • •• •
• • •• • •
• • • •
-0.4


• • •

7.8 8.0 8.2 8.4 7.6 7.8 8.0 8.2

Wilcoxon w/o quad. terms Wilcoxon with quad. terms

Panel C Panel D
0.6
0.6

• •

0.4

0.4


• •


0.2
LS residuals

LS residuals

••• •
0.2

• ••
• • • •• • • •• •

0.0

• • • • ••
• •• • •• •
0.0

•• • •
• • • ••
• • • • •• •
-0.4

• • •
• • •
• •
-0.4

• • • •

7.8 8.0 8.2 8.4 7.8 8.0 8.2 8.4

LS w/o quad. terms LS with quad. terms

the underlying variance. The internal R Studentized residuals defined below,


(3.9.31), are similarly Studentized R residuals.
A diagnostic use of a Studentized residual is in detecting outlying obser-
vations. The R method provides a robust fit to the bulk of the data. Thus
any case with a large Studentized residual can be considered an outlier from
this model. Even though a robust fit is resistant to outliers, it is still useful
to detect such points. Indeed in practice these are often the points of most
interest. The value of an internally Studentized residual is in its simplicity. It
tells how many estimated standard errors a residual is away from the center
of the data.
The standardization depends on which estimate of the intercept is selected.
We obtain the result for αbS the median of b
eRi and only state the results for the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 222 —


i i

222 CHAPTER 3. LINEAR MODELS

intercept based on symmetric errors. Thus the residuals we seek to standardize


are given by
beR = Y − α bS 1 − Xβbϕ . (3.9.12)
We obtain a first order approximation of cov(b eR ). Since the residuals are
invariant to the regression coefficients, we can assume without loss of generality
that the true parameters are zero. Recall that hci is the ith diagonal element
of H = X(X′ X)−1 X′ and hi = n−1 + hci .

Theorem 3.9.3. Under the conditions (E.1), (E.2), (D.1), (D.2), and (S.1)
bs then a first order representation
of Section 3.4, if the intercept estimate is α
of the variance of ebR,i is
.
eR,i ) = σ 2 (1 − K1 n−1 − K2 hci ) ,
Var(b (3.9.13)

where K1 and K2 are defined in expressions (3.9.18) and (3.9.19), respectively.


In the case of a symmetric error distribution when the estimate of the intercept
bϕ+ , discussed in Section 3.5.2, and (S.3) also holds,
is given by α
.
eR,i ) = σ 2 (1 − K2 hi ) .
Var(b (3.9.14)

b given in (3.5.23) and the asymp-


Proof: Using the first order expression for β ϕ
totic representation of αbS given by (3.5.22), we have
.
b
eR = e − τS sgn(e)1 − Hτϕ ϕ(F (e)) , (3.9.15)
P
where sgn(e) = sgn(ei )/n and τS and τϕ are defined in expressions
R (3.4.6)
and (3.4.4), respectively. Because the median of ei is 0 and ϕ(u) du = 0, we
have
.
E[b
eR ] = E(e1 )1 .
Hence,
.
cov(b
eR ) = E[(e − τS sgn(e)1 − Hτϕ ϕ(F (e)) − E(e1 )1)
(e − τS sgn(e)1 − Hτϕ ϕ(F (e)) − E(e1 )1)′ ] . (3.9.16)

Let J = 11′ /n denote the projection onto the space spanned by 1. Since our
design matrix is [1 X], the leverage of the ith case is hi = n−1 + hci where hci
is the ith diagonal entry of the projection matrix H. By expanding the above
expression and using the independence of the components of e we get after
some simplification (see Exercise 3.15.25):
.
eR ) = σ 2 {I − K1 J − K2 H} ,
Cov(b (3.9.17)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 223 —


i i

3.9. DIAGNOSTICS 223

where
 τ 2  δ 
S S
K1 = 2 −1 , (3.9.18)
σ τS
 τ 2  δ 
ϕ
K2 = 2 −1 , (3.9.19)
σ τϕ
δS = E[ei sgn(ei )] , (3.9.20)
δ = E[ei ϕ(F (ei ))] , (3.9.21)
σ2 = Var(ei ) = E((ei − E(ei ))2 ) . (3.9.22)

This yields the first result, (3.9.13). Next consider the case of a symmetric
bϕ+ , discussed in
error distribution. If the estimate of the intercept is given by α
Section 3.5.2, the result simplifies to (3.9.14).
From Cook and Weisberg (1982, p. 11) in the least squares case,
Var(beLS,i) = σ 2 (1 − hi ) so that K1 and K2 are correction factors due to using
the rank score function.
Based on the results in the theorem, an estimate of the variance-covariance
matrix of b eR is
e=σ
S b2 {I − K̂1 J − K̂2 Hc } , (3.9.23)
where
!
b1 τ̂ 2 2δbS
K = S2 −1 , (3.9.24)
σ̂ τ̂S
!
τ̂ϕ2 2δb
b2
K = −1 , (3.9.25)
σ̂ 2 τ̂ϕ
1 X
δbS = |êR,i | , (3.9.26)
n−p
and
1
δb = D(β̂ ϕ ) .
n−p
The estimators τbS and τ̂ϕ are discussed in Section 3.7.1.
To complete the estimate of the Cov(b eR ) we need to estimate σ. A robust
estimate of it is given by the MAD,

b = 1.483medi {|êRi − medj êRj |} ,


σ (3.9.27)

which is a consistent estimate of σ if the errors have a normal distribution.


For the examples discussed here, we used this estimate in (3.9.23)-(3.9.25).
It follows from (3.9.23) that an estimate of Var(b
eR,i ) is

se2R,i = σ b1 1 − K
b2 (1 − K b 2 hc,i ) , (3.9.28)
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 224 —


i i

224 CHAPTER 3. LINEAR MODELS

where hci = xi (X′ X)−1xi .


2
Let σbLS denote the usual least squares estimate of the variance. Least
squares residuals are standardized by seLS,i where

se2LS,i = σ 2
bLS (1 − hi ) ; (3.9.29)

see page 11 of Cook and Weisberg (1982)) and recall that hi = n−1 +
x′i (X′ X)−1xi . If the error distribution is symmetric (3.9.28) reduces to

se2R,i = σ b 2 hi ) .
b2 (1 − K (3.9.30)

Using the ww package, the Studentized residuals based on Wilcoxon


scores are computed by the R commands: fit.wil = wwest(xmat,y)
and studres.hbr(xmat,fit.wil$tmp2$wmat,fit.wil$tmp1$resid), where
xmat and y contain the design matrix and the vector of responses, respec-
tively.

Internal R Studentized Residual


We define the internal R Studentized residuals as
ebR,i
rR,i = for i = 1, . . . , n , (3.9.31)
seR,i

where e sR,i is the square root of either (3.9.28) or (3.9.30) depending on whether
one assumes an asymmetric or symmetric error distribution, respectively.
It is interesting to compare expression (3.9.30) with the estimate of the
variance of the least squares residual σ 2
bLS (1 − hi ). The correction factor K b2
depends on the score function ϕ(·) and the underlying symmetric error distri-
bution. If, for example, the error distribution is normal and if we use normal
scores, then K b 2 converges in probability to 1; see Exercise 3.15.26. In general,
however, we do not wish to specify the error distribution and then K b 2 provides
a natural adjustment.
A simple benchmark is useful in declaring whether or not a case is an out-
lier. We are certainly not advocating eliminating such cases but flagging them
as potential outliers and targeting them for further study. As we discussed in
the last section, the distribution of the R residuals should resemble the true
distribution of the errors. Hence a simple rule for all cases is not apparent. In
general, unless the residuals appear to be from a highly skewed distribution,
a simple rule is to declare a case to be a potential outlier if its residual
exceeds two standard errors in absolute value; i.e., |rR,i | > 2.
The matrix S, e (3.9.23), is an estimate of a first order approximation of
cov(b eR ). It is not necessarily positive semi-definite and we have not constrained
it to be so. In practice this has not proved troublesome since only occasionally

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 225 —


i i

3.9. DIAGNOSTICS 225

have we encountered negative estimates of the variance of the residuals. For


instance, the R fit for the cloud data resulted
√ in one case with a negative
b 1 − hi , where σ
variance. Presently, we replace (3.9.28) by σ b is the MAD
estimate (3.9.27), in these situations.
We have already illustrated the internal R Studentized residuals for the
potency of Example 3.9.2 discussed in the last section. We use them next on
the cloud data.

Example 3.9.3 (Cloud Data, Example 3.9.1, continued). Returning to cloud


data example, Panel A of Figure 3.9.3 displays a residual plot of the internal
Wilcoxon Studentized residuals versus the fitted values. It is similar to Panel C
of Figure 3.9.1 but it has a meaningful scale on the vertical axis. The residuals
for three of the cases (4, 10, and 16) are over two standard errors from the
center of the data. These should be flagged as potential outliers. Panel B of
Figure 3.9.3 displays the normal q−q plot of the internal Wilcoxon Studentized
residuals. The underlying error structure appears to have heavier tails than
the normal distribution.

Figure 3.9.3: Internal Wilcoxon Studentized residual plot, Panel A, and cor-
responding normal q−q plot, Panel B, for the Cloud Data.

Panel A Panel B
2

2
1

1
Wilcoxon Studentized residuals

Wilcoxon Studentized residuals


0

0
−1

−1
−2

−2

22 24 26 28 30 32 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Wilcoxon cubic fit Normal quantiles

As with their least squares counterparts, we think the chief benefits of


the internal R Studentized residuals is their usefulness in diagnostic plots and
flagging potential outliers.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 226 —


i i

226 CHAPTER 3. LINEAR MODELS

External R Studentized Residual


Another statistic that is useful for flagging outliers is a robust version of the
external t statistic. The LS version of this diagnostic is discussed in detail in
Cook and Weisberg (1982). A robust version of this diagnostic is discussed
in McKean, Sheather, and Hettmansperger (1991). We briefly describe this
latter approach.
Suppose we want to examine the ith case to see if it is an outlier. Consider
the mean shift model given by

Y = X1 b + θi di + e , (3.9.32)

where X1 is the augmented matrix [1 X] and di is an n × 1 vector of zeroes


except for its ith component which is a 1. A formal hypothesis that the ith
case is an outlier is given by

H0 : θi = 0 versus HA : θi 6= 0 . (3.9.33)

One way of testing these hypotheses is to use the test procedures described in
Section 3.6. This requires fitting Model (3.9.32) for each value of i. A second
approach is described next.
Note that we can rewrite Model (3.9.32) equivalently as

Y = X1 b∗ + θi d∗i + e , (3.9.34)

where d∗i = (I − H1 )di , H1 is the projection matrix onto the column space of
X1 and b∗ = b + H1 di θi ; see Exercise 3.15.27. Because of the orthogonality
between X and d∗i , the least squares estimate of θi can be obtained by a
simple linear regression of Y on d∗i or equivalently of b
eLS on d∗i . For the rank-
based estimate, the asymptotic distribution theory of the regression estimates
suggests a similar approach. Accordingly, let θbR,i denote the R estimate when
eR is regressed on d∗i . This is a simple regression and the estimate can be
b
obtained by a linear search algorithm; see Section 3.7.2. As Exercise 3.15.29
shows, this estimate is the inversion of an aligned rank statistic to test the
hypotheses (3.9.33). Next let τbϕ,i denote the estimate of τϕ produced from
this regression. We define the external R Studentized residual to be the
statistic
θb
tR (i) = pR,i , (3.9.35)
τbϕ,i / 1 − h1,i
where h1,i is the ith diagonal entry of H1 . Note that we have standardized θbR,i
by its asymptotic standard error.
A final remark on these external t-statistics is in order. In the mean shift
model, (3.9.32), the leverage value of the ith case is 1. Hence, the design

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 227 —


i i

3.9. DIAGNOSTICS 227

assumption (D.2), (3.4.7), is not true. This invalidates both the LS and rank-
based asymptotic theory for the external t-statistics. In light of this, we do
not propose the statistic tR (i) as a test statistic for the hypotheses (3.9.33)
but as a diagnostic for flagging potential outliers. As a benchmark, we suggest
the value 2.

3.9.3 Measures of Influential Cases


Since R estimates have bounded influence in the y-space but not in the x-
space, the R fit may be affected by outlying points in the x-space. We next
introduce a statistic which measures the influence of the ith case on the ro-
bust fit. We work with the usual model (3.2.3). First, we need the first order
representation of Yb R . Similar to the proof of Theorem 3.9.3 which obtained
the first order representation of the residuals, (3.9.15), we have
.
bR =
Y α1 + Xβ + τS sgn(e)1 + Hτϕ ϕ(F (e)) ; (3.9.36)

see Exercise 3.15.28.


Let YbR (i) denote the R predicted value of Yi when the ith case is deleted
from the model. We call this model, the delete i model. Then the change in
the robust fit due to the ith case is

RDF F ITi = YbR,i − ŶR (i) . (3.9.37)

RDF F ITi is our measure of the influence of case i. Computation of this


statistic is discussed later. Clearly, in order to be useful, RDF F ITi must be
assessed relative to some scale.
RDF F IT is a change in the fitted value; hence, a natural scale for assessing
RDF F IT is a fitted value scale. Using as our estimate of the intercept α bS , it
follows from the expression (3.9.36) with γ = 0 that
.
Var(YbR,i ) = n−1 τS2 + hc,iτϕ2 . (3.9.38)

Hence, based on a fitted scale assessment, we standardize RDF F IT by an


estimate of the square root of this quantity.
For least squares diagnostics there is some discussion on whether to use
the original model or the model with the ith point deleted for the estimation
of scale. Cook and Weisberg (1982) advocate the original model. In this case
the scale estimate is the same for all n cases. This allows casewise comparisons
involving the diagnostic. Belsley, Kuh, and Welsch (1980), however, advocate
scale estimation based on the delete i model. Note that both standardizations
correct for the model and the underlying variation of the errors.
Let τbS (i) and τ̂ϕ (i) denote the estimates of τS and τϕ for the delete i model
as discussed above. Then our diagnostic in which RDF F ITi is assessed relative

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 228 —


i i

228 CHAPTER 3. LINEAR MODELS

to a fitted value scale with estimates of scale based on the delete i model is
given by
RDF F ITi
RDF F IT Si = 1 . (3.9.39)
(n−1 τbS2 (i) + hc,i τbϕ2 (i)) 2
This is an R analogue of the least squares diagnostic DF F IT Si proposed by
Belsley et al. (1980). For standardization based on the original model, replace
τbS (i) and τ̂ϕ (i) by τbS and τ̂ϕ , respectively. We define

RDF F ITi
RDCOOKi = 1 . (3.9.40)
(n−1 τbS2 + hc,i τbϕ2 ) 2
+
bR
If α is used as the estimate of the intercept then, provided the errors
have a symmetric distribution, the R diagnostics are obtained by replacing
Var(YbR,i ) with Var(YbR,i ) = hi τ̂ϕ2 ; see Exercise 3.15.30 for details. This results
in the diagnostics,

RDF F ITi
RDF F IT Ssymm,i = √ , (3.9.41)
hi τbϕ (i)

and
RDF F ITi
RDCOOKsymm,i = √ . (3.9.42)
hi τbϕ
This eliminates the need to estimate τS .
There is also a disagreement on what benchmarks to use for flagging points
of potential influence. As Belsley et al. (1980) discuss in some detail, DF F IT S
is inverselypinfluenced by sample size. They advocate a size-adjusted bench-
mark of 2 p/n for DF F IT S. Cook and Weisberg (1982) suggest a more

conservative value which results in p. We use both benchmarks in the ex-
amples. We realize these diagnostics only flag potentially influential points
that require investigation. Similar to the two references cited above, we would
never recommend indiscriminately deleting observations solely because their
diagnostic values exceed the benchmark. Rather these are potential points of
influence which should be investigated.
The diagnostics described above are formed with the leverage values based
on the projection matrix. These leverage values are nonrobust (see Rousseeuw
and van Zomeren, 1990). For data sets with clusters of outliers in factor space
robust leverage values can be formulated in terms of high breakdown estimates
of the center and scatter matrix in factor space. One such choice would be the
MVE, minimum volume ellipsoid, proposed by Rousseeuw and van Zomeren
(1990). Other estimates could be based on the robust singular value decompo-
sition discussed by Ammann (1993). See, also, Simpson, Ruppert, and Carroll
(1992). We recommend computing YbR (i) with a one or two step R estimate

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 229 —


i i

3.9. DIAGNOSTICS 229

based on the residuals from the original model; see Section 3.7.2. Each step
involves a single ordering of the residuals which are nearly in order (in fact
on the first step they are in order) and a single projection onto the range of
X (easily obtained by using the routines in LINPACK as discussed in Section
3.7.2).
The diagnostic RDF IT T Si measures the change in the fitted values when
the ith case is deleted. Similarly we can also measure changes in the esti-
mates of the regression coefficients. For the LS analysis, this is the diagnostic
DBET AS proposed by Belsley, Kuh, and Welsch (1980). The corresponding
diagnostics for the rank-based analysis are:

βbϕ,j − βbϕ,j (i)


RDBET ASij = p , (3.9.43)
τbϕ (i) (X′ X)jj

where βb (i) denotes the R estimate of β in the delete i-model. A similar statis-
ϕ
tic can be constructed for the intercept parameter. Furthermore, a DCOOK
verison can also be constructed as above. These diagnostics are often used
when |RDF F IT Si| is large. In such cases, it may be of interest to know which
components of the regression coefficients are more influential than other com-
ponents.
√ The benchmark suggested by Belsley, Kuh, and Welsch (1980) is
2/ n.
Example 3.9.4 (Free Fatty Acid (FFA) Data). The data for this example
can be found in Morrison (1983, p. 64) and for convenience we have placed it
(Free Fatty Acid Data) at the url cited in the Preface. The response is the level
of free fatty acid of prepubescent boys while the independent variables are age,
weight, and skin fold thickness. The sample size is 41. Panel A of Figure 3.9.4
depicts the residual plot based on the least squares internal t-residuals. From
this plot there appears to be several outliers. Certainly the cases 12, 22, 26,
and 9 are outlying and perhaps the cases 8, 10, and 38. In fact, the first four
of these cases probably control the least squares fit, obscuring cases 8, 10, and
38.
As our first R fit of this data, we used the Wilcoxon scores with the inter-
cept estimated by the median of the residuals, α bs . Note that all seven cases
stand out in the Wilcoxon residual plot based on the internal R Studentized
residuals, (3.9.31); see Panel B of Figure 3.9.4. This is further confirmed by the
fits displayed in Table 3.9.3, where the LS fit with these seven cases deleted
is very similar to the Wilcoxon fit using all the cases. The q − q plot of the
internal R Studentized residuals, Panel C of Figure 3.9.4, also highlights these
outlying cases. Similar to the residual plot, the q − q plot suggests that the
underlying error distribution is positively skewed with a light left tail. The
estimates of the regression coefficients and their standard errors are displayed
in Table 3.9.3. Due to the skewness in the data, it is not surprising that the LS

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 230 —


i i

230 CHAPTER 3. LINEAR MODELS

bβ (Second Cell Entry) for


Table 3.9.3: Estimates of β (First Cell Entry) and σ
Free Fatty Acid Data. (The rows are for the parameters β0 , β1 , β2 , and β3 ,
respectively. The last row contains the estimates of scale: σ for LS and τ for
the Wilcoxon. LS−7 represents the LS fit without the seven outliers.)
Original Data log y
LS Wilcoxon LS−7 Bent Sc. Wilcoxon
1.70 .33 1.49 .27 1.24 .21 1.37 .21 .99 .54
−.002 .003 −.001 .003 −.001 .002 −.001 .002 .000 .005
−.015 .005 −.015 .004 −.013 .003 −.015 .003 −.031 .008
.205 .167 .274 .137 .285 .103 .355 .104 .555 .271
.215 .178 .126 .134 .350

and R estimates of the intercept are different since the former estimates the
mean of the residuals while the later estimates the median of the residuals.
Table 3.9.4 displays the values of the R and LS diagnostics for the cases of
interest. For the seven cases cited above, the internal Wilcoxon Studentized
residuals, (3.9.31), definitely flag three of the cases and for two of the others it
exceeds 1.70; see Panel B of Figure 3.9.4. As RDF F IT S, (3.9.39), indicates
none of these seven cases seem to have an effect on the Wilcoxon fit (the
liberal benchmark is .62), whereas the 12th case appears to have an effect on
the least squares fit. RDF F IT S exceeded the benchmark only for case 2 for
which it had the value -.64. Case 36 with h36 = .53 has high leverage but it
did not have an adverse effect on either the Wilcoxon fit or the LS fit. This
is true too of cases 11 and 40 which were the only other cases whose leverage
values exceeded the benchmark of 2p/n.
As we noted above, both the residual and the q−q plots indicate that the
distribution of the residuals is positively skewed. This suggests a transforma-
tion as discussed below, or perhaps a prudent choice of a score function which
would be more appropriate for skewed error distributions than the Wilcoxon
scores. The score function ϕ.5 (u), (2.5.33), is more suited to positively skewed
errors. Panel D of Figure 3.9.4 displays the internal R Studentized residuals
based on the R fit using this bent score function. From this plot and the
tabled diagnostics, the outliers stand out more from this fit than the previous
two fits. The RDF F IT S values for this fit are even smaller than those of the
Wilcoxon fit, which is expected since this score function protects on the right.
While Case 7 has a little influence on the bent score fit, no other cases have
RDF F IT S exceeding the benchmark.
Table 3.9.3 displays the estimates of the betas for the three fits along with
their standard errors. At the .05 level, coefficients 2 and 3 are significant for
the robust fits while only coefficient 2 is significant for the LS fit. The robust
fits appear to be an improvement over LS. Of the two robust fits, the bent

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 231 —


i i

3.10. SURVIVAL ANALYSIS 231

Figure 3.9.4: Panel A: Internal LS Studentized residual plot on the original


free fatty acid data; Panel B: Internal Wilcoxon Studentized residual plot
on the original free fatty acid data; Panel C: Internal Wilcoxon Studentized
normal q −q plot on the original free fatty acid data; and Panel D: Internal
R Studentized residual plot on the original free fatty acid data based on the
score function ϕ.5 (u).

Panel A Panel B

3
Wilcoxon Studentized residuals
LS Studentized residuals

2
1

1
0

0
−1

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8

LS fit original data Wilcoxon fit original data

Panel C Panel D
Wilcoxon Studentized residuals (orig)

Bent score Studentized residuals


3

3
2

2
1

1
0

0
−1

−2 −1 0 1 2 0.4 0.5 0.6 0.7 0.8

Normal quantiles Fit based on Bent score original data

score fit appears to be more precise than the Wilcoxon fit.


A practical transformation on the response variable suggested by the Box-
Cox transformation is the log. Panel A of Figure 3.9.5 shows the internal R
Studentized residuals plot based on the Wilcoxon fit of the log transformed
response. Note that five of the cases still stand out in the plot. The residuals
from the transformed response still appear to be skewed as is evident in the
q−q plot, Panel B of Figure 3.9.5. The LS and Wilcoxon fits were quite similar
for the transformed data, so only the Wilcoxon estimates are displayed in
Table 3.9.3.

3.10 Survival Analysis


In this section we discuss scores which are appropriate for lifetime distributions
when the log of lifetime follows a linear model. These are called accelerated
failure time models; see Kalbfleisch and Prentice (1980). Let T denote the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 232 —


i i

232 CHAPTER 3. LINEAR MODELS

Table 3.9.4: Regression Diagnostics for Cases of Interest for the Fatty Acid
Data
LS Wilcoxon Bent Score
Case hi Int. t DFFIT Int. t DFFIT Int. t DFFIT
8 0.12 1.16 0.43 1.57 0.44 1.73 0.31
9 0.04 1.74 0.38 2.14 0.13 2.37 0.26
10 0.09 1.12 0.36 1.59 0.53 1.84 0.30
12 0.06 2.84 0.79 3.30 0.33 3.59 0.30
22 0.05 2.26 0.53 2.51 -0.06 2.55 0.11
26 0.04 1.51 0.32 1.79 0.20 1.86 0.10
38 0.15 1.27 0.54 1.70 0.53 1.93 0.19
2 0.10 -1.19 -0.40 -0.17 -0.64 -0.75 -0.48
7 0.11 -1.07 -0.37 -0.75 -0.44 -0.74 -0.64
11 0.22 0.56 0.30 0.97 0.31 1.03 0.07
40 0.25 -0.51 -0.29 -0.31 -0.21 -0.35 0.06
36 0.53 0.18 0.19 -0.04 -0.27 -0.66 -0.34

Figure 3.9.5: Panel A: Internal R Studentized residuals plot of the log trans-
formed free fatty acid data; Panel B: Corresponding normal q−q plot.

Panel A
Wilcoxon Studentized residuals

2.0
1.0
0.0
−1.0

−1.0 −0.8 −0.6 −0.4 −0.2

Wilcoxon fit logs data

Panel B
Wilcoxon Studentized residuals (logs)

2.0
1.0
0.0
−1.0

−2 −1 0 1 2

Normal quantiles

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 233 —


i i

3.10. SURVIVAL ANALYSIS 233

lifetime of a subject and let x be a p×1 vector of covariates associated with T .


Let h(t; x) denote the hazard function of T at time t; see Section 2.8. Suppose
T follows a log linear model; that is, Y = log T follows the linear model

Y = α + x′ β + e , (3.10.1)

where e is a random error with density f . Exponentiating both sides we get


T = exp{α + x′ β}T0 where T0 = exp{e}. Let h0 (t) denote the hazard function
of T0 . This is called the baseline hazard function. Then the hazard function of
T is given by

h(t; x) = h0 (t exp{−(α + x′ β)}) exp{−(α + x′ β)} . (3.10.2)

Thus the covariate x accelerates or decelerates the failure time of T ; hence,


the name accelerated failure time for these models.
An important subclass of the accelerated failure time models are those
where T0 follows a Weibull distribution, i.e.,

fT0 (t) = λγ(λt)γ−1 exp{−(λt)γ } , t > 0 , (3.10.3)

where λ and γ are unknown parameters. In this case it follows that the hazard
function of T is proportional to the baseline hazard function with the covariate
acting as the factor of proportionality; i.e.,

h(t; x) = h0 (t) exp{−(α + x′ β)} . (3.10.4)

Hence these models are called proportional hazards models. Kalbfleisch


and Prentice (1980) show that the only proportional hazards models which
are also accelerated failure time models are those for which T0 has the Weibull
density. We can write the random error e = log T0 as e = ξ + γ −1 W0 where
ξ = − log γ and W0 has the extreme value distribution discussed in Section
2.8 of Chapter 2. Thus the optimal rank scores for these log-linear models are
generated by the function

ϕfǫ (u) = −1 − log(1 − u) ; (3.10.5)

see (2.8.8) of Chapter 2.


Next we consider suitable score functions for the general failure time mod-
els, (3.10.1). As noted in Kalbfleisch and Prentice (1980) many of the error
distributions currently used for these models are contained in the log-F class.
In this class, e = log T is distributed down to an unknown scale parameter,
as the log of an F random variable with 2m1 and 2m2 degrees of freedom.
In this case we say that e has a GF (2m1 , 2m2 ) distribution. The distribu-
tion of T is Weibull if (m1 , m2 ) → (1, ∞), log-normal if (m1 , m2 ) → (∞, ∞),

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 234 —


i i

234 CHAPTER 3. LINEAR MODELS

and generalized gamma if (m1 , m2 ) → (∞, 1); see Kalbfleisch and Prentice. If
(m1 , m2 ) = (1, 1) then the e has a logistic distribution. In general this class
contains a variety of shapes. The distributions are symmetric for m1 = m2 ,
positively skewed for m1 > m2 , and negatively skewed for m1 < m2 . While
Kalbfleisch and Prentice discuss this class for m1 , m2 ≥ 1, we extend the class
to m1 , m2 > 0 in order to include heavier-tailed error distributions.
For random errors with distribution GF (2m1 , 2m2 ), the optimal rank score
function is given by

ϕm1 ,m2 (u) = (m1 m2 (exp {F −1 (u)} − 1))/(m2 + m1 exp {F −1 (u)}) , (3.10.6)

where F is the cdf of the GF (2m1 , 2m2 ) distribution; see Exercise 3.15.31.
We label these scores as GF (2m1 , 2m2 ) scores. It follows that the scores are
strictly increasing and bounded below by −m1 and above by m2 . Hence an
R-analysis based on these scores has bounded influence in the Y -space.
This class of scores can be conveniently divided into the four subclasses
C1 through C4 which are represented by the four quadrants with center (1, 1)
as depicted in Figure 3.10.1. The point (1, 1) in this figure corresponds to
the linear-rank, Wilcoxon scores. These scores are optimal for the logistic
distribution, GF (2, 2), and form a “natural” center point for the scores. One
score function from each class with the density for which it is optimal is plotted
in Figure 3.10.2. These plots are generally representative. The score functions
in C2 change from concave to convex as u increases and, hence, are suitable
for light-tailed error structure, while, those in C4 pass from convex to concave
and are suitable for heavy tailed error structure. The score functions in C3
are always convex and are suitable for negatively skewed error structure with
heavy left tails and moderate right tails, while those in C1 are suitable for
positively skewed errors with heavy right tails and moderate left tails.
Figure 3.10.2 shows how a score function corresponds to its density. If the
density has a heavy right tail then the score function tends to be flat on the
right side; hence, the resulting estimate is less sensitive to outliers on the right.
While if the density has a light right tail then the scores tend to rise on the
right in order to accentuate points on the right. The plots in Figure 3.10.2
suggest approximating these scores by scores consisting of two or three line
segments such as the bent score function, (2.5.33).
Generally the GF (2m1 , 2m2 ) scores cannot be obtained in closed form due
to F −1 , but software such as R can easily produce them. For example, the R
command qf(u,df1,df2) returns F −1 (u), where F is the cdf of a F -random
variable with degrees of freedom ν1 = df1 and ν2 = df2. There are two inter-
esting subclasses for which closed forms are possible. These are the subclasses
GF (2, 2m2 ) and GF (2m1 , 2). As Exercise 3.15.32 shows, the random vari-
ables for these classes are the logs of variates having Pareto distributions. For

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 235 —


i i

3.10. SURVIVAL ANALYSIS 235

Figure 3.10.1: Schematic of the four classes, C1 - C4, of the GF (2m1 , 2m2 )
scores.

2.0
1.5 Neg. Skewed Light Tailed

C3 C2
m2

1.0

C4 C1
0.5

Heavy Tailed Pos. Skewed


0.0

0.0 0.5 1.0 1.5 2.0

m1

the subclass GF (2, 2m2 ) the score generating function is


 1/2
m2 + 2 
ϕm2 (u) = m2 − (m2 + 1)(1 − u)1/m2 . (3.10.7)
m2
These are the powers of rank scores discussed by Mielke (1972) in the context
of two-sample problems.
It is interesting to note that the asymptotic relative efficiency of the
Wilcoxon to the optimal rank score function at the GF (2m1 , 2m2 ) distri-
bution is given by
12 Γ4 (m1 + m2 )Γ2 (2m1 )Γ2 (2m2 )(m1 + m2 + 1)
ARE = ; (3.10.8)
Γ4 (m1 )Γ4 (m2 )Γ2 (2m1 + 2m2 )m1 m2
see Exercise 3.15.31. This efficiency can be arbitrarily small. For instance, in

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 236 —


i i

236 CHAPTER 3. LINEAR MODELS

Figure 3.10.2: Column A contains plots of the densities: the Class C1 distribu-
tion GF (3, .8); the Class C2 distribution GF (4, 8); the Class C3 distribution
GF (.5, 6); and the Class C4 distribution GF (1, .6). Column B contains the
corresponding optimal score functions.
Column A Column B

0.15

0.0
GF(3,.8)-density(x)

GF(3,.8)-score(u)
0.10

-0.5
0.05

-1.0
0.0

-4 -2 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

x u
0.4

2
GF(4,8)-density(x)

GF(4,8)-score(u)
0.3

1
0.2

0
0.1

-1
-2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

x u
1.5
0.12
GF(.5,6)-density(x)

GF(.5,6)-score(u)

1.0
0.08

0.5
0.04

0.0
0.0

-15 -10 -5 0 0.0 0.2 0.4 0.6 0.8 1.0

x u
0.12

0.2
GF(1,.6)-density(x)

GF(1,.6)-score(u)
0.08

0.0
-0.2
0.04

-0.4
0.0

-5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0

x u

the subclass GF (2, 2m2) the efficiency reduces to

3m2 (m2 + 2)
ARE = , (3.10.9)
(2m2 + 1)2

which approaches 0 as m2 → 0 and 34 as m2 → ∞. Thus in the presence of


severely skewed errors, the Wilcoxon scores can have arbitrarily low efficiency
compared to a fully efficient R estimate based on the optimal scores.
For a given problem, the choice of scores presents a problem. McKean
and Sievers (1989) discuss several methods for score selection, one of which is
illustrated in the next example. This method is adaptive in nature with the
adaption depending on residuals from an initial fit. In practice, this can lead

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 237 —


i i

3.10. SURVIVAL ANALYSIS 237

to overfitting. Its use, however, can lead to insight and may prove beneficial
for fitting future data sets of the same type; see McKean et al. (1989) for such
an application. Using XLISP-STAT (Tierney, 1990) Wang (1996) presents a
graphical interface for methods of score selection.

Example 3.10.1 (Insulating Fluid Data). We consider a problem discussed


in Nelson (1982, p. 227) and also discussed by Lawless (1982, p. 185). The
data consist of breakdown times T of an electrical insulating fluid subject to
seven different levels of voltage stress v. Panel A of Figure 3.10.3 displays a
scatter plot of Y = log T versus log v. As a full model we consider a oneway
layout, as discussed in Chapter 4, with the response variable Y = log T and
with the seven voltage levels as treatments. The comparison boxplots, Panel
B of Figure 3.10.3, are an appropriate display for this model.

Figure 3.10.3: Panel A: Scatterplot of insulating fluid data, Example 3.10.1,


overlaid with GF (2, 10) and LS fits; Panel B: Comparison boxplots of log
breakdown times over levels of voltage stress.

Panel A Panel B
8

8
6

6
4

4
Log survival time

Log survival time


2

2
0

GF(2,10)−Fit
LS−Fit
−2

−2

3.3 3.4 3.5 3.6 1 2 3 4 5 6 7

Log voltage Voltage levels

The one method for score selection that we briefly touch on here is based on
q−q plots; see McKean and Sievers (1989). Using Wilcoxon scores we obtained
an initial fit of the oneway layout model as discussed in Chapter 4. Panel A of
Figure 3.10.4 displays the q−q plot of the ordered residuals versus the logistic
quantiles based on this fit. Although the left tail of the logistic distribution
appears adequate, the right side of the plot indicates that distributions with
lighter right tails might be more appropriate. This is confirmed by the near
linearity of the GF (2, 10) quantiles versus the Wilcoxon residuals. After trying
several R-fits using GF (2m1 , 2m2 ) scores with m1 , m2 ≥ 1, we decided that

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 238 —


i i

238 CHAPTER 3. LINEAR MODELS

the q −q plot of the GF (2, 10) fit, Panel B of Figure 3.10.4, appeared to be
most linear and we used it to conduct the following R-analysis.

Figure 3.10.4: Panel A: q − q plot of Wilcoxon fit versus logistic population


quantiles of full model (oneway layout); Panel B: q −q plot of GF (2, 10) fit
versus GF (2, 10) population quantiles of full model (oneway layout).

Panel A Panel B

2
2
1

0
0
Wilcoxon residuals

GF(2,10) residuals
−1

−2
−2
−3

−4
−4

−4 −2 0 2 4 −8 −6 −4 −2 0 2

Logistic quantiles GF(2,10) quantiles

For the fit of the full model using the scores GF (2, 10), the minimum value
of the dispersion function, D, is 103.298 and the estimate of τϕ is 1.38. Note
that this minimum value of D is the analogue of the “pure” sum of squared
errors in a least squares analysis; hence, we use the notation DP E = 103.298
for pure error dispersion. We first test the goodness of fit of a simple
linear model. The reduced model in this case is a simple linear model. The
alternative hypothesis is that the model is not linear but, other than this, it is
not specified; hence, the full model is the oneway layout. Thus the hypotheses
are

H0 : Y = α + β log v + e versus HA : the model is not linear. (3.10.10)

To test H0 , we fit the reduced model Y = α + β log v + e. The dispersion at


the reduced model is 104.399. Since, as noted above, the dispersion at the full
model is 103.298, the lack of fit is the reduction in dispersion RDLOF =
104.399 − 103.298 = 1.101. Therefore the value of the robust test statistic is
Fϕ = .319. There is no evidence on the basis of this test to contest a linear
model.
The GF (2, 10)-fit of the simple linear model is Ŷ = 64 − 17.67 log v, which
is graphed in Panel A of Figure 3.10.3. Under this linear model, the estimate
of the scale parameter τϕ is 1.57. From this we compute a 95% confidence

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 239 —


i i

3.10. SURVIVAL ANALYSIS 239

Table 3.10.1: Sensitivity Analysis for Insulating Data: W = Wilcoxon, GF =


GF(2, 10), RE = REXT, EX = EXT
Value of Y5
Original (.05) 7.75 10.05 16.05 30.05
Est α̂ β̂ α̂ β̂ α̂ β̂ α̂ β̂ α̂ β̂
LS 59.4 -16.4 60.8 -16.8 62.7 -17.3 67.6 -18.7 79.1 -21.9
W 62.7 -17.2 63.1 -17.4 63.0 -17.4 63.1 -17.4 63.1 -17.4
GF 64.0 -17.7 65.5 -18.1 67.0 -18.5 67.1 -18.5 67.1 -18.5
RE 64.1 -17.7 65.5 -18.1 68.3 -18.9 68.3 -18.9 68.3 -18.9
EX 64.8 -17.7 68.4 -18.7 79.3 -21.8 114.6 -31.8 191.7 -53.5

interval for the slope parameter β to be −17.67 ± 3.67; hence, it appears that
the slope parameter differs significantly from 0. In Lawless there was interest
in computing a confidence interval for E(Y |x = log 20). The robust estimate
of this conditional mean is Ŷ = 11.07 and a confidence interval is 11.07 ± 1.9.
Similar to the other robust confidence intervals, this interval is the same as in
the least squares analysis, except that τ̂ϕ replaces σ̂. A fuller discussion of the
R analysis of this data set can be found in McKean and Sievers (1989).

Example 3.10.2 (Sensitivity Analysis for Insulating Fluid Data). As noted


by Lawless, engineers may suggest a Weibull distribution for breakdown times
in this problem. As discussed earlier this means the errors have an extreme
value distribution. This distribution is essentially the limit of a GF (2, 2m)
distribution as m → ∞. For completeness we obtained, using the IMSL (1987)
subroutine UMIAH, estimates based on an extreme value likelihood function.
These estimates are labeled EXT . R estimates based on the the optimum R-
score function (2.8.8) for the extreme value distribution are labeled as REXT .
The influence functions for EXT and REXT estimates are unbounded in Y -
space and, hence, neither estimate is robust; see (3.5.17).
In order to illustrate this lack of robustness, we conducted a small sen-
sitivity analysis. We replaced the fifth point, which had the value 6.05 (log
units), in the data with an outlying observation. Table 3.10.1 summarizes the
results for several different choices of the outlier. Note that even for the first
case when the changed point is 7.75, which is the maximum of the original
data, there is a substantial change in the EXT -estimates. The EXT fit is a
disaster when the point is changed to 10.05, whereas the R-estimates exhibit
robustness. This is even more so for succeeding cases. Although the REXT -
estimates have an unbounded influence function, they behaved well in this
sensitivity analysis.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 240 —


i i

240 CHAPTER 3. LINEAR MODELS

3.11 Correlation Model


In this section, we are concerned with the correlation model defined by

Y = α + x′ β + e (3.11.1)

where x is a p-dimensional random vector with distribution function M and


density function m, e is a random variable with distribution function F and
density f , and x and e are independent. Let H and h denote the joint distri-
bution function and joint density function of Y and x. It follows that

h(x, y) = f (y − α − x′ β)m(x) . (3.11.2)

Denote the marginal distribution and density functions of Y by G and g.


The hypotheses of interest are:

H0 : Y and x are independent versus HA : Y and x are dependent .


(3.11.3)
By (3.11.2) this is equivalent to the hypotheses H0 : β = 0 versus HA : β 6= 0.
For this section, we use the additional assumptions:

(E.2) Var(e) = σe2 < ∞ (3.11.4)


(M.1) E[xx′ ] = Σ , Σ > 0 . (3.11.5)

Without loss of generality assume that E[x] = 0 and E(e) = 0.


Let (x1 , Y1 ), . . . , (xn , Yn ) be a random sample from the above model. Define
the n × p matrix X1 to be the matrix whose ith row is the vector xi and let
X be the corresponding centered matrix, i.e, X = (I − n−1 11′ )X1 . Thus the
notation here agrees with that found in the previous sections.
We intend to briefly describe the rank-based analysis for this model. As
we show using conditional arguments the asymptotic inference we developed
for the fixed x case holds for the stochastic case also. We then want to explore
measures of association between x and Y . These are analogues to the classical
2 2
coefficient of multiple determination, R . As with R , these robust CMDs
are 0 when x and Y are independent and positive when they are dependent.
Besides defining these measures, we obtain consistent estimates of them. First
we show that, conditionally, the assumptions of Section 3.4 hold. Much of
the discussion in this section is taken from the paper by Witt, Naranjo, and
McKean (1995).

3.11.1 Huber’s Condition for the Correlation Model


The key assumption on the design matrix for the nonstochastic x linear model
was Huber’s condition, (D.2), (3.4.7). As we next show, it holds almost surely

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 241 —


i i

3.11. CORRELATION MODEL 241

(a.s.) for the correlation model. This allows us to easily obtain inference meth-
ods for the correlation model as discussed below.
First define the modulus of a matrix A to be
m(A) = max |aij | . (3.11.6)
i,j

As Exercise 3.15.33 shows the following three facts follow from this defini-
tion: m(AB) ≤ p m(A)m(B) where p is the common dimension of A and B;
m(AA′ ) ≥ m(A)2 ; and m(A) = max aii if A is positive semidefinite. We next
need a preliminary lemma found in Arnold (1980).
Lemma
Pn 3.11.1. Let {a−1 n } be a sequence of nonnegative real numbers. If
−1
n i=1 ai → a0 then n sup1≤i≤n ai → 0.
Proof: We have
n n−1
an 1X n−1 1 X
= ai − ai → 0 . (3.11.7)
n n i=1 n n − 1 i=1

Now suppose that n−1 sup1≤i≤n an 6→ 0. Then for some ǫ > 0 and for all
integers N there exists an nN such that nN ≥ N and n−1N sup1≤i≤nN ai ≥ ǫ.
Thus we can find a subsequence of integers {nj } such that nj → ∞ and
n−1
j sup1≤i≤nj ai ≥ ǫ. Let ainj = sup1≤i≤nj ai . Then

ainj ainj
ǫ≤ ≤ . (3.11.8)
nj inj
Also, since nj → ∞ and ǫ > 0, inj → ∞; hence, expression (3.11.8) leads to a
contradiction of expression (3.11.7).
The following theorem is due to Arnold (1980).
Theorem 3.11.1. Under (3.11.5),
n o
−1
lim max diag X (X′ X) X′ = 0 , a.s. ; (3.11.9)
n→∞

Proof: Using the facts cited above on the modulus of a matrix, we have
   −1 !
−1 1
m X (X′ X) X′ ≤ p2 n−1 m (XX′ ) m X′ X . (3.11.10)
n

Using the assumptions


 on the correlation model, the law of large numbers
yields n1 X′ X → Σ a.s. . Hence we need only show that n−1 m (XX′ ) → 0
a.s. . Let Ui denote the ith diagonal element of XX′ . We then have
n
1X 1 a.s.
Ui = tr X′ X → tr Σ .
n i=1 n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 242 —


i i

242 CHAPTER 3. LINEAR MODELS

a.s
By Lemma 3.11.1 we have n−1 supi≤n Ui → 0. Since XX′ is positive semidefi-
nite, the desired conclusion is obtained from the facts which followed expres-
sion (3.11.6).
Thus given X, we have the same assumptions on the design matrix as
we did in the previous sections. By conditioning on X, the theory derived in
Section 3.5 holds for the correlation model also. Such a conditional argument
is demonstrated in Theorem 3.11.2 below. For later discussion we summarize
the rank-based inference for the correlation model. Given a specified score
function ϕ, let βb denote the R estimate of β defined in Section 3.2. Under
ϕ
the correlation model (3.11.1) and the assumptions (3.11.4), (S.1), (3.4.10),
√ b D 2 −1
and (3.11.5) n(β ϕ − β) → Np (0, τϕ Σ ). Also the estimates of τϕ discussed
in Section 3.7.1 are consistent estimates of τϕ under the correlation model. Let
τbϕ denote such an estimate. In terms of testing, consider the R test statistic,
Fϕ = (RD/p)/(b τϕ /2), of the above hypothesis H0 of independence. Employing
D
the usual conditional
√ argument, it follows that pFϕ → χ2 (p, δR ), a.e. M under
Hn : β = θ/ n where the noncentrality parameter δR is given by δ =
θ ′ Σθ/τϕ2 .
Likewise for the LS estimate β b of β. Using the conditional argument,
√ LS b − β) → D
(see Arnold (1980) for details), n(β LS Np (0, σ 2 Σ−1 ) and under Hn ,
D
pFLS → χ2 (p, δLS ) with noncentrality parameter δLS = θ ′ Σθ/σ 2 . Thus the
ARE of the R test Fϕ to the least squares test FLS is the ratio of noncen-
trality parameters, σ 2 /τϕ2 . This is the usual ARE of rank tests to tests based
on least squares in simple location models. Hence the test statistic Fϕ has
efficiency robustness. The theory of rank-based tests in Section 3.6 applies to
the correlation model.
We return to measures of association and their estimates. For motivation,
we consider the least squares measure first.

3.11.2 Traditional Measure of Association and Its Esti-


mate
The traditional population coefficient of multiple determination (CMD)
is defined by
2 β ′ Σβ
R = 2 ; (3.11.11)
σe + β ′ Σβ
2
see Arnold (1981). Note that R is a measure of association between Y and
x. It lies between 0 and 1 and it is 0 if and only if Y and x are independent
(because Y and x are independent if and only if β = 0).
2
In order to obtain a consistent estimate of R , treat xi as nonstochastic
and fit by least squares the model Yi = α + x′i β + ei , which is called the full

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 243 —


i i

3.11. CORRELATION MODEL 243


P b )2 ,
model. The residual amount of variation is SSE = ni=1 (Yi − α bLS − x′i β LS
b
where β LS and αbLS are the least squares estimates. Next fit the reduced model
defined as P
the full model subject to H0 : β = 0. The total amount of variation
is SST = ni=1 (Yi − Y )2 . The reduction in variation in fitting the full model
2
over the reduced model is SSR = SST − SSE. An estimate of R is the
proportion of explained variation given by
SSR
R2 = . (3.11.12)
SST
2
The least squares test statistic for H0 versus HA is FLS = (SSR/p)/b
σLS where
2 2
bLS = SSE/(n − p − 1). Recall that R can be expressed as
σ
p
SSR F
n−p−1 LS
R2 = 2
= p . (3.11.13)
SSR + (n − p − 1)b
σLS 1 + n−p−1 FLS
Now consider the general correlation model. As shown in Arnold (1980),
2
under (3.11.4) and (3.11.5), R2 is a consistent estimate of R . Under the
2
multivariate normal model R2 is the maximum likelihood estimate of R .

3.11.3 Robust Measure of Association and Its Estimate


The rank-based analogue to the reduction in residual variation is the reduction
in residual dispersion which is given by RD = D(0) − D(β b R ). Hence, the
proportion of dispersion explained by fitting β is
R1 = RD/D(0) . (3.11.14)
This is a natural CMD for any robust estimate and, as we show below, the
population CMD for which R1 is a consistent estimate does satisfy interesting
properties. As expression (A.5.11) of the Appendix shows, however, the influ-
ence function of the denominator is not bounded in the Y -space. Hence the
statistic R1 is not robust.
In order to obtain a CMD which is robust, consider the test statistic of H0 ,
Fϕ = (RD/p)/(τbϕ /2), (3.6.10). As we indicated above, the test statistic Fϕ has
efficiency robustness. Furthermore, as shown in the Appendix, the influence
function of Fϕ is bounded in the Y -space. Hence the test statistic is robust.
Consider the relationship between the classical F-test and R2 given by
expression (3.11.13). In the same way but using the robust test Fϕ , we can
define a second R coefficient of multiple determination
p
F
n−p−1 R
R2 = p
1 + n−p−1 FR
RD
= . (3.11.15)
RD + (n − p − 1)(b
τϕ /2)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 244 —


i i

244 CHAPTER 3. LINEAR MODELS

It follows from the above discussion on the R test statistic that the influence
function of R2 has bounded influence in the Y -space.
The parameters that respectively correspond to the statistics D(0) and
R R
b
D(β R ) are D y = ϕ(G(y))ydG(y) and D e = ϕ(F (e))edF (e); see the dis-
cussion in Section 3.6.3. The population CMDs associated with R1 and R2
are:

R1 = RD/Dy (3.11.16)
R2 = RD/(RD + (τϕ /2)) , (3.11.17)

where RD = D y − D e . The properties of these parameters are discussed in the


next section. The consistency of R1 and R2 is given in the following theorem:

Theorem 3.11.2. Under the correlation model (3.11.1) and the assumptions
(E.1), (2.4.16), (S.1), (3.4.10), (S.2), (3.4.11), and (3.11.5),

P
Ri → Ri a.e. M , i = 1, 2 .

Proof: Note that we can write

Xn  
n b 1
D(0) = ϕ Fn (Yi ) Yi
i=1
n+1 n
Z  
n b
= ϕ Fn (t) tdFbn (t) ,
n+1

where Fbn denotes the empirical distribution function of the random sample
Y1 , . . . , Yn . As n → ∞ the integral converges to Dy .
Next consider the reduction in dispersion. By Theorem 3.11.1, with prob-
ability 1, we can restrict the sample space to a space on which Huber’s design
condition (D.1) holds and on which n−1 X′ X → Σ. Then conditionally given
X, we have the assumptions found in Section 3.4 for the non-stochastic model.
b )→ P
Hence from the discussion found in Section 3.6.3 (1/n)D(β R D e . Hence
it is true unconditionally, a.e. M. The consistency of τbϕ was discussed above.
The result then follows.

Example 3.11.1 (Measures of Association


√ for Wilcoxon Scores). For the
Wilcoxon
R score function, p ϕW (u) = 12(u − 1/2), as Exercise 3.15.34 shows,
D y = ϕ(G(y))y dy = 3/4E|Y p 1 − Y2 | where Y1 , Y2 are iid with distribution
function G. Likewise, D e = 3/4E|e
√ R1 −2 e−1
2 | where e1 , e2 are iid with distribu-
tion function F . Finally τϕ = ( 12 f ) . Hence for Wilcoxon scores these

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 245 —


i i

3.11. CORRELATION MODEL 245

coefficients of multiple determination simplify to


E|Y1 − Y2 | − E|e1 − e2 |
RW 1 = (3.11.18)
E|Y1 − Y2 |
E|Y1 − Y2 | − E|e1 − e2 |
RW 2 = R . (3.11.19)
E|Y1 − Y2 | − E|e1 − e2 | + (1/(6 f 2 ))

As discussed above, in general, RW 1 is not robust but RW 2 is robust.


Example 3.11.2 (Measures of Association forR Sign Score). For the sign score
function, Exercise 3.15.34 shows that D y = ϕ(G(y))y dy = E|Y − medY |
where medY denotes the median of Y . Likewise D e = E|e − mede|. Hence for
sign scores, the coefficients of multiple determination are
E|Y − medY | − E|e − mede|
RS1 = (3.11.20)
E|Y − medY |
E|Y − medY | − E|e − mede|
RS2 = . (3.11.21)
E|Y − medY | − E|e − mede| + (4f (mede))−1
These were obtained by McKean and Sievers (1987) from a l1 point of view.

3.11.4 Properties of R Coefficients of Multiple Deter-


mination
In this section we explore further properties of the population coefficients of
multiple determination proposed in the last section. To show that R1 and
R2 , (3.11.16) and (3.11.17), are indeed measures of association we have the
following two theorems. The proof of the first theorem is quite similar to corre-
sponding proofs of properties of the dispersion function for the nonstochastic
model.
Theorem 3.11.3. Suppose f and g satisfy the condition (E.1), (3.4.1), and
their
R first moments are finite then D y > 0 and D e > 0, where D y =
ϕ(G(y))y dy.

R D y since the proof for D e is the same. The


Proof: It suffices to show it for
function ϕ is increasing and ϕ = 0; hence, ϕ must take on both negative
and positive values. Thus the set A = {y : ϕ(G(y)) < 0} is not empty and is
bounded above. Let y0 = sup A. Then
Z y0 Z ∞
Dy = ϕ(G(y))(y − y0 )dG(y) + ϕ(G(y))(y − y0 )dG(y) . (3.11.22)
−∞ y0

Since both integrands are nonnegative, it follows that Dy ≥ 0. If D y = 0


then it follows from (E.1) that ϕ(G(y)) = 0 for all y 6= y0 which contradicts

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 246 —


i i

246 CHAPTER 3. LINEAR MODELS

the facts that ϕ takes on both positive and negative values and that G is
absolutely continuous.
The next theorem is taken from Witt (1989).
Theorem 3.11.4. Suppose f and g satisfy the conditions (E.1) and (E.2)
in Section 3.4 and that ϕ satisfies assumption (S.2), (3.4.11). Then RD is a
strictly convex function of β and has a minimum value of 0 at β = 0.
Proof: We show that the gradient of RD is zero at β = 0 and that its second
matrix derivative is positive definite. Note first that the
R distribution function,

G, and Rdensity, g, of Y can be expressed as G(y) = F (y − β x)dM(x) and
g(y) = f (y − β ′ x)dM(x). We have
Z Z Z
∂RD
= − ϕ′ [G(y)]yf (y − β ′ x)f (y − β ′ u)udM(x)dM(u)dy
∂β
Z Z
− ϕ[G(y)]yf ′(y − β ′ x)xdM(x)dy . (3.11.23)

Since E[x] = 0, both terms on the right side of the above expression are 0
at β = 0. Before obtaining the second derivative, we rewrite the first term of
(3.11.23) as
Z Z Z 
′ ′ ′
− ϕ [G(y)]yf (y − β x)f (y − β u)dydM(x) udM(u) =
Z Z 
′ ′
− ϕ [G(y)]g(y)yf (y − β u)dy udM(u) .

Next integrate by parts the expression in brackets with respect to y using


dv = ϕ′ [G(y)]g(y)dy and t = yf (y − β ′ u). Since ϕ is bounded and f has a
finite second moment this leads to
Z Z
∂RD
= ϕ[G(y)]f (y − β ′ u)dydM(u)
∂β
Z Z
+ ϕ[G(y)]yf ′(y − β ′ u)udydM(u)
Z Z
− ϕ[G(y)]yf ′(y − β ′ x)xdydM(x)
Z Z
= ϕ[G(y)]f (y − β ′ u)udydM(u) .

Hence the second derivative of RD is


Z Z
∂ 2 RD
= − ϕ[G(y)]f ′(y − β ′ x)xx′ dydM(x) (3.11.24)
∂β∂β ′
Z Z Z
− ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)xu′ dydM(x)dM(u) .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 247 —


i i

3.11. CORRELATION MODEL 247

Now integrate the first term on the right side of (3.11.24) by parts with respect
to y by using dt = f ′ (y − β ′ x)dy and v = ϕ[(G(y)]. This leads to
Z Z Z
∂ 2 RD
=− ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)x(u − x)′ dydM(x)dM(u) .
∂β∂β ′
(3.11.25)
We have, however, the following identity
Z Z
ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)(u − x)(u − x)′ dydM(x)dM(u) =
Z Z
ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)u(u − x)′ dydM(x)dM(u)
Z Z
− ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)x(u − x)′ dydM(x)dM(u) .

Since the two integrals on the right side of the last expression are negatives of
each other, this combined with expression (3.11.24) leads to
Z Z
∂ 2 RD
2 = ϕ′ [G(y)]f (y − β′ x)f (y − β′ u)(u − x)(u − x)′ dydM(x)dM(u) .
∂β∂β ′
Since the functions f and M are continuous and the score function is increas-
ing, it follows that the right side of this last expression is a positive definite
matrix.
It follows from these theorems that the Ri s satisfy properties of association
2
similar to R . We have 0 ≤ Ri ≤ 1. By Theorem 3.11.4, Ri = 0 if and only if
β = 0 if and only if Y and x are independent.

Example 3.11.3 (Multivariate Normal Model). Further understanding of Ri


can be gleaned from their direct relationship with R2 for the multivariate
normal model.

Theorem 3.11.5. Suppose Model (3.11.1) holds. Assume further that the
(x, Y ) follows a multivariate normal distribution with the variance-covariance
matrix  
Σ β′ Σ
Σ(x,Y ) = . (3.11.26)
Σβ σe2 + β ′ Σβ
Then, from (3.11.16) and (3.11.17),
q
2
R1 = 1 − 1 − R (3.11.27)
q
2
1− 1−R
R2 = q , (3.11.28)
2
1 − 1 − R [1 − (1/(2T 2))]

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 248 —


i i

248 CHAPTER 3. LINEAR MODELS


R
where T = ϕ[Φ(t)]tdΦ(t), Φ is the standard normal distribution func-
2
tion, and R is the traditional coefficient of multiple determination given by
(3.11.11).

Proof: Note that σy2 = σe2 + β ′ Σβ and E(Y ) = α + β ′ E[x]. Further the
distribution function of Y is G(y) = Φ((y − α − β ′ E(x))/σy ) where Φ is the
standard normal distribution function. Then
Z ∞
Dy = ϕ [Φ(y/σy )] ydΦ(y/σy ) (3.11.29)
−∞
= σy T . (3.11.30)

Similarly, D e = σe T . Hence,

RD = (σy − σe )T . (3.11.31)
2 2 σe2
By the definition of R , we have R = 1 − σy2
. This leads to the relationship,
q
2 σy − σe
1− 1−R = . (3.11.32)
σy

The result (3.11.27) follows from the expressions (3.11.31) and (3.11.32).
For the result (3.11.28), by the assumptions on the distribution of (x, Y ),
the distribution of e is N(0, σe2 ); i.e., f (x) = (2πσe2 )−1/2 exp {−x2 /(2σe2 )} and
F (x) = Φ(x/σe ). It follows that f ′ (x)/f (x) = −σe−2 x, which leads to

f ′ (F −1 (u)) 1
− ′
= Φ−1 (u) .
f (F (u)) σe

Hence,
Z 1  
1 −1
τϕ−1 = ϕ(u) du Φ (u)
0 σe
Z
1 1
= ϕ(u)Φ−1 (u) du .
σe 0

Making the substitution u = Φ(t), we obtain the relationship T = σe /τϕ .


Using this, the result (3.11.31), and the definition of R2 , (3.11.11), we get
σy −σe
σy
R2 = σy −σe .
σy
+ σσye 2T1 2

The result for R2 follows from this and (3.11.32).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 249 —


i i

3.11. CORRELATION MODEL 249

Note that T is free of all parameters. It can be shown directly that the Ri s
2
are one-to-one increasing functions of R ; see Exercise 3.15.35. Hence, for the
2
multivariate normal model the parameters R , R1 , and R2 are equivalent.
Although the CMDs are equivalent for the normal model, they measure
dependence between x and Y on different scales. We can use the above rela-
tionships derived in the last theorem to have these coefficients measure the
2
same quantity at the normal model by simply solving for R in terms of R1
and R2 in (3.11.27) and (3.11.28), respectively. These parameters are useful
∗ ∗
later so we call them R1 and R2 , respectively. Hence solving as indicated we
get
∗2
R1 = 1 − (1 − R1 )2 (3.11.33)
 2
∗2 1 − R2
R2 = 1− . (3.11.34)
1 − R2 (1 − (1/(2T 2 )))
2 ∗2 ∗2
Again, at the multivariate normal model we have R = R1 = R2 .
For Wilcoxon scores and sign scores the reader is ask to show in Exercise
3.15.36 that (1/(2T 2)) = π/6 and (1/(2T 2)) = π/4, respectively.

Example 3.11.4 (A Contaminated Normal Model). As an illustration of


these population coefficients of multiple determination, we evaluate them for
the situation where the random error e has a contaminated normal distribution
with proportion of contamination ǫ and the ratio of contaminated variance to
uncontaminated σc2 , the random variable x has a univariate normal N(0, 1)
distribution, and the parameter β = 1. So β ′ Σβ = 1. Without loss of gener-
ality, we took α = 0 in (3.11.1). Hence Y and x are dependent. We consider
the CMDs based on the Wilcoxon score function only.
The density of Y = x + e is given by
  !
1−ǫ y ǫ y
g(y) = √ φ √ +p φ p .
2 2 1 + σc2 1 + σc2

This leads to the expressions


√ n o
12 −1/2 √ p
Dy = √ 2 (1 − ǫ)2 2 + 2−1/2 ǫ2 1 + σc2 + ǫ(1 − ǫ)[3 + σc2 ]1/2

√ n o
12 −1/2 p
De = √ 2 (1 − ǫ)2 + 2−1/2 ǫ2 σc + ǫ(1 − ǫ) 1 + σc2

"√ ( )#−1
12 (1 − ǫ)2 ǫ2 2ǫ(1 − ǫ)
τϕ = √ √ + √ +p ;
2π 2 σc 2 σc2 + 1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 250 —


i i

250 CHAPTER 3. LINEAR MODELS

Table 3.11.1: Coefficients of Multiple Determination under Contaminated Nor-


mal Errors (e)
e ∼ CN(ǫ, σc2 = 9) e ∼ CN(ǫ, σc2 = 100)
ǫ ǫ
CMD .00 .01 .02 .05 .10 .15 .00 .01 .02 .05 .10 .15
2
R .50 .48 .46 .42 .36 .31 .50 .33 .25 .14 .08 .06

R1 .50 .50 .48 .45 .41 .38 .50 .47 .42 .34 .26 .19

R2 .50 .50 .49 .47 .44 .42 .50 .49 .47 .45 .40 .36

see Exercise 3.15.37. Based on these quantities the coefficients of multiple


2
determination R , R1 , and R2 can be readily formulated.
Table 3.11.1 displays these parameters for several values of ǫ and for σc2 = 9
and 100. For ease of interpretation we rescaled the robust CMDs as discussed
∗2 ∗2 2
above. Thus at the normal (ǫ = 0) we have R1 = R2 = R with the common
value of .5 in these situations. Certainly as either ǫ or σc change, the amount of
dependence between Y and x changes; hence all the coefficients change some-
what. However, R2 decays as the percentage of contamination increases, and
the decay is rapid in the case σc2 = 100. This is true also, to a lesser degree,

for R1 which is predictable since its denominator has unbounded influence in

the Y -space. The coefficient R2 shows stability with the increase in contami-

nation. For instance when σc2 = 100, R2 decays .44 units while R2 decays only
.14 units. See Witt et al. (1995) for more discussion on this example.

Ghosh and Sen (1971) proposed the mixed rank test statistic to test the
hypothesis of independence (3.11.3). It is essentially the gradient test of the
hypothesis H0 : β = 0. As we showed in Section 3.6, this test statistic is
asymptotically equivalent to Fϕ . Ghosh and Sen (1971), also, proposed a pure
rank statistic in which both variables are ranked and scored.

3.11.5 Coefficients of Determination for Regression


We have mainly been concerned with coefficients of multiple determination as
measures of dependence between the random variables Y and x. In the regres-
sion setting, though, the statistic R2 is one of the most widely used statistics,
not in the sense of estimating dependence but in the sense of comparing mod-
els. As the proportion of variance accounted for, R2 is intuitively appealing.
Likewise R1 , the proportion of dispersion accounted for in the fit, is an in-
tuitive statistic. But neither of these statistics are robust. The statistic R2
though is robust and is directly linked (a one-to-one function) to the robust
test statistic Fϕ . Furthermore it lies between 0 and 1, having the values 1 for
a perfect fit and 0 for a complete lack of fit. These properties make R2 an

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 251 —


i i

3.11. CORRELATION MODEL 251

Table 3.11.2: Coefficients of Multiple Determination on Hald Data


Subset of Original Data Changed Data
Predictors R2 R1 R2 R2 R1 R2
{x1 , x2 } .98 .86 .92 .57 .55 .92
{x1 , x3 } .55 .33 .52 .47 .24 .41
{x1 , x4 } .97 .84 .90 .52 .51 .88
{x2 , x3 } .85 .63 .76 .66 .46 .72
{x2 , x4 } .68 .46 .62 .34 .27 .57
{x3 , x4 } .94 .76 .89 .67 .52 .83

attractive coefficient of determination for regression as the following example


illustrates.
Example 3.11.5 (Hald Data). These data consist of thirteen observations
and four predictors. It can be found in Hald (1952) but it is also discussed
in Draper and Smith (1966) where it serves to illustrate a method of predic-
tor subset selection based on R2 . For convenience, we have placed this data
(Hald Data) at the url cited in the Preface. The response is the heat evolved
in calories per gram of cement. The predictors are the percent in weight of
ingredients used in the cement and are given by:

x1 = amount of tricalcium aluminate


x2 = amount of tricalcium silicate
x3 = amount of tetracalcium alumino ferrite
x4 = amount of dicalcium silicate .

To illustrate the use of the coefficients of determination R1 and R2 , suppose we


are interested in the best two variable predictor model based on coefficients
of determination. Table 3.11.2 gives the results for two data sets. The first
is the original Hald data while in the second we changed the 11th response
observation from 83.8 to 8.8.
Note that on the original data all three coefficients choose the subset
{x1 , x2 }. For the changed data, though, the outlier severely affects the LS
coefficient R2 and the nonrobust coefficient R1 , but the robust coefficient R2
was much less sensitive to the outlier. It chooses the same subset {x1 , x2 } as
it did with the original data; however, the LS coefficient selects the subset
{x3 , x4 }, two different predictors than its selection for the original data. The
nonrobust coefficient R1 still chooses {x1 , x2 }, although at a relatively much
smaller value.
The last example illustrates that the coefficient R2 can be used in the
selection of predictors in a regression problem. This selection could be

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 252 —


i i

252 CHAPTER 3. LINEAR MODELS

formalized like the MAXR procedure in SAS. In a similar vein, the stepwise
model building criteria based on LS estimation (Draper and Smith, 1966)
could easily be robustified by using R estimates in place of LS estimates and
the robust test statistic Fϕ in place of FLS .

3.12 High Breakdown (HBR) Estimates


By (3.5.17), the influence function of the R estimate is unbounded in the
x-space. While in a designed experiment this is of little consequence, for non-
designed experiments where there are widely dispersed xs (i.e., outliers in
factor space), this is of some concern. In this chapter we present R estimators
which have influence functions bounded in both spaces and which can have
50% breakdown. We call these estimators high breakdown R (HBR) esti-
mators. Further, we derive diagnostics which differentiate between fits based
on these estimators, R estimators and LS estimators. Tableman (1990) pro-
vides an alternative development of bounded influence R estimates.

3.12.1 Geometry of the HBR Estimates


Consider the linear model (3.2.3). In Chapter 3, estimation and testing are
based on the pseudo-norm, (3.2.6). Here we consider the function
X
kukHBR = bij |ui − uj | , (3.12.1)
i<j

where the weights bij are positive and symmetric, i.e., bij = bji . It is then
easy to show, see Exercise 3.15.42, that the function (3.12.1) is a pseudo-
norm. As noted in Section 2.2.2, if the weights bij ≡ 1, then this pseudo-norm
is proportional to the pseudo-norm based on the Wilcoxon scores. Hence we
refer to this as a generalized R (HBR) pseudo-norm.
Since this is a pseudo-norm we can develop estimation and testing proce-
dures using the same geometry as in the last chapter. Briefly the HBR estimate
b
of β in model (3.2.3) is a vector β HBR such that

b
β HBR = Argminβ kY − XβkHBR . (3.12.2)

Equivalently we can define the dispersion function


DHBR (β) = kY − XβkHBR . (3.12.3)
Since it is based on a pseudo-norm, DHBR is a continuous, nonnegative, convex
function of β. The negative of its gradient is given by
X
SHBR (β) = bij (xi − xj )sgn[(Yi − Yj ) − (xi − xj )′ β] . (3.12.4)
i<j

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 253 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 253

Thus the HBR estimate solves the equation


.
SHBR (β) = 0 .

In the next subsection, we discuss the selection of the weights bij .


The HBR estimates were proposed by Chang, McKean, Naranjo, and
Sheather (1999). As we noted in Section 3.1, these estimates are easily com-
puted using the package ww; see Terpstra and McKean (2005).

3.12.2 Weights
The weight for a point (xi , Yi), i = 1, . . . , n, for the HBR estimates is a function
of two components. One component depends on the “distance” of the point
xi from the center of the X-space (factor space) and the other component
depends on the size of the residual based on an initial high breakdown fit. As
shown below, these components are used in combination, so the weight due to
one component may be offset by the weight of the other component.
First, we consider distance in factor space. It seems reasonable to down-
weight points far from the center of the data. The leverage values hi =
n−1 + x′ci (X′c Xc )−1 xci , for i = 1, . . . , n, measure distance (Mahalanobis) from
the center relative to the scatter matrix X′c Xc . Leverage values, though, are
based on means and the usual (LS) variance-covariance scatter matrix which
are not robust estimators. There are several robust estimators of location and
scatter from which to choose, including the high breakdown minimum co-
variance determinant (MCD) which is an ellipsoid that covers about half
of the data and yet has minimum determinant. Although computationally
intensive, Rousseeuw and Van Driessen (1999) present a fast computational
algorithm for it. Let vc denote the center of the ellipsoid. Letting V denote
the MCD, the sample covariance of the points covered, the robust distances
are given by
Qi = (xi − vc )′ V−1(xi − vc ). (3.12.1)
n o
We define the associated weights by wi = min 1, Qci , where c is usually
set at the 95th percentile of the χ2 (p) distribution. Note that “good” points
generally have weights 1.
The class of GR estimates proposed by Sievers (1983) use weights of the
form bij = wi wj which depend only on distance in factor space. As shown by
Naranjo and Hettmansperger (1994), these estimates have positive breakdown
and bounded influence in factor space, but as Exercise 3.15.41 shows they are
always less efficient than the Wilcoxon estimates, unless all the weights are 1.
Further, at times, the loss in efficiency can be severe; see Chang et al. (1999)
for discussion. One reason is that “good” points of high leverage (points that
follow the model) are downweighted by the same amount as points at the same

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 254 —


i i

254 CHAPTER 3. LINEAR MODELS

distance from the center of factor space but which do not follow the model
(“bad” points of high leverage). The asymptotic variance of the GR estimators
is given in Exercise 3.15.41, also.
The weights for the HBR estimates are a function of the GR weights and
residual information from the Y -space. The residuals are based on a high
breakdown initial estimate of the regression coefficients. We have chosen to
use the least trim squares (LTS) estimate which is given by
h
X
Argmin [Y − α − x′ β]2(i) , (3.12.2)
i=1

where h = [n/2]+1 and where the notation (i) denotes the ith ordered absolute
residual; see Rousseeuw and Van Driessen (1999). Let b e0 denote the residuals
from this initial fit.
Define the function ψ(t) by ψ(t) = 1, t, or − 1 according as t ≥ 1,
−1 < t < 1, or t ≤ −1. Let σ be estimated by the initial scaling estimate
(0) (0)
MAD = 1.483 medi |b ei − medj {bej }| . Recall the robust distances Qi , defined
in expression (3.12.1). Let
   
b b
mi = ψ = min 1, ,
Qi Qi

and consider the weights


(   ( ))
bbij = min 1, cbσ b
σ b b
min 1, min 1, , (3.12.3)
|b
ei | |b
ej | bi
Q bj
Q

where the tuning constants b and c are both set at 4. From this point-of-view, it
is clear that these weights downweight both outlying points in factor space and
outlying responses. Note that the initial residual information is a multiplicative
factor in the weight function. Hence, a good leverage point generally has a
small (in absolute value) initial residual which offsets its distance in factor
space. The following example illustrates the differences among the Wilcoxon,
GR, and HBR estimates.

Example 3.12.1 (Stars Data). This data set is drawn from an astronomy
study on the star cluster CYG OB1 which contains 47 stars; see Rousseeuw
and Leroy (1987) for a discussion on the history of the data set. The response is
the logarithm of the light intensity of the star while the independent variable is
the logarithm of the temperature of the star. For convenience, we have placed
the Stars Data at the url cited in the Preafce. The data are also shown in Panel
A of Figure 3.12.1. Note that four of the stars, called giants, form a cluster of
outliers in factor space while the rest of the stars fall in a point cloud. Panel A

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 255 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 255

shows also the overlay plot of the LS and Wilcoxon fits. Note that the cluster
of four outliers in the x space have exerted such a strong influence on the
fits that they have drawn the LS and Wilcoxon fits towards the cluster. This
behavior is predictable based on the influence functions of these estimates.
These four giant cases have very large robust distances from the center of the
data. Hence the weights used by the GR estimates severely downweight these
points, resulting in its fit through the point cloud. For this data set, the initial
LTS fit ignores the four giant stars and fits the point cloud. Hence, the four
giant stars are “bad” leverage points and, hence, are downweighted for the
HBR fit, also.
The ww command to compute the GR or HBR estimates is the same as
for the Wilcoxon estimates, wwest, except that the argument for the weight
indicator bij is set at bij="GR" or bij="HBR", respectively. For example,
suppose the design matrix without the intercept column is in the variable xmat
and the response vector is in the variable y. Then the following R commands
return the LS, Wilcoxon, GR and HBR estimates

ls.fit = lm(y~x)
wil.fit = wwest(x,y,bij="WIL",print.tbl=F)
gr.fit = wwest(x,y,bij="GR",print.tbl=F)
hbr.fit = wwest(x,y,bij="HBR",print.tbl=F)

Example 3.12.2 (Stars Data, continued). Suppose in the last example that
we had no subject matter available concerning the data set. Then based on
the scatterplot, we may decide to fit a quadratic model. The plots of the LS,
Wilcoxon, GR, and HBR fits for the quadratic model are found in Panel B of
Figure 3.12.1. The quadratic fits based on the LS, Wilcoxon, and HBR esti-
mates follow the curvature in the data, while the GR fit misses the curvature
resulting in a very poor fit. For the quadratic model, the cluster of four giant
stars are “good” data points and the HBR weights take this into account.
The weights used for the GR fit, however, ignore this residual information
and severely downweight the four giant star cases, resulting in the poor fit as
shown in the figure.
The last two plots in the figure, Panels C and D, are the residual plots
for the GR and HBR fits. Based on their fits, the LS and Wilcoxon residual
plots are the same as the HBR. The pattern in the GR residual plot (Panel
C), while not random, does not indicate how to proceed with model selection.
This is often true for residual plots based on high breakdown fits; see McKean
et al. (1993).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 256 —


i i

256 CHAPTER 3. LINEAR MODELS

Figure 3.12.1: Panel A: Stars data overlaid with LS, Wilcoxon, and GR fits;
Panel B: Wilcoxon residual plot; Panel C: GR residual plot; Panel D: Internal
GR Studentized residuals by case.
Panel A: Linear Model Panel B: Quadratic Model

6.0

6.0
5.5

5.5
Wil LS,Wil,HBR
Log light intensity

Log light intensity


LS
5.0

5.0
4.5

4.5
HBR GR

GR
4.0

4.0
3.6 3.8 4.0 4.2 4.4 4.6 3.6 3.8 4.0 4.2 4.4 4.6

Log temperature Log temperature

Panel C: Quadratic Model Panel D: Quadratic Model


1.5

1.0
1.0

HBR residuals (Quad Model)


GR residuals (Quad Model)

0.5
0.5

0.0
0.0
−0.5

−0.5

4.5 5.0 5.5 4.5 5.0 5.5 6.0

GR fits (Quad Model) HBR fits (Quad Model)

3.12.3 b HBR
Asymptotic Normality of β
The asymptotic normality of the HBR estimates was developed by Chang
(1995) and Chang et al. (1999). Much of our development is in Appendix A.6
which is taken from this later article. Our discussion is for general weights un-
der assumptions that we specify as we proceed. In order to establish asymp-
b
totic normality of β HBR , we need some further notation and assumptions.
Define the parameters

γij = Bij′ (0)/Eβ (bij ) , for 1 ≤ i, j ≤ n , (3.12.4)

where
Bij (t) = Eβ [bij I(0 < Yi − Yj < t)] . (3.12.5)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 257 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 257

Consider the symmetric n × n matrix An = [aij ] defined by



−γ
P ij bij if i 6= j
aij = . (3.12.6)
k6=i ik ik if i = j
γ b

Define the p × p matrix Cn as

Cn = X ′ A n X . (3.12.7)

Since the rows and columns of An sum to zero, it can be shown that
X
Cn = γij bij (xj − xi )(xj − xi )′ ; (3.12.8)
i<j

see Exercise 3.15.45. Let


n
X
Ui = (1/n) (xj − xi ) E(bij sgn(Yj − yi )|yi) . (3.12.9)
j=1

Besides assumptions (E.1), (3.4.1), (D.2), (3.4.7), and (D.3), (3.4.8) of


Chapter 3, we need to assume additionally that
P
(H.1) For some CH , n−2 Cn = n−2 X′ An X → CH . (3.12.10)
Pn
(H.2) For some ΣH , (1/n) i=1 Var(Ui ) → ΣH . (3.12.11)
√ b (0) D b (0) is the initial estimator and Ξ
(H.3) n(β − β) −→ N(0, Ξ) where β
is a positive definite matrix. (3.12.12)
   
(H.4) b (0) ≡ gij β
The function bij = g xi , xj , yi , yj , β b (0) is
continuous, the gradient ▽gij is bounded uniformly in i, j. (3.12.13)

For the correlation model, an explicit expression can be given for the matrix
CH assumed in (H.1); see (3.12.24) and, also, Lemma 3.12.1.
As our theory shows, the HBR estimate√attains 50% breakdown (Section
3.12.4) and asymptotic normality, at rate n, provided the initial estimate
of regression estimates have these qualities. One such estimate is the least
trimmed squares, LTS, which is given by expression (3.12.2). Another class
of such estimates are the rank-based estimates proposed by Hössjer (1994);
see also Croux, Rousseeuw, and Hössjer (1994).
The development of the theory for βb
HBR proceeds similar to that of the R
estimates. The theory is sketched in the Appendix, Section A.6, and here we
present only the two main results: the asymptotic distribution of the gradient
and the asymptotic distribution of the estimate.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 258 —


i i

258 CHAPTER 3. LINEAR MODELS

Theorem 3.12.1. Under assumptions (E.1), (3.4.1), and (H.1)-(H.4),


(3.12.10)-(3.12.13),
D
n−3/2 SHBR (0) → N(0, ΣH ).
The proof of this theorem proceeds along the same lines of the theory that
was used to obtain the null distribution of the gradients of the R estimates. The
projection of SHBR (0) is first determined and its asymptotic distribution is
established as N(0, ΣH ). The result follows then upon showing that difference
between SHBR (0) and its projection goes to zero in second mean; see Theorem
A.6.4 for details. The following theorem gives the asymptotic distribution of
b
β HBR .

Theorem 3.12.2. Under assumptions (E.1), (3.4.1), and (H.1)-(H.4),


(3.12.10)-(3.12.13),
√ D
n(βb −1 −1
HBR − β) −→ N( 0, (1/4)CH ΣH CH ).

For inference, we say that


b
β HBR is approximately N(β, KHBR ), (3.12.14)
where
1 −1 −1
KHBR = C ΣH CH . (3.12.15)
4n H
The proof of this theorem is similar to that of the corresponding theorem
for R estimates. First asymptotic linearity and quadraticity are established.
These results are then combined with Theorem 3.12.1 to yield the result; see
Theorem A.6.1 of the Appendix for details.
The following lemma derives another representation of the limiting ma-
trix CH , which proves useful both in the derivation of the influence func-
b
tion of β HBR found in the next section and in Section 3.12.6 which concerns
the implementation of these high breakdown estimates. For what follows, as-
sume without
 (0)  loss of generality  that the true parameter value β = 0. Let
(0)
gij βb b
≡ b xi , xj , yi, yj , β denote the weights as a function of the ini-
tial estimator. Let gij (0) ≡ b(xi , xj , yi, yj ) denote the weight function evalu-
ated at the true value β = 0. The following result is proved in Lemma A.6.1
of the Appendix:
Z ∞ Z ∞  
(0)

Bij (t) = ··· b
b xi , xj , yj + t, yj , β f (yj + t) f (yj )
−∞ −∞
Y
× f (yk ) dy1 · · · dyn . (3.12.16)
k6=i,j

It is further shown that Bij′ (t) is continuous in t. The representation we want


is:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 259 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 259

Lemma 3.12.1. Under assumptions (E.1), (3.4.1), and (H.1) - (H.4),


(3.12.10) - (3.12.13),
" n n Z
#
1 1 XX ∞
E b(xi , xj , Yi , yj )f 2 (yj ) dyj (xj − xi )(xj − xi )′ −→ CH .
2 n2 i=1 j=1 −∞
(3.12.17)
Proof: By (3.12.4), (3.12.5), (3.12.8), and (3.12.16),
  n n
1 1 1 XX ′
E 2 Cn = 2
Bij (0)(xj − xi )(xj − xi )′ . (3.12.18)
n 2 n i=1 j=1

Because P BPij (0) is uniformly bounded over all i and j, and the matrix
2
(1/n ) i j (xj − xi )(xj − xi )′ converges to a positive definite matrix, the
right side of (3.12.18) also converges. By Lemmas A.6.1 and A.6.3 of the Ap-
pendix, we have
Z
Bij (0) = b(xi , xj , yj , yj )f 2 (yj ) dyj + o(1)

(3.12.19)

where the remainder term is uniformly small over all i and j. Under Assump-
tion (H.1), (3.12.10), the result follows.
Discussion on obtaining standard errors for the estimators can be found in
Section 3.12.6.
Remark 3.12.1 (Empirical Efficiency). As noted above, there is always a
loss of efficiency of the GR estimator relative to the Wilcoxon estimator. It
was hoped that the HBR estimator would regain some of this efficiency. This
was confirmed in a Monte Carlo study which is discussed in Section 8 of the
article by Chang et al. (1999). In this study, over a series of designs, which
included contamination in both responses and factor space, in all but two of
the situations, the empirical efficiency of the HBR estimate relative to the
Wilcoxon estimate was always larger than that of the GR estimate relative to
the Wilcoxon estimate.
Remark 3.12.2 (Stability Study). To obtain its full 50% breakdown, the
HBR estimates require initial estimates with 50% breakdown. It is known that
slight changes to centrally located data can cause some high breakdown esti-
mates to change by a large amount. This was discussed for the high breakdown
least median squares (LMS) estimates by Hettmansperger and Sheather (1992,
1993) and later confirmed in a Monte Carlo study by Sheather, McKean, and
Hettmansperger (1997). Part of the article by Chang et al. (1999) consisted of
a stability study for the HBR estimator using LMS and LTS starting values.
Over the situations investigated, the HBR estimates were much more stable
than either the LTS or LMS estimates but were less stable than the Wilcoxon
estimates.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 260 —


i i

260 CHAPTER 3. LINEAR MODELS

3.12.4 Robustness Properties of the HBR Estimates


In this section we show that the HBR estimate can attain 50% breakdown and
derive its influence function. We show that its influence function is bounded
in both the x- and the Y -spaces. The argument for breakdown is taken from
Chang (1995) while the influence function derivation is taken from Chang et
al. (1999).

Breakdown of the HBR Estimate


Let Z = {zi } = {(xi , yi )}, i = 1, . . . , n denote the sample of data points and
k · k the Euclidean norm. Define the breakdown point of the estimator at
sample Z as
 
∗ b m b ′ b
ǫn (β, Z) = max ; sup kβ(Z ) − β(Z)k < ∞ ,
n Z′
where the supremum is taken over all samples Z′ that can result from replacing
m observations in Z by arbitrary values. See, also, Definition 1.6.1.
We now state conditions under which the HBR estimate remains bounded.
Lemma 3.12.2. Suppose there exist finite constants M1 > 0 and M2 > 0 such
that the following conditions hold:
(B1) inf kβ k=1 supij {bij (xj − xi )′ β} = M1 .
(B2) supij {bij |yj − yi |} = M2 .
Then   
b 1 n
kβ HBR k < 1+2 M2 .
M1 2
Proof: Note that

DHBR (β) ≥ sup{bij |yj − yi − (xj − xi )′ β|} ≥ kβkM1 − M2


ij
 
n
≥ 2 M2
2
  PP
whenever kβk ≥ M11 1 + 2 n2 M2 . Since DHBR (0) = i<j bij |yj − yi | ≤

n
M2 and DHBR is a convex function of β, it follows that β b
2 HBR =
Argmin DHBR (β) satisfies
  
b 1 n
kβ HBR k < 1+2 M2 .
M1 2
The lemma follows.
For our result, we need to further assume that the data points Z are in
general position; that is, any subset of p + 1 of these points determines a

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 261 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 261

unique solution β. In particular, this implies that neither all of the xi s are
the same nor all of the yis are the same; hence, provided the weights have not
broken down, this implies that both constants M1 and M2 of Lemma 3.12.2
are positive.
Theorem 3.12.3. Assume that the data points Z are in general position. Let
v, V, and β b (0) denote the initial estimates of location, scatter, and β. Let
b (0) , Z) denote their corresponding breakdown
ǫ∗n (v, Z), ǫ∗n (V, Z) and ǫ∗n (β
points. Then breakdown point of the HBR estimator is

b ∗ b (0)
ǫ∗n (β ∗ ∗
HBR , Z) = min{ǫn (v, Z), ǫn (V, Z), ǫn (β , Z), 1/2} . (3.12.20)

Proof: Corrupt m points in the data set Z and let Z′ be the sample consisting
of these corrupt points and the remaining n − m points. Assume that Z′ is
in general position. Assume that v(Z′ ), V(Z′ ) and β b (0) (Z′ ) have not broken
down. Then the constants M1 and M2 of Lemma 3.12.2 are positive and finite.
Hence, by Lemma 3.12.2, kβ b ′
HBR (Z )k < ∞ and the theorem follows.
Based on this last result, the HBR estimate has 50% breakdown provided
the initial estimates v, V, and β b (0) all have 50% breakdown. Assuming that
the data points are in general position, the MCD estimates of location and
scatter as discussed near expression (3.12.1) have 50% breakdown. For initial
estimates of the regression coefficients, again assuming that the data points
are in general position, the LTS estimates, (3.12.2), have 50% breakdown; see,
also, Hössjer (1994). The HBR estimates used in the examples of Section 3.12.6
employ the MCD estimates of location and scatter and the LTS estimate of
the regression coefficients, resulting in the weights defined in (3.12.3).

Influence Function of the HBR Estimate


In order to derive the influence function, we start with the gradient equation
.
S(β) = 0, written as
n n
. 1 XX
0= 2 bij sgn(zj − zi )(xj − xi ).
n i=1 j=1

b (0) ) = gij (0) + Op(1/√n). Hence,


By Lemma A.6.3 of the Appendix, bij = gij (β
the defining equation may be written as
n n
. 1 XX
0= 2 gij (0)sgn(zj − zi )(xj − xi ) , (3.12.21)
n i=1 j=1

ignoring a remainder term of magnitude Op (1/ n).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 262 —


i i

262 CHAPTER 3. LINEAR MODELS

Influence functions are derived at the model where both x and y are
stochastic; hence, consider the correlation model of Section 3.11,

Y = x′ β + e , (3.12.22)

where e has density f , x is a p ×1 random vector with density function m, and


e and x are independent. Let F and M denote the corresponding distribution
functions of e and x. Let H and h denote the joint distribution function and
density of Y and x. It then follows that

h(x, y) = f (y − x′ β)m(x) . (3.12.23)

If we rewrite equation (3.12.21) using the Stieltjes integral notation of the


empirical distribution of (xi , yi), for i = 1, . . . , n, we see that the functional
β(H) makes the following expression 0:
ZZ
b(x1 , x2 , y1 , y2)sgn{y2−y1 − (x2−x1 )′ β(H)}(x2−x1 )dH(x1 , y1 )dH(x2, y2 ).

Let I(a < b) = 1 or 0, depending on whether a < b or a > b. Then using


the fact that the sign function is odd and the symmetry of the weight function
in its x and y arguments, the functional β(H) makes the following expression
0:
Z Z  
′ 1
x1 b(x1 , x2 , y1 , y2 ) I(y2 − y1 < (x2 −x1 ) β(H))− dH(x1, y1 )dH(x2 , y2).
2

Define the matrix CH by


 ZZZ 
1 ′ 2
CH = (x2 −x1 )b(x1 , x2 , y1 , y1)(x2 −x1 ) f (y1 )dy1 dM(x1 )dM(x2 ) .
2
(3.12.24)
Note that under the correlation model CH is the assumed limiting matrix of
Assumption (H.1), (3.12.10); see Lemma 3.12.1.
The next theorem gives the result for the influence function of β b HBR . Its
proof is given in Theorem A.5.1 of the Appendix.

b
Theorem 3.12.4. The influence function for the estimate β HBR is given by
ZZ
b 1
Ω(x0 , y0 , β HBR ) = (x0−x1 )b(x1 , x0 , y1 , y0)sgn{y0−y1 }dF (y1)dM(x1 ),
2CH
(3.12.25)
where CH is given by expression (3.12.24).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 263 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 263

In order to show that the influence function correctly identifies the asymp-
totic distribution of the estimator, define Wi as
Z Z
Wi = (xi − x1 )b(x1 , xi , y1, Yi )sgn(yi − y1 ) dF (y1)dM(x1 ) . (3.12.26)

Next write Wi in terms of a Stieltjes integral over the empirical distribution


of (xj , yj ) as
n
1X
Wi∗ = (xi − xj )b(xj , xi , yj , yi)sgn(yi − yj ) . (3.12.27)
n j=1

√ P d
If we can show that (1/ n) nj=1 Wi∗ → N(0, ΣH ), then we are done. From
the proof of Theorem A.6.4 in the Appendix, it suffices to show that
n
1 X P
√ (Ui − Wi∗ ) → 0 , (3.12.28)
n i=1
Pn
where Ui = (1/n) j=1 (xi − xj )E[bij sgn(yi − yj )|yi ] . Writing the left side of
(3.12.28) as
n X
X n
3/2
(1/n ) (xi − xj ) {E [bij sgn(yi − yj )|yi] − gij (0)sgn(yi − yj )} ,
i=1 j=1

where gij (0) ≡ b(xj , xi , yj , yi), the proof is analogous to the proof of Theorem
A.6.4.

3.12.5 Discussion
b
The influence function, Ω(x0 , y0 , β HBR ), for the HBR estimate is a continuous
function of x0 and y0 . With a proper choice of a weight function it is bounded
in both the x and y spaces. This is true for the weights given by (3.12.3);
furthermore, for these weights Ω(x0 , y0 , β b
HBR ) goes to zero as x0 and y0 get
large in any direction.
b is a generalization of the influence func-
The influence function Ω(x0 , y0 , β)
tions for the Wilcoxon estimates; see Exercise 3.15.46. Figure 3 of Chang et
al. (1999) shows the influence function of the HBR estimator for the special
case where (x, Y ) has a bivariate normal distribution with mean 0 and the
identity matrix as the variance-covariance matrix. For this plot we used the
weights given by (3.12.3) where mi = ψ(b/x2i ) with the constants b = c = 4.
As discussed above, the plot verifies that the influence function is bounded in
both the x and y spaces and goes to zero as x0 and y0 get large in any di-
rection. For comparison purposes, Figure 3 of Chang et al. (1999) also shows

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 264 —


i i

264 CHAPTER 3. LINEAR MODELS

the influence function of the Wilcoxon estimator (for convenience, both of


these figures have been placed at the url cited in the Preface). As shown, the
Wilcoxon influence function is unbounded in both the x and y spaces. Note
that at x0 = 0 and y0 = 0 it is bounded; i.e., bounded in response space. For
both plots, we used the method of Monte Carlo (10,000 simulations for each
of 1600 grid points), to perform the numerical integration. The plot of the
Wilcoxon influence function is an easily verifiable check on the Monte Carlo
because of its closed form, (3.5.17).
Other high breakdown estimators can have unbounded influence func-
tions. Such estimators can have instability problems as discussed in Sheather,
McKean, and Hettmansperger (1997) for the LMS estimator which has un-
bounded influence in the x space at the quartiles of Y . The generalized S
estimators discussed in Croux et al. (1994) also have unbounded influence
functions in the x space at particular values of Y . In contrast, the influence
function of the HBR estimate is bounded everywhere. This helps to explain its
more stable behavior than the LMS in the stability study discussed in Chang
et al. (1999).

3.12.6 Implementation and Examples


In this section, we discuss how to estimate the standard errors of the HBR
estimates and how to properly standardize the residuals. We then consider
two examples.

Standard Errors and Studentized Residuals


Using the asymptotic distribution of the HBR estimate as a guideline and
upon substituting the estimated weights for the true weights we can estimate
the asymptotic standard errors for these estimates. The asymptotic variance-
covariance matrix of βb
HBR is a function of the two matrices ΣH and CH ,
given in (3.12.11) and (3.12.10), respectively. The matrix ΣH is the variance-
covariance matrix of the random vector Ui , (3.12.9). We can approximate Ui
by the expression,
X n
bi = 1
U (xj − xi )bbij (1 − 2Fn (b
ei )) , (3.12.29)
n j=1

where bbij are the estimated weights, b


ei are the HBR residuals and Fn is the
empirical distribution function of the residuals. Our estimate of ΣH is then
the sample variance-covariance matrix of U b 1, . . . , U
b n , i.e.,

1 Xb  ′
n
bH =
Σ b
Ui − U Ubi − U
b . (3.12.30)
n − 1 i=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 265 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 265

For the matrix CH , consider the results in Lemma 3.12.1. Upon substi-
tuting the estimated weights for the weights, expression (3.12.19) simplifies
to Z
. b 1
Bij (0) = bij f 2 (t) dt = bbij √

, (3.12.31)
12τW
where τW is the scale parameter
√ R (3.4.4) for the Wilcoxon score function; which,
for convenience, is τW = [ 12 f (t) dt]−1 . To estimate τW , we use the estima-
2

tor τc
W given in expression (3.7.8). Now approximating bij in Cn using (3.12.3)
leads to the estimate
XX n n
1
bn = √ bbij (xj − xi )(xj − xi )′ .
n−2 C (τc
W n)
−2
(3.12.32)
4 3 i=1 j=1

Recall, for inference, we denoted the asymptotic covariance matrix of β b


HBR
by KHBR ; see the discussion around expression 3.12.15. Hence, as our estimate
of KHBR , we have
b HBR = 1 C
K b −1 Σ
b HC
b −1 . (3.12.33)
H H
4n
Similar to the R and GR estimates, we estimate the intercept by

α b HBR } .
bHBR = med1≤i≤n {yi − x′i β (3.12.34)
√ b
Because nβ HBR is bounded in probability and X is centered, it follows, using
an argument very similar to the corresponding result for the R estimates (see
McKean et al., 1990), that the joint asymptotic distribution of α b and βb
HBR
is given by
     2
 
√ b
α α D τS
0 0′
n b HBR →N , ,
β β 00 (1/4)C−1ΣC−1
(3.12.35)
where τS is defined by (3.4.6); see the discussion around expression expression
(1.5.29) for estimation of this scale parameter.

3.12.7 Studentized Residuals


An important use of robust residuals is in detection of outliers. This is easi-
est done when the residuals are correctly Studentized by an estimate of their
standard deviation. Based on the papers by McKean, Sheather, and Hettman-
sperger (1990, 1993), in Section 3.9 we discussed Studentized residuals for fits
obtained from highly efficient rank-based estimates. In this section, we extend
this development for fits based on the HBR estimates.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 266 —


i i

266 CHAPTER 3. LINEAR MODELS

Let βb bHBR be the estimates of the last section for β and α,


HBR and α
respectively. Denote the residual for the ith case by

eb∗i = Yi − α b
b − x′i β HBR , (3.12.36)

and the vector of residuals by b e∗ . Using (3.12.35), a first-order approximation


of the standard deviation of the residuals, b e∗i , can be obtained in the same way
as the derivation for Studentized residuals for regular R estimates.
As briefly outlined in the Appendix, this development for the HBR resid-
uals results in the first order approximation given by
. 1
e∗ ) = σ 2 I + τS2 H1 + X(X′ A∗ X)−1 ΣH (X′ A∗ X)−1 X′ − 2τS κ1 H1
Var(b
√  ∗ 4 ′ ∗ −1
− 12τ κ2 A X(X A X) X + X(X′ A∗ X)−1 X′ A∗ , (3.12.37)

where σ 2 is the variance of ei , κ1 = E[|ei |], κ2 = E[ei (2F (ei ) − 1)], H1 =


n−1 11′ , and the components of A∗ are estimated by

a∗ij = bbij /( 12b
b τ ); (3.12.38)

see the discussion around expression (A.7.2) in the Appendix. This yields an
estimate of the matrix A∗ .
We recommend estimating σ 2 by MAD = 1.483med{|e|∗i }; κ1 by
n
1X ∗
b
κ1 = |b
e |; (3.12.39)
n i=1 i

and κ2 by
n  
1X e∗i ) 1 ∗
R(b
b
κ2 = − eb , (3.12.40)
n i=1 n+1 2 i
which is a consistent estimate of κ2 ; see McKean et al. (1990). An estimator
of ΣH is given in expression (3.12.30). Let V b denote the estimate of Var(b
e∗ ).
beb2i denote the the ith diagonal entry of V.
Let σ b Define the Studentized
residuals by
b
ei
e∗i =
b . (3.12.41)
bebi
σ
As in LS, these standard errors correct for both the underlying variance of
the errors and location. For flagging outliers, appropriate benchmarks for these
residuals are ±2; see Section 3.9 for discussion.
Using the ww package, the Studentized residuals based on the HBR fit
are computed by the R commands: fit.hbr = wwest(xmat,y,bij="HBR")
and studres.hbr(xmat,fit.hbr$tmp2$wmat,fit.hbr$tmp1$resid), where
xmat and y contain the design matrix and the vector of responses, respectively.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 267 —


i i

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 267

Table 3.12.1: Estimates of Coefficients for the Quadratic Data


Fit Intercept Linear Quadratic
Wilcoxon -.665 5.95 -.652
HBR .422 4.64 -.375
LMS 1.12 3.65 -.141

3.12.8 Example on Curvature Detection


High breakdown estimates and highly efficient estimates can give conflicting
results. High breakdown estimates are less sensitive to outliers and clusters
of outliers in the x-space; hence, for data sets where this is a problem high
breakdown estimates often give better fits than highly efficient fits. On the
other hand, there is generally a loss of efficiency when using HBR estimates;
see Exercise 3.15.41. Also, the HBR estimates are hampered in fitting and
detecting curvature while this is not true of the highly efficient estimates.
The next example illustrates this problem. McKean et al. (1993) discuss this
curvature problem in general for polynomial models.

Example 3.12.3 (Quadratic Data, (Illustrates Curvature Detection)). In


order to demonstrate the problems that the high breakdown estimates have
in fitting curvature, we simulated data from the following quadratic model:

Yi = 5.5|xi| − .6x2i + ei , (3.12.42)

where the ei s are simulated iid N(0, 1) variates and the xi s are simulated
contaminated normal variates with the contamination proportion set at .25
and the ratio of the variance of the contaminated part to the noncontaminated
part set at 16. Panel A of Figure 3.12.2 displays a scatter plot of the data
overlaid with the Wilcoxon, HBR, and LMS fits. The estimated coefficients
for these fits are in Table 3.12.1. As shown, the Wilcoxon fit is quite good in
fitting the curvature of the data. Its estimates are close to the true values. On
the other hand, the high breakdown fits are quite poor. The LMS fit missed
the curvature in the data. This is true too for the HBR fit, although the fit
did correct itself somewhat from the poor LMS starting values. Panels B and
C of Figure 3.12.2 contain the internal Studentized residual plots based on
the Wilcoxon and the HBR fits, respectively. Based on the Wilcoxon residual
plot, no further models would be considered. The HBR residual plot shows as
outliers the two points which were fitted poorly. It also has a mild linear trend
in it, which is not helpful since a linear term was fit. This trend is true for the
LMS residual plot (Panel D), although it gives an overall impression of the
lack of a quadratic term in the model. In such cases in practice, a higher degree
polynomial may be fitted, which in this case would be incorrect. Difficulties

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 268 —


i i

268 CHAPTER 3. LINEAR MODELS

Figure 3.12.2: Panel A: For the quadratic data of Example 3.12.3, scatter
plot of data overlaid by Wilcoxon, HBR, and LMS fits; Panel B: Studentized
residual plot based on the Wilcoxon fit; Panel C: Studentized residual plot
based on the HBR fit; Panel D: Residual plot based on the LMS fit.
Panel A Panel B

15 • •
• •
• •

2
• • •

Wilcoxon residual
•• • • ••
10

• • •

1
• •
• • •
• •• • • •
Y


• ••• • • •

•••

0

5

•• • •• • • •
• • ••
••• •
•••
•• •• •

-1
• Wilc. • •
•• HBR

0

•• LMS

0 2 4 6 0 2 4 6 8 10

X Wilcoxon fit

Panel C Panel D

• •
2

• •
• • • •• • • • • •• • • • ••
• •• • ••• •
• •• • ••
0

• • • •
•• • • • • •
• • • • •• • •
0

• ••• •
• • •••
HBR residual

LMS residual

• • • • • • •

-2

-5
-4

-10
-6

• •
• •
-15
-8

2 4 6 8 10 14 5 10 15 20

HBR fit LMS fit

in reading residual plots from high breakdown fits, as encountered here, were
discussed in general in McKean et al. (1993).

3.13 Diagnostics for Differentiating between


Fits
Is the least squares (LS) fit appropriate for the data at hand? How different
would a more robust estimate be from LS? Is a high breakdown estimator
necessary, or is a highly efficient robust estimator sufficient? In this section, we
present simple intuitive diagnostics which help answer these questions. These

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 269 —


i i

3.13. DIAGNOSTICS FOR DIFFERENTIATING BETWEEN FITS 269

measure the difference in fits among LS, highly efficient R, and high breakdown
R fits. These diagnostics were developed by McKean et al. (1996a,1999); see,
also, McKean and Sheather (2009) for a recent discussion. The package ww
computes them; see Terpstra and McKean (2005) and McKean, Terpstra, and
Kloke (2009). We sketch the development for the diagnostics that differentiate
between the LS and R fits first. Also, we focus on Wilcoxon scores, leaving the
general scores analog to the exercises. We begin by looking at the difference
in LS and Wilcoxon (W) fits, which leads to our diagnostics.
Consider the linear model (3.2.3). The design matrix is centered, so the
LS estimator is β b ′ −1 ′ b
LS = (X X) X Y. Let β W denote the R estimate, im-
mediately
√ following expression (3.2.7), based on Wilcoxon scores, (ϕ(u) =
12[u − (1/2)]). We use τW to denote the √ corresponding
R scale parameter τϕ .
For Wilcoxon scores, recall that τW = 1/( 12 f (t)2 dt). We then have the
following theorem:

Theorem 3.13.1. Assume that the random errors ei of Model (3.2.3) have
finite variance σ 2 . Also, assume that assumptions (E1), (E2), (D1), and (D2)
of Section 3.4 are true. Then β b −β b is asymptotically normal with mean
LS W
0 and variance-covariance matrix
b −β
Var(β b ) = δ 2 (X′ X)−1 , (3.13.1)
LS W

where δ 2 = σ 2 + τW
2
− 2κ and κ = 12τW E[e(F (e) − 1/2)].

The proof of this theorem can be found in the Appendix; see Section A.8.
It follows that δ 2 ≥ (τW − σ)2 . To see this note that E[e(F (e) − 1/2)] =
Cov(e, F (e)) ≥ 0; hence, κ ≥ 0. Then by the Cauchy-Schwarz inequality,
q √ √
κ ≤ τW E(e2 ) · E( 12(F (e) − 1/2))2 = τW σ 2 · 1 = τW σ,

By the definition of δ 2 , it follows that δ 2 ≥ (τW − σ)2 ≥ 0. This implies that


if τW 6= σ, then δ 2 > 0.
Outliers have an effect on the LS estimate of the intercept, α bLS , also.
Hence this measurement of overall difference in the R and LS estimates needs
to be based on the estimates of α, too. The problems to which we apply our
diagnostics are often messy. In particular, we do not want to preclude errors
with skewed distributions. Hence, we estimate the intercept by the median
of the Wilcoxon residuals, i.e., α̂S = med{Yi − x′i β b }. This, however, raises
W
a problem since α bLS is a consistent estimate of the mean of the errors, µe ,
while α̂R is a consistent estimate of the median of the errors, µ ee . We define
the difference in these target values to be

ee .
µd = µe − µ (3.13.2)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 270 —


i i

270 CHAPTER 3. LINEAR MODELS

One way to avoid a problem here is to assume that the errors have a symmetric
distribution, i.e. µd = 0, but this is undesirable in developing diagnostics for
exploratory analysis. Instead, we consider measures composed of two parts:
one part measures the difference in slope parameters and the other part mea-
sures differences in the estimates of intercept. First, the following notation is
convenient, here. Let b = (α, β ′ )′ denote the vector of parameters. Let b b LS
and bbW denote respectively the LS and Wilcoxon estimates of b.
The version of Theorem 3.13.1 that includes intercept is the following corol-
lary. Let τs = 1/(2f (θ)), where f is the error density and θ is the median of
f . Assume without loss of generality that the design matrix X is centered.

Corollary 3.13.1. Under the assumptions of Theorem 3.13.1, b bW − bbLS is


asymptotically normal with mean vector (µd , 0′ )′ and variance-covariance ma-
trix
   2 
α̂LS − α̂S . (δs /n) 0
Var b b = (3.13.3)
β −β LS W 0 δ 2 (X′X)−1

where δs2 = σ 2 + τs2 − 2τs E(e · sgn(e)).

By the Cauchy-Schwarz inequality, E(e · sgn(e)) ≤ σ so that δs2 ≥ (τs −


σ)2 ≥ 0. Hence, the parameter δs2 ≥ 0.
A simple diagnostic to measure the difference between the LS and Wilcoxon
b LS −b
fit suggested by this corollary is given by (b bW )′ A−1 (b
b LS −b
bW ) where AD
D
2
is the covariance matrix (3.13.3). This has an asymptotic χ distribution with
p+1 degrees of freedom. Monte Carlo studies, however, showed that it was too
liberal. The major problem is that if the LS and Wilcoxon fits are close, then
τbW is close to σ
b, which can lead to a practical singularity for the matrix AD ;
see McKean et al. (1996a, 1999) for discussion. One practical solution is to
standardize with the asymptotic covariance matrix of the Wilcoxon estimate.
This leads to the diagnostic

b LS − b
T DBET AS(LS, W ) = (b b W )′ A b LS − b
b −1 (b bW ) (3.13.4)
W

where
 
bW = τbs2 /n 0
A 2 , (3.13.5)
0 τbW (X′ X)−1

where τbW is the robust estimator of τW given in expression (3.7.8), for


Wilcoxon scores, and τbs is the robust estimator of τs discussed around ex-
pression (1.5.29).
The diagnostic T DBET AS(LS, W ) decomposes into separate intercept

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 271 —


i i

3.13. DIAGNOSTICS FOR DIFFERENTIATING BETWEEN FITS 271

and slope terms


T DBET AS(LS, W ) = (n/b τs2 )(bαLS − αb S )2
+(1/bτW 2 b −β
)(β b )′ X′ X(βb −β b )
LS W LS W
= (n/bτs2 )(bαLS − αbS )2 + (1/b 2
τW ) k YbLS − YbW k2 .
= T DINT (LS, W ) + T DBET AS(LS, W )
(3.13.6)
Even if there is little difference between the Wilcoxon and LS estimates of the
slope parameters, the total difference in fits T DBET AS(LS, W ) can be large
because of asymmetry of the errors, i.e., T DINT (LS, W ) is large. Hence, both
parts of the decomposition are useful diagnostics. Below, (3.13.9), we give a
benchmark for large values of T DBET AS(LS, W ).
If T DBET AS(LS, W ) is large then we often want to determine the cases
which are the main contributors to this large difference. Hence, consider the
correspondingly standardized statistic for difference in ith fitted value
YbW,i − YbLS,i
CF IT Si (LS, W ) = , (3.13.7)
SE(YbW,i )
where SE(YbW,i ) = τbW [hi − (1/n)] (the expression in brackets is ith leverage
of the design matrix). As with T DBET AS(LS, W ), the standardization of
CF IT Si(LS, W ) is robust. Note that this standardization is similar to that
proposed for the diagnostic RDF F IT S by McKean, Sheather, and Hettman-
sperger (1990).
Note that (3.13.7) standardizes by only one fitted value in the numerator
(instead of the difference). Belsley, Kuh, and Welsch (1980) used a similar
standardization in assessing the difference between ŷLS,ipand ŷLS,(i) , the ith
deleted fitted value. They suggested a benchmark of 2 (p + 1)/n, and we
propose the same benchmark for CF IT Si (LS, W ). Having said this, we have
found it useful in many cases to look for gaps that separate large CF IT S from
small CF IT S (see the examples presented below).
Simulations, such as those discussed in McKean et al. (1999), show that
standardization at the Wilcoxon fit is successful and has much better perfor-
mance than standardization using the asymptotic covariance of the difference
in fits (Theorem 3.13.1). These simulations were performed over a wide range
of error and x-variable distributions.
Using the benchmark for CF IT S, we can derive an analogous benchmark
for T DBET AS by replacing τS with τW . We realize that this may be a crude
approximation but we are deriving a benchmark. Let X1 = [1 : X] and
denote the projection matrix by H = X1 (X′1 X1 )−1 X′1 . Replacing τS with
. 2 . √
τW , we have Cov(b bR ) = τW (X′1 X1 )−1 and SE(YbW,i ) = τW hii . Under this
approximation, it follows from (3.13.7) that an observation is flagged by the
diagnostic CF IT SW,i (LS, W ) whenever

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 272 —


i i

272 CHAPTER 3. LINEAR MODELS

|YbW,i − YbLS,i | p
√ > 2 (p + 1)/n . (3.13.8)
τbW hii
We use this expression to obtain a benchmark for the diagnostic
T DBET AS(LS, W ) as follows:
bW − b
T DBET AS(LS, W ) = (b bLS )′ [b2
τW bW − b
(X′1 X1 )−1 ]−1 (b b LS )
= (1/b2
τW )[X1 (bbR − b b GR )]′ [X1 (b
bW − b
b LS )]
X
= (1/b2
τW ) (YbW,i − YbLS,i)2
i
!2
1X YbW,i − YbLS,i
= (p + 1) p .
n i τbW (p + 1)/n
Since hii has the average value (p + 1)/n, (3.13.8) suggests flagging
T DBET
p AS(LS,2W ) as large whenever T DBET AS(LS, W ) > (p +
1)(2 (p + 1)/n) , or
4(p + 1)2
T DBET AS(LS, W ) > . (3.13.9)
n
We proceed the same way for diagnostics to indicate differences in fits
between Wilcoxon and HBR fits and between LS and HBR fits. The asymptotic
representation for the HBR estimate of β, (A.7.1), can be used to obtain the
asymptotic distribution of the differences between these fits. For data sets
where the HBR weights are close to one, though, the covariance matrix of
this difference is practically singular, resulting in the diagnostic being quite
liberal; see McKean et al. (1996a, 1999) for discussion. So, as we did for the
diagnostic between the Wilcoxon and LS fits, we standardize the differences
in fits using the asymptotic covariance matrix of the Wilcoxon estimate; i.e.,
AW . Hence, the total differences in fits are given by
T DBET AS(W, HBR) = (b̂W − b̂HBR )′ A−1
W (b̂W − b̂HBR ) (3.13.10)
and
T DBET AS(LS, HBR) = (b̂LS − b̂HBR )′ A−1
W (b̂LS − b̂HBR ). (3.13.11)
We recommend using the benchmark given by (3.13.9) for these diagnostics,
also. Likewise the diagnostics for casewise differences are given by
ŶW,i − ŶHBR,i
CF IT Si(W, HBR) = (3.13.12)
SE(ŶW,i )
ŶLS,i − ŶHBR,i
CF IT Si (LS, HBR) = . (3.13.13)
SE(ŶW,i)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 273 —


i i

3.13. DIAGNOSTICS FOR DIFFERENTIATING BETWEEN FITS 273


p
They suggested a benchmark of 2 (p + 1)/n, and we recommend its use for
our diagnostics.
These diagnostics can be computed by the ww package. For example, sup-
pose we want the diagnostics T DBET AS and CF IT S between the Wilcoxon
and HBR fits. Assuming that xmat and y contain the design matrix and the
vector of responses, respectively, these diagnostics are returned in the list tds
by the command tds = fitdiag(xmat,y,est=c("WIL","HBR")).

Example 3.13.1 (Bonds Data). Siegel (1997) presented a data set, the Bonds
data, which we use to illustrate some of these concepts. It was further discussed
in Sheather (2009) and McKean and Sheather (2009). The responses are the
bid prices for U.S. treasury bonds while the dependent variable is the coupon
rate (size of the bond’s periodic payment rate in percent). The data are shown
in Panel A of Figure 3.13.1 overlaid with the LS (solid line) and Wilcoxon (bro-
ken line) fits. The fits differ dramatically and the diagnostic TDBETA(LS,W)
has the value 213.7 which far exceeds the benchmark of 0.457. The three cases
yielding the largest values for the casewise diagnostic CFITS are cases 4, 13,
and 35. Panels B and C display the LS and Wilcoxon Studentized residual
plots. As can be seen, the Wilcoxon Studentized residual plot highlights Cases
4, 13, and 35, also. Their Studentized residuals exceed 20 and clearly should be
labeled outliers. These are the outlying points on the far left in the scatterplot
of the data. On the other hand, the LS Studentized residual plot shows only
two of them exceeding the benchmark. Further, the bow-tie pattern of the
Wilcoxon residual plot indicates heteroscedasticity of the errors. As discussed
in Sheather (2009), this heteroscedasticity is to be expected because the bonds
have different maturity dates.
As further discussed in Sheather (2009), the three outlying cases are of a
different type of bond than the others. The plot in Panel D is the Studentized
residuals versus fitted values for the Wilcoxon fit after removing these three
cases. Note that there are still a few outlying data points. The diagnostic
TDBETA(LS,Wil) has the value 1.55 which exceeds the benchmark of 0.50
but the difference is far less than the difference based on the original data.
Next consider the differences between the LS and HBR fits. The leverage
values corresponding to the three outlying cases exceed the benchmark for
leverage points (the smallest leverage value of these three cases has value 0.152
which exceeds the benchmark of 0.114). The diagnostic TDBETA(LS,HBR)
has the value 318.8 which far exceeds the benchmark. As discussed above
the Wilcoxon fit is sensitive to outliers in factor space and in this case TD-
BETA(Wil,HBR) is 10.5. When the outliers are omitted, the value of this
statistic is 0.034 which is less than the benchmark.

In this simple regression model, it is obvious that the three outlying cases
are on the edge of factor space. As the next example shows, in a multiple

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 274 —


i i

274 CHAPTER 3. LINEAR MODELS

Figure 3.13.1: Plots for bonds data.


Panel A Panel B

120

3
115

2
110

LS Studentized residuals

1
105
Bid price

0
100
95

−1
90

−2
4 6 8 10 12 85 90 95 100 105 110 115

Coupon rate LS fit

Panel C Panel D
30

2
20
Wilcoxon Studentized residuals

Wilcoxon Studentized residuals

0
10

−2
−4
0

−6

80 90 100 110 120 95 100 105 110 115 120

Wilcoxon fit Wilcoxon fit

regression problem this is generally not as apparent. The diagnostics discussed


in this section, though, alert the user to potential troublesome points in factor
space or response space.
Example 3.13.2 (Hawkins Data). This is an artificial data set proposed by
Hawkins, Bradu, and Kass (1984) involving three independent variables. There
are a total of 75 data points in the set and the first 14 of them are outlying
in factor space. The other 61 points follow a linear model. Of the 14 outlying
points, the first 10 points do not follow the model while the points 11 through
14 do; hence, the first 10 cases are referred to as bad points of high leverage
while the next 4 cases are referred to as good points of high leverage. Panels
A and B of Figure 3.13.2 are the unstandardized residual plots from the LS
and Wilcoxon fits, respectively. Note that the LS and Wilcoxon fits are fooled
by the bad outliers. Their fits are drawn by the bad points of high leverage
while they both flag the four good points of high leverage.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 275 —


i i

3.13. DIAGNOSTICS FOR DIFFERENTIATING BETWEEN FITS 275

Figure 3.13.2: Panel A: LS residual plot of the Hawkins data; Panel B:


Wilcoxon residual plot.

Panel A Panel B

2
4

0
2

−2
0

Wilcoxon residual

−4
LS residual

−2

−6
−4

−8
−6

−10
−8

−12

0 2 4 6 8 0 2 4 6 8 10 12

LS fit Wilcoxon fit

Table 3.13.1: Estimates of Regression Coefficients for the Hawkins Data


α (se) β1 (se) β2 (se) β3 (se)
LS -0.387 (0.42) 0.239 (0.26) -0.334 (0.15) 0.383 (0.13)
Wil -0.776 (0.20) 0.169 (0.11) 0.018 (0.07) 0.269 (0.05)
HBR -0.155 (0.22) 0.096 (0.12) 0.038 (0.07) -0.046 (0.06)

The residual plot for the HBR fit of the Hawkins data is in Panel A of
Figure 3.13.3. Note that the HBR fit correctly identified the 10 bad points
of high leverage and fit well the 4 good points of high leverage. Table 3.13.1
displays the estimates and standard errors of the fits.
The differences in fits diagnostics were successful for this data set. As
displayed on the plot in Panel B of Figure 3.13.3, T DBET AS(W, HBR) =
1324 which far exceeds the benchmark value of 0.853 and which indicates the
the Wilcoxon and HBR fits differ substantially. The plot in Panel B consists
of the diagnostics CF IT SW,i (W, HBR) versus Case i. The 14 largest values of
CF IT SW,i (W, HBR) are for the 14 outlying cases. Recall that the Wilcoxon
fit incorrectly fit the 4 good leverage points. So it is reassuring to see that
all 14 outlying were correctly identified. Also, in a further investigation of
this data set, the gap between these 14 CF IT SW,i (W, HBR) values and the
other cases, would lead to one considering a fit based on the other 61 cases.
Assuming that the matrix x1 is the design matrix (not including the intercept)
the following is the ww code which obtained the fits and the diagnostics:
fit.ls = lm(y~x1)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 276 —


i i

276 CHAPTER 3. LINEAR MODELS

Figure 3.13.3: Panel A: HBR residual plot; Panel B: CF IT S(W, HBR) by


case.

Panel A Panel B

12
10

10
8

8
Standardized GR residuals
6
GR residual

6
4

4
2

2
0
0

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60

GR fit Case

fit.wl = wwest(x1,y)
fit.hbr = wwest(x1,y,bij="HBR")
fdwilhbr = fitdiag(x1,y,est=c("WIL","HBR"))
fdwilhbr$tdbeta
fdwilhbr$bmtd
cfit =fdwilhbr$cfit
fdwills = fitdiag(x1,y,est=c("WIL","LS"))
fdlshbr = fitdiag(x1,y,est=c("LS","HBR"))

3.14 Rank-Based Procedures for Nonlinear


Models
In this section, we consider the general nonlinear model:

Yi = fi (θ 0 ) + εi . i = 1, . . . , n , (3.14.1)

where fi are known real valued functions defined on a compact space Θ


and ε1 , . . . , εn are independent and identically distributed random errors with
probability density function h(t). The asymptotic properties and conditions
needed for the numerical stability of the LS estimation procedure were inves-
tigated in Jennrich (1969); see, also, Malinvaud (1970) and Wu (1981). LS
estimation in nonlinear models is a direct extension of its estimation in linear
models. The same norm (Euclidean) is minimized to obtain the LS estimate

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 277 —


i i

3.14. RANK-BASED PROCEDURES FOR NONLINEAR MODELS 277

of θ 0 . For the rank-based procedures of this section, we simply replace the


Euclidian norm by an R norm, as we did in the linear model case. Hence, the
geometry is the same as in the linear model case.
For the nonlinear model (3.14.1), Oberhofer (1982) obtained the weak con-
sistency for R estimates based on the sign scores, i.e., the L1 estimate. Abebe
and McKean (2007) obtained the consistency and asymptotic normality for the
nonlinear R estimates of Model (3.14.1) based on the Wilcoxon score function.
In this section, we briefly discuss this Wilcoxon development.
For the long form of the model, let Y = (Y1 , . . . , Yn )T and f(θ) =
(f1 (θ), . . . , fn (θ))T . Given a norm k · k on n-space, a natural estimator of
θ is a value θ b which minimizes the distance between the response vector Y
and f(θ); i.e., θ b = argmin
θ ∈Θ kY − f(θ)k. If the norm is the Euclidean norm
b
then θ is the LS estimate. Our interest, though, is in the √ Wilcoxon norm given
in expression (3.2.6), where the score function is ϕ(u) = 12[u − (1/2)]. Here,
we write the norm as k · kW , where W denotes the Wilcoxon score function.
We define the Wilcoxon estimator of θ 0 , denoted hereafter by θ bW,n , as

bW,n = argmin
θ θ ∈Θ kY − f(θ)kW . (3.14.2)

We assume that fi (θ) is defined and continuous for all θ ∈ Θ, for all i. It then
follows that the dispersion function is a continuous function of θ and, hence,
since Θ is compact, that the Wilcoxon estimate θ b exists.
To state the asymptotic properties of the LS and R nonlinear estimates,
certain assumptions are required. These are discussed in detail in Abebe and
McKean (2007). We do note the analog of Assumption D.3, (3.4.8), for the
linear model; that is, the sequence of matrices
n
X
−1
n {∇fi (θ 0 )}{∇fi (θ 0 )}′ (3.14.3)
i=1

converges to a positive definite matrix Σ(θ 0 ) where ∇fi (θ) is the p×1 gradient
of fi (θ) with respect to θ. Under these assumptions, Abebe and McKean
(2007) showed that θ bW,n converges in probability to θ 0 . They, then, derived
the asymptotic distribution of θ bW,n . Similar to the derivation in the linear
model case of Section 3.5, this involves a pseudo-linear model.
Consider the local linear model given by

Yi∗ = x∗i T θ 0 + εi , (3.14.4)

where for i = 1, . . . , n,

Yi∗ = Yi∗ (θ) ≡ Yi − fi (θ 0 ) + {∇fi (θ 0 )}T θ 0 .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 278 —


i i

278 CHAPTER 3. LINEAR MODELS

Note that the probability density function of the errors of Model 3.14.4 is h(t),
i.e., the density function of εi. Define the corresponding Wilcoxon dispersion
function as X
Dn∗ (θ) ≡ [2n(n + 1)]−1 |e∗i (θ) − e∗j (θ)| . (3.14.5)
i<j

Furthermore, let
bn = argmin ∗
θ θ ∈Θ Dn (θ) . (3.14.6)
It then follows (see Exercise 3.15.43) that
√ D
bn − θ 0 ) → Np (0, τ 2 Σ(θ 0 )) ,
n(θ (3.14.7)
√ R
where τ = 1/( 12 h2 (t) dt), (3.4.4). Abebe and McKean (2007) show that
√ b bn ) → 0, in probability; hence, we have the asymptotic distribution
n(θ W,n − θ
for the Wilcoxon estimator, which we state in the next theorem.

Theorem 3.14.1. Under assumptions in Abebe and McKean (2007),


√ D
bW,n − θ 0 ) → Np (0, τ 2 Σ(θ 0 )).
n(θ (3.14.8)

bLS,n denote the LS estimator of θ 0 . Under suitable regularity condi-


Let θ
tions, the asymptotic distribution of the LS estimator is given by
√ D
bLS,n − θ 0 ) → Np (0, σ 2 Σ(θ 0 )) ,
n(θ (3.14.9)

where σ 2 is the variance of the random error εi . It follows immediately, from


expressions (3.14.8) and (3.14.9), that, for any component of θ 0 , the asymp-
totic relative efficiency (ARE) between the Wilcoxon estimator and the LS
estimator of the component is given by the ratio σ 2 /τ 2 . This, of course, is the
ARE between the Wilcoxon and LS estimators in linear models. If the error
distribution is normal, then this ratio is the well-known number 0.955. Hence,
there is only a loss of 5% efficiency, if one uses the Wilcoxon estimator instead
of the LS estimator when the errors are normally distributed. In contrast, the
L1 estimator has the asymptotic relative efficiency of 63% relative to the LS
estimator. The ARE between the Wilcoxon and L1 estimators at normal errors
is 150%. Hence, as in the linear model case, the Wilcoxon estimator is a highly
efficient estimator for nonlinear models. For heavier tailed error distributions
the Wilcoxon estimator is generally much more efficient than the LS estima-
tor; see Table 1.7.1 for the AREs among LS, L1 , and Wilcoxon procedures
for selected contaminated normal distributions. A discussion of such results,
along with a Monte Carlo verification, for a family of contaminated normal
error distributions can be found in Abebe and McKean (2007).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 279 —


i i

3.14. RANK-BASED PROCEDURES FOR NONLINEAR MODELS 279

Using the pseudo model and Section 3.5, a useful asymptotic representation
of the Wilcoxon estimate is given by:

n(θbW,n − θ 0 ) = τ (n−1 X∗T X∗ )−1 n−1/2 X∗T {ϕ[H(Y ∗ − X∗ θ 0 )]} + op (1) ,
(3.14.10)
where X is the n × p matrix with the ith row given by {∇fi (θ 0 )} and Y ∗
∗ T

is an n × 1 vector with the ith component Yi − fi (θ 0 ) + {∇fi (θ 0 )}T θ 0 .


Based on (3.14.10), we can obtain the influence function of the Wilcoxon
estimate. Assume fi depends on a set of predictors zi ∈ Z ⊂ ℜm as fi (θ) =
f (zi , θ). Assume also that f is a continuous function of θ for each z ∈ Z and is
a measurable function of z for each θ ∈ Θ with respect to a σ-finite measure.
Under these assumptions, the representation above gives us the local influence
function of the Wilcoxon estimate at the point (z0 , y0) ,

bW,n ) = τ {Σ(θ 0 )}−1 {ϕ[H(y0 )]}∇f (z0, θ 0 ) .


IF(z0 , y0 ; θ

Note that the influence function is unbounded if the tangent plane of S at θ 0


is unbounded. This phenomenon corresponds to the existence of high leverage
points in linear regression. The HBR estimators, however, can be extended to
the nonlinear model, also; see Abebe and McKean (2010). These are robust
in such cases.

3.14.1 Implementation
To implement the asymptotic inference based on the Wilcoxon estimate we
need a consistent estimator of the variance-covariance matrix. Define the
statistic Σ(θbW,n ) to be Σ(θ 0 ) of expression (3.14.3) with θ 0 replaced by θ
bW,n .
By the consistency of θ bW,n to θ 0 , Σ(θ
bW,n ) converges in probability to Σ(θ 0 ).
Next, it follows from the asymptotic representation (3.14.10) that the estima-
tor, (3.7.8), of τ proposed by Koul et al. (1987) for linear models is also a
consistent estimator of τ for our nonlinear model. We denote this estimator
by τb. Thus τb2 Σ(θ bW,n ) is a consistent estimator of the asymptotic variance-
covariance matrix of θ bW,n .

Estimation Algorithm
Similar to the LS estimates for nonlinear models, a Gauss-Newton-type of
algorithm can be used to obtain the Wilcoxon fit. Recall that this is an it-
erated algorithm which uses the Taylor Series expansion of the function f(θ)
evaluated at the current estimate to obtain the estimate at the next iteration.
Thus each iteration consists of fitting a linear model. Abebe and McKean
(2007) show that this algorithm for obtaining the Wilcoxon fit converges in
a probability sense. Using this algorithm, all that is required to compute the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 280 —


i i

280 CHAPTER 3. LINEAR MODELS

Table 3.14.1: Wilcoxon and LS Estimates Based on the Original Data with
Standard Errors (SE) and the Wilcoxon Estimates Based on the Data with
Substituted Gross Outlier
Original Data Set Outlier Data Set
Wil. Est. SE LS est. SE Wil. Est. SE
θ1 0.1902 0.0161 0.1903 0.0219 0.1897 0.0161
θ2 0.0061 0.0002 0.0061 0.0003 0.0061 0.0002
θ3 0.0197 0.0006 0.0105 0.0008 0.0107 0.0006

Wilcoxon nonlinear estimate is a computational procedure for Wilcoxon linear


model estimates; see Exercise 3.15.44 for further discussion.
We next consider an example that demonstrates the robustness and effi-
ciency properties of the rank estimators in comparison to the LS estimator in
practical situations.

Example 3.14.1 (Chwirut’s Data). These data are taken from the ultrasonic
block reference study by Chwirut (1979). The response variable is ultrasonic
response and the predictor variable is metal distance. The study involved 214
observations. The model under consideration is
exp[−θ1 xi ]
fi (θ) ≡ f (xi ; θ1 , θ2 , θ3 ) ≡ , i = 1, . . . , 214 .
θ2 + θ3 x

Using the Wilcoxon and LS fitting procedures, we fit the (original) data and
then a data set with one observation replaced by an outlier. Figure 3.14.1
displays the results of the fits.
For the original data, as shown in the figure and by the estimates given
in Table 3.14.1, the LS and Wilcoxon fits are quite similar. As shown in the
residual plots of Figure 3.14.1, there are several moderate outliers in the orig-
inal data set. These outliers have an impact on the LS estimate of scale,
the square-root of MSE, which has the value σ bLS = 3.36. In contrast, the
Wilcoxon estimate of τ is τb = 2.45 which explains the Wilcoxon’s smaller
standard errors than those of LS in the table of estimates.
For robustness considerations, we introduced a gross outlier in the response
space (observation 17 was changed from 8.025 to 5000). The Wilcoxon and LS
fits were obtained. As shown in Figure 3.14.1, the LS estimate essentially did
not converge. From the plot of the fitted models and residual plots, it is clear
that the Wilcoxon fit performs dramatically better than its LS counterpart. In
Table 3.14.1 the Wilcoxon estimates are displayed with their standard error.
There is basically little difference between the Wilcoxon fits for the original
set and the data set with the gross outlier.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 281 —


i i

3.14. RANK-BASED PROCEDURES FOR NONLINEAR MODELS 281

Figure 3.14.1: Analysis of Chwirut’s data.


(a) Wilcoxon Residual vs. Predictor (b) LS Residual vs. Predictor
10

10
5

5
resids

resids
0

0
−5

−5
−10

−10
1 2 3 4 5 6 1 2 3 4 5 6

x x

(c) Wilcoxon and LS Fits : Original Data (d) Wilcoxon Residual vs. Predictor : Outlier
10

Wilcoxon
80

LS
5
60

0
resids
y

40

−5
−10
20

−15

1 2 3 4 5 6 1 2 3 4 5 6

x x

(e) LS Residual vs. Predictor : Outlier (f) Wilcoxon and LS Fits : Outlier

Wilcoxon
80

LS
0
−50

60
resids

y
−100

40
−150

20
−200

1 2 3 4 5 6 1 2 3 4 5 6

x x

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 282 —


i i

282 CHAPTER 3. LINEAR MODELS

3.15 Exercises
3.15.1. For the baseball data in Example 3.3.2, explore other transformations
of the predictor Years in order to obtain a better fitting model than the one
discussed in the example.

3.15.2. Consider the linear model (3.2.2).


n

(a) Show that the ranks of the residuals can only change values at the 2
equations yi − x′i β = yj − x′j β.

(b) Determine the change in dispersion as β moves across one of these defining
planes.

(c) For the telephone data, Example 3.3.1, obtain the plot shown in Panel
D of Figure 3.3.1; i.e., a plot of the dispersion function D(β) for a set
of values β in the interval (−.2, .6). Locate the estimate of slope on the
plot.

(d) Plot the gradient function S(β) for the same set of values β in the interval
(−.2, .6). Locate the estimate of slope on the plot.

3.15.3. In Section 2.2 of Chapter 2, the two-sample location problem was


modeled as a regression problem; see (2.2.2). Consider fitting this model using
Wilcoxon scores.

(a) Show that the gradient test statistic (3.5.8) simplifies to the square of the
standardized MWW test statistic (2.2.21).

(b) Show that the regression estimate of the slope parameter is the Hodges-
Lehmann estimator given by expression (2.2.18).

(c) Verify Parts (a) and (b) by fitting the data in the two-sample problem of
Exercise 2.13.45 as a regression model.

3.15.4. For the simple linear regression problem, if the values of the indepen-
dent variable x are distinct and equally spaced show that the Wilcoxon test
statistic is equivalent to the test for correlation based on Spearman’s rs , where
P
(R(xi ) − n+1
2
)(R(yi ) − n+12
))
rs = q P qP .
(R(xi ) − n+1
2
) 2 (R(y i ) − n+1 2
2
)

Note that the denominator of rs is a constant. Obtain its value.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 283 —


i i

3.15. EXERCISES 283

3.15.5. For the simple linear regression model consider the process
n X
X n
T (β) = sgn(xi − xj )sgn((Yi − xi β) − (Yj − xj β)) .
i=1 j=1

(a) Show under the null hypothesis H0 : β = 0, that E(T (0)) = 0 and that
Var (T (0)) = 2(n − 1)n(2n + 5)/9.

(b) Determine the estimate of β based on inverting the test statistic T (0);
i.e., the value of β which solves
.
T (β) = 0 .

(c) Show that when the two-sample problem is written as a regression model,
(2.2.2), this estimate of β is the Hodges-Lehmann estimate (2.2.18).
Note: Kendall’s τ is a measure of association between xi and Yi given by
τ = T (0)/(n(n − 1)); see Chapter 4 of Hettmansperger (1984) for further
discussion.
3.15.6. Show that the R estimate, β b is an equivariant estimator; that is,
ϕ
b b b b (Y).
β ϕ (Y + Xδ) = β ϕ (Y) + δ) and β ϕ (kY) = k β ϕ

3.15.7. Consider Model 3.2.1 and the hypotheses (3.2.5). Let ΩF denote the
column space of the full model design matrix X and let ΩR denote the subspace
of ΩF subject to H0 . Show that ΩR is a subspace of ΩF and determine its
dimension. Hint: One way of establishing the dimension is to show that C =
X(X′X)−1 M′ is a basis matrix for ΩF ∩ ΩcR .
3.15.8. Show that Assumptions (3.4.9) and (3.4.8) imply Assumption (3.4.7).
3.15.9. For the special case of Wilcoxon scores, obtain the proof of Theorem
3.5.2 by first getting the projection of the statistic S(0).
3.15.10. Assume that the errors ei in Model (3.2.2) have finite variance σ 2 .
Let βb denote the least squares estimate of β. Show that √n(β b − β) → D
LS LS
Np (0, σ 2 Σ−1 ). Hint: First show that the LS estimate is location and scale
equivariant. Then without loss of generality we can assume that the true β is
0.
3.15.11. Under the additional assumption that the errors have a symmetric
distribution, show that R estimates are unbiased for all sample sizes.
3.15.12. Let ϕf (u) = −f ′ (F −1 (u))/f (F −1(u)) denote the optimal scores for
the density f (x) and suppose that f is symmetric. Show that ϕf (1 − u) =
−ϕf (u); that is, the optimal scores are odd about 1/2.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 284 —


i i

284 CHAPTER 3. LINEAR MODELS

3.15.13. Suppose the errors ei are double exponentially distributed. Show


that the L1 estimate, i.e. the R estimate based on sign scores, is the maximum
likelihood estimate.
3.15.14. Using Theorem 3.5.7, show that
 ∗     
bS
α α0 − x′ β 0 κn −τϕ2 x′ (X′X)−1
b ∼ Np+1 , ,
β ϕ β0 −τϕ2 (X′ X)−1 x τϕ2 (X′X)−1
(3.15.1)
where κn = n−1 τS2 + τϕ2 x′ (X′ X)−1 x and τS and and τϕ are given respectively
by (3.4.6) and (3.4.4).
3.15.15. Show that the random vector within the brackets in the proof of
Lemma 3.6.2 is bounded in probability.
3.15.16. Show that difference between the numerators of the two F -statistics,
(3.6.10) and (3.6.12), converges to 0 in probability under the null hypothesis.
3.15.17. Show that the difference between Fϕ , (3.6.10), and Aϕ , (3.6.15),
converges to 0 in probability under the null hypothesis.
3.15.18. By showing the following results, establish the asymptotic distri-
bution of the least squares test statistic, FLS , under the sequence of models
(3.6.22) with the additional assumption that the random errors have finite
variance σ 2 .
(a) First show that
  
1 D Bθ
√ X′ Y −→ N 2
,σ I , (3.15.2)
n A2 θ
where the matrices A2 and B are defined in the proof of Theorem 3.6.1.
This can be established by using the Lindeberg-Feller CLT, Theorem
A.1.1 of the Appendix, to show that an arbitrary linear combination
of the components of the random vector on the left side converges in
distribution to a random variable with a normal distribution.
(b) Based on Part (a), show that
 
′ −1 .. 1 D 
−B A1 .I √ X′ Y −→ N W−1 θ, σ 2 W−1 , (3.15.3)
n
where the matrices A1 and W are defined in the proof of Theorem 3.6.1.
(c) Let FLS (σ 2 ) denote the LS F -test statistic with the true value of σ 2 re-
placing the estimate σ b2 . Show that
  ′   
2 ′ −1 .. 1 ′ −2 ′ −1 .. 1 ′
FLS (σ ) = −B A1 .I √ X Y σ W −B A1 .I √ X Y .
n n
(3.15.4)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 285 —


i i

3.15. EXERCISES 285

(d) Based on (3.15.3) amd (3.15.4), show that FLS (σ 2 ) has a limiting non-
central χ2 distribution with noncentrality parameter given by expression
(3.6.27).

b2 is a consistent estimate of σ 2
(e) Obtain the final result by showing that σ
under the sequence of Models (3.6.22).
3.15.19. Show that D e , (3.6.28) is a scale parameter; i.e., De (Fae+b ) =
|a|De (Fe ).
3.15.20. Establish expression (3.6.33).
3.15.21. Suppose Wilcoxon scores are used.
(a) Establish the expressions (3.6.34) and (3.6.35).

(b) Similarly for sign scores establish (3.6.36) and (3.6.37).


3.15.22. Consider the model (3.2.1) and hypotheses (3.6.8). Suppose
the errors have a double exponential distribution with density f (t) =
(2b)−1 exp {−|t|/b}. Assume b is known. Show that the likelihood ratio test
is equivalent to the drop in dispersion test based on sign scores.
3.15.23. Establish expressions (3.9.8) and (3.9.9).
3.15.24. Let X be a random variable with distribution function FX (x) and
let Y = aX + b. Define the quantile function of X as qX (p) = FX−1 (p). Show
that qX (p) is a linear function of qY (p).
3.15.25. Verify expression (3.9.17).
b2,
3.15.26. Assume that the errors have a normal distribution. Show that K
(3.9.25), converges in probability to 1.
3.15.27. Verify expression (3.9.34).
3.15.28. Proceeding as in Theorem 3.9.3 show that the first order represen-
b R is given by expression (3.9.36). Next show that
tation of the fitted value Y
the approximate variance of the ith fitted case is given by expression (3.9.38).
3.15.29. Consider the mean shift model, (3.9.32). Show that the estimator of
θi given by the numerator of expression (3.9.35) is based on the inversion of
an aligned rank statistic to test the hypotheses (3.9.33).
3.15.30. Assume that the errors have a symmetric distribution. Verify ex-
pressions (3.9.41) and (3.9.42).
3.15.31. Assume that the errors have the distribution GF (2m1 , 2m2 ).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 286 —


i i

286 CHAPTER 3. LINEAR MODELS

(a) Show that the optimal rank score function is given by expression (3.10.6).

(b) Show that the asymptotic relative efficiency between the Wilcoxon anal-
ysis and the rank-based analysis based on the optimal scores for the
distribution GF (2m1 , 2m2 ) is given by expression (3.10.8).
3.15.32. Suppose the errors have density function

fm2 (x) = e−x (1 + m−1 −x −(m2 +1)


2 e ) , m2 > 0 , −∞ < x < ∞ . (3.15.5)

(a) Show that the optimal scores are given by expression (3.10.7).

(b) Show that the asymptotic relative efficiency of the Wilcoxon analysis to
the rank analysis based on the optimal rank score function for the density
(3.15.5) is given by expression (3.10.9).
3.15.33. The definition of the modulus of a matrix A is given in expression
(3.11.6). Verify the three properties concerning the modulus of a matrix listed
in the text following this definition.
3.15.34.
p Consider Example 3.11.1. If Wilcoxon scores are used, show that
D y = 3/4E|Y
p 1 − Y2 | where Y1 , Y2 are iid with distribution function G and
that D e = 3/4E|e1 − e2 | where e1 , e2 are iid with distribution function F .
Next assume that sign scores are used. Show that D y = E|Y − med Y | where
med Y denotes the median of Y . Likewise D e = E|e − med e|.

3.15.35. In Example 3.11.3, show that coefficients of multiple determination


R1 and R2 given by expressions (3.11.27) and (3.11.28), respectively, are one-
2
to-one functions of R given by expression (3.11.11).
3.15.36. At the end of Example 3.11.3, verify, for Wilcoxon scores and sign
scores, that (1/(2T 2)) = π/6 and (1/(2T 2)) = π/4, respectively.

3.15.37. In Example 3.11.4, show that the density of Y is given by


  !
1−ǫ y ǫ y
g(y) = √ φ √ +p φ p .
2 2 1 + σc2 1 + σc2

Using this, verify the expressions for D y , D e , and τϕ found in the example.
3.15.38. For the baseball data given in Exercise 1.12.33, consider the variables
height and weight.

(a) Obtain the scatterplot of height versus weight.

(b) Obtain the CMDs: R2 , R1 , R2 , R1∗2 and R2∗2 .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 287 —


i i

3.15. EXERCISES 287

3.15.39. Consider a linear model of the form,

Y = X∗ β ∗ + e , (3.15.6)

where X∗ is n × p whose column space Ω∗F does not include 1. This model is
often called regression through the origin. Note for the pseudo-norm k · kϕ
that
n
X
∗ ∗
kY − X β kϕ = a(R(yi − x∗′ ∗′
i β))(yi − xi β) (3.15.7)
i=1
Xn
= a(R(yi − (x∗i − x∗ )′ β ∗ ))(yi − (x∗i − x∗ )′ β ∗ )
i=1
n
X
= a(R(yi − α − (x∗i − x∗ )′ β ∗ ))(yi − α − (x∗i − x∗ )′ β ∗ ),
i=1

where x∗i is the ith row of X∗ and x∗ is the vector of column averages of X∗ .
Based on this result, the estimate of the regression coefficients based on the
R fit of Model (3.15.6) is estimating the regression coefficients of the centered
model, i.e., the model with the design matrix X = X∗ − H1 X∗ . Hence, in
general, the parameter β is not estimated. This problem also occurs in a
weighted regression model. Dixon and McKean (1996) proposed the following
solution. Assume that (3.15.6) is the true model, but obtain the R fit of the
model:  
∗ ∗ ∗ α1
Y = 1α1 + X β 1 + e = [1 X ] +e, (3.15.8)
β ∗1
where the true α1 is 0. Let X1 = [1 X∗ ] and let Ω1 denote the column space of
X1 . Let Yb 1 = 1b b ∗ denote the R fitted value based on the fit of Model
α1 + X∗ β 1
(3.15.8). Note that Ω∗ ⊂ Ω1 . Let Yb ∗ = H Ω∗ Y
b 1 be the projection of this fitted
value onto the desired space Ω . Finally estimate β ∗ by solving the equation

b∗ = Y
X∗ β b∗ (3.15.9)

b ∗ = (X∗′ X∗ )−1 X∗′ Y


(a) Show that β b 1 is the solution of (3.15.9).

(b) Assume that the density function of the errors is symmetric, that the R
score function is odd about 1/2 and that the intercept α1 is estimated
.
by solving the equation T + (b
eR − α) = 0 as discussed in Section 3.5.2.
Under these assumptions show that

b ∗ has an approximate N(β ∗ , τ 2 (X∗′ X∗ )−1 ) distribution.


β (3.15.10)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 288 —


i i

288 CHAPTER 3. LINEAR MODELS

Table 3.15.1: Crystal Data for Exercise 3.15.40


Time (hours) 2 4 6 8 10 12 14
Weight (grams) 0.08 1.12 4.43 4.98 4.92 7.18 5.57
Time (hours) 16 18 20 22 24 26 28
Weight (grams) 8.40 8.881 10.81 11.16 10.12 13.12 15.04

(c) Next, suppose that the intercept is estimated by median of the residuals
from the R fit of (3.15.8). Using the asymptotic representations (3.5.23)
and (3.5.22), show that the asymptotic representation of βb ∗ is given by

b ∗ = τS (X∗′ X∗ )−1 X∗′ H1 sgn(e)+τϕ (X∗′ X∗ )−1 X∗′ HX ϕ[F(e)+op (1/ n).
β
(3.15.11)

b
Use this result to show that the asymptotic variance of of β is given by

b ) = τ 2 (X∗′ X∗ )−1 X∗′ H1 X∗ ((X∗′ X∗ )−1
AsyVar(β S
+τϕ2 (X∗′ X∗ )−1 X∗′ HX X∗ (X∗′ X∗ )−1 . (3.15.12)

(d) Show that the invariance to x∗ as shown in (3.15.7) is true for any pseudo-
norm.
3.15.40. The data in Table 3.15.1 are presented in Graybill and Iyer (1994).
The dependent variable is the weight (in grams) of a crystalline form of a
certain chemical compound while the independent variable is the length of
time (in hours) that the crystal was allowed to grow. A model of interest is
the regression through the origin model (3.15.6). Obtain the R estimate of β ∗
for this data using the procedure described in (3.15.9). Compare this fit with
the intercept of the R fit of the intercept model.

3.15.41. Recall that the GR estimates minimize the pseudo-norm in expres-


sion (3.12.1), where the weights bij are functions of the design matrix. Let

−(1/n)bPij i 6= j
wij = .
(1/n) k6=i bik i = j

Let W = [wij ]. Then under regularity conditions, the asymptotic


√ R 2 covariance
matrix of the GR estimator is given by τ D, where τ = 1/( 12 f (t) dt) and
D is the matrix B + (X′ X)−1 where
−1 −1 −1
B = (X′ WX) X′ W2 X (X′ WX) − (X′ X) ;

see Naranjo and Hettmansperger (1994). Note that if B is positive semi-


definite then the Wilcoxon estimator is always (could be the same) more

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 289 —


i i

3.15. EXERCISES 289

efficient than the GR estimator. Show B is positive semi-definite by filling


in the details of the following proof.
Proof: Let v be any vector in Rp . Since X′ WX is non-singular, there exists a
vector u such that v = X′ WXu. Hence by the Pythagorean Theorem,

v′ Bv = kWXuk2 − kHWXuk2 ≥ 0 ,

where H is the projection matrix onto the column space of X.


Hence, there is always a loss of efficiency when using the GR estimates and,
at times, this loss can be severe. If the design matrix, though, has clusters of
outlying points then this downweighting may be necessary. Discuss why the
above proof cannot be used for the HBR estimator.

3.15.42. Consider the linear model (3.2.3). Show that expression (3.12.1) is
a pseudo-norm.

3.15.43. Consider the pseudo-linear model (3.14.4) of Section 3.14. For the
Wilcoxon pseudo estimator, obtain the asymptotic result (3.14.7).

3.15.44. By filling in the brief sketch below, write out the Gauss-Newton
algorithm for the Wilcoxon estimate of the nonlinear model (3.14.1).
Let θb0 be an initial estimate of θ. Let f0 = f(θ b0 ). Write the norm to be
minimized as kY − fkW = kY − f0 + [f0 − f]kW . Then use a Taylor series
of order 1 to approximate the term in brackets. The increment for the next
step estimate is the Wilcoxon estimator of this approximate linear model with
Y − f0 as the dependent variable. For actual implementation, discuss why the
regression through the origin algorithm of Exercise 3.15.39 is usually necessary
here.

3.15.45. Show that the p × p matrix


P Cn defined in expression (3.12.7), can
be written alternately as Cn = i<j γij bij (xj − xi )(xj − xi )′ .

3.15.46. Consider the influence function of the HBR estimate given in expres-
sion (A.5.22). If the weights for residuals and the xs are both set at 1, show
that the influence function of the HBR estimate simplifies to the influence
function of the Wilcoxon estimate given in (3.5.17).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 291 —


i i

Chapter 4

Experimental Designs: Fixed


Effects

4.1 Introduction
In this chapter we discuss rank-based inference for experimental designs based
on the theory developed in Chapter 3. We concentrate on factorial type de-
signs and analysis of covariance designs but, based on our discussion, it is
clear how to extend the rank-based analysis for any fixed effects design. For
example, based on this rank-based inference Vidmar and McKean (1996) de-
veloped a response surface methodology which is quite analogous to the tra-
ditional response surface methods. We discuss estimation of effects, tests of
linear hypotheses concerning effects, and multiple comparison procedures. We
illustrate this rank-based inference with numerous examples. One purpose of
our discussion is to show how this rank-based analysis is analogous to the tra-
ditional analysis based on least squares. In Section 4.2.5 we introduce pseudo-
observations which are based on an R fit of the full model. We show that
the rank-based analysis (Wald-type) can be obtained by substituting these
pseudo-observations in place of the responses in a package that obtains the
traditional analysis. We begin with the one-way design.
In our development we apply rank scores to residuals. In this sense our
methods are not pure rank statistics; but they do provide consistent and highly
efficient tests for traditional linear hypotheses. The rank transform method is
a pure rank test and it is discussed in Section 4.7 where we describe various
drawbacks to the approach for testing traditional linear hypotheses in linear
models. Brunner and his colleagues have successfully developed a general ap-
proach to testing in designed experiments based on pure ranks, although the
hypotheses of their approach are generally not linear hypotheses. Brunner and
Puri (1996) provide an excellent survey of these pure rank tests. We do pursue
them further in this book.

291
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 292 —


i i

292 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

We only consider fixed effects linear models in this chapter. In Chapter


5, we extend our discussion to mixed models and, in general, to models with
dependent error structure. There have been extensions of robust fitting to
discrete responses; see, for example, Vidmar, McKean, and Hettmansperger
(1992), Stefanski, Carroll, and Ruppert (1986), and Li (1991).
The computation for the methods discussed in this chapter can be per-
formed by the R packages ww and Rfit.

4.2 One-way Design


Suppose we want to determine the effect that a single factor A has on a re-
sponse of interest over a specified population. Assume that A has k levels,
each level being referred to as a treatment group. In this situation, the com-
pletely randomized design is often used to investigate the effect of A. For this
design n subjects are selected at random from the population of interest and
ni of these subjects are randomly assigned to level i of A, for i = 1, . . . , k. Let
Yij denote the response of the jth subject in the ith level of A. We assume
that the responses are independent of one another and that the distributions
among levels differ by at most shifts in location. Although the randomization
gives some credence to the assumption of independence, after fitting the model
a residual analysis should be conducted to check this assumption and the as-
sumption that the level distributions differ by at most a shift in locations.
Under these assumptions, the full model can be written as

Yij = µi + eij j = 1, . . . , ni , i = 1, . . . , k , (4.2.1)

where the eij s are iid random variables with density f (x) and distribution
function F (x) and the parameter µi is a convenient location parameter (for
example, the mean or median). Let T (F ) denote the location functional. As-
sume, without loss of generality, that T (F ) = 0. Let ∆ii′ denote the shift
between the distributions of Yij and Yi′ l . Recall from Chapter 2 that the pa-
rameters ∆ii′ are invariant to the choice of locational functional and that
∆ii′ = µi − µi′ . If µi is the mean of the Yij then Hocking (1985) calls this the
means model. If µi is the median of the Yij then we call it the medians
model; see Section 4.2.4 below.
Observational studies can also be modeled this way. Suppose k independent
samples are drawn from k different populations. If we assume further that the
distributions for the different populations differ by at most a shift in locations
then Model (4.2.1) is appropriate. But as in all observational studies, care
must be taken in the interpretation of the results of the analyses.
While the parameters µi fix the locations, P the parameters of Pkinterest in
k
this chapter are contrasts of the form h = i=1 ci µi where i=1 ci = 0.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 293 —


i i

4.2. ONE-WAY DESIGN 293

Similar to the shift parameters, contrasts are invariant to the choice of location
functional. In fact contrasts are linear functions of these shifts; i.e.,
k
X k
X k
X
h= ci µ i = ci (µi − µ1 ) = ci ∆i1 = c′1 ∆1 , (4.2.2)
i=1 i=1 i=2

where c′1 = (c2 , . . . , ck ) and

∆′1 = (∆21 , . . . , ∆k1 ) (4.2.3)

is the vector of location shifts from the first cell. In order to easily reference
the theory of Chapter 3, we often use ∆1 which references cell 1. But picking
cell 1 is only for convenience and similar results hold for the selection of any
other cell.
As in Chapter 2, we can write this model in terms of a linear model as
follows. Let Z′ = (Y11 , . . . , Y1n1 , . . . , Yk1, . . . , Yknk ) denote the vector of allPob-
servations, µ′ = (µ1 , . . . , µk ) denote the vector of locations, and n = ni
denote the total sample size. The model can then be expressed as a linear
model of the form
Z = Wµ + e , (4.2.4)
where e denotes the n × 1 vector of random errors eij and the n × k design
matrix W denotes the appropriate incidence matrix of 0s and 1s; i.e.,
 
1n1 0 · · · 0
 0 1n · · · 0 
 2 
W =  .. .. ..  . (4.2.5)
 . . ··· . 
0 0 · · · 1nk

Note that the vector 1n is in the column space of W; hence, the theory derived
in Chapter 3 is valid for this model.
At times it is convenient to reparameterize the model in terms of a vector
of shift parameters. For the vector ∆1 , let W1 denote the last k − 1 columns
of W and let X be the centered W1 ; i.e., X = (I − H1 )W1 , where H1 =
1(1′ 1)−1 1′ = n−1 11′ and 1′ = (1, . . . , 1) . Then we can write Model (4.2.4) as

Z = α1 + X∆1 + e , (4.2.6)

where ∆1 is given in expression (4.2.3). It is easy to show that for any matrix
[1|X∗ ], having the same column space as W, that its corresponding non-
intercept parameters are linear functions of the shifts and, hence, are invariant
to the selected location functional. The relationship between Models (4.2.4)
and (4.2.6) are explored further in Section 4.2.4.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 294 —


i i

294 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

4.2.1 R Fit of the One-way Design


Note that the sum of all the column vectors of W equals the vector of ones
1n . Thus 1n is in the column space of W and we can fit Model (4.2.4) by
using the R estimates discussed in Chapter 3. In this chapter we assume that
a specified score function, a(i) = ϕ(i/(n + 1)), has been chosen which, without
loss of generality, has been standardized; recall (S.1), (3.4.10). A convenient
way of obtaining the fit is by the QR-decomposition algorithm on the incidence
matrix W; see Section 3.7.3.
For the R fits used in the examples of this chapter, we use the cell median
model; that is, Model 4.2.4 with T (F ) = 0 where T denotes the median
functional and F denotes the distribution of the random errors ei . We use the
score function ϕ(u) to obtain the R fit of this model. Let X∆ b 1 denote the
b
fitted value. As discussed in Chapter 3, X∆1 lies in the column space of the
centered matrix X = (I − H1 )W1 . We then estimate the intercept as
α b 1} ,
bS = med1≤i≤n {Zi − x′i ∆ (4.2.7)
where xi is the ith row of X. The final fitted value and the residuals are,
respectively,
b = α
Z b1
bS 1 + X∆ (4.2.8)
b
e = Z−Z b. (4.2.9)
Note that Z b lies in the column space of W and that, further, T (Fn ) = 0,
where Fn denotes the empirical distribution function of the residuals and T is
the median location functional. Denote the fitted value of the response Yij as
b we find from (4.2.4) that µ
Ybij . Given Z, b = (W′ W)−1W′ Z.b Because W is an
incidence matrix, the estimate of µi is the common fitted value of the ith cell
which, for future reference, is given by
bi = Ybij ,
µ (4.2.10)
for any j = 1, . . . , ni . In the examples below, we denote the R fit described in
this paragraph by stating that the model was fit using Wilcoxon scores and
the residuals were adjusted to have median zero.
It follows from Section 3.5.2 that µ b is asymptotically normal with mean µ.
To do inference based on these estimates of µi we need their standard errors but
these can be obtained immediately from the variance of the fitted values given
by expression (3.9.38). First note that the leverage value for an observation
in the ith cell is 1/ni and, hence, the leverage value for the centered design
is hc,i = hi − n−1 = (n − ni )/(nni ). Therefore by (3.9.38) the approximate
variance of µbi is given by
. 1 1
µi ) = τϕ2 + (τS2 − τϕ2 ) . i = 1, . . . , k ;
Var(b (4.2.11)
ni n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 295 —


i i

4.2. ONE-WAY DESIGN 295

see Exercise 4.8.18.


Let τbϕ and τbS denote, respectively, the estimates of τϕ and τS presented in
Section 3.7.1. The estimated approximate variance of µ bi is expression (4.2.11)
with these estimates in place of τϕ and τS . Define the minimum value of the
dispersion function as DE, i.e.,
ni
k X
X
DE = D(b
e) = a(R(b
eij ))b
eij . (4.2.12)
i=1 j=1

The symbol DE stands for the dispersion of the errors and is analogous to LS
sums of squared errors, SSE. Upon fitting such a model, a residual analysis
as discussed in Section 3.9 should be conducted to assess the goodness of fit
of the model.
Example 4.2.1 (LDL Cholesterol of Quail). Thirty-nine quail were randomly
assigned to four diets, each diet containing a different drug compound, which,
hopefully, would reduce low density lipid (LDL) cholesterol. The drug com-
pounds are labeled: I, II, III, and IV. At the end of the prescribed experimental
time the LDL cholesterol of each quail was measured. The data are displayed
in the comparison boxplots, found in Panel A of Figure 4.2.1; for convenience,
we have also placed the data at the url site listed in the Preface. From the
comparison boxplots, found in Panel A of Figure 4.2.1, it appears that Drug
Compound II was more effective than the other three in lowering LDL. The
data appear to be positively skewed√ with a long right tail. We fitted the data
using Wilcoxon scores, ϕ(u) = 12(u − 1/2), and adjusted the residuals to
have median 0. Panel B of Figure 4.2.1 displays the q − q of the Wilcoxon
Studentized residuals. The long right tail of the error distribution is apparent
from this plot as are also the six outliers. The estimates of τϕ and τS are
19.19 and 21.96, respectively. For comparison the LS estimate of σ was 30.49.
The Wilcoxon and LS estimates of the cell locations along with their standard
errors are:
Drug Wilcoxon Fit LS Fit
Compound Est. SE Est. SE
I 67.0 6.3 74.5 9.6
II 42.0 6.3 52.3 9.6
III 63.0 6.3 73.8 9.6
IV 62.0 6.6 67.6 10.1
The Wilcoxon and LS estimates of the location levels are quite different, as
they should be since they estimate different functionals under asymmetric
errors. The long right tail has drawn out the LS estimates. The standard
errors of the Wilcoxon estimates are much smaller than their LS counterparts.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 296 —


i i

296 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Figure 4.2.1: Panel A: Comparison boxplots for data of Example 4.2.1; Panel
B: Wilcoxon internal R Studentized residual normal q−q plot.

Panel A Panel B

8
150

6
Wilcoxon Studentized residuals
LDL Cholesterolo

4
100

2
0
50

I II III IV −2 −1 0 1 2

Drug Compound Normal quantiles

The data set in the last example was taken from a much larger study dis-
cussed in McKean, Vidmar, and Sievers (1989). Most of the data in that study
exhibited long right tails. The left tails were also long; hence, transformations
such as logarithms were not effective. Scores more appropriate for positively
skewed data were used with considerable success in this study. These scores
are briefly discussed in Example 2.5.1.

4.2.2 Rank-Based Tests of H0 : µ1 = · · · = µk


Consider Model (4.2.4). A hypothesis of interest in the one-way design is that
there are no differences in the levels of A; i.e.,

H0 : µ 1 = · · · = µ k versus H1 : µi 6= µi′ for some i 6= i′ . (4.2.13)

Define the k × (k − 1) matrix M as


 
1 −1 00 ... 0
 1 0 −1 0 ... 0 
 
M =  .. .. .. .. .. ..  . (4.2.14)
 . . . . . . 
1 0 0 0 . . . −1

Then Mµ = ∆1 , (4.2.3), and, hence, H0 is equivalent to Mµ = 0. Note that


the rows of M form k − 1 linearly independent contrasts in the vector µ. If
the design matrix given in (4.2.6) is used then the null hypothesis is simply

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 297 —


i i

4.2. ONE-WAY DESIGN 297

H0 : Ik−1 ∆1 = 0; that is, all the regression coefficients are zero. We discuss
two rank-based tests for this hypothesis.
One appropriate test statistic is the gradient test statistic, (3.5.8), which
is given by
T = σa−2 S(Z)′ (X′ X)−1 S(Z) , (4.2.15)
where S(Z)′ = (S2 (Z), . . . , Sk (Z)) for
ni
X
Si (Z) = a(R(Zij )) , (4.2.16)
j=1

and, as defined in Theorem 3.5.1,


n
X
σa2 = (n − 1)−1 a2 (i) . (4.2.17)
i=1

Based on Theorem 3.5.2 a level α test for H0 versus H1 is:


Reject H0 in favor of H1 if T ≥ χ2 (α, k − 1) , (4.2.18)
where χ2 (α, k − 1) denotes the upper level α critical value of χ2 -distribution
with k − 1 degrees of freedom. Because the design matrix X of Model (4.2.6)
is an incidence matrix, the gradient test simplifies. First note that
 
′ −1 1 1 1
(X X) = J + diag ,..., , (4.2.19)
n1 n2 nk
where J is a (k − 1) × (k − 1) matrix of ones; see Exercise 4.8.1. Since the
scores sum to 0, we have that S(Z)′ 1k−1 = −S1 (Z). Upon combining these
results, the gradient test statistic simplifies to
Xk
1 2
Tϕ = σa−2 S(Z)′ (X′ X)−1 S(Z) = σa−2 S (Z) . (4.2.20)
n i
i=1 i

For Wilcoxon scores further simplification is possible. In this case


Xni  
√ R(Yij ) 1
Si (Y) = 12 −
j=1
n+1 2
√  
12 n+1
= ni Ri − , (4.2.21)
n+1 2
where Ri denotes the average of the ranks from sample i. Also for Wilcoxon
scores σa2 = n/(n + 1). Thus the test statistic for Wilcoxon scores is given by
X k  2
12 n+1
HW = ni Ri − . (4.2.22)
n(n + 1) i=1 2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 298 —


i i

298 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Table 4.2.1: Analysis of Dispersion Table for the Hypotheses (4.2.13)


Source D=Dispersion df MD F
A RD k − 1 RD/(k − 1) Fϕ
Error n−k τbϕ /2

This is the Kruskal-Wallis (1952) test statistic. It is distribution free under


H0 . In the case of two levels, the Kruskal-Wallis test is equivalent to the
MWW test discussed in Chapter 2; see Exercise 4.8.2. From the discussion on
efficiency, Section 3.5, the efficiency results for the Kruskal-Wallis test are the
same as for the MWW.
As a second rank-based test, we briefly discuss the drop in dispersion test
for H0 versus H1 given by expression (3.6.10). Under the null hypothesis, the
underlying distributions of the k levels of A are the same; hence, the reduced
model is
Yij = µ + eij , (4.2.23)
where µ is a common location functional. Thus there are no parameters to fit
in this case and the reduced model dispersion is
ni
k X
X
DT = D(Y) = a(R(Yij ))Yij . (4.2.24)
i=1 j=1

The symbol DT denotes total dispersion in the problem which is analogous


to the classical LS’s total variation, SST . Hence the reduction in dispersion
is RD = DT − DE, where DE is defined in expression (4.2.12), and the drop
in dispersion test is given by Fϕ = (RD/(k − 1))/(b τϕ /2). As discussed in
Section 3.6 this should be compared with F -critical values having k − 1 and
n − k degrees of freedom. The analysis can be summarized in an analysis of
dispersion table of the form given in Table 4.2.1.
Because the Kruskal-Wallis test is a gradient test, the drop in disper-
sion test and the Kruskal-Wallis test have the same asymptotic efficiency; see
Section 3.6. The third test discussed in that section, the Wald-type test, is
discussed below, for this hypothesis, in Section 4.2.5.
Example 4.2.2 (LDL Cholesterol of Quail, Example 4.2.1 continued). For
the hypothesis of no difference among the locations of the cholesterol levels
of the drug compounds, hypotheses (4.2.13), the results of the LS F -test, the
Kruskal-Wallis test, and the drop in dispersion test can be found in Table
4.2.2. The long right tail of the errors spoiled the LS test statistic. Using
it, one would conclude that there is no significant difference among the drug
compounds which is inconsistent with the boxplots in Figure 4.2.1. On the
other hand, both robust procedures detect the differences among the drug
compounds, especially the drop in dispersion test statistic.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 299 —


i i

4.2. ONE-WAY DESIGN 299

Table 4.2.2: Tests of Hypotheses (4.2.13) for the Quail Data


Test Scale
Procedure b or τbϕ
Statistic σ df p-value
LS, FLS 1.14 30.5 (3, 35) .35
Drop Disp., Fϕ 3.77 19.2 (3, 35) .02
Kruskal Wallis 7.18 3 .067

Table 4.2.3: Analysis of Dispersion Table for H0 : Mµ = 0


Source D=Dispersion df MD F
Mµ = 0 RD q MRD = RD/q Fϕ
Error n−k τbϕ /2

4.2.3 Tests of General Contrasts


As discussed above the parameters and hypotheses of interest for Model (4.2.4)
can usually be defined in terms of contrasts. In this section we discuss R
estimates and tests of contrasts. We apply these results to more complicated
designs in the remainder of the chapter.
For Model (4.2.4), consider general linear hypotheses of the form

H0 : Mµ = 0 versus HA : Mµ 6= 0 , (4.2.25)

where M is a q × k matrix of contrasts (rows sum to 0) of full row rank. Since


M is a matrix of contrasts, the hypothesis H0 is invariant to the intercept and,
hence, can be tested by the R test statistic discussed in Section 3.6. To obtain
the test based on the reduction of dispersion, Fϕ , discussed in Section 3.6, we
need to fit the reduced model ΩR which is Model (4.2.4) subject to H0 . Let
D(ΩR ) denote the minimum value of the dispersion function for the reduced
model fit and let RD = D(ΩR ) − DE denote the reduction in dispersion. Note
that RD is analogous to the reduction in sums of squares of the traditional
LS analysis. The test statistic is given by Fϕ = (RD/q)/b τϕ /2. As discussed
in Chapter 3 this statistic should be compared with F -critical values having
q and n − k degrees of freedom. The test can be summarized in the analysis
of dispersion table found in Table 4.2.3, which is analogous to the traditional
analysis of variance table for summarizing a LS analysis.

Example 4.2.3 (Poland China Pigs). This data set, presented on page 87
of Scheffé (1959), concerns the birth weights of Poland China pigs in eight
litters. For convenience we have tabled that data in Table 4.2.4. There are 56
pigs in the eight litters. The sample sizes of the litters vary from 4 to 10.
In Exercise 4.8.3 a residual analysis is conducted of this data set and the
hypothesis (4.2.13) is tested. Here we are only concerned with the following

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 300 —


i i

300 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Table 4.2.4: Birth Weights of Poland China Pigs by Litter


Litter Birth Weight
1 2.0 2.8 3.3 3.2 4.4 3.6 1.9 3.3 2.8 1.1
2 3.5 2.8 3.2 3.5 2.3 2.4 2.0 1.6
3 3.3 3.6 2.6 3.1 3.2 3.3 2.9 3.4 3.2 3.2
4 3.2 3.3 3.2 2.9 3.3 2.5 2.6 2.8
5 2.6 2.6 2.9 2.0 2.0 2.1
6 3.1 2.9 3.1 2.5
7 2.6 2.2 2.2 2.5 1.2 1.2
8 2.5 2.4 3.0 1.5

contrast suggested by Scheffé. Assume that the litters 1, 3, and 4 were sired
by one boar while the other litters were sired by another boar. The contrast
of interest is that the average litter birthweights of the pigs sired by the boars
are the same; i.e., H0 : h = 0 where
1 1
h = (µ1 + µ3 + µ4 ) − (µ2 + µ5 + µ6 + µ7 + µ8 ) . (4.2.26)
3 5
For this hypothesis, the matrix M of expression (4.2.25) is given by [5 −
3 5 5 − 3 − 3 − 3 − 3]. The value of the LS F -test statistic is 11.19, while
Fϕ = 15.65. There are 1 and 48 degrees of freedom for this hypothesis so both
tests are highly significant. Hence both tests indicate a difference in average
litter birthweights of the boars. The reason Fϕ is more significant than FLS is
clear from the residual analysis found in Exercise 4.8.3.

4.2.4 More on Estimation of Contrasts and Location


In this section we further explore the relationship between Models (4.2.4) and
(4.2.6). This enables us to formulate the contrast procedure based on pseudo-
observations discussed in Section 4.2.5. Recall that the design matrix X of
expression (4.2.6) is a centered design matrix based on the last k − 1 columns
of the design matrix W of expression (4.2.4). To determine the relationship
between the parameters of these models, we simply match them by location
parameter for each level. For Model (4.2.4) the location parameter for level i
is of course P
µi . In terms of Model (4.2.6), the locationPparameter for the first
n n
level is α − kj=2 nj ∆j and that of the ith level is α − kj=2 nj ∆j + ∆i . Hence,
P k nj
letting δ = j=2 n ∆j , we can write the vector of level locations as

µ = (α − δ)1 + (0, ∆1 )′ , (4.2.27)

where ∆1 is defined in expression (4.2.3).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 301 —


i i

4.2. ONE-WAY DESIGN 301

Table 4.2.5: All Pairwise 95% Wilcoxon Confidence Intervals for the Quail
Data
Difference Estimate Confidence Interval
µ2 − µ1 -25.0 (−42.7, −7.8)
µ2 − µ3 -21.0 (−38.6, −3.8)
µ2 − µ4 -20.0 (−37.8, −2.0)
µ1 − µ3 4.0 (−13.41, 21.41)
µ1 − µ4 5.0 (−12.89, 22.89)
µ3 − µ4 1.0 (−16.89, 18.89)

Let h = Mµ be a q × 1 vector of contrasts of interest (i.e., rows of M sum


to 0). Write M as [m M1 ]. Then by (4.2.27) we have

h = Mµ = M1 ∆1 . (4.2.28)

By Corollary 3.5.1, ∆ b 1 has an asymptotic N(∆, τ 2 (X′ X)−1 ) distribution.


ϕ
Hence, based on expression (4.2.28), the asymptotic variance-covariance ma-
trix of the estimate Mbµ is

Σh = τϕ2 M1 (X′ X)−1 M′1 . (4.2.29)

Note that the only difference for the LS fit, is that σ 2 would be substituted
for τϕ2 . Expressions (4.2.28) and (4.2.29) are the basic relationships used by
pseudo-observations discussed in Section 4.2.5.
To illustrate these relationships, suppose we want a confidence interval for
µi − µi′ . Based on expression (4.2.29), an asymptotic (1 − α)100% confidence
interval is given by
r
1 1
bi − µ
µ bi′ ± t(α/2,n−k) τbϕ + ; (4.2.30)
ni ni′

i.e., same as LS except τbϕ replaces σ


b.

Example 4.2.4 (LDL Cholesterol of Quail, Example 4.2.1 continued). To


illustrate the above confidence intervals, Table 4.2.5 displays the six pairwise
confidence intervals among the four drug compounds. On the basis of these
intervals Drug Compound 2 seems best. This conclusion, though, is based on
six simultaneous confidence intervals and the problem of overall confidence
in these intervals needs to be addressed. This is discussed in some detail in
Section 4.3, at which time we will return to this example.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 302 —


i i

302 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Medians Model
Suppose we are interested in estimates of the level locations themselves. We
first need to select a location functional. For the discussion we use the median;
although, for any other functional, only a change of the scale parameter τS
is necessary. Assume then that the R residuals have been adjusted so that
their median is zero. As discussed above, (4.2.10), the estimate of µi is Ybij , for
any j = 1, . . . , ni , where Ybij is the fitted value of Yij . Let µ
b = (b bk )′ .
µ1 , . . . , µ
Further, µ b is asymptotically normal with mean µ and the asymptotic variance
of µbi is given in expression (4.2.11). As Exercise 4.8.4 shows, the asymptotic
covariance between estimates of location levels is:

cov(b bi′ ) = (τS2 − τϕ2 )/n ,


µi , µ (4.2.31)

for i 6= i′ . As Exercises 4.8.4 and 4.8.18 show, expressions (3.9.38) and (4.2.31)
lead to a verification of the confidence interval (4.2.30).
Note that if the scale parameters are the same, say, τS = τϕ = κ then the
approximate variance reduces to κ2 /ni and the covariances are 0. Hence, in
this case, the estimates µ bi are asymptotically independent. This occurs in the
following two ways:

1. For the fit of Model (4.2.4) use a score function ϕ(u) which satisfies (S2)
and use the location functional based on the corresponding signed-rank
score function ϕ+ (u) = ϕ((u + 1)/2). The asymptotic theory, though,
requires the assumption of symmetric errors. If the Wilcoxon score func-
tion is used then the location functional would result in the residuals
being adjusted so that the median of the Walsh averages of the adjusted
residuals is 0.

2. Use the l1 score function ϕS (u) = sgn(u − (1/2)) to fit Model (4.2.4) and
use the median as the location functional. This of course is equivalent
to using an l1 fit on Model (4.2.4). The estimate of µi is then the cell
median.

4.2.5 Pseudo-observations
We next discuss a convenient way to estimate and test contrasts once an R fit
of Model (4.2.4) is obtained. Let Zb denote the R fit of this model, let b
e denote
the vector of residuals, let a(R(be)) denote the vector of scored residuals, and
let τbϕ be the estimate of τϕ . Let HW denote the projection matrix onto the
column space of the incidence matrix W. Because of (3.2.13), the fact that 1n
is in the column space of W, and that the scores sum to 0, we get

HW a(R(b
e)) = 0 . (4.2.32)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 303 —


i i

4.2. ONE-WAY DESIGN 303

Define the constant ζϕ by


n−k
ζϕ2 = Pn 2 . (4.2.33)
i=1 a (i)
P . .
Because n−1 a2 (i) = 1, ζ = 1 − (k/n). Then the vector of pseudo-
observations is defined by
e=Z
Z b + τbϕ ζϕ a(R(b
e)) ; (4.2.34)
see Bickel (1976) for a discussion of the pseudo-observations. For the Wilcoxon
scores, we get
2 (n − k)(n + 1)
ζW = . (4.2.35)
n(n − 1)
b
e and b
Let Z e
e denote the LS fit and residuals, respectively, of the pseudo-
observations, (4.2.34). By (4.2.32) we have,
b
e=Z
b,
Z (4.2.36)
and, hence,
b
e
e = τbϕ ζϕ a(R(b
e)) . (4.2.37)
From this last expression and the definition of ζϕ , (4.2.33), we get
1 b′b
e
eee = τbϕ2 . (4.2.38)
n−k
Therefore the LS fit of the pseudo-observations results in the R fit of Model
(4.2.4) and, further, the LS estimator MSE is τbϕ2 .
The pseudo-observations can be used to compute the R inference on a
given contrast, say, h = Mµ. If the pseudo-observations are used in place
of the observations in a LS algorithm, then based on the variance-covariance
b (4.2.29), expressions (4.2.36) and (4.2.38) imply that the resulting LS
of h,
estimate of h and the LS estimate of the corresponding variance-covariance
matrix of hb is the R estimate of h and the R estimate of the corresponding
b Similarly for testing the hypotheses (4.2.25),
variance-covariance matrix of h.
the LS test using the pseudo-observations results in the Wald-type R test, Fϕ,Q
of these hypotheses given by expression (3.6.12). Pseudo-observations are used
in many of the subsequent examples of this chapter.
The pseudo-observations are easy to obtain. For example, the package
rglm returns the pseudo-observations in the data set of fits and residuals.
These pseudo-observations can then be read into Minitab or another package
for further analyses. In Minitab itself, for Wilcoxon scores the robust regres-
sion command, RREGR, has the subcommand PSEUDO which returns the
pseudo-observations. Then the pseudo-observations can be used in place of
the observations in Minitab commands to obtain the R inference on contrasts.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 304 —


i i

304 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Example 4.2.5 (LDL Cholesterol of Quail, Example 4.2.1 continued). To


demonstrate how easy it is to use the pseudo-observations with Minitab, re-
consider Example 4.2.1 concerning LDL cholesterol levels of quail under the
treatment of four different drug compounds. Suppose we want the Wald-type
R test of the hypotheses that there is no effect due to the different drug com-
pounds. The pseudo-observations were obtained based on the full model R
fit and placed in column 10 and the corresponding levels were placed in col-
umn 11. The Wald Fϕ,Q statistic is obtained by using the following Minitab
command:

oneway c10 c11

The execution of this command returned the value of the Fϕ,Q = 3.45 with a
p-value .027, which is close to the result based on the Fϕ -statistic.

4.3 Multiple Comparison Procedures


Our basic model for this section is Model 4.2.4, although much of what we do
here pertains to the rest of this Chapter also. We discuss methods based on
the R fit of this model as described in Section 4.2.1. In particular, we use the
same notation to describe the fit; i.e., the R residuals and fitted values are,
respectively, b b the estimate of µ and τϕ are µ
e and Z, b and τbϕ , and the vector of
e
pseudo-observation is Z. We also denote the pseudo-observation corresponding
to the observation Yij as Zeij .
Besides tests of contrasts of level locations, often we want to make compar-
isons among the location levels, for instance, all pairwise comparisons among
the levels. With so many comparisons to make, overall confidence becomes a
problem. Multiple comparison procedures, MCP, have been developed to off-
set this problem. In this section we explore several of these methods in terms
of robust estimation. These procedures can often be directly robustified. It
is our intent to show this for several popular methods, including the Tukey
T -method. We also discuss simultaneous, rank-based tests among levels. We
show how simple Minitab code, based on the pseudo-observations, suffices to
compute these procedures. It is not our purpose to give a full discussion of
MCPs. Such discussions can be found, for example, in Miller (1981) and Hsu
(1996). 
We focus on the problem of simultaneous inference for all k2 comparisons
µi − µi′ based on an R fit of Model (4.2.4). Recall, (4.2.28), that a (1 − α)100%
asymptotic confidence interval for µi − µi′ based on the R fit of Model 4.2.4
is given by r
1 1
bi − µ
µ bi′ ± t(α/2,n−k) τbϕ + . (4.3.1)
ni ni′

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 305 —


i i

4.3. MULTIPLE COMPARISON PROCEDURES 305

In this section we say that this confidence interval has experiment error
rate α. As Exercise 4.8.8 illustrates, simultaneous confidence for several such
intervals can easily slip well below 1 − α. The error rate for a simultaneous
confidence procedure is called its family error rate.
We next describe six robust multiple comparison procedures for the prob-
lem of all pairwise comparisons. The error rates for them are based on asymp-
totics. But note that the same is true for MCPs based on least squares when
the normality assumption is not valid. Sufficient Minitab code is given, to
demonstrate how easily these procedures can be performed.

1. Bonferroni Procedure. This is the simplest of all the MCPs. Suppose


we are interested in making l comparisons of the form µi − µi′ . If each
individual confidence interval, (4.3.1), has confidence 1 − αl , then the
family error rate for these l simultaneous confidence intervals isat most
α; see Exercise 4.8.8. To do all comparisons just select l = k2 . Hence
the R Bonferroni procedure declares
r
1 1
levels i and i′ differ if |b bi′ | ≥ t(α/(2(k)),n−k)τbϕ
µi − µ + . (4.3.2)
2 ni ni′
The asymptotic family error rate for this procedure is at most α.
To obtain these Bonferroni intervals by Minitab assume that pseudo-
observations, Yeij , are in column 10,
 the corresponding levels, i, are in
column 11, and the constant α/ k2 is in k1. Then the following two lines
of Minitab code obtains the intervals:

oneway c10 c11;


bonferroni k1.

2. Protected LSD Procedure of Fisher. First use the test statistic Fϕ


to test the hypotheses that all the level locations are the same, (4.2.13),
at level α. If H0 is rejected then the usual level 1−α confidence intervals,
(4.2.28), are used to make the comparisons. If we fail to reject H0 then
either no comparisons are made or the comparisons are made using the
Bonferroni procedure. In summary, this procedure declares levels
r
′ 1 1
i, i differ if Fϕ ≥ Fα,k−1,n−k & |b bi′ | ≥ t(α/2,n−k) τbϕ
µi − µ + .
ni ni′
(4.3.3)
This MCP has no family error rate but the initial test does offer pro-
tection. In a large simulation study conducted by Carmer and Swan-
son (1973) this procedure based on LS estimates performed quite well
in terms of power and level. In fact, it was one of the two procedures

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 306 —


i i

306 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

recommended. In a moderate-sized simulation study conducted by Mc-


Kean, Vidmar, and Sievers (1989) the robust version of the protected
LSD discussed here performed similarly to the analogous LS procedure
on normal errors and had a considerable gain in power over LS for error
distributions with heavy tails.
Upon rejection of the hypotheses (4.2.13) at level α, the following
Minitab code obtains the comparison confidence intervals. Assume that
pseudo-observations, Yeij , are in column 10, the corresponding levels, i,
are in column 11, and the constant α is in k1.

oneway c10 c11;


fisher k1.

The F -test that appears in the AOV table upon execution of these com-
mands is Wald’s test statistic Fϕ,Q for the hypotheses (4.2.13). Recall
from Chapter 3 that it is asymptotically equivalent to Fϕ under the null
and local hypotheses.
3. Tukey’s T Procedure. This Pkis a multiple Pcomparison procedure for
the set of all contrasts, h = i=1 ci µi where ki=1 ci = 0. Assume that
the sample sizes for the levels are the same, say, n1 = · · · = nk = m.
The basic geometric fact for this procedure is the following equivalence
due to Tukey (see Miller, 1981): for t > 0,
max |(b µi′ − µi′ )| ≤ t ⇐⇒
µi − µi ) − (b (4.3.4)
1≤i,i′ ≤k
k
X k k k k
1 X X X 1 X
bi − t
ci µ |ci | ≤ ci µ i ≤ bi + t
ci µ |ci |,
i=1
2 i=1 i=1 i=1
2 i=1
P P
for all contrasts ki=1 ci µi where ki=1 ci = 0. Hence to obtain simulta-
neous confidence intervals for the set of all contrasts we need the distri-
bution of the left side of this inequality. But first note that
(b µi′ − µi′ ) = {(b
µi − µi ) − (b µi − µi ) − (b µ1 − µ1 )}
− {(bµi′ − µi′ ) − (b
µ1 − µ1 )}
b i1 − ∆i1 ) − (∆
= (∆ b i′ 1 − ∆i′ 1 ) .
b 1 which
Hence, we need only consider the asymptotic distribution of ∆
τϕ2
by (4.2.19) is Nk−1(∆1 , m [I + J]).
Recall, if v1 , . . . , vk are iid N(0, σ 2 ), then the max1≤i,i′ ≤k |vi − vi′ |/σ has
the Studentized range distribution, with k −1 and ∞ degrees of freedom.
But we can write this random variable as
max |vi − vi′ | = max |(vi − v1 ) − (vi′ − v1 )| .
1≤i,i′ ≤k ′1≤i,i ≤k

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 307 —


i i

4.3. MULTIPLE COMPARISON PROCEDURES 307

Hence we need only consider the random vector of shifts v1′ = (v2 −
v1 , . . . , vk − v1 ) to determine the distribution. But v1 has distribution
Nk−1 (0, σ 2 [I+J]). Based on this, it follows from the asymptotic distribu-
tion of ∆ b 1 , that if we substitute qα;k,∞ τϕ /√m for t in expression (4.3.4),
where qα;k,∞ is the upper α critical value of a Studentized range distribu-
tion with k and ∞ degrees of freedom, then the asymptotic probability
of the resulting expression is 1 − α.
The parameter τϕ , though, is unknown and must be replaced by an es-
timate. In the Tukey T procedure for LS, the parameter is σ. The usual
estimate s of σ is such that if the errors are normally distributed then the
random variable (n − k)s2 /σ 2 has a χ2 distribution and is independent
of the LS location estimates. In this case the Studentized range distri-
bution with k − 1 and n − k degrees of freedom is used. If the errors are
not normally distributed then this distribution leads to an approximate
simultaneous confidence procedure. We proceed similarly for the proce-
dure based√on the robust estimates. Replacing t in expression (4.3.4) by
qα;k,n−k τbϕ / m, where qα;k,n−k is the upper α critical value of a Studen-
tized range distribution with k and n − k degrees of freedom, yields an
approximate simultaneous confidence procedure for the set of all con-
trasts. As discussed before, though, small sample studies have shown
that the Student t-distribution works well for inference based on the
robust estimates. Hopefully these small sample properties carry over to
the approximation based on the Studentized range distribution. Further
research is needed in this area.
Tukey’s procedure requires that the level sample sizes are the same which
is frequently not the case in practice. A simple adjustment due to Kramer
(1956) results in the simultaneous confidence intervals,
r
1 1 1
bi′ ± √ qα;k,n−k τbϕ
bi − µ
µ + . (4.3.5)
2 ni ni′
These intervals have approximate family error rate α. This approxima-
tion is often called the Tukey-Kramer procedure.
In summary the R Tukey-Kramer procedure declares
r
′ 1 1 1
levels i and i differ if |b bi′ | ≥ √ qα;k,n−k τbϕ
µi − µ + . (4.3.6)
2 ni ni′
The asymptotic family error rate for this procedure is approximately α.
To obtain these R Tukey intervals by Minitab assume that pseudo-
observations, Yeij , are in column 10, the corresponding levels, i, are in
column 11, and the constant α is in k1. Then the following two lines of
Minitab code obtains the intervals:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 308 —


i i

308 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

oneway c10 c11;


tukey k1.

4. Pairwise Tests Based on Joint Rankings. The above methods were


concerned with estimation and simultaneous confidence intervals for
effects. Traditionally, simultaneous nonparametric inference has dealt
with comparison tests. The first such procedure we discuss is based on
the combined rankings of all levels; i.e, the rankings that are used by
the Kruskal-Wallis test. We discuss this procedure using the Wilcoxon
score function; see Exercise 4.8.10 for the analogous procedure based
on a selected score function. Assume a common level sample size
m. Denote the average of the ranks for the ith level by Ri· and let

R1 = (R2· − R1· , . . . , Rk· − R1· ). Using the results of Chapter 3, under
H0 : µ1 = · · · = µk , R1 is asymptotically Nk−1 (0, k(n+1)12
(Ik−1 + Jk−1 ));
see Exercise 4.8.9. Hence, as in the development of the Tukey procedure
above we have the asymptotic result
" r #
k(n + 1) .
PH0 max |Ri· − Ri′ · | ≤ qα;k,∞ = 1 − α . (4.3.7)
1≤i,i ≤k
′ 12
Hence the joint ranking procedure declares
r
k(n + 1)
levels i and i′ differ if |Ri· − Ri′ · | ≥ qα;k,∞ . (4.3.8)
12
This procedure has an approximate family error rate of α. This procedure
is not easy to invert for simultaneous confidence intervals for the effects.
We would recommend the Tukey procedure, (2), with Wilcoxon scores
for corresponding simultaneous inference on the effects.
An approximate level α test of the hypotheses (4.2.13) is given by
r
k(n + 1)
Reject H0 if max |Ri· − Ri′ · | ≥ qα;k,∞ , (4.3.9)
1≤i,i′ ≤k 12
although the Kruskal-Wallis test is the usual choice in practice.
The joint ranking procedure, (4.3.9), is approximate for the unequal
sample size case. Miller (1981, p. 166) describes a procedure similar to
the Scheffé procedure in LS which is valid for the unequal sample size
case, but which is also much more conservative; see Exercise 4.8.6. A
Tukey-Kramer type rule, (4.3.6), for the procedure (4.3.9) is
r r
′ n(n + 1) 1 1
levels i and i differ if |Ri· − Ri′ · | ≥ + qα;k,∞ .
24 ni ni′
(4.3.10)
The small sample properties of this approximation need to be studied.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 309 —


i i

4.3. MULTIPLE COMPARISON PROCEDURES 309

5. Pairwise Tests Based on Separate Rankings. For this procedure we


compare levels i and i′ by ranking the combined ith and i′ th samples. Let
(i′ )
Ri· denote the sum of the ranks of the ith level when it is compared
with the i′ th level. Assume that the sample sizes are the same, n1 =
· · · = nk = m. For 0 < α < 1, define the critical value cα;m,k by
 
(i′ )
P H0 max Ri· ≥ cα;m,k = α . (4.3.11)
1≤i,i′ ≤k

Tables for this critical value at the 5% and 1% levels are provided in
Miller (1981). The separate ranking procedure declares

(i′ ) (i)
levels i and i′ differ if Ri· ≥ cα;m,k or Ri′ · ≥ cα;m,k . (4.3.12)

This procedure has an approximate family error rate of α and was de-
veloped independently by Steel (1960) and Dwass (1960).
An approximate level α test of the hypotheses (4.2.13) is given by

(i′ )
Reject H0 if max Ri· ≥ cα;m,k , (4.3.13)
1≤i,i ≤k

although as noted for the last procedure the Kruskal-Wallis test is the
usual choice in practice.
Corresponding simultaneous confidence intervals can be constructed sim-
ilar to the confidence intervals developed in Chapter 2 for a shift in lo-
cations based the MWW statistic. For the confidence interval for the
ith and i′ th samples corresponding to the test (4.3.12), first form the
differences between the two samples, say,
ii ′
Dkl = Yik − Yi′ l 1 ≤ k, l ≤ m .

Let D(1) , . . . , D(m2 ) denote the ordered differences. Note here that the
critical value cα;m,k is for the sum of the ranks and not statistics of the
form SR+ , (2.4.2). But recall that these versions of the Wilcoxon statistic
differ by the constant m(m + 1)/2. Hence the confidence interval is

(D(cα;m,k − m(m+1) +1) , D(m2 −cα;m,k + m(m+1) ) ) . (4.3.14)


2 2

It follows that this set of confidence intervals, over all pairs of levels i
and i′ , form a set of simultaneous 1 − α confidence intervals. Using the
iterative algorithm discussed in Section 3.7.2, the differences need not
be formed.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 310 —


i i

310 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

6. Procedures Based on Pairwise Distribution-Free Confidence


Intervals. Simple pairwise (separate ranking) multiple comparison pro-
cedures can be easily formulated based on the MWW confidence inter-
vals discussed in Section 2.4.2. Such procedures do not depend on equal
sample sizes. As an illustration, we describe  a Bonferroni-type pro-
k
cedure for the situation of all l = 2 comparisons. For the levels (i, i′ ),
ii′ ii′
let [D(c α/(2l) +1)
, D(n i ni′ −cα/(2l) )
) denote the (1 − (α/l))100% confidence in-
terval discussed in Section 2.4.2 based on the ni ni′ differences between
the ith and i′ th samples. This procedure declares
′ ′
levels i and i′ differ if 0 is not in [D(c
ii
α/(2l) +1)
ii
, D(n i ni′ −cα/(2l) )
) . (4.3.15)

This Bonferroni-type procedure has family error rate at most α. Note


that the asymptotic value for cα/(2l) is given by
r
. ni ni′ ni ni′ (ni + ni′ + 1)
cα/(2l) = − zα/(2l) − .5 ; (4.3.16)
2 12
see (2.4.13). A Protected LSD-type procedure can be constructed
in the same way, using as the overall test either the Kruskal-Wallis test
or the test based on Fϕ ; see Exercise 4.8.12.

Example 4.3.1 (LDL Cholesterol of Quail, Example 4.2.1 continued). Re-


consider the data on the LDL levels of quail subject to four different drug
compounds. The full model fit returned the estimate µ b = (67, 42, 63, 62). We
set α = .05 and ran the first five MCPs on this data set. We used the Minitab
code based on pseudo-observations to compute the first four procedures and
we obtained the later two by Minitab commands. A table that helps for the
separate rank procedure can be found on page 242 of Lehmann (1975) which
links the tables in Miller (1981) with a table of family error α values for this
procedure. Based on these values, the Minitab MANN command can then be
used to obtain the confidence intervals (4.3.14). For each procedure, Table
4.3.1 displays the drug compounds that were declared significantly different
by the procedure. The first three procedures, based on effects, declared drug
compounds 1 and 2 different. Fisher’s PLSD also declared drug compound 2
different from drug compounds 3 and 4. The usual summary schematic based
on Fisher’s is
2 4 3 1 ,
which shows the separation of the second drug compound from the other
three compounds. On the other hand, the schematic for either the Bonferroni
or Tukey-Kramer procedures is
2 4 3 1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 311 —


i i

4.3. MULTIPLE COMPARISON PROCEDURES 311

Table 4.3.1: Drug Compounds Declared Significantly Different by MCPs


Comp. Declared Respective Confidence
Procedure Different Interval
Bonferroni (1, 2) (1.25, 49.23)
Fisher (1, 2),(2, 3),(2, 4) (7.83, 42.65) (−38.57, −3.75)
(−37.79, −2.01)
Tukey-Kramer (1, 2) (2.13, 48.35)
Joint Ranking None
Separate Ranking None

which shows that though Treatment 2 is significantly different from Treat-


ment 1 it does not differ significantly from either Treatments 4 or 3. The
joint ranking procedure came close to declaring drug compounds 1 and 2 dif-
ferent because the difference in average rankings between these levels was
12.85 slightly less than the critical value of 13.10. The separate-ranking pro-
cedure declared none different. Its interval, (4.3.14), for compounds 1 and 2
is (−29, 68.99). In comparison, the corresponding confidence interval for the
Tukey procedure based on LS is (−14.5, 58.9). Hence, the separate ranking
procedure was impaired more by the outliers than least squares.

4.3.1 Discussion
We have presented robust analogues to three of the most popular multiple
comparison procedures: the Bonferroni, Fisher’s protected least significant
difference, and the Tukey T method. These procedures provide the user with
estimates of the most interesting parameters in these experiments, namely the
simple contrasts between treatment effects, and estimates of standard errors
with which to assess these contrasts. The robust analogues are straightforward.
Replace the LS estimates of the effects by the robust estimates and replace the
estimate of σ by the estimate of τϕ . Furthermore, these robust procedures can
easily be obtained by using the pseudo-observations as discussed in Section
4.2.5. Hence, the asymptotic relative efficiency between the LS-based MCP
and its robust analogue is the same as the ARE between the LS estimator
and robust estimator, as discussed in Chapters 1-3. In particular if Wilcoxon
scores are used, then the ARE of the Wilcoxon MCP to that of the LS MCP
is .955 provided the errors are normally distributed. For error distributions
with longer tails than the normal, the Wilcoxon MCP is generally much more
efficient than its LS MCP counterpart.
The theory behind the robust MCPs is asymptotic, hence, the error rates
are approximate. But this is true also for the LS MCPs when the errors are
not normally distributed. Verification of the validity and power of both LS

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 312 —


i i

312 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

and robust MCPs is based on small sample studies. The small sample study
by McKean et al. (1989) demonstrated that the Wilcoxon Fisher PLSD had
the same validity as its LS counterpart over a variety of error distributions for
a one-way design. For normal errors, the LS MCP had slightly more empirical
power than the Wilcoxon. Under error distributions with heavier tails than
the normal, though, the empirical power of the Wilcoxon MCP was larger
than the empirical power of the LS MCP.

The decision as to which MCP to use has long been debated in the litera-
ture. It is not our purpose here to discuss these issues. We refer the reader to
books devoted to MCPs for discussions on this topic; see, for example, Miller
(1981) and Hsu (1996). We do note that, besides τϕ replacing σ, the error
part of the robust MCP is the same as that of LS; hence, arguments that one
procedure dominates another in a certain situation holds for the robust MCP
as well as for LS.

There has been some controversy on the two simultaneous rank-based test-
ing procedures that we presented: pairwise tests based on joint rankings and
pairwise tests based on separate rankings. Miller (1981) and Hsu (1996) both
favor the tests based on separate rankings because in the separate rankings
procedure the comparison between two levels is not influenced by any infor-
mation from the other levels which is not the case for the procedure based on
joint rankings. They point out that this is true of the LS procedure, also, since
the comparison between two levels is based only on the difference in sample
means for those two levels, except for the estimate of scale. However, Lehmann
(1975) points out that the joint ranking makes use of all the information in
the experiment while the separate ranking procedure does not. The spacings
between all the points is information that is utilized by the joint ranking
procedure and that is lost in the separate ranking procedure. The quail data,
Example 4.3.1, is illustrative. The separate ranking procedure did quite poorly
on this data set. The sample sizes are moderate and in the comparisons when
half of the information is lost, the outliers impaired the procedure. In contrast,
the joint ranking procedure came close to declaring drug compounds 1 and
2 different. Consider also the LS procedure on this data set. It is true that
the outliers impaired the sample means, but the estimated variance, being a
weighted average of the level sample variances, was drawn down some over
all the information; for example, instead of using s3 = 37.7 in the compar-
isons with the third level, the LS procedure uses a pooled standard deviation
s = 30.5. There is no way to make a similar correction to the separate ranking
procedure. Also, the separate rankings procedure can lead to inconsistencies
in that it could declare Treatment A superior to B, Treatment B superior to
Treatment C, while not declaring Treatment A superior to Treatment C; see
page 245 of Lehmann (1975) for a simple illustration.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 313 —


i i

4.4. TWO-WAY CROSSED FACTORIAL 313

4.4 Two-way Crossed Factorial


For this design we have two factors, say, A at a levels and B at b levels that
may have an effect on the response. Each combination of the ab factor settings
is a treatment. For a completely randomized design, n subjects are selected
at random from the reference population and then nij of these subjects
P P are
randomly assigned to the (i, j)th treatment combination; hence, n = nij .
Let Yijk denote the response for the kth subject at the (i, j)th treatment
combination, let Fij denote the distribution function of Yijk , and let µij =
T (Fij ). Then the unstructured or full model is

Yijk = µij + eijk , (4.4.1)

where eijk are iid with distribution and density functions F and f , respectively.
Let T denote the location functional of interest and assume without loss of
generality that T (F ) = 0. The submodels described below utilize the two-way
structure of the design.
Model 4.4.1 is the same as the one-way design model (4.2.1) of Section 4.2.
Using the scores a(i) = ϕ(i/(n + 1)), the R fit of this model can be obtained
as described in that section. We use the same notation as in Section 4.2: i.e.,
b
e denotes the residuals from the fit adjusted so that T (Fn ) = 0 where Fn is
the empirical distribution function of the residuals; µb denotes the R estimate
of µ the ab × 1 vector of the µij s; and τbϕ denotes the estimate of τϕ . For the
examples discussed in this section, Wilcoxon scores are used and the residuals
are adjusted so that their median is 0.
An interesting submodel is the additive model which is given by

µij = µ + (µi· − µ) + (µ·j − µ) . (4.4.2)

For the additive model, the profile plots, (µij versus i or j), are parallel. A
diagnostic check for the additive model is to plot the sample profile plots,
(µc
ij versus i or j), and see how close the profiles are to parallel. The null
hypotheses of interest for this model are the main effect hypotheses given
by

H0A : µi· = µi′ · for all i, i′ = 1, . . . a and (4.4.3)


H0B : µ·j = µ·j ′ for all j, j ′ = 1, . . . b . (4.4.4)

Note that there are a − 1 and b − 1 free constraints for H0A and H0B , respec-
tively. Under H0A , the levels of A have no effect on the response.
The interaction parameters are defined as the differences between the
full model parameters and the additive model parameters; i.e.,

γij = µij − [µ + (µi· − µ) + (µ·j − µ)] = µij − µi· − µ·j + µ . (4.4.5)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 314 —


i i

314 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

The hypothesis of no interaction is given by


H0AB = γij = 0 , i = 1, . . . , a , j = 1, . . . , b . (4.4.6)
Note that are (a−1)(b−1) free constraints for H0AB . Under H0AB the additive
model holds.
Historically nonparametric tests for interaction were developed in an ad
hoc fashion. They generally do not appear in nonparametric texts and this
has been a shortcoming of the area. Sawilowsky (1990) provides an excellent
review of nonparametric approaches to testing for interaction. The methods
we present are simply part of the general R theory in testing general linear
hypotheses in linear models and they are analogous to the traditional LS tests
for interactions.
All these hypotheses are contrasts in the parameters µij of the one-way
model, (4.4.1); hence they can easily be tested with the rank-based analysis
as described in Section 4.2.3. Usually the interaction hypothesis is tested first.
If H0AB is rejected then there is difficulty in interpretation of the main effect
hypotheses, H0A and H0B . In the presence of interaction H0A concerns the
cell mean averaged over Factor B, which may have little practical significance.
In this case multiple comparisons (see below) between cells may be of more
practical significance. If H0AB is not rejected then there are two schools of
thought. The “pooling” school would take the additive model, (4.4.2), as the
new full model to test main effects. The “non-poolers” would stick with un-
structured model, (4.4.1), as the full model. In either case with little evidence
of interaction present, the main effect hypotheses are more interpretable.
Since Model (4.4.1) is a one-way design, the multiple comparison proce-
dures discussed in Section 4.3 can be used. The crossed structure of the design
makes for several interesting families of contrasts. When interaction is present
in the model, it is often of interest to consider simple contrasts between cell
locations. Here, we only mention all ab 2
pairwise comparisons. Among others,
the Bonferroni, Fisher, and Tukey T procedures described in Section 4.3 can
be used. The rule for the Tukey-Kramer procedure is:
s
1 1 1
cells (i, j) and (i′ , j ′ ) differ if |b bi′ j ′ | ≥ √ qα;ab,n−ab τbϕ
µij − µ + .
2 nij ni′ j ′
(4.4.7)
The asymptotic family error rate for this procedure is approximately α.
The pseudo-observations discussed in Section 4.2.5 can be used to easily ob-
tain the Wald test statistic, Fϕ,Q , (3.6.12), for tests of hypotheses and similarly
they can be used to obtain multiple comparison procedures for families of con-
trasts. Simply obtain the R fit of Model (4.4.1), form the pseudo-observations,
(4.2.34), and input these pseudo-observations in a LS package. The resulting
analysis of variance table contains the Wald-type R tests of the main effect

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 315 —


i i

4.4. TWO-WAY CROSSED FACTORIAL 315

hypotheses (H0A and H0b ) and the interaction hypothesis (H0AB ). As with a
LS analysis, one has to know what main hypotheses are being tested by the LS
package. For instance, the main effect hypothesis H0A , (4.4.3), is a Type III
sums of squares hypothesis in SAS; see Speed, Hocking, and Hackney (1978).

Example 4.4.1 (Lifetime of Motors). This problem is an unbalanced two-way


design which is discussed on page 471 of Nelson (1982); see, also, McKean and
Sievers (1989) for a discussion on R analyses of this data set. The responses
are lifetimes of three motor insulations, (1, 2, and 3), which were tested at
three different temperatures (200◦ F, 225◦ F, and 250◦ F). The design is an
unbalanced 3 × 3 factorial with 5 replicates in 6 of the cells and 3 replicates in
the others. The data can be found at the url cited in the Preface. Following
Nelson, as the response variable we considered the logs of the lifetimes. Let
Yijk denote the log of the lifetime of the kth replicate at temperature level i
and which used motor insulation j.
As a full model we use Model (4.4.1). The R analysis is based on Wilcoxon
scores with the intercept estimated by the median of the residuals. Hence the
R estimates of µij estimate the true cell medians. The cell median profile plot
based on the Wilcoxon estimates, Panel A of Figure 4.4.1, indicates that some
interaction is present. Panel B of Figure 4.4.1 is a plot of the internal Wilcoxon
Studentized residuals, (3.9.31), versus fitted values. It indicates randomness
but also shows several outlying data points which are, also, quite evident in
the q−q plot, Panel C of Figure 4.4.1, of the Wilcoxon Studentized residuals
versus logistic population quantiles. This plot indicates that score functions
for distributions with heavier right tails than the logistic would be more ap-
propriate for this data; see McKean and Sievers (1989) for more discussion
on score selection for this example. Panel D of Figure 4.4.1, Casewise plot
of the Wilcoxon Studentized residuals, readily identifies the outliers as the
fifth observation in cell (1, 1), the fifth observation in cell (2, 1), and the first
observation in cell (2, 3).
The ANOVA table for the R analysis is:

Source RD df MRD FR
Temperature (T) 26.40 2 13.20 121.7
Motor Insulation (I) 3.72 2 1.86 17.2
T×I 1.24 4 .310 2.86
Error 30 .108

Since F (.05, 4, 30) = 2.69, the test of interaction is significant at the .05 level.
This confirms the profile plot, Panel A. It is interesting to note that the
least squares F -test statistic for interaction was 1.30 and, hence, was not
significant. The LS analysis was impaired because of the outliers. The row
effect hypothesis is that the average row effects are the same. The column

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 316 —


i i

316 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Figure 4.4.1: Panel A: Cell median profile plot for data of Example 4.4.1, cell
medians based on the Wilcoxon fit; Panel B: Internal Wilcoxon Studentized re-
sidual plot; Panel C: Logistic q−q plot based on internal Wilcoxon Studentized
residuals; Panel D: Casewise plot of the Wilcoxon Studentized residuals.
Panel A Panel B

6
Cell median estimates (Wilcoxon)

Wilcoxon Studentized residuals


8

4
2
7

• • ••
•• • • •
• • •• • ••

0

• • • •
6

•• • • •
-2

200 deg •
225 deg
-4
5

250 deg


-6

1.0 1.5 2.0 2.5 3.0 5.5 6.5 7.5

Motor insulation Wilcoxon fit

Panel C Panel D

• •
6

6
Wilcoxon Studentized residuals

Wilcoxon Studentized residuals

• •
4

4
2

•• • • • • • •
••••• • • • • •
• ••• • • ••• • • •
••••••••••••• • ••
0

•••••• • ••• • •
• •••• • • • • •
-2

-2

• •
• •
-4

-4

• •
-6

-6

-2 0 2 0 10 20 30 40

Logistic quantiles Case

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 317 —


i i

4.5. ANALYSIS OF COVARIANCE 317

effect hypothesis is similarly defined. Both main effects are significant. In the
presence of interaction, though, we have interpretation difficulties with main
effects.
In Nelson’s discussion of this problem it was of interest to estimate the
simple contrasts of mean lifetimes of insulations at the temperature setting of
200◦ F. Since this is the first temperature setting, these contrasts are µ1j −µ1j ′ .
The estimates of these contrasts along with corresponding confidence intervals
formed under the Tukey-Kramer procedure as discussed above, (4.3.6), are
given by:

Contrast Estimate Confidence


µ11 − µ12 -.76 (−1.22, −.30)
µ11 − µ13 -.84 (−1.37, −.32)
µ12 − µ13 -.09 (−.62, .44)

It seems that insulations 2 and 3 are better than insulation 1 at the tem-
perature of 200◦ F, but between insulations 2 and 3 there is no discernible
difference.

In the last example, the number of observations per parameter was less
than five. To offset uneasiness over the use of the rank analysis for such small
samples, McKean and Sievers (1989) conducted a a Monte Carlo study on
this design. The empirical levels and powers of the R analysis were good over
situations similar to those suggested by this data.

4.5 Analysis of Covariance


Often there are extraneous variables available besides the response variable.
Hopefully these variables explain some of the noise in the data. These variables
are called covariates or concomittant variables and the traditional analysis
of such data is called analysis of covariance.
As an example, consider the one-way model (4.2.1), with k levels and
suppose we have a single covariate, say, xij . A first order model is yij =
µi + βxij + eij . This model, however, assumes that the covariate behaves the
same within each treatment combination. A more general model is

yij = µi + βxij + γixij + eij j = 1, . . . , ni , i = 1, . . . , k . (4.5.1)

Hence the slope at the ith level is βi = β + γi and, thus, each treatment
combination has its own linear model. There are two natural hypotheses for
this model: H0C : β1 = · · · = βk and H0L : µ1 = · · · = µk . If H0C is true
then the difference between the levels of Factor A are just the differences in the
location parameters µi for a given value of the covariate. In this case, contrasts

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 318 —


i i

318 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

in these parameters are often of interest as well as the hypothesis H0L . If


H0C is not true then the covariate and the treatment combinations interact.
For example, whether one treatment combination is better than another, may
depend on where in factor space the responses are measured. Thus as in crossed
factorial designs, the interpretation of main effect hypotheses may not be clear;
for more discussion on this point see Huitema (1980).
The above example is easily generalized. Consider a designed experiment
with k treatment combinations. This may be a one-way model with a factor
at k levels, a two-way crossed factorial design model with k = ab treatment
combinations, or some other
P design. Suppose we have ni observations at treat-
ment level i. Let n = ni denote the total sample size. Denote by W the
full model incidence matrix and µ the k × 1 vector of location parameters.
Suppose we have p covariates. Let U be the n × p matrix of covariates and let
Z denote the n × 1 vector of responses. Let β denote the corresponding p × 1
vector of regression coefficients. Then the general covariate model is given
by
Z = Wµ + Uβ + Vγ + e , (4.5.2)
where V is the n × pk matrix consisting of all column products of W and U
and the pk × 1 vector γ is the vector of interaction parameters between the
design and the covariates.
The first hypothesis of interest is

H0C : γ11 = · · · = γpk,pk versus HAC : γij 6= γi′ j ′ for some (i, j) 6= (i′ , j ′ ).
(4.5.3)
Other hypotheses of interest consist of contrasts in the the µij . In general, let
M be a q × k matrix of contrasts and consider the hypotheses

H0 : Mµ = 0 versus HA : Mµ 6= 0 . (4.5.4)

Matrices M of interest are related to the design. For a one-way design M


may be a (k − 1) × k matrix that tests all the location levels to be the same,
while for a two-way design it may be used to test that all interactions between
the two factors are zero. But as noted above, the hypothesis H0C concerns
interaction between the covariate and design spaces. While the interpretation
of these later hypotheses, (4.5.4), are clear under H0C they may not be if H0C
is false.
The rank-based fit of the full Model (4.5.2) proceeds as described in Chap-
ter 3, after a score function is chosen. Once the fitted values and residuals
have been obtained, the diagnostic procedures described in Section 3.9 can be
used to assess the fit. With a good fit, the model estimates of the parameters
and their standard errors can be used to form confidence intervals and regions
and multiple comparison procedures can be used for simultaneous inference.
Reduced models appropriate for the hypotheses of interest can be obtained

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 319 —


i i

4.5. ANALYSIS OF COVARIANCE 319

and the values of the test statistic Fϕ can be used to test them. This analysis
can be conducted by the package rglm. It can also be conducted by fitting
the full model and obtaining the pseudo-observations. These in turn can be
substituted for the responses in a package which performs the traditional LS
analysis of covariance in order to obtain the R analysis.
Example 4.5.1 (Snake Data). As an example of an analysis of covariance
problem consider the data set discussed by Afifi and Azen (1972). For conve-
nience, we have placed the Snake Data at the url cited in the Preface. The
study involves four methods, three of which are intended to reduce a human’s
fear of snakes. Forty subjects were given a behavior approach test to determine
how close they could walk to a snake without feeling uncomfortable. This score
was taken as the covariate. Next they were randomly assigned to one of the
four treatments with ten subjects assigned to a treatment. The first treatment
was a control (placebo) while the other three treatments were different meth-
ods intended to reduce a human’s fear of snakes. The response was a subject’s
score on the behavior approach test after treatment. Hence, the sample size is
40 and the number of independent variables in Model (4.5.2) is 8. Wilcoxon
scores were used to conduct the analysis of covariance described above with
the residuals adjusted to have median 0.
The plots of the response variable versus the covariate for each treatment
are found in Panels A-D of Figure 4.5.1. It is clear from the plots that the
relationship between the response and the covariate varies with the treatment,
from virtually no relationship for the first treatment (placebo) to a fairly
strong linear relationship for the third treatment. Outliers are apparent in
these plots also. These plots are overlaid with Wilcoxon and LS fits of the
full model, Model (4.5.1). Panels E and F of Figure 4.5.1 are, respectively,
the internal Wilcoxon Studentized residual plot and the internal Wilcoxon
Studentized logistic q−q plot. The outliers stand out in these plots. From the
residual plot, the data appears to be heteroscedastic and, as Exercise 4.8.14
shows, the square root transformation of the response does lead to a better
fit.
Table 4.5.1 displays the Wilcoxon and LS estimates of the linear models for
each treatment. As this table and Figure 4.5.1 shows, the larger discrepancy
between the Wilcoxon and LS estimates occurs for those treatments which
have large outliers. The estimates of τϕ and σ are 3.92 and 5.82, respectively;
hence, as the table shows the estimated standard errors of the Wilcoxon esti-
mates are lower than their LS counterparts.
Table 4.5.2 displays the analysis of dispersion table for this data. Note that
Fϕ strongly rejects HOC (p-value is 0.015). This confirms the discussion above
based on Figure 4.5.1. The second hypothesis tested is no treatment effect,
H0 : µ1 = · · · = µ4 . Although Fϕ strongly rejects this hypothesis also, in lieu
of the results for HOC , the practical interpretation of such a decision is not

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 320 —


i i

320 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Figure 4.5.1: Panels A-D: For the snake data, scatterplots of final distance
versus initial distance for the placebo and treatments 2-4, overlaid with the
Wilcoxon fit (solid line) and the LS fit (dashed line); Panel E: Internal
Wilcoxon Studentized residual plot; Panel F: Wilcoxon Studentized logistic
q−q plot.

Panel A: Placebo Panel B: Treatment 2 Panel C: Treatment 3

25
35

20
20
30

15
Final distance

Final distance

Final distance
15
25

10
10
20

5
5
15

0
0

10 15 20 25 5 10 15 20 25 30 5 10 15 20 25 30

Initial distance Initial distance Initial distance

Panel D: Treatment 4 Panel E Panel F


10

10
25

Wilcoxon Studentized Residuals


Wilcoxon Studentized residuals

5
Final distance

20

0
15

−5

−5
10

5 10 15 20 25 30 0 5 10 15 20 25 −2 0 2

Initial distance Wilcoxon Fit Logistic quantiles

Table 4.5.1: Wilcoxon and LS Estimates of the Linear Models by Treatment


Wilcoxon Estimates LS Estimates
Treatment Int. (SE) Slope (SE) Int. (SE) Slope (SE)
1 27.3 (3.6) -.02 (.20) 25.6 (5.3) .07 ( .29)
2 -1.78 (2.8) .83 ( .15) -1.39 (4.0) .83 ( .22)
3 -6.7 (2.4) .87 (.12) -6.4 (3.5) .87 (.17)
4 2.9 (2.4) .66 ( .13) 7.8 (3.4) .49 (.19)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 321 —


i i

4.6. FURTHER EXAMPLES 321

Table 4.5.2: Analysis of Dispersion (Wilcoxon) for the Snake Data


Source D=Dispersion df MD F
HOC 24.06 3 8.021 4.09
Treatment 74.89 3 24.96 12.7
Error 32 1.96

obvious. The value of the LS F-test for HOC is 2.34 (p-value is 0.078). If HOC
is not rejected then the LS analysis could lead to an invalid interpretation.
The outliers spoiled the LS analysis of this data set. As shown in Exercise
4.8.15 both the R analysis and the LS analysis strongly reject HOC for the
square root transformation of the response.

4.6 Further Examples


In this section we present two further data examples. Our main purpose in
this section is to show how easy it is to use the rank-based analysis on more
complicated models. Each example is a three-way crossed factorial design. The
first has replicates while the second involves a covariate. Besides displaying
tests of the effects, we also consider estimates and standard errors of contrasts
of interest.
Example 4.6.1 (Marketing Data). This data set is drawn from an exercise on
page 953 of Neter et al. (1996). A marketing firm research consultant studied
the effects that three factors have on the quality of work performed under
contract by independent marketing research agencies. The three factors and
their levels are: Fee level ((1) High, (2) Average, and (3) Low); Scope ((1) All
contract work performed in house, (2) Some subcontract out); Supervision ((1)
Local supervision, (2) Traveling supervisors). The response was the quality
of the work performed as measured by an index. Four agencies were chosen
for each level combination. For convenience, we have also placed the data
(Marketing Data) at the url listed in the Preface. The design is a 3 × 2 × 2
crossed factorial with 4 replications, which we write as,

yijkl = µijk + eijkl , i = 1, . . . , 3; j, k = 1, 2; l = 1, . . . 4 , (4.6.1)

where yijkl denotes the response for the lth replicate, at Fee i, Scope j, and
Supervision k. Wilcoxon scores were selected for the fit with residuals adjusted
to have median 0. Panels A and B of Figure 4.6.1 show, respectively, the
residual and normal q−q plots for the internal R Studentized residuals, (3.9.31),
based on this fit. The scatter in the residual plot is fairly random and flat.
There do not appear to be any outliers. The main trend in the normal q −q

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 322 —


i i

322 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Figure 4.6.1: Panel A: Wilcoxon Studentized residual plot for data of Example
4.6.1; Panel B: Wilcoxon Studentized residual normal q−q plot.

Panel A Panel B

1.5

1.5
1.0

1.0
Wilcoxon Studentized residuals

Wilcoxon Studentized residuals


0.5

0.5
0.0

0.0
−0.5

−0.5
−1.0

−1.0
−1.5

60 70 80 90 100 110 120 −1.5 −2 −1 0 1 2

Wilcoxon fit Normal quantiles

Table 4.6.1: Tests of Effects for the Market Data


Effect df FLS Fϕ Fϕ,Q
Fee 2 679. 207. 793.
Scope 1 248. 160. 290.
Supervision 1 518. 252. 596.
Fee×Scope 2 .108 .098 .103
Fee×Super. 2 .053 .004 .002
Scope×Super. 1 77.7 70.2 89.6
Fee×Scope×Super. 2 .266 .532 .362
b or τbϕ
σ 36 2.72 2.53 2.53

plot indicates tails lighter than those of a normal distribution. Hence, the fit
is good and we proceed with the analysis.
Table 4.6.1 displays the tests of the effects based on the LS and Wilcoxon
fits. The Wald-type Fϕ,Q statistic based on the pseudo-observations is also
given. The LS and Wilcoxon analyses agree which is not surprising based on
the residual plot. The main effects are highly significant and the only sig-
nificant interaction is the interaction between Scope and Supervision. As a
subsequent analysis, we consider nine contrasts of interest. We use the Bon-
ferroni method based on the pseudo-observations as discussed in Section 4.3.
We used Minitab to obtain the results that follow. Because the factor Fee
does not interact with the other two factors, the contrasts of interest for this
factor are: µ1·· − µ2·· , µ1·· − µ3·· , and µ2·· − µ3·· . Table 4.6.2 presents the esti-

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 323 —


i i

4.6. FURTHER EXAMPLES 323

Table 4.6.2: Wilcoxon Estimates of Contrasts of Interest for the Market Data
Contr. Est. Conf. Int. Contr. Est. Conf. Int.
µ1·· − µ2·· 1.05 (−1.59, 3.69) µ·11 111.9 (108.9, 114.9)
µ1·· − µ3·· 31.34 (28.70, 33.98) µ·12 101.0 (98.0, 104.0)
µ2·· − µ3·· 30.28 (27.64, 32.92) µ·21 10.6.4 (103.4, 109.4)
µ·22 81.6 (78.6, 84.6)

mates of these contrasts and the 95% Bonferroni confidence p intervals which
.
are given by the estimates of the contrast ±t(.05/18;36) τbϕ 2/16 = ±2.64. From
these results, quality of work significantly improves for either higher or aver-
age fees over low fees. The results for high or average fees are insignificant.
Since the factors Scope and Supervision interact, but do not interact sepa-
rately or jointly with the factor fee, the parameters of interest are the simple
contrasts among µ·11 , µ·12 , µ·21 and µ·22 . Table 4.6.2 displays the estimates of
these parameters.pUsing α = .05, the Bonferroni bound for a simple contrast
.
here is t.05/18;36 τbϕ 2/12 = 3.04. Hence all 6 simple pairwise contrasts among
these parameters, are significantly different from 0. In particular, averaging
over fees, the best quality of work occurs when all contract work is done in
house and under local supervision. The source of the interaction between the
factors Scope and Supervision is also clear from these estimates.

Example 4.6.2 (Pigs and Diets). This data set is discussed on page 291
of Rao (1973). It concerns the effect of diets on the growth rate of pigs. For
convenience, we have tabled the data at the url cited in the Preface. There are
three diets, called A, B, and C. Besides the diet classification, the pigs were
classified according to their pens (5 levels) and sex (2 levels). Their initial
weight was also recored as a covariate.
The design is a 5 × 3 × 2 a crossed factorial with only one replication.
For comparison purposes, we use the same model that Rao used which is a
fixed effects model with main effects and the two-way interaction between the
factors Diets and Sex. Letting yijk and xijk denote, respectively, the growth
rate in pounds per week and the initial weight of the pig in pen i, on diet j
and sex k, this model is given by:

yijk = µ + αi + βj + γk + (βγ)jk + δxijk + eijk , i = 1, . . . , 5; j = 1, 2, 3 k = 1, 2 .


(4.6.2)
For convenience we have written the model as an overparameterized model,
although we could have expressed it as a cell means model with constraints
for the interaction effects which are assumed to be 0. The effects of interest
are the diet effects, βj .
We fit the model using the Wilcoxon scores. The analysis could also be
carried out using pseudo-observations and Minitab. Panels A and B of Figure

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 324 —


i i

324 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Figure 4.6.2: Panel A: Internal Wilcoxon Studentized residual plot for data of
Example 4.6.2; Panel B: Internal Wilcoxon Studentized residual normal q−q
plot.

Panel A Panel B

4
2

2
Wilcoxon Studentized residuals

Wilcoxon Studentized residuals


0

0
−2

−2
−4

−4
−6

−6

8.5 9.0 9.5 10.0 10.5 −1 0 1

Wilcoxon fit Normal quantiles

4.6.2 display the residual plot and normal q−q plot of the internal R Studen-
tized residuals based on the Wilcoxon fit. The residual plot shows the three
outliers. The outliers are prominent in the q−q, but note that even the remain-
ing plotted points indicate an error distribution with heavier tails than the
normal. Not surprisingly the estimate of τϕ is smaller than that of σ, .413 and
.506, respectively. The largest outlier corresponds to the 6th pig which had the
lowest initial weight (recall that the internal R Studentized residuals account
for position in factor space), but its response was above the first quartile. The
second largest outlier corresponds to the pig which had the lowest response.
The results of the tests for the effects for the LS and Wilcoxon fits are:
Effect df FLS Fϕ Fϕ,Q
Pen 4 2.35 3.65 3.48∗∗

Diet 2 4.67∗ 7.98∗ 8.70∗


Sex 1 5.05∗ 8.08∗ 8.02∗
Diet×Sex 2 0.17 1.12 .81
Initial Wt. 1 13.7∗ 19.2∗ 19.6∗
b or τbϕ
σ 19 .507 .413 .413

Denotes significance at the .05 level
The pseudo-observations were obtained based on the Wilcoxon fit and were
inputed as the responses in SAS to obtain Fϕ,Q using Type III sums of squares.
The Wilcoxon analyses based on Fϕ and Fϕ,Q are quite similar. All three tests
indicate no interaction between the factors Diet and Sex which clarifies the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 325 —


i i

4.7. RANK TRANSFORM 325

Table 4.6.3: Test Statistics for the Effects of Pigs and Diets Data with No
Covariate
Effect df FLS Fϕ Fϕ,Q
Pen 4 2.95∗ 4.20∗ 5.87∗
Diet 2 2.77 4.80∗ 5.54∗
Sex 1 1.08 3.01 3.83
Diet×Sex 2 0.55 1.28 1.46
b or τbϕ
σ 20 .648 .499 .501

Denotes significance at the .05 level

interpretation of the main effects. Also all three agree on the need for the
covariate. Diet has a significant effect on weight gain as does sex. The robust
analyses indicate that pens is also a contributing factor.
The results of the analyses when the covariate is not taken into account
are given by:
Effect df FLS Fϕ Fϕ,Q
∗ ∗
Pen 4 2.95 4.20 5.87∗

Diet 2 2.77 4.80 5.54∗
Sex 1 1.08 3.01 3.83
Diet×Sex 2 0.55 1.28 1.46
b or τbϕ
σ 20 .648 .499 .501

Denotes significance at the .05 level
It is interesting to note, here, that the factor diet is not significant based on
the LS fit while it is for the Wilcoxon analyses. The heavy tails of the error
distribution, as evident in the residual plots, has foiled the LS analysis.

4.7 Rank Transform


In this section we present a short comparison between the rank-based analysis
of this chapter with the rank transform analysis (RT). Much of this discussion
is drawn from McKean and Vidmar (1994). The main point of this section is
to show that often the RT test does not work well for testing hypotheses in
factorial designs and more complicated models. Hence, we do not recommend
using RT methods. On the other hand, Akritas, Arnold, and Brunner (1997)
develop an unique approach in which factorial hypotheses are replaced by
corresponding nonparametric hypotheses based on cdfs. They then show that
RT type methods are appropriate for these nonparametric hypotheses.
As we have pointed out, the rank-based analysis is quite analogous to the
LS-based traditional analysis. It is based on R estimates while the traditional

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 326 —


i i

326 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

analysis is based on LS estimates. The only difference in the geometry of


estimation is that that the R estimates are based on the pseudo-norm (3.2.6)
while the LS estimates are based on the Euclidean pseudo-norm. The rank-
based analysis produces confidence intervals and regions and tests of general
linear hypotheses. The diagnostic procedures of Chapter 3 can be used to check
the adequacy of the fit of the model and determine outliers and influential
points. Furthermore, the efficiency properties discussed for the simple location
nonparametric procedures carry over to this analysis. The rank-based analysis
offers the user a complete and highly efficient analysis of a linear model as
an alternative to the traditional analysis. Further, there are computational
algorithms available for these procedures.
Proposed by Conover and Iman (1981), the rank transform (RT) has be-
come a very popular procedure. The RT test of a linear hypothesis consists
generally of ranking the dependent variable and then performing the LS test
on these ranks. Although in general the RT offers no estimation, and hence
no model checking, it is a simple procedure for testing.
Some basic differences between the rank-based analysis and RT are readily
apparent. In linear models the Yi ’s are independent but not identically dis-
tributed. Hence when the RT is applied indiscriminately to a linear model,
the ranking is performed on nonidentically distributed items. The rankings in
the RT are not “free” of the x’s. In contrast, the residuals based on the R
estimates, under Wilcoxon scores, satisfy
n
X .
b R) =
xij R(Yi − x′i β 0 , j = 1, . . . , p . (4.7.1)
i=1

Hence the R residuals have been adjusted by the fit so that the ranks are
orthogonal to the x-space, i.e., the ranks are “free” of the x’s. These are the
ranks that are used in the R test statistic Fϕ , at the full model. Under H0 this
would also be true of the expected ranks of the residuals in the R fit of the
reduced model. Note, also, that the statistic Fϕ is invariant to the values of
the parameters of the reduced model.
Unlike the rank-based analysis there is no general supporting theory for
the RT. Hora and Conover (1984) presented asymptotic null theory on the
RT for treatment effect in a randomized block design with no interaction.
Thompson and Ammann (1989) explored the efficiency of this RT, showing,
however, that this efficiency depends on the block parameters. RT theory for
repeated measures designs has been developed by Akritas (1991, 1993) and
Thompson (1991b). These extensions also have the unpleasant trait that their
efficiencies depend on nuisance parameters.
Many of these theoretical studies on the RT have raised serious questions
concerning the validity of the RT for simple two-way and more complicated
designs. For a two-way crossed factorial design, Brunner and Nuemann (1986)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 327 —


i i

4.7. RANK TRANSFORM 327

showed that the RT statistics are not reasonable for testing main effects in the
presence of interaction for designs larger than 2×2 designs. This was echoed by
Akritas (1990) who stated further that RT statistics are not reasonable test
statistics for interaction nor most other common hypotheses in either two-
way crossed or nested classifications. In several of these articles (see Akritas,
1990 and Thompson, 1991a, 1993), the nonlinear nature of the RT is faulted.
For a given model the hypotheses of interest are linear contrasts in model
parameters. The rank transform, though, is nonlinear; hence often the original
hypothesis is no longer tested by the rank transformed data. The same issue
was raised earlier by Fligner (1981) in a discussion of the article by Conover
and Iman (1981).
In terms of small sample properties, initial simulations of the RT analysis
on certain models (see for example Iman, 1974), did appear promising. Now
there has been ample evidence based on simulation studies questioning the
wisdom of doing RTs on designs as simple as two-way factorial designs with
interaction; see, for example, Blair, Sawilowsky, and Higgins (1987) and the
Preface in Sawilowsky (2007). We discuss one such study next and then present
an analysis of covariance example where the use of the RT results in a poor
analysis.

4.7.1 Monte Carlo Study


Another major Monte Carlo study on the RT was performed by Sawilowsky,
Blair, and Higgins (1989), which investigated the behavior of the RT over a
three way factorial design with interaction. In many of their situations, the RT
gave severely inflated empirical levels and severely deflated empirical powers.
We present the results of a small Monte Carlo study discussed in McKean and
Vidmar (1994), which is based on the study of Sawilowsky et al. The model
for the study is a 2 × 2 × 2 three-way factorial design. The shortcomings of
the RT as discussed in the two-way models above seem to become worse for
such models. Letting A, B, and C denote the factors, the model is

Yijkl = µ + ai + bj + ck + (ab)ij + (ac)ik + (bc)jk + (abc)ijk + eijkl ,

for i, j, k = 1, 2, l = 1, . . . , r, where r is the number of replicates per cell. In the


study by Sawilowsky et al., r was set at 2, 5, 10, or 20. Several distributions
were considered for the errors eijkl , including the normal. They considered the
usual seven hypotheses (3 main effects, 3 two-ways, and 1 three-way) and 8
patterns of alternatives. The nonnull effects were set at ±c where c was a
multiple of σ; see, also, McKean and Vidmar (1992) for further discussion.
The study of Sawilowsky et al. found that the RT test for interaction “. . . is
dramatically nonrobust at times and that it has poor power properties in many
cases.”

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 328 —


i i

328 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

In order to compare the behavior of the rank-based analysis and the RT,
on this design, we performed part of their simulation study. We considered
standard normal errors and contaminated normal errors, which had 10% con-
tamination from a normal distribution with mean 0 and standard deviation 8.
The normal variates were generated as discussed in Marsaglia and Bray (1964)
using uniform variates which were generated by a portable fortran generator
written by Kahaner, Moler, and Nash (1989). There were 5 replications per cell
and the nonnull constant of proportionality c was set at .75. The simulation
size was 1000.

Table 4.7.1: Empirical Levels and Power for Test of A × C


Error Distribution
Normal Errors Contaminated Normal Errors
Model Model
Null Alternative Null Alternative
Nominal α Nominal α Nominal α Nominal α
.10 .05 .01 .10 .05 .01 .10 .05 .01 .10 .05 .01
LS .095 .040 .009 .998 .995 .977 .087 .029 .001 .602 .505 .336
Wil .104 .060 .006 .997 .992 .970 .079 .032 .004 .934 .887 .713
RT .369 .243 .076 .847 .770 .521 .221 .128 .039 .677 .576 .319

Tables 4.7.1 and 4.7.2 summarize the results of our study for the following
two situations: the two-way interaction A × C and the three-way interaction
effect A × B × C. The alternative for the A × C situation had all main effects
and all two-way interactions in while the alternative for the A×B×C situation
had all main effects, two-way interactions besides the three-way alternative in.
These were poor situations for the RT in the study conducted by Sawilowsky
et al. and as Tables 4.7.1 and 4.7.2 indicate the RT behaves poorly for these
situations in our study also. Its empirical α levels are deplorable. For instance,
at the nominal .10 level for the three-way interaction test under normal errors,
the RT has an empirical level of .777, while the level is .511 at the contaminated
normal. In contrast the levels of the rank-based analysis were quite close to
the nominal levels under normal errors and slightly conservative under the
contaminated normal errors. In terms of power, note that the empirical power
of the rank-based analysis is slightly less than the empirical power of LS under
normal errors while it is substantially greater than the power of LS under
contaminated normal errors. For the three-way interaction test, the empirical
power of the RT falls below its empirical level.

Example 4.7.1 (The Rat Data). This example, taken from Shirley (1981),
contrasts the rank-based methods, the rank transformed methods, and least
squares methods in an analysis of covariance setting. The response is the time
it takes a rat to enter a chamber after receiving a treatment designed to delay
the time of entry. There were 30 rats in the experiment and they were divided
evenly into three groups. The rats in Groups 2 and 3 received an antidote to

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 329 —


i i

4.7. RANK TRANSFORM 329

Table 4.7.2: Empirical Levels and Power for Test of A × B × C


Error Distribution
Normal Errors Contaminated Normal Errors
Model Model
Null Alternative Null Alternative
Nominal α Nominal α Nominal α Nominal α
.10 .05 .01 .10 .05 .01 .10 .05 .01 .10 .05 .01
LS .094 .050 .005 1.00 .998 .980 .102 .041 .001 .598 .485 .301
Wil .101 .060 .004 .997 .992 .970 .085 .039 .006 .948 .887 .713
RT .777 .644 .381 .484 .343 .144 .511 .377 .174 .398 .276 .105

Table 4.7.3: LS (Top Row) and Wilcoxon (Bottom Row) Estimates (Standard
Errors) for the Rat Data
Group 1 Group 2 Group 3
α β α β α β σ or τϕ
-39.1 (20.) 76.8 (10.) -15.6 (22.) 20.5 (14.) -14.7 (19.) 21.9 (12.) 20.5
-54.3 (16.) 84.2 (8.6) -19.3 (18.) 21.0 (11.) -11.6 (16.) 17.4 (10.) 17.0

the treatment. The covariate is the time taken by the rat to enter the chamber
prior to its treatment. The data are displayed in Panel A of Figure 4.7.1; for
convenience we have also placed the data at the url listed in the Preface. As
a full model, we considered the model,

yij = αj + βj xij + eij , j = 1, . . . , 3, i = 1, . . . , 10 (4.7.2)

where yij denotes the response for the ith rat in Group j and xij denotes the
corresponding covariate. There is a slight quadratic aspect to the Wilcoxon
residual plot, Panel B of Figure 4.7.1, which is investigated in Exercise 4.8.16.
Panel C of Figure 4.7.1 displays a plot of the internal Wilcoxon Studentized
residuals by case. Note that there are several outliers. These also can be seen
in the plots of the data for groups 2 and 3, Panels E and F of Figure 4.7.1.
Note that the outliers have an effect on the LS-fits, drawing the fits toward
the outliers in each group. In particular, for Group 3, it only took one outlier
to spoil the LS fit. On the other hand, the Wilcoxon fit is not affected by
the outliers. The estimates are given in Table 4.7.3. As the plots indicate, the
LS and Wilcoxon estimates differ numerically. Further evidence of the more
precise R fits relative to the LS fits is given by the estimates of the scale
parameters σ and τϕ found in the Table 4.7.3.
We first test for homogeneity of slopes for the groups; i.e, H0 : β1 = β2 =
β3 . As clearly shown in Panel A of Figure 4.7.1 this does not appear to be true
for this data. While the slopes for Groups 2 and 3 seem to be about the same
(the Wilcoxon 95% confidence interval for β2 − β3 is 3.9 ± 27.2), the slope for
Group 1 appears to differ from the other two. To confirm this statistically, the
value of the Fϕ statistic to test homogeneity of slopes, H0 , has the value 9.88
with 2 and 24 degrees of freedom, which is highly significant (p < .001). This
says that Group 1, the group that did not receive the antidote, does differ

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 330 —


i i

330 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Figure 4.7.1: Panel A: Wilcoxon fits of all groups; Panel B: Internal Wilcoxon
Studentized residual plot; Panel C: Internal Wilcoxon Studentized residuals
by Case; Panel D: LS (solid line) and Wilcoxon (dashed line) fits for Group
1; Panel E: LS (solid line) and Wilcoxon (dashed line) fits for Group 2; Panel
F: LS (solid line) and Wilcoxon (dashed line) fits for Group 3.
Panel A Panel B
200






Group 1 •

1
Group 2
150


Group 3 •
Time (After Treatment)

Wicoxon Residuals




100

0







50

-1







0

1.0 1.5 2.0 2.5 0 50 100 150

Time (Before Treatment) Wilcoxon Fitted Values

Panel C Panel D
6

• •

• LS
• Wilcoxon
4

150


• •
Wilcoxon Studentized Residual

• •
Time (After Treatment)
2


• • • • •
100
0

• • • • •
• •
• •
• •


-2


50
-4

0 5 10 15 20 25 30 1.0 1.5 2.0 2.5

Case Time (Before Treatment)

Panel E Panel F
40

60
50
30
Time (After Treatment)

Time (After Treatment)

40
20

30
20
10

10

LS LS
Wilcoxon Wilcoxon
0

1.0 1.5 2.0 2.5 1.0 1.5 2.0

Time (Before Treatment) Time (Before Treatment)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 331 —


i i

4.8. EXERCISES 331

significantly from the other two groups in terms of how the groups interact
with the covariate. In particular, the estimated slope of post-treatment time
to pre-treatment time for the rats in Group 1 is about four times as large as
the slope for the rats in the two groups which received the antidote. Because
there is interaction between the groups and the covariate, we did not proceed
with the second test on average group effects; i.e., testing α1 = α2 = α3 .
Shirley (1981) performed a rank transform on this data by ranking the
response and then applying standard least squares analysis. It is clear from
Panel A of Figure 4.7.1 that this nonlinear transform results in homogeneous
slopes for the ranked problem, as confirmed by Shirley’s analysis. But the rank
transform is a nonlinear transform and the subsequent analysis based on the
rank transformed data does not test homogeneity of slopes in Model (4.7.2).
The RT analysis is misleading in this case.
Note that using the rank-based analysis we performed an overall analysis
of this data set, including a residual analysis for model criticism. Hypotheses
of interest were readily tested and estimates of contrasts, along with standard
errors, were easily obtained.

4.8 Exercises
4.8.1. Derive expression (4.2.19).

4.8.2. In Section 4.2.2 when we have only two levels, show that the Kruskal-
Wallis test is equivalent to the MWW test discussed in Chapter 2.

4.8.3. Consider a one-way design for the data in Example 4.2.3. Fit the model
using Wilcoxon estimates and conduct a residual analysis, including residual
and q −q plots of standardized residuals. Identify any outliers. Next test the
hypothesis (4.2.13) using the Kruskal-Wallis test and the test based on Fϕ .

4.8.4. Using the notation of Section 4.2.4, show that the asymptotic covari-
ance between µ bi and µ
bi′ is given by expression (4.2.31). Next show that ex-
pressions (3.9.38) and (4.2.31) lead to a verification of the confidence interval
(4.2.30).

4.8.5. Show that the asymptotic covariance between estimates of location


levels is given by expression(4.2.31).

4.8.6. Suppose D is a symmetric, positive definite matrix. Prove that

h′ y p
sup √ = y′ Dy . (4.8.1)
h h′ D−1 h

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 332 —


i i

332 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

Refer to the Kruskal-Wallis statistic HW , given in expression (4.2.22). Let


y′ = (R1 − n+12
, . . . , Rk − n+1
2
12
) and D = n(n+1) diag(n1 , . . . , nk ). Then, using
(4.8.1), show that
Pk 2 )
hi (Ri − n+1 p
HW ≤ χ2α (k − 1) if and only if i=1
r
n(n+1) Pk 1
≤ χ2 (k − 1) ,
12 j=1 nj h2j

P
for all vectors h such that hi = 0.
Hence, if the Kruskal-Wallis test rejects H0 at level α then there must
be
p at least one contrast in the rank averages that exceeds the critical value
2
χ (k − 1). This provides Scheffé type multiple contrast tests with family
error rate approximately equal to α.
4.8.7. Apply the procedure presented in Exercise 4.8.6 to the quail data of
Example 4.2.1. Use α = .10.
4.8.8. Let I1 and I2 be (1 − α)100% confidence intervals for parameters θ1
and θ2 , respectively. Show that

P [{θ1 ∈ I1 } ∩ {θ2 ∈ I2 }] ≥ 1 − 2α . (4.8.2)

(a) Suppose the confidence intervals I1 and I2 are independent. Show that

1 − 2α ≤ P [{θ1 ∈ I1 } ∩ {θ2 ∈ I2 }] ≤ 1 − α .

(b) Generalize expression (4.8.2) to k confidence intervals and derive the Bon-
ferroni procedure described in (4.3.2).
4.8.9. In the notation of the Pairwise Tests Based on Joint Rankings
procedure of Section 4.7, show that R1 is asymptotically Nk−1 (0, k(n+1)
12
(Ik−1 +
Jk−1 )) under H0 : µ1 = · · · = µk . (Hint: The asymptotic normality follows as in
Theorem 3.5.2. In order to determine the covariance matrix of R1 , first obtain

the covariance matrix of the random vector R = (R1· , . . . , Rk· ) and then
obtain the covariance matrix of R1 by using the transformation [−1k−1 Ik−1 ].)
4.8.10. In Section 4.3, the Pairwise Tests Based on Joint Rankings
procedure was discussed based on Wilcoxon scores. Generalize this procedure
for an arbitrary score function ϕ(u).
4.8.11. For the baseball data in Exercise 1.12.33, consider the following one-
way problem. The response of interest is the hitter’s average and the three
groups are left-handed hitters, right-handed hitters, and switch hitters. Using
either Minitab or rglm, obtain the following analyses based on Wilcoxon scores:
(a) Using the test statistic Fϕ , test for an overall group effect. Obtain the
p-value and conclude at the 5% level.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 333 —


i i

4.8. EXERCISES 333

(b) Use the protected LSD procedure of Fisher to compare the groups at the
5% level.

4.8.12. Consider the Bonferroni-type procedure described in item (6) of Sec-


tion 4.3. Formulate a similar Protected LSD-type procedure based on the test
statistic Fϕ . Use these procedures to make the comparisons discussed in Ex-
ercise 4.8.11.

4.8.13. Consider the baseball data in Exercise 1.12.33. In Exercise 3.15.38, we


investigated the linear relationship between a player’s height and his weight.
For this problem, consider the simple linear model

height = α + weightβ + e .

Using Wilcoxon scores and either Minitab or rglm, investigate whether or not
the same simple linear model can be used for both the pitchers and hitters.
Obtain the p-value for the test of this hypothesis based on the statistic Fϕ .

4.8.14. In Example 4.5.1 obtain the square root of the response and fit it to
the full model. Perform a residual analysis on the resulting fit. In particular
identify any outliers and compare the heteroscedasticity in the plot of the
residuals versus the fitted values with the analogous plot in Example 4.5.1.

4.8.15. For Example 4.5.1, overlay the Wilcoxon and LS fits for the four
treatments based on the square root transformation of the response. Then
obtain an analysis of covariance for both the Wilcoxon and LS analyses for
the transformed data. Comment on the plots and the results of the analyses.

4.8.16. Consider Example 4.7.1. Investigate whether a model which also in-
cludes quadratic terms in the covariates is more appropriate for the Rat data
than Model (4.7.2).

4.8.17. Consider Example 4.7.1. Eliminate the placebo group, Group 1, and
perform an analysis of covariance on Groups 2 and 3. Use the linear model,
(4.7.2). Is there any difference between these groups?

4.8.18. Let HW = W(W′W)−1 W′ be the projection matrix based on the


incidence matrix, (4.2.5). Show that HW is a block diagonal matrix with the
ith block a ni × ni matrix of all ones. Recall X = (I − H1 )W1 in Section
4.2.1. Let HX = X(X′X)−1 X′ be the projection matrix. Then argue that
HW = H1 + HX and, hence, HX = HW − H1 is easy to find. Using (4.2.8),
b = . 2
show that, for the one-way design, cov(Z) τS H1 + τϕ2 HX and, hence, show
that var(b
µi ) is given by (4.2.11) and that cov(b bi′ ) is given by (4.2.31).
µi , µ

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 334 —


i i

334 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

4.8.19. Suppose we have k treatments of interest and we employ a block design


consisting of a blocks. Within each block, we randomly assign mk subjects to
the treatments so that each treatment receives m subjects. Suppose we model
the responses Yijl as

Yijl = µ + αi + βj + eijl ; i = 1, . . . , a , j = 1, . . . , k , l = 1, . . . , m ,

where eijl are iid with cdf F (t). We want to test

H0 : β1 = · · · = βk versus HA : βj 6= βj ′ for some j 6= j ′ .

Suppose we rank the data in the ith block from 1 to mk for i = 1, . . . , a. Let
Rj be the sum of the ranks for the jth treatment. Show that

am(mk + 1)
E(Rj ) =
2
2
am (mk + 1)(k − 1)
Var(Rj ) =
12
am2 (mk + 1)
Cov(Rj , Rl ) = − .
12
Further, argue that

Xk  " #2
k−1 Rj − E(Rj )
Km = p
j=1
k Var(Rj )
" k
#
12 X
= R2 − 3a(mk + 1)
akm2 (mk + 1) j=1 j

is asymptotically χ2 with k − 1 degrees of freedom. Note if m = 1 then K1 is


the Friedman statistic. Show thatR the efficiency of the Friedman test relative
to the two-way LS F -test is 12σ 2 [ f 2 (x) dx]2 (k/(k + 1)). Plot the efficiency
as a function of k when f is N(0, 1).

4.8.20. The data in Table 4.8.1 are the results of a 3 × 4 design discussed
in Box and Cox (1964). Forty-eight animals were exposed to three different
poisons and four different treatments. The response was the survival time of
the animal. The design was balanced. Use (4.4.1) as the full model to answer
the questions below.

(a) Using Wilcoxon scores obtain the fit of the full model. Sketch the cell me-
dian profile plot based on this fit and discuss whether or not interaction
between poison and treatments is present.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 335 —


i i

4.8. EXERCISES 335

Table 4.8.1: Box-Cox Data, Exercise 4.8.20


Treatments
Poisons 1 2 3 4
0.31 0.82 0.43 0.45
1 0.45 1.10 0.45 0.71
0.46 0.88 0.63 0.66
0.43 0.72 0.76 0.62
0.36 0.92 0.44 0.56
2 0.29 0.61 0.35 1.02
0.40 0.49 0.31 0.71
0.23 1.24 0.40 0.38
0.22 0.30 0.23 0.30
3 0.21 0.37 0.25 0.36
0.18 0.38 0.24 0.31
0.23 0.29 0.22 0.33

(b) Based on the Wilcoxon fit, plot the residuals versus the fitted values.
Comment on the appropriateness of the model. Also obtain the internal
Wilcoxon Studentized residuals and identify any outliers.

(c) Using the statistic Fϕ , obtain the robust ANOVA table (main effects and
interaction) for this data. Conclude in terms of the p-values.

(d) Note that the hypothesis matrix for interaction defines six interaction
contrasts. Use the Bonferroni and Protected LSD multiple comparison
procedures, (4.3.2) and (4.3.3), to investigate these contrasts. Determine
which, if any, are significant.

(e) Repeat the analysis in Parts (c) and (d), (Bonferroni analysis), using LS.
Compare the Wilcoxon and LS results.

4.8.21. For testing the ordered alternative

H0 : µ1 = · · · = µk versus HA : µ1 ≤ · · · ≤ µk ,

with at least one strict inequality, let


X
+
J= Sst ,
s<t

+
where Sst = #(Ytj > Ysi) for i = 1, . . . , ns and j = 1, . . . , nt ; see (2.2.20).
This test for ordered alternatives was proposed independently by Jonckheere
(1954) and Terpstra (1952). Under H0 , show the following:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 336 —


i i

336 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS


P
n2 − n2t
(a) E(J) = 4
.
P
n2 (2n+3)− n2t (2nt +3)
(b) V (J) = 72
.
p
(c) z = (J − E(J))/ V (J) is approximately N(0, 1).

Hence, based on (a)-(c) an asymptotic test for H0 versus HA , is to reject H0


if z ≥ zα .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 337 —


i i

Chapter 5

Models with Dependent Error


Structure

5.1 Introduction
In this chapter, we develop robust rank-based fitting and inference procedures
for linear models with dependent error structure. The first four sections con-
sider general mixed models. For these, the underlying model is linear but the
data come in clusters; for example, in blocks, subjects, or centers. Hence, there
is a cluster (block) effect; that is, the observations within a cluster are depen-
dent random variables. As in Chapter 4, the fixed effects of the linear model
are of interest but the dependent error structure must be taken into account
in the development of the inference procedures. The last section of the chapter
discusses rank-based fitting and inference procedures for autoregressive time
series models.

5.2 General Mixed Models


Consider an experiment done over m clusters (blocks), where cluster k has
nk observations. In practice, we usually assume that the observations from
different clusters are independent but that the observations within a cluster
are dependent. We use the terms clusters and blocks interchangeably in our
discussion. The term blocks, though, is often associated with experimental
design. The methods described in this chapter, however, can be used on ob-
servational studies, longitudinal studies, repeated measure designs, etc.; so,
we prefer to use the looser term clusters.
Within cluster k, let Yk , Xk , and ek denote respectively the nk × 1 vector
of responses, the nk × p design matrix, and the nk × 1 vector of errors. Let 1nk

337
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 338 —


i i

338 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

denote a vector of nk ones. Then the general mixed model for Yk is


Yk = α1nk + Xk β + ek , k = 1, . . . m, (5.2.1)
where β is the p × 1 vector of regression coefficients and α is the intercept
parameter. The components of the random error vector ek are generally de-
pendent random variables. Later, we make certain assumptions on the
distribution of ek but, for now, our discussion is quite general.
Alternately, the model can be written in the long form as
Y = 1n α + Xβ + e, (5.2.2)
Pm
where n = k=1 nk denotes the total sample size, Y = (Y1′ , . . . , Ym ′ ′
), X =
′ ′ ′ ′ ′ ′
(X1 , . . . , Xm ) , and e = (e1 , . . . , em ) . Because an intercept parameter is in
the model, we can assume that X is centered and that the true median of ekj
is zero. Since we can always reparameterize, assume that X has full column
rank. It is important to note that the design matrices, Xk s, for the clusters
need not have full column rank. For example, incomplete block designs can
be considered. To distinguish this general mixed model from the linear model
of Chapter 3, in this chapter we call the model of Chapter 3 the independent
error or case model.
This general mixed model often occurs in the applied sciences. Examples
include data from clinical designs carried out over centers, repeated measures
type data on individuals, data from randomized block designs, and clusters
of correlated data. As in Chapters 3 and 4, for inference the primary focus
concerns the regression coefficients (fixed effects), but the dependent structure
must be taken into account in order to obtain valid inference for the fixed
effects. Liang and Zeger (1986) discuss these models in some detail, developing
a weighted LS inference for it.
The fixed effects part of the model is, of course, the linear model of Chap-
ters 3 and 4. So in this section we proceed to discuss the R fit developed in
Chapter 3 for Model (5.2.1). As we show the asymptotic variance of the R
estimator is a function of the dependence structure in the model.

Geometry
The geometry of the R fit of Model 5.2.1 is the same as for that of the linear
model in Chapter 3. Let ϕ(u), 0 < u < 1, be a specified score function
which satisfies Assumption (S.1) of Section 3.4 and consider the norm given
in expression (3.2.6). Then, as in Chapter 3, the R estimator of β is
b = ArgminkY − Xβkϕ .
β (5.2.3)
ϕ

For Model (5.2.1), properties of this estimator were developed by Kloke, Mc-
Kean, and Rashid (2009). They refer to it as the JR estimator for joint rank-
ing; however, we use the terminology of Chapter 3 and call it an R estimator.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 339 —


i i

5.2. GENERAL MIXED MODELS 339

b is a solution to Sϕ (Y − Xβ) = 0, where


As in Chapter 3, equivalently, β ϕ
Sϕ (Y − Xβ) is the negative of the gradient of kY − Xβkϕ given in (3.2.12).
Once β is estimated, we estimate the intercept α by the median of the resid-
uals, that is
α b ϕ }.
bS = medkj {ykj − x′kj β (5.2.4)
As in Chapter 3, both estimators are regression and scale equivariant.
In the case of Wilcoxon scores, the R code ww, Terpstra and McKean
(2005), can be used to obtain the R fit. For general scores, the R software
Rfit developed by Kloke and McKean (2010b) computes these methods, also.
This package includes routines for the inference discussed next; see the Preface
for url locations.

Asymptotic Theory
The asymptotic theory for the R estimates is similar to the theory in Chapter
3. For this reason we only briefly sketch it in the following discussion. First,
certain conditions are needed. Assume that the random vectors e1 , e2 , . . . , em
are independent; i.e., the responses drawn from different blocks or clusters
are independent. Assume further that the univariate marginal distributions
of ek are the same for all k. As discussed at the end of this section (see
Subsection 5.2.1), this holds for many models of practical interest; however, in
Section 5.5, we do discuss more general rank-based estimators which do not
require this assumption. Let F (x) and f (x) denote the common univariate
marginal distribution function and density function. Assume that f (x) follows
Assumption (E.1) of Section 3.4 and that the usual regularity (likelihood)
conditions hold; see, for example, Section 6.5 of Lehmann and Casella (1998).
For the design matrix X, assume that Huber’s condition (D.2) of Section 3.4
holds. As with the asymptotic theory for the traditional estimates (see Liang
and Zeger, 1986), assume that the number of clusters goes to ∞, i.e., m → ∞,
and that nk ≤ M, for all k, for some constant M.
Because of the invariances, without loss of generality, assume that the true
regression parameters are zero in Model (5.2.1). As in Chapter 3, asymptotic
theory for the fixed effects estimator involves establishing the distribution of
the gradient and the asymptotic quadraticity of the dispersion function.
Consider Model (5.2.1) and assume the above conditions. It then follows
from Brunner and Denker (1994) that the projection of the gradient Sϕ (Y −
Xβ) is the random vector X′ ϕ[F(Y−Xβ)], where ϕ[F(Y−Xβ)] = (ϕ[F (Y11 −
x′11 β)], . . . , ϕ[F (Ymnm − x′mnm β)])′ . We need to assume that the covariance
structure of this projection is asymptotically stable; that is, the following
limit exists and is positive definite:
Pm
Σϕ = limm→∞ n−1 k=1 X′k Σϕ,k X, (5.2.5)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 340 —


i i

340 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

where
Σϕ,k = Cov{ϕ[F(ek )]}. (5.2.6)
In likelihood methods, a similar assumption is made on the limit of the co-
variance matrix of the errors.
As discussed by Kloke et al. (2009), under these assumptions, it follows
from Theorem 3.1 of Brunner and Denker (1994) that
1 D
√ SX (0) → Np (0, Σϕ ), (5.2.7)
n

where Σϕ is defined in expression (5.2.5). The linearity and quadraticity re-


sults obtained in Chapter 3 for the linear model can be extended√ to our model.
The√linearity result is SX (β) = SX (0) − τϕ−1 n−1 X′ Xβ + op ( n), uniformly
for nkβk2 ≤ c, for c > 0, where τϕ is the same scale parameter as in Chap-
ter 3; i.e., defined in expression (3.4.4). From this we obtain the asymptotic
representation of the R estimator given by
√ √
nβb = τϕ n(X′ X)−1 X′ ϕ[F(e)] + op (1). (5.2.8)
ϕ

b
Based on (5.2.7) and (5.2.8), we obtain the asymptotic distribution of β ϕ
which we summarize in the following theorem,
Theorem 5.2.1. Under the assumptions discussed above, the distribution of
b ϕ is approximately normal with mean β and covariance matrix
β
m
!
X
Vϕ = τϕ2 (X′ X)−1 X′k Σϕ,k Xk (X′ X)−1 . (5.2.9)
k=1

bS is approximately normal with mean α and variance


It then follows that, α
m
"n #
1 X X k X
σ12 (0) = τS2 var(sgn(ekj )) + cov(sgn(ekj ), sgn(ekj ′ )) , (5.2.10)
n k=1 j=1 j6=j ′

where τs = 1/2f (0).


In this section, we have kept the model general; i.e., we have not specified
the covariance structure. To conduct inference, we need an estimate of the
covariance matrix of βb . Define the residuals of the R fit by
ϕ

b b .
bs 1n − Xβ
eR = Y − α (5.2.11)
ϕ

Using these residuals, we estimate the parameter τϕ as discussed in Section


3.7.1. Next, a nonparametric estimate of Σϕ,k , (5.2.6). is obtained by replacing

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 341 —


i i

5.2. GENERAL MIXED MODELS 341

the distribution function F (t) in its definition by the empirical distribution


function of the residuals. Based on these results, for a specified vector h ∈ Rp ,
an approximate (1 − α)100% confidence interval for h′ β is given by
q
′b b ϕ h.
h β ϕ ± zα/2 h′ V (5.2.12)

Consider general linear hypotheses of the form H0 : Mβ = 0 versus


HA : Mβ 6= 0, where M is a q × p matrix of rank q. We offer two test
b suggests a Wald-type test
statistics. First, the asymptotic distribution of β ϕ
of H0 based on the test statistic

b )T [MV
TW,ϕ = (Mβ b ).
b ϕ MT ]−1 (Mβ (5.2.13)
ϕ ϕ

Under H0 , TW,ϕ has an asymptotic χ2q distribution with q degrees of freedom.


Hence, a nominal level α test is to reject H0 if TW,ϕ ≥ χ2α (q). As in the in-
dependent error case, this test is consistent for all alternatives of the form
Mβ 6= 0. For efficiency results consider a sequence of local alternatives of
β
the form: HAn : Mβ n = √n , where β 6= 0. Under this sequence of alter-
natives TW,ϕ has an asymptotic noncentral χ2q -distribution with noncentrality
parameter
η = (Mβ)T [MVϕ MT ]−1 Mβ. (5.2.14)

A second test utilizes the reduction in dispersion, RDϕ = D(Red) −


D(Full), where D(Full) and D(Red) are respectively the minimum values of
the dispersion function under the full and reduced (full model constrained by
H0 ) models. The asymptotically correct standardization depends on the de-
pendence structure of the errors; see Exercises 5.7.5 and 5.7.6 for discussion
on this test and also of the aligned rank test of Chapter 3.
Our discussion has been for general scores. If we have knowledge of the dis-
tribution of the errors then we can optimize the analysis by selecting a suitable
score function. From expression (5.2.9), although the dependence structure ap-
pears in the approximate covariance of βb ϕ , as in Chapters 2 and 3, the constant
of proportionality is τϕ . Hence, the discussion in Chapters 2 and 3 concern-
ing score selection based on minimizing τϕ is still pertinent for the rank-based
analysis of this section. Example 5.3.1 of the next section illustrates such score
selection.
If the score function is bounded, then based on their asymptotic represen-
tation, (5.2.8), these R estimators have bounded influence in response space
but not in factor space. However, for outliers in factor space, the high break-
down HBR estimators, (3.12.2), can be extended in the same way as the R
estimators.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 342 —


i i

342 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

5.2.1 Applications
In many applications the form of the covariance structure of the random vector
of errors ek of Model 5.2.1 is known. This can result in a simplified asymp-
totic covariance structure for βb . We discuss several such cases in the next
ϕ
few sections. In Section 5.3, we consider a simple mixed model with clusters
handled as a random effect. Here, besides an estimate of τϕ , only an additional
covariance parameter is required to estimate Vϕ . In Section 5.4.1, we discuss
a transformed procedure for a simple mixed model, provided that the design
matrices for each cluster, Xk s, have full column rank. Another rich class of
such models is the repeated measure designs, where cluster is synonymous
with subject. Two common types of covariance structure for these designs
are: (i) the covariance of the errors for a subject have compound symmetri-
cal structure, i.e., a simple random effect model, or (ii) the errors follow a
stationary time series model, for instance an autoregressive model. For Case
(ii), the univariate marginals would have the same distribution and, hence,
the above assumptions hold for our rank-based estimates. Using the residuals
from the rank-based fit, R estimators of the autoregressive parameters of the
error distribution can be obtained. These estimates could then be used in the
usual way to transform the observations and then a second (generalized) R
estimate could be obtained based on these transformed observations. This is
a robust analogue of the two-stage estimation procedure discussed for cluster
samples in Rao, Sutradhar and Yue (1993). Generalized R estimators based
on transformations are discussed in Sections 5.4 and 5.5.

5.3 Simple Mixed Models


In this section, we discuss a simple mixed model with block or cluster as a
random effect. Consider Model (5.2.1), but for each block k, model the error
vector ek as ek = 1nk bk + ǫk , where the components of ǫk are independent
and identically distributed and bk is a continuous random variable which is
independent of ǫk . Hence, we write the model as

Yk = α1nk + Xk β + 1nk bk + ǫk , k = 1, . . . m. (5.3.1)

Assume that the random effects b1 , . . . , bm are independent and identically dis-
tributed random variables. It follows that the distribution of ek is exchange-
able. In particular, all marginal distributions of ek are the same; so, the theory
of Section 5.2 holds. This family of models contains the randomized block de-
signs, but as in Section 5.2 the blocks can be incomplete and/or unbalanced.
We call Model 5.3.1, the simple mixed model.
For this model, the asymptotic variance-covariance matrix of β b , (5.2.9),
ϕ

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 343 —


i i

5.3. SIMPLE MIXED MODELS 343

simplifies to
Pm
τϕ2 (X′ X)−1 k=1 X′k Σϕ,k Xk (X′ X)−1, Σϕ,k = (1 − ρϕ )Ink + ρϕ Jnk , (5.3.2)

where ρϕ = cov {ϕ[F (e11 )], ϕ[F (e12 )]} = E{ϕ[F (e11 )]ϕ[F (e12 )]}. Also, the
asymptotic variance of the intercept (5.2.10) simplifies
P to n−1 τS2 (1 + n∗ ρ∗S ),
for ρ∗S = cov [sgn (e11 ), sgn (e12 )] and n∗ = n−1 m k=1 nk (nk − 1). As with LS,

for positive definiteness, we needPto assume  that each of ρϕ and ρS exceeds
m nk
maxk {−1/(nk − 1)}. Let M = k=1 2 − p (the subtraction of p, the di-
mension of the vector β, is a degree of freedom correction). A simple moment
estimator of ρϕ is
m X
X
ρbϕ = M −1 a[R(b
eki )]a[R(b
ekj )]. (5.3.3)
k=1 i>j

Plugging this into (5.3.2) and using the estimate of τϕ discussed earlier, we
have an estimate of the asymptotic covariance matrix of the R estimators.
For the general mixed model (5.2.1) of Section 5.2, the AREs for the rank-
based procedures are difficult to obtain; however, for the simple mixed model,
(5.3.1), the ARE can be obtained in closed form provided the design is centered
within each block; see Kloke et al. (2009). The reader is asked to show in
Exercise 5.7.2 that for Wilcoxon scores, this ARE is
Z 2
2 2
ARE(FW,ϕ , FLS ) = [(1 − ρ)/(1 − ρϕ )]12σ f (t) dt , (5.3.4)

where ρϕ is defined under expression (5.3.2) and ρ is the correlation coefficient


within a block. If the random vectors in a block follow the multivariate normal
distribution, then this ARE lies in the interval [0.8660, 0.9549] when 0 < ρ < 1.
The lower bound is attained when ρ → 1. The upper bound is attained when
ρ = 0 (the independent case), which is the usual high efficiency of the Wilcoxon
to LS at the normal distribution. When −1 < ρ < 0, this ARE lies in [0.9549,
0.9662] and the upper bound is attained when ρ = −0.52 and the lower bound
is attained when ρ → −1. Generally, the high efficiency properties of the
Wilcoxon analysis to LS analysis in the independent errors case extend to the
Wilcoxon analysis for this mixed model design. See Kloke et al. (2009) for
details.

5.3.1 Variance Component Estimators


In this section, we assume that the variances of the errors exist. Let Σek
denote the variance-covariance matrix of ek . Under the model of this section,
the variance-covariance matrix of ek is compound symmetric having the form

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 344 —


i i

344 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

Σek = σ 2 Ak (ρ) = σ 2 [(1−ρ)Ink +ρJnk ], where σ 2 = Var(eki ), Ink is the identity


matrix of order nk , and Jnk is a nk × nk matrix of ones. Letting σb2 and σε2
denote respectively the variances of the random effect bk and the error ε, the
total variance is given by σ 2 = σε2 + σb2 . The intraclass correlation coefficient is
ρ = σb2 /(σε2 +σb2 ). These parameters, (σε2 , σb2 , σ 2 ), are referred to as the variance
components.
To estimate these variance components, we use the estimates discussed
in Kloke at al. (2009); see, also Rashid and Nandram (1998) and Gerard and
Schucany (2007). In block k, rewrite model (5.3.1) as ykj −[α+x′kj β] = bk +εkj ,
j = 1, . . . , nk . The left side of this expression is estimated by the residual

b
eR,kj = ykj − [b b k = 1, . . . , m; j = 1, . . . , nk .
α + x′kj β], (5.3.5)

Hence, a predictor (estimate) of bk is given by bbk = med1≤j≤nk {beR,kj }. Hence


a robust estimator of the variance of bk is MAD2 , (3.9.27); that is,
h i2
bb2 = [MAD1≤k≤m (bbk )]2 = 1.483 med1≤k≤m |bbk − med1≤j≤m {bbj }| . (5.3.6)
σ

In this simple mixed model, the residuals ebkj , (5.3.5), are often call the
marginal residuals. In addition, though, we have the conditional residuals
for the errors εkj which are defined by

eR,kj − bbk , j = 1, . . . nk , k = 1, . . . , m.
εbkj = b (5.3.7)

A robust estimate of σε2 is then

σ εkj )]2 .
bε2 = [MAD1≤j≤nn ,1≤k≤m (b (5.3.8)

Hence, robust estimates of the total variance σ 2 and the intraclass correlation
coefficient are
b2 = σ
σ bε2 + σ
bb2 and ρb = σ
bb2 /b
σ2. (5.3.9)
Thus, our robust estimates of the variance components are given in expressions
(5.3.6), (5.3.8), and (5.3.9).

5.3.2 Studentized Residuals


In Chapter 3, we presented Studentized residuals for R and HBR fits. These
residuals are fundamental for diagnostic analyses of linear models. They cor-
rect for both the model (factor space) and the underlying covariance structure
and allow for a simple benchmark rule for designating potential outliers. In
this section, we present Studentized residuals based on the R fit of the simple
mixed model, (5.3.1). Because the marginal residuals b eR,kj , (5.3.5), are used to
check the quality of fit, these are the appropriate residuals for standardizing.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 345 —


i i

5.3. SIMPLE MIXED MODELS 345

Because the block sample sizes nk are not necessarily the same,
some additional notation simplifies the presentation. Let ν1 and ν2
be two parameters and define the block-diagonal matrix B(ν1 , ν2 ) =
diag{B1 (ν1 , ν2 ), . . . , Bm (ν1 , ν2 )}, where Bk (ν1 , ν2 ) = (ν1 − ν2 )Ink + ν2 Jnk ,
k = 1, . . . , m. Hence, for Model (5.3.1), we can write Var(e) = σ 2 B(1, ρ).
Using the asymptotic representation for β b given in expression (5.2.8), a
ϕ
tedious calculation, similar to that in Section 3.9.2, shows that the approxi-
mate covariance matrix of b eR is given by

τ2
CR = σ 2 B(1, ρ) + s2 Jn B(1, ρ∗S )Jn + τ 2 Hc B(1, ρϕ )Hc (5.3.10)
n
τs ∗ ∗ τs ∗ ∗
− B(δ11 , δ12 )Jn − τ B(δ11 , δ12 )Hc − Jn B(δ11 , δ12 )
n n
τ τs τs τ
+ Jn B(γ11 , γ12 )Hc − τ Hc B(δ11 , δ12 ) + Hc B(γ11 , γ12 )Jn ,
n n
where Hc is the projection matrix onto the column space of the centered design
matrix Xc , Jn is the n × n matrix of all ones, and

δ11 = E[e11 sgn (e11 )],

δ12 = E[e11 sgn (e12 )],
δ11 = E[e11 ϕ(F (e11 ))],
δ12 = E[e11 ϕ(F (e12 ))],
γ11 = E[sgn(e11 )ϕ(F (e11 ))],
γ12 = E[sgn(e11 )ϕ(F (e12 ))],

and ρϕ and ρ∗S are defined in (5.2.5) and (5.2.9), respectively.


To compute the Studentized residuals, estimates of the parameters in CR ,
(5.3.10), are required. First, consider the matrix σ 2 B(1, ρ). In Section 5.3.1,
we obtained robust estimators σ b2 and ρb given in expression (5.3.9). Substitut-
ing these estimators for σ 2 and ρ into σ 2 B(1, ρ), we have a robust estimator
of σ 2 B(1, ρ) given by σ b2 B(1, ρb). Expression (5.3.3) gives a simple moment
estimator of ρϕ . The parameters ρ∗S , δ11 , δ12 , δ11∗ ∗
, δ12 , γ11 , and γ12 can be esti-
mated in the same way. Substituting these estimators into the matrix CR , let
b R denote the resulting estimator.
C
For t = 1, . . . , n, let b
ctt denote the tth diagonal entry of the matrix C b R.
Then the tth Studentized marginal residual based on the R fit is
p
eb∗R,t = b
eR,t / b
ctt . (5.3.11)

As in Chapter 3, the traditional benchmarks used with these Studentized


residuals are the limits ±2.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 346 —


i i

346 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

5.3.3 Example and Simulation Studies


In this section we present an example of a randomized block design. It consists
of only two blocks, so we also summarize simulation studies which confirm the
validity of the rank-based analysis. For the examples and the simulation stud-
ies, we computed the rank-based analysis using the collection of R functions
Rfit described above. By the traditional fit, we mean the maximum likelihood
fit based on multivariate normality of the error random vectors. This fit and
subsequent analysis was obtained using the R function lme as discussed in
Pinheiro et al. (2008).

Example 5.3.1 (Crab Grass Data). Cobb (1998) presented an example of


a complete block design concerning the weight of crab grass. Much of our
discussion is drawn from Kloke at al. (2009). There are four fixed factors
in the experiment: the density of the crab grass at four levels, the nitrogen
content of the crab grass at two levels, the phosphorus content of the crab
grass at two levels, and the potassium content of the crab grass at two levels.
Two complete blocks of the experiment were carried out, so altogether there
are n = 64 observations. Here, block is a random factor and we assume the
simple mixed model, (5.3.1), of this section. Under each set of experimental
conditions, crab grass was grown in a cup. The response is the dry weight of a
unit (cup) of crab grass, in milligrams. The data are presented in Cobb (1998).
For convenience, we have displayed the data at the url listed in the Preface.
We consider the rank-based analysis of this section based on Wilcoxon
scores. For the main effects model, Table 5.3.1 displays the estimated effects
(contrasts) and standard errors for the Wilcoxon and traditional analyses. For
the nutrients, these effects are the differences between the high and low levels,
while for the factor density the three contrasts reference the highest density
level. There are major differences between the Wilcoxon and the traditional
estimates. For the Wilcoxon estimates, the nutrients nitrogen and phosphorus
are significant and the contrast between the low and high levels of density
is highly significant. Nitrogen is the only significant effect for the traditional
analysis. The Wilcoxon statistic to test the density effects has the value TW,ϕ =
20.55 with p = 0.002; while, the traditional test statistic is Flme = 0.82 with
p = 0.490. The robust estimates of the variance components are: σ b2 = 206.33,
bb2 = 20.28, and ρb = 0.098
σ
An outlier accounts for much of this dramatic difference between the ro-
bust and traditional analyses. Originally, one of the responses was mistyped;
instead of the correct value 97.25, the response was typed as 972.5. As Cobb
(1998) notes, this outlier was more difficult to spot in the original units. Upon
replacing the outlier with its correct value, the Wilcoxon and traditional anal-
yses are similar, although the Wilcoxon analysis is still more precise; see the
discussion below on the other outliers in this data set. This is true too of the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 347 —


i i

5.3. SIMPLE MIXED MODELS 347

Table 5.3.1: Wilcoxon and LS Estimates with SEs of Effects for the Crab Grass
Data
Wilcoxon Traditional
Contrast Est. SE Est. SE
Nit 39.90 4.08 69.76 28.7
Pho 10.95 4.08 −11.52 28.7
Pot −1.60 4.08 28.04 28.7
D34 3.26 5.76 57.74 40.6
D24 7.95 5.76 8.36 40.6
D14 24.05 5.76 31.90 40.6

test for the factor density: TW,ϕ = 23.23 (p = 0.001) and Flme = 6.33 with
p = 0.001. The robust estimates of the variance components are: σ b2 = 209.20,
bb2 = 20.45, and ρb = 0.098 These are essentially unchanged from their values
σ
on the original data. If on the original data the experimenter had run the ro-
bust fit and compared it with the traditional fit, then the outlier would have
been discovered immediately.
Figure 5.3.1 contains the Wilcoxon Studentized residual plot and q−q plot
for the original data. We have removed the large outlier from the plots, so
that we can focus on the remaining data. The “vacant middle” in the residual
plot is an indication that interaction may be present. For the hypothesis of
interaction between the nutrients, the value of the Wald-type test statistic
is TW,ϕ = 30.61, with p = 0.000. Hence, the R analysis strongly confirms
that interaction is present. On the other hand, the traditional likelihood ratio
test statistic for this interaction is 2.92, with p = 0.404. In the presence of
interaction, many statisticians would consider interaction contrasts instead of
a main effects analysis. Hence, for such statisticians, the robust and traditional
analyses would have different practical interpretations.

5.3.4 Simulation Studies of Validity


In this data set, the number of blocks is two. Hence, to answer questions con-
cerning the validity of the Wilcoxon analysis, Kloke et al. (2009) conducted
a small simulation study. Table 5.3.2 summarizes the empirical confidences
and AREs of this study for two situations, normal errors and contaminated
normal errors (20% contamination and the ratio of the contaminated variance
to the uncontaminated variance at 25). For each situation, the same random-

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 348 —


i i

348 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

Figure 5.3.1: Studentized residual and q−q plots, minus large outlier.

Studentized Residual Plot, Outlier Deleted


Studentized Wilcoxon residual

8
6
4
2
−2 0

40 60 80 100

Wilcoxon fit

Normal q−q Plot, Outlier Deleted


Studentized Wilcoxon residual

8
6
4
2
−2 0

−2 −1 0 1 2

Normal quantiles

ized block design as in the crab grass example was used, with the correlation
structure as estimated by the Wilcoxon analysis. The empirical confidences of
the asymptotic 95% confidence intervals were recorded. These intervals are of
the form Estimate ±1.96×SE, where SE denotes the standard errors of the
estimates. The number of simulations was 10,000 for each situation, therefore,
the error in the table based on the usual 95% confidence interval for a propor-
tion is 0.004. The empirical confidences for the Wilcoxon are quite good with
the target of 0.95 usually within range of error. They were perhaps a little
conservative at the the contaminated normal situation. Hence, the Wilcoxon
analysis appears to be valid for this design. The intervals based on the tra-
ditional fit are slightly liberal. The empirical AREs between two estimators
displayed in Table 5.3.2 are the ratios of empirical mean squared errors of the
two estimators. As the table shows, the traditional fit is more efficient at the
normal but the efficiencies are close to the value 0.95 for the independent er-
ror case. The Wilcoxon analysis is much more efficient over the contaminated
normal situation.
Does this rank-based analysis differ from the independent error analysis
of Chapter 3? As a tentative answer to this question, Kloke et al. (2009)
ran 10,000 simulations using the model for the crab grass example. Wilcoxon
scores were used for both analyses. To avoid confusion, call the analysis of
Chapter 3 the IR analysis (I for independent errors), and the analysis of this

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 349 —


i i

5.3. SIMPLE MIXED MODELS 349

Table 5.3.2: Validity of Inference (Empirical Confidence Sizes and AREs)

Norm. Errors Cont. Norm. Errors


Contrast Wilc. Traditional ARE Wilc. Traditional ARE
Nit 0.948 0.932 0.938 0.964 0.933 7.73
Pho 0.953 0.934 0.941 0.964 0.930 7.82
Pot 0.948 0.927 0.940 0.966 0.934 7.72
D34 0.950 0.929 0.936 0.964 0.931 7.75
D24 0.951 0.929 0.943 0.960 0.931 7.57
D14 0.952 0.930 0.944 0.960 0.929 7.92

section the R analysis. They considered normal error distributions, setting the
variance components at the values of the robust estimates. Because the R and
IR fits are the same, they considered the differences in their inferences of the six
effects listed in Table 5.3.1. For 95% nominal confidence, the average empirical
confidences over these six contrasts are 95.32% and 96.12%, respectively for the
R and IR procedures. Hence, both procedures appear valid. For a measure of
efficiency, they averaged, across the contrasts, the averages of squared lengths
of the confidence intervals. The ratio of the R to the IR averages is 0.914;
hence for the simulation, the R inference is about 9% more efficient than the
IR inference. Similar results for the traditional analyses are reported in Rao
et al. (1993).

5.3.5 Simulation Study of Other Score Functions


Besides the large outlier, there are six other potential outliers in the Cobb
data. This quantity of outliers suggests the use of score functions which are
more preferable than the Wilcoxon score function for very heavy-tailed error
structure. To investigate this, we turned to the family of Winsorized Wilcoxon
score functions. Recall that this family was discussed for skewed data in Ex-
ample 2.5.1. Here, though, asymmetry does not appear to be warranted. We
selected the score function which is linear over the interval (0.2, 0.8), i.e.,
20% Winsorizing on both sides. We denote it by WW2 . For the parame-
ters as in Table 5.3.1, the WW2 estimates and standard errors (in paren-
theses) are: 39.16 (3.78), 10.13 (3.78), −2.26 (3.78), 2.55 (5.35), 7.68 (5.35), and
23.28 (5.35). The estimate of the scale parameter τ is 14.97 compared to the
Wilcoxon estimate which is 15.56. This indicates that an analysis based on

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 350 —


i i

350 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

the WW2 fit has more precision than one based on the Wilcoxon fit.
To investigate this gain in precision, we ran a small simulation study. We
used the same model and the same correlation structure as estimated by the
Wilcoxon fit. We considered normal and contaminated normal errors, with the
percent of contamination at 20% and the relative variance of the contaminated
part at 25. For each situation 10,000 simulations were run. The AREs were
very similar for all six parameters, so we only report their averages. For the
normal situation the average ARE between the WW2 and Wilcoxon estimates
was 0.90; hence, the WW2 estimate was 10% less efficient for the normal
situation. For the contaminated normal situation, though, this average was
1.21; hence, the WW2 estimate was 20% more efficient than the Wilcoxon
estimate for the contaminated normal situation.
There are families of score functions besides the Winsorized Wilcoxon
scores. Gastwirth (1966) presents several families of score functions appro-
priate for classes of distributions with tails heavier than the exponential dis-
tribution. For certain cases, he selects a score based on a maxi-min strategy.

5.4 Arnold Transformations


In this section, we apply a linear transformation to the mixed model, (5.2.1),
and then obtain the R fits. We begin with a brief but necessary discussion of
the intercept parameter.
Write the mixed model in the long form (5.2.2), Y = 1n α + Xβ + e.
Suppose the transformation matrix is A. Multiplying both sides of the model
by A, the transformed model is of the form

Y ∗ = X∗ b + e∗ , (5.4.1)

where v∗ denotes the vector Av and the vector of parameters is b = (α, β′ )′ .


While the original model has an intercept parameter, in general, the trans-
formed model does not. As discussed in Exercise 3.15.39 of Chapter 3, the R
e ∗ b + e∗ , where
fit of Model (5.4.1) is actually the R fit of the model Y ∗ = X
Xe ∗ = (I − H1 )X∗ and H1 is the projection matrix onto the space spanned by
e ∗ is the centered design matrix based on X∗ .
1; i.e., X
As proposed in Exercise 3.15.39, to obtain an R fit of Model (5.4.1), we
use the following algorithm:

(1) Fit the model


e ∗ b + e∗ .
Y ∗ = α1 1 + X (5.4.2)
By fit we mean: obtain the R estimate of b and then estimate the α1 by
b ∗ denote the R fit.
the median of the residuals. Let Y 1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 351 —


i i

5.4. ARNOLD TRANSFORMATIONS 351

b ∗ to the right space; i.e., obtain


(2) Project Y 1

b ∗ = HX∗ Y
Y b ∗. (5.4.3)
1

b ∗ ; i.e., our estimator is


(3) Solve X∗ b = Y

b ∗ = (X∗′ X∗ )−1 X∗′ Y


b b ∗. (5.4.4)

b∗ is asymptotically normal with the asymp-


As developed in Exercise 3.15.39, b
totic representation given by (3.15.11) and asymptotic variance given by
(3.15.12). We use these results in the remainder of this chapter.

5.4.1 R Fit Based on Arnold Transformed Data


As in the previous sections, consider an experiment done over m blocks (clus-
ters, centers), and let Yk denote the vector of nk observations for the kth
block, k = 1, . . . , m. In this section, we consider the simple mixed model of
Section 5.3. Using the notation of expression (5.3.1), Yk follows the model
Yk = α1nk + Xk β + 1nk bk + ǫk , where bk is a random effect and β denotes
the fixed effects of interest. As in Section 5.3, assume that the blocks are inde-
pendent and bk and ǫk are independent. Let ek = 1nk bk + ǫk . As in expression
(5.2.2), the long form of the model is useful, i.e., Y = 1n α + Xβ + e. Be-
cause there is an intercept
Pm parameter in the model, we may assume that X is
centered. Let n = k=1 nk denote the total sample size. For this section, in
addition we assume that for all blocks Xk has full column rank p.
If the variances of the error variables exist, denote them by Var[bk ] = σb2
and Var[ǫkj ] = σǫ2 . In this case, the variance covariance structure for the kth
block is compound symmetric which we denote as

Var[ek ] = σ 2 Ak (ρ) = σ 2 [(1 − ρ)Ink + ρJnk ], (5.4.5)

where σ 2 = σǫ2 + σb2 , and ρ = σb2 /(σb2 + σǫ2 ).

Arnold Transformation
Arnold (Chapters 14 and 15, 1981) discusses a Helmert transformation for
these types of models for traditional (least squares) analyses for balanced
designs, i.e., all nk ’s are the same. Kloke and McKean (2010a) generalized
Arnold’s results to unbalanced designs and developed the properties of the R
fit for the transformed data. Consider the nk × nk orthogonal matrix
 1 ′ 
√ 1n
nk
Γk = k
(5.4.6)
C′k ,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 352 —


i i

352 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

where the columns of Ck form an orthonormal basis for 1⊥ ′


nk , (Ck 1nk = 0). We
call Γk an Arnold transformation of size nk .
Now, apply an Arnold’s Transformation, (AT), of size nk to the re-
sponse vector for the kth block
 ∗ 
∗ Yk1
Yk = Γk Yk = ∗
Yk2
∗ √
where the mean component is Yk1 = α∗ + b∗k + nk x̄′k β + e∗k1 , the contrast

component is Yk2 = X∗k β + e∗k2 , and the other quantities are:
1 ′
x̄′k = 1 Xk
nk nk
1
e∗k1 = √ 1′nk ek
nk
X∗k = Ck X k
e∗k2 = C′k ek = bk C′k 1nk + C′k ǫk = C′k ǫk .
In particular, note that the contrast component contains, as a linear model,
the fixed effects of interest and, moreover, it is free of the random block effect.
Furthermore, notice that all the information on β is in the contrast com-
ponent if x̄ = 0. This occurs when the experimental design is replicated at
least once in each of the blocks and the covariate does not change. Also, all of
the information on β is in the mean component if the covariates are constant
within a block. More often, however, there is information on β in both of the
components. If this is the case, then for balanced designs, one can put both
pieces back together and obtain an estimator using all of the information. For
unbalanced designs this is not possible. The approach we take is to ignore
the information in the mean component and use the contrast component for
inference.
Let n∗ = n − m. Then the long form of the Arnold transformation is
Y2 = C′ Y, where C′ = diag[C′1 , . . . , C′m ]. So we can model Y2∗ as

Y2∗ = X∗ β + e∗2 , (5.4.7)


where e∗2 = C′ e, and, provided variances exist, Var[e∗2 ] = σ22 In∗ , σ22 = σ 2 (1−ρ),
and X∗ = C′ X.

LS Fit on Arnold Transformed Data


For the traditional least squares procedure, suppose the variance of the errors
exist. Under the additional assumption of normality, the transformed errors
are independent. The traditional estimator is thus the usual LS estimator
b ∗ ∗
β AT LS = Argminky2 − X βkLS ,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 353 —


i i

5.4. ARNOLD TRANSFORMATIONS 353

b
i.e., β ∗′ ∗ −1 ∗′ ∗
AT LS = (X X ) X y2 . This is the extension of Arnold’s (1981) solu-
tion that was proposed by Kloke and McKean (2010a) for the unbalanced case
of Model (5.4.7). As usual, estimate the intercept based on the mean of the
residuals,
1 ′
bLS =
α 1 (y − y b)
n
1 ′
= 1 (In − X(X∗′ X∗ )−1 X∗ C′ )y = ȳ.
n
As Exercise 5.7.3 shows the joint asymptotic distribution is
     2 
bLS
α α σ1 0′
b ∼N
˙ p+1 , (5.4.8)
β AT LS β 0 σ22 (X∗′ X∗ )−1
P
where σ12 = (σ 2 /n2 ) m 2 2 2
k=1 [(1 − ρ)nk + nk ] and σ2 = σ (1 − ρ). Notice that if
inference is to be on β then we avoid explicit estimation of ρ. To estimate σ22
P Pnk ∗2 b
we may use σ b22 = m k=1 bkj /(n∗ − p) where b
j=1 e e∗kj = ykj

− x∗′
kj β.

R Fit on Arnold Transformed Data


For the R fit of Model (5.4.7), we briefly sketch the development in Kloke
and McKean (2010a). Assume that we have selected a score function ϕ(u).
We define the Arnold’s transformation rank-based (ATR) estimator of β as
the regression through the origin rank estimator defined by the steps (5.4.2)-
(5.4.4) of the last section; that is, the rank-based estimator is given by
b ∗ ∗
β AT R = Argminky2 − X βkϕ . (5.4.9)
The results of Section 5.2 ensure that the ATR estimates are consistent
and asymptotically normal. The reason for doing an Arnold transformation,
though, is that the transformed error variables are uncorrelated. While this
does not necessarily mean that they are independent, in the literature they
are usually treated as if they are. This is called working independence. The
asymptotic distributions discussed next are formulated under the working in-
dependence. The simulation results reported in Kloke and McKean (2010a)
support the validity of the asymptotic distributions over normal and contam-
inated normal error distributions.
Recall from the regression through the origin algorithm that the asymptotic
distribution of βb AT R depends on the choice of the estimate of the intercept
α1 . For the first case, suppose the median of the residuals is used as the
estimate of the intercept (bαAT R = med{ykj2∗
−x∗′ b
kj β AT R }). Then, under working
independence, the joint approximate distribution of the regression parameters
is      2 2 
αbAT R α σs τs,e /n 0′
b ∼N
˙ p+1 , (5.4.10)
β AT R β 0 V

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 354 —


i i

354 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

where V is given in expression (3.15.12) of Chapter 3, σs2 = 1 + t∗ ρs , t∗ =


P m
k=1 nk (nk − 1), and ρs = cov[sgn(e11 )sgn(e12 )].
For the second case, assume that the score function ϕ(u) is odd about 1/2;
+
ϕ(1−u) = −ϕ(u). Let α bAT R denote the signed-rank estimator of the intercept;
see expression (3.5.31) of Chapter 3. Then, under working independence, the
joint approximate distribution of the rank-based estimator is
 +     2 2 
bAT R
α α σs τs,e /n 0′
b ∼N
˙ p+1 , , (5.4.11)
β AT R β 0 V
where V = τ 2 (X∗′ X∗ )−1 . In comparing expressions (5.4.8) and (5.4.11), we see
that asymptotic relative efficiency (ARE) between the ATLS and the ATR es-
timators is the same as that of LS and R estimates in ordinary linear models.
In particular when Wilcoxon scores are used and errors have a normal distri-
bution, the ARE between the ATLS and ATR (Wilcoxon) is the usual 0.95.
Hence, for this second case, the ATR estimators are efficiently robust.
To complete the practical inference, the scale parameters, τ and τs are
based on the distribution of e∗2kj and can be estimated as discussed in Chapter
3. From this, an inference is readily formed for the parameters of the model.
Validity of the resulting confidence intervals is confirmed in the simulation
study of Kloke and McKean (2010a). Studentized residuals are also discussed
in this article. A matrix expression such as (5.3.10) for the simple mixed model
is derived by the authors; however, unlike the situation in Section 5.3.2, some
of the necessary correlations are not straightforward to estimate. Kloke and
McKean recommend a bootstrap to estimate the standard error of a residual.
We use these in the following example.

Example and Discussion


The following example is drawn from the article of Kloke and McKean (2010a).
Although simple, the following data set demonstrates some of the nice features
of Arnold’s Transformation, particularly for balanced data.
Example 5.4.1 (Milliken and Johnson Data). The data in Table 5.4.1 are
from an example found on page 260 of Milliken and Johnson (2002). Each
row represents a block of length two. There is one covariate and each of the
responses were measurements on different treatments.
The model for these data is
 
−0.5
Yk = α12 + ∆ + βxk 12 + ǫk .
0.5
The Arnold’s Transformation for this model is
 
1 1 1
Γk = √ .
2 1 −1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 355 —


i i

5.4. ARNOLD TRANSFORMATIONS 355

Table 5.4.1: Data for Example 5.4.1


x y1 y2
23.2 60.4 76.0
26.9 59.9 76.3
29.4 64.4 77.8
22.7 63.5 75.6
30.6 80.6 94.6
36.9 75.9 96.1
17.6 53.7 62.3
28.5 66.3 81.6

Table 5.4.2: ATR and ATLS Estimates and Standard Errors for Example 5.4.1

ATR ATLS
Est SE Est SE
α 70.8 3.54 72.8 8.98
∆ −14.45 1.61 −14.45 1.19
β 1.43 0.65 1.46 0.33

The transformed responses are Yk∗ = Γk Yk = [Yk1


∗ ∗ ′
, Yk2 ] , where

Yk1 = α∗ + β ∗ xk + ǫ∗k1 ,

Yk2 = ∆∗ + ǫ∗k2 ,
√ √
α∗ = 2α, β ∗ = 2β, and ∆∗ = √12 ∆. We treat the transformed errors
ǫ∗k1 for k = 1, . . . , m and ǫ∗k2 for k = 1, . . . , m as iid. Notice that the first
component is a simple linear regression model and the second component is a
simple location model. For this example, we use signed-rank to estimate both
of the intercept terms. The estimates and standard errors of the parameters are
given in Table 5.4.2. Kloke and McKean (2010a) plotted bootstrap Studentized
residuals for the least squares and Wilcoxon fits. These plots show no serious
outliers.

To demonstrate the robustness of ATR estimates in the example, Kloke


and McKean (2010a) conducted a small sensitivity analysis. They set the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 356 —


i i

356 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

(i)
second data point to y12 = y11 + ∆y, where ∆y varied from -30 to 30. Then
the parameters ∆(i) are estimated based on the data set with the outlier. The
b −∆
graph below displays the relative change of the estimate, (∆ b (i) )/∆,
b as a
function of ∆y.
0.04
0.02
0.00
Relative change of the estimate of ∆.
Relative Change ∆

−0.02
−0.04
−0.06

−30 −20 −10 0 10 20 30

∆y

Over this range of ∆y, the relative changes in the ATR estimate is between
−0.042 to 0.062. In contrast, as the reader is asked to show in Exercise 5.7.4,
the relative change in ATLS over this range is between 0.125 to 0.394. Hence,
the relative change in the ATR estimates is small, which indicates the robust-
ness of the ATR estimates.

5.5 General Estimating Equations (GEE)


For longitudinal data, Liang and Zeger (1986) presented an elegant, general it-
erated reweighted least squares (IRLS) fit of a generalized longitudinal model.
As we note below, their fit solves a set of general estimating equations
(GEE). Their model is more general than Model (5.2.1). Abebe, McKean,
and Kloke (2010) developed a rank-based fit of this general model which we
present in this section. While analogous to Liang and Zeger’s fit, it is robust in
response space. Further, the procedure can easily be generalized to be robust
in factor space, also. In this section, we use T to denote the transpose of a
vector or matrix.
Consider a longitudinal set of observations over m subjects. Let Yit denote
the tth response for ith subject for t = 1, 2, . . . , ni and i = 1, 2, . . . , m. Assume

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 357 —


i i

5.5. GENERAL ESTIMATING EQUATIONS (GEE) 357


P
that xit is a p × 1 vector of corresponding covariates. Let n = m
i=1 ni denote
the total sample size. Assume that the marginal distribution of Yit is of the
exponential class of distributions and is given by
f (yit ) = exp{[yit θit − a(θit ) + b(yit )]φ} , (5.5.1)
where φ > 0, θit = h(ηit ), ηit = xTit β, and h(·) is a specified function. Thus the
mean and variance of yit are given by
E(Yit ) = a′ (θit ) and Var(Yit ) = a′′ (θit )/φ, (5.5.2)
where the ′ denotes derivative. In this notation, the link function is h−1 ◦(a′ )−1 .
More assumptions are stated later for the theory.
Let Yi = (Yi1 , . . . , Yini )T and Xi = (xi1 , . . . , xini )T denote the ni × 1 vector
of responses and the ni × p matrix of covariates, respectively, for the ith
individual. We consider the general case where the components of the vector
of responses for the ith subject, Yi , are dependent. Let θ i = (θi1 , θi2 , . . . , θini )T ,
so that E(Yi) = a′ (θ i ) = (a′ (θi1 ), . . . , a′ (θini ))T . For a s × 1 vector of unknown
correlation parameters α, let Ci = Ci (α) denote a ni × ni correlation matrix.
Define the matrix
1/2 1/2
Vi = Ai Ci (α)Ai /φ , (5.5.3)
where Ai = diag{a′′ (θi1 ), . . . , a′′ (θini )}. The matrix Vi need not be the covari-
ance matrix of Yi . In any case, we refer to Ci as the working correlation
matrix. For estimation, let V b i be an estimate of Vi. This, in general, requires
estimation of α and often an initial estimate of β. In general, we denote the
estimator of α by α̂(β, φ) to reflect its dependence on β and φ.
Liang and Zeger (1986) defined their estimate in terms of general estimat-
ing equations (GEE). Define the ni × p Hessian matrix,
∂a′ (θ i )
Di = , i = 1, . . . , m . (5.5.4)
∂β
b is the solution to the equations
Then their GEE estimator β LS
Xm
DT V b −1[Yi − a′ (θ i )] = 0 . (5.5.5)
i i
i=1
To motivate our estimator, it is convenient to write this in terms of the Eu-
clidean norm. Define the dispersion function,
Xm
DLS (β) = b −1 [Yi − a′ (θ i )]
[Yi − a′ (θ i )]T V i
i=1
m
X
= b
[V
−1/2
Yi − V b −1/2 Yi − V−1/2 a′ (θ i )]
b −1/2 a′ (θ i )]T [V
i i i i
i=1
ni
m X
X
= [yit∗ − dit (β)]2 , (5.5.6)
i=1 t=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 358 —


i i

358 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

where Yi∗ = V b −1/2 Yi = (Y ∗ , . . . , Y ∗ )T , dit (β) = cT a′ (θ i ), and cT is the tth


i i1 ini t t
b −1/2
row of Vi . The gradient of DLS (β) is
m
X
▽DLS (β) = − b −1[Yi − a′ (θ)] .
DTi V (5.5.7)
i
i=1

Thus the solution to the GEE equations (5.5.5) also can be expressed as

b = Argmin DLS (β) .


β (5.5.8)
LS

From this point of view, β b is a nonlinear least squares (LS) estimator. We


LS
refer to it as GEEWL2 estimator.
Consider, then, the robust rank-based nonlinear estimators discussed in
Section 3.14. For nonnegative weights (see expression (5.5.10) below), we as-
sume for now that the score function is odd about 1/2, i.e., it satisfies expres-
sion (2.5.9). In situations where this assumption is unwarranted, we can adjust
the weights to accommodate scores appropriate for skewed error distributions;
see the discussion in Section 5.5.3.
Next consider the general model defined by expressions (5.5.1) and (5.5.2).
As in the LS development, let Yi∗ = V b −1/2 Yi = (Y ∗ , . . . , Y ∗ )T , git (β) =
i i1 ini
T ′ T b −1/2 ∗
ct a (θ i ), where ct is the tth row of Vi , and let Gi = [git ]. The rank-based
dispersion function is given by
ni
m X
X
DR (β) = ϕ[R(Yit∗ − git (β))/(n + 1)][Yit∗ − git (β)] . (5.5.9)
i=1 t=1

We next write the R estimator as weighted LS estimator. From this repre-


sentation the asymptotic theory of the R estimator can be derived. Further-
more, it naturally suggests an IRLS algorithm. Let eit (β) = Yit∗ −git (β) denote
the (i, t)th residual and let m(β) = med(i,t) {eit (β)} denote the median of all
the residuals. Then because the scores sum to 0 we have the identity,
ni
m X
X
DR (β) = ϕ[R(eit (β))/(n + 1)][eit (β) − m(β)]
i=1 t=1
Xm X ni
ϕ[R(eit (β))/(n + 1)]
= [eit (β) − m(β)]2
i=1 t=1
eit (β) − m(β)
Xm X ni
= wit (β)[eit (β) − m(β)]2 , (5.5.10)
i=1 t=1

where wit (β) = ϕ[R(eit (β))/(n + 1)]/[eit (β) − m(β)] is a weight function.
As usual, we take wit (β) = 0 if eit (β) − m(β) = 0. Note that by using

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 359 —


i i

5.5. GENERAL ESTIMATING EQUATIONS (GEE) 359

the median of the residuals in conjunction with property (2.5.9), the weights
are positive. To accommodate other score functions besides those that satisfy
(2.5.9), quantiles other than the median can be used; see Example 5.5.3 and
Sievers and Abebe (2004) for discussion.
For the initial estimator of β, we recommend the rank-based estimator of
Chapter 3 based on the score function ϕ(u). Denote this estimator by b (0) . As
β
 (0)  R
estimates of the weights, we use wbit βb ; i.e., the weight function evaluated
R
(0)
b . Expression (5.5.10) leads to the dispersion function
at β
 (0)
 X ni
m X  (0)  h  (0) i2

DR b
β|β R = w b
bit β eit (β) − m b
β
R R
i=1 t=1
"r r  #2
ni
m X
X  (0)  (0)
  (0) 
= w b
bit β eit (β) − b
w it
b
β b
m β .
R R R
i=1 t=1

Let  
βb (1) = ArgminD ∗ β|βb (0) . (5.5.11)
R R
n (k) o
This establishes a sequence of IRLS estimates, β b , k = 1, 2, . . ..
R
After some algebraic simplification, we obtain the gradient
 (k)
 m
X h  i
∗ b T b −1/2 c b −1/2 ′ ∗ b (k)
▽DR β|β R = −2 Di V i W i V i Yi − a (θ) − m β R ,
i=1

 (k)   (k)  (5.5.12)


where m∗ β b = Vb m β
1/2 b 1, 1 denotes a ni × 1 vector all of whose
R i R
c i = diag{ŵi1 , . . . , ŵin } is the diagonal matrix of weights
elements are 1, and W i

b (k+1) satisfies the general estimating equations


for the ith subject. Hence, β R
(GEE) given by
m
X h  (k) i
b
DTi V
−1/2 c
Wi V b
b −1/2 Yi − a′ (θ) − m∗ β =0. (5.5.13)
i i R
i=1

We refer to this weighted, general estimation equations estimator as the


GEEWR estimator.

5.5.1 Asymptotic Theory


Recall that both the GEEWL2 and GEEWR estimators were defined in terms
of the univariate variables Yit∗ . These of course are transformations of the
original observations by the estimates of the covariance matrix Vi and the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 360 —


i i

360 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

weight matrix Wi . For the theory, we need to consider similar transformed


variables using the matrices Vi and Wi , where this notation means that Vi
and Wi are evaluated at the true parameters. For i = 1, . . . , m and t =
1, . . . , ni , let
−1/2
Yi† = Vi Yi = (Yi1† , . . . , Yin† i )T
−1/2
G†i (β) = Vi a′i (θ) = [git† ]
e†it = Yit† − git† (β). (5.5.14)

To obtain asymptotic distribution theory for a GEE procedure, assump-


tions concerning these errors e†it must be made. Regularity conditions for the
GEEWL2 estimates are discussed in Liang and Zeger (1986). For the GEEWR
estimator, assume these conditions and, further that the marginal pdf of e†it
is continuous and the variance-covariance matrix given in (5.5.15) is positive
definite. Under these conditions, Abebe et al. (2010) derived the asymptotic
distribution of the GEEWR estimator. The proof involves a Taylor series ex-
pansion, as in Liang and Zeger’s (1986) proof, and the rank-based theory found
in Brunner and Denker (1994) for dependent observations. We state the result
in the next theorem.
√ b (0)
Theorem 5.5.1. Assume that the initial estimate satisfies m(β R − β) =
√ b (k)
Op (1). Then under the above assumptions, for k ≥ 1, m(β R − β) has an
asymptotic normal distribution with mean 0 and covariance matrix,
( m )−1 ( m )
X −1/2 −1/2
X −1/2 −1/2
lim m DTi Vi Wi V i Di DTi Vi Var(ϕ†i )Vi Di
m→∞
i=1 i=1
( m )−1
X −1/2 −1/2
× DTi Vi Wi Vi Di , (5.5.15)
i=1

where ϕ†i denotes the ni × 1 vector (ϕ[R(e†i1 )/(n + 1)], . . . , ϕ[R(e†ini )/(n + 1)])T .

5.5.2 Implementation and a Monte Carlo Study


For practical use of the GEEWR estimate, the asymptotic covariance ma-
trix (5.5.15) requires estimation. This is true even in the case where percentile
bootstrap confidence intervals are employed for inference, because appropriate
standardized bootstrap estimates are generally used. We present a nonpara-
metric estimator of the covariance structure and then an approximation to it.
We compare these in a small simulation study.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 361 —


i i

5.5. GENERAL ESTIMATING EQUATIONS (GEE) 361

Nonparametric (NP) Estimator of Covariance

The covariance structure suggests a simple moment estimator. Let β b (k) and
b (k) denote the final estimates of β and Vi , respectively.
(for the ith subject) V i
Then the residuals which estimate e†i ≡ (e†i1 , . . . , e†ini )T are given by
h i−1/2
e†i = V
b b (k) Yi − G b (k) ), i = 1, . . . , m,
b (k) (β (5.5.16)
i i

h i−1/2  (k)   
b (k) =
where G Vb (k) a′ b
θ and b(k) = h xT β
θ b (k) . Let R(b
e†it ) de-
i i it it

note the rank of b e†it among {b e†i′ t′ }, t = 1, . . . , ni ; i = 1, . . . , m. Let ϕ b †i =


e†i1 )/(n + 1)], . . . , ϕ[R(b
(ϕ[R(b e†ini )/(n + 1)])T . Let S bi = ϕ b †i − ϕ b †i 1ni . Then a
moment estimator of the covariance matrix (5.5.15) is that expression with
Var(ϕ†i ) estimated by
\† ) = S
Var(ϕ biS
bT , (5.5.17)
i i

and, of course final estimates of Di and Vi . We label this estimator by (NP).


Although this is a simple nonparametric estimate of the covariance structure,
in a simulation study Abebe et al. (2010) showed that this estimate often leads
to a very liberal inference. Werner and Brunner (2007) discovered this in a
corresponding rank testing problem.

Approximation (AP) of the Nonparametric Estimator


The form of the weights, though, suggests a simple approximation, which is
based on certain ideal conditions. Suppose the model is correct. Assume that
the true transformed errors are independent. Then, because the scores have
been standardized, asymptotically Var(ϕ†i ) converges to Ini , so replace it with
Ini . This is the first part of the approximation.
Next consider the weights. The functional for the weights is of the form
ϕ[F (e)]/e. Assuming that F (0) = 1/2, a simple application of the Mean Value
Theorem gives the approximation ϕ[F (e)]/e = ϕ′ [F (e)]f (e). The expected
value of this approximation can be expressed as
Z ∞ Z 1  ′ −1 
−1 ′ 2 f [F (u)]
τϕ = ϕ [F (t)]f (t) dt = ϕ(u) − du, (5.5.18)
−∞ 0 f [F −1 (u)]

where the second integral is derived from the first by integration by parts
followed by a substitution. The parameter τϕ is of course the usual scale pa-
rameter for the R estimates in the linear model based on the score function
ϕ(u). The second part of the approximation is to replace the weight matrix
b (k) by (AP).
by (1/τ̂ϕ )I. We label this estimator of the covariance matrix of β

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 362 —


i i

362 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

Monte Carlo Study


We report the results of a small simulation study in Abebe et al. (2010) which
compares the estimators (NP) and (AP). It also provides empirical information
on the relative efficiency between βb (k) and the maximum likelihood estimator
(mle) under assumed normality.
The simulated model is a randomized block design with the fixed factor at
five levels and the random (block) factor at seven levels. The distribution of the
random effect was taken to be normal. Two error distributions were considered:
a normal and a contaminated normal with the contamination rate at 20% and
ratio of the contaminated standard deviation to the noncontaminated at five.
For the normal error model, the intraclass correlation coefficient was set at
0.5. For each distribution, 10,000 simulations were run.
We consider the GEEWR estimator based on a working independence co-
variance structure. We compared it with the maximum likelihood estimator
(mle) for a randomized block design. This yields the traditional analysis used
in practice. We used the R function lme (Pinheiro et al., 2008) to compute it.
Table 5.5.1 records the results of the empirical efficiencies and empiri-
cal confidences between the GEEWR estimator and mle estimator for the
fixed effect contrasts between level 1 and the other four levels. The empiri-
cal confidence coefficients are for nominal 95% confidence intervals based on
asymptotic distribution of the GEEWR estimator using the nonparametric
(NP) estimate of the covariance structure, the approximation (AP) discussed
above, and the mle inference.
At the normal distribution, the loss in empirical efficiency of the GEEWR
estimates over the mle estimates is only about 3%; while for the contaminated
normal distribution the gain in efficiency of the GEEWR estimates over the
maximum likelihood estimates is about 200%. Hence, for these situations the
GEEWR estimator possesses robustness of efficiency. In terms of empirical
confidence coefficients, the nonparametric procedure is quite liberal. In con-
trast, the approximate procedure confidences are quite close to the nominal
confidence (95%) for the normal situation and similar to those of the mle for
the contaminated normal situation.

5.5.3 Example: Inflammatory Markers


As an example, we selected part of a study by Plaisance et al. (2007) concern-
ing the effect of a single session of high intensity aerobic exercise on inflam-
matory markers of subjects taken over time. One purpose of the study was to
see if these markers differed depending on the fitness level of the subject. Sub-
jects were placed into one of two groups (High Fitness and Moderate Fitness)
depending on the level of their peak oxygen uptake. The response we consider

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 363 —


i i

5.5. GENERAL ESTIMATING EQUATIONS (GEE) 363

Table 5.5.1: Empirical Efficiencies and Confidence Coefficients


Dist. Method Contrast
β21 β31 β41 β51

Empirical Efficiency
Norm 0.974 0.974 0.972 0.973
CN 2.065 2.102 2.050 2.055

Empirical Conf. Coeff.


Norm mle 0.916 0.915 0.914 0.914
NP 0.546 0.551 0.564 0.549
AP 0.951 0.955 0.954 0.951
CN mle 0.919 0.923 0.916 0.915
NP 0.434 0.445 0.438 0.441
AP 0.890 0.803 0.893 0.889

here is C-reactive protein (CRP). Elevated CRP levels are a marker of low-
grade chronic inflammation and may predict a higher risk for cardiovascular
disease (Ridker et al., 2002). The effect of interest is the difference in CRP
between the two groups, which we denote by θ. Hence, a one-sided hypothesis
of interest is
H0 : θ ≥ 0 versus HA : θ < 0. (5.5.19)
Out of the 21 subjects in the study, 3 were removed due to noncompliance
or incomplete information. Thus, we consider the remaining 18 individuals, 9
in each group. CRP level was obtained 24 hours and immediately prior to the
acute bout of exercise and subsequently 24, 72, and 120 hours following exer-
cise giving 90 data points in all. For the reader’s convenience, the CRP data
are displayed at the url listed in the Preface. The top left comparison boxplot
of Figure 5.5.1 shows the effect based on the raw responses. An estimate of
the effect based on the raw data is difference in medians which is −0.54. Note
that the responses are skewed with outliers in each group. We took the time
of measurement as a covariate. Let yi and xi denote respectively the 5 × 1
vectors of observations and times of measurements for subject i and let ci
denote his/her indicator variable for Group, i.e., its components are either 0
(for Moderate Fitness) or 1 (for High Fitness). Then our model is

yi = α15 + θci + βxi + ei , i = 1, . . . 18 , (5.5.20)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 364 —


i i

364 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

where ei denotes the vector of errors for the ith individual. We present the
results for three covariance structures of ei : working independence (WI), com-
pound symmetry (CS), and autoregressive-one (AR(1)). We fit the GEEWR
estimate for each of these covariance structures using Wilcoxon scores.

Figure 5.5.1: Plots for CRP data.


Group Comparison Box Plots Box Plots: Residuals

4
4

3
3

Residuals

2
CRP

1
1

0
0

High Fit Mod. Fit AR(1) CS WI

Residual Plot of CS Fit QQ Plot of Residuals for CS Fit


4

4
3

3
CS Residuals

CS Residuals
2

2
1

1
0

−0.5 −0.4 −0.3 −0.2 −0.1 0.0 −2 −1 0 1 2

CS Fit Normal Quantiles

The error model for compound symmetry is the simple mixed model; i.e.,
ei = bi 1ni + ai , where bi is the random effect for subject i and the components
of ai are iid and independent from bi . Let σb2 and σa2 denote the variances of bi
and aij , respectively. Let σt2 = σb2 +σa2 denote the total variance and ρ = σb2 /σt2
denote the intraclass coefficient. In this case, the covariance matrix of ei is of
the form σt2 [(1−ρ)I+ρJ]. We estimated these variance component parameters
σt2 and ρ at each step of the fit of Model (5.5.20) using the robust estimators
discussed in Section 5.3.1
The error model for the AR(1) is eij = ρ1 ei,j−1 + aij , j = 2, . . . ni , where
the aij ’s are iid, for the ith subject. The (s, t) entry in the covariance matrix

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 365 —


i i

5.5. GENERAL ESTIMATING EQUATIONS (GEE) 365

|s−t|
of ei is κρ1 , where κ = σa2 /(1 − ρ21 ). To estimate the covariance structure at
step k, for each subject, we model this autoregressive model using the current
residuals. For each subject, we then estimate ρ1 , using the Wilcoxon regression
estimate of Chapter 3; see, also, Section 5.6 on time series. As our estimate of
ρ1 , we take the median over subjects of these Wilcoxon regression estimates.
Likewise, as our estimate of σa2 we took the median over subjects of MAD2 of
the residuals based on the AR(1) fits.
Note that there are only 18 experimental units in this problem, nine for
each treatment. So it is a small sample problem. Accordingly, we used a boot-
strap to standardize the GEEWR estimates. Our bootstrap consisted of re-
sampling the 18 experimenter units, nine from each group. This keeps the
covariance structure intact. Then for each bootstrap sample, the GEEWR es-
timate was computed and recorded. We used 3000 bootstrap samples. With
these small samples, the outliers had an effect on the bootstrap, also. Hence,
we used the MAD of the bootstrap estimates of θ as our standard error of θ. b
Table 5.5.2 summarizes the three GEEWR estimates of θ and β, along
with the estimates of the variance components for the CS and AR(1) models.
As the comparison boxplot of residuals shows in Figure 5.5.1, the three fits are
similar. The WI and AR(1) estimates of the effect θ are quite similar, including
their bootstrap standard errors. The CS estimate of θ, though, is more precise
and it is closer to the difference (based on the raw data) in medians −0.54.
The traditional fit of the simple mixed model (under CS covariance structure),
would be the maximum likelihood fit based on normality. We obtained this fit
by using the lme function in R. Its estimate of θ is −0.319 with standard error
0.297. For the hypotheses of interest (5.5.19), based on asymptotic normality,
the CS GEEWR estimate is marginally significant with p = 0.064, while the
mle estimate is insignificant with p = 0.141.

Table 5.5.2: Summary of Estimates and Bootstrap Standard Errors (BSE)


Wilcoxon Scores
COV. θb BSE βb BSE Cov. Parameters
WI −0.291 0.293 −0.0007 0.0007 NA NA
CS −0.370 0.244 −.0010 0.0007 σ̂a2 = 0.013 ρ̂ = 0.968
AR(1) −0.303 0.297 −0.0008 0.0015 ρ̂1 = 0.023 σ̂a2 = 0.032

Winsorized Wilcoxon Scores with Bend at 0.8


CS −0.442 0.282 −0.008 0.0008 σ̂a2 = 0.017 ρ̂ = 0.966

Note that the residual and q − q plots of the CS GEEWR fit, bottom

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 366 —


i i

366 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

plots of Figure 5.5.1, show that the error distribution is right skewed with
a heavy right tail. This suggests using scores more appropriate for skewed
error distributions than the Wilcoxon scores. We considered a simple score
from the class of Winsorized Wilcoxon scores. The Wilcoxon score function
is linear. For this data, a suitable Winsorizing score function is the piecewise
linear function, which is linear on the interval (0, c) and then constant on the
interval (c, 1). As discussed in Example 2.5.1 of Chapter 2, these scores are
optimal for a skewed distribution with a logistic left tail and an exponential
right tail. We obtained the GEEWR fit of this data using this score function
with c = 0.80, i.e., the bend is at 0.80. To insure positive weights, we used the
47th percentile as the location estimator m(β) in the definition of the weights;
see the discussion around expression (5.5.10). The computed estimates and
their bootstrap standard errors are given in the last row of Table 5.5.2 for the
compound symmetry case. The estimate of θ is −0.442 which is closer than the
Wilcoxon estimate to the difference in medians based on the raw data. Using
the bootstrap standard error, the corresponding z-test for hypotheses (5.5.19)
is −1.57 with the p-value of 0.059, which is more significant than the test
based on Wilcoxon scores. Computationally, the iterated reweighted GEEWR
algorithm remains the same except that the Wilcoxon scores are replaced by
these Winsorized Wilcoxon scores.
As a final note, the residual plot of the GEEWR fit for the compound
symmetric dependence structure also shows some heteroscedasticity. The vari-
ability of the residuals is directly proportional to the fitted values. This scalar
trend can be modeled robustly using the rank-based procedures discussed in
Exercise 3.15.39.

5.6 Time Series


A widely used model in time series analysis is the stationary autoregressive
model of order p, denoted here by AR(p). The model (with location parameter)
is typically written as
Xi = φ0 + φ1 Xi−1 + φ2 Xi−2 + · · · + φp Xi−p + ei

= φ0 + Yi−1 φ + ei , i = 1, 2, . . . , n (5.6.1)
where p ≥ 1, Yi−1 = (Xi−1 , Xi−2 , . . . , Xi−p )′ , φ = (φ1 , φ2 , . . . , φp )′ , and Y0 is
an observable random vector independent of e. The stationarity assumption
requires that the solutions to the following equation,
xp − φ1 xp−1 − φ2 xp−2 − · · · − φp = 0 (5.6.2)
lie in the interval (−1, 1); see, for example, Box and Jenkins (1970). Further-
more, assume that the components of e, ei , are iid with a cdf F (t) and a pdf

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 367 —


i i

5.6. TIME SERIES 367

f (t). For asymptotic distribution theory, we need to further assume that F


satisfies
E(ei ) = 0 and E(e2i ) = σ 2 . (5.6.3)
The assumptions (5.6.1)-(5.6.3) guarantee that the process {Xi } is both causal
and invertible; e.g., Brockwell and Davis (1991). This, along with the conti-
nuity of F , imply that the various inverses appearing in the sequel exist with
probability one.
In this brief section, we are concerned with the rank-based fitting of Model
(5.6.1) using highly efficient and high breakdown R estimates. There has been
work done on rank tests for hypotheses in time series; see, for example, Hallin
and Mēlard (1988) and Hallin, Jurečková, and Koul (2007).
Computationally Model (5.6.1) is a linear model with Xi as the ith re-
sponse and Yi−1 ’s as the ith row of the design matrix. For actual computa-
tion, usually, the first and last responses are the observations Xp+1 and Xn ,
respectively. Thus, the fitting and inference methods discussed in Chapter 3
are appropriate. Note, however, the responses are definitely dependent and
this dependency must be taken into account for valid asymptotic theory.
As in Chapter 3, let ϕ(u) denote a general score function with score aϕ (i) =
ϕ[i/(n + 1)], i = 1, . . . , n. Then the rank-based estimate of φ is given by
b = ArgminDϕ (φ)
φ
n
X
′ ′
= Argmin aϕ [R(Xi − Yi−1 φ)](Xi − Yi−1 φ), (5.6.4)
i=1

′ ′
where R(Xi − Yi−1 φ) denotes the rank of Xi − Yi−1 φ among X1 −
′ ′
Y0 φ, . . . , Xn − Yn−1 φ. Koul and Saleh (1993) developed the asymptotic the-
ory for these rank-based estimates. As we note in the next paragraph, though,
because of the autoregressive model, error distributions with even moderately
heavy tails produce outliers in factor space (points of high leverage). With
this in mind, the high breakdown weighted-Wilcoxon estimates discussed in
Section 3.12 seem more appropriate. The asymptotic theory for these weighted
Wilcoxon estimates was developed by Terpstra, McKean, and Naranjo (2000,
2001). For an account of M estimates and GM estimates for the autoregressive
model see Bustos (1982), Martin and Yohai (1991), and Rousseeuw and Leroy
(1987).
Suppose the random errors of Model (5.6.1) have a heavy-tailed distri-
bution. In this case, by the nature of the model, outlying errors (outliers in
response space) become also errors in factor space. For instance, if, at time
i, ei is an outlier then by the model Xi is an outlier in response space but,
at time i + 1, Xi appears in the design matrix and hence is also an outlier in
factor space. Since the outlier becomes incorporated into the model, outliers
of this form are generally “good” points of high leverage; see, e.g., page 275

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 368 —


i i

368 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

of Rousseeuw and Leroy (1987). These are called innovative outliers (IO).
Another class of outliers, additive outliers (AO), occur frequently in time
series data; see Fox (1972) for discussion of both AO and IO types of outliers
(he labeled them as Type I and Type II, respectively). One way of modeling
AO and IO types of outliers is with a simple mixture distribution. Suppose
Xi follows Model (5.6.1) but we observe instead Xi∗ where

Xi∗ = Xi + νi ; i = 1, 2, . . . , n (5.6.5)

and the νi ’s (not necessarily independent) follow the mixture distribution,


(1 − γ)δ0 (·) + γM(·). Here, γ denotes the proportion of contamination, δ0 is
a point mass at zero, and M is the contaminating distribution function. Note
that when γ = 0 the observed process reduces to the process Xi and, hence, for
heavy-tailed error distributions IO outliers can occur. When γ > 0, AO outliers
can occur. For example, suppose at time i, γ > 0 and the contaminating
distribution results in an outlier Xi∗ . Then Xi∗ is in the design matrix at time
i + 1 but Xi is on the right side of the model statement for time i + 1. Hence,
generally, Xi∗ is a “bad” point of high leverage. Many time series data sets
have both IO and AO outliers.
For the reasons cited in the last paragraph, we consider the HBR estimate
(3.12.2) of Chapter 3 which, in the notation of Model (5.6.1), is given by
X
b bbij |(Xi − Y ′ φ) − (Xj − Y ′ φ)|,
φ HBR = Argmin i−1 j−1 (5.6.6)
i<j

where the weights bbij are the robust weights given by expression (3.12.3).
Recall these weights downweight “bad” points of high leverage but not “good”
points of high leverage. Hence, based on the discussion of AO and IO outliers,
the HBR estimates seem appropriate for the autoregressive model. Other types
of weights, for autoregressive models, including the GR weights discussed in
Chapter 3, are presented in Terpstra et al. (2001).
The R collection of functions ww developed by Terpstra and McKean (2005)
can be used to compute the HBR estimates with the weights (3.12.3). In their
accompanying article, Terpstra and McKean discuss the application of ww to
an autoregressive time series data set. The package ww also computes the di-
agnostics T DBET AS(W, HBR), (3.13.10), and CF IT Si (W, HBR), (3.13.12)
discussed in Chapter 3, which are used to compare the Wilcoxon and HBR
fits. We used ww for the computation in this section.

5.6.1 Asymptotic Theory


Terpstra et al. (2000, 2001) develop the asymptotic theory for the HBR esti-
mates for the autoregressive model. In this section we briefly summarize this

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 369 —


i i

5.6. TIME SERIES 369

theory. The results are very close to those for the HBR estimates for the lin-
ear model of Chapter 3 and we will point out the similarities. The proofs,
though, use results from stochastic processes and are different; see Terpstra et
al. (2000, 2001) for the details of the proofs. Let bij = b (Yi−1 , Yj−1, bei , b
ej ) de-
note the weight function, which is similar to its definition in the linear model
case; see expression (3.12.13). Define the terms,
Z ∞
γF (Y0 , Y1 ) = b (Y0 , t, Y1 , t) f (t) dF (t) ,
−∞
Z ∞
B(u1 , u2 , e) = sgn(s − e)b(u1 , u2 , e, s) dF (s), (5.6.7)
−∞
Z ∞
AF (Y0 , Y1 , Y2 ) = B(Y0 , Y1 , e)B(Y0 , Y2 , e) dF (e).
−∞

Next, define
Z
1
CF = (y1 − y0 )γF (y0 , y1 )(y1 − y0 )′ dG(y0 )dG(y1 ) (5.6.8)
2
and
Z
ΣF = (y1 − y0 )AF (y0 , y1 , y2 )((y2 − y0 )′ dG(y0 )dG(y1 )dG(y2 ), (5.6.9)

where G is the cdf of Y1 .


We first state the asymptotic uniform linearity (AUL) and asymptotic
uniform quadraticity (AUQ) results for the HBR process. For all c > 0 and
∆ ∈ ℜp ,
sup kSn (∆) − Sn (0) + 2CF ∆k = op (1)

k k≤c
and
sup |Dn (∆) − Qn (∆)| = op (1).
k∆k≤c
The main results of this section are summarized in the following theorem.

Theorem 5.6.1. Under regularity conditions discussed in Terpstra et al.


(2000),

1. AUL and AUQ hold where CF is defined in (5.6.8).


D
2. Sn (0) → N (0, ΣF ) where ΣF is defined in (5.6.9).
√ b 
D 
3. n φ n − φ0 → N 0, (1/4)C−1 −1
F ΣF CF .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 370 —


i i

370 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

Note how similar the results of (2) and (3) are to Theorems 3.12.1 and
3.12.2, respectively. Terpstra et al. (2000) developed consistent, method of
moment type of estimators for CF and ΣF . These estimates are essentially
the same as the estimates discussed in Section 3.12.6 for the HBR estimates.
For inference, we recommend that the estimates discussed in Chapter 3 be
used. Hence, we suggest using K b HBR of expression (3.12.33) for the estimate
of the asymptotic variance-covariance matrix of φ b .
n
As in the linear model case, the intercept parameter φ0 cannot be estimated
directly by minimizing the rank-based pseudo-norms. As in Chapter 3, a robust
estimate of the intercept is the median of the residuals. More specifically, define
the initial residuals as follows,

b ′
ei = Xi − Yi−1 b , i = 1, 2, . . . , n.
φn

Then, a natural robust estimate of φ0 is φb0 = medi {b ei }. Then, similar to


b
the theory of Chapter 3, the joint distribution of φ0 and φb is asymptotically
normal; see Terpstra et al. (2001) for details.

5.6.2 Wald-Type Inference


Assume that Model 5.6.1 holds. Based on Theorem 5.6.1 a Wald-type of infer-
ence can be constructed. As in Chapter 3, consider general linear hypotheses
of the form
H0 : Mφ = 0 versus HA : Mφ 6= 0, (5.6.10)
b
where M is a q × p specified matrix of rank q. Let φ HBR , (5.6.6), be the HBR
estimate of φ. Consider the test statistic,
b
W 2 = (Mφ ′ b −1 b
HBR ) KHBR MφHBR , (5.6.11)

where K b
b HBR , (3.12.33), is the estimate of the variance-covariance of φ HBR .
For efficiency results, consider the sequence of local alternatives Hn : Mφ =
n−1/2 ζ, where ζ 6= 0. The following theorem follows easily from Theorem 5.6.1;
see Exercise 5.7.7.

Theorem 5.6.2. Assume the regularity conditions of Theorem 5.6.1. Then


D
(a) under H0 , W 2 → χ2q ;
D
(b) under Hn , W 2 → χ2q (η ∗ ), with the noncentrality parameter η ∗ =
n−1 ζ ′ (MKHBR M′ )−1 ζ;

(c) the test statistic W 2 is consistent for HA .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 371 —


i i

5.6. TIME SERIES 371

Example 5.6.1 (Order of an Autoregressive Series). In practice, usually the


order (value of p) of an autoregressive model is not known. Upon specifying
p and then fitting the model, a residual analysis can be performed to see how
well the model (i.e., selection of p) fits the data. If, based on residual plots, the
fit is poor then a higher order can be tried. Using the Wald test procedure, a
more formal testing algorithm can be constructed. First select a value of P of
maximal order; i.e., the residual analysis shows that the model fits well. Next,
select a value of α for the testing described next. Then the algorithm is given
by

(0) Set p = P , i = 1.

(1) While p > 0, fit Model (5.6.1) with order p.

(2) Let φ2,i = (φp−i+1, . . . , φp )′ . Then use the Wald test procedure to test
H0 : φ2,i = 0 versus HA : φ2,i 6= 0.

(3) If H0 is rejected then stop and declare p to be the order; otherwise, set
p = p − 1 and i = i + 1 and go to (1).

See Terpstra et al. (2001) for more discussion on this algorithm.

Graybill (1976) discusses an algorithm similar to the one in the last ex-
ample for selecting the order of a polynomial. Terpstra and McKean (2005)
discuss this algorithm for rank-based methods for polynomial models. In a
small simulation study, the algorithm was successful in determining the order
of the polynomial.

Example 5.6.2 (Residential Extensions Data). A widely cited example in the


robust time series literature is a monthly time series (RESX), which originated
at Bell Canada and is discussed in Rousseeuw and Leroy (1987). The series
consists of the number of telephone installations in a given region and has
two obvious outliers. The outliers are essentially attributed to bargain months
where telephone installations were free. Following other authors (e.g., Martin,
1980; Rousseeuw and Leroy, 1987), we consider the seasonally adjusted data
Xi = RESXi+12 − RESXi , i = 1, . . . , 77, where RESXi is the original data.
Historically, the stationary zero mean AR(2) has been used to model the
seasonally differenced series. An autoregressive model of at least order 2 is
clear from the plot of Xi versus Xi−2 found in the top left plot of Figure
5.6.1. There is a definite first linear trend and two large outliers in the vertical
direction. Notice that these two points have become points of high leverage in
the design (the two outliers in the horizontal direction). The plot of Xi versus
Xi−1 is quite similar. Terpstra et al. (2001) applied the algorithm for order of
an autoregressive discussed in Example 5.6.1 for the Wilcoxon and GR fits.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 372 —


i i

372 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

The algorithm selected p = 2 for both fits. The reader is asked to run this
algorithm for the HBR fit in Exercise 5.7.8.
Table 5.6.1 displays the estimates along with standard errors for the LS,
Wilcoxon, and HBR fits. Notice that the HBR fit differs from the LS fit. This
is clear from the top right plot of Figure 5.6.1 which shows the data overlaid
with the LS (dashed line) and HBR (solid line) fits. Both large outliers were
omitted from this plot to improve the resolution. The HBR fit hugs the data
much better than the LS fit. The HBR shows a negative estimate of φ2 while
the LS estimate is positive. In terms of inference, the HBR estimates of both
orders are highly significant. For LS, only the estimate of φ1 is significant. The
outliers have impaired the LS fit and its associated inference. The diagnostic
TDBETA between the HBR and LS fits is 258, well beyond the benchmark of
0.48.
The HBR fit differs also from the Wilcoxon fit. The diagnostic for it is
TDBETA = 233. As the casewise plot of Figure 5.6.1 shows, the two fits differ
at many cases. The Wilcoxon fit differs some from the LS fit (TDBETA =
11.7). The final plot of Figure 5.6.1 shows the Studentized residuals of the
HBR fit versus Cases. The two large outliers are clear, along with a few others.
But the remainder of the points fall within the benchmarks. For this data, the
HBR fit performed the best.

The Studentized residuals discussed in the last example were those dis-
cussed in Chapter 3 for the HBR fit, expression (3.12.41); see Terpstra, Mc-
Kean, and Anderson (2003) for Studentized residuals for the traditional and
robust fits for the AR(1) model.

Table 5.6.1: LS, Wilcoxon and HBR Estimates and Standard Errors for the
Residential Data of Example 5.6.2
Procedure φb1 s.e.(φb1 ) φb2 s.e.(φb2 )
LS 0.473 0.116 −0.166 0.116
Wil 0.503 0.029 −0.151 0.029
HBR 0.413 0.069 0.290 0.076

5.6.3 Linear Models with Autoregressive Errors


A model that is often used in practice consists of a linear model where the
random errors follow a time series. This may occur when the responses were
systematically collected, say, over time. One such model is a two-phase (AB)
intervention experiment on a subject. Here the responses are collected over

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 373 —


i i

5.6. TIME SERIES 373

Figure 5.6.1: Plots for Example 5.6.2. Note on the top right plot, the large
two outliers have been omitted.

Seasonally Adjusted Data versus Lag2 Data Seasonally Adjusted Data and Fits
Telephone installaltions (Seasonally adjusted)

Telephone installaltions (Seasonally adjusted)

10000
50000

6000
*
30000

* **
* * ** * **
* ******* ** ** * ***** ** *

2000
*
***** * * **** ** **** * * ** * **
10000

* * * * * *

−2000 0
* * **** ** *
0

0 10000 20000 30000 40000 50000 20 40 60

Lag 2 data Case

CFIT vs Case, HBR and Wil. Fits (TD = 233) HBR Studentized residual Plot
CFIT between HBR and Wilcoxon fits

40
5

HBR Studentized residuals

20
0
−5

0
−10

−20
−15

−40

20 40 60 20 40 60

Case Case

time; the A phase of the responses for the subject falls before the intervention
while the B phase of his/her responses falls after the intervention. A common
design matrix for this experiment is a first order design allowing for differ-
ent intercepts and slopes in the phases; see Huitema and McKean (2000) for
discussion. Since the data are collected over time on the same subject, an au-
toregressive model for the random errors is often assumed. The general mixed
model of Section 5.2 is also of this type when the observations in a cluster are
taken over time, such as in a repeated measures design. In such cases, we may
want to model the random errors with an autoregressive model. These types of
models differ from the autoregressive model (5.6.1) discussed at the beginning
of this section in two aspects: firstly, the parameters of interest are those of
the linear model not those of the time series and, secondly, the series is often
quite short. Type AB intervention experiments on a single subject may only
be of length five for each phase. Likewise in a repeated measures design there
may be just a few measurements per subject.

For discussion, suppose the data (Y1 , x1 ), . . . , (Yn , xn ) follow the linear

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 374 —


i i

374 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

model

Yi = α + x′i β + ei , where, (5.6.12)


Xk
ei = φj ei−j + ai , i = 1, 2, . . . , n, (5.6.13)
j=1

and xi is a p × 1 vector of covariates for the ith response, k is the order of the
autoregressive model, and the ai ’s are iid (white noise). For many real data
situations k is quite small, often k = 1, i.e., an AR(1). One way of proceeding
is to fit the linear model, (5.6.12), and obtain the residuals from the fit. For
our discussion, assume a robust fit is used, say, the Wilcoxon fit. Let b ei denote
the residuals based on this fit. In practice, diagnostics are then run on these
residuals examining them for time series trends. If the check is negative then
usually one proceeds with the linear model analysis. If it is positive then other
fitting methods are used. We discuss these two aspects from a robust point of
view.
A simple diagnostic plot for systematic dependence consists of the resid-
uals versus time order. There are general tests for dependence, including the
nonparametric runs tests. For this test, runs of positive and negative residuals
(in time order) are obtained and measured against what is expected under
independence. Huitema and McKean (1996), though, found that the runs test
based on residuals had very poor small sample properties for the AB interven-
tion designs that they considered. On the other hand, diagnostic tests designed
for specific dependent alternatives, such as the Durbin-Watson test, were valid.
With the autoregressive errors in mind, there are specific diagnostic tools
to use on the residuals. Simple diagnostic plots, lag plots, consist of the
scatter plots of bei versus b
ej−1, j = 1, . . . , k. Linear patterns are indicative of
an autoregressive model. For traditional methods, a common test for an AR(1)
model on the errors of a linear model is based on the Durbin-Watson statistic
given by Pn
et − eet−1 )2
(e e2 + e
e e2
d = t=2Pn 2 = 2 − P1 n n2 − 2r1 , (5.6.14)
et
t=1 e t=1 e
et
where P
e ee
ne
r1 = Pn t 2t−1
t=2
(5.6.15)
t=1 e
et
and e
et denotes the th LS residual. The null (errors are iid) distribution depends
on the design matrix, so often approximate critical values are used. By the far
right expression of (5.6.14), the statistic d is a function of r1 . This suggests
another test statistic based on r1 given by
r1 + [(p + 1)/n]
h= p ; (5.6.16)
(n − 2)2 /[(n − 1)n2 ]

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 375 —


i i

5.7. EXERCISES 375

see Huitema and McKean (2000). The associated test is based on standard
normal critical values. In the article by Huitema and McKean, this test per-
formed as well as the Durbin-Watson tests in terms of power and validity over
AB type designs. The factor (p + 1)/n in the formula for h is a bias correction.
Provided an AR(1) is used to model the errors, r1 is the LS estimate of φ1 ;
however, in the test statistic h it is standardized under the null hypothesis
of independence. This suggests using the robust analog; i.e., using the HBR
estimate of the AR(1) model (based on the R residuals), with standardization
under the null hypothesis, as a test statistic diagnostic.
If dependence is diagnosed, there are several traditional fitting procedures
to fit the linear model. Several methods make use of transformations based
on estimates of the dependent structure. The reader is cautioned, though,
because this can lead to very liberal inference; see, for instance, the study by
Huitema et al. (1999). The problem appears to be the bias in the estimates.
McKnight et al. (2000) developed a double bootstrap procedure based on a
two-stage Durbin type approach (Chapter 9 of Fuller, 1996), for autoregressive
errors. The first bootstrap corrects the biasedness of the estimates of the
autocorrelation coefficients while the second bootstrap yields a valid inference
for the regression parameters of the linear model, (5.6.1). Robust analogs of
these traditional methods are currently being investigated.

5.7 Exercises
5.7.1. Assume the simple mixed model (5.3.1). Show that expression (5.3.2)
is true.
5.7.2. Obtain the ARE between the R and traditional estimates found in
expression (5.3.4), for Wilcoxon scores when the random error vector has a
multivariate normal distribution.
5.7.3. Show that the asymptotic distribution of the LS estimator for the
Arnold transformed model is given by expression (5.4.8).
5.7.4. Consider Example 5.4.1.
(a) Verify the ATR and ATLS estimates in Table 5.4.2.

(b) Over the range of ∆y used in the example, verify the relative changes in
the ATR and ATLS estimates as shown in the example.
5.7.5. Consider the discussion of test statistics around expression (5.2.13).
Explore the asymptotic distributions of the drop in dispersion and aligned
rank test statistics under the null and contiguous alternatives for the general
mixed model.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 376 —


i i

376 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

5.7.6. Continuing with the last exercise, suppose that the simple mixed model
(5.3.1) is true. Suppose further that the design is centered within each block;
i.e., X′k 1nk = 0p . For example, this is true for an ANOVA design in which
all subjects have all treatment combinations such as the Plasma Example of
Section 4.

(a) Under this assumption, show that expression (5.3.2) simplifies to Vϕ =


τϕ2 (1 − ρϕ )(X′ X)−1 .

(b) Show that the noncentrality parameter η, (5.2.14), simplifies to


1
η= Mβ ′ [M(X′ X)−1 M′ ]−1 Hβ.
τϕ2 (1 − ρϕ )

(c) Consider as a test statistic the standardized version of the reduction in


dispersion,
RDϕ /q
FRD,ϕ = .
(1 − ρ̂ϕ )(τ̂ϕ /2)
D
Show that under the null hypothesis H0 , qFRD,ϕ → χ2 (q) and that under
D
the sequence of alternatives HAn , qFRD,ϕ → χ2 (q, η), where the noncen-
trality parameter η is given in Part (b).

(d) Show that FW,ϕ , (5.2.13), and FRD,ϕ are asymptotically equivalent under
the null and local alternative models.

(e) Explore the asymptotic distribution of the aligned rank test under the
conditions of this exercise.

5.7.7. Prove Theorem 5.6.2.

5.7.8. Consider the residential extensions data discussed in Example 5.6.2.

(a) Apply the algorithm for order of an autoregressive discussed in Example


5.6.1 for the HBR fit.

(b) Replace the two large outliers in the data set with their predicted HBR
fits. Run the Wilcoxon and HBR fits of the changed data set. Obtain
the diagnostics TDBETA and CFIT.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 377 —


i i

Chapter 6

Multivariate

6.1 Multivariate Location Model


We now consider a statistical model in which we observe vectors of observa-
tions. For example, we may record both the SAT verbal and math scores on
students. We then wish to investigate the bivariate distribution of scores. We
may wish to test the hypothesis that the vector of population locations has
changed over time or to estimate the vector of locations. The framework in
which we carry out the statistical inference is the multivariate location model
which is similar to the location model of Chapter 1.
For simplicity and convenience, we often discuss the bivariate case. The
k-dimensional results are usually obtained by obvious changes in notation.
Suppose that X1 , . . . , Xn are iid random vectors with XTi = (Xi1 , Xi2 ). In
this chapter, T denotes transpose and we reserve prime for differentiation. We
assume that X has an absolutely continuous distribution with cdf F (s −θ1 , t−
θ2 ) and pdf f (s − θ1 , t − θ2 ). We also assume that the marginal distributions
are absolutely continuous. The vector θ = (θ1 , θ2 )T is the location vector.
Definition 6.1.1. Distribution models for bivariate data. Let F (s, t) be a
prototype cdf, then the underlying model is a shifted version: H(s, t) = F (s −
θ1 , t − θ2 ).
The following models are used throughout this chapter.
1. We say the distribution is symmetric when X and −X have the same
distribution or f (s, t) = f (−s, −t). This is sometimes called diagonal
symmetry. The vector (0, 0)T is the center of symmetry of F and the
location functionals all equal the center of symmetry. Unless stated oth-
erwise, we assume symmetry throughout this chapter.
2. The distribution has spherical symmetry when ΓX and X have the
same distribution where Γ is an orthogonal matrix. The pdf has the form

377
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 378 —


i i

378 CHAPTER 6. MULTIVARIATE

g(kxk) where kxk = (xT x)1/2 is the Euclidean norm of x. The contours
of the density are circular.

3. In an elliptical model the pdf has the form |det Σ|−1/2 g(xT Σ−1 x), where
det denotes determinant and Σ is a symmetric, positive definite matrix.
The contours of the density are ellipses.

4. A distribution is directionally symmetric if X/kXk and −X/kXk


have the same distribution.

Note that elliptical symmetry implies symmetry which in turn implies di-
rectional symmetry. In an elliptical model, the contours of the density are
elliptical and if Σ is the identity matrix then we have a spherically symmet-
ric distribution. An elliptical distribution can be transformed into a spherical
one by a transformation of the form Y = DX where D is a nonsingular ma-
trix. Along with various models, we encounter various transformations in this
chapter. The following definition summarizes the transformations.

Definition 6.1.2. Data transformations.


(a) Y = ΓX is an orthogonal transformation when the matrix Γ is orthog-
onal. These transformations include rotations and reflections of the data.
(b) Y = AX + b is called an affine transformation when A is a nonsingular
matrix and b is any vector of real numbers.
(c) When the matrix A in (b) is diagonal, we have a special affine transfor-
mation called a scale and location transformation.
(d) Suppose t(X) represents one of the above transformations of the data. Let
b
θ(t(X)) denote the estimator computed from the transformed data. Then we
say the estimator is equivariant if θ̂(t(X)) = t(θ̂(X)). Let V (t(X)) denote
a test statistic computed from the transformed data. We say the test statistic
is invariant when V (t(X)) = V (X).

Recall that Hotelling’s T 2 statistic is given by

T 2 = n(X − µ)T S−1 (X − µ),

where S is the sample covariance matrix. In Exercise 6.8.1, the reader is asked
to show that the vector of sample means is affine equivariant and Hotelling’s
T 2 test statistic is affine invariant.
As in the earlier chapters, we begin with a criterion function or with a set
of estimating equations. To fix the ideas, suppose that we wish to estimate
θ or test the hypothesis H0 : θ = 0 and we are given a pair of estimating
equations:  
S1 (θ)
S(θ) = =0; (6.1.1)
S2 (θ)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 379 —


i i

6.1. MULTIVARIATE LOCATION MODEL 379

see expressions (6.1.3)-(6.1.5) for examples of three criterion functions. We


now list the usual set of assumptions that we have been using throughout the
book. These assumptions guarantee that the estimating equations are Pitman
regular in the sense of Definition 1.5.3 so that we can define the estimate and
test and develop the necessary asymptotic distribution theory. It is convenient
to suppose that the true value of θ is 0 which we can do without loss of
generality.

Definition 6.1.3. We say that the mutivariate process S(θ) is Pitman Reg-
ular if the following conditions hold:

(a) The components of S(θ) should be nonincreasing functions of θ1 and θ2 .

(b) E0 (S(0)) = 0.

D0
(c) √1 S(0) → Z ∼ N2 (0, A).
n

 
P
(d) supkbk≤B √1n S √1n b − √1 S(0)
n
+ Bb → 0 .

The matrix A in (c) is the asymptotic covariance matrix of √1n S(0) and the
matrix B in (d) can be computed in various ways, depending on when differen-
tiation and expectation can be interchanged. We list the various computations
of B for completeness. Note that ▽ denotes differentiation with respect to the
components of θ.

1
B = −E0 ▽ S(θ) |θ =0
n
1
= ▽Eθ S(0) |θ=0
n
= E0 [(− ▽ log f (X))ΨT (X)] (6.1.2)

where ▽ log f (X) denotes the vector of partial derivatives of log f (X) and
Ψ(· ) is such that
n
1 1 X
√ S(θ) = √ Ψ(Xi − θ) + op (1).
n n i=1

Brown (1985) proved a multivariate counterpart to Theorem 1.5.6. We state


it next and refer the reader to the paper for the proof.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 380 —


i i

380 CHAPTER 6. MULTIVARIATE

Theorem 6.1.1. Suppose conditions (a)-(c) of Definition 6.1.3 hold. Suppose


further that B is given by the second expression in (6.1.2) and is positive
definite. If, for any b,
    
1 1 1
trace n cov S √ b − S(0) → 0,
n n n
then (d) of Definition 6.1.3 also holds.
The estimate of θ is, of course, the solution of the estimating equations,
denoted θ.b Conditions (a) and (b) make this reasonable. To test the hypoth-
esis H0 : θ = 0 versus HA : θ 6= 0, we reject the null hypothesis when
1 T
S (0)A b −1 S(0) ≥ χ2 (2), where the upper α percentile of a chisquare dis-
n α
tribution with 2 degrees of freedom. Note that A b → A, in probability, and
b
typically A is a simple moment estimator of A. Condition (c) implies that
this is an asymptotically size α test.
With condition (d) we can determine the asymptotic distribution of the
estimate and the asymptotic local power of the test; hence, asymptotic effi-
ciencies can be computed. We can determine the quantity that corresponds
to the efficacy in the univariate case described in Section 1.5.2 of Chapter 1.
We do this next before discussing specific estimating equations. The following
proposition follows at once from the assumptions.
Theorem 6.1.2. Suppose conditions (a)-(d) in Definition
√ 6.1.3 are satisfied,
θ = 0 is the true parameter value, and θ n = γ/ n for some fixed vector γ.
b is the solution of the estimating equation. Then
Further θ
√ b D
1. nθ = B−1 √1n S(0) + op (1) →0 Z ∼ MVN(0, B−1 AB−1 )
Dθ ,
2. n1 ST (0)A−1 S(0) →n χ2 (2, γ T BA−1Bγ) ,
where χ2 (a, b) is noncentral chisquare with a degrees of freedom and noncen-
trality parameter b.
b→0
Proof: Part 1 follows immediately from condition (d) and letting θ n = θ
in probability; see Theorem 1.5.7. Part 2 follows by observing (see Theorem
1.5.8) that
       
1 T −1 1 T 1 −1 1
Pθ n S (0)A S(0) ≤ t = P0 S −√ γ A S −√ γ ≤ t
n n n n
and from (d),
 
1 1 1 D
√ S − √ γ = √ S(0) + Bγ + op (1) →0 Z ∼ MVN(Bγ, A).
n n n
Hence, we have a noncentral chisquare limiting distribution for the
b is Ω(x) = B−1 Ψ(x)
quadratic form. Note that the influence function of θ
and we say θb has bounded influence provided kΩ(x)k is bounded.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 381 —


i i

6.1. MULTIVARIATE LOCATION MODEL 381

Definition 6.1.4. The estimation efficiency of a bivariate estimator can be


measured using the Wilk’s generalized variance defined to be the determi-
nant of the covariance matrix of the estimator: σ12 σ22 (1 − ρ212 ) where ((ρij σi σj ))
is the covariance matrix of the bivariate vector of estimates. The estimation
efficiency of θb1 relative to θb2 is the square root of the reciprocal ratio of the
generalized variances.
This means that the asymptotic covariance matrix given by B−1 AB−1 of
the more efficient estimator is “small” in the sense of generalized variance. See
Bickel (1964) for further discussion of efficiency in the multivariate case.
Definition 6.1.5. When comparing two tests based on S1 and S2 , since the
asymptotic local power is an increasing function of the noncentrality parame-
ter, we define the test efficiency as the ratio of the respective noncentrality
parameters.
In the bivariate case, we have γ T B1 A−1 T −1
1 B1 γ divided by γ B2 A2 B2 γ and,
unlike the estimation case, the test efficiency may depend on the direction γ
along which we approach the origin; see Theorem 6.1.2. Hence, we note that,
unlike the univariate case, the testing and estimation efficiencies are not nec-
essarily equal. Bickel (1965) shows that the ratio of noncentrality parameters
can be interpreted as the limiting ratio of sample sizes needed for the same
asymptotic level and same asymptotic power along the same sequence of al-
ternatives, as in the Pitman efficiency used throughout this book. We can see
that BA−1 B should be “large” just as B−1 AB−1 should be “small.” In the
next section we consider how to set up the estimating equations and consider
what sort of estimates and tests result. We will be in a position to compute
the efficiency of the estimates and tests relative to the traditional least squares
estimates and tests. First we list three important criterion functions and their
associated estimating equations (other criterion functions are introduced in
later sections).
v
u n
uX
D1 (θ) = t [(xi1 − θ1 )2 + (xi2 − θ2 )2 ] (6.1.3)
i=1
n
X p
D2 (θ) = (xi1 − θ1 )2 + (xi2 − θ2 )2 (6.1.4)
i=1
Xn
D3 (θ) = {|xi1 − θ1 | + |xi2 − θ2 |} . (6.1.5)
i=1

The first criterion function generates the vector of means, the L2 or least
squares estimates. The other two criterion functions generate different versions
of what may be considered L1 estimates or bivariate medians. The two types

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 382 —


i i

382 CHAPTER 6. MULTIVARIATE

of medians differ in their equivariance properties. See Small (1990) for an ex-
cellent review of multidimensional medians. The vector of means is equivariant
under affine transformations of the data; see Exercise 6.8.1. In each of these
criterion functions we have pushed the square root operation deeper into the
expression. As we see, this produces very different types of estimates. We now
take the gradients of these criterion functions and display the corresponding
estimating functions. The computation of these gradients is given in Exercise
6.8.2.
 P 
−1 P (x i1 − θ1 )
S1 (θ) = [D1 (θ)] (6.1.6)
(xi2 − θ2 )
Xn  
−1 xi1 − θ1
S2 (θ) = kxi − θ i k (6.1.7)
xi2 − θ2
i=1
 P 
P sgn(x i1 − θ1 )
S3 (θ) = . (6.1.8)
sgn(xi2 − θ2 )
In (6.1.8) if the vector is zero, then we take the term in the summation to
be zero also. In Exercise 6.8.3 the reader is asked to verify that S2 (θ) = S3 (θ)
in the univariate case; hence, we already see something new in the structure of
the bivariate location model over the univariate location model. On the other
hand, S1 (θ) and S3 (θ) are componentwise equations unlike S2 (θ) in which the
two components are entangled. The solution to (6.1.8) is the vector of medians,
and the solution to (6.1.7) is the spatial median which is discussed in Section
6.3. We begin with an analysis of componentwise estimating equations and
then consider other types.
Sections 6.2.3 through 6.4.4 deal with one sample estimates and tests based
on vector signs and ranks. Both rotational and affine invariant/equivariant
methods are developed. Two and several sample models are treated in Section
6.6 as examples of location models. In Section 6.6 we are primarily concerned
with componentwise methods.

6.2 Componentwise Methods


Note that S1 (θ) and S3 (θ) are of the general form
 P 
ψ(x i1 − θ1 )
S(θ) = P (6.2.1)
ψ(xi2 − θ2 )
where ψ(t) = t or sgn(t) for (6.1.6) and (6.1.8), respectively. We need to find
the matrices A and B in Definition 6.1.3. It is straightforward to verify that,
when the true value of θ is 0,
 
Eψ 2 (X11 ) Eψ(X11 )ψ(X12 )
A= , (6.2.2)
Eψ(X11 )ψ(X12 ) Eψ 2 (X22 )

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 383 —


i i

6.2. COMPONENTWISE METHODS 383

and, from (6.1.2),  


Eψ ′ (X11 ) 0
B= ′ . (6.2.3)
0 Eψ (X12 )
Provided that A is positive definite, the multivariate central limit theorem
implies that condition (c) in Definition 6.1.3 is satisfied for the component-
wise estimating functions. In the case that ψ(t) = sgn(t), we use the second
representation in (6.1.2). The estimating functions in (6.2.1) are examples of
M-estimating functions; see Maronna, Martin, and Yohai (2006).
Example 6.2.1 (Pulmonary Measurements on Workers Exposed to Cotton
Dust). In this example we extend the discussion to k = 3 dimensions. The
data consists of n = 12 trivariate (k = 3) observations on workers exposed
to cotton dust. The responses are the changes in measurements of pulmonary
functions: FVC (forced vital capacity), FEV3 (forced expiratory volume), and
CC (closing capacity). The data are presented in Merchant et al. (1975) and
are also displayed at the url listed in the Preface.
Let θ T = (θ1 , θ2 , θ3 ) and consider H0 : θ = 0 versus HA : θ 6= 0. First
we compute the componentwise sign test. In (6.2.1) take ψ(x) = sgn(x), then
n−1/2 ST3 = n−1/2 (−6, −6, 2) and the estimate of A = Cov(n−1/2 S3 ) is Ab given
by
 P P   
n sgnxi1 sgnxi2 sgnxi1 sgnxi3 12 8 −4
1 P P
 = 1  8 12
sgnx i1 sgnxi2 n sgnxi2 sgnxi3 0 
n P P 12
sgnxi1 sgnxi3 sgnxi2 sgnxi3 n −4 0 12
P
where the diagonal elements areP i sgn2 (Xis ) = n and the off-diagonal ele-
ments are values of the statistics i sgn(Xis )sgn(Xit ). Hence, the test statistic
b −1 S3 = 3.667, and using χ2 (3), the approximate p-value is 0.299; see
n−1 ST3 A
Section 6.2.2.
We can also consider the finite sample conditional distribution in which
sign changes are generated with a binomial with n = 12 and p = .5; see
the discussion in Section 6.2.2. Again note that the signs of all components
of the observation vector are either changed or not. The matrix A b remains
−1 T b −1
unchanged so it is simple to generate many values of n S3 A S3 . Out of 2500
values we found 704 greater than or equal to 3.667; hence, the randomization
or sign change p-value is approximately 704/2500 = 0.282, quite close to the
asymptotic approximation. At any rate, we fail to reject H0 : θ = 0 at any
T −1
reasonable level. Further, Hotelling’s T 2 = nX Σ b X = 14.02 with a p-value
of 0.051, based on the F -distribution for [(n − p)/(n − 1)p]T 2 with 3 and 9
degrees of freedom. Hence, Hotelling’s T 2 is significant at approximately 0.05.
Figure 6.2.1, Panel A, provides boxplots for each component. These box-
plots suggest that any differences are due to the upward shift in the CC dis-
tribution. The normal q −q plot of the component CC, Panel B, shows two

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 384 —


i i

384 CHAPTER 6. MULTIVARIATE

Figure 6.2.1: Panel A: Boxplots of the changes in pulmonary function for the
cotton dust data. Note that the responses have been standardized by compo-
nentwise standard deviations; Panel B: Normal q −q plot for the component
CC, original scale.

2 Panel A Panel B

15
1

10
Standardized responses

Changes in CC
0

5
−1

0
−2

−5

CC FEV_3 FVC −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Component Normal Quantiles

outlying values on the right side. The plots (not shown) for the other two
components exhibit no outliers. In the case of the componentwise Wilcoxon
test, Section 6.2.3, we consider (n + 1)S4 (0) in (6.2.14) along with (n + 1)2 A,
essentially in (6.2.15). For the this data (n + 1)ST4 (0) = (−63, −52, 28) and
 
649 620.5 −260.5
b = 1  620.5
(n + 1)2 A 649.5 −141.5  .
n
−260.5 −141.5 650
P P
The diagonal elements are i R2 (|Xis |) which should be i i2 = 650 but differ
for the first two components
P due to ties among the absolute values. The off-
diagonal elements are i R(|Xis |)R(|Xit |)sgn (Xis )sgn (Xit ). The test statistic
b −1 S4 (0) = 7.82. From the χ2 (3) distribution, the approx-
is then n−1 ST4 (0)A
imate p-value is 0.0498. Hence, the Wilcoxon test rejects the null hypothesis
at essentially the same level as Hotelling’s T 2 test.
In the construction of tests we generally must estimate the matrix A.
When testing H0 : θ = 0 the question arises as to whether or not we should
center the data using θ.b If we do not center then we are using a reduced model
estimate of A; otherwise, it is a full model estimate. Reduced model estimates
b must only
are generally used in randomization tests. In this case, generally, A
be computed once in the process of randomizing and recomputing the test
b −1 S. Note also that when H0 : θ = 0 is true, θ P
b −→
statistic n−1 ST A 0.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 385 —


i i

6.2. COMPONENTWISE METHODS 385

b is valid under H0 . When estimating the asymptotic


Hence, the centered A
b −1 −1
Cov(θ), B AB , we should center A b because we no longer assume that H0
is true.

6.2.1 Estimation
Let θ = (θ1 , θ2 )T denote the true vector of location parameters. Then, when
(6.1.2) holds, the asymptotic covariance matrix in Theorem 6.1.2 is
 Eψ2 (X −θ ) Eψ(X −θ )ψ(X −θ )

11 1 11 1 12 2
[Eψ′ (X11 −θ1 )]2 Eψ′ (X11 −θ1 )Eψ′ (X12−θ2 )
 
B−1 AB−1 =  . (6.2.4)
Eψ(X11 −θ1 )ψ(X12 −θ2 ) Eψ2 (X12 −θ2 )
Eψ′ (X11 −θ1 )Eψ′ (X12 −θ2 ) [Eψ′ (X12 −θ2 )]2
.

Now Theorem 6.1.2 can be applied for various M estimates to establish


asymptotic normality. Our interest is in the comparison of L2 and L1 estimates
and we now turn to that discussion. In the case of L2 estimates, corresponding
to S1 (θ), we take ψ(t) = t. In this case, θ in expression (6.2.4) is the vector of
means. Then it is easy to see that B−1 AB−1 is equal to the covariance matrix
of the underlying model, say Σf . In applications, θ is estimated by the vector
of component sample means. For the standard errors of these estimates, the
vector of componentwise sample means replaces θ in expression (6.2.4) and
the expected values are replaced by the corresponding sample moments. Then
it is easy to see that the estimate of B−1 AB−1 is equal to the traditional
sample covariance matrix.
In the first L1 case corresponding to S3 (θ), using (6.1.2), we take ψ(t) =
sgn(t) and find, using the second representation in (6.1.2), that
 1 E sgn(X11 −θ1 )sgn(X12 −θ2 )

4f12 (0) 4f1 (0)f2 (0)
 
B−1 AB−1 =   , (6.2.5)
E sgn(X11 −θ1 )sgn(X12 −θ2 ) 1
4f1 (0)f2 (0) 4f22 (0)

where f1 and f2 denote the marginal pdfs of the joint pdf f (s, t) and θ1 and θ2
denote the componentwise medians. In applications, the estimate of θ is the
vector of componentwise sample medians, which we denote by (θb1 , θb2 )′ . For
inference an estimate of the asymptotic covariance matrix, (6.2.5) is required.
An estimate of Esgn(X11 − θ1 )sgn(X12 − θ2 ) is the simple moment estimator
P
n−1 sgn(xi1 − θb1 )sgn(xi2 − θb2 ). The estimators discussed in Section 1.5.5,
(1.5.29), can be used to estimate the scale parameters 1/2f1 (0) and 1/2f2 (0).
We now turn to the efficiency of the vector of sample medians with respect
to the vector of sample means. Assume for each component that the median
and mean are the same and that without loss of generality their common value
is 0. Let δ = det(B−1 AB−1 ) = det(A)/[det(B)]2 be the Wilk’s generalized

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 386 —


i i

386 CHAPTER 6. MULTIVARIATE


√ b
variance of nθ in Definition 6.1.4. For the vector of means we have δ =
σ12 σ22 (1 − ρ2 ), the determinant of the underlying variance-covariance matrix.
For the vector of sample medians we have

1 − (EsgnX11 sgnX12 )2
δ=
16f12(0)f22 (0)

and the efficiency of the vector of medians with respect to the vector of means
is given by
s
1 − ρ2
e(med,mean) = 4σ1 σ2 f1 (0)f2 (0) . (6.2.6)
1 − [EsgnX11 sgnX12 ]2

Note that EsgnX11 sgnX12 = 4P (X11 < 0, X12 < 0) − 1. When the underlying
distribution is bivariate normal with means 0, variances 1, and correlation ρ,
Exercise 6.8.4 shows that
1 1
P (X11 < 0, X12 < 0) = + . (6.2.7)
4 2π sin ρ
Further, the marginal distributions are standard normal; hence, (6.2.6) be-
comes s
2 1 − ρ2
e(med, mean) = . (6.2.8)
π 1 − [(2/π) sin−1 ρ]2
The first factor 2/π ∼ = .637 is the univariate efficiency of the median relative
to the mean when the underlying distribution is normal and also the efficiency
of the vector of medians relative to the vector of means when the correlation
in the underlying model is zero. The second factor accounts for the bivariate
structure of the model and, in general, depends on the correlation ρ. Some
values of the efficiency are given in Table 6.2.1.
Clearly, as the elliptical contours of the underlying normal distribution
flatten out, the efficiency of the vector of medians decreases. This is the first
indication that the vector of medians is not affine (or even rotation) equiv-
ariant. The vector of means is affine equivariant and hence the dependency
of the efficiency on ρ must be due to the vector of medians. Indeed, Exercise
6.8.5 asks the reader to construct an example showing that when the axes are
rotated the vector of means rotates into the new vector of means while the
vector of medians fails to do so.

6.2.2 Testing
We now consider the properties of bivariate tests. Recall that we assume the
underlying bivariate distribution is symmetric. In addition, we would generally

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 387 —


i i

6.2. COMPONENTWISE METHODS 387

Table 6.2.1: Efficiency (6.2.8) of the Vector of Medians Relative to the Vector
of Means When the Underlying Distribution is Bivariate Normal

ρ 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .99
eff .64 .63 .63 .62 .60 .58 .56 .52 .47 .40 .22

use an odd ψ-function, so that ψ(t) = −ψ(−t). This implies that ψ(t) =
ψ(|t|)sgn(t) which is useful shortly.
Now referring to Theorem 6.1.2 along with the corresponding matrix
A, the test of H0 : θ = 0 vs HA : θ 6= 0 rejects the null hypothe-
sis when n1 ST (0)ARR−1
S(0) ≥ χ2α (2). Note that the covariance term in A is
Eψ(X11 )ψ(X12 ) = ψ(s)ψ(t)f (s, t) dsdt and it depends upon the underlying
bivariate distribution f . Hence, even the sign test based on the componentwise
sign statistics S3 (0) is not distribution-free under the null hypothesis as it
is in the univariate case. In this case, Eψ(X11 )ψ(X12 ) = 4P (X11 < 0, X12 <
0) − 1 as we saw in the discussion of estimation.
To make the test operational we must estimate the components of A. Since
they are expectations, we use moment estimates, under the null hypothesis.
Now condition (c) in Definition 6.1.3 guarantees that the test with the esti-
mated A is asymptotically distribution-free since it has a limiting chisquare
distribution, independent of the underlying distribution. What can we say
about finite samples?
First note that
 
Σψ(|xi1 |)sgn(xi1 )
S(0) = . (6.2.9)
Σψ(|xi2 |)sgn(xi2 )
Under the assumption of symmetry, (x1 , . . . , xn ) is a realization of
(s1 x1 , . . . , sn xn ) where (s1 , . . . , sn ) is a vector of independent random variables
each equalling ±1 with probability 1/2, 1/2. Hence Esi = 0 and Es2i = 1. Con-
ditional on (x1 , . . . , xn ) then, under the null hypothesis, there are 2n equally
likely sign combinations associated with these vectors. Note that the sign
changes attach to the entire vector. From (6.2.9), we see that conditionally,
the scores are not affected by the sign changes and S(0) depends on the sign
changes only through the signs of the components of the observation vectors.
It follows at once that the conditional mean of S(0) under the null hypothesis
is 0. Further the conditional covariance matrix is given by
 2
P 
P Σψ (|xi1 |) ψ(|x i1 |)ψ(|x
P 2i2 |)sgn(xi1 )sgn(xi2 )
.
ψ(|xi1 |)ψ(|xi2 |)sgn(xi1 )sgn(xi2 ) ψ (|xi2 |)
(6.2.10)
Note that conditionally, n−1 times this matrix is an estimate of the matrix A
above. Thus we have a conditionally distribution-free sign change distribution.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 388 —


i i

388 CHAPTER 6. MULTIVARIATE

For small to moderate n the test statistic (quadratic form) can be computed
for each combination of signs and a conditional p-value of the test is the
number of values (divided by 2n ) of the test statistic at least as large as the
observed value of the test statistic. In the first chapter on univariate methods
this argument also leads to unconditionally distribution-free tests in the case
of the univariate sign and rank tests since in those cases the signs and the
ranks do not depend on the values of the conditioning variables. Again, the
situation is different in the bivariate case due to the matrix A which must
be estimated since it depends on the unknown underlying distribution. In
Exercise 6.8.6 the reader is asked to construct the sign change distributions
for some examples.
We now turn to a more detailed analysis of the tests based on S1 = S1 (0)
and S3 = S3 (0). Recall that S1 is the vector of sample means. The matrix
A is the covariance matrix of the underlying distribution and we take the
sample covariance matrix as the natural estimate. The resulting test statistic
T −1
is nX A b X which is Hotelling’s T 2 statistic. Note for T 2 , we typically use
a centered estimate of A. If we want the randomization distribution then we
use the uncentered estimate. Since BA−1 B = Σ−1 f , the covariance matrix
of the underlying distribution, the asymptotic noncentrality parameter for
Hotelling’s test is γ T Σf−1 γ. The vector S3 is the vector of component sign
statistics. By inverting (6.2.5) we can write down the noncentrality parameter
for the bivariate componentwise sign test.
To illustrate the efficiency of the bivariate sign test relative to Hotelling’s
test we simplify the structure as follows: assume that the marginal distribu-
tions are identical. Let ξ = 4P (X11 < 0, X12 < 0) − 1 and let ρ denote the
underlying correlation, as usual. Then Hotelling’s noncentrality parameter is
 
1 T 1 −ρ γ12 − 2ργ1 γ2 + γ22
γ γ = . (6.2.11)
σ 2 (1 − ρ2 ) −ρ 1 σ 2 (1 − ρ2 )
Likewise the noncentrality parameter for the bivariate sign test is
 
4f 2 (0) T 1 −ξ 4f 2 (0)(γ12 − 2ξγ1 γ2 + γ22 )
γ γ = . (6.2.12)
(1 − ξ 2 ) −ξ 1 (1 − ξ 2 )
The efficiency of the bivariate sign test relative to Hotelling’s test is the ratio
of the their respective noncentrality parameters:
4f 2 (0)σ 2 (1 − ρ2 )(γ12 − 2ξγ1γ2 + γ22 )
. (6.2.13)
(1 − ξ 2 )(γ12 − 2ργ1 γ2 + γ22 )

There are three contributing factors in this efficiency: 4f 2 (0)σ 2 which is the
univariate efficiency of the sign test relative to the t-test, (1 − ρ2 )/(1 − ξ 2 )
due to the dependence structure in the bivariate distribution, and the final

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 389 —


i i

6.2. COMPONENTWISE METHODS 389

Table 6.2.2: Minimum and Maximum Efficiencies of the Bivariate Sign Test
Relative to Hotelling’s T 2 When the Underlying Distribution is Bivariate Nor-
mal

ρ 0 .2 .4 .6 .8 .9 .99
min .64 .58 .52 .43 .31 .22 .07
max .64 .68 .71 .72 .72 .71 .66

factor which reflects the direction of approach of the sequence of alternatives.


It is this last factor which separates the testing efficiency from the estimation
efficiency. In order to see the effect of direction on the efficiency we use the
following result from matrix theory; see Graybill (1983).

Lemma 6.2.1. Suppose D is a nonsingular, square matrix and C is any


square matrix and suppose λ1 and λ2 are the minimum and maximum eigen
values of CD−1 , then
γ T Cγ
λ1 ≤ T ≤ λ2 .
γ Dγ
The proof of the following proposition is left as Exercise 6.8.7.

Theorem 6.2.1. The efficiency e(S3 , S1 ) is bounded between the minimum


and maximum of 4f 2 (0)σ 2 (1 − ρ)/(1 − ξ) and 4f 2 (0)σ 2 (1 + ρ)/(1 + ξ).

In Table 6.2.2 we give some values of the maximum and minimum efficien-
cies when the underlying distribution is bivariate normal with means 0, vari-
ances 1, and correlation ρ. This table can be compared to Table 6.2.1 which
contains the corresponding estimation efficiencies. We have f 2 (0) = (2π)−1
and ξ = (2/π) sin−1 ρ . Hence, the dependence of the efficiency on direction
determined by γ is apparent. The examples involving the bivariate normal
distribution also show the superiority of the vector of means over the vec-
tor of medians and Hotelling’s test over the bivariate sign test as expected.
Bickel (1964, 1965) gives a more thorough analysis of the efficiency for general
models. He points out that when heavy-tailed models are expected then the
medians and sign test are much better provided ρ is not too close to ±1.
In the exercises the reader is asked to show that Hotelling’s T 2 statistic is
affine invariant. Thus the efficiency properties of this statistic do not depend on
ρ. This means that the bivariate sign test cannot be affine invariant; again, this
is developed in the exercises. It is now natural to inquire about the properties
of the estimate and test based on S2 . This estimating function cannot be
written in the componentwise form that we have been considering. Before we
turn to this statistic, we consider estimates and tests based on componentwise
ranking.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 390 —


i i

390 CHAPTER 6. MULTIVARIATE

6.2.3 Componentwise Rank Methods


In this part we sketch the results for the vector of Wilcoxon signed-rank statis-
tics discussed in Section 1.7 for each component. See Example 6.2.1 for an
illustration of the calculations. In Section 6.6 we provide a full development of
componentwise rank-based methods for location and regression models with
examples. We let
P R(|xi1 −θ1 |) !
sgn(xi1 − θ1 )
S4 (θ) = P R(|xn+1 i2 −θ2 |)
. (6.2.14)
n+1
sgn(xi2 − θ2)

Using the projection method, Theorem 2.4.6, we have from Exercise 6.8.8,
for the case θ = 0,
 P +   P 
F (|xi1 |)sgn(xi1 ) 2 [F1 (xi1 ) − 1/2]
S4 (0) = P 1+ + op (1) = P + op (1)
F2 (|xi2 |)sgn(xi2 ) 2 [F2 (xi2 ) − 1/2]

where Fj+ is the marginal distribution of |X1j | for j = 1, 2 and Fj is the


marginal distribution of X1j for j = 1, 2; see, also, Section A.2.3 of the Ap-
pendix. Symmetry of the marginal distributions is used in the computation
of the projections. The conditions (a)-(d) of Definition 6.1.3 can now be veri-
fied for the projection and then we note that the vector of rank statistics has
the same asymptotic properties. We must identify the matrices A and B for
the purposes of constructing the quadratic form test statistic, the asymptotic
distribution of the vector of estimates, and the noncentrality parameter.
The first two conditions, (a) and (b), are easy to check since the multivari-
ate central limit theorem can be applied to the projection. Since under the
null hypothesis that θ = 0, F (Xi1 ) has a uniform distribution on (0, 1), and
introducing θ and differentiating with respect to θ1 and θ2 , the matrices A
and B are
 1   R 2 
1 3
δ 2 f1 (t)dt R 0
A= and B = (6.2.15)
n δ 31 0 2 f22 (t)dt
RR
where δ = 4 F1 (s)F2 (t)dF (s, t) − 1. Hence, similar to the vector of sign
statistics, the vector of Wilcoxon signed rank statistics also has a covariance
which depends on the underlying bivariate distribution. We could construct a
conditionally distribution-free test but not an unconditionally distribution-free
one. Of course, the test is asymptotically distribution-free.
A consistent estimate of the parameter δ in A is given by
n
1X Rit Rjt
δb = sgnXit sgnXjt . (6.2.16)
n t=1 (n + 1)(n + 1)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 391 —


i i

6.2. COMPONENTWISE METHODS 391

Table 6.2.3: Efficiencies of Componentwise Wilcoxon Methods Relative to L2


Methods When the Underlying Distribution is Bivariate Normal

ρ 0 .2 .4 .6 .8 .9 .99
min .96 .94 .93 .91 .89 .88 .87
max .96 .96 .97 .97 .96 .96 .96
est .96 .96 .95 .94 .93 .92 .91

where Rit is the rank of |Xit | in the tth component among |X1t |, . . . , |Xnt |. This
estimate is the conditional covariance and can be used in estimating A in the
construction of an asymptotically distribution-free test; when we estimate the
asymptotic covariance matrix of θ b we first center the data and then compute
(6.2.16).
The estimator that solves S4 (θ) = 0 is the vector of Hodges-Lehmann
estimates for the two components; that is, the vector of medians of Walsh
averages for each component. Like the vector of medians, the vector of HL
estimates is not equivariant under orthogonal transformations and the test is
not invariant under these transformations. This shows up in the efficiency with
respect to the L2 methods which are an equivariant estimate and an invariant
test. Theorem 6.1.2 provides the asymptotic distribution of the estimator and
the asymptotic local power of the test.
Suppose the underlying distribution is bivariate normal with means 0,
variances 1, and correlation ρ, then the estimation and testing efficiencies are
given by
r
3 1 − ρ2
e(HL, mean) = (6.2.17)
π 1 − 9δ 2
3 (1 − ρ2 ) γ12 − 6δγ1 γ2 + γ22
e(Wil, Hotel) = { }. (6.2.18)
π (1 − 9δ 2 ) γ12 − 2ργ1 γ2 + γ22

Exercise 6.8.9 asks the reader to apply Lemma 6.2.1 and show the testing
efficiency is bounded between

3(1 + ρ) 3(1 − ρ)
ρ and . (6.2.19)
3 −1
2π[2 − π cos ( 2 )] 2π[2 − π3 cos−1 ( ρ2 )]

In Table 6.2.3 we provide some values of the minimum and maximum effi-
ciencies as well as estimation efficiency. Note how much more stable the rank
methods are than the sign methods. Bickel (1964) points out, however, that
when there is heavy contamination and ρ is close to ±1 the estimation effi-
ciency can be arbitrarily close to 0. Further, this efficiency can be arbitrarily
large. This behavior is due to the fact that the sign and rank methods are

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 392 —


i i

392 CHAPTER 6. MULTIVARIATE

not invariant and equivariant under orthogonal transformations, unlike the


L2 methods. Hence, we now turn to an analysis of the methods generated by
S2 (θ). Additional material on the componentwise methods can be found in
the papers of Bickel (1964, 1965) and the monograph by Puri and Sen (1971).
The extension of the results to dimensions higher than two is straightforward
and the formulas are obvious. One interesting question is how the efficiencies
of the sign or rank methods relative to the L2 methods depend on the dimen-
sion. See Section 6.6 and Davis and McKean (1993) for componentwise linear
model rank-based methods.

6.3 Spatial Methods


6.3.1 Spatial Sign Methods
We are now ready to consider the estimate and test generated by S2 (θ); recall
(6.1.4) and (6.1.7). This estimating function cannot be written in compo-
nentwise P fashion because kxi − θk appears in both components. Note that
S2 (θ) = kxi − θk−1 (xi − θ), a sum of unit vectors, so that the estimating
function depends on the data only through the directions and not on the mag-
nitudes of xi −θ, i = 1, . . . , n. The vector kxk−1 x is also called the spatial sign
of x. It generalizes the notion of univariate sign: sgn(x) = |x|−1 x. Hence, the
test is sometimes called the angle test or spatial sign test and the estimate
is called the spatial median; see Brown (1983). Milasevic and Ducharme
(1987) show that the spatial median is always unique, unlike the univariate
median. We see that the test is invariant under orthogonal transformations and
the estimate is equivariant under these transformations. Hence, the methods
are rotation invariant and equivariant, properties suitable for methods used
on spatial data. However, applications do not have to be confined to spatial
data and we consider these methods to be competitors to the other methods
already discussed.
Following our pattern above, we first consider the matrices A and B in
Definition 6.1.3. Suppose θ = 0, then since S2 (0) is a sum of independent
random variables, condition (c) is immediate with A = EkXk−2 XXT and the
obvious estimate of A, under H0 , is

X n
b = 1
A kxi k−2 xi xTi , (6.3.1)
n i=1

which can be used to construct the spatial sign test statistic with

1 D 1 D
b −1 S2 (0) →
√ S2 (0) → N2 (0, A) and ST2 (0)A χ2 (2) . (6.3.2)
n n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 393 —


i i

6.3. SPATIAL METHODS 393

In order to compute B, we first compute the partial derivatives; then we take


the expectation. This yields
  
1 1 T
B=E I− (XX ) , (6.3.3)
kXk kXk2

where I is the identity matrix. Use a moment estimate for B similar to the
estimate of A.
The spatial median is determined by
n
X
b = Argmin
θ kxi − θk (6.3.4)
i=1

or as the solution to the estimating equations


Xn
xi − θ
S2 (θ) = = 0. (6.3.5)
i=1
kxi − θk

The R package SpatialNP provides routines to compute the spatial median.


Gower (1974) calls the estimate the mediancentre and provides a Fortran
program for its computation. See Bedall and Zimmerman (1979) for a program
in dimensions higher than 2. Further, for higher dimensions see Möttönen and
Oja (1995).
We have the asymptotic representation
1 b 1 D
√ θ = B−1 √ S2 (0) + op (1) → N2 (0, B−1AB−1 ). (6.3.6)
n n

Chaudhuri (1992) provides a sharper analysis for the remainder term in


his Theorem 3.2. The consistency of the moment estimates of A and B is
established rigorously in the linear model setting by Bai, Chen, Miao, and Rao
(1990). Hence, we would use  and B̂ computed from the residuals. Bose and
Chaudhuri (1993) develop estimates of A and B that converge more quickly
than the moment estimates. Bose and Chaudhuri provide a very interesting
analysis of why it is easier to estimate the asymptotic covariance matrix of θb
than to estimate the asymptotic variance of the univariate median. Essentially,
unlike the univariate case, we do not need to estimate the multivariate density
at a point. It is left as an exercise to show that the estimate is equivariant
and the test is invariant under orthogonal transformations of the data; see
Exercise 6.8.13.

Example 6.3.1 (Cork Borings Data). We consider a well-known example due


to Rao (1948) of testing whether the weight of cork borings on trees is inde-
pendent of the directions: North, South, East, and West. In this case we have

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 394 —


i i

394 CHAPTER 6. MULTIVARIATE

Table 6.3.1: Weight of Cork Borings (in Centigrams) in Four Directions for 28
Trees

N E S W N E S W
72 66 76 77 91 79 100 75
60 53 66 63 56 68 47 50
56 57 64 58 79 65 70 61
41 29 36 38 81 80 68 58
32 32 35 36 78 55 67 60
30 35 34 26 46 38 37 38
39 39 31 27 39 35 34 37
42 43 31 25 32 30 30 32
37 40 31 25 60 50 67 54
33 29 27 36 35 37 48 39
32 30 34 28 39 36 39 31
63 45 74 63 50 34 37 40
54 46 60 52 43 37 39 50
47 51 52 43 48 54 57 43

four measurements on each tree and we wish to test the equality of marginal
locations: H0 : θN = θS = θE = θW . This is a common hypothesis in repeated
measure designs. See Jan and Randles (1996) for an excellent discussion of is-
sues in repeated measures designs. We reduce the data to trivariate vectors via
N −E, E−S, S−W . Then we test δ = 0 where δ T = (θN −θS , θS −θE , θE −θW ).
Table 6.3.1 displays the original n = 28 four component data vectors.
We consider the differences: N − S, S − E, and E − W . For the reader’s
convenience, at the url listed in the Preface, we have tabled these differences
along with the unit spatial sign vectors kxk−1 x for each data point. Note that,
except for rounding error, for the spatial sign vectors, the sum of squares in
each row is 1.
We compute the spatial sign statistic to be ST2 = (7.78, −4.99, 6.65) and,
from (6.3.1),  
.2809 −.1321 −.0539
b =  −.1321
A .3706 −.0648  .
−.0539 −.0648 .3484
b −1S2 (0) = 14.74 which yields an asymptotic p-value of .002,
Then n−1 ST2 (0)A
using a χ2 approximation with 3 degrees of freedom. Hence, we easily reject
H0 : δ = 0 and conclude that boring size depends on direction.
For estimation we return to the original component data. Since we have
rejected the null hypothesis of equality of locations, we want to estimate the
four components of the location vector: θ T = (θ1 , θ2 , θ3 , θ4 ). The spatial me-

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 395 —


i i

6.3. SPATIAL METHODS 395

bT = (45.38, 41.54, 43.91, 41.03). For com-


dian solves S2 (θ) = 0, and we find θ
parison the mean vector is (50.54, 46.18, 49.68, 45.18)T . These computations
can be performed using the R package SpatialNP. The issue of how to apply
rank methods in repeated measure designs has an extensive literature. In addi-
tion to Jan and Randles (1996), Kepner and Robinson (1988) and Akritas and
Arnold (1994) discuss the use of rank transforms and pure ranks for testing
hypotheses in repeated measure designs. The Friedman test, Exercise 4.8.19,
can also be used for repeated measure designs.

Efficiency for Spherical Distributions


Expressions for A and B can be simplified and the computation of efficiencies
made easier if we transform to polar coordinates. We write
   
cos φ cos ϕ
x=r = rs (6.3.7)
sin φ sin ϕ

where r = kxk ≥ 0, 0 ≤ φ < 2π, and s = ±1 depending on whether x is


above or below the horizontal axis with 0 < ϕ < π. The second representa-
tion is similar to (6.2.9) and is useful in the development of the conditional
distribution of the test under the null hypothesis. Hence
X  cos ϕi 
S2 (0) = si (6.3.8)
sin ϕi

where ϕi is the angle measured counterclockwise between the positive hori-


zontal axis and the line through xi extending indefinitely through the origin
and si indicates whether the observation is above or below the axis. Under
the null hypothesis θ = 0, si = ±1 with probabilities 1/2, 1/2 and s1 , . . . , sn
are independent. Thus, we can condition on ϕ1 , . . . , ϕn to get a conditionally
distribution-free test. The conditional covariance matrix is
Xn  
cos2 ϕi cos ϕi sin ϕi
(6.3.9)
cos ϕi sin ϕi sin2 ϕi
i=1

and this is used in the quadratic form with S2 (0) to construct the test statistic;
see Möttönen and Oja (1995, Section 2.1).
To consider the asymptotically distribution-free version of this test we use
the form
X  cos φi 
S2 (0) = (6.3.10)
sin φi
where, recall 0 ≤ φ < 2π, and the multivariate central limit theorem im-
plies that √1n S2 (0) has a limiting bivariate normal distribution with mean 0

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 396 —


i i

396 CHAPTER 6. MULTIVARIATE

and covariance matrix A. We now translate A and its estimate into polar
coordinates.
  n  
cos2 φ cos φ sin φ b 1X cos2 φi cos φi sin φi
A=E ,A =
cos φ sin φ sin2 φ n i=1 cos φi sin φi sin2 φi
(6.3.11)
1 T
Hence, n S2 (0)A b S2 (0) ≥ χ (2) is an asymptotically size α test.
−1 2
α
The polar coordinate representation of B is given by
   
−1 1 − cos2 φ − cos φ sin φ −1 sin2 φ − cos φ sin φ
Er = Er .
− cos φ sin φ 1 − sin2 φ − cos φ sin φ cos2 φ
√ (6.3.12)
Hence, n times the spatial median is limiting bivariate normal with asymp-
totic covariance matrix equal to B−1 AB−1 . The corresponding noncentral-
ity parameter of the noncentral chisquare limiting distribution of the test is
γ T BA−1Bγ. We are now in a position to evaluate the efficiency of the spa-
tial median and the spatial sign test with respect to the mean vector and
Hotelling’s test under various model assumptions. The following result is ba-
sic and is derived in Exercise 6.8.10.
Theorem 6.3.1. Suppose the underlying distribution is spherically symmet-
ric so that the joint density is of the form f (x) = h(kxk). Let (r, φ) be the
polar coordinates. Then r and φ are stochastically independent, the pdf of φ is
uniform on (0, 2π] and the pdf of r is g(r) = 2πrf (r), for r > 0.
Theorem 6.3.2. If the underlying distribution is spherically symmetric, then
the matrices A = (1/2)I and B = [(Er −1 )/2]I. Hence, under the null hypoth-
esis, the test statistic n−1 ST2 (0)A−1 S2 (0) is distribution-free over the class of
spherically symmetric distributions.
Proof: First note that
Z
1
E cos φ sin φ = cos φ sin φdf = 0 .

Then note that

Er −1 cos φ sin φ = Er −1 E cos φ sin φ = 0 .

Finally note, E cos2 φ = E sin2 φ = 1/2.


We can then compute B−1 AB−1 = [2/(Er −1)2 ]I and BA−1B =
[(Er −1 )2 /2]I. This implies that the generalized variance of the spatial me-
dian and the noncentrality parameter of the angle sign test are given by
detB−1 AB−1 = 2/(Er −1 )2 and [(Er −1 )2 /2]γ T γ. Notice that the efficiencies
relative to the mean and Hotelling’s test are now equal and independent

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 397 —


i i

6.3. SPATIAL METHODS 397

of the direction. Recall, for the mean vector and T 2 , that A = 2−1 E(r 2 )I,
det B−1 AB−1 = 2−1 E(r 2 ), and γ T BA−1 Bγ = [2/E(r 2)]γ T γ. This is because
both the spatial L1 methods and the L2 methods are equivariant and invariant
with respect to orthogonal (rotations and reflections) transformations. Hence,
we see that the efficiency
1
e(spatialL1 , L2 ) = Er 2 {Er −1 }2 . (6.3.13)
4
If, in addition, we assume the underlying distribution is spherical normal (bi-
variate normal with means 0 and identity covariance matrix) then Er −1 =
p
π/2, Er 2 = 2 and e(spatialL1 , L2 ) = π/4 ≈ .785. Hence, the efficiency of
the spatial L1 methods based on S2 (θ) are more efficient relative to the L2
methods at the spherical normal model than the componentwise L1 methods
(.637) discussed in Section 6.2.3.
In Exercise 6.8.12 the reader is asked to show that the efficiency of the
spatial L1 methods relative to the L2 methods with a k-variate spherical model
is given by  2
k−1
ek (spatial L1 , L2 ) = E(r 2 )[E(r −1 )]2 . (6.3.14)
k
−1
When the k-variate spherical
√ model is normal, the exercise shows that Er =
Γ[(k−1)/2)]

2Γ(k/2)
with Γ(1/2) = π. Table 6.3.2 gives some values for this efficiency
as a function of dimension. Hence, we see that the efficiency increases with
dimension. This suggests that the spatial methods are superior to the compo-
nentwise L1 methods, at least for spherical models.

Efficiency for Elliptical Distributions


We need to consider what happens to the efficiency when the model is elliptical
but not spherical. Since the methods that we are considering are equivariant
and invariant to rotations, we can eliminate the correlation from the elliptical
model with a rotation but then the variances are typically not equal. Hence,
we study, without loss of generality, the efficiency when the underlying model
has unequal variances but covariance 0. Now the L2 methods are affine equiv-
ariant and invariant but the spatial L1 methods are not scale equivariant and
invariant (hence not affine equivariant and invariant); hence, the efficiency is
a function of the underlying variances.
The computations are now more difficult. To fix the ideas suppose the
underlying model is bivariate normal with means 0, variances 1 and σ 2 , and
covariance 0. If we let X and Z denote iid N(0, 1) random variables, then the
model distribution is that of X and Y = σZ. Note that W 2 = Z 2 /X 2 has a
standard Cauchy distribution. Now we are ready to determine the matrices A
and B.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 398 —


i i

398 CHAPTER 6. MULTIVARIATE

Table 6.3.2: Efficiency as a Function of Dimension for a k-Variate Spherical


Normal Model

k 2 4 6
e(spatial L1 , L2 ) 0.785 0.884 0.920

Table 6.3.3: Efficiencies of Spatial L1 Methods Relative to the L2 Methods for


Bivariate Normal Model with Means 0, Variances 1 and σ 2 , and 0 Correlation,
the Elliptical Case

σ 1 .8 .6 .4 .2 .05 .01
e(spatial L1 , L2 ) 0.785 0.783 0.773 0.747 0.678 0.593 0.321

First, by symmetry, we have E cos φ sin φ = E[XY /(X 2 + Y 2 )] = 0 and


Er cos φ sin φ = E[XY /(X 2 + Y 2 )3/2 ] = 0; hence, the matrices A and B
−1

are diagonal. Next, cos2 φ = X 2 /[X 2 + σ 2 W 2 ] = 1/[1 + σ 2 W 2 ] so we can use


the Cauchy density to compute the expectation. Using the method of partial
fractions: Z
2 1 1 1
E cos φ = 2 2 2
dw = .
(1 + σ w ) π(1 + w ) 1+σ
Hence, E sin2 φ = σ/(1+σ). The next two formulas are given by Brown (1983)
and are derivable by several steps of partial integration:
r ∞  2
−1 πX (2j)!
Er = (1 − σ 2 )j ,
2 j=0 2 (j!)2
2j

r ∞  2
−1 2 1 πX (2j + 2)!(2j)!
Er cos φ = 4j+1
(1 − σ 2 )j ,
2 2 j=0 2 (j!)2 [(j + 1)!]2

and
Er −1 sin2 φ = Er −1 − Er −1 cos2 φ .
Thus A = diag[(1 + σ)−1 , σ(1 + σ)−1 ] and the distribution of the test
statistic, even under the normal model depends on σ. The formulas can be
used to compute the efficiency of the spatial L1 methods relative to the L2
methods; numerical values are given in Table 6.3.3. The dependency of the
efficiency on σ reflects the dependency of the efficiency on the underlying
correlation which is present prior to rotation.
Hence, just as the componentwise L1 methods have decreasing efficiency as
a function of the underlying correlation, the spatial L1 methods have decreas-
ing efficiency as a function of the ratio of underlying variances. It should be

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 399 —


i i

6.3. SPATIAL METHODS 399

emphasized that the spatial methods are most appropriate for spherical models
where they have equivariance and invariance properties. The componentwise
methods, although equivariant and invariant under scale transformations of
the components, cannot tolerate changes in correlation. See Mardia (1972)
and Fisher (1987, 1993) for further discussion of spatial methods. In higher
dimensions, Mardia refers to the angle test as Rayleigh’s test; see Section 9.3.1
of Mardia (1972). Möttönen and Oja (1995) extend the spatial median and
the spatial sign test to higher dimensions. See Table 6.3.5 below for efficien-
cies relative to Hotelling’s test for higher dimensions and for a multivariate t
underlying distribution. Note that for higher dimensions and lower degrees of
freedom, the spatial sign test is superior to Hotelling’s T 2 .

6.3.2 Spatial Rank Methods


Spatial Signed-Rank Test
Möttönen and Oja (1995) develop the concept of a orthogonally invariant
rank vector. Hence, rather than use the univariate concept of rank in the
construction of a test, they define a spatial rank vector that has both magni-
tude and direction. This problem is delicate since there is no inherently natural
way to order or rank vectors.
We must first review the relationship between sign, rank, and signed-rank.
Recall the norm, (1.3.17) and (1.3.21), that was used to generate the Wilcoxon
signed rank statistic. Further, recall that the second term in the norm was the
basis, in Section 2.2.2, for the Mann-Whitney-Wilcoxon rank sum statistic. We
reverse this approach here and show how the one-sample signed-rank statistic
based on ranks of the absolute values can be developed from the ranks of
the data. This provides the motivation for a one-sample spatial signed-rank
statistic.
P Let x1 , . . . , xn be a univariate sample. Then 2[Rn (xi ) − (n + 1)/2] =
j sgn(xi − xj ). Thus the centered rank is constructed from the signs of
the differences. Now to construct a one-sample statistic, we introduce the
reflections −x1 , . . . , −xn and consider the centered rank of xi among the 2n
combined observations and their reflections. The subscript 2n indicates that
the reflections are included in the ranking. So,
  X X
2n + 1
2 R2n (xi ) − = sgn(xi − xj ) + sgn(xi + xj )
2 j j
= [2Rn (|xi |) − 1]sgn(xi ); (6.3.15)

see Exercise 6.8.14. Hence, ranking observations in the combined observa-


tions and reflections is essentially equivalent to ranking the absolute values

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 400 —


i i

400 CHAPTER 6. MULTIVARIATE

|x1 |, . . . , |xn |. In this way, one-sample rank-based methods can be developed


from two-sample rank-based methods.
Möttönen and Oja (1995) use this approach to develop a one-sample
spatial signed-rank statistic. The key is the expression sgn(xi − xj ) +
sgn(xi + xj ) which requires only the concept of sign, not rank. Hence, we must
find the appropriate extension of sign to two dimensions. In one dimension,
sgn(x) = |x|−1 x can be thought of as a unit vector pointing in the positive or
negative directions toward x.
Likewise u(x) = kxk−1 x is a unit vector in the direction of x. Hence, as
in the previous section, we take u(x) to be the vector P spatial sign. The
vector centered spatial rank of xi is then R(xi ) = j u(xi − xj ). Thus,
the vector spatial signed-rank statistic is
XX
S5 (0) = {u(xi − xj ) + u(xi + xj )} . (6.3.16)
i j

This is also the sum of the centered spatial ranks of the observations when
ranked in the combined P observations
P and their reflections. Note that −u(xi −
xj ) = u(xj −xi ) so that u(xi −xj ) = 0 and the statistic can be computed
from XX
S5 (0) = u(xi + xj ) , (6.3.17)
i j

which is the direct analog of (1.3.24).


We now develop a conditional test by conditioning on the data x1 , . . . , xn .
From (6.3.16) we can write
X
S5 (0) = r+ (xi ) , (6.3.18)
i
P
+
where r (x) = j {u(x−xj )+u(x+xj )}. Now it is easy to see that r+ (−x) =
−r+ (x). Under the null hypothesis
P of symmetry about 0, we can think of
+
S5 (0) as a realization of i bi r (xi ) where b1 , . . . , bn are iid variables with
P (bi = +1) = P (bi = −1) = 1/2. Hence, Ebi = 0 and var(bi ) = 1. This means
that, conditional on the data,
  n
b d 1 1 X +
ES5 (0) = 0 and A = Cov 3/2 S5 (0) = 3 (r (xi ))(r+ (xi ))T . (6.3.19)
n n i=1

The approximate size α conditional test of H0 : θ = 0 versus HA : θ 6= 0


rejects H0 when
1 T b −1
S A S5 ≥ χ2α (2) , (6.3.20)
n3 5
where χ2α (2) is the upper α percentile from a chisquare distribution with 2
degrees of freedom. Note that the extension to higher dimensions is done in
exactly the same way. See Chaudhuri (1992) for rigorous asymptotics.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 401 —


i i

6.3. SPATIAL METHODS 401

Table 6.3.4: Each Row is a Spatial Signed-Rank Vector for the Data Differences

Row SR1 SR2 SR3 Row SR1 SR2 SR3


1 0.28 -0.49 -0.07 15 0.30 -0.54 0.69
2 0.28 -0.58 0.12 16 -0.40 0.73 -0.07
3 -0.09 -0.39 0.31 17 0.60 -0.14 0.39
4 0.58 -0.29 -0.11 18 0.10 0.56 0.49
5 -0.03 -0.20 -0.07 19 0.77 -0.34 0.22
6 -0.28 0.07 0.43 20 0.48 0.10 -0.03
7 0.07 0.43 0.23 21 0.26 0.08 -0.16
8 0.01 0.60 0.32 22 0.12 0.00 -0.11
9 -0.13 0.46 0.34 23 0.32 -0.58 0.48
10 0.23 0.13 -0.49 24 -0.14 -0.53 0.42
11 0.12 -0.20 0.33 25 0.19 -0.12 0.45
12 0.46 -0.76 0.28 26 0.73 -0.07 -0.14
13 0.30 -0.56 0.34 27 0.31 -0.12 -0.58
14 -0.22 -0.05 0.49 28 -0.30 -0.14 0.67

Example 6.3.2 (Cork Borings, Example 6.3.1 continued). We use the spatial
signed-rank method (6.3.20) to test the hypothesis. Table 6.3.4 provides the
vector signed-ranks, r+ (xi ) defined in expression (6.3.18).
Then ST5 (0) = (4.94, −2.90, 5.17),
 
.1231 −.0655 .0050
b −1 =  −.0655
n3 A .1611 −.0373  ,
.0050 −.0373 .1338

b −1 S5 (0) = 11.19 with an approximate p-value of 0.011 based on


and n−1 ST5 (0)A
2
a χ -distribution with 3 degrees of freedom. The Hodges-Lehmann estimate of
. bT = (49.30, 45.07, 48.90, 44.59).
θ, which solves S5 (θ) = 0, is computed to be θ

Efficiency
The test in (6.3.20) can be developed from the point of view of asymptotic
theory and the efficiency can be computed. The computations are quite in-
volved. The multivariate t distributions provide both a range of tailweights
and a range of dimensions. A summary of these efficiencies is found in Table
6.3.5; see Möttönen, Oja, and Tienari (1997) for details.
The Möttönen and Oja (1995) test efficiency increases with the dimen-
sion; see especially, the circular normal case. The efficiency begins at .95
and increases! The efficiency also increases with tailweight, as expected. This

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 402 —


i i

402 CHAPTER 6. MULTIVARIATE

Table 6.3.5: The Row Labeled Spatial SR Are the Asymptotic Efficiencies of
Multivariate Spatial Signed-Rank Test, (6.3.20), Relative to Hotelling’s Test
under the Multivariate t Distribution; the Efficiencies for the Spatial Sign
Test, (6.3.2), Are Given in the Rows Labeled Spatial Sign

Degress of Freedom
Dimension Test 3 4 6 8 10 15 20 ∞
1 Spatial SR 1.90 1.40 1.16 1.09 1.05 1.01 1.00 0.95
Spatial Sign 1.62 1.13 0.88 0.80 0.76 0.71 0.70 0.64
2 Spatial SR 1.95 1.43 1.19 1.11 1.07 1.03 1.01 0.97
Spatial Sign 2.00 1.39 1.08 0.98 0.93 0.88 0.85 0.79
3 Spatial SR 1.98 1.45 1.20 1.12 1.08 1.04 1.02 0.97
Spatial Sign 2.16 1.50 1.17 1.06 1.01 0.95 0.92 0.85
4 Spatial SR 2.00 1.46 1.21 1.13 1.09 1.04 1.025 0.98
Spatial Sign 2.25 1.56 1.22 1.11 1.05 0.99 0.96 0.88
6 Spatial SR 2.02 1.48 1.22 1.14 1.10 1.05 1.03 0.98
Spatial Sign 2.34 1.63 1.27 1.15 1.09 1.03 1.00 0.92
10 Spatial SR 2.05 1.49 1.23 1.14 1.10 1.06 1.04 0.99
Spatial Sign 2.42 1.68 1.31 1.19 1.13 1.06 1.03 0.95

strongly suggests that the Möttönen and Oja approach is an excellent way to
extend the idea of signed rank from the univariate case. See Example 6.6.2 for
a discussion of the two-sample spatial rank test.

Hodges-Lehmann Estimator
.
The estimator derived from S5 (θ) = 0 is the spatial median of the pairwise
averages, a spatial Hodges-Lehmann (1963) estimator. This estimator is stud-
ied in great detail by Chaudhuri (1992). His paper contains a thorough review
of multidimensional location estimates. He develops a Bahadur representation
for the estimate. From his Theorem 3.2, we can immediately conclude that
√ n n  
1 b −1 n XX 1
√ θ = B2 u (xi + xj ) + op (1) (6.3.21)
n n(n − 1) i=1 j=1 2

where B2 = E{kx∗ k−1 (I − kx∗ k−2 x∗ (x∗ )T )} and x∗ = 21 (x1 + x2 ). Hence, the
b is determined by that of n−3/2 S5 (0). This leads
asymptotic distribution of √1n θ
to
1 b D
√ θ → N2 (0, B−1 −1
2 A2 B2 ) , (6.3.22)
n
where A2 = E{u(x1 + x2 )(u(x1 + x2 ))T }. Moment estimates of A2 and B2 can
b defined in expression (6.3.19), is a consistent
be used. In fact the estimator A,
estimate of A2 . Bose and Chaudhuri (1993) and Chaudhuri (1993) discuss
refinements in the estimation of A2 and B2 .
Choi and Marden (1997) extend these spatial rank methods to the two-
sample model and the one-way layout. They also consider tests for ordered
alternatives; see, also, Oja (2010).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 403 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 403

6.4 Affine Equivariant and Invariant Methods


6.4.1 Blumen’s Bivariate Sign Test
It is clear from Tables 6.3.3 and 6.3.5 of efficiencies in the previous section that
is desirable to have robust sign and rank methods that are affine invariant
and equivariant to compete with LS methods. We begin with yet another
representation of the estimating function S2 (θ), (6.1.7). Let the ordered ϕ
angles be given by 0 ≤ ϕ(1) < ϕ(2) < . . . < ϕ(n) < π and let s(i) = ±1 when
the observation corresponding to ϕ(i) is above or below the horizontal axis.
Then we can write, as in expression (6.3.8),
n
X  
cos ϕ(i)
S2 (θ) = s(i) . (6.4.1)
sin ϕ(i)
i=1

Now under the assumption of spherical symmetry, ϕ(i) is distributed as the


ith order statistic from the uniform distribution on [0, π) and, hence, Eϕ(i) =
πi/(n + 1), i = 1, . . . , n. Recall, in the univariate case, if we believe that the
underlying distribution is normal then we could replace the data by the normal
scores (expected values of the order statistics from a normal distribution) in
a signed-rank statistic. The result is the distribution-free normal scores test.
We do the same thing here. We replace ϕ(i) by its expected value to construct
a scores statistic. Let
Xn  πi
 X n  πRi 
cos n+1 cos n+1
S6 (θ) = s(i) πi = si πRi (6.4.2)
sin n+1 sin n+1
i=1 i=1

where R1 , . . . , Rn are the ranks of the unordered angles ϕ1 , . . . , ϕn . Note that


s1 , . . . , sn are iid with P (si = 1) = P (si = −1) = 1/2 even if the underlying
model is elliptical rather than spherical. Since we now have constant vectors
in S6 (θ), it follows that the sign test based on S6 (θ) is distribution-free over
the class of elliptical models. We look at the test in more detail and consider
the efficiency of this sign test relative to Hotelling’s test. First, we have
immediately, under the null hypothesis, from the distribution of s1 , . . . , sn
that
  P P !
cos2 [πi/(n+1)] cos[πi/(n+1)] sin[πi/(n+1)]
1
cov √ S6 (0) = P n
cos[πi/(n+1)] sin[πi/(n+1)]
P 2 n
sin [πi/(n+1)] →A,
n n n

where !
R1 R1
cos2 πtdt cos πt sin πtdt 1
A= R1 0 0 R
1 = I,
0
cos πt sin πtdt 0
sin2 πtdt 2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 404 —


i i

404 CHAPTER 6. MULTIVARIATE

as n → ∞. So reject H0 : θ = 0 if n2 S′6 (0)S6 (0) ≥ χ2α (2) for the asymptotic


size α distribution-free version of the test where
( 2 X 2 )
2 ′ 2 X πi πi
S6 (0)S6 (0) = s(i) cos + s(i) sin . (6.4.3)
n n n+1 n+1

This test is not affine invariant. Blumen (1958) created an asymptotically


equivariant test that is affine invariant. We can think of Blumen’s statistic
as an elliptical scores version of the angle statistic of Brown (1983). In (6.4.3)
i/(n+ 1) is replaced by (i−1)/n. Blumen rotated the axes so that ϕ(1) is equal
to zero and the data point is on the horizontal axis. Then the remaining scores
are uniformly spaced. In this case, π(i − 1)/n is the conditional expectation
of ϕ(i) given ϕ(1) = 0. Estimation methods corresponding to Blumen’s test,
however, have not yet been developed.
To compute the efficiency of Blumen’s test relative to Hotelling’s test
we must compute the noncentrality parameter of the limiting chisquare dis-
tribution. Hence, we must compute BA−1 B and this leads us to B. Theorem
6.3.2 provides the matrices A and B for the angle sign statistic when the un-
derlying distribution is spherically symmetric. The following theorem shows
that the affine invariant sign statistic has the same A and B matrices as in
Theorem 6.3.2 and they hold for all elliptical distributions. We discuss the
implications after the proof of the proposition.

Theorem 6.4.1. If the underlying distribution is elliptical, then correspond-


ing to S6 (0) we have A = 12 I and B = (Er −1 /2)I. Hence, the efficiency of
Blumen’s test relative to Hotelling’s test is e(S6 , Hotelling) = E(r 2 )[E(r −1 ]2 /4
which is the same for all elliptical models.

Proof: To prove this we show that under a spherical model the angle statistic
S2 (0) and scores statistic S6 (0) are asymptotically equivalent. Then S6 (0) has
the same A and B matrices as in Theorem 6.3.2. But since S6 (0) leads to an
affine invariant test statistic, it follows that the same A and B continue to
apply for elliptical models.
Recall that under the spherical model, s(1) , . . . , s(n) are iid with P (si =
1) = P (si = −1) = 1/2 random variables. Then we consider

n   n  
1 X πi
cos n+1 1 X cos ϕi
√ s(i) πi −√ s(i) =
n i=1 sin n+1 n i=1 sin ϕi
n  
1 X πi
cos n+1 − cos ϕ(i)
√ s(i) πi .
n sin n+1 − sin ϕ(i)
i=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 405 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 405

We treat the two components separately. First


   
1 X πi
√ s cos − cos ϕ ≤
n (i)
n+1
(i)
 
πi 1 X

maxi cos
− cos ϕ(i) √ s(i) .
n+1 n

The cdf of the uniform distribution on [0, π) is equal to t/π for 0 ≤ t < π. Let
i
Gn (t) be the empirical cdf of the angles ϕi , i = 1, . . . , n. Then G−1
n ( n+1 ) = ϕ(i)
πi
and maxi | n+1 − ϕ(i) | ≤ supt |G−1
n (t) − tπ| = supt |Gn (t) − tπ| → 0 wp1 by the
Glivenko-Cantelli Lemma. The result now follows by using a linear approxi-
πi
mation to cos( n+1 )−cos ϕ(i) and noting that the cos and sin are bounded. The
same argument applies to the second component. Hence, the difference of the
two statistics are op (1) and are asymptotically equivalent. The results for the
angle statistic now apply to S6 (0) for a spherical model. The affine invariance
extends the result to an elliptical model.
The main implication of this proposition is that the efficiency of the test
based on S6 (0) relative to Hotelling’s test is π/4 ≈ .785 for all bivariate normal
models, not just the spherical normal model. Recall that the test based on
S2 (0), the angle sign test, has efficiency π/4 only for the spherical normal and
declining efficiency for elliptical normal models. Hence, we not only gain affine
invariance but also have a constant, nondecreasing efficiency.
Oja and Nyblom (1989) study a class of sign tests for the bivariate loca-
tion problem. They show that Blumen’s test is locally most powerful invariant
for the entire class of elliptical models. Ducharme and Milasevic (1987) define
a normalized spatial median as an estimate of location of a spherical distri-
bution. They construct a confidence region for the modal direction. These
methods are resistant to outliers.

6.4.2 Affine Invariant Sign Tests


Affine Invariant Sign Tests
Affine invariance is determined in the Blumen test statistic by rearranging the
data axes to be uniformly spaced scores. Further, note that the asymptotic
covariance matrix A is (1/2)I, where I is the identity. This is the covariance
matrix for a random vector that is uniformly distributed on the unit circle.
The equally spaced scores cannot be constructed in higher dimensions. The
approach taken here is due to Randles (2000) in which we seek a linear trans-
formation of the data that makes the data axes roughly equally spaced and
the resulting direction vectors are roughly uniformly distributed on the unit
sphere. We choose the transformation so that the sample covariance matrix of
the unit vectors of the transformed data is that of a random vector uniformly

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 406 —


i i

406 CHAPTER 6. MULTIVARIATE

distributed on the unit sphere. We then compute the spatial sign test (6.3.2)
on the transformed data. The result is an affine invariant test.
Let x1 , ..., xn be a random sample of size n from a k-variate multivariate
symmetric distribution with symmetry center 0. Suppose for the moment that
a nonsingular matrix Ux determined by the data, exists and satisfies

n   T
1X Ux xi Ux xi 1
= I. (6.4.4)
n i=1 kUx xi k kUx xi k k

Hence, the unit vectors of the transformed data have covariance matrix
equal to that of a random vector uniformly distributed on the unit k − sphere.
Below we describe a simple and fast way to compute Ux for any dimension k.
The test statistic in (6.4.4) computed on the transformed data becomes

1 T b −1 k
S7 A S7 = ST7 S7 (6.4.5)
n n
where
Xn
Ux xi
S7 = (6.4.6)
i=1
kUx xi k

b in (6.3.1) becomes k −1 I because of the definition of Ux in (6.4.4).


and A

Theorem 6.4.2. Suppose n > k(k − 1) and the underlying distribution is


symmetric about 0. Then nk ST7 S7 in (6.4.5) is affine invariant and the limiting
distribution, as n → ∞, is chisquare with k degrees of freedom.

The following lemma is helpful in the proof of the theorem. The lemma’s
proof depends on a uniqueness result from Tyler (1987).

Lemma 6.4.1. Suppose n > k(k − 1) and D is a fixed, nonsingular transfor-


mation matrix. Suppose Ux and UDx are defined by (6.4.4). Then

1. DT UTDx UDx D = c0 UTx Ux for some positive constant ċ0 that may depend
on D and the data and

2. there exists an orthogonal matrix G such that c0 GUx = UDx D.

Proof: Define D∗ = Ux D−1 then


n   T n   T
1X U∗ Dxi U∗ Dxi 1X Ux xi Ux xi 1
= = I.
n i=1 kU∗ Dxi k kU∗ Dxi k n i=1 kUx xi k kUx xi k k

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 407 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 407

Tyler (1987) showed that the matrix UDx defined from Dx1 , ..., Dxn is unique
up to a positive constant. Hence, UDx = aU∗ for some positive constant a.
Hence,
UTDx UDx = a2 U∗T U∗ = a2 (DT )−1 UTx Ux D−1
and DT UtDx UDx D = a2 UTx Ux which completes the proof of Part (1) with
−1/2
c0 = a2 . Next, define G = c0 UDx DU−1 x where c0 comes from the lemma.
T
Then, using part a, it follows that G G = I and G is orthogonal. Hence,
−1/2 1/2
c0 GUx = c0 c0 UDx DU−1
x Ux = c0 UDx D

and Part (2) follows.


Proof of Theorem 6.4.2: Given D is a fixed, nonsingular matrix, let
yi = Dxi for i = 1, ..., n. Then (6.4.6) becomes
Xn
UDx Dxi
SD
7 = .
i=1
kUDx Dxi k

We show that SDT D T


7 S7 = S7 S7 and hence does not depend on D. Now, from
the Lemma 6.4.1,
1/2
UDx Dx c0 GUx x Ux x
= 1/2
=G
kUDx Dxk k c0 GUx x k kUx xk

and n
X Ux xi
SD
7 = G = GS7 .
i=1
kUx xi k
Hence, SDT D T
7 S7 = S7 S7 and the affine invariance follows from the orthogonal
invariance of ST7 S7 .
Sketch of the argument that the asymptotic distribution is
chisquared with k degrees of freedom. Tyler (1987) showed that there
exists a unique upper triangular matrix U∗ with upper left diagonal element
equal to 1 and such that
"   ∗ T #

UX UX 1
E ∗ ∗
= I
kU Xk kU Xk k

and n(Ux − U∗ ) = Op (1). Theorem 6.1.2 implies that (k/n)S∗T ∗
7 S7 is asymp-
totically chisquared with k degrees of freedom where U∗ replaces Ux in S7 . But
since Ux and U∗ are close, (k/n)S∗T ∗ T
7 S7 − (k/n)S7 S7 = op (1), the asymptotic
distribution follows. See the appendix in Randles (2000) for details.
We have assumed symmetry of the underlying multivariate distribution.
The results continue to hold with the weaker assumption of directional sym-
metry about 0 in which X/kXk and −X/kXk have the same distribution.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 408 —


i i

408 CHAPTER 6. MULTIVARIATE

In addition to the asymptotic distribution, we can compute or approximate


the conditional distribution (given the direction axes of the data) of nk ST7 S7
under the assumption of directional symmetry by listing or sampling the 2n
equi-likely values of

n
!T n
!
k X Ux xi X Ux xi
δi δi
n i=1
kUx xi k i=1
kUx xi k

where δi = ±1 for i = 1, ..., n. Hence, it is straightforward to approximate the


p-value of the test.

Computation of Ux
It remains to compute Ux from the data x1 , ..., xn . The following efficient
iterative procedure is due to Tyler (1987) who also shows the sequence of
iterates converges when n > k(k − 1).
We begin with
n   T
1 X xi xi
V0 = ,
n i=1 kxi k kxi k

and U0 = Chol (V0−1 ), where Chol (M) is the upper triangular Cholesky de-
composition of the positive definite matrix M divided by the upper left diago-
nal element of the upper triangular matrix. This places a 1 as the first element
of the main diagonal and makes Chol (M) unique.
If kV0 − k −1 Ik is sufficiently small (a prespecified tolerance) stop and take
Ux = U0 . If kV0 − k −1 Ik is large, compute
n   T
1X U0 xi U0 xi
V1 = ,
n i=1 kU0 xi k kU0 xi k

and compute U1 = Chol (V1−1).


If kV1 −k −1 Ik is sufficiently small stop and take Ux = U1 U0. If kV1 −k −1 Ik
is large compute
n   T
1X U1 U0 xi U1 U0 xi
V2 = ,
n i=1 kU1 U0 xi k kU1 U0 xi k

and U2 = Chol (V2−1). If kV2 − k −1 Ik is sufficiently small, stop and take


Ux = U2 U1 U0 . If kV2 − k −1 Ik is large compute V3 and U3 and proceed until
kVj0 − k −1 Ik is sufficiently small and take Ux = Uj0 Ujo −2 ...U0 .
Affine Equivariant Median. We now turn to the problem of construct-
ing an affine equivariant estimate of the center of symmetry of the underlying

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 409 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 409

distribution. Our goal is to produce an estimate that is computationally effi-


cient for large samples in any dimension, a problem that plagued some earlier
attempts; see Small (1990) for an overview of multivariate medians. The es-
timate described below was proposed by Hettmansperger and Randles (2002)
and we refer to it as the HR estimate. The estimator θ b is chosen to be the
solution of
n
1 X Ux (xi − θ)
=0 (6.4.7)
n i=1 kUx (xi − θ)k
in which Ux is the k × k upper triangular positive definite matrix, with a 1 in
the upper left position on the diagonal, chosen to satisfy
n   T
1X Ux (xi − θ) Ux (xi − θ) 1
= I. (6.4.8)
n i=1 kUx (xi − θ)k kUx (xi − θ)k k

This is a transform-retransform estimate; see, for example, Chakraborty,


Chaudhuri, and Oja (1998). The data are transformed using Ux , and the
estimate τb = Ux θ b is computed. Then the estimate is retransformed back to
b = U−1 τb . The simultaneous solutions of (6.4.7) and (6.4.8)
the original scale θ x
are M estimates; see Section 6.5.4 for the explicit representation. It follows
from this that the estimate is affine equivariant. It is also possible to directly
verify the affine equivariance.
The calculation of (Ux , θ) b involves two routines. The first routine finds
the value that solves (6.4.7) with Ux fixed. This is done by letting yi =
Ux xi and finding τb that solves Σ(yi − τ )/ k yi − τ k= 0. Hence, τb is the
b=
spatial median of y1 , . . . , yn ; see Section 6.3.1. The solution to (6.4.7) is θ
−1
Ux τb . The second routine then finds Ux in (6.4.8) as described above for the
computation of Ux for a fixed value of θ with xi replaced by xi − θ. b
b
The calculation of (Ux , θ) alternates between these two routines until
convergence. To obtain starting values, let θ 0j = xj . Use the second routine
to obtain U0j for this value of θ. The starting (θ 0j , U0j ) is the pair that
minimizes, for j = 1, ..., n, the inner product
" n #T " n #
X U0j (xi − θ 0j ) X U0j (xi − θ 0j )
.
i=1
kU0j (xi − θ 0j )k i=1
kU0j (xi − θ 0j )k

This starting procedure is used, since starting values need to be affine invariant
and equivariant.
For a fixed Ux there exists a unique solution for θ, and for fixed θ there ex-
ists a unique Ux up to multiplicative constant. In simulations and calculations
described in Hettmansperger and Randles (2002) the alternating algorithm did
not fail to converge. However, the equations defining the simultaneous solution

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 410 —


i i

410 CHAPTER 6. MULTIVARIATE

b do not fully satisfy all conditions stated in the literature for existence
(Ux , θ)
and uniqueness; see Maronna (1976) and Kent and Tyler (1991).
The asymptotic distribution theory developed in Hettmansperger and Ran-
dles (2002) shows that θ b is approximately multivariate normally distributed
under the assumption of directional symmetry and, hence, symmetry. The
asymptotic covariance matrix is complicated and we recommend a bootstrap
estimate of the covariance matrix of θ.b
The approach taken above is more general. If we begin with the orthog-
onally invariant statistic in (6.3.2) and use a matrix U that satisfies the in-
variance property in part (2) of Lemma 6.4.1 then the resulting statistic is
affine invariant. For example we could take U to be the inverse of the sample
covariance matrix. This results in a test statistic studied by Hössjer and Croux
(1995). We prefer the more robust matrix Ux proposed by Tyler (1987).
Example 6.4.1 (Mathematics and Statistics Exam Scores). We now illus-
trate the one-sample affine invariant spatial sign test (6.4.5) and the affine
equivariant spatial median on a small data set. A major advantage of this
method is the speed of computation which allows for bootstrap estimates of
the covariance matrix and standard errors for the estimator. The data consists
of 20 vectors, chosen at random from a larger data set published in Mardia,
Kent, and Bibby (1979). Each vector consists of four components and records
test scores in Mechanics, Vectors, Analysis, and Statistics. We wish to test the
hypothesis that there are no differences among the examination topics. This
is a traditional hypothesis in repeated measures designs; see Jan and Randles
(1996) for a thorough discussion of this problem. Similar to our findings above
on efficiencies, they found that multivariate sign and signed-rank tests were
often superior to least squares in robustness of level and efficiency.
We consider the trivariate data that result when the Statistics score is
subtracted from the other three scores. For convenience, we have tabled these
differences at the url cited in the Preface. We suppose that the trivariate
data are a sample of size 20 from a symmetric distribution with center θ =
(θ1 , θ2 , θ3 )T and we wish to test H0 : θ = 0 versus HA : θ 6= 0. In Table 6.4.1
we have the HR estimates (standard errors) and the tests for the affine spatial
methods, Hotelling’s T2 , and Oja’s affine methods described later in Section
6.4.3. The standard errors of the HR estimate are obtained from a bootstrap
estimate of the covariance matrix. The following estimates are based on 500
bootstrap resamples.
 
33.88 10.53 21.05
b =  10.53 17.03 12.49  .
Cov (θ)
21.05 12.49 32.71
The standard errors in Table 6.4.1 are the squareroots of the main diagonal
of this matrix.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 411 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 411

Table 6.4.1: Results for the Original and Contaminated Test Score Data: Mean
of Signed-Rank Vectors, Usual Mean Vectors, the Hodges-Lehmann Estimate
of θ; Results for the Signed-Rank Test (6.4.16) and Hotelling’s T 2 Test

Test Asymp.
M −S V −S A−S Statistic p-value
Original Data
HR Estimate −2.12 13.85 6.21
SE HR 5.82 4.13 5.72
Mean -4.95 12.10 2.40
SE Mean 4.07 3.33 3.62
Oja HL-est. -3.05 14.06 4.96
Affine Sign Test (6.4.5) 14.19 0.0027
Hotelling’s T 2 13.47 0.0037
Oja Signed rank (6.4.16) 14.07 0.0028
Contaminated Data
HR Estimate −2.92 12.83 6.90
SE HR 5.58 8.27 6.60
Mean Vector -4.95 8.60 2.40
Oja HL-estimate -3.90 12.69 4.64
Affine Sign Test (6.4.5) 10.76 0.0131
Hotelling’s T 2 6.95 .0736
Oja Signed rank (6.4.16) 10.09 0.0178

The affine sign methods suggest that the major source of statistical sig-
nificance is the V − S difference. In particular, Vector scores are higher than
Statistics scores. A more convenient comparison is achieved by estimating the
locations in the four-dimensional problem. We find the affine equivariant spa-
tial median for M, V, A, S to be (along with bootstrap standard errors) 36.54
(8.41), 53.04 (5.09), 44.28 (8.39), and 39.65 (7.06). This again reflects the sig-
nificant differences between Vector scores and Statistics. In fact, it appears
the Vector exam was easiest while the other subjects are roughly equivalent.

An outlier was created in V by replacing the 70 (first observation) by 0. The


results are shown in the lower part of Table 6.4.1. Note, in particular, unlike
the robust methods, the p-value for Hotelling’s T 2 test has shifted above 0.05
and, hence, would no longer be considered significant.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 412 —


i i

412 CHAPTER 6. MULTIVARIATE

An Affine Invariant Signed-Rank Test and Affine Equivariant Esti-


mate
The test statistic can be constructed in the same way that the affine invari-
ant sign test was constructed. We sketch this development below. For a de-
tailed and rigorous development see Oja (2010, Chapter 7) or Oja and Randles
(2004). The spatial signed-rank statistic is given by S5 in (6.3.19) along with
the spatial signed-rank covariance matrix, given in this case by
n
1X +
r (xi )r+ (xi )T . (6.4.9)
n i=1
Now suppose we can construct a matrix Vx such that when xi is replaced by
Vx xi in (6.4.9) we have
1 1X + + T 1
1
P + T r+ (V x ) n
r (V x xi )r (V x xi ) = I. (6.4.10)
n
r (V x xi ) x i k
The divisor in 6.4.10 is the average squared length of the signed-rank vec-
tors and is needed to normalize
P T(on average) the signed-rank vectors. In the
−1 2
simpler sign vector case n [xi xi / k xi k ] = 1. The normalized signed-rank
vectors now have roughly the same covariance structure as vectors uniformly
distributed on the unit k-sphere. It is straightforward to develop an iterative
routine to compute Vx in the same way we computed Ux for the sign statistic.
The signed-rank test statistic developed from (6.3.22) is then
k T
S S8 , (6.4.11)
n 8
P
where S8 = r+ (Vx xi ). Again, it can be verified directly that this test statis-
tic is affine invariant. In addition, the p-value of the test can be approximated
using the chisquare distribution with k degrees of freedom or by simulation,
conditionally using the 2n equally likely values of
" n #" n #
k X T + X
δi r (Vx xi )T δi r+ (Vx xi )
n i=1 i=1

with δi = ±1.
Recall that the Hodges-Lehmann estimate related to the spatial signed-
rank statistic is the spatial median of the pairwise averages of the data vectors.
This estimate is orthogonally equivariant but not affine equivariant. We use
the transformation-retransformation method. We transform the data using
Vx to get yi = Vx xi i = 1, ..., n and then compute the spatial median of the
pairwise averages (yi + yj )/2 which we denote by τb . Then we retransform
b = V−1τb . This estimate is now affine equivariant. Because of the
it back: θ x
complexity of the asymptotic covariance matrix we recommend a bootstrap
estimate of the covariance matrix of θ. b

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 413 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 413

Efficiency
Recall Table 6.3.5 which provides efficiency values for either the spatial sign
test or the spatial signed-rank test relative to Hotelling’s T2 test. The calcula-
tions were made for the spherical t-distribution for various degrees of freedom
and finally for the spherical normal distribution. Now that we have affine
invariant sign and signed-rank tests and affine equivariant estimates we can
apply these efficiency results to elliptical t and normal distributions. Hence, we
again see the superiority of the sign and signed-rank methods over Hotelling’s
test and the sample mean. The affine invariant tests and affine equivariant
estimates are efficient and robust alternatives to the traditional least squares
methods.
In the case of the affine invariant sign test, Randles (2000) presents a power
sensitivity simulation comparing his test to Hotelling’s T 2 test, Blumen’s test,
and Oja’s sign test (6.4.14). In addition to the multivariate normal distribu-
tion, he included t distributions and a skewed distribution. Randles’ affine
invariant sign test performed extremely well. Although Oja’s sign test per-
formed comparably, it is much more computationally intensive than Randles’
test.

6.4.3 The Oja Criterion Function


This method provides a direct approach to affine invariance/equivariance and
does not require a transform-retransform technique. It is, however, much more
computationally intensive. We only sketch the results in this section and give
references where the more detailed derivations can be found. Recall from the
univariate location model that PL1 and L 2 are special cases of methods that
m
are derived from minimizing |xi − θ| , for m = 1Pand m = 2. Oja (1983)
proposed the bivariate objective function: D8 (θ) = i<j Am (xi , xj , θ) where
A(xi , xj , θ) is the area of the triangle formed by the three vectors xi , xj , θ.
When m = 2 Wilks (1960) showed that D8 (θ) is proportional to the deter-
minant of the classical scatter matrix and the sample mean vector minimizes
this criterion. Thus, by analogy with the univariate case, the m = 1 case is
called the L1 case. The same results carry over to dimensions greater than
2 in which the triangles are replaced by simplices. For the remainder of the
section, m = 1.
We introduce the following notation:
 
1 1 1
1
Aij =  θ1 xi1 xj1  .
2
θ2 xi2 xj2
P
Then D8 (θ) = 21 i<j abs{detAij } where det stands for determinant and abs
stands for absolute value. Now if we differentiate this criterion function with

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 414 —


i i

414 CHAPTER 6. MULTIVARIATE

respect to θ1 and θ2 we get a new set of estimating equations:


n−1 n
1X X
S8 (θ) = sgn{detAij }(x∗j − x∗i ) = 0 , (6.4.12)
2 i=1 j=i+1

where x∗i is the vector xi rotated counterclockwise by π2 radians, hence,


x∗i = (−xi2 , xi1 )T . Note that θ enters only through the Aij . The expression in
(6.4.12) is (CW for clockwise, CCW for counterclockwise):

∗ ∗ T ∗ ∗ x∗j − x∗i if x∗i → x∗j → θ is CCW
sgn{(xj −xi ) (θ −xi )}(xj −xi ) = .
−(x∗j − x∗i ) if x∗i → x∗j → θ is CW

The estimator that solves (6.4.12) is called the Oja median and we are inter-
ested in its properties. This estimator minimizes the sum of triangular areas
formed by all pairs of observations along with θ. Niinimaa, Nyblom, and Oja
(1992) provide a fortran program for computing the Oja median and discuss
further aspects of its computation; see, also, the R package OjaNP. Brown and
Hettmansperger (1987a) present a geometric description of the determination
of the Oja median. The statistic S8 (0) forms the basis of a sign-type statistic
for testing H0 : θ = 0. We refer to this test as the Oja sign test. In order
to study the Oja median and the Oja sign test we need once again to deter-
mine the matrices A and B. Before doing this we rewrite (6.4.12) in a more
convenient form, a form that expresses it as a function of s1 , . . . , sn . Recall
the polar form of x, (6.3.7), that we have been using and at the same time
introduce the vector y as follows:
   
cos φ cos ϕ
x=r = rs = sy .
sin φ sin ϕ
As usual 0 ≤ ϕ < π, s indicates whether x is above or below the horizontal
axis, and r is the length of x. Hence, if s = 1 then y = x, and if s = −1 then
y = −x, so y is always above the horizontal axis.
Theorem 6.4.3. The following string of equalities is true:
n−1 n  
1 1 X X xi1 xj1
S8 (0) = sgn{det }(x∗j − x∗i )
n 2n xi2 xj2
i=1 j=i+1
n−1 n n
1 X X 1X
= si sj (sj yj∗ − si yi∗ ) = si z i
2n i=1 j=i+1 2 i=1

where
n−1
1X ∗
zi = y and yn+i = −yi .
n j=1 i+j

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 415 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 415

Proof: The first formula follows at once from (6.4.12). In the second formula
we need to recall the ∗ operation. It entails a counterclockwise rotation of 90
degrees. Suppose, without loss of generality, that 0 ≤ ϕ1 ≤ . . . ≤ ϕn ≤ π.
Then
     
xi1 xj1 si ri cos ϕi sj rj cos ϕj
sgn det = sgn det
xi2 xj2 si ri sin ϕi sj rj sin ϕj
= sgn{si sj ri rj cos ϕi sin ϕj − sin ϕi cos ϕj }
= si sj sgn{sin(ϕj − ϕi )}
= si sj .

Now if xi is in the first or second quadrants then yi∗ = x∗i = si x∗i and if xi is
in the third or fourth quadrants then yi∗ = −x∗i = si x∗i . Hence, in all cases we
have x∗i = si yi∗ . The second formula now follows. The third formula follows by
straightforward algebraic manipulations. We leave these details to the reader;
see Exercise 6.8.15.
Based on the notation at the end of the proof of Theorem 6.4.3, we have
n
X i−1
X n
X n−1
X
zi = yj∗ − yj∗ , i = 2, . . . , n − 1, z1 = yj∗ , zn = − yj∗ . (6.4.13)
j=i+1 j=1 j=2 j=1

The third formula shows that we have a sign statistic similar to the ones
that we have been studying. Under the null hypothesis (s1 , . . . , sn ) and
(z1 , . . . , zn ) are independent. Hence conditionally on z1 , . . . , zn (or equiva-
lently conditionally on y1 , . . . , yn ) the conditional covariance matrix of S8 (0)
b = 1 P zi zT . A conditional distribution-free test is
is A 4 i i

b −1 S8 (0) ≥ χ2 (2) .
reject H0 : θ = 0 when ST8 (0)A (6.4.14)
α

Theorem 6.4.3 shows that conditionally on the data, the χ2 -approximation is


appropriate. The next theorem shows that the approximation is appropriate
unconditionally as well. For additional discussion of this test see Brown and
Hettmansperger (1989). We want to describe the asymptotically distribution-
free version of the Oja sign test. Then we show that, for elliptical models,
the Oja sign test and Blumen’s test are equivalent. It is left to the exercises
to show that the Oja median is affine equivariant and the Oja sign test is
affine invariant so they compete with Blumen’s invariant test, the affine spa-
tial methods in Section 6.3, and with the L2 methods (vector of means and
Hotelling’s test); see Exercise 6.8.16.
Since the Oja sign test is affine invariant, we consider the behavior un-
der spherical models, without loss of generality. The elliptical models can be
reduced to spherical models by affine transformations. The next proposition
shows that zi has a useful limiting value.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 416 —


i i

416 CHAPTER 6. MULTIVARIATE

Theorem 6.4.4. Suppose that we sample from a spherical distribution cen-


tered at the origin. Let
 
2 cos tπ
z(t) = − E(r) ,
π sin tπ

then n  
1 1 X i
3/2
S8 (0) = √ si z + op (1) .
n 2 n i=1 n

Proof: We sketch the argument. A more general result and a rigorous argument
can be found in Brown et al. (1992). We begin by referring to formula (6.4.13).
Recall that n n  
1X ∗ 1X − sin ϕi
y = ri .
n i=1 i n i=1 cos ϕi
Consider the second component and let ∼ = mean that the approximation is
valid up to op (1) terms. From the discussion of maxi |πi/(n + 1) − ϕ(i) | in
Theorem 6.4.1, we have
[nt] [nt]   X [nt]
1X ∼ 1X ∼ Er π πi
ri cos ϕi = {Er} cos ϕi = cos
n i=1 n i=1 π n i=1 n+1
  Z πt  
∼ Er Er
= cos udu = sin πt .
π 0 π

Furthermore,
n  Z π  
1 X ∼ Er Er
ri cos φi = cos udu = − sin πt .
n π πt n
i=[nt]

Hence the formula holds for the second component. The first component for-
mula follows in a similar way.
This proposition is important since it shows that the Oja sign test is asymp-
totically equivalent to Blumen’s test under elliptical models since they are
both invariant under affine transformations. Hence, the efficiency results for
Blumen’s test carry over for spherical and elliptical models to the Oja sign
test. Also recall that Blumen’s test is locally most powerful invariant for the
class of elliptical models so the Oja sign test should be quite good for elliptical
models in general. The two tests are not equivalent for nonelliptical models.
In Brown et al. (1992) the efficiency of the Oja sign test relative to Blumen’s
test was computed for a class of symmetric densities with contours of the form
|x1 |m + |x2 |m . When m = 2 we have spherical densities, and when m = 1 we
have Laplacian densities with independent marginals. Table 1 of Brown et al.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 417 —


i i

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 417

(1992) shows that the Oja sign test is more efficient than Blumen’s test except
when m = 2 where, of course, the efficiency is 1. Hettmansperger, Nyblom,
and Oja (1994) extend the Oja methods to dimensions higher than 2 in the
one sample case and Hettmansperger and Oja (1994) extend the methods to
higher dimensions for the multisample problem.
In Brown and Hettmansperger (1987a), the idea of an affine invariant
rank vector is introduced. The approach is similar to that of Möttönen and
Oja (1995) for the spatial rank vector discussed earlier; see Section 6.3.2. The
Oja criterion D8 (θ) with m = 1 in Section 6.4.3 is a multivariate extension
of the univariate L1 criterion function and we take its gradient P to be the
centered rank vector.PRecall in the univariate case D(θ) = |xj − θ| and
the derivative D ′ (θ) = sgn(θ − xj ). Hence, D ′ (xi ) is the centered rank of
xi . Likewise the vector centered rank of xk is defined to be:
  
 1 1 1 
1X X
Rn (xk ) = ▽D8 (xk ) = sgn det  xk1 xi1 xj1  (x∗j − x∗i ).
2  
i<j xk2 xi2 xj2
(6.4.15)
Again we use the idea of affine invariant vector rank to define the Oja signed-
rank statistic. Let R2n (xk ) be the rank vector when xk is ranked among the
observation vectors x1 , . P
. . , xn and their reflections −x1 , . . . , −xn . Then the
test statistic is S9 (0) = R2n (xj ). Now R2n (−xj ) = −R2n (xj ) so that the
conditional covariance matrix (conditioning on the observed data) is
n
X
b =
A R2n (xj )RT2n (xj ) .
j=1

The approximate size α test of H0 : θ = 0 is:


b −1 S9 (0) ≥ χ2 (2) .
Reject H0 if ST9 (0)A (6.4.16)
α
.
In addition, the Hodges-Lehmann estimate of θ based on S9 (θ) = 0 is the
Oja median of a set of linked pairwise averages; see Brown and Hettmansperger
(1987a) for details.
Hettmansperger, Möttönen, and Oja (1997a, 1997b) extend the affine in-
variant one- and two-sample rank tests to dimensions greater than 2. Because
of affine invariance, Table 6.3.5 provides the efficiencies relative to Hotelling’s
test for a multivariate t distribution; see Möttönen, Hettmansperger, Oja, and
Tienari (1998). Note that the efficiency is quite high even for the multivari-
ate normal distribution. Further, note that this efficiency is the same for all
elliptical normal distributions as well since the test is affine invariant.
Example 6.4.2 (Mathematics and Statistics Exam Scores, Example 6.4.1
continued). We apply the Oja signed-rank test and the Oja HL estimate to

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 418 —


i i

418 CHAPTER 6. MULTIVARIATE

the data of Example 6.4.1. The numerical results are similar to the results of
the affine spatial methods; see Table 6.4.1 for the results. Note that due to
computational complexity it is not possible to bootstrap the covariance matrix
of the Oja HL-estimate. The R library OjaNP can be used for computations.

6.4.4 Additional Remarks


Many authors have worked on the problem of developing multidimensional sign
tests under various invariance conditions. The sign statistics are important for
defining medians, and further in defining the concept of centered rank. Oja and
Nyblom (1989) propose a family of locally most powerful sign tests that are
affine invariant and show that the Blumen (1958) test is optimal for elliptical
alternatives. Using a different approach that involves data-based coordinate
systems, Chaudhuri and Sengupta (1993) introduce a family of affine invariant
sign tests. See also Dietz (1982) for a development of affine invariant sign
and rank procedures based on rotations of the coordinate systems. Another
interesting approach to the construction of a multivariate median and rank is
based on the idea of data depth due to Liu (1990). In this case, the median
is a point contained in a maximum number of triangles formed by the n3
different choices of three data vectors. See, also, Liu and Singh (1993).
Hence, we conclude that if we are fairly certain that we have a spherical
model, in a spatial statistics context, for example, then the spatial median and
the spatial sign test are quite good. If the model is likely to be elliptical with
heavy tails then either Blumen’s test or the affine invariant spatial sign or
spatial signed-rank tests along with the corresponding equivariant estimators
are both statistically and computationally quite efficient. If we suspect that
the model is nonelliptical then the methods of Oja are preferable. On the
other hand, if invariance and equivariance considerations are not relevant then
the componentwise methods should work quite well. Finally, departures from
bivariate normality should be considered. The L1 type methods are good when
there is a heavy-tailed model. However, the efficiency can be improved by
rank-type methods when the tail weight is more moderate and perhaps close
to normality. Even at the bivariate normal model the rank methods lose very
little efficiency when invariance is taken into account. Oja and Randles (2004)
discuss affine invariant rank tests for several samples and, further, discuss tests
of independence.
Consider the situation where multivariate data is collected over groups, i.e.,
the one-way MANOVA model. For such data, discriminant analysis is a two-
stage procedure: separation and allocation. For the traditional least squares
procedure, separation of training data into groups is accomplished by the max-
imization of the Lawley-Hotelling test for differences between group means.
This produces a set of discriminant coordinates which are used to visualize

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 419 —


i i

6.5. ROBUSTNESS OF ESTIMATES OF LOCATION 419

the data. Further, the discriminant representation can be used for allocation
of data of unknown group membership. Crimin, McKean, and Sheather (2007)
proposed a robust approach to discriminant analysis based on efficient robust
discriminant coordinates. These coordinates are obtained by the maximiza-
tion of a Lawley-Hotelling test based on robust estimates. The design matrix
used in their fitting is the usual one-way incidence matrix of zeros and ones;
hence, the procedure uses highly efficient robust estimators to do the fitting.
This produces efficient robust discriminant coordinates which allow the user
to visually assess the differences among groups. In particular, Crimin et al.
showed that the robust procedure based on using the Hettmansperger and
Randles (2002) (HR) estimates of location for each group had quite good ef-
ficiency properties. In the simulation study conducted by Crimin et al. these
efficiency properties were verified for the situations investigated. Other than
the normally distributed data, the HR procedure was more efficient than the
LS procedure in terms of empirical misclassification probabilities. Further,
over the situations investigated (including the elliptical Cauchy distribution),
it was much more efficient than the high breakdown but low efficient dis-
criminant analysis proposed by Hawkins and McLachlan (1997) which uses
minimum covariance determinants (MCD) as an estimator of scatter.

6.5 Robustness of Estimates of Location


In this section we sketch some results on the influence and breakdown points
for the estimators derived from the various estimating equations. Recall from
Theorem 6.1.2 that the vector influence is proportional to the vector Ψ(x).
Typically Ω(x) is a projection and reduces the problem of finding the asymp-
totic distribution of the estimating function √1n S(θ) to a central limit problem.
To determine whether an estimator has bounded influence or not, it is only
necessary to check that the norm of Ω(x) is bounded. Further, recall that
the breakdown point is the smallest proportion of contamination needed to
carry the estimator beyond all bounds. We now briefly consider the different
invariance models:

6.5.1 Location and Scale Invariance: Componentwise


Methods
In the case of component medians, the influence function is given by

ΩT (x) ∝ (sgn(x11 ), sgn(x21 )) .

The norm is clearly bounded. Further, the breakdown point is 50% as it is in


the univariate case. Likewise, for the Hodges-Lehmann component estimates

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 420 —


i i

420 CHAPTER 6. MULTIVARIATE

ΩT (x) ∝ (F1 (x11 ) − 1/2, F2 (x21 ) − 1/2), where Fi (· ) is the ith marginal cdf.
Hence, the influence is bounded in this case as well. The breakdown point is
29%, the same as the univariate case. Note, however, that the componentwise
methods are neither rotation nor affine invariant/equivariant.

6.5.2 Rotation Invariance: Spatial Methods


We assume in this subsection that the underlying distribution is spherical. For
the spatial median, we have Ω(x) = u(x), the unit vector in the x direction.
Hence, again we have bounded influence. Lopuhaä and Rousseeuw (1991) were
the first to point out that the spatial median has 50% breakdown point. The
proof is given in the following theorem. First note from Exercise 6.8.17 that
the maximum breakdown point for any translation equivariant estimator is
[(n+1)/2]
n
and the spatial median is translation equivariant.

b has breakdown point ǫ∗ =


Theorem 6.5.1. The spatial median θ [(n+1)/2]
for
n
every dimension.

Proof: In view of the preceding remarks, we only need to show ǫ∗ ≥ [(n+1)/2] n


.
Let X = (x1 , . . . , xn ) be a collection of n observations in k dimensions. Let
Ym = (y1 , . . . , yn ) be formed from X by corrupting any m observations.
b m ) minimizes P kyi − θk. Assume, without loss of generality, that
Then θ(Y
b
θ(X) = 0. (Use translation equivariance.) We suggest that the reader follow
the argument with a picture in two dimensions.
Let M = maxi kxi k and let B(0, 2M) be the sphere of radius 2M centered
at the origin. Suppose the number of corrupted observations m ≤ [ n−1 2
]. We
b
show that supkθ(Ym )k over Ym is finite. Hence, ǫ ≥ ∗ (n−1)/2+1 (n+1)/2
= n and
n
we will be finished.
Let dm = inf{kθ(Y b m ) − γk : γ in B(0, 2M)}, the distance of θ(Yb m ) from
B(0, 2M). Then the distance of θ(Y b m ) from the origin is kθ(Y
b m )k ≤ dm +2M.
Now

b m )k ≥ kyj k − kθ(Y
kyj − θ(Y b m )k ≥ kyj k − (dm + 2M) . (6.5.1)

b m ) far outside B(0, 2M). In partic-


Suppose the contamination has pushed θ(Y
ular, suppose dm > 2M[(n + 1)/2]. We show that this leads to a contradiction.
We know that X ⊂ B(0, M) and if xk is not contaminated,

b m )k ≥ M + kxk k + dm .
kxk − θ(Y (6.5.2)

Next split the following sum up over contaminated and not contaminated

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 421 —


i i

6.5. ROBUSTNESS OF ESTIMATES OF LOCATION 421

observations using (6.5.1) and (6.5.2).


n
X X X
b m )k ≥
kyi − θ(Y (kyi k − (dm + 2M)) + (kyi k − dm )
i=1 contam   not  
X n−1 n−1
= kyi k − (dm + 2M) + (n − )dm
2 2
X    
n−1 n−1
= kyi k − 2M( ) + dm (n − 2( ))
2 2
X      
n−1 n−1 n−1
> kyi k−2M( )+2M( )(n−2 )
2 2 2
X    
n−1 n−1
= kyi k + 2M( )(n − 1 − 2( ))
2 2
X
≥ kyi k

b m ) minimizes P kyi − θk, hence we have a contradic-


But, recall that θ(Y
 
tion. Thus dm ≤ 2M( n−1
2
). Hence,
 ǫ∗ must be at least [(n−1)/2]+1
n
. The proof
follows, because n−1
2
+ 1 = n+1
2
.

6.5.3 The Spatial Hodges-Lehmann Estimate


This estimate is the spatial median of the pairwise averages: 12 (xi + xj ). It was
first studied in detail by Chaudhuri (1992) and it is the estimate corresponding
to the spatial signed-rank statistic (6.3.16) of Möttönen, and Oja (1995).
From (6.3.21) it is clear that the influence function is bounded. Further,
since it is the spatial median of the pairwise averages, the argument that shows
that the breakdown of the univariate Hodges-Lehmann estimate is 1− √12 ≈ .29
works in the multivariate case; see Exercise 1.12.13 in Chapter 1.

6.5.4 Affine Equivariant Spatial Median


We can represent the affine equivariant spatial median as an M estimate; see
Maronna (1976) or Maronna et al. (2006). Our multivariate estimators θb and
Ux are the solutions of the following M estimating equations:
n n
1X 1X
u1 (di )Ux (xi − θ) = 0, u2 (di )Ux (xi − θ)(xi − θ)T UTx = I (6.5.3)
n i=1 n i=1

where di = kUx (xi − θ)k2 with u1 (d) = d−1/2 and u2 (d) = kd−1 . Because they
b is between (k+1)−1 and k −1 where
are M estimators, the breakdown value for θ
k is the dimension of the underlying population. The asymptotic theory for

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 422 —


i i

422 CHAPTER 6. MULTIVARIATE

b developed in the appendix of Hettmansperger and Randles (2002), shows


θ,
b is B−1 Ux (x − θ)/kUx (xi − θ)k, where
that the influence function for θ
  
1 Ux (x − θ) (x − θ)T UTx
B=E I− ; (6.5.4)
kUx (I − xi − θ)k kUx (xi − θ)k kUx (xi − θ)k

recall (6.3.3). Hence, we see that the influence function is bounded with a pos-
itive breakdown. Note however that the breakdown decreases as the dimension
of the underlying distribution increases.

6.5.5 Affine Equivariant Oja Median


This estimator is affine equivariant and solves the equation (6.4.12). From
the projection representation of the statistic in Theorem 6.4.3 notice that the
vector z(t) is bounded. It then follows that, for spherical models (with finite
first moment), the influence function is bounded. See Niinimaa and Oja (1995)
for a rigorous derivation of the influence function.
The breakdown properties of the Oja median are more interesting. As
shown by Niinimaa, Oja, and Tableman (1990), even though the influence
function is bounded, the estimate can be broken down with just two contam-
inated points; that is, they showed that the breakdown of Oja’s median is
2/n. Further, Niinimaa and Oja (1995) show that the breakdown point of the
Oja median depends on the dispersion of the contaminated data. When the
dispersion of the contaminated data is less than the dispersion of the original
data then the asymptotic breakdown point is positive. If, for example, the
contaminated points are at a single point, then the breakdown point is 1/3.

6.6 Linear Model


We consider the bivariate linear model. As examples of the linear model, we
find bivariate estimates and tests for a general regression effect as well as shifts
in the bivariate two-sample location model and multisample location models.
We focus primarily on componentwise rank-based procedures for estimation
and testing, which are discussed in detail for the general multivariate linear
model in Davis and McKean (1993). We discuss some other methods for the
multiple sample location model in the examples of Section 6.6.1. Spatial and
affine invariant/equivariant methods for the general linear model are discussed
in Oja (2010, Chapter 13).
In Chapter 3, Section 3.2, we present the notation for the univariate linear
model. Here, we think of the multivariate linear model as a series of concate-

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 423 —


i i

6.6. LINEAR MODEL 423

nations of the univariate models. Hence, we introduce


   
Y11 Y12 Y1T
 ..  = (Y (1) , Y (2) ) =  ..  .
Yn×2 =  ... .   .  (6.6.1)
Yn1 Yn2 YnT

The superscript indicates a column, a subscript a row, and, as usual in this


chapter, T denotes transpose. Now the multivariate linear model is

Y = 1αT + Xβ + ǫ , (6.6.2)

where 1 is n × 1 vector of ones, αT = (α(1) , α(2) ), X is n × p full rank, centered


design matrix, β is p × 2 matrix of unknown regression constants, and ǫ is
n × 2 matrix of errors. The rows of ǫ, and hence, Y, are independent and the
rows of ǫ are identically distributed with a continuous bivariate cdf F (s, t).
Model 6.6.2 is the concatenation of two univariate linear models: Y (i) =
1α(i) + Xβ (i) + ǫ(i) for i = 1, 2. We have restricted attention to the bivariate
case to simplify the presentation. In most cases the general multivariate results
are obvious.
We rank within components or columns. Hence, the rank-score of the ith
item in the jth column is:

aij = a(Rij ) = a(R(Yij − xTi β (j) )) (6.6.3)

where Rij is the rank of Yij −xTi β (j) when ranked among Y1j −xT1 β (j) , . . . , Ynj −
xTn β (j) . The rank scores are generated by a(i) = ϕ( n+1 i
), 0 < ϕ(u) <
R R 2
1, ϕ(u)du = 0, and ϕ (u)du = 1; see Section 3.4. Let the score matrix
A be defined as follows:
 
a11 a12
 ..  = (a(1) , a(2) )
A =  ... .  (6.6.4)
an1 an2

so that each column is the set of rank scores within the column.

The criterion function is


n
X
D(β) = aTi ri (6.6.5)
i=1

where aTi = (ai1 ai2 ) = (a(R(Yi1 − xTi β (1) ), a(R(Yi2 − xTi β (2) )) and rTi = (Yi1 −
xTi β (1) , Yi2 −xTi β (2) ). Note at once that this is an analog, using inner products,
of the univariate criterion in Section 3.2.1. In fact, D(β) is the sum of the

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 424 —


i i

424 CHAPTER 6. MULTIVARIATE

corresponding univariate criterion functions. The matrix of the negatives of


the partial derivatives is:
 P P 
xi1 ai1 xi1 ai2 X X 
 .. .. 
L(β) = XT A =   = a x , a x ; (6.6.6)
P . P .
i1 i i2 i
xip ai1 xip ai2

see Exercise 6.8.18 and equation (3.2.11). Again, note that the two columns in
(6.6.6) are the estimating functions for the two concatenated univariate linear
models and xi is the ith row of X written as a column.
Hence, the componentwise multivariate R estimator of β is β b that mini-
.
mizes (6.6.5) or solves L(β) = 0. Further, L(0) is the basic quantity that we
use to test H0 : β = 0. We must statistically assess the size of L(0) and reject
H0 and claim the presence of a regression effect when L(0) is “too large” or
“too far from the zero matrix.”
We first consider testing H0 : β = 0 since the distribution theory of the test
statistic is useful later for the asymptotic distribution theory of the estimate.
For the linear model we need some results on direct products; see Magnus
and Neudedker (1988) for a complete discussion. We list here the results that
we need:
1. Let A and B be m × n and p × q matrices. The mp × nq matrix A ⊗ B
defined by  
a11 B · · · a1n B
 .. 
A ⊗ B =  ... .  (6.6.7)
am1 B · · · amn B
is called the direct product or Kronecker product of A and B.
2.
(A ⊗ B)T = AT ⊗ BT , (6.6.8)
(A ⊗ B)−1 = A−1 ⊗ B−1 , (6.6.9)
(A ⊗ B)(C ⊗ D) = (AC ⊗ BD) . (6.6.10)

3. Let D be a m × n matrix. Then Dcol is the mn × 1 vector formed by


stacking the columns of D. We then have
tr(ABCD) = (DTcol )T (CT ⊗ A)Bcol = DTcol (A ⊗ CT )(BT )col .
(6.6.11)
4.
(AB)col = (BT ⊗ I)Acol = (I ⊗ A)Bcol . (6.6.12)

These facts are used in the proofs of the theorems in the rest of this chapter.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 425 —


i i

6.6. LINEAR MODEL 425

6.6.1 Test for Regression Effect


As mentioned above, we base the test of H0 : β = 0 on the size of the random
matrix L(0). We deal with this random matrix by rolling out the matrix by
columns. Note from (6.6.4) and (6.6.6) that L(0) = X′ A = (X′ a(1) , X′ a(2) ).
Then we define the vector
 T (1)   T   (1) 
X a X 0 a
Lcol = T (2) = T . (6.6.13)
X a 0 X a(2)
Now from the discussion in Section 3.5.1, let the column variances and covari-
ances be
n Z
2 1 X 2
σa(i) = aji → σi = ϕ2 (u) du = 1
2
n − 1 j=1
n Z
2 1 X
σa(1) a(2) = aj1 aj2 → σ12 = ϕ(F1 (s))ϕ(F2 (t)) dF (s, t),
n − 1 j=1

where F1 (s) and F2 (t) are the marginal cdfs of F (s, t). Since the ranks are
centered and using the same argument as in Theorem 3.5.1, E(Lcol ) = 0 and
 
σa2(1) XT X σa(1) a(2) XT X
V = Cov(Lcol ) =
σa(1) a(2) XT X σa2(2) XT X
 
1 
= A A ⊗ XT X .
T
(6.6.14)
n−1
Further,  
1 1 σ12
V→ ⊗Σ , (6.6.15)
n σ12 1
where n−1 XT X → Σ and Σ is positive definite.
The test statistic for H0 : β = 0 is the quadratic form
 
AR = LTcol V−1Lcol = (n − 1)LTcol (AT A)−1 ⊗ (XT X)−1 Lcol (6.6.16)
where we use a basic formula for finding the inverse of a direct product; see
(6.6.9). Before discussing the distribution theory we record one final result
from traditional multivariate analysis:
AR = (n − 1)trace{LT (XT X)−1 L(AT A)−1 } ; (6.6.17)
see Exercise 6.8.19. This result is useful in translating a quadratic form involv-
ing a direct product into a trace involving ordinary matrix products. Expres-
sion (6.6.17) corresponds to the Lawley-Hotelling trace statistic based on
ranks within the components. The following theorem summarizes the distri-
bution theory needed to carry out the test.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 426 —


i i

426 CHAPTER 6. MULTIVARIATE

Theorem 6.6.1. Suppose H0 : β = 0 is true and the conditions in Section


3.4 hold. Then
P0 (AR ≥ χ2α (2p)) → α as n → ∞
where χ2α (2p) is the upper α percentile from a chisquared distribution with 2
degrees of freedom.

Proof: This theorem follows along the same lines as Theorem 3.5.2. Use a
projection to establish that √1n Lcol is asymptotically normally distributed
and then AR is asymptotically chisquared. The details are left as an Exercise
6.8.20; however, the projection is provided below for use with the estimator.
 T (1) 
1 1 X ϕ
√ Lcol = √ + op (1) (6.6.18)
n n XT ϕ(2)
T
where ϕ(i) = (ϕ(Fi (ǫ1i ) . . . ϕ(Fi (ǫni )) i = 1, 2 and F1 , F2 are the marginal cdfs.
i
Recall also that a(i) = ϕ( n+1 ) where ϕ(· ) is the score generating function.
The asymptotic covariance matrix is given in (6.6.15).

Example 6.6.1 (Multivariate Mann-Whitney-Wilcoxon


√ Test). We now spe-
i
cialize to the Wilcoxon score function a(i) = 12( n+1 − .5) and consider the
two-sample model. The testP is a multivariate version
P 2 of the Mann-Whitney-
2 1 n
Wilcoxon test. Note that a(i) = 0, σa = n−1 a (i) = n+1 −→ 1, and
n   
12 X Ri1 1 Ri2 1
σa(1) a(2) = − −
n − 1 i=1 n + 1 2 n+1 2

where R11 , . . . , Rn1 are the ranks of the combined samples in the first com-
ponent and similarly for R12 , . . . , Rn2 for the second component. Note that
n
σa(1) a(2) = n+1 rs , where rs is Spearman’s Rank Correlation Coefficient.
Hence,
 n
    
1 T n+1
σa(1) a(2) n 1 rs 1 σ12
A A= n = →
n−1 σa(1) a(2) n+1 n + 1 rs 1 σ12 1

where ZZ   
1 1
σ12 = 12 F1 (r) − F2 (s) − dF (r, s)
2 2
depends on the underlying bivariate distribution.
Next, we must consider the design matrix X for the two-sample model.
Recall (2.2.1) and (2.2.2) which cast the two-sample model as a linear model in
the univariate case. The design matrix (or vector in this case) is not centered.
For convenience we modify C in (2.2.1) to have 1 in the first n1 places and

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 427 —


i i

6.6. LINEAR MODEL 427

0 elsewhere. Note that the mean of C is nn1 and subtracting this from the
elements of C yields the centered design:
 
n2
 .. 
 . 
1 n2 

X=  
n  −n1 
 . 
 .. 
−n1

where n2 appears n1 times. Then XT X = n1nn2 and n1 XT X = nn1 n2 2 −→ λ1 λ2 .


We assume as usual that 0 < λi < 1, i = 1, 2.
Now L = L(0) = (l1 , l2 ) and li = XT a(i) . It is easy to see that
Xn1 n1  
√ X Rji 1
li = aji = 12 − , i = 1, 2 .
j=1 j=1
n + 1 2

So li is the centered and scaled sum of ranks of the first sample in the ith
component.
Now Lcol = (l1 , l2 )T has an approximate bivariate normal distribution
with covariance matrix:
 
1 T T n1 n2 T n1 n2 1 rs
Cov(Lcol ) = (A A) ⊗ (X X) = A A= .
n−1 n(n − 1) n + 1 rs 1
Note that σ12 is unknown but estimated by Spearman’s rank correlation co-
efficient rs (see above discussion). Hence the test is based on AR in (6.6.17).
It is easy to invert Cov(Lcol ) and we have (see Exercise 6.8.20)
n+1 2 2 1  ∗2 ∗2 ∗ ∗

AR = {l + l − 2r s l1 l2 } = l + l − 2r s l l ,
n1 n2 (1 − rs2 ) 1 2
1 − rs2 1 2 1 2

where l1∗ and l2∗ are the standardized MWW statistics. We reject H0 : β =
0 at approximately level α when AR ≥ χ2α (2). The test statistic AR is a
quadratic form in the component Mann-Whitney-Wilcoxon rank statistics and
rs provides the adjustment for the correlation between the components.
Example 6.6.2 (Brains of Mice Data). For this example, we consider bivari-
ate data on levels of certain biochemical components in the brains of mice;
see the url listed in the Preface for the tabled data. The treatment group re-
ceived a drug which was hypothesized to alter these levels. The control group
received a placebo.
The ranks of the combined treatment and control data for each component
are given in the table, under component ranks. The Spearman rank correlation

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 428 —


i i

428 CHAPTER 6. MULTIVARIATE

coefficient is rs = .149, the standardized MWW statistics are l1∗ = −2.74


and l2∗ = 2.17. Hence AR = 14.31 with the p-value of .0008 based on a χ2 -
distribution with 2 degrees of freedom. Panels A and B of Figure 6.6.1 show,
respectively, plots of the bivariate data and the component ranks. The strong
separation between treatment and control is clear from the plot. The treatment
group contains an outlier which is brought in by the component ranking.
We have discussed the multivariate version of the MWW test based on
the centered ranks of the combined data where the ranking mechanism is
represented by the matrix A. Given A and the design matrix X, the test
statistic AR can be computed. Recall from Section 6.3.2 P that Möttöen and
Oja (1995) introduced the vector spatial rank R(xi) = j u(xi − xj ), where
u(x) = kxk−1 x is a unit vector in the direction of x. In Section 6.4.3, an affine
rank vector R(xi ) is given by (6.4.15). Both spatial and affine rank vectors
are centered. Let R(xi) be the ith row of A. Note that in these two cases the
columns of A are not of length 1. Nevertheless, from (6.6.17), we have
n(n − 1)
AR = [l1 l2 ](AT A)−1[l1 l2 ]T
n1 n2
 
n(n − 1) 1 l12 l22 l1 l2
= + − 2r (1) (2) ,
n1 n2 1 − r 2 ka(1) k2 ka(2) k2 ka ka k
where r is the correlation coefficient between the two columns of A; see Brown
and Hettmansperger (1987b). The table of data at the url cited in the Preface
for this example, contains the affine rank vectors. The corresponding affine
invariant MWW test is AR = 15.69 with an approximate p-value of .0004
based on a χ2 (2)-distribution. See Exercise 6.8.21.
The methods of Section 6.4.2 could also be adapted to develop affine sign
and rank methods; see Oja (2010). The advantage of that approach is compu-
tational efficiency; it can be used for much larger data sets and more complex
designs.
Example 6.6.3 (Multivariate Kruskal-Wallis Test). In this example we de-
velop the multivariate version of the Kruskal-Wallis statistic for use in a mul-
tivariate one-way layout; see Section 4.2.2 for the univariate case. We suppose
we have k samples from k independent distributions. The n × (k − 1) design
matrix is given by  
0n1 0n1 . . . 0n1
 1n 0n . . . 0n 
 2 2 2 
 0n 1n . . . 0n 
C= 3 3 3 
 .. .. .. .. 
 . . . . 
0nk 0nk . . . 1nk
ni
and the column means are c′ = (λ2 , . . . , λk ) where λi = n
. The centered
design is X = C − 1c′ and has full column rank k − 1.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 429 —


i i

6.6. LINEAR MODEL 429

Figure 6.6.1: Panel A: Plot of the data for the brains of mice data; Panel B:
Plot of the corresponding ranks of the brains of mice data.

Panel A Panel B

T T
C
1.4

20
C
C
C
1.2

C
T

15
Component 2 responses

T C

Component 2 ranks
1.0

T
C

C T
0.8

10
T
C
T
0.6

C
T
C C

5
C
C T C T
TC T
C C T
0.4

C T
T T
T
C T
T
T T
T T

0.8 0.9 1.0 1.1 1.2 1.3 1.4 5 10 15 20

Component 1 responses Component 1 ranks

In this design the first of the k populations is taken as the reference pop-
ulation with location (α1 , α2 )T . The ith row of the β matrix is the vector of
shift parameters for the (i + 1)st population relative to the first population.
We wish to test H0 : β = 0 that all populations have the same (unknown)
location vector.
The matrix A = (a(1) , a(2) ) has the centered and scaled Wilcoxon scores of
the previous example. Hence, a(1) is the vector of rank scores for the combined
k samples in the first component. Since the rank scores are centered, we have

XT a(i) = (C − 1cT )T a(i) = CT a(i)

and the second version is easier to compute. Now L(0) = (L(1) , L(2) ) and the
hth component of column i is
 
√ X Rji 1
lhi = 12 −
j∈Sh
n+1 2
√  
12 n+1
= nh Rhi −
n+1 2

where Sh is the index set corresponding to the hth sample and Rhi is the
average rank of the hth sample in the ith component.
1
As in the previous example, we replace n−1 AT A by its limit with 1 on the
main diagonal and σ12 off the diagonal. Then let ((σ ij )) be the inverse matrix.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 430 —


i i

430 CHAPTER 6. MULTIVARIATE

This is easy to compute and is useful below. The test statistic is then, from
(6.6.16)
 11 T −1 12 T −1   (1) 
(1)T (2)T σ (X X) σ (X X) L
AR ≈ (L , L ) 12 T −1 22 T −1
σ (X X) σ (X X) L(2)
T T T
= σ 11 L(1) (XTX)−1 L(1) +2σ 12 L(1) (XT X)−1 L(2) +σ 22 L(2) (XTX)−1 L(2) .
The ≈ indicates that the right side contains asymptotic quantities which must
be estimated in practice. Now
X k k  
(1)T T −1 (1) −1 2 12 X n+1 n
L (X X) L = nj lj1 = 2
nj R j1 − = H1
j=1
(n + 1) j=1
2 n + 1

where H1 is the Kruskal-Wallis statistic computed on the first component.


Similarly,
k   
(1)T T −1 (2) 12 X n+1 n+1 nH12
L (X X) L = 2
nj Rj1 − Rj2 − =
(n + 1) j=1 2 2 n+1

and H12 is a cross component statistic. Using Spearman’s rank correlation


coefficient rs to estimate σ12 , we get
1
AR = {H1 − 2rs H12 + H2 }.
1 − rs2
The test rejects the null hypothesis of equal location vectors at approximately
level α when AR ≥ χ2α (2(k − 1)).
In order to compute the test, first compute componentwise rankings. We
can display the means of the rankings in a 2 × k table as follows:

Treatment
1 2 ··· k
Component 1 R11 R21 ··· R1k
Component 2 R12 R22 ··· R2k

Then use Minitab or some other package to find the two Kruskal-Wallis statis-
tics. To compute H12 either use the formula above or use
k 
X nj 
H12 = 1− Zj1 Zj2 , (6.6.19)
j=1
n
q
where Zji = (Rji − (n + 1)/2)/ VarRji and VarRji = (n − nj )(n + 1)/nj ; see
Exercise 6.8.22. The package Minitab lists Zji in its output.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 431 —


i i

6.6. LINEAR MODEL 431

The last example shows that in the general regression problem with
Wilcoxon scores, if we wish to test H0 : β = 0, the test statistic (6.6.16)
can be written as
1 (1)T (1)T T
AR = 2 {L (XT X)−1 L(1) −2σc
12 L (XT X)−1L(2) +L(2) (XT X)−1 L(2) }
1− σc
12
n
where the estimate of σ12 can be taken to be rs or n+1 rs and rs is Spearman’s
rank correlation coefficient and
√ n   n
12 X n+1 X
lhi = Rji − xjh = a(R(Yji))xjh .
n + 1 j=1 2 j=1

Then reject H0 : β = 0 when AR ≥ χ2α (2p).

6.6.2 The Estimate of the Regression Effect


In the introduction to Section 6.6, we pointed out that the R estimate βb solves
.
L(β) = 0, (6.6.7). Recall the representation of the R estimate in the univariate
case given in Corollary 3.5.2. This immediately extends to the multivariate
case as
 −1
√ 1 T 1
b
n(β − β 0 ) = X X √ (τ1 XT ϕ(1) , τ2 XT ϕ(2) ) + op (1) (6.6.20)
n n
T
where ϕ(i) = (ϕ(Fi (ǫ1i )), . . . , ϕ(Fi (ǫni ))), i = 1, 2 Further, τi is given by
(3.4.4) and we define the matrix τ by τ = diag{τ1 , τ2 }. To investigate the
asymptotic distribution of the random matrix (6.6.20), we again roll it out by
columns. We need only consider the linear approximation on the right side.
Theorem 6.6.2. Assume the regularity conditions in Section 3.4. Then, if β
is the true matrix,
     
√ D 1 σ12
b
n(β col − β col ) → N2p 0, τ τ ⊗Σ −1
σ12 1
where ZZ
σ12 = ϕ(F1 (s))ϕ(F2 (t))dF (s, t) , τ = diag{τ1 , τ2 }

and τi is given by (3.4.4), and n1 X′ X −→ Σ , positive definite.


Proof: We sketch the argument based on (6.6.1), (6.6.13), and Theorem 3.5.2.
Consider, with Σ−1 replaced by ( n1 X′ X)−1 ,
    !
1 τ1 Σ−1 XT ϕ(1) τ1 Σ−1 0 √1 XT ϕ(1)
n
√ = .
n τ2 Σ−1 XT ϕ(2) 0 τ2 Σ−1 √1 XT ϕ(2)
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 432 —


i i

432 CHAPTER 6. MULTIVARIATE

The multivariate central limit theorem establishes the asymptotic multivariate


T
normality. From the discussion after (6.6.1), we have Eϕ(i) ϕ(i) = I, i = 1, 2
T
and Eϕ(1) ϕ(2) = σ12 I. Let V denote the covariance matrix of the above
vector. Then
   
τ1 Σ−1 0 Σ σ12 Σ τ1 Σ−1 0
V =
0 τ2 Σ−1 σ12 Σ Σ 0 τ2 Σ−1
   
τ12 Σ−1 τ1 τ2 σ12 Σ−1 τ12 τ1 τ2 σ12
= = ⊗ Σ−1
τ1 τ2 σ12 Σ−1 τ22 Σ−1 τ1 τ2 σ12 τ22
   
1 σ12
= τ τ ⊗ Σ−1
σ12 1

√ b
and this is the asymptotic covariance matrix for n(β col − β col ).
√ We remind the reader √thatR when we use the Wilcoxon score ϕ(u) =
12(u − 2 ), then τi−1 = 12 fi2 (x)dx, fi the marginal pdf i = 1, 2 and
1
n
b12 = n+1
σ rs , where rs is Spearman’s rank correlation coefficient. See Section
3.7.1 for a discussion of the estimation of τi .

6.6.3 Tests of General Hypotheses


Recall the model (6.6.1) and let the matrix M be r × p of rank r and the
matrix K be 2 × s of rank s. The matrices M and K are fully specified by the
researcher. We consider a test of H0 : MβK = 0. For example, when K = I2 ,
and M = (O Ir ) where O denotes the r × (p − r) matrix of zeros, we have
H0 : Mβ = 0 and this null hypothesis specifies that the last r parameters in
both components are 0. This is the usual subhypothesis in the linear model
applied to both components. Alternatively we might let M = Ip , p × p, and
 
1
K= .
−1

Then H0 : βK = 0 and we test the null hypothesis that the parameters of


the two concatenated linear models are equal: βi1 = βi2 for i = 1, . . . , p. This
is appropriate for a pre-post test model. Thus, we generalize (3.6.1) to the
multivariate linear model. The development proceeds in steps beginning with
H0 : Mβ = 0, i.e. K = I2 .

Theorem 6.6.3. Under H0 : Mβ = 0


     
√ D 1 σ
b col → N2r 0, τ
n(Mβ) 12 −1 T
τ ⊗ [MΣ M ]
σ12 1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 433 —


i i

6.6. LINEAR MODEL 433

here τ , σ12 , Σ are given in Theorem 6.6.2. Let V denote the asymptotic co-
variance matrix and let V ∗ = n(Mβ) b T V b col . Then
b −1 (Mβ)
col

b T {[b 1 b col
V∗ = β col τ AT Ab τ ]−1 ⊗ MT (M(XT X)−1 MT )−1 M}β
n−1
b T (M(XT X)−1 MT )−1 (Mβ)
= trace{(Mβ) b
1
·[b
τ AT Abτ ]−1 } (6.6.21)
n−1

is asymptotically χ2 (2r). Note that we have estimated unknown parameters in


the asymptotic covariance matrix V.

Proof: First note that


 
b M 0 b
(Mβ) col = β col .
0 M

Using Theorem 6.6.2, the asymptotic covariance is, with τ = diag{τ1 , τ2 },


       T 
M 0 1 σ12 −1 M 0
V = τ τ ⊗Σ
0 M σ12 1 0 MT
    
M 0 τ1 Σ−1 τ1 σ12 Σ−1 MT 0
=
0 M τ2 σ12 Σ−1 τ2 Σ−1 0 MT
 
τ1 MΣ−1 MT τ1 σ12 MΣ−1 M T
= −1 T
τ2 σ12 MΣ M τ2 MΣ−1 MT
   
1 σ12
= τ τ ⊗ MΣ−1 MT .
σ12 1

Hence, by the same argument,


 T   
b T M 0 −1 M 0 b
β col V β col =
0 MT 0 M
 
b T 1 σ12 b
β col {[τ τ ]−1 ⊗ MT (MΣ−1 MT )−1 M}β col =
σ12 1
 
b T −1 T −1 b 1 σ12
trace{(Mβ) (MΣ M ) (Mβ)[τ τ ]−1 } .
σ12 1

Using tr to denote trace, denote the test statistic, (6.6.21), defined in the
last theorem to be

b T (M(XT X)−1 MT )−1 (Mβ)[b


b τ 1
QM V R = tr{(Mβ) AT Ab
τ ]−1 } . (6.6.22)
n−1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 434 —


i i

434 CHAPTER 6. MULTIVARIATE

Then the corresponding level α asymptotic decision rule is:

Reject H0 : Mβ = 0 in favor of HA : Mβ 6= 0 if QM V R ≥ χα (2r) . (6.6.23)

The next theorem describes the test when only K is involved. After that
we put the two results together for the general statement.
√ b
Theorem 6.6.4. Under H0 : βK = 0, where K is a 2 × s matrix, n(βK) col
is asymptotically
     
T 1 σ12 −1
Nps 0, K τ τK ⊗ Σ
σ12 1

where τ , σ12 , and Σ are given in Theorem 6.6.2. Let V denote the asymptotic
covariance matrix. Then

b T V b col = trace{(βK)
b −1 (βK) b T (XT X)(βK)[Kb
b 1
n(βK) col τ AT Ab
τ K]−1 }
n−1

is asymptotically χ2 (ps).
cT K) T
 cT
Proof: First note that (β col = K ⊗ I β col . Then from Theorem 6.6.2,
√ cT
the asymptotic covariance matrix of nβ col is
   
√ cT 1 σ12
AsyCov( nβ col ) = τ τ ⊗ Σ−1 .
σ12 1
√ b
Hence, the asymptotic covariance matrix of n(βK) col is
    
√  1 σ12 T
b
AsyCov( n(βK) T −1
KT ⊗ I
col ) = K ⊗I τ
σ12 1
τ ⊗Σ
   
T 1 σ12 −1
= K τ τ ⊗Σ (K ⊗ I)
σ12 1
   
T 1 σ12
= K τ τ K ⊗ Σ−1 ,
σ12 1

which is the desired result. The asymptotic normality and chisquare distribu-
tion follow from Theorem 6.6.2.
The previous two theorems can be combined to yield the general case.

Theorem 6.6.5. Under H0 : MβK = 0,


   
√ D 1 σ
b col −→ Nrs 0, [K τ
n(MβK) T 12
τ K] ⊗ MΣMT −1
.
σ12 1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 435 —


i i

6.6. LINEAR MODEL 435

b then, letting tr
If V is the asymptotic covariance matrix with estimate V
denote the trace operator,
b T V
n(MβK) b col =
b −1 (MβK)
col

b T {[KT τb 1 AT Ab
(MβK) b col =
τ K]−1 ⊗ [M(XT X)−1 MT ]−1 }(MβK)
col
n−1
b T [M(XT X)−1 MT ]−1 (MβK)[K
b T 1
tr{(MβK) τb AT Abτ K]−1 } (6.6.24)
n−1
has an asymptotic χ2 (rs) distribution.
The last theorem provides great flexibility in composing and testing hy-
potheses in the multivariate linear model. We must estimate the matrix β
along with the other parameters familiar in the linear model. However, once
we have these estimates it is a simple series of matrix multiplications and the
trace operation to yield the test statistic.
Denote the test statistic, (6.6.24), defined in the last theorem to be

QM V RK = (6.6.25)
b T [M(XT X)−1 MT ]−1 (MβK)[K
b T 1
tr{(MβK) τb AT Ab
τ K]−1 }.
n−1
Then the corresponding level α asymptotic decision rule is:

Reject H0 : MβK = 0 in favor of HA : MβK 6= 0 if QM V RK ≥ χα (rs) .


(6.6.26)
The test statistics QM V R and QM V RK are extensions to the multivariate
linear model of the quadratic form test statistic Fϕ,Q , (3.6.12). The score
or aligned test and the drop in dispersion test are also available. Davis and
McKean (1993) develop these in detail and provide the rigorous development
of the asymptotic theory. See also Puri and Sen (1985) for a development of
rank methods in the multivariate linear model.
In traditional analysis, based on the least squares estimate of the matrix of
regression coefficients, there are several tests of the hypothesis H0 : MβK = 0.
The test statistic QM V RK , (6.6.24), is an analogue of the Lawley (1938) and
Hotelling (1951) trace criterion. This traditional test statistic is given by
n o
b T T −1 T −1
QLH = trace (Mβ LS K) [M(X X) M ] Mβ LS K(K ΛK) b Tb −1
,
(6.6.27)
b ′ −1 ′
where β LS = (X X) X Y is the least squares estimate of the matrix of
regression coefficients β and Λ b is the usual estimate of Λ, the covariance
matrix of the matrix of errors ǫ, given by
b )′ (Y − Xβ
b = (Y − Xβ
Λ b )/(n − p − 1) . (6.6.28)
LS LS

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 436 —


i i

436 CHAPTER 6. MULTIVARIATE

Under the above assumptions and the assumption that Λ is positive definite
and assuming H0 : MβK = 0 is true, QLH has an asymptotic χ2 distribution
with rs degrees of freedom. This type of hypothesis arises in profile analysis;
see Chinchilli and Sen (1982) for this application. In order to illustrate these
tests, we complete this section with an example.
Example 6.6.4 (Tablet Potency Data). The following data are the results
from a pharmaceutical experiment on the effects of four factors on five mea-
surements of a tablet:
Responses Factors Covariate
POT2 POT4 RSDCU HARD H2 O SAE SAI ADS TYPE POT0
7.94 3.15 1.20 8.50 0.188 1 1 1 1 9.38
8.13 3.00 0.90 6.80 0.250 1 1 1 -1 9.67
8.11 2.70 2.00 9.50 0.107 1 1 -1 1 9.91
7.96 4.05 2.30 6.00 0.125 1 1 -1 -1 9.77
7.83 1.90 0.50 9.80 0.142 -1 1 1 1 9.50
7.91 2.30 0.90 6.60 0.229 -1 1 1 -1 9.35
7.82 1.40 1.10 8.43 0.112 -1 1 -1 1 9.58
7.42 2.60 2.60 8.50 0.093 -1 1 -1 -1 9.69
8.06 2.00 1.90 6.17 0.207 1 -1 1 1 9.62
8.51 2.80 1.70 7.20 0.184 1 -1 1 -1 9.89
7.88 3.35 4.70 9.30 0.107 1 -1 -1 1 9.80
7.58 3.05 4.00 8.10 0.102 1 -1 -1 -1 9.73
8.14 1.20 0.80 7.17 0.202 -1 -1 1 1 9.51
8.06 2.95 2.50 7.80 0.027 -1 -1 1 -1 9.82
7.31 1.85 2.10 8.70 0.116 -1 -1 -1 1 9.20
8.66 4.10 3.60 6.40 0.114 -1 -1 -1 -1 9.53
8.16 3.95 2.00 8.00 0.183 0 0 0 1 9.67
8.02 2.85 1.10 6.61 0.139 0 0 0 -1 9.41
8.03 3.20 3.60 9.80 0.171 0 1 0 1 9.62
7.93 3.20 6.10 7.33 0.152 0 1 0 -1 9.49
7.84 3.95 2.00 7.70 0.165 0 -1 0 1 9.96
7.59 1.15 2.10 7.03 0.149 0 -1 0 -1 9.79
8.28 3.95 0.70 8.40 0.195 1 0 0 1 9.46
7.75 3.35 2.20 6.37 0.168 1 0 0 -1 9.78
7.95 3.85 7.20 9.30 0.158 -1 0 0 1 9.48
8.69 2.80 1.30 6.57 0.169 -1 0 0 -1 9.46
8.38 3.50 1.70 8.00 0.249 0 0 1 1 9.73
8.15 2.00 2.30 6.80 0.189 0 0 1 -1 9.67
8.12 3.85 2.50 7.90 0.116 0 0 -1 1 9.84
7.72 3.50 2.20 5.60 0.110 0 0 -1 -1 9.84
7.96 3.55 1.80 7.85 0.135 0 0 0 1 9.50
8.20 2.75 0.60 7.20 0.161 0 0 0 -1 9.78
8.10 3.30 0.97 8.73 0.152 0 0 0 1 9.71
8.16 3.90 2.40 7.50 0.155 0 0 0 -1 9.57

There are n = 34 data cases. The five responses are: (POT2), potency of
the tablet at the end of 2 weeks; (POT4), potency of the tablet at the end
of 4 weeks; the third and fourth responses are measures of the tablet purity
(RSDCU) and hardness (HARD); and the fifth response is its water content
(H2 O); hence, we have a 5-dimensional response rather than the bivariate
responses discussed so far. This means that the degrees of freedom are 5r rather
than 2r in Theorem 6.6.3. The factors are: SAI, the amount of intragranular
steric acid, which was set at the three levels −1, 0, and 1; SAE, the amount of
extragranular steric acid, which was set at the three levels −1, 0, and 1; ADS,

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 437 —


i i

6.6. LINEAR MODEL 437

Table 6.6.1: Tests of the Effects for the Potency Data


Effects M-matrix df QM V R p-value QLH p-value
Main [I4 O4×10 ] 20 179.6 .00 91.8 .00
Higher Order [O9×4 I9 09×1 ] 45 102.1 .00 70.7 .01
Interaction [O6×4 I6 O6×4 ] 30 70.2 .00 52.2 .01
Quadratic [O3×10 I3 03×1 ] 15 34.5 .00 18.7 .23
Covariate [O1×13 1] 5 3.88 .57 4.34 .50

the amount of cross carmellose sodium, which was set at the three levels −1, 0,
and 1; and TYPE of steric acid which was set at two levels −1 and 1. The
initial potency of the compound, POT0, served as a covariate. This data were
used in an example in the article by Davis and McKean (1993) and much of
our discussion below is taken from this article.
This data set was treated as a univariate model for the response POT2
in Chapter 3; see Examples 3.3.3 and 3.9.2. As our full model we choose the
same model described in expression (3.3.1) of Example 3.3.3. It includes: the
linear effects of the four factors; six simple two-way interactions between the
factors; the three quadratic terms of the factors SAI, SAE, and ADS; and the
covariate for a total of fifteen terms. The need for the quadratic terms was
discussed in the diagnostic analysis of this model for the response POT2; see
Example 3.9.2. Hence, Y is 34 × 5, X is 34 × 14, and β is 14 × 5.
Table 6.6.1 displays the results for the test statistic QM V R , (6.6.22) for the
usual ANOVA hypotheses of interest: main effects, interaction effects broken
down as simple two-way and quadratic, and covariate. Also listed are the
hypothesis matrices M for each effect where the notation Ot×u represents a
t × u matrix of 0s and It is the t × t identity matrix. Also given for comparison
purposes are the results of the traditional Lawley-Hotelling test, based on the
statistic (6.6.27) with K = I5 . For example, M = [I4 O4×10 ] yields a test of
the hypothesis:

H0 : β11 = · · · = β41 = 0, β12 = · · · = β42 = 0, . . . , β15 = · · · = β45 = 0 ;

that is, the linear terms vanish in all five components. Note that M is 4 × 14
so r = 4 and hence we have 4 × 5 = 20 degrees of freedom in Theorem 6.6.3.
The other hypothesis matrices are developed similarly. The robust analysis
indicates that all effects are significant except the covariate effect. In particular
the quadratic effect is significant for the robust analysis but not for the Lawley-
Hotelling test. This confirms the discussion on LS and robust residual plots
for this data set given in Example 3.9.2.
Are the effects of the factors different on potencies of the tablet after 2
weeks, POT2, or 4 weeks, POT4? This question can be answered by evaluating
the statistic QM V RK , (6.6.26), for hypotheses of the form MβK, for the ma-
trices M given in Table 6.6.1 and the 5 × 1 matrix K where K′ = [1 −1 0 0 0].
For example, β11 , . . . , β41 are the linear effects of SAE, SAI, ADS, and TYPE

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 438 —


i i

438 CHAPTER 6. MULTIVARIATE

Table 6.6.2: Contrast Analyses between Responses POT2 and POT4


All Terms Covariate main effect Higher order Interaction Quadratic
except mean terms terms terms terms terms
df 14 1 4 9 6 3
QM V RK 21.93 3.00 5.07 12.20 8.22 4.77
p-value .08 .08 .28 .20 .22 .19
QLH 22.28 2.67 6.36 11.73 6.99 5.48
p-value .07 .10 .17 .23 .32 .14

on PO2 and β12 , . . . , β42 are the linear effects on PO4. We may want to test
the hypothesis
H0 : β11 = β12 , . . . , β41 = β42 .
The M matrix picks the appropriate βs within a component and the K
matrix compares the results across components. From Table 6.6.1, choose
M = [I4 O4×10 ]. Then
 
β11 β12 · · · β15
 .. .. .. 
Mβ =  . . .  .
β41 β42 · · · β45

Next choose KT = [1 −1 0 0 0] so that


 
β11 − β12
 β21 − β22 
 
MβK =  ..  .
 . 
β41 − β42

Then the null hypothesis is H0 : MβK = 0. In this example r = 4 and s = 1


so the test has rs = 4 degrees of freedom. The test is illustrated in column 3
of Table 6.6.2. Other comparisons are also given. Once again, for comparison
purposes the results for the Lawley-Hotelling test based on the test statistic,
(6.6.27), are given also. The robust and traditional analyses seem to agree
on the contrasts. Although there is some overall difference the factors behave
somewhat the same on the responses.
Suppose we have the linear model 6.6.2 along with a matrix of scores that
sum to zero. The criterion function and the matrix of partial derivatives are
given by (6.6.5) and (6.6.6). Then the test statistic for a general regression
effect is given by (6.6.16) or (6.6.17). Special cases yield the two-sample and
k-sample tests discussed in Examples 6.6.1 and 6.6.3. The componentwise rank
case uses chisquare critical values. The computation of the tests require the
score matrix A along with the design matrix X. For example, we could use
the L1 norm componentwise and produce multivariate sign tests that extend
Mood’s test to the multivariate model.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 439 —


i i

6.7. EXPERIMENTAL DESIGNS 439

This approach can be extended to the spatial rank and affine rank cases;
recall the discussion
P in Example 6.3.2. In the spatial case the criterion function
is D(α, β) = kyi − αTP
T
− x′i βk, (6.1.4). Let u(x) = kxk−1 x and rTi =
α − xi β, then D(α, β) = uT (ri )ri and hence,
T ′

 
uT (r1 )
 .. 
A= .  .
T
u (rn )
P
Further, let Rc (ri) = j u(ri − rj ) be the centered spatial rank vector.
P
Then the criterion function is D(α, β) = RTc (ri )ri and
 
RT (r1 )
 .. 
A∗ =  .  .
T
R (rn )

The tests then can be carried out using the chisquare critical values. See
Brown and Hettmansperger (1987b) and Möttönen, and Oja (1995) for details.
For the details in the affine invariant sign or rank vector cases see Brown
and Hettmansperger (1987b), Hettmansperger, Nyblom, and Oja (1994), and
Hettmansperger, Möttönen, and Oja (1997a,b).
Rao (1988) and Bai, Chen, Miao, and Rao (1990) consider a different
formulation of a linear model. Suppose, for i = 1, . . . , n, Yi = Xi β + ǫi where
Yi is a 2 × 1 vector, Xi is a q × 2 matrix of known values, β is a 2 × 1 vector of
unknown parameters. Further, ǫ1 , . . . , ǫn is an iid set of random
P vectors from
a distribution with median vector 0. The criterion function is kYi − Xi βk,
the spatial criterion function. Estimates, tests, and the asymptotic theory are
developed in the above references.

6.7 Experimental Designs


Recall that in Chapter 4 we developed rank-based procedures for experimental
designs based on the general R estimation and testing theory of Chapter 3.
Analogously in the multivariate case, rank-based procedures for experimental
designs can be based on the R estimation and testing theory of the last section.
In this short section we show how this development can proceed. In particular,
we use the cell median model (the basic model of Chapter 4), and show how the
test (6.6.26) can be used to test general linear hypotheses involving contrasts
in these cell medians. This allows the testing of MANOVA type hypotheses as
well as, for instance, profile analyses for multivariate data.
Suppose we have k groups and within the jth group, j = 1, . . . , k, we have
a sample of size nj . For each subject a d-dimensional vector of variables has

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 440 —


i i

440 CHAPTER 6. MULTIVARIATE

been recorded. Let yijl denote the response for the ith subject in Group j for
the lth variable and let yij = (yij1 , . . . , yij2 )T denote the vector of responses
for this subject. Consider the model,
yij = µj + eij , j = 1, . . . , k , i = 1, . . . , nk ,
P
where the eij are independent and identically distributed. Let n = nj denote
the total sample size. Let Yn×d denote the matrix of responses (the yij s are
stacked sequentially by group) and let ǫ be the corresponding n × d matrix of
eij . Let Γ = (µ1 , . . . , µk )T be the k × d matrix of parameters. We can then
write the model as
Y = WΓ + ǫ , (6.7.1)
where W is the incidence matrix in expression (4.2.5). This is our full model
and it is the multivariate analog of the basic model of Chapter 4, (4.2.1). If
µj is the vector of medians then this is the multivariate medians model.
On the other hand, if µj is the vector of means then this is the multivariate
means model.
We are interested in the following general hypotheses:
H0 : MΓK = O versus HA : MΓK 6= O , (6.7.2)
where M is an r × k contrast matrix (the rows of M sum to zero) of rank r
and K is a d × s matrix of rank s.
In order to use the theory of Section 6.6 we need to transform Model (6.7.1)
into a model of the form (6.6.2). Consider the k × k elementary column matrix
E which replaces the first column of a matrix by the sum of all columns of the
matrix; i.e., " k #
X
[c1 c2 · · · ck ]E = ci c2 · · · ck , (6.7.3)
i=1

for any matrix [c1 c2 · · · ck ]. Note that E is nonsingular. Hence we can write
Model (6.7.1) as
 T 
−1 α
Y = WΓ + ǫ = WEE Γ + ǫ = [1 W1 ] +ǫ, (6.7.4)
β

where W1 is the last k − 1 columns of W and E−1 Γ = [α β T ]T . This is a


model of the form (6.6.2). Since M is a contrast matrix, its rows sum to zero.
Hence the hypothesis simplifies to:
 T 
−1 α
MΓK = MEE ΓK = [0 M1 ] K = M1 βK . (6.7.5)
β
Therefore the hypotheses (6.7.2) can be tested by the procedure (6.6.26) based
on the fit of Model (6.7.4).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 441 —


i i

6.7. EXPERIMENTAL DESIGNS 441

Most of the interesting hypotheses in MANOVA can be written in the form


(6.7.2) for some specified contrast matrix M. Therefore based on the theory
developed in Section 6.6, a robust rank-based methodology can be developed
for MANOVA type models. This methodology is demonstrated in Example
6.7.1, which follows, and Exercise 6.8.23.
For the multivariate setting, Davis and McKean (1993) developed an ana-
log of Theorem 3.5.7 which gives the joint asymptotic distribution of [b
αβb T ]T .
They further developed a test of the hypothesis H0 : MΓK = O, where
M is any full row rank matrix, not necessarily a contrast matrix. Hence, this
provides a robust rank-based analysis for any multivariate linear model.
Example 6.7.1 (Paspalum Grass). This data set, discussed on page 460 of
Seber (1984), concerns the effect on growth of paspalum grass due to a fungal
infection. The experiment was a 4 × 2 two-way design. Half of the forty-eight
pots of paspalum grass in the experiment were inoculated with a fungal infec-
tion and half were left as controls. The second factor was the temperature (14,
18, 22, 26o C) at which the inoculation was applied. The design was balanced
so that six plants were used for each combination of treatment and tempera-
ture. After a specified amount of time, the following three measurements were
made on each plant: y1 is the fresh weight of the roots of the plant (gm); y2
is the maximum root length of the plant (mm); and y3 is the fresh weight of
the tops of the plant (gm). For the reader’s convenience, we have tabled the
Paspalum Grass Data at the url cited in the Preface.
As a full model we fit Model 6.7.1. Based on the residual analysis found in
Exercise 6.8.24, though, the fit clearly shows heteroscedasticity and suggests
the log transformation. The subsequent analysis is based on the transformed
data. Table 6.7.1 displays the estimates of Model 6.7.1 based on the Wilcoxon
score function and LS. Note the fits are very similar. The estimates of the
vector τ and the matrix AT A are also displayed.
The hypotheses of interest concern the average main effects and interac-
tion. For Model 6.7.1, matrices for treatment effects, temperature effects and
interaction are given by
MTreat. = [ 1 1 1 1 −1 −1 −1 −1 ]
 
1 −1 0 0 1 −1 0 0
MTemp. =  1 0 −1 0 1 0 −1 0 
1 0 0 −1 1 0 0 −1
 
1 −1 0 0 −1 1 0 0
MTreat.×Temp. =  0 1 −1 0 0 −1 1 0  .
0 0 1 −1 0 0 −1 1
Take the matrix K to be I3 . Then the hypotheses of interest can be expressed
as MΓK = O for the above M matrices. Using the summary statistics in

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 442 —


i i

442 CHAPTER 6. MULTIVARIATE

Table 6.7.1: Estimates Based on the Wilcoxon and LS Fits for the Paspalum
Grass Data, Example 6.7.1. (V is the variance-covariance matrix of vector of
random errors ǫ.)
Wilcoxon Fit LS Fit
Components Components
Parameter (1) (2) (3) (1) (2) (3)
µ11 1.04 3.14 .82 .97 3.12 .78
µ21 2.74 3.70 3.05 2.71 3.67 3.02
µ31 2.47 3.63 3.25 2.40 3.61 3.20
µ41 1.49 3.29 2.79 1.45 3.29 2.76
µ12 .94 3.12 .77 .92 3.12 .70
µ22 1.95 3.43 2.58 1.96 3.43 2.55
µ32 2.26 3.36 3.19 2.19 3.39 3.11
µ42 1.09 3.18 2.45 1.01 3.17 2.41
τ or σ .376 .188 .333 .370 .197 .292
1.04 .62 .92 .14 .04 .09
AT A or V .62 1.04 .57 .04 .04 .03
.92 .57 1.04 .09 .03 .09

Table 6.7.2: Test Statistics QM V RK and QLH Based on the Wilcoxon and LS
Fits, Respectively, for the Paspalum Grass Data, Example 6.7.1. (Marginal
F -tests are also given. The numerator degrees of freedom are given. Note that
the denominator degrees of freedom for the marginal F -tests is 40.)
Wilcoxon LS
MVAR Marginal Fϕ MVAR Marginal FLS
Effect df QM V RK df (1) (2) (3) df QLH df (1) (2) (3)
Treat. 3 14.9 1 9.19 7.07 11.6 3 12.2 1 11.4 6.72 8.66
Temp. 9 819 3 32.5 13.4 61.4 9 980 3 45.2 13.4 162
Treat. ×Temp. 9 11.2 3 2.27 1.49 1.35 9 7.98 3 2.01 .79 1.36

Table 6.7.1 and the elementary column matrix E, as defined above expression
(6.7.3), we obtained the test statistics QM V RK , (6.6.26) based on the Wilcoxon
fit. For comparison we also obtain the LS test statistics QLH , (6.6.27). The
values of these statistics for the hypotheses of interest are summarized in
Table 6.7.2. The test for interaction is not significant while both main effects,
Treatment and Temperature, are significant. The results are quite similar for
the traditional test also. We also tabulated the marginal test statistics, Fϕ .
The results for each component are similar to the multivariate result.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 443 —


i i

6.8. EXERCISES 443

6.8 Exercises
6.8.1. Show that the vector of sample means of the components is affine
equivariant. See Definition 6.1.1.

6.8.2. Compute the gradients of the three criterion functions (6.1.3)-(6.1.5).

6.8.3. Show that in the univariate case S2 (θ) = S3 (θ), (6.1.7) and (6.1.8).

6.8.4. Establish (6.2.7).

6.8.5. Construct an example in the bivariate case for which the mean vector
rotates into the new mean vector but the vector of componentwise medians
does not rotate into the new vector of medians.

6.8.6. Students were given a math aptitude and reading comprehension test
before starting an intensive study skills workshop. At the end of the program
they were given the test again. The following data represents the change in
the math and reading tests for the five students in the program.

Math Reading
11 7
20 40
-10 -4
10 12
16 5

We would like to test the hypothesis H0 : θ = 0 vs HA : θ 6= 0. Fol-


lowing the discussion at the beginning of Section 6.2.2, find the sign change
distribution of the componentwise sign test and find the conditional p-value.

6.8.7. Prove Theorem 6.2.1.

6.8.8. Using the projection method discussed in Chapter 2, derive the pro-
jection of the statistic given in (6.2.14).

6.8.9. Apply Lemma 6.2.1 and show that (6.2.19) provides the bounds on the
testing efficiency of the Wilcoxon test relative to Hotelling’s test in the case
of a bivariate normal distribution.

6.8.10. Prove Theorem 6.3.1.

6.8.11. Show that (6.3.13) can be generalized to k dimensions.

6.8.12. Consider the spatial L1 methods.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 444 —


i i

444 CHAPTER 6. MULTIVARIATE

(a) Show that the efficiency of the spatial L1 methods relative to the L2
methods with a k-variate spherical model is given by
 2
k−1
ek (spatial L1 , L2 ) = E(r 2 )[E(r −1 )]2 .
k

(b) Next assume that the k-variate spherical model is normal. Show that

Er −1 = Γ[(k−1)/2)]

2Γ(k/2)
with Γ(1/2) = π.

6.8.13. . Show that the spatial median is equivariant and that the spatial sign
test is invariant under orthogonal transformations of the data.

6.8.14. Verify (6.3.15).

6.8.15. Complete the proof of Theorem 6.4.3 by establishing the third formula
for S8 (0).

6.8.16. Show that the Oja median and Oja sign test are affine equivariant
and affine invariant, respectively. See Section 6.4.3.

6.8.17. Show that the maximum breakdown point for a translation equiv-
ariant estimator is (n+1)/(2n). An estimator is translation equivariant if
T (X + a1) = T (X) + a1, for every real a. Note that 1 is the vector of all
ones.

6.8.18. Verify (6.6.6).

6.8.19. Show that (6.6.17) can be derived from (6.6.16).

6.8.20. Fill in the details of the proof of Theorem 6.6.1.

6.8.21. Show that AR = 15.69 in Example 6.6.2.

6.8.22. Verify formula (6.6.19).

6.8.23. Consider Model (6.7.1) for a repeated measures design in which the
responses are recorded on the same variable over time; i.e., yijl is response for
the ith subject in Group j at time period l. In this model the vector µj is
the profile vector for the jth group and the plot of µij versus i is called the
b j denote the estimate of µj based on the R fit
profile plot for Group j. Let µ
of Model (6.7.1). The plot of µbij versus j is called the sample profile plot of
Group j. These group plots are overlaid and are called the sample profiles.
A hypothesis of interest is whether or not the population profiles are parallel.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 445 —


i i

6.8. EXERCISES 445

(a) Let At−1 be the (t − 1) × t matrix given by


 
1 −1 0 ··· 0
 0 1 −1 · · · 0 
 
At−1 =  .. .. .. ..  .
 . . . . 
0 0 0 · · · −1
Show that parallel profiles are equivalent to the null hypothesis H0 de-
fined by:
H0 : Ak−1ΓAd−1 = O versus HA : Ak−1 ΓAd−1 6= O , (6.8.1)
where Γ is defined in Model 6.7.1. Hence show that a test of parallel
profiles can be based on the test (6.6.26).
(b) The data below are the times (in seconds) it took three different species
(A, B, and C) of rats to run a maze at four different times (I, II, III, and
IV). Each row contains the scores of a single rat. Compare the sample
profile plots based on Wilcoxon and LS estimates.
Group A Group B Group C
Times Times Times
Rat I II III IV Rat I II III IV Rat I II III IV
1 47 53 51 28 6 44 57 46 27 11 45 33 30 18
2 35 66 38 39 7 47 29 21 30 12 30 50 21 25
3 43 40 34 40 8 28 76 29 39 13 33 32 32 24
4 49 60 44 32 9 57 63 60 15 14 44 62 38 22
5 41 61 38 32 10 34 62 41 27 15 40 42 33 24

(c) Test the hypotheses (6.8.1) using the procedure (6.6.26) based on
Wilcoxon scores. Repeat using the LS test procedure (6.6.27).
(d) Repeat items (b) and (c) if the 13th rat at time period 2 took 80 seconds
to run the maze instead of 34. Note that p-value of the LS procedure
changes from .77 to .15 while the p-value of the Wilcoxon procedure
changes from .95 to .85.
6.8.24. Consider the data of Example 6.7.1.
(a) Using the Wilcoxon scores, fit Model (6.7.4) to the original data. Obtain
the marginal residual plots which show heteroscedasticity. Reason that
the log transformation is appropriate. Show that the residual plots based
on the transformed remove much of the heteroscedasticity. For both the
transformed and original data obtain the internal Wilcoxon Studentized
residuals. Identify the outliers.
(b) In order to see the effect of the transformation, obtain the Wilcoxon and
LS analyses of Example 6.7.1 based on the original data. Discuss your
findings.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 447 —


i i

Appendix A

Asymptotic Results

A.1 Central Limit Theorems


The following version of the Lindeberg-Feller Central Limit Theorem proves
to be useful for our theory. A proof of it can be found in Arnold (1981).
Theorem A.1.1. Consider the sequence of independent random variables
2
W1n , . . . , Wnn , for n = 1, 2 . . . . Suppose E(Win ) = 0, Var(Win ) = σin < ∞,
and
2
max σin , → 0 as n → ∞, (A.1.1)
1≤i≤n
X n
2
σin → σ 2 , 0 < σ 2 < ∞, as n → ∞, (A.1.2)
i=1
and n
X
2
lim E(Win Iǫ (|Win |) = 0 , (A.1.3)
n→∞
i=1
for all ǫ > 0, where Ia (|x|) is 0 or 1 when |x| > a or |x| ≤ a, respectively.
Then n
X D
Win → N(0, σ 2 ) .
i=1
A useful corollary to this theorem is given next; see, also, page 153 of Hájek
and Šidák (1967).
Corollary A.1.1. Suppose that the sequence of random variables X1 , . . . , Xn
are iid with E(Xi ) = 0 and Var(Xi ) = σ 2 < ∞. Suppose the sequence of
constants a1n , . . . , ann are such that
Xn
a2in → σa2 , as n → ∞ , 0 < σa2 < ∞ , (A.1.4)
i=1
max |ain | → 0 , as n → ∞ . (A.1.5)
1≤i≤n

447
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 448 —


i i

448 APPENDIX A. ASYMPTOTIC RESULTS

Then
n
X D
ain Xi → N(0, σ 2 σa2 ) .
i=1

Proof: Take Win of Theorem A.1.1 to be ain Xi . Then the mean of Win is 0
2
and
P 2its variance is σin = a2in σ 2 . By (A.1.5), max σin
2
→ 0 and by (A.1.4),
2 2
σin → σ σa . Hence we need only show that condition (A.1.3) is true. For
i = 1, . . . , n, define

Win = max |ajn ||Xi | .
1≤j≤n

∗ ∗
Then |Win | ≥ |Win |; hence, Iǫ (|Win |) ≤ Iǫ (|Win |), for ǫ > 0. Therefore,

n n
( n
)
X  2  X  2  X  

E Win Iǫ (|Win |) ≤ E Win Iǫ (|Win |) = a2in E X12 Iǫ (|W1n

|) .
i=1 i=1 i=1
(A.1.6)
Note that the sum in braces converges to σ 2 σa2 .
Because converges ∗
X12 Iǫ (|W1n |)
to 0 pointwise and it is bounded above by the integrable function X12 , it then
follows that by Lebesgue’s Dominated Convergence Theorem that the rightside
of (A.1.6) converges to 0. Thus condition (A.1.3) of Theorem A.1.1 is true and
we are finished.
Note that the simple Central Limit Theorem follows from this corollary by
taking ain = n−1/2 , so that (A.1.4) and (A.1.5) hold.

A.2 Simple Linear Rank Statistics


In the next two subsections, we present the asymptotic distribution theory for
a simple linear rank statistic under the null and local alternative hypotheses.
This theory is used in Chapters 1 and 2 for location models and, also, in Section
A.3, where it is used to establish asymptotic linearity and quadraticity results
for Chapters 3 and 5. The theory for a simple linear rank statistic is presented
in detail in Chapters 5 and 6 of the book by Hájek and Šidák (1967); hence,
here we only present a heuristic development with appropriate references to
Hájek and Šidák. Also, Chapter 8 of Randles and Wolfe (1979) presents a
detailed development of the null asymptotic theory of a simple linear rank
statistic.
In this section we assume that the sequence of random variables Y1 , . . . , Yn
are iid with common density function f (y) which follows assumption (E.1),
(3.4.1). Let x1 , . . . , xn denote a sequence of centered, (x = 0), regression co-
efficients and assume that they follow assumptions (D.2), (3.4.7), and (D.3),

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 449 —


i i

A.2. SIMPLE LINEAR RANK STATISTICS 449

(3.4.8). For this one-dimensional case, these assumptions simplify to:


max x2
Pn i2 → 0 (A.2.1)
i=1 xi
n
1X 2
x → σx2 , σx2 > 0 , (A.2.2)
n i=1 i

for some constant σx2 . It follows from these assumptions that maxi |xi |/ n →
0, a fact that we find useful. Assume that the score function ϕ(u) is defined
on the interval (0, 1) and that it satisfies (S.1), (3.4.10); in particular,
R1 R1 2
0
ϕ(u) du = 0 and 0
ϕ (u) du = 1 . (A.2.3)
Consider then the linear rank statistics,
n
X
S= xi a(R(Yi)) , (A.2.4)
i=1

where the scores are generated as a(i) = ϕ(i/(n + 1)).

A.2.1 Null Asymptotic Distribution Theory


It follows immediately that the mean and variance of S are given by
P  1 Pn 2 . Pn 2
E(S) = 0 and Var(S) = ni=1 x2i n−1 i=1 a (i) = i=1 xi , (A.2.5)

where the approximation


R1 2 is due to the fact that the quantity in braces is a
Riemann sum of 0 ϕ (u) du = 1.
Note that we can write S as
Xn  
n
S= xi ϕ Fn (Yi ) , (A.2.6)
i=1
n + 1

where Fn is the empirical distribution function of Y1 , . . . , Yn . This suggests the


approximation,
Xn
T = xi ϕ(F (Yi)) . (A.2.7)
i=1

We have immediately from (A.2.3) that the mean and variance of T are
P
E(T ) = 0 and Var(T ) = ni=1 x2i . (A.2.8)

Furthermore, by assumptions (A.2.1) and (A.2.2), we can apply Corollary


A.1.1 to show that
√1 T is asymptotically distributed as N(0, σx2 ) . (A.2.9)
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 450 —


i i

450 APPENDIX A. ASYMPTOTIC RESULTS

Because the means of S and T are the same, it follows that S has the same
asymptotic distribution as T provided the second moment of their difference
goes to 0. But this follows from the string of inequalities:
"  !2 
2 # Xn    
1 1 1 n
E √ S−√ T = E xi ϕ Fn (Yi ) − ϕ(F (Yi)) 
n n n i=1
n + 1
( n ) "   2 #
n 1X 2 n
≤ x E ϕ Fn (Y1 ) − ϕ(F (Y1 )) → σx2 · 0,
n − 1 n i=1 i n+1

where the inequality and the derivation of the limit is given on page 160 of
Hájek and Šidák (1967). This results in the following theorem,
Theorem A.2.1. Under the above assumptions,
1 P
√ (T − S) → 0 , (A.2.10)
n
and
1 D
√ S → N(0, σx2 ) . (A.2.11)
n
Hence we have established the null asymptotic distribution theory of a
simple linear rank statistic.

A.2.2 Local Asymptotic Distribution Theory


We first need the definition of contiguity between two sequences of densities.
Definition A.2.1. A sequence of densities {qn } is contiguous to another
sequence of densities {pn }, if for any sequence of events {An },
Z Z
pn → 0 ⇒ qn → 0 .
An An

This concept is discussed in some detail in Hájek and Šidák (1967).


The following fact follows immediately from this definition. Suppose the
sequence of densities {qn } is contiguous to the sequence of densities {pn }. Let
P P
{Xn } be a sequence of random variables. If Xn → 0 under pn then Xn → 0
under qn .
Then according to LeCam’s First Lemma, if log(qn /pn ) is asymp-
totically N(−σ 2 /2, σ 2 ) under pn , then qn is contiguous to pn . Further by
LeCam’s Third Lemma, if (Sn , log(qn /pn )) is asymptotically bivariate nor-
mal (µ1 , µ2 , σ12 , σ22 , ρσ1 σ2 ) with µ2 = −σ22 /2 under pn , then Sn is asymptotically
N(µ1 + ρσ1 σ2 , σ12 ) under qn ; see pages 202-209 in Hájek and Šidák (1967).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 451 —


i i

A.2. SIMPLE LINEAR RANK STATISTICS 451

In this section, we assume that the random variables Y1 , . . . , Yn and the


regression coefficients x1 , . . . , xn follow the same assumptions that we made in
the last section; see expressions (A.2.1) and (A.2.2). We denote the likelihood
function of Y1 , . . . , Yn by
py = Πni=1 f (yi) . (A.2.12)
In the last section we derived the asymptotic distribution of S under py . In
this section we are further concerned with the likelihood function
qd = Πni=1 f (yi + di ) , (A.2.13)
for a sequence of constants d1 , . . . , dn which satisfies the conditions
n
X
di = 0 (A.2.14)
i=1
n
X
d2i → σd2 > 0 , as n → ∞ (A.2.15)
i=1
max d2i → 0 , as n → ∞ (A.2.16)
1≤i≤n
Xn
1
√ xi di → σxd , as n → ∞ . (A.2.17)
n i=1

In applications (e.g., power in simple linear models) we take di = −xi β/ n.
For xi s following assumptions (A.2.1) and (A.2.2), the above assumptions
would hold for these di s.
In this section, we establish the asymptotic distribution of S under qd .
Consider the log of the ratio of the likehood functions qd and py given by
n
X f (Yi + ηdi )
l(η) = log . (A.2.18)
i=1
f (Yi )

Expanding l(η) about 0 and evaluating the resulting expression at η = 1


results in
X n n
f ′ (Yi ) 1 X 2 f (Yi)f ′′ (Yi ) − (f ′ (Yi))2
l= di + di 2 (Y )
+ op (1) , (A.2.19)
i=1
f (Y i ) 2 i=1
f i

provided that the third derivative of the log-ratio, evaluated at 0, is square


integrable. Under py , the middle term converges in probability to −I(f )σd2 /2,
provided that the second derivative of the log-ratio, evaluated at 0, is square
integrable.
Hence, under py and some further regularity conditions we can write,
n
X f ′ (Yi) I(f )σd2
l= di − + op (1) . (A.2.20)
i=1
f (Yi ) 2

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 452 —


i i

452 APPENDIX A. ASYMPTOTIC RESULTS

The random variables in the first term, f ′ (Yi)/f (Yi ) are iid with mean 0 and
variance I(f ). Because the sequence d1 , . . . , dn satisfies (A.2.14)-(A.2.16), we
can use Corollary A.1.1 to show that, under py , l converges in distribution to
N(−I(f )σd2 /2, I(f )σd2). By the definition of contiguity (A.2.1) and the imme-
diate following discussion of LeCam’s first lemma, we have the result
the densities qd = Πni=1 f (yi + di) are contiguous to py = Πni=1 f (yi ) ;
(A.2.21)
see, also, page 204 of Hájek and Šidák (1967).
We next establish the key result:
Theorem A.2.2. For T given by (A.2.7) and under py and the assumptions
(3.4.1), (A.2.1), (A.2.2), (A.2.14)-(A.2.17),
 1  !  !
√ T
n D 0 σ 2
x σ γ
xd f
→ N2 I(f )σ2 , . (A.2.22)
l − 2 d σxd γf I(f )σd2

Proof: Consider the random vector V = (T / n, l)′ , where T is defined in ex-
pression (A.2.7). To show that V is asymptotically normal under pn it suffices
to show that for t ∈ R2 , t 6= 0, t′V is asymptotically univariate normal. By
the above discussion, for the second component of V, we need only be con-
cerned with the first term in expression (A.2.19); hence, for t = (t1 , t2 )′ , define
the random variables Win by
X n   X n
1 f ′ (Yi)
√ xi t1 ϕ(F (Yi)) + t2 di = Win . (A.2.23)
i=1
n f (Y i ) i=1

We want to apply Theorem A.1.1. The random variables Win are independent
and have mean 0. After some simplification, we can show that the variance of
Win is
2 1 xi
σin = x2i t21 + t22 d2i I(f ) − 2t1 t2 di √ γf , (A.2.24)
n n
where γf is given by
Z 1  ′ −1 
f (F (u))
γf = ϕ(u) − du . (A.2.25)
0 f (F −1(u))
Note by assumptions (A.2.1), (A.2.2), and (A.2.15)-(A.2.17) that
n
X
2
σin → t21 σx2 + t22 σd2 I(f ) − 2t1 t2 γf σxd > 0 , (A.2.26)
i=1

and that
2 1 22 2 1
max σin ≤ max xi t1 + t2 I(f ) max d2i + 2|t1 t2 |γf max √ |xi | max |di | → 0 ;
1≤i≤n 1≤i≤n n 1≤i≤n 1≤i≤n n 1≤i≤n
(A.2.27)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 453 —


i i

A.2. SIMPLE LINEAR RANK STATISTICS 453

hence conditions (A.1.2) and (A.1.1) are true. Thus to obtain the result we
need to show n
X  2 
lim E Win Iǫ (|Win |) = 0 , (A.2.28)
n→∞
i=1

for ǫ > 0. But |Win | ≤ Win where

1 f (Yi)

Win
= |t1 | max √ |xj ||ϕ(F (Yi))| + |t2 | max |dj | .
1≤j≤n n 1≤j≤n f (Yi )
Hence,
n
X n n 
 2  X  2 ∗
 X 1 2 2 2
E Win Iǫ (|Win |) ≤ E Win Iǫ (Win ) = E t1 xi ϕ (F (Y1))
i=1 i=1 i=1
n
 ′ 2  ′ ! #
f (Y 1 ) 1 f (Y 1 )
+t22 d2i + 2t1 t2 √ xi diϕ(F (Y1 )) − ∗
Iǫ (W1n ) =
f (Y1) n f (Y1 )
( n )
X1  
t21 x2i E ϕ2 (F (Y1))Iǫ (W1n ∗
)
i=1
n
( n
) " 2 #
X f ′
(Y i )
+t22 d2i E Iǫ (W1n∗
)
i=1
f (Y i )
( n
) "  ′ 2 #
1 X f (Y1 ) ∗
+2t1 t2 √ xi di E ϕ(F (Y1)) − Iǫ (W1n ) . (A.2.29)
n i=1 f (Y1 )

Because Iǫ (W1n ) → 0 pointwise and each of the other random variables in
the expectations of (A.2.29) are absolutely integrable, the Lebesgue Domi-
nated Convergence Theorem implies that each of these expectations converge
to 0. The desired limit in expression (A.2.28) then follows from assumptions
(A.2.1), (A.2.2), and (A.2.15)-(A.2.17). Hence V is asymptotically bivariate
normal. We can obtain its asymptotic variance-covariance matrix from expres-
sion (A.2.26), which completes the proof.
Based on Theorem A.2.2, an application
√ of LeCam’s third lemma leads to
the asymptotic distribution of T / n under local alternatives which we state
in the following theorem.
Theorem A.2.3. Under the sequence of densities qd = Πni=1 f (yi + di ), and
the assumptions (3.4.1), (A.2.1), (A.2.2), (A.2.14)-(A.2.17),
1 D
√ T → N(σxd γf , σx2 ) , (A.2.30)
n
1 D
√ S → N(σxd γf , σx2 ) . (A.2.31)
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 454 —


i i

454 APPENDIX A. ASYMPTOTIC RESULTS


√ √
The result for S/ n follows because (T − S)/ n → 0 in probability under
the densities
√ py ; hence, due to the contiguity cited in expression (A.2.21),
(T − S)/ n → 0, also, under the densities qd . A proof of the asymptotic
power lemma, Theorem 2.4.13, follows from this result.
We now investigate the relationship between S and the shifted process
given by
X n
Sd = xi a(R(Yi + di )) . (A.2.32)
i=1

Consider the analogous process,


n
X
Td = xi ϕ(F (Yi + di )) . (A.2.33)
i=1

We next establish the connection between T and Td ; see Theorem 1.3.1, also.

Theorem A.2.4. Under the likelihoods qd and py , we have the following iden-
tity:    
1 1
Pqd √ T ≤ t = Ppy √ Td ≤ t . (A.2.34)
n n
Proof: The proof follows from the following string of equalities.
  " n
#
1 1 X
Pq d √ T ≤ t = Pq d √ xi ϕ(F (Yi )) ≤ t
n n i=1
" n
#
1 X
= Pq d √ xi ϕ(F ((Yi − di ) + di )) ≤ t
n i=1
" n
#
1 X
= Pp y √ xi ϕ(F (Zi + di )) ≤ t
n i=1
 
1
= Ppy √ Td ≤ t ,
n
(A.2.35)

where the third equality follows because the sequence of random variables
Z1 , . . . , Zn follows the likelihood py .
We next establish an asymptotic relationship between T and Td .

Theorem A.2.5. Under py and the assumptions (3.4.1), (A.2.1), (A.2.2),


(A.2.14)-(A.2.17),  
T − [Td − Epy (Td )] P
√ →0.
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 455 —


i i

A.2. SIMPLE LINEAR RANK STATISTICS 455



Proof: Since E(T ) = 0, it suffices to show that the V [(T − Td )/ n ] → 0. We
have,
  n
T − Td 1X 2
V √ = x V [ϕ(F (Yi)) − ϕ (F (Yi + di ))]
n n i=1 i
n
1X 2
≤ xi E [ϕ(F (Yi)) − ϕ(F (Yi + di )]2
n i=1
n Z
1X 2 ∞
= x [ϕ(F (y)) − ϕ(F (y + di )]2 f (y) dy
n i=1 i −∞
n
! Z 
1X 2 ∞
2
≤ x max [ϕ(F (y)) − ϕ(F (y + di )] f (y) dy .
n i=1 i −∞ 1≤i≤n

The first factor in the last expression converges to σx2 ; hence, it suffices to
show that the lim of the second factor is 0. Fix y. Let ǫ > 0 be given. Then
since ϕ(u) is continuous a.e. we can assume it is continuous at F (y). Hence
there exists a δ1 > 0 such that |ϕ(z) − ϕ(F (y))| < ǫ for |z − F (y)| < δ1 . By
the uniform continuity of F , choose δ2 > 0 such that |F (t) − F (s)| < δ1 for
|s − t| < δ2 . By (A.2.16) choose N0 so that for n > N0 implies
max {|di|} < δ2 .
1≤i≤n

Thus for n > N0 ,


|F (y) − F (y + di )| < δ1 , for i = 1, . . . , n ,
and, hence,
|ϕ(F (y)) − ϕ (F (y + di))| < ǫ , for i = 1, . . . , n .
Thus for n > N0 ,
max [ϕ(F (y)) − ϕ(F (y + di )]2 < ǫ2 ,
1≤i≤n

Therefore,
Z ∞ 
2
lim max [ϕ(F (y)) − ϕ(F (y + di ))] f (y) dy ≤ ǫ2 ,
−∞ 1≤i≤n

and we are finished.


The next result yields the asymptotic mean of Td .
Theorem A.2.6. Under py and the assumptions (3.4.1), (A.2.1), (A.2.2),
(A.2.14)-(A.2.17),  
1
Epy √ Td → γf σxd .
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 456 —


i i

456 APPENDIX A. ASYMPTOTIC RESULTS

Proof: By Theorem A.2.3,


√1 T − γf σxd
n D
→ N(0, 1) , under qd .
σx
Hence by the transformation Theorem A.2.4,
√1 Td − γf σxd
n D
→ N(0, 1) , under py . (A.2.36)
σx
By (A.2.9),
√1 T
n D
→ N(0, 1) , under py ;
σx
hence by Theorem A.2.5, we must have
h i
√1 Td − E √1 Td
n n D
→ N(0, 1) , under py . (A.2.37)
σx
The conclusion follows from the results (A.2.36) and (A.2.37).
By the last two theorems we have under py
1 1
√ Td = √ T + γf σxd + op (1) .
n n

We need to express these results for the random variables S, (A.2.4),√ and Sd ,
√ to py and (T −S)/ n → 0 in
(A.2.32). Because the densities qd are contiguous
probability under py , it follows that (T − S)/√n → 0 in probability under qd .
By a change of variable this means (Td − Sd )/ n → 0 in probability under py .
This discussion leads to the following two results which we state in a theorem.

Theorem A.2.7. Under py and the assumptions (3.4.1), (A.2.1), (A.2.2),


(A.2.14)-(A.2.17),
1 1
√ Sd = √ S + γf σxd + op (1) (A.2.38)
n n
1 1
√ Sd = √ T + γf σxd + op (1) . (A.2.39)
n n

Next we relate the result Theorem A.2.7 to (2.5.26), the asymptotic lin-
earity of the general scores statistic in the two-sample problem. Recall in
the two-sample problem that ci = 0 for 1 ≤ i ≤ n1 and ci = 1 for
n1 + 1 ≤ i ≤ n1 + n2 = n, (2.2.1). Hence, xi = ci − c = −n2√
/n for 1 ≤ i ≤ n1
and xi = n1 /n for n1 + 1 ≤ i ≤ n. Defining di = −δxi / n, it is easy to
check that conditions (A.2.14)-(A.2.17) hold with σxd = −λ1 λ2 δ. Further

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 457 —


i i

A.2. SIMPLE LINEAR RANK STATISTICS 457


√ P √
(A.2.32) becomes
P S ϕ (δ/ n) = xi a(R(Y i − δxi / n))
R and (A.2.4)R becomes
Sϕ (0) = xi a(R(Yi )), where a(i) = ϕ(i/(n + 1)), ϕ = 0 and ϕ2 = 1.
Hence (A.2.38) becomes

1 √ 1
√ Sϕ (δ/ n) = √ Sϕ (0) − λ1 λ2 γf δ + op (1) .
n n

Finally using the√usual partition argument, Theorem 1.5.6, and the mono-
tonicity of Sϕ (δ/ n) we have:

Theorem A.2.8. Assuming Finite Fisher information, nondecreasing and


square integrable ϕ(u), and ni /n → λi , 0 < λi < 1, i = 1, 2,
  !
1 δ 1
Ppx √sup √ Sϕ √ − √ Sϕ (0) + λ1 λ2 γf δ ≥ ǫ → 0 , (A.2.40)
n|δ|≤c n n n

for all ǫ > 0 and for all c > 0.

(A.2.11), n−1/2 Sϕ (0)


This theorem establishes (2.5.26). As a final note from P
2 2 2 −1
is asymptotically N(0, σx ), where σx = σ (0) = lim n x2i = λ1 λ2 . Hence
to determine the efficacy using this approach, we have

λ1 λ2 γ f p
cϕ = = λ1 λ2 τϕ−1 , (A.2.41)
σ(0)

see (2.5.27).

A.2.3 Signed-Rank Statistics


In this section we develop the asymptotic local behavior for the general signed-
rank statistics defined in Section 1.8. Assume that X1 , . . . Xn are a random
sample having distribution function H(x) with density h(x) which is symmet-
ric about 0. Recall that general signed-rank statistics are given by
X
Tϕ+ = a+ (R(|Xi|))sgn(Xi ) , (A.2.42)

where the scores are generated as a+ (i) = ϕ+ (i/(n + 1)) for a nonnega-
+
tive
R +and 2square integrable function ϕ (u) which is standardized such that
(ϕ (u)) du = 1.
The null asymptotic distribution of Tϕ+ was derived in Section 1.8 so here
we are concerned with its behavior under local alternatives. Also the deriva-
tions here are similar to those for simple linear rank statistics, Section A.2.2;
hence, our derivation is brief.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 458 —


i i

458 APPENDIX A. ASYMPTOTIC RESULTS

Note that we can write Tϕ+ as


X  
+ n +
Tϕ+ = ϕ H (|Xi |) sgn(Xi ) ,
n+1 n
where Hn+ denotes the empirical distribution function of |X1 |, . . . , |Xn |. This
suggests the approximation
X
Tϕ∗+ = ϕ+ (H + (|Xi|))sgn(Xi ) , (A.2.43)

where H + (x) is the distribution function of |Xi|.


Denote the likelihood of the sample X1 , . . . Xn by
px = Πni=1 h(xi ) . (A.2.44)
A result that we need is
1  P
√ Tϕ+ − Tϕ∗+ → 0 , under px . (A.2.45)
n
This result is shown on page 167 of Hájek and√Šidák (1967).
For the sequence
√ of local alternatives, b/ n with b ∈ R, (here we are
taking di = −b/ n), we denote the likelihood by
 
n b
qb = Πi=1 h xi − √ . (A.2.46)
n
For b ∈ R, consider the log of the likelihoods given by
n
X h(Xi − η √bn )
l(η) = log . (A.2.47)
i=1
h(Xi )

If we expand l(η) about 0 and evaluate it at η = 1, similar to the expansion


(A.2.19), we obtain
n n
b X h′ (Xi ) b2 X h(Xi )h′′ (Xi ) − (h′ (Xi ))2
l = −√ + + op (1) , (A.2.48)
n i=1 h(Xi ) 2n i=1 h2 (Xi )

provided that the third derivative of the log-ratio, evaluated at 0, is square


integrable. Under px , the middle term converges in probability to −I(h)b2 /2,
provided that the second derivative of the log-ratio, evaluated at 0, is square
integrable. An application of Theorem A.1.1 shows that l converges in distri-
2
bution to a N(− I(h)b
2
, I(h)b2 ). Hence, by LeCam’s first lemma,
 
the densities qb = Πni=1 h xi − √bn are contiguous to px = Πni=1 h(xi ) .
(A.2.49)
Similar to Section A.2.2, by using Theorem√A.1.1 we can derive the asymp-
totic distribution of the random vector (Tϕ∗+ / n, l), which we record as:

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 459 —


i i

A.2. SIMPLE LINEAR RANK STATISTICS 459

Theorem A.2.9. Under px and some regularity conditions on h,


 1 ∗     
√ T +
n ϕ D 0 1 bγh
→ N2 2 , , (A.2.50)
l − I(h)b
2
bγh I(h)b2

where γh = 1/τϕ+ and τϕ+ is given in expression (1.8.24).

By this last theorem and LeCam’s third lemma, we have


1 D
√ Tϕ∗+ → N(bγh , 1) , under qb . (A.2.51)
n

By the result on contiguity, (A.2.49), the test statistic Tϕ+ / n has the same
distribution under qb . A proof of the asymptotic power lemma, Theorem 1.8.1,
follows from this result.
Next consider a shifted version of Tϕ∗+ given by
n     
X b b
∗ +
+
Tbϕ + = ϕ H Xi + √n sgn Xi + √
n
. (A.2.52)
i=1

The following identity is readily established:



Pqb [Tϕ∗+ ≤ t] = Ppx [Tbϕ + ≤ t] ; (A.2.53)

see, also, Theorem 1.3.1. We need the following theorem:

Theorem A.2.10. Under px ,


 ∗ ∗ ∗ 
Tϕ+ − [Tbϕ + − Epx (Tbϕ+ )] P
√ →0.
n

Proof: As in Theorem A.2.5, it suffices to show that V [(Tϕ∗+ − Tbϕ

+ )/ n] → 0.
∗ ∗

Let Vn = V [(Tϕ+ − Tbϕ+ )/ n]. Then Vn reduces to
Z ∞     2
+ +

+ b
+ b
ϕ H (|x|) sgn(x)−ϕ H x+ √n sgn x+ √
n
h(x)dx.
−∞

Since ϕ+ (u) is square integrable, the quantity in braces is dominated by an


integrable function. Since it converges pointwise to 0, a.e., an application of
the Lebesgue Dominated Convergence Theorem establishes the result.
Using the above results, we can proceed as we did for Theorem A.2.6 to
show that under px ,  
1 ∗
Epx √ Tbϕ+ → bγh . (A.2.54)
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 460 —


i i

460 APPENDIX A. ASYMPTOTIC RESULTS

Hence,
1 ∗ 1 ∗
√ Tbϕ + = √ Tϕ+ + bγh + op (1) . (A.2.55)
n n
A similar result holds for the signed-rank statistic.
For the results needed in Chapter 1, however, it is convenient to change
the notation to:
X n
Tϕ+ (b) = a+ (R|Xi − b|)sgn(Xi − b) . (A.2.56)
i=1

The above results imply that


1 1
√ Tϕ+ (θ) = √ Tϕ+ (0) − θγh + op (1) , (A.2.57)
n n

for n|θ| ≤ B, for B > 0.
The general signed-rank statistics found in Chapter 1 are based on norms.
In this case, since the scores are nondecreasing, we can strengthen our results
to include uniformity; that is,
Theorem A.2.11. Assuming Finite Fisher information, nondecreasing and
square integrable ϕ+ (u),
1 1
Ppx [√ sup | √ Tϕ+ (θ) − √ Tϕ+ (0) + θγh | ≥ ǫ] → 0 , (A.2.58)
n|θ|≤B n n
for all ǫ > 0 and all B > 0.
Proof: A proof can be obtained by the usual partitioning type of argu-
ment
R + on 2the interval [−B, B]; see the proof of Theorem 1.5.6. Hence, since
(ϕ (u)) du = 1, the efficacy is given by cϕ+ = γh ; see (1.8.21).

A.3 Rank-Based Analysis of Linear Models


In this section we consider the linear model defined by (3.2.3) in Chapter
3. The distribution of the errors satisfies assumption (E.1), (3.4.1). The de-
sign matrix satisfies conditions (D.2), (3.4.7), and (D.3), (3.4.8). We assume
without loss of generality that the true vector of parameters is 0.
It is easier to work with the following √
transformation of the design matrix
and parameters. We consider β such that nβ = O(1). Note that we suppress
the notation indicating that β depends on n. Let,
1/2
∆ = (X′X) β, (A.3.1)
′ −1/2
C = X (X X) , (A.3.2)
di = −c′i ∆ , (A.3.3)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 461 —


i i

A.3. RANK-BASED ANALYSIS OF LINEAR MODELS 461

where ci is √the ith row of C and note that ∆ = O(1) because n−1 X′ X →
Σ > 0 and nβ = O(1). Then C′ C = Ip and HC = HX , where HC is the
projection matrix onto the column space of C. Note that since X is centered,
C is also. Also kci k2 = h2nii where h2nii is the ith diagonal entry of HX . It
is straightforward to show that c′i ∆ = x′i β. Using the conditions (D.2) and
(D.3), the following conditions are readily established:

d = 0 (A.3.4)
n
X X n
2
di ≤ kci k2 k∆k2 = pk∆k2 , for all n (A.3.5)
i=1 i=1
max d2i ≤ k∆k2 max kci k2 (A.3.6)
1≤i≤n 1≤i≤n

= k∆k2 max h2nii → 0 as n → ∞ ,


1≤i≤n

since k∆k is bounded.


For j = 1, . . . , p define
n
X
Snj (∆) = cij a(R(Yi − c′i ∆)) , (A.3.7)
i=1

where the scores are generated by a function ϕ which satisfies (S.1), (3.4.10).
We now show that the theory established in Section A.2 for simple linear rank
statistics holds for Snj , for each j.
√ Fix j, then the regression coefficients
P 2 xP
i of Section A.2 are given by xi =
ncij . Note from (A.3.2) that xi /n = c2ij = 1; hence, condition (A.2.2)
is true. Further by (A.3.6),

max1≤i≤n x2i
Pn 2 = max c2ij → 0 ;
i=1 xi
1≤i≤n

hence, condition (A.2.1) is true.


For the sequence di = −c′i ∆, conditions (A.3.4)-(A.3.6) imply conditions
(A.2.14)-(A.2.16) (the upper bound in condition (A.3.6) was actually all that
was needed in the proofs of Section A.2). Finally for (A.2.17), because C is
orthogonal, σxd is given by
n n p
( n )
1 X X

X X
σxd = √ xi di = − cij ci ∆ = − cij cik ∆k = −∆j . (A.3.8)
n i=1 i=1 k=1 i=1

Thus by Theorem A.2.7, for j = 1, . . . , p, we have the results,

Snj (∆) = Snj (0) − γf ∆j + op (1) (A.3.9)


Snj (∆) = Tnj (0) − γf ∆j + op (1) , (A.3.10)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 462 —


i i

462 APPENDIX A. ASYMPTOTIC RESULTS

where
n
X
Tnj (0) = cij ϕ(F (Yi)) . (A.3.11)
i=1

Let Sn (∆)′ = (Sn1 (∆), . . . , Snp (∆)). Because component-wise convergence


in probability implies that the corresponding vector converges, we have shown
that the following theorem is true:

Theorem A.3.1. Under the above assumptions, for ǫ > 0 and for all ∆

lim P (kSn (∆) − (Sn (0) − γ∆) k ≥ ǫ) = 0 . (A.3.12)


n→∞

The conditions we want are asymptotic linearity and quadraticity. Asymp-


totic linearity is the condition
!
lim P sup kSn (∆) − (Sn (0) − γ∆) k ≥ ǫ =0, (A.3.13)
n→∞
k∆k≤c

for arbitrary c > 0 and ǫ > 0. This result was first shown by Jurečková (1971)
under more stringent conditions on the design matrix.
Consider the dispersion function discussed in Chapter 2. In terms of the
above notation
n
X
Dn (∆) = a(R(Yi − ci ∆))(Yi − ci ∆) . (A.3.14)
i=1

An approximation of Dn (∆) is the quadratic function

Qn (∆) = γ∆′ ∆/2 − ∆′ Sn (0) + Dn (0) . (A.3.15)

Using Jurečková’s conditions, Jaeckel (1972) extended the result (A.3.13) to


asymptotic quadraticity which is given by
!
lim P sup |Dn (∆) − Qn (∆)| ≥ ǫ =0, (A.3.16)
n→∞
k∆k≤c

for arbitrary c > 0 and ǫ > 0. Our main result of this section shows that
(A.3.12), (A.3.13), and (A.3.16) are equivalent. The proof proceeds as in Heiler
and Willers (1988) who established their results based on convex function
theory. Before proceeding with the proof, for the reader’s convenience, we
present some notes on convex functions.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 463 —


i i

A.3. RANK-BASED ANALYSIS OF LINEAR MODELS 463

A.3.1 Convex Functions


Let f be a real valued function defined on Rp . Recall the definition of a convex
function:
Definition A.3.1. The function f is convex if

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) , (A.3.17)

for 0 < λ < 1. Further, a convex function f is called proper if it is defined


on an open set C ∈ Rp and is everywhere finite on C.
The convex functions of interest in this appendix are proper with C = Rp .
The proof of the following theorem can be found in Rockafellar (1970); see
pages 82 and 246.
Theorem A.3.2. Suppose f is convex and proper on an open subset C of Rp .
Then f is continuous on C and is differentiable almost everywhere on C.
We find it useful to define a subgradient:
Definition A.3.2. The vector D(x0 ) is called a subgradient of f at x0 if

f (x) − f (x0 ) ≥ D(x0 )′ (x − x0 ) , for all x ∈ C . (A.3.18)

As shown on page 217 of Rockafellar (1970), a proper convex function which


is defined on an open set C has a subgradient at each point in C. Furthermore,
at the points of differentiability, the subgradient is unique and it agrees with
the gradient. This is a theorem proved on page 242 of Rockafellar which we
next state.
Theorem A.3.3. Let f be convex. If f is differentiable at x0 then ▽f (x0 ),
the gradient of f at x0 , is the unique subgradient of f at x0 .
Hence combining Theorems A.3.2 and A.3.3, we see that for proper convex
functions the subgradient is the gradient almost everywhere; hence if f is a
proper convex function we have

f (x) − f (x0 ) ≥ ▽f (x0 )′ (x − x0 ) , a.e. x ∈ C . (A.3.19)

The next theorem can be found on page 90 of Rockafellar (1970).


Theorem A.3.4. Let the sequence of convex functions {fn } be proper on
C and suppose the sequence converges for all x ∈ C∗ where C∗ is dense in
C. Then the functions fn converge on the whole set C to a proper and con-
vex function f and, furthermore, the convergence is uniform on each compact
subset of C.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 464 —


i i

464 APPENDIX A. ASYMPTOTIC RESULTS

The following theorem is a modification by Heiler and Willers (1988) of a


theorem found on page 248 of Rockafellar (1970).
Theorem A.3.5. Suppose in addition to the assumptions of the last theorem
the limit function f is differentiable, then

lim ▽fn (x) = ▽f (x) , for all x ∈ C . (A.3.20)


n→∞

Furthermore the convergence is uniform on each compact subset of C.


The following result is proved in Heiler and Willers (1988).
Theorem A.3.6. Suppose the hypotheses of Theorem A.3.4 hold. Assume,
also, that the limit function f is differentiable. Then

lim ▽fn (x) = ▽f (x) , for all x ∈ C∗ (A.3.21)


n→∞

and
lim fn (x0 ) = f (x0 ) , for at least one x0 ∈ C∗ (A.3.22)
n→∞

where C∗ is dense in C, imply that

lim fn (x) = f (x) , for all x ∈ C (A.3.23)


n→∞

and the convergence is uniform on each compact subset of C.

A.3.2 Asymptotic Linearity and Quadraticity


We now proceed with Heiler and Willers (1988) proof of the equivalence of
(A.3.12), (A.3.13), and (A.3.16).
Theorem A.3.7. Under Model (3.2.3) and the assumptions (3.4.7), (3.4.8),
and (3.4.1), the expressions (A.3.12), (A.3.13), and (A.3.16) are equivalent.
Proof:
(A.3.12) ⇒ (A.3.16). Both functions Dn (∆) and Qn (∆) are proper convex
functions for ∆ ∈ Rp . Their gradients are given by,

▽Qn (∆) = γ∆ − Sn (0) (A.3.24)


▽Dn (∆) = −Sn (∆) , a.e. ∆ ∈ Rp . (A.3.25)

By Theorem A.3.2 the gradient of D exists almost everywhere. Where


the derivative of Dn (∆) is not defined, we use the subgradient of Dn (∆),
(A.3.2), which, in the case of proper convex functions, exists everywhere
and which agrees uniquely with the gradient at points where D(∆) is

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 465 —


i i

A.3. RANK-BASED ANALYSIS OF LINEAR MODELS 465

differentiable; see Theorem A.3.3 and the surrounding discussion. Com-


bining these results we have
▽(Dn (∆) − Qn (∆)) = −[Sn (∆) − Sn (0) + γ∆] (A.3.26)

Let N denote the set of positive integers. Let ∆(1) , ∆(2) , . . . be a listing
of the vectors in p-space with rational components. By (A.3.12) the
rightside of (A.3.26) goes to 0 in probability for ∆(1) . Hence, for every
infinite index set N ∗ ⊂ N there exists another infinite index set N1∗∗ ⊂
N ∗ such that
a.s.
[Sn (∆(1) ) − Sn (0) + γ∆(1) ] → 0 , (A.3.27)
for n ∈ N1∗∗ . Since the right side of (A.3.26) goes to 0 in probability for
∆(2) and N1∗∗ is an infinite index set, there exists another infinite index
set N2∗∗ ⊂ N1∗∗ such that
a.s.
[Sn (∆(i) ) − Sn (0) + γ∆(i) ] → 0 , (A.3.28)
for n ∈ N2∗∗ and i ≤ 2. We continue and, hence, get a sequence of nested
infinite index sets N1∗∗ ⊃ N2∗∗ ⊃ · · · ⊃ Ni∗∗ ⊃ · · · such that
a.s.
[Sn (∆(j) ) − Sn (0) + γ∆(j) ] → 0 , (A.3.29)
for n ∈ Ni∗∗ ⊃ Ni+1
∗∗
⊃ · · · and j ≤ i. Let Ne be a diagonal infinite index
set of the sequence N1∗∗ ⊃ N2∗∗ ⊃ · · · ⊃ Ni∗∗ ⊃ · · · . Then
a.s.
[Sn (∆) − Sn (0) + γ∆] → 0 , (A.3.30)
e and for all rational ∆.
for n ∈ N
Define the convex function Hn (∆) = Dn (∆) − Dn (0) + ∆′ Sn (0). Then
Dn (∆) − Qn (∆) = Hn (∆) − γ∆′ ∆/2 (A.3.31)
▽(Dn (∆) − Qn (∆)) = ▽Hn (∆) − γ∆ . (A.3.32)
Hence by (A.3.30) we have
a.s.
▽Hn (∆) → γ∆ = ▽γ∆′ ∆/2 , (A.3.33)
e and for all rational ∆. Also note
for n ∈ N
Hn (0) = 0 = γ∆′ ∆/2|∆=0 . (A.3.34)
Since Hn is convex and (A.3.33) and (A.3.34) hold, we have by Theo-
rem A.3.6 that {Hn (∆)}n∈Ne converges to γ∆′ ∆/2 a.s., uniformly on
each compact subset of Rp . That is by (A.3.31), Dn (∆) − Qn (∆) → 0
a.s., uniformly on each compact subset of Rp . Since N ∗ is arbitrary,
we can conclude (see Theorem 4, page 103 of Tucker, 1967) that
P
Dn (∆) − Qn (∆) → 0 uniformly on each compact subset of Rp .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 466 —


i i

466 APPENDIX A. ASYMPTOTIC RESULTS

(A.3.16) ⇒ (A.3.13). Let c > 0 be given and let C = {∆ : k∆k ≤ c}.


P
By (A.3.16) we know that Dn (∆) − Qn (∆) → 0 on C. Using the same
diagonal argument as above, for any infinite index set N ∗ ⊂ N there
e ⊂ N ∗ such that Dn (∆) − Qn (∆) a.s.
exists an infinite index set N → 0 for
e
n ∈ N and for all rational ∆. As in the last part, introduce the function
Hn as
Dn (∆) − Qn (∆) = Hn (∆) − γ∆′ ∆/2 . (A.3.35)
Hence,
a.s.
Hn (∆) → γ∆′ ∆/2 , (A.3.36)
for n ∈ Ne and for all rational ∆. By (A.3.36) and the fact that the
function γ∆′ ∆/2 is differentiable we have by Theorem A.3.5,
a.s.
▽Hn (∆) → γ∆ , (A.3.37)

for n ∈ Ne and uniformly on C. This leads to the following string of


convergences,
a.s.
▽(Dn (∆) − Qn (∆)) → 0
a.s.
Sn (∆) − (Sn (0) − γ∆) → 0 , (A.3.38)
e and uniformly on C. Since N ∗
where both convergences are for n ∈ N
was arbitrary we can conclude that
P
Sn (∆) − (Sn (0) − γ∆) → 0 , (A.3.39)

uniformly on C. Hence (A.3.13) holds.

(A.3.13) ⇒ (A.3.12). This is trivial.


These are the results we wanted. For convenience we summarize asymp-
totic linearity and asymptotic quadraticity in the following theorem:
Theorem A.3.8. Under Model (3.2.3) and the assumptions (3.4.7), (3.4.8),
and (3.4.1),
!
lim P sup kSn (∆) − (Sn (0) − γ∆) k ≥ ǫ =0, (A.3.40)
n→∞
k∆k≤c
!
lim P sup |Dn (∆) − Qn (∆)| ≥ ǫ =0, (A.3.41)
n→∞
k∆k≤c

for all ǫ > 0 and all c > 0.


Proof: This follows from the Theorems A.3.1 and A.3.7.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 467 —


i i

A.3. RANK-BASED ANALYSIS OF LINEAR MODELS 467

A.3.3 b and β
Asymptotic Distance between β e
This section contains a proof of Theorem 3.5.5. It shows that the R estimate in
Chapter 3 is close to the value which minimizes the quadratic approximation
to the dispersion function. The proof is due to Jaeckel (1972). For convenience,
we restate the theorem.

Theorem A.3.9. Under the Model (3.2.3), (E.1), (D.1), (D.2), and (S.1) in
Section 3.4,
√ P
b − β)
n(β e → 0.
√ e
Proof: Choose ǫ > 0 and δ > 0. Since nβ converges in distribution, there
exists a c0 such that h √ i
e
P kβk ≥ c0 / n < δ/2 , (A.3.42)
for n sufficiently large. Let
n √ o
e e .
T = min Q(Y − Xβ) : kβ − βk = ǫ/ n − Q(Y − Xβ) (A.3.43)

e is the unique minimizer of Q, T > 0; hence, by asymptotic quadraticity


Since β
we have
" #
P max √
|D(Y − Xβ) − Q(Y − Xβ)| ≥ T /2 ≤ δ/2 , (A.3.44)
kβ k<(c0 +ǫ)/ n

for sufficiently large n. By (A.3.42) and (A.3.44) we can assert with probability
greater than 1 − δ that for sufficiently large n, |Q(Y − Xβ)e − D(Y − Xβ)| f <

e < c0 / n. This implies with probability greater than 1 − δ that
(T /2) and kβk
for sufficiently large n,
f < Q(Y − Xβ) √
D(Y − Xβ) e + T /2 and kβk
e < c0 / n . (A.3.45)

Next suppose β is arbitrary and on the ring kβ − e = ǫ/√n. For kβk


βk e <
√ √
c0 / n it then follows that kβk ≤ (c0 + ǫ)/ n. Arguing as above, we have
with probability greater than 1 − δ that D(Y − Xβ) > Q(Y − Xβ) − T /2,
for sufficiently large n. From this, (A.3.43), and (A.3.45) we get the following
string of inequalities

D(Y − Xβ) > Q(Y − Xβ) − T /2


n √ o
e = ǫ/ n − T /2
≥ min Q(Y − Xβ) : kβ − βk
= T + Q(Y − Xβ)e − T /2
e > D(Y − Xβ)
= T /2 + Q(Y − Xβ) e . (A.3.46)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 468 —


i i

468 APPENDIX A. ASYMPTOTIC RESULTS

Thus, D(Y − Xβ) > D(Y − Xβ), e = ǫ/√n. Since D is convex, we


e for kβ − βk
must also have D(Y − Xβ) > D(Y − Xβ), e ≥ ǫ/√n. But D(Y −
e for kβ − βk
e ≥ min D(Y − Xβ) = D(Y − Xβ).
Xβ) b Hence β b must lie inside the disk kβ −
√ h i
e = ǫ/ n with probability of at least 1−2δ; that is, P kβ
βk e < ǫ/√n >
b − βk
1 − 2δ. This yields the result.

A.3.4 Consistency of the Test Statistic Fϕ


This section contains a proof of the consistency of the test statistic Fϕ , The-
orem 3.6.2. We begin with a lemma.
Lemma A.3.1. Let a > 0 be given and let tn = min√ e
e
(Q(β)−Q(β)).
nkβ −β k=a
Then tn = (2τ )−1 a2 λn,1 where λn,1 is the minimum eigenvalue of n X X. −1 ′

Proof: After some computation, we have


√ √
Q(β) − Q(β) e ′ n−1 X′ X n(β − β)
e = (2τ )−1 n(β − β) e .

Let 0 < λn,1 ≤ · · · ≤ λn,p be the eigenvalues of n−1 X′ X and let γ n,1 , . . . , γ n,p
be a corresponding set of orthonormal
Pp eigenvectors. The spectral decompo-
sition of n X X is n X X = i=1 λn,iγ n,iγ ′n,i . From this we can show for
−1 ′ −1 ′

any vector δ that δ ′ n−1 X′ Xδ ≥ λn,1kδk2 and, that further, the minimum is
achieved over all vectors of unit length when δ = γ n,1. It then follows that
min δ ′ n−1 X′ Xδ = λn,1 a2 ,
kδ k=a

which yields the conclusion.


Note that by (D.2) of Section 3.4, λn,1 → λ1 , for some λ1 > 0. The following
is a restatement and a proof of Theorem 3.6.2.
Theorem A.3.10. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Sec-
tion 3.4 hold. The test statistic Fϕ is consistent for the hypotheses (3.2.5).
Proof: By the above discussion we need only show that (3.6.21) is true. Let
ǫ > 0 be given. Let c0 = (2τ )−1 χ2α,q . By Lemma A.3.1, choose a > 0 so large
that (2τ )−1 a2 λ1 > 3c0 + ǫ. Next choose n0 so large that (2τ )−1 a2 λn,1 > 3c0 ,
√ e
for n ≥ n0 . Since nkβ − β 0 k is bounded in probability, there exits a c > 0
and a n1 such that for n ≥ n1
Pβ (C1,n ) ≥ 1 − (ǫ/2) , (A.3.47)
0

√ e
where we define the event C1,n = { nkβ−β 0 k < c}. Since t > 0 by asymptotic
quadraticity, Theorem A.3.8, there exits an n2 such that for n > n2 ,
Pβ (C2,n ) ≥ 1 − (ǫ/2) , (A.3.48)
0

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 469 —


i i

A.3. RANK-BASED ANALYSIS OF LINEAR MODELS 469

where C2,n = {max√nkβ −β k≤c+a |Q(β) − D(β)| < (t/3)}. For the remainder
0
of the proof assume that n ≥ max{n0 , n1 , n2 } = n∗ . Next suppose β is such
√ e √ e
that nkβ − βk = a. Then on C1,n it follows that nkβ − βk ≤ c + a. Hence
on both C1,n and C2,n we have
D(β) > Q(β) − (t/3)
e + t − (t/3)
≥ Q(β)
e + 2(t/3)
= Q(β)
e + (t/3) .
> D(β)
√ e e > (t/3) > c0 .
Therefore, for all β such that nkβ − βk = a, D(β) − D(β)
√ e
But D is convex; hence on C1,n ∩ C2,n , for all β such that nkβ − βk ≥ a,
D(β) − D(β) e > (t/3) > c0 .

Finally choose n3 such that for n ≥ n3 , δ > (c + a)/ n where δ is the
positive distance between β 0 and Rr . Now assume that n ≥ max{n∗ , n3 } and
b = (β
C1,n ∩ C2,n is true. Recall that the reduced model R estimate is β b ′ , 0′ )′
r r,1
where βb lies in Rr ; hence,
r,1
√ √ √ √
nkβb r − βk
e ≥ nkβ b r − β 0 k − nkβe − β 0 k ≥ nδ − c > a .

b ) − D(β)
Thus on C1,n ∩ C2,n , D(β e > c0 . Thus for n sufficiently large we have
r

b ) − D(β)
P [D(β e > (2τ )−1 χ2 ] ≥ 1 − ǫ .
r α,q

Because ǫ was arbitrary (3.6.21) is true and consistency of Fϕ follows.

A.3.5 Proof of Lemma 3.5.1


The following lemma was used to establish the asymptotic linearity for the
sign process for linear models in Chapter 3. The proof of this lemma was first
given by Jurečková (1971) for general scores. We restate the lemma and give
its proof for sign scores.
Lemma A.3.2. Assume conditions (E.1), (E.2), (S.1), (D.1), and (D.2) of
Section 3.4. For any ǫ > 0 and for any a ∈ R,

lim P [|S1(Y − an−1/2 − Xβ b ) − S1 (Y − an−1/2 )| ≥ ǫ n] = 0 .
R
n→∞

Proof: Let a be arbitrary but fixed and let c > |a|. After matching notation,
Theorem A.4.3 leads to the result,

1 1
max √ S1 (Y − an−1/2
− Xβ) − √ S1 (Y) + (2f (0))a = op (1) .
n n
k(X′ X)1/2 β k≤c
(A.3.49)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 470 —


i i

470 APPENDIX A. ASYMPTOTIC RESULTS

Obviously the above result holds for β = 0. Hence for any ǫ > 0,
" #
1 1
P max √ S1 (Y − an −1/2
− Xβ) − √ S1 (Y − an −1/2
) ≥ ǫ ≤
n n
k(X′ X)1/2 β k≤c
" #
1
P max √ S1 (Y − an−1/2 − Xβ) − √1 S1 (Y) + (2f (0)a ≥ ǫ
n n 2
k(X′ X)1/2 β k≤c
 
1 1 ǫ
+P √ S1 (Y − an−1/2 ) − √ S1 (Y) + (2f (0)a ≥ .
n n 2

By (A.3.49), for n sufficiently large, the two terms on the rightside are arbi-
b is bounded
trarily small. The desired result follows from this since (X′ X)1/2 β
in probability.

A.4 Asymptotic Linearity for the L1 Analysis


In this section we obtain a linearity result for the L1 analysis of a linear
model. Recall from Section 3.6 that the L1 estimates are equivalent to the R
estimates when the rank scores are generated by the sign function; hence, the
distribution theory for the L1 estimates is derived in Section 3.4. The linearity
result derived below offers another way to obtain this result. More importantly
though, we need the linearity result for the proof of Lemma 3.5.6 of Section
3.5. As we next show, this result is a corollary to the linearity results derived
in the last section.
We assume the same linear model and use the same notation as in Section
3.2. Recall that the L1 estimate of β minimizes the dispersion function,
n
X
D1 (α, β) = |Yi − α − xi β| .
i=1

The corresponding gradient function is the (p+1)×1 vector whose components


are  Pn
− Pi=1 sgn(Yi − α − xi β) if j = 0
▽ j D1 = ,
− ni=1 xij sgn(Yi − α − xi β) if j = 1, . . . , p

where j = 0 denotes the partial of D1 with respect to α. The parameter α


denotes the location functional med(Yi − xi β), i.e., the median of the errors.
Without loss of generality, we assume that the true parameters are 0.
We first consider the simple linear model. Consider then the notation
of Section A.3; see (A.3.1)-(A.3.7). We derive the analogue of Theorem A.3.8

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 471 —


i i

A.4. ASYMPTOTIC LINEARITY FOR THE L1 ANALYSIS 471

for the processes


n
X α
U0 (α, ∆) = sgn(Yi − √ − ∆ci ) (A.4.1)
i=1
n
n
X α
U1 (α, ∆) = ci sgn(Yi − √ − ∆ci ) . (A.4.2)
i=1
n

Let pd = Πni=1 f0 (yi ) denote √


the likelihood for the iid observations Y1 , . . . , Yn
n
and let qd = Πi=1 f0 (yi + α/ n + ∆ci ) denote the likelihood of the variables
Yi − √αn − ∆ci . We assume throughout that f (0) > 0. Similar to Section
A.2.2, the sequence of densities qd is contiguous to the sequence pd . Note that
the processes U0 and U1 are already sums of independent variables; hence,
projections are unnecessary.
We first work with the process U1 .

Lemma A.4.1. Under the above assumptions and as n → ∞,

E0 (U1 (α, ∆)) → −2∆f0 (0) .

Proof: After some simplification we get


n
X  √ 
E0 (U1 (α, ∆)) = 2 ci F0 (0) − F0 (α/ n + ∆ci )
i=1
n
X √
= 2 ci (−∆ci − α/ n)f0 (ξin ) ,
i=1

where, by the mean value theorem, ξin is between 0 and |α/ n + ∆ci |. Since
the ci ’s are centered, we further obtain
n
X n
X
E0 (U1 (α, ∆)) = −2∆ c2i [f0 (ξin ) − f0 (0)] − 2∆ c2i f0 (0) .
i=1 i=1

By assumptions P of Section A.2.2, it follows that maxi |α/ n + ∆ci | → 0 as
n 2
n → ∞. Since i=1 ci = 1 and the assumptions that f0 continuous and
positive at 0, the desired result easily follows.
This leads us to our main result for U1 (α, ∆):

Theorem A.4.1. Under the above assumptions, for all α and ∆


P
U1 (α, ∆) − [U1 (0, 0) − ∆2f0 (0)] → 0 ,

as n → ∞.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 472 —


i i

472 APPENDIX A. ASYMPTOTIC RESULTS

Because the ci ’s are centered it follows that Epd (U1 (0, 0)) = 0. Thus by
the last lemma, we need only show that Var(U1 (α, ∆) − U1 (0, 0)) → 0. By
considering the variance of the sign of a random variable, simplification leads
to the bound:
n
X √
Var((U1 (α, ∆) − U1 (0, 0)) ≤ 4 c2i |F0 (α/ n + ∆ci ) − F0 (0)| .
i=1

By our assumptions, maxi |∆ci + α/ n| → 0 as n → ∞. From this and the
continuity of F0 at 0, it follows that Var(U1 (α, ∆) − U1 (0, 0)) → 0.
We need analogous results for the process U0 (α, ∆).

Lemma A.4.2. Under the above assumptions,

E0 [U0 (α, ∆)] → −2αf0 (0) ,

as n → ∞.

Proof: Upon simplification and an application of the mean value theorem,


n   
2 X α
E0 [U0 (α, ∆)] = √ F0 (0) − F0 √ + ci ∆
n i=1 n
n  
−2 X α
= √ √ + ci ∆ f0 (ξin )
n i=1 n
n
−2α X
= [f0 (ξin ) − f0 (0)] − 2αf0 (0) ,
n i=1

where we have used √ the fact that the ci ’s are √ centered. Note that |ξin | is
between 0 and |α/ n + ci ∆| and that max |α/ n + ci ∆| → 0 as n → ∞. By
the continuity of f0 at 0, the desired result follows.

Theorem A.4.2. Under the above assumptions, for all α and ∆


P
U0 (α, ∆) − [U0 (0, 0) − 2αf0 (0)] → 0 ,

as n → ∞.

Because the medYi is 0, E0 [U0 (0, 0)] = 0. Hence by the last lemma it then
suffices to show that Var(U0 (α, ∆) − U0 (0, 0)) → 0. But,
n  
4 X α
.
Var(U0 (α, ∆) − U0 (0, 0)) ≤ F √ + c ∆ − F (0)
n i=1
0 i 0
n

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 473 —


i i

A.5. INFLUENCE FUNCTIONS 473



Because max |α/ n + ci ∆| → 0 and F0 is continuous at 0, Var(U0 (α, ∆) −
U0 (0, 0)) → 0.
Next consider the multiple regression model as discussed in Section A.3.
The only difference in notation is that here we have the intercept parameter
included. Let ∆ = (α, ∆1 , . . . , ∆p )′ denote the vector of all regression param-
eters. Take X = [1n : Xc ], where Xc denotes a centered design matrix√and as
in (A.3.2) take C = X(X′ X)−1/2 . Note that the first column of C is (1/ n)1n .
Let U(∆) = (U0 (∆), . . . , Up (∆))′ denote the vector of processes. Similar to
the discussion prior to Theorem A.3.1, the last two theorems imply that
P
U(∆) − [U(0) − 2f0 (0)∆] → 0 ,

for all real ∆ in Rp+1 .


As in Section A.3, we define the approximation quadratic to D1 as

Q1n (∆) = (2f0 (0))∆′ ∆/2 − ∆′ U(0) + D1 (0) .

The asymptotic linearity of U and the asymptotic quadraticity of D1 then


follow as in the last section. We state the result for reference:

Theorem A.4.3. Under conditions (3.4.1), (3.4.3), (3.4.7), and (3.4.8),


!
lim P max kU(∆) − (U(0) − (2f0 (0))∆) k ≥ ǫ =0, (A.4.3)
n→∞ k∆k≤c
!
lim P max |D1 (∆) − Q1 (∆)| ≥ ǫ =0, (A.4.4)
n→∞ k∆k≤c

for all ǫ > 0 and all c > 0.

A.5 Influence Functions


In this section we derive the influence functions found in Chapters 1-3. Discus-
sions of the influence function can be found in Staudte and Sheather (1990),
Hampel et al. (1986), and Huber (1981). For the influence functions of Chapter
3, we find the Gâteux derivative to be a convenient functional; see Fernholz
(1983) and Huber (1981) for rigorous discussions of functionals and deriva-
tives.

Definition A.5.1. Let T be a statistical functional defined on a space of


distribution functions and let H denote a distribution function in the domain
of T . We say that T is Gâteux differentiable at H if for any distribution

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 474 —


i i

474 APPENDIX A. ASYMPTOTIC RESULTS

function W , such that the distribution functions {(1 − s)H + sW } lie in the
domain of T , the following limit exists:
Z
T [(1 − s)H + sW ] − T [H]
lim = ψH dW , (A.5.1)
s→0 s
for some function ψH .

Note by taking W to be H in the above definition we have


Z
ψH dH = 0 . (A.5.2)

The usual definition of the influence function is obtained by taking the


distribution function W to be a point mass distribution. Denote the point
mass distribution function at t by ∆t (x). Letting W (x) = ∆t (x), the Gâteux
derivative of T (H) is

T [(1 − s)H + s∆s (x)] − T [H]


lim = ψH (x) . (A.5.3)
s→0 s
The function ψH (x) is called the influence function of T (H). Note that this
is the derivative of the functional T [(1 − s)H + s∆s (x)] at s = 0. It measures
the rate of change of the functional T (H) at H in the direction of ∆s . A
functional is said to be robust when this derivative is bounded.

A.5.1 Influence Function for Estimates Based on


Signed-Rank Statistics
In this section we derive the influence function for the one-sample location
estimate θbϕ+ , (1.8.5), discussed in Chapter 1. We assume that we are sampling
from a symmetric density h(x) with distribution function H(x), as in Section
1.8. As in Chapter 2, we assume that the one sample score function ϕ+ (u) is
defined by  
+ u+1
ϕ (u) = ϕ , (A.5.4)
2
where ϕ(u) is a nondecreasing, differentiable function defined on the interval
(0, 1) satisfying
ϕ(1 − u) = −ϕ(u) . (A.5.5)
Recall from Chapter 2 that this assumption is appropriate for scores for sam-
ples from symmetrical distributions. For convenience we extend ϕ+ (u) to the
interval (−1, 0) by
ϕ+ (−u) = −ϕ+ (u) . (A.5.6)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 475 —


i i

A.5. INFLUENCE FUNCTIONS 475

Our functional T (H) is defined implicitly by the equation (1.8.5). Using the
symmetry of h(x), (A.5.5), and (A.5.6) we can write the defining equation for
θ = T (H) as
Z ∞
0 = ϕ+ (H(x) − H(2θ − x))h(x) dx
Z−∞

0 = ϕ(1 − H(2θ − x))h(x) dx . (A.5.7)
−∞

For the derivation, we proceed as discussed above; see the discussion around
expression (A.5.3). Consider the contaminated distribution of H(x) given by

Ht,ǫ (x) = (1 − ǫ)H(x) + ǫ∆t (x) , (A.5.8)

where 0 < ǫ < 1 is the proportion of contamination and ∆t (x) is the distri-
bution function for a point mass at t. By (A.5.3) the influence function is the
derivative of the functional at ǫ = 0. To obtain this derivative we implicitly
differentiate the defining equation (A.5.7) at Ht,ǫ (x); i.e., at
Z ∞
0 = (1 − ǫ) ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t (2θ − x))h(x) dx
−∞
Z ∞
= ǫ ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t (2θ − x)) d∆t (x) .
−∞

Let θ̇ denote the derivative of the functional. Implicitly differentiating this


equation and then setting ǫ = 0 and without loss of generality θ = 0, we get
Z ∞ Z ∞
0 = − ϕ(H(x))h(x) dx + ϕ′ (H(x))H(−x)h(x) dx
−∞ −∞
Z ∞ Z ∞
′ 2
= −2θ̇ ϕ (H(x))h (x) dx − ϕ′ (H(x))∆t (−x)h(x) dx + ϕ(H(t)) .
−∞ −∞
R
Label the four integrals in the above equation as I1 , . . . , I4 . Since ϕ(u) du =
0, I1 = 0. For I2 we get
Z ∞ Z ∞

I2 = ϕ (H(x))h(x) dx − ϕ′ (H(x))H(x)h(x) dx
−∞ −∞
Z 1 Z 1
= ϕ′ (u) du − ϕ′ (u)u du = −ϕ(0) .
0 0

Next I4 reduces to
Z −t Z H(−t)

− ϕ (H(x))h(x) dx = − ϕ′ (u) du = ϕ(H(t)) + ϕ(0) .
−∞ 0

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 476 —


i i

476 APPENDIX A. ASYMPTOTIC RESULTS

Combining these results and solving for θ̇ leads to the influence function which
we can write in either of the following two ways,
ϕ(H(t))
Ω(t, θbϕ+ ) = R ∞
−∞
ϕ′ (H(x))h2 (x) dx
ϕ+ (2H(t) − 1)
= R∞ . (A.5.9)
4 0
ϕ+′ (2H(x) − 1)h2 (x) dx

A.5.2 Influence Functions for Chapter 3


In this section, we derive the influence functions which were presented in
Chapter 3. Much of this work was developed in Witt (1989) and Witt, McKean,
and Naranjo (1995). The correlation model of Section 3.11 is the underlying
model for the influence functions derived in this section. Recall that the joint
distribution function of x and Y is H, the distribution functions of x, Y , and
e are M, G, and F , respectively, and Σ is the variance-covariance matrix of
x.
b denote the R estimate of β for a specified score function ϕ(u). In this
Let β ϕ
section we are interested in deriving the influence functions of this R estimate
and of the corresponding R test statistic for the general linear hypotheses.
We obtain these influence functions by using the definition of the Gâteux
derivative of a functional, (A.5.1). The influence functions are then obtained
by taking W to be the point mass distribution function ∆(x0 ,y0 ) ; see expression
(A.5.3). If T is Gâteux differentiable at H then by setting W = ∆(x0 ,y0 ) we
see that the influence function of T is given by
Z
Ω(x0 , y0 ; T ) = ψH d∆(x0 ,y0 ) = ψH (x0 , y0) . (A.5.10)

P For example, we obtain the influence function of the statistic D(0) =


a(R(YRi ))Yi . Since G is the distribution function of Y , the functional is
T [G] = ϕ(G(y))ydG(y). Hence for a given distribution function W ,
Z
T [(1 − s)G + sW ] = (1 − s) ϕ[(1 − s)G(y) + sW (y)]ydG(y)
Z
+s ϕ[(1 − s)G(y) + sW (y)]ydW (y) .

Taking the partial derivative of the right side with respect to s, setting s = 0,
and substituting ∆y0 for W leads to the influence function
Z Z
Ω(y0 ; D(0)) = − ϕ(G(y))ydG(y) − ϕ′ (G(y))G(y)ydG(y)
Z ∞
+ ϕ′ (G(y))ydG(y) + ϕ(G(y0))y0 . (A.5.11)
y0

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 477 —


i i

A.5. INFLUENCE FUNCTIONS 477

Note that this is not bounded in the Y -space and, hence, the statistic D(0)
is not robust. Thus, as noted in Section 3.11, the coefficient of multiple de-
termination R1 , (3.11.16), is not robust. A similar development establishes
the influence function for the denominator of LS coefficient of multiple deter-
mination R2 , showing too that it is not bounded. Hence R2 is not a robust
statistic.
Another example is the influence function of the least squares estimate of
β which is given by
b ) = σ −1 y0 Σ−1 x0 .
Ω(x0 , y0 ; β (A.5.12)
LS

The influence function of the least squares estimate is, thus, unbounded in
both the Y - and x-spaces.

b
Influence Function of β ϕ

Recall that H is the joint distribution function of x and Y . Let the p × 1


vector T (H) denote the functional corresponding to βb . Assume without loss
ϕ
of generality that the true β = 0, α = 0, and that the Ex = 0. Hence
the distribution function of Y is F (y) and Y and x are independent; i.e.,
H(x, y) = M(x)F (y).
Recall that the R estimate satisfies the equations
n
X .
xi a(R(Yi − x′i β)) = 0 .
i=1

b∗n denote the empirical distribution function of Yi − x′i β. Then we can


Let G
rewrite the above equations as
Xn  
n b∗ ′ 1 .
n xi ϕ Gn (Yi − xi β) =0.
i=1
n+1 n

Let G∗ denote the distribution function of Y − x′ T (H). Then the functional


T (H) satisfies Z
ϕ(G∗ (y − x′ T (H))xdH(x, y) = 0 . (A.5.13)

We can show that


Z Z

G (t) = dH(v, u) . (A.5.14)
u≤v′ T (H)+t

Let Hs = (1 − s)H + sW for an arbitrary distribution function W . Then


the functional T (H) evaluated at Hs satisfies the equation
Z Z
(1−s) ϕ(Gs (y−x T (Hs ))xdH(x, y)+s ϕ(G∗s (y−x′ T (Hs ))xdW (x, y) = 0 ,
∗ ′

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 478 —


i i

478 APPENDIX A. ASYMPTOTIC RESULTS

where G∗s is the distribution function of Y − x′ T (Hs ). We obtain ∂T /∂s by


implicit differentiation. Then upon substituting ∆x0 ,y0 for W the influence
function is given by (∂T /∂s) |s=0, which we denote by Ṫ . Implicit differentia-
tion leads to
Z
0 = − ϕ(G∗s (y − x′ T (Hs ))xdH(x, y)
Z
∂G∗
−(1 − s) ϕ′ (G∗s (y − x′ T (Hs )) s xdH(x, y)
∂s
Z
+ ϕ(G∗s (y − x′ T (Hs ))xdW (x, y) + sB1 , (A.5.15)

where B1 is irrelevant since we are setting s to 0. We first get the partial


derivative of G∗s with respect to s. By (A.5.14) and the independence between
Y and x at H, we have
Z Z
∗ ′
Gs (y − x T (Hs )) = dHs (v, u)
u≤y−T (Hs )′ (x−v)
Z
= (1 − s) F [y − T (Hs )′ (x − v)]dM(v)
Z Z
+s dW (v, u) .
u≤y−T (Hs )′ (x−v)

Thus,
Z
∂G∗s (y − x′ T (Hs ))
= − F [y − T (Hs )′ (x − v)]dM(v)
∂s
Z
∂T
+(1 − s) F ′ [y − T (Hs )′ (x − v)](v − x)′ dM(v)
∂s
Z Z
+ dW (v, u) + sB2 ,
u≤y−T (Hs )′ (x−v)

where B2 is irrelevant since we are setting s to 0. Therefore using the inde-


pendence between Y and x at H, T (H) = 0, and Ex = 0, we get
∂G∗s (y − x′ T (Hs ))
|s=0 = −F (y) − f (y)x′ Ṫ + WY (y) , (A.5.16)
∂s
where WY denotes the marginal (second variable) distribution function of W .
Upon evaluating expression (A.5.15) at s = 0 and substituting into it
expression (A.5.16) we have
Z Z
− xϕ(F (y))dH(x, y)+ xϕ′ (F (y))[−F (y)−f (y)x′Ṫ +WY (y)]dH(x, y)
Z
+ xϕ(F (y))dW (x, y) =
Z Z
′ ′
− ϕ (F (y))f (y)xx Ṫ dH(x, y) + xϕ(F (y))dW (x, y) = 0 .

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 479 —


i i

A.5. INFLUENCE FUNCTIONS 479

Substituting ∆x0 ,y0 in for W , we get

0 = −τ ΣṪ + x0 ϕ(F (y0)) .

b is given by
Solving this last expression for Ṫ , the influence function of β ϕ

b ϕ ) = τ Σ−1 ϕ(F (y0 ))x0 .


Ω(x0 , y0 ; β (A.5.17)

Hence the influence function of βb is bounded in the Y -space but not in the
ϕ
x-space. The estimate is thus bias robust. Note that the asymptotic repre-
b ϕ in Corollary 3.5.23 can be written in terms of this influence
sentation of β
function as
Xn

b
nβ = n −1/2 b ) + op (1) .
Ω(xi , Yi; β
ϕ ϕ
i=1

Influence Function of Fϕ
Rewrite the correlation model as

Y = α + x′1 β 1 + x′2 β 2 + e

and consider testing the general linear hypotheses

H0 : β 2 = 0 versus HA : β 2 6= 0 , (A.5.18)

where β 1 and β 2 are q × 1 and (p − q) × 1 vectors of parameters, respectively.


b denote the reduced model estimate. Recall that the R test based upon
Let β 1,ϕ
the drop in dispersion is given by

RD/q
Fϕ = ,
τb/2

where RD = D(β b 1,ϕ ) − D(β


b ϕ ) is the reduction in dispersion. In this section
we want to derive the influence function of the test statistic.
Let RD(H) denote the functional for the statistic RD. Then

RD(H) = D1 (H) − D2 (H) ,

where D1 (H) and D2 (H) are the reduced and full model functionals given by
Z
D1 (H) = ϕ[G∗ (y − x′1 T1 (H))](y − x′1 T1 (H))dH(x, y)
Z
D2 (H) = ϕ[G∗ (y − x′ T (H))](y − x′ T (H))dH(x, y) , (A.5.19)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 480 —


i i

480 APPENDIX A. ASYMPTOTIC RESULTS

and T1 (H) and T2 (H) denote the reduced and full model functionals for β 1
and β, respectively. Let β r = (β ′1 , 0′ )′ denote the true vector of parameters
under H0 . Then the random variables Y − x′ β r and x are independent. Next
write Σ as  
Σ11 Σ12
Σ= .
Σ21 Σ22
It is convenient to define the matrices Σr and Σ+
r as
   −1 
Σ11 0 + Σ11 0
Σr = and Σr = .
0 0 0 0

As above, let Hs = (1 − s)H + sW . We begin with a lemma,

Lemma A.5.1. Under the correlation model,

(a) RD(0) = 0
∂RD(Hs )
(b) |s=0 = 0
∂s
∂ 2 RD(Hs ) 2 ′ ′
 −1 +

(c) | s=0 = τ ϕ [F (y − x β r )]x Σ − Σ x.
∂s2
Proof: Part (a) is immediate. For Part (b), it follows from (A.5.19) that
Z
∂D2 (Hs )
= − ϕ[G∗s (y − x′ T (Hs ))](y − x′ T (Hs ))dH
∂s
Z
∂G∗
+(1 − s) ϕ′ [G∗s (y − x′ T (Hs ))](y − x′ T (Hs )) s dH
∂s
Z
∂T
+(1 − s) ϕ[G∗s (y − x′ T (Hs ))](−x′ )dH
∂s
Z
+ ϕ[G∗s (y − x′ T (Hs ))](y − x′ T (Hs ))dW (y) + sB ,

where B is irrelevant because we are setting s to 0. Evaluating this at s = 0


and using the independence of Y − x′ β r and x, and E(x) = 0 we get after
some simplification
Z
∂D2 (Hs )
= − ϕ[F (y − xβ r )](y − xβ r )dH
∂s
Z
− ϕ′ [F (y − xβ r )]F (y − xβ r )(y − xβ r )dH
Z
+ ϕ′ [F (y − xβ r )]WY (y − xβ r )(y − xβ r )dH
+ϕ[F (y0 − x0 β r )](y0 − x0 β r ).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 481 —


i i

A.5. INFLUENCE FUNCTIONS 481

Differentiating as above and using x′ β r = x′1 β 1 , we get the same expression


for ∂D
∂s
1
|s=0 . Hence Part (b) is true. Taking the second partial derivatives of
D1 (H) and D2 (H) with respect to s, the result for Part (c) can be obtained.
This is a tedious derivation and details of it can be found in Witt (1989) and
Witt et al. (1995).
Since Fϕ is nonnegative,
p there is no loss in generality in deriving the influ-
ence function of qFϕ . Letting Q2 = 2τ −1 RD we have
p Q[(1 − s)H + s∆x0 ,y0 ] − Q[H]
Ω(x0 , y0; qFϕ ) = lim .
s→0 s
But Q[H] = 0 by Part (a) of Lemma A.5.1. Hence we can rewrite the above
limit as  1/2
p Q2 [(1 − s)H + s∆x0 ,y0 ]
Ω(x0 , y0 ; qFϕ ) = lim .
s→0 s2
Using Parts (a) and (b) of Lemma A.5.1, we can apply L’hospital’s rule twice
to evaluate this limit. Thus
 1/2
p 1 ∂ 2 Q2
Ω(x0 , y0 ; qFϕ ) = lim
s→0 2 ∂s2
 2
1/2
−1 ∂ RD
= 2τ
∂s2
p
= |ϕ[F (y − x′ β r )]| x′ [Σ−1 − Σ+ ] x (A.5.20)
Hence, the influence function of the rank-based test statistic Fϕ is bounded
in the Y -space as long as the score function is bounded. It can be shown that
the influence function of the least squares test statistic is not bounded in
Y -space. It is clear from the above argument that the coefficient of multiple
determination R2 is also robust. Hence, for R fits R2 is the preferred coefficient
of determination.
However, the influence function of the rank-based test statistic Fϕ is not
bounded in the x-space. In Chapter 3 we presented statistics whose influence
function are bounded in both spaces; however, they are less efficient.
The asymptotic distribution of qFϕ was derived in Section 3.6; however,
we can use the above result on the influence function to immediately display
it. If we expand Q2 into a vonMises expansion at H, we have
∂Q2 1 ∂ 2 Q2
Q2 (Hs ) = Q2 (H) + |s=0 + |s=0 +R
Z ∂s 2 ∂s2 
 
= ϕ(F (y − x β r )x d∆x0 ,y0 (x, y) Σ−1 − Σ+
′ ′

Z 

· ϕ(F (y − x β r )xd∆x0 ,y0 (x, y) + R . (A.5.21)

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 482 —


i i

482 APPENDIX A. ASYMPTOTIC RESULTS

Upon substituting the empirical distribution function for ∆x0 ,y0 in expression
(A.5.21), we have at the sample

n
"  #
2 1 X ′ 1 ′
 −1 
nQ (Hs ) = √ xi ϕ R(Yi − xi β r ) Σ − Σ+
n n
" i=1 n  #
1 X 1 ′
× √ xi ϕ R(Yi − xi β r ) + op (1).
n i=1 n

This expression is equivalent to the expression (3.6.10) which yields the asymp-
totic distribution of the test statistic in Section 3.6.

A.5.3 b HBR of Section 3.12.4


Influence Function of β

The influence function of the high breakdown estimator βb


HBR is discussed in
Section 3.12.4. In this section, we restate Theorem 3.12.4 and then derive a
proof of it.

b HBR is given by
Theorem A.5.1. The influence function for the estimate β

Z Z
b 1
Ω(x0 , y0 , β HBR ) = C−1
H (x0−x1 )b(x1 , x0 , y1 , y0)sgn{y0−y1 }dF (y1)dM(x1 ),
2
(A.5.22)
where CH is given by expression (3.12.24).

Proof: Let ∆0 (x, y) denote the distribution function of the point mass at the
point (x0 , y0 ) and consider the contaminated distribution Ht = (1 − t)H + t∆0
for 0 < t < 1. Let β(Ht ) denote the functional at Ht . Then β(Ht ) satisfies

Z Z  
′ 1
0 = x1 b(x1 , x2 , y1 , y2 ) I(y2 − y1 < (x2 − x1 ) β(Ht )) −
2
×dHt (x1 , y1 )dHt (x2 , y2 ). (A.5.23)

We next implicitly differentiate (A.5.23) with respect to t to obtain the deriva-


tive of the functional. The value of this derivative at t = 0 is the influence
function. Without loss of generality, we can assume that the true parameter
β = 0. Under this assumption x and y are independent. Substituting the value

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 483 —


i i

A.5. INFLUENCE FUNCTIONS 483

of Ht into (A.5.23) and expanding we obtain the four terms:


ZZZ Z y1 +(x2 −x1 )′ β (Ht )  
2 1
0 = (1 − t) x1 b(x1 , x2 , y1, y2 )dF (y2) −
−∞ 2
×dM(x2 )dM(x1 )dF (y1 )
ZZZZ  
′ 1
(1 − t)t x1 b(x1 , x2 , y1 , y2) I(y2 − y1 < (x2 − x1 ) β(H)) −
2
×dM(x2 )dF (y2)d∆0 (x1 , y1)
ZZZZ  
′ 1
+(1−t)t x1 b(x1 , x2 , y1, y2 ) I(y2 −y1 < (x2 −x1 ) β(H))−
2
×d∆0 (x2 , y2)dM(x1 )dF (y1)
ZZZZ  
2 ′ 1
+t x1 b(x1 , x2 , y1, y2 ) I(y2 − y1 < (x2 − x1 ) β(H)) −
2
×d∆0 (x2 , y2)dδ0 (x1 , y1 ) .

Let β̇ denote the derivative of the functional evaluated at 0. Proceeding to


implicitly differentiate this equation and evaluating the derivative at 0, we
get, after some derivation,
Z Z Z
0 = x1 b(x1 , x2 , y1 , y1 )f 2 (y1 )(x2 − x1 )′ dy1 dM(x1 )dM(x2 ) β̇
Z Z  
1
+ x0 b(x0 , x2 , y0, y2 ) I(y2 < y0 ) − dF (y2)dM(x2 )
2
Z Z  
1
+ x1 b(x1 , x0 , y1, y0 ) I(y0 < y1 ) − dF (y1)dM(x1 ) .
2

Once again using the symmetry in the x arguments and y arguments of the
function b, we can simplify this expression to
 ZZZ 
1 ′ 2
0 = − (x2 −x1 )b(x1 , x2 , y1 , y1 )(x2 −x1 ) f (y1 )dy1dM(x1 )dM(x2 ) β̇
2
Z Z  
1
+ (x0 − x1 )b(x1 , x0 , y1 , y0 ) I(y1 < y0 ) − dF (y1 )dM(x1 ) .
2

Using the relationship between the indicator function and the sign function
and the definition of CH ,(3.12.24), we can rewrite this last expression as
Z Z
1
0 = −CH β̇ + (x0 − x1 )b(x1 , x0 , y1 , y0 )sgn{y0 − y1 } dF (y1)dM(x1 ) .
2

Solving for β̇, leads to the desired result.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 484 —


i i

484 APPENDIX A. ASYMPTOTIC RESULTS

A.6 Asymptotic Theory for Section 3.12.3


The theoretical results needed for Section 3.12.3 of Chapter 3 were derived
in Chang, McKean, Naranjo, and Sheather (1999). Our development is taken
from this article. The main goal is to prove Theorem 3.12.2 which we restate
here:

Theorem A.6.1. Under assumptions (E.1), (3.4.1), and (H.1)-(H.4),


(3.12.10)-(3.12.13),
√ d
b HBR − β) −→ N( 0, (1/4)C−1 ΣH C−1 ).
n(β

Besides the notation of Chapter 3, we need:

1. Wij (∆) = (1/2)[sgn(zj − zi ) − sgn(yj − yi )],



where zj = yj − x′j ∆/ n . (A.6.1)

2. tij (∆) = (xj − xi )′ ∆/ n . (A.6.2)
3. Bij (t) = E[bij I(0 < Yi − Yj < t)] . (A.6.3)
4. γij = Bij′ (0)/E(bij ) . (A.6.4)
X
5. Cn = γij bij (xj − xi )(xj − xi )′ . (A.6.5)
i<j
" #
X √
−3/2
6. R(∆) = n bij (xj − xi )Wij (∆) + Cn ∆/ n . (A.6.6)
i<j

Without loss of generality, we assume that the true β 0 is 0. We begin with


the following lemma.

Lemma A.6.1. Under assumptions (E.1), (3.4.1), and (H.1), (3.12.13),


Z ∞ Z ∞ Y

Bij (t) = ··· b ) f (yj +t) f (yj )
b(xi , xj , yj +t, yj , β f (yk ) dy1 · · · dyn
0
−∞ −∞ k6=i,j

is continuous in t.

Proof: This result follows from (3.4.1), (3.12.13), and an application of Leib-
nitz’s rule on differentiation of definite integrals.
Let ∆ be arbitrary but fixed. Denote Wij (∆) by Wij , suppressing depen-
dence on ∆.

Lemma A.6.2. Under assumptions (E.1), (3.4.1), and (H.4), (3.12.13), there
exist constants |ξij | < |tij | such that E(bij Wij ) = −tij Bij′ (ξij ).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 485 —


i i

A.6. ASYMPTOTIC THEORY FOR SECTION 3.12.3 485

Proof: Since Wij = 1, −1, or 0 according as tij < yj −yi < 0, 0 < yj −yi < tij ,
or otherwise, we have
Z Z
Eβ (bij Wij ) = bij fY (y) dy − bij fY (y) dy .
0
tij <yj −yi <0 0<yj −yi <tij

When tij > 0, E(bij Wij ) = −Bij (tij ) = Bij (0) − Bij (tij ) = −tij Bij′ (ξij ) by
Lemma A.6.1 and the Mean Value Theorem. The same result holds for tij < 0,
which proves the lemma.
Lemma A.6.3. Under assumptions (H.3), (3.12.12), and (H.4), (3.12.13),
we have

b ) = gij (0) + [▽gij (ξ)]′ β
bij = gij (β b = gij (0) + Op (1/ n),
0 0

b k.
uniformly over all i and j, where kξk ≤ kβ 0

Proof: Follows from a multivariate Mean Value Theorem (see e.g. page 355 of
Apostol, 1974), and by (3.12.12) and (3.12.13).
Lemma A.6.4. Under assumptions (3.12.10)-(3.12.13), (3.4.1), (3.4.7), and
(3.4.8),
(i) E(gij (0)gik (0)Wij Wik ) −→ 0, as n → ∞
(ii) E(gij (0)Wij ) −→ 0, as n → ∞
uniformly over i and j.
Proof: Without loss of generality, let tij > 0 and tik > 0, where the indices i,
j and k are all different. Then

E(gij (0)gik (0)Wij Wik ) = E[gij gik I(0 < Yj − Yi < tij ) I(0 < Yk − Yi < tik )]
Z ∞ Z yi +tik Z yi +tij


= gij gik fi fj fk dyj dyk dyi .
−∞ yi yi

2
Assumptions (3.4.7)
√ and (3.4.8) imply (1/n)maxi (xik − xk ) → 0 for all k, or
equivalently (1/ n)maxi |xik − xk | → 0 for all k, which implies that tij → 0.
Since the integrand is bounded,R ∞this
R yproves (i).
+t
Similarly, E(gij (0)Wij ) = −∞ yii ij gij fi fj dyj dyi → 0, which proves (ii).

Lemma A.6.5. Under assumptions (3.12.10)-(3.12.13), (3.4.1), (3.4.7), and


(3.4.8),
(i) Cov(b12 W12 , b34 W34 ) = o(n−1 ).
(ii) Cov(b12 W12 , b34 ) = o(n−1 ).
(iii) Cov(b12 W12 , b13 W13 ) = o(1).
(iv) Cov(b12 W12 , b13 ) = o(1).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 486 —


i i

486 APPENDIX A. ASYMPTOTIC RESULTS

b . Thus
Proof: To prove (i), recall that b12 = g12 (0) + [▽g12 (ξ)]′ · β 0

Cov(b12 W12 , b34 W34 ) = Cov(g12 (0)W12 , g34 (0)W34 )


+2 Cov([▽g12 (ξ)]′ · βb W12 , g34 (0)W34 )
0
b W12 , [▽g34 (ξ)]′ · β
+Cov([▽g12(ξ)] · β′ b W34 ).
0 0

Let I1 , I2 , and I3 denote the three terms on the right side. The term I1 is 0,
by independence. Now,
n o
′ b
I2 = 2E [▽g12 (ξ)] · β 0 W12 g34 (0)W34
n o
b W12 E {g34 (0)W34 }
−2E [▽g12 (ξ)]′ · β 0

= I21 − I22 .

Write the first term above as


n √ √ o
b 0 g34 (0)( nW12 )( nW34 ) .
I21 = 2(1/n)E [▽g12 (ξ)]′ · β

The term [▽g12√ b = b12 −g12 (0) is bounded and of magnitude op (1). If we
(ξ)]′ · β 0
can show that nW12 is integrable, then it follows using standard arguments
that I21 = o(1/n). Let F ∗ denote the cdf of y2 − y1 and f ∗ denote its pdf.
Using the mean value theorem,
√ √ √
E[ nW12 (∆)] = n(1/2)E[sgn(Y2 − Y1 − (x2 − x1 )′ )∆/ n) − sgn(Y2 − Y1 )]
√ √
= n(1/2)[2F ∗(−(x2 − x1 )′ )∆/ n) − 2F ∗ (0)]
√ √
= − nf ∗ (ξ ∗ )(x2 − x1 )′ ∆/ n ≤ f ∗ (ξ ∗ )|(x2 − x1 )′ ∆| ,

for |ξ ∗ | < |(x2 − x1 )′ ∆/ n|. The right side of the inequality in expression
(A.6.7) is bounded. This proves that I21 = o(1/n). Similarly,
n √ o  √
′ b
I22 = 2(1/n)E [▽g12 (ξ)] · β 0 ( nW12 ) E g34 (0)( nW34 ) = o(1/n),

which proves I2 = 0.
The term I3 can be shown to be o(n−1 ) similarly, which proves (i). The
proof of (ii) is analogous to (i). To prove (iii), note that

Cov(b12 W12 , b13 W13 ) = Cov(g12 (0)W12 , g13 (0)W13 )


+ 2 Cov([▽g12 (ξ)]′ · β b W12 , g13 (0)W13 )
0
b W12 , [▽g13 (ξ)]′ · β
+ Cov([▽g12(ξ)] · β′ b W13 ).
0 0

The first term is o(1) by Lemma A.6.4. The second and third terms are clearly
o(1). This proves (iii). Result (iv) is analogously proved.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 487 —


i i

A.6. ASYMPTOTIC THEORY FOR SECTION 3.12.3 487

We are now ready to state and prove asymptotic linearity. Consider the
negative gradient function
XX
S(β) = − ▽ D(β) = bij sgn(zj − zi )(xj − xi ). (A.6.7)
i<j

Theorem A.6.2. Under assumptions (3.12.10)-(3.12.13), (3.4.1), (3.4.7),


and (3.4.8),
p
sup n−3/2 [S(β) − S(0) + 2 Cn β] −→ 0.

k nβ k≤C

 
Proof: Write R(∆) = S(n−1/2 ∆) − S(0) + 2n−1/2 Cn ∆ . We show that
( )
XX p
− 32 − 21
sup R(∆) = 2 sup n bij (xj −xi ) Wij (∆) + n Cn ∆ → 0.
k∆k≤C k∆k≤C i<j

It suffices to show that each component converges to 0. Consider the kth


component
" #
XX XX
Rk (∆) = 2 n−3/2 bij (xjk −xik )Wij (∆) + γij bij (xjk −xik )tij
i<j i<j
XX
= 2n−3/2 (xjk − xik )(bij Wij + γij tij bij ).
i<j

We show that E(Rk (∆)) → 0 and V ar(Rk (∆)) → 0. By Lemma A.6.2 and
the definition of γij ,
XX
E(Rk ) = 2 n−3/2 (xjk − xik )[E(bij Wij ) + γij tij E(bij )]
i<j
XX
−3/2
= 2n (xjk − xik )tij [Bij′ (0) − Bij′ (ξij )]
i<j
" #1/2 " #1/2
XX XX
≤ 2 (1/n2 ) (xjk − xik )2 (1/n) t2ij
i<j i<j

× sup |Bij′ (0) − Bij′ (ξij )| −→ 0,


i,j

PP 2 ′ ′ ′
since (1/n) i<j tij = (1/n)∆ X X∆ = O(1) and supi,j |Bij (0) −
Bij′ (ξij )| → 0 by Lemma A.6.1.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 488 —


i i

488 APPENDIX A. ASYMPTOTIC RESULTS

Next, we show that V ar(Rk ) → 0.


" #
XX
V ar(Rk ) = V ar 2 n−3/2 (xjk − xik )(bij Wij + γij tij bij )
i<j
" n X
n
#
X
= V ar 2 n−3/2 (xjk − xk )(bij Wij + γij tij bij )
i=1 j=1
n X
X n
−3
= 4n (xjk − xk )2 V ar(bij Wij + γij tij bij )
i=1 j=1
X X
+4n−3 (xjk − xk )(xmk − xk )Cov(bij Wij + γij tij bij , blm Wlm
(i,j)6=(l,m)

+γlmtlm blm ).

The double sum term above goes to 0, since there there n2 bounded terms in
the double sum, multiplied by n−3 . There are two types of covariance terms
in the quadruple sum, covariance terms with all four indices different, e.g.
((i, j), (l, m)) = ((1, 2), (3, 4)), and covariance terms with one index of the first
pair equal to one index of the second pair, e.g. ((i, j), (l, m)) = ((1, 2), (1, 3)).
Since there are of magnitude n4 terms with all four indices different, we need
to show that each covariance term is o(n−1 ). This immediately follows from
Lemma A.6.5. Finally there are of magnitude n3 covariance terms with one
shared index, and we need to show each term is o(1). Again, this immediately
follows from Lemma A.6.5. Hence, we have established the desired result.
Next define the approximating quadratic process,
XX
Q(β) = D(0) − bij sgn(Yj − Yi )(xj − xi )′ β + β ′ Cn β . (A.6.8)
i<j

Let
D ∗ (∆) = n−1 D(n−1/2 ∆) (A.6.9)
and
Q∗ (∆) = n−1 Q(n−1/2 ∆) . (A.6.10)
Note, minimizing D ∗ (∆) and Q∗ (∆) are equivalent to minimizing D(n−1/2 ∆)
and Q(n−1/2 ∆), respectively.
The next result is asymptotic quadraticity.
Theorem A.6.3. Under assumptions (3.12.10)-(3.12.13), (3.4.1), (3.4.7),
and (3.4.8), for a fixed constant C and for any ǫ > 0,
!
P sup |Q∗ (∆) − D ∗ (∆)| ≥ ǫ →0. (A.6.11)
k∆k<C

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 489 —


i i

A.6. ASYMPTOTIC THEORY FOR SECTION 3.12.3 489


hP P i
Proof: Since ∂Q
∂∆
− ∂D ∗
∂∆
= 2n −3/2
b
i<j ij (xj − xi )Wij + C(n−1/2
∆) =
R(∆), it follows from Theorem A.6.2 that for ǫ > 0 and C > 0,
!
∂Q∗ ∂D ∗
P sup k − k≥ ǫ/C → 0.
k∆k<C ∂∆ ∂∆

For 0 ≤ t ≤ 1, let ∆t = t ∆. Then


p

d ∗ X ∂Q∗
∂D
[Q (∆t ) − D ∗ (∆t )] = ∆k ( − )
dt ∂∆tk ∂∆tk
k=1
∂Q∗ ∂D ∗
≤ k ∆ k sup k − k<k ∆ k (ǫ/C) < ǫ
k∆k<C ∂∆ ∂∆

with probability approaching 1. Now, let h(t) = Q∗ (∆t ) − D ∗ (∆t ). By the


previous result, we have |h′ (t)| < ǫ with high probability. Thus
Z 1 Z 1

|h(1)| = |h(1) − h(0)| = h (t) dt ≤

|h′ (t)| dt < ǫ,
0 0

with probability approaching 1. This proves the theorem.


The next theorem states asymptotic normality of S(0).

Theorem A.6.4. Under assumptions (3.12.10)-(3.12.13), (3.4.1), (3.4.7),


and (3.4.8),
D
n−3/2 S(0) → N(0, ΣH ) . (A.6.12)

Proof: Let SP denote the projection of S∗ (0) = n−3/2 S(0) onto the space of
linear combinations of independent random variables. Then
n n
" #
X X XX
SP = E[S∗ (0)|yk ] = E n−3/2 (xj − xi )bij sgn(Yj − Yi )|yk
k=1 k=1 i<j
n
" k−1
X X
= n−3/2 (xk − xi )E[bik sgn(yk − Yi )|yk ]
k=1 i=1
n
#
X
+ (xj − xk )E[bkj sgn(Yj − yk )|yk ]
j=k+1
n X
X n
−3/2
= n (xj − xk )E[bkj sgn(Yj − yk )|yk ]
k=1 j=1
Xn

= (1/ n) Uk , (A.6.13)
k=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 490 —


i i

490 APPENDIX A. ASYMPTOTIC RESULTS

where Uk is defined in expression (3.12.9) of Chapter 3. By assumption (D.3),


(3.4.8), and a multivariate extension of the Lindeberg-Feller theorem (Rao,
1973), it follows that SP ∼ AN(0, ΣH ). If we show that E k SP −S∗ (0) k2 → 0,
then it follows from the Projection Theorem 2.4.6 that S∗ (0) has the same
asymptotic distribution as SP , and the proof will be done. Equivalently, we
may show that E(SP,r − Sr∗ (0))2 → 0 for each component r = 1, . . . , p. Since
for each r we have E(SP,r − Sr∗ (0)) = 0, then

E(SP,r − Sr∗ (0))2 = V ar(SP,r − Sr∗ (0))


" n Xn
#
X
= V ar n−3/2 (xjr − xkr ) {E[bkj sgn(Yj − yk )|yk ] − bkj sgn(Yj − Yk )}
k=1 j=1
" n X
n
#
X
≡ V ar n−3/2 T (Yj , Yk )
k=1 j=1
n X
n
!
X XXXX
= n−3 V ar(T (Yj , Yk )) + Cov[T (Yj , Yk ), T (Yl , Ym )]
k=1 j=1 k j l m

where the quadruple sum is taken over (j, k) 6= (l, m). The double sum term
goes to 0 since there are n2 bounded terms divided by n3 . There are two types
of covariance terms in the quadruple sum: terms with four different indices
and terms with three different indices (i.e., one shared index). Covariance
terms with four different indices are zero (this can be shown by writing out
the covariance in terms of expectations, and using symmetry to show that
each covariance term is zero). Thus we only need to consider covariance terms
with three different indices and show that the sum goes to 0. Letting k be the
shared index (without loss of generality), and noting that E T (yj , yk ) = 0 for
all j, k, we have

XX X
n−3 Cov[T (Yj , Yk ), T (Yl , Yk )]
k j6=k l6=k,j
XX X
= n−3 E {T (Yj , Yk ) · T (Yl , Yk )}
k j6=k l6=k,j
XX X
= n−3 E {[E(bkj sgn(Yj − yk )|yk ) − bkj sgn(Yj − Yk )]
k j6=k l6=k,j
· [E(bkl sgn(Yl − yk )|yk ) − bkl sgn(Yl − Yk )]}
XX X
= n−3 E {[E(gkj (0) sgn(Yj − yk )|yk ) − gkj (0) sgn(Yj − Yk )]
k j6=k l6=k,j
· [E(gkl (0) sgn(Yl − yk )|yk ) − gkl (0) sgn(Yl − Yk )]} + op (1)

where the last equality follows from the relation bkj = gkj (0) + 0p (1/ n).

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 491 —


i i

A.7. ASYMPTOTIC THEORY FOR SECTION 3.12.7 491

Expanding the product, each term in the triple sum may be written as

E [E(gkj (0) sgn(Yj − yk )|yk )]2
+E {gkj (0)sgn(Yj − Yk )gkl (0)sgn(Yl − Yk )}
−2 E {gkj (0) sgn(Yj − Yk ) [E(gkl (0) sgn(Yl − yk )|yk )]}

= (1 + 1 − 2) [E(gkj (0) sgn(Yj − yk )|yk )]2 = 0

where the first equality follows by taking conditional expectations with respect
to k inside appropriate terms. A similar method applies to terms where k is
not the shared index. The theorem is proved.
Proof of Theorem A.6.1: Let β̃ denote the value which minimizes Q(β). Then
β̃ is the solution to
0 = S(0) − 2Cn β

so that nβ̃ = (1/2)n2 C−1 n [n
−3/2
S(0)] ∼ AN(0, (1/4) C−1 ΣC−1 √), by Theo-
rem A.6.4 and assumption (D.2), (3.4.7). It remains to show that n(β̃ − β̂) =
op (1). This follows from Theorem A.6.3 and convexity of D(β), using standard
arguments as in Jaeckel (1972).

A.7 Asymptotic Theory for Section 3.12.7


Assume without loss of generality that α = 0, β = 0, and med ei = 0. In this
section, we must further assume that the variance of ei is finite, i.e., σ 2 < ∞.
From the proof of Theorem A.6.1, β b can be expressed asymptotically as

X n

b = 1 (n−2 X′ An X)−1 √1
nβ Uk + op (1) , (A.7.1)
2 n i=1

where Uk is defined in expression (A.6.13).√ In this section, we assume the



weights bij are given. Writing aij = bij /( 12τ ) and using the definition of Uk ,
we approximate Uk by
√n
b ∗ 12τ X
Uk = − 2 (xj − xk )a∗kj (1 − 2F (ek )) . (A.7.2)
n j=1

The estimate of α is given by the median of the HBR residuals. This estimator
can be expressed asymptotically as
n
X
−1/2
b = τS n
α sgn(ei ) + op (1) ; (A.7.3)
i=1

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 492 —


i i

492 APPENDIX A. ASYMPTOTIC RESULTS

see McKean et al. (1990). Using (A.7.1) and (A.7.3) we have the following
first order expression for the residuals eb∗i , (3.12.36),
n n
. 1X 1 1 X ∗
e∗ = e − τS
b sgn(ei )1 − √ X(n−2 X′ A∗ X)−1 √ U , (A.7.4)
n i=1 2 n n i=1 k

where A∗ = [a∗ij ]. Because E[U∗k ] = 0 and med ei = 0, taking expectations of


both sides of (A.7.4) leads to
.
e∗ ] = E[e1 ]1 .
E[b (A.7.5)

Write
.
e∗ ) = E[(b
Var(b e∗ − E[e1 ]1)(b
e∗ − E[e1 ]1)′ ] . (A.7.6)
The approximate variance-covariance matrix of b e∗ can then be obtained by
e∗ in expression (A.7.6)
substituting the right side of expression (A.7.4) for b
and then expanding and taking expectations term-by-term.P This is a tedious
but by making use of E[Uk ] = 0, med ei = 0, ni=1 xi = 0, and
derivation, P
P

n ∗ n ∗
j=1 aij = j=1 aji = 0, we obtain expression (3.12.37) of Section 3.12.7.

A.8 Asymptotic Theory for Section 3.13


Proof of Theorem 3.13.1: Let X1 = [1, X] denote the augmented matrix
of centered explanatory variables and a column of 1s. Recall that β b∗ =
LS
(X′1 X1 )−1 X′1 Y. Since 1′ X = 0, it can be shown that the vector of slope
parameters β b satisfies β
b = (X′ X)−1 X′ Y . Since Y = α1 + Xβ + e, we get
LS
the relation
b = β + (X′ X)−1 X′ e.
β (A.8.1)
LS

From McKean et al. (1990), we have the equivalent relation



b =β+
β 12τ (X′ X)−1 X′ Fc (e) + op (n−1/2 ) (A.8.2)
R

where Fc (e) = [Fc (e1 )−1/2, . . . , Fc (en )−1/2]′ is an n×1 vector of independent
random variables.
Now,

Var(βb −β b ) = Var(β b ) + Var(β b ) − 2E(β


b − β)(βb − β)′
LS R LS R LS R
2 ′ −1 2 ′ −1
= σ (X X) + τ (X X)

−2 12τ E[(X′ X)−1 X′ eF′c (e)X(X′ X)−1 ] = δ(X′ X)−1

where δ = σ 2 + τ 2 − 2 12τ E[e1 (F (e1 ) − 1/2)].

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 493 —


i i

A.8. ASYMPTOTIC THEORY FOR SECTION 3.13 493

Finally, note that from expressions (A.8.1) and (A.8.2) both β b and β b
LS R
are both functions of the errors ei . Hence, asymptotic normality follows in the
usual way by using (A2) to show that the Lindeberg conditionP holds.
Proof of Corollary 3.13.1: Recall that α̂LS = (Y ) = (1/n) ni=1 Yi . Define the
R intercept estimate, αbR , as the median of residuals. Expression (3.5.22) gives
the asymptotic representation of α bR which, for convenience, we restate here:
n
X
bR = (1/n)τs
α sgn(Yi ) + op (n−1/2 ), (A.8.3)
i=1

which gives
P
Var(α̂LS − α̂R ) = (1/n2 ) Pni=1 Var(Yi − τs sgn(Yi))
= (1/n2 ) Pni=1 [Var(Yi ) + τs2 Var(sgn(Yi)) − 2τs cov(Yi, sgn(Yi))]
= (1/n2 ) ni=1 [σ 2 + τs2 − 2τs E(ei sgn(ei ))
= (1/n)[σ 2 + τs2 − 2τs E(e1 sgn(e1 ))].

Next we need to show that the intercept and slope differences have zero covari-
ance. This follows from the fact that 1′ X = 0. Asymptotic normality follows
as in the proof of Theorem 3.13.1.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 495 —


i i

References

Abebe, A. and McKean, J. W. (2007), Highly efficient nonlinear regression


based on the Wilcoxon norm, Festschrift in Honor of Mir Masoom Ali on
the Occasion of his Retirement, D. Umbach, ed, 340-357.

Abebe, A. and McKean, J. W. (2010), Weighted Wilcoxon estimates in non-


linear regression, Submitted.

Abebe, A., McKean, J. W., and Kloke, J. D. (2010), Iterated reweighted rank-
based estimates for GEE models, Submitted.

Adichi, J. N. (1978), Rank tests of sub-hypotheses in the general regression


model, Annals of Statistics, 6, 1012-1026.

Afifi, A. A. and Azen, S. P. (1972), Statistical Analysis: A Computer Oriented


Approach, New York: Academic Press.

Akritas, M. G. (1990), The rank transform method in some two-factor designs,


Journal of the American Statistical Association, 85, 73-78.

Akritas, M. G. (1991), Limitations of the rank transform procedure: A study


of repeated measures designs, Part I, Journal of the American Statistical
Association, 86, 457-460.

Akritas, M. G. (1993), Limitations of the rank transform procedure: A study


of repeated measures designs, Part II, Statistics and Probability Letters, 17,
149-156.

Akritas, M. G. and Arnold, S. F. (1994), Fully nonparametric hypotheses for


factorial designs I: Multivariate repeated measures designs, Journal of the
American Statistical Association, 89, 336-343.

Akritas, M. G., Arnold, S. F., and Brunner, E. (1997), Nonparametric hy-


potheses and rank statistics for unbalanced factorial designs, Journal of the
American Statistical Association, 92, 258-265.

495
i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 496 —


i i

496 References

Ammann, L. P. (1993), Robust singular value decompositions: A new approach


to projection pursuit, Journal of the American Statistical Association, 88,
505-514.

Ansari, A. R. and Bradley, R. A. (1960), Rank-sum tests for dispersion, Annals


of Mathematical Statistics, 31, 1174-1189.

Apostal, T. M. (1974), Mathematical Analysis, 2nd Edition, Reading, Mas-


sachusetts: Addison-Wesley.

Arnold, S. F. (1980), Asymptotic validity of F -tests for the ordinary linear


model and the multiple correlation model, Journal of the American Statis-
tical Association, 75, 890-894.

Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis,


New York: John Wiley and Sons.

Aubuchon, J. C. and Hettmansperger, T. P. (1984), A note on the estimation


of the integral of f 2 (x), Journal of Statistical Inference and Planning, 9,
321-331.

Aubuchon, J. C. and Hettmansperger, T. P. (1989), Rank-based inference for


linear models: Asymmetric errors, Statistics and Probability Letters, 8, 97-
107.

Babu, G. J. and Koti, K. M. (1996), Sign test for ranked-set sampling, Com-
munications in Statistics, Part A-Theory and Methods, 25(7), 1617-1630.

Bahadur, R. R. (1967), Rates of convergence of estimates and test statistics,


Annals of Mathematical Statistics, 31, 276-295.

Bai, Z. D., Chen, X. R., Miao, B. Q., and Rao, C. R. (1990), Asymptotic
theory of least distance estimate in multivariate linear models, Statistics,
21, 503-519.

Bassett, G. and Koenker, R. (1978), Asymptotic theory of least absolute error


regression, Journal of the American Statistical Association, 73, 618-622.

Bedall, F. K. and Zimmerman, H. (1979), Algorithm AS143, the mediancenter,


Applied Statistics 28, 325-328.

Belsley, D. A., Kuh, K., and Welsch, R. E. (1980), Regression Diagnostics, New
York: John Wiley and Sons.

Bickel, P. J. (1964), On some alternative estimates for shift in the p-variate


one sample problem, Annals of Mathematical Statistics, 35, 1079-1090.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 497 —


i i

References 497

Bickel, P. J. (1965), On some asymptotically nonparametric competitors of


Hotelling’s T 2 , Annals of Mathematical Statistics, 36, 160-173.

Bickel, P. J. (1974), Edgeworth expansions in nonparametric statistics, Annals


of Statistics, 2, 1-20.

Bickel, P. J. (1976), Another look at robustness: A review of reviews and some


new developments (Reply to Discussant), Scandinavian Journal of Statis-
tics, 3, 167.

Bickel, P. J. and Lehmann, E. L. (1975), Descriptive statistics for nonparamet-


ric model, II Location, Annals of Statistics, 3, 1045-1069.

Blair, R. C., Sawilowsky, S. S., and Higgins, J. J. (1987), Limitations of the rank
transform statistic in tests for interaction, Communications in Statistics,
Part B-Simulation and Computation, 16, 1133-1145.

Bloomfield, B. and Steiger, W. L. (1983), Least Absolute Deviations, Boston:


Birkhäuser.

Blumen, I. (1958), A new bivariate sign test, Journal of the American Statistical
Association, 53, 448-456.

Bohn, L. L. and Wolfe, D. A. (1992), Nonparametric two-sample procedures for


ranked-set samples data, Journal of the American Statistical Association,
87, 552-561.

Boos, D. D. (1982), A test for asymmetry associated with the Hodges-Lehmann


estimator, Journal of the American Statistical Association, 77, 647-651.

Bose, A. and Chaudhuri, P. (1993), On the dispersion of multivariate median,


Annals of the Institute of Statistical Mathematics, 45, 541-550.

Box, G. E. P. and Cox, D. R. (1964), An analysis of transformations, Journal


of the Royal Statistical Society, Series B, Methodological, 26, 211-252.

Box, G. E. P. and Jenkins, G. (1970), Time series analysis: Forecasting and


control, San Francisco: Holden-Day.

Brockwell, P. J. and Davis, R. A. (1991), Time Series: Theory and Methods,


New York: Springer-Verlag.

Brown, B. M. (1983), Statistical uses of the spatial median, Journal of the


Royal Statistical Society, Series B, Methodological, 45, 25-30.

Brown, B. M. (1985), Multiparameter linearization theorems, Journal of the


Royal Statistical Society, Series B, Methodological, 47, 323-331.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 498 —


i i

498 References

Brown, B. M. and Hettmansperger, T. P. (1987a), Affine invariant rank meth-


ods in the bivariate location model, Journal of the Royal Statistical Society,
Series B, Methodological, 49, 301-310.

Brown, B. M. and Hettmansperger, T. P. (1987b), Invariant tests in bivariate


models and the L1 criterion function, In: Statistical Data Analysis Based on
the L1 Norm and Related Methods, ed, Y. Dodge, 333-344, North Holland,
Amsterdam.

Brown, B. M. and Hettmansperger, T. P. (1989), An affine invariant version


of the sign test, Journal of the Royal Statistical Society, Series B, Method-
ological, 51, 117-125.

Brown, B. M. and Hettmansperger, T. P. (1994), Regular redescending rank


estimates, Journal of the American Statistical Association, 89, 538-542.

Brown, B. M., Hettmansperger, T. P., Nyblom, J., and Oja, H. (1992), On


certain bivariate sign tests and medians, Journal of the American Statistical
Association, 87, 127-135.

Brunner, E. and Denker, M. (1994), Rank statistics under dependent observa-


tions and applications to factorial designs, Journal of Statistical Inference
and Planning, 42, 353-378.

Brunner, E. and Neumann, N. (1986), Rank tests in 2 × 2 designs, Statistica


Neerlandica, 40, 251-272.

Brunner, E. and Puri, M. L. (1996), Nonparametric methods in design and


analysis of experiments, In: Handbook of Statistics, S. Ghosh and C. R.
Rao, eds, 13, 631-703, The Netherlands: Elsevier Science, B. V.

Bustos, O.H. (1982), General M-estimates for contaminated pth-order autore-


gressive processes: Consistency and asymptotic normality, Z. Wahrschein-
lichkeitstheorie verw. Gebiete, 59, 491-504.

Carmer, S. G. and Swanson, M. R. (1973), An evaluation of ten pairwise multi-


ple comparison procedures by Monte Carlo methods, Journal of the Amer-
ican Statistical Association, 68, 66-74.

Chakraborty, B., Chaudhuri, P., and Oja, H. (1998), Operating transformation


retransformation on spatial median and angle test, Statistica Sinica, 8, 767-
784.

Chang, W., McKean, J. W., Naranjo, J. D., and Sheather, S. J. (1999), High
breakdown rank-based regression, Journal of the American Statistical As-
sociation, 94, 205-219.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 499 —


i i

References 499

Chaudhuri, P. (1992), Multivariate location estimation using extension of R-


estimates through U-statistics type approach, Annals of Statistics, 20, 897-
916.

Chaudhuri, P. and Sengupta, D. (1993), Sign tests in multidimensional infer-


ence based on the geometry of the data cloud, Journal of the American
Statistical Association, 88, 1363-1370.

Chernoff, H. and Savage, I. R. (1958), Asymptotic normality and efficiency of


certain nonparametric test statistics, Annals of Mathematical Statistics, 39,
972-994.

Chiang, C.-Y. and Puri, M. L. (1984), Rank procedures for testing subhypothe-
ses in linear regression, Annals of the Institute of Statistical Mathematics,
36, 35-50.

Chinchilli, V. M. and Sen, P. K. (1982), Multivariate linear rank statistics for


profile analysis, Journal of Multivariate Analysis, 12, 219-229.

Choi, K. and Marden, J. (1997), An approach to multivariate rank tests in


multivariate analysis of variance, Journal of the American Statistical Asso-
ciation, To appear.

Chwirut, D. J. (1979), Recent improvements to the ASTM-Type Untrasonic


Reference Block System, Research Report NBSIR 79-1742, Washington DC:
National Bureau of Standards.

Coakley, C. W. and Hettmansperger, T. P. (1992), Breakdown bounds and


expected test resistance, Journal of Nonparametric Statistics, 1, 267-276.

Cobb, G. W. (1998), Introduction to Design and Analysis of Experiments, New


York: Springer-Verlag.

Conover, W. J. and Iman, R. L. (1981), Rank transform as a bridge between


parametric and nonparametric statistics, The American Statistician, 35,
124-133.

Conover, W. J., Johnson, M. E., and Johnson, M. M. (1981), A comparative


study of tests for homogeneity of variances, with applications to the outer
continental shelf bidding data, Technometrics, 23, 351-361.

Cook, R. D., Hawkins, D. M., and Weisberg, S. (1992), Comparison of model


misspecification diagnostics using residuals from least mean of squares and
least median of squares fits, Journal of the American Statistical Association,
87, 419-424.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 500 —


i i

500 References

Cook, R. D. and Weisberg, S. (1982), Residuals and Influence in Regression,


New York: Chapman and Hall.

Cook, R. D. and Weisberg, S. (1989), Regression diagnostics with dynamic


graphics, Technometrics, 31, 277-291.

Cook, R. D. and Weisberg, S. (1994), An Introduction to Regression Graphics,


New York: John Wiley and Sons.

Crimin, K., Abebe, A. and McKean, J. W. (2008), Robust General Linear Mod-
els and Graphics via a User Interface, Journal of Modern Applied Statistics,
7, 318-330.

Crimin, K., McKean, J. W. and Sheather, S. J. (2007), Discriminant proce-


dures based on robust discriminant coordinates, Journal of Nonparametric
Statistics, 19, 199-213.

Croux, C. Rousseeuw, P. J. and Hössjer, O. (1994), Generalized S-estimators,


Journal of the American Statistical Association, 89, 1271-1281.

Cushney, A. R. and Peebles, A. R. (1905), The action of optical isomers, II,


Hyoscines, Journal of Physiology, 32, 501-510.

Davis, J. B. and McKean, J. W. (1993), Rank based methods for multivariate


linear models, Journal of the American Statistical Association, 88, 245-251.

Dietz, E. J. (1982), Bivariate nonparametric tests for the one-sample location


problem, Journal of the American Statistical Association, 77, 163-169.

Dixon, S. L. and McKean, J. W. (1996), Rank-based analysis of the het-


eroscedastic linear model, Journal of the American Statistical Association,
91, 699-712.

Doksum, K. A. and Sievers, G. L. (1976), Plotting with confidence: Graphical


comparisons of two populations, Biometrika, 63, 421-434.

Dongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W. (1979), Linpack
Users’ Guide, Philadelphia: SIAM.

Donoho, D. L. and Huber, P. J. (1983), The notion of breakdown point, In:


A Festschrift for Erich L. Lehmann, eds, P. J. Bickel, K. A. Doksum, J. L.
Hodges Jr., 157-184, Belmont, CA: Wadsworth.

Dowell, M. and and Jarratt, P. (1971), A modified regula falsi method for
computing the root of an equation, BIT 11, 168-171.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 501 —


i i

References 501

Draper, D. (1988), Rank-based robust analysis of linear models. I. Exposition


and review, Statistical Science, 3, 239-257.

Draper, N. R. and Smith, H. (1966), Applied Regression Analysis, New York:


John Wiley and Sons.

DuBois, C., ed, (1960), Lowie’s Selected Papers in Anthropology, Berkeley: Uni-
versity of California Press.

Ducharme, G. R. and Milasevic, P. (1987), Spatial median and directional data,


Biometrika, 74, 212-215.

Dwass, M. (1960), Some k-sample rank order tests, In: I. Olkin, et al., eds,
Contributions to Probability and Statistics. Stanford: Stanford University
Press.

Efron, B. (1979), Bootstrap methods: Another look at the jackknife, Annals of


Statistics, 7, 1-26.

Efron B. and Tibshirani, R. J. (1993), An Introduction to the Bootstrap, New


York: Chapman and Hall.

Eubank, R. L., LaRiccia, V. N., and Rosenstein, R. B. (1992), Testing sym-


metry about an unknown median, via linear rank procedures, Journal of
Nonparametric Statistics, 1, 301-311.

Fernholz, L. T. (1983), von Mises calculus for statistical functionals, Lecture


Notes in Statistics 19, New York: Springer.

Fisher, N. I. (1987), Statistical Analysis for Spherical Data, Cambridge: Cam-


bridge University Press.

Fisher, N. I. (1993), Statistical Analysis for Circular Data, Cambridge: Cam-


bridge University Press.

Fix, E. and Hodges, J. L., Jr. (1955), Significance probabilities of the Wilcoxon
test, Annals of Mathematical Statistics, 26, 301-312.

Fligner, M. A. (1981), Comment, American Statistician, 35, 131-132.

Fligner, M. A. and Hettmansperger, T. P. (1979), On the use of conditional


asymptotic normality, Journal of the Royal Statistical Society, Series B,
Methodological, 41, 178-183.

Fligner, M. A. and Killeen, T. J. (1976), Distribution-free two-sample test for


scale, Journal of the American Statistical Association, 71, 210-213.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 502 —


i i

502 References

Fligner, M. A. and Policello, G. E. (1981), Robust rank procedures for the


Behrens-Fisher problem, Journal of the American Statistical Association,
76, 162-168.

Fligner, M. A. and Rust, S. W. (1982), A modification of Mood’s median test


for the generalized Behrens-Fisher problem, Biometrika, 69, 221-226.

Fox, A. J. (1972), Outliers in time series, Journal of the Royal Statistical Society
B, 34, 350-363.

Fraser, D. A. S. (1957), Nonparametric Methods in Statistics, New York: John


Wiley and Sons.

Freidlin, B. and Gastwirth, J. L. (2000), Should the median test be retired


from general use?, The American Statistician, 54, 161-164.

Fuller, W. A. (1996), Introduction to Statistical Time Series, New York: John


Wiley and Sons.

Gastwirth, J. L. (1966), On robust procedures, Journal of the American Sta-


tistical Association, 61, 929-938.

Gastwirth, J. L. (1968), The first median test: A two-sided version of the control
median test, Journal of the American Statistical Association, 63, 692-706.

Gastwirth, J. L. (1971), On the sign test for symmetry, Journal of the American
Statistical Association, 66, 821-823.

George, K. J., McKean, J. W., Schucany, W. R., and Sheather, S. J. (1995), A


comparison of confidence intervals from R-estimators in regression, Journal
of Statistical Computation and Simulation, 53, 13-22.

Gerard, P. D. and Schucany, W. R. (2007), An enhanced sign test for dependent


binary data with small number of clusters, Computational Statistics and
Data Analysis, 51, 4622-4632.

Ghosh, M. and Sen, P. K. (1971), On a class of rank order tests for regres-
sion with partially formed stochastic predictors, Annals of Mathematical
Statistics, 42, 650-661.

Gower, J. C. (1974), The mediancenter, Applied Statistics, 32, 466-470.

Graybill, F. A. (1976), Theory and Application of the Linear Model, North


Scituate, Massachusetts: Duxbury.

Graybill, F. A. (1983), Matrices with Applications in Statistics, Belmont, CA:


Wadsworth.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 503 —


i i

References 503

Graybill, F. A. and Iyer, H. K. (1994), Regression Analysis: Concepts and


Applications, Belmont, California: Duxbury Press.

Hadi, A. S. and Simonoff, J. S. (1993), Procedures for the identification of


multiple outliers in linear models, Journal of the American Statistical As-
sociation, 88, 1264-1272.

Hájek, J. and Šidák, Z. (1967), Theory of Rank Tests, New York: Academic
Press.

Hald, A. (1952), Statistical Theory with Engineering Applications, New York:


John Wiley and Sons.

Hall, P. and Padmanabhan, A. R. (1997), Adaptive inference for the two-sample


scale problem, Technometrics, 39, 412-422.

Hallin M., Jurečková, J., and Koul, H. L. (2007), Serial autoregression and re-
gression rank score statistics, In: V. Nair, ed, Advances in Statistical Mod-
eling and Inference; Essays in Honor of Kjell Doksum’s 65th Brithdya, Sin-
gapore: World scientific.

Hallin M. and Mēlard, G. (1988), Rank-based tests for randomness against first-
order ARMA processes, Journal of the American Statistical Association, 16,
402-432.

Hampel, F. R. (1974), The influence curve and its role in robust estimation,
Journal of the American Statistical Association, 69, 383-393.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. J. (1986),


Robust Statistics, the Approach Based on Influence Functions, New York:
John Wiley and Sons.

Hardy, G. H., Littlewood, J. E., and Polya, G. (1952), Inequalities. 2nd ed.,
Cambridge: Cambridge University Press.

Hawkins, D. M., Bradu, D., and Kass, G. V. (1984), Location of several outliers
in multiple regression data using elemental sets, Technometrics, 26, 197-208.

Hawkins, D. M. and McLachlan, G. J. (1997), High-breakdown linear discrimi-


nant analysis, Journal of the American Statistical Association, 92, 136-143.

He, X., Simpson, D. G. and Portnoy, S. L. (1990), Breakdown robustness of


tests, Journal of the American Statistical Association, 85, 446-452.

Heiler, S. and Willers, R. (1988), Asymptotic normality of R-estimation the


linear model, Statistics, 19, 173-184.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 504 —


i i

504 References

Hendy, M. F. and Charles, J. A. (1970), The production techniques, silver con-


tent and circulation history of the twelfth-century Byzantine, Archaeometry,
12, 13-21.

Hettmansperger, T. P. (1984a), Statistical Inference Based on Ranks, New


York: John Wiley and Sons.

Hettmansperger, T. P. (1984b), Two-sample inference based on one-sample


sign statistics, Applied Statistics, 33, 45-51.

Hettmansperger, T. P. (1995), The rank-set sample sign test, Journal of Non-


parametric Statistics, 4, 263-270.

Hettmansperger, T. P. and Malin, J. S. (1975), A modified Mood’s test for loca-


tion with no shape assumptions on the underlying distributions, Biometrika,
62, 527-529.

Hettmansperger, T. P. and McKean, J. W. (1978), Statistical inference based


on ranks, Psychometrika, 43, 69-79.

Hettmansperger, T. P. and McKean, J. W. (1983), A geometric interpretation


of inferences based on ranks in the linear model, Journal of the American
Statistical Association, 78, 885-893.

Hettmansperger, T. P. McKean, J. W., and Sheather, S. J. (1997), Rank-based


analyses of linear models, In: Handbook of Statistics, 145-175, S. Ghosh and
C. R. Rao, eds, 15. Amsterdam: Elsevier Science.

Hettmansperger, T. P., Möttönen, J., and Oja, H. (1997a), Affine invariant


multivariate one-sample signed-rank tests, Journal of the American Statis-
tical Association, To appear.

Hettmansperger, T. P., Möttönen, J., and Oja, H. (1997b), Affine invariant


multivariate two-sample rank tests, Statistica Sinica, To appear.

Hettmansperger, T. P., Nyblom, J. and Oja, H. (1994), Affine invariant multi-


variate one-sample sign tests, Journal of the Royal Statistical Society, Series
B, Methodological, 56, 221-234.

Hettmansperger, T. P. and Oja, H. (1994), Affine invariant multivariate multi-


sample sign tests, Journal of the Royal Statistical Society, Series B, Method-
ological, 56, 235-249.

Hettmansperger, T. P. and Randles, R. H. (2002) A practical affine equivariant


multivariate median. Biometrika, 89, 851-860.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 505 —


i i

References 505

Hettmansperger, T. P. and Sheather, S. J. (1986), Confidence intervals based


on interpolated order statistics, Statistics and Probability Letters, 4, 75-79.

Hocking, R. R. (1985), The Analysis of Linear Models, Monterey, California:


Brooks/Cole.

Hodges, J. L. Jr. (1967), Efficiency in normal samples and tolerance of extreme


values for some estimates of location, In: Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, 1, 163-186. Berkeley:
University of California Press.

Hodges, J. L., Jr. and Lehmann, E. L. (1956), The efficiency of some non-
parametric competitors of the t-test, Annals of Mathematical Statistics, 27,
324-335.

Hodges, J. L., Jr. and Lehmann, E. L. (1961), Comparison of the normal scores
and Wilcoxon tests, In: Proceedings of the Fourth Berkeley Symposium on
Mathematical Statistics and Probability, 1, 307-317, Berkeley: University of
California Press.

Hodges, J. L. Jr. and Lehmann, E. L. (1962), Rank methods for combination


of independent experiments in analysis of variance, Annals of Mathematical
Statistics 33,482-497.

Hodges, J. L., Jr. and Lehmann, E. L. (1963), Estimates of location based on


rank tests. Annals of Mathematical Statistics, 34, 598-611.

Hogg, R. V. (1974), Adaptive robust procedures: A partial review and some


suggestions for future applications and theory, Journal of the American
Statistical Association, 69, 909-923.

Hora, S. C. and Conover, W. J. (1984), The F-statistic in the two-way lay-


out with rank-score transformed data, Journal of the American Statistical
Association, 79, 688-673.

Hössjer, O. (1994), Rank-based estimates in the linear model with high break-
down point, Journal of the American Statistical Association, 89, 149-158.

Hössjer, O. and Croux, C. (1995), Generalizing univariate signed rank statistics


for testing and estimating a multivariate location parameter, Journal of
Nonparametric Statistics, 4, 293-308.

Hotelling, H. (1951), A generalized T -test and measure of multivariate disper-


sion, In Proceedings of the Second Berkeley Symposium on Mathematical
Statistics, 23-41. Berkeley: University of California Press.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 506 —


i i

506 References

Hφyland, A. (1965), Robustness of the Hodges-Lehmann estimates for shift,


Annals of Mathematical Statistics, 36, 174-197.

Hsu, J. C. (1996), Multiple Comparisons, London: Chapman Hall.

Huber, P. J. (1981), Robust Statistics, New York: John Wiley and Sons.

Huitema, B. E. (1980), The Analysis of Covariates and Alternatives, New York:


John Wiley and Sons.

Huitema, B. E. and McKean, J. W. (2000), Design specification issues in time-


series intervention models, Educational and Psychological Measurement, 60,
38-58.

Huitema, B. E., McKean, J. W., and McKnight, S. (1999), Autocorrelation ef-


fects on least squares intervention analyses of short time series, Educational
and Psychological Measurement, 59, 767-786.

Huitema, B. E., McKean, J. W., and Zhao, J. (1996), The runs test for au-
tocorrelated errors: Unacceptable properties, Journal of Educational and
Behavior Statistics, 21, 390-404.

Iman, R. L. (1974), A power study of the rank transform for the two-way
classification model when interaction may be present, Canadian Journal of
Statistics, 2, 227-239.

International Mathematical and Statistical Libraries, Inc. (1987), User’s Man-


ual: Stat/Library, Houston, Texas: Author.

Jaeckel, L. A. (1972), Estimating regression coefficients by minimizing the dis-


persion of the residuals, Annals of Mathematical Statistics, 43, 1449-1458.

Jan, S. L. and Randles, R. H. (1995), A multivariate signed sum test for the
one-sample location problem, Journal of Nonparametric Statistics, 4, 49-63.

Jan, S. L. and Randles, R. H. (1996), Interdirection tests for simple repeated


measures designs, Journal of the American Statistical Association, 91, 1611-
1618.

Jennrich, R. I. (1969), Asymptotic properties of non-linear least squares esti-


mators, The Annals of Mathematical Statistics, 40, 633-643.

Johnson, G. D., Nussbaum, B. D., Patil, G. P., and Ross, N. P. (1996), Design-
ing cost-effective environmental sampling using concomitant information,
Chance, 9, 4-16.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 507 —


i i

References 507

Jonckheere, A. R. (1954), A distribution-free k-sample tests against ordered


alternatives, Biometrika, 41, 133-145.

Jurečková, J. (1969), Asymptotic linearity of rank statistics in regression pa-


rameters, Annals of Mathematical Statistics, 40, 1449-1458.

Jurečková, J. (1971), Nonparametric estimate of regression coefficients, Annals


of Mathematical Statistics, 42, 1328-1338.

Kahaner, D., Moler, C., and Nash, S. (1989), Numerical Methods and Software,
Englewood Cliffs, New Jersey: Prentice Hall.

Kalbfleisch, J. D. and Prentice, R. L. (1980), The Statistical Analysis of Failure


Time Data, New York: John Wiley and Sons.

Kapenga, J. A., McKean, J. W., and Vidmar, T. J. (1988), RGLM: Users Man-
ual, Amer. Statist. Assoc. Short Course on Robust Statistical Procedures
for the Analysis of Linear and Nonlinear Models, New Orleans.

Kent, J. and Tyler, D. (1991), Redescending m-estimates of multivariate loca-


tion and scatter, Annals of Statistics, 19, 2102-2119.

Kepner, J. C. and Robinson, D. H. (1988), Nonparametric methods for detect-


ing treatment effects in repeated measures designs, Journal of the American
Statistical Association, 83, 456-461.

Killeen, T. J., Hettmansperger, T. P., and Sievers, G. L. (1972), An elemen-


tary theorem on the probability of large deviations, Annals of Mathematical
Statistics, 43, 181-192.

Kloke, J. D. and and McKean, J. W. (2010a), Rank-based estimation for Arnold


transformed data, Festschrift for Professor Thomas P. Hettmansperger, In
press.

Kloke, J. D. and and McKean, J. W. (2010b), Rfit: R algorithms for rank-based


fitting, Submitted.

Kloke, J., McKean, J. W., and Rashid, M. (2009), Rank-based estimation and
associated inferences for linear models with cluster correlated errors, Journal
of the American Statistical Association, 104, 384-390.

Klotz, J. (1962), Nonparametrics tests for scale, Annals of Mathematical Statis-


tics, 33, 498-512.

Koul, H. L. (1992), Weighted Empiricals and Linear Models, Hayward, Cali-


fornia: Institute of Mathematical Statistics.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 508 —


i i

508 References

Koul, H.L. and Saleh, A. K. (1993), R-estimation of the parameters of autore-


gressive AR(p) models, The Annals of Statistics, 21, 534-551.

Koul, H. L., Sievers, G. L., and McKean, J. W. (1987), An estimator of the


scale parameter for the rank analysis of linear models under general score
functions, Scandinavian Journal of Statistics, 14, 131-141.

Kramer, C. Y. (1956), Extension of multiple range tests to group means with


unequal numbers of replications, Biometrics, 12, 307-310.

Kruskal, W. H. and Wallis, W. A. (1952), Use of ranks in one criterion variance


analysis, Journal of the American Statistical Association, 57, 583-621.

Larsen, R. J. and Stroup, D. F. (1976), Statistics in the Real World, New York:
Macmillan.

Lawless, J. F. (1982), Statistical Models and Methods for Lifetime Data, New
York: John Wiley and Sons.

Lawley, D. N. (1938), A generalization of Fisher’s z-test, Biometrika, 30, 180-


187.

Lehmann, E. L. (1975), Nonparametrics: Statistical Methods Based on Ranks,


San Francisco: Holden-Day.

Lehmann, E. L. and Casella, G. (1998), Theory of Point Estimation, 2nd Ed.,


New York: Springer.

Li, H. (1991), Rank Procedures for the Logistic Model, Unpublished Ph.D. The-
sis, Western Michigan University, Kalamazoo, MI.

Liang, K.-Y. and Zeger, S. L. (1986), Longitudinal data analysis using gener-
alized linear models, Biometrika, 73, 13-22.

Liu, R. Y. (1990), On a notion of data depth based on simplices, Annals of


Statistics, 18, 405-414.

Liu, R. Y. and Singh, K. (1993), A quality index based on data depth and
multivariate rank tests, Journal of the American Statistical Association,
88, 405-414.

Lopuha, H. P. and Rousseeuw, P. J. (1991), Breakdown properties of affine


equivariant estimators of multivariate location and covariance matrices, An-
nals of Statistics, 19, 229-248.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 509 —


i i

References 509

Magnus, J. R. and Neudecker, H. (1988), Matrix Differential Calculus with


Applications in Statistics and Econometrics, New York: John Wiley and
Sons.

Malinvaud, E. (1970), The consistency of nonlinear regressions, The Annals of


Mathematical Statistics, 41, 956-959.

Mann, H. B. (1945), Nonparametric tests against trend, Econometrica, 13,


245-259.

Mann, H. B. and Whitney, D. R. (1947), On a test of whether one of two


random variables is stochastically larger than the other, Annals of Mathe-
matical Statistics, 18, 50-60.

Mardia, K. V. (1972), Statistics of Directional Data, London: Academic Press.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979), Multivariate Analysis,


Orlando, Fl.: Academic Press.

Maritz, J. S. (1981), Distribution-Free Statistical Methods, London: Chapman


and Hall.

Maritz, J. S. and Jarrett, R. G. (1978), A note on estimating the variance


of the sample median, Journal of the American Statistical Association, 73,
194-196.

Maritz, J. S., Wu, M. and Staudte, R. G. Jr. (1977), A location estimator based
on a U-statistic, Annals of Statistics, 5, 779-786.

Maronna, R.A. (1976), Robust M-estimators of multivariate location and scat-


ter, Annals of Statistics, 4, 51-67.

Maronna, R. A., Martin, R. D., and Yohai, V. Y. (2006), Robust Statistics


Theory and Methods, New York: John Wiley and Sons.

Marsaglia, G. and Bray, T. A. (1964), A convenient method for generating


normal variables, SIAM Review, 6, 260-264.

Martin, R.D. (1980), Robust estimation of autoregressive models (with discus-


sion), In: Directions in Time Series, eds, D. R. Brillinger and G. C. Tiao,
228-254. Hayward, CA: Institute of Mathematical Statistics.

Martin, R.D. and Yohai, V.J. (1991), Bias robust estimation of autoregression
parameters, In: Directions in Robust Statistics and Diagnostics, Part I, eds,
W. Stahel and S. Weisberg, 233-246, New York: Springer-Verlag.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 510 —


i i

510 References

Mason, R. L., Gunst, R. F., and Hess, J. L. (1989), Statistical Design and
Analysis of Experiments, New York: John Wiley and Sons.

Mathisen, H. C. (1943), A method of testing the hypothesis that two samples


are from the same population, Annals of Mathematical Statistics, 14, 188-
194.

McIntyre, G. A. (1952), A method of unbiased selective sampling, using ranked


sets, Australian Journal of Agricultural Research, 3, 385-390.

McKean, J. W. and Hettmansperger, T. P. (1976), Tests of hypotheses of the


general linear model based on ranks, Communications in Statistics, Part
A-Theory and Methods, 5, 693-709.

McKean, J. W. and Hettmansperger, T. P. (1978), A robust analysis of the


general linear model based on one step R-estimates, Biometrika, 65, 571-
579.

McKean, J. W., Naranjo, J. D., and Sheather, S. J. (1996a), Diagnostics to


detect differences in robust fits of linear models, Computational Statistics,
11, 223-243.

McKean, J. W., Naranjo, J. D., and Sheather, S. J. (1996b), An efficient and


high breakdown procedure for model criticism, Communications in Statis-
tics, Part A-Theory and Methods, 25, 2575-2595.

McKean, J. W., Naranjo, J. D., and Sheather, S. J. (1999), Diagnostics for


comparing robust and least squares fits, Journal of Nonparametric Statis-
tics, 11, 161-188.

McKean, J. W. and Ryan, T. A. Jr. (1977), An algorithm for obtaining con-


fidence intervals and point estimates based on ranks in the two sample
location problem, Transactions of Mathematical Software, 3, 183-185.

McKean, J. W. and Schrader, R. (1980), The geometry of robust procedures


in linear models, Journal of the Royal Statistical Society, Series B, Method-
ological, 42, 366-371.

McKean, J. W. and Schrader, R. M. (1984), A comparison of methods for


Studentizing the sample median, Communications in Statistics, Part B-
Simulation and Computation, 6, 751-773.

McKean, J. W. and Sheather, S. J. (1991), Small sample properties of robust


analyses of linear models based on r-estimates, In: Directions in Robust
Statistics and Diagnostics, Part II, 1-20, W. Stahel and S. Weisberg, eds,
New York: Springer-Verlag.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 511 —


i i

References 511

McKean, J. W. and Sheather, S. J. (2009), Diagnostic procedures, Wiley In-


terdisciplinary Reviews: Computational Statistics, 1(2), 221-233.

McKean, J. W., Sheather, S. J., and Hettmansperger, T. P. (1990), Regression


diagnostics for rank-based methods, Journal of the American Statistical
Association, 85, 1018-1028.

McKean, J. W., Sheather, S. J., and Hettmansperger, T. P. (1991), Regression


diagnostics for rank-based methods II, In: Directions in Robust Statistics
and Diagnostics, Part II, 21-31, eds: W. Stahel and S. Weisberg, New York:
Springer-Verlag.

McKean, J. W., Sheather, S. J., and Hettmansperger, T. P. (1993), The use


and interpretation of residuals based on robust estimation, Journal of the
American Statistical Association, 88, 1254-1263.

McKean, J. W., Sheather, S. J., and Hettmansperger, T. P. (1994), Robust


and high breakdown fits of polynomial models, Technometrics, 36, 409-415.

McKean, J. W. and Sievers, G. L. (1987), Coefficients of determination for least


absolute deviation analysis, Statistics and Probability Letters, 5, 49-54.

McKean, J. W. and Sievers, G. L. (1989), Rank scores suitable for the analysis
of linear models under asymmetric error distributions, Technometrics, 31,
207-218.

McKean, J. W., Terpstra, J., and Kloke, J. D. (2009), Computational rank-


based statistics, Wiley Interdisciplinary Reviews: Computational Statistics,
1(2), 132-140.

McKean, J. W. and Vidmar, T. J. (1992), Using procedures based on ranks:


cautions and recommendations, American Statistical Association 1992 Pro-
ceedings of the Biopharmaceutical Section, 280-289.

McKean, J. W. and Vidmar, T. J. (1994), A comparison of two rank-based


methods for the analysis of linear models, The American Statistician, 48,
220-229.

McKean, J. W., Vidmar, T.J., and Sievers, G. L. (1989), A robust two stage
multiple comparison procedure with application to a random drug screen,
Biometrics 45, 1281-1297.

McKnight, S., McKean, J. W., and Huitema, B. E. (2000), A double bootstrap


method to analyze an intervention time series model with autoregressive
error terms, Psychological Methods, 5, 87-101.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 512 —


i i

512 References

Merchant, J. A., Halprin, G. M., Hudson, A. R., Kilburn, K. H., McKenzie,


W.N., Jr., Hurst, D. J., and Bermazohn, P. (1975), Responses to cotton
dust, Archives of Environmental Health, 30, 222-229.

Milasevic, P. and Ducharme, G. R. (1987), Uniqueness of the spatial median,


Annals of Statistics, 15, 1332-1333.

Mielke, P. W. (1972), Asymptotic behavior of two-sample tests based on the


powers of ranks for detecting scale and location alternatives, Journal of the
American Statistical Association, 67, 850-854.

Miller, R. G. (1981), Simultaneous Statistical Inference, New York: Springer-


Verlag.

Milliken, G. A. and Johnson, D. E. (2001), Analysis of Messy Data, Vol. 3:


Analysis of Covariance, New York: Chapman & Hall/CRC.

Mood, A. M. (1950), Introduction to the Theory of Statistics, New York:


McGraw-Hill.

Mood, A. M. (1954), On the asymptotic efficiency of certain nonparametric


two-sample tests, Annals of Mathematical Statistics, 25, 514-533.

Morrison, D. F. (1983), Applied Linear Statistical Models, Englewood Cliffs,


New Jersey: Prentice Hall.

Möttönen, J. (1997a), SAS/IML Macros for spatial sign and rank tests, Math-
ematics Department, University of Oulu, Finland.

Möttönen, J. (1997b), SAS/IML Macros for affine invariant multivariate sign


and rank tests, Mathematics Department, University of Oulu, Finland.

Möttönen, J., Hettmansperger, T. P., Oja, H., and Tienari, J. (1998), On the
efficiency of the affine invariant multivariate rank tests, Journal of Multi-
variate Analysis, 66, 118-132.

Möttönen, J. and Oja, H. (1995), Multivariate spatial sign and rank methods.
Journal of Nonparametric Statistics, 5, 201-213.

Möttönen, J., Oja, H., and Tienari, J. (1997), On the efficiency of multivariate
spatial sign and rank tests, Annals of Statistics, 25, 542-552.

Naranjo, J. D. and Hettmansperger, T. P. (1994), Bounded-influence rank re-


gression, Journal of the Royal Statistical Society, Series B, Methodological,
56, No. 1, 209-220.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 513 —


i i

References 513

Naranjo, J. D., McKean, J. W., Sheather, S. J., and Hettmansperger, T. P.


(1994), The use and interpretation of rank-based residuals, Journal of Non-
parametric Statistics, 3, 323-341.

Nelson, W. (1982), Applied Lifetime Data Analysis, New York: John Wiley and
Sons.

Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. (1996), Ap-
plied Linear Statistical Models, 4th Ed., Chicago: Irwin.

Niinimaa, A. and Oja, H. (1995), On the influence functions of certain bivariate


medians, Journal of the Royal Statistical Society, Series B, Methodological,
57, 565-574.

Niinimaa, A., Oja, H. and Nyblom, J. (1992), Algorithm AS277: The Oja bi-
variate median, Applied Statistics, 41, 611-617.

Niinimaa, A., Oja, H., and Tableman, M. (1990), The finite-sample breakdown
point of the Oja bivariate median, Statistics and Probability Letters, 10,
325-328.

Noether, G. E. (1955), On a theorem of Pitman, Annals of Mathematical Statis-


tics, 26, 64-68.

Noether, G. E. (1987), Sample size determination for some common nonpara-


metric tests, Journal of the American Statistical Association, 82, 645-647.

Numerical Algorithms Group, Inc. (1983), Library Manual Mark 15, Oxford:
Numerical Algorithms Group.

Nyblom, J. (1992), Note on interpolated order statistics, Statistics and Proba-


bility Letters, 14, 129-131.

Oberhofer, W. (1982), The consistency of the nonlinear regression minimizing


the L1 -norm, The Annals of Statistics, 10, 316-319.

Oja, H. (1983), Descriptive statistics for multivariate distributions, Statistics


and Probability Letters, 1, 327-333.

Oja, H. (2010), Multivariate Nonparametric Methods with R: An Approach


Based on Spatial Signs and Ranks, New York: Springer.

Oja, H. and Nyblom, J. (1989), Bivariate sign tests, Journal of the American
Statistical Association, 84, 249-259.

Oja, H. and Randles, R. H. (2004), Multivariate nonparametric tests, Statistical


Science, 19, 598-605.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 514 —


i i

514 References

Olshen, R. A. (1967), Sign and Wilcoxon test for linearity, Annals of Mathe-
matical Statistics, 38, 1759-1769.

Osborne, M. R. (1985), Finite Algorithms in Optimization and Data Analysis,


Chichester: John Wiley and Sons.

Peters, D. and Randles, R. H. (1990a), Multivariate rank tests in the two-


sample location problem, Communications in Statistics, Part A-Theory and
Methods, 15(11), 4225-4238.

Peters, D. and Randles, R. H. (1990b), A multivariate signed-rank test for the


one-sample location problem, Journal of the American Statistical Associa-
tion, 85, 552-557.

Pinheiro, J. and Bates, D. (2000), Mixed Effects Models in S and S-Plus, New
York: Springer.

Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and the R Core team (2008),
nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-89.

Pitman, E. J. G. (1948), Notes on nonparametric statistical inference, Unpub-


lished notes.

Plaisance, E. P., Taylor, J. K., Alhasson, S., Abebe, A., Mestek M. L., and
Grandjean P. W. (2007). Cardiovascular fitness and vascular inflammatory
markers following acute aerobic exercise, International Journal of Sport Nu-
trition and Exercise Metabolism, 17, 152–162.

Policello, G. E. II, and Hettmansperger, T. P. (1976), Adaptive robust proce-


dures for the one-sample location model, Journal of the American Statistical
Association, 71, 624-633.

Presnell, B. and Bohn, L. (1999), U-Statistics and imperfect ranking in ranked-


set sampling, Journal of Nonparametric Statistics, 10, 111-126.

Puri, M. L. (1968), Multisample scale problem: Unknown location parameters.


Annals of the Institute of Statistical Mathematics, 40, 619-632.

Puri, M. L. and Sen, P. K. (1971), Nonparametric Methods in Multivariate


Analysis, New York: John Wiley and Sons.

Puri, M. L. and Sen, P. K. (1985), Nonparametric Methods in General Linear


Models, New York: John Wiley and Sons.

Randles, R. H. (1989), A distribution-free multivariate sign test based on inter-


directions, Journal of the American Statistical Association, 84, 1045-1050.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 515 —


i i

References 515

Randles, R. H. (2000), A simpler, affine equivariant multivariate, distribution-


free sign test, Journal of the American Statistical Association, 95, 1263-
1268.

Randles, R. H., Fligner, M. A., Policello, G. E., and Wolfe, D. A. (1980),


An asymptotically distribution-free test for symmetry versus asymmetry,
Journal of the American Statistical Association, 75, 168-172.

Randles, R. H. and Wolfe, D. A. (1979), Introduction to the Theory of Non-


parametric Statistics, New York: John Wiley and Sons.

Rao, C. R. (1948), Tests of significance in multivariate analysis, Biometrika,


35, 58-79.

Rao, C. R. (1973), Linear Statistical Inference and Its Applications, 2nd Edi-
tion, New York: John Wiley and Sons.

Rao, C. R. (1988), Methodology based on L1 -norm in statistical inference,


Sankhya, Series A, 50, 289-313.

Rao, J. N. K., Sutradhar, B. C., and Yue, K. (1993), Generalized least squares
F test in regression analysis with two-stage cluster samples, Journal of the
American Statistical Association, 88, 1388-1391.

Rashid, M. M. and Nandram, B. (1998), A rank-based predictor for the finite


population mean of a small area: An application to crop production, Journal
of Agricultural, Biological, and Environmental Statistics, 3, 201-222.

Ridker, P. M., Rifai, N., Rose, L., Buring, J. E., and Cook, N. R. (2002).
Comparison of C-reactive protein and low-density lipoprotein cholesterol
levels in the prediction of first cardiovascular events, New England Journal
of Medicine, 347, 1557-1565.

Rockafellar, R. T. (1970), Convex Analysis, Princeton, New Jersey: Princeton


University Press.

Rousseeuw, P. J. (1984), Least median squares regression, Journal of the Amer-


ican Statistical Association, 79, 871-880.

Rousseeuw, P. J. and Leroy, A. M. (1987), Robust Regression and Outlier De-


tection, New York: John Wiley and Sons.

Rousseeuw, P. and Van Driessen, K. (1999), A fast algorithm for the minimum
covariance determinant estimator, Technometrics, 41, 212-223.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 516 —


i i

516 References

Rousseeuw, P. J. and van Zomeren, B. C. (1990), Unmasking multivariate


outliers and leverage points, Journal of the American Statistical Association,
85, 633-648.

Rousseeuw, P. J. and van Zomeren, B. C. (1991), Robust distances: Simulations


and cutoff values, In: Directions in Robust Statistics and Diagnostics, Part
II, eds, W. Stahel and S. Weisberg, 195-203, New York: Springer-Verlag.

Ryan, B., Joiner, B., and Cryer, J. (2005), Minitab Handbook, 5th Ed., Aus-
tralia: Thomson.

Savage, I. R. (1956), Contributions to the theory of rank order statistics-the


two sample case, Annals of Mathematical Statistics, 27, 590-615.

Sawilowsky, S. S. (1990), Nonparametric tests of interaction in experimental


design, Review of Educational Research, 60, 91-126.

Sawilowsky, S. S. (2007), Real Data Analysis, Charlotte, North Carolina: In-


formation Age Publishing.

Sawilowsky, S. S., Blair, R. C., and Higgins, J. J. (1989), An investigation of


the type I error and power properties of the rank transform procedure in
factorial ANOVA, Journal of Educational Statistics, 14, 255-267.

Scheffé, H. (1959), The Analysis of Variance, New York: John Wiley and Sons.

Schrader, R. M. and McKean, J. W. (1977), Robust analysis of variance, Com-


munications in Statistics, Part A-Theory and Methods, 6, 879-894.

Schrader, R. M. and McKean, J. W. (1987), Small sample properties of least ab-


solute values analysis of variance, Statistical Analysis Based on the L1 -Norm
and Related Methods, Y. Dodge, ed, 307-321, Amsterdam: North Holland.

Schuster, E. F. (1975), Estimating the distribution function of a symmetric


distribution, Biometrika, 62, 631-635.

Schuster, E. F. (1987), Identifying the closest symmetric distribution or density


function, Annals of Mathematical Statistics, 15, 865-874.

Schuster, E. F. and Becker, R. C. (1987), Using the bootstrap in testing sym-


metry versus asymmetry, Communications in Statistics, Part B-Simulation
and Computation, 16, 19-84.

Searle, S. R. (1971), Linear Models, New York: John Wiley and Sons.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 517 —


i i

References 517

Sheather, S. J. (1987), Assessing the accuracy of the sample median: Estimated


standard errors versus interpolated confidence intervals, In: Statistical Data
Analysis Based on the L1 Norm and Related Methods, Y. Dodge ed, Ams-
terdam: North-Holland, 203-216.
Sheather, S. J. (2009), A Modern Approach to Regression with R, New York:
Springer.
Sheather, S. J., McKean, J. W., and Hettmansperger, T. P. (1997), Finite
sample stability properties of the least median of squares estimator, Journal
of Statistical Computation and Simulation, 58, 371-383.
Shirley, E. A. C. (1981), A distribution-free method for analysis of covariance
based on rank data, Applied Statistics, 30, 158-162.
Siegel, S. and Tukey, J. W. (1960), A nonparametric sum of ranks procedure
for relative spread in unpaired samples, Journal of the American Statistical
Association, 55, 429-444.
Sievers, G. L. (1983), A weighted dispersion function for estimation in linear
models, Communications in Statistics, Part A-Theory and Methods, 12(10),
1161-1179.
Sievers, G. L. and Abebe, A. (2004), Rank estimation of regression coefficients
using iterated reweighted least squares, Journal of Statistical Computation
and Simulation, 74, 821-831.
Simonoff, J. S. and Hawkins, D. M. (1993), Algorithm AS 282: High breakdown
regression and multivariate estimation, Applied Statistics, 42, 423-432.
Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992), On one-step GM-
estimates and stability of inferences in linear regression, Journal of the
American Statistical Association, 87, 439-450.
Small, C. G. (1990), A survey of multidimensional medians, International Sta-
tistical Review, 58, 263-277.
Speed, F. M., Hocking, R. R., and Hackney, O. P. (1978), Methods of analysis
with unbalanced data, Journal of the American Statistical Association, 73,
105-112.
Steel, R. G. D. (1960), A rank sum test for comparing all pairs of treatments,
Technometrics, 2, 197-207.
Stefanski, L. A., Carroll, R. J. and Ruppert, D. (1986), Optimally bounded
score functions for generalized linear models with applications to logistic
regression, Biometrika, 73, 413-424.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 518 —


i i

518 References

Stewart, G. W. (1973), Introduction to Matrix Computations, New York: Aca-


demic Press.

Stromberg, A. J. (1993), Computing the exact least median of squares esti-


mate and stability diagnostics in multiple linear regression, SIAM Journal
of Scientific Computing, 14, 1289-1299.

Student (1908), The probable error of a mean, Biometrika, 6, 1-25.

Tableman, M. (1990), Bounded-influence rank regression: A one-step estimator


based on Wilcoxon scores, Journal of the American Statistical Association,
85, 508-513.

Terpstra, T. J. (1952), The asymptotic normality and consistency for Kendall’s


test against trend, when ties are present, Indagationes Mathematicae, 14,
327-333.

Terpstra, J. and McKean, J. W. (2005), Rank-based analyses of linear models


using R, Journal of Statistical Software, 14(7), http://www.jstatsoft.org.

Terpstra, J., McKean, J. W., and Anderson, K. (2003), Studentized autore-


gressive time series residuals, Computational Statistics, 18, 123-141.

Terpstra, J., McKean, J. W., and Naranjo, J. D. (2000), Highly efficient


weighted Wilcoxon estimates for the autoregression, Statistics, 35, 45-80.

Terpstra, J., McKean, J. W., and Naranjo, J. D. (2001), Weighted Wilcoxon es-
timates for autoregression, Australian & New Zealand Journal of Statistics,
43, 399-419.

Thompson, G. L. (1991a), A note on the rank transform for interaction,


Biometrika, 78, 697-701.

Thompson, G. L. (1991b), A unified approach to rank tests for multivariate and


repeated measure designs, Journal of the American Statistical Association,
86, 410-419.

Thompson, G. L. (1993), Correction Note to: A note on the rank transform for
interactions, (V. 78, 697-701), Biometrika, 80, 211.

Thompson, G. L. and Ammann, L. P. (1989), Efficacies of rank-transform


statistics in two-way models with no interaction, Journal of the American
Statistical Association, 85, 519-528.

Tierney, L. (1990), XLISP-STAT, New York: John Wiley and Sons.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 519 —


i i

References 519

Tucker, H. G. (1967), A Graduate Course in Probability, New York: Academic


Press.

Tyler, D. E. (1987), A distribution-free M-estimator of multivariate scatter,


Annals of Statistics, 15, 234-251.

Vidmar, T. J. and McKean, J. W. (1996), A Monte Carlo study of robust and


least squares response surface methods, Journal of Statistical Computation
and Simulation, 54, 1-18.

Vidmar, T. J., McKean, J. W., and Hettmansperger, T. P. (1992), Robust


procedures for drug combination problems with quantal responses, Applied
Statistics, 41, 299-315.

Wang, M. H. (1996), Statistical Graphics: Applications to the R and GR Meth-


ods in Linear Models, Unpublished Ph.D. Thesis, Western Michigan Uni-
versity, Kalamazoo, MI.

Welch, B. L. (1937), The significance of the difference between two means when
the population variances are unequal, Biometrika, 29, 350-362.

Werner, C. and Brunner, E. (2007), Rank methods for the analysis of clustered
data in diagnostic trials, Computational Statistics and Data Analysis, 51,
5041-5054.

Wilcoxon, F. (1945), Individual comparisons by ranking methods, Biometrics,


1, 80-83.

Wilks, S. S. (1960), Multidimensional statistical scatter, In: Contributions to


Probability and Statistics in Honor of Harold Hotelling, ed. I. Olkin et al,
486-503. Stanford: Stanford University Press.

Witt, L. D. (1989), Coefficients of Multiple Determination Based on Rank Es-


timates, Unpublished Ph.D. Thesis, Western Michigan University, Kalama-
zoo, MI.

Witt, L. D., McKean, J. W., and Naranjo, J. D. (1994), Robust measures of


association in the correlation model, Statistics and Probability Letters, 20,
295-306.

Witt, L. D., Naranjo, J. D., and McKean, J. W. (1995), Influence functions


for rank-based procedures in the linear model, Journal of Nonparametric
Statistics, 5, 399-358.

Wu, C.-F. (1981), Asymptotic theory of nonlinear regression estimation, The


Annals of Statistics, 9, 501-513.

i i

i i
i i

“book” — 2010/11/17 — 16:39 — page 520 —


i i

520 References

Ylvisaker, D. (1977), Test resistance, Journal of the American Statistical As-


sociation, 72, 551-556.

i i

i i

You might also like