You are on page 1of 12

Applied Logistic Regression

WILEY SERIES TN PROBABILITY AND STATISTICS TEXTS AND REFERENCES SECTION



Established by WALTER A. SHEWHART and SAMUEL S. WILKS

Editors: Noel A. C. Cressie, Nicholas J. Fisher, Jain M Johnstone, J. B. Kadane, David W. Scott, Bernard W. Silverman, Adrian F. M. Smith, JozeJ L. Teugels; Vic Barnett, Emeritus, Ralph A. Bradley, Emeritus, J. Stuart Hunter, Emeritus, David G. Kendall, Emeritus

A complete list of the titles in this series appears at the end of this volume.

Applied Logistic Regression

Second Edition

DAVID W. HOSMER

University of Massachusetts Amherst, Massachusetts

STANLEY LEMESHOW

The Ohio State University Columbus. Ohio

A Wiley-Interscience Publication JOHN WILEY & SONS, INC.

New York • Chichester • Weinheim • Brisbane • Singapore • Toronto

To Trina, Wylie, Tri,

D. W.H.

To Elaine, Jenny, Adina, Steven, S. L.

This text is printed on acid-free paper. @ Copyright © 2000 by John Wiley & Sons. Inc.

All rights reserved. Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax

(978) 750-4470. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, E-Mail: PERMREQ@WILEY.COM.

To order books or for customer service please, call I(800)-CALL- WILEY (225-5945).

Library of Congress Cataloging in Publication Data:

Hosmer, David W.

Applied logistic regression I David W. Hosmer. Jr .• Stanley Lemeshow.-2nd ed.

p. em.

Includes bibliographical references and index. ISBN 0-471-35632-8 (cloth: alk. paper)

I. Regression analysis. J. Lerneshow, Stanley. 1I. Title.

QA278.2.H672000

519.5'36-dc21 00-036843

Printed in the United States of America

10 9 8 7 6 5 4

CONTENTS
1 Introduction to the Logistic Regression Model 1
1.1 Introduction, I
1.2 Fitting the Logistic Regression Model, 7
1.3 Testing for the Significance of the Coefficients, 11
1.4 Confidence Interval Estimation, 17
1.5 Other Methods of Estimation, 21
1.6 Data Sets, 23
1.6.1 The ICU Study, 23
1.6.2 The Low Birth Weight Study, 25
1.6.3 The Prostate Cancer Study, 26
1.6.4 The UMARU IMPACT Study, 27
Exercises, 28
2 Multiple Logistic Regression 31
2.1 Introduction, 31
2.2 The Multiple Logistic Regression Model, 31
2.3 Fitting the Multiple Logistic Regression Model, 33
2.4 Testing for the Significance of the Model, 36
2.5 Confidence Interval Estimation, 40
2.6 Other Methods of Estimation, 43
Exercises, 44
3 Interpretation of the Fitted Logistic Regression Model 47
3.1 Introduction, 47
3.2 Dichotomous Independent Variable, 48
3.3 Polychotomous Independent Variable, 56
3.4 Continuous Independent Variable, 63
3.5 The Multivariable Model, 64
3.6 Interaction and Confounding, 70
3.7 Estimation of Odds Ratios in the Presence of
Interaction, 74
3.8 A Comparison of Logistic Regression and
Stratified Analysis for 2 x 2 Tables, 79
3.9 Interpretation of the Fitted Values, 85
Exercises, 88
4 Model-Building Strategies and Methods for
v vi

CONTENTS

Logistic Regression 4.1 Introduction, 91

4.2 Variable Selection, 92

4.3 Stepwise Logistic Regression, 116

4.4 Best Subsets Logistic Regression, 128 4.5 Numerical Problems, 135

Exercises, 142

91

5 Assessing the Fit of the Model 143

5.1 Introduction, 143

5.2 Summary Measures of Goodness-of-Fit, 144

5.2.1 Pearson Ch i-Square Statistic and Deviance, 145 5.2.2 The Hosrner-Lemeshow Tests, 147

5.2.3 Classification Tables, 156

5.2.4 Area Under the ROC Curve, 160 5.2.5 Other Summary Measures, 164

5.3 Logistic Regression Diagnostics, 167

5.4 Assessment of Fit via External Validation, 186 5.5 I nterpretation and Presentation of Results from

a Fitted Logistic Regression Model, 188 Exercises, 200

6 Application of Logistic Regression with Different

Sampling Models 203

6.1 Introduction, 203

6.2 Cohort Studies, 203

6.3 Case-Control Studies, 205

6.4 Fitting Logistic Regression Models to Data from Complex Sample Surveys, 211 Exercises, 222

7 Logistic Regression for Matched Case-Control Studies 223

7.1 Introduction, 223

7.2 Logistic Regression Analysis for the I-I Matched Study, 226

7.3 An Example of the Use of the Logistic Regression Model in a I-I Matched Study, 230

7.4 Assessment of Fit in a Matched Study, 236

7.5 An Example of the Use of the Logistic Regression Model in a I-MMatched Study, 243

7.6 Methods for Assessment of Fit in a I-M

CONTENTS

vii

Matched Study, 248

7.7 An Example of Assessment of Fit in a 1-M Matched Study, 252

Exercises, 259

8 Special Topics 260

8.1 The Multinomial Logistic Regression Model, 260

8.1.1 Introduction to the Model and Estimation of the

Parameters, 260

8.1.2 Interpreting and Assessing the Significance of the Estimated Coefficients, 264

8.1.3 Model-Building Strategies for Multinomial Logistic Regression, 273

8.1.4 Assessment of Fit and Diagnostics for the Multinomial Logistic Regression Model, 280 8.2 Ordinal Logistic Regression Models, 288

8.2.1 Introduction to the Models, Methods for Fitting and Interpretation of Model Parameters, 288 8.2.2 Model Building Strategies for Ordinal Logistic Regression Models, 305

8.3 Logistic Regression Models for the Analysis of Correlated Data, 308

8.4 Exact Methods for Logistic Regression Models, 330 8.5 Sample Size Issues _When Fitting Logistic Regression Models, 339

Exercises, 347

Addendum

352

References

354

Index

369

This page intentionally left blank

Preface To The Second Edition

The use of logistic regression modeling has exploded during the past decade. From its original acceptance in epidemiologic research, the method is now commonly employed in many fields including but not nearly limited to biomedical research, business and finance, criminology, ecology, engineering, health policy, linguistics and wildlife biology. At the same time there has been an equal amount of effort in research on all statistical aspects of the logistic regression model. A literature search that we did in preparing this Second Edition turned up more than 1000 citations that have appeared in the 10 years since the First Edition of this book was published.

When we worked on the First Edition of this book we were very limited by software that could carry out the kinds of analyses we felt were important. Specifically, beyond estimation of regression coefficients, we were interested in such issues as measures of model performance, diagnostic statistics, conditional analyses and multinomial response data. Software is now readily available in numerous easy to use and widely available statistical packages to address these and other extremely important modeling issues. Enhancements to these capabilities are being added to each new version. As is well-recognized in the statistical community, the inherent danger of this easy-to-use software is that investigators are using a very powerful tool about which they may have only limited understanding. It is our hope that this Second Edition will bridge the gap between the outstanding theoretical developments and the need to apply these methods to diverse fields of inquiry.

Numerous texts have sections containing a limited discussion of logistic regression modeling but there are still very few comprehensive texts on this subject. Among the textbooks written at a level similar to

ix

x

PREFACE TO THE SECOND EDITION

this one are: Cox and Snell (1989), Collett (1991) and Kleinbaum (1994).

As was the case in our First Edition, the primary objective of the Second Edition is to provide a focused introduction to the logistic regression model and its use in methods for modeling the relationship between a categorical outcome variable and a set of covariates. Topics that have been added to this edition include: numerous new techniques for model building including determination of scale of continuous covariates; a greatly expanded discussion of assessing model performance; a discussion of logistic regression modeling using complex sample survey data; a comprehensive treatment of the use of logistic regression modeling in matched studies; completely new sections dealing with logistic regression models for multinomial, ordinal and correlated response data, exact methods for logistic regression and sample size issues. An underlying theme throughout this entire book is the focus on providing guidelines for effective model building and interpreting the resulting fitted model within the context of the applied problem.

The materials in the book have evolved considerably over the past ten years as a result of our teaching and consulting experiences. We have used this book to teach parts of graduate level survey courses, quarter- or semester-long courses, and focused short courses to working professionals. We assume that students have a solid foundation in linear regression methodology and contingency table analysis.

The approach we take is to develop the model from a regression analysis point of view. This is accomplished by approaching logistic regression in a manner analogous to what would be considered good statistical practice for linear regression. This differs from the approach used by other authors who have begun their discussion from a contingency table point of view. While the contingency table approach may facilitate the interpretation of the results, we believe that it obscures the regression aspects of the analysis. Thus, discussion of the interpretation of the model is deferred until the regression approach to the analysis is firmly established.

To a large extent there are no major differences in the capabilities of the various software packages. When a particular approach is available in a limited number of packages, it will be noted in this text. In general, analyses in this book have been performed in STATA [Stata Corp. (1999)]. This easy to use package combines excellent graphics and analysis routines, is fast, is compatible across Macintosh, Windows and UNIX platforms and interacts well with Microsoft Word. Other

PREFACE TO THE SECOND EDITION

xi

major statistical packages employed at various points during the preparation of this text include SAS [SAS Institute Inc. (1999)J, SPSS [SPSS Inc. (1998)], and BMDP [BMDP Statistical Software (1992)J. In general, the results produced were the same regardless of which package was used. Reported numeric results have been rounded from figures obtained from computer output and thus may differ slightly from those that would be obtained in a replication of our analyses or from calculations based on the reported results. When features or capabilities of the programs differ in an important way, we note them by the names given rather than by their bibliographic citation.

This text was prepared in camera ready format using Microsoft Word 98 on a Power Macintosh platform. Mathematical equations and symbols were built using Math Type 3.6a [Math Type: Mathematical Equation Editor (1998)J.

Early on in the preparation of the Second Edition we made a decision that data sets used in the text would be made available to readers via the World Wide Web. The ftp site at John Wiley & Sons, Inc. for the data in this text is

ftp://ftp.wiley.comlpublic/sci_tech_med/logistic.

In addition, the data may also be found, by permission of John Wiley & Sons Inc., in the archive of statistical data sets maintained at the University of Massachusetts at Internet address

http://www-unix.oit.umass.edul-statdata

in the logistic regression section. Another advantage to having a text web site is that it provides a convenient medium for conveying to readers text changes after publication. In particular, as errata become known to us they will be added to an errata section of the text's web site at John Wiley & Sons, Inc. Another use that we envision for the web is the addition, over time, of additional data sets to the statistical data set archive at the University of Massachusetts.

We are deeply appreciative of the efforts of our students and colleagues who carefully read and contributed to the clarity of this manuscript. In particular we are indebted to Elizabeth Donohoe-Cook, Sunny Kim and Soon-Kwi Kim for their careful and meticulous reading of the drafts of this manuscript. Special thanks also goes to Rita Popat for helping us make the transition between the software we used for the first and second editions. We appreciate Alan Agresti's comments on the section dealing with the analysis of correlated data. Cyrus Mehta was particularly helpful in sharing key papers and for providing us with

xii

PREFACE TO THE SECOND EDITION

the LogXact 4 (2000) program used for computations in Section 8.4. Others contributed significantly to the First Edition and their original suggestions made this Second Edition stronger. These include Gordon Fitzgerald, Sander Greenland, Bob Harris and Ed Stanek.

There have been many other contributors to this book. Data sets were made available by our colleagues, Donn Young, Jane McCusker, Carol Bigelow, Anne Stoddard, Harris Pastides, and Jane Zapka, as well as by Doctors Daniel Teres and Laurence E. Lundy at Baystate Medical Center in Springfield, Massachusetts. Cliff Johnson at NCHS was helpful in providing us with a data set from the NHANES III that we used extensively in Section 6.4 as well as for sharing insights with us into analytic strategies used by that agency. We are very grateful to Professor Petter Laake, Section of Medical Statistics at the University of Oslo and Professeur Roger Salamon of the University of Bordeaux, II who provided us with support to work on this manuscript during visits to their universities. Comments by many of our students and colleagues at the University of Massachusetts, The Ohio State University, the New England Epidemiology Summer Program, the Erasmus Summer Program, the Summer Program in Applied Statistical Methods at The Ohio State University, the University of Oslo and the University of Bordeaux as well as at innumerable short courses that we have had the privilege to be invited to teach over the past ten years, were extremely useful.

Finally, we would like to thank Steve Quigley and the production staff at John Wiley & Sons for their help in bringing this project to completion.

DAVID W. HOSMER, JR. STANLEY LEMESHOW

Amherst, Massachusetts Columbus Ohio

June, 2000

You might also like