You are on page 1of 138
MULTILEVEL ANALYSIS An introduction to basic and advanced multilevel modeling | Tom A. B. Snijders and Roel J. Bosker | 6 | SAGE Publications | London + Thousand Oaks « New Delhi i © © @ TomA. B. Snllders and Roa J. Bosker 1999 Pirst published 1999 Reprinted 2000, 2002, 2003, 1 ‘ontents All rights reserved. No part of this publication may be reproduced, stored in a e retrieval system, transmitted or utilized in any form or by any means, electronic, ‘mechanical, photocopying, recording or otherwise, without permission in writing from the Publishers. Preface SAGE Publications Ltd 6 Bonkill Street London EC2A 4PU SAGE Publications Ine 2455 Teller Road ‘Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd 32, M-Block Market Greater Kailash - 1 New Delhi 110 048 03) SA46 British Library Cataloguing in Publication data A catalogue revord for this book is available from the British Library ISBN 0-7619-5880-4 ISBN 0-7619-5890-8 (pbk) Library of Congress catalog record available Printed in Great Britain by The Cromwell Press Ltd, Trowbridge, Wiltshire Bae AP Sulbes etsicuunieeba uta 1 Introduction 1.1 Multilevel analysis, LL. Probability models 1.2. This book 121 Prerequisites 1,22 Notation 2 Mulilevel Theories, Multi-stage Sampling, and Multilevel Models 2.1 Dependence as a nuisance 2.2 Dependence as an interesting phenomenon 2.3. Macto-level, micro-level, and cross-level relations 3. Statistical Treatment of Clustered Data 3.1 Aggregation 32 Disaggregation 3.3. The intraclass correlation 33.1 Within-group and between-group variance 3.3.2 Testing for group differences 3.4 Design effects in two-stage samples 3.5 Reliability of aggregated variables 3.6 Within- and between-group relations 3.6.1 Regressions 3.62 Correlations 3.6.3 Estimation of within- and between-group correlations 3.7 Combination of within-group evidence 4 The Random Intercept Model 4.1 A regression model: fixed effects only 4.2 Variable intercepts: fixed or random parameters? 4.2.1 When to use random coefficient models? 43. Definition of the random intercept model 44° More explanatory variables 4.5 Within- and between-group regressions Contents 46 Parameter estimation 56 4.7 ‘Estimating’ random group effects: posterior means 58 4.7.1 Posterior confidence intervals 60 48 Threelevel random intercept models, 63 ‘The Hierarchical Linear Model er 5.1 Random slopes 67 5.l.l Heteroscedasticity 68 5.1.2 Don't force m, to be 0! 69 5.1.3 Interpretation of random slope variances 70 5.2 Explanation of random intercepts and slopes 2 5.2.1 Cross-level interaction effects 3 5.2.2 A general formulation of fixed and random parts 9 5.3. Specification of random slope models 80 5.3.1 Centering variables with random slopes? 80 54 Estimation 82 5.5. Three and more levels 83 ‘Testing and Model Specification 86 6.1 Tests for fixed parameters 86 6.1.1 Multi-parameter tests for fixed effects 88 6.2 Deviance tests 88 6.2.1 Halved p-values for variance parameters 90 6.3 Other tests for parameters in the random part a 6.4 Model specification 1 64.1 Working upward from level one 94 6.4.2 Joint consideration of level-one and level-two variables 96 643 Concluding remarks about model specification 97 How Much Does the Model Explain? 99 7.1 Explained variance 99 7.1 Negative values of R2? 99 7.1.2. Definitions of proportions of explained variance in two-level models 101 7.1.3 Explained variance in three-level models 104 7.14 Explained variance in models with random slopes 104 7.2, Components of variance 105 7.2.1 Random intercept modes 108 7.22 Random slope models 108 Heteroscedasticity 110 8.1 Heteroscedasticity at level ane 110 8.1.1 Linear variance functions 110 8.1.2 Quadratic variance funetions 4 8.2 Heteroscedasticity at level two 19 Contents 9 Assumptions of the Hierarchical Linear Model. 9.1 Assumptions of the hierarchical linear model 9.2 Following the logic of the hierarchical linear model 9.2.1 Include contextual effects 9.2.2 Check whether variables have random effects 9.2.3 Explained variance 9.3 Specification of the fixed part 9.4 Specification of the random part 9.4.1 Testing for heteroscedasticity 9.4.2 What to do in case of heteroscedasticity 9.5 Inspection of level-one residuals 9.6 Residuals and influence at level two 9.6.1 Empirical Bayes residuals 9.6.2 Influence of level-two units 9.7 More general distributional assumptions 10 Designing Multilevel Studies 10.1 Some introductory notes on power 10.2 Estimating a population mean 10.3 Measurement of subjects 10.4 Estimating association between variables 10.4.1 Cross-level interaction effects 10.5 Exploring the variance structure 10.5.1 The intraclass correlation 10.5.2 Variance parameters 11 Crossed Random Coefficients 11.1 A two-level model with a crossed random factor 11.1.1 Random slopes of dummy variables 11.2 Crossed random effects in three-level models 113 Correlated random coefiicients of crossed factors 11.3.1 Random slopes in a crossed design 113.2 Multiple roles 11.3.3. Social networks 12 Longitudinal Data 121 Fixed occasions 12.1.1 The compound symmetry model 12.1.2 Random slopes 12.13 The fully multivariate model 12.14 Multivariate regression analysis 12.15 Explained variance 12.2 Variable ovcesion designs 12.2.1 Populations of curves 12.2.2. Random functions 12.2.8. Fixnlaining the finctions vil 120 120 121 122 122 123, 124 125 126 128 128 132 132 134 139 140 141 142 143 144 148 151 151 154 155 155 156 159 160 160 161 162 166 167 168 171 173 178 179 isi 181 182 193 Contents 12.24 Changing coveriates i | 123 Autocorrlated residuals 199 | 13 Multivariate Multilevel Models 200 13.1 The multivariate random intercept model 201 2 132 Mukivariate random slope models 206 4 Preface 14 Discrete Dependent Variables 207 14.1 Hierarchical generalized linear models 207 | 14.2 Introduction to multilevel logistic regression 208 qa . 14.2.1 Heterogeneous proportions 208 ‘This book grew out of our teaching and consultation activities in the 14.22 The logit function: Log-odds a domain of multilevel analysis. Is intended as wel forthe absolute beginner 1423 The empty model 213 Sn this field as for those who have already mastered the fundamentals and are 1424 The random intercept model as now entering more complicated areas of application. The reader is refered 142.5 Estimation ne to Section 1.2 for an overview ofthis book and for some reading guidelines. 14.26 Aggregation 219 ‘We are grateful to various people from whom we got reactions on earier 14.27 Testing the random intercept 220 1 parts of this manuscript and also to the students who were exposed to it and 1433 Further topics about multilevel logistic regression 220 helped us realize what was unclear. We received useful comments and bene- 143.1 Random slope model 220 fited from discussions about parts of the manuscript with, among others, He a) Repensert ation 25» cisesinld motel =) Joerg Blasius, Maritje van Duija, Wolfgang Langer, Ralf Maslowsk', and 143.3 Residual intraclass correlation coefficient ma J Tan Plewis. Moreover ve would lke to thank Hennie Brandaa, Mieke 143.4 Explained variance 25 rekeans, Jan van Damme, Hetty Dekiers, Miranés Lubbers Lyset 1435 Consequences of adding effects to the model 227 | Rekes-Mombarg end Jan Maarten Wit, Carolina de Weerth, Beate Volker, 14.36 Bibliographic remarks 229 \ Ger van der Werf, and the Zentral Archiv (Cologne) who kindly permitted 144 Ordered categorical variables 29 1s to use data from their respective research projects as iustrative material 145 Multilevel Poisson regression 234 i for this book, We woald also lke to thank Annelies Verstappen-Remmers 15 Software 239 | for her unfailing secretarial assistance 1511 Special software for multilevel modeling 239 15.1.1 HLM 240 i 15.1.2 Min / MLwiN 243 | Tom Snijders 151.3 VARCL 245 Roel Bosker 151.4 MIXREG, MIXOR, MIXNO, MICPREG 27 \ Te, 1999 15.2 Modules in general purpose software packages 248 15.21 SAS, procedure MIXED 28 182.2 SPSS, command VARCOMP 249 | 1523 BMDP-V modules 250 1524 Stata 250 153 Other multilevel software 251 153.1 PinT 251 | 15.3.2 Mplus 251 | 1533 MLA 251 1534 BUGS 251 References 252 | Index 261 1 1 Introduction 1.1 Multilevel analysis Multilevel analysis is @ methodology for the analysis of data with complex patterns of variability, with a focus on nested sources of variability: e-g., pupils in classes, employees in firms, suspects tried by Judges in courts, ani- mals in litters, longitudinal measurements of subjects, etc. In the analysis of such data, it usually is illuminating to take account of the variability associated with each level of nesting. There is variability, eg., between pupils but also between classes, and one may draw wrong conclusions if either of these sources of variability is ignored. Multilevel analysis is an approsch to the analysis of such data including the statistical techniques as well as the methodology of how to use these. The name of inultilevel analysis is used mainly in the social sciences (in the wide sense: sociology, education, psychology, economies, criminology, etc). but also in other fields such as the bio-medical sciences. Our focus will be on the social sciences. In its present form, multilevel analysis is a stream which has two tribu- tatles: contextual analysis and mixed effects models. Conteztual analysis is a development in the social sciences which has focused on the effects of the social context on individual behavior. Some landmarks before 1980 were the paper by Robinson (1950) who discussed the ecological fallacy (which refers to confusion between aggregate and in- dividual effects), the paper by Davis, Spaeth, and Huson (1961) about the distinction between within-group and between-group regression, the volume edited by Dogan and Rokkan (1969), and the paper by Burstein, Linn, and Capell (1978) about treating regression intercepts and slopes on one level as outcomes on the higher level. ‘Mized effects models are statistical models in the analysis of variance and in regression analysis where it is assumed that some of the coeficients are Gixed and others are random. This subject is too vast even to mention some landmarks, The standard -wference book on random effects models ‘and mixed effects models is Seurle, Casella, and McCulloch (1992), who sive an extensive historical overview in their Chapter 2. The name ‘mixed ‘mode!’ seems to have been used first by Eisenhart (1947). Contextual modeling until about 1980 focused on the definition of ap- propriate variables to be wed in ordinary least squares regression analy- sis. The main focus in the development of statistical procedures for mixed 1 2 Introduction models was until the 1980s on random effects (ie, random differences be- tween classes in some classification system) more than on random coefi- cients (i., random effects of numerical variables). Multilevel analysis as we now know it was formed by these two streams coming together. It ‘was realized that in contertual modeling, the individual and the context are distinct sources of variability, which should both be modeled as ran- dom influences. On the other hand, statistical methods and algorithms ‘were developed that allowed the practical use of regression-type models ‘with nested random coeficients. ‘There was 2 cascade of statistical pa- pers: Aitkin, Anderson, and Hinde (1981); Laird and Ware (1982); Ma- son, Wong, and Entwisle (1983); Goldstein (1986); Aitkin and Longford (1986); Raudenbush and Bryk (1986); De Leeuw and Kreft (1986), and Longford (1987) proposed and developed techniques for calculating est- rates for mixed models with nested coefficients. These techniques, together with the programs implementing them which were developed by a number of these researchers or under their supervision, allowed the practical use of models of which until that moment only special cases were accessible for practical use. By 1986 the basis of multilevel analysis was established, ‘many farther elaborations have been developed since then, and the method ology has proved to be quite fruitful for applications. On the organizational side, the ‘Multilevel Models Project’ in London stimulates developments by its Newsletter and its web site htp://wnw.oe.ac.uk/mutlevel/ with the mirror web sites http://www medent.umontreal.ca/mukilevel/ and also beep: /wowedfac.unimelb.edu au/mutilevel/. In the biomedical sciences mixed models were proposed especially for longitudinal data; in economics mainly for panel data (Swatoy, 1971), the most common longitudinal data in economics. One of the issies treated in the economic literature was the pooling of cross-sectional and time se- ries data (eg, Madalla, 1971 and Hausman and Taylor, 1981), which ts closely related to the diference between within-group and between-group regressions. Overviews aze given by Chow (1984) and Baltagi (1995). ‘A more elaborate history of multilevel analysis is presented in the bibi- graphical sections of Longford (1993a) and in Kreft and de Leeuw (1998). For an extensive bibliography, see Hiittner and van den Eeden (1996). 1.1.1 Probability models ‘The main statistical model of multilevel analysis is the hierarchical linear ‘model, an extension of the multiple liar regression model to a model that includes nested random coefficients. This model is explained in Chapter 5 ‘and forms the basis of mast of this book. ‘There are several ways to argue why it makes sense to use a probability ‘model for data analysis. In sampling theory a distinetion is made between design-based inference and model-based inference (se, e., Sirndal, Swens- sea, and Wretman, 1901). The former means that the researcher draws a probability sample from some finite population, and wishes to make in- ferences from the sample to this finite population. The probability model This book 3 then follows from how the sample e drawn by the researcher. Model-based inference means that the researcher postulates a probability model, usually aiming at inference to some large and sometimes hypothetical population like all English primary school pupils in the 1990s or all human adults living ina present-day industrialized culture. Ifthe probability model is adequate then so are the inferences based on it, bt checking this adequacy is possible only toa limited extent Tt is possible to apply model-based inference to data collected by in- vestigating some entire research population, like all twelve-year-old pupils in Amsterdam at a given moment. Sometimes the question is posed why one should use a probability model if no sample is drawn but an entire population is observed. Using a probability model that assumes statistical Yarlabilty, even though an entire research population was investigated, cam ‘be justified by realising that conclusions are sought which apply not only to the investigated research population but to wider population. The investigated research population is supposed to be representative for this ‘wider population ~ for pupils also in earlier or later years, is other towns, Imaybe in other countries. Applicability to such a wider population is not ‘automatic, but has to be carefully argued by considering whether indeed the research population may be considered to be representative for the larger {often vaguely outlined) population. The inference then is not primarily about a given delimited set of individuals but about social, behavioral, bio- logieal, ete, mechanisms and processes. The random effets, or residuals, playing a role in such probability models can be regarded as the resultants Of the factors that are not included in the explanatory variables used. They reflect the approximating nature of the model used. ‘The model-based infer- tence will be adequate to the extent that the assumptions of the probability ‘model are an adequate reflection of the effects that are not explicitly in- cluded by means of observed variables. 'As we shall see in Chapters 3, 4, and 6, the basic idea of multilevel analysis is that data sets with a nesting structure that includes unexplained variability at each level of nesting, such as pupil in classes or employees in firms, are usually not adequately represented by the probability model of ‘multiple linear regression analysis, but are often adequately represented by ‘the hierarchical linear model. TIhus, the use ofthe hierarchical linear model in multilevel analysis is in the tradition of model-based inference. 1.2 This book ‘This book is meant as an introductory textbook and as a reference book for practical users of multilevel analysis. We have tried to include all the main points that come up when applying multilevel analysis. Some of the data sets used in the examples, and corresponcing commands to run the examples in the computer packages MLn/MLwiN and HLM (sec Chapter 16), are ‘available at the web site http://stat.gamma.cug.nl/snijders/multlevel.btem. ‘After this introductory chapter, the book procecds with a conceptual 4 Introduction chapter about multilevel questions and a chapter about ways for treating ‘multilevel data that are not based on the hierarchical linear model. Chapters 4106 treat the basic conceptual ideas of the hierarchical linear model, and hhow to wotk with it in practice, Chapter 4 introduces the random intercept model as the primary example of the hierarchical Linear model. This is extended in Chapter § to random slope models. Chapters 4 and § focus on understanding the hierarchical inear model and its parameters, paying only very limited attention to procedures and algorithms for parameter tstimation (estimation being work that most researchers delegate to the computer). Testing parameters and specifying a multilevel model is the topic of Chapter 6. ‘An introductory course on multilevel analysis could cover Chapters 1 to 6 and Section 7.1, with selected material from other chapters. A minimal course would focus on Chapters 4 to 6. The later chapters are about topics which are more specialized or more advanced, but important in the practice cof multilevel analysis ‘The text of this book is not based on a particular computer program for rultilevel analysis. The last chapter, 15, gives a brief review of programs ‘that can be used for multilevel analysis and makes the link (to the extent ‘that this is still necessary) between the terminology used in these programs and the terminology of the book. Chapters 7 (about the explanatory power of the model) and 9 (about ‘model assumptions) are important for the interpretation of reults of sta- jstcal analyses using the hierarchical linear model. Chapter 10 helps the researcher in setting up a multilevel study, and in choosing sample sizes at the various levels. Chapters 8, and 11 to 14, treat various extensions of the basic hier- archical linear model that are useful in practical research. ‘The topic of Chapter 8, heteroscedasticity (non-constant residual variances), may seem rather specialized. Modeling hetercscedasticity, however, can be very use- fal. Tt also allows model checks and model modifications that are used in Chapter 9. Chapter 11 treats crossed random cosffcints, a model ingre dient which strictly speaking is outside the domain of multilevel models, Dbut which is practically important and can be implemented in currently available multilevel sofware. Chapter 12 is about longitudinal data, with a fixed occasion design (i, repeated measures data) aa well as those with a variable occasion design. TThis chapter indicates how the flexibility of the multilevel model gives important opportunities for data analysis (€4,, for Incomplete multivariate of longitudinal data) that were unavailable earlier. Chapter 13 is about multilevel analysis for multivariate dependent variables. Chapter 14 describes possibilities of multilevel modeling for dichotomous, ordinal, and frequency data. If additional textbooks are sought, one could consider Hox (1994) and Kreft and de Leeuw (1998), good introductions; Bryk and Raudenbush (1003), an elaborate treatment of the bicrarchical linear model; and Long- ford (19992) and Goldstein (1995) for more ofthe mathematical background. ‘This book 5 121 Prerequisites For reading this textbook, its required that you have a good working know edge of statistics. It is assumed that you know the concepts of probability, random variable, probability distribution, population, sample, statistical independence, expectation (= population mean), variance, covariance, cor- relation, standard deviation, and standard error. Further itis assumed that ‘you know the basics of hypothesis testing and multiple regression analy- sis, and that you car understand formulae of the kind that occur in the explanation of regression analysis. Matrix notation is used only in a few more advanced sections. These sections can be skipped without loss of understanding of other parts of the book. 1.22 Notation ‘The main notational conventions are as follow ‘Abstract variables and random variables are denoted by italicized capital letters, ike X or Y. Outcomes of random variables and other fixed values are denoted by italicized small letters, ke z or . Thus we speak about the variable X, but in formulae where the value ofthis variable is considered as a fixed, non-random ral, it will be denoted 2. There are some exceptions to this, eg, in Chapter 2, and the use of the letter V for the number of groups (levelstwo unks!) in the data. ‘The letter € is used to denote the expected value, or population average, of a random variable. Thus, EY and E(Y) denote the expected value of Y. For example, if Py isthe fraction of tals obtained in n coin fips, and the coin is fair, then the expected value is EP, =} ‘Statistical parameters are indicated by Gresk letters. Examples are h, 04, and B. The folowing Greek letters are used. alpha beta gamma delta eta theta lambda Pi ho sigms tau phi chi omega capital sigma capital wu HMEX ST gRARY eS o2ue 2 Multilevel Theories, Multi-stage Sampling, and Multilevel Models In many cases simple random sampling is not a very cost-efficient strategy, and multi-stage samples may be more efficient instead. In that case the clustering of the data is, in the phase of data analysis, « nuisance which should be taken into consideration. In many other situations, however, ‘multi-stage samples are employed because one is interested in relations be- tween variables at different layers in a hierarchical system. In this case the dependency of observatiors within groups is of focal interest, because it reflects that groups differ iz certain respects, In either case, the use of single-level statistical models is no longer valid. The fallacies to which their ‘use can lead are described in the next chapter. 2.1 Dependence as a nuisance From textbooks on statistics itis learned that, as @ standard situation, ob- servations should be sampled independently from each other. The standard sampling design to which statistical models are linked accordingly is simple random sampling with replacement from an infinite population: the result of one selection is independent of the result of any other selection, and the chances of selecting a certain single unit are constant (and known) across all units in the population. ‘Textbooks on sampling, however, make clear ‘that there are more cost-efficient sampling designs, based on the idea that robabilities of selection should be known but do not have to be constant. One of those cost-efficient sampling designs is the multi-stage sample: the Population of interest consists of subpopulations, and selection takes place ‘via those subpopulations. If there is only one subpopulation level, the de- sign is a two-stage sample. Pusils, or instance, are grouped in schools, 0 the population of pupils consists of subpopulations of schools that contain Pupils. Other examples are: families in neighborhoods, teeth in jawbone, animals in litters, employees in irms, children in families, etc. In a random two-stage sample, a random saxple of the primary units (schools, neighbor- hoods, jawbones litters, firms, families) ia taken in the fist stage, and then the secondary units (pupils families, teeth, animals, employees, children) Dependence as an interesting phenomenon 1 are sampled at random from the selected primary units in the second stage. ‘A common mistake in research is to ignore the fact that the sampling scheme ‘was a two-stage one, and to pretend that the secondary units were selected independently. The mistake in this case would be, that the researcher over- looks the fact that the secondary units were not sampled independently from each other: having selected a primary unit (a school, for example) Increases the chances of selection of secondary units (pupils, for example) from that primary unit. Stated otherwise: the multistage sampling design leads to dependent observations. ‘The multi-stage sampling design can be graphically depicted as In Figure 2.1. In this figure we see a population that ° o%980 + selected unit Doge .: Bot selected unit Figure 2.1 Multistage sampling consists of 10 subpopulations, each containing 10 micro-units. A sample of 2 pore taken by randy sling 8 ost of 10 eobpopuations and ‘within these ~ again at random of course ~ § out of 10 micro-units. ‘Multistage samples are preferred in practice, because the costs of in- terviewing oF testing persons are reduced enormously if these persons are seographically ot organizationally grouped. It is cheaper to travel to 100 neighborhoods and interview ten persons per neighborhood on their po- litical preferences than to travel to 1,000 neighborhoods and interview one person per neighborhood. In the next chapters we will see how we can make adjustments to deal with these dependencies. 2.2 Dependence as an interesting phenomenon eee Ee ean el eee ete Se ee eee sau tg di, hn ores ect v om ih eee eee 8 Multilevel theories, multi-stage sampling, and multilevel models Questions that we seek to answer may be: do employees in multinationals earn more than employees in other firms? Or: is there a relation between the performance of pupils and the experience of their teacher? Or: ia the sen. tence differential between black and white suspects diferent betwoen judges, and if so, can we find characteristics of judges to which this sentence differ. ental is related? In this case a variable is defined at the primary unit level (Gums, teachers, judges) as well as on the secondary unit level (employees, pupils, cases). From here on we will refer to primary units as macto-level units (oF macro-unts for short) and to the secondary units aa micro level units (oF micro-unts for short). Moreover, forthe time being, we will re strict ourselves to the two-ievel case, and thus to two-stage samples only. In Table 2.1 a summary of the temuinology is given. Examples of macro-unite and the micro-units nested within them are presented in Table 2.2 ‘Table 2.1 Summary of terms to describe units at either level in the two-level case. rmlero-units secoudary units slementary units level? units level units ‘Table 2.2 Some examples of units at the macro and micro level. Tiere res evet — “schools teachers classes pupils seighborhoods families firms employers jawbones teeth faxes children Titers animals doctors patients subjects measurements interviewers respondents Judges suspects ‘Most of the examples presented in the table have been dealt with in the text already. It is important to note that what is defined as a macro-unit ‘and a micro-unit, respectively, depends on the theory at hand. Teach- rs are nested within schools, if we study organizational effects on teacher bum-out: then teachers are the micro-units and schools the macro-units. But when studying teacher effects on student achievement, teachers are the ‘macro-units and students the micro-units. The same goes, mutatis mu. ‘Macro-level, micro-level, and eross-level relations 8 tandis, for neighborhoods and families (eg when studying the effects of housing conditions on marital problens), and for families and children (e,, when studying effects of income on educational performance of siblings) In all these instances the dependency of the observations on the micto- ‘units within the macro-units is of focal interest. If we stick to the example of schools and pupils then the dependency (e.g, in mathematics achievement of pupils within a school) may stem from: 1. pupils within school sharing the same school environment; 2 pupils within a school sharing the same teachers; 5. pupils within a school affecting each other by direct communication or shared group norms; 4. pupils within a school coming from the same neighborhood. ‘The more the achievement levels of pupils within a school are alike (as compared to pupils from other schools), the more likely tis that causes for the achievement have to do with the organizational unit (in this case: the school). Absence of dependency in this cae implies absence of institutional effects on individual performance. A special kind of nesting is defined by longitudinal data represented in ‘Table 2.2 as ‘measurements within subjects". The measurement occasions here ate the micro-units and the subjects the macro-units. The dependence of the diferent measurements for a given subject is of primary importance in longitudinal data, but the following section about relations between vari- ables defined at ether level is not directly intended forthe nesting structure defined by longitudinal data. Because of the special nature ofthis nesting structure, separate chapter (Chapter 12) is devoted toi. 2.3 Macro-level, micro-level, and cross-level relations For the study of hierarchical, or multilevel systems, baving two distinct layers, Tacq (1986) distinguished between three kinds of propositions: about ‘micro-units (e.g., ‘employees have on average 4 effective working hours Pet ay’; ‘boys lag behind girls in reading comprehension’), about macro-units (ee, ‘schools have on average budget of $20,000 to spend oa resources’; in ‘neighborhoods with bad housing conditions crime rates are above average’), or about macro-micro relations (e.g, ‘if firms have a salary bonus system, employees will have increased productivity’; ‘a child suffering from a broken family situation will affect classroom climate’). Multilevel statistical models are always needed if a multi-stage sampling design has been employed. The use of such a sampling design is quite obvi- ous if we are interested in macro-micro relations, less obvious - but often necessary from a cost-effectiveness point of view ~ if micro-level propos! tions are our primary concern, and hardly obvious ~ but sometimes still applicable ~ if macro-level propositions are what we are focusing on. These three instances will be dealt with in the sequel of this chapter. To facili- 10 Multilevel theories, multi-stage sampling, and multilevel models tate comprehension, following Tacg (1986) we use figures with the following conventions: ‘a dotted line indicates that there are two levels; below the line is the micro-level; ‘above the line is the macro-level; ‘macro-level variables are denoted by capitals; ‘micro-level variables are denoted by lower case letters; arrows denote presumed causal relations. Multilevel propositions Multilevel propositions can be represented as in Figure 2.2. Day Figure 2.2 The structure of a multilevel proposition In this example we are interested in the effect of the macro-level variable Z (eg., teacher efficacy) on the micro-level variable y (eg., pupil motivation) controlling for the microlevel variable z (eg., pupil aptitude). Micro-level propositions Micro-level propositions are of the form indicated in Figure 2.3. Figure 2.8 The structure of a micro-evel proposition, To this case the line indicates that there isa macro-evel that isnot referred ton the hypothesis that is put tothe test, but that is used in the sampling design in the fist stage. In assessing the strength ofthe relation between cccupational status and income, for instance, respondents may have been selected for face-to-face interviews per zip-ode area. This then may case dependency (as a nuisance) in the data. Macro-level propositions Macro-level propositions are of the form of Fi re 2.4. Macro-level, micro-leve, and crose-level relations n Figure 2.4 The serucare of a macro-level proposition, ‘The line separating the macre-level from the micro-level seems to be super- uous here. When investigating the relation betweén long-range strategic planning policy of firms and their profits, there is no multilevel situation, and a simple random sample may have been taken. When either or both variables are not directly observable, however, and have to be measured at the microlevel (eg, organizational climate measured as the average satis- faction of employees), then a two-stage sample is needed nevertheless. This is the case a fortiori for variables defined as aggregates of microrlevel vari- ables (eg, the crime rate in a neighborhood). Macro-micro relations ‘The most common situation in social research is that macro-level variables are supposed to have a relation with micro-level variables. There are three obvious instances of maero-tomicro relations, all of which are typical ex- amples of the multilevel situation. z Zz z ee Figure 2.5 The stracture of macro-micro propesitions. ‘The first case is the macro-to-micro proposition. The more explicit the religious norms in social networks, for example, the more conservative the views that individuals have on contraception. ‘The second proposition is 2 special case of this. It refers to the case where there is a relation between Zand y, given that the effect of z on y is taken into account. The example given may be modified to: for individuals of a given educational level’. The last case in the figure is the macro-micro-interaction, also known 33 the cross-level interaction: the relation between and y is dependent on Z. Or stated otherwise: the relation between Z and y is dependent, on 2 The effect of aptitude or achievement, for instance, may be small in case of ability grouping of pupils within classrooms but large in ungrouped classrooms, [Next to these three situations thera isthe so-called emergent, or micro~ ‘macro, proposition (Figure 26). 12 Multilevel theories, multi-stage sampling, and multilevel models ot Figure 2.6 The structure of a micro-macro proposition. In this case, a micro-level variable x affects a macro-level variable Z (student achievement may affect teacher's stress experience) ‘There are of course combinations of the various examples given. Figure 2.7 contains a causal chain that explains through which miero-variables there is an association between the macro-level variables W and Z (cf. Coleman, 1990). Figure 2.7 A causal macro-micro-micro-macro chain, ‘An example of this chain: why do the qualities ofa football coach affect is social prestige? The reason is that good coaches are capable of motivating their players, thus leading the players to good performance, thus to winning games, and this of course leads to more social prestige for the coach. An- other instance of a complex multilevel proposition is the contextual effects proposition. An example: ‘low socio-economic status pupils achieve less in classrooms with alow average aptitude’. Tis is also acrost-level interaction effect, but the macro-level variable, average aptitude in the classroom, now is an aggregate of a micro-level variable. In the next chapters the statis- tical tools to handle multilevel structures will be introduced for outcome variables defined at the micro-leval 3 Statistical Treatment of Clustered Data Before proceeding in the next chapters to explain ways for the statistical modeling of data with a multilevel structure, we focus attention in this chap- ter on the question: what will happen if we ignore the multilevel structure of the data? Are there any instances where ove may proveed with single- level statistical models although the data stem from a multistage sampling esign? What kind of errors may occur when this is done? Next to this, we present some statistical methods for multilevel data that do not use the hierarchical linear model (which receives ample treatment in following chapters). First, we describe the intraclass correlation coeficient, 22 basic measure for the degree of dependency in clustered observations. Second, some simple statistics (mean, standard error ofthe mean, variance, correlation, reliability of aggregates) are treated for two-stage sampling de- signs. The relations are spelled out between within-group, between-group, ‘and total regressions; and similarly for correlations. Finally, we mention some simple methods for combining evidence within groups into an overall test. 3.1 Aggregation ‘A common procedure in social research with two-level data is to aggregate the micro-level data to the macro-level. The simplest way to do this is to ‘work with the averages for each macro-unit, ‘There is nothing wrong with aggregation in cases where the researcher is only interested in macro-level propositions, although it should be borne in mind that the reliability of an aggregated variable depends, among others, oon the number of micro-level units in a macro-level unit (see later in this chapter), and thus will be larger for the larger macro-units than for the smaller ones. In cases where the researcher is interested in macro-micro or rmicro-level propositions, however, aggregation may result in gross errors. ‘The frst potential error is the ‘shift of meaning’ (cf. Hitter, 1981). A variable that is aggregated to the macro level refers to the macro-units, not directly to the micro-units, ‘The firm average of a rating of employees on their working conditions, eg., may be used as an index for ‘organizational climate’. This variable refers to the firm, not directly to the employees. 13 u Statistical treatment of clustered data ‘The second potential exor with aggregation is the ecological fallacy (Robinson, 1950). A correlation between macrolevel variables cannot be teed to make assertions about micro-leve rations. The percentage of black Inhabitants in a neighborhood could be related to average political views in the neighborhood, eg., the higher the percentage of blacks in a neighbor- hood, te higher might be the proportion of people with extreme right-wing polieal views. This, however, does aot give us any clue about the micro- Ievel tation between race and political conviction. (The shift of meaning plays a role here, too. The percentage of black inhabitant is a variable that Ineans something for the neighborhood, and this meaning is distinct from the meaning of ethnicity 2s an individual-level variable.) The ecological and other related fallacies are extensively dlscusted by Alker (1969). ‘The third potential eror is the neglect of the original datastructue, specially when some kind of analysis of covariance is to be used. Suppose one is interested in asseaing between school differences in pupil achieve- ‘ment after correcting for intake diferences, and that Figure 3.1 depicts the true situation. The figure depicts the situation for five groups, for each of ‘which we have five observations, The groups are indicated by the shapes ©, x,+,0, and ». The five group means are indicated by «. Figure 3.1 micro-level versus macro-level adjustments, (X,¥) values for five groups indicated by #, 0,4, x03 group averages by «. Now suppose the question is: do the diferences between the groups on the variable Y,, after adjusting for differences on the variable X, have a substantial size? The micro-level approach, which adjusts for the within- sroup regression of Y on X, will lead to the regression line that hasa postive slope. In this picture, the micro-units from the group that have the O- symbol are all above the line, whereas the micro-units from the e-group are all under the regression line. The micro-level regression-approech thus will Jead us to conclude that the five groups do differ given that an adjustment for X has been made. Now suppose that we would aggregate the data, and regress the average Yon the average X. The averages are depicted Disaggregation 15 by «. his situation is represented in the grap by the regression ine with 2 negative slope. ‘The averages of all groups are almost perfectly on the regresion-line (the observed average 7 can almost perfectly be predicted ffom the observed average X), thus leading us to the conclusion that there are almost no differances between the five groups after adjusting for the average X. Although the situation depicted in the graph is an idealized example, it clearly shows that working with aggregate data ‘is dangerous at best, and disastrous at worst’ (Aitkin and Longford, 1986, p. 42). When analysing multilevel data, without aggregation, the problem described in this paragraph can be dealt with by distinguishing between the within- ‘groups and the between-groups regressions. Tis is worked out in Sections 3.6, 45, and 9.2.1 ‘The last objection against aggregation i, that it prevents from examin- ing potential cros-level interaction elfects of a specified micro-level variable with an as yet unspecified macro-level variable. Having aggregated tne data to the macro-level one cannot examine relations like: isthe sentence difer- ential between black and white suspects different between judges, when allowance is made for differences in seriousness of crimes? Or, to give an- other example: isthe effect of aptitude on achievement, present in the case of whole class instruction, smaller or even absent in case of ability grouping of pupils within classrooms? 3.2 Disaggregation Now suppose that we treat our data at the micro level. There are two situations: 1. we also have a measure of a variable at the macro level, next to the ‘measures at the micro level; 2. we only have measures of micro-level variables. In situation (1), disaggregation leads to ‘the miraculous multiplication of the number of unite. To lustrate what is meant: suppose a researcher is interested in the question whether oer judges give more leient sentences than younger judges. A two-stage sample is taken: in the fist stage ten judges are sampled, an inthe second stage per judge te trials are sampled (in total there ae thas 10 x 10 = 100 tras). One might disaggregate the data to the level of the trials and estimate the relation between the experience ofthe judge and the legth of the sentence, without taking into account that some trials involve the same judge. This i like pretending that there aze 100 independent observations, whereas in actual fact there aze only 10 independent observations (the 10 judges). ‘This shows that disoggregation and treating the data as if they are independent implies that the sample size i dramatically exaggerated, For the study of between- stoup diferences, disaggregation often lends to serious risks of committing type I errors (asserting on the basis of the observations that there is a 16 Statistical treatment of clustered data difference between older and younger judges whereas in the population of judges there is no such relation). On the other hand, for studying within- ‘group differences, disaggregation often leads to unnecessarily conservative tests (Le., too low type I error probabilities); this is elaborated in Moerbeek et al. (1997). Tf only measures are taken at the micro level, analysing the data at the micro level isa correct way to proceed, as long as one takes into account that observations within a macro-unit may be correlated. In sampling theory, this phenomenon is known as the design effect for twostage samples. If one wants to estimate the average management capability of young managers, ‘while in the first stage a limited number of organizations (say 10) are selected and within each organization five managers are sampled, one runs the risk cof making an error if (as is usually the case) there are systematic differences between organizations, In general, two-stage sampling leads to the situation that the ‘effective’ sample size that should be used to calculate standard errors is less than the total number of cases, the latter being given here by ‘the 50 managers. The formula will be presented in one of the next sections. Starting with Robinson's (1950) paper about the ecological fallacy, many papers have been written about the possibilities and dangers of cross-level Inference, i.e, methods to conclude something about relations betwee micro-units on the basis of relations between data at the aggregate level, or conclude something about relations hetween macro-units on the basis of relations between disaggregated data. A concise discussion and many references are given by Pedharur (1982, Chapter 13) and by Aitkin and Longford (1986). Our conclusion is that if the macro-units have any mean- ingfal relation with the phenomenon under study, analysing only aggregated or only disaggregated data is apt to lead to misleading and erroneous con- clusions. A mlilevel approach, in which within-group and between-group relations ate combined, is more dificult but much more productive. This approach requires, however, to specify assumptions about the way in which ‘macro- and micro-effects are put together. The present chapter presents some multilevel procedures that are based on only a minimum of such as- sumptions (eg., the additive model of equation (3.1)). ‘The further chapters ‘of this book are based on a more elaborate model, the so-called hierarchical linear model, since about 1990 the most widely accepted basis for multilevel analysis. 3.3 The intraclass correlation ‘The degree of resemblance between micro-units belonging to the same macro- tunit can be expressed by the intraclass correlation coefficient. ‘The term ‘classi conventionally used here and refers to the macro-units in the classi- fication system under consideration. There are, however, several definitions cof this coefficient, depending on the assumptions about the sampling design. In this section we assume a two-stage sampling design, and infinite popu- lations at either level. The macro-units will also be referred to as groups. ‘The intraclass correlation wv {A relevant model here isthe random effets ANOVA model.” Indicating bby Yi the outcome value observed for szicro-unit i within macro-unit j, this ‘model can be expressed as Yy =a + U; + Ry. G1) ‘where 4 is the population grand mean, Uj is the specific effect of macro- unit j, and Ry is the residual effect for micro-unit i within this macro- unit. In other words, macro-unit j has the ‘true mean’. + Uj, and each measurement of a micro-unit within this macro-unit deviates from this true mean by some value, called Ry. Units differ randosnly from one another, ‘which is reflected by the fact that U; is a random variable and the name ‘random effects mode". Some units have a high true mean, corresponding to ‘high value of U;, others havea close to average, stil others alow true mean. Its assumed that al variables are independent, the group effects U; having Population mean 0 and population variance 7? (the population between: group variance), and the residuals having mean 0 and variance a (the Population within-group variance). For example, if micro-units are pupils ‘and macro-units are schools, then the within-group variance is the variance ‘within the schools about ther true means, while the between-group variance 1s the variance between the school true means. The total variance of Yi 1s then equal to the sum of these two variances, var(¥y) =7? + 0? ‘The number of micro-units within the j'th macro-unit is denoted by ns ‘The number of macro-units is N, and the total sample size is M = 3, nj In this situation, the intraclass correlation coefficient p, can be defined poration variecce Beeween SARGDO- Wile Pmt 2 Total variance ara 6) It is the proportion of variance that is accounted for by the group level. This parameter is called a correlation coeficient, because it is equal to the comelation between values of two randomly drawa micro-units in the same, randomly drawn, macro-unit. tis important to note thatthe population variance between macro-units 1s not dicectly reflected by the observed variance between the means of the ‘macro-uniis (the observed between macro-units variance). ‘The reason 1s that in a twostage sample, variation between micro-units will alzo show up as exra observed variance between macrounite. Its indicated below how ‘the observed variance between cluster means must be adjusted to yield a good estimator for the population variance between macro-units. This model is also known 1a te statistical Iterature a8 the one-way random effects ANOVA model and as Eiwabar' Type If ANOVA madd. In molleel modeling ti known as the empty model, which istreated further In Section 43, 8 Statistical treatment of clustered data Example 3.1 Random data. Suppose we havea series of 100 observations as inthe random digits Table 3. ‘Table 3.1. Data grouped into macro-units (random digits from Glass and Stanley, 1970, p. 511). 7 Scores Vj for miero-unita (random digs) Average Py Tee 2 eos os oe eT mo Mm 6 2 SLO % 3 9% OD 74 0 3 Mo 8 mM MMS Of 19 32 35 Me 4s ST 2 OS OS 0 i 2 0 a7 a7 Of 39 53 4 OBST OS 3 7S 15 72 68 SB 00 83 38 LL oe 4 2 om Ue Ds OK wT 0 7 & OF mw aT © 2 8 oo 7 Ow ES oO 7% 2 2 oo m 3 © 2 2 we SA ‘The core part of the table contains the raadom digits. Now suppose that cach row in the table is a macro-unit, so that for each macro-unit we have ‘observations on 10 micro-units, The averages of the scores for each macro- unit are inthe last column. There seem to be lage diferonces between tbe randomly constructed macro-units, if we look at the variance in the macro- tunit averages (which is 105.7). The total observed variaace between the 100, ‘micro-unite a 614.0. Suppose the macro-nnits were schools, the micro-units pupils, and the random digits test scores. According to these two observed ‘variances we might conclude that the schools differ considerably with respect. to their average testscores. We know in this case, however, that in ‘reality’ the macro-unite differ only by chance. ‘The following subsections show how the intraclass correlation can be estimated and tested. For a review of various inference procedures for the intraclass correlation we refer to Donner (1986). An extensive overview of many methods for estimating and testing the within-group and between- group variances is given by Searle, Casella, and McCulloch (1992). 3.3.1 Within-group and between-group variance ‘We continue referring to the macro-units as groups, Todisentangle the infor- ‘mation contained in the data about the population between-group variance and the population within-group variance, we consider the observed vari- ‘ance between groups and the observed sariance within groups. ‘These are defined in the following way. The mean of macro-unit j is denoted ‘The intraclass correlation 19 ‘The observed variance within group j i given by i< \2 S$ = Ls - Pa) ‘This number will vary from group to group. To have one parameter that expresses the withingroup variability fo all groups jointly, one uses the observed within-group variance, or pooled within group variance, This is a vreighted average of the variances within the various macro-nits, defined WUE = (33) ~ 338}. If model (8.1) hols, the expected value ofthe observed within-group vari sees ext oul the poston ikon verano Expected variance within = ES2.tin (4) ‘The situation forthe between-group variance is abit more complicated. For equal group sizes ny, the observed between-group variance is defined as the variance between the group teans, 1d, 2 os Des - 2 5) For unequal group ses, the contributions of the various groups need to be ‘weighted. The following formula uses weights that are useful for estimating the population between-group variance: w-p Se = EWG DHE FP (38) In this formula, 7 is defined by 1 yr (ny) =a {u- 3} WR en where i = M/N is the mean sample size and #0) = xt dew -a is the variance of the sample sizes, If all nj have the same value, then also has this value. in ths cae, Steen jst the variance ofthe group means, given by (35) Tt can be shown thatthe total observed variance isa combination of the within-group and the between-group vaianoes, expressed as follows: 1 LL - vy observed total variance 2» ‘Statistical treatment of clustered data al Seen + ‘Seeeween + (a) MoT ‘Toe complications with respect to the between-group variance arse fom the ac that the te levelresidvals Ry ano contrite, although to 4 tino extent, tothe obervedbebwemn-group variance Staistialthdry ee Expected chard varatoe between ~ Tw variance between + Expected sampling eror variance More specially, the formula (Hays (198, Seton 1.) forthe case with cnstat nj and Seas, Can, and McCulloch (1992, Section 3.6) fr the general case EStewen=7 +E, (9) which holds provided that model (3.1) is valid. The second term in this female betote all when hs besten Lange, ‘Th for arg gu ses, the expetedclerved between varie @ peoicaly ual tothe te between variance. For small group sizes, however, it tends to be larger than the tue betwee vance due Yo the random difeencs thet alo exit between the group means. Tn practice, we do aot know the population values ofthe betwee and within macrotit variances thee have to be etinated from the data ‘One way of estimating these parameters is based on formulae (3.4) and (35). From the fit flows thatthe population within-group variance, o, can be estimated unbiasedly by the observed within-group variance: P= Shean (a10) From the combination ofthe ast to formulae follows thatthe poplation etnee-goup vrae, 7, canbe eiated abiandly by taking the ob- served between gouje tariance and aublatig te eotbution that eve rth group valance mais, n serage,acsrdingt (20), tothe cheered etwen-group variance Pa, mee ~ “Babin. eu (Another expression is given in (3.14)) This expression can take negative values. This happens when the difference between group means is less than ‘would be expected on the basis of the within-group variability, even if the true between-greup variance 1? would be 0. In such a case, It is natural to estimate 7? as being 0. It can be cozeluded that the split between observed within-group vari- ance and observed between-group variance does not correspond precisely to the split between the within-group and between-group variances in the population: the observed between-group variance reflects the population between-group variance plus a bit ofthe population within-group variance. ‘The intraclam correlation is estimated according to formula (3.2) by es (uz) ‘The intraclass correlation a (Formula (3.15) gives another, equivalent, expresion.) The stendard error of this estimator in the case where all group sizes are constant, nj =n, it given by nae ae=DW=D ‘This formula was given by Donner (1986, equation (6.1)), who alo gives the (quite complicated) formula for the standard error for the case of variable group sizes. ‘The estimators given above are so-called analysis of variance (ANOVA) estimators. They have the advantage that they can be represented by ex- plicit formulae. Other much used estimators are those produced by the max- {mum likelihood (ML) and residual maximum likelihood (REML) methods (cf. Section 4.8). For equal group sizes, the ANOVA estimators are the sare as the REML estimators (Searle, Casella and McCulloch, 192). For un- ‘equal group sizes, the ML and REML estimators are slightly more efficient than the ANOVA estimators. Multilevel software can be used to calculate the ML and REML estimates. Example 3.2 Within- and between-group variability for randor: data. For our random digits table of the enrlier example the observed berween var ‘ance i Serween = 105.7. The observed variance within the macrounits can ‘be computed fom formula (3.8). The observed total variance is known to be {814.0 and the observed between variance is given by 105.7. Solving (3.8) for the observed within variance yields S2anuq = (09/90) x (814.0 ~ (10/11) x 106.7) = 789.7. Te estimated true variance within the macro-units then also is 6? = 780.7. The estimate for the true between macro-uits variance is ‘computed from (3.11) as #8 = 105.7 ~ (789.7/10) = 267 Finally, the estimate of the intraclass corrlation is > = 2627/(789.7 + 26.7) = 003. Tes standard error, computed from (3.13), is 0.06. == p+ (2 Der) (13) 3.9.8 Testing for grosp differences ‘The intraciass correlation a8 defined by (3.2) can be zero or positive. A statistical text can be performed to investigate if a postive value for this Coefcient could be attributed to chance. If may be assumed that the ‘within-group deviations Ry are normally distributed, one can se an exact, {est for the hypothesis that the intraclas correlation is 0, which is the Same asthe null hypothesis that there are no group diferencs, othe true between-group variance is 0. This is just the F-test fora group effect in the one-way analysis of variance (ANOVA), which canbe found in any textbook ton ANOVA. The test statistic can be written as ro, and it has an F distribution with N~1 and MN degres of freedom if the ml hypothesis holds. 2 Statistical treatment of clustered data Example 8.8. The F-test forthe random dataset Forth data of Table 31, F= (10 106)/7097 = 1-34 wth 9 and 90 degrees of Som, This valve i ar fom signicast(p > 0.0). Thus, there Bo idence of tre bebwoen- grou difference. Statistical computer packages usually give the F statistic andthe withia- soup valance, S4y.q. From ths outpt, the extimated population bebween- froup variance cane calulated by # = Ser —1) au) and the eatinated intraclass correlation coeficient by ares a=sogt (315) where fis given by (8.7). IF F< 1, it is natural to replace both of these expressions by 0. These formalae show that a high value forthe F statistic wl lead to large estimates for the between-group variance as well asthe intraclass correlation, but that the group sizes, as expresed by &, moderate the relation betwoen the tst statistic and the parameter estimates. there are covariates, t often i relevant to test whether there are group ferences in addition to thove accounted fr bythe eect of the covariates ‘This is achieved by the usual F-test forthe group effect in an analysis of covariance (ANCOVA). Such ates is relevant because iti possible thatthe ANOVA F-test doesnot demonstrate any group efets, but that soch effects do emerge when controling forthe covariate (r vice versa). Another check on whether the groups make a diffrence canbe carried out by testing the sr0up-by-covariate interaction effec. These tests can be found in texibooks fon ANOVA and ANCOVA, and they are contained inthe well-known general prpose statistical computer programs. So, to test whether a given nesting stracture in a data set calls for rltileel analysis, one can use standard techniques from the analysis of variance. In addition to testing for the main group effect, itis also advis- ale to tet for group-by-covarite interactions. If there is neither evidence for main eect nor for interaction eects involving the group sracture, then the researcher may leave aside the nesting structure and analyse the data by unilevel methods such as ordinary least aquares (OLS) regression analysis, This approach to text for group diferences can be taken when- ver the numberof groups isnot too large for the computer program being ‘used. If there are too many groups, however, the program will refuse to do the job. Iz such a cae it wil stil be posible to cary out the testa for soup diferences that are treated in the folowing chapters following the logic ofthe hierarchical near model. Thi wil require the use of statistical milters! software. 3.4 Design effects in two-stage samples In the design of empirical investigations, the determination of sample sizes is an important decision. For two-stage samples, this is more complicated Design effects in two-stage samples 23 than for simple (‘one-stage’) random samples. An elaborate treatment of this question is given in Cochran (1977). This section gives a simple ap- proach to the precision of estimating population mean, indicating the basic role played by the intraclass correlation. We return to this question {in Chapter 10. Large samples are preferable to increase the precision of parameter esti- mates, Le, to obtain tight confidence intervals around the parameter eati- mates, In a simple random sample the standard errr ofthe mean is related to the sample size by the formula standard deviation Veampk eae This formula can be used to indicate the required sample size (in a simple random sample) if a given standard error is desired. ‘When using two-stage amples, however, the clustering of the data should bbe taken into account when determining the sample size. Let us suppose that all group sizes are equal, nj = n forall j. The (total) sample size then 1s Nn. The design effect is « number that indicates how much the sample size in the denominator of (3.16) isto be adjusted because of the sampling design used. Its the rato ofthe variance obtained with the given sampling design to the variance obtained for a simple random sample from the same Population, supposing thatthe total sample size is the same. A large design cect implies a relatively large variance, which isa disadvantage that may be offset by the cost reductions implied by the design. The design effect of 8 tworstage sample with equal group sizes is given by design effect =14 (na (317) ‘This formula expresses that, from a purely statistical point of view, a two- stage sample becomes les attractive as p increases (clusters become more homogeneous) and as the group size n increases (the two-stage nature of ‘the sampling design becomes stronger). Suppose, eg. we were studying the satisfaction of patients with their doctors treatments. Furthermore, let us assume that some doctors have ‘ore satisfied patients than others, leading to a p. of 0.90. The researchers used a two-stage sample, since that is far cheaper than selecting patients simply at random. They first randomly selected 100 doctors, from each chosen doctor selected five patients at random, and then interviewed each of these. In this case the design effect is 1 + (5 ~ 1) x 030 = 2.20. When imating the standard error of the mean, we no longer can treat the obser- ‘ations as independent from each other. The aféective sample sz, Le, the ‘equivalent total sample size that we should use in estimating the standard error, is equal to standard error (3.16) Nn eae’ (a8) {n which JV is the number of selected macro-units. For our example we find Negssive = (100 x 5)/2.20 = 227. So the twostage sample with a total of 500 patients here is equivalent to.a simple random sample of 227 patients. Netecive = m ‘Statistical treatment of clustered data (One can also derive the total sample size using a two-stage sampling de- sign on the basis of a desired level of precision, assuming that pis known, ‘and fixing n because of budgetary or time-related considerations. The gen- feral rule is: this required sample size increases as p, increases and it i creases with the number of micro-units one wishes to select per macro-uit. Using (3.17) and (8.18) this can be derived numerically from the formula Nia = Naw + New (02), « ‘The quantity Ni inthis formula refers to the total desired sample size when ‘using a tworstage sample, whereas Nyy refers to the desired saznple size if ‘one would have used a simple random sample. Tn practice, p:is unknown. However, it often is possible to make an educated guess about it on the basis of earlier research. In Figure 3.2, Ni is graphed as a function of n and p, (0.1, 0.2, 0.4, and 0.8, respectively), and taking New = 100 as the desired sample size for an ‘equally informative simple random sample. Me 3000 p =08 2000 po =04 1000 20 i w Dn Figure 3.2 The total dested sample size in two-stage sampling, Reliability, as conceived in paychological test theory (e.g., Lord and Novick, 1068) and in generalizability theory (eg, Shavelson and Webb, 1991), is Closely related to clustered data ~ although this may not be obvious at frst sight. Classical paychological test theory considers a subject (an individual ‘or other observational unit) with a given true score, of which imprecise, or unreliable, observations may be made. The observations can be considered to be nested-within the subjects. If there is more than one observation per subject, the data are clustered. Whether there is only one observation ‘or several, equation (3.1) is the model for this situation: the true score of subject j is 4+ Us and the j'th observation on this subject is Yi, with Relat ty of aggregated variables 25 ‘associated measurement error Ri. I several observations are taken, these can be aggregated to the mean value Ys which then is the measurement for the true score of subject j. ‘The same idea can be used when itis not an individual subject whois to be measured, but some collective entity: a school, frm, or in general any macro-level unit such as those mentioned in Table 2.2. For example, when the school climate is measured on the boris of questions posed to pupils of the school, then Yi; could refer to the answer by pupil fin school j toa given question, and the opinion ofthe pupils about this school would be measured by the mean value Ys. In terms of psychological tet theory the micro-level ‘units ¢ are regarded as parallel items for measuring the inacro-evel unit j. The reliability of a measurement is defined generally 0s sebility = _vatiance of true scores reliability = “Seance of Observed score * 1t can be proved that this is equal to the correlation between independent replications of measuring the same subject. (This means in the mathe- matical model that the same value Uy is measured, but with independent realizations of the random error Ry.) The relabilty is indicated by the symbol Ay? For measurement on the bass of a singe observation according to model (G41), reliability is just the intraclass correlation coefcent: Naps apa (iin When several measurements are made for each macrolevel unit, these constitute a cluster or group of measurements which are aggregated to the group mean Y,. To apply to Ys the general definition of reliability, note that the observed variance isthe variance betweun the observed means ¥., while the true variance is the variance betweea the true scores 4 + Uj ‘Therefore the reliability ofthe aggregate is vaviance between 4 + Uj variance between F (3.19) reliability of Pj = (8.20) Example 3.4 Reliability for random data. Tin our previous random digits example the digits represented, eg,, the perceptions by teachers in schools of their working conditions, then the ag iregated variable, an indicator for organizational climate, has an estimated reliability of 96.7/108.7 = 0.25. (The population value ofthis reliability is 0, Dowever, as the data are random, so true variance i ail.) It can readily be demonstrated that the reliability of aggregated variables increases as the number of micro-units per macro-unit increases, since the ‘rue variance of the group mean (with group size n;) is 7? while the expected ia the Uteratare the reobilty of a measurement X frequently denoted by the symbol pxxy 0 thatthe rlibity ecient Ay could alo be Ganced by py 6 ‘Statistical treatment of clustered dats observed variance of the group mean is 7? +02/n,. Hence the reliability tan be exprened by 2 = iors . (3.21) 25am “TF yA a Ws quite clear that if nl very large then Ay almot 1, In = Uwe ace no able to dtngulh between wte: and between group variance. Figure 3.3 presents a graph where the reliability of an aggregate is depicted as a fundion of y (lencted by) and (1 and OA, epee) N= as 10 04 Figure 3.8 Reliability of aggregated variables. 3.6 Within- and between-group relations ‘We saw in Section 3.1 that regressions at the macro-level between aggregated variables X and Y can be completely different from the regressions between the micro-level variables X and Y. This section considers in more detail the interplay between macro-level and micro-level relations between two variables. Flrst the focus is on regression of ¥ on X, then on the correlation between X and ¥. ‘The main point of this section is that within-group relations can be, in principle, completely different from between-group relations. This is natu- ral, because the processes at work within groups may be diferent from the processes at work between groups (see Section 3.1). ‘Total relations, i.e. relations at the micro-level when the clustering into macro-units is disre- garded, are mostly a kind of average of the within-group and between-group relations, Therefore it is necessary to consider within- and between-group relations jointly, whenever the clustering of micro-units in macro-units is ‘meaningful for the phenomenon being studied. Within- and between-group relations a 3.6.1 Regressions ‘The linear regression of a ‘dependent’ variable ¥ on an ‘explanatory’ or Sndependent’ variable X is the linear function of X that yields the best? prediction of ¥. When the bivariate distribution of (X,Y) is known and the data structure has only a single level, the expression for this regression function is + aX+R, where the regression coefficients are given by = EY) ~ rE(X), cor(X,¥) a ‘The constant term fy is called the intercept, while fis called the regression coefcient. The term R is the residual or error component, and expresses the part of the dependent variable ¥ that cannot be approximated by a linear function of Y. Recall from Section 1.2.2 that £(X) and £(Y) denote the population means (expected values) of X and Y, respectively. In a maltlevel data structure, this principle can be applied in various ways, depending on which population of X and Y values is being considered. [Let us consider the artificial dataset of Table 3.2. The fist two columns in the table contain the identification sumbers of the macro-unit (j) and ‘the micro-unit (3). The other four columns contain the data, By Xj is denoted the variable observed for micro-unit i in macro-unit j, and by X the average of the Xis values for group j. The analogous notation holis for the dependent variable Y. ‘Table 8.2 Artificial date on § macro-units eack with 2 micro-uaits soon ou ewes ao nannvunnld| jew enaamaass One might be interested in the relation between ¥y and Xy. The linear regression line of ¥is on Xy at the microrlevel for the total group of 10 observations is ¥ij =5.33 - 0.33%y +R. (Total regression) Bex prodiction’ moana bere the podition hat has the smallet mean squared error: the ao-aled least aquares criterion. 8 Statistical treatment of clustered data ‘This is the disaggregated relation, since the nesting of micro-units in macro- units is not taken into account. The regression coefiient is ~0.33. ‘The aggregated relation is the linear regression relationship at the macro- level of the group means ¥.y on the group means Xs, This regression line ¥5=8.00- 100%, +R. (Regression between group means) ‘The regression coeficient now is ~1.00. [A third option is to describe the relation between Yis and Xig within exch single group. Assuming that the regression coeficient has the same value in each group, ths is the same as the regression of the within-group Y¥-deviations (Vij Yj) on the X-deviations (Xiy—2,). This within-group regression line is given by Yy=Py + 1.00(Xy — £4) +R, (Regression within groups) ‘with a regression coeficient of +1.00. Finally, and that is how the artificial dataset was constructed, Yiy can be written as a function of the within-group and between-group relations ‘between Y and X. ‘his amounts to putting together the between-group ‘and the within-group regression equations. The result is Yiy = 8.00 ~ 1.00.5 + 1.00(Xiy ~ Xj) +R (22) 00+ L00X, - 200%; + R (Multilevel regression) Figure 3.4 graphically depicts the total, within-group, and between-group relations beween the variables. The five parallel ascending lines represent Total Within ¥ Figure 8.4 Within, between, and total relatioas the within-groups relation between Y and X. The steep descending line rep- resents the relation at the aggregate level (ie., between the group means), ‘whereas the almost horizontal descending line represents the total relation- ship, ie, the micro-level relation between X and ¥ ignoring the hierarchical structure. Within- and between-group relations 29 ‘The within-group regression coefficient is +1 whereas the between group coefficient is -1. The total regression coefficient, ~0.33, is in between these two, This illustrates that within-group and between-group relations can be completely different, even have opposite signs. The true relation between Y and X is revealed only when the within- and between-group relations are considered jointly, Le, by the multilevel regression. In the multilevel regres- sion, both the between-groups and the within-group regression coefficients play a role. Thus there are many different ways to describe the data, of ‘which one is the best, because it describes how the data were generated. In this artificial data set, the residual R is 0; real data have, of course, non-zero residuals, ‘A population model ‘The interplay of within-group and between-group relations can be better understood on the basis of a population model such as (8.1). Since this section is about two variables, X and Y, a bivariate version of the model is heeded. In this model, group (macrowunit) j has specie main effects Usg and Uy, for variables X and Y, and associated with individual (micro-unit) ‘are the within-group deviations Rey and Ry. The population means are denoted jzy and pty and it is assumed that the U's and the R's have population means 0. The U's on one hand and the R's on the other are independent. Te formula for X and Y then reads Xis = be + Uns + Rays Vig = ty + Uns + Pas - For the formulae that refer to relations between group means X, and Py, itis assumed that each group has the same size, denoted by n ‘The correlation between the group effects is defined as Preeween = p(Uz4, Uys) while the correlation between the individual deviations is defined by Prin = P(Reijs Ryts) One of the two variables X and ¥ might have a stronger group nature than the other, so that the intraclass correlation coeficients for X and ¥ may be diferent. ‘These are denoted by pix and pry, respectively ‘The within-group regression coefficients the regresion coeficient within ach group of Y on X, assumed to be the same for each group. This coeficient is denoted by Ayienin and defined by the within-group regression equation, Yig = My + Uys + Brisnia (Xiy ~ He ~ Ung) + Be (3-24) ‘This equation may be regarded as en analysis of covariance (ANCOVA) model for ¥. Hence the within-group regression coefficient also is the effect of X in the ANCOVA approach to this multilevel data. ‘The within-group regression coeficient is also obtained when the Y- deviation values (Yi ~ ¥j) ate regressed on the X-deviation values (3.23) 20 Statistical treatment of clustered data (Xx - Xj). In other words itis also the regression coefcient obtained in the disaggregated analysis of the within-group deviation scores. ‘The population between-group repression coefficient is defined as the re sression coeficient forthe group effets U, on U,. This coficient is denoted Dy Boecwen U and i defined by the regression equation Taj = Booman u Uyj + Ry where R now isthe group-level residual. ‘The total regression coefficient of X on Y'is the regression cosffcient in the disaggregated analysis, Le, when the data are treated as single level data: Yis = by + Bras (Xj ~ Ma) + R- ‘The total regression coefficient can be expressed as a weighted mean of the within- and the between-groups codficients, where the weight for the ‘between-groups coefficient is just the intraclass correlation for X. The for- ‘mula is Prat = Pex Boeoeen U + (1 a) iin 25) ‘This expression implies that ifX isa pare macro-level variable (60 that pus = 1), the total regression coefficients equal tothe between-group coefficient. Conversely, if X is a pure microlevel variable we have fis = 0, and the total regression coefficient is just the within-group coefcient. Usually X will have both a within-group and a between-group component and the total regression coefficient will be somewhere between the two level-pecific regression coeficients Regressions between observed group means [At the macro-level, the regression of the observed group means Yj on Xj is not the same as the regression of the ‘true’ group effects U, on U,. This is because the observed group averages, X.j and P, can be regarded as the ‘true’ group means to which some error, Hey and Rs, has been added.® Therefore the regression coeficient for the observed group means is not exactly equal to the (population) between-group regression coeficien, but itis given by Breswcen group meses = Ass Boereen U + (1~Asj) Britny (8.28) where Nos is the reliability ofthe group means X for measuring s+ Us, given by equation (3.20) applied tothe X variable. Ifmis large the reliability will be close to unity, and the regression coeficient for the group means will be close to the between-group regression coefficient at the population level. Combining equations (3.25) and (3.28) leads to another expression for the total regression coefficient. This expression uses the correlation ratio 72 ‘which is defined as the ratio of the intraclass correlation coefficient to the reliability of the group mean, Pe + 0%/n ab aa sastion may be skipped by the curory reader ‘Tye sae phenomenon iba te bao ormeae (3.9) and (3.11), (327) Within- and between-group relations at For large group sizes the reliability approaches unity, so the correlation ratio approaches the intraclass correlation. Tn the data, the correlation ratio rf is the same as the proportion of vance in Xj explained by the group means, and it can be computed aa the ratio ofthe between-group sum of squares relative to the total sum of squares in an analysis of variance, ic., Eqns (hs~ K.P Lisa —¥F ‘The combined expression indicates how the total regression coefficient depends on the within-group regression coefcient and the regression coeff fcient between the group means: Prat = 1 Brerween group ream + (11) Bri (8.28) “Expression (3.28) was frst given by Duncan et al. (1961) and ean be found ‘also, eg, in Pedhazur (1982, p. 538). A multivariate version was given by Maddala (1971). To apply this equation to an unbalanced data set, the regression coeficient between group means must be calculated in a weighted regresion, group j having weight ny. Example 3.5. Within. ond htveen-sroup regressions for artificial dat In the artical example given, tho foal um of squares of Xi ad well as Yis i830 ad the betwees-poup suns of squares for X and ¥ are 20. Hence the correlation rile are 3/30 = 0.867. 1 we use this value and plug ‘eta formala (3.28), we Bind Beaten = 0.667 x (~1.00) + (1 ~ 0.667) x 1.00 = -033,, which i indeed what we found eater. 9.6.2 Correlations ‘The quite extreme nature of the artificial data set of Table 3.2 becomes apparent when we consider the correlations. ‘The group means (X,Y) le on a decreasing straight tine, so the ob- served between-group correlation, which is defined as the correlation between the group means, 13 Ryeween = 1. The within-group correlation is de- fined as the correlation within the groups, assuming that this correlation is the same within each group. This can be calculated as the correlation cocficient between the within-group deviation scores Xiy = (Xy- Xy) and Yiy = (Vy ~ Py). In this data set the deviation scores (Xij,%iq) are (-1,-I) for #'= 1 and (41,41) for # = 2, s0 the within-group correlation here is Rywhin = +1. Thus, we see that the within-group as well as the ‘between-group correlations are perfect, but of opposite signs. The disaggrre sated correlation, ie, the corzlation computed without taking the nesting structure into account, is Rroeai = 0:33. (This isthe same as the value for the regression coefficient in the total (disaggregated) regression equation, ‘because X and Y have the same variance.) a2 ‘Statistical treatment of clustered data ‘The population model again Recall that in the population model mentioned above, the corelation coe: ficient between the group effets Uy and Uy was defined a8 Poemen and the correlation between the individual deviations Rand Ry was dened as Petia. The intraclass correlation coeficients for X and Y were denoted by paces How do thee corelations between unobservable variables relate to com: relations between observables? The population within-group correlation is ‘also the correlation between the within-group deviation scores (Xi, Yy): ARF) = Pin 29) For the between-group coefclent the relation s, a8 always, a bit more complicated. The correlation coeficient between the group means is equal to AR5, P53) = Vag Mug Poeeneen + y/(1 ~ Aas) (1 ~ Avs) Pwisdins (8:30) where sj and Ay are the reliability coefficients of the group means (see equation (3.20)). For large group sizes the reliabilities will be close to 1 (provided the intraclass correlations are larger than 0), s0 that the corre- lation between the group means will then be close to uecween- This for- ‘mula shows thet the correlation between group means is higher than total correlation, i. aggregation will increase correlation, only if the between- groups correlation coefficient is larger than the within-groups correlation coefficient. ‘Therefore, the reason that correlations between group means are often higher than correlations between individuals is not the mathemat- ical consequence of aggregation, but the consequence of the processes at the group level (determining the value of pyetween) being different from the processes at the individual level (which determine the value of phi) ‘The total corelation (i.e, the correlation in the disaggregated analysis) is a combination of the within-group and the between-group correlation coefficients. Tne combination depends on the intraclass correlations, as shown by the formula AX Yi) = VPP Poem + (l= Ha) A= Pry) Pein» (881) If the intraclass correlations are low, thea X and Y have primarily the nature of leve-one variables, andthe total correlation willbe close to the within-group carrelation; on the other hand, ifthe intraclass cortelations are close to 1, then X and ¥ have almost the nature of leve-two variables and the total correlation is close to the between-group correlation. If the intradass correlations of X and Y are equal and denoted by p,, then (8:31) can be formulated more simply as ALXiss Yi) = A Pern + (1) Pin In this case the weights p,and (1~ p:) add up to 1 and the total regression coefcient is necessarily between the within-group and the between-group regression coefizient. In general, however, ths is not always true, because the-sum of the weights in (3.31) is smaller than 1 ifthe intraclass correlations for X and ¥ ar different, For example, if one of the intraclass correlations Within- and between-group relations 33 is close to 0 and the other is close to 1, then one variable is mainly 2 level- ‘one variable and the other mainly a level-two variable. Formula (3.31) then implies that the total correlation coefficient is close to 0, no matter how large the within-group and the between-group correlations. ‘This is rather ‘obvious, since a level-one variable with hardly any between-group variability Cannot be substantially correlated with a variable with hardly any within sroup variability. Correlations between observed group meand# ‘Analogous to the regression coefficients, also for the correlation coefficients wwe can combine the equations to gee how the total corrdation depends on the within-group correlation and the correlation between the group means. This yields AXis.¥ig) = Ne my LLG Ps) + 1-12) 1p) Pwienin- (8-32) ‘This expression was given by Knapp (1977) and can also be found, eg, in edhazur (1982, p. 536). When iti applied to an unbalanced data st, the correlation between the group means should be calculated with weights ny. Tt may be noted that many texts do not make the explicit distintion between population and data. If the population and the data are equated, thea the rlabltes are unity, the corelation ratios are :he same asthe in- traclass correlations, and the population between-group correlation is equal to the correlation between the group means. The equation for the total correlation then becomes Raoul = ey Reaneen + [01 #8) (1 = 99) Ren + 33) ‘When parameter estimation is being considered, however, confusion may be caused by neglecting this distinction. Example 3.8 Withn- and lebecn-groupcoraations for rificial data ‘The correlation ration in the artical data example are 72 =f = 0.667 and wwe aloo saw above that yin = +1 and Ryetren =~! Filing fa these Sumbers ia formala (3-33) yields Rest = VOGT x (-1.00) + (i =OB5TP x 1.00= -038 , ‘hich indeed isthe valve found earlier forthe total coreltion. 3.6.3 Estimation of within- and between-group correlations ‘There are several ways for obtaining estimates forthe corlation parameters treated in this section, "A quick method is based on the intraclass correlations, estimated as in Section 3.3.1 or from the output of a multilevel computer program, and the observed within-group and total correlations. The observed within-group correlation is just the ordinary correlation coefficient between the within- group deviations (Xy— X,) and (Yy — ¥.j), and the total correlation is the ordinary correlation coefficient between XX and Y in the whole data set. "be remainder of Seton 28 my also be skipped by the curry render. 4 Statistical treatment of clustered data "The quick method then is based on (3.29) and (8:31). This leads to the cetimates Prithin = Penn» (3.34) Facet ~ YO= Pa) T= Fa) Roti From VE Bafa) ee (335) ‘This i ot the statistically met efcient method, bu tin strghtforward and lends to good results if sample sizes are not too small, ‘The ANOVA mathod (Searle, 1956) goes via the verances and cover: nog, based on the definition cov(X,Y) x) ARY) = ECR ean Estimating the withi- and between-group variances was ditcused in tion 3.3.1. The withi- and between-group covariance between X sad can be estimated by formulae analogous to (23), (3), (310) aad (Blt), replacing the squares (¥iy — P,)? and (Fy ~ ¥-)? by the crose-producte (Xi ~ K%y ~ Fp) and (Ry — KIL} — F). Mis shown in Sea, Casale, and MeCuiloeh (1902, Sctioa 1i"La) how thewe ealelations cot be replaced by a calculation involving oly sums of aquares Folly, the maximum Wialihood (ML) and. residual maximum ltl hood (REME) methods can bo used, Those are the most ten tstd tines, tion methods (cf. Section 4.6) and are implemented in multilevel software, Chapter 19 describes multivariate multilevel modes; the corsdation Seat ‘cient between two variables refers to the simplest multivariate situation, + vis. bivariate data, Formula (13.5) represents the model which allows the ‘estimation of within-group and between-group correlations, Example 87 Win nd teen ony clin Jo oa tt ininay eal a Gaps tab se ae ae toance cf M = ET pl te TS ea Te cons the dn bree the ones oe eat Cee t ungnge tt (F), Win phoa earenns ret oe seg oe tein fn pp gigs so aioe rain ta relay he tnd). Beeersiel corearae ae tine dtrse te sel’ popstar Se eg Seiborbnt) ante ndtence of aes ut aloe a Pen wikiowchonlcomaions ad the beeen caret se tte dite moos ‘The lds wit, “bere aod eta wl be areal to, 04 Ta orl sario w Re ™ 06D bare The ANOVA oir are cleat along rice Seton 3. ra ne en ced en ere en, timatai within-group population variances and comtiance (ef. (3.10)) are =Sy = 32.233, 65 = Se y = 64319,6,, = Sy xy = 28.516. ‘The observed between-group variances and covariance are. S. SB, = 20.580, 5, ee Re 3.483, ‘4.558, For this data set, = 17435. According to Combination of within-group evidence 35 (2.11) the extimated population between-group variances and covariance are #2 = 11.564, 73 = 16.890, fy = 12.92. "From these eitimated variances, the intraclass correlations are computed. 35 ps = 11.584/(L1.584 + 32.288) = 0.264 and pry = 16.890/(16.800 + 64.319) = 02000 ‘The ‘quick’ method uses the observed within and total correlations and the intraclas corelations, The reguling estimates are jw = 0.626 from (3.34) and Ay = 0.922 fom (3.35) ‘The ANOVA estimates forthe within. and betwees-group correlations ‘ses the estimated within- and between-group population variances and co- variance. The results ae py = 28.516/ V32.259 X 64515 = 0626 and Dy = 12928/ SIL SBK x 16B00 = 0.924. ‘The ML estimates for the within-group variances und covariance are ob- {group variances and covariance they are ‘This lads to etimated correlations 28662/ /SE25 OLGA = 0621 and fy = 14.54/ YTETEX TOW = 0.938. Tecan be concladed that, fortis large dara et, the hres methods all ytd ‘practialy the same rele’ The wthi-schol(pupl lev!) correlation, 0.62, is substaatial, ‘Thus pupil language and arithmetic capacities are closely correlated, Tae betwoen-ichodl corratoa, 0.92 or 093, is very high. This demonstrates thatthe schol polices, the teaching quality, andthe proces that determine the composition ofthe school population Rave practically the samme effoct on the pupil’ language as on thet arithmetic performance. Note ‘hat te observed between-school correlation, 088, isa tite as high because ofthe attenuation caused by unreliability tht ‘llows fom (3.0). 3.7 Combination of within-group evidence ‘When researc focuses on within-group relations and several groups (macro- nits) were investigated, it's often desired to combine the evidence gathered in the various groupe. For example, consider a study where a relation be- tween work satisfaction (X) and sickness leave (Y) is studied in several organizations. Ifthe organizations are suficiently similar, or if they can be considered as sample from some population of organizations, then the data can be analysed according to the hierarchical near mode ofthe next chapter. If, however, the organizations are too diverse and not represen- tative of aay poplatio, then it still can be relevant to conduct one test for the relation between X and ¥ in which the evidence from all organiza- tions is combined. Another example is meta-analysis, the statistically based combination of several studies. There exist many texts about meta-analysis eg, Hedges and Olkin (1985), Rosenthal (1991), Hedges (1992). A num- ber of publications may contain information about the same phenomenon, and it can be important to combine this information in a single test. If the studies leading to the publications may be regarded as a sample from some population of studies, chen again methods based on the hierarchical linear model can be used. ‘The hierarchical linear model is treated in the following chapters; applications to meta-analysis are given, eg, by Bryk 36 ‘Statistical treatment of clustered ‘and Raudenbush (1992, Chapter 7). If studies collected cannot be regarded ‘sa sample from a population, then still the methods mentioned below may be used. ‘There exist various methods for combining evidence from several stud- jes, based only on the assumption that this evidence is statistically inde- pendent. They can be applied already if the number of independent studios is at least two, The least demanding method is Fisher's combination of p- values (Fisher, 1932; Hedges and Olkin, 1985). This method assumes that im each of 1 studies a null hypothesis is tested, which reaults in indepen- dent p-values p1,... Py. The combined mull hypothesis is that in all of the studies, the null hypothesis holds; the combined alternative hypothes ‘that in at lenst one of the studies, the alternative hypothesis holds. It is not required that the NV independent studies used the same operationalizations for methods of analysis, only that it is meaningful to test this combined null hypothesis. This hypothesis ean be tested by minus twice the sum of the natural logarithms of the p-values, 2a 20h, (836) which under the combined null hypothesis has a chi-squared distribution vith 207 degrees of freedom. Becatge of th shape ofthe logarithmic func. ton, this combined statistic will already have a lage value i at least one of the p-values is very smal. ‘A stronger combination procedure canbe achieved ifthe several studies all ead to estimates of theoretically the same parameter, denoted here by 6 Suppose that the jth study yields a parameter extimate 6 wth standard error sj and that all the studies are statistically independent. ‘Then the combined estimate with smallest standard error is the weighted average with weights inversely proportional to #7 Dy (837) a with standard error sE@)=,/—,. (038) \yxy For example, i standard errors are inversely proportional to the square root of sample size, oj = 0/7 for some value , then the weights are directly Proportional to the sample sizes and the standard error of the combined estimate is o/ Ving: If the individual eaimates are approximately nor- rally distributed, and also (even when the estimates are not nearly normally distributed) when NV is large, the tatio, a SEO” can be tested in standard normal distribution. Combination of within-group evidence ar ‘The choice between these two combination methods can be made as follows. The combination of estimates, expressed by (3.37), is more suitable if the true parameter value (the estimated @) is approximately the same in ‘each of the combined studies, while Fisher's method (3.36) for combining p- values is more suitable if it is possible that the effect sizes are very diferent between the IV studies. More combination methods can be found in the literature about meta-analysis, eg., Hedges and Olkin (1985). Example $.8 Gossip behavior in sts organizations ‘Wittek and Wielers (1998) investigated effects of informal social network structures on goasip behavior in six work organizations, One of the hypothe- ‘es tested was that individuals tend to gossip more if they are invalved in ‘more coalition triads. An individual A is involved in a coalition triad with ‘pro others, B and C, if he has a postive relation with B while A and B both have a negative relation with C. Six organizations were studied which ‘were 30 different that an approach following the lines of the hierarchical m- ‘ear model was not considered appropriate. For each organtzation separately, ‘8 multiple regression wat carried out to estimate the effect of the aumber of coalition triads in which a person was involved on a measure for gossip behavior, controlling for some relevant otber variables. ‘The p-values obtained were 0.015, 0.4, 0.19, 0.13, 0.25, and 0.42. Only one of these is significant (ie, less than 0.05), and the question is whether ‘hie combination of six p-values would be unlikely under the combined null Iypothesis which states that in oll six organizations the effect of caltion ‘tiads on gossip is absent. Equation (3.36) yields the tet statistic x? = 22.00 with d.f.=2 6 =12,p < 0005. Thus the result is significant, which shows ‘hat indeed in the combined data there is evidence that there isan effect of, coalition triads on gossip behavior, even though this effect is signifcant in only one of the organizations considered separately. In this chapter we have presented some statistics to describe bierarchi- cally structured data. We mostly gave examples with balanced data, ie., an equal number of micro-units per macro-unit. In practice, most data sets are ‘unbalanced (except mainly for some experimental designs with longitudinal data on a fixed set of occasions). Our main purpose was to demonstrate how clustering in the data, Le., dependent observations, is not only a nuisance that should be taken eare of statistically, but can also be a very interesting phenomenon, worth further study. Finally, before proceeding with the in- troduction of the hierarchical linear model of multilevel analysis, t should bbe borne in mind that usually there are explanatory variables and statisti- cally itis the independence not of the observations but of the residaals (the ‘unexplained part of the dependent variable) which is the basic assumption of single-level linear models. 4 The Random Intercept Model In the preceding chapters it was argued that the best way to snalyse multi level data is an approach that represents within-group as well es between- sroup relations within a single analysis, where ‘group’ refers to the units at the higher levels of the nesting hierarchy. Very often it makes sense to use probability models to represent the variability within and between groups, in other words, to conceive of the unexplained variation within groups and ‘the unexplained variation between groups as random variability. For a study of pupils within schools, eg, this means that not only unexplained variation between pupils, but also unexplained variation between schools is regarded as random variability. This can be expressed by statistical models with so-called random coeficients. ‘The hierarchical linear model is such a random coefficient model for ‘multilevel, or hierarchically structured, data and is by now the main tool for ‘multilevel analysis. Chapters 4 and 6 treat the definition of this model and the interpretation of the model parameters. The present chapter discusses ‘the simpler cage of the random intercept model; Chapter 5 treats the general hierarchical linear model, which also has random slopes. Testing the various components of the model is treated in Chapter 6. The later chapters treat vyarlous elaborations and other aspects of the hierarchical linear model. The {focus of this treatment is on the two-level case, but. Chapters 4 and 6 also contain sections on models with more than two levels of variability For the sake of concreteness, we refer to the level-one units as ‘indi uals, and to the level-two units as ‘groups’. The reader may fill in other ‘names for the units, if she has a diffrent application in mind; egg. ifthe ap- plication isto repeated measurements, ‘measurement occasions’ for level-one tunits and ‘subjects’ for level-two units. The nesting situation of ‘measure- ‘ment occasions within individuals is given special attention in Chapter 12. ‘The number of groups in the data is denoted 1V; the number of individuals in the groups may vary from group to group, and is denoted n, for group 5 (J =1,2,...,N). The total number of individuals is denoted M = 3-jn;. ‘The hierarchical linear model is a type of regression model that is par- ticularly suitable for multilevel data. It differs from the usual multiple regression model in the fact that the equation defining the hierarchical lin- ear model contains more than one error term: one (or more) for each level. ‘As in all regression models, there isa distinction between dependent and ex- ‘Planatory variables: the aim is to construct a model that expresses how the 38 A regression model: fized effects only 39 dependent variable depends on, or is explained by, the explanatory vari ables. Instead of explanatory variable, the names predictor variable and independent variable are also in use. ‘The dependent variable must be a variable at level one: the hierarchical linear model is a model for explaining omething that happens at the lowest, most detailed level In this section, we assume that one explanatory variable is available at cither level. In the notation, we distinguish the following types of indices and variables: jis the index for the groups (j = 1,....); {is the index for the individuals within the groups Assen). ‘The indices can be regarded as case numbers; note that the numbering the individuals starts again in every group. For example, individual 1 in group 1 is different from individual 1 in group 2. For individual ¢ in group j, we have the following variables: Yig i the dependent variable; 143 isthe explanatory variable at the individual level; for group j, we have that 4 is the explanatory variable at the group level ‘To understand the notation, itis essential to realize that the indices, 4 and 4 indicate precisely on what the variables depend. The notation Yi, €, indicates that the value of variable Y depends on the group j and also on ‘the individual é. (Since the individuals are nested within groups, the index ‘mals sense only if itis accompanied by the index j: to identify individual 1, we must know to which group we refer!) The notation 24, on the other hand, indicates that the value of Z depends only on the group j, and not on the individual, i. ‘The basic ‘dea of multilevel modeling is that the outcome variable ¥ has an individual as well as a group aspect. ‘This carries through also for other level-one variables. The X variable, although it is a variable at the Individual level, may also contain a group aspect. The mean of X in one group may be different from the mean in another group. In other words, X may (and often will) have a positive between-group variance. Stated ‘more generally, the compositions of the various groups with respect to X may differ from one another. It should be kept in mind that explanatory variables that are defined at the individual level often also contain some information about the groups. 4.1 A regression model: fixed effects only ‘The simplest model is one without the random effects that are characteristic for multilevel models; it isthe classical model of multiple regression. This ‘model states that the dependent variable, Yis, can be written as the sum of 0 ‘The random intercept model ‘a systematic part (a linear combination of the explanatory variables) and a random residual, Vis =P + Bry + tasy + Ray (ay Jn this model equation, the 's are the regression parameters: fy is the intercept (je., the value obtained if xj as well as x; are 0), 8; is the coef- ficient for the individual variable X, while By is the coefficient forthe group variable Z. The variable Ry; is the residual Gometimes called error); an essential requirement in regreasion model (4.1) is tha all residuals are mu- ‘ually independent and have e zero mean; a convenient assumption i that in all groups they have the same variances (the homoscedasticity assump- tioa) and are normally distributed. This model has a multilevel nature only to the extent that one of the explanatory variables refers to the lower and the other to the higher level. Model (4.1) cam be extended to 2 regression model where not only main effects of X and Z, but also the craw-level interaction eet is present. This type of interaction is discussed more eizhorately in the following chapter Tt means that the product variable ZX = Z = X is added to the lst of explanatory variables. The resulting regression equation is Vig = fo + Batis + Baty + Paty 25 + Ras (42) ‘These models pretend, as % were, that all the multilevel structure in the ata is fully explained by the group variable Z and the individual variable X. If two individuals are being considered and their X- and Z-values are given, then for their Y-value itis immaterial whether they belong to the same, orto diferent groups. Models of the type (4.1) and (4.2), and their exteasions with more ex- planatory variables at either or both levels, have in the past been widely ‘used in research oa data with a multilevel structure. They are convenient to handle for anybody who knows multiple regression analysis. Is anything wrong with then? YES! For data with a meaningful multilevel structure, it is practically always unfounded to make the « priori assumption that all of the group structure is represented by the explanatory variables. Given that there are only N groups, itis unfounded todo asf one has my +ma-+..+yr independent replications. ‘There is one exception: when all group sample sizes ny are equal to 1, the researcher does not nead to have any qualms about ising these models because the nesting structure, although it may bve present in the population, is not present in the data. Designs with ‘nj =1 can be used when the explanatory variables have been chosen on the basis of substantive theory, and the focus ofthe research ison the regres: Son coefficients rather thaa on how the variability of ¥ is partitioned into within-group and between-group variability. In designs with group sizes lager than 1, however, the nesting struc- ture often cannot be represented completely in the regression model by the explanatory variables, Additional effects of the nesting structuze can be represented by letting the regression coefficients vary from group to group. ‘Thus, the coefficients Ay and A; in equation (4.1) must depend on the group, Variable intercepts: fized or random parameters? 4 denoted by j. ‘Tis is expressed in the formula by an extra index j for these coeficients, This yields the model Vij = Pag + Bay zug + Bay + Ry (43) Groups j can have a higher (or lower) value of fos, indicating that, for any given value of X, they tend to have higher (or lower) values of the dependent variable Y.’ Groups can also have a higher or lower value of Aaj, ‘which indicates that the effect of X on Y is higher or lower. Since Z is'@ group-level variable, it would not make much sense conceptually to let the Cocficent of Z depend on the group. Therefore Ay is lft unaltered in this formula. ‘The multilevel models treated in the following sections and in Chapter 5 contain diverse specifications of the varying coeficients fy and fs. The simplest version of model (4.3) is the version where Poy and fs are constant (do not depend on j), ie., the nesting structure has no effect, and we are ‘back at mode! (41). In this case, the OLS! regression models of type (4.1) and (4.2) offer a good approach to analysing the data. If, on the other hand, the coeficients fy and fy do depend on j, then these regression models ‘may give misleading results. Then it is preferable to take into account, how the nesting structure influences the effects of X and Z on Y. Tis can be done using the random coefficient model ofthis and the following chapters. In this chapter the case is treated where the intercept fay depends ca the group; the next chapter treats the case where also the regression coeicient ‘us is group-dependent. 4.2 Variable intercepts: fixed or random parameters? Let us Bist conser only the rvgresion on the levelone varieble X. A frst step towards modeling between-group variability ito lt the intercept vary between groups. This refects that some groups tend to lave, on average, higher responses Y and others tend to have lover reponse. This model is halfway between (41) and (43) (but omitting th effet of 2), in the sense that intercept Boy does depend on the group but the regression coeicieat of X, fy, is constant: Yig = fag + Aazy + Ry - (44) ‘This is pictured in Figure 4.1. Te group-dependent intercept can be split into an average intercept and the roup-dependent deviation: Bos = 00 + Ung - For reasons that will become clear in Chapter'5, the notation for the re- igrssion coefcients is changed here, and the average intercept scaled 7 while the regresion coefficient for X i aed 7p. Substitution now leads to the model, Yis = 700 + oa + Uy + Ry (45) "Ordinary leat Squares 2 ‘The random intercept model ¥ 5 casio line group 2 Pal rear group regression line group 3 Pr regression line group 2 bos wou Bos Figure 4.1 Diferent parallel regression lines. ‘The point yi2 is indicated with ie residual Ria ‘The values Uoy are the main effects of the groups: conditional on an indi- vidual having a given X-value and being in group j, the Y-value'a expected tobe Ua higher than nthe average soup. Mel (5) an be undetood in two ways (1) As a model where the Uoy are fzed parameters, N in number, of the statistical model. This is relevant ifthe groups j refer to categories each with its own distinct interpretation, eg, ciamsication according to gen- der or religious denomination. In order to obtain identified parameters, the restriction that 5°, aj = 0 can be imposed, to tht effectively the groups kad to N — 1 repesion parameters This the ual aaa of cor ance model, in which the grouping variable i a factor. (Since we prefer to tie Grek lee for seaseaprtntrs an ep te ancons variables, we would prefer to use a Gree letter rather than U when we take this view of model (4.5),) In this specification itis impossible to use a grouplevel variable Z; as an explanatory variable because it would be redundant given the fixed group effects. (2) Asa model where the ag are independent identically distributed random variables. Note that the Voy ace the unexplained group effets, which also may be called group residdals, controlling for the eects of variable X. ‘These residuals now are assumed to be randomly drawn from a popalation with zero mean and an a priori unknown variance. This i rlevast if the effects of the groups j (which can be neighborhoods, schools, companies, etc.), controlling for the explanatory variables, can be considered to be exchangeable, There is one parameter associated to the Un inthis statistical model: their variance. ‘This i the simplest random ooeficient regression model. I s called the random intercept model because the roup-dependent intercept, to + Uos, is a quantity which varies randomly fom group to sroup. The groups are regarded as a sample from a population of groups It is posible that there are group-level variables Z; that express relevant Variable intercepts: fized or random parameters? “a attributes of the groups (such variables will be incorporated in the model fn Section 4.4 and, more extensively, in Section 5.2). ‘The next section discusses how to determine which of thse specications cof model (4.5) is appropriate in a given situation. ‘Note tat model (41) and (4.2) are OLS models or fixed eects models, wich donot take the nesting stracture into acoount (exept maybe by the the of grouplevel variable Z,), whereas models of type (1) above are OLS. trodes tht do take the nesting structure foto acount. The latter kind Of OLS model has a mach larger number of regresion parameters, since in Such models N groups lead to N —1 regression coficeat, Tes important to datingulsh between these two kinds of OLS models in discussions about ow to handle data witha mated sitar 4.2.1 When to use random coefficient models? ‘These two different interpretations of equation (4.5) imply that multilevel data can be approached in two different ways, using models with fixed or ‘with random coefficients. Which of these two interpretations is the most appropriate in a given situation depends on the focus ofthe statistical infer- ence, the nature ofthe set of NV groups, the magnitudes of the group sample sizes nj, and the population distributions tavolved. 1. Ifthe groups are regarded as unique entities and the researcher wishes primarily to draw conclusions pertaining to each of these IV specific groups, then it is appropriate to-uee the analysis of covariance model, Examples are ‘groups defined by gender or ethnic background. 2. If the groups are regarded as a sample from a (real or hypothetical) population and the researcher wishes to draw conclusions pertaining to this population, then the random coefficient model is appropriate. Examples are the groupings mentioned in Table 2.2. 3. Ifthe researcher wishes to test effects of group-level variables, the random coefficient model should be used. The reason is that the fixed effects model already ‘explains’ all differences between group by the fixed effects, and there js no unexplained between-group variability let that could be explained by sroup-level variables, ‘Random effects’ and ‘unexplained variability’ are two ‘ways of saying the same thing. 4, Especially for relatively small group sizes (in the range from 2 to 50 or 100), the random coefficient model has important advantages over the anal- ysis of covariance model, provided that the assumptions about the random ‘coeficients are reasonable. This can be understood as follows. ‘The random coefficient model includes the extra assumption of indepen- dent and identically distributed group effects Uaj. Stated less formally: the ‘unexplained group effects are governed by ‘mechanisms’ that are roughly similar from one group to the next, and operate independently between the ‘groups, The groups are said to be exchangeable. This assumption helps to counteract the paucity of the data that is implied by relatively small group “ ‘The random intercept model sizes n;. Since all group effects are assumed to come from the same popula- tion, the data from each group also have a bearing on inference with respect to the other groups, namely, through the information it provides about the population of groups. Tn the analysis of covariance model, each of the Up; is estimated as a separate parameter. If group sizes are small, then the data do not contain very much information about the values of the Uo; and there will be a con- siderable extent of overfiting in the analysis of covariance model: many parameters have large standard errors. This overfittng is avoided by using. the random coefficient model, because the Ugy don't figure as parameters. If a the other hand, the group sizes are large (say, 100 or more), then in the analysis of covariance the group-dependent parameters Uj are estimated vory precisely (with small standard errors), and the additional information that they come from the same population does not add much to this pre- cision. In such a situation the difference betwoen the results of the two approaches will be negligible. 5. The random coefficient model is mostly used with the additional as- sumption that the random coefficients, Upj and Raj in (4.5), are normally distributed. If this assumption is a very poor approximation, results ob- tained with this model may be unreliable. This can happen, eg., when there are more outlying groups than can be accounted for by a normally distributed group effect Un; with a common variance. Other discussions about the choice between fixed and random coefficients can be found, eg, in Searie etal. (1992; Section 1.4) and in Hsiao (1995, Section 8). An often mentioned condition for the use of random coefcient ‘models isthe restriction that the random coeficients should be independent of the explanatory variables. However, if there is a posible correlation be- tween group-dependent coeficients and explanatory variable, this residual correlation can be removed, while continuing to use a random coefficient model, by also including effects of the group means of the explanatory vari- ables. ‘This i treated in Sections 4.5 and 9.21 In order to choose between regarding the group-dependent intercepts Uay as fixed statistical parameters and regarding them as random variables, a rule of thumb that often works in educational and social research is the following. This rule mainly depends on NY, the numberof groups inthe date. ICN is small, soy, N < 10, then use the analysis of covariance approsch the problem with viewing the groups as a sample from a population is in this case, that the data will contain only scanty information about this population. IfN isnot small, say N > 10, while nis small or intermediate, say nj < 100, then use the random coeficient approach: 10 or more groups is usally too large a number tobe regarded as unique entities. If the group sizes nj are large, say ny > 100, thew it does not matter much which view wwe take, However, this rule of thumb should be take with a large grain of salt and serves only to give a first hunch, not to determine the choice between fixed and random effects Definition of the random intercept modet 48 Populations and populations When the researcher has indeed chosen to work with a random coefficient model, she must be aware that more than one population is involved in the multilevel analysis, Each level corresponds to a population! For a study of ‘pupils in schools, there is a population of schools and a population of pupils. For voters in municipalities, there is 2 population of municipalities and a population of voters; ete. Recall that in this book we take a model-based view. ‘This implies that the population are infinite hypothetical entities, that exprese ‘what could be the case’. The random residuals and coeficients can be regarded as representing the effecte of unmeasured variables and the approximate nature of the linear model. Randomness, in this sense, may be Imerpreted as unexplained variability. Sometimes a random coeflicient model can be used also when the pop- ulation idea at the lower level is less natural. For example, in a study of longitudinal data where respondents are measured repeatedly, a multilevel ‘model can be used with respondents at the second and measurements at the first level: measurements are nested within respondents. Then the popula- tion of respondents is an obviovs concept. Measurements may be related to a population of time points. This will sometimes be natural, but not always. Another way of expressing the idea of random coeflicient models in such a situation isto say that residual (non-explained) variability is present at level one as well a8 at level two, and this nomexplained variability is represented by a probability model. 4.3 Definition of the random intercept model In this text we treat the random coefficient view on model (4.5). This model, the random intercept model is a simple cose of the so-called hierarchical linear model We shall not specifically treat the analysis of covariance model, land refer for this purpose to texts on analysis of variance and covariance. (For example, Cook and Campbell, 1979 or Stevens, 1996.) However, we shall encounter a number of considerations from the analysis of covariance that also play a role in multilevel modeling. ‘The empty model Although this chapter follows an approach along the lines of regression anal- yrs, the simplest case ofthe hierarchical linear mode! is the random effects ‘analysis of variance modeh in which the explanatory variables, X and 2, do not figure. This model only contains random groups and random variation ‘within groups. It can be expressed as a model? where the dependent vari- able is the sum of a general mean, yop, @ random effect at the group level, Us, and a random effect at the individual level, Ry: Yig = Yoo + Uaj + Rj (46) ‘he same model encountered before in formula (3.1). 46 ‘The random intercept model Groups with a high value of Uo; tend to have, on the average, high responses ‘whereas groups with a low value of Uos tend to have, on the average, low responses. The random variables Voy and Ry are assumed to have a mean of O (the mean of Yi; is already represented by oo), to be mutually indepen- dent, and to have variances var(Ryj) = o? and var(Uoq) = 72. In the context cof multilevel modeling (4.6) is called the empty model, beeause it contains not a single explanatory variable. It is important because it provides the basic partition of the variability in the data between the two levels. Given rode! (4.6), the total variance of ¥ can be decomposed as the sum of the leveltmo and the leve-one variances, var(Vig) = var(Unj) + var(Ris) = 73 + 0? - The covariance between two individuals (( and @, with 4 ') in the same group j is equal tothe variance ofthe contribution Uy that is shared by these individuals, cov(Yis, Yi) = var(Uos) = 73 » and their cortelation is B oltiss Yes) = Gap aay * ‘This parameter is just the intraclass correlation coefficient py(Y) which we encountered already in Chapter 3. It can be interpreted in two ways: it is the correlation between two randomly drawn individuals in one randomly drawn group, and it is also the fraction of total variability that is due to tthe group level. Example 4.1 Empty model for language scores in elementary schools In this example a data set is used that will be used in examplor in mazy chapters of this book. The data set is concerned with grade 8 pupils (age about 11 years) in elementary schools in The Netherlands, After deleting Dupils with missing values, the number of pupils is Mf = 2287, and the number of schools is N = 131. Class sizes in the original data set range from 10 to 42. Jn the data set reduced by deleting cases with missing data, the clas sizes range from 4 to 35, The nesting structure is pupils within lasses. "The dependent variable is the score on a language test. Most of the ‘analyses of this data set in this book are concerned with investigating how the language test score depends or the pupil's inzeligence and his ot her family’s social-economic status, and on a numberof school or class variables. Fitting the empey model yields the parameter estimates presented in Table 441, Te ‘deviance ia this table i given for the sake of completeness and later reference. This concept is explained in Chapter 6. ‘The estimates 8? = 64.57 and 9 = 19-42 yield an intraclass correlation 0). ‘The reverse, where the betwoun-group regression lie is less steep than the within-group regression lines, can also be the case. Fegression line within group 2 regression line within group 8 regression line within group 1 ‘Figure 4.3 Differst between-group and within-group regression lines. If the within- and between-group regression coeficients are different, then it often is convenient to replace 24; in (4.8) by the within-group devia- tion score, defined a8 2,;—Z s. To distinguish the corresponding parameters from those in (4.8), they are denoted by 7. The resulting model is Yi = Foo + ao 2s — 25) + Fon Zs + Vos + Ry - (49) ‘This model is statistically equivalent to model (48) but it has a more con- venient parametrization because the between-group regression coeficient Fu =n0 +m (410) while the within-group regresion cocficint is fo = 10 (aun) ‘The use of the within-group deviation score is called within-group centering, some computer programs for multilevel analysis have special facilities for this Within- and between-group regressions 55 Example 4.3 Within. ond between-group regressions for 10. ‘We continue example 2 by allowing dilferences between the within-group and between-group regressions of the language score on 1Q. The resus are dis- played in Table 44. 1Q bore is the raw variable, Le, without group-centering, In other words, the results refer to model (48) ‘Table 4.4. Estimates for random intercept model with ferent within- and betwoon-group regressions. ised Bet “en = Tieropt “fa = Coefiteat of ‘or = Coeficent of IQ (group meas) Rasdom Effect Variance Component 8. Tevebivo variance gaan) 1 130 Eerctone variance: Fahy) was 128 Deviance snr The withingroupregremion coeficien: 2.415 andthe beeneen group regres sion coeficen is 2415 > 1.590 = 4004. A pup witha given 1Q obtains, on erage higher language tet acre if be or sb i ia clam with higher average 10, Inotber words, the context eect of mean 1Q gives an aditional Contribution ovr and above the eet of individual 1Q. A figure for these re- ‘sults is qualitatively similar to Figure 43 in the sane chat tbe wthio group Tegreasion lines ae le seep than the between-group regression ine. “Tha tablerepreents wits each clas denoted ja nea regression equation Y= 40.74 + Voy + 24151Q + 159979 share Uy i clas-dependen deviation with mean 0 and variance 7.73 (rtaa- Gard deviation 2.78). The wthi-lae deviations aboot tis regression eae tion, Ry, havea vciance of 42.15 (eandard deviation 6.49), Within each clas, the ec: (rogeion cota) of 1Q i 2415, wo the regeson lines fe parle, Clases der in two ways: they may have diftrent mean 1 values, which affects the expected results Y through the term 1.5891Q; this San explained diference berwen the lanes nd thay ave randomly dit ing valves for Un, whi i an explained difterence. These two ingredients contrbite to tbe caardependeat intercept, gven by 40.74 + Us, 4158010. “The within group and berwees-group regrenion concent would be xual iia formula (48), he coficet of average 1Q would be 0, Le, mu = 0. This ill hyphens eat be tered (one Secon 6.1) by th tatiodafied a8 siven here by 1.589/0.313 = 5.08, a highly significant result. In other words, swe may conclude that the within- and between-group regresion coeficients ‘re diferens indeed. If te individual 1Q variable bad been replaced by within-group deviation scores 1Q,,—TQ.,, i.e, model (4.9) had been used, then the estimates obtained 56 ‘The random intercept model would have been fio = 2415 and Jon = 4.004, of, formulae (410) and (4.11). Indeed, the regrestion equation given above can be described equivalently by YY = 40.74 + Vay +2415 (1Q - 1) + 4.00419 , ‘which indicates explicitly that the within-group regression confcient is 2.415 ‘while the between-group regression coefcient, i. the coeficient ofthe group means ¥ on the group means IQ, s 4.004 When interpreting the results of a multilevel analysis, itis important to keep in mind that the conceptual interpretation of within-group and between-group regression coefficients usually is completely different. These ‘wo coeflicents may express quite contradictory mechanisms. This is related to the shift of meaning and the ecological fallacy discussed in Section 3.1. rather than the exception that within-group regression coeflicients dlifer from between-group regression coefficients (although the statistical signifi cance of this difference may be another matter, depending as it is on sample sizes, etc). 4.6 Parameter estimation ‘The random intercept model (4.7) s defined by its statistical parameters: the regression parameters 7, and the variance components, and r2. Note that the random effects, Ups, are not parameters ina statistical sense, but latent (ie., not directly observable) variables. The literature (eg., Longford, 1993) contains two major estimation methods for estimating the statistical parameters, under the assumption that the Uoy as well as the Ry are nor- mally distributed: maximum leehood (ML) and residual (or restricted) maximum likelihood (REML). ‘The two methods difer ttle with respect to estimating the regression covflcints, but they do differ with respect to estimating the variance com- ponents. A. very brief indication of the difference between the two estima- tion methods i, that the REML method estimates the variance components while taking into account the loss of degrees of freedom resulting ftom the estimation ofthe regression parameters, whereas the ML method does not take this into account. The renut is thatthe ML estimators for the variance components have # downward bias, and the REML estimators don't. For example, the urual variance estimator fora single-level sample, in which the sum of squared deviations is divided by sample sie minus 1, s a REML est- mator, the coresponding ML estimator divides instead by the total sample 2. The difference can be important especially when the number of groups is small. For a large number of groups (as a rule of thumb, ‘large’ here ‘means larger than 30), the difference between the ML and the REML es- timates is immaterial. ‘The literature suggests that the REML method is preferable with expect to the estimation of the variance parameters (and the covariance parameters for the more general models treated in Chap- ter 5). When one wishes to carry out deviance tests (ote Section 6.2), it Parameter estimation or sometimes is required to use ML rather than REMC, estimates ® Various algorithms are available to determine these estimates. They have names such as EM (Expectation - Maximization), Fisher scoring, IGLS (Iterative Generalized Least Squares), and RIGLS (Residual or Restricted IGLS). They are iterative, which means that a number of steps are taken in which a provisional catimate comes closer and closir to the final eatimate. When all goes wel, the steps converge to the ML or REML estimate. Tech- nical details can be found, eg., in Bryk and Raudenbush (1992), Goldstein (1995), or Longford (1988a, 1995). In principle, the algorithms all yield the same estimates for a given estimation method (ML or REML). The diferences are that for some complicated models tae algorithms may vary in the amount of computational problems (sometimes one algoritam may converge and the other not), and that computing time may be diferent. For the practical user, the differences between the algorithms are hardly worth thinking about, ‘An aspect of the estimation of hierarchical lineat model parameters that surprises some users of this model is the fact that itis possible that the variance parameters, in model (4.7) notably parame:er rf, can be estimated to be exactly 0. The value of 0 is then also reported for the standard erzor not mean that the data imply absolute certainty that the population value of 79 Js equal to 0. Such an estimate can be understood as follows. For simplicity, consider the empty model, (4.6). The levl-one residual variance ? is estimated by the pooled within-group variance. The parameter 7 is ‘estimated by comparing this within-group variability to the between-group variability. The later is determined not only by rf but also by 0, since a+e (4.2) Note that, bung a variance, cannot be negative. This imple thet, even if rf = 0, a positive between-group variability is expected. If observed between-group vaiaiity i egal to or smaler than what expected from (412) in ease rf O then the eatimate#@ = Os reported (ek: the discussion folowing 11) Ifthe group sizes ny are variable, the larger groups will, naturally, have a larger influence onthe estimates than the smale: groups. The iafuence of group size onthe etrats i, however tnodiated by te latracss corre lation coufiien:. Consider, eg, the estimation ofthe mean intercept,“ the residual intraclass corelaton is, the groupe have an influence oo the estimated value of eo that proportional to thet sve. Inthe extreme case thatthe residual intraclas correlation is I, ach group has an equally large influence, independent ofits size. In prectce, where the residual intraclass we?) Then moda are compared with diferent fixed part, deiane tnt shold a hand on ML estimation Deviance tts with REML enimates may te used for comparing ‘models with diferent random parts and the eae fed part. Difeent random parte wil be tread fa the next chapter 58 The random intercept model correlation is between 0 and 1, larger groups will have a larger influence, Dut less than proportionately, 4.7 ‘Estimating’ random group effects: posterior means ‘The random group effects Uoy are Intent variables rather than statistical Parameters, and therefore are not eatimated as an integral part of the sta- tistical parameter estimation. However, there can be matty reasons why it can nevertheless be desirable to ‘estimate’ them,” ‘This can be done by ‘© method known as empirical Bayes estimation which produces so-called posterior means, see, eg, Etron and Moris (1975). The basic idea ofthis method is that Uo, is ‘estimated’ by combining two kinds of information: (2) the data from group j, (2) he fact (or, rather, the model assumption) that the unobserved Up; is a random variable just like all other random group effects, and therefore a normal distribution with mean 0 and waiance rf * ss Jn other words, data information is combined with population information. ‘The formula is given here only for the empty model, ie, the model rrithout explanatory vatiabies. The idea for more complicated modes is analogous; formulae can be found in the literature, eg, Longford Section 2.10). 8s oe ‘The empty model was formulated in (4.6) as Ys = Poy + Pas = eo + Uoj + Ray Since 1o0 is already an estimated parameter, an estimate for fo; will be the same as an extimate for Uoj plus yoo. Therefore, estimating byy end mating Uoy are equivalent problems given that an estimate for yb If we used only group j, Ary would be estimated by the group mean, which is also the OLS estimate, ” ° Bay =F (413) ‘we looked only at the population, we would estimate Soy by its population ‘mean, 7. This parameter i etimated by the overall mean, mw =F.= EY, where M = ny denotes the total sample size. Another posit ul 5 sample size. Another posit is to combine the information from group j with the population information, The optimal combined ‘estimate’ for (hy is a weighted average of the two previous estimates: Bip = Aj Pa + (1-24) 0 1) ihe word etinate Ue pat between quotation masks because the proper saisicl term for fing Wily values ofthe Us, beng random varias, Is preiction, Te Sern of imation i tered for fading ty aot fr sain prance ction i aocinted in everyday speech, however, with determing tomethingsboct he futur, we prefer to speak bere abot “stmaien between paren ‘Batimating’ random grovp effets: posterior means 59 where EB stands for ‘emplrical Bayes’ and the weight 4 is defined as the relability ofthe mean of group J (ee equation (3.21), -_8 we G+ oF /n; ‘The ratio ofthe two weights, Ay/(L 8 Jas the ratio of true variance 1 to error variance o?/nj. In practice we do not know the true values of the parameters o? and 72, and we substitute estimated values to calcu- late (414). Formula (4.14) is called the posterior mean, or the empisical Bayes estimate, for Bry. This term comes from Bayesian statistics. It refers to the distinction between the prior knowledge about the group effects, which is based only on the population from which they are drawn, and the pos- terior knowledge which is based also on the observations made about this group. There is an important parallel between random coeficient models tnd Bayesian statistical models, because the random coefcients used in the hierarchical linear mode! are analogous to the random parameters that ae essential in the Bayesian statistical paradigm. This empirical Bayes estimate is treated from » Bayesian standpoint, eg, in Press (1989, p43), Formula (414) can be regarded as fellows: the OLS extimate (4.13) for group jis pushed a bt toward the general mesn “po. This i an example of Shrinkage fo the mean jas tke is being sed, ein paychometrics, fr the tetimation of true scores. The corresponding estimator sometimes is called the Kelley estimator; se, eg, Kelley (1927), Lord and Novick (1968), or other textbooks on classical peychological test theory. From the definition of the weight Ay iti apparent that the infuence of the data of group j itself becomes larger as group size ny becomes larger. For large groups, the posterior mean is practically equal to Ay, the intercept that would be tstimated from data on group j alone Tn principle, the OLS estimate (413) and the empirical Bayes estimate (4.14) both are sensible procedures fr estimating the mean of group j- The former does not need the assumption that qroup j is random clemest from the population of groups, and is an unbiased estimate. The latter ' biased toward the population mean, but for a randomly drawn group it has a smaller mena squared error. The squared err averaged overall troaps wil be smaller for the empirical Bayes estimate, but the price is @ Conservative (drawn tothe average) appraisal of the groups with trly very high ot very low values of fy. The estimation variance of the empirical Bayes estimate is var (25 - By) ==) B (4.15) ifthe uncertaiaty due tothe estimation of po (whch sof secondary impor- tance anyway) is neglected. This formula also is well-known from classical psychological tet theory (eg, Lord and Novick, 1968) —ataonca wee Hat te average f may ~ pote ~lndependentrepicn tous f this estimate for this pactiular group J would be very clogs to the true valde Bos 6 The random intercept model The same principle can be applied (but with more complicated formulae) to the ‘timation’ of the group-dependent intercept fay = 900 + Uay random intercept models that do include explanatory variables, such as (4.7). This intecept can be ‘etimated” again by yop plus the posterior mean of Uoj, anc is then also referred to asthe posterior intercept. Tnstead of being primaiily interated in the inwrcept as defined by vo + Vay whic is the value of the regression equation for all explana tory variables having the value 0, one may aso be interested in the value of the regression line for group j forthe case where ony the level-one variables iy t0 py are O while the level-two variables have the values proper to this sroup. To ‘estimate’ this version of the intercept of group j, we use fm + ony on + font + OFF» 418) where the values 7 indicate the (ML or REML) estimates of the regression coefficients. These values alo are sometimes called posterior intercepts ‘The posterior means (4.14) can be used, eg, to see which groupe have unexpectedly high of low values on the outcome variable, given their values on the explanatory variables. ‘They can also be used in a residual analysis, for checking the assumption of normality for the random group effects, and for detecting outer, cf. Chapter 9. The posterior intercepts (4.16) indicate the total main efect of group j, controling for the leverone variables X; to Xp, but including the effects of the lev-two variables Z; t0 Zp. For example, in a study of pupils in schools where the dependent variable isa relevant indicator of scholastic performance, thee posterior intercepts could be valuable information for the parents indicating the contribution of the various schools tothe performance of ther beloved children. Example 4.4 Posterior means for random data. ‘We can illustrate the ‘estimation’ procedure by returning to the random digits table (Chapter 2, Table 3.1). Macro unit 04 in that table has an average of Yj = 315 over its 10 random digits. The grand mean of the total 100, random digits is ¥., = 47.2. The average of macro unit 04 thus seems to be far below the grand mean. But the reliability of this mean is aly 2, 26.7 /{26.7 + (769:7/10)} = (414), the posterior mean is calealated a8 0.25 x315 + (1-025) x472 =433 ‘In words: the pesterior mean for macro-unit 04 is determined for 78 percent (e, 1= 3) by the grand mean of 47.2 and by only 25 percent (i.e, 33) by its OLS mean of $1.5. The shrinkage to the grand mean is evident. Because of the low estimated intraclass correlation of f, = 0.08 and the low number of observations per macro unit, n; = 10, the empirical Bayes estimate of the average of macro unit 0¢ is closer to the grand mean than to the group mean. 4.7.1 Posterior confidence intervals ‘Now suppose that parents have to choose a school for thelr children, and that they wish to do so on the basis of the value a school adds to abilities that pupils already have when entering the school (as indicated by an 1Q test) ‘Estimating’ rondom group effects: posterior means a Let us focus on the language scores. ‘Good! schools then are schools where pupils on average are ‘over-achieves’, that is to say, they achieve more than Expected oa the basis of their 1Q. ‘Poor’ schools are schools where pupils fon average have language scores that are lower than one would expect given. their 1Q scores. Tn this case the level-two residuals Uo; from a two-level model with language a8 the dependent and 1Q as the predictor variable convey the relevant information. But remember that each Up, has to be estimated from the data, and that there is sampling error associated with each residual, since we work with a sample of students from each school. Of course we ‘might argue that within each school the entire population of students is studied, but in general we should handle each parameter estimate with its associated uncertainty since we are now considering the performance of the ‘school for a hypothetical new pupil at this school. ‘Therefore, instead of simply comparing schools on the basis of the level- two residuals it is better to compare these residuals taking account of the associated confidence intervals. ‘The standard error of the empirical Bayes estimate is smaller than the mean squared error of the OLS estimate based on the data only for the given macro-unit (the given school, in our example). This is just the point of using the empirical Bayes estimate. For the empty model the standard terror is the square root of (4.15), which can also be expressed as 1 5B. (Ag?) = . 1.17) SE. (657) [Fant (aa ‘This formula also was given by Longford (19932, Section 1.7). Thus, the standard error depends oa the within-group as well as the between-group variance and on the sumber of sampled pupils for the school. For models with explanatory variables, the standard error can be obtained from com- puter output of multilevel software (see Chapter 15). Denoting the standard error for school j shortly by SEj, the corresponding ninety percent conf- dence intervals can be calculated as the intervals (a — 1.64 x SE,, ARP + 1.64 x SE) « ‘Two cautionary remarks are in order, however. In the first place, the shrinkage construction of the empirical Bayes es timates implies a bias: ‘good’ schools (with a high Ugs) will tend to be represented too negatively, ‘pocr’ schools (with a low Ugj) will tend to be represented too positively (especially if the sample sizes are small). The smaller standard error is bought at the expense of this bias! These confi- dence intervals have the property that, on average, the random group ef fects Uoj wil be included in the confidence interval for ninety percent of the ‘groups. But for close to average groups the coverage probabilty is higher, while for the groups with very low or very high group effects the coverage probability will be lower than ninety percent. 62 The random intercept model In the socond place, users of such information generally wish to compare 2 series of groups. This problem was addresed by Goldstein and Healy (1995). The parent in our example will make her or his own selection of schools, and fi the Parent is a trained statistician) will compare the schools onthe baal of whether the confidence intervals overlap. In that case, the parent is impli citly performing a series of statistical tests on the ferences between the sroup effects Voy. Goldstein and Healy (1995, p. 175) write: ‘It a come ton statistical misconception to suppase that two quantities whose 96% confidence intervals just fil to overlap are significantly diferent at the 59% significance level. ‘The reader is referred to ther article on how to adjust the width of the confidence intervals in order to perform such significance testing. For example, testing the equality ofa series of level-two residuals, 2 the five percent significance level, requires confidence intervals that are constructed by multiplying the standard error given above by 1.39 rather than the well-known five percent value of 1.96. For a ten percent sigaifcance level, the factor is 1.24 rather than 1.64. So the ‘comparative confidence totervals’ are allowed to be narrower than the confdeace intervals used for assessing single groups. Example 4.5 Comparing aided value of schol ‘Toble 42 presents the multdevel model where language sores are controlled {or 10. The pomeior means Af, which can be interpreted aa the etinated talve added, are graphically prowated in Figare 4. The figure alo preosts {he confidence intervals for testing the equality of any pais of resivale ae Significance level of five percent. For convenience the schools are ordered on : ‘the sizeof thelr posterior mean, Us Figure 4.4 The added value scores for 131 schools vith comparative posterior confidence intervals, Note that approximately 30 schools have confidence intervals that ovelap ‘he confidence interval ofthe best school in this sample, implying that they value-added scores do not difer significantly. At the lower ex:reme, also aboxt ‘Three-level random intercept models 63 sou tt ey ot en ‘ae i soy tn cana a nf og ne ca angle 48 Posterior opener nol fr rondo dat {ett bk oer ut testo dit cmp i 5 wo ‘vied sa ef te I tacts (rp ttn cg or he OF Eerste hn pote mena with catdence tele Vas 7 x 2 7 6 é 10 igure 4.5 OLS means (x) and posterior means (+) ‘with comparative posterior confidence intervals boerve imlage since the OLS means (x) Once again we clearly obverve the shrinkage since the OI a) fare further apart than the posterior means (e). Further, as we would expé ‘rom a random digits example, none of the pairwise comparisons results in any significant diferences between the macro-units since all 10 confidence intervals overlap, 4.8 ‘Three-level random intercept models sree-level random intercept mode! i a straightforward extension of the twoleve mel In pei camp, daa wee ed wh set ete nested within schools. The actual hierarchical structure of educational data is, however: students neted within classes nested within schools. Other ex- amples are: siblings within families within neighborhoods, or people within regions within states, Les obvious examples are: students within cohorts within schools, oF longitudinal measurements within person within groups. ‘These latter cases willbe ilustrated in Chapter 12 on longitudinal data. For the time being we concentrate on ‘simple’ three-level hierarchical data, structuses. ‘The dependent variable now is denoted by Yiu, referring to, pup fin class jn stool More general, one cna about evel one unit jin level-two unt j in level-three unit &. The three-level model 64 ‘The random intercept model such data with one explanatory variable may be formated as a regression model Yign = Boge + Bitige + Rage > (4.18) where Aa is the intercept in level-two unit § within lee-three unit k. For the intercept we have the leveltwo model, Poin = b00n + Uojn » (419) where don isthe average intercept in leve-three unit k. For this average intercept we have the leve:three model, oan = “Yooo + Voor (4.20) “This shows that now there are three residuals, as theres variability on three levels Their vaianoes are denoted by var(Rij,) = 07, var(Uop (42) ‘The total variance between all level-1 units now equals o? +7? +", and the pepulation variance between the level-two units is r? +". Substituting (4.20) and (4.19) into the level-one model (4.18) and using (in view of the ext chapts) the triple indexing notation 7h fo the regression coefcient Py yields Yiu = tow + ov 2un + Voor + Vase + Rie (422) Example 4.7 A three-level model: students in classes in schools For this example we we a dataset on 3792 students in 280 classes in 57 secondary schools with complete data (see Opdenakker and Van Damme, 1997), At school entrance students were administered testa on 1Q, mathe- ‘matics ability, schievement motivation, and furthermore data was collected fon the educational level of the father and the students’ gender. "The response variable is the score on a mathematics test administered at ‘the end of the second grade of the secondary school (when the students were approximately 14 years old), Table 4.5 contains the results of the analysis of the empty three-level model (Model 1) and a model with a fixed effet of the students’ intelligence, ‘Table 4.5 Estimates for threo level model Morel Mots 2 Fiat ee Coie SE. aE on = Tea a6 9B 45S “Ree = Gacicnt of 19 oi cats Random Bis Var. Comp. SB. Var. Comp 8 Taso arian gern) nam 081108 O87 Feo tes eotene: pu varWon) ime 028 orm out8 Ea o rae ose 9106s Deviance 190007 son ‘Thrwe-level random intercept models 6 ‘The total variance is 11.686, the sum ofthe three variance components, Since this is a threolevel model there are soveral kinds of intraclas correlation ‘coeficient. Of tbe total variance, 2.124/11.686 = 18 percent is situated at the school level while (2.214 + 1746)/11.686 = 83 percent i situated at the class and school level. The level-tizee intraclass correlation expressing the keness of students in the same schools thus is estimated to be 18 percent, while the intraclass correlation expressing th likes of students inthe samme ‘asses and the same schools thus is extimated to be 0.33. In addition, one can als estimate the intraclass correlation that expresses the likeness of classes in the same schools. This leveltwo intraclass correlation is estimated to bbe 2.124/(2.124 + 1.746) = 0.55. This is more thax 0.5: the school level contributes slightly more to variability than the class level. The interpretation 's that if one randomly takes two classes within one schock and calculates the average mathematics achievement level in one of the two, one can predict reasonably accurately the average achievement level in the other class. Of course we coald have estimated a twolevel model at wall, ignoring the clase level, but that would have led to a redistribution of the cla-level variance ‘to the two other levels, and it would affect the validity of hypothesis texts for Added fixed effets. “Model 2 shows that the fixed effect of 1Q is very strong, with a tratio (cf, Section 6.1) of 0.121/0.005 =24.2. (The intercept changes drastically ‘because the 1Q score does not have a zero mean; the conventional IQ scale, ‘with a population mean of 100, was used.) Adding the effect of 1Q lends to a stronger decrease in the clas- and school-level variances than in the student- level variance. Tis suggests that achools and clams are rather homogeneous with respect to 1Q and/or that intelligence may play its role partly at the school and class levels. As in the two-level model, predictor variables at any of the three levels can be added. All features of the two-level model can be generalized to the three-level model quite straightforwardly: significance testing, model building, testing the model ft, centering of variables etc., although the researcher should be more careful now because of the more complicated formulation. For example, fora level-one explanatory variable there can be three kinds of regressions. In the school example, these are the within-class regression, the withinschool/between-class regression, and the between-school regres sion. Coefficients for these distinct regressions can be obtained by using the class means as well as the school means as explanatory variables with fixed effects Example 4.8 Within-class, btween-lass, and betueen-school regressions. Continuing the previous example, we now investigate whether indeed the ef- {ect of 1 is in part a claselevel ot school-level effect: ia other words, whether ‘the within-las, between-clasy/within-school, and betwees-school regressions are different. Table 4.6 presents the results Jn Model 3, the effocts of the class mean, 10 j., as well as the school ‘mean IQ, have been added. Tle class mean has a clearly significant elect (¢ = 0.106/0.013 = 8.15), which indicates that berweea-class regressions are diferent from within-class regressions. The school mean does not have a 6 The random intercept model sipicaat eftct (€= 0980/0028 = 138), s0 this wo evdesce tat the $eitxce stool rogsios ae diferest fom the beweenlas reprenions BieS be conta thar thecompostion with repect to atllgente plays ol a the clas love, but ota the schoo! level ‘Table 4.6 Estimates for three-level model with distinct within-class, within-school, and between-school regresions. Model 3 Mode 4 Fi E Caeicieat SE cent SE Trereapt iai6 290 Ina Coaicient of 1G 0107 0.005 Coeficent of Qn — 10 yy 0107 0.08 Coeficent of Qj, 0.106 os Coufitent of jn 7B. oz oo Coofiient of TO. 00s 0.0m 0.25205 Random Beets Var. Comp. SE. Var. Comp. SE. Level tire sorta? m3 E__Was Comp. SE variou) ome oat ome oat evel eo variance: varaja) 0453 0.089 os .088 Tevet one variance: arc) 603 01st 689k Deviance mamas 195243 Like in Section «5 replacing the variables by the deviation scores fads to an equivalent model formulation in which, however, the within-lse, between. ‘lass, and betweea-achool regression coeficients are given directly by the Sxed parameters. In the three-level case, this means that we must use the following three variables: Wi Ta, the within-class deviation score ofthe student from the class mean; Tj. —TG.4, the within-school deviation score of the class mean ffom the school mean; R., the school mean itself ‘The results are shown 38 Model 4 ‘We se here that the withi-caes regression coedicient is 0.107, equal to the coeficient of student-Level IQ in Model 3; the between clas within school regresion coefficient i 0.212, equal (up to rounding errors) to the sum ofthe student-leel and the claselovel coeficients ia Model 3; while the between school regression coelicent is 0.252, equal to the sum ofall three coefficients ‘in Model 3. From Model 3 we know that the diference between the last two coefcients isnot significant 5 The Hierarchical Linear Model Im the previous chapter the simpler case of the hierarchical linear model ‘was treated, where only intercepts are assumed to be random. In the more general case, slopes may also be random. For a study of pupils within schools, ¢g., the effect of the pupil's intelligence or socio-economic status on scholastic performance could differ between schools. This chapter presents the general hierarchical linear model, which allows intercepts as well as slopes to vary randomly. ‘The chapter follows the approach of the previous corte: most attention ig paid to the case of a two-level nesting structure, and ‘the level-one units are ealled — for convenience only ~ ‘individuals’, while the level-two units are calid ‘groups’. The notation is also the same. 5.1 Random slopes In the random interept model of Chapter 4, the groups difer with respect to the average value of the dependent variable: the only random group effect 's the random intercept. But the relation between explanatory and dependent variables can difer between groups in more ways. For exain in the educational field (nesting structure: pupils within clasrooms), iis possible that the effect of soclo-econome status of pupils on thelr scholastic Achievement is stronger in some clasirooms than in ochers. AS an exam- ple in developmental psychology (repeated measurements within individual subjects), it is possible that some subjects progress faster than others. In the analysis of covariance, this phenomenon is known as heterogeneity of regressions across grour, or as group-by-covarate interaction. In the hier achical Iinear mode, itis modeled by random slopes. Tet us go back to a mode! with group-specific regressions of Y on one level-one variable X only, lke model (43) but without the effet of Z, Yig = Boy + By zig + Ry Gl) ‘The intercepts oy as well as the regression cdefficients, or slopes, fag are sroup-dependeat, ‘These group-dependent coeicents can be spit into an Sveragecoeficient and the group-dependent deviation: aj = 900 + Uns Pry = mo + Us 2) er 68 ‘The hierarchical linear model Substitution leads to the model Vis =o + nozy + Uny + Uis2y + Ry. 63) 1 fi assumed here that the leveLtwo residuals Voy and Ui, as wel! as the Jevelone residuals Ray have means 0, given the values of the explanatory Yarlable X. Thus, 7p isthe average resressice coefcient just lke spo the average intercept. The first part of (6.3), Yoo + oz, is called the fired part of the model. The second part, Vos +U;,24j + Ray, scaled the random port. The ter Ciy2y can be regarded as a random interaction between grovp gad X. This model implies that the groups are characteriad by two tes, dom effects: their intercept and their slope. We say that X has @ random slope, of a random effect, oF a random coaficient. ‘These two group afocte will ususlly not be independent, but correlated. It is ‘assumed that, for aiferent groups, the pais of raniom effects (oj, U;s) are independent and ‘identically distributed, that they are independent of the evel one residvala ‘egy and that all are independent and identically distributed. ‘The vari ance of the level-one residuals Ri; is again denoted 07; the vatiaaces and ‘covariance of the level-two residuals (Uj, Ui,) are denoted as follows, var(Uo4) mo=8 way) = marty 64) covUoss is) = rm. ‘ust lke in de preceding chapter, one can say that the unexplained group ‘ffets are assumed to be exchangeable. 5.1L Heteroscedastcity ‘Model (5.3) implies not only that individuals within the same group have correlated ¥-values (recall the residual intraclass correlation coefficient of Chapter 4), but also that this correlation as well as the variance of ¥” ase dependent on the value of X. In an example, this can be understood as follows. Suppose that, in a study of the effect of socio-economic status (SES) on scholastic performance (¥), we have schools which do not difer in thelr effect on high-SES children,’ but which do differ in the effect of SES on ¥ (eg, because of teacher expectancy effects). ‘Thea for children {om a high SES background it does not matter which school they go to, Dut for children from a low SES background it does. The school then ade 4 component of variance for the low-SES children, but not for the high: SES children: as a consequence, the variance of Y (for a random child at random school) will be larger for the former than for the latter children, Further, the within-school correlation between high-SES children will be ni, whereas between low-SES children it will be positive ‘This example shows that model (6.3) implies that the variance of Y, given the value z on X, depends on 2. This is called heteroscedasticity in ‘the statistical literature. An expression for the variance of (6.3) is obtained ‘2s the sum of the variances ofthe random variables involved pis a term de- Random slopes 6 pending on the covariance between Voy and Uis (the other random variables fre uncorrelated). Using (5.3) and (54, the result i varthiy | 24) = 18 + 2iyay + Bal tot 65) ‘and, for two diferent individuals (i and #, with i ##) in the same group, cov(¥igs Ye | Bes B04) = 15 + mi (By +245) + hay zs- (56) Formula (4) imple thatthe mua variance ofY minal fo sy = ran This nde by difeentaton with repos to 24.) When this value is within the range of possible X-values, the residual ey ist eves and then ieee agi hi wae sna than all X-alus, then the residual variance is an increasing function of z; if ti larger ail X-valus, then the residual variance is decreasing 5.1.2 Dont force 7 to be OF ; ‘ve the procding dacuson imps that the group effects depend on 2: securing to (6), tis eft sven by Ua + Uy 2, This ustated by Figure 5.1. It gives a hypothetical graph of the regression of s achievement (¥) on intelligence (X). school 1 yon) y@)) ya Figure 6.1 Diferent vertical axes a 8 It is lear that there are slope diferences between the three schoo! Looking at the ¥")-axls, there are almost no intercept differences between the schools. But if we add a value 10 to each inteligence sore, then the Yat sted tothe et by 10 wits the Ya, Now shoo the best schol I the wort: thee are strong intercept diffrence. If we woul hhave subtracted 10 from the z-scores, we would have obtained the ¥) axis, with again intercept diferences but now in reverse order. This implies that the intercept variance rand also the intercept-b-slope covariance 7a, depend on the origin (O-alue) for the X-variable. From this we can lear two things: : ig sciences is arbitrary, 1) Since the origin of most vaniables in the socal sciences frat slope model the ltrep slope covaans shoud be 8 Bet 0 The hierarchical linear model Parameter estimated from the data, and not « priori constrained to the value 0 (ie,, left out of the model). {2) Io randou siope models we should be careful with the interpretation of the intercept variance and the intercept-by-slope covariance, since the {intercept refers to an individual with = 0. For the interpretation of these parameters itis helpful when the scale for X is defined so that =~ 0 hase Preferably as a reference situation. For example, jp sxpested measurements when X refers to time, of measurement number, correspond to the start, or the end, of the measurements, In nesting structares of individuals within groupe, itis often COavenfent to let = = 0 correspond to the overall mean of the population or the sample ~ eg, if X is 1Q at the conventional scale with mean 100: it advised to subtract 100 to obtain « population mean of 0. 5.19 Interpretation of random slope variances For the interpretation of the variance of the random. slopes, r?, it is lumi nating to take also the average slope, +9, into consideration, Model (5.3) implies thatthe regression coeffclent, or slope, for group ir-no +0, This isa normally distributed random variable with mean no and wanded deviation n= V/9Z. Since about 95 percent of the probability of a nor 2a distribution is within two standard deviations fom the meas, ellos that approximately 95 percent ofthe groups have slopes between nny and no +2n. Conversely, about one in forty groupe has a slope ea thet, Yo ~2n and one in forty has slope steeper than ng + dre Example 5.14 random slope for IQ. Ne continue the examples of Chapter 4 where the effect of 1Q on a language {cst score was studied. Recall that 1Q is hereon a scale with mean U aed ge set is 207. A random slope of 1 is added to My =m + nosy + 12s + Uy + Use + Ry ‘The results can be read from Table 5.1. Note that the heading ‘Leveltwo random effects refers to the random intereept and random slope which wo ancote effects associated to the level-two units (the cla), but that the vace ‘able tbat bas the random slope, 1Q, i itself a levelone vetable: Figuee 5.2 presents a sample of fifteen regresion lines, raadomly chosen according to the mode of Table 5.1. (The values ofthe group mean {9 were choses randomly from a normal distribution with mean 0.127 and stendad eviation 1.005, which are the mean and standard deviation of the group ‘mean of 19 inthis dataset.) This gure thus demonstrates the populatios of ‘regression lines that characterizes, aecording ta this model, the population of schools. Random slopes a ‘Toble 5.1 Estimates for random slope mode Pied Eft Conciet oa Concent 019 on = Conficient 21 (group mea) Random Beet Tevet rondo GREE afar) wear) ty = cov Uj Us) Tevet-one vationce ‘us i Deviance 12135, ¥ 0 2 25 or 2 3 & x=19 secording to the model Figure 5.2 Fifteen random regression ines according of Table 5.1 (with randomly chosen intercepts and slopes) eit tn pe et ti Scenes ecrane ca ae come Sha de te se sated i moe Eiiie Gait Woiaie act die Sanna eh ero ee ena Hinrich ices tgs acs eee rere epee eet eer cay ee ee eee tt mem -0.82//752 x 0.2 = —0.65. Recall that all variables are cente pe, they, n ‘The hierarchical Knear model have ze mena), 0 that the itereet coreponds to the language test oe fora pupil wih an average ineigesce in a dase with an average mean TRalignce Tis eptiecorreation beeen slope and intecop mene that Class with higher peformance fru ppl of svrgeftellgence hve lower witinclns ec of aeligece. hus, the higher average performance tends to be achleed more ty hher language sete e te le EESSay aguer score ofthe mare eligane Sula te ntti a'a random slope model, the wishin group caberence cannot be sm nas Baie cng ee at ii {acon is that, interna ofthe preent example, the correlation beowen pupils in the ame cass depends on thet italigence. Thus the extent to which a given classroom deserves to be called ‘goo! asia across Pupils To investigate how the contribution of classrooms to papi’ performance depends on 1Q, consider the equation implied by the parameter entimater Yig = 40.75 + 2.45010, + 1.4051 + Vey +Uis1Q4, + Ray Recall fom example 2 thatthe standard deviation ofthe 1Q score is about 2, and the mean is 0. Hence pupils with an inteligence among the bottom few percent or the top few percent have Q scores of about <4. Subttutag those values inthe contribution ofthe random effects gives Uj Uy. Ie follows from equations (65) and (6) that for pupils with TQ = "4, we have var(¥y |1Qy = ~4) 7.92 +2 x (~0,820) x (—4) + (~4)? x 0.200 + 41.35 = 50.03, con(¥s.¥or 110, ) =792 = 16% 020 = 472, va (10, = 4) = 7192 ~8 x 0820-+ 16 x 0.200-+ 4135 = 4593, and therefore PAYS Yrs |1Qy = 4,199, = 8) = AT a ‘s )= Fate BH ee enor, the language tt sae of he or lige and the lot integer pupils in the same class are positively corelated over the population of lasses: clases that have relatively good results forthe less able tend also to have relatively good results for the more able stents This positive corelation corresponds to the rerult tat the value of 1 for which the variance given by (58) is minimal, is outside the range from ~ “+4. Por the etimates in Tale 5.1, this variance ria var(Kj|1Qy ==) =792 ~ 1642 +022" + 0? "gut oO the Give of hfe of il that he ase ninimal for = = 1.64/04 = 4.1, just outside the IQ range fom ~4to +4. This again implies tha clases tend mostly to perform elses higher of loves, ovet the entire range of IQ, This is illustrated lao by Figure 52 (which, however, also contains some regresicn lines that cron each other within the range of 19). 5.2 Explanation of random intercepts and slopes Regresion amas als at explanng varity inthe ostnme (Le, de pendent) arable Explanation a undertaod herein quite lied ney Bzplanation of random intercepts and slopes 3 vi, as being able to predict the value of the dependent variable from know- Jedge of the values of the explanatory variables. The unexplained variability in single-level multiple regression analysis i just the variznce ofthe residual term. Variability in multilevel data, however, has a more complicated struc- ture. This is related to the fact, mentioned in the preceding chapter, that ‘several populations are involved in multilevel modeling: one popilation for ‘each level. Explaining variability in a multilevel structure can be achieved by explaining variability between individuals but also by explaining variabil ity between groupe; if there are random slopes as well as random intercepts, af the group level one could try to explain the variability of slopes as well as intercepts, ‘In the model defined by (6.1) ~ (6.3), some variability in ¥ is explained by the regression on X, ie., by the term 10; the random coefficients Vag, Usy, and Ras each express different parts of the unexplained variability. In order to try to explain more of the unexplained variability, all three of these can be the point of attack. In the fir place, one can try to find explanations in the population of individuals (at level one). ‘The part, of residual variance that is expressed by o? = var(Ry) can be diminished by {neluding other level-one variables. Since group composit’ons with respect to level-one variables can differ from group to group, inclusion of such variables ‘may also diminish residual variance at the group level. A second possibility 's to try to find explanations in the population of groups (at level two). If wwe wish to reduce the unexplained variability associated with Uay and Us, ‘we can algo say that we wish to expand equations (5.2) by predicting the ‘group-dependent regression coefficients fog and 643 from level-two variables Z. Supposing for the moment that we have one such variable, this leads to regression formulae for fay and 64; on the variable Z, aj = 900 + 01 25 + Vos (37) Bj = mn0 + ny + Us (68) In words, the fs are treated as dependent variables in regression models for the population of groups; however, these are ‘latent regressions’, because the {8s cannot be observed without error. Equation (5,7) is called an intercepts fas outcomes model, and (5.8) a slopes a8 outcomes model.” 5.2.1 Cross-level interaction effects This chapter started with the basic model (5:1), reading Yas = Boy + Bag ty + Ras Substituting (5.7) and (5.8) in this equation leads to the model Yes = (roo + ro125 + Vos) + (no + maz] + Ursley + Res Tie the cider eaatore, these juatlo were applied tothe exited groupise re rela coefiiets rather than the Itent coaficients, The statistical etimation thea fos cared out in two stages Sint ordiaary east aqoares (‘0:S") eatimation withia TE Group, nee OLS eiimation with the mmated coficients a cutcomss atic jefllent sod does not diferestate the "roe score’ arabilty of te latent ‘halons fom the sampling variablty of the estimated groupwise reresion coeli- ‘ents. We do pot treat this twostage method. 4 ‘The hierarchical linear model 0 + mo124 + Mozy + m2 (69) Uy + Urn + Ry. ‘he last expreasion was reavrangad so that fst comes the fixed part and then the random part. Comparing this with model (2.3) shes tet ane Which remains Uo; + Uij zis + Ry. However, itis to be expected that the residual random intercept and slope variances, 1? and 1?, will be lose than their counterparts in model (5.3) because part of the variability of intercept and slopes now is explained by Z. In Chapter 7 we wil see, however, that ‘this is not necessarily so for the estimated values of these parameters, In equation (5.9) we see that explaining the intercept fy by « level-two Te Z leads to s main effect of Z, while explaining the coefficient yy of X by the level-two variable 7 leads to a product interaction efat of ait Z. Such an interaction between a level-one and a level-two variable is called a cross-level interaction. For the defii ‘4244, the main effect coefficient m9 of X is to be interpreted as the cfiect of X for cases with Z = 0, while the main effect coeficent nor of 7 's to be interpreted as the effect of Z for cuses with X = 0. Example 8.2. Crosplevel interaction between 19 and group size ‘The group sit of the school clases yields a pictial explanation of the clase: dependent slones. Group size ranges from § to 31, with a mean of 201 ‘The school varlable Zs is defzed as group size minis 281.” (The name Z, 's implicitly wed already for IQ, the group mean ef 1Q.) Whes this aeiabie bs added to the model of Example 5.1, the parameter estimates presented ie ‘Table 5:2 are obtained, ‘The value of Za ranges from about -18 to about +14. The clase dependent regression cocficient of 10, cf. (58), is ye + ‘a2, + Uyy, eximeted as 2443 0.022554 Uiy. Fr 4 ranging betwoen —18 anil 414, the feed port of Cross-level interactions can be considered on the basis of two different Kinds of argument. The above presentation is in line with an inductive argument: if a researcher finds a significant random slope variance, she ‘may be led to think of level-two variables that could explain the random slope. An alternative approach is to base the cross-level interaction on substantive (theoretical) arguments formulated before looking at ehe data, Beplanation of random intercepts and slopes 6 ‘table 5.2 Batimates for model with random slope and crom-level interaction. eee Coxe eee Se a once of 1 2 Content 1G 138 Coates of 2 ost 73 =Cocicnt of Z5x1Q 022 fmponent_ SE. Random Effct____Varance Componert_S.6. “Tevet too rendom Der paves) 1 19 waved) a om ee) am oss eee a6 Ls m Devian The reearchr hen i ed to fatimate and ts the rom evel interaction fect irespetive of whether aan slope vsance maa found, If on level interaction effect exist, the power of the statistical test, x effect is considerably higher thaa the power of the test for the coresponding random slope (assuming that the same model serves asthe nul) hypothesis} Taerfor it ono contradict t ok for a specie cos eve interaction even ifno significant random slope was found. This is further elaborated in the last part of subsection 6.4.1. More variables ‘The preceding models can be extended by including more variables that have random ees, and more variables epenng thse random cs Sappose that thee arp leveLene explanatory varies Xin. Xp and ¢ lee explanatory vaables Ziq, Then I the rere! nt alraid of a model with too many parameters, he ean consider the m: wher all Xara have varying Hope, and whee the random intercept a5 well as all these slopes are explained by ail Zvaviables. At the within- soup level, Le, for the individuals, the model then is @ regression model with p variables, tas # ‘sx Vij = Bag + Bag uy +--+ Bos tps + Pas : ‘The explanation of the regresion coefficients os to By Is based on the ‘between-group model, which is a g-variable regression model for the group- dependent coefficient Js, em) Bas =o + 7a 24j + oe + Me zas + Urs» . Substitution of (6.11) n (5.10) and rearrangement of terms then yields the model 76 ‘The hierarchical near model Yer + Srwsw + Samay + Sawaya ine z Hay + Stiga + Ry (51 This shows that we obtain main effects of each X and Z variable as well all crosslevel product interactions. Further, we see the reason why, in formula (4.7), the fixed coeficients were called 7g for the levelone variable Xn and roe for the levetwo variable Zs. The groups are now characterized by p+ random coeficients Voy to Uy ‘These random coefficients are independent between groups, but may be o2, ‘elated within groups. It is assumed that the vector (Uay,.-,Uyy) is inde. Pendent of the level-one residuals Rij and that al residuals have Ropulation ‘means 0, given the values of all explanatory variables. It is also sesumed that the levol-one residual Ry has a normal distribution with constant vest, ance o” and that (Uoj,...,Ups) has a multivariate normal distribution with & constant covariance matrix. Analogous to (5.4), the variances and covarl ances of the leveltwo random effects are denoted var(Uns) = ts = 18 (h= 1.040); covtUass Tay) = me (k= 1,...4p) 6.13) Example 5.3 A model with many fced effects. For this example, it must be noted that there isa distinction between class and group. The clas isthe set of pupils who are being taugbs fy the sae teacher in the same clasroom. The group isthe subset of those yp is ake Gass who are in grade 8. Some clases are combinations of grade 7 and grade 8 pupils Only the grade 8 pupils are part of cis data set. According {aviable COMB is defined that indicates whether a cass such a multrade cass (COMB = 1; 59 clase), or entirely composed of grade 8 pupils (COMS 78 classes) Tae model forthe laaguage test score inthis example include main effects ‘and crosrlevel interactions between the following variables, Pupil lese! + 1Q (as used in the preceding examples) + SES = social-economic status ofthe pupil's family (a numerical viable ‘with mean 0 and standard deviation 10.9), Chass level + TQ = average 19 in the group + GS = group size + COMB = indicator of mult-grade clases (Te class average of SES is not included in this example, because other Analyses showed that this variable bas ao significant effet in other words, there is no significant diference becween the within-group and the becwees’ ‘0up regressions on SES.) Explanation of random intercepts and slopes 7 eee Baoan dates nso gt as ae ar oe eee ae rene eee ae Sia ergata ee = . “Sotinatag model (5.12) (with p = 2, q = 3) leads to the results presented Tecan Sec ore ‘Table 5.3 Estimates for model with random slopeo and many effects. Ceficient of GS Coeelet of 19 x 1G Coeielnt of 19 x G3 (Coeient of 1G x COMB, ite smal, mode (512) ental « umber of ta ates p and gare que smal, co tistical parametes that usualy istoo larg for comfort. "Therefore, two aon te oan used * {a} Not all X-rablen ae considered to have random som Note at th craton ate tae stops by the Pl ye eo sorte, a coin a eo gaat tt fon O (esting the parameter cused in Chaps 6) oy that even tatmate to be 0 “ Se coofients hy of certain variable Xy are vasa ‘Gar 'goupe, tb aot smear) to we al vaables Zy for explaining tc way The ue craton tt isang cach fy Dy cay a welcheoen aoe othe Bye ® Gitano tao oe nd ih ced ttn is wos, peas on sbjot wae a wel a emisal conde one "the nana! snp of toting and wel Bing re talon 8 ‘The hierarchical linear model Example 5:4 A parsimonious model inthe case of many vara In Table 53, he random slope valance of SES wetted ty (happens ocesonaly o ston 88) Thule, hi andom dope's elsed oe We sale in Chapter 6, cha sigsifcane of iad eet canbe soplyng a tet tothe ratio etiate to standard eon Tha eek inf othe mod ewe all cemiod nterainy te eae hae interaction between 1Q and COMB. The resul timates are di in Table 5.4. Se ‘Table 8.4 Estimates for a more parsimonious model with a random slope and many effects. Fixed Eft Couttest sz “oe = erat gat _S “ho = Goeficen of la ‘ana “ros Goce of Ss diss oats {hs = Concen of COMB S308 are 1598 ast “ia = Coefilert of 1Q x COMB oar ain Random Bet ce agin Bt Variance Conponeat 8 geen 756 135 tH) zs uaa Tevelone variance: 8s a at) saa 129 Betws oes ‘The timate of th emanig ef do nt change mc om 53, but the standard errs of many feed codfceats des cent This canbe explain! by th oman eth many nowagee nee Nou hat COMB is ot «centered arable, 201 Someg wie satly «postive mean Al othe explanatory racabes hae eae fare, thelatrcpe oi he mean pps eh average aceneion sigh grade ca (COMB = 0), andthe repesioncthcae od ne avsags fc of Qin ago pad cute The meas of plies og ons hse i +s. The inracton eek ng the aol cies org ‘a mkigrede css, Hens te ef fda malades -o6 9 Compare to Bxample 52, ara out that i nt group ae COMB, that seems to have an fit otha man ek aod as ener fo with 1Q.Pattiond cases (COMB =I) lad fo mer bogey score abd 10 higher eft of ltlisco The uezpnnal te at haw opedent spe of laguage mar 10 conser we tat ale 52, when the iteration of 1Q with G3 gross), need was included in the model. ° ae ania ‘The model oud can be exreed as model with a cate me on ode! with variable intercepts Mis = Bog + Bry ass + Bas tay + Rag, Ezplanation of random intercepts and slopes nm where Xi 1Q and Xs is SES. The intercept is Bey = 900 + os 31) + aan 395 + Uos , where Zs is average IQ and Zs is COMB. The coeficient of Xi is, Buy = mo + nazay + is» while the coefciont of Xz la not variable, Bas 5.8.2 A general formulation of fized and random parts Formally, and in many computer programs, these simplifications lead to a representation of the hierarchical linear model that is slightly different from (6.12). (For example, the HLM program uses the formulations (5.10) ~ (6.12) whereas MLn uses formulation (5.14).) Whether a level-one variable ‘was obtained as across-level interaction or not is immaterial to the computer program. Even the difference between level-one variables and level-two vari- ables, although possibly relevant for the way the data are stored, is not of any importance for the parameter estimation. ‘Therefore, all variables Jevel-one and level-two variables, including product interactions - can be represented mathematically simply as zy. When there are r explanatory variables, ordered so that the fist p have fixed and random coeficients, while the last r — p have only fixed coefficients” the hierarchical linear ‘model can be represented as Yg=a0 & Soomeny + Vay + Ui any + Ry (6.4) = Po The two terms, aot Someny and Uy + SUiyeny + Ry, are the fixed and the random parts of the model, respectively. In cases where the explanation of the random effects works extremely ‘well, one may end up with models without any random effects at level 2. In other words, the random intercept Uoj and all random slopes Un, ia (6-14) hhavezero variance, and may just as well be omitted from the formala. In this case, the resulting model may be analysed just as well with OLS regression analysis, because the residuals are independent and have constant variance. Of course, this is known only after the multilevel analysis has been carried out. In such a case, the within-group dependence between measurements hhas been fully explained by the available explanstory variables (and their interactions). This underlines that whether the hierarchical linear model is ‘amore adequate model for analysis than OLS regression depends not on the dependence of the measurements, but ox the dependence of the residuals. Th fe mathematically ponble that some variables bave a random but not a fixed lect. Tas makes sense oly a spell cases. 0 ‘The hierarchical linear mode! 5.3 Specification of random slope models Given that random slope models are available, the researcher has many op- tions to model his data. Each predictor may be assigned a random slope, and each random slope may covary with any other random slope. Parsi ‘monious models, however, should be preferred, if only for the simple reason that a strong scientific theory is general rather than specific. A good guide for choosing between a fixed or a random slope for a given predictor vati- able should preferably be found in the theory that is being investigated. If the theory (whether this is a general scientific theory or a practical policy theory) does not give any clue with respect to a random slope for a certain predictor variable, then one may be tempted to refrain from using random, slopes. However, this implies a risk of invalid statistical tests, because if some variable does have a random slope, then omitting this feature from the model could affect the estimated standard errors of the other variables. ‘The specification of the hierarchical linear model, including the random Part, is discussed more fully in Section 6.4 and Chapter 9. In data exploration, one can try various specifications. Often it appears ‘that the chance of detecting slope variation is high for variables with strong, fixed effects. This, however, is an empirical rather than a theoretical asser~ tion. Actually, it may well be that when a fixed effect is ~ almost ~ zero, there does exist slope variation. Consider, for instance, the case where male teachers treat boys advantageously over girls, whereas for female teachers the situation is reversed. If half of the sample consists of male and the other half of female teachers, then, all other things being equal, the main sgender effect on achievement will be absent, since in half of the classes the gender efect will be positive and in the other half negative. The fixed ef fect of students’ gender then is zero but varies across classes (depending on ‘the teachers’ gender). In this example, of course, the random effect would disappear if one should specify the cross-evel interaction effect of teachers’ sender with students’ gender. 5.3.1 Centering variables with random slopes? Recall from Figure 6.1 that the intercept variance and the meaning of the intercept in random slope models depend on the location of the X variable. Also the covariance between the intercepts and the slopes is dependent on this location. In the examples presented so far we have used an IQ score for which the grand mean was zero (the original score was transformed by subtracting the grand mean 1Q). This facilitated interpretation since the intercept could be interpreted as the expected score for a student with average IQ. Making the IQ slope random did not have consequences for ‘these meanings. In Section 4.5 a model was introduced by which we could distinguish ‘within- from between-group regression. Two models were discussed: Yeo + mozy + mm Fy + Voy + Rey (43) ‘Specification of random slope models aL and Yig = oo + Fo (2ij Fy) + Fons + Vos + Rey (49) 1 was shown that for = tio + 701, ho = “hoy 0 that the two models are equivalent. ‘Are the models aso equivalent when the effect of Xi or (Xqj ~ Xj) is random acroas groups? This was discussed by Kreft, de Leeuw, and Aiken (1995). Let us first consider the extension of (4.8). Define the levelone and level-two models Vig = Bog + Bag zis + Ay + Rey : Bos = 00 + Uos By =o + Uys substituting the leveltwo mode! into the level-one model leads to Yig = 00 + nozy + 1m Fy + Voy + Vigzis + Ry [Next we consider the extension of (49): ig = Bay + Bry (aay 5) + By + Rey Bos = 700 + Voy Bis = tho + Ug 5 substitution and rearrangement of terms now yields Yig = 00 + hoes + Cor — Ho) Fy + Uoy + Uisaes — Vig + By ‘This shows that the two models differ in the term Ui; .j whichis included in the group-mean centered random slope model but not inthe other model. ‘Therefore in general there is no one-to-one relation betweea the and the 4 parameters, so the models are not statistically equivalent except for the extraordinary case where variable X ben no between-group variability ‘This implies that in constant slope models one caa either use Xij and X,, or (Xy—X,) and Xs as predictors, since this results in statistically ‘equivalent models, but in random slope models one shoul carefully choose ‘one or the other specification. ‘On which consideration should this choice be based? Generally one should be reluctant to use group-mean centered random slopes models unless ‘there isa clear theory (or an empirical clue) that not inthe first place the absolute score Xi but rather the relative score (Xij ~ Xs) is related to Yy. Now (Xj ~ X,) indicates the relative position of an individual in his or her group, and examples of instazces where one may be particularly interested in this variable may be: + research on normative or comparative social reference processes (e4., Guldemond, 1994), + research on relative deprivation, + research on teachers’ rating of student performance. 82 ‘The dierarchical linear model 5.4 Estimation What was mentioned in section (4.6) can be applied, with the necessary extensions, also to the estimation of parameters in the more complicated model (6.14). A number of iterative estimation algorithms have been pro- posed eg, by Caird and Ware (1982), Goldstein (1086), and Longford (2987), and are now implemented in multilevel software. ‘The following may give some intuitive understanding of estimation meth- ds. HT the parameters of the random part, i., the parameters in (5.13) together with o?, were known, then the regression coeflcients could be eat. mated straightforwardly with the so-called generalized least squarts (‘GLS') method. Conversely, if all regression coeficients yy, were known, the ‘otal ‘residuals’ (which seems an apt name forthe second line of equation (6.12)) could be computed, and their covariance matrix could be used to eatimate the parameters of the random part. These two partial estimation processes can be alternated: use provisional values for the random part parameters to estimate regression coefficients, use the latter estimates to estimate the random part parameters again (and now better), then go on to estimate the regression coeficients again, and so on ad libiturn ~or, rather, until conver. Bence of this iterative process. This loose description is close to the iterated Seneralized least squares (IGLS') method that is one of the algorithms to Ccaleulate the ML estimates, There exist other methods (one called Fisher scoring, treated in Longford (1098), the eter aud 2M or Expectation“ Mesionon Sees and Raudenbush (1992)) which calculate the same estimates, each with ve own advantages, Parameters can again be estimated with the ML or with the REML ‘method; the REML method is preferable in the sense that it produces less biased estimates forthe random part parameters in the case of small sample sizes, but the ML method is more convenient if one wishes to use devianoe tests (fee the next chapter). The IGLS algorithm produces ML estimates, whereas the so-called RIGLS (‘restricted IGLS") algorithm yields the REM, Etiimates. For the random slopes model also it is posible that estimates for the variance parameters 7} are exactly 0. The explanation is analogous to the explanation given for the Intercept variances in Section 4.6 ‘The random group effects Uas can again be ‘estimated’ by the empiri- cal Bayes method, and the resulting ‘estimates’ are called posterior slopes (Sometimes posterior means). This is analogous to what is treated in Section 47 about posterior means Usually the estimation algorithms do not allow to include an unlimited ‘number of random slopes. Depending on the data set and the model spect. fication, itis not uncommon that the algorithm refuses to converge for more than two or three variables with random slopes. Sometimes the convergence can be improved by linearly transforming the variables with random slopes ‘0 that they have (approximately) zero means, or by transforming them to hhave (approximately) zero correlations. Three and more levels 83 For some det et the etnaton method cn produce atinated vas and covariance parameters that correspond to impossible covatiance tatrices forthe random eet at level two, eg 7x sometimes is extnatod larger than 7 x 7. This would imply an intercept-sope correlation larger than 1. This is not an error of the estimation procedure, and it can be ‘understood aa follows. The estimation procedure is directed at the mean ‘vector and covariance matrix ofthe vector of all observations. Some com- binations of parameter values correspond to permissible structures of the latter covariance matrix that, nevertheless, cannot be formulated a8 ran- dom eects model such as (53). Even if the estimated yalues forthe rx parameters do not combine into a postive definite matrix r, the o? pa- ameter will still make the covaviance matrix of the original observations (€& equations (5.5) and (6.6)) postive definite. Therefore, such a strange result is in contradiction to the random effects formulation (5.3), but not to the more general formulation of a patterned covariance matrix for the observations lost computer programs give the standard errs of the variances of the random itrept tad slopen some give the standar eos othe eta standard deviations f instead. The two standard errs can be transfortned into each other by the approximation formula SE?) & 27SE(F)- (6.15) jowever, some caution is necessary in the use ofthese standard errors. The ves in on he ee. Te cetimaced vaiue plus or minus twice the standard error, is a valid approai- ‘mation only if the relative standard error of # (Le, standard error divided by parameter estimate) is small, say, les than 1/4 8.5 Three and more levels ‘When the data have a three-level hierarciy, slopes of level-one variables can bbe made random at level two and also at level three. Th this case there will be at least two level-two and two level-three equations: one for the random intercept and one for the random slope. So, in the case of one explanatory variable, the mode! might be formulated as follows: Base + Bisetn + Ran (Level-one mode!) ron + Uayn (Level-two mode! for intercept) Basu = bon + Uijn (Level-two mode! for slope) ayn = 00 + Voos (Level-turee model for intercept) 4534 = h00 + Vios (Levelthree model for slope) In the specification of such a model, for each level-one veriable with random slope it has to be decided whether its slope must be random at level two, random at level three, or both. Generally one should have either strong a priori knowledge or a good theory to formulate models as complex as this ‘one oF even more complex models (Le., with more random slopes). Further, cy ‘The hierarchical Hnear model {or each level-two variable it must be decided whether its slope is random at level three. Example 6.5 A three-level model with random slope. We continue with the example of Section 4.8, where we illustrated the three- level model using a data set about a math text administered to students in classes in schools. Now we include the available covaziates (which are all centered around their grand means) and moreover the rogreaion coefficeat for the mathematics pretest is allowed to be random at level two and level ‘three. ‘Te results are in Table 6.5. ‘Table 5.5 Estimates for three-level model with random slopes. xed Bice Coe se. Type = Baber Coecent of 1Q 0.080 8.008 Conllicient af preter os oon Cofiient af tivation 3 008 CCoefcent of fathers education os oo = Gone of grader oz f.i0s Random Bit Vasance Component $2. Fare rnd CDS f= vation) osm 054 0.0024 010 oasi 013s 049 0.089 001 0.0000 O99 0.088 sora as ‘The interpretation of the fixed partis straightforward as in conventional single-level regression models. The random parts more complicate Since al ‘Predictor vatiables were grand mean centered, the intercept vatiances (level three) and 7 (love two) have a clear meaning: they represeat the amount of ‘aviation in mathematios achievement across schools and across claocs within Schools, respectively, for the average student whilst controlling for diferences in 1Q, mathematics ability, achievement motivation, educational level of the father, and gender. Comparing this table with Table 4.5 shows that much of the initial level-three and (especially) level-two variation has now been. ‘accounted for. Once there is control for initial differences, schools and clases ‘within schools ifer considerably les in the average mathematics achievement of ther students at the end of grade two. Now we turn to the slope variance. ‘The fixed slope coefficient for the ‘mathematics pretest i estimated to be 0.146. The variance at level three for {his slope is 0.0024, and at level two 0.0019. So the variability between schools ofthe effect of the pretest a somewhat larger than the variability ofthis effect between classes. On one end ofthe distribution there are afew percent of the schools that have an effect of the pretest that is only 0.146 — 2x VOOO2 ‘Three and more levels s VOOR = 0s, when in he mos scien shi ci 0248-+2 VOTE Bask” Sac tae arin fice ao ch win Schools, the gap Deon na aod nl Bgh iver (andar ‘deviations apart; this standard deviation for the pretest is 8.21) wit school can become as big as (0.146 + 2 0.0034 + 0.0019) x (48.21) Spann whens onthe ther Panda a on a (0105 — 2 ‘0024 +0061) x 4x 8.21) ~ 1 pin inthe et lective schools. Give Me Racked deviation of 84 for tho dependent variable a difrence of 15 very low whereas 95 gute a lage diference 6 Testing and Model Specification (it is assumed for this chapter that the reader has a basic knowledge of statistical testing: null hypothesis, alternative hypothesis, errors of the frst and the second kind, significance level, statistical power } 6.1 Tests for fixed parameters Suppose we are working with a model represented as (6.14). The mull hy- Pothesis that a certain regression parameter is 0, Le., Hy: 74 =0, (61) can be tented by a t-test. The statistical estimation leads to an estimate jy with associated standard error S.E.(j,). Their ratio is a t-valuer (62) One-sided an well a tsded tests can be caved out on the basis ofthis test statistic.’ Under the null hypothesis, ‘T (7x) has approximately a t distribution, but ehe umber of degree of feedom (ds somevat coe complicated than in matiple near represion, becawe of th presence of the two lees. The approximation by the raisiibation not watt even the normality assumption for the random coafiients holds, Senpese fg that we are ting the coeficent of leone variable ln aeoriares with the specication of mode! (5.14), a crom level ineraction vacable also considered alee-one variable) If the total namber of lod one ots '5 Mand the total numberof explanatory variables isn, then so cen tala af. = M—r—1. For testing the coefivient oft leva tee rasan wine there aze W level-ino units and q explanatory arable aleve Boge taka af.=N-q~1 Af the number of units minus the number of variables is laige enough, say, larger than 40, the fistibation can be replaced ty a standard normal dstbaton. Example 6.1 Testing within. and between-group regressions. We wish to test if between-group and withingroup regressions of language ‘teat score on 1Q are different from one another, when controlling for social economic status (SES). A model with a random slope for 1Q is used. Two ‘models are estimated and presented in Table 6.1. The frt contains the row "This is ove ofthe common priacples for canstracion of « bet. This type af et ia called tbe Wold tet afer the satintcian Abrahare Wald (1900-1950), ‘Tests for ized parameters 87 1Q variable aloog with the group meas, the second coctains the within-group deviation variable 19, defined as 1, =10,-10,, also together with the group mean. To test whether within- and between igoup regression coficeat are diflret, the sgnieance of the group meas 3 Ea wl couolng th loo ecg! asc, Phe S and between-group regressions is discussed further in ‘Table 6.1. Estimates for two models with different bbetween- and within group regresions. Modal Mol 2 esr mae— Ca SE “pe = Bie ‘on os ae om Jo = Concent of1Q “230 Down “po = Comet of 226 oon Jar Gomcen sf58s ~ 1s cos ae Sas Ji Sones tig” Cost Saas Sms aa asdom Bets Vaz Comp $2. Var Comp. SE. “Tesco endow Wa daante) fis Blane iro oa Garo dom, Ce a) wa im aim 15.037 ssion7 ‘The table shows that only the estimate for IQ differs between the two models. This win ecnrdance with Secon 45: H1Q is asiable 1 and IQ's arahe 2, thatthe rgreston conics reo 308m, texpetey hen te thio group repeson conics Is yy t Mods? and’ i Model 2, hie te beiweengou regres concer too a del! andor IE Mocel 2 The tod are egies epesentation ofthe data ad fer oy inte prasad ee ty te “The wthnegrop regresion sod Dewee-goup serene ae he same {ya yy hn ap tan Sate tml entrla for thea arable (ie sb vaiabl without grouprenter The et sat fareing Hn =O la Moda awl 108/095 2'pss Tin strongly seas (p< D002). Te many be cncladed that within goup and beeen group represions se sgiasly diferes, "Te eau fat Mel? ean be sed fo x if he within group or Between sroup ropeion ae 0. The tas ati for tesing he within pou Fe Groton w 2209/0082 ~ 214, the sate for ‘ang the termes gOU fegrenon i 3345/0820 — 104 Both are extremly spins Concad. inp the are poste within poup av wll ts beee- group regressions ad these are dierent fom one another 88 Testing and model specification 6.1.1 Malti-parameter tests for fzed effects Sometimes we with to text several regression parameters simultaneously. For example, consider testing the effect of a categorical variable with more than three categories. The effect of such a variable can be represented in the fixed part of the hierarchical linear model by ¢— 1 dummy variables, ‘where oie the mumber of categorie, and this eect is nil if and only if all the corresponding ¢~ 1 regression coefficient are 0. Two types of test are such used for this purpose: the multivariate Wald test and the likelihood ratio test, also knowa as the deviance test. The later test is explained in the next section For the multivariate Wald test, we need not only the standard errors of the estimates but also the covariances among them. Suppose that we consider a certain vector + of ¢ regression parameters, for which we wish to test the noll hypehess Hy:7=0. (63) ‘The statistical estimation leads to an estimate 7 and an associated esti- mated covariance matrix ©. From these, we can let a computer program calculate the test statistic represented in matrix form by Very. (6.4) ‘Under the null hypothesis, the distribution of this statistic can be approxi- mated by the chi-squared distribution with q degrees of freedom.? The way of oblaining tests presented in this section is not applicable to tests of whether parameters (variances or covariances) ia the random part of the model are 0. The reason is the fact that, if @ population variance parameter is its estimate divided by the estimated standard error does not approximately havea t-dstribution. Test for such hypotheses are discussed in the next section. 6.2 Deviance tests ‘The deviance test, or likelihood ratio test, is a quite general principle for statistical testing. In applications of the hierarchical linear model this test is mainly used for multiparameter tests and for tests about the random part of the model. The general principle is as follows. ‘When parameters ofa statistical model are estimated by the maximum likelihood (ML) method, the estimation also provides the likelihood, which ‘can be transformed into the deviance defined as minus twice the natural logarithm of the likelihood. ‘This deviance can be regarded as a measure of lack of fit between model and data, but (in most statistical mode's) one cannot interpret the values of the deviance directly, but only diferences in deviance values for several models fitted to the same data set. This approximation pelea the fart that Sin matimated, and nat mown exactly 1 corresponds to using the standard normal distibution to tet the value of (62). te peiogple, this could be taken into acount ty ual the F rather than the chisquared Aiatebutio. IF the sunber af groupe i large, the diference le aot appreciable. Deviance tests 89 Suppose that two models are fitted to one data set, model Mo with ma parameters and a larger model M, with m parameters. So M, can be Fegarded as an extension of My, with m ~ mg parameters added. Suppose that Mo is tested aa the null hypothesis and M; is the alternative hypothesis. Indicating the deviances by Do and Dy, respectively, thei difference Dy— Dy can be used ao atest statistic having a chi-squared distribtion with m—mo degrees of freedom, This type of test can be applied to parameters of the fixed as well as of the random part. ‘The deviance produced by the residual maximum likelihood (REML) ‘method can be used in deviance tests only ifthe two models compared (Mo ‘and Mj) have the same fixed parts and differ only in their random parts. Example 6.2 Test of random intercept. In example 42, the random intercept model yields deviance of D1 = 15251.8, ‘while the ODS regression model has a deviance of Dp = 15477,7. There is ‘mime = 1 parameter added, the random intercept. The deviance diference 15 2259, immensely significant in a chi-squared distribution with d.f. ‘This implies that, even when contrlling for the effect of 1Q, the diferences between groupe are strongly significant For example, suppose one is testing the fixed effect of a categorical ex: planatory variable with c categories. This categorical variable can be repre- sented by c~1 dummy variables. Model My is the hierarchical linear model with the effects of the other variables in the fixed part and with the given random part. Model Mj also includes all these effects; in addition, the ¢—1 regression coefficients of the dummy variables have been added. Hence the difference in the number of parameters is m—m = eI. This implies that the deviance difference Do ~ D1 can be tested in a chi-squared distribution with df. =c— 1, This testis an alternative for the multi-parameter Wald test treated in the preceding section. These tests will be very close to each other for intermediate and large sample sizes. Example 6.3 Bffect of a categorical variate. In the data set used in Chapters 4 and 6, schools differ according to their de- ‘nomination: public, eatholl, protestant, or non-denominational private. To represent these four categories, three dummy variables are used, contrasting the last three agains: the first category. This means that all dummy variables are 0 for public achools, the ist is 1 for catholic schools, the second is 1 for protestant schools, and the third is 1 for non-denominational private schools. ‘When the fixed effects ofthese c—1 = 3 variables are added to the model presented in Table 5.2, which in eis example has the role of Mo with deviance ‘Dy = 15208.4, the deviance decreases to D; = 15198.6. The chi-squared value is Dp - Dy = 148 with df. =3, p < 0.005, Tt can be concluded that, when controlling for 1Q and group size (see the speciation of mode! Mo in Table 152), there are dilferences betweon the three types of school "The estimated fixed eects (with standard errors) of the dummy variables are 1.70 (0.64) for the cathohe, ~0.0 (0.67) for the protesant, and 1.09 (1.30) for the no-denomizational private schools. These effects are relative to the public schools, This implies that the catholic schools achieve higher, 90 Testing and model specification controlling for 1Q and group sie, than the public schools, while the other two categories do not differ significantly from the public schools, 6.2.1 Halved p-values for variance parameters ‘Variances are by definition non-negative. When testing the null hypothesis that a variance of the random intercept or of a random slope is zero, the alternative hypothesis is therefore one-sided. This observation can indeed bbe used to arrive at a sharpened version of the deviance test for variance ‘parameters. Tis principle was derived by Mille (1977) and Self and Liang (1987) First consider the case that a random intercept is tested. The null model ‘Mo then is the mode! without a random part at level 2, ie. all observations Yj ate Independent, conditional on the values of the explanatory variables. This is an ordinary linear regression model. ‘The alteraative model Mi is the random intercept model with the same explanatory variables. There is 1m ~ mo = 1 additional perameter, the random intercept variance 73. For the observed deviances Dy of model Mg (this model can be estimated by ordinary least squares) and D, ofthe random intercept model, the difference Dp ~ D, is caleulated. If Dp ~ Dy = 0, the random intercept variance is definitely not significant (i is estimated as being 0..). If DpDy > 0, the tall probability ofthe difference Da ~ D; is looked up ina table of the chi- squared distribution with d.f. = 1. The p-value for testing the significance of the random intercept variance is alf this tal value. ‘Second, consider the case of testing a random slope. Specifically, suppose that model (6.14) holds, and the mull hypothesis is that the last random slope variance is zero: 73 = 0. Under this null hypothesis, the p covariances, ‘Tp for h=0,...,p— 1, also are 0. The alternative hypothesis is the model defined by (6.14), and has my ~ mo = p-+1 parameters more than the null model (one variance and p covariances). For example, if there are no other random slopes in the model (p = 1), m — mo = 2. ‘The same procedure is followed as for testing the random intercept: Both models are estimated, yielding the deviance difference Do ~ D1. If Dy ~ D, = 0, the random Biope variance is not significant. If Do - D; > 0, the tail probability of the difference Do ~ D; is looked up in e table of the chi-squared distribation with df. = p+1. The p-value for testing the significance of the random slope variance is haf tis tail value "The argumentation can be given lowely as follows. Ifthe variance parameter is, ‘ea the probably is aboot 1/2 tat the erimated vale 0, and also 1/2 that tbe tatimated vale i peitiv. For example, for ntimating the leve-twoiterept vince 7? {nan empty moder, formula (311) can be und If indeed +® = 0, the proba ie about 1/2 that ths expreaion wil be oegetve, aad therefore treated to 0. If the vranee parameter in etinated ar, the aolated comsince parameters ao are etimated 2 Ov with the rent that all eimated parameters under model Sf; are the same as under ti. ‘This irplies D, = Dy. The e-squared distribution bolds under the coodton tbat ‘the variance parareer is extmaed at's pelive value. I an be concluded thatthe cll diaibution ofthe deviance diference fs a so-aled mixed dntebution, with probably 1 Tor the valve 0, and probability 1/2 for cboquared dstrbution. Other tests for parameters in the random part a1 Example 6.4 Test of random slope ‘When comparing Tables 4.4 and 5.1, it can be concluded that 1m ~ mo parameters are added and the deviance diminishes ty Dy — Dy = 15227.5— 15212.5 = 140, Testing the value of 14.0 in a chi-squared distribution with 4. = 2 yields p < 0001. Halving the pvalue leds to p < 0.0005. Thus, the ‘Sguifcance probability ofthe random slope for IQ in the model of Table 5.1 ‘sp < 0.0005 ‘As another example, suppose that one wishes to test the significance of the random slope for IQ in the model of Table 5.4, Then the model must be fitted in which this effect is omitted and all other effects remain. Thus, the omitted parameters are 7? and mo so that df. Fitting this reduced model leads to a deviance of Dp = 15101.5, which is 8.5 more than the deviance in Table 54, The chrsquared distribution with df. = 2 gives a tall probability of p < 002, 90 that halving the pvalue yields p < 0.01, Thus, testing the random slope for 1Q in Table 5.4 yields a sigaificant outcome with p < 0.0. 6.3 Other teste for parameters in the random part, ‘The deviance tests are very convenient for testing parameters in the random part, but other tests for random intercepts and slopes also exist. In Section 3.3.2, the ANOVA F-test for the intraclass correlation was mentioned. This is fectively a test for randomness of the intercept, If it is desired to test the random intercept while controlling for explanatory variables, one may ‘use the F-test from the ANCOVA model, using the explanatory variables as covariates. ‘Bryk and Raudenbush (1992) present chi-squared tests for random in- tercepts and slopes, i.e. for variance parameters in the random part. They ‘are based on calculating OLS estimates for the values of the random effects with each group and testing these values for equality. For the random intercept, these are the large-sample chi-squared approximations to the F- tests in the ANOVA or ANCOVA model mentioned in Section 3.3.2. * ‘Another test for variances parameters in the random part was proposed by Berkhof and Snijders (1998), and is explained in Section 9.2.2. 6.4 Model specification Model specification isthe choice of satisfactory model. For the hierarchical linear model this amounts to the selection of relevant explanatory variables (and interactions) in the fixed part, and relevant random slopes (with their covariance pattern) in the random part. Model specification i one of the ‘most difficult parts of statistical inference, because there are two steering wheels: substantive (eubject-matter related) and statistical considerations. These steering wheels must be handled jointly. The purpose of model spec- ification is to arrive at a model that describes the observed data to a satis- factory extent but without unnecessary complications. A parallel purpose {s to obtain a model that is substantively interesting without wringing from the data drops that are really based on chance but interpreted as substance. 2 ‘Testing and model specification {e linear regression analysis, model specification is already complicated (as is elaborated in many textbooks on regression, eg, Ryan, 1997), and in ‘multilevel analysis the number of complications is multiplied because of the complicated nature of the random part. ‘The complicated nature of the hierarchical linear model, combined with ‘the two steering wheels for model specification, implies that there are no clear and fixed rules to follow. Model specification is a process guided by the following principles 1, Considerations relating to the subject matter. These follow from field knowledge, existing theory, detailed problem formulation, and com- 2. The distinction betwen effects that are indicated a priori as effects to be tested, ie, effects on which the research is focused, and effects that are necessary to obtain a good model fit. Often the tested ef- fects are a subset of the fixed effects, and the random part is to be fitted adequately but of secondary interest. When there is no strong, prior knowledge about which variables to include in the randoza part, one may follow a data-driven approach to select the varibls for the random part. ‘3. A preference for ‘hierarchical’ models in the general sense (not the ‘hierarchical linear model’ sense) that if a model contains an inter- action effect, then usually also the corresponding main effects should bbe included (even if these are not significant); and if a variable has a random slope, it normally should also have a fixed effect. ‘The reason. is that omitting such effects may lead to erroneous interpretations 4. Doig justice tothe multe! nature ofthe probe. Thins done ax folows. (a) When a given level-one variable is present, one should be aware ‘that the within-group regression coefficient may differ from the between-group regression coefficient, as described in Section 4.5. This can be investigated by calculating a new variable, defined ‘as the group mean of the level-one variable, and testing the effect of the new variable. (&) When there isan important random intercept variance, there are important unexplained differences between group means. One may look for level-two variables (original level-two variables as well as aggregates of level-one variables) that explain part of these between-group differences. (©) When there is an important random slope variance of some level- cone variable, say, X;, there are important unexplained differences between the within-group effects of X; on Y. Here also one may look for leve-two variables that explain part of these differences, ‘Model specification 93 ‘This leads to cross-lovel interactions as explained in Section 5.2. ‘Recall from this section, however, that cross-level interactions ‘can also be expected from theoretical considerations, even if no significant random slope variance is found. 5. Awareness of the necessity of including certain covariances of random effects. Including such covariances means that they are free parame- ters in the model, not constrained to 0 but estimated from the data. In Section 5.1.2, attention was given to the necessity to include in the model all covariances 1, between random slopes and random Intercept. Another case of this point arises when a categorical variable with © > 3 categories has a random effect, This is implemented by giving random slopes to the c— 1 dummy variables that are used to represent the categorical variable. The covariances between these random slopes should then also be included in the model. Formulated generally, suppose that two variables Xj and Xy have random effects, and that the meaning of these variables is such thet they could be replaced by two linear combinations, aX +a! Xy and 2X +UXy. (For the random intercept and random slope discussed in Section 6.1.2, the relevant type of linear combination would corre spond to a change of origin of the variable with the random sope.) ‘Then the covariance ray’ between the two random slopes sheuld be included in the model. 6. Reluctance to include non-significant effects in the model; one could also say, a reluctance to overfitting. Each of points 1-4 above, however, could override this reluctance. ‘An obvious example of this overriding is the case where one wishes to test for the effect of Xz, while controling for the effect of Xy. The ‘purpose of the analysis is a subject matter consideration, and even if the effect of X; is non-significant, one still should include this effect in the model. 7. The desire to obtain a good fit, and include all effects in the model that contribute to a good fit; in practice, this leads to the inclusion of all significant effects unless the data set isso large that certain effects, although significant, are deemed unimportant nevertheless. 8. Awareness of the following two basic statistical facts: (2) Every test of 2 given statistical parameter controls for all other cffects in the model used as a null hypothesis (Mo in Section 6.2). Since the latter set of effects has an influence on the interpreta- tion as well as on the statistical power, test results may depend ‘on the set of other effects included in the model (b) We are constantly making errors of the fist and the second kind. ‘Bspecially the latter, since statistical power often is rather low. 4 Testing and model specification ‘This implies that an effect being non-significant does not mean that the effect is absent in the population. It also implies that significant effect may be so by chance (but the probability of this is no larger than the level of significance ~ most often set at 0.05). Multilevel research often is based on data with a limited number of groups. Since power for detecting effects of level-two variables, depends strongly on the number of groups in the data, warnings bout a low Dower are especially important fr lee two vat ‘These considerations are nie; but how to proceed in practice? To get an in- sight into the data, ii usually advsabe to tart with a decriptive analysis ofthe variables: an investigation oftheir means, standard deviations, cor- felations, and distrthutional forms. Iti aso helpfl to make a preliminary (quick and dirty’) analyeis with a simpler method such as OLS regression. Wren starting with the multilevel analysis as such, in most situations (longitudinal data may provide an exception), it is advisable to start with fitting the empty model (4.6). This gives the raw within-group and between- ‘group variances, from which the estimated intraclass correlation can be cal= culated. These parameters are useful as a general description and a starting Point for further model fitting. The process of further model specification will include forward steps: select additional effects (fixed or random), tent their significance, and decide whether or not to include them in the model; and backward steps: exclude effects from the model because they are not {important from a statistical or substantive point of view. We mention two possible approaches in the following subsections. 6.4.1 Working upward from level one In the spirit of Section 5.2, one may start with constructing a model for level one, ie. first explain within-group variability, and explain between- group variability subsequently. ‘This two-phase approach is advocated by Bryk and Raudenbush (1992) and is followed in the program HLM (Bryk, Raudenbush, and Congdon, 1996) (see Chapter 15). ‘Modeling within-group variability Subject matter knowledge and availability of data leads to a number of level-one variables X; to X, which are deemed important, or hypothetically important, to predict or explain the value of ¥. These variables lead to ‘equation (5.10) as a starting point: Vig = Bog + Bag zicg +o. + Opp pig + By» ‘This equation represents the within-group effects of Xx on Y. The between group variability is first modeled as random variability. This is represented ‘Model specification 95 by splitting the group-dependent regression coeficents (yj in a mean coef ficient ‘ho and a group-dependent deviation Uj: Bag = 0 + Ung Substitution yids Yig=70 + Sona any + Uy + DU zny + Ry 5) & & ‘This model has a number of level-one variables with fixed and random fects, but it wil usually not be necessary to include all random effects. For the precise specification of the level-one model, the following steps are useful. 1. Select in any case the variables on which the research is focused. In addition, select relevant available level-one variables on the basis of subject matter knowledge. Also include plausible interactions between level-one variables. 2, Select, among these variables, those for which, on the basis of subject matter knowledge, a group-dependent effect (random slope!) is plau- sible. If one does not have a clue, one could select the variables that are expected to have the strongest fixed effects. 13, Estimate the model with the fixed effects of step 1 and the random fects of step 2. 4, Test the significance of the random slopes, and exclude the non- significant slopes from the model 5, Test the significance of the regression cocficients, and exclude the non significant coefficients from the model. This ean also be a moment: to consider the inclusion of interaction effets between level-one variables. 6. For a check, one could test whether the variables for which a group dependent effect (Le., « random slope) was not thought plausible in step 2, indeed have a non-significant random slope. (Keep in mind that, normally, including a random slope implies inclusion of the fixed effect!) Be reluctant to include random slopes for interactions; these are often hard to interpret. ‘With respect to the random slopes, one may be restricted by the fact that data usually contain less information about random effects than about fixed cffects. Including many random slopes can therefore lead to long iteration processes of the estimation algorithm. The algorithm may'even fail to com ‘verge. For this reason it may be necessary to specify only a small number of random slopes. "After this process, one has arrived at a model with a number of level-one ‘variables, some of which have @ random iz addition to their fixed effect. It is possible that the random intercept is the only rexaining random effect, 96 Testing and model specification ‘This model isan interesting intermediate product, as it indicates the within- group regressions and their variability. Modeling betveen-group variability ‘The next step is to try and explain these random effects by level-two vati- fables. The rindom intercept variance can be explained by level-two vari- ‘ables, the random slopes by interactions of level-one with level-two variables, ‘as was discussed in Section 5.2. It should be kept in mind that aggregates of level-one variables can be important level-two variables. For deciding which main effects of level-two variables and which cross- level interactions to include, itis again advisable first to select those effects ‘that are plausible on the basis of substantive knowledge, then to test these and include or omit them depending on their importance (statistical and substantive), and finally to check whether other (less plausible) effects also are significant. ‘This procedure has a builtin filter for cross-level interactions: an inter- action betwee level-one variable X and level-two variable Z is considered only if X has a significant random slope. However, this ‘lter' should not bbe employed as a strict rule. If there are theoretical reasons to consider the Xx Z interaction, this interaction can be tested even if X does not have Significant random slope. The background to this is the fact that if there is X'x Z interaction, the test for this interaction has a higher power to detect {this than the test for a random slope. It is possible that, if one carries out both tests, the test of the random slope is non-significant whereas the test of the X x Z interaction is indeed significant. This implies that either an error of the first kind is made by the test on X x Z interaction (this is the case if there is no interaction), or ‘an error of the second kind is made by the test of the random slope (this is ‘the case if there is interaction). Assuming that the significance level is 0.05, and one focuses on the teat of this interaction effect, the probability of the first event ia less than 0.05 whereas the probability of an error of the second kkind can be quite high, especially since the test of the random slope does not always have high power for testing the specific alternative hypothesis of an X xZ interaction effect. Therefore, provided that the X x Z interaction effect was hypothesized before looking at the data, the significant result of the test of this effect is what counts, and not the lack of significance for the random slope. 6.4.2 Joint consideration of level-one and level-two variables ‘The procedure of first building a level-one model and subsequently extending it with level-two variables is neat but not always the most eficient, or the most relevan:. If there are level-two variables or cross-level interactions that are known to be important, why not include them in the model right from the start? Fer example, it could be expected that a certain level-one vanable has @ within-group regression differing from its between-group regression, In such a case, one may wish to Include the group mean of this variable Model specification 97 right from the start. In this approach, the same steps are followed as in the preceding section, but without the distinction between level-one and level-two variables. This leads to the following steps. 1, Select relevant available level-one and level-two variables on the ba sis of subject matter knowledge. Also include plausible interactions, Doa’t forget group means of level-one variables to account for the possibility of diference between within-group and between-group re- ‘greasions. Also don’t forget cross-level interactions. 2, Select among the level 1 variables those for which, on the basis of subject matter knowledge, a group-dependent effect (random slope!) plausible. (A possibility would be, again, to select those variables that are expected to have the strongest fixed effects.) 43 Estimate the model with the fixed effects of step 1 and the random fects of step 2. 4, Test the significance of the random slopes, and exclude the non- significant slopes from the model. 5, Test the significance of the regression coeficients, and exclude the zon- significant coefficients from the model. Ths can also be a momen: to consider the inclusion of more interaction effects. 6. Check if other effects, thought less plausible at the start of model building, indeed are not significant. If they are significant, include them in the model. In an extreme instance of step 1, one may wish to include all available variables and a large number of interactions in the fixed part. Similarly, one right wish to give all level-one variables random effects in step 2. Whether this is practically possible wil depend, among others, on the numberof level- fone variables, Such an implementation of these steps leads to a backward model-fiting process, where one starts with a large model and reduces it by stepwise excluding non-significant effects. The advantage is that mashing fects (where 2 variable is excluded early in the model building process because of non-significance, whereas it would have reached significance if fone had controlled for another variable) do not occur. The disadvantage is that it may be a very time-consuming procedure. 64.9 Concluding remarks about model specification This section has suggested a general approach to specification of multilevel models rather than laying out a step-by-step procedure. ‘This is in accor- dance with our view of model wpecification being a proces with two steering ‘wheels and without foclproof procedures. ‘This implies that, given one data set, a researcher (let alone two researchers...) may come up with more taan 98 Testing ond model specification cone model, each seeming in itelf a satisfactory result of a model specise cation process. a out view, this reflects the basic indeterminacy thet is inherent to model fitting on the basis of empirical data. Iti well posible ‘hat several diferent modes coreapond to ave data, and that there are no compelling arguments to choose between them. Ia s it beter to accept this indeterminacy and lave it to be relved by fete research than to make an unwarranted choice between the efferent models, This treatment of model specication may seem rather inductive, oy data-driven. “If one is in the fortunate situation of having prion hy, potheses to be tested (usually about regression coefficients), tis useful to distinguish between those parameters on which hypothesis tests are focused and the other parts of the model, required to have a wellfitting model and (consequently) a vaid tet ofthese hypotheses. An inductive approach is adequate for the latter part of the model, while the tested paremeters evidently are to be included in the model anyway. Another aspect of model specification is the checking of assumptions Independence assumptions should be checked in the couse of speciying the randow part ofthe model. Distributional assumptions, specifically: the, sssumption of normal dstrfbtion for the various Fandom ects: soe be checked by residual analysis. Checks of assumptions are reated in Chap 7 How Much Does the Model Explain? 7.1 Explained variance ‘The concept of ‘explained variance’ is well-known in rultiple regression analysis: i gives an answer to the question, how much of the variability of the dependent variable is accounted for by the linear regression on the e*- planatory variables. The usual measure for the explained proportion of var- ‘nce is the squared multiple correlation coefficient, H?. For the hierarchical linear model, however, the concept of ‘explained proportion of variance’ is somewhat problematic. In this section, we follow the approach of Snijders snd Bota (100) to explain the fice and give a subi muller version of F*. ‘One way to approach this concept is to transfer its customary treat- ment, wellknown from multiple linear regression, straightforwardly to the Ierazchical random effects model: treat proportional reductions in the es timated variance components, 7 and 29 in the random-intercept model for to levels, as analogues of H? values. Since there are several variance com ‘ponents in the hierarchical linear model, this approach leads to several values, one for each variance component. However, this defiaition of R now and then leads to capleasant surprises: it sometimes happens that adding explanatory variables increases rather than decreases some of the variance components. Even negative values of R? are peesible. Negative values of FE clearly are undesirable and are not in accordance with its intuitive in- terpretation In the discussion of R-type measures, t should be kept in mind that ‘these measures depend on the distribution ofthe explanatory variables, ‘This implies that these variables, denoted in this section by X, are supposed to be drawa at raadom ffom the population at levl one and the population at level two, and not determined by the experimental design or the researcher's ‘whims. In order to stress the random nature of the X-variable, the values of X are denoted Xy, instead of by 2i a8 in earlier chapters. 7.1.1 Negative values of RY ‘As an example, we consider data from a study by Vermeulen and Bosker (1992) on the effects of part-time teaching in primary schools. The de- 99 100 How rauch does the model explain? pendent variable ¥ is an acithmetic test score; the sample consists of 718 grade 3 pupils in 42 school, “Aa intelligence txt score Kis taed te poe dictor variable. Group sizes range from 1 to 33 with an average of 20. In the folowing, i sometimes is desirable to present an example for balecced data (-e, with equal group sizes). The balanced data presented below st the data restricted to 33 schools with 10 pupils in each school, by deleting schools with less than 10 pupils from the sample and randomly sampling 10 pupis from each school fom the remaining schools. For den onstrate purpose ode wr fle: the empty Moe A Model B ith group mean, X s, as predictor variable; and Model C, with s withineem deviation score, (Xy~X,), an predicir variable Table 71 wrest oe results of the analyses both for the balanced and for the entire data set, ‘The residual variance at level one is denoted 0; the residual vasiance ai level two is denoted rf. From Table 7.1 we see that in the balanced as ‘Table 7.1 Estimated residoal variance ps # and if ance parameters # and #2 for ‘models with within-group and between-group predictor variables, + BlXiy~ Kj) +0 +B 6973 2.443 TE Unbalanced desiga ALY = Bo4 Voy + By, 7.653 2.708 BLYiy = Bo +f, Xy 4+ Uoy + By 7.685 2.038 CY = f+ Bi(Xy~Xj)+ Uy + By 6.658 2801 well as in the unbalanced case, #2 incr as a within-grot ; #9 increases a8 a within-group deviation yatable is added as an explanatory variable to the model Purthernoce for the balanced case, 3 ip not affected by adding a group-level variable to the model. In the unbalanced case, 6? increases slightly when arding fo sta set R? on the devel is nega- trl Bata 2 ill hy ater arama! ste REE etd re same. It is argued below that defining A? as the Proportional reduction in residual variance parameters 6? and ‘3, respectively, is not the best way to aoe — cae ‘FP in the linear regression model; and that obi snl nats a Ei dl measures denoted below by R} and Ri ae Explained variance 301 11.8. Definitions of proportions of explained variance in two-level models In multiple linear regression, the customary R? parameter can be introduced in several ways: eg, ax the maximal squared correlation coefficient between the dependent variable and some linear combination of the predictor vari ables, or as the proportional reduction in the residual variance parameter ue to the joint predictor variables. A very appealing principle to define ‘measures of modeled (or explained) variation 's the principle of proportional reduction of prediction error. This is one of the defiaitions of Rin multiple linear regression, and can be described 2s follows. A population of values is given for the explanatory and the dependent variables, (Xt, .-» Xqis Ys with a known joint probability distribution; f ig the value for the vector © for which the expected squared error £0 - Soom Xu)? is minimal. (This is the defition of the ordinary least squares (‘OLS’) estimation criterion.) (In this equation, vo is defined as the intercept and ou = 1 for all i,j.) If, for a certain case i, the values of X, ‘unknown, then the best predictor for ¥; is its expectation €(Y), squared prediction exor var(¥,); ifthe values Xi. Xi are given, then the linear predictor of Yj with minimum squared ervor is the regression value YE SaXne. The diference betwoon the observed valu ¥, and the predicted sale Sq eX i the prediction error. Accordingly, the mean squared prediction err is defined as EM — An Xn)? r ‘The proportional reduction of the mean squared error of prediction is the same as the proportional reduction in the unexplained variance, due to the tse of the variables X; to Xq- In a formula, it can be expressed by 2 = Yer) ~ var(¥ = Fn AX) ® var(¥) var(¥i- Yon AnXns) var) ‘this formula expresies one of the equivalent ways to define F?, ‘The same principle can be wed to deine ‘explained proportion of vari- ance in the Bierarchical near model. For this model, however, there are several options with respect to what one wishes to predict. Let ws consider a two-level model with dependent variable ¥. In such a model, one can choose between predicting an individual value Yy at the lowest level, or a group mean Yj. On the bass ofthis distinction, two concepts of explained pro- portion of variance in a tworlevel model can be defined. ‘The first, and most important, isthe proportional reduction of error for predicting an tndivédual ‘outcome. ‘The second is the propertionat reduction of error for predicting a group mean. To elaborate these concepts more specifically, first consider a =1- 102 How much does the model explain? ‘two-level random effects model with 2 random intercept: and some predictor variables with fixed effects but 1o other random effects: = + Vm Xnis + Voy + Ry» (ray cad Since we wish to discuss the definition of ‘explained proportion of variance’ as = population parameter, we assume temporarily that the vector 7 of regression coeflicients is known. Level one For the level-one explained proportion of variance, we consider the predic- tion of Yj for @ randomly drawn leve-one unit within a randomly drawn level-two unit j. Ifthe values of the predictors Xij are unknown, then the best predictor for Yi is its expectation; the associated mean squared pre- diction error is var(Yij). If the value of the predictor vector Xj; for the siven unit is known, then the best near predictor for Yi is the regression value Dhso th Xnij (Orhere Xaaj is defined as 1 for all h, 3.) The associated ‘mean squared prediction error is var(¥iy — mn Xny) =o? + 1B. ® ‘The level-one explained proportion of variance is defined as the proportional reduction in mean squared prediction error: per gay — Wey — Ey mXng) Raa wey : (72) Now let us proceed from the population to the data. The most straight- forward way to estimate Rf is to consider 42 + #2 for the empty model, Yi = 0 + Ung + By 5 (73) 1s wel as for the ftted model (7.1), and compute 1 minus the ratio ofthese ‘values. In other words, R} is just the proportional reduction in the value of @? +43 due to including the X-variables in the model. For a sequence of nested models, the contributions to the estimated value of (7.2) due to adding new predictors can be considered to be the contribution of these predictors to the explained variance at level one. To illustrate this, we once again use the data from the frst (balanced) example, and estimate the proportional reduction of prediction error for a model where within-group and betwee-groups regresion coficlents ma be different. =n * ‘Table 7.2. Estimating the level-one explained variance (balanced data). # HYG Pe es + By ‘Se04 2271 D. Yi =o + A(X Ky) + Xs + Uy +By 6973 0.991 Explained variance 103 rom Table 7.2 we se that d?+73 for model (A) amounts to 10.965, and for Jhodel (D) to 7.964. fi thus estimated to be I~ (7:964/10.968) = 0.274 Level two ‘The leve-two explained proportion of variance can be defined as the pro- portional reduction in mean squared prediction error, for the prediction of Fy for a randomly drawn leveltwo unit j. Ifthe values of the predictors Xhus for the level-one units 4 within level-two unit j are completely un- nown, then the best predictor for Vs is its expectation; the associated mean squared prediction ezror is var(¥-). Ifthe values of all predictors “Fay for all in this particular group j are known, then the best linear Jreictor for Py is the regression value Soy 7 ai the associated mean square prediction error is var? - Dm Xng) where nis the numberof leve-one units on which the average is based. The Jevel-two explained proportion of variances now defined as the proportional reduction in mean squared prediction error for Ys a 21 -w@a- Ean %ns) on % ae (74) ‘To estimate the level-two explained proportion of variance, we follow a similar approach as above: Fi is estimated as the proportional reduction in the value of @4/n + #3, where n i a representative value forthe group size. Tn the example given earlier, let us use for m a usual group size of n= 30. For model (a) the value of 32 n+ #3 is 8694/30 + 2.271 = 2.561, whereas for rode! (b) this amounts to 6973/30 + 0.991 = 1.223, Rf is thus estimated at 1 (1223/2561) = 0.52. 1 is natural that the mean squared error for predicting a group mean should depend on the group size. Often one can use for n a value which is deemed a priori to be ‘representative’. For example, if normal class size is considered to be 30, and even if because of missing data the values of ny in the data set are on average less than 30, itis advisable to use the representative value n = 90, In the case of varying group size, if itis “uncleer how a representative group size should be chosen, one possibility is to use the harmonic mean, defined by N/{5-,(0/n,)}. Tt is quite common that the data within the groups are based on a sample and not the entire groups inthe population are observed, so that the group Sizes inthe data set do not reflect the group sizes inthe population. In such 4 case, since itis more relevant to predict population group averages than sample group averages itis advisable to let n reflect the group sizes in the population rather than the sample group sizes. If group sizes are very largr, eg, when the leve-two units are defined by municipalities or other regions fnd the level-one units by their inhabitants, this means that, practically speaking, Fj is the proportional reduction inthe intercept variance. 104 How much does the model explain? Population values of 2 and FE are non-negative What happens to F? and 2, when predictor variables are added to the raultilevel model? Is it possible that adding predictor variables lends to smaller values of Rf or A§? Can we even be sure at all that these quantities are positive? Tt tums out that a distinction must be made between the population parameters A? and F2 and their estimates from data. Population values of F} and PE in correctly specified models, with a constant group size zn, become smaller when predictor variables are deleted, provided that the ‘variables Uaj and Eij on one hand are uncorrelated with all the Xi; variables ‘on the other hand (the usual model assumption). For estimates of Fi and Fé, however, the situation is diferent: these cetimates sometimes do increase when predictor variables are deleted. When itis observed that an estimated value for Ror RZ becomes smaller by the addition of a predictor variable, or larger by the deletion of o predictor variable, there are two possibilities: either this is a chance Suctuation, or the larger model ia msspecifed. It will depend on how large the change in ‘i or Bj is, and on tho subject-matter insight ofthe researcher, whether the researcher will deem that either the fist or the second possibilty is more Tikely. In this sense changes in or A in the ‘wrong’ direction serve a8 2 diagnostic for possible misspecification, ‘This possibility of misspecification refers to the fxed part ofthe model, Le, the specification of the explanatory ‘ariables having fixed regression coeficients, and not to the random part of the model. We return to this in Section 9.23 14.9 Explained variance in three-level models In three level random intercept models (Section 4.8), the residual variance, or mean squared prediction variane, isthe sum ofthe variance components St the three level, o +12 +8. Accordingly, the level-one explained pro- portion of variance can be defined here as the proportional reduction inthe Sum of these three variance parameters. Example 7.1. Vorionce in maths performance explained by 10 Ta Bxataplo 48, Table 46 exhibits the rents of the empty model (Model 1) tnd a model in which IQ has a fixed effect (Model 2). The total variance in the cupty model is 1816 + 1746 + 2.124 — 11.686 while the eotal unexplained Seriance in Model 28 6910 +0701 + 1109 = 8.720, Hence the lvehone Tlaized proportion of variance is I~ (8.720/11.686) = 0.5. 14.4 Explained variance in models with random slopes ‘The idea of using the proportional reduction in the prediction error for Vij and Y 4, respectively, asthe definitions of explained variance at either level, can be extended to two-level models with one or more random regression coefficients. The formulae to calculate Rf and Fj can be found in Snijders fand Bosker (1994). However, the estimated values for Rj and 3 usually ‘change only very little when random regression coeficients are included in the model. Components of variance 105 ‘The formulae for estimating R and Rj in models with random intercepts only are very easy. Estimating F and 1 in models with random slopes is more tedious. The software paclage HLM (Bryk et al., 1996), however, provides the necessary estimates, since it not only produces estimates of ‘the variance components but also of the observed residual between group vary - Dm Xas) (75) Using this estimate, denoted in the HLM output as D.BAR, one can caleu- late the estimate of Fj (as the proportional reduction in D-BAR). In HLM. versions 2.20 and higher, the output includes estimates of rf and the relia~ Dility of the intercepts, but no longer the D.BAR. However, this observed residual between-group variance can now be calculated as 72/reliability. Since the level-two variance is not constant in random slope models it is usually advisable here to center explanatory variables around their grand ‘means, so that F now refers to the explained variance at level two for the average level-two unit ‘The simplest possibility to estimate Rj and F in models with random slopes is to r-estimate the models as random intercept models with the same fixed parts (omitting the random slopes), and use the resulting parameter cstimates to calculate R} and RE in the usual (simple) way for random intercept models. Tis will usually yield values that ere very close to the values for the random slopes model. Example 7.2. Beplained variance for language scores In Table 54, a model was prevented for the data set on language scores in elementary schools used throughout Chapters 4 and 5, ‘When a random intercept model is fitted with the same fixed part, the cctimated variance parameters are ff = 7.61 for level two and @* = 39.82 {for lvel one. For the empty model, Table 4.1 shows that the estimates are #9 = 19.42 and 9? = 64.57. This implies that explained variance at level one is 1-(99.8247.61)/(64.57+19 42) = 0.44 and, using an average clas size n= 25, plained variance at level two is 1—|(39.82/25)+7.61|/(64.57/25)+19.42] 0.58. These explained variances are quite high, and can be attributed mainly ‘0 the explanation by 19, 7.2 Components of variance! The preceding section focused on the total amount of variance that can be explained by the explanatory variables. In these measzres of explained variance, only the fixed effects contribute. It can also be theoretically ihurni- nating to decompose the observed variance of Y into parts that correspond to the various constituents of the model. "This i discussed ia this section {or a two-level model For the dependent variable Y, the lev-one and level-two variances in the empty model (4.6) are denoied of and +B, respectively, The total "his sa more advanced sation which may be skipped by the reader. 106 How much does the model explain? variance of ¥ therefore is of + rf, and the components of variance are the parts into which this quantity is splt. The first spit, obviously, isthe split Oto + rh into of and rh, and was extensively discussed in our treatment ‘of the intraclass correlation coeficient. To obtain formulae for a further decomposition, it i necessary to be more epeciic about the distribution of the explanatory variables. It usual insinglelevel as well asin multilevel regression analysis to condition on the values of the explanatory variable, Le, to consider those as given values. In this section, however, all explanatory variables are regarded as random variables with a given distsibtion, 72.1 Random intercept models For the random intercept model, we distinguish the explanatory variables in level-one variables X and level-two variables Z. Deviating from the notation in other parts of this book, matrix notation is used, and X and Z denote vectors, ‘The explanatory variables X;,..» Xp at level one are collected in the vector X with value Xisfor unit in group j. Tt is esoumed more specifically that Xij can be decomposed into independent leve-one and level io pats, Xi = XY + XP (78) (i, a kind of multivariate hierarchical near model without a fixed part). ‘The expectation is denoted EXy = px, the evel-one covariance matric is cov XY) = ER, and the level-two covariance matrix is cor(XP) = BE ‘This implies thatthe overall covariance matrix of X isthe sum ofthese two, cov(Xy) = BY + ER =Ey . Further, the covaciance matrix of the group average for a group of size n is cov(X 4) = day 408. Jk may be noted that this notation deviates slightly from the common split of Xy into Xy = (Xy —Xy) + Ky (7.7) ‘Phe split (7.6) is & population-based split, whereas the more usual split (7.7) js sample-based. In the notation used here, the covariance matrix of the within-group deviation variable is = Satay, while the covariance matic of the group means is cov(Xy) = doy +BR. For the discusion inthis section, Ue prevent notation Is more convenient. ‘The spit (7.6) is not a completely imocuous assumption. ‘The indepen- dence between XV and X? implies that the covariance matrix of the group cov(Xi5 ~ Components of variance 107 means is anges? than 1/(n ~ 1) times the within-group covariance matrix of X. ‘The vector of explanatory variables Z = (Ziy.~yZe) at level two has value Z; for group j. ‘The vector of expectations of Z is denoted £2; = 2, and the covariance matrix is cov(Z) = Bz» In the random intercept model (4.7), denote the vector of regression cocficients of the X's by yx = (T1051 90)! 5 and the vector of regression coefficients of the Z's by 12 = (F050 M4) ‘Taking into account the stochastic nature ofthe explanatory variables then leads to the following expression for the variance of Y: var(¥iy) = yx Ex1x + Ye Dae +73 + 07 =e BY ax + Ye BR ax + Ye Eee +gto. (78) {It may be illuminating to remark that for the special case that all explana- tory variables are uncorrelated, this expression is equal to var(Yig) = Srteax) + Sos ey tito (this hold, eg feels only one leon ad only one lvetwo ox Slantory varie) This oma show that, nth seal case, he cont Botion ofeach explanatory variable othe vance of he dependent variable i gen by the product of the repreoncoeficet andthe variance of Che explanatory variable "The desompoaicn of X into independent levehooe and lveLtwo parts alow uso inleate preity whlch prt of (78) corexpond tothe uncom ditional level-one variance o% of Y, and which parts to the unconditional Tev wo variance" cha teh ax to? Bath Dhaw + ae Bere +8 ‘his shows how the within group variation of the Ine-one variables eats tome Par of the unconditional leve-one var pars of the leve-two Carnac ae eaten bythe variation ofthe levl-wo variable, and ao by the betwen group (compestion) ylation ofthe lewkone variable. Re call, however, the definition of D¥, which implies that the between-group section of X is taken nt of the Tendon? rasan of the group moa, That may be pected ven the witheroup vasiatin of Ky Tie word “anger iv met bere In the sense of the ordering of positive definite symmetric matrices 108 How much does the model explain? 1.22. Random slope models For the herarcheal linear model in its general specification given by (6.12) a decomposition of the variance is very complicated because of the presence Of the cros-level interactions. Therefore the decomposition ofthe variance is discussed for random slopes models inthe formulation (5.14), repeated here a3 ¥ig = 0 + Downers + Vos + DUnjany + Ry (79) without bothering about whether some of the zy are level-one or level-two variables, or products of a level-one and a leveltwo variable. Recall that in this section the explanatory variables X are sechastc The vector X = (Xi,..,.Xq) of all explanatory variables has mean ux q) and covariance matrix Ex. The sub-veetor (Xj). Xp) of vatiabes hat have random slopes, bas mean p(y) and covariance matrix Ex(y. Thee covariance matrices could be split into within group and between-group Parts, but this is left up to the reader, ‘The covariance matrix of the random slopes (U3j,...,Uyg) is denoted Ty and the px 1 vector of the intercept-slope covariances is denoted Tio. With these specifications the variance of the dependent variable can be shown to be given by varl¥y) = 1 Exig ? +9 + te(e Tio + Hcg Ts Hxto + trace (Tha Exip)) + 0? (710) (A similar expression, but without taking the fixed ellects into account, is given by Snijders and Bosker (1993) as formula (21))) A bref discussion of all terms in this expression is as follows. 1. The first term, TEx 7s tives the contribution ofthe fixed effects and may be regarded as the ‘ex. Plained part’ of the variance. This term could be split into a level-one and a level-two part as inthe preceding subsection. 2. The part 18 + weg To + Hcg) Ta bx Gay should be seen as one piece. One could rescale all variables with random slopes to have 2 zero mean (cf. the discussion in Section 5.1.2); this would lead to x(q) = 0 and leave ofthis piece only the intercept variance 72. In other words, (7.11) i just te interoept variance after subtracting the mean from all variables with random slopes. 3. The part trace (Tir Ex) is the contrition ofthe random slopes to the variance of ¥. In the ex- treme ease thet all variables X; to Xy would be uncorrelated and have unit Components of variance 109 variances, this expression reduces to the sum of squared random slope vari- ‘ances. ‘This term also could be spit into a level-one and a level-two part. 4. Finally, e {s the residual level-one variability that can neither be explained on the basis of the fixed effects, nor on the basis of the latent group characteristics that are represented by the random intercept and slopes. 8 Heteroscedasticity ‘The hierarchical linear model is a quite flexible model, and it has some other features in addition to the possibility of representing a nested data structure, One ofthese features is the possiblity of representing multilevel as well as single-level regression models where the residual variance is not constant. mn ordinary least squares regression analysis, one of the standard as- sumptions is homoscedastcity: residual variance is constant, Le, it does not depend on the explanatory variables. This assumption was made in the preceding chapters, e,, for the residual variance at level one and for the intercept variance at level two. The techniques used in the hierarchical linear model allow to relax this assumption and replace it by the weaker assumption that variances depend linearly or quadratically on explanatory variables. Thus, an important special ease of heteroscedastic models (Le, models with heterogeneous variances) is obtained, viz, hetoroscedasticity where the variance depends on given explanatory variables. This feature 4s implemented at this moment in the MLn/MLwiN program and in HLM version 5, and can also be obtained in SAS (c£. Chapter 15). This chapter treats a two-level model, but the techniques treated (and the software mea tioned) can be used also for heteroscedastic single-level regresion models. 8.1 Heteroscedasticity at level one 8.1.1. Linear variance functions In a hierarchical linear model it sometimes makes sense to consider the possibility that the residual variance at level one depends oa one of the predictor arable. An example is stution whee two meanrenent instruments have been used, each with a diferent precision, resulting in ty diferent vale forthe meuurenent evo varlance Which bo component of the level-one variance. If the levelone residual variance depends linearly on some variable X; it can be expressed by . level-one variance =o} + 20012145, @) where the value of X; for given unit is denoted by au while the random part at level one now has two parameters, of and o9:. The reason for incor- no Heteroscedastcity at level one ret poratig the factor 2 wil become clear Inte, when also quadratic variance fanetions are considered. For example, when X; is a dummy variable with values 0 and 1, the residual variance is 02 for the units with X, = 0 and 0 +200: for the units wvith X, = 1 When the level-one variance depends on more than variable, Ther effects can be added to the variance function (8.1) by adding terms Doe 2a, xample 8. Resiuclvorionct depending on gender. qn the example wed in Chapters and, the resdual variance might de pun oa the pupils gender, To Investigate this i a model that isnot overly Tomplcntd, we tke the model of Table 54, delete the effets of malt grade Stowe, anda he elect of gender (x dummy tvable which i fr bars tod 1 fr gb) Table 61 peseatsetimate fortwo modes: one with constant residual vaviaon, and oe with rival variances depeaiog oo gender. Tau, Model Tova bomowcedatc model and Model? gende-depeadent beteroncedastic rode Mods! Fined Et Toate © “face 0S OS 19 2288 ‘2254 0081 SES ose dist ool Gender 26 266025 1G 102 iol 0a Random Efe! Parameier SE. Parameter 8.8: a rr Invercepe varace aor 1a) m1 1Q slope variance fis 00s mt 088 Inererpt iQ slope covaciance 0.75028, 0.78028. Level one eartanor parameter: 8 conmast vere. ars LT 2 16T aT 201 gender eft Deviance 150055 150084 ‘According to formula (81), the resieal variance in Model 2 38°72 for bays and $8.72 2 1.21 = 36.0 for girls. ‘The residual variance estimated in the homeecedastic Model 1is very close w the average of che oo figures. ‘This is natura, since about hal ofthe pupilsare girls and hal are boys. The dlifereace between the two valance i, however, ot significant: the deviance test yields x? = 15005. ~ 180084 1.1, df. =}, p> 02. The Seed effect of geader is quik significant (¢ = 264/026 = 10:29 < 0.0001), Controling for TQ and SES, glo score on average 2.6 higher than boys ‘Analogous to the dependence due to the multilevel nesting structure as discussed in Chapter 2, heteroscedasticity has two faces: it can be ant sance and it can be interesting, It can be a nuisance because the failure to a2 Heteroscedasticity take it into account may lead to a misepeciied model and, hence, incorrect parameter estimates and standard erors. On the other hand, it can also be an interesting phenomenon in itself. When high values on some variable X are associated with a higher residual variance, this means that forthe units who score high ou X; there is, within the context of the model being con- sidered, more uncertainty about their value on the dependent variable Y. ‘Thus, it may be interesting to look for explanatory variables that differenti ate especially between units who score high on X;. Sometimes a non-linear fanetion of 2, ot an interaction involving X, could play such a role. Example 8.2 Heteroscedaticity related to 10, Continuing the previous example, its now investigated ifresidual variance de- pends on 1Q. The corresponding parameter estimates are presented as Model 3 in Table 8.2. Heteroscedastcity at level one us 1 so-called splize function’ than by a polynomial function of IQ. Specifically, the coeficient of the aquare of 1Q turned out to be different for negative than {or postive 1Q values (recall that TQ was standardized to have an average of 0). This is represented in Model 4 of Table 8.2 by the variables . 1g t1Q <0 w= fo i950, a a. fo, #1 <0 a = IQ if1Q 20. [Adding these two variables to the fixed part gives a quite significant decrease of the deviance (61.9 for two degroes of freedom) and completely takes away ‘the random slope of IQ. The IQ-related heteroacedasticity, however, becomes ven stronger. The total fixed eect of 1Q now is given by 2.29319 + 0.26610? if1Q <0 vans otra={ $2519" oni 1g So ws) Table 8.2 Heteroscdastic models depending on 1Q. Mods! 3 Mode ¢ ‘The graph of this effet is shown in Figure 8.1. tig an increasing function Sc ‘hich flattens out for low and fr igh values of 1Q in e way that cannot be wet Ost ars oa ‘rll represented by a quadratic or bie function. 2m 0am “ate sr ence oz ot 0308 0099 y| ous coe 01 Sane 4 2st 028 © 235, oas 12 0 iat Oa 7 Random Eft Pannier SE Parmeter SE. Teel oo rondaes PERF = ye 71 19 Inveceptvrtance soo tak nasa 1Q slope variance ois cam oa IQ sope covarance 05102 = 00 Lame-one variance pormetr 4 conta toe m1 sap aeaaee ~Ro 025-2970 igure 8.1 Bfoct of 1Q on language tet Deviance 4960.0 1908.1 Comparing the deviance to Model 1 shows that there is a quite significant heteroscedasticity asociated with IQ: x* = 15005.5 ~ 14960.0 = 45.5, d.f. = 4, p-<0.0001. The level-one variance Function is (ef. (8.1)) 37.83 — 40219 ‘This shows that language scores of the lew intelligent pupils are more variable ‘than language scores of the more intelligent. The standard deviation of 1Q is 207 and the mean is 0. Thus, the range of the level-one variance, when 1Q isin the range of the mean + trice the standard deviation, is between 29.79 ‘and 45.87. This is an apptecable variation around the average value of 37.56 ‘estimated in the homoscedastie Model 1. Prompted by the IQ-dependent heteroscedasticity, the data were explored for effects that might diferetiate between the pupils with lower 1Q scare. Non-linear effects of 1Q and some interactions involving IQ were tried. Tt appeared that a non-linear effect of 1Q is discernible, represented better by ‘This is an interesting turn of this modeling exercise, sad a nuisance only because it indicates once again that school performance S a quite complicated subject. Our interpretation of this data set toward the closing of Chapter 5 ‘was that there is a random slope of 10, i, school differ in the effect of IQ [Now it turns out that the data are clearly beter described by a model in ‘which 1Q has a non-linear effect, the effect of 1Q being stronger inthe middle range than toward its extreme values; in which the effect of 1Q does not vary across achools; and in which the combined effects of 1Q, SES, gender, and the achool-average of 1Q predict the language scores for high IQ much better than for low-1Q pupil T Spline fanctions (introduced it more extensively in Section 12:22 and treated more {ally ey in Seber aod Wild, 1089, Section 98) are & mere flexible clas of functions ‘han polyaomials They ave polynomials f which the couficients may be diferent oo tevera intervals ua Heteroscedasticity 8.1.2 Quadratic variance functions ‘The formal representation of lvel-oneheteroscedaticity is based on includ- Ing random effects at level one, a8 spelled out in Goldstein (1995, see the remarks on a complex random part at level one). Ia Section 5.1.1 it was re- ‘marked already that random slopes, Le, random elects at level two, lead to heteroscedasteity. Tis also holds for random eects at level one. Consider a tworlevel model and suppose that the level-one random partis random part at level one = Ray + Ruy =u - (4) Denote the variances of Roy; and Ray by of and of, respectively, and their covariance by oo. The rules for ealcslating with vasiances and covariances imply that var(Roy + Riss) =08 + 20m ay + of hy (6s) Formula (8.4) i just a formal representation, wed in MLn/MLwiN to spe- cify this heterscedastic modd For the interpretation one should rather fook at (8.5). This formula can be used without the interpretation that of and o? are variances and og: a covariance; these parameters might be any fnumbers. ‘The formata only implies that the residual variance ia a quadratic function of 21. In the previous section, the case was Used where of = O, producing the linear function (8.1). Ifa quadratic function is desired, all three parameters are estimated from the data. Example 8.3 Educational level afained by pupils ‘This example i about a cohort of pupils entering secondary school in 1989, studied by Deldere, Bosker, and Driesen (1998). The question is, how well the educational evel attained in 1995 ean be predicted from individual char- acteristics and school achievement at the end of primary school. The data is about 15,007 pupils in $69 secondary schools. The dependent variable is ‘an educational attainment variable, being defined as 12 minus the minimum number of additional years of schooling t would take theoretically for this pupil in 1095 to gain a certificate giving access to university. The range is ‘(to 13 (eg., the value of 13 means that tbe pupil aleady is a first year university student). Explanatory variables are taicher’s rating at the end of primary schoo! (an advice on the most suitable type of secondary school, range 1 t0 4), achievement on three standardized test, so-called CITO tests, fon language, arithmetic, and information processing (all with a mean value ‘round 11 and a standard deviation between 4 and 5), socio-economic sta- ‘tus (a discrete ordered scale with values 1 to 6), gender (0 for boys, 1 for fGrls), and minority satus (based on the parents” country of birth, O for the Netherlands and other industrialized countries, 1 for other countries). ‘Table $3 presents the results of a random intercept model as Model 1 [Note that standard errors are quite small due to the large number of pupils in this data set. ‘The explained proportion of variance at level 1 is Rf = (0139. Model 2 shows the results for a model where residual variance depends ‘quadratically on SES. It follows from (8.5) that residual variance here is given by residual variance =3475 — 0.632SES + 0.056SES* The deviance difference between Models 1 and 2 (x? = 973, df. = 2, p< (0.0001) indicates that the dependence of residual variance on SES is quite Hteroscedastcity at level one 18 significant, The variance function decreases curvilinear from the value 291 for SES = 1 to a minimum value of 1.75 for SES = 6. This implies that ‘when educational attainment is predicted by the variables in this model, the ‘ancertainty in the prediction is highest for low-SES pupils. It is reasuring that the estimates and sandard errors of other effects are not appreciably diferent between Models 1 and 2 ‘The specification of this random part was checked in the following way. First the models with oaly a linear or only a quadratic variance term were cstimated separately. This showed that both variance terms are significant. Further, it might be possible that the SES-dependance of the residual variance is a random slope in disguise. Therefore a model with a random slope (ie, ‘random efect at level two) for SES also was fitted. This showed that the random slope was barely sigaiicant and not very lage, and did not take away the heteroscedasicity effet. ‘Table 8.3 Hetercecedastic model depending quadratically on SES. Model 1 Model 2 Caalicient SE, fest SE ‘8m eat — 0087 oo 002 = OasT 00. 0867 0.0042 0s78 0.042 Coes 0.0089 on648 0.0049 Ose 0.0049 O98 0.0048 oie 00191650013, oa 0.08 = 03s oz Parameter SE. Parmeter SB. oi ools oak LescLone variance parameters 952 0003475 OMT 0316 0064 oF quadratic SES eect 00s aie Deviance sms. 401078 ‘The SES-dependent hetercecedastcity led to the consideration of non-linear cefects of SES and interaction effects invalving SES, Since the SES variable ‘soumes values 1 through 6, ve dummy variables were used contrasting the respective SES values to the reference category SES = 3. In order to limit ‘the number of variables, the interactions of SES ware defined as interaction: with the numerical SES variable rather than with the categorical vara xepresented by theoe dummies. For the same reason, for the interaction of SES with the CITO teats only the average ofthe three CITO tests (cange 1 to 20, mean 11.7) was considered, Product interactions were considered of SES with gender, with the average CITO score, and with minority statas. AAs factors in the product, SES and CITO tests were centered approximately by using (SES —3) and (CITO average — 12). Gender and minority status, being 0-1 variables, did not need to be centered ‘This implies that although SES is represented by dummies (Le, as categorical variable) in the main effec, itis used a8 a numerical variable in

You might also like