Devellis 2003

Scale Development Second Edition Applied Social Research Methods Series Volume 26APPLIED SOCIAL RESEARCH Scale Development Theory and Applications Second Edition u Nem 300.72 D485 2.042608 300.72 ‘Astor DeVellis, Robert ‘Scala develo Pain “wil ‘ie xt oe eo ve Robert F. DeVellis Applied Social Research Methods Series Volume 26 SAGE Publications sila) hit Pasar Pater‘This book is dedicated, wth deopest love and appreciation to my paren, John A. DeVelits and Mary Defelis Cox, ‘al mp wife, Brenda De Feds ‘Copan © 2007 by Sage Pubs. ie eer No ot oft PO a Be ere or zed in ay Fem o b 89 ‘Mice scowon or simi, cong pcoeoying, rsa, or by ey ifomtion Tougyt eiley, wtspeiaon wig ome blah Forinometon Sg ine © sie Thonn On Cane 120 Pout ander om Sage abla Led (onl See ondon BCPA APE ‘Une Kegon Sue bon fi Pa tet. ‘Bu PachhelEnlve Now Desi 110017 fa Pred i ie Usb Seo of Amica ios of Congas Cooney aban Dt vin, RabenF ‘Sate velop fy ang apc by Robes F. Dis. pct (Appi posal rch tnd e528) Jace ira eee de. ISBN ta-250-0 cee) ISON O761526054 (HL) 1 Saing (Sova) | Tae Sts 6177 DS 2003 e720 amcaisnie wmes7es ese epuvig ior— Marat Senwelt ‘how’ tiisunes Kner Wiley Proton ater” — Ci. Molina Copy Bato: ai bins Inder Moat Hal Contents Preface oi Overview 1 Genzral Perspectives on Measurement 2 Historical Origins of Measurement in Socal Science 3 Later Developments ia Measurement 5 ‘The Role of Measurement inthe Social Sciences 6 Summary ad Preview B Understanding the Latent Variable “ Constracs Versus Measures “ [Eaten Variable a5 the Presumed Cause of Tem Values 15 Path Diagrams 16 Further Elaboration ofthe Measurement Model 20 Parallel Tests 2 Alternative Models 2 Beercises 26 Rellabtty 2 Continuous Versus Dichotomous Heras 2 Interat Consistency 27 Reliability Based on Corelations Between Soale Scores, 39 Generalizabilty Theory a6 Sommary a Exercises a Vatiaey 9 Conte Vatiity a Criterion-Retaed Validity 50 Coastuct Validity 3 What About Face Validity? 3 Exercises 58 Guidetines in Scale Development a Svop 1: Determine Clearly What Its You Man 0 Measure cy Step 2: Generate an item Poo! 8 Step 3: Determine the Format for Measurement 0 ‘Step 4: Have the Initial Item Pool Reviewed by Experts 85 ‘Step 5: Consider Inclusion of Validation Hes 87Step 6: Adntinster Items 1 a Development Sample Step 7: Evaluate the Kens ‘Step 8: Optimize Seale Leagts Exercisos 6. Factor Analysis ‘An Overview of Factor Analysis A Conceptual Description of Factor Analysis Interpreting Factors Principal Componeats Versus Common Fastors Confizmatory Factor Analysis Using Factor Analysis in Scale Development Sample Size Conclusion 7. An Overview of ftem Response Theory Item Difficulty tem Diserimivation False Positives tem Characteristic Curves, (Complexities of IRT ‘When to Use IRT Conetasions 8, Measurement in the Broader Research Context Bofore Scale Development Afler Scale Administration Final Thoughts References Index [About the Aushor 133 a8 49 142 a2 MT 49 12 1s 1st 158 160 161 16 m Preface ‘The first edition of this book has bec wily adopted as an introduction to measurement concepts and methods. The resson for is success, Fam quite ‘re, ig that i made rather complex ideus accessible ro nonexperts. That was cenainly my goal. Helping stadaets at al levels to understand measurement issues conceptually issn extremely important aspect of conveying this material. The gsadnatecourston scale development I offer, inthe School of Public Health af the University of North Corolina st Chapel Hil attracts students with varied quantitative backgrounds. Within he same semester, my students ‘may zange from poople who have had only a single graduste course in stats- tes tostudents studying for «Ph.D. in quantitative psychology. My experience in teaching hat course suggests that people at ll vets benefit ftom materct presented in efer, conceptual, nonmathematical terms, Although formulas are inevitable in such a course, I sy to explain the concepts in waye that make those formulas tranapares, merely shorthand for a reasonable series of opers- ‘ons applied to data. Taempted, eth some sppareat uecess, to uuasfer those reaching methods © the first edition, and T have tried to-do s0 again in tis revised edition. The emphasis in this book is squarely on conveying infor imation in ways that mae the uederlying principles clear and that let readers peer into the “black boxes" that various methods seem 10 be, For this edition, the valume bas been extensively revised. My approach has been to retain whar students found clesrest and most use, ta revise mate~ rial for whien T have conceived more lucid explanations, and to add copies ‘hat tave growa in importance since the appearince of the fst eition, Every ‘chapter has changed: several chapters, substantially. More than 30 new eefer- ences have been added, bat many classic volumes retain tei mmpostance and fare cited once again ix this edition. Figures have been added io several ‘chapters to make key points visually. In the opening chapter, {have added ‘some new exansples, carifiad why some variables need many items for valid sssessment mid others do not, and provided a more extensive discussion of differnt sypes of item composites, Chapters 2 and 3 have been edited for fgromer clay, and I have added figures ilusratiog Key points in euch chapter. In Chapter , in addition 70 doing some fine-tuning, fave added pew ecnion an face Valais. in Chapter 5, whic ists the steps followed in seale development, { have ude the details of several practical tis that students have found useful. Chapter 8, on viewing measuremeot in the broader perspective, has heen expanded to chide plates fo lvok for instrus rienis, How qualitative procedures can serve as a foundation for seule sitvi SCALE DEVELOPMENT. ovelopioent, and issues related 10 differential tern functioning, The «wo remaining chapters comprise the most substantial changes fiom the previous civion, Chapter 6, on fictoranzlsi, has heen substantially expanded in cove exage and completely rewiten to provide more vivid and accessible analogies to the factor analytic process. have used figures Hberlly to support the idea® presented textually. Finally, anew Chapter 7 has been added on 3 topic merely hinted atin the fist ection, item response theory CRT}. My goal is aot 1 equip readers with 8 working knowledge of IRT's highly complex and sti- evolving methods but to previde them with eonceptual foundation thst wil help them understand more advanced material When they encounter elsewhere Despite the inclusion of Chapter 7, the primary focus ofthe book continues to be classioal messurement methods, There is no doubt that methods such as IRF well gain in populrty, especially az more manageable eormpnter programs forthe necessary analyses become avilable. Clasicel methods wil ‘ot disappear, however, Despite earain theoretical limitations, those methods \work surprisingly wel i a variety of citourstances, Their foundations and ‘applications are certainly easily understood. In various parts of the revised text, Ihave highlighted some areas where I think the advantages of IRT over classical methods mey be particularly important. For the bulk ofthe regerch being dane, however, classical methods work very well, These methods will not become obsolete as IRT gnins prominence, The two will exist side-by-side as alterative methods with their respective advantages and disadvantages Many applied researchers will never really need anything other than classical ‘mensurement techniques, The measuremeat gop that is sill most roubling ine ties not beoween those Who bave or have sot mastered the newest methods available bur bergen those who have or have not mastered any measurement ouceps er methodologies. I hope that this volume will help to move people (o the betet-inforned side of that gap. Overview ‘Measurement is of vital eoncem serois a broad range of soviel research contexts, For example, comider the following hypothetical situations: 1. A hells poyshologist faces « common dilemms: the messurement sale she eos apparettly dos nt ent, Hee stay request se have measure ht an cfererfint bervcen what individals wot to happen and wht they expect to happen when they see a physician, Her researe shows tht previous suies ted scales that inotvarteny confounded these two ideas. No existing sslos ‘Seem fo make tis disnctin a presisey the way that she would lke. Although ‘She could fibicate a ew qoesonsolap the distinction between what ne wants tnd expect, she wows that "made Up” Kees ugh not he reliable or vlié indicus ofthese concept. 2. An epidentologst is unsure howto proceed. He is performing secondry say ‘S56 on large dats tethased on national heath survey, He would ke to exam ine the relationship between ceain aspects of perceived psychological zee and helt stu. tough no set of items invended os w tres oesure was nuded in the orignal survey, several oma segiealy intended to mearate oer varie les oppear fo tap conten eats to wes. Ie ight be possible to pool text fer into «coloble and vid measure of peyshologial sce. However, ie pooled items evnstted a poor mesaere of sas, the investigator night reach renagus conctsions ‘A marketing tam is fasted ins steps to plan a campig for & new He of highprice if 13s. Focus groups have suggesied tat pent purchasing ‘Gecisons are strongly iofuneed by the apparent edacational relevance of os of this sort-The tain suppect ther pazente ho have high ecationl and career CBpiteteas for tee infants will be mottateacred to this new ine of toys ‘Therefore he enn soul keto abrese these aspirations among alge and geo- raphically persed enmple of parent. Aina! fous groups ae judged be to cumbersome for rachung 2 slices arge srpie of consumer In each ofthese situations, poole intersted in some substantive area have ‘come heed to head witha measurement problom, None ofthese researchers is interested primarily in messucement per se. However, each rust find a way 10 ‘quantify paticulr phenamenon befor tackling the main research objective, Th each case, “offthe-shel®” mesturenient tools are ether inappropriate ot tunavallsble. AML the researchers recognize that, if they adopt haphazard ,2 SCALE DEVELOPMENT ‘measurement approsshes, they run the risk of yielding insocurate dats Developing ther own measureroen instruments seems tobe the onfy remain ing eptica. Many social scionce researchers have encountered similar problems. One all-ioo-common response to these types of problems is reliance on existing instruments of questionable suitability. Another isto assume that newly devele ‘oped questioanaire ites thet “look right” will do an adequate measurement jpb. Uncesiness of unfamiliarity with methods for developing reliable anc vali instant and the inaocessibility of practical information on this topic fate common excuses for week measurement strategies, Adempts at scquiting, scale development skills may lead a researcher either to ecane sources inteaded primarily for measurement specialists of to infonnition too generat ‘be useful. This volume is intended as an altemative fo those choices, GENERAL PERSPECTIVES ON MEASUREMENT Measurement is @ fundamental activity of science, We acquire knowledge sbout people, cbjocts, events, and processes by observing them, Making sense ‘of these observations flequeatly requires that we quantify them—that i, that ‘we measure the things in which we have a scientific interes. The process of ‘measurement andthe broader scientific questions it serves interact with one sssother, the boundaries between them are often imperceptible, This happens, for example, when a new entity is detected of refined in the course of measurement or when the reasoning involved in determining how fo quantify phenomenon of intrest sheds new light onthe phenomenon itself For exez- ple, Smith, Earp, and DeVellis (1995) investigated wornen’s perceptions of battering. An « priori conceptual model based on theortial analysis sug- ksested thas there are si distint components to these petcepions. Empirical! work aimed at developing a sale to measure these perceptions indicafed thst ‘among both batered and gonbattered women, a much simpler cenceptualien- son prevailed—a single concept thoroughly exslsined how study parpants ‘sponded 1937 of the 40 items administered. This finding suggests that wheat researchers saw asa complex consteiaton of vrigles was eetuly pereived by women living inthe community asa single, broader phetcmenon. Ths i the couse of dovising @ means of measuring women’s perceptions about batter ng. we discovered someting new about the structure of those perceptions ‘Duncan (1984) argues thatthe roots of measurement lie i soca! processes snd that these processes and their measurement etally precede science: “Atl ‘measurement... is social measurement, Physical measures are made for OVERVIEW, 2 social purposes” (p35), In ceferenceto the earliest formal social measurement processes, such as voting, census-isking, and systema of job advancement, Duncan notes that “this origins seor to repent attempts to met everyday ‘iuman needs, not merely experiments uadartaken (0 tatist) seieatife corios- Sty” (p. 106), He goes on to say that sinmlsr processes "can be drawn in the history of physics: the measurement of length or distance, ara, volume, ‘weight and ime was achieved by ancient peoples inthe course of solving pac ‘ical, social probloras; and physical seience was built om the foundations of| those acievertens”(p. 106) ‘Whatever the iitiel motives, each aren of seieace develop its own se of| measurement procedures. Physics, for example, has developed specialized methods and equipment for detecting subatomic particles. Within the beba= ‘oral and social sciences, psyohometrice has evolved as the subspecialty con cemed with measuring payehologicsl and social phenomena. Typisaly, the rmeasorement procedure ued is te queitioanize and the variables of intrest se part ofa broader theoretical framework, HISTORICAL ORIGINS OF MEASUREMENT IN SOCIAL SCIENCE Early Examples Common sonse and histrical recotd support Duncan's clairn that social necessity led tothe development of measurement before science emerged. Nc ub, some form of measurement has been a part of one species’ repertoire Since prehistoric times. The earliest humans matt have evahates abject, pos- sessions, and opponents on she basis of charscteritis teh at size, Duncan (1984) cites biblical references 10 concems with measurement (eg. “A false balance is n abomination to the Lord, bot just weigh isa delight." Proverbs 11:1) and notes that the viitings of Aristolle refer to oficsls charged with checking weighs and measures, Anastal (1968) notes that the Socratic ‘method employed in sncient Gresce involved probing for understanding in 4 manner that might be regarded ax knowledge testing, In his 1964 essay, P,H. DuBois (roprnted in Bamette, 1976) decries the use af civil service testing as cey a8 2200 BCE in Caina, Wright (1990) cites other examples of the imporance aseribed in antiquity to accurate messaremtent, including the “weight of seven" oa which Tth-centmry Muslim taxation was hased. He slso ‘notes that some have linked the Freach Revolution, in part, to peasants’ being fed op with unfair measurement practices.4 SCALE DEVELOPMENT. Emergence of Statistioal Methods and the Role of Mental Testing ‘Nunnally (1978) points ou shat, although systematic observations may have ‘been going on the absence af statistical methods hindered the development of sssience of measuring human abilities unite later half of te 19%h oemury Sieilsy, Duncan (1984) observes that, in most Fields of mathematics other than peomerry, applications preceded » formal development of the fovadations (which he assribes to tbe {Sth century) by millennia. The eventual develop ment of suitable slaistical methods in the 19th century was set in motion by Darwin's work on evolution and his observation anf measurement of syste aie variation across species. Darwin's cousin, Si Francis Galton, extended the systematic observation of differences to bumans, A chief concer of Galton wos the inheritance of wnstomical and intelectual traits. Kar] Pearson, regarded by auny a6 the “founder of statistics” (eg., Allen & Yen, 1973, 3), was junior colleague of Galton’s. Pearson developed the mathematical ‘ool, includiag the product-moment correlation coefficient bearing his name, ‘needed to examine systemstically relationships among vaisbles, Scientists soul then quantify the extent to which measurable characteristics were inter. related. Cheres Spearman continued inthe tradition of his predecessors and Set the siage for the subsequent development and popularization of factor ‘analysis in the early 20th century. It noteworthy that many of the early ‘contibutrs to formal measurement (including Aled Bipot, who developed {ests of mental ability in France in te early 1900s) shared an interest in intel leetual abilities, Hence much of te early workin peychometres was applied to "mental testing.” ‘The Role of Psychophysles Another historia! root of modem paychometis arse fom psychophysics. Ateropts to apply the measurement provedares of physics © the study of sensttions le to a prowacted debate egieding the nature of meastexent "Narens end Lace (1986) have sumrarized the isuor. They note tha inthe Ite 1th eeatary, Hermann von Helmbotz observed tht physical attributes, sul 4s length and mass, possessed the same intrinsic mathematical stricture as dd positive teal numbers. For example, units of length or mass could be ordered sad acid just Ike ordinary nombers. Inthe emly 19005, the debate contved. The Commission of the British Assocation for Advancement of Seienee regurded fundamental measurement of psychological variables to be impossible because of the problems inherent in ordering o adding sensory pereptions, S. Sih Stevens argued that strict adldiiey, 28 wonld apply to length or mest, ‘eas rot necessary and pointed out that individuals could rake fairly consistent ov ERYIEW 5 ratio judgments of sound iniensty, For example, they could judas one sound fo be tice or al as loud as another. He asgue dat this aco property enabled the dete from such measurements to be subjected to mathematical manips tion, Stevens is credited wih classifying measurements into nominal, ordi, imverv, anratio scales, Lovdinessjudgrmens, be argued, sonfrmed 10 # 80 scale (Duncan, 1984), At about the time that Stevens was presenting his frguiments on the legiimacy of scaling psychophysical measures, Louis L, ‘Thurstoue as developing the mmdbematical foundations of factor analysis (Giumally, 1978), Thurstone's interests spanned both psychophysics and mental abilities. According to Duncan (1984), Stevens credited Thurstone with applying psyehoptysical metheds to the ssaing of social stimali. Thus his ‘work represents @ convergence of what had been separate historical rots, LATER DEVELOPMENTS IN MEASUREMENT Evolution of Baste Concepts ‘As influential as Stevens has been, his conceptualization of measurement is by no means the final word. He daGined measurement as the “asignment of numerals 10 objeets or events according to rules” (ched in Duncan, 1984). ‘Danesn (1984) challenged this definton as “incomplete inthe same Way that ‘playing the piano is striking the keys of the instrument according to some patter’ is incomplote. Measurement isnot ony the assignment of numerals, ee. Is also the sasignment of numerals in suc a way 25 to correspon toa ferent degrees of Quali’... oF propeny of some objector even (p. 126). "Narens ard Luce (1986) also identified limitations in Stevens's original con- ceprutlization of measurement and illustrated « rumber of subsequent refine rmenis, However, their work underscores a basic point made by Stevens ‘Measurement madels other than the type endorsed by the Commission (of the British Association for Advancement of Science) exis, and these lea to measurement methods epplicable t the nonphysical as well as physical sciences. In essence, this eck on the fandamental properties of mewsutes has evtablished the scietitic egtimacy ofthe types of measurement procedures ‘sed in the social sciences Evolution of “Mental Testing” Although tradionally "mertl testing” (or “ability testing,” as it is now ore commonly known) has been en active area of psyehomereis, it is not‘ SCALE DEVELOPMENT printexy focus of this volume. Many of the advances in that branch of peychometries are les commonly and perhaps less easily applica when the {goal isto messare characteristics other than abilities. These advances include Item-responte tory. Overtime, the applicability ofthese methods to measure- ‘ment contest ether than sbility assessment has become mare apparent, and ‘we will briefly examine them in a late chapter. Primarily, however, | witl ‘tnphasize the “claesical” methods that largely have dominated the measurement of soci and psychological phenomeza other than sblites. Broadening the Domain of Psychometrics rca (1984) notes that the impact of psychometric in the socialsciences has transcendod is origins in the measurement of sensations an! intellectual abilities. Paychometzics hat emesged o¢ a methedalogical pardign in its own fight. Duncan supports this argument with thrce examples of the impact of payebonncrics: (1) the widespread use of psychometric dfiitons of reliability and validity, 2) the popularity of factor analysis in social scence research, ‘and (3) the adoption of psychometric methods for developing scales measuring an ary of variables far broader than those with which psychometics was initially concemed (p, 203). The applicability of peychomeme concepts and ‘methods to the measurement of diverse psychological nad social phenomena ‘will ocoupy our atetion forthe remainder of this volume. ‘THE ROLE OF MEASUREMENT IN THE SOCIAL SCIENCES ‘The Relationship of Theory to Measurement ‘The phenomena we ty to measure in social sclence research often derive fiom theory. Consequently, theory plays a key role in how we concepwalize ‘our measurement problems. Of course, many areas of sence measure things ‘erived ftom theory, Until s subatomic particle is confirmed through measure- ret, it fs merely 2 theoretical consteucl, However, theory in psychology aod exer social scenes is fferent fom theory inthe physical sciences, tthe sosial sciences, scientists fend to rely on numerous theoretical models that concern rather aarowly circumscribed phenomena, whereas inthe physical sciences, the theories scientists use ace fewer in number and more comprehensive in scope. Festinger's (3984) social comparison theey, for example, focuses on x rather narrow range of human experience: the way people avERviEW evaluate their own abilities or opinions by comparing themselves to others ‘oniras, poysicists continue to work foward a grand unified field theory tht ‘ill embrace al ofthe fandaruenal forces oP nature within a single concepts Framework. Also, the sovial sclences are less mature than physical sciences, and their theories ate evolving more rapidly. Measuring elusive, intangible Dhenomena derived from mtiple, evolving theories poses a clear challenge fo social science researchers. Therefor, i is especially important tobe raind- al of measuromect procedures ané to recognize fully their strengths and shortcomings. “The mote researchers know about the phenomeca in which they are interestod, the abstract relationships that exist among hypotbetial constacts, and the iquanative tooks available so ther, the better equipped they are to develop Telsble, vel, and usable scales, Detailed knowledge of the specific phenom- fon of interest i probebly the most important ofthese considerations. For txampie, social comparison theory bas many aspects thet may imply efferent ‘euaurement sttategies, One research question might regire operationalizing Social comparisons es zelative preference for information about higher- of Tower status others, while another might dictate ratings of self relative tothe “typisel petson” on various dimensions, Different measures captring cistinct aspects ofthe stme general phenomenon (e.g. "social comparison”) thus ay ‘bot yield convergent reslts (DeVellis et al, 1997}. In essence, the measures “rc axressing different variables despite the use of@ common variable name in thelr descriptions. Consequenty, developing a measure thats optimally suites to the research question requizes understanding the sublietes ofthe theory Different variables cal for different assessment strtegies, Number of tokens tke from container, for example, can be observed directly, Many— lrguatly, most_of the variables of interest to social and behavioral scientists ie not divoly observable; belief, motivational states, expectancies, needs, Crootons, sd socal role perceptions ane but a few examples, Certain vari Siler sannat be directly observed but can be determined by research prove ‘axes other than questiomaires. For example, although cognitive researchers fannot diretly observe how individuals onganize information about gender jto thet self schemes, they may be able to use recall procedures to make inferences about how individuals structure their thoughts about self and gender, Thore are many instances, however, in which itis impossible or Enmpractical to assess social science variables with any method other than & paperand.peneil measurement sale, Tht { often, but net always, the case ‘when we are interested in measuring theoretical constrats, Thus, an invest- [por interested in measuring androgyny may find it far easier to do so by means of a carefully developed questionnaire than by some altemative provedur,s SCALE DEVELOPMENT. ‘Theoretical and Atheoretical Measares AA this point, ve should acknowledge shat although this book fecuses ot ‘measures of theoretical construc, not all paper-and-pencil assessments ced be theoretical, Sex and age, for example, can be ascertained from selFreport by means of a questionnsire, Depending on the research question, these £80 variables a be components of a theoretical mode! or simply par oa descrip. ‘ion of» study's participants. Some caetexts ia whieh people are asked to respond to list of questions using & paper-end-pencil format, such as an assessment of hospital petient meal preferences, have no thecrticl foundation, In othe eases, a study may begin atheoreiclly but result in the forma lation of theory. For example, a market reseatcher might ask parents to lis the types of oys they have bought for their children, Subsequently the researcher might explore these listings for pattems of relationships, Based on the ‘observed pattras of toy purchases, the researcher may develop a model of purchasing behavior, Other examples of relatively atbeoretcal measurement ‘republic opinion questionnaires, Asking people which brango soap they use or fox whoa: they intend to vote seldom inwalves any attempt to tp sn under 'ying theoresical construct. Rather the interest sin the subject's response per notin some sharacitistie ofthe person ti prevumed to reflect, Distinguishing beoween theoreticel ond atheoretical messurement situations canbe difcu at times. For example, seokiag a voters preference in presidential condidates as @ means of predicting the outcome of an election amounts to ‘sking @ respondent to repott his her behavioral tention. An investigator say ask people how they plan to vote not oat ofan interest in voter decision making processes, but merely to anticipate the evertal election resus, If, on the otter hand, the same question is asked in dhe context of examining how attudes toward specific issues affect candidate preference, a well-elaboritad ‘hoary may underte the research, The information about voting is intended in this 298 ot reveal how the respondent will vote but to shed fight on individual ‘haacterites In these two instances, the relevance irelevanee ofthe mesure to ceocy is amaser of te investigator’ intent notte procedures use. Renders Snterested in leamting more about consrueting survey questionaire that sre not Fimariy coneemed with measuring hypothetical constructs are refered to ‘Converse and Preser (1986), Ceaje and Bini (1996), Dillman (2000), Fink (1995), Foolor (1993, 1995), and Weisberg, Kostic, snd Bowen (1996), ‘Measurement Seales Measuremsent instruments that are collections of tems combined isto composite sore, und intended to reveal levels af theoretical variables nat OVERVIEW 2 readily bservableby direct means re often refered tas scales, We develo scales when we want fo measure phenomena tat ve belive to exis beaut of our theoretical understanding of the wold, but tat we canaot assess iret, For example, wo may invoke depression or ansiey a a explanation for behaviors we observe, Mest theoeliians would agree that depression or sxiety is not exivaent othe behavior We see, bu underlies i. Our shecries suggest tal these phecorena exist and that hey i urace behavior, but at ‘hey ae intangible, Sometimes, it ay be appropeit 1 infer thee existence fom thee behavioral consequences. However, at ater Sines, we may not have acess fo behavior informetion (such as when we are Tested to al survey methodologies), may not be sure how to interpret avallabe samples of Ichavior uch as hen person remains pusive in the Face of an event thst ost others ook react fo szongly), or may be unwilng * assume that Behavior i isomorphic wi the underlying consuct of iteest (sac as when ve suspect that eryng isthe result of oy rather than sadnes) nites in ‘hich we cannot rely on behavior as an ination of « phenomenon it may te useful to assess the consrut by means ofa caeflly consited and validate sate Even among theoretically derived variables, there i a impli continu ranging fom relatively concrete aud acesible phenomena to reatvely atsirct and inecessibfe phenomena, Not all phenomens wll equte mult jtem scales. Age and gender certainly have relevance to many theories but rarely regite a multtem sale for accurete assessment. People know thet see and gender, These variable, for the most pe, are Binks! to concrete, relatively unembiguos charters (, monpaology oF evens (date of birt, Unless sone speial circumstance such sa neurologialaprment js present respondents can rive information about their age and gender fiom memory quite easily. Tey can rspand with a high degree of accuracy ‘wasinge question essessing& variable such ax tese, Bbncty enuably i & sore complex and abstract arabe thn sage or gender. Ke wypelly involves ‘combination of tyscal, ctu, and histone factors, As eel, tis lee tungible—more oF a social constucton—tan is age or eter. Although the zmechanisms involved in defining one's edict eay be complex and unsold ‘over an extended period of ne, mos indviduls Rave aved aa persoal efciton and ean report their eiiey with te refection o nospection ‘Thus, singe variable may sutie for asessng encty under most cream stances. Many other theoteticat variables, however, require respondent to reomsit, ieee, compare or evalate less accessibeinforstion For example, measuring how marred people believe their ives would be ‘feet if they bad chosen diferent spouse probaly oud resis sub stantial mental effort, and one ifem may not capture the complexity of the" SCALE DEVELOPMENT phenomenon of interest. Under conditions such as these, a scale may’ be the Epproprate assessment fool. Multiple items may capture the essence of such fa variable with a degree of precision thot a single item could wot stain, is precisely this type of varishle—ene that is not directly observable and thit {avolves thought on the pact of the rexpondent—tbat is most appropriately assessed by means of a scale "A scale should be contrasted with other types of mulli-item measures that sietd a composite score. The distinetions amoag these different types of tem “Sompasites is ofboth theoretoa! and practical importance, a later chaps of, this book will reveal. AS the terms are used in tis volume, a scale consists of ‘what Bolles (1989, pp. 64.65; see also Locttin, 1998, pp. 200-202) terms “effect indicatons”--thet i, fers whose values are caused by an undesiying coastuct (or “latent variable,” as [shall rela to it in the next chapter). A rmeasure of depression often conforms tothe characteristics of scale, with the {espoases to individual tems sharing x common cause, namely, the affective Sate of the respondent, Thus, how someune responds fo items suck as “I fel fad" and "My life is joylee” probably is largely determined by that person's ‘slings atthe time, I will wee the term index, on the other hand, to describe seis of tens that are “cane indicators,” that, items tat determine te level fof construct. A measure of presidential cancidateappedt, for example, might fit de characteristics of a index, The items might susess 2 candidate's eogrephical residence, family size, physical atractiveness, bility t0 inspite Cumpaign woekers, and potential nancial resources. Although these chara ‘ecisiea probably do aot share eny common eouse, they might all share en ‘ffect—inereasing the likelihood of a suecessful presidential campaign. The items are ot the result of apy one thing, but they determine the same outcome. ‘A tore genet erm for a collection of items that one might aggregate into a ‘Composite seore is emergent variable (eg. Cohen, Coben, Teresi, Marchi, & ‘Velez, 1990), which includes colletions of entities that share cersin cha teristics and ean be grouped und=r a common category hesding. Grouping ‘charecterstcs together, however, dees not necessarily imply any cause lnk fge. Sentences beginning with a word that has fewer dan five letters, for ‘cxample, can easily be categorized together although they share neither a com- {nen cause nor a cortmon effect. An emergent variable “pons wp” merely Jhecause someone or somethiag (such as a data analytic program) perceives some type of similarity among te items in question, All Seales Are Not Crested Equal Regrettably, not allem composites are developed carefully. For many, sembly ney be 2 mare appropriate tern than development, Resesteers OVERVIEW 4 fen “throw togaiber” or “dredge up" iterss and assume they constiats ceable selon These recearchers map give no thought ro whethor the items ‘Guue a common cause (thes conslituing « scale), sbare @ corn conse~ ‘Guence (us constinting a index), or merely ate examples of 1 shared syper- einai category that does not ieaply either a common causal sntseedent oF consequence (thus constituting en emergent variable). ws researcher not only may fail o exploit theory in developing a scale bat ato may rach ereneous concusions about theory by misinterpreting wat t eeate meesnres. An unfortunate bur dstessingly common occurtece isthe fesearcher coming to the conelusion that some consinetistnimportaat or that Totne dry i inconsisteat, based on te performance of a measure that ry sat eflet the variable axstmed by the investigator, Why might this happen? Rarely in esearch do we examine relationships among variables directly. As poted cair, many interesting variables are not directly observable, «fact we Pan easy forget More often, we asses relationships among proxies (such 2 Sthleg that ae intended to represent the variables of interest, The observable proxy and the inobservable variable may become confused. For example, Mhrables such as blood pressure and body temperatuee, at firs canseration, ppear to be directly observable; but what we actaally observe are proxies cere column of mercury. Our coxclusions about the variables assure that dhe obeervtbe proxies are very closely liked 10 the underying variables hey lis intended to represent, Such is che case fora thermometer; we deseribe the vel of the mercury in @ thermometer as "he temperature,” even thou Soietly speaking, it is merely visible manifestation of temperature (Le. thennel energy) inthis case, where the two correspond very closely, the on Sequences of refering to Gre measurement (He scale valu thatthe mercury sins) as the varable he amount of thermal energy) ace neatly aos iteonsequcntnl, When the relationship between the variable an its ndieatOr [pweaker than inthe themmometer example, confusing the meesars wih the enomenon it is intended to seveal cat lead fo erroneous conclusions neder a hypotetiealstuaion in which an investigator wishes fo perform 2 CRondery analysis on an existing data st. Let us assume that our vestigation i phoested an the role of seca) support on subsequent professional stain fn The tavestgator Observes that the avellable data set contains 8 wealth of {nformation on subjects’ professional staus over an extended period of time Td tat subjects wore asked wheter they were marid, fn fii there may be several irae, collected at various Gms, that pertain 10 mciage. Let a fforher assueve that, in the absence of any data providing a mote detailed paemnent of social suppor, the investigator decides to sum these mariage ‘ow inne a “sosle” and fo use this 2¢# measure of suppor. Mast soci sei~ fiste would ageee that equating social suppor with mart status 3s acta SCALE DEVELOPMENT justified. The later hoth omits important aspects of vocal support (ei the ‘perceived gunity of supp ceceived) and includes potentially smelevant fae- Tors (eg, stats as an adult versus child atthe time of messurenent. I this hhypothencalinvestigntor concluded, on the basis of baving used this asrese- ‘ment method, that social support played no role in professional antainmens, ‘hae conciusion might be completely wrong, In fact. she comparison was betwee marital status and professional attainment. Only if marital sas actu ally indicated level of suppor would the eanclusion be vali, Costs of Poor Measurement Even ifa poor measure is the only one available, the costs of wsing it may be greater than any benefits ttained I is rare in the social esiences fr there tobe situations in which an immeate decision must he made in order to avoid dire consequences and one has ro other choice but to make do with the bes. instroments available, Even in these rare stances, however, the inherent problems of using peor measures to assess constructs da net vanish, Using & measure tha doesnot assess what one presumes it assesses can lea to wrong decisions. Dees this mean that we should only use mossurement tools thet have undergone rigorous development and extensive validation testing? Not necessarily. Altiough imperfect measurement may be better than no measure= ‘meat et all in some situations, we should recognize when our meesurement procedures are awed and temper ovr conclusions accordingly ‘Often, an investigator will consider measurement secondary ‘0 the important sciemfic issues that motivate a study and ths ater to “economize” by skimping on measurement, However, adequate measures are a necessary soncition for valid research. Investigators should stive for an isomorpbisrt bbetveen the theoretical consiuets im which they have an interes and the ‘methods of measurement they use to operationalize thems Poor messurement imposes an absolute tit on the validity ofthe conclusions ane ca reach. For an investigsior who prefers to pay as litle attention as possible to measure ‘ment—and as rauch attention as posible to substantive issues—aa appropricte Strategy might be to get the measurement part of the iavestigation coresct from the very beginning so that it ean be taken more or less for granted ‘thereat A researcher also can falsely economize by using scales thet are to bret in ‘he hope of reducing the bumden on respondents, Choosing a questionnaire that Js too brief toe reliable is a had idea no matter how mach respondents appre= ciate its brevity. A reliable questionnaire that is completed by half of the respondents yields more énformaion than an vneliable questionnsire that is oveRviEw 8 completed by all respondents. Ifyou cannot determine what the data men, the amount of information cofected is iefewant. Consequently, responkeats’ completing “convenient” questionnaires thet cannot yield meaningful infer: mation is 3 poorer use oftheir time an effort thee their completing a some- ‘het longer version that produces valid data. Thus, using inadequately brief assessment methods may have ethical as well as scientific implications, SUMMARY AND PREVIEW ‘This chapter has stressed that measurement is a fundamootal activity in all branches of science, including the behavioral and social sciences. Psychomeirics, he specialty are ofthe social sciences that is concerned with ‘measuring social and psychologica) phenomens, as historical antecedents ‘extending back fo ancien! mes, Inthe sosial ssiences, theory plays vital ole inthe development of measurement scales, which are collections of items that reveal ihe Ievel ofan underlying theoretical variable, However, no ll colles- tions of items constitute scales in this sense. Developing scales may be more emanding than selecting items casually; however, the costs of using “infor ral” measures usually greatly outweigh the benefits ‘The following chapters cover the rationale and methods of scale devetop- ‘ment in greater dewil. Chapter 2 explores the “latent varisbe,” the underlying ceonstract tata sealeatempts to quantify, and it presents the theoretical ses forthe methods deserited in later chapters. Chapter 3 provides @ conceptual fourcatlon for understanding zolisbilty and the logic underiying the eebiabil- ity coefficient. The fourth chaptor reviews validicy, while the ith i a practi. cal guide tothe steps involved in seale development. Chaptsr 6 introduces fator analytic concepts and deserbos their use in scale development. Chapter 7 is a conceptual overview of an atomative approach to scale develepaent, item response theory. Finally, Chapter 8 briedy discusses how scales ft into the broader research processUnderstanding the Latent Variable ‘This chapter preseas a conceptual schema for understanding the lationship between messires and the constructs they represent, though i isnot the ouly framework availabe, ftem response theory (IRT) is am alterative measure ‘ment perspective dat we will examine ip Chapter 7. Because of it relative conceptual and computational accessibility and wide usage, f emphasize the flassical messirement model, whieh assumes at incividual ivens are ‘comparable ieatrs ofthe underlying construct, CONSTRUCTS VERSUS MEASURES “Typicely, researehers are interested in constructs rather than items or sees pet se. For exaruple, market researcher measuting parens’ aspirations for {heir children would be more intrested in intangible parental sentiments and hopes about what their children will accomplish than in where those patents place marks on a questionnaire, However, recording responses to a question ‘nce tay, in many eases, be the best method of assessing those sentiments aad hopes, Seale items re usually ameans co the end of eonstuct assessment Tn other words, they are necersary because many constricis cannot be assesses direoty, in senge, measures are proxies for variables tat we cannot directly ‘observe. BY sssessing the relationships between measures, weinfe,indzetl, the teationships between constructs. In Figure 2.1, for exemple although ou" primary interest isthe relationship between variables A and B, we estate ‘on the basis ofthe relationship Between measures corresponding to those variables “The underiyng phenomenn oe construct that a scale is intended to reflect is often called the latent variable, Exactly what is latent variable? Hs name reveals two chief features, Consider the example of pares’ aspirations for chikiren’s achievement, Firs its fateneyather than manifest, Parents aspirations for their ebilea’s achiesement aro not directly observable. In addition, ‘he constrict is verible rather than eonstant—that is, sane aspect of i, such sity strength or magni, changes. Parents’ uspirations for their children’s fcbievement may vary with tepard to time (e.g, during the child's infancy verrus adolescence, place (eg. on an athletic field versus a classroom), “4 UNDERSTANDING THE LATENT VARIABLE 5 = |Measure| oan A B Figure 2.4 Relationships benween instruments correspond to ‘ationships berwoon latent variables only when ech reasire coresponds tits lent variable people (egy parents whose oxn backgrounds or careets diffe), or any Combination ofthese and other dimensions. The latent variable isthe acta phenomenon tha i of interes, in this e4te, child achievement sspiations. [Although we caanot observe or quantify it directly, the Intent variable pre- Sumably takes on a specific alu under some specified set of conditions. A fale developed to measure a latent varable is intended to estimate is actual ‘magnitude ot tie time and place of measurement for exch person measured ‘This unobservable “actual magnitude” isthe we score LATENT VARIABLE AS THE PRESUMED CAUSE OF ITEM VALUES “The notion ofa latent variable implies a certain velatioaship between it nd tae item that tap i The latent variable is regardad atu cts of te item soore— that i, the strength or quantity ofthe latent variable (.e., the value ofits trae score) i presumed to cause an ite (or so of tems) to take on a certain value ‘An example may senfore tis point. The folowing are hypotetiel items for assessing paren" aspirations for children’s achievement:w SCALE DEVELOPMENT. 2. My childs achievements determine my own sees 2 Twill do shmost anything o ensure my sil sees, 3. Nossctific iv wo great il helps my child aciove sce 4 [fy cls accomplishnenis se mite impo me than just about any ste Lan think Jf parents are given an opportunity to express how strongly hey agree with ‘each of these items, their underlying aspirations for their ebildren'sachiave- ‘meat should influence ther responses. In ether words, cach ite should give an indication of the strength ofthe latent variable, aspirations for children’s ‘achievement. The score obtained on the ite is caused by the strength or quan ‘ny ofthe Intent variable for that person at that particular time. A causal relationship between a latest variable and a measure implies ceriain empirical relationships. For example, if an item value is caused by & Jetont variable, then there should be a correlation between that value and the true score of the Intent variable ecanse we cannot dnetly asses the tris ‘tere, We cannot Compute # comelation between it sod the item. However, ‘when we exmnine 1 set of iors that ae presumably caused by the same iatent_ ble, We can examine thei relationships to one another{ $0, i items Ike ie hes above ineisarigg parental aspirations for child achievement, we could look directly at how they corelted with one mother, jnvoke the latent variable ss te tusis for the corelations among items, and use that information to infer how highly each item was correlated with the fateutvatiable. Shorty, I will explain how all his can be lesmed ftom corre lations among items, First, however, I will inodvce some diagrammatic ‘procedures to help make this explanation more ces. PATH DIAGRAMS Coverage of ths topic here willbe limited to issues pertinent oseale develop- mest. For more in-depth treatment of the topi, consult Asher (1983) or Losin (1998) Dingrammatle Conventions Path dlagrams age a method for depicting eaustl relationships among vat ttle, Altbongh they can be used in conjunction with atk anabis, which is data analytic method, path diagrams have more general utility a a means of UNDERSTANDING THE LATENT VAREAMLE a, x——— 4. Figure 22 The causal patha fom XY Figure 2.3 Two variables plus ertor detennine V* seal EVELOPMENT UNDERSTANDING THE LATENT VARUABLE w, aXe as —T oS AX ey Xs<-———¢s Figure 24 A path diggram with path eoefficients, which en be used 10 Compute correlations berween variables Path Diagrams in Scale Development Path diagrams can help us see how scale items are causily tented t0 4 Inert variable, They can also help us understand bow certain relationships, mong toms impty certs relsionships batwoea items end the Ine arabe. Werbegin by cramininga simple compustonal rule fr pa grams. Leto look atthe stpte posh diagram in Fgure 24, “The mambers along the pas are standardised pth cefcents. Each ove expreses the sent of the cael relationship Bewven te vanibles joined by te ecrow. The fact thatthe eoelicens ape standardized means tat they all use the sate scl to quantify tbe easel reaonships. this agra, ¥ iseceuse of X, though Xy A wef relationship eis between the values of ‘path cveficients andthe correlations between the Xs (which would represent items, in he case ofa gale developmen-rpe path diagram) For diagrams | Te tis one iar Bave oly ‘ongin (Vi ths eae), the corel tion between any two Xs is equal to the product of the coeficiens for the ing a route, trough Y, berween the X variables in question/ For enazapl, de carelaion between Nad X, calle by muipyingHietwo Storied path coocess thet join thm va Y. Ts, ry 6 x.5 06 ‘VarblesX, and X, also share Y asa common source but be route comnect= ing them is tonges, Howe, te Tal sill applies. Begining at X, we can ftae back to and then forward again wo X,. (Os, we could bave gove inthe te dso, fom X06) The abe 3 Dk 3 TE Ts Figore2.3 A path dlagram with error tons “This lationship betvcen pat coefficlonts and correlations provides «basis for estimating paths between a atest variable and the items that it insluences. Even though the latent variable is hypothetical and usmeasurabl, the ems fe real and the conelations among them can be directly computed. By using these correlations, the simple rule jst discussed, and some assumptions about the relaGonships among items and the tru score, we can come up wit cate nates forthe paths between the items and the latent variable, We can begin ith set of corelaions among vasables. Then. working backwerd from the Tiationship among pots and correlations, we ean determine what the valies ff certain paths mse be if the assumptions are comect. Let us consider the ‘example in Figure 2.5. “Ths diagram is simila tothe exaraple considered earlier excop thet there reno path values, the variables X, and X; have been dropped, the resaining X variables represent scale items, and each iter has a variable (ero, sbeled ng) other than Y influencing it These e variables are unique i the case of cach tem and represent the residual Voition in each item not explained by Y ‘Fh diagram ‘ndicates that all of the items are influenced by Y. In addition, each i influenced by a unique set of vanables other than Y that is caftecttvehy ‘uoated as err. “This revned diagram represents how five individu tems ere related to 4 single intent variable, The numerical subscripts given tothe e8 and Xs inli- ‘ate thatthe five items ae different and thatthe five saurees of error, one fOr tach item, are also different. The diagram fas no arowte going directly Som se X to another X or going tom aa e to another eo from ane toan X otherSCALE DEVELOPMENY ‘han the one with which it is associsied. These aupects of the diagram represent assumptions that will be discussed ater [we had five actual tems that a group of people had completed, we would hve em scores that we could then correlate with one socber. The ale exam ined carlies allows the computations of conelations from pth eoeficient, ‘With the adcition of some aesumptios, it 380 lets us compute path coe ‘ints ftom cortelations~ibat i, cemelations eompued from actual items can be used to determine Bow esc items eats to the latent variable. If, for exe ple, X, and X, have e correlation of 49, then we know thatthe product ofthe ‘olues forthe path leading from Y to X, ad the path lending from ¥ to X, is ‘equal 949. We now this Decause our rue established thatthe coreation of {to variables equals the product of the path coeficents along the route that jpins them. Tf we elso assume thatthe two parh values are egval, then they bork must be 70." FURTHER ELABORATION OF THE MEASUREMENT MODEL Classical Measurement Assumptions ‘The clasical measurement model stasis with common assumptions about stems and their relitionships wo the latent variable ad sources of eror: 4. The amount of emorasocated wih inividaa ems varie randomly. The enor ‘soci wih individual ters has a mean of 20 when iis aggregated arose 8 trge nae of people. Thus, tems’ means tend Yo be unsicted by sor ‘when a age suse of respondents complete hems. One item's mor term i not comelate with another ies ero ten the only outs linking stems pass trough the lent varale, never though hn) ror ‘Err ter reno eet wit he rue score ofthe Intent variable, Note hat {he patiscranating fo he Ito: varible do ot extend card fo the erat ‘ems. The mow between an fem andi ror erm ims he other Wi. ‘The frst two assumptions above are common statistical assumptions that ‘underlie many analytic procedures. The third amounts to defining “eror” as ‘he residual remaining efter ope consiers all of the relationships between & ‘eo! risers aha outcome cr, nis eas, 2 ea tee ad i ae UNDERSTANDING THE LATENT VARCANLE a, PARALLEL TESTS ‘Chasscal mensurement thoory, in its most orthodox form, is based on the | ‘assumption of parallel "tots." Ta term parallel tests stom frm he fat that ‘one ca5 view each individual item as atest” forthe value ofthe Lett yar- | able. For our purposes, referring to pavalle! items would be mose accurate, | However, I wil defer to convention and use the taitional name i A virtue ofthe parallel tests model i that ts assumptions make it quite easy to reach useful eanclusions about how individual items elate tothe latent var able, based on our observations of how the items relate to one another, Earlier, [suggested that with 2 knowledge ofthe comeltions ammeng items and with ‘certain assumptions, one could take inferences about the paths leading from ‘8 cause variable to an item As will be shown in the next chapter, being able to assign x numerical value tthe relationships hetweon the latent variable and the items themselves is quite important, Thus i this seetion, 1 yall examine in some detail ow the assumptions of parallel ess lead to certain conclusions ‘hat make thie posible, ‘The rationale underiying the model of parallel tste is that each item of a sale is precisely as good a measure of the Intent variable a any ather ofthe scale items. The individual items ave thus strictly parallel, which is 19 say that ‘ich item’ relationship to the fatont variable is presumed ideatcal to every other item's relationship to that variable and the amount of ert present in ‘etch items also presumed to be identical. Diggrammatically, this model ean ‘be represented as shown in Figure 2.6. ‘This model adds two assumptions to dose listed eater: 1, The erount of inftuence from the latent variable to esch item is assumed o be the same or al tems, 2. Bach tem is assured to have the same noun of err as any otber tem, mean ing thatthe tntuence of factors ke than the att variable sexual fora ‘These added assumptions mean thatthe correlation of each item with the ‘nae soor is identical. Being able to assert that these correlations are equal is Jmporiant because it leads to a means of determining the value for each of these identical corelations. This, in tum, leads 1 @ meens of quantifying rel- sly, which will be discussed i the next chapter ‘Asserting thet correlations between the true score end each ftem are equal requires both of the preceding assumptions. A squared comelaton is the2 SCALE DEVELOPMENT L ay a as x, X x i a ey €3 Figure 2.6 A diagram of a paraiel tests model, in which al pathways fom the latent variable (L) tthe ifems (X,, X,. are equal in value to ape another, as are all pathways fem the ear terms tthe items proportion of variance shared between two variables. So, if comelations Denween the true score and each oft items are equa, the proportions of vari= ance shared between the tru scare and each item also must be equal Assume {hat a true score coutibutes the seme aniount of variance to each ftir. ‘This amount can be an equal proportion of ttl variance foreach item only if the items have identical otal variances, In order forthe otal variances to be equal for the two items, the amount of variance each item receives fom soures other than the eu score must also be equal. Asal verison sources other than the tras score are lumped together a eror, this means that the to items must heve equal erce variances. For example, ifX, got arbitrary writs ‘of variation fom its mae scofe aad I ftom error, the tu score proportion ‘would be 909% of total vasistion. IFX, also got 9 units of variation from the ‘ue s20F, these 9 units could only bo 909% ofthe tot ifthe total variation ‘vere 10, The wtal eoulé only equal 10 if err contented £ unit to X,, a8 i id 10, The corelation between each item and ihe true seare then would ‘equa de square root af the proportion of each item's variance that is aebu le to the trae score, or roughly 95 i this ease UNDERSTANDING THE LATENT VARIABLE R “Ths, because the paralieltoste adel assumes that the amount of influence {om the lstem variable isthe same foreach ilem and tha she amount from ‘other sources (ero) i the same for each item, the proportions of tem var fmce ateibatable tothe lana variable ané to error re equsl fo items, This Slzo means thet, under the assumptions of parallel tests, standardized path ‘ofiients fom the latent variable to each itm are equal oral items, 1 was fssuming that standardized path coefficients were equi that made it possible, in an easier expe, to compute path coefMiient fom corvelations between items. The path diagram role, discussed earlier, relating path coeffvients to correlations, should help us to understand why these equalities hold when one secepis the preceding assumptions, “The assamnptions of this mode! also imply that coreltions arnong fers are ‘deaical(e, the correlation between X, end X, is identical to the correlation bbenween X, and X, oF X, and X,), How do we arive at ths constusin from the assumptions? The correlations are all the same because the only mechan- ism te account forthe correlation between any two items is the route through the latent variable that Hinks those iter. For example, X, and X, ave linked only hy the route made up of paths a, and a. The correlation can be compated by facing the route joing the two items in question and mubiplving the pat ‘values, For any two items, this entails maliplying two path that have iden fal values (Le, 6, = a = a3) Corelations computed by multiplying equal values wil, of course, be equal ‘The assumptions also imply ‘har each of these corslations betwee items equals the square of any pth from the latent variable (oan indvidal iter. How {do we rach is conclusion? The product of tw different paths (€ 2,2, and a) $s ideotical to the squat of either pth because both path coeticomts are it~ cal Ifa, = 4, = ay and (a, cy) = (a,> ay) = (0, ,), then enc of these later Products mut alse equal the value of any ofthe ath multiplied by isl Tt also follows ffom the kssumptions of this model thatthe preportion of cor associated with each item isthe complement of the propeition of vai= face that i elated tothe latent variable. In other words, whatever effet on 2 {ives item is not explained by the latent variable must be expined by eror. ‘Topehes, these two effests expla 100% of the variation in any given tem ‘Thus is so simply because the emor farm, , is defined as encompassing all sougcos of variation in the tem other than the lent variable. “Those assumptions sap at feast ane other eoneusion: Because ea itm is influenced eqyally by he latent variable and each erorterm’s influence on iis corresponding item is also equal, the items all have equal means and equal variances I the ony to souoas hat ean influence the mean are idetical for items then clearly the means for the items also will be identical, This reasoning also holds for te itera variances.u SCALE DEVELOPMENT Th conclusion, dhe purnlel tests model assumes J. radon ener 2 emo are not erolatd with 5. eons re not corsa wih the re core 4 te nen arabe aes a items eqully = Hs 5. the amouat Fee Fo ech tem i el hoter ‘These assumprions allow us 10 reach a variety of intereting conclusions Forthermora, the model enables us to make inferences about the latent 9ari= ble, based nthe items’ coreations with one ancther. However, the mode} secompishes this feat by setting forth fry stringent assumptions. ALTERNATIVE MODELS As it happens, all of the narrowly restictive assumptions associated with strictly parallel tess are not necessary in ordesto make usefal inferences about the relationship of tro scores to observed seores, A model based on what are technically called essendaly owguivatent teste (er, occasionally, tandomly y paraltel tests) makes a more teal assumption, narnely, thatthe amount of ‘ior variance associated with a given item need aot equat the error variance of the other items (e.g. Allen & Yen, 1979). Consequently, the standardized values of the pats ffom the latent variable 1 each item may not be eaual, However, the ansiandardized values of the path fiom the latent variable 12 cach item (.0, the antount as opposed to proportion of indluence tt the Latent variable has on each ite) ae sll presumed to be identical fr al items. This seas tbat items are parallel with respect to how rnuch they are influenced by the latent variable bu are not necessarily influenced to exactly the same exieat by extraneous factors that are lumped togeter as enor. Under stiedly pecallel sssuroptions different items not only tp the true score to the same degree, but their enor components are also the same, Tau equivalency tau” is the Greek ssivalent of," asin true score) is much easier 1 Tive with because it does 101 mpose the “equal errors condition. Because ems may vary, item meses nd vetiances may also vary. The more lbera! assumptions ofthis model ae tyactive because finding equivalent measures of equal variance is rave. Tis ode! allows us io reac many ofthe same conclusions we reach wit striely perl tests but with less restrictive assumptions. Readers may wish to com. pare this model to Nunnally and Bemstein's (1994) diseussion of the "domain sampling model (UNDERSTANDING THE LATENT VARIABLE Some seale developers consider even the sentially wn-equivaleat model too restrictive, After all, how often can we assume that tac itm i influenced by te latent variable to the same dogree? Tests developed under whst is called the congeneric model (Joreskog, 1971)are subject to an even more relaxed set ‘of assumptions (See Carmines & Melver, 1987, fora discussion of congeneric tests). Icassomes (beyone the basic measurement assumptions) merely that al ‘the items share a common latent variable. They need not beat equally strong relationships tothe latent variable, and their eror variances need nat be equal. ‘One must assume omly tht each item reflects the true score to some degree ‘Ofcourse, the more strongly each item correlates withthe tre sare, tbe more reliable te scale wil be, [An even Fess constrained approach is the general factor mode, which allows multiple latent variables to underlie a given set of tems, Carmines arc ‘Melver (1981), Loeblin (1998), and Long (1983) have discussed the merits of ‘this type of very general model, ebief among ther being is iniproved corre- spondence to real-world dats. Structural equation modeling upprosches often {incorporate footor analyses into their measurement models. Situations in ‘which mubiple latent variables underlie a set of indicators exemplify the speneral fctor model (Loehlin, 1998) “Te congener mode! isa spacial cate of the factor model (ie, a single~ factor ease}. Likewise, an essentially tau-equivalent measure isa special case ‘of congenerie measure—one for which the reationships of items to their Intent variable are assumed to be equal, Finally, @ stitly parallel test is = special case of an essentially tx-equivalent one, adding the assumption of ‘equal rlationships between each item and its associated sources of eror Another measurement suategy should be mentioned. This is item response ‘theory (IRT). This approach has been used primarily, butnot exclusively, with _dishotomous-response (eg. correct versus itcorec) iter in developing ability tats, Diflerent models within the broader class of IRTs may be based on ‘the normal or, with inereesing frequency, the logistic probability Fmnetion. RT ‘assumes that exch individual item has i ows characteristic seaitivty to the latent variable, represented by an iter-chavactriticearve (ICC). An ICC isa plot of the releionship bergen the value of the latent variable (e.2, ability) fand the probebilty of a certain response to an item (eg. answering it correctly). Thus the carve zeveals how much abity an item demands to be answered coretly. We will onsider IRT further in Chapter 7. Except for that consideration of SRT ip Chapter 7 and a discussion of faster snalysis in Chapter 6, we will focus primarily on parallel and essentelly tau-equivalent models for several reasons. First, they exemplify “classical reasuzemeat theory, In addition, discussing the mechanisms by which other models operate can quickly become burdensome. Fimally, classical models2% SCALE DEVELOPMENT ‘ave proven Very usefl for social scientists wit primary interests other dan measurement who, nonetheless, take careful measurement seriously. This troup isthe anience for whom the present text has been writen. For these incividuals, the scale development procedures that follow fom @ classical model generally yield very satisfactory scales Indeed, although tomy know! sdge no lly is readily available, I suspect that (ouside of ability vesting) & substantial majority of the wellsknown and highly regarded scales used in social science research were developed using suck procedures EXERCISES 1. How cam we intr the rlaooshiphetwets the Iateat variable and to teas tebe to bated on he corelations beween the to ters? 2. Whats the shit ctference io sesumptons between the parallel tess and esen- tually trequielent models? 3. Which messurement nods assumes, beyond te basi assumptions eae 0 slimeasisemen sppreache, only tha the items share «common latent variable? NOTE 1 Alsough 70 i alo a allowable square root of 4, deviting between he posi tive o negative ret ie pzally of less concer than one would tink. store aa the items an be made fo comelte positvely with one azothr (i necessary. by ‘sevese scving cet ter as docused io Chapter 3) thn the sigss of the pat couffciese from the Iatot viable to te iocvidual ites wl te the sae ad ate tubiary, Note, however, that giving poslve signs to these pts implies tat the ens Indio mere ofthe consti, whareas negative events would ingly the epost, Reliability Rolibility is a fundamental issue in psychological measurement is Jnmporanee is clear once is meaning is ally understood. Scale relia is the ‘proportion of variance apie 10-che. tue score of the latent variable There are severl metkods for eompting eliobiliy, but they all share hs fundamesta defnton, However how ope conceptualize and operationlzes relly ifers bated onthe computational method one uss. CONTINUOUS VERSUS DICHOTOMOUS ITEMS Altiough items may have a vavity of response formats, we assume ix this chapter that item responses coasist of muliplevalue response options. Dichotomous items (i, items with only wa response options, uc a8 “yes” and "no" oF with multiple resposse options that van be classified as “wight” ‘versus “wrong” ate Widely used inability testing and t a lesser degree, in ‘her measurement contexts. Examples are 1. Turck isthe capita of Switzerland, Tele 2, What the value af pi? fiat GSI (O)278 ‘Spesal methods for computing reliability that take advaotage of the compu tational simpliiy of dichotonsous responses have been developed. General measurement exts such as Nunnally ad Berastsin (1994) cover these methods {a same doail. The loge ofthese methods for assessing reliability largely parallel the more general approach that applies to multipoint, continuous seale stems, In the interest of brevity, this chapter will make only passing reference te reliability asessient methods intended for seales mado up of dichotomous items, Some characteristics ofthis type of scale are discussed in Chaper 5, INTERNAL CONSISTENCY Internal consistency reiailty, as the narne implies, i concerned with the hhomogenciy ofthe tems within aseale. Seales based on classical mensurenent” SCALE DEVELOPMENT ‘models oe fatended to measure 9 single phenomenon, AS we SAW ithe precoding shaper, measrement theory suggests tha the elation sarong ‘town are logically connected to th rlaonshipe of items to the ltent vr hl te Hers of aenle havea trong relationship thee latent vaxiable, ‘hey will ave a seong lath to each ether, AURORA Ws eat dey hserve the Tnkage betiveon semi nd the linen vaiale, we can certainly | determine whether the emis ae eorelated tone nnather. A scale is internally consistent 10 the extent tha its items ate highly intewonelsted; What can ‘econ far Comelations among lies? Theve are fvo possibilies: that items causally effect each other (e4, ftem A causes item B) of thatthe items share ‘common cause, Under most conditions, the former explanation is untikely, leaving the later 8s the more obvious choice. High interitem coreations thus sugges hat the its are all measuring (i, are manifestations of the sane thing. If we make the assumptions discussed in the preceding chapter, we also can conclude that suong conslations among items imply strong links Dbetween items and the latent variable, Thus, a unidimensional scale or a single dimension of 2 multidimensional scale should consist of set of items ‘hat correlate wet with each other. Mukidimensional scales measuring several Phenomena—for example, the Multicimensional Health Locus of Control (MBLC) soales (Wallston etal, 1978)—are really fumiies of related sles; ‘ach “dimension” isa seal im ste own right. Coefficient Alpha Interna} consistency is typically quated with Cronbach's (1951) coefficient alpha, . We will examine alpba in sorte detail for severl reasons. Fis, is widely used as a measure of reliability. Second, its connection to the defini- sion of reliability may be less selvident than is the case for other measures of reliability (such as the alternate forms methods) discussed later, Consequently, alpha may appess more mysterious than other relibility com polation metheds to these who are not fumiise with its intemal workings Finally, an exploration ofthe logic underlying the computation of alpha pro. vides 2 sound basis for comparing how other compuistions! methods capture the essonce of what is meant hy reliability ‘The Kuder Richardson forrula 20, or KR-20, as itis more commonly ‘known, is « special version of alpha for items that are dichotomous (e.g, Nunnally & Bemstzin, 1994), However, as noted earlier, we will oncentrate fon the mere general form that applies items having mubple response options ‘Yaw can think about all the varigbiity ina set of tem seores as due 10 one ‘of wo things) stl variation aeros individuals in the phenamencn that be RELIABRITY ms seale measures {true variation in the latent variable) and (b) ener. Tis is ‘nue becnise classical measurement models define dhe phenomenon (Ci patients’ desire for control of their interactions wit a physician) asthe source ff alt shared variation and error as any remsining, of unshared, varition fp scale scores (e Ba single tems unintended dovble meaning). Another way © |) think about this isto regard total variation as hnving ¢wo components: signa! |) (Ge, tue differences in patients” desire for eostrul) and notse (ie, score ai | ferences caused by everything but twe differences in desire for conto!) ‘Compating alpha, as we shall se, partitions the cota wsrinee among the et of items into signal and noise components. The proportion of total variation that js signal equals alpho, Thus another way to think about slpha is that it equals 1 ~ enor variance, ex, conversely, that error variance =! ~ lp, ‘The Covariance Matrix ‘To understaod intemal consistency more fly, it helps to examine the coverage matrix ofa set of scale items. A covariance mari fora set of scale Jems reveals important information about the scale as a whole "A covariance matrix is 2 more general foon of «ecrelation marx, Inst eor- relation mstix, the data have been standardized, with the variances set to 1.0. Inacovariance matrix, the data entries are unstenrdized; thus, it contains the same information, in uastandardized form, as a correlation mattis. The diago- al elements of @ covariance matrix sre varianees~covariances of itm with ‘hemselves-—jstas the unitis along the main diagonal ofa corelation matrix fare variables’ varisnces standardized to 1.0 and also their correlations with themselves. ls off-diagonal values are covariances, expressing relationships Detween pars of unstandardized variables just 2s correlation coefficients do ‘with standardization. So, conceptually, a covariance mattix consists of (a) ‘eiances (on the disgonsl for individual variables and gb) covariances (oft- diagonal) vepeesenting the unstandardized relationship between varieble pairs “A typicel covariance marix for three variables X,, Xen X, 6 shown in Table 3.1 “Aa aliemative tha somewhat more compactly uses the eustomary symbols ‘or matrices, variances, and covariances is os Ms Gz Oy M3 Sm3 SCALE DEVELOPMENT astzs4 [x x & 3 [wn [ewe | Gon x (ema Lv | coe Covariance Matrices for Mult-item Scales Let us focus our attention on the properties ofa covariance matrix fora set of items that, when added together, make up 2 scale. The covariance matrix presented above has three variables, X,, X, and X,- Assume that these varie ables are stually scores for three items and that the items, X,, X,, and X,, ‘when added together make up a stale we will eal ¥. What can this satis tll us about ihe relationship ofthe individual items tothe scale asa whole? ‘covariance matix has a number of very intresting (wel, useful at least) ‘ropestis. Among these isthe fact that adding all ofthe elements in tho mnatix Together (Le, summing the variances, which are along the diagonal, and the ‘covariances, which are of of the dingonal) gives a value tht is exactly equal tothe Variance of the scale as a whole gssucning that the items are equally |Fweighted. So, if we add all he tems in the symbolic covariance matrix, the resulting sum would be the vaviace of scale Y. This is very important and bears repeating: The variance ofa scale, Y, made up of any numberof items, ‘equals the sum of all the values in the covariance matrix for those items, assuming equal ‘tere weighting.‘ Thus the variance ofa scale Y, made up of nee equally weighted items, X,,X,, and X,, has tbe following relationship to the covariance matix ofthe lems: =C, vece ooo Reeders who Would like mace information about the topics covered in this section ate refered to Nunnally (1978) for covariance matrices and Nambpoodiri (1984) for en intmebiction fo rantix algsbra im etstistes. The RoLiABLiTY ” Xy<— 1 Xpe—— 2 ¥ om te “AX 5<—— 5 jagrammae represeration of how a set of five items welates to the comynon latent variable Y Figure 3.1 covariance matcx forthe individual lems has additional usefl inferaation| fot mentioned here. Applications that can be derived from item covariance ‘matrices are discussed by Bohmnstedt (1969), Alpha and the Covariance Matrix Alpha is defined asthe proportion of a scale's otal variance that is att able to a common soatee, presumably the true score ofa latent variable under lying te items, Thos if we want to compute alpha, i would be usefl 0 have value forthe scao's total variance and value for the proportiots that is “common” variance, The covariance matrix is just what we need in order to do this ‘Reval the diagram we used in Chapter 2 to show how items related to their latent variable, as in Figuze 3.1. ‘All f the variation in items that is duet te latent variable, Yi shared or common, (The tenns jon and communal are also used to deseribe this vaia~ tion.) When ¥ varies (as it wil, for example, across individeals having differ- cntevels ofthe stibute it represents), scores onal she ems wil vary with Daca itis @eause oF those scores. Ths iY is high, al the ee scores wil tend tobe highs iY is tow, they will end to be fow, This means thatthe items ‘vill tend to vay jointly (Le, be corelated with one another). So, tbe Inte Variable affects all of the tems and thus they are coreated. The exe terms in contrast, are the source of the unique variation that each ‘lem, possesses. ‘Whereas alt tems share variability due to Y, no wo share any variation from the sume error source, unde ou clessizal measurement assupions. The value ff given enor teen only affects the score of one feo, Thus, the enor terms2 SCALE DEVELOPMENT fe not corelted with ene another. So, each item (aud, by'implzation the sale defined by the sum ofthe tems) varies asa funetion of (@} dhe source of vars- tion common tise and the ether items aed (b) tigae, unshated Variation that we zefer to as extr, Ht follows thatthe total varznee Zoe each sees, an hence forthe scale asa whole, must be & combination of variance from con ‘mon an unigue sources. Acconing te the definition of relibiiey, alpha should «cual the ratio of common-souzce Variation 1 tal varitio, ‘Now, considera F-item measure called ¥ whose cavariance matrix is 3s follows Fy Os oe a, 8 ay Onn Gs Oy oe Pin Fan Say a The vance, 0 ofthe kite scale equals the sum ofa matic elements ‘The ences along the main diagonal ae the varincer ofthe individual ers represented inthe metix. The variance of the fh tem is signified as 7 Therefor, te sur of he clement along the esta iagtal, 3.03, i the stm ofthe variances ofthe individual tems. Thus the covariance matic gives us ready accesso two values (a) the otal variance ofthe sale, 0, defined as the su of all elements inthe mtx and (b) the sam ofthe individ item vatiances, E03, compute by summing env sloug the main diagonal. These to ves en Be given a conceptual interpretation, The sum of the ofc ‘hatin i, by definition, the varimee of Y, the scale made up ofthe individ] items. However this total variance as We have sald, canbe parton ino Ailfere pas Les consider bow the covariance matrix separtes common frm unique variance by examining how the elements on the resin diagonal of te cava ance mavix differ fom sl the of: cegonal elements All of the variances (Giagontelemers) are single-vavabe o *varabloith vel?” teas ted carlin tht these varices can be cbough of a3 variances of ions with themselves Pash varie contains information about oxy one tz Tp other RELIABILITY Be Communal Noncommunal Figure 32 A varianoe-covariance matrix showing that the variances along the main diagonal (shaded area) are noacommmunal, whereas the covariances lying above and below the diagonal (nshaded area) are commun! words each represents infermation that fs based on a single item, not join vatation shared among items. (Within that single item, some of is variation vill be due to the common underlying variable and thus shared with other items; some will not, However, the item's variance does not quantify the ‘exient of shared variance, merely the amoust of dispersion in the soores for ‘that item, itespective of what causes it) The offiegonal elements of the ‘covariance matrix ll iavolve pues of terms, and this common, or joint, variation between two ofthe scales items (covariation). Tha the elements in the ‘covariance muti (and hence the total variance of Y) consist of coveriation {joint variation, if you will) plas nonjoint, or noncomal, variation. coe ‘cermin items considered individually. Figure 2.2 pictorially represents these two sub-divisious of the covariance matrix. The shaded area slong the diagonal isthe noncomsmunal postion of the matrix, and the two off-diagonal regions within the wiangular borders are, iogethes, the commana portion.M SCALE DEVELOPMENT As the covarlances—and only the covarianees—represent communal variation, all noncommunal variation must be seprosented in the vanances along the main diagonal ofthe covariance matric and ths by the term Zor? ‘The total variance, of course. is expressed by of, the sum of all the matrix elements, Thas we ean express the rato of nonjoint variation to total v in Yas Lee This sao comesponds to the sum of the diagonal valnes in the covariance mats, Ids follows that we ean express the properton of ont, or commanal, ‘variation as wht islet over, that i, the complement ofthis vale as shown: fe) ‘This value comesponds to the sum of all the off-diagonal values of the covariance matrix I may Seem strange, or at least ineficient, to compute the iagonti elements nd then subtract them from the value of the covariance ratrix asa whole. Why not just compute the sum of the off-iaganal elements Givectly a3 Y.6,,, where and jreptesent each ofthe two items involved in a patieuar covahance? In fact, one would arrive at the same exact point by Airectly computing the sum of off-dingonal clement. The formula involving subtraction fom 1s a legacy ofthe days when computers were not available to do eaicuitions. Computing the tual variance for V and the variance for exch individual item, i, were probably operations that hed already boen done for other purposes. Even if there were no noed to calelate these variances for ‘other purposes, consider the computational effort involved. For a 20-item seal, the choice would be between computing 21 variances (one for each item and enother forthe entire seale) and 190 covariances (i, one foreach ofthe 380 off-diagonal clemenis ofthe matin, with dose above the diagonal iden tical to those below) plus the total variance. Ths, @ formula tht quantifies ‘communal variance as what remains ater removing noncomamtanl fer total ‘variance makes more sense than might at fist be apparent, ‘The velue represented by the formula (4) RELIABILITY Be or, equivalentiy, Bel, ‘would at first bfush seem to capmae the definition of alps, shat is the come runal portion of cual variance ina scale tht can be: telpated tothe itams’ Como once, which we presume reflects the tue score of te Inet var able. Well aeéd one more corecion, however. This need becomes appar Con Fe consider what would happen ifwe had, say, five perfectly correlated ‘tems, Such an arrangement should resul in perfect elabilty. The correlation ‘mar in this metance would consist ofa $>3 matix with all values oqusl to 1.0, The denominator of the preceding equation would thus equal 25. The ‘numerator, however, vould only equal 20, thus yielding a relisbity of 2075, ‘or 80 rather then 1.0, Why is this 20? The toial umber of element i the ‘covariance matrix iH. The mumber of elements in the matrix that are aon ‘comma (2, those along the main diagonal) sk. The number that are cam rnvinal al those not on the diagonal) is K The faction in ou ast formula thus has s mumeraor bused on &—E values and a denominator based on values. To adjust oar caleultions so thatthe eatio expresses the relative mag. nitude ache than che nurabers of terms hat are summed jn the numerator and ‘denominator, we matiply the enti expression representing the proportion of ‘communal visation by values to counteract the differences i nurbers ofterms sammed. To do this, we multiply by Re ~2), os, equivalently, i(k = 1), This| Timiss the range of possible values for alpha wo between 0.O.nd 1.0. Ine fve- item exemple just diseussed, muiplying £0 by 5/4 yields the sppropriate 10. Readers may went do the mental aitmeto for matrices of other sizes. I should soon become apparent that i(k ~ 1) ie aways the multiplier tht will yield en alpha of 1.0 when the items are all perfertly comelated. Thus, we arrive tthe usual formula for coefficient alpha: Abe) ‘To summarize, « meaure's reliability equals the proportion of total vari+ ance among its items that is due tp the latent variable and thus fs coraramal “The formula for alpha expresses this by peeifying the portion of total variance fe the item set thet is unigue, subtracting this fom 1 to determine the6 SCALE DEVELOPMENT Proportion thet is communal, andl multiplying by « coretion factor vo adjust forthe ruber of elements contributing tothe earlier computations, An Alternative Farmuts for Alpha Another common formula for computing alpha is based on conlations imther than covariances, Aciualy, uses F, the average interitem correlation, This formule is ie iF 1 follows logically from the eoverince-based formula for sipha, Consider the covariance formula ip eanceptial terms 1 ~ (Sum of iter variances Sum of varisnces end covaniances Note thatthe numerstor and denominator inthe tern the eight ae sums of individual vniues, However, the sur of these indivi values is identical to the mean of the values multiplied by the number of values involved. (For example. & numbers that sum to SO and k times the mean of those tubers both equal $0, To illustrate further, subsite 10 for in the preceding sen. tence: the average of 10 values that sum to 50 bas tobe 5, and 1D times $ equals 50, the same value as the orginal sum.) Therefore, the mimerstor of fb term on the right must equal K times the average item variance, ¥, and the denom- "ator must equal tines the average variance lus (@~ &) oy, altertatvely, (A D—stimes the average covariance, (2): ml ‘To remove the)” from the eguation, we can replace it with its equivalent A+ (RR Neli[AS-+ (KYA De], which allows ug to consolidate the shole erm onthe sight into a single ratio. con A (472 Demto To7 ree oe — A> NE I, aBLEABHLITY ” or, equivalently, (te kd Lait} CCross-cancelig & from the numerator ofthe eft tom and denominator of the right tea, white cross-canceling (R~1) from the aumerstor of the right temo and the denorsinalr ofthe left torn, yields the simplified expression ae — Fra ne ‘Recall tha! the formula we are striving for involves conelations rater than, covariances end thus standardized rather than unstandardized terms. ARer sendandizing, an average of covariances is identical to an average of corela- ons, and e variance equals 1.0. Consequently, we ean replace 8 with the aver- sue ibterivem correlation, F, and 9 with 1.0, This yelde the conelation-based formula for coefficient sip aif. Tea DF This formula known as he Speamar-Brown prophecy formula end one of its importa uses wll be ilstated inthe sexton oft chaper dealing with ‘pli elibity computation “The two diferent formals, one based on covariances and the other on correlations, are sometimes refened os the ras score and iadardied score ‘formulas for alpha, respectively. The rave score formula preserves information aout item meens and variances in the computation proces, beemse cova fanees are based on valet tat retain the orginal scaling ofthe ra data. If items ave makedly diferent varianoes, thote with irger variances il be sven greater weight than those with leer vaiances when this formula is ‘sed 0 compute alpha. The standardized score ora based on creations doesnot rei the original salng metric ofthe less. Reval hat» eral tion is sandardizes covtriance. So, all ems are placed on a common ‘etic and thus weighted equally in he computation of alpha by de standardized formula. Which is beter depends on the specifi context sed whether qual weighting is desired, As we shal cei lnfer chapters, ecommendeda SCALE DEVELOPMENT procedures for developing items often ental structuring theis wording 30 08 fo yield comparable variances for each item. When these provedures are followed. there is typically ite difference in the alpha conficients computed by the two altemative methods, Qn the other hand, when procedures timed ai prodvcing equivalent item variances are not followed, observing that te Stardardized and raw alpha values differ appreciably (6, by .05 or more) 4s indicative that at Teas one itera has a variance that differs appreciably from the variances ofthe otter items. Rel bility and Statistical Power ‘An often overlooked benefit of more relile scales is tbat they increase statistical power for a given sample size (or allow a smaller sample size 10 ‘yield equivalent power, relative to less roliahle measures. To have 2 spevified degre of confidence inthe ability to detect a difference ofa given magnitude between two experimental groups, for example, one needs a particular size sample. The probability of detecting such a difference (the power ofthe ‘Stastica txt) can be inoeased by increasing the sample size. In many eppli- ‘ations, mich the same effect caa be obiained by improving the slisbilty oF ‘measurement. A reliable measure, like a larger sample, contributes relavely fess eror to the satistcal analysis. Researchers might do well to weigh the relative advantages of increasing scale reliability versus saruple size in| research situations where both options are available. "The power gains from improving reliability depend on e mamber of factors, ining the init sample sie, the probability Ievei st for deteting ® Type Tero, the effect size (e.g. mean difference) that is considered significant, and the proportion of error variance that i attsbutable to measure unrelibiity rather than sample heterogeneity or other sourees. A precise comparison between reliability enhancement ané sample size increase requires that these ‘actors be specified; however, the following examples lstate the point. In 2 hypothetical research siuation with the probability of a Type I ertor set t.08, 1 HOspoiat difference between to means regarded ss important, and aa error ‘ratiance equal fo 100, the sample size would have to be increased from 128 to 172 (234% increase) to eis the power of an F test fom £80 t0 90. Reducing tke total error variance from 100 to 75 (a 25% decrease) would have essen tially the sare result without increasing the sample size, Substituting highly reliabt sale fora substantially poorer one might wccomplisa this. As another ‘example, for N= $0, evo scales with reliabilities of 38 that havea coreation (r= 24) barely aehieving significance at p<.10 are significant at p <.01 3F their reliabilities are ineteased to 90. Ifthe reliabilities eemained at 38, a ‘sample more than twice as large would be needed for the correlation to reach RELIABILITY » p< DI. Lipsey (4990) provides a more comprehensive discussion of taisieal over, including the effects of measurement reliability. RELIABILITY BASED ON CORRELATIONS BETWEEN SCALE SCORES ‘There are allematives ¢o intemal consistency reliability. These pes of reliability computation involve having the same set of people complete to Separate versions of 2 scale or the same version on moltiple occasions. Alternate Forms of Reliablity two ststly parallel Forms ofa scle exist, then the correlation beteso nem can be computed as long a the same people complete bets paras) forms, For exeupe, assure that a researeber first developed rwo equivalent Sets of items measuring pains’ dese for contol when interacting with Diysleans, sen administered both sets of Hes to a gsoap of peters and, Truly, coveted the scores fom ove se of items with he sores from the Gir sot This coreltion wuld Bethe alterate Forms relly. Recall hat patel forms are made up of items, al of which (ier within ot Between rms) do an equally good fob of measuring the istent variable. Tis eles that beh forms ofthe sale have iden! siphas, means and vane {hess he supe BRSRGMEH Th eseace, pall forms coast of one set ‘Trier that has been divide more or less ariarly into two subscts that Saale op the to parallel altos forms of th sete. Vader these conditions tbe conelatin baton one for. and ie other is equivalent to corelting citer form wit sell as each alten form is equvafento de other. SplitHalf Reliabtity A problem wit altemate form reliability is that we wally do mot have wo versions ofa scale that conform strictly to the assumptions of paralel tests. However, there are other reliably estimates that apply the same sort of loz to a single sot of items, Because altemate forms are essentially made up of & Spe pool of items that has been divided in two, i follows that we can (ay take the set of items that makes up 2 single sele (i. aseae hat does not fave any alternate form), (b) divide that set of ites into two subsets, and (@)eorreate the suboes to assess reliability” SCALE DEVELORMENT A reliahilty measure of this type is called a glt-haf relat. Spitehalt reliability is rally a clase rather than a single type of computational metiod bbecanse there ate a variety of ways in which the seale ean be spit in ball. One method is to carapace the first half of the items tothe second hall. This type of first-half last-kaf spit may be problematic, however, because factors other than the value of the latent variable in other words, sources of errr} eight alfet cach subse differently. For example, ifthe items making up the scale ‘question wore seatered throughout a lengthy questionnaire, the respondents ‘ight be more fatigued when completing the second half of the scale. Fatigue ‘would then differ systemadcaly between the two halves and Would thas make {hem appear less similar. However, the dissimilarity would not be so me sharactritic ofthe items per se as of their position in th item order ofthe Sle, Other factors that might differentia carier oocuaring from later ocetr Ting Stems are a practice effect whereby respondents might get better at answering the items as they go along, failure t9 complete the entire set of ‘tems, and possibly even something 28 mundane us changes inthe print qusl> ‘ty of « questionnaire from front to back. As with fatigue, these factors would lower the comelation between halves because of the order in which the sale items were presented and not because ofthe quality of the scale items, As @ ult officers such as those, mensuring the strength of the relationships among stems may be complicated by circumstances not directly related titer quality, resting in an oroneous reliability assessment To avoid sone of the pitfills essoejated with item rer, one can essess another type of spit-halfrelibility Known 83 odi-even reliabily. In this instance, the subset of ocd-numbered items is compared tothe even-nimbered items. Ths ensures that each ofthe two subsets of items consists of an equal ‘number from exch section (i, the besinning, middle, and end) ofthe original scale, Assuming the item onder is ielevant (2s opposed othe “easy-to-hard” ‘xder common t achievement tests, for example), ths method avoids many of the problems associated with fst-half versus second-half spit balves. Jn theory, there are many other ways to arrive at split-haf reliability. Two slerntives othe methods discussed above for constituting the item subsets are Balanced halver and random halves. In che former case, oxe would identity some potetilly important item characteristics (such a8 Srst-person soning, item length, or whether a certain type of response indicates the presence ot ssence ofthe atibute in question). The two halves ofthe scale would then be constituted <0 26 tb have the characteristics equally presented in cach halt Thus an investigator might divide up the stems 50 that each subset had the same number of items worded in dhe fust-person, the same musnber of short ites, tnd so on, However, when coosideriag muiple item charectristes, it right be impossible to balance the proportion of one without making it ipposibe RELIABILITY a Xie a a yo a Xe Figure3.3 A path diagram showing the relationship of two spit halves ‘of a measure (X, and X,) wo their common latent variable bln anche. Ths would be be ee, or ample fe ware mre long than shor St proon fus, Creating 9 bela fo fhe ao charter would seule a Inalnce of the fame Alsi may be afi © Serene wbich carers ofthe es shuld eter, ‘a invengatr ec andor halves merely by nda Aocaing cach tert soft ta bute wl evel be coed itso {ote o compute the relay cna, How well is work pds On te unber of fens, the uber of barceiicof oncem, and te degree of independence among he cbc Hoping tin a sal smb of ems, varying slong sve iateeltd deniers, wl eld compare ‘Deupings howgh odonaation wees: On te ctor hand eno feigning ae 30 er varying with expect to ree cored Charcertos 0 oBepries Mit il Teeoably compaatiesbses. ‘hich eto of achieving ait halves i bow doped he prclr sition. Wht i sat import sta the aver thnk sat how hing ies igh ret none! subs od ha spe canbe tskon 9 rd Te reteoing behind bah splines an lo oes relibly ra ratrl extent te peal es ode [Ahough vhost model we regarded cash tm as seston cana ga sale or eo bls fsa) at confers fothe mode! went” Terfry, we sen ply te ope weed a he se of several ite othe cote of wo seme firme oro aves of sale Cider "es" (sae aes reat fons) ude he peel es ssspns "The ely rot inking te to cont of thecal ph om he ae varies cae Thus pod ofthese pt acs sual fh coaton Between the test te path auc ae ob ga en they de andree SCALE DEVELOPMENT assumptions of tis model), then the corelaion between the tests equals the square ofthe path value fons latent varisbe to either test. The squats of that path (assuning that tiga standardized path coefficient) is also the proportion fof variance in citer tet thats influenced by the latent variable. This, in turn, is the definition of reliability, Thus the correlation between the swo tests ‘equals the reliability of each, ‘Whereas te “test” refered to inthe preceding pagraph are two complete versions ef @ scale inthe alternate forms ease, they are two halfscaies inthe splithalf instance. Thus the correlation between two spli-halves yields areli- ability estimate foreach haf of the whole set of iter, which is an under- ‘estimate ofthe relisbilty for the entire set, An estimate ofthe reliability of the ttre scale, based on the zelablity ofa portion of the scale, ean be computed Dy asing the Speerman-Brown formula, discussed earlier in this chopter, ‘Recall that, aocording to this formule, fa Teenie wheres the number of tems in question and Fis the average correlation of ‘ay one tem with any ather (.e, the average interitem comelatioa), Ifyou had ‘determined the reliability ofa subset of tems (eg, by means of she spit-haif method) and keew how many ites that reliability was based on (eg, haf the ‘number in the whole Seale), you could use the formals to compute 7. Thea, ‘you could plag that value of F sod the number of items ia te whole saie beck Into the formule, The result would be an estimate of the reliability ofthe whole scale, based on # reliability value computed on split halves of the sale, It simplifies matters if you perform a Jitle algebra on the Speesinan-Brown equation to putt into the following form: ioe) store 1, is the reliability of the item set in question, For example, if you new thatthe spli-haf reliability for eo Sitem halves was equel to 9, You ‘ould compare F as follows i b= ex] RELIABILITY ou ‘eu cou then rsompute the rlshilty forthe whole 18-item ele by wing rod and ke 18 in the Spearman-Brown formcls. Thus, the seiabilty et rate forthe full eae is 13x T+ (75) which equals 99.5 or 947, (Note that increasing the number of items has increase the reliability, A quick look et the Spearman-Browm formula should troke it apparent that al cle being eqs @ longer scale wit always be mor reliable han a shorter ane.) ‘Temporal Stability “Another two-score method of computing reliably involves the temporal avabity of» meastire, ot HOW constant scores remain fom One oveasion 10 fnother Testretese relkabizy 8 the method typically used to assess tis Suppose tha, instead of developing two sets of items to measure patients ‘desire for control when inéracting with physicians, our hypothetical jvest- iator developed only a single st. Those tes could he given to one prop Of ptlents on Iwo separate oceasions, and dhe score from the frst oscasion Peald be correlated with those from the (ater administration, The ratinale Crderiying seliability determinations of tis type is that sf» measure tely feflecs some mensingfleonstat, it should asses that construst comparably ‘separate occasions. In other words, the true svore ofthe Intent variable should exer comparsbe influence on abserved scores om to (or mere) ee ions while the eor component should not remain constant across wc Traioos ofthe scale. Consequently, the correlation of scores cbtained across tivo adronistrations of a seale to the same individuals should represent the fxtent to which the Ifent variable determines observed scores, This is the (valent tothe definition of reliability as th propesion of variance atibst able fo the tre score of the latent warable "The problem with his reasoning is that what happes to the sores ver fre nay oF mity not have to do with the eroreproneness ofthe messurestent ps0 {odure, Nuaasly (1978) points out that characteristics of the tems eight use them to yield temporally stable responses even when the eansruct of fayerest as changed, Eor example, if purporsed anxiety measure was iatle= fncod by socal desirability ax sell as anxiety, scores right remsin const “eapite variations in anxiety. The stability in scores, eect ina igh come~ lation actose occasions of administration, seuld not be the result of nvasiance“ SCALE DEVELOPMENT inthe phenomenon of interest. Alternatively, the phenomenon may sot change ‘while scores on the measure do; that is, the scale cowl be unveliable. OF, ‘changes in seores may be attributed t unrelibiiey when, in fact, the ple ‘nomenon itself hs changed and the measure has accurately tacked that change. The probiem is thet eiduer a change or the absenco of e change ca be {uc to a variety of things besides the (ua)elablity ofthe measurement pro= ctu, Kelly and MeGrath (1988) identified four factors that are confounded \when one examines two gels of scores on the same measure, separated in te. ‘Those are (3) real change in ibe construct of interes (ey net increase in average level of naxiciy smong a sample of individuals), (} eystemati oxcl- Intons in the phenomenon (e.g, variations i anxiety, around some constat ‘mean, 08a foneton of time of day), (c) changes atcibutable to differences in Subjects or measucenient methods rather than the phezomenon of interest e- ‘fatigue eters that cause items fo be msseesd), nd (8) ermporal instbilty de 'o the inherent unrlibilty ofthe measurement peovedure. OF these factors, nly the founisis unreliability. These authors oso not thet, shthough methods such as the multitmitwmuldimetbod matix apprasch (disewssed in the next chapter) can help, i is aver possible to ufcoaound these factors filly, This isnot osty that demonstrating temporal sablty i wxaportant. In any ‘nomber of research contents, it my be eitical fo ansume (or detponseae) that meesurements seperated in time ar highly carelsted. However, the stability \We sock in these situations is stability of both the messure and the phenome: non, Testetest corelatons oni fll us about the measure when we ae highly ‘confidest that the phenorenon hus emeined stable. Such confidence is ot often waranted. Thas tesiretest reliably, although importa, may best be ‘houghe of as revealing something about the nature of 9 phenomenon aud is measurement, not th Iter slone. Refeming to invariance in scores overtime ' iemporal sabiityis pe erable because itdoes net sugges, as Goes test retest reliability, that measurement errr is the snure of aty instability we ebserve, GENERALIZABILITY THEORY ‘Thus far, our discussion of relinbilty bas Focused on partitioning observed ‘aration into the portion thet is atibutable tothe wue score ofthe latent var able andthe remaining portion, which is errr. This section briefiyinteoduces 4 more general ffamework for partioning variance among eror and nonerror Before we anply the idea of finer partitioning of error variance to measurement, let us considera more general research example io which mate weviaRuiry as souroes of varittion are examined. Suppose that a researcher wanted to determiie the effectiveness ofa training program intended to inereare professional productivity. Assume, furthermore, thatthe researcher administered the training ‘rogram to a large sample of college professors and to a comparable sample ‘of units, The researcher also identified comparable groups of professors and satists who vould not participate inthe taining program but would tke part inthe same productivity agcessment asthe traning program paticipants, Upos siving the study some though, this rsenrchee might ave concluded thatthe observations of productivity would reflect the operation of three identifiable sources of systematic variston: (a) participent verses nonparticipant,(b)pro- fessor versus etist, and (c) the interaction of these effecs. A reasonable ena- Iytie sttegy in this situation would be to perform an analysis of variance (ANOVA) on the productivity cores, treating each of these sources of varie tico as 8 dimention in the analysis, The inveetigstor could then determine to ‘what extent each source of variation contributed f the total vasition in professional proauctivty. In essence, this analytic seategy would partition the fotal variance emong observed productivity scores into Severs source: trin= ing participation, profession, the intesction ofthese end error. Btror would ‘represent all sources of variation other than those spéviied by the preceding factors Now, consider a hypothetical situation in which a retentcher is developing ‘scale of desire for sutonomy. The measure will be used ina study of elderly people, some of whom may hove visual problems. Consemuenty, the investi- ‘lor plens to administer the desire-forantonomy measure orally to those peo ple who would have difficulty reading, and in writen form to the remeining. study participants, - the researcher fgnored mode of administration (written versus orl) as a source of vatition in test scores, he or she would he regaring each score ‘baited as duc to the te level ofthe respondents desize for autonomy plas sore degree of enor. The researcher could proceed to calculate reliability as siscussed earlier. Note, however, that merely computing alpha on the seale scores without regard for the mode of administration weald not differentiate the potential systematic emer due to admisistation method fom any other sourve of exer. - Alteratvely, it is possible for the researcher to acknowledge administra sion mode as source of variation among scores, using an analysis of variance fppcoach, Ifthe resulting analysis demonstrated thatthe difference between ‘dministation methods accounted for an inconsequential proportion of the ‘otal variation in stoves, them the researcher could have gremer confidence in ‘he compurabiiy of scores for individusls competing either tye oral or wit- ten version. I, on the other hand, a significant amount of the total observedsowbich “ SCALE DEVELOPMENT variation in scores vere aliituable t0 adtinistation mode, then the Tesrarher would kaow tbat any inerretation of Seores should tke this {ference between odes int consideration. Generalcbiity theory (eg, Cronbach, Ges, Nanda, & Rajat, 1972) provides framework for examining the extent to which on can assume equ ence of measurerent proces scross one ox more dimensions. Inthe pe- cating example, de dinenson in qusion vas mode of sdinsraton, Back [Simeon of ini ets ‘oration nd ie rferred fo as facet: Ts “ade of administration asthe only potential ource of variton (otter than individuals) across whic the investigator ‘ished to geetlize, Therefore, hs example involves singe facet inthe parlance of generalizability sory, observations obiainable aot al levee ofa facet (eg, with both oral and writen administration of the sele) constitte a universe of ndmieible erations, The mean ofthese cbserv- tons is efeed to as te universe scare ad i analogous fo te tae seore of clasical test theory (Allen & Yen, 1979). A study simed a determining to tbat exon scores ar comparable aro fee eels of feat icalled « eneralzailiy sty, or Gstuy, The bypoetcal sty of desire for aon fay is an example of G-sudy by virtue ofits adresing the effets of if ferent "eves" ofthe mode-fadniaisraton fal “The purpose ofthe G-stdy isto help the investigatr detennine the extent ace does or doesnot imi generalizability Ife ce (eg, mode ‘of admmistration) explains a significant amount of he variance in observed Seores, findings do not generalize acros levels (eg, orl versus wien “eiministation) ofthat facet, Te extent to which one can generalize arose Teves a he facet witout misrepresenting the data i exprssed as genera: ability coeicien. Tis is typically computed by forming 8 ratio fom the appropiate ean squares resulting fom the ANOVA performed as part of the Geatay. Conceptually, the generizblity eoefTicint she rato of wnivene score virince to observed score variance snd is analogous to the reibity owtiiont llon & Yen, 1979). Nee, howover, hata G-study yields a poor generalizability coefficient, the tuys design pints ta source of the prob- Teeth is, the feat examined, A reliability eneicient rely ientiis the sun of er shout rbuting to a specie sure im some invances, choosing the appropiate ANOVA design, desing -which effets correspond tothe facets of sees, andl constructing the correct enerlizsbinycoeficient can be demanding. fut 35 with amalyss of vari- roe im general, mukiple dimensions, nested, crossed, and mined effects can complicate a G-tudy, See Myer, 1978 o Kirk, 1995 for general discus Slows of ANOVA designs) Keoping the design of « G-tudy simple is edvs- fhe Ilsa prudent to consul «source tht explains in detail how o build RELIABILITY a the appropriate ANOVA model for a given type of G-study, Croskes and ‘Algina (1986) deseribe the apprevsinte designs for sever] different one: ad ‘prevtacet geveraliability stidies. This source also provides a good general iniroduetion to generatizabiltytheors. SUMMARY ‘Scales ace relighle tothe extent tha they consist of reliable Stems thet share & ‘ornanon latest variable. Coefficient alpha coresponds closely to the classical Gefiiion of rabilty as the proportion of variance ina scale that is atibut~ le tothe true score ofthe latent variable. Various methods for computing feliabilsy have diferent uslty in particular situations. For example, if ove floes uot have access te paralle} versions of a sesle, computing altemate forms reliability is impossible. A resestcher who understands the advantages an s+ fdvaniages of altemative methods for computing reliability isin « good! posi- thon to make informed judgmeats when designing # measurement seedy oF evaluating a published repor. EXERCISES! J, tf set of lees tus 4908 inten! consistency, wat does that ply about the reltonship ofthese othe tet vail Te this exesne assure ta the follwing ism covariance mat for ssc, ¥, rade up tine ams, Xy Xy and X, 12s 4 5 10 6 a6 18 (a) What are te varices of, X, a 262 6) Wha she variance of YP () Whatiecoefcon alpha fr sie Y? isc the ways in which steer ebiity confounds ater fats with he cual sale properties 4, Flow doos hs logic of tena forms elaiity follow for the asumpsans of paral st?“ SCALE onyELOPMEST Notes 1. For wighed items, covainaces are mip! by products and variances by sqpaes of ee comespondng item Weis. See Nunnally (1978, pp, 158186) for a more complete description oft 2. Throughout the book, te salon for any exes that requis a near ancy willbe found inthe Noms sition ofthe chapter which te exercise ngpems. 5. The answers ae (a) 12, 1.0, an 1.8 (which sum to.) (7.0 (est ofl erent inthe main) (€ (92) f1~ (407709) = 04, Validity ‘Whereas reliability concems bow snuch a variable influences a set of items, validity concems whether te variable isthe underiying cause of item cov: _aton. To the extent Uist a stale is reliable, variation in scale scores can be stibuted tothe true score of Some phenomenon that exerts a causal influence overall he items, However, determining that a scale is relible does not guar aniee tht te Lotent variable shared by the items is, in fat, the variable of Iterest t the scale developer, The adequacy of a scale as a measure of & specific variable (eg, perceived psychological stress) i an issue of validity Some authors have assigned a broader meaning to validity. For example, Messick (1995) described six typos of validity, one of which {consequential validiy) concerns the ienpact on respondents of how their scores aze used, Althongh Messick’s (1995) views on validity have raised some thought provoking issues, his classification system has aot been widely adopted. According to the more conventions! interpretation, validity is inferred fiom the manner in which a scale was constructed, its ably to predict specific ‘evens, o is relationship to measures of other constructs, There are essentially three types of validity that comrespond to these operations 2. content wii 2, ftom aeated vig 3. construct vaidty Bash type will be reviewed briefly Fora more extensive treatment of valié- ity, Including e discussion of methodological and statistical issues in criterion related validity and allerative validity indices, see Ghiseli, Campbell, and Zeseck (1981, Chapter 10), Readers might also want to consider Messick's (1995) more all-encompassing view of validity. CONTENT VALIDITY Content validity concerns item sampling adequncy-—thet is, the extent (0 hich a.specific set of items reflests a content domain, Content validity is a / 4* SCALE DEVELOPMENT ‘casiest to evaluate when the domain (e.g. ll the vocabulary words tanght to sixth gridess) is well defined. The ieive is more subtle when measuring attributes, such as belief, attitdes, or dispositions, because its dificult 19 determine exactly what the range of potential tems is nd when a sample of + items is represeatative, In theory, 2 scale has content validity wHen its fame SE TAHT Clnven suet ofthe universe of appropri ts. In the ‘vocabulary test example used above, ths is easily kecomplished. Ab the ‘worts taught during the school year would be defined as the universe of items, Some subset could then be sampled. However, inthe case of measur ing beliefs, for example, we da not have a convenient listing oF the relevant lusiverse of items, Stil, one’s methods in developing a stale (2... having items reviewed by experts for relevance tothe domain of interest, as Sug+ gested in Chapter $) can help to mavimize item appropreteness, For extn ple, if a researcher needed tm develop a measure contrasting expected ‘outcomes and desired outcomes (eg, expecting versus wanting a physician to involve the patient in decision making), itmight be desirable for hes or him to estabish that all relevant outcomes were represented in the items, To do fis, the researcher might ask colleagues familiar with tha context of the research fo review sn initial list of items and suggest cortent areas that have >~ been omitted but should be included, Items reflecting this content could then be added, CRITERION-RELATED VALIDITY Inorderto have erterineltad validly a th em impli ta tern or scale ‘isrequed to bave only en empire! assocntion with sme exten of “pl andar Whster or not the theoreti bas Torta aston i ane Stoo is ieievanto criterion valid fone cold show, or example, {ist dowsing is mpc asocaad wat octing underground weer sive, Shen dowsing woul hav vig wth reaper tthe eron of sucess sell ging. Ths nero elt vay pa es moe of retical sie than seinticone, cnt is coneemed not wth anerstanng a proces bat erty with profiting Ta feet. etenonrelted validity soar refers y loaspreditive valid Se — Cerion relied validity by say name does rt necessary imply cna «elatinship among varies, even when thee ordering ofthe prdictor nd the eniteon are unambiguoon. OF eure, prediction inte conn oF thory (©, precio as a hypothesis) maybe relevant to the saa eaticnshe turonyvarables and on ere a very use teens propose vanipiry . Ste Another point worth noting about criterionelate validity Is that, logical, ‘one is dealing with the same type of validity ismue whether the criterion follows, precedes, of coincides With the measurement in question. Ths, in sion to" predictive validity.” concurrent yall e.g. “predicting” driving sil from answers to oral questions asked during the driving tes) oreven post dictve vay (eg, “predicisg” bik weight for an infancy developmental ‘anus see) may be used more or less synonyrmously with criterion-relared validity, Te most important agpect of citeriontelate validity is pot the me relationship between the measure is question and the criterion whose value one is atermpting fo infer but, rater, the stent ofthe empirical relationship between the two events, The term criteion-eloted validity bas the advantage ver the other terms of being lemmoraly neutral and shus i preferable. CCriterion-Retated Validity Versus Accuracy Before leaving stiterion-related validity, few words are in order concer ing its relationship to accuracy. As Ghiselli and colleagues (1981) poitt out fe corelation coefficient, which has been the traditional inci of criterion- related validity, may not be very wrofol wea predictive accuruey isthe issue ‘A comelation coefficient, for example, does not reveal how many eases are ‘correctly clestied by a predictor (although tables that provide sa estimate of| the proportion of eases falling into varieus percentile categories, based on the size of the cotrelstion between predictor snd criterion, ae described by ‘Ghiseli eta, 1983, p91), It may be mare propriate in some stztions to Give both a predictor nod its ertevion into discrete categories and to assess the “hit rae” fr placing cases ino the comect category ofthe criterion based cntheir predvwor category, For exemple, ane could classify each variable ino "Sow" versus “high” categories, end conceprustize accuracy as the proportion of comet classifications (ie, Histances wien the vale ofthe predictor come: "sponds tae value ofthe criterion). Where one divides categories isan impor- lant conserstion. Consider a criterion that has two nonarbitary site, such ss "elk and “well, "and an assesamest tool that has e range of seres that sn investigator wants to dichotomize, The purpose ofthe assessment tool is to ptediot whether people will tet a8 postive or negative forthe sickness in (question. Because the outcome is dichotomous, it makes sense to make the predictor dichotomous, There are two possible erors ia elasification: the esse can mistakenly classify truly sick pereon as well (alse negative) or ‘truly well personas sick (false positive). Where along the range of scores on the assessment tool the dividing line is placed wen dichotomizing ean effect the rates ofthese two types of errors. At the extremes, classifying everyone 5 well will avoid any Jae negatives (but increase false positives). whereas2 SCALE DEV! clasifyug everyone as sik will avoid any fale positives (but inerease false segaives). Obviously, in bath af these extreme case, the assessment tho! vould have nv predictive value at all, The goal, f course, ix to chnose 8 ‘toll that produces the fewest errors of either type, and thus the highest sovuraey. Often, there is 30 ideal ett point, that i, one resubing in perfec ‘lassfication. n such a case, the investigator may make a conscious effort to minimize one rype of err rather than the ost. For example, if te sickness is devastating ard the treatment is fective, inexpensive, and benign, the cost ‘of. false nogaive (resulting is underseating) is far greater than the cost of a {alse positive (resuling in ovesteating). Thus, choosing. a cuoff so as to reduce false negatives while accepting false positives would seem appropriate (a the over hand, if the remedy is both expensive and unpleasant and the sickness mill, the opposite rade-off might make mre sense ‘Als, itis important to remember that, even if the earrelation between & predictor measure and a erteron is perfec, the score obtainod on the predic {or is not an estimate of the criterion. Comelation coeficients ate insensitive to fineartmnsformetions of ene or both variables. A high correlation between ‘v0 vatables implies that Scores on those variables obtained from the same $ndividual will occupy similar looations on their respective distributions, For example, someone scoring very high onthe first variable is likely also to seore ‘very high on the second, ithe two are strongly correlated, Yer high, however, is a relative rater han an absolus tem and dees not consider the to variables’ units of measurement, for example. Transforming the predictors units ‘ofmeasurement to that ofthe eiterion may be necessary o obtain an accurate ‘numeral prediction, This adjustment is equivalent to determining the appropriate intercept in addition to the slope of regression live. A faire t ecngnize the need to transform a scare could lead to erroneous conclusions, ‘An error ofthis sor is peztaps most kely to oceur ifthe predictor happens 10 be calibrated in units tha all ato the srae range asthe eiterion, Assume, for example, tht someone devised she flowing “speeding ticket scale” 0 predict how many tickers drivers would receive over $ years, 1, Fexceed she speed limit when eve. Frequently Cccasionally Rerely Never 2. On mltilane roads, 1 drive i the passing lane. Frequent Cocasionaiy Revely Never 3. Luge for myself what driving speed is eppropriat, Frequently Cecasionay Rarely Never ‘VALIDITY, s Let us also mele the implausible assormption that the scale correlates perfectly with the numer of tickers received n a 5-year perio. The scale” scored by giving each item @ value of 3 when a respondent ctcles “ie ‘quently,” 2 for “occasionally,” | for “rarely.” and for “never” The item soores then ate simsmed to get a scale score, The score's perfect criterion validity does not mean that # score of 9 wanslates into nine dekets over S years, Rath, it mexas tha the people who score highest on de instrument ‘xe also the people who have the Righest observed number of tickets in a year. Some empirically determined transformation (¢..33 x SCORE) would yield ‘the actual estimate. This particular transformation would predict three ckets Iva driver scoring 9. If evterin-rlated validity were high, then a more aceu- tate estimate could be computed. However, the similarity bebween the numer ical values of the criterion and the predictor measure prior io an appropriate transformation would have nothing to do with the degree of valiciy CONSTRUCT VALIDITY ‘Construct validity (Cronbach & Meh, 1955) is directly concerned with the theoretical relationship ofa variable (eg. 8 score on some sale) to oer variables. its the extent fo which a measure "behaves" the way that the construct, 5t purports to measure should behave with regard to established measures of ‘other constructs. So, for example, if we view some variable, based on theory, 8 positively related to constructs A and B, negatively related to C and D, and unyelated to X and Y, then a scale that purports to measure thet construct should bear similar elationship to measures of those constrcts. In other ‘words, our measure should be positively correlated with measures of con structs A and B, negatively conelated with measures of C and D, and uncor~ related with measures of X and Y. A depiction of these hypothesized relationships might fook like Figure 4.1 ‘The extent to which empirical correlations match the predicted pattem provides some evidence of how well the measure “behaves” lke the variable 113s supposed to messure, Differentiating Construct From Criterion-Related Validity People often confuse construct and eriterion-tolated validity because the same exact contelation can serve either purpose. The difference resides mare jn the investigator's intent than inthe value oblained. Por example, an epi- detiologist might attempt to determine which of s variety of measures« SCALE DEVELOPMENT igure 41 A bypothesized rolationship among variables obtained in ¢ survey study correlate with heath stata. The intent might be rnerely to identify risk factors without concer (at leat nitilly) forthe under~ lying eaasel mechani linking sooes on messures to health stats. Vali, in this case, i the degree to which the scales can predict health stats. ‘Altertively, the concern could be more theoretical and explanatory. The investigator, like the epidemiologist described inthis book’s opening chapiex, might endorse a theoretical model that views stress as a cause of healt status snd te isste might be how wll anewly developed seale meeaures stress. This might be assessed by evaluating the “behavior” ofthe sce relative to hoor ‘theory suggests stets should operate. Ifthe theory suggested tha stress and healt status should be correlated, then the seme empirical elationship ured as evidence of pregitive validity in the preceding example might be used as evidence of construct validity. ‘Sowalled known-groups validation is another example ofa procedure that cean be classified as either constructor citesion-releted vaiity, depending on the investigmor’s intent. Known-groups validation typically involves demon strating that some scale can differentiate members of one group from another, based on their seale seores, The purpase may becither theory related (such as ‘when a measure of atsitudes toward 2 certain group is valiated by corretly| Giferentiatng those who do of do not affiliate with members of that group) for prey predictive (such as when one uses a series of semingly unrelated Items to predict job surnaver, Inthe former case, the procedure should be somieed typeof contutvaldy ad ike Iter eon eed validity How Steong Should Correlations Be in Order to Demonstrate Construct Validity? “There is ap outof that defines construct validity, It is importat to recog nize thet to measures may sbare more thas consist similarity. Specifically. smiaritics in the way that constructs are measured may account for some covariation in sores, independent of construct simiaity. For example, two ‘asiables scored on @ mulSpoit scoring system (with scores from I to 100) vaLiDiTy “will ave aigher correlation wih each other than with s inaey variable alt Tea equal fine fav antact caused by he stu o he measurement trethods Likewise, because of procedure similarities, data of one ype aa ‘red By interviews may correlate to a degree with other data gathered inthe eee way; tha is, some ofthe covariation between two variables may be due to measarement similarity rather than construct simiacity. This fet provides i for answering the question concerning the magnitode of corel tions necessary to conchade construct validity. The variables, at a minim, Should demoustate covariation above and beyond what ean be mtribated to shared method variance. ‘Mutiteoit-Multimethod Matrix Campbell and Fiske (1989) dovised a procedure called the mvt sulinedhod mati tt is exremely useful for examining construct validity. ‘The procedure involves measuring more than one construct by meus of Bore than oce method s0 thet one obtains @ “fully crossed" method-by-measure suatrix, For example, suppose that a study is designed in which exxity, ‘epwesson, and shoe size are eaoh messured at two times using two diferent “Tewsiresent procedures each time, (Note that rvo diferent semples of indi ‘iduals gould be measured st the same time. What effect would this have ob the logic of he epproecl?) Each construct could be assessed by two methods, tr eisual analog seale (a Tine upon which respondents make a maricto indicate the amount of the atribute they possess, be it anxiety, depression, or bigness ff foot and a rating assigned by an interviewer following a 15-minute inss- fetion with each subjeet. One eoulé then construct a matrix of correlations ‘bleined between measurements asin Table 41 "Another possible distinction, not ia the table, is between related wersus unseated trait, Because the entries that reflect the same trait (construct) ad the same method should share both method and construct variance, one would tenpect these coweiations to be highest. It is baped that crelations come Spending to he same tat but diferent methods would be tbe neat highest ecthis Would suggest dot construct covesaton is higher than method cova ation: in oiher ores, our mesures were more itiuenced by what was re ‘ured than by how it was measured In contrast, there is no reasot Why any Covarnton should exist between shoe size and eer of the oer Wo co. fatucts wen they ave measured by diferent procedures, Thus these corel: tions should wot be significantly ditferest from zero, Far noniéenical bat theoretically telated constructs such as depression and mney, one would expect some constrict covariation. This is potentially @ highly informative

Devellis 2003

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Devellis 2003

Uploaded by

Copyright:

Available Formats

You might also like