(Jum Nunnally, Ira Bernstein) Psychometric Theory

| PSYCHOMETRIC THEORY THIRD EDITIONThink wa sat hae Raman by The side Campy Their ver lane cust Jane Boer the presen spervion was Annee Maye New dings were don by ECL A PSYCHOMETRIC THEORY prght 9 1984, 1978, 1967 by NeGraw-Hl ne. A ints eserves rnd nthe United Sie Avrn, Expt pic un an Sits Cx Ato 976 Dur tispunteaion may ne epraduca o area nary frm Sy any means, or ‘Sorad ina daa baa or taval system, wihaut he pr wien poration otha publner, This beok is printed on acidree paper 34587890 nocmoc gesrss SBN 0-07-D47a49-x Library of Congress Catalogingn-Publeaton Data Nunealy Jn . [Potovoretic ea! C Nua aH Be treudes bistagrantioal reterences ane index ‘SEN S07-047B40°x al paper) -Peyenmettes. | Barely HT Besang 1898 1s07.207—ac00 sozarse ABOUT THE AUTHORS TUM C. NUNNALLY tecsived a doctorate from the Univesity of Chicago in 1982 working with William Stephenson, Leon Thurstone, and Car Rogers. Early in hs ea reer be was associated with Samuel Beck and Roy Grinker, From [954 1960, he twas a faculty member atthe University of Utinois. He joined Vanderbilt University im 1960 as professor and became chair in 1951. He published four books besides the previous editions of Peychometric Theory. Popular Conceptions of Mental Fealth (Hal) became abasic resoure in community mens) health. n adition, he published 110 ar ticles, Many involved vasious issues in messurement, uch as Q methodology and con- Fnmatory factor alysis In fact, he pioneered modern multivariate hypothesis testing [a addition, he also published signifcenc work on language development in deaf cil- dren, the acquisition of reward valve in previously neural objets, experimental aes- thedes, and the scaling ofthe evaluative and affective components of language. AS an tdurafatatr he was responsible for making the prychology department at Vanderbilt the major force itis today, Fle served oa the editorial board of Agplied Psychological Measurement and tree otter journals and was etive in many scholarly orgeniaations He died on Angust 22, 1982. IRA H, BERNSTEIN received a doctorate at Vanderbitt Univesity in 1963 under the direedon of Prot. Richard L, Blanton, & close ftiend of end collaborator wits the Inte Prof, Nunnally. He then spent a postdoctoral year at Prof. Nunnally's institution, the University of Minois, studying perception with Prof. Charles Eriksen. He joined the faculty of the University of Texas ax ington (en Arlington State College) in 1964 He has eemained there ever since, progresing the rank of professor in 1969. He vides his work equally among besic research (perception!ognition), methodological sues in applied measurement (iteievel factoring, categorical modelng), and appli cation (poli selection, pain, and other medical topics). He has published approx imately 70 aricles in such journals at Educasional and Psychological Measurement, Health Education Quarter, Muttvariate Behavioral Research, Perception and Psy- chophysies, Psychological Bulletin, end The Journal of Experimental Prychology (General and has served on twa editorial boards. He isa Fellow ofthe American P5)- ‘chologisal Associaton and the Society for Personality Assessment and is also te akc thor of Applied Multvariae Anutysis (Speinger-Verla). ——————————_K&=——eeeeaeoeoaeou_wowwThis book is dadicated to the memory of Jum ©. NunnallyPART PART 2 PRERACE INTRODUCTION Irtreduction CHAPTER OVERVIEW MEASUREMENT IN SCIENCE What le "Meaningful and "Useful"? ADVANTAGES OF STANDAROIZED MEASURES Cbiecvty / Quantioaion / Communication / Economy / Scientific ‘Generalization MEASUREMENT AND MATHEMATICS Measurement and Statistics MEASUREMENT SCALES Nominal scales / Ordinal scales / Interval scales / Ratio Scales ‘Otrer Seales / variance [DECISIONS ABOUT MEASUREMENT SCALES COstensive Characteristes / Consequences of Assumptions / Convention 1 Classileaton as Measuramont RECENT TRENOS IN NEASURENENT “The Impact of Comaulere / Closed varsus OpenForm Solutions / ‘Comauter Simuston suwmaRy SUGGESTED ADDITIONAL READINGS STATISTICAL FOUNDATIONS. ‘Traditional Approaches To Sealing CHAPTER OVER: " 2 2 sy a acones More Complex Organizations J “Holes inthe Maire (Vissing Oats) ‘Sealing Stimuli versus Sealing People A.QRIEF INTRODUCTION TO PSYCHOPHYSICS Psychophysical Metneds / Absolute Thresholds / Simulating a "Threshold | Diterance Threshoes | The Weber Fraction, Fachner's Law, and Psychophysical Sealing / Direct Psychophys and the Plateau-Stavane Tradtion / The Fularon-Cattell Law / Signal Detection Theory and Modem Psychophysics {TYPES GF STIMULI AND RESPONSES _Juogmants versus Sentiments | Absolute versus Comparative Responses / Preferences versus Simiarty Responses / Spectied versus Unspecies Atroutes METHODS FOR CONVERTING RESPONSES TO STALLUS SCALES Ordinal Methods / Interval Methods / Ratio Methods MODELS FOR SCALNG STIMULI Dirac [Subjective Estimate) Models / Inve (Discriminant) Models / ‘Simulating Thurstone Scaling / A Comparison ofthe Two ‘Simulations / The Logistic Distribution and Luce's Cnoca Theory / ‘Averages as Scale Values / Checks and Balances / Multtem Measures / item Teaco Lines (Rom Characteristics Curves) / Difcuty and Diseriination DETERMINISTIC MODELS FOR SCALING PEOPLE ‘The Guttman Scale / Evaluation of the Gutman Scale PROBABILISTIC MODELS FOR SCALING PEOPLE Nonmanotona Medes / Monotone Models with Spectied Distribution Forms / Monotone Medels wih Unspectied Distributor Farms UNMARY SUGGESTED ADOIONAL READINGS Valcty CcHaPTER OVERVIEW AL CONSIDERATIONS ‘CONSTRUCT VALIDITY ‘Domain of Observabies / Relations among Observables / Relations ‘among Constructs / Campbell and Fiske's Conributon to Construct Validation PREDICTIVE VALIOITY ‘The Temporal Relation between Predictor and Criterion / The Citron Problam / ihr Problems In Pradcton / The “Composite Crtorion”/ Valisty Coeficients/ Vaicty Ganeraliztion /Mota-analysis (CONTENT VALIOITY EXPLICATION OF CONSTRUCTS oo 36 38 50 n 75 et 22g 4 changing Substantive Theaties versus Changing Measurement Theories / & Cammonsensa Pobn of View (OTHER ISSUES CONCERNING VALIDITY Fielators among he Three Types of Vaeity ¢ Other Names / The Place of Factor Analysis sunwaay SUGGESTED ADDITONAL READINGS Elements of Statistical Description and HAPTER OVERVIEW CONTINUOUS VERSUS DISCRETE (CATEGORICAL) VARIANCE “Transformations of Distributions CORRELATION AND COVARIANCE AS CONCEPTS THE PEARSON PROOUCT:MOMENT CORRELATION ‘The Maaning of Pearson Procuc- Moment Coreiaton / Computer “Applications / Covariance / Other Measures of Linear Relation | ‘Tnrao Spacial Cases ESTMATES OF serial a) J Terachoric Coralation (ia) and alate Estates PEARSON r VERSUS ESTIMATES OF PEARSON + ‘Some Felted isues in Categorization ASSUMPTIONS UNDERLYING r FACTORS INFLUENOING ¢ Restriction ot Range / Distebution Frm [A UNIVERSAL MEASURE OF FELATIONSHIP PREDICTION, REGRESSION, AND STRUCTURAL EQUATIONS Regression / Regression Gased unca Raw Scores / The Standart ror of Estimate Pertioning ot Variance / Structure! Equations STATISTICAL ESTIMATION AND STATISTICAL DECISION THEORY ‘Geceralizod Leest-Squares Estimation | Maximum Ltenoad atimation / Maximum Lxelneed and the Testing of Herarchical Models Bayesian Estimation / The Matnod af Moments / Equal ‘Weighting (The "t Don't Make No Nevermind Principe’) Prapories of Estimators suunaay ‘SUGGESTED ADDITIONAL READINGS Linear Combinations, Partial Correlation, Multiple Goreaton, and Multiple Regression CHAPTER OVERVIEW. VARIANCES OF LINEAR COMBINATIONS VVarance ota Weighted Sum / Variance of 4 Sum of Standard Soores 108 4 14 ne 120 19 1 war 129 120 108 109 159 160 eee eeePART 3 Variance of Sums of Dichotemous Distributions / Numeral Examples CHARACTERISTICS OF SCORE DISTRIBUTIONS Varances / Dietbution Shape COVARIANCE OF LINEAR COMBINATIONS CCorrlation of Linear Combinations / Numerical Exarpla PARTIAL CORRELATION ‘an Example of Paraling / Higher-Order Partaling / Another Form of Partaling E CORAELATION ANO MULTIPLE REGRESSION ‘The Two-Pradicioe Caso / Numerical Example / The General Case / Testing the Signifeance of Aang Inexemenis nF Dstarminants of 1 Categorical Predctors/ Mulioolinaanty / Presictor Importance STON AND ALTERNATIVE WEIGHTINGS OF PREDICTORS Stepwise Incision of Predictors J All Rossible Subsats Approaches / Hlararcica Inclusion of Varlables (‘Combining Strategies / Maderatad Mutple Regression / Variable Weighting “The Analysis of Covarianes/ Nonlingar Relations / Residual, Analysis! Canonical Anaysis suMMaRy SUGGESTED ADDTIONAL READINGS CONSTRUCTION OF MULTLITEM MEASURES ‘The Theory of Measurement Error CHAPTER OVERVIEW CONCEPT OF MEASUREMENT ERROR ONE FORM OF CLASSICAL TEST THEORY THE DOMAIN-SAMPLING MODE Multa Measures / Estimates of Reliablty / The Importance ofthe Relailty Costiciont ‘THE MODEL OF PARALLEL TESTS PERSPECTIVES ON THE THO MODELS Factoral Composition PRECISION OF RELIABILITY ESTIMATES Varances of tems -URTHER DEDUCTIONS FROM THE DOMAIN-SANBLING MODEL. ‘est Length / Tho Aelabilty ofan tm Sample and Costciont Agha / ‘Numotcal Example / Varnes of True ane Eror Scores / Estimation of Tus Scores / The Standard Error of Measurement ‘Atenuation 168 m 15 sae 199 208 208 208, 209 an au 22 215 216 223 228 228 20 | [ALTERNATIVE MODELS Factorial Domain-Semping Model / The Binomial Mocs AELIABILTY AS STABILITY OVER TIE Diferenca Scores / One Other Consideration SUMMARY [SUGGESTED ADDITIONAL READINGS “The Assessment of ellabily CHAPTER OVERVIEW ‘SOURCES OF ERROR ‘aviation within a Test / Variation between Tests ESTIMATION OF RELIABILITY - Internal Consistency / Atemative Forms / Other Estimates of elianity / Long-Range Stability USES OF THE RELIABLITY COEFFICIENT Carections for Atterwaion J Contidence inarvats / Etec of Diepersion on eliabilty MAKING MEASURES AELIABLE ‘Test Length / Standards of ellabilly / Linitatlons on the Rib, Costicients Utty RELIABILITY OF LINEAR COMBINATIONS ‘Negative Elements / Weighted Sums / Principles Concoming he Felabity of Linear Coméinatlons [AN ANALYSIS OF VARIANCE APPROACH TO RELIABILITY ‘Some Basic Concepts / Appleton io te Study of Rekability (GENERALIZABILITY THEORY ‘Basie Concepts / Generaizabilty Studios and Daclsion Studia | A ‘Single Facet Design { Thaoratcal Considerations / Applying tho FRosuls ofa Single Facet G Study 0 @ D Study / A Fxed-Facet Design / Higher-Order Designs SUMMARY SUGGESTED ADOITIONAL READINGS Construction of Conventional Tests CHAPTER OVERVIEW CONSTRUCTION OF TESTS DESIGNED FOR CONTENT VALIDATION “The Domain of Goniont and Test Pian / Test ties / Test Length / Sample of Subjects / hem Analysis / itam Selacton / Norms / The Foie of Extomal Correlates CGONSTAUGTION OF TESTS DESIGNED FOR CONSTRUCT VALIDATION “The Hypathesis nd Domain of Coniant/ Content Homogeneity / ‘Methodological Heterogeneity / Relations among Measures and Constructs | The Rola ol Factor Analysis / tem Analysis anc ‘Selection ! The inadequacy of Raonal Approaches fo Test 23 248 ar 28 248 249 251 292 266 as 279 22 29 23 295 a0 RR Rl1 Construction) The inadequacy of Empical (Crterion-Orlented) i Approaches to Test Construction / Norms / Applying ine Measure / ‘Some Examplas o! Constructs inthe Ablties Area CONSTRUCTION OF TESTS DESIGNED FOR PREDICTIVE ‘ VAUDATION 328 llem Analysis, lam Selection, and Norns ; PROBLEMS UNIQUE TO CEATAIN TESTING SITUATIONS 328 5 Feversing the Direction of Keying / Unpotr versus Bipolar Atibutes J ' Discrimination ata Point / Equelscriminating Tests / Weighting of : tems J Taking Advantage of Chances SUMMARY 334 SUSGESTED ADDITIONAL READINGS 337 8 Special Problems in Classical Tet Thacry 338 CHAPTER OVERVIEW 28 GUESSING 340 ‘The Blind Guessing Model and Abbot's Formula / Etactso! Guessing ‘on Test Parameters / The Accuracy otha Cortction for Bling Guessing / Sopnisteated Guessing Model / Practica Consideratons / Using tha Model ta Estimate Tast Parameters / Multple-Choies versus Short Answer Tests, seep Tests ue The intemal Stiucture of Speed Tests / The Item Pao! Measurement of Reliability / Factor Composition / Variables Relating to Speed / Statistical fects of Tene Limits / Ono-Tial Measures ol the Sets Time Limits / Carrcton lor Guessing in Spaed Testa / Timed: Power Tasts / Speet-Diticulty Tasta / Factors Measured by Speed and Power Testa! Impscations [ADVERSE IWPACT, IMPROPER DISCRIMINATION, TEST BIAS, AND DISPARITY Definitions of Bias / Oispary and Bias / Test Sia, Regression, and the Cleary ule / Applying Linear Ragression to Salary Disputes / Reverse Ragrassion/ Residual Analysia / Simpson's Paredox Ravsted / Bias in Cortan-Valdaiod Measures / Barriers and CCutois ! Selection Faimass and Quotas / Pooled versus Separate Group Nore Hato EFFECTS ara ‘racitional Measures of Halo / Flecant Developments inthe Study of Hale FESPONSE BIASES AND RESPONSE STYLES 378 Sovtces of Bias / Changes In Test Scores 98 Personality Changes / Garelessness and Confusion / The Ate af Social Oesiabty / ther Proposed Stistc Variables 10 PART 4 MULTECALE TESTS hem Overian SUGGESTED ADDITIONAL READINGS ‘Racent Developments in Test Theory CHAPTER OVERVIEW (TEM RESPONSE THEORY sow ‘Conditional Independence / One-Parametar Models { Two-Param Models / Three-Parameter Models / om and Test information / “Tha Bock Nominal Model / Tha Serena Mode for Graced (Creinah Rasponses / & Nonpararett Aopoeeh / Omer IAT Models / Applications to Nonstandard Testing Conditions / Scoring Algorthens a DIFFERENTIAL ITEM FUNCTIONING (TEN BIAS) Th cubstantve Examole | A Simulated Example / Diferental “Atammive Functioning / IRT Approaches to Assessing OIF / ‘nteratve [RT Approaches / Classicl Approaches to Assessing DIF F Content Bias ALORED TESTS AND COMPUTERIZED ADAPTIVE TESTING . “Talored Testing and Psychophysical Thresholds / Applying the Suess Prcpe fo Peyehameties | Feclvel Test! Moro ‘Complex Fans of Talored Testing | Perspectives on Talore Test ‘COMMENTARY ON IFT INCHIEVENENT TESTS FOR MASTERY LEARNING anon Natura af Master Learning / Test Gonsixcton / Defiion of Mastery" / Practical Problems suMManY SUGGESTED ADOTIONAL READINGS FACTOR ANALYSIS Factor Analysis |: The General Model and Variance Condensation CHAPTER OVERVIEW USES OF FACTOR ANALYSIS - Factors a8 Groupings of Variables | Exploratory and Confirnatory Nanavysis / Factor Analysic and Selentfic Ganeralzaton / Variable ‘Basic CONCEPTS The General Factor Model / The Unio Measurement | Estimating 286 28 392 300 200 304 a8 a8 485 488 42 448 “a7 ast i eeews ‘The Fol of the Correlation Matix / Properties of a Factor Solution CCENTAOIO CONDENSATION 42 are 43. Gontfirmatory Factor Analysis wee PRINCIPAL COMPONENT AND PRINCIPAL AXIS, q ovERVEW ‘CONDENSATION 473 | eee RON Co SPEARWAN'S GENERAL FACTOR SOLUT Principal Components / Mathematical Properties of Principal GOMPARING FACTORS IN DIFFERENT ANALYSES ad Components / Principal Axis Solutions I ce Geassical Approaches to Testing Factor Invariance J Some Practical MAXIMUM LIKELINOOD AND AELATED FORMS OF CONDENSATION a7 k ct omparg Overall ects of Comparing Factor Stuctures | Comparing Ov Useluiness of Ml. Factoring / Varlants on MAL Factring aa (OTHER METHODS OF CONDENSATION jutons / ACS Approaches 481 Salut DETERMINING THE NUMBER OF FACTORS TESTING WEAK THEORIES (THEORIES CONCERNNG GROUPINGS OF wu “2 "VARIABLES) Consequences af Choosing a Given Number of Factors = matory Anaiysis / Procrusios Multle Group Confiematory Analy oe Sema RR a Cenfematory Analysis / ACS Canfematory Anaiysis A Comps summary as ofthe Three Approaches 570 eereeesreoreria “se FACTORNG CATEGORCAL VARIABLES (TEM LEVEL FACTORING) sr ACS anc ates Arpreaehes wth PM Measures / Musca Anateas 12 Exploratory Factor Analysis I Rotation and Other Topies a1 ere ic haere overview “91 Iisa he Ful ACS Mal | ACS Noaton | Atmore | FACTOR ROTATION ae oars of Pat Canes / Not on ring Cay ex Geometric Anslogy / Visual Raton / Further Mathematics of Rotation Cerio Ox ol Speision ACS / Rei and 1 Obique Rotations / Simple and “Simpler Siucures / Releronce Nonaeree Mod /CroseiggedCoreaion/Aaphing Vectora Reeoplying ACS Classical Approaches te Song = ANALYTIC ROTATIONS 505 suman . 593 uartimax / Varimax / Promax SUGGESTED ADOTTIONAL READINGS ESTIMATION OF FACTOR SCORES 507 Practical Consideration in Obtaining Factor Soares RELATIONS AMONG THE VARIOUS MATRICES 5 ost PART § ADDITIONAL STATISTICAL MODELS, CONCEPTS, AND ISSUES 595 DELS, S, ONAL STATISTI HE COMMON FACTOR MaDEL wn : 514 7 The Problem of Gonrunaly Estimation / Factorable Matices / Maric 14. Prote Ariss Disininan Anata and Muticinersioral Scaling $87 Flank ? Unites as Communaity Estates / Communal Derived CHAPTER OVERVIEW ae trom Hypotheses / Statistca! Gite of Rank Nereton ? Squared eee 5 Matiz Corlatons / Rally Goeticionls/ OrottEetmaten PROBLENG IN PRORLE AVALYSIS . ‘Some Maer Oiferences betwaen Component and Comnen Factor CGharaceraties of Score Proles i Solutions / Soma Conceptual Problems wth the Common Factor aa ee ReoeeS| LUSTERING OF PROPLES cori a Mosel ects of Number of Vatablas and Average Careation voon oS jeaautes of Profa Slay Oistnca Measure Hires and the Factor Stucture oping Chita ‘Ovelagping Custerng FACTOR ANALYTIC OESIGNS 528 ARAWSCORE FACTOR ANALYSIS ae ‘Atemave Dasigns / Three-Moge Factor Anaisis fn Sea of Pawar Fac Arle / How Raw Som Rc sous racTOeNG sat ‘Anayis Works | Tasformatons ol Vaabies / Transform HIGHER-ORDER ACTORS 532 Profle HOW TO FOOL YOURSELE WITH FACTOR ANALYSIS 593| ; 708 DISCAINENANT ANALYSIS 610 f SUMMARY s 708 Geometric Interaretatin of Discriminant Analysis / Linear Discriminant SUGGESTED ADDITIONAL READINGS Function / Multiple Linear Discriminant Functions / Placement : Evaluation of Discriminant Analysis 08 PATTERN ANALYSIS 620 REFERENCES Discovering Latent Groups / Discriminating among Existing Groups / \NOEX 738 Evaluation o! Patt Analyse i Nar index 7a MULTIDIMENSIONAL SCALING 21 ' Subjet doe Spatial Conceptions of MDS / An Overview of Altornative Approaches to MDS / Psychophysical Methods Based upon Simlarty / Psychophysical Methods Based upon Atrbute Ratings / Indirect Methods | Vector-Spaco Ratlo Methods / Guelcian Distance Aatio | Methods / Interval Methads / Ordinal Methods ang ALSCAL / Some Emplicel Proparies of Aliomativa NOS Solitons ! MOS ot Corslation Matrices / Scaling of incvidual Difarances / An Exampi o! the Usa of MDS / Some Gonelusing Comments DOMINANCE (PREFERENCE) SCALING 65 ‘The Unielting Cancapt / Muitdimeneional Uniolding and ALSCAL suMMaRy ee SUGGESTED ADOITIONAL REAOINGS 651 15 The Analysis of Categorical Data, Binary Classification, and Atarnatives to Geometric Representations 652 OnaPTER OVERVIEW 682 CATEGORICAL MODELING 854 Two-way Independence / Associaton (Nonindeperdence) / ‘Alkemative Mads forthe 2x 2 Case / Measures of Association in the 2x2 Design / More stout G / Tha Generalized Logt Vanant SSuctural and fandom Zero Multiple Lovols on a Variable / Higher-Order Designs / Predictor Critorlon Medals Multiple Response Categeries in Predicor-Cnitaran Modes / Some important ‘Assumptions / Log‘ineer Modeling ara tem Response Theory / More-Specife Catagorcal Models / Loglate Regression / Comparing Groups with Logistic Regression / An ilstrativs Problam (A Note on Residuals / Praciting CatogericalCrtaria BINARY CLASSIFICATION 600 ‘Ciassical Signal Detection / Categorical Modeling Approaches to the Equal Variance Gaussian Model / Gareral Recogriton Theory / Appication to Condensation Tasks / MDS, Disimianty Judgments, ‘and General Raccorton Theory / Implications for Measurement NONGEOMETAIC AND NON-EUCLIDIAN MODELS 606 Nearest Neighbors / Tres Representations / Network and Graph Theorate Approaches / Conclusions a —_————eewv“cx€~_— TT| PREFACE Like the previous edition, the third edition of Prychometric Theory is a cormpreensive text in measurement designed for researchers and for use in graduate courses in esy= chology, education, and areas of business such as management ard marketing, Ii a {ended Co consider te broad measurement problems tht arise in these areas once one eps aside from the specific content that often obscure these similarities. This does fot mean that all situations are the same, Lee Cronbach (1957) pointed out 8 major dference between the measurement needs of those who study group differences, asin fexperimental manipulations, and those who stody individual differences. This differ- fee is noted in the pages that follow, I bave also atempted 0 write the book so that the reader needs only a basic background in statistics, "The previous editions of this book were so widely read and accepted that they be- ‘came a common denominator for more than a generation of scholars, Prof. Nunaaliy’s death « decade ago raised the possibility tha this contibuton might be forgoten. 1 anno, ofcourse, know what he woul! have written fortis edition, however, I Bope that [have stood with sufficient solidarity upoa his shoulders. My main goal ist ex ‘pres othe readers of ths book, the love of solving measurement problems that be in- Eplred in me. It is also with pide that I include some contibations of my own students who have followed this path, They inclide Vietor Bissonnette, Sebiastiano Fisiaro, snd Calvin Garin, ‘One essential Feaure that I have tried so maintain isthe emphasis upon principles that characterized the previous editions, Nov, as then, there are many excellent cefer- ‘ences that go much further into the details ofthe various analyses. Tse ace found in the many references, especially the Suggested Additional Readings at the end ofeach cbapler Ihave atempted to sike a balance between papers designed for geaeral a tlience of paychologists and graduete students, which appear in sources like Prycko- logica! Bulletin and empirical jourals, versus those of t mare mathematical orienta: tion, such as Paychometrita and Applied Peychological Measurement. Thave also maintained the use of several examples that allow hard calculation, since many of the newer procedures do not, Consequently, [have included some produces that many consider obsolete, specifically centroid factor analysis. Not every feader ot instructor will find this useful, but [can speak from my ow memories as 4 sudent on this point, Most recent developmests in measurement have had the xsicnded eof aking sens Fare rom he snien 2 om the da thn olde methods. To the & os this eee the importa more general, latent variables, i is an i fe ant step forward. However, one cannot totally ignore the possibility that che 11 mere ered cts palit whch we tase pe me eae profound Chive also made tse of te genera purpons staal wakeges tne how a necessary part of every researcher's repertoire of tools, 1 Faxishet LISREL, and Excel in my own work, but chere wreaing naeee oe don ht cn hs se rey ve al fet cee Prof. Neal sessed in pevieus eons rd ins oue ok te et cussions, Chapter in tics, poise eed vont ‘Widow doubt th mor cng tat sake lc sae pulaion of he pe ous eon fis ook hasbeen estes from cased precise aes Yrtns fo moerm feat poets, Tis ian dissed nay aa o So ies acl ed xe pay cis fee no ois, However {song fe Pe. Nal Wal ne opal es es vers gn ny bere ata inate. As og crv uh ethan co te hy ws tn shouldbe, wed when date gutberng ila e preset ose eh ora ae ca ey of ngs. Anyone wid serous inten h eseeeneg ae tnowldgeble about ese metas However you sto nev ley tae cee stad eos us may are The poston have uke ao ete St Nem ce. The appropriate dis- apparent in the jor change fom the pevout elton i hat clasfction has be sive api sts a for of metreen tn om selng Chane Pesan Mes necessary distinction, A well chosen categorization can be as fruitful improved scaling. I am particularly indebted to Calvin Garbin for 7 foot on tit a and T hope that Chapter 15 is useful to this end. Sessions ons sce of iscussng Both lial and mde model he pe the book of manageable length forced the d ; ere spe et. reget having todo. There ase speci tet, such a Anais (1988) and Cre let dicssion of teoreal sues Part and ehaper overviews an ters previously devoted to lien texts that deal ich 8 (1990), and also present an excel ince ots summaries have been ade, bu he oge- Maan ihre silo at a teri sion Pa Ove (Cope et measument sa ener pret, Fut To (Chops 23) ds wi oes of measurment, including cision ofthe cial eonep of aid Although there has been a noticeable wend toward unit jor meanit of validity, {have continued to point out Prof. Nunaally's useful distinctions among them. This ist par also presents basic concepts in statstles and correlational ana IMy aiscussion of certain approximations to the ordinary comeation coefficient (7, ‘a polyeerial and polyetorte correlation, somewhat softens Prof, Nunnally’ neg ward them despite our general agreement, I have also expanded the dis tial estimation, a8 it has become considerably more complex in 1 years. Part Three (Chapters 6-10) continues to deal with the inteal suucure of fultiitem measures and the process of aggregating items ino scales. The previcus tuition mentioned generalizability theory In passing; | Tet it now needs mote detailed Siscustion. One xinor change i that [have moved some material on speed and guessing from the last chapter ofthe previous edition into Chapter 9 to unity discussion ‘with related mazeial, {have also added material on statistical definitions of test bias ‘and balo effects to this chapter A full chapter has been added on modern test theory ‘Pat Four (Chapters 11-13) deals with tow measures reat to one anothet, The pre= vious edition devoted two chapters to facoe analysis. [ts importance led me to devote three chapters to it. I have provided a detailed contrast between the component ard common facto models. In addition, Ihave provided a detailed discussion of confirma: tory factor analysis andthe factoring of caogoriel data (tem-level factoring). | have also attempted to maintain the cleity of Prof. Nunnally’s discussion ofthe geometric basis of factor analysis. The Fist of two chapters forming Part Five, Chapter (4, ais ‘cusses some altemative models that largely share the geometric assumptions of factor tnalyss, Finaly, the last chapter deals with avasiety of emerging topics in measure 3 of independence, and alternatives t geometric rope cussion of su ‘ment categorical modeling sentation Thank severel people besides these already mansioned. [am indebted to my cecent stents: Laura Feld, Peul Havig, Matthew Lee, and Meredith O'Brien ane not mentioning by name al Of the other ae students who have worked vith me. thank Laucie Liska and Tir Larey foc their comments on an eacer deft, tnd Pauline Oregory, Amy Osbom, and Susen Sterling for cleical assistance, I am purticularly indebted to Professor Dennis Duily at the University of Houston Law ‘Schoo! for his suggestions about the legal aspects of es as. Professor James Tanaka Completed his penetrating comments only a short time before the tragic aczideat that Claimed his life. Several colleagues at the University of Texas at Atington (lim Bowen, Jim Ericksoa, Bll Ickes, and Paul Paolus) and elsewhere (Chatles Eriksen, Matt Jaremko, Michael & Judith Keith, Rob Kolodner, Paul McKinney, and Rick Weidemas) stimulated my thinking about maay issues. The book could not have a writen without the help of Professor Nunnally's widow, Kay. Jim Cullum, John Sheridan, the Ite John B. Gillespie, Ferdinand LaMenthe, Thelonious Monk, and CCharies C. Parker played an oblique role but deserve note nonetheless I especilly thank my wite, Linda and daughters, Cari and Dina, for their ove and supper. iaily, [extend my appreciation tothe following reviewers whose suggestions guicea my welling of chs text J, Wiliass Asher, Purdue Univesity: Jasob C. Bead, Florida State University; Jeffrey Bi let Taeologicl Seminary; Richard L, Blantoa, Professor Emeritus, Vanderbilt University; Donalé R Brown, Purdue Univer sity; Joseph M, Fitzgerald, Wayne St Iniveriy of Calvin Garba, Univerbraska; Gene ¥, Glass, Ui ity of Colorado; Richaré L, Gorsuch, Fuller Theolog Seminary; Frank M. Gresham, Louisiana State University; Lary H. Ludlow, on College; Samuel T. Mayo; Michael Pressley, University of Marylend, Louis Primavera, St. John's University: John B, Texas A&M University; Roget Schvaneveldt, New Mexico State University; Eugene & Stone, SUNY, Albany Michael J. Subkoviak, University of Wisconsin, Madison; J. S, Tanaka, University of ‘iinoss and David Thissen, Universcy of North Carolina, Chapel Hil. Tconelude this pre with the wisdom of one of my oldest fiends, Stan Coren of tbe University of Britsh Columbia. He cold me while [ was working on my previous major effort thac "You never finish writing @ book; they jst take it away fo ‘you. ra H, Bernstein PSYCHOMETRIC THEORY. ONE INTRODUCTION ‘The main purpose of Part One (a single chapter in this ease) is to define “measure iment” ia terms of two fairly simple concepts: Measurement consists of rules for assigning symbols to objects 50 as co (1) represent quantities of atributes numerically (sealing) or (2) define whether the objects fal inthe same or cifferent categories with respect to a given attibue (classification). Mos of the book is concerned with the frst Of these meanings. The topics of levels of sealing and the generl standards by which rmeasuremeat cules are evaluate are focal issues.CHAPTER INTRODUCTION CHAPTER OVERVIEW ‘This opening chapter begins witha defsition of meas nt which we break down into two subiopes: scaling and classification, Some general properties of good mes- ‘surement ae introduced, and the importance of standardization is discussed. The sepa fate roles of measurement and pure mattematcs ere coatrased. One major and stil controversial, pic in measurement concems what are known as levels of messurement. According o some, the appropriate level of a measure must be established before employing mathematical and Satistical procedures associated with that level. ‘Many look for ostensive (visuelizable) properties of measures like the yardsticks and clocks of ptyscs. They view present scales as imperfect corcelates of unknown" scales. We atempt to show that these strategies easily lead 10 uncessonable outcomes. One should demonstrate that 2 meesure has the properties ascribed to i, establish scales by convention, but be prepared ta change these conventions as beter measures ‘become available, The chapter concludes by noting some of the changes brought 0 the study of measurement that result fom th availabilty of computers. MEASUREMENT IN SCIENCE ‘Although tomes have been writen on the natue of measurement, in the end it boils Gown to two fairly simple concepis: “measurement” consists of rules for assigning Symbols to objects eo as to (I) represent quantities of atthunes numerically scaling) or @) define whether the objects fall inthe same or differen: categories with respect to 2 given attribute (classification), Most of what is historically called measurement in Wolves scaling, and therefore properties of numbers, but classification can be equally4 par 1s netacoUeTION important. The objects in psychology are usually people, but they aay be lower ai mals as in some areas of psychology and biology or physical objects asin some market research, The term “rules” indicstes tht che assignment of cumbers must be ex plicity stated, Some rules are so obvious that detailed definition is unnecessary, asin ‘measuring height with a tape measur. Unfortunately, these obviou tional in science, For instance, assaying a chemical compound us teemely complex procedures. Certainly the rules (or measuring most atributes such 25, inealligence, shyness, oF priming sre aot icutvely obvious Rules, in turn, are an important aspect of standardization. A measure is standardized te the extent that (1) its eos are clas, (2) cis practical to apply (3) does noe demand great shill of acministrators beyond dha necessary for tei inal tetining, and (4) its results do not depend upon the specific administrator. The basic point about stardard- ization is that users of a given instrument should obtain similar resus, The results must therefore be reliable ina sense tobe discussed at several points i tis book. Thus, cea suring the surface temperature of plane is well sanderdized if eifforenc astronomers ‘obtain very similar estimates, Simlaly, an intelligence test is well standardized if di a examiners obcai similar scores fom testing a particular child ata given ime, “The term “attiaute” in he defitin in some particular feature of objects. One cannot measure objects—one measures their sutibutes, One does not measure 2 child ut eather is er height, or socialization. The distinson benteen an object an isa ike mere bairspliing, bu it is importan. Fast, ic demonsteates that measurement re quite a process of sbstaction, An atibute concer relations among objets on a particular dissension, e.g, weight oc inligence. A red rock and a white rock may weigh the same, and two white rocks may have diferent weights. The atributes of weight and color must not be confounded with exch other aor with any other atributes. [tis (quite easy to confuse a parcalaratbue of objects with other eeibutes. For exam ple, some people find it citficult to understand that acrrwnal and a law-abiding citizen can both be equally smart, Filing © abstract a particular ateibure from the whole takes the concept of measurement dificult ro grasp son Tor emphasizing that one measures attributes and not objects is, that it makes ws consider the natue of an attribute carefully before axempting me suterent, An atibute we Believe in may not exist in the form proposed, For example, the many negative cesults obcained in the efforts to measure an overall attbute of rigidity make it debatable that such an atribute exists. Bven highly popula eras used to deseribe people may not correspond to measurable acibuces, e.g. clairvoyance. tt is also common for an assumed unitary aurlbute 10 confound several more specifica eibutes, For example, “adjustment” may include suistaction with one's life, positive snood, skills in coping with strest, and other meaaings ofthe term, Although such con slomerate measuees may be partly justifiable on practical grounds, their use can under ‘mine psychological seience. As this book will show in detail, a measure should gener ally concer some one ching—some distinct, unitary atibuce. To the extent that snitaryatteibutes need be combined in an overall appraisal, e., of adjustment, they should usually be caionally combined from different measures cather than being com founded within one measure ion of meas he Fist pert ofthe defi neat stresses the use of numbers rept sent quantities in sealing Technically, quantification concerns how much of aq ac tribute is present in an object, and numbers communicate the amount. Quaatifcaton i So iatimalely inerbvined with measurement that the two terms are often used int Changesbly. This is unfortunate, as the second part, clasiieation, isa least a8 impor ‘Although the definition emphasizes that rules are atte heart of measurement, i docs not speefy the nature ofthese cules or place any limit ea the allowable kinds of ules. This is because a clear distinction must be made between measurement as a ‘validating measures, The measurement process involves Such considerations as the levels-of-measurement issue that is discussed Inte inthis ter, Validation involves issues that are discussed in Ctmpter 3. Numerous star- Gards can be applied to obtain the usefulness of @ measurement meted, including the fexiet to which data obsained from the method (1) fa mathematical modi urea single atsibut, (3) ae repeatable over time if necessary, (4) ae valid In various ‘Suct Senses, and (5) proce interesting relationships with other scientiie meas standards wil be discussed toughout this book. Thus, a psychologist might est rules to measure, say, dogmatism, in @ manner that seems quite illogical ro other hologsts, but the meesure's usefulness cannot be dismissed befotehnd, "The rules employed to define 2 particular measure must be unambiguous. They may be developed from an elaborate deductive model, based on previous experience, How from common sens, or simply spring from hunches, bot the crucial pois i Bow con sistently uses agree on the measire and ultimately how well tke aeasurement method explains imponant phenomena, Consequently any set of rules that unambiguously {quantifies properties of objects constitutes a legitimate measurement method and has & right fo compete with other measures for scientific usefulness, Keep in mind, however, that clarity does nat guarantee explanarory power “Meaningful” and “Useful? ‘There is both agreement and disagreement arong Scientists about what is a meaningful andor useful result, Ics fai t Say that tere is a high degree of agreement on two points, One is that any result should be repeatable under similar circumstances. [is {Quite possible tata finding abwiaed on Apc 8, 1991, from a particular group of psy hology students atthe University of Texas at Arlington was areal effect descriptive of that group of people. However, unless that effect also applied to sore other gcoup, £8. dents atthe University of Texas at Arlington tested on another dy or at some otter tniversty onthe same doy, there is no seed fora scientist tobe concerned with it "Tae second point of agreement that all scientists have leamed is that any set of ce sults can be understood after the fact even fit isa chance occurence or even systema ically wrong, Perhaps every investigator has analyzed ase of sults and formulated an explanation only tn iscover that dere was a “bua” in the analysis. That bug probably id not hamper a “creative” explanation of the wrong resus aa like manner, some of the more sadistic instructors we bave Known assign randomly generated resus 1 sir dents for explanacon, Students often find the exercise creative until they ae le oe.8 PART 1: mTROOUCTION The keys to mesningfulness are to proceed from some position cha anticipates e- ts, This is where scientists differ. Some are song biased toward testing hypothe ses derived from highly formalized theoces; auhers are more informal andor resul- oriented in theit approach. For a debate an this issu, see Greenwald, Pratanis, Lippe, and Baumgardner (1986) and a series of commentaries that appeared in the Cetaber 1988 issue of Prychological Review, As ofthis writing, the pendulum seems wo have swung in # more formal direction, atleast in cognitive psychology, but it prab- ably will swing back. Whatever the level of formality prefered, meaningfulness depends upon context, One ofthe most common phrases one tears cbout res what?" The answer isin placing ndings in rlevane conte This is aot to cule out unanticipated findings, which are always an exciting par o Science. However, before one becomes too earaptured by an interpretation given a set of findings, one should be prepared to replicate them, preferably in some way tut broadens thee generality AOVANTAGES OF STANDARDIZED MEASURES Objectivity Although you may already have a healthy respec fo the importance of messurement in science, i is useful to look at some particular advantages that measurement provides. To note these advantages, consider what would be let if 00 measures were available, eg. IF there were no thermometers or intelligence tests, Measures based upon well-developed rules, usually including some form af norms that describe the Scores obtained in populations of interes, ace called “standardize,” Despite extcisms of standardized psychological tests, the decisions that thase re used for wauld stil be made, What would be left would consist of subj etc, Some of the advantages of standardized fellows: ve appiisuls, personal jadgments, sures over personal judgmenis are as The auajor advantage of measurement isin taking the guesswork out of scientific ob servation, A key principle of scence is that any statement of fice made by one sien= tis should be independendy verifiable by other scientists, The peaciple is violsted if Scientists can disagree about che measure. For example, sce there is no standardized eB." WO psychologists could disagree widely about a pa tiem’ ubidinal energy. It is obvicusly dificult to test theories of ibidinal energy wil ‘One could ‘argue tht measurement isthe major problem in psychology. There tre many theories, buta theory canbe tested only to the extent thats hypotesized attributes can be adequately measured. This has historically been the problem with radian theory: There are no agreed-on procedures for observing and quantifying such atributes as libidinal eaerny, etc. Major advances in psychology. if not alls fences, are often based upon breakthroughs in measuremeat, Consiger, for example, the Hood of research stimulated by the development of intelligence tests and of personality tests like the Minnesota Muliphasic Personality Inventory (MMPD, of, in a very ‘ventication Communication Economy QoueTER 1. WTROQUCTIN 7 diffrent are, the development of techniques te record from single neurons (Hartline, 940; Kuen, 1953). Scientific cesults inevitably involve fanetional relations aunong measured variables, and the science of psychology can progress no faster than the measurement of is key variables. The numerical esuts provided by standardized measures have two advantages. First numerical indices can be reported in finer detail than personal judgment, allowing ‘more subile effecs to be noted. Thus the availablity of thermometers makes it poss ble to eport the exact increase in temperature svhea two chemicals are mixed, ther than for the investigator to intitvely judge only that “the temperature increases,” Silly, teachers may be abe reliably assign children to brosdcaregoris of ite ligence suck as bright, average, and below normal, but intelligence cess provide ner ditforetitions ‘Second, quantification permis the use of more powerful methods of mathematical analysis that are oftza essential to the elaboration of theories aad the analysis of exper iments, Although important psychological theories nee aot be highly quantitative, he ‘wend is and will continue tobe cleary in that direction. Mathematically stamable theories make precise deductions possible for empicical invesiguion. Also, other mathematical medels and tools, such as factor analysis and the analysis of variance (ANOVA), may be used to analyze various results even when the study does not test any formal theory. Science is a highly public enterprise requiring effcient communication among scien tists, Scientists build onthe pat, aad thee findings mast be compared with results of other scientsss working on the same problem. Communication is gready facilitated ‘when standardized measures are available, Suppote, for example, iis reported that & pricular ceatment made the subjects “appear anxious” In an experiment concerning the effects of stress on anxiety reaction, Ths eaves many questions as to what the ex perimenter meant by “appesr anxious," and makes it diffcule for other experimenters te investigate the same effect, Much beter communication could be achieved if the anxiety measure were standardized, as the means and standard. deviations ofthese scores oul be compare across treatment groups. Even very careful subjective eva ations are much more difficult to communicate than statistical analyses of standardized [Although standardized moarurse faquenty require a goat dael of work to develop, they generally ace much more economical of time end money than are subjective eval uations after they have been developed. For example, even the best judges of incl ‘gence need to observe a child for some time, Atleast as good an appraisal caa usvallypar 1: TRCQUETION be obtained in ess than an hour with any of several inexpensively administered group measures of intelligence. Similarly, one can use a sancidized activity measure such 25 rate of bar pressing in & Skinner box o evaluate the effect ofa proposed stimulant Besides saving time and money, standardized measures often fhe professionals for more important work, Progress generally favors measures thar either require relatively ploy or allow less highly tained cechaicians to do the administration and scoring. The time saved allows practitioners and scientists more time forthe more scholarly and creative aspects oftheir work. Ttis sometimes dificult co disentangle the measure fom the measurement process, as in individually administered ioeligence tests. Although individual inelligence tests ace highly standardizad, they still require much time to administer and score, Context determines whether thee are sufficient advantages to compensate for these ore highly standardized pencl-ané- pape ess. cdsedvantages over ev Sclenttic Generalization Scientific generalization is at the very heart of scientific work. Most absecvations in volve purticular events—a “Elling” star, a boby crying, a feeling of pain from = pin seratch, or a friend cemarking ebout the weather. Science secks fo fad undeelying ceder in these particular eveass by formulating and testing hypotheses ofa more gener- ab natue, The most widely known examples ae te principles of gravitation, hest, and states of gases in physies. Theories, cluding those in the behavioral sciences, ae i tended to be general and thereby explain large aumber of phenar ‘seul ‘of principles. ientfc generalizations particularly in the behavioral sciences, must be stated in statistical terms, They deal wih the probability of en event occuring and cannot be specified with more exactness, The development and use of standardized ‘measurement methods ate just as essential fo probabilistic relationships as they are for deterministic ones, Figure 1-1 illuswaes a simple probabilistic relationship noted by the first euthor between the complexity of candomly generated geometic forms and the amount of time tha subjects looked atthe forms. The data are group averages and face much more regular than individoal subject data. However, the principle seems leer People look longer st more comlex figures than at simpler figues; but this would have been much less apperen in the data of individual subjecs. Many MEASUREMENT AND MATHEMATICS A clear distinction needs be made between measurement, which is directly concerned ‘with the real world, and mathematics, which, as an abstract enterprise, ueeds have ‘nothing to do wih the rel world, Perhaps the two would not be so readily confused if both did not frequent invalve numbers. Measurement always concerns numbers re= Inable to the physical World, end the legiinacy of any measurement is determined by fs (aets about the physical world), In partcula, sealing, but not classification, always concems some form of numerical statement of how much of an attibute is Faure 1+ sere 1 wencouenew 9 ieee) ow vio nee hncion sms caplet umber of 8 or com aeneats ‘geometric forms). present, as its purpose i vo quantify the attributes of real objets, A measure may be intended to fi 8 set of measurement axioms (& model), but its Bt to the model can be determined only by sesing how well the data ft the models predictions. Even ifthe is no formal model, the eventual and crucial test of any measute (scale or slssifies- sion) is how well it explains relations among variables, As wll be discussed in Chap 3, the various tyes of validity for psychologicel measures ell require data rather than purely mathematical deductions. In contast co measurement, pure mathematics is limited to deductive sow of rules forthe manipulation of symbols, of which those used eo denote quantities and cae- ‘ories ae only one type. Many deductive systems in modem mathematics do at ia volve aumbers, though they may involve elassifeation, Any internally consistent set ff rules for maniptating a tet of symbols can be a legitimate branch of mathematics, “Thus the statement “iggle wug drang,flous” could be @ legitimate mathematical sa sent in a set of rules stating that when any igle is wugged it drang a fous. Mache- constructed in which both the jects asd the operations are symbolized by nonsense words. This system aight not and need no be of practical use a its legiimacy depends eniely onthe internal consistency of its rules ‘Asa resul, scientists develop measures by stating rules toquanify autsbutes of real objects, but borrow mathematical systems to examine the seuctare of the data, Fort- rately siensally useful measurement methods can usually be associated with 2p propriate mathematical syst Measurement and Statistics ‘eecause the term “statics” i used broadly, some lof the term are ncessary in istinctions among different usea et to £88 thee implications for psychometic theory. ‘There is abasic distinction between descriptive and inferential statistics. "Descrip tive statistics” concerns quantiative statements about an auibute ofa particular group1 pants: wTROCUCTION cof cbservations and does aot necessarily imply generalization. Thus, arithmetic meun of the seores on a classtoom test the correlation between twa pre sumed measures of anxiety, oF the scores of two job applicants without making any broader statements about those not taking the tests. In contra concemns generalizing from observed sample values (Statistics) to theie counterparts in 4 populstion (parameters), nearly always inthe oem of probability satemeats. A com ‘mon example is 9 esimate the probability thet the observed mean difference between an experimental group and contcol group isa chance departure feom 0, the expected result if he erearnent had no effect. "We wil say less inthis book about inference than description, as most of the traditional quantitative methods to be presented are primarily designed for description rather than inference. Thus comelational analysis, factor analysis, discriminant analysis, and other procedures can be discussed and employed with minimal use of infer- tence, Ths is nt to say thas inferential statistics are anmportnt or that tney wil bet tally neglected. We will consider some advances in infeental statistics that have become prominent sin "There are thee reasons to empha and some newer models are large-sample theories that assume that many subjects are studied. Second, even some investigators who have been very concerned with develop ing these newer inferential measurement models sess the importance of descrigtion (Bentler & Bonnet, 1980), Finaly, we have enough material to present without going too far into a somewhat ancillary tole, There are excellent books on the relevant inferential statisti for psychometric theory that wil be reterencea where appropriate ‘A second important statistical distinction is tha between the sarnling of objec (in this contexe, usually people) and the sampling of content (tems). After a measure has been developed, itis ofen important to rake statements about cbjects es in developing test noms. Before measuces are developed, however, measurement is much more closely related tothe sampling of content, asin deciding whic test items to include We will later stress how itis use 0 think of particular test tems asa sample hypothetical infinite population or universe of items measuring the same cai. Thus a speling tet for fourth-grade stodents can be thought of us x sample of all possible sp propriate words. Par of measurement theory this coocems statistical relations between the actual test Scores and the hypothetical scores that would be made if ll items inthe universe had been administered. “There is two-way problem in all psychology concerned with the sampling of objects to be measured and the sampling of content. The former usually concesns the generality of ndings over objeces, and the later conceens tbe generality of finding bver test tees. Some item response theory models (Chapters 2 ané 10) simultaneously take objects and items into account. However, most analyses take only one of these di mensions into account explicitly and keep the other in mind or, worse, simply ignore it Thus, a study comparing different approaches to eaching mathematics upon a per- ticular achievement test may explicitly concern gender differences. However, it might have to acknowiedge that cifferent results might have been obtaines with different achievement measures. ‘The frequent necessity of considering only one of these two dimensions is aot ideal, bu itis not necessarily atl. Subsequent sudies can deal with generalizing over } i | 1» nermooucnca 11 the other dimension. The most desirable situation is when one samples 50 extensively ‘on one dimension thatthe only sampling error present is onthe other dimension. This, normally requires an extremely large sample of subjects. At least hundreds, if not thousands, of subjects should be used in the development process. Except as noted, We will assume that all mathematical analyses are based on large numbers of subjects 50 tnt issues willbe limited to the sampling of content, Suds conducted on relatively small aumters of subjects are usually not sufficient, Thus, even though a few dozen subjects may suffice to establish cht he lst reliability i precise statement ofthe magnitude is nearly always cequired. ‘The idea that sampling conten is more important than sampling objets in developing & measure is not easy to grasp, Many students fall into the tap of assuming tests religility increases with umber of objets (tubjets) uted in the study of re ability, when infact it is directly eelated tothe aumber af items on the test an inde pendent ofthe numberof obj MEASUREMENT SCALES TABLE 11 STEVENS’ LEVELS OF MEASUREMENT, BASIC DEFINING OPERATIONS, PERMISSIGLE operation (crearthans lesen) Intra Fao sve! Sera Ags ve) poss oa Wy A series of articles by Stevens (1946, 1951, L958, 1960) evoked considerable discussion and soul searching about the different possible types of measurement scales. Stevens proposed that measurements fll nto four major classes (sore extensions of hese basic types will be noted below): nominal, ordinal, terval, and rai, The levels allow pro- _essively more sophisticated quantitative procedures to be performes! on the measures but in turn demand progressively move of the measurement operations. In ation, the levels cestrict the transformations possible upon the dat, Tale I-{ provides an illustra ‘ion of his proposed clasifiation which we will embellish on nthe succeeding pages, Stevens’ worke evoked a grear deal of contoversy at the time, some of which continues. One mejor effect was that it led to a healthy self-consciousness about (AMPLES OF PERMISSIBLE STATISTICS, AND EXAMPLES ‘ay neta Nunbos ateazes, nade Telephone unbars Monotonic rceasng Medan percerde, arenes of mina ‘atstee fleas rank cece mean Tempore (KaviNominal scales 42 parr wtRooueTION psychological measurement, but it als le io some unfortunate cooclusions about the employing particular classes af mahemsaical procedures with measur of peychologica anvibutes. OF these, the isue of whether or not it is meaningful ro compute the mean ofa series of test scores derived by summing individual items had the greatest implications. We will frst present Stevens" postion in a simplified, conventional manner, alter which we vil discuss the nature of psychological measurement in more general terms. Nominal scales contain rules for deciding whether swo objects 1, for categorizing, Equivalens cal property in common, (wo people are bots females. ‘equality with respect oa elevant properties, and it wll be sense below. The result of a nominal sale isa series of classes which may be given a riumesic designation, The numbers are frequently used to keep tack of things, without implying that they ean be subjected to any mathematical analysis. Telephone and social Security qumbers are common examples af using numbers simply as labels thet could just as wall be expressed without numbers, Tose labels have no mathematical properties, and so it makes no sense to average a waek and a home telephone number. Howey- fr, itis important to distinguish between using the category “names” numerically, ‘which is improper, and the category “frequencies,” which is quite proper, eg, to ask whether there are mare Demacrals, Independents, or Republicans in apolitical pol Te is sometimes useful co distinguish berweea labels and categories even though valent) or = means tat ewo objects have a cri foes not imply identity iscussed in a more formal ual objects, Taese may be unique, as ae the social security numbers given to U:S. citi ‘zens and residents, of there may be many duplication, as with given names. in con trast, categoriea are groupings of objects, in which it is usually desirable to have relsively few categories compared 9 the numberof objects. Common categories are race, eticiy, and gender. ‘Although categories and lbels need aot reflect any specific quantitative elation ship, they may lead tothe discovery of important careates. For example, the find that people of a certin etnicity are more prone toa particular disease than people of 1 diffeent ethnicity is vital to geneticists, However, this isan issue of classification, discussed below and in Chapter 15, end not scaling. Labels and categories are nominal scales, but nominal scales have thus far offered litle to formal scaling models even though suck models exist, Nominal seales can be transformed in any manner that does not assign the samme wctber to cifferenteategores. Thus, males and femal could, respectively, be coded ‘land 0, 0 and I or even ~257.3 and 534.8 without gnino- loss of information. These ‘one-to-one transformations are permissible because the names do not have oumeric, properties. The flexibility with whick one can traasform nominal seals ceects the limited mathematical operations that can be performed with them. For example, as sume that a survey hes coded parental votes as 1,2, or 3 for Democrat, Republican, and Independent and thatthe frequencies of individuals in these three classes are 35, rinal seal 25, and 40. One could compute @ "mean" as (35 « | #25 2.# 40 - 300 or 2.08, However, this figure would change capriciously if permissible ransformations were sade upon the categories. Foe example, ic would change fo 2.95 if Independents coded 0, Democrats were coded 2, and Republicans were coded 9, and there i no lagi cal connection berween changes ithe scale values and changes in cls mess One i portant exception to ths principle is when thee ae tv categories. This exception un Gerlies much contemporary multiple regression theory, as we will see later in this book, In this ease, sastics such as means do change predictably as categories ae cenanged, We will show why this isthe case when we consider interval soaes, (Ordinal scaling involves rules for deciding whether one abject that is + to another is > (greaee than) ot (ess than) with respect o a given atebute (there may also be tes 50 < and 2 are also used). A ordinal scale for N persons (Ss) allows one ta determine that §; > 5; 2 Sy 2S, with respect to an atibute (the = par of > allows fortes). This {ilies that (I) ast of abject is ordeced from “ost” to “leat” with respect to aa attribute, (2) one does not know how eau aay of the objects passers of he a fan absolute sense, ané (3) one does not know how far apart the o to the atteibure, An ordinal scale is obtained if a group of people are vanked tallest to shortest. This scale gives no indication ofthe average height, The mean ran of the height of NV jockeys and NV profesional basketball players will be (W + 1)? both cases, the mean of five canked observations will ths be (5 + 12 or 3, Likewise, variance ofthe ranks will equal (¥¥~ 1/12 regardless of whether the measures ‘egy similar or very dissimilar. If there ae five ranked observations the result wil be (B= HAD 02 Dichotomous (pass-fi) scoring is special and, indeed, the simplest case of o ing, is commonly present in true-false or multiple-choice ability tests. A pass is commonly designated 1, and a failure is designated 0, Items using en agree-dsagree format in personality or ttiude measurement logically also yield pass-fail orderings, since agreeing withthe Rey is a form of passing, Ordered categories aise when a measue yields relatively precise information the investigator lamps scares into smaller number of succestve categories, Forex ample, en economist may categorize family income measures ito 2 small umber of levels, This can sacrifice great deal of information, but it may be needed for data pre sentation, In contrat, dita may be gathered as ranks. Likert scale items are a common ‘example used in personality and attitude measurement in which subjects describe their intensity of feeling foward the item. For example, subjects might be asked whether they “strongly agree,” “agree,” “ure indifferent,” “disagree,” or “strongly csa with the statement “l feol uncomfortable asking professors questions in clas.” The ‘subjects then assigned a score from 1 t 5, andthe total scale score isthe sum of indi ‘vidual item scorer, Tie format generates ace inforretion than cichotomnis scaring, 25 it may increase the range of scores substantlly over dichotomous items Scoring, & benef to the statistical analysis a it more faithfully reflects the individual differences (nthe atibut. a4 paar 1s irRapUcTION Rank ordering is basic t higher forms of messurement. Most of the information contained in higher level scales is contined simply in the rank orderings (Coombs, 964; Parker, Casey, Zitax, & Silberberg, 1988). Thus, if two sets of measures obtained from higher level scales are correlated and converted to ranks, and the ranked data also corelated (see Spearman's ran order corelation in Chapter 4), she corel ion between the orginal numbers and the corelation between the ans are usually ‘onsderable information is lost if bot sets of quite similar in magaitude. In contrast, ‘observations and coreladions become much smaller when data are dichotomized, Con: sequently, methods based upon rank ordering, such as rank order multidimensir ling considered in Chapter (4, often do justice the relations contained in higher- {evel data, but the common practice of dichotomiaing variables when the underlying data are of a stronger form shoul be avoided (Coben, 1990). The class of teansformations permissible for ordinal scales is more limite than itis oc nominal scales, The transformation must preserve the rank-order properties of the Gata, Thus, category names 1, 2, and 3 may be transformed to 4, 5, and 23 oF -1.3, 205, and 5.33, but not 3 Land 2. These permissible wansformations are called "mo: hotonic” and are illustrated in Fig, 1-2, A set of statistical operations fas been de Signed for use with ordinal data, The central tendency may be described in teams of the median or the mode (which is also meaningful with nominal date) cather than arithmetic mean, The mean end mode will change predictably with permissible trans formations, whereas the mean will ne. For example ifthe median and mode are inthe second of four ordinal categories coded from | 10 4, they will remain so under any pet missible transformation, which ig not tre ofthe arithmetic mean. A considerebly di ferent mean will obtain if the exegores ae cecoded as 2, 4,17, and 39, for example, but the median and mode simply change 9 the second eategory, 4 Iinerval scales rellect operations that define @ unit of measurement as well ws >, = and ‘They are often cefered f0 as “equal interval scales” for ths reason. Consequently (1) he cank ordering of objects on an atibute is know, (2) the distances among objects on the atibute axe also known, bot beolute magnus of the atisbute ‘ve unknown, Expressing the height of each of a series of children relative to their mean height would yield an interval scale of their height, Thus a child 2 inches tl than average mould receive «score of +2, a child 3 inches shorter than average would ve a scote of ~3, ef, Deviations from any mean can be calculated without actual ty knowing bow far anyone {s fom a tru z2r0 poine, e.g, zero height, The absolute magoitudes of the atibute are potenially imporcant but unknown since the tallest child is probably short ine more general sense. However, psychological measures are coramonly described as deviations from the mean, Tnterval scales do no require an equal numberof objects (people) at each poi lar distribution of scores. The term “equal” describes the intervals on che ‘ot the aumbec of people between equally spaced point’ on the scale. Thus, the af {00 and 105 are assumed equal tothe dif between intelligence mea ference between intelligence measures of 120 and 125 even though many more people fall beeween 100 to 105 than 120 to 125. FIGURE 12 or ts s 6 7 8 8 onorgnl ele “we examples of mancansranstrmators permission an orl ses. The those vansermatara afin te caine agabraeaty terval properties imply that if, b,c,» kare equally spaced points on te sale, the scale is defined by ewo statement: 1 apboerdk 2 a-b=b =j-k ‘An interval seal is defined by algebraic differences between points, and so acti sion and subteation ofthe scale points are peamissible operations, Since 2b =b—c, che sum ofthe two intervals equals (a-) + (=<) =a ~c. The difference between the two intervals equals zero: -g=a-2+e ‘The expression equals 2era because a + = 2b: a-b=b-c ate=2b Since points are assumed ta be equidistant onan interval sale, and Similarly, the distance fom a to ¢ equals twice the distance fom ato bato Scales Whereas there is usually litle dispute over whether nominal or ordinal properties have been established, there is ote great dispute over wherher or not a scale possess ‘meaningful unit of measurement. Formal scaling methods designed ta this end a discussed in Chapters 2, (0, and 15. For now, it suffices to nore that many measures tve-fale, tnd Likert scale items. Data from individual items are clearly ordinal, However, the toca score is usually tected as interval, us when the arimetic mean score, which as- ‘umes equality of intecras, is computed, These who perform such operations thus im plicitly use a scaling model vo convert dats trom 2 lower (ordinal toa higher (interval) Tevel of measurement when they sum over items to obtain a total score, Some adher {ents of Stevens” position have argued tas these statistical operations ace improper and advocate, among other things, chat medians, rather than arithmetic means should be bused to describe conventional test data, We svongly disagree with this point o fo reasons we will note throughout this book, not the least of which is cha the results| of summing item responses are usualy indistinguishable from using more formal ‘methods. However, some situations clearly do provide only ordinal data, and the re Sulls of using statistics that assume an inerval can be mislesding, One example would be the responses to individual items scored on out-category (Likert-type scales. ‘The only transformation that preserves the properties ofan interval scale is called the general linear transformation and is of the form X’ = BX +o, where X is the trans formed measure, X isthe original measure, and and b ae, respectively, alive and multiplicative constants involved in the tensformation. Transforming temperatures from Celsius (C) to Fahenhet (F), bath of which are interval scales, by the relation F2%C+22 isacommon example, Figure 1-3 illustrates three general linear transfor tations. Ratios of individual values are aot meaningful on an interval scale because the zero ofan interval sele may be legitimately changed through changes in ive constant a. The ratios, in degrees Farenet of 6 to 32 and of L00 to numerically computable in degrees as 2, However, these no longer remain equ: 2 the fist of them becomes undefined, if these temperatures ere expressed in de sus. On the other hand, catios of diferences in interval scale values are ingfal, For example, essume the surnmer mean ‘emp hei), of a particular city is 9D during the day and 75 at change to 50 and 40 in the winter. The ratio of the die temperatures is (90 ~ 75)(50 ~ 40) or L.5, The corresponding ratio in degrees Celsius is G22 ~ 23.9)/(10 ~ 4.4 or (within counding error) also 1.5. This is because the ef fects of changes in band « cancel inthe process of forming ratios of differences ‘When there are only two categories, there is only one interval to consider, so that one imerval may be considered an “equal” interval. That is why binary (dichotomous) ‘ariables may be considered to form interval scales, the point noted above 2s being sO where in statistics. important 9 modern reg A ati scale i an interval scale with a rational (sue) zero cather than an axbitery Zero, A rational zero for ebildren’s height in the above example would be physicat FIGURE 13 coAPTES 1» wwraopUcTION 17 e20sKe | ‘Tonsforaton | piri yy Tees ene as at Three examples gaeral nar tarsiormations pornisabla on an irl cal: H = +2 BSD See ana e034 Tha general orm fhe fareiomaton f= 9x = zero rather than the mean height. The presence of a meaningful no makes ratios of any two measures meaningful. Unlike the three lower types of scales, all four funda mental operations of algeora—addiion, subtmtion, division, and multiplication — may be used with incividual values deRnee on ratio scales, ‘A ational zer0 means absence of the at ‘ero height or weight. I ig often easonable to reference seares to the mean, but the meen clearly does not denote absence of the attribute, and so its no a rational zero in the present sense. IF there is no rational 2er0, it does not make sense to form ratios since ratios change asthe acbiteary zero changes, anather way of saying that raios of individual valuet on an imerval scale are aot meaningful. For example, suppose the class average on a test is 30 and two particular students obtain scores of $0 and 40. Relative co a scoce of zea, che ratios ofthese two scores is 1 25:1. However, zar0 cor rect is nota rational zero because a student obttining a scare of zero might be able to fnawer some simpler itmse correctly. Relative to the mesn, the rio becomes (50 ‘30V{40 ~ 30) oF 21, ut tis eto is just as acbtrary a5 the 125:L tao eelative to 200. There ere many éxtmples of eaicnal 208 in physies—2ero time and absolute 29 (Kelvin) temperature being two others. However, it hat proven difficult ro define bute and not simply "reasonable," e.g,other Seales 38 exRT 1: neTRODUCTION 1 zeros for most psychological atibuts like ineligence. Zero reaction t ‘upon physical time, and so it isa rational zero, This means that i is sensible to form such eatios as the mean reaction time obtained from u ere verws a est intense stimulus. The major example of ratio scales comes from the fact that differences be tween observations on an interval scale form a raio scale. Thus, if pre-and postest fona measure are obtained, the resulting change seare can be assumed to form a ratio setle with O represencing no change. However, Chapter $ will dis thhange seores may have ather prablems—i is dificult #9 compare two chan bated upon diffeent pretest scores. ‘Actually, ratio scales are cutely needed to address the most common needs of scl ing. Defining an interval i very important, but ordaring isthe most crucial concept. In contrast, nominal measurement cules suffice for most classification problems, I is not proper to employ the general linear wansformation permissible with interval seals, only the more restricted form x’ = bX is allowable, This more specific form of linear transformation, depicted in Fig, -4, is also called a multiplicative transformation, Ea ploying an active constant (a) implies thatthe zo pot isnot fxed, which it fato sale, by defnition, Changing from feet (F) 9 inches (1) by the elation 1 Frequently used multiplicative teansformation. Ratios of height, weight ete, as measured from thelr true zeco points are meaning ful, These rads do nac change with permissible wansformations since these permiss- ble transformations do not allow s change in the zero point. This is why the cerm "ratio scale” is used, Someone who weighs twice as much as another person in pounds will lo weigh twice as much in kilograms. Those within the tradition exemplified by Stevens have proposed scale types other than these basie Four and it is important not co think that all scales are divided into four levels, Coombs (1964), Coombs, Dawes, and Tversky, (1970) and Stine (19803) have discussed these in some detil, One addtional type is an ordered metric in hich (1) the rank order of objects is known, (2) the rank ore of intervals between objects is known, but (3) the magnitudes of the inervals are unknown, Such a scale allows fone to say hat and b ciffer more than ¢ nd but does not allow core precise state rents abou: the celative magnitudes of ciference. Stevens (1958) proposed a logarit= ‘aie inerval scale where the ratios of magnitudes coresponding ta successive points , b,c, are lb = ble = el te. Then og a — lo Blog c= log ¢ ~ log 4, ec The decibel scale that is families to physicists isa logarithmic interval seale (itis not limited to the measurement of sound intensity), since i involves wansforming stimulus energies 1 their logs. ‘The absolute scale formed from counts is the strongest type of messucement be~ cause it has the interesting property of being its ow invariant scale of measurement When one says “There are thee people in te room,” the meaning of “three” is inher cent inthe ceal number sysem, In contast, Af you Were told. a room isthe unlts wie, this might refer so yards, meters, or some other wit of measurement AS interesting 38 some of these other scales are, though, the four besie ones listed above are far and away the matt important to psychomerrie theary and application, CHPTER 1: INTROOUCTION 19 onda Two maples of mutica vanstamnatons peeiasbe ona rao scale: X= 1.2X nd (5 Tha gone frm ha Warstomaton 5 = bx tis important to consider the citeumstances under which a yarcular «ype of scaler mains invariant, 2, maintains its properties when the unit of measurement is changed |As we have seen, the more powerful the mathematical operations meaningful with a given scale, the less fee oae is to change it Thus, nominal scale labels may be ‘hanged in an almost unlimited manner a8 long a5 no two categories are given the same label, bata the otter extzeme absolute scales lose their absolute properties when changed ia any way, Tavarance is basic to the generality of scientific statements derived from a sale (Luce, 1959, 1960), Isis easy ta imagine the chaos that would result if some physical measures lacked the invariance of ratio scales. Without iavacance, a stick tht is ewice as long as another measured in feet might be tee times as long when measured ia inches. The range of invariance of a scale determines te extent to which principes re rain unaffected by expcessing the scale in different units, e.. feet rather than inches. ‘This does not mesn thatthe results of using the scale will sot change. A mean temperature in degrees Fahzenheit will be numerically different than a mean temperature in degrees Celsius even though both ace permissible operuions. The point is that the means wil change in an orderly fashion: Specifically, the same equation wil reine the20. panes wrRaaueton cuaeten 4: wrRaoUETION 21 do not, We sony suggest hts poston an easly become tao mow and oun xproductive, Mickel (1986) describes two ote radio tat he es ope theo (Gait, 1980, Brcgman, 92) and Cassia! theory (Reveboom, 1966). Nes to perform apricula sista operaon.Opensonal theory views a concep as nymous with be opains at Geto In ter word, xscre ona tt dosnt Teteen (nd fr sreting beyond) en, ei the measur, bt ty eo tnt an ero hat toa eats te ta opratonlio does no eas te tree tbe tae lt tel. Fly, il theory views measurement a ote mination of suatiy x fw snc of an ate i rest in an objet (ote thove, we assim tat meuuremei ls aloes clash feaion) Gaia (1980, also see Baer, Hasek, & Patinovich, 1960) termed ls positon sisal ery” and was hl teal of represen teary ( “easuemen thoy". His tone was very lay pera, bt is View retest sks sympa chor wean faves ave ad end wh they considerd fm be obvious espe of instal enaljacn Pops his major ptt iS tha wing presumably inpemisble aiterece tbe cous of he cat Sommon anaes. Fee & cunt Towoend and As 1980) ip. Similar satements upply about operations meaningful on zeneral form of relation other sales, DECISIONS ABOUT MEASUREMENT SCALES A strong view of measurement is called the eepreventational position (or the “undae 3” postion inthe previous edition of this book) about measurement scales be. cause i stazes that scale values represent empirical celations among, objects (Michel, 1986, also see Stine, 19896), Its main assertions are that (I) measurement scales have empirical reality in addition to being theoretical constructs, (2) a possible measure of an atibute can be classified into one of a small number of distinct levels, and (3) ine vestigators must document te scale properties of pariular measures data obsined from the scale because he sales level limits ee pecmissible mathemat. ‘al operations. Besides Stevens, the tradition includes Krantz, Luce, Suppes, and Tversky (1971), Luce (19596), Suppes and Zinnes (1963), and Townsend and AshDy (1984; also see Ashby & Perrin, 988; Davison & Sharma, 1988, 1980), Repeesentaional theory had great impact inthe 1980s Investigators cended to avoid tric tess (tthe F ali ofthe ANOVA. ete.) th al scale (at ding t representational theory) and used aonparametic tests (Siezel & CCastellan, 1985) chat required only ordinal or nominal assumptions instead, Representa tional proponents of onparanecc tests argued that these tests were only slightly weak. lest able 10 detect differences) than their paramecie counterparts, a difference that id generally be overcome by gathering slighaly more data. However, they largely re analyzing tensive Characterstios The physical characteristic ofthe measurement operations provide one way ta judge the scale characteristics ofa particular measure, eg. length with some form of yard stick. To prove chat the aribut in question is measured on wrazo scale requires proof ignored the greater Aexibility of paramewic methods in evaluating interactions (com ‘of both (1) equal imervals and (2) an axiomatially unquestionsble zero poiat, Anyone bined effects of two or more variables chat are not predictable from the indivicl var : can sea the zero point where the yardstick stars, The beginning of the measuring in ables), Startng inthe 1960s, investigators renumed tothe use of pacametic cst. box ofthe yar, and open space ie tehind tht pont, Who could As 8 simple example ofthe representational approach, consider this approsc (0 { aque for a more weaning zero pot? The eguaiy of inevas i also easy to defining the equivalence ("=") of two objects (the presence of a property in common, ' onstrate, eg, saw the yardstick inch by inch and compare the inch long pieces to ea- 4 being enrolled in the sane college couse). Equivalence requres tani. sym Sore nual. ante, and ceecvity. “Trasitvty” mens tat the elton passes aeons objec ih "Toa lesee or greater exert, ll ther measures employ comelates of he atibute John and Richard are envolled in the course and if Richard end Mary are enrolled in i the course, then Joba and Mary must be encoled use, "Syeumetty” means time intervals bu, sity speaking, we observe the effects of time and notte it- eclatonship extends the samme way in both directions-—if John is earolled in i scifi, pecceluensringe and tha car's ration ere onty consequences of Gat Sourse as Mary, then Mary must be enrolled in the same course as Join, “Re Neary all measures of "We cennot observe rather than the tribute itself and are therefore insect. We can establish equality of et to behavioral scientists ate indie Nexivty” sates hat the clation extends to the object iteli—every objects equivalent inteligence pe se bat only is by-products, Likewise. a subje’ percegton can ‘0 itself (if Fohn is enrolled inthe cours, then Joh is encole inte course but not all be infered from subjects” ably to discriminate and/or report what they experience examples are that obvious, as we will ein Chapter (3), Parallel considerations yield (Exton 1960), defiiion of the “>" and "<” eationships used to defi ordinal scales, the unit used “Many investigators, who may not even consider themselves representationlist in to define interval scales, andthe zero point used t define ratio seales. These later ee- formal sense, tend to evalute sale properties in terms of ostnsve characteristics and lations are nor symmetical among other things. Jf Mary is >", eg. aler thin, sia of ml mea ps soln Salonen ugh he ln Suse, then Susu caunus be ">" Mary. Representationalists have been most con may have been developed from a formal scaling model. We suggest hat ithe daa ob ing the mean i permissible, We have already sessed the issue of wheter scores on & sideration and the axioms (assumptions) ofthe macel ae approprit, then the man interval seal, and they have often argued thet they reasure has seale properties specified by the model, For example, Chapter 2 consisinal scales. Tes based data can be ana ns predicted model proposed by Louis Guttman forthe construction of ox upon assumptions ebout pater of responses to test items. Ree eed to determine tiow well the actual score pattems reate to the by the model, A good fc implies that an appropriate scale exis, ae no ostensive properties 1 guarantee the equality of in- intelligence, some have argued tha inclligence tests, for examle, provide ordinal scales at best, We hope the above discussion illustrates that Few mea Sure in all sciences would be considared more than ordinal scales by these standards: the following sections will show that proper standards for judging the scale properties ff « measure do not requce observing the ostensive characterises of an attribute. in pasiculae 1 Standards ean be based on deta eather than ostensive characteristics, One studies the results of applying a measure to real objects when using a sealing model, oF one Studies the measurement tool diretiy when using ostessive characteristic. Thu, i Stead of telying upon the ostesive properties of yardsticks, one could test © model Concerning properies of ratio Seals and then see if it Sts data obtained from yardstick ‘measurements. One coal therefore derive the scale properties of the yardstick from refocesesing a yardstick. Pople have done tis, andthe data Bt avriety of Seating models beautifully, e.g, produce transitivity. Tas Is what psychological seal ing is about [tan aterigt to work backward from data to test the fro a model. (a zhi way, rao, interval, ofdinal, or perhaps nominal scales for psychological ttsbutes| which pannot be seen cireily may be constructed Using sealing models is a bealthy tend in the development of measurement methods, Many models ace incutvely qute appealing. Because they specify the char cteristics that should be found in data, they are subject to refutation (can be falsified, Poppet, 1939), Sore models have produced scales tat have led to interesting sciensit- ic ndings. 3 A mode is no better and ao worse than its assumptions (axioms). These is ample room for disagreement, and there is plenty of it, about the fuitfuiness of different models For example, we hive argued! that measures like multiple-choice test scores Should be viewed as having interval properties. However, if psychologists dist bout the comectness of cifferent sealing models, how are scale characteristics ever determined? 1, for example, several interval sealing codels are being tried on a par ticular type of data, a failure of te data to fit one model does not automatically pe ‘vent the meesure from being considered as en interval scale. Conversely, evea ifthe Gata Stall the models, the measures should ot automatically be thougt of as consti- tating an interval scale. A more final decision should be made with respect to stan dards to be discussed inthe following sections. Consequences of Assumptions Even if one believes that there is « real scale for exch atribute that is ether di ty ansformation, 81 impor does not have the sare present in a pariculer measure or mirrored in 8 menoto tant question is What difference does ic make ifthe measur CmerER 1s wrRcaUCTION 23 zzt0 point oF proportionally equal intervals asthe real scale? Ifthe scientist assumes, for example, thatthe scale isan interval scale when it really is not, something will 0 "wrong in the ily work of the scientist, What could go wrong? How cauld the ith culty be detected? The scientist could msstate the specific form ofthe relationship be tween the attsoute and other variables. For example, a power function might be Found between two measures using an imperfect icerval scale, whereas the right scale may produce a linea relationship. How seriously would such « misstatement affect the progress of the behavioral sc- ences? At present, the usual answer is “very litle.” Most results are reposted a ether correlations or mean differences. We have stressed and will 25s that correlations are tle affesed by monotonic tensformations on variables. These cortlatios are the besis of stil more powerful methods lke factor analysis, However, we also stress that justifying the rank order is vital, Even if one accepted the representational point of view about measurement sales, what sense does it make to sacri fds of coreational analysis just because there is no way of proving the claimed scale properties of che measures? “There is also often major concer about the ratios of variances among different sources of variation in analyzing mean differences among groups, 2a. F, the variance among means relative to the vasiance within groups. Tais ratio and related satsties tre aso litle affected by monotonic trnsformatioas ofthe dependent measure. [f itis ranted thatthe measure used inthe experiment is at least monotonically rests tothe real sele, it usually makes litle difference which is used in the analysis. There are some exceptions of impor. Two of thee are (1) ia examining details of functional = Tationsips, such as whether @ particulr monotonic relation is linea, logartamic, a power function, or some other form, and (2) for some goodnest-of it tests used in structural modeling (See Chapters 5,10, and 15). 'A simple rule of thumb i that transformations become more important as the level of sophistication ofthe reseirch hypotheses increases. Thus, tests simply concered With looking fer group differences and rane orderings of groups typically invalye sia tistical procecures that are lide affected by transformations. Numerically, these p haps accoust forthe vast majority of research, Interval assumptions are therefore a0 crocial when interest cerers on ordinal relations among group means, etc. Howe; more refined tests of highly quantative models are very seasitive tothe interval prop ties of the seal, virually by deBniton. ‘After analyzing the results of investigations, asin corelstions and/or eaios of vari= ance components, it on is impranc make probability statements about te results fatter applying inferential statistics, Thus, it may be important co set confidence 2008s correlation coefficient o test the significance ofa particular ratio among com ponents of variance, Such statistical methods are completely indifferent 10 the 2er0 fe and consequently do not require ratio scales. However, they do ss Sume interval properts, but since they are based on ratios of variation and covaria Finn, they ace also litle affected by monotonic deviations from any true interval seal, Moreover, statistical methods are completely bling to any meaning in the real world of the nu population of numbers that meets te assumptions inthe particular statistical method, such as normality oftheConvention 24 PART: wrRODUCTION popultion eroraistribucion, We suggest that iis perfectly permissible co employ the ANOVA to est hypotheses about the average size ofthe numbers on the backs of feor- ball players on diferent teams. What use you may make of the result i, of course cltferent stor, since thee is to meaning ¢2 a theory of football numbers beyond iden Lifying the position individuals play, eather chan how well they play it (see Locd, 1953), (Chapters Ut and 15 wi sentational point of view some extremely useful consequences of the cepre- We merely note thai is esily misused when the usual in- tent isto compute corelations or infer the ordering among groups means. Meteover, ‘even when che intet isto sudy specifies of Functional relations, one may discover that, iy good defiitions of atcibutes are not linearly related to aoe another so elation to olner measures depends upon how the ateibute is defined. ‘We have ths far considered the cepresestational point of view that scientists normally think in sal" scales and obtain measures as spproximations to such "real scales, Our opinion is that (1) this point of view frequently leads to unanswerable ‘questions and (2) violations of even relatively important assumptions are not harms in most settings. The authors oppose the concept of “eal” scales in most settings and feplore the confusion that this conception has sirought to the average investigator. I is rch more appropriate to think of measurement scales as conventions agreements among scientists about a “good” sealing Ta saying that scales are established by convention and not God-given, we do not mean that such conventions should be arbitrary. Before measuring. an stribure, ll ‘manner of wisdom should be sought as to the nature ofthe aeibute—one cannot mea sure something unless one has some general conception about what is co be measured ‘The nature af & “good scaling of certain measures can be so readily agreed that a convention is easily established, eg. length, weigh, and ti, Exasperation about theo- ‘ies of measurement has tempted some to with that there were no yardstcks and a@ balances forthe measurement of weight so tha ll scientists could see that measure nt always invalves convention rather than discovery af the “real” measure, Sometines, one person establishes t measurement convention and other scientists often neglect to participate in establishing the particular convention. Consequently, the accepted as she scale, The Fahrenheit thermometer was once taken 05 ve scale of temperature. Later, the discovery of absolute 2210 led 10 a new aud nore useful sealing. tn psychology, icelligence was once defined as the rato of mental age to chronological age, (2, a an inclligence quoxent(1Q), but intelligence is now measured relative to performance within a given age distribution. Both these instances ilustrate why its Weoag to think that “eal” scales had been discovered Ics, Detter to say that conventions changed because beter conveations were developed, ‘The key is continued validation of measures ‘After applying all available wisdom to the problem, iis good to apply some type of formal scaling model when ecwally constructing measuremect scales. Although any set of rules forthe assignment of numbers constitues measuremeat, silly andlor ad toe cules probably will ot reult in a useful measure, (1s useful 0 think ofa sealing mode! ax an internally consistent plan for Scaling an attribute. Whea the plan & pu the, the measure may eventually prove unsatisfactory tothe scientific community, bu having e plan increases the probability thu it will be acceptable. Sometimes, useful upon, However, explicit plans based on common sense rmeagutes are simply stumble and past experience Improve the probabilities of useful measurement scale. "A convention establishes the scale propecties of a measure. I itis established as a ratio seal, then the zero point can be taken seriously andthe intervals may be tated fs equal in any form of analysis. IF is established as an interval seale, the may be treated as equal inal forms of analysis. Tis is aot meant to imply that such Conventions are, oF should be, established quickly or until muck evidence sin, bat in the end they are conventions, not discoveries of teal” scales, Certsia convertions are not employed because they make 90 sense or do not lead to sci results, For example, the Celsius scale’ use ofthe freezing point of water ode scientific wlity. Water i an important su fine temgerature's zero point has li stance, But it not the only important substance. On the other hand, the ebsolute 2eo of the Kelvin scale based upon the absence of molecular activity is useful wo a wide range fof physical laws, Ie similarly makes litle sense io establish zero points on scales of many, but aot all, psychological atributes. Zero intelligence might be defined as he problem-solving ability ofa dead person, bu the utility of this convention in esas Ing « cao seale of intelligence remains o be determined. Psychologists seek to devel op incorval sales for many attributes because iis reasonable to ask how far apart p2o- ple aze on the seale and not simply theie ordering. For example, we frequently need to flecermine if ais closer to han 10 Sealing procedures that make sease muy sill not produce scales that work well x practice. These last four words are the key 10 establishing a measucecrent conven fiona good measure is one that mathematically fits well ina system of lawl ela- tionships: Chepter 3 will emphasize thatthe usefulness (validity) of a measure isthe extent to Which it relates to other variables in a domain of interest. The “best” sealing of any parscular ateibue is that producing the simplest forms of relationship with ther Variables. An increasing hierarchy of simplicity is (1) «random telaonsbip, {@)« noneandom pattern fting no pasticular line of relationship, (3) an unevenly as tending oF descending monotonic relationship, () a smooth monotonic celationship, (5) a straight line, and (6) a straight line passing through the crigia, The only way (0 eserbe a random relationship completely is to deseribe every point. However, a straight line passing through the origin is completely deserted by ¥ = 6X, andthe & (slope) parameters usually arbitrary, Since te scientists tsk isto transate and si plify the complexity of events inthe univgcse through lawiul lationship, the simpler ‘ese relationships, the beter One say to make relationships simpler is t change the scaling of one or move of the variables, Maus a itegular monotonic relationship can be smoothed by stretching a procedare widely ned hy Anderson (1981, 1982) under the fame “funedonal meastrement” Any monotonic curve can be transformed t9 a frraighe line by this device. straight line can be made 2 passthrough the origin by Changing the engin (2e point) oa one of the scales. OF course, conventions about Ooo eee28 pant 1s werRooUCTION 4 particular atrbute should nat be altered because ofthe celatioships found with only ‘ne of to other measures. One should conser the effects upon several measures. Nonetheless. if many relationships are simplified by a pariculae transformation, the ‘ew scale is logically a beter scale, Such traneformaions are made actually quite fre- {quectly. For example, logarithmic transformations are quite common, especially in sensory psychology. Following this point of view to the extreme, there is no eason why all variables known to science could not be rescaled to simplify all relationships. This would be a Wise move if it ould be done—a big "i Scales ace as “eal” as the old ones. tnd there might be every reeson t take the zero points and the intervals o scales seriously. “There are two majo problems with considering scaling merely as 2 matter of convention, First, it disquieting t0 those who think of real scales and futitly wish For infulble tests ofthe seationships among rel sales. Looking at measurement scaling 5 coavention also seems ta make the problem "messy." How well a particular sealing Of an atrbute fis in with other variables is vague. Which variables? How good is a particular fe? To avoid such questions, however, i ta blind eneself tothe realities of| Scientific enterprise. To see shelter in the apparent csamess of conceptions regerding al seals isnot t provide answers about the properties of messutement scales but 8k logically unanswerable questions. ‘A second, and ore serious, problem with considering Sealing as a matter of con vention is that two or more conventions alten compete with one another. For example, here has been much dispute abou. whether Thurstone's law of comparative judgmeat or Stevens’ magnitude-esimation methods beer describe the cesults of measuring Sensations (see Chapter 2). As it tens out, Thurstone's procedures are more wef in describing lawful relations involving confusion among stimuli, and Stevens tre more useful in predicting how stimuli will appear (te two are also simply through a logarthenic wansformation). More appropriate than asking which is correct ‘would be to ask whether confusion among stimuli or their appearance is at issue in the particular situation Having competing conventions regarding the scaling of arwibutes is not as bad as it sounds for two reasons. Fics, ifthe two scalings are monoronically related to each ther, as is usualy the case, and if one has a monotonic relationship with a thid vac tole, so will che other, Thus the principles established with the two scalings will pro- ns, even though the specifics may differ. The ‘duce the sume general Functional cela specific form of relationship is raely the mujor issue in contemporary psychology ven though it ean be. The more common question isthe strength of relationship be tween the two variables. Coreations greater than .60 ion rather than the rile, and, as was said previously, such corelations are largely insensitive to monotonic transformations, Consequently if thee are ¢wo competing, monotonivaly related con: ventions for scaling that see equally celiable inthe sense to be deseribe inter, both will produce about the same correlation With ay other variable. In sum, the specie forms Of relationship can be setled only when tere are frm conventions for scaling, The form of «relaianshp is celasive to the measurement convention. To hape to fa her to continue to search vainly for real scales orto assume that one measurement convention eventually will win out over others, Cuapren & werRooUiON 27 ‘We have devoted nearly all of his chapter tothe first part ofthe defrition of messu ment, measurement as scaling. This is because measurement as scaling has led (Om issues of cispute than has measurement as classification, and because until eecentiy there were few sophisticated techniques to use swith categorical (nominal) dota, tne sual Fruits of classification. This has changed, especially since the lst edition ofthis book. and Chaptec 15 will fcus on some ofthese new developmeats (Classification demands a nominal scale (rules to define “=" and "2") at ¢ minimum and, conversely, iluszrtes that 2 nominal scale, which was considered “lowly” in terms of scaling, can be extremely imporant. Consider two common statements: (() “Everyone is unique; ne wvo people ace the same” and (2) "People are pretty muck alike." Although these tvo statement oppear‘oally contradictory, bth share the char acteristic tht they lead one away from some useful, if not dbvious, results. For example ‘people who describe themselves as Republicans are quite likely to answer 2 variety of politically relates quessons differently from people who describe themselves as Dea eras, eg, “Should prayer be allowed! in public schools?” Similarly, the relation berween politcal affiliation and esponse ro he political issue may jointly vary with 3¢- Gitional variables such es whether the person lives in a rural, suburb, or urban are, [Note tht this analysis does not necestarily ignore individuality. Two people who fall within the sume “cell” ofthe analysis (e.g, who are both Democrats, lve in a suburban area, and oppose school prayer) may dilfer in countless ways (e.g, gender, cligion, ‘height, or weight). As with scaling, clasifiation assumes equivalence and nat identity ‘Although clastification is rlatvely simple conceptually, ic can be quite dificult ‘empirically. Usefol classification along ove dimension implies thatthe dimeasion in question wil celate to another dimension (whit in tur could be at any ofthe previ Ously mentioned levels). Thee is ao reason to clasify people as type alpha versus type beta unless these categories have a useful external corelate Even such obvious categories as Catholic, Prowestant, Jewish, and Muslim may not be widely useful (Ghough religiously orthodox versus religiously nonorthodor, disregarding the specific religion, may be), Moreover, apparent relations between a categorical variable (or any ther) and a given calteion may be an artifact ofa thin vanable; religious differences ‘ay For example, bean artifact of differences in education andior income, Thus, one may obtain apparent ctferences between Catholics and Protestant on an issu that in- ‘volves ibeal versus conservative atitudes because moce afuent individuals also tnd to be more conservative and the two groups differ in aluence Likewise, empirical cisputes often arse between “umpers” (people who favor a small and therefore more parsimonious number of broad categories) and “spliters” (people who favor 2 larger umber of more finely defined eaegories) RECENT TRENDS IN MEASUREMENT he Impact of Computers tis very easy to think thatthe msi cole of a computer ist expedite analyses that one ‘would have performed anyway. This is cecnly important Anyone who has used Computers for along tine appreciates the increasing Mexbiiy and user-eienliness of28 aR 1: mtAaoUETION ‘major computer packages sich as BMDP, SAS, SPSSX, SYSTAT, and Unidult. O ewise apprevates the related factors of greater power, increased celiability. and fower cos inthe personal computers that are now beginning to dominate statistical analyses and the availability af supercomputers for massive undertakings. However, ‘ne tddltional point must be stressed —computers now allow fundamentally cf ils of analyses to be performed, ie, open torm analyses that are effectively impos sible wo do by hand, Closed voreus Open-Form Solutions Many of the technig come popular actully have long histories. However, they were essentially interesting Stiscal curiosities before computes became generally avaiable. The distinction be tween closed- and open form solutions helps make this point more understandable, Consider your first statistics class where you were taught to compute the arithmetic vean of a sample by acding up the scores and dividing by the auraber of scores and given the associated equation X= SXVNV. This i a closed form solution because all you reed do is plug the numbers into che Formula to obvin te result. You might wish to tse computer if.N were very large, but the principle would be he same. ‘Oa the other hand, suppose you did not know the formula but for some bizare rem son you cemembered that the mean mioimizes the sum of squared deviations. This to can be expressed by a kind of formula: 3(— C)* =a minimum when C = %, but the formula does pot ell you how co obuain X. You might use this information to compute X by plugging in different values of C, computing the sum of squared deviations for tach value, and accepting the one producing the smallest sum. If you performed enough celculations, you could infact obtain an open-form estimate of X, Many statistical quantities of interest, particularly those of recent prominence, ce quire an open form of estimation because they lack a closed-form solution, This is bitten trae of maximum likelihood estimates discussed at several points in this book. For all intents and purposes, such estimates cequire a computer and, even then, can be very time-consuming. The proces involves repeated calculations or iterations. Numer ical analysts often specialize in developing better algorithms to obtain te necessary successive approxiniations. Keraive proportional ting and Newton-Raphson algo: ihm are two such common computational peoceses. You will nat ceed to know how to use either one yoursell, but they are widely employed in programs you may 088 concepts, and measurement theories that have recently be ‘Computer Simulation stealer Form of simulation Computers are also invaluable in simlating processes. that is widely performed on computers is the Monte Carlo method in whic rave ofa parameter is obtained by random sampling. If you were asked to verify the probability of obtaining heads on a coin fip is 5, you might actualy Bip a coin © large numberof times and count the actual number of heads, hoping the cota was tt This would ilustrae the Monte Carla method but would net bea computer simulation 0% efficiently on a computer where the program would ‘The experiment may be do SUMMARY wrrcauction 29 conduct a series of uals. On each wal, the program generates «random aumnber from Oto { aad alés one to the count of heads if the rancor qumber is greater chan 0.5 ‘When finished it prints the proportion of times heads occured. Computer simula tue often performed when itis difficult to obtain 8 solution analytically (algebecaly) rif no solution is known to exis surement consists of rales for assigning symbols to objects to (1) represent quanti Ges of atarbutes numerically (sealing) of (2) define whecner che objects fallin the same or different categories with respect to a given alteibue (classification) Both scl: sification involve the formulation and evaluation of rules. These ra of object, usualy, but not exclusively, people. Ii i tant to remember that we can measure only attbutes of abjess, not the objects themselves, Among the charactaristcs of good rales are repeatability (elability) and, more importantly, validity in sense to be described, Standardization is an important goal of measurement because i fciitaes ebjeedivity, qusatiction, communication, econo my, and scien generalization ‘Measurement ases mathematics, but the two serve separue roles. Measutement needs to relate tothe physical world, but pure mathematics is solely concerned with Jogieal consistency. One traditionally important, but controversial, aspect of scaling that involves mathematics is te concept of levels of measurement: Scales generally fall at one of four loves (others have been suggested): nominal, ordinal, imerval, and ratio, These four levels represent progressively better articulated rules. For example, nominal scales simply define whether of not wo objects are equivalent to one another vith respect to a critical atibute, but ordinal scales determine whether one object that isnot equivalent to another is greater than or less than tbe obec. Stronger results ae possible from higher levels of measurement. Basic to these levels of measure- tment is the concept of invariance, which concerns what remains the sare as germis: sible changes are made i the scale (eg, in its unit of measurement); higher-level scales are more restricted as t how they may be transformed and still preserve key Focal tothe debate abou level suatisdeal operations permissible on a given set of measures. The representational position asserts that scale properties rust be establihed before performing relevant operations; e... a scale fnust demonstrably have interval properes before it is proper to compute an arth- metic mean, Alternative posions, classical and operational, do not sae this view. Many, wo need not be formally aligned with a specif positon, look for seals to have ostensve (visualizable)properies like yardsticks or clocks nave before arsegting a scale a5 real they view eristing measures as highly imperfect comelates of uve s. We suggest that very few measures in science ae osteysive. A much {evion isthe exent to which ce results of Using te Seale fa valiny sud surement use is essentially based upon convention, and progres is made when be conventions are agreed upon. In gegeral, the more well elaborated a hypothesis ssa fed quantitatively, the more important formal scaling issues a30 PART i: mrAaoUeTION “The most important single factor inthe recent progress in measurement has been the computer. Although computers obviously allow analyses thar could be done by hand to be done more easily and accurately, they allow fundamentally different analy ses tobe performed. Many ofthese use open-form solutions, so named because the | sults cannot be defined directly by a formula (closed-form solution). [n addition, com puters allow simulation of procestes tha are dificul o study directly. succes’ ADDITIONAL READINGS (Coben, (1980) Tings Ihave leamed sofa). American Psychologie, 12, 1304-1312 Coombs, CH. (195). theory dat. New Yor: Wiley Cito, 1 (1980), Measorement seals ans sndstice:Resrgence of a9 old misconception, Pay: chological Bullen, 37, 564-567. hell J (1986. Measurement seas and statistics: A clash of paradlgms, Ps lei, 100, 388-397 Stevens, 5. S, (1958), Problems and methods of pyshophysics. Prucholosicl Bullets, 53, 17 ‘Townsend J.T, & Astby, FO, ((984). Measurement scules and sats: The misconception risconcsved, Prycholagice! Bulletin 9, 184-101 (Now: Sage Publications otfers many short monographs onan extremly wide variety of evant topics aimed scholars who are no quanative specials, Althowgh they should not be sed eth sole guide given problem besaute of th ienarsrsbe complication tha may be present, they are highly recommended as saring pola. We will at cite thse Works inva, Mi hologicl Bul - TWO STATISTICAL FOUNDATIONS Part Two contains four chapters that deal with statistical concepts basic to measure ett. First, we 100k at some models used to construct scales, One central conceat is that of the item wace Hine (Wtem-characterstic curve) which relates the magnitude along a dimension (rit) to the magnitude of response to 2 particular item. The next chapter deals withthe three basic meanings of test validity: content validity, construct validity, and predictive validity. Many have debated whether these are ultimately the same of aot. Though they skate importane similarities, there ae also important dtfer- 1g them, The third chapter considers statistical description and estimation, ‘Mach of ths involves tradition isues in correlation and regression that you may ously exposed to, However, t¥0 acditional topics may be less familie: ations and alteraative forms of satistical estimation, The later tant because statistical inferece plays a much lager cole in psychometric theory than it did inthe previous eition, Tee method of maximum likeitoed is especialy important, Finally, we discuss properties of linear commbisations which are central 0 psychometric theory aTRADITIONAL APPROACHES TO SCALING CHAPTER OVERVIEW Scaling was defined in Chapter 1 asthe assignment of numbers © objects 2 represent ‘qwanties of aueibutes. Although any relevane set of rules can be spoken of #8 mea- sucement, it helps to have some internally consistent plan whe developing a new imessure. The plan is a “scaling matlel," and the resulting measure is a "scale" or a “measurement method." The simplest example is euler used asa scale of length The methods for consricing and applying rulers constitute the scaling models. Scaling ' models are designed to generate one or more dimensions (continua) t0 locate people or | 5. In the following example, pesons P,, Pa, Ps, and P, fall along one such di | mension, whieh could be social anxiety, spelling ability, etude toward abortion, ec. f ® BPs », } Lower $$$ Higher Atebute Because this is an incerval ‘Thus P, is considerably higher int and Ps tar below the others We begin this chapter with an troduction tothe concept of «data matcx, whichis central to nearly all measurement daca, and some differences between scaling stimuli and scaling people. Nox, we present a brief history of “ps ic is che Study ofthe celation berween vatiation in physical dimensions of stimuli and thir as Socited responses—s it forms the foundation for “psychometric” theory. in conta, Jc, the distances between people are meaningful attibute than Pa, Py and Pyare close together, 33DATA MATRICES Aas 241 24 parr a: svansTCAL *paychometics” in general may or may not study the effects or variation in a sing physical dimension, and 80 it includes psychophysics asa topic. Then, some distin fons among different types of stimuli and, especially, cesponses are made. We then conse some general principles underlying the development of ordinal, interval, and rato scales Following this, we present what is probably the historically most important sealing model for siml, Thurstone sealing, The ensuing tection considers sore moé- tls used to scale people, In pustular, we introduce the linear model (also ealed the Summative or centro model), which simply involves the familiar process of defining a score a8 the ordinary sum, pechaps Weighted, of responses to individual items Most measurement pr with a data matrix or two-way array or table (we will deseribe some other matrices From time to time). Rows typically represent WV cif ferent objec (osually people), and columns represent X ciferent stimuli (conten), ‘questionnaire items (see Tuble 2-1). ft s convention to denote the entice matrix by An uppercase letter in bolaface, eg. X. The daca are responses, e.g, incorcect ver= The «csubserit conventionally denotes the row (usualy the object being messured, ‘ pewon), and the second subscript denotes the column (stimulus, as a questionasce item number), so thet xy denoces the response of subject ito stimulus j. However, the stimuli and tesponses can represen anything that de experimenter does to the subjects and anything the subjests do inretum, Consequently, we need not limit the discussion to people and test items inthe ordinary sense, Subjects might estimate the weights of various objec, for example. Tt is posible, though rare, thatthe matrix isa single per. Son's response to a series of stimuli studied over occasions (2.g,, Nunnally, 1955), among other variants ‘Most classical psychometric models treat scale tems as replicates of one another in the sense that diferences among the items are ignored in scaling. Thus, a patien’s fanxety is ypiclly defined by counting the number of anxiery-elated symptoms that, are endorsed regardless of which specific se are, Alternative models, mainly Of cecent origin, derive scale scares from the pattern of responses. These later models sus |= correct, Likest scales ee. Individal elements appear ia lowercase italic ‘A BASIC TWO-WAY DATA MATAXX [X} CONTAINING RESPONSE (ROWS) By K STILL! (COLUMNS) OF "PERSONS. 2 ye ye Objects Sm mm 9, ual 5 ‘ill be inteduced here but ae discussed in more detail in Chapter 10, Likewise, methods of scaling objects, asin market research studies, often assume that people ace replicates of one another. For example, the percentage of persons in a group that prefer one brand of cereal to another is assumed to be the same as the percentage of times a typical (medal) individual would have this preference over occasions. These classical methods, by defsition, ret individuai differences among items and people fas random error, In contrast, newer methods incogporut individual differences in 2 more systematic manner tis only meaningful t obtain a single measure by counting the numberof poi responses ifthe stimuli measure & single ceibut, This in tum implies that differences in response tothe various stimuli are highly corelated; e.g, if people who admit co fone anxiety-related symptom also tend 10 admit to ochers, and vize versa for people who deny these symptoms, Various correlational mettods are ued to evaluate the ex tent to Which people or stimuli can be viewed as replicates. IF responses correlate poorly with one another, two or more scales would have w be formed fom the items. ‘These invelve methods discussed throughout the book, especially in Chapters 11 though Ld. This chapter will be limited to models cat assume the stimull measure 2 single atrbute (uoidimensionel scaling)—situations in which the data under consider ation can be summarized satisfactorily with only one “yardstick” More Complex Organizations “The two-way organization of Table 2-1 contains the minimal elements of interest 10 2 measurement problem. If there were buta single column (stimulus), there would be no ‘ay to evaluat the stracture ofthe stimuli whichis basic to psychometric theory. The only resus possible would be descriptive stistics on the single measure (e.g. the ‘mean and sandard deviation) forthe single group of subjects. These cata ae arly of interest tothe psychomeciclan because noting can be said abou the structure Lik ‘wise dete from a single row (sibject) in isolation are unlikely to be informative, Ata minimum, we need t compare that person's daa to normative data ‘More complex arrangements of the date are extremely common. First, che two-way matrix may be repeated over occasions, as when a pr anda posttest are administered. nis gives rise to 4 thee-dimensional arangement in Which there are rows and columns, as before, plus “slices” that represent the two or more occasions. Another possibility is tha subjects are sampled from two or more groups e., one stuies gener diferences in response to items measuring depression. & tind possibly i that two or more anribates are investigated simultaneously, £8 when one series of items ‘measures job satisfaction and another series of teme reflects job performance. Ths de sign involves methods of multidimensional (multivariate) asalysis considered Ine in the book. ‘Scaling objects often involves & three-dimensional array, as when a market r= searcher conducts a tene tox ond has people judge muldple anibutes of several brands of cola, e.g. sweetness and intensity Of favor. (AS a incidental poi, the application of measurement methods to quantily the perceived appearance. including taste, of consumer praduot preferences is known as “sensory evaluation” to market Of),36 PART = STATISTICAL FOUNCATIONS searchers.) These posiilities may be combine in sil highercorder ways, eg, by ob- tining pre-and posttest measures that compare ae of more group We have frequently used the phrase “people or objec,” but the vast majocity of ssueies examine people's responses to different stimuli [a fact, objects (which may be abstract concepts) play the same role a8 people in some studies and as stimuli in “Holes” in the Matix (Missing Data) an idea situation, chete is an outcome teach location in the matris; #1. each person is administered each stimulus. Sameimes hs Is not possible or even mear For example, the number of stimuli may be to large to allaw & given pecton (0 re spond teach ane, Silay, the efess of ministering one stimulus may infuence subsequent behavior, knoven us “caryover effects. Subjecs are chen often deliberate iy given a subset of the stimuli chosen according toa p plan usually involving random assignment of stimuli to given subject. Ths is part of the experimental design, Perhaps che most comprebensive ext dealing with these problems is Winer, Brown, end Michels (1991), Although some statistical power is lost when sub Jects do not respond to all stimuli, this loss of power can be offet by increasing the size, The problem will nt be considered further since it poses no addtional Far mote serious problems emerge when the resulting holes in dhe data mateix are nonrandom, For example, the second author once was given reuropsychotogical west data, The data invalved many scales (subtest) that were normally aot all adminis tared to each patient, Thas, patents with froma labe damage were given one set of subtests, patients with temporal lobe damage were given a different set of subtests te, Such Timitatons on data gathering caused the missing data t be nonrandom “Type of injury was confounded withthe partcular sales that were administered. The results obtained from analyzing these data might well differ substenially from a study in which all subjects repaades to all messures or the patern of administration ‘was random, Good design dictates minimizing the impact of missing daca. fall mea sures ate equally important, candomize the order of admiaistation or administer ran- ‘dom subsets if all cannot be administered to each subject. Conversely, if some ace relatively unimportant be telministr these the en. EVALUATION OF MODELS Often different models can be applied to a given set of data to develop aterat scales. These madels and their associated sale sometimes lead co different substan tive conclusions, Two different models might precuce scales that are cot linearly rela fl, One model might suggest thatthe data do not even possess ordinal properties, ‘whereas another might indicate they clealy form an interval scale, How, then, does fone know which model to choose? Chapter | aoted why this cannot be known is ad rmosterucal testis how well the scale provides meaningful vance. We suggest | ' repeatable relations with other variables. Befoce time and effort are spent on suc investigations, however, some additional criteria canbe applied 1 The intuitive appeal of a Although the data el provides one criterion for “reasonable.” lic, a scieatist’s intuition plays an inci must be pt table ale in the guthecing and analysis of daa, Looked at in one way, a measure ment model s nothing more than an explicidy defined hunch that paricula operations fon data wil be useful Ia particular, we suggest that psychologists lean coward mex surement models that are most analogous (0 the measurement of simple physical a tabutes, 2. length 2 Another aspect of “reasonable” is that one should exploit whet is already known about similar deta, For example, power functions axe wall known to describe celations between pliysical and perceived intensity (see below). On the negative side, some models assume that individual ret item responses are highly celiable; yet, a wealth of evidence shows that such responses usually are highly unclible, 3 Preliminary analyses often provide cues about the usefulness of a scale. I the je values for abjacs of getsons are markedly afected by slight procedural differ: ences, the scale will probably not work wel in practice. There are, for example, au serous ay in which subjects con judge weight. IF two simular appearing approaches yield very diferent intervals of judged weight, either or both methods are suspect. Conversely, different models tht yield similar tesul provide converging, operations Garner, Hike, & Eriksen, 1956) that mutually sweadthen the coafidence one may have about any given method, “Thangulation” is another commen term used t© deseribe ths 4 Another important ype of evidence is the magnitude of measurement error in using a particular scale, which we will discuss in detail in Chapters 6 to 10. A seale ‘that eles great deal of measurement error cannot possibly be useful Beyond the standards f good sense, however, the ukimate cst of any mode isthe extent ta which it olds useful empirical results. Scaling Simul versus Sealing People Although paychometsic methods canbe used 1 scale people, stimuli, or bot, fferent method are often used when the focus is on sealing people than when the focus is on scaling objects. As Cronbach (1957) pointed out in a classical acl, clinical, counsel Jing, and school psychologists are more inclined to think in terms of indvilualdter- ences among people, e.g. in measuring such asibutes as intelligence and level of adjustment. These individual differences are « quisance to experimental psychologists ‘nd market researchers who largely ignore individual differences, though both may interested in group differences. Their problems typically involve sealing stimuli, ‘measuring which Words or scvertisemnens are most readily recalled. Regardless ofthe ocus ofthe reseuvh, ie tsie data ae ipucsentable as a twordimensionel any, 9 taps extenced into ater dimensions because of addtional coasiderations. ‘Unigimensional seling of people is probably the easiest siuation to deseribe. For example, a spelling test contains words as stimuli and students as subjects. The data OQ38 PART a te simply | = correct and 0 = incorrect. The simplest model for scaling subjecs (see the linear model below) collapses the stimulus dimension of wards by adding the aumber Ls For each person. Although additional analyses ae usually conducted to deter mine the interrelations among eesponses fo different words, these simple suas of cor rect responses scale students on their spelling ability. Consequently, ina may obtain a score of 48 and Ralph may obtaia a score of 45 out of 30 words. Ici quite possible ‘hava simple ranking of the students wil suffice so that an ordinal see nay be all that is necessary for such purposes as grading, Te major requirement in that alternative scalings be monotonically related to one another, coder people in the same way. Thus if two different methods fo strong monot tionship, research ruts will be mu which scale is employes. “The roles of people and simi are often reversed to scale objets. Specially, sums over students for each word describe differences inthe difficulty of the words, ling people is 2 that they rank: scaling anxiety have a he same regacless of 8, if 50 students spell “abscus" corecly but only 35 spell “maemonic”™ mnemonic” is considered move dificult than “abacus.” (a fac, these data are usually standard pare ofa test analysis, even wen increst is diected toward scaling people However, seuies directed toward scaling stimuli are also moce likely tobe concerned with establishing Funetiona relationships to various auributs, in which cese ordinal sales are quite likely to be insuficient. Assum tones of differen inensity which subjects rate for loudness. Everyone knows that nore intense ines will be raed louder; che Key tothe study is whether the relationship is logarithmic, near, or of some other fort. A unidimensional scale of stimul should also ft acypical (moda individual. Such a scale shouldbe typical of a group even if ie imperfectly represents the data fom any one individual Because ofthe thoraier problems ia stimulus scaling, most af the issues and enore complex sealing models have arisen from scaling stimuli. This difference bas infu on language used to describe psychological research, “Scaling” and "scaling snetnods ing of stimul, Problems of sealing people are more Tkaly to ean “test constuction.” Tease who are inter exted in the details of stimulus scaling could well consul the classical works of Gul ford (1954), Torgerson (1958), and Woodworth and Schlossberg (1954). Despite thei age, all thzee ofthese books describe the major models in unique step-by-step detail, more recent books have tered to conceatrate on newer modes. Perhaps the main consideration in measurement i what kind of response is to be jbisined from the subject, because this has profound effects on what subsequent ‘analyses may be peformed-—one cannot snalyze ata that one has not obtained. There are to broad approaches, and both decve from psychophysics. In one, which origi rated with Gustav Fechner, subjects make only ordinal judgments as to whether a stimulus was seen of not and whether a comparison stimulus is more or less intense than a standard stimulus. The methods requice very lle of subjects. {odeed, animals can be exe t0 make cequisite cesponses by means of such devices as bar pressing in the other approach, most strongly associated with $.S. Stevens (see Chapter 1). subjects aze required to use properties ofthe real-number yer to make interval at ratio judgments, as by seying how mach mor ‘comparison stimulus was than A stindard. Such methods normally quire adults a alder children for example, thatthe stilt are HAPTER ® TRAOITIONN. APPROACHES TO SCALNG 99 |p prc INTRODUCTION TO PSYCHOPHYSICS fined psychophysics asthe study ofthe reltion between variation ia physical dimensions of stimuli, which we will aymbolize as © (fr physical), and thei associated responses, historically called “sensations,” which we wil symbolize as ‘Y (for psychological), The paysical dimeasion need not be intensity, bu it will befor all examples inthis chapter, and the associated responses will describe apparent inteasity We have already noted the obvious ordiaal relation between the physical and apparent intecsities of weights, ashes of light, and tones. A S-poutd weight obviously feels heavier than a L-pound weight Ia particular, the probability that a weak event willbe detooted also increases as the intensity increases, Paychoptysics is concemed with making moce detailed statements about the relations becween > and which, as wat alsa noted, are usually required y the problem under study, Three particular questions historically important yet elevant ro rmany contemporary problems 1 Whet isthe minieal energy aeaded for a particular event to be perceived under pnicular condition, i, the absolute threstold or limen? For reasons to be noted below, this normally invalves determining the stimulus event tht is perceptible 50 gereent ofthe time 2 How different mus twa stimuli be in 0 or to determine which is of greater intensity? This involves what is variously calles “difference threshold,” “difference limes,” or “just noticeable difference” (IND) foveen a standard and a compérison stimulus, 13 How may the relation between physica intensity and is associated sensation be described in tbe interval or rato terms of Chapter 17 This is know as the problem of psychophysical scaling The history ofthese questions is covered in several excellent books on the general bitory of experimental psychology (Boring, 1950; Robinson, 1981) because ealy ex: perimental psychology was psychophysics. Simple but usefl applications may be found in any standard undergraduate textbook on per 8 Coren and Ward (1989), For a more detiled treatment, see Engen (1972a, 19726) ‘or Woodworth and Schlossberg (1954). Psychophysics is imporent for its own sake 28 exemplified by its uae in such areas as communications enginzering and photography Audiologists perform psychophyscel scaling on individuals in testing for hearing loss whoa they coaypace absolute thresholds they obtain with norms, Aa abnormally high threshold implis hearing loss. Psychophysics is limited tothe study of relationships that hold when stimuli vary slong a specified physical dimension such as sound inten sity, Measuring intelligence, psychopathology, et, isnot psychophysical because 20 physical cimension underlies these attsbutes, Nonetheless, concepts like the threshold ‘are applicable to psychometric in general Psychophysical Methods Methods used to gather psychophysical dara were first developed by Fechner (1860/1966) to study he elation becween mind and body. Later, J, M. Cattell, Fuller: ton (Fullerton & Cattell, 1892), Thurston (1928), and others expanded upon their use (Oooo edSeveral psychophysical methods developed by Fechner are stil widely used. One is called the method of constant stinli, Assume that a tone whose physical intensity is 35 unit is essentially never repovted us beng heard, but atone whose physical inten sity is 213 units is netly always repored a8 being heard. The experimenter might choose to use intensities of 185, 190, I units, On each cial, one level (mage ude) is chosen at random for presentaion. There is no limit wpoa the sumber of lv tis the experimenter may use, The levels nezd aot be equally spaced and they need not focus equally often, bat its typical to use from 5 0 10 equally spaced and equally probable levels. The results ae the probabilities of an afimative response (eg, saying the tone was heard) foreach level. Two related procedures are the method of adjustment andthe metho of limits. Tn the “method of adjustment,” a standard is varied uni itis barely sensed © determine fn absolute threshold, or a comparison is made to barely dife fram a standard to pro- thea aifference uewold (IND), The method of limits takes two forms, The “ascent fag method" as used to determine an ebsolute threshold stares with a stimulus that is not sonsed, The stimulus is progressively Increased unc itis sensed. The "descending Thethoa’ starts with a stiles tats sensed and decreases the intensity. The modica tion made to determine difference thresholds is srsghtforward, The comparison sim- tlus is pesented either below (ascending method) or above (descending method) and incremented or decremented, Absolute Thresholds “The original idea of an absolute threshold goes very far back in philosophy. “cut” in &—the subject never sensed the stimulus below the cut (threshold) and al- trays detected it above the et. Imagine thatthe rethod of constant stimuli is used fo prevent a series of weigits, This predicts a step function relating © co (in thi the probability of poring thatthe stimulus was seased oe detected), as illustrated in Figure, 2-la, The geoeral aame given to any relation between @ and \ isa "psycho: tetric” (auind/measuring) function, This particular function describes local psy hophysic, because is defined in terms of sensiions inthe location ofthe thes Sid, However, iis extemely unusual for dra © provide a step Function, which we srl later show is of general importance t2 psychometic theory. Tee data will more Tikely esemble pene (0) af Figure 2-14, known as an agive or S-curve, Fogure 2-2a illustrates an ogive and its essocited data pints as simulated by meth ods defined below. Although several arhematical functions produce ogives and there re many explicit curve fiting methods (see Chapter 15), curve fiting can afte be done by inspection. The point at which the curve crosses the .50 level for defines the absolute threshold, This is approximately 200 units in te present case a onder to explain this lack of a step function, the orginal thes ‘was modified 1 incorporate sensory nose. "Sensory noise” refers to random error in peveeiving an event. causing a Rxed stimulus to have variable effects on diferent us- Tle The process may be thought of as physiological in origin, but it ned not be so iewed, The most popular specific conception of sensory noise is the phi-zamms hhypothesis--aumerous independent factors contribute to the error, and so ik varies byportesis i eee _—_ w Phyl magni, © FIGURE 2-1 (a) tap function rgresantg ne inl concet of he hash represening a more estate oucare, fom af he epve(ryhenste ren) ie cums trea. An ara tose (Lace 1933019 leas oop fncton, defo below, Chea tors an og ogee tela rath nd eango be ere by oj A the bt ns far es ul, possibly & eval guano (Steven; Morgan Vann, 1980, ets become fat a nea function which will ot ‘The 0.5 point that desribes the sbsofute threshold is thereto © location of the psychometric function is one of its (wo basic p junction it one of is ovo basic parameters. IF sudizory stimull are used, the function of = subject with more acute hearing and con sequently a lower testold will fll ro the left of the fonction of a subject with lasauae 22 AL FOUNDATIONS 42 pant som —- Prog of ys repeat b an a5 ts | eo ryt apse, © stand (200 ni) Phe magnate, © peyenomete Anca deved to aplyng a meine constant tml (med at) 9 {alabecleoressanen detacton ad (9) eomearaine responses. CHAPTER 2 TRACONAL APEEOACHES TO SCALING 43 acue hearing. Likewise, we hear tones al middle frequencies better than lower: of higher-fequeney tones, holding intensity constant, so that middle-frequency tones produce psychometric functions to the lett of higher- and lower-frequency tones. Lo- tation tbus define task cfculty. The second paremeter of importance is the slope of the function or the extent 10 which it resembles a step function. The steeper the slope, the more discriminating he responses ace, Quantities related to these two pee rameters play a cnical role in psychometric theory, as we will show later in tis chap [Now considera question ike "Are you unhappy at Kile" on depression inventory. ‘The probability thet cis question will be answered in the affimatve should be gute low for people who are low in the anribue (act depressed) aod increase wit the level of depression until i reaches 1.0. Tit implies that there haul be a level of depres sion for which the probability of endorsing the item is 5, and soit is meaningful to think of an absolue threshold associated withthe item, Simi considerations bold for items for which there is a comec: answer and the underlying dimension is cor ‘knowledge or general intelligence. We will exploc the generality of the threshold anc psychometic function concepts, especially inthis chapter and in Chapter 10. Te fx that there are physical dimensions of weight, sound intensity, and light inrensiy, bt ‘one of depression, course knowledge, or general intelligence, might appese to reflect 1 major diference between psychophysical and other applications. However, as 6 noted in Chapter 1, such ostensive characteristics are not needed to provide & scale “he scaling models considered in this book allow dimensions that are not defined physically to be infere. ‘Simulating @ Threshold “The data in ig. 2-2 were actually derived from 8 very simple compucer simulation ta usta the absolute heshoid and sensory noise. We defined the absolute tareshold 88 200 units. Sensory noise was produced by choosing «random namber from ¢ nor mal distribution with a mean of O and a sandard deviation of 10 in accord with the phi-gamma hypothesis. The mean of any given physical magnitude (@) was is phys- fal value (185 to 215 in Sanit steps), but it varied normally about this mean on an given tral. The sensory effect fra stimulus on any given el equaled plus the can- dom aurabes. We ran 100 ials pr tras For example, the to random aumabersobteined forthe first two tral using the 195- grim stimulus Were +20.6, and #2.8. These procuce sensory etfects of 215.6 and 197.8, f the effecr equaled or exceeded the threshold value of 200, the subject aid yes (the stimulus was felt); otherwise the subject said no, Consequertly, the subject suid yes in the fist ease and no in the second. Note tht the sensory effect of any compe son stimulus can exceed 200, bat the probability of this happ pliysical magnitude increases, The cesuling proportions of yes responses () even stimull were 0.07, 0.17, 0.35, 0.53, 066, 0.87, and 0.54, as plotted, The imgor tant point to remember is hos! sensory abse can eause physically unchanging samli to vary over wials44 ART 2: grangicd. FOUNDATIONS Diference Thresholds Uiference deshold (JND) is bit trickier when the subject compares two to determine which is ofthe greater magnitude, The comesponding i function is $deseribes the comparison stimulus per- Defining « stil io point at which the psycho vive as equal toa standard half the time, no the threshold. Ths is called the point of subjective equality. Rs value need not match the physical magnitude of the standard {the point of objective equality). For example, suppose the stundard and comparison mult in Fig. 22b were weights of different density, eg, were lead versus wood. A 00-gram lead standard stimulus would obviously be much smaller than a 200-gram ‘wood cornparison, and so there might be an illusory difference in weight. The owo ights might have to differ in physical magnitude to appear ‘The “interval of uncertain)” i shat range of stimulus differences for which judg iments can "go either way” and is usualy taken from 25 to 75 on the function, asi lusteated in Fig. 2-2. The concept also appli ro absolute thresholds, even though shat is not depited here. The difference threshold (1 ure) is usu ally defined as half this interval of uncertainty, again by convention. The key to both types of threshold is the varied psychological effect ofa fixed physical stimulus due to weighes Ie is possible to simulate a difference threshold in a manaer similr to te absol threhold, However, sensory noise would affect both the standard andthe comparison. ‘Although this wight seem to decease subjects" ability to make judgments, this need, not be the case. The covariance (or correlation) between the vo nose sources is also important for reasons that will become clear when we consider the logic Thurstone (1928) used to develop fis discriminant model. “The Weber Fraction, Fechner’s Law, and Psychophysical Sealing . H. Weber noted an important property ofthe JND which was the main stimulus to Fechner's subsequent ideas-its magnitude is proportional 10 the standard against ‘which iis derived. Subsequent esearch indicates that his findings area good fist approximation fora wide variety af sensory dimensions a8 long asthe standard is tot ex Fremely weak or strong, Thus, suppose he found that |.05-gram weight was just no tleably aifrent from a I-gram standard weight so tat the IND was 0.05(1.05 ~ 1) grams. The Weber faction isthe JND divided by the magnitude of the standard (®), $70.05/1 oF 0.05 inthis particular case, Weber's esults were that & 10.5-gram compar fson stimutus was just noticeably heavier than (O-gram standard, 2 105-grare com ‘parson was just noticeably heavier than 2 L00-geam standard, ete. His results may be eneraly sated as Ad equals «constant where A® isthe physical mageitude ofthe IND sssociaed witha given ©. ‘Suppose that Weber's law had held exactly, & LO-unit standard was also the absolute threshold, and the fraction was 0,05. A [.05-unit comparison will be t IND tore intense thon this standard, Now. let the eesulting 1O5-unit stimulus become a new standard. A L10, 4¢4 LOSI + 0.05) unit comparison will be just noticeably more intease. Keep repeating the process of abuaining & stimulus that is 1 IND more 5 and use it as the next standard. The resulting values 34, .., 10 90 decimal places. It does aot matter what type semacon A corals tht on can speak of vo atm ntes of bow many IND ‘separate themm—2.3, 0.5, or whatever. Mathematically, this relationship: eit x preoed as 6g (1), which scaled Pa blog ®) +a en where = sale value ofthe sensation (a © = physical magnitude ba magnitude) scaling constants Neither scaling constant is important to our discussion; a ia commonly chosen to ke =O when ® is at threshold, but this is usually not viewed as a cational zero in the ratio scale sense. Figure 2-30 depics Fechaer's law. Unlike Figs, 2-1 and 2 ues of © need not fall near threshold. Th relation epplies to the entire physical dime sion () and is know as globel psychophysics. LLogartumic functions have several important charscteitice, The one pat larly important for our purpose is that equal physical ratios yield equal seasocy erences, Suppose stimuli a , cand d er, respectively, 10, 20, 100, and 200 grams. Since ab = o/i, a and b ar just ws many INDs apart from each other 3s ae ¢ and d. Fechaer's ethods are called indlect methods because subjects do not define sensory magnitudes direcly, and discriminant methods becwuse they concern tae subject's ability to discriminate. They are also called confusion methods because scale values require that stimuli generally be confusable with one another in maghi tude, is Direct Psychophysics and the Plateau/Stevens Traition Coven and Ward (1989) described a test of Fechaer’s law made by Plaeau in 1872. He had artists mix black and white pigments to make a gray appear midway between the two, Fechner’s law predics thatthe gray's intensity should be the average of the black’ intensity and the whites intensity. Plateau obtained a systematic departure in thatthe grays fell near the cube roots of the two olher intensities, Four impo things about Platea's research and Stevens’ (1951, 1956, 1975) subsequent e sions are that (1) unlike Fechner's approach, subjects respond ivecly through sub) tive estimates; @2) equal physical ratioe provide equal sensory etios and not dif ences with these subjective estimates; (3) equal numbers of JNDs becween different pais of stimuli sre ac equal appearing, the emphasis is upon global and not lea! ps}- a46 Paar STANGTICAL FOUNCATIONS blog) v= “Rayica magiid FIGURE 2.9 (2) Fechner legate lam or ncractpayehoanysis, (8) Stovens’ power aw for det payenophyses with an exponent «= | choplaysics. Point ? may be stated ut Eq, 22, called Stevens’ la, since he examined it | $0 thoroughly, o the power law from its mathematical form: Wap" were © = physical magnitude ling constant ‘The a parameter is more comple physical ratio of two stimuli cat differ along the physical éimension in question, @. Lat the two stimuli be x and y, thie associted sensory ratio be ‘P/F, and thei phys le value ofthe sensation (apperest magnitude) Tedeserbes tne sensory rata associate with the ” Payal maga, yabe", >t (Steves gover im) (o Sivene law wth an exponen <1, ee () Stover aw with an exponent a> 1 ical ratio be @./0,. Ifthe two aze the same (Y'Y, = 19) the tolation is 2= |, For example, doubling the duration ofa aoise also makes it appear t last rv 15 long, However, the sensory ratio is smaller than the associated physical rao for most dimensions (,!%, ¢,(,), and so a <1, The brightness (apparent intensity) of rmaay light souress increases only 29 the cube root of the changin paysical intensity. ‘This means the physical intensity of two lights must be ia an 8:1 ratio For the mor in tense light to appear twice us bight. Finally, few sensory ratios are Targer than their sssocated physical catios (#'¥, > J), and so a> 1 [fone alec shock is phys cally twice as powerful as another, Ic wl scrally appear more tha 10 times as intense Stevens and his associates devoted many years to tnoroaga say of different sensory modalities, In particulae, Stevens (1961) “casloged” the exponents of various éimen- sions, Figure 2-3b through 2-2d depict these three outcomes (2 = 1, a < I, and a> I). Note that even though the function for a < | cesembles Feche's law in being concaveeo downward, the cwo are quite diferent. Data fiting Fechner’ law become finer whe the abscissa, but nat the ordinate, isloguitamic (semilog graph puper), and data fting, 4 power law become linear when both axes are logarithmic (lng-log graph paper). gardless ofthe magaitude ofthe exponent. The slope ofthe line inthe later cs Fines the magnitude of the exponent. Altnough Fechner and Stevens’ laws wer perio (investigators commonly asked which one was “rig now generally cecagnized tha the Y of Fechnes law for di “iemination need aot be ine sare asthe of Stevens lw for subjective estimates, and so there need be 00 it compatibility. Indeed, the so would be completely compatible if Stevens’ W were the logarithm of Fechner’ (Luce, 1983). ns also developed several methods for inferring the exponents and showing ‘ha any given estimete was a an artifact ofa single method: ie he used conv opecations as defined above, The mast commonly used of these methods the fol 1 Ratio production. A subject is showa a stazdard stimulus an is then asked toad jst a comparison so that it appears in a specified cato to the standard. The simplest ‘nd most common ratio i 2:1, 50 thatthe subject is asked co make the second stimulus appear ewice as intense. Tf, for example, the companion has to be physically four times as intense, the rato (a) willbe S. However, the subject might algo be asked to make the second stimulus thre times &s intense 2 Ratio estimation. The subject is shown standard and comparison stimuli and asked to define the ratio of thei epparent intensities. Thus, they might repoct that a ‘comparison tone is 1.5 times louder than a standard tone, 3 Magnitude estinarion, The subject is showa a single stimulus and simpy asked to define its magnitude numerically. Usually subjects ae also shown a different simu lus, called the modulus, which is given an assigied value to fix the units ofthe scale, ing it somewhat simi to ratio estimation. 4 Bisecrion. As in Piateau’s experiment, subjects are shown (wo stimuli and asked to adjust a third so that it appears midway berween the fst wo. Unlike other subjective estimates, bisection requces interval rather than rai juigments 5 Cross-madal matching. The subject is peesented a stimulus in one modality and asked to edjust a stimulus in snother modality © apparent equality. For examele, the task might be to make atone appear as Toud asa light is bright. As bizare as the task may seem, the exponent relating the vo modalities is predictable from the exponents 1 from the other tasks. For example, the sweetoess ofa sucrose solution andthe apparent thickness of wood blocks both have exponents of about |.3. Suppose given rose solution is matched with a given thickness. Then the concentration of the s se is then doubled. According to Stevens’ power la, the matehiag wood bl should seem twice as thick, which it does all methods, the procedure is repeated with different stimuli in order to dete the exponent. itis not associated as strongly with the Ste method of equal-appearing intervals (category scaling) also tends to lw (Marks, 1974; Ward, 1974) Subjects simply sor stimuli into ‘Stevens’ power regories so chat the I Signal Detection Th een category boundaries appear equal. In purtculr, the sensory der nthe upper and lower boundries ofeach category shoul be the same, “pro Fullrton-Cattell Law Te Fullerton-Catell (Fullerton & Cattell, 1892) law isa bac link between ian indirect psychophysies and paychometes in general states, simply and eupho- neously, that equally often nodced differences are equal unless always or aever ao iced. This is certainly tru in the psychophysical case since the unit (the IND) is defined by equally often noticed differences. The significance of the Fulleron-Catell law is that it does not depend upon how the stim differ or on the basis of the judg ment, In paticula, the “>” elaionship that meant brighter, tesvier, or louder above can also mean “is more prefered,” among other things. Ifyou prefer bananss to apples 75 percent ofthe time and apples to pears 75 percent of the time, the distance bet apples and bananss and the distance between apples and pears may be assumed equal; len apples are at che midpoint ofa scale cetned by these vee still The “lays or ever” paris simply a caveat that one cannot draw” inferences when there is n0 conf sion over tials: If you slays prefer bananas to apples and always prefer apples t2 pears, their relative distances cannot be infered (rom these data alone, However, if You sometimes prefer plums over ench and sometimes not, a sale can be constructed. Jory and Modern Peychphysion ‘early studies ofthe absolute threshold, a stimulus was always presented. Subjers, viho were often aleo the investigators, eypiclly knew this but were tained at analy introspestion to report their sensations and to ignore this knowledge, Sometimes, how 2 the equipment would malfunction and fal to produce a stimulus, but subjects right say “Yes, Isaw (head, fle et.) I.” thus committing te stimulus exor by e- sponding on the basis of their coaceptions ofthe stimulus rather than the sensation i self, Gradually, “catch” als were regularly use to "keep subjects thei tes," but no systematic ute was wade of the data obtained on these tls since the purpose of the experiments was to measure sensations. ‘Measuring sensations was the exclusive goal of nineteeath-century psychophy researc and is often valid goal today, but itis not the only goal. Reflecting a variety Of factors such asthe behaviors rejection of mencal states lke sensations, much of psychophysics eventually beeeme concerned with subjects’ ability ro discriminate the presence of stimulatioa from its absence. A particular tradition emerged known asthe heory of signal detection (TSD} (Egan, 1975; Grean & Swess, 1967; Macmillan & nan, 1991; Swets, 1986a, 1986b: Swets, Tanaer, & Birdsall, 1961; Tanner & ‘Swets, 1954), I bears « close kinship to Tharstone scaling, and we will consider iia more detail in Chapter 15. For the present, tis mest important in helping to iustate the diference betwen the classial payehoghysies of judging sensations snd the more ‘modern emphasis upon accuracy of dserimination, “TSD has proven particularly imporcant because ofits emphasis upon essessing response bias or eifferental willingness to use che response alleratves independestly ee50 PART a stangTICAL FOUNDA of sensitivity oF accuracy at discrimination. Threshold measures using psychophysical provedures derived from Fechner are percularly infuenced by a subject’ willingness Peveport having sensed the stimuls. A practical example ofa expanse bis involves ney of two clinicians wo See the same set of palieats. Clinician A nt ofthe patients determined to have a given disoeder on the basis of save appropsite method, but clinician B diagnoses only BD percent ofthe patients corey, Does this mean that clinician A isthe better diagnostician? ‘Te data Or insuiieat since only ther hit (sue positive) cates in identifying those who have ihe disease are Known. We also ceed t know the false alarm (else positive) cates of tingnosing normals a having the disease, Pechaps clinician A bas a false alarm cate of 0 pereont, in which case he or ahs is just blindly guessing the presence ofthe disease in 90 percent ofthe population. LE this i true and if elinician B's false alarm rate i ess than 80 percent, elisician B could be the beter. ‘TYPES OF STIMULI AND RESPONSES ‘Badless distinctions could be made about stimuli and responses that ace important to peychometie, but we wil consider only the ost imporant, Most are derived from psyehophysis. Judgments versus Sentiments ‘Although no two words perfectly symbolize the distinction, the distinctions betw sgnat we call “judgments” where thre is a corect response, and “sentiments,” whieh Involve preferences, i very basic. There are corect(veideal) versus incorrect a dees to"Elow rauch is (#0 plus (W0?" and "Wich of the two weights is heavier” ‘There may also be degrees of carectness, as in line-length judgments of visual ila- sions fn contrast, sotiments cover pertoral rection, preferences interest, attitudes, Walues, ena Hikes and dislikes. Some examples of sentiments include (1) rating how Tuch you like boiled cabbage on a sevea-calegory Likert scale, 2) answering the (question, “Which would you rather do, organize a elub oF work on a stamp collec dion?” and (3) tank-ordecing 10 celebrities in terms of preference, Verdicaity does not apply to seatiments—a subject is neither cowect nor incorrect for preferring Chocolate ice cream to vanilla ice cream, This distinction is vecy close tothe ditfer | chee benween matiog discriminations in TSD and reporting sensations in classical pevchophysics Judgments also tend to be cognitive, involving “knowing.” whereas entiments tend tobe affective, involving “feeling” "Ability tess nearly always employ judgments regardless of whether an essay, shor- answer, multiple-choice, oF true-false fortat is used. Coversely, tests of interests in- evenly concer sentiments asthe subject identifies liked and disliked activities. Ari~ fades and personality measures can use either form. Items like “Do you lke going 1 Dorteu” involve seetiments. fst items lke “How often do you go (0 partes?” are Pentaly judgments. The distinction may be chscuced because the perceived frequency may reflect preference as wel as actual frequency. ‘Social desirability may bias sentiments in the signal detection sense so that te pop- I eee laity of socially endorsed behaviors may be overest hiss es ty to bee problem wih judgment. However, the nero consistency or extent fo which hens Ses a hig npr tt Tengu uly of xn wt the mesure tends ona the sane ov ne over ine tay of may nt be important Ch ters 6 through 9 consider how these statistics anaes sing judgments i ganerally clearer than the lgi fusing tentimens because a ‘antages inherent in having aco sponte, Gere a fequendlyeployed 1 descabe thee to eatgois. Goldamond's (1958) diaeon berveen what he eile "objective" and “abecve indica of gerrpioncoetponds a ee the jwignenesenimetdsincUon. The word “cole” x fequely axed in ple of the word "sentiment. ey , Absolute versus Comparative Responses, Inger nace epont concemssprivlarstinels, wheres sompeative topous cine fo o ee sir The snow apps to Sota acts tnd seutmens. "How many ences have you been oi he pest ear ese “ave You best mare cones tan ovis ine past ye is hs istintion fr ‘pmen,Uicws,“o ou kept “Doyo ss se an (One of pyelog’s roms is tht people ae aot invari beter (more co Sse sdor seat) a malng comparative response thn atl esporse Tis is besten feme-oreference problem present a ast some sent a2. fot potest avoided In compra responses Kinga conser "ss tola ewes se the quesion of how sues Is sweet tm lded when es tse jugs which o ever clas ia the sete sine fe eteion of sweet Canoe apliecqully ols, One ponte aplication af his pane na iy estng Hevea no “none ofthe above” or “al of the above” ene, tmulipiecboie tesa ne sompensive jens ofthe nie th of the nea thes. We suggest (od some dnc) ta tee aerate be aise bess hey Compromise the compantve nate othe est by aking wie one oa of he otter ssmadves ae te inn abhi sene, Story eats ae abelle get of he uth ris ofa single ea, and we ogee ore o€ lp ‘pote austin fos and oer usr ob considered, People ely make asl judgment in aly ie snce mot choices se ine nly compara. The arts fw nanes in ich Sole jaigments, One nora © sense 1 erploy ab- pion is when absoive evel is important, as ia aitiudes toward various ethnic groups. A subject could, for example, rank vacious groups fom most to least prefered. However, the subjeot may dsike all the ational, {groups or like thers al, which Would not be apparent from the comparative rankings. ‘Absolute cesponses are especially important when some indicator of neutrality is need. ‘4. For example, people who are mare neutral with respect io candlates in an elcrinn tue protably more susceptible to influence and change than those who bave 2 clear preference. By requiring absolute responses from subjes Se to approximate A neutral point52 PARTS. STATISTICAL FOUNDATIONS CHAPTER 2 TAMDITIONAL APPROACHES TO SCALNG 53 “Another cae in which i makes sense to phase ems in ubsoate tems hen the ‘ate Sigety of absolute judgments sf interest, For example, the MMPS 2m aaa ae ke often have headaches” and" Frequently have oubl falling ees payenologct is probably nox scaly interested inte ata Fequeney of cae ee teopes sights (fe or she were, more objective texs could be devel seas mough einen observation. The ive & bow the patent itergets wots ke a cawonly.” Abolue judgment are pesTecy eppropiae i tha cas. re ee waynes are aio sifu because they are much eas and ster cobs vig ts compartive responses, For example, te method of pred comparison isan aaa ray poweral way to guber daa. A macket research example coud involve ae among K brats of cola The aubet i given two bran in suceson at aan reerence, This is epeated for all possible pairs of brands, Ufo sre erate anywhere om K(K~ 192 pls given bands reseed aay rage ino possible posions ithe al) t K pars GE al bands appear in a aoe ea given brand is paced with sel), The surber of comparisons nese oer hE Poe example ifaw ae 20 bani the sy for 190 20X15) 9 “HEN eats are rete pr subject However, itis much cueker w bave secs se aa aes ediidaly. Any of several sealing models can be vse vo obtln a eat etmtes of preference from each col's average rating over subjects. Con eee cal comparoe methods gery ive much moc eible rsa when spplicable, Pi cent that a person answering 2 item phrased absolutely tas a exteron 19 defeacace “tree. "selom” or “hardly eve” the judgment beromes pa ier rtveToivicuas gerry hve eeliags about he absolut ing fr on aeare iy but suck serena infnce by the range of objects oF at Soe ia dividual who ces how much they like boiled eabbage probably ree anes eae i thee to eat?” Dilfeences among subjects ano de contibae ars ay, However temporal instabilities canbe of lterestin themselves (Spiel- berger, Gorse, Lushene, 1970} Heeecolute format i appropriate, anchoring by specifying the meaning ofthe space sel is general fnporan © reducing unwanted ero eo ifeenes 9 sre eae of comparison or example, ntexd of simpy aking subse mle sec omray go w the movies on a Avepoin scale, fodicate gat 1 rans once & Fans lease nce a monte (te etul anchors shoulé be deve eet ecexing). Sully, pretest reveals at sujet eat always panes Scale fom ordinal dara Of cour, data guhered ata radio level need not be { Se cede et abso adore raabeges the point thar T must ct them Gy Tee Oee par o utc srg inves chosing a pital pose hat ul oa era one to respond in he negative, change the anchor to favor 8 ges ie a ee rear ar ag SES Sipe responses, such ast would eat aabogns ity were served 9 m6" eee earatia ia coon poaions demand anchors, a athe MMT example where te ambiguity was amples of es respanies which net vay semen inl which in ti smo liked tases bes se ling would be os lay puree, Sir (ry eapone denote wich sil ats ot kone wae, Petree slay Snape Ay eur psig Bc st ‘poner te eorallysynetic—saying Asa inp tat Bi smar'o FGhuper 1 wl conta iurenig exept), Tre teing deseabed below oquies prefect da, Hove, he mont common mete of males (Ge ace ani and pean pra! coven) eq silty dab {SBC ey ue tned upon he ony Peano conto ufc (Caper 6) measure of similarity rather than preference. 7 specified versus Unspecified Attibutes ‘By definition, psychophysical responses are obtained with respect to an atiibute defined by the experimenter, This may also be the case when the attibute is nota single physical dimension, For example, a marketing study may ask whic of several pack Ages differing in height, width, end depth looks largest even though all contain the same volume. Conversely, subjects may be asked co evaluate similarities or prefer- fences among stimuli without being tld in what respect, Ifthe stimuli cleary differ in 1 single, dominant respect, instructions may be unnecessary. However, if the stimuli fe multidimensional, the goals ofthe experiment dictate whether or not some particu Jar atibute should be specified. The study’ may concern how well subjects ignore & given atrbute, so that itis important to tell him or her which atribute is eitical. On the ethec and, subjects should not be told, implicitly or explicitly, ifthe goal i to find ‘out which actual ateibures subjects actually use. METHODS FOR CONVERTING RESPONSES TO STMULUS SCALES Fechnerian methods, which provide ordinal data, Stevens’ methods, which provide i terval or rato data are appiceble outside the confines of psychophysics. Keepin mind that the level at which data ae gathered may well. ciffer from the level of the resulting scale, particularly for Pechnerian methods. Scaling models ofien take data obtained at fe level and transform it 1 a higher level, most specifically to produce an interval intentional onal ate In gone te staples yoy fo ota oi dat ste tod of ck preference vere Siarty Responses Fe ee dun toes naw apes toe ed Senos regen sy sponses esodng wich tut ae preter tia Dares espresso known as coiaoze sponte Xx Inte ABX methods bjs are preseted with smal A and B foe by 2 | third stimulus (X) which is either A or B. The subject is asked o say whether X is & NTIntorval Methods 54 paw fo B. The process is repeated. comparing all pairs of stimuli The probability of con- fusing any tv stimuli is an orcinal measure of thet silat, This method is parti luny useful in sealing stimuli that are difcule to descrive, For exemple, suppose Alpha Cola and Bata Cola are fact similar ia taste, but both differ somewhat from Gamma Cola Subjects’ A-B-X judginents may be only 60 percent comect when Alpha ard Beta are paired (50 percent is chance), Sut 80 percent carve wien Alphe end ‘Gamma are paired and 85 percene correct when Beta and Garmma are paire. ln contrast, the method of wads uses three different stimuli which may all be highly disriminable from one another and asks which two ace most similar, For example, ‘She subjeot mighe taste string beans, lima beans, end green peas, [is probable that lima beans and green peas would be found to be the mos similar pairing, The data obtained ‘com all possible wis in larger set (eae number of combinstions of K things taken tree at aime) provide similarity rankings. [In che method of successive categories, the subject sors the stimuli ino cistinet piles or categories that are ordered with respect toa specified atribute. For example, subjects could sot the U.S. presidents into five piles ranging from “very effective” 9 ‘very inetective." This information can be obtained most easily by having the sub- Jeo mark sprinted rating scale, This method has many variants depending onthe in formation sought by the experimenter. Ifthe experimenter is seeking only ordinal in formation, the subject may be allowed free choice as co the number of stimuli pee category and number of categories, In contrat, the categories may be constrained to appear © ally spaced in the method of successive categories Sometimes, subjects are required to place an equal numberof stimuli in each category. Pechaps the mos impe twat variant isthe Q sort where subjects sortie stimuli so thatthe distribution of stim= tl in successive piles forms x normal distribution. These methods necessarily provide ‘uumerous ted ranks. Thus if stimuli are placed in a series of categories, those inthe category caa be thought of a ied For nares most of these tes, top renk, Averaging over subjects elimi ‘The primary mtheds used co obtain interval data feom subjects are variations upon the method of suscessive categories and Stevens’ methods of bisection. This involves ine seructng the subject to ase the scale as though the distances between successive eat gories were the same; e.g, the diference between a rating of 2 and 4 is equal to th difference berween a cating of § and 8. Frequently anchors ae also employed. For example, pleasantness could be anchored with adjectives ranging from “extremely pleas (o “extremely unpleasant.” Rating anchors also may be expressed as percentages to further ensure che interval nature of the responses so chat subjects can be asked what percent ofthe general population they feel agrees with each of a series of statements ‘The method of bisection may be applied outside psychophysics as follows. Subjects ‘may be given two statements differing in how favorable they are toward the President and asked ro select another statement from alist that falls closest co halfway between them. Rather than bisecting the distance between the wo stimuli, other ratios may be used, asin psychophysics. For exemple, subjects may be asked to select a stimulus X ' tio Methods he distance between the (wo standards. Another approach ist pres foo stimuli that are a the extremes of he aeibute aad have them judge the eto of in tervals formed when a third stimulus is inserted. Tall hese methods, the subject evaluates iaerval of judgment ors though he or she may describe 1:1 rtis inthe method of bisection, these ratios a not Farmed with respect 2 the absolute magnitudes ofthe stimuli as in ratio scaling ‘The experimenter might eventually use a scaling model co obtain these absolue nituds, duit is important to mainiain the distinction between What the subject is quired to do and the experimenter’ use of she data in a scaling model, Ratio methods require subjects to evaluste the absolute magnitudes of stimuli. Fo ample, subjects may be given the name of food liked moderately well by most peo ple and askee to name «Food liked twice a8 much, half ss much, ec. Note thet in catio provuction, the subject generates the actual stimulus, uolike in otber catio method. This may be somewhat difcoe outside psychophysical aplictions. Ifa zero point can be taken seriouly, previously described percentage scales can be ‘employed for ratio estimation, For example, subjects might te the complexity of 100, ‘geometric forms. The stimalus rated as most complex in pilot eesearch ig used 25 a Standard, andthe other stimuli are rated in relation to this standard on a percentage scale, Ifthe least complex form is rated at 20 percent, is seal value will be 20, where the standard is 1.0. These ratio scales closely resemble scales obtained from more d- rect catio estimation methods (Stevens, 1951, 1958, 1960). Interval and ratio estimation methods may appear supecicially similar. For example, choosing & stimulus tat is halfway between two otbers (bisection) seems similr to choosing 1 simulus that is twice as greet as another (cto production). In bath cases, the subject forms two equal-zppearing intervals. Tue important difference between these two methods is thatthe Lower interval is bounded by a phenomenal 2er0 in ratio production, The subject is essentially required to form an interval between two stim that is equal to the interval between te less intense stimulus and zero. More lover, i subjects are sophisti igh to provide interval judgments, they can also isualy provide ratio judgments, meking interval methods somewhat unnecessary, MODELS FOR SCALING STIMULI ‘The next step in scaling ist geuerate an ordinal, interval, or ratio scale as desires ‘The madels considered inthis chapter are considered classical primarily because th have been available for 3 long time. They may also be considered classical they provide relatively simple closed-form solutions and therefore donot require & computer (ia practic, computers would probably be used), [a conlzast, mode psy hometrcs, considered in Chapter 10, usualy equires opea-form estimation. Ordinal scales do not require complex models, and the various methods of gaher- fata and sealing wually produce the same rank ordering. In general, simply aver~58 paar 2 stansrca, FOUNDATIONS age individual subjects ranks and rank-ocder the average of these ranks, This fal set ‘of ranks isthe desired ordinal scaling of a mod subject. In paired comparison methods, tte fist step to determine the percent of subjecss har cate each stimulus as being higher onthe pareular response cimension than each of che other simul, Thus, each of 10 stimuli produce 9 percentages comparing that stimulus to the cest. The full data from the group of subjects aze summarized by a square matrix containing all possible percentages of paired comparison preferences These percentages ae summed for each stimulus (columa of the matix), and these sums are then rankee fom highest o lowest Formal scaling models are more important in constructing interval (the more coat ‘mon situation) oF eatio scales. The remainder of this section will consider models used for these purposes. They Ell inco owo broad classes of models paralleling the distinc tion between Fechnerian indirect (discriminant) methods and Stevens’ cirect (subjec live estimate) methods. Stevens’ approach will be discussed fst because i is simpler. Direct (Subjective Estimate) Models Direct models ae usually close tothe data because the experimenter takes the subjects interval responses (e.g. bisecons or ratio responses, magnitude estimations, it raifo estimations, ratio productions) serously. Often, the experimenter needs only «9 i average cesponses over repeated measurements of one individual to obtain aa incivide ual scale of, more commonly, aver subjects in @ group to obtain » group scale, The Stevens tradition, like the Pechner tradition, ecognizes variability from sensary noise but simply a eror rather than as an intrinsic pert of sealing i (One example is co use the aforemenconed method of equal-appearing intervals Subjects might sor 100 occupations into 10 successive categories canging from least to most prestigious. The subject ae instroced to test the 10 aumbered categories as an inerval scale. Ecor is minimized by averaging judgments over subjects or oves sions. Thus “psychology professor” may berated 9,9, 8, and 8 by four subjects. This yields an average rating, snd theefore a scale rating, of 8.5 on the interval scale, Mes Surements are obtained in alike manner forthe 99 remaining occupations. Tis scale may then be used in any situation requiring an equal-appearing interval scale, 22. ation between job prestige end job satisfaction. A ratio scale can be a Uke manner using rato production. For example, one occupation (2g ed at SO and subjects asked to cate others as ratios relative t9 | this nom. See Stevens (1958, 1960) and the Suggested Additional Readings for ther details. tis important 0 test the assumption tha the subjects are behaving tently. One important sts is toe intemal consistency reliability (homogencity) of the data, Chapers 6-8 wil illustrate the process. Indirect (Diecriminant) Models ‘Although che logic traces back 10 Feces stone's law of comparative judgment (Taurstone, 1928) is te foundation of modern discriminant models. Tis lw takes on numerous forms depending upon more specific assumptions, We will consider only the base ideas sad stess the single most populae FIGURE 2-4 CONNPTER 2 TAROMONA: APOROACHES TO SOALING 57 model. A more complete discussion may be Found in Bock and Jones (1968), Guilford (1954) and Torgerson (1958). The law of comparative judgment le! to sighal detec tion theory and general recognition theory (Ashby & Townsend, 1986; Ashby & Per Fin, 1988, see Chapter 15) Although the same computtional procadues can be applic to testing one individe ual repestedly by pooling individual data, we will illustrate the logic heee with the classic example of how one individual's subjective rank orderings can be “brought imo the open” 25 an interval scale. Any stimulus is assumed to yield a dscriminal process with respect 10a specified atiibute. The “discriminal process” is simply © broadly defined action which corelates with the intensity ofthe stiaulas oa sh intr val scale for an atribute. Because of whst is equivalent io sensory noise, ea stimulus has 2 discriminat distribution (diseriminal dispersion) which reBecis the variation in response fo that stimulus. The model assumes the phi-gamune hypothesis by assuming reactions toa given stimulus are normally distributed, as shown in Fig. 2-6, ‘These distributions and the atiibute continuum on which they fll, most simply called e "strength axis,” ae entirely bypotheical. Unlike psychophysics, the expec fer cannot locate the stimuli directly on the attibuteaay model would be uuonecessary if this could happen. Only after the expecimenter makes a series of ssgumpions about what i going on inthe subjet’s head and about te statistical rela lionship of such covert reactions to the hypothetical dimensions can a suitable model be formulae ‘The mean discriminal process (reaction) to each stimulus isthe best estimate ofthe scale value of that stimulus in several senses, such as most likely end least squaces (see Chapter 4), If ll stimulus means were known, an interval scale would complete the sealing problem, whichis unfortunstely not directly posible. They must be infered from the subjects responses. Each of several varants upon the basie model make somewhat different assumptions about the nature of these discriminal processes “The standard deviations depicted in Fig. 2-4 are unequal, and se sore stimuli are more variable than others. Because chs sa discriminant model, he diseriminal processes of at least some stimuli must overlap measurably. Ifthe discriminal ditribtion of any stimulus does not overlap with any ofthe others, ics interval location cannot be deter ‘mined. The mejor assumptions an deductions ofthe general mode! are as follows: 1 Denote the covert disriminal responses to stimulus j afr, and the cover die minal responses to stimulus k a Disrninaldatibatons of tre stil which fal a SSrongh sus ans te ls progreasv more variate ssivay ghar pots along the68 PART2: STATETIOAL FOUNOATONS 2 The means of these discriminal responses, f and fue the best estimates of their respective scale positions. Thai, if each stimulus” diseriminal processes could be de termined directly, its mean (arithmetic average) would be the bes estimate ofa typicel ‘action and therefor it location on the interval sale of judgment or seatment inal distributions causes the itference in response tothe mull, r= 7 ~ rs t0 be positive on some trials and negative on others, producing varied response 10 fixed sim that is necessary in discriminant models. [n the present case, there is variation in the perception of difference. Understanding distribu tions of difference scores is absolutely cricial to understanding discriminant models used in comparisons. By analogy, wo weight lifters exch vary in ther kil because of 1 varity of random factors. The varied amounts of weight they lift at a comperiion produce distibutions analogous to those ia Fig, 2-2. Heavier weights quite literally mean greater strengih, One lifter may be better than the other on average, However, if their abilities ae sufcietly similar, their distributions will overlap the weaker athe late may sorcetmes life a heavier weight than the beter able. One could subtract the ‘weight of the poore lifter from the weight of the beter lifer in any competition f9 obtain a difference score. Most of these diferences wil reflect te fact thatthe poorer liker eanoot lift as heavy a weight as the better ler. Iti pecfectly proper to place these difference scores into a frequency distibution which summarize the overlap of the two separate distributions. [a this case, the weights can actully be sealed directly, but this is the exception 4 Because the individual diseviminal processes rand rg are assumed to be normal= ly distributed, the disaibution of ther difference, ry =r rp wll also be normally dis teuted, This distribution of differences is ilustated in Fig. 2-5. The shaded area is proportional to the perceniage of times stimulus jis judged greater than stimulus tnd vice versa for the unshaded area, Note thatthe mean (7) is positive, This is be ‘cause the mean discriminal esponse 1 stimulus 7, (7) is greater than the mean dis iminal response to r (Fy); consequent, the majority ofthe differences (te shaded portion) ae pesitive rather than negative 5 The mean of the differences between zesponses tothe two stimuli on numerous oocasions, Fy = 7 ~ hs the best estimate ofthe interval separating the two. Although this mean cangot be estimated drecly because itis entirely hypothetical, Thurstone’s 3 The overlap in dsc IPTER 2 TRADITIONAL APOROACHES TO SeALNG 59 law of comparative judgment allows ito be estimated rom pai tows 6 Ask a subject to state whether stimuls jis greater of less than stimulask with reapect oan atibute, Denote he proportion of times jis judged ae 7 Next, assume thet discrimi diferences a of 7 anda standard deviation of 1.0, The 20 point wl fl tthe let ow the ight fof the mean depending om which sralus is more Frequent judged greater wth te- spect the atsioue. Convert px into a comesponding number of standard devition Usts om a table ofthe normal distibuton, I for example, js judged greater than & 52 percent ofthe ie (Pn, = 82). the eoesponcing normal deviate (4) is epproe- imately 1 Tis implies hte ero points 1 standard deviations below the men Moce importantly, F= 7) ~ Fit Lt standard deviations above 0, which moves ws close toa slution. 8 With 7,= F~ Fi expreasedin standard deviations unis, at needs to be doce isto exorest 7 i terms of the aotal andar deviation of te dispersion of discrimi- ral iffrenes, Tiss necessary because the standard deviations of discriminate fences might ie for sfferee pure of sie. In te sbove anlogy to weigh lifer, this would bagpen i some liters are more consistent than others. If tat curs, £0 oirs of simul separated bythe same mean distance could be separated by dierent Scale distanes, Thus even ig and gy are the same, the standard deviations ofthe 2.0, a slime from tis averaging rset Bea he a Sumpeon ta hese stall oveap isnt tenable (be “always or cer” part ofthe Pallerton-Cattll law). The results ae normal deviates expressed as deviations fo the average stimulus in the set, Pinlly the value of the lowest (most negative) sim Tus is subtracted from each of the values to eliminate negative values in the final scale. “This produces the final interval scale, e.g, of food preferences forthe data in Table 2.5, Corn isthe mos liked vegetable, and turnips ae the least iked. The later is arbi- ‘warily designated as zero on the scale, This 220 is arbitrary, by definition, since tis is ‘i interval sale separated (e Simlating Thurstone Scaling scale values presented atthe bottom of Table 2-3 0 clirecily by applying Bqs. 23 through 2-6 One can work becker fom exmat tae propotons found in Table inrevee ode Consequon,& Mocte Calo apteach i nneeessry. How ipamste t pooem ne ard compa our previous smulaon, Te et eps te multiply te sae Vales ia the toto Une of Table 23 by °V3. 19 confor to By Dat Ts provides values of 00,120, 926, whic are tbe mean dsrninal Tocompare ups with abs, vo numbers wer ution Raving aman of 2ero and a send devon of 0. The Gest umber was Sido the value ssid with npn (O00), an the second urbe as aed eval wih eabage 140, Tse naepenont, coca sted dues povided derision, When ace esse vas they Melded te cove lcaiminel responses and ry of aasumpson 1. They were aS- ne comalydstbutd bess of stuption 4 Tee sabjetprefred trips Sher cabbage if, =n, as >0 but pefemed cabbage over eri ifn was <0 “hs wn rpeated 00D nes foreth ls ae The esting probes spear in Table 24 chosen from a normal distrib [ESTIMATED PROPORTIONS OF SUBJECTS PREFEARING EACH VEGETABLE BASED UPON COMPUTER SINULATION 7s See a io bet 708781 BIS so gM 678 807748 so 300825 sar metas S00 5280 ao 3a (oS 2s 351308 aes ae meter sar PROACHES TO SCALING 6 ‘These probabilities are on ple, cuenps ae preferred to cab 2 difference of 710, On the other hand, the observed p ( 858) is fairly close to the predicted preference (864). Consider why the Mt was not beter, One major factor was tha the simulacon assured the equal ciscriminal disper. sions of Eqs. 2-5 and 26 instead ofthe more general Eqs, 2-3 and 2-4. Another poss Ditty is that the stimuli vary along more than oge axis, ie, are multidimensional. Should one try 2 more general model with more parsmeters to estimate? Perhaps yes perhaps no. This is a question of the tadeot® of completeness and goodness of &t against parsimony, first approximation to those in Tel 810 of the me, but the simulation only ps rence for com over cabbage [A Comparison of the Two Simulations ‘Two simulations have been presented ia this chapter, The first involved absolute judg ‘ments along a single physical dimension, ie, was psycbophysical. The second involved a comparison of two sentiments wit sieal dimension, ic, was not psychophysical ‘The law of comparative judgment bas had both historical ané continuing impor tance, The fist author had the privilege of siting in Thurston's classroom when he Indicated that the law of comparative judgment was his proudest achievement, This ‘came from 2 man for whom the word “genias” is appropriate. Hundreds of jourad articles and numerous books have been stimulated by the law of comparative juda- ‘ment, Although the detivation of the law is aoc simple, the law ite is held in cover ence by some psyctiomeicians, and for good reason, In the end, the law is very simple. It consists of transforming percentages of “greater than responses for pairs of stimuli into z sears reflecting their differeace, ‘The process uses the inverse of the cumulative aormal cure introduced in basic satis: ties, This inverse function is depicted in Fig. 2-6, The interval between any cw9 stil is the ¢ score that corresponds fo the percentage of “greater than” responses, Insrvals ‘are computed fr all pairs of simul. Although these z scores themselves can defiae in tervals, they are usually averaged to increase th reliability of te estimates, and the lowest one is set to 2er0 fo simplity description ‘The point basic to both simulations is that variability due to noise unified the two lypes of response. The additonal factor ofa corelation between the separ process in comparison is also imporant in reducing the magniade of error. The simula tions reasonably document wha the subjects do, ‘stimuli that did not vary slong one The Logistic Distibution and Luce's Choice Theory Although mach statistical theory used in scaling employs the familiar normal distribu tion, mare racent work tends 2 stress the logis distribution, The ogivel sep of the logistic distribution is visually indstinguishable from che cumulative eermal distribu tion, bat itis much more convenient matremstically Tis willbe especially important in Chapters 10 and 15. Bquacon 2-7 defines the logistic function:| STANSTICAL. FOUNDATIONS | vee jw ww 1 slaw conor jan 253i ones ae AGURE 26. bones vas ae cont ees Yo cle oh sana cig 10 0.01 for any value of X. Had this disteibu- where ¢ = 2.71828! lative normal distribution by ao more than 0.01 f tion been used instead of the cumulative normal Geapacative judgment des. Consider choosing one vegetble from menss on WEE (hearse 5 und beets are the only two choices and (2) there are ot ahs paar eine crrcecnce of choice theory is the constant ratio rule which predicts thatthe ras feces wer beets will be the same in both situations. Thus, ible 2-2 36.1 pace ‘chose asparagus over beets when they are the stribution because the com oral distribution, the 1 of choosing ini hat 36 1 pect oF 8 en ons Ts lo 96.10 - 56D ont tes ects choose hts roa args se. AS vs wot i 2 ‘of the time, These constant ratios in tu st oes relates scale values (X) to Stone's la Of cOO™ ae values, The toginie trasformati through 2-6 show us should probabil (In eontrast, Es. 2 fudgment is a constane-lifference rule ~ pvorages as Seale Value Both Thurston's law of comparative judgment and choice theory are cepresenttional models of interval scaling in the sense of Chepter I. The reason for choosing ether the normal curve transformation thet gave rise to Table 2-2 or the logistic transformation lows from the constan-difference and constant-ato rules (assumptions). Consider wat would happen ifthe scale were simply formed from the preference probabilities in Table 2-2 themselves. One frst computes the column sums, which are 1.634, 3.00; 3.638, 4.499, 4.878, 4.847, 5.768, 5.925, and 6414, Dividing each in turn by 9 to for averages gives 0.181, 0.333, 0.382, 0.493, 0542, 0354, 0,640, 0658, and 0-712. Nex. subtracting 0.181 from each average gives values of 0.000, 0.132 0361, 0373, 0.458, 0.477, and. order to visualize che similarities between these values, based upon simple sums, and either Thurstone's or Luce’s formal assumptions, multiply each value by the ratio of tne highest scale value in Table 2-3 (1,630) tothe highest sele value here (0.531) or 3.07. This makes the fist and las values ofthe two seals the same. Both this and the subtraction ofthe smallest scale value (0.(81) are permissible sansformaions of an interval scale, The resulting scale values are 0,000, 0.466, 0.615, 0.957, 1.108, 145, 1409, 64, and 1,630, The similar to che proper Thurstone values are ap parent and important ‘This similarity is one justification forthe operationalist positon (Caito, 1980) dis cussed in Chapter I. Were the table comprised of outcomes for aine baseball teams, the result would be familiar won-loss perceatages. However, the operation and fore the scale values are meaningless it a representational sense sine, unike the aio ule provided by Thurstone and his predecessors, there is none for summing probabil lies as opposed to z scores. The operatonalist postion is that itis ditfcule to see why cone operation is mesningless when it gives results ne ‘hat is meaning L031 ly identical co those of another So far in this chapter aumerous assumptions have been discussed regarding the use of various models for scaling stimuli. How does one know if the assumptions 1 We have already noted the importance of internal consistency in developing vu |ectve esimate scales, Similae considerations hold for cisriminant models, Basically, an ordinal scale is developed by averaging individual subjects" cankings, andthe data are internally consistent to the extent that different subjects ive similar rankings, AS previously note, suitable methods for obtaining internal consistency measures ae discussed later in Chapters 6 through 8 2 As indicated in the simulations, one can work backward from Thurston scale values co paired comparison probubilies. These estimated probablles shoud similar tothe observed probabilities, 3 One should examine the tassitvity of the response probabilities. If stimulus is prefered to j andj is preferred to , then i should be prefered to k. Violations of

(Jum Nunnally, Ira Bernstein) Psychometric Theory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Jum Nunnally, Ira Bernstein) Psychometric Theory

Uploaded by

Copyright:

Available Formats

You might also like