ANNE~NASTASI

Professor of Psychology, Fordham Universiry

Psyclwlvgical Testing

MACMILLAN
New York

PUBLISHING

CO.,

INC.

Collier Maonillan Publishers
London

All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the Publisher. Earlier editions copyright 1954 and © 1961 by Macmillan Publishing Co., Inc., and copyright © 1968 by Anne Anastasi.
MACMILLAN PUBLISHING Co., INC.

866 Third Avenue, New York, New York 10022
COLLIER MACMILLAN CANADA, LTD.

Librarlj of Congress Cataloging in Publication Data Anastasi, Anne, (date) Psychological testing. Bibliography: p. Includes indexes. 1. Mental tests. 2. Personality tests. I. Title. [DNLM: 1. Psychological tests. WM145 A534P] BF431.A573 1976 153·9 75-2206 ISBN O-<>2-30298<r3

I N A revised edition, one expects both similarities and differences. This edition shares with the earlier versions the objectives and basic approach of the book. The primary goal of this text is still to contribute toward the proper evaluation of psychological tests and the correct interpretation and use of test results. This goal calls for several kinds of information: ( 1) an understanding of the major principles of test construction, (2) psychological knowledge about the behavior being assessed, (3) sensitivity to the social and ethical implications of test use, and (4) broad familiarity with the types of available instruments and the sources of information about tests. A minor innovation in the fourth edition is the addition of a suggested outline for test evaluation (Appendix C). In successive editions, it has been necessary to exercise more and more restraint to keep the number of specific tests discussed in the book from growing with the field-it has never been my intention to provide a miniature Mental Measurements Yearbook! l:\evertheless, I am aware that principles of test co~struction and interpretation can be better understood when applied to~particular tests. Moreover, acquaintance with the major types of available tests, together with an understanding of their special contributions and limitations, is an es!>entialcomponent of knowledge about contemporary testing. For these reasons, specific tests are again examined and evaluated in Parts 3, 4, and 5. These tests have been chosen either because they are outstanding examples with which the student of testing should be familiar or because they illustrate some special point of test construction or interpretation. In the text itself, the principal focus is on types of tests rather than on specific instruments. At the same time, Appendix E contains a classified list of over 250 tests, including not only those cited in the text but also others added to provide a more representative sample. As for the differences-they loomed especially large during the preparation of this edition. Much that has happened in human society since the mid-1960's has had an impact on psychological testing. Some of these developments were briefly described in the last two chapters of the third psychological' edition. Today they have become part of the mairn;tream.;()f testing and have been accordingly incorpo~i-ted in the apprqpqate sections throughout the book. Recent changes in psychological Jesting that are reflected in the present edition can be delpribed on three levels: and inethod()l~i(1) general orientation toward testing, (2) Stlbm,IJ,tive cal developments, and (3) "ordinary progress" w1)Q as the publiciitibn of new tests and revision of earlier tests.

Preface

Preface

vii

; An example of changes on the first level is the increasing awareness of ~e ethical, social, and legal implications of t~sting. In the present edia lon, this topic has been expanded and treated 111 separate chapter early b the book (Ch. 3) and in Appendixes A and B. A cluster of related evelopments represe~ts a bro~dening of.test u~es..Beside~ the tradi~ion~l 'pplications of tests 111 selectwn and diagnosIs, 111creasmg attention IS eing given to administering tests for self-kuowledge and self-develop~entl and to training individuals in the use of their own test res?lts. in ,lJecision aking (Chs. 3 and 4). In the same category are the contmumg m ~eplacementof global scores with multitrait profiles and the application bf classificationstrategies, whereby "everyone can be above average" in bne or more socially valued "ariables (Ch. 7). From another angle, rffortsare being made to modify traditional interpretations of test scores, ~n bothcognitive and noncognitive areas, in the light of accumulating psychological knowledge. In this edition, Chapter 12 brings together 'psychologicalissues in the interpretation of intelligence test scores, :touching such problems as stability and change in intellectual level on .overtime; the nature of intelligence; and the testing of intelligence in :earlychildhood, in old age, and in different cultures. Another example is provided by the increasing emphasis on situational specificity and person-by-situationinteractions in personality testing, stimulated in large partbythe social-learning theorists (Ch. 17). T~e second level, -covering substantive and methodological changes, is illustrated the impact of computers on the development, administraby "tioll,scoring, and interpretation of tests (see especially Chs. 4, 11, 13, 17, 18, W). The use of computers in administering or managing instructional pro/ramshas also stimulated the development of criterion-referenced t~~~lthough other conditions have contributed to the upsurge of a 'i!restin such tests in education. Criterion-referenced tests are discussed '1 ,. 'pally in Chapters 4,5, and 14. Other types of lllstruments that have to prominence and have received fuller treatment in the present n include: tests for identifying specific learning disabilities (Ch. inventories and other devices for use in behavior modification pro-' (Ch. 20), instruments for assessing early ch~ldhOod education 14), Piagetian "ordinal" scales (Chs. 10 and 14), basic education literacy tests for adults (Cbs. 13 and 14), and techniques for the ment of environments (Ch. 20). Problems to be considered in the , ment of minority groups, including the question of test bias, are ined from different angles in Chapters 3, 7, 8, and 12. the third level, it may be noted that over 100 of the tests listed in edition have been either initially pUblished or revised since the ication of the preceding edition (1968). Major examples include the arthy Scales of Children's Abilities, the WISC-R, the 1972 Stanfordnorms (with all the resulting readjustments in interpretations),

l..

I

c

Forms Sand T of the DAT (including a computerized Career Planning Program), the Strong-Campbell Interest Inventory (merged form of the SVIB), and the latest revisions of the Stanford Achievement Test and the Metropolitan Readiness Tests. It is a pleasure to acknowledge the assis~nce received from many sources in the preparation of this edition. The completion of the project was facilitated by a one-semester Faculty Fellowship awarded by Fordham University and by a grant from the Fordham University Research Council covering principally the services of a research assistant. These services were performed by Stanley Friedland with an unusual combination of expertise, responSibility, and graciousness. I am indebted to the many authors and test publishers who provided reprints, unpublished manuscripts, specimen sets of tests, and answers to my innumerable inquiries by mail and telephone. For assistance extending far beyond the interests and responsibilities of any single publisher, I am especially grateful to Anna Dragositz of Educational Testing Service and Blythe Mitchell of Harcourt Brace Jovanovich, Ioc. I want to acknowledge the Significant contribution of John T. Cowles of the University of Pittsburgh, who assumed complete responSibility for the preparation of the Instructor's Manual to accompany this text. For informative discussions and critical comments on particular topics, I want to convey my sincere thanks to Willianl H. Angoff of Educational Testing Service and to several members of the Fordham University Psychology Department, including David R. Chabot, Marvin Reznikoff, Reube~ M. Schonebaum, and 'Warren, W. Tryon. Grateful acknowledgment IS also made of the thoughtful recommendations submitted by course instructors in response to the questionnaire distributed to current users of the third edition. Special thanks in this connection am due to Mary Carol Cahill for her extensive, constructive, and Wide-ranging suggestions. I wish to express my appreciation to Victoria Overton of the Fordham University library staff for her efficient and courteous assistance in bibliographic matters. Finany, I am happy to record the contributions of my husband, John Porter Foley, Jr., who again participated in the solution of countless problems at all stages in the preparation of the book. A.A.

and test sophistication 41 3. NATURE AND USE OF PSYCHOLOGICAL TESTS What is a psychological test? 23 Reasons for controlling the use of psychological tests Test administration 32 Rapport 34 Test anxiet\' 37 Examiner ~nd situational variables 39 Coaching. FUNCTIONS AND ORIGINS OF PSYCHOLOGICAL TESTING Current uses of psychological tests Early interest in classification and 3 of the mentally Q training retarded 5 The first experimental psychologists 7 Contributions of Francis Galton 8 Cattell and the early "mental tests" 9 Binet and the nse of intelligence tests Group testing 12 Aptitude testing 13 ~ Standardized achievement tests 16 Measurement of personality 18 Sources of information about tests 20 10 2.CONTENTS PART 1 CONTEXT OF PSYCHOLOGICAL TESTING 1. 49 Protection of privacy Confidentiality 52 Communicating test results 56 Testing and the civil rights of minorities ix 57 . SOCIAL AND ETHICAL OF TESTING IMPLICATIONS " User qualifications 45 Testing instruments and procedures 47 . practice.

Intelligence in early childhood 332 Problems in the testing of adult intelligence Problems in cross-cultural testing 343 Nature of intelligence 349 337 7. RELIAB ILITY The correlation coefficient 104 Types of reliability 110 Reliability of speeded tests 122 Dependence of reliability coefficients on the sample tested Standard error of measurement 127 Reliability of criterion-referenced tests 131 Croup tests versus individual tests 299 Multilevel batteries 305 Tests for the college level and beyond 318 125 12. MEASURING MULTIPLE AInLJTIES APTITUDES 369 Factor analysis 362 Theories of trait organization MUltiple aptitude batteries Measurement of creativity 378 388 8.. TESTS of test scores 94 FOR SPECIAL POPULATIONS 281 Infant and preschool testing 266 Testing the physically handicapped Cross-cultural testing 287 5. VALIDITY: MEASUREMENT INTERPRET ATION AND Validity coefficient and error of estimate 163 Test validity and decision theory 167 Moderator variabll.'55 Wechsler Preschool and Primary Scale of Intelligence 260 10. EDUCATIONAL TESTING 398 410 217 Achievement tests: their nature and uses General achievement batteries 403 Standardized tests in separate subjects Teacher-made classroom tests 412 . PSYCHOLOGICAL INTELLIGENCE ISSUES IN TESTING Content validity 134 Criterion-related validity Construct validity 151 Overview 158 140 Longitudinal studies of intelligence 327.PART 3 TESTS OF GENERAL INTELLECTUAL LEVEL 9.s 177 Combining information from different tests 180 Use of tests for cl. ITEM ANALYSl-S Item difficulty 199 Item validity 206 Internal consistency 215 Item analysis of speeded tests Cross validation 219 Item-group interaction 222 14. INDIVIDUAL 4. NORMS AND THE INTERPRETATION TEST SCORES Statistical concepts 68 Developmental norms 73 Within-group norms 77 Relativity of norms 88 Computer utilization in tile interpretation Criterion-referenced testing 96 TESTS OF Stanford-Binet Intelligence Scale 230 Wechsler Adult Intelligence Scale 245 Wechsler Intelligence Scale for Children 2.assification decisions 186 Statistical analyses of test bias 191 PART 4 TESTS OF SEPARATE 13.

OTHER Diagnostic and criterion-rdt:renced tests Specialized prognostic tests 423 Assessment in early childhood education 417 425 ASSESSMENT TECHNIQUES ~ OCCUPATIONAL TESTING \V Validation of industrial tests 435 Short screening tests . . SELF-REPORT TESTS INVENTORIES Content validation 494 Empirical criterion keying . Guidelines on Employee Selection Procedures (EEOC) Guidelines for Reporting Criterion-Related and Content Validity (OFCC) PART 5 PERSON ALITY 17.20. MEASURES OF INTERESTS.496 506 Factor analysis in test development 510 Personality theory in test development 515 Test-taking attitudes and response sets Situational specificity 521 Evaluation of personality inventories 18.for industrial personnel Special aptitude tests 442 Testing in the profeSSions 458 439 "Objective" performance tests 588 Situational tests 593 SeH-concepts and personal constructs 598 Assessment techniques in behavior modification programs Observer reports 606 Biographical inventories 614 The assessment of environments 616 Diagnostic use of intelligence tests 465 Special tests for detecting cognitive dysfunction Identifying specific learning disabilities 478 Clinical judgment 482 Report writing 487 B. AND VALUES ATTITUDES. PROJECTIVE TECHNIQUES Nature of projective techniques 558 Inkblot techniques 559 Thematic Apperception Test and related instruments Other projective techniques 569 Evaluation of projective techniques 576 .527 Interest inventories 528 Opinion and attitude measurement 543 Attitude scales 546 Assessment of values and related variables 552 19.

PART 1 C01ltext of . Psychological Testing .

tiOIlOfchildren with reference to their ability to profit from different types of school instruction. Related clinical uses of tests include the examination of the emotionally disturbed.n~L_ 1Jetween individuals or between the reactions of the same individual on different occasions. in the armed services. The classifica. At present. and the s~~ction of applicants for professional and other special schools are among the many educational ~ uses of tests.:iffe~~~. the identi£ication of the intellectually retarded on the one hand and the gifted on the other. It would be easy enough to recall . in college. To this day. schools are among the largest test users. the function of psychological tests is to measure . a test the reader himself has taken in school. the delinquent. the diagnosis of academic failures. or in the personnel office. Psychological testing is a relatively young branch of one of the youngest of the sciences. is meant by a psychological test. A strong impetus to the early development of tests was likewise provided by problems arising in education. From the assembly-line . One of the first problems that stimulated the development of psychological tests was the identification of the mentally retarded. Basically.CHAPTER 1 Functions and 01~igiTlS of Psycllological TeStiTlg ' A NYONE reading this book today could undoubtedly illush'ate what . This would certainly not have been the case fifty years ago. the educational and vocational counseling of high school and college students. in the counseling center. the detection of int~i1ectual deficiencies remains an Important application of certain types of psychological tests. and other types of behavioral deviartts.9. Or perhaps the reader has served as a subject in an experiment in which standardized tests were employed. The selection and classification of industrial personnel represent another major application of psychological testing.

Some acquaintance with the lead·' ing current tests is necessary in order to understand references to the use of such tests in the psychological literature. DuBois (1966) gives a provocative and entertaining account of the system of civil service examinations prevailit:\g in the 'Chinese empire for some three thousand years. the impact of community programs. the scope and variety of psychological tests employed in military sihlations underwent a phenomenal increase during World War II. not to the test specialist. promotion. It is to these developments that we now turn. reference may be made to studies on the nature and extent of individual differences. the outcomes of psychotherapy. From the many different uses of psychological tests. Pefers~n (1926~. A closely related application of psychological testing is to be found in the selection and classification of military personnel. job assignment. From simple beginnings in "Vorld 'War I. With the growing concern for the proper care of mental I A more detlliled account of the early origins of psycllOlogical tests can be found in Goodenough (1949) and J. The book is not designed to EARLY INTEREST IN CLASSIFICATION AND TRAINING OF THE MENTALLY RETARDED The nineteenth century witnessed a strong awakening of interest in the humane treatment of the mentally retarded and the insane. Emotional wellbeing and effective interpersonal relations have become increasingly prominent objectives of counseling.:ts he used as an adjunct to s -i u interviewing. and the influence of noise on performance. there is scarcely a type of job for which some kind of psychological test has not proved helpful in such matters as hiring. A brief overview of the historical antecedents and origins of psychological testing will provide perspective and should aid in the understanding of present-day tests. Today a familiarity with tests is required. the effective employment of tests in many of these situations. For all such areas of research-and for many others-the precise mt>.' The direction in which contemporary psychological testing has been progressing can be clarified when considered in the light of the precursors of such tests. To be sure. es eciiill-"Tri('Onnection with high-level jobs. As illustrations. we need go no farther than the nineteenth century. with its interweaving of testin and t~hin has mch i mmon with toda 's rrograme earning. lose sight of the fact that such tests are als? serving important functions in basic research Nearly all problems in differential psychology. testing constitutes an important part a out the m IVI un. And a proper evaluation and interpretation of test results must ultimately rest on a knowledge of how the tests were constructe<l. transfer. usuall • re uires that the t!. 'the Socratic method of teaching. DuBois (1970) for a brief but comprehensive history of psychologi~l tClsting. and even torture had been the common lot of these unfortunates. It is primarily with this end in view that the present book has been prepared. Tests were used to assess the mastery of physical as well as intellectual skills. Similarly. the relative effectiveness of different educational procedures. research on test development has been continuing on a large scale in all branches of the armed services. however. Subsequently. Prior to that time.Anastasi (1965) for historical antecedents of the study of individual differences. European umversities relied on formal examinations in awarding degrees and honors. There is growing emphasis. Within this framework. so that test scores ht of other back ound' rmatiQn may be properly int~rpreteaTnt1leli evertheless. ridicule. See also Boring (1950) and Murphy and Kovach (1972) for more general backgrq~md. . but to the general student of psychology. but by the general psychologist as well. on the use of tests to enhance self-understanding and personal development. The roots of testing are lost in antiquity. psycholOgical tests provide standardized tools for investigating such varied problems as life-span developmental changes within the individual. not only b~' those who give or construct tests. neglect. it follows that some knowledge of such tests is needed for an adequate understanding of most fields of contemporary psychology. the identification of psychological traits. test scores are part of the information given to the individual as aids to his own decision-making processes. however. the measurement of group:' differences. ~ total personnel program. One should not. The use of tests in counseling has gradually broadened from a narrowly defined guidance regarding educational and vocational plans to an involvement with all aspects of the person's life. require testing procedures as a means of gathering data. operator or filing clerk to top management. To identify the major developments that shaped contemporary testing. and . From their beginnings in the middle ages.asurement of individual differences made possible by well-constructed tests is an essential prerequisite. ~nd the investigationfijo]ogical and cUltural factors associated WIth 6ehavioral differences. The special limitations as well as the advantages that characterize current tests likewise become more intelligible when viewed against the background in which they originated. or termination. and what are their peculiar limitations. Among the ancient Greeks. for example. It is clearly evident that psychological tests are currently being employed in the solution of a wide range of practical problems. testing was an established adjunct to the educational process. It is directed. what they can be expected to accomplish.4 COllfcl't of Psychological Testing make the individual either n skilled examiner and test administrator or an"experf on test construction. too.

the French psychologist Alfred Binet urged that children who failed to respond to normal schooling be examined before dismissal and. This appointment was a momentous event in the history of psychological testing. severely retarded children are given intensive exercise in sensory discrimination and in the development of motor control. Of special significance are the contributions of another French physician.ase or decrease the speeg 'i\t the subject's response. What is probably the first explicit statement of this distinction is to be found in a two-volume work published in 1838 by the French physician Esquirol (1838).1\h~portance makmg observations on all subjects un4i~. Esquirol also pointed out that there an! many degrees of mental retardation.:es...egll~. ." ~hildren. The former manifested emotional disorders that might or might not be accompanied by intellectual deteriomtion from an initially normal level. S.The earlv ps~'chological experiments brought out the need for rigorous control of the conditions under which observations were made. For example. With his fellow members of the Society for the Psychological Study of the Child. to which Binet was appointed.tal:6hed the nrst school devoted to the education of mentally reta . who pioneered in the training of the mentally retarded. Man~.6 Context of Psychological Testing Functions and Origins of Psychological Testing 7 deviates came a realization that some uniform criteria for identifying and classifying these cases were required. The establishment of many special institutions for the care of the mentally retarded in both Europe and America made the need for setting up admission standards and an objective system of classification especially urgent. be assigned to special classes (T. The principal aim of psychologists of that period was the fommlation of generalized descriptions of human behavior. In the effort to develop some system for claSSifying the different degrees and varieties of retardation"Esguiroi tried several procedures but concluded that the individual's use of language provides the m05t de endable criterion of his intellectual level. .s~ndardiz~& conditions was . Some of the procedures developed by Seguin for this purpose were 'eventually incorporated into performance or nonverbal tests of intelligence. H.!fu1svividly demonstrated: Such standardization of proce. The presence of such error. The important part verbal ability plays in our concept of intelligence will be repeatedly demonstrated in subsequent chapters. Or agam. It was the uniformities rather than the differences in behavior that were the focus of attention. where many of the early experimental psychologists received their training.of the sense-training and muscle-trainirJg techniques currently in use in institutions for the mentally retarded \vere originated by Seguin. 1973). First it was necessary to differentiate between the insane and the mentallv retarded. the latter were characterized essentially by i~tellectual defect that had been present from birth or early infancy. This was the attitude toward individual differences that prevailed in such laborotodes as that founded by '''undt at Leipzig in 1879. Binet stimulated the Ministry of Public Instruction to take steps to improve the condition of retarded children. other sensory stimuli and \vith simple reaction time. if considered educable... More than half a century after the work of Esquirol and Seguin. as will be apparent in subsequent sections. It is meres mg to note t at current criteria 0 menta retardation are also largely lingUistic ant! that present-day intelligence tests are heavily loaded ~vith Yerbal content. in which the individual is required to insert variously shaped blocks into the corresponding recesses as quickly as possible. The ~arly experimental psycholOgists of the nineteenth century were not. . and in 1837 he. Individual differences were either ignored or were accepted as a necessary evil that limited the applicability of the generalizations. the \\'?rding of directions given to the subject in a reaction-time experiment mIght appreci~bly incre. the founoers of experimental psychology reBected the influence of their backgrounds in physiology and physics.dure eventually became one of the special earmarks of psychological tests.~:ding field could of mar~edly alter the appearance of a visu~J s~mulU~:". rendered the generalizations approximate rather than exact. concerned \vith the measurement of individual'differences. By these methods. This emphasis on sen~ory phenome~a was in tU!'l1reflected in the nature of the £rst psychologICal tests. Having rejected the prevalent notion of the ineurability of mental retardation . varying along a continuum from normality to low-grade idiOCy. in general. where his ideas gaine _ ide recognition. In 1848 he emigrated to America. and~ tories were concerned largely with sensitivit~ to ~al. In their choice of topics. in which over one hundred pages are de\'oted to mental retardation.. the fact that one individual reacted diHerently from another when observed under identical co~ditions was regarded' as a form of -etror. Thus. A specific outcome was the <'stablishment of a ministerial commission for the study of retarded children. of which more will be said Jal'er. An example is the Seguin Form Board. SeO'uin (1866) eXIJerimented for many " vears with what he v ~ termed the physiological method of training. the bnghtness or color oEthe sUtr~. Wolf. The problems studied in their laboraauditory. St:ilI another way in which nineteenth-century experimental psychology Influenced the course of the testing movement may be noted. or individual variability. as in many other phases of their work.

Calton was mstrument~l ' in inducing a number of educational institutions to keep systematic anthropometric recOl:ds on their students.. where it operated for six years.. and the like. The newly established science of experimental psychology and the still newer testing movement merged in Cattelfs work. motor. It was Calton's belief that tests of sensory discrlrmnatlOn could serve as a means of gauging a person's intellect. used for the £rst time in the psychological literature. p. and the n~ore per~ptive the sen~es are of difference.(.'.. Galton selected and adapted a n~mber of techniques previously derived ~y m~thematicians.M ". ~'. For his doctorate at Leipzig.890. the laboratory was transferred to South Kensmgton Museum.'.U4-~e. primarily r~sponsible for launching the testing movem~l~t: A umfY~lg. task. Philippe. memory. by .18~4wh~re. the exact degree of resemblance bet:w'een p~ren~s and offspring. .The only information that reaches us concernmg outward events appeals to pass through the avenue of our senses. By such methods. ~e al<. the Galton whistle for determmlllg the hlghest au i e pitch.. !1~ Catten's tests were typical of those to be found in a number of test series developed during the Jast decade of the nineteenth century.. Cattell's own interest in the measurement of individual differences was reinforced bv contact with Calton. Wissler. . college students'. While lectming at Cambridge in 1888. despite Wundt's resistance to this t'ype of investigation. or twins. Only in this way could he discover.ntellectual functions could he Qbt<}ined through tests of sensorv cis.. This article described a series of tests that were beinO' administered anlluallv.sth~tic discrimimltion.. and pain-an observation that furtller strengthene5iYnis ~nviction that sens~ry discriminative capacity "would on the whole' be highest among the mtellectualh. 27).A few attempts to evaluate such early tests yielded very discOuraging results: The individual's Rerform~Dce showed little correspondence from one test to another (Sharp.~ bolst. and misccllaneous adults. 1894·~. 1901).rheredit ". Such test series were administered to schoolchilqren. With this end 11l View. for example.e~ U-U.-(~ i. was . sensiti~ty to pain.'rothers and .<-lA. On his return to America.it-r I for the more complex functions seemed at that time a well-nigh hopeless r:YL-' ' . Cattell shared Galton's view that Jl measure of/M-. . the term "mental test'. 1926. The tests. and simple perceptual processes and: to compare tlieir skill with the norms (J. He thereby extended enormously the application of statistical procedures to the analysis of test data. the nrst large. and graduated series of weights for measurin? k~ne. Jastraw set up an exhibit at which visitors wete"'iIllitted to take tests of sensory. Galton himself devised most of the sun pIe tests admIDlstered at hIS anthropometric laboratory. cold. .discrimination of len h. visitors could be measured 111 ce~yslcal traIts and could take tests of keenness of vision and hearing. Cattell was active both 'in the.I. he completed a dissertation on individual differences in reaction !ime. Galton also pioneered in the application of rating-sca~c ~nd ques~lOnnaire methods as well as in the use of the free associatIon techmque subsequently ~mployed for a wide ~arietyof purposes. Cattell's pI'eference for such tests was also ~tl<-. reaction time. 1883.ablest" (Galton. sisters.~ _ In an article written by Cattell in .""V'.Functions and Ol'igills of Psychological Testing 9 It "'as the English biologist Sir Francis Galton who . An especially prominent position in the development of psychological testing is occupied by the American psychologist James McKeen Cattell. 1. . Whe~l the exposition closed.. In this respec. speed of movement..fu. A .t.oset up an anthropo~ctric laboratory at the International EXposI~on of . the most eminent of whom was Karl Pearson. weight discrimination.\'. Thus Galton wrote: ..~~ c~pination and reaction time. London. factor ~n Calton's numerous and vaI'ied research activities was hiS }nterest llL 'humaJ. In the course of his imestigations on heredity. included measures of muscular strength. Calton t~a 'ize t e need for measuring the characteristics of related and unrelated persons. many of which are still familiar either in ~heir original or in modified forms.rther contribution of Galton is to be found in hiS development of statistical methods for the analysis of data on individual differences. Peterson. and it exhibited little or no . Examples include the Cal~o~ bar for . whereas the development of objective measures1-<=~. 1883. These techniques he put in such form as to permit theIr use by the mathematically untrained investigator who might wish to treat test results quantitatively.V1. muscular strength. This phase of Galton's work has been carried forward by many of his students. At the Columbian Exposition Jield in Chicago in 189~.~' he was partly influenced hy the theories of L?cke..p!i<ck{t<:1.~J preCiSIOnand accuracy.establishment of laboratories for experimental psychology and in the spread of the testing movement.pa) mg threepence.:mual . l -. to college o students in the effort to determine their irteilectuall~yel. and other simple sensorimotor functions. 1~1899. C~lt~n !lad. I In his choice of tests. 29). the larger is the field upon which our Judgment and 10telligence can act" (Calton.:~lso noted that idiots tend to be defective in the ability to discrlmmaJe·:heat. systematic body of data on individual differences in simple psychological processes was gradually aceu~ulated.:as..e~ed by the fact that simple functions could be measured with . reaction time. keenness of vision and of hearing. cousins. which had to be administered individually.f<.

A number of test series assembled by European psychologists of the period tended to cover somewhat more complex functions. In these tests we can recognize the trends that were eventually to lead to the development of the famous Binet intelligence scales. employing chiefly simple arithmetic operations. H. A. because the scientific ~Om!J1l1nity was not ready for it. A few years earlier. Chaille publi!iheq in the New Orleans Medical a~d Surgical Journal a series of tests for infan~ 11l7anged according to the a!1:eat whIch the tests are commonly passed. E. The 1905 scale was presented as a preliminary and tentative instrument. had tests of perception. Binet and Henri criticized most of the available test series as being too largely sensory and as concentrating unduly on simple. and the analysis of handwriting. a much greater proportion of verbal content was found in this scale than in most test series of the time. pp. 1973). however. 1905). Wolf. The most complex of the three tests. attention. the year of Binet's untimely death.udgmt.. The tests. Although sensory and perceptual tests were included. great precision is not necessary. global score Eor eaclrdiild (T. Partly because' of the limited circulation of the journal 'nd partly. The results. An extensive and varied list of tests was proposed. with speCial emphasis onJ. \\Tolf. In this scale. some unsatisfactory tests from the earlier scale were eliminated. and to some mentally retarded children and adults. Minor revisions and relocations of specific tests were instituted. memory. comprehension. The test series they devised ranged from physiological measures and motor tests to apprehension span and the interpretation of pictures. covering such functions as memory. the Binet-Simon tests attracted wide > Goodenough (1949. avoided the term "mental age" because of its unverified developmental implications and preferred the more neutral term "mental level" (T.10 Context of PSlJc11010gical Testing Functions and Origi. 50-51) notes that in 1881. since individual differences are larger in these functions. the ages of 3 and 13 Years. Like Kraepelin. known as the 1905 seale. More tests were added at several year levels. Many approaches were tried. no fundamental changes were introduced. Which Binet regarded as essential components of intelligence. the Minister of Public Instruction appointed ~inet to the previ- . Kraepelin (1895). a pupil of Kraepelin. and reasoning. and all tests were grouped into age levels on the basis of the performance of about 300 normal children between. The tests were designed to cover a wide variety of functions. This scale. Binet and his co-workers devoted many years to active and ingenious research on ways of measuring intelligence. Ebbinghaus (1897). however.nt. notably Blin and Damaye. S. in the measurement of the more complex functions. including even the measurement of cranial. T en a specific situation arose that brought Binet's efforts to imme(]iate practical fruition. 1973). led to a growing conviction that the direct. It was in connection 'with the objectives of this commission that Binet. In the various translations and adaptations of the Binet scales. prepared a long series of tests to measure what he regarded as basic factors in the characterization of an individual. memory span. ously cited commission to study procedures for the education of retarded children. imagination. in collaboration with Simon. Binet's own scale was in~ed by the work oE some oE ~is contemporaries. comprehension. 1894) or academic grades (Wissler. the Italian psychologist Ferrari and his students were interested primarily in the use of tests with pathological cases (Guicciardi & Ferrari. specialized abilities. in the 4-year-Ievel.ls of Psychological Testing 11 relation to independent estimates of intellectual levC:'1 ased on teachers' b ratings (Bolton. Gilbert. the number of tests was increased. They argued further that. the significance of this age-scale concept passed unnoticed at the time. was the only one that showed a clear correspondence with the children's scholastic achievement. In an article published in France in 1895. sentence completion. memory. prepared the first Binet-Simon Scale (Binet & Simon. scale. all tests similarly passed by normal 4-yearolds. and hand form. H." Since mental age is such a simple concept to~rasE> the introduction of this term undoubtedly did much to popularize intelligence testing. and so on to age 13. In the second. in the 3-year level were placed all tests passed by 80 to 00 percent of normal 3-year-olds. suggestibility. In 1904. measurement of com lex 1 fence a unc ons 0 ere t e greatest promise. and sentence completion to schoolchildren. aesthetic appreciation. 2l y~aTs befor~ the appearance of the 1908 Binet-Simon Scale. The difficulty level was determined empirically by administering the tests to 50 normal children aged 3 to 11 years. perhaps. and many others. The child's score on the entire test could then be expressed as a mental level corresponding to the age of normal children whose performance he equaled. and susceptibility to fatigue and to distraction. who prepared a set of oral questions from which they derived a single . 1896). Another German psychologist. 1901). A third revision of the Binet-Simon Scale appeared in 1911. the term "mental age" was commonly substituted for "mentalleveI. were designed to measure practice effects. and the scale was extended to the adult level Even prior to the 1908 revision. J. Thus. 1891-1892. association. even though crude. Oehrn (1889). consisted of 30 problems or tests arranged in ascending order of difficulty. and no precise objective method for arriving at a total score was formulated. and motor functions emploY€id in an investigation on the interrelations of psychological functions. administered tests of arithmetic computation. facial. who was interested primarily in the clinical examination of psychiatric patients.> Binet himself. or 1908.

The tests finally developed by the Army psychologists came to be known as the ~rm""yAlpha and the Army Beta The former was designed fo~ g~n~ral routine te~ting. It was in this setting that the first group intelligence test was developed. "--T~e application of such group intelligence tests far outran their technical Improvement. too.--logical test mg. Extensive studies of specIal adult groups. Coll~ge studen~s were routinely examined prio~ to admission. since only certain aspects of mtelligence were measured by such tests. were undertaken. A major contribution of Otis's test. ~he tests failed to meet unwarranted expectations" skepticism and hostiht)' toward all testing often resulted. Schoolteachers began to give intelligence tests to thcir classes. was developed to meet a pressing practical need. a committee was appointed by the American Psychological Association to consider ways in which psychology might assist in the conduct of . Such tests are essentiallv clinical instruments. Yerkes. It was in this test that the intelligence quotient (IQ).t Stanford University.t~t are ot p. Gr~dually psychologists eame to recogni~e that the ~erm . such as prisoners. The te~ting . IJ:!. tests were designed as mass testing lUsh uments. ~t Terman a. is the first Kuhlmann-Binet revision.ut they also sVVed as ~dels for most group intelligence tests. like the first Binet scale. are indil. were now being launched with ~est~ul optimIsm. Because group. ~ IS ~lthough intelligence tests were originally designed to sample a wide vanety of ~unctions in order to estimate the individual's general intelIectua~ level.~~~va:s relevant to many admmistrative decisions. to a lesser extent. based on the indiscriminate use of tests i? ma~ have ~one as much to retai' as to ad\'ance the progress of psvcho. The latest revision of this test is widely employed today and will be more full\' considered in Chapter 9. To be sure. Not only did the Army Alpha and Army Beta themselves pass through many revisions.12 Context of Psyc11010gical Testing Functions and Origins of Psyc1101ugical Testing 13 attention among psychologists throughout the world. they not only permItted the simultaneous examination of large groups but also simplified the instructions and adminish'ation procedu~es so as to demand a minimum of training on the part of the exammer. Some call for individual timing of responses. such tests are not adapted to group administration. In this task. 1912). which extended the scale downward to the age level of 3 months (Kuhlmann. was the introduction of multiple-choice and other "objective" item types.' • cases. as well as all their revisions.•. the Army tests were released for cmhan use. Otis.movement underwent a tremendous spurt of growth. under the direction of !lobert 1 1. It soon became apparent that such tests were quite limited in theIr . the testi boom of the twenties. Another characteristic of the Binet type of test is that" it requires a highly trained examiner. from preschool children to graduate students. Il1telhgence test was a misnomer. Shortly af~e~ the temunatlOn of "Vorld War I. a number of diHerent revisions were prepa. Group testing. 1916). JJ1US. suited to the intensive study of individual J . Both test~ w~re suitable for administratio~ to large groups. Translation~ and adaptations appeared in many lang. This committee. including rejection or discharge from military service.red.lUlq be prefer- .oral re~ponses from the subject or n~cessitate the manipulation of materials. which he designed while a student in one of Terman's graduate courses. Such informati~. In Americ. of the ability to handle numerical and other abstract and symb~~ic re~ations. Soon group mtelhgence tests were being devised for all ages and types of ~ersons.cove~age. Man\' of the tests in these scales require .rime importance in our culture. The Binet tests. or admission to officer-training camps. and especially on an unpublished group intelligence test prepared by ~rthur S. This scale represents one of the earliest efforts to develop preschool and infant tests of intelligence. in terms of the type of mformation these tests are able to yield. 'Vhen. the war.l. which hc turned over to the Army. recognized the need for the rapid classification of the million and a ha1f recruits with respect to general intellectual level. b. and known as the Stanfmd-Binet (Terman. the Ar-m~' psychologists drew on all available test materials.iclual scales in the sense that the\" can be administered to onlY one person at a time. realized that more'precise designations. That the tests were still crude instruments was often f?rgotten in the rush of gathering scores and drawing practical conduslO~Sfrom the ~esults. Of special interest. For these and other reasons. B~ It was. And soon the general public became IQ-conscious. th~ tests cov~red abilities . fact. most mtelhgence tests were primarily measures of verbal ability and. When the United States entered l)!orld 'Var I in 1917. assignment to different types of sel'vicei.uages. Large-sc~le test109 progra~ns: previously impossible. the latest of which are even now in use. or mtio between mental age and chronological age. t~e latter was a nonlanguage scale employed WIth Illiterates and wIth foreign-born recruits who were unable to take a tcst in English. the most famous of which is the one developed under the direction of L. w<. Not all important functions were represented. was first used.

This shift ill terminology was made in l'ec:ognition of the fact that mallY so-called intelligence tests measure that combination of abilities demanded by academic work. perceptual. Test users.~~mber of multiple aptitude batteries !rl. For example.!!lechaniea . special battent's were constructed for pilots. it will suffice to note that the data gathered by such procedures have indicated the presence of a Dumber of rebtiyely . This . not only tllC'IQ or other global score but also scores on subtests wonld lJt' examined in the e\'aluation of the indhidual case. t at c inicians a een tr\'ing for matiy years to .14 Context of Psyclwlo{!. in vary~ng proportions. or vice versa.('ver. Subsequent methodological developments. _ ' vocationa counseling and in the selection and classification of industrial and military ersonn~1. in which. in the traditional intelligence tests. were found more often in special aptitude tests than in intelligence tests. and especially clinicians.. Research along these line~ is still in progress under the sponsorship of various branches of the armed services. have come to be known as "factor analvsis. ~)eeaus~ inaphtude tellig('J]ce tests were not designed for the purpose of . wlth crude and often errODl:'OUSesults from intelligence tests. Such investigations were begun by the English . the obtained diffl:'rence betwcen subtest scores might be reversed if the individual were retestE'd on a different day or with another foml of the same test. For example. A report of the batterics prepared in the Air Force alone occupies at least nine of the nineteen volumes devoted to the aviation psychology program during 'Vorld War II (Anny Air Forces. ReIley (1928) and L. frequently utilized such interc~l11parisons in order to obtain 1110re insight into the individual's psychological make-up. c erica. a parallel development in the stu. since the multiple aptitude batteries cover some of the traits not ordinarily me u e JlI IJ1e 1 ence tests. . !hurs~one (1935. such internal variability is also discernible on a test like the Stanford-Binet. In the Air Force. a separate score is obtained for such traits as "erhal comprehension.14C\\' items to yield a stable or reliable estimate of a specific ability:.~11 anal.ide variety of different tests.n educati0l1~l and vocational counselmg and in personnel' selectioll and' cJassincadqIl. A. Others. numerical aptitude. L.homogeneous content. ~fuch of the test research conducted in the armed services was based on factor analysis and was directed toward the construction of mu. lIote"iOlthy fact: an individual's erformance on ' test often -showed mar -c variation. as well as on that of other American and English ll1veshgators. such as spatial. 194i). If such intraindividual comparisons are to be made." The contributions that the methods of factor ana'lysis have made to test c'Onstruction will be more fully examined and ill~strated in Chapter 1:3. To some extent."-' " such butteries will be discussed in Chapter 13.\yelikewise ~en 4. u tip e ap u e atteries represent a relatively late development in the testing field.obtam. The .eveloped for clVllian.. radio operators. Verbal comprehenSIOn and numerical reasoning are examples of this tvpe of trait. r These batteries also incorporate into a comprehensivl:' and svstl:'matic testing program much of the inform.ndeJ)endent factors.yas especially apparent on gl'OUp tests.ation formerly obtained fro~l special aptihlde tl:'sts.dIHerel.jis'a result. the work of thc military psychologists during World War II s. 111 whlch the items ar~mmonly segregated into subtests of relath'e1\.d)' of trait organization was gradually providing the means for constructing SUC? tests. Statistical studi('s on the nature of intelligence had been explonng the iflterrelatiol1s among scores obtained by many persons on a . Among the most widely used are tests of. bombardiers. While the practical apl)lication of tests demonstrated the l1~. and mechanical aptitude~. all items involving words might prove difficult for a particular individual. These s ecial a till/de tests ' . spatial visualization. range finders. Some of these traits were represen'ted. arithm~tic re~soning. In this connection. a number of tests that would probably have been caned intelligence tests during the twenties later came to be known as scholastic aptitude tests. L. One of the chief practical outcomes of factor analysis was the development of multiple aptitude batteries.'sis. for example. ho. a person might score relatively high on a verbal subtest and low on a numerical subtest.J~d also be noted. In place of a total score or IQ. Nearl~' all have appeared since 1945.~h. or traits. based on the work of such American psychologists as T. use and are being widely appliel:l\. E\'l'n prior to Vvorld War I. 1927) during the £lrst decade of the present century. These batteri('s arc desiuned to provide a measure of the individual's standing in each of a number of traits.ical Testing Functions and OrigillS of PSljchological Testing 15 able.~. whereas itcms employing pictures or geometric diagrams may place him at an advantage. Often the subtests heing compared contain t0o. for example. Such batteries thus provide a SUItable mstrument for makin<1 the kind of intraindividual anaJ\'Sis I' 1 e~'e ~nOSls. For the present.psychologist Charles Spearman (1904. ps\'ch~logists had begun to recognize the need for tests of spE'cial aptitudes to suppkment the global intelligence tests.ed for differential aptitude tests. Thus.'ed their widesl>\'eadand indiscriminate use durinlJ0 the twenties also revealed another . To avoid confusion. -TI~ca~lation of intelligence tests that follm.ltiple aptitude batteries. Such a practice is not to be general1~' recommended. . and scores of other military specialists. tests are needed that are specially designed to reveal differences in performance in various functions. Examples of . a point of terminology shoul\!l be clarified. and artistic aptitlldes. and perce~tual speed. 1947-1948). musical.

ting col1eges-c·hangcs that reflect inten'ening developments 111 both testIng and cducation. permitted a wider cO\'erage of content. there was a growing emphaSiS on the design of items to test the understanding and application of knowledge and other hroad educational objectives. ~lultiple al~tltl1de battenes measure a number of aptitudes but pro\"ide a profile of scores. the testing functions of the CEEB were llIerged with those of the Carnegie Corporation and the American Council on Education to form Educational Testing Service (ETS). Commenting on this innDvati~l. and nalional testing programs . Probably the best known . .EEB).t1cnt ~'ears.1930 it was. and eliminated tIll' possibilitv of h\'oritism on the examiner's part. GHes ~f. ~ests yielding a single global score sm:h as an IQ.?f these programs is that of the College Entrance Examination Board ~t. as contrasted to the recall of factualiminutiae also made the content of achievement tests resemble more -cioselv th~t of intellige~lce tests. Today the difference between these two 'types of tests is dueHy one of degree of specificity of content and extent to which the test presupposes a designated course of prior instruCtion. Terman. ' . The deeade of the 19:305 also witnessed the introduction of test-seoring maehines. arithmetic computation. Stl11 later came the achie\"ement batteries. In modern times. As more and more psychologists trained in psychometrics participated m the construction of standardized achievement tests. but also yielded less reliable results than the "new type" of objective items. the first stand-ardized tests for measuring the outeomes of school instnl~tion began to appear. . Kelley. ETS has assumed responsibility for a growing number of testlllg programs on behalf of universities.. Horacc ~fann cit~d arguments remarkably similar to those used much later to justify the replacement of essay questions hy objective multiple-choice items. and of several national testing programs for the selection of highl\' talented students for scholarship awards.\' mltl Origi/l.ent tests are used not only for educational purposes but also III the se]Pchon of applicants for industrial and government jobs. regional. reduced the chance element in question choice. At the same time. when written examinations wefe substituted for the oral interroO'ation of students by visiting examiners. and arithmetic reasol1lng.widely recognized that essay tests were not only more hme-cOnsumll1g for examiners and examinees.l{!. this program has undergone profound changes ill its testing procedures and in the number and nature ?f partie-ipa. traditional school examinations were undergoing a number of technical improvements (Caldwell & Courtis. t le term "intelliO'ence test" customarih' . selection of go\'~rnI~lent emplo:-e~s by examination was introduced in European countnes 111the late eIghteenth and eark nineteenth centuries. ami Lewis M.trllcting and evaluating all ~hese tcsts have much in common. Established at thc turn of the ce_ll'~' to reduce duplication in the exa"tnining of entering college freshmen. Ebel & Damrin. Foreshadowing many characteri·stic'S of modern t'fsting. the technical aspects of achievement tests increasingly came to resemble those of intelligence and aptitude tests.{ Tcsrillg FI/I1C!iol1.. and other institutions. . Its authors were three earl" It'aders in test development: Truman L. these tests utilized measurement principks developed in the psychological laboratory. J. 17 term "aptitude test" has been tracHtiollalJ" cmployed to refer to tests measuring relativel\" homo ('ncous and dparlv defined sc rn1C'nts of • I I \. professional schools. 19. Test construction techniques developed during and prior to World "'a~ I were introduded into tll<:'examination program of the United States Ch-il Service with the appointment of L. An important step in this direction was taken by the Boston pubhc schools in 1845. L. as.5 . Ruch. initiated by the publication of the first edition of the Stanford Achievement Test in 192:3. S~)ecial aptitu~c tests typically measure a single aptitude. In 1947.B. \[ention should also he made of the American Collegc Testing Program established in 1959 to scrccn applicants to colleges not included i~ thc CEEB program. The establishment of statewide. dating from 111. In subscq.c. put all students in a uniform situation. Examples include scales for rating the quality of handwriting and written compos. for which the new ohjec:tive tests could be readily adapted. this battery provided com~arable measu~'es of perfo~ance in different school subjects. Spearheaded h~' the work of E. evidence was accumulating regarding the lack of agreement among teachers in grading essay tests.. 192:3. well ~s tests in spelling. Achievem. Aft~r the turn of the centurv. \fention has already been made of the systematic use of ci\'i\ sen'jce examinations in the Chinese empire. As the latter came into increasing use in standardized achievement tests. 1960 ~. refers to more hderogence-. O'Rourke as director of the newlv established research dh'ision in 1922. government agencies. evaluated 111 terms of a smgle norma live group.itiol1s.as another noteworthy parallel denlopment. Thorndike. The incre~s!ng effOlts to prepare achIevement tests that would measure the attainment of broad educational goals.56). By . one for eaeh aptitude. Co) While psychologists were busy developing intelligence and aptitude tests.ical Tcsli. The l!llited States Chi! Service Commission in~talled competitive examinations as a regular procedure in 1883 (Kanuck. \lann noted.16 COIl!ex! of Psyclwlogict. Procedur~s for cons. The written examiuations.~ of Psyc1IO/<l{!.

also writing: during the last decade of the nineteenth century. quantitative scores could he obtained on each of a largc numb('r of sp('cific tests. . dlsad\:antages. Intellectual as well as nonintellectual traits . These tests wem' C:Oll('erned with rclath·ely ~omplex and subtle sodal and emotional beha\'ior and refluired rather ehlborate f~cilities and tr~lin:d personnel for their admillistration. was concerned \:'ith such beha"ior as cheating.e subject s responses. standardized on s'choolchildren.nall.. both practi~al and theoretical. But such lack of progress is not to be attributed to insufficient eHOI't. including a special form for use with children. thereby reducmg the chances that the subject can dt'li1wrateh. Each approach has its own special advaqtages and. however. interests. an attempt was made to subdivide emotional adjustment into more specific forms.Th(' first extensive applicatIOn o~ such tl'chniqnes is to be found in the h'sts de\'eloped in the late twenhcs and earl~' thirties by Hartshorne. \\'as rdati\'C I~' suhjectivc. ~fa\'. Goldlwrg. such as home adjustment. the subject is gi\'en a relatin'Jy unstructured task that permits "'ide latitudl' in its solution. the designation "personality test" most often refers to measures of such characteristics as emotional adjustment. 19:30). The protntype of tht.so been tlSed in this manner. Kraepelin ( 1892) also employed this technique to study the psychological effects of fatigue. Sommer (1894).en\plcn'ed.tyand olle that has shown phenomenal gro\vth.create a desired impressi?l1. On the whole. Lik(' the performancc and situational tests. cspecially among dlll1CIans. . to refer to the cntirc individual. 5. Pro.p'ortions since .ectll. suggested that the free association test might be used to differentiate between the various forms of mental disorder. In some of these questionnaires. \Iention should also be made of the 'York of Galton. such as dOl1lmalll'C-sublllission in interpersonal ('ontacts.r the spt'cial difficulti~ encountel:fd in the easurement of personality that account for the slow advances in this u~ . In such tests. the subject has pllc. .e techniqlles represent a third approach to the study of persO. personalit\' qnpstionnaire. proje~ti\'l' techniqucs are mor(' or less disguised in lhl:'ir purpose. This series. howcver. stealing. of J'sydl(l'(/~i('111 1'<'S!iIlt!. moreover.. An earl~' precursor of personaJit~' testing may be r('cognizcd in Kra.l.nts one of thc earlIest types of projccth'e techniques. This test was designed as a rough screening device for identifying seriously ~urotic men \\'110 would be' unfit for military service. The fre(' association technique has subscqllenth' becn utilized for a vari('ty of testing purpos('s and is still curr('nth. Otller tasks commonly employed\n proJech\'e techniques include drawing. :\Iost of these tests s~llIulate e\'eryday-life situations quite c1ose1~'."ould thus be included under this heading. and drugs and concluded that all these agents increase the relati\'{~ frequenc~' of superficial associations.e al. The interpretatIOn of th. lying. ' ~nd .a\'aJlable types of personality t('sts present serious difficulties. moth·ation. personality testing has lagged far behmd aptitude t('sting in its positive accomplishments. A later development \\'as th<: constmction of tests for quantifying the expression of interests and athtude's. and their associates (1928. and vocational adjustment. The \Voodworth Personal Data Sheet.:pelin's use of the free association test with abnormal patients. A more recent illustration. Sellten('e-completion tests hav. Other tests concentrated more intensively on a narrower area of bc-!Ja>ior 01' Wl'I'(' <:olll:erncd with mOl'(' dbtindly social r('~pons('s. The prc\'iously cited free association test'represe. 19 Another area of psy<:holo~ical testing is concerned with the aH('ctive or nonint('lIectnal aspects of b('ha\'io!'. arranging toys to create a scene. Objective. 19-48).J)efore the war cnded. although some psychologists prefer to lISt' the term personalit~. in a hroader sense. moreover.eh. The Personal Data ~heet was )lot completed carly enough to permit its operational use . 19:31. cooperatin'ness.~'hich the individual answered about himself. pers?nality ~as attained i~pr~s~ive Pl~p. 1970. . is the Per~(lnal Data Sheet developed by \Voodworth durin~ \"orId \Var I (DuBois. All.man) mgemous devIC. 1929. school adjustment. Tests d('signed for this purpose are commonly known as personality tests.tev. In the terminology of psychologit·al testing. These tests. Symonds. It is rathe. In such tests. Immediatel" after the war. W('H' based l'ssentialh' on <llll'stionnaire t('chniqul's. is l~ro\'ided by the series of situational tests developcd during l " OJld "ar II 111 the Assessment Program of the Office of Strate<Tic Services (OSS.lttell in the dpyelopment of standardized questionnaire and ratin~-. and interpreting pictures or inkblots. .~'ale tl'chniqn('s. A total score was o\5t~ined by counting the number of symptoms reported. The inventor\' conslst~d of a number of questions dealing with common neurotic sy~pt01'!lS. civilian forms were prepared. interpersonal relations.\hon of performatlce a task to perform whose purpose is often disgUised.All(~th('rapproach to the measurement of personalit~' is through the apor situational tests.\ {///(/ (higill. and attitudes. 19(1). or self-report inventory. ('xtempor~nt'ous dramatic play. In thIS test the subject is gh'en specially selectcd stimulus words and is required to r('spond to each with the first word that comes to mind. these proc-edmes \wre e\'entual1~' employed by othNs in constructing some of the most common types of current personality tests. served as a model for most subsequent emotional adjustment inventories. too. Pear~on: and C. for the a~1I1. Although origin~l1y devised for other purposes. and pcmstenct'. Hesearch on the ~~~urement ~f. hunger.'csand techmcal J1nprovemeil~s arc under ~VeStigabon.J' IIIIC/ /(111. The assumption underlvincr such metllocls is that the indi\'idual will project his characteristic m~d~: of response into stich a task.

a constant stream of new tests. clinical practice. the namt's and nddrt'sses of some of the largt'r .38. this volume' gives the original sOl\J'et' as well as an annotat<. and convenience. the manual should report the number and nature of subjects on whom lIonns.Routine information regarding poblisher.:esin which such information can be readily located should be given. thus supplementing rather than supplanting the earlier yearbooks. revisc>d forms of old tests. may be reviewed r~peat('dly m StH. the methods employed in computing indices of reliability and valIdity. as nt'w data accumulate from pertment research. Covering only tests not listed in the \nrr. Selection criteria included availability of the test to professionals. such as the Journal of Educational Measurement and the JOllrnal of Counseling Psyc1101ogy. A comprehensive survey of such imtruJl1cnts {. in other. Since I9iO several sourcebooks have appeared which provide information about u~published or little known instruments. questionnaires. and additional data that mav refine or alter the interpretation of scores on existing tests. is concerned principally with tests appearing bet\\'een 1964 and 1~70. A comprehensive bibliography covering all types of published tests available in English-speaking countries is provided by Te:~ts in Print (Buras. educational. enable the test user to evaluate the ·test before choosing it for IllS specific purpose. forms. There are shifting oriel. as well as a complete list of published references pertailling to each lest. & Glaser. A comprehensive list of test publishers. ] 975).p. 11970).-\merican p'uhlishers and distributors of psychological tests are gi\'en in AppendiX D. this sourcehook includes tests. rating scales. sufficient length. In ~he e\'ent that the necessary information is too lengthy to fit conveniently mto the manual.100 abstracts. personnel selection.complete .·cesSlyey~arhooks.Hell/(/I 11ealtll Measures (Comn'~·. largely supplementing the material listed in the MMY.3 and 6 years (Walker. The accelerating rate of <:hange. :\lanuals and specimen sets of tests can be purchased hy qualified users. For each of :1. norms. makes it impracticable to sun'ey speCific tests in any single text. Cobb. and other <ledc('s for assessing both aptitude and personality variables in adults and children.000 measures. Th('sc yearbooks cover nearly all commercially available psychological. The entries were located through a search of 26 measurementrelated journals for the Years 1960 to 1970. The coverage is especially . however. it should be noted that the most direct source of information is pro\'ided h~' the catalo~t1cs of tcst pubregardiI.tations. The Ser. and validity. the ). -price. and the specific criteria against which validity was checked. Eaeh yearbook includes tests publIshed dunng a speCified period. wlll(:h llldudes critical reviews of most of the tests by one or more test experts. Backer. :\fo!'E'over. References to such publications are given in the appropriate chapters of this book. reliahilit~·. and validity were est~b~ished. Finanv. and vocational tests published in English. The test manual should provide the ('ssential infurmation required for administering. & Frenrh. however. Another similar reference is entitled Measures for Psychological Assessment (Chun.~ (Bums. together with ~he vast number uf available tests. anyone working with tests needs to be familiar with IlUoredirect sources of contemporary information about tests. not requiring expensive or elaborate equipment). Cltalog\1('s of current tests can be obtained from each of these publishers on requcst. adequate instructions for administration and scoring. 1974). references to the printed sour<. scoring key. It might be added that ma~y test manuals still fa!1 short of this goal. A still more specialized collection CO\'crs measures of social and emotional development applicable to children between the ages of . however. Both include a numbeF'~9f tests not found in any volume of the MMY. 1973). and age of subjects for whom the test is suitable is also regularly giv('n. Tests. \\'ith addresses. In order to keep abreast of current developments. 1971).Psychological testing is in a state of rapid chan~e.earbook assumed Its ('UlTt'I\t form. can be found in the lates't Mell/al M el/S/lTcmcnfs rearl)()ok~ For reach' reference. One of the most important sources is the series of Mental !If easurements )'eaTbooks (MMY) edited hy Buros (19i2).for paper-~ndpencil tests. Two related sources are Reading Tests and Reviett. 1968) and Personality Tests and Reviews (Buras.. .enth Mental Measurements r ear7JOok.!!: specific curr~ltksts lIshers and b~' tht· mannal that accompani0s ('ach test. \lore intensive coverage of testing instruments and problems in special areas can be found in books dealing with the us~ of tests in such fields as counseling. scoring. and data on reIiahilit~. words. as well as master indexes'that facilitate the location of tests in the :\1\1Y. The manual should. of continuing interest.m hr found in A SourcelJook for . But some of the larger ancl more professionally onented test publishers are giving increasillg attention to the preparation .of use (i.d bibliography of the studies in which the measure was subscquently used. this handbook describes instruments located through an intensi\'(~ journal search spanning a ten-year period. and education.. The earhest publications in this series were merely bi~)liographies of tests: B~ginning in ]9. Containing approximately 1. 197:1). In it should be found full and detailed instructions. and evaluating a particular test. Information 011 asses~ment devices suitable for children from birth to 12 years is summarized in Tests and Measurements in Child Development: A Handbook (Johnson & Bommarito. Reviews of specific tests are also published in several Ilsychological and educational journals. for example.

22 Context of Psyc11010gical Testing of manuals that meet adequate scientific standards.-et'more samples of it. A succinct but comprehensive guide for the evaluatwn of psy~hologlcal tests is to be found in Standards for Educational arul Psyc11010glCal Tests (1974).ological Tests .sample . published by the American Psychological As~ocia~ion. An enlightened lie of test users provides the firmest assurance that such standal'ds wIll be maintained and improved in the future. attitudes. T introduction in Chapter 1 has already suggested some of the many uses of psychological tests. tests of separate abilities. Relevant portions of the StQnda~ds "ill.-A. '\'hich cover tests of general intellectual level. traditionally called intelligence tests. and personality tests. In the face of such diversity in nature and purpose. interpersonal behavior. In their latest revision. psychological test is essentially an objective . . as well as the wide diversity of available tests. They are concerned with the information about validity. concerned with measures of emotional and motivational traits. the Standards also provide a guide for the proper use of tests and for the correct interpretation and application of test results. norms. In this respect. and 5..the 'Jame way as the chemist who tests a patient's blood or a community. in connection with the appropnate tOpICS.iff a c1lild's vocabulary. PU?- CHAPTER 2 J\rat1ure arld Use of Psyclz. It is with these featm!es that the present chapter is concerned. and other noncognitive characteristics.~ .be cited in the following chapters. interests. reliability. insofar as 0R~flh~tions are made on a small hut carefully chosen . tests of special aptitudes. The major categories of psychological tests will be discussed and illustrated in Parts 3. 4.." do psychological tests differ from other methods of gathering information about individuals? The answer is to be found in certain fundamental features of both the construction and use of tests. and achievement tests. a clerk's ability to perform arithmetic computa- . these tests represent only a small proportion of the available types of instruments. HE HISTORICAL BEHAVIOR SAMPLE. Psychological tests are like tests in any other science. These standards represent a summary of recommended practices 111 test construction based on the current state of knowledge in the field. and other test characteristics that ought to be reported in the manual. including multiple aptitude batteries.}swater supply by analyzing .~hat are tIle common differentiating characteristics of ps~'Chological tests? Ho.~d standardized measure orit's'ample of behavior. . Although the general public may still associate psychological tests most dosely with "IQ tests" and with tests designed to detect emotional disorders. an ip~jyjil~)rs behaviQr.. the psychologist proceeds in much·. If the psychologistwish¢'~ to test the extent .

. the single independent \'ariable is usuall~' the indh-idual being tested. to ~motionally toned stimuli. Stand~rdized testing p.havior can be made."It is only necessary tna " . it can be demonstrated that there is a dose correspondence between the child's knO\dedge of the word list and his total l1laster~.rocedure.. Anotlwr point that should be considered at the outset pertains to the cone-ept of Clll}(/cify. for example.\:Imple might be a foreign vocabulary test in whi!=·htilt:' students are examilled on 20 of the 50 nt'\\words th~y have studied. of . the correct an~wer may be given away by smiling or paY~jlg wh~n the crucial word th~\.tions.t area·Ofb~. but would in itself presuppose no knowledge of French. ~r:.:. and to other complex. the goal of psychological testing. time limits.~i[ ex.\. another example is provided by the ro.lJS chapter dealing . The degrec of similarity between the test sample and the predicted behavior ma\' vary widely. hO"'ever. however. implies a prediction of what the incIi\'idual will cIO in situations other than the present test. J~. the test constructor provides detailed directions for administering each newly developed h:'st. or a pilot's eye-hand coordination. A lesser degree of similarity is illustrated by many vocational aptitude tests administered prior to joh ance between the training." Ko psychological test can do more than measurelJel1"UDor.'h~or to he preclictt'cl." \\'Jth problems of test administration. in itself.vior the test is. Despite their superficial differences. to dc\'isc a test fur predicting how well an individual can learn Fre11Ch before he has even begun the study of French. would be a poor measure of the indiyidual's computational skill. cO\'ered by the test is J:arely. At the other extreme one finds projecth'e personality test!>'" eh as the Rorschach inkblot test. At one extreme. inflection.'rs the behavior under consideration obviously depends on the number and nature of it in the samp e. suell as mental retardation ur emutional disorder. tone of voice. Onl\' in the senSe that a present behavior sample can be used as an indicator of other.in the test. ways of handling queries from subjects. he ('xamim's their performance with a representatin' set of wonls. :11'ithmclie prol>lems. The diagnostic or 'redictiJ. heing foreeast from his present test performance. "'hetlwr or not the test adeqnately co\'(. Such tenus should. the test mav coincide completelY with a part o'f the b. or motor tests. In a test involving the detection of absurdities. more subtle factors may influence the subject's performance on certain tests.great interest. pauses.on ence be demoHstrated bet"'ecn the tm). future behayior can we s~ak.of vocabulary. other ~ the testing situation.lnit>le. presenting problems orally. then the tests are ser\'ing their purpose. an ant 1I1letJctest consisting of only five problems. ho\\"('\'er.-:"iIl recalled that in the initial definition a ps~-he chological test \\'as described as a standardized measure. be used with caution in reference to ps~'dlOlogical tests. It is logically Simpler to consider all tests as behavior samples from which predictions regarding other JX. consideration must be given to the rate of speaking. Standardization implies !miformifll of ~)rQcedllre in 'hdnl11Hsfenng and SCoring the 'test If the scores obtained by different iudiyiduals are to be comparable. STA:-.. Such a test would invoh-e a sample of the types of behavior required to learn the new language. Thus.. prc>Jiminary demonstra: ~ns.c t.read . Nor is the job applicant's performance on a specific set of 20 arithmetic problems of much importune-e_ If. everyday-life situations. Such a requirement is only a speCial application of the need for controlled conditions in all scientific ohse-ryations.. It should be noted ir. or examp e. and faCj~1 e}pression. or between the applicant's score on the arithmetic problems and his computational performance on the joh. In a test situation. The formulation of such directions is a major part of the standardization of a new test_ Such standardization extends to the exact materials em 'plo~d. in giving instructions or. the individual's future performance on a job.a7uc of a lsycholC!gical test depend~_ol! the debH. tot eX. oral instructions to subjects. Prediction eommonly connotes a temporal estimate. ~le including only multiplication items. An e.ld test taken prior to obtaining a driver's liccme. It is entirely possible.to }[('dicr. testin~ conditions must obYiously be the same for all. Different typps of tests can then be characterized as variants of this basic pattern. A yoealmlary test composed entirely of baseball terms would hardly proYide a dependable estimate of a child's total range of vocalmlar~'. for example. in which there is only a mod<'rate rese tasks peIformed on the joh and those incorporat . Will be dJscussed further m a later sect~g~ of-<tl.. :Many other. In a nls hroader sense. e\-en the diagnosis of present condition. in which an attempt is made to predict from the subject's as~ociations to inkblots how he will rcad to other people. And each mUst prove Its worth by" an empirically demonstrated correspondence between the subject's pcrformance on the test and in other situations. and evel. this connectiolJ that the test items need not resemble closely the beha. Whether the term "diagnosis" or the term "prediction" is employed in this connection also represents a minor distinction.~O which it sen'es as an indicator of a relatively broad and Measurement of the hehaYior sample directl~' !!guinea. In order to secure uniformity of testing conditions. 'Vh~ethci:S\1ch behavior can serve as an effective inc!('x of other IX'hador can be determined only by empirical try-out. if ever. It . It could then be said that this test measures the indh'idual's "capacity" or "potentialitt for learning French.DARDIZATIO:-. The child's knowledge of a particular list of 50 words is not.'.. all these tests consist of samples of the indi~s behavioL.()f a test measuring "capacity.aminer's point of \1:w.

known as the standardization sample. The olJ]ectlve evaluation of psychological tests involves primarilv d~tennination of the reliability and the validity of the test in specified Sltuatlons. This early . By this procedure. Reference to the definition of a psychological test with which this discussion opened will show that such a test was characterized as an objective as well as a standardized measure. Psychological tests have no predetermined standards of pli5singor fa'inng. Test reliability is the consistency of scores obtain_ed. then the 8-year-old norm on this test corresponds to a score of 12. taken to be the easiest. since a few such responses occur in the majority of "normal" individuals in the standardization sample. There are other major ways in which psychological tests can be properly described as objective. scoring. ipso facto. they arranged the 30 items of the scale in order of increasing difficulty. time required to complete a task. new items can be added to fill the gaps. or some other objective measure appropriate to the content of the test. 'Vhen Binet and Simon prepared their original. 1905 scale for the measurement of intelligence. hunches. Similarly. This group. number of errors. But at least such objectivity is the goal of test consb'uction and has been achieved to a reasonably high degree in most tests. In . t?e As used in psychometrics. RELIABILITY. the term reliability always means consistenc~'. The items correctly solved by the largest number of' children were. a norm is the normal or average performance. It is thus possible to evaluate different degrees of superiority and inferiority. On dominance-submission tests. logical test construction. As its name implies. the administration. On both types of tests. in esse!1tially the same way as for aptitude tests.>le or maladaptive' }'esponses. which is now common practice in psycho. of comse. All permit the designation of the indi"idual's position with reference to the normative or standardization sample. If a child receives an IQ of 110 on Monday and an IQ of 80 . some items can be discarded. In the process of standardizing a test. the norm does not ordinarih· correspond to a complete absen<. an individual's score is evaluated by comparing it with the scores obtained by others. of whatever type. if there is a bunching of items at the easy or difficult end of the scale.26 COlltext Of Psychological Testing Nature alld Use of Psychological Tests 27 Another important step in the standardization of a test is the establishment of norms. those passed by relativdy few children were regarded as more difficult items. Such a raw score is meaninglcss until evaluated in terms of a suitable set of norms. for example. it is administered to large. Thus. and interpretation of scores are objective insofar as they are independent of the subjective judgment of the individual examiner.are such tests objective? Some aspects of hat the objectivity of psychologieh'l tests have already been touched on in OBJECTIVE MEASUREMENT OF DIFFICULTY. a the discussion of standardization. Similarly. since perfect standal'dization and objectivity have not been attained in practice. empirical procedures.~ the same persons when retested with the identical test or with an eqRhYalent form of the test. are bascq'· on lmpirically established norms. in an emotional adjustment inventory. representative sample of the type of subjects for whom it is designed.'C of unfavoral. Thus. any more than a perfect or errorless score is the norm on an aptitude test. Thus. The specific ways in which such norm" may be expressed will be considered in Chapter 4. the nonn falls at an intermediate point representing the degree of dominance or submission manifested by the average individual. and personal biases may lead. . the norm corresponds to the performance of typical or average individuals. The only way q~estlOns sU~h ~s these can be conclusively answered is by. on the other hand. It is thus apparent that psychological tests.empirical trial. on the one hand. was determined by trying out the items on 50 normal and a few mentally retarded children. The latter is known as the raw score on the test.••. it will be recalled. This is not entirely so.:xarnple typifies the objective measurement of difficulty level. Such difficulty. specific way~. It might also be noted that norms are established for personality tests . serves to establish the norms. Such norms indicate not only the average performance but also the relative frequency of varying degrees of deviation above and below the awrage. Subjective opinions. to extravagant claims regarding what a particular test can acco~plish and. It may be expressed as number of correct items. How good is this test? Does it really work? Thes£l quest~ons could-and occasionally do-result in long hours of futile discussIOn. Anv one individual should theoretically obtain the identical score on a test r~gardless of who happens to be his examiner. . More technical aspects of item analYsis will be considered in Chapter 8. . The norm on a personality test is not necessarily the most desirable or "ideal" performance. The determination of the difficulty level of an item or of a whole test is based on objective. an empirical order of difficulty was established. :l'ot only the arrangement but also the selection of items for inclusion in a test can be determined by the proportion of subjects in the trial samples who pass each item. to stubborn rejection. if items are sparse in celiain portions of the difficulty range. if normal B-year-old children complete 12 out of 50 problems correctly on a particular arithmetic reasoning test.

The scores of these persons are not themselves employed for operational purposes but serve only in the process of testing the test. i. we can conclude only that both scores cannot be right. the test user can predict whether the test will be about equally reliable for the group with 'which he expects to use it. One further point.margin of error.. The different types of test reliability. or validity coefficie. but tlus could be demonstrated only by further retests. to become available. By studying the validation data.ith the test? The answer to this riddle is to be found in the distinction between the validation l. ratings by instructors. By means of pact on individuals-that tests. for example. In the process of ·y~lidating such a test. The reader may have noticed an apparent paradox in the concept of test validity. VALIDITY. as well as methods of measuring each.e. should be coq$fdered at this time. as well as the specific criteria and statistical procedures employed.TfOUp on the one hand anci the groups on which the test will eventually be employed for operational purposes on the other. the role of different examiners or scorers. Such a composite measure constitutes the criterion with which each student's initial test score is to be correlated. ultimatle success in medical scholYlwould be a criterion. the person's present level of prerequisite skills. The validity coefficifnt enables us to determine how closel\' the criterion perfor~ance could have been predicted from the test scor~s. \Vhether one or neither is an adequate estimate of the individual's ability in vocabulary cannot be established without additional information. Reliability can be checked with reference to I temporal fluctuations. we could detennine which applicants will succeed on a job or which students will satisfactorily complete college by admitting all who apply and waiting for subsequent developments! It is the very wastefulness of this procedure-and its deleterious emotional imtests are designed to minimize. whereas in another. and other relevant characteristics can be assessed with a deferminable margin of error. The special problems encountered in determining the validity of different types of tests. Validitv tells us more than the degree to which the te~t is f~lfilling its funcpari.ft actually tells us what the test is measuring.>e validated against achie\'ement in flig:lt training. It might still be argued that we would need only to wait for the criterion measure to mature. Before a psychological test is released for general use. Undoubtedly the most important question to be asked about any psychological test"concerns its validity. external criteria of-whatever the test is nesigned to measure. Before the test is ready for use. it can then be used on other samples in the absence of criterion measures. the particular selection of items or behavior sample constituting the test. For example.. it would be administered to a large group of students at the time of their admission to medical school. To be sure. But such a procedure would be so wasteful of time and energy as to be prohibitive in most instances.when retested on Friday. can be validated against on-the-job success of a trial group of new employees. The determination of validity usually requires independent. if in olle set of 50 words an individual identifies 40 correctl~·... The interpretation of test scores would undoubtedly be clearer and less ambiguous if tests were regularly named in terms of the criterion . a thorough. whereas those scoring low on the test had done poorly in medical school. From the given data. would signify th~t those individuals who scored high on the.!t.. the smaller will be this . it is obvious that little or 110 confidence can be put in either score. then neither score can be taken as a dependable index of his verbal comprehension. Tests designed for broader f\nd more varied uses are validated against a number of criteria and their validity can be established only by the gradual accumulation of data from many different kinds of investigations. the degree to which the test actually measures what it purports to measure. supposedly equivalent set he gets a score of only 20 right. had been relatively successful in medical school. however. It would thus be more accurate to define validity as the extent to which we Jrnow what the test measures. if a medical aptitude test ist9 be used in selecting promising applicants for medical school. With such information. A low correlation would indicate little correspondence l. Some measure of performance in medical school would eventually be obtained for each student on the basis of grades. The number and nature of individuals on whom reliability was checked should likewise be reported. because the same test may vary in these different aspects. . will be considered in Chapter 5. Thus. If the test proves valid b~' this method. A high correlation. success or failure in completing training. we can objectively determine what the test is measuring. and other aspects of the testing situation. In a similar manner. If it is necessary to follow up the subjects or in other ways to obtain independent measures of what the test is trying to predict. on any group in order to obtain the information that the test is trying to predict. knowledge. its validity must be established on a representative sample of suhjects. Similarly. or whether it is likelv to be more reliable or less reliable. rind criterirJn measure and hence poor validity for the test. tests designed for other purposes can be validated against appropriate criteria.t"'ppn tp~t ~('orp. Validity provides a direct check on how well the test fulfills its function. why not dispense v. A vocational aptitude test. willlJ~ fhscussed in Chapters 6 and 7. It is essential to specify the type of reliability and the method employed to determine it. objective check of its reliability should be carried out.test. and the like. The more valid and reliable thef~. in both illustrations it is possible that only one of the two sC'ores is in error. A pilot aptitude battery can 1.

administration and scoring." Such an attitude is simply a carry"over from the usual procedure of preparing for a school examination. which would . i~sion to. An adequate realization of the need to follow instructions precisely. from a mail-order catalogue. and the testing <'Onditiolls.schoolteacher. To be sure. temporary emotional or physical state of thl> subject.arding the individual being tested are essential in interpreting any test score. author. as to rrnder the tests worthless or to hurt the indi:. O' ~~~ ~ .!:_ 9perly used to be effective.Like ny sd~ntillc instrument or precision tool. Other information. Some background data reg.School X and I'd like to give him ~ol1lepractice so he can pass. however. is likewise relevant.. Tests cannot be chos'en like lawn mowers. The proper interpretation of test scores requires a thorough understanding of the test. Only in such a way' ~an the tes~ user determine the appropriateness of an)' test for his particular purpose and its suitability for the type of persons with whom he plans to use it. Under such conditions. The need for a qualified examiner is evident in each of the three major aspects of the testing situation-selection of the test. SuQ't remarks 'lustrate potential misllses or misinterpretations of psychological tests in uch wavs. A tendency in this direction w pe'recognized in such test labels as "~cholastic aptitude test" and sonnel classification test" in place of the vague title "intelligence When applied to an intelligence test. pertaining to reliability. They cannot be evaluated by name. i~ required if the test scores obtained by different examiners are to be comparable or if anyone individual's score is to he evaluated in terms of the published norms. the test would be completely invalidated.. .poses. and norms is essential. such tests can cause serious . however. The introductory discussion of test standardization earlier in this chapter has ah'eady suggested the importance of a trained examiner. There are two principal reasons for controlling the use of psychological ests: (a) to revent general familiarity with test content. Test content clearly has to be restricted in . She gave me a personality test and I came 1 neurotic.\ than is generally realized. psychological t~~s"roJ!~.V. the individual. Information on these practica] points can '\lsually be obtained from a test catalogue and should be taken into account in planning a testing program. an e"nlnation of its technical merits' in terms of such characteristics as validity reliability difficulty level.' invalidate the test an ( to ensure tat e test is used ~ a qualified :> .1ems closely resembling those on an intelligence test. such a test w~ld no longer be a 'measure of color vision for him. 'ast ~'enryou gave a new personality test to our employees for research pur. for example. The same score may be obtained by different persons for very different reasons." st night I answered the questions in an intelligence test published in a ~gazine and I got an IQ of SO-I think psychological tests are silly." The above ·remarks are not imaginary. In the hands of either the unscrupulous or "we -meamng ut uninformed user.re be quite dissimilar. such as unusual testing conditions. order to forestall deliberate efforts to fake scores. Finally. however. "so that the pupils will be well prepared to take the test. In the absence of proper checking procedures. I've been too upset to go to class ever since. the effect of familiarity may be less obvious." . nature of the group on which norms were established." . testing time required. and extent of the subject's previous experience with tests. the validity of the test as a predictive instl'l1ment is reduced. and i~terpretation of scores. Careful conh-ol of testing conditions is also essential. What is being measured can be objectively determined only by reference to the specific procedures in terms of which the particular test was validated. it is likely that such specific training 01' coaching will raise the scores on the test without appreciably affecting the broader area of beha"ior the test tries to sample. Similarly.that measures each child's inllate potential. we need a culture-free IQ . '~\' if an individual were to merr'lbrize the correct' responses on a test o'f' color blindness. For the test to serve its function. incorrect or inaccurate scoring may render the test score worthless. or the test may be invalidated in good faith by misinformed persons." o improve the reading program in our school. it requires no psychological training to consider such factors as cost. may give her class special praettee in prob. A \ . 'y roommate is studying psych.to occur . nd the list could easily be extended by any psychologist.We would now like to have the scores for their personnel folders. or other easy marks of identification.idual. scoring errors are far more likeh. Under these condItions. The conclusions to be drawn from such scores would therefo. and the like. and ease and rapidity of scoring. some consideration must also be given to special factors that may have influenced a particular score. as well as a thorough familiarity with the standard instructions. Each is based on a re~fincident.CHOLOCICAL TESTS THE USE OF 'y I:have a Stanford-Binet blank? ~fy nephew has to take it next week for.t . bulkiness and ease of transporting test materials. 'SONS FOR CONTROLLING .Context of Psychological Tes/ing '~:~hl:ough hich they had been validated.LP. In other cnses.

informal . because no one person would s normally be concerned with all forms of testing. Depending upon the nature of the test and the type of subjects to be examined. supervised training in the administration of the particular test is usually essential. from the examination of infants to the clinical testing of psychotic patients or the administration of a mass testing program for military personnel. it is more pra~ticable to acquire ~. locking the doors or posting an assistant outside each door may be neeessarv to-prevent the -entrance of late-comers. Moreover. Standardized procedure applies not only to verbal instructions. and Terman and Merrill (1960) for individual testing. Materials should generally be placed on a table near the testing ta. The proctors hand out and collect test materials. Even apparentl~' ·minor aspects of the testing situation may appreciably alter performance.. and other aspects of the tests themselves but also to the testing environment. The present discussion will therefore deal principally with the common rationale of test administration rather than with specific questions of implementation. ~ flijJ. 'Advance preparation for the testing session takes many forms. and how the applicant will perform on the job. Memorizingthe exact verbal instructions is essential in most individual testing. takes care of timing. It is therefore important to identify any test-related influences that may limit or impair the generalizability of test results. needed should be carefully counted. The Differential Aptitude Tests. Some attention should be iven to the selection of a . see Palmer (1970). Sattler (1974). frequent periodic checking and calibration may be necessary. In individual testing and especially in the administration of performance tests.'11 group tests with answer sheets other than those lIsed in the standardization sample. It is important to realize the extent to which testing conditions may lI1fluence scores. Posting a sign on the door to indicate that testing is in progress is effective. Any influences that are specific to the test situation constitute error variance and reduce test validity. and is in charge of the group in anyone testing room. for example. may be administered with any of five different answer .~cial~ should a so e ta -en to prevcnt mtcrrup ons unng the test. For detailed suggestions regarding testing procedure. The most important requirement for good testing proc. such training may requi. A test SCOl'e should help us to predict how the client will feel and act outside the clinic. There is may affect also evidence to show that the Slli9ir~loyed test scores (Bell. so that each is hilly infonned about the functions he is to perform. In the testing of large groups. In testing there can he no emergencies. some· previous familiarity with the statements to be read prevents misreading and hesitation and permits a more natural. the groups using desks tending to obtain higher scores (Kelley. proved to be significant in a group testing project with high school students.· or other materials TESTING COXDlTlOXS. Such a factor as the use of deSKSor of chairs with desk arms. such preparation invqlves the actual layout of the necessary materials to facilitate subsequent use with a minimum of search or fumbling. Even ill a group test in which the instructions are reauto the subrects. provided all personnel have learned that such a sign means no admittance under any circumstances. Hoff. . For group testing. Only in this way can unifom1ity of procedure be . In the absence of empirical verification. Special efforts must therefore be made to foresee and forestall emergencies.-19t3~1~li'~1lfr-~~ab1ishment of independent test-scoring and data-processing agencies that. 1942). When apparatus is employed. A whole volume could easil\' be devoted to a discussion of desirable procedures of test administration. how the student will achieve in college courses. and espeCially in large-scale projects. such preparation may include advance briefing of examiners and proctors.re from a few demonstration and practice sessions to over a year of instruction.. But such a survey falls outside the scope of the present book. materials. uch techniques within specific settings. checked. For individual testing."I:AMINERS. In group testing. provide their machine-scorable answer sheets. & Hoyt. and prevent cheating. special pencils. and Clemans (1971) for group testing. This room should be hould wvide . venti~ .~le so that they are within easy reach of the examiner but do not distriCt Vte subject. In general. 1~43:Traxler & Hilkert. all test blanks. and arranged in advance of the testing day. timing. the examiner reads the instructions. nontest situations.manner during test admillish'ation.\wed. make certain that subjects are following instructions. answer individual questions of subjects within the limitations specified in the manual. ADVASCE PREPARATIOS OF E. answer sheets.. the equivalence of these answer sheet# cannot be assumed. The preparation of test materials is an9ther important preliminary step.edure is advanc-e preparation. for example. examiners sometimes administer 0\1. Thorough familiarity with the specific testing procedure is another important prerequisite in both individual and group testing.Nature alld (he of PsycllOlogiclIl Tc'sls 33 · J The basic rationale of testing im·olves generalization from the behavior sample observed in the testing situation to beha"ior manifested in other..a{ls.

Any deviation from standard motivating conditions for a particular test should be noted and t. uniformity of conditions is essential for comparability of results. motivational problems may be encountered in testing emotionally disturbed persons. job applicants typed 'a significantly faster rate when tested alone than when tested in groups liHwo or more (Kirchner. A friendly. Test periods should be br~ef. the test stimuli are used only for qualitative exploration. timid child needs more preliminary time to become familiar with his surroundings. however. ey call for frank and honest responses to questions about one's usual behavior. In testing preschool children. and making such comments as ood" or "fine. testing children below the fifth grade. however minor. and other manifestations of negativism. as in other testing procedures. nodding. having the child mark the \'ers in the test booklet itself is generally preferable. and the ~asks should be varied and intrinsically interesting to the chll. The shy. 1956). prisoners. his performance cannot be directly compared with the norms or with that of other children 01' who are motivated only with the standard verbal encoura"ement . they call for full reporting of associations evoked by the stimuli. In another study. 57). Second. the examiner endeavors to motivate the subject to follow the mstructlOns as fullv and conscientiously as he can. Under these rcumstances. But in all instances. On the Clerical Speed and Accuracy Test of this battery. Still other kinds of tests may require other approaches. in certain projective tests. or juvenile delinquents. . because they were nd to yield substantially different scores than those obtained with the reI' sheets used by the standardization sample. If a child is given a coveted prize whenever he solves a test problem correctly. 0 praise.could readily be multiplied. the examiner cannot assume they will be motiyated to excel on academic taSKSto the same extent as children in the starfdardizati~n sa~~le . 1966). elicit his cooperation. separate s are provided for three of the five answer sheets. A certain flexibilitv of procedure is necessary at this age level because of possible refusal~. and negativism. The training of examiners covers techniques for the establishmcnt of rapport as well as those more directly related to test administration. and relaxed manner on the part of the examiner helps to reassure the child. 1952. without any censoring or editing of content. Specific techniques for establishing rapport vary with the nature of the test and with the age and other characterbtics of the subjects. distractibility. In establishing rapport. the presence of the examiner in the room tended to hibit the inclusion of strongly emotional content in the stories (Bernein. Especially when examined in an institutional setting. \Vhen he docs so. the nstructions call for careful concentration on the given tasks and for put'ng forth one's best efforts to perform well. special factors to be considered include shyness with strangers. The implications are threefold. 'Vhen testing children from educationally disadvantaged backgrounds or from different cultures. he ~ no longer inrpret the subject's responses in terms of the test norms.Context of Psychological Testing eets. the term "rapport" refers to the examiner's effOl'ts o arouse the subject's interest in the test. 19i5). more subtle testing conditions have been shown to affect ormance on ability as well as personality tests. In the intensive assessment of a person rough individual testing.This pro~le~ and others pertaining to the testing of persons \\ lth diSSImilar expenential backgrounds will be c'Onsidered further in Chapters 3..hcn interpreting test results.d. Hata. It is the reonsibility of the test author and publisher to descdbe such procedures ully and clearly in the test manual. as illustrated by smiling.first. The older schoolchild can usually be motivated through an appeal to his competitive spirit and his desire to do well on tests. At these grade levels. III the administration of a typing test. Special. in personality inventories. For this reason it is better for the examiner not to be too demonstrative at the outset. cheerful. . take testing conditions into account . Whether the exinel' is a stranger or someone familiar to the subjects may make a 'nificant difference in test scores (Sacks.. ld the responses should be treated in the same way as any other infor"malbehavioral observations or interview data. and 12. and nsure that he follows the standard test instructions. Tsudzuki. & Kuze.. record any unusual testing onditions. any other. loss of interest. The game approach is still the most effective way of arousing their interest in the test. suca persons are likely ·to . . follow standardized procedures to the minutest detail. an experienced examiner may occasionally dert from the standardized test procedure in OJ:der to eJi~it additional inrmation for special reasons." were shown to have a decided effect on test results "ickes. 7. Examples. steps can also be taken in group testing to motivate the subjects and relieve their anxiety. . Children in the first two or three grades of elementary school present many of the same testing problems as the preschool child. the general manner and behavior of the exam. The testIng should be presented to the child as a game and his cunoslty aroused before each new task is introduced. In ability tests. Natml. but rather to wait until the child is ready to make the first contact.aken into account in interpreting performance.' anel USe' Of Psychological Tests 35 In psychometrics. Third. In a projective test requiring the subject to write stories 'fit given pictures. Although rapport can be more fully established in individual testing. the use of (Illy separate answer t may significantly lower test scores (Meh'opolitan Achievement Test ial Report. 1956).

such as suspicion. although high school and college students also respond to such an appeal Cooperation of the examinee can usually . for example. a score correctly indicating wh~lt he can do rather than overestimating or underestimating his abilities. Longitudinal studies likewise revealed an inverse relation between changes in anxiety level and changes in inteJligence or achievement test perfonnance (Hill & Sarason. would mean subsequent failure. Adult testing presents--some additional problems. offers general suggestions on how to take tests. Individual differences in test anxiety have been studied with hoth schoolchildren and college students (Ga~dry& Spielberger. which might result from invalid test scores. one should bear in mind that e\'e1')'test presents an implied threat to the individual's prestige. used together with the tape. Procedures tending to dispel surprise and strangeness from the testing situation and to reassure and encourage the subject shottld certainly help to lower anxiety. as well as a more extensive pretesting orientation~.manifest a number of unfavorable attitudes. provides general information on how to take tests. contains items such as the following: Do you worry a lot before taking a test? \\'hen the teacher sa~'s she is going to find out how much you h. It is also desil:able to eliminate the element of surprise from the test situation as far as possible. The children's form. & Shimberg. The individual might otherwise experience a mounting sense of failure as 11e advances to the more difficult items or finds that he is unable to finish anv subtest within the time allowed. "'aite. does your healt begin to beat faster? While 'you are taking a test. \1ore general orientation booklets aie also . & Ruebush. This approach can serve not only to motivate the individual to try his best on ability tests but also to reduce faking and encourage frank reporting on personality inventories. . the examiner may read the instructions from a printed script. of course. Abnormal conditions in their past experiences are also likely to influence their test perforrnanee adversely. Some reassurance should therefore be given at the outset. G. The first booklet. and contains a few sample items. the second contains practice tests. Lighthall. do not indicate the direction of caUsal relationslllps. :\lany of the practices designed to enhance rapport sen'e also to reduce test anxiety. 1974b). fl'ar. As a result of early failures and frustrations in school. loss of time. In testing any school-age child or adult. Of primary interest is the finding that both school achievement and intelligence test scores yielded significant negative correlations with test anxiety. Such explanatory booklets are regularly available to participants in large-scale testing programs such as those conducted by the College Entrance Examination Board (1974a. :Many group tests provide a prdiminaryexplanatory statement that is read to the group by the examiner. It is possible that children develop test anxiety because they per- . 1974. 1966. Davidson. It is helpful to explain.available. & Zim. si'tc11 as l\feeting the Test (Anderson. Samson. 1967).be secured by convincing him that it is in his own interests to obtain a \. An even better procedure is to announce the tests a few days in advance and to give each subject a printed booklet that explains the purpose and nature of the tests. In the absence of a tape recorder. ~uch findings.technique for use with culturally disadvantaged applicants unfamili~f. A tape recOl'ding and two booklets are combined in Test Orientatioll Procedure (TOP). Sarason. Unlike the schoolchild. for example. bardo. 1961)..Spielberger. In any event.v'ith tests. 19i2). \rhich the tests resemble. that no one is expected to finish or to get all the itcms correct. Hill. It is certainly not in the best interests of the individual to be admitted to a course of study for which he is not qualified or assigned to a job he cannot perform or that he would find uncongenial. Much of this research was initiated bv Sarason and his associates at Yale (Sarason. because the unexpected and unknown are likely to produce al1xiet~'. It therefore becomes more important to "sell" the purpose of the tests to the adult. J'he examiner's own manner and a wellorganized.we learned. designed specifically for job applicants with little prior testing experience CBennett & Doppelt. Similar correlations have been found among college st1tdcn!s (1. smccthly running testing operation will contribute toward the same goal. The first step was to construct a questionnaire to assess the individual's test-taking attitudes. or cynical indifh'renee. they may have developed feelings of hostility and inferiority toward academic tasks. Le. ~Iost persons will understand that an incorrect decision. insecurity. because the examinee realizes that he himself would otherwise be the loser. 1960). do you usually think you are not doing wen. The experienced examiner makes special efforts to establish rappolt under these conditions. 1964). the adult is not so likely to work hard at a task merely because it is assigned to him. 1965). Katz. The United States Employment Service has likewise de\'eloped a booklet on how to take tests. for example. valid score. he must be sensitive t~ these special difficulties and take them into account in interpreting and explaining test performance. and frustration for them.

The instructions on the latter . Rosen- .~~ch "aneffect. an. Masling (l~60). & Davidson. it is undoubtedl\' true that a ~hronicalh. Lighthall. sex.(..hi e t lose who are customarilv hi<rh-anxiol1s )erform better Ii' firmore re axe can itions." . Palll & Eriksen.other way in which an examin8r may inadvertently affect the ~x~~m~e s responses is through ~is own 'cexpectations. In general. Individuals who are '. than with clearly defined and well-learned functions.high amidv len'l will c:I. the negative "rrelation between anxiet~' level and test performance disappears Denny. such as his. .. should be distinguished horn the tesr:tiinit1!. professional or socioeconomic status. The examiner's behavior before and during test auministration has also heen s~lown to affect test results. Although some effects have been demonstrated with objective group tests. 1962. 1964). 1962). found that ego-involving instructions. but a deleteriouseffect on that ofbigh-anxious subjects. such as the nature of th. 1966).\merica today. . and the instructions given to the subjects. Se\'eral studies of thes~ examiner variables. had a beneficialeffect on the performance of low-anxious subjects.III t he sen~e that the same examiner characteristic or testing mannel may have a dIfferent effect on different examinees as a function of the examinee's Own personality characteristics. age. 1965.\lein bencficia~ while a lar e amount is detrimental. ~'Iasling. however. race. gationof this question. and Sattler (1970. the w-allxiousgroup improved significantly more than the high-anxious. Similar interactions may occur '~ith task variables.'performance level in nontest situations? Because of the competitive pre~sure experienced by college-bound high school seniors in . there may be Significant interactions between examiner and examinee' charact " .' St'll ' • '. a slight amount Qf anxiety . Feldhusen & Klausmeier. c es gIVers an the test takers' diverse perceptions of the funetiglls ' and goals of testing.Sarason. in the examination of preschool children. In one study (:Waite. These extraneous factors are more likely to operate with unstructured and ambiguous stimuli. f Comprehensive surveys of the effects of examiner and situational variables on test seores'lmve been prepared by S. the purpose of the testing. children are more susceptible to examiner and situational influences than are adults. ~foliarty (1961. controlled investigations ha\'e YIelded significant differences in intelligence test performance as a res~lt of a "warm" versus a "cold" interpersonal relation between examllJer and examinees. or a rigid and aloof versus a natural manner on the part of the examiner (Exner. 'iotlschildren equated in intelligence test scores were given repeated ials in a learning task Although initially equal in the learning test.erJ a deb'imental effect 'on school learning and' int~lIectual dewlopllleltf. there 5 evidence suggesting that at least some of the relationship results from he deleteLious effects of anxiety on test performance.and test performance is nonlinear. To what extent do~s test auxier.r.~'ects with which this discussion is concerned. There is considerable evidence that test results may vary systematically as a function of the examiner (E. 1960). In support of this interpretation is the finding that \\ithin subgroups of high scorers on intelligence tests. 1959). For example. Sarason (1954). e~lstJCs. 1974). Moreover. occasion specified that the test was given for 'research purposes only and scores would not be sent to any college.erformancc of high school students on a test given as part of the fe-gular administration of the SAT with performance on a parallel form of the test administered at . be confounded. training and expenence. howe\'er. ". d . such as telling subjects that everyone is expected to finish in the time allotted. Masling. it has been argued that performance on c'OlIege ~dmissif>il tests may be unduly affected by test anxiety. Sattler and Theye (1967). Moreover.Context of Psydl(Jlogical Testiug form poorly on tests and haw thus experienced failure and frustration in previous test situations.cllstomariy ow-anxious benefit from test con i. Hence thp l:'ffeds of two or more variables ma\. Emotionally disturbed and insecure persons of an\' age are also mClre likely to be affected by such conditions than are well-adjusted persons. Palmer (19. most of the data have been obtained with either projective techniques or individual intelligence tests. 1966.1952).a different time under "relaxed" conditions. These differences may he related to personal characteristics of the examiner. the concurrent validitv of the test scores against high school course grades did not differ signifi~antly under the two conditions. The results showed that performance was no poorer during the standard administration than during the relaxed administration. Dyer (1973) adds even more variables to this list. I. high-anxious and low. 1966. and appearance. for example. calling attention to the possible inHirence of th t t· . as well as "ith difficult and nO"el tasks. French (1962) compar~d Jhf'p. It thus appears likely that the r~latjQn between anxiety._". Other studies have likewise foundan interaction between testing conditions and such individual char~cteristicsas anxiety level and achievement motivation (Lawrence.0).e test. Mandler and Sarason .· . the role of the examiner is especially cruCiaL. 1966. In a thorough ana::4ontrol1ed investi. personality charaderistics. B. Severalinvestigators have compared test performance under conditions esigned to evoke "anxious" and "relaxed" states. have yielded misleading or illconcluSl\'e results because the experimental designs failed to control or isolate the influence of differcnt examiner or subject characteristics. This is simply a P clal mstance of the self-fulfilhng prophecy (Rosenthal. Cohen. 1958).make the individual's test performance unrepresentative of his cust~mar~' .tions t lat arouse some et:>. On the other hand.

the 9-day group scored Significantly higher on all subtests of the battery. The answer to this ques~ represel1ts the difference between coacmng and education.:. ~foreover. When their scores were c'Ompared with those obtained by 2. E. snowed more improvement in their test scores than those who had undergone neutral or less gratifying experiences. Davis (1969a. Perfonnance on an arithmetic reasoning test was significantly poorer when preceded by a failure experience on a verbal comprehension test than it was in a control group given no preceding test and in one that had taken a standard verbal comprehension test under ordinary conditions. Obviously any educational experience the indiVidual undergoes. On one occasion. The examinees' activities immediately preceding the test may also affect their performance. In this study. narrow or broad.be~dassified as either. Similar results were obtained by W. while the other 7 were told that experienced examiners elicit more animal than human responses. especially when such activities produce emotional disturbance. fOllowing what may have been an emotionally depressing experience.. of course. 1953).single test. however. they had again been writing.ical Testing Natufa aile! Use of Psychological Tests 41 thaI & Rosnow.180 recruits tested at the conventional time. averaged 4 or 5 points lo\ver than on the first test. fatigue. Bridgeman (1974) found that "success" feedback was followed by significantly higher performance on a similar test than was "failure" feedhack in subjects who had actually performed equally well to begin with. In the majority of well-administered testing programs. 1969b) with college students. In an investigation with third. on the second occasion. other aspects of the testing situation may Significantly affect test performance.es will in no way invalidate the test.724 recruits were given the Navy Classification Battery during their ninth day at the ~a\'al Training Center (Gordon & Alf. -An experiment conducted with the Rorschach will illustrate this effect (Masling.ca /:crtUln. the class had been engaged in writing a composition on "The" Best Thing That Ever Happened to Me". These findings were corroborated in a later investigation specifically designed to determine the effect of immediately preeeding experience on the Draw-a-Man Test (Reichenberg-Hackett. 1960). Military recmits. The examiners' expectations apparently operated through subtle postural and facial cues to which the subjects responded.handicapping conditions.()fi..Irtai9rity of his activities. under these conditions.type. one of degree. to those mtfUencmg the mdl vidual's performance in the large . Under these conditions. there was some evidence to suggest that IQ on the Draw-a-Man Test was influenced Qrthe children's preceding classroom activity (McCarthy.om those ~ffecting only a single a~lllinis~tj~n of a. feedback is much more likely to improve the performance of initially low-scoring persons. in or out of school. 1969). These differences occurred despite the fact that neither examiners nor subjects reported awareness of any influence attempt.. should be reflected in his performance on tests sampling the relevant aspects of behavior.tate piCture of the individual's standing in the abilities under conside~n. the two groups of examiners obtained significantly diHerent ratios of animal to human responses from theh subjects. but this time on "The Wo~sLThing That Ever'Happened to Me. From the standpOint of effective testing." The IQ's on the second test. throu~hJib. In evaluating the eHect of coaching or practice on test scores. the conclusions drawn from test performance should be qualified. Such broad influene. 'whereby the individual is informed about the specific items he missed and given remedial instruction. affect~ng'p~rformance on all Items . among other things.and fourth-grade schoolchildren. When circumstances do not permit the control of these conditions. Such general motivational feedback. since the test score presents an aar:a. Apa~ from the examiner.edict. children who had had a gratifying experience involving the successful solution of an interesting puzzle. followed by a reward of toys and candy. for example. The difference is. 1965). the influence of these factors is negligible for practical purposes. Influences cannot. This type of motivational feedback may operate largely through the goals the subjects set for themselves in subsequent performance and may thus represent another example of the self-fulfilling prophecy. a workable distinction can be . In one investigation designed to test the effect of acclimatization to such a situation on test performance. tape recordings of all testing sessions revealed no evidence of verbal influence on the part of any examiner. 1944). however.~se. but obviously vary widely in scop~~f. are often examined shortly after induction. or other. during a period of intense readjustment to an unfamilim' and stressful situation. In a particularly well-designed investigation with seventh-grade students. either formal or informal. Nevertheless~ the skilled examiner is constantly on guard to detect the possible operation of such factors and to mipimize their influence. that experienced examinel's elicit more human than animal responses from the subjects. The examples cited in this section illustrate the wide diversity of testrelated factors that may affect test scores. s1)ould not be confused with corrective feedback. The examiners were 14 graduate student volunteers. 2. during their third day. Several studies have been concerned with the effects of feedback regarding test scores on the individual's subsequent test performance. 7 of whom were told.40 Context of Psycholog. a fundamental question is whether the improvement is limited to the specific items included in the test or whether it extends to the broader area of ~ehavior that the test i~gned to p.

!ightthus signify normal ability in the one instance and inferior ability#}. th~ median IQ of the group rose from 102 to 113. 195:3-1954). ot wrJ/i e results of the coaching studies which ha. once attained. 'ences of'the examinees. When the same test was readministered in successive years. the College Board conducted veral well-controlled experiments to determine the effects of coaching 'its Scholastic Aptitude Test and surveyed the results of similar studies other.the other. the Col. An example is 'provided by problems requiring insightful solutions which.8-9): It should also be noted that in its test construction procedures. The implications of sucll findings are il.Significant m~a. can be applied directly in solving the same or similar problems in a retest.'e thus far been completed inte that average increases of less than 10 points on a 600 point scale can . \Vhether gains persist or level off in successive administrations seems to depend on the difficulty of the test and the abilit~· level of the subjects. the meaning of an IQ obtamed on an initial and later trial proved to be quite different. the greater will the improvement in test scores. G~ins in score are also found on retesting with pili:dIel -forms <1j the same tes~.COlltext of P~yc1lOlogic(/l Testing e. the College Entrance Examination Board has been conhed about the spread of ill-advised commercial coaching courses for lege applicants. ] 968). the nature of the tests.a~ ~Q of 100 fell approximately at the average o£'lhe distribution on the Im~lal trial. . Both adults and children. too. e investigation was conducted with black students in 15 urban and '"ral high schools in Tennessee. The studies have covered individual as well as group tests. as well as coaching. the less likely is improve:nt to extend to criterion performance. effects of coaching on test scores have been widely in'the gated. since the subjects may emplo~' different work methods in solving the same problems. independent investigators (Angoff.. As the College Board uses the term. the Trustees of the College Board issued . Individuals with deficient educational backunds are more likely to benefit from special coaching than are those 'ihave had superior educational opportunities and are already pre. but usuaIl~' less pronounced. 1968).. the more closely truction is restricted to specific test content. on test performance are similar to the effects of coaching. certain types of items may be much easier when encountered a second time. p. among others (College Entrance Examination Board. whether derived from a repetition of the identical test or from a parallel form. although such gains tend in general to be . Scores on such tests. apitude is not something flxed and impervious to influence by the way the child PRACTICE. Item types on which perfo. The conclusion from all"these studies is ':at intensive drill on items similar to those on the SAT is unlikelY to 'oduce appreciably greater gains than occur students are rete~ted 'th the SAT after a year of regular high schot. This is especially true since the tests merely supplementary to the school record and other evidence taken into . Nor is improvement necessarily limited to the initial repetitions. -but in the lowest quarter On a retest~S\ldl iQ's.expected. . 1971b. 1972). Pike & Evans. \in'S :".Conege Entrance 'amination Board. For example. It is obvious.3.il instruction. . Thus.. to do well on the tests. Many of these studies were conducted by British psycholo. this particular Scholastic Aptitude Test is a measure of abilities that seem to grow slowly and stubb(lrnl~'. Rather. The effects of sheer repetition. These studies covered a variety of coaching ethods and included students in both public and private high schools. It is not reasonable to believe that admissions decisions can be ected by such small changes in scores. A number of studies have been concerned ~.a4Ier. but not responding to hasty attempts to relive a young lifetime. As might be expected. should therefore be carefully scrutinized.(.'eral years (see Quereshi.\ lustrated by the results obtained in annual retests of . but it dropped to 104 when another test w~s substituted. Becaus~ of the retest gains. To clarify the issues. it can be stated that a test score is inmlidated only when a ':'cular experience raises it withont appreciably affecting the criterion ~Lviorhat: the test is deSigned to predict. though numencally identical and derived from the same te~ 1l.schools (Yates et aI.with special reference to the effects of practice and coaching on the brinerly used in assigning ll-year-old children to different types of 'Ilrv. On the other hand.500 schoolchildren with a Yariety of intelligence tests (Dearborn & Rothnev. in which the fonowing points were ade. may alter the nature of the test.formal statement about coaching.n gams have been reported when altema"f~ forins ofa 'test were adrnullstered in immediate succession or after intervals ranging from orie .rma1lce can be appreciably raised by short-term drill or instruction of a narrowly limited nature are not included in the operational forms of the tests. 19711>.blance between test content and coaching material. unt b'): admissions officers. 1968.'ith the effects of the identical repetition of intelligence tests over periods ranging from a few days to se. lege Board im'estigates the susceptibility of new item types to coaching (:\ngoH. and the amount and 'of coaching provided. 1941).{CHIKC. "n America.srh. . and both normal and mentally retarded persons have been employed. the ~~ovement depends on the ability and earlier educational. or practice. It should be noted that practice. On the basis of such research. profoundly influcllced by conditions at home and at school over thc years. that the closer the re. All agree in showing significant mean gains on retests. Moreover. t and is taught.

Thus. The distribution and use of psychological tests constitutes a major area in Ethical Standards of Psychologists. 1965. 1952)."familiaritywith common item types and practice in the use of objective "answer sheets may also improve performance slightly. . and employee samples. cited in Chapter 1. IIace.c~tions vary with the type of test. can be quite effective in equalizing test sophistication (Wahlstrom & Boersman. Qf course. The general problem o(test sophistication should '"be considered in this connection. as described em'lier in this chapter. whereas a mini~um of specialized psychological training is needed in the case of educational achievement 45 . where the extent of test-taking experience may have varied Widely. and 9 (Impersonal Services). Peel. It is particularly important to take test sophistication into account when comparing the scores obtained by children from different types of schools. as well as from haVing developed more self-confidence and "etter test"taking attitudes. The individual who has had ex'vl! prior experience in taking psychological tests enjoys a certain adJage in test performance over one who is taking his first test (Heim & . & Ebel. Principles 13.Context of Psychological Tesring b three years (Angoff. Bishop. as between the advancement of science for human betterment and the protection of the rights and welfare of individuals. and 15 are specifically directed to testing. ~dbe made when interpreting test scores. SpeCific . 1971b. For a fuller . )17 SOPHJSTICATIO~. the Casebook on Ethical Standards of PsycllOlogists (1967) and Ethical Principles in tIle Conduct of Researc11 with Human Participants (1973). The requirement that tests be used only by appropriately qualified examiners is one step toward protecting !he indiy!~ual againE: the im~oper use of tests. high school and college students. Other principles that. CHAPTER 3 Social a1ld Etltical 11JljJZicatioTls of Testi1lg I 1968).ri!'d of int~nsive training and s~pervised experience is required for the proper use of individual intelligence tests and most personality tests. Rodger. being concerned with Test Security. 1966.and richer understanding of the principles set forth in the Ethical Standards. Both report specific incidents to illustrate each prinCiple. a relatively long pe. x RDER to prevent the misuse of psychological tests. 14. 'although broader in scope. and Test Publication. the reader should consult two companion publications. Some of the matters discussed in the Ethical Standards are closely related to points covered in the Standards for Educational and Psychological Tests (1974). the necessary qualiB. Part is the result of a certain amount of overlap in the type of content and functions covered by many tests.r results have been obtained with normal and intellectually gifted )children. Millman. Test Interpretation.194~1950. it has become O necessary to erect a number of safeguards around both the tests themselves and the test scores. Short orientation and practice sessions. Part Ithis advantage stems from having overcome an initial feeling of angeness. the code of professional ethics officially adopted by the American Psychological Association and reproduced in Appendix A. Special attention is given to marginal situations in which there may be a conflict of values. a "onthe distribution of gains to be expected on a retest with a parallel should be provided in test manuals and allowance for such gains . 1951. Droege. are highly relevant to testing include 6 (ConfideIitiality). 1936). 7 (Client Welfare).

. can bc attrib_R!. and behavior genetics. Who is a qualified psychologist? Obviously. In administering the test. howcyer. In this process. there has been continuing movement toward licensing. which can also be obtained directly from ABPP. A Significant step. ranging from educational achievement and vocational proficiency tests. a specified amount of snpervised experience. Thus. Although most states began with the simpler certification laws. It should also be noted that students who take tests in class for instructional purposes are not usually equipped to administer the tests to others or to interpret the scores properly. parents. ABPP does ~)()thave the enforcement authority available to the agencies administermg toe state licensing and certification laws. ReeJuiring a high level of training and experience within deSignated specialties. and validity. B~cause the in de endent is less subject to judC1ment and eva ua on l' wle eable collen es t lan lS 1e lIlS Itntional s choloC1ist he needs to meet hi her standards ? . child development. or government agency. such as a school system.al~fi~~ions. e IS a so cognizant of the available research literature on the clioseiitest and able to evaluate its technical merits with reC1ard to such o character. legislators. with the diversification of the field and the consequent specialization of training. it is essential that an adequately qualified psychologist be available. at least as a consultant. or by persons in other professions. 1967). p~rchase of tests is generally restricted to persoJl~ . Test scores can be properly interpreted only in the light of all available knowledge regarding the behavior that the tests are designed to measure. When tests are administered' by psychological technicians or assistants. such as learning. He draws conclusions or ~ makes recommendations only after considering the test score (or scores) in the light of other pertinent information about the individual.The. individual diffe. Licensing laws thus need to include a definition of the practice of psychology. was the enactment of state licensing and certification laws for psychologists. industrial and organizational. As a privately constituted board within the profession. The catalogues of major testp~1>lishers specify reqUlr~ments that must be met by purchasers.Fdrnplexity of the science of psychology has inevitably becn accompani~. In recognition of this fact.:hological testing itself has tended to become dissociated from~. these difficulties arise from inadequate communication between· psychometricians and their various publicseducators. Above all.istics as norms. The well-trained examiner chooses tests that are a )ro riate for 0 the particular purpose for whie 1 e is teshn an t examme. ABPP grants diplomas in such areas as clinical. Principle 2c). Nearly all states now have such laws. In part.)~'c. university. Usually ~pdividuals with a mast~r s degree in psychology or its equivalent qu~l. The growing. speCialty certification within psychology is provided by the American Board of Professional Psychology (ABPP).who meet certam z:nlmmal qualifications. and satisfactory performance on a qualifying examination. Although the terms '1icensing" and "certification" are often used interchangeably. both in upgrading professional standards and in helping the public to identify qualified psychologists. the Ethical Standards specify: "The psychologist recognizes the boundaries of his competence and the limitations of his techniques and does not offer selyices or use techniques that fail to meet profeSSional standards established in particular fields" (Appendix A. reliability. Probably th~ most common examples center on unfounded inferences kdfrtIQs.~ to inadequate communication between psychologists and laymeD. to provide the needed perspective for a proper interpretation of test performance. through . the technical aspects of test construction have tended to outstrip the psychological sophistication with which test results are interpreted.ences. A useful distinction is that between a psychologist working in an institutional setting. Not alT IU1sconcephons· about tests. the requirements are generally a PhO in psychology.46 COllfext of Psycl1010gicaf Testing Social alief Etllicalll1lplications of Testing 47 or vocational proficiency tests. and so forth. and school psychology. counseling.the· mainstream of behavioral science (Anastasi. psychometricians have concentrated more and more on the technical refinements of test construction and have tended to lose conta:tt wit'rr developments in other relevant specialties." whereas licensing controls the practice of psychology. he is sensitive to the many conditions that such as those 1 ustrate 10 apter 2. job' applicants. in psychology certification typically refers to legal protection of the title "psychologist.dby increasingspecialization among psychologists.. In either type of law. and one ractitioner engaged in independent practice. The principal f~nction of ABPP is to provide information regarding qualified psychologIsts.i~~' -SO'rtle publishers claSSIfytheir tests into levels with reference to user qt. clinic. The Biographical Director~' of the APA contains a list of current diplomates in each specialty. lie shpuld be sufficiently knowledgeable about the science of human behavior to guard against unwarranted inferences in his interpretations of test scores. The same would be true of a psychologist responSIble for the supervision of other i·nstitntional psychologists or one who serves as an expert consultant to institutional personnel. Violations of the APA ethics code constitute grounds for revoking a celtiRcate or license.pro esslOna qualifications. Misconceptions about the nature and purpose of tests and misinterpretations of test results underlie Illany of the popular criticisms of psychological tests. At a more advanced level. no psychologist is equally qualified in all areas.

. Under these COndltI~:\. .p a~i 'lorna-do not necessarily P articular test or that his en a PhD.f~~ the speCific ~. . who as" order countersigned by t elf ps~ . or attitudinal traits are necessarily disguised. the possibility of developing s'i1~1.e ti .It is evident. 'entories to such clinical instrultelligence tests and mterest In\ t 'ersonalit tests. other personal '~llcoul1ter-m:lM' yield information about him that he wouldpr~fer to c. eq. h Pure asers . or achievement test may reveal limitations in skills and knowledge that an individual would rather 1Totdisclose. any pub. The. F~~se61 ijf'te§. The major responsl 1 Ity Yare able to exert IS neeessan y h ' d' 'dual uscr or institution f 'd in t e 111 IVi proper use 0 tests resl es h t MA degree in psychology ~ed. ' t of tests s ou no If ~~ ma °or UI:l'Sbook either for descri tive wrposes or forI SC If 1 t' on would not on y b e e.~t1t the 'b t tlle test rather than a sellmg c . t . testing. f" I ignment or for research must have t e .: of the results obtained is relevant to the proper mtel pre a 0 at test. 1 13)' "Test scores like test ' d h' "( Pnnclp e. or evaluation.'" gn h d a ou . ld t be published in a newsp. test results may be Worse than useless. does s9 only after making certain that the r:esponsible person is fully aware oflhe purposes of the intervjew.t be released preI d blishers Tests s Oll no .. 'bilihr concern. Under these conditions. Although concerns about the invasion of privacy have . f h a dual objective: secunty sponsl 1 Ity 01 e'pr.'stricted accordingly. A question arising particularly in connection with personality tests is that of invasion of privacy. a .desi ed t~.'Context of Psychological Testing . . but it might . . or. 'dual Moreover . " '00. Distincs individual intelligence tests al ldmOhsPers alld a~thorized insti' d' [lUre as e alsohma db' of appropnate 'idua 1t s. aptitude. I' 't d to persons \\1 professional mteres s to such deVices IS ImI e . y specified and the d~. suchI drastic errors . motivational. est itself as well as full il1fo~~n~tton factal e~OSitiOl1 of what nd norn1S. the conk ' . b' t" c"l'dence 'I\'hen a o f fficient 0 Jec lye. efforts to Imp emen 'b'l' utors ma 'e SllleCIe . .s Social alld Ethical Implicatiolls of Tes/ing 49 either naIve credulity or indiscriminate lic toward aU psychological testing.been expressed most commonly about perspnalit)' tests. . manual should pro\ 1 e a. . they logi<:ally apply to any type of test. testing. 1 . Not only does this procedure provide no control of testing conditions but usually it nlso involves tIle interpretation of test scores in the absence of other pertinent information about the individual. vith the nature of the tehst. Graduate students who may e etween In .1 d prevenhon of mIsuse. '1 limited. An individual's performance on eithel' aptitude or personalit~· tests cannot be properly assessed by mailing test forms to him nnd lla\'ing him return them by mail for scoring and interpretation. ass to t I' In d'1VI. choIogist who uses them. )onsibility of the test . It IS the rfe Sl h to prevent obsolescence. Moreover. i ' dorms 0 ten enoug r to reVise tests an n d t d 'II of course var)' . Insofar as some tests of emotional.~.)Jogicalests have often been. .ting cliee:tii\'ene~. Of primary importance in this connection is the obligation to have a dear understanding with the examinee regarding the use that will be made of llis test results. l' er professIOna lcsponsl '} h Id .1Jt'r_ a son should not be subjected to any testing program under false pretenses. articular test or ~ c ass a~s h "ehology instructor. casual conversation.aut or an 'favorable lIght. .Jellowing statement contained in Ethical Standards of Psychologists (Principle 7d) is especially germane to this problem: The psychologist who asks that an individual reveal personal information in the COurseof interviewing. resistance on the part of the pub- Another unprofessional practice is testing by mail. any observation of an individual's behavi@r-'tt'~ in an interview. • 01 S ou '. or who allows such infonnation to be divulged to him. ~ll safeguar t elr use who arc ualifled to interpret and q als. f the test n.. necessary to keep the examinee"in'1gnQ. II ' . f a test in the absence 0 su nI\. I' be made regardincr the tests by aut lOrs an pu ' ' 1 N" h Id anv c aUllS b V for <renera use. are rele::sed ~nl~ to perso~:sshould be noted that although test m properly (Prmciple 14)" I t these obJ'cctives. t atena san ' .The manual S IOU . .' 'sttibute ear y or reo .aper. m~ to create an erroneous and distorted 01 ~""h nllhlicitv may foster prials in thIS fashIOn ten .~: Ethical Standards state: ' .n to speCIfic test It~~S 'ght also be added that presentation of )Vithother 'persOJ~s. 'b'l' f' th oller use of the test.s the Il1arketing of psvcho. le ~ycho ogl~a y mJ~nou will tend to invalidate the future use of .qnCe. ~or exampleA~.. . state hc~nse.E. to restrict the distn~uboll o· ~ests .1\ tes .e a d vice:.eW~~~~j~~ \vorthless.l1 and that I¢ may reveal unWittingly.):h his l'esponses on any Oue test are to be int~fpreted.r indirect testing procedures i~~ a grave responsibility on the pi. a~ld a ' hat the indi\'idualls quah~ed ~o u. Xe\'er~~ •. Certainly any itlteJligence.. this condition should d If search purposes 0 . the subject may reveal characteristics in the COurse of such a test without realiZing that he is so dOing.'th . Although there are few available tests whose appr~1ts subtle enough to fall into this category. or evaluation and of the ways in which the information may be used.S:'lb~tIOl~~e data to permit an evaluation re ardin administraUon. idity with wlueh a tes t be c ames out " e WI. The fact that psycI11. De. or .

however. An individual is less likely to feel that his privacy is being ~aded by a test assessing his readiness for a particular educational progrlfm than by a test allegedly measuring his "innate intelligence.oblems. Anonymity does not.When tes ng IS con uded for institutional purposes. the client should tie warned that in the course of the testing or interviewing he may reveal :informationabout himself without realizing that he is so doing. however. In clinical or counseling sit1.popular fears and suspicion would be lessened. To safeguard personal prijno universal rules can be formulated.~l is told in advance that a self-report inventory-will be scored v. Ruebhausen & Brim. Some subjects may resent the disclosure of facts they consider personal.Il/('xl (If Psychological Testing lit in discussions of the invasion of privacy probably reflects . Even under these conditions. The examinee should certainly be infoJ'l!le~. whether employing het-observational procedures. anonymity should be preserved as fully as possible and the procedures for ensuring such anonymity should be explained in advance to the subjects. psychologists are committed to the goal of g. An instrument that is demonstrably valid for a given purpose is one that provides relevant information. however. whether or not it utilizes tests. boutit would alsotcm. 1973. 1967." Several other printhe other hand. 1966).tions. All research OIl human behavior. or research. his feelings. only general guidelines £illl rovided.. the _ t is usually willing to reveal himself in order to obtain h~]p with his . Conflicts may thus arise.knowledge about human behavior. th~~~ substitute for the ethical awareness and professional respons~i{9 Ie individual psychologist. 7d.determination"-(p.. Privacy and Be1lGvioral Researc11. The investigator must be alert to the values involved and must carefully weigh alternative solutions (see Ethical Principles. The information that t e m iVl ua is asked to reveal must be relevant to the stated purposes of the testing. Suc~ infonnation would usually invalidate the test.about the purpose of testing. and the facts of his personal life" (p.ict resolutions can be found in the previously ical Principles in the Conduct of Research tcit11 Human Pars (1973). he clinician or examiner does not invade privacy'where he is T eelyadmitted.Ruebhausen & Brim.:wifi be made of his scores. since it is not to his advantage to be placed in a position where he will fail or which he will find uncongenial. problem is obviously not simple.d Jo distort responses on many personality tests. In a retitled Privacy and Be7IGvioral Research (1967). Whatever the purposes of testin tlle rotection f riva two Key concepts: re evanc consent. Solutions must be worked out in ter~ p£ :particularcircumstances. in the case of a minor. It is also desirable. Not only would the giving of this information seriously impair the usefuhless of an ability test. 2). 1966). 'Id also bc noted that all behavior research. and it has been the subject of "e delibemtion by psychologists and other professionals. The results of tests administered in a clinical or counseling situation. An important implication of this principle is that an practicable effOlts should be made to ascertain the validity of tests for the particular diagnostic or predictive purpose for which they are used. cooperation of subjects may be elicited if they are convinced that the information is needed for the research in question and if they _ have sufficient confidence in the integrity and competence of the investigator. however. should not be made available for instihltional purposes. the right to privacy is defined as "the the individual to decide for himself how much he will share with histhoughts. :'nerelevant factor is the purpose for which the testing is conducted'ther for individual counseling. and its application in individual cases mav call for the exercise of considerable judgment (Ethical Principles. are concerned with the protection of privacy 'the{velfare of research subjects (see. with 110 mysterious powers to penetrate havior. misconceptions about tests.. For ~xaQJple.ith adorpinance . fllrthercharacterized as "a right that is essential to insure dignity reedomof sf>lf. solve the problem of protecting privacy in all research contexts. When tests are given for research purposes. must be balanced against the protection of the individual. institutional decisions regarding~~lecand classification.·ill be best served when he investigates judgment indicate~ investigation is needed. In most cases.as scientists. tliat he be shown the test items in advance or told how specific responses will be scored. the lfiaffiinee Isbould fully informed as to the use that will be made of his test scores. Principle 1a in Ethical s of Psychologists (Appendix A) clearly spells out the psycholoViction"that socieh' v. if an indi®~. be . to explain to the examinee that correct assessment will benefit him. Nor should the test items be shown to a parent. Freedom of inquiry. amplesof such confl. presents the possibility of invasion '.£.of behavior samples." The concept. unless the examinee gives his consent. or he Irony disclose feelings of which he himself is unawar . 16). may present conflicts of values. of course. however. 8a. prepared for the f Science and Technology. e. Yet. 2).j.f informed consellt also requires clarification.. even when complete confidentiality of responses is assmed. which must be resolved in individual cases. If all tests were recognized as . 1973. It is not implied. the kinds of data sought.g. which is essential to the progress of science. It also behooves the examiner to make sure that test scores are correctly interpreted. and the use tha1.In the application of these guidelines to specific cases.

f 1 data resent a challenge 'tP d the establishHevielding scientifically meanmg u .~e~rotectlOn ~f p~lVacYiftf:ceted.. an.tween l~t:~~o:al consent. ted and I 0\ ores WI I be mterpre_ . buttressed with reasons of ethics. f g educatlona . and ~orthrJ ~:d. which it is related. e.lid' 't hould be' adde~~~"t sue an the personality standpoint of test va Ity.d.SUlvley~ . a personahty llwen ory r 1" . 11 I . how th~ results - _. . be accorded the dignity of a private personality? Considerations of healthy personal growth. be achieved. an . nt. court. The underlying principle is that such records should not be released without the knowl~~~.n~. special qU. even before the age of full legal responsibility. 1971. or employment setting.'. However. and the need of various persons to know the results.' "on of privacy or ere Is-also some eVI ence . one must also consider the parents' right of access to the child's test record. 1970.. of test records have usuall~ dealt Discussions of ~he ~n6dentiality with accessibility to a thIrd person. the Gujdelines differentiate b.. Under these conditions. W'th oper rappor an h h b of refusals to c. 1970). Russell Sage Foundation.:~. 'While avoiding rigId preSC. 1 t however t e num er titudes of mutua' respec. . ~ e~~ tten consent. There has been a growing awareness of the right of the individual himself to have access to the findings in his test re ort. " ". the problem of t. t enresents an mvaSI . 4314. and achie". at the " .ltems 0 e " : ex lanation of h. . .ds. or is married (whether age eighteen or not).and DissenunatlOl1 0 tip' . Counselors are now trying more and more to involve the client as an active participant in his O\\'n assessment." The previously mentioned Guidelines (Russell Sage Foundation.. . Ruebhausen and Brim (1966. 'th regard to pae the Russell tfite COeelle~~i~~: ~~:(i~. 't: nd in the .. tllat this gg rom both national and stateWide .. h . .t. I nt should e su Cleo. He should also lave e opportum to comment on e contents of the report and if necessary to clarify or correct factual information. ' f P '1 Recor s. 1 pr . t' cite . 'S" nt'ly reduced when ' ff nsive 15 slgm ca ''der some of th e.-.:. tent. Womer.'. 1 Sf'" 'on did not affect the mean profile 0 scores on .bi. -In the case 'of minors. conseiitOf • the individual.' . Principle 14). and it is usually desirable for them to have such information. on the ethical and ~ega alsPdec ~ that protect the indid 'penmenta eSlgns rocedur~~ a~o pe:rucipate and that adequately safeguard his t . such 'arents: legally elected ~r .infonne~ at the time of testing regarding the purpose 6f~!he test._-- . Apart from these. The fundamental question is: tiahty of test ata ISmu {ts? Several considerations influence the all hav~ access. both to fill in background data and to elicit parental coope. 1972). seem to command that this be done.._ _. . or both.r. Proper safeguards must be observed against misuse and misinterpretation of test findings (see Ethical Standards.to eCme.~~~U~~iS~~:~r:i:. ' a IDENTIALITY .es~ons anse "':1 . ~. o. In some cases.d~di::I7:rc . ~r~~e resentatives.. . as in a school system.sed sampling and voluntee~ error. a child's academic or emotional difficulties may arise in part from parent-child relations. the question is not whether to commUDlcute test results to arents of a minor but how to do so. '. other than the in~hjdilal tese~d (or parent of a minor) and the examiner (Ethical Stando.•.for obtammo ' 1 t of school record keeping. especially in the case of older children.possjble exceptjons. this recommendation is followed by the caution that school authorities check local state laws for possible legal difficulties in implementing such a policy.forms._~--_. . .ou comes a rch (Holi:zn~~n. This presents a possible conflict with the child's own right to privacy. For these purposes. .1:'~. .Social and Ethical Implications of Testing 53 of Psychological T('sting . test results should be presented in a form that is readily understandable.·. d' "d al consent.!lich representation a conse .32) wrote: uShould not a child. 'hild.~:~ath(' number of respondents 'who .a out eir child. p the Guidelines board.lpti~n:h type of instru. pp. p. to t~t resAmung them are the security of test con~ in particular situations.. the indi~dual should be . 'Vhen tests are administered in an institutional context. moreover.~mcnt tests as examp es °b em' t. free from technical jargon or labels. Parents 1 norma y have a legal right to information.. 'bl ' tity The technical difmay be reduced to a neghgl e quan ' h'. 11 re ereo .:~' -. b avoided. the counselor's contact WIth die parents IS of prime importance. may : USe. and oriented toward the immediate objective of the testing.' h ." he should have the right to deny parental access to his records. Principle 6. his'tiparents.p'J)se~ likely to bbeIn thu~n:ait or by a false or distorted are . personality ~ss~ssm~~i~:lilles is the inclusion of sample helpfu eature.ration. bot III t':s III '~itiv~area of pers~~allty .appoll1t~... 27) recommend that uwhen a student reaches the age of eighteen and no longer is attending high school. ologist's ipgenUlty.(Fink & Butcher. There is also a selected . the hazards of misunderstanding test scores.Q. as'he may have a ou t IS ng of children. fl d by stereotyped (and often .YLitemL :preceded by a Simple. . . . In a searching analysis of the problem.

.cdcej:setso (Russell Sal1e foundation. " I 'the interest 0 t e m 1'111 u late longltudma use m them should be subject to unusual¥i table research purPloseCs. we should explore the possibility that these very characteristics may permit more effecth'e procedures for protecting the security of individual records. d the healing compaSSion Therewas a t Ime W 1 n .R:::~~t~~:s: i~st~nc~s.Social alld Ethical Implications of TCStillg 55 'ntext of psychological 'd Testing nd their availabilih' to institutional personnel who h~v~ a ISC . Follow-u. l' 't policies regardit. ta. . 'fi d bv the develop- the c:lpacity to record faithfully.nres:~ch and coullsehng contexts. After the preparation of these tapes. 1970:W' .. :t!on..:~~e eta the Guidelines ." .er is rdevance to ree of objectIVity and ven a 'J 1 I Id be . . h f II'b'n" f hiS memorv an efficiency man. a ' . With the . f h 'd'" d al or for aca I1revent suc I1 mIsuses. The procedure could be simplified sQmewhat if the lin\ing faCility'· were located in a domestic agency given. .. rs. a three-file system of computer tapes was devised."". f test resu ts. 'fi d into three categories-with regar· 0 t2). ~'Ioreover. d b th the passmg 0 tme an ' lhatat'compame. . pp. haKd the availability . (Russell Sage Foundation. obtalOe II Ye Too much may have hapn evaluating him for admISSion to co eg 'k h ·1' and ""'lated . containing each student's responses marked with an arbitrary identincation number. This two-file system repl'esents the traditional security system. th. and retrieving data made possible by computers can be of inestimable service both in research and in the more immediate handling of social problems. aata . questionnaires were administered annually to several hundred thousand college freshmen. to retrieve promptly. inclu~jpg the American Council on Education. An example of what can be accomplished with adequate facilities is provided by the Link system de\'eloped by the American Council of Education (Astin & Boruch. which substitutes one set of code numbers f~the other. or d' 1971 P 42) contain a sample uidelines (Russell Sage Folun :tlOn"n de~ri~lg the transmission. is readily accessible for research purposes. I ~ wou Id em: d b a child in the third grade reading achle\'t>ment sco~e.gpurpose.0 f' t' 'd' the warmth of human reeol. retention. Known as the Link file. For these reasons.dt~: urposes..p data t~p!s are sent tq the f{)reign facility. h tr t allies of privacy were the inModernscience has mtl'Oduce ' 1e among t e s ongcs .a. ~. of iformfor the use of schoo sys ems I . d to llim in t Ie mtervemng ) d etained fo'l"many . 1970). with the agreement that the file would never bC. b anifest v a sur . . The unprecedented advances in storing. As is so often the cas~.hand. in a foreign country. .. since some staff members would have access to both files. containing only the students' names and addresses with the same identification numbers. In a longitudinal research program on the effects of different types of college environments. the original questionnaires were destroyed. 1 b' ti f the schoo.:adequate protection against subpoena. . hi 'h' t1 institutiOn.r'these conditions. On er pr. such files a-re subject to judicial and legislative subpoena. f I S' '1 Iv when recor s are r Iml ar . . T~e same r~qUlre%~. no one can identify {he responses of illdividuals ~ the data files. 0 . was originally housed in a locked vault and used only to print labels for follow-up mailings. longi1tudinal rec~r~s a~:o l:~~I~::~~:t~~~ing ani'counseling y for researc I purposes u . 'b'1't f personal recor s.. anot\l.le)eased to anyone.'.stitution to fonnulate SHm ar exp lCl d' .ad f th UncJ:e."'. . The Link file was dcposited at a computer facilit). tea of 1 1 I • .~and uf all other ~ersonal da~:n~avlen b~~. . major det~~~ilih~ of the data. . 1970. a third me was prepared. . interpretation •. e educationa 0 Jec ves . 5-6) Ruebhausen wrote. i . when recor s are re ame . .. I" 'ears to ma e suc eaI' . processing. t wou ._. tenance secun y. nO further penms~lOn e nee or em.ti results are made avalla e Wit III Ie e d at tlIe me 1 r uested by outsiders. The potential dangel"s of invasion of privacy and violation of con~dentiality need to be faced squarely. constructively.t11ere dan!!:er that tbey ma): edu~nd would not have approved.' b ed for purposes that the individscores meaning u. and to communicate both Widely and instantly.g the destruc. . inent of computenzed. t troIs In t Ie w e In I:> d t ngen can . f . The previously equired. and acceSSI I I Y a 't nd accessibility of test results '" The pro bl ems 0f mam. \fnrlrrn sciellcehas !!ivenus . To permit the collection of follow-up data on the same persons while preventing the identiflcation of individual responses by anyone at any future time. advanta es resuppose proper son. d a new dimension into the issues of pnvacy..oblem pertains to the l'ete~tlO~l? recor s I vcr' valuable. d 1 . 'nt situation exists when test resu ts are eq t It from " 'm lover or a college requests tes resu s "~. Such elaborate precautions roi'the protection of conlidentiality obviously would not be feasible except in a!aJge-scale computerized data bank.s-e for any type of . and imaginatively.0 '1 . to maintain permanently. It still did not provide complete protection.decoding files and the research data files under: the control of different organizations.~unauthollze 1 acbcessd for example to cite an IQ . . The first tape. is (or his parents) never suspecte 'd t' d either for le. sch~ol recol' s.se t1 othe. n m::uses as inl~rrect inferences o rleords opens t~e way f~~ s~ch for otber than the original 'solete data atld. are c aSSli~in factor in this classification is the 'I" retenti~n. ind~v~d~l~~e~o.. . The second tape. it contained only the original identification numbers and a neW set of random numbers which were substituted for the original identification ~umbers in the name and address file. ""II" . Rather than fearing the centralization and efficiency of complex computer systems.:~:~t~o:. d'n institutions.:~~:~e~. be .

i. insofar as ossible. scope of ?~present discussion. Se d. two major gll1del~nes are of particular mterest. s~ould the data be interpreted by a properly qualified person. a college student mIght become seriously' discouraged when he leams of his poor performance on a scholastic aptitude test.c:nnical meaning is that mderstood.tit>ns (see. whether child or adult.'s against misinterpretation apply here as in ~mmuni~tm~ ird party.. but should be accomnterpretive explanations by a professionally trained person. whose impact ran. when he is learnin? about hfs 0'1\'11 assets . Humphreys. It IS clear that the should not be transmitted routinely.e individual himself.Social and "Etl1icalIIll1"ications of Testing 57 i$tshave given much thought to the comm~nication of test "formthat will be meaningful and useful.. Such de~nmental effects may.even w en their te.:pt when comg with adequately trained professlOnals . This. Kendrick.. I 1 I II II T~ SETfINC. but also to his anticipated eIllotional response to the on. Among the more clarifying contributions are several po.anageable. applies o at person's general educatIOn 1~:imowledge about psynd testing.te persons. Although the details of . ~. e. of the person who is to receive the i~fomlation. a knowledge of such a score WIthout the opportunity to discuss it further ~nay be harmful to the individual.. for whatever reasons.d also be available for counseling anyone who may become cmOti01~any dIsturbed by such information.~ . example. psychological testing has been a major focus of att:nbon. The counseling situation IS such thaf If the individual rejects any information. . approprig. 1969. Written'reports about their own children may ributed to the parents.But a. ~he decades since 1950 have witnessed an increasing publIc concern With the rights of minorities. scores. ut by no means least is the problem of commumcatlOg test re'. The same gene. The person's emotional reaction to the mforrnatlOn lS ly important. ~ehow they afe transmitted.. e.y. 'H1".. Cleary... A familiar example is the popuhyassumption !cates a fixed characteristic of the individual wl)ich pTedeis lifetime level of intellectual achievemen~.nr1. and the of the d~ta.litcommunication it is desirable to take . Ch. but faclli~Ies shoul.. and arrangements made for personal '.ges from clanflcabo~ ~o obfuscation.vir'll1:1l !!iven hiS own test results.. An Important consideration in counseling relates to the' counselee's ~cceptance o~ the information presented to him. and int~Fts~ ratlOgs With 'ores. Even when a test has been accurately administer:d and scored and properly interpreted. of course. Counseling psychologists h~e been especially concerned with the dev~lo ment of effective wavs of transmittin test inform' to-their-_ c IC11t5 see. For example. 1971. similar safeguard~ shoul~ b~ proVided. Goldman.' a concern that is reflected in the enactment of civil rights legislation at both federal and state levels. an Important condItIon resu1tsshould be prcsented in terms of descriptive performrather than isolated numerical scores.ll1icatingresults to teachers. for example. nee test· which are more likely to be misinter reted than are 't tes . then that information is likely to be totally wasted.slbon papers by professional associ. Hence w understood to includj) men. a recommended to arrange a group meeting at which a counselor or school '\explains the purpose and nature of the tests... personal I' involvement with the child may interfere with a calm and 'cceptance of factual information. emplo'yers.clates. .vithany parents wishing to discuss the ~epol'ts further .g.g. Is of performance and qualitati\·e descnptot~ns 111 Sllnple terms preferred over specific numerical. FI~ test-reporting is to be \'iewed~ as an mtegral part of the counselin rocess and incor orated into the o a counse or-c lent relationshi .. .. .ral . occur regardless of the correctness or lllcorrectness of the score itself. & \Ves- Ie tlthou~h ~omen repre)'lnt a statistical majorltyjn the nati~~al population. In conn~t~on with mechanisms for improving educational and vocational opportumhes ~f such groups. In the case of a parent or teacher.ocess ~re be}'o~d the.. norms with standards. . Ev~n well-educated ye been known to confuse percentiles WIth Q~~centa~e scor~s. ga I. with lQ's.they have s~ed Jllany of the problems the term "minority" is use(i "fu tnis section it will be of mmoTlhes. school administrators. or he might become uncooperahve a~ld unm. not is I onl~. if he discovers that he is much brighter than any of Ius asso. cxc. :. Th~ psychological literature of the 1960s and early 197?s co~tams many dI~cussions of the topic. test results shou e reported as answers to specific !:lucstions raised bv the CQun~. A severe personality disorder may be precipitated when a ~aladlust('d individual is given his score on a personality test. for. however. This is especiall}' tnu::. the sort of th'tt"t mav reasonably be drawn from the results. imicating scores to parents.more serious misinte )fetation )ertams to the conrawn from test SCOl'es.W:o account the char. 14-16).~c~upalJonallY'in in otlu~r ways. A gifted schoolchild might develop habits of laziness and shiftlessness.. of course. American Psychological Association.-tfu~ pr. ' I .

al mi~nlrstandings about tIle nature anddfllnfction of ps~chdol'Vll?j~~ls .€fit~rion o th fcular test but me 1 evan O~ __ . 2--3).C~So~.. t . d h di ex l' T ld obviously represent a test-restncte an cap. n 'fail to provide the kind of information needed to correct the very 'ionsthat impaired performance.~_ to~sinclude previous experience m ~akmg tests. 1972. an~ an y 0 tet ~_-i<c.e~7c ~:::o::~. s'deration SpeCial en arts s ou d. In a way. ~:: unrelated to cntenon per£orm~n tu' of obl'ects unfamiliar in a particular -ample the use of names or piC res . as in the portrayal of male doctors or executives and female nurses or secretaries. If a minority examinee Qn~l:li~sa low score on an aptitude test or a deviant score on a personality):est. mo~. pp. an infel~i'St:ore on an arithmetic test could result from low test-taking motivation. FACTORS. Members of different ethnic groups participate either as regular staff members or as consultants. f inDrity groUp wily in the basic issues and SOCialImplications 0 m ·ng. Ev:rytPsbychaoVl~o~C~ts in. HarcDurt Brace Jovanovich. 'th parallel form is .: :~~~e. --_..ne~t lofc~. . ion to erEorm variable. A brief but cogent paper b~ F~augh Tl:: Iso helps to cle~r away some preval~nt S~ll. ~ I' when testing persons wltn diSSimilar lion of these test-related factors .. b cultural factors t a a ec rtant to differentiate etween . As one test publisher aptly expressed it. And the reviewing of test content with reference to possible minority implications is a regular step in the process of test construction. if the development of arithmetic ability itself is more strongly fostered in one culture than in another. Kogan. t h ter our interest is chnicalanalysIs of the concep 0 h In t e presen c ap .. most standardized tests were constructed by white middle-class people. The major test publishers now make special efforts to weed out inappropriate test cDntent. 0 as a measure I1tials Hom a test. '. too. t d to the test It is . An example Df the application Df these procedures in item construction and revision is provided by the 1970 edition of the Metropolitan Achievement Tests (Fitzgibbon. Some thought should also be given to the type of nQCWsto be employed in evaluating individual scores.. I b kgroun s 0 groups or 10 iflerencesin the expenentia ac hI' 1 test ~itably manifested in test performanlce. the appropriate norms may be general nDrms~. ~~d iarity with such objects. 1 I f as Cll ture alIec s e . .oUP. d .. may have acquired connotations that are offensive to minority groups. . naleand testh~g procedures ar: ~1~~~:~e~i:'?7s giv:n in Ch~pter 7.. quantitative t m mg oes . I' d th hDse in uence is res nc e ·terionbe laVlor an ose w d ~ Ex~mples of such atter. Stories or pictures portraying typical suburban middle-class family scenes.stmg.ti: otivation. toreduce ~prl.. . On the other hand.:~~e:. res a beh:wlOl' samp e. Exclusive representation of the physical features of a single racial type in test illustrations may have a similar effect on members of an ethnic minority.Jlotms based Qn - .er problem of cross-cultural te. nso ar If 1 ut aU cultural will and should be detected by tests.n:dS: A d~b1e proc~~urea:\~u~~~.t. .e pa~ I h ld be m'aae the opera.' . scores on an arithmetic test should not eliminate or conceal such a difference. Ph h had little or no IsoreeDmmended with low-seorin examm s w a ave TEST.!2gl. attitudes.e~st~t~e:~ also~n~~e:~:s:e. In testing culturally di"h·elt·seffPerst i:e~~ °bno~h .Social and Etllicallmplications .:. .~:e~c:. 'onlcxt of Testing 59 of Ps!}clIOlogica1Testing '5' Deutsch Fishman. The most frequent misgivings regarding the use Df tests with minority group m~w:bers ste~ from misinterpretations of scores.h l'd't llnnectlOl1 Wit test va I I y.ah~?t case the test behavior domain it was deslgnc d to assess.~d\Y the booklets akingorientation and prehmmary prnc iCe. . By far the most important coflsiderations in the testing of culturally diverse groups-as in all testing -. '1Jl~use of t~sts 1972). we may th. 1964. North. it is essential tQ. tSst-related actors that.ii:ctilfural bacKg~. . Their Dwn test construction staffs have becDme sensitized to pDtentially offensive. Some of the propose so u Ions . ~~I~~~ural 'of the concern centers on the lowenng 0 es sc . or inadequate knowledge of arithmetic. INTERPRETATION AND USE OF TEST SCORES. may alienate a child reared in a low-income inner-city home. for example.. Another. .investigate why he did so. In the same vein. poor reading ability.. . 1 citron the ause the testing of minorities repr~sent\a sP:~~~l ~~. Ability to carry out. who sometimes clumsily violate the feelings of the test-taker without even knDwing it.ereb Y lower Its . or stereotyped material.. re l\~e va 1 .fectingperformance on . . r ~veJl n tests. d" d' Chapter 2 Retestinl1 WI a _ ~--~ d tape recor mgs cite III '. Depending on the purpose of the testing. one cDuld say that we have been not so mueh culture biased as we-have been 'culture blind'" (Fitzgibbon.pertain to the interpretation of test scores. more subtle way in which specific test content may spuriously affect performance is through the examinee's emotional and attitudinal responses.we ~ e. :heoretical . t e U full) in Cha ter 12. Certain words. rappDrt with the exammer. & Whiteman. culturally restricted.'d adequate test. 1972. t reoE ns that ma)' have affected the devel~p.1972).RELATED . and other psyc ~ O~IC: for the problem ou members. cultura mlleu wou h' k' d not depend upon fami!. among other reasons. women's organizatiDlls have objected to the perpetuation of sex stereotypes in test content.. FDr example. "Until fairly recently.

When combined with information about his experiential background. Loretan.of t~sts was aptly characterized in the following words by John Macy.f~~!f. intelligence test scores should not foster a l'igid categorizing ~f persons. Gardner (1961. That it proved necessary to discard the tests in order to eliminate the misconceptions. with backgrounds different from those of their teachers. 1966). . These are very often chffaren whose cultural handicaps are most evident in their overt social and interpersonal behavior. 1965. It is largely because implications of permanent status have become attached tq. test scores should facilitate effective planning for the optimal development of the individual. the Guidelines for Testitlg Minority Group Children (Deutsch et at. non-conforming pupils.Social alld Et!lical171lplicatiolls of Testing 61 an IQ would thus serve to perpetuate their handicap. p. the IQ is an index of innate intellectual potential and represents a fixed property of the organism.rg Public Policy." In the same vein.Jhe IQ that in 1964 the use of group intelligenGe-testS-. 139) contain the follOWingobservation: Many bright. \Vhen properly intcrrireted. this view is neither theoretically defensible nor supported by empirical data.M:as discontinued in the l\ew York City public schools (H. OBJECTIVITY OF TESTS. and they couldn't hean the accents of the slum.. Without the intervention of standardized tests. . \Vith regard to personnel selection. many such children would he stigmatized by the adverse subjective ratings of teachers who tend to reward can· formist behavior of middle-class character. which are administered and interpreted by trained examiners and school psychologists. p. 4&-49) wrote: "The tests couldn't see whether the youngster was in rags or in tweeds. was not eliminated. 1964. intelligence tests-and any other test-may be regarded as a map on which the individual's present position can be located. It was the mass testing and routine use of IQs by relatively unsophisticated persons that was considered hazardous. about the fixity of the IQ is a revealing commentary on the tenacity of the misconceptions. "'hen social stereot:'pes and prejudice may distort interpersonal evaluations. It should also be noted that the use of individual intelligence tests like the Stanford-Binet.. According to a popular misconception.¥:. 1966. The tests revealed intellectual gifts at every level of the population. tests provide a safeguard against favoritism and arbitrary or capricious decisions. B. As will be seen in Chapter 12. On the conhary. the contr!!>ution:. Cilbeli. Commenting on the use of tests in schools.'Chairman of and the United States Civil Service Commission (7. pp. make favorable showings on achievement tests. Jr. in contrast to their low classroom marks. 883) :""'.

. No'uniform versioD.I . color. because of their many research and training grants from such federal sources as the Department of Health. and Welfare. of . the Guidelines make explicit reference to the Standards for Educational and Psychological Tests (1974) prepared by the American PsycholOgical Association. AL REGULATIONS. . In .'d in the development and " f ' b d' may sign! can 'Ii al d programs 0]0 eSlgn. and application forms (Sections 2 and 13).s. (EEOC) When charges . such as educational or work-history requirements. \Vhen the use of a test (or other selection procedure) results in a significantly higher rejection rate for minority candidates than for nonminority candidates." Moreover. 'ti' The\' also prOVIde a quanti ~ . the u. A major portion of the Guidelines covers minimum requirements for acceptable validation (Sections 5 to 9). It is recognized that properly conducted testing programs not only are acceptable under this Act but can also contribute to the "implementation of nondiscriminatory personnel policies.yon~ ese.S'c. t e lllve t th 'tuation through conferences and first to correc e Sl . E 10 ment Practices CommiSSions. ".es this part ar~ based o~ ~e: can significantly contribute to the fzedemployee selection proce u I I'CI'es as required bv Title ' ' . can be found in Fincher legislative ctions. the Guidelines point out that even when selection procedures have been satisfactorily 3 In 1973. t personne po I ' .~sessment and complemented by 'ction with other tools of perso~n~fi tl. a~ .rtlJlent of Labor. executive orders. deed aid in the utilization an tenanceof an efficient work force an . e . 0 rtunity C ommlSSlon . . lOaon. In the final section. In states having an approved FEPC. " '•. in the interest of simplIficationand improved coordination. The Equal Employment Opportunity Act prohibits discrimination by employers. Education. e an important fun~tlOn 111 pre~te~ive index of the extent of cultural . d'al programs • nandicapas a necessar~' first step In reme 1< of Testing 63 purpose: be of states enacted legislation and estlt ••• Anum. h eTVIceas had a vita mteres m d bt that the widesprea d pu bl'Ie gicaltesting methods. sex. the same regulations specified for tests are also applied to all other formal and informal selection procedures. the Commission will defer to the local agency and will give its Bndings and conclusions "substantial weight.. religion.l~~' nor to the development 0 suc lIotts have been made to pat1iI0ng states that did so 7t~r. . consisting of representatives of E .. owever. " h EEOC' shgates t e com . f pIc that are related to job perh' h' the basis for entrv sityto measure charactenS!lCS 0 peo . Colleges and universities are among the institutions concerned with OFCC regulations.. the federal courts. In defining acceptable procedures for establishing validity.. u1. d . voluntary com~ lance. t orders and finally bring action in hold hearings. The reader may find it profitable to review these requirements after reading the more detailed technical discussion of validity in Chapters 6 and 7 of this book. et. . '6 d'" to be lush e .. -... ~v~u:s over the veal'S. th e)' 'h nsummar)' tests can e . d d .'ofthe appraisal methods they must submit to. ns-as 111 testmg aD. dF .the U. trade unions. including rt decisions.t. ISsue cease an eSlS . qual Employment ppo.'.S. (FEPC) to implement i. an cou a 'A brief summary of ~he major e~ d .:. Commissionon Civil Rights. h plal'nt and if it finds the charges . together with a 1974 amendment of the OFCC guidelines clarifying acceptable procedures for reporting test validity. It will be seen that the requirements are generally in line with good psychometric practice.. dealing with affirmative action.the U.o<. entation of no~ d Iscnmma or.Social and Etlticallmplications ntcxt of Psychological Testing " .3 Some major provisions in the EEOC Guidelines should be noted.es d f '1 EEOC may proceed to r If these proce ures al. 1u!s yet been adopted.. when used in (is also recogmzed that pro esslon~ . • 101 ee Selection Procedures..~ce ~: the' practicality.. -arefiled. llv developed tests. Civil Service Commission. (1970) as an aid in the ". Department of Justice. sfme.. prepared by the :GUldeltnes on Emp y. Both EEOC and OFCC have drawn up guidelines regarding employee testing and other selection procedures. and the U. . r . interviews. .nfottement is vested in the sponsibility for Implementation an .> I 'n of the Civil Rl?hts Act o. servation human resources generally. . th~ ~areerservices of ~hel~ederalt ~o\t'he:::l~pmen't and application of .ed by the ~q 1964 ~ ?ts subsequent amendments).e\ the The most pertinent federal tern th~ re?ulatio~s after the e u~tE:olo '~ent Opportunity Act (Title legislatIOnS provld. I ha\'~ ~o ou d res has in large part been " in the objectivity of ~ur 111 nn g p. b 'sused in testing culturally disadvantaged ml ' When properly used." The Office of Federal Contract Compliance (OFCC) has the authority to monitor the use of tests for employment purposes by government contractors. al developmentssince midcentury. -1'im\'!\. or employment agencies on the basis of race. is at the very root of the merit system.the preparation of a set of uniform guidelines was undertaken by the Equal Employment Opportunity Coordinating Council. or national origin.S. which are virtulillly identical in substance. ting irrelevant and unfair discrim. "e aIr mp y f h legal mechanisms at the federal l~. I t 0 portumty CommiSSIOn mp oymen P b' 'th the following state'entation of the Civil Rights Act. e. A copy of the EEOC Guidelines on Employee Selection Procedures is reproduced in Appendix B. rgm WI I h belief that properly validated and in elin. and the "by the public's perception 0 f t h alrne. its utility must be justified by evidence of validity for the job in question. 1 (1973).

Affirmative '~implieshat an organization does more than merely avoiding dist '. and other job-related behavior. sychologically.takento reduce this discrepancy as much as possible. steps e. special training programs fOI the acquisition of prerequisite PART 2 Primipus of Psychological listing knowledge. if disproportionate rejection rates result for minorities.motivation. explicitly enminority candidates to apply and following other recruiting esignedto counteract past stereotypes.. ry practicCli. ~mative actions in meeting these problems include remedia most likely to reach minorities.Context of Psychological Testing 'ted. . and. job skills. ~. They may also be '~iniH~erson'sreluctance to apply for a job not traditionally open " ndidates.. when practicable.Such effects may include deficiencies in aptitudes. or in his inexperience in job-seeking procedures. affirmative action programs may P ded as eHorts to compensate for the residual effects of past social ~s.

determine the meaning of the score.CHAPTER 4 NornlS a'nd the 11lterjJretation of Test Scores I N THE absence of additional interpretive data. they indicate the individual's t~lativ. Second. we 67 . A score of 65 percent correct on one vocabulary test. Like aU raw scores. the raw score is converted into some relative measure. To say that an individual has correctly solved 15 problems on an arithmetic reasoning test. The difficulty level of the items making up each test will. to discover where he falls in that distribution. or identified 34 words in a vocabulary test. First. if an individual has a raw score of 40 on a vocabulary test and a:raw score of 22 on an arithmetic reasoning test. percentage scores can he interpreted only in terms of a dearly defined and uniform frame of reference. Nor do the familiar percentage scores provide a satisfactory solution to the problem of interpreting test scores.e standing in the normative sample and thus permit an evaluation of his'performance in reference to other persons. a raw score on any psychological test is meaningless. These derived scores are designed to serve a dual purpose. For example. Any individual's raw score is then referred to the distribution of scores obtained by the standardization sample. might be equivalent to 30 percent corred on another. and to 80 percent correct on a third. The norms are thus empirically established by determining what a representative group of persons actually do on the test. of course. Does his score coincide with the average performance of the standardization group? Is he slightly below average? Or does he fall near the upper end of the distribution? In order to determine more precisely the individual's exact position with reference to the standardization sample. for' example. they provide comparable measures that permit a direct comparison of the individual's performance on different tests. Scores on psychological tests are mOst commonly interpreted by reference to norms which represent the test performance of the standardization sample. or successfully assembled a mechanical object in 57 seconds conveys little or no information about his standing in any of these functions.

The individual's relath'e performance in many different . ~cores.9ndifferent tests are usually expressed in different units.thusbe compared. can be expressed in the same units "to the same or to closely similar normative samples for . owever. it conveys littlestep in bringing order into such a chaos of Iaw data is to es into a frequency distribution.vesstate'd above. distribution can also be the data of Table 1 'l'n gra h. f The information provided b presented graphicallv in the f y af r~~ue~lcy. giving number of correct syllables substituted Inute trial. derived scores )0 one of two major ways: (1) developmental level atrelative position within a specified group. 34) Class Interval 52-55 48-51 44-47 40-43 36-S9 32-35 28-31 24-27 20-23 16-19 Frequency 1 1 20 73 156 328 244 136 28 8 3 2 12-15 8-11 •.000 be an overwhelming sight. as illustrated in Table l. n wo ways both fo In the histogram. For computational details and speto be ~llowed in the practical application of tl1ese techer is refeHed to any recent textbook on psychological or Frequency Distribution of Scores of 1 000 C II Stud on a Code-Learning Test ' 0 ege ents - (Data from Anastasi. Simplified . eight . The difficulty level of the of est would also affect such a comparison between raw scores. only a few feath " . or number of o.~n the vertical axis are the graph has been plotted I' teases a m g wlthm each class interval. . When . 'on is prepared by grouping the scores into convenient d tallying each score in the appropriate interval.a)'isoll such scores is impossible. d' number of ca 1 " m lcatesat'J4~ largest ses custer In the center of the range and thattlie nu. lmportant mathematical TO erti ' .orm ° a lstnbubon curve. The following section is included simply . . p. They have been grouped '1s of 4 points.examples are given onl~. Table 1 shows the . on the b r h' ase me.l rms 109 m common use. will be considered ::tions of this chapter. The sums of these frequencies 'e total number of cases in the group. The be' . ariousways in which raw scores may be converted to fulfill p.ri15er atistics. or equally good in both? Since '.p erequency The s c' . the tallies are counted to find the frequency.~s. for this purpose and not to pro'~ statistical methods. ~-:-na-= fa 1.~ollegestudents in a code-learning test in which one set ds. c ass/1~ervals: . was to be substituted for an. Jomed by straight I' . In that form. es. 1934.meaningof certain common statistical measures. a . however. ssen a y t e curve . represent purpo h tures will be noted E ti n h se. These types of ~r with some of their common variants. Figure 1 shows f p lC orm. . u ceSSlVe pomts are then meso ' Except for minor irregularities th di 'b . terval corresponds to the g b umn erected over each class incan think of each individ n~mt erd~f persons scoring in that interval. the hei ht of the :x. But first it . We ua 1s an mg on another's shoulders to form the column In the fre is indi~ated by a ~o th~ number of persons in each interval across from the appro n~atacef m t e center of the class interval and . three b~tween 12 and 15. ranged from 8 to 52.n entered. from 52-55 at the top of the distribution Ie frequency column reveals that two persons scored q.atically dete~jned.000 s ~~~ws:e~n~and 11.between 16 and 19. ject of statistical method is to organize and summarize )~ in order to facilitate their understanding. .vill be necessary to ex'elementary statistical concepts that underlie the develop'zation of norms. in each class im"erval.~:~i A mathem. in vocabulary or in arithmetic.~i:~YY~~' . or onzontal axis.il1lcsof Psychological Tcstillg TABLE 1 'nownothing about his relative performance on the two tests. This type of curve has of statistical a~alyses FoP thP es and prOVIdesthe basis for many kinds . A list of 1. are the scores grouped int I' frequencies. or nonsense syllables. Fundamentally. . on the other hand. gu:e 3. resembles the bell-shaped normdl e stn ution por~ayed in Figure 1 ~erfect normal curve is reproduce.

10}~ 4 ~x' 244· .40 - = ~~2 = v'24. . or most frequent score.20- 15 19 23 Flc.5 ~~:.rl. Distribution Curves: Frequenc\: polygon and Histogram. A third measure of central tendency is the median.'The most familiar of these measures is the average."i"'l1~ 1 flifkrences around the central tendency.3627 31 35 39 scores 40. Wi e seen that the first column in Table 2 ata 'f g lves .5243 47 51 55 (:1:2 ) 12. more technically known as the mean (M)..5. being 33. e y cru c an unstable.ts. th d f .:of central tendency. Thus. cxtrem I d d . approximate the normal curve. in Table 1. st distributions of human traj. .curve is bilaterally symmetrical. ixj =H Ixl = 40 _ 40_ o -20 4 1 1 ___ ~X = 400 =~J :£x' = 16 36 64 244 ~X 400 M=N=1O=40 N . The median is the point that bisects the distribution.• 1..< tough 111 actual practice we would rarely perform these co T hI mputations on so fe' ' ard statistical sym~o~~~~t s~o~: ~ervetS adlsfo introduce certain standtO f e no e or uture reference Original d .28. "'.n of the ou 1 erence Table 2 in w~t:~ P01~t it will be helpful to look at the exam~Ie 10 c t ~ va~ous measures under consideration have been computed on str:~~~' alu~ a s~an group was chosen in order to simplify the demon• . most typical or repJi~sentative score to characterize the performance of the entire grouf:. In genI.32.44. erne lan IS 405 fall'mg ml'd way b etween 40 and 41-five cases " J r~ 340 320 300 --- Frequency polygon Histogram 280 260 240 ~ 220 i3 200 '0180 •• 160 140 :l 120 100 80 60 40 20 i TABLE 2 fi!. from height and weight to aptitudes personality characteristics.) 50% of {:~ cases ~~ ~~1 ~! J +20 64 49 9 A group of scores can also be described in terms of some measure. the mode is the midpoint of the class ihterval with the highest frequency. ~ {E 3_2 ~ AD = }. half the cases falling above it and half below. ~"'t "f .48.=z:r-- ~ Illustration of Central Tendency and Variabilit)· •• ""JI --I Diff.Norms and tile Interpretation of Test Scores Principles of Pbycl1010gical Testing 71 ps off gradually in both directions as the extremes are approached. the more closely will the distribution resemble theoretical normal curve. g.. for it is determined by onl two scores' A smgle unu~ually high 01' low score would thus markedly Iffect its size' A .40 = 4. .the larger the group. the mode falls midway between 32 and 35. Such a measure provides a single. In a frequency distribution. As is well known. r ~ ••• Medi. It will be noted that this score corresponds to the highest point on the distribution curve in Figure 1.. Another measure of central tendency is the mode. Further description of a set of test scores is given by measures of varia. n . (Data from Table 1.1. raw scores are conventionally designated by a capital X used to refer to deviations of each score from the ' an a sma x IS letter I means "sum of" It 'n b group mean. or middlemost score when all scores have been arranged in order of size. The mean is 40 . Squared 24.n ~ 40. The Greek . :ore precIse method of measuring variability is based on the d'ff etwee~ eac~ individual's score and the me. this is found by adding all scores and dividing the sum by the number of cases (N).24 . owever IS . V anance = SD or u 0" = -N = -10 . th e d" or the computation 0 mean and median.9 .16. with a single peak in the center. ~he most ~bvious and faniiliar way of reporting variability is in terms of e range etween the highest and lowest score The ran e h .

On the baseline of this normal curvc have been marked distances representing one. 99. which is the square root of the as shown in Table 2. +20' to 49. There is little point in a mode in such a small group.1(1 on both sides of the mean there are 68. in the example given in Table 2.. or mean One way in which meaning can be attached to test scores is to indicate how far along the normal developmental path the individual has progressed.e 2. two. The percentage of cases that fall between the mean and + lu in a normal curve is 34. Sf) also provides the basis for expressing an individual's scores on . In such a distribution. however. Technically.8 (40 + 2 X 4. :by the number of cases ("iN X2 ) is known as the variance. The symbol Ix\ ill the AD formula that absolute values were summed.ractenzed as reacbmg the sixth-grade nonn An a reading test and the t~l~d-grade n~rm in an. thus obtaining a measure th'eaverage deviation (AD).. The variance has proved ~x'useful in sorting out the contributions of different factors to mdifferences in test performance. +1u to 44.. and so on. as having a mental age of 10. These relationships are particularly relevant in the interpreta. andc~ymbo1ized by u'. The sum of thiS column diffe~ent tests in terms of norms. howchief concern is with the SD.20 0). If we Ignore slgns.13. -- Lorge SD ---Small SD Scores Frequenc\'Distributions .13 percent of the cases are likewise found between the mean and -1u. while all other ccur only once. the mean would correspond to a score of 40. For the present purposes. and three standard deviations above and below the mean.9).urth-grade child may be cba. ribution with wider individual differences yields a larger SD "one with narrower individual differences. \'ith . 34.72'1 t I 95. so that between + 1u and .26'1 1 I I I I I I I I = I I I I I -30' FIC. as will be shown in the section on stan~ard scores. without regard to sign.9 (40 + 4. The interpretation of the SD is especi~lly clear-cut when apphed to a normal or approximately normal distribution curve. This p~ has owed in the last column of Table 2.72 percent) fall within ±3u from the mean. This measure is commonly employed in . Because the curve is symmetrical.9).igur. in which the negative signs are 'ely eliminated by squaring each deviation. ~n a different context. since the cases do not show c1eartering on anyone score. -leT Mean +leT +20' 3. distributions having the same mean but dlflenng In vanabllity. T~us a~ 8-year-old who performs as well as the average 10-yearold on an mtelhgence test may be described. 11 i~. Percentage Distribution of Cases in a NOlmal Curve. the Same Mean b~t Different Variahility. because t". Nearly all the cases (99. there is an exact relationship between the SD and the proportion of cases. Alf ~mne descriptive value. The sum of these deviations will always equal zero.44'1 68. the AD is not suitable for use in furthema'tical analyses because of the arbitrary discarding of signs. ar~thmetic test. 41 would repremode. to be discussed in later sections..tion of standard scores and percentilcs. Other d~velopm~tal .'g the variability of different groups. and column sho\\'s how far each score deviates above or below of 40. a mentally retarded adult who performs at the saifre level would likewise be assigned ~n MA of 10. h more serviceable measure of variability is the standard devw:mbolized by either SD or u). be.. of e Ci.26 percent of the cases.. For instance.o persons obtained this score.Principles of Psychological Test ing 'eIcent) are above the median and five below. or cancel each other out ( + 20 .\1l average the absolute deviations.EOsitive and negative deviations around the m~an nec~ssarily.JU~yi9I ~ r eviatiol1. as shown in Figure 3. for e~a~~le. In F.in systems uti!tze more hIghly quahtative deSCriptions of be.

an arithmetic tes~ by the fourth graders in the standardizahon sample 1S23. in months. For this reason. Such a score may be the total number of correct Items on the whole test. the mental age unit shrinks correspondingly with age. the indh'idual's performance shows a certal~ '~mount f scatter. I. retarded at age·12.e. In Chapter 1 it was noted that the tenn "mental ~ge" s. years. Partial credits.lvitbe:dditjonaJ readily ~isualized if •• ~ think ~~ the in. for example. ~rade ~orms are found by computing the mean raw score obtained by chIldren In each grade.Norms and the Interpretation of Test Scores 75 'Prillcil.to 4 IS eqUIValent to three years of growth from ages 9 to 12. First.'agefor all tests passed at hi~e:.0 refers to average perfonnance at the beginning of the fourth grade (September testing). To describe a pupil s ~chlevement as equivalent to seventh-grade performance in spelhng. One year of mental growth from ages 3. but tends to shrin~ with advancing years. representing fractions of a gr~de. and fifth-grade in arithmetic has the same popular appeal as the use of mental age in the traditional intelligence tests.e. and ~rogress may therefore be more rapid in oJ1e subject than III ~other dUrIng a particular grade. This practice is understandable becaus. Nevertheless. If an ll1d~-i vidual's raw score is equal to the mean 8-year-old raw SCOre. th~n a raw score of 23 corresponds to a grade equivalent of 4. age 5 represents a larger deviation from the norm than does one vear 'of acceleration or retardation at age 10.·els.dividual's height as being exw pressed 10 tem1S of heIght age. . .t)Q~ninedy the children in each year group within the standardiza~tQn' sample constitute the age norms for such a test. it is c~stomar}'to compute the basal age. The mean raw scores.the t<:stsare employed within an academic setting. For example. All raw scores on such a test can be transformed in a similar manner by reference to the age nonns. 4. however. the subject fails some tests below h1s o mental age level and passes some above it. 'l'l. although Binet himself had employed the re nelitral term "mental levcl. In other words. eIghth-grade in reading. and so fOlth. Hence. It should be noted that the mental age unit does not remain constant with age.TAL ACE. -I'erexp~essed. The dIfference in inches between a height age of 3 and 4.~ivalent of 6~9in arithmetic.lesof PSljchological Testing unctionsranging from sensorimotor activities to concept formation. ' ~tal age norms have also been employed wl~h ~ests that are l:ot dldivedinto year levels..years would be greater tha~ that betw~en a height age of 10 and 11. a~e usually found by interpolation. . on number of~p"(lrs. IntermedIate grade equivalents. the content of instruction varies somewhat from grade to grade. OWll1gto the progressive shrinkage of the MA unit.ddelv popularized through the various translations and adaptatiOns the Billet-Simon scales. This relationship may be more GRADE EQUIVALENTS. where many subjects may be studied for only one or two.If a fourth-grade child obtains a grade eq. Because the school year covers ten months. Since mtellectual development progresses more rapidly at the earlier ages and gradually decreases as the individual approaches his mature limit. the subJect s raw scor~ 1S first determined. to be oinetricallvcrude and do not lend themselves well to precise statreah~e~t. For exam~le.t. In other words.then hiS mental age on the test is 8 years. a child who is one year retarded at age 4 will be approximately three.the test user keeps fi~ly in mind the manner in which they were ·deri~ed. the highest age at and below w~lCh all testsare passed. tilose passed by the m~j~rity of 8-year-olds ~e assignedto the 8-year level." In age scales such as the Binet and 'revisionsjitemsare grouped into year le. those items ssedbv the majority of 7-vear-olds in the standardization sample are ~jacedi~ the 7-year level. i. one year of acceleration or retardation at.In actual practice. grade norms are appropriate only for common subjects taught througho~t the grade le~els covered by the test. In such a case..etic processes taught In the SIxth grade. The mean raw seore of the 8-~ea~old children. . they have considerable appeal for de\ve purposes. Scores on educational achievement tests are often interpreted in terms of grade equivalents. earned at higher age level§. grade-units are obv~ously unequal and these inequalities occur irregqllirly in different .. For example. He undoubtedly obtained'hjs sc6r~ largely by .~11then correspond to the highest year level that he can succe5sful~y 'omplete.5 refers to average performance at the middle of the grade (Febmary testing).p~r ~evels The chIld s mental age o~ months of credit the test ISthe sum of the ba~:gp . if the average number of problems solved c~ITectly on . Even Vlith subjects taugkt in each grade. For example. scores based on developmental norms tend. and so forth. are then ~d?ed to thiS basal . or it may be based on time. let us sav. the emphas1s placed on different subjects may vary from grade ~o grade. although they can also be obtamed directly by testing children at different times within the school year.. Despite their popularity. successive months can be expressed as decimals.t does ~ot mean that he has mastered thfi aritn. years.w. especially in the intensive clinical study of individuals or certain research purposeS. subjects. would represent the 8-year nonn. Grade norms are also subject to misinterpretation uni~s . or on b somecombination of sU~'h measures. 4. Thus. They are not generally apph cable at the hIgh school level.:. A child s score on the test '. grade norms have several shortcomings.

Principles of Psyc11010gicaJ Testing . Jean Piaget (see Flavell.Norms arid the ITltcrprc:tafioTl of Test Scores 77 . 1963. visual ont of him exhibit a characteristic chronologIcal sequen:e I~ d in ion and in hand and finger movements.stedto equal-umt mterva ~:m~~ scale or simplex.al scale IS simp y one . between them' in the statistical sense. n .st!ir ~limbing.. climbs stairs without assistance. recognizes identity in quantity of liquid when poured into differently shaped containers). e~:~~~hl~tage presupposing m~stery of prerequisite behaVIOr chara~teJ'istic f earlier stages. Piagetian tasks have been used widely in research by developmental psychologists and some have been organized into standardized scales. 1 d t the description of be'pment in infants and young chlldl. Insofar as these scales typically provide information about what the child is actually able to do (e. s~~~~:~l~:o~':~:~~. Y f the ob'ect Such sequential patterning was hkewlse ob- l~:~e~ 0 0 J • 0 0 ~g~ cg~~.• . Ford. med that he has the prerequi~ites for seventh-gra e ant me I ~ adc norms tend to be incorrectly regarded as performan~l A sixth-grade teacher. 1971). namely..areas of behavior.SUC uncti~ns as OCt forma~ion. a ~ h'ld' 'behaVior with h 1 Tliese levels are "found by companng tIe CIS k a ran ing from 4 weeks to 38 mont s.:therange' of achJevement test scores will inevitably exten over . ordinal scales are designed to identify the stage reached by the child in the development of specific behavior functions.For example. is object permanence.t. Ginsburg & Opper.g.seq. The tasks are designed to reveal the dominant aspects of each developmental stage. h d I h th apprOXImate e h ff 1933 The Gesell Developmental Sc e U es s 0''0 ) r lopm~ntallevel in months that the child has attained in eadc 0 °aul d ptive lan<1uage an person . . in which items are selected in the first place on the basis of their differentiating between successive ages.: t~:: he thumb in opposition to the palm. & Flamer. Uzgiris & Hunt.~~:t:~n~f a~ese~lo:~e£ his associates at eAxamp e1913s et ~l. I I" differs from that in statistics. these instruments are ordinal scales. only later are empitical data gathered regarding the ages at which each stage is typically reached. Piaget's research has focused on the development of cognitive processes from infancy to the midteens. It certam lOrpenorman . •• . is evaluated in . in which successuallydeSignedon the ~o~~.~". The ordinality of such scales refers to the uniform progression of development through successive stages. such scores are secondary to a qualitative description of the child's characteristic behavior. for example: may assume tha.n?rm In ac rade ests This misconception is certamly not surpnsmg when g h iare ~sed Yet individual differences within any onc grade ar~ suc ·. Iy COU not Id • >~t . 1 t developmental norms derives Another approac 1 0 1 b t' f behavior . 1966. . tee h I s reac I . i Nearly all standardized tests now provide some foryn of within~group norms. or the recognition that an attribute remains constant over changes in perceptual appearance. He is concerned with specific concepts rather than broad abilities. in which the attainment of one stage is contingent upon completion of the earlier stages in the development of the concept.' o . Another widely studied concept is conservation. and ~ost ~f th. do I' the sense that developmental stages follow In a .~· an extension of Guttman's analysis to Include nonlinear hi~archies i. or when rods of the same length are placed in different spatial arrangements.k t of the first few years. Green. as when the same quantity of liquid is poured into differently shaped containers.lingui~~c dc~~~~~.wOalking. d 'h fc I. 1975). Use of th~ entire an 'de attempts at palmar prehension OCC~Il'St a~ ear~er ~ i~h. dge about amount of dilI~r~nce les Ordinal sillIes of child development arecontra.~ilescribc:d by Bart . whereby the child is aware of the identity and continuing existence of objects when they are seen from different angles or are out of sight.th grade arithmetic. 1944). the procedure differs from that followed in constructing age scales. the individual's performa.~ . An DINAL SCALES. . ? . f'f''t d xtenslVe eVidence 0 um or1111 . g .so:u:c:ss at I lower levels (Guttman. 0 1 omotion sensory typical of successive ages in . Although sc. with special reference to Piagetillrr··~al. Gesell & Amatruda. Pinard & Laurendeau. . In summary. there has been a sharp upsurge of interest in the developmental theories of the Swiss child psychologist. 1940. d tl uential patterning of typlCalof eight ey at>es. t pincerowedb use of the thumb and index finger In a more e c~en . thIS t)~e 0 pre en~.' inati0t. sell and his co-workers emphaSize Ie . to be discussed in Chapters 10 and 14 (Goldschmid & Bentler. o~1 . In this respect.'Oresmay he reported in terms of approximate age levels. behavior development.' I that permlt~ a ran -oruenn~ 0 .df. With such norms. motor. Thisusageof the term ordma sca ~ " . Th ey CIe e.~~. 1947. hI' E Ipirica 0 serva Ion 0 research in chIld psyc 0 og~. and Airasian (1974). H~lver( mes" . 1969. h~!:e~ pal grades. in which an k l' f individuals wjthout '. . f behavior of developmental sequences and an orderly pdrogressllolll ~ect piaced b Old' fons towar a sma 0 ] h Iges. An example of such a concept. . or schema. ce 'I·nfouI. they share important features with the criterion-referenced tests to be discussed in a later section of this chapter.. In accordance with Pia get's approach.n e. The scales eve ope ~ 'c6nstant .t all class should fall a! or close to tl~e sixth-grade . 1964. 1968b. " . Loretan. An :pprformancet one 1 a eve mlp I o Since the 19605.

They are easy to compute and can be readily understood. this discrepancy in the gaps between percentile ranks (PH) can readily be seen if we compare the dj$tance between a PR of 40 and a PH of 50 with that between a PR oero and a PR of 20. PERCEKnLES. so that the lower the percentile. . Like the median. are.84). the best person m the group 'ing a rank of one. e 50th percentile (P.. percentile scores can also provide a correct visual pictUre of th~ differences between sc.is 34 (84 . m FIgure 4). That between + I. because they cut off the lowest and highest quarters the distribution. if 28 percent of the persons obtam fewer than 15 bblemscorrect on an arithmetic reasoning test. they provide convenient landmarks Qrdescribing a distribution of scores and comparing it with other dis- whereas raw score differences near the ends of the distribution are greatly shrunk. whereas the horizonta..__________________________ •••••••••••• ·1 Norms and tile Interpretation of Test Scores 79 Principles of Psychological Testing msof the performance of the most nearly comparable standardization up. If plotted on arithmetic probability paper. It IS apparent that percentiles show each indiyf<Jual's relative position In the normative sample but not the amount of <h~ence between scores. cases cluster closely at the center and s~atter more widely as the extremes are approached. A~ithmetic probability paper is a cr~ss-se<:. zero percentile is not reached until infinity and hence cannot be shown on the graph. Percentile Ranks in a NOlmal Distribution. In Figure 4. those below 50 signify inferior p~rforman:e..centile . The chief drawback of percentile scores arises from the marked 10equality of their units. or vice versa (as in Figure 5).1i~. then raw score differences near the median or center of the distrihution are exag~erated in the percentile transformation. Consequently.letween the mean and + lIT . A percentile indicates ~he to . the percentile difference i. already dlsd as a measure of central tendency. These percentile ranks are given under the graph m Flgure 4. percentiles are universally applicable.:)\150 be regarded as ranks in a group of 100. normal curve. except th~t m rankmg ustomary to start countin<1 at the top. as lS true of most test scores. ) Q1 Mdn 1 Q3 20130405 06070180 99 i J I 1 i I I : I I I i I I I 1 I I ~ I I I I I I 1 I I I : I I J I I \ I I I : I \ I I I I I 1 I I I 1 I I \ I I I I I ributions. whether it measures aptitude or personahty vanables. If the distribution of raw scores approx1mates the normal curve.rl?npaper i~ which the vertical h. it will be recalled. do not imply a zero raw score and a perfect raw score. Moreover. however. ~ercent~les . expressed in terms of the percentage of correct /items. then a raw score of <j<\rrespdnds the 28th percentile (P~~). and +~is only 14 (98 . The same relationship can be seen from the opposite direction if we examine the percentile ranks corresponding to equal u-distances from the mean ~f a. even by relatively untrained persons.(PH"')' These percentiles. . on the other hand.(I) corresponds to the medlan. Percentiles should not be confused with the familiar percehtage scores. especially. ~ ~ ~ 4. . Even more stdking is the discrepancy between these distances and that between a PH of 10 and PR of 1. he latter are raw scores. however. .in t?e same W~y asltM'percentile p~~nts in a normal dlstnbubon (as.nes are uniformly spaced. any glYen percentage of cases near the center covers a shorter distance on the baseline than the same percentage near the ends of the distribution. or~s. Thus.nes. raw score lower than any obtained in the stand~rdizahon samA .' 'dual's standing. as when comparing a child's raW score with that of ~hi~dren of e same chronological age or in the same school grade.. This distortion of distances between scores can be seen in Figure 4.J{iiduafs relative position in the standardization sample. In a normal curve.25th and 75th percentile are known as the first and thlrd quartile hits (Ql and Q3).. one hl~her than any .9 ~m FIC. spaced . Percentiles above 50 represent e-average performance. 'With ~ercentiles.percentiles are derived scores. -30- -10- M +10- +2098 +3099. Percentile scores have several advantages. at ~he extremes of the distribut~on. Wlthm-group reshave a uniform and clearl\' defined quantitative meaning and can appropriately employed in m~st types of statistical analysis. Percentile scores are expressed in terms of the percentage persons in the standardization sample who fall be~ow a given raw reoFor exampk. score in the standardization sample would have a percentile rank of 100. expressed in terms of perce~ltage of }<persons. They can be used equally well with adults and children and a~e sUit~ble for any type of test. '.:ple would have a percentile rank of zero (Po). Such normqJpe.50). we begin ing at the bottom. the poorer the . (In a mathematically derived normal curve.

To con"er~ an origi~$ll!tandard score to the new scale. and an SD of 100.5 X 100 = 650). because the total range of most groups extends no farther than about 3 SD's above and below the mean.". ard scores" or "z scores. such standard scores will have to be reported to at least one decimal place in order to provide sufficient differentiation among individuals. theyretain the exact numerical r~labons of the ongmal raw scores.:lly applied.60 = +1. "AXDARD SCORES.~:. a standard score of + l. score. any computations that can be carried out with the original raw scores can also be carried out with linear standard scores. tend to produce awkward numbers that are confusing and difficult to use for both computational and reporting purposes. Current tests are making increasing use of standard. For this reason. entile difference is 5 points. withollt any distortion of results. the illinterscoredifference will be correctly represented~ Many aptitude achievement atteries now utilize this technique in their score prob 'whichshow the individual's performance in each test. It is apparent that such a procedure will yield derived scores that have a negative sign for all subjects falling below the mean. usethey are computed by subtracting a constant from each raw score thendividing the result by another con~tant The relative magnitude ". Standard scores express the individual's distance from of meanin terms of the standard deviation of the distribution. Percentiles are spaced so as to ~orrespond ~~I istancesin a normal distribution. John Mary Ellen Edgar Jane Dick Bill Debby TABLE ~h-A Normal"PercentileChart.00 Both the abovE'conditions. . reprod in Figure 13 (Ch. X\=65 65 . w!. simply to put the scores into a more convenient form. . Linearly derived standard scores are often desilTnatedsimpl\' as "standb . .40 SD below the mean.100 = 4(0).60 Zl= X:=58 58 .Moreover.S ou1ltcorrespond to 650 (500 + 1. 5).thm both pal:s.Norms and the Interprdation of Test Scores . For this reason.eartransforma. viz." To compute a :. the occurrence of negative values and of decimals. Any raw score that is exactly equal to the mean is equivalent to a z smre of zero. Similarly.aw scores. ressed as 400 (500 . 3 Computation of Standard Scores X-M SD JOHN'S SCORE BILL'S SCORE "'canbe used to plot the scores of different persons.Whe~ found by a l. Table 3 shows the computation of z scores for two individuals. Jane and Dick differ by 10 percentile as do Bill and Debby. An example ~eIndividualReport Form of the Differential Aptitude Tests. on the same r thescoresof the same person on different tests. scoreswhichare the most satisfactory type of derived score ftom most ~oints' view. it is Simplynecessary to multiply the standard score by the 5 t .in. All-properties of the original distribution of raw scores are duplicated in the distribution of these standard scores.pfillciIJles of Psychological Testing 81 of differences between standard scores derived by such a linear transformation corresponds exactly to that between the . Compare the sc~re. . ~x~lnple. the scores on the Scholastic Aptitude Test (SAT) of the College Entrance Examination Board are standard scores adjusted to a mean ot. Standardscores mav be obtained by either linear or nonlinear transationsof the origi~al raw scores. In elther case.For. . we find the difference between the individual's raw score and the mean of the normative group and then divide this difference by the SD of the normative group. distance ~ed " hn and Mary with that between EIIen and Edgar. Thus a standard score of -Ion this test would b: . one of whom falls 1 SD above the group mean.. some further linear transformation is u~u. the other .

If the normalized standard score is multiplied by 10 and added to or subtracted from 50. Although under certain circumstances another type of distribution may be more appropriate. thl:"'t'fugh physical operations. These percentages correspond to a distance of 1 SD below and 1 SD above the mean of a normal curve. This scale provides a single-digit system of scores with a mean of 5 and an SD of approximately 2. physical me1tsures such as height and weight. Like linearly derived standard scores.negroup but would exceed 84 percent in the other. are converted to a distribution with a 1 of 10 and an SD of 3. TABLE 4 Normal Curve Percentages for Use in Stanine Conversion Percentage Stanine Raw scores can readily be co~verted to stanines by arranging the original scores:in order of size and ~. is a debatable point that involves certain questionable assumptions. the normal curve is usually employed for this purpose.I SD above the mean. it is converted into a T score. "'c -" 3 Kaiser (1958) proposed a modification of the staninl!'scale thaq~volves slight (. a normalized score of zero indicates that the individual falls at the mean of a normal curve.3 The name stanine (a contraction of "standard nine") is based on the fact that the scores run from 1 to 9.00 might exceed only 50 percent of the cases in . excelling 50 percent of the group. and so on. ard score is obtained.a z score of +1. generaU. with s consisting of 11 units and also yielding an SD of 2. One of the chief reasons for this chotee is that most raw score distributions approximate the normal CUJ. Anoth'1f"frnportan: advantage . it has frequently been argued that. The restriction of ~cores to single-digit numbers has certain computational advantages. and the con-esponding normalized stand2 Partly for this reason and partly as a result of other theoretical considerations. the 4 lowest-scoring persons receive a stanine score of 1. for example.~al :~rve is that It has many useful mathematical properties. :1. and so forth. e linearlv derived standard scores discussed in the preceding section " be cO~lparable only when found from distributions that have apximately the same form. for instance.distl-ibutions. His score exceeds aproximately t1J.e. a score of 60 to 1 SD above the mean. This.~fn assigning stanines in accordance with the normal curve percentages"re. Any other convenient values can be arbitrarily chosen for the . viz. but they are subject to other limitations already discussed. Co . Scores 011 the separate subtests of the Wechsler Inence Scales. NQrmalized standard scores are standard scores expressed in terms of a distribution that has been transformed to fit a normal curve.'\: ".Principles of P~Y. On this scale. Under such conditions. for each score requires only a Single column on computer punched cards.!~ 5 units above and 5 below the mean (Canfield.. that he surpasses 84 percent. with a mean of zero and an SD of 1. a score of 50 corresponds to the mean. When the group contains more or fewer than l00~cases. developed by the United States Air Force during World War II.913" Ch. the number corresponding to each deSignated percentage is first computed. whlchl""faclhtate further computations."one distribution is mal'kedly skewed and the other "normal.y yield normal ~istributions. A score of -I means thafhe surpasses approximately 16 percent of the group. 1951}.produced in Table 4.of the ~or. by normaliZingraw scores. For example. In order to achieve comparability of scores from dissimilarly shaped .If. thus being e~Werto handle quantitatively. This percentage is then located in the normal curve frequency table.. as can be seen by reference to the bottom line of Figure 4.'IO'ogical Testing 'ed SD (100) and add it to or subtract it from the desired mean ). Thus. Normalized standard scores are expressed in the same form as linearly derived standard scores.han~es in the percentages and yields an SD of exactly 2. and a s(:ore of + I. Moreover.e group consists of exactly I()() persons. the percentage of persons in the standardization sample falling at or above each raw score is found. respectively. Another well-known transformation is represented by the stanine scale. howeyer. Such scoreS can be computed by reference to tables giving the percentage of cases falling at different SD distances from the mean of a normal curve. ~lO-Uilit tefl scale.same percentage of persons in both distributions. if tlJ. signines that the individual occuies the same position in relation to both groups. the next 12 a score of 3. 'twill be recalled that one of the reasons for transforming raw scores o any derived scale is to render scores on different tests comparable. a score corresponding ~. an e(lual-unit scale could of be developcd for psycholo~ical measurement similar to the equal-twit sL-dles physical measurement. Other variants are the C scale (Guilford & ltruchter. All such measures are examples of linearly sformed standard scores. and these numbers of cases are then given the appropriate stanines. The mental age and percentile scores described in earlier sections represent nonlinear transformations. however.mean and SD. 19)..nonlinear transformations may be employed to fit the scores to any specified type of distribution curve. Firsf. the next 7 a score of 2. and tl. a type of score first proposed by McCall (1922).V-e more closely than they do any other type of curve. which use equal-unit scales derived. and is percentage can be determined if the form of the distribution is 'known. normalized standard scores can be put into any convenient form.

Whenever feasible. and so forth. 6 would receive a stanine of 1 (4 percent of == 6). but it _". IQ's below 100 indicate retardation.servethe same purposes as normalized st. There are therefore certain practical advantages in the use of a derived scale that corresponds to the familiar distribution of Stanford-Binet IQ's. since both may fall at a distance of 1 SD from th~ .~ DEVIAT10JlO IQ. especially with aptitude achievement tests. testers and clinicians have become accustomed to interpreting and classifying test performance in terms of such IQ levels. if a child's ~IA Is his CA. The deviation IQ is a standard score with a mean of 100 and an SD that approximates the SD of the Stanford-Binet IQ distribution.1 y cent when the SD is 16 (as in the Stanford-Binet). it prm'e.&' . These discrepancies are illustrated in Table 5. the mean is 100 and the SD 15. it is important to remember that deviation IQ's from different tests are comparable only when they employ the same or closely similar values for the SD. if an SD close to 16 is chosen in reporting standard scores on a newly developed test. Although the :ods of deriving these two types of scores are quite different. A major technical difficulty is that.erton (1966) have prepared a table whereby ranks can be directly rted to stanines.. ying . proved deceptive. the ratio IQ has been largely replaced by the so-called deviation IQ. Such a trans:)ation should be carried out only when the sample is large and repIltative and when there is reason to believe that the deviation from in~litvresults from defects in the test rather than from characteristics he sample or from other factors affecting the behavior under conration/it should also be noted that whpn-the original distribution of scores approximates normality.. above 100. With 150 cases. They are not ratios of mental ages and chronological ages. which is actually another variant of the familiar standard score. The justification lies in the general familiarity of the term "IQ.' = 18) . This value should. In actual practice. Since Stanford-Binet IQ's have been in use for many years. An IQ of 115 at age r example. of their respective age distributions. With an approximately al distributiou of raw scores. however. Such aIJ. Chiefly for this reason. \Vith the increasing use of deviation IQ's. that an IQ of 70 cuts off the lo\v(j:. the tiltingscores will be nearly identical under such conditions. Ithough nOlmalized standard scores are the most satisfactory type of .. J = for comparability of ratio IQ's throughout their age range. 70. 130. as few as 0. always be reported in the manual and carefully noted by the test user." and in the fact that such scores can be interpreted as IQ's provided that their SD is approximately equal to that of previously known IQ's. acceleration. Table 5 shows. . cut off. which shows the percentage of cases}i1normal distriblltions with SD SD's from 12 to 18 who would obtain IQ's at different l~els. the linearl\' derived standard scores . Deviation IQ's are also used in a number of current group tests of intelligence and in the latest revision of the Stanford-Binet itself.st per3. may indicate the same degree of superiority as an IQ at age 12. Such IQ's are not derived by the same methods employed in finding traditional ratio IQ's. 90.\ or average performance. it is generally more 'rable to obtain a normal distribution of raw scores by proper adjust. his IQ will be exactly 100. Because of their practical as well as theoretical rimtages. Bartlett and . Such a correspondence of score units can be achieved by the selection of numerical values for the mean and SD that agree closely with those in the Stanford-Binet distribution. They have learned what to expect from individuals with IQ's of 40. the resulting scores can be interpreted in the same way as Stanford-Binet ratio IQ's.These values have actually been employed in the IQ scales ofp*lJli~hed tests.!proeessof normaliZing a distribution that is already virtually normal r produce little or no change. An IQ of 100 thus represents '\i.7 percent (SD = 12) or as many as 5. unless the f the IQ distribution remains approximately constant with age. It should be added that the use of the term "IQ" to designate such standard scores may seem to be somewhat misleading. An IQ of 70 has been used traditionally as a cutoff point fpl' . If a test maker chooses a different value for the SD in making up his deviation IQ scale.IQ was ply the ratio of mental age to chronological age. Obviously. Although the SD of the Stanford-Binet ratio IQ (last used in the 1937 edition) was not exactly constant at all ages. In an effort to convert ~1A scores into a ~6rm of the individual's relative status. there are nevertheless certain techal objections to normalizing all distributions routinely. ifficult to constmc:t tests that met the psychometric requiremeritS' . multiplied by 100 to 'pate decimals (IQ 100 X MAjCA). In these tests. the linearly derived standard scores the normalized standard scores will be very similar. Among the first tests to express scores in terms of deviation IQ's were the \Vechsler Intelligence Scales. for example..refor the majority of purposes. Hence. will not be comparable at different age levels. the ratio IQ (Intelligence Jient) was introduced in early intelligence tests. it fluctuated around a median value slightly greater than 16. " apparent logical simplicity of the traditional ratio IQ.Prillciplcs of Psycl1010gical Testing Norms and the Interpretation of Test Scores 85 us.1 percen . the meaning of any given IQ on his test will be quite different from its meaning on other tests. ObViously. For any group containing from 10 to 100 cases.stanines are being used increasingly.out of 200 cases.t of the llifficulty' level of test items rather than ~by subsequently alizing a markedly nonnormal distribution. 8 would be assigned a stanine of 1 (4 percent of = 8).ndard scores.

0 -.) 5: In .3 0.0 26. 130 above 120-129 ··:110-119 100-109 90. ). : 1Q1ilterval s'.8 7.6 . of course. an IQ of 132 corresponds to a standard score of +2. Ha~court Brace Jovanovich.5 tage of Cases at Each IQ Interval and Different Standard Deviations in Normal Distributions with Mean co v esyTest Department. depending on the ~est chosen..0 21. and so forth.6 '0 E 0. The IQ range between 90 and lIO.rcd (CEEB) scores.2 6.::}47. to make the checking of the SD imperative.this chapter.7 4.79 .0 .stanines.13% 0. Similarly. Wechsler deviation IQ's (SD = 15). t this stage in our dis.5 SO = 18 z '" \Rh 0.5 21. become IQ's and vice versa. a ree.2 29.89 70. ! 55 4% I I 10 85 I 100 I 115 130 I 145 4% Stanine I 7% . which might be used in selecting children for special programs for the intellectually gifted.00 SD (Figure 4).00.S} 59.1 8. To be sure..2 15. I 5 I 10 I I I I I I I 20 30 40 50 60 10 80 90 95 !l9 Distribution.12%. Finally. however. Ratio IQ's on any test will coincide with th~g_iven deviation iQ scale-If they are normally distributed and have an S1). In Figure 6 are summarized the relaHbnships that exist in a normal distribution among the types of scores so far discussed in .0) 420 .5 5.. apply to IQ's of 130 and above.r.8 . test publishers are making efforts to adopt the umform SD of 16 in new tests and in new editions of earlier tests.'1II9tA~. Any other .50. of scores.6 29. .~~"""""~ CEEB score I 200 Deviation IQ (SD =15) 300 - I I I 400 500 I 600 700 800 mental retardation. standard s(:ores have. In connection with the last point. generally described as normal.°l 15. If we know that the distribution of Stanford-Binet ratio IQ's had a mean of 11") ronrl ~n qT) of :mnroximatelv 16.~Percertile rank of approximately 84.5 3.xamm~tion of the meaning of a ratio IQ on such a test as the Stanford-.3 5. Relationships among OiHerent Types of Test Scores in a Normal INTERRELATIONSHIPSWITHIN-GROUPCORES. 6. 15.0 +1<1 +2<1 +3<1 +4<1 15. Below70 Total SD= 12 SD = 14 1.7 100.r I 100.4 8..17% 2 3 . There are still enough variations among cuaently available tests. Moreover.1}52..1 100. Inc. The same discrepancies.1 7. of 15. Percentiles ~ave gradually been taking on at least a graphIC rese~b~a~ce t? norma}ijzed standard scores. we can conclude that an IQ of 1I6 falls at a distance of 1 SD above the mean and represents a standard score of + 1.3 1.1 15. T SCOres.9~ 80. because in a normal curve 84 plirc~1it of-the cases fall helo.2 4.b .13% -40-10Mean Test score SD = 16 3.0 = Tscore L 10 I I I 20 30 40 100. 4 +1 I I 50 I ! GO I 70 I 80 90 .8 26.8 6. College Entrance Examination Bqp. IJlay include as few as 42 percent or as many as 59.20%! 11% 112% 17% I 5 6 7 8 Percentile FIC. an IQ of 76 to a standard score of -1.00. These include z scores. and percentil~s.3 16. the reader may have become aware of a rapprochement among the various types.6 percent of the popula-tion. a Stanford-Binet ratio IQ of lI6 corresponds to. +1. .1 16. OF S A cussian of derived scores.4 z score I -4 ! I -3 -2 I -I I I +2 +3 I +4 15. Linear standard scores arc mdlstingmshable from normalized standard scores if the original distribution of raw scores closely approximates the normal curve.Bmet WIll show that these IQ's can themselves be interpreted as standard scores.

115. If the verbal abilitv test was standardized on a random sample on a of high school students.lr~de t(t'Qbtain a representative cross sectiol\Hlf. the scale units may not be comparable.-ertain advantages they offer 'th regard to test construction and statistical treatment of data . So-called intelligence tests rrovide many illustrations of this confusion. such IQ's cannot be accepted at face value without further information. Second.~15 consti~tmg the~i\r. Such differences probably account for many otherwise unexplained discrepancies in test results. they must be ree ferred to particular tests. the examiner might erroneously conclude that the individual is much more able along verbal than along spatial lines. provided we know 'SD. if we wish to establish nonns of test performance for the population of 10-year-old. then an individual who received an IQ of 112 on the first test is most likely to receive an IQ of 118 on the secon~. numerical. Although common]y descnbed by the same blanket term.. and ease of developing nonns. The positions of these two students might have been reversed by exchanging the particular tests that eq.Principles of Psychological Testing ally distributed IQ could be added to the chart. he would have obtained these scores even if the three tests had been administered within a week of each other.ns of a single individual's test performance over time. another may tap predominantly spatial aptitudes. the composition of the s~dardi. and other relevant characteristics to ensure that it was truly representative of the defined population. a distinction is made between sample and populatIOn. THE NORMATIVE SAMPLE. is restricted to the particular normative population from which it was derived. If the SD is 20. familiarity. !hird. As explained earlier in this chapter.. For example. public schoo] boys. In the development and application of test norms. considerable attention should be.• Any norm. fifth. given to the standardization sample.. for instance. and spatia] content in about equal proportions. ceived an IQ of 94 and Tom Brown an IQ of 110.. Still another example involves longitudinal comparisl?. the same indi~idu~l will appear to have performed better when compared with an mfenor group than when compared with a superior group. the first question to ask before interpreting these changes is.!U.~ardization sample. however expressed.or penn~ne~t. . The sample would be checked w1th reference to geographical distribution. each of these scores can be readily translated into . They JIle~ely represent the test performance of the subi. but similarly constituted. one of these tests may include only v~rba] content. group froin which the sample 1Sdrawn. First. ny of the others. ObViously. ethnic (. Differences in the respective normative samples. If the school records show that Bill Jones re. howeyer. When certain statistical conditions are met.. however. should always be accompanied by the name of the test on which it was obtained. urban. ~ve migh~ test a carefully chosen sample of 500 10-year-oJd boys attendmg PUb~IC schools in several American cities. af1 eff?rt IS usual. and sixth grades. Th: former refers to the group of individuals actually teste (i. a Norms and the Interpretation of Test Scores 89 ISTERTEST COMPARISONS.'omposition. In that case.. and still another may cover verbal. are more likely to be overlooked. univer.. Standscores in any form (including the deviation IQ) have generally placed other types of scores because of c. socioeconomic level. an IQ of 80 to -1 SD. an individual's relative standing in di~erent functions may be grossly misrepresented through lack of comparability of test norms. Th~ latter des1gn~tes the larger. In conclusion. tests may differ in content despite their similar labels. In st~tistjca] terminology. the exact form in which scores are reported is dictated gelyby convenience. Let us s~ppose that a student has been given a verbal comprehension test and a spatial aptitude test to determine his relative standing in the two fields. "What tests did he take on these three occasions?" The apparent decline may reflect no more than the differences among the tests.ose obtained. or allY other score.the populatIon for which th~. and so on.•same population should not yIeld nonns that diverge appreciably frorp tfl. An IQ.ch was given in his respective school. Psychological test norms are in no sense absolute. There are three principal reasons to account for systematic variations among the scores obtained by the same individual on different tests. when the reverse may actually be the case. carefully derived and properly interpreted. if IQ's onone test have an SD of 12 and IQ's on another have an SD of 18. and 101 at the fourth. It is.apparent that the sample on wh1ch the norms are based should be large enough to provide stable values. are fundamentally s1m1lar _. In choosing such a sample·. Similarly. then an IQ of 120 corresponds to '1 SD.. Test ~corescannot be properly interpreted in the abstract. The test user should never lose sight of the way in which norms are established.ation sa'!!Ples used in establIshmg nonns for different tests may vary.it~st is designed. similarly chosen sample of th•. . while the spatial tes~ was standardized selected group of boys attending elective shop courses. If a schoolchild's cumulative record shows IQ's of 118. Another. Lack of comparability of either test content or scale units can usually be detected by reference to the test itself or to the test manual. ~ost pes of within-group derived scores.

see Angolf (i~~. Through correlational analysis. One solution for the lack of comparability of n~rms IS to use an anchor test to work out eqUivalency tables for scores ?n dl~erent tests. or institutionalized mental redates.dycomparable to each test to be norme?. For several other battenes.~' ~est could then be checked aKainst the most nearlY similar anchor test 111 the battery. No single anchor test. interest. Closely related to the question of representativeness of sample is the needfor defining the specific population to which the norms apply.torsadministered a two-day battery of specially cons~ructed aphtude. such S~I~ctive . Each ne. all administered to the same national sample. For ex. 1964) so far come closest to providing such an anchor batten' for a high school popula~ion.one way of ensuring that a sample is representative is to restrict the population to fit the ~ecifications of the available sample. 1962). Hence. data have been gathered to identify the Project TA. More ambitious proposals have been made from time to time for cali. In such a case. regardless of content. For example. then a school sample would be representative. a .eloped tests ·can ~ever be regarded as completely interchangeable. th~ lllVeStIga. however. the relative proportion of severely rearded persons will be much greater in institutiunal samples than in the total population. then Test. nate it. For example. the rr<llJtin~norms are not comparable. however. This can be done by the equipercent. The data gathered in Project TALENT (Flanagan et a!'. NATION~L ANCHOR NORMS. This approach has been followed to a limited extent by so~e test publishers. 1966a). ausesuch samples are usually large and readily available for testing oses. Nor does such iffiinationi?.patients in mental hospitals. In actual practice. and temperament tests to appr~:llnately 400. national normative sample (Len~on.such groups are not representative of the entire population of riminals.ffectdifferent subgroups equally.composite of Project TALENT tests is identified that is most n~ya. may make this goal unattainable. e. of course. Subtle selective factors that might .g. For example. the sample unrepresentative should be carefully investigated. brat~n~ each new test against a single anchor test. owing to e progressive dropping out of the less able pupils. the rate of ctiveelimination from school is greater for boys than for girls. Neyman. Then a suitable sample should be assembled. Practical obstacles in obtaining subjects. the desired population should be definedin advance in terms of the objectives of the test. "~- . Consequently." "lO-year-old American children.ze m. The special limitations of these samples.A-IQ 115 is considered to be equivalent to Test-B-IQ 120. . but it would not elimi. Such tables are designed to show what score in Test A IS e~Ulvalent to ~ach score in TestB. which has itself been admllllstered to a highly representative."Prillciplesof Psychological Testing .. Using a r~ndo~ sample of about 5 per~nt of the high schools in tIllS country. & Dailey. Lennon. the samples obtained by different test constructors often tend to be unrepresentative of their alleged populations and biased in different ways. Similarly. 1971a). Testing subjects in school.population under consideration. No test provides norms for the human species! And it is doubtful whether any tests give truly adequate norms for such broadly defined populations as "adult American men. if the 80th pel:' centile in the same group corresponds to an IQ of lI5 on Test A and to an IQ of 120 on Test B. m which scores are considered equivalent when ther have equal percentiles in a given group. it must be recognized tItat l~dependen~ly dev. however. At best.Because of many special factors that determine institutionaliza'n itseH. uallyimportant is the requirement that the sample be representative '. Shaycoft.000 students in grades 9 through 12. who have prepared equivalency tables for a few of theIr Own tests (see." and the like.Lf:NT com4 F~r an excellent analysis of some of the technical difficulties involved in efforts to achIeve score comparability with different tests. if the population i$ defined to include only 14-year-old schoolchDdrenrather than all 14-year-old children. very fe''''' tests are standardized on such broad populations as is pORularly assumed. Even with the avail~bihty of anchor data such as these. Ideally. or mental retardates. A ber of such selective factors are illustrated in institutional samples. mental retardtes with physical handicaps are more likely to be institutionalized than re the physically fit. will yield an in'singlysuperior selection of cases in the sllccessive grades. The general procedure is to admllllster both the Project TALENT battery and the tests to be calibra~ed to the same sample. tables are then prepared g1Vlllg the corresponding scores On the Project T~LENT composite and on the particular test. of course. with a large sampling error would obviollsly be of little yalue in ~erpretationof test scores.ethod. for example. 1962: ~haycoft. could be used in establishmg norms for all tests. ample. 1966.psychotics. "'hat is required is a batterY of anchor tests. achIevement. and /~greater in lower than in higher socioeconomic levels. should be careyanalyzed.prisoners. the use of natIOnal anchor norms would appreciably reduce the lack of comparability among tests. Th~ Pro!ec~ TALENT battery has been employed to calibrate several test battenes III use by the Navy and Air Force (Dailey.they offer an alluring field for the accumulation of normative . Obvious]y. By means of the equipercentile method. factors likewise operate in other institutional samples. 1966b). & Orr. it is far better to redefine the population more narrowly than to report norms on an ideal population which is not adequately represented by the standardization sample.

highly specific norms are deSirable. ' . Between 1926 (when this test was first a~ministered) and 1941.• subo :testsfrom the same battery. :hus. It was concluded that scale continuity should be maintained. The subgroups may be formed with respect to ag~. grade. The anchor test consisted of the reading comprehension and vocabulary btests of the Metropolitan Achievement Test.000 fourth-. Seder. Qwing to the differential operation of selective f~ctors. Office of E~uqation(Jaeger.000 candidates who took the test m 1941. . SAT scores were expressed on a normative scale. geographical region. 1965. In still other groups. soclOeCOnO~T1lc 'level and manv other factors. organizations" or to "first-year enginee~ing students.Ofparticular interest is The Anchor Test Study conducted by the EducationalTesting Service under the auspices of the U:S. . \Vith such a scale. for which new norms cre established in one phase of the-project. therefore.ses. the n?rms " might be said to apply to "employed clerical worke~'. This is true whenever recog-. These local norms are more appropriate than broad nahonal norms for many testing purposes.s 111 large busll1~sS '. such as the prediction of subsequent job performance or college achievement. score eqUivalency "tablesfor the seven tests were prepared by the equipercentile method. hrough an unusually \vell-controlled ~xpenmental desl. Thus. so chosen as to suit the specificpurposes of each test. an employer may accumulate norms on applicants for a gIVen type of job within his company. The use to be made of the test determmes the ~pe of differentiation that is most relevant. Or a single elementa~y school may evaluate the performance of individual pupils in terms of Its own sco:e distribution. Thus. SPECIFIC NORMS. These candidates constitute the fixed reference group employed in scaling all subsequent forms of the test. Cooley & Miller. often developed by the test users themselves within a particular setting.dUring a particular year. In the equating phase of the "d)'.•. . purposes. norms are available for a broadly defined populatIon. In such ca. One of the clearest examples of scaling in terms of a fixed reference group is provided by the score scale of the College Board Scholastic Aptitude Test (Angoff. each battery bemg paned In turn with every other battery. normative interpretation requires reference to independently collected norms from a suitable population. as well as whether general or specific norms are more appropriate. 1974). the comparison of a child's relative achievement in different subjects. in order to control for order. 1971b). & Vale. and so forth. sex. . other interested persons (Loret. o. ~re some notable exceptions.fifth-. From statistical analyses of all these data. population should be clearly reported wIth the norms. SA~ at certam . the limits of the normative . each child took the reading comprehension an~ voca?ula~ subests from two of the seven batteries. considered a?ove. 1962.of curnc~. As the number and variety of College Board member colleges l~lcreased and the composition of the candidate population changed. Eve~ w~e~ representatIve . It IS often helpful .. These batteries include the General AptItude Test Battery 'ofthe United States Employment Service. An even more urgent reason for scale continu~ty ~temmed from the observation that students taking the.1965). of administration.. and sixth-grade schoolchIldren were exammed 111 50 states.3). A college admissions office may develop norms on its own student population. A manual for interpreting scores is provided for use by school systems and . Although most derived scores are computed m such a way as to provide an immediate normative interpretation of test perfom~ance.hmes of the year performed mOre poorly than those ~akll1g It at other bmes. urban or rural envIronment. For many test~ng <. Bianchini. all the pamngs were 'duplicated in reverse sequence.andthe Flanagan Aptitude Classification Tesfs . elementa~ schoolchIldren. type. This study represents a systematIc effort to proVIde comparable and tI'uly representative national norms for the seve~ most 'dely used reading achievement tests for. These nonnormative SAT scores can then be mterpreted by comparison with any appropriate distribution .•nizable subgroups yield appreciably different scores on a particular ~est. Each new form is thereby linked to one or two ~arher forms. Mention should also be made of local norms. Otherwise. One type of nonnormative scale utIlIzes a fixed reference group in order to ensure compar~bility and continuity of scores.positecorresponding to each test in the battery (Cool~y. 111 t~r. there. the Differential Aptitude Tests. After 1941. which in turn are linked with other forms by"g chin of Items extend!ng back to the 1941 form. all SAT scores were expressed in terms of the ~ean and SD of the approximately 11. lum.ms o~ the mean and SD of the candidates taking the test at each adm~mstration. or the measurement of an individual's progress o\-er time. Local' or other specific norms are often used for this purpose. Another approach to the nonequivalence of existing norms-and probably a more realistic one for most tests-is to standardize tests on more narrowly defined populations. To permit translation of raw scores on any {prm of the SAT into these ~x~d-refere~ce-group scores. a score of 500 on any form of the SAT corresponds to the mean of the 1941 sample' a score of 600 falls 1 SD above that mean. Some groups took parallel forms of t~~ t\.tohave separately reported subgroup norms. an individual's score would depend on the characteristics ot the group tes~ed . a short anc~or test (9r set of common items) IS lI:c1uded 111 each fonn. without providing normative evaluation of performance.~ Principles of Psychological Testing Norms alld the Intcrpretation of Tcst Scores 93 . 197. The groups employed in r11'ridnrt s11ehnorms are even more narrow I)· defined than the subgroups • FIXED REFERENCE GROUP.ver 300.gn.

the mile. .. most current tests. scales ~ased on natlOn~1 anchor norms is that the latter require the chOIce of a. . 'mportant however are the adoption of new procedures and F ar more 1 " . These specific norms are. 227) SUCCinctlyputs It. it is likely that for many testing purposes such broad norms are not required.impact . Our ignora~ce of the precise on. Similarly.ver time. on t~e other hand. 32--33) writes: There is hardly a person here who knows the precise origina~ definition of ~he I gth of the foot used in the measurement of height or distance.lev. Several test publishers.nt It . g or derivation of the foot does not lessen Its usefulness to us In a~y "ay. These statements are typical of what a counselor would say to the student in going over his test results in an individual conference (Super. Apart from the practical difficulties in obtaining such a group and the need to update the norms. In the present connection. more useful in making colle. are equipped to provide such scoring services to test users. candidate o ulation. cisions. Scales built from a fixed reference group are analogous m one respect to scales employed in physical measurement.can be detected only with a fixed-score scale. p. t ce of a should be' of no consequence. test users may obtain computer printouts of diagnostic and interpretive stl. such as that of a particular college. For example. f s pIe achieved bv rigorous form-to-form equati~g-an . d 1 . M.ion of Test Scores 95 "94 Princil)les of Psychological Testing of scores. Holtzman. Essentially. d't' .' h' h ld the exploration of new approaches to psychological testmg w lC wo~ have been impossible without the fle:dbility. more~v~r. ~:g COMPUTER UTILIZATION IN THE INTERPRETATION OF TEST SCORES Computers have already made a Sig~i~cant. R. Needless to say. IS cons a·. together with other inforn:tation . computer capabilities should serve "to free one's thinking from the constraints of the past. a type of college. from test construction to admlmstrahon. It will be noted that the principal difference beh":een the fixed-reference-group scales u~der consideration and the previously discussed. In this connection.. the Differential Aptitude Tests (see Ch. In such cases. test scores are usually incorporated in the computer data base.upon eve? phase of testing. Many innovative possibilities. The latter contains verbal statements that combine the test data with information on interests and goals given by the student on a Career Planning Questionnaire. Harris. ~Vhat is of consequence IS t e ~am enan . In the field ofpsych?l. The obvious uses of computers-and those develop~d earliest-represent simply an unprecedented increase in the spe~d WIth which traditional data analyses and scoring processes can be earned out. g.ge adml~slon decisions than would be annu~l norms based on ~he entire. are now adapted for computer scoring (Baker. or which it was whose foot was originally agreed upon as the standard. sconng. This technique has been investigated with regard to educational and vocational planning and decision making.preclsely th~ same considerations applv to other units of measurement-the mch. This approach has been pursued with both personality and aptitude tests.gmal me~n. or'nlative data to aid in interpretation and III the formation of specific men t alY n . t 1e provlSl~n 0 up. 1970).tnt' facts and relations in answering the individual's questions and aiding him in reaching de-. and interpretation. a r~gi?n. with the ~1innesota Multiphasic Personality Inventory (MMPI). Examples of such interactive computer systems. A. As Baker (1971. ~~susefulness derives from the fact that it remains the same ~ver time and allows us to familiarize ourselves with it. the computer program associates prepared verbal statements with particular patterns of test responses. the computer com~thes all the available information about the individual with storedt-t' ". in which the individual is in direct contact with the computer by means of response stations and in effect engages in a dialogue with the computer (J. . 'lar]y reasonable to say that the original defimtlOn of the scale IS or IS Slml .ogical measureme. etc. certain tests now provide facilities for computer interpretation of test scores. Although separate answer sheets are commonly used for this purpose. At a somewhat more complex level. 1970. . h .. t nt scale--which in the case of a multiple-form testmg program. and it utilizes all re.. there is no one here who does not know how to. Katz. which includes a profile of scores on the separate subtests as well as an interpretive computer printout. reportmg. smgle group that IS broadl representative and appropriate for normative purposes. Individualized interpretation of test scores at a still more complex level is illustrated by interactive computer systems." Various testing innovations resulting from computer utilization will be discussed under appropriate topics throughout the book. optical scanning equipment available at some scoring centers permits the reading of responses directly from test booklets. to be discussed in Chapter 17. 1974. nt decisions. evalm~te lengt s and distances in terms of this unit. .\tements about the subject's personality tendencies and emotional condition. 1973. together with the numerical scores. In such a situation. Any changes in the candidate populatlOn o. Super. speed. .bout educational programs and occupations. as well as independent test-scoring organizations.. Angoff (1962}pp.. th: de ree of Fahrel1h~it. 13) proVide a Career Planning Report. and so on. 1971).Norms and the Intcrpretat. such as diagnostic scoring and path analysis (recording a student's progress at various stages of learning) have barely been explored. data which would be revised from time to time as con I lOllSwalla . At the simplest level. we shan examine some applications of computers in the interpretation of test scores. and especially those designed for group administration. and d~ta-processl~g ('~n:lhiliti('s of computPTS.tovided by the student or client. ii!' various stages . 1973).

to further practice at the present edto more ad. such as content-. On t e aSlSo.J.hat the person can do and what he kno'. From still another angle. The focus is clearly on u.vanced m:te~:r~~~ he receives instruction in more .Sapp lCan~~iduallY Prescribed Instruction-see Jsityof Pittsburgh s IPI (1 ) d' Pro)'ect PLAN (Planning for 1968 an . during. and in meeting demands for educational accountability (Gronlund. it has been contrasted with the usual normreferenced testing. testing is closely integrated with instruction. A brief but excellent 'discussion of the chief limitations of criterion-referenced tests is given by Ebel (1972b). 1970).1974). h testing that has aroused a surge of USES. :\t1:. ~l. 'ded by the l' t' of computers are PIOVI .. Pro) d t' al planning 'as well as instruction aualdevelopment... ma . illustrates criterion-referenced testing. s for Researc~e~t . informal tests prepared by teachers for classroom use. I 'and its definition varies among different wnters.these systems.\'s. I . such as the National Assessment of Educational Progress (\Vomer. an examinee's test performance may be reported in terms of the specific kinds of arithmetic operations he has mastered. testing for the attainment of minimum requirements.H!·(_~T!~({' :: .1\1" ' II E \Ii lill~: . '. criterion-referenced testing uses as its interpretive frame of reference a specified content domain rather than a specified population of persons. C.. diagnose possible leaming difficulties.I [ . Prominent among these are computer-assisted. In all .. as in qualifying for a driver's license or a pilof s license.n( 1 t.omW lere eac I. "Criterion-referenced. The previously cited Project PLAN and IPI are examples of such programs. aser.URE AN~ . nts (Harris 1973).instructionalprogram desIgned to ltiesidentified in individual cases. 1 d' dl'fferent activity and to 'I I Y be InvOve In a ' .~~~ninclud~s a progr~m of self-knowled?e. "entaryand high school subjects.. the student may 'Pg matenal.del)' r a descnptlOn 0 a \\ 1 • " 1 ' \ ch'Ll---. The major distinguishing feature of criterionreferenced testing (however defined and whether designated by this term or by one of its synonyms) is its interpretation of test performance in terms of content meaning. nostic anal sis of errors may lead correcr the specific learning taryprereqUIsItematenal. selfpaced instructional systems.~ n or t le\'el of attainment.r to a reme~l~l branc . 0' reading to first-. . . familiarity with the concepts of criterion-referenced testing can contribute to the improvement of the traditional.nstruction (CMI-see . although it is not the most appropriate term. computer-manage . 1969' Glaser . '<TERION-REFERENCED TESTING . Prehmmary e na s. :. 1 d IBM's Education and Career Exerationaldevelopment. ceumulateddal y regar m. criterion-referenced tests are useful in broad surveys of educational accomplishment. for example. and other individualized. or the chances of his achieving a designated performance level on an external criterion (educational or vocational). ..I r \:11' .:. and prescribe subsequent instructional procedures. a~d fi ld I show good acceptance of ation (SIGI)." however. . mc~T. In this respect.~. and after completion of each instructional unit to check on prerequisite skills. computer-managed.~ "f . f 'ble variant of computer ' t' ally more eaSl ss costly an d opera Ion d '.1975). CONTENT MEANING. t ch stu ent s curren d I ate the student's responses to 1 appropnate 0 ea 'ermust r. mewhat1010asl~~native terms are in common use. as well as ~ simple and well-balanced introduction to criterion-referenced testing. an occupa Ion . Thus far. not on how he compares with others. )lee testmg.. the difficulty level of reading matter he can comprehend (from comic books to literary classics). ver. the .Dlag . These terms are sometimes employed as synonyms for criterion-referenced and sometimes with slightly differ~nt connotations. seems to have gained ascendancy. .epeated~' s~or. Typically. Gronlund (1973) provides a helpful guide for this purpose.''C'(' domain-. " Fnst propose '. and objective-referenced.~~ 'tionpackages or more ~onventlOn:l t~: rather formidable mass of 'utionof the computer IS to proces f f each student in a '1 d' g the per ormance 0 . A funda- .ea~ hi~::~onse history. & GI .. .'~.1111: : used CAI system for tE':lching '( n-1 \ F.particularly 1~ education'd1sbygGlaser (1963) this term is still d . In criterionreferenced testing. ion in 1earmng IS . From another angle. " 1 d b the Amencan I ni~gin Accordance with Needs) deve ope SYh Brudner & I 1971' Flanagan anner. being introduced before. in which an individual's score is interpreted by comparing it with the scores obtained by others on the same test.severa ' .' s S 'stem for Interactive Guidance !:ion System (ECES). w . Examplesof thi. 1 I mer does not interact directly leton. the estimated size of his vocabulary. by 1 data utilized in systems high school stud~nts and thel roPfart~e t t Iso repres~n an mtegra par to present instructional I) I der t results a titer-assisted instructwn (CAd .Norms and the Interpretation . PrillcijJles of Psychological Testing of Test Scores 97 I .' xt instructional step for each these data in prescnbmg the ne -. !lr. n appro~c t~ enerally desi<Ynatedas "criterion~ J. 1974).U~~~'~eu::.~~~~u~~:'nT~e i~~~vi~ualize~ il~struct~~n~f~:~~. criterion-referenced testing has found its major applications in several recent innovations in education. In suc~ syst~~~~t::me:ter is to assist the teacher in i.n!1 1. Finally.

insh'uctional objectives can also be arranged m an ordmal archy..g. 1970. ite~ns are usually excluded from no~n-referenced t~sts. in its emphasis on content meaning in the interpretation of test scores. rea lStiCan. Further sum of these pomts Willbe found in Chapters 5.bjectives.inapplkahle to most criterion-referenced tests.! 1 " rinciplrs of Psychological T ('sting equirement in constructing this type of test is a. Some published tcsts are so constructed as to permit both norm-referenced and criterionrefe~enced applications. or "review" interval.estin? IS r~gularly. Individ~al differences would thus be manifested in learning hme rather than In final achievement as in traditional educational testing (Bloom. e -consuming.ievels. r . An example is the 1973 Edition of the Stanford AchIevement Test. If scores.ty are. ] . A three-way distinction may also be employed. At these '. It follows t. MASTERY TESTING. S. _. This procedure is admittedly difficult . . ' In connection with individualized instru('tion. On the other hand. f knowledge or skills to be assesscd by the test.~Afterhe instructional objectives have been fonnulated. individual differences in perfo~m~nce are of httle or no interest. 1969. \Vithout such careful specification and control of .d unnecessary.hen addl~g -ing. rather than m terms of vague underlymg entities. however. .of this reduction in variability. To describe a child's " intelligence test performance in terms of the specific intelJech~al skills and knowledge it represents might help to counteract the confuSIOns a~d misconceptions that have become attached . employed in the previously cited programs fo~ l~dlvlduahzed mstructIon. : As a resl~lt.the mdl~I~~al s abllibes. has ~r has not attained the preestablished level of mastery . indicating that the Norms and tIle Interpretation of Test Scores 99 indiVidual. content ~vel:a~e m~y p~oc~ed in many different directions.. VVhen stated I~ these general terms. ad~'~nced and less structured subjects. Essentiany. Hence as generally constructed cnter~on-refer~nced tests minimize indh'idual differences.or "identifies the misspelled word in which the final e is re. on such e to have communicable meaning.ests: speclflc~tlO~ of ~etailed instructional objectives. and goals." In the programs prepared for in?ividualized ion. appreciation. The ll1dlvJ~ual m~~ progress almost without limit in such functions as understandmg. 9(9).for example.hat In mastery testing. this procedure yields an all-or-none score. It is also characteristic of published cr~tenon-referenced tes~ for basic skills. nearly complete mastery is generally expected (e.en strictly applied. The interpretation of intelligence test scores. mastery testing is inapplicable or insufficient. 1963. Moreover. 'such as "multiplies three-digit by two-digit •. interests. While providing appropriate norms at each level this batt~ry ~eets three important requirements of criterion-referenced . The Skills M:omtor~ng System in Reading and in Study Skills (Harcourt Brace o\'anovlch) '. 1968. achievement is open-ended.hty and \'al. Under these conditions ' complete ma St ery IS un. and an intermediate doubtful. J. suitable for elementary school. heseareas. however. to formulate highly speCIfic obJectIves for advancedlevels of howl edge in less highly structured subjects. irSCIlIl. When basic skIlls are tested. Beyond basic skills. clearly defined . A second major feature almost always found in criterion-referenced testing is the procedure of testing for mastery. as do the PiaF:etian ordinal scales discussed earlier in this chapter. and originality. adequate coverage of each obJective WIth appropriate items.to the IQ.' Mas:er~ t. For example: they lnclude items passed or failed by all or nearly all examinees although such. 80--85% correct items). some educators have argued that. C~rroll. nonmastery. llciHQIlUI context these units correspond to behaviorally defined 6nal~. the usual methods for findin tdtlJio ~. the results of criterion-referenced testing could derite into an idiosyncratic and uninterpretable jumble. the content domam to be ~lust be widely recognized as important. Cooley & Glaser. and 8. and ~iagnosis: An Instructi onal Aid Series in Reading and In Mathematics (ScLCnceResearch Associates). In more. given enough time and suitable instructional methods nearly ~veryone can achieve complete mastery of the chosen instructio~al obJ:etives. Such an interpretation can certainly be combined with n?rmreferenced scores. the acquisition of more elementary skills being prerequisite :the acquisition of higher-level skills. 1965). criterion-referenced testing is best adapted for ng basic skills (as in reading and arithmetic) at elem~ntary le~e1s.6 It is impr~eticab~e a?d probably ndesirable.t. criterion-referenced testing may exert a salutary effecton testing in general. items are t d to sample each objective. depending upon . and wide range of item difficulty. however. The selected domain subdivided into small units defined in performance terms. these objectives run to several hundred for a smgle school . B. including mastery. as well as local instructional factllties. Hence norm-referenced evaluation is generally enlployed In such cases to assess degree of attainment. the critenon-referenced approa~h IS equivalent to interpreting test sCOTesn t~e light of the demonstra~ed i validity of the particular test. It should be noted that criterion-referenced testing is neither as ne~' ! 6ldeaUy. cnbcal thinking. would benefit from this approach. such tests follow the simplex model of a Guttman scale (see Popham & 1T1Isck.d'. both thc content and sequence of learning are likely to be much 'moreflexible. Exam~le~ of such tes~ mclude the Prescriptive Reading Inventory and Pres~np~lve Mathem~tlCsJnventory (California Test Bureau). Gagne.

or othe.ex:ectanc~ chart. furtherthe sense of all-orthat the concept of mastery in education-in earning of specific units-achie\"ed considerable popularity in the and 19305and was later abandoned. conclude that the probability 17 r of his obtaining a grade betwee ~~ove . 'f the number of cases in each cell of sueh a bivariate distribution is Changedto a percentage. an e~ e~. Other examples may _ in early product scales for assessing the quality of handwriting. cess" and "failure" in a 'ob ' s. is used in the APA test PECTANCY ndards (1974). the term "criterion-referenced testing" uld refer to this type of performance interpretation. such as the The first column of Tahle 6 shows h . _tions. dlVlded into four " ' er 0 s u ents whose f 11' . and so on. D. test-score interval ~e . or F category? This type of information can e obtained by examining tbe bivariate distribution of predictor scores SAT) plotted against criterion status (freshman grade-point average). 1968.' . administered earl . ~hus. More precise attempts to test performance in terms of content meaning also antedate the lion of the term :'criterion-referenced testing" (Ebel. n expectancy table gives the probability of different criterion outroesfor persons who obtain each test score. est scores may also be interpreted in terms of T eeted criterion performance.l962-see also Anastasi. while the other proaches discussed in this section can be more precisely described as tent.r undertak~ng. Forms Sand T. Such a choice presupposes information about what persons have done in similar situations. p~rcentages. B. Based on a pilot y e Ir orcc. er than normative interpretations. f 'II 101 one I ustrated in Table 6 Tl d . score mterval. 52 percent b limitations of the a~ai~ble dPtercent between 80 and 89. C. 15 percent re' percent grades of 80-89 d 63 gra d es of 90 or above At th th ' an percent e 0 er e~treme. The very choice of content or to be measured is influenced by the examiner's knowledge of what e expected from human organisms at a particular developmental or ctional stage. Expectancy Table Showing Relation betwe f en DAT lerbal Reasoning Test and Course Grades in America H' t n IS ory or 171 Boys in Crade 11 (Adapted from Fifth Edition Manual for . into "sucthese conditions. th I eren la Aphtude Tests y m e course. DIfferential Aptitude Tests. In many practical situation n. Ebel (1972b) observes. ' h 0 eac . For exam 'f an mdlVldual WIll receive a given ':e n~w t~udent receives a test score of 34 (i. Under probability of success oP fa"I y hart can be prepared.:.rinciples of Psychological -/ Testing clearly divorced from norm-referenced testing as some of its ts imply. The crite . 1974 by The Psychological ~'-=-== -r----=--r:--. il.66. I Norms and the Interpretation at Test Scores TABLE 6 . out of 100. Evaluating an individual's test performance in absolute ch as by letter grades or percentage of correct items.York Times" still'leaves room for a wide range of indi\'idual erencesin degree of understanding. The pred easomng test of the D'ff t' I .:~~~~. the result is an expectancy table. or drawings by matching the individual's work sample f a set of standard specimens. regardless of how . To describe an individual's level ding comprehension as "the ability to understand the content of • ~ett. a. IS gIven in the second column The r " scores. 1974). 2). 69-70). pp.~ 1973. the probability ~n 9 IS S9'~ of 100. . s usage of the term "criterion" follows standard psychometric prac. 0: ? lS8 . Within the estimates of the probabilit ~ha~tthese. class intervals' the numb f t d t e test SCOles. if a student tains a score of 530 on the CEEB Scholastic Aptitude Test. cntena can be dicliotomized. Reproduced by permission Corporation. showing the r I ure corresponding t 'h .lirt shows. are expressed. th~. as in a training program or on a job. of the 46 students wi~h scores of 40 or above celved grades of 70-79 22 al Reasomng test. what are e chances that hislreshman grade-point average in a specific college ill fall in the A. ll~.Y. om1ativeframework is implicit in all testing."as en -of-course grades. Moreover. represent the best criterion grade. as when a test is said to be validated against a particular criterion Ch.:i/ of his obtaining a grade of 90 ~ _" ou . 1962. thIS expectancy en. All right~~~. For example. Strictly speaking.e" in the 30--39 inte~. F Igure 7 is an -example f selection battery developeod ~\h a~. 171 high school boys en 'II dl~ ata for thIs table were obtained from ' ro e m courses in Am' h' Ictor was the Verbal R' encan Istor)'.e:b e course.. mastery testing does not 'by eliminate individual differences.referenced. of the 46 students scoring below 20 on the test '30 ' percent receIved gr d b I 7 etween 70 and 79 a d 17 a es e ow 0.::. scores an crltenon was . mto each mterval table indicate the pe t' f emall1l1lg entnes m each row of the 'th' h rcen age 0 cases who received each grade at th d f h WI III eac . p. 'd The correlation between test d ~lOn. (Angoff. N. by imposing rm cutoff scores on an ability continuum. This terminology. is certainly . New York. in fact.se cof study.:---Percentage Receiving Each Criterion Crade Test Score ~umber of Cases 46 36 6 Below 70 70-79 15 39 80-89 22 39 21 90 & above 40 & above 30-39 20-29 Below 20 63 --= 43 46 12 30 63 17 5 52 17 TABLES.

rapport.(From Flanagan..139 Reliability 5 R •• 3 2 904 G. The characteristicsof thiss~mple should therefore be specified. 9 f. Such a measure of reliability characterizes the test when administered under standard conditions and given to subjects simil!lT to those constituting the normative sample. if we are interested in measuring fluctuations of mood. he is reducing error variance and making the test scores more reliable. lng. 1 the succeSSl'\'e stamnes. an ing. Essentially. no test is a perfectly reliablei~strument. Po the basis of this .failedto camp :t: pnmary.f 4 has . chance factors.tpproximately 60 percent wil1:. "I . .699 11. . any condition that is irrelevant to the purpose of the test represents error variance.trcm . . °t e:amPco~e of 4 win fail and 'I t d t who obtain a s an me s 40 percent 0f pI 0 ea e s '1 1t 'marv flight train. l' W ht trammg can e . f Ie that approximately r . Between these ex.. Factors that might be considered error variance for one purpose would be classified under true variance for another. lies in the definition of error variance. To put it in more technical terms. measures of test reliability make it possible to estimate what proportion of the total variance of test scores is error variance. The crux of the matter. d 1 :v refers to the consistency of scores obtained by the same persons when reexamined with the same test on different occasions.reia: d could be ma. the test is designed to measure more permanent personality characteristics. however. p. whereby we can predict the range of fluctuation likely to occur in a single individual's score as a result of irrelevant. Hence. • .) ~ . instructions. Similar statements . 19 f 1were eiiminated in the course of train. In its broadest sense.de about. or with different sets of equivalent items. Despi~e optimum testing conditions.209 2. time limits.975 '23.atisf~ctor:':b~~~i~ye o~~~cces~ and failure m t eh~ receive each stanine. it ~uld be predlcte .444 32. Besldebs provldmthg t both expectancy tables and . together with the type of reliability that was measured. Thus. however. IectionBattery an IIDlllaI '. d E1' ' fan from Primary Flight Trall1JUg. day changes in scores on a test of cheerfulness-depression would be relevant to the purpose of the test and would hence be part of the true variance of the scores. 58. then the day-by.474 19. the same daily fluctuations would fall under the heading of error variance. . The concept of test reliability has been used to cover several aspects of score consistency. For example. on the other hand. and other similar factors. LIABILlTY .ofthe men recelVlDg a stamne 0 . of Men 5 9 8 7 6 21. . expectancy charts give a genera 1 ea 0 . when the examiner tries to maintain uniform testing conditions by controlling the testing environment. 1947.'thin each stanine on the battery who the pertentage of men scormg 'It b seen that 77 percent . R l' bet "een Performance 7 Expectancv Chart ShowlI1g e atlon \. This concept of reliability underlies the computation of the error of measurement of a single score. m lVI ua s 60'40 or 3:2 chance of completing individual wIth a stamne o. If. W Ii c on y percen es the ercentage of failures training satisfactorily. expectancy chart.398 34. n ' . 'led to complete the " 1I 1 4 t of those at stamne aJ. or under othel: variable examining conditions. a criterion-referenced interpreprimary flight training. decreases consIstent y over ". test reliability indicates the extent to which individual differences in test scores are attributable to "true" differences in the characteristics under consideration and the extent to which they are attributable to chance errors. .. .CHAPTER No. f t t es it can e seen a d' 1 'd f the validitv of a test in pre Icttatlol1 0 es scor. Thus.129 39. ing a given criterion. every test should be accompanied by a statemellt of its reliability.

valla I" mlg t scar I' I I In variable 2. a negative correlation wl'11 prob bl u '~f -h b' .position in both variables.t' . ow. Figure 9 illustrates a perfect negative correlation ( -1.basic characteristics of conelation caefficients.~G ! .j/ff II ~4H1Hff iiNt I . Accordingly. It will be 2 noted that all of the 100 cases in thee grolJ. ~. all individuals fall on the diagonal extending from the upper left.j/ff1 . I I 2 OF CORRELATION. since any such conditions might be _for a certain purpose and would thus be classified as error varie types of reliability computed in actual practice.00. be as many varieties of test reliability as there 'lions affecting test scores.~ ('()mnlete "bsence of rdationship.l) are distributed along ~~ diagonal running from the lower left.. the principal techniques for measuring the . together with the sources of iance identified by each. 9 9 ..e.rlways positive.. This diagonal runs in the reverse direction from that in Figure 8.".. an mehc reasoning test represents the number of bl ' pro ems correctly soh d .~. can be d:in any elementary textbook of educational or psychological statissuch as Guilford and Fruchter (1973). i I i I : i ~ . in the relate: h' f ' an so orth. stin variable 2.j/ff' 4/It. such as fEAl\. I 0. CorrefrequentlY low When a a I. Ime scores are correla't d 'th I Th .... ". . 1ahon can be expected In su I ..00). e avel acre 111 the second would be no regularit}. this reversal being consistently maintained throughout the distribution..00 ).Jifflll I I ! ! ! i ! : .Thus. slowest) individ- ~1. Ict an 1l1dividual's relative standing in variable 2 from k 1.Jiff. a correlation coefficient (T) ex~ssesthe d'egree of correspondence. 0 i"'P'? 0 "0 0 -0 """ Score on Variable J Bivariate D' t'b' f ISn utlOn or a Hypothetical Correlation of +1. are few. if the top-scoring individual in variable 1also obtains the score in variable 2. in order to clarify use and interpretation. o N N 0.'the diagram. '01' relationship. The best individual in variable 1is the poorest in variable 2 and vice versa.. It will be noted that. might occur by chance. For example 1'£ t' . there is a complete reversal of scores from one variable to the other. Since all types of reliability are con-with the degree of consistency or agreement between two in deBy derived sets of scores.'theupper right-hand corner of .Jifflll m N I :g 60-69 "5 ~ 50-59 o ~ 40-49 oX : :. The closer the bivariate distribution of scares approaches this diagonal.. sh others 11lIght be above the d f . aIthoug'h " negative conel t' .. t e poorest (i. ~r .. WOu e ImpOSSIblet d' .j/ff! JItt. the next section will consider some . ch tally mark in this diagram indicate~~~e score of one individual in th vllriable 1 (horizontal axis) and vain. If each individ l' 'f' tl. or average ~oth variables.e. ()o. More technical discussion of correlation. uch a correlation would ha\'e a value of + 1. between two sets . cae su lect s score:()n f . 0 pre. T I.\:B1e: (vertical axis)."l. b' such variables. i .00.00. . . since it shows that each individual occupies the same relative .a~e~l~gb~th~ ~hance ~core above average in In one variable and below in the oth . . and if the process Under these conditions l't -. The top-~oring subJ'ect I'n "bl a1 n~whledge of IllS score in variablE! .of course.lIes are nearly . 0 M .j/ff JHt I OJ . Such a distribution indicates a perfect positive correlation (+ 1. they can all be expressed in terms of a lion coefficient. a y resu t. however..iud/Iles of Psychological Testing could.' 'Uers might ~all above average average in one and at th ' .. extremes.. it usually results f th a IOn IS a tamed between two rom e way in which the scores are expressed. This figure presents a scatter diag~\lm.. es IS recor ed as the xi' b qmred to complete all itenls h"l h' '..~t.~. MO. Some individual 'h b h e ug I. Essentiallv. alzderbo near~zero correlation would result. WI amount scores. '1. ncn there would be a perfect correlation between variables 1 and 2. and so on down to the poorest individual in the group. the higher will be the positive correlation. ~" a negative correh . In this case..j/ff .to.er. or hivariate distributiOflt. There lOns Ip rom one i d' "d I TI Ie coefficients found in a t I' n 1\I ua to another.scores. as as more detailed specifications of computing procedures. ! --- mr II 0. an anthmetic computation t t' d s.pm er a secondsre· . of test scores will be examined.O. the second-best individual in variable 1 is second ~ .a s n~me were pulled at random out of a hat to determine his were repeated for variabl~ C) pOSI IOn m vanable 1.1. or below ave. in this scatter diagram. W Ie IS Score on an 'th . In this chapter. having some value 'h~ ~1 p~actIce generally fall between these lations between measures of t an zero but lower than 1.to the lower right-hand comer. CIa case. A hypothetical illustration of a perfect positive correlation is shown in igure 8.

d l:~:::~ J refers to the consistency of scores obtained by the same persons when reexamined with the same test on different occasions. every test should be accompanied by a statement of its reliability.atis~~~tor~'~b~~~i~ye~f and failure ing. ejectionBattery and Elimination from I1maly n . Hence. the test is designed to measure more permanent personality characteristics.trcm ". J' fI' ht trammg can e ailed to comp :t: pnmary. chance factors.209 2. and other similar factors. Despite optimum testing conditions. however. Thus. In its broadest sense.' I 'd f the validitv of a test in pre lcttatlon 0 es expectancy charts glVe a genera 1 ea 0 ing a given criterion. test reliability indicates the extent to which individual diHerences in test scores are attributable to "true" differences in the characteristics under consideration and the extent to which they are attributable to chance errors. 1 the succeSSl'\'e stanmes. Ch t Showmg R el'atIon bet \\'ee .f 4 has ~idin' a criterion-referenced interpreg primary fhght trammg. an could be made about.. Besldebs pro th t both expectancy tables and . The crux of the matter.398 Reliability 34. on the other hand. .699 11. Essentially. if we are interested in measuring fluctuations of mood. Po the basis of this . or under othel: variable examining conditions. flight train40 percent 0f pI'J0 t ca d e t s who obtam a s amne 1 itpproximately 60 percent wil1. Performance IG. when the examiner tries to maintain uniform testing conditions by controlling the testing environment.the percentage of men scormg \\ I .474 19. Flight Training. The concept of test reliability has been used to cover several aspects of score consistency.{FromFlanagan.975 '23. f t t scores it can e seen a d' . . Ig f 1were eiiminated in the course of train.thin each stanine on the battery who . however.' . Similar statements re~a~ m~ hp i each stanine. individual.~~c:rv.CHAPTER No. W 1 C on y 11 percen es the ercentage of failures h'aining satisfactorily. with a. s~amne o. . This concept of reliability underlies the computation of the error of measurement of a single score. Thus. 58. 1947. indlvldua s w 6~. whereby we can predict the range of fluctuation likely to occur in a single individual's score as a result of irrelevant. If. rapport. of Men 5 9 S 7 6 5 4 3 2 21. no test is a perfectly reliable instrument. Factors that might be considered error variance for one purpose would be classified under true variance for another. instructions. lies in the definition of error variance. measures of test reliability make it possible to estimate what proportion of the total variance of test scores is error variance. he is reducing error variance and making the test scores more reliable. n decreases consistent y over '.444 32. it ~uld be predlCt. Between these ex.7. time limits.e . °t e~amPcoe~eof 4 will fail and s . Expectancy aT p. To put it in more technical terms.129 39.) : .:2 chance of completing . Such a measure of reliability characterizes the test when administered under standard conditions and given to subjects similllr to those constituting the normative sample. f I that approximately r expectancy chart. p. then the day-by-day changes in scores on a test of cheerfulness-depression would be relevant to the purpose of the test and would hence be part of the true variance of the scores. 9 f 'led to complete the of the men receIVing a stamne 0 '1 I 4 t of those at stamne al ing. It b seen that 77 percent . For example. or with diHerent sets of equivalent items. the same daily fluctuations would fall under the heading of error variance. any condition that is irrelevant to the purpose of the test represents error variance. The characteristics of thiss~mple should therefore be specified. together with the type of reliabIlity that was measured. LIABILITY .139 904 R .

all individuals fall on the diagonal extending from the upper left./lff' ..00).e no regularity in the relationshi from './Iff! . Since all types of reliability are con. uch a correlation would ha\'e a value of + 1. the higher will be the positive correlation. might OCcur by chance If each ind' 'd I' out of a hat to determ'ine hi 1I. Bivariate Distr'b t' f I U IOn or a Hypothetical Correlation of +1. since it shows that each individual occupies the same relative position in both variables. The best individual in variable 1is the poorest in variable 2 and vice versa.Thus.~:'~' . Some individ I 'h b g t Score 11lgh.a a Ove average " .. as ·as more detailed specifications of computing procedures. . I N 0- N '.a case. . a d d us. In this chapter..'l s n~.to the lower right-hand comer. extremes. 1f~!ch sublect s score'On an anthmetic computation te t .) . (+1."l...s SCore in variable In variable 2./Iff 11/ : :./Iff. however. There wou The coefficients fOund in t I ~ one mdl\ Idual to another. ot ers mightf II b 111 one variable and below in the oth . so art. or bivariate distrihutiOl/. 'brll there would be a perfect correlation between variables 1 and 2./Iff.iflciplesof Psychological Testing "~ould. gb Essentially. T j ! '-- '" v ./Iff.1:1 u. if the top-scoring individual in variable 1 also obtains the op score in variable 2.. This figure presents a scatter diag~ll. it usually results from th a IOn. .00. they can all be expressed in tcrms of a 'on coefficient. slowest) individ- .in any elementary textbook of educational or psychological statis.! I b t") . In SUell 'h .ou~t scores./lff . I ! : .00. .~ .m. ?\fore technical discussion of correlation. IC an in ividual's ~. The closer the bivariate distribution of scares approaches this diagonal. ~ . ~ I SCore Variable FIG. between two sets cores.. Th '.jtionsaffecting test scores.tamed between two e way III which the scores are expressed. This diagonal runs in the reverse direction from that in Figure 8.with the degree of consistency or agreement between two inde'flyderived sets of scores. 8.. t 9 i ./Iff i#ff I ! . The top-sl!Oring Subject in variable a 1 ~~w~edge of l~. I . althoug'h frequentlv low When a . d S IS recor e as the dumb f d '~er a secon sre· qUire to complete all items wh'l h' I e IS Score on an arith t' t es t represents the number of' hI '''.'the"upper right-hand corner of :the diagram. ' age In ot . > 50-59 c ~ 40-49 . ' SCores are correlated with negat. a correlation coefficient (T) exses the d'egree of correspondence. can be . ()o.lies are nearly a-lways positive. A hypothetical illustration of a perfect positive correlation is shown in igure 8.•. be as many varieties of test reliability as there .1 .t'fl'? Si On (). :..JHt-.' a negatIve cone. sh others mIght be above the average in one and at th f h ld b e a\el:lge III the second and .. or average both vadables or below av:.' '11 . b' such variables. and so on down to the poorest individual in the group. ! . there is a complete reversal of scores from one variable to the other.IS 0 . or relotions1Jip. Such a distribution indicates a perfect positive correlation ./Iff III ".p~achce generally fall between these lations between measures of t an zero but lower than 1.e.low. It will he noted that. . Accordingly.gbt hY chhance score above average in .r·~tr'~ ('omnlete flbsellce of rdationship.00). 0(""') 0. II 0- I . negative con-el t' ./Iff N •• :g 'g o 60-69 .~ a~1. the principal techniques for measuring the 'f}'of test scores will be examined.t ./lff I lilt I I i ! i . since any such conditions might be for a certain purpose and would thus be classified as error varie types of reliability computed in actual practice.. having some value ~~ ~'l . alzderbo near~zero correlation would result. am. t e poorest (I.lYc correlation will probabl ' result./lffl !mr . together with the sources of illiance identified by each. in this scatter diagram. For example if time . CorreI. such as Guilford and Fruchter (1973). are few.as n~me \"ere pulled at random . in order to clarify use and interpretation.<:. this reversal being consistently maintained throughout the distribution. ach tally mark in this diagram illdicated~e score of one individual in 'oth variable 1 (horizontal axis) and vUllable: 2 (vertical axis). s pOsitIOn In vanahle 1 a d 'f th ' n I e process were repeated for variable" ~r Under these conditions it -.er. .. It will be noted that all of the 100 cases in thee groBl) are distributed along "diagonal running from the lower left. II j i . . In this case. of course. Figure 9 illustrates a perfect negative correlation ( -1. such as EA!\'ING OF CORRELATION.. the next section will consider some basic characteristics of correlation cBefficients.' me IC reasoning ' pro ems correctly sol\!cd Ia t'Ion can be expected./Iff. the second-best individual in v-ariable 1is second ~stin variable 2.00.t~. \Vou e ImpOSSible to d' t d relative standing in variable 2 from k pre.

the corresponding cross-products will be negative. we multiply each individ\i&r" tandard score in variable I by his standard score in variable 2. Thus. Table 7 shows the computation of a to each child's nam ~1e IC and reading scores of 10 children./iII.91=. method demonst. Score on Variable 1 Ic./ill \ 11II11II1 ./iII.. Next reading test (Y) T~ are. . but also the amount n of his deviation above or below the group mean./ill \ 11IIJIlt JIlt 1/1 I i tive. hiS s~ores in the arithmetic test (X) and the the res ective c~l e sums an .[ ' - .The thU'? column shows the deviation (x) of the deviatio~ ero~1 thed~nthmetic mean.. clearly l~ '" ~ . Rather than 1 correspon mg u to find standard scores. now.4 -14 8 40 18 24 3 . and the fourth column. When subjects above the average in one variable are below the average in the other.Reliability 107 /I I . wheneach individual's standing is expressed in}erms of standard scores.9. re n mg t e cross-products. 48 32 34 36 41 43 47 40 400 40 22 16 18 15 24 20 23 27 210 21 +1 -2 +8 -8 -6 -4 +1 +3 +7 0 0 1 4 64 64 36 16 1 9 49 0 2~4 9 1 4· 36 186 . we (yS~o~. products will be positive. The Pearson correlation coefficje. most common is the Pearson ProductMoment Correlation Coefficient. ' W1 In actual practice it's d~ot n~cessary to convert each raw scorc to a standard score befo' ~ can be mad . provided that each individual falls on theA. deviations are squareda~n ~~: .94) (4)R} = 212. fo r .the Pearson correlation coefficient.~l .If. 0 eSdc~le m Chapter 4. depending on the nature of the data.9 c 60-69 - but it illustrates the than other methods Ii > 50-59 o Jlltl/tf ./iIt.at .. the correlation will be nega- Arithmetic Pupil Bill Carol I Geoffrey Ann X Reading Y x y -4 +7 +1 -5 -3 -6 +3 -1 +2 +6 0 x:z y' 16 49 1 25 9 36 xI} 41 38 17 28 Bob Jane Ellen Ruth Dick : Mary S M fT. = IN -. Bivariate Distribution for a Hypothetical Correlation of -1.11/I11I 1 I \ 1/1 i 0- 1 0- '? 0- '? 0- r. II \ . all. b 1 some prod uc t s are posItive and some negative the correlation . The meanin that l m. . It will be recalled that . This correlation coefficient takes into a.31 10 = 86 86 (10)(4. If the sum of the cross-products is negative. Correlation coefficients may be computed in variom ways.40 4.'''_~~i~i .:would have two positive standard scores.=~= NUru. and the sums of the and reading scores ~~~K:t:::!t~h~ ~and~~d /~viations of the arithmetic dividing each x and y by'ts .40 I ? "':. 'll When e c ose to zero. since this conversion .computmg./iII .94 = fT. personsfalling above the average receive positive standard scores./ill ~ 040-049 u Vl o i .~ 14 0 86 = v'24. of the ~terr~latIon coeffiCient morequickest. Table 7 not the rf. It will have a high positive val\ie:'W~~n corresponding standard scores are of equal sign and of approximately equal amount in the two variables. 10 r./ill I I \ \ . while thosebelow the average receive negative scores.ceount ot only the person's position in the group.~:~~t:~~::i\~hor::uts.ame side of the mean on both variables.I I i .00. x wo co umns. while the best individualwill have the highest score on the second..- ~ ~ ~ R Pears~~ I:~. he once for all after the cross-products have been added There are many s ortcuts a .60 4.:'Ii'.))t is Simply the mean of these products. TABLE 7 Computation of Pearson Product-Moment Correlation Coefficient . The. -= ~186 = v'18. 'ualwillhave the numerically highest score on the first test. one inferior in both woul~ have two negative standard scores.::g /~ore fr~m the reading mean. means of the 10 scores are given under each aJthm ti umn. These squares are used in .IIII. an individual who is superior in both variables to be corre1al:ed.

. as hi h as that obtained in our 'lion is zel'O.05 levels. with only one chance out of 100 of being wrong. we conc U ignificancelevels refer to ~ e If a correlation is said to be Slgt ing conclusions from our ~. of cases ndard de~'iatiol1s (11':<Uy).. An example of a reliability coefficient.h h arithmetic also to ~er orm If we are concerned only Wit t e ugh the relation IS not close. The correlation between the number of words written in the two forms. and by the product of the two divided bv the number. we to the larger populatIOn ": 1 etic .72. . In ot er war 5.40 found in Table 7 ind~STATISTICAL SIGNIFICAJ'CE. means. d vould constitute a very mamongAmencan sc 00 iously. h lchildren 0 t e sam . orrelation C C coefficients have man)' uses in the analysis of psy. .s had obtained a significantly higher mean than the girls on a mechanical comprehension test.chological data. 'expectedfrom sample to samp . the obtained correlation is somewhat lower than is desirable for reliability coefficients.nd reading ability are corret want to know whether anthm f h e age as those we tested. the trend is definitely in this direction. Most at. For interpretive purposes in this book. 1atlOns. 1 de that t e wo .ificancelevels may be. th~ pr. '" " wered the question of . The second form was identical. . e cr .~e. There is some 1ten h adl'ng test and vice versa. . the scores of 104 persons on two equivalent forms of a Word Fluency test' were correlated."t "05-1eve1 is . only an understanding of the general concept is required. . no e . although there is a certain amount of scatter of individual entries.ill11\.40 found 111 Ta e.the 10 cases actually ~xamAlneth'r comparable sample of the : 'f 1 opulatlOl1. Nevertheless. in an investigation by Anastasi and Drake (1954).. alf wel on t e re . cept this correlation as an 'h'ldren we can ac rmance of the. Parenthetically. For example. although oglcalresearch applies 10 ed for s ecial reasbns~ .se 10 c 1 .. if in the sample tested the bo).. For instance. In one form.. it might be added that significance levels can be interpreted in a similar way when applied to other statistical measures.WI . ht h e been antiCIpate . This correlation is high and significant at the . the uSe of the correlation coefficient in computing different measures of test reliability will be con' sidered.63.s Ip t a". Ie in the size of correlations. the subjects were given five minutes to write as many words as:'they could that began with a given letter. The <!uestion usually . f th degree 0 re a lOn e '1 ch however we are usuuate descriptlOn 0 ' d 1 I holog1ca resear . 1 t' Igm can ' . d' g deviations in thc x an y · lll' 1tIp l'ymg the cOITespon the sum of these cross-pro d uc t s r) en found by mu '( lumns. THE RELIABILITYOEFFICIENT..To compute the _~orrelatlOn(N ) . The two letters were chosen by the test authors as being approximately equal in difficulty for this purpose. As mlg av l' h' conc1usively~\Yith this size 0 o establish a general re at. The . is to be found in Figure 10.01 level indicates that we can conclude. In the follOWing section.Y f level. With 104 cases.90's. . correlation of . al d 't. For example. Any 'the 6 . any correlation of . IS SImp 'd l'f the correlation 111t h e . antly greater t 1lan zelo.the . to say that the difference between two means is significant at the . ' . It will be noted that the tallies cluster c~ose to the diagonal extending from the lower left. computed by the Pearson Product-Moment method. 1 t' hl'p between the arithmetic 11 f ositwe re a Ions d for those children doing we tes a moderate d egree 0 p ency reading scores.• ~. The measurement of test reliability represents one application of such coefficients. uate sample 0 sUf 1 a p much higher correlatIOn.to the upper right-haridcorner."t f error is 5 out of 100. .80's or . 1 dures or es Ima ere are stabshca proce .01 and . emp b( 7 f 'Is fo reach significance even rrelation of . d' ther group n1easures.' 1 whether the correlation IS rd deviations. except that a different letter was employed.Reliability 109 '08 Prillcip1t's of PS!Jchological T('8ting . we could conclude that the boys would also excel in the total population.n below that value simply leaves unans whether the two variables are correlated in the population from which the sample was drawn. h risk of error we are willing to ta~e zero.r.about carre h . could a cOTTel~hon glne? When we say that a lmg error a 0 ' have resulte d f rom sam P t (01) level" we mean the .\ of this test was found to be .05 levels for groups of different sizes can be found by consulting tables of the significance of correlations in any statistics textbook. lOne of the subtests of the SRA Tests of Primary Mental Abilities' for Ages 11 to 17.10 cases it is . however.h'. t the end as shown in the correlation form this division only once ad' ' the last column (xI)) have Th oss-pro uets m' d ula in Tab 1 7 e. 'fi t at the 1 percen. ~~e ~01 or the .05 ~eve1. An ~nation of the scatter diagram in Figure 10 shows a typical bivariate distribution of scores corresponding to a high positive correlation. that a difference in the obtained direction would be found if we tested the whole population from which our samples were drawn. In this case. he smallest corre a Ion s.01 level. The data were obtained. . size might yield a much lowfer ort~ tl'ng the probable fluctuation . however .lOn. v . The minimum correlations significant at the .25 or higher is significant at this revel. 1 bon IS slgm can t of 100 that the population corre nare no greater than one ou h t van'ables are truly corre. an an) 0 . n psyc d h t'cular sam1J/e of indivi ua s 1" beyon t e par 1 terested in genera lZln~ h'ch the represent. Hence. which usually fall in the . " les in this group. fIt' existing between the two .

f test seSSIOn 0 • tionsof performance rom one n d t ting conditions such as extreme may result in part from uncontr? e eds ther distractions or a broken . Apart from the desirability of reporting length of interval. ().:TYPES OF RELIABILITY ost obvious method for finding the reThe m. but over a few . ~ M ~ IT' ~ ~ e. :rhus." \ i -HH" 1 \ " : I. Retest reliabIlIty sows. Short-range. Many preschool intelligence tests. or artistic judgment may have altered appreciably over a ten-year. they are likely to characterize a broader area of behavior than that covered by the test performance itself. When we measure the reliability of the StanfordBin~t. The . l 1954. ecent experIences 0 a tionalstram.. long-range retests have been conducted wit~ such tests-.. be sure. but are virtually useless as predictors of late childhood or adult IQ's.1 i .period. lfowev~~~strated by illness. In testing young children.st ~c~res is by. r . \ I I : . random fluctuations that occur during intervals ranging from a few hours to a few months are generally included under the error variance of the test score. bpt the results are ~enerally discussed in terms of the predictability of adult intelligence . because of circumstances peculiar to his own home.~et:1ks.The I'ehablhty coeffiCIent Tn on the two administra' d by the same persons ~betweenthe scores 0bt ame d to the random fluctua. for example. . h'd ntical test on a second occa. or community environment. . however.) . Th . and so forth. owing to unusual intervening experiences. h the extent to which scores on a test th higher the reliability. Any additional changes in the relative test performance of individuals that occur over longer periods o£ time are apt to be cumulative and progressive rather than entirely random."f0'0"t ("") M ~ ~~ I I $ 1 It) ~ I 0 " ~ N Score on FormJ: Word F veney (") b J. \-1 . h h they arise from changes in t e changes m wea er. ·<:. the and the like. the interval between retests should rarely exceed six months. in checking this type of test reliability.72. Flc.' condition of the subject h1l11Se as 1 f pleasant or unpleasant nature. however. To so~e ext:nt. the interval over which it was measured should always be specified. r. should not be confused with that of the reliability of a particular test. The individual's status may have either risen or dropped appreciably in relation to others of his own age. ~ -0.p penod of ten years. or even one year.ReliabilifY 111 I .jilt I \o/Ht'lII. TEST-RETEST RELIABILITY. 1111 I 1 \ i 1111 '. Thus. or for other reasons such as illness or emotional disturbance. an effort is made to keep the interval short. t the other These variations . e enor . since at early ages progressive developmental changes are discernible over a period of a month or even less. It is also desirable to give some indication of relevant intervening experiences of the subjects on whom reliability was measured. we do not ordinarily correlate retest _~~res over a '-T. school. su pencil point. for example. rcpeCatll1)gi: . yield moderat~ly stable measures within the preschool period. one's general level of scholastic aptitude. For any type of person..(Dalafrom Anastasi & Drake. Moreover. the period should be even shorter than for older persons. what considerations should guide the choice of interval? Illustrations could readily be cited of tests showing high reliability over periods of a few days or weeks.1. When retest reliability is reported in a test manual. . mechanical comprehension. emo.extent to which such factors can affect an individual's psychological development provides an important problem for investigation. th dden nOlses an 0 '. A Reliability Coefficient of . counseling.. a simple distinction can usually be made. In actual practice. This question. Since retest correlations decrease progressively as this interval lengthens. " sian.h~S:ase is simply the correlation liabilityof te. fatigue. psychotherapy. worry.l0. there is not one but an infinite number of retest reliability coefficients for any test. less susceptible the scores are to the random daily changes in the condition of the subject or of the testing environment. but whose scores reveal an almost complete lack of correspondence when the interval is extended to as long as ten or fifteen years. such as educational or job experiences. variance correspon s " lionsof the test. can hr I!eneralized over different occaSlDns. e .

?f ('Ourse be exerof a test should be jnd~endc t{ parallel. and that the cultv as the first test The d. ~gainwe must fall back on an analysis of the purposes of the test and 9iJ a thorough understanding of the behavior the test is designed to preBiet.hcontellt salllpl~llg: ?lIderlies not only . I al e to reVICW. 't has probably h d th' amlOe 1 more close lv. same speci caLet us suppose that a 40't VI ua bS slcore differ on the two tests? -I em voca u ary t t h b a measure of general verbal c . lPractice will probably produce varying amounts of improvement in the ~testscores of different individuals. parallel forms same specifications..as een constructed as ~ist of 40 different words is ass~:b~:~e.ilherthan the entire behavior domain that is being tested. F~ndamentaJ)y. ht changes in the person's condition than is verbal comprehension. er types of reltabIhty to be discussed shortlv. random changes that characterize the test performance itself . Since both types are important for most ' . .. To ticular selection of items? I:sa ~'ff epen? on ~actors speci~c to the parently.ry in the extcnt of daily fluctuation they exhibit.Prillciples of PsycllOlogical Testing omchildhood performance. On another oce opposite expenence. error vanance under ferent individuals the relat' .esu ting from content sampling. Owing to fortuitous f . For the large a . The natt\re of the test itself ay also change with repetition.n~nc: .e. One way of avoiding the difficulties enuntered 1n test-retest reliability is through the use of alternate forms the test. while A will re therefore be reversed o'n th t a ]. the retest technique is inapropriate. For example.. however. It is the f re ore appropnate to ex . ed mo~t carefully. The corlation between the scores obtained on the two forms represents the 'ability coefficient of the test. steadiess of delicate finger movements is undoubtedly more susceptible to .apparently simple and straightforward.IUS rate the type of ' conSIderation. The concept of item sam Iin ' alternate-form reliability bu~ al~ ~. This is especially true of problems inlyingreasoning or ingenuity. :'l' Although. Once the subject has grasped the princiinvolvedin the problem. IUS the Ii t I' t . . e wo Ists o' t h selection of items.pes of reliability.. rather than in terms of the reliability of a rticulartest. This familiar what extent do Scores on th. the same pattern of right and wrong responses _likelyto recur through sheer memory. Everyone a e expenence of taking . If wish to obtain an over-all estimate of the individual's habitual finger diness.'Jetin lend themselves to the retest technique. error vanance in this cas 8' ' across occasIOns.el. y across orms only not . The same persons can thus be tested with one form on the stoccasjon and with another. d"d I ar 0 111 IVI ua lB If the t are apprOXimately equal in thei II .e. the scores on the two ad1Jlinistrations the test are not independently obtained and the correIaof between them will be spuriously high. ' wmg 0 c anee differences in the I • Reliability 113 . a temate-form reliability provides a useful mg many tests. Only tests that are not appreciably affected by. ur once he has worked out a solution. The concept of reliability is generally restricted to shortge. The tests :h~ ~nstruct~ tests desi~ed to meet the U ('Ontam the same number of 1 elDS. d'ffi aftors In the past experience of difwhat from pcrso~ to pe !VeT]· cu ty of the two lists will vary Some1 rson. reas a single test session would suffice for verbal comprehension.'if!. It should be noted that different behavior functions may themselves . . ALTERNATE-FORM RELIABILITY. .. The second list on the oth h d er 1arge number of' words unfamiIi an t mIght'd con t'am a d'Isproportionately d' . alt . The e represents uctuat'o' f one set of items to another b t H . in thei~ excel B on th~ second The eIe~ excel A on the first list.s IS mIg t contain a larp. we would probably require repeated tests on several days. cised to ensure that the are trul ' care s ou .In other words. 't .ua A than does the second list.majorityof psychological tests. e .r. he may have had th .s .I ns In per ormance from In the d I ' u not uctuations over time eve Opment of alternate forms h Id· .£ accompanied by a stateme~' f t~rntc. If t h·' two forms are administered' Ion 0 relevant' In t ervenmg experiences. I measure for e\'al~at' 'ever. Moreover. ow much would an indi . It will be noted that such a reliability efficientis a measure of both temporal stability and consistency of nse to different item samples (or test forms). ems on areas he had f 'I d situation illustrates error va . WO 111 IVI ua ~ "true scores") B' will neverth I r overa word knowledge (i..he felt he had a "I k b k" a course exammatlOn in \vhich very topics he happue~:d t~e~aveb. t~ engft of the mterval between test adescnp . the examinees may recall many of their former reooses. the resulting correlation shows reliabilit . testing purposes 110. the test-retest tech.d l' . comparable form on the second. This coefficient thus binestwo ty. Like lest-retest rcliabilit. A number of sensory dis(~riminationnd motor tests would fall into this category. Ifn Immediate succession.~:~~~ :ow suppose that a second Items are constructed with I ame purpose. Thus. h number of words unfamiliar to individ .I erences 111 the sco e bt' d b y the same m lVIduals on these two tests 'II t r s 0 ame . he can roduce the correct Iesponse in the future without going through the ervening steps.workmg independt' h s In accor d ance with the 'fi IOns.effqua can~ to cover the same range of diffi.~rm rdl~bIhty should always be ministrations as well as ado .. were to pre!)are another te It ~rent IO vestlgator. finding an unusually large number of l't . d: . if the interval between reestsis fairly short..ve standing of these two persons will .~:~:e many of the items covered the easion. '~iquepresents difficulties when applied to most psychological tests.

we do not ordinarily correlate retest :~~res over a period of ten years. one's general level of scholastic aptitude. d b the same persons 0 \[1betwe~i'Ithe scores 0b tame Y. Since retest correlations decrease progressively as this interval lengthens.. onMFormJ: Word fluencY Test 10. but whose scores reveal an almost complete lack of correspondence when the interval is extended to as long as ten or fifteen years. e enor . random fluctuations that occur during intervals ranging from a few hours to a few months are generally included under the error variance of the test score. su th y arise from changes m t e . such as educational or job experiences. For any type of person. :rhus. Many preschool intelligence tests. h the extent to which scores on a tes tionaI stram. less susceptible the scores are to the random daily changes in the condition of the subject or of the testing environment. owing to unusual intervening experiences. d ther distractions or a bro en I • h dden nOlses an 0 " h changes 111 we at er. · recent expenences 0 a t .infinite number of retest reliability coefficients for any test. The m.fcsults are generally discussed in terms of the predictability of adult intelligence . I' rt f ncontro e es 1 ' k may resu t 111 pa rom u .: 'sion. psychotherapy. 0 so~e .'~'£p he SUfe~ long-range retests have been conducted wit~ such tests:. d to the random fluctua. rcpeatlll)g. period. however. however. bpt the .1 i I \-i. the interval between retests should rarely exceed six months. It is also desirable to give some indication of relevant intervening experiences of the subjects on whom reliability was measured. and the like.' f 'Uustrated by illness. When retest reliability is reported in a test manual. yield moderarely stable measures within the preschool period. I \ \ " \ \ 4!It \ . should not he confused with 'that of the reliability of a particular test.t :h~ ase is simply the correlation ." : \ \I : . they are likely to characterize a broader area of behavior than that covered by the test performance itself. n the two administra. or artistic judgment may have altered appreciably over a ten-year. t the other These variations . f e test seSSIOn 0 • " tions of performance rom on II d t t'ng conditions such as extreme . The .72. or community environment. or even one year. Short-range. e .) Data from An8~tasi & Drake. since at early ages progressive developmental changes are discernible over a period of a month or even less.4!It \ 1/: j 1/1 \ " I III/ 1/11 '. . and so forth. emocondition of the subject hmlsel : as 1 f pleasant or unpleasant nature. tions of the test. The individual's status may have either risen or dropped appreciably in relation to others of his own age. . Retest rehablhty sows.fIIt1H1 ! 0Ii') ~ 0() I 0Ii') 0() ~ "-1 0 "- 0 Ii') 0 ~ Ii') ~ 0 Ii') Ii') I I Il'l 0 0() 0() sc:e . the can he I':eneralized over different occaSIOns. Moreover. the higher the reliability. Thus. When we measure the reliability of the StanfordBinet. vanance correspoll S '. In testing young children. there is not one but an . for example. . but over a few weeks. pend pomt. but are virtually useless as predictors of late childhood or adult IQ's. counseling. fatigue.. Any additional changes in the relative test performance of individuals that occur over longer periods of time are apt to be cumulative and progressive rather than entirely random. the period should be even shorter than for older persons.flit I \.. liabilityof test scores is by. Th ..The l'eliability coefficlenf (Tn III IS C. what considerations should guide the choice of interval? Illustrations could readily be cited of tests showing high reliability over periods of a few days or weeks. worry. h 'dentical test on a second occaTEST-RETEST . In actual practice. 1954. because of circumstances peculiar to his own home. This question.extent to which such factors can affect an individual's psychological development provides an important problem for investigation. T extent however. '!G. ':TYPES OF RELIABILITY ost obvious method for finding the reRELIABILITY. mechanical comprehension. Apart from the desirability of reporting length of interval. or for other reasons such as illness or emotional disturbance. the interval over which it was measured should always be specified. A Reliability Coefficient of . school. a simple distinction can usually be made.Reliabilify 111 I . for example. in checking this type of test reliability. an effort is made to keep the interval short.

r 0 111nlVI ua B. For the large ajority of psychological tests. TIlUS the fi t I' t . This familiar what extent do Scores on ~~n~nc: . ' wmg 0 c ance differences in the 'f '. ' If t h·' two forms are administered' Ion 0 relevant 111 t ervenmg experiences. The concept of reliability is generally restricted to shortnge. .criminationand motor tests would fall into this category. the resulting correlation shows reliabilit . The coration between the scores obtained on the two forms represents the 'ability coefficient of the test. The e represents uctuat'o' f R . Thus. Everyone a e expenence of tak' g .. very topics he happen~d to have studi many 0 t e Items covered the casion. It is the f has p.eff can. Only tests that are not appreciably affected by"lfi.'ourse be exerof a test should be ind~endc t{ parallel. or once he has worked out a solution. altternate-form reliability provides a useful ny ests. rather than in terms of the reliability of a rticulartest.~~':~~~~~:g'enYlear. B -will neverthele:s : word knowledge (i. 'hereas a single test session would suffice for verbal comprehension. the test-retest techique presents difficulties when applied to most psychological tests. n other words. comparable form on the second. . This is especially true of problems inDIving reasoning or ingenuity. One way of avoiding the difficulties enimteredin test-retest reliability is through the use of alternate forms the test. If wish to obtain an over-all estimate of the individual's habitual finger diness. '. Fundamentally. . in their excel B on the second Th ]. he felt lIe had a "I k b k» 'In a course examination in which uc v rea because f h .racticewill probably produce varying amounts of improvement in the . he can produce the correct response in the future without going through the itervellingsteps. h rs IS mlg t contain a larger number of words unfamiliar to individ . t~ engft of the Interval between test adescnp . I ns In per ormance from one set of items to another b t In the d I ' u not uctuations over time eve Opment of alternate forms h Id" . eWolstso' t h selection of items. d: . d"d conta'n a d'Isproportionately I I are apprOXimately egual in the. and that the qua culty as the first test The d.dr.e. the same pattern of right and wrong responses I . steadiof delicate finger movements is undoubtedly more susceptible to ht changes in the person's condition than is verbal comprehension. he may have had th ' ed mo~t carefully.opnate to examine it more closely.etiI~nd themselves to the retest technique.obably h drethoreappr. were to preIJare another t t' d ' wor mg In ependt' h es m accor ance with th 'fi e same specI caIOns. d ent Iy.ua] A than does the second list. I erences 111 the sco e bt' d b r s a ame y the same III JVldua]s on these two tests ']1 t . gainwe must fall back on an analysis of the purposes of the test and i1 a thorough understanding of the behavior the test is designed to pret. the retest technique is inap' opriate. . If the two individual~ ov ra "true scores"). The natnre of the test itself :ayalso change with repetition. if the interval between res is fairly short. e cel A on the first list. Since both types are important for most . ow much would an indi . er types of reltabllIty to be disclIssed short Iv. This coefficient thus binestwo types of reliability. ~:~:~t~. .4 likelyto r~cur through sheer memory. finding an unusually ems on areas he had f 'I d . Once the subject has grasped the pdnci"Ieinvolvedin the problem. On another oclarge number of I't e opposIte expenence. Like lest-retest rcliabilit· alt accompanied by a stateme~' f t~m:te. For example.ve standing of these two persons will . the scores on the two adinistrationsof the test are not independently obtained and the correIan between them will be spuriously high.esu ting from content sampling. 'fn Immediate succession.testscores of different individuals.~nn rell~blhty should always be ministrations as well as ado . I us rate the type of ' consIderation. The concept of item sam tin ' altemate-fOlm reliability bu~ al~ ~. e . care s ou . Moreover.hcontellt sampl:llg: ~nderlies not only . while A will re therefore be reversed o'n the t atll. It should be noted that different behavior functions may themselves . random changes that characterize the test performance itself therthan the entire behavior domain that is being tested. Dnses.Prillciples of Psychological Testing om childhood performance. y across orms only not error vanance in this cas fl' ' across occasIOns. cised to ensure that the are tm] . a I erent mvestigator k' . in the extent or daily fluctuation they exhibit.as een constructed as ~ist of 40 different words is ass~:1~:d e:::~~~ ~ow suppose that a second Items are constructed with I ame purpose. Owing to fortuito f ' error vanance under ferent individuals the relat' d~~ ators In the past experience of difwhat from pcrso~ to pe Ive ·I cu ty of the two lists wiII vary SomerSOll. To ticu]ar selection of items? Ifls eds'Hepen? on factors specific to the par. Reliability 113 Althoughapparently simple and straightforward.of (. the examinees may recall many of their former I'e. situation illustrates error' I al e to reVICW. we would probably require repeated test~ on several days.. The tests :h~ ~nstruct~d tests desi~ed to meet the U contam the same number of items . ALTERNATE-FORM RELIABILITY. A number of sensory distiDn . however. It will be noted that such a reliability cient is a measure of both temporal stability and consistency of nse to different item samples (or test forms). The same persons can thus be tested with one form on the stDccasjonand with another. to cover the same range of dim. Let us suppose that a 40-'t VI ua bS slcore differ ort the hm tests? I em voca u ary test h b a measure of general verbal c h' . parallel forms same specifications. The second list on the oth h d er might 1arge number of' words unfamilia an t' .

The useof several alternate forms also provides a means of reducing the possibilityof coaching or cheating. From a sin'gle. m er 0 Items In the test Other thmgs being equ I th I . as questions rereading test.e I:e~ls In such a group to be placed . illustrative examples.:administration of one form of a test it is possible to arrive at a measure 'of. l+(n-l)r'u in which t is the estimated ffi' n is the number of times th ~o~. " .ethod. a I' e ~nger a test.d the 'items should be expressed in the same form and should cover the metype of content.I . Under these con'II =: ~--. or examp e. ~e on?llndally an. For all these reasons. If the practice effect is small. .~~~~. the Rrst' half and the no difficulty level of 't e comparable.. or to a gIven passage in a tact to one or the other h~lf \Vere ~ o~p of ~tems should be assigned inin different halves of the t~st th .t~n~a~~. Th sse rom 2.h ' tween two sets of scores each a .. VIsIon Yle s verv ne I· One precaution to b b d . however.. . reliability by various split-half procedures. any item involving the same principle can be readily solved by most subjects once they have worked out the solution to the first. Ie 1st problem IS how to split the test ill divided in man ~ most nearly comparable halves. -' on t e other hand. e a serve 111 making such dd I' an a -even sp It pertains to groups of items d l' . for example. . Instructions. Under these conditions. In such a way..' ar)' eqUlva ent half-scores. that individuals will differ in amount of improvement. It is much more likely.' n y. if it is decreased determining reliability bv ~heP:ari~~ntrown formula is Widely used in porting reliability in this 'fo p a f m. t e correlatIon IS computed betest-retest and alternate-fotm r:I.Principles of Psychological Testing :. two scores are obtained for e~c1i person by dividing the test into comparable halves. n is 4. In most tests.. h' ea mg WIt a smale problem h ferring to a particular mechanical di~ . It is apparent that split-half reliability provides a measure of consistency with regard to content sampling. Reliability lIS To find split-half reliabilit tl Ii.ms. '. however. It should be added that the availability of parallel test forms is desirIe for other reasons besides the determination of test reliability. if all examinees were to show the same improvement with repetition. c~ent. Although much more widely applicable than test-retest reliability. it can be simplified as f~Iows:ength of-the test. if the d from 60 to 30. because only one test session is involved: This type of reliahility coefficient is sometimes called a coefficient of internal consistency.anged in an approximate order .ctors varying progresquate for most purposes is to fi d th e es.' e ec at engt f e all Its ~oefficlent can be estimated by I means of the Spearman-Bra wn ormu a. ' consIstent measure The ff t th I h emng or shortening a test will hav . formula always involves do~~in"'~~: tpphed to spht-haIf reliability.since adding a constant amount to each score does not alter the <:orrelationcoefficient. . because of the practical difficulties of constructing comparable forms. format. order to obtain th y. 1965). the clitions. and number of test items is incr:a eS ~ eng~ ened or shortened. Alterte forms are useful in' follow-up studies or in investigations of the ects of some intervening experimental factor on test performance. e Slml anty of the half-scores would be spuriousl inflated' might aIf~ct items 'i~l~c. In both f ' . the practice effect represents another source of variance that will tend to reduce the correlation between the two test forms."affected.ls bas~d on only 50 items. procedure that is adeof the test. ger samp e a behaVIOr. time limits.:~. It is reasonable t . reduction will be negligible. In the first place. e scores on the odd and even items of difficulty such a dl' . that this correlation actuallv gives th e °l. Any test can be second half w~urd dl~e~ent wars. I S conSIstency m tenns of con. . and I other aspects of the test must likewise be checked for comparability. })ractice fatig b d mu a Ive e ects 0 warming .leerror in understanding of the problem th Once the two half-scores have b b' d be correlated by the usual m th een a tame for each person. if the behavior functions under consideration are subject to a large practice elfeet. as well as to the cu I t' If f Ul). . al"temate-form reliability also has certain limitations. ue.'Jto 100. If the items we . wll Increase 0 I "t. 'f hoe re la I It" a onlv a half test F 'I . Another related question concerns the degree to which the nature of the test will change with repetition. other techniques for estimating test reliability are often required. I . each score is based on the full nu b . they may shoufld be noted. The range and level of difficulty of the items should o be equal. In this case a whole r glam. the correlation between their scores would remain un. changing the specific content of the items in the second form would not suffice to eliminate this carry-over from the first form. tent samplmg not its sl b'I't . and other factors. the!'use of alternate forms will reduce but not eliminate such an 'effect. In certain types of ingenuity problems. the more reliable it will be? If' o expect t Iat.n. To be sure. ' 2 .----_ nr'lI '1 SPLIT-HALF RELIABILITY. ~11 the obtained coefficient. Thus. a II} over hme (see Cureton. " . owing to differences in nature and I ems. since only a single administration of a single form is required. owing to extent of previous practice with similar material. . ore am and am' tI f sively from the beginning to th~ end ~f at Ie.d'b~lt. Temporal stability of the scores does Ilot enter into such reliability. we can . WIth a lar arrive at a more adequate and . n is %. m~ny test manuals re. Lenulhening a test h . I t e entire test consists of 100 ite . gIVen below: . it should be added that alternate forms are unavailable for many tests. ' sue.. ' owever. Finallv. motivation in taking the test. In such a case.

Principles of Psychological Testing 2r'1I = 1 + r'lI
Reliability
U7

Tn

. s it-half reliability was developed by An alternate method for findmg p. f th differences between . 0 Ily the vanance a e I Ion (1939). It reqUires I If t ( , ) and the variance of tota f I ' the two ha -tes Sad ch person s scores on b 't t d in the following ormu a, res (a'r); these two values aTe su stJ u e. ,. hich yieids the reliability of the whole test duectl) .
111

= 1- -,u:;

u'e!

,r , hi of this formula to the definition of . It is interesting to note the relations p 's scores on the two half'. , A d'ff ce between a person . 'd d 'errorvanance. ny I eren 'f these differences, dlvl e ' h r The vanance 0 , . 'tests represents c ance eTTO. , 'es the roportion of error variance 111 by the variance of total scores, gl\ 'b P t d from 1 00 it gives the h' 'ariance IS SU trac e , , I to the reliability coefficient. he scores. When t IS error \ h' h . proportion of "true" variance, w IC IS equa . A fourth method for finding reliability, f . I form is based on the . 1 d" t 'ahon 0 a slllg e , , the test This interitem conalso utiliZing a slIlg e a mmlslII , f onses to a Items m . f ariance' (1) content samconsistencv 0 resp d by two sources a error v , h ,;:sistenclj is ~n uence . d s lit-half reliability); and (2) etero\1 piing (as III altemat~-form an. p m led. The more-homogeneous the geneitv of the behavlOr domalll sa.P ' For example if one test int ' • h' h tl . lteritem conSIS enc\. , b h'1 lo'ther cOllllJrises addition, su _ domain, the Ig er Ie 11 I . I' l' 'tcms w leal b a hI y " eludes only mu tip Ica IOn I ..'.. the former test will pro I· I' t' and dIVISIOnItems, ' traction, mu tip Ica lOn, h th latter In the latter, more h . . 't onsistenc\' t an e, ' show more mten em c "f better in subtraction t an III ' t t e subJ'ect ma\' per orm 1 ' heterogeneous es, on. "ons' another subject may score re a~, any of the other arithmetIc operatl ly in addition, subtrac'tems but more poor b h d' " , A ore extreme example would e tively well on t e IVI510n I tion and multiplication; and so on. mb I items in contrast to one ' b t . ti I IT of 40 voca u ary, . represented y a tcs consls I/::). I I t'ons 10 arithmetic reasomng, b 1 10 spaha re a I 0, ' containing 10 voca u ar~, ~ the latter test, there might be little or and 10 perceptual speed Item~'dI. 'd r performance on the different no relationship between an III IVI ua s
KUDER·RICHARDSON RELIABILIT1:..

·of 20 by the successful completion of 5 pcrccptual speed, 5 spatial relations, 10 arithmetic reasoning, and no vocabulary items, Many other combinations could obViously producc the same total score of 20. This Score would have a very different meaning when obtained through such dissimilar combinations of items. In the relatively homogeneous vocabulary test, On the other hand, a Score of 20 would probably mean that the Subject llad succeeded with approximately the first 20 words, if the items were arranged in ascending order of difficulty, He might have failed two or three easier words and correctly responded to two or three more difficult itcms beyond the 20th, but such individual variations are slight in comparison with those found in a more heterogeneous test . A highly relevant question in this connection is whether the criterion that the test is trying to predict is itself relatively homogeneous or heterogeneous. Although homogeneous tests are to be preferred because their Scores permit fairly unambiguous interpretation, a single homogeneous test is obViously not an adequate predictor of a highly heterogeneous criterion. lvforeover, in the prediction of a heterogeneous criterion, the heterogeneity of test items would not necessarily represent error variance. Traditional intelligence tests provide a good example of heterogeneous tests designed'to predict heterogeneous criteria. In such a case, however, it may be desirable to construct several relatively homogeneous tests, each measuring a different phase of the heterogeneous criterion, Thus, unambiguous interpretation of test scores could be combined with adequate criterion coverage. The most common procedure for finding interitem consistency is that developed by Kuder and Richardson (1937). As in the split-half methods, interitem consistency is found from a single administration of a single test. Rather than requiring two half-scores, however, such a technique is based on an examination of performance on each item. Of the various formulas derived in the original article, the most Widely applicable, commonly known as "Kuder-Richal'dson formula 20," is the follo ing: w 3

,

'. a

0'

0

0

,

.'

types of items. ill be less ambiguous when derived ., It is apparent that test scores w h t'. the highly heteroget ts Suppose t a III from relatively homogeneo~ es S' 'th and Jones both obtain a score of neous, 40-item test cited ave, rfml s of the two on this test were e 20, Can we conclude that the Ph ormance tly completed 10 vocabulary ? N t II Smith may aye correc .. equal. ot a a . 's and none of the arithmetic reasomng items, 10 perceptual speed ~tem 't t Jones may have received a score and spatial relations items, neon ras ,

In this formula, rll is the reliability coefficient of the whole test, n is the number of items in the test, and IJ't the standard deviation of total SCOl'es on the test. The only new term in this formula, 'S.pq, is found by tabulating the proportion of persons who pass (p) and the proportion who do not pass (q) each item. The product of p and q is computed for each item, and these products are then added for all items, to give ~pq. Since in the ptocess of ~est construction p is often routinely recorded in order
3

A Simple dcrivatiolJ of this formula can be found in Ebel (1965, ppo 32!hS27).

u8

Pri'lcipks of Psychological

Testing

Reliability

119

the difficulty level of each item, this method of determining rci~bility involves little additional cO,mputation. l' bT ,fIt can be shown mathematically that the Kuder-Ri~hardson r~ la Ilty , cient is actually the mean of aU split-half coeffiCients .resultll1~ from ent splittings of a test (Cronbach, 1951).4 The ordmary spht-half dent, on the other hand, is based on a planned split design~d to equivalent sets of items. Hence, unless the test items are hIghly mogeneous, the Kuder-Richardson coefficient will be .lo\~er than t~e lit-halfreliability. An extreme example will serve to hl.ghlight t?e dlf erence.Suppose we construct a 50-item test out of 25 diHerent kmd~ a emssuch that items 1 and 2 are vocabulary items, items 3 and 4 antheticreasoning, items 5 and 6 spatial orientation, a~d so on. The odd.and venscores on this test could theoretically agree qmte clos:ly, thus. YIeld'ng a high split-half reliability coefficient. The homogeneity of. thiS test, ince there would be little consistency of owever,wou Id be very low • S " ld erformance among the entire set of 50 items. In thIS example, we wou. '~xpectthe Kuder-Richardson reliability to be much lower th\lD th~ splIthalfreliability. It can be seen that the diHerence between Kuder-~Ichard,son and split-half reliability coefficients may serve as a rough ll1dex of

i6'find

f

i,

. the heterogeneity of a test. The Kuder-Richardson formula is applicable to tests whose Items are scored as right or wrong, or according to some other all-or-none syste~. Sometests however may have multiple-scored items. On a personahty inventory,for exampie, the respondent may receive a di,~erent n,~~erical score on an item, depending on whether he checks . usually, some. " " I" "ne\1el'" For such tests a generahzed formula has times, rare y, or· ' . k been derived known as coefficient alpha (Cronbach, 1951; NOVIC & Lewis, 1967).' In this formula, the value ~pq is replaced by ~u'i, ~he sum of the variances of item scores. The procedure is to find the vana~ce of all individuals' scores for each item and then to ~dd these v~na~ces across all items. The complete formula for coeffiCIent alpha IS glVen below:

_ (~)U't - ~U';
TlI -

one case, error variance covers temporal fluctuations; in another, it refers to differences between sets of parallel itcms; and in still another, it includes any interitem inconsistency. On the other hand, the factors excluded from measures of error variance are broadly of two types: (a) those factors whose variance should remain in the scores, since they are part of the true differences under consideration; and (h) those irrelevant factors that can be experimentally controlled. For example, it is not customary to report the error of measurement resulting when a test is administered under distracting conditions or with a longer or shorter time limit than that specified in the manual. Timing errors and serious distractions can be empirically eliminated from the testing situation. Hence, it is not necessary to report special reliability coefficients corresponding to "distraction variance" or "timing variance." Similarly, most tcsts provide such highly standardized procedures for administration and scoring that error variance attributable to these factors is negligible. This is particularly true of group tests deSigned for mass testing and computer scoring. 'With such insb'uments, we need only to make certain that the prescribed procedures are carefully followed and adequately checked. 'Vith~clinical instruments employed in intensive individual examinations, on the other hand, the!'e is evidence of considerable "examiner variance:' Through special experimental designs, it is possible to separate this variance from that attributable to temporal fluctuations in the subject's condition or to the use of alternate test forms. ~ne source of error variance that can be checked quite simply is scorer tests of creativity and projective vanance. Certain types of tests-notably tests of personality-leave a good deal to the judgment of the scorer. \Vith such tests, there is as much need for a measure of scorer reliability as there is for the more usual reliability coefficients. Scorer reliability can be found by having a sample of test papers independently scored by two examiners. The two scores thus obtained hv each examinee are then correlated in the usual way, and the resulti~g correlation coefficient is a measu,re of scorer reliability. This type of reliability. is commonly computed when subjectively scored instruments are e.mployed in research. "»est manuals should also report it when appropriate. '

n- 1

u't

A clear description of the computational

layout for finding coefficient alpha can be found in Ebel (1965, pp. 326-330).
SCORER RELIABILITY. It should now be apparent that the difIer:nt types of reliability vary in the factors they subsume under error vananee. In
4 This is strictly true only when the split-half coefficientsare found by the Rulon formula,not when they are found by correlation of halves and Spearman-Brown formula(Novick & LewiS, 1967).

OVERVIEW. The diHerent types of reliability coemsiel),ts discussed in this section are summarized in Tables 8 and 9. In Tablit18'the operations followed in obtaining each type of reliability are classffled,-,with regard to number of test forms and number of testing sessions required. Table 9 shows the sources of variance treated as error vitri~nce b},;,~achprocedure. Any reliability coefficient may be interpreted directly"in terms of the percentage of score variance attributable to different sources. Thus, a reliability coefficient of .85 signifies that 85 perceI1t 9f the variance in test

split-half correlations could be computed for each fonn amI the two coeffiCients averaged by the appropriate statistical procedures. a split-half reliability coefficient can also be computed."Ulr'&. are shown graphically in Figure II.-)...(. .70 = .08 True Variance = 1.Y:-:_~ :_~~.•.80.70. i .10· (time sampling) 1. The three reliability coefficients can now be analyzed to yield the error variances shown in Table 10 and Figure n. we find that .-..:.".~~')l.::. 4':~~ __'~. the result is the reliability coefficient (r1l). in Relation to Test Form andTesting Session Testing SessionS Test Forms Required Required Split-Half Kuder-Richardson Scorer A1temate-Form (Immediate) Alternate.Reliability lZ0 121 Principles of Psyc11010gical Testing TABLE 8 Techniquesfor Measuring Reliability.~. . known as th6 index of re1iabdity.'C:J". 6 For a better estimate of the coefficientqf internal consistency..:~ scores depends on true vati~nce in the trait measured and 15 percent epends on error variance (as:'opcrationally defined by the specific pr~edure followed).20 + .r~C'~<.62 = = = .-:. When the index of reliability is squared.fW"6'.:. II 'II. time sampling (_10).. Let us consider the following hypothetical example.!':"i·:.62. a second scorer has rescored a l'andom sample of 50 papers...~io!<!'.~..Form (Delayed) ..~""F. Forms A and B of a creativity test have been administered with a two-month interval to 100 sixth-grade children.10 + .38 5 Derivations the indexof reliability... which can therefore be interpreted directly as the percentage of true variance. by .I.. Experimental designs that yield more than one type of reliability coefficient for the same group permit the analysis of total score variance into different components..30 = . These proportions.SO .10 of the variance can be attributed to time sampling alone.. is .92 is obtained..est-Retest lemale-Form(Immediate) emale-Form(Delayed) lit-Half 9 From delayed alternate-form reliability: 1 ...~. .~ •• 1'l.-.38 and hence a true variance of .. from which a scorer reliability of . i' '. Adding the error vari~nces attributable to content sampling (.•.·~ •.c~.. the proportion of true variance in test scores 'sithe square of the correlation between scores on a single form of the est and true scores free from chance errors. TABLE 10 Anal)'sis of Sources of Error Variance in a H}'P0thetical Test :fABLE .92 scorer reliability: = .l:::i. Finally.t.6 This coefficient.:. and mterscorer difference (...':'... Actually. The resulting alternateform reliability is .:.38 .::..20· (time samplin'k plus content sampling) (content sampling) From split-half. . Spearman-Brown reliability: Time sampling Content sampling Time sampling and Content sampling Content sampling Content sampling and Content heterogeneity Interscorer differences DiHerence TWDl 1 . This correlation.c.• ••.ourcesof Error Variance in Relation to Reliability Coefficients Type of Reliability Coefficient ..:.tr. From the responses of either form. expressed in the more familiar percentage terms.20). The statistically sophisticated reader may recall that It 's the square of a correlation coefficient that represents proportion of ommanvariance. stepped up by the Spearman-Brown formula. It will be noted that by subtracting the en'or variance attributable to content sampling alone (split-half reliability) from the error variance attributable to both content and time sampling (alternate-form reliability). .t..based on two dilTerentsets of assumptions.~ is equal to the square root of the reliability co- efficient (\/.!:..at.08) gives a total error variance of . 2 and 3).OS· (interscorer difference ) er-Richardsonand Coefficient Ipha rer Total Measured Error Varianetl· ...-. '-\. of \given Gulliksen (l950b. Chs. :..tJ'..- Two \ Test-Retest .

Error Variance:

38'J. A_- --x.--8-'X,-"'" 10

Stable over lime; consistent over !orms; free !rom interscorer difference

11. Percentage Distribution of Score Variance in a Hypothetical Test.

'LIABILITY OF SPEEDED TESTS
"

oth in test construction and in the interpretation of test scores, an portant distinction is that between t~e ~ea.s~rement. of speed and of wer. A pure speed test is one in whIch md1~dual differences depend tirel\, on speed of performance. Such a test IS co~s~ructed fr~~ Items uniformly low difficulty, all of which are well wI~hm ~he. a?lhty level the persons for whom the test is designed. The hme 1Im1t.1~made so ort that no one can finish all the items. Under these conditIons, each erson's score rcflects only the speed with which he worked. A pur~ DICeI' test, on the other hand, has a time limit long el:ough ~o permIt veryone to attempt an items. The difficulty of the Items IS steeply , raded, and the test includes some items too difficult for anyone to solve, sothat no one can get a perfect score. " It will be noted that both speed and power tests are deSIgned to p~e-" vent the achievement of perfect scores. The reason for such.a precauh~, is that perfect scores are indeterminate, since it is impos~lble to .knm.Y how much higher the individual's score would have been If m?re.l~ems, 'ffi It items had been included, To enable each mdlVldual or more d I cu, ,,' .d d to show fully what he is able to a~c,qm1?H,~rthe test must proVI e a e. qllate ceiling, either in number o~ ~te"':iJr in. difficulty level. An..ex~ep~ lion to this rule is ,found in mastery ,Jng, as Illustrated by the cllt~no~ referenced tests discussed in ChaPtrc4. The purpose of such testm~ IS not to establish the limits of what th'e3hdividual can do, but to determme whether a preestablished performance level has or has not been rea.ehed. In actual practice, the distinction between speed and power :ests IS ~nc of degree most tests depending on both powe~ and speed 111 varymg proportiO~S. Information about these proportions is needed for each test . rder not onlv to understand what the test measures but also to ~o~se the prop~r procedures for evaluating its reliability. Single-trial reliability coefficients, such as t~ose found by odd-even or Ku.derRichardson techniques, are inapplicable to speeded tests. To the extent

that individual differences in test scores depend on speed of performance, reliability coefficients found by these methods will be spuriously high. An extreme example will help to clarify this point. Let us suppose that a 50-item test depends entirely on speed, so that individual differences in score are based wholly on number of items attempted, rather than on errors. Then, if individual A obtains a score of 44, he will obviously have 22 correct odd items and 22 correct even items. Similarly, individual B, with a score of 34, will have odd and even scores of 17 and 17, respectively. Consequently, except for accidental careless errors on a few items, the correlation between odd and even scores would be perfect, or + 1.00. Such a correlation, however, is entirely spurious and provides no information about the reliability of the test. An examination of the procedures followed in finding both split-half and Kuder-Richardson reliability \:vill show that both are based on the consistency in number of errors made by the examinee. If, now, individual differences in test scores depend, l~ot on errors, but on speed, the measure of reliability must obviously be based on consistency in speed of u:ork. 'Vhen test performance depends on a combination of speed and power, the single-trial reliability coefficient will fall below 1.00, but it will still be spuriously high. As long as individual differences in test scores are appreciably affected by speed, single-trial reliability coefficients cannot be properly interpreted. 'What alternative procedures are available to determine the reliability of Significantly spl1eded tests? If the test-retest techniqu~ is applicable, it would be appropriate. Similarly, equivalent-form reliability may be properly employed with speed tests. Split-half techniques may also be used, provided that the split is made in terms of time rather than in terms of items. In other words, the half-scores must be based on separately timed parts of the test. One way of effecting such a split is to administer two eqUivalent halves of the test with separate time limits. For example, the odd and even items may be separately printed on different pages, and each set of items given with one-half the time limit of the entire test. Such a procedure is tantamount to administering two equivalent forms of the test in immediate succession. Each form, however, is h¥f as long as the test proper, while the subjects' scores are normally based on the whole test. For this reason, either the Spearman-Brown or some other appropriate formula should be used to find the reliability of the whole test. If it is not feasible to administer the two half-tests separarely, an alternative procedure is to divide the total t,ime into quarters, and to find a score for each of the four quarters. This caneasil~':J;>~ 'done by having the examinees mark the item on which they ar~ w6rkiti~ whenever the examiner gives a prearranged signal. The number of items correctly completed within the first and fourth quarters can then be combined to

Principles of PsycllOlogical '~w,' represent one half-score,

Testing
TABLE

while those in the second and thir~ q~artcrs ," can be combined to yield the other half-score. Such a combmahon of . quarters tends to balance out the cumulative effects of practice, fatigue, and other factors. This method is especially satisfactory when the items are not steeply graded in difficulty level. When is a test appreciably speeded? Under what conditions must the . special precautions discussed in this section be observed? Obviously, the mere employment of a time limit does not signify a speed test. If all subjects finish within the giycn time limit, speed of work plays no part in determining the scores. Percentage of persons who fail to complete the test might be taken as a crude index of speed versus power. Even when no one finishes the test, however, the role of speed may be negligible. For example, if everyone (<()mpletes exactly 40 items of a 50-item .test, individual differences with regard to speed are entirely absent, although no one had time to attempt all the items. The essential question, of course, is: "To what extent are individual differences in test scores attributable to speed?" In more technical terms, we want to know what proportion of the total variance of test scores is speed variance. This proportion can be estimated roughly by finding the ... variance of number of items completed by different persons and dividing '\ it by the variance of total test scores (u·'/r:J't). In the example cited above, in which ev~ry individual finishes 40 items, the numerator of this fraction would be zero, since there are no individuaL differences in number of items completed (u'(' 0). The entire index would thus equal zero in a pure power test. On the other hand, if the total test variance (U2f) is attributable to individual differences in speed, the two variances will .. be equal and the ratio will be 1.00. Several more refined procedures have ;". been developed for determining this proportion, but their detailed consideration falls beyond the scope of this book., . '. An example of the effect of speed on single-trial reliability coefficients is provided by data collected in an investigi~on of the first edition of the SRA Tests of Primary Mental Abilitie.s.~.r Ages 11 to 17 (Anastasi & Drake, 1954). In this study, the reliab!lijY',uf each test was first determined by the usual odd-even procedm:e.;{~;fie~~coefficients, given in the first row of Table 11, are closely sinjil Jhose reported in the test manual. Reliability coefficients were the ..," ,nfited by correlating scores shown in the second on separately timed halves. These coef1i~~:are row of Table 11. Calculation of speed indexes showed that the Verbal Meaning test is primarily a power teSt;,l~i1e the Reasoning test is somewhat more dependent on speed. The Spa.~~,and Number tests proved to be highly speeded. It will be noted iri;1;~h'1' 11 that, when properly com-

11

Reliability Coefficients of Four of the SRA Tesls of Primary MenIal Abilities for Ages 11 to 17 (1st Edition)
(Data from Anastasi & Drake, 1954)

Reliability Coefficient Found by: Single-trial odd-even method Separately timed halves

Verbal Meaning Reasoning .94
.90

Space .90 .75

Number
.92

,96 .87

.83

p~ted, the reliability of the Space test is .75, in contrast to a spuriously hIgh odd-even coefficient of .90. Similarly, the reliability of the Reasoning te,st drops f~on~.. 6 to .87, and that of the Kumber test drops from .92 to 9 .8,3. The rehablhty of the relatively unspeeded Verbal Meaning test, all the other hand, shows a negligible difference whe'n computed by the two methods.

OF RELIABILITY DEPENDENCE ON THE SAMPLE TESTED

COEFFICIENTS

=

7

(1955),

See. e.g .• Cronbach & Warrington (1951 Y,Culliksen (1950a, 1950b), Cuttman Helmstadter & Ortmeyer (1953).

HET~ROG~XEITY. important factor influencing the size of a reliability An coeffiCient IS the nature of the group on which reliability is measured. In ~he. ~rst pla~e, any correlation coefficient is affected by the range of 1I1?~\')?ual dl~erenc:~ in the group. If every member of a group were ah~~ 111spcllmg ablhty, then the correlation of spelling with any other a~lll~y would be zero in that group. It would obviously be impossible;' WI~~1Ilsuch a group, to predict an individual's standing in any other ablhty from a knowledge of his spelling SCOFe. Anot~er, less extreme, example is provided by the correlation between tw~ aptItude tests, such as a verbal comprehenSion and an arithmetic reasonmg test. If these tests were administered to a highly homogeneous sampll:', such as a group of 300 college sophomores, the correlation beI tween the two would probably be close to zero().There is little relationS~i~, wi~hin such a .s~lected s~mple of college students, between any indn Idual s verbal abdlty and hiS numerical reasoning abilitv. On the other hand, wer~ the test~ to. be. give.n to a hetero~neous sample of 300 persons, rangmg f~om mstItut~ona1tzed mentally retar~ed persons to college graduates, a hIgh correlatlon would undoubted:}£,::be obtained betweep the two tests. The mentally retarded would o~ta1.~~hoore.r:scores than tile . hips would hold for college graduates on both tests, and similar no{ other subgroups within this highly heterogeneo'us ',pIe.'>

Principles of Psychological Testing mination of the hypothetical scatter diagram given in Figure 12 urther illustrate the dependence of correlatioll coefficients on the Hity, or extent of individual differences, within the group. This r diagram shows a high positive correlation in the entire, heteroges group, since the entries are closely clustered about the diagonal ding from lower left- to upper right-hand corners. If, now, we cononly the subgroup falling within the small rectangle in the upper -hand portion of the diagram, it is apparent that the correlation bethe two variables is close to zero. Individuals falling within this , icted range in both variables represent a highly homogeneous group, did the college sophomores mentipned above. 'ke all correlation coefficients, reliability coefficients depend on the 'iability of ,the sample within which they are found. Thus, if the reility coefficient reported in a test manual was determined in a group 'ing from fourth-grade children to high school students, it cannot be med that the reliability would be equally high within, let us say, an hth-grade sample. \Vhen a test is to be used to discriminate individual i I
-'~-, I I

Reliability

127

differences within a more homogeneous sample than the standardization group, the reliabi~ity ~oefficient should be redetermined on such a sample. Formulas for estimating the reliability coefficient to be expected when the standard deviation of the group is increased or decreased are available in elementary statistics textbooks. It is preferable, however, to recompute the reliability coefficient empirically on a group comparable to that on which the test is to be used. For tests designed to cover a wide range ~f age or abil.ity, the test manual should report separate reliability coeffiCIents for relatively homogeneous subgroups within the standardization sample.

, I ,

,
i
i ,

I I ,, I I , I , I
i

;

!
I

I

i

I

! I
i

i

,
i i

, !,

i
I
"

,

,,
i

,

,
I

I

IIi I

!

;

!

I i

I

,

! I
i

i

,
I
"

i ~ , I,
I ill '1/'11

I I iI/
"1'/1

,

11\
i ,I

-h':
i

I

I I I

,
1

,
,

i

I ; 1\11'/1,/1
Ifi'll

/lill,l I jll'/I', 1/1 1'1/;/1/

IIi
i I

i

i
i

,
!
!

I i
I

i
~,

!

,
I

!

I
I

!

,

1

! , ! I

:'/'1111, /II I!I: ,

, I ~ I III

/11/1.//, /I /1;11, 1
II I
i

/I:/!
I

!
I ,
:
I

I I ,II,
.11 11[111/1
;1 1/11/1'

, 1/1 /1/1:/1 III I '11,/1 11/

I :/1'

I

!

I

,
I i
I

I

I
!

i I I i
I

I I
!

I:

, i'i
i

il'"

//I//!//

I

i i i I I , , i

, 1

I I !1I,II,lI/llIll/ll/ : I ~" 11111I11 //,/1 1/1,11/, ;111/11

1'1'
I

I

I

,

111 111 111/:1/!1I
I ,/1

I

i

II

,
/ I II:W

I fll
IJII 11·/1

I 11/11/'

I I I

,

i,l
II

I!
I

i

! I
[I
I

11/ 1/1/ /I'
I

I

,"
III

/I
I

, ,

11 II

1/1/ I II

i
I

I
..

,:~it·
t~i
I
",

".

•.
;", ; 'I;;;l;i.;: 'i

.

II'

,'

,

,
;

I!

,

i i I
I

,
J

,
i

,
,
i

ABILITY LEVEL. Kot only does the reliability coefficient vary with the extent of individual differences in the sample, but it may also vary between groups differing in average ability level. These differences, moreover, cannot usually be predicted or estimated by any statistical formula, b~t c~n ~e' discovere~ .only by empirical tryout of the test on groups d.dfermg 111 age or abilIty levcl. Such differences in the reliability of a smgle test may arise from the faCt that a slightlv different combination of abilities is measured at different difficulty lev~ls of the test. it may result from the statistical properties of the scale itself, as in the StanfordBinet (Pinneau, 1961, Ch. 5). Thus, for different ages and for different IQ levels, the reliability coefficient of the Stanford-Binet varies from .83 to .98. In other tests, reliability may be relatively low for the younger and less able ¥roups, since their scores are unduly influenced by guessing. Under such CIrcumstances, the particular test should not be employed at these levels.

Or

I
I

!

I

,

t~1

,
1

,

i
I

,
1

i

I
i
I

;

~I
I

i i , I L
I I

I'

i!

, i '~?f : ! ,

: I I
I

:
I

I

i

I
I

I

;

..;
.'fo;.,

,

I
I

1

I
i

jfI
/I

I

.....- ...

I

I
1
",'·,1.

I i I
I

I
I
I I

It is apparen.t t~at every reliability coefficient should be accompanied by a fuD descnptIon of the type of group on which it was detelmined. Special attention should be given to the variability and the ability level of the sa~~le. The reported reliability coefficient is applicable only to ~amplef, s~nll]~r to that on which it was computed. A desirable and growlIlg practice In test construction is to fractionate the standardization sample into m~re homogeneous subgroups, with regard to age, sex, grade leve~, occupation, and the like, and to report separate reliability coeffic~ents for each s~bgroup. Under these conditions, the reliability cocHicIen¥ are more lIkely to be applicable to the samples ~~th which the test is to be used ill actual practice. ..

!
I

!

I

I
INTERPRETATION OF INDIVIDUAL SCORES.

Score on Variable 1

.Frc. 12. The Effect of Restricted Range upon a Correlation Coefficient.

expressed in terms of the standard

The reliability of a test may be error of measllre~ent ((fmen.,), also

II.n. Expressed in terms of individual scores.89. . Is Jane more able along verbal than along numerical lines? Does Tom have more aptitude for mechanical than for verbal activities? If Jane scored higher on the verbal than on the numerical sub tests . how sure can we be that they would still do so1on a retest with another form of the battery? In other words. we could argue that his true score lie within 2.B. . and mechahical tests employed? Because of the growing interest in the interpretation of score p'rofi. called tIle standard error of a score.. changes in scores following instructiun or other experimental \'ariables need to be interpreted in the light of errors of measurement."e = INTERPRETATI01IO OF SCORE DIFFERENCES. -v . olflis obtained score. We can thus state at ttte 99 percent' confidence level (with Iy one chance of error out of l00J. It is particularly important to consider test reliability and errors of measurement \\'hen evaluating the differellces between two scores. this standard error can be interpreted in t~rms of the normal curve frequencies 'discussed in Chapter 4 (see Figure 3).'. and the like. t. by the following formula: . A frequent question abollt test scores concerns the individuars relative standing in different areas..that Jim's IQ on any single admination of the test will lie between"97 an9 123 (110 -13 and no + 13). Unlike the reliability coefficient.15\/1. if the reliability of differetlt tests.-the chances arc roughly 2:1 (or 68:32) that JUllS IQ on . Reference to Figurei.p~l.:..:the chances are 99:1 that Jim's 13 points. For many testing pursuited to the interpretation poses. a normal curve. The standard error of measurement and the reliabilitv coefficient are obviously alternative ways of exprt'ssing test reliability. Thinking in terms of the range within which each score may fluctuate serves as a check against overemphasizing small diHerences between scores. the reliability coefficient is the better measure.·~i!4er side of the mean includes 'actly 99 percent of the cases.above reasoning in the reverse direc." 'fluctuate between ± lUIII. ''Jimwere given 100 equivalent te~ts.. or 5 points on either side of his ' Ie IQ. ~"wehad a set of 100 IQ's obtai~ed with the above test by a single boy.we could try to follow ~t. pp. or (2.58)(5) . let us suppose that . the standard error of measurement is more appropriatc. Under these circum~nces. If his true IQ is no.1t~~Chapter 4 shows that ±3u covers 00.89 15Y. Similarly.. we do not have the true scores. it is therefore more useful than the reliability coefficient. standard deviation.The mean of this distribution of 100 scores can be taken as the true . To understand what the UI/H'. but only e scores obtained in a single test administration. the a"" ••. it remains unchanged when found in a homogeneous or a heterogeneous group.tJim.score is unlikely to deviate by more 2.tti{ee. li-20) proposed that the standard error of measurement be used as illustrated abo've to estimate the reasonable limits of the true score for persons ""it-h any given obtained score. is test :_.~t.nclude .Because of the types of chance errors discussed in this chapter. falling into a normal distribution around Jim's true ':score. 'In actual practice.scoreand the standard deviation of the distribution can be taken as the . The usual problems of comparability of units would thus arise when errors of measurement are reported in terms of arithmetic problems.33) 5. we can choose higher '\lddsthan :2: 1. numerical. standard error of measurement can be easily computed from the rehabll: ity coefficient of the test. these :\ scores will vary.and a reliability coefficient of . Th~s" we can . the error of measuren)('nt is independent of the variability of the group on which it is computed. ilis IQ would fall outside this band 'Valuesonly once.. we call say that the statement would be correct for 99 percent of all the cases. if deviation IQ's on a particular intelligence test have a standard devia~iol1 of ~5 . the error of measurement will not be directly comparable from test to test. Hence. of an IQ on thIS test IS.ll 15(.les. This measure is particularly wen of individual scores. If an individual's obtal. "11Im. from his true"~ore. could thc score differences have resulted merely from the chance: se)ection of specific items in the particular verbal. test publishers have been developing report forms that permit the evalua- . hath computed on the same group. being reported in score units. TI~e . On the basis of this reasoning..' tells us about a score.58u ras. of course.7 percent of the cases. On the other hand. want to compare words in a vocabulary test.58?:7.580'n1f. It will be recalled that between the mean and ±lu there are ~pproximatf'ly 68 percent of the cases in. Although we cannot aslll = = = sign a probability to this statement for any given obtained score. It can be::~_sg~.58O''''r ••. we 'V<:mldexpect him to score between 105 ld U5 about two-thirds (68 percent)' of the time.Principles of PsycllOlogical Testing Reliability U9 .h-. To interpret individual scores. If we want to be more certain of oiI~rprediction. true IQ.on an aptitude battery and Tom scored higher on the mechanical than on the verbal. • Like an\.ainedrom normal curve fref uenc)' tables that a distance of 2. Gulliksen (1950b. For example. on either side of will fall within 2. Such caution is desirable both j when comparing test scores of different persons and when comparing the scores of the same individual in diHerent abilities.?~. It is in terms of such "reasonable limits" that the en-or of measurement is customarily interpreted in psychological testing and it will be so interpreted in this book.in which al is the standard deviation of the test scores and '11 the reliability coefficient.

All rights reseT\'ed. It is well to bear in mind that the standard error of the difference between two scores is larger than the error of measurement of either of the two scores. .ence between two scores can be found from the standard errors of measurement of the two scores by the follOWing formula: '" ~60 ~~ 50 u ~ 40 30 ...93 = 4.96.) and Urneas . but they are sufficiently close to permit uniform interpretations for practical purposes..~~e Tests. .95 chance (4. is the standard error of the difference between the two scores. p.93. The result is 9. The splithalf reliabilities of these scores are . 1 Flc. A major statistical implication of .. for example. By substituting SDyll . This follows from the fact that this difference is affected by the chance er1"Ors present in both scores.~rp. 2. .70.. or approximately 10 points. in which Udi//.) tion of scores in terms of their errors of measurement. In iI'l~.. the same SD was used for tests 1 and 2. Each percentile bar corresponds to a dist~nce of approxI~core. In the profil~%tl!ustrated~~f~gure ·1. I~~ '-. -. respectively..96 . Fifth Edition Manual.r2lI for Umeas .) and SDyll .. Thus the ence between an individual's WAIS Verbal and Performance IQ be at least 10 points to be significant at the . percentile scores ?~ each subtest of the battery are plotted as one-inch bars. \Ve may illustrate the above procedllfe with the Verbal and Performance IQ's on the Wechsler Adult Intelligence Scale (WAIS). test users are advised not to attach Importance to olfferences between scores whose percentile bars overlap. ~ - ". and Umca8. WAIS deviation IQ's have a mean of 100 and an SD of 15.. that bctween Mechanical Reasoning and Space Relations probably does not. Illustrating Use of Percentile Bands.tetingthe profiles. N. .-1 . New York. 1974 by The Psychological Corporation.. .:.time. Reproduced b)' permission. '\\1th the ~1:l~jPed percentil~ ' at the center.. .-~:. than half their length. . . ~~\ <. as follows~ . 73. The standard error of the diffe.~ I. the actual ranges covered by the one-inch lines are not identical.: ". .Reliability 131 :RAWSCORE PERCENTILE I~~ w..96 and . we multiply the standard error of the difference by 1. we may rewrite the formula directly in terms of reliability coefficients. ) varies somewhat with subtest. 1 RELIABIUTY OF CRITERION-REFERENCED TESTS It will be recalled from Chapter 4 that criterion-referenced. 0 : 0 .t.. Copyright ® 1973.especially if they overlap by more 13. ••. (Fig..95) differshould To determine how large a score difference could be obtained by at the . ~ the difference between the Verbal Reasoning and Numerical Ability scores probably reflects a genuine difference in ability level. 8 Because the reliability coefficient (a¥d hence th~ er•• . .• are the standard errors of measurement of the separate scores.. . 13. Hence the standard error of the difference between these two scores can be found as follows: Udif/. An example is.8 mately 1 Y2 to 2 standard error~ o~ :ithe~' ~. .. = 15y12 . the Individual Report Form for use with the Differential Aptit. reproduced in Figure 13.. the difference between Abstract Reasoning and Mechanical Reasoning is in the doubtful range.05 level. I l'~~'ll l~:: ft . In this substitution. On this form.05 level. and sex. tests usually (but not necessarily) evaluate performance in terms o(~ mastery rather than degree of achievement.ide ~f 't~i!o~t~ine? Hence the assumption that the mdlVl~ua! s true ~~~allS Wlthm the bar is correct about 90 percent oftl. Score Profile on the Differential Aptitude Tests.~~~ 60 9S 80" 95 30 80 90 'l9 85 i . grade.• .Y.TII for Umeus. since their scores would have to be expressed in terms of the same scale before they could be compared.

we wish to test the hypothesis that the examinee has achieved the required le"el of mastery in tllP content domain or instructional objective sampled by tne test items. 1973). In an earlier section of this chapter.::fls~nted. Efforts are under way. Is there a sizeable proportion of students who reached or exceeded the cutoff score on tIle masten' test at . ma~tery testmg. Hambleton and Novick (1973). To supplement this limited information.hom the same decision (~mstery or nonmastery) is reached on both forms (Hambleton & No\'Ick. variability is reduced to zero.tper to: (1) accept the hypothesis.ltaking tbe test until a mastery or nonmastery d~cision is r~·. 4 . it would be inappropriate to assess the reliahilitv of most criterion-referenced tests by the usual procedures. speCified ce~ures ar~ feasible and can reduce total testing time while yielding rehable ~stlma.1973).:ith a fixed. if everyone continues training until the skill is mastered. Thus the number of observations (m. Segucntial analysis consists in taking observations one at a timE' and deciding after cach observation whC'f. 1971. 1974). The t. A cutting score. predetermined number of items the examine~~c. Popham and Husek (1969). Another procedure for determining the reliability of a master)' test is to administer two parallel forms to the same individuals and note the percentage of persons for ". 1973. criterion· referenced tests typically provide only a small number of Itcms for cach objective. \Vith the computer facilities described in Chaptn_~. When flexible. In the development of several criterion-referenced tests. Obviously.. Theoretically. 1973. Specifically. 197]. is then selected that best dis. empirical answers (see. ] 973 ). these two questions have been answered by judgmental decisions.ical ~ryouts. however. is affected by the variability of the group in which it is computed.g. At that point. Not only is low variability a result of the way such tests are used.test. mastery decisions reached at a prerequisite instructional level can be che{:ked against performance at the next instructional level. 1971).dntimieS. criminates between the two groups.(3~f~~ake add~tional o~serYatlOns.tes of mastery (Glaser & Kitko. so does the correlation coefficient. Ferguson & i\ovick. Glaser & Nitka. 19i2.\" ~rocedures required for the construction and evaluationof criterion-referencedtests.the lower level and ~ailed t~ achi~\'e mastery at the next levei within a reasonable period of mstructlOnal tU1W?Does an analysis of their difficulties suggest that they had not truly mastered the prerequisite skiIIs:l If so. in both theoretical develoIJ!nent and ~mpir.see Glaser and Nitko (1971). as well ~s from the test results of other students (Ferguson & !'\oviek. including reliability coefficients. 1969.eh lend themselves well t~ the kind of decisions required by. e. For example. in terms of number or percentage of correct items. two important questions are: (1) How many items must be used for reliable assessment of each of the specific instructional objectives covered by the test? (2) "What proportion of items must be correct for the reliable establishment of mastery? In much current testing. 1947). Hambleton & l\ovick. A set of tables for determining the minimum number of ~lems required for establishing mastery at levels is provided by Millman (1972. or . these findings would strongly suggest that the mastery test was unreliable. procedures have been developed for incorporatinO' collateral data from the student's previous performance history. further study. Hambleton & Novick. conditions. Rather than being p. even a highly stable and internally consistent tcst could yield a reliability coefficient near zero. testing is discuntinue'd and the student is either dire '. Allstatistical procedures for use with criterion-referenced tests are in an exploratory stage. I~ore traditional techniques can be utilized to assess the reliability of a gl\'en .·" ·'d. as will be shown in Chapter 8. then. Either the addition of more items or the establishment of a higher cutoff score would seem to be indicated. we saw that any correlation. to develop appropriate statistical techniques that will provide objective. before the most effective IJlethodology for different testmg situatlons can be formulated.1cElrath. As the vatiability of the sample decreases.. Lindgren & :'.fhls case :t:lytnber of items) needed to reach a reliable conclusion is..o Under thes. Millman. amenable to ~testillg within the framework of decision theory and sequential analysis (Glaser & :\'itko. In the construction of criterion-referenced tests. it is also built into the tests through the construction and choice of items. such sequential decision pro9 For fuller discussionof special statistic. (2) rejcct the hypothes!s.13Z Pl'inciplt:s of Psychological Tcstillg Reliability 133 mastery testing is a reduction in yariability of scores among persons. Some Investigators have been explorinO' the use of Ban'sian estimation techniques.. The dichotomization can be fmther rcGned by usmg teacher Judgments to exclude any cases in the lower grade knoVl'll to have mastered the concept or skill and any cases in the higher grade who have demonstrably failed to master it. Because of the large number of specific instructional objectives to bc t~sted. itself deten~nined during the process of testing...'o question~ about number of items and cutoff score can be incorporated into a single hypothesis. whi. Much remains to be done.:fo~he next instructional level or returned to the nonmastered level '0 . Livingston. Wald. Educational Testing Service has followed an empirical procedure to set standards of mastery. 197:3. A few examples will serve to illustrate the nature and scope of these efforts. This procedure involves administering the test in classes one grade below and one grade above the grade where the particular conce?t or skill i~ taught. individually tailored procedmes are impracticable. Millman (1974).

the objective sources of information and empirical operatIOns ut~li~ed In establishing its validity (Anastasi. we should guard against ae- cepting the test name as an index of . NATURE. or bookkeeping '~'ould seem to be valid by definition if it consists of multiplication. and Lennon (1956). also important to guard against any tendency to overgeneralize regarding the domain sampled by the test. sample of the behavior domain to be measured. however. 1964. cannot be reported in general terms. spelling. This type of test is designed to measure how well the individual has mastered a specific skill or course of study. and construct validity. . ~he ~rait measured by a given test can be defined only through an e~amIna~l~n of. convenient labels for IdentificatIon purposes. procedures will be considered in one of the . Its validity must be determmed WIth reference to the' particular use for. Huddleston (1956). Test names provide short. Techniques for analyzing and intcrpreting vali1~tt "data with reference to practical decisions will be discussed in Chapter 7.. and other aspects of spelling ability (Ahlstrom. Most test names are far too broad and vague to furnish meaningful clues to the behavior area covered.n be found in Ebel (1956). respectively. a test can easily become overloaded with those aspects of the field that lend thcmselves more readily to the pl'eparation of objective items. rather than being defined after the test has been prepared. A \VeIl-constructed achievement test should cover the objectives of instruction. No test can be said to ha.Validity: Basic Concepts 135 HAPTER 6 . frequency of misspellings in written compositions. these c~l1t~nt. spelling. In the Standards for Educational and PsycJlOloglcal Tests. a multiple-choice spelling test may measure the ability to recognize correctly and incorrectly spelled worde. which the test is being considered. the prepfaration of items is preceded by a thorough and systematic examinati'Qn of relevant course syllabi and textbooks. For instance. the vahdlty of .fgll?c'~ir:!g.~ as well as factual knowledge. Fundamentallv all procedures for determining test validity are concerned with the 'r~lationships between performance on the test and other independently observable facts about the behavio~ ehar~cte~stics under consideration. routi~e tasks. the ~est measures. For example. procedures are classified under three prineip~~"categories: criterion-related. A test of multiplication. In this connection. such as the application of principles and the interpretation of data.' (1974). not just its subject matter. ~vloreover. Knoell & Harris...Basic C011cepts . Mere inspection of the test may fail to reveal the processes actually used by examinces in taking the test. . The specific methods ·employed for mvestIgatmg these relationships are numerous and have been descri~ed by various names.ve 'hl~h or "low" validitv in the abstract.what. set through the choice of appropriate' items. content validity depends on the relevance of the individual's test responses to the behavior area under consideration. a test designed to measure proficiency in such areas as mathematics or mechanics may be unduly influenced bv the ability to understand verbal directions or by speed o{performing si~ple. is not so simple as it appears to be. Still another difficulty arises from the possible inclusion of irrelevant factors in the test scores.alidity: .~ . For educational Content validity is built into a test from the outtests. It is. Content must therefore be broadly defined to include major objectives. Such a validation -procedure is commonly used in evaluating achievement tests . 1952). I Further discussions of content validity from several angles ca. Moreover.lwf the test measures and how wen it does so. T · HE VALIDlTY of a test concerns u. But it cannot be assumed that such a test also measures ability to spell correctly from dictation.a . and the relations amona them will be examined in. and in the correct pro~r example. The solution.tes. Content validity involves essentially the systematic exami~ation of the test content to determine whether it covers a representative . rather than on the apparent rcle\'ance of item content. The behavior domain to be tested must be systematically analyzed to make certain that aJJ major aspects are covered by the test iteme. although increasing e£forts are being made to use more specific and operationally definable test names.' Onc difficulty is that of adequately sampling the item universe. or bookkeeping items. The domain under consideration should be fully described in advance. It might thus appear that mere inspection of the content of the test should suffice to establish its "a1idih' for such a purpose.concludmg section. 1950). as well as by consultation SPECIFIC PROCEDURES. Each o~ tnese types of valIdatIon. section~..

'" "'Oeo "''''''' N."'''I''' coo". Two volumes are ilable.In general. '" •..•...•... f. LllMcn .. If they served as judges in classi. analysis. and appreciation. terms. llj5!l:f% CONO ~ '" . ""'" ~ "'''' . the directions they were given should be reported. the instruc'onal objectives or processes to be tested. Both total s and performance on individual items can be checked for grade ess... s and classifying items should be described.!~eJJeN " "'''' •.. These specifications should show the content areas or topics to be covered. Krathwohl et al. and evaluation.) ......CO'" "'''I''' "''''''' "'''''''' ~.to ".. includes five major categories: recciv'g.-CD •.•. "'N~ "I•.."INN "''''0 JaqwnN wall -N'" "''''(0 "'COOl O~N ~~~ ~~~ "'O~ ~NN "1M. responding. those items are retained that show the largest gains percentages of children passing them from the lower to the upper JeqwnN wall ~N'" ... Information should likebe provided about number and nature of 'course syllabi and texts surveyed. •. values.. •. MLll CO.... yaluing. respectively. "l"'''' COOlN "--N "'0>..... "INN 0:>0>0 Lll"'''' NNN NNM U!Pn&S IU!~OS 'u.Principles of Psychological Testing with subject-matter experts.. . .. The jor categories given in the cognitive domain include knowledge (in sense of remembered facts..nee certain processes may be unsuitable or irrelevant for certain topics..test specifications are drawn up for the item writers. their number and prolal qualifications should be stated. the procedures followed in selecting cate... covering cognitive and affective domains. On this basis.. 14).. . E 0 " x Iii . "" ~~~ --~ "'o~ ~NN "''''. methods. principles... a:>Ua!~S "" " " : " x " " 5a!)!lJcwnH I. need to have items.. The classification affective objectives.... 0. inrests. Not all cells in such a table. 1964).r.. co •••. umber of empirical procedures may also be followed in order to ement the content validation of an achievement test. this handbook also provides examples of any types of items designed to test each objective. :> ~ a3u:a!3S ~ is ~ samuewnH -.. concerned with the modification of attitudes. synthesis... A convenient ay to set up such specifications is in terms of a two-way table. a.. as well as extent of agreement among judges. ~ 'M.. 'In addition. .•.. 1956..application. and the relative importance of 'ndividual topics and processes.•••.'" ~::~ "'.. ~Jn listing objectives to be co\'ered in an educational achievement test. If subject-matter experts ipated in the test-construction process.. .:rl"'.... ••.•.. "'NN N"'''' NNM "'''' . ••• o N"'.~- 6 oiIpt'!JE) coco". "'N. ~~g LllNN 11l5!1l~ L ape.!leJJC!N I" " " sa!P01S II?POS 0 'u..l. <t ••• . with ocesses across the top and topics in the left-hand column (see Table . IThediscussion of content validity in the manual of an achievement test uld include information on th~ content areas and the skills or obives covered bv the test... of course. M COOOl •. it is paI:tJcularly desirable to give the dates n subject-matter experts were' consulted.... compresion. including publication dates... t might be added that such a specification table will also prove helpful ..•• LllCO •.. etc.. the number of items of ach kind to be prepared on each topic can be established.•. e test constructor can be guided by the extensive survey of educational jectives given in the Taxonomy of ~ducational Objectives (Bloom a!.M -~•..'C a" f! ..lllCO Nmq-10l!:t~LO " MNO .eh.• . £~ ~ 0 ~Z '4D!1l% gaP'!J~! ~~. organization. the preparation of teacher-made examinations for classroom use in any ubject.. c "'ON 0 '" ~~filll!'l~... """-CX) r--. Prepared by a group of specialists educational measurement.•. items. "'~Lll N"'<t NM.. <0 .•.... and characterization. with some indication of the number of items ach category.•.. On the basis of the information thus gath'-ered.)... " E 0. " " " I 3A!leJJeN iI " "" "" 5Cl'lpn~s '0 le!XlS " is 'p u. :6- eou81OS x " " "" x S3!l!UllWnH . Because curricula and course eilt change over time.'" "' •.!) "''''''' "'... "I"'''' < .

For aptitude and personality tests. when tests originally designed for children and developed within a classroom setting were grst extended for adult use.Principles of Psychological Testing Validity: Basic Concepts 1. This type of validation issuitable when the test is an actual job sample or otherwise calls for the sameskills and knowledge required on the job. the administrative personnel who decide on its use.solving each problem.dity provides an adequate technique forevaluating achievement tests. content validity is usually inappropriate and may. in the latter tests. .~res on the test can be ~rrelated \". they frequently met with ~esistance and criticism because of their lack of face validity. Certainly if test content appears irrelevant. be misleading.mechanic would arrive at the same solution in terms of spatial visualization. or childish. Content validity is particularly appropriate for the criterion-refer~n~d . a thorough · job analysis should be carried out in order to demonstrate the close re· semblance between the job activities and the test. For example. On the other hand.be discussed in Chapter 15. eventual validation of aptitude or personality tests requires empirical verification by the procedures to be described in the following sections. if the test is designed to measure readg comprehension. aptitude and personality tests are not based on a specified course of instruction or uniform set of prior experiences from which test content can be drawn. The latter is not validity 'in the technical sense. These tests bear less intrinsic resemblance to the behavior domain they are trying to sample than do achievement tests. 1950). the APPLICATIONS. include analyses of t~l)es of errors commonly made on a test observation of the work methods employed by examinees. ther supplementary procedures that may be employed. on the other hand. Although considerations of relevance and effectiveness of content must obviously enter into the initial stages of constructing any test. when appriate. which covers grades 7 to 9. was found to measure chiefly motor speed in a group of high school students. testsdescribed in Chapter 4. the information provided des its classification with regard to learning skill and type of mal. while a. Face validity pertains to whether the test "looks valid" to the examinees who take it.terpreted in tern1S of content meaning.as well as the percentage of children in the normative sample who the right answer to the item in each of the grades for which that of the test is designed. (1) Does the test 'cover a representative sa~ple of the speCified skills and knowledge? (2) Is test performance reasonably free from the influence of irrelevant . individuals are likely to vary more in the work methods or psycholOgical processes employed in responding to the same test items. the content of aptitude and personality tests can do little more than reveal the hypotheses that led the test constructor to choose a certain type of content for measuring a specified trait. Hence. FACE "ALIDITY. to . The identical test might thus measure different functions in different persons. Consequently.ith scores on a reading comprensiontest. the result will be poor cooperation. The 30 items included in Figure 14 repret onepart of the Reading test for Level 3. silly. college graduates might solve a problem in verbal or mathematical terms.. in each test in this achievement battery. This test. not to what the test actually measures. It permits us to answer two questions ihat are basic to the validitv of an achievement test. A specific illustration of the dangers of relying on content analysis of aptitude tests is provided by a study conducted with a digit-symbol substitution ~est"(Burik. \Janables? ~. Or a test measuring arithmetic reasoning among high scho. Content validation is also · applicable to certain occupational tests designed for employee selection and classification. Unlike achievement tests. Although common usage of the term validity in tlhs connection may make for confusion. Such hypotheses need to be empirically confirmed to estabiish the validity of the test. For every . To detect the possible irrelevantinfluence of ability to read instructions on test performance. For example. in fact. In such cases. it would be virtually impossible to determine the psychological functions measured by the tcst from an inspection of its content.ol freshmen might measure only individual differences in speed of computation when given to college" students. Under these conditions. Especially when bolstered by such empirical checks as thoseilIusb'ated above. the question of face validity concerns rapport and public relations. inappropriate. Fundamentally. and other technically untrained observers. content vali. giving the questions v. face validity itself is a desirable feature of tests. generally regarded as a typical "codelearmng test. Content validitv should not be confused with face validity. Because performance on these tests lS 111f . The latter ld be done by testing students individually with instructions to "think ud" while . regardless of the actual validity of the .oithout the reading passage on hich they are based will show how many could be answered simply from examinees' prior information or other irrelevant cues.Figure 14 shows a portion of a table from the manual of the ential Tests of Educational Progress-Series II (STEP).39 es. it is obvious that content validity ~ is a prime requiremenf for their effective use. it refers. The contribution of speed can be ckedby noting how many persons fail to finish the test or by one of e more refined methods discussed in Chapter 5. but to what it appears superficially to measure.

The criterion measure against test scores are validated may be obtained at approximately the . . selecting students for onto college or professional schools. It is frequently impracticable to extend vah~atlon ~rocedures over the time required for predictive validity or to o~tam a s~Itable preselection sample for testing purposes. not on hme. ization period. sonnel to occupational training programs represent examples of the sort of decisions requiring a knowledge of the predictive validity of tests. . Concurrent validity ISrel. B~sicalIy. if the criterion conSIStsof continuous observation of a patient during a two-week hospital. the items should be ded in tcrms of machine operations rather than in terms of "how y oranges can be purchased for 36 cents" or other traditional schoolk problems. The difference can be illustrated bv asking: "Is Smith neurotic?" (concurrent validity) and "Is Smith lik"ely to become neurotic::>"(predictive validity) . Similarly.ev~nt to tests employed for diagnosis of existing status. As a comprom~se . The logICal dI~tinchon between predictive and concurrent validity is ?ased. For example. but on the objectives of testing.140 Principles of Psychological Testing Validity: Basic Concepts 141 ~st. it might be associates' ratings . differentiate between concurrent and predictive validthe basis of these time relations between criterion and test. time as the test scores or after a stated interval. For example.her available information on the subjects' behavior in various life lions. the test scores of college stud~nts may b~ compared with their cumulative grade-point average at ~he tIme of testmg. it might be ge grades. o riterion-relatedvalidity indicates the effectiveness of a test in predictan individual's beha\'ior in specified situations. VALIDITY.a number of instances. or in the more limited sense of 'on over a time interval. Such mHuences would obviously raise the correlation between test scores and crite~on in ~ manner that is entirely spurious' or <ilrtificia1:. Other examples include the use of tests to screen out applicants likely to develop emotional disorders in stressful environments and the use of tests to identify psychiatric patients most likely to benefit from a particular therapy. a test that could sort out normals from neurotic and ' ?oubtful cases would appreciably reduce the number of persons requirmg such extensive observation.e.solutIOn. be obctivelyvalid. Hiring job applicants. on the other hand. For t~is purpose. such lcIl. r can it be assumed that when a test is modified so as to increase its e validity. . if a college ill<st. In . a direct and indent measure of that which the test is deSigned to predict. mechanical aptitude test. face validity should never be regarded as a substie for objectively determined validity. i. . an arithmetic test for naval personnel can be ressedin naval terminology. rather than predIction of future outcomes.owl~qgemight influence the gr~de gIVen to the student or the rating assigned to the worker. peranceon the test is checked against a criterion.' TIus pOSSIblesource of error in test validation is known as criterion . such tests provide a simpler. The rediction"can be used in the broader sense. Thus.To be sure.. It also needs face validity to function effectively In pracal situations. quicker. the criterion might bc subsequent job ormanceas a machinist. concurrent validity ~sthe ~~st ~pprop!iate type and can be justified in its own right. or less ex~ensive subs. we might ask what function is served bv the test in such situa~ions.Face validity can often be improved by merely reformulating test msin terms that appear relevant and plausible in the particular setting whichthe" will be used.its objective validity remains unaltered. Because ~he criterion for concurrent validity is always available at the hme of testmg. if a test of simple arithmetic soningis 'constructed for use with machinists. It cannot be assumed that im\1ng the face validity of a test '\vill improve its objective validity.. For example. The APA test ·urds (1974). The validity of the in its final form will always need to be directly checked. status. Or a hIgh-scoring person might be given the benefit of the doubt ~hen academic grades or on-the-job ratings are being prepared. without necessarily altering the functions asured. and for a neuroticism test. to refer to prediction he test to any criterion situation.metor or a foreman III an mdustnal plant knows that a particularillai~ VIdual scored very p~rly on an aptitude test. or those of employees compared with their current Job success. For certain uses of psychological tests. and assigning military per'CURREI'.:T AND PREDICTIVE • ~RITERION CO~TAMINATION.. tests are administered to a group on whom cntenon data are already available. for a scholastic aptitude test.htute for the criterion data.. It is in the latter sense that it is used in ression"'predictive validity:' The information provided by prevalidityis most relevant to tests used in the selection and dasn of personnel. An essential precaution in finding the vahdlty of a test IS to make certain that the test scores do not themselves influence any individ~ars c~terion. concurrent validity is found merely as a su~stJt~te for predictive validity. Thus.Especially in adult testing. therefore. it is not sufficie~t for a t~st to.

motivational.age grade in all courses taken during the freshman year. t~l~g. each grade g weighted by the number of course points for which it w. Any method for assessing behavior in tion could provide a criterion measure for some particular purhe criteria employed in £ndif\g the validities reported in test Is. from the primary grades to college and uateschool. It is doubtful. even were such an ultimate criterion available.ds are ~ f:equent ~ource of ~riterion data. It is for this reason that such tests have often ore precisely described as measures of scholastic aptitude. and teachers' or instructors' ratings for "intelligence. that the educaal ladder serves as a progressively selective nee. special honors and as. fall into a few common categories. Finally. engineering. and other nonintellectual factors may influence the continuation of the individual's education. training versus elimination from the program. they may be properly ed with the criterion of academic achievement. or bookkeeping. EspecIa y at t e Ig er e ucationallevels. variant of the criterion of academic achievement frequenl:ly emedwith out-of-school adults is the amount of education the individual pleted. with such concurrent validation it is difficult to disentangle cause-and-effect relations. In the validation of any of these s. To what extent ~re the obtain~d differences in intelligence test scores simply the result of the yarymg amount of education? And to what extent could the test have predicted individual differences in subsequent educational progress? These questions can be answered only when the test is administered before the criterion data have matured. An outstanding illustration is the validahO~ ~f Au Force pIlot selection tests against performance in basic flight tr~m~g. and succ~ssful co~pletjon of. Although employed principally in the validation of genintelligence tests. as in predictive validation.p~rformance in music or art schools has been employed in validatmg musIc.however. To prevent the operation of such an s absolutely essential that no person who participates in the asof criterion ratings have any knowledge of the examinees' test or this reason. In their urgency to utilize all available information for decisions. Performance in training programs is also commonly used as a ~ntenon ~or test validation in other military occupational specialties and m some mdustrial validation studies.lli ent drop out of 01earlier.oftests for use in the selection of college students. a :~on criterion is freshman grade-point average. provide criteria for aptitude tests in these area's. it would probably be subject to many unconttolled . ObVIOuslyit would require a long time for such criterion data to mature. economic. for example. or art. in order to determine their validity as dIfferential predictors. achieveest scores.aptItude battenes have often been checked against grades in spec. The cindicesused as criterion measures include school grades. It is expected that in general the more intelligent individuals inutl their education longer.such persons may fail to realize that the test scores e put aside until the criterion data mature and validity can be d.. such as stenographY. they have also served as criteria for certain 'pIe-aptitude and personahty tests. For example. e various indices of academic achievement have provided criterion at all educational levels. aptitude tests. test scores employed in "testing the test" must rictlyconfidential. I. moreover. and so forth. l\ful~lple . while the less inte. represent a more highly MON CRiTERIA. for example. Various business school courses. since the criterion ratings become "contaminated" by the owledgeof the test scores. Among the criteria equendyemployed in validating intelligence test~ is some index of ic ac ' t. For example.IRehIg? school or college courses. This measure is the .Validity: Basic Concepts rillciplesof Psychological Testing tion. and oth. for example. scores on a verbal comprehension test may be compared with grades in English courses spatial visualization scores with geometry grades. It is sometimes difficult to convince teachers. Although it is undoubtly true that college graduates. Several professional aptitude tests have been validated In terms of achievement in schools of law medicine dentistry. respectIvely. promotion and graduation records. eliminating oseincapable of continuing beyond each step. Moreover. and other line personnel that such a precauential. whether il~ly ultimate criterion is ever obtained in actual practice. 143 selected group than elementary school graduates. ~mong the specific indices of training performance employed for critenon purposes may be mentioned achievement tests administered on . In the case of custom-:nade tests' deSigned for use within a specific testing program. SlIl~Ilarly. military officers.r areas. social. Any test may be validated against as many criteria e are specific uses for it. the ultimate criteria would be combat perfo~mance a~d eventual achievement as a practicing physician. training reco. a useful distinction is that between intermediate and ultimate criteria: In the development of an Air Force pi!Pt-selection test or a medical aptItude test. mechamcal aptitude tests may be validated against final achievement in sho~ courses. a frequent type of criteno~ is bas~d on performance in specialized training.a~~ceived.completion of training. The assumption underlying this crite . instructors' ratings. the relation between amount of education and scholastic a titnde is far from erEect. ' In connection with the use of training records in general as criterion measures. s.n t~e development of special aptitude tests." Insofar as ratings given within an acade~ic setting are likely to be heavily ~dby the individual's scholastic performance. formally assigned grades.

To be sure.icaI ingenuity. Other groups sometimes employed. Oc~up~tlOnaI groups have frequently becn used in the development and vahdahon ?f interest tests. origmali~. cou~s. . The method o~ contrasted groups is used quite commonly in the validahon of persollahty tests. In this case. performance in specialized training. They are partICularly useful in providing criteria for personality . Moreover. or ot~er spccial groups generally knO\vn to represent distmetly dIfferent pomts of "iew on certain issues. are disti?ct groupsthat have gradually become differentiated through the operation ofthe multiple demands of daily living. For example. Because of the variation in the nature of inally . ible during training. the measurement of perform.extracl~rncular activities may be compared with those who have partlcIp~ted 111 nOlle during a comparable period of college attendance. or job succe~s.4 Principles of Psychological Testing Validity: Basic Concepts 145 . it would be cult to evaluate the relative degree of success of physicians practicing erent specialties and in different parts of the country. For example.:ratings are not restricted to the evaluation of speci~c achievement. ~eograp~lCal. of certain types of ratings by school teachers. Under these circuwstances. On the other hand. Mention has already been made. the scores obtained by institution~l. To these can be added ratings by offic~rs 10 mIhtary sltuahons.. The contrasted groups included in the present category. Validation by the method of contrasted groups generally involves a compositecriterion that reflects the cumulative and uncontrolled selective j~fluencesof everyday life. In th. by simply choosingthe extremes of the distribution of criterion me~sures. case. such intermediate criteria as performance records at some stage of iningare frequently employed. Similarly. classmates. ho}'-'wer. on the other. there is no reason to expect the psychiatric diagnosis to be supenor to the test score itself as an indication of the individual's emotion~l ~ondition. or honesty.lltmost every type of test.':":" Ratings have bee~ employed in the valid~tion of.of traits that psychological tests attempt to measure. The assmnption' underlymg such a procedure is that. respectively.ob performance. latter. may be compar~d WIth that of clerks or engineers.es. such. ed mentally retarded children may be compared with those obtained y schoolchildren of the same age. and to a larger extent in the validation of special tude tests. college students who hav~>:engaged in man~ . although probnot representing ultimate criteria. the ratings themselves define the CrItenon. since it usually involves a l?nger low-up. ip e validation of an intelligence test. . the professions. in validating a test of social traits. the validity of a musical aptitude or a echanical aptitude test may be checked by comparing the scores obtained by students enrolled in a music school or an engineering school. :Ve are now considering the use of ratings as the very core of the cntenon mea~ur~. ratings of students by school counselors. In this respect they are to be erred to training records. the test perform~nce of salesmen or executives. The ratings discussed earlier represent~d merely a SUhsldI~ry tec?mque for obtaining information regarding such criteria as academiC achIevement.y~hiatric diagnosis may serve as a satisfactory criterion proVIded that It IS based on prolonged observation and detailed case history rather than on a cursory psychiatric interview or examination.ering. instructoml in speclahzed. Moreover. as well as ~n the preparation of attitude scales. F'or these reas. Thus. and armed services. but rather as an indicator or predictor whose own validIty would have to be determined. mech~ll. religious. Ps. or job performa!1ce. In the developmc. at least provide good interiate criteria for many testing purposes. psychiatric diagnosis is used both as a basIS for the selection of items and as evidence of test v~lidity. with reference to man v social traits individua. contrasted groups can be selected on the basis of any cnterion. the subjects in the vali~\ltion sample might be ~ate? on such c?aracteristics as dominance.as criterion measures.?o~ies. and ratmgs by co-workers.nt of certain personality t~sts. Most measures of job performance. Similarly. fellow club-members and other grou?~ of associ~tes. but involve a personal judgment by an observer regardmg any of the variety. for specific jobs.similar jobs in different organizations. The criterion under cons~deralionis thus more complex and less clearly definable than those preVIously discussed.mce does not permit as much uniformity of conditions as is . an~ jo~ supervisors. such as school grades. This criterion been used to some extent in the validation of general intelligence as as personality tests. test manuals reporting 'ditydata against job criteria should describe not only tbe specific terion measures employed but also the job duti~s performed by the rkers. tors that would render it relatively useless. leadership. ratings.with the scores of unselected high school or college student~. including work in business. industry. on the one hand. The "jobs" in question may vary widely in both I and kind.withina particular group versus elimination therefrom. This critcrion is ultimately based on survi"al . Such a psychiatric diagnosis could not be regarded as ~ c:ltenon measure.the criterion of job puformance is likely to entail a loss m the mber of available subjects. or many purposes. It is a common criterion in the validation of custom-made . the multiplicity of factors etermining commitment to an institution for the mentally retarded conitutes the criterion. m the validation of attitude scales include political. such as the Strong Vocational Interest Blank. Thus.ls who hav~ entered and remained in such occupatiq9~:~s selling or executive work Will as a group excel persons in such fiela~['ils clerical work or engine. in connection with other criterion cate. the most satisfactory type of criterion measure is t based on follow-up records of actual .

formally assigned grades. provide criteria for aptitude tests in these areas. and other line personnel that such a precauential. ~mong the specific indices of training performance employed for critenon purposes may be mentioned achievement tests administered on completion of train.aphtude battenes have often been checked against grades in spec. doubtful. t. economic.since the criterion ratings become "contaminated" by the owledgeof the test scores. Among the criteria equentlyemployed in validating intelligence test~ is some index of ic ac . Performance in training programs is also commonly used as a ~ntenon for test validation in other military occupational specialties and m some industrial validation studies. in order to determine their validity as dIfferential predictors. military officers. eliminating se incapable of continuing beyond each step. and other nonintellectual factors may influence the continuation of the individual's education. selected group than elementary school graduates. ' In connection with the use of training records in general as criterion measures. social.p~rformance in music or art schools has been employed in validatmg musIC.ct to many uncontrolled . moreover. respectively. The assumption underlying this erite . even were such an ultimate criterion available. To prevent the operation of such an . Any test may be validated against as many criteria e are specific uses for it. dentistry. they may be properly . scores on a verbal comprehension test may be compared with grades in English courses spatial visualization scores with geometry grades. To what extent ~re the obtain~d differences in intelligence test scores simply the result of the yarymg amount of education? And to what extent could the test have predicted individual differences in subsequent educational progress? These questions can be answered only when the test is administered before the criterion data have matured. whether a". An outstanding illustration is the validahO~ ?f Alr Force pllot selection tests against performance in basic flight tr~m~g. Although it is undoubtly true that college graduates. represent a more highly MON CRiTERIA. and teachers' or instructors' ratings for "intelligence.intelligence tests. for example.'s absolutely essential that no person who participates in the ast of criterion ratings have any knowledge of the examinees' test or this reason. however. medicine. Although employed principally in the validation of gen. a useful distinction is that between intermediate and ultimate criteria: In the development of an Air Force pilpt-selection test or a medical aptitude test. such as stenography.ing. each grade g weighted by the number of course points for which it waJJ~ceived. promotion and graduation records. or bookkeeping. ObVIOuslyit would require a long time for such criterion data to mature: It is. the relation between amount of education and scholastic a titude is far from erfect. it would probably be subje. fall into a few common categories. s. Moreover..such persons may fail to realize that the test scores e put aside until the criterion data mature and validity can be 143 d. from the primary grades to college and .edby the individual's scholastic performance.. and other areas. and so forth. the ultimate criteria would be combat perfo~mance a~d eventual achievement as a practidng physician. motivational. This measure is the .fi~hlg~ school or college courses. l\ful~lple . aptitude tests. achieveest scores. mechamcal aphtude tests may be validated against final achievement in sho~ courses. EspecIa y at t e Ig er e ucationallevels. instructors' ratings. It is sometimes difficult to convince teachers. In their urgency to utilize all available information for decisions.uateschool. and succ~ssful co~plehon of. engineering. nee.truly ultimate criterion is ever obtamed m actual practice. they have also served as criteria for certain tiplc-aptitude and personality tests. a on criterion is freshman grade-point average. e various indices of academic achievement have provided criterion at all educational levels. training reco. test scores employed in "testing the test" must rictlyconfidential. training versus elimination from the program . Finally. I. For example. variant of the criterion of academic achievement frequently emyedwith out-of-school adults is the amount of education the individual pleted. that the educaal ladder serves as a progressively selective . It is for this reason that such tests have often ore precisely described as measures of scholastic aptitude . ge grade in all courses taken during the freshman year. for example." Insofar as ratings given within an acade~ic setting are likely to be heavily .Validity: Basic Concepts ril1ciples Psychological Testing of ation. a frequent type of criteno~ is bas~d on performance in specialized training. special honors and as. For example.ds are ~ f:equent ~ource of ~riterion data. Any method for assessing behavior in ation could provide a criterion measure for some particular purhe criteria employed in finding the validities reported in test Is. Various business school courses. for example. Several professional aptitude tests have been validated m terms otachievement in schools of law.n t~e development of special aptitude tests.or art. The cindicesused as criterion measures include school grades. t~l~g. It is expected that in general the more intelligent individuals tinue their education longer. ed with the criterion of academic achievement. as in predictive validation. In the case of custom-made tests designed for use within a specific testing program. SlIl~Ilarly. with such concurrent validation it is difficult to disentangle cause-and-effect relations. while the less int~lli ent drop out of 001 earlier. In the validation of any of these of tests for use in the selection of college students.

Other groups sometimes employed.44 Principws of Psychological Testing Validity: Ba51c Concepts 145 tors that would render it relatively useless. the test perform~nce of salesmen or executives. mechll. since it usually involves a longer low-up. Most measures of job performance. the validity of a musical aptitude or a echanical aptitude test may he checked by comparing the scores obained by students enrolled in a music school or an engineering school.es. fellow club-~embers. For example. the measurement of perform. the multiplicity of factors etermining commitment to an institution for the mentally ret~rded constitutes the criterion. TIle contrasted groups included in the present category. and othe. or many purposes. such as school grades. the professions. The ratings discussed earlier represented merely a suhsldl~ry tec?mque for obtaining information regarding such criteria as academIC achievement. or job performa!!ce. or honesty. classmates. are disti~ct groupsthat have gradually become differentiated through the ope~ation of the multiple demands of daily living. . in connection with other criterion catel?o~ies. on the one hand.similar jobs in different organizations.ex~mp. To be sure. college students who hav~. as well as ~n the preparation of attitude scales.nt of certain personality t~sts. For these reasueh intermediate criteria as performance records at some stage of iningare frequently employed as criterion measures. ~eograp~lCal.the subjects in the vali~\ltion sample might be ~ate~ on such charactensbcs as dominance. industry.the criterion of job ptrformanee is likely to entail a loss in the mber of available subjects. Ratings have bee~ employed in t?e validl!tion of.:nical ingenuity. it would be cultto evaluate the relative degree of success of physicians practicing rent specialties and in different parts of the country. with the scores of un selected high school or college student~.le. of certain types of ratings by school teachers..almost every type of test. Such a psychiatric diagnosis could not he regarded as ~ c:ltenon measure. The method o~ contrasted groups is used quite commonly in the validation of personahty tests.. religious. performance in specialized training. the most satisfactory type of criterion measure is t based on follow-up records of actual . or other special groups generally known to represent distmetly different points of \oiew on certain issues. at least provide good interiate criteria for many testing purposes. For. In this case. Thus. espectively. To these can be added ratings by officers m mIlitary Situations. Under these circutJIstances. They are partIcularly useful In providing criteria for personality . suchuatings are not restricted to the evaluation of speci~c achievement. On the other hand. cou~s. In this respect they are to be erred to training records. ratings. grou?~ of associ~tes.mce does not permit as much uniformity of conditions as is ible during training.. such as the Strong Vocational Interest Blank. test manuals reporting ~ditydata against job criteria should describe not only the specific 'terion measures employed but also the job duti~s performed by the rkers. case.extracl~rricular activities may be compared V\'ith those who have partlcIp~ted 111 nOlle during a comparable period of college attendance. III the validation of attitude scales include political. . by simply choosingthe extremes of the distribution of criterion metsures.'~ngaged in man! . h~~wer. origmality. ~ut involve a personal judgment by an observer regardmg any of the vanety.. or job succe:s. and armed services. including work in business.y~hiatric diagnosis may serve as a satisfactory criterion proVIded that It is based on prolonged observation and detailed case history rather than on a cursory psychiatric interview or examination. although probnot representing ultimate criteria. It is a common criterion in the validation of custom-made for specine jobs. maybe compar~d WIth that of clerks or engineers. The criterion under cons~derationis thus more complex and less clearly definable than those preViously discussed. Mention has already been made. Because of the variation in the nature of minallv. in validating a test of social traits. instructor.ls who hav~ entered and remained in such occupatiQp~r~s selling or executive work Will as a group excel persons in such fie1~&~iisclerical work or engineering. .. Th~s. ratings of students bv school counselors and ratings by co-workers.of traits that psychological tests attempt to measure.been used to some extent in the validation of general intelligence as as personality tests. Oc~up~tlOl1al.ob performance.groups have frequently been used in the development and vahdabon ?f mterest tests.. the scores obtamed by mSbtutlOnalmentally retarded children may be compared with those obtained schoolchildren of the same age. In the developme. The assumption' underlymg such a procedure is that. The "jobs" in question may vary widely in both and kind.s in speCialized. e validation of an intelligence test. :Ve are now considering the use of ratings as the viery core of the cntenon mea:ur~. leadership. In th~ latter. but rather as an indicator or predictor whose own validity would have to be determined. with reference to many socialtraits individua. Validation by the method of contrasted groups generally involve~ a ill osite criterion that reflects the cumulative and uncontrolled selectIve fluencesof everyday life. there is no reason to expect the psychiatric diagnosis to be supenor to the test score itself as an indication of the individual's emotion~l ~ondition. Ps. Moreover. Moreover. Similarly. This criterion . contrasted groups can be selected on the basis of any cnterion.~n. on the other. This critcrion is ultimately based on sur\'iY~1 'thin a particular group versus elimination therefr?m. an~ jO~ supervisors. and to a larger extent in the validation of special tude tests. the ratings themselves define the crltenon. Similarly. psychiatric diagnosis is used both as a basIS for the selection of items and as evidence of test v~lidity.

iiAlthoughratings may be subject to many judgmental errors. }lfhis especially true of distinctly social traits. (Adapted from Ghiselli. however.The validity coefficient may be high and positive in one study and negli'. as distinguished from basic research. applicants for one of its jobs or when a given college wishes to determine i how well an academic aptitude test can predict the course performance ~" ofits students. riterion-related validity is most appropriate OF C 'for local validation studies. :' between finger dexterity tests and the job proficiency of benchworkers. rity. Or a group ftest might be Ivalidated:against an individu~l test. Figure 15 gives examples of the wide variation in the correlations of a single type of test with criteria of job proBciency. gible or even substantially negative in another.latter as a cdterion is indefensible.Properly be regarded as a criterion measure. Similar . example. .represents applied research. Thus. 1966. for :lhample. When the new test is an abf . . Thus.1" 1 " 0 C 0. . when ob. the largest number of validity coefficients among boys fell between . . This is the approach followed. Equally Wide dlff~rences we~e found with the other subtests and. correlations between a new test and previously available tests i~arerequently cited as evidence of validity.00 +0. for the Numerical Ability test (NA). WIth grades 10 other subjects not included in Figure 16.22 to .~~~ariatiQnQ~~~rvgafilong vali<lliy c~.. the new test may be regarded at best as a crude appro xi~mation of the earlier one. has repeatedly served as a criterion in validating group tests. Criterion-related validity can be best characterized as the ~practical validity of a test in a specified situation. proficiency criteria o -1. in which ratings based on is . It should be noted that unless the new test .!!iorr~ason for th. for .tests. is far wider than could be explained in these terms.the Differential Aptitude Tests.50 +1. . . The Stanford-Binet. . and as : such it provides results that are less generalizable than the results of I other procedures. the second graph summarizes in similar fashion 191 correlations .'. 15.00 '0 U •• 0 +1.146 'J Principles of PSljchological Testing Validity: Basic COllcepts 147 .~range of validity.. specificprogram is to be assessed..represenl~a simpler or shorter substitute for the earlier test. clerks. Examples of Variation in Validity Coefficients of Given Tests for Particular Jobs.~'firstgraph shows the distribution of 72 correlations found between in~:telligence test scores and measures of the job proficiency of general c.personal contact may constitute the most logically defensible criterion.Chapter 20. the latter can . )ained under carefully controlled conditions they represent a valuable 's9urce of criterion data. ThIS Bgure shows the distribution of correlations obtained between grades in mathematics and scores on each of the subtests Of. 20 72 coefficients for general c1erh on intelligence tests.. but the correlations obtained in different mathem~tics ~ourses and in different schools ranged from . £.50 and ..the variation in validity coefficients against job criteria reported l.50 .00 FIG. strated repeatedly. of . This type of validation '.. it might be added. That criterion-related validity may be quite specific has been demon. the use of the . 29. Techniques for improving the accuracy of i:iatingsand for reducing common types of errors will be considered in . "In such a case.vari~tion with r~gard to the prediction of course grades is illustrated m Flgure 16. ~.) Some.59.test might be validated against a more elaborate and time-consuming per<i.11 Finally. t. Thus.75. formancetest whose validity had previouslv been established. ~ 10 191coefficients for bench workers on finger dexterity tests.. In plo ed 10 different studies to measure' the resu s 0 0 19ures and 16. Ol •• . J)ifferences in the' crjtena themselves ~un~oubtedb' a m. moreover/some variation is attributable to diHerences in homogeneity and lev~l~£ the groups tested . p.breviated or Simplified form of a currently available test. when a given company wishes to evaluate a test for selecting . 10 ~ 0 ~ -1..n FIgure 15 r~ults from differences among the specific tests em.{. Although in both instances the correlations tend to chIster in a particular '. thc duties of offic~g~rks or berichworkers may differ :~2. since objective criteria are much more difficult to find in this area. 20 > 0 SPECIFICITY CRITERIA.. The range of validity coefficients found. the variation among individual studies is considerable..00 . a paper-and-pencil '. in which the effectiveness of a test for a .:. The .. proficiency criteria ~ c: ~ 'u .00 -0.

~~o~~iS~!::c& Fruchter.h~~~~I~ri:ri~~:~ge ove~.e fo1rgOotha~r ... however. dividual advancement In ra~ ' an kn° f course that educational 1966) It IS we ll own.' . If different subcriteria are relatively independent. ~llor. mosf Critf:'rion-related validity studies conducted in industry afe likely to prove unsatisfactory for 16. Even if adequatel~ p'ained' personnel are available to carry out the necessary research. 1963.~r~' ~vi~n th:9. IIi ence and aptitude tests meula an course con en. and still another of his ability to resist distraction. 'evidence that the traits required for successful terfo~~:nor rather than static. 'dely among compames or amo~. validity is itself subject For example. academic achievement. 1961. " R roduced by permiSSIon.sin each column indicate the number 0 coe clen S In givenat the left. Richards. An analysis of these more speCific reIahonships lends meaning t6 the test Scores in terms of the multiple dimensions of criterion behavior (Dunnette. Wallace. ' . Since these measures may tap different traits or combinations of traits. one test might prove to be a valid predictor of a clerk's perceptual speed and accuracy in handling detail work. This is admittedly a desi procedure and one that is often recommended in test manuals. 1959). Differential Aptitude I I Th bad ac.) Fifth Edition Manual. si~g~de taslk(. the same situation.. their interoorre!atioDs are \" often quite low. 1960). a more effectIve procedure is to validate each test against that aspect of the criteiiO'i1Jf IS best designed to measure. '\'hen different criterion measures are obtained for the same individuals. the . tP differ in content. f )0b s. another of his ability to spell correctly. d nen.data for the same job (Seashore. o~sd~e:ent ' combmation of traits in e critefJon ma resent ver t'. Price. or other similar accomplishments '~a be of uestionable value and is certainl of limited generality. Ebel. bases for evalua mg . and l~umerous other ways. 1967 P . . 1960. In other words . a test may fail to correlate significantly with supervisors' ratings of job proflciency and yet show appreciable validity in predicting who will resign and who will be promoted at a later date (Albright. often differs 1966) There ce of ~ iven job e. p. Because of criterion complexity. shIfts In orgamza IOna ('.. now. N ew or. teria most commonly used m vaiidatmg mte g d' 'namely. S. ' . . If. Several . in school. 1965). . accident records or absenteeism may show virtually no relation to productivity or error . are reflected in the validity coefficients of any given test against different criterion measures.lac. . & J aeo bsen. timt. we are faced with the necessity of conducting a separate validation stud in each loc tion and re eatin it at frequent mten~ S. e validitv coefficient of a test against Job ..different indicators or measures of job proficiency or academic achievement could thus be used in validating a test. These differences. 'practical criteria are likely to be multifaceted. ce of the mdivi ua eiS m . al d't'ons asons such as changmg nature (} ' . 0 . 1960) There is also ~:~~::le~d~:1~~I'Sh~~~6. GraphIC• Summary 0 f· "\' al'd'ty Coefficients of the . CopyrIght © 1975. e. job performance and edut:ational achievement-are ynamlc < SYl':mETIC VALIDITY. (Forms Santi T) for Course Grades in Mathematics! em ~rst ~ the anyingnumb~r.nne)'. teaching milarlv. 82: eP by The Psychological Corporatlon.g. Thus. Hence.. 1965). Criteria not only differ across situations and over time. i rrent situations. For instance. For example. we return to the practicClI question of evaluating a test or combination of tests for effectiveness in predicting a complex criterion such as success on a given job. or in other actiryties of daily life depends not on one trait but on many traits.148 Principles of Psychological Testing Validity: Basic Concepts 149 . In ma~r situations. criteria riteria may also vary over Ime In. Indik. . k d ther tempor con 1 1 IV. validating a test against a composite criterion of job proficiency. it is not surprising to find that they yield different validity coefficients fpr any given test. & Georgopoulos. courses in the same su Jec may t' student achieveethod'instructor characteristics. Smith. & Glennon. Success on a job. to be the ' c· ntly w llat appears ellt.peri- b or even a.• . It follows that criterion-related to temporal changes. of course. t t change over t'Ime. Y k N Y All rights reserved. . de artments in the same company.tra~m(th' lli omits v~lidity against job performance cntena Ise. it is not feasible to follow this procedure be~jise of well-nigh insurmountable practical obstacles. R. but they are also likely to be complex (see.

First.s often 60 small for significant statistical results. p. The statistical procedures aTe essentiaIly an adaptation of multiple regression equations. Essentially. the reader is referred to the ariginal sources . more endunng. the number of employees c engagedin the same or closely similar jobs '~ithin a co~pany i. 1966. (2) analysis and empirical study of each test to determine ~he . Examples of such constructs ~re mtelhge~~. to be discussed in Chapter. as wen as adequacy of C[lterion coverage. and (3) finding the validity of each test for the given job synthetically from the weights of these elements in the job and in the test.sts have . a determina_Honof test validity for these elements. it is difficult to obtain dependable and sufScientlyomprehensive criterion data.e~t and manifestations' are grist for this . Guion. When the batteries thus assembled were applied t~ a subsequently hired group of 13 employees.S. a theoretical construct or trait. because of its concentration on job-relevant skills (Primoff. si~ce polythose persons actuany hired can be followed up on .. verbal fluency."110\\'11 synthetic validity.ing procedures are fon9~~ed to ensure stability of correl~tions and weights derived from self-~~~gs.e. the Concept of synthetic validity can be imple~ented III diHerent ways to fit the practical exigencies of different situatIOns. personnel psychologJ. The study was conducted primarily to demonstrate a model for the utilization of synthetic validity.7. a separate battery could be "svnthesized" for each job by co~bining the two best tests for each of the j~b elements demanded by that Job. ously dlscusse~ types . and these ratings were then checked against the employees' scores on each test in a trial battery.. as \Firstintroduced by Lawshe (1952). Thlfd. l'rimoff. and anxiety. Among the special features of this procedure are the listing of job elements expressed in terms of worker behavior and the rating of the relative importance of these elements in each job by supervisors and jo1}.' There i" evidence that the J-coefficient has proved helpful in improvin~ th~ employment opportunities of minority applicants and persons WIth lIttle formal education.g. especially suitable for use m a sn~all company with few employ~es in each type of job. 1959. Detailed job analyses nevertheless revealed seve. For each job element. The two examples of synthetic validity were cited only to illustrate the scope of possible applications of these techniques. the results showed conSIderable promi~e. for construct ':&-" vahdatlon Will be considered below. 14. the process involves three steps: (1) _. and it permits the assembling of test batteries to fit ~he reqUIrements of specific jobs and the detennination of test validity 1D many contexts where adequate criterion-related validation studies are impracticable. For a description of the actual procedures followed. It oH~rs . For these purpose. these results are only suggestive. Civil Service job applicants.Jj~m~~s are found in total applicantsamples (not subject to thep~1f'~-~~lW? of employed workers). correlations will very ~robablybe lowered by restriction of range through preselection.. Each employee was rated on the Job elements appropriate to his job.aHecting i~ developm. A different application of synthetic validity. 1975).al~dl~ mill: IllustratIOns of speCific technique~ $uitabl~.the Job.gathering the1needed empirical data and for ~mbining these d~ta. : For all the reasons discussed above.n job elements commo!}Jto many jobs.alking. Correlations between test scores and sell-ratings on jOp. Any data thrOWIng hght on the nature of the trait under consideration and the ~~~tions . McCormick. ~f·' . . 1\~~£nal estimate of correlation between test and job performance is.." Several procedures have been developed for ". Primoff (1975) has developed the J-coefficient (for "jobcoefficient") as an index of synthetic validity. .~~ the weight of the same element in the given test. mechanical comprehension. cntena.!9Pnd from the correlation of each job element with the pifticular job'.of validity. Lawshe & Balma. its correlation with the job is multiplied by its weight in the test. . the concept of synthetic validity has !beendefined by Balma (1959. Ch. construct validation requires the gradual a~um~latIon of mfonnation from a variety of sources.shownincreasing interest in a technique 1. and these produtcs are added across all appropriate job elements. !he construct validity o~ a test is the extent to which the test may be saId to me~ure. 1959. On the basis of these analyses.. each of whom was doing a job that was appreCiably different from the Jobs of the other employees.}p$]Jmbents. and a combination of elemental fvalidities into a ~'hole. Various chec1. Because of the small number of cases. In ~ummary.tso Validity: Basic Concepts Principles of Psychological Testing 151 at leastthree reasons.~a~a_ are ?btained from d~Herent samples of applicant populations. In a long-term research program conducted with U... Second. The study was carried out in a company having 48 employee~. obtainan estimate of synthetic validity for a particular complex cntenon (see. neurotiCIsm. speed of .a promising approach to the problem of complex and changmg.and more abstract kind of behavioral description t'han the previ. to .i extent to which it measures proficiency in performing each of these Job elements. 1975). 1965. detailed job analysis to identify the job elements and their relative _weights.s. Focusing on a broader. is described by Gmon (1965). 395) as "the inferring of validity in a specific situation from a systematic analysis of job elements.

it is argued that the test scores should likewise show . ."sample. A fundamental assump. withuut such added advantages as brevity or ease of administration. If the new test correlates too lughly With an already available test. is based on the assumption that "in~telligence"increases with age.: DEVELOPMENTAL CHANGES. and sent~nce ~mpletioJl •• high correlations with each ~ther and low correlations With all ot~ ~ts. A measure of height or weight would al~o show regul~r ag~ inc1'ements.e a neglIgtble correlation with tests of general inte1hgence ~r scholastic aptitude. Thus" it_ tests as vocabulary. too. at least until maturit.Stanford. a special aptitude test or a personalItr teat "hould hav. In the area of personality measurement. if 20 tests have been glven ~o 300 persons. an ~ntrinsic h~erarchy in the content of these scales. ff "rre. it should be noted that.. factor analysis is a refined technique for analyzing the I~terrelationships of behavior data. For ex~~ple. correlations with t~sts of general intelligence.eases with age does not define the area covered by the test very precisely. but not too high.O~ may Itself reveal. to prove that a test measures something that illcr. however. reading comprehension should not appreCiably affect performance on such tests. a hierarchical pattern of learned skills.ofa number of intelligence tests is age d.with advancing age.during childhood..:'0£ intelligence. test scores fail t~ improve with age. It will be noted that this use o~ correlations With other tests is similar to one of the supplementary techmques described under content validity. cd f ac t ors reqmr to account for the'ttbtai. reading. ' FA~OR ANALYSrs. Moreover. major criterion employed in the validation A '.Validity. Lik~ all ~th~r .Binet and most preschool tests arc checked agamst chronolog':ical age to determine whether the scores show a pr~gressive i~crease . Developmental analyses are also basic to the construct validation of the JPiagetian ordinal scales cited in Chapter 4. such that the attainment of earlier stages in concept development is prerequisite to the acquisition of later conceptual skills.~uch an inspectional ana ~m of .functions that do not exhibit clear-cut and consistent age changes. criteria. can utilize empmcal eVidence of hierarchical invariance in their validation. we could tentatively infer the pre~en. f particular relevance to construct validitv is fador O an~lySlS: a s~atistical procedure for the identification of psy~hological ~ralts. these correlahons sh~uld be ~oderately high. together WIth multnple aptItude tests developed~~y means of I~r analysis.lly. Basic Concepts 153 acco~~ing to. as initiated by Binet. T'here is thus.a ~rrelaho~ table is ~t and uncetjtirln. Similarly.s~entia. Correlations between a new test and slIDllar earlier tests are sometimes cited as evidence that the new test me~sures apprOximately the same general area of behavior as other tests des~gnated by"the ~ame name. Do children who demonstrate mastery of the concept at a given level :also exhibit mastery at the ~ower levels? Insofar as criterion-rt:ferenced tests are also frequently deSIgned CO~~Anoss WlTIl OTHER TESTS.. hlgh correlations. the first step is to compute the correlations of each t~st Wlth e:ery other. ". such as "intelligence tests" or "'mechanical aphtude tests: Unlike the correlations found in criterion-related validity. age differentiation is a necessary but not a sufficient condition for validity. or verbal comprehension are someh~es reporte~ as indirect or negative evidence of validity. it is circumscribed by the particular cultural settmg m whlCh It is derived. . . Th ese tee h . IS mapp1icable to any . then the new test represents needless duplication. Low correlations. . such as conservation or object permanen.' A final point should be emphasized reg~rding the. .ce. Thus. of course. analogies oppOSites.ifferentiation. more precIse statistical teclm1ques have blWft developed to locat th """'e e common . even when apphcable. Correlations with other tests are employed in still another way to d~m~nstrate that the new test is relatively free from the influence of certa~n m~le:ant factors.a m~ues a . however.ch tests a. In these cases. ·ned co i ti'ons. if the test is valid. if the .of ~e age criterion. mterpretahon .~the . The very concept of an age scale . such a finding probably indicates " that the test is not a valid measure of the abilities it was designed to .'. would make the test suspect.actor a~alysis will be e~amiil~d further in Chapter 13. The criterion of age differentiation. suggesting the 10catI?n of common traIts. Since abilities are expected to mcre~se \~lth age . such an increase. The construct vahdahon of ~rdi~al scales should therefore include empirical data on the sequential 10variance of the successive steps. they. For example. it ~as found li~ited u~e. for example. it cannot be assumed that the criterion of age differentiation is a universal one . A psychological test validated a?amst such a cnteno~ measures behavior characteristics that increase w1th age under the condl' tions existing in the type of environment in which the test was standardized.(:e of a verbal :omprehe~ioj "tor. tion of such scales is 1thesequential patterning of development. An inspection of the resulting table of 190 eoi-relati. Because'. Su.Y. Because different cultures may stimulate and foster the development of dissimilar behavior characteristics. would n~t 10 t~emselves insure validity. Thus. certain clusters among the tests. E. This involves checking the performance of children at different levels in the development of any tested concept. On the other hand. .although it would obviously not be deSignated as an mtelli- '\ gencetest.

The rationale of such a test calls for low scores on the pretest. they can be utilized in describing e factorial composition of a test. he factorial validity of this vocabulary test as a measure of the trait of erbal comprehension is . A test whose items were selected by this meth. ~he contribution of internal consistency data to test vahdatlOn IS very limited. rather than in tcrms of the original 20 scores. and other indices of an~iety expression du~pg and after the exammatIon. In the example cited above.lents can be.66 in a vocabulary test.I~ariables on test scores.Validity: BasicCaacepts Principles of PSljchological Testing n the process of factor analysis.. especially in the area of personality. The correlations of the rem~ining sUbte~ts with total score are then reported as evidence of the Internal consistency of the entire instrument.y. In a similar w. For example. If a sizeable proportion of exa~mees pass an ltem on thc pretest and fail it on the posttest. there is obvlOusly something wrong with the item. picture completion. Items that are commonly falled on both tests are too difficult. Each 'dual might thus be described in terms of his scores in the five or six ors.h.t~~ng an examination under distracting and stressful conditions.arent that internal consistency correlations.Pns would be retained. such as . Ne. Many intelligence tests. In the published descriptions of certain tests. test designed to measure anxiety-proneness can be administered to sub!ects who are subsequently put through a situation designed to arouse amQe.' . for t~e purposes of such a test. The performance of the upper criterion group on each test item is then compared with that of the lower criterion group.. along with other tests. The essential characteristic of this method is that the criterion is none other than the -total score on the test itself. together with the weight r loading of each factor and the correlation of the test with each facto~.Aft~rthe factors have been idcntified. grOUpmethod is used. uch a correlation is known as the factorial validity of the test. or both. Sometimes an adaptation of the contrasted .. the degree of homogeneity of a test has some relevance to its construct validity .' exper4. Because it helps to charactenze the behavior domain or trait sampled by the test. Ideally.qd can be said to show internal consistency. anthmehc. etc. the scores on each subtest are often correlated with total score and any subtest whose correlation with total score is too low is eliminated.66. ThiS relationshIp can also be checked for individual items in the te~t (Po. i EFFECT OF EXPERIYENTAL VARIABLES ON TEST SCORES' A further source of data forconstmct validation is provided by ex-periments on the effect of selecte(.~. for example. little can be learned about what a test measmes. In the absence of data external to the test itself. designed to test any other hypothesis regarding th. five or six factors t suffice to account for the intercorrelations among the 20 tests.subtest scores with total score. to explore the factorial validity of a particular test and to define the common traits it measures.lI. A.~. and are either ~liminated or revised. In checking the validitv of a ~riterionreferellce'O test for use in an individualized instruction~l program. Only those Items )'leldmg significant item-testcorr~fliJi. A major purpose of (>ranalysis is to simplify the description of behavior by reducing the er of categories from an initial multi licit of test vari bles to a few 1 155 ac . and those passed on both tests ~oo easy.vert~eless. one approach is through a comparison QE pretest and posttest scor~s..on each item and total test score can be computed. or the instruction.) whose scores are combined in finding the total test score. the biserial'correlation between .qcedures may also be employed for this purpose. admlms~ered b~fore ~he relevant instruction. INTERNAL CONSISTENCY. PosItive flndmgs from such an experiment would indicate that the test scores. . reBect current anxiety level. The set of variables analyzed can. for lD:tance. the number of variables or . ofcourse. Another application of the criterion of internal consistency involves the ~orrelation of . It is app.. con~lst of separately administered subtests (such as vocabulary. lf he verbal comprehension factor has a weight of . Thus. Each test can thus be cl1afacterized in rmsof the l1)a)or factors determining its scores.' . Correlation~l pr. extr'"eme groups being selected on the basis of the total test score. The lDltlal anxiety test scores can then be correlated with phySiolog!cal. In the construction of such tests. Items that fail to show a significantly greater proportion of "passes" in the upper than in the lower criterion group are considered invalid. the statement is made that the test has been validated by the method of internal consistency.cate~ories erms of which each individual's performance can be descnbed lS reed from the number of original tests to a relatively small number of rs. are essentially measures of homogeneity. whether based on Items or subtests."pass-f~il" . A different hypothesis regarding an anxietY· test could ?e evalua~ed by admini~tering the test before and after an anxiety-arousmg expen:~ce an~ seemg whether test scores rise Significantly on the retest.tfait ~~SUred by a gIVen test. and high scores on the posttest. Ratings and other criterion 'measurescan thus be utilized.pharo. include both test and nontest data. or common traits. the largest proportion of examinees should fall an Item ?n the pretest and pass it on the posttest. It should be noted that factorial validity is entially the correlation of the test with whatever is common to a group of tests or other indices of behavior. since each item differentiates in the same direction as the entire test. 1971).

The table also includes correlations between different traIts measured by the same riJ":.23'".•. These ~rrelab~ns Involve the same trait measured by the"same method.o Qr ~ore metho~s. the COITf:lationbetween dominance scores from a self-report inventory and dOITt~ijancescores from a projective test should be higher than the correlatIon between dominance and sociability scores from a self-report in..• ·::. 'l tween different traits measured by the same method."_:~~~~ . It mIght inllicate.3 to methods.the scores obtained fc".thod'(in solid triilngles) ~nd corrclati.~ t I (.~IM&~) ". they should also be higher than the correlations beNVERGENT AND DISCRIMINANT VALIDATiON.. Correlation of a mechanical aptitude with subsequent grades in a shop course would be an example of vergent validation. :-rere hIgh. (2) a projective tech'iquc. Method 1 B. A Hypothetical (From Campbell & Fiske.'. which ey called the multitrait-multimet1lOd J7latrix.. for example.. For the same test.."'-~"':'_~~~'~:. For examplc. Thus. n the familiar validati~n procedure... ventor~. but also that it does not correlate sIgmficantly wIth van abIes which it should differ. and so forth. two il)vestigators may each pliepare a self-report Inventory designed to assesseIl.33 I : :....•. C refer to traits.. each measure "is. . The hypothetical correlations giv~n in Ta~le 12 include reli.:~~~::::~58~ L . Essentially.58"'. C. reliability c~efficlents (~ono~ralt-rnonomethod) are the numbers in parentheses along principal diagonal.it should ~heoret. Al would indicate dom~na~ce oreson the self-report inventory..inant va~ionis also especially relevant to the validation of personality tests.. C.. representing common method variance.durance.. Discrin. A pathetical example provided by Campbell and FIske WIll serve to IUUSate the procedure. that a person's scores on this Inventory are unduly affected by some irrelevant common factor such as ability to understand the questions or desire to make oneself appear in a favorable light on all traits. along principal dIagonal) and validity coefcients (in boldface.~6'. ~s'. esp~cially in the construct validation of personality tests.TABLE 12 Multitrait-M:ultimethod Matrix In a thoughtful analysis nstruet validation.•. A..58 •.. •••••• I l ~~1 :.. the validity coefficients should obviously be higher than the correlations between different traits measured by different methods.~'::>'.~~:~~~45 ~~ . . ampbell and Fiske (1959) proposed a systematic experimental deSign the dual approach of convergent and discriminant validation. T...~I. .reading prehension test. discriminant validity would be rated by a low and insignificant correlation with scores. p. Campbell (1960) points out that in order emollstrate construct validity we must show not only that a test cores highly with other variables with whi~h .. on a .43 '".ability cofficients (in parentheses. Table 12 shows all possible correlations among the ores obtained when three traits are each measured by three methods...)2: :. If ~he l~tter correlation. In an earlier article. The hreemethods could be (1) a self-report inventory.thu~ 'being checked against other.:. such as A) dominance.ically elate.. Method 2 B..ons between different traitsllleasured by different methods (Ill broken trIangles). .. since reading ability is an irrelevant varIable m a test gnedto measure mechanical aptitude.and (3) associates' ratings. but With a dlffer~nt test. For example. Yet the end~rance scores obtained with the two inventories may show quite diffe~~nt.) Traits A. indei pendent measures of the same'::'trait.. For satisfactory construct validity. this procedure quiresthe assessment of two or more traits by tw.•_. subSCripts1. t will be recalled that the requirement of low correlatlOn WIth trrelet variables was discussed in connection with supplementary and pretionary procedures followed in content validation.~~~p arecorrelated.85) Note: Le~tersA. (B) sociability..:. I. D.... Sohd tnangles enclose heterotrait-monomethod correlations. along three shorter diagonals).. I : . ~22---:11: 1'" 67-'---42-------: '..C3 associates' ratings on achievement motivation.•. Under these . B. e three traits could represent three personality characteristics.34: I '. 82. In ich irrelevant variables may affect scores in a variety of ways.t..2. Validity coefficients (rnon~tralt-heteromethod) are the three diagonal sets of boldface numbers... .. . .... Campbell and Fiske ) described the former process as convergent validation and the er as discriminant validation. Fiske (1973) has added still another set of correlations that should be checke~..::~:=.56::-:.'. A2 dominance scores on the projective est. and (C) achievement motivation.~~~':Q.... broken triangles enclose heterotrait-hcteromethod correlations. I . 1959.-~. ... In these validity same trait by different methods coefficients. patterns of correlations with measures of other personality traits.

In assembling items for any new test. anxiety. Validity against various practical criteria is commonly reported in test manuals to aid the potential user in understandin~ what a test measures. All the specific techniques for establishing content and criterion-related validity. or other postulated traits. We have considered several ways of asking. Comparing the test performance of contrasted groups. construct validity is a comprehensive concept.If an achievement test is useet to predict subsequent performance 13 TABLE Validationof a Single Arithmetic Test for Different Purposes Illustrative Question . the test constructor is guided by hypotheses regarding the relations between the type of content he chooses and the behavior he wishes to measure. the hniques actually employed to measure rehabllIty and validIty corond to easily identifiable regions of this continuum. discussed in earlier sections of this chapter. he too relies in part on content validity in evaluating any test. Achievement test in elementary school aritlunetic Aptitude test to predict performance in high school mathematics Technique for diagnosing learning disabilities Measure of logical reasoning How much has Dick learned in the past? How well will Jim learn in the future? Does Bill's performance show specific disabilities? How can we describe Henry's psychological functioning? Criterion-related: predictive Criterion-related: concurrent Type of Validity at a higher educational level.it cannot be concluded tllat both inventories measure the same ·sonalityconstruct of endurance. "How valid is this test?" Topoint up the distinctive features of the different types of validity. however. In fact. Although the validation . reliability represents agreement between two measures of same trait obtained through maximally similar methods. focus on the differences among the various types of validation procedures. Content validity likewise enters into both the construction and the subsequent evaluation of all tests. and construct validity do not correspond to distinct or lOgically coordinflte categories. the scores on a particular test depend too much on speed for his purposes. which constituted the first edition of the current APA test Standards (1974). This example highlights the fact that the chOIce of valIdahon pro. The correlations of a mechanical aptitude test with performance in shop courses and in a wide variety of jobs contribute to our understanding of the construct measured by the test. For example.. All these observations about content are relevant to the construct validity of a test. there is no information provided by any validation procedure that is not relevant to construct validity. could have heen listed again under construct validity. by examining such criteria the test user is able to build up a concept of the behavior domain sampled by the test. let us apply each in turn to a test consisting of 50 assorted arithmetic problems. it needs to be evaluated against the criterion of subsequent college performance rather than in terms of its content validity.ditions. shows that content. Since similarity and difference methods arem~tters of degree. validity represents agreement between measures of the same trait obtained by maximally different methods. chas test scores and supervisor's ratings. which includes the other types. when employed for different purposes. . All the techniques of criterion-related validation. theoretically reliability and validity can regarded as falling along a single continuum: O~~inarily.Principles of Psychological Testing Validity: Basic Conc('pts 159 . Comparing the test scores of institutionalized mental retardates with those of normal schoolchildren is one way to investigate the construct validity of an intelligence test. or he may notice that an intelligence test developed twenty years ago contains many obsolescent items unsuitable for use today. The examples given in Table 13.Four ways in which this test might be employed. The same test. should be validated in different ways. a:e illustra:ed ~n Table 13. such as neurotics and normals. On the contrary. is one way of checking the construct validity of a test designed to measure emotional adjustment. he may conclude that. such as alle! forms of the same test. ho\~'e~er. represent ways of testing such hypotheses. As for the test user. Although he may not be directly concerned with the prediction of any of the specific criteria employed. as when selectinO' hiO'h school students for b t:< college admission. Further consideration of these procedures. together with the type of validation procedure appropriate to each. criterion-related. as well as the other techniques discussed under construct validation. The term construct validity was officially introduced into the psychome~rist's lexicon in 1954 in the Technical RecommenN4a{ions for Psyc11010glcal Tests and Diagnostic Techniques. cedure depends on the use to be made of the test scores. he may check the vocabulary in an emotional adjustment inventory to determine whether some of the words are too difficult for the persons he plans to test. t might be noted that within the framework of the mnlhtrmt-mulhhod matrix. .

Under any circumstances. ~g t 0 data gathered in the process of abIes with which th~ test c~ ~ ~lhO~ would take into account the varifound to affect its Scores an~et~ ed SIgnificantly. and through factorial analyses of such data. This information could be used directly in assessing the relevance of the test to the required job functions. a very misleading impression about the validity of a test could be created. rationale for their use. . not intend article. Construct validation has also stimulated the search for novel ways of gathering validity data. now." (Cronbach & ~:e~tel~r 282). These procedures are e . rs 00. this statement was often incorrectl acce ted :c~p the absence of ~ata ~hat t~ Justifrng a claim for construct validity in such an interpretati. In practical contexts. Since .s ~uc~ asbroad and loosely dcflned canconstructors Seem to ~r .~~~~. the field of operation has been' expanded to admit a \\rider variety of procedures..Jles of PSlJchological Testing procedures subsumed under construct validity were not new at the time. The difficulties encountered in these situations were discussed earlier in thi. it has bE':~~~~~~ v.m aCcord w1th the positive contrl. construct validation cannot b I' d" 0 servations . Another possible danger in the application of construct validation is that I I Validity: Basic Concepts 161 it may open the way for s b" . It is possible for a test constructor to try a large number of different validation procedures. this approach requires a systematic job analysis. e same connectIon. unvenfled assertions about test cept. followed by a description of worker qualifications expressed in . Appearing in the first detailed published analysis of the co' ~nstruct "alidity. that "unless the n t k d b ytheIr own inSIstence. or it could serve as a basis for computing a J-coefficient or some other quantitative index of synthetic validity. Like synthetic validation. as well as the conditions scores. if the correspondence of constructs is clear enough. III the same e war ma k es contact with b . If. been oHered as if it wcre l~al'. Constr~ct validation offers another alternative approach that could be followed in evaluating the appropriateness of published-tests for a particular job. ~ J~chve. Through an analysis of the correlations of different criterion measures with each other and with other . In some instances. one can learn more about the meaning of a particular criterion.g. h ey cnhclze tests for wh' h" c alme t fi . in connection with synthetic validity. t. ome textbook writers and test psychological trait na~lescelVe It as content validity expressed in terms of subjective accounts of ~h~:~~e.. This very multiplicity of data-gathering techniques. 291). James. .160 Pritlci. the results will enrich the interpretation of the test validation study. 5. Another practical application of construct validation is in the evaluation of tests in situations that do not permit acceptable criterion-related validation studies. the validity of the criterion measures used in traditional criterion-related " test validation (see. as in the local validation of some personnel tests for industrial use. implications of these procedures more explicit and to provide a systematic .. construct validation is suitable for investigating .the sta~e~ent did. the test has bcen subjected to sufficient research prior to publication.. 1973). It is particularly appropriate in the evaluation of tests for use in research. . I e Co~fuslOn anses from a statement that d a measure of some at . . the r~sults of such a study may lead to modification or replacement of the criterion chosen to validatc a test. construct validation "is .?f' .:''terms of relevant behavioral constructs.s chapter. Construct validation has focused attention on the role of psychological theory in test construction and on the need to formulate hypotheses that can be proved or disproved in the validation process. presents certain hazards. trait or behavio d a I~n (p. Although the principal techniques employed in investigating construct validity have long been familiar. . the theoretical ' omam b e a d equateI). defined only' r th I' h measured bv a pa rti cu Iar test can f validating that test Such I~ Iie. ncep 0 construct valid'ty I ' 1. the data cited in the manual should permit a specification of the principal constructs measured by the test. relevant variables. t~e~ present as construct validity purely A further source of ossibl ey e ~ve (o~ hope) the test measures.~vo ve w e~ever ~ test is to be interpreted as quahty whIch is not 'operationally deoned'. t IS only through the empirical investigation of the r I' h' external data that we can d' ehahons IpS of test SCores to other ISCOverw at a test measures. the discussions of construct validation that followed served to make the . ~~~ps that diff~r significantly in such butions made bv the co t n fIre :. however. e (p. e. 291) In th .n is i1lus:a~~t ors of . If these confirmatory results were then to be reported without mention of all the validity probes that yielded negative results.. Actually. a few of which will yield positive results by chance. validity.ne~pun network of rationalizations has l construct. .

f~nctions. unless the validation s~ple is repri'!seiififiVe of the population on which the test is to be used. The specific procedures for computing these different kinds of correlations can be found in any standard statistics text.r· Because of the specificity of each criterion. as it is to the measurement of reliability. A validity coefficient is a correlation between test score and criterion measure.'. Most of them are also useful. for example. when a test user relies on published validation data.Through such in~ormation.ling with construct validity. may utilize different work methods to solve the same test problem. as when a two-fold pass-fail criterion is employed (e. with Table 6 (Ch. Consequently. the familiar Pearson Product-Moment Correlation Coefficient is applicable. Although publishe'd dat~ay str~ngl~ sugg~st that a given test should have high validity in a particular sltuatio~. The same test may measure different functions when given to individuals who differ in age. it is commonly used in test manuals to report the validity of a test against each criterion for which data are available. the higher will be the correlation.g. chapter deals with quantitative expressions of vahdlty and theIr mterpretation.66 When both test and criterion variables are continuous.ser hlms~1f. it is essential to specify the nature of the group on which a validity coefficient is found. It will be recalled that expectancy charts give the probability that an individual who obtains a certain score on the test will attain a specified level of criterion performance. COI\"DITIONS AFFECTING VALIDITY COEFFlCIEXTS. such tables and charts provide a convenient way to show what a validity coefficient means for the person tested. Or it might be a valid measure of different functions in the two populations. sex. educational level. the cntena employed in published studies cannot be assumed to be iden?cal. dlTee: corrobo~ation is always desirable.since both characteristics ale commonly reported in terms of correlation eoefficiElnts. In fact. with th~se the test user wants· to predict. Persons with different experiential backgrounds. he arrives at a tentative concept of what psychological fu~ctlOns the test actually measures. he IS dea. It will be recalled that. if we know a student's score on the DAT Verbal Reasoning test. the wider the range of scores. occupation. te~t users are . Two courses in freshman English taught i in different colleges may be quite dissim~1. alidity: Measuremel~t and lrlterpretation 6 was concerned with different concepts of validity and their appropriateness for various testing. in understanding and mterpretmg the validity data reported in test manuals. Thus. p. 4). he examin~ailable validit)'data reported in the test manual or ot~er p~ed so. 4. validity should be redetermined on a more appropriate sample. other things being equal. The data used in computing any validity coefficient can also be expressed in the form of an expectancy table or expectancy chart. Because it provides a single numerical index of test validity. Ch.. 'test agamst local cnterla whenever possible.l. regardless of the specific pro?ed~res used m gathering the data. This fact should be kept in J6z . as in this example. HAPTER C . or any other relevant characteristic. In effect. and little or no validity in another. Jobs bearing the same title m two dIfferent companies are rarely identical. For example. The dete:t'inination of validJ!Y agamst specific local criteria represents the second stage in the test ~r's evaluation of valKTfty. As we have seen in Chapter 6. t~is. As in the case of reliability. The question of sample heterogeneity is relevant to the measurement of validity. Fig. The same data yield a validity coefficient of .HArTER 7 MEASUREMEXT OF RELATIONSHIP. 101).. when considering the suitability of a test for his purposes.The teChnIques ttr'1le dIscussed 1~ this chapter are esp~cially relevant to the analysis of validity data obtamed by ~e test u.us~ally advised to check the validity of anv chosen.e can look up the chances that he will earn a particular grade in a hIgh school course. illustrated in Chapter 4. however..rop~sed use of t~e test. a test could have high validity in predicting a particular criterion in one population. "". First.Jltces. and he judges the relevance of such function~ to his p. The test user is concerned with validity at either or both of two stages. Other types of correlation coefficients can be computed when the data are expressed in different forms. 7.

In this situat~:n ~h ers performance among the ·h. Hence. we should be rea~ . f onthe one hand.~c~~n WIth reliability. however.Principles of Psychological Testing Validity: Mcasuremcnt and Interprctation ~65 the validity coefficients given in test manuals. fonn well in the course . on the other. as a ~lwf the imper{~(.r:n~ o'b th~ bIVanate distribution. II exammation of the nature of the relations·hip b 'tV usua y give a.:~. The computation of a Pearson correlation coefficient a. arrow at t e lower end A . Because of ~higher admissibn standards.error of that the errOr of measure. b Ivanate distribution itsdf ill 11 . a new test that is being validated for job selee. attention should alm be given to the form of the relationship between test and criterion.al present example. and average freshman grades.' Th p. to enable employees to read instructiorl manuals. '0' For the propet interpretation of a validity coefficient. " Validity coefficients may also change over time because of changing . Y e 0 owmg fOfn. however.lfIte~ ~st 6t may be a straight line. An examination of the r' bivariate distributions dearly reveals the reason for this drop. ' . 1962). on College Entrance Examination Board tests and high school records. the observed drop in correlation did not indicate . further increments in reading ability may be unrelated to degree of job success. Il.Aj.special difHcttlt}. but the at the lower end of the s:~ e Sart er around this line at the upper than aptitude test is a necClisa.Ica !Jds~gnificant at some acceptable level .\:er diagram obtained by plotting reading comprehension scores a!Ylinst criterion measures would show a rise in job perI fprmance up to the minimal required reading ability and a leveling off beyond that point. labels. the later class was more homogeneous than . encountered in many validation samples arises from preselection. the range of such a group in both . Had the difference$ in group homogeneity been ignored. the entries would cluster around a curve rather mind when interpreting In other situations th 1" f b individual entries m. bwanate dIstrIbution is known as hctero'J' e earson correlatio h variability throughout tb ~ assum:s ?moscedasticity or eqll. Hence. of estimare coe ..52 over the 30 years. criterion. Clen cou not . ues lOll IS pOSSIble. . these' conditions may not be met I that in certain . This would be an example of a nonlinear relation between test and job performance. In the subsequent use of the test. 't o~s t e margm of ~r. however. before sonably certain that the obt' d el~d~ Ity of a test. .. the bivae. that such employees represent a superior selection of all "!hosewho applied for the job. cn erIon score.'h ere. that the predictors were less va-lid than they had been 30 years earlier. It is ~likely. In the at the upper end and n na e shtn utIon would be fan-shaped-wide ' ..clen.the effe~t of such preselection will therefore be to lower the 'validity coefficient. correct y reveal the relative effectiveness of ".'tests¢ores and criterion measures will be curtailed at the lower end of the :·bdistribution.."'" . An example is provided by a comparison of validity . How hIgh should a validity . Kahneman & Ghiselli.coefficients compll. mterpretation of a validit ffi . ame va I Ity coeffi' t ld . of error to be ex~irni1ar1y.t n. .'selection standards. For example. . .LIDITY·COEFFr • coefficient be? No gener I CIE~T. 1959.:::1 MAGNITUDE OF A V.~ ~:fi~~t The error of estimate is found b th f 11 . the correlation was lower in the later group. .y d: . when it is admindster¢d to all applicants for selection purposes.m a true correlation of zero. mong th Ig -scor' e t d . It wl"ll be recalled pected in an individual's n Icates the margin.. (Fisher. This correlation dropped from .sumes that the relationship is linear and uniform throughout the range. the validity can be expected to be somewhllt higher. There is evidence I situations. since the of concomitant circumsta. s ISCusse in Cha t 5 I h ' drawing any conclusions about th • lid' per . Correlations were found between a predictive index based .tionmay be admini$tered to a group of newly hired employees on whom .11 to . Consequently. the eITor of esti=~: :~ a res~t of th~ unreliability of the t~t. good indication of the e ween test and 't' E and expectancy charts also I cn erIOn. h~ how-scoring students will perform • . am?ng the low-scoring stuscedasticih. of course.:the earlier class in both predictor and criterion performance. a answer to thIS gr' . we need to e correlat1~n between test Scores and f v~luate the SIZeo~ the correlation in ~he light of the uses to be m d vidual's exact criterion s~ e 0 ~le test.have arisen throug~ chance fluctuatip. ~ut u~:se that 'performa~c::e on a scholastic achievement in a course Th t' h a tufficlent condItion for successful poorly in the cOU"se' bl!lt'a' a IS t.t may be interpreted in terms measurement discussed in : whl7h IS analogous to the . coe clent ~ust take into account a number be high enough to be sta~~~·' o~tamed correlation. some WIll per:erf~rm poorly because of low motivation. not er words.criterJIon meaSures of job performance will eventua11y be available. xpectancy tables the test at different levels. Once this minimum is e:. Thus. however. This condition in ~g ~sco~g t~an. In other words. WI e WIder variability of criterion dents. it might have " been 'Wrongly concluded that this was the case.ula: ". If we WIsh to predict an indias:: grade-point average a student will receive in college the of the standard erro.rotto be expe~tec:l m the individual's predicted validity of the test. a particular job may " require a minimum level of reading comprehension.ted over a 3D-year interval with Yale students (Burn"ham. d th mg s u ents. 1965).. An examination of the bivariate distributjon or scat. although the act curacy with whkh individuals' grades were predicted showed little ch~nge. and the like. should such as the 01 or 05 level' d~8.tceeded.tls of sam Ii Havmg establjshed a signiflcant p ng fro.

84). What are the ch.84 rather than 60 (i. the 45 individuals falling to the right of the heavy vertical line would be chosen. This is the category of false re. A test may appreciably improve predIctive effiCIency If It sho~s a~1J significant correlation with the criterion. evaluation . The predlc~ve lmprove~~nt attributable to the use of the test would thus be rol. . The minimum acceptable job performance. and -the error is 60 percent aslarge as it would be by chance.an:es that Mary Greene will graduate from medIcal school.45). It will be noted that if the vahdlty were o erfect(r >'V . the error of estimate is ~s . it can be seen that there. On the other and. lf v'l.::: 1.lar~e as it would be if we were to guess the subject's score. even if we are unable to estimate WIth certamty whether his grade average will be 74 or ~I: . The latter term has been adopted frO. Reference to the formula for cr •• t.c~ses. and he range of prediction error is as wide as the enbre distnbutIOn of criterionscores. if a smaller number were hired at random. the error of estimate would be zero. To put it diffe:ently. Between these two extremes are to be found the errors ofestimate corresponding to tests of varying validity. It will be noted that errors in predic:te:d criterion score that do not affect the decision can be ignored. Conslderation must be given to other ways of evaluating the contribution of a test. which take into account the types of decisions to be made from the scores. the percentage of job successes is now. Under these conditions.salesmen. will show that the term VI . This terminology is likely to be COD- = .J:lk medical practice.6 Prillciples of Psychological Testing Validity: AI easuremellt and Interpretation 167 : whichr2 >'V is the square of the validity coefficient and Uv is th~ ~tandard eviatiol1f the criterion scores. or cutoff point. In most testing situations. Similarly.conside~abl~. comprising the 22 persons who score below the cutoff point on the test but above the criterion cutoff. is indicated in the diagram by a heavy horizontal line. lt IS not necessary to predict the specific criterion performance of mdlvl~ual .::: ulIVI -0 = v).a t~st for a pathological condition is reported ~positive if the condltion 1S present and negative if the patient is Dormal.ections. than. tI:at Tom Hlggms ".20 or . ho~ev~r. In such a case. and 38 job successes.r'xv ig equal to 1. in the cri:erion.'in pass a course in calculus.. For example. job successes. falling below the heavy horizontal line. however. thereforc. . the use of s~ch a test enables us to predict the individual's critenon performance wlth a marginof error that is 40 percent smaller than it would be if we were to guess. Within this group of 45. .70.00..00). without reference to test scores.. BASIC APPROACH. The correlation between these two variables is slightly below . If all 100 appli~ants are hired.80.. Some of these procedurcs will be illustrated in the following section. in whi~ .Un. The 40 cases falling below this line would represent job failures. i. another category of cases in Figure 17 must also be examined. Opl)' those prediction errors that cross the cutoff line and hence place the individual in the wrong category will reduce the selective effectiveness of the test . or false acceptances.e. the prediction is no better. however 10w.. From these data we would estimate that 22 percent' of the total applicant sample are potential job successes who will be lost if the test is used as a screening device with the present cutoff point. or machine operators? Such information is ~seful not only fo~ ~roup i selection but also for individual career planmng. These false rejects in a personnel selection situation correspond to the false positives in clinical evaluations. the outlook would be qUite dlscouraglOg. .-~ mdicated in an mdlVldual who lS actually normal. result from a mere guess. then VI . as when brain damage~. with zero validity. It would thus appear that even with a validlty of .e pnmary ~~ctl~n of psychological tests were to predlct each mdIvl~ual ~ exact l?OSlhO~in the criterion distribution. Hence. Let us suppose that 100 applicants have been given fln aptitude test and followed up until each could be evaluated for success on a certain job."XI/ is equal to . but rather to determine which individuals will exceed a certam mlmmum standard of performance.certa~n Cltcumstanees even validities as low as . . lt 15 advantageous for a student to know that he has a gOO? chanc~ of pas~ing all courses in law school. Figure 17 shows the bh'ariate distribution of test scores and measures of job success for the 100 subjects. a ~ues~.60.der. In other words. Suppose.e.::: .. that the test scores are used to select the 45 most promising applicants out of the 100 (selection ratio.80. the 60 above the line. . or that Beverly ~ruce WIll succeed as an astronaut? Which applicants are likely to be satlsfactory clc::rks. the error of predicted scores is. For a complete evaluation of the effectiveness of the test as a screening mstrument. A false positive thus refers to ~ case in ~hich the test erroneously 4l~~atf.u th. 38/45 . whl~h 1S unusu~lIy high. If the validlty coefficientio. or criterion cutoff point. mos~ t~sts do not appear very efficient.arc 7 job failures. the error of estimate is as large as e standard deviation of the criterion distribution (ucBr. For many testing purposes.(~-1:hepresence ?f ~ ?athologJ~1 condition. This increase is attributable to the use of the test as a screening instrument. 60 percent will succeed on the job.30 may Justify lncluslon of the test in ~ selection program.r'''11 . with a test having zero validity. the proportion of successes would probably be close to 60 percent. .of tests in terms of the error of estimate is unrealistically stringent. servesto indicate the size of the error relative to the error that wou~ .. \\ hen examined in the light of the error of estimate.

decision theory is an attempt to put the decision-making process into mathematical form.> case the number of false rejects can be reduced by the choice of a lower . A few of these ideas were introduced into testing before the formal development of statistical decision theory and were later recognized as Dtting into that framework. fusingunless we remember that in clinical practice a positiv~ result po a In pers~n~el . I I I Job Successes Criterion Cutoff Job failures Low Low Test Score ~'FIC. scores. it may be necessary to hire the top 40 percent of applicants in one case and the top 75 percent in another. Guttman & Raju.. and the lI~e.obS of such I !: a nature that a poorly qualified worker could cause senous loss or dami age. or plan for deciding which applicants to accept and which to reject. test denotes pathology and unfavorable diagnosis. for example.. . academic achievement. A precursor of decision theory ini. attention should be ven to the percentage of false rejects (or false positives as we as to the . ~ information required inc1\ip s'the validity co- l . the cutting smre 011 a test can be set at that point giving the maximum differentiation be. I~ mor. and few are in a form permItting theIr Immediate application to practical testing problems. it may be more important to admit.Validity: Measurement and Interpretation 169 . TIus can be done roughly by comparing the distrl ution of test scores in the two criterion groups. These procedures make it possible to take into account other relevant parameters. the selection ratio is determined by the practical demands of the situation. 17. the number of job r. selectiona positive result conventionally refers to a favorabJ~ prediCtIon : regardingjob performance. a strategy is a technique for utilizing information in order to reach a decision about individuals. Other factors that normally determine the position of ~he .e. Essentially. cutoffscore include the available personnel snpP4:.erc:nt-i ) a cesses an ai ures wit in t~_se eete grou. Under o. as many qualIfied ~ personsas possible..' a few possible failures. 1965.psychologi. such as the relative seriousness of false rejections and false acceptances.~. Hoffman. Increase in the Proportion of "Successes" Resulting from the Use of . When the selection ratio is not externall. I '~ openin s. La Forge. In tTllscase. .th~ory a~e often quite complex. Some of the basic concepts of decision theory.s~ecified circumstances. An example would be a commercial airline.!} In certam SItu.cutoffscore. & Hsieh.~). " PREDICTION OF OUTCOMES. This would be the case when t~~'. 1966. Rorer."i.ther :' circumstances. In the terminology of decision theofy. ations. In settin on a test. filled. and the ur ('nc or seed with which t . ca. More precise mathematical procedures for setting optimal cutting scores have also been worked out (Darlington & Stauffer. at the risk of including more fallures . are proving helpful in the reformulation and clarification of certain questions about tests. d~c1Slon. a Selection Test.' . Because of supply and demand in filling job openings. tw~ Clilelioll grouEs. 1966).T imposed.which permIt a detennmation of -the net gain in selection acc~racy atbibutable to the use of the test. the strategy was to accept the 45 persons with the highest te. whereas 'i. so thdt available information may be used to arrive at the most effective decision ~nder . Many of its implications for the construction and interpretation of psychological tests have been systematically worked out by Cronbach and GIeser (1965). Statistical decision theory was developed by Wald (1950) with special reference to the decisions required in the inspection and quality control of industrial products. In the latter '. In many personnel decisions. however. pilot. The mathematical procedures employed lD. the example given in Figure 17 illustrates a simple strategy.~eral terms. to e~clu?e all but '. The increase in percentage of successful employees from 60 to 84 could be used as a basis for estimating the net benefit resulting from the use of the test.the cutoff point should be set sufficiently higt.1testing is ~o b~ found in the Taylor-Russell table~( 193.

".70 . for Base Rate .63 .60 .75 .63 .aie or ercenta e of successful applicants selected pnor to the use of he test 1s 60. Obviously if the selection ratio w. but only 11 and 9 when the b.95 .Validity: Measurement and Interpretation o Principles of Psychological Testing 171 cient of the test.62 .68 .65 .90 .61 .40 . the contribution of the test is not evaluated against chance success unless 'applicants were preViously selected by chance-a most unlikely circumstance.60 .:.~~.63).80 .60. (From Taylor and Russell. when only 5 percent of applicants need to be chosen..00 1.63 . of course.62 .63.63 .63 1.73 . Let us consider an example in which test validity is .61 .69 .65 . extreme.00 1.ase rates were more .70 .63 selected after the use of the test.00 1.6J . the.76 ..61 .00 1.72 .65 . a test with a validity coefficient of only .00 .65 .95 1. letters of recommendation.75 . The incremental validity resul~~~ from the use of a test depends not only on the selection ratio but l\~'()ll the base rate.64 .82 .86 . when as many as 95 percent of applicants must be admitted.99 1.62 . It indicates the contribution the test makes to the selection of individuals who will meet the minimum standards in criterion performance.91 .64 .00 .62 .ere 100 percent. Ot~er tables are prOVided by Taylor and Russe~l for ~t~~r base ra~es Across the top of the table are given different va ues ~ .71 . Thus.71 . 1939).00 1.63 .75 .00 1. 1959. where the base rate refe~ to' the frequency of the patholOgical condition to be diagnosed in the.:~!M.atio.90 .99 1. we need to consult the other appropriate tables in the cited reference (Taylor & Russell.63 .66 .interest in clinical psychology.97 1.77 .00 .75 . If applicants had been sele<.67 . .~~~ _ i ( lven . In applying the Taylor-Russell tables.JI.65 .00 1..78 . the proportion of applicants who m~~t be acclep~e~ lection ratio).99 1.94 .00 1.62.90 .00 .99 1. and along the side are the tes~ validities. the improvement in percentage of successful employees attributable tQ . This table is designed for us~ when the base .00) would raise the proportion of successful persons by only 3 percent (.92 . 576) ~':~~.30 .81 .:teq on the basis of previous job history. On the other hand.00 .61 . Table 14 shows the anticipated outcomes when the base rate is .96 .7"'J2'-':UliH~~'. For urposes of illustration.80 .67 .86 .88 . that is.e selection .63 . even a test with perfect validity ( r = 1.63 .60.93 .:::·.63 . p.66 .00 1.73 .64 .63 .61 .60 to ..69 .70 .83 . For other base rates.61 .80 .60 .84 .86 .95 .00 . The rise from 60 to 82 represents the incremental vaUdity of the test (Sechrest. howen'r valid.73 .99 1. if all applicants had to be accepted. Thus. could improve the selection process.98 . The implications of extreme base rates are of specia~. A change many 0 t I" ctors can alter the predictive efficiency of the test..97 .72 . Under these conditions.67 .75 . from 10 to 21 in the second.88 . test validity should be computed on the same sort of group used to estimate percentage of prior successes.64 .60 and anyone table entry shows the increase in proportion of successful selections attributable to the test. or the increase in predictive validity attributable to the test.66 .64 . the difference between . In other words.60 .66 .60 .63 . one of the Taylor-Russell tables has been e rod~eed in Table 14.95 .71 . .40 and the selection ratio is 70 percent.63 . the base rale refers to the proportion of successful employees prior to the introduction of the test for selection purposes.85 . 1963).66 .62 .60 ·60 . The entnes 111 the' body of the table indicate the proportion of successful· persons 14 f G' 0 TABLE Proportionof "Successes" Expected through the Use 0 Test Validityand Given Selection Ratio.30 can raise the percentage of successful applicants selected from 60 to 82.61 .62 .70 .67 .:>.97 . and interviews.69 .92 .62 .':5.64 . Reference to Table 14 sho\\'s that.the use of the test is 25 when the base rate was 50.78 .66 . and the proportion of successfu~ app lc~n~ :: ~~r:e thout the use of the test (base rate).00 1. . what would be the contribution or incremental validity of the test if we begin with a base rate of 50 percent? And what would be the contribution if we begin with more extreme base rates of 10 and 90 percent? Reference to the appropriate Taylor-Russell tables for these base rates shows that the percentage of successful employees would rise from 50 to 75 in the Hrst case.68 .99 .50 .62 .66 .67 . no test.00 .qpulation tested (Buchwald.74 .74 .60 .80 .r •• . and from 9 to 99 in the third.basis at what the test adds to these previous selection procedures. the contribution of the test should be evaluated ODe.61 ..00 .68 .61 . In the previously illustrated job selection situation.60 . p.62 . =-" Selection Ratio .66 .

In many practical situations. Jarrett. For example. "'Then the seriousness of a rare condition makes its diagnosis urgent.. Although the introduction of any valid test win improve :~. Under these conditions. when available facilities !Jlake the intensive individual examination of all cases impracticable.50 is 50 percent as great as the improvement expected from a test of perfect validity. for instance. Thus. what is wanted is an estimate of the effect of the selection test. however. 0 0 . tests of moderate validity may be employed in an early stage of sequential decisions. This solution would be appropriate. there will be few false negatives but many false positives. Brown & Ghiselli.i'wfth rare pathological conditions. 0 C'1 \I') .e- ~ :2 ~ ~ 0 RELATION OF VALIDITY TO MEAN OUTPUT LEVEL. account.. Brogden (1946) first demonstrated that the expected increase in output is directly proportional to the validity of the test. The latter can then be detected through a more intensive individual examination given to all cases diagnosed as positive by the test. 1973). predictive or diagnostic accuracy. all cases might first be screened with an easily administered test of moderate validity.2 Qj:8~ tn IX: . It:l It:l ~ es III 0 C. but on overall output of the selected persons. several investigators addressed themselves to this question (Brogden. or normals diagnosed as pathological. '''ith the extreme base rates found . not on percentage of persons exceeding the minimum performance. The number of false positives. base rates are closest to 50 percent.. Cureton. In a clinical situation.. The relation between test validity and expected rise in criterion achievement can be readily seen in Table 15. the improvement may be .) c ·8 II) .~population. Meehl & Rosen. J. then 5 percent is the base rate of brain damage in this . the improvement is greatest when the . 1965). . the use of a test may prove to be unjustified when the cost of its administration and scoring is taken into '. Expressing criterion scores 1 o A table including more values for both selection ratios and validity coefficients was prepared by Naylor and Shine (1965). if 5 percent of the intake population of a clinic has organic :brain damage.:..Princillies of PSljcllological Testing . Richardson. 1955.. How does the actual level of job proficiency or criterion achievement of the workers hired on the basis of the test compare with that of the total applicant sample that would have been hired without the test? Following the work of Taylor and Russell. this cost would include the time of professional personnel that IDlght otherwise be spent on the treatment of • additional cases (Buchwald. 1957a. 1944). the improvement resulting from the use of a test of validity . or normal individuals incorrectly classified as pathological.. 1946. S. negligible. Wiggins. If the cutoff score is set high enough (high scores being favorable). 1 ~ ~ c. 1948.( It:l ~ 0 ~ It:l ce: 0 ~ 1:-: 0 I. 1953.t1965. For :)example. would of course increase this overall cost in a clinical situation.

a . and other relatively intangible factors. \Vith the same 20 percent selection ratio and a perfect test (validity coefficient 1.07 cutoff score V~lid Rejection . the judged favorableness or unfl\. e other data needed are the erent outcomes expre d of th ' sse on a common scale. (selection ratio == .38 It is characteristic of decision theory that tests are evaluated in terms of their effectiveness in a specific situation. is given in the column for zero validity. and suitable for . mexpensive.performance of this group is . Even in such cases. To illustrate the use of he table. . Reference to Table 15 shows that the mean criterion . Educational decisions must take into account institutional goals. corresponding to the performance of applicants selected without se-ofthe test. A$imple Decision Strategy. t' d or expensive equipment would llee d a h'Igh er Uln~g.Validity: Measurement and Interpretation Principles of Psychological Testing standard scores with a mean of zero and an SD of 1. doubling the validity doubles the output rise. as in counseling. The lack of adequate systems for assigning values to outcomes in terms of a uniform utility scale is one of the chief obstacles to the application of decision theory. The evaluation of test validity in terms of either mean predicted output or proportion of persons exceeding a minimum criterion cutoff is obviously much more favorable than an evaluation based on the previously discussed error of estimate. Again. Decision Outcome Valid Acceptance Probability . In this context. . cutoff score on the t valid and fals es. 18 gIVe t e probabilities of utilities of the diff gure. however. social values.. group administration An 1· IdV:~dun1tramed personnel. let us assume that the highest scoring 20 percent of the appli. a to justifyexaminer vahdlty rame its use. here are four possible outcomes. 257-274. The expected overall utili ing the probability h e.00. this table gives e expected mean criterion score of workers selected with a test of given idity and with a given selection ratio. .w fl'ctitious example illustraf all . . The reason for the difference is that prediction errors that do not affect decisions are irrelevant to the selection situation.enee to the schematic representation cedure Th' d~Islon strategy m FIgure 18 \vill help to clarify the pro17 in :. but merely made it explicit Value-:"· ways enter d . if Smith and Jones are both superior workers and are both hired on the basis of the test.strategy could then be found by multiplythese products forOt~:c faoutco~e by the utility of the outcom~.00).32. the base output an. . with a selection ratio of 60 percent. For instance. and employee morale are difficult to assess in monetary terms. see y. just twice what it would be with the test of validity .40. Similar direct linear relations will be found if other mean criterion performances are compared within any roW of Table 15. 17. pp.50 yields a mean of .. including base ra:e and s~~ Another important' parameter is the relative utility of expected outcomes.16. For example. certain outcomes pertaining to good will.50. Such evaluation takes into account not only the validity of the test in predicting a particular criterion but also a number of other parameters. ry Id not mtroduce the problem of eClSlon process. e er.20) by means of a test whose validity coefficientis. a validity of . Since there were 100 applicants . that decisi:: Ut~:~ste~: It a~ been repeatedly pointed values into the d . n IVI ua test req . must consider the indiTIlE ROLE OF VALUES IN DECISION TIIEORY. the mean criterion score of the accepted applicants }vould be 1.33 False'.~:t. s.' Wiggins"0973). In industrial decisions. however.vorablcness of each outcome. Individual decisions. while a validity of .22 18. Administer test and C1pply False Acceptance . 2 For . public relations.Rejection FIG. and subtracting a value correspondin to h lY' = rh t val:d~.25 yields a mean criterion score of .~r:e~:~:r t~h~s last ~erm ~i?h~ights th~ fact that a test of easily administered by reIat' e r~tamed If It IS short. mg steps II! these computations. t d . ~ tems h ave a 1 clearly re~gnized or sy:te:a~ica~~s~:~dl~~~ they were not heretofore In choosmg a decision strate th 1 . it does not matter if the test shows Smith to be better than Jones while in job performance Jones 175 vidual's preferences and I h out. including ability of he acceptances and valid and false rejections. Using a test with ero validity is equivalent to using no test at all. 10: r excels Smith. adding u~ ou comes.70 SD above the expected base mean of an Illitested sample. a dollar-and-cents value can frequently be asSigned to different outcomes.50.cantsare hired. utilities across all outcome R e goa IS to maximize expected of a Simple de . oun om t e number of persons in in that example th ns ofbFIgu:e. The probeac outcome can be f d fr h each of the four sectio . ese num ers dIVIded b 100' h the four outcomes listed in Fi .vhic~ la~ralm sho~s the decision strategy illustrated in Figure a smg e test IS administered to a group of applicants and the' d " eClSIon to acceptt or ~Jec t an app Iicant is made on the basis of a .

.l!lpensatory educational programs for students with certain educational disabilities.the effectiveness of a test may be increased through the use ~f more complex decision strategies which take still more param. Wlggms (1973). the decision strategy followed in individual eases should take into account available data on the interaction of initial test score and differential treatment.SEQUEXTIAL STRATEGIES AND ADAPTIVE TREATMENTS.t t may be used to make sequential rather than termmal deCISIOns. }Jut " ". sented by Test B. '.. The validity coefficient alone cannot indicate whether or not a test should be used ~ince it is only one of the factors to be considered in evaluating th~ lmpact of the test on the efficacy of the total decision process) . Be£ause. With the flexibility of ap~roach ushe. :"decisions to accept or reject ar: treated as terminal.: mcludl~~ those clearly accepted or rejected. & Cleary. Two examples will serve to illustrate these poss~blhtles. in connection with the utlhzahon of computers 10 group testing.> session. In such situations. '. positives (i. rather than trying all items.to ma'Ximize the effectlve usc 0 es mg Ie. Figure 19.e~tion. Ch.that are later rectified may be costly in terms of several value systems. An example would be the utilization of different training procedures for workers at different aptitude levels. but to test further. a~ cases clas~ified as . s.' . es sod' F' 17 d 18 aU . possibly pathol~gi~al) ~y the. . On the baSIS of. ~ssenhally the sequen~e ~f items ~r item groups 'within the test is determine? b~ the examl. A second condition that may alter the effectiveness of a psychological test is. When adaptive treatments ar~ utilized. Incompetent employees hired because of prediction errors can usually be discharged after a probationary period.nose pathological condItIons With very low base rates. repeatedly at several stages. incorrect selection decisionS. the availability of alternative treatments and the possibility of adaptmg treatments to individual characteristics. . Rock. hI' I d' d Another strategy.-& 13etz. 6. 1973): Altho~gh.shortand easilv administered screening test. they are often less costly than terminal wrong decisions. to more difficult items. decision theory has served to focus attention on the complexity of factors that determine the contribution a given test can make in a particular situation. there has been increasing exploration of prediction models involving interacti~ hetween persons and 3 J. everyone might begm w1th a set of Ite~s of intermediate difficulty.. Under these conditions. tIllS group .fuller discussion of the implications of decision theory for test use.lllto .etoe~s.re~ in by decision theory.. seq~entIal testmg IS particularly well suited for computer testing.nee s ownperfom1anceo For example. count. Such branchmg may oeeu. or thc introduction of CQ.. Cronbach and GIeser . the assignment of in<!ividuals to alternative treatments is essentiallyadilsSif'ication rather -mOre wiI[ be Sald about tlleTequired methodtharu-sel~oblem. The classic psychometric model assumes that prediction errors are characteristic of the test rather than of the person and that these errors are randomly distributed among persons.ac. appli~ cable to paper-and-pencil printed grou~ ~ts. in'dividuals would be sorted into three categ?nes. To be sure. Essentially. is to use only two categories. Those who score poorly are routed t? easIer items' those who score well. ology in a later section on classification decisions.. Weiss.'. on the " other hand. it is only adverse selection decisions that are terminal. T~st A could be a . The validity of a test for a given criterion may vary among subgroups differing in personal characteristics.'. suitable for the diagnosis of psye 0 ogICa 1~or ers. shows a two-stage sequential decision. On the basis of t~e second-sta?e testmg. For a . per~orma~ce ..With the simple decision strategy Illustrate III 19ures an . Such sequential testing can also be cmployed within a si~gle test~ng . although they may not be so perceived.~ \ . The princip~l eff. 1969.s WIll be dlscusse~ further in Chapter 11. on this test. (1965). f t to t'm (DeWItt & \Velss . In some situations.prel~mmary sc~eem~g test. see S. It should also be noted that many personnel decisions are in effect sequential. ~.e~t is that each examinee attempts only those items suited to h~s abJ1l~y level. . 3n in~ermedlat~ uncertam group to be examined further with more intenSIve tec~mque~. Linn. The examples cited illustrate a few of the ways in which the concepts an~ rationale of decision theory can assist in the evaluation of psychologIcal tests for specific testing purposes. ~n connection Wlth the use of tests to diag. I DIFFERENTIALLY PIlEDlCTABLE SUBSETS OF PERSONS.~ust. wouldbe sorted into accepted and rejected categorIes. and at a more technical level. as well s. This is the strategy cited earlIer ll1 this. . . the success rate js likely to be substantially improved. failing students can be dropped from college at several stages. Sequential testing ~~del.. .e. repre. 1974.

Ghi~elli (1956... Stricker. e u. In another study (Grooms & Endler. age. The individual who is characteristically preClSe and careful about details. OF V Evidence for the opEMPmlCAL eration of moderator variables comes from a variety of sources. these xamples. but it may be interesting to speculate about it in the light of other known sex differences.' ts. H. 1960. A number of investigations have been specially designed to assess the role of moderator variables in the prediction of academic achievement. regardless of their interest in the courses. 1968. In. but not among liberal arts students of either sex.19). 1969). ~he ?~ta do not in. Since effort will be reflected in grades.ient of a test may 'be too low to be of much practical value In prcdlction.J applicant sa'ijl~ was only . the o~her ~areless and evasive. there is evidence that self-report personality inventories may have higher validity for some types of neurotics than for others (Fulkerson. or socioeconomic background. It was hypothesized that a given test will be a. the college grades of the .class achievement will probably devend largely on their abilities. UidiIstrial situations. This hypothesis was confirmed in several groups of male engineering students. the reason for this sex difference in the predictabIhty of academlc achievement.. 1967) has extenslvely explored the role of moderator variaBles iIl. Although the hypothesis was partially confirmed.!. on the other hand. teristic. if an applicant has little interest in a job. When the validity of the ~ = = . In a different context. careless individual who tends to avoid expressing unpleasant thoughts and emotions and who llses denial as a primary defense. Frederiksen & Melville. the correlation between aptitude test score and job success may be quite high. But when reo < computed in subsets of individuals differing in some i~e~tifia?le charac. Chise~!C Sander~.better predictor for those individuals who perform more consistently in different parts of the test-and whose total scores are thus more reliable. Such interaction implies that the same test may be a better pretor for cert~i~Ciasses or subsets of persons than it is for others.. such as sex.. sincethey moderate the validity of the test (Saunders. the relation proved to be more complex than anticipated (Berdie. 1960a. validity may be high in one subset and negl1g~~le In anot~er. identified through two tests of compulsivity. . theiJ. A moderator variable is some characteristic of persons that makes It i posS'ibfeto-pre'ct e pre ictability 0 I erent 10 ividuals with a given ins rument. If. or it may be a score on another test. Per~aps anothe~ test or " some other assessment device could be found that IS an effective pre. The group was then sorted into tpirds qp the basis ~ ~~ . the correlation between the appropriate interest test scores and grades should be higher among noncompulsive than among compulsive students.courseachiev~t predict achievement from test scores. 1959). Moreover. For xamplc. Tht.Jld and would make it more difficult to val'ianee-in their. or a better predlctor for applicants from a ower than for applicants from a higher socioeconomic level. Whatever the reason for the difference.220. the @rrelati~n between an aptitude test and a job-performance criterion in the t6tl..a given test may be a better predic~or of criterio~ performance or men than for women. In a survey of several hundred correlation coefficients between ap~tude test scores and academic grades. but that the effort of the less compulsive students would depend on their interest. Among such persons. If women students in general tend to be more conforming and more inclined to accept the values and standards of the school situation. sex does a ear to function as a moderator variable in the predictability of academic gra es from aphtu e test scores. 1960b. although the trend was more pronounced at the coll~ge level. these interest differences wO. Interests and motUlation often function as moderator variables. 1956). 1954. ! 178 ~ Principles of Psychological Testing Validity: Mr:a~'U"C11lentand Interpretation 179 EXAMPLES MODERATOR ARIABLES. Y{. 1966) tested the hypothesis that the more compulsive students. the correlation between aptitude test scores and job performance would be low. A different approach is illustrated by Berdie (1961). G. In a study of taxi drivers (Ghiselli. same trend was founa in high sChool and college. who in(r vestigated the relation between intraindividual variability on a test and the predictive '-'ilidity of the same test. 1963..ouldput a great deal of effort into their course work.dicate. on the other hand.63) with aptitude and more anxious students correlated higher (r achievement test scores than did the grades of the less anxious litudents . who tends to worry about his problems. I When computed in a total group. For individuals who are interested and highly motivated. the vali~ity coe!R<. Thus. Seashore (1962) found htgher correlations for women than for men in the large majority of instances.al level.. dictor in the second group. Several studies (Frederiksen & Cilbert. The characteristic behavior of the two types tends to make one type careful and accurate in reporting symptoms. The test could thus be used effectively in making declSJons regardmg persolls in the first group but not in the second. men students tend to concentrate their efforts on those activities (in or out of school) that arouse their introduce additional individual interests. and who uses intellectualization as a primary defense is likely to provide a more accurate picture of his emotional difficulties on a self-report inventory than is the impulsive. he will probably perform poorly regardless of his scores on relevant aptitude tests. lack of agreement among different indicators of compulsivity casts doubt on the generality of the construct that was being measured. 1956). on an occupational interest test. t may e a emograp lC vana e. 1960).sex and socioeconomic level are known as moderator vanables.

the student's stanine Score on each of the three tests is multiplied by the corresponding weight given in the equation.fJ701(.'.-W preparing a case report and in making! recom~endatI~ns. testis compared in these two subgroups.~dictionerror without regard to direction of error. the examiner relies on judgment. 1972a.:. Bill would thus be expected to ~o somewhat better than average in mathe~tics courses. may obscure important individualdifferences.n' an explor. The results are usually ~9uite specific to the situations in which they were obtained.Gi/ltt/Vl. e.)' L :1"11 . _ SpecIRc techmques for the computation of regression equations can be 1 i: \ (i 11 I illl' = 'I. not one but several tests are the eperallyrequired. plus a constant (1.--. II fill Bill's ~redictcd stanine is approximately 6. Achiev. the criterion measure de.::. is then developed by comparing the item responses of two contrasted :. . 1972a. Hobert & Dmmette. a s~n.g major typ:s.::. suringlargely' a singlet~ . .35.sepr~~uct~. :If '. it is usually preferable to use a ination of several relatively homogeneous t~sts. b t . Considerable caution is_required to avoid methodologicalpitfalls (see. 1972.::-." . A technique employed by Chiselli in much of his research consists in finding for each individual the absolute difference (D) between his :. The smaller the value of D..\ Atthis time the identification and use of moderator variables are still .:. Other investigators (Dunnette. and the validity of the original . A predictability scale ~.!!g at a decision regardmg each IndiVIdual. == (. 'I In t~IS ~quabon. 1Jpe chief arlsmg 10 the use of such batteries concerns the way 'in which scores . . past experience. His very supenor performance in the reasoning test (R =8') and his above~verage score on the verbal test (V = 6) compensate for his poor score 10 spee~ and a~uracy of computation (N 4).21 N + . EQUATION.~-=._~ /?I?'e:-T~!'~redict Of:tlp. 1972b.tory ·phase.::!. . --------'--:.fOLtb~aminer to utilize test scores with~1 out further st~tistical.xFor prediction of practical criteria.21) ( 4) +( . Math~matics Achievement =: .g.. and reasoning (R) tests: • MULTIPLE ~RESS~ON. This approach has shown considerablepromise as a means of identifying persons for whom a test will " be a good or a poor predictor. Abrahams & Alf.)).. en a number of speciaUy selected tests are employed together to Math. The sum of t~c.criterion.21 V + .. namely. \ :: The multiple regression equation YIelds a predicted cntenon Score for each individual on the basis of his score .1972. 1972b). that~ re~!i.:=___. it rose to .ves. The. morepredictable is the individual's criterion score. each covering a ent aspect of the criterion._-' . .problem Validity: Measurement and Interpretation 181 aptitude test was recomputed within the third whose occupational interest level was most appropriate for the job.. £ II . or the evaluation of high-level execu//' I ' ". And it is iinportant to check the extent to which the use of moderators actually 'proves the prediction that could be achieved through other more 'rect means (Pinder.35 = 6.. 4) that a stanme of 5 represents average pedormance.32)( 8) 1 + 1. It Isa£.Ghiselh... ------- '\ \ 'V~. 1967) have "'. ing on a number of different traits.on the di~ert:n~ tests are to be combined in arrivi. actual and his predicted criterion scores. Alternative procedures.35).180 Principles of Psychological Testing . gives the student's predicted s~!!~ne pOSItIon lD mathematics courses. Dunnette.fl..i! J 111:1 'J. Velicer.{li. statistical procedures followed for this purpose ~re of tw. numerical (N).. rather than a single test consisting of a podge of many diffe:rent kinds of items.QmIDOlLpr. have accordingly been pro'posed.01 'lii.'been developed to determine in advance which of two tests will be a '\ better predictor for each individual (Chiselli. Such clinical use of test scores will be discussed \ further 1ll Chapter 16.analpis. is more satisfactory b~~. 1972. and \ theoret~cal ratIOnale to interpret score patterns and integrate findings \ from dl~erent tests. A single test designed to measure a criteriQn would thus have to be highly heterogeneous. .(.'1 'i The estimated ma'h Lema ti'cs ach'levement of this student is found as follows: 1 ' i. ---US-Scores ('Ch-:.. '1'1. U _-1. 1973)."". An extension of the same procedure has . the .actice.82 R + 1. Hence. :..664.. multiple regression equation and multiple cutoff scores.:. counseling. a te ~L1e 10 owing regression equation illu~trates the applIcation of this technique to predicting a student's achIevement in high school mathematics courses from his scores on verbal (V).. Most: criteria are complex.subgroups selected on the basis of their D scores.VV1C 111 ~li~lCaldiagnosis. based on the absolute amount of pre. I ". they are known as a test batten(.gle.~~ homogeneous _~~st.1 .f ir ':' v(. argued that Chiselli's D index.'i. .. '..?1t~/et-. 1960a). ~Vhe~~ts ~re adIriinistered in the intensive study of individual cases.e_iL)'ieIasJess--.'Suppose that Bill Jones receives the following stanine scores: Verbal 6 1 I: :1 1 . involving separate analyses of overpredicted and underpredicted cases.. I '--"1 J Ii..21)(6) : + (.: ~::~~~ ':. to identify highly preo dictableand poorly predictable subgroups. It has aly been pointed out. however. It ~l be recalled (Ch. The predictability -: scaleis subsequently applied to a new sample.

Sign up to vote on this title
UsefulNot useful